Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

4
Best approach at logistic regression model for prediction
Post Body

Hi all, I am working on a master's thesis and I have a pretty massive dataset that I am trying to see if there is any correlation between possible predictors and a binary outcome. I took my data and split it into a training set with 50 observations and a test set with 32 observations. I have 14 predictor variables I have tested individually and 5 of them are statistically significant on their own and some aren't. I have 3 models I have built and wanted to explain my logic and see what you think is the best approach.

Model 1: Includes all variables that were statistically significant on their own (5 variables and none stay significant when all in the model) Pseudo R2: 0.2968

Model 2: Created by taking Model 1 and doing a stepwise removal of variables with a p-value threshold of 0.1 (2 variables stay and 1 significant) Pseudo R2: 0.2342

Model 3: Include all 14 variables tested including ones insignificant on their own and run a stepwise removal of variables with a p-value threshold of 0.1 (4 variables stay and 1 significant) Pseudo R2: 0.4401

When applied to the test set this is how they perform:
Model 1: Area under ROC curve: 0.8203, correctly predicts 15/32 test set observations

Model 2: Area under ROC curve: 0.8008, correctly predicts 16/32 test set observations

Model 3: Area under ROC curve: 0.8359, correctly predicts 15/32 test set observations

What method to build a model do you think is the most sound? I can describe all three of them in my write-up but feel like I should report one of them as a "main" model. I am personally leaning toward Model 2 since one of the variables stays significant and the other one is very close (95% CI 0.996, 1.103). The association seems rather weak anyway, on this model both ORs are 1.03 and 1.05, nothing crazy. The important thing for my thesis is that I explain how I built the model and what I did, not so much that I have ballpark data that is super significant.

Author
Account Strength
100%
Account Age
10 years
Verified Email
Yes
Verified Flair
No
Total Karma
37,853
Link Karma
30,444
Comment Karma
7,164
Profile updated: 2 days ago
Posts updated: 11 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
2 years ago