Best approach at logistic regression model for prediction

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

Hi all, I am working on a master's thesis and I have a pretty massive dataset that I am trying to see if there is any correlation between possible predictors and a binary outcome. I took my data and split it into a training set with 50 observations and a test set with 32 observations. I have 14 predictor variables I have tested individually and 5 of them are statistically significant on their own and some aren't. I have 3 models I have built and wanted to explain my logic and see what you think is the best approach.

Model 1: Includes all variables that were statistically significant on their own (5 variables and none stay significant when all in the model) Pseudo R2: 0.2968

Model 2: Created by taking Model 1 and doing a stepwise removal of variables with a p-value threshold of 0.1 (2 variables stay and 1 significant) Pseudo R2: 0.2342

Model 3: Include all 14 variables tested including ones insignificant on their own and run a stepwise removal of variables with a p-value threshold of 0.1 (4 variables stay and 1 significant) Pseudo R2: 0.4401

When applied to the test set this is how they perform:
Model 1: Area under ROC curve: 0.8203, correctly predicts 15/32 test set observations

Model 2: Area under ROC curve: 0.8008, correctly predicts 16/32 test set observations

Model 3: Area under ROC curve: 0.8359, correctly predicts 15/32 test set observations

What method to build a model do you think is the most sound? I can describe all three of them in my write-up but feel like I should report one of them as a "main" model. I am personally leaning toward Model 2 since one of the variables stays significant and the other one is very close (95% CI 0.996, 1.103). The association seems rather weak anyway, on this model both ORs are 1.03 and 1.05, nothing crazy. The important thing for my thesis is that I explain how I built the model and what I did, not so much that I have ballpark data that is super significant.

Author

Account Strength

100%

Account Age

10 years

Verified Email

Yes

Verified Flair

Total Karma

37,853

Link Karma

30,444

Comment Karma

7,164

Profile updated: 2 days ago

Posts updated: 11 months ago

HaveAGreatGay

Subreddit

r/AskStatistics

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 2 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/AskStatisti...