This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Hi all, I am working on a master's thesis and I have a pretty massive dataset that I am trying to see if there is any correlation between possible predictors and a binary outcome. I took my data and split it into a training set with 50 observations and a test set with 32 observations. I have 14 predictor variables I have tested individually and 5 of them are statistically significant on their own and some aren't. I have 3 models I have built and wanted to explain my logic and see what you think is the best approach.
Model 1: Includes all variables that were statistically significant on their own (5 variables and none stay significant when all in the model) Pseudo R2: 0.2968
Model 2: Created by taking Model 1 and doing a stepwise removal of variables with a p-value threshold of 0.1 (2 variables stay and 1 significant) Pseudo R2: 0.2342
Model 3: Include all 14 variables tested including ones insignificant on their own and run a stepwise removal of variables with a p-value threshold of 0.1 (4 variables stay and 1 significant) Pseudo R2: 0.4401
When applied to the test set this is how they perform:
Model 1: Area under ROC curve: 0.8203, correctly predicts 15/32 test set observations
Model 2: Area under ROC curve: 0.8008, correctly predicts 16/32 test set observations
Model 3: Area under ROC curve: 0.8359, correctly predicts 15/32 test set observations
What method to build a model do you think is the most sound? I can describe all three of them in my write-up but feel like I should report one of them as a "main" model. I am personally leaning toward Model 2 since one of the variables stays significant and the other one is very close (95% CI 0.996, 1.103). The association seems rather weak anyway, on this model both ORs are 1.03 and 1.05, nothing crazy. The important thing for my thesis is that I explain how I built the model and what I did, not so much that I have ballpark data that is super significant.
Subreddit
Post Details
- Posted
- 2 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/AskStatisti...