This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Dear statisticians,
I have a dataset with around 100 000 observations and 800 variables. My target variable is binary (0 and 1) and my covariates include continuous variables, categorical ordered and categorical unordered variables.
All variables may contain missing values.
My objective is to estimate a logistic regression model that allows me to predict my target variable.
That being said I was looking for some advice on how to restrict the number of variables to apply to the model, preferable without using Principal Component Analysis. I have some interrelated questions, namely:
1- I could start by looking at correlations between the continuous variables and the dependent variable. But what if the relationships are non linear ?
2- And what about the categorical variables ? What test is appropriate to see the relationship between (multinomial and binomial) categorical ordered and categorical unordered variables and a binary variable ? And how can I use this metric in a comparable way with the metric I use for the continuous variable ?
3- Should I bin the continuous variables ? How to decide if I should do it or not ? This has the advantage of making them more comparable with the other categorical variables.
I was considering the following : bin the continuous variables and then select the variables with the highest chi-square statistic (relating to the binary variable). But then should I leave the continuous variables in their binned form ? And does the chi-square test takes into account the ordinality of the variables ?
Any guidance?
Best regards and thank you in advance for your help.
Subreddit
Post Details
- Posted
- 5 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/AskStatisti...