This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
There are a few comments in this thread we could talk about, but I'm going to focus on this one, particularly, the notion that a model in social sciences with a low R2 value immediately means the model is useless, as exemplified by this:
Just for fucking fun I decided to recreate the chart in excel. The fucking r-squared is roughly 50%.
Before we get into it, for transparency's sake I should say that I have not read what the original thread is referring to substantively, so I can't comment on the study mentioned per se, this is purely a RI about R2 .
Anyways, it is true that R2 represents the amount of variance captured by your model (edited per /u/brberg’s comment below) but I have a few points about why this is not necessarily super important assuming you care about causality and not just prediction. I'm not going to go too in depth, just because I've discussed this throughout my comment history, but here we go:
Want to get a high R2 value? Just add more variables to your regression. (For this section, you can use this to view the LaTeX equations, aka the stuff between dollar signs): That's because R2 is just the residual sum of squares,
$\sum_{i = 1}^{n}(\hat{y}_i-\bar{y})^2$
divided by the total sum of squares,$\sum_{i = 1}^{n}(y_i-\bar{y})^2$
, where we're talking about a dataset with n values marked indexed by i, associated with a predicted (or modeled) value y-hat. If you add more variables SSR is necessarily non-increasing, which is nicely explained here. As a result, you really shouldn't be as concerned about the proportion of the variance in the outcome variable that is predictable from a single predictor... a lot of different things could affect your outcome, especially in the social sciences where units are highly heterogeneous. Taken from an old comment of mine here.Ok you say, then just consider adjusted R2 which penalizes for the extra terms you include in the regression. Well, you might be able to explain more variation, but if you're interested in causal inference then you have to be careful about including "bad" controls, aka conditioning on a collider. Consider this example from another old comment about the gender pay gap. Guess which of the two models has a higher adjusted R2 value? Hint: the wrong one! Also you might be overfitting, which the next point gets at.
This is obviously an extraordinarily simplified example, but consider this data-generating process as a toy model of why controlling for A, B, C, and D, (or just A in this case) when said variables come downstream of the causal pathway is a bad idea. This was written in R, if you're familiar with the software.
male <- rbinom(n=1000, size=1, prob=0.5)
wages <- 2*male rnorm(1000)
hours_worked <- wages rnorm(1000)
lm(wages ~ male)
lm(wages ~ male hours_worked)
There's a hardcoded gender wage gap of "2" here, and notice that wages are purely a function of gender (i.e. discrimination) and not hours worked. The second regression will produce a biased estimate of the effect of gender on wages (you will underestimate this effect). It does not mean it doesn't exist!
Scott Cunningham, in pages 74-78 of his book on causal inference goes through this example as an example of collider bias and I think he does so quite nicely (plus, it's in Stata, if you're unfamiliar with R).
Of course we don't know that this is the true data-generating process: the point is that just because the gender pay gap diminishes when we control for these sorts of variables does not mean that discrimination does not exist.
Here are some graphical examples about why higher values of R2 could mean a worse model, from Nick Huntington-Klein's Twitter.
And finally, this Twitter thread makes these similar points nicely with a policy example:
Imagine you are studying a population in which everyone has a very serious disease, except one person. Then, of course, you find that the disease explains little variation in happiness. Would you then conclude that the "effects are too small to warrant policy change"? Surely not. The low "variance explained" is due to low variation in exposure, but the effect of an intervention could be huge. Thus, if screen time affects one's happiness substantially, but almost everyone in the population has the same exposure to screen time, then screen time will surely explain little variation in happiness, since there is low variation of the exposure to begin with.
I'm not saying R2 doesn't matter or we should throw it away entirely. That said, in the social sciences we should worry less about these values than in the hard sciences. Why?
Units are far more heterogenous in the social sciences! Every single carbon atom is the same, but every human is different. This is exactly why we have to use randomization to get at causality as opposed to being able to create a perfectly controlled environment like in the lab sciences. /u/rationalities was trying to make this same point in the original thread. See also their point here.
Social science outcomes are highly complex and have many causes, so it's unlikely that any particular model will be able to explain all of the variation in an outcome perfectly, or even above a certain threshold, especially when just using a single variable model. We have to work probabilistically, not deterministically.
So be wary of R2, especially when the regression includes many predictors be sure to look at adjusted R2 but even still, don't put so much weight on it if you care about causality. It may not matter anyway.
Edit: shifted the organization of the post a bit
Subreddit
Post Details
- Posted
- 4 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/badeconomic...