A colleague recently forwarded discussions about the R-Squared fit statistic, asking for my advice. These criticisms about the )’s usefulness raised in Cosma Shalizi’s (Carnegie Mellon) lecture notes, and further discussed in a blog post by Clay Ford (University of Virginia). Shalizi argues that the R-Squared does not really measure model fit.
One of Shalizi’s arguments is that the is a poor model fit statistic because it can render high scores with an incorrect model, and low scores with a correct one. While true, it is important to keep these arguments in proportion when thinking about their impact on the analytics process.
Low with Correct Model
Shalizi writes that R-Squared can be arbitrarily low when a regression has very large error terms or highly invariant predictors. Following Ford’s exposition of this point, I generate generate hypothetical data and the “true” model:
#Set up true model and data #Specify number of observations n.obs <- 100 #Create a fictitious predictor with mean = 100 and SD = 20 predictor <- rnorm(n.obs,100,20) #Set error size so typical error is 20 error.1<- 20 outcome.1 <- 20 + 0.8 * predictor + rnorm(n.obs, 0, error.1) dat.1 <- data.frame(cbind(outcome.1, predictor)) #plot the data library(ggplot2) ggplot(dat.1, aes(x = predictor, y = outcome.1)) + geom_point() + geom_smooth(method="lm", se = F)
We model the data below. Note the R-Squared:
summary(lm(outcome.1 ~ predictor, dat.1))$r.squared
##  0.4231167
Now, let’s mess with the model above and give it huge errors. Tinker with the model above if you are reading this document in R. If you can’t, check out this model. Same data, but tent-times larger errors:
#Set up true model and data error.2 <- 200 outcome.2 <- 20 + 1.8 * predictor + rnorm(n.obs, 0, error.2) dat.2 <- data.frame(cbind(outcome.2, predictor)) #plot the data ggplot(dat.2, aes(x = predictor, y = outcome.2)) + geom_point() + geom_smooth(method="lm", se = F)
#Linear regression R-Squared summary(lm(outcome.2 ~ predictor, dat.2))$r.squared
##  0.01082093
Now the R-Squared is much smaller, even though we are still modeling the “true” model.
When I look at these comparisons, I’m quite underwhelmed by the argument. Sure, we have the “true” model, but I strain to see how this model is at all useful. It you use this model, you are still bound to be off in your predicted values.
High with Incorrect Model
Shalizi notes that you can get a high R-Squared with a poorly-fitted model. Following Ford’s lead, I’ll develop the point by example. Imagine we have a “true” model with non-linear effects:
predictor <- rexp(n.obs, rate=0.005) error.3 <- 20 outcome.3 <- -100 + 0.01 * predictor + 1.25 * predictor^2 + rnorm(n.obs, 0, error.3) dat.3 <- data.frame(cbind(outcome.3, predictor)) #Plot the data ggplot(dat.3, aes(x = predictor, y = outcome.3)) + geom_point() + geom_smooth(method="lm", se = F)
#Linear regression R-Squared reg.3 <- lm(outcome.3 ~ predictor, dat.3) summary(reg.3)
## ## Call: ## lm(formula = outcome.3 ~ predictor, data = dat.3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -69805 -48338 -11725 38780 511727 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -81926.10 10427.02 -7.857 0.00000000000511 *** ## predictor 870.80 38.82 22.434 < 0.0000000000000002 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 70660 on 98 degrees of freedom ## Multiple R-squared: 0.837, Adjusted R-squared: 0.8354 ## F-statistic: 503.3 on 1 and 98 DF, p-value: < 0.00000000000000022
So here we have a very high R-Squared, but the model is clearly misspecified! The relationship between and is non-linear.
My view on this criticism is that, while I grant this point, I feel like this situation is most likely to trick someone who is not doing regression diagnostics. If you diagnose property, you should receive several indications that the model is not well-specified. While this particular argument has merit in a pure technical sense, its impact on practical modeling shouldn’t be overestimated. Run your diagnostics.
library(olsrr) ols_plot_resid_qq(reg.3) #Errors not Normal distribution
ols_plot_resid_fit(reg.3) #Errors do not appear heteroskedastic
Other Shortcomings & Uses
The R-Squared has other shortcomings. It doesn’t work with a lot of models. You can’t use it to compare models from different sets. It has trouble comparing models with transformed outcomes and untransformed ones. It is lacking for prediction.
The R-Squared has one important feature: it is widely understood by audiences. The people want it. They can be frustrated and suspicious when you don’t present it. If you do proper diagnostics and model well, think of it as a crude, widely-loved metric that will help you assure some audience members that a model isn’t completely divorced from the data. If you do proper diagnostics, model well, and use other fit statistics, and are careful not to fetishize the R-Squared, it can help communicate.