Does the R-Squared Measure Model Fit Well?

A colleague recently forwarded discussions about the R-Squared fit statistic, asking for my advice. These criticisms about the R^2)’s usefulness raised in Cosma Shalizi’s (Carnegie Mellon) lecture notes, and further discussed in a blog post by Clay Ford (University of Virginia). Shalizi argues that the R-Squared does not really measure model fit.

One of Shalizi’s arguments is that the R^2 is a poor model fit statistic because it can render high scores with an incorrect model, and low scores with a correct one. While true, it is important to keep these arguments in proportion when thinking about their impact on the analytics process.

Low R^2 with Correct Model

Shalizi writes that R-Squared can be arbitrarily low when a regression has very large error terms or highly invariant predictors. Following Ford’s exposition of this point, I generate generate hypothetical data and the “true” model:

#Set up true model and data
#Specify number of observations
n.obs <- 100                       
#Create a fictitious predictor with mean = 100 and SD = 20
predictor <- rnorm(n.obs,100,20)                                     
#Set error size so typical error is 20
error.1<- 20                        
outcome.1 <- 20 + 0.8 * predictor + rnorm(n.obs, 0, error.1)
dat.1 <- data.frame(cbind(outcome.1, predictor))

#plot the data
library(ggplot2)
ggplot(dat.1, aes(x = predictor, y = outcome.1)) + 
         geom_point() + geom_smooth(method="lm", se = F)

plot of chunk unnamed-chunk-1

We model the data below. Note the R-Squared:

summary(lm(outcome.1 ~ predictor, dat.1))$r.squared
## [1] 0.4231167

Now, let’s mess with the model above and give it huge errors. Tinker with the model above if you are reading this document in R. If you can’t, check out this model. Same data, but tent-times larger errors:

#Set up true model and data
error.2 <- 200                      
outcome.2 <- 20 + 1.8 * predictor + rnorm(n.obs, 0, error.2)
dat.2 <- data.frame(cbind(outcome.2, predictor))

#plot the data
ggplot(dat.2, aes(x = predictor, y = outcome.2)) + 
geom_point() + geom_smooth(method="lm", se = F)

plot of chunk unnamed-chunk-3

#Linear regression R-Squared
summary(lm(outcome.2 ~ predictor, dat.2))$r.squared
## [1] 0.01082093

Now the R-Squared is much smaller, even though we are still modeling the “true” model.

My Take

When I look at these comparisons, I’m quite underwhelmed by the argument. Sure, we have the “true” model, but I strain to see how this model is at all useful. It you use this model, you are still bound to be off in your predicted values.

High R^2 with Incorrect Model

Shalizi notes that you can get a high R-Squared with a poorly-fitted model. Following Ford’s lead, I’ll develop the point by example. Imagine we have a “true” model with non-linear effects:

predictor <- rexp(n.obs, rate=0.005)
error.3 <- 20
outcome.3 <- -100 + 0.01 * predictor + 1.25 * predictor^2 + 
          rnorm(n.obs, 0, error.3)
dat.3 <- data.frame(cbind(outcome.3, predictor))

#Plot the data
ggplot(dat.3, aes(x = predictor, y = outcome.3)) + 
         geom_point() + geom_smooth(method="lm", se = F)

plot of chunk unnamed-chunk-4

#Linear regression R-Squared
reg.3 <- lm(outcome.3 ~ predictor, dat.3)
summary(reg.3)
## 
## Call:
## lm(formula = outcome.3 ~ predictor, data = dat.3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69805 -48338 -11725  38780 511727 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -81926.10   10427.02  -7.857     0.00000000000511 ***
## predictor      870.80      38.82  22.434 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70660 on 98 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.8354 
## F-statistic: 503.3 on 1 and 98 DF,  p-value: < 0.00000000000000022

So here we have a very high R-Squared, but the model is clearly misspecified! The relationship between x and y is non-linear.

My Take

My view on this criticism is that, while I grant this point, I feel like this situation is most likely to trick someone who is not doing regression diagnostics. If you diagnose property, you should receive several indications that the model is not well-specified. While this particular argument has merit in a pure technical sense, its impact on practical modeling shouldn’t be overestimated. Run your diagnostics.

library(olsrr)
ols_plot_resid_qq(reg.3)  #Errors not Normal distribution

plot of chunk unnamed-chunk-5

ols_plot_resid_fit(reg.3)  #Errors do not appear heteroskedastic

plot of chunk unnamed-chunk-5

Other Shortcomings & Uses

The R-Squared has other shortcomings. It doesn’t work with a lot of models. You can’t use it to compare models from different sets. It has trouble comparing models with transformed outcomes and untransformed ones. It is lacking for prediction.

The R-Squared has one important feature: it is widely understood by audiences. The people want it. They can be frustrated and suspicious when you don’t present it. If you do proper diagnostics and model well, think of it as a crude, widely-loved metric that will help you assure some audience members that a model isn’t completely divorced from the data. If you do proper diagnostics, model well, and use other fit statistics, and are careful not to fetishize the R-Squared, it can help communicate.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.