The R-Squared Statistic

This tutorial explains the R-Squared fit statistic through its “manual” calculation on R. The best way to engage this lesson is to program with these commands while you read. Tinker with the values. Be hands on. Packages used: ggplot2

What is the R-Squared?

Defined

The R-Squared is a fit statistic, a metric that we use to measure how well our regression model fits the observed data.

Conventionally, people interpret the R-Squared as the percentage of observed variance captured by a model. More strictly, the statistic expresses the ratio of (1) the variance in the dependent variable that we cannot attribute to model error, the discrepancy between the model’s predicted and observed values and (2) the variance observed within the model’s dependent variable values.

I see the metric as a widely-known, rough metric of model fit whose greatest benefit is more related to ease of communicating a point to audiences than to it being a powerful metric to help guide your analysis.

Calculating

The metric is measured is the ratio of the regression sum of squares (SS_{reg}) to total sum of squares (SS_{tot}\):

R^{2} = SS_{reg} / SS_{tot}

Where:

  • SS_{reg} is the regression sum of squares, a metric that is widely used to describe the proportion of a variable’s variance that is captured by our regression model, and
  • SS_{tot} is the total sum of squares, a metric that captures overall variance in the regression’s dependent variable.

Total Sum of Squares

Explanation

The total sum of squares (SS_{tot}) is a measure of a variable’s overall variance (i.e., how much its scores vary). When this variance score is lower, then a variables scores mostly cluster around the mean. Where it is higher, scores are more spread out.

So, for example, the variance of people’s internal body temperatures are lower than the weather outside, because people’s body temperatures will only differ by one or two degrees Celsius, whereas the weather outside can vary by tens of degrees Celsius.

Calculating

The equation below shows how to calculate the total sum of squares. It is calculated by (1) subtracting each observed score by the group’s overall mean, (2) squaring each of the resulting terms, and then (2) adding all these individual terms together.

Let y_{mean} be the mean value of variable y, and let y_{i} be the \i^{th} observation of y.

SS_{tot} = \sum(y_{i} - y_{mean})^2

Illustrated

The total sum of squares is one way to measure a variable’s variability. The higher the score, the higher the observed variability.

Consider two temperature measures: We measure the body temperatures of 10 randomly-chosen people, and the outdoor temperatures of 10 randomly-chosen places.

# Generate 10 randomly-genereated body temperature scores.  
# Assume mean of 36.8 degrees, and a standard deviation of 0.4
body.temps <- rnorm(10, 36.8, 0.4)
body.temps <- round(body.temps, 1)  #Rounding the number to make handout more legible
body.temps
##  [1] 36.6 36.3 36.8 36.6 36.6 36.9 36.8 36.5 37.1 37.1
# Generate 10 hypothetical outdoor temperatures
# Assume mean of 17 degrees, and a standard deviation of 8
outdoor.temps <- rnorm(10, 17, 8)
outdoor.temps <- round(outdoor.temps, 1) #Rounding for legibility
outdoor.temps
##  [1] 22.2 18.9  8.6 11.7 26.1 22.5 14.7 10.2 13.5 10.3

The outdoor temperatures vary more widely than the body temperatures, and so the outdoor temperatures register a higher total sum of squares, our metric for measuring variance:

sum((body.temps - mean(body.temps))^2)
## [1] 0.601
sum((outdoor.temps - mean(outdoor.temps))^2)
## [1] 338.261

These figures do not have such an easy, direct interpretation. They are useful in this larger excercise of assessing model fit, however.

##Calculation Deconstructed
Let’s begin by randomly generating a data set with 20 people’s ages, with a mean age of 45 and a standard deviation of 9.

ages <- rnorm(10, 45, 9)  #generate 20 random numbers, normal distribution,
                          #mean 45 and standard deviation of 9

ages <- round(ages, 0)    #Rounding variable to zero decimal places, so we
                          #don't have to look at decimals when I list them below.
ages
##  [1] 26 40 50 43 48 69 52 45 45 44

The mean age in this set is:

mean(ages)
## [1] 46.2

Calculate the difference between each observed variable and it mean. That will get you y_{i} - y_{mean} for each individual value:

ages - mean(ages)
##  [1] -20.2  -6.2   3.8  -3.2   1.8  22.8   5.8  -1.2  -1.2  -2.2

Then square each term:

(ages - mean(ages))^2
##  [1] 408.04  38.44  14.44  10.24   3.24 519.84  33.64   1.44   1.44   4.84

Then sum the terms together. This will give you the total sum of squares score:

sum((ages - mean(ages))^2)  #This is the total sum of squares
## [1] 1035.6

A side note: We square the terms to ensure that variance is always expressed as a positive number, so that when we sum the individual scores together the negatives do not cancel the positive ones out. If we did not square the terms, then our variance estimates would be virtually zero. Try it out. Run the script a few times to try it with different randomly generated numbers:

ages - mean(ages)
##  [1] -20.2  -6.2   3.8  -3.2   1.8  22.8   5.8  -1.2  -1.2  -2.2
sum (ages - mean(ages))
## [1] -0.00000000000002842171

Residual Sum of Squares

Our Example

Above, we focused on the fact that our dependent variable scores vary. We see different scores on some metric (e.g., temperature), and we might run a regression to figure out how to predict that metric score using other variables (say latitude or altitude).

For this part of the lesson, let’s focus on a new example: The owner of Paradise Ice Cream asks our friend Bob for help predicting ice cream sales. Better predictions will allow her to better estimate how much ice cream to make the night before. We are going to use temperature forecasts to predict ice cream sales.

Generate Data. Let’s begin by generating 60 hypothetical days, in which the daily forecasted temperature was Normally distributed with a mean of 25 degrees Celcius (SD = 4 degrees):

temps <- rnorm(60, 25, 4)
temps <- round(temps) #round for legibility
temps
##  [1] 24 25 25 31 32 32 29 25 26 24 20 23 20 31 23 23 24 28 28 21 21 32 20
## [24] 20 23 22 23 21 27 21 17 23 29 19 28 19 19 24 20 20 24 27 27 29 33 24
## [47] 24 29 24 30 22 25 24 27 21 26 24 17 24 29

Bob doesn’t know it yet, but in reality there is a relationship between temperature and Paradise Ice Cream sales. The true model is:

SALES = $300 + $90 \times TEMP + \epsilon

Where SALES is ice cream shop’s total dollar sales, and TEMP is the foretasted temperature in degrees in Celsius. So the model predicts that, for each degree of higher temperature, Paradise Ice Cream sells $90 more in ice cream. The \(\epsilon\) term captures our model’s residual error – the number of dollars by which ice cream sales are off from our predictions.

Let’s generate some hypothetical numbers. Let’s imagine that, on average, our model is typically off by plus/minus $500, and that the errors are Normally distributed.

error.1 <- rnorm(60, 0, 500)   #Generating 60 days' of error terms
error.1 <- round(error.1)

sales <- 300 + 90 * temps + error.1
sales <- round(sales, 0)  #rounding for legibility
sales
##  [1] 2356 2997 3269 2729 3180 3313 2092 1965 2179 2500 2367 2744 2673 3309
## [15] 2924 3015 2042 3841 3434 2660 2287 3532 2059 1667 2123 2022 1293 2339
## [29] 2942 2340 1249 2101 2679 1948 2447 1645 1951 2206 2050 2903 2354 2346
## [43] 2649 3149 3260 2460 2269 2473 2019 2724 1729 2758 2678 2776 2723 2709
## [57] 2642 2350 2638 2366

Let’s compile our sales, temps, and error vectors into a data frame, and then run a regression. The relationship is highly significant as I wrote it above. You can mess with the regression results by messing with my formulations of the “error” and “sales” vectors above:

#Compile into data frame
icecream <- data.frame(cbind(sales, temps))
head(icecream)  #Take a look at the first few rows of the data frame
##   sales temps
## 1  2356    24
## 2  2997    25
## 3  3269    25
## 4  2729    31
## 5  3180    32
## 6  3313    32
#The relationship visualized:
ggplot(icecream, aes(x = temps, y = sales)) + geom_point() + geom_smooth(method="lm", se= F)

plot of chunk unnamed-chunk-11

#Run the regression
reg.1 <- lm(sales ~ temps, icecream)
summary(reg.1)
## 
## Call:
## lm(formula = sales ~ temps, data = icecream)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1088.72  -287.65    16.32   242.20  1049.45 
## 
## Coefficients:
##             Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)   496.46     342.71   1.449       0.153    
## temps          81.97      13.79   5.944 0.000000169 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 422.6 on 58 degrees of freedom
## Multiple R-squared:  0.3785, Adjusted R-squared:  0.3678 
## F-statistic: 35.33 on 1 and 58 DF,  p-value: 0.0000001688

Note the R-Squared on this output, but don’t detain yourself here. Let’s make these calculations by hand, so that we can see how this works.

Let’s add the model’s predicted values to the “icecream” data frame:

icecream$predicted <- fitted(reg.1)

Calculating Redisduals

Residuals (or “error”) refer to the difference between our model’s predicted outcome values and the data’s observed outcome values.

resid = y_{observed} - y_{predicted}

Let’s create a new variable in our “icecream” data frame that contains our model’s residual:

icecream$resid = icecream$sales - icecream$predicted
head(icecream, 10)
##    sales temps predicted      resid
## 1   2356    24  2463.684 -107.68390
## 2   2997    25  2545.652  451.34841
## 3   3269    25  2545.652  723.34841
## 4   2729    31  3037.458 -308.45775
## 5   3180    32  3119.425   60.57455
## 6   3313    32  3119.425  193.57455
## 7   2092    29  2873.522 -781.52237
## 8   1965    25  2545.652 -580.65159
## 9   2179    26  2627.619 -448.61928
## 10  2500    24  2463.684   36.31610

Sum of Squared Residuals

These residual scores are variance estimates, like the ones with which we worked in the earlier section total sum of squares. They are just capturing the difference between predicted and observed values, instead of observed individual and overall mean values above.

To calculate the residual sum of squares (SS_{resid}):

SS_{residual} = \sum(y_{observed} - y_{predicted})^2

We calculate below:

#Calculate the residual sum of squares
SS.resid <- sum((icecream$sales - icecream$predicted)^2)
round(SS.resid, 3)
## [1] 10357420
#And the total sum of squares, while we're at it:
SS.tot <- sum((icecream$sales - mean(icecream$sales))^2)
round(SS.tot, 3)
## [1] 16665834

Residuals and Overall Variance

If we take the ratio of SS_{resid} and SS_{total}:

round(SS.resid / SS.tot, 3)
## [1] 0.621

The resulting percentage gives us the percentage of the total variance that we can attribute to model error, the degree to which observed values are different from predicted values.

Note that, if you subtract this figure from one, the result is the R-Squared of our regression model above:

round(1 - SS.resid / SS.tot, 3)
## [1] 0.379

The explanation follows.

Regression Sum of Squares

The regression sum of squares (\(SS_{reg}\)) is a metric that tries to capture how much of the variance estimates by \(RSS_{tot}\) is predicted by our model. This is the conceptual framework by which we estimate this metric.

  1. The total sum of squares (\(RSS_{total}\)) measures variation in the dependent variable.
  2. The residual sum of squares (\(RSS_{residual}\)) is the amount of dependent variable variance attributable to the model’s failure to predict values.
  3. The remainder is presumed to be the amount of variance that is predicted by the model:
RSS_{model} = 1 - \frac{RSS_{residual}}{RSS_{total}}

Deconstructed Example

Generate Data

Imagine a hypothetical sample in which we asked 100 people to rate Funzo’s Burger House on a scale of zero to 100. Assume that Funzo’s is for kids, and people like it less aas they age. We assume the true model underlying the data is:

RATING = 100 - 2 * AGE + \epsilon

Where RATING is the person’s rating of Funzo’s, AGE is their age in years, and \epsilon is the error term.

#Create the data

age <- runif(100, 5, 70)   #Generate 100 values from a uniform distribution between 5 and 70
error <- rnorm(100,0, 15)  #Set expected model error at 15 points
rating <- 100 - (2 * age) + error
funzo <- data.frame(cbind(rating, age))

Here is a scatterplot, to depict the bivariate regression. Play with the numbers above and see the changes…

ggplot(funzo, aes(x = age, y = rating)) + geom_point() + geom_smooth(method = "lm", se=F)

plot of chunk unnamed-chunk-18

Regression

Next, run the regression:

reg.2 <- lm(rating~age, funzo)
summary(reg.2)
## 
## Call:
## lm(formula = rating ~ age, data = funzo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.902  -9.897  -0.289   7.944  39.270 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 95.99133    3.18496   30.14 <0.0000000000000002 ***
## age         -1.89573    0.07581  -25.00 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.44 on 98 degrees of freedom
## Multiple R-squared:  0.8645, Adjusted R-squared:  0.8631 
## F-statistic: 625.3 on 1 and 98 DF,  p-value: < 0.00000000000000022

Note the R-Squared. Calculate the fitted values and error terms, and then add them to the data set:

funzo$predicted <- fitted(reg.2)
funzo$error <- funzo$rating - funzo$predicted
head(funzo,10)
##       rating      age predicted       error
## 1  -11.58271 46.50208   7.83606 -19.4187687
## 2  -22.28820 66.67941 -30.41465   8.1264496
## 3   60.53316 14.76919  67.99298  -7.4598183
## 4   19.73782 32.07206  35.19145 -15.4536250
## 5  -34.78134 68.10289 -33.11319  -1.6681523
## 6   25.30219 37.20698  25.45703  -0.1548443
## 7   37.16856 41.00807  18.25121  18.9173463
## 8   35.28165 30.14312  38.84820  -3.5665496
## 9  -47.37257 67.86045 -32.65358 -14.7189878
## 10 -17.70322 45.29438  10.12554 -27.8287609

Calculation

Now we can calculate our variance and fit statistics:

#Calculate our sums of squares:
ss.total <- sum((funzo$rating-mean(funzo$rating))^2)
ss.total
## [1] 150803.8
ss.residual <- sum((funzo$rating - funzo$predicted)^2)
ss.residual
## [1] 20433.77
#And we calculate the R-Squared:

r.squared <- 1 - (ss.residual/ss.total)
r.squared
## [1] 0.8645009

Which should match the R-Squared given in the example above. The results will be differrent depending on your session’s random seed (see the command set.seed())

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.