This tutorial explains the R-Squared fit statistic through its “manual” calculation on R. The best way to engage this lesson is to program with these commands while you read. Tinker with the values. Be hands on. *Packages used:* **ggplot2**

## What is the R-Squared?

#### Defined

The R-Squared is a ** fit statistic**, a metric that we use to measure how well our regression model fits the observed data.

Conventionally, people interpret the R-Squared as the percentage of observed variance captured by a model. More strictly, the statistic expresses the ratio of (1) the variance in the dependent variable that we cannot attribute to *model error*, the discrepancy between the model’s predicted and observed values and (2) the variance observed within the model’s dependent variable values.

I see the metric as a widely-known, rough metric of model fit whose greatest benefit is more related to ease of communicating a point to audiences than to it being a powerful metric to help guide your analysis.

#### Calculating

The metric is measured is the ratio of the regression sum of squares () to total sum of squares ():

Where:

- is the
*regression sum of squares*, a metric that is widely used to describe the proportion of a variable’s variance that is captured by our regression model, and - is the
*total sum of squares*, a metric that captures overall variance in the regression’s dependent variable.

## Total Sum of Squares

#### Explanation

The ** total sum of squares** () is a measure of a variable’s overall variance (i.e., how much its scores vary). When this variance score is lower, then a variables scores mostly cluster around the mean. Where it is higher, scores are more spread out.

So, for example, the variance of people’s internal body temperatures are lower than the weather outside, because people’s body temperatures will only differ by one or two degrees Celsius, whereas the weather outside can vary by tens of degrees Celsius.

#### Calculating

The equation below shows how to calculate the *total sum of squares*. It is calculated by (1) subtracting each observed score by the group’s overall mean, (2) squaring each of the resulting terms, and then (2) adding all these individual terms together.

Let be the mean value of variable , and let be the \ observation of .

#### Illustrated

The *total sum of squares* is one way to measure a variable’s variability. The higher the score, the higher the observed variability.

Consider two temperature measures: We measure the body temperatures of 10 randomly-chosen people, and the outdoor temperatures of 10 randomly-chosen places.

```
# Generate 10 randomly-genereated body temperature scores.
# Assume mean of 36.8 degrees, and a standard deviation of 0.4
body.temps <- rnorm(10, 36.8, 0.4)
body.temps <- round(body.temps, 1) #Rounding the number to make handout more legible
body.temps
```

## [1] 36.6 36.3 36.8 36.6 36.6 36.9 36.8 36.5 37.1 37.1

```
# Generate 10 hypothetical outdoor temperatures
# Assume mean of 17 degrees, and a standard deviation of 8
outdoor.temps <- rnorm(10, 17, 8)
outdoor.temps <- round(outdoor.temps, 1) #Rounding for legibility
outdoor.temps
```

## [1] 22.2 18.9 8.6 11.7 26.1 22.5 14.7 10.2 13.5 10.3

The outdoor temperatures vary more widely than the body temperatures, and so the outdoor temperatures register a higher *total sum of squares*, our metric for measuring variance:

```
sum((body.temps - mean(body.temps))^2)
```

## [1] 0.601

```
sum((outdoor.temps - mean(outdoor.temps))^2)
```

## [1] 338.261

These figures do not have such an easy, direct interpretation. They are useful in this larger excercise of assessing model fit, however.

##Calculation Deconstructed

Let’s begin by randomly generating a data set with 20 people’s ages, with a mean age of 45 and a standard deviation of 9.

```
ages <- rnorm(10, 45, 9) #generate 20 random numbers, normal distribution,
#mean 45 and standard deviation of 9
ages <- round(ages, 0) #Rounding variable to zero decimal places, so we
#don't have to look at decimals when I list them below.
ages
```

## [1] 26 40 50 43 48 69 52 45 45 44

The mean age in this set is:

```
mean(ages)
```

## [1] 46.2

Calculate the difference between each observed variable and it mean. That will get you for each individual value:

```
ages - mean(ages)
```

## [1] -20.2 -6.2 3.8 -3.2 1.8 22.8 5.8 -1.2 -1.2 -2.2

Then square each term:

```
(ages - mean(ages))^2
```

## [1] 408.04 38.44 14.44 10.24 3.24 519.84 33.64 1.44 1.44 4.84

Then sum the terms together. This will give you the total sum of squares score:

```
sum((ages - mean(ages))^2) #This is the total sum of squares
```

## [1] 1035.6

*A side note:* We square the terms to ensure that variance is always expressed as a positive number, so that when we sum the individual scores together the negatives do not cancel the positive ones out. If we did not square the terms, then our variance estimates would be virtually zero. Try it out. Run the script a few times to try it with different randomly generated numbers:

```
ages - mean(ages)
```

## [1] -20.2 -6.2 3.8 -3.2 1.8 22.8 5.8 -1.2 -1.2 -2.2

```
sum (ages - mean(ages))
```

## [1] -0.00000000000002842171

## Residual Sum of Squares

#### Our Example

Above, we focused on the fact that our dependent variable scores vary. We see different scores on some metric (e.g., temperature), and we might run a regression to figure out how to predict that metric score using other variables (say latitude or altitude).

For this part of the lesson, let’s focus on a new example: The owner of Paradise Ice Cream asks our friend Bob for help predicting ice cream sales. Better predictions will allow her to better estimate how much ice cream to make the night before. We are going to use temperature forecasts to predict ice cream sales.

** Generate Data.** Let’s begin by generating 60 hypothetical days, in which the daily forecasted temperature was Normally distributed with a mean of 25 degrees Celcius (SD = 4 degrees):

```
temps <- rnorm(60, 25, 4)
temps <- round(temps) #round for legibility
temps
```

## [1] 24 25 25 31 32 32 29 25 26 24 20 23 20 31 23 23 24 28 28 21 21 32 20 ## [24] 20 23 22 23 21 27 21 17 23 29 19 28 19 19 24 20 20 24 27 27 29 33 24 ## [47] 24 29 24 30 22 25 24 27 21 26 24 17 24 29

Bob doesn’t know it yet, but in reality there is a relationship between temperature and Paradise Ice Cream sales. The true model is:

Where is ice cream shop’s total dollar sales, and is the foretasted temperature in degrees in Celsius. So the model predicts that, for each degree of higher temperature, Paradise Ice Cream sells $90 more in ice cream. The \(\epsilon\) term captures our model’s residual error – the number of dollars by which ice cream sales are off from our predictions.

Let’s generate some hypothetical numbers. Let’s imagine that, on average, our model is typically off by plus/minus $500, and that the errors are Normally distributed.

```
error.1 <- rnorm(60, 0, 500) #Generating 60 days' of error terms
error.1 <- round(error.1)
sales <- 300 + 90 * temps + error.1
sales <- round(sales, 0) #rounding for legibility
sales
```

## [1] 2356 2997 3269 2729 3180 3313 2092 1965 2179 2500 2367 2744 2673 3309 ## [15] 2924 3015 2042 3841 3434 2660 2287 3532 2059 1667 2123 2022 1293 2339 ## [29] 2942 2340 1249 2101 2679 1948 2447 1645 1951 2206 2050 2903 2354 2346 ## [43] 2649 3149 3260 2460 2269 2473 2019 2724 1729 2758 2678 2776 2723 2709 ## [57] 2642 2350 2638 2366

Let’s compile our sales, temps, and error vectors into a data frame, and then run a regression. The relationship is highly significant as I wrote it above. You can mess with the regression results by messing with my formulations of the “error” and “sales” vectors above:

```
#Compile into data frame
icecream <- data.frame(cbind(sales, temps))
head(icecream) #Take a look at the first few rows of the data frame
```

## sales temps ## 1 2356 24 ## 2 2997 25 ## 3 3269 25 ## 4 2729 31 ## 5 3180 32 ## 6 3313 32

```
#The relationship visualized:
ggplot(icecream, aes(x = temps, y = sales)) + geom_point() + geom_smooth(method="lm", se= F)
```

```
#Run the regression
reg.1 <- lm(sales ~ temps, icecream)
summary(reg.1)
```

## ## Call: ## lm(formula = sales ~ temps, data = icecream) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1088.72 -287.65 16.32 242.20 1049.45 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 496.46 342.71 1.449 0.153 ## temps 81.97 13.79 5.944 0.000000169 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 422.6 on 58 degrees of freedom ## Multiple R-squared: 0.3785, Adjusted R-squared: 0.3678 ## F-statistic: 35.33 on 1 and 58 DF, p-value: 0.0000001688

Note the R-Squared on this output, but don’t detain yourself here. Let’s make these calculations by hand, so that we can see how this works.

Let’s add the model’s predicted values to the “icecream” data frame:

```
icecream$predicted <- fitted(reg.1)
```

#### Calculating Redisduals

** Residuals** (or “error”) refer to the difference between our model’s predicted outcome values and the data’s observed outcome values.

Let’s create a new variable in our “icecream” data frame that contains our model’s residual:

```
icecream$resid = icecream$sales - icecream$predicted
head(icecream, 10)
```

## sales temps predicted resid ## 1 2356 24 2463.684 -107.68390 ## 2 2997 25 2545.652 451.34841 ## 3 3269 25 2545.652 723.34841 ## 4 2729 31 3037.458 -308.45775 ## 5 3180 32 3119.425 60.57455 ## 6 3313 32 3119.425 193.57455 ## 7 2092 29 2873.522 -781.52237 ## 8 1965 25 2545.652 -580.65159 ## 9 2179 26 2627.619 -448.61928 ## 10 2500 24 2463.684 36.31610

#### Sum of Squared Residuals

These residual scores are variance estimates, like the ones with which we worked in the earlier section *total sum of squares*. They are just capturing the difference between predicted and observed values, instead of observed individual and overall mean values above.

To calculate the *residual sum of squares* ():

We calculate below:

```
#Calculate the residual sum of squares
SS.resid <- sum((icecream$sales - icecream$predicted)^2)
round(SS.resid, 3)
```

## [1] 10357420

```
#And the total sum of squares, while we're at it:
SS.tot <- sum((icecream$sales - mean(icecream$sales))^2)
round(SS.tot, 3)
```

## [1] 16665834

#### Residuals and Overall Variance

If we take the ratio of and :

```
round(SS.resid / SS.tot, 3)
```

## [1] 0.621

The resulting percentage gives us the percentage of the total variance that we can attribute to ** model error**, the degree to which observed values are different from predicted values.

Note that, if you subtract this figure from one, the result is the R-Squared of our regression model above:

```
round(1 - SS.resid / SS.tot, 3)
```

## [1] 0.379

The explanation follows.

## Regression Sum of Squares

The ** regression sum of squares** (\(SS_{reg}\)) is a metric that tries to capture how much of the variance estimates by \(RSS_{tot}\) is predicted by our model. This is the conceptual framework by which we estimate this metric.

- The
*total sum of squares*(\(RSS_{total}\)) measures variation in the dependent variable. - The
*residual sum of squares*(\(RSS_{residual}\)) is the amount of dependent variable variance attributable to the model’s failure to predict values. - The remainder is presumed to be the amount of variance that
*is*predicted by the model:

## Deconstructed Example

#### Generate Data

Imagine a hypothetical sample in which we asked 100 people to rate Funzo’s Burger House on a scale of zero to 100. Assume that Funzo’s is for kids, and people like it less aas they age. We assume the true model underlying the data is:

Where is the person’s rating of Funzo’s, is their age in years, and is the error term.

```
#Create the data
age <- runif(100, 5, 70) #Generate 100 values from a uniform distribution between 5 and 70
error <- rnorm(100,0, 15) #Set expected model error at 15 points
rating <- 100 - (2 * age) + error
funzo <- data.frame(cbind(rating, age))
```

Here is a scatterplot, to depict the bivariate regression. Play with the numbers above and see the changes…

```
ggplot(funzo, aes(x = age, y = rating)) + geom_point() + geom_smooth(method = "lm", se=F)
```

#### Regression

Next, run the regression:

```
reg.2 <- lm(rating~age, funzo)
summary(reg.2)
```

## ## Call: ## lm(formula = rating ~ age, data = funzo) ## ## Residuals: ## Min 1Q Median 3Q Max ## -37.902 -9.897 -0.289 7.944 39.270 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 95.99133 3.18496 30.14 <0.0000000000000002 *** ## age -1.89573 0.07581 -25.00 <0.0000000000000002 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.44 on 98 degrees of freedom ## Multiple R-squared: 0.8645, Adjusted R-squared: 0.8631 ## F-statistic: 625.3 on 1 and 98 DF, p-value: < 0.00000000000000022

Note the R-Squared. Calculate the fitted values and error terms, and then add them to the data set:

```
funzo$predicted <- fitted(reg.2)
funzo$error <- funzo$rating - funzo$predicted
head(funzo,10)
```

## rating age predicted error ## 1 -11.58271 46.50208 7.83606 -19.4187687 ## 2 -22.28820 66.67941 -30.41465 8.1264496 ## 3 60.53316 14.76919 67.99298 -7.4598183 ## 4 19.73782 32.07206 35.19145 -15.4536250 ## 5 -34.78134 68.10289 -33.11319 -1.6681523 ## 6 25.30219 37.20698 25.45703 -0.1548443 ## 7 37.16856 41.00807 18.25121 18.9173463 ## 8 35.28165 30.14312 38.84820 -3.5665496 ## 9 -47.37257 67.86045 -32.65358 -14.7189878 ## 10 -17.70322 45.29438 10.12554 -27.8287609

#### Calculation

Now we can calculate our variance and fit statistics:

```
#Calculate our sums of squares:
ss.total <- sum((funzo$rating-mean(funzo$rating))^2)
ss.total
```

## [1] 150803.8

```
ss.residual <- sum((funzo$rating - funzo$predicted)^2)
ss.residual
```

## [1] 20433.77

```
#And we calculate the R-Squared:
r.squared <- 1 - (ss.residual/ss.total)
r.squared
```

## [1] 0.8645009

Which should match the R-Squared given in the example above. The results will be differrent depending on your session’s random seed (see the command **set.seed()**)