The American Community Survey is a workhorse survey series in demography, geography, and many other fields. This post provides a walk-through of an analysis that I conducted with the 2018 release of the American Community Survey’s five-year sample. Its examples will focus on assessing the prevalence and migration patterns of New York State’s elderly population. For a more complicated analysis that conducts these comparisons across states, review the Markdown file associated with my recent research note on the topic.
- Introduction
- Downloading the Data
- Set Up Survey Object
- Creating or Changing Variables
- Subsetting
- Parameter Estimates
The American Community Survey (ACS) is a large-scale, high-quality running survey program run by the United States Census Bureau. It samples thousands of U.S. households annually through multiple survey projects, collecting information on their residences, demographics, household finances, and other topics. The set is a widely-used basis for estimating more detailed information that is not captured in the U.S. Census, and for estimates changes in the U.S. population that have occurred since the Census.
We begin by starting our script as a fresh session:
#Start Session
#Clear Memory
rm(list=ls())
gc()
#Set directory to folder where I plan to store the data
directory <- "E:/Dropbox/Research/ACS/2018/5-Year"
setwd(directory)
# Load Packages
library(curl) #For downloading data
library(survey) #Package for survey analysis
Downloading the Data
The are distributed through a structured folder and file naming system. The survey data we wish to access is distributed through the same web folder: https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/

The data are distributed as state-specific CSV files. Each state-level file is demarcated using a standard two-letter state naming system. This series is stored in R’s base package as state.abb, but as lower-case characters.
I ran the operation as a loop:
states <- tolower(state.abb) #Convert *state.abb* to lower-case
for (i in states){
#Part 1: Downalod and Unzip Person Files
#Create state directory
dir.create(paste0(directory,"/", i))
setwd(paste0(directory, "/", i))
#Create objects with file URL & destination
#based on state abbreviation
source_url <-
paste0("https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/csv_p",
i, ".zip")
dest_url <- paste0(directory, "/", i, "/csv_p", i, ".zip")
#Download and unzip
curl_download(source_url, dest_url)
unzip(paste0("csv_p", i, ".zip"))
temp.nam <- list.files(pattern = "\\.csv$")
temp.dat <- read.csv(temp.nam)
#Save data as RDS in main directory
setwd(directory)
saveRDS(temp.dat, file = paste0("acs2018_5yp_", i, ".RDS"))
#Part 2: Download & Unzip Household Files
#Same as Part 1, but applied to csv_h* files
setwd(paste0(directory, "/", i))
source_url <-
paste0("https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/csv_h",
i, ".zip")
dest_url <- paste0(directory, "/", i, "/csv_h", i, ".zip")
curl_download(source_url, dest_url)
unzip(paste0("csv_p", i, ".zip"))
temp.nam <- list.files(pattern = "\\.csv$")
temp.dat <- read.csv(temp.nam)
temp.dat$STATE <- i
setwd(directory)
saveRDS(temp.dat, file = paste0("acs2018_5yh_", i, ".RDS"))
}
The result is that I have all the state-specific household and person data files stored in our data directory. In this example, we will work with the Arkansas person set:
setwd(directory)
DAT.AL <- readRDS("acs2018_5yp_al.RDS")
Set Up Survey Object
We are going to use the survey package to analyze these data (documentation here). The package author, Thomas Lumley, has written an excellent introduction to survey analysis using this package: Complex Surveys: A Guide to Analysis Using R (2010, Wiley).
The first step in the process of analyzing these data is to create the survey design object, which combines the respondents’ answers along with information that helps us make more sophisticated population estimates from our sample data. The information required to set up this set’s survey object is in its readme file. The code that I’m using is built on Anthony Damico’s scripts. Note that I am creating an object for a specific state survey. My strategy is to get state-level estimates through state-specific objects, and then consolidate them after estimation.
ACS_DESIGN_AL <-
svrepdesign(
weight = ~PWGTP , #Person's weight
repweights = 'PWGTP[0-9]+' , #Replicate person weighting variables
scale = 4 / 80 , #See page 15
rscales = rep( 1 , 80 ) ,
#Scales constant across 80 replicates
mse = TRUE ,
type = 'JK1' ,
#Jacknife pop corr, for unstratified samples
data = DAT.AL
#Using Alabama data in this analysis
)
And now we can engage the data.
Creating or Changing Variables
Use the update() function to add or change variables in the survey object. For example, I create a variable that designates respondents as “elderly” if they are at least 65 years old:
ACS_DESIGN_AL <- update(ACS_DESIGN_AL, #Update the object
ELDERLY = factor(AGEP>65) #Creating a variable
)
Refer to this set’s data dictionary for variable names, explanations, and codes.
Subsetting
Use subset() to subset your survey sample:
ACS_DESIGN_AL_ELDERLY <- subset(ACS_DESIGN_AL, AGEP > 65)
The new object restricts its estimates to elderly Alabamans, defined as age 65+.
Parameter Estimates
Continuous Univariate
Using the AGEP variable, which measures respondents’ ages. The variable is top-coded at 99 years.
Use svymean() for a parameter mean.
#Mean age:
svymean( ~ AGEP , ACS_DESIGN_AL )
## mean SE
## AGEP 38.995 0.0105
So, the mean age of Alabamans, as suggested by this data, is 38.995 years old, with a standard error of 0.01 on that estimate.
#Median age
svyquantile( ~ AGEP , ACS_DESIGN_AL , 0.5 , na.rm = TRUE )
## Statistic:
## AGEP
## q0.5 38
## SE:
## AGEP
## q0.5 0.2511995
#10th percentile age
svyquantile( ~ AGEP , ACS_DESIGN_AL , 0.1 , na.rm = TRUE )
## Statistic:
## AGEP
## q0.1 8
## SE:
## AGEP
## q0.1 0.2511995
The median Alabaman is 38 years old. The 10th percentile in this distribution is 8 years old.
Discrete Univariate
FOr discrete variables, svymean() works.
#Proporotion elderly
svymean( ~ ELDERLY , ACS_DESIGN_AL )
## mean SE
## ELDERLYFALSE 0.8504 3e-04
## ELDERLYTRUE 0.1496 3e-04
An estimated 14.96% of Albama is “elderly”, which we defined as over 65 years old.
If the operation seems to be giving you a mean rather than a proportion, then respecify the variables as a factor using factor().
Discrete-Discrete Bivariate
Use svyby() in conjunction with svymean() for cross-tabs. In the operation svyby(x, y), the results will express proportions of “x” across categories of “y”. This command looks at the proportion of ELDERLY by categories of SEX:
#Elderly by sex
svyby( ~ ELDERLY , ~ SEX , ACS_DESIGN_AL , svymean )
## SEX ELDERLYFALSE ELDERLYTRUE se1 se2
## 1 1 0.8665133 0.1334867 0.0003242783 0.0003242783
## 2 2 0.8352724 0.1647276 0.0004199916 0.0004199916
The results suggest that 13.3% of males in Alabama are elderly. This figure is 16.5% for females.
#Elderly by sex
svyby( ~ factor(SEX) , ~ ELDERLY , ACS_DESIGN_AL , svymean )
## ELDERLY factor(SEX)1 factor(SEX)2 se1 se2
## FALSE FALSE 0.4932625 0.5067375 0.0003111538 0.0003111538
## TRUE TRUE 0.4319345 0.5680655 0.0007504035 0.0007504035
So 56.8% of the elderly are female, whereas only 50.7% of the non-elderly population is female.
Discrete-Continuous Bivariate
Use svyby() for the mean score of a continuous variable across the categories of a discrete one.
#Mean age by sex
svyby( ~ AGEP , ~ SEX , ACS_DESIGN_AL , svymean , na.rm = TRUE )
## SEX AGEP se
## 1 1 37.78940 0.01966696
## 2 2 40.12691 0.01833421
The operation can also render the percentile estimates in conjunction with svyquantile():
#Median age by sex
svyby(
~ AGEP ,
~ SEX ,
ACS_DESIGN_AL ,
svyquantile ,
0.5 ,
ci = TRUE ,
keep.var = TRUE ,
na.rm = TRUE
)
## SEX V1 se
## 1 1 37 0.2511995
## 2 2 40 0.2511995
So the median Alabaman male is 37 years old, whereas the median Alabaman female is 40.
Continuous-Continuous Bivariate
The survey package does not include an operation for calculating correlations. There is such an operatino in the jtools package. Use svycor() for correlations in survey data. For example, AGEP and personal income (PINCP)
library(jtools) #Remember, svycor is not from survey package
svycor(~PINCP + AGEP, na.rm = T, design = ACS_DESIGN_AL)
## PINCP AGEP
## PINCP 1.00 0.17
## AGEP 0.17 1.00
Personal income and age have a correlatino of 0.17. For a recommended extended tutorial on svycor(), see this post by Jacob Long.
Linear Regression
For an OLS regression of a continuous variable on predictors, use svyglm() with default settings. Below, I use SEX and WKHP (hours worked last week) to estimate PINP (personal income):
model.1 <- svyglm(PINCP ~ SEX + WKHP, ACS_DESIGN_AL)
summary(model.1)
##
## Call:
## svyglm(formula = PINCP ~ SEX + WKHP, ACS_DESIGN_AL)
##
## Survey design:
## update(ACS_DESIGN_AL, ELDERLY = factor(AGEP > 65))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18908.20 834.36 22.66 <2e-16 ***
## SEX -13474.16 317.23 -42.47 <2e-16 ***
## WKHP 1154.15 15.89 72.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.746482e+14)
##
## Number of Fisher Scoring iterations: 2
This model suggests a baseline income of $18,908, a $13,474 earnings penalty to women, and an additional $1,154 in income per hour worked last week.
Logit Regression
For a logit:
model.2 <- svyglm(ELDERLY ~ WKHP, family = binomial(link = "logit"), ACS_DESIGN_AL)
summary(model.2)
##
## Call:
## svyglm(formula = ELDERLY ~ WKHP, family = binomial(link = "logit"),
## ACS_DESIGN_AL)
##
## Survey design:
## update(ACS_DESIGN_AL, ELDERLY = factor(AGEP > 65))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.289704 0.043399 -29.72 <2e-16 ***
## WKHP -0.046834 0.001216 -38.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 121606.1)
##
## Number of Fisher Scoring iterations: 6
Obviously, we’re not positing a causal model here. The model predicts a -0.04 decrease in the logged odds of being elderly for each addition hour of work performed in the previous week. This converts into $ (-0.04) = 0.96 $ the odds of being elderly per additional hour worked. In other words a 0.96 – 1= -0.04 = 4% $ decrease in odds per additional hour.