Associate Professor of Sociology at the City University of New York, Queens College

Quick Tutorial: American Community Survey

A quick walk-through of simple analyses using American Community Survey data.

The American Community Survey is a workhorse survey series in demography, geography, and many other fields. This post provides a walk-through of an analysis that I conducted with the 2018 release of the American Community Survey’s five-year sample. Its examples will focus on assessing the prevalence and migration patterns of New York State’s elderly population. For a more complicated analysis that conducts these comparisons across states, review the Markdown file associated with my recent research note on the topic.

The American Community Survey (ACS) is a large-scale, high-quality running survey program run by the United States Census Bureau. It samples thousands of U.S. households annually through multiple survey projects, collecting information on their residences, demographics, household finances, and other topics. The set is a widely-used basis for estimating more detailed information that is not captured in the U.S. Census, and for estimates changes in the U.S. population that have occurred since the Census.

We begin by starting our script as a fresh session:

#Start Session

#Clear Memory
rm(list=ls())
gc()

#Set directory to folder where I plan to store the data
directory <- "E:/Dropbox/Research/ACS/2018/5-Year"
setwd(directory)

# Load Packages
library(curl)    #For downloading data
library(survey)  #Package for survey analysis

Downloading the Data

The are distributed through a structured folder and file naming system. The survey data we wish to access is distributed through the same web folder: https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/

Web folder where this ACS set is distributed. Note the patterned file names.

The data are distributed as state-specific CSV files. Each state-level file is demarcated using a standard two-letter state naming system. This series is stored in R’s base package as state.abb, but as lower-case characters.

I ran the operation as a loop:

states <- tolower(state.abb)   #Convert *state.abb* to lower-case

for (i in states){
  
  #Part 1: Downalod and Unzip Person Files
  #Create state directory
  dir.create(paste0(directory,"/", i))
  setwd(paste0(directory, "/", i))
  
  #Create objects with file URL & destination
  #based on state abbreviation
  source_url <- 
    paste0("https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/csv_p", 
           i, ".zip")
  dest_url <- paste0(directory, "/", i, "/csv_p", i, ".zip")
  
  #Download and unzip
  curl_download(source_url, dest_url)
  unzip(paste0("csv_p", i, ".zip"))
  temp.nam <- list.files(pattern = "\\.csv$")
  temp.dat <- read.csv(temp.nam)
  
  #Save data as RDS in main directory
  setwd(directory)
  saveRDS(temp.dat, file = paste0("acs2018_5yp_", i, ".RDS"))
  
  #Part 2: Download & Unzip Household Files
  #Same as Part 1, but applied to csv_h* files
  setwd(paste0(directory, "/", i))
  source_url <- 
    paste0("https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/csv_h", 
           i, ".zip")
  dest_url <- paste0(directory, "/", i, "/csv_h", i, ".zip")
  curl_download(source_url, dest_url)
  unzip(paste0("csv_p", i, ".zip"))
  temp.nam <- list.files(pattern = "\\.csv$")
  temp.dat <- read.csv(temp.nam)
  temp.dat$STATE <- i
  
  setwd(directory)
  saveRDS(temp.dat, file = paste0("acs2018_5yh_", i, ".RDS"))
  
}

The result is that I have all the state-specific household and person data files stored in our data directory. In this example, we will work with the Arkansas person set:

setwd(directory)
DAT.AL <- readRDS("acs2018_5yp_al.RDS")

Set Up Survey Object

We are going to use the survey package to analyze these data (documentation here). The package author, Thomas Lumley, has written an excellent introduction to survey analysis using this package: Complex Surveys: A Guide to Analysis Using R (2010, Wiley).

The first step in the process of analyzing these data is to create the survey design object, which combines the respondents’ answers along with information that helps us make more sophisticated population estimates from our sample data. The information required to set up this set’s survey object is in its readme file. The code that I’m using is built on Anthony Damico’s scripts. Note that I am creating an object for a specific state survey. My strategy is to get state-level estimates through state-specific objects, and then consolidate them after estimation.

ACS_DESIGN_AL <-
    svrepdesign(
        weight = ~PWGTP ,              #Person's weight
        repweights = 'PWGTP[0-9]+' ,   #Replicate person weighting variables
        scale = 4 / 80 ,               #See page 15
        rscales = rep( 1 , 80 ) ,      
              #Scales constant across 80 replicates
        mse = TRUE ,
        type = 'JK1' ,                 
              #Jacknife pop corr, for unstratified samples
        data = DAT.AL                  
              #Using Alabama data in this analysis
    )

And now we can engage the data.

Creating or Changing Variables

Use the update() function to add or change variables in the survey object. For example, I create a variable that designates respondents as “elderly” if they are at least 65 years old:

ACS_DESIGN_AL <- update(ACS_DESIGN_AL,   #Update the object 
                     ELDERLY = factor(AGEP>65)  #Creating a variable
)

Refer to this set’s data dictionary for variable names, explanations, and codes.

Subsetting

Use subset() to subset your survey sample:

ACS_DESIGN_AL_ELDERLY <- subset(ACS_DESIGN_AL, AGEP > 65)

The new object restricts its estimates to elderly Alabamans, defined as age 65+.

Parameter Estimates

Continuous Univariate

Using the AGEP variable, which measures respondents’ ages. The variable is top-coded at 99 years.

Use svymean() for a parameter mean.

#Mean age:
svymean( ~ AGEP , ACS_DESIGN_AL )
##        mean     SE
## AGEP 38.995 0.0105

So, the mean age of Alabamans, as suggested by this data, is 38.995 years old, with a standard error of 0.01 on that estimate.

#Median age
svyquantile( ~ AGEP , ACS_DESIGN_AL , 0.5 , na.rm = TRUE )
## Statistic:
##      AGEP
## q0.5   38
## SE:
##           AGEP
## q0.5 0.2511995
#10th percentile age
svyquantile( ~ AGEP , ACS_DESIGN_AL , 0.1 , na.rm = TRUE )
## Statistic:
##      AGEP
## q0.1    8
## SE:
##           AGEP
## q0.1 0.2511995

The median Alabaman is 38 years old. The 10th percentile in this distribution is 8 years old.

Discrete Univariate

FOr discrete variables, svymean() works.

#Proporotion elderly
svymean( ~ ELDERLY , ACS_DESIGN_AL )
##                mean    SE
## ELDERLYFALSE 0.8504 3e-04
## ELDERLYTRUE  0.1496 3e-04

An estimated 14.96% of Albama is “elderly”, which we defined as over 65 years old.

If the operation seems to be giving you a mean rather than a proportion, then respecify the variables as a factor using factor().

Discrete-Discrete Bivariate

Use svyby() in conjunction with svymean() for cross-tabs. In the operation svyby(x, y), the results will express proportions of “x” across categories of “y”. This command looks at the proportion of ELDERLY by categories of SEX:

#Elderly by sex
svyby( ~ ELDERLY , ~ SEX , ACS_DESIGN_AL , svymean )
##   SEX ELDERLYFALSE ELDERLYTRUE          se1          se2
## 1   1    0.8665133   0.1334867 0.0003242783 0.0003242783
## 2   2    0.8352724   0.1647276 0.0004199916 0.0004199916

The results suggest that 13.3% of males in Alabama are elderly. This figure is 16.5% for females.

#Elderly by sex
svyby( ~ factor(SEX) , ~ ELDERLY , ACS_DESIGN_AL , svymean )
##       ELDERLY factor(SEX)1 factor(SEX)2          se1          se2
## FALSE   FALSE    0.4932625    0.5067375 0.0003111538 0.0003111538
## TRUE     TRUE    0.4319345    0.5680655 0.0007504035 0.0007504035

So 56.8% of the elderly are female, whereas only 50.7% of the non-elderly population is female.

Discrete-Continuous Bivariate

Use svyby() for the mean score of a continuous variable across the categories of a discrete one.

#Mean age by sex
svyby( ~ AGEP , ~ SEX , ACS_DESIGN_AL , svymean , na.rm = TRUE )
##   SEX     AGEP         se
## 1   1 37.78940 0.01966696
## 2   2 40.12691 0.01833421

The operation can also render the percentile estimates in conjunction with svyquantile():

#Median age by sex
svyby( 
    ~ AGEP , 
    ~ SEX , 
    ACS_DESIGN_AL , 
    svyquantile , 
    0.5 ,
    ci = TRUE ,
    keep.var = TRUE ,
    na.rm = TRUE
)
##   SEX V1        se
## 1   1 37 0.2511995
## 2   2 40 0.2511995

So the median Alabaman male is 37 years old, whereas the median Alabaman female is 40.

Continuous-Continuous Bivariate

The survey package does not include an operation for calculating correlations. There is such an operatino in the jtools package. Use svycor() for correlations in survey data. For example, AGEP and personal income (PINCP)

library(jtools)   #Remember, svycor is not from survey package
svycor(~PINCP + AGEP, na.rm = T,  design = ACS_DESIGN_AL)
##       PINCP AGEP
## PINCP  1.00 0.17
## AGEP   0.17 1.00

Personal income and age have a correlatino of 0.17. For a recommended extended tutorial on svycor(), see this post by Jacob Long.

Linear Regression

For an OLS regression of a continuous variable on predictors, use svyglm() with default settings. Below, I use SEX and WKHP (hours worked last week) to estimate PINP (personal income):

model.1 <- svyglm(PINCP ~ SEX + WKHP, ACS_DESIGN_AL)
summary(model.1)
## 
## Call:
## svyglm(formula = PINCP ~ SEX + WKHP, ACS_DESIGN_AL)
## 
## Survey design:
## update(ACS_DESIGN_AL, ELDERLY = factor(AGEP > 65))
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18908.20     834.36   22.66   <2e-16 ***
## SEX         -13474.16     317.23  -42.47   <2e-16 ***
## WKHP          1154.15      15.89   72.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.746482e+14)
## 
## Number of Fisher Scoring iterations: 2

This model suggests a baseline income of $18,908, a $13,474 earnings penalty to women, and an additional $1,154 in income per hour worked last week.

Logit Regression

For a logit:

model.2 <- svyglm(ELDERLY ~ WKHP, family = binomial(link = "logit"), ACS_DESIGN_AL)
summary(model.2)
## 
## Call:
## svyglm(formula = ELDERLY ~ WKHP, family = binomial(link = "logit"), 
##     ACS_DESIGN_AL)
## 
## Survey design:
## update(ACS_DESIGN_AL, ELDERLY = factor(AGEP > 65))
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.289704   0.043399  -29.72   <2e-16 ***
## WKHP        -0.046834   0.001216  -38.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 121606.1)
## 
## Number of Fisher Scoring iterations: 6

Obviously, we’re not positing a causal model here. The model predicts a -0.04 decrease in the logged odds of being elderly for each addition hour of work performed in the previous week. This converts into $ (-0.04) = 0.96 $ the odds of being elderly per additional hour worked. In other words a 0.96 – 1= -0.04 = 4% $ decrease in odds per additional hour.

Leave a Reply

Your email address will not be published. Required fields are marked *