Associate Professor of Sociology at the City University of New York, Queens College

Start Analyzing Consumer Expenditure Survey in R

A quick guide to getting started with the Consumer Expenditure Survey.

The Consumer Expenditure Survey (CEX) is an annual survey of U.S. household spending. The survey is performed on behalf of the U.S. Bureau of Labor Statistics, a federal government agency that compiles and develops data and research on U.S. household economic activity. This walk-through covers steps involved in a basic analysis of this survey’s public-use microdata. In addition to this tutorial, you should also review the program’s Getting Started Guide

This tutorial focuses on the analysis of the CEX’s interview, as opposed to its diary data. The latter data contains individual-level spending diary data. The former, with which we are working here, covers household-level expenditures as discerned through interviews of household represen.

Downloading the Data

The CEX’s public-use microdata is distributed here. If you look at the links to the individual data tables, you will notice that they distributed using structured URLs and file names. This makes it easy to script their download. Note that I am downloading Stata files. You can download another format by pointing the download operation to a different file.

#Set directory where data will be stored
data.directory <- "E:/Dropbox/Data/cex"
setwd(data.directory)

#Which years do you want to download?
first.year <- 1990 
last.year <- 2019

#Create array with only last two digits of years
years <- substr(first.year:last.year, 3, 4)

#A loop to fetch data files from web
for (i in years){
  download.file(
    paste0("https://www.bls.gov/cex/pumd/data/stata/intrvw", i, ".zip"),
    paste0("intrvw",i,".zip"))
  unzip(paste0("intrvw",i,".zip"))
}

The zip files will extract at least two folders (it is not 100% consistent across all years available). The one with the prefix “intrvw” will contain interview data, and “expn” for the diary data for a particular year. Choose the “intrvw” folder for the year that you wish to analyze. Note

Preparing the Data

Each year’s interview data is stored in five tables, each representing data from a quarter-year. Data from any survey year makes inferences from five quarter-years of data, each of which asks about spending in the current and previous quarter.

We are going to work with the 2019 interview data, in the folder “intrvw19”. I search the folder to find the “fmli*” files. There should be five files with those file names.

setwd(paste0(data.directory, "/intrvw19/intrvw19"))
list.files(pattern="fmli")
## [1] "fmli191x.dta" "fmli192.dta"  "fmli193.dta"  "fmli194.dta"  "fmli201.dta"
#Import the Stata files using read.dta from package "foreign"
library(foreign)
for (i in 1:5){
  temp <- read.dta(list.files(pattern="fmli")[i] )
  temp$qtr <- i
  assign(paste0("fmli",i), temp)
}

The next steps involve consolidating them into a single data table, parsing them into five different sets with individual imputed data estimates, and then generating the survey object using the five tables with unique imputed values. The following code is a adapted version of code by Anthony Damico. Those who are interested in exploring major social science sets using R may appreciate his very excellent R code repository at www.asdfree.com

# Generating elements of survey object

# constrain to focal year variables
fmli5 <- fmli5[, names(fmli4)]

# consolidate to common data table for efficient recoding
fmly <- rbind(fmli1, fmli2, fmli3, fmli4, fmli5)

# 1. Identify weight and replicate weight variables
# create a vector with weight variable names
wtrep <- c( paste0( "wtrep" , stringr::str_pad( 1:44 , 2 , pad = "0" ) ) , "finlwt21" )
for ( i in wtrep ) {
  fmly[ is.na( fmly[ , i ] ) , i ] <- 0
}

# 2. Data Tables for imputations
# create a vector containing all of the multiply-imputed variables (cleaned names)
mi_vars <- gsub( "5$" , "" , grep( "[a-z]5$" , names( fmly ) , value = TRUE ) )

# creating five sets, each with unique imputation, to be recombined below
for ( i in 1:5 ){
    x <- fmly
    for ( j in mi_vars ){
        # copy the contents of the current column (for example 'welfare1')
        # over to a new column ending in 'mi' (for example 'welfaremi')
        x[ , paste0( j , 'mi' ) ] <- x[ , paste0( j , i ) ]
        # delete the all five of the imputed variable columns
        x <- x[ , !( names( x ) %in% paste0( j , 1:5 ) ) ]
    }
    
    # save the current table in the sqlite database as 'imp1' 'imp2' etc.
    assign( paste0( 'imp' , i ) , x )
    rm( x )
}

Setting Up Survey Object

We are going to use the survey package to analyze these data (documentation here). The package author, Thomas Lumley, has written an excellent introduction to survey analysis using this package: Complex Surveys: A Guide to Analysis Using R (2010, Wiley).

After the above operations, we are left with five tables. Each table consolidate all five quarterly waves. For multiply-imputed variables, each table has unique imputations. The tables are bound together using the imputationList() function in the mitools package.

library(survey)
library(mitools)
cex.design <- 
    svrepdesign( 
        weights = ~finlwt21 ,                 #Sampling weight
        repweights = "wtrep[0-9]+",          #Replication weights, from array above
        data = imputationList(list( imp1 , imp2 , imp3 , imp4 , imp5 ) ) , 
        #data combined by imputationList
        type = "BRR" ,
        combined.weights = TRUE ,
        mse = TRUE
    )

Creating or Changing Variables

With this set, you may often wish to combine past and current quarter spending to generate estimates of annual household spending. For example, the variables apparpq and apparcq capture the household’s spending on apparel in the past and current quarters, respectively. If we combine the two, we will get a six-month spending estimate. One might assume that double that amount roughly approximates a year’s worth of spending.

cex.design <- update(cex.design,
                     apparel = (apparcq + apparpq) * 2
                     )

Parameter Estimates

Combined imputations using MIcombine() from the mitools package.

Continuous Univariate

# Mean
MIcombine( with( cex.design , svymean( ~ apparel ) ) )
## Multiple imputation results:
##       with(cex.design, svymean(~apparel))
##       MIcombine.default(with(cex.design, svymean(~apparel)))
##          results       se
## apparel 513.5305 14.54195
# Median
MIcombine( with( cex.design ,
    svyquantile(
        ~ apparel ,
        0.5 , se = TRUE 
) ) )
## Multiple imputation results:
##       with(cex.design, svyquantile(~apparel, 0.5, se = TRUE))
##       MIcombine.default(with(cex.design, svyquantile(~apparel, 0.5, 
##     se = TRUE)))
##      apparel       se
## q0.5     196 6.942061
# 10th percentile
MIcombine( with( cex.design ,
    svyquantile(
        ~ apparel ,
        0.1 , se = TRUE 
) ) )
## Multiple imputation results:
##       with(cex.design, svyquantile(~apparel, 0.1, se = TRUE))
##       MIcombine.default(with(cex.design, svyquantile(~apparel, 0.1, 
##     se = TRUE)))
##      apparel se
## q0.1       0  0

Discrete Univariate

# Proportions Table
MIcombine( with( cex.design , svymean( ~ factor(sex_ref) ) ) )
## Multiple imputation results:
##       with(cex.design, svymean(~factor(sex_ref)))
##       MIcombine.default(with(cex.design, svymean(~factor(sex_ref))))
##                    results          se
## factor(sex_ref)1 0.4826752 0.005276427
## factor(sex_ref)2 0.5173248 0.005276427

Bivariate: Discrete-Continuous

Table of mean scores

MIcombine( with( cex.design ,
    svyby( ~ apparel , ~ sex_ref , svymean )
) )
## Multiple imputation results:
##       with(cex.design, svyby(~apparel, ~sex_ref, svymean))
##       MIcombine.default(with(cex.design, svyby(~apparel, ~sex_ref, 
##     svymean)))
##    results       se
## 1 535.2375 19.97322
## 2 493.2774 13.00179

Median by Group

MIcombine( with( cex.design ,
    svyby(
        ~ apparel , ~ bls_urbn , svyquantile ,
        0.5 , se = TRUE ,
        keep.var = TRUE , ci = TRUE 
) ) )
## Multiple imputation results:
##       with(cex.design, svyby(~apparel, ~bls_urbn, svyquantile, 0.5, 
##     se = TRUE, keep.var = TRUE, ci = TRUE))
##       MIcombine.default(with(cex.design, svyby(~apparel, ~bls_urbn, 
##     svyquantile, 0.5, se = TRUE, keep.var = TRUE, ci = TRUE)))
##   results        se
## 1     200  8.925507
## 2      80 19.195478

Bivariate: Two Discrete

MIcombine( with( cex.design ,
    svyby( ~ sex_ref , ~ bls_urbn , svymean )
) )
## Multiple imputation results:
##       with(cex.design, svyby(~sex_ref, ~bls_urbn, svymean))
##       MIcombine.default(with(cex.design, svyby(~sex_ref, ~bls_urbn, 
##     svymean)))
##              results          se
## 1:sex_ref1 0.4848522 0.005826197
## 2:sex_ref1 0.4517365 0.033628859
## 1:sex_ref2 0.5151478 0.005826197
## 2:sex_ref2 0.5482635 0.033628859

Regression

glm_result <- 
    MIcombine( with( cex.design ,
        svyglm( apparel ~ bls_urbn + sex_ref )
    ) )
    
summary( glm_result )
## Multiple imputation results:
##       with(cex.design, svyglm(apparel ~ bls_urbn + sex_ref))
##       MIcombine.default(with(cex.design, svyglm(apparel ~ bls_urbn + 
##     sex_ref)))
##                results       se     (lower     upper) missInfo
## (Intercept)  546.38234 19.96643  507.24887 585.515822      0 %
## bls_urbn2   -181.14220 49.93095 -279.00507 -83.279326      0 %
## sex_ref2     -40.48467 16.54160  -72.90562  -8.063722      0 %

Leave a Reply

Your email address will not be published. Required fields are marked *