Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization
Joint Statistical MeetingsMinneapolis, MN August 9, 2005
Presented by:Michael Davern, Ph.D.Assistant Professor, Research DirectorSHADAC, Health Services Research and PolicyUniversity of MinnesotaSupported by a grant from The Robert Wood Johnson Foundation
Co-authorsThis work is coauthored with:Miriam King, Ph.D., Research Associate
We both are with the Minnesota Population Center at the University of Minnesota
Data set harmonizationThe goal is to simplify access to all available years of a data set for analysis of trends over time.This goal has many difficulties associated with it.We focus on the issues involved with handling major sources of survey error over time.
Survey changes present challenges to harmonizationSample designHow people and records are drawn into a data set changes and affects how variance estimation is done.NonresponseHow surveys account for unit, supplement, person and item nonresponse changes over time. Survey questions and measurementChanges to question wording and question universes.Survey processing/editingChanges to processing and data editing.
Decennial census sample designsDecennial census samplingInvolves both sampling of people/households to receive the long form and sampling of long form records to release (1% and 5%).Both the household/person selection changes over time as does the process used to select the public use micro data samples.Data users need access to the sample design information to calculate appropriate variances/standard errors.Although appropriate estimates can be obtained with replicate weights at the moment most users do not use them.We are testing sample design variables to add to the IPUMS for Taylor Series estimation.Will include both a stratification variable, cluster variable and weighting variable (when available) so analysts can simply program in SAS, Stata, SUDAAN, etc. Our approach will make the changes in sample design seem seamless to the data user and will increase the use of more appropriate estimation methods.
Survey sample designsThe NHIS and CPS change sample designs over time.Non-self representing PSUs are shuffled so some are not included between the designs.Self-representing PSUs (MSAs) can also change (boundaries annex/lose counties).Pooling data between two sample designs is a major challenge.Data users often like to pool data to get larger samples or rare characteristics (e.g., those with SSI income).When working with data from years with two sample designs its best to average the estimates and the standard errors from single years.Also some surveys (e.g., NHIS) release sample design information that can be used for Taylor Series estimates, whereas others do not (e.g., CPS).
NonresponseThere are several types of survey nonresponse.Unit, person, supplement and item.Nonresponse is also handled differently by the various surveys and can cause problems for data users.Unit nonresponse is generally handled by adjusting survey weights of responders to account for nonrespnders.Heterogeneity among the weights makes it important to use appropriate statistical routines for variance estimation.
Person and supplement nonresponsePerson and supplement nonresponse can be more difficult to deal with.NHIS, for example, contains information on a household, but if they refused the supplement there is no supplement data for them.This makes the data structure uneven.The CPS, on the other hand, fully imputes the missing ASEC (i.e., March) supplement nonresponders (currently about 10% of the cases).This evens out the data structure making it easier for data users to work with.Although this can be problematic as the CPS full supplement imputation process can lead to rather large biases in estimates (e.g., health insurance coverage).We are investigating ways of evening out portions of the NHIS data structure to make it easier to work with and disseminate.
Item nonresponseItem nonresponse is also a challenge.Decennial census and CPS are fully imputed for item nonresponse.Makes it much easier for data users.Although it can simplify things too much.The NHIS, on the other hand, does not impute missing values.This is a major problem for people who want to work with the income series on the NHIS (recently they released separate imputed income files).We are experimenting with imputing the income data information on the NHIS files using CPS income data.
Question wording and measurementQuestion wording changes take many forms.Change in the basic question The inclusion of examplesthe placement of the question in the surveyChanges in the type of response allowed (e.g., can income amounts be reported in smaller than yearly intervals?) Providing facsimiles of question wording, and highlighting wording changes in variable documentation, allows users to decide whether comparability is possible for their analyses.
Changes to question universesChanges in universe definitions affect multiple variables (e.g., the age limit for adults answering work and income questions).Other changes affect single variables. Providing universe definitions in variable documentation tells users how to restrict their data to achieve comparability.Testing variable universes reveals when data cleaning is needed before the data are released to users.
Changes in response categoriesMany data harmonization projects lose detail by adopting a least common denominator approach.IPUMS projects adopt the joint goal of:Losing no information Providing comparability over timeIPUMS projects achieve these goals through composite coding schemes. The first digit(s) provides detail available across all years Trailing digits provide additional detail available in only limited years
Other strategies for handling changes in response categoriesCreating bridging variables is another means of achieving comparability over time.When responses are given in intervalled form in some years, and in full detail in other years, IPUMS projects provide both detailed and intervalled variables.Recoding data using a common standard (e.g., the 1950 occupation and industry codes), together with providing the original, unrecoded data, is a third strategy employed by IPUMS projects.When response changes are too great to achieve comparability (e.g., the shift from 4 to 5 categories for health status in NHIS), the data are provided in separate variables and the issue is discussed in the documentation.
Changes in data processing Variable documentation also helps users by pointing out subtle changes in data processing by the agency releasing the non-harmonized public use data.
ConclusionsThe goal of simplifying data dissemination and harmonization is difficult and demographic survey design and processing play a major role in making it difficult.Sample designSurvey nonresponseSurvey questions and itemsSurvey processing/editing