Dealing with variables: Resources and topics in enhancing secondary survey data

  • Published on

  • View

  • Download


4th ESRC Research Methods Festival St Catherines College, Oxford. 5-8 July 2010. Dealing with variables: Resources and topics in enhancing secondary survey data. Paul Lambert University of Stirling DAMES research Node, - PowerPoint PPT Presentation


  • Dealing with variables: Resources and topics in enhancing secondary survey dataPaul LambertUniversity of Stirling DAMES research Node, of session 17 Resources (i): Resources for data management 6/JUL/20104th ESRC Research Methods FestivalSt Catherines College, Oxford. 5-8 July 2010

  • Dealing with variables: Resources and topics in enhancing secondary survey data

    Rigorous and vigorous approaches to dealing with variables

    Three specialist topics: The GESDE services for data on occupations, ethnicity and educational qualifications

  • Survey research and variable analysis

  • *Data management applied to variables refers tothe tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis [DAMES Node..]

    Usually performed by social scientists themselvesPre-analysis tasks (though often revised/updated)Inputs also from data providers Usually a substantial component of the work processBut may not be explicitly rewarded (sometimes even penalised..)

    a little different from archiving / controlling data itself

  • *Some components in secondary survey researchManipulating dataRecoding categories / operationalising variablesLinking dataLinking related data (e.g. longitudinal studies)Combining / enhancing data (e.g. linking micro- and macro-data) Secure access to dataLinking data with different levels of access permissionFull or restricted access to detailed micro-dataHarmonisation standardsApproaches to linking concepts and measures (indicators)Recommendations on particular variable constructionsCleaning data missing values; implausible responses; extreme values

  • *Example recoding data [use a recode or file matching routine]

  • * the centrality of keeping clear records of DM activitiesReproducible (for self)Replicable (for all)Paper trail for whole lifecycleCf. Dale 2006; Freese 2007

    In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)

    Syntax Examples:

  • Some provocative examples for the UKSocial mobility is increasing, not decreasing!!Popularity of controversial findings associated with Blanden et al (2004)Contradicted by wider ranging datasets and/or better measures of stratification positionDM: researchers ought to be able to more easily access wider data and better variables

    Degrees, MScs and PhDs are getting easier{or at least, more people are getting such qualifications}Correlates with measures of education are changing over time DM: facility in identifying qualification categories & standardising their relative value within age/cohort/gender distributions isnt, but should, and could, be widespread

    Black-Caribbeans are not disappearing As the 1948-70 immigrant cohort ages, the Black-Caribbean group is decreasingly prominent due to return migration and social integration of immigrant descendants Data collectors under-pressure to measure large groups onlyDM: It ought to be possible to harmonise measures of ethnicity over time, and to build richer data resources with more cases (e.g. by merging survey data)

    People interpreted the RAE wrongly! Most responses to the RAE 2008 involved comparing GPA scores between subject areas within and/or across institutions; but standardising relative to subject area distribution, or scaling by subject area, often gives very different results.DM: see Lambert and Gayle (2008) for a demo of alternative uses of RAE data

  • What might a rigorous and vigorous variable analysis look like? to debate but Id nominate: ReplicabilityFeatures a pro-active review of variablesReview a full set of alternative measuresReview alternative functional formsAttention to distribution/standardisationAttention to harmonisation

  • How should I make my work replicable? The concept of a workflow is a useful device for documenting a survey research project

    Workflows involve organising materials as a series of interrelated but distinctive componentsIn survey research, software syntax files make excellent templates for documenting our work in component elements [Long, 2009; Treiman, 2009; Altman & Franklin, 2010; Kulas, 2008]Computer science researchers have developed workflow depositories [e.g. MyExperiment] and workflow capture tools [e.g. Taverna]

  • Ad hoc organisation of a workflow as a master file in Stata Forthcoming workshop: Documentation and workflows for social survey research, University of Stirling, 1-2 September 2010, see

  • A workflow summary in Excel (following Long, 2009)

  • How should I review variables/functional forms/distributions/harmonisations?We tend to rely on personal expertise in particular subject domainsExpertise of the depositor of the dataExpertise of the analyst Some textbooks and other capacity building events cover these topics generically [e.g. Treiman 2009], but by and large they get unduly neglected from methodological training

    Something called e-Science can help with both variable reviews and replication

  • The e-Social Science endeavour

    see for up-to-date linksA number of UK projects seeking to improve social science research by capitalising on emerging computer science techniquesHandling distributed data; collaborative technologies; large and complex data; secure data

    The Grid embodies these technologies, but more generic terms like e-Social Science & Digital Social Research are increasingly preferredGESDE: Grid Enabled Specialist Data Environments *

  • e-Social Science, BSA2009*Example: Understanding New Forms of Digital Records (DReSS) transcribed talkaudio videodigital recordssystem logslocation

    transcriptcode treevideosystem log

    e-Social Science, BSA2009

  • *This session part-organised by the Data Management though e-Social Science nodeDAMES

    ESRC Node funded 2008-2011

    Aim: Useful social science provisions by exploiting tools for data management developed in computer science. Core components are: Data curation tool Data fusion tool Portals for access to data and data resources

  • Data curation tool collects metadata and allows data resources of different formats to be organised in an accessible depository

  • Data fusion tool supports merging of data files through shared variables (e.g. for recodes, aggregations, pooling data, linking related data, probabilistic linkages)

    External user (micro-social data)Occ info (index file) (aggregate)Users output(micro-social data)idougsex.ougCS-MCS-FEGPidougCS11101.1106058I111060.23201.3206971II232069.33202.8743951VIIa332071.48741.487439.58742.587451.

  • GEMDE Example of a portal for distributing and accessing supplementry data related to ethnicity

  • 2) Special Topics: The GESDE services for sociological classificationsKey variables in social science research are not just for sociology, but are much debated thereComplex categorical measures and variable operationalisation recommendations/debatesIndividual level measures of social positioning

    GESDE = 3 related online services which are Grid Enabled Specialist Data Environments GEODE: the o is for data on OccupationsGEEDE: the e is for data on Educational qualificationsGEMDE: the m is for data on ethnic Minorities

  • Our contribution in GESDE..Many existing resources on these topics [See app.]Academic reviews and projects [e.g. Rose & Harrison 2010; Ganzeboom, 2008; Schneider, 2008; Guveli, 2006]Service providers [e.g. ESDS variable guides; CESSDA-PPP]National Statistics Institutes guidelines [e.g.]

    Itd be good if more people were engaging with and exploiting these resources to enhance their own data..!

  • *At the centre of this are problems of standardizing categorical data

    Measurement equivalence (e.g. van Deth, 2003) is often not feasible for complex categorical measures For categorical data, equivalence for comparisons is often best approached in terms of meaning equivalence(because of non-linear relations between categories and shifting underlying distributions) (even if measurement equivalence seems possible)

    Arithmetic standardisation offers a convenient form of meaning equivalence by indicating relative position with the structure defined by the current context For categorical data, this can be achieved/approximated by scaling categories in one or more dimension of difference

  • *Effect proportional scaling using parents occupational advantage

  • What was that then? We can represent categories through positions on a scaleIn turn, we can use position in the dimension as a category score which then plugs into a further analysis (e.g. regression main and interaction effects)

    ..E.g. some options for data on ethnicity..Stereotyped Ordered Logistic Regression (SOR) models, summarize dimensions of difference according to regression predictor values [e.g. Lambert and Penn, 2001]Geometric data analysis for distances between people, or things [cf. Prandy, 1979; Bennett et al., 2009]Assign category scores by hand (a priori or by selected average)


  • *

  • 2(a) Data on occupationsOccupational unit groups = standardised lists of occupational titlesE.g. via CASCOT,*

  • on occupations..find ways of attaching summary information about occupations to occupational unit groups*

  • Comparability problems => value of documenting methods & comparing alternatives*

  • GEODE: Our contributionGEODE acts as a library style service for access to occupational information resources We encourage people to supply data theyve produced, and we upload data ourselvesResearchers are encouraged to use the portal to find and exploit suitable data Services: search, browse, deposit data, link data, user ratings*

  • GEODE (v1) Occupational data

  • Survey Network 4 June 2009*Using occupational data: Example as a measure of marked social disadvantage Lambert & Gayle (2009)

    Survey Network 4 June 2009

  • *[Example: Occupational not geographical inequality]

  • 2(b) Data on educational qualificationsSimilar issues arise with the use of educational dataSpecialist resources exist which can enhance measures of educational dataMany users arent aware of alternative coding schemes or harmonised approaches

    GEEDE acts as a service for bringing together and disseminating relevant data resources on educational measures

  • *Example recoding data

  • *Family and Working Lives Survey (54 vars per educ record)

  • 2(c) Data on ethnicity We can conceive of similar information resources and data analysis requirements for measures of ethnicityThere are generally fewer published resources / agreed standards in this domain

    GEMDE publishes resources but puts more emphasis on understanding complex ethnicity data*

  • working with ethnicity data in surveys is hard! - Its sparse - Its collinear (e.g. to age, location) - Its dynamic (cf. comparative research) *

  • *EFFNATIS sample (1999): Subjective ethnic identity [Heckman et al., 2001]

  • *A data management contributionPreserve information on what was done with categorical dataCommunicate information on what should/could be done

  • GEMDE seeks to promote replicability / transparencyDocument your own recodes Access somebody elses recodes Identify commonly used recodes (& use them..!)


  • ..and making complex analysis of ethnicity data easier..Organising complex categorical dataLabelling, recoding, etcEffect proportional scalingStandardisation Interaction terms


  • The GEODE model for GEMDE?.A service for MUGs and MIRs

    Define/register Minority Unit Groups

    Define/register Minority Information Resources

    Explore data resources and obtain help in approaching analysis of complex, sparse data

  • What's a MIR? 'Minority Information Resource'. This is our own terminology. By a MIR, we mean any piece of information which supplies systematic data on a minority unit group (MUG) classification. We've used this term to be deliberately similar to the phrase 'Occupational Information Resources' that we used on GEODEE.g. summary statistical data about the categories from and documentation or information E.g. recodings which have been used in a particular studySocial scientists are not in general aware of the existence of MIRs (cf. wides use of popular Occupational Information Resources). In GEMDE we seek to publicise little know resources and promote their uptake: We argue that better communication and dissemination of MIRs is in fact an important step towards better scientific practice of replication and standardisation of research. In our terms, every MIR necessarily links to a MUG (but not every MUG has a MIR).

  • The GEMDE portalLiferay portal with access to MUGs and MIRs, first release Jan 2010, now available for general use (

    Shibboleth access for registered usersGuest level access Deposit MUGs/MIRsSearch/browse deposited resources

    Feedback on resources (user ratings)Review live data (e.g. pooled LFS records)Expert and user quality ratings

  • Screenshot here!*

  • Summary: Remind me how these topics enhance survey data..?Variable operationalisations can ordinarily be improved by more rigour and vigourMore transparent operationalisation/documentationBetter use of detailed dataBetter ability to include measures in suitably complex models/analysis

    The GESDE approach has been to seek technological solutions to the organisation and distribution of complex variable-related information

  • *Data usedDepartment for Education and Employment. (1997). Family and Working Lives Survey, 1994-1995 [computer file]. Colchester, Essex: UK Data Archive [distributor], SN: 3704.Heckmann, F., Penn, R. D., & Schnapper, D. (Eds.). (2001). Effectiveness of National Integration Strategies Towards Second Generation Migrant Youth in a Comparative Perspective - EFFNATIS. Bamberg: European Forum for Migration Studies, University of Bamberg.Li, Y., & Heath, A. F. (2008). Socio-Economic Position and Political Support of Black and Ethnic Minority Groups in the United Kingdom, 1972-2005 [computer file]. 2nd Edition. Colchester, Essex: UK Data Archive [distributor], SN: 5666.Office for National Statistics. Social and Vital Statistics Division and Northern Ireland Statistics and Research Agency. Central Survey Unit, Quarterly Labour Force Survey, January - March, 2008 [computer file]. 4th Edition. Colchester, Essex: UK Data Archive [distributor], March 2010. SN: 5851.University of Essex, & Institute for Social and Economic Research. (2009). British Household Panel Survey: Waves 1-17, 1991-2008 [computer file], 5th Edition. Colchester, Essex: UK Data Archive [distributor], March 2009, SN 5151.

  • *ReferencesAltman, M., & Franklin, C. H. (2010). Managing Social Science Research Data. London: Chapman and Hall. Bennett, T., Savage, M., Silva, E. B., Warde, A., Gayo-Cal, M., Wright, D., et al. (2009). Culture, Class, Distinction. London: Routledge.Blanden, J., Goodman, A., Gregg, P., & Machin, S. (2004). Changes in generational mobility in Britain. In M. Corak (Ed.), Generational Income Mobility in North America and Europe (pp. 147-189). Cambridge: Cambridge University Press.Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158.Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 153-171.Ganzeboom, H. B. G. (2008). Tools for deriving status measures from ISKO-88 and ISCO-68. Retrieved 1 March, 2008, from Guveli, A. (2006). New Social Classes within the Service Class in the Netherlands and Britain: Adjusting the EGP class schema for the technocrats and the social and cultural specialists. Nijmegen: Radbound U. Nijmegen. Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. NY: Wiley.Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers.Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage.Kulas, J. T. (2008). SPSS Essentials: Managing and Analyzing Social Sciences Data New York: Jossey Bass.Lambert, P. S., & Gayle, V. (2009). Data management and standardisation: A methodological comment on using results from the UK Research Assessment Exercise 2008. Stirling: University of Stirling, Technical paper 2008-3 of the Data Management through e-Social Science research Node ( Lambert, P. S., & Gayle, V. (2009). 'Escape from Poverty' and Occupations. Colchester, Essex: BHPS Research Conference, 9-11 July 2009, and, P. S., & Penn, R. D. (2001). SOR models and Ethnicity data in LIS and LES : Country by Country Report. Syracuse University, Syracuse, New York 13244-1020: Luxembourg Income Study Paper No. 260. Levesque, R., & SPSS Inc. (2010). Programming and Data Management for IBM SPSS Statistics 18: A Guide for PASW Statistics and SAS users. Chicago: SPSS Inc.Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. Penn, R. D., & Lambert, P. S. (2009). Children of International Migrants in Europe: Comparative Perspectives. Basingstoke: Palgrave.Prandy, K. (1979). Ethnic discrimination in employment and housing. Ethnic and Racial Studies, 2(1), 66-79.Schneider, S. L. (2008). The International Standard Classification of Education (ISCED-97). An Evaluation of Content and Criterion Validity for 15 European Countries. Mannheim: MZES. Simpson, L., & Akinwale, B. (2006). Quantifying Stablity and Change in Ethnic Group. Manchester: University of Manchester, CCSR Working Paper 2006-05.Rose, D., & Harrison, E. (Eds.). (2010). Social Class in Europe: An Introduction to the European Socio-economic Classification London: Routledge. Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass. van Deth, J. W. (2003). Using Published Survey Data. In J. A. Harkness et. a.l. (2003) (pp. 329-346).

  • *Appendix

    Existing resources sources and types of support for data management in the social sciences:

  • NCRM, Session 27, 1 July 2008*

    Existing resources (i): Data providersa) Documentation and metadata files

    NCRM, Session 27, 1 July 2008

  • *Existing resources (i): Data providers

    Resources for variables CESSDA PPP on key variables UK Question Bank ONS Harmonisation Resources for datasetsUK Census data portal, IPUMS international census data facilities, European Social Survey, Data manipulations prior to data releaseMissing data imputation / documentationSurvey design / weighting information Influential most analysts use the archive version

  • *Existing resources (ii) Resource projects / infrastructures

    UK ESDS ESDS International| ESDS Government ESDS Longitudinal|ESDS Qualidata Helpdesks; online instructions; user support..

    UK ESRC NCRM / NCeSS / RDI initiativesLongitudinal data Linking micro/macro - Other resources / projects / initiativesEDACwowe -

  • *Existing resources (iii) Analytical and software supportTextbooks featuring data management[Levesque & SPSS Inc, 2010] [Altman & Franklin, 2010] [Long, 2009] [Kulas, 2008]

    Software training covering DM Statas data management manualSPSS user group course on syntax and data management,

    But generally, sustained marginalisation of DM as a topicAdvanced methods texts use simplistic data Advanced software for analysis isnt usually combined with extended DM requirements

  • *Existing resources (iv) Data analysts contributionsAcademic researchers often generate and publish their own DM resources, e.g. Harry Ganzeboom on education and occupations, Provision of whole or partial syntax programming examples Analysts often drive wider resource provisions related to DM CAMSIS project on occupational scales, CASMIN project on education and social class

  • *Existing resources (v) Literatures on harmonisation and standardisationNational Statistics Institutes principles and practicesE.g. ONS Cross-national organisationsE.g. UNSTATS - Academic studiesE.g. [Harkness et al 2003] [Hoffmeyer-Zlotnick & Wolf 2003] [Jowell et al. 2007] [Scheider, 2008] [Rose and Harrison 2010]


View more >