data archiving and processing

1

Data Archiving and Processing

Karine Sahakyan, MD, MPH

American University of Armenia

June 26-27, 2008

Caucasus Research Resource Centers – ArmeniaA Program of the Eurasia Foundation

2

Overview

• Introduction, general archiving theory and practices

• Data structures and data processing• Survey documentation• User guides

3

Course schedule

26th June • Introduction and orientation• General archiving theory and practices• Data processing, data file structures, deriving

variables• Practical exercises and coursework

4

Course schedule

27th June • Data linkage• Survey documentation• Thematic user guides

5

Introduction

DataProject

DataProject

Documentation

DataProject

DocumentationTheory

Data User

DocumentationTheory

Project

Data User

DocumentationTheory

Project

Data storage and preservation (official archives)

Data User

DocumentationTheory

Project

Data processing and analysis (personal archives)

Data User

DocumentationTheory

Project

Supporting documentation

Data User

DocumentationTheory

Project

User guides

15

Sources

• Elder, GH et al (1993) Working with Archival data: studying Lives, London, Sage.

• Dale, A et al (1988) Doing Secondary Analysis, London, Unwin Hyman.

• UK Data archive (www.data-archive.ac.uk)

• Royal Statistical Society (www.rss.org.uk)

• ESDS (www.esds.ac.uk)

• SPSS to process and analyse data

www.spsstools.net/SampleSyntax.htm

• DI2007 and INTAS 2007 survey data sets and docs

16

Intended course outcomes

• 1: Understanding of the need to systematically archive and document data.

• 2: Ability to differentiate different data types and structures

• 3: Ability to use SPSS syntax files to perform data derivation, data merging and analysis

• 4: Appreciation of the benefits of user guides for prospective data users

17

Assessment

• User guide and codebook development for the DI 2007 on a specific thematic module. Testing ILO’s 1,2 and 4

• Analysis of DI2007 or INTAS survey using SPSS syntax files to derive variables / merge files / produce tables/statistics. Testing ILO 3.

18

Assessment

• Course work sessions are timetabled so you expected to start both projects during this course

• Projects should be submitted 2 weeks after the end of the course

• I will be available by email to assist with any questions in this 2 week period

19

Data Archives

20

Data preservation- why?

Scientific responsibility

Costs

Legal requirement

Future use (secondary analysis)

21

Data preservation- what?

Original digitised dataQuestionnaire forms (?)Explanatory documentation (purpose and

technical)Unique identifiers (for future linkage and

follow up)Data at risk of being lost

22

Data preservation- how?

Design surveys with preservation in mind (consent forms, anonymisation)

Use commonly used formats (eg SPSS)

Collate developmental reports (track changes)

Recognised archive sites (CRRC!)

23

Data preservation- threats

Initial user needs delay access

Ownership and copyright

Confidentiality, disclosure, ethics, data protection

Physical storage media

Logical (digital) storage format

Costs

Organisational change

Poor data infrastructure (funding and strategy)

24

Survey Data: ‘version’ control

Early (pre-cleaning) release

‘Final’ release

Additional variables (derivations)

Preserving the original codings through:

1. using syntax to process the original data

2. saving processed data with different file name

3. creating archive of derived data sets (possibly thematic)

25

Exercise

1. What factors constitute the major threats to data preservation in the South Caucasus?

2. Using your list of threats formulate a ‘best practice’ guide for the preservation of data which aims to safeguard the future of statistical data in the region.

26

Data file structures

27

Simple one-off cross-sectional

1. Simplest file structure

2. Data arranged in a case/variables matrix

3. Each case has a value on each variable

4. Each case has a unique identifier

5. Processing involves 1. Selecting sub-sets of cases

2. Selecting sub-sets of variables

3. Recoding original variables

4. Deriving new variables from existing ones

ID V1 V2 V3 V4 .. ..1 1 3 70 6 .. ..2 1 3 75 3 .. ..3 2 2 73 4 .. ..4 2 1 73 2 .. ..5 2 1 72 6 .. ..6 1 2 74 10 .. ..7 1 3 73 2 .. ..8 .. .. .. .. .. .... .. .. .. .. .. ..

29

Repeated cross-sectional

1. As above but the questionnaire, or a newer version of it is administered at different points in time (say annually)

2. Respondents are sampled anew

3. Data processing as above

4. Comparisons over time are macro not micro. ie. They represent aggregate change over time and not individual change.

Different respondents, same questions

ID V1 V2 V3 ..

1 1 3 72 ..

2 1 3 71 ..

3 2 1 74 ..

.. .. .. .. ..

ID V1 V2 V3 ..

1 2 2 79 ..

2 2 1 80 ..

3 1 3 83 ..

.. .. .. .. ..

T1 T2

31

Hierarchical cross-sectional

1. Similar to the above but now there is more than one structure in the data eg. Respondents within households.

2. The case/variable matrix is now ‘nested’ ie some data is for the HH and some for the individual (this can be in the same data file or can be in separate files)

3. Separate unique code numbers are needed.

4. Data Processing involves1. Accurate separation of different levels

2. Suitable linkage where appropriate

Hierarchical structure #1 (people in households)

HHID H1 H2 PNO P1 P2 .. ..

1 4 1 1 2 1 .. ..

1 4 1 2 1 2 .. ..

2 1 1 1 2 2 .. ..

3 2 2 1 2 2 .. ..

3 2 2 2 1 1 .. ..

3 2 2 3 1 2 .. ..

4 1 1 1 1 2 .. ..

4 1 1 2 2 1 .. ..

.. .. .. .. .. .. .. ..

Hierarchical structure #2 (episodes of employment)

PID EMPNO P1 P2 P3 ..

1 1 2 2001 2005 ..

1 2 1 2005 -8 ..

2 1 2 2002 -8 ..

3 1 2 1995 1998 ..

3 2 1 1998 2001 ..

3 3 1 2001 -8 ..

4 1 1 2004 2005 ..

4 2 2 2005 -8 ..

.. .. .. .. .. ..

34

Panel

1. Using the same sample of respondents over time

2. Questions are often also repeated at different surveys

3. Data structure can be a simple case/variable for each phase of data collection

4. Unique identification for each respondent which remains for the life of the panel needed.

5. Data processing1. Connecting variables for a single individual over successive

waves of the survey (micro data analysis)

Same respondents over time

T1 T3T2 T4

36

Cohort

1. Similar to Panel but each case is from a common cohort (where this is taken to be time related)

2. Birth cohort studies for example – all babies born in a particular week during a particular year, traced through their lives

3. Data structure and processing same as panel

Same respondents over time

T1 T3T2 T4

38

Retrospective

1. Not really a survey type but a data collection tool

2. Can be included in any of the surveys listed above

3. Data is (retrospectively) longitudinal

4. Each retrospective element needs to have unique codings for different events or episodes

5. Data structure is ‘relational’ each element relates to each respondent as well as to the respondents other retrospective elements

6. Data processing1. Time sensitive linkages of different elements

Looking backwards

T1T1 - X

40

Exercise

1. What type of survey design can help with the following ideas:1. Young people are taking longer to get married than they

used to

2. Fear of crime is highest in the urban environment

3. Employment and income are generally under-reported

4. The democrats will win the US presidency in 2008

41

Structure of DI2007

1. Cross-sectional

2. Individuals within HH

3. All HH members

4. Absent HH migrants

Household ID links each file

HH and Ind

HH members Absent Migrants

Structure on INTAS 2007

• Linked to DI 2005 so panel and hierarchical (though these properties not being used)

• Retrospective data collection• main file and 8 retrospective modules• relational structure

Each data file relates to each other (person ID)

Education

Marriage

Leisure

Employment

Job

Housing

Cohabitation

Core

Children

45

Deriving variables

46

Coding and recoding

1. Original codings (as in code book)

2. Simplifications1. Dealing with DK / NA and Missing codes

(tidying up)

2. Collapsing categories (substantive and statistical)

3. Improves analysis and presentation

4. See D1.sps – DI2007 and contributions of absent HH members

47

Creating analytic files

1. Protects the original data from being deleted/overwritten

2. Small files are processed faster

3. Less scrolling through data/variables

4. If syntax file used, it is easy to adapt (to include or delete variables)

5. See D2.sps

48

Deriving variables

1. Combining variables to produce a hybrid

2. Can be scale related to summarise a concept (ie where all response codes are of the same type – ‘safety’ example. See D3.sps)

3. Can relate to a broad conceptual category (social origins using parental education and employment. See CF1.sps)

4. To adjust data where you have reason to suspect that one variable might help to improve another (using reported expenditure to adjust reported income. See CF7.sps)

49

Data linkage

50

Hierarchical links

1. Data is nested: individual within household, or episodes belong to an individual.

2. Link 1: attach HH data to the individual (analysing individuals, not needed for DI as already linked)

3. Link 2 : produce summary data of all individuals in the HH, and attach to the HH data (analysing HH’s). See D6.sps, though it is a bit long.

4. Link 3: attach episode data to an individual. See CF2.sps.

5. See earlier slides on hierarchical data.

51

Longitudinal links

1. The respondents’ data from successive surveys is joined together

2. Cross-wave ID number used for both individual and family

3. Panel surveys and cohort surveys

52

Relational links

1. Linking an individuals marital statuses and fertility statuses

2. Linking an individuals education / employment and job statuses

3. Linking both of the above

4. …adding housing and leisure

5. (so-called ‘many to many’ links ie. one individual may have had 5 jobs, 4 different addresses, 2 marriages, 4 children and so on… others might have had much less.

6. See CF5.sps and CF6.sps

53

Data Processing Coursework

54

Survey Documentation

55

Survey Documentation exercise

1. In groups detail exactly what is needed to effectively begin analysing a survey data set.

2. try to be as precise as possible about the type of documentation, its content and the amount of detail that is required.

3. If you were to manage an archive of different social surveys, what would be on your check list of details to catalogue the surveys?

56

Survey Documentation (ESDS)– all variables should be named. Ideally, variable names should not exceed 8

characters, which ensures compatibility between all current dissemination formats used by the ESDS. The absolute maximum is 32 characters, which ensures compatibility with recent versions of all major dissemination (SPSS, ver. 12 onwards; Stata, ver. 7 onwards; and SAS, ver. 8 onwards)

– all variables should be labelled. Labels should be brief (preferably ‹ 80 characters), but precise and always make explicit the unit of measurement for continuous (interval) variables. Where possible, all variable labels should reference the question number (and if necessary questionnaire)

– where possible, all data labelling should be created and supplied to the ESDS as part of the data file itself. This is the expectation with data supplied in one of the three major statistical packages - SPSS, Stata or SAS

– if the package being used for data management does not allow such variable and code labelling it must be provided as part of the documentation - i.e. a comprehensive list of variable names, variable descriptions, code names and variable formatting information

– the code used to create all derived variables (e.g. the SPSS syntax file or Stata do file) should be provided so that interested users can see exactly how these variables have arisen

– See also ESDS ‘Research Management and Documentation’

57

Survey Documentation (Metadata)

1. Rationale / purpose / history

2. Questionnaires

3. Code book

4. Technical details (design issues, sampling, weighting, imputation etc)

5. Technical details for users (user based examples)

6. Publications (working papers / technical reports / academic papers)

58

Example

1. www.iser.essex.ac.uk/ulsc/bhps/doc/

59

User Guides

60

What is a thematic data user guide?

1. A tool to assist researchers in locating and using data ie. it is not meant to provide all answers, only to point to sources which contain more detail

2. An up-to-date collation of research/publications on a particular topic

3. A brief description of different data sources

4. A brief description of different research projects, their aims, and design.

5. A collation of different theoretical questions of relevance

6. See ESDS Education / Social Capital

61

Examples

1. www.esds.ac.uk/support/thematicguides.asp

62

Exercise

1. Select a substantive topic of interest to you

2. List current data sources / theories / publications

3. Draft an introduction to this topic that would help a researcher to quickly learn the main issues.

data archiving and processing

Documents

preservation of data

original data

data archiving

data archives

data derivation

archival data

data merging

processed data