data archiving and processing

62
1 Data Archiving and Processing Karine Sahakyan, MD, MPH American University of Armenia June 26-27, 2008 Caucasus Research Resource Centers – Armenia A Program of the Eurasia Foundation

Upload: crrc-armenia

Post on 06-May-2015

1.012 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Data Archiving and Processing

1

Data Archiving and Processing

Karine Sahakyan, MD, MPH

American University of Armenia

June 26-27, 2008

Caucasus Research Resource Centers – ArmeniaA Program of the Eurasia Foundation

Page 2: Data Archiving and Processing

2

Overview

• Introduction, general archiving theory and practices

• Data structures and data processing• Survey documentation• User guides

Page 3: Data Archiving and Processing

3

Course schedule

26th June • Introduction and orientation• General archiving theory and practices• Data processing, data file structures, deriving

variables• Practical exercises and coursework

Page 4: Data Archiving and Processing

4

Course schedule

27th June • Data linkage• Survey documentation• Thematic user guides

Page 5: Data Archiving and Processing

5

Introduction

Page 6: Data Archiving and Processing

Data

Page 7: Data Archiving and Processing

DataProject

Page 8: Data Archiving and Processing

DataProject

Documentation

Page 9: Data Archiving and Processing

DataProject

DocumentationTheory

Page 10: Data Archiving and Processing

Data User

DocumentationTheory

Project

Page 11: Data Archiving and Processing

Data User

DocumentationTheory

Project

Data storage and preservation (official archives)

Page 12: Data Archiving and Processing

Data User

DocumentationTheory

Project

Data processing and analysis (personal archives)

Page 13: Data Archiving and Processing

Data User

DocumentationTheory

Project

Supporting documentation

Page 14: Data Archiving and Processing

Data User

DocumentationTheory

Project

User guides

Page 15: Data Archiving and Processing

15

Sources

• Elder, GH et al (1993) Working with Archival data: studying Lives, London, Sage.

• Dale, A et al (1988) Doing Secondary Analysis, London, Unwin Hyman.

• UK Data archive (www.data-archive.ac.uk)

• Royal Statistical Society (www.rss.org.uk)

• ESDS (www.esds.ac.uk)

• SPSS to process and analyse data

www.spsstools.net/SampleSyntax.htm

• DI2007 and INTAS 2007 survey data sets and docs

Page 16: Data Archiving and Processing

16

Intended course outcomes

• 1: Understanding of the need to systematically archive and document data.

• 2: Ability to differentiate different data types and structures

• 3: Ability to use SPSS syntax files to perform data derivation, data merging and analysis

• 4: Appreciation of the benefits of user guides for prospective data users

Page 17: Data Archiving and Processing

17

Assessment

• User guide and codebook development for the DI 2007 on a specific thematic module. Testing ILO’s 1,2 and 4

• Analysis of DI2007 or INTAS survey using SPSS syntax files to derive variables / merge files / produce tables/statistics. Testing ILO 3.

Page 18: Data Archiving and Processing

18

Assessment

• Course work sessions are timetabled so you expected to start both projects during this course

• Projects should be submitted 2 weeks after the end of the course

• I will be available by email to assist with any questions in this 2 week period

Page 19: Data Archiving and Processing

19

Data Archives

Page 20: Data Archiving and Processing

20

Data preservation- why?

Scientific responsibility

Costs

Legal requirement

Future use (secondary analysis)

Page 21: Data Archiving and Processing

21

Data preservation- what?

Original digitised dataQuestionnaire forms (?)Explanatory documentation (purpose and

technical)Unique identifiers (for future linkage and

follow up)Data at risk of being lost

Page 22: Data Archiving and Processing

22

Data preservation- how?

Design surveys with preservation in mind (consent forms, anonymisation)

Use commonly used formats (eg SPSS)

Collate developmental reports (track changes)

Recognised archive sites (CRRC!)

Page 23: Data Archiving and Processing

23

Data preservation- threats

Initial user needs delay access

Ownership and copyright

Confidentiality, disclosure, ethics, data protection

Physical storage media

Logical (digital) storage format

Costs

Organisational change

Poor data infrastructure (funding and strategy)

Page 24: Data Archiving and Processing

24

Survey Data: ‘version’ control

Early (pre-cleaning) release

‘Final’ release

Additional variables (derivations)

Preserving the original codings through:

1. using syntax to process the original data

2. saving processed data with different file name

3. creating archive of derived data sets (possibly thematic)

Page 25: Data Archiving and Processing

25

Exercise

1. What factors constitute the major threats to data preservation in the South Caucasus?

2. Using your list of threats formulate a ‘best practice’ guide for the preservation of data which aims to safeguard the future of statistical data in the region.

Page 26: Data Archiving and Processing

26

Data file structures

Page 27: Data Archiving and Processing

27

Simple one-off cross-sectional

1. Simplest file structure

2. Data arranged in a case/variables matrix

3. Each case has a value on each variable

4. Each case has a unique identifier

5. Processing involves 1. Selecting sub-sets of cases

2. Selecting sub-sets of variables

3. Recoding original variables

4. Deriving new variables from existing ones

Page 28: Data Archiving and Processing

ID V1 V2 V3 V4 .. ..1 1 3 70 6 .. ..2 1 3 75 3 .. ..3 2 2 73 4 .. ..4 2 1 73 2 .. ..5 2 1 72 6 .. ..6 1 2 74 10 .. ..7 1 3 73 2 .. ..8 .. .. .. .. .. .... .. .. .. .. .. ..

Page 29: Data Archiving and Processing

29

Repeated cross-sectional

1. As above but the questionnaire, or a newer version of it is administered at different points in time (say annually)

2. Respondents are sampled anew

3. Data processing as above

4. Comparisons over time are macro not micro. ie. They represent aggregate change over time and not individual change.

Page 30: Data Archiving and Processing

Different respondents, same questions

ID V1 V2 V3 ..

1 1 3 72 ..

2 1 3 71 ..

3 2 1 74 ..

.. .. .. .. ..

ID V1 V2 V3 ..

1 2 2 79 ..

2 2 1 80 ..

3 1 3 83 ..

.. .. .. .. ..

T1 T2

Page 31: Data Archiving and Processing

31

Hierarchical cross-sectional

1. Similar to the above but now there is more than one structure in the data eg. Respondents within households.

2. The case/variable matrix is now ‘nested’ ie some data is for the HH and some for the individual (this can be in the same data file or can be in separate files)

3. Separate unique code numbers are needed.

4. Data Processing involves1. Accurate separation of different levels

2. Suitable linkage where appropriate

Page 32: Data Archiving and Processing

Hierarchical structure #1 (people in households)

HHID H1 H2 PNO P1 P2 .. ..

1 4 1 1 2 1 .. ..

1 4 1 2 1 2 .. ..

2 1 1 1 2 2 .. ..

3 2 2 1 2 2 .. ..

3 2 2 2 1 1 .. ..

3 2 2 3 1 2 .. ..

4 1 1 1 1 2 .. ..

4 1 1 2 2 1 .. ..

.. .. .. .. .. .. .. ..

Page 33: Data Archiving and Processing

Hierarchical structure #2 (episodes of employment)

PID EMPNO P1 P2 P3 ..

1 1 2 2001 2005 ..

1 2 1 2005 -8 ..

2 1 2 2002 -8 ..

3 1 2 1995 1998 ..

3 2 1 1998 2001 ..

3 3 1 2001 -8 ..

4 1 1 2004 2005 ..

4 2 2 2005 -8 ..

.. .. .. .. .. ..

Page 34: Data Archiving and Processing

34

Panel

1. Using the same sample of respondents over time

2. Questions are often also repeated at different surveys

3. Data structure can be a simple case/variable for each phase of data collection

4. Unique identification for each respondent which remains for the life of the panel needed.

5. Data processing1. Connecting variables for a single individual over successive

waves of the survey (micro data analysis)

Page 35: Data Archiving and Processing

Same respondents over time

T1 T3T2 T4

Page 36: Data Archiving and Processing

36

Cohort

1. Similar to Panel but each case is from a common cohort (where this is taken to be time related)

2. Birth cohort studies for example – all babies born in a particular week during a particular year, traced through their lives

3. Data structure and processing same as panel

Page 37: Data Archiving and Processing

Same respondents over time

T1 T3T2 T4

Page 38: Data Archiving and Processing

38

Retrospective

1. Not really a survey type but a data collection tool

2. Can be included in any of the surveys listed above

3. Data is (retrospectively) longitudinal

4. Each retrospective element needs to have unique codings for different events or episodes

5. Data structure is ‘relational’ each element relates to each respondent as well as to the respondents other retrospective elements

6. Data processing1. Time sensitive linkages of different elements

Page 39: Data Archiving and Processing

Looking backwards

T1T1 - X

Page 40: Data Archiving and Processing

40

Exercise

1. What type of survey design can help with the following ideas:1. Young people are taking longer to get married than they

used to

2. Fear of crime is highest in the urban environment

3. Employment and income are generally under-reported

4. The democrats will win the US presidency in 2008

Page 41: Data Archiving and Processing

41

Structure of DI2007

1. Cross-sectional

2. Individuals within HH

3. All HH members

4. Absent HH migrants

Page 42: Data Archiving and Processing

Household ID links each file

HH and Ind

HH members Absent Migrants

Page 43: Data Archiving and Processing

Structure on INTAS 2007

• Linked to DI 2005 so panel and hierarchical (though these properties not being used)

• Retrospective data collection• main file and 8 retrospective modules• relational structure

Page 44: Data Archiving and Processing

Each data file relates to each other (person ID)

Education

Marriage

Leisure

Employment

Job

Housing

Cohabitation

Core

Children

Page 45: Data Archiving and Processing

45

Deriving variables

Page 46: Data Archiving and Processing

46

Coding and recoding

1. Original codings (as in code book)

2. Simplifications1. Dealing with DK / NA and Missing codes

(tidying up)

2. Collapsing categories (substantive and statistical)

3. Improves analysis and presentation

4. See D1.sps – DI2007 and contributions of absent HH members

Page 47: Data Archiving and Processing

47

Creating analytic files

1. Protects the original data from being deleted/overwritten

2. Small files are processed faster

3. Less scrolling through data/variables

4. If syntax file used, it is easy to adapt (to include or delete variables)

5. See D2.sps

Page 48: Data Archiving and Processing

48

Deriving variables

1. Combining variables to produce a hybrid

2. Can be scale related to summarise a concept (ie where all response codes are of the same type – ‘safety’ example. See D3.sps)

3. Can relate to a broad conceptual category (social origins using parental education and employment. See CF1.sps)

4. To adjust data where you have reason to suspect that one variable might help to improve another (using reported expenditure to adjust reported income. See CF7.sps)

Page 49: Data Archiving and Processing

49

Data linkage

Page 50: Data Archiving and Processing

50

Hierarchical links

1. Data is nested: individual within household, or episodes belong to an individual.

2. Link 1: attach HH data to the individual (analysing individuals, not needed for DI as already linked)

3. Link 2 : produce summary data of all individuals in the HH, and attach to the HH data (analysing HH’s). See D6.sps, though it is a bit long.

4. Link 3: attach episode data to an individual. See CF2.sps.

5. See earlier slides on hierarchical data.

Page 51: Data Archiving and Processing

51

Longitudinal links

1. The respondents’ data from successive surveys is joined together

2. Cross-wave ID number used for both individual and family

3. Panel surveys and cohort surveys

Page 52: Data Archiving and Processing

52

Relational links

1. Linking an individuals marital statuses and fertility statuses

2. Linking an individuals education / employment and job statuses

3. Linking both of the above

4. …adding housing and leisure

5. (so-called ‘many to many’ links ie. one individual may have had 5 jobs, 4 different addresses, 2 marriages, 4 children and so on… others might have had much less.

6. See CF5.sps and CF6.sps

Page 53: Data Archiving and Processing

53

Data Processing Coursework

Page 54: Data Archiving and Processing

54

Survey Documentation

Page 55: Data Archiving and Processing

55

Survey Documentation exercise

1. In groups detail exactly what is needed to effectively begin analysing a survey data set.

2. try to be as precise as possible about the type of documentation, its content and the amount of detail that is required.

3. If you were to manage an archive of different social surveys, what would be on your check list of details to catalogue the surveys?

Page 56: Data Archiving and Processing

56

Survey Documentation (ESDS)– all variables should be named. Ideally, variable names should not exceed 8

characters, which ensures compatibility between all current dissemination formats used by the ESDS. The absolute maximum is 32 characters, which ensures compatibility with recent versions of all major dissemination (SPSS, ver. 12 onwards; Stata, ver. 7 onwards; and SAS, ver. 8 onwards)

– all variables should be labelled. Labels should be brief (preferably ‹ 80 characters), but precise and always make explicit the unit of measurement for continuous (interval) variables. Where possible, all variable labels should reference the question number (and if necessary questionnaire)

– where possible, all data labelling should be created and supplied to the ESDS as part of the data file itself. This is the expectation with data supplied in one of the three major statistical packages - SPSS, Stata or SAS

– if the package being used for data management does not allow such variable and code labelling it must be provided as part of the documentation - i.e. a comprehensive list of variable names, variable descriptions, code names and variable formatting information

– the code used to create all derived variables (e.g. the SPSS syntax file or Stata do file) should be provided so that interested users can see exactly how these variables have arisen

– See also ESDS ‘Research Management and Documentation’

Page 57: Data Archiving and Processing

57

Survey Documentation (Metadata)

1. Rationale / purpose / history

2. Questionnaires

3. Code book

4. Technical details (design issues, sampling, weighting, imputation etc)

5. Technical details for users (user based examples)

6. Publications (working papers / technical reports / academic papers)

Page 58: Data Archiving and Processing

58

Example

1. www.iser.essex.ac.uk/ulsc/bhps/doc/

Page 59: Data Archiving and Processing

59

User Guides

Page 60: Data Archiving and Processing

60

What is a thematic data user guide?

1. A tool to assist researchers in locating and using data ie. it is not meant to provide all answers, only to point to sources which contain more detail

2. An up-to-date collation of research/publications on a particular topic

3. A brief description of different data sources

4. A brief description of different research projects, their aims, and design.

5. A collation of different theoretical questions of relevance

6. See ESDS Education / Social Capital

Page 61: Data Archiving and Processing

61

Examples

1. www.esds.ac.uk/support/thematicguides.asp

Page 62: Data Archiving and Processing

62

Exercise

1. Select a substantive topic of interest to you

2. List current data sources / theories / publications

3. Draft an introduction to this topic that would help a researcher to quickly learn the main issues.