organising social science data – computer science perspectives simon jones computing science and...
TRANSCRIPT
Organising social science data – computer science perspectives
Simon Jones
Computing Science and MathematicsUniversity of Stirling, Stirling, Scotland, UK
Seminar: Data management in the social sciences and the contribution of the DAMES Node
Stirling 31 January 2012
DAMES: Data Management through e-Social Sciencehttp://www.dames.org.uk
2
DAMES: Background DAMES: Case studies, provision and support for data
management in the social sciences This talk: focusing on "support for data management"
Infrastructure/tools Driven by social science needs for support for advanced
data management operations “In practice, social researchers often spend more time
on data management than any other part of the research process” (Lambert)
A ‘methodology’ of data management is relevant to ‘harmonisation’, ‘comparability’, ‘reproducibility’ in quantitative social science
3
DAMES: Themes Enabling the (social science) researcher:
To deposit, search and process heterogeneous data resources
To access online services/‘tools’ that enable researchers to carry out repeatable and challenging data management techniques such as: • fusion • matching • imputation …
Facilitating access is an important goal Underlying computer science research themes
MetadataData curationData management/processingPortals
4
Data management/processing scenarios
Curation scenarios include:Uploading occupational data to distribute across
academic communityRecording data properties prior to undertaking data
fusion involving a survey and an aggregate dataset Fusion scenarios include:
Linking a micro-social survey with aggregate occupational information (deterministic link)
Enhancing a survey dataset with ‘nearest match’ explanatory variables (probabilistic link)
Other processes: recoding, operationalising, linking, cleaning…
5
Generic data flows
Data setstore
Processing
Data sets are deposited
Data sets are selected
Processing is configured
Data set selection, and the configuration of processing jobs must be informed by knowledge about the data sets - metadata
Result is saved
6
Key role for metadata Metadata records are absolutely core to the functioning
of the portal infrastructureFor adequate, searchable records for the
heterogeneous resources (data tables, command files, notes and documentation)
To connect the resources and the data mgmt toolsTo document the data sets resulting from application
of the data mgmt tools: inputs, process, rationale,… DAMES requirements:
(Micro-)data based, very general DDI (= Data Documentation Initiative)
7
DDI 2 – An XML language<ddi2:codeBook xmlns:ddi2="http://www.icpsr.umich.edu/DDI"> <ddi2:docDscr> <ddi2:citation> <ddi2:titlStmt> <ddi2:titl>An interesting study</ddi2:titl> <ddi2:IDNo agency="DAMES-M">12</ddi2:IDNo> </ddi2:titlStmt> <ddi2:prodStmt> <ddi2:producer>DAMES Portal</ddi2:producer> <ddi2:copyright>Univ of Stirling </ddi2:copyright> <ddi2:prodDate>July 29, 2010</ddi2:prodDate> <ddi2:grantNo source="Financial_1" agency="Economic and Social Research Council"> RES-149-25-1066 </ddi2:grantNo> </ddi2:prodStmt> </ddi2:citation> </ddi2:docDscr> ...
8
The metadata "cycle"
Processing
Metadata
SearchData is mirrored by metadata
Configure/ process Select
Deposit/curate
9
DAMES portal architecture overview
Portal
DAMES Resources
External Dataset
Repositories
User
Services
Search
Enact Fusion
File Access
Compute Resources
Metadata
Local Datasets
(Note: Security omitted)
10
Tools Since metadata must have a key role in data
management… So tools for managing and exploiting the metadata have
key role in the use and operation of the DAMES portalAt deposit/curationFor searchingFor informing the configuration of processing steps
The following slides illustrate use of our tools
11
Curation ToolThe source data:
12
13
14
15
16
17
18
19
20
21
22
23
24
Also automatically uploaded to searchable eXist database
25
Metadata searching
26
Browsing the search results
27
Fusion Tool prototype Scenario: A soc sci researcher wishes to fuse Scottish
Household Survey data with privately collected study data:Uses the data curation tool to upload the dataUses the data fusion/imputation tool to select the data,
identify corresponding variables, and to generate a derived dataset (held in the portal)
The metadata about this derived dataset is stored and (may be) made public through the portal
Another researcher can now search the portal (metadata) for SHS data and find the derived dataset
DAMES metadata handling must facilitate this process
28
The Fusion Tool prototypeSelect datasets
(recipient and donor)
Select "common variables"
Select variables to be imputed
Select data fusion method
Submit to fusion "enactor"
Metadata accessed
29
Select datasets (recipient and donor)
Select "common variables"
Select variables to be imputed
Select data fusion method
Submit to fusion "enactor"
Metadata accessed
30
Select datasets (recipient and donor)
Select "common variables"
Select variables to be imputed
Select data fusion method
Submit to fusion "enactor"
Ski
pped
Metadata for result dataset
31
Job submission: Information flow
Wizard
EnactorCompute resources (Condor)
subjob1
subjob2
User's localfile store
Resultantdata
DDIrecord
notify(job id)
fetch job
submit
JFDL/JSDL
description.xml
Furtherinfra-
structure
35
Thank you!