ddi the movie 2: applications of the architecture (early draft) by i-lin kuo

34
DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Upload: hilary-harvey

Post on 04-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

DDI The Movie 2: Applications of the Architecture

(early draft)

By I-Lin Kuo

Page 2: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Table of Contents

• Modules and Instrument Documentation

• The Variable

• Ontologies and Tagging

Page 3: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Modules and Instrument Documentation

Chapter 1

Page 4: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Suggested Approach to Instrument Documentation

• METS has an extremely well-designed structure map which describes the logical structure of its objects of interest. See http://www.loc.gov/standards/mets/presentations/METSIntro2.ppt

• Basically, a skeleton of a structure is created which then contains pointers to items. See next slide for the recipe for building METS.

Page 5: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Building a METS Document: 5 key aspects

1. Expressing the Structure2. Linking Structure with Content3. Linking Structure with Descriptive

Metadata4. Linking Structure and Content Files with

Administrative metadata5. Not covered: Linking behaviors with

structures.

Page 6: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Suggested Approach to DDI Instrument Documentation

• I recommend that the DDI adopt a similar approach– Create an instrument structure map for each instrument– Link the structure with content contained in <InstrumentItem>

• Examples of <InstrumentItem> would be <SimpleQuestion>, <GridQuestion>, <QuestionGroup>, <Computation>, <FlowCheck>, <InterviewerInstr>, etc.

– Link the structure and content with display behavior• This approach has the advantage of allowing questions etc. to be re

used in different instrument structure maps. This would be useful in a study with separate male and female questionnaires, for example.

• I also think (haven’t thought this through completely) that this allows a clean separation between question content and question display. Thus, a multi-mode survey would have identical structure maps linked to different display behavior.

Page 7: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

The Variable

Chapter 2

Page 8: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

DDI 2.0 variable4.3 var* (ATT == wgt, wgt-var, weight, qstn, files, vendor, dcml, intrvl, rectype, sdatrefs, methrefs, pubrefs, access, aggrMeth, measUnit, scale, origin, nature, additivity, temporal, geog, geoVocab, catQnty)4.3.1 location* (ATT == StartPos, EndPos, width, RecSegNo, fileid, locMap)4.3.2 labl* (ATT == level, vendor, country, sdatrefs)4.3.3 imputation? 4.3.4 security? (ATT == date)4.3.5 embargo? (ATT == date, event, format)4.3.6 respUnit? 4.3.7 anlysUnit? 4.3.8 qstn* 4.3.9 valrng*4.3.10 invalrng*4.3.11 undocCod* 4.3.12 universe*4.3.13 TotlResp? 4.3.14 sumStat* (ATT == wgtd, wgt-var, weight, type)4.3.15 txt* (ATT == level, sdatrefs)4.3.16 stdCatgry* (ATT == date, URI)4.3.17 catgryGrp*4.3.18 catgry* 4.3.19 codInstr* 4.3.20 verStmt* 4.3.21 concept* (ATT == vocab, vocabURI)4.3.22 derivation?4.3.23 varFormat? (ATT == type, formatname, schema, category, URI)4.3.24 geoMap* (ATT == URI, mapformat, levelno)4.3.25 catLevel* (ATT == levelnm)4.3.26 notes*

Page 9: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

DDI 2.0 var major components• Variable type: @wgt, @intrvl• Reference: @qstn, @wgt-var, @files, @sdatrefs, @methrefs, @pu

brefs• Descriptive: <notes>, <universe>, <txt>, <concept>, <derivation>,

<qstn>, <geoMap>• Provenance: <verStmt>• Sampling/Measurement: <imputation>, <respUnit>, <anlysUnit>,• Logical Encoding: <valrng>, <invalrng>, <undocCod>, <catgry>, <

catgryGrp>, <stdCatgry>, <codInstr>, <catLevel>, <varFormat>• Statistics: <TotlResp>, <sumStat>• Security/Access: <security>, <embargo>• Physical description: @rectype, <location>• Other: @vendor, • Note: some elements and attributes straddle several concerns. In th

at case, I just picked one.

Page 10: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Some problems of 2.0

• No recoding documentation

• One variable, one question

• Question contained within variable

• No virtual recodes

Page 11: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

3.0 restructuring goals for the variable

• Standardize the usage of elements such as security, etc. so that they may be machine-actionable

• Standardize the naming of elements and attributes• Reduce redundancy so there is only one way to markup• Compatibility with ISO11179 conception of the variable• Compatibility with statistical tools conception of variable• Compatibility with MetaDater concept of Question/variable• More sophisticated recode documentation• Better documentation of question flow in instrument documentation• More complete classification of variable types• Systematic handling of variable referencing• Support of virtual recodes

Page 12: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

2.0 Classification of variable types

• We’ll start with this as this is relatively easy

• 2.0 already has attributes wgt, wg-var, qstn but more are needed for a richer [machine-actionable] typology

Page 13: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

3.0 Classification of variable types

• Types is actually a misnomer. These should be treated as labels rather than types because they are not exclusive

– Raw/question (codes come directly from questions) – this will probably be affected by the ongoing discussion on question typology at DDI-ID

– Recodes– Weight– Attrition– Key– Imputation Flag– Time/geog?– Continuous/discrete– Aggregated– Nominal|ordinal|interval|ratio– Virtual recode – a “variable” for display purposes only without corresponding data, such as a

continuous variable displayed as a discrete variable– dropped

• [Nonexistent] intermediate – an intermediate variable used only for calculation, without data or display. [Nonexistent] instrument – an artifact of the instrument, without data or display

– … incomplete

Page 14: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Referencing

• Variables should reference their applicable weight variables, or vice versa

• Imputation flags should reference their corresponding variables

• Variables might need to reference attrition variable in some cumulative dataset

• Recodes will need to reference questions, computations, and other variables in their recode descriptions

• Directionality of the references remains to be decided

Page 15: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Machine-actionable consequences

• Identification of keys enables complex files functionality

• Weight, imputation flag, and attrition references may allow statistics to be intelligently calculated on the fly

Page 16: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

General approach to compatibility

• By compatibility with statistical tools (SPSS, SAS, STATA), we mean that we should be able to do a round-trip from a setup file DDI setup file with no loss of information.– It is not realistic to expect as a 3.0 deliverable 3 XSLT stylesheet

s which transform DDI SPSS, SAS, or Stata setup files.– It may also be possible to have stylesheets which convert from S

PSS and SAS proprietary XML formats to DDI, which perform the round-trip without loss of information. This is dependent on whether or not the DDI is rich enough to contain all the info.

• By compatibility with ISO11179 and MetaDater, we will suggest a standard way in which <var> may be marked up.

Page 17: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Compatibility with statistical tools: SPSS

FILE HANDLE DATA / NAME="data-filename" LRECL=66.

DATA LIST FILE=DATA / STANUM 8-9 QTYPE 13 VARIABLE LABELS

STANUM 'State ID' / QTYPE 'State or National prec

inct' / VALUE LABELS

STANUM 2 'Alaska' /

QTYPE 1 'State' 2 'National' /

• The simple excerpt from an SPSS setup file at left can be round-tripped even with DDI 2.0:– Data List column info goes in

<location>– Variable labels go into var.txt– Value labels go into <catgry>

• More analysis is needed to see what is necessary for round-tripping for the SPSS xml format and/or more complicated setup files. Achim is familiar with the xml.

Page 18: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Compatibility with statistical tools: Stata

_column(8) int STANUM :STANUM %2f "State ID"_column(10) int PRECINCT %3f "Sample precinct number"_column(13) int QTYPE :QTYPE %1f "State or National precinct"_column(16) int BACKSIDE :BACKSIDE %1f "Backside completion flag"_column(17) float WGT %6.3f "Respondent weight"

label define STANUM 2 "Alaska" ;label define QTYPE 1 "State" 2 "National" ;

• Int/float map to DDI 2.0’s <varFormat>. Q: are all stata’s types map-able into DDI types? • Does “%6.3f” map to DDI? If not, we need to add a place for it.• The notation :STANUM indicates that perhaps formats/categories may be shared by different variables. If this is true, then <catgry> would have to be moved out of <var>

• More analysis needed. I’m not too familiar with stata.

Page 19: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Compatibility with statistical tools: SAS

PROC FORMAT;

VALUE STANUM 2='(2) Alaska' ;

VALUE QTYPE 1='(1) State' 2='(2) National' ;INPUT

STANUM 8-9 QTYPE 13 LABEL

STANUM = 'State ID'

QTYPE = 'State or National precinct' FORMAT STANUM STANUM. QTYPE QTYPE.

• PROC FORMAT map to DDI <catgry>

• INPUT maps to DDI <location>• LABEL maps to DDI var.txt• FORMAT associates each variabl

e with a coding format. Multiple variables may be associated to the same format. This will not work with 2.0 for the same reason 2.0 cannot associate multiple variables with the same question.

• Thus, <catgry> needs to be taken out of <var> for 3.0

Page 20: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Compatibility with MetaDater

• Still a lot of reading yet to do on this one….

Page 21: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Compatibility with ISO11179

• Harmonization steps based on Dan Gilman’s 2003 presentation http://www.iassistdata.org/conferences/2003/presentations/

• Goal: seek to harmonize with ISO11179 at the variable model level so that DDI may be used as a transport/exchange format for ISO11179.

Page 22: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

ISO/IEC 11179 - Core Model

Data Element Value Domain

Data Element Concept Conceptual Domain

Conceptual

Representational

Variables Values

corresponds to DDI 2.0 tag/concepts …

<var>

<concept> and/or<universe>

<catgry> <concept> …<catgry>

However, the catgry.concept does not exist in DDI 2.0

pointer

pointer

ISO11179 ontology or concept registry

Ontologies also do not exist in DDI 2.0

Page 23: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

ISO11179 Harmonization Steps

• 3.0 harmonization with the ISO11179 model on previous slide– Move <catgry> out of <var>, as different data element

s may point to the same value domain. This is not possible if value domain is contained within data element.

– Add a <concept> to <catgry> or some means of pointing to the reference domain.

– Add a way of pointing to an ontology or registry from the <concept>. This will be explained in the section on “Ontologies”

Page 24: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Additional analysis needed

• Changes in the structure for the variable have to be analyzed for its impact on other concerns:– Nested categories– N-Cubes

Page 25: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Overall restructuring plan

• Need to identify those components which are intrinsic to a variable and those which are extrinsic or may be shared between variables– Intrinsic: type(wgt, derivation, txt), <recode>– Extrinsic: <sumStat>, <TotlResp>– Shared: <qstn>, <catgry>, <security>, <embargo>, <v

erStmt>

• Extrinsic and shared elements need to be moved out of <var>

• Elements necessary for compatibility with other standards need to be added.

Page 26: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Ontologies and Tagging

Chapter ?

Page 27: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Rel-tag microformat

• Problem: How can we associate keywords to a web page?

• Old solution: “meta” keywords in an html page• 2005 solution: rel-tag microformat, popularized b

y the technorati blog aggregator to allow blog authors to tag content to aid the technorati search engine.

• This isn’t the same as the DDI problem but the solution is instructive.

Page 28: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Rel-tag microformat details

• Example: <a href="http://technorati.com/tag/tech" rel="tag">technology</a>– The last segment of the path – “tech” – is the tag– The preceding part – http://technorati.com/tag -- is th

e space which knows what to do with the tag– “technology” is the visible part of the tag– ‘rel=“tag”’ identifies this as a rel-tag rather than a nor

mal anchor

• See http://microformats.org/wiki/reltag or google for more details

Page 29: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

DDI ontology problem

• Problem: How can we associate words in DDI markup to controlled vocabularies or ontologies such as Madeira, ICPSR social science thesaurus, or ISO11179 concept registry?

• Note that the rel-tag microformat already contains 75% of what we need:– The authority– The space– The tag = the keyword

• So we can probably modify this to suit our needs

Page 30: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Examples

<var> <concept> <a href=http://www.icpsr.umich.edu/socSciThes/crime rel=“ddi”>crime</a> </concept></var>

<catgry> <concept><a href=http://data-archive.ac.uk/ISO11179/marital+status” rel=“ddi>marital status</a></concept> <catValu>3</a> <labl>never been <a href=http://data-archive.ac.uk/Madeira/marriage” rel=“ddi”>married</a> </labl></catgry>

Page 31: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Rel-tag flexibility

• Note that tags can occur anywhere and are not restricted to <concept>

• The visible part does not have to match the keyword

• Different ontologies may be used simultaneously

Page 32: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Applications of rel-tags

• ISO11179: Rel-tags plus the variable restructuring suggested in the previous chapter “The Variable” give the DDI variable a compatibility with the ISO11179 data element/variable model

• Comparative data search: Rel-tags provide a way to implement the “upward-pointing” to a controlled vocabulary that Wendy and Jostein talked about last week. This implementation does not conflict with the variable-variable link mechanism needed for Reto.

• Madeira: rel-tags allow Madeira to mark up individual words in <catgry>

Page 33: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Shortcoming

• As currently used, rel-tags do not allow for nested tags.

Page 34: DDI The Movie 2: Applications of the Architecture (early draft) By I-Lin Kuo

Summary

• DDI should look into rel-tags or some variant to be used with ontologies