ddi the movie #1: architecture for a modular distributed metadata xml by i-lin kuo

DDI the Movie #1:Architecture for a Modular Distributed Metadata XML

By I-Lin Kuo

Table of Contents

• 1. Modularity and Physical Instances• 2. Modules, Visibility, and Versioning• 3. Modules and Lists• 4. The Known Intended Functions of DDI• 5. DDI Formats• 6. Modularity and Grouping: Composition vs.

Inheritance• 7. Modules and the Study Space• 8. Links and Referencing

Modularity and Physical Instances

Chapter 1

Standalone instance (~DDI2.0)

• This is a standalone instance, like DDI 2.0’s main use case.

• Question and variables are all contained within the same physical instance.

• A standalone instance is complete in and of itself, like a “codebook”.

Questions

Variables

Instances Sharing Modules Use Case

• Exporting a module: However, we might want to pull out the questions into its own instance so it can be shared by several studies/datasets. Parts of the standalone instance must be modularized.

• Two questions must be resolved:– Q1.1: How do we indicate this

[sharing] relationship between documents?

– Q1.2: How do we actually reference between the documents in a robust way?

Questions

Variables Variables

Study 1 Study 2

Translation Use Case• A similar case is the adding of

translations to an existing DDI.• We would like to alter the

existing DDI instance as little as possible. Thus, we would like the translations to be in separate documents

• We would like the existing DDI instance to be able to function without the translation.

• We would like multiple translations which could be added later with no more changes to the original.

• We would like the same translation to be usable if the original instance were updated.

Questions

Variables

German

Finnish

Q1.1: Relationship Indication

• Q1.1: How do we indicate this relationship between documents?

• A: We borrow a concept from Eclipse plugins. Eclipse plugins have either an extends relationship or a dependency relationship with other plugins. These relationships are expressed explicitly in the plugin’s manifest files.

• We will express these relationships explicitly in the DDI’s header section, using a similar tag style as Eclipse.

Instances sharing modules• We may refactor a standalone insta

nce so that it uses a shared question module.

• The core instance is incomplete without the Questions module, so the relationship is dependency. A dependency relation is expressed by <requires>

• The original instance indicates this dependency by an appropriate <requires> element in its header.

• The required module simply provides a name/identifier in its header. (the uniqueness and format of this identifier are to be resolved later). The module does not indicate what it is used by since it may be used by several different instances

Questions

Variables Variables

<requires module=“Questions”/>

<module name=“Questions”/>


Study 2Study 1

Translation

• Since the original instance is complete without the translation, the relationship is extension.

• It is important to differentiate between dependencies and extensions.

• An <extension-point> indicates some indeterminate module may be plugged in

• Q1.3: Why is it important?• Q1.4: What happens when we

have multiple extensions/translations?

Questions

Variables

German

Finnish

<extension-point module=“translation”/>

<extension name=“translation”/>

<extension name=“translation”/>

Q1.4 in more detail

• Q1.4: What happens when we have multiple extensions/translations?

• Note that the Instances-Sharing-Modules is an example of many-sharing-one. The Translation example is an example of one-using-many. In the one-using-many, we often have to decide to use only one – but which one should we use?

• Which-one-to-use is known to the application at runtime, not at markup time. The actual choice may depend on application context or user choice. For example, if the user had previously selected the language German, then the German translation should be used.

• A more precise expression of the above is that extensions are late-bound and dependencies are early-bound.

Identically structured datasets• Consider a study with multiple datasets all identically structured.• This might be U.S. Census 2000, or …• … a simple study provided with multiple physical data formats – SAS,

SPSS, STATA.• As much as possible, we’d want the same DDI instance to be used for

all three data formats.• This is another one-using-many example. Structurally, it is identical to

the Translation Use Case

SASSTATA

SPSS

<extension-point module=“physical”/>

<extension name=“physical”/> <extension name=“physical”/>

<extension name=“physical”/>

Q4 answered

• Q1.4: What happens when we have multiple extensions/translations?

• A: The application must select the appropriate one based on context.

• Therefore, the markup must specify the context-type to be used when selecting. At run-time, the user or application will provide the actual context.

Selectors

• The Eclipse model of linking between plugins needs to be enriched by adding a selector or context-type concept to capture the conditional relationship between modular DDI instances.

• Selectors allow a decision of which actual connection between the one and the many to be made at runtime.

Q1.4 answered II

Questions

Variables

German

Finnish

<extension-point module=“translation” selector=“xml:lang”/>

<extension name=“translation” xml:lang=“german”/>

<extension name=“translation” xml:lang=“:finnish”/>

Language selector example

The context of the selection is“language” or xml:lang

Q1.4 answered III

SASSTATA

SPSS

<extension-point module=“physical” selector=“stat-format”/>

<extension name=“physical” stat-format=“SAS”/> <extension name=“physical” stat-format=“STATA”/>

<extension name=“physical” stat-format=“SPSS”/>

Statistical format selector example

The context of the selector is “stat-format”

Q1.3

• Q1.3: Why is it important to distinguish between dependency and extension?

• A: For example, contrast a multiple language survey with a single language survey that has been translated.

• The documentation for a multi-language survey is incomplete without its language components. The relationship is requires.

• The documentation for the single language survey is complete without its translations. The relationship is extends.

• Both have a language selector, however.

Q1.3

• After rethinking, it may not actually be important to distinguish between dependency and extension. I’ll have to rethink this for the October meeting.

• In any case, a <requires> element is a kind of <extension-point>

Exporting Modules

• In the first example, we exported questions into its own separate instance.

• To do this, we had to add <requires …> to the header.

Questions

Variables


<module name=“Questions”/>

Exporting Modules

• Let’s say we wanted to export Variables instead of Questions.

• Should we just place Variables in its own physical instance and add <requires …> to the header like we did before?

Variables

Questions

<requires …/>

<module name=“Variables”/>

Is this OK?

Exporting Modules

• Variables depends on questions, so there is a circular dependency.

• Circular dependencies must be avoided!

• Q5: Why must circular dependencies be avoided?

• Q6: How do we avoid circular dependencies?

Variables

Questions

<requires module=“Variables/>

<module name=“Variables”> <requires module=“Questions”/>

NO!!

Why circular dependencies must be avoided

• In this example, because there is a circular dependency, Study 2 indirectly depends on Study 1.

• This is very, very bad.

Variables

Questions Questions

<requires module=“Variables”/>

<module name=“Variables”/>

<requires module=“Variables”/>

Study 2Study 1

Export Heuristic

• Q6: How do we avoid circular dependencies?

• HEURISTIC: If we export a module then we must either export or copy all of its dependencies.

Solution 1: Export dependencies

• In this picture, questions has been exported to its own physical instance.

• If there are other modules within the original DDI that depend on Questions, then there would also be a <requires module=“Questions”/> within the original’s header.

• Check that the Study 2 depends on Study1 scenario cannot occur.

Variables


<module name=“Variables”> <requires module=“Questions”/>

Questions

Solution 2: Copy Dependencies

• In this case, there is duplication but no circular dependency.

• This is acceptable but not ideal

Variables

Questions


<module name=“Variables”>

Questions

Solution 3: Export Related Dependencies

• In this case, the Questions module is exported within the Variables module so there are not 3 physical instances as in Solution 1.

• There is also no duplication as in Solution 2.

• Note also that solution 3 can morph into solution 1 by exporting the Questions module from Variables. This is done without further changes to the original core DDI instance

Variables


<module name=“Variables”>

Questions

What is a module?

• The preceding discussion has implications on module design. In particular, the avoidance of circular dependencies impacts module design. If our modules have circular dependencies, then we should not consider them to be modules.

• Our current design decomposes into functional modules. This kind of decomposition doesn’t necessarily avoid circular dependencies, so I think our current modular decomposition must be revised.

• See Chapter 3: Modules and Lists

Discussion on Preservation

• The relation between the core DDI and its modules advanced in this chapter is like that of the hub and spokes of a wheel. METS also has this kind of structure. However, the DDI Modular architecture differs from METS in two significant ways:– METS is static while DDI modules are dynamic. In other words,

in METS, what’s at the end of the spoke may not change, but with DDI, what’s at the end of the spoke may be switched out by the application

– The METS “spoke” is a loosely coupled, top-level reference. The DDI “spoke” is a tightly coupled, multi-reference.

• The dynamic nature of DDI Modular “spokes” has consequences from the preservation point of view, as it is unclear what it means to preserve a dynamic entity. (not covered by OAIS model?) Preserve a snapshot?

http://www.loc.gov/standards/mets/presentations.html

Discussion on Preservation II

• The motivation for a dynamic modular DDI architecture involving swappable modules comes from the following use cases– Translations– Continuous wave– Enhancement by end-users, archives and harvesters– Extended data lifecycle

• The OAIS reference model deals only with the archival phase and thus does not encounter processing issues. It seems that processing concerns and preservation concerns are at odds.

Summary:Important concepts to remember

• Exporting a module

• Dependency relationship

• Extension relationship

• Selectors

• Export heuristic and circular dependencies

Modules, Visibility, and Versioning

Chapter 2

• Note: the completion of this chapter predates the versioning document at http://www.pop.umn.edu/~wlt/arofan/versioning.doc and will need to be updated accordingly to incorporate the ideas of versioning.doc

http://www.pop.umn.edu/~wlt/arofan/versioning.doc

http://www.pop.umn.edu/~wlt/arofan/versioning.doc

Visibility

• The header or wrapper for a physical instance declares the availability of its modules to the world.

• Undeclared modules are not visible, i. e. cannot be referenced from external instances. These are called internal modules. There are reasons why some modules should not be visible externally.

• Modules must also declare version in their header

• Q2.1 Why do we care about versioning?– Historicity/Provenance– Interoperability

• The two usages should be separated, as provenance requires tracking at a much finer level of granularity than interoperability, and is technically more demanding.

Historicity

• If a changing resource such as DDI is cited, then it is important to identify the specific version of the resource.

• For reproducibility of analysis, it is important that actionable metadata be versioned as well as data. (Non-actionable metadata need not be versioned).

• The debate of whether or not DDI ought to be versioned usually boils down to whether or not DDI is regarded as actionable metadata.

Historicity II

• It is inaccurate to change the @author of a module every time the module is modified throughout the lifecycle. Thus, depending on the requirements, it may be necessary to label individual elements by the versions in which they were last changed.

• It may also be necessary to label elements with multiple @author to track the sources of the changes.

• Placing the above information in the xml metadata itself is verbose and prone to fault. It is the opinion of this author that there are far better mechanisms (a la CVS) to accomplish this purpose.

Interoperability

• If metadata is to be machine-actionable, then it needs to be versioned.

• Applications will need to know what versions of the metadata are compatible with each other.

Suggested versioning scheme

• DDI Versioning is not mandatory. However, it is recommended that those applications which do version DDI instances follow the following versioning scheme:

• DDI instance versions should be identified by a 4-part versioning number, as well as a publication timestamp. The versioning number should consist of digits only.

• An optional non-digit identifier may be placed in @edition• For both document-based and dynamic RDBMS-based

archives, this allows retrieval by either version number or timestamp, but not both.

4-Part Version Number

• Left side driven by data changes – if the data itself is versioned, then this should match as closely as possible

• 1.3 -> 1.4 might involve minor data cleaning related changes, or reissue in a different physical format

• 1.3->2.0 would involve significant data changes such as adding newly recoded variables

• Right side driven by metadata changes

• 7.2 -> 7.3 should involve minor metadata changes with no anticipated incompatibilities such as typo corrections, adding question text, adding related non-data materials etc.

• 7.2->8.0 should involve major metadata changes such as modularization of DDI, adding comparison linkages, adding ISO1179 markup

1.3.7.2

?? I’m uncertain as to whether 2-2 is good enough. Perhaps 3-2 ??

Examples

• 3.11 is not acceptable– Only has two parts. Recommended: 3.11.0.0

• 4.11a.1.2 is not acceptable – Has non-digits. Recommended: 4.11.1.2

edition=“a”

• 2.a.5.2 is not acceptable– Has non-digits. Recommended: 2.0.5.2

edition=“a”

Versioning Authority• If the data producer provides a data version number, then that shoul

d be used for the left side, if possible. The data producer is the data versioning authority.

• The metadata versioning authority is the organization which houses/disseminates the metadata.

• If there is no version number for the data, or if there are multiple pieces of data with different version numbers, then the data versioning authority is the metadata versioning authority and assigns numbers to the left side.

• If a harvester such as VDC or Nesstar does not change the metadata, then the original source of the DDI is the metadata versioning authority. If the harvester enhances the harvested metadata (and redistributes the DDI) then the harvester becomes a metadata versioning authority and should include a reference to the original. If the harvester does not redistribute the DDI, then no metadata versioning authority change is necessary.

Version Derivation

• If an organization or application receives a DDI from another organization or application which it further enhances, then the second is derived from the first.

• Version derivation information should be recorded.• The data version numbers should agree, if possible,

between the original and the derived.• The metadata version numbers need not agree, and

indeed, the derived metadata version number may be < the original metadata version number. This is because the keeper of the metadata is the metadata versioning authority

Version Dependencies

• Dependencies should indicate version for the sake of interoperability. Currently, we have two types of dependency -- <requires> and <extension> -- located in the header

• <requires requires-version=“1.2.3.0” edition=“a”>• <extension extends-version=“0.9.11.0”>• Modules which lack a version number cannot be

used externally.• However, dependencies should specify a range

of versions rather than a single version, since metadata can change.

Version Range Examples

• 1.2.0.* (does not include 1.2.1.0)• 1.*• 1.2.0.3+ (does include 1.2.1.0)• 1.2.0.3 – 1.2.0.5• 1.2.0.3 – 1.2.0.*• 1.2.0.3+ – 1.2.0.* (same as 1.2.0.3 – 1.2.0.*)• It is not anticipated that an edition range is

necessary. However, a module may extend more than one extension-point via multiple <extension> elements. This mechanism may effectively allow for an “edition range”

Summary

• Only modules which are declared in the header are visible externally.

• The declaration must include a @version number and a @publish-date, with an optional @edition if applicable. A @version-authority is required, with possibly a @version-derivation if necessary.

• The recommended versioning scheme (TBD) is a 4-part numeric version number.

• Dependencies must declare the version or version range upon which they depend.

• Versioning and publishing are intertwined. Only published modules may be versioned, and the publishing authority is the same as the versioning authority.

Modules and Lists

Chapter 3

Q3.1 Why Lists?

• In DDI 2.0, we refer to producer, author, researcher, etc. multiple times when they are the same entity. We would like to not have to repeat this information every time we use it.

• In DDI 3.0, multiple variables may be derived from the same question. We would like to not have to repeat the question text and associated information.

• Other examples abound…• So, for example, we may choose to gather all the produc

ers, authors, etc. in a <institutionsAndPersonsList> and simply refer to the items in the list.

Module = Collection of Lists

• The simplest module is a single list:– <QuestionsList>– <institutionsAndPersonsList>

• The most complicated module is a collection of Lists

• More precisely, a module is a collection of unordered lists

• Q3.1: Why unordered lists?

Lists map to OO and RDBMS

• This conception of a module allows a relatively straightforward mapping to OO and RDBMS implementations– Modules map to OO packages or RDMBS schemas– Unordered Lists map to OO Collections or RDBMS ta

bles– List items map to OO objects or RDMBS rows– 1:n or n:1 relationships map to foreign key refs in the

usual way. N:n map to linking tables.

List/Module Management Fundamental Operations

• Publish/Unpublish• Export• Import• Rename/Move/Copy• Extract (from inline to List)/Inline• Filter• Concatenate• Merge-common/Consolidate? Is this necessary?• Merge-all

– Resolve by publication date or version number or derivation– Resolve conflicts manually if necessary

If you understand the module operations and you also understand where you may want to use these operations in the DDI data lifecycle then you understand DDI Modules

Publish/Unpublish

• A module is published when it is made available “externally”, though exactly what that means remains to be defined

• Most module operations should only be performed on unpublished modules.

• If a published module is unpublished, then it is withdrawn from circulation, perhaps to be modified and republished later. (This has more meaning in the context of a data library).

Example: Export and Import

• We take this example of exporting Variables from Chapter 1. Please view the animation.

• Questions must be exported before Variables because of dependencies.

• Then Variables are exported

• Finally, Questions can be imported into Variables so there are only two physical instances.

Variables

Export Questions

Export Variables

Import Questions

Questions

Rename/Move

• These operations should only be allowed on unpublished modules as they break references.

• More details in “Modularity and Referencing” chapter

Extract/Inline

• Some forms of the DDI may not be modular. In that case, inline elements must be extracted into a list as an internal module. For example, all author/PI/editor type elements might be extracted into a <PersonsAndOrganizationsList>

• The reverse inline operation resolves all the references to the module. This may be useful to create a standalone DDI or a DDI normal form.

Filter

• Filter usually occurs in conjunction with another operation: extract and filter, copy and filter

Example: Concatenate and Merge

• Here’s another animated example

• Questions from two related studies are exported

• At this point, there are four physical instances

• We might then decide to combine all the questions into a single document via a concatenate operation.

• We might go one step further and decide to merge the separate question modules

Export Questions

Questions Questions

Concatenate Questions

Questions

Merge Questions

Concatenate and Merge• Q3.2: What’s the difference between the Concatenate and Merge

operations? Using questions as an example, • Concatenate:

– Total number of questions before = total number of questions after– Two question modules are declared in the header– Concatenate is non-directional: A Concatenate B is functionally the

same as B Concatenate A– No key changes in the referring documents

• Merge:– Total number of questions before >= total number of questions after– Only one question module is declared in the header– Merge is directional: A merged to B is not functionally the same as B

merged to A– May require key changes in the referring documents

Merge

• Merge is not a routine operation. It is a major metadata change

• Merging B to A may be easily done when– A is a published or private module– B is a private module

• Merging B to A is not easily done when both A and B are published modules because keys in referring documents must be changed. Normally, this should only be done during harmonization. All references in the study space should be changed.

Published• A module is published if it is officially released to the public. A

module may also be considered to be published if it is output from one application to be used in a different application.

• Only published modules may have version numbers. Published modules MUST have version numbers.

• A published module which is reabsorbed as an internal module loses its version number. Each published module may evolve its version number independently.

• A necessary but not sufficient condition to be published is that the appropriate header information has been added so that other documents can refer to the list item in the module.

• The danger with changing published modules is that the links in referencing documents may become invalid.

Merge Conflicts

• As with cvs update conflicts, merge conflicts require human intervention

• ???? I haven’t thought the merge conflict resolution process through yet ????

• Resolving conflicts is a manual operation (re: CVS operations)

• Q3.1 Why Lists?– Lists are easy to conceptualize. Modules are a sort of black box– Module operations are easily defined as list operations.– It is easy to define retrieval from a list as opposed to some more

complicated structure.• Q3.2 Why unordered lists?

– How to define merge and concatenate for ordered lists is very problematic.

– The list retrieval mechanism (using the key) can’t return order in an obvious way.

– Unordered lists correspond to RDBMS tables in an obvious way.• Q3.3 What if we need order?

– Define an unordered list where each list item has internal order, but none of the order relationships between the list items matter. This may or may not work, depending on the kind of order needed.

Summary

• Modules are nothing more than a collection of unordered lists.

• Module operations are essentially list operations.

The Known Intended Functions of DDI

Chapter 4

• Metadata Archival Format• Metadata for Statistical Analysis• Metadata Dissemination Format• Browsing Metadata Format• Transport/Exchange Format• Application Metadata Persistence Format• Data Management Metadata Format• Extensible Metadata Format

Functional vs Substantive Use Cases

• This analysis distinguishes between functional use cases and substantive use cases. Examples of substantive use cases would be:– Time series– Geography– Cross-national studies

• Functional use cases are important for the architecture of DDI, the purpose of this analysis

• Functional use cases are covered in another document by Wendy, see http://www.pop.umn.edu/~wlt/arofan/changingData.doc

http://www.pop.umn.edu/~wlt/arofan/changingData.doc

History: The Codebook

• The old codebook served three purposes:– As a metadata archival format to be preserved as a

historical document alongside the data.– As metadata to be used by a researcher to aid in

analyzing the data.– As an easily disseminated metadata document.

• Shortcomings– Codebook formats varied widely– Format was human-readable but not machine-

readable– Format was not very complete or easily browseable

Expanded Scope of the DDI

• The new DDI lifecycle intends a number of additional uses of DDI:– As a metadata format for browsing and searching

without having to download the associated data.– As a metadata transport/exchange format between

data management systems, data libraries, or applications.

– As a persistence format to record the full or partial state of a data management system, data library, or application.

– As an extensible metadata format to allow systems, applications, and possibly users to contribute metadata throughout the lifecycle.

Examples of Browsing Use

• An xslt stylesheet converts a codebook into a series of web pages with drill down capability.

• Nesstar and VDC harvest DDI’s into their own proprietary repository, which allows their users to search for relevant datasets across compatible archives

Examples of Transport/Exchange

• Between systems -- The MetaDater Project and ISO11179 would like to use DDI to exchange metadata between systems.

• Between applications – Blaise outputs DDI to be used in SPSS analysis

• From application to archive – SPSS outputs DDI to a data archive

• From archive to application – data library DDI is downloaded to user’s machine along with data to be directly read into SAS

Application Metadata Persistence Format Examples

• A questionnaire design application may choose to store its questionnaire in DDI rather than in some proprietary format. This format could be stored alongside a rectangular data file and modified only slightly for dissemination.

Data Management Metadata Format Example

• A simple file-based data management system stores DDI as files in its file system.

• A RDBMS-based data management system such as Metadater stores a snapshot of its current state as DDI documents.

Extensible Metadata Format Example

• A data library wishes to enhance the value of the metadata by adding translations, defined vocabulary classifications, or comparison metadata. It should be able to do so with minimal modification to the original DDI and in a way that easily accommodates to the original DDI.

• A CAI application starts with a DDI as input and then adds instrument documentation.

Q4.1 Can one format rule them all?

• A: I suggest 4 different packaging formats– DDI Lite : for simplest core functionality such as simpl

e questionnaire documentation and statistical tool, analogous to the codebook or DDI 2.0

– Standalone: for full instrument documentation and limited comparative data and complex data capability

– Modular: to handle translations, comparative data, time series, etc., which require complex inter-document linking.

– DDI-zip: same as Modular, but zipped together in a single file for easy transport.Z

Document Format vs. Archive Format

• The document format is not the same as the archive format

• The archive format may be file-based, or RDBMS-based such as MetaDater, or a proprietary format such as Nesstar

• The archive format is an implementation choice. The choice of document format is a service provided to the consumer.

• Not all archives need or should support all DDI formats. • Not all DDI document formats are equivalent. There is a

clear hierarchy of functionality. The SRG should provide tools to convert from one format to another, with the understanding that roundtrip conversion may result in loss of functionality/metadata.

Typical DDI 2.0 use case in 3.0:Simple Study Post Hoc Processing

• Archive receives single study with no DDI documentation• Archive performs post-hoc DDI documentation

– Simple DDI-lite format created at beginning from question text and SPSS/SAS definition statements

– Additional sampling, abstract, summary statistics added– Enhanced content markup such as Madeira or ISO1179 added

• Completed DDI is published– If archive is a file-based archive, then DDI is published in DDI-lite or standalone f

orm– If archive is a database-based dynamic archive, then DDI is consumed and shred

ded– If archive is a DDI-modular archive, then DDI is converted into modular form befo

re publishing– If a translation is needed, then DDI is converted into modular form before translat

ion is performed• Note that no knowledge of modularity is necessary until the very end and th

at most processing can be done with DDI-lite, thus it is still possible to perform early lifecycle DDI editing by hand.

Simple Study Post hoc Processing

StudyDesign

DataCollection

DataProcessing

DataDistribution

DataDiscovery

DataAnalysis

DataArchiving

or or

or or

No DDI

Z

Created from SPSS

SPSS

DDI 3.0 use case: Full lifecycle simple study

• Researcher uses questionnaire design tool which outputs DDI-lite.

• Data collection tool (Blaise) accepts DDI as input and generates a CAI instrument

• DDI with instrument metadata is generated by tool and read by stat tool (SPSS) along with gathered data. Stat tool records new variables as well as their derivation and outputs DDI Standalone

• Archive receives and enhances the DDI and performs additional manipulation before official dissemination, depending on whether archive is document-based, dynamic, or DDI-Modular.

Full lifecycle simple 3.0 study

StudyDesign

DataCollection

DataProcessing

DataDistribution

DataDiscovery

DataAnalysis

DataArchiving

or

orZ

Word

generateinstrument

Include added variables

DDI 3.0 Use Case: Non-archived Simple Study

• Researcher produces questionnaire in Word• Technical staff converts word document to DDI-lite• Application consumes DDI and produces an online quest

ionnaire. Application then gathers data and outputs a DDI with a tab-delimited data file.

• Data is for internal use only, not published, so there is no data distribution step.

• Researcher takes DDI and data file and uses Excel or statistical tool (SPSS) to perform analysis.

Non-archived Simple Study

StudyDesign

DataCollection

DataProcessing

DataDistribution

DataDiscovery

DataAnalysis

DataArchiving

Word

Generate online instrument

Data

DDI 3.0 Use Case: Administrative Data

• Preexisting administrative data is a pure numeric file without metadata.

• DDI is used to document file format and add metadata.

• The collection instrument and process do not need to be documented

• Questions per se may not exist for such metadata, but <var> and <description> are meaningful concepts

• Data file documented by DDI for internal use only is analyzed using Excel or a statistical tool

DDI 3.0 Administrative Data

StudyDesign

DataCollection

DataProcessing

DataDistribution

DataDiscovery

DataAnalysis

DataArchiving

orData

Data

Longitudinal study, Post Hoc Processing

1. Several waves of a longitudinal study have already been collected, with SPSS setup files

2. SPSS setup files are converted to DDI-Lite, one for each wave.3. Question text for each wave is converted into a question module. Th

e module may be inlined, in which case DDI-Lite may be used. If the module is a separate physical instance, then DDI-Modular must be used.

4. Recode and derivation links are added from each wave to its question module (requires DDI-Modular).

5. Question modules may be merged to eliminate identical questions. Software should preserve the recode and derivation links

6. More derivation links may be added to the question module to indicate the temporal evolution of questions. Note that this is a kind of comparative data study.

7. Additional waves may be processed similarly without affecting previously marked up instances.

Continuous Wave Study

• Data is gathered continuously rather than in discrete waves. Records are added without logical structure changes

• The logical structure is described once via DDI.

• As records are added to the data file, DDI version numbers and publish dates are updated

DDI 3.0: Repeated, Cross-national Study

• A standard questionnaire is prepared for each wave and converted to a question module.

• Translation module is created for each question module and language combination. Actually, this is not a translation module per se.

• Question module + translation are used to generate an instrument for each country-wave. Additional flow control and computation instrument items are added to the instrument which outputs DDI instrument module.

• Reference/standard instrument module replaces question module• For each country, the instrument is adjusted. If the instrument is sufficiently

DDI-aware, it can record country-specific modifications which point to the standard instrument and translations.

• At this point, the metadata is essentially complete, minus the actual data gathering, file formats, and recodes.

• Two possible lifecycle pictures, depending on Solution #1 or Solution #2 for the Linking/Referencing Problem (see chapter ? Links and Referencing)

Repeated Cross-National Study Solution #1

StudyDesign

DataCollection

DataProcessing

DataDistribution

DataDiscovery

DataAnalysis

DataArchiving

Converted from standalone

Synthesized from

modular for distribution

Comparative data links added

Repeated Cross-National Study Solution #2

StudyDesign

DataCollection

DataProcessing

DataDistribution

DataDiscovery

DataAnalysis

DataArchiving

Synthesized from

modular for distribution

Comparative data links document derivation from

reference standard

More comparative data links added if necessary

Harvested DDI

• A harvester such as VDC or Nesstar aims to create a searchable virtual collection of studies harvested from several data archives

• A DDI spider crawls the data archives collecting DDI and possibly associated data as well

• The collected DDIs are shredded and stored into a local database, with possible enhancement (such as summary statistics) provided the harvester service

Summary

• A number of familiar and unfamiliar use cases are illustrated to show they live within the lifecycle

• Different use cases suggest a need for different DDI formats to be used in different phases of the lifecycle.

DDI Document Formats

Chapter 5

DDI-Lite

• Entry-level DDI sufficient to satisfy basic DDI 2.0 use case functionality

• Provides a low barrier to entry level DDI processing

• Intended to be used as a processing or simple exchange format, not as preservation format

• Likely to be used as a processing format by unsophisticated Data Producer or as a synthesized document disseminated for basic analysis.

DDI-Standalone

• Mid-level DDI suitable for all but most advanced DDI use cases

• Use of referencing may present formidable barrier to processing

• May be used as a processing/browsing format, single study exchange format, or as a preservation format

• May be synthesized from DDI-Modular in order to relieve the burden of inter-document linking

DDI-Modular

• Advanced DDI to support all use case functionality

• Much indirect referencing requires special handling

• Distributed processing format may be unsuitable as preservation format. Suitable as the basis for a study-space exchange format

DDI-zip

• May zip as DDI-Lite zip, DDI-Standalone zip, or DDI-Modular zip

• Use to package metadata along with data and related files

• Primarily used as exchange format. May also be used as a preservation format for individual studies or as a snapshot of a dynamic DDI-Modular archive.

Modularity and Grouping: Composition vs. Inheritance

Chapter 6

The role of inheritance in DDI 3.0?

• The proposed grouping mechanism in DDI 3.0 is used to as an inheritance reuse mechanism– Group attributes are inherited by each of the

group members.– Group members may override attributes.– Things which are shared by group members

are pushed up to the group level

What Problem Does Grouping Solve Really Well?

• Grouping works extremely well with the problem of organizing the collections and sub-collections of a data library.

Problems with Inheritance Approach

• General Principles (from OO Programming)– Prefer Composition to Inheritance– Is-a vs. Has-a

• Specific Use Cases– Grouping is essentially a single inheritance mechanism and has difficulty dealing

with multiple dimensions (e.g. time-space for repeated cross-national study)• Pragmatic Issues

– Pushing things up leads to large “up” instance.– Pushing things up leads to constant modification and lack of stability of “up” insta

nce, aka “fragile base class” in OO– Reuse does not happen when you have to override often. Reuse works best whe

n you have much common ground. As the number of datasets increases, the common ground decreases, and the usefulness of inheritance decreases when it ought to be most useful.

– We would like to be able to pick and choose what we inherit from our parents, on an intermediate level between all-or-nothing and by-individual-characteristic

Problems… II

• Granularity– The grouping mechanism has been slightly

modified to use a a grid approach at the lowest level of grouping. However, the lowest level of grouping is at the study level. The Comparative Data concern requires that linkages between studies be done at a variable level.

Problems with Inheritance Approach

• These problems are not fatal in and of themselves, as they might be resolved with a more clever implementation/use of the grouping mechanism. However, I’m unable to find a satisfactory variation.

• I’m going to suggest an approach based on composition rather than inheritance

Modules and the Study Space

Chapter 7

DDI Archive

• A DDI Archive is one which obeys the DDI Exchange Protocol. Should this be named a DDI Dissemination Node instead?

• A DDI Archive is divided into groups and subgroups. At the lowest level, the leaf nodes are study-spaces.

• A DDI Archive may be logically structured as a DDI Standalone archive or a DDI Modular archive.

• A DDI Archive may be implemented as a file-based archive or an RDBMS-based archive

DDI Archive Services

• A DDI Archive provides a number of services. The following is a possible list:– DDI-only or DDI + Data– Static or dynamic data services– Static or dynamic document services:

dynamic services may synthesize the DDI on-the-fly from a user-requested extract, for example.

- DDI Modules- …

General Idea of the Study Space

• Keep grouping for collections and subcollections of a data library/archive. Inheritance (slightly modified) works at the grouping level

• Use a study space at the lowest level rather than a study. Composition is used at the study-space level.

• Linking between datasets within different study spaces is achieved by explicit linking

• Linking between datasets within the same module is achieved by a combination of explicit linking and dynamically synthesized links uplifted from composed modules.

What is a Study Space?

• A study space is a collection of closely related studies and their associated materials, e.g.:– A longitudinal study– A cross-national study– A repeated cross-national study– A continuous wave of administrative data– A single simple poll

• For example, the set of all U.S. Census products should reside in a single study space named “U.S. Census”

• By “closely related”, we mean that we expect the studies to reference each other at a fine grain


• A study space is designed specifically to deal with longitudinal studies, repeated cross-national studies, and continuous studies

• A study space may be static or dynamic. Dynamic study spaces may offer synthesized datasets and synthesized DDI for those datasets, re MetaDater


• A study space may be implemented as a file-based or RDBMS-based system.

• Conceptually, a study space consists of:– DDI core instances – one for each study in the space– DDI extension modules instances– Associated data files– Associated other materials– A Catalog (required if coordinates are used) which describes the

content and structure of the study space– Labels (optional)

• A group/collection/sub-collection may have optional labels

• A study space may be logically structured according to labels

Example of a Study Space

• “U.S. Census” forms a study space• “U.S. Census 2000” is the set of all items

within the study space that have a time label which overlaps with 2000

• “U.S. Census 2000” & “Summary File 4” is a dataset within the study space.

• Hopefully, this definition sidesteps the debate about what actually constitutes a study

Labels

• A label is a key-value pair. A single key may have multiple values, delineated by “|”

• Labels for a group or study-space apply to all the contents of the group or study-space. These labels serve as descriptors. The primary use of descriptor labels are for navigation and finding aids.

• Labels attached to items in a study-space may play the role of coordinates within a study space. The primary use of coordinate labels are for retrieval.

• Ancestral label values are not overridden but rather provide context

Hypothetical Label Example<group name=“Census”>

<group name=“N America”><label key=“geog” value=“North America”/><group name=“Canada”>

<label key=“geog” value=“Canada”/><label key=“time” value=“1850-2000”/>

</group><group name=“US”>

<label key=“geog” value=“United States”/><label key=“time” value=“1800-2000”/><study-space name=“population”/><study-space name=“agricultural”>

…</study-space>

</group><group name=“Mexico”>

<label key=“geog” value=“Mexico”/></group>

</group><group name=“S America”>

<label key=“geog” value=“South America”/></group >

</group >

Label Inheritance

• In the previous example, the full value of the geog label of the group Canada is “North America/Canada”. Ancestral labels are not overridden or discarded, but rather provide context.

• A finding aid/search on geog=“North America” should retrieve the groups “N America”, “Canada”, “US”, and “Mexico” but not the group “S America”. This would not happen if ancestral labels were overridden

• Disclaimer: The previous example is not an actual example of intended markup, but merely to illustrate label value inheritance through grouping

Catalog• Each group or study-space should have a catalog which contains th

e following information (possibly a DDI-Lite document?)– Descriptor labels– Contents (immediate children but not grandchildren)– Coordinate labels and their allowed values (for study-spaces only– The catalog at root of the archive must contain a machine-actionable de

scription of how to request items from this archive, whether by SOAP, OAI-PMH, http request, etc. It should also list the kinds of services provided by the archive

– A study-space catalog should list the methods of link synthesis, if any• The catalog specification should be part of DDI Exchange Protocol t

o allow a DDI spider to browse a DDI archive.

Dynamic Study Spaces

• A dynamic study space may be implemented as a file-based or RDBMS-based system, but is conceptually based on a DDI Modular architecture.

• A “typical” dynamic archive will synthesize DDI standalone instances for distribution. In the process of synthesizing these DDI standalone instances, comparative data links may also be synthesized.

Intelligent Link Synthesis I• Intelligent link synthesis is the use of simple inference algorithms to deduce additional

links from explicit links.• For example, take the following crude example of a longitudinal study marked up usin

g Reto’s system:– Wave 1: var 1– Wave 2: var 1 same as Wave 1: var 1– Wave 3: var 1 same as Wave 1: var 1– Wave 4: var 1 no longer same as Wave 1: var 1 same question but different categories, per

haps.– Wave 5: var 1 same as Wave 1: var 1

• In Reto’s system (I think), there would be explicit links:– Identical link: 2:1 1:1, 3:1 1:1, 5:1 1:1– Same question different format type link: 4:1 1:1

• However, the following additional links may be logically inferred and thus produced when the standalone instance is synthesized

– Identical link: 1:1 2:1, 2:1 3:1, 3:1 5:1, …– Same question different format type link: 4:1 2:1, 4:1 3:1, 4:15:1

• These additional links are too tedious to completely do by hand explicitly. Reto also says these links are not explicit in order to save space.

Uplifted Link Synthesis Example

• Imagine the following modular structure:

Questions

Variables Variables

Wave 1 Wave 2 Wave 3

Variables

Formats

…

Wave 4

Question

Links

Format

Links

Uplifted Link Synthesis Example

• Wave 1, 2, 3 share a common question and format module.– Wave 1: var 1 comes from question 1 with format 1– Wave 2: var 1 comes from question 2 with format 2– Wave 3: var 1 comes from question 3 with format 3

• In the course of comparative analysis, it is determined that:– Questions 1, 2, 3 are identical– Questions 4, 5, 6 are identical– Formats 1, 2 are identical– Formats 3, 4, 5 are identical

• The appropriate links are added to the Question Links and Format Links modules.

Uplifted Link Synthesis Example, continued

• When the standalone instance is synthesized, we may choose to synthesize the following links comparing the variables. These variable comparison links are produced by uplifting the relationship on underlying question and formats comparison links:– Identical link: 1:1 2:1, 1:12:1, 4:15:1, 4:15:1– Same question different format link: 2:13:1,

2:13:1,5:16:1, 5:16:1– Different question same format link: 3:14:1,

3:14:1, 3:15:1, 3:15:1

Uplifted Link Synthesis Example, continued

• Uplifted links have a number of advantages over explicit links:– Economy of scale: the number of explicit links on the underlying

modules is O(n) but the number of produced uplifted links is O(n^2)

– Transitive linkage: uplifted links are transitive where applicable. Explicit links require a lot of effort to be transitive

– Completeness: Uplifted links are complete – all variables in the study-space with the desired relationships are linked.

– Bidirectionality: Uplifted links are bidirectional without having to modify the instances of previous waves of a study. Explicit links must modify previous instances in order to be bidirectional.

Discussion

• There is a possibility that the CDG may object to the methods of inference used to synthesize links. While I don’t think that my examples ought to be controversial, nonetheless, objections may be raised on theoretical grounds.

• In the case of objection, it seems to me that those who object should restrict themselves to explicit links only, but that the mechanisms for inferring links should remain in the DDI.

Summary

• The concepts for DDI Archive and study-space are explained

• One of a DDI Archive’s services is to produce DDI Standalone instances for dissemination

• A dynamic DDI study-space may use synthesized links and uplifted links to supplement/complete explicit links for a DDI instance to link to other instances in the study-space

Links and Referencing

Chapter 8

“Linking” technologies

• HTML anchor: <a href=“…”>• XLink• XPointer & XPath• IDs and IDrefs• XSLT keys• Microformats (used in blogs)• Most of these technologies share the struc

ture of uri-path-to-document#path-within-document

DDI Linking issues

• The DDI has two unique characteristics which fall outside of the use cases of most linking technologies– 1. The extension of the DDI into the design and

processing phases of the lifecycle means that the neither the path-to-document nor the path-within-document are static values. Most technologies assume the archival phase outlook, in which the path-to-document is indeed static

– 2. The distributed nature of the DDI is at odds with those linking mechanisms which assume link source and destination are within the same document

Link Robustness Issues• Distributed DDI content and the extension of Lifecycle into content-editing st

ages leads to a number of old and new problems– URIs are expected to change

• For example, a DDI document is moved around after it is created by a survey design app, consumed by an online questionnaire app, re-output by the same app with data, and then given to several data analysts as they perform data cleaning. This is not a problem solvable by the use of PURLs.

• The above problem may be mitigated by restricting to a single document version of the DDI during editing phases.

– Content structure changes as content is edited– Content version changes as content is edited– Referring document and referred document may not be under the control of the s

ame entity• None of the previously mentioned linking technologies when straightfowardl

y applied are robust enough to handle the above URI-changing scenario• Ideally, we would like a reference mechanism that allows us to change a do

cument’s content without having to update all the referring referred documents at the same time.

Solution #1: Edit standalone only

• One possible solution is to restrict the format of the DDI used during editing phases to the single document standalone form. This eliminates the need for machine-actionable extra document links altogether. More precisely…– In the Survey Design, Data Collection, and Data Processing phases, the

DDI standalone form is used– When the dataset is submitted to an archive or library, the

library/archive may optionally convert the standalone to modular form in order to add value such as translation or comparative data linkage between studies

– When the dataset is disseminated from the archive, the DDI is synthesized/reassembled dynamically into a standalone form. Such a document is for processing only, not historical. Translation text is substituted inline or placed in an internal module. All links are resolved during this time into static links. Because the extra-document links now point to published, non-changing documents, static links may now be used.

Solution #1 continued

• This solution restricts the use of the multi-document DDI Modular form to within the dynamic archive only. A cursory analysis indicates that this will have a significant impact only upon comparative data and longitudinal data concerns.– For example, the documentation of variants of a Q/V

standard for a repeated cross-national study can only occur during the OAIS ingest step in the archival phase of the lifecycle.

– For a longitudinal study, the relation of the current wave of questions is documented not at design time but at OAIS ingest step.

Solution #2: Dynamic link resolution

• Another solution is to add a robust layer of indirection to minimize the changes needed when document locations change

• Instead of storing path-to-doc#path-within-doc at every link reference, we store pointer-to-path-to-doc#logical-path-within-doc, where the value of pointer-to-path-to-doc is stored within the referring document.

• At link resolution time, we replace the pointer by its actual value, then [optional] query the referenced document to replace the logical path by the physical path.– In Java, C#, or Perl, this requires a little extra programming– In XSLT, this would require a multi-step transformation.

• When URIs change, we need only replace the single occurrence at the referring document’s pointer.

• When the structure of the referenced document changes, we replace the appropriate physical path in the module’s header, and this value is returned when the referenced document is queried

Example of dynamic link resolution

• Referring document– <link ref=“quest#InstrumentItems/question[id=‘Q3’]>– <extension-point name=“quest”>

• <location href=“http://www.icpsr.umich.edu/DDI?docID=1234”>• Referenced document at docID=1234

– <extension name=“quest”>– <physical-xpath value=“/root/Instruments/Instrument[id=‘5’]”>

• At link resolution time, the application first reads the referring document and then retrieves the referenced document at docID=1234 and reads the value of physical-xpath

• Next, every occurrence of quest# within the referring document is replaced by http://www.icpsr.umich.edu/DDI?docID=1234# root/Instruments/Instrument[id=‘5’]/

• So the ref in the beginning is resolved to– http://www.icpsr.umich.edu/DDI?docID=1234# root/Instruments/Instrum

ent[id=‘5’]/InstrumentItems/question[id=‘Q3’]

http://www.icpsr.umich.edu/DDI?docID=1234







Solution #1 vs Solution #2

• Solution #1 is simpler technologically and conceptually

• Solution #2 may allow more functionality available over more of the lifecycle

To be discussed

• These are the only two solutions to the linking and referencing problem that I could come up with that are acceptable to me. We should come up with other options and discuss them in October

Link Action Types

• On click, navigate to referenced document• On click, navigate to referenced document element• Display destination element inline• Display processed destination element inline• Retrieve references from destination element and then

resolve those and process• There may be more complicated behaviors. See METS

http://www.loc.gov/standards/mets/presentations/METSIntro2.ppt for an example of how DDI might attach behaviors to links if necessary

http://www.loc.gov/standards/mets/presentations/METSIntro2.ppt

http://www.loc.gov/standards/mets/presentations/METSIntro2.ppt

IDs vs. Keys

• IDs must be unique within a document. Keys should be (but don’t need to be?) unique within a document context.

• Keys are more flexible – they can be made of a concatenation of sub-element and attribute values. IDs must be a single attribute

• Validation enforce ID uniqueness but not key uniqueness (keys defined within XSLT, not XML)

• Keys can enforce element reference type integrity (but only at transformation time) but IDs cannot

• When editing, it is clear that you cannot change ID values without violating referential integrity, but it is not as clear with keys.

ddi the movie #1: architecture for a modular distributed metadata xml by i-lin kuo

Documents