1 gslis data curation institute, june, 2008 data integrity featuring identity conditions allen h....

1

GSLIS Data Curation Institute, June, 2008

Data Integrity

featuring

Identity Conditions

Allen H. RenearCenter for Informatics Research in Science and Scholarship

Graduate School of Library and Information ScienceUniversity of Illinois at Urbana-Champaign

slides: Allen RenearLast change: July1, 2008

2

For more information…

This slide set is based on collaborative work presented in full in these publications…

• Sperberg-McQueen, Michael, Claus Huitfeldt, and Allen H. Renear (2000). “Meaning and Interpretation in Markup” Markup Languages: Theory and Practice, 2:3, 215-234.

• Renear, Allen H., David Dubin, C. M. Sperberg-McQueen, and Claus Huitfeldt (2002). “Towards a Semantics for XML Markup” In Proceedings of the 2002 ACM Symposium on Document Engineering, (pp. 119-126), New York: Association for Computing Machinery.

• Renear, Allen H., David Dubin, C. M. Sperberg-McQueen, and Claus Huitfeldt (2003). “XML Semantics and Digital Libraries.” In Catherine C. Marshall, Geneva Henry, and Lois Delcambre (Eds.), Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries, (pp. 303-305), Houston, May. New York: Association for Computing Machinery.

• Renear, Allen and David Dubin. (2003). “Towards Identity Conditions for Digital Documents”. In DC-2003: Proceedings of International DCMI Conference and Workshop (pp. 181-189). DCMI.

See also:

Lynch, C. (1999). “Canonicalization: a Fundamental Tool to Facilitate Preservation and Management of Digital Information”. D-Lib Magazine.

3

Identity Conditions

• Underlying data curation issues such as integrity, authenticity, etc.

is the problem of identity conditions.

• By identity conditions we mean: a method for determining whether an object x and an object y are the same object.

• Is this the same thing as that? How do you tell?

• Identity conditions vary with the kind of object in question…• Did you read the same novel I read?• Did you read the same text I read?• Did you read the same edition I read?• Did you read the same copy I read?

4

FRBR: Functional Requirements for Bibliographic Records

… “a conceptual model of the bibliographic universe” [IFLA]

—an ER model of works, texts, editions, authors, subjects … etc.

—for designing systems for managing bibliographic records.

… influential in cataloguing and technology in libraries and elsewhere —bibliographic databases (including OCLC) are being “FRBRized”

—software systems (including Endeavor) are being updated,

—the new revision of the the Anglo-American Cataloging Rules (AACR3, now “RDA”), describes FRBR as its “foundation”.

… and increasingly influential in ontologies beyond the cataloguing community

“When a scientist asks ‘what data do you have?’ …”

—Joe Hourcle (NASA/SDAC), “FRBR in a scientific context” 2007

5

The FRBR “Group 1” Entity Types

There are four Group 1 entity types

Work: “a distinct intellectual or artistic creation”

Expression: “the intellectual or artistic realization of a work in the form of alphanumeric, musical, or choreographic notation, sound, image … etc.”

Manifestation: “the physical embodiment of an expression of a work”.

Item: “a single exemplar of a manifestation”

Or, colloquially (for book-like objects): work, text, edition, copy

Each entity type is assigned a distinctive set of attributes…

-- works have attributes such as subject and genre

-- expressions have attributes such as language

-- manifestations have attributes such as typeface

-- items have attributes such as condition and location.

6

Relationships Between the Entities

7

Data? / Documents?

8

Darwin Core record<?xml version="1.0" encoding="UTF-8"?><response><record>

<darwin:DateLastModified>2003-06-08</darwin:DateLastModified><darwin:InstitutionCode>DGH</darwin:InstitutionCode><darwin:CollectionCode>DGH Lepidoptera</darwin:CollectionCode><darwin:CatalogNumber>DGHEUR_0002976</darwin:CatalogNumber><darwin:ScientificName>Dichomeris marginella (Fabricius, 1781)</darwin:ScientificName><darwin:BasisOfRecord>O</darwin:BasisOfRecord><darwin:Kingdom>Animalia</darwin:Kingdom><darwin:Order>Lepidoptera</darwin:Order><darwin:Family>Gelechiidae</darwin:Family><darwin:Genus>Dichomeris</darwin:Genus><darwin:Species>marginella</darwin:Species><darwin:ScientificNameAuthor>(Fabricius, 1781)</darwin:ScientificNameAuthor><darwin:IdentifiedBy>Donald Hobern</darwin:IdentifiedBy><darwin:Collector>Donald Hobern</darwin:Collector><darwin:YearCollected>2003</darwin:YearCollected><darwin:MonthCollected>06</darwin:MonthCollected><darwin:DayCollected>08</darwin:DayCollected><darwin:ContinentOcean>Europe</darwin:ContinentOcean><darwin:Country>Denmark</darwin:Country><darwin:County>Københavns Amt</darwin:County><darwin:Locality>Merianvej, Hellerup</darwin:Locality><darwin:Longitude>12.538</darwin:Longitude><darwin:Latitude>55.737</darwin:Latitude><darwin:CoordinatePrecision>100</darwin:CoordinatePrecision><darwin:IndividualCount>1</darwin:IndividualCount><darwin:Notes>1 in Skinner trap</darwin:Notes>

</record></response>From D. Hobern, Architecture and Tools for Biodiversity Data Exchange 2004; www2.gbif.org/ss5hobern.pdf

9

cluML: DNA microarray clustering results<dataset name="leukemia" xmlns:ds="urn:dataset">

<objects type="microarray-samples"> <object name="sample_31"> <feature name="U22376" value="408" /> <feature name="X59417" value="1784" />

. . . <partitioning name="hierarchical clustering"> <object-clusters> <cluster name="0"> <object name="sample_1" /> <object name="sample_4" />

. . . <cluster-parameters> <parameter name="Metrics" value="Euclidean" /> <parameter name="Intercluster metrics" value="Complete linkage" /> <parameter name="Transformation" value="No transformation" /> <parameter name="N" value="2" /> . . . <validation method="Dunn index"> <validation-parameters> <parameter name="Metrics" value="Euclidean" /> <parameter name="Intercluster metrics" value="Complete linkage" /> <parameter name="Intracluster metrics" value="Complete diameter" />

. . . <validation-results> <result name="Dunn's Index" value="1.227" />

…

From http://machaon.karanagai.com/cluML.htmlSee: See: N Bolshakova, P Cunningham (2005). “cluML: A Markup Language for Clustering and Cluster Validity Assessment of Microarray Data”. Applied

Bioinformatics.

10

NeuroML K+ Channel Description<neuroml

class="DBChannel" description="Hodgkin-Huxley squid K channel” author="Dave Beeman”keywords="Hodgkin-Huxley potassium squid delayed rectifier”[email protected]="An implemention of the GENESIS K_squid_hh channel" Erest="-0.07V">

<channels>

<channel name="K_squid_hh" class="HHChannel" permeantSpecie="K" Erev="0.09V” Gmax="360.0S/m^2" ivlaw="ohmic”><gates>

<gate name="X” class="HHVGate" timeUnit="sec" voltageUnit="V” vmin="-0.1" vmax="0.05” instantCalculation="false" useState="false" power="4">

<forwardRate class="ParameterizedHHRate" A="-600.0” B="-10000.0" C="-1.0" D="1.0" E="0.060" F="-0.01"/>

<backwardRate class="ParameterizedHHRate" A="125.0" B="0.0 C="0.0" D="1.0" E="0.07" F="-0.08"/> </gate> </gates> <log author="Dave Beeman" date="Jul 9, 2002 11:11:15 PM" literatureReference="A.L. Hodgkin and A.F. Huxley, J. Physiol. (Lond) 117, pp 500-544 (1952)"> </log> </channel> </channels></neuroml>

From From S Crook, D Beeman, P Gleeson, F Howell (2005). "XML for model specification in neuroscience". Brains, Minds and Media

11

Identity Problems

12

Preservation

• Is a document being offered already in the repository?

• Is a document retrieved the same document that was deposited?

13

Conversion (e.g., format migration)

• Imagine the format conversion of data or documents

• UTF-8 to UTF-16• EBCDIC to ASCII• TEI P4 to TEI P5 • CML 1.0 to CML 2.1.1• mzData to mzML

• In most cases we’d probably say the data, or document, is the same — only in a different format, encoding, etc.

• In a conversion something changes; in a successful conversion something remains the same.

14

and so on…

• Integrity assurance• Has this document been corrupted or tampered with?

• Cataloguing • Do we already have a record/identifier for this document?• Exactly what here are we identifying/cataloguing?

• Copyright

• &c.

15

Positive Curation

• Does document A refer to the any of same things that document B refers

to?

• If so, does A’s assertions reiterate, supplement, or contradict B’s?

[Identity as fundamental to positive curation as to preservation.]

16

The Problem: Finding a method

• How do we determine if x and y are the same document?

17

Inadequate Solution Strategies

Some common approaches to determining document identity.

1) bit stream identity is document identity

2) character string identity is document identity

3) normalized serialization identity is document identity

Each is more accurate than the preceding.

But each still fails to be a sound method.

However the direction of the progression will point to the solution.

18

1) Bit Stream Strategies

• The strategy: treat the bit sequence as the document — different sequence, different document.

• This approach would yield satisfactory results if the bitstream-to-document relation were one-to-one. However….

• A problem: changes to the bitstream don’t always affect what document is being carried by the (new) bitstream.

• character encoding (e.g., ASCII vs. EBCDIC), and other transcoding for compression, encryption, transport optimization and so on.

• Such “document-preserving” changes are routine, and sometimes silent.

• So strategies that rely on bitstream surrogates under-report identity.

19

2) Character String Strategies

• The strategy: conceptualize the document as a character string • “abc” retains its identity regardless of how encoded. • Immediately improves empirical results, practical advantages

• The problem: serialization artifacts. • e.g. attribute order, declaration order, nonsignificant whitespace, redundancies

from namespace and attribute defaulting;• Alternative reserializations of a document occur routinely and are obviously

document-preserving.—And sometimes silently (e.g., white space, attribute order).

• So character stream strategies also under-report identity.

20

3) Normalized Serialization Strategies

• The strategy: conceptualize the document as a “canonical” serialization. • Again, improved results, tolerating even more document-preserving changes

—and adequate for some purposes.

• A problem: XML markup vocabularies can have alternative constructs that “mean the same thing” (often by definition).

• e.g. <div type=p> vs. <p class=div> • Or the TEI’s various alternatives for encoding “overlapping” elements —

each of which creates a different normalized serialization

• So again: identity is under-reported.

21

Summary of failed strategies

1) bit stream identity is document identity No, different bit streams can carry the same document

2) character string identity is document identity No, different character strings can carry the same document

3) normalized serialization identity is document identity No, different canonical serializations can carry the same document

22

Enough ad hockery! A fresh start…

Document identification strategies require:

1. a method to identify the document carried by structures at lower levels of abstraction,

and

2. a method to compare the documents identified, without using the lower level abstractions as surrogates.

But how …?

23

Wasn’t this the problem XML was supposed to solve?

We said we need “a method to identify the document carried by varying structures at lower levels of abstraction…”

But isn’t this what XML has always promised to do?!• [… identify the abstract common meaning implicit in the varying equivalent

encodings.. securing interoperability (and other good things).]

And isn’t this why we use XML vocabularies for scientific data?

24

XML doesn’t get us all the way there…

XML representations define structure, but the structure defined (and serialized) is not a conceptual document, or data, but a data structure

• specifically an acyclic directed graph with ordered branches and nodes labeled with element names, further decorated with attribute/value assignments —at the same level of abstraction as the W3C DOM.

So the data structure expressed by an XML document is not the conceptual document (or the data) itself. This is why even canonicalization doesn’t work.

• The data structure consists of things like nodes, labels (strings), attribute /value assignments (more strings), untyped parent/child relationships, etc.

• But a textual document consists of things like titles, chapters, paragraphs, relationships like part_of and title_of, and properties like is_in_German

• And a data document consists of things like (perhaps) predicates, names, functions, assertions, etc.

• Neither textual documents, nor data documents, consist of nodes labeled “p”; documents don’t have nodes labeled with the string “p”, or “voltage”.

25

The last step up the abstraction ladder

• The final step up the ladder of abstraction we’ve been climbing is from the serialized data structure to the conceptual document that data structure represents.

• If we can make this final upward advance in abstraction we will be where we need to be in order to specify document identity conditions.

• The bridge would be a formal system explicitly mapping XML markup to the conceptual structures it signals.

• That is, a system for representing the meaning of XML markup.

• That is what is needed. And that is what is missing.

Summarizing:• XML languages serialize a data structure, by themselves, without additional

semantics, they cannot make assertions.• XML schemas provide a mechanism for specifying the syntax of an XML

language, but there is no commonly used formal mechanism for specifying the semantics of that vocabulary

• This is why we don’t have robust identity conditions for documents.• How do we remedy this?

26

A closer look…

<article><title>Intension <em purpose="foreign">Redux</em></title><author>Billy Bob Otebranne</author><affiliation>Decatur University</affiliation><p>Wittgenstein wrote: <quot lang="de"><em purpose=foreign">Die Welt ist alles, was der Fall ist</em></quot><foot>The World is everything that is the case.</foot>It is hard to escape, at first reading, the suspicion that Wittgenstein is guilty here of a gross platitude: it is only after reading the rest of the <em purpose="title" lang="la">Tractatus</em>that on returning to its famous first sentence one appreciates the depths of its intensions.</p>

...

27

So how, exactly, does markup work?By licensing inferences…

The meaning of markup is the inferences it licenses<p lang="de">... licenses the inferences

"This element is a paragraph”; "This element is in German"Similarly[*]:

<darwin:CatalogNumber>DGHEUR_0002976</darwin:CatalogNumber> … <darwin:IdentifiedBy>Donald Hobern</darwin:IdentifiedBy> … <darwin:Longitude>12.538</darwin:Longitude>

<darwin:Latitude>55.737</darwin:Latitude> <darwin:CoordinatePrecision>100</darwin:CoordinatePrecision> …

licenses these inferences“Specimen DGHEUR_0002976 was identified by Donald Hobern“Specimen DGHEUR_0002976 was found along latitude 55.737“The precision of the locality specification (radius around latitude/longitude coordinates) in this record is 100”

• We say licenses inferences rather than makes assertions because...XML document markup languages are not languages of assertion

They are language for serializing data structuresHumans must infer the right assertions from markup.

So to represent the meaning of markup we need to convert the document instance into the assertions it licenses.

28

Expressing the meaning of markup

And that doesn't seem too hard to do.

Doesn’t <p lang="en">… go pretty easily into predicate logic, Paragraph(a) English(a) …

Or some other knowledge representation language, like OWL?

No, not so easily.

29

Propagation/Inheritance

<p>Wittgenstein wrote: <quot lang="de"><em purpose=foreign">Die Welt ist alles, was der Fall ist</em></quot>

• Some properties expressed by markup are understood to be propagated, according to rules, to child elements and content.

• But… XML DTDs provide no notation for specifying which properties are propagated or what the rules for propagation are.

Examples of propagation rules

• The property of being a paragraph is not propagated at all (elements within paragraphs aren't themselves necessarily paragraphs).

• Being-in-German is propagated … until defeated (rule: closest ancestor)

• Being-in-Helvetica will be defeated by a subsequent rendition assignment of Times, but not by a subsequent rendition assignment of Bold.

Language designers, content developers, and software engineers all depend upon a common understanding of such rules.

30

Ontological variation in reference:

• <word> pos=“PN” nationality=“FR” condition=“illegible” state=“copyedited”> Voltaire </word>

• Markup might appear to indicate that the same thing… • is-a-proper-noun • is-a-French-citizen, • is-illegible, • has-been-copyedited

• But obviously no thing has these properties

31

Arity and Deixis

Some properties expressed by markup are monadic (is_a_paragraph), some polyadic (is_the_affiliation_of); this cannot be expressed.

• a title that is the immediate first child of a section is probably the title of a section, and probably a the title of that section

• But how do we know it is the title of something?• And how do we know that it is the title of the section it is the title of?

32

Parent/Child Overloading

• The parent/child relationships of the XML graph data structure support a variety of implicit substantive relationships.

• A chapter might have title, sentence, footnote, annotation, and page break, as child elements,

• But in each case the parent/child relation represents a different substantive relationship.

—A sentence is part of a chapter—A title is the title of a chapter.—An annotation annotates a chapter—A page break is within some rendition of a chapter

33

Class Relationships and Synonymy

• XML contains no general constructs for expressing class hierarchies among elements, attributes, or attribute values.

• And yet these are frequently intended and assumed• (e.g. <chapter>, <section>, <subsection>…)

34

Data documents are no different

<darwin:CatalogNumber>DGHEUR_0002976</darwin:CatalogNumber> … <darwin:IdentifiedBy>Donald Hobern</darwin:IdentifiedBy> … <darwin:Longitude>12.538</darwin:Longitude> <darwin:Latitude>55.737</darwin:Latitude>

<darwin:CoordinatePrecision>100</darwin:CoordinatePrecision> …

licenses

“Specimen DGHEUR_0002976 was identified by Donald Hobern“Specimen DGHEUR_0002976 was found along latitude 55.737“The precision of the locality specification (radius around latitude/longitude coordinates) in this record is 100”

hmm…

35

BECHAMEL

• HostsElectronic Publishing Research Group, Graduate School of Library and Information Science University of Illinois, Urbana-ChampaignDepartment of Language and Information TechnologyBergen University, Norway

• Researchers• Michael Sperberg-McQueen

World Wide Web ConsortiumMIT Laboratory for Computer Science

• Allen Renear and Dave DubinGraduate School of Library and Information ScienceUniversity of Illinois, Urbana-Champaign

• Claus HuitfeldtLanguage and Information TechnologyBergen University, Norway

(x) [ (titleof(a,x) (y)[titleof(y,x) (y=x) ]

(x) [ (div(x) ( y)[para(y) partof(y,x) ]

(x) [ (div(x) (y)[titleof(y,x) (y=x) ]

36

The Big Picture

BECHAMEL [e.g.]processing

Ontology instance(e.g. RDF ground triples)

ParsedDocuments

ParserDocumentInstances Syntax Schema

(e.g. DTD, XSD)

Bechamel

Complete Schema for markup vocabulary

Semantics Schema(Bechamel)

Domain ontology(e.g., RDFS/OWL)

Inferencing/Curation, etc.Other axioms

Other facts

37

A Bechamel approach to identity questions

• We process an XML document along with the semantic rules for its vocabulary, producing assertions in predicate logic.

• These are the assertions “licensed” by the serialized XML representation; they are the meaning of that representation; they represent the conceptual document.

• At this level of abstraction not only are all serialization artifacts gone, but so is the data structure itself:

• instead of a tree with untyped arcs and labeled nodes decorated with attribute/value pairs, we now have objects such a paragraphs, with properties such as being in German. The parent-child relationship has been unpacked into various n-place typed relations and axioms that govern propagation and class relationships.

• Document identity is determined by solving for logical equivalence. • cross-walking, application of partial or full equivalences in an interlingua, or other

heterogeneity management may be applied.

38

Summary: Levels of interoperability

• Serialization (XML)• XML is a language for serializing data structures.• But it does not standardize how to use these data structures to make assertions.

• Syntactic (RDF, OWL)• RDF and OWL standardize how to express assertions. • But they do not not define the properties and values which are the components of

these assertions.

• Semantic (Domain ontologies)• Domain ontologies are argeements about the properties and values which are the

components of assertions.

To address data curation problems, such as document identity, there must be formalization at all three levels.

39

Questions…?

40

oops: the part where you take it back

• This is not yet a complete coherent picture

41

I knew you’d ask about that!

• document vs. meaning of the document

• forensics, provenance, authentication