clarin metadata & iso dcr daan broeder. max-planck institute for psycholinguistics tke es05...

35
CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14’th Dublin

Upload: sean-mcgregor

Post on 27-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CLARIN Metadata & ISO DCR

Daan Broeder.

Max-Planck Institute for Psycholinguistics

TKE ES05 Workshop, August 14’th Dublin

Page 2: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CLARIN Project

The CLARIN project is a large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable for Language & SSH (Social Sciences & Humanities) researchers.

CLARIN EU project and different national CLARIN projects CLARIN EU WP2 since 2007 investigated and creates

(prototypical) solutions for: Common AAI infrastructure Single system of persistent identifiers (PIDs) for resources Common metadata domain …

Page 3: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Current Metadata Situation

Fragmented landscape Metadata sets, schema & infrastructures in our domain:

IMDI, OLAC/DCMI, TEI Problems with current solutions:

Inflexible: too many (IMDI) or too few (OLAC) metadata elements

Limited interoperability (both semantic and functional) Problematic (unfamiliar) terminology for some sub-

communities. Limited support for LT tool & services descriptions

Page 4: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

CLARIN chose for a component approach: CMDI NOT a single new metadata schema but rather allow coexistence of many (community/researcher)

defined schemas with explicit semantics for interoperability

How does this work? Components are bundles of related metadata elements that

describe an aspect of the resource A complete description of a resource may require several

components. Components may contain other components Components should be designed for reusability

Page 5: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

TechnicalMetadata

Sample frequency

Format

Size…

Lets describe a speech recording

Page 6: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

Language

TechnicalMetadata

Name

Id

Lets describe a speech recording

Page 7: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

Language

TechnicalMetadata

Actor

Sex

Language

Age

Name

Lets describe a speech recording

Page 8: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

Language

TechnicalMetadata

Actor

Location

ContinentCountryAddress

Lets describe a speech recording

Page 9: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

Language

TechnicalMetadata

Actor

Location

Project…

Name

Contact Lets describe a speech recording

Page 10: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

Language

TechnicalMetadata

Actor

Location

Project

Metadata schema

Metadata profile

Lets describe a speech recording

Page 11: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

Language

TechnicalMetadata

Actor

Location

Project

Metadata schema

Metadata description

Lets describe a speech recording

Metadata profile

Page 12: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Components

Language

TechnicalMetadata

Actor

Location

Project

Metadata schema

Metadata description

Lets describe a speech recording

Component definitionXML

W3C XML Schema

XML File

Profile definitionXML

Metadata profile

Page 13: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

LocationCountry

Coordinates

ActorBirthDate

MotherTongue

TextLanguage

Title

RecordingCreationDate

Type

Component registry

user

DanceName

Type

User selects appropriate components to create a new metadata profile or an existing profile

Selecting metadata components from the registry

CMDI Component Reuse

Page 14: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Country dcr:1001Language dcr:1002

LocationCountry

Coordinates

ActorBirthDate

MotherTongue

TextLanguage

Title

RecordingCreationDate

Type

Component registry

BirthDate dcr:1000

ISOcat concept registry

user

DanceName

Type

Semantic interoperability partly solved via references to ISO DCR or other registry

Selecting metadata components from the registry

Title: dc:title

DCMI concept registry

CMDI Explicit Semantics

User selects appropriate components to create a new metadata profile or an existing profile

Page 15: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

RecordingCreationDate

Type

Component registry

Genre 1 dcr:1020Language dcr:1002Genre2 dcr:1030

DanceName

Type

Relation Registry

Text 1Language

Title

Genre1

Text 2Language

TitleGenre2ISOCat

Relation Registry

User MD search

User selects or creates a profile that specifies relations between DCs

dcr:1020 = dcr:1030 dcr:1020 ~ dcr:1030 dcr:1020 > dcr:1030

Metadata modelers or terminology expert can also use the RR to specify relations that the ISO DCR can’t store

Page 16: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CMDI Metadata Live-cycle

SearchService

Joint MetadataRepository

MetadataRepository

MetadataRepository

Relation Registry

ISOcatConcept Registry

DCMIConcept Registry

otherConcept Registry

CLARINComponent

Registry/Editor

SemanticMapping

Create metadata schema from selection of existing components. Allow creation of new components if they have references to ISOcat

Perform search/browsing on the metadata catalog using the ISO DCR and other concept registries and CLARIN relation registry

Metadata component profile was selected from metadata component registry

Metadata harvestingby OAI-PMH protocol

Metadata descriptions created

Page 17: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CMDI: Browsing the Component Registry

Page 18: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CMDI: Editing a Component

Page 19: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

MD Components & Semantic Granularity

Problems with component metadata: too high granularity in the ISOCat Actor.Name, Actor.Fullname, Actor.Address, Actor.email,… Creator.Name, …, Creator.email,… Funder.Name, …,Funder.email

Having a DC for every of these MD elements would explode the ISOcat. Using just generic “Name” loses precision.1. Compromise: use fine granularity only for elements that are

expected to be often used (CreatorName, ActorName) for searching in metadata. Map the rest to generic “Name”

2. More fundamental solution: Use container concepts: create an “Actor” DC, then we can reason with the context. Actor ~ Participant, Name ~ Fullname

-> Actor.Fullname ~ Participant.name

Page 20: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Metadata Thematic Domain

DCs to describe Language Resources & Technology Chair: Peter Wittenburg, MPI for Psycholinguistics Started entering data 2009 based on two expert meetings in

Athens and guided by: existing metadata sets: IMDI, OLAC/DC and inventory: ENABLER

resulted in: 218 DCs Translation work was initiated via the CLARIN national

coordinators. (15 language sections for “audio file format”) Dutch CLARIN metadata project 2010 added:

76 new DCs (of which 30 still private)

Page 21: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Some experiences I

The GUI is not too fast Need a discussion platform to discuss a DC’s attributes.

(now solved with the forum function) UI arrangement. For instance the value domain attributes

are not in one panel. (type, data type, value domain Metadata terms often needed to be linked to DCs that are

either too broad or too narrow. (Situation did not merit a new DC).

Search for existing DCs is only effective if you know the terminology.

Page 22: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Some experiences II

Duplicate entries (e.g. “source”). Entry was made before check was in place.

illogical or unsystematic definitions:DC-2512: The name of the person who was participating in the creation project.DC-2454: The name of the person that can be contacted to get access to the resource or to the tool/service.DC-2505: The address of an organization that was/is involved in creating, managing and accessing resource or tool/service.DC-2521: The email address of a person or an organization that is involved in creating, managing or accessing resources or tools/services.DC-2459: The organization that was leading the creation project or that is responsible for accessing the resource and the contact person is affiliated with.DC-2461: The telephone number of a person or an organization that is involved in creating, managing or accessing the resource.

 

Page 23: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Next Steps for Metadata TD

Expected standardization: October 2010 Before that we will reexamine all DCs Build jury -> vote DCR board -> vote

What happens then? The DCs will get new PIDs But we have metadata records where the PIDs of “old” DCs

are used Curate. Update the metadata records Redirect (if owner agrees) to standardized version Make use of Relation registry: old_DC == new_DC ?

Page 24: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Thank you for your attention

CLARIN has received funding fromthe European Community's Seventh Framework Programme

under grant agreement n° 212230

Page 25: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

WS05: Standardizing Data Categories in ISOcat: Implementing Group Work for Thematic DomainsWS05: Standardizing Data Categories in ISOcat: Implementing Group Work for Thematic Domains

Page 26: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CMDI Architecture I

Division into: MD Producer components MD Exploitation or consumer components OAI-PMH components Knowledge components: DCR, Relation Registry

The CMDI takes an archivist or “production” first viewpoint Prioritize that the metadata can be of good quality:

consistent, coherent, correctly linked to the concept registries The consumer side can be more “experimental” and diverse. Many MD exploitation “stacks” or consumers can work in

parallel on the same metadata

Page 27: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Concept registries

Basically a list with concepts and their descriptions where every concept has a unique identifier.

Some have a complicated structure and are associated with elaborate (administrative) processes to determine the status and acceptation of concepts in the registry. e.g. ISO-DCR.

others are static and simple lists of concepts and descriptions e.g. DCTERMS

Page 28: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

ISO DCR

ISO-DCR is important for more CLARIN objectives then metadata and is under control of the linguistic community (ISO-TC37)

is an implementation of the model defined in ISO 12620 , offering a GUI and programming APIs

Every DC Is subject to a standardization process and carries information on the status of that process

Metadata is just one of 13 Thematic Domains in the DCR Can contain no relations between the DCs, only a value

domain relation is possible.

Page 29: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CMDI Architecture II

MD Comp.Editor

MD Comp.Registry

ISO-CatDCR

MD Editor.

Local MD Repository

OAI-PMHData

provider

OAI-PMHServiceProvider

CLARINJoint MD

Repository

MD Services

Semantic mappingServices

RelationRegistry

MDCatalog

user

Metadatamodeler

ISOTDG

MDCreator

Externalagents

VirtualCollectionRegistry

Page 30: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Current CMDI status I

ISO-DCR: 218 metadata concepts CMDI component registry: 135 components, 19 profiles

Produced & inspired by: Deconstructing existing metadata schema IMDI, OLAC, TEI Considering requirements of other CLARIN activities like

profile matching CLARIN NL metadata project tested the CMDI model and

delivered components and profiles for the resources in two major Dutch Language Resource centers

Page 31: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Current CMDI status II

Operational or test phase: ISOCat DCR Component registry & editor ARBIL metadata editor

Still working on: Joint Metadata Repository, Metadata Catalog, Semantic

Mapping, Relation Registry

Expect a usable first version in third quarter 2010

Page 32: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

CMDI contributors

Collaboration on the CMDI implementation MPI for Psycholinguistics: metadata modeling and editing

facilities Språkbanken, University of Gothenburg: Joint CLARIN

metadata repository Austrian Academy: Metadata catalog, metadata &

semantic mapping services IDS: Virtual Collection Registry MPG / CLARIN NL: ISO-DCR DFKI: Relation Registry

Page 33: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

Common metadata domain

Why a common metadata domain: Finding and sharing resources housed at all archives &

repositories participating in CLARIN Specify distributed heterogeneous collections of LRs and

processing these collections In general, a common metadata domain helps bringing

along a single domain of LRs

Page 34: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

RecordingCreationDate

Type

Component registry

Genre 1 dcr:1020Language dcr:1002Genre2 dcr:1030

DanceName

Type

Relation Registry

Text 1Language

Title

Genre1

Text 2Language

TitleGenre2ISOCat

Relation Registry

MD searchuser

User selects or creates a profile that specifies relations between DCs

dcr1020 = dcr1030 dcr1020 ~ dcr1030dcr:1020 > dcr:1030

MD modeler

Page 35: CLARIN Metadata & ISO DCR Daan Broeder. Max-Planck Institute for Psycholinguistics TKE ES05 Workshop, August 14th Dublin

RecordingCreationDate

Type

Component registry

Genre 1 dcr:1020Language dcr:1002Genre2 dcr:1030

DanceName

Type

Relation Registry

Text 1Language

Title

Genre1

Text 2Language

TitleGenre2ISOCat