interoperability aspects in the the virtual language observatory dieter van uytvanck max planck...

Post on 31-Mar-2015

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Interoperability aspects in the

The Virtual Language Observatory

Dieter Van UytvanckMax Planck Institute for Psycholinguistics

Dieter.VanUytvanck@mpi.nl

Metadata in Context workshop

2010-09-08

Nijmegen

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Overview

• Context sketch• VLO: ideas, sources, modalities• Interoperability issues• Future plans

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Context sketch

• Lots of resources somewhere out there:• Data collections

• Corpora• Lexica• Grammars• Multimedia recordings

• Software• Web applications / services• Old-school linguistic resources:

• Books• Articles• CD-ROMs

• It’s like a jungle, sometimes ...

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the idea

• Researcher: “where do I start?”• Provide a single entry point giving access to all information• Because of the large amount of data:

• Drill-down paradigm (decrease search space gradually)

• Multiple ways of exploring:• Full-text search• Facet browsing• Geographic overlay

• Unified interface, links to the original context

• Available via www.clarin.eu/vlo

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources – LRT inventory

• http://www.clarin.eu/inventory• Initiated by CLARIN• Ad-hoc, low-barrier, user-driven inventory of Language

Resources and Tools• Number of records (+/-):

• Resources: 848

• Tools: 180

• You can add new entries yourself!

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources – OLAC catalogue

• http://catalog.clarin.eu > OLAC data providers• Metadata as harvested from 40 OLAC providers (among

them several CLARIN centres)• Quality and quantity differs hugely

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources – MPI catalogue

• http://corpus1.mpi.nl• About 130.000 metadata records• Broad spectrum:

• Experimental data

• Spoken Dutch corpus

• Sign Language corpora

• Endangered languages documentation

• Archive in principle open for externally created linguistic data collections (eg: endangered languages, see Donated Corpora)• If these collections comply with the technical requirements

(archiveable formats, metadata, …)

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources – DFKI tool registry

• http://registry.dfki.de/• Contains information about 292 (linguistic) software

packages• You can add entries yourself

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the modalities

• GIS

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the modalities

• Hierarchical catalogue

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the modalities

• Facet browser

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interaction between modalities

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

… all leading to the data

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interoperability issues (1)

• The six facets to which all of the metadata records are mapped are currently• country

• continent

• origin

• language

• organization

• genre

• subject

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interoperability issues (2)

• Observations:• Lots of inconsistencies and errors, eg for 1 organisation:

• MPI (5)

• MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (2)

• MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (39)

• Max Planck Institute for Psycholinguistics (Nijmegen, Netherlands) (112)

• Max Planck Institute for Psycholinguistics (13849)

• Max Planck Institute for Psycholinguistics & Volkswagen Stiftung (12)

• Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands (2)

• Max Planck Institute for Psycholinguistics, Postbus 310, 6500 AH Nijmegen, The Netherlands (15)

• Facets help to detect them

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interoperability issues (3)

• Because of the distributed approach:• Distributed responsabilities

• Loss of specificity by converting all metadata records to a common subset

• Important to provide link to original record (also for the context!)

• Need for high-quality and well maintained controlled vocabularies and relevant Persistent Identifiers:

• Mime types• Organisation names• ISO-639-3 language codes (cfr. ISOcat)• Domain-specific vocabularies

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interoperability issues (4)

• Metadata exchange protocols exist (OAI-PMH eg) but:• They are not always used

• For the VLO one still has to rely on non-continuous information flows like CSV files

• Clearly an undesired situation on the longer term

• Granularity: how to indicate it in a standardized way?

• User feedback

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Future steps

• Curate the metadata: • correct typographical errors

• add information

• use consistent terminology, etc.

• Process CMDI- and ISOcat based metadata • Use (emerging) standards to refer to

• persons

• projects

• resources

• ... in a persistent and interoperable way

Thank you for your attention

CLARIN has received funding fromthe European Community's Seventh Framework Programme

under grant agreement n° 212230

top related