interoperability aspects in the the virtual language observatory dieter van uytvanck max planck...

20
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics [email protected] Metadata in Context workshop 2010-09-08 Nijmegen

Upload: jeffrey-brawdy

Post on 31-Mar-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Interoperability aspects in the

The Virtual Language Observatory

Dieter Van UytvanckMax Planck Institute for Psycholinguistics

[email protected]

Metadata in Context workshop

2010-09-08

Nijmegen

Page 2: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Overview

• Context sketch• VLO: ideas, sources, modalities• Interoperability issues• Future plans

Page 3: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Context sketch

• Lots of resources somewhere out there:• Data collections

• Corpora• Lexica• Grammars• Multimedia recordings

• Software• Web applications / services• Old-school linguistic resources:

• Books• Articles• CD-ROMs

• It’s like a jungle, sometimes ...

Page 4: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the idea

• Researcher: “where do I start?”• Provide a single entry point giving access to all information• Because of the large amount of data:

• Drill-down paradigm (decrease search space gradually)

• Multiple ways of exploring:• Full-text search• Facet browsing• Geographic overlay

• Unified interface, links to the original context

• Available via www.clarin.eu/vlo

Page 5: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources

Page 6: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources – LRT inventory

• http://www.clarin.eu/inventory• Initiated by CLARIN• Ad-hoc, low-barrier, user-driven inventory of Language

Resources and Tools• Number of records (+/-):

• Resources: 848

• Tools: 180

• You can add new entries yourself!

Page 7: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources – OLAC catalogue

• http://catalog.clarin.eu > OLAC data providers• Metadata as harvested from 40 OLAC providers (among

them several CLARIN centres)• Quality and quantity differs hugely

Page 8: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources – MPI catalogue

• http://corpus1.mpi.nl• About 130.000 metadata records• Broad spectrum:

• Experimental data

• Spoken Dutch corpus

• Sign Language corpora

• Endangered languages documentation

• Archive in principle open for externally created linguistic data collections (eg: endangered languages, see Donated Corpora)• If these collections comply with the technical requirements

(archiveable formats, metadata, …)

Page 9: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the sources – DFKI tool registry

• http://registry.dfki.de/• Contains information about 292 (linguistic) software

packages• You can add entries yourself

Page 10: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the modalities

• GIS

Page 11: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the modalities

• Hierarchical catalogue

Page 12: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

VLO: the modalities

• Facet browser

Page 13: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interaction between modalities

Page 14: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

… all leading to the data

Page 15: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interoperability issues (1)

• The six facets to which all of the metadata records are mapped are currently• country

• continent

• origin

• language

• organization

• genre

• subject

Page 16: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interoperability issues (2)

• Observations:• Lots of inconsistencies and errors, eg for 1 organisation:

• MPI (5)

• MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (2)

• MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (39)

• Max Planck Institute for Psycholinguistics (Nijmegen, Netherlands) (112)

• Max Planck Institute for Psycholinguistics (13849)

• Max Planck Institute for Psycholinguistics & Volkswagen Stiftung (12)

• Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands (2)

• Max Planck Institute for Psycholinguistics, Postbus 310, 6500 AH Nijmegen, The Netherlands (15)

• Facets help to detect them

Page 17: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interoperability issues (3)

• Because of the distributed approach:• Distributed responsabilities

• Loss of specificity by converting all metadata records to a common subset

• Important to provide link to original record (also for the context!)

• Need for high-quality and well maintained controlled vocabularies and relevant Persistent Identifiers:

• Mime types• Organisation names• ISO-639-3 language codes (cfr. ISOcat)• Domain-specific vocabularies

Page 18: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Interoperability issues (4)

• Metadata exchange protocols exist (OAI-PMH eg) but:• They are not always used

• For the VLO one still has to rely on non-continuous information flows like CSV files

• Clearly an undesired situation on the longer term

• Granularity: how to indicate it in a standardized way?

• User feedback

Page 19: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Metadata in Context2010-09-08

Nijmegen

www.clarin.eu

Future steps

• Curate the metadata: • correct typographical errors

• add information

• use consistent terminology, etc.

• Process CMDI- and ISOcat based metadata • Use (emerging) standards to refer to

• persons

• projects

• resources

• ... in a persistent and interoperable way

Page 20: Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics Dieter.VanUytvanck@mpi.nl

Thank you for your attention

CLARIN has received funding fromthe European Community's Seventh Framework Programme

under grant agreement n° 212230