Interoperability aspects in the
The Virtual Language Observatory
Dieter Van UytvanckMax Planck Institute for Psycholinguistics
Metadata in Context workshop
2010-09-08
Nijmegen
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Overview
• Context sketch• VLO: ideas, sources, modalities• Interoperability issues• Future plans
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Context sketch
• Lots of resources somewhere out there:• Data collections
• Corpora• Lexica• Grammars• Multimedia recordings
• Software• Web applications / services• Old-school linguistic resources:
• Books• Articles• CD-ROMs
• It’s like a jungle, sometimes ...
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the idea
• Researcher: “where do I start?”• Provide a single entry point giving access to all information• Because of the large amount of data:
• Drill-down paradigm (decrease search space gradually)
• Multiple ways of exploring:• Full-text search• Facet browsing• Geographic overlay
• Unified interface, links to the original context
• Available via www.clarin.eu/vlo
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources – LRT inventory
• http://www.clarin.eu/inventory• Initiated by CLARIN• Ad-hoc, low-barrier, user-driven inventory of Language
Resources and Tools• Number of records (+/-):
• Resources: 848
• Tools: 180
• You can add new entries yourself!
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources – OLAC catalogue
• http://catalog.clarin.eu > OLAC data providers• Metadata as harvested from 40 OLAC providers (among
them several CLARIN centres)• Quality and quantity differs hugely
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources – MPI catalogue
• http://corpus1.mpi.nl• About 130.000 metadata records• Broad spectrum:
• Experimental data
• Spoken Dutch corpus
• Sign Language corpora
• Endangered languages documentation
• Archive in principle open for externally created linguistic data collections (eg: endangered languages, see Donated Corpora)• If these collections comply with the technical requirements
(archiveable formats, metadata, …)
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources – DFKI tool registry
• http://registry.dfki.de/• Contains information about 292 (linguistic) software
packages• You can add entries yourself
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the modalities
• GIS
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the modalities
• Hierarchical catalogue
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the modalities
• Facet browser
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interaction between modalities
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
… all leading to the data
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interoperability issues (1)
• The six facets to which all of the metadata records are mapped are currently• country
• continent
• origin
• language
• organization
• genre
• subject
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interoperability issues (2)
• Observations:• Lots of inconsistencies and errors, eg for 1 organisation:
• MPI (5)
• MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (2)
• MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (39)
• Max Planck Institute for Psycholinguistics (Nijmegen, Netherlands) (112)
• Max Planck Institute for Psycholinguistics (13849)
• Max Planck Institute for Psycholinguistics & Volkswagen Stiftung (12)
• Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands (2)
• Max Planck Institute for Psycholinguistics, Postbus 310, 6500 AH Nijmegen, The Netherlands (15)
• Facets help to detect them
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interoperability issues (3)
• Because of the distributed approach:• Distributed responsabilities
• Loss of specificity by converting all metadata records to a common subset
• Important to provide link to original record (also for the context!)
• Need for high-quality and well maintained controlled vocabularies and relevant Persistent Identifiers:
• Mime types• Organisation names• ISO-639-3 language codes (cfr. ISOcat)• Domain-specific vocabularies
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interoperability issues (4)
• Metadata exchange protocols exist (OAI-PMH eg) but:• They are not always used
• For the VLO one still has to rely on non-continuous information flows like CSV files
• Clearly an undesired situation on the longer term
• Granularity: how to indicate it in a standardized way?
• User feedback
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Future steps
• Curate the metadata: • correct typographical errors
• add information
• use consistent terminology, etc.
• Process CMDI- and ISOcat based metadata • Use (emerging) standards to refer to
• persons
• projects
• resources
• ... in a persistent and interoperable way
Thank you for your attention
CLARIN has received funding fromthe European Community's Seventh Framework Programme
under grant agreement n° 212230