interoperability aspects in the the virtual language observatory dieter van uytvanck max planck...
TRANSCRIPT
Interoperability aspects in the
The Virtual Language Observatory
Dieter Van UytvanckMax Planck Institute for Psycholinguistics
Metadata in Context workshop
2010-09-08
Nijmegen
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Overview
• Context sketch• VLO: ideas, sources, modalities• Interoperability issues• Future plans
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Context sketch
• Lots of resources somewhere out there:• Data collections
• Corpora• Lexica• Grammars• Multimedia recordings
• Software• Web applications / services• Old-school linguistic resources:
• Books• Articles• CD-ROMs
• It’s like a jungle, sometimes ...
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the idea
• Researcher: “where do I start?”• Provide a single entry point giving access to all information• Because of the large amount of data:
• Drill-down paradigm (decrease search space gradually)
• Multiple ways of exploring:• Full-text search• Facet browsing• Geographic overlay
• Unified interface, links to the original context
• Available via www.clarin.eu/vlo
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources – LRT inventory
• http://www.clarin.eu/inventory• Initiated by CLARIN• Ad-hoc, low-barrier, user-driven inventory of Language
Resources and Tools• Number of records (+/-):
• Resources: 848
• Tools: 180
• You can add new entries yourself!
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources – OLAC catalogue
• http://catalog.clarin.eu > OLAC data providers• Metadata as harvested from 40 OLAC providers (among
them several CLARIN centres)• Quality and quantity differs hugely
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources – MPI catalogue
• http://corpus1.mpi.nl• About 130.000 metadata records• Broad spectrum:
• Experimental data
• Spoken Dutch corpus
• Sign Language corpora
• Endangered languages documentation
• Archive in principle open for externally created linguistic data collections (eg: endangered languages, see Donated Corpora)• If these collections comply with the technical requirements
(archiveable formats, metadata, …)
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the sources – DFKI tool registry
• http://registry.dfki.de/• Contains information about 292 (linguistic) software
packages• You can add entries yourself
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the modalities
• GIS
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the modalities
• Hierarchical catalogue
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
VLO: the modalities
• Facet browser
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interaction between modalities
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
… all leading to the data
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interoperability issues (1)
• The six facets to which all of the metadata records are mapped are currently• country
• continent
• origin
• language
• organization
• genre
• subject
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interoperability issues (2)
• Observations:• Lots of inconsistencies and errors, eg for 1 organisation:
• MPI (5)
• MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (2)
• MPI for Psycholinguistics (Nijmegen, Netherlands), Académie Marquisienne (Tuhuna 'Eo 'Enata) (39)
• Max Planck Institute for Psycholinguistics (Nijmegen, Netherlands) (112)
• Max Planck Institute for Psycholinguistics (13849)
• Max Planck Institute for Psycholinguistics & Volkswagen Stiftung (12)
• Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands (2)
• Max Planck Institute for Psycholinguistics, Postbus 310, 6500 AH Nijmegen, The Netherlands (15)
• Facets help to detect them
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interoperability issues (3)
• Because of the distributed approach:• Distributed responsabilities
• Loss of specificity by converting all metadata records to a common subset
• Important to provide link to original record (also for the context!)
• Need for high-quality and well maintained controlled vocabularies and relevant Persistent Identifiers:
• Mime types• Organisation names• ISO-639-3 language codes (cfr. ISOcat)• Domain-specific vocabularies
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Interoperability issues (4)
• Metadata exchange protocols exist (OAI-PMH eg) but:• They are not always used
• For the VLO one still has to rely on non-continuous information flows like CSV files
• Clearly an undesired situation on the longer term
• Granularity: how to indicate it in a standardized way?
• User feedback
Metadata in Context2010-09-08
Nijmegen
www.clarin.eu
Future steps
• Curate the metadata: • correct typographical errors
• add information
• use consistent terminology, etc.
• Process CMDI- and ISOcat based metadata • Use (emerging) standards to refer to
• persons
• projects
• resources
• ... in a persistent and interoperable way
Thank you for your attention
CLARIN has received funding fromthe European Community's Seventh Framework Programme
under grant agreement n° 212230