data portal based on open archives initiative protocols and apache lucene

9
WDC-MARE – World Data Center for Marine Environmental Sciences Data portal based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler , [email protected] Michael Diepenbroek, mdiepenbroek@wdc- mare.org MARUM, University of Bremen, Germany EGU 2006, Vienna, 2006-04-03 WDC-MARE – World Data Center for Marine Environmental Sciences

Upload: lesa

Post on 28-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

WDC-MARE – World Data Center for Marine Environmental Sciences. Data portal based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler , [email protected] Michael Diepenbroek, [email protected] MARUM, University of Bremen, Germany EGU 2006, Vienna, 2006-04-03. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Data portal based on Open Archives Initiative Protocols and Apache

Lucene

Uwe Schindler, [email protected] Diepenbroek, mdiepenbroek@wdc-

mare.org

MARUM, University of Bremen, Germany

EGU 2006, Vienna, 2006-04-03WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Page 2: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Data PortalsWDC-MARE with its information system PANGAEA provides data portals for several EU/international

projects:CARBOOCEAN, EUR-OCEANS, IODP

Problem:Not all data are stored centralized, so all datasets

provided in portals must be consolidated from different sources!

Page 3: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Example:CARBOOCEAN data portal

•Data stays at the data providers•Metadata is harvested by the portal

•Search queries are handled by the centralized catalogue

•Scientist gets link to data at the provider

Page 4: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Open Archives ProtocolThe Open Archives Initiative Protocol for

Metadata Harvesting (OAI-PMH) is a protocol developed by the Open Archives Initiative.

• uses it during web crawling ( Scholar)

•Almost all digital libraries support it (most famous ones: arXiv and the CERN Document Server)

•Very simple to implement (XML over HTTP based)•Repository software for databases or file system

metadata providers is widely available

Page 5: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Current OAI-PMH software1. Limited to Dublin Core metadata (libraries)!

2. Limited full text search functionality due to relational databases in the background!

3. No geographic retrievals (because of Dublin Core limitation)!

4. End user interface is part of the software, this limits usability in CMS systems

Page 6: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Requirements for portal software

1. Open for any XML metadata format2. Any mappings to document fields should be done

by XPath3. Possibility to map incompatible XML schemas

during harvesting by XSL4. No relational database, only a full text search

engine, that contains everything needed for operation

5. Range queries for specific fields (date/time or numeric)

6. Web service interface for the end user software that is accessible from any language (Java/JSP, PHP,

Perl,...)

Page 7: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Lucene

Lucene

Lucene

XML-Files

OAI-PMH

OAI-PMH

OAI-Harvester

OAI-Harvester

Filesystem-Harvester

OAI protocol in HTTP

OAI protocol in HTTP (specific set)

filesystem directory, FTP,…

Mini PanHTTP ServerJetty HTTP Server

Tomcat

Apache Axis

VirtualIndex

VirtualIndex

XSL

XSL

Portal 1(Webserver, PHP)

Portal 2(Webserver, JSP)Stored:

xmldata (same format everywhere, XSL before indexing), identifier, lastModified, sets

Searchable:

field1: “/oai_dc:dc/dc:author”field2: “/oai_dc:dc/dc:title”field3: “java:org.test.LatLon.parse(/oai_dc:dc/dc:coverage)” *default: “.”

*) xmlns:java=“http://xml.apache.org/xalan/java”

MetadataPortalMetadataPortal Java Java PackagePackage

Page 8: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Metadata standard harvested for search: DIF v9.4

Searchable fields: Bounding box, date/time, parameters, authors, investigators, title

Data centers:World Data Center for Marine Environmental Sciences (WDC-MARE), University of Bremen and Alfred-Wegener-Institute in Bremerhaven, Germany

Carbon Dioxide Information Analysis Center (CDIAC), Environmental Sciences Division at Oak Ridge National Laboratory, USA

French National Oceanographic Data Centre, SISMER (Systèmes d'Informations Scientifiques pour la Mer) at the Ifremer in Brest, France

CARBOOCEAN Data Portal

Page 9: Data portal based on Open Archives Initiative Protocols and Apache Lucene

WDC-MARE – World Data Center for Marine Environmental SciencesWDC-MARE – World Data Center for Marine Environmental Sciences

Thank you!