darwin core archive (dwc-a) validation: a new collaborative effort

29
Darwin Core Archive (DwC-A) validation: A New Collaborative Effort Christian Gendreau, Université de Montréal / Canadensys David P. Shorthouse, Université de Montréal / Canadensys Marie-Élise Lecoq, GBIF France Tim Robertson, GBIF

Upload: kristgen

Post on 14-Jun-2015

325 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Darwin Core Archive (DwC-A) validation: A New Collaborative

EffortChristian Gendreau, Université de Montréal / CanadensysDavid P. Shorthouse, Université de Montréal / Canadensys

Marie-Élise Lecoq, GBIF FranceTim Robertson, GBIF

Page 2: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Darwin Core Archive (DwC-A)

DarwinCore standard does not impose strong rules on the content associated with any DarwinCore terms.

Page 3: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Current GBIF DwC-A Validator

Original goal“… test Darwin Core Archives as specified in the Darwin Core Text Guide.”

http://tools.gbif.org/dwca-validator/

Page 4: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Current GBIF DwC-A Validator

Original targetDwC-A are simple and can be created using simple custom scripts.

“… make sure GBIF and others can read the information as expected.”

Page 5: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Current GBIF DwC-A Validator

• Validates archive structure• Offer web presence– Report viewer– API

Page 6: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Next GBIF DwC-A Validator?

New goalExtends validation to the content of the archive

https://github.com/gbif/dwca-validator

Page 7: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Current content validators

• Atlas of Living Australia sandbox• VertNet – Spatial quality• GBIF Spain – Darwin Test• Encyclopedia of Life – dwc-validator• Scratchpads – dwca-validator• GlobalNames – dwc-archive ruby gem• … much more

See Appendix 1 for links

Page 8: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

What we need?

• Accommodate different scopes• Configuration/customizations– Use more knowledge when available

• Web access (page and API)

Page 9: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Scopes

• Data entry• Desktop software– Scientific Work Flow – Statistical software

• Integrated Publishing Toolkit (IPT)• National nodes• Aggregators

Page 10: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Configuration/Customization

• Where the validator will be used?• Can we provide more information?– e.g. I know all the dates in my file should be ISO

Page 11: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Components

• Library• Web• Extension Support

Page 12: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Library

• Define structure for validation process• Provide a validation framework enabling

sharing• Close to DarwinCore specification

Page 13: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Web

• Web page to submit archive or URL• Report viewer• API

Page 14: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Extension Support

• Include domain knowledge• Propose interpreted data

Page 15: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Internals

• Validation types– Structure• Metadata

– Records : Rows• Fields data (e.g. date, coordinates)

– Records : Columns• ID uniqueness

Page 16: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Internals – Record level

• Validation chain– Composed by chain elements– Possible parallelism

Page 17: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Internals – Record level

• Immutable Chain element– Self contained• Never relies on another chain element

– Ordering independent• Same behaviour wherever the element is used in the

chain

But what if I need really ordering?

Page 18: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Internals - Composition

• Composed chain element• Exposed as one chain element

Page 19: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Composition example

• Mandatory Latitude/Longitude– Check record completion on lat/long– Check decimal lat/long value

Page 20: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Configuration example

• Select mandatory DarwinCore terms– scientificName must be provided

• Restrict bounding box– decimalLatitude and decimalLongitude must be

between

Page 21: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Customization example

• Apply your own controlled vocabulary– Use your own dictionary for a term– ControlledVocabularyEvaluationRule

Page 22: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Extension Example

• Suggester, link to narhwal-processor– Suède –> ISO 3166-2:SE – URI –> http://sws.geonames.org/2661886

Page 23: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Collaborative

• Share configuration• Share customization (dictionary)• Implement new reusable component– e.g. validation on specific Dwc-A extension

Page 24: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Collaboration

• Where to go?– https://github.com/gbif/dwca-validator

• Who can contribute?– Everyone

• What is needed?– Ideas, constructive comments– Code review, feedback

Page 25: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Project status

• Not yet released• Command line interface available

Follow the project on GitHub

Page 26: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Acknowledgments

Page 27: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Special thanks

• SiB Colombia• SiB Brazil• Peter Desmet• John Wieczorek• Dag Endresen• …

Page 28: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Appendix 1DwC Content validators

Atlas of Living Australia sandboxhttp://sandbox.ala.org.au/datacheck/

VertNet – Spatial qualityDisplayed on occurrence pages athttp://portal.vertnet.org/search

GBIF Spain – Darwin Testhttp://www.gbif.es/darwin_test/Darwin_Test_in.php

Encyclopedia of Life – dwc-validatorhttp://services.eol.org/dwc_validator/

Page 29: Darwin Core Archive (DwC-A) validation: A New Collaborative Effort

Appendix 1 - continue

Scratchpads – dwca-validatorhttps://github.com/edwbaker/dwca_validator/

GlobalNames – dwc-archive ruby gemhttps://github.com/GlobalNamesArchitecture/dwc-archive