building data infrastructures for science

26
Building data infrastructures for science Vince Smith Informatics Horizons, London 24 July 2013

Upload: vincent-smith

Post on 27-Jan-2015

112 views

Category:

Technology


3 download

DESCRIPTION

Presented by V. Smith at the Informatics Horizons event, Natural History Museum, London, UK. 24 July 2013.

TRANSCRIPT

Page 1: Building data infrastructures for science

Building data infrastructures for scienceVince Smith

Informatics Horizons, London24 July 2013

Page 2: Building data infrastructures for science

Overview

1. (my) Background• Lice to data infrastructures!• Why data infrastructures at the NHM

2. Building data infrastructures• Recent core investment in NHM infrastructures• Leveraging external investment in NHM infrastructures • Infrastructure design principles & coordination

3. NHM 5-year data infrastructure horizons• Collections digitisation• Large-scale use of collections data • New approaches to biodiversity discovery

4. Decadal community infrastructure challenges • The long view – science data strategies• Data modeling and real time monitoring as a unifying theme

Page 3: Building data infrastructures for science

1. (my) Background

Page 4: Building data infrastructures for science

Lice to data infrastructures!

Systematics (circa 1998)- No high level keys- Poor high level taxonomy - Just one phylogeny- Few living experts!

Circa 5,000 spp. Mammals & birds

12,000 associations 15,000 potential hosts

Page 5: Building data infrastructures for science

My data infrastructure (circa 1998)

Palma, R.L., and R.L.C. Pilgrim. 2002. A revision of the genus Naubates (Insecta: Phthiraptera: Philopteridae). J. R. Soc. N.Z. 32:7-60.

data in 4 of 54 pages,in 1 of 9,110 taxonomic

142 pieces of “raw”

papers on lice

- Taxonomic names- Authorities (name concepts)- Citations- Collection data- Morphological characters- Textual descriptions- Diagnostic keys

- Illustrations- Photographs

Page 6: Building data infrastructures for science

“The bane of my existence is doing things that I know the computer could do for me”

-- Dan Connolly, The XML Revolution(Nature, 1998)

Page 7: Building data infrastructures for science

http://darwin.zoology.gla.ac.uk/~rpage/LouseBase/2/

LouseBASE

Specimens Images

(SID)

http://darwin.zoology.gla.ac.uk/~SID/

Literature

PHPBibhttp://myphpbib.sourceforge.net/

Lab Notebook

http://www2.flmnh.ufl.edu/pdb/

Host-Parasite Checklists

http://www2.flmnh.ufl.edu/adb/

Glasgow version at:

My data infrastructures (circa 2004)

Page 8: Building data infrastructures for science

ScienceProc. R. Soc. BSyst. Biol. Mol. Phyl. Evol.

Zoo. ScriptaBiol. Letters

My publications in 2004 (enabled by these infrastructures)

PLoS BiologyGrzimek’s Ency. Ent. Abh.

Images LiteratureSpecimens ChecklistsLab Notebooks

Making louse research more efficient, more collaborative and more productive

Page 9: Building data infrastructures for science

Why data infrastructures at the NHM: lots of potential

Card indices ArchivesLibrary

Staff LabelsFrozen Tissue

Slides DrySpirit

Page 10: Building data infrastructures for science

2. Building data infrastructures

Page 11: Building data infrastructures for science

Recent NHM investment in science data infrastructures

1. KE EMu (collections data)• Improved interface (speed, complexity, data quality, support)• Rapid Data Entry Web-Interface• Improved import & export functionality (CLD & data portal)

2. DAMS (multimedia) ?• Review (Digital Strategy Group)

3. NHM Virtual Library (literature)• Integrated search & discovery of NHM resources• Better integration with external resources

4. NHM Data Portal (access, citation & archival)• Discovery & visualisation of collections data on the Web• Web exposure & archival of NHM research datasets• Sub-portals for collaborative projects• As strategically important as the Web in 3 years time!

Enabling the NHM mission?

Collections Public Engagement Research

Page 12: Building data infrastructures for science

What are Scratchpads? (http://scratchpads.eu)External investment in science data infrastructures

1. ViBRANT (EU FP7 Infrastructures, 17 partners, €4.75M)• Virtual Biodiversity Research & Access Network for Taxonomy• Building & integrating tools supporting biodiversity research communities

(publishing, literature & vocabulary management, ID keys, conservation assessments,

mapping & visualisation tools, citizen science support)

2. e-Monocot (NERC Consortium; Kew Oxford & NHM, £2.38M)• Sustainable, integrated resource on Monocot plants• Content and supporting digital infrastructure

(Complete family level keys & taxon pages; generic keys & pages for 8 families; select

species-level resources from European Monocots, Red-list species and Slipper orchids)

3. SYNTHESYS 1,2 & 3 (EU FP5/6/7 Infrastructures, 18 partners, €10M)• Support for physical access to participating collections• JRA: Research into mass collections digitisation

(Image analysis, segmentation, transcription & crowdsourcing)

4. Others• Open-UP• BHL-EUROPE

ViBRANTVirtual Biodiversity

Page 13: Building data infrastructures for science

What are Scratchpads? (http://scratchpads.eu)Scratchpad VRE: foundation for ViBRANT & eMonocot

Taxa(Classifications, taxon profiles, specimens, literature, images, maps, phenotypic, genotypic

& morphometric datasets, keys, phylogenies)

Conservation Projects Regions Societies

Page 14: Building data infrastructures for science

What are Scratchpads? (http://scratchpads.eu)Impact: Scratchpad usage (July 2013)

65,000 unique visitors/month

Per month unique visitors to Scratchpad sites

525 Scratchpad Communities

by 6,550 active registered users

covering 73,444 taxa

in 535,317 pages. 81 paper citations in 2012

In total more than

1,300,000 visitors

119 NHM staff,

83 sites

Page 15: Building data infrastructures for science

3. Our near-term infrastructure horizons

Page 16: Building data infrastructures for science

Digital Ambition: NHM Science Strategy 2013-2017

A New Voyage of Discovery

Three Focal Areas1. Scientific discovery2. Scientific infrastructure3. Scientific engagement

Five Challenges1. The digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills

Resources & funding

Measuring success

Page 17: Building data infrastructures for science

A New Voyage of Discovery

Three Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagement

Five Challenges1. The digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills

Resources & funding

Measuring success

Digital Ambition: NHM Science Strategy 2013-2017

Collections digitisation

Large-scale use of collections data

New approaches to biodiversity discovery

Page 18: Building data infrastructures for science

Collections digitisation (data mobalisation)

Target20M specimens available digitally in 5-years

ChallengesCurrent fragmented effortsHeterogeneity of processExisting data (2.8M lots; 400k geo.; 120k images)Scale of operation (iCollections, 130k in 1 year)Transcription (Citizen Sci. / crowdsourcing)Data quality, annotation & feedback

Resources & fundingExpensive (£20-£60M @ £1-3 per specimen)Linked to our public offer

Next steps (Sept. 2013)Coll. Descriptions & protocolsGreater coordination of effortProgramme group with project portfolio?Planning of digital access via NHM Data Portal

Page 19: Building data infrastructures for science

Large scale use of collections data (or why digitise)

Data applications help set digitisation priorities

Potential applications for NHM dataInvasive alien speciesImpacts of climate changeSpecies conservation & protected areasImpacts of human developmentBiodiversity & human healthFood, farming & biofuels

Sustainable delivery of dataNHM Data portalPromote access & reuse of dataSub-portals for specific themesDelivering content to third parties (e.g. GBIF)

Next steps (requirements)Storage (Access, backup & archival)Citation, linking & measuring impact (identifiers)Data layering & visualisationH.P.C. (Ecol. niche modeling & analysis)

NHM Data Portal

Data visualisation

Poaceae

Leguminosae

Brassic

aceae

Rosaceae

Solan

aceae

Compositae

Rubiaceae

Vitacea

e

Anacard

iaceae

Araceae

Arecace

ae

Moraceae

Malvace

ae

Musaceae

Cucurb

itacea

e

Amaryllid

aceae

Grossu

lariac

eae

Amaranth

aceae

Aquifoliac

eae

Theac

eae

Jugla

ndaceae

Euphorb

iaceae

Apiaceae

Caricac

eae

Aspara

gaceae

Dioscorea

ceae

Pedalia

ceae

Rutaceae

Laurac

eae

Betulac

eae

Convolvu

laceae

Myrtace

ae

Oleacea

e

Zingib

eracea

e

Bromelia

ceae

Piperacea

e

Lecyth

idaceae

0200400600800

1000 Crop Wild Relatives

Page 20: Building data infrastructures for science

New approaches to biodiversity discovery (new types of data)

Take home messages from NHM Tropical Biodiversity Symposium

Molecular approachesMolecular detection & monitoring of organisms is routineMetagenomics (env. sequencing) commonplaceWhole genomes are normalThe primary route to understanding biodiversity for many

Ecological observatoriesAutomated biodiversity detectionRemote sensing (e.g. satellite & acoustic data, drones, camera traps)

Monitoring conspicuous, rare or invasive spp. (algal blooms, palms) Monitoring human activitySupplement field research, fills in gaps & scales

Digital infrastructure requirementsVery large quantities of data (2.5-10TB per researcher per yr.)

Doesn’t map to existing NHM collections infrastructuresChallenge current networking & storage capacity Digital and physical collections become equally important?

3-4 June 2013, NHM

22 July, 2013

Page 21: Building data infrastructures for science

4. Community decadal challenges

Page 22: Building data infrastructures for science

The long view: community informatics challenges

GBIF GBIC Report(Coming soon)

EU Biodiversity Strategy(2011)

Biodiv. Inf. Challenges(2013)

Page 23: Building data infrastructures for science

Modeling the biosphere: a (the) 30 year goal?

Nature 2013, doi:10.1038/493295a

A clear, singular long-term vision, that NHM data

can contribute too

Page 24: Building data infrastructures for science

QUESTIONS

Page 25: Building data infrastructures for science

What are Scratchpads? (http://scratchpads.eu)Infrastructure design principals*

1. Start with needs - focus on real user needs (not just the ‘official process’)

2. Do less - if someone else is doing it, link to it or use it

3. Design with data - prototype and test with real users on the live website

4. Do the hard work to make it simple - let the computer take the strain

5. Iterate. Then iterate again. - iteration reduces risk & is more sustainable

6. Build for inclusion – it’s easier in the long run

7. Understand context - we are designing for people, not a screen or a brand

8. Build digital services, not websites - there is life beyond the website

9. Be consistent, not uniform - every circumstance is different

10. Make things open: it makes things better - it’s more sustainable

= experience from 7-years with the Scratchpads= lessons for building NHM data infrastructures?

*https://www.gov.uk/designprinciples

Page 26: Building data infrastructures for science

What are Scratchpads? (http://scratchpads.eu)Better NHM digital coordination from 2013

Digital Strategy Group

Developing common vision High level strategy

Director level engagement(Science, PEG & Corp. Services)

Digital Design Group

Delivering & leading digital activitiesFund raising (internal & external)

Prioritisation

Administrative supportResource management

Analysis of impact

Digital Programme

Group