idigbio technology, cloud and appliances

42
iDigBio Technology, Cloud and Appliances Jose Fortes (on behalf of the iDigBio IT team) Paleocollections Workshop Gainesville, Florida April 27, 2012 Supported by NSF Award EF-1115210

Upload: harry

Post on 23-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

iDigBio Technology, Cloud and Appliances. Jose Fortes (on behalf of the iDigBio IT team). Paleocollections Workshop Gainesville, Florida April 27, 2012 Supported by NSF Award EF-1115210. iDigBio (idigbio.org). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: iDigBio  Technology,  Cloud and Appliances

iDigBio Technology, Cloud andAppliancesJose Fortes(on behalf of the iDigBio IT team)

Paleocollections WorkshopGainesville, Florida

April 27, 2012Supported by NSF Award EF-1115210

Page 2: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 2

iDigBio (idigbio.org)

Goal: making data and images for millions of biological specimens available in electronic format for the biological research community, agencies, students, educators, and public

Mission: leadership, coordination, and outreach in digitization

of collections by implementing resources for communication, use of technology, access to data, research and education. The “Hub” part of the NSF ADBC program aggregating TCNs and

PENsA resource: permanent cloud computing infrastructure

to link biological data from collections across the USAto use search and analytics tools to mine and reference data

Page 3: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

iDigBio IT VisionCyberinfrastructure to enable

the collaborative creation, integration and management of digitized biocollections,

their use in scientific research, education and outreachVisible as a collection of persistent Internet-accessible

services, data and resourcesFor biocollection “producers”For biocollection “consumers”For biocollection service providersFor cyberinfrastructure providersFor national/global data aggregators

Page 4: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 4

CI StakeholdersDomain Data

Producers

Infrastructure Providers

Domain Service Providers

Domain Data Consumers

National/Global Data

Aggregators

iDigBio

Museums

Amazon WS

Google

Microsoft Azure

DataONE

TCNs

Collectors

GBIF

ALA

Researchers

Amazon Turk

Georeferencing

Imaging services

Data quality

Mapping

EOLTCNs

TCNsGovernmentTranslation

OCR

BISON

NESCent

Data Conservancy

iPlant

iPlant

TeachersCitizens

TCNs

Page 5: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 5

Stakeholders APIsDomain Data

Producers

Infrastructure Providers

Domain Service Providers

Domain Data Consumers

National/Global Data

Aggregators

iDigBio

Museums

Amazon WS

Google

Microsoft Azure

TCNs

Collectors

GBIF

ALA

Researchers

Citizens

Amazon Turk

Georeferencing

Imaging services

Data quality

Mapping

EOL

TCNsTCNsGovernment

TranslationOCR

Domain data

BLOBsAppliances

UpdatesNotification Query

results Customer Requests

Processed data

Domain-level data

UpdatesNotificationUsage track

BISON

DataONE

TCNs

Data Conservancy

NESCent

iPlant

Teachers

Page 6: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 6

Interface Model for iDigBio and TCNs

Infrastructure Providers, National/Global Data Aggregators, Domain Service Providers, Domain Data Consumers

. . .

. . .

iDigBio + Resources

TDWG

XMPPOCCIWG

REST WS

WS-I TAPIR

HTTP

SQL UTF-8

RDF

XML

X.509 OpenID

SAML

TCP JPEG2000 ODBC

Virtual Appliances Machines

Storage

Networking

Learning Modules

Archiving Data Collections

Structured Data Services

Wiki Workshop Resources

Workflow Engines

Taxonomic Validation

Data Conversion

Geographical Mapping

Collaboration Tools

Non-structured Data Services

TCNs

National History Museums

Google App Engine

XSEDE

Microsoft Live

Amazon EC2/S3

Applied Innovations

Microsoft Azure

Google Apps

BISON/Federal

CollectionsiPlant TCNsNCBI LifeMapper ALAEOL NESCentAcademic CloudsDataONE

Page 7: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Building the iDigBio CloudCloud-based strategy

Providing useful services/APIs (programmatic and web-based)Federated scalable object storage and information processingDigitization-oriented virtual appliancesReliance on standards, proven solutions and sustainable software

Continuous consultation with stakeholdersSurveys, workgroups, summit/workshops, person-to-person …

Page 8: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 8

Keeping our eyes on the ballCommon/frequent needs: archival storage, server hosting, feedback on the data, data intensive transformations …10-year tsunami of requirements: from being on Facebook to multilingual search-and-compute across multiple data sets…

Page 9: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 9

Evolution of iDigBio capabilities

Time

Data ingestion

Data access, provision and visualization

Provide and enable data feedback

Data linking and federation

Process and visualize integrated data

Increasing storage and server hosting in support of the aboveIncreasing number of appliances in support of the aboveWeb site for interaction with public, community, education and above

Q3/2012 Q3/2013 Q3/2014 Q3/2015

Page 10: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Near-term goals: ingest data

• Textual datao JSON document databaseo Data ingestion via DwC-a files

o Get / Set API

• Image Datao Internet-accessible object

storage

o Upload appliance

o Limited access to low-level APIs

TextualData

(RIAK)

ImageData

(SWIFT)

API Gateway

Internet access

Page 11: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Medium-term goals• Textual Data

o JSON document databaseo Data Ingestion via DwC-a fileso Rich RESTful API

• Image Datao Web-accessible object storageo Upload applianceo Fully abstracted storage

• Indexing and Searcho Extract EXIF data from imageso Limited but useful set of indexeso Intuitive search UIo Search available via API

• Portalo Consumes and interfaces text, image and search APIs (minimal server

side code)o Web-based mapping - client side javascript limits useable record count to

about 50k records at a time.

TextualData

(RIAK)

ImageData

(SWIFT)

API Gateway

Internet access

Filter Set Query

interface

EXIF extraction

iDigBio Portal

Page 12: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

(Very) Long-term Goals

Page 13: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Virtual appliance cycle

download instantiate

Domain expertiDigBio

Users atTCNs

CollectionsCommunity

Requirements,standards

Page 14: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 14

Toolbox Workflow Example

Linux,MySQL,Specify,

GEOlocate

(2) Data entry, improvementTCN server

Cloud providers(Amazon, Azure…)

(6) Search

(3) Data ingested into iDigBio (4b) Replica

tion Services

(7) Visualization

iDigBio Cloud(1) Download iDigBio

appliance

Global Aggregators

(4a) Data publishing

Domain Data Consumer

(5) Download analysis

appliance

Page 15: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Short term

Ingestion applianceWeb-based UI

Images captured(e.g. HD/flash media)/images/1/100.tif /1/101.tif /2/200.tif …

iDigBio objectStorage cloud(Swift)

Batch upload,Cloud APIs

Webserver

Cloudclient

File interface

/1/100.tif GUID1/1/101.tif GUID2

Facilitate data ingestion, interface with iDigBioTools identified by community in workshops/groups

Page 16: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Medium-term – “Marketplace”

iDigBio Portal

Users/ Developers Community

appliances

Endusers

iDigBioPersonnel

iDigBioappliances

Proposals

Page 17: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Long-term – information processing

iDigBio Portal

Users/ Developers

Communityappliances

Download

Endusers

iDigBioPersonnel

Deploy

SpecimenDatabase

WorkflowsMap/Reduce

Page 18: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

SummaryiDigBio cloud

Service-oriented standards-based cyberinfrastructure focused on the ADBC community needs

Scalable data management and information processing using standard interfaces, data formats, protocols, tools

Toolboxes as appliancesEvolving collection of community-selected toolsBuilt-in interfaces for effortless iDigBio integrationEmbedded best practices and standards in biocollections work

Software re-use when open-source, well maintained, manageable, sustainable and efficient to re-purpose

Feedback and suggestions [email protected] and “Contacts” at idigbio.org

Page 19: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 19

AcknowledgmentsNational Science Foundation

Judith Skog and Anne Maglia

IDigBio team at University of Florida and Florida State University

Page 20: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 20

Extras

Page 21: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

ExamplesImage ingestion appliances (short term)

Batch upload of several images from a local storage device/file system to cloud storage

Generate GUID/URLs for later processingReliable transfers using cloud APIs (e.g. Swift/iDigBio)

Post-processing appliancesOCR tools; end-user or for batch processing

Geo-referencing appliancesTraining/verification

Research workflow appliancesData-intensive/batch processing workflows; e.g. data

mining, image processing

Page 22: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Now: appliance proposal processBy users/developers through the iDigBio Web portal

Requirements – demonstrates usage/buy-in, software license, documentation, etc

Queue of appliances for integrationiDigBio will prioritize and work with developers

Leverage expertise in appliance developmentFocus on images that users can download and run on

VMware, VirtualboxApplication, in addition to appliance, if applicable/desirable

Page 23: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Virtual Appliances in iDigBioPackaging of software and dependences in virtual machines

End user/desktop (e.g. VMware, Virtualbox)Infrastructure-as-a-Service clouds (e.g. OpenStack)Enhance user experience, facilitate integration with cloud

Image ingestion appliances (short term)Batch upload of images from a local storage to cloudGenerate GUID/URLs for later processingReliable transfers using cloud APIs (e.g. Swift/iDigBio)

Post-processing appliances (OCR tools; end-user or batch)Geo-referencing appliances (Training/verification)Research appliances (Data-intensive/batch workflows)

Page 24: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 24

iDigBio Cloud Internal Architecture

Object store Database

iDigBio Collections

Management

Media Data/Metadata

Compute

API/XML ConsumerGBIFMorphbank…

Specimen-record objectsSpecimen-image objects

PublishCommentUpdatesNotifications

Domain Data Producers

National/Global Data Aggregators

Data Intensive Processing

Initial deploymenton UF ACIS resources; partially replicated at FSU for reliability and performance

(NOVA)(SWIFT) (RIAK)

Page 25: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Archer cyber-infrastructure

Archer seed resources

Local resource pools:servers, clusters,desktop labs

Userdesktops

Self-configuringVirtual appliances

Deployment, support, configuration, troubleshooting

Archer software andmanagement

Voluntaryresources

Web portal,documentation,

tutorials

Community-contributedcontent: applications,datasets

Archer seed resources

Local resource pools:servers, clusters,desktop labs

Userdesktops

Self-configuringVirtual appliances

Deployment, support, configuration, troubleshooting

Archer software andmanagement

Voluntaryresources

Web portal,documentation,

tutorials

Community-contributedcontent: applications,datasets

www.archer-project.org

Page 26: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Unique UF+FSU IT resourcesExcellent resources

Computational ACIS lab: 14 clusters, 700+ cores, 500 Terabytes 3 HP centers: ~6000 cores, 300 Terabytes

Networking to/from UF and FSU 10 Gbit connectivity to UF Campus Research Network 10 Gbit connections to Florida Lambda Rail, National Lambda Rail,

and Internet2

Page 27: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Invasive SpeciesWhere have they been introduced, and how

quickly are they spreading? What is the pattern of spread, and do they covary

with other taxa? What is the effect of climate change on the

spread of invasives?

Page 28: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 28

Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change

Vascular Plant Diversity in Florida

2609 species (of 4200)all included in phylogeny

203 speciesendemic to Florida

Ratio of endemicsto all species

~200,000 location points; data from UF, FSU, USF, GBIF, FNAI

Page 29: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 29

Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change

Vascular Plant Diversity in Florida

2609 species (of ~4200)all included in phylogeny

+

Phylogenetic tree, 2609 speciesGenBank, new (1000 spp)

Page 30: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory 30

Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change

Integrate distribution data, ecological data, climate models, phylogenyHow does species diversity compare to

phylogenetic diversity?How do species diversity and phylogenetic

diversity change?How do invasive species respond?Integrate across cladesDevelop workflows to facilitate such studies

D. Soltis, G. Burleigh, C. Germain-Aubrey, J. Allen, L. Majure

Page 31: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Research & Scientific OutreachFoster, encourage, enhance, enable research using

collections dataFoster research in IT

Integrate with various research communitiesWork with research communities to develop collections

and research-related workshops and symposia at meetingsWork with research communities to develop interfaces

with data repositories, etc. to promote integrated researchCoordinate these efforts with TCNs and PENs

Page 32: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Linking Collections to EcologyThrough collections from LTERs

Page 33: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Linking Collections to Ecology

Through NEON

Biological monitoring at sites across USA; collectionsBaseline for changes in

species distribution and abundance over time

National Ecological Observatory Network

Page 34: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Paleobiology Database (http://paleodb.org/cgi-bin/bridge.pl)

Linking Collections to Paleobiology

Page 35: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Linking Collections to GenomicsNational network of tissue and genetic

resources

Page 36: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Linking Collections to GenomicsExtend HUB connections to genomics databases

Page 37: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Linking to Living CollectionsBotanical gardens, zoos, culture collections

Page 38: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Interactions with Systematics Community and Beyond

Facilitate digitization effortsCoordinate with other databasing efforts in systematics

Connect to databases outside systematics: ecology to genomics (NEON to GenBank)

Page 39: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Interactions Fostered Through…Discussions at national meetings of

professional societies (systematics, ecology, evolution, genomics)

Workshops to engage members of systematics community

Workshops to engage members of different communities

Page 40: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Unique UF+FSU recordTrack record of building cyberinfrastructure

PUNCH and In-VIGO Nanohub, Netcare, In-VIGOBlast …

MorphbankAFRESHTelecenterArcher

Page 41: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

Archer cyber-infrastructure

Hundreds of distributed compute/routers nodes24/7 operation, 650+ cores

Custom appliance imagefor computer architecturecommunity

Job scheduling acrossparticipating institutions

Page 42: iDigBio  Technology,  Cloud and Appliances

Advanced Computing and Information Systems laboratory

• How are species distributed in geographical and ecological space?

• What is the history of life on Earth?• What factors lead to speciation, dispersal, and

extinction?• What are the impacts of climate change likely to

be?• What information is needed for effective

conservation strategies?

Research Questions

Slide provided by Pam Soltis