idigbio technology, cloud and appliances
DESCRIPTION
iDigBio Technology, Cloud and Appliances. Jose Fortes (on behalf of the iDigBio IT team). Paleocollections Workshop Gainesville, Florida April 27, 2012 Supported by NSF Award EF-1115210. iDigBio (idigbio.org). - PowerPoint PPT PresentationTRANSCRIPT
iDigBio Technology, Cloud andAppliancesJose Fortes(on behalf of the iDigBio IT team)
Paleocollections WorkshopGainesville, Florida
April 27, 2012Supported by NSF Award EF-1115210
Advanced Computing and Information Systems laboratory 2
iDigBio (idigbio.org)
Goal: making data and images for millions of biological specimens available in electronic format for the biological research community, agencies, students, educators, and public
Mission: leadership, coordination, and outreach in digitization
of collections by implementing resources for communication, use of technology, access to data, research and education. The “Hub” part of the NSF ADBC program aggregating TCNs and
PENsA resource: permanent cloud computing infrastructure
to link biological data from collections across the USAto use search and analytics tools to mine and reference data
Advanced Computing and Information Systems laboratory
iDigBio IT VisionCyberinfrastructure to enable
the collaborative creation, integration and management of digitized biocollections,
their use in scientific research, education and outreachVisible as a collection of persistent Internet-accessible
services, data and resourcesFor biocollection “producers”For biocollection “consumers”For biocollection service providersFor cyberinfrastructure providersFor national/global data aggregators
Advanced Computing and Information Systems laboratory 4
CI StakeholdersDomain Data
Producers
Infrastructure Providers
Domain Service Providers
Domain Data Consumers
National/Global Data
Aggregators
iDigBio
Museums
Amazon WS
Microsoft Azure
DataONE
TCNs
Collectors
GBIF
ALA
Researchers
Amazon Turk
Georeferencing
Imaging services
Data quality
Mapping
EOLTCNs
TCNsGovernmentTranslation
OCR
BISON
NESCent
Data Conservancy
iPlant
iPlant
TeachersCitizens
TCNs
Advanced Computing and Information Systems laboratory 5
Stakeholders APIsDomain Data
Producers
Infrastructure Providers
Domain Service Providers
Domain Data Consumers
National/Global Data
Aggregators
iDigBio
Museums
Amazon WS
Microsoft Azure
TCNs
Collectors
GBIF
ALA
Researchers
Citizens
Amazon Turk
Georeferencing
Imaging services
Data quality
Mapping
EOL
TCNsTCNsGovernment
TranslationOCR
Domain data
BLOBsAppliances
UpdatesNotification Query
results Customer Requests
Processed data
Domain-level data
UpdatesNotificationUsage track
BISON
DataONE
TCNs
Data Conservancy
NESCent
iPlant
Teachers
Advanced Computing and Information Systems laboratory 6
Interface Model for iDigBio and TCNs
Infrastructure Providers, National/Global Data Aggregators, Domain Service Providers, Domain Data Consumers
. . .
. . .
iDigBio + Resources
TDWG
XMPPOCCIWG
REST WS
WS-I TAPIR
HTTP
SQL UTF-8
RDF
XML
X.509 OpenID
SAML
TCP JPEG2000 ODBC
Virtual Appliances Machines
Storage
Networking
Learning Modules
Archiving Data Collections
Structured Data Services
Wiki Workshop Resources
Workflow Engines
Taxonomic Validation
Data Conversion
Geographical Mapping
Collaboration Tools
Non-structured Data Services
TCNs
National History Museums
Google App Engine
XSEDE
Microsoft Live
Amazon EC2/S3
Applied Innovations
Microsoft Azure
Google Apps
BISON/Federal
CollectionsiPlant TCNsNCBI LifeMapper ALAEOL NESCentAcademic CloudsDataONE
Advanced Computing and Information Systems laboratory
Building the iDigBio CloudCloud-based strategy
Providing useful services/APIs (programmatic and web-based)Federated scalable object storage and information processingDigitization-oriented virtual appliancesReliance on standards, proven solutions and sustainable software
Continuous consultation with stakeholdersSurveys, workgroups, summit/workshops, person-to-person …
Advanced Computing and Information Systems laboratory 8
Keeping our eyes on the ballCommon/frequent needs: archival storage, server hosting, feedback on the data, data intensive transformations …10-year tsunami of requirements: from being on Facebook to multilingual search-and-compute across multiple data sets…
Advanced Computing and Information Systems laboratory 9
Evolution of iDigBio capabilities
Time
Data ingestion
Data access, provision and visualization
Provide and enable data feedback
Data linking and federation
Process and visualize integrated data
Increasing storage and server hosting in support of the aboveIncreasing number of appliances in support of the aboveWeb site for interaction with public, community, education and above
Q3/2012 Q3/2013 Q3/2014 Q3/2015
Advanced Computing and Information Systems laboratory
Near-term goals: ingest data
• Textual datao JSON document databaseo Data ingestion via DwC-a files
o Get / Set API
• Image Datao Internet-accessible object
storage
o Upload appliance
o Limited access to low-level APIs
TextualData
(RIAK)
ImageData
(SWIFT)
API Gateway
Internet access
Advanced Computing and Information Systems laboratory
Medium-term goals• Textual Data
o JSON document databaseo Data Ingestion via DwC-a fileso Rich RESTful API
• Image Datao Web-accessible object storageo Upload applianceo Fully abstracted storage
• Indexing and Searcho Extract EXIF data from imageso Limited but useful set of indexeso Intuitive search UIo Search available via API
• Portalo Consumes and interfaces text, image and search APIs (minimal server
side code)o Web-based mapping - client side javascript limits useable record count to
about 50k records at a time.
TextualData
(RIAK)
ImageData
(SWIFT)
API Gateway
Internet access
Filter Set Query
interface
EXIF extraction
iDigBio Portal
Advanced Computing and Information Systems laboratory
(Very) Long-term Goals
Advanced Computing and Information Systems laboratory
Virtual appliance cycle
download instantiate
Domain expertiDigBio
Users atTCNs
CollectionsCommunity
Requirements,standards
Advanced Computing and Information Systems laboratory 14
Toolbox Workflow Example
Linux,MySQL,Specify,
GEOlocate
(2) Data entry, improvementTCN server
Cloud providers(Amazon, Azure…)
(6) Search
(3) Data ingested into iDigBio (4b) Replica
tion Services
(7) Visualization
iDigBio Cloud(1) Download iDigBio
appliance
Global Aggregators
(4a) Data publishing
Domain Data Consumer
(5) Download analysis
appliance
Advanced Computing and Information Systems laboratory
Short term
Ingestion applianceWeb-based UI
Images captured(e.g. HD/flash media)/images/1/100.tif /1/101.tif /2/200.tif …
iDigBio objectStorage cloud(Swift)
Batch upload,Cloud APIs
Webserver
Cloudclient
File interface
/1/100.tif GUID1/1/101.tif GUID2
Facilitate data ingestion, interface with iDigBioTools identified by community in workshops/groups
Advanced Computing and Information Systems laboratory
Medium-term – “Marketplace”
iDigBio Portal
Users/ Developers Community
appliances
Endusers
iDigBioPersonnel
iDigBioappliances
Proposals
Advanced Computing and Information Systems laboratory
Long-term – information processing
iDigBio Portal
Users/ Developers
Communityappliances
Download
Endusers
iDigBioPersonnel
Deploy
SpecimenDatabase
WorkflowsMap/Reduce
Advanced Computing and Information Systems laboratory
SummaryiDigBio cloud
Service-oriented standards-based cyberinfrastructure focused on the ADBC community needs
Scalable data management and information processing using standard interfaces, data formats, protocols, tools
Toolboxes as appliancesEvolving collection of community-selected toolsBuilt-in interfaces for effortless iDigBio integrationEmbedded best practices and standards in biocollections work
Software re-use when open-source, well maintained, manageable, sustainable and efficient to re-purpose
Feedback and suggestions [email protected] and “Contacts” at idigbio.org
Advanced Computing and Information Systems laboratory 19
AcknowledgmentsNational Science Foundation
Judith Skog and Anne Maglia
IDigBio team at University of Florida and Florida State University
Advanced Computing and Information Systems laboratory 20
Extras
Advanced Computing and Information Systems laboratory
ExamplesImage ingestion appliances (short term)
Batch upload of several images from a local storage device/file system to cloud storage
Generate GUID/URLs for later processingReliable transfers using cloud APIs (e.g. Swift/iDigBio)
Post-processing appliancesOCR tools; end-user or for batch processing
Geo-referencing appliancesTraining/verification
Research workflow appliancesData-intensive/batch processing workflows; e.g. data
mining, image processing
Advanced Computing and Information Systems laboratory
Now: appliance proposal processBy users/developers through the iDigBio Web portal
Requirements – demonstrates usage/buy-in, software license, documentation, etc
Queue of appliances for integrationiDigBio will prioritize and work with developers
Leverage expertise in appliance developmentFocus on images that users can download and run on
VMware, VirtualboxApplication, in addition to appliance, if applicable/desirable
Advanced Computing and Information Systems laboratory
Virtual Appliances in iDigBioPackaging of software and dependences in virtual machines
End user/desktop (e.g. VMware, Virtualbox)Infrastructure-as-a-Service clouds (e.g. OpenStack)Enhance user experience, facilitate integration with cloud
Image ingestion appliances (short term)Batch upload of images from a local storage to cloudGenerate GUID/URLs for later processingReliable transfers using cloud APIs (e.g. Swift/iDigBio)
Post-processing appliances (OCR tools; end-user or batch)Geo-referencing appliances (Training/verification)Research appliances (Data-intensive/batch workflows)
Advanced Computing and Information Systems laboratory 24
iDigBio Cloud Internal Architecture
Object store Database
iDigBio Collections
Management
Media Data/Metadata
Compute
API/XML ConsumerGBIFMorphbank…
Specimen-record objectsSpecimen-image objects
PublishCommentUpdatesNotifications
Domain Data Producers
National/Global Data Aggregators
Data Intensive Processing
Initial deploymenton UF ACIS resources; partially replicated at FSU for reliability and performance
(NOVA)(SWIFT) (RIAK)
Advanced Computing and Information Systems laboratory
Archer cyber-infrastructure
Archer seed resources
Local resource pools:servers, clusters,desktop labs
Userdesktops
Self-configuringVirtual appliances
Deployment, support, configuration, troubleshooting
Archer software andmanagement
Voluntaryresources
Web portal,documentation,
tutorials
Community-contributedcontent: applications,datasets
Archer seed resources
Local resource pools:servers, clusters,desktop labs
Userdesktops
Self-configuringVirtual appliances
Deployment, support, configuration, troubleshooting
Archer software andmanagement
Voluntaryresources
Web portal,documentation,
tutorials
Community-contributedcontent: applications,datasets
www.archer-project.org
Advanced Computing and Information Systems laboratory
Unique UF+FSU IT resourcesExcellent resources
Computational ACIS lab: 14 clusters, 700+ cores, 500 Terabytes 3 HP centers: ~6000 cores, 300 Terabytes
Networking to/from UF and FSU 10 Gbit connectivity to UF Campus Research Network 10 Gbit connections to Florida Lambda Rail, National Lambda Rail,
and Internet2
Advanced Computing and Information Systems laboratory
Invasive SpeciesWhere have they been introduced, and how
quickly are they spreading? What is the pattern of spread, and do they covary
with other taxa? What is the effect of climate change on the
spread of invasives?
Advanced Computing and Information Systems laboratory 28
Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change
Vascular Plant Diversity in Florida
2609 species (of 4200)all included in phylogeny
203 speciesendemic to Florida
Ratio of endemicsto all species
~200,000 location points; data from UF, FSU, USF, GBIF, FNAI
Advanced Computing and Information Systems laboratory 29
Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change
Vascular Plant Diversity in Florida
2609 species (of ~4200)all included in phylogeny
+
Phylogenetic tree, 2609 speciesGenBank, new (1000 spp)
Advanced Computing and Information Systems laboratory 30
Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change
Integrate distribution data, ecological data, climate models, phylogenyHow does species diversity compare to
phylogenetic diversity?How do species diversity and phylogenetic
diversity change?How do invasive species respond?Integrate across cladesDevelop workflows to facilitate such studies
D. Soltis, G. Burleigh, C. Germain-Aubrey, J. Allen, L. Majure
Advanced Computing and Information Systems laboratory
Research & Scientific OutreachFoster, encourage, enhance, enable research using
collections dataFoster research in IT
Integrate with various research communitiesWork with research communities to develop collections
and research-related workshops and symposia at meetingsWork with research communities to develop interfaces
with data repositories, etc. to promote integrated researchCoordinate these efforts with TCNs and PENs
Advanced Computing and Information Systems laboratory
Linking Collections to EcologyThrough collections from LTERs
Advanced Computing and Information Systems laboratory
Linking Collections to Ecology
Through NEON
Biological monitoring at sites across USA; collectionsBaseline for changes in
species distribution and abundance over time
National Ecological Observatory Network
Advanced Computing and Information Systems laboratory
Paleobiology Database (http://paleodb.org/cgi-bin/bridge.pl)
Linking Collections to Paleobiology
Advanced Computing and Information Systems laboratory
Linking Collections to GenomicsNational network of tissue and genetic
resources
Advanced Computing and Information Systems laboratory
Linking Collections to GenomicsExtend HUB connections to genomics databases
Advanced Computing and Information Systems laboratory
Linking to Living CollectionsBotanical gardens, zoos, culture collections
Advanced Computing and Information Systems laboratory
Interactions with Systematics Community and Beyond
Facilitate digitization effortsCoordinate with other databasing efforts in systematics
Connect to databases outside systematics: ecology to genomics (NEON to GenBank)
Advanced Computing and Information Systems laboratory
Interactions Fostered Through…Discussions at national meetings of
professional societies (systematics, ecology, evolution, genomics)
Workshops to engage members of systematics community
Workshops to engage members of different communities
Advanced Computing and Information Systems laboratory
Unique UF+FSU recordTrack record of building cyberinfrastructure
PUNCH and In-VIGO Nanohub, Netcare, In-VIGOBlast …
MorphbankAFRESHTelecenterArcher
Advanced Computing and Information Systems laboratory
Archer cyber-infrastructure
Hundreds of distributed compute/routers nodes24/7 operation, 650+ cores
Custom appliance imagefor computer architecturecommunity
Job scheduling acrossparticipating institutions
Advanced Computing and Information Systems laboratory
• How are species distributed in geographical and ecological space?
• What is the history of life on Earth?• What factors lead to speciation, dispersal, and
extinction?• What are the impacts of climate change likely to
be?• What information is needed for effective
conservation strategies?
Research Questions
Slide provided by Pam Soltis