grids for chemical informatics randall bramley, geoffrey fox, dennis gannon, beth plale computer...
TRANSCRIPT
Grids for ChemicalInformatics
Randall Bramley, Geoffrey Fox, Dennis Gannon, Beth Plale
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
What is a Grid?
Name borrowed from the power grid.• The concept:
A ubiquitous information & computation resource A definition
• a network of compute and data resources that has been supplemented with a layer of services that provide uniform and secure access to a set of applications of interest to a distributed community of users.
Grids may be wide-area or enterprise
Scientific Challenges The current and future
generations of scientific problems are:
• Data Oriented Increasingly stream based. Often need petabyte
archives
• In need of on-demand computing resources
• Conducted by geographically distributed teams of specialists
Who don’t want to become experts in grid computing.
Science Communities and Outreach
• Communities• CERN’s Large Hadron Collider
experiments• Physicists working in HEP and
similarly data intensive scientificdisciplines
• National collaborators and thoseacross the digital divide indisadvantaged countries
• Scope• Interoperation between LHC
Data Grid Hierarchy and ETF• Create and Deploy Scientific
Data and Services Grid Portals• Bring the Power of ETF to bear
on LHC Physics Analysis: Helpdiscover the Higgs Boson!
• Partners• Caltech• University of Florida• Open Science Grid and Grid3• Fermilab• DOE PPDG• CERN• NSF GriPhyn and iVDGL• EU LCG and EGEE• Brazil (UERJ,…)• Pakistan (NUST,…)• Korea (KAIST,…)
LHC Data Distribution Model
Identify Genes
Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4
Predictive Disease Susceptibility
Physiology
Metabolism Endocrine
Proteome
Immune Transcriptome
BiomarkerSignatures
Morphometrics
Pharmacokinetics
EthnicityEnvironment
AgeGender
Genetics and Disease Susceptibility
Source: Terry Magnuson, UNC
On-DemandStorm predictions
StreamingObservations
Forecast Model
Data Mining
Storms Forming
Information/Knowledge Grids Distributed (10’s to 1000’s) of data sources (instruments,
file systems, curated databases …) Data Deluge: 1 (now) to 100’s petabytes/year (2012)
• Moore’s law for Sensors Possible filters assigned dynamically (on-demand)
• Run image processing algorithm on telescope image• Run Gene sequencing algorithm on compiled data
Needs decision support front end with “what-if” simulations
Metadata (provenance) critical to annotate data
Integrate across experiments as in multi-wavelength astronomy
Data Deluge comes from pixels/year available
Internet Scale Distributed Services Grids use Internet technology to manage sets of network
connected resources• Classic Web: independent one-to-one access to individual
resources • Grids integrate together and manage multiple Internet-
connected resources: People, Sensors, computers, data systems
Grids are built on top of commodity web service technology with broad industry support
Organization can be explicit as in• TeraGrid which federates many supercomputers; • CrisisGrid which federates first responders, commanders,
sensors, GIS, (Tsunami) simulations, science/public data Organization can be implicit such as curated databases and
simulation resources that “harmonize a community”
The Architecture of Gateway GridsThe Users Desktop.
Gateway Services
Grid Portal Server
Grid Portal Server
Physical Resource Layer
Core Grid Services
Proxy CertificateServer / vault
Proxy CertificateServer / vault
Application EventsApplication EventsResource BrokerResource Broker
User MetadataCatalog
User MetadataCatalog
Replica MgmtReplica Mgmt
ApplicationWorkflow
ApplicationWorkflow
App. Resourcecatalogs
App. Resourcecatalogs
ApplicationDeployment
ApplicationDeployment
ExecutionManagement
ExecutionManagement
InformationServices
InformationServices
SelfManagement
SelfManagement
DataServices
DataServices
ResourceManagement
ResourceManagement
SecurityServicesSecurityServices
OGSA-like Layer
Let’s look at a few real examples
(about a dozen … many more exist!)
BIRN – Biomedical Information
Mesoscale MeteorologyNSF LEAD project - making the tools thatare needed to make accurate predictions of tornados and hurricanes. - Data exploration and Grid workflow
Workflow in the LEAD Grid
Katrinaoutput
Renci Bio GatewayProviding access to biotechnology tools running on a back-end Grid.
- leverage state-wide investment in bioinformatics- undergraduate & graduate education, faculty research- another portal soon: national evolutionary synthesis center
X-Ray Crystallography
SERVOGridSERVOGrid
SERVOGrid Requirements Seamless Access to Data repositories and large scale
computers Integration of multiple data sources including sensors,
databases, file systems with analysis system• Including filtered OGSA-DAI (Grid database access)
Rich meta-data generation and access with SERVOGrid specific Schema extending openGIS (Geography as a Web service) standards and using Semantic Grid
Portals with component model for user interfaces and web control of all capabilities
Collaboration to support world-wide work Basic Grid tools: workflow and notification NOT metacomputing
Database Database
Analysis and VisualizationPortal
RepositoriesFederated Databases
Data Filter
Services
Field Trip DataStreaming Data
Sensors
?DiscoveryServices
SERVOGrid
ResearchSimulations
Research Education
CustomizationServices
From Research
to Education
EducationGrid ComputerFarmGrid of Grids: Research Grid and Education Grid
GISGrid
Sensor GridDatabase Grid
Compute Grid
Google maps can be integrated with Web Feature Service Archives to filter and browse seismic records.
Integrating Archived Web
Feature Services and Google Maps
MyGrid - Bioinformatics
A B C
The Williams Workflows
A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence
Physical Network
Discovery Metadata
BioInformatics GridChemical Informatics Grid
…Domain SpecificGrids/Services
…
Data Access/Storage
Security WorkflowMessaging Management
Information/Knowledge
Instrument/Sensor
Compute/Supercomputer
MIS
Core Low Level Grid Services
Application Services Policy
M(B,C)IS is Molecular (Bio, Chem) Information System supportingspecific metadata (CML, CellML, SBML) and physical representations
HTS ToolsQuantum CalculationsCIS
Sequencing ToolsBiocomplexity SimulationsBIS
Portals
Collaboration
Ser
vice
s
Comments on Grid Components Support GT4 and WS-I+(+); Support Java and .NET Portals – all services will have a portlet interface Compute Grid -- This is some sort of Condor Grid (as used by Cambridge) Supercomputer Grid -- (extended) TeraGrid Workflow, Metadata, Information Management – learn from Taverna, link
with BPEL style workflow, link with other Semantic Grid/metadata services Instruments – learn from CIMA/Reciprocal Net, compare with Sensors in
LEAD/SERVOGrid MIS/CIS – See if idea sensible – in any case need CML, LSID, Molecular
visualization Application Services – Need a wizard. Support “filters” (Wild) and loosely
coupled simulations (Baik) Data – Link to PubChem and Bioinformatics – link to Baik database Discovery – Extended UDDI Security – review any special requirements and status of PubChem, caBIG,
myGrid etc, Collaboration, Management, Messaging, Policy -- nothing special needed