cyberinfrastructure to support scientific exploration and collaboration dennis gannon (based on work...
TRANSCRIPT
CyberInfrastructure to Support Scientific Exploration and Collaboration
CyberInfrastructure to Support Scientific Exploration and Collaboration
Dennis Gannon(based on work with many collaborators,
most notably Beth Plale ) School of Informatics
Indiana University
Overview• CyberInfrastructure for Virtual Organizations
in Computational Science– Science Portals and The Gateway Concept
• Automating the Trail From Data to Scientific Discovery– An Example in Depth: Mesoscale Storm Prediction
• The challenge for the Individual Researcher– Connecting tools to on-line services
• The Promise of the Future– Multicore, personal petabyte and gigabit
bandwidth
The Realities of Science in the U.S.
• “Big Science” dominates the funding hierarchy. – Why? Its important and easy to sell to
congress.
• The NSF is investing in vast network of supercomputers to support big science– The results are empowering a broad
range of scientific communities.
• Where is the single investigator?– The Web has enabled democratization of
information access– Is there a similar path for access
advanced computational resources?
Democratizing Access to Science
• What is needed for the individual or small team to do large scale science?– Access to data and the tools to analyze it
and transform it.– A means to publish the not just the results
of a study but a way to share the path to discovery.
• Where are the resources?– What we have now: TeraGrid– What is emerging?
The TeraGrid• The US National Supercomputer Grid
– CyberInfrastructure composed of a set of resources (compute and data) that provide common services for
• Wide area data management, Single sign-on user authentication• Distributed Job scheduling and management. (in the works.)
• Collectively– 1Petaflop– 20 Petabytes
• Soon to triple.• Will add a petaflop
each year.– But at a slower
rate than google, ebay, amazon add resources.
TeraGrid Wide: Science Gateways
• Science Portals– A Portal = a web-based home+personal workspace
+ personal tools. – Web Portal Technology + Grid Middleware
• Enables a community of researcher to:– Access to shared resources (both data and
computation)– A forum to collaboration on shared problem solving
• TeraGrid Science Gateways – Allow the TeraGrid to be the back-end resource.
NEESGridRealtime access to earthquake Shake table experiments at remote sites.
BIRN – Biomedical Information
Geological Information Grid Portal
Renci Bio PortalProviding access to biotechnology tools running on a back-end Grid.
- leverage state-wide investment in bioinformatics- undergraduate & graduate education, faculty research- another portal soon: national evolutionary synthesis center
Nanohub - nanotechnology
X-Ray Crystallography
ServoGrid Portal
The LEAD Project
Predicting Storms• Hurricanes and tornadoes cause massive
loss of life and damage to property• Underlying physical systems involve highly
non-linear dynamics so computationally intense
• Data comes from multiple sources– “real time” derived from streams of data from
sensors– Archived in databases of past storms
• Infrastructure challenges:– Data mine instrument radar data for storms– Allocate supercomputer resources
automatically to run forecast simulations– Monitor results and retarget instruments.– Log provenance and metadata about
experiments for auditing.
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction/Detection
PCs to Teraflop Systems
Product Generation, Display,
Dissemination
End Users
NWSPrivate Companies
Students
Traditional Methodology
STATIC OBSERVATIONS
Radar DataMobile Mesonets
Surface ObservationsUpper-Air BalloonsCommercial Aircraft
Geostationary and Polar Orbiting Satellite
Wind ProfilersGPS Satellites
The Process is Entirely Serialand Static (Pre-Scheduled): No Response to the Weather!
The Process is Entirely Serialand Static (Pre-Scheduled): No Response to the Weather!
Analysis/Assimilation
Quality ControlRetrieval of Unobserved
QuantitiesCreation of Gridded Fields
Prediction/Detection
PCs to Teraflop Systems
Product Generation, Display,
Dissemination
End Users
NWSPrivate Companies
Students
The LEAD Vision: Enabling a new paradigm of scientific exploration.
DYNAMIC OBSERVATIONS
Models and Algorithms Driving Sensors
The CS challenge: The CS challenge: * Build cyberinfrastructure services that provide * Build cyberinfrastructure services that provide adaptability, scalability, availability, useability. adaptability, scalability, availability, useability. * Create a new paradigm of meteorology research.* Create a new paradigm of meteorology research.
Building Experiments that Respond to the Future
Can we pose a scientific search and discovery query that the cyber infrastructure executes as our agent?
• In the LEAD case it is Data Driven, Persistent and Agile – Weather data streams define nature of computation– Mine the data streams, detect “interesting” features, event
triggers workflow scenario that has been waiting for months.
The LEAD Gateway Portal• To support three classes of users
– Meteorology research scientists & grad students.– Undergrads in meteorology classes– People who want easy access to weather data.
Go to:http://www.leadproject.org
Gateway Components • A Framework for Discovery
– Four basic components
• Data Discovery– Catalogs and index services
• The experiment– Computational workflow managing on-demand
resources
• Data analysis and visualization• Data product preservation,
– automatic metadata generation and experimental data providence.
Data Search
• Select a region and a time range and desired attributes
Building Experiments
• As the user interacts with the portal they are creating “experiments”
• An experiment is– A collection of data (or desired data)– A set of analysis, transformational or
predictive tasks • Defined by a workflow or a high level query
– A provenance document that encodes a repeatable history of the experiment.
Portal: Experimental Data & Metadata Space
• CyberInfrastructure extends user’s desktop to incorporate vast data analysis space.
• As users go about doing scientific experiments, the CI manages back-end storage and compute resources.
– Portal provides ways to explore this data and search and discover it.
• Metadata about experiments is largely automatically generated, and highly searchable.
– Describes data object (the file) in application-rich terms, and provides URI to data service that can resolve an abstract unique identifier to real, on-line data “file”.
arpssfc
arpstrn Ext2arps-ibc
88d2arps
mci2arps
ADASassimilation
arps2wrf
nids2arps
WRF
Ext2arps-lbc
wrf2arps
arpsplot
IDV viz
Terrain data files
Surface data files
ETA, RUC, GFS data
Radar data (level II)
Radar data (level III)
Satellite data
Surface, upper air mesonet & wind profiler
data
Typical weather forecast runs as workflow
~400 Data Products Consumed & ~400 Data Products Consumed & Produced Produced –– transformedtransformed –– during during
Workflow LifecycleWorkflow Lifecycle
~400 Data Products Consumed & ~400 Data Products Consumed & Produced Produced –– transformedtransformed –– during during
Workflow LifecycleWorkflow Lifecycle
Pre-ProcessingPre-Processing AssimilationAssimilation ForecastForecast VisualizationVisualization
The Experiment Builder• A Portal “wizard” that leads the user
through the set-up of a workflow
• Asks the user: – “Which workflow do you want to run?”
• Once this is know, it can prompt the user for the required input data sources
• Then it “launches” the workflow.
Parameter Selection
Selecting the forecast region
Experience so far• First release to support “WxChallenge: the new collegiate weather
forecast challenge”– The goal: “forecast the maximum and minimum temperatures,
precipitation, and maximum sustained wind speeds for select U.S. cities.
– to provide students with an opportunity to compete against their peers and faculty meteorologists at 64 institutions for honors as the top weather forecaster in the nation.”
– 79 “users” ran 1,232 forecast workflows generating 2.6TBybes of data.
• Over 160 processors were reserved on Tungsten from 10am to 8pm EDT(EST), five days each week
• National Spring Forecast– First use of user initiated 2Km forecasts as part of that program.
Generated serious interest from National Severe Storm Center.
• Integration with CASA project scheduled for final year of LEAD ITR.
Is TeraGrid the Only Enabler?
• The web has evolved a set information and service “super nodes”– Directories & indexes (google, MS, Yahoo)– Transactional mosh pits (eBay, Facebook, Wikipedia)– Raw data and compute services (Amazon …)
• We can build the tools for scientific discovery on this “private sector” grid?– Yes.– One CS student + one Bio-informatician + Amazon
Storage Service + Amazon Compute Cloud = ..
A Virtual Lab for Evolutionary Genomics• Data and databases live on S3• Computational Tools run (on-
demand) as services on EC2.• User composes workflows.• Result data and metadata visible to
user through desktop client.
Validating Scientific Discovery
• The Gateway is becoming part of the process of science by being an active repository of data provenance
• The system records each computational experiment that a user initiates – A complete audit trail of the
experiment or computation – Published results can include link to
provenance information for repeatability and transparency.
• The Scientific Method is all about repeatability of experiments– Are we there yet?
Almost
• The provenance contains the workflow and if we publish it, it can be re-run– Are the same resources still available?
• Not a necessary condition for validation
– Has the data changed?
• Another user can modify it.– Replace an analysis step with another– Test it on different data.
The Future Experimental Testbed• In five years multicore, personal
petabytes and ubiquitous gigabit bandwidth– Much richer experimental capability on the
desktop. More of the computational work can be downloaded
• Do we no longer need the massive remote data/compute center?– Demand scales with capability.
• But there is more.
Last Thought
• Vastly improved capability for interactive experimentation– Data exploration and visualization. Interacting
with hundreds of incoming data streams. – Tracking our path and exploring 100 possible
experimental scenarios concurrently. – Deep search agents
• Discovering new data and new tools• Grab data - automatically fetch and analyze the
provenance and set up the workflow to be re-run.
Questions
The Realization in Software
Data Storage
Application services Compute Engine
User Portal
PortalserverPortalserver
DataCatalogservice
DataCatalogservice
MyLEAD UserMetadatacatalog
MyLEAD UserMetadatacatalog
MyLEAD Agent
service
MyLEAD Agent
service DataManagement
Service
DataManagement
Service
WorkflowEngine
WorkflowEngine
Workflow graph
ProvidenceCollection
service
ProvidenceCollection
service
Event Notification Bus
FaultTolerance
& scheduler
FaultTolerance
& scheduler