Transcript
Page 1: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Anke KamrathDivision Director, San Diego Supercomputer Center

[email protected]

Welcome andCyberinfrastructure Overview

MSI Cyberinfrastructure InstituteJune 26-30, 2006

Page 2: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

The Digital World

Shopping

Entertainment

Information

Page 3: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Science is a Team Sport

Astronomy

Physics

Life Sciences

Modeling and Simulation

Data Managementand Mining

GAMESS

QCD

Geosciences

Page 4: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Cyberinfrastructure – A Unifying Concept

Cyberinfrastructure = resources

(computers, data storage, networks, scientific

instruments, experts, etc.) + “glue”

(integrating software, systems, and organizations).

NSF’s “Atkins Report” provided a compelling vision for integrated Cyberinfrastructure

Page 5: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

A Deluge of Data

• Today data comes from everywhere• “Volunteer” data• Scientific instruments• Experiments• Sensors and sensornets• Computer simulations• New devices (personal digital devices,

computer-enabled clothing, cars, …)

• And is used by everyone• Researchers, educators• Consumers• Practitioners• General public

• Turning the deluge of data into usable information for the research and education community requires an unprecedented level of integration, globalization, scale, and access

Data from sensors

Data from simulations

Data from

instruments

Data from analysis

Volunteer data

Page 6: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Using Data as a Driver: SDSC Cyberinfrastructure

SDSCData

Cyberinfrastructure

Data-oriented HPC, Resources,

High-end storage,Large-scale data analysis,

simulation, modeling

Community Databasesand Data Collections,

Data management, mining and preservation

Data-oriented Tools, SW Applications, and Community

Codes

SRBBiology

Workbench

Data- and Computational

Science Education

and TrainingSummer Institute

Collaboration, Service and Community

Leadership for Data-oriented Projects

IT

Page 7: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Impact on Technology: Data and Storage are Integral to Today’s Information Infrastructure

• Today’s “computer” is a coordinated set of hardware, software, and services providing an “end-to-end” resource.

• Cyberinfrastructure captures how the research and education community has redefined “computer”

network

data

computer

storage

fieldinstrument

network

computer

data

network

computerviz

computer

sensorsfield

data

wireless

Data and storage are an integral part of today’s “computer”

Page 8: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Goal: SDSC’s Data Cyberinfrastructure should “extend the reach” of the local research and education environment.

Access to community and

reference data collections

More capable and/or higher capacity

computational resources

Multi-disciplinary expertise

Community codes, middleware, software

tools and toolkits

Building a National Data Cyberinfrastructure Center

Long-term Scienctific Data Preservation

Page 9: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Impact on Applications: Data-oriented Research Driving the Next Generation of Technology Challenges

Compute (more FLOPS)

Dat

a (m

ore

BY

TE

S)

Home, Lab, Campus, Desktop

Applications

TraditionalHPC

Applications

Data-oriented Research

Applications

Page 10: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Today’s Research Applications Span the Spectrum

Compute (more FLOPS)

Dat

a (m

ore

BY

TE

S)

Compute (more FLOPS)

Dat

a (m

ore

BY

TE

S)

Home, Lab, Campus, Desktop

TraditionalHPC

environment

Extreme I/O EnvironmentData Mgt. Envt.

Lends itself to Grid

Could be targeted efficiently on Grid

Difficult to target efficiently on Grid

NVOEOL

CiPres

SCECVisualization

ENZOVisualization

CFD

Turbulencefield

ClimateSCEC

Simulation ENZO simulation

QCD

Protein Folding/MD

TurbulenceReattachment

length

CPMD

MCell

GridSAT

Seti@Home

EverQuest

GAMESS

Data-oriented Environment

Page 11: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Working with Compute and Data – Simulation, Analysis, Modeling

Simulation of Southern of 7.7 earthquake on lower San Andreas Fault

• Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m

• Builds on 10 years of data and models from the Southern California Earthquake Center

• Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each

• Simulation generates 45+ TB data

Resources RequiredComputers and Systems• 80,000 hours on DataStar• 256 GB memory p690 used

for testing, p655s used for production run, TG used for porting

• 30 TB Global Parallel file GPFS

• Run-time 100 MB/s data transfer from GPFS to SAM-QFS

• 27,000 hours post-processing for high resolution rendering

People • 20+ people for IT support• 20+ people in domain

research

Storage• SAM-QFS archival storage• HPSS backup• SRB Collection with

1,000,000 files

Page 12: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Simulating an earthquake 1:

1. Divide up Southern California into “blocks”

2. For each block, get all the data on ground surface composition, geological structures, fault information, etc.

Big Data & Big Compute:

The Southern San Andreas Fault

Page 13: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Simulating earthquake 2:

3. Map the blocks on to processors (brains) of the computer

SDSC’s DataStar – one of the 25 fastest

computers in the world

Big Data & Big Compute:

Page 14: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Simulating an earthquake 3:

4. Run the simulation using current information on fault activity and the physics of earthquakes

Big Data & Big Compute:

Page 15: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

• Simulating an earthquake 4:

5. The simulation outputs data on seismic wave velocity, earthquake magnitude,and other characteristics

Managing the data

• How much data was output?

• 47 TeraBytes which is

• 2+ times the printed materials in the Library of Congress! or

• The amount of music in 2000+ iPods! or

• 47 million copies of a typical DVD movie!

Where to store the data?

• In HPSS, a tape storage library that can hold 10 PetaByes (100000 Terabytes) -- 500 times the printed materials in the Library of Congress

Page 16: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

How long will TeraShake take on your desktop computer?

Computing Platform

Number of Processors

Floating Point (arithmetic) Operations per second

Can run TeraShake in

Desktop 1 5.3 billion

DataStar at SDSC 1024 (240 used for TeraShake)

10.4 trillion 5 days

72 centuries!(approximate)

Page 17: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Better Neurosurgery Through Cyberinfrastructure

• PROBLEM: Neuro-surgeons seek to remove as much tumor tissue as possible while minimizing removal of healthy brain tissue

• Brain deforms during surgery• Surgeons must align preoperative

brain image with intra-operative images to provide surgeons the best opportunity for intra-surgical navigation

Radiologists and neurosurgeons at Brigham and Women’s Hospital, Harvard Medical School exploring transmission of 30/40 MB brain images (generated during surgery) to SDSC for analysis and alignment

Finite element simulation on biomechanical model for volumetric deformation performed at SDSC; output results are sent to BWH where updated images are shown to surgeons

Transmission repeated every hour during 6-8 hour surgery.

Transmission and output must take on the order of minutes

Page 18: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Community Data Repository: SDSC DataCentral

• Provides “data allocations” on SDSC resources to national science and engineering community

• Data collection and database hosting• Batch oriented access• Collection management services

• First broad program of its kind to support research and community data collections and databases

• Comprehensive resources• Disk: 400 TB accessible via HPC

systems, Web, SRB, GridFTP

• Databases: DB2, Oracle, MySQL

• SRB: Collection management

• Tape: 6 PB, accessible via file system, HPSS, Web, SRB, GridFTP

• 24/7 operations, collection specialists

Example Allocated Data Collections include

• Bee Behavior (Behavioral Science)

• C5 Landscape DB (Art)

• Molecular Recognition Database (Pharmaceutical Sciences)

• LIDAR (Geoscience)

• AMANDA (Physics)

• SIO_Explorer (Oceanography)

• Tsunami and Landsat Data (Earthquake Engineering)

• Terabridge (Structural Engineering)

DataCentral infrastructure includes: Web-based portal, security, networking, UPS systems, web services and software tools

Page 19: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Public Data Collections Hosted in SDSC’s DataCentral

Seismology3D Ground Motion Collection for the LA Basin

AtmosphericSciences50 year Downscaling of Global Analysis over California Region

Earth SciencesNEXRAD Data in Hydrometerology and Hydrology

Elementary Particle Physics

AMANDA data

Biology AfCS Molecule Pages

Biomedical Neuroscience

BIRN

Networking Backbone Header Traces

Networking Backscatter Data

Biology Bee Behavior

Biology Biocyc (SRI)

Art C5 landscape Database

Geology Chronos

Biology CKAAPS

Biology DigEmbryo

Earth Science Education

ERESE

Earth Sciences UCI ESMF

Earth Sciences EarthRef.org

Earth Sciences ERDA

Earth Sciences ERR

Biology Encyclopedia of Life

Life Sciences Protein Data Bank

Geosciences GEON

Geosciences GEON-LIDAR

Geochemistry Kd

Biology Gene Ontology

Geochemistry GERM

Networking HPWREN

Ecology HyperLter

Networking IMDC

Biology Interpro Mirror

Biology JCSG Data

Government Library of Congress Data

GeophysicsMagnetics Information Consortium data

EducationUC Merced Japanese Art Collections

Geochemistry NAVDAT

Earthquake Engineering

NEESIT data

Education NSDL

Astronomy NVO

Government NARA

Anthropology GAPP

Neurobiology Salk data

Seismology SCEC TeraShake

Seismology SCEC CyberShake

Oceanography SIO Explorer

Networking Skitter

Astronomy Sloan Digital Sky Survey

Geology Sensitive Species Map Server

GeologySD and Tijuana Watershed data

Oceanography Seamount Catalogue

Oceanography Seamounts Online

Biodiversity WhyWhere

Ocean SciencesSoutheastern Coastal Ocean Observing and Prediction Data

Structural Engineering

TeraBridge

Various TeraGrid data collections

BiologyTransporter Classification Database

Biology TreeBase

Art Tsunami Data

Education ArtStor

Biology Yeast regulatory network

Biology Apoptosis Database

Cosmology LUSciD

Page 20: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Data Cyberinfrastructure Requires a Coordinated Approach

Storage hardware

Networked Storage (SAN)

Grid StorageFilesystems, Database Systems

Data Mining, Simulation Modeling, Analysis, Data Fusion

Applications: Medical informatics,Biosciences, Ecoinformatics,…

Knowledge-Based Integration Advanced Query Processing

Visualization

High speed networking

sensornets

How do we configure computer architectures to optimally support

data-oriented computing?

How do we collect, accessand organize data?

How do we obtain usableinformation from data?

How do we detect trends and relationships in data?

How do we represent data, information and knowledge

to the user?

How do we combine data, knowledge

and information management with simulation and modeling?

instrumentsHPC

inte

gra

tio

ninteroperability

Page 21: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Working with Data: Data Integration for New Discovery

Data Integration in the Biosciences Data Integration in the Geosciences

DisciplinaryDisciplinaryDatabasesDatabasesUsersUsers

SoftwareSoftwareto accessto access

datadata

SoftwareSoftwareto federateto federate

datadata

OrganismsOrganisms

OrgansOrgans

CellsCells

AtomsAtoms

Bio-Bio-polymerspolymers

OrganellesOrganelles

Cell BiologyCell Biology

AnatomyAnatomy

PhysiologyPhysiology

ProteomicsProteomics

Medicinal Medicinal ChemistryChemistry

GenomicsGenomics

Where can we most safely build a nuclear waste dump?Where should we drill for oil?

What is the distribution and U/ Pb zircon ages of A-type plutons in VA?

How does it relate to host rock structures?

DataIntegration

Geologic Map

Geo-Chemical

Geo-Physical

Geo-Chronologic

Foliation Map

Complex “multiple-worlds”

mediation

Page 22: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Preserving Data over the Long-Term

Page 23: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Data Preservation

• Many Science, Cultural, and Official Collections must be sustained for the foreseeable future

• Critical collections must be preserved:

• community reference data collections (e.g. Protein Data Bank)

• irreplaceable collections (e.g. field data – tsunami recon)

• longitudinal data (e.g. PSID – Panel Study of Income Dynamics)

• No plan for preservation often means that data is lost or damaged

“….the progress of science and useful arts … depends on the reliable preservation of

knowledge and information for generations to come.”

“Preserving Our Digital Heritage”, Library of Congress

Page 24: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

How much Digital Data*?

Kilo 103

Mega 106

Giga 109

Tera 1012

Peta 1015

Exa 1018

1 human brain at the

micron level = 1 PetaByte

1 novel = 1 MegaByte

iPod Shuffle (up to 120 songs) = 512 MegaBytes

Printed materials in the Library of Congress = 10 TeraBytes

SDSC HPSS tape archive = 6 PetaBytes

All worldwide information in one year

= 2 ExaBytes

1 Low Resolution

Photo = 100 KiloBytes

* Rough/average estimates

Page 25: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Key Challenges for Digital Preservation

• What should we preserve?• What materials must be “rescued”?• How to plan for preservation of materials by

design?

• How should we preserve it?• Formats• Storage media• Stewardship – who is responsible?

• Who should pay for preservation?• The content generators?• The government?• The users?

• Who should have access?

Print media provides easy access for long periods of time

but is hard to data-mine

Digital media is easier to data-mine but requires management of evolution of media

and resource planning over time

Page 26: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

What can go wrong

Entity at risk

Problem Frequency

File Corrupted media, disk failure 1 year

Tape+ Simultaneous failure of 2 copies

5 years

System+ Systemic errors in vendor SW, or Malicious user, or Operator error that deletes multiple copies

15 years

Archive+ Natural disaster, obsolescence of standards

50 - 100 years

Page 27: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

SDSC Cyberinfrastructure Community Resources

COMPUTE SYSTEMS• DataStar

• 2396 Power4+ processors, IBM p655 and p690 nodes

• 10 TB total memory• Up to 2 GBps I/O to disk

• TeraGrid Cluster• 512 Itanium2 IA-64

processors• 1 TB total memory

• Intimidata• Only academic IBM Blue

Gene system• 2,048 PowerPC processors• 128 I/O nodes

http://www.sdsc.edu/user_services/

SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES

• User Services• Application/Community Collaborations• Education and Training• SDSC Synthesis Center• Community SW, toolkits, portals, codes

• http://www.sdsc.edu/

DATA ENVIRONMENT• 1 PB Storage-area Network

(SAN)• 10 PB StorageTek tape library• DB2, Oracle, MySQL• Storage Resource Broker• HPSS• 72-CPU Sun Fire 15K• 96-CPU IBM p690s

• http://datacentral.sdsc.edu/

Support for 60+ community data collections and

databases

Data management,

mining, analysis, and preservation

Page 28: Welcome and Cyberinfrastructure Overview MSI Cyberinfrastructure Institute June 26-30, 2006

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Thank You

[email protected]

www.sdsc.edu


Top Related