a national big data cyberinfrastructure supporting computational biomedical research

29
“A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research” Invited Presentation Symposium on Computational Biology and Bioinformatics: Remembering John Wooley National Institutes of Health Bethesda, MD July 29, 2016 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD http://lsmarr.calit2.net 1

Upload: larry-smarr

Post on 23-Jan-2018

254 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

“A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research”

Invited Presentation

Symposium on Computational Biology and Bioinformatics:

Remembering John Wooley

National Institutes of Health

Bethesda, MD

July 29, 2016

Dr. Larry Smarr

Director, California Institute for Telecommunications and Information Technology

Harry E. Gruber Professor,

Dept. of Computer Science and Engineering

Jacobs School of Engineering, UCSD

http://lsmarr.calit2.net 1

Page 2: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

John Wooley Drove Supercomputing for Biological Sciences

Page 3: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

John Wooley was a Scientific Founder of Calit2

www.calit2.net

220 UCSD & UCI FacultyWorking in Multidisciplinary Teams

With Students, Industry, and the Community

The State Provides $100 M For New Buildings and Equipment

LS Slide2001

John Wooley was the UCSD Layer

Leader for DeGeM

Page 4: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

NSF’s OptIPuter Project: Using Supernetworks to Meet the Needs of Data-Intensive Researchers

OptIPortal– Termination

Device for the

OptIPuter Global

Backplane

Calit2 (UCSD, UCI), SDSC, and UIC Leads—Larry Smarr PIUniv. Partners: NCSA, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AIST

Industry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent

2003-2009 $13,500,000

Biomedical Big Data as Application Driver:Mark Ellisman, co-PI

Page 5: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

UCSD

StarLight Chicago

UIC EVL

NU

CENIC San Diego GigaPOP

CalREN-XD

8

8

The OptIPuter LambdaGrid is Rapidly Expanding

NetherLight Amsterdam

U Amsterdam

NASA Ames

NASA GoddardNLRNLR

2

SDSU

CICESE

via CUDI

CENIC/Abilene Shared Network

1 GE Lambda

10 GE Lambda

PNWGP Seattle

CAVEwave/NLR

NASA JPL

ISI

UCI

CENIC Los Angeles

GigaPOP

22

Source: Greg Hidley, Aaron Chin, Calit2

LS Slide2005

Page 6: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

PI Larry Smarr

Paul Gilna Ex. Dir.

Announced January 17, 2006$24.5M Over Seven Years

John Wooley was a CAMERA co-PI &

Chief Science Officer

Page 7: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research
Page 8: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Calit2 Microbial Metagenomics Cluster-Next Generation Optically Linked Science Data Server

512 Processors ~5 Teraflops

~ 200 Terabytes Storage 1GbE and

10GbESwitched/ Routed

Core

~200TB Sun

X4500 Storage

10GbE

Source: Phil Papadopoulos, SDSC, Calit2

Page 9: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

The CAMERA Project Established a GlobalMarine Microbial Metagenomics Cyber-Community

Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis

http://camera.calit2.net/

4000 Registered Users From Over 80 Countries

Page 10: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Determining the Protein Structures of the Thermophilic Thermotoga Maritima Genome—Life at 80oC!

Extremely Thermostable -- Useful for Many Industrial Processes (e.g. Chemical and Food)

173 Structures (122 from JCSG)

• 122 T.M. Structures Solved by JCSG (75 Unique In The PDB) • Direct Structural Coverage of 25% of the Expressed Soluble Proteins• Probably Represents the Highest Structural Coverage of Any Organism

Source: John Wooley, JCSG Bioinformatics Core Project Directro, UCSD

LS Slide2005

Page 11: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

John Wooley Organized a Series of International Workshopson Metagenomics and Thermotoga at Calit2

Page 12: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Academic Research OptIPlanet Collaboratory:A 10Gbps “End-to-End” Lightpath Cloud

National LambdaRail

CampusOptical Switch

Data Repositories & Clusters

HPC

HD/4k Video Repositories

End User OptIPortal

10G Lightpaths

HD/4k Live Video

Local or Remote Instruments

LS 2009 Slide

Page 13: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

So Why Don’t We Have a NationalBig Data Cyberinfrastructure?

“Research is being stalled by ‘information overload,’ Mr. Bement said, because data from digital instruments are piling up far faster than researchers can study. In particular, he said, campus networks need to be improved. High-speed data lines crossing the nation are the equivalent of six-lane superhighways, he said. But networks at colleges and universities are not so capable. “Those massive conduits are reduced to two-lane roads at most college and university campuses,” he said. Improving cyberinfrastructure, he said, “will transform the capabilities of campus-based scientists.”-- Arden Bement, the director of the National Science Foundation May 2005

Page 14: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

DOE ESnet’s Science DMZ: A Scalable Network Design Model for Optimizing Science Data Transfers

• A Science DMZ integrates 4 key concepts into a unified whole:– A network architecture designed for high-performance applications,

with the science network distinct from the general-purpose network

– The use of dedicated systems for data transfer

– Performance measurement and network testing systems that are regularly used to characterize and troubleshoot the network

– Security policies and enforcement mechanisms that are tailored for high performance science environments

http://fasterdata.es.net/science-dmz/

Science DMZCoined 2010

The DOE ESnet Science DMZ and the NSF “Campus Bridging” Taskforce Report Formed the Basis for the NSF Campus Cyberinfrastructure Network Infrastructure and Engineering (CC-NIE) Program

Page 15: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Based on Community Input and on ESnet’s Science DMZ Concept,NSF Has Funded Over 100 Campuses to Build Local Big Data Freeways

Red 2012 CC-NIE AwardeesYellow 2013 CC-NIE AwardeesGreen 2014 CC*IIE AwardeesBlue 2015 CC*DNI AwardeesPurple Multiple Time Awardees

Source: NSF

Page 16: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Creating a “Big Data” Freeway on Campus:NSF-Funded Prism@UCSD and CHeruB Campus CC-NIE Grants

Prism@UCSD, PI Phil Papadopoulos,

SDSC, Calit2, (2013-15)

CHERuB, PI Mike Norman,

SDSC

CHERuB

Page 17: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

NCMIR Brain Images in Calit2 VROOM:Allows for Interactive Zooming from Cerebellum to Individual Neurons

NCMIR Connected Over Prism to Calit2/SDSC at 80 Gbps

Page 18: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Calit2 3D Immersive StarCAVE OptIPortal:Enables Interative Exploration of Protein Data Bank

Cluster with 30 Nvidia 5600 cards-60 GB Texture Memory

Source: Tom DeFanti, Greg Dawe, Calit2

Connected at 50 Gb/s to Quartzite

30 HD Projectors!

15 Meyer Sound Speakers + Subwoofer

Passive Polarization--Optimized the

Polarization Separation and Minimized Attenuation

Page 19: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

The Pacific Wave PlatformCreates a Regional Science-Driven “Big Data Freeway System”

Source: John Hess, CENIC

Funded by NSF $5M Oct 2015-2020

Flash Disk to Flash Disk File Transfer Rate

PI: Larry Smarr, UC San Diego Calit2Co-PIs:• Camille Crittenden, UC Berkeley CITRIS, • Tom DeFanti, UC San Diego Calit2, • Philip Papadopoulos, UC San Diego SDSC, • Frank Wuerthwein, UC San Diego Physics and SDSC

Page 20: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Pacific Research Platform Regional Collaboration:Multi-Campus Science Driver Teams

• Jupyter Hub

• Biomedical– Cancer Genomics Hub/Browser

– Microbiome and Integrative ‘Omics

– Integrative Structural Biology

• Earth Sciences– Data Analysis and Simulation for Earthquakes and Natural Disasters– Climate Modeling: NCAR/UCAR– California/Nevada Regional Climate Data Analysis– CO2 Subsurface Modeling

• Particle Physics• Astronomy and Astrophysics

– Telescope Surveys– Galaxy Evolution

– Gravitational Wave Astronomy

• Scalable Visualization, Virtual Reality, and Ultra-Resolution Video 20

Page 21: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

PRP Transforms Big Data Microbiome and Integrated ‘Omics Science

12 Cores/GPU128 GB RAM3.5 TB SSD48TB Disk

10Gbps NIC

Knight Lab

10Gbps

Gordon

Prism@UCSD

Data Oasis7.5PB,

200GB/s

Knight 1024 ClusterIn SDSC Co-Lo

CHERuB100Gbps

Emperor & Other Vis Tools

64Mpixel Data Analysis Wall

120Gbps

40Gbps

1.3TbpsPNNL

UC DavisLBNL

Caltech

Page 22: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

To Expand IBD Project the Knight/Smarr Labs Were Awarded ~ 1 Million Core-Hours on SDSC’s Comet Supercomputer

• 8x Compute Resources Over Prior Study

• Smarr Gut Microbiome Time Series– From 7 Samples Over 1.5 Years – To 50 Samples Over 4 Years

• IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients to ~100 Patients– 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank– 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients

• New Software Suite from Knight Lab– Re-annotation of Reference Genomes, Functional / Taxonomic Variations– Novel Compute-Intensive Assembly Algorithms from Pavel Pevzner

Page 23: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

We Used SDSC’s Comet to Uniformly Compute Protein-Coding Genes, RNAs, & CRISPR Annotations

• We Downloaded from NCBI Over 60,000 Bacterial and Archaea Genomes– Required 5 Core-Hours Per Genome

– 300,000 Core-Hours to Complete– Ran 24 Cores in Parallel– Over 400 Days Wall-Clock Time

• Requires a Variety of Software Programs– Prodigal for Gene Prediction

– Diamond for Protein Homolog Search Against UniRef db – Infernal for ncRNA Prediction – RNAMMER for rRNA Prediction

– Aragorn for tRNA Prediction

• Will Make These Results a New Community Database– Knight Lab, Calit2, SDSC

Source: Zhenjiang (Zech) Xu, Knight Lab, UCSD

Page 24: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Cancer Genomics Hub (UCSC) is Housed in SDSC:Large Data Flows to End Users at UCSC, UCB, UCSF, …

1G

8G

Data Source: David Haussler, Brad Smith, UCSC

15GJan 2016

30,000 TBPer Year

Page 25: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Creating a Distributed Cluster for Integrated Modelingof Large Macromolecular Machines

• UCSF-10-100 Gbps Science DMZ– QB3@UCSF (~5000 cores), – Institute for Human Genetics (~1200 cores), – Cancer Center (~800 cores), – Molecular Structure Group (~1000 cores).

• Coupled Via PRP to:– LBNL NERSC– SDSC

• Bring Huge Datasets from Supercomputer Centers Back to UCSF Clusters for Analysis

Requires CPU-months per computation

Lead: Andrej Sali, UCSF

Page 26: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Driving

Improvement

s in

Scientific

Data Transfer

Driving

Improvement

s in

Scientific

Data Transfer

NCMIR X-rayMicroscope

(XRM)Zeiss Versa

510

MicroCT reconstructions of Chiton radula. Chiton radula have evolved to incorporate an iron oxide mineral, magnetite, making them extremely hard and magnetic. Images

courtesy of Steven Herrera, Ph.D., Kisailus Biomemetics and Nanostructured Materials Laboratory, UC Riverside

UCSD/NCMIR Fiona/Data Transfer Node (DTN)

PRP Facilitated CollaborativeData Transfer 10-100Gbps

XRM Data Sets are 100+ GBs

UCR researchers are modeling the teeth (radula) of

marine snail, Cryptochiton Stelleri, to engineer new

biomimetic abrasion resistant composites

UC RiversideFiona/Data Transfer Node (DTN)

3D Reconstructions from NCMIR X-ray Microscopic Computed Tomography Facilitates Development of Bioinspired “Tough” Materials

Page 27: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Next Step: Global Research PlatformBuilding on CENIC/Pacific Wave and GLIF

Current InternationalGRP Partners

Page 28: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

Mirror Cell Image Library Infrastructure and Data Management Workflows at Singapore’s NSCC

Cell Image Library Designed For “Big Data” Leverages High Bandwidth Connected High Performance Storage and Computing Resources

Source: Mark Ellisman & Steve Peltier, NCMIR, UCSD

Page 29: A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research