jsm madduri-august-2015
TRANSCRIPT
![Page 1: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/1.jpg)
globus.org/genomics
Finding Needles in a Haystack – Big Data Management and Analysis using Globus
Ravi [email protected]
JSM 2015, Seattle, Washington
![Page 2: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/2.jpg)
globus.org/genomics
• Globus Genomics is developed, operated, and supported by researchers, developers, and bioinformaticians at the Computation Institute – University of Chicago/Argonne National Lab
• We are a non-profit organization building solutions for non-profit researchers
• Our goal is to support the advancement of science by bringing together our strengths and capabilities to help meet the unique needs of researchers and research institutions
Who We Are
![Page 3: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/3.jpg)
globus.org/genomics
Publish
results
Collectdata
Design experimen
t
Test hypothesis
Hypothesize
explanation
Identify patterns
Analyzedata
Finding needles in haystacks
Pose questio
n
3
![Page 4: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/4.jpg)
globus.org/genomics
Imagine if a researcher, when tackling a problem,
could easily:• Assemble, integrate, and interpret all
relevant data within a knowledge network
• Be informed of anomalies, patterns, gaps
• Formulate & apply computational models
• Outsource tasks if local expertise lacking
• Launch automated processes to test hypotheses, expand knowledge network
• Pay for all this by taking on other tasks
![Page 5: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/5.jpg)
globus.org/genomics
We will cover
• Accelerating Scientific Discovery Process by providing Science as a Service– Research Data Management– Analyzing Research Data
• Interactive Analysis• Large-scale Analysis
– Publishing Results so others can• Discover• Validate• Reproduce/Use
![Page 6: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/6.jpg)
globus.org/genomics
90% of cancer patients carry a mutation that may be responsive to a known drug
Mark Rubin, Weill Cornell Medical College and NewYork-Presbyterian Hospital in New York in Nature, April, 2015
![Page 7: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/7.jpg)
Trying to find a single causative gene for diseases with a complex genetic background is like looking for the proverbial needle in a haystack
– Nancy Cox (Vanderbilt)
![Page 8: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/8.jpg)
globus.org/genomics
Higgs discovery “only possible because of the extraordinary achievements of …
grid computing”Rolf Heuer, CERN DG
10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks
![Page 9: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/9.jpg)
globus.org/genomics
How do we accelerate discovery without requiring that every lab acquire a haystack-sorting machine?
Clayton & Shuttleworth thresher, 1910: Museum Victoria, Australia
![Page 10: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/10.jpg)
globus.org/genomics
Managing big data with Globus
PI initiates transfer request; or requested automatically by script, science gateway
1
Globus transfers files reliably, securely
Light SourceCompute Facility
2
PI selects files to share, selects user or group,
and sets access permissions
Globus controls access to shared
files on existing storage; no need
to move files to cloud storage!
Researcher logs in to Globus and accesses shared files; no local
account required; download via Globus
Researcher assembles data set;
describes it using metadata (Dublin core and domain-
specific)
Curator reviews and approves; data set
published on campus or other system
Peers, collaborators search and discover datasets; transfer and share using Globus
4
7
6
3
5• SaaS Only a web
browser required• Access using your
campus credentials• Globus monitors and
informs throughout
6 8
Publication Repository
Personal Computer
![Page 11: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/11.jpg)
globus.org/genomics
Globus Platform-as-a-Service
Identity, Group, Profile Management Services
…
Sharing Service
Transfer Service
Globus Toolkit
Glo
bus
API
s
Glo
bus
Conn
ect
![Page 12: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/12.jpg)
globus.org/genomics
Globus Adoption and Usage• 166,449 active Globus endpoints• 27,961 users registered• Biggest transfer: 500.42TB• Longest running transfer: 182 days. • Fastest transfer: 58.5Gbps (average)• 55TB moved per day, on average, since the
service was launched in November 2010• Average throughput: 637.7Mbps (since
service launch)
![Page 13: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/13.jpg)
globus.org/genomics
Analyzing Big Data using Globus Galaxies
Sequencing Centers
Sequencing Centers
PublicData
Storage
Local Cluster/CloudSeq
Center
Research Lab
Globus provides for• High-performance • Fault-tolerant• Securefile transfer between all data-endpoints
Data management Data analysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
Galaxy Data Libraries
Globus Genomics on Amazon EC2
• Analytical tools are automatically run on the scalable compute resources when possible
• Globus integrated within Galaxy
• Web-based UI• Drag-Drop
workflow creations
• Easily modify workflows with new tools
Galaxy-based workflow management
FTP, SCP, others
FTP, SCP
SCP
Globus Genomics
FTP,
SCP,
HTTP
![Page 14: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/14.jpg)
globus.org/genomics
Our Science Stack• Galaxy
– Interactive execution, iPython, R– Creation, Execution, Sharing, Discovering
Workflows• Globus
– Data management– Identity Management
• AWS– HTCondor, Chef, EC2, EBS, S3, SNS– Spot, Route 53, Cloud Formation
SaaS
PaaS
IaaS
![Page 15: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/15.jpg)
globus.org/genomics
Examples of what researchers have done
![Page 16: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/16.jpg)
globus.org/genomics
• 134 samples and 4 workflows • 4 TB data initially• 2200 core hours in 6 days
Cox lab, UChicago
![Page 17: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/17.jpg)
globus.org/genomics
Consensus Caller
![Page 18: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/18.jpg)
globus.org/genomics
Rediscovery of previously observed variants Transition/Transversion Ratio
Genotype Mendel Error Rate Distributions of Mendel Error Counts per Trio
![Page 19: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/19.jpg)
globus.org/genomics
Contaminated Samples
![Page 20: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/20.jpg)
globus.org/genomics
Olopade lab, UChicago
A profile of inherited predisposition to breast cancer among Nigerian womenY. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade
• 200 targeted exomes• 200 GB data initially• 76,920 core hours in 1.25 days
![Page 21: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/21.jpg)
globus.org/genomics
Expanding Consensus Genotyper – SNVs, Indels,
SVs
RAW FASTQs
GATK Pipeline/HC
FreeBayes
SAMtools mpileup
GATK Pipeline/UG
VCF
VCF
VCF
VCF
Consensus Genotyper
VCF
Atlas2
Delly/Contra
VCF
VCF
![Page 22: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/22.jpg)
globus.org/genomics
14 deleterious SNVs and 11 damaging Indels (BRCA1: 15, BRCA2: 4, PALB2: 2, BRIP1: 1, CHEK2: 1, NBN: 1, TP53: 1) were found in 29 subjects, and they were all confidently detected among 5 callers. Identified SNVs and Indels were all confirmed by Sanger sequencing.
Preliminary Results are very encouraging
![Page 23: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/23.jpg)
globus.org/genomics
QC
PPMI ADNI
Adenocarcinomahttp://bit.ly/1M0h6Yx
http://bit.ly/A10R89y
Adrenal
Brain Alignment Feature count
AlignmentQC
1. Query and discover data
3. Execute parallel alignment workflow on dynamically provisioned cloud resources
ERMrest
2. Transfer bags
Alignment FilesAlignment
Files
3. Publish bags
BDDS Collection
Alignment FilesAlignment
Files
Differential expression
Differential expression
4. Discover published data and execute comparison workflow
Combining Data management and Analysis
![Page 24: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/24.jpg)
globus.org/genomics
Gene Expression Results
![Page 25: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/25.jpg)
globus.org/genomics
Globus Genomics at a glance
30 institutions, groups
10smillion core hours
labs
2 PBsraw sequences
analyzed
>1500 analysis tools
1000s genomes processed
>50workflows
99%uptime over the past
two years
1 PBlargest single transfer
to do
5 dayslongest running
workflow
100sdifferent species
1000s genomes processed
5 dayslongest running
workflow
![Page 26: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/26.jpg)
globus.org/genomics
Other Globus Genomics users
DobynsLab
Cox LabVolchenboum LabOlopade Lab
Nagarajan Lab
![Page 27: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/27.jpg)
globus.org/genomics
Pricing includes• Estimated compute• Storage (one month)• Globus Genomics platform usage• Support
Costs are remarkably low
![Page 28: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/28.jpg)
globus.org/genomics
Globus Genomics – Making it routine to find needles in NGS haystacks
www.globus.org/genomics
![Page 29: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/29.jpg)
globus.org/genomics
Other Examples of Science as a Service
• PDACS - Portal for data analysis services for cosmological simulations
• CVRG Galaxy – Large-scale ECG Data Analysis
• Globus Proteomics• eMatter – Material Science Simulations• FACE-IT - Framework to Advance Climate,
Economic, and Impact Investigations with Information Technology (usefaceit.org)
![Page 30: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/30.jpg)
globus.org/genomics
• More information on Globus Genomics:www.globus.org/genomics
• More information on Globus: www.globus.org
![Page 31: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/31.jpg)
globus.org/genomics
Our work is supported by:U.S . DEPARTMENT OF
ENERGY
31
![Page 32: Jsm madduri-august-2015](https://reader036.vdocuments.net/reader036/viewer/2022062523/58ed8ce01a28ab0d278b4673/html5/thumbnails/32.jpg)
globus.org/genomics
Thank you!
@madduri