genomic-scale data pipelines - yow! conferences...problem data candidate technologies build mvps...
TRANSCRIPT
![Page 1: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/1.jpg)
Dr. Denis Bauer & Lynn Langit
Genomic-scale Data Pipelines
![Page 2: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/2.jpg)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
Denis Bauer, PhD
Oscar Luo, PhD
Rob Dunne, PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson, PhD
Adrian WhiteAndy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai, PhD
Natalie Twine, PhD
Arash Bayat
John Hildebrandt Mia Chapman
Ian BlairKelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
![Page 3: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/3.jpg)
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
YouTube
Big Data in 2025…Petabytes?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 4: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/4.jpg)
Genome holds the blueprint for every cell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 5: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/5.jpg)
It affects looks, disease risk, and behavior
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 6: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/6.jpg)
1
0.17
2
20
0 5 10 15 20 25
Astronomy
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 7: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/7.jpg)
VCF Data
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 8: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/8.jpg)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
Focus
![Page 9: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/9.jpg)
Finding the disease gene(s)
Spot the variant that is…• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 10: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/10.jpg)
Cloud Data Pipeline Pattern
Problem
• Define bizproblem
Data
• Quality
• Quantity
• Location
Candidate Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble Pipeline
• Validate sections
• Test at scale
![Page 11: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/11.jpg)
Cloud Data Pipeline Pattern
Candidate Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble Pipeline
• Validate sections
• Test at scale
![Page 12: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/12.jpg)
Machine Learning Pipeline Pattern
![Page 13: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/13.jpg)
What is CSIRO’s solution?For Scale at reasonable cost Use Apache Hadoop
For Scale at speed Use Apache Spark
For Usability in bioinformatics Create a domain-specific ML API (library)
For global useLeverage Cloud Pipeline Patterns
Transformational Bioinformatics| Denis C. Bauer @allPowerde
![Page 14: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/14.jpg)
GWAS Analysis with Variant-Spark
On-premise Cluster with Apache Hadoop & Spark
Genomics Analysts
CSIRO corporate data center
Transformational Bioinformatics| Denis C. Bauer @allPowerde
![Page 15: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/15.jpg)
Why Apache Spark?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
![Page 16: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/16.jpg)
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Transformational Bioinformatics| Denis C. Bauer @allPowerde
![Page 17: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/17.jpg)
Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 18: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/18.jpg)
Solving Important Questions…Cancer genomics?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
![Page 19: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/19.jpg)
DEMO: Who is a Hipster?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
![Page 20: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/20.jpg)
VariantSpark & Databricks Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
databricks Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 21: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/21.jpg)
Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
low Accuracy high
low
Spe
ed
h
igh
![Page 22: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/22.jpg)
Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster • 12 workers
• 16 x Intel Xeon [email protected] CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset
Whole Genome
RangeGWAS Range
![Page 23: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/23.jpg)
Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
Transformational Bioinformatics| Denis C. Bauer @allPowerde
![Page 24: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/24.jpg)
Future Directions for VariantSpark RF
Additional feature types
Unordered Categorical
For Scores -Continuous
Different feature ranges
Small and Big Inputs
For Gene Expression analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 25: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/25.jpg)
Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy
Editing does not work every time, e.g. only 7 in 10 embryos were mutation free
Aim: Develop computational guidance framework to enable edits the first time; every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned
Transformational Bioinformatics| Denis C. Bauer @allPowerde
![Page 26: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/26.jpg)
Make process parallel and scalable
• SPEED: Each search can be broken down into parallel tasks to then only take seconds
• SCALE: Researchers might want to search the target for one gene or 100,000
Scalability + Agility =
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 27: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/27.jpg)
One of the first Serverless Applications in Research
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Featured in
This is My Architecture
![Page 28: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/28.jpg)
GT-Scan2
![Page 29: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/29.jpg)
Considering Servicesfor GT-Scan2
• Use AWS Step Functions• Simplify workflow
• Simplify task timeouts
• Simplify task failures
• Must evaluate costs• SNS vs. Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 30: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/30.jpg)
Cloud Data Pipeline Pattern
Problem DataCandidate
TechnologiesBuild MVPs
Assemble Pipeline
1. Analyze/GWAS vcf -> S3/Hadoop IngestETLAnalyzeViz
S3 -> Databricks DBFSApache SparkVariant-Spark MLNotebook SQL, R or Python
Spark
2. Search/GTScan2 S3/fastq-> DynamoDBS3/fastq, bed
IngestETLAnalyzeViz
S3LambdaLambdaLambda/API Gateway
Serverless
![Page 31: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/31.jpg)
Spark Pipeline Pattern
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 32: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/32.jpg)
Serverless Architecture Pattern
Lambda function
1
Lambda function
2
Lambda function
3
buckets with objects DynamoDB
API Gateway Users
Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 33: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/33.jpg)
Cloud Genomic Data Pipelines• Problem # 1 – Analyze
• Find the mutated genes
• Solution: Spark-based machine learning
• Problem #2 – Scan• Find the nucleotide (DNA letters)
• Solution: Serverless
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 34: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks](https://reader035.vdocuments.net/reader035/viewer/2022062602/5ec464b4db60ee0b64135dfe/html5/thumbnails/34.jpg)
Genomics Big Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Dr. Denis Bauer & Lynn Langit