writing data analysis pipeline as ruby gem
TRANSCRIPT
![Page 1: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/1.jpg)
Writing Data Analysis Pipeline As Ruby Gem
Shi-Gang Wang
![Page 2: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/2.jpg)
About me{
name: ‘ Shi-Gang Wang ( Sean ) ’,
email: ‘ [email protected] ’,
working_at: ,
role: [‘ software engineer ’],
language: ‘ ruby ’,
github: ‘ https://github.com/seansg ’
}
![Page 3: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/3.jpg)
Outline
❖ What is pipeline❖ Disassemble pipeline ❖ Queue a pipeline
![Page 4: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/4.jpg)
?
![Page 5: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/5.jpg)
![Page 6: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/6.jpg)
pineapple.txt
![Page 7: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/7.jpg)
pineapple.txtcat pineapple.txt
![Page 8: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/8.jpg)
pineapple.txtcat pineapple.txt
cat pineapple.txt | grep apple
![Page 9: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/9.jpg)
pineapple.txtcat pineapple.txt
cat pineapple.txt | grep applecat pineapple.txt | grep apple | wc -l
![Page 10: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/10.jpg)
Write scripts to do one thing
Make scripts to work together
=> Pipeline
![Page 11: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/11.jpg)
![Page 12: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/12.jpg)
Take
as an example
![Page 13: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/13.jpg)
CAGNUT❖ Computational and Analytical Gear for Nucleic
acid Utilitarian Techniques❖ DNA analysis pipeline
❖ Burrows-Wheeler Aligner (BWA) — in C
❖ Sequence Alignment/Map tools (SAMtools) — in C
❖ Genome Analysis Toolkit (GATK) — in Java
❖ Picard — in Java
❖ Generate bash scripts
![Page 14: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/14.jpg)
A Genome Analysis Flowchart
https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png
![Page 15: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/15.jpg)
![Page 16: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/16.jpg)
Demo
![Page 17: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/17.jpg)
How to write the pipeline?
![Page 18: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/18.jpg)
How to write the pipeline?
How to disassemble the pipeline?
![Page 19: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/19.jpg)
Think about the pipeline structure
Pipeline
Tools
CAGNUT Core
![Page 20: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/20.jpg)
Think about the pipeline structure
Pipeline
CAGNUT Core
Tools
![Page 21: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/21.jpg)
Write all parts as ruby gems
![Page 22: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/22.jpg)
Benefits of ruby gems
❖ Reuse❖ Debug❖ Maintain❖ Share
![Page 23: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/23.jpg)
Difficulties
❖ Usage❖ Integration of tools❖ Execution order❖ Automation
![Page 24: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/24.jpg)
Prepare work — Define help
❖ “Help” can help you understand how to use the commands of pipeline
![Page 25: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/25.jpg)
Prepare work — Namespace
![Page 26: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/26.jpg)
Skills of writing the gems
Part 1 — Tool gemsPart 2 — Pipeline gem
Part 3 — Cagnut core gem
![Page 27: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/27.jpg)
Part 1 — Tool gems
❖ Tool written in Singleton❖ Tool methods written in class❖ Job scripts generation Pipeline
Tools
CAGNUT Core
![Page 28: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/28.jpg)
Tool written in Singleton
![Page 29: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/29.jpg)
Tool method written in class
![Page 30: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/30.jpg)
Get specific variables in other class
❖ Use Forwardable
![Page 31: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/31.jpg)
Job scripts generate❖ Use Tilt
❖ Generic interface to multiple Ruby template engines
![Page 32: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/32.jpg)
Part 2 — Pipeline gem
❖ Require tool gems❖ Create workflow with tool gems❖ Generate the job list Pipeline
Tools
CAGNUT Core
![Page 33: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/33.jpg)
Require tool gems
❖ Loading bundle env
![Page 34: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/34.jpg)
Create workflow with tool gems
❖ Composed by tool gems❖ Order❖ Dependency
![Page 35: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/35.jpg)
Generate the job list
![Page 36: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/36.jpg)
Part 3 — CAGNUT core gem
❖ Project template prepare❖ Parameters handling❖ Tool-specific methods overwrite❖ Jobs control Pipeline
Tools
CAGNUT Core
![Page 37: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/37.jpg)
Project template prepare❖ Define bundle as Thor command
![Page 38: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/38.jpg)
Parameter handing❖ Use OptionParser
![Page 39: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/39.jpg)
Tool-specific method overwrite
❖ One tool, One configuration❖ Using “Prepend” to overwrite
dev.af83.com/2012/10/19/ruby-2-0-module-prepend.html
![Page 40: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/40.jpg)
Jobs Control — desktop run
❖ wait $!
❖ detach Zombie Process
!
![Page 41: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/41.jpg)
![Page 42: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/42.jpg)
If the data is largeor
much larger, like the human genome
![Page 43: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/43.jpg)
The size of the human genome is
3 x109 base pairs (bps)
![Page 44: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/44.jpg)
Each base pair takes 2 bits
(you can use 00, 01, 10, and 11 for T, G, C and A)
2 x 3 x 109 bits = 6 x109 bits
= 7.5x108 bytes = ~700 MB
![Page 45: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/45.jpg)
In a perfect world: ~700 MB
(just 3 billion letters)
In the real world: ~200 GB(right off the genome
sequencer)
![Page 46: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/46.jpg)
Crash your desktop/laptop!
![Page 47: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/47.jpg)
Long wait …
![Page 48: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/48.jpg)
Resource allocation
![Page 49: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/49.jpg)
Resource allocation
❖ Specifying the memory used by the program
❖ Using Queueing System
![Page 50: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/50.jpg)
What is Queueing System?
![Page 51: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/51.jpg)
Queueing System
BD C AWaiting JobsJob Finished Job
System
❖Queue❖the list of waiting jobs
❖Queueing System❖Waiting Jobs + Servers
Server n
Server 2
Server 1
![Page 52: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/52.jpg)
In a desktop computer
![Page 53: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/53.jpg)
Cluster Queues
![Page 54: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/54.jpg)
Queueing System
❖ Props❖ Jobs scheduling❖ Load balancing❖ Batch jobs execution
![Page 55: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/55.jpg)
Queueing System
❖ Portable Batch System (PBS)❖ Sun Grid Engine (SGE) ❖ Load Sharing Facility (LSF)
![Page 56: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/56.jpg)
Submit jobs to Queueing System
❖ Take LSF as an example❖ Creating a job script❖ Submitting the job
![Page 57: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/57.jpg)
Demo
❖ Submit jobs to cluster
![Page 58: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/58.jpg)
Acknowledgement
https://cagnut.golden.io
https://goldenio.com
![Page 59: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/59.jpg)
Thanks
![Page 60: Writing data analysis pipeline as ruby gem](https://reader036.vdocuments.net/reader036/viewer/2022062400/5880ca051a28abba3b8b70af/html5/thumbnails/60.jpg)
Backup