seqmapreduce: software and web service for accelerating sequence mapping yanen li department of...
TRANSCRIPT
![Page 1: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/1.jpg)
SeqMapReduce: software and web service for accelerating sequence mapping
Yanen LiDepartment of Computer Science, University of Illinois at
Urbana-ChampaignEmail: [email protected]
10/05/2009, CAMDA 2009, Chicago
![Page 2: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/2.jpg)
Challenge of NGS Alignment
• Sequences: Short (25 ~ 76 bp)• Size of data set: large, still increasing• BLAST?
Transaction /Long Query
Batch/Short Query
BLAST
NGS Aligner
We need INDEX !
![Page 3: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/3.jpg)
The NGS Aligner War
Where are you?
![Page 4: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/4.jpg)
NGS Aligner Classification
• Standalone AlgorithmsHash Reads: Eland, RMAP, MAQ, SHRiMP …Pros: less RAM, less overheadCons: waste of genome scan
Hash Genome: SOAP, PASS, Mosaik, BFAST …Pros: fast, scale up wellCons: big RAM, heavy overhead
Index Genome (Burrows-Wheeler): Bowtie, BWA
![Page 5: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/5.jpg)
NGS Aligner Classification
Parallel Algorithm Options Things Needed to Consider
Multi-thread Hard to scale up to many cores
Cluster Computing Load balancing, Fault tolerance
Cloud Computing Restricted programming interface
![Page 6: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/6.jpg)
Programming Model of Cloud Computing
• MapReduceDeveloper supplies two functions
– All v’ with the same k’ are reduced together
Simple framework usually can scale up well
![Page 7: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/7.jpg)
Why Cloud Computing Attractive?
• Fit for Data Intensive Computing (DIC)• NGS alignment is DIC in nature
• Hadoop – open sourced Cloud Computing system
Built-in Load balancing and Fault tolerance Easy to program
![Page 8: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/8.jpg)
Cloud Based NGS Aligner
Hash Reads Hash Genome Hash Both
SeqMapReduce *CloudBurst *
Hash/index Genome will be the next
SeqMapReduce: Hash all reads in RAM in every nodeCloudBurst: Hash reads and the genome, but not in RAM
![Page 9: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/9.jpg)
The SeqMapReduce Framework
![Page 10: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/10.jpg)
Inside SeqMapReduce
• Pre-processing: formatting the genome
Format once, use every time
Bases at the end are duplicated
![Page 11: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/11.jpg)
Inside SeqMapReduce
• Map phase: Seed & Filtering Divide a read into K parts, If M mismatches: at least (K-M) parts are exactly matched e.g. K=4, M=2 4-2=2 parts exactly matched combinations We need only 6 Hash Tables
Genome seqs scanned for potential hitsThen go to Mismatches Counting
46
2
![Page 12: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/12.jpg)
Inside SeqMapReduce
• Reduce Phase Aggregating intermediate results
• Post Processing Duplication detection Mismatches counting Final output report
![Page 13: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/13.jpg)
Inside SeqMapReduce• Mismatches counting Naive way: simple counting (O(N))• Mismatches counting using bit operationsBit-wise XOR (Exclusive or)
00 01 10 11
00 00 01 10 11
01 01 00 11 10
10 10 11 00 01
11 11 10 01 00
![Page 14: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/14.jpg)
Mismatches counting
• Original R (read), and G (genome)• W=R XOR G• Define 2 constantsW1=10101010…W2=01010101…X=W & W1 (keep 10, clear 01, 11=>10)Y=W & W2(keep 01, clear 10, 11=>01)Then Y << 1N=POPCNT(X | Y)
W is combinations of 00 01 10 11
W 00 01 10 11W2 01 01 01 01Y=W & W2 00 01 00 01Y << 1 00 10 00 10
W 00 01 10 11
W1 10 10 10 10
X=W & W1 00 00 10 10
X=W & W1
X | Y W 00 01 10 11
X | Y 00 10 10 10
Y =W & W2
![Page 15: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/15.jpg)
Web Service of SeqMapReduce
![Page 16: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/16.jpg)
Web Service of SeqMapReduce
• Input format .zip of fasta format reads• Reads can be upload through web site• Support 13 model organisms• Support reads longer than 32 bps• Up to 5 mismatches • No indels in current version (will update soon)• Output with ELAND format• Free of charge for academics • Users: Small labs, want quick results but could be afford expensive hardware
and softwares
![Page 17: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/17.jpg)
Results on CAMDA 2009 datasets
• Pol II ChIP-seq FC201WVA_20080307_s_5 (4.5 million)
• IFNg stimulated STAT1 ChIP-seq FC302MA_20080507_s_1 (6.2 million)
• Illinois Cloud Computing Testbed (CCT). Each node: 64 bit 2.6 GHz CPUs, 16 GB RAM, and 2 TB storage.
• 2 mismatches are allowed.• Accuracy: 95% of results are the same as MAQ.
![Page 18: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/18.jpg)
Speed Up
Run time VS No. of coresPol II data set
Run time VS No. of coresSTAT1 data set
Speed up is quasi-linear to the No. of cores
1 2 4 8 16 320
500
1000
1500
2000
2500
3000
3500
4000
4500
With overhead
Without overhead
1 2 4 8 16 320
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
With overheadWithout overhead
Ave overhead time: 67.22s Ave overhead time: 86.09 s
![Page 19: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/19.jpg)
Scale UpSize Size Ratio Run Time Run Time
RatioSTAT1 6.2 million 1.38 364 second 1.03
Pol II 4.5 million 354 second
RAM requirement: ~ 50 M per million readsCan scale up to tens of millions of read with several Gs of RAM
![Page 20: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/20.jpg)
Comparison to CloudBurst
Why CloudBurst is slow?It hashes Reads and genome, with Hadoop system hash functionNo filtering in the Map phase: heavy I/O to Reduce phase
![Page 21: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/21.jpg)
Results on Amazon EC2
• Speed up similar of using UIUC Hadoop Cluster, but slower• Large Standard Instances are chosen• Cost $99.01
![Page 22: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/22.jpg)
Future Plans
• Apply to Bisulfite Reads to genome wide methylation analysis
• Web-based visualization of short-read alignments
![Page 23: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign](https://reader035.vdocuments.net/reader035/viewer/2022062417/5518cdca550346881f8b5bad/html5/thumbnails/23.jpg)
Acknowledgements
• UIUC Cloud Test Bed • Michael Schatz • CAMDA Organizers
This work is supported by NSF DBI 08-45823 (SZ)
Thank you!