persona: a high-performance bioinformatics framework · persona: a high-performance bioinformatics...
TRANSCRIPT
![Page 1: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/1.jpg)
Persona:AHigh-PerformanceBioinformaticsFramework
StuartByma1,SamWhitlock1,LauraFlueratoru2,EthanTseng3,ChristosKozyrakis4,EdouardBugnion1,JamesLarus1
EPFL1,U.Polytehnica ofBucharest2,CMU3,Stanford4
112/07/2017
![Page 2: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/2.jpg)
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
2
![Page 3: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/3.jpg)
Sequencingcost
3
Notawetlabproblemanymoreà IT/Systemsproblem
![Page 4: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/4.jpg)
Implications
4
~300GB ~hours
Needefficientsystemsthatscalewell
?
![Page 5: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/5.jpg)
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
5
~300GB ~hours
![Page 6: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/6.jpg)
Whatkindofdata?
• CommonsequencersproduceReads• SnippetsofDNAà AACCGCTAGCGCGCTAGCTCGAGCTAGAA• 100-200bases
6
@sequence name, metadataACGTTTCGATCGCGCCAGGAGGCTAG+-+*''))**55CCF@>>>>>CCCCCCtimesafewhundredmillion…
![Page 7: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/7.jpg)
Alignment
...TGACCTATAGCGATATAGCTTATTATTGGG-CAAAAATGGAATCGATTGATCG...|||||||||| ||||| |||TATTATTGGGATAAAA-TGG
ReferenceGenome
Read:
Insertion Deletion
Mismatch
7
~hours
timesafewhundredmillion…
![Page 8: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/8.jpg)
AlignedReads
• StoredinSAM/BAM
read_name 16 chr12 85500011 70 18M * 0 0 TTTTACACACATTATCTC CDDFAEEC>EDDFFBCDEED?FCC@ PL:Z:Illumina PU:Z:pu LB:Z:lb SM:Z:sm
• Followedby• Duplicatemarking• Sorting• Recalibrations,analysis(variantcalling)
8
~hours
![Page 9: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/9.jpg)
DataandToolIssues
9
…
FASTQSAM/BAMBEDVCF
…
![Page 10: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/10.jpg)
Persona– Bioinformatics,Unified
10
AggregateGenomicData
![Page 11: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/11.jpg)
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
11
![Page 12: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/12.jpg)
AggregateGenomicData
12
Header
Index
Data compressed
Manifest
StorageSubsystem
BasesQ-ScoresMetadata
![Page 13: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/13.jpg)
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
13
![Page 14: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/14.jpg)
AGDChunks
14
Dataflow
• Dataflowexecutionframework• BaseonTensorFlow engine• Butnomachinelearning
• OperatorsperformcomputationonAGDchunks
![Page 15: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/15.jpg)
Dataflow AGDChunks
15
• Modularity• Balance/tuning• (bounded)Queueing
![Page 16: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/16.jpg)
Fine-grainedThreading
• AGDchunksoptimizedforstorage• Toocoarseforsometasks
• Splitintosubchunks• Delegatetoexecutor sharedresource• Taskqueue+threadpool Aligners
NotifyAGDBuf
ThreadPool16
![Page 17: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/17.jpg)
AlignerGraph GetChunk
Decompress/Parse
Compress
PutChunk
AGDChunks
AlignmentExecutor
17
AlignReads
![Page 18: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/18.jpg)
GraphConstruction
18
c = persona.read_chunk(path)
d = persona.decompress(c)
o = persona.align(d)
sess = tf.Session()result = sess.run([o])
![Page 19: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/19.jpg)
PersonaShell
PersonaShell
align sort import
localruntime dist runtime
$ persona align local –i hg19 data/my_agd.json$ persona sort local data/my_agd.json
19
…
![Page 20: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/20.jpg)
DistributedComputation
Server1
Server0
QueueService
Client$ persona client bwa-align
20
Server1 ServerN
StorageSubsystem
![Page 21: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/21.jpg)
CurrentFeatures
• ImportdatafromFASTQ/BAM/SRA,exporttoBAM• SequencealignmentwithBWA-MEM,SNAP• Datasetsorting• Duplicatemarking• Datasetstatistics(samtools flagstat)• Readcoverage(depth)
21
![Page 22: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/22.jpg)
Agenda
• Motivation• BioinformaticsDataandTools• Persona• AGD• DataflowEngine
• PerformanceResults
22
![Page 23: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/23.jpg)
Evaluation-- Setup
• FocusedonsequencealignmentusingSNAP• Throughputinbasesalignedpersecond• Data• 223million101basereads(~16GB)• AGDchunksof100Krecords
• Hardware• 32XUbuntu16.04,[email protected]• Dataon6-diskRAID0andsinglespindledrive• 7serverCephobjectstorefordistributedexecution
23
![Page 24: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/24.jpg)
Evaluation-- Questions
• Whatarethebandwidth-savingeffectsofAGD?
• WhatistheoverheadofthePersonaframework?
• HowwelldoPersonaandAGDscale?
24
![Page 25: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/25.jpg)
Performance– AGD
25SignificantlylessI/OàmoreefficientuseofHW,BW
SNAP
PersonaSNAP
*singledisk
![Page 26: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/26.jpg)
PersonaOverhead
26Negligibleoverhead!
*RAID-0
![Page 27: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/27.jpg)
Scaling
27Fulldatasetalignedin~17seconds
![Page 28: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/28.jpg)
ScalingLimits
28
![Page 29: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/29.jpg)
Persona– ScalableBioinformatics
29
AggregateGenomicData
…
https://github.com/epfl-vlsc/persona
![Page 30: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/30.jpg)
backup
30
![Page 31: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/31.jpg)
Performance– SortandDup.Mark
• Sort• Bymetadataoralignedlocation• 1.54xspeedupoversamtools• 5.15xspeedupoverPicard
• Datasetstats• 2xspeedup
• Duplicatemarking• Samealgorithmassamblaster• 3.73xfasterthansamblaster
• Coverage(depth)• 2xspeedup
31
![Page 32: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/32.jpg)
Profiling
32
![Page 33: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/33.jpg)
Read/WriteSingleDisk
33
![Page 34: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/34.jpg)
Alignment
• Example:SNAP• Buildhashindexofreference• Toalignaread:• Hashaportion(seed)• Lookup
• Evaluateeachhit• Editdistancecomputation
• Coresalignreadsinparallel
TATTATTGGGATAAAATGGTTT
...TATTACTGGGCAAAAATGGTTTATG.............
ReferenceGenomeIndex(40GB)
TATTATTGGGATAAAATGGTTT
editdistance
34
![Page 35: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/35.jpg)
SharedData
• Sometimesneedtosharedatabetweenops• E.g.multi-GBindexofreferencegenome
• UseTFsessionresourcemanager• [string,string]à refcount object
• Opcancreateobjects,providehandletootherops
ResourceManager
LookupOrCreate()[c,n]
Lookup()
ProviderOp
ConsumerOp
35
![Page 36: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/36.jpg)
DataMovement
• Tensorsnotamenabletobioinfo data• LeverageTFsharedresources• Implementreusablebuffers• Stablememoryuse• Avoidsyscalls
BufferPoolOp
[container,name]
Pool
36
![Page 37: Persona: A High-Performance Bioinformatics Framework · Persona: A High-Performance Bioinformatics Framework Stuart Byma1, Sam Whitlock1, Laura Flueratoru2, Ethan Tseng3, Christos](https://reader036.vdocuments.net/reader036/viewer/2022070916/5fb6b0ba6cbfa7023579e14d/html5/thumbnails/37.jpg)
Bioinformatics?
• Biology,computerscience,math,statistics• Startedmid90’swithHumanGenomeProject• Broadfield• Genomics,proteomics,systemsbiology
• Thistalk:WholeGenomeSequence(WGS)analysis• ReadingthelettersofyourDNA(ATCG…)
37