gpu accelerated bioinformatics research at...

39
GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012 May 14-17, 2012 | San Jose, California BingQiang WANG, Head of Scalable Computing, BGI [email protected]

Upload: duongkiet

Post on 03-Oct-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

GPU Accelerated Bioinformatics Research at BGI

GPU Technology Conference 2012

May 14-17, 2012 | San Jose, California

BingQiang WANG, Head of Scalable Computing, BGI

[email protected]

Page 2: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

DNA double helix encodes secrets of life Facts: Human Genome - Trillions of cells - 23 pairs of chromosomes - 2 meters of DNA - 3 billion DNA subunits (A,T,C,G) - Approximately 30,000 genes code for proteins that perform most life functions

Page 3: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Different Types of Gene Variation

Page 4: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Sequencer, Sequencing, Sequence

• Sequencing technology turns the secret codes (A, T, C, G) into digital form

• Thus enabling computational research

Sample

Preparation

Library

Sequencing

Analysis

Results

Page 5: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Sequencing long DNA

Multiple copies of the same source

Randomly fragment the copies

Sequence Unordered genome fragments

Page 6: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Next Generation Sequencing (NGS)

• Indeed 2nd generation sequencing technology

• Low cost (several K$ per human genome)

• High throughput

• Short reads (small pieces of DNA strand)

• Lots, lots of data

Page 7: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Big Data Incoming

• Breadth – As sequencing cost falling falling down – More individuals are being sequenced

• Thousands of human individuals: diagnostics and treatment of diseases

• Tens of thousands of rice individuals: molecular breeding, more food

• Depth – Combining data from other sources / levels – DNA, RNA, protein…

• And, dynamically – The dimension other than breadth and depth - time – Living cells, living life

Page 8: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Craig Venter, Multiple personal genomes await, Nature, Vol 464, April 2010

Exponential growth

Approximately 10 fold every 18 months!

Page 9: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Collecting and integrating large-scale, diverse types of data

Page 10: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

… we are able to isolate and sequence individual cells, monitor the dynamics of single molecules in real time and lower the cost of the technologies that generate all of these data, such that hundreds of millions of individuals can be profiled. Sequencing DNA, RNA, the epigenome, the metabolome and the proteome from numerous cells in millions of individuals, and sequencing environmentally collected samples routinely from thousands of locations a day …

Eric E. Schadt et al, Computational Solutions to Large-scale Data Management and Analysis, Nature Reviews | Genetics, Vol 11, September 2010

Page 11: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Sequencing @

• World’s leading sequencing and genomics research center • Started with Human Genome Project in 1999

– Several sequencers at that time – Now more than 150 sequencers – Consider the trend …

• Mass spectrometers to capture protein information – Complement sequencing – Proteomics, so on

MODEL ABI

3730XL Roche

454 ABI

SOLiD 4 Solexa GA IIx

Illumina HiSeq 2000

INSTALLATION 16 1 27 6 135

Page 12: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Computing @

• Sequencing throughput – 6T base pairs per day (upgraded

from 4T) – ~20 PB data storage

• Connecting raw data and scientific discovery – Analysis tools – High performance computing is

the key

• Computing horsepower – ~20,000 cores – ~20 GPUs – ~220 Tflops peak performance

• Still increasing …

Page 13: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Sequencing vs Computing

• Observation – Exponential growth of

sequence data output

• What will happen if, demand for computation grows with amount of data, as – O(N)

– O(N2)

– beyond O(N2)?

Page 14: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Computational Challenges

• “Classical” sequence data analysis – Alignment as O(N) – Variant calling as O(N) – Linear as data increasing

• Growing computing demand – let us mine for “sth” – Population genomics as O(N2) – Phylogenetic study NP hard – Gene association study high dimensional – Systems biology with various levels of data NP hard – …

• Sequencing cost down leads to more and more high dimensional analysis – Lots, lots of computing

Page 15: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Solution: Disruptive Computing Technology

Scientific and Clinical

Interests

Sequencing Technology

Computing Technology

Page 16: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

GPU Accelerated Bioinformatics Research

• Individual tools for routine analysis

– SOAP3 / SOAP3-DP aligner

– SNP calling with GSNP

• Tackle challenging scientific questions

– Gene association study

• First step: High resolution genotyping with GAMA-MPI

– More incoming

Page 17: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

SOAP3 / SOAP3-DP Aligner

• Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

• Collaboration between University of Hong Kong (HKU) and BGI – Professor T.W. Lam

• SOAP3 and SOAP3-DP – Designed for GPU, great performance – SOAP3: GPU accelerated version of SOAP2

• Compressed indexing: bidirectional Burrows-Wheeler transform

– SOAP3-DP: index-assisted dynamic programming (semi-global alignment) • Alignment with INDELs (insert/delete) & mismatches

• SOAP3: GPU-based Compressed Indexing and Ultra-fast Parallel Alignment for Short Reads – Wednesday (Tomorrow) 16:00-16:30, Hall B – On behalf of Professor T.W. Lam and team members

Page 18: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

1893.45

10671.39

211.53 819.81

0

2000

4000

6000

8000

10000

12000

Human Zebra fish

Total Time (second)

SOAP2 SOAP3

Data type

Reads length

(bp)

Total Number of

Reads (million )

Mismatch number

SOAP3 (Total Time: second) SOP2 Total time

(second)

Alignment Speed-up

ratio (second)

Time for reading reads

Time for alignment and output

Total time

Human 100 16 3 83.30 128.23 211.53 1893.45 14.12

Zebra fish 76 21 3 95.50 724.32 819.81 10671.39 14.6

14.12

14.6

13

13.5

14

14.5

15

Human Zebra fish

Speedup Ratio

SOAP3

84.2

64.49

88.29

76.55

0

20

40

60

80

100

Human Zebra fish

Alignment Ratio (%)

SOAP2 SOAP3

Page 19: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Speedup Compared with Other Tools

• SOAP3-dp is at least 10 times faster when comparing with other tools (BWA, Bowtie2)

0

2

4

6

8

10

12

14

16

18

20

VS. SOAP3 VS. BWA VS. Bowtie2

Speedup

Paired-end

Single-end

SOAP3-DP

Page 20: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

SNP Calling with GSNP

• A single-nucleotide polymorphism (SNP, pronounced snip) is a DNA sequence variation occurring when a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in an individual.

• Collaboration with Hong Kong University of Science and Technology (HKUST) – Professor Qiong Luo – Mian Lu – Jiuxin Zhao

• Based on SOAPsnp – BGI’s home made standard SNP calling tool

Page 21: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

21

A T A G A 13 11 9 7 6

read quality score

A C G A T A C A

A C G T T 12 10 9 6 7

C G A C A 11 10 8 7 5

G T T A C 13 11 9 7 6

Alignment

Differences from reference for each read

Data quality

Experimental errors Alignment

quality score for each base

Calculate likelihood for

each genotype

Bayesian theory

Infer genotype

Prior probability of each genotype

A site

GSNP

Page 22: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

22

A T A G A 13 11 9 7 6

A C G A T A C A

A C G T T 12 10 9 6 7

C G A C A 11 10 8 7 5

G T T A C 13 11 9 7 6

Thread 0 Thread 2 Thread 4 Thread 3 Thread 5 Thread 6 Thread 1 Thread 7

The parallelization strategy on the GPU: one thread handles one site

Optimization techniques of GSNP

Reduce memory overhead and branch divergence

Balance workloads

The Consistency of GPU and CPU Results

Reduce I/O cost

GSNP

Page 23: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

23

con

stan

t m

em

ory

on

th

e G

PU

64 possible results

Searching for the result

Solution to the consistency of GPU and CPU results

The sparse representation of aligned bases

SOAPsnp GSNP

The measured non-zero%: ~0.1%

Site 1

Site 1

Site 2 Site 2

Computing the result

GSNP

Page 24: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

24

527

21879

1

10

100

1000

10000

100000

GSNP SOAPsnp

Elap

sed

tim

e (

sec.

)

Ch.1

73

3675

1

10

100

1000

10000

GSNP SOAPsnp

Elap

sed

tim

e (

sec.

)

Ch. 21

End-to-End Performance Comparison of GSNP

The elapsed time of all components are included. GSNP is around 50X faster than the single-thread CPU-based SOAPsnp.

GSNP

Page 25: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

GPU Accelerated Bioinformatics Research

• Individual tools for routine analysis

– SOAP3 / SOAP3-DP aligner

– SNP calling with GSNP

• Tackle challenging scientific questions

– Gene association study

• First step: High resolution genotyping with GAMA

– More incoming

Page 26: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Estimating MAF in a Population with GPU

• Within a population, SNPs can be assigned a minor allele frequency — the lowest allele frequency at a locus that is observed in a particular population. There are variations between human populations, so a SNP allele that is common in one geographical or ethnic group may be much rarer in another. (from Wikipedia)

• MAF is the foundation of genome wide association study (GWAS), e.g. HapMap project

• Our approach is a highly accurate yet computationally very expensive one (O(N2))

• Collaboration with Hong Kong University of Science and Technology (HKUST), as well as National Supercomputing Center at Tianjin (Tianhe-1A)

Page 27: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

27

The site has a probability for the occurrence of allele ‘a’ and ‘A’

allele

? ? a A a a A A

A a

A A

a a

or or

a A

A A

Probability of AA Aa aa

One individual Compute

minor allele frequency

Add another individual

Individual SNP Calling alignment

reference

reads

A

G C

C

T G C

C

G

C G

G A

A

A

GAMA

Page 28: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

28

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250

Allele Frequency

Gen

oty

pe

Like

liho

od

a A

a A

a a

a A

a A

A A

a A

a a

A A

Different sites represent different alleles

C B A … Individual 1

Individual 2

Individual 3

Individual …

Compute allele frequency likelihood for each site

GAMA

Page 29: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

29 Thread 0 Thread 1 Thread 2 Thread 4 Thread 3

One Site Computation

: the operator

Multiple threads handle a site

Thread 0

Thread 1 Thread …

A thread handles a site ?

Bad solution !

SFS construction

GAMA

Page 30: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Estimating MAF in a Population with Multiple GPUs

– Based on TH-1A

– CUDA + MPI

– Parallel file system (Lustre)

– AllToAll comm (now)

Page 31: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

main thread 1

thread M

… …

main thread 1

thread M

node 1 Node N

… …

wait

wait

wait

calculate and output result

calculate and output result

calculate and output result

decompress and parse data

decompress and parse data

decompress and parse data

finish

Alltoall data exchange

Alltoall data exchange

Alltoall data exchange

GAMA-MPI Flow Chart

wait

… …

start

Page 32: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

32

Version Computing time Total Time Computing Speedup

Total Speedup

Note

CPU ~ 1518 days ~ 1619 days

GPU (Single)

~ 15.75 hours ~ 101 days 2313 16 against CPU

GPU (86 with MPI *)

~ 717 seconds ~ 5.4 hours 79 449 against single GPU

Dataset: Human genome, 512 individuals ( 1024 input files) , full scan of 3G sites

* 86 nodes x 12 cores per node = 1032 cores , with one core processing one file

GAMA

Page 33: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Summary

• GPU is very promising to accelerate bioinformatics analysis and life science research

• Efforts need to be made to leverage data access and computation

• For large-scale data analysis, especially GWAS type analysis, more study needed

Page 34: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Acknowledgement

• Team members and Xing Xu, Lin Fang

• Our research collaborators – Prof T.K. Lam, Dr S.M. Yiu from HKU

– Prof Qiong Luo and group member from HKUST • Mian Lu

• Jiuxin Zhao

• Yuwei Tan

– Prof Xiaowen Chu from HKBU • Kaiyong Zhao

• National Supercomputing Center at Tianjin

Page 35: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Acknowledgement (cont’d)

• NVIDIA

– Dr Simon See

– Allan Hou

– China DevTech Team

• Dr Peng Wang

• Dr Bin Zhou

• Agatha Hu

Page 36: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Contact Us

[email protected]

http://www.easygenomics.com

Next Generation Bioinformatics

on the Cloud

Sifei He

Director of BGI Cloud

[email protected]

Xing Xu, Ph.D

Senior Product Manager

EasyGenomics | BGI

[email protected]

Page 37: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Problems and Solutions

37

Problems:

• Big genomic data

• Geological distribution

• Algorithm integration

• Computational demand

• Big genomic data

• Geological distribution

• Algorithm integration

• Computational demand +)

Cloud

High Speed Data Exchange

Workflows

Resource Management

Solutions

Page 38: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

EasyGenomics™

EasyGenomics is the bioinformatics platform for research and applications on the cloud

Algorithms, Workflows,

Reports

Computational Resources

Database, Data management

Web portal, Simple UI High speed

connection

Page 39: GPU Accelerated Bioinformatics Research at …on-demand.gputechconf.com/gtc/2012/presentations/S0519...GPU Accelerated Bioinformatics Research at BGI GPU Technology Conference 2012

Thank you

[email protected]