tools for hts analysis michael brudno and marc fiume department of computer science university of...

50
TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto

Upload: shannon-lindsey

Post on 03-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

TOOLS FOR HTS ANALYSIS

Michael Brudno and Marc FiumeDepartment of Computer ScienceUniversity of Toronto

Outline•Lab focus•Our tools

•SHRiMP: read mapper•VARiD: SNP and indel finder•Savant: genome browser

• Discussion

Our Tools

READ MAPPING (SHRiMP)

READ MAPPING (SHRiMP)

SNP DETECTION(VARiD)

SNP DETECTION(VARiD)

INDEL DETECTION(MODiL)

INDEL DETECTION(MODiL)

CNV DETECTION(CNVer)

CNV DETECTION(CNVer)

ASSEMBLY(UNNAMED)ASSEMBLY

(UNNAMED)VISUALIZATION(SAVANT)

VISUALIZATION(SAVANT)

SHRIMP – SHORT READ MAPPING PACKAGE

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Key SHRiMP Features•High Sensitivity•Support for common formats (SAM, FASTQ, etc)•Flexible seeding framework•Multi-threading•Full support for SOLiD and Illumina (and 454) reads

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Sensitivity/Specificity Comparison

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Runtime comparison

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Unpaired 50bp Reads Paired 75bp Reads

Mapping 6 million reads to C. Savignyi (180 Mb)

VARID – SNP AND INDEL DETECTION

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

motivation | methods | results | summary

Variation detection from NGS reads

Reference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC

Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTGAligned reads: ATCCATTGCA GATCCACTGCAC

• Determine differences (variation) between reference and donorusing NGS reads of the donor

MotivationColor-space and Letter-space platforms

bring them together

MotivationColor-space and Letter-space platforms

bring them together

MethodsMethods

SummarySummary

ResultsResults

16

motivation | methods | results | summary

Sequencing Platforms

• letter-space Sanger, 454, Illumina, etc

> NC_005109.2 | BRCA1 SX3TCAGCATCGGCATCGACTGCACAGG

• color-space AB SOLiD less software tools available

> NC_005109.2 | BRCA1 AF3T212313230313232121311120

• many differences -> useful to combine this information• sequencing biases• inherent errors • advantages

17

A G

C T

2

1

2

1 3

0 0

00

A

A

C

0

G T

1 32

C 1

G

0

2

3 2

3 10

T 3 2 01

Color Space

motivation | methods | results | summary

Translation Matrix Translation Automata

18

Translating

> T212313230313232121311120> T

Sequencing Error vs SNP

Sequencing Error> T212313230313232121311120> T212313230310232121311120> TCAGCATCGGCAAGCTGACGTGTCC

SNP> TCAGCATCGGCATCGACTGCACAGG> TCAGCATCGGCAGCGACTGCACAGG> T212313230312332121311120

A G

C T

CAGCATCGGCATCGACTGCACAGG

Color Space

motivation | methods | results | summary

19

Color Space

motivation | methods | results | summary

20

• clear distinction between a sequencing error and a SNP• can this help us in SNP detection? sounds like it!

single color change error, 2 colors changed (likely) SNP.

Easy snp call Well covered bases Difficult Casereference incolor-space

reads

position

Detection• Heterozygous SNPs• Homozygous SNPs• Tri-allelic SNPs• small indels• account for various errors, quality values & misalignments

Motivation • variation caller to handle both letter-space & color-space reads

Motivation

motivation | methods | results | summary

21

VARiD• system to make inferences on the donor bases

• variation detection

Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

MotivationMotivation

SummarySummary

ResultsResults

22

Statistical model for a system - states

Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state

We can observe the state’s emission (output)each state has a probability distribution over outputs

Hidden Markov Model (HMM)

motivation | methods | results | summary

23

S1 S2 S3

e1

e2

e1

e2

e1

e2

Hidden Markov Model (HMM)

motivation | methods | results | summary

24

Apply HMM to variation detection: • we don’t know the state (donor), but • we can observe some output determined by the state (reads)

Hidden Markov Model (HMM)

motivation | methods | results | summary

25

. . . . . B6 B7 B8 B9 . . . . .

B6 B7 B7 B8 B8 B9

AA

AC

color 0

color 1

AA

AC

color 0

color 1

AA

AC

color 0

color 1

unknowndonor

Why pairs of letters? Handle colors.• AA and TT gives the same colors. Can’t just model colors

The donor could be:• letters: AA color 0• letters: AC color 1 :• letters: TT color 016 combinations

A G

C T

2

1

2

1 3

0 0

00

A

A

C

0

G T

1 32

C 1

G

0

2

3 2

3 10

T 3 2 01

States

motivation | methods | results | summary

26

B6 B7

States and Transitions

motivation | methods | results | summary

27

. . . B6 B7 B8 . . .

B6 B7 B7 B8

AA

CA

AT

TT

:

:

GA

CT

::

AA

TT

.

.

.

.

.

.

.

.

.

Poss

ible

Sta

tes

Transitions• only certain transitions allowed

• when allowed, p(Xt|Xt-1) = freq(Xt)

• each state depends only on the previous states (Markov Process)

States• 16 possible states• only look at second letter

T01020100311223 T1030101311223 T20100311223

ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC

Unknown genome

Color reads

Letter reads

Emissions

motivation | methods | results | summary

28

..... B6B7B8B9 ..... B7 B8

color 0

color 1

AA

AC

color 0

AA

AA

color 0

color 1

color 2

color 3

letters A

letters C

1 – 3ε

ε

ε

ε

emissionprobabilityp(em|AA)

letters T ξ

1- 3ξ

letters G ξ

ξ

Emission Probabilities

motivation | methods | results | summary

29

Same color emission distribution

TT

Different letter emission distribution

TT

])31[(])31[( 2112 Ep

E.g. For state CC:

Combining emission probabilities• probability that this state emitted these reads.

motivation | methods | results | summary

Emission Probabilities

30

T01020100311223 T1030101311223 T20100311223

ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC

..... B6B7B8B9 .....

Summary

• unknown state • donor pair at location

•transitions • transition probabilities

• emissions • reads at location• emission probabilities

motivation | methods | results | summary

Simple HMM

31

B6 B7

AA

AC

color 0

color 1

• Have set-up a form of an HMM• run Forward-Backward algorithm • get probability distribution over states at some position

AA

CA

AT

TT

:

:

GA

CT

::

likely state

motivation | methods | results | summary

Forward-Backward Algorithm

32

• Variation Detection:compare most likely state with reference:

ref: GCTATCCAdon: ...AT...

Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

MotivationMotivation

SummarySummary

ResultsResults

33

Simple HMM • only detects homozygous SNPs

Extended HMM:• short indels• heterozygous SNPs• complex error profiles & quality values

motivation | methods | results | summary

Extended HMM

34

Expand states• Have states that include gaps

• emit: gap or color

A---

-G

AGTG

T-T-

• Have larger states, for diploids

• Transitions built in similar fashion as before• Same algorithm, but in all we have 1600 states with very sparse transitions

Expansion: Gaps and het. SNPs

motivation | methods | results | summary

35

• Emission probabilities o Support quality values o Use variable error rates for emissions

• Translate through the first lettero first color is incorrecto letter-space signal

Donor: ACAGCATCGGCATCGACTGC 1123132303123321213read: >T2123132303123321213 > C123132303123321213

Expansion

motivation | methods | results | summary

36

• Post-process putative SNPso uncorrelated adjacent errors may support het SNPso check putative SNPs

motivation | methods | results | summary

blue: varid steps

Summary

ResultsResults

MotivationMotivation

MethodsMethods

SummarySummary

38

Results

motivation | methods | results | summary

• Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773)

Color-space dataset:• Compare random subsets:

• Corona (with AB mapper) • VARiD (with SHRiMP) • VARiD (with AB mapper)

Conclusions:• Using F-measure, the three pipelines perform very similarly. • High-coverage results is as good as can be achieved

39

Results

motivation | methods | results | summary

• Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773)

Letter-space dataset:• Compare random subsets :

• GigaBayes (with Mosaik) • VARiD (with SHRiMP) • VARiD (with Mosaik)

Conclusion:• Using F-measure the three pipelines perform very similarly.• High-coverage results is as good as can be achieved

40

Results

VARiD: Combining Letter-space and Color-space Datato achieve increased accuracy in at-cost comparison

motivation | methods | results | summary

41

SummarySummary

MotivationMotivation

MethodsMethods

ResultsResults

42

Summary of VARiD• HMM modeling underlying donor• Treats color-space and letter-space together in the same framework• no translation – take advantage of each technology’s properties• accurately calls short SNPs, indels in both color- and letter-space

• improved results with hybrid data.

Summary

motivation | methods | results | summary

• Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available)

• Contact: [email protected]

• Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available)

• Contact: [email protected]

43

SAVANT GENOME BROWSER

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Challenge in Genomic Data Analysis• genomic data is generated in high volumes• interpretation and analysis challenge• typical pipeline employs many separate tools for computation and visualization

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Tools for HTS data analysisTool Cost Computation Visualization

Read Alignment e.g. Bowtie, BWA

Free Y N

File Format Conversion e.g. Galaxy, SAMTools

Free Y N

Other Comand-line Toolse.g. Genetic Variation Discovery, Comparitive Genomics, etc.

Free Y N

UCSC Genome Browser Free N Y

Integrative Genomics Viewer Free N Y

GBrowse Free N Y

CLC Genomics Workbench $$$ Y Y

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

• substantial disconnect between the processes of computational analysis and visualization

Tools for HTS data analysisTool Cost Computation Visualization

Read Alignment e.g. Bowtie, BWA

Free Y N

File Format Conversion e.g. Galaxy, SAMTools

Free Y N

Other Comand-line Toolse.g. Genetic Variation Discovery, Comparitive Genomics, etc.

Free Y N

UCSC Genome Browser Free N Y

Integrative Genomics Viewer Free N Y

GBrowse Free N Y

CLC Genomics Workbench $$$ Y Y

Savant Genome Browser Free Y Y

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

• substantial disconnect between the processes of computational analysis and visualization

Savant Genome Browser• platform for integrated visual analysis of genomic data

• feature-rich genome browser•computationally extensible via plugin framework

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

(Very) Short List of Features

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

FEATURE DEMONSTRATION

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

INTERFACEHTS READ ALIGNMENTSEXAMPLE PLUGINS: SNP FINDER

Power of visual analytics• task: find the correct parameter for command-line tool

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Plugin Framework• unlocks the potential for performing visual analytics

• mutually beneficial for both users and tool developers

for users: perform complex data analyses on-the-fly within a visual environment

for programmers: platform for simple development and deployment of various programs

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

CONCLUSIONS

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Conclusions• Savant is a platform for integrated visualization and analysis of genomic data

•stand-alone genome browser•novel features: e.g. table view, visualization modes, data selection, etc.

•computationally extensible through plugin framework

• makes interpretation and analysis of genomic data easier and more efficient

Savant Genome Browser - http://compbio.cs.toronto.edu/savant/

Acknowledgements

Recep Andrew Vlad MikeBrudno

Yue Marc

Vanessa OrionJoe Nilgun

Paul

Vera

Misko Yoni

Questions?

SHRiMPhttp://compbio.cs.toronto.edu/shrimp

VARiDhttp://compbio.cs.toronto.edu/varid

Savant Genome Browserhttp://compbio.cs.toronto.edu/savant