qpalma - optimal spliced alignments of short sequence reads · how can we generate data for...

53
QPalma - Optimal Spliced Alignments of Short Sequence Reads Fabio De Bona 1 Stephan Ossowski 2 Korbinian Schneeberger 2 Gunnar R¨ atsch 1 (1) Friedrich Miescher Laboratory of the Max Planck Society (2) Max Planck Institute for Developmental Biology ubingen, Germany September 25, 2008 Fabio De Bona (FML, T¨ ubingen) QPalma September 25, 2008 1 / 18

Upload: others

Post on 01-Nov-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

QPalma - Optimal Spliced Alignments of ShortSequence Reads

Fabio De Bona1

Stephan Ossowski2

Korbinian Schneeberger2

Gunnar Ratsch1

(1) Friedrich Miescher Laboratory of the Max Planck Society(2) Max Planck Institute for Developmental Biology

Tubingen, Germany

September 25, 2008

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 1 / 18

Page 2: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Introductionfml

Next Generation (NG) sequencing technologies produce hugeamounts of sequencing data

Difference to Sanger sequencing:I Much cheaper and fasterI Much more and shorter fragments ⇒ “reads”I Much more errors

Genome / transcriptome sequencing

Data can be used via local alignments:I Discovery of new genes,I Identification of alternative splice forms, . . .

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 2 / 18

Page 3: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Introductionfml

Next Generation (NG) sequencing technologies produce hugeamounts of sequencing data

Difference to Sanger sequencing:I Much cheaper and fasterI Much more and shorter fragments ⇒ “reads”I Much more errors

Genome / transcriptome sequencing

Data can be used via local alignments:I Discovery of new genes,I Identification of alternative splice forms, . . .

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 2 / 18

Page 4: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Transcriptome Analysis with NG Sequencingfml

QPalma’s aim: Accurately Align transcriptome reads to genomic sequences

Genomic read mapping already challenging

Transcriptome read mapping is even more difficult:I Spliced reads stem from several genomic regionsI Short reads getting even shorter due to splicing⇒ Alignments more error prone

Improve alignments by using more information:I Accurate splice site modelsI Intron length modelI Quality scores model

Idea: Use a machine learning method to infer optimal parameters

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 3 / 18

Page 5: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Transcriptome Analysis with NG Sequencingfml

QPalma’s aim: Accurately Align transcriptome reads to genomic sequences

Genomic read mapping already challenging

Transcriptome read mapping is even more difficult:I Spliced reads stem from several genomic regionsI Short reads getting even shorter due to splicing⇒ Alignments more error prone

Improve alignments by using more information:I Accurate splice site modelsI Intron length modelI Quality scores model

Idea: Use a machine learning method to infer optimal parameters

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 3 / 18

Page 6: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Transcriptome Analysis with NG Sequencingfml

QPalma’s aim: Accurately Align transcriptome reads to genomic sequences

Genomic read mapping already challenging

Transcriptome read mapping is even more difficult:I Spliced reads stem from several genomic regionsI Short reads getting even shorter due to splicing⇒ Alignments more error prone

Improve alignments by using more information:I Accurate splice site modelsI Intron length modelI Quality scores model

Idea: Use a machine learning method to infer optimal parameters

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 3 / 18

Page 7: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Spliced vs. Unspliced Alignments

fml

Find matching region on genome with a few mismatches

Efficient data structures for mapping many reads

Most current mapping techniques are limited to unspliced reads

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 4 / 18

Page 8: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Spliced vs. Unspliced Alignments

fml

Find matching region on genome with a few mismatches

Efficient data structures for mapping many reads

Most current mapping techniques are limited to unspliced reads

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 4 / 18

Page 9: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Spliced vs. Unspliced Alignments

fml

Find matching region on genome with a few mismatches

Efficient data structures for mapping many reads

Most current mapping techniques are limited to unspliced reads

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 4 / 18

Page 10: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Extended Smith-Waterman Algorithm

fml

Classical scoring f : Σ× Σ→ R

Source of Information

Sequence matches

Computational splice sitepredictions

Intron length model

Read quality information

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18

Page 11: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Extended Smith-Waterman Algorithm

fml

Classical scoring f : Σ× Σ→ R

Source of Information

Sequence matches

Computational splice sitepredictions

Intron length model

Read quality information

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18

Page 12: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Extended Smith-Waterman Algorithm

fml

Classical scoring f : Σ× Σ→ R

Source of Information

Sequence matches

Computational splice sitepredictions

Intron length model

Read quality information

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18

Page 13: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Extended Smith-Waterman Algorithm

fml

Quality scoring f : (Σ× R)× Σ→ R

Source of Information

Sequence matches

Computational splice sitepredictions

Intron length model

Read quality information

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18

Page 14: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Extended Smith-Waterman Algorithm

fml

Quality scoring f : (Σ× R)× Σ→ R

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18

Page 15: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Scoring Parameter Inferencefml

What are optimal parameters?

How do we jointly optimize 336 parameters?

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 6 / 18

Page 16: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Scoring Parameter Inferencefml

What are optimal parameters?

How do we jointly optimize 336 parameters?

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 6 / 18

Page 17: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Cartoon: Maximize the Marginfml

truefalse false

Correct alignment is not highest scoring one

Correct alignment is highest scoring one

Can we do better?

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 7 / 18

Page 18: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Cartoon: Maximize the Marginfml

truefalse

Correct alignment is not highest scoring one

Correct alignment is highest scoring one

Can we do better?

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 7 / 18

Page 19: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Cartoon: Maximize the Marginfml

false true

Use a technique motivated by “large-margin” methods

Idea: Enforce a margin between correct and incorrect examples

One has to solve a huge quadratic problem

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 7 / 18

Page 20: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome reads

I Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)

I Generate artificially spliced reads:

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18

Page 21: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)

I Generate artificially spliced reads:

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18

Page 22: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)

I Generate artificially spliced reads:

⇒ Identify reads overlapping into introns

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18

Page 23: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)

I Generate artificially spliced reads:

⇒ Remove overlapping parts

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18

Page 24: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)

I Generate artificially spliced reads:

⇒ Remove overlapping parts

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18

Page 25: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)

I Generate artificially spliced reads:

⇒ Combine both exonic parts

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18

Page 26: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

How Can We Generate Data for Training?fml

How do we obtain true alignments for training QPalma?

Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)

I Generate artificially spliced reads:

⇒ Truncate to desired length (38nt)

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18

Page 27: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions

Task: Find the correct spliced alignment

Trained on 10, 000 artificially spliced genome readsTested on 30, 000 spliced reads

SmithW Intron Intron+Splice Intron+Splice +Quality

Alig

nmen

t Er

ror

Rate

14.19% 9.96% 1.94% 1.78%

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 9 / 18

Page 28: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions

Task: Find the correct spliced alignment

Trained on 10, 000 artificially spliced genome readsTested on 30, 000 spliced reads

SmithW Intron Intron+Splice Intron+Splice +Quality

Alig

nmen

t Er

ror

Rate

14.19% 9.96% 1.94% 1.78%

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 9 / 18

Page 29: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions

Task: Find the correct spliced alignment

Trained on 10, 000 artificially spliced genome readsTested on 30, 000 spliced reads

SmithW Intron Intron+Splice Intron+Splice +Quality

Alig

nmen

t Er

ror

Rate

14.19% 9.96% 1.94% 1.78%

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 9 / 18

Page 30: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Error Rate Depends on Intron Positionfml

1 Trust introns confirmed by spliced read with ≥ 6nt overlap

2 Pure Smith-Waterman algorithm would need longer overlapsand would still perform worse

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 10 / 18

Page 31: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Error Rate Depends on Intron Positionfml

1 Trust introns confirmed by spliced read with ≥ 6nt overlap

2 Pure Smith-Waterman algorithm would need longer overlapsand would still perform worse

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 10 / 18

Page 32: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Error Rate Depends on Intron Positionfml

1 Trust introns confirmed by spliced read with ≥ 6nt overlap

2 Pure Smith-Waterman algorithm would need longer overlapsand would still perform worse

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 10 / 18

Page 33: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Can We Use QPalma for Whole-Genome Alignments?

fmlSo far we had two assumptions:

1 All reads are spliced

2 Genomic region is known

1 Many reads will be fully contained in an exon

2 Direct alignment to genome too expensive (O(n ·m))

Page 34: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Can We Use QPalma for Whole-Genome Alignments?

fmlSo far we had two assumptions:

1 All reads are spliced

2 Genomic region is known

1 Many reads will be fully contained in an exon

2 Direct alignment to genome too expensive (O(n ·m))

Page 35: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

A Pipeline for Efficient Large Scale Alignmentfml

Pipeline Workflow

1 Map unspliced reads & find seed regionsI Use suffix-array based method to find short read match with at most

two mismatches (Vmatch, Kurtz et al.; ShoRe, Ossowski et al. 2008)

I ≈ 4h for 2, 6× 106 reads

2 Recover potentially spliced reads from first mappingI Use fast approximation of QPalma to decide which reads may be

spliced (even if they can be mapped well)

I ≈ 17min for 2, 2× 106 reads

3 Identify accurate alignments for the candidate spliced readsI Use full QPalma model to align remaining reads

I ≈ 8h for ∼ 442, 000 reads

Page 36: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

A Pipeline for Efficient Large Scale Alignmentfml

Pipeline Workflow

1 Map unspliced reads & find seed regionsI Use suffix-array based method to find short read match with at most

two mismatches (Vmatch, Kurtz et al.; ShoRe, Ossowski et al. 2008)

I ≈ 4h for 2, 6× 106 reads

2 Recover potentially spliced reads from first mappingI Use fast approximation of QPalma to decide which reads may be

spliced (even if they can be mapped well)

I ≈ 17min for 2, 2× 106 reads

3 Identify accurate alignments for the candidate spliced readsI Use full QPalma model to align remaining reads

I ≈ 8h for ∼ 442, 000 reads

Page 37: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

A Pipeline for Efficient Large Scale Alignmentfml

Pipeline Workflow

1 Map unspliced reads & find seed regionsI Use suffix-array based method to find short read match with at most

two mismatches (Vmatch, Kurtz et al.; ShoRe, Ossowski et al. 2008)

I ≈ 4h for 2, 6× 106 reads

2 Recover potentially spliced reads from first mappingI Use fast approximation of QPalma to decide which reads may be

spliced (even if they can be mapped well)

I ≈ 17min for 2, 2× 106 reads

3 Identify accurate alignments for the candidate spliced readsI Use full QPalma model to align remaining reads

I ≈ 8h for ∼ 442, 000 reads

Page 38: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

A Pipeline for Efficient Large Scale Alignmentfml

Pipeline Workflow

1 Map unspliced reads & find seed regionsI Use suffix-array based method to find short read match with at most

two mismatches (Vmatch, Kurtz et al.; ShoRe, Ossowski et al. 2008)

I ≈ 4h for 2, 6× 106 reads

2 Recover potentially spliced reads from first mappingI Use fast approximation of QPalma to decide which reads may be

spliced (even if they can be mapped well)

I ≈ 17min for 2, 2× 106 reads

3 Identify accurate alignments for the candidate spliced readsI Use full QPalma model to align remaining reads

I ≈ 8h for ∼ 442, 000 reads

Page 39: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Application to Natural Transcriptome Readsfml

RNA-seq reads for A. thaliana (provided by Weigel lab, MPI Devel. Biology)

4 lanes from Illumina Genome Analyzer 1G (≈ 50× coverage)

38nt reads, polyA enriched

Strand unspecific, young leaves

Initial Read mapping by ShoRe (http://1001genomes.org)

Spliced alignment by QPalma(http://fml.mpg.de/raetsch/projects/qpalma)

⇒ ≈ 30 million unspliced

⇒ ≈ 1 million spliced reads

Exon coverage

Annotation

Intron coverage

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 13 / 18

Page 40: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Application to Natural Transcriptome Readsfml

RNA-seq reads for A. thaliana (provided by Weigel lab, MPI Devel. Biology)

4 lanes from Illumina Genome Analyzer 1G (≈ 50× coverage)

38nt reads, polyA enriched

Strand unspecific, young leaves

Initial Read mapping by ShoRe (http://1001genomes.org)

Spliced alignment by QPalma(http://fml.mpg.de/raetsch/projects/qpalma)

⇒ ≈ 30 million unspliced

⇒ ≈ 1 million spliced reads

Exon coverage

Annotation

Intron coverage

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 13 / 18

Page 41: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Application to Natural Transcriptome Readsfml

RNA-seq reads for A. thaliana (provided by Weigel lab, MPI Devel. Biology)

4 lanes from Illumina Genome Analyzer 1G (≈ 50× coverage)

38nt reads, polyA enriched

Strand unspecific, young leaves

Initial Read mapping by ShoRe (http://1001genomes.org)

Spliced alignment by QPalma(http://fml.mpg.de/raetsch/projects/qpalma)

⇒ ≈ 30 million unspliced

⇒ ≈ 1 million spliced reads

Exon coverage

Annotation

Intron coverage

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 13 / 18

Page 42: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

What Can We Do With It?fml

Exon coverage

Annotation

Intron coverage

For instance:

Assemble alignments to obtain splice graphs (describing alternative splicing)

I Only works well for highly covered genes

Use exon/intron coverage tracks to improve gene findingI Considerable improvements appear possible

Estimate relative abundances of alternative transcriptsI Spliced reads help as they can connect exons

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 14 / 18

Page 43: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

What Can We Do With It?fml

Exon coverage

Annotation

Intron coverage

For instance:

Assemble alignments to obtain splice graphs (describing alternative splicing)

I Only works well for highly covered genes

Use exon/intron coverage tracks to improve gene findingI Considerable improvements appear possible

Estimate relative abundances of alternative transcriptsI Spliced reads help as they can connect exons

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 14 / 18

Page 44: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

What Can We Do With It?fml

Exon coverage

Annotation

Intron coverage

For instance:

Assemble alignments to obtain splice graphs (describing alternative splicing)

I Only works well for highly covered genes

Use exon/intron coverage tracks to improve gene findingI Considerable improvements appear possible

Estimate relative abundances of alternative transcriptsI Spliced reads help as they can connect exons

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 14 / 18

Page 45: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Summaryfml

New method for accurate de novo spliced alignments of short reads

Challenging data: many short & error prone reads

Proposed algorithm for parameter estimation

Successfully combined many information sources

Can be applied to data from other sequencing platforms (ABI, . . .)

Fast enough for whole genome/transcriptome sequencingI Currently, very good for smaller genomesI When excluding indels, then also fast for mouse/human (in progress)

Potential improvements

Speed-up dynamic program

Integrated seed region finding and spliced alignments

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 15 / 18

Page 46: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Summaryfml

New method for accurate de novo spliced alignments of short reads

Challenging data: many short & error prone reads

Proposed algorithm for parameter estimation

Successfully combined many information sources

Can be applied to data from other sequencing platforms (ABI, . . .)

Fast enough for whole genome/transcriptome sequencingI Currently, very good for smaller genomesI When excluding indels, then also fast for mouse/human (in progress)

Potential improvements

Speed-up dynamic program

Integrated seed region finding and spliced alignments

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 15 / 18

Page 47: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Summaryfml

New method for accurate de novo spliced alignments of short reads

Challenging data: many short & error prone reads

Proposed algorithm for parameter estimation

Successfully combined many information sources

Can be applied to data from other sequencing platforms (ABI, . . .)

Fast enough for whole genome/transcriptome sequencingI Currently, very good for smaller genomesI When excluding indels, then also fast for mouse/human (in progress)

Potential improvements

Speed-up dynamic program

Integrated seed region finding and spliced alignments

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 15 / 18

Page 48: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Summaryfml

New method for accurate de novo spliced alignments of short reads

Challenging data: many short & error prone reads

Proposed algorithm for parameter estimation

Successfully combined many information sources

Can be applied to data from other sequencing platforms (ABI, . . .)

Fast enough for whole genome/transcriptome sequencingI Currently, very good for smaller genomesI When excluding indels, then also fast for mouse/human (in progress)

Potential improvements

Speed-up dynamic program

Integrated seed region finding and spliced alignments

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 15 / 18

Page 49: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

AcknowledgementsfmlCoauthors

Stephan Ossowski

Korbinian Schneeberger

Gunnar Ratsch

Weigel Lab – DNA- and RNA-seq Data

Jun Cao

Richard Clark

Christa Lanz

Detlef Weigel

Stipend

Friedrich Miescher Laboratory of the Max Planck Society

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 16 / 18

Page 50: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Thank [email protected]

http://www.fml.tuebingen.mpg.de/raetsch/projects/qpalma

Poster D19

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 17 / 18

Page 51: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions

Task: Find the correct spliced alignment

Trained on 10, 000 artificially spliced genome reads

Tested on 30, 000 spliced reads

Quality Splice Intron Errorinformation site pred. length rate

- - - 14.19 %+ - - 13.49 %

- - + 9.96 %+ - + 9.68 %

- + - 3.16 %+ + - 2.81 %

- + + 1.94 %+ + + 1.78%

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 18 / 18

Page 52: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions

Task: Find the correct spliced alignment

Trained on 10, 000 artificially spliced genome reads

Tested on 30, 000 spliced reads

Quality Splice Intron Errorinformation site pred. length rate

- - - 14.19 %+ - - 13.49 %

- - + 9.96 %+ - + 9.68 %

- + - 3.16 %+ + - 2.81 %

- + + 1.94 %+ + + 1.78%

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 18 / 18

Page 53: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic

Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions

Task: Find the correct spliced alignment

Trained on 10, 000 artificially spliced genome reads

Tested on 30, 000 spliced reads

Quality Splice Intron Errorinformation site pred. length rate

- - - 14.19 %+ - - 13.49 %

- - + 9.96 %+ - + 9.68 %

- + - 3.16 %+ + - 2.81 %

- + + 1.94 %+ + + 1.78%

Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 18 / 18