algorithms for alignment of genomic sequences michael brudno department of computer science stanford...

Post on 16-Dec-2015

217 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Algorithms for Alignment of Genomic Sequences

Michael Brudno

Department of Computer ScienceStanford University

PGA Workshop 07/16/2004

Conservation Implies Function

Exon

Gene

CNS:OtherConserved

Edit Distance Model (1)

Weighted sum of insertions, deletions & mutations to transform one string into another

AGGCACA--CA AGGCACACA| |||| || or | || ||A--CACATTCA ACACATTCA

Edit Distance Model (2)

Given: x, y

Define: F(i,j) = Score of best alignment ofx1…xi to y1…yj

Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY,

F(i,j-1) – GAP_PENALTY,F(i-1,j-1) + SCORE(xi, yj))

Edit Distance Model (3)

F(i,j) = Score of best alignment ending at i,j

Time O( n2 ) for two seqs, O( nk ) for k seqs

F(i,j)

F(i,j-1)F(i-1,j-1)

F(i-1,j)

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

F(i,j) = max (F(i,j), 0)

Return all paths with a position i,j where

F(i,j) > C

Time O( n2 ) for two seqs, O( nk ) for k seqs

Heuristic Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

BLAST FASTA

CHAOS: CHAins Of Seeds

1. Find short matching words (seeds)

2. Chain them

3. Rescore chain

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

locationin seq1

seedseq1

seq2

CHAOS: Chaining the Seeds

locationin seq1

distancecutoff

seedseq1

seq2

• Find seeds at current location in seq1

CHAOS: Chaining the Seeds

locationin seq1

distancecutoff

gapcutoff

seedseq1

seq2

• Find seeds at current location in seq1

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

locationin seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal

locationin seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal.

• Pick a previous seed that maximizes the score of chain location

in seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal.

• Pick a previous seed that maximizes the score of chain location

in seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

Time O(n log n), where n is number of seeds.

CHAOS Scoring

• Initial score = # matching bp - gaps

• Rapid rescoring: extend all seeds to find optimal location for gaps

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

LAGAN: 1. FIND Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

LAGAN: 2. CHAIN Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

LAGAN: 3. Restricted DP

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

MLAGAN: 1. Progressive Alignment

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

MLAGAN: 2. Multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Cystic Fibrosis (CFTR), 12 species

• Human sequence length: 1.8 Mb• Total genomic sequence: 13 Mb

HumanBaboon Cat Dog

Cow Pig

MouseRat

ChimpChicken

Fugufish

Zebrafish

CFTR (cont’d )

9055099.7%MammalsLAGAN

9086296%Chicken & Fishes

Chicken & Fishes

Mammals6704547

99.8%MLAGAN

98%

MAX MEMORY

(Mb)TIME (sec)

% Exons Aligned

Automatic computational system for Automatic computational system for comparative analysis of pairs of genomescomparative analysis of pairs of genomes http://pipeline.lbl.gov

Alignments (all pair combinations):

Human Genome (Golden Path Assembly)Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)Rat assemblies: January 2003, February 2003

----------------------------------------------------------D. Melanogaster vs D. Pseudoobscura February 2003

Tandem Local/Global Approach

•Finding a likely mapping for a contig (BLAT)

Progressive Alignment Scheme

yes

no yes no

Human, Mouse and Rat genomes

Pairwise M/R mapping

Aligned M&R fragments Unaligned M&R sequences

Map to Human GenomeMapping aligned fragments by union of M&R local BLAT hits on the human genome

H/M/R MLAGAN alignment

M/R pairwise alignment

M/H and R/H pairwise

alignment

Unassigned M&R DNA fragments

yes no

Computational Time

23 dual 2.2GHz Intel Xeon node PC cluster.

Pair-wise rat/mouse – 4 hours

Pair-wise rat/human and mouse/human – 2

hours

Multiple human/mouse/rat – 9 hours

Total wall time: ~ 15 hours

Distribution of Large Indels

0

20

40

60

80

100

120

140

160

180

200

100 150 200 250 300 350 400 450 500 550

Indel length

Count

Evolution Over a Chromosome

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Local & Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Local Global

Glocal Alignment Problem

Find least cost transformation of one sequence into another using new operations

•Sequence edits

•Inversions

•Translocations

•Duplications

•Combinations of above

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Shuffle-LAGAN

A glocal aligner for long DNA sequences

S-LAGAN: Find Local Alignments

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

S-LAGAN: Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Building the Homology Map

d

a b

c

Chain (using Eppstein

Galil); each alignment

gets a score which is

MAX over 4 possible

chains.

Penalties are affine (event and distance components)

Penalties:

a) regular

b) translocation

c) inversion

d) inverted translocation

S-LAGAN: Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

S-LAGAN: Global Alignment

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

S-LAGAN Results (CFTR)

Local

Glocal

S-LAGAN Results (CFTR)

Hum/Mus

Hum/Rat

S-LAGAN Results (IGF cluster)

S-LAGAN results (HOX)

• 12 paralogous genes• Conserved order in mammals

S-LAGAN results (HOX)

• 12 paralogous genes• Conserved order in mammals

S-LAGAN Results (Chr 20)

• Human Chr 20 v. homologous Mouse Chr 2.

• 270 Segments of conserved synteny

• 70 Inversions

S-LAGAN Results (Whole Genome)

LAGAN S-LAGAN

Total 37% 38%

Exon 93% 96%

Ups200 78% 81%

CPU Time

350 Hrs 450 Hrs

• Used Berkeley Genome Pipeline

• % Human genome aligned with mouse sequence

• Evaluation criteria from Waterston, et al (Nature

2002)

Rearrangements in Human v. Mouse

Preliminary conclusions:

• Rearrangements come in all sizes

• Duplications worse conserved than other rearranged regions

• Simple inversions tend to be most common and most conserved

What is next? (Shuffle)

• Better algorithm and scoring

• Whole genome synteny mapping

• Multiple Glocal Alignment(!?)

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Biological Story

• Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development

Align Human, Mouse, Rat & Fugu

Detailed Alignment

hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174

hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174

Can we align human & fly???

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Putting it all together

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Acknowledgments

Stanford:Serafim BatzoglouArend SidowMatt Scott

Gregory Cooper Chuong (Tom) DoSanket MaldeKerrin SmallMukund Sundararajan

Berkeley: Inna DubchakAlexander Poliakov

Göttingen:Burkhard Morgenstern

Rat Genome Sequencing Consortium

http://lagan.stanford.edu/

top related