genome assembly: the orientation problem

62
Genome Assembly: The Orientation Problem Paul M. Bodily Dr. Mark Clement Computational Sciences Laboratory Department of Computer Science Brigham Young University

Upload: others

Post on 01-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Genome Assembly: The Orientation Problem

Paul M. Bodily Dr. Mark Clement

Computational Sciences Laboratory Department of Computer Science

Brigham Young University

Background: Genome Assembly

Background: Genome Assembly

Background: Genome Assembly

Background: Genome Assembly

Background: Genome Assembly

Background: Genome Assembly

TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACC ATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTAATTTTTAGCTA TGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGG ATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTA CAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGG GTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGG TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTG GCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTAATTT CTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAG TTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACC AGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTG TCAAATAGTCCAGTAGAGGGCAGTCCACCAG CTAATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTT GCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCT TAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAG CTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT GGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTT CTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCA GCTAATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTT GAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTAATTTTTAGCTA TTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCA ATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGT GCTAATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT GCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTT

Background: Genome Assembly

Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…

Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG

Reads

Possible assemblies:

Background: Genome Assembly

Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…

Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG

Reads

Possible assemblies:

Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGMissing Read 7: ?????????????????????????????????????????…

Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAG?

Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…

Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG

Reads

Possible assemblies:

Background: Genome Assembly

Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…

Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG

Reads

Possible assemblies:

Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…

Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG

Reads

Possible assemblies:

Background: Genome Assembly

TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTAATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT

TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTCAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTTTGCTGATTTTTAGCTAATATCTAGCCAGGAGAGCAAGCACATAATTCTGGACAAATAAGTCATATACCTGTT

Maternal: Paternal:

Read 1:! ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCRead 2:! CCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCARead 3:! CCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGRead 4:! CGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGARead 5:! GGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAARead 6:! GCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGRead 7:! CGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Consensus: !ACCCGGCGGCAGGAGAGGGGATGAAGATGGCGGACGCGAAGCAGAAGC…

Read 1:! ! TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 2:! ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAARead 3:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAARead 4:! ! ATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG…

Contig (reads 1 & 3): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 1 & 4): !TATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAGContig (reads 2 & 3): !CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAAContig (reads 2 & 4): ! CATAGTAGCTGATTGTATTATTGATTGTATTGTATACTATATTAAG

Reads

Possible assemblies:

6

4 8

3 4

4

Background: Genome Assembly

TTTTTTAA…ATTGACC ATCTTTGC…TTTAGCTA

ATATCTAGC…TACCTGTT TTTTTTAA…ACCTTTTTA

TGCTGGCT…GAGGTTAGG CGTTT…GGTTTCAC

TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTT

TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTCAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTT

Background: Genome Assembly

6

4 8

3 4

4

Background: Genome Assembly

The Problem

3

6

4 9

The Problem

3

6

4 9

The Problem

3

6

4 9

The Problem

3

6

4 9

The Problem

3

6

4 9

Problem Statement

  Given a graph where nodes represent DNA sequences and edges represent possible scaffoldings of these sequences, find the subgraph which maximizes edge weight such that   in any traversal of the remaining subgraph a node can only

be traversed in a single direction, and   The subgraph is void of cycles.

Related Problem: Minimum Spanning Tree

  Given a connected weighted graph, find a minimum spanning tree.

http://en.wikipedia.org/wiki/Minimum_spanning_tree

Review: Kruskal’s Algorithm (1956)

  Create a forest F (a set of trees), where each vertex in the graph is a separate tree

  Create a set S containing all the edges in the graph

  While S is nonempty and F is not yet spanning   remove an edge with minimum weight from S   if that edge connects two different trees, then add it to the

forest, combining two trees into a single tree   otherwise discard that edge.

  At the termination of the algorithm, the forest forms a minimum spanning forest of the graph. If the graph is connected, the forest has a single component and forms a minimum spanning tree.

Review: Kruskal’s Algorithm (1956)

http://en.wikipedia.org/wiki/Kruskal's_algorithm

Review: Kruskal’s Algorithm (1956)

http://en.wikipedia.org/wiki/Kruskal's_algorithm

Review: Kruskal’s Algorithm (1956)

http://en.wikipedia.org/wiki/Kruskal's_algorithm

Review: Kruskal’s Algorithm (1956)

http://en.wikipedia.org/wiki/Kruskal's_algorithm

Review: Kruskal’s Algorithm (1956)

http://en.wikipedia.org/wiki/Kruskal's_algorithm

Review: Kruskal’s Algorithm (1956)

http://en.wikipedia.org/wiki/Kruskal's_algorithm

Orientation Problem: Proposed Solution

  Create a forest F (a set of trees), where each vertex in the graph is a separate tree

  Create a set S containing all the edges in the graph

  While S is nonempty   remove an edge with maximum weight from S   if that edge connects two different trees, then add it to

the forest, combining two trees into a single tree (reconcile orientations)

  if that edge connects two nodes within the same tree and suggests a consistent orientation, then add it to the tree

  otherwise discard that edge.

Orientation Problem: Proposed Solution (cont.)

  (reconcile orientations) – if two trees have consistent orientations within themselves, but have an edge between them suggesting an inconsistent orientation

Orientation Problem: Proposed Solution (cont.)

  (reconcile orientations) – if two trees have consistent orientations within themselves, but have an edge between them suggesting an inconsistent orientation

then we can arbitrarily toggle the orientation of all nodes in one of the trees while maintaining consistent orientation, allowing us then to add the inter-tree edge.

  (reconcile orientations) – if two trees have consistent orientations within themselves, but have an edge between them suggesting an inconsistent orientation

then we can arbitrarily toggle the orientation of all nodes in one of the trees while maintaining consistent orientation, allowing us then to add the inter-tree edge.

Orientation Problem: Proposed Solution (cont.)

Orientation Problem: Proposed Solution

  Create a forest F (a set of trees), where each vertex in the graph is a separate tree

  Create a set S containing all the edges in the graph

  While S is nonempty   remove an edge with maximum weight from S   if that edge connects two different trees, then add it to

the forest, combining two trees into a single tree (reconcile orientations)

  if that edge connects two nodes within the same tree and suggests a consistent orientation, then add it to the tree

  otherwise discard that edge.

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Orientation Algorithm

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

Cycles

  In Kruskal’s algorithm, the fact that we only add edges between trees eliminates cycles

Cycles

  In a bidirected graph, cycles become significantly more complex

  Although applying the same rule would prevent cycles, it would also remove edges that did not create cycles

What does the green subgraph mean?!

7 8

5 7

9

5

6 6

15

9

11

A B

C

E

G F

D

What does the green subgraph mean?!

A B

C

E

G F

D

What does the green subgraph mean?!

A B

C

E

G F

D

ATCTTTGC…TTTAGCTA

ATATCTAGC…TACCTGTT TTTTTTAA…ACCTTTTTA

TGCTGGCT…GAGGTTAGG CGTTT…GGTTTCAC

TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTGAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTT

TTTTTTAATGTTTACATTTATCTCTATGTTTACCTTTTTAGTCACATTGACCTGCTGGCTCAATACCTCAAATAGTCCAGTAGAGGGCAGTCCACCAGGCAGAAAAGGTTAGGCGTTTTGGTTTCACATCTTTGCTGGGGAATAATAGGGGAAATGGCTGTTTT

What does the green subgraph mean?!

TGCTGGCT…GAGGTTAGG

TTTTTTAA…ATTGACC

Algorithm Performance

  Time: O(e) for e edges

  Space: O(e+n) for e edges and n nodes

Results

7 8

5 7

9

5

6 6

15

9

11

A

B C

E

G

F

D

Results: Synthetic Genome

Results takehomes

  In reality, most inconsistent orientations derive from spurious edges (errors)

  Inconsistent orientations do not happen very often

  Inversions and repeats: our biological assumption holds true most of the time, but not always

  Algorithm works well for the problem it is designed to solve

Questions?

  Paul Bodily ([email protected])