members: eishita tyagi sandeep namburi aarthi talla vinay vyas amin momin jay humphrey

26
Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Upload: steve

Post on 05-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey. COMPUTATIONAL GENOMICS GENOME ASSEMBLY. Contents. Assembly De novo Algorithms Involved Reference Assembly problems Task and Strategy. How do we get Reads?. De novo Assembly. Reads. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Members:Eishita TyagiSandeep NamburiAarthi TallaVinay VyasAmin MominJay Humphrey

COMPUTATIONAL GENOMICS

GENOME ASSEMBLY

Page 2: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Contents

• Assembly– De novo

• Algorithms Involved

– Reference– Assembly problems– Task and Strategy

Page 3: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

How do we get Reads?

Page 4: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

De novo AssemblyReads

Overlap

Local Multiple Alignment

Contigs

Scaffolding

Alignment Scoring

Finishing

Assembly Problems:

-Repeats

-Chimerism

-Gaps

Page 5: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

• Greedy Algorithm• Overlap-Layout-Consensus Algorithm• Eulerian path Algorithm

Overlapping Reads

Page 6: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Greedy Algorithm

X = abcbdab Y = bdcaba, the lcs is Z= bcba. LCS = Longest common subsequence

By inserting the non-lcs symbols while preserving the symbol order, we get the scs: = abdcabdab

Shortest common superstring

The union of two strings (X U Y)

Page 7: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Overlap-Layout-Consensus Algorithm

• Graph based: G(V,E) How is it executed ??

– de Bruijn Graph – a directed graph with vertices that represent sequences of symbols from an alphabet, and edges that indicate where the sequence may overlap.

– Nodes (V) = reads– Edges (E) = between overlapping reads– Path = Contig (each node occurs at least

once)

• Builds graph – alignments • Removing ambiguities • Output is a set of nonintersecting simple

paths, each path being a contig.• Consensus sequence

• E.g.. Celera Assembler, Arachne

Page 8: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Eulerian Path Algorithm

• De-bruijn graph• Eulerian path – a path that visits all edges of a

graph• Breaks reads into overlapping n-mers.• Source: n-1 prefix and destination is the n-1

suffix corresponding to an n-mer.

Page 9: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

n-mer

– Build a table of n-mers contained in sequences (single pass through the genome)

– Generate the pairs from n-mer table

ATGTGC

GCA

CAG

AGG

GGTHAMILTONIAN (IDURY - WATERMAN

AT

TG

GC

CA

AG

GGEULER

Page 10: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

MSA

•Correct errors using multiple alignment•Score alignments•Accept alignments with good scores

Page 11: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Parameters for Scoring

• length of overlap• % identity in overlap region• maximum overhang size

Page 12: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Contigs

• A continuous sequence of DNA that has been assembled from overlapping cloned DNA fragments.

• Reads combined into Contigs based on sequence similarity between reads.

Page 13: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

ScaffoldingThe process through which the read pairing information is used to order and orient the contigs along a chromosome is called Scaffolding.

– Scaffolding groups contigs -> subsets with known order and orientation.

– Nodes (V) = contigs.– Directed edge (E) – mate pairs

between node.

Page 14: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Mate Pairs or Paired End Reads

• A library of Paired End reads or Mate pairs are used to determine the orientation and relative positions of contigs.

• Reads sequenced from the template DNA• Known order and orientation (facing in,

facing out, or facing the same direction) between reads.

• Known range of separation between read 5' ends.

• Approximately 84-nucleotide DNA fragments that have a 44-mer adaptor sequence in the middle flanked by a 20-mer sequence on each side.

• Mate-pairs allow you to remove gaps & merge islands (contigs) into super-contigs.

Sameward

Outward

Inward

Page 15: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

A scaffold of 3 contigs (the thick arrows) held together by mate pairs

Mate Pairs are Needed to:

•Order Contigs•Orient Contigs •Fill Gaps in the assembly

Page 16: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Reference Assembly

Reads

Overlap

Local Multiple Alignment

Contigs

Map to a reference

Alignment Scoring

Finishing

Assembly Problems:

-Repeats

-Chimerism

-Gaps

Page 17: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Mapping contigs to a reference

Page 18: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Assembly Problems

• Errors from sequencing machines, e.g. missing a base, or misreading a base

• Even at 8-10 X coverage, there is a probability that some portion of the genome remains unsequenced

• Repeat problem lead to Misassembly and Gaps

• Chimeric reads - When two fragments from two different parts of genome are combined together

Page 19: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Repeat Problems

• Ability of an assembly program to produce 1 contig for a chromosome: limited by regions of the genome that occur in multiple near-identical copies throughout the genome (repeats).

• Assembler incorrectly collapses the two copies of the repeat leading to the creation of 2 contigs instead of 1.

• Thus, number of contigs increase with the number of repeats.

• Repeated sequences within a genome also produce problems with higher level ordering.

Page 20: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Genome mis-assembled due to a repeat. 

Assembly programs incorrectly may combine the reads from the two copies of a repeat leading to the creation of 2 separate contigs (Contig Level Misassembly)

Page 21: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Gaps• A good Assembler would have to ignore the repeats and generate one

contig instead of two.• A Gap would be created in the place of the repeat. • Higher the number of repeats, the Gaps generated would increase.

•Two fragments from two different parts of genome are combined together.•Can give a completely wrong assembly.

Chimeric reads

Page 22: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Finishing

• Process of completing the chromosome sequence.

• Re-sequence areas with gaps or less than 2x, 3x, 5x coverage

• Close gaps (usually by PCR or BACs)

• Expensive and time-consuming.

Page 23: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

Our Task• To Assemble Neisseria meningitidis strains sequences: M13519 and

M16917

– Strains are Non-groupable

• M13519 matches Serogroup C (PCR), W135 (SASG)

• M16917 matches Serogroup Y (PCR), W135 (SASG)

• No completed genomes available for strains with Serogroup Y and W135.

Page 24: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

De novo assembly with Newbler and Mira3

Reference assembly using AMOScmp and Newbler

Best Best results from each merged with

Minimus2

Finish by manual alignment

Our Strategy

Page 25: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

• Number of large contigs• Total size• Coverage• Average length • N50• Longest contig • % genome assembled

Important Assembler Metrics

Page 26: Members: Eishita Tyagi Sandeep Namburi Aarthi Talla Vinay Vyas Amin Momin Jay Humphrey

NEXT PRESENTATION – WEDNESDAY

Initial Results and Lab