assemble…or - irisa.fr · epgv seminar - lusignan - april 2013 8 . minia approach-• use a bloom...

29
Assemble… or not Colloque EPGV 2013 Pierre Peterlongo

Upload: lamhanh

Post on 11-May-2018

217 views

Category:

Documents


4 download

TRANSCRIPT

Assemble…  or  not

Colloque EPGV 2013 Pierre Peterlongo

This  talk A « bird eye » view of the GenScale NGS work

EPGV Seminar - Lusignan - April 2013 2

This  talk A « bird eye » view of the GenScale NGS work

Dominique  Lavenier

Rayan  Chikhi

Guillaume  Rizk

Claire  Lemaitre

Raluca  Uricaru

And many collaborators, … EPGV Seminar - Lusignan - April 2013 3

Minia,  de  novo  assembly  of  a  human  genome  on  a  

laptop Guillaume Rizk, Rayan Chikhi

Space-efficient and exact de Bruijn graph representation based on a Bloom filter Algorithms in Bioinformatics, 2012 - Springer

EPGV Seminar - Lusignan - April 2013 4

Contiging,  overview Reads

TCAGGCAG ACTCAGCA ATATAATA …

k-­‐‑mers

De-­‐‑Bruijn    Graph

…TCAGGCAGATCGATACTCAGCACAACGTATATAATA…

Contig,  Unitig

EPGV Seminar - Lusignan - April 2013 5

De  Bruijn  Graph •  Very simple example: one read

Source:  homolog.us

EPGV Seminar - Lusignan - April 2013 6

Contiging,  overview Reads

TCAGGCAG ACTCAGCA ATATAATA …

k-­‐‑mers

Contig,  Unitig

De-­‐‑Bruijn    Graph

Memory  bo,leneck

…TCAGGCAGATCGATACTCAGCACAACGTATATAATA…

EPGV Seminar - Lusignan - April 2013 7

Minia  approach •  Use a BLOOM FILTER: probabilistic data-structure for

indexing k-mers:

Is ACGATCGACTCAGCAT indexed ? o  NO à We are sure ACGATCGACTCAGCAT is absent o  YES à We can not conclude : FALSE POSITIVES

•  Minia: o  Store some of the false positives :

•  Those that may be met while walking the graph.

EPGV Seminar - Lusignan - April 2013 8

Minia  approach •  Use a BLOOM FILTER: probabilistic data-structure for

indexing k-mers and store the annoying false positives

•  (Knowing that AACGATCGACTCAGCA exists) Is ACGATCGACTCAGCAT indexed ?

o  NO à We are sure ACGATCGACTCAGCAT is absent o  YES à

•  if ACGATCGACTCAGCAT not a stored FP then present •  else, ACGATCGACTCAGCAT absent

EPGV Seminar - Lusignan - April 2013 9

Minia  results

k  =  25  :  13  bits  per  node.

EPGV Seminar - Lusignan - April 2013 10

Minia  results

EPGV Seminar - Lusignan - April 2013 11

Minia  results  COMPARISON  WITH  OTHER  HUMAN  GENOME  ASSEMBLIES

EPGV Seminar - Lusignan - April 2013 12

Minia  results   C.  Elegans  nematode  –  

33  million  paired-­‐‑end  reads  of  length  100  bp  (SRR065390)

EPGV Seminar - Lusignan - April 2013 13

Minia  Consequences  

EPGV Seminar - Lusignan - April 2013 14

•  Exploring parameter sets •  Developping light memory NGS applications:

o  Kissnp2 o  Mapsembler o  Utimate Gap Filler o  Kissplice o  Intl o  Minigraph o  …

Kissnp2  Search  for  SNPs  on  your  laptop  

•  Assembly difficulty: o  Repeats o  Polymorphism

•  Find SNPs: o  Usual approach: Map against a reference sequence (if exists).

•  (else) create a reference to map reads o  (else) Kissnp

Reads kissnp

>SNP_higher_path_1|score_27|high|left_contig_length_41|right_contig_length_45 atggcaadgggaataadcataadtadddctaaagtGACAACAGTCATTTTTTTCAAAGAACTTCAGTCTGGAACTATTATGTTGTTaggacatgtgatcdcatcaccagtatctcgaaatcctaaaada >SNP_lower_path_1|score_27|high|left_contig_length_41|right_contig_length_45 atggcaadgggaataadcataadtadddctaaagtGACAACAGTCATTTTTTTCAAAGAATTTCAGTCTGGAACTATTATGTTGTTaggacatgtgatcdcatcaccagtatctcgaaatcctaaaada

EPGV Seminar - Lusignan - April 2013 15

Kissnp2  Main  idea

•  A SNP in the de-Bruijn Graph:

EPGV Seminar - Lusignan - April 2013 16

Kissnp2  Main  idea

•  A SNP in the de-Bruijn Graph:

•  Algo o  Find a branching k-mer: two (or more) extension possibilities o  Walk the k-mers from one to another, until the full bubble is closed

o  Uses the Minia datastructure

EPGV Seminar - Lusignan - April 2013 17

Kissnp2  Yes,  but…

•  Pros: o  No reference needed o  Don’t construct the full graph o  Works with 1, 2, 3, …, n datasets :

>SNP_higher_path_16|C1_2|C2_6|Q1_70|Q2_67 TAATGTTAAATGACGAGTTAATGGGTGCAGCACATGAACATGGCACATGTA >SNP_lower_path_16|C1_2|C2_5|Q1_70|Q2_68 TAATGTTAAATGACGAGTTAATGGGGGCAGCACATGAACATGGCACATGTA

EPGV Seminar - Lusignan - April 2013 18

Kissnp2  Yes,  but…

•  Cons: 1.  A sequencing error = a SNP 2.  An approximate repeat = a SNP (with no filtration: 9 billion SNPs found in human genome !)

•  Avoid this bias: 1.  Use only k-mer with a minimal support (= coverage) 2.  Avoid approximate repeats:

•  Avoid k-mers with too big support (e.g. 1000) •  Extend found SNPs

>SNP_higher_path_16|C1_2|C2_6|Q1_70|Q2_67 …acgatcgacgacgctaacacatataccTAATGTTAAATGACGAGTTAATGGGTGCAGCACATGAACATGGCACATGTAgagagdacatatgtaacacgcagcat… >SNP_lower_path_16|C1_2|C2_5|Q1_70|Q2_68 …acgatcgacgacgctaacacatataccTAATGTTAAATGACGAGTTAATGGGGGCAGCACATGAACATGGCACATGTAgagagdacatatgtaacacgcagcat…

EPGV Seminar - Lusignan - April 2013 19

Kissnp2  In  a  nutshell

•  Read to SNPs o  Any number of read sets o  For each SNP found:

•  The average coverage per set •  The average quality of the SNP position •  Possibility to extend SNPs (left and right unitigs)

•  First results (publication in progress) o  Duck: 350 million reads : 4GB, 3h30, 527.208 SNPs (on genotoul) o  Duck: 3.500 million reads : 4GB, 5.5 days, 4.994.126 SNPs (on genotoul) o  Tick: 576 SNP found and selected w.r.t. coverage à 575 true positives.

EPGV Seminar - Lusignan - April 2013 20

Use  /  extricate  polymorphism

•  Idea: o  Polymorphism o  Polyploidy o  Repeats o  …

… are the source of both: o  Huge volume of information

AND o  Huge mess in assembly process, and any NGS process.

EPGV Seminar - Lusignan - April 2013 21

Use  /  extricate  polymorphism:  a  trail

•  From a graph of UNITIGS o  Produced by assemblers before contiging

•  Map set of reads…

A

T

C

G

Reads:  set1

Reads:  set2

EPGV Seminar - Lusignan - April 2013 22

Use  /  extricate  polymorphism:  a  trail

•  From a graph of UNITIGS o  Produced by assemblers before contiging

•  Map set of reads…

A

T

C

G

•  Compute o  Coverage per set o  Quality per set

EPGV Seminar - Lusignan - April 2013 23

Use  /  extricate  polymorphism:  a  trail

•  From a graph of UNITIGS o  Produced by assemblers before contiging

•  Map set of reads…

A

T

C

G

•  Compute o  Coverage per set o  Quality per set

•  « Phase » polymorphism o  Better assembly

EPGV Seminar - Lusignan - April 2013 24

Use  /  extricate  polymorphism:  a  trail

•  « Phase » polymorphism o  Better assembly

EPGV Seminar - Lusignan - April 2013 25

Ultimate  gap  filler    

Ref.

Reads

EPGV Seminar - Lusignan - April 2013 26

Ultimate  gap  filler    

Ref.

Reads

1.  Detection  of  gap  in  the  reference:

EPGV Seminar - Lusignan - April 2013 27

Ultimate  gap  filler    

Ref.

Reads

1.  Detection  of  gap  in  the  reference: 2.  Fill  gaps  (using  pairs):

EPGV Seminar - Lusignan - April 2013 28

Take  Home  Message •  Assemble with no memory:

o  Minia

•  Find Polymorphism SNPs with no reference: o  Kissnp2 o  (kissplice)

•  Use and extricate polymorphism •  Detect and close gaps: (reads + ref) :

o  Ultimate Gap Filler

•  More infos and links: http://team.inria.fr/genscale

EPGV Seminar - Lusignan - April 2013 29