phred/phrap/consed analysis a user’s view arthur gruber international training course on...

31
Phred/Phrap/Consed Phred/Phrap/Consed Analysis Analysis A User’s View A User’s View Arthur Gruber Arthur Gruber International Training Course on International Training Course on Bioinformatics Applied to Genomic Bioinformatics Applied to Genomic Studies Studies Rio de Janeiro 2001 Rio de Janeiro 2001 Faculty of Veterinary Medicine and Faculty of Veterinary Medicine and Zootechny Zootechny University of São Paulo University of São Paulo BRAZIL BRAZIL

Upload: ashlyn-hopkins

Post on 29-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phred/Phrap/Consed AnalysisPhred/Phrap/Consed AnalysisA User’s ViewA User’s View

Arthur GruberArthur Gruber

International Training Course on International Training Course on Bioinformatics Applied to Genomic StudiesBioinformatics Applied to Genomic Studies

Rio de Janeiro 2001Rio de Janeiro 2001

Faculty of Veterinary Medicine and Faculty of Veterinary Medicine and Zootechny Zootechny

University of São PauloUniversity of São PauloBRAZILBRAZIL

Page 2: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

What is Phred/Phrap/Consed ? What is Phred/Phrap/Consed ?

Phred/Phrap/Consed is a worldwide distributed package for:Phred/Phrap/Consed is a worldwide distributed package for:

a. Trace file (chromatograms) reading;a. Trace file (chromatograms) reading;

b. Quality (confidence) assignment to each individual b. Quality (confidence) assignment to each individual base;base;

c. Vector and repeat sequences identification and c. Vector and repeat sequences identification and masking;masking;

d. Sequence assembly;d. Sequence assembly;

e. Assembly visualization and editing;e. Assembly visualization and editing;

f. Automatic finishing. f. Automatic finishing.

Page 3: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Why to assemble? Why to assemble?

W hole genomeBAC/cosm id clone

f in a l con sen sus seq u en ce

Finishingq u a lity

b o th s ta n ds covera geg a p f illing

Partial Assem blyco n tigs

DNA sequencingra n d om clo n es

Clone libraryp U C 18

Sm all fragm ents1 .0 - 2 .0 kb

DNA fragm entationso n ic d is rup tion

n e bu liza tion

W hole genomeBAC/cosm id clone

• Current DNA sequencing methods Current DNA sequencing methods generate reads of 500-700 bp generate reads of 500-700 bp – – resolution limit of electrophoresisresolution limit of electrophoresis

• Whole genomes or large clones need Whole genomes or large clones need to be fragmented to be fragmented - clone library- clone library

• Short fragments are randomly Short fragments are randomly sequenced (shotgun approach)sequenced (shotgun approach) – – reads are assembled to form final reads are assembled to form final consensus sequenceconsensus sequence

Page 4: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

How to deal with the enormous How to deal with the enormous amount of reads generated by the amount of reads generated by the high throughput DNA sequencers?high throughput DNA sequencers?

Sanger Centre

Page 5: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phred Phred Genome ResearchGenome Research 88: 175-185, 1998: 175-185, 1998

Page 6: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phred Phred Genome ResearchGenome Research 88: 186-194, 1998: 186-194, 1998

Page 7: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

PhredPhred

Phred is a program that performs several Phred is a program that performs several tasks:tasks:

a. Reads trace filesa. Reads trace files – compatible with most – compatible with most file formats: SCF (standard chromatogram file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR.(MegaBACE) and LI-COR.

b. Calls basesb. Calls bases – attributes a base for each – attributes a base for each identified peak with a lower error rate than identified peak with a lower error rate than the standard base calling programs.the standard base calling programs.

Page 8: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phred Phred

c. Assigns quality values to the basesc. Assigns quality values to the bases – a – a “Phred value” based on an error rate “Phred value” based on an error rate estimationestimation calculated for each individual calculated for each individual base.base.

d. Creates output filesd. Creates output files – base calls and – base calls and quality values are written to output files.quality values are written to output files.

Page 9: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Trace FileTrace File High quality region – no ambiguities (Ns)High quality region – no ambiguities (Ns)

Page 10: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Trace FileTrace File Medium quality region – some Medium quality region – some

ambiguities (Ns)ambiguities (Ns)

Page 11: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Trace FileTrace File Poor quality region – low confidencePoor quality region – low confidence

Page 12: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phred value formula Phred value formula

qq = - 10 x log = - 10 x log10 10 ((pp))

wherewhereqq - - q quality valueuality valuepp - - estimated probability error for a base call estimated probability error for a base call

Examples:Examples:

qq = 20 means = 20 means pp = 10 = 10-2-2 (1 error in 100 bases) (1 error in 100 bases)qq = 40 means = 40 means pp = 10 = 10-4-4 (1 error in 10,000 (1 error in 10,000

bases)bases)

Page 13: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

The structure of a phd file The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.gBEGIN_SEQUENCE 01EBV10201A02.g

BEGIN_COMMENTBEGIN_COMMENT

CHROMAT_FILE: EBV10201A02.gCHROMAT_FILE: EBV10201A02.gABI_THUMBPRINT: ABI_THUMBPRINT: PHRED_VERSION: 0.990722.gPHRED_VERSION: 0.990722.gCALL_METHOD: phredCALL_METHOD: phredQUALITY_LEVELS:99QUALITY_LEVELS:99TIME: Thu May 24 00:18:58 2001TIME: Thu May 24 00:18:58 2001TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MAX_INDEX: 12153TRACE_ARRAY_MAX_INDEX: 12153TRIM: TRIM: CHEM: termCHEM: termDYE: bigDYE: big

END_COMMENTEND_COMMENT  

BEGIN_DNABEGIN_DNAt 8 5t 8 5c 13 17c 13 17a 19 26a 19 26c 19 32c 19 32

t 6 11908t 6 11908a 6 11921a 6 11921g 6 11927g 6 11927t 6 11947t 6 11947c 6 11953c 6 11953a 6 11964a 6 11964g 6 11981g 6 11981c 4 11994c 4 11994n 4 12015n 4 12015c 4 12037c 4 12037n 4 12044n 4 12044n 4 12058n 4 12058n 4 12071n 4 12071n 4 12085n 4 12085n 4 12098n 4 12098n 4 12111n 4 12111n 4 12124n 4 12124c 4 12144c 4 12144n 4 12151n 4 12151END_DNAEND_DNA  END_SEQUENCEEND_SEQUENCE

t 24 2221t 24 2221a 24 2232a 24 2232a 22 2245a 22 2245a 27 2261a 27 2261g 25 2272g 25 2272c 19 2286c 19 2286c 12 2302c 12 2302t 19 2314t 19 2314g 12 2324g 12 2324g 15 2331g 15 2331g 19 2346g 19 2346g 23 2363g 23 2363t 33 2378t 33 2378g 36 2390g 36 2390c 44 2404c 44 2404c 44 2419c 44 2419t 39 2433t 39 2433a 39 2446a 39 2446a 34 2460a 34 2460t 35 2470t 35 2470g 34 2482g 34 2482

t 16 8191t 16 8191g 19 8200g 19 8200t 13 8211t 13 8211c 13 8229c 13 8229g 4 8241g 4 8241n 4 8253n 4 8253c 4 8263c 4 8263t 10 8276t 10 8276t 9 8286t 9 8286c 12 8301c 12 8301t 16 8313t 16 8313c 12 8329c 12 8329c 12 8336c 12 8336c 15 8343c 15 8343t 19 8356t 19 8356c 9 8371c 9 8371g 13 8386g 13 8386g 14 8397g 14 8397a 7 8417a 7 8417g 9 8427g 9 8427g 4 8445g 4 8445

Page 14: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phrap Phrap Phragment Assembly ProgramPhragment Assembly Programor… Phil’s Revised Assembly or… Phil’s Revised Assembly ProgramProgramPhrap is a program for assembling shotgun Phrap is a program for assembling shotgun

DNA sequence data DNA sequence data

Key Features:Key Features:

a. Uses the entire read content a. Uses the entire read content – no need for trimming.– no need for trimming.

b. User supplied (i.e. Repbase) + internally computed b. User supplied (i.e. Repbase) + internally computed data data – better accuracy of assembly in the presence of – better accuracy of assembly in the presence of repeats.repeats.

c. Contig sequence is constituted by a mosaic of the c. Contig sequence is constituted by a mosaic of the highest quality parts of the readshighest quality parts of the reads – it’s not a consensus! – it’s not a consensus!

Page 15: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phrap Phrap Phragment Assembly ProgramPhragment Assembly Programor… Phil’s Revised Assembly or… Phil’s Revised Assembly ProgramProgramPhrap is a program for assembling shotgun Phrap is a program for assembling shotgun

DNA sequence data DNA sequence data

d. Provides extensive information about assembly d. Provides extensive information about assembly – – contained in phrap.out, *.ace and *.screen.contigs.qual contained in phrap.out, *.ace and *.screen.contigs.qual filesfiles

e. Handles very large datasets e. Handles very large datasets – hundreds of thousands of – hundreds of thousands of reads are easily manipulated.reads are easily manipulated.

f. Generate output files f. Generate output files – contain some important data and – contain some important data and enable visualization by other programsenable visualization by other programs

Page 16: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phrap output filesPhrap output files

• *.contigs *.contigs – fasta file containing the contigs– fasta file containing the contigs- Contigs with more than one readContigs with more than one read

- Singletons (single reads with a match to some other contig but that Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it)couldn’t be merged consistently to it)

• *.singlets *.singlets – fasta file of the singlet reads– fasta file of the singlet reads- Reads with no match to other readReads with no match to other read

• *.ace*.ace – allows for viewing the assembly using – allows for viewing the assembly using ConsedConsed

• *.view*.view – required for viewing the assembly using – required for viewing the assembly using PhrapviewPhrapview

Page 17: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Consed Consed Genome ResearchGenome Research 88: 195-202, 1998: 195-202, 1998

Page 18: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Consed Consed

Consed is a program for viewing and editing Consed is a program for viewing and editing assemblies produced by Phrapassemblies produced by Phrap

Key Features:Key Features:

a. Assembly viewer a. Assembly viewer - allows for visualization of contigs, - allows for visualization of contigs, assembly (aligned reads), quality values of reads and assembly (aligned reads), quality values of reads and final sequence. final sequence.

b. Trace file viewer b. Trace file viewer – single and multiple trace files can be – single and multiple trace files can be visualized allowing for comparison of a given sequence visualized allowing for comparison of a given sequence in several reads.in several reads.

Page 19: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Consed Consed

Consed is a program for viewing and editing Consed is a program for viewing and editing assemblies produced by Phrapassemblies produced by Phrap

Key Features:Key Features:

c. Navigation c. Navigation – identify and list regions which are below a – identify and list regions which are below a given quality threshold, contain high quality given quality threshold, contain high quality discrepancies, single-strand coverage, etc.discrepancies, single-strand coverage, etc.

d. Autofinish d. Autofinish – automatic set of functions for: gap closure, – automatic set of functions for: gap closure, improvement of sequence quality, determination of improvement of sequence quality, determination of relative orientation of contigs, identification of regions relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. covered by a single read or by reads of a single strand. The program automatically performs primer picking and The program automatically performs primer picking and chooses the templates.chooses the templates.

Page 20: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 21: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 22: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 23: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 24: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 25: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 26: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 27: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 28: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001
Page 29: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Phred/Phrap/Consed PipelinePhred/Phrap/Consed Pipeline

Chromat_dirChromat_dir

Phd_dirPhd_dir

Edit_dirEdit_dir

Directories:Directories:

Assembly view ing/editingConsed

Assem blyPhrapassem bled contigs - se qs_ fas ta .sc re en .con tigsassem bly file - seq s_ fa s ta .sc re e n .a ce#

Vector screening and m askingCross_M atch (local a lignment program) x vec to r.seqscreened/masked file - seq s_ fa s ta .scre en

Conversion - phd to fastaphd2fasta.plnucleotide sequences - seq s_ fa s taquality values - seq s_ fa s ta .sc re e n .q u a l

Q uality (confidence) values assignm entPhredphd files - * .p hd

Inputchromatogram files

Page 30: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Finishing ProblemsFinishing Problems

Finishing can be a boring and difficult task due:Finishing can be a boring and difficult task due:

DNA sequencing problemsDNA sequencing problems

a. High GC content a. High GC content – genomes presenting a high GC – genomes presenting a high GC content are more prone to generate artifacts as content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. compressions, sudden drops, bad quality regions. Try to use Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc.annealing temperature, use deaza-dGTP instead of dGTP, etc.

b. Palindromic regionsb. Palindromic regions – lead to strong secondary – lead to strong secondary structures causing sudden drops. structures causing sudden drops. Try to use deaza-dGTP instead of Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product.dGTP, amplify the problematic region by PCR and sequence the product.

c. Homopolymeric regionsc. Homopolymeric regions – can reduce DNA synthesis – can reduce DNA synthesis efficiency for some chemistries. efficiency for some chemistries. Try to use Dye Primer instead of Dye Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye).Terminator, change chemistry (dRhodamine instead of BigDye).

Page 31: Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001

Finishing ProblemsFinishing Problems

Finishing can be a boring and difficult task due:Finishing can be a boring and difficult task due:

DNA assembly problemsDNA assembly problems

a. High content of repeats a. High content of repeats – highly repeated elements – highly repeated elements reduce accuracy of DNA assembly. reduce accuracy of DNA assembly. Identify the repeat unit, screen Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units.enzymes to estimate its size and number of repeat units.

b. High AT contentb. High AT content – some highly biased genomes (i.e. – some highly biased genomes (i.e. Plasmodium falciparum; Plasmodium falciparum; plastid genomes) can pose a plastid genomes) can pose a problem for assembly programs. problem for assembly programs. Very difficult to solve. Try to Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data. determine a restriction map and associate mapping with DNA sequencing data.