computational biology, part 6 sequence file formats and sequence assembly robert f. murphy copyright...

45
Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Robert F. Murphy Copyright Copyright 1996-2001. 1996-2001. All rights reserved. All rights reserved.

Upload: corey-booker

Post on 30-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Computational Biology, Part 6Sequence File Formats and

Sequence Assembly

Computational Biology, Part 6Sequence File Formats and

Sequence Assembly

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1996-2001. 1996-2001.

All rights reserved.All rights reserved.

Page 2: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence file formatsSequence file formats Two characteristics of file formatsTwo characteristics of file formats

texttext or or binarybinary minimalminimal or or annotatedannotated

TextText files use IUB codes and are readable by a files use IUB codes and are readable by a word word processor processor (e.g., (e.g., SimpleTextSimpleText, , Microsoft WordMicrosoft Word) or ) or text editor text editor (e.g., (e.g., emacsemacs))

BinaryBinary files are usually readable only by the files are usually readable only by the program that created them (e.g., program that created them (e.g., MacVectorMacVector))

AnnotatedAnnotated files preserve information known about files preserve information known about the sequence (coding region start/stop, protein the sequence (coding region start/stop, protein features, literature references, etc.)features, literature references, etc.)

Page 3: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence file formatsSequence file formats

ASCII (text) ASCII (text) MinimalMinimal

Line, Plain TextLine, Plain Text StadenStaden FASTAFASTA Bionet (allows comments)Bionet (allows comments)

AnnotatedAnnotated GenBankGenBank GCGGCG

Binary (usually annotated)Binary (usually annotated) MacVectorMacVector

Page 4: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Examples of ASCII sequence file formatsExamples of ASCII sequence file formats Line (Line (MacVectorMacVector), Plain Text (), Plain Text (AssemblyLIGNAssemblyLIGN))CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTCGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC

Page 5: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Examples of ASCII sequence file formatsExamples of ASCII sequence file formats Fasta (Fasta (EntrezEntrez))>gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese.>gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese.CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTCGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC

Page 6: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GenBank (GenBank (Entrez, MacVectorEntrez, MacVector))LOCUS RATOBESE 539 bp ss-mRNA ROD 23-SEP-1995LOCUS RATOBESE 539 bp ss-mRNA ROD 23-SEP-1995DEFINITION Rat mRNA for obese.DEFINITION Rat mRNA for obese.ACCESSION D49653ACCESSION D49653KEYWORDS .KEYWORDS .SOURCE Rattus norvegicus (strain OLETF, LETO and Zucker, ) differentiatedSOURCE Rattus norvegicus (strain OLETF, LETO and Zucker, ) differentiated adipose cDNA to mRNA.adipose cDNA to mRNA. ORGANISM Rattus norvegicusORGANISM Rattus norvegicus Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia;Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; Rattus.Sciurognathi; Myomorpha; Muridae; Murinae; Rattus.REFERENCE 1 (bases 1 to 539)REFERENCE 1 (bases 1 to 539) AUTHORS Murakami,T. and Shima,K.AUTHORS Murakami,T. and Shima,K. TITLE Cloning of rat obese cDNA and its expression in obese ratsTITLE Cloning of rat obese cDNA and its expression in obese rats JOURNAL Biochem. Biophys. Res. Commun. 209, 944-952 (1995)JOURNAL Biochem. Biophys. Res. Commun. 209, 944-952 (1995) STANDARD full automaticSTANDARD full automaticCOMMENT Submitted (10-Mar-1995) to DDBJ by:COMMENT Submitted (10-Mar-1995) to DDBJ by: Takashi MurakamiTakashi Murakami Department of Laboratory MedicineDepartment of Laboratory Medicine School of MedicineSchool of Medicine University of TokushimaUniversity of Tokushima Kuramotocho 3-chomeKuramotocho 3-chome Tokushima 770Tokushima 770 JapanJapan Phone: +81-886-33-7184Phone: +81-886-33-7184 Fax: +81-886-31-9495.Fax: +81-886-31-9495.

[continued][continued]

Page 7: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GenBank [continued]GenBank [continued]NCBI gi: 995614NCBI gi: 995614FEATURES Location/QualifiersFEATURES Location/Qualifiers source 1..539source 1..539 /organism="Rattus norvegicus"/organism="Rattus norvegicus" /strain="OLETF, LETO and Zucker"/strain="OLETF, LETO and Zucker" /dev_stage="differentiated"/dev_stage="differentiated" /sequenced_mol="cDNA to mRNA"/sequenced_mol="cDNA to mRNA" /tissue_type="adipose"/tissue_type="adipose" CDS 30..533CDS 30..533 /partial/partial /note="NCBI gi: 995615"/note="NCBI gi: 995615" /codon_start=1/codon_start=1 /product="obese"/product="obese" /translation="MCWRPLCRFLWLWSYLSYVQAVPIHKVQDDTKTLIKTIVTRIND/translation="MCWRPLCRFLWLWSYLSYVQAVPIHKVQDDTKTLIKTIVTRIND ISHTQSVSARQRVTGLDFIPGLHPILSLSKMDQTLAVYQQILTSLPSQNVLQIAHDLEISHTQSVSARQRVTGLDFIPGLHPILSLSKMDQTLAVYQQILTSLPSQNVLQIAHDLE NLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQNLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQ LDLSPEC"LDLSPEC"BASE COUNT 121 a 167 c 133 g 118 tBASE COUNT 121 a 167 c 133 g 118 tORIGINORIGIN 1 ccaagaagaa gaagacccca gcgaggaaaa tgtgctggag acccctgtgc cggttcctgt1 ccaagaagaa gaagacccca gcgaggaaaa tgtgctggag acccctgtgc cggttcctgt 61 ggctttggtc ctatctgtcc tatgttcaag ctgtgcctat ccacaaagtc caggatgaca61 ggctttggtc ctatctgtcc tatgttcaag ctgtgcctat ccacaaagtc caggatgaca 121 ccaaaaccct catcaagacc attgtcacca ggatcaatga catttcacac acgcagtcgg121 ccaaaaccct catcaagacc attgtcacca ggatcaatga catttcacac acgcagtcgg 181 tatccgccag gcagagggtc accggtttgg acttcattcc cgggcttcac cccattctga181 tatccgccag gcagagggtc accggtttgg acttcattcc cgggcttcac cccattctga 241 gtttgtccaa gatggaccag accctggcag tctatcaaca gatcctcacc agcttgcctt241 gtttgtccaa gatggaccag accctggcag tctatcaaca gatcctcacc agcttgcctt 301 cccaaaacgt gctgcagata gctcatgacc tggagaacct gcgagacctc ctccatctgc301 cccaaaacgt gctgcagata gctcatgacc tggagaacct gcgagacctc ctccatctgc 361 tggccttctc caagagctgc tccctgccgc agacccgtgg cctgcagaag ccagagagcc361 tggccttctc caagagctgc tccctgccgc agacccgtgg cctgcagaag ccagagagcc 421 tggatggcgt cctggaagcc tcgctctact ccacagaggt ggtggctctg agcaggctgc421 tggatggcgt cctggaagcc tcgctctact ccacagaggt ggtggctctg agcaggctgc 481 agggctctct gcaggacatt cttcaacagt tggaccttag ccctgaatgc tgaggtttc481 agggctctct gcaggacatt cttcaacagt tggaccttag ccctgaatgc tgaggtttc////

Page 8: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GCG (GCG (MacVector, GCGMacVector, GCG))LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95DEFINITION Rat mRNA for obese.DEFINITION Rat mRNA for obese.ACCESSION -ACCESSION -KEYWORDS -KEYWORDS -SOURCE Rattus norvegicus; Norway ratSOURCE Rattus norvegicus; Norway rat ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata;ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi;Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; RattusMyomorpha; Muridae; Murinae; RattusREFERENCE [1]REFERENCE [1] AUTHORS Murakami, T. & Shima, K.AUTHORS Murakami, T. & Shima, K. TITLE Cloning of rat obese cDNA and its expression in obese rats.TITLE Cloning of rat obese cDNA and its expression in obese rats. JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995)JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995)COMMENT Database Reference:COMMENT Database Reference: DDBJ RATOBESEDDBJ RATOBESE Accession: D49653Accession: D49653 ------------ ------------ Submitted (10-Mar-1995) to DDBJ by: Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Takashi Murakami Department of Laboratory Medicine Department of Laboratory Medicine School of Medicine School of Medicine University of Tokushima University of Tokushima Kuramotocho 3-chome Kuramotocho 3-chome Tokushima 770 Tokushima 770 Japan Japan Phone: +81-886-33-7184 Phone: +81-886-33-7184 Fax: +81-886-31-9495 Fax: +81-886-31-9495

[continued][continued]

Page 9: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GCG [continued]GCG [continued]FEATURES From To/Span DescriptionFEATURES From To/Span Description pept 30 533 obesepept 30 533 obese ???? 1 539 source; /organism=Rattus norvegicus;???? 1 539 source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker;/strain=OLETF, LETO and Zucker; /dev_stage=differentiated; /sequenced_mol=cDNA/dev_stage=differentiated; /sequenced_mol=cDNA to mRNA; /tissue_type=adiposeto mRNA; /tissue_type=adiposeBASE COUNT 121 A 167 C 133 G 118 T 0 OTHERBASE COUNT 121 A 167 C 133 G 118 T 0 OTHERORIGIN ?ORIGIN ? RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 ..RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 .. 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC////

Page 10: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence file format tipsSequence file format tips

When saving a sequence for use in an email When saving a sequence for use in an email message or pasting into a web page, use an message or pasting into a web page, use an unannotated text unannotated text format such as format such as FASTAFASTA

When retrieving from a database or exchanging When retrieving from a database or exchanging between programs, use an between programs, use an annotated text annotated text format format such as such as GCGGCG or or GenbankGenbank

When using sequence again with the same When using sequence again with the same program, use that program’s program, use that program’s annotated binary annotated binary format (or format (or annotated text annotated text if binary not available)if binary not available)

Page 11: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence assemblySequence assembly

Goal: Assemble pieces of sequence into Goal: Assemble pieces of sequence into single, continuous sequencesingle, continuous sequence

Early commercial system to do sequence Early commercial system to do sequence assembly was the GCG assembly was the GCG GelOverlap/GelAssemble suite (VMS,Unix)GelOverlap/GelAssemble suite (VMS,Unix)

We will use AssemblyLIGN (Macintosh), a We will use AssemblyLIGN (Macintosh), a companion to MacVectorcompanion to MacVector

Page 12: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence assembly: TermsSequence assembly: Terms

projectproject collection of collection of fragmentsfragments, , templatestemplates and and contigscontigs

fragmentsfragments pieces of sequence entered by user or read from pieces of sequence entered by user or read from

filesfiles contigscontigs

lists of aligned fragments generated (normally) lists of aligned fragments generated (normally) by programby program

Page 13: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence assembly: TermsSequence assembly: Terms

templatestemplates any sequence to be searched forany sequence to be searched for

can be entered by usercan be entered by user can be read from system filescan be read from system files

most common example is sequence of vector most common example is sequence of vector sequences in templates are NOT included in sequences in templates are NOT included in

assembled sequences unless they are ALSO assembled sequences unless they are ALSO present in a fragment (and have not been present in a fragment (and have not been removed)removed)

Page 14: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence assembly: File organizationSequence assembly: File organization AssemblyLIGN keeps all information AssemblyLIGN keeps all information

(including sequences) in a single project (including sequences) in a single project documentdocument

GCG keeps all information in a directory GCG keeps all information in a directory (and its subdirectories), with each fragment (and its subdirectories), with each fragment in a separate file in a separate file

Page 15: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence assembly: StepsSequence assembly: Steps

Data entry/import (fragments, templates)Data entry/import (fragments, templates) Removal of unwanted sequenceRemoval of unwanted sequence Automated creation of contigsAutomated creation of contigs Manual editing/confirmationManual editing/confirmation ExportExport

Page 16: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Automated creation of contigsAutomated creation of contigs

StepsSteps1. Finding pairwise overlaps1. Finding pairwise overlaps

2. Resolving overlaps2. Resolving overlaps

3. Improving alignment3. Improving alignment

4. Final assembly and consensus generation4. Final assembly and consensus generation

Page 17: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

1. Finding pairwise overlaps1. Finding pairwise overlaps

Compare each fragment (and its Compare each fragment (and its complement) with each other fragmentcomplement) with each other fragment

Generate list of regions of similarity that Generate list of regions of similarity that meet criteria belowmeet criteria below ParametersParameters

minimum overlap length (default 20)minimum overlap length (default 20) stringency (% of bases that must match, default 70)stringency (% of bases that must match, default 70) minimum repeat length (default 30)minimum repeat length (default 30)

Page 18: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

1. Finding pairwise overlaps1. Finding pairwise overlaps

Each fragment may appear in more than one Each fragment may appear in more than one overlapoverlap

1 83 5

36

8 2

7 9

5 4

Page 19: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps

1 8 3 5

36 8 2

7 95 4

Page 20: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps

1 8 3 5

36 8 2

7 95 4

1 8 2

Page 21: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps

3 5

36

7 95 4

1 8 2

Page 22: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps

3 5

36

7 95 4

1 8 2

536

Page 23: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps

7 95 4

1 8 2

536

Page 24: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps

7 95 4

1 8 2

536

4

Page 25: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps

7 9

1 8 2

536

4

Page 26: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

3. Improve alignment3. Improve alignment

Introduce gaps in sequences if they will Introduce gaps in sequences if they will improve overlapsimprove overlaps alignment parametersalignment parameters

gap creation penalty (default 2.0)gap creation penalty (default 2.0) gap extension penalty (default (0.1)gap extension penalty (default (0.1)

Page 27: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

4. Final assembly and consensus generation4. Final assembly and consensus generation Mark fragments that are now part of a contig (no Mark fragments that are now part of a contig (no

longer appear by themselves)longer appear by themselves) Form consensus for each contig by “reading” Form consensus for each contig by “reading”

along aligned sequences and converting to IUB along aligned sequences and converting to IUB codes by consensus rulescodes by consensus rules consensus parameterconsensus parameter

base designation threshold (% of all bases at a given base designation threshold (% of all bases at a given position that must be the same for that base to be assigned position that must be the same for that base to be assigned to the consensus; otherwise, less specific IUB code used; to the consensus; otherwise, less specific IUB code used; default 80%)default 80%)

Page 28: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Manual consensus editingManual consensus editing

Crucial to verify alignment and resolve Crucial to verify alignment and resolve ambiguities (e.g., sequencing errors)ambiguities (e.g., sequencing errors)

Program keeps an “edit history” that tracks Program keeps an “edit history” that tracks all changes made to the original sequences: all changes made to the original sequences: essential to be able to “retrace your steps” essential to be able to “retrace your steps” from original sequencing gels (e.g., in case from original sequencing gels (e.g., in case of conflicts with sequences in database)of conflicts with sequences in database)

Page 29: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

OpenOpen “demo π” project “demo π” project

Page 30: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

Goal: Goal: Eliminate Eliminate vector vector sequencesequence

Double-click Double-click vectorvector

Select all Select all fragmentsfragments

Drop on Drop on vectorvector

Page 31: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

““vector Alignments” window shows that vector Alignments” window shows that frag8frag8 contains vector sequence contains vector sequence

Click on ‘shadow’ to edit Click on ‘shadow’ to edit frag8frag8 and display and display highlighted vector sequencehighlighted vector sequence

Page 32: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

Highlighted Highlighted sequence sequence doesn’t doesn’t look like look like sequence in sequence in “vector” “vector” windowwindow

Page 33: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

Click on “vector” windowClick on “vector” window ChooseChoose Select All Select All (Edit (Edit

Menu)Menu) ChooseChoose Reverse & Reverse &

Complement Complement (Edit Menu)(Edit Menu) Now highlighted sequence Now highlighted sequence

in in frag8frag8 matches that in matches that in “vector” window“vector” window

Page 34: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

Click on “frag8” windowClick on “frag8” window Delete highlighted sequenceDelete highlighted sequence Then close “frag8” windowThen close “frag8” window

Page 35: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

ChooseChoose Select All Select All (Edit Menu)(Edit Menu) ChooseChoose AssembleAssemble (Project Menu) (Project Menu)

Page 36: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

All but one All but one fragment fragment ((frag14frag14) ) combined into combined into Untitled Untitled Config 1Config 1

Page 37: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

Goal: Try relaxing Goal: Try relaxing assembly parameters to assembly parameters to merge merge frag14frag14 into the into the contigcontig

Choose Choose Assembly Assembly Options Options (Project Menu)(Project Menu)

Reduce minimum Reduce minimum overlap length to 5overlap length to 5

Page 38: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

Now all Now all fragments are fragments are mergedmerged

Double-click Double-click Untitled Untitled Contig 2 Contig 2 to to see alignment see alignment and consensusand consensus

Page 39: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

Map shows gross alignments of fragmentsMap shows gross alignments of fragments Click on Magnifying glass ‘A’ and select Click on Magnifying glass ‘A’ and select

region of map to viewregion of map to view

Page 40: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

AssemblyLIGN TutorialAssemblyLIGN Tutorial

Positions that do not match at/above the Positions that do not match at/above the Base Designation Threshold are highlighted Base Designation Threshold are highlighted in the consensus or the original sequencesin the consensus or the original sequences

Page 41: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Can decrease the Base Designation Threshold Can decrease the Base Designation Threshold to reduce ‘uncalled’ basesto reduce ‘uncalled’ bases

Page 42: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Reading for Next ClassReading for Next Class

Baxellanis & Ouellette, Chapter 7Baxellanis & Ouellette, Chapter 7 Sequence Analysis Primer, pp. 90-124 Sequence Analysis Primer, pp. 90-124

“Similarity versus Homology” and “Dot “Similarity versus Homology” and “Dot Matrix Methods” (on web page)Matrix Methods” (on web page)

(03-510) Durbin et al, pp. 12-17(03-510) Durbin et al, pp. 12-17

Page 43: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Summary, Part 6Summary, Part 6

A variety of sequence file formats are A variety of sequence file formats are currently in use. Files can be either text or currently in use. Files can be either text or binary, and can consist only of sequence or binary, and can consist only of sequence or also include annotation information.also include annotation information.

The choice of file format is dictated by the The choice of file format is dictated by the requirements of the analysis desired and the requirements of the analysis desired and the subset of formats compatible between the subset of formats compatible between the “writing” and “reading” program.“writing” and “reading” program.

Page 44: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Summary, Part 6Summary, Part 6

Sequence assembly requires the ability to Sequence assembly requires the ability to compare sequences to find regions of high compare sequences to find regions of high homology.homology.

Pieces of sequence are assembled by Pieces of sequence are assembled by “connecting” them via regions of overlap.“connecting” them via regions of overlap.

A consensus sequence can be generated A consensus sequence can be generated from the connected pieces (using user-from the connected pieces (using user-specified rules to resolve ambiguity).specified rules to resolve ambiguity).

Page 45: Computational Biology, Part 6 Sequence File Formats and Sequence Assembly Robert F. Murphy Copyright  1996-2001. All rights reserved

Sequence comparisons using BLAST server Web PageSequence comparisons using BLAST server Web Page Main BLAST web page URLMain BLAST web page URL

http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/ Links to Links to BasicBasic and and AdvancedAdvanced Search Pages Search Pages Two main BLAST programsTwo main BLAST programs

blastnblastn - compares user nucleotide sequence to - compares user nucleotide sequence to nucleotide sequences in databasenucleotide sequences in database

blastpblastp - compares user peptide sequence to - compares user peptide sequence to peptide sequences in databasepeptide sequences in database