computational biology, part 6 sequence file formats and sequence assembly robert f. murphy copyright...
TRANSCRIPT
Computational Biology, Part 6Sequence File Formats and
Sequence Assembly
Computational Biology, Part 6Sequence File Formats and
Sequence Assembly
Robert F. MurphyRobert F. Murphy
Copyright Copyright 1996-2001. 1996-2001.
All rights reserved.All rights reserved.
Sequence file formatsSequence file formats Two characteristics of file formatsTwo characteristics of file formats
texttext or or binarybinary minimalminimal or or annotatedannotated
TextText files use IUB codes and are readable by a files use IUB codes and are readable by a word word processor processor (e.g., (e.g., SimpleTextSimpleText, , Microsoft WordMicrosoft Word) or ) or text editor text editor (e.g., (e.g., emacsemacs))
BinaryBinary files are usually readable only by the files are usually readable only by the program that created them (e.g., program that created them (e.g., MacVectorMacVector))
AnnotatedAnnotated files preserve information known about files preserve information known about the sequence (coding region start/stop, protein the sequence (coding region start/stop, protein features, literature references, etc.)features, literature references, etc.)
Sequence file formatsSequence file formats
ASCII (text) ASCII (text) MinimalMinimal
Line, Plain TextLine, Plain Text StadenStaden FASTAFASTA Bionet (allows comments)Bionet (allows comments)
AnnotatedAnnotated GenBankGenBank GCGGCG
Binary (usually annotated)Binary (usually annotated) MacVectorMacVector
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats Line (Line (MacVectorMacVector), Plain Text (), Plain Text (AssemblyLIGNAssemblyLIGN))CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTCGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats Fasta (Fasta (EntrezEntrez))>gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese.>gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese.CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCCTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACCATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGGACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACAGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCGATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCCTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCCTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTTGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCTGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTCGCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GenBank (GenBank (Entrez, MacVectorEntrez, MacVector))LOCUS RATOBESE 539 bp ss-mRNA ROD 23-SEP-1995LOCUS RATOBESE 539 bp ss-mRNA ROD 23-SEP-1995DEFINITION Rat mRNA for obese.DEFINITION Rat mRNA for obese.ACCESSION D49653ACCESSION D49653KEYWORDS .KEYWORDS .SOURCE Rattus norvegicus (strain OLETF, LETO and Zucker, ) differentiatedSOURCE Rattus norvegicus (strain OLETF, LETO and Zucker, ) differentiated adipose cDNA to mRNA.adipose cDNA to mRNA. ORGANISM Rattus norvegicusORGANISM Rattus norvegicus Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia;Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; Rattus.Sciurognathi; Myomorpha; Muridae; Murinae; Rattus.REFERENCE 1 (bases 1 to 539)REFERENCE 1 (bases 1 to 539) AUTHORS Murakami,T. and Shima,K.AUTHORS Murakami,T. and Shima,K. TITLE Cloning of rat obese cDNA and its expression in obese ratsTITLE Cloning of rat obese cDNA and its expression in obese rats JOURNAL Biochem. Biophys. Res. Commun. 209, 944-952 (1995)JOURNAL Biochem. Biophys. Res. Commun. 209, 944-952 (1995) STANDARD full automaticSTANDARD full automaticCOMMENT Submitted (10-Mar-1995) to DDBJ by:COMMENT Submitted (10-Mar-1995) to DDBJ by: Takashi MurakamiTakashi Murakami Department of Laboratory MedicineDepartment of Laboratory Medicine School of MedicineSchool of Medicine University of TokushimaUniversity of Tokushima Kuramotocho 3-chomeKuramotocho 3-chome Tokushima 770Tokushima 770 JapanJapan Phone: +81-886-33-7184Phone: +81-886-33-7184 Fax: +81-886-31-9495.Fax: +81-886-31-9495.
[continued][continued]
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GenBank [continued]GenBank [continued]NCBI gi: 995614NCBI gi: 995614FEATURES Location/QualifiersFEATURES Location/Qualifiers source 1..539source 1..539 /organism="Rattus norvegicus"/organism="Rattus norvegicus" /strain="OLETF, LETO and Zucker"/strain="OLETF, LETO and Zucker" /dev_stage="differentiated"/dev_stage="differentiated" /sequenced_mol="cDNA to mRNA"/sequenced_mol="cDNA to mRNA" /tissue_type="adipose"/tissue_type="adipose" CDS 30..533CDS 30..533 /partial/partial /note="NCBI gi: 995615"/note="NCBI gi: 995615" /codon_start=1/codon_start=1 /product="obese"/product="obese" /translation="MCWRPLCRFLWLWSYLSYVQAVPIHKVQDDTKTLIKTIVTRIND/translation="MCWRPLCRFLWLWSYLSYVQAVPIHKVQDDTKTLIKTIVTRIND ISHTQSVSARQRVTGLDFIPGLHPILSLSKMDQTLAVYQQILTSLPSQNVLQIAHDLEISHTQSVSARQRVTGLDFIPGLHPILSLSKMDQTLAVYQQILTSLPSQNVLQIAHDLE NLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQNLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQ LDLSPEC"LDLSPEC"BASE COUNT 121 a 167 c 133 g 118 tBASE COUNT 121 a 167 c 133 g 118 tORIGINORIGIN 1 ccaagaagaa gaagacccca gcgaggaaaa tgtgctggag acccctgtgc cggttcctgt1 ccaagaagaa gaagacccca gcgaggaaaa tgtgctggag acccctgtgc cggttcctgt 61 ggctttggtc ctatctgtcc tatgttcaag ctgtgcctat ccacaaagtc caggatgaca61 ggctttggtc ctatctgtcc tatgttcaag ctgtgcctat ccacaaagtc caggatgaca 121 ccaaaaccct catcaagacc attgtcacca ggatcaatga catttcacac acgcagtcgg121 ccaaaaccct catcaagacc attgtcacca ggatcaatga catttcacac acgcagtcgg 181 tatccgccag gcagagggtc accggtttgg acttcattcc cgggcttcac cccattctga181 tatccgccag gcagagggtc accggtttgg acttcattcc cgggcttcac cccattctga 241 gtttgtccaa gatggaccag accctggcag tctatcaaca gatcctcacc agcttgcctt241 gtttgtccaa gatggaccag accctggcag tctatcaaca gatcctcacc agcttgcctt 301 cccaaaacgt gctgcagata gctcatgacc tggagaacct gcgagacctc ctccatctgc301 cccaaaacgt gctgcagata gctcatgacc tggagaacct gcgagacctc ctccatctgc 361 tggccttctc caagagctgc tccctgccgc agacccgtgg cctgcagaag ccagagagcc361 tggccttctc caagagctgc tccctgccgc agacccgtgg cctgcagaag ccagagagcc 421 tggatggcgt cctggaagcc tcgctctact ccacagaggt ggtggctctg agcaggctgc421 tggatggcgt cctggaagcc tcgctctact ccacagaggt ggtggctctg agcaggctgc 481 agggctctct gcaggacatt cttcaacagt tggaccttag ccctgaatgc tgaggtttc481 agggctctct gcaggacatt cttcaacagt tggaccttag ccctgaatgc tgaggtttc////
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GCG (GCG (MacVector, GCGMacVector, GCG))LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95DEFINITION Rat mRNA for obese.DEFINITION Rat mRNA for obese.ACCESSION -ACCESSION -KEYWORDS -KEYWORDS -SOURCE Rattus norvegicus; Norway ratSOURCE Rattus norvegicus; Norway rat ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata;ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi;Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; RattusMyomorpha; Muridae; Murinae; RattusREFERENCE [1]REFERENCE [1] AUTHORS Murakami, T. & Shima, K.AUTHORS Murakami, T. & Shima, K. TITLE Cloning of rat obese cDNA and its expression in obese rats.TITLE Cloning of rat obese cDNA and its expression in obese rats. JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995)JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995)COMMENT Database Reference:COMMENT Database Reference: DDBJ RATOBESEDDBJ RATOBESE Accession: D49653Accession: D49653 ------------ ------------ Submitted (10-Mar-1995) to DDBJ by: Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Takashi Murakami Department of Laboratory Medicine Department of Laboratory Medicine School of Medicine School of Medicine University of Tokushima University of Tokushima Kuramotocho 3-chome Kuramotocho 3-chome Tokushima 770 Tokushima 770 Japan Japan Phone: +81-886-33-7184 Phone: +81-886-33-7184 Fax: +81-886-31-9495 Fax: +81-886-31-9495
[continued][continued]
Examples of ASCII sequence file formatsExamples of ASCII sequence file formats GCG [continued]GCG [continued]FEATURES From To/Span DescriptionFEATURES From To/Span Description pept 30 533 obesepept 30 533 obese ???? 1 539 source; /organism=Rattus norvegicus;???? 1 539 source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker;/strain=OLETF, LETO and Zucker; /dev_stage=differentiated; /sequenced_mol=cDNA/dev_stage=differentiated; /sequenced_mol=cDNA to mRNA; /tissue_type=adiposeto mRNA; /tissue_type=adiposeBASE COUNT 121 A 167 C 133 G 118 T 0 OTHERBASE COUNT 121 A 167 C 133 G 118 T 0 OTHERORIGIN ?ORIGIN ? RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 ..RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 .. 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC////
Sequence file format tipsSequence file format tips
When saving a sequence for use in an email When saving a sequence for use in an email message or pasting into a web page, use an message or pasting into a web page, use an unannotated text unannotated text format such as format such as FASTAFASTA
When retrieving from a database or exchanging When retrieving from a database or exchanging between programs, use an between programs, use an annotated text annotated text format format such as such as GCGGCG or or GenbankGenbank
When using sequence again with the same When using sequence again with the same program, use that program’s program, use that program’s annotated binary annotated binary format (or format (or annotated text annotated text if binary not available)if binary not available)
Sequence assemblySequence assembly
Goal: Assemble pieces of sequence into Goal: Assemble pieces of sequence into single, continuous sequencesingle, continuous sequence
Early commercial system to do sequence Early commercial system to do sequence assembly was the GCG assembly was the GCG GelOverlap/GelAssemble suite (VMS,Unix)GelOverlap/GelAssemble suite (VMS,Unix)
We will use AssemblyLIGN (Macintosh), a We will use AssemblyLIGN (Macintosh), a companion to MacVectorcompanion to MacVector
Sequence assembly: TermsSequence assembly: Terms
projectproject collection of collection of fragmentsfragments, , templatestemplates and and contigscontigs
fragmentsfragments pieces of sequence entered by user or read from pieces of sequence entered by user or read from
filesfiles contigscontigs
lists of aligned fragments generated (normally) lists of aligned fragments generated (normally) by programby program
Sequence assembly: TermsSequence assembly: Terms
templatestemplates any sequence to be searched forany sequence to be searched for
can be entered by usercan be entered by user can be read from system filescan be read from system files
most common example is sequence of vector most common example is sequence of vector sequences in templates are NOT included in sequences in templates are NOT included in
assembled sequences unless they are ALSO assembled sequences unless they are ALSO present in a fragment (and have not been present in a fragment (and have not been removed)removed)
Sequence assembly: File organizationSequence assembly: File organization AssemblyLIGN keeps all information AssemblyLIGN keeps all information
(including sequences) in a single project (including sequences) in a single project documentdocument
GCG keeps all information in a directory GCG keeps all information in a directory (and its subdirectories), with each fragment (and its subdirectories), with each fragment in a separate file in a separate file
Sequence assembly: StepsSequence assembly: Steps
Data entry/import (fragments, templates)Data entry/import (fragments, templates) Removal of unwanted sequenceRemoval of unwanted sequence Automated creation of contigsAutomated creation of contigs Manual editing/confirmationManual editing/confirmation ExportExport
Automated creation of contigsAutomated creation of contigs
StepsSteps1. Finding pairwise overlaps1. Finding pairwise overlaps
2. Resolving overlaps2. Resolving overlaps
3. Improving alignment3. Improving alignment
4. Final assembly and consensus generation4. Final assembly and consensus generation
1. Finding pairwise overlaps1. Finding pairwise overlaps
Compare each fragment (and its Compare each fragment (and its complement) with each other fragmentcomplement) with each other fragment
Generate list of regions of similarity that Generate list of regions of similarity that meet criteria belowmeet criteria below ParametersParameters
minimum overlap length (default 20)minimum overlap length (default 20) stringency (% of bases that must match, default 70)stringency (% of bases that must match, default 70) minimum repeat length (default 30)minimum repeat length (default 30)
1. Finding pairwise overlaps1. Finding pairwise overlaps
Each fragment may appear in more than one Each fragment may appear in more than one overlapoverlap
1 83 5
36
8 2
7 9
5 4
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
1 8 3 5
36 8 2
7 95 4
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
1 8 3 5
36 8 2
7 95 4
1 8 2
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
3 5
36
7 95 4
1 8 2
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
3 5
36
7 95 4
1 8 2
536
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
7 95 4
1 8 2
536
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
7 95 4
1 8 2
536
4
2. Resolving overlaps2. Resolving overlaps Build larger pieces by combining overlapsBuild larger pieces by combining overlaps
7 9
1 8 2
536
4
3. Improve alignment3. Improve alignment
Introduce gaps in sequences if they will Introduce gaps in sequences if they will improve overlapsimprove overlaps alignment parametersalignment parameters
gap creation penalty (default 2.0)gap creation penalty (default 2.0) gap extension penalty (default (0.1)gap extension penalty (default (0.1)
4. Final assembly and consensus generation4. Final assembly and consensus generation Mark fragments that are now part of a contig (no Mark fragments that are now part of a contig (no
longer appear by themselves)longer appear by themselves) Form consensus for each contig by “reading” Form consensus for each contig by “reading”
along aligned sequences and converting to IUB along aligned sequences and converting to IUB codes by consensus rulescodes by consensus rules consensus parameterconsensus parameter
base designation threshold (% of all bases at a given base designation threshold (% of all bases at a given position that must be the same for that base to be assigned position that must be the same for that base to be assigned to the consensus; otherwise, less specific IUB code used; to the consensus; otherwise, less specific IUB code used; default 80%)default 80%)
Manual consensus editingManual consensus editing
Crucial to verify alignment and resolve Crucial to verify alignment and resolve ambiguities (e.g., sequencing errors)ambiguities (e.g., sequencing errors)
Program keeps an “edit history” that tracks Program keeps an “edit history” that tracks all changes made to the original sequences: all changes made to the original sequences: essential to be able to “retrace your steps” essential to be able to “retrace your steps” from original sequencing gels (e.g., in case from original sequencing gels (e.g., in case of conflicts with sequences in database)of conflicts with sequences in database)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
OpenOpen “demo π” project “demo π” project
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Goal: Goal: Eliminate Eliminate vector vector sequencesequence
Double-click Double-click vectorvector
Select all Select all fragmentsfragments
Drop on Drop on vectorvector
AssemblyLIGN TutorialAssemblyLIGN Tutorial
““vector Alignments” window shows that vector Alignments” window shows that frag8frag8 contains vector sequence contains vector sequence
Click on ‘shadow’ to edit Click on ‘shadow’ to edit frag8frag8 and display and display highlighted vector sequencehighlighted vector sequence
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Highlighted Highlighted sequence sequence doesn’t doesn’t look like look like sequence in sequence in “vector” “vector” windowwindow
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Click on “vector” windowClick on “vector” window ChooseChoose Select All Select All (Edit (Edit
Menu)Menu) ChooseChoose Reverse & Reverse &
Complement Complement (Edit Menu)(Edit Menu) Now highlighted sequence Now highlighted sequence
in in frag8frag8 matches that in matches that in “vector” window“vector” window
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Click on “frag8” windowClick on “frag8” window Delete highlighted sequenceDelete highlighted sequence Then close “frag8” windowThen close “frag8” window
AssemblyLIGN TutorialAssemblyLIGN Tutorial
ChooseChoose Select All Select All (Edit Menu)(Edit Menu) ChooseChoose AssembleAssemble (Project Menu) (Project Menu)
AssemblyLIGN TutorialAssemblyLIGN Tutorial
All but one All but one fragment fragment ((frag14frag14) ) combined into combined into Untitled Untitled Config 1Config 1
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Goal: Try relaxing Goal: Try relaxing assembly parameters to assembly parameters to merge merge frag14frag14 into the into the contigcontig
Choose Choose Assembly Assembly Options Options (Project Menu)(Project Menu)
Reduce minimum Reduce minimum overlap length to 5overlap length to 5
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Now all Now all fragments are fragments are mergedmerged
Double-click Double-click Untitled Untitled Contig 2 Contig 2 to to see alignment see alignment and consensusand consensus
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Map shows gross alignments of fragmentsMap shows gross alignments of fragments Click on Magnifying glass ‘A’ and select Click on Magnifying glass ‘A’ and select
region of map to viewregion of map to view
AssemblyLIGN TutorialAssemblyLIGN Tutorial
Positions that do not match at/above the Positions that do not match at/above the Base Designation Threshold are highlighted Base Designation Threshold are highlighted in the consensus or the original sequencesin the consensus or the original sequences
Can decrease the Base Designation Threshold Can decrease the Base Designation Threshold to reduce ‘uncalled’ basesto reduce ‘uncalled’ bases
Reading for Next ClassReading for Next Class
Baxellanis & Ouellette, Chapter 7Baxellanis & Ouellette, Chapter 7 Sequence Analysis Primer, pp. 90-124 Sequence Analysis Primer, pp. 90-124
“Similarity versus Homology” and “Dot “Similarity versus Homology” and “Dot Matrix Methods” (on web page)Matrix Methods” (on web page)
(03-510) Durbin et al, pp. 12-17(03-510) Durbin et al, pp. 12-17
Summary, Part 6Summary, Part 6
A variety of sequence file formats are A variety of sequence file formats are currently in use. Files can be either text or currently in use. Files can be either text or binary, and can consist only of sequence or binary, and can consist only of sequence or also include annotation information.also include annotation information.
The choice of file format is dictated by the The choice of file format is dictated by the requirements of the analysis desired and the requirements of the analysis desired and the subset of formats compatible between the subset of formats compatible between the “writing” and “reading” program.“writing” and “reading” program.
Summary, Part 6Summary, Part 6
Sequence assembly requires the ability to Sequence assembly requires the ability to compare sequences to find regions of high compare sequences to find regions of high homology.homology.
Pieces of sequence are assembled by Pieces of sequence are assembled by “connecting” them via regions of overlap.“connecting” them via regions of overlap.
A consensus sequence can be generated A consensus sequence can be generated from the connected pieces (using user-from the connected pieces (using user-specified rules to resolve ambiguity).specified rules to resolve ambiguity).
Sequence comparisons using BLAST server Web PageSequence comparisons using BLAST server Web Page Main BLAST web page URLMain BLAST web page URL
http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/ Links to Links to BasicBasic and and AdvancedAdvanced Search Pages Search Pages Two main BLAST programsTwo main BLAST programs
blastnblastn - compares user nucleotide sequence to - compares user nucleotide sequence to nucleotide sequences in databasenucleotide sequences in database
blastpblastp - compares user peptide sequence to - compares user peptide sequence to peptide sequences in databasepeptide sequences in database