bioinformatics abe 2007 kent koster group 3. why bioinformatics? “other techniques raise more...
Post on 19-Dec-2015
217 views
TRANSCRIPT
BioinformaticsBioinformatics
ABE 2007ABE 2007
Kent KosterKent Koster
Group 3Group 3
Why bioinformatics?Why bioinformatics?
““Other techniques raise more questions Other techniques raise more questions than they answer. Bioinformatics is what than they answer. Bioinformatics is what answers the questions those techniques answers the questions those techniques generate.”generate.”
OutlineOutline
Bioinformatics DefinedBioinformatics DefinedEvolution of BioinformaticsEvolution of BioinformaticsBioinformatics HistoryBioinformatics HistoryCommon Uses of BioinformaticsCommon Uses of BioinformaticsProcedures and Tools of BioinformaticsProcedures and Tools of BioinformaticsOur ProcedureOur ProcedureOur ResultsOur ResultsResourcesResources
Bioinformatics DefinedBioinformatics Defined
Bioinformatics is broad term covering the use of Bioinformatics is broad term covering the use of computer algorithms to analyze biological data. computer algorithms to analyze biological data.
Differs from “computational biology” in that while Differs from “computational biology” in that while computational biology is the use of computer computational biology is the use of computer technology to solve a single, hypothesis-based technology to solve a single, hypothesis-based question, bioinformatics is the omnibus use of question, bioinformatics is the omnibus use of computerized statistical analysis to make computerized statistical analysis to make statistical or comparative inferences.statistical or comparative inferences.
i.e. converting “data” to “information.” i.e. converting “data” to “information.”
The nebulous genesis of The nebulous genesis of bioinformatics bioinformatics
1977 – Φ-X174 Phage Genome sequenced 1977 – Φ-X174 Phage Genome sequenced 1990 – Paper published in the 1990 – Paper published in the Journal of Journal of
Molecular BiologyMolecular Biology describes sequence describes sequence alignment search algorithm alignment search algorithm
1990s – Software used to find fragment overlap 1990s – Software used to find fragment overlap for the Human Genome Projectfor the Human Genome Project
1992 – NCBI takes over GenBank DNA 1992 – NCBI takes over GenBank DNA sequence database in response to the growing sequence database in response to the growing number of gene patents number of gene patents
The nebulous genesis of The nebulous genesis of bioinformaticsbioinformatics
1994 – “Entrez” Global Query Cross-Database 1994 – “Entrez” Global Query Cross-Database Search System allows users to search GenBank Search System allows users to search GenBank database database
1995 – Dr. Owen White writes software to help 1995 – Dr. Owen White writes software to help find gene elements (promoters, start and stop find gene elements (promoters, start and stop codons, etc.) in the sequenced Haemophilus codons, etc.) in the sequenced Haemophilus influenzae genomeinfluenzae genome
1996 – NCBI-BLAST created to provide powerful 1996 – NCBI-BLAST created to provide powerful heuristic searches against the GenBank heuristic searches against the GenBank database database
Genomics to Proteomics through Genomics to Proteomics through BioinformaticsBioinformatics
Because proteins are ultimately the tool of all* gene Because proteins are ultimately the tool of all* gene expression, proteomics is, in effect, the “product” science expression, proteomics is, in effect, the “product” science made possible by bioinformaticsmade possible by bioinformatics
A proteome is the collection of all proteins expressed in A proteome is the collection of all proteins expressed in a cell at a given timea cell at a given time
Every organism has 1 genome, but many proteomesEvery organism has 1 genome, but many proteomes In addition to “high throughput” protein analysis, In addition to “high throughput” protein analysis,
proteomics is researched through cDNA analysis (RT-proteomics is researched through cDNA analysis (RT-PCR)PCR)
Proteomics represents a methodical addition of “large Proteomics represents a methodical addition of “large scale biology” to traditional molecular biology, made scale biology” to traditional molecular biology, made possible by bioinformaticspossible by bioinformatics
Common Uses of BioinformaticsCommon Uses of Bioinformatics
Homology and Comparative ModelingHomology and Comparative ModelingProtein or gene homology is shared Protein or gene homology is shared
nucleotide or amino acid sequences or nucleotide or amino acid sequences or domains shared between different proteins domains shared between different proteins regardless of whether from same or different regardless of whether from same or different organism organism
Gene or Protein IdentificationGene or Protein IdentificationSearching databases for nucleotide or amino Searching databases for nucleotide or amino
acid sequences that match sequences in acid sequences that match sequences in unknown samplesunknown samples
So, how do ya do it?So, how do ya do it?
DNA SequencingDNA SequencingSequence FormatsSequence FormatsSequence Homology Software ToolsSequence Homology Software ToolsAligning ToolsAligning ToolsAnnotated InformationAnnotated InformationProtein FoldingProtein Folding
DNA SequencingDNA Sequencing
Sanger MethodSanger MethodNew nucleotide chains of DNA being New nucleotide chains of DNA being
replicated by DNA Polymerase are stopped replicated by DNA Polymerase are stopped when di-deoxy nucleotides (added in the when di-deoxy nucleotides (added in the reaction mixture in ~1/100 ratio) are reaction mixture in ~1/100 ratio) are incorperated into the chainincorperated into the chain
DNA SequencingDNA Sequencing
Fluorescent dyes are bound to the Fluorescent dyes are bound to the ddNTPs, allowing the molecule to detected ddNTPs, allowing the molecule to detected when it is excited by a laserwhen it is excited by a laser
Terminated DNA chains are run on a gel, Terminated DNA chains are run on a gel, and fragments are resolved by size and fragments are resolved by size
By combining the fluorescence readings By combining the fluorescence readings from each size nucleotide chain, the DNA from each size nucleotide chain, the DNA sequence is computedsequence is computed
Example Sequence Example Sequence ChromatographChromatograph
Sequence AnalysisSequence Analysis
First Things First – Sequence File Formats:First Things First – Sequence File Formats: Most common for nucleotides: FASTA / Multi-FASTAMost common for nucleotides: FASTA / Multi-FASTA ““>” followed by any unicode text, entire line read as sequence title>” followed by any unicode text, entire line read as sequence title Carriage return followed by continuous 5’- 3’ nucleotide sequence or Carriage return followed by continuous 5’- 3’ nucleotide sequence or
protein sequence using 1-letter codesprotein sequence using 1-letter codes Example:Example: >E. coli Globin-coupled chemotaxis sensory transducer (TM >E. coli Globin-coupled chemotaxis sensory transducer (TM
domain) domain) ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATGATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATGCGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGACGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGATGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATCTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTCTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTAAAA
Sequence Homology SoftwareSequence Homology Software NCBI-BLASTNCBI-BLAST
Run by the National Center for Biotechnology Run by the National Center for Biotechnology InformationInformation
BLAST uses a heuristic algorithm based on the BLAST uses a heuristic algorithm based on the Smith-Waterman algorithmSmith-Waterman algorithm
Algorithm searches database for a small string within Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then the query (default 11 for nucleotide searches), then when it detects a match, searches for shared when it detects a match, searches for shared nucleotides at each end of the seed to extend the nucleotides at each end of the seed to extend the matchmatch
Gaps are taken into account, then the matches are Gaps are taken into account, then the matches are presented in order of statistical significancepresented in order of statistical significance
http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/
Different Types of BLASTDifferent Types of BLAST
Nucleotide-nucleotide BLAST (BLASTN): Nucleotide-nucleotide BLAST (BLASTN): Basic nucleutide sequence searchesBasic nucleutide sequence searches The BLAST that you used for your sequencesThe BLAST that you used for your sequences
Protein-protein BLAST (BLASTP): Protein-protein BLAST (BLASTP): Similar technology used to search amino acid Similar technology used to search amino acid
sequencessequences
Position-Specific Iterative BLAST (PSI-BLAST):Position-Specific Iterative BLAST (PSI-BLAST): A more advance protein BLAST useful for analyzing A more advance protein BLAST useful for analyzing
relationships between divergently evolved proteins.relationships between divergently evolved proteins.
Different Types of BLASTDifferent Types of BLAST
BLASTX and BLASTN variants:BLASTX and BLASTN variants:Use six-frame translation for proteins and Use six-frame translation for proteins and
nucleotides, respectively, in the searchnucleotides, respectively, in the searchMegaBLAST:MegaBLAST:
Used for BLASTing several sequences at Used for BLASTing several sequences at once to cut down on processing load and once to cut down on processing load and server reporting-timeserver reporting-time
Interpreting BLAST ResultsInterpreting BLAST Results
Max/Total ScoreMax/Total Score Calculated from the number of matches and gaps. Calculated from the number of matches and gaps.
Higher relative to your query length is betterHigher relative to your query length is better E Value: E=KmnE Value: E=Kmn(e-λS)(e-λS)
Translation: E Value gives you the number of entries Translation: E Value gives you the number of entries required in the database for a match to happen by required in the database for a match to happen by random chance. e.g. E=erandom chance. e.g. E=e-6-6 means that one match means that one match would be expected for every 1,000,000 entries in the would be expected for every 1,000,000 entries in the databasedatabase
Smaller E Values are betterSmaller E Values are better Values larger than E=eValues larger than E=e-5-5 too likely to be due to chance too likely to be due to chance
Interpreting BLAST ResultsInterpreting BLAST Results
Query CoverageQuery CoverageThe percent of the query sequence matched The percent of the query sequence matched
by the database entryby the database entryMax IdentMax Ident
The percent identity, i.e. the percent that the The percent identity, i.e. the percent that the genes match up within the limits of the full genes match up within the limits of the full match (e.g. deletions or additions reduce this match (e.g. deletions or additions reduce this value)value)
Sequence Aligning SoftwareSequence Aligning Software
Clustal (free) Clustal (free) ClustalX – SoftwareClustalX – SoftwareClustalW – WebClustalW – Web
DNAStar ($$$)DNAStar ($$$)Functionality is similar, but difference is in Functionality is similar, but difference is in
interface, tools, and speed of algorithms interface, tools, and speed of algorithms http://www.ebi.ac.uk/clustalw/http://www.ebi.ac.uk/clustalw/
SMARTSMART
Simple – Modular – Architecture – Simple – Modular – Architecture – Research – ToolResearch – Tool
Run by EMBL (European Molecular Run by EMBL (European Molecular Biology Laboratory)Biology Laboratory)
While BLAST compares nucleotide While BLAST compares nucleotide sequences and then informs you of any sequences and then informs you of any domains that may have been annotated to domains that may have been annotated to them, SMART compares by domainsthem, SMART compares by domains
PFAMPFAM
Protein domain databaseProtein domain database Manually curated, trading volume for qualityManually curated, trading volume for quality Uses “hidden Markov models” for domain Uses “hidden Markov models” for domain
pattern recognitionpattern recognition Run by Sanger Institute in the UKRun by Sanger Institute in the UK Heuristic server-load analysis predicts when key Heuristic server-load analysis predicts when key
protein analysis report is due and crashes server protein analysis report is due and crashes server http://www.sanger.ac.uk/Software/Pfam/ http://www.sanger.ac.uk/Software/Pfam/
InterproInterpro
Database of protein domains and Database of protein domains and functional sitesfunctional sites
Best source of annotationBest source of annotationOther tools sometimes draw annotation Other tools sometimes draw annotation
from Interpro from Interpro Run by the European Bioinformatics Run by the European Bioinformatics
InstituteInstitutehttp://www.ebi.ac.uk/interpro/ http://www.ebi.ac.uk/interpro/
Protein FoldingProtein Folding
Lowest energy state foldingLowest energy state foldingAb initio: Ab initio: tremendously resource heavy, can tremendously resource heavy, can
only be done for tiny proteinsonly be done for tiny proteinsDistributed computing is used for mid-sized Distributed computing is used for mid-sized
proteinsproteinsFolding@HomeFolding@HomeHuman Proteome Folding ProjectHuman Proteome Folding ProjectRosetta@HomeRosetta@HomePredictor@HomePredictor@Home
Protein FoldingProtein Folding
Software-assisted manual foldingSoftware-assisted manual foldingUse knowledge of biochemistry to fold protein Use knowledge of biochemistry to fold protein
into predicted structure, then software to find into predicted structure, then software to find lowest energy statelowest energy state
Commercial Programs:Commercial Programs:Protein ShopProtein ShopProfoldProfold
Manual Motif VerificationManual Motif Verification
Ramachandran Plot – ratio of Ψ to Φ Ramachandran Plot – ratio of Ψ to Φ angles on N and C terminals of subunitangles on N and C terminals of subunit
Our ProcedureOur Procedure
Colonies were selected from nutrient platesColonies were selected from nutrient plates Each group selected two colonies to sequenceEach group selected two colonies to sequence Colonies which survived ampicillin treatment were Colonies which survived ampicillin treatment were
possibly transformed by the vector, which contained possibly transformed by the vector, which contained an ampicillin resistance genean ampicillin resistance gene
Presence of PDI insert was expected to disrupt ccdB Presence of PDI insert was expected to disrupt ccdB (lethal protein) and LacZα gene expression in vector (lethal protein) and LacZα gene expression in vector plasmidplasmid
LacZα expression resulted in some blue colonies, as LacZα expression resulted in some blue colonies, as the colonies were able to cleave X-Gal substrate into the colonies were able to cleave X-Gal substrate into blue productblue product
Initial Questions Guiding Colony Initial Questions Guiding Colony SelectionSelection
How did some blue colonies survive?How did some blue colonies survive? Did all blue colonies come from the PCR product?Did all blue colonies come from the PCR product? Did the white colonies contain the PDI inserts?Did the white colonies contain the PDI inserts? Were some colonies able to survive without the ampicillin Were some colonies able to survive without the ampicillin
resistance plasmid?resistance plasmid? What was the actual sequence of the commercial What was the actual sequence of the commercial
positive control insert?positive control insert? Some samples were transformed with inserts collected Some samples were transformed with inserts collected
from PCR instead of gel electrophoresis. Could have from PCR instead of gel electrophoresis. Could have non-PDI sequences have ligated to the vector and been non-PDI sequences have ligated to the vector and been inserted into bacteria?inserted into bacteria?
ProcedureProcedure
Samples were prepared with T3 and T7 Samples were prepared with T3 and T7 (forward and backward) primers in solution (forward and backward) primers in solution for sequencingfor sequencing
Samples were sent to UH Manoa lab for Samples were sent to UH Manoa lab for sequencingsequencing
Chromatogram results were viewed with Chromatogram results were viewed with Finch TV to determine qualityFinch TV to determine quality
ProcedureProcedureSequences were trimmed at 5’ and 3’ Sequences were trimmed at 5’ and 3’
ends, then restriction enzyme sites on the ends, then restriction enzyme sites on the vector were attempted to be located with vector were attempted to be located with Finch TVFinch TV
ProcedureProcedure
Sequences were exported in FASTA formatSequences were exported in FASTA format Procedure was repeated for the other strandsProcedure was repeated for the other strands Pair-wise alignment was performed for both Pair-wise alignment was performed for both
strands of each sample with EBI’s toolsstrands of each sample with EBI’s tools Consensus sequence from pair-wise alignment Consensus sequence from pair-wise alignment
was searched for in BLASTwas searched for in BLAST Gene information was located from BLAST Gene information was located from BLAST
annotation and TAIR websiteannotation and TAIR website
ResultsResults
General RemarksGeneral Remarks Because colonies were selected prior to the identity of Because colonies were selected prior to the identity of
the positive control insert being questioned, no control the positive control insert being questioned, no control colonies were sequencedcolonies were sequenced
All sequenced white colonies definitively had PDI All sequenced white colonies definitively had PDI gene insert, save for one interesting exceptiongene insert, save for one interesting exception
Some blue colonies showed multiple nucleotide Some blue colonies showed multiple nucleotide chromatogram readings, suggesting either sample chromatogram readings, suggesting either sample contamination or separately transformed contamination or separately transformed E. coliE. coli growing as one colonygrowing as one colony
Group 3 ResultsGroup 3 Results
Sequenced 1 blue and 1 white colony from Sequenced 1 blue and 1 white colony from same platesame plate
Colonies were transformed with PCR Colonies were transformed with PCR product, not gel-recovered DNAproduct, not gel-recovered DNA
White colonies had PDI insertWhite colonies had PDI insertBlue colonies had 154Bp partial insert, Blue colonies had 154Bp partial insert,
disrupting ccdB gene, but remaining in-disrupting ccdB gene, but remaining in-frame and allowing for a partially function frame and allowing for a partially function LacZ alpha gene to be expressedLacZ alpha gene to be expressed
Group 3 White ColonyGroup 3 White Colony
T7 strand definitively showed the presence T7 strand definitively showed the presence of a PDI insert of a PDI insert
Group 3 White ColonyGroup 3 White Colony
T3 and T7 strand consensus sequence T3 and T7 strand consensus sequence also showed PDI gene presensealso showed PDI gene presense
Group 3 Blue ColonyGroup 3 Blue Colony
Blue colony T3 showed multiple signalsBlue colony T3 showed multiple signals
Group 3 Blue ColonyGroup 3 Blue ColonyHowever, T7 strand was salvageableHowever, T7 strand was salvageableA 154 nucleotide sequence was found A 154 nucleotide sequence was found
between the restriction sitesbetween the restriction sites
Group 1 ResultsGroup 1 Results
White Colony from PCR product showed White Colony from PCR product showed PDI gene in both T3 and T7 strandsPDI gene in both T3 and T7 strands
White colony from gel purification:White colony from gel purification:T7 strand sequenced as multiple signalsT7 strand sequenced as multiple signalsT3 strand sequenced excellently T3 strand sequenced excellently
Group 1 Gel White ColonyGroup 1 Gel White Colony
T3 sequence showed only nucleotides T3 sequence showed only nucleotides 1540-2320 of the vector 1540-2320 of the vector
Group 2 ResultsGroup 2 Results
White Colony from gel purificationWhite Colony from gel purificationWhite colonies sequenced with PDI geneWhite colonies sequenced with PDI gene
Blue w/ White Ring Colony from PCRBlue w/ White Ring Colony from PCRBoth T3 and T7 strand sequencing showed Both T3 and T7 strand sequencing showed
consistent multiple signalsconsistent multiple signals
Group 4 ResultsGroup 4 Results
1 white colony from PCR and 1 white 1 white colony from PCR and 1 white colony from gel purification were colony from gel purification were sequencedsequenced
Both showed PDI geneBoth showed PDI gene
Final RemarksFinal Remarks All white colonies had the PDI gene, except one with a modified vectorAll white colonies had the PDI gene, except one with a modified vector All blue colonies were transformed with the direct PCR product (not gel All blue colonies were transformed with the direct PCR product (not gel
purified)purified) Group 3 showed that a small (154Bp) insert that stays in-frame with the Group 3 showed that a small (154Bp) insert that stays in-frame with the
LacZ gene can knock-out the ccdB, while still allowing the expression of an LacZ gene can knock-out the ccdB, while still allowing the expression of an at least partially functioning LacZ geneat least partially functioning LacZ gene
Some blue colonies with white rings could be 2 separate lines living togetherSome blue colonies with white rings could be 2 separate lines living together Bacteria transformed with ampicillin resistance gene could deplete area of Bacteria transformed with ampicillin resistance gene could deplete area of
ampicillin, allowing bacteria without the gene to crowd the white bacteria out of ampicillin, allowing bacteria without the gene to crowd the white bacteria out of the area of depleted ampicillinthe area of depleted ampicillin
How could bacteria without the insert survive both ccdB expression and How could bacteria without the insert survive both ccdB expression and ampicillin selection in broth?ampicillin selection in broth?
ccdB gene could be lost due to mutationccdB gene could be lost due to mutation Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and
ampicillin resistance genesampicillin resistance genes No group sequenced the positive control insert – sequence still a mystery!No group sequenced the positive control insert – sequence still a mystery!
ResourcesResources http://www.bioinformatics.orghttp://www.bioinformatics.org http://http://http://http://syntheticbiology.org/Tools.htmlsyntheticbiology.org/Tools.html NCBI BLAST: NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/ SMART: SMART: http://smart.embl-heidelberg.de/http://smart.embl-heidelberg.de/ PFAM: PFAM: http://www.sanger.ac.uk/Software/Pfam/http://www.sanger.ac.uk/Software/Pfam/ Interpro: Interpro: http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/interpro/ Canadian Bioinformatics Helpdesk Newsletter (Ramachandran Canadian Bioinformatics Helpdesk Newsletter (Ramachandran
Plot): Plot): http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.phttp://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.phphp
Finch TV: Finch TV: http://www.geospiza.com/finchtv/http://www.geospiza.com/finchtv/ EBI Pair-wise alignment: EBI Pair-wise alignment:
http://www.ebi.ac.uk/emboss/align/index.htmlhttp://www.ebi.ac.uk/emboss/align/index.html TAIR: TAIR: http://www.arabidopsis.orghttp://www.arabidopsis.org