bioinformatics abe 2007 kent koster group 3. why bioinformatics? “other techniques raise more...

BioinformaticsBioinformatics

ABE 2007ABE 2007

Kent KosterKent Koster

Group 3Group 3

Why bioinformatics?Why bioinformatics?

““Other techniques raise more questions Other techniques raise more questions than they answer. Bioinformatics is what than they answer. Bioinformatics is what answers the questions those techniques answers the questions those techniques generate.”generate.”

OutlineOutline

Bioinformatics DefinedBioinformatics DefinedEvolution of BioinformaticsEvolution of BioinformaticsBioinformatics HistoryBioinformatics HistoryCommon Uses of BioinformaticsCommon Uses of BioinformaticsProcedures and Tools of BioinformaticsProcedures and Tools of BioinformaticsOur ProcedureOur ProcedureOur ResultsOur ResultsResourcesResources

Bioinformatics DefinedBioinformatics Defined

Bioinformatics is broad term covering the use of Bioinformatics is broad term covering the use of computer algorithms to analyze biological data. computer algorithms to analyze biological data.

Differs from “computational biology” in that while Differs from “computational biology” in that while computational biology is the use of computer computational biology is the use of computer technology to solve a single, hypothesis-based technology to solve a single, hypothesis-based question, bioinformatics is the omnibus use of question, bioinformatics is the omnibus use of computerized statistical analysis to make computerized statistical analysis to make statistical or comparative inferences.statistical or comparative inferences.

i.e. converting “data” to “information.” i.e. converting “data” to “information.”

The nebulous genesis of The nebulous genesis of bioinformatics bioinformatics

1977 – Φ-X174 Phage Genome sequenced 1977 – Φ-X174 Phage Genome sequenced 1990 – Paper published in the 1990 – Paper published in the Journal of Journal of

Molecular BiologyMolecular Biology describes sequence describes sequence alignment search algorithm alignment search algorithm

1990s – Software used to find fragment overlap 1990s – Software used to find fragment overlap for the Human Genome Projectfor the Human Genome Project

1992 – NCBI takes over GenBank DNA 1992 – NCBI takes over GenBank DNA sequence database in response to the growing sequence database in response to the growing number of gene patents number of gene patents

The nebulous genesis of The nebulous genesis of bioinformaticsbioinformatics

1994 – “Entrez” Global Query Cross-Database 1994 – “Entrez” Global Query Cross-Database Search System allows users to search GenBank Search System allows users to search GenBank database database

1995 – Dr. Owen White writes software to help 1995 – Dr. Owen White writes software to help find gene elements (promoters, start and stop find gene elements (promoters, start and stop codons, etc.) in the sequenced Haemophilus codons, etc.) in the sequenced Haemophilus influenzae genomeinfluenzae genome

1996 – NCBI-BLAST created to provide powerful 1996 – NCBI-BLAST created to provide powerful heuristic searches against the GenBank heuristic searches against the GenBank database database

Genomics to Proteomics through Genomics to Proteomics through BioinformaticsBioinformatics

Because proteins are ultimately the tool of all* gene Because proteins are ultimately the tool of all* gene expression, proteomics is, in effect, the “product” science expression, proteomics is, in effect, the “product” science made possible by bioinformaticsmade possible by bioinformatics

A proteome is the collection of all proteins expressed in A proteome is the collection of all proteins expressed in a cell at a given timea cell at a given time

Every organism has 1 genome, but many proteomesEvery organism has 1 genome, but many proteomes In addition to “high throughput” protein analysis, In addition to “high throughput” protein analysis,

proteomics is researched through cDNA analysis (RT-proteomics is researched through cDNA analysis (RT-PCR)PCR)

Proteomics represents a methodical addition of “large Proteomics represents a methodical addition of “large scale biology” to traditional molecular biology, made scale biology” to traditional molecular biology, made possible by bioinformaticspossible by bioinformatics

Common Uses of BioinformaticsCommon Uses of Bioinformatics

Homology and Comparative ModelingHomology and Comparative ModelingProtein or gene homology is shared Protein or gene homology is shared

nucleotide or amino acid sequences or nucleotide or amino acid sequences or domains shared between different proteins domains shared between different proteins regardless of whether from same or different regardless of whether from same or different organism organism

Gene or Protein IdentificationGene or Protein IdentificationSearching databases for nucleotide or amino Searching databases for nucleotide or amino

acid sequences that match sequences in acid sequences that match sequences in unknown samplesunknown samples

So, how do ya do it?So, how do ya do it?

DNA SequencingDNA SequencingSequence FormatsSequence FormatsSequence Homology Software ToolsSequence Homology Software ToolsAligning ToolsAligning ToolsAnnotated InformationAnnotated InformationProtein FoldingProtein Folding

DNA SequencingDNA Sequencing

Sanger MethodSanger MethodNew nucleotide chains of DNA being New nucleotide chains of DNA being

replicated by DNA Polymerase are stopped replicated by DNA Polymerase are stopped when di-deoxy nucleotides (added in the when di-deoxy nucleotides (added in the reaction mixture in ~1/100 ratio) are reaction mixture in ~1/100 ratio) are incorperated into the chainincorperated into the chain

DNA SequencingDNA Sequencing

Fluorescent dyes are bound to the Fluorescent dyes are bound to the ddNTPs, allowing the molecule to detected ddNTPs, allowing the molecule to detected when it is excited by a laserwhen it is excited by a laser

Terminated DNA chains are run on a gel, Terminated DNA chains are run on a gel, and fragments are resolved by size and fragments are resolved by size

By combining the fluorescence readings By combining the fluorescence readings from each size nucleotide chain, the DNA from each size nucleotide chain, the DNA sequence is computedsequence is computed

Example Sequence Example Sequence ChromatographChromatograph

Sequence AnalysisSequence Analysis

First Things First – Sequence File Formats:First Things First – Sequence File Formats: Most common for nucleotides: FASTA / Multi-FASTAMost common for nucleotides: FASTA / Multi-FASTA ““>” followed by any unicode text, entire line read as sequence title>” followed by any unicode text, entire line read as sequence title Carriage return followed by continuous 5’- 3’ nucleotide sequence or Carriage return followed by continuous 5’- 3’ nucleotide sequence or

protein sequence using 1-letter codesprotein sequence using 1-letter codes Example:Example: >E. coli Globin-coupled chemotaxis sensory transducer (TM >E. coli Globin-coupled chemotaxis sensory transducer (TM

domain) domain) ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATGATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATGCGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGACGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGATGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATCTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTCTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATTAAAA

Sequence Homology SoftwareSequence Homology Software NCBI-BLASTNCBI-BLAST

Run by the National Center for Biotechnology Run by the National Center for Biotechnology InformationInformation

BLAST uses a heuristic algorithm based on the BLAST uses a heuristic algorithm based on the Smith-Waterman algorithmSmith-Waterman algorithm

Algorithm searches database for a small string within Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then the query (default 11 for nucleotide searches), then when it detects a match, searches for shared when it detects a match, searches for shared nucleotides at each end of the seed to extend the nucleotides at each end of the seed to extend the matchmatch

Gaps are taken into account, then the matches are Gaps are taken into account, then the matches are presented in order of statistical significancepresented in order of statistical significance

http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/

Different Types of BLASTDifferent Types of BLAST

Nucleotide-nucleotide BLAST (BLASTN): Nucleotide-nucleotide BLAST (BLASTN): Basic nucleutide sequence searchesBasic nucleutide sequence searches The BLAST that you used for your sequencesThe BLAST that you used for your sequences

Protein-protein BLAST (BLASTP): Protein-protein BLAST (BLASTP): Similar technology used to search amino acid Similar technology used to search amino acid

sequencessequences

Position-Specific Iterative BLAST (PSI-BLAST):Position-Specific Iterative BLAST (PSI-BLAST): A more advance protein BLAST useful for analyzing A more advance protein BLAST useful for analyzing

relationships between divergently evolved proteins.relationships between divergently evolved proteins.

Different Types of BLASTDifferent Types of BLAST

BLASTX and BLASTN variants:BLASTX and BLASTN variants:Use six-frame translation for proteins and Use six-frame translation for proteins and

nucleotides, respectively, in the searchnucleotides, respectively, in the searchMegaBLAST:MegaBLAST:

Used for BLASTing several sequences at Used for BLASTing several sequences at once to cut down on processing load and once to cut down on processing load and server reporting-timeserver reporting-time

Interpreting BLAST ResultsInterpreting BLAST Results

Max/Total ScoreMax/Total Score Calculated from the number of matches and gaps. Calculated from the number of matches and gaps.

Higher relative to your query length is betterHigher relative to your query length is better E Value: E=KmnE Value: E=Kmn(e-λS)(e-λS)

Translation: E Value gives you the number of entries Translation: E Value gives you the number of entries required in the database for a match to happen by required in the database for a match to happen by random chance. e.g. E=erandom chance. e.g. E=e-6-6 means that one match means that one match would be expected for every 1,000,000 entries in the would be expected for every 1,000,000 entries in the databasedatabase

Smaller E Values are betterSmaller E Values are better Values larger than E=eValues larger than E=e-5-5 too likely to be due to chance too likely to be due to chance

Interpreting BLAST ResultsInterpreting BLAST Results

Query CoverageQuery CoverageThe percent of the query sequence matched The percent of the query sequence matched

by the database entryby the database entryMax IdentMax Ident

The percent identity, i.e. the percent that the The percent identity, i.e. the percent that the genes match up within the limits of the full genes match up within the limits of the full match (e.g. deletions or additions reduce this match (e.g. deletions or additions reduce this value)value)

Sequence Aligning SoftwareSequence Aligning Software

Clustal (free) Clustal (free) ClustalX – SoftwareClustalX – SoftwareClustalW – WebClustalW – Web

DNAStar ($$$)DNAStar ($$$)Functionality is similar, but difference is in Functionality is similar, but difference is in

interface, tools, and speed of algorithms interface, tools, and speed of algorithms http://www.ebi.ac.uk/clustalw/http://www.ebi.ac.uk/clustalw/

SMARTSMART

Simple – Modular – Architecture – Simple – Modular – Architecture – Research – ToolResearch – Tool

Run by EMBL (European Molecular Run by EMBL (European Molecular Biology Laboratory)Biology Laboratory)

While BLAST compares nucleotide While BLAST compares nucleotide sequences and then informs you of any sequences and then informs you of any domains that may have been annotated to domains that may have been annotated to them, SMART compares by domainsthem, SMART compares by domains

PFAMPFAM

Protein domain databaseProtein domain database Manually curated, trading volume for qualityManually curated, trading volume for quality Uses “hidden Markov models” for domain Uses “hidden Markov models” for domain

pattern recognitionpattern recognition Run by Sanger Institute in the UKRun by Sanger Institute in the UK Heuristic server-load analysis predicts when key Heuristic server-load analysis predicts when key

protein analysis report is due and crashes server protein analysis report is due and crashes server http://www.sanger.ac.uk/Software/Pfam/ http://www.sanger.ac.uk/Software/Pfam/

InterproInterpro

Database of protein domains and Database of protein domains and functional sitesfunctional sites

Best source of annotationBest source of annotationOther tools sometimes draw annotation Other tools sometimes draw annotation

from Interpro from Interpro Run by the European Bioinformatics Run by the European Bioinformatics

InstituteInstitutehttp://www.ebi.ac.uk/interpro/ http://www.ebi.ac.uk/interpro/

Protein FoldingProtein Folding

Lowest energy state foldingLowest energy state foldingAb initio: Ab initio: tremendously resource heavy, can tremendously resource heavy, can

only be done for tiny proteinsonly be done for tiny proteinsDistributed computing is used for mid-sized Distributed computing is used for mid-sized

proteinsproteinsFolding@HomeFolding@HomeHuman Proteome Folding ProjectHuman Proteome Folding ProjectRosetta@HomeRosetta@HomePredictor@HomePredictor@Home

Protein FoldingProtein Folding

Software-assisted manual foldingSoftware-assisted manual foldingUse knowledge of biochemistry to fold protein Use knowledge of biochemistry to fold protein

into predicted structure, then software to find into predicted structure, then software to find lowest energy statelowest energy state

Commercial Programs:Commercial Programs:Protein ShopProtein ShopProfoldProfold

Manual Motif VerificationManual Motif Verification

Ramachandran Plot – ratio of Ψ to Φ Ramachandran Plot – ratio of Ψ to Φ angles on N and C terminals of subunitangles on N and C terminals of subunit

Our ProcedureOur Procedure

Colonies were selected from nutrient platesColonies were selected from nutrient plates Each group selected two colonies to sequenceEach group selected two colonies to sequence Colonies which survived ampicillin treatment were Colonies which survived ampicillin treatment were

possibly transformed by the vector, which contained possibly transformed by the vector, which contained an ampicillin resistance genean ampicillin resistance gene

Presence of PDI insert was expected to disrupt ccdB Presence of PDI insert was expected to disrupt ccdB (lethal protein) and LacZα gene expression in vector (lethal protein) and LacZα gene expression in vector plasmidplasmid

LacZα expression resulted in some blue colonies, as LacZα expression resulted in some blue colonies, as the colonies were able to cleave X-Gal substrate into the colonies were able to cleave X-Gal substrate into blue productblue product

Initial Questions Guiding Colony Initial Questions Guiding Colony SelectionSelection

How did some blue colonies survive?How did some blue colonies survive? Did all blue colonies come from the PCR product?Did all blue colonies come from the PCR product? Did the white colonies contain the PDI inserts?Did the white colonies contain the PDI inserts? Were some colonies able to survive without the ampicillin Were some colonies able to survive without the ampicillin

resistance plasmid?resistance plasmid? What was the actual sequence of the commercial What was the actual sequence of the commercial

positive control insert?positive control insert? Some samples were transformed with inserts collected Some samples were transformed with inserts collected

from PCR instead of gel electrophoresis. Could have from PCR instead of gel electrophoresis. Could have non-PDI sequences have ligated to the vector and been non-PDI sequences have ligated to the vector and been inserted into bacteria?inserted into bacteria?

ProcedureProcedure

Samples were prepared with T3 and T7 Samples were prepared with T3 and T7 (forward and backward) primers in solution (forward and backward) primers in solution for sequencingfor sequencing

Samples were sent to UH Manoa lab for Samples were sent to UH Manoa lab for sequencingsequencing

Chromatogram results were viewed with Chromatogram results were viewed with Finch TV to determine qualityFinch TV to determine quality

ProcedureProcedureSequences were trimmed at 5’ and 3’ Sequences were trimmed at 5’ and 3’

ends, then restriction enzyme sites on the ends, then restriction enzyme sites on the vector were attempted to be located with vector were attempted to be located with Finch TVFinch TV

ProcedureProcedure

Sequences were exported in FASTA formatSequences were exported in FASTA format Procedure was repeated for the other strandsProcedure was repeated for the other strands Pair-wise alignment was performed for both Pair-wise alignment was performed for both

strands of each sample with EBI’s toolsstrands of each sample with EBI’s tools Consensus sequence from pair-wise alignment Consensus sequence from pair-wise alignment

was searched for in BLASTwas searched for in BLAST Gene information was located from BLAST Gene information was located from BLAST

annotation and TAIR websiteannotation and TAIR website

ResultsResults

General RemarksGeneral Remarks Because colonies were selected prior to the identity of Because colonies were selected prior to the identity of

the positive control insert being questioned, no control the positive control insert being questioned, no control colonies were sequencedcolonies were sequenced

All sequenced white colonies definitively had PDI All sequenced white colonies definitively had PDI gene insert, save for one interesting exceptiongene insert, save for one interesting exception

Some blue colonies showed multiple nucleotide Some blue colonies showed multiple nucleotide chromatogram readings, suggesting either sample chromatogram readings, suggesting either sample contamination or separately transformed contamination or separately transformed E. coliE. coli growing as one colonygrowing as one colony

Group 3 ResultsGroup 3 Results

Sequenced 1 blue and 1 white colony from Sequenced 1 blue and 1 white colony from same platesame plate

Colonies were transformed with PCR Colonies were transformed with PCR product, not gel-recovered DNAproduct, not gel-recovered DNA

White colonies had PDI insertWhite colonies had PDI insertBlue colonies had 154Bp partial insert, Blue colonies had 154Bp partial insert,

disrupting ccdB gene, but remaining in-disrupting ccdB gene, but remaining in-frame and allowing for a partially function frame and allowing for a partially function LacZ alpha gene to be expressedLacZ alpha gene to be expressed

Group 3 White ColonyGroup 3 White Colony

T7 strand definitively showed the presence T7 strand definitively showed the presence of a PDI insert of a PDI insert

Group 3 White ColonyGroup 3 White Colony

T3 and T7 strand consensus sequence T3 and T7 strand consensus sequence also showed PDI gene presensealso showed PDI gene presense

Group 3 Blue ColonyGroup 3 Blue Colony

Blue colony T3 showed multiple signalsBlue colony T3 showed multiple signals

Group 3 Blue ColonyGroup 3 Blue ColonyHowever, T7 strand was salvageableHowever, T7 strand was salvageableA 154 nucleotide sequence was found A 154 nucleotide sequence was found

between the restriction sitesbetween the restriction sites


White Colony from PCR product showed White Colony from PCR product showed PDI gene in both T3 and T7 strandsPDI gene in both T3 and T7 strands

White colony from gel purification:White colony from gel purification:T7 strand sequenced as multiple signalsT7 strand sequenced as multiple signalsT3 strand sequenced excellently T3 strand sequenced excellently

Group 1 Gel White ColonyGroup 1 Gel White Colony

T3 sequence showed only nucleotides T3 sequence showed only nucleotides 1540-2320 of the vector 1540-2320 of the vector


White Colony from gel purificationWhite Colony from gel purificationWhite colonies sequenced with PDI geneWhite colonies sequenced with PDI gene

Blue w/ White Ring Colony from PCRBlue w/ White Ring Colony from PCRBoth T3 and T7 strand sequencing showed Both T3 and T7 strand sequencing showed

consistent multiple signalsconsistent multiple signals


1 white colony from PCR and 1 white 1 white colony from PCR and 1 white colony from gel purification were colony from gel purification were sequencedsequenced

Both showed PDI geneBoth showed PDI gene

Final RemarksFinal Remarks All white colonies had the PDI gene, except one with a modified vectorAll white colonies had the PDI gene, except one with a modified vector All blue colonies were transformed with the direct PCR product (not gel All blue colonies were transformed with the direct PCR product (not gel

purified)purified) Group 3 showed that a small (154Bp) insert that stays in-frame with the Group 3 showed that a small (154Bp) insert that stays in-frame with the

LacZ gene can knock-out the ccdB, while still allowing the expression of an LacZ gene can knock-out the ccdB, while still allowing the expression of an at least partially functioning LacZ geneat least partially functioning LacZ gene

Some blue colonies with white rings could be 2 separate lines living togetherSome blue colonies with white rings could be 2 separate lines living together Bacteria transformed with ampicillin resistance gene could deplete area of Bacteria transformed with ampicillin resistance gene could deplete area of

ampicillin, allowing bacteria without the gene to crowd the white bacteria out of ampicillin, allowing bacteria without the gene to crowd the white bacteria out of the area of depleted ampicillinthe area of depleted ampicillin

How could bacteria without the insert survive both ccdB expression and How could bacteria without the insert survive both ccdB expression and ampicillin selection in broth?ampicillin selection in broth?

ccdB gene could be lost due to mutationccdB gene could be lost due to mutation Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and

ampicillin resistance genesampicillin resistance genes No group sequenced the positive control insert – sequence still a mystery!No group sequenced the positive control insert – sequence still a mystery!

ResourcesResources http://www.bioinformatics.orghttp://www.bioinformatics.org http://http://http://http://syntheticbiology.org/Tools.htmlsyntheticbiology.org/Tools.html NCBI BLAST: NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/ SMART: SMART: http://smart.embl-heidelberg.de/http://smart.embl-heidelberg.de/ PFAM: PFAM: http://www.sanger.ac.uk/Software/Pfam/http://www.sanger.ac.uk/Software/Pfam/ Interpro: Interpro: http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/interpro/ Canadian Bioinformatics Helpdesk Newsletter (Ramachandran Canadian Bioinformatics Helpdesk Newsletter (Ramachandran

Plot): Plot): http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.phttp://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.phphp

Finch TV: Finch TV: http://www.geospiza.com/finchtv/http://www.geospiza.com/finchtv/ EBI Pair-wise alignment: EBI Pair-wise alignment:

http://www.ebi.ac.uk/emboss/align/index.htmlhttp://www.ebi.ac.uk/emboss/align/index.html TAIR: TAIR: http://www.arabidopsis.orghttp://www.arabidopsis.org

bioinformatics abe 2007 kent koster group 3. why bioinformatics? “other techniques raise more...

Documents

bioinformatics abe

outline bioinformatics

genbank database slide

genbank dna sequence

computational biology

use of computer algorithms

use of computer technology

gene expression