basic overview of bioinformatics tools and biocomputing applications ii dr tan tin wee director...

18
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Upload: kerry-wood

Post on 20-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Basic Overview of Bioinformatics Tools and

Biocomputing Applications II

Dr Tan Tin Wee

Director

Bioinformatics Centre

Page 2: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Common Computational Analyses• Sequence Assembly• Simple sequence analysis

– Translation and reverse Complement, ORF– Composition statistics (protein & DNA)– Molecular mass– Total charge and pI; local hydropathy– Simple determination of secondary structures – Restriction site analysis– Internal repeat analysis

• Detection of active sites, functional residues, characteristic structures, substrates, and processing signals

Page 3: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Common Computational Analyses

• Database sequence search

• Multiple alignment

• 2 and 3 Structure prediction; transmembrane helix detection

• Structure modeling

• Docking prediction and design

• Hidden Markov model searches

Page 4: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Database Searching

• Text-based Database Searching -using a text string to match an annotation in a sequence database record, ie. Keyword search

• Sequence-based Database Searching -using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records

Page 5: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Text-Based Database Searching• Examples: Entrez, SRS, DBGET, AceDB

- common integrated database systems• Search Concepts

– Boolean Search - AND, OR, NOT– Broadening Search– Narrowing the Search– Proximity searching, soundex– Wild Card, Stemming eg. Thala* for thalasemia, thalassemia,

thalassemic

• Use standard string search algorithms and boolean operations, vocabulary matches

Page 6: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Text-based Database Searching

• Example: To find the human homolog of the Drosophila per gene• Procedure

– Web to Entrez– All Fields : enter "human" "per"– Hits returned, irrelevant - broaden search– "human" "period" - more hits– check every one, find the human RIGUI gene

• Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)?Use Boolean searches?

Page 7: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Sequence-based Database Searching

• Homology Search

• Global or Local Sequence Alignment

• Needleman-Wunch Algorithm

• Smith-Waterman Algorithm

• Lipman - Pearson FASTA

• Altschul's BLAST

• Take a sequence, pairwise comparison with each sequence in the database

Page 8: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Sequence-based Database Searching

• Basic Assumptions:• Sequences of homologous Genes/Protein diverge over

time even though structure and/or function change little• Significant sequence similarity inferred as potential

structural /functional similarity or common evolutionary origin

• Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level.

Page 9: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Sequence-based Database Searching

• Global Alignmentforces complete alignment of the pairwise comparison of the two input sequences

• Local Alignmentlooks for local stretches of similarity and tries to align the most similar segments

• Algorithms used may be similar, but output different, statistics needed to assess results

Page 10: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Sequence-based Database Searching

• Alignment Scoring

• Substitution score and substitution matrixPAM, BLOSUM

• affine gap costs/gap penalty and gap scores

• Optimal alignments, dynamic programmingNeedleman-Wunsch algorithm,Smith-Waterman algorithm (SSEARCH)

• Additional heuristics to speed up the search - FASTA, BLAST

Page 11: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Some definitions

• Affine gap costs - scoring system for gaps within alignments which charges a penalty for gap formation and additional per-residue penalty proportional to size of gap

• Alignment score - numerical value indicating the overall quality of an alignment, the higher the better the alignment.

• Algorithm - fixed procedure embodied in a computer program

• Heuristics - a computer science term referring to guesses made by the program to approximate results, usually based on arbitrary or predefined rules.

• Gapped Alignment - alignment of sequences where gaps are permitted

Page 12: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Computational Genefinding

• Major challenge in genome project

• Given a DNA sequence, where does a gene begin and stop? - ORF

• Where are the exons and introns?

• Where are the transcription elements?

• Gene structure and other regulatory elements?

Page 13: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Genomic Elements

• Intron-exon splice sites• Start-Stop codons• Branch Points• Promoters and terminators of transcription• Polyadenylation sites• ribosomal binding sites• Topoisomerase II binding sites• Topoisomerase I cleavage sites• Transcription factor binding sites

Page 14: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Detecting Genomic Elements

• Local sites and motifs/patterns for such element - signals and signal sensors

• Extended variable-length regions eg exons and introns- contents and content sensors

• Linguistic technique - gene structure described in formal grammar - GeneLang genefinding program

Page 15: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Signal sensors

• Simple consensus sequenceUse of Pattern matching algorithms

• Weight matricesallow for weighted score for each weight matrix sensors to be summed

• Use of Artificial Neural Networks (ANN)

Page 16: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Content Sensors

• Long ORF for bacteria• Statistical models eg. Markov models -

GeneMarkstatistical models of nucleotide frequencies and dependencies in codon structure

• Neural Nets eg Grailexon detection by neural network combined with signal sensors for exon-intron splice sites

Page 17: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Some Definitions

• Artificial Neural Nets - statistical pattern recognition method - a type of nonlinear regression

• Markov Models - statistical models for sequences in which the probability of each residue depends on the residues preceding it.

• Dynamic Programming - type of algorithm widely used for constructing sequence aligments and for evaluating all posible candidate gene structure

Page 18: Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Other Genefinding methods

• Use of dynamic programmingLinguistic rules for functional featuresParameters of a Markov Process on hidden variables - hidden Markov Models (HMM)

• HMM genefinder - EcoParse, Xpound GeneMark HMM, Veil, HMMgene, GenScan