presented by dr. shazzad hosain asst. prof. eecs, nsu sequence databases, their use and blast

Download Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Sequence Databases, Their Use and BLAST

If you can't read please download the document

Upload: peregrine-hampton

Post on 24-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Sequence Databases, Their Use and BLAST
  • Slide 2
  • Why Sequence Databases? Electronic databases are fast becoming the lifeblood of the field Because of the power of biomolecular sequence comparison Discoveries based solely on sequence homology have become routine
  • Slide 3
  • The first success story Cellular growth factor Particular proteins or hormones needed to stimulate or continue growth of a cell colony. By early 1970s it was understood that certain viruses could cause particular cells in culture (in vitro) to grow without bound This cancer like transformation of cultured cells by viruses suggested that viral infection could be a cause of cancer in animals, but the mechanisms were unknown
  • Slide 4
  • The first success story Oncogene is a gene that is mutated or expressed at high levels, and thus helps turn a normal cell into a tumor cell It was hypothesized that certain genes in the infecting viruses (oncogenes) encode cellular growth factors The virus infected cells would thus produce uncontrolled quantities of the growth factor, allowing the cell colony to grow beyond its normal limits.
  • Slide 5
  • The first success story The hypothesis is now generally accepted. However, the link between oncogenes and growth factors did not come from a direct test of this hypothesis. Instead it was an unanticipated result of merging two independent sets of data via a computer search.
  • Slide 6
  • The first success story Simian sarcoma virus is a retrovirus that was known by the early 1970s to cause cancer in a specific species of monkeys. Retrovirus is an RNA virus, that must be converted in the infected cell to DNA before the virus can replicate
  • Slide 7
  • Slide 8
  • The first success story By 1970 we know a retrovirus to cause cancer Oncogene, named v-sis, was isolated and sequenced in 1983 A partial amino acid / protein sequence of important growth factor, Platelet-derived growth factor (PDGF), was published about the same time When compared the two sequences At one region of 31 amino acid, 26 exact matches In another region of 39 residues, 35 exact matches
  • Slide 9
  • A More Recent Story First complete DNA sequence of a free-living ogranism was reported in 1995 A total of 1,743 putative regions were identified Each of these 1,743 strings was then translated to one or more proteins (depending on reading frames) These protein sequences were searched for sufficiently similar sequences in the protein sequence database Swiss-Prot. In this way, 1,007 of the putative genes not only matched entries in the database, but matched is such an unambiguous manner that the specific biochemical function could be deduced for each one.
  • Slide 10
  • Indirect Applications of Database Search Clustering similar sequences into sequence families Such families may reveal important conserved biological phenomena that had not been observed by laboratory work and that would be hard to recognize by looking at two sequences alone. Also there are many clever ways to tackle both biological and biotechnical problems, such as Sequence assembly in bacteria Multiple sclerosis and database search
  • Slide 11
  • Multiple Sclerosis and Database Search Not well understood disease
  • Slide 12
  • Multiple Sclerosis and Database Search Multiple Sclerosis (MS) is an autoimmune disease Meaning that the immune system incorrectly identifies native cells as foreign invaders The first line of attack in the immune system are the T-cells, which identify foreign matter. Once identified, other elements of the immune system attack and destroy the identified matter
  • Slide 13
  • Multiple Sclerosis and Database Search
  • Slide 14
  • Recently, specific T-cells were found that identify proteins or protein segments that appear on the surface of myelin cells Its natural to conjecture/hypothesize that those T-cells (that mistakenly identify proteins on the myelin surface as foreign) had previously been generated by the immune system to (correctly) identify similar proteins on the surface of bacteria or viruses. Or, in other way, some bacteria or virus has protein on their outer surface that are very similar to myelin How you test which bacteria / viruses are involved?
  • Slide 15
  • Multiple Sclerosis and Database Search Myelin surface proteins are sequenced A search was conducted in the protein databases for highly similar proteins in bacteria and viruses About 100 proteins were found Laboratory work then verified that the specific T-cells that attach myelin sheath also attack particular proteins found by the database search
  • Slide 16
  • Multiple Sclerosis and Database Search So the database search not only confirmed the hypothesis, but also identified the particular bacterial and viral proteins that are confused with proteins on the myelin surface The hope is that by examining the similarities among those bacterial and viral protein sequences one might better understand what features on the myelin surface proteins are used by the T-cells to mistakenly identify myelin cells as foreign.
  • Slide 17
  • Biological Databases Over 1000 biological databases Vary in size, quality, coverage, level of interest Many of the major ones covered in the annual Database Issue of Nucleic Acids ResearchNucleic Acids Research What makes a good database? comprehensiveness accuracy is up-to-date good interface batch search/download API (web services, DAS, etc.)
  • Slide 18
  • GenBank, the Granddaddy Store and facilitate retrieval of all DNA sequences ever made public It is now maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), which is part of the National Institutes of Health (NIH), USA The European version, the EMBL data library DNA DataBase of Japan (DDBJ) exists in Japan Also another one, Genome Sequence DataBase (GSDB) These four database share information between them Submission to one is effectively a submission to all
  • Slide 19
  • 19 GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB EBI NCBI NIH Submissions Updates Submissions Updates Submissions Updates
  • Slide 20
  • Ten Important Bioinformatics Databases GenBankwww.ncbi.nlm.nih.govnucleotide sequenceswww.ncbi.nlm.nih.gov Ensemblwww.ensembl.orghuman/mouse genome (and others)www.ensembl.org PubMedwww.ncbi.nlm.nih.govliterature referenceswww.ncbi.nlm.nih.gov NRwww.ncbi.nlm.nih.govprotein sequenceswww.ncbi.nlm.nih.gov SWISS-PROTwww.expasy.chprotein sequenceswww.expasy.ch InterProwww.ebi.ac.ukprotein domainswww.ebi.ac.uk OMIMwww.ncbi.nlm.nih.govgenetic diseaseswww.ncbi.nlm.nih.gov Enzymeswww.chem.qmul.ac.ukenzymeswww.chem.qmul.ac.uk PDBwww.rcsb.org/pdb/protein structureswww.rcsb.org/pdb/ KEGGwww.genome.ad.jpmetabolic pathwayswww.genome.ad.jp Source: Bioinformatics for Dummies
  • Slide 21
  • Types of Database Sequence and Bibliographic Genomic Clinical and Mutation Homologies Integrated Most databases are accessible from a web page are interlinked
  • Slide 22
  • Sequence Databases Main nucleic acid sequence databases EMBL GenBank DDBJ Main protein sequence databases Swiss Prot also TREMBL, GenPept Often integrated with other databases
  • Slide 23
  • Integrating Sequence and Bibliographic Databases Entrez Links nucleic acid sequences, protein sequences and MEDLINE Powerful and easy to use SRS = Sequence Retrieval System Universal system for searching sequence and other databases Available worldwide including at HGMP (Human Genome Mapping Project)
  • Slide 24
  • Genomic Databases GDB = Human Genome Database Repository for mapping and genomic data for the Human Genome Project Powerful; links to other databases ACeDB Developed to provide access to C.elegans data
  • Slide 25
  • Clinical and Mutation Databases OMIM Online Mendelian Inheritance in Man Database of disease-linked genes and associated phenotypes Links to Entrez, GDB and other databases HGMD Database of sequences and phenotypes of disease-causing mutations Disease-specific mutation databases
  • Slide 26
  • Homology Databases MGI Mapping and gene expression data for the mouse Human homologies Links to GDB and Entrez Many other organism-specific databases can be used to search for homologs of human genes
  • Slide 27
  • An Integrated Database GeneCards Integrated resource of information on human genes and their products Major emphasis on human disease Links to many kinds of biomedical information Sequence databases OMIM, HGMD, MDB Doctors Guide to the Internet
  • Slide 28
  • 28 Databases Primary (archival) GenBank/EMBL/DDBJ UniProt PDB Medline (PubMed) BIND Secondary (curated) RefSeq Taxon UniProt OMIM SGD
  • Slide 29
  • NCBI (National Center for Biotechnology Information) Over 30 databases including GenBank, PubMed, OMIM, and GEO Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/)
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 65,369,091,950 bases in 61,132,599 sequence records in the traditional GenBank divisions and 80,369,977,826 bases in 17,960,667 sequence records in the WGS division as of August 2006. www.ncbi.nlm.nih.gov/GenBank
  • Slide 43
  • Slide 44
  • The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. It should be noted, though, that RefSeq has been built using data from public archival databases only. RefSeq biological sequences (also known as RefSeqs) are derived from GenBank records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article in the literature, a RefSeq represents the consolidation of information by a particular group at a particular time.
  • Slide 45
  • Slide 46
  • Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
  • Slide 47
  • Slide 48
  • Slide 49
  • The MOD squad Most model organism communities have established organism-specific Model Organism Databases (MODs) Many of these databases have different schemas and implementations, although there is movement toward harmonizing many features via the Generic Model Organism Database project.
  • Slide 50
  • The MOD squad SGD: yeast (www.yeastgenome.org) Wormbase: C. elegans (www.wormbase.org) FlyBase: Drosophila (flybase.bio.indiana.edu) Zfin: zebrafish (zfin.org) and many others (Xenopus, Dictyostelium, Arabisdopsis)
  • Slide 51
  • The MOD squad: what about Homo sapiens? There is not a true model organism database for Human. The two main sources of genome information that have evolved are the UCSC Genome Browser and Ensembl. EnsEMBLwww.ensembl.org UCSCgenome.ucsc.edu
  • Slide 52
  • UCSC Browser
  • Slide 53
  • Slide 54
  • Ensembl
  • Slide 55
  • Slide 56
  • Slide 57
  • Protein Data Bank (PDB)
  • Slide 58
  • total yearly
  • Slide 59
  • Protein Data Bank (PDB)
  • Slide 60
  • Real Sequence Database Search FASTA Fast-all and pronounced fast-AY BLAST Basic Local Alignment Search Tool Two perspective of studies of these tools Algorithmic What algorithm weight matrix they use? Technical How they are used? How to interpret the search result? How to tune different parameters to get meaningful result and so on?
  • Slide 61
  • BLAST We will know the technical perspective Assignment 2 Download and run BLAST Prepare a report What is BLAST Different types of BLAST, their insights Search result analysis and so on. Submission deadline, August 07
  • Slide 62
  • Reference Chapter 15, Algorithms on Strings, Trees and Sequences by Dan Gusfield