informatics for molecular biologists ansuman chattopadhyay,phd head, molecular biology information...

Informatics for Molecular Biologists

Ansuman Chattopadhyay,PhDHead, Molecular Biology Information Service

Falk Library,Health Sciences Library System

University of Pittsburgh

Molecular Biology Information Service

Falk Library of Health SciencesHealth Sciences Library SystemUniversity of Pittsburgh200 Scaife HallDesoto and Terrace StreetsPittsburgh, PA 15261

Topics• Searching tools

– Internet– PubMed

• NCBI developed bioinformatics tools– Entrez Gene

• Structure visualization tools– Cn3D

• Genome Browsers– UCSC genome browsers

– NCBI Map viewer

Information search space

• Biomedical literature databases

• Molecular databases

• Organism whole genome sequences

Literature database

• NCBI PubMed– contains over 15 million citations dating back

to the mid-1950's.

Search:“apoptosis”: 130,476

“breast cancer”: 160,055 “p53”: 42,418

Molecular databases

1996 1997 1998 1999 2000 2001 2002 2003 2004

Articles

Databases

Organisms whole genome sequences

http://www.genomesonline.org/

Internet for Biologists

• Google Vs Clusty

– Google: Chronological list of search results– Clusty: Search results categorized into topical clusters

Vivísimo's clustering technology creates topical

categories on-the-fly from the search results, using terms in the title, snippet, and any other available textual description in the search results themselves

Google Vs Clusty

• Search Example: Pittsburgh– Google– Clusty

Clusty

Clusters help you see your

search results by topic, so

you can zero in on exactly

what you’re looking for

or discover unexpected

relationships between items.

Search examples for Clusty

• SNP

• BLAST

• Lupus

Web 2.0• Website bookmark and tagging tool

– Del.icio.us a social bookmarking web service for storing, sharing, and

discovering web bookmarks.

Web 2.0

• Connotea; http://www.connotea.org/

Medline searching tool• PubMed vs ClusterMed

Search example: macular degeneration, cell cycle, p53

Molecular databases• DNA Sequence Databases and Analysis Tools

• Enzymes and Pathways

• Gene Mutations, Genetic Variations and Diseases

• Genomics Databases and Analysis Tools

• Immunological Databases and Tools

• Microarray, SAGE, and other Gene Expression

• Organelle Databases

• Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others)

• Plant Databases

• Protein Sequence Databases and Analysis Tools

• Proteomics Resources

• RNA Databases and Analysis Tools

• Structure Databases and Analysis Tools

HSLS OBRC• http://www.hsls.pitt.edu/guides/genetics/obrc/

Types of databases

– By level of curation:

• Archival

–GenBank, GenPept, ssSNP

• Curated

–Refseq, SwissProt, RefSNP

Types of databases

– Archival data• repository of information • redundant; might have many sequence records for

the same gene, each from a different lab • submitters maintain editorial control over their

records: what goes in is what comes out

• no controlled vocabulary • variation in annotation of biological features

Example: GenBank record

GenBank

• archival database of nucleotide sequences from >130,000 organisms

• records annotated with coding region (CDS) features also include amino acid translations

• each record represents the work of a single lab

• redundant; can have many sequence records for a single gene

International Nucleotide Sequence Database Collaboration

Types of databases

Refseq

• Curated data– non-redundant; one record for each gene, or

each splice variant – each record is intended to present an

encapsulation of the current understanding of a gene or protein, similar to a review article

– records contain value-added information that have been added by an expert(s)

Refseq• Database of reference sequences

• Curated

• Non-redundant; one record for each gene, or each splice variant, from each organism represented

• A representative GenBank record is used as the source for a RefSeq record

• Value-added information is added by an expert(s)

• Each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review article

• Variety of accession number prefixes (NM_ , NP_ , etc.) and status codes (provisional, reviewed, etc.). More about those in later slides.

• RefSeq database includes genomic DNA, mRNA, and protein sequences, so organizes information according to the model of the central dogma of biology

RefSeq

Searching GenBank

• Find messenger RNA sequence for Human epidermal growth factor (EGF) gene.

Databases developers

• NCBI

• EBI

Neighbors and Hard Links

Genomes

Taxonomy

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structure

Word weight

BLASTBLAST

Phylogeny

Source NCBI

NCBI Tools

Entrez Gene

NCBI’s database for gene centricinformation focuses on organisms genome

• completely sequenced • an active research community to contribute

gene-specific information • scheduled for intense sequence analysis

– Total Taxa: 4246; Total Genes: 284,3587

• 160,000 organisms in the nucleotide sequence database (Genbank)

Entrez gene• each record represents a single gene from a given organism

Gene record includes: – a unique identifier or GeneID assigned by NCBI – a preferred symbol – and any one or more of: – sequence information – map information – official nomenclature from an authority list – alternate gene symbols – summary of gene/protein function – published references that provide additional information on

function – expression – homology data – and more

Genomic Sequence

Exon-Intron Structure

Expression Profile

Interacting Partners

3D Structure

mRNA Sequence

Chromosomal Localization

Disease

Amino acid Sequence

Homologous Sequences

Gene / Protein

Searching Entrez Gene

Entrez gene

Find: • gene symbols and aliases • sequences: genomic, mRNA, protein • intron-exon architecture • genomic context: neighboring and antisense

genes • Interacting partners • associated gene ontology terms: function,

cellular component and biological process

Entrez Gene recordQuery: BRCA1

Search Tips:Query text box: BRCA1Limits:•To limit your search to a specific field, select: “Gene name” from drop-down menu•Limit by taxonomy: select “Homo sapiens”

Name and aliases

Chromosomal location

Sourse: NCBI

Entrez Gene: sequences and genomic context

Sequences: mRNA, Genomic, Protein

mRNA Seq

ProteinSeq

Genomic Seq

Transcription and alternative splicing

Alternative splicing: http://www.exonhit.com/UserFiles/Image/epissage.swf?PHPSESSID=d9u8tiu2sioqa8u29bkop3l0l2

Entrez Gene: intron-exon architectures

Tips: Change Display to “Gene Table” from “Summary”

Genomic SeqmRNA Seq

ProteinSeq

Gene Ontology

– Controlled vocabulary tagging

• Function

• Biological Processes

• Cellular Component

Entrez Gene : Gene Ontology

Homologous sequences

Entrez Gene: Homologous sequence

Tips: change Display settings from" summary”to “Alignment score”to “Multiple Alignment”

Single nucleotide polymorphisms

Single nucleotide polymorphisms (SNP) are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. For example a SNP might change the DNA sequence AAGGCTAA to ATGGCTAA

Coding SNPs

Entrez Gene: SNPs

Protein Info: HPRD

Entrez Gene: Links

Entrez Gene: Linkout

Seq to Entrez gene: UCSC BLATQuery Seq: SGLTPEEFMLVYKFARKHHITLTNLITEE

BLAT to Entreze Gene

Find chromosomal location of your gene of interest. How many exons have been reported for your gene?What are its neighboring genes ?

Query sequence:IHYNYMCNSSCMGGMNRRPILTII

Hands-On Exercise Question

Exercise:

Find the protein sequence for rat leptin.

BLAT this sequence vs. the human

genome to find the human homolog.

Look for SNPs in the coding region of

this gene—are there any?

Sequence alignment

• Pair wise alignment• Multiple alignment

Pairwise alignment

• Global– Needleman Wunsc (1970)

• Local– Smith-Waterman (1981)– Lipman and Pearson

/FASTA (1985)– Basic Local Alignment

Search Tool(BLAST:1991)

To find homologous sequence for a sequence of interest by searching sequence databases:

Nucleotide:

Protein:

TTGGATTATTTGGGGATAATAATGAAGATAGCAATTATCTCAGGGAAAGGAGGAGTAGGAAAATCTTCTA TTTCAACATCCTTAGCTAAGCTGTTTTCAAAAGAGTTTAATATTGTAGCATTAGATTGTGATGTTGAT

MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAGLNKSVEEFE NELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLG

• To Find statistically significant matches, based on sequence similarity, to a protein or nucleotide sequence of interest.

•Obtain information on inferred function of the gene or protein.

•Find conserved domains in your sequence of interest that are common to many sequences. •Compare two known sequences for similarity.

What you can do with BLAST

•Find homologous sequence in all combinations (DNA/Protein) of query and database.

–DNA Vs DNA–DNA translation Vs Protein–Protein Vs Protein–Protein Vs DNA translation–DNA translation Vs DNA translation

BLAST exercise

• Find homologous sequences for uncharacterized archaebacterial protein, NP_247556, from Methanococcus jannaschii

BLAST searchSort by E values

2X10-65

Sequence description

Link to Entrez

number of display cut off (100)over rides E value cut

off (10)

Descriptions of hits

BLAST search

•Orthologs from closely related species will have the highest scores and lowest E values

–Often E = 10-30 to 10-100

•Closely related homologs with highly conserved function and structure will have high scores

–Often E = 10-15 to 10-50

•Distantly related homologs may be hard to identify

–Less than E = 10-4

Protein domains

• Wikipedia

SH2Src homology 2 domains; Signal transduction, involved in recognition of phosphorylated tyrosine (pTyr). SH2 domains typically bind pTyr-containingligands via two surface pockets, a pTyr and hydrophobic binding pocket, allowing proteins with SH2 domains to localize to tyrosine phosphorylated sites.

Searching CDD

• CDD SEARCH

Query sequence:

• BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database. This graphical output includes:

– Alignment of up to 200 BLAST hits on the query sequence – Best Hits to each organism – List of known protein domains in the query sequence – Filter hits by selecting the BLAST cutoff score – Distribution of hits by taxonomic grouping – Display of similar sequences with known 3D structure – Filter hits by database and/or by taxonomic grouping – Display a taxonomic tree of all organisms with similar sequences

Access: Link out from NCBI protein records

Link toTP53 Blink: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_000537.2&dopt=gp

Protein structure

Protein data bank (PDB)• international database of 3-D biological macromolecular structures

• accepts direct submissions of structure data

• maintained by a nonprofit organization, the Research Collaboratory for Structural Bioinformatics (RCSB), associated with Rutgers University, San Diego Supercomputer Center, and the Biotechnology Division of the National Institute of Standards and Technology

• contains molecular structures of proteins and nucleic acids, primarily structures experimentally-derived by X-Ray crystallography and NMR

• also includes some theoretical models, though they are not encouraged.

3D structure viewing software

• NCBI Cn3D

• First glance in Jmol

A simple tool for macromolecular visualization.

The Cn3D home page includes a link in the blue sidebar for instructions on installing Cn3D, which is available for PC, Mac, and Unix.

• View the 3-dimensional structure for 1TUP and practice using some of the Cn3D features that allow you to:

– spin the structure using your mouse – use the control+left mouse button combination to zoom in and

out of the structure – use the shift+left mouse button combination to move the

structure across the viewing window – use the Style menu to render the structure in different ways

(e.g., worms, space fill, ball and stick, ...) – use the Style menu to color the structure in different ways (e.g.,

secondary structure, domain, ...) – use the Style/Edit Global Style to label every 20th amino acids

What is it?

Genome Browser is a computer program which helps to display gene maps, browse the chromosomes, align genes or gene models with ESTs or contigs etc.

Genome Sequence Project Time Line

1976 : RNA Bacteriophage MS2

1995: Haemophilus influenzae

2003: Human genome reference sequence

2005: 265 genomes; 21 archaeal, 211 bacterial, 33 eukaryotic

http://www.genomesonline.org/

Genome Browsers

• NCBI MAP Viewer

• EBI Ensembl

• UCSC Genome Browser

informatics for molecular biologists ansuman chattopadhyay,phd head, molecular biology information...

Documents

capturing the ‘ome’: the expanding molecular toolbox for...

combinatorial chemistry. synthesis of many structures...

network biology presentation by: ansuman sahoo 10 th...

development of an information service program in molecular...

cyberinfrastructure-enabled molecular products design...

enzymes used in molecular biology: a useful...

regulation of gene expression 13 february, 2013 ansuman...

model (molecular dynamics extended library): a database of...

locating gene/protein information january 11, 2011 ansuman...

introduction to clc main workbench 20 june, 2012 ansuman...

sequence similarity searching 24 th september, 2012 ansuman...

molecular surface chemistry by metal ... - unt digital...

going against goliath 23 rd may 2010 katrina kurtz, mlis...

fluorescence blinking beyond nanoconfinement: spatially...

lecture 7: molecular techniques i restriction mapping...

molecular engineering of new ionic liquid sorbents …...

design of a high fragment efficiency library by molecular...

automated kapa hyperplus library preparation with the...

mcb 7200: molecular biology gene libraries cdna libraries...

infoboosters: connecting texts with databases boost box, dec...