bioinformatics for genomic and proteomic data analysis
DESCRIPTION
Bioinformatics for Genomic and Proteomic data analysis. -- Gene Prediction. Sequence Analysis. -- Alignment techniques (BLAST, PSI-BLAST). -- Major databases and retrieval techniques. -- Predicting Function, domains etc. - PowerPoint PPT PresentationTRANSCRIPT
Bioinformatics for Genomic and Proteomic data analysis
• Sequence Analysis
-- Predicting Function, domains etc.
-- Predicting phyico-chemical properties of protein (ProtParam).
-- Predicting signal peptides and transmembrane proteins (SignalP).
-- finding homology between sequences, identifying repeats etc (DOTPLOT).
-- Major databases and retrieval techniques.
• Structure analysis
-- Gene Prediction
-- Phylogenetic analysis
-- Alignment techniques (BLAST, PSI-BLAST)
-- Analysis of Protein structure and conformation (Rasmol, SwissPDBViewer, VMD etc).
-- Protein structure predictions- Homology modeling (SwissModel, Modeller).
• Some practical applications
-- Sequence analysis
-- Structure analysis
Major Bioinformatics databases, Search engines and data
formats.
By: Sachin Pundhir Bioinformatics sub-centre DAVV, Indore
Database
• Collection of records and files
• Organized for a particular purpose
• Tables• Tuples (records)
– Attributes» Values
BIO520 Student Database
1998
Name ID Grade
Amy 123 A
Joe 456 B
Sue 789 C
Table
Tuple
.
Attribute.
Value
Database Operations
• Tables– Create, delete
• Tuples (Records)– Read,write, delete
• Search, sort, modify, print…
1998
Name ID Grade
Amy 123 A
Joe 456 B
Sue 789 C
International Nucleotide Sequence Database Collaboration (INSDC)
• Consists of
DDBJ (Japan)
GenBank (USA)
EMBL Nucleotide Sequence Database.
• The three databases exchange new and updated data on a daily basis to achieve optimal synchronisation.
Bioinformatics databases
• Nucleotide sequence database:
– Genbank: Nucleotide sequence database. Highly redundant.
– DDBJ: DNA Data Bank of Japan.
– EMBL: nucleotide sequence database.
– Refseq: integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein
products, for major research organisms.
Primary databases
• Protein sequence database:
• Genpept: Protein sequence database.
• UniProtKB/Swiss-Prot: curated protein sequence database, minimal level of redundancy and high
level of integration with other databases.
• UniProtKB/TrEMBL: computer-annotated supplement of Swiss-Prot that contains all the
translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.
•Refseq: Well curated, non-redundant database.
• Structure Database
•PDB: Protein Data Bank
•MMDB: Molecular Modeling Database
Secondary database
GenBank Record
Header
information that apply to the whole record
Features
annotations on the record
Sequence
GeneBank Record
modification date
Header
GenBank Record
Locus Name
Sequence Length
Molecule Type
GenBank Division
Modification DateAccession Number
Version Number
GeneBank Record
Link to Seq
FEATURE
GenBank RecordSequence
Using Entrez
An integrated database
search and retrieval system
WWWAccess
Entrez&BLAST
Genomes
Taxonomy
Entrez: Database Integration
PubMed abstracts
Nucleotide sequences
Protein sequences
3-D Structure
3 -D Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Database Searching with Entrez
Using limits and field restriction to find human MutL homologLinking and neighboring with MutL
Global Entrez Search
Document Summaries:MutL[All Fields]
Entrez Nucleotides: Limits & Preview/Index
Tabs
MutL
Entrez Nucleotides: LimitsAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitleUidVolume
Field Restriction
Exclude bulk sequences
MutL
Entrez Nucleotides: Limits
Title == Definition
Exclude Bulk Sequences
Document Summaries: Limits
Adding Terms: Preview/IndexAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle UidVolume
Human MutL Search Results
Human MutL RefSeq
GenBank Records
NM_000249: Links
Literature Links
PubMed
OMIM
NM_000249: PubMed
Books
Books Link
OMIM: Human Disease Genes
Conserved Domain
Sequence Links
Nucleotide Protein
NM_000249: Related Sequences
simila
rity
Original GenBank mRNAs
Original GenBank genomic
Genome Project BAC
Taxonomy Link
The Tax Browser
NCBI’s Taxonomy
Taxonomy Link
NCBI Protein Databases
• GenPept GenBank, EMBL, DDBJ CDS translations
• RefSeq mRNA based (NP_) and genome based (XP_)
• Swiss-Prot curated high quality protein reviews
• PIR protein information resource Georgetown University
• PRF protein resource foundation
• PDB Protein Databank sequences from structures
Protein Link
BLAST Link
Conserved Domains
Related Proteins: Redundancy
Red
un
dan
t Seq
uen
ces
Sequence from MutL structure
Related Proteins: Links
BLink: non-redundant relatives
Arabidopsis homolog
Conserved Domain
MLH1 Domain Structure: CDD
ATPase Domain Mismatch Repair Domain
MLH1: ATPase Domain
ATPase structural alignment
ATP Binding site helix
Genome Resources
NM_000249: Genome Links
Higher Genome Resources
MLH1: UniGene Cluster
ESTs in UniGene
The New Homologene
early globin gene
A-chain gene B-chain gene
frog A chick A mouse A mouse B chick B frog B
paralogsorthologs orthologs
gene duplication
• No longer UniGene based• Protein similarities first• Guided by taxonomic tree• Includes orthologs and paralogs
The New Homologene
Entrez Genes: integrated gene-based access
LocusLinkComplete Genomes
•eukaryotic•microbial•organelle
Genes MLH1: Central Resource
QUESTIONS!!!