by xianfeng (jeff) chen computational and systems biologist may 7, 2009 bioinformatics...
Post on 26-Dec-2015
219 Views
Preview:
TRANSCRIPT
By Xianfeng (Jeff) Chen
Computational and Systems Biologist
May 7, 2009
Bioinformatics Cyber-infrastructure for Genomics and Proteomics in Systems Biology
Agenda Today
(1) Cyber-infrastructure and systems biology.
(2) High performance computing and software for peptide/protein identification and quantification, data mining/target discovery, on mass spectrometry generated proteomics data. (3) Relational database management system, genome annotation methodology, systems biology data integration, biology knowledge generation and augmentation.
Section One: Cyber-infrastructure and Systems Biology
Reductionist approach,one gene, one protein
Systems approach,multiple genes, network
analysis
Cutting edge science and technology
Status of Technologies in Systems Biology
Cyber-infrastructure for Systems Biology Cyber-infrastructure for Systems Biology
• “…. build new types of scientific and engineering knowledge environments and organizations to pursue research in new ways and with increased efficacy.
• …..new NSF funding of $1 billion per year is needed to achieve critical mass …….
2008Awarded $50 millions
http://www.communitytechnology.org/nsf_ci_report/
2004Awarded to $100 millions
2004Awarded $85 millions
Supporting Cyber- infrastructure and Systems Biology Workflow
Historic strong area
Supporting
(DOE - Genomics: GTL Roadmap, p.52)
Cyber-knowledge System to Enable Genomics-based Predicative Medicine
System Integration at Systems Biology CenterSystem Integration at Systems Biology Center
Core Laboratory Facility:Data Generation
Core Computational Facility:Data Processing, Storage,
and Dissemination
Cyber-infrastructure, Data Management, Data Analysis Pipeline, and Data Display
(1) LIMS for raw data & protocol(2) Preprocessed data management(3) High throughput computing(4) Data validation and integration(5) Knowledge representation
Data Mining and Knowledge Discovery
PC Single CPU Computing Unix Multiple CPUs Computing Cluster Computing
Cyber-infrastructure Component (1) : High Performance Computing
Step 1 Step 2Start point
Most labs 5-10 biological labs in US 2-4 biological labs
For large sets of data analysis
--- Migration of Bio-Computing Capability
Cyber-infrastructure Component (2) : Integrated Knowledgebase System
--- Case Study of National Biodefense Proteomics Data Center
Public File Server
Private File ServerOracle Relational Database
Database query,
Data upload over
http
Batch Processing
(1) Data uploading;
(2) Data validation;
(3) Data analysis;
(4) Data processing
Perl,
Java
Web services
Data exchange using XML based
SOAP
---- System Integration Case 1: UVa Proteomics Data Center---- System Integration Case 1: UVa Proteomics Data Center
High Performance
and ThroughputComputing
Data ManagementData Management
Section Two: High Performance Computing and Proteomics
Protein Database Search EnginesMascot Matrix Science
Sequest / Bioworks Scripps/ThermoX! Tandem the GPMSpectrum Mill Agilent Technologies
OMSSA NCBIPEAKS Bioinformatics Solutions Inc. Phenyx GeneBio
Statistical Validation and QuantitationPeptideProphet Institute for Systems Biology ProteinProphet Institute for Systems Biology ASAPRatio, XPRESS, Libra Institute for Systems Biology Scaffold Batch System Proteome Software, Inc.SIEVE ThermoCensus Scripps Research Institute
Open Data StandardsFuGE and XAR FHCRC, ICBC, ITMAT, & ManchesterMIAPE HUPO PSI and Collaborators mzXML, pepXML, protXML Institute for Systems Biology MS1, MS2, SQT Scripps Research Institute
Computational Proteomics Software and Algorithms
Many more ……..…
System Integration Case 2: National Biodefense Proteomics Data Center
http://www.proteomicsresource.org
Awarded $14 millions
(1) University of Michigan Microarray and mass spectrometry
(2) Caprion Pharmaceuticals Mass spectrometry
(3) Harvard Proteomics Institute Genomics and protein expression array
(4) Albert Einsten College of Medicine Mass spectrometry
(5) PNNL Mass spectrometry
(6) Scripps NMR structural, X-ray crystal diffraction data, and Mass spectrometry
(7) Myriad Genetics Yeast two-hybrid system
Proteomics Research Centers (PRC) and Their Major Data Types
PRC Organizations Major Data Types
Proteomics Data Flow
PRCS
VBI
Public
Data Sources
2D GELS
Protein Array
LC
Immunoaffinity purification
Y2H
MS
MS/MS
NMR
X-Ray Cryoelectron Microscopy
X-Ray Defraction
etc…
Data Types
QA
&
QC
Quality Assurance
& Quality Control
Converting to Standard Format
Standard
Format
Standard Format for Each Data Type
QA
&
QC
Quality Assurance
& Quality Control
Data Modeling / Decomposition
Relational Database
MIAME and MIAPE-like Standards/SOP for Data Submission
Proteomics Database Architecture
Search By Experiment/Sample
Databases in Proteomics Data Center
• Annotation improvement and interaction network analysis
(1) Non-homologous based methods -------------- Phylogenetic profiling,
Rosetta stone pattern,
Operon analysis,
Co-expression profiling,
Gene neighboring etc.
(2) Comparative genomics with reference genomes --- E. coli, yeast, Arabidopsis,
etc. model organisms.
• Identifying anchor points for data integration
(1) Known metabolic pathway;
(2) Known signal transduction pathway;
(3) Known gene regulation machinery;
(4) Known protein-protein interaction map.
Strategies for Annotating Raw Data into Meaningful Knowledge
BMC Bioinformatics 2006, 7 (Suppl 4):S18
Qualitative Data Integration and Knowledge Augmentation Based on Networks Biology
Quantitative Proteome Profiling
--- The field is 2-3 years old
Thermo SIEVE Scatter Plot of 14 UVa Raw Files for Validation of Data Quality and Absolute Quantification.
Scaffold Capability of Proteome Spectra Counts of Semi-quantification.
Search Engine Comparison at UVa Proteomics Data Center (1)
Few common annotations
Low annotation rates
Peptide/Protein Identifications with Various Protein Database Search Engines (2)
X!Tandem missed OMSSA missed
Sequest over-predicted
UVaPDC, MS/MS Search Engine Comparison (3)
Spectra counts
Common annotations
Statistics on confident values
Statistics and Summarization Capability of Scaffold
--- The best feather of the software
Data Mining on Data Processed via Computational Approach
Knowledge-based Discovery
Identified
Identified
Rate limited step
Knowledge Inference
Knowledge Inference
Inference on Gene Network in Systems Biology
(1) Y2H, (2) MS pull down assay, (3) Co-expression assay.
Where are the significant regulatory steps impacting pathway expression ?
Target/lead protein
Raf
MAPK
EDH1
EPS8L1* or
EPS8L2*GDP
GTP
NRas*EPS15
Mucin-4*
Gβ
Gα* GγGTP
P
EGFRAdenylate
Cyclase
ATPcAMP
Cell ProliferationMP Formation
P
Gα*
Gβ
Gγ
Urinary Biomarker Identification ---EGFR Pathway Related Bladder Cancer
----- Small scale analysis
* Differentially expressed
Patient with Bladder Cancer
Healthy Individual
Urine Urine
Urine Microparticles
LC-MS/MS
SEQUEST
Spectral Count Analysis
Western Blotting
EPS8L2
Exosomes
Ectosomes
Patten Matching on Gene Signatures at Various Biological States
--- Large-scale analysis
*** query signatures are compared to reference gene/protein expression signatures for known perturbations or disease phenotypes. (many to many association analysis)
Section Three : Knowledge Base Establishment
Database Case 1 Soybean Upstream Regulatory Elements for Ongoing Regulatory Motif Annotation
115
89
Nominated Transcription Factor Involved in Stress Response
Group IX
Red Dot = Soybean ERF genes
Implicated in regulating wounding and jasmonate responses
Soybean Promoter :
GmERFs, Gmubis, Gmcons, GmWRKYs
more and more and more……..
10 promoters per month
Promoter
Ongoing Effort on Transcription Factor Binding Motifs
---- Identify genetic circuits of cell wall, starch, and lipid biosynthesis and degradation
Elucidation of Conserved Co-expression Networks via Data Integration with Expression Profiling Data
(1) BMC Bioinformatics. 2007, 8:129.(2) BMC Bioinformatics. 2008, 9:53.
Database Case 2 CGKB and TOBFAC Knowledge Bases
Genome Annotation Strategy (1) : Homology-based Annotation
263,425 total cowpea gene space sequence (GSS).
High level coding region detection !
BMC Genomics. 2008, 9:103.
Genome Annotation Strategy (2) : Metabolic Pathway Integration
BMC Bioinformatics. 2007, 8:129.
Genome Annotation Strategy (3) : GO Integration with Distribution of Function Assignments
BMC Genomics. 2008, 9:103.
Genome Annotation Strategy (4): Comparative Genomics at Genome-scale
BMC Genomics. 2008, 9:103.
---- Example of medicago vs cowpea
Genome Annotation Strategy (5): Comparison at Gene Family Level
(1) BMC Genomics. 2008, 9:103.(2) Plant Physiology. 2008, 147:280-295.
--- WRKY and CONSTANS (CO) and CO-like Gene Families of Cowpea Transcription Factors
Genome Annotation Strategies: (6) Repeat, (7) Domain, (8) Gene Model
BMC Bioinformatics. 2007, 8:129.
Repeat
Domain
Gene Model
Genome Annotation Strategy (9) : Comparative Genomics on Network for Conserved Protein Complexes
Comparative genome analysis
Conserved networks
Published Protein-Protein (PPI) Interactions in Organisms
Example of Yeast PPI
Genome Annotation Strategy (10): Functional Validation of Genes of Interest Through Reverse Genetics Program
My name
2008
Acknowledgement
top related