ioerger lab – bioinformatics research
Post on 04-Jan-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
Ioerger Lab – Bioinformatics Research• Pattern recognition/machine learning
– issues of representation
– effect of feature extraction, weighting, and interaction on performance of induction algorithm
• Applications in Structural Biology– molecular basis of biology: protein structures
– predicting structures
– tools for solving structures (X-ray crystallography, NMR)
– stability, folding, packing, motions
– drug design (small-molecule inhibitors)
– large datasets exist – exploit them – find the patterns
TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition
Principal Investigators: Thomas Ioerger (Dept. Computer Science)
James Sacchettini (Dept. Biochem/Biophys)
Other contributors: Tod D. Romo, Kreshna Gopal, Erik McKee,
Lalji Kanbi, Reetal Pai & Jacob Smith
Funding: National Institutes of Health
Texas A&M University
X-ray crystallography• Most widely used method for
protein modeling
• Steps: – Grow crystal
– Collect diffraction data
– Generate electron density map (Fourier transform)
– Interpret map i.e. infer atomic coordinates
– Refine structure
• Model-building– Currently: crystallographers
– Challenges: noise, resolution
– Goal: automation
X-ray crystallography• Most widely used method for
protein modeling
• Steps: – Grow crystal
– Collect diffraction data
– Generate electron density map (Fourier transform)
– Interpret map i.e. infer atomic coordinates
– Refine structure
• Model-building– Currently: crystallographers
– Challenges: noise, resolution
– Goal: automation
• Automated model-building program
• Can we automate the kind of visual processing of patterns that crystallographers use?– Intelligent methods to interpret density, despite noise– Exploit knowledge about typical protein structure
• Focus on medium-resolution maps– optimized for 2.8A (actually, 2.6-3.2A is fine)
– typical for MAD data (useful for high-throughput)
– other programs exist for higher-res data (ARP/wARP)
Overview of TEXTAL
Electron density map(or structure factors) TEXTAL Protein model
(may need refinement)
SCALE MAP
TRACE MAP
CALCULATE FEATURES
PREDICT Cα’s
BUILD CHAINS
PATCH & STITCH CHAINS
REFINE CHAINS
LOOKUP: model side chains CAPRA: models backbone
POST-PROCESSING
SEQUENCE ALIGNMENT
REAL SPACE REFINEMENT
Crystal Collect data Diffraction data Electron density map
Model of backbone
Model of backbone & side chains
Corrected & refined model
F=<1.72,-0.39,1.04,1.55...> F=<1.58,0.18,1.09,-0.25...>
F=<0.90,0.65,-1.40,0.87...> F=<1.79,-0.43,0.88,1.52...>
Examples of Numeric Density Features
•Distance from center-of-sphere to center-of-mass•Moments of inertia - relative dispersion along orthogonal axes•Geometric features like “Spoke angles” •Local variance and other statistics
Features are designed to be rotation-invariant, i.e. samevalues for region in any orientation/frame-of-reference.
TEXTAL uses 19 distinct numeric features to represent the pattern of density in a region, each calculated over 4 different radii, for a total of 76 features.
Databaseof knownmaps
Region in map to be interpreted
The LOOKUP ProcessFind optimalrotation
i
iii RFRFwRRdist 22121 ))()((),(
“2-norm”: weighted Euclideandistance metric for retrieving matches:
Two-step filter: 1) by features 2) by density correlation
SLIDER: Feature-weighting algorithm• Euclidean distance metric used for retrieval: • relevant features – good, irrelevant features – bad
• Goal: find optimal weight vector w the generates highest probability of hits (matches) in top K candidates from database
• Concept of Slider: • adjust features so the most matches are ranked higher than mismatches
i
iii RFRFwRRdist 22121 ))()((),(
Slider Algorithm(w,F,{Ri},matches,mismatches) choose feature fF at random for each <Ri,Rj,Rk>, Rjmatches(Ri),Rkmismatches(Ri) compute cross-over point i where: dist’(Ri,Rj)=dist’(Ri,Rk) dist’(X,Y)= (Xf-Yf)2+(1-)dist\f(X,Y) pick that is best compromise among i
ranks most matches above mismatches update weight vector: w’update(w,f,), wf’= repeat until convergence
Quality of TEXTAL models
• Typically builds >80% of the protein atoms
• Accuracy of coordinates: ~1Å error (RMSD)– Depends on resolution and quality of map
Closeup of -strand (TEXTAL model in green)
Deployment
• September 2004: Linux and OSX distributions– Can be downloaded from http://textal.tamu.edu– 40 trial licenses granted so far
• June 2002: WebTex (http://textal.tamu.edu)– Till May 2005: TB Structural Genomics Consortium members only– Recently open to the public– users upload data; processed on server; can download results– 120 users from 70 institutions in 20 countries
• July 2003: Model building component of PHENIX– Python-based Hierarchical ENvironment for Integrated Xtallography– Consortium members:
• Lawrence Berkeley National Lab• University of Cambridge• Los Alamos National Lab• Texas A&M University
Intelligent Methods for Drug Design• structure-based:
– given protein structure, predict ligands that might bind active site
• other methods: – QSAR, high-throughput/combi-chem,
manual design using 3D
• Virtual Screening– docking algorithm + large library of
chemical structures– sort compounds by interaction energy– purchase top-ranked hits and assay in lab– looking for M inhibitors (leads that can
be refined)– goal: enrichment to ~5% hit rate
Virtual Screening• diversity• ZINC database: ~2.6 million compounds
– purchasable; satisfy Lipinski’s rules
• docking algorithms: – FlexX, DOCK, GOLD, AutoDock, ICM...
– search for position and conformation of ligand
• scoring function– electrostatic + steric + desolvation
– entropy effects?
• major open issues: – active site flexibility, charge state, waters, co-factors
– works best with co-crystal structures (already bound)
Grid at Texas A&M
~1600 computersin student labs on TAMU campus (Open-Access Labs)
Blocker Zachary
gridmaster.tamu.edu
GridMP softwareby United Devices(Austin, TX)
West CampusLibrary
DOCK binaries +receptor files +20 ligands at a time
typical configuration:2.8 GHz dual-core Pentium CPUsrunning Windows XP
Data Mining of Results
• promiscuous binders• clusters of related compounds• patterns of contacts within active site• hydrogen-bonding interactions• adjust weights of scoring function for unique
properties of each site – open/closed, hydrophobic/charged...
• ideas for active site variations • development of pharmacophore search patterns
Current Screens in Sacchettini Lab• proteins related to tuberculosis (Mycobacterium)
– focus on unique pathways involved in dormancy/starvation• glyoxylate shunt – slow-growth metabolic pathway
• cell-wall biosynthesis (unique mycolic acid layer in tb.)
• biosynthesis of amino acids/co-factors that humans get from diet
– isocitrate lyase
– malate synthase
– PcaA: mycolic acid cyclopropane synthase
– ACPS: acyl-carrier protein synthase
– InhA: enoyl-acyl reductase (target of isoniazid)
– KasB: fatty-acid synthase
– BioA: biotin (co-factor) synthase
– PGDH: phospho-glycerol dehydrogenase (serine biosynthesis)
• Related proteins in malaria, SARS, shigella
Conclusions• Many opportunities for research in Structural Bioinformatics
– large datasets
– significant problems
• Provides challenges for machine learning– drives development of novel methods, especially for dealing with noise,
sampling biases, extraction of features...
• Requires inherently interdisciplinary approach– training in biochemistry; knowledge of molecular interactions
– understanding chemical intuition; use of visualization tools
– insights about strengths and limitations of existing methods
• Requires collaboration to construct appropriate representations to enable learning algorithms to find patterns– translate expectations about what is relevant, dependencies, smoothing,
sources of noise...
top related