machine learning challenges in location proteomics
DESCRIPTION
Machine Learning Challenges in Location Proteomics. Robert F. Murphy Departments of Biological Sciences and Biomedical Engineering & Center for Automated Learning and Discovery Carnegie Mellon University. Protein characteristics relevant to systems approach. sequence structure - PowerPoint PPT PresentationTRANSCRIPT
Machine Learning Challenges in Location Proteomics
Machine Learning Challenges in Location Proteomics
Robert F. MurphyRobert F. Murphy
Departments of Biological Sciences and Departments of Biological Sciences and Biomedical Engineering &Biomedical Engineering &
Center for Automated Learning and DiscoveryCenter for Automated Learning and DiscoveryCarnegie Mellon UniversityCarnegie Mellon University
Protein characteristics relevant to systems approachProtein characteristics relevant to systems approach
sequencesequence structurestructure expression levelexpression level activityactivity partnerspartnerslocationlocation
Subcellular locations from major protein databasesSubcellular locations from major protein databases
GiantinGiantin Entrez:Entrez: /note="a new 376kD Golgi complex /note="a new 376kD Golgi complex
outher membrane protein"outher membrane protein" SwissProt:SwissProt: INTEGRAL MEMBRANE INTEGRAL MEMBRANE
PROTEIN. GOLGI MEMBRANE.PROTEIN. GOLGI MEMBRANE. GPP130GPP130
Entrez:Entrez: /note="GPP130; type II Golgi /note="GPP130; type II Golgi membrane protein”membrane protein”
SwissProt:SwissProt: nothing nothing
More questions than answersMore questions than answers
We learned that Giantin and GPP130 are We learned that Giantin and GPP130 are both Golgi proteins, but do we know:both Golgi proteins, but do we know:What part (i.e., cis, medial, trans) of the What part (i.e., cis, medial, trans) of the
Golgi complex they each are found in? Golgi complex they each are found in? If they have the same subcellular If they have the same subcellular
distribution?distribution? If they also are found in other If they also are found in other
compartments?compartments?
Vocabulary is part of the problemVocabulary is part of the problemVocabulary is part of the problemVocabulary is part of the problem
Different investigators may use Different investigators may use differentdifferent terms terms to refer to the same pattern or the to refer to the same pattern or the samesame term to term to refer to different patternsrefer to different patterns
Efforts to create Efforts to create restricted restricted vocabularies (e.g., vocabularies (e.g., Gene Ontology consortium) for location have Gene Ontology consortium) for location have been madebeen made
SWALL entries for giantin and gpp130SWALL entries for giantin and gpp130SWALL entries for giantin and gpp130SWALL entries for giantin and gpp130
ID GIAN_HUMAN STANDARD; PRT; 3259 AA.
AC Q14789; Q14398;
GN GOLGB1.
DR GO; GO:0000139; C:Golgi membrane; TAS.
DR GO; GO:0005795; C:Golgi stack; TAS.
DR GO; GO:0016021; C:integral to membrane; TAS.
DR GO; GO:0007030; P:Golgi organization and biogenesis; TAS.
ID O00461 PRELIMINARY; PRT; 696 AA.
AC O00461;
GN GPP130.
DR GO; GO:0005810; C:endocytotic transport vesicle; TAS.
DR GO; GO:0005801; C:Golgi cis-face; TAS.
DR GO; GO:0005796; C:Golgi lumen; TAS.
DR GO; GO:0016021; C:integral to membrane; TAS.
Words are not enoughWords are not enough
Still don’t know how similar the locations Still don’t know how similar the locations patterns of these proteins arepatterns of these proteins are
Restricted vocabularies do not provide the Restricted vocabularies do not provide the necessary necessary complexity and specificitycomplexity and specificity
Needed: Systematic ApproachNeeded: Systematic Approach
•Need Need new methodsnew methods for accurately and for accurately and objectively objectively determiningdetermining the the subcellular location subcellular location pattern of all proteinspattern of all proteins
•DistinctDistinct from drug from drug screening by low-screening by low-resolution resolution microscopymicroscopy
•Need to advance past “cartoon” view of subcellular Need to advance past “cartoon” view of subcellular locationlocation
•Need Need systematic, quantitativesystematic, quantitative approach to protein location approach to protein location
First Decision PointFirst Decision PointFirst Decision PointFirst Decision Point
Classification by direct (pixel-by-pixel) Classification by direct (pixel-by-pixel) comparison of individual images to comparison of individual images to known patterns is not useful, sinceknown patterns is not useful, since
different cells have different different cells have different shapes, shapes, sizes, orientationssizes, orientationsorganelles within cells are organelles within cells are not not found in fixed locationsfound in fixed locations
• Therefore, use feature-based methods rather than (pixel) model-based methods
Input ImagesInput Images
Created 2D image database for HeLa cellsCreated 2D image database for HeLa cells Ten classes covering all major subcellular Ten classes covering all major subcellular
structures: Golgi, ER, mitochondria, structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubulesmicrofilaments, microtubules
Included classes that are similar to each Included classes that are similar to each otherother
Example 2D Images of HeLaExample 2D Images of HeLa
Features: SLFFeatures: SLF
Developed sets of Developed sets of SSubcellular ubcellular LLocation ocation FFeatures (eatures (SLFSLF) containing features of ) containing features of different typesdifferent types
Motivated in part by descriptions used by Motivated in part by descriptions used by biologists (e.g., punctate, perinuclear)biologists (e.g., punctate, perinuclear)
First type of features derived from First type of features derived from morphological image processingmorphological image processing - finding - finding objects by automated thresholdingobjects by automated thresholding
Number of fluorescent objects per cellNumber of fluorescent objects per cellVariance of the object sizesVariance of the object sizesRatio of the largest object to the smallestRatio of the largest object to the smallestAverage distance of objects to the ‘center Average distance of objects to the ‘center
of fluorescence’of fluorescence’Average “roundness” of objectsAverage “roundness” of objects
Features: MorphologicalFeatures: Morphological
Features: Haralick textureFeatures: Haralick texture
Give information on Give information on correlations in correlations in intensity between adjacent pixelsintensity between adjacent pixels to to answer questions likeanswer questions like is the pattern more like a checkerboard or is the pattern more like a checkerboard or
alternating stripes?alternating stripes? is the pattern highly organized (ordered) is the pattern highly organized (ordered)
or more scattered (disordered)?or more scattered (disordered)?
Example: Difference detected by texture feature “entropy”Example: Difference detected by texture feature “entropy”
Features: Zernike momentFeatures: Zernike moment
Measure degree to which pattern matches a Measure degree to which pattern matches a particular Zernike polynomialparticular Zernike polynomial
Give information on basic nature of pattern Give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) (e.g., circle, donut) and sizes (frequencies) present in patternpresent in pattern
Examples of Zernike Polynomials
Z(2,0) Z(4,4) Z(10,6)
Subcellular Location Features: 2DSubcellular Location Features: 2D
Morphological featuresMorphological features Haralick texture featuresHaralick texture features Zernike moment featuresZernike moment features Geometric featuresGeometric features Edge featuresEdge features
2D Classification Results 2D Classification Results
Overall accuracy = 92% (95% for major patterns)Overall accuracy = 92% (95% for major patterns)
True True
ClassClass
Output of the Classifier
DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub
DNA 99 1 0 0 0 0 0 0 0 0
ER 0 97 0 0 0 2 0 0 0 1
Gia 0 0 91 7 0 0 0 0 2 0
Gpp 0 0 14 82 0 0 2 0 1 0
Lam 0 0 1 0 88 1 0 0 10 0
Mit 0 3 0 0 0 92 0 0 3 3
Nuc 0 0 0 0 0 0 99 0 1 0
Act 0 0 0 0 0 0 0 100 0 0
TfR 0 1 0 0 12 2 0 1 81 2
Tub 1 2 0 0 0 1 0 0 1 95
Human Classification ResultsHuman Classification Results
Overall accuracy = 83% (92% for major patterns)Overall accuracy = 83% (92% for major patterns)
Output of the ClassifierTrueClass DNA ER Gia GPP LAM Mit Nuc Act TfR TubDNA 100% 0% 0% 0% 0% 0% 0% 0% 0% 0%ER 0% 90% 0% 0% 3% 6% 0% 0% 0% 0%Giantin 0% 0% 56% 36% 3% 3% 0% 0% 0% 0%GPP130 0% 0% 53% 43% 0% 0% 0% 0% 3% 0%LAMP2 0% 0% 6% 0% 73% 0% 0% 0% 20% 0%Mitochond. 0% 3% 0% 0% 0% 96% 0% 0% 0% 0%Nucleolin 0% 0% 0% 0% 0% 0% 100% 0% 0% 0%Actin 0% 0% 0% 0% 0% 0% 0% 100% 0% 0%TfR 0% 13% 0% 0% 3% 0% 0% 0% 83% 0%Tubulin 0% 3% 0% 0% 0% 0% 0% 3% 0% 93%
Computer vs. HumanComputer vs. Human
40
50
60
70
80
90
100
40 50 60 70 80 90 100
Computer Accuracy
Human Accuracy
Extending to 3D: Labeling approachExtending to 3D: Labeling approach
Total protein labeled with Cy5 reactive dyeTotal protein labeled with Cy5 reactive dye DNA labeled with PIDNA labeled with PI Specific Proteins labeled with primary Ab + Specific Proteins labeled with primary Ab +
Alexa488 conjugated secondary AbAlexa488 conjugated secondary Ab
3D Image Set3D Image SetGiantinNuclear ER Lysosomalgpp130
ActinMitoch. Nucleolar TubulinEndosomal
New features to measure “z” asymmetryNew features to measure “z” asymmetry
2D features treated 2D features treated xx and and yy equivalently equivalently For 3D images, while it makes sense to treat For 3D images, while it makes sense to treat
xx and and yy equivalently (cells don’t have a equivalently (cells don’t have a “left” and “right”, “left” and “right”, zz should be treated should be treated differently (“top” and “bottom” are not the differently (“top” and “bottom” are not the same)same)
We designed features to separate distance We designed features to separate distance measures into x-y component and z measures into x-y component and z componentcomponent
Overall accuracy = 97%Overall accuracy = 97%
Classification Results for 3D imagesClassification Results for 3D images
How to do even betterHow to do even better
Biologists interpreting images of protein Biologists interpreting images of protein localization typically view many cells localization typically view many cells before reaching a conclusionbefore reaching a conclusion
Can simulate this by classifying Can simulate this by classifying setssets of cells of cells from the same microscope slidefrom the same microscope slide
Set size 9, Overall accuracy = 99.7%Set size 9, Overall accuracy = 99.7%
Classification of Sets of 3D ImagesClassification of Sets of 3D Images
9999000000000000000000TubTub001001000000000000000000EndoEndo000010010000000000000000ActinActin000000100100000000000000NucleNucle000000001001000000000000MitoMito000000000010010000000000LysoLyso0000000000009999000000GppGpp000000000000001001000000GiaGia0000000000000000999900ERER000000000000000000100100DNADNA
TubTubEndoEndoActinActinNuclNuclMitoMitoLysoLysoGppGppGiaGiaERERDNADNA
Tru
e C
lass
Predicted Class
First ConclusionFirst Conclusion
Description of subcellular locations for Description of subcellular locations for systems biology should be implemented systems biology should be implemented using a data-driven approach rather than a using a data-driven approach rather than a knowledge-capture approach, but…knowledge-capture approach, but…
Subcellular Location Image FinderSubcellular Location Image Finder
(Have automated system for finding images in on-line journal (Have automated system for finding images in on-line journal articles that match a particular pattern - enables connection articles that match a particular pattern - enables connection between new images and previously published results)between new images and previously published results)
Figure
Caption
Panels
ScopeAnnotated
Scopes
AnnotatedPanels
ImagePtr
Panellabels
LabelMatching
Caption understanding
Panel splitting
Labelfinding
Panel classification,Micrograph analysis
Entityextraction
proteins,cells, drugs,experimentalconditions, …
image type, image scale, subcellular pattern analysis…
[Murphy et al, 2001]
[Murphy et al, 2001]
[Cohen et al, 2003]
]
alignment between caption entities and panels
Image SimilarityImage Similarity
Classification power of features implies that Classification power of features implies that they capture essential characteristics of they capture essential characteristics of protein patternsprotein patterns
Can be used to measure Can be used to measure similaritysimilarity between between patternspatterns
Clustering by Image SimilarityClustering by Image Similarity
Ability to measure similarity of protein Ability to measure similarity of protein patterns allows us for the first time to create patterns allows us for the first time to create a systematic, objective, framework for a systematic, objective, framework for describing subcellular locationsdescribing subcellular locations
Ideal for database referencesIdeal for database references One way is by creating a One way is by creating a Subcellular Subcellular
Location TreeLocation Tree Illustration: Build hierarchical dendrogramIllustration: Build hierarchical dendrogram
Subcellular Location Tree for 10 classes in HeLa cells
Subcellular Location Tree for 10 classes in HeLa cells
Do this for all proteins:
Location ProteomicsDo this for all proteins:
Location Proteomics Can use Can use CD-taggingCD-tagging (developed by (developed by Dr. Jonathan Dr. Jonathan
JarvikJarvik) to randomly tag many proteins: Infect ) to randomly tag many proteins: Infect population of cells with a retrovirus carrying a population of cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a DNA sequence that will produce a “tag” in a random gene in each cellrandom gene in each cell
Isolate separate Isolate separate clonesclones, each of which produces , each of which produces express one tagged proteinexpress one tagged protein
Use RT-PCR to Use RT-PCR to identify tagged geneidentify tagged gene in each in each cloneclone
Collect Collect images of many cellsimages of many cells for each clone using for each clone using fluorescence microscopyfluorescence microscopy
Example images of CD-tagged clonesExample images of CD-tagged clones
(A) Glut1 gene (type 1 glucose transporter)
(B) Tmpo gene (thymopoietin
(C) tuba1 gene (-tubulin)(D) Cald gene (caldesmon 1)(E) Ncl gene (nucleolin)(F) Rps11 gene (ribosomal
protein S11)(G) Hmga1 gene (high mobility
group AT-hook 1)(H) Col1a2 gene (procollagen
type I 2)(I) Atp5a1 gene (ATP synthase
isoform 1)
Proof of principleProof of principle
Cluster 46 clones expressing different Cluster 46 clones expressing different tagged proteins based on their subcellular tagged proteins based on their subcellular location patternslocation patterns
Feature selectionFeature selection
Use Stepwise Discriminant Analysis to Use Stepwise Discriminant Analysis to rank rank features based on their ability to features based on their ability to distinguish proteinsdistinguish proteins
Use increasing numbers of features to train Use increasing numbers of features to train neural network classifiers and evaluate neural network classifiers and evaluate classification accuracy over all 46 clonesclassification accuracy over all 46 clones
Best performance obtained with 10 featuresBest performance obtained with 10 features
Tree buildingTree building
Therefore use these 10 features with z-scored Therefore use these 10 features with z-scored Euclidean distance function to build SLTEuclidean distance function to build SLT
Find optimal number of clusters using k-means Find optimal number of clusters using k-means clustering and AICclustering and AIC
Find consensus hierarchical trees by randomly Find consensus hierarchical trees by randomly dividing the images for each protein in half and dividing the images for each protein in half and keeping branches conserved between both halves keeping branches conserved between both halves (repeat for 50 random divisions)(repeat for 50 random divisions)
Consensus Subcellular Location TreeConsensus Subcellular Location Tree
Examples from major clustersExamples from major clusters
SignificanceSignificance
Proteins clustered by location analogous to Proteins clustered by location analogous to proteins clustered by sequence (e.g., proteins clustered by sequence (e.g., PFAM)PFAM)
Can subdivide clusters by observing Can subdivide clusters by observing response to drugs, oncogenes, etc.response to drugs, oncogenes, etc.
These represent protein location statesThese represent protein location states Base knowledge required for modelingBase knowledge required for modeling Can be used to filter protein interactionsCan be used to filter protein interactions
From patterns to causesFrom patterns to causes
Machine learning approaches have been Machine learning approaches have been previously used to find previously used to find localization motifslocalization motifs in protein sequences, but the set of locations in protein sequences, but the set of locations used was limited to major organellesused was limited to major organelles
High-resolution subcellular location trees High-resolution subcellular location trees can be used to discover (recursively) new can be used to discover (recursively) new motifs that determine location of each motifs that determine location of each groupgroup
Can include Can include post-translationalpost-translational modifications modifications
More ConclusionsMore Conclusions
Organized data collection approach is Organized data collection approach is required to capture high-resolution required to capture high-resolution information on the subcellular location of information on the subcellular location of all proteinsall proteins
Prohibitive combinatorial complexity make Prohibitive combinatorial complexity make colocalization approach infeasible, so major colocalization approach infeasible, so major effort should focus on one protein at a timeeffort should focus on one protein at a time
Center for Bioimage InformaticsCenter for Bioimage Informatics
$2.75 M CMU funding from NSF ITR$2.75 M CMU funding from NSF ITR Joint with UCSB and collaborators at Berkeley Joint with UCSB and collaborators at Berkeley
and MITand MIT R. Murphy (CALD/Biomed.Eng./Biol.Sci.)R. Murphy (CALD/Biomed.Eng./Biol.Sci.) Jelena Kovacevic (Biomedical Engineering)Jelena Kovacevic (Biomedical Engineering) Tom Mitchell (CALD)Tom Mitchell (CALD) Christos Faloutsos (CALD)Christos Faloutsos (CALD)
AcknowledgmentsAcknowledgments
Former studentsFormer students Michael Boland, Mia Markey, Michael Boland, Mia Markey,
William Dirks, Gregory Porreca, William Dirks, Gregory Porreca, Edward Roques, Meel Velliste Edward Roques, Meel Velliste
Current grad studentsCurrent grad students Kai Huang, Xiang Chen, Ting Zhao, Kai Huang, Xiang Chen, Ting Zhao,
Yanhua Hu, Elvira Garcia Osuna, Yanhua Hu, Elvira Garcia Osuna, Zhenzhen Kou, Juchang HuaZhenzhen Kou, Juchang Hua
FundingFunding NSF, NIH,NSF, NIH, Rockefeller Bros. Fund, Rockefeller Bros. Fund,
PA. Tobacco Settlement FundPA. Tobacco Settlement Fund Collaborators/ConsultantsCollaborators/Consultants
Simon Watkins, David Cassasent, Simon Watkins, David Cassasent, Tom Mitchell, Christos Faloutsos, Tom Mitchell, Christos Faloutsos, Jon Jarvik, Peter BergetJon Jarvik, Peter Berget