machine learning challenges in location proteomics

Machine Learning Challenges in Location Proteomics

Machine Learning Challenges in Location Proteomics

Robert F. MurphyRobert F. Murphy

Departments of Biological Sciences and Departments of Biological Sciences and Biomedical Engineering &Biomedical Engineering &

Center for Automated Learning and DiscoveryCenter for Automated Learning and DiscoveryCarnegie Mellon UniversityCarnegie Mellon University

Protein characteristics relevant to systems approachProtein characteristics relevant to systems approach

sequencesequence structurestructure expression levelexpression level activityactivity partnerspartnerslocationlocation

Subcellular locations from major protein databasesSubcellular locations from major protein databases

GiantinGiantin Entrez:Entrez: /note="a new 376kD Golgi complex /note="a new 376kD Golgi complex

outher membrane protein"outher membrane protein" SwissProt:SwissProt: INTEGRAL MEMBRANE INTEGRAL MEMBRANE

PROTEIN. GOLGI MEMBRANE.PROTEIN. GOLGI MEMBRANE. GPP130GPP130

Entrez:Entrez: /note="GPP130; type II Golgi /note="GPP130; type II Golgi membrane protein”membrane protein”

SwissProt:SwissProt: nothing nothing

More questions than answersMore questions than answers

We learned that Giantin and GPP130 are We learned that Giantin and GPP130 are both Golgi proteins, but do we know:both Golgi proteins, but do we know:What part (i.e., cis, medial, trans) of the What part (i.e., cis, medial, trans) of the

Golgi complex they each are found in? Golgi complex they each are found in? If they have the same subcellular If they have the same subcellular

distribution?distribution? If they also are found in other If they also are found in other

compartments?compartments?

Vocabulary is part of the problemVocabulary is part of the problemVocabulary is part of the problemVocabulary is part of the problem

Different investigators may use Different investigators may use differentdifferent terms terms to refer to the same pattern or the to refer to the same pattern or the samesame term to term to refer to different patternsrefer to different patterns

Efforts to create Efforts to create restricted restricted vocabularies (e.g., vocabularies (e.g., Gene Ontology consortium) for location have Gene Ontology consortium) for location have been madebeen made

SWALL entries for giantin and gpp130SWALL entries for giantin and gpp130SWALL entries for giantin and gpp130SWALL entries for giantin and gpp130

ID GIAN_HUMAN STANDARD; PRT; 3259 AA.

AC Q14789; Q14398;

GN GOLGB1.

DR GO; GO:0000139; C:Golgi membrane; TAS.

DR GO; GO:0005795; C:Golgi stack; TAS.

DR GO; GO:0016021; C:integral to membrane; TAS.

DR GO; GO:0007030; P:Golgi organization and biogenesis; TAS.

ID O00461 PRELIMINARY; PRT; 696 AA.

AC O00461;

GN GPP130.

DR GO; GO:0005810; C:endocytotic transport vesicle; TAS.

DR GO; GO:0005801; C:Golgi cis-face; TAS.

DR GO; GO:0005796; C:Golgi lumen; TAS.

DR GO; GO:0016021; C:integral to membrane; TAS.

Words are not enoughWords are not enough

Still don’t know how similar the locations Still don’t know how similar the locations patterns of these proteins arepatterns of these proteins are

Restricted vocabularies do not provide the Restricted vocabularies do not provide the necessary necessary complexity and specificitycomplexity and specificity

Needed: Systematic ApproachNeeded: Systematic Approach

•Need Need new methodsnew methods for accurately and for accurately and objectively objectively determiningdetermining the the subcellular location subcellular location pattern of all proteinspattern of all proteins

•DistinctDistinct from drug from drug screening by low-screening by low-resolution resolution microscopymicroscopy

•Need to advance past “cartoon” view of subcellular Need to advance past “cartoon” view of subcellular locationlocation

•Need Need systematic, quantitativesystematic, quantitative approach to protein location approach to protein location

First Decision PointFirst Decision PointFirst Decision PointFirst Decision Point

Classification by direct (pixel-by-pixel) Classification by direct (pixel-by-pixel) comparison of individual images to comparison of individual images to known patterns is not useful, sinceknown patterns is not useful, since

different cells have different different cells have different shapes, shapes, sizes, orientationssizes, orientationsorganelles within cells are organelles within cells are not not found in fixed locationsfound in fixed locations

• Therefore, use feature-based methods rather than (pixel) model-based methods

Input ImagesInput Images

Created 2D image database for HeLa cellsCreated 2D image database for HeLa cells Ten classes covering all major subcellular Ten classes covering all major subcellular

structures: Golgi, ER, mitochondria, structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubulesmicrofilaments, microtubules

Included classes that are similar to each Included classes that are similar to each otherother

Example 2D Images of HeLaExample 2D Images of HeLa

Features: SLFFeatures: SLF

Developed sets of Developed sets of SSubcellular ubcellular LLocation ocation FFeatures (eatures (SLFSLF) containing features of ) containing features of different typesdifferent types

Motivated in part by descriptions used by Motivated in part by descriptions used by biologists (e.g., punctate, perinuclear)biologists (e.g., punctate, perinuclear)

First type of features derived from First type of features derived from morphological image processingmorphological image processing - finding - finding objects by automated thresholdingobjects by automated thresholding

Number of fluorescent objects per cellNumber of fluorescent objects per cellVariance of the object sizesVariance of the object sizesRatio of the largest object to the smallestRatio of the largest object to the smallestAverage distance of objects to the ‘center Average distance of objects to the ‘center

of fluorescence’of fluorescence’Average “roundness” of objectsAverage “roundness” of objects

Features: MorphologicalFeatures: Morphological

Features: Haralick textureFeatures: Haralick texture

Give information on Give information on correlations in correlations in intensity between adjacent pixelsintensity between adjacent pixels to to answer questions likeanswer questions like is the pattern more like a checkerboard or is the pattern more like a checkerboard or

alternating stripes?alternating stripes? is the pattern highly organized (ordered) is the pattern highly organized (ordered)

or more scattered (disordered)?or more scattered (disordered)?

Example: Difference detected by texture feature “entropy”Example: Difference detected by texture feature “entropy”

Features: Zernike momentFeatures: Zernike moment

Measure degree to which pattern matches a Measure degree to which pattern matches a particular Zernike polynomialparticular Zernike polynomial

Give information on basic nature of pattern Give information on basic nature of pattern (e.g., circle, donut) and sizes (frequencies) (e.g., circle, donut) and sizes (frequencies) present in patternpresent in pattern

Examples of Zernike Polynomials

Z(2,0) Z(4,4) Z(10,6)

Subcellular Location Features: 2DSubcellular Location Features: 2D

Morphological featuresMorphological features Haralick texture featuresHaralick texture features Zernike moment featuresZernike moment features Geometric featuresGeometric features Edge featuresEdge features

2D Classification Results 2D Classification Results

Overall accuracy = 92% (95% for major patterns)Overall accuracy = 92% (95% for major patterns)

True True

ClassClass

Output of the Classifier

DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub

DNA 99 1 0 0 0 0 0 0 0 0

ER 0 97 0 0 0 2 0 0 0 1

Gia 0 0 91 7 0 0 0 0 2 0

Gpp 0 0 14 82 0 0 2 0 1 0

Lam 0 0 1 0 88 1 0 0 10 0

Mit 0 3 0 0 0 92 0 0 3 3

Nuc 0 0 0 0 0 0 99 0 1 0

Act 0 0 0 0 0 0 0 100 0 0

TfR 0 1 0 0 12 2 0 1 81 2

Tub 1 2 0 0 0 1 0 0 1 95

Human Classification ResultsHuman Classification Results

Overall accuracy = 83% (92% for major patterns)Overall accuracy = 83% (92% for major patterns)

Output of the ClassifierTrueClass DNA ER Gia GPP LAM Mit Nuc Act TfR TubDNA 100% 0% 0% 0% 0% 0% 0% 0% 0% 0%ER 0% 90% 0% 0% 3% 6% 0% 0% 0% 0%Giantin 0% 0% 56% 36% 3% 3% 0% 0% 0% 0%GPP130 0% 0% 53% 43% 0% 0% 0% 0% 3% 0%LAMP2 0% 0% 6% 0% 73% 0% 0% 0% 20% 0%Mitochond. 0% 3% 0% 0% 0% 96% 0% 0% 0% 0%Nucleolin 0% 0% 0% 0% 0% 0% 100% 0% 0% 0%Actin 0% 0% 0% 0% 0% 0% 0% 100% 0% 0%TfR 0% 13% 0% 0% 3% 0% 0% 0% 83% 0%Tubulin 0% 3% 0% 0% 0% 0% 0% 3% 0% 93%

Computer vs. HumanComputer vs. Human

40

50

60

70

80

90

100

40 50 60 70 80 90 100

Computer Accuracy

Human Accuracy

Extending to 3D: Labeling approachExtending to 3D: Labeling approach

Total protein labeled with Cy5 reactive dyeTotal protein labeled with Cy5 reactive dye DNA labeled with PIDNA labeled with PI Specific Proteins labeled with primary Ab + Specific Proteins labeled with primary Ab +

Alexa488 conjugated secondary AbAlexa488 conjugated secondary Ab

3D Image Set3D Image SetGiantinNuclear ER Lysosomalgpp130

ActinMitoch. Nucleolar TubulinEndosomal

New features to measure “z” asymmetryNew features to measure “z” asymmetry

2D features treated 2D features treated xx and and yy equivalently equivalently For 3D images, while it makes sense to treat For 3D images, while it makes sense to treat

xx and and yy equivalently (cells don’t have a equivalently (cells don’t have a “left” and “right”, “left” and “right”, zz should be treated should be treated differently (“top” and “bottom” are not the differently (“top” and “bottom” are not the same)same)

We designed features to separate distance We designed features to separate distance measures into x-y component and z measures into x-y component and z componentcomponent

Overall accuracy = 97%Overall accuracy = 97%

Classification Results for 3D imagesClassification Results for 3D images

How to do even betterHow to do even better

Biologists interpreting images of protein Biologists interpreting images of protein localization typically view many cells localization typically view many cells before reaching a conclusionbefore reaching a conclusion

Can simulate this by classifying Can simulate this by classifying setssets of cells of cells from the same microscope slidefrom the same microscope slide

Set size 9, Overall accuracy = 99.7%Set size 9, Overall accuracy = 99.7%

Classification of Sets of 3D ImagesClassification of Sets of 3D Images

9999000000000000000000TubTub001001000000000000000000EndoEndo000010010000000000000000ActinActin000000100100000000000000NucleNucle000000001001000000000000MitoMito000000000010010000000000LysoLyso0000000000009999000000GppGpp000000000000001001000000GiaGia0000000000000000999900ERER000000000000000000100100DNADNA

TubTubEndoEndoActinActinNuclNuclMitoMitoLysoLysoGppGppGiaGiaERERDNADNA

Tru

e C

lass

Predicted Class

First ConclusionFirst Conclusion

Description of subcellular locations for Description of subcellular locations for systems biology should be implemented systems biology should be implemented using a data-driven approach rather than a using a data-driven approach rather than a knowledge-capture approach, but…knowledge-capture approach, but…

Subcellular Location Image FinderSubcellular Location Image Finder

(Have automated system for finding images in on-line journal (Have automated system for finding images in on-line journal articles that match a particular pattern - enables connection articles that match a particular pattern - enables connection between new images and previously published results)between new images and previously published results)

Figure

Caption

Panels

ScopeAnnotated

Scopes

AnnotatedPanels

ImagePtr

Panellabels

LabelMatching

Caption understanding

Panel splitting

Labelfinding

Panel classification,Micrograph analysis

Entityextraction

proteins,cells, drugs,experimentalconditions, …

image type, image scale, subcellular pattern analysis…

[Murphy et al, 2001]

[Murphy et al, 2001]

[Cohen et al, 2003]

]

alignment between caption entities and panels

Image SimilarityImage Similarity

Classification power of features implies that Classification power of features implies that they capture essential characteristics of they capture essential characteristics of protein patternsprotein patterns

Can be used to measure Can be used to measure similaritysimilarity between between patternspatterns

Clustering by Image SimilarityClustering by Image Similarity

Ability to measure similarity of protein Ability to measure similarity of protein patterns allows us for the first time to create patterns allows us for the first time to create a systematic, objective, framework for a systematic, objective, framework for describing subcellular locationsdescribing subcellular locations

Ideal for database referencesIdeal for database references One way is by creating a One way is by creating a Subcellular Subcellular

Location TreeLocation Tree Illustration: Build hierarchical dendrogramIllustration: Build hierarchical dendrogram

Subcellular Location Tree for 10 classes in HeLa cells

Subcellular Location Tree for 10 classes in HeLa cells

Do this for all proteins:

Location ProteomicsDo this for all proteins:

Location Proteomics Can use Can use CD-taggingCD-tagging (developed by (developed by Dr. Jonathan Dr. Jonathan

JarvikJarvik) to randomly tag many proteins: Infect ) to randomly tag many proteins: Infect population of cells with a retrovirus carrying a population of cells with a retrovirus carrying a DNA sequence that will produce a “tag” in a DNA sequence that will produce a “tag” in a random gene in each cellrandom gene in each cell

Isolate separate Isolate separate clonesclones, each of which produces , each of which produces express one tagged proteinexpress one tagged protein

Use RT-PCR to Use RT-PCR to identify tagged geneidentify tagged gene in each in each cloneclone

Collect Collect images of many cellsimages of many cells for each clone using for each clone using fluorescence microscopyfluorescence microscopy

Example images of CD-tagged clonesExample images of CD-tagged clones

(A) Glut1 gene (type 1 glucose transporter)

(B) Tmpo gene (thymopoietin

(C) tuba1 gene (-tubulin)(D) Cald gene (caldesmon 1)(E) Ncl gene (nucleolin)(F) Rps11 gene (ribosomal

protein S11)(G) Hmga1 gene (high mobility

group AT-hook 1)(H) Col1a2 gene (procollagen

type I 2)(I) Atp5a1 gene (ATP synthase

isoform 1)

Proof of principleProof of principle

Cluster 46 clones expressing different Cluster 46 clones expressing different tagged proteins based on their subcellular tagged proteins based on their subcellular location patternslocation patterns

Feature selectionFeature selection

Use Stepwise Discriminant Analysis to Use Stepwise Discriminant Analysis to rank rank features based on their ability to features based on their ability to distinguish proteinsdistinguish proteins

Use increasing numbers of features to train Use increasing numbers of features to train neural network classifiers and evaluate neural network classifiers and evaluate classification accuracy over all 46 clonesclassification accuracy over all 46 clones

Best performance obtained with 10 featuresBest performance obtained with 10 features

Tree buildingTree building

Therefore use these 10 features with z-scored Therefore use these 10 features with z-scored Euclidean distance function to build SLTEuclidean distance function to build SLT

Find optimal number of clusters using k-means Find optimal number of clusters using k-means clustering and AICclustering and AIC

Find consensus hierarchical trees by randomly Find consensus hierarchical trees by randomly dividing the images for each protein in half and dividing the images for each protein in half and keeping branches conserved between both halves keeping branches conserved between both halves (repeat for 50 random divisions)(repeat for 50 random divisions)

Consensus Subcellular Location TreeConsensus Subcellular Location Tree

Examples from major clustersExamples from major clusters

SignificanceSignificance

Proteins clustered by location analogous to Proteins clustered by location analogous to proteins clustered by sequence (e.g., proteins clustered by sequence (e.g., PFAM)PFAM)

Can subdivide clusters by observing Can subdivide clusters by observing response to drugs, oncogenes, etc.response to drugs, oncogenes, etc.

These represent protein location statesThese represent protein location states Base knowledge required for modelingBase knowledge required for modeling Can be used to filter protein interactionsCan be used to filter protein interactions

From patterns to causesFrom patterns to causes

Machine learning approaches have been Machine learning approaches have been previously used to find previously used to find localization motifslocalization motifs in protein sequences, but the set of locations in protein sequences, but the set of locations used was limited to major organellesused was limited to major organelles

High-resolution subcellular location trees High-resolution subcellular location trees can be used to discover (recursively) new can be used to discover (recursively) new motifs that determine location of each motifs that determine location of each groupgroup

Can include Can include post-translationalpost-translational modifications modifications

More ConclusionsMore Conclusions

Organized data collection approach is Organized data collection approach is required to capture high-resolution required to capture high-resolution information on the subcellular location of information on the subcellular location of all proteinsall proteins

Prohibitive combinatorial complexity make Prohibitive combinatorial complexity make colocalization approach infeasible, so major colocalization approach infeasible, so major effort should focus on one protein at a timeeffort should focus on one protein at a time

Center for Bioimage InformaticsCenter for Bioimage Informatics

$2.75 M CMU funding from NSF ITR$2.75 M CMU funding from NSF ITR Joint with UCSB and collaborators at Berkeley Joint with UCSB and collaborators at Berkeley

and MITand MIT R. Murphy (CALD/Biomed.Eng./Biol.Sci.)R. Murphy (CALD/Biomed.Eng./Biol.Sci.) Jelena Kovacevic (Biomedical Engineering)Jelena Kovacevic (Biomedical Engineering) Tom Mitchell (CALD)Tom Mitchell (CALD) Christos Faloutsos (CALD)Christos Faloutsos (CALD)

AcknowledgmentsAcknowledgments

Former studentsFormer students Michael Boland, Mia Markey, Michael Boland, Mia Markey,

William Dirks, Gregory Porreca, William Dirks, Gregory Porreca, Edward Roques, Meel Velliste Edward Roques, Meel Velliste

Current grad studentsCurrent grad students Kai Huang, Xiang Chen, Ting Zhao, Kai Huang, Xiang Chen, Ting Zhao,

Yanhua Hu, Elvira Garcia Osuna, Yanhua Hu, Elvira Garcia Osuna, Zhenzhen Kou, Juchang HuaZhenzhen Kou, Juchang Hua

FundingFunding NSF, NIH,NSF, NIH, Rockefeller Bros. Fund, Rockefeller Bros. Fund,

PA. Tobacco Settlement FundPA. Tobacco Settlement Fund Collaborators/ConsultantsCollaborators/Consultants

Simon Watkins, David Cassasent, Simon Watkins, David Cassasent, Tom Mitchell, Christos Faloutsos, Tom Mitchell, Christos Faloutsos, Jon Jarvik, Peter BergetJon Jarvik, Peter Berget

machine learning challenges in location proteomics

Documents

golgi membrane tas

golgi proteins

golgi stack tas

golgi lumen tas

golgi organization

golgi cisface tas

integral membrane protein

subcellular location