![Page 1: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/1.jpg)
prediction of proteins that participate in learning process
by machine learning
Dan EvronMiri Michaeli
Project Advisors: Dr. Gal ChechikOssnat Bar Shira
![Page 2: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/2.jpg)
Biological Background
• A synapse is a junction between 2 neurons.
• How does Synaptic Transmission works?
![Page 3: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/3.jpg)
Hebbian theory
Donald Hebb:
»"When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased"
![Page 4: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/4.jpg)
Synaptic Plasticity
• synaptic plasticity is the ability of the synapse to change in strength by molecular alteration.
• What kind of alterations happen during synaptic plasticity?
![Page 5: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/5.jpg)
Synaptic Plasticitychanges
• Pre synaptic release probability.
• The number of postsynaptic receptors.
• Properties of postsynaptic receptors.
example
• Change in the probability of glutamate release.
• Insertion or removal of postsynaptic AMPA receptors.
• phosphorylation and de-phosphorylation inducing a change in AMPA receptor conductance.
![Page 6: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/6.jpg)
What is the connection to learning and memory?
synaptic plasticity is one of the important neurochemical foundations of learning and memory.
![Page 7: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/7.jpg)
Learning in Aplysia• Habituation• Sensitization• Classical conditioning
• All found in the gill withdrawal reflex !!!
• Kendel’s work connects organism level learning to cellular level learning !!!
![Page 8: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/8.jpg)
And what about us?
• in mammals: • Many of the pathways are far from
understood.
• Much bigger and complex nervous system.
• Research shows that many principals are the same (LTP/LTD in the Hippocampus).
![Page 9: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/9.jpg)
Project Idea & Goal
Biological research has found many proteins which are connected to biological pathways involved in learning in the neuron and synapse. Yet, pathways are far from understood and many components are missing.
Our goal is to find candidate proteins that may take part in these pathways and have not been discovered yet.
![Page 10: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/10.jpg)
How will we do that?
1. Collect numerical data on organism proteins.
2. Collect ontologies about synaptic plasticity
3. Label each gene as related / non related to synaptic ontologies (according to data)
4. Use SVM as a classifier
5. Search for false positive genes in results
6. Publish a great article and win a Nobel prize! (or just dream about it…)
![Page 11: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/11.jpg)
Our research organism is…Mus musculus
AKA...
The house mouse!
![Page 12: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/12.jpg)
Tools & Databases
GEO (Gene Expression Omnibus)
MGI (Mouse Genome Informatics)
GO (Gene Ontology)
MPPDB (Mouse Protein-Protein Interaction Database)
SynDB (Synapse Database)
![Page 13: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/13.jpg)
Tools & Databases
• Classifier: SVM (Support Vector Machine)
![Page 14: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/14.jpg)
The project had 2 main phases:
• Phase 1: – Work only on PPI data– Create baseline for further work
• Phase 2:– Increase our PPI data– another data type: gene expression– Combine the PPI and GE data– Try to improve prediction !!
![Page 15: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/15.jpg)
Phase 1:
• Extract PPI data from BioGRID
• Label the matrix for each ontology
• Perform SVM algorithm on the sets
• Calculate baseline
![Page 16: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/16.jpg)
Phase 1 - results• Most ontologies had only few related genes -
problematic.• Baseline:
baseline SVM prediction
010
203040
506070
8090
endoplasmicreticulum
ion channel activity G protein coupledreceptor protein
signaling pathw ay
ontologies
![Page 17: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/17.jpg)
Phase 2
will another type of data improve the results?
Gene expression
![Page 18: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/18.jpg)
Step 1 - extracting data
– Representative set of mouse proteins from MGI.
– Gene Expression data from experiments related to synaptic and neuronal learning.
– Mouse Protein Protein Interaction (PPI) from several data bases.
– gene ontologies from GO.– Synaptic ontologies from SynDB.
![Page 19: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/19.jpg)
Step 2 – processing data
• Each gene expression data comes in separate files - need to be combined.
• Normalize gene expression data.
• Create PPI’s matrix.
• Convert PPI’s proteins to genes.
![Page 20: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/20.jpg)
Step 3 - combine the data
According to the list of genes:– Matrix that combine PPI&GE when each gene
has at least one data type. (“union”)– Matrix that combine PPI&GE when each gene
has both data types. (“intersect”)– PPI matrices from the two mentioned matrices– GE matrices from the two mentioned matrices
![Page 21: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/21.jpg)
Step 4 - labeling the data
• For each set, and each ontology we labeled the genes (related/non related).
![Page 22: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/22.jpg)
Step 5 - perform SVM algorithm on the sets
![Page 23: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/23.jpg)
Step 6 - process the results
• Evaluate prediction success (AUC).
• Find potential false positive candidates.
So how did we do?
We have to build a ROC curve before..
![Page 24: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/24.jpg)
What is ROC?
ROC = Receiver Operating Characteristic.
• Our SVM builds a ROC curve - that is a graphical plot of the sensitivity vs. specificity.
• During the SVM run-time, it calculates the AUC of the ROC curve made by it after classification.
![Page 25: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/25.jpg)
What is AUC?
• AUC = Area Under the Curve.• The AUC is a way to evaluate accuracy of the
learning model by averaging the prediction precision.
• The AUC spans between 0.5 and 1, when 0.5 shows that the test has a 50% precision (equals to tossing a coin!) and 1 indicates a perfect precision ability.
• The AUC enables us to examine and compare SVM results.
![Page 26: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/26.jpg)
Results
• Intersect of the data:– Size of all 3 matrices is similar – enables
comparison.– Average AUC: GE alone: 75%
PPI alone: 63%
GE + PPI: 75%
![Page 27: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/27.jpg)
Results - intersectComparison of AUC in GE, PPI and GE+PPI
0.000.100.200.300.400.500.600.700.800.90
endo
som
e
mito
chon
dria
lin
ner
mem
bran
e
G-p
rote
inco
uple
dre
cept
orac
tivity
G-p
rote
inco
uple
dre
cept
orpr
otei
n
ion
chan
nel
activ
ity
volta
ge-g
ated
ion
chan
nel
activ
ity
GO terms
AU
C
PPI
GE
GE + PPI
![Page 28: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/28.jpg)
Results
• Union of the data:– Close to reality in number of genes (14K in
matrices, 15K in representative list)– Average AUC in GE alone = GE + PPI = 74%– The matrices size issue– PPI alone corresponded to different GO
categories, so can not be compared.
![Page 29: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/29.jpg)
Results - union
Comparison of AUC in GE and GE+PPI
0.000.100.200.300.400.500.600.700.800.901.00
en
do
pla
sm
icre
tic
ulu
mm
em
bra
ne
mit
oc
ho
nd
ria
lre
sp
ira
tory
ch
ain
ca
lciu
m c
ha
nn
el
ac
tiv
ity
ex
tra
ce
llula
rlig
an
d-g
ate
d i
on
ch
an
ne
l a
cti
vit
y
ca
tio
n c
ha
nn
el
ac
tiv
ity
sy
na
pti
c v
es
icle
ne
uro
tra
ns
mit
ter
rec
ep
tor
ac
tiv
ity
GO terms
AU
CGE
GE + PPI
![Page 30: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/30.jpg)
Conclusions
• We can compare between different types of data only from the “intersect” mats.
• In intersect, the PPI sets the size, therefore we have same GO categories.
• In union, GE size took over the PPI data and that is the reason for different GO categories (GO categories in both PPI’s are the same).
• PPI did not contribute to prediction !
(bad news…)
![Page 31: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/31.jpg)
The good news…
• Still, 75% is a nice accuracy!
• We found several false positive genes, that may be related to synaptic plasticity and have not been discovered yet as such.
examples:– Neurogranin (NRGN) – CADPS
![Page 32: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/32.jpg)
Neurogranin (NRGN)
Acts as a "third messenger" substrate of protein kinase C-mediated molecular cascades during synaptic development and remodeling. Binds to calmodulin in the absence of calcium.
![Page 33: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/33.jpg)
Ca++-dependent secretion activator(CADPS)
Calcium-binding protein involved in exocytosis of vesicles filled with neurotransmitters and neuropeptides. Probably acts upstream of fusion in the biogenesis or maintenance of maturesecretory vesicles.
![Page 34: prediction of proteins that participate in learning process by machine learning](https://reader035.vdocuments.net/reader035/viewer/2022062423/568145ce550346895db2d631/html5/thumbnails/34.jpg)
Next steps..
• Computationally:– Improve the classification by adding new
types of data and / or by different representation of the data.
• Biologically:– Explore through biological experiments the
proteins we have found (the FP list).