in silico immune response prediction based on peptide array data mitja luštrek institute for...
TRANSCRIPT
In silico immune response prediction based on peptide array data
Mitja Luštrek
Institute for Biostatistics and Informaticsin Medicine and Aging Research
Overview
• IntroductionPeptide arrays, the task of prediction, machine learning
• Immune response predictionPrediction methods and results, insights into immune system
• Limitations and future workWhy do we not do better and how we might
Peptide arrays
Peptides (antigen)
Peptide arrays
Peptides (antigen)
Serum (antibodies)
Peptide arrays
Peptides (antigen)
Serum (antibodies)
Immune response
In silico predicion
• Predict in silico which peptides evoke immune response
In silico predicion
• Predict in silico which peptides evoke immune response
• Save costs by putting only the most promising peptides on an array
• Gain insight into the workings of immune system
Task at hand
• Data:– 10,218 peptides (15-mers) with negative response– 3,420 peptides (15-mers) with positive response
Task at hand
• Data:– 10,218 peptides (15-mers) with negative response– 3,420 peptides (15-mers) with positive response
multiplied to 10,218 for a balanced data set
Task at hand
• Data:– 10,218 peptides (15-mers) with negative response– 3,420 peptides (15-mers) with positive response
multiplied to 10,218 for a balanced data set
• Method:– machine learning– 70 % data for training, 30 % for testing
Machine learningAttribute 1 Attribute 2 ... Class
a11 a12 ... nega21 a22 pos...
Labe
led
(tra
inin
g)
data Instance
(peptide)
Machine learningAttribute 1 Attribute 2 ... Class
a11 a12 ... nega21 a22 pos...
Labe
led
(tra
inin
g)
data
Machine learning algorithm
Classifier:if Attribute 1 is such and suchand Attribute 2 is such and such...then Class = neg
if Attribute 1 is such and suchand Attribute 2 is such and such...then Class = pos
Instance (peptide)
Machine learningAttribute 1 Attribute 2 ... Class
b11 b12 ...b21 b22
...Unl
abel
ed
(tes
t) d
ata
Machine learningAttribute 1 Attribute 2 ... Class
b11 b12 ...b21 b22
...Unl
abel
ed
(tes
t) d
ata
Classifier
Attribute 1 Attribute 2 ... Classb11 b12 ... posb21 b22 neg...Cl
assi
fied
data
Support vector machine (SVM)
Attribute 1
Attrib
ute
2
Support vector machine (SVM)
Attribute 1
Attrib
ute
2
Support vector machine (SVM)
Attribute 3 Attribute 1
Attrib
ute
2
RIPPER rules
• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them
Repeated Incremental Pruning to Produce Error Reduction
RIPPER rules
• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them
• Grow a rule:(Attribute 1 = a1)
Selection based on information gain
RIPPER rules
• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them
• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2)
RIPPER rules
• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them
• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c
Until all instances matched by the rule belong to class c (the rule never misclassifies)
RIPPER rules
• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them
• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c
The rule is perfect on the growing set.But does it overfit the growing set?We test it on the pruning set.
RIPPER rules
• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them
• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c
• Prune a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c
RIPPER rules
• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them
• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c
• Prune a rule:(Attribute 1 = a1) ... Class = c
As long as removal improves performance
on the pruning set
RIPPER rules
• Grow and prune a rule• Delete instances covered by the rule• Repeat the process until the last rule increases
length of the rules + misclassified instancesby more than a constant
Related work
• EL-Manzalawy et al. (2008)• Data: 934 epitopes, 934 random peptides• Compared several machine learning methods• Best performance by support vector machine
+ string kernel
SVMTheir data, string kernel 69.59 %Our data, string kernel 78.11 %
First baseline
Immune response prediction
First attempt: AA counts
A C D E F G H I K L M N P Q R S T V W Y Class
Alanine count
Positive or negative immune response
Cysteine count...
First attempt: AA counts
A C D E F G H I K L M N P Q R S T V W Y Class
Alanine count
Positive or negative immune response
Example peptide:QGDYCRPTVQEERKL, response 35 (negative)A C D E F G H I K L M N P Q R S T V W Y Class0 1 1 2 0 1 0 0 1 1 0 0 1 2 2 0 1 1 0 1 neg
Cysteine count...
First attempt: AA counts
SVM RulesString kernel 78.11 %AA counts 79.44 % 73.93 %
Second baseline
• Simple attributes are more accurate than string kernel
• SVM is more accurate than rules• But rules can be understood by a human
First attempt: AA counts
(Y = 0) and (F = 0) and (E >= 1) Class = neg(Y = 0) and (W = 0) and (R = 0) and (F <= 1) Class = neg(Y = 0 or <= 1) and ... Class = neg... and (F = 0 or <= 1) and ... Class = neg... and (W = 0 or <= 1) and ... Class = neg...otherwise Class = pos
Tyrosine
Tryptophan
Phenylalanine
First attempt: AA counts
(Y = 0) and (F = 0) and (E >= 1) Class = neg(Y = 0) and (W = 0) and (R = 0) and (F <= 1) Class = neg(Y = 0 or <= 1) and ... Class = neg... and (F = 0 or <= 1) and ... Class = neg... and (W = 0 or <= 1) and ... Class = neg...otherwise Class = pos
No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %)No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %)
Tyrosine Phenylalanine
Tryptophan
AA counts in sections of peptide
• AA counts ignore position in peptide
AA counts in sections of peptide
X Y ClassXXYY 2 2 posXXYY 2 2 posXXYY 2 2 posYYXX 2 2 negYYXX 2 2 negYYXX 2 2 neg
Pepti
des
• AA counts ignore position in peptide
AA counts in sections of peptide
X Y Class X left Y left X right Y rightXXYY 2 2 pos 2 0 0 2XXYY 2 2 pos 2 0 0 2XXYY 2 2 pos 2 0 0 2YYXX 2 2 neg 0 2 2 0YYXX 2 2 neg 0 2 2 0YYXX 2 2 neg 0 2 2 0
Pepti
des
• AA counts ignore position in peptide
AA counts in sections of peptide
• SVM accuracy increases a bit• Rules accuracy decreases• SVM better at coping with many attributes?
SVM RulesAA counts 79.44 % 73.93 %AA counts, 2 sections 79.98 % 73.84 %AA counts, 3 sections 80.31 % 70.87 %AA counts, 4 sections 79.99 % 69.05 %AA counts, 5 sections 79.98 % 71.49 %
AA count differences
• Machine learning cannot infer all relations automatically, it needs help
AA count differences
X Y Z Class1 2 5 pos2 3 3 pos3 4 1 pos2 1 5 neg3 2 3 neg4 3 1 neg
• Machine learning cannot infer all relations automatically, it needs help
AA count differences
X Y Z Class X – Y ...1 2 5 pos –12 3 3 pos –13 4 1 pos –12 1 5 neg 13 2 3 neg 14 3 1 neg 1
• Machine learning cannot infer all relations automatically, it needs help
AA count differences
SVM RulesAA counts 79.44 % 73.93 %AA count differences 78.48 % 72.91 %AA counts + AA count differences 79.39 % 75.29 %
• SVM accuracy decreases a bit• Rules accuracy increases• The changes are small
AA count differences
(E – Y <= –1) and (N – R <= –1) Class = pos(E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2) Class = pos(E – something <= –1 or 0) and ... Class = pos(something – Y <= –2, –1 or 0) and ... Class = pos...otherwise Class = neg
Tyrosine
Glutamic acid
AA count differences
(E – Y <= –1) and (N – R <= –1) Class = pos(E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2) Class = pos(E – something <= –1 or 0) and ... Class = pos(something – Y <= –2, –1 or 0) and ... Class = pos...otherwise Class = neg
No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No glutamic acid in peptide Class = pos, otherwise Class = neg (59.77 %)
Tyrosine
Glutamic acid
Substring counts
• Perhaps single AAs are not informative enough and we should count longer substrings
Substring counts
• Perhaps single AAs are not informative enough and we should count longer substrings
X Y ClassXXXYY 3 2 posYXXXY 3 2 posYYXXX 3 2 posXYXYX 3 2 negXYYXX 3 2 negXXYYX 3 2 neg
Pepti
des
Substring counts
• Perhaps single AAs are not informative enough and we should count longer substrings
X Y Class XXX ...XXXYY 3 2 pos 1YXXXY 3 2 pos 1YYXXX 3 2 pos 1XYXYX 3 2 neg 0XYYXX 3 2 neg 0XXYYX 3 2 neg 0
Pepti
des
Substring counts
SVM RulesAA counts 79.44 % 73.93 %Counts of substrings of lengths up to 2 78.92 % 74.00 %Counts of substrings of lengths up to 3 79.01 % 73.91 %
• The changes are small• Only one rule with substrings of length above 1
(Y = 0) and (F = 0) and (W = 0) and (M = 0) and (I = 0) and (LL = 0) Class = neg
Substrings with gaps
• Machine learning needs recurring patterns• Small counts for substrings of length above 1 –
little recurrence • Increase substring counts by allowing gaps
between AAs
Substrings with gaps
• Machine learning needs recurring patterns• Small counts for substrings of length above 1 –
little recurrence • Increase substring counts by allowing gaps
between AAs
• XYXABCXXY ABC × 1• YYAXBCYYX ABC × 0.5• XYABXXCYX ABC × 0.52
Substrings with gaps
SVM RulesAA counts 79.44 % 73.93 %Lengths up to 3, no gaps 79.01 % 73.91 %Lengths up to 3, gap lengths up to 1 79.11 % 74.37 %Lengths up to 3, gap lengths up to 2 78.83 % 74.62 %Lengths up to 3, gap lengths up to 3 79.10 % 74.71 %Lengths up to 3, gap lengths up to 4 78.91 % 75.54 %
• SVM accuracy decreases a bit• Rules accuracy increases• The changes are small
Substrings with gaps
• Still no rules with substrings of length 3• More rules with substrings of length 2
Substrings with gaps
• Still no rules with substrings of length 3• More rules with substrings of length 2
(Y = 0) and (F = 0) and (E >= 1) and (W = 0) and (RL = 0) Class = neg...otherwise Class = pos
Leucine
Positive responsewhen in pair
Substring Count Substring Count
RL/LR 5 EL 1
LL 5 SL 1
LP 2 RR 1
SE 2 PP 1
KK 2
Classes of AAs
• Perhaps individual AAs are too specific and we should merge similar AAs into classes
Classes of AAs
• Perhaps individual AAs are too specific and we should merge similar AAs into classes
W Y F C I V H L M A G T R S N Q D P E K-
0.727
-0.721
-0.719
-0.693
-0.682
-0.669
-0.662
-0.631
-0.626
-0.605
-0.537
-0.525
-0.448
-0.423
-0.381
-0.369
-0.279
-0.271
-0.160
-0.043
Inflexible (1) Medium (2) Flexible (3)
Flexibility index
Classes of AAs
• Perhaps individual AAs are too specific and we should merge similar AAs into classes
W Y F C I V H L M A G T R S N Q D P E K-
0.727
-0.721
-0.719
-0.693
-0.682
-0.669
-0.662
-0.631
-0.626
-0.605
-0.537
-0.525
-0.448
-0.423
-0.381
-0.369
-0.279
-0.271
-0.160
-0.043
Inflexible (1) Medium (2) Flexible (3)
Example peptide:QGDYCRPTVQEERKL 323113321333332
Flexibility index
Classes of AAs
Counts of substrings of lengths up to 3:• 1, 2, 3• 11, 12, 13, 21, 22, 23, 31, 32, 33• 111, 112, 113, 121, 122, 123, 131, 132, 133,
211, 212, 213, 221, 222, 223, 231, 232, 233, 311, 312, 313, 321, 322, 323, 331, 332, 333
Gaps of length 1
Classes of AAs
SVM RulesAA counts 79.44 % 73.93 %AAs, lengths up to 3, gaps up to 1 79.11 % 74.37 %Aromatic/aliphatic 72.51 % 72.49 %Basic/acidic 61.15 % 61.06 %Flexibility 69.24 % 67.46 %Hydrophobicity 58.92 % 56.85 %Polarity 64.94 % 63.50 %Size 61.89 % 60.58 %Turns index 58.92 % 55.57 %
Aromatic/aliphatic AAsAromatic Aliphatic Other
Phenylalanine (F) Isoleucine (I) all the rest
Histidine (H) Leucine (L)
Tryptophan (W) Valine (V)
Tyrosine (Y)
Aromatic/aliphatic AAsAromatic Aliphatic Other
Phenylalanine (F) Isoleucine (I) all the rest
Histidine (H) Leucine (L)
Tryptophan (W) Valine (V)
Tyrosine (Y)
No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %)No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %)
Aromatic/aliphatic AAsAromatic Aliphatic Other
Phenylalanine (F) Isoleucine (I) all the rest
Histidine (H) Leucine (L)
Tryptophan (W) Valine (V)
Tyrosine (Y)
No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %)No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %)
1 or no aromatic AAs in peptide Class = neg, otherwise Class = pos (71.37 %)
Flexibility of AAs(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and
(F213 <= 1) Class = pos(F1 >= 4) and (F33 <= 3.5) Class = pos(F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos(F1 >= 3) and (F11 = 0) and (F322 <= 0.5) and (F321 >= 1) Class = posotherwise Class = neg
Flexibility of AAs
• Inflexible AAs indicate pos• Flexibility linked to epitope propensity
in the literature
(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and
(F213 <= 1) Class = pos(F1 >= 4) and (F33 <= 3.5) Class = pos(F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos(F1 >= 3) and (F11 = 0) and (F322 <= 0.5) and (F321 >= 1) Class = posotherwise Class = neg
?
Flexibility of AAs
• Inflexible AAs indicate pos• Flexibility linked to epitope propensity
in the literature• Y, W, F are inflexible, E is flexible
(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and
(F213 <= 1) Class = pos(F1 >= 4) and (F33 <= 3.5) Class = pos(F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos(F1 >= 3) and (F11 = 0) and (F322 <= 0.5) and (F321 >= 1) Class = posotherwise Class = neg
?
Polarity of AAs
(N >= 8) and (P+ >= 2) and (P– <= 1) Class = pos(N >= 10) and (NP+ >= 0.5) and (NNP0 >= 1.5) Class = pos(...N... >= something) and ... Class = pos(... P–... <= something) and ... Class = pos...otherwise Class = neg
Polarity of AAs
• Non-polar AAs indicate pos• Should not polarity be conductive to
antibody binding?
(N >= 8) and (P+ >= 2) and (P– <= 1) Class = pos(N >= 10) and (NP+ >= 0.5) and (NNP0 >= 1.5) Class = pos(...N... >= something) and ... Class = pos(... P–... <= something) and ... Class = pos...otherwise Class = neg
?
Polarity of AAs
• Non-polar AAs indicate pos• Should not polarity be conductive to
antibody binding?• Y, W, F are non-polar, E is polar negative
(N >= 8) and (P+ >= 2) and (P– <= 1) Class = pos(N >= 10) and (NP+ >= 0.5) and (NNP0 >= 1.5) Class = pos(...N... >= something) and ... Class = pos(... P–... <= something) and ... Class = pos...otherwise Class = neg
?
Classes of AAs
SVM RulesAA counts 79.44 % 73.93 %All AA class counts 79.77 % 75.07 %All AA class counts and AA counts 79.89 % 76.12 %
• SVM accuracy increases a bit• Highest rules accuracy
Classes of AAs
SVM RulesAA counts 79.44 % 73.93 %All AA class counts 79.77 % 75.07 %All AA class counts and AA counts 79.89 % 76.12 %
• SVM accuracy increases a bit• Highest rules accuracy
(aromatic AAs >= 3) and (non-polar AAs >= 8) and (medium turn presence AAs >= 8) Class = pos
(aromatic AAs >= 3) and (non-polar AAs >= 8) and (Y >= 2) Class = pos
Precision 88.64 %
Precision 91.71 %
AA pair counts
Antibody
Peptide
AA pair counts
Antibody
Peptide
AA1
AA2
Distance d
Attributes are counts of AA1 and AA2 at distance dfor all AA1, AA2 and d
AA pair counts
• Too many attributes:20 AAs × 20 AAs × 14 possible distances = 5,600 attributes
AA pair counts
• Too many attributes:20 AAs × 20 AAs × 14 possible distances = 5,600 attributes
• Ways to reduce them:– Classes of AAs instead of individual AAs– Increment distances in steps > 1
d = 1d = 2
d = 3d = 4
d = 5
Step
3
AA pair counts
SVM RulesAA counts 79.44 % 73.93 %AA × AA, distance step 4 75.28 % 71.02 %AA × aromatic/aliphatic, step 3 78.25 % 72.80 %Aromatic/aliphatic × aromatic/aliphatic, step 2 72.56 % 70.90 %
• Accuracy decreases• Rules not particularly illuminating
AA pair counts with a fixed side
Antibody
Easily accesible side of peptide
Peptide array surface
AA pair counts with a fixed side
Antibody
Easily accesible side of peptide
AA at fixed position
AA at d
Distance d
Attributes are AA at fixed position and counts of AA at distance dfor all AA and d
Peptide array surface
AA pair counts with a fixed side
• Fewer attributes:20 AAs at fixed position +20 AAs × 14 possible distances = 300 attributes
AA pair counts with a fixed side
• SVM accuracy increases a bit• Rules accuracy decreases due to bad attributes
SVM RulesAA counts 79.44 % 73.93 %Pair AA × AA, distance step 4 75.28 % 71.02 %Fixed pair, step 2 80.07 % 69.35 %Fixed pair, step 1 78.82 % 67.60 %
AA pair counts with a fixed side
• SVM accuracy increases a bit• Rules accuracy decreases due to bad attributes
SVM RulesAA counts 79.44 % 73.93 %Pair AA × AA, distance step 4 75.28 % 71.02 %Fixed pair, step 2 80.07 % 69.35 %Fixed pair, step 1 78.82 % 67.60 %
(Y at d 2 >= 1) Class = pos(Y at d 6 >= 1) Class = pos(Y at d 7 >= 1) Class = pos(Y at d 9 >= 1) Class = pos
(Y at d 11 >= 1) Class = pos(Y at d 14 >= 1) Class = pos...otherwise Class = neg
AA properties (in sections of peptide)
• AA properties on which AA classes are based can be used directly – averaged over peptide
AA properties (in sections of peptide)
• AA properties on which AA classes are based can be used directly – averaged over peptide
SVM RulesAA counts 79.44 % 73.93 %AA properties 76.45 % 72.63 %AA properties, 2 sections 76.97 % 73.01 %AA properties, 3 sections 77.25 % 71.13 %AA properties and AA counts 79.30 % 74.97 %AA properties and AA counts, 2 sections 79.89 % 74.24 %AA properties and AA counts, 3 sections 80.05 % 70.80 %
AA properties (in sections of peptide)
(aromatic >= 0.3) Class = pos(polarity <= something) and ... Class = pos (basic >= something) and ... Class = pos ...otherwise Class = neg
AA properties (in sections of peptide)
(aromatic >= 0.3) Class = pos(polarity <= something) and ... Class = pos (basic >= something) and ... Class = pos ...otherwise Class = neg
• Aromatic AAs indicate pos• Non-polar AAs indicate pos (Y, W, F non-polar)• Presence of basic / absence of acidic AAs
indicates posH, K, R E, D
Limitations and future work
Why are we stuck at 80 %
Training data
A single antibodyor a group of
similar antibodies
A single antibodyor a group of
similar antibodies
Singledifferent
antibodies
Why are we stuck at 80 %
Training data
A single antibodyor a group of
similar antibodies
A single antibodyor a group of
similar antibodies
Singledifferent
antibodies
Similar peptides Similar peptides Each peptidedifferent
Why are we stuck at 80 %
Training data
A single antibodyor a group of
similar antibodies
A single antibodyor a group of
similar antibodies
Singledifferent
antibodies
Similar peptides Similar peptides
Similarity recognized by machine learning: our 80 %
Each peptidedifferent
Ignored by machine learning
Why are we stuck at 80 %
Training data Test data
A single antibodyor a group of
similar antibodies
A single antibodyor a group of
similar antibodies
Singledifferent
antibodies
Similar peptides Similar peptides
Similarity recognized by machine learning: our 80 %
Each peptidedifferent
Ignored by machine learning
Recognized by machine learning:
our 80 % again
Not recognized
Where do we go from here
Tyrosine followed by another aromatic AAfollowed by tryptophan followed by polar AA
Where do we go from here
Tyrosine followed by another aromatic AAfollowed by tryptophan followed by polar AA
• Aggregating rules with different attributes• Kernels to use many attributes simultaneously• Peptide similarity
Aggregating rules
• Check if different attribute sets cover different instances
• If so, pick the best rules for each attribute set• Use only the best rules for classification
Aggregating rules
• Check if different attribute sets cover different instances
• If so, pick the best rules for each attribute set• Use only the best rules for classification
Training TestRule 1Rule 2Rule 3...
Kernels
• Use many attributes without computing them explicitely
• Only works with some methods like SVM
Kernels
• Use many attributes without computing them explicitely
• Only works with some methods like SVM
Instance 1: (a11, a12, ..., a1n) Instance 2: (a21, a22, ..., a2n)
Kernels
• Use many attributes without computing them explicitely
• Only works with some methods like SVM
Instance 1: (a11, a12, ..., a1n) Instance 2: (a21, a22, ..., a2n)
(a11, a12, ..., a1n) · (a21, a22, ..., a2n) = a11 a21 + a12 a22 + ... + a1n a2n
Only need dot product
Only need to compute attributes that are non-zero in both attribute vectors
Peptide similarity
• Smart similarity:– Find best alignment of two peptides– Then compute similarity
Peptide similarity
• Smart similarity:– Find best alignment of two peptides– Then compute similarity
• Nerest-neighbor classification
Peptide similarity
• Smart similarity:– Find best alignment of two peptides– Then compute similarity
• Nerest-neighbor classification• Clustering:– Find groups of similar peptides– Find groups of peptides that
are similar in the same way
Questions?Suggestions?
Peptide arrays – SVM – RIPPER – EL-Manzalawy
AA counts – AA counts in sectionsAA count differences – Substring counts (with gaps)
AA classes – AA pairs (with a fixed side) – AA properties
Stuck at 80 %Aggregating rules – Kernels – Peptide similarity