in silico immune response prediction based on peptide array data mitja luštrek institute for...

In silico immune response prediction based on peptide array data

Mitja Luštrek

Institute for Biostatistics and Informaticsin Medicine and Aging Research

Overview

• IntroductionPeptide arrays, the task of prediction, machine learning

• Immune response predictionPrediction methods and results, insights into immune system

• Limitations and future workWhy do we not do better and how we might

Peptide arrays

Peptides (antigen)

Peptide arrays

Peptides (antigen)

Serum (antibodies)

http://2.bp.blogspot.com/_-YeRJYlOgk0/SAj31Mcr-9I/AAAAAAAAAFo/_JRSOK7OZXg/s1600-h/blood+drop.gif

Peptide arrays

Peptides (antigen)

Serum (antibodies)

Immune response

http://2.bp.blogspot.com/_-YeRJYlOgk0/SAj31Mcr-9I/AAAAAAAAAFo/_JRSOK7OZXg/s1600-h/blood+drop.gif

In silico predicion

• Predict in silico which peptides evoke immune response

In silico predicion

• Predict in silico which peptides evoke immune response

• Save costs by putting only the most promising peptides on an array

• Gain insight into the workings of immune system

Task at hand

• Data:– 10,218 peptides (15-mers) with negative response– 3,420 peptides (15-mers) with positive response

Task at hand


multiplied to 10,218 for a balanced data set

Task at hand


multiplied to 10,218 for a balanced data set

• Method:– machine learning– 70 % data for training, 30 % for testing

Machine learningAttribute 1 Attribute 2 ... Class

a11 a12 ... nega21 a22 pos...

Labe

led

(tra

inin

g)

data Instance

(peptide)


a11 a12 ... nega21 a22 pos...

Labe

led

(tra

inin

g)

data

Machine learning algorithm

Classifier:if Attribute 1 is such and suchand Attribute 2 is such and such...then Class = neg

if Attribute 1 is such and suchand Attribute 2 is such and such...then Class = pos

Instance (peptide)


b11 b12 ...b21 b22

...Unl

abel

ed

(tes

t) d

ata


b11 b12 ...b21 b22

...Unl

abel

ed

(tes

t) d

ata

Classifier

Attribute 1 Attribute 2 ... Classb11 b12 ... posb21 b22 neg...Cl

assi

fied

data

Support vector machine (SVM)

Attribute 1

Attrib

ute

2

Support vector machine (SVM)

Attribute 3 Attribute 1

Attrib

ute

2

RIPPER rules

• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them

Repeated Incremental Pruning to Produce Error Reduction

RIPPER rules


• Grow a rule:(Attribute 1 = a1)

Selection based on information gain

RIPPER rules


• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2)

RIPPER rules


• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c

Until all instances matched by the rule belong to class c (the rule never misclassifies)

RIPPER rules



The rule is perfect on the growing set.But does it overfit the growing set?We test it on the pruning set.

RIPPER rules



• Prune a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c

RIPPER rules



• Prune a rule:(Attribute 1 = a1) ... Class = c

As long as removal improves performance

on the pruning set

RIPPER rules

• Grow and prune a rule• Delete instances covered by the rule• Repeat the process until the last rule increases

length of the rules + misclassified instancesby more than a constant

Related work

• EL-Manzalawy et al. (2008)• Data: 934 epitopes, 934 random peptides• Compared several machine learning methods• Best performance by support vector machine

+ string kernel

SVMTheir data, string kernel 69.59 %Our data, string kernel 78.11 %

First baseline

Immune response prediction

First attempt: AA counts

A C D E F G H I K L M N P Q R S T V W Y Class

Alanine count

Positive or negative immune response

Cysteine count...


A C D E F G H I K L M N P Q R S T V W Y Class

Alanine count

Positive or negative immune response

Example peptide:QGDYCRPTVQEERKL, response 35 (negative)A C D E F G H I K L M N P Q R S T V W Y Class0 1 1 2 0 1 0 0 1 1 0 0 1 2 2 0 1 1 0 1 neg

Cysteine count...


SVM RulesString kernel 78.11 %AA counts 79.44 % 73.93 %

Second baseline

• Simple attributes are more accurate than string kernel

• SVM is more accurate than rules• But rules can be understood by a human


(Y = 0) and (F = 0) and (E >= 1) Class = neg(Y = 0) and (W = 0) and (R = 0) and (F <= 1) Class = neg(Y = 0 or <= 1) and ... Class = neg... and (F = 0 or <= 1) and ... Class = neg... and (W = 0 or <= 1) and ... Class = neg...otherwise Class = pos

Tyrosine

Tryptophan

Phenylalanine


(Y = 0) and (F = 0) and (E >= 1) Class = neg(Y = 0) and (W = 0) and (R = 0) and (F <= 1) Class = neg(Y = 0 or <= 1) and ... Class = neg... and (F = 0 or <= 1) and ... Class = neg... and (W = 0 or <= 1) and ... Class = neg...otherwise Class = pos

No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %)No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %)

Tyrosine Phenylalanine

Tryptophan

AA counts in sections of peptide

• AA counts ignore position in peptide


X Y ClassXXYY 2 2 posXXYY 2 2 posXXYY 2 2 posYYXX 2 2 negYYXX 2 2 negYYXX 2 2 neg

Pepti

des



X Y Class X left Y left X right Y rightXXYY 2 2 pos 2 0 0 2XXYY 2 2 pos 2 0 0 2XXYY 2 2 pos 2 0 0 2YYXX 2 2 neg 0 2 2 0YYXX 2 2 neg 0 2 2 0YYXX 2 2 neg 0 2 2 0

Pepti

des



• SVM accuracy increases a bit• Rules accuracy decreases• SVM better at coping with many attributes?

SVM RulesAA counts 79.44 % 73.93 %AA counts, 2 sections 79.98 % 73.84 %AA counts, 3 sections 80.31 % 70.87 %AA counts, 4 sections 79.99 % 69.05 %AA counts, 5 sections 79.98 % 71.49 %

AA count differences

• Machine learning cannot infer all relations automatically, it needs help


X Y Z Class1 2 5 pos2 3 3 pos3 4 1 pos2 1 5 neg3 2 3 neg4 3 1 neg



X Y Z Class X – Y ...1 2 5 pos –12 3 3 pos –13 4 1 pos –12 1 5 neg 13 2 3 neg 14 3 1 neg 1



SVM RulesAA counts 79.44 % 73.93 %AA count differences 78.48 % 72.91 %AA counts + AA count differences 79.39 % 75.29 %

• SVM accuracy decreases a bit• Rules accuracy increases• The changes are small


(E – Y <= –1) and (N – R <= –1) Class = pos(E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2) Class = pos(E – something <= –1 or 0) and ... Class = pos(something – Y <= –2, –1 or 0) and ... Class = pos...otherwise Class = neg

Tyrosine

Glutamic acid


(E – Y <= –1) and (N – R <= –1) Class = pos(E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2) Class = pos(E – something <= –1 or 0) and ... Class = pos(something – Y <= –2, –1 or 0) and ... Class = pos...otherwise Class = neg

No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No glutamic acid in peptide Class = pos, otherwise Class = neg (59.77 %)

Tyrosine

Glutamic acid

Substring counts

• Perhaps single AAs are not informative enough and we should count longer substrings

Substring counts


X Y ClassXXXYY 3 2 posYXXXY 3 2 posYYXXX 3 2 posXYXYX 3 2 negXYYXX 3 2 negXXYYX 3 2 neg

Pepti

des

Substring counts


X Y Class XXX ...XXXYY 3 2 pos 1YXXXY 3 2 pos 1YYXXX 3 2 pos 1XYXYX 3 2 neg 0XYYXX 3 2 neg 0XXYYX 3 2 neg 0

Pepti

des

Substring counts

SVM RulesAA counts 79.44 % 73.93 %Counts of substrings of lengths up to 2 78.92 % 74.00 %Counts of substrings of lengths up to 3 79.01 % 73.91 %

• The changes are small• Only one rule with substrings of length above 1

(Y = 0) and (F = 0) and (W = 0) and (M = 0) and (I = 0) and (LL = 0) Class = neg

Substrings with gaps

• Machine learning needs recurring patterns• Small counts for substrings of length above 1 –

little recurrence • Increase substring counts by allowing gaps

between AAs


• Machine learning needs recurring patterns• Small counts for substrings of length above 1 –

little recurrence • Increase substring counts by allowing gaps

between AAs

• XYXABCXXY ABC × 1• YYAXBCYYX ABC × 0.5• XYABXXCYX ABC × 0.52


SVM RulesAA counts 79.44 % 73.93 %Lengths up to 3, no gaps 79.01 % 73.91 %Lengths up to 3, gap lengths up to 1 79.11 % 74.37 %Lengths up to 3, gap lengths up to 2 78.83 % 74.62 %Lengths up to 3, gap lengths up to 3 79.10 % 74.71 %Lengths up to 3, gap lengths up to 4 78.91 % 75.54 %

• SVM accuracy decreases a bit• Rules accuracy increases• The changes are small


• Still no rules with substrings of length 3• More rules with substrings of length 2


• Still no rules with substrings of length 3• More rules with substrings of length 2

(Y = 0) and (F = 0) and (E >= 1) and (W = 0) and (RL = 0) Class = neg...otherwise Class = pos

Leucine

Positive responsewhen in pair

Substring Count Substring Count

RL/LR 5 EL 1

LL 5 SL 1

LP 2 RR 1

SE 2 PP 1

KK 2

Classes of AAs

• Perhaps individual AAs are too specific and we should merge similar AAs into classes

Classes of AAs


W Y F C I V H L M A G T R S N Q D P E K-

0.727

-0.721

-0.719

-0.693

-0.682

-0.669

-0.662

-0.631

-0.626

-0.605

-0.537

-0.525

-0.448

-0.423

-0.381

-0.369

-0.279

-0.271

-0.160

-0.043

Inflexible (1) Medium (2) Flexible (3)

Flexibility index

Classes of AAs


W Y F C I V H L M A G T R S N Q D P E K-

0.727

-0.721

-0.719

-0.693

-0.682

-0.669

-0.662

-0.631

-0.626

-0.605

-0.537

-0.525

-0.448

-0.423

-0.381

-0.369

-0.279

-0.271

-0.160

-0.043

Inflexible (1) Medium (2) Flexible (3)

Example peptide:QGDYCRPTVQEERKL 323113321333332

Flexibility index

Classes of AAs

Counts of substrings of lengths up to 3:• 1, 2, 3• 11, 12, 13, 21, 22, 23, 31, 32, 33• 111, 112, 113, 121, 122, 123, 131, 132, 133,

211, 212, 213, 221, 222, 223, 231, 232, 233, 311, 312, 313, 321, 322, 323, 331, 332, 333

Gaps of length 1

Classes of AAs

SVM RulesAA counts 79.44 % 73.93 %AAs, lengths up to 3, gaps up to 1 79.11 % 74.37 %Aromatic/aliphatic 72.51 % 72.49 %Basic/acidic 61.15 % 61.06 %Flexibility 69.24 % 67.46 %Hydrophobicity 58.92 % 56.85 %Polarity 64.94 % 63.50 %Size 61.89 % 60.58 %Turns index 58.92 % 55.57 %

Aromatic/aliphatic AAsAromatic Aliphatic Other

Phenylalanine (F) Isoleucine (I) all the rest

Histidine (H) Leucine (L)

Tryptophan (W) Valine (V)

Tyrosine (Y)





Tyrosine (Y)






Tyrosine (Y)


1 or no aromatic AAs in peptide Class = neg, otherwise Class = pos (71.37 %)

Flexibility of AAs(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and

(F213 <= 1) Class = pos(F1 >= 4) and (F33 <= 3.5) Class = pos(F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos(F1 >= 3) and (F11 = 0) and (F322 <= 0.5) and (F321 >= 1) Class = posotherwise Class = neg

Flexibility of AAs

• Inflexible AAs indicate pos• Flexibility linked to epitope propensity

in the literature

(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and


?

Flexibility of AAs

• Inflexible AAs indicate pos• Flexibility linked to epitope propensity

in the literature• Y, W, F are inflexible, E is flexible

(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and


?

Polarity of AAs

(N >= 8) and (P+ >= 2) and (P– <= 1) Class = pos(N >= 10) and (NP+ >= 0.5) and (NNP0 >= 1.5) Class = pos(...N... >= something) and ... Class = pos(... P–... <= something) and ... Class = pos...otherwise Class = neg

Polarity of AAs

• Non-polar AAs indicate pos• Should not polarity be conductive to

antibody binding?


?

Polarity of AAs

• Non-polar AAs indicate pos• Should not polarity be conductive to

antibody binding?• Y, W, F are non-polar, E is polar negative


?

Classes of AAs

SVM RulesAA counts 79.44 % 73.93 %All AA class counts 79.77 % 75.07 %All AA class counts and AA counts 79.89 % 76.12 %

• SVM accuracy increases a bit• Highest rules accuracy

Classes of AAs

SVM RulesAA counts 79.44 % 73.93 %All AA class counts 79.77 % 75.07 %All AA class counts and AA counts 79.89 % 76.12 %

• SVM accuracy increases a bit• Highest rules accuracy

(aromatic AAs >= 3) and (non-polar AAs >= 8) and (medium turn presence AAs >= 8) Class = pos

(aromatic AAs >= 3) and (non-polar AAs >= 8) and (Y >= 2) Class = pos

Precision 88.64 %

Precision 91.71 %

AA pair counts

Antibody

Peptide

AA pair counts

Antibody

Peptide

AA1

AA2

Distance d

Attributes are counts of AA1 and AA2 at distance dfor all AA1, AA2 and d

AA pair counts

• Too many attributes:20 AAs × 20 AAs × 14 possible distances = 5,600 attributes

AA pair counts

• Too many attributes:20 AAs × 20 AAs × 14 possible distances = 5,600 attributes

• Ways to reduce them:– Classes of AAs instead of individual AAs– Increment distances in steps > 1

d = 1d = 2

d = 3d = 4

d = 5

Step

3

AA pair counts

SVM RulesAA counts 79.44 % 73.93 %AA × AA, distance step 4 75.28 % 71.02 %AA × aromatic/aliphatic, step 3 78.25 % 72.80 %Aromatic/aliphatic × aromatic/aliphatic, step 2 72.56 % 70.90 %

• Accuracy decreases• Rules not particularly illuminating

AA pair counts with a fixed side

Antibody

Easily accesible side of peptide

Peptide array surface


Antibody

Easily accesible side of peptide

AA at fixed position

AA at d

Distance d

Attributes are AA at fixed position and counts of AA at distance dfor all AA and d

Peptide array surface


• Fewer attributes:20 AAs at fixed position +20 AAs × 14 possible distances = 300 attributes


• SVM accuracy increases a bit• Rules accuracy decreases due to bad attributes

SVM RulesAA counts 79.44 % 73.93 %Pair AA × AA, distance step 4 75.28 % 71.02 %Fixed pair, step 2 80.07 % 69.35 %Fixed pair, step 1 78.82 % 67.60 %


• SVM accuracy increases a bit• Rules accuracy decreases due to bad attributes

SVM RulesAA counts 79.44 % 73.93 %Pair AA × AA, distance step 4 75.28 % 71.02 %Fixed pair, step 2 80.07 % 69.35 %Fixed pair, step 1 78.82 % 67.60 %

(Y at d 2 >= 1) Class = pos(Y at d 6 >= 1) Class = pos(Y at d 7 >= 1) Class = pos(Y at d 9 >= 1) Class = pos

(Y at d 11 >= 1) Class = pos(Y at d 14 >= 1) Class = pos...otherwise Class = neg

AA properties (in sections of peptide)

• AA properties on which AA classes are based can be used directly – averaged over peptide


• AA properties on which AA classes are based can be used directly – averaged over peptide

SVM RulesAA counts 79.44 % 73.93 %AA properties 76.45 % 72.63 %AA properties, 2 sections 76.97 % 73.01 %AA properties, 3 sections 77.25 % 71.13 %AA properties and AA counts 79.30 % 74.97 %AA properties and AA counts, 2 sections 79.89 % 74.24 %AA properties and AA counts, 3 sections 80.05 % 70.80 %


(aromatic >= 0.3) Class = pos(polarity <= something) and ... Class = pos (basic >= something) and ... Class = pos ...otherwise Class = neg


(aromatic >= 0.3) Class = pos(polarity <= something) and ... Class = pos (basic >= something) and ... Class = pos ...otherwise Class = neg

• Aromatic AAs indicate pos• Non-polar AAs indicate pos (Y, W, F non-polar)• Presence of basic / absence of acidic AAs

indicates posH, K, R E, D

Limitations and future work

Why are we stuck at 80 %

Training data

A single antibodyor a group of

similar antibodies


similar antibodies

Singledifferent

antibodies


Training data


similar antibodies


similar antibodies

Singledifferent

antibodies

Similar peptides Similar peptides Each peptidedifferent


Training data


similar antibodies


similar antibodies

Singledifferent

antibodies

Similar peptides Similar peptides

Similarity recognized by machine learning: our 80 %

Each peptidedifferent

Ignored by machine learning


Training data Test data


similar antibodies


similar antibodies

Singledifferent

antibodies

Similar peptides Similar peptides

Similarity recognized by machine learning: our 80 %

Each peptidedifferent

Ignored by machine learning

Recognized by machine learning:

our 80 % again

Not recognized

Where do we go from here

Tyrosine followed by another aromatic AAfollowed by tryptophan followed by polar AA

Where do we go from here

Tyrosine followed by another aromatic AAfollowed by tryptophan followed by polar AA

• Aggregating rules with different attributes• Kernels to use many attributes simultaneously• Peptide similarity

Aggregating rules

• Check if different attribute sets cover different instances

• If so, pick the best rules for each attribute set• Use only the best rules for classification

Aggregating rules

• Check if different attribute sets cover different instances

• If so, pick the best rules for each attribute set• Use only the best rules for classification

Training TestRule 1Rule 2Rule 3...

Kernels

• Use many attributes without computing them explicitely

• Only works with some methods like SVM

Kernels



Instance 1: (a11, a12, ..., a1n) Instance 2: (a21, a22, ..., a2n)

Kernels



Instance 1: (a11, a12, ..., a1n) Instance 2: (a21, a22, ..., a2n)

(a11, a12, ..., a1n) · (a21, a22, ..., a2n) = a11 a21 + a12 a22 + ... + a1n a2n

Only need dot product

Only need to compute attributes that are non-zero in both attribute vectors

Peptide similarity

• Smart similarity:– Find best alignment of two peptides– Then compute similarity

Peptide similarity


• Nerest-neighbor classification

Peptide similarity


• Nerest-neighbor classification• Clustering:– Find groups of similar peptides– Find groups of peptides that

are similar in the same way

Questions?Suggestions?

Peptide arrays – SVM – RIPPER – EL-Manzalawy

AA counts – AA counts in sectionsAA count differences – Substring counts (with gaps)

AA classes – AA pairs (with a fixed side) – AA properties

Stuck at 80 %Aggregating rules – Kernels – Peptide similarity

in silico immune response prediction based on peptide array data mitja luštrek institute for...

Documents

suchand attribute

polar aas

promising peptides

posaromatic aas

classes of aas

growing set

balanced data set method

machine learning70