in silico immune response prediction based on peptide array data mitja luštrek institute for...

97
In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Upload: liliana-daniel

Post on 19-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

In silico immune response prediction based on peptide array data

Mitja Luštrek

Institute for Biostatistics and Informaticsin Medicine and Aging Research

Page 2: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Overview

• IntroductionPeptide arrays, the task of prediction, machine learning

• Immune response predictionPrediction methods and results, insights into immune system

• Limitations and future workWhy do we not do better and how we might

Page 3: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Peptide arrays

Peptides (antigen)

Page 5: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Peptide arrays

Peptides (antigen)

Serum (antibodies)

Immune response

Page 6: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

In silico predicion

• Predict in silico which peptides evoke immune response

Page 7: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

In silico predicion

• Predict in silico which peptides evoke immune response

• Save costs by putting only the most promising peptides on an array

• Gain insight into the workings of immune system

Page 8: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Task at hand

• Data:– 10,218 peptides (15-mers) with negative response– 3,420 peptides (15-mers) with positive response

Page 9: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Task at hand

• Data:– 10,218 peptides (15-mers) with negative response– 3,420 peptides (15-mers) with positive response

multiplied to 10,218 for a balanced data set

Page 10: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Task at hand

• Data:– 10,218 peptides (15-mers) with negative response– 3,420 peptides (15-mers) with positive response

multiplied to 10,218 for a balanced data set

• Method:– machine learning– 70 % data for training, 30 % for testing

Page 11: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Machine learningAttribute 1 Attribute 2 ... Class

a11 a12 ... nega21 a22 pos...

Labe

led

(tra

inin

g)

data Instance

(peptide)

Page 12: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Machine learningAttribute 1 Attribute 2 ... Class

a11 a12 ... nega21 a22 pos...

Labe

led

(tra

inin

g)

data

Machine learning algorithm

Classifier:if Attribute 1 is such and suchand Attribute 2 is such and such...then Class = neg

if Attribute 1 is such and suchand Attribute 2 is such and such...then Class = pos

Instance (peptide)

Page 13: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Machine learningAttribute 1 Attribute 2 ... Class

b11 b12 ...b21 b22

...Unl

abel

ed

(tes

t) d

ata

Page 14: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Machine learningAttribute 1 Attribute 2 ... Class

b11 b12 ...b21 b22

...Unl

abel

ed

(tes

t) d

ata

Classifier

Attribute 1 Attribute 2 ... Classb11 b12 ... posb21 b22 neg...Cl

assi

fied

data

Page 15: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Support vector machine (SVM)

Attribute 1

Attrib

ute

2

Page 16: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Support vector machine (SVM)

Attribute 1

Attrib

ute

2

Page 17: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Support vector machine (SVM)

Attribute 3 Attribute 1

Attrib

ute

2

Page 18: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

RIPPER rules

• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them

Repeated Incremental Pruning to Produce Error Reduction

Page 19: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

RIPPER rules

• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them

• Grow a rule:(Attribute 1 = a1)

Selection based on information gain

Page 20: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

RIPPER rules

• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them

• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2)

Page 21: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

RIPPER rules

• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them

• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c

Until all instances matched by the rule belong to class c (the rule never misclassifies)

Page 22: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

RIPPER rules

• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them

• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c

The rule is perfect on the growing set.But does it overfit the growing set?We test it on the pruning set.

Page 23: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

RIPPER rules

• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them

• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c

• Prune a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c

Page 24: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

RIPPER rules

• Split data:– growing set (2/3) to grow new rules– pruning set (1/3) to prune them

• Grow a rule:(Attribute 1 = a1) and (Attribute 2 <= a2) ... Class = c

• Prune a rule:(Attribute 1 = a1) ... Class = c

As long as removal improves performance

on the pruning set

Page 25: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

RIPPER rules

• Grow and prune a rule• Delete instances covered by the rule• Repeat the process until the last rule increases

length of the rules + misclassified instancesby more than a constant

Page 26: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Related work

• EL-Manzalawy et al. (2008)• Data: 934 epitopes, 934 random peptides• Compared several machine learning methods• Best performance by support vector machine

+ string kernel

SVMTheir data, string kernel 69.59 %Our data, string kernel 78.11 %

First baseline

Page 27: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Immune response prediction

Page 28: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

First attempt: AA counts

A C D E F G H I K L M N P Q R S T V W Y Class

Alanine count

Positive or negative immune response

Cysteine count...

Page 29: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

First attempt: AA counts

A C D E F G H I K L M N P Q R S T V W Y Class

Alanine count

Positive or negative immune response

Example peptide:QGDYCRPTVQEERKL, response 35 (negative)A C D E F G H I K L M N P Q R S T V W Y Class0 1 1 2 0 1 0 0 1 1 0 0 1 2 2 0 1 1 0 1 neg

Cysteine count...

Page 30: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

First attempt: AA counts

SVM RulesString kernel 78.11 %AA counts 79.44 % 73.93 %

Second baseline

• Simple attributes are more accurate than string kernel

• SVM is more accurate than rules• But rules can be understood by a human

Page 31: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

First attempt: AA counts

(Y = 0) and (F = 0) and (E >= 1) Class = neg(Y = 0) and (W = 0) and (R = 0) and (F <= 1) Class = neg(Y = 0 or <= 1) and ... Class = neg... and (F = 0 or <= 1) and ... Class = neg... and (W = 0 or <= 1) and ... Class = neg...otherwise Class = pos

Tyrosine

Tryptophan

Phenylalanine

Page 32: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

First attempt: AA counts

(Y = 0) and (F = 0) and (E >= 1) Class = neg(Y = 0) and (W = 0) and (R = 0) and (F <= 1) Class = neg(Y = 0 or <= 1) and ... Class = neg... and (F = 0 or <= 1) and ... Class = neg... and (W = 0 or <= 1) and ... Class = neg...otherwise Class = pos

No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %)No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %)

Tyrosine Phenylalanine

Tryptophan

Page 33: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA counts in sections of peptide

• AA counts ignore position in peptide

Page 34: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA counts in sections of peptide

X Y ClassXXYY 2 2 posXXYY 2 2 posXXYY 2 2 posYYXX 2 2 negYYXX 2 2 negYYXX 2 2 neg

Pepti

des

• AA counts ignore position in peptide

Page 35: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA counts in sections of peptide

X Y Class X left Y left X right Y rightXXYY 2 2 pos 2 0 0 2XXYY 2 2 pos 2 0 0 2XXYY 2 2 pos 2 0 0 2YYXX 2 2 neg 0 2 2 0YYXX 2 2 neg 0 2 2 0YYXX 2 2 neg 0 2 2 0

Pepti

des

• AA counts ignore position in peptide

Page 36: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA counts in sections of peptide

• SVM accuracy increases a bit• Rules accuracy decreases• SVM better at coping with many attributes?

SVM RulesAA counts 79.44 % 73.93 %AA counts, 2 sections 79.98 % 73.84 %AA counts, 3 sections 80.31 % 70.87 %AA counts, 4 sections 79.99 % 69.05 %AA counts, 5 sections 79.98 % 71.49 %

Page 37: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA count differences

• Machine learning cannot infer all relations automatically, it needs help

Page 38: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA count differences

X Y Z Class1 2 5 pos2 3 3 pos3 4 1 pos2 1 5 neg3 2 3 neg4 3 1 neg

• Machine learning cannot infer all relations automatically, it needs help

Page 39: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA count differences

X Y Z Class X – Y ...1 2 5 pos –12 3 3 pos –13 4 1 pos –12 1 5 neg 13 2 3 neg 14 3 1 neg 1

• Machine learning cannot infer all relations automatically, it needs help

Page 40: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA count differences

SVM RulesAA counts 79.44 % 73.93 %AA count differences 78.48 % 72.91 %AA counts + AA count differences 79.39 % 75.29 %

• SVM accuracy decreases a bit• Rules accuracy increases• The changes are small

Page 41: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA count differences

(E – Y <= –1) and (N – R <= –1) Class = pos(E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2) Class = pos(E – something <= –1 or 0) and ... Class = pos(something – Y <= –2, –1 or 0) and ... Class = pos...otherwise Class = neg

Tyrosine

Glutamic acid

Page 42: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA count differences

(E – Y <= –1) and (N – R <= –1) Class = pos(E – Y <= 0) and (D – Y <= – 1) and (Q – Y <= – 2) Class = pos(E – something <= –1 or 0) and ... Class = pos(something – Y <= –2, –1 or 0) and ... Class = pos...otherwise Class = neg

No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No glutamic acid in peptide Class = pos, otherwise Class = neg (59.77 %)

Tyrosine

Glutamic acid

Page 43: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substring counts

• Perhaps single AAs are not informative enough and we should count longer substrings

Page 44: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substring counts

• Perhaps single AAs are not informative enough and we should count longer substrings

X Y ClassXXXYY 3 2 posYXXXY 3 2 posYYXXX 3 2 posXYXYX 3 2 negXYYXX 3 2 negXXYYX 3 2 neg

Pepti

des

Page 45: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substring counts

• Perhaps single AAs are not informative enough and we should count longer substrings

X Y Class XXX ...XXXYY 3 2 pos 1YXXXY 3 2 pos 1YYXXX 3 2 pos 1XYXYX 3 2 neg 0XYYXX 3 2 neg 0XXYYX 3 2 neg 0

Pepti

des

Page 46: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substring counts

SVM RulesAA counts 79.44 % 73.93 %Counts of substrings of lengths up to 2 78.92 % 74.00 %Counts of substrings of lengths up to 3 79.01 % 73.91 %

• The changes are small• Only one rule with substrings of length above 1

(Y = 0) and (F = 0) and (W = 0) and (M = 0) and (I = 0) and (LL = 0) Class = neg

Page 47: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substrings with gaps

• Machine learning needs recurring patterns• Small counts for substrings of length above 1 –

little recurrence • Increase substring counts by allowing gaps

between AAs

Page 48: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substrings with gaps

• Machine learning needs recurring patterns• Small counts for substrings of length above 1 –

little recurrence • Increase substring counts by allowing gaps

between AAs

• XYXABCXXY ABC × 1• YYAXBCYYX ABC × 0.5• XYABXXCYX ABC × 0.52

Page 49: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substrings with gaps

SVM RulesAA counts 79.44 % 73.93 %Lengths up to 3, no gaps 79.01 % 73.91 %Lengths up to 3, gap lengths up to 1 79.11 % 74.37 %Lengths up to 3, gap lengths up to 2 78.83 % 74.62 %Lengths up to 3, gap lengths up to 3 79.10 % 74.71 %Lengths up to 3, gap lengths up to 4 78.91 % 75.54 %

• SVM accuracy decreases a bit• Rules accuracy increases• The changes are small

Page 50: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substrings with gaps

• Still no rules with substrings of length 3• More rules with substrings of length 2

Page 51: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Substrings with gaps

• Still no rules with substrings of length 3• More rules with substrings of length 2

(Y = 0) and (F = 0) and (E >= 1) and (W = 0) and (RL = 0) Class = neg...otherwise Class = pos

Leucine

Positive responsewhen in pair

Substring Count Substring Count

RL/LR 5 EL 1

LL 5 SL 1

LP 2 RR 1

SE 2 PP 1

KK 2

Page 52: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Classes of AAs

• Perhaps individual AAs are too specific and we should merge similar AAs into classes

Page 53: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Classes of AAs

• Perhaps individual AAs are too specific and we should merge similar AAs into classes

W Y F C I V H L M A G T R S N Q D P E K-

0.727

-0.721

-0.719

-0.693

-0.682

-0.669

-0.662

-0.631

-0.626

-0.605

-0.537

-0.525

-0.448

-0.423

-0.381

-0.369

-0.279

-0.271

-0.160

-0.043

Inflexible (1) Medium (2) Flexible (3)

Flexibility index

Page 54: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Classes of AAs

• Perhaps individual AAs are too specific and we should merge similar AAs into classes

W Y F C I V H L M A G T R S N Q D P E K-

0.727

-0.721

-0.719

-0.693

-0.682

-0.669

-0.662

-0.631

-0.626

-0.605

-0.537

-0.525

-0.448

-0.423

-0.381

-0.369

-0.279

-0.271

-0.160

-0.043

Inflexible (1) Medium (2) Flexible (3)

Example peptide:QGDYCRPTVQEERKL 323113321333332

Flexibility index

Page 55: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Classes of AAs

Counts of substrings of lengths up to 3:• 1, 2, 3• 11, 12, 13, 21, 22, 23, 31, 32, 33• 111, 112, 113, 121, 122, 123, 131, 132, 133,

211, 212, 213, 221, 222, 223, 231, 232, 233, 311, 312, 313, 321, 322, 323, 331, 332, 333

Gaps of length 1

Page 56: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Classes of AAs

SVM RulesAA counts 79.44 % 73.93 %AAs, lengths up to 3, gaps up to 1 79.11 % 74.37 %Aromatic/aliphatic 72.51 % 72.49 %Basic/acidic 61.15 % 61.06 %Flexibility 69.24 % 67.46 %Hydrophobicity 58.92 % 56.85 %Polarity 64.94 % 63.50 %Size 61.89 % 60.58 %Turns index 58.92 % 55.57 %

Page 57: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Aromatic/aliphatic AAsAromatic Aliphatic Other

Phenylalanine (F) Isoleucine (I) all the rest

Histidine (H) Leucine (L)

Tryptophan (W) Valine (V)

Tyrosine (Y)

Page 58: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Aromatic/aliphatic AAsAromatic Aliphatic Other

Phenylalanine (F) Isoleucine (I) all the rest

Histidine (H) Leucine (L)

Tryptophan (W) Valine (V)

Tyrosine (Y)

No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %)No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %)

Page 59: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Aromatic/aliphatic AAsAromatic Aliphatic Other

Phenylalanine (F) Isoleucine (I) all the rest

Histidine (H) Leucine (L)

Tryptophan (W) Valine (V)

Tyrosine (Y)

No tyrosine in peptide Class = neg, otherwise Class = pos (66.53 %)No tryptophan in peptide Class = neg, otherwise Class = pos (59.27 %)No phenylalanine in peptide Class = neg, otherwise Class = pos (63.21 %)

1 or no aromatic AAs in peptide Class = neg, otherwise Class = pos (71.37 %)

Page 60: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Flexibility of AAs(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and

(F213 <= 1) Class = pos(F1 >= 4) and (F33 <= 3.5) Class = pos(F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos(F1 >= 3) and (F11 = 0) and (F322 <= 0.5) and (F321 >= 1) Class = posotherwise Class = neg

Page 61: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Flexibility of AAs

• Inflexible AAs indicate pos• Flexibility linked to epitope propensity

in the literature

(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and

(F213 <= 1) Class = pos(F1 >= 4) and (F33 <= 3.5) Class = pos(F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos(F1 >= 3) and (F11 = 0) and (F322 <= 0.5) and (F321 >= 1) Class = posotherwise Class = neg

?

Page 62: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Flexibility of AAs

• Inflexible AAs indicate pos• Flexibility linked to epitope propensity

in the literature• Y, W, F are inflexible, E is flexible

(F1 >= 5) Class = pos(F1 >= 4) and (F33 <= 4) and (F131 >= 1.5) and (F231 = 0) and

(F213 <= 1) Class = pos(F1 >= 4) and (F33 <= 3.5) Class = pos(F1 >= 3) and (F2 >= 5) and (F311 >= 1) Class = pos(F1 >= 3) and (F11 = 0) and (F322 <= 0.5) and (F321 >= 1) Class = posotherwise Class = neg

?

Page 63: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Polarity of AAs

(N >= 8) and (P+ >= 2) and (P– <= 1) Class = pos(N >= 10) and (NP+ >= 0.5) and (NNP0 >= 1.5) Class = pos(...N... >= something) and ... Class = pos(... P–... <= something) and ... Class = pos...otherwise Class = neg

Page 64: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Polarity of AAs

• Non-polar AAs indicate pos• Should not polarity be conductive to

antibody binding?

(N >= 8) and (P+ >= 2) and (P– <= 1) Class = pos(N >= 10) and (NP+ >= 0.5) and (NNP0 >= 1.5) Class = pos(...N... >= something) and ... Class = pos(... P–... <= something) and ... Class = pos...otherwise Class = neg

?

Page 65: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Polarity of AAs

• Non-polar AAs indicate pos• Should not polarity be conductive to

antibody binding?• Y, W, F are non-polar, E is polar negative

(N >= 8) and (P+ >= 2) and (P– <= 1) Class = pos(N >= 10) and (NP+ >= 0.5) and (NNP0 >= 1.5) Class = pos(...N... >= something) and ... Class = pos(... P–... <= something) and ... Class = pos...otherwise Class = neg

?

Page 66: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Classes of AAs

SVM RulesAA counts 79.44 % 73.93 %All AA class counts 79.77 % 75.07 %All AA class counts and AA counts 79.89 % 76.12 %

• SVM accuracy increases a bit• Highest rules accuracy

Page 67: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Classes of AAs

SVM RulesAA counts 79.44 % 73.93 %All AA class counts 79.77 % 75.07 %All AA class counts and AA counts 79.89 % 76.12 %

• SVM accuracy increases a bit• Highest rules accuracy

(aromatic AAs >= 3) and (non-polar AAs >= 8) and (medium turn presence AAs >= 8) Class = pos

(aromatic AAs >= 3) and (non-polar AAs >= 8) and (Y >= 2) Class = pos

Precision 88.64 %

Precision 91.71 %

Page 68: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts

Antibody

Peptide

Page 69: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts

Antibody

Peptide

AA1

AA2

Distance d

Attributes are counts of AA1 and AA2 at distance dfor all AA1, AA2 and d

Page 70: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts

• Too many attributes:20 AAs × 20 AAs × 14 possible distances = 5,600 attributes

Page 71: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts

• Too many attributes:20 AAs × 20 AAs × 14 possible distances = 5,600 attributes

• Ways to reduce them:– Classes of AAs instead of individual AAs– Increment distances in steps > 1

d = 1d = 2

d = 3d = 4

d = 5

Step

3

Page 72: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts

SVM RulesAA counts 79.44 % 73.93 %AA × AA, distance step 4 75.28 % 71.02 %AA × aromatic/aliphatic, step 3 78.25 % 72.80 %Aromatic/aliphatic × aromatic/aliphatic, step 2 72.56 % 70.90 %

• Accuracy decreases• Rules not particularly illuminating

Page 73: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts with a fixed side

Antibody

Easily accesible side of peptide

Peptide array surface

Page 74: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts with a fixed side

Antibody

Easily accesible side of peptide

AA at fixed position

AA at d

Distance d

Attributes are AA at fixed position and counts of AA at distance dfor all AA and d

Peptide array surface

Page 75: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts with a fixed side

• Fewer attributes:20 AAs at fixed position +20 AAs × 14 possible distances = 300 attributes

Page 76: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts with a fixed side

• SVM accuracy increases a bit• Rules accuracy decreases due to bad attributes

SVM RulesAA counts 79.44 % 73.93 %Pair AA × AA, distance step 4 75.28 % 71.02 %Fixed pair, step 2 80.07 % 69.35 %Fixed pair, step 1 78.82 % 67.60 %

Page 77: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA pair counts with a fixed side

• SVM accuracy increases a bit• Rules accuracy decreases due to bad attributes

SVM RulesAA counts 79.44 % 73.93 %Pair AA × AA, distance step 4 75.28 % 71.02 %Fixed pair, step 2 80.07 % 69.35 %Fixed pair, step 1 78.82 % 67.60 %

(Y at d 2 >= 1) Class = pos(Y at d 6 >= 1) Class = pos(Y at d 7 >= 1) Class = pos(Y at d 9 >= 1) Class = pos

(Y at d 11 >= 1) Class = pos(Y at d 14 >= 1) Class = pos...otherwise Class = neg

Page 78: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA properties (in sections of peptide)

• AA properties on which AA classes are based can be used directly – averaged over peptide

Page 79: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA properties (in sections of peptide)

• AA properties on which AA classes are based can be used directly – averaged over peptide

SVM RulesAA counts 79.44 % 73.93 %AA properties 76.45 % 72.63 %AA properties, 2 sections 76.97 % 73.01 %AA properties, 3 sections 77.25 % 71.13 %AA properties and AA counts 79.30 % 74.97 %AA properties and AA counts, 2 sections 79.89 % 74.24 %AA properties and AA counts, 3 sections 80.05 % 70.80 %

Page 80: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA properties (in sections of peptide)

(aromatic >= 0.3) Class = pos(polarity <= something) and ... Class = pos (basic >= something) and ... Class = pos ...otherwise Class = neg

Page 81: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

AA properties (in sections of peptide)

(aromatic >= 0.3) Class = pos(polarity <= something) and ... Class = pos (basic >= something) and ... Class = pos ...otherwise Class = neg

• Aromatic AAs indicate pos• Non-polar AAs indicate pos (Y, W, F non-polar)• Presence of basic / absence of acidic AAs

indicates posH, K, R E, D

Page 82: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Limitations and future work

Page 83: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Why are we stuck at 80 %

Training data

A single antibodyor a group of

similar antibodies

A single antibodyor a group of

similar antibodies

Singledifferent

antibodies

Page 84: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Why are we stuck at 80 %

Training data

A single antibodyor a group of

similar antibodies

A single antibodyor a group of

similar antibodies

Singledifferent

antibodies

Similar peptides Similar peptides Each peptidedifferent

Page 85: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Why are we stuck at 80 %

Training data

A single antibodyor a group of

similar antibodies

A single antibodyor a group of

similar antibodies

Singledifferent

antibodies

Similar peptides Similar peptides

Similarity recognized by machine learning: our 80 %

Each peptidedifferent

Ignored by machine learning

Page 86: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Why are we stuck at 80 %

Training data Test data

A single antibodyor a group of

similar antibodies

A single antibodyor a group of

similar antibodies

Singledifferent

antibodies

Similar peptides Similar peptides

Similarity recognized by machine learning: our 80 %

Each peptidedifferent

Ignored by machine learning

Recognized by machine learning:

our 80 % again

Not recognized

Page 87: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Where do we go from here

Tyrosine followed by another aromatic AAfollowed by tryptophan followed by polar AA

Page 88: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Where do we go from here

Tyrosine followed by another aromatic AAfollowed by tryptophan followed by polar AA

• Aggregating rules with different attributes• Kernels to use many attributes simultaneously• Peptide similarity

Page 89: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Aggregating rules

• Check if different attribute sets cover different instances

• If so, pick the best rules for each attribute set• Use only the best rules for classification

Page 90: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Aggregating rules

• Check if different attribute sets cover different instances

• If so, pick the best rules for each attribute set• Use only the best rules for classification

Training TestRule 1Rule 2Rule 3...

Page 91: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Kernels

• Use many attributes without computing them explicitely

• Only works with some methods like SVM

Page 92: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Kernels

• Use many attributes without computing them explicitely

• Only works with some methods like SVM

Instance 1: (a11, a12, ..., a1n) Instance 2: (a21, a22, ..., a2n)

Page 93: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Kernels

• Use many attributes without computing them explicitely

• Only works with some methods like SVM

Instance 1: (a11, a12, ..., a1n) Instance 2: (a21, a22, ..., a2n)

(a11, a12, ..., a1n) · (a21, a22, ..., a2n) = a11 a21 + a12 a22 + ... + a1n a2n

Only need dot product

Only need to compute attributes that are non-zero in both attribute vectors

Page 94: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Peptide similarity

• Smart similarity:– Find best alignment of two peptides– Then compute similarity

Page 95: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Peptide similarity

• Smart similarity:– Find best alignment of two peptides– Then compute similarity

• Nerest-neighbor classification

Page 96: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Peptide similarity

• Smart similarity:– Find best alignment of two peptides– Then compute similarity

• Nerest-neighbor classification• Clustering:– Find groups of similar peptides– Find groups of peptides that

are similar in the same way

Page 97: In silico immune response prediction based on peptide array data Mitja Luštrek Institute for Biostatistics and Informatics in Medicine and Aging Research

Questions?Suggestions?

Peptide arrays – SVM – RIPPER – EL-Manzalawy

AA counts – AA counts in sectionsAA count differences – Substring counts (with gaps)

AA classes – AA pairs (with a fixed side) – AA properties

Stuck at 80 %Aggregating rules – Kernels – Peptide similarity