bioinformatics of disease: immune epitope prediction

Bioinformatics of Disease: immune epitope prediction

Shoba RanganathanProfessor and Chair – Bioinformatics Dept. of Chemistry and Biomolecular Sciences & Adjunct Professor Biotechnology Research Institute Dept. of BiochemistryMacquarie University Yong Loo Lin School of MedicineSydney, Australia National University of Singapore, Singapore([email protected]) ([email protected])

Visiting scientist @ Institute for Infocomm Research (I2R), Singapore

Bioinformatics is ….. Bioinformatics is the study of living

systems through computation

Data in Bioinformatics (in the main)

and their management and analysis

Networks, pathways

and systemsSequences Genomes Transcriptomes

Databases, ontologies Data & text

miningEvolution andphylogeneticsMaths/StatsAlgorithms Physics/

Chemistry

Genetics and populations

Structures

Overview of my research1. Genome analysis2. Transcriptome analysis3. Protein/Proteome

analysis4. Systems Biology5. Immunoinformatics6. Genome-phenome

mapping7. Biodiversity Informatics

5. What is Immunoinformatics? Using Bioinformatics to address problems

in Immunology Application of bioinformatics to

accelerate immune system research has the potential to deliver vaccines and address immunotherapeutics.

Computational systems biology of immune response

Immunoinformatics

Immunology

ComputerScience

Biology

Networks, pathways,

and systems

Maths/StatsDatabases

Artificialintelligence

Algorithms

Cell biology

-omics

Basic immunology

Clinical immunology

IMMUNOINFORMATICS

Physics/Chemistry

Summary Introduction Structural Immunoinformatic

Database development Data Analysis Computational models Applications

Networks, pathways

and systemsGenetics and populations

-omics

Basic immunology

Clinical immunology

The immune system Composed of many interdependent cell

types, organs, and tissues toprotect the body from infections (bacterial,

parasitic, fungal, or viral) and arrest abnormal growth and differentiation

Inappropriate immune responses lead to allergies and autoimmunity

2nd most complex system in the human body

Genomics vs. Immunomics Genomics: solving the genome puzzle

104 genes coding for 106 products Immunomics: understanding immune

response102-103 genes leading to >1012 products

Enormous diversity in immunomics has implications for immune function and modulation

It is a numbers game…. >1013 MHC class I haplotypes (IMGT-HLA)

107-1015 T cell receptors (Arstila et al., 1999)

>109 combinatorial antibodies (Jerne, 1993)

1012 B cell clonotypes (Jerne, 1993)

1011 linear epitopes composed of nine amino acids

>>1011 conformational epitopes

T cell mediated adaptive immune response Specific peptide residues critical for stimulating

cellular immune responses Major histocompatibility complex (MHC)

molecules (Human Leukocyte Antigen or HLA in humans) bind and present short antigenic peptides to T cell receptors, for inspection

Antigen presentation is by two classes of MHC (class I and class II)

Those peptides that bind to specific MHC and trigger T cell recognition (T cell epitopes) are targets for vaccine and immunotherapy development

1. Epitope

3. T cell receptor

How to generate a T cell-mediated immune response

2. MHC

Major histocompatibility complex

MHC Class II

Gene structure of the human MHC

MHC Class I

3D structure of the human MHC

MHC Class I for endogenous peptides

Figure by Eric A.J. Reits

MHC class II for exogenous peptides

Figure by Eric A.J. Reits

1. Degradation of antigen2. Peptide binding to MHC3. Recognition of peptide-MHC complex by T-cells

Yewdell et al. Ann. Rev Immunol (1999)

20% processed

0.5% bind MHC

50% CTL response

0.05% chance of immunogenicity

Antigen processing pathway: peptides, MHC, T-cells

Physico-chemical properties affect MHC-peptide binding

Epitope prediction º “Fishing”

Suggest candidate epitopes by in silico screening of entire proteins and even proteomes with specificity at:the allele levelthe supertype leveldisease-implicated alleles alone.

Minimize the number of wet-lab experiments Cut down the lead time involved in epitope

discovery and vaccine design

Computational models can help identify T cell epitopes

1. Sequence-based approach Pattern recognition techniques

• binding motif, matrices, ANN, HMM, SVM Main limitations:

• Require large amount of data for training• Preclude data with limited sequence conservation

2. Structure-based approach Rigid backbone modeling techniques Flexible docking techniques Main advantage: large training datasets unnecessary

Predicting MHC-binding peptidesTong, Tan and Ranganathan (2007) Briefings in Bioinformatics 8: 96-108

Our aim: Structure-based prediction

of MHC-binding peptides

Great potential to:generate biologically meaningful data for analysispredict candidate peptides for alleles that have not

been widely studied, where sequence-based approaches fail or are not attempted

predict binding affinity of peptidespredict non-contiguous epitopes

Structure determination through experimental methods is both expensive and time-consuming

Has not been extensively studied due to high computational costs and development complexity

Why structure?

Protein Threading [Altuvia et al. 1995; Schueler-Furman et al. 2000]

Homology Modeling [Michielin et al. 2000] Rigid/Flexible Docking [Rosenfeld et al. 1993;

Sezerman et al. 1996; Rognan et al. 1999; Desmet et al. 2000; Michielin et al. 2003]

Existing Structure-based Prediction Techniques

Hypothesis for epitope selection Peptides bound to MHC alleles are similar to

substrates bound to enzymes “Lock-and-key” mechanism for peptide

selectionShapeSizeElectrostatic characteristics

Introduction Structural Immunoinformatic


Sequences

Databases, ontologies

Basic immunology

Genetics and populations

Structures

MPID:MHC-Peptide Interaction Database Govindarajan et al. (2003) Bioinformatics, 19: 309-310RDB of 82 curated pMHC complexes (Class I: 64 & Class II:18)

Distribution based on MHC allele specificity

0

5

10

15

20

25

A*0

201

A*6

801

B*0

801

B*2

705

B*3

501

B*5

101

B*5

301

DQ

8

DR

1

DR

2

DR

3

DR

4

H2-

Db

H2-

Dd

H2-

Kb

H2-

Ld

HLA

-Cw

3

HLA

-Cw

4

I-Ad

I-Ak

RT1

.Aa

MHC allele

Gap index =

Peptide/MHC interaction characteristics

Gap Volume

Intermolecular hydrogen bonds

Interface area

Gap volumeInterface area

Interacting Residues

Peptide Length

MPID-T: MHC-Peptide-T Cell Receptor Interaction Database Tong et al. (2006) Applied Bioinformatics, 5: 111-114

187 curated pMHC 16 with TCR Human:110, Murine:74

and Rat:3 Alleles: 40

(interface area, H bonds, gap volume and gap index)

0

510

1520

25

3035

40

A*01

01

A*02

01

A*68

01

B*15

01

B*3

501

B*08

01

B*27

05

B*2

709

B*44

02

B*44

03

B*44

05

B*51

01

B*53

01

Cw

*030

4

Cw

*040

1

E*01

03

E*01

01

G*0

101

DR

B1*

0101

DR

B1*

0301

DR

B1*

1501

DR

B5*

0101

DQ

B1*0

302

DQ

B1*0

602

DR

B1*

0401

DQ

B1*

0201

RT

1.A

a

RT1

-A1C

H2-

Db

H2-

Dd

H2-

Kb

H2-

Ld

H2-

M3

H2-

Qa-

2

I-Ak

I-Ab

I-Ad

I-Au

I-Ek

I-Ag

7

101 new entries 187 entries (Human: 110; Murine: 74; Rat: 3) 134 non-redundant entries (class I: 100; class II: 34) 121 class I and 41 class II entries 26 HLA alleles (class I: 18; class II: 8) 14 rodent alleles (class I: 8; class II: 6) 16 TCR/peptide/MHC complexes

Distribution of MHC by allele

Peptide/MHC binding motifs

Conserved peptide properties in solution structures Classified according to

Alleles Peptide length

Polar Amide Basic Acidic Hydrophobic

1. There were only 36 crystal structures of unique MHC (2006) alleles vs. 1765 unique MHC alleles identified in IMGT/HLA database

2. Structure determination through experimental methods is both expensive and time-consuming

3. Homology model building for alleles with no structural data!

How to obtain structures of experimentally unsolved alleles?


Database development Data Analysis of pMHC Class I

complexes Computational models Applications

Data & text mining

Maths/Stats

Structures

MHC Class I superfamilies have different interaction characteristics

Superfamily HLA-A2 (36 entries)

HLA-B7(12 entries)

HLA-B27(18 entries)

Interface area (Å2) 846.3±48.9 876.7±72.4 934.0±136.0

Gap volume (Å3) 799.8±195.2 870.2±198.0 985.1±101.5

Gap index 0.9±0.2 1.0±0.1 1.0±0.3

Hydrogen bonds 11.1±1.9Concentrated at pockets A, B, F

14.3±2.3Well distributed

17.9±2.8Concentrated at pockets A, B, F

Single linkage cluster analysis of 68 pMHC Class I complexes from 13 alleles (all available A and B)

Data 68 peptide–HLA complexes spanning 13 classes I alleles

from MPID-THierarchical clustering Hierarchical clustering using the agglomerative algorithm. Distance between structures computed by single-linkage

method (MATLAB version 7.0) based on the separation between the each pair of data points.

Nearest neighbors merged into clusters. Smaller clusters were then merged into larger clusters based

on inter-cluster distances, until all structures are combined. Last 3 levels considered for defining HLA class I supertypes.Interaction parameters Significant for the characterization of peptide/MHC interface:

Intermolecular hydrogen bonds pMHC Interface area

Binding characteristics of HLA supertypes analyzed

Details

Gap volume Gap index

B27

B44

B7

B62

B8

Legend

Do the Class I alleles aggregate into “superfamilies” using receptor-ligand interaction patterns?

80 HLA class I complexes 13 class I alleles Five descriptors Hierarchical clustering using

nearest neighbor algorithm 77% consensus with data

from other groups

Supertype definition: receptor structure, ligand binding motifs, or receptor-ligand interaction patterns

MHC Class I superfamilies from receptor-ligand interactions

B27 B44 B7 B62 B8

Legend

Tong, Tan and Ranganathan (2007) Bioinformatics, 23: 177-183



Maths/Stats

StructuresSequences

Physics/Chemistry

1. Finding the best fit conformation (docking) of peptides within the MHC binding groove

2. Screening potential binders from the background

Two-step approach to predict MHC-binding peptides

Docking is a computationally exhaustive procedure Large number of possible peptide conformations

3 global translational degrees of freedom 3 global rotational degrees of freedom 1 conformational degree of freedom for each rotatable bond

y

x

z R

N C Ca

C

O

>1010 possible conformations for a 10-residue peptide

Class I peptides N-termini residues

0.02 – 0.29 Å C-termini residues

0.00 – 0.25 Å

Class II binding registers Only 9 residues fit in

the binding groove N-termini residues

0.01 – 0.22 Å C-termini residues

0.02 – 0.27 Å

Conservation of nonamer peptide backbone conformation

Rapid docking of peptide to MHC Tong, Tan & Ranganathan (2004) Protein Sci. 13:2523-2532

Anchoring root fragments to reduce search space (Pseudo-Brownian rigid body docking )

Loop modeling (Loop closure of central backbone by satisfaction of spatial restraints)

Ligand backbone and side-chain refinement (entire backbone and interacting side-chains

2

3

1

Benchmarking with existing techniquesAuthor Technique Peptide RMSDa RMSDb

Rognan et al. Simulated Annealing

TLTSCNTSV 1.04 0.46FLPSDFFPSV 1.59 1.10GILGFVFTL 0.46 0.32ILKEPVHGV 0.87 0.87

LLFGYPVYV 0.78 0.33

Desmet et al. Combinatorial Buildup Algorithm RGYVYQGL 0.56 0.32

Rosenfeld et al. Multiple Copy AlgorithmFAPGNYPAL 2.70 0.40

GILGFVFTL 1.40 0.32

Sezerman et al. Combinatorial Buildup Algorithm

LLFGYPVYV 1.40 0.33

ILKGPVHGV 1.30 0.87

GILGFVFTL 1.60 0.32

TLTSCNTSV 2.20 0.46

aRMSD of peptide backbone obtained from respective authors. bRMSD of peptide backbone obtained in our work from redocking bound complexes and single template respectively.

Quantitative separation of binders from non-binders: empirical free energy scoring function DQ3.2b involved in several autoimmune

diseases: Celiac disease insulin-dependent diabetes mellitus IDDM-associated periodontal disease autoimmune polyendocrine syndrome

type II

Gbind = αGH + βGS + GEL + C

Gbind = binding free energy GH = hydrophobic term GS = decrease in side chain entropy GEL = electrostatic term C = entropy change in system due to external

factors α, β, γ optimized by least-square multivariate regression

with experimental binding affinities (IC50) of MHC-peptides in training dataset (Rognan et al., 1999)

Quantitative separation of binders from non-binders: empirical free energy scoring function

Gbind ≈ -RT ln (IC50) (Rognan et al., 1999).

Test case: MHC Class II DQ8

DQ3.2b (DQA1*0301/DQB1*0302) is involved in several autoimmune diseases: Celiac disease insulin-dependent diabetes mellitus IDDM-associated periodontal disease autoimmune polyendocrine syndrome

type II

Data used Structure: 1JK8 - DQ3.2β–insulin B9-23 complex Dataset I: 127 peptides with experimentally determined

IC50 values [70 high-affinity (IC50 < 500 nM), 13 medium-affinity (500 nM < IC50 < 1500 nM )and 23 low-affinity (1500 < IC50 < 5000 nM) binders and 21 non-binders (5000 < IC50)] derived from biochemical studies. 87 with known binding registers.

Dataset II: 12 Dermatophagoides pternnyssinus (Der p 2) peptides with experimental T-cell proliferation values from functional studies, with 7 peptides eliciting DQ3.2β-restricted T-cell proliferation.

Gbind ≈ -RT ln (IC50) (Rognan et al., 1999).

Training 56 binding conformations with known registers 30 non-binding conformations from 3 non-

binders Testing

Test set 1 – 68 peptides from biochemical studies 16 strong ; 13 medium; 21 weak; 18 non-binders

Test set 2 – 12 peptides from functional studies 7 elicit T-cell proliferation

Scoring: Training & testing datasets

Y Q T I E E N I K I F E E D A

E285B 112-126 peptide

Core sequence Binding Energy

YQTIEENIK -23.12

QTIEENIKI -21.34

TIEENIKIF -25.32

IEENIKIFE -29.53

EENIKIFEE -32.27

ENIKIFEED -21.72

NIKIFEEDA -22.95

Screening class II binding register: a sliding window approach

Docking

Anchoring root fragments (probes) to reduce search space

Loop modeling

Refinement of binding register

Extension of flanking residues for MHC Class II

A

B

C

D

4-step protocol used

Sensitivity (SE) = number of binders correctly predicted = TP/AP (TP+FN)

Specificity (SP) = number of non-binders correctly predicted

= TN/AN (TN+FP)

Accuracy estimates

Area under ROC (receiver operating characteristics) curve:>90% excellent>80% good

Results for Training setSpecificity (SP) Level

Group Sensitivity (SE)

Binding Energy Threshold (kJ/mol)

LMH 0.90 -28.70 MH 0.85 -29.10

SP = 0.80

H 0.75 -30.82 LMH 0.84 -29.10 MH 0.77 -30.50

SP = 0.90

H 0.75 -32.74 LMH 0.81 -29.93 MH 0.73 -32.12

SP = 0.95

H 0.63 -33.59

High SE (good for most predictions)

Very few FPs, but also fewer predictions

Group LMH MH HAROC 0.88 0.93 0.93

Screening class II binding register: HLA-DQ8 prediction accuracy for Test Set I

Classification of binding peptides High-affinity binders (H)

IC50 ≤ 500 nM Medium-affinity binders (M)

500 nM < IC50 ≤ 1500 nM Low-affinity binders (L)

1500 < IC50 ≤ 5000 nM

Position 1 4 6 7 9 Source BE

(kJ/mol) IC50 (nM)

Binding Motif

T D R R Q S V V V N W M D D G K A A A D E I I I P D Y Y R Q E F L M

L Q L Q P F P Q P Q P F P P L A-gliadin 56-70 -41.01 20 D M T P A D A L D D F D L HSV -40.53 173 A A A A A V A A E A Y Artificial sequence -39.98 48 G V A G L L V A L A V IA-2 499-509 -36.16 95 D S N I M N S I N N V M D E I D F F E K Pf ABRA 487–506 -36.01 171 F E S T G N L I A P E Y G F K I S Y HA 255–271Y -35.70 62 Y P F I E Q E G P E F F D Q E MHC Ia 51–63 analog -35.34 1156 L L D I L D T A G L E E Y S A M R D p21 51–66; C out -35.27 202 Q P Y P Q P Q P F P S Q Q P Y A-gliadin 41-55 -35.26 1120 F P S Q Q P Y L Q L Q P F P Q A-gliadin 49-63 -33.93 20 C D G E R P T L A F L Q D V M GAD 101–115 -33.57 69 S F P P Q Q P Y P Q P Q P Q Y A-gliadin 77-91 -33.35 370 S Q D L E L S W N L N G L Q A D L S S FceR 104–122 -32.89 123 E P R A P W I E Q E G P E Y W MHC Ia 46-63 -32.89 519 P P L Y A T G R L S Q A Q L M P S P P M VP16 -32.59 538 S Q D L E L S W N L N G L Q A Y FceR 104–122 analog -32.49 118

Ligands / Epitopes

I A R A K M F P A V A E K 34P3A -31.91 541

Test Set 1: Improved detection of binders

lacking position specific binding motifs

Binding registers20/23 (87%) binding registers Only register (aa 4-12) from Test Set 2

(Der p 2: 1-20)(SE=0.80; SP(LMH)=0.90)

Top 5 predictions are experimental positives at very stringent threshold criteria (SE=0.95; SP(H)=0.63)

T-cell proliferation

Multiple registers (SP=0.95, SE(LMHP =0.81): 58% of Test Set 1)

0123456789

1011121314

1 2 3 4 5 6 7

No of Binding Registers

No o

f Pep

tides

Weak Binders Medium Binders Strong Binders

Mainly for medium and high binders Experimental support: Sinha et al. for

DRB1*0402 Is this why binding motifs are unsuccessful?

Introduction Structural Immunoinformatic Database

development Data Analysis Computational models developed Applications

Autoimmune blistering skin disorder Characterized by autoantibodies targeting

desmoglein-3 (Dsg3) Strong association with DR4 and DR6 alleles

Pemphigus vulgaris (PV)

http://www.medscape.com

adam.about.com

www.aafp.org

Who are the major players in PV? DR4 PV implicated alleles (for Semitic)

DRB1*0401 DRB1*0402 DRB1*0404 DRB1*0406

DR6 PV implicated alleles (for Caucasians) DRB1*1401 DRB1*1404 DRB1*1405 DQB1*0503

DR4 PV implicated alleles (DRB1*0401, *0402, *0404, *0406) High sequence conservation

97.9 – 99.0% identity 98.4 – 99.5% similarity

High structural conservation Cα RMSD <0.22 Å for all key binding pockets

7 polymorphic residues within binding cleft Pocket 1 (β86), Pocket 4 (β70, 71, 74) Pocket 6 (β11) Pocket 7 (β71) Pocket 9 (β37)

What is known about DR4?

DR6 PV implicated alleles (DRB1*1401, *1404, *1405, DQB1*0503) High sequence conservation

85.8 – 94.1% identity 83.2 – 97.3% similarity

High structural conservation Cα RMSD <0.22 Å for all key binding pockets

14 polymorphic residues within binding clefts Pocket 1 (β86) Pocket 4 (β13, 70, 71, 74, 78) Pocket 6 (β11) Pocket 7 (β28, 30, 67, 71) Pocket 9 (β9, 37, 57, 60)

What is known about DR6?

9 stimulatory Dsg3 peptides tested on PV patients possessing DR4 and DR6 PV implicated alleles1. Dsg3 96-112 (DR4, DR6)2. Dsg3 191-205 (DR4, DR6)3. Dsg3 206-220 (DR4, DR6)4. Dsg3 252-266 (DR4, DR6)5. Dsg3 342-356 (DR4, DR6)6. Dsg3 380-394 (DR4, DR6)7. Dsg3 763-777 (DR4, DR6)8. Dsg3 810-824 (DR4)9. Dsg3 963-977 (DR4)

Clues…

DR4 PV 8/9 investigated Dsg3 peptides fit perfectly into DRB1*0402 Atomic clashes with all other investigated DR4 subtypesDR6 PV 6/9 investigated Dsg3 peptides fit perfectly into DRB1*0503 Atomic clashes with all other investigated DR6 subtypes

HLA association in DR6 PV more likely to be at DQ than DR locus

Consistent with experimental work done by Sinha et al. (2002, 2005, 2006)

Disease associated alleles vs. innocent bystanders

Tong et al. (2006) Immunome Research, 2: 1

1/9 investigated Dsg3 peptides fits existing binding motifs Flanking residues – clashes in fitting binding register Register-shift for Peptide V (Dsg3 342-356)

Detected binding register: Dsg3 346-354 Binding motifs: Dsg3 347-355 (Veldman et al., 2003)

: Dsg3 345-353 (Sinha et al., 2006)

Whither sequence motifs (again!)?

Docking of 936 15mer Dsg3 peptides generated using a sliding window of size 15 across the entire Dsg3 glycoprotein

Large-scale screening of Dsg3 peptides

Dsg3 peptide (sliding window width 15)

N C

Binding register (sliding window width 9)

Flanking residues

Tong et al. (2006) BMC Bioinformatics, 7(Suppl 5): S7

Training set: 8 peptides each, with exp. IC50 values and known binding registers (5 binders and 3 non-binders)

-40.00

-35.00

-30.00

-25.00

-20.00

-15.0050 70 90 110 130 150 170 190 210 230 250

15-mer start position

Bin

ding

Ene

rgy

-40.00

-35.00

-30.00

-25.00

-20.00

-15.00250 270 290 310 330 350 370 390 410 430 450


Bin

ding

Ene

rgy

-40.00

-35.00

-30.00

-25.00

-20.00

-15.00450 470 490 510 530 550 570 590 610 630 650


Bin

ding

Ene

rgy

-40.00

-35.00

-30.00

-25.00

-20.00

-15.00650 670 690 710 730 750 770 790 810 830 850


Bin

ding

Ene

rgy

-40.00

-35.00

-30.00

-25.00

-20.00

-15.00850 870 890 910 930 950 970 990 1010 1030 1050


Bin

ding

Ene

rgy

-40.00

-35.00

-30.00

-25.00

-20.00

-15.00450 470 490 510 530 550 570 590 610 630 650


Binding Energy

-40.00

-35.00

-30.00

-25.00

-20.00

-15.00450 470 490 510 530 550 570 590 610 630 650

15-m er s tart position

Binding Energy

Extracellular

Intracellular

Transmembrane

DQB1*0503

DRB1*0402

Immunoreactive region

Large-scale screening of Dsg3 peptides

Common epitopes possibly responsible for inducing disease in DR4 & DR6 patients

Significant level of cross reactivity observed between DRB1*0402 and DQB1*0503 ( AROC=0.93) 57% of peptides investigated in this study predicted to

bind to both alleles with high affinity 90% of known Dsg3 peptides predicted to bind to both

alleles 12/20 top predicted DQB1*0503-specific Dsg3 peptides

from transmembrane region All top predicted DQB1*0402-specific Dsg3 peptides

from extracellular regions Disease initiation implications: DR4 from ECD; DR6 from

TM

Multiple binding registers revisited 76% (410/539) predicted high-affinity binders to DRB1*0402

possess > 2 binding registers 57% (384/673) predicted high-affinity binders to DQB1*0503

possess > 2 binding registers 66% (354/539) bind both alleles at different registers Similar proportion (70%) detected in known binders to both

alleles

Both alleles bind similar peptides via different binding registers

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6

No of Binding Registers

No o

f Pep

tides

DQB1*0503 DRB1*0402

What next? We have developed a predictive model for

HLA-C (Cw*0401) with very limited (only six) experimental binding values.

The model yields excellent results for test data (AROC=0.93).

Application to determine immunological hot spots for HIV-1 p24gag and gp160gag glycoproteins shows binding energies similar to HLA-A and –B.

Conclusions Computational models for immunogenic

epitope prediction can be successfully developed, even for alleles with limited experimental data.

While computations can never completely replace “wet-lab” experiments, in silico predictions can significantly cut down the development time of therapeutic vaccines.

1. Genome analysisApproaches EST analysisAnnotation pipeline using workflow

strategies

ApplicationsParasitic nematodesCancer EST data

OutcomesComprehensive

annotation at the gene and protein levels

Novel &/or pathogen-specific genes

Immune response evasion strategies

2. Transcriptome analysisApproaches Graph formalism for

alternative splicing Genome-wide analysis

Applications Drosophila genome Chicken compared to

human and mouse Kallikrein variants as

markers

Outcomes New mRNA-gDNA alignment

method, MGAlign & MGAlignIt First splicing graph database,

DEDB Web server for splicing graphs,

ASGS Sub-graph elements for

alternative splicing Multi-species splicing graph

database, GraphDB

3. Protein/Proteome research:Origin and evolution of structural domainsApproaches Intron mapping to

domain boundary All eukaryotic proteins

analyzed

Applications Domain prediction in

EST/genome data Effect of splice

variants on domains

Outcomes New database of protein

coding genes, XPro Visualization of intronic

locations on protein structural doimains, XDomView

Analysis tool, Go Module Viewer

3. Protein/Proteome research: Small disulfide-rich proteins<100 aa per domain; ≥ 2 SS bonds

Approaches Multiple structure

alignment and hierarchical classification

Comparative modeling rules

Sequence, structure and evolutionary analysis of Potato II inhibitor family

Outcomes New database, DSFD Server for model building,

SDPMOD Understanding of wound-

induced protease inhibitor folding

Applications Design of protease

inhibitors, channel modulators, growth regulators

3. Protein/Proteome research: Protease cleavage site predictionApproaches Detailed structural

modeling and docking of signal peptide moiety to signal peptidase I

SVM for caspases

Applications Enhanced production of

therapeutic and cemmercial heterologous proteins

Apoptosis initiation

Outcomes New databases, SPdb,

CasBase Server for caspase

clevage prediction, CASVM

Signal peptide cleavage prediction (under development)

4. Systems BiologyApproaches Holistic computational,

molecular biology and FRET study to locate secretion roadblocks

EST analysis of host-parasite interactions

Applications Trichoderma reesei as fungal

bioreactor Parasites that lead to: liver

cancer - food borne trematode (Opisthorchis viverrini) and bladder cancer (Schistosoma haematobium).

Outcomes Improved heterologous

protein production using filamentous fungi

Understanding of how parasites evade host immune activation

6. Genome-Phenome mappingApproaches Mutation data for non-

laboratory animals Mapping to OMIM Mapping to structure

Applications OMIA-OMIM mapping

to structure Correlation between

genotype and disease pehnotype

OutcomesOMIA database, with

links to OMIM (courtesy NCBI)

Mutations linked to severity of disease for α-D-mannosidosis

Predictions of new human disease mutations from known mutation sites in cow, cat and guinea pig

7. Biodiversity Informatics: Customary medicinal plantsApproaches Integrating, visualizing and

analyzing ethnobotanical, phytochemical and pharmacological data on customary medicinal plants

Data from Australian aboriginal elders and Indian Siddha doctors

Applications Novel antimicrobial, anti-

inflammatory and anti-cancer lead compunds

Outcomes CMkb, an integrated

knowledgebase

Dedications Prof. Bernard Pullman

Mme. Alberte Pullman

My brother, a CML survivor

Acknowledgements Dr. (Victor) J.C. Tong, NUS&I2R, Singapore A/Prof. Tin Wee Tan, NUS Dr. Animesh Sinha, Weill Medical College of

Cornell University & Michigan State University, USA

Drs. J. Tom August (JHU) and Vladimir Brusic (DFCI) (NIAID-NIH Grant #5 U19 AI56541 & Contract #HHSN266200400085C).

All of you!

bioinformatics of disease: immune epitope prediction

Documents