protein evolution: sars coronavirus as an example

46
CZ5225 Methods in Computational Biology CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families Lecture 2-3: Protein Families and Family Prediction Methods and Family Prediction Methods Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS Room 07-24, level 7, SOC1, NUS August 2004 August 2004

Upload: samira

Post on 18-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS August 2004. Protein Evolution: SARS coronavirus as an example. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Protein Evolution:  SARS coronavirus as an example

CZ5225 Methods in Computational BiologyCZ5225 Methods in Computational Biology

Lecture 2-3: Protein Families Lecture 2-3: Protein Families and Family Prediction Methodsand Family Prediction Methods

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg

Room 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUSAugust 2004August 2004

Page 2: Protein Evolution:  SARS coronavirus as an example

22

Protein Evolution: Protein Evolution: SARS coronavirus as an exampleSARS coronavirus as an example

Page 3: Protein Evolution:  SARS coronavirus as an example

33

SARS CoronavirusSARS CoronavirusA novel coronavirusIdentified as the cause ofsevere respiratorysyndrome (SARS )

Page 4: Protein Evolution:  SARS coronavirus as an example

44

SARS InfectionSARS Infection

How SARS coronavirus enters a cell and reproduce

Page 5: Protein Evolution:  SARS coronavirus as an example

55

Protein EvolutionProtein Evolution

Generation of different species

Page 6: Protein Evolution:  SARS coronavirus as an example

66

Protein Families• Sequence alignment-based families.

– Based on Principle of Sequence-structure-function-relationship.– Derived by multiple sequence alignment– Database: PFAM (Nucleic Acids Res. 30:276-280)

• Structure-based families.– Derived by visual inspection and comparison of structures– Database: SCOP (J. Mol. Biol. 247, 536-540)

• Functional Families.– Databases:

• G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: 346-349), ORDB (Nucleic Acids Res. 30:354-360)

• Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: 346-349)• Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49)• Transporters: TC-DB (Microbiol Mol Biol Rev. 64:354-411)• Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: 294-295)• Therapeutic targets: TTD (Nucleic Acids Res. 30, 412-415)• Drug side-effect targets: DART (Drug Safety 26: 685-690)

Page 7: Protein Evolution:  SARS coronavirus as an example

77

Protein Families

Sequence families =\= Structural families =\= Functional families

Sequence similar, structure different

Sequence different, structure similar

Sequence similar, function different (distantly related proteins)

Sequence different, function similar

Homework: find examples

Page 8: Protein Evolution:  SARS coronavirus as an example

88

Protein Family Prediction Methods

Sequence alignment-based families:

• Multiple sequence alignment (HMM): HMMER; JMB 235, 1501-153; JMB 301, 173-190

Structure-based families:

• Visual inspection and comparison of structures

Functional Families.

• Statistical learning methods: – Neural network: ProtFun (Bioinformatics, 19:635-642)

– Support vector machines: SVMProt (Nucleic Acids Res., 31: 3692-3697)

Page 9: Protein Evolution:  SARS coronavirus as an example

99

Sequence Comparison as a Sequence Comparison as a Mathematical Problem: Mathematical Problem:

Example:

Sequence a:  ATTCTTGC

Sequence b: ATCCTATTCTAGC  

         Best Alignment:             ATTCTTGC                                  ATCCTATTCTAGC                                        /|\                  gap        Bad Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                              /|\             /|\                                      gap          gap

Construction of many alignments => which is the best?  

Page 10: Protein Evolution:  SARS coronavirus as an example

1010

How to rate an alignment?How to rate an alignment?• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

C - - - T T A A C TC G G A T C A - - T

+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12

Alignment score

Page 11: Protein Evolution:  SARS coronavirus as an example

1111

Alignment GraphAlignment GraphSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Page 12: Protein Evolution:  SARS coronavirus as an example

1212

An optimal alignmentAn optimal alignment-- the alignment of maximum score-- the alignment of maximum score

• Let A=a1a2…am and B=b1b2…bn .

• Si,j: the score of an optimal alignment between

a1a2…ai and b1b2…bj

• With proper initializations, Si,j can be computedas follows.

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bws

aws

s

Page 13: Protein Evolution:  SARS coronavirus as an example

1313

Computing Computing SSi,ji,j

i

j

w(ai,-)

w(-,bj)

w(ai,bj)

Sm,n

Page 14: Protein Evolution:  SARS coronavirus as an example

1414

InitializationsInitializations

0 -3 -6 -9 -12 -15 -18 -21 -24

-3

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Page 15: Protein Evolution:  SARS coronavirus as an example

1515

SS3,53,5 = = ??

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 ?

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Page 16: Protein Evolution:  SARS coronavirus as an example

1616

SS3,53,5 = = ??

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

optimal score

Page 17: Protein Evolution:  SARS coronavirus as an example

1717

C T T A A C – TC T T A A C – TC G G A T C A TC G G A T C A T

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

8 – 5 –5 +8 -5 +8 -3 +8 = 14

Page 18: Protein Evolution:  SARS coronavirus as an example

1818

Global Alignment vs. Local AlignmentGlobal Alignment vs. Local Alignment

• global alignment:

• local alignment:

Page 19: Protein Evolution:  SARS coronavirus as an example

1919

An optimal local alignmentAn optimal local alignment

• Si,j: the score of an optimal local alignment ending at ai and bj

• With proper initializations, Si,j can be computedas follows.

),(

),(),(

0

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bwsaws

s

Page 20: Protein Evolution:  SARS coronavirus as an example

2020

local alignmentlocal alignment

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 ?

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

Page 21: Protein Evolution:  SARS coronavirus as an example

2121

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

The best

score

A – C - TA T C A T8-3+8-3+8 = 18

local alignmentlocal alignment

Page 22: Protein Evolution:  SARS coronavirus as an example

2222

Multiple sequence alignment (MSA)Multiple sequence alignment (MSA)

• The multiple sequence alignment problem is to simultaneously align more than two sequences.

Seq1: GCTC

Seq2: AC

Seq3: GATC

GC-TC

A---C

G-ATC

Page 23: Protein Evolution:  SARS coronavirus as an example

2323

How to score an MSA?How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

Page 24: Protein Evolution:  SARS coronavirus as an example

2424

Functional Classification by SVMFunctional Classification by SVM

• A protein is classified as either belong (+) or not belong (-) to a functional family

• By screening against all families, the function of this protein can be

identified (example: SVMProt)

• What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes.

• Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.

Page 25: Protein Evolution:  SARS coronavirus as an example

2525

SVM ReferencesSVM References

• C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line).

• R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy).

• S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy).

• Online lecture notes

Page 26: Protein Evolution:  SARS coronavirus as an example

2626

Introduction to Machine LearningIntroduction to Machine Learning

Goal:

To “improve” (gaining knowledge, enhancing computing capability)

Tasks:

•Forming concepts by data generalization.•Compiling knowledge into compact form •Finding useful explanations for valid concepts.•Clustering data into classes.

Reference:

Machine Learning in Molecular Biology Sequence Analysis .

Internet links:

http://www.ai.univie.ac.at/oefai/ml/ml-resources.html

Page 27: Protein Evolution:  SARS coronavirus as an example

2727

Introduction to Machine LearningIntroduction to Machine Learning

Category:

• Inductive learning.

• Forming concepts from data without a lot of knowledge from domain (learning from examples).

• Analytic learning.

• Use of existing knowledge to derive new useful concepts (explanation based learning).

• Connectionist learning.

• Use of artificial neural networks in searching for or representing of concepts.

• Genetic algorithms.

• To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.

Page 28: Protein Evolution:  SARS coronavirus as an example

2828

Machine Learning MethodsMachine Learning Methods Inductive learning:

Concept learning and example-based learning

Concept learning:

Page 29: Protein Evolution:  SARS coronavirus as an example

2929

Machine Learning MethodsMachine Learning Methods Analytic

learning:

Page 30: Protein Evolution:  SARS coronavirus as an example

3030

Machine Learning MethodsMachine Learning Methods Neural network:

Page 31: Protein Evolution:  SARS coronavirus as an example

3131

Machine Learning MethodsMachine Learning Methods Genetic algorithms:

Strength

Pattern

Classification

Page 32: Protein Evolution:  SARS coronavirus as an example

3232

Page 33: Protein Evolution:  SARS coronavirus as an example

3333

SVMSVM

Page 34: Protein Evolution:  SARS coronavirus as an example

3434

SVMSVM

Page 35: Protein Evolution:  SARS coronavirus as an example

3535

SVMSVM

Page 36: Protein Evolution:  SARS coronavirus as an example

3636

SVMSVM

Page 37: Protein Evolution:  SARS coronavirus as an example

3737

SVMSVM

Page 38: Protein Evolution:  SARS coronavirus as an example

3838

SVMSVM

Page 39: Protein Evolution:  SARS coronavirus as an example

3939

SVMSVM

Page 40: Protein Evolution:  SARS coronavirus as an example

4040

SVMSVM

Page 41: Protein Evolution:  SARS coronavirus as an example

4141

SVMSVM

Page 42: Protein Evolution:  SARS coronavirus as an example

4242

SVMSVM

Page 43: Protein Evolution:  SARS coronavirus as an example

4343

SVMSVM

Page 44: Protein Evolution:  SARS coronavirus as an example

4444

SVM for Classification of ProteinsSVM for Classification of ProteinsHow to represent a protein?

• Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties:– amino acid composition– Hydrophobicity– normalized Van der Waals volume– polarity,– Polarizability– Charge– surface tension– secondary structure– solvent accessibility

• Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties.

Nucleic Acids Res., 31: 3692-3697

Page 45: Protein Evolution:  SARS coronavirus as an example

4545

SVM for Classification of ProteinsSVM for Classification of Proteins

Descriptors for amino acid composition of protein:

C=(53.33, 46.67)

T=(51.72)

D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0)

Nucleic Acids Res., 31: 3692-3697

Page 46: Protein Evolution:  SARS coronavirus as an example

4646

CZ5225 Methods in Computational Biology Assignment 1Assignment 1

• Project 1: Protein family classification by SVM– Construction of training and testing datasets– Generating feature vectors– SVM classification and analysis.– Write a report and include a softcopy of your datasets

• Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. – Write a code in any programming language– Test it on a few examples (such as estrogen receptor and Progesterone

receptor)– Can you extend your program to multiple alignment?– Write a report and include a softcopy of your program