mining frequent patterns in protein structures: a study of protease families

Mining frequent patterns in protein structures:

A study of protease families

Dr. Charles Yan Dr. Charles Yan

CS6890 (Section 001) ST: Bioinformatics CS6890 (Section 001) ST: Bioinformatics The Machine Learning ApproachThe Machine Learning Approach

Presented By: Bhavendra MattaPresented By: Bhavendra Matta

Presentation StructurePresentation Structure

ProblemProblem IntroductionIntroduction Method Proposed Method Proposed ResultsResults FindingsFindings About AuthorsAbout Authors QuestionsQuestions

ProblemProblem

Mining frequent patterns in protein structure:Mining frequent patterns in protein structure:

Analysis of protein sequence and structure databases

usually reveal frequent patterns (FP) associated with

biological function. Data mining techniques generally consider

the physicochemical and structural properties of amino acids

and their microenvironment in the folded structures.

Important TerminologyImportant Terminology Frequent Patterns in Protein Structures :Frequent Patterns in Protein Structures :

The primary structure of proteins is the sequence of amino acids in the The primary structure of proteins is the sequence of amino acids in the polypeptide chain. FP here refers to frequent patterns found in each type polypeptide chain. FP here refers to frequent patterns found in each type of Amino acids.of Amino acids.

Conserved Residue:Conserved Residue:These are used to determine structural relationships between the sequences of a These are used to determine structural relationships between the sequences of a multiple sequence alignment.multiple sequence alignment.VHAVOYJVHAVOYJBIOBIOBHAVJOYBHAVJOYBIOBIOOYJVHAVOYJVHAVBIOBIO

Here BIO is Conserved Residue. Protease :Protease :

Protease refers to a group of enzymes whose catalytic function is to breakdown Protease refers to a group of enzymes whose catalytic function is to breakdown peptide bonds of proteins.peptide bonds of proteins.

continue..continue.. Catalytic triadCatalytic triad

It refers to three amino acid residues found inside the active site of certain It refers to three amino acid residues found inside the active site of certain proteases. These include Asp 102, His 57, and Ser 195.proteases. These include Asp 102, His 57, and Ser 195.

Unsupervised Learning.Unsupervised Learning. It is a method of machine learning where a model is fit to observations It is a method of machine learning where a model is fit to observations output. Here the unsupervised learning is clustering forming type.output. Here the unsupervised learning is clustering forming type.

Microenvironment refers to the local structure assumed by residues close in space, but not necessarily contiguous along the sequence.

There are strong correlations between function and microenvironment.

IntroductionIntroduction The paper presents a novel unsupervised learning approach to discover The paper presents a novel unsupervised learning approach to discover

frequent patterns in the protein families.frequent patterns in the protein families.

FP calculation are based on three features (with no prior Functional motifs FP calculation are based on three features (with no prior Functional motifs knowledge)knowledge) 1. Biochemical Features 1. Biochemical Features 2. Geometric Features 2. Geometric Features 3. Dynamic Features 3. Dynamic Features

The identified FP’s for each amino acids belongs to three protease subfamilies.The identified FP’s for each amino acids belongs to three protease subfamilies. ChymotrypsinChymotrypsin Subtillsin subfamilies of Subtillsin subfamilies of SerineSerine proteases proteases Papain subfamily Cysteine proteasesPapain subfamily Cysteine proteases

The catalytic triad residues are distinguished by their strong spatial coupling The catalytic triad residues are distinguished by their strong spatial coupling (high interconnectivity) to other conserved residues.(high interconnectivity) to other conserved residues.

continue….continue….

Proteins Function is associated with a particular sequences or Proteins Function is associated with a particular sequences or structure motif.structure motif.

Few catalytic residue database are:Few catalytic residue database are: PDB ( Protein Data Base)PDB ( Protein Data Base) PROCAT: Geometric hashing Function.PROCAT: Geometric hashing Function. WEBFEATURE: Bayesian NetworkWEBFEATURE: Bayesian Network PINTS:PINTS: TRILOGY:TRILOGY:

MethodMethod Training DatasetTraining Dataset Feature ExtractionFeature Extraction FP DiscoveryFP Discovery Conserved Residue Conserved Residue

Identification.Identification. Rank of Conserved Rank of Conserved

Residue.Residue.

Dataset A set of proteins belonging to a given family is selected as the training

dataset. Features are extracted from all the amino acids in this dataset.

Two classes of enzymes, serine proteases and cysteine proteases are analyzed here.

Mainly all proteases typically have a catalytic triad at the active site.

These enzymes are classified into evolutionary subfamilies S1-Chymotrypsin (S1) S8-Subtilisin of serine proteases C1-Papain of Cysteine proteases

Feature ExtractionFeature Extraction Each amino acid is characterized in terms of its

Dynamic features Biochemical features Geometric features

of the residues in its microenvironment.

Dynamic features It uses Gaussian network model, an elastic network model for describing

the equilibrium dynamics of proteins, is used for characterizing the dynamics features.

GNM, the α-carbons (C) form the network nodes, and the nodes located within an interaction cut-off distance of 7.0. Å are connected via uniform elastic springs.

Another structural property CN too have a strong impact on equilibrium dynamics is the CN, which is defined as the number of amino acids (or α-carbons) that coordinate the central amino acid within a first interaction shell of 7.0 Å.

Biochemical featuresBiochemical features

It defines the Amino acid amino acid type and property. The classification is based here on both the specific amino

acid identity chemical features or functional groups

Chain mining multiple level association rules.

Geometric features It uses a 3D reference frame to define each residue, using the

three backbone atoms N, Cα and C (carbonyl C).

It uniquely defines the position and orientationof the residue in the 3D space.

.

FP DiscoveryFP Discovery

It uses Apriori algorithm.It uses Apriori algorithm. AlgorithmAlgorithm

Calculate occurrence and support of each feature to build Calculate occurrence and support of each feature to build the FP.the FP.

Discard FPs with the support smaller than predefined Discard FPs with the support smaller than predefined minimum support.minimum support.

Join the FPs to generate augmented FPs if length is FP is x Join the FPs to generate augmented FPs if length is FP is x then augmented FP length is x+1.then augmented FP length is x+1.

Defining minimum support is based on the degree of FP to Defining minimum support is based on the degree of FP to be considered.be considered.

FP DiscoveryFP Discovery

Identification of Conserved ResidueIdentification of Conserved Residue

Applying Apriori Algorithm to proteins reveal FP with maximum length.Applying Apriori Algorithm to proteins reveal FP with maximum length.

The FP occurs at least once in examined subfamily of proteins is The FP occurs at least once in examined subfamily of proteins is considered to conserved FP.considered to conserved FP.

Next, the conserved residues are removed from the original dataset, and the Apriori algorithm is applied again to the modified dataset.

All the conserved patterns of 20 types of amino acids were identified by this iterative search for each family.

Rank of Conserved ResidueRank of Conserved Residue Once the conserved residues are identified by the Apriori algorithm, a

ranking method is needed to distinguish the catalytic residues.

It is assumed that the catalytic residues are optimally coupled with other conserved residues to achieve the highest cooperativity.

The amino acids that show the lowest interconnectivity (smallest number of connected neighbors) are removed from the list of considered residues.

The ‘core’ residues are assigned the score zero, and the others are scored according to the number of iterations required to reach the ‘core’ residues.

ResultsResults Consider the serine residues in the serine protease family. Information for a set of 111 serine residues is extracted from the 5 proteins

in S1, and for a set of 250 serine residues from the 7 proteins in S8. This is consistent with the fact that the conservation of the

microenvironment and global dynamics is a more restrictive (and discriminative) feature than sequence conservation.

Another observation is that amino acids that sequentially neighbor the catalytic residues tend to be conserved.

The present unsupervised learning algorithm identified 22, 22

and 26 conserved residues in the S1, S8 and C1 subfamilies.

continues…continues…

Result Continues…Result Continues…

Conclusion A novel unsupervised leaning approach to discover biologically

meaningful FPs in protein structures The approach incorporates features associated with collective

dynamics (GNM slow mode shapes) as well as the biochemical (amino acid types and physicochemical properties) and geometric (3D coordination directions) features in the microenvironment.

This approach can be used to discover and annotate all frequent patterns in the protein structure database.

It can help to predict structure and function of uncharacterized proteins, and identify the important amino acids or structural regions.

About AuthorsAbout Authors

Ivet Bahar She is currently Chair and Professor of Department of Computational

Biology, University of Pittsburgh, Pittsburgh. She has more than 21 years of research work . Currently Research Areas: Currently Research Areas:

Characterization of Proteins Structural Classes Characterization of Proteins Structural Classes Characterization of Anti-Cancer Agents Characterization of Anti-Cancer Agents Conformational Dynamics of Proteins Conformational Dynamics of Proteins Protein Folding Kinetics Protein Folding Kinetics

About AuthorAbout Author

Shann-Ching Chen Carnegie Mellon University, Pittsburgh Main focus on Machine Learning . Current Project Areas

Retrieval of 3D Protein and Nucleic Acid StructuresRetrieval of 3D Protein and Nucleic Acid Structures Multimodal Biometrics Multimodal Biometrics

Questions???Questions???

Thank YouThank You

mining frequent patterns in protein structures: a study of protease families

Documents

catalytic function

sequence of amino acids

frequent patterns fp

amino acid residues

protein structures

conserved residues

proteins function

type of amino acids