classification and diagnostic of cancers classification and diagnostic prediction of cancers using...

Classification Classification and Diagnostic and Diagnostic

of Cancersof CancersClassification and diagnostic Classification and diagnostic prediction of cancers using gene prediction of cancers using gene expression profiling and artificial expression profiling and artificial neural networksneural networks

JAVED KHAN ET AL.JAVED KHAN ET AL.

NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001

Aronashvili Reuven

OverviewOverview The purpose of this study was to

develop a method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs).

We trained the ANNs using the small, round blue-cell tumors (SRBCTs) as a model.

The SRBCTsThe SRBCTs The small, round blue cell tumors (SRBCTs) of

childhood(four distinct categories):1. Neuroblastoma (NB)2. Rhabdomyosarcoma (RMS)3. Non-Hodgkin lymphoma (NHL) – (BL)4. Ewing family of tumors (EWS)

Often present diagnostic dilemmas in clinical practice because they are similar in appearance on routine histology.

However accurate diagnosis is essential – as treatment options , response to therapy, etc, vary.

No single test can precisely distinguish SRBCTs – Immunohistochemistry, cytogenetics, interphase fluorescent in situ hybridization and reverse transcription.

IntroductIntroductionion

Artificial Neural Artificial Neural NetworksNetworks

ANNsANNs

Artificial Neural Artificial Neural Networks (ANNs) – put Networks (ANNs) – put

to the task.to the task. Modeled on the structure and behavior of Modeled on the structure and behavior of

neurons in the human brain.neurons in the human brain. Can be trained to recognize and categorize Can be trained to recognize and categorize

complex patterns.complex patterns. Pattern recognition achieved by adjusting of Pattern recognition achieved by adjusting of

the ANN by a process of error minimization the ANN by a process of error minimization through learning from experience.through learning from experience.

ANNs were applied to decipher gene-ANNs were applied to decipher gene-expression signatures of SRBCTs and then expression signatures of SRBCTs and then used for diagnostic classification.used for diagnostic classification.

Principal Principal Component Component

AnalysisAnalysis

PCA

Principal Component Principal Component Analysis-Analysis- Neuronal GoalNeuronal Goal

n-dimensional

vectors

m-dimensional

vectors

m < n

transform from 2 to 1 dimension

Ex:

We look for axes which minimise projection errors and maximise the variance after projection.

Algorithm (cont’d)Algorithm (cont’d) Preserve as much of the variance as possible

rotate

project

more information (variance)

less information

Terminology (Covariance)Terminology (Covariance)

How How twotwo dimensions vary from the mean with respect dimensions vary from the mean with respect to each otherto each other

1

( )( )cov( , )

( 1)

n

i ii

X X Y YX Y

n

cov(X,Y) > 0: Dimensions increase together cov(X,Y) < 0: One increases, one decreases cov(X,Y) = 0: Dimensions are independent

Terminology (Covariance Matrix)Terminology (Covariance Matrix)

Contains covariance values between all possible Contains covariance values between all possible dimensions:dimensions:

( | cov( , ))nxnij ij i jC c c Dim Dim

• Example for three dimensions (x,y,z) (Always symetric):

cov( , ) cov( , ) cov( , )

cov( , ) cov( , ) cov( , )

cov( , ) cov( , ) cov( , )

x x x y x z

C y x y y y z

z x z y z z

cov(x,x) variance of component x = var(x)

Properties of the Cov Properties of the Cov matrixmatrix

Can be used for creating a distance Can be used for creating a distance

that is not sensitive to linear that is not sensitive to linear

transformationtransformation

Can be used to find directions Can be used to find directions

which maximize the variancewhich maximize the variance

Determines a Gaussian distribution Determines a Gaussian distribution

uniquely (up to a shift)uniquely (up to a shift)

Principal componentsPrincipal componentsHow to avoid correlated features? How to avoid correlated features?

Correlations Correlations covariance matrix is non-diagonal ! covariance matrix is non-diagonal !

Solution: diagonalize it, then use transformation that makes it Solution: diagonalize it, then use transformation that makes it

diagonal to de-correlate features. Z are the eigen vectors of Cxdiagonal to de-correlate features. Z are the eigen vectors of Cx

C – symmetric, positive definite matrix XC – symmetric, positive definite matrix XTTCX > 0 for ||X||>0; CX > 0 for ||X||>0;

its eigenvectors are orthonormal:its eigenvectors are orthonormal:

its eigenvalues are all non-negativeits eigenvalues are all non-negative

Z – matrix of orthonormal eigenvectors (because Z is Z – matrix of orthonormal eigenvectors (because Z is real+symmetric), real+symmetric),

transforms X into Y, with diagonal Ctransforms X into Y, with diagonal CYY, i.e. decorrelated., i.e. decorrelated.

T ( ) ( )

T T

; ;i ii

X X

Y X

Y Z X C Z Z C Z ZΛ

C Z C Z Z ZΛ Λ

In matrix form, X, Y are dxn, Z, CX, CY are dxd

( )T ( )i jij Z Z

Matrix formMatrix formEigen problem for C matrix in matrix form:Eigen problem for C matrix in matrix form:X C Z ZΛ

11 12 1 11 12 1

21 22 2 21 22 2

1 2 1 2

11 12 1 1

21 22 2 2

1 2

0 0

0 0

0 0

d d

d d

d d dd d d dd

d

d

d d dd d

C C C Z Z Z

C C C Z Z Z

C C C Z Z Z

Z Z Z

Z Z Z

Z Z Z

Small Small ii small variance small variance data change little in direction Y data change little in direction Yii

PCA minimizes PCA minimizes CC matrix reconstruction errors: matrix reconstruction errors:

ZZii vectors for large vectors for large ii are sufficient to get: are sufficient to get:

because vectors for small eigenvalues will have verybecause vectors for small eigenvalues will have very

small contribution to the covariance matrix.small contribution to the covariance matrix.

Principal componentsPrincipal componentsPCA: old idea, PCA: old idea, C. Pearson (1901), C. Pearson (1901), H. Hotelling 1933H. Hotelling 1933

Result: PC are linear combinations of all featuresResult: PC are linear combinations of all features,, providing new providing new uncorrelated features, with diagonal covariance matrixuncorrelated features, with diagonal covariance matrix = eigenvalues = eigenvalues..

T

T

;

Y X

Y Z X

C Z C Z Λ

TXZΛZ C

Z – principal components, of vectors X Z – principal components, of vectors X transformed using eigenvectors of Ctransformed using eigenvectors of CXX

Covariance matrix of transformed Covariance matrix of transformed vectors is diagonal => ellipsoidal vectors is diagonal => ellipsoidal distribution of data.distribution of data.

Two components for Two components for visualizationvisualization

New coordinate system: New coordinate system: axis ordered according to axis ordered according to variance variance = size of the eigenvalue. = size of the eigenvalue.

First First k k dimensions account fordimensions account for

1

1

k

ii

dk

ii

V

fraction of all variance (please note that i are variances); frequently 80-90% is sufficient for rough description.

Diagonalization methods: see Numerical Recipes, www.nr.com

PCA propertiesPCA propertiesPC Analysis (PCA) may be achieved by: PC Analysis (PCA) may be achieved by: transformation making covariance matrix diagonaltransformation making covariance matrix diagonal projecting the data on a line for which the sums of squares projecting the data on a line for which the sums of squares

of distances from original points to projections is minimal.of distances from original points to projections is minimal. orthogonal transformation to new variables that have orthogonal transformation to new variables that have

stationary variancesstationary variances

True covariance matrices are usually not known, estimated from data. True covariance matrices are usually not known, estimated from data.

This works well on single-cluster data; more complex structure may This works well on single-cluster data; more complex structure may require local PCA, separately for each cluster. require local PCA, separately for each cluster.

PC is useful for: finding new, more informative, uncorrelated PC is useful for: finding new, more informative, uncorrelated features;features;

reducing dimensionality: reject low variance features,reducing dimensionality: reject low variance features,

reconstructing covariance matrices from low-dim data.reconstructing covariance matrices from low-dim data.

PCA disadvantagesPCA disadvantagesUseful for dimensionality reduction but: Useful for dimensionality reduction but:

Largest variance determines which components are Largest variance determines which components are used, but does not guarantee interesting viewpoint for used, but does not guarantee interesting viewpoint for clustering data.clustering data.

The meaning of features is lost when linear The meaning of features is lost when linear combinations are formed.combinations are formed.

Analysis of coefficients in ZAnalysis of coefficients in Z11 and other important eigenvectors and other important eigenvectors may show which original features are given much weight.may show which original features are given much weight.

PCA may be also done in an efficient way by performing PCA may be also done in an efficient way by performing singular value decomposition of the standardized data matrix.singular value decomposition of the standardized data matrix.

PCA is also called Karhuen-Loève transformation. PCA is also called Karhuen-Loève transformation.

Many variants of PCA are described in A. Webb, Statistical Many variants of PCA are described in A. Webb, Statistical pattern recognition, J. Wiley 2002. pattern recognition, J. Wiley 2002.

Summed Square Summed Square Error Error

NextNext

- The number of samples, 63.

- The average Output of the validation models of sample i.

- The correct output (1-this cancer, 0-not this cancer)

Error Minimization Error Minimization

2

1

1 sN

i iis

E o tN

Summed Square Error Summed Square Error

sN

io

it

The AlgorithmThe Algorithm

Calibration and Calibration and validation of the ANN validation of the ANN

Models.Models. The gene expression values are taken from The gene expression values are taken from cDNA microarrays containing 6567 genes.cDNA microarrays containing 6567 genes.

63 training samples comprised of 13 EWS 63 training samples comprised of 13 EWS and 10 RMS from tumor biopsy and 10 EWS, and 10 RMS from tumor biopsy and 10 EWS, 10 RMS, 12 NB, 8 BL from cell lines. 10 RMS, 12 NB, 8 BL from cell lines.

25 test samples comprised of 5 EWS, 5 RMS, 25 test samples comprised of 5 EWS, 5 RMS, 4 NB, from tumors and 1EWS, 2 NB, 3BL 4 NB, from tumors and 1EWS, 2 NB, 3BL from cell lines. Plus 5 non-SRBCT samples (to from cell lines. Plus 5 non-SRBCT samples (to test ability of reject diagnosis).test ability of reject diagnosis).

Schematic Illustration Of The Analysis Process

1.1. Quality FilteringQuality Filtering2.2. PCA – reduced to 10 componentsPCA – reduced to 10 components3.3. 25 test samples set aside and the 63 25 test samples set aside and the 63

training samples are randomly partitioned training samples are randomly partitioned into 3 groupsinto 3 groups

4.4. One group is reserved for validation and One group is reserved for validation and the other two used for calibration.the other two used for calibration.

5.5. For each model the calibration was For each model the calibration was optimized with 100 iterative cycles optimized with 100 iterative cycles (epochs). (epochs).

6.6. This was repeated using each of the three This was repeated using each of the three groups for validation. groups for validation.

7.7. The samples were again randomly The samples were again randomly partitioned and the entire training partitioned and the entire training process repeated. For each selection of a process repeated. For each selection of a validation group one model was validation group one model was calibrated, resulting in a total of 3750 calibrated, resulting in a total of 3750 trained models.trained models.

8.8. Once the models were calibrated they Once the models were calibrated they were used to rank the genes according to were used to rank the genes according to their importance for classification.their importance for classification.

9.9. The entire process was repeated using The entire process was repeated using only top ranked genes. only top ranked genes.

Stage 1 – Filtering the Stage 1 – Filtering the GenesGenes

Starting with 6567 Starting with 6567 genes.genes.

Filtering for the Filtering for the minimal number of minimal number of expression reduced expression reduced the genes to 2308.the genes to 2308.

We filtered genes by requiring that a gene should have red intensity greater than 20 across all experiments.

Cont.Cont. 10 dominant PCA components per sample 10 dominant PCA components per sample

were used as inputs. (contained 63% of the were used as inputs. (contained 63% of the variance in the data matrix.)variance in the data matrix.)

The remaining PCA components contained The remaining PCA components contained variance unrelated to separating the four variance unrelated to separating the four cancers.cancers.

So we have 88x10 inputs. (88 samples with So we have 88x10 inputs. (88 samples with their 10 PCA components).their 10 PCA components).

Four outputs – (EWS, RMS, NB, BL).Four outputs – (EWS, RMS, NB, BL).

3-Fold Cross Procedure3-Fold Cross Procedure the 63 training (labeled) the 63 training (labeled)

samples were randomly samples were randomly shuffled and split into 3 shuffled and split into 3 equally sized groups.equally sized groups.

Each linear ANN model Each linear ANN model was then calibrated with was then calibrated with the 10 PCA input variables the 10 PCA input variables using 2 of the groups. using 2 of the groups.

with the third group with the third group reserved for testing reserved for testing predictions (validation).predictions (validation).

This procedure was This procedure was repeated 3 times, each repeated 3 times, each time with a different group time with a different group used for validation.used for validation.

The Learning Procedure The Learning Procedure A pair of lines, purple A pair of lines, purple

(training) and gray (training) and gray (validation), (validation), represents one represents one model. model.

No sign of “over-No sign of “over-training” of the training” of the models as would be models as would be shown by a rise in the shown by a rise in the summed square error summed square error for the validation set for the validation set with increasing with increasing iterations (epochs).iterations (epochs).

The results shown are for 200 different model.

Cont. ValidationCont. Validation The random shuffling was redone 1250 times (for The random shuffling was redone 1250 times (for

each shuffling 3 ANN models). Thus, in total, each each shuffling 3 ANN models). Thus, in total, each sample belonged to a validation set 1250 times, and sample belonged to a validation set 1250 times, and 3750 ANN models were calibrated.3750 ANN models were calibrated.

Each ANN model gives a number between 0 (not this Each ANN model gives a number between 0 (not this cancer type) and 1(this cancer type) as an output for cancer type) and 1(this cancer type) as an output for each cancer type.each cancer type.

The average for all model outputs for every The average for all model outputs for every validation sample is then computed (denoted the validation sample is then computed (denoted the average committee vote).average committee vote).

Each sample is classified as belonging to the cancer Each sample is classified as belonging to the cancer type corresponding to the largest committee vote.type corresponding to the largest committee vote.

The 63 Training SamplesThe 63 Training Samples

Using these ANN Using these ANN models, all 63 models, all 63 training samples were training samples were correctly classified to correctly classified to their respective their respective categories.categories.

Optimization of Genes Optimization of Genes used for Classification.used for Classification.

The contribution of each gene to the The contribution of each gene to the classification by the ANN models was then classification by the ANN models was then assessed.assessed.

Feature extraction was performed in a Feature extraction was performed in a model dependent way due to relatively few model dependent way due to relatively few samples.samples.

This was achieved by monitoring the This was achieved by monitoring the sensitivitysensitivity of classification to a of classification to a change in change in the expressionthe expression levellevel of each gene, using of each gene, using the 3750 previously calibrated models.the 3750 previously calibrated models.

Rank using SensitivityRank using Sensitivity

Sensitivity (S) of the outputs (o) with Sensitivity (S) of the outputs (o) with respect to any 2308 input varaibles respect to any 2308 input varaibles (x(xkk) is defined as:) is defined as:

Where Where NNss is the number of samples (63) is the number of samples (63) and and NNoo is the number of outputs (4). The is the number of outputs (4). The procedure for computing procedure for computing SSkk involves a involves a committee of 3750 models.committee of 3750 models.

1 1

1 1 s oN Ni

ks is o k

oS

N N x

Cont.Cont.

In this way genes were In this way genes were rankedranked according to the significance of according to the significance of classification and the classification and the classification error rate using classification error rate using increasingincreasing numbers of these numbers of these ranked genes was determined. ranked genes was determined.

Optimization Of GenesOptimization Of Genes

The classification Error rate minimized at 0% at 96 genesThe classification Error rate minimized at 0% at 96 genes

After the 96 with the

highest rank the error is

0%

RecalibrationRecalibration

The entire process (2–7) was repeated using only the 96 top ranked genes (9).

The 10 dominant PCA components for these 96 genes contained 79% of the variance in the data matrix.

Using only these 96 genes, we recalibrated the ANN models.

Again correctly classified all 63 samples.

Assessing the Quality of Assessing the Quality of Classification - Classification -

Diagnoses. Diagnoses. The aim of diagnoses is to be able to The aim of diagnoses is to be able to

reject test samples which do not reject test samples which do not belong to any of the four categories.belong to any of the four categories.

To do this a distance To do this a distance ddcc from a from a sample to the ideal vote for each sample to the ideal vote for each cancer type was calculated:cancer type was calculated:

Cont.Cont.

Where Where cc is the cancer type, is the cancer type, ooii is the average is the average committee vote for cancer committee vote for cancer ii, and , and δδii,,cc is unity if is unity if ii corresponds to cancer type corresponds to cancer type cc and zero and zero otherwise.otherwise.

The distance is normalized such that the The distance is normalized such that the distance between two ideal samples belonging distance between two ideal samples belonging to different disease categories is unity.to different disease categories is unity.

4

2

,1

1

2c i i ci

d o

Cont.Cont. Based on the validation set, an empirical Based on the validation set, an empirical

probability distribution of distances for probability distribution of distances for each cancer type was generated.each cancer type was generated.

The empirical probability distributions are The empirical probability distributions are built using each ANN model independently.built using each ANN model independently.

Thus, the number of entries in each Thus, the number of entries in each distribution is given by 1250 multiplied distribution is given by 1250 multiplied with the number of samples belonging to with the number of samples belonging to the caner type. the caner type.

Cont.Cont. For a given test sample it is thus possible to For a given test sample it is thus possible to

reject possible classifications based on these reject possible classifications based on these probability distributions. probability distributions.

Hence for each disease category a cutoff Hence for each disease category a cutoff distance from the ideal sample was defined distance from the ideal sample was defined within which it is expected a sample of this within which it is expected a sample of this category to fall in.category to fall in.

The distance given by the 95The distance given by the 95thth percentile of percentile of the probability distribution was chosen.the probability distribution was chosen.

This is the basis of diagnoses, as a sample This is the basis of diagnoses, as a sample that falls outside the cutoff distance cannot that falls outside the cutoff distance cannot be confidently diagnosed. be confidently diagnosed.

Cont.Cont.

If a sample falls outside the 95If a sample falls outside the 95thth percentile of percentile of the probability distribution of distances the probability distribution of distances between samples and their ideal output (for between samples and their ideal output (for example for EWS it is EWS = 1, RMS = NB = example for EWS it is EWS = 1, RMS = NB = BL = 0), its diagnosis is rejected.BL = 0), its diagnosis is rejected.

Using the 3750 ANN models calibrated with Using the 3750 ANN models calibrated with the 96 genes all of the 5 non-SRBCT samples the 96 genes all of the 5 non-SRBCT samples were excluded from any of the four were excluded from any of the four diagnostic categories, since they fell outside diagnostic categories, since they fell outside the 95 percentile.the 95 percentile.

Classification & Classification & DiagnosisDiagnosis

Plotted, for each sample, is the distance from its committee vote to the ideal vote for that diagnostic category.

Training samples - squares

test samples - triangles.

The 95th percentile is represented by a dashed line.

Table 1-ANN Diagnostic Table 1-ANN Diagnostic PredictionPrediction

bb

b

bb

Non SRBCs

b

b

Out of The 95th per.

Final ResultsFinal Results

Using these criteria for all 88 samples, the sensitivity of the ANN models for diagnostic classification was 93% for EWS, 96% for RMS and 100% for both NB and BL. The specificity was 100% for all four diagnostic categories. (As shown in the table)

Hierarchical Hierarchical ClusteringClustering

Using The 96 Top Ranked Using The 96 Top Ranked GenesGenes

Hierarchical ClusteringHierarchical Clustering The hierarchical The hierarchical

clustering dendrogram clustering dendrogram of the samples using of the samples using the 96 genes, identified the 96 genes, identified from the ANN models.from the ANN models.

The Pearson The Pearson correlation coefficient correlation coefficient was 0.54.was 0.54.

All 63 training and the All 63 training and the 20 test SRBCTs 20 test SRBCTs correctly clustered correctly clustered within their diagnostic within their diagnostic categories.categories.

0.54

Hierarchical ClusteringHierarchical Clustering Each row represents one

of the 96 cDNA clones and each column a separate sample.

On the right are the IMAGE id., gene symbol, class in which the gene is highly expressed, and the ANN rank.

*, genes that have not been reported to be associated with these cancers

ConclusionsConclusions

When we tested the ANN models calibrated using the 96 genes on 25 blinded samples, we were able to correctly classify all 20 samples of SRBCTs and reject the 5 non-SRBCTs. This supports the potential use of these methods as an adjunct to routine histological diagnosis.

ConclusionsConclusions

Although we achieved high sensitivity and specificity for diagnostic classification, we believe that with larger arrays and more samples it will be possible to improve on the sensitivity of these models for purposes of diagnosis in clinical practice.

Questions?Questions?

classification and diagnostic of cancers classification and diagnostic prediction of cancers using...

Documents