slides 3

Bioinformatics seminar: Finding

cancer driver mutations

Ilya Minkin

Saint Petersburg Academical University

8th December 2012

1 / 26

Motivation

I It is widely accepted that tumorogenesis heavilydepends on accumulation of speci�c mutation

I If we consider genome of a tumor cell, we maysee thousands of alterations

I Which mutations cause functional changes thatenhance tumor cell proliferation, i.e."driver"mutations?

I And which mutations are unrelated to thisprocess, i.e. "passenger"mutations?

I This questions is one of the most challenging incancer genetics

2 / 26

Motivation

I Genes that mutate in wide range of tumors canbe con�dently classi�ed as "driver"genes

I However, there are many "driver"genes thatmutate in < 1% of tumors

I We can't use traditional methods (studies inmodel organisms, gene KO, etc) to analyzehundreds of gene candidates

I Thus we need a high-throughput computationalmethod for �nding "driver"mutations that isindependent on frequency

I The papers are focused on missense mutations,not nonsense/frameshift

3 / 26

CAN-Predict

Distinguishing Cancer-Associated Missense Mutationsfrom Common Polymorphisms, Cancer Research(2007)Joshua S. Kaminker, Yan Zhang, Allison Waugh,Peter M. Haverty, Brock Peters, Dragan Sebisanovic,Jeremy Stinson, William F. Forrest, J. FernandoBazan, Somasekar Seshagiri, and Zemin Zhang

4 / 26

Overview

I There are few well-characterized classes of SNPsin the human genome:

I Common missense SNP

I Mendelian disease SNP � variants that causedisease and follow Mendelian pattern ofinheritance

I Complex disease SNP � variants that are relatedto complex disease caused by many genetic andenvironmental factors

I Cancer driving mutations

I We will try to understand how to distinguishthem and consequently build a classi�er

5 / 26

Three characteristics

For each known mutation we obtain:

I 1. SIFT predicts whether an amino acidsubstitution a�ects protein function and gives ascore

I 2. Align both variant/canonical protein againstPfam database using HMMER. Take the bestE-scores and calculate log10(Evariant/Ecanonical)

I 3. The log-odds scores representing the relativefrequency with which a Gene Ontology (GO)term was used to annotate cancer or noncancergene sets

6 / 26

7 / 26

8 / 26

9 / 26

Conclusions

I We see that cancer-driving mutations are similarto Mendelian diseases SNPs

I These three features can be used forclassi�cation

I Random forests are used for building theclassi�er

10 / 26

Random forests

Random forest is an ensemble of decision trees. Theyare powerful and fast. Each tree is grown as follows:

I If the number of cases in the training set is N,sample N cases at random - but withreplacement, from the original data. This samplewill be the training set for growing the tree

I If there are M input variables, a number m¾M isspeci�ed such that at each node, m variables areselected at random out of the M and the bestsplit on these m is used to split the node

I Each tree is grown to the largest extent possible

11 / 26

Validation

I The Out Of Bag (OOB) error rate is animportant measure of accuracy and is calculatedby applying each individual classi�cation tree toa subset of training data points not used in theconstruction of the tree

I OOB = 3.19%

I Cross validation: of 730 variants, only 10 of 581(1.7%) normal variants were misclassi�ed ascancer and only 13 of 149 (8.7%) cancervariants were misclassi�ed as normal

12 / 26

An application of such classi�er is distinguishingrelevant cancer-associated mutations from theexpected polymorphic variants often identi�ed duringsequencing projects

13 / 26

Discussion

I There is a classi�er that distinguishes betweencancer-driving mutations and passenger

I Cross-validation shows that it is pretty accurate

I It was also shown that somatic mutations aremore likely to be predicted as cancer-drivingcompared to common SNPs

I They also tried to predict some novelcancer-driving mutations, but I'm so bored andwon't tell about it

14 / 26

CHASM

Cancer-Speci�c High-Throughput Annotation ofSomatic Mutations: Computational Prediction ofDriver Missense Mutations, Cancer Research(2009)Hannah Carter, Sining Chen, Leyla Isik, SvitlanaTyekucheva, Victor E. Velculescu, Kenneth W.Kinzler, Bert Vogelstein, and Rachel Karchin

15 / 26

Previous work

I There were classi�ers developed earlier

I Driver mutations are similar to mutationsassociated with Mendelian disease and may beidenti�able by setting constraints on amino acidresidues at mutated positions

I Passengers are more similar to nonsynonymoussingle nucleotide polymorphisms (nsSNP) withhigh minor allele frequencies (MAF)

I Random forests(CAN-Predict), SVM(Proteinkinase-speci�c classi�er)

16 / 26

Previous work

I Previous work used common SNPs as negativetraining examples

I Existing computational methods could detectdi�erences between somatic missense mutationsobserved in cancers and high MAF nsSNPs

I But these di�erences might be less relevant tothe discrimination between driver and passengermutations that occur somatically in tumors

17 / 26

SNPs and passenger mutations

I Although high MAF nsSNPs and passengermutations have properties in common, they alsohave di�erences

I Passenger mutations may or may not have afunctional impact on proteins; by de�nition, theyare neutral with respect to cancer cell �tness

I In contrast, high MAF nsSNPs have become�xed in the human genome and must befunctionally neutral or have a mild functionalimpact with respect to normal cell �tness

I Let's generate synthetic set of passengermutations and train a classi�er

18 / 26

OverviewI Random Forest classi�er that was trained on 49predictive features

I Feature selection was done with a protocolbased on mutual information

I Driver mutation data set � 2,488 missensemutations previously identi�ed as playing afunctional role in oncogenic transformation

I The synthetic passenger mutations weregenerated by sampling from eight multinomialdistributions that depend on dinucleotidecontext and tumor type

I The �nal score yielded for each mutation is thefraction of trees that voted for the passengerclass. 19 / 26

Candidate features

In total 80 features, such as:

I Change in charge

I BLOSUM 62 substitution score

I 17way exon conservation

I SNP Density (The number of genetic variants orpolymorphisms in the exon where the mutationis located)

I ...

20 / 26

21 / 26

Probabilistic interpretation of random

forest classi�cation scores

I They used the trained Random Forest tocompute a classi�cation score for each of 607glioblastoma multiforme (GBM) missensemutations

I However, these scores are not probabilities andthe statistical behavior of the algorithm has notbeen well-characterized

I It is not evident where to set a trusted scorecuto� for purposes of identifying drivermutations

22 / 26

Probabilistic interpretation of random

forest classi�cation scores

I For each of the 607 GBM mutants, we test thenull hypothesis: the mutant is not functionallyrelated to the growth of the tumor (passenger),versus the alternative hypothesis that it is(driver)

I We obtain a P value for a mutation bycomparing its score to the null distribution,which consists of the scores of a �ltered set ofsynthetic passengers that were held out fromRandom Forest training

23 / 26

Other methods

I PolyPhen classi�es missense mutants as�Probably Damaging�, �Possibly Damaging� or�Benign� and also provides a continuousmeasure of a mutation's functional impact, thePSIC score

I SIFT provides a score that ranges between 0 and1 to report the probability that a missensemutation will be tolerated

I SIFT/PolyPhen consensus

I CanPredict

I KinaseSVM

24 / 26

25 / 26

26 / 26

slides 3

Technology

passenger mutations

cancer variants

somatic missense mutations

distinguishing cancer

earlier driver mutations

cancer genetics

novel cancer

environmental factors