slides 3
DESCRIPTION
TRANSCRIPT
Bioinformatics seminar: Finding
cancer driver mutations
Ilya Minkin
Saint Petersburg Academical University
8th December 2012
1 / 26
Motivation
I It is widely accepted that tumorogenesis heavilydepends on accumulation of speci�c mutation
I If we consider genome of a tumor cell, we maysee thousands of alterations
I Which mutations cause functional changes thatenhance tumor cell proliferation, i.e."driver"mutations?
I And which mutations are unrelated to thisprocess, i.e. "passenger"mutations?
I This questions is one of the most challenging incancer genetics
2 / 26
Motivation
I Genes that mutate in wide range of tumors canbe con�dently classi�ed as "driver"genes
I However, there are many "driver"genes thatmutate in < 1% of tumors
I We can't use traditional methods (studies inmodel organisms, gene KO, etc) to analyzehundreds of gene candidates
I Thus we need a high-throughput computationalmethod for �nding "driver"mutations that isindependent on frequency
I The papers are focused on missense mutations,not nonsense/frameshift
3 / 26
CAN-Predict
Distinguishing Cancer-Associated Missense Mutationsfrom Common Polymorphisms, Cancer Research(2007)Joshua S. Kaminker, Yan Zhang, Allison Waugh,Peter M. Haverty, Brock Peters, Dragan Sebisanovic,Jeremy Stinson, William F. Forrest, J. FernandoBazan, Somasekar Seshagiri, and Zemin Zhang
4 / 26
Overview
I There are few well-characterized classes of SNPsin the human genome:
I Common missense SNP
I Mendelian disease SNP � variants that causedisease and follow Mendelian pattern ofinheritance
I Complex disease SNP � variants that are relatedto complex disease caused by many genetic andenvironmental factors
I Cancer driving mutations
I We will try to understand how to distinguishthem and consequently build a classi�er
5 / 26
Three characteristics
For each known mutation we obtain:
I 1. SIFT predicts whether an amino acidsubstitution a�ects protein function and gives ascore
I 2. Align both variant/canonical protein againstPfam database using HMMER. Take the bestE-scores and calculate log10(Evariant/Ecanonical)
I 3. The log-odds scores representing the relativefrequency with which a Gene Ontology (GO)term was used to annotate cancer or noncancergene sets
6 / 26
7 / 26
8 / 26
9 / 26
Conclusions
I We see that cancer-driving mutations are similarto Mendelian diseases SNPs
I These three features can be used forclassi�cation
I Random forests are used for building theclassi�er
10 / 26
Random forests
Random forest is an ensemble of decision trees. Theyare powerful and fast. Each tree is grown as follows:
I If the number of cases in the training set is N,sample N cases at random - but withreplacement, from the original data. This samplewill be the training set for growing the tree
I If there are M input variables, a number m¾M isspeci�ed such that at each node, m variables areselected at random out of the M and the bestsplit on these m is used to split the node
I Each tree is grown to the largest extent possible
11 / 26
Validation
I The Out Of Bag (OOB) error rate is animportant measure of accuracy and is calculatedby applying each individual classi�cation tree toa subset of training data points not used in theconstruction of the tree
I OOB = 3.19%
I Cross validation: of 730 variants, only 10 of 581(1.7%) normal variants were misclassi�ed ascancer and only 13 of 149 (8.7%) cancervariants were misclassi�ed as normal
12 / 26
An application of such classi�er is distinguishingrelevant cancer-associated mutations from theexpected polymorphic variants often identi�ed duringsequencing projects
13 / 26
Discussion
I There is a classi�er that distinguishes betweencancer-driving mutations and passenger
I Cross-validation shows that it is pretty accurate
I It was also shown that somatic mutations aremore likely to be predicted as cancer-drivingcompared to common SNPs
I They also tried to predict some novelcancer-driving mutations, but I'm so bored andwon't tell about it
14 / 26
CHASM
Cancer-Speci�c High-Throughput Annotation ofSomatic Mutations: Computational Prediction ofDriver Missense Mutations, Cancer Research(2009)Hannah Carter, Sining Chen, Leyla Isik, SvitlanaTyekucheva, Victor E. Velculescu, Kenneth W.Kinzler, Bert Vogelstein, and Rachel Karchin
15 / 26
Previous work
I There were classi�ers developed earlier
I Driver mutations are similar to mutationsassociated with Mendelian disease and may beidenti�able by setting constraints on amino acidresidues at mutated positions
I Passengers are more similar to nonsynonymoussingle nucleotide polymorphisms (nsSNP) withhigh minor allele frequencies (MAF)
I Random forests(CAN-Predict), SVM(Proteinkinase-speci�c classi�er)
16 / 26
Previous work
I Previous work used common SNPs as negativetraining examples
I Existing computational methods could detectdi�erences between somatic missense mutationsobserved in cancers and high MAF nsSNPs
I But these di�erences might be less relevant tothe discrimination between driver and passengermutations that occur somatically in tumors
17 / 26
SNPs and passenger mutations
I Although high MAF nsSNPs and passengermutations have properties in common, they alsohave di�erences
I Passenger mutations may or may not have afunctional impact on proteins; by de�nition, theyare neutral with respect to cancer cell �tness
I In contrast, high MAF nsSNPs have become�xed in the human genome and must befunctionally neutral or have a mild functionalimpact with respect to normal cell �tness
I Let's generate synthetic set of passengermutations and train a classi�er
18 / 26
OverviewI Random Forest classi�er that was trained on 49predictive features
I Feature selection was done with a protocolbased on mutual information
I Driver mutation data set � 2,488 missensemutations previously identi�ed as playing afunctional role in oncogenic transformation
I The synthetic passenger mutations weregenerated by sampling from eight multinomialdistributions that depend on dinucleotidecontext and tumor type
I The �nal score yielded for each mutation is thefraction of trees that voted for the passengerclass. 19 / 26
Candidate features
In total 80 features, such as:
I Change in charge
I BLOSUM 62 substitution score
I 17way exon conservation
I SNP Density (The number of genetic variants orpolymorphisms in the exon where the mutationis located)
I ...
20 / 26
21 / 26
Probabilistic interpretation of random
forest classi�cation scores
I They used the trained Random Forest tocompute a classi�cation score for each of 607glioblastoma multiforme (GBM) missensemutations
I However, these scores are not probabilities andthe statistical behavior of the algorithm has notbeen well-characterized
I It is not evident where to set a trusted scorecuto� for purposes of identifying drivermutations
22 / 26
Probabilistic interpretation of random
forest classi�cation scores
I For each of the 607 GBM mutants, we test thenull hypothesis: the mutant is not functionallyrelated to the growth of the tumor (passenger),versus the alternative hypothesis that it is(driver)
I We obtain a P value for a mutation bycomparing its score to the null distribution,which consists of the scores of a �ltered set ofsynthetic passengers that were held out fromRandom Forest training
23 / 26
Other methods
I PolyPhen classi�es missense mutants as�Probably Damaging�, �Possibly Damaging� or�Benign� and also provides a continuousmeasure of a mutation's functional impact, thePSIC score
I SIFT provides a score that ranges between 0 and1 to report the probability that a missensemutation will be tolerated
I SIFT/PolyPhen consensus
I CanPredict
I KinaseSVM
24 / 26
25 / 26
26 / 26