de novo discovery of mutated driver pathways in cancer discussion leader: matthew bernstein scribe:...
TRANSCRIPT
De novo discovery of mutated driver pathways in cancer
Discussion leader: Matthew BernsteinScribe: Kun-Chieh WangComputational Network Biology BMI 826/Computer Sciences 838https://compnetbiocourse.discovery.wisc.edu
by Fabio Vandin, Eli Upfal, and Benjamin J. RaphaelGenome Research, 2012
Problem overview
• Cancer is caused by a genetic mutation, or set of mutations, that leads to uncontrolled growth and division
• A driver pathway is any pathway such that a mutation in the pathway leads to cancer. A mutation in a driver pathway is called a driver mutation
• Other mutations are called passenger mutations• Problem statement:
Given: A set of cancer genomesGoal: Find the driver mutations
Challenges
• Passenger mutations are difficult to discern from driver mutations
• Cancer genomes are highly heterogeneous in respect to both passenger and driver mutations – Many combinations of driver mutations may lead to cancer– Cannot test all combinations of genes
Assumptions
• As is often done in computational biology, we make some assumptions to make the problem well defined:– Driver mutations tend to be rare and thus can be assumed
to be mutually exclusive, meaning that if a cancer genome has one driver mutation it does not have another
– A set of driver mutations should “explain” the global set of cancer genomes. Meaning that each cancer genome should have one driver mutation.
Formulating the objective function
• With these assumptions we search for a set of mutations with high:– Coverage – most patients have at least one mutation in the
set of driver mutations– Exclusivity – a patient has only one driver mutation
Given: A set of cancer genomesGoal: Find a set of mutations with maximum coverage while maintaining exclusivity
Formulating the objective function
• number of genes• number of patients• an matrix where if gene is mutated in
patient • – the set of patients for which gene is mutated
•
– the set of patients that have a mutation in some gene in the set of genes
Formulating the objective function
• A set of genes is mutually exclusive if for all pairs of genes the following holds:
• An submatrix of is mutually exclusive if each row of the submatrix contains at most one value of 1
Formulating the objective function
• The problem can now be restated mathematically:
• PROBLEM: driver mutations may not be measured as mutually exclusive due to experimental error. Furthermore, passenger mutations may co-occur in driver pathways.
Given: A mutation matrix and Goal: Find a mutually exclusive submatrix of size with the largest number of non-zero rows
Formulating the objective function
• We must reformulate the problem. Our current formulation is too strict.
• Instead of strictly mutual exclusive mutations, we’ll attempt to find approximately exclusive mutations:– most patients have no more than 1 mutation in
• This introduces a tradeoff:Increase coverage Decrease
exclusivityWhy? We can always increase coverage by adding a new mutation to our set of driver mutations. But this mutation might be highly non-exclusive
Formulating the objective function
• To make this problem mathematically well-defined, we need to formalize this tradeoff
• We measure the coverage overlap using the following equation:
• Given 2 genes (red) and (blue) we can visualize this equation as:
The area ofthe overlap is
Formulating the object function
• We measure the tradeoff between coverage and exclusivity with the following measure:
Penalizes non-exclusivity.The lower the better.
Measures coverage.The higher the better.
Given: A mutation matrix and Goal: Find a submatrix of size that maximizes
Formulating the objective function
Maximizing the objective function
• The authors prove that solving this problem is NP-hard• Roughly, this translates to the fact that we need to try every
combination of genes to find the one that maximizes• Thus, we require either an algorithm for finding an
approximate solution, or a heuristic
The Greedy Approach
• Greedily add mutations to the current set of driver mutations as long as the objective function increases until genes are added:1. 2. for :
1. 2.
3. return
Results-Greedy approach
• Even with this very naïve approach, we can make interesting guarantees on its accuracy under the gene independence model– Gene mutations are independent– Driver genes have high coverage– Each driver mutation contributes to the value
• Can prove that under this model, we would need 2,400 patients to maximize the objective function with probability 1-(1x10-4)
– This number of patients is not currently available
Better idea: MCMC
• Markov Chain Monte Carlo (MCMC) is a method for sampling from a complicated joint probability distribution
• Problem:
• Solution: Form a Markov chain such that its stationary distribution is the distribution of interest
Given: A joint distribution Goal: Generate a sample
Quick review: Markov chains
• A Markov chain is a basic model for modeling a stochastic process. It consists of a set of states and probabilities for transitioning from state to state
• Example:
• The stationary distribution is the probability of being in each state if we let the random process traverse from state to state for an infinite amount of time
The MCMC Approach
• Sample from sets of genes in proportion to • We do so by forming a Markov chain such that each state in
the Markov chain is a associated with a set of genes • Stochastically transition from state to state. The most
frequently visited state is most likely have the highest
The MCMC Approach
• More specifically, given current state we obtain as follows:1. Choose a gene uniformly at random from the global
set of genes 2. Choose gene uniformly at random from3. Let 4. With probability set
otherwise
The MCMC Approach
• With this definition of the transition matrix, the stationary distribution is
• The authors prove that this Markov chain approaches its stationary distribution quickly
Results – Simulated data
• Generated 2 simulated datasets– A dataset starting from a set of 6 genes– A dataset consisting of 2 driver pathways and
• Control coverage and exclusivity• Simulate passenger mutations using observed characteristics
in Glioblastoma data • Simulated both single-nucleotide mutations as well as copy-
number abberations (CNAs)• Ran the MCMC algorithm for 107 iterations and sampled every
104 iterations on each dataset
Results – Simulated data
Results – Simulated data
Results – real data
• Built matrices from various cancer genome studies• Searched for sets of size• Once a statistically significant set of mutations was found,
they remove them from the matrix and re-run the algorithm to find new sets
• Performed a statistical test. The test statistic was and the null model was obtained by independently permuting the mutations for each mutation group among the patients– This preserved the mutation frequency– The reason for doing this is to assess the significance of the
coverage and exclusivity given a fixed mutation frequency
Results – multiple cancer types
Results – Lung adenocarcinoma
Results – Glioblastoma multiforme
Discussion
• Is there an underlying network model?• In contrast to nearly every other method that we have discussed
in this class, this method does not utilize a biological network such as a protein-protein or protein-DNA interaction network– Can we incorporate such a network into this method?
• Are coverage and exclusivity the best metrics for finding driver mutations?
• Does their objective function correctly capture coverage and exclusivity?
• What other methods could they have tried in order to solve their combinatorial optimization problem?
• How can this method be validated with biological experiments?