de novo discovery of mutated driver pathways in cancer discussion leader: matthew bernstein scribe:...

De novo discovery of mutated driver pathways in cancer

Discussion leader: Matthew BernsteinScribe: Kun-Chieh WangComputational Network Biology BMI 826/Computer Sciences 838https://compnetbiocourse.discovery.wisc.edu

by Fabio Vandin, Eli Upfal, and Benjamin J. RaphaelGenome Research, 2012

https://compnetbiocourse.discovery.wisc.edu/



Problem overview

• Cancer is caused by a genetic mutation, or set of mutations, that leads to uncontrolled growth and division

• A driver pathway is any pathway such that a mutation in the pathway leads to cancer. A mutation in a driver pathway is called a driver mutation

• Other mutations are called passenger mutations• Problem statement:

Given: A set of cancer genomesGoal: Find the driver mutations

Challenges

• Passenger mutations are difficult to discern from driver mutations

• Cancer genomes are highly heterogeneous in respect to both passenger and driver mutations – Many combinations of driver mutations may lead to cancer– Cannot test all combinations of genes

Assumptions

• As is often done in computational biology, we make some assumptions to make the problem well defined:– Driver mutations tend to be rare and thus can be assumed

to be mutually exclusive, meaning that if a cancer genome has one driver mutation it does not have another

– A set of driver mutations should “explain” the global set of cancer genomes. Meaning that each cancer genome should have one driver mutation.

Formulating the objective function

• With these assumptions we search for a set of mutations with high:– Coverage – most patients have at least one mutation in the

set of driver mutations– Exclusivity – a patient has only one driver mutation

Given: A set of cancer genomesGoal: Find a set of mutations with maximum coverage while maintaining exclusivity


• number of genes• number of patients• an matrix where if gene is mutated in

patient • – the set of patients for which gene is mutated

•

– the set of patients that have a mutation in some gene in the set of genes


• A set of genes is mutually exclusive if for all pairs of genes the following holds:

• An submatrix of is mutually exclusive if each row of the submatrix contains at most one value of 1


• The problem can now be restated mathematically:

• PROBLEM: driver mutations may not be measured as mutually exclusive due to experimental error. Furthermore, passenger mutations may co-occur in driver pathways.

Given: A mutation matrix and Goal: Find a mutually exclusive submatrix of size with the largest number of non-zero rows


• We must reformulate the problem. Our current formulation is too strict.

• Instead of strictly mutual exclusive mutations, we’ll attempt to find approximately exclusive mutations:– most patients have no more than 1 mutation in

• This introduces a tradeoff:Increase coverage Decrease

exclusivityWhy? We can always increase coverage by adding a new mutation to our set of driver mutations. But this mutation might be highly non-exclusive


• To make this problem mathematically well-defined, we need to formalize this tradeoff

• We measure the coverage overlap using the following equation:

• Given 2 genes (red) and (blue) we can visualize this equation as:

The area ofthe overlap is

Formulating the object function

• We measure the tradeoff between coverage and exclusivity with the following measure:

Penalizes non-exclusivity.The lower the better.

Measures coverage.The higher the better.

Given: A mutation matrix and Goal: Find a submatrix of size that maximizes

Maximizing the objective function

• The authors prove that solving this problem is NP-hard• Roughly, this translates to the fact that we need to try every

combination of genes to find the one that maximizes• Thus, we require either an algorithm for finding an

approximate solution, or a heuristic

The Greedy Approach

• Greedily add mutations to the current set of driver mutations as long as the objective function increases until genes are added:1. 2. for :

1. 2.

3. return

Results-Greedy approach

• Even with this very naïve approach, we can make interesting guarantees on its accuracy under the gene independence model– Gene mutations are independent– Driver genes have high coverage– Each driver mutation contributes to the value

• Can prove that under this model, we would need 2,400 patients to maximize the objective function with probability 1-(1x10-4)

– This number of patients is not currently available

Better idea: MCMC

• Markov Chain Monte Carlo (MCMC) is a method for sampling from a complicated joint probability distribution

• Problem:

• Solution: Form a Markov chain such that its stationary distribution is the distribution of interest

Given: A joint distribution Goal: Generate a sample

Quick review: Markov chains

• A Markov chain is a basic model for modeling a stochastic process. It consists of a set of states and probabilities for transitioning from state to state

• Example:

• The stationary distribution is the probability of being in each state if we let the random process traverse from state to state for an infinite amount of time

The MCMC Approach

• Sample from sets of genes in proportion to • We do so by forming a Markov chain such that each state in

the Markov chain is a associated with a set of genes • Stochastically transition from state to state. The most

frequently visited state is most likely have the highest

The MCMC Approach

• More specifically, given current state we obtain as follows:1. Choose a gene uniformly at random from the global

set of genes 2. Choose gene uniformly at random from3. Let 4. With probability set

otherwise

The MCMC Approach

• With this definition of the transition matrix, the stationary distribution is

• The authors prove that this Markov chain approaches its stationary distribution quickly

Results – Simulated data

• Generated 2 simulated datasets– A dataset starting from a set of 6 genes– A dataset consisting of 2 driver pathways and

• Control coverage and exclusivity• Simulate passenger mutations using observed characteristics

in Glioblastoma data • Simulated both single-nucleotide mutations as well as copy-

number abberations (CNAs)• Ran the MCMC algorithm for 107 iterations and sampled every

104 iterations on each dataset

Results – Simulated data

Results – real data

• Built matrices from various cancer genome studies• Searched for sets of size• Once a statistically significant set of mutations was found,

they remove them from the matrix and re-run the algorithm to find new sets

• Performed a statistical test. The test statistic was and the null model was obtained by independently permuting the mutations for each mutation group among the patients– This preserved the mutation frequency– The reason for doing this is to assess the significance of the

coverage and exclusivity given a fixed mutation frequency

Results – multiple cancer types

Results – Lung adenocarcinoma

Results – Glioblastoma multiforme

Discussion

• Is there an underlying network model?• In contrast to nearly every other method that we have discussed

in this class, this method does not utilize a biological network such as a protein-protein or protein-DNA interaction network– Can we incorporate such a network into this method?

• Are coverage and exclusivity the best metrics for finding driver mutations?

• Does their objective function correctly capture coverage and exclusivity?

• What other methods could they have tried in order to solve their combinatorial optimization problem?

• How can this method be validated with biological experiments?

de novo discovery of mutated driver pathways in cancer discussion leader: matthew bernstein scribe:...

Documents

set of mutations

driver mutationother

set of genes

driver mutationgiven

set of patients

divisiona driver pathway

mutual exclusive mutations

set of cancer genomesgoal