association rule mining on rnaseq data trends in ... · association rule mining on rnaseq data 2....
TRANSCRIPT
Association Rule Mining on RNAseq DataTrends in Bioinformatics
Lukas Ehrig
Jan 23th, 2019
Association Rule Mining on RNAseq Data
Agenda
1. Motivation2. Method
2.1. Filtering Association Rules2.2. Interestingness Measures2.3. Weighting of Interestingness Measures
3. Experiments4. Conclusion
Chart 2
Association Rule Mining on RNAseq Data
1. Motivation
Image: [2]
Chart 3
Association Rule MiningDiscovering significant relations between attributes (“items”) in large datasets.
Association Rules
Gene1↑ → Gene2↓
Illness → Gene1↑, Gene2↓Gene1↑, Gene2↓ → Illness
Association Rule Mining on RNAseq Data
1. Motivation
Chart 4
Association Rule Mining on RNAseq Data
1. Motivation
Chart 5
Number of features infeasible Exponential number of possible frequent item sets→ Runtime complexity
Huge number of possible / output rules n frequent items yield 2n − 2 association rules→ which rules are relevant/interesting?
⇒ Select relevant subset of rules
Association Rule Mining on RNAseq Data
2. Method
Chart 6
Association Rule Mining on RNAseq Data
2. Method
Chart 7
Association Rule Mining on RNAseq Data
2. Method - Filtering of Association Rules
Filter association rules that can be removed for formal reasons or because of what the researcher wants to find out or use the results for.
Chart 8
Examples
Item-FilterFilter association rules containing wanted/unwanted items, e.g. Gene1― → Gene2―.
Redundancy-FilterRemove rules that can be deduced from other rules.
If using multiple filters in a row, the order might matter!
Association Rule Mining on RNAseq Data
2. Method - Min-Max Filter
Min-Max Filter
Idea The antecedent of a rule should be as small, the consequent as large as possible.
Min-Max1: ensure, that or hold for rule .
Min-Max2: Given two rules, and , remove , if and . If , we do not lose information!
Example
r0: X,Y,Z → Cr1: X,Y → C Chart 9
(Strong) Monotonicity
If a rule X → Y holds on a statistically large subset of a dataset D, then X → ¬Y does not hold on any statistically large subset of D.
Example
G2↓ → G4↑G2↓, G3↑ → G4↓
Association Rule Mining on RNAseq Data
2. Method - Monotonicity
Chart 10
Association Rule Mining on RNAseq Data
2. Method - Monotonicity Filter
Graph creation Rule Filtering
Chart 11
Data: TCGA-LAML_TCGA-GBM
Association Rule Mining on RNAseq Data
2. Method
Chart 18
Association Rule Mining on RNAseq Data
2. Method
Chart 19
Interestingness Measures
Association Rule Mining on RNAseq Data
2. Method - Interestingness Measures
Interestingness Measure
Objective Measure Subjective Measure
Chart 20
Interestingness Measures
ObjectiveDepend only on the data and patterns, no knowledge of user or application required. Mostly based on theories in probability, statistics or information theory.
Examples
SupportConfidenceLift
Association Rule Mining on RNAseq Data
2. Method - Objective Interestingness
Chart 21
Interestingness Measures
SubjectiveTake into account both the data and the user of these data.Definition requires domain knowledge (or background knowledge about the data) and its explicit representation.
Examples
NoveltyUnexpectednessActionability
Association Rule Mining on RNAseq Data
2. Method - Subjective Interestingness
Chart 22
External domain knowledge
Genes associated with medical condition:http://www.disgenet.org/https://www.targetvalidation.org/
Co-Expression of genes:http://coxpresdb.jp/http://www.genefriends.org
Association Rule Mining on RNAseq Data
2. Method - Subjective Interestingness
Figure: Disgenet searchChart 23
First idea:
Jaccard-Index:
Jaccard distance:
Association Rule Mining on RNAseq Data
2. Method - Subjective Interestingness
Chart 24
Intuition:
Index measures “overlap” / similarity of sets. If there are lots of genes associated with a medical condition (B), or a rule (A) is very long, an overlap is more likely, but the denominator also increases.
First idea:
For a rule X → Y and a set B of genes associated with a medical condition: measures how much a rule corresponds to the domain knowledge.
measures the novelty of an association rule.
Association Rule Mining on RNAseq Data
2. Method - Subjective Interestingness
Chart 25
First idea:
For a rule X → Y and a set B of genes associated with a medical condition: measures how much a rule corresponds to the domain knowledge.
measures the novelty of an association rule.
Problems:
Scores for different medical conditions hardly comparableIf B is small, is almost always 0!
Association Rule Mining on RNAseq Data
2. Method - Subjective Interestingness
Chart 26
Next idea:
Define a subjective interestingness measure based on gene co-expression.
E.g. how strongly do the genes in the antecedent of the association rulecorrelate with with the genes in the consequent of the association rule?
Association Rule Mining on RNAseq Data
2. Method - Subjective Interestingness
Chart 27
planned
Association Rule Mining on RNAseq Data
2. Method
Chart 28
Association Rule Mining on RNAseq Data
2. Method
Chart 29
Given interestingness measures in a vector and an -dimensional weight vector , a combined interesting score can be computed by:
Scalar product
Geometric Method
Association Rule Mining on RNAseq Data
2. Method - Weighting
Setting weights
Value range of e.g. support , confidence
Effect of weights in geometric method
increases weight of , but decreases weight of
Chart 30
Setup
DataRNAseq data from The Cancer Genome Atlas (TCGA)DSI: TCGA-BRCA_TCGA-PAAD (307 samples; selected 2.7k genes by variance)DSII: TCGA-LAML_TCGA-GBM (1,28k samples; selected 2.5k genes by variance)
All data sets were discretized by threshold.
Association Rule Miner
Weka 3.9 Apriori Associator
Association Rule Mining on RNAseq Data
3. Experiments
Chart 31
Experiment #1Number of filtered rules
1.1 Min-Max1 filter
Association Rule Mining on RNAseq Data
3. Experiments
Chart 32
Experiment #1Number of filtered rules
1.2 Min-Max2 filter 1.3 Monotonicity filter
Association Rule Mining on RNAseq Data
3. Experiments
Chart 33
Association Rule Mining on RNAseq Data
3. Experiments
Chart 34
Experiment #2 (planned)Compute average change in rank for association rules after weighting with different subjective interestingness measures.
Experiment #3 (planned)Evaluate new subjective novelty measure(s) by holding back domain knowledge and see, whether it is discovered.
Experiment #4 (planned)Retrieve gene network from additional external domain knowledge source and compare to (weighted) association rules / generated graph.
...
Filtering techniques are powerful to reduce number of association rules.
Assessing the biological relevance of rules by using subjective interestingness measures proved to be challenging at first try.
Discussion
How to improve the first subjective interestingness measure based on the Jaccard index?
What other subjective interestingness measures are conceivable?
Other ideas regarding experiments and their evaluation?
Association Rule Mining on RNAseq Data
4. Conclusion
Chart 35
Association Rule Mining on RNAseq DataTrends in Bioinformatics
Lukas Ehrig
Jan 23th, 2019
Title/Final Slide:Boehm Konstruktion http://www.boehm-konstruktion.com/referenzen/hasso-plattner-high-tech-park/
[2] Central Dogma of Molecular Biology:Khan Academyhttps://www.khanacademy.org/science/high-school-biology/hs-molecular-genetics/hs-rna-and-protein-synthesis/a/intro-to-gene-expression-central-dogma
The BPMN model of the method was made using Signavio Academic Initiative:https://academic.signavio.com/
Association Rule Mining on RNAseq Data
Appendix: Image Sources
Association Rule Mining on RNAseq Data
Appendix: Gene Expression Data
Chart 38
Chart 39
Gene
Association Rule Mining on RNAseq Data
Appendix: Gene Expression Data
Chart 40
Sample
Association Rule Mining on RNAseq Data
Appendix: Gene Expression Data
Chart 41
Discretize
Association Rule Mining on RNAseq Data
Appendix: Gene Expression Data
Chart 42
Association Rule Mining on RNAseq Data
Appendix: Gene Expression Data
Chart 43
{Headset,Laptop}
Association Rule Mining on RNAseq Data
Appendix: Association Rule Mining
Chart 44
A: Laptop ⇒ Headsetsup(A) = 4conf(A) = 0.8B: Headset ⇒ Laptopsup(B) = 4 _conf(B) = 0.6
Association Rule Mining on RNAseq Data
Appendix: Association Rule Mining
Chart 45
Association Rule Mining on RNAseq Data
Appendix: Association Rule Mining
Chart 46
Transaction
Item: ENSG0..03=-1
Transaction ID
Association Rule Mining on RNAseq Data
Appendix: Association Rule Mining
Redundant Association Rules
An association rule is redundant if its structure and statistical measures can be deduced from another rule.
Different notions of redundancy. Based on Galois-closure according to Zaki (for approximate association rules):
Association Rule Mining on RNAseq Data
Appendix: Redundancy Filter
[Mohammed J. Zaki. 2004. Mining Non-Redundant Association Rules. Data Min. Knowl. Discov. 9, 3 (Nov. 2004), 223-248.]
Chart 48
Association Rule Mining on RNAseq Data
Appendix: Non-redundant Association Rules
[Mohammed J. Zaki. 2004. Mining Non-Redundant Association Rules. Data Min. Knowl. Discov. 9, 3 (Nov. 2004), 223-248.]
Chart 49
Association Rule Mining on RNAseq Data
Appendix: Non-redundant Association Rules
[Mohammed J. Zaki. 2004. Mining Non-Redundant Association Rules. Data Min. Knowl. Discov. 9, 3 (Nov. 2004), 223-248.]
Chart 50
Association Rule Mining on RNAseq Data
Appendix