an introduction to graph mining - leiden...
TRANSCRIPT
An Introduction to Graph Mining
An Introduction to Graph Mining
Hossein Rahmani
8 December 2009
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 1 / 49
An Introduction to Graph Mining
Outline
From Data Mining To Graph MiningGraph AlgorithmsFunction Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 2 / 49
An Introduction to Graph Mining
From Data Mining To Graph Mining
Data MiningClassificationClusteringAssociation rule learning
Graph MiningPowerful way to representdataoutput: expressed asgraphs
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 3 / 49
An Introduction to Graph Mining
Graph Mining Domains
Internet Movie DatabaseWeb DataSocial Networks AnalysisBio-Informatics
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 4 / 49
An Introduction to Graph Mining
Internet Movie Database
Movie RecommendationCommunity DetectionPrediction: $2 million in opening weekend
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 5 / 49
An Introduction to Graph Mining
Web Mining
Web Content MiningTopic Prediction
Web Structure MiningCommunity Mining
Web Usage MiningWebsite Roadmap
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 6 / 49
An Introduction to Graph Mining
Social Networks
Relationships and flowsbetween peopleTechnologies
Email, BlogsSocial Networking Softwarelike Orkut, FaceBook
Questions:Who has control over whatflows in the Network?Who has best visibility of whatis happening in the Network?Customer Network Value
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 7 / 49
An Introduction to Graph Mining
Protein-Protein Interaction(PPI) Network
Nodes: ProteinsEdges: Interaction among theProteinsOpen Problems:
Function PredictionDrug Discovery
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 8 / 49
An Introduction to Graph Mining
Function Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 9 / 49
An Introduction to Graph Mining
Drug Discovery in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 10 / 49
An Introduction to Graph Mining
Outline
From Data Mining To Graph MiningGraph Algorithms
Some DefinitionsGraph MatchingGraph CompressionGraph based Decision Tree
Function Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 11 / 49
An Introduction to Graph Mining
Graph Definition
Graph: G = (V ,E, µ, v)V : finite set of nodes.E ⊆ V× V denotes a set of edges.µ : V → LV denotes a node labeling function.v : E → LE denotes an edge labeling function.
Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2)Graph G1 is a subgraph of G2, written G1 ⊆ G2,if:
V1 ⊆ V2E1 ⊆ E2µ1(u) = µ2(u) for all u ∈ V1.v1(u, v) = v2(u, v) for all (u, v) ∈ E1.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 12 / 49
An Introduction to Graph Mining
Graph Definition
Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2),A graph isomorphism between G1 and G2 is a bijective functionf : V1 → V2 satisfying
µ1(u) = µ2(f (u)) for all nodes u ∈ V1.For every edge e1 = (u, v) ∈ E1, there exists an edgee2 = (f (u), f (v)) ∈ E2 such that v1(e1) = v2(e2).
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 13 / 49
An Introduction to Graph Mining
Graph Definition
Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2)Any bijective function f : V̂1 → V̂2, where V̂1 ⊆ V1 and V̂2 ⊆ V2 iscalled edit path from G1 to G2.Example: f = {u1 → v3,u2 → ε, ..., ε→ v6}u1 → v3: Substitution of node u1 ∈ V1 by node v3 ∈ V2
u2 → ε: Deletion of node u2 ∈ V1
ε→ v6: Insertion of v6 ∈ V2
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 14 / 49
An Introduction to Graph Mining
Frequent Subgraph
Frequent subgraphs:support (subgraph) >=minimum support
Usage:Graph ClassificationGraph ClusteringGraph Indexing
Detection Algorithms:Apriori-Based ApproachPattern Growth Approach
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 15 / 49
An Introduction to Graph Mining
Apriori-Based Approach
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 16 / 49
An Introduction to Graph Mining
Pattern Growth Approach
A graph G is extended by adding new edge e.Edge e may or may not introduce a new node to G.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 17 / 49
An Introduction to Graph Mining
Graph Compression
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 18 / 49
An Introduction to Graph Mining
Graph Compression
Portion of DNAEasy to Analyze?
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 19 / 49
An Introduction to Graph Mining
Graph Compression
Si compresses the inputDNA.Cluster hierarchy
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 20 / 49
An Introduction to Graph Mining
Graph Based Decision Tree
Decision TreeEach branch corresponds to attribute valueEach leaf node assigns a classification
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 21 / 49
An Introduction to Graph Mining
Graph Based Decision Tree
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 22 / 49
An Introduction to Graph Mining
Graph Based Decision Tree
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 23 / 49
An Introduction to Graph Mining
Graph Based Decision Tree
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 24 / 49
An Introduction to Graph Mining
Outline
From Data Mining To Graph MiningGraph AlgorithmsFunction Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 25 / 49
An Introduction to Graph Mining
Function Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 26 / 49
An Introduction to Graph Mining
Formal Description of PPI Networks
Represented as an undirected graphNode set V → ProteinsEdge set E → Direct interaction
∀v ∈ V is described by a description d(v) ∈ Dd(v) derived from the network structureNo additional information, such as the protein structure is available
∀v ∈ V optionally is annotated with a label l(v) ∈ LLabels l(v) are sets of protein functionsE.g., metabolism, transcription, protein synthesis and etc
We assume there is a true labeling function λ that is l(v) = λ(v)where l(v) is definedTask: Find a suitable λ(v) where l(v) is not defined
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 27 / 49
An Introduction to Graph Mining
Related Works
1 Transductive approachesLocal: Majority Rule and its extensionsGlobal: Global Optimization andFunctional Clustering
2 Inductive approachesLocal: Topological RedundanciesGlobal: ? → Our Method
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 28 / 49
An Introduction to Graph Mining
Majority Rule
Local transductive methodAssumption: Two Interactingproteins have something incommon (e.g., same function)Predicted function: Mostcommon function(s) amongclassified partners
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 29 / 49
An Introduction to Graph Mining
Majority Rule
Problem: Links unclassified-unclassified proteins completelyneglectedSolution: Global optimization methods
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 30 / 49
An Introduction to Graph Mining
Functional Clustering
Global transductive methodAssumption: Dense regions are a sign ofthe common involvement in biologicalprocessPredict the function of unclassified proteinbased on the cluster they belong to
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 31 / 49
An Introduction to Graph Mining
Our Method: A Global Description of Proteins
Global inductive approachNode description
N nodes in the network numbered from 1 to NEach node is described by an n-dimensionalvectori ’th component in the vector of node v gives thelength of shortest path between v and node iProbelm: Large Graph→ very high dimensionaldescriptionsSolution: Reduce dimensionality by focusing onshortest-path distance to a few "important" nodes
Feature selection problem
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 32 / 49
An Introduction to Graph Mining
Important Proteins
Definition: Node i is important if the shortest-path distance ofsome node v to node i is likely to be relevant for v ’s classificationFeature fi denotes the shortest-path distance to node iAnova based feature selection
For each function j , let Gj be the set of all proteins that have thatfunction jLet f̄ij be the average fi value in GjLet var(fij ) the variance of the fi in GjAnova (analysis of variance) based relevancy measure:
Ai =Varj [f̄ij ]
Meanj [var(fij )](1)
A high Ai denotes a high relevance of feature fi
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 33 / 49
An Introduction to Graph Mining
Important Proteins
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 34 / 49
An Introduction to Graph Mining
Comparison of Learners
Random Forest performs best among all the learners in 13 out of17 casesOther 4 cases are all characterized by a very high class skewRandom Forest: Best candidate for learning from this type of data
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 35 / 49
An Introduction to Graph Mining
Different Number of Important Proteins
We select the 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80,90, 100, 200 and 300 most important proteins.Describe each protein using its distance to the n most importantproteins.Using random forest for function prediction and record precision,recall, F-measure and AUC.In 4 Datasets for 17 functions.The shape of the curves is qualitatively very similar in all cases.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 36 / 49
An Introduction to Graph Mining
Less than 10 Important Proteins
Bad Bad predictive performance.They do not reach their maximum.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 37 / 49
An Introduction to Graph Mining
10-50 Important Proteins
There is usually a major improvement in all four metrics.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 38 / 49
An Introduction to Graph Mining
50-70 Important Proteins
For most of the functions, selecting 50-70 important proteins isenough to obtain good classication results.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 39 / 49
An Introduction to Graph Mining
Conclusions
Graph Mining DomainsIntroduction to Graph Algorithms
Graph MatchingGraph CompressionGraph based Decision Tree
Protein Function prediction
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 40 / 49
An Introduction to Graph Mining
Thanks!
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 41 / 49
An Introduction to Graph Mining
References
Cook, D. J. and Holder, L. B. 2006 Mining Graph Data. John Wiley& Sons.Rahmani, H., Blockeel, H. and Bender, A., Proc. 3rd Int.Workshop in Machine Learn. Syst. Biol. (MLSB09) 2009 85 – 94.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 42 / 49
An Introduction to Graph Mining
Transductive Learning
Task: Predict the label of all the nodesInput: G = (V ,E ,d , l) with l a partial functionOutput: Complete version G′ = (V ,E ,d , l ′) with l ′ a completefunction that is consistent with ll ′ should approximates λ by optimization criterion oo expresses our assumption about λ
E.g., directly connected nodes tend to have similar labelsNumber of {v1, v2} edges where l ′(v1) 6= l ′(v2) edges should beminimal
Our assumption about λ is called bias of transductive learner
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 43 / 49
An Introduction to Graph Mining
Inductive Learning
Task: Learn a function f : D → L that maps a node descriptiond(v) onto its label l(v)
Input: G = (V ,E ,d , l) with l a partial functionOutput: f : D → L such that f (d(v)) = l(v)
Note: f differs from l in that it maps D, not V , onto LIt can make prediction for node v that is not in the original network,as long as d(v) is known
BiasesTransductive bias: Assumption expressed by optimization criterionoDescription bias DInductive bias: Choice of learning algorithm that is used to learn ffrom (d(v), l(v)) couples
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 44 / 49
An Introduction to Graph Mining
Global Optimization
Global transductive methodLinks unclassified/unclassified proteinsalso taken into accountAny probable function assignment to thewhole set of unclassified proteins isconsidered
Counting number of interacting pairswith no common functionsSelect the function assignment withlowest value
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 45 / 49
An Introduction to Graph Mining
Topological Approaches
Local inductive methodNode description d(v) is built based on the local neighborhoodCount number of patterns (e.g., graphlet) around the proteinsMake the signature vector for each proteinAssumption: Proteins with high similar signature vector havesame functions
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 46 / 49
An Introduction to Graph Mining
Topological Approaches
Some topological patterns (Number of considered patterns = 73)
Orbit: One of the previous patternsSame orbit frequency→ same function
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 47 / 49
An Introduction to Graph Mining
Topological Approaches
Local inductive methodNode description d(v) is built based on the local neighborhoodCount number of patterns (e.g., graphlet) around the proteinsMake the signature vector for each proteinAssumption: Proteins with high similar signature vector havesame functions
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 48 / 49
An Introduction to Graph Mining
Topological Approaches
Some topological patterns (Number of considered patterns = 73)
Orbit: One of the previous patternsSame orbit frequency→ same function
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 49 / 49