ece 4502/6502 & cs 6501: graph...

Post on 26-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ECE 4502/6502 & CS 6501: Graph Mining

Instructor: Jundong LiSpring 2020, University of Virginia

Midterm Exam Review

Logistics of the exam• Format: Online and Open Book• Time: Thursday, March 19, 3:30PM - 4:45PM• Where to download/upload exam questions?

• Collab->Assignment• When can I download exam questions?

• Available to download from 3:10 pm (20 minutes before the exam)

• When do I need to upload my answers?• Submission protocol will be closed at 5:05 pm (20

minutes after the exam)• How can I upload my answers?

• (i) Electronically; or (ii) handwriting, then scan (or) take photos

The covered topics

• Chapter 1. Introduction• Chapter 2. Graph Essentials• Chapter 3. Network Measures• Chapter 4. Network Models• Chapter 5. Data Mining Essentials• Chapter 6. Community Analysis• Chapter 7. Information Diffusion• Chapter 8. Recommender System

Organization of the exam

• 1~2 true/false questions and 1 regular question for each chapter (10 T/F questions and 8 regular questions in total)

• For the T/F questions, you need to decide whether the statement is true or not

• For the regular questions, you will be asked to write down your thoughts about a problem or do some simple calculations

Chapter 1: Introduction

Graph mining

• What are graphs/networks?

• Why graphs are important?

• What is graph mining?

• What are interesting applications of graph mining?

Chapter 2: Graph Essentials

Graph basics• What is a network/graph?• Nodes and edges• Undirected networks vs. directed networks• In-degree and out-degree• What kind of degree distribution? – power-law• What is the power-law distribution and how to

interpret it?

Graph representation

• Adjacency matrix• Adjacency list• Edge list• Given a graph, can you draw the corresponding

adjacency matrix, adjacency list and edge list?

Connectivity in graphs

• What is random walk and how to perform random walk in the graph?

• What is a component of an undirected graph, and what are strongly and weakly connected components of directed graph?• What is the shortest path between two nodes?• What is the diameter of the graph and how to

calculate it?

Special subgraphs

• What is a minimum spanning tree (MST)?

• What is a complete graph?• All possible edges exist

• What is a regular graph? • In a k-regular graph, all nodes have degree k

Graph algorithms

• Depth first search and breadth first search• Depth first search – stack structure• Breadth first search – queue structure• How to find the shortest path with Dijkstra’s

algorithm? • How to find the minimum spanning tree with

the Prim’s algorithm?• Given a piece of pseudo code, can you have a

sense what it is about?

Chapter 3: Network Measures

Centrality• Degree centrality and normalized version

• Formulation• Eigenvector centrality

• Advantages over degree centrality• Formulation • Which eigenvalue-eigenvector pair should we choose?

• Katz centrality• What are the problems of eigenvector centrality?• Formulation – why the bias term 𝛽 helps?

• PageRank• What are problems of proceeding algorithms?• Formulation of PageRank• How to choose the parameter 𝛼 in PageRank?• What is power method and why?

Centrality

• Betweenness centrality and normalized version• Formulation and calculation

• Closeness centrality • Formulation and calculation

Transitivity and reciprocity

• What is transitivity?• Clustering coefficient measures transitivity in

undirected graphs – how to calculate it?• Local clustering coefficient measures transitivity

at the node level – how to calculate it?

• What is reciprocity?• Given a directed graph, how to calculate its

reciprocity?

Balance and social status

• What is social balance theory?• In which case a triangle relationship is stable?

• What is social status theory?• How to determine if the status theory is

violated?

Similarity

• What is structural equivalence?• Look at the neighborhood shared by two nodes• How to calculate Jaccard and Cosine similarity?• How to measure the significance of similarity?

• Compare the calculated similarity value with its expected value where vertices pick their neighbors at random

• What is regular equivalence?• How neighborhoods themselves are similar• How to calculate the regular equivalence?

Chapter 4: Network Measures

Properties of real-world networks

• Degree distribution• Power-law degree distribution - scale-free networks• How to decide if it follows power-law distribution?

• Clustering coefficient• High clustering coefficient

• Average path length• The average path length is small

Random graphs

• What is a random graph?• How to build a random graph?• 𝐺(𝑛,𝑚)model• 𝐺(𝑛, 𝑝) model

• When is the 1st and the 2nd transition phase?• Properties of random graphs compared with

real-world networks• Degree distribution?• Clustering coefficient – what is the expected local

and global clustering coefficient?• Average path length?

Small-world model

• What is a small-world graph?• How to construct a small-world graph?

• Why start from regular (ring) lattice?• How to add randomness into the regular lattice?

• Properties of small-world graphs compared with real-world networks• Degree distribution?• Clustering coefficient? How to calculate the global and

local clustering coefficient of regular lattice?• Average path length?

• How will the clustering coefficient and average path length change?

Preferential attachment model

• What’s the basic intuition of preferential attachment model? - The rich get richer• How to construct a preferential attachment

model?• Properties of preferential attachment model

compared with real-world networks• Degree distribution?• Clustering coefficient?• Average path length?

• What are the major differences of different network models?

Chapter 5: Data Mining Essentials

Data mining and data

• What is the KDD process?• What are the differences between data mining

and database?• What are typical data types (nominal, ordinal,

interval, ratio) and permissible operations?• Text representation• Vector space model• TF-IDF – how to obtain the TF-IDF representation?

Data quality and processing

• What data qualities need to be checked before applying data mining algorithms?

• What are the typical data processingtechniques and when will they be used?

Supervised learning

• What is the process of supervised learning such as classification and regression?• Classification models• Decision tree classifier • K-nearest neighbor classifier• Naïve Bayes classifier• Classification with network information

• Regression models• Linear regression• Logistic regression

Evaluation of supervised learning

• Training set and test set• What is leave-one-out and what is K-fold cross

validation?• Evaluation of classification• Accuracy• Precision, recall, and F1-measure

• Evaluation of regression• RMSE• MAE

Unsupervised learning

• What is the target of clustering?• Different distance measures

• Euclidean distance• L1-norm distance• Cosine distance

• Clustering algorithm such as k-means• How the algorithm works?• When does it stops?• What is the objective function?

• Evaluation of clustering results• With ground truth• Without ground truth

Chapter 6: Data Mining Essentials

Community detection

• Why analyze communities?

• What are explicit and implicit communities?

• What is community detection?

• What’s the difference between community detection and conventional clustering?

Member-based community detection• What are member-based methods?• Nodes with similar characteristics are in a

community

• Node characteristics• Degree, e.g., clique percolation method (CPM)• Reachability, e.g., k-clique, k-club, and k-clan• Similarity, e.g., Jaccard and Cosine similarity

Group-based community detection• What are group-based methods?• The global network information and topology is

considered to determine communities• Balanced communities - spectral clustering• Community detection à a minimum cut problem• Find a graph partition such that the number of

edges between the two sets is minimized• What are ratio cut and normalized cut?• Formulation of spectral clustering – which

eigenvector to use?

Group-based community detection• Modular communities - modularity

maximization• Modularity is a measure that defines how likely

the community structure found is created at random • How to calculate the modularity?• How to maximize the modularity?• How to obtain the community assignment from

the modularity matrix?

Group-based community detection• Hierarchical communities: hierarchical

clustering

• How to build a hierarchical structure of communities?• Divisive hierarchical clustering• Agglomerative hierarchical clustering

Community evolution

• Evolution of networks

• Interesting patterns in dynamic networks • Decreasing probability of new connections between

two nodes with increasing distance• Many new connections occur as triadic closures• Density increases with the network growth• Average distance between nodes decreases

Community evaluation

• How the evaluate a community detection assignment?

• Evaluation with ground truth• Precision and Recall, or F-Measure• Normalized Mutual Information (NMI)

• Evaluation without ground truth• Use domain experts or conduct user studies• Use multiple community detection methods

Chapter 7: Information Diffusion

Information diffusion

• What is information diffusion?

• Key components of information diffusion• Sender(s)• Receiver(s)• Medium

Herd behavior

• What is herding behavior?• In what conditions the herding behavior will

happen? – global information• Herd behavior experiment - urn experiment• How to use mathematical tools to validate that herd

behavior will happen?

• How to interrupt the herding behavior

Information cascade

• What is information cascade?• What’s the major difference between

information cascade and herding behavior?• Independent Cascade Model (ICM) - each node

has one chance to activate its neighbors• How it works and which set of nodes will get

activated at the end?

Maximizing the spread of cascades• To trigger a large spread, which set of

individuals should be targeted at the very beginning?• Given a parameter k (budget), find a k-node set

S to maximize f(S)• A constrained optimization problem with f(S)

as the objective function• What are the key properties of f(S)?• How to optimize it? Greedy algorithm• How good is the greedy algorithm?

Diffusion of innovations and epidemic models• What are the key characteristics of diffusion of

innovations?• How to model diffusion of innovation?• External-influence, internal-influence, and mixed-

influence• What are epidemics? Three components:

pathogen, hosts, and spreading mechanism• What are differences between epidemics and

cascades?

Epidemics Models

• SI model• SIR model• two cases of SIR model

• SIS model• SIRS model• How to perform epidemic intervention?

Chapter 8: Recommendation

Recommendation

• What’s the difference between search and recommendation?

• Challenges of recommender systems• The cold start problem• Data sparsity • Attacks• Privacy• Explanation

Content-based recommendation

• Assumption: a user’s interest should match the description of the items that the user should be recommended by the system

• Detailed steps:• Describe the items to be recommended• Create a profile of the user that describes the types

of items the user likes• Compare items with the user profile to determine

what to recommend

Collaborative filtering

• What is collaborative filtering (CF)?

• What’s the advantage of CF over content based recommendation?

• Types of collaborative filtering algorithms• Memory-based (recommendation is directly based

on previous ratings in the stored matrix that describes user-item relations)• Model-based (assumes that an underlying model

(hypothesis) governs how users rate items)

Memory-based collaborative filtering• User-based CF• Users with similar previous ratings for items are

likely to rate future items similarly • Item-based CF• Items that have received similar ratings previously

from users are likely to receive similar ratings from future users

• How to measure the similarity? - cosine similarity or Pearson correlation coefficient• How to get the final ratings in user-based CF

and item-based CF?

Model-based collaborative filtering • We focus on a well-established technique using

singular value decomposition (SVD)• What is SVD on matrix X? - Decompose X into

3 matrices (UΣVT)• Matrices U ∈ ℝ!×! and V ∈ ℝ#×# are

orthogonal and matrix Σ ∈ ℝ!×# is diagonal• The product of these matrices is equivalent to

the original matrix – no information loss• How to use SVD for the recommendation?

Recommendation to groups

• Find content of interest to all members of a group of socially acquainted individuals

• Three strategies:• Maximizing average satisfaction• Least misery• Most pleasure

Social recommendation

• Leverage social connections to improve recommendation performance• Three different ways:• Using social context alone – link prediction• Extending classical models – matrix factorization

and social similarity regularization• Constrain using social context - limit the set of

individuals that can contribute to the ratings of a user to the set of friends of the user

Evaluating recommender systems

• Predictive accuracy • MAE, RMSE – how to calculate it?

• Classification accuracy• Precision, Recall – how to calculate it?

• Rank accuracy• Spearman’s Rank Correlation, Kendall’s 𝜏 −how to calculat it?

top related