a survey of heterogeneous information network analysis

40
A Survey of Heterogeneous Information Network Analysis Chuan Shi, Member, IEEE, Yitong Li, Jiawei Zhang, Yizhou Sun, Member, IEEE, and Philip S. Yu, Fellow, IEEE IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015

Upload: so-yeon-kim

Post on 18-Jan-2017

370 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: A survey of heterogeneous information network analysis

A Survey of Heterogeneous Information Network AnalysisChuan Shi, Member, IEEE,Yitong Li, Jiawei Zhang, Yizhou Sun, Member, IEEE,and Philip S. Yu, Fellow, IEEE

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015

Page 2: A survey of heterogeneous information network analysis

Introduction

Page 3: A survey of heterogeneous information network analysis

IntroductionInformation networks is the interacting components which

constitute interconnected networks

Information network analysis has become a hot research topic in

data mining and information retrieval fields in the past decades

Most of information network have a basic assumption: the type of

objects or links is unique -> Homogeneous information network

Page 4: A survey of heterogeneous information network analysis

IntroductionBut most real systems consist of a large number of interacting, multi

typed components and we can model them as Heterogeneous

information network(HIN).

Compared to homogeneous information network The HIN can

effectively fuse more information and contain rich semantics, and

thus it forms a new development of data mining.

In this paper the author presents survey of Heterogeneous

information network and analysis.

Page 5: A survey of heterogeneous information network analysis

Basic concepts and Definitions

Page 6: A survey of heterogeneous information network analysis

Basic definitions(1/4)

•Def 1. Information network

directed graph G = (V,E)

mapping function object type :

link type :

belongs to object type set :

belongs to link type set :

Page 7: A survey of heterogeneous information network analysis

Basic definitions(2/4)

•Def 2. Hetero/Homogeneous information network

Heterogeneous information network

if the types of objects

or the types of relations

Otherwise, it is a homogeneous information network.

Page 8: A survey of heterogeneous information network analysis

Basic definitions(2/4)

Page 9: A survey of heterogeneous information network analysis

Basic definitions(3/4)

•Def 3. Network schema

Meta template for an information network G=(V,E)

The network schema of a heterogeneous information network

specifies type constraints on the sets of objects and relationships

among the objects.

※ Network instanceAn information network following a network schema

Page 10: A survey of heterogeneous information network analysis

Basic definitions(3/4)

Page 11: A survey of heterogeneous information network analysis

•Def 4. Meta path

A meta path P is a path defined on a schema and is

denoted in the form of which defines a composite

relation between objects

where denotes the composition operator on relations.

Basic definitions(4/4)

Page 12: A survey of heterogeneous information network analysis

Basic definitions(4/4)

Page 13: A survey of heterogeneous information network analysis

Comparisons with related concepts• HIN ⊃ Homogeneous network

• HIN ⊃ Multi-relational network

• HIN ⊃ Multi-dimensional/mode network

• HIN ⊃ Composite network

• HIN ≒ Complex network

Page 14: A survey of heterogeneous information network analysis

Example datasets

Three types of data that can be constructed HIN1. Structured data

a. database table organized with entity-relation modelb. ex) bibliographic data

2. Semi structured dataa. XML format datab. object -> attribute -> objectc. relation -> connections among attributes

3. Non structured dataa. Any data which have recognizable entities and extractable relations

Page 15: A survey of heterogeneous information network analysis

Example datasets Widely used HIN examples1. Multi-relational network with single typed object

a. Object type = 1

b. Relation type >1

c. ex) Facebook, Twitter

Page 16: A survey of heterogeneous information network analysis

Example datasets Widely used HIN examples2. Bipartite network

a. Object type = 2

b. Relation type > 1

c. ex) User-item, Document-word

d. k-partite graph can be constructed

Page 17: A survey of heterogeneous information network analysis

Example datasets Widely used HIN examples3. Star-schema network

a. HIN that using the target object as a hub node

b. ex) Bibliographic information network Movie, Patent data

Page 18: A survey of heterogeneous information network analysis

Example datasets Widely used HIN examples4. Multiple-hub network

a. Bioinformatics data

Page 19: A survey of heterogeneous information network analysis

Example datasets Multiple HINs

Page 20: A survey of heterogeneous information network analysis

Why Heterogeneous Information Network Analysis

•It is a new development of data miningBig data analysis is an emergent yet important task to be studiedMany different types of objects are interconnectedHIN can be an effective tool to deal with complex big data.

•It is an effective tool to fuse more informationWe can fuse information across multiple social network platforms

•It contains rich semanticsDifferent-typed objects and links coexist and they carry different meanings

APA, APVPA, APV, etc...

Page 21: A survey of heterogeneous information network analysis

Research Developments

Page 22: A survey of heterogeneous information network analysis

Research Developments

Page 23: A survey of heterogeneous information network analysis

Similarity measure❏ Goal: consider both structure

similarity of two objects and the meta path connecting two objects (e.g. APA, APVPA, etc)❏ Path based similarity measure❏ The relevance of different-

typed objects❏ meta path based relevance

search + user preference

different similarities according to meta paths (different semantic meanings)

image-tag-image(based on common tags)

image-tag-image-group-image-tag-image(further measured by shared groups)

Sun, Yizhou, et al. "Pathsim: Meta path-based top-k similarity search in heterogeneous information networks." VLDB’11 (2011).

Page 24: A survey of heterogeneous information network analysis

Clustering❏ Clustering based on networked

data❏ based on a homogeneous

network (e.g. normalized cuts, modularity)

❏ need to consider multiple types of objects co-existing network

Page 25: A survey of heterogeneous information network analysis

Clustering❏ Integrate the attribute information

❏ based on the network structure, connections in the network and the vertex attributes

❏ Integrate the text information❏ topic mining - a unified topic

model with HIN❏ multiple objects clustering

Boden, B.,et al. "Density-Based Subspace Clustering in Heterogeneous Networks." Machine Learning and Knowledge Discovery in Databases (2014)Deng, Hongbo, et al. "Probabilistic topic models with biased propagation on heterogeneous information networks." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 2011.

Page 26: A survey of heterogeneous information network analysis

Clustering❏ Integrate with mining tasks

❏ semi-supervised learning - path selection according to user guidance(labeled information)

❏ ranking-based clustering on HIN - mutual promotion of clustering and ranking

❏ Outlier detection❏ detect association-based clique outliers in HIN❏ find subnetwork outliers according to different queries and semantics❏ a meta-path based outlier mining in HIN

Page 27: A survey of heterogeneous information network analysis

Classification❏ Classification in HIN

❏ classify multiple types of objects simultaneously❏ the label of objects is decided by the effects of different-typed objects

along different typed links

❏ Multi-label classification❏ use multiple types of relationships

mined from linkage structure of HIN

❏ Meta paths for feature generation

❏ Ranking-based classification❏ mutually enhance classification

and ranking

knowledge propagation

Page 28: A survey of heterogeneous information network analysis

Link prediction❏ Challenges

❏ The links to be predicted are of different types

❏ Dependencies existing among

multiple types of links

➔ collectively predict multiple types

of links

➔ utilize meta paths

❏ Others

❏ Link prediction across multiple

aligned heterogeneous networks

❏ Dynamic link prediction

different link relations

Page 29: A survey of heterogeneous information network analysis

Ranking❏ Challenges

❏ treating all objects equally will mix

different types of objects together

❏ different results under different meta

paths(different semantic meanings)

❏ Meta-path based ranking

❏ simultaneously evaluate the

importance of multiple types of

objects and meta paths

Page 30: A survey of heterogeneous information network analysis

Recommendation❏ Meta path

❏ explore the semantics and extract relations among objects

❏ Can effectively fuse all kinds of information

❏ utilize different contexts

❏ use interest groups

❏ unified framework of

multiple HIN features

Page 31: A survey of heterogeneous information network analysis

Information fusion❏ Across multiple aligned HINs

❏ via the shared common information entities

❏ A more comprehensive and consistent knowledge shared in different HINs

using their structures, properties, and activities

❏ Information can reach more users and achieve broader influence

❏ Transferring knowledge between aligned networks

❏ e.g. overcome cold start problem in recommendation system

Page 32: A survey of heterogeneous information network analysis

Advanced topics

Page 33: A survey of heterogeneous information network analysis

More complex network construction

❏ Easy to construct HIN with well-defined schema

❏ From real data?❏ objects and links can be noisy or not reliable❏ duplicated names❏ missing relations❏ ...

❏ high-quality HINs by cleaning❏ integrated with information extraction, NLP, and other techniques

Page 34: A survey of heterogeneous information network analysis

More powerful mining methods

❏ Network structure

Bipartie Star-schema Multiple-hub Weighted

Dynamic Multiple-network Schema-rich

Page 35: A survey of heterogeneous information network analysis

More powerful mining methods

❏ Semantic mining❏ node/link semantics

❏ different-typed nodes/links have different semantics

❏ meta-path❏ different similarities under different meta paths

❏ constrained meta-path❏ constraint on node❏ constraint on link

APC APA

APA|P.L = “Data Mining”APA|P.L = “Information Retrieval”….

weighted meta-path

Page 36: A survey of heterogeneous information network analysis

More powerful mining methods

Page 37: A survey of heterogeneous information network analysis

Bigger networked data

❏ can flexibly and effectively integrate varied objects and

heterogeneous information

❏ However, many practical technique challenges in real HIN

❏ huge, dynamic, memory capacity ..

❏ Instead of whole network, hidden but small networks can be

mined

❏ Quick/parallel computation strategies have been considered

recently

Page 38: A survey of heterogeneous information network analysis

Conclusion

Page 39: A survey of heterogeneous information network analysis

Conclusion❏ There is a surge on HIN in recent years because of rich

structural and semantic information.

❏ The recent/future developments of different data mining tasks on HIN.

❏ An understanding of the fundamental issues and a good starting point to work on this field.

Page 40: A survey of heterogeneous information network analysis

Thank you !

Q & A