a survey of heterogeneous information network analysis
TRANSCRIPT
A Survey of Heterogeneous Information Network AnalysisChuan Shi, Member, IEEE,Yitong Li, Jiawei Zhang, Yizhou Sun, Member, IEEE,and Philip S. Yu, Fellow, IEEE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015
Introduction
IntroductionInformation networks is the interacting components which
constitute interconnected networks
Information network analysis has become a hot research topic in
data mining and information retrieval fields in the past decades
Most of information network have a basic assumption: the type of
objects or links is unique -> Homogeneous information network
IntroductionBut most real systems consist of a large number of interacting, multi
typed components and we can model them as Heterogeneous
information network(HIN).
Compared to homogeneous information network The HIN can
effectively fuse more information and contain rich semantics, and
thus it forms a new development of data mining.
In this paper the author presents survey of Heterogeneous
information network and analysis.
Basic concepts and Definitions
Basic definitions(1/4)
•Def 1. Information network
directed graph G = (V,E)
mapping function object type :
link type :
belongs to object type set :
belongs to link type set :
Basic definitions(2/4)
•Def 2. Hetero/Homogeneous information network
Heterogeneous information network
if the types of objects
or the types of relations
Otherwise, it is a homogeneous information network.
Basic definitions(2/4)
Basic definitions(3/4)
•Def 3. Network schema
Meta template for an information network G=(V,E)
The network schema of a heterogeneous information network
specifies type constraints on the sets of objects and relationships
among the objects.
※ Network instanceAn information network following a network schema
Basic definitions(3/4)
•Def 4. Meta path
A meta path P is a path defined on a schema and is
denoted in the form of which defines a composite
relation between objects
where denotes the composition operator on relations.
Basic definitions(4/4)
Basic definitions(4/4)
Comparisons with related concepts• HIN ⊃ Homogeneous network
• HIN ⊃ Multi-relational network
• HIN ⊃ Multi-dimensional/mode network
• HIN ⊃ Composite network
• HIN ≒ Complex network
Example datasets
Three types of data that can be constructed HIN1. Structured data
a. database table organized with entity-relation modelb. ex) bibliographic data
2. Semi structured dataa. XML format datab. object -> attribute -> objectc. relation -> connections among attributes
3. Non structured dataa. Any data which have recognizable entities and extractable relations
Example datasets Widely used HIN examples1. Multi-relational network with single typed object
a. Object type = 1
b. Relation type >1
c. ex) Facebook, Twitter
Example datasets Widely used HIN examples2. Bipartite network
a. Object type = 2
b. Relation type > 1
c. ex) User-item, Document-word
d. k-partite graph can be constructed
Example datasets Widely used HIN examples3. Star-schema network
a. HIN that using the target object as a hub node
b. ex) Bibliographic information network Movie, Patent data
Example datasets Widely used HIN examples4. Multiple-hub network
a. Bioinformatics data
Example datasets Multiple HINs
Why Heterogeneous Information Network Analysis
•It is a new development of data miningBig data analysis is an emergent yet important task to be studiedMany different types of objects are interconnectedHIN can be an effective tool to deal with complex big data.
•It is an effective tool to fuse more informationWe can fuse information across multiple social network platforms
•It contains rich semanticsDifferent-typed objects and links coexist and they carry different meanings
APA, APVPA, APV, etc...
Research Developments
Research Developments
Similarity measure❏ Goal: consider both structure
similarity of two objects and the meta path connecting two objects (e.g. APA, APVPA, etc)❏ Path based similarity measure❏ The relevance of different-
typed objects❏ meta path based relevance
search + user preference
different similarities according to meta paths (different semantic meanings)
image-tag-image(based on common tags)
image-tag-image-group-image-tag-image(further measured by shared groups)
Sun, Yizhou, et al. "Pathsim: Meta path-based top-k similarity search in heterogeneous information networks." VLDB’11 (2011).
Clustering❏ Clustering based on networked
data❏ based on a homogeneous
network (e.g. normalized cuts, modularity)
❏ need to consider multiple types of objects co-existing network
Clustering❏ Integrate the attribute information
❏ based on the network structure, connections in the network and the vertex attributes
❏ Integrate the text information❏ topic mining - a unified topic
model with HIN❏ multiple objects clustering
Boden, B.,et al. "Density-Based Subspace Clustering in Heterogeneous Networks." Machine Learning and Knowledge Discovery in Databases (2014)Deng, Hongbo, et al. "Probabilistic topic models with biased propagation on heterogeneous information networks." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 2011.
Clustering❏ Integrate with mining tasks
❏ semi-supervised learning - path selection according to user guidance(labeled information)
❏ ranking-based clustering on HIN - mutual promotion of clustering and ranking
❏ Outlier detection❏ detect association-based clique outliers in HIN❏ find subnetwork outliers according to different queries and semantics❏ a meta-path based outlier mining in HIN
Classification❏ Classification in HIN
❏ classify multiple types of objects simultaneously❏ the label of objects is decided by the effects of different-typed objects
along different typed links
❏ Multi-label classification❏ use multiple types of relationships
mined from linkage structure of HIN
❏ Meta paths for feature generation
❏ Ranking-based classification❏ mutually enhance classification
and ranking
knowledge propagation
Link prediction❏ Challenges
❏ The links to be predicted are of different types
❏ Dependencies existing among
multiple types of links
➔ collectively predict multiple types
of links
➔ utilize meta paths
❏ Others
❏ Link prediction across multiple
aligned heterogeneous networks
❏ Dynamic link prediction
different link relations
Ranking❏ Challenges
❏ treating all objects equally will mix
different types of objects together
❏ different results under different meta
paths(different semantic meanings)
❏ Meta-path based ranking
❏ simultaneously evaluate the
importance of multiple types of
objects and meta paths
Recommendation❏ Meta path
❏ explore the semantics and extract relations among objects
❏ Can effectively fuse all kinds of information
❏ utilize different contexts
❏ use interest groups
❏ unified framework of
multiple HIN features
Information fusion❏ Across multiple aligned HINs
❏ via the shared common information entities
❏ A more comprehensive and consistent knowledge shared in different HINs
using their structures, properties, and activities
❏ Information can reach more users and achieve broader influence
❏ Transferring knowledge between aligned networks
❏ e.g. overcome cold start problem in recommendation system
Advanced topics
More complex network construction
❏ Easy to construct HIN with well-defined schema
❏ From real data?❏ objects and links can be noisy or not reliable❏ duplicated names❏ missing relations❏ ...
❏ high-quality HINs by cleaning❏ integrated with information extraction, NLP, and other techniques
More powerful mining methods
❏ Network structure
Bipartie Star-schema Multiple-hub Weighted
Dynamic Multiple-network Schema-rich
More powerful mining methods
❏ Semantic mining❏ node/link semantics
❏ different-typed nodes/links have different semantics
❏ meta-path❏ different similarities under different meta paths
❏ constrained meta-path❏ constraint on node❏ constraint on link
APC APA
APA|P.L = “Data Mining”APA|P.L = “Information Retrieval”….
weighted meta-path
More powerful mining methods
Bigger networked data
❏ can flexibly and effectively integrate varied objects and
heterogeneous information
❏ However, many practical technique challenges in real HIN
❏ huge, dynamic, memory capacity ..
❏ Instead of whole network, hidden but small networks can be
mined
❏ Quick/parallel computation strategies have been considered
recently
Conclusion
Conclusion❏ There is a surge on HIN in recent years because of rich
structural and semantic information.
❏ The recent/future developments of different data mining tasks on HIN.
❏ An understanding of the fundamental issues and a good starting point to work on this field.
Thank you !
Q & A