ghost

20
GHOST: An Effective GHOST: An Effective Graph-based Graph-based Framework Framework for Name Distinction for Name Distinction Author: Xiaoming Fan, Jianyong Wang, Bing Lv, Lizhu Zhou, Wei Hu Publication: CIKM’08 Presenter: Jhih-Ming Chen 1

Upload: jhihming

Post on 25-May-2015

447 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Ghost

GHOST: An Effective GHOST: An Effective Graph-basedGraph-based Framework for Name DistinctionFramework for Name Distinction

Author: Xiaoming Fan, Jianyong Wang, Bing Lv, Lizhu Zhou, Wei Hu

Publication: CIKM’08

Presenter: Jhih-Ming Chen

1

Page 2: Ghost

OutlineOutlineIntroductionThe GHOST framework◦Graphical View of the Database◦Selection of Valid Paths◦Computing Similarity◦Clustering Strategy◦User Feedback

Experimental ResultsConclusion

2

Page 3: Ghost

IntroductionIntroductionThis paper focus on investigating the

problem in digital libraries to distinguish publications written by authors with identical names.◦GHOST(abbr. GrapH-based framewOrk for

name diStincTion)Example:◦Query term: “Lei Wang”◦About 151 publications are retrieved in DBLP◦There are no fewer than 39 different persons with

this name

3

Page 4: Ghost

IntroductionIntroductionObjective of name distinction:◦group publications coauthored by those with

an identical name into clusters so that the elements in each cluster belong to the same author.

4

Page 5: Ghost

5

Page 6: Ghost

Graphical View of the DatabaseGraphical View of the DatabaseThe database of publications D is represented by

a graph G = {V, E}◦ Each node v ∈ V represents an author.

Note that authors with an ambiguous name are treated as different nodes.

◦An undirected edge represents a coauthorship.◦ The edge between node a and node b has a label S(a,

b), which denotes the complete set of publications coauthored by both a and b.

6

Page 7: Ghost

Selection of Valid PathsSelection of Valid PathsGHOST needs to do is to discover the existence

of a triangle-like basic unit and check whether the longer sub-path is invalid or not.◦ If invalid paths are found, the longer one would be

eliminated in the process of searching.An invalid sub-path emerges if and only if two

sets of three publication sets consist of only one identical publication.

7

Page 8: Ghost

Computing SimilarityComputing SimilarityAll valid paths will be used to compute the

similarity of the two nodes based on the following heuristics:◦ The shortest path is the most indicative valid paths.◦ The more paths there exist between two nodes, the

more “similar” the two nodes may be. ◦ The notion of “six degrees of separation”.

Suppose there are m(i, j) valid paths linking two nodes ai and aj , and the length of nth path is ln

2.1 and ,)()( where,)(

1/1),( )1(

),(

1

nl

n

jim

n nji lf

lfaaSim

8

Page 9: Ghost

Clustering StrategyClustering StrategyThis paper adopt a powerful new

clustering algorithm called Affinity Propagation (AP for short) for GHOST.

9

Page 10: Ghost

Affinity PropagationAffinity PropagationTake each data point as a node in the

network.Consider all data points as potential cluster

centers.Start the clustering with a similarity

between pairs of data points.Exchange messages between data points

until the good cluster centers are found.◦Responsibility◦Availability

10

Page 11: Ghost

How Affinity Propagation worksHow Affinity Propagation works

start

Update r(i,k)

Change on Decision?

Construct Similarity

Matrix

Update a(i,k)

Decide exemplar

end

Y

N

11

Page 12: Ghost

Responsibility Responsibility r(i, k)r(i, k)Responsibility: data point → candidate exemplar

◦ How well suited is exemplar for data point, compared to all other possible exemplar.

◦ self-responsibility r(k, k): prior likelihood for k to be chosen as exemplar. Defined by user, determines number of clusters. Good choice: ),(, kismediankkr

ki

)',()',(max,,'

kiskiakiskirkk

12

Page 13: Ghost

Availability Availability a(i, k)a(i, k)Availability: candidate exemplar → data point

◦ How appropriate is candidate as exemplar for data point, taking support from other data points into account.

kii

kirkkrkia,'

),'(,0max),(,0min,

ki

kirkka'

),'(,0max,

13

Page 14: Ghost

Affinity Propagation Update RulesAffinity Propagation Update Rules r(i, j) = 0; a(i, j) = 0; i, j∀ for i := 1 to num_iterations

end; for all xk with ( r(k, k) + a(k, k) > 0 )

◦ xk is exemplar◦ Assign non-exemplars xi to closest exemplar under similarity

measure s(i, k) end;

)',()',(max,,'

kiskiakiskirkk

kii

kirkkrkia,

),'(,0max),(,0min,

ki

kirkka'

),'(,0max,

14

Page 15: Ghost

15

Page 16: Ghost

How Affinity Propagation worksHow Affinity Propagation works

16

Page 17: Ghost

User FeedbackUser FeedbackWhen we resolve a name and find that there is at

least one direct coauthor shared by two distinct authors, any author with the name is referred to as a “dense author”.◦ For example, two different authors named “Wei

Wang” have coauthored with “Jiawei Han”.To deal with the low performance caused by

dense authors, user feedback is adopted to achieve enhancement.◦Decrease the number of valid paths. ◦ Increase the depth while searching for valid paths. ◦Adjust the value of “preferences” in the AP

clustering process.

17

Page 18: Ghost

Experimental ResultsExperimental Results

Identical Name PublicationsActual

AuthorsEstimated

Clusters

Result Evaluation

precision recall f-score

Cheng Chang 14 4 4 1.00 1.00 1.00

Hui Fang 20 3 3 1.00 1.00 1.00

Yi Li 40 22 22 0.78 0.93 0.85

Jim Smith 21 3 5 1.00 0.85 0.92

Michael Wagner 37 11 13 1.00 0.55 0.71

Jianyong Wang 45 1 2 1.00 0.87 0.93

Lei Wang 95 39 39 0.99 1.00 0.99

Wei Wang 141 14 14 0.88 0.92 0.90

Bin Yu 58 12 17 0.97 0.69 0.81

Jing Zhang 60 25 23 0.94 1.00 0.97

(Average) - - - 0.96 0.88 0.9118Evaluated on the real DBLP

dataset

Page 19: Ghost

Experimental ResultsExperimental ResultsGHOST v.s DISTINCT◦DISTINCT is one of the state-of-the-art name

distinction algorithms.

19

Page 20: Ghost

ConclusionConclusionIn this paper, we have explored the problem of

name distinction, and developed an effective five-step framework, GHOST, which employs only one type of relationship, namely, co-authorship.

Experimental results show that GHOST can achieve both high precision and recall, and outperforms the-state-of-the-art approach, DISTINCT.

20