ghost

GHOST: An Effective GHOST: An Effective Graph-basedGraph-based Framework for Name DistinctionFramework for Name Distinction

Author: Xiaoming Fan, Jianyong Wang, Bing Lv, Lizhu Zhou, Wei Hu

Publication: CIKM’08

Presenter: Jhih-Ming Chen

1

OutlineOutlineIntroductionThe GHOST framework◦Graphical View of the Database◦Selection of Valid Paths◦Computing Similarity◦Clustering Strategy◦User Feedback

Experimental ResultsConclusion

2

IntroductionIntroductionThis paper focus on investigating the

problem in digital libraries to distinguish publications written by authors with identical names.◦GHOST(abbr. GrapH-based framewOrk for

name diStincTion)Example:◦Query term: “Lei Wang”◦About 151 publications are retrieved in DBLP◦There are no fewer than 39 different persons with

this name

3

IntroductionIntroductionObjective of name distinction:◦group publications coauthored by those with

an identical name into clusters so that the elements in each cluster belong to the same author.

4

Graphical View of the DatabaseGraphical View of the DatabaseThe database of publications D is represented by

a graph G = {V, E}◦ Each node v ∈ V represents an author.

Note that authors with an ambiguous name are treated as different nodes.

◦An undirected edge represents a coauthorship.◦ The edge between node a and node b has a label S(a,

b), which denotes the complete set of publications coauthored by both a and b.

6

Selection of Valid PathsSelection of Valid PathsGHOST needs to do is to discover the existence

of a triangle-like basic unit and check whether the longer sub-path is invalid or not.◦ If invalid paths are found, the longer one would be

eliminated in the process of searching.An invalid sub-path emerges if and only if two

sets of three publication sets consist of only one identical publication.

7

Computing SimilarityComputing SimilarityAll valid paths will be used to compute the

similarity of the two nodes based on the following heuristics:◦ The shortest path is the most indicative valid paths.◦ The more paths there exist between two nodes, the

more “similar” the two nodes may be. ◦ The notion of “six degrees of separation”.

Suppose there are m(i, j) valid paths linking two nodes ai and aj , and the length of nth path is ln

2.1 and ,)()( where,)(

1/1),( )1(

),(

1

nl

n

jim

n nji lf

lfaaSim

8

Clustering StrategyClustering StrategyThis paper adopt a powerful new

clustering algorithm called Affinity Propagation (AP for short) for GHOST.

9

Affinity PropagationAffinity PropagationTake each data point as a node in the

network.Consider all data points as potential cluster

centers.Start the clustering with a similarity

between pairs of data points.Exchange messages between data points

until the good cluster centers are found.◦Responsibility◦Availability

10

How Affinity Propagation worksHow Affinity Propagation works

start

Update r(i,k)

Change on Decision?

Construct Similarity

Matrix

Update a(i,k)

Decide exemplar

end

Y

N

11

Responsibility Responsibility r(i, k)r(i, k)Responsibility: data point → candidate exemplar

◦ How well suited is exemplar for data point, compared to all other possible exemplar.

◦ self-responsibility r(k, k): prior likelihood for k to be chosen as exemplar. Defined by user, determines number of clusters. Good choice: ),(, kismediankkr

ki

)',()',(max,,'

kiskiakiskirkk

12

Availability Availability a(i, k)a(i, k)Availability: candidate exemplar → data point

◦ How appropriate is candidate as exemplar for data point, taking support from other data points into account.

kii

kirkkrkia,'

),'(,0max),(,0min,

ki

kirkka'

),'(,0max,

13

Affinity Propagation Update RulesAffinity Propagation Update Rules r(i, j) = 0; a(i, j) = 0; i, j∀ for i := 1 to num_iterations

end; for all xk with ( r(k, k) + a(k, k) > 0 )

◦ xk is exemplar◦ Assign non-exemplars xi to closest exemplar under similarity

measure s(i, k) end;

)',()',(max,,'

kiskiakiskirkk

kii

kirkkrkia,

),'(,0max),(,0min,

ki

kirkka'

),'(,0max,

14

How Affinity Propagation worksHow Affinity Propagation works

16

User FeedbackUser FeedbackWhen we resolve a name and find that there is at

least one direct coauthor shared by two distinct authors, any author with the name is referred to as a “dense author”.◦ For example, two different authors named “Wei

Wang” have coauthored with “Jiawei Han”.To deal with the low performance caused by

dense authors, user feedback is adopted to achieve enhancement.◦Decrease the number of valid paths. ◦ Increase the depth while searching for valid paths. ◦Adjust the value of “preferences” in the AP

clustering process.

17

Experimental ResultsExperimental Results

Identical Name PublicationsActual

AuthorsEstimated

Clusters

Result Evaluation

precision recall f-score

Cheng Chang 14 4 4 1.00 1.00 1.00

Hui Fang 20 3 3 1.00 1.00 1.00

Yi Li 40 22 22 0.78 0.93 0.85

Jim Smith 21 3 5 1.00 0.85 0.92

Michael Wagner 37 11 13 1.00 0.55 0.71

Jianyong Wang 45 1 2 1.00 0.87 0.93

Lei Wang 95 39 39 0.99 1.00 0.99

Wei Wang 141 14 14 0.88 0.92 0.90

Bin Yu 58 12 17 0.97 0.69 0.81

Jing Zhang 60 25 23 0.94 1.00 0.97

(Average) －－－ 0.96 0.88 0.9118Evaluated on the real DBLP

dataset

Experimental ResultsExperimental ResultsGHOST v.s DISTINCT◦DISTINCT is one of the state-of-the-art name

distinction algorithms.

19

ConclusionConclusionIn this paper, we have explored the problem of

name distinction, and developed an effective five-step framework, GHOST, which employs only one type of relationship, namely, co-authorship.

Experimental results show that GHOST can achieve both high precision and recall, and outperforms the-state-of-the-art approach, DISTINCT.

20

ghost

Documents

selection of valid paths

number of valid paths

indicative valid paths

closest exemplar

kis exemplar

possible exemplar

jvalid paths

invalid paths