ghost
Post on 25-May-2015
447 Views
Preview:
TRANSCRIPT
GHOST: An Effective GHOST: An Effective Graph-basedGraph-based Framework for Name DistinctionFramework for Name Distinction
Author: Xiaoming Fan, Jianyong Wang, Bing Lv, Lizhu Zhou, Wei Hu
Publication: CIKM’08
Presenter: Jhih-Ming Chen
1
OutlineOutlineIntroductionThe GHOST framework◦Graphical View of the Database◦Selection of Valid Paths◦Computing Similarity◦Clustering Strategy◦User Feedback
Experimental ResultsConclusion
2
IntroductionIntroductionThis paper focus on investigating the
problem in digital libraries to distinguish publications written by authors with identical names.◦GHOST(abbr. GrapH-based framewOrk for
name diStincTion)Example:◦Query term: “Lei Wang”◦About 151 publications are retrieved in DBLP◦There are no fewer than 39 different persons with
this name
3
IntroductionIntroductionObjective of name distinction:◦group publications coauthored by those with
an identical name into clusters so that the elements in each cluster belong to the same author.
4
5
Graphical View of the DatabaseGraphical View of the DatabaseThe database of publications D is represented by
a graph G = {V, E}◦ Each node v ∈ V represents an author.
Note that authors with an ambiguous name are treated as different nodes.
◦An undirected edge represents a coauthorship.◦ The edge between node a and node b has a label S(a,
b), which denotes the complete set of publications coauthored by both a and b.
6
Selection of Valid PathsSelection of Valid PathsGHOST needs to do is to discover the existence
of a triangle-like basic unit and check whether the longer sub-path is invalid or not.◦ If invalid paths are found, the longer one would be
eliminated in the process of searching.An invalid sub-path emerges if and only if two
sets of three publication sets consist of only one identical publication.
7
Computing SimilarityComputing SimilarityAll valid paths will be used to compute the
similarity of the two nodes based on the following heuristics:◦ The shortest path is the most indicative valid paths.◦ The more paths there exist between two nodes, the
more “similar” the two nodes may be. ◦ The notion of “six degrees of separation”.
Suppose there are m(i, j) valid paths linking two nodes ai and aj , and the length of nth path is ln
2.1 and ,)()( where,)(
1/1),( )1(
),(
1
nl
n
jim
n nji lf
lfaaSim
8
Clustering StrategyClustering StrategyThis paper adopt a powerful new
clustering algorithm called Affinity Propagation (AP for short) for GHOST.
9
Affinity PropagationAffinity PropagationTake each data point as a node in the
network.Consider all data points as potential cluster
centers.Start the clustering with a similarity
between pairs of data points.Exchange messages between data points
until the good cluster centers are found.◦Responsibility◦Availability
10
How Affinity Propagation worksHow Affinity Propagation works
start
Update r(i,k)
Change on Decision?
Construct Similarity
Matrix
Update a(i,k)
Decide exemplar
end
Y
N
11
Responsibility Responsibility r(i, k)r(i, k)Responsibility: data point → candidate exemplar
◦ How well suited is exemplar for data point, compared to all other possible exemplar.
◦ self-responsibility r(k, k): prior likelihood for k to be chosen as exemplar. Defined by user, determines number of clusters. Good choice: ),(, kismediankkr
ki
)',()',(max,,'
kiskiakiskirkk
12
Availability Availability a(i, k)a(i, k)Availability: candidate exemplar → data point
◦ How appropriate is candidate as exemplar for data point, taking support from other data points into account.
kii
kirkkrkia,'
),'(,0max),(,0min,
ki
kirkka'
),'(,0max,
13
Affinity Propagation Update RulesAffinity Propagation Update Rules r(i, j) = 0; a(i, j) = 0; i, j∀ for i := 1 to num_iterations
end; for all xk with ( r(k, k) + a(k, k) > 0 )
◦ xk is exemplar◦ Assign non-exemplars xi to closest exemplar under similarity
measure s(i, k) end;
)',()',(max,,'
kiskiakiskirkk
kii
kirkkrkia,
),'(,0max),(,0min,
ki
kirkka'
),'(,0max,
14
15
How Affinity Propagation worksHow Affinity Propagation works
16
User FeedbackUser FeedbackWhen we resolve a name and find that there is at
least one direct coauthor shared by two distinct authors, any author with the name is referred to as a “dense author”.◦ For example, two different authors named “Wei
Wang” have coauthored with “Jiawei Han”.To deal with the low performance caused by
dense authors, user feedback is adopted to achieve enhancement.◦Decrease the number of valid paths. ◦ Increase the depth while searching for valid paths. ◦Adjust the value of “preferences” in the AP
clustering process.
17
Experimental ResultsExperimental Results
Identical Name PublicationsActual
AuthorsEstimated
Clusters
Result Evaluation
precision recall f-score
Cheng Chang 14 4 4 1.00 1.00 1.00
Hui Fang 20 3 3 1.00 1.00 1.00
Yi Li 40 22 22 0.78 0.93 0.85
Jim Smith 21 3 5 1.00 0.85 0.92
Michael Wagner 37 11 13 1.00 0.55 0.71
Jianyong Wang 45 1 2 1.00 0.87 0.93
Lei Wang 95 39 39 0.99 1.00 0.99
Wei Wang 141 14 14 0.88 0.92 0.90
Bin Yu 58 12 17 0.97 0.69 0.81
Jing Zhang 60 25 23 0.94 1.00 0.97
(Average) - - - 0.96 0.88 0.9118Evaluated on the real DBLP
dataset
Experimental ResultsExperimental ResultsGHOST v.s DISTINCT◦DISTINCT is one of the state-of-the-art name
distinction algorithms.
19
ConclusionConclusionIn this paper, we have explored the problem of
name distinction, and developed an effective five-step framework, GHOST, which employs only one type of relationship, namely, co-authorship.
Experimental results show that GHOST can achieve both high precision and recall, and outperforms the-state-of-the-art approach, DISTINCT.
20
top related