Efficient Approximate Entity Extraction with Edit Distance Constraints
Wei Wang1, Chuan Xiao1, Xuemin Lin1 and Chengqi Zhang2
1 University of New South Wales and NICTA2 University of Technology, Sydney
2
Named Entity Recognition
Dictionary-based NER
Dictionary of Entities
Isaac Newton Sigmund Freud
English Austrian physicist
mathematician astronomer philosopher alchemist theologian psychiatrist economist historian
sociologist ...
Documents
1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics.
2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst.
3
Approximate Entity Extraction
What if data are not cleaned or standardized? due to typos, multiple representations, etc.
Example – multiple representations al qaeda al qaida al-qaeda al-qa’ida
Using similarity measures token-based measures: jaccard e.g.
x = {al, qaeda}, y = {al, qaida} J(x, y) = 1/3 = 0.33
If we set the threshold as 0.33, it works well for entities with several tokens, but, {al, qaeda} will match {al, gore} !
match the same entity!
tyx
yxyxJ
),(
4
Using Edit Distance Constraints
Using string-based measures edit-distance
Problem Definition Given a document R and a dictionary E of entities, the task of
approximate entity extraction with edit distance threshold d is to find all sub-strings in R such that they are within edit distance d from one of the entities in E.
{ R[i .. j], E | k, ed(R[i .. j], Ek) d }
E
5
Previous Approaches
q-gram based method count filtering
at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 – q*d
position filtering positions of common q-grams
should be within d length filtering
| len(s)-len(t) | d
Steps index the q-grams for the entities probe index for the q-grams of
each sub-string (query) of the document form candidates
verify the candidates
Rhode_IslandRho hod ode de_ e_I _Is Isl sla lan and
a
Example: q = 3
at most q*d q-grams are destroyed
6
Drawbacks of q-gram Based Methods
entities are short we have to use small q to ensure the lower bound of matching
q-grams is positive short q-grams result in poor performance
short q-grams are frequent long inverted lists the lower bound is low for short entities large candidate size
It has to try all the queries with length from Lmin – d to Lmax + d at every starting position.
Document
1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics.
Dictionary (Lmin=9, Lmax=43)
1 physicist
2 mathematician
3 Philosophiæ Naturalis Principia Mathematica
7
FastSS Algorithm [T. Bocek et. al. 2007]
Basic Idea – Neighborhood Generation generate the variants for each entity and query by
enumerating edit operations at any possible position Steps
enumerate by at most d deletions for each entity resulting strings are called d-variant family, inserted into
inverted index generate d-variant family for each query, probe the index to
form candidates, and then verify them Example, d = 1
e = qaeda q = qaida Ve = {qaeda, aeda, qeda, qada, qaea, qaed} Vq = {qaida, aida, qida, qada, qaia, qaid}
Problem the size of d-variant family for each entity (query) is O(|s|d) too many variants when entities are long or d is large!
8
Partitioning Scheme
How to reduce the number of variants? immediate solution: divide an entity (query) into several
partitions generate d-variants within each partition only guarantee not
to miss any result
still too many variants? pigeon-hole principle If we consider shifting and scaling, there exists an entity
partition and a query partition such that their edit distance is within 1 generate 1-variant family for each partition
divide each entity (query) into k = ceil[(d+1)/2] partitions
Partitioning Scheme
divide each entity (query) into k = ceil[(d+1)/2] partitions shift within the range of [-d, d] scale within the range of [-2, 2] (it can be proved 2 is
enough)
shifting an scaling are only needed on entities special cases
first partition: only need to consider scaling within [-2, 2] last partition: only need to consider same amount of shifting
and scaling within [-d, d]
dd
22
always start from the first character
always end with the last character
10
Partitioning Scheme - Example
Example, d = 3
e = abcdefgh
q = axxbcdefgyh
Partitioning k = 2 Pe = {<ab,1>; <abc,1>, <abcd,1>; <abcde,1>; <abcdef,1>;
<bcdefgh,2>; <cdefgh,2>; <defgh,2>; <efgh,2>; <fgh,2>; <gh,2>; <h,2>}
Pq = {<axxbc,1>;<defgyh,2>}
Generating 1-variants V{defgh} and V{defgyh} share a common variant ‘defgh’, so this
candidate will be identified
represented in the form of <str, partition_id>
11
Prefix Pruning
What if a partition is still quite long? still many 1-variants solution: generate 1-variant family on prefix only!
Prefix Pruning If a partition is longer than a threshold l, we only generate 1-
variant family on its l-prefix. Example, l = 5
P = abcdefg generate 1-variant family on its 5-prefix
P[1 .. 5] = abcde Vp[1 .. 5] = {abcde, bcde, acde, abce, abcd}
space complexity - # of variants generated FastSS: O(|s|d) after partitioning and prefix pruning: O(l * d2)
12
NGPP Algorithm
Neighborhood Generation + Partitioning + Prefix Balance between variant size and selectivity
different schemes to deal with short and long entities Index short and long entities
short: for entities which are shorter than k*l+d, we index d-variant family on its l-prefix (prefix pruning only)
long: for entities which are no shorter than k*l, we first divide them into k partitions, and index 1-variant family on the l-prefix of the partitions (partitioning + prefix pruning)
Scan documents scan for each starting position enumerate the query length from Lmin – d to l generate its d-variant family, search for short entities generate its 1-variant family, search for long entities
13
NGPP Example
d = 2, l = 4 short < 10, long >= 8 Entity
e1 = ‘Providence’ (long)
e2 = ‘capital’ (short)
Document Prowidnce is the kaepital of Rhode Island.
genenrate 1-variant familiy
pr
pro
prov
provi
provid
vidence
idence
dence
ence
nce
genenrate d-variant familiy
capital
Prowrowiowid
e1 Providence
…
kaep
e2 capital
…
1-variant match
d-variant match
14
Experiment Settings
Algorithms NGPP FastSS q-gram based method
Measure number of variants, candidate size, running time
Datasetdataset # of records avg. string length
DBLPDICT (author) 108k 14.5
DOC (author, title) 87k 104.7
GENEDICT (gene/protein name) 381k 22.4
DOC (author, title, abstract) 10k 870.0
CONLLDICT (person, location) 8k 12.6
DOC (news article) 19k 819.0
15
Experiment Results
NGPP vs FastSS DBLP; d = 2
algorithm # of variants candidate size running time
FastSS 7500M 2.1M 2643s
NGPP(l = 10)
150M 11M 40s
Experiment Results
NGPP vs q-gram based method DBLP; d = 1, 2, 3
Candidate Size Running Time
Conclusion
Contributions an efficient algorithm for approximate entity extraction with
edit distance constraints based on neighborhood generation two techniques to reduce the number of variants generated, as
well as running time partitioning prefix pruning
Future work approximate multiple pattern matching
other similarity measures, e.g., the function used in DNA/protein sequence alignment
18
Thank you!
Questions?
19
Related Work
neighborhood generation approaches E. W. Myers. A sublinear algorithm for approximate keyword searching.
Algorithmica, 12(4/5):345–374, 1994. T. Bocek, E. Hunt, B. Stiller. Fast Similarity Search in Large Dictionaries.
Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007.
q-gram based approaches L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan,
and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.
C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008.
alternative: use vgrams instead of q-grams C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of
approximate queries on string collections using variable-length grams. In VLDB, 2007.
X. Yang, B. Wang, and C. Li. Cost-based variable length gram selection for string collections to support approximate queries efficiently. In SIGMOD, 2008.