experiments an efficient trie-based method for approximate entity extraction with edit-distance...

1
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit- Distance Constraints Entity Extraction A Document An Efficient Filter for Approximate Membership Checking. Venkaee shga Kamunshik kabarati, Dong Xin, Surauijt Chadhuri SIGMOD Approximate Entity Extraction #1: Data in real world is dirty ed: minimum # of single- character transformations Surauijt Chadhuri Surajit Chaudhuri #2: Improve extraction quality Problem Definition Given a dictionary of entities E = {e 1 , e 2 , . . . , e n }, a document D, and a predefined edit distance threshold τ, approximate entity extraction finds all similar” pairs <s, e i > such that ED(s, e i ) ≤ τ, where s is a substring of D and e i E. Dong Deng, Guoliang Li, Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China Trie-based Algorithms Search-Extension Method Copyright © 2012, Database Research Group, Tsinghua University http:// dbgroup.cs.tsinghua.edu.cn/dd/projects /taste A Dictionary of Entities 1 Dong Xin 2 Surajit Chaudhuri Entity Extraction Locate entities from the document e.g., Dong Xin #3: Many real applications Information retrieval Molecular biology Bioinformatics Natural language processing ID Entities Lengt h 1 vancouver 10 2 vanateshe 11 3 surajit_ch audri 8 4 caushit_ch audu 8 5 caushit_ch akra 9 Entit ies Documen t An example result with ed threshold 2 <surajit_chaudri, surajit_chaudhuri> an ef cient lter for approximate membershep checking. kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, dong xin. vancouver, canada. sigmod 2008. Elapsed Time implemented in C++, Ubuntu: Intel Core E5420 2.5GHz CPU and 4 GB memory ed=3 Given an entity e with τ + 1 segments and a substring s, if s is similar to e within threshold τ , s must contain a substring which is exactly a segment of e. Trie-based Framework 2.index the segments using trie structure [fig2] 3.from the document, find the matched segments by enumerate 1.partition the entities into segments [fig1] Optimizing Partition Scheme Optimize Object: C=M[τ+1][m]. M[i][j]: the minimum total partition weight to partition string c 1 c 2 … c j-1 c j into i segments. Scalib ilit y D a t a s e t s Taste vs. Faerie & NGPP 1&2.the same with trie-search method[fig1&2] 3.1 Search: check whether each substring of the document is a trie leaf node. 3.2 Extension: Extend the matched segments to find similar pairs. [Fig 3] Search-Extension VS. Sort-Extension Candidate Number E v e n v s . D i c t + D o c : E v e n v s . D i c t + D o c : >= 1 edit operation >= 1 edit operation >= 1 edit operation >= τ + 1 = 3 edit operation NOT SIMILAR Trie- search: Fig 1: Partition Fig 2: Trie Structure Fig 3 Extension Sort-Extension Method Fig 4.1 Example 1 1.Sort the inverted list in leaf node 2.Share the computation of the longest common prefix while extending the matched segment. Fig 4.2 Example 2 Fig 4.3 Example 3 Entity: c 1 c 2 c 3 c 4 … … c m-2 c m-1 c m D o c u m e n t g 1 g 2 g τ g τ+1 Wg 1 Wg 2 Wg τ Wg τ+1 Appear Time: Segments: vanates he va n sh e at e vanates he vana he te s Will Extend 5 times Will Extend 2 times Observation: Different partition scheme generates different candidate set with different size. Dynamic Programming, the recursive formula: Weight: build a suffix trie to determining Wc i …c j Partition Scheme: Even VS. Dict+Doc 1.Even scheme involves large candidate set size. 2.Dict+Doc scheme counts the indexing time in. Accelerate Partition Scheme Selection: 1. Using segment length to do pruning. 2. Using even-scheme weight as upper bound. 3. Adding extra pointers on suffix trie.

Upload: shawn-arnold

Post on 31-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter

Experiments

An Efficient Trie-based Method for ApproximateEntity Extraction with Edit-Distance Constraints

Entity Extraction

A DocumentAn Efficient Filter for Approximate Membership Checking. Venkaee shga Kamunshik kabarati, Dong Xin, Surauijt ChadhuriSIGMOD

Approximate Entity Extraction

#1: Data in real world is dirtyed: minimum # of single-character transformationsSurauijt ChadhuriSurajit Chaudhuri

#2: Improve extraction quality

Problem DefinitionGiven a dictionary of entities E = {e1, e2, . . . , en}, a document D, and a predefined edit distance threshold τ, approximate entity extraction finds all “similar” pairs <s, ei> such that ED(s, ei) ≤ τ, where s is a substring of D and ei E.∈

Dong Deng, Guoliang Li, Jianhua FengDepartment of Computer Science, Tsinghua University, Beijing, China

Trie-based Algorithms

Search-Extension Method

Copyright © 2012, Database Research Group, Tsinghua University

http://dbgroup.cs.tsinghua.edu.cn/dd/projects/taste

A Dictionary of Entities1 Dong Xin 2 Surajit Chaudhuri

Entity ExtractionLocate entities from the documente.g., Dong Xin

#3: Many real applications Information retrieval Molecular biology Bioinformatics Natural language processing

ID Entities Length1 vancouver 102 vanateshe 113 surajit_chaudri 84 caushit_chaudu 85 caushit_chakra 9

Entities Document

An example result with ed threshold 2 <surajit_chaudri, surajit_chaudhuri>

an efficient filter for approximatemembershep checking. kaushit chekrabarti, surajit chaudhuri, vankatesh ganti, dong xin. vancouver, canada. sigmod 2008.

Elapsed Time

implemented in C++, Ubuntu: Intel Core E5420 2.5GHz CPU and 4 GB memory

ed=3

Given an entity e with τ + 1 segments and a substring s, if s is similar to e within threshold τ , s must contain a substring which is exactly a segment of e.

Trie-based Framework

2.index the segments using trie structure [fig2]

3.from the document, find the matched segments by enumerate all substrings.

1.partition the entities into segments [fig1]

Optimizing Partition Scheme

Optimize Object: C=M[τ+1][m].M[i][j]: the minimum total partition weight to partition string c1c2 … cj-1cj

into i segments.

Scalibility

Datasets

Taste vs. Faerie & NGPP

1&2.the same with trie-search method[fig1&2]

3.1 Search: check whether each substring of the document is a trie leaf node.3.2 Extension: Extend the matched segments to find similar pairs. [Fig 3]

Search-Extension VS. Sort-Extension

Candidate Number

Even vs. Dict+D

oc:

Even vs. Dict+D

oc:

>= 1 edit operation

>= 1 edit operation

>= 1 edit operation

>= τ + 1 = 3 edit operationNOT SIMILAR

Trie-search:

Fig 1: Partition

Fig 2: Trie Structure

Fig 3 Extension

Sort-Extension Method

Fig 4.1 Example 1

1.Sort the inverted list in leaf node2.Share the computation of the longest common prefix while extending the matched segment.

Fig 4.2 Example 2 Fig 4.3 Example 3

Entity: c1 c2 c3 c4 … … cm-2 cm-1 cm

Docum

ent

g1 g2 … … gτ gτ+1

Wg1 Wg2 Wgτ Wgτ+1Appear Time:

Segments:

vanateshe

van sheate

vanateshe

vana hetesWill Extend 5 times Will Extend 2 times

Observation: Different partition scheme generates different candidate set with different size.

Dynamic Programming, the recursive formula:

Weight: build a suffix trie to determining Wci …cj

Partition Scheme: Even VS. Dict+Doc1.Even scheme involves large candidate set size.2.Dict+Doc scheme counts the indexing time in.Accelerate Partition Scheme Selection: 1. Using segment length to do pruning. 2. Using even-scheme weight as upper bound. 3. Adding extra pointers on suffix trie.