© 2008 ibm corporation regular expression learning for information extraction yunyao li *,...

© 2008 IBM Corporation

Regular Expression Learning for Information Extraction

Yunyao Li*, Rajasekar Krishnamurthy*, Sriram Raghavan*, Shivakumar Vaithyanathan*, H. V. Jagadish○

*IBM Almaden Research Center

○ University of Michigan

http://www.almaden.ibm.com/cs/projects/avatar/


Outline

Motivation

Regex Learning Problem

Regex Transformations

ReLIE Search Algorithm

Experiments

Summary


Importance of Regular Expression (Regex)

Regex is essential to many information extraction (IE) tasks

Email addresses

Software names

Credit card numbers

Social security numbers

Gene and Protein names

….

But … writing regexes for an IE task is not straightforward

Web collections

Email compliance

bioinformatics


Phone Number Extraction

A simple pattern:

blocks of digits separated by non-word character:

R0 = (\d+\W)+\d+

Identifies valid phone numbers (e.g. 800-865-1125, 725-1234)

Produces invalid matches (e.g. 123-45-6789, 10/19/2002, 1.25 …)

Misses valid phone numbers (e.g. (800) 865-CARE)


Software Name Extraction

A simple pattern:

blocks of capitalized words followed by version number:

R0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+

Identifies valid software names (e.g. Eclipse 3.2, Windows 2000)

Produces invalid matches (e.g. English 123, Room 301, Chapter 1.2)

Misses valid software names (e.g. Windows XP)


Conventional Regex Writing Process for IE

Regex0

SampleDocuments

Match 1Match 2

…

Good Enough?

N

YRegexfinal

(\d+\W)+\d+(\d+\W)+\d{4}

800-865-1125725-1234…123-45-678910/19/20021.25…

Regex1Regex2Regex3 (\d+[\.\s\-])+\d{4}(\d{3}[\.\s\-])+\d{4}


Our goal - Learning Regexfinal automatically

Regex0

SampleDocuments

Match 1Match 2

…

NegMatch 1…

NegMatch m0

PosMatch 1…

PosMatch n0

Labeled Matches ReLIERegexfinal


Intuition

R0 ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}

([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4}

([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4}

([A-Z][a-zA-Z]{1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4}

([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}

…

([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4}

…

Compute F-measure

F1

F7

F8

F34

F48

…

((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)

[A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} F35

R’

([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4}

([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4}

(((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)

[A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}

……

([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4}

([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3}

……

([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4}

…………..

……

…

……

…

• Generate candidate regular expressions by modifying current regular expression• Select the “best candidate” R’• If R’ has better than current regular expression, repeat the process


Outline

Motivation




Experiments

Summary



Ideally:

find the best Rf among all possible regexes

How do we define the best?

Highest F-measure over a document collection D.

We can only compute F-measure based on the labeled data

Must limited Rf such that any match of Rf is also a match of R0


Regex Learning as a Search Problem

+ ++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+++

++

++

+

+

+

+

+

+

++

+

++

+

+ ++

+

++

+

++

++

+

++

+

+

++

--

-

-

-

--

--

-

--

-

--

--

--

--

-

--

--

-

-

-

---

-

-

--

--

--

-

--

-

-

-

--

--

--

---

-- --

- --

- --

--

-

- --

+ +--- --

-

--

-

-

-

---

--M(R0, D)M(Rf, D)

M(R, D): Matches of R over document collection D.


Outline

Motivation




Experiments

Summary


Two Regex Transformations

Drop-disjunct Transformation:

R = Ra(R1| R2|… Ri| Ri+1|…| Rn) Rb R’ = Ra (R1| … Ri|…) Rb

Include-Intersect Transformation

R = RaXRb R’ = Ra(XY) Rb

where Y


(\d+\W)+\d+ (\d{3}\W)+\d+

Applying Drop-Disjunct Transformation

Character Class Restriction

E.g. To restrict the matching of non-word characters

(\d+\W)+\d+ (\d+[\.\s\-])+\d+

Quantifier Restriction

E.g. To restrict the number of digits in a block


Applying Include-Intersect Transformation

Negative Dictionaries

Disallow certain words from matching specific portions of the regex

E.g. a simple pattern for software name extraction:

blocks of capitalized words followed by version number:

R0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+

Identifies valid software name (e.g. Eclipse 3.2, Windows 2000)

Produces invalid matches (e.g. ENGLISH 123, Room 301, Chapter 1.2)

([A-Z]\w*\s*)+[Vv]?(\d+\.?)+ ( [A-Z]\w*\s*)+[Vv]?(\d+\.?)+((?! ENGLISH|Room|Chapter)


Outline

Motivation




Experiments

Summary


ReLIE Algorithm

Chara

cter c

lass

restr

iction

s

Quantifierrestrictions

Negativedictionary

R0 ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}

([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4}

([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4}

([A-Z][a-zA-Z]{1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4}

([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}

…

([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4}

…

Compute F-measure

F1

F7

F8

F34

F48

…

((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)

[A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4}

Chara

cter c

lass

restr

iction

s

Quantifierrestrictions

Negativedictionary

F35

R’

([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4}

([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4}

(((?!(Copyright|Page|Physics|Question| · · · |Article|Issue)

[A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4}

……

([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4}

([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3}

……

([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4}

…………..

……

…

……

…

• Generate candidate regular expressions by applying a single transformation• Select the “best candidate” R’ based on F-measure on training corpus • If R’ has better F-measure than current regular expression, repeat the process• Use validation set to avoid over-fitting


Outline

Motivation




Experiments

Summary


Experimental Set Up

Data Set

EWeb: 50K web pages from IBM intranet

AWeb: 50K web pages from University of Michigan web site.

• AWeb-S: subset of 10K pages from AWeb

Email: 10K emails from Enron collection

Extraction TasksSoftwareNameTask CourseNumberTask

PhoneNumberTask URLTask

Comparison Study

ReLIE

Conditional Random Fields (CRF):

• Base feature set– matches corresponding to the input regex

– three adjacent words to each side of the matches


Extraction Quality(a) SoftwareNameTask

0.5

0.6

0.7

0.8

0.9

1

10% 40% 80%

Percentage of Data Used for Training

F-M

easu

re

ReLIE CRF

(b) CourseNumberTask

0.5

0.6

0.7

0.8

0.9

1

10% 40% 80%


F-M

easu

re

ReLIE CRF

(c) URLTask

0.5

0.6

0.7

0.8

0.9

1

10% 40% 80%


F-M

easu

re

ReLIE CRF

(d) PhoneNumberTask

0.5

0.6

0.7

0.8

0.9

1

10% 40% 80%


F-M

ea

su

re

ReLIE CRF

ReLIE performs comparably with CRF with a slight edge with limited training data

Program repeatedly failed at training phrase.


(a) SoftwareNameTask training: EWeb, testing: AWeb

0

0.2

0.4

0.6

0.8

1

10% 40% 80%

F-M

easu

re

ReLIE CRF

Cross-domain Evaluation

(c) URLTask training: Aweb-S, testing: Email

0

0.2

0.4

0.6

0.8

1

10% 40% 80%Percentage of Data Used for Training

F-M

ea

su

re

ReLIE CRF

(d) PhoneNumberTask training: Email testing: AWeb

0

0.2

0.4

0.6

0.8

1

10% 40% 80%Percentage of Data Used for Training

F-M

easu

re

ReLIE CRF

ReLIE significantly outperforms CRF for all three tasks

(b) CourseNameTask is not tested, as course names exist only in AWeb.


Performance

ReLIE is an order of magnitude faster than CRF for both training and testing

Average Training/Testing Time (sec)(with 40% data for training)


What has ReLIE learned?

Patterns learned by ReLIE are similar to features manually given to CRF


ReLIE as Feature Extractor for CRF

C+RL: CRF + features learned by ReLIE

• Token level features learned by ReLIE• helpful when the training data is small

• Character level features learned by ReLIE• always helpful


Outline

Motivation




Experiments

Summary


ReLIE

Effective for learning regexes for certain classes of IE

Particularly useful when

cross-domain, or

limited training data

Potentially becoming a powerful feature extractor for CRF and other machine learning algorithms.


http://www.almaden.ibm.com/cs/projects/avatar/