listreader : wrapper induction for lists in ocred documents
DESCRIPTION
ListReader : Wrapper Induction for Lists in OCRed Documents. Thomas Packer BYU CS DEG 2012.03.17. We Love Data (In Digital Form). Lots of Paper Documents in the World. Lots of Text in Paper Documents. Lots of Lists in Text Lots of Data in Lists. Manual Data Entry. - PowerPoint PPT PresentationTRANSCRIPT
1
ListReader:Wrapper Induction for Lists
in OCRed Documents
Thomas PackerBYU CS DEG2012.03.17
2
We Love Data (In Digital Form)
3
Lots of Paper Documents in the World
4
Lots of Text in Paper Documents
5
Lots of Lists in TextLots of Data in Lists
6
Manual Data Entry
7
Wrappers: Individualized Extraction Rules
8
Semi-Automatic Wrapper Induction
9
Weakly-supervised
Wrapper Induction
Semi-supervised Wrapper Induction
10
Data: Image
11
Data: OCR
12
Data: Hand-Labeled OCR
13
Induced Regex Wrappers
Single-Specific
Single-General
Multi-General
(i)(\. )(Lydia)( )(Lewis)(\*, )(b\. ) …
([a-z]{1,5})([ \t\r\n\.]{1,6})([a-zA-Z]{3,9})([ \t\r\n]{1,5}) ([a-zA-Z]{3,9})([ \t\r\n\*,]{1,7})([ \t\r\n\.a-z]{1,7}) …
([iv]{1,3})([ \.]{2,2})([ACHJLacdehilmnrstuvy]{4,6})([ ]{1,1}) ([Leisw]{5,6})([ \*,]{2,3})([ \.abp]{3,5}) …
Character Classes[a-z][A-Z][0-9][ \t\r\n]<each punct.>
15
Preliminary Results(Field Label F-measure, Small Dataset)
Single-Specific Single-General Multi-General0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
31%36%
54%
45% 46%
60%
TrainingTest
16
Conclusions and Future Work
Single-Specific Single-General Multi-General Transduction0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
31%36%
54%
76%
45% 46%
60%
79%
TrainingTest
17
All done. Suggestions?