listreader : wrapper induction for lists in ocred documents

16
ListReader: Wrapper Induction for Lists in OCRed Documents Thomas Packer BYU CS DEG 2012.03.17 1

Upload: otylia

Post on 22-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

ListReader : Wrapper Induction for Lists in OCRed Documents. Thomas Packer BYU CS DEG 2012.03.17. We Love Data (In Digital Form). Lots of Paper Documents in the World. Lots of Text in Paper Documents. Lots of Lists in Text Lots of Data in Lists. Manual Data Entry. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ListReader : Wrapper Induction for Lists in OCRed Documents

1

ListReader:Wrapper Induction for Lists

in OCRed Documents

Thomas PackerBYU CS DEG2012.03.17

Page 2: ListReader : Wrapper Induction for Lists in OCRed Documents

2

We Love Data (In Digital Form)

Page 3: ListReader : Wrapper Induction for Lists in OCRed Documents

3

Lots of Paper Documents in the World

Page 4: ListReader : Wrapper Induction for Lists in OCRed Documents

4

Lots of Text in Paper Documents

Page 5: ListReader : Wrapper Induction for Lists in OCRed Documents

5

Lots of Lists in TextLots of Data in Lists

Page 6: ListReader : Wrapper Induction for Lists in OCRed Documents

6

Manual Data Entry

Page 7: ListReader : Wrapper Induction for Lists in OCRed Documents

7

Wrappers: Individualized Extraction Rules

Page 8: ListReader : Wrapper Induction for Lists in OCRed Documents

8

Semi-Automatic Wrapper Induction

Page 9: ListReader : Wrapper Induction for Lists in OCRed Documents

9

Weakly-supervised

Wrapper Induction

Semi-supervised Wrapper Induction

Page 10: ListReader : Wrapper Induction for Lists in OCRed Documents

10

Data: Image

Page 11: ListReader : Wrapper Induction for Lists in OCRed Documents

11

Data: OCR

Page 12: ListReader : Wrapper Induction for Lists in OCRed Documents

12

Data: Hand-Labeled OCR

Page 13: ListReader : Wrapper Induction for Lists in OCRed Documents

13

Induced Regex Wrappers

Single-Specific

Single-General

Multi-General

(i)(\. )(Lydia)( )(Lewis)(\*, )(b\. ) …

([a-z]{1,5})([ \t\r\n\.]{1,6})([a-zA-Z]{3,9})([ \t\r\n]{1,5}) ([a-zA-Z]{3,9})([ \t\r\n\*,]{1,7})([ \t\r\n\.a-z]{1,7}) …

([iv]{1,3})([ \.]{2,2})([ACHJLacdehilmnrstuvy]{4,6})([ ]{1,1}) ([Leisw]{5,6})([ \*,]{2,3})([ \.abp]{3,5}) …

Character Classes[a-z][A-Z][0-9][ \t\r\n]<each punct.>

Page 14: ListReader : Wrapper Induction for Lists in OCRed Documents

15

Preliminary Results(Field Label F-measure, Small Dataset)

Single-Specific Single-General Multi-General0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

31%36%

54%

45% 46%

60%

TrainingTest

Page 15: ListReader : Wrapper Induction for Lists in OCRed Documents

16

Conclusions and Future Work

Single-Specific Single-General Multi-General Transduction0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

31%36%

54%

76%

45% 46%

60%

79%

TrainingTest

Page 16: ListReader : Wrapper Induction for Lists in OCRed Documents

17

All done. Suggestions?