text correction using domain dependent bigram models from web crawls

47
Text Correction using Domain Dependent Bigram Models from Web Crawls Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Miho

Upload: dermot

Post on 06-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Text Correction using Domain Dependent Bigram Models from Web Crawls. Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov. Two recent goals of text correction. Two recent goals of text correction. Use of powerful language models - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Text Correction using

Domain Dependent Bigram Models from

Web Crawls

Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov

Page 2: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Two recent goals of text correction

Page 3: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Two recent goals of text correction

Use of

powerful language models

word frequencies, n-gram models, HMMs, probabilistic grammars, etc.

Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,...

Page 4: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Two recent goals of text correction

Use of

powerful language models

word frequencies, n-gram models, HMMs, probabilistic grammars, etc.

Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,...

Document centric and

adaptive text correction

prefer words of the text as correction suggestions for unknown tokens.

Taghva & Stofsky 2001, Nartker et al. 2003, Rong Jin 2003, ...

Page 5: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Two recent goals of text correction

Use of

powerful language models

word frequencies, n-gram models, HMMs, probabilistic grammars, etc.

Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,...

Document centric and

adaptive text correction

prefer words of the text as correction suggestions for unknown tokens.

Taghva & Stofsky 2001, Nartker et al. 2003, Rong Jin 2003, ...

Page 6: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Two recent goals of text correction

Use of

powerful language models

word frequencies, n-gram models, HMMs, probabilistic grammars, etc.

Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,...

Here: Use of document centric language models (bigrams)

Document centric and

adaptive text correction

prefer words of the text as correction suggestions for unknown tokens.

Taghva & Stofsky 2001, Nartker et al. 2003, Rong Jin 2003, ...

Page 7: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Use of document centric bigram models

Idea

Page 8: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Use of document centric bigram models

Idea

Wk-1 Wk+1Wk............. .............Text T:

Page 9: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Use of document centric bigram models

Idea ill-formed

Wk-1 Wk+1Wk............. .............Text T:

Page 10: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Use of document centric bigram models

Idea ill-formed

V1

V2

...Vn

Wk-1 Wk+1Wk............. .............Text T:

correction candidates

Page 11: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Use of document centric bigram models

Idea ill-formed

V1

V2

...Vn

Wk-1 Wk+1Wk............. .............Text T:

correction candidates

Prefer those correction candidates V where bigrams Wk-1V and VWk+1"are natural, given the text T".

Page 12: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Use of document centric bigram models

Idea ill-formed

V1

V2

...Vn

Wk-1 Wk+1Vi............. .............Text T:

correction candidates

Prefer those correction candidates V where bigrams Wk-1V and VWk+1"are natural, given the text T".

Page 13: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Use of document centric bigram models

Idea ill-formed

V1

V2

...Vn

Wk-1 Wk+1Vi............. .............Text T:

correction candidates

Prefer those correction candidates V where bigrams Wk-1V and VWk+1"are natural, given the text T".

Page 14: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Use of document centric bigram models

Idea ill-formed

V1

V2

...Vn

Wk-1 Wk+1Vi............. .............Text T:

correction candidates

Prefer those correction candidates V where bigrams Wk-1V and VWk+1"are natural, given the text T".

ProblemHow to measure "naturalness of a bigram, given a text"?

Page 15: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

How to derive "natural" bigram models for a text?

Page 16: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

How to derive "natural" bigram models for a text?

• Counting bigram frequencies in text T?

Page 17: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Sparseness of bigrams: low chance to find bigrams repeated in T.

How to derive "natural" bigram models for a text?

• Counting bigram frequencies in text T?

Page 18: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Sparseness of bigrams: low chance to find bigrams repeated in T.

• Using a fixed background corpus (British National Corpus, Brown Corpus)?

How to derive "natural" bigram models for a text?

• Counting bigram frequencies in text T?

Page 19: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Sparseness of bigrams: low chance to find bigrams repeated in T.

• Using a fixed background corpus (British National Corpus, Brown Corpus)?

Sparseness problem partially solved - but models not document centric!

How to derive "natural" bigram models for a text?

• Counting bigram frequencies in text T?

Page 20: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

• Counting bigram frequencies in text T?

Sparseness of bigrams: low chance to find bigrams repeated in T.

• Using a fixed background corpus (British National Corpus, Brown Corpus)?

Sparseness problem partially solved - but models not document centric!

Our suggestion

Using domain dependent terms from T, crawl a corpus C in the web thatreflects domain and vocabulary of T. Count bigram frequencies in C.

How to derive "natural" bigram models for a text?

Page 21: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Correction Experiments

Text T

Page 22: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Correction Experiments

Text T 1. Extract domain specific terms (compounds).

Page 23: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Correction Experiments

Text T 1. Extract domain specific terms (compounds).

2. Crawl a corpus C that reflects domain and vocabulary of T.

Page 24: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Correction Experiments

Text T 1. Extract domain specific terms (compounds).

2. Crawl a corpus C that reflects domain and vocabulary of T.

Dictionary D

Page 25: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Correction Experiments

Text T 1. Extract domain specific terms (compounds).

2. Crawl a corpus C that reflects domain and vocabulary of T.

Dictionary D3. For each pair of dictionary words UV, store the frequency of UV in C as a score s(U,V).

Page 26: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Correction Experiments

First experiment ("in isolation")

What is the correction accuracy reached when using s(U,V) as the single information for ranking correction suggestions?

Text T 1. Extract domain specific terms (compounds).

2. Crawl a corpus C that reflects domain and vocabulary of T.

Dictionary D3. For each pair of dictionary words UV, store the frequency of UV in C as a score s(U,V).

Page 27: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Correction Experiments

Text T 1. Extract domain specific terms (compounds).

2. Crawl a corpus C that reflects domain and vocabulary of T.

Dictionary D3. For each pair of dictionary words UV, store the frequency of UV in C as a score s(U,V).

First experiment ("in isolation")

What is the correction accuracy reached when using s(U,V) as the single information for ranking correction suggestions?

Second experiment ("in combination")

Which gain is obtained when adding s(U,V) as a new parameter to a sophisticated correction system using other scores as well?

Page 28: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 1: bigram scores "in isolation"

• Set of ill-formed output tokens of commercial OCR system.• Candidate sets for ill-formed tokens: dictionary entries with edit distance < 3.• Using s(U,V) as the single information for ranking correction suggestions.• Measured the percentage of correctly top-ranked correction suggestions.• Comparing bigram scores from web crawls, from BNC, from Brown Corpus.

Neurol. Fish Mushr. Holoc. Rom Botany

Crawl 64.5% 43.6% 54.8% 59.5% 48.2% 56.5%

BNC 46.8% 34.7% 41.8% 40.9% 37.5% 28.5%

Brown 38.2% 30.5% 36.4% 40.2% 37.0% 25.5%

Texts from 6 domains

Page 29: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 1: bigram scores "in isolation"

• Set of ill-formed output tokens of commercial OCR system.• Candidate sets for ill-formed tokens: dictionary entries with edit distance < 3.• Using s(U,V) as the single information for ranking correction suggestions.• Measured the percentage of correctly top-ranked correction suggestions.• Comparing bigram scores from web crawls, from BNC, from Brown Corpus.

Neurol. Fish Mushr. Holoc. Rom Botany

Crawl 64.5% 43.6% 54.8% 59.5% 48.2% 56.5%

BNC 46.8% 34.7% 41.8% 40.9% 37.5% 28.5%

Brown 38.2% 30.5% 36.4% 40.2% 37.0% 25.5%

Texts from 6 domains

Resumee: crawled bigram frequencies clearly better than those from static corpora.

Page 30: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

• Baseline: correction with length-sensitive Levenshtein distance and crawled word frequencies as two scores.

• Then adding bigram frequencies as a third score.

• Measuring the correction accuracy (percentage of correct tokens) reached with fully automated correction (optimized parameters).

• Corrected output of commercial OCR 1 and open source OCR 2.

Page 31: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

OCR 1Output

OCR 1

Baseline

correction

Adding bigram score

Additional

gainNeurology 98.74 99.39 99.44 0.05Fish 99.23 99.47 99.57 0.10Mushroom 99.01 99.50 99.55 0.05Holocaust 98.86 99.03 99.15 0.12Roman Empire 98.73 98.90 99.00 0.10Botany 97.19 97.67 97.89 0.22

Page 32: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

OCR 1output

OCR 1

Baseline

correction

Adding bigram score

Additional gain

Neurology 98.74 99.39 99.44 0.05Fish 99.23 99.47 99.57 0.10Mushroom 99.01 99.50 99.55 0.05Holocaust 98.86 99.03 99.15 0.12Roman Empire 98.73 98.90 99.00 0.10Botany 97.19 97.67 97.89 0.22

Output highly accurate

Page 33: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

OCR 1Output

OCR 1

Baseline

correction

Adding bigram score

Additional gain

Neurology 98.74 99.39 99.44 0.05Fish 99.23 99.47 99.57 0.10Mushroom 99.01 99.50 99.55 0.05Holocaust 98.86 99.03 99.15 0.12Roman Empire 98.73 98.90 99.00 0.10Botany 97.19 97.67 97.89 0.22

Baseline correction adds significant improvement

Page 34: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

OCR 1Output

OCR 1

Baseline

correction

Adding bigram score

Additional

gainNeurology 98.74 99.39 99.44 0.05Fish 99.23 99.47 99.57 0.10Mushroom 99.01 99.50 99.55 0.05Holocaust 98.86 99.03 99.15 0.12Roman Empire 98.73 98.90 99.00 0.10Botany 97.19 97.67 97.89 0.22

Small additional gain by adding bigram score

Page 35: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

OCR 2Output

OCR 2

Baseline

correction

Adding bigram score

Additional

gainNeurology 90.13 96.29 96.71 0.42Fish 93.36 96.71 98.02 1.31Mushroom 89.26 95.51 96.00 0.49Holocaust 88.77 94.23 94.61 0.38Roman Empire 93.11 96.12 96.91 0.79Botany 91.71 95.41 96.09 0.68

Page 36: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

OCR 2Output

OCR 2

Baseline

correction

Adding bigram score

Additional

gainNeurology 90.13 96.29 96.71 0.42Fish 93.36 96.71 98.02 1.31Mushroom 89.26 95.51 96.00 0.49Holocaust 88.77 94.23 94.61 0.38Roman Empire 93.11 96.12 96.91 0.79Botany 91.71 95.41 96.09 0.68

Reduced output accuracy

Page 37: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

OCR 2Output

OCR 2

Baseline

correction

Adding bigram score

Additional

gainNeurology 90.13 96.29 96.71 0.42Fish 93.36 96.71 98.02 1.31Mushroom 89.26 95.51 96.00 0.49Holocaust 88.77 94.23 94.61 0.38Roman Empire 93.11 96.12 96.91 0.79Botany 91.71 95.41 96.09 0.68

Baseline correction adds drastic improvement

Page 38: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Experiment 2: adding bigram scores to fully-fledged correction system

OCR 2Output

OCR 2

Baseline

correction

Adding bigram score

Additional

gainNeurology 90.13 96.29 96.71 0.42Fish 93.36 96.71 98.02 1.31Mushroom 89.26 95.51 96.00 0.49Holocaust 88.77 94.23 94.61 0.38Roman Empire 93.11 96.12 96.91 0.79Botany 91.71 95.41 96.09 0.68

Considerable additional gain by adding bigram score

Page 39: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Additional experiments: comparing language models

Compare word frequencies in input text with1. word frequencies retrieved from "general" standard corpora2. word frequencies retrieved from crawled domain dependent corpora

Result

Experiment

Using the same large word list (dictionary) D,the top-k segments w.r.t. ordering using frequencies of type 2 covers much more tokens of the input text than the top-k segments w.r.t. ordering using frequencies of type 1

Page 40: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Additional experiments: comparing language models

TokensTypes

Crawled frequencies

Standard frequencies

Page 41: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Summing up

Page 42: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Summing up

• Bigram scores represent a useful additional score for correction systems.

Page 43: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Summing up

• Bigram scores represent a useful additional score for correction systems.

• Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora.

Page 44: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Summing up

• Bigram scores represent a useful additional score for correction systems.

• Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora.

• Sophisticated crawling strategies developed. Special techniques for keeping arbitrary bigram scores in main memory (see paper).

Page 45: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Summing up

• Bigram scores represent a useful additional score for correction systems.

• Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora.

• Sophisticated crawling strategies developed. Special techniques for keeping arbitrary bigram scores in main memory (see paper).

• The additional gain in accuracy reached with bigram scores depends on the baseline.

Page 46: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Summing up

• Bigram scores represent a useful additional score for correction systems.

• Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora.

• Sophisticated crawling strategies developed. Special techniques for keeping arbitrary bigram scores in main memory (see paper).

• The additional gain in accuracy reached with bigram scores depends on the baseline.

• Language models obtained from text-centered domain dependent corpora retrieved in the web reflect the language of the input document much more closely than those obtained from general corpora.

Page 47: Text Correction  using  Domain Dependent Bigram Models  from Web Crawls

Thanks for your attention!