text boundary analysis eric mader advisory software engineer ibm
DESCRIPTION
Where do I break lines? The rain in Spain stays mainly on the plain. 您有坦率和誠實的聲譽。TRANSCRIPT
Text Boundary Analysis
Eric MaderAdvisory Software Engineer
IBM
Where do I break lines?
The rain in Spain stays mainly on the plain.
Where do I break lines?
The rain in Spain stays mainly on the plain.
您有坦率和誠實的聲譽。
Where do I break lines?
The rain in Spain stays mainly on the plain.
ด่ๅแรงฃนึ๓อัตราลกูจา้งใหมใ่ห๓้๕
您有坦率和誠實的聲譽。
Even in English, this can be hard
You owe me $1,234.56... I think.
Even in English, this can be hard
You owe me $1,234.56... I think.
Word wrapping vs word selection
Some characters’ behavior is context-dependent.
Word wrapping:
Some characters’ behavior is context-dependent.
Some characters’ behavior is context-dependent.
Word wrapping:
Searching by words:
Word wrapping vs word selection
Analysis by pairs
ltr dgt sp pun
ltr
dgt
sp
pun
X
X
X
first
second
Analysis by pairs
ltr dgt sp pun
ltr
dgt
sp
pun
X
X
X
first
second
Analysis by pairs
ltr dgt sp pun
ltr
dgt
sp
pun
X
X
X
first
second
-
X
- X X
Analysis by pairs
ltr dgt sp pun
ltr
dgt
sp
pun
X
X
X
first
second
-
X
- X X
Analysis by pairs
ltr dgt sp pun
ltr
dgt
sp
pun
X
X
X
first
second
-
X
- X X
nbs
nbs
Analysis by pairs
ltr dgt sp pun
ltr
dgt
sp
pun
X
X
X
first
second
-
X
- X X
nbs
nbs
Analysis by pairs
ltr dgt sp pun
ltr
dgt
sp
pun
X
X
X
first
second
-
X
- X X
nbs
nbs
kji X X X X
kji
X
X
X
X
X
X
Where pairs break down
You owe me $1,234.56... I think.
A break position can depend on more than two characters:
Where pairs break down
You owe me $1,234.56... I think.
4.5
A break position can depend on more than two characters:
Where pairs break down
You owe me $1,234.56... I think.
6..
A break position can depend on more than two characters:
He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”
Where pairs break down
Sentence boundaries require even more lookahead:
He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”
Where pairs break down
Sentence boundaries require even more lookahead:
He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”
Where pairs break down
Sentence boundaries require even more lookahead:
He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”
Where pairs break down
Sentence boundaries require even more lookahead:
He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”
Where pairs break down
Sentence boundaries require even more lookahead:
An example•If not otherwise mentioned, each character is a “word” unto itself.
•A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.
•A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.
•If a “word” and a “number” appear in succession with nothing between them, they’re kept together.
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
The state-machine approach
start
A
’ .
0
$
%
$1,234.56...
Limitations
1992–1996
Limitations
1992–1996
Limitations
–1996
Limitations
1992–1996
Limitations
1992–1996
Limitations
1992–1996
Limitations
1992–1996
Automatic table building•If not otherwise mentioned, each character is a “word” unto itself.
•A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.
•A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.
•If a “word” and a “number” appear in succession with nothing between them, they’re kept together.
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;
Automatic table building
Automatic table building
•All regular-expression rules have equal precedence
•The “winning” rule is decided using a longest-possible-match algorithm (except in certain well-defined cases)
•Our build algorithm parses the regular expressions, builds the state table, and makes sure it’s deterministic in a single pass
Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};
Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};
Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};
Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};
Sentence-break rules.*?{term{[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};
Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};
Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};
Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};
Ignore characters
$ignore=[[:Mn:][:Me:][:Cf:]];
Surrogate support
kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];
Surrogate support
kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];
Surrogate support
kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];
Random-access iteration
You owe me $1,234.56... I think.
Random-access iteration
You owe me $1,234.56... I think.
Random-access iteration
You owe me $1,234.56... I think.
Random-access iteration
You owe me $1,234.56... I think.
Random-access iteration
You owe me $1,234.56... I think.
Random-access iteration
You owe me $1,234.56... I think.
Random-access iteration
!{sent-start}{start}*{space}*{end}*{period};![{sent-start}{lc}{digit}]{start}*{space}*{end}*{term};
Dictionary-based iteration
We hold these truths to be self-evident: that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are Life, Liberty, and the Pursuit of Happiness.
Dictionary-based iteration
Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.
Dictionary-based iteration
$dictionary=[A-Za-z\-\’];
Dictionary-based iteration
Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.
Dictionary-based iteration
Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.
Dictionary-based iteration
Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Dictionary-based iteration
themendinetonight
Text Boundary Analysis
Eric MaderAdvisory Software Engineer
IBM