learning to read by spelling - university of oxfordvgg/publications/2018/gupta... · 2018. 12....
Post on 04-Jun-2021
3 Views
Preview:
TRANSCRIPT
Learning to Read by SpellingTowards Unsupervised Text Recognition
Ankush Gupta Andrea Vedaldi Andrew Zisserman
Visual Geometry Group (VGG)
University of Oxford
ICVGIP 2018, Hyderabad
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Text Recognition
tion of regular fits of the gout, one or more joints
Imaged Text ASCII Text
Assumes localisation is given
• Word / line level bounding boxes
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Let’s solve this
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Text Recognition: Sequence Learning 101
ConvNet
a..no..
<stop>
Sequence Model (e.g. RNNs)
Paired Image / Annotations
tion of regular fits of the gout one or more joints
part of the brain the tunica arachnoides was
rated and of a pale yellow colour and with the
Lots of paired data
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Text Recognition: Paired Data?
Manual labour
• Expensive
• Boring..
Synthetic Data
• New engine for each domain
• Complex pipelines
• Domain gapJaderberg et al., NIPS DLW 2014
Gupta et al., CVPR16, BMVC18
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Can we do without paired data?
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
We leverage this structure for supervision.
Language is highly structured
• There are ~8 billion random strings of length 7 (26 characters)
but only 15K are valid English strings
• The frequency of characters and words, and their co-occurrence
(n-grams etc.) further constrain the output
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Method
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Text Recognition
2 sub-problems
1. Segment text image into characters, and cluster to consistent class
2. Assign each cluster to correct “character” label
à Solve for a |A | x |A | permutation matrix
where, A is the alphabet, e.g.: 26 English letters {a,b,c,…,z}
visual
language
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Unpaired Text RecognitionLearn from unaligned text corpora, and text-images
Fully-Conv Recognition Net
images with <= L characters
|A |“fake” text
Valid Language
Strings
e.g. from: WMT,
NewsGroup etc.
Discriminator
“real” text
Adversarial Loss
L
28:
English letters {a,b,c,…,z}
+ space
+ pad
Softmax
each position
one-hot
visual language
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Unpaired Text Recognition
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
The “recognizer” can generate valid text without “recognizing”
Pitfall: Uncorrelated Predictions
à Fool the discriminator without solving the task
e.g. use “text-image” as noise à learn generator for valid English strings
Fully-Conv Recognition Net
“ this is a valid English string which looks real to the discriminator ”invalid
recognition
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Uncorrelated Predictions: Solution
Local recognizer
restricted
receptive field
(~ 3 characters)
. . . . . tion of regular fits of the gout one or more joints
Global discriminator
checks validity of
the entire sentence Discriminator
No Reconstruction unlike CycleGAN
Because text à image is highly ambiguous
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Experiments
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Synthetic Text Images
• Fixed-width font
• WMT Newscrawl text source (EMNLP datasets)
• Control over nuisance factors à used for analysis
• 100K training sequences, 1K test sequences
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Real Text Images
• Google Books scans
• Non-fixed width font
• Varying word spacing due to alignment
• “See-through” from back
• Different case (small / capital)
• Italics / noise / fading etc.
• 3K training lines, 300 test lines
(no overlapping pages)
• Use Google OCR output as
unaligned “text source”
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Training StrategySample images and valid strings independently
Fully-Conv Recognition Net Valid Language
Strings
Discriminator
Adversarial Loss
softmax
predictionsone-hot
strings
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Results
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Synthetic Text Images
• ~99% character accuracy
• ~95% word accuracy
• Trained on 24-length sequences à test on 3, 5, 7, 9, 11, 24, 32, 48
à generalization to other lengths
test accuracy
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Real Text Images
Why?
Varying spacing / non-fixed width font challenging for
fully-conv. recognizer
45% word accuracy!
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Real Text Images
Why?
Varying spacing / non-fixed width challenging for fully-conv. recognizer
à Let features travel using a “skip-RNN” in the last layer
Conv
RNN
Skip
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Real Text Images
85% character accuracy (now vs. 45% before)!
96.2% character accuracy.
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Real Text Example
the different forms in which this disease ap
pears have rendered it necessary to divide it
into regular and aregular gout in the former
the attacks of which are known by the denomina
tion of regular fits of the got onne o more joints
of the extremities become inflamed painful and
tender and frequently in an exquisite deqree a
symptoneatic fever proportioned to the degree of
pain and inflammation with evening exacerba
tions accompand the other complaints which dis
tress the patient for uncertain perions sometimes
for several weeks wo he the fit goes off the piont
which have been the seat of the disease are always
found to have become rigid and inflexible in pro
portion to the degree in which the disease has
existed in themuffeequently remaining enlarged
and incapable of free motion for a considerable
timer cen he other hand the patient at the sas
time experiences so perfect an exemption from
disease as generally to lead to the opinion that
the fit has occasioned the most salutary changes
in the sten in ■ne■ e erd y i er
in the irregular gout the affection of the joints
is much less confined than in the former some e
times it leaves the joints at fort tttached and
fixes on some distant part■ and sometimes after
harassing the patient by making a circuit in■
Ground Truth Prediction
the different forms in which this disease ap
pears have rendered it necessary to divide it
into regular and irregular gout in the former
the attacks of which are known by the denomina
tion of regular fits of the gout one or more joint
of the extremities become inflamed painful and
tender and frequently in an exquisite degree a
symptomatic fever proportioned to the degree of
pain and inflammation with evening exacerba
tions accompany the other complaints which dis
tress the patient for uncertain periods sometimes
for several weeks when the fit goes off the joints
which have been the seat of the disease are always
found to have become rigid and inflexible in pro
portion to the degree in which the disease has
existed in them■ frequently remaining enlarged
and incapable of free motion for a considerable
time on the other hand the patient at the same
time experiences so perfect an exemption from
disease ag generallyto leadu to the opinion that
the fit has occasioned the most salutary changes
in the system ■ ■■■ ■ ■i■■
in the irregular gout the affection of the joints
is much less confined than in the former yiomeu
timcs it leaves the joints at first attacked and
fixes on some distant part■ and sometimes after
harassing the patient by making a circuit in■
Image
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Analysis
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Effect of Sequence Length on Training Convergence
Training with sequences of
different lengths:
• Short lengths 3-5:
no convergence
• Longer sequence à
faster convergence
13 > 11 > 9 > 7
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Which symbol is learnt first?
learning order
1
2
3
4
train
ing r
un
s i e u a o l t d h g k m y c p r n f b x w v z j q
g e i s n l a u r d o h t k y m c v w x p b j f q z
g i e o a s u p t d l c n r w h k y v m b z q f j x
a e t s d i u l o h r n y m v w b j x f z c g p k q
low high
• Symbols are learnt roughly in the order of their frequency.
(Spearman’s rank correlation coefficient ρ = 0.80, p-value < 1e−5).
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Text Corpus Ablation
• Completely unrelated lexicon (#3): small adverse effect
• Related lexicon (#2): no such effect
88
90
92
94
96
98
100
WMT WMT (No overlap) War & Peacechar word
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Extensions
• Not text-image specific
à apply to any input domain, as long as output is still language
• Examples:
• Speech
• Sign language / gestures
• Lip reading
Learning to Read by SpellingTowards Unsupervised Text Recognition
Ankush Gupta Andrea Vedaldi Andrew Zisserman
Any Questions?
Visual Geometry Group (VGG)
University of Oxford
ICVGIP 2018, Hyderabad
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Which symbol is learnt first?
0 2000 4000 6000 8000 10000
iterations
0.0
0.2
0.4
0.6
0.8
1.0
chara
cte
r accura
cy
a
et
s
d
iulohr n ym v w b
jx fz
cg p
k
q
Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad
Learning Dynamics
tttttttttttttttttttttttttttttttttttttttttttrssssss
ttttttt ny nytt nr nrttttttt ny ntt nrttttttt zzzz
bcorote to whol th ticunthss tidio tiostolonzzzz
trougfht to ferr oy disectins it has dicomered
brought to view by dissection it was discovered
Tra
inin
g
ite
rati
on
s
1. First {space} / segmentation into words is learnt
2. Next, common words like {to, it}
3. Last, less frequent characters like {v, w}.
The last transcription also corresponds to the ground-truth (punctuations are not modelled). The colour bar on the right indicates the accuracy (darker means higher accuracy).
top related