robustness through prior knowledge: using explanation-based learning to distinguish handwritten...
TRANSCRIPT
Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish
Handwritten Chinese Characters
Gerald DeJongComputer Science
University of Illinois at [email protected]
Qiang Sun, Shiau Hong Lim, Li-Lun Wang
Challenges of Noisy Unstructured Text Data
• Noise – working with real input– Bottom-up limitations– Some true noise– Some self-induced variability– More reliant on prior structure
• Lack of structure – problem complexity– Top-down limitations– Highly structured = little variability – More reliant on input (noisy or otherwise)
Noise• True noise
– Missing information– Extra information– Random / Normal(?)
• Induced noise– Imperfect representation
• Pixelization• Staircasing• Extra / missing blobs or pixels
– Variability• Unmodeled / approximated world dynamics • Ignored parameters / covariates • Not random• Convenient to pretend it is true noise…
Structure vs. Unstructured
Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation…
Relatively unstructured:
Very structured:
With more structure, less induced noise
Name: Ishmael .
Finances: Low .
Problem:Bored, Spleen .
Date: Recent? .
Unstructured: Deal with the Noise
• With structure programming problem• Without structure learning problem• Learn signal from noise via training examples
– Each training example contains little information– Is there enough information?– Task dependent
• Difficulty: Subtlety of required processing• Two statistical NLP question types:
– “How large is Brazil?”– “Will the Fed raise interest rates?”– Second requires integrating lots of partial evidence
Machine Learning as an Empirically Guided Search through a Hypothesis Space
--
+
+
+
-
Example Space X with Training Set Z Hypothesis Space H
--
+
What Makes a Learning Problem Hard?• Expressiveness of hypothesis space H
• Large / Diverse / Complex H: – More bad hypothesis can masquerade as good– More training examples are required for desired confidence
• Want high confidence that a learner will produce a good approximation of the true concept
• Cost: More information More training examples
* *
Explanation Based LearningInformation Beyond Training Examples
• Utilize existing domain knowledge
• Treat training examples as illustrations of a deeper pattern
• Explain how the assigned class label may arise from an example’s properties
• Explanations suggest the deeper patterns
• Calibrate and confirm using other training examples
Two Kinds of Prior Knowledge
• Solution Knowledge is directly relevant to a specific classification task.
– Can be readily used to bias a learning system.
– But it requires the expert to already know the solution and to possess expertise about the machine learner and its bias space.
• Domain Knowledge is more abstract and not tied to any particular classification task.
– “The same pen will leave similar-width strokes.”
– Only indirectly helpful for telling a “3” from a “6”
– Easy for human experts to articulate.
– Difficult to express in a statistical learner’s bias vocabulary
Solution vs. Domain Knowledge
• 3 vs 8– Right half: little information– Left half: much more information
• Solution knowledge: “pay attention to the left half”
• Domain knowledge– Prior idealized stroke representations:– Conjecture differential information– Calibrate & Verify with training data
• EBL: – Derive solution knowledge – Use domain knowledge – Interacting with training examples
3 8
The Explanation-Based Learning Approach
Transform Domain Knowledge into Solution Knowledge.
• Conjecture explanations for some training labels using Domain Knowledge.
• Evaluate explanation quality using the rest of the training set.• Assemble statistically confirmed explanations into Solution
Knowledge.
• Adjust the statistical learner’s bias to reflect the new Solution Knowledge.
domainknowledge
trainingexamples
EBLsolution
knowledgeinductivelearner
classifier
Learnerdomainknowledge
trainingexamples
EBLsolution
knowledgeinductivelearner
classifier
Learner
SVM Background(Support Vector Machines)
• Generic: few parameters to manipulate• Linear AND nonlinear
– Linear in a high dimensional dot product space– Nonlinear in the input feature space
• Expressiveness: nonlinear• Cost: linear (+ convex optimization)• Two cute nuggets:
– Large margin: prefer low capacity / reduce overfitting– Kernel function (Kernel “trick”): compact, efficient,
expressive
Handwritten Digitsan ML success story(?)
• Pixel input, e.g.:• 32 32 8 bits• x = 1024 dimensions, 256 values• Multi-class classifiers
– Ten index classifiers 1vAll– Four Boolean encoders– All pairs w/ voting– …
• Generic ANNs work poorly• Generic SVMs work better• Specially designed ANNs work
well* • Well: < 0.5% overall
(LeCun et al, ’98; Simard et al ‘03)
We are interested in generic solutions
Class Information
• Let x be the vector of image pixels:x = {x1, x2, x3,… x1024}
• Distributed– No crucial input pixel– Class c: relations among many pixels
• x is Sufficient– Given the input x, the label is not ambiguous
(at least to people)– Entropy (c | x) 0
• Separator is a function of the input pixels• It must be nonlinear: interaction / relation among
pixels determines class assignment
What’s the Best Separating Hyperplane?
+
-
-
-
-
-
-
-
+
+
+
+
+
What’s the Best Separating Hyperplane?
+
-
-
-
-
-
-
-
+
+
+
+
+
What’s the Best Separating Hyperplane?
+
-
-
-
-
-
-
-
+
+
+
+
+
What’s the Best Separating Hyperplane?
+
-
-
-
-
-
-
-
+
+
+
+
+
Margin m
Can use the radius r of the smallest enclosing sphere
Capacity is related to (r/m)2Support Vectors
Kernel Methods
• Map to a new higher dimensional space– Can be very high
– Can be infinite
• Kernel functions– Introduce high dimensionality
– Computation is independent of dimensionality
– Defined w/ dot product of input image vectors(information on the Cosine between image vectors)
• A kernel function defines a distance metric over space of example images
• Points not linearly separable: soft margin, margin distributions,…
SVMs for Digit Images
• K(x,y) = (x y)3 or (x y + 1)3
• Dot product scalar; cube itConsider how this works…
• Before 322 features (or about 103)
• Now ~ (322)3 features (or about 109)
• New Feature = monomial = correlation among three pixels
• VC(lin sep) ~ # dimensions
• Overfitting problem? – Not if the margin is large
– Monitor number of support vectors
Mercer’s Condition / Representer Theorem
• <Kernel matrix is positive semidefinite>• The desired hyperplane can be represented as
Linear weighted sum of distances to support vectors
• Kernel defines the distance metric• The hypothesis space is represented efficiently by
using some of the training examples – the support vectors
m
iiiK
1
),( xs
Distinguishing Handwritten Seven’s vs. Two’s and Eight’s
Two’s
Eight’s
Seven’s
Handwritten 32 x 32 gray scale pixels
Input feature space is inappropriate
Map inputs to a high-dimensional space
Many more features; nonlinear combinations
Linearly separable in the new space
Mercer Kernels
Usually start with a kernel rather than features
(s x)d Homogeneous polynomials
(s x + 1)d Complete polynomials
Exp(-||s – x||2 / 22 ) Gaussian / RBF
K + k
c K
K + c
K k
ProblemsSVMs & statistical learning generally
• Little information from each training example– Signal must show through the noise– Need many training examples– Thousands of are needed for handwritten digits
• Much information is ignored (weak bias vocabulary)• Compare w/ humans
– Novel simple shape of similar complexity– Master with several tens (perhaps a hundred) training examples– Exceedingly small non-fatigue error rate
• Chinese characters are much more difficult than digits
Two Related Classification Problems
1.2%
negligible
error
60000
< 100 ?
No. examples
SVMs
Humans
Two Related Classification Problems
1.2%
negligible
error
60000
< 100 ?
No. examples
SVMs
Humans
a fixed permutatio
n over pixels
Two Related Classification Problems
1.2%
50%
error
60000
NA
No. examples
1.2%
negligible
error
60000
< 100 ?
No. examples
SVMs
Humans
a fixed permutatio
n over pixels
To an SVM these are the same problemApparently the SVM ignores information crucial to
people
Strokes Make the Difference• Explanatory hidden features
– Humans know that strokes mediate between pixels and class labels.
– Statistical machine learners find the pattern using pixel level inputs alone without knowing about strokes.
• What can this example tell us?
– Statistical learning algorithms are advanced enough to extract complex pattern from data.
– But simple prior knowledge (e.g., the existence of strokes) may help to find relevant patterns faster and more accurately.
• Inventing latent features is hard for statistics
Domain Knowledge
• What can we say about strokes?– Within an image they are written by the same
person using the same writing instrument…
– They are made by a succession of simple pen movements…
– They give rise to the pixels…
– Much Information! (suppose it did not hold)
• This is not easily captured in the native bias vocabulary (not solution knowledge)
• Knowledge about strokes is imperfect so that building a bottom-up stroke extractor is error-prone.
Primary Domain: Distinguishing Handwritten Chinese Characters
• More complex than digits or Western characters (64x63 pixels).
• Thousands of different characters Few training examples available for each (200 labeled images for us).
• Domain knowledge includes anideal prototype stroke representation for each character.
Handwritten Chinese Characters
• We selected ten characters in three classes:
• Yields forty-five classification problems.
• Classification difficultyvaries significantly byclassification problem.
Hough Transform
• Old (but good) idea• <x,y> <m,b> given y = mx + b• Hough transform makes a poor line detector • BUT Explaining is easy and reliable
(class label determines the ideal prototype stroke representation)
• We know the lines: – approximate parameters, – geometric constraints
• Find / hallucinate the Hough peaks to optimize the fit
Feature Kernel Functions
• Design special-purpose kernel functions
• Adapt “distance” metric to fit the task
• Emphasize expected high-information content pixels
Explaining Chinese Characters
• A pixel is judged to be informative if it is likely to be part of an informative stroke feature.
• Stroke features are informative if they are distinctive between the ideal prototype characters.
• Interaction between training examples and the prior domain knowledge is crucial.
• From domain knowledge, the top and bottom horizontal strokes are unlikely to be informative.
• Explanation: apply a linear Hough transformation to identify lines in the image, and associate pixels in the images with strokes.
• Prototype stroke representations greatly aid in identifying the pixel – stroke correspondence in training examples (but not test examples).
• High information pixels correspond to distinctive stroke-level features
Constructing Explanations
五互
五五互
互
What is an Explanation for the Feature Kernel Function Approach?
• An account of where the class information is expected to be found within the input image pixels
• Uniform emphasis over disk of 90% probability mass of the fitted Gaussian
Experiments Feature kernel function vs conventional
(cubic polynomial SVM)
FKF: similar performance withnearly an order of magnitude less training
Performance by problemScatter Plot for 45 ProblemsAll problems improve; FKF never hurtsLower slope?(suggests hardest problems are helped most)
Learning curves by problem difficulty (as judged by SVM accuracy)
A) Hardest B) Middle C) Easiest third
Experiments Feature kernel function vs conventional
(cubic polynomial SVM)
Experiments Feature kernel function vs conventional
(cubic polynomial SVM)
For each problem at full trainingFKF always uses fewer support vectors
Interaction between prior knowledge and training examples is crucial
Explanation-Augmented Support Vector Machine
• EA-SVM: another approach
• Previous approach adapted the kernel function
• EA-SVM alters the SVM algorithm; uses standard kernel function
• Explanations are integrated directly as a bias
EA-SVMWhat is an Explanation?
• An explanation is a generalization of a training example, a proposed equivalence class of examples.
• Same explanation implies same label for the same reason, and should be treated the same by the classifier.
• For an SVM, examples with the same explanation should have the same margin.
• A perfect explanation is a hyperplane to which the classifier should be parallel
• Explanations are not perfect.
• So prefer a decision surface that ismore nearly parallel to confirmed explanations.
• Penalize non-parallelnessx1
x2 x3
example
constrain surface
x1
x2 x3
example
constrain surface
Formalizing the Constraints Mathematically
• Let an explanation justify the label for a given example x using only a subset e of features, the explained example v is defined as:
The special symbol ‘*’ indicates that this feature does not participate in the inner product evaluation. With numerical features one can simply use the value zero.
• The constraints can be expressed as:
or equally:
• Geometrically, this requires the classifier hyper-plane to be parallel to the direction x – v.
otherwisev
exifxv
i
iii
,'*'
,
bb vwxw
0) vxw (
EA-SVMs: Explanation-Augmented Support Vector Machines
• Incorporate high quality explanations into a conventional SVM
• Classifier reflects information from both examples and domain knowledge.
• Optimal classifier blends:– Maximal conventional margin to training examples– Maximally parallel to high quality explanations
• We use soft constraints for each.• Similar analyses using two sets of slack variables.• Linear blending via cross validation.
The EA-SVM Optimization Problem
• Perfect knowledge:
• Imperfect knowledge:– Introduce positive new slack variables (i):
– The optimization problem become:
– K, the confidence parameter, is determined by cross-validation; it blends empirical and explanation information
ivwxw
ibxwytosubject
wnimizemi
ii
ii
,0
,01)(::21
: 2
iiiiiiii vwxwvwxwi 0,,
iivwxw
ivwxwibxwysub
KwMinimize
iiiii
iiiii
i i
,0;,
;,;,01)(:2
1:
2
Solutions for EA-SVM• With perfect knowledge:
where
• With imperfect knowledge:
where
• When confidence parameter K goes to infinity, the second solution reduces to the same as the first one.
• When K and the i are 0, the problem ignores the
explanations and reduces to a standard SVM.
i i iiiiii vxxyw )(
,0i 0i ii y
i i iiiiii vxxyw )(
KKy ii iii ,0,0
Formal Analysis: Why EA-SVM works
• EA-SVM algorithm minimizes the following error bound:
• Interesting symbols in the expression of h:– Rv : The radius of the ball that contains all the explained
examples. We expect Rv < R.
– D: The penalty of a separator <u,b> violates the parallel constrains imposed by explanations.
– is determined by cross-validation to minimize h.
A Simple Prediction• A closer look at h:
• With perfect knowledge, D=0:
• Without knowledge:
• EA-SVM has most to offer when the ratio Rv /R is
small, which means explanations uses few important features to justify the label. Intuitively, the learning problem is difficult but the domain knowledge is informative.
2
2222V )/1)(R(5.64
D
h
22V /R5.64 h
22 /R5.64 h
Experiment 1: Does Explanation-Augmentation Help?
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
00 0.04 0.08 0.12 0.16 0.20 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
0.04
0.08
0.12
0.16
0.2
Results for 45 classifiers on pairs of Chinese characters. Below the line means EA-SVM makes fewer errors than SVM.
Experiment 2: Difficult Problems Benefit More
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
5 10 20 40 80 160
Training Size
Err
or
Rat
e
EA-SVM Easy
SVM Easy
EA-SVM Difficult
SVM Difficult
-0. 08
0. 5
0 0. 25 0. 5 0. 75
di ffi cul ty im
prov
emen
t1
0
0. 3
0. 2
0. 1
0. 4
-0. 08
0. 5
0 0. 25 0. 5 0. 75
di ffi cul ty im
prov
emen
t1
0
0. 3
0. 2
0. 1
0. 4
EA-SVM vs. SVMEasy tasks: SimilarDifficult tasks: EA-SVM wins at all training levels.
Task difficulty is highly correlated with Improvement of EA-SVM over conventional SVM.
Exp 3: Robustness and the Effect of Knowledge Quality
random knowledge
SVM error
EA
-SV
M e
rro
r
0.04
0.08
0.12
0.16
0.2
00 0.04 0.08 0.12 0.16 0.2
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
expert knowledge
SVM error
EA
-SV
M e
rro
r
0
0.05
0.1
0.15
0.2
0 0.05 0.1 0.15 0.2
opposite knowledge random knowledge
SVM error
EA
-SV
M e
rro
r
0.04
0.08
0.12
0.16
0.2
00 0.04 0.08 0.12 0.16 0.2
random knowledge
SVM error
EA
-SV
M e
rro
r
0.04
0.08
0.12
0.16
0.2
0.04
0.08
0.12
0.16
0.2
00 0.04 0.08 0.12 0.16 0.20 0.04 0.08 0.12 0.16 0.2
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
expert knowledge
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
00 0.04 0.08 0.12 0.16 0.20 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
0.04
0.08
0.12
0.16
0.2
expert knowledge
SVM error
EA
-SV
M e
rro
r
0
0.05
0.1
0.15
0.2
0 0.05 0.1 0.15 0.2
opposite knowledge
SVM error
EA
-SV
M e
rro
r
0
0.05
0.1
0.15
0.2
0
0.05
0.1
0.15
0.2
0 0.05 0.1 0.15 0.20 0.05 0.1 0.15 0.2
opposite knowledge
EA-SVM benefits from good knowledge, and is not hurt by incorrect knowledge.
Exp 4: Additional (Non-image) Domains.
• Protein Explanations: only known motif sequences are important for proteins’ categorization.
• Text Explanations: Only words related to the category label are important.
• ROC (protein) and F1 (text) scores show EA-SVM improvement.
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
A. B.protein text
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
A. B.
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
A. B.protein text
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
A. B.
Previous Work on Incorporating Knowledge into SVMs (Solution Knowledge)
• Incorporating transformation invariance into SVMs.– Virtual support vector (Schölkopf, 1996)– Invariant kernel function (Schölkopf, 2002)– Jittered SVM (DeCoste & Schölkopf, 2002)– Tangent propagation (Simard 1992, 1998)
• Locally-improved kernel function explores spatial locality property (Schölkopf, 1998)
• Convolutional networks (LeCun et al 1998, Simard et al 2003)
• Knowledge-based SVM and kernels incorporates prior rules. (Fung, Mangasarian & Shavlik, 2002, 2003; Mangasarian, Shavlik & Wild 2004)
• Extracting character high-level features from pixel representation. (Teow 2000, Shi 2003, Kadir 2004…)
Conclusion• Inductive learning algorithms can benefit from domain
knowledge.• This work illustrates a novel direction of using knowledge
by combining EBL ideas into a statistical learner.• With Domain Knowledge, the expert need not also be
expert in the learning algorithms.• The EBL components are extremely simple; more can be
done.• The role of Domain knowledge rather than Solution
Knowledge demands further study; this is an important and little-explored direction.
• Next step: IJCAI07 Poster Explanation-Based Feature ConstructionShiau Hong Lim