IIIT
Hyd
erab
adII
IT H
yder
abad
Writer Identification and Verification
for online Handwriting
Sachin Gupta([email protected])
Advisor
Dr. Anoop M. Namboodiri
IIIT
Hyd
erab
ad
Handwriting Graphical representation of thoughts
• Using predefined symbols• Still used frequently (e.g., note taking)
An acquired skill• Years of habituation and practice
Complex generation process• Neuromuscular perceptual-motor task• Hand contains some 27 bones and 40 muscles
IIIT
Hyd
erab
ad
Handwriting Identification
Handwritten documents have associated identity
Handwriting Identification• Study of writership of the documents• Comparison with reference handwritten documents
IIIT
Hyd
erab
ad
Applications of handwriting analysis
• Forensic tool
• Crime detection tool
• Social compatibility tool
• Employment tool
• Business tool
• Self development tool
• Genealogy tool
• Scientific research tool
• Graphic tool
• Health tool
IIIT
Hyd
erab
ad
Recognition Vs Identification
Handwriting Recognition• To automatically understand the underline text in the document• Design of automated handwritten document reading systems• Suppress variation due to writer or handwriting style
Handwriting Identification• Study to determine the writer of the document• Enhance the variation due to different handwriting styles
IIIT
Hyd
erab
ad
Problem Statement
Writer Identification• Identify writer of a questioned document • Given pool of writers
Writer Verification• Verify whether the claimed identity is right?• Given: Data based of writers
Forensic Document Analysis• Verify whether two given documents are written by same person?
IIIT
Hyd
erab
ad
Identification
ReferenceData Base
Questioned Document
35
50
65
Matching Score
ResultWriter - 3
Comparisons
Who wrote this document?
1: N Matching
IIIT
Hyd
erab
ad
VerificationReference Data Base
Questioned Document
Mayank: I wrote this document !!! Mayank Sachin Amit
Comparator
Distance <
ThresholdYes
NO
Threshold: decided based on training documents’Within and Between writer distance distributions
1: 1 Matching
IIIT
Hyd
erab
ad
Individuality Features Sub-character and character level
• Shape and size• Choice of allograph
Word level• Connections and character spacing• Aspect Ratio
Line level• Slant and slope• Word spacing
Paragraph and page level• Indentations and arrangements of text• Uniformity of margins
W1 W2
Character Level Individuality
W1
W2
Word Level Individuality
IIIT
Hyd
erab
ad
Line and Paragraph LevelWriter-1 Writer-2
Slant and Slope of linesParallelism of LinesWord Spacing – number of words in a lineUniformity of Margins
Overall Texture
IIIT
Hyd
erab
ad
Challenges
High within writer variations
• Due to mood dependent nature of handwriting• No two piece of handwriting by any individual are same
Low between writer variations• Handwriting must be readable • Degree of variations are low
IIIT
Hyd
erab
ad
Online Vs Offline
Offline• Matrix of integers• Only shape and size information is available• Temporal information about how stroke is drawn is lost
Online• Sequence of X-Y coordinates, Pen up-down events• Shape and size information is available• Sequencing of points and strokes is available
IIIT
Hyd
erab
ad
Data collection and Annotation
Major Hurdle• Sequential process: Devices needed for online handwriting• People are reluctant to writing• Standard databases are not available
Online handwriting collection devices are not accurate
Automatic segmentation and annotation• Research problem
Data collection• 600 pages of data from around 50 writers in various scripts
IIIT
Hyd
erab
ad
State of the Art
Done by handwriting experts• Mostly manually• State of art systems are not available
Using • Context dependent information such as origin, type and condition
of the documents• Difficult to model mathematically
IIIT
Hyd
erab
ad
Theme
Identifying consistent features automatically• To discriminate between writers
Usability of discriminating features• Preserve discrimination
IIIT
Hyd
erab
ad
Major Contributions
Text-independent writer identification• Designing codebook of writers• Automatically identifying and extracting discriminating features
Text-dependent writer verification• Writer-specific text generation• Robust to forgery
Forensic document examination• Repudiation detection in handwritten documents
IIIT
Hyd
erab
ad
Text-independent ?
Underline text is not known• Data is not annotated• Given: Sequence of strokes and x-y coordinate values
Challenges of text-independent • Extract consistent curves (features) from documents• Compare similar features between two documents • Design codebook of individual writers
IIIT
Hyd
erab
ad
Theoretical background
Handwriting modeling studies• Strokes is the combination of different
forces• Handwriting curves become consistent
due to habituation
Relative velocity points of strokes are constant for same writer (Empirical results)
Velocity Profile of above stroke
Stroke from Devanagari Script
IIIT
Hyd
erab
ad
Classifier
Soft Classification
NN1
NN2
NN3
NNn
……
.Combined
Result
Classify Writers
12
3
n
Summarized framework
Questioned document
Cluster into different clusters
Writer Classification
IIIT
Hyd
erab
ad
ResultsExperimented with• Roman, Hindi, Cyrillic, Arabic and Hebrew
Training data
• Approx. 300-400 curves for Roman
• Approx. 700-800 curves for others
Test Data
• 100 curves for Roman
• 200-300 curves for othersTables and graphs are on next page…..
IIIT
Hyd
erab
ad
Varying No of Curves
Accuracy increases with number of curves.>85% accuracy reached with 200 curves (10-12
words).
Accuracy with 12 words
IIIT
Hyd
erab
ad
Script Vs Accuracy
~10 writers for all scripts For Most Scripts Top-2 accuracy is nearly 100% except Chinese Confusion between pairs of writers
IIIT
Hyd
erab
ad
Related work• Line level features
– Word spacing– Lower and Upper profile– Fractal & wavelet features – Loops and Blobs
• Paragraph level features– Image processing
• Grey scale histogram• Run length coding• Fractal image compression
– Texture features• Gabor filter, Wavelet• Contour-let GGD• Grey scale covariance matrix
– Online features• Pen pressure, velocity, azimuth• Velocity of Bary center
– Codebook generation• Using directional features
• Our approach– Code book design using – Sub-character features – Script independent framework– Online handwriting data– Identification with less amount of
data– Automatic Identification of consistent
and discriminating features
IIIT
Hyd
erab
ad
Result comparison
Schomaker et al[28]• Combination of directional, texture and image processing features • Identification: accuracy of 87% with 900 writers• Verification: Equal error rate of 3%-8%• Test Data size: 1 page of handwritten data
Our approach[5]• Using shape based features• Identification accuracy of ~85% with 15 writers• Test data size: 12 words (1 line)
IIIT
Hyd
erab
ad
Analysis
Shape and size based primitives • Obtain reasonable accuracy with simple algorithm.
Chinese script• Most of the strokes are straight line segment• Inter-stroke relations based features can be used
To increase accuracy• Robust clustering and classification algorithm• Fusion with high level like line and paragraph primitive
IIIT
Hyd
erab
ad
Problem Statement
Text-independent systems• Large amount of data needed
Text-dependent framework• Higher Accuracy • Small amount of data needed
Problems (Text-dependent systems)• Forgery (due to fixed text known in advance)• Authentication text not known (usually random text is used)
IIIT
Hyd
erab
ad
Signature Vs Text-dependent
Signature and Text-dependent handwritingVariations are unlimited, signature need not be readableWriter consciously tries to write the same signature
ChallengesDiscrimination between Within and Between writer variation has to
be done Discriminating distance method have to find out
IIIT
Hyd
erab
ad
System Specification
Empirical finding• Discriminating power of primitives vary for individuals • Primitives: sub-characters, characters, words, etc.
System Specifications•Writer – specific text
For higher accuracies With limited amount of text
•Varying text across multiple authentication Robust to forgery
IIIT
Hyd
erab
ad
Boosting?
Classifier combination method• Combines weak classifiers to generate a accurate learning algorithm• Greedy algorithm
Select weak classifiers on each stage• based on previously selected classifier
Maintains a distribution of weights over training samples
IIIT
Hyd
erab
ad
Framework
Verification as 2-class problem• Positive samples Vs Negative samples
Given• Set of writers and primitives • Table of discriminating power
Randomness is included at each stage• Proportional to the Discriminating power of the classifier• More Discriminating: more probable to be accepted
IIIT
Hyd
erab
ad
Text Generation Process
Bag of Primitives
List of Writers
W1 W2 W3
W4 W5 W6
Randomness is included at selection process.
Threshold selected Is biased: accepting the writer• For lower False Rejection Rates
Fix Threshold and Reject WritersSelect it or
not?
Accuracy
IIIT
Hyd
erab
ad
Effect of Boosting
Distance
Prob
abil
ity X1
Within writer Distance
Between writer Distance
Number of Boosting Stages
IIIT
Hyd
erab
ad
Dynamic Time Warping
Naïve Alignment Re-sampled series
DTW Alignment
• Time Series Alignment • Dynamic Programming
Approach
• Different length feature vectors can be compared
IIIT
Hyd
erab
ad
Stroke Comparison
Dynamic Time Warping• Alignment of stroke done using dynamic programming
Directional features• Strokes representation: 12 Bins of curvature directions• Curvature angle: Different between adjacent tangents direction
1 1 2 3 3 4 3 0 0 0 0 10 360
IIIT
Hyd
erab
ad
Results
Experimented with English script (20 writers) and Hindi script(10 writers)
DTW and Directional feature extraction methods are used
Each user written about 10-12 words each• 3 fold cross-validation is used
IIIT
Hyd
erab
ad
Performance measures
False acceptance rate• Percentage of user forge user those are accepted• Should be lower for forensic application
Security is the major concern
False rejection rates• Percentage of genuine users those are rejected• Should be lower for civilian applications
Usability is the major concern
IIIT
Hyd
erab
ad
Definition
Threshold-1• Control the range of variations within writers• Decided based on positive samples
Threshold-2• Confidence before rejecting other writers (negative samples)• Lower threshold-2 == Higher confidence
IIIT
Hyd
erab
ad
Analysis and Summary
Writer-specific text generation framework
Automatic text generation
Automatic threshold generation
Text is Varied• Robust to forgery
IIIT
Hyd
erab
ad
Related work
• Features– Character level
• GSC features• Structural features• Directional features
– Word level• Word model recognition• Shape curvature• Shape context• Morphological features
• Feature selection– Static feature selection– PCA based discriminating
power
• Our approach– Writer-specific text generation– Boosting based framework– Text variation– Higher accuracy with limited
amount of data
IIIT
Hyd
erab
ad
Comparison
Srihari et al.[17]• Shape context, Shape curvature, GSC features, WMR features• Performance: 42%, 22%, 62% and 28% respectively (1000
writers)• Test data size- 10 words
Our approach• Directional features • Performance: 95% (20 writers) • Test data size: 5 words
IIIT
Hyd
erab
ad
Traditional writer identification Vs QDE
Assumption of Natural Handwriting
Biometrics Terms• Repudiation (Negative Biometrics)• Forgery (Positive Biometrics)
Quantity and quality of data available
Cost factor involved • Used as expert witness in legal Verdict
IIIT
Hyd
erab
ad
Repudiation
The rejection or renunciation of a duty or obligation (as under a contract)
Merriam-Webster's Dictionary of Law
Handwriting Repudiation • Deliberately alter his natural handwriting to avoid
detection • To deny involvement in the case
IIIT
Hyd
erab
ad
Repudiation
Comparator
Calculate Distance
Significant Distance?
1 : 1 Matching
QuestionedDocument
Data Base
ReferenceDocument
Same Writer ?
Different Writers ?
HypothesisTesting
Written by same writer?
No Database
Dis
IIIT
Hyd
erab
ad
Verify whether given documents written by same person or
differentwithout assuming Natural
Handwriting
IIIT
Hyd
erab
ad
Challenges
With in writer variations become high
Between-writer variations become less as compared.
Learning can’t be done as data is not available.
IIIT
Hyd
erab
ad
Ray of Hope
One can’t exclude from one’s own writing, those discriminating elements of which he/she is not aware
Maximum and minimum velocity points remain the same in-spite of absolute velocity.
Words have significant overlap at sub-character level.
IIIT
Hyd
erab
ad
Framework
• Statistically significant score between two documents.
• Utilize online information that can be available
• No assumptions about distribution of data.• May lead to erroneous conclusions.
IIIT
Hyd
erab
ad
Assumptions
• Questioned and reference document either have significant overlap or are same at word level.
• Reference document is collected in online mode.
IIIT
Hyd
erab
ad
Hypothesis Testing
• To calculate significance of distance between two distributions.
• According to Neyman Pearson paradigmH0 : Documents written by same writer (Null Hypothesis)
H1 : Document written by different writers (Alternative Hypothesis)
• Intra-document word distances and inter-document word distances are two distribution to be compared.
• Distributions are compared to find out whether they are generated from same population.
IIIT
Hyd
erab
ad
Distribution Comparison
• KL divergence test (make assumptions on nature of distribution)
• Kolmogorov Smirnov Test (don’t make any assumptions)
IIIT
Hyd
erab
ad
Results
• Data being collected from 23 different users in English.
• Each users 3 pages of normal data and 3 pages of repudiated data is collected.
• Preprocessing: – Words are segmented using semi-automatic toolkit for word
segmentation.
IIIT
Hyd
erab
ad
Analysis of Results
• Semi automatic System
• Used as an aid to expert
• Null Hypothesis is never accepted without expert intervention.
-1 1 0
Similar Different
strong probability of identification
probable
indications
no conclusion
indications did not
probably did not
strong probability did not
Scale Used by Forensic Experts
IIIT
Hyd
erab
ad
Conclusion and Future work
Learning based framework to learn similarity, in-spite of discrimination between documents.
Can we tell whether writer is trying to repudiate.
Framework which can learn more features and can give independent scores on each feature.
IIIT
Hyd
erab
ad
Conclusions
Proposed algorithms for automatic identification and extraction of discriminating features for online handwriting
Framework proposed for writer-specific text generation and text variations for text-dependent systems
Introduced the problem of repudiation and proposed a hypothesis testing based framework for the same
IIIT
Hyd
erab
ad
Sachin Gupta and Anoop M. Namboodiri, Repudiation Detection in Handwritten Documents Proc of The 2nd International Conference on Biometrics (ICB'07), PP. 356-365 Seoul, Korea, 27-29 August, 2007.
Anoop M. Namboodiri and Sachin Gupta Text Independent Writer Identification from Online Handwriting , International Workshop on Frontiers in Handwriting Recognition(IWFHR'06), October 23-26, 2006, La Baule, Centre de Congress Atlantia, France.
Sachin Gupta and Anoop M. Namboodiri Text dependent Writer Verification using Boosting, In proceedings of International Conference on Frontiers in Handwriting Recognition (ICFHR’08), Montreal, Canada
Sachin Gupta and Anoop M. Namboodiri Text dependent Writer Verification, planned in IEEE Transactions on Information Forensics and Security, 2008
Publications
IIIT
Hyd
erab
ad
Future work
Fusion of online and offline features for higher accuracies
Can we automatically detect person intention to repudiate or forge • Based on single document
More robust algorithms for feature extraction• Different than standard feature selection approaches
IIIT
Hyd
erab
ad
Face Detection
• Boosting classifiers • Simple Haar filters were used• Filter are selecting using boosting classifiers
(a) (b) (c) (d)
Paul Viola and Michael Jones, “Robust Real-time Face Detection”, International Journal of computer vision, 2004.
IIIT
Hyd
erab
ad
Object classificationlearning the in-variances
• Use multiple kernel learning framework
Shape - 3.94 Color - 0 Texture - 0
IIIT
Hyd
erab
ad
Image Segmentation
• Active Contour• Have to give initial contour • Contour adjusts itself to the object using external and internal
energy• Useful in object tracking
• Graph cuts• Represents image using graph • Find cut in the graph with minimum cut or maximum flow• Can not diverge outside will just converge inside