document examiner feature extraction: thinned vs skeletonised images

1

Document Examiner Feature Extraction: Thinned vs Skeletonised Images

Vladimir Pervouchine and Graham Leedham

Forensics and Security Laboratory

School of Computer Engineering

Nanyang Technological University

Singapore

2

Outline

• Forensic handwriting examination• The need for accurate stroke extraction• Thinning based method• Vector skeletonisation method• Feature extraction

– From thinned images– From vector skeletons

• Writer classification method• Results• Conclusions

3

Variation of the

word “the” written by 8 different

writers. Source: Harrison,

1981

Forensic handwriting examination

4

• Variation of the letters “G” and “R” written by 15 different writers.

Source: Harrison, 1981


5

Example of variation in letter formation styles in 10 letters from 9 different writers.

Source: Harrison, 1981


6

Current Methods used by Forensic Document

Examiners• Primarily involves manual extraction and comparison of

various global and local visible features.• They are usually doing a comparison test between a

“Questioned Document” and a set of “Known Documents”.

• The objective is to determine whether the “Questioned Document” was, or was not, written by a particular individual.

• The “Questioned Document” may be in disguised handwriting.

7

Forgery / Disguise / Alteration

(i) Is the writing GENUINE? (the author is who he claims to be)

(ii) Is the writing FORGED? (the author is not who he claims to be and is attempting to assert the writing is the same as someone else’s) or

(iii) Is the writing DISGUISED? (the author wishes to deny doing the writing at a later date) or

(iv) Is the writing ALTERED? (Has someone modified or altered the original document?)

8

Extraction of handwritten strokes from images

• Forensic document examiners analyse the pen tip trajectory

• The trajectory is not readily available from the grayscale handwriting images

• To mimic extraction of document examiner features it is necessary to approximate pen trajectory

• We need to preserve individual information in character shapes

• Many algorithms have been proposed for a similar problem in offline handwriting recognition, but they do not need to preserve the individual traits of characters

9

Thinning based stroke approximation

• Matlab Image Processing toolbox thinning (Zhang and Suen thinning algorithm) is used for the first approximation

• Post processing is applied to– remove extra branches– remove spurious loops– remove small connected

components• Feature extraction

attempts to overcome remaining artifacts

Original image

Binarisation

Thinning

Remove small connected components

Find junction points

Find end points

Correct spurious loops

Prune short branchesWhile

changes are made

10

Thinning based stroke approximation

4. Corrected image

2. Binarised image

3. Thinned image

1. Original image

11

Vector skeletonisation method

• 1st stage: vectorisation. Spline-approximated skeletal branches are formed

• 2nd stage: minimum cost configuration of branch interconnections is found. Branches are grouped into strokes– For each retraced segment of

stroke restoration of hidden loop is attempted

• 3rd stage: Near-junction and loop spline knots are adjusted to make strokes smoother

Original image

Vectorisation

Binary encoding of junction points configuration

GA optimisation to find configuration with

lowest cost

Adjustment of loop and near-junction

knots

12

Vector skeletonisation method

1. Original image 2. Skeletal branches

3. Strokes with retraced segments and loops

4. Adjusted skeleton

13

Feature extraction: list of features

• Features extracted from both raster and vector skeletons

1. Height2. Width3. Height to width ratio4. Distance HC5. Distance TC6. Distance TH7. Angle between TH and TC8. Slant of stem of t9. Slant of stem of h10. Position of t-bar11. Connected/disconnected t and h12. Average stroke width13. Average pseudo-pressure14. Standard deviation of average

pseudo-pressure

• Features extracted from vector skeleton only

15. Standard deviation of stroke width

16. Number of strokes17. Number of loops and retraced

branches18. Straightness of t-stem19. Straightness of t-bar20. Straightness of h-stem21. Presence of loop at top of t-stem22. Presence of loop at top of h-

stem23. Maximum curvature of h-knee24. Average curvature of h-knee25. Relative size (diameter) of h-

knee

14

Feature extraction

• Position of t-bar feature is binary: 1 if t-bar crosses stem and 0 if touches or is separated or missing

• Size of h-knee is measured parallel to a horizontal line

• Pseudo-pressure is measured as the gray level normalised to 1.

• Straightness is measured as the ratio of the stroke length to the distance between its ends

h-knee

t-stem h-stem

t-bar

15

Writer classification scheme

• Constructive ANN with spherical threshold units (DistAl) was used as classifier

• 100 samples of grapheme “th” drawn from 20 different writers

• 5-fold cross-validation method is used to evaluate classification accuracy

• Three experiments: – Original feature set (features 1-14), features extracted using

raster skeleton– Original feature set, features extracted using vector skeleton– Extended feature set (features 1-25),features extracted from

vector skeleton

• Additionally, accuracy of feature extraction was measured

16

Results: accuracy of feature extraction• Extraction software performed

analysis of shape to detect various parts of character

• Analysis was performed step by step

• At each step some feature was extracted

• If at least one feature was not extracted or extracted incorrectly, the sample was counted as “failure”

Method Accuracy, %

Raster 87

Vector 94

Input: original image, binarised image, skeleton

Height, width, height to width ratio

Analysis of branches

originating from top end points

Stem features

Search for t-bar…

Feature vector

17

Results: accuracy of writer classification

Conclusions• Use of vector skeleton results in less feature extraction failures• Use of vector skeleton produces higher writer classification

accuracy even on the same feature set – this indicates that feature values are measured more accurately

• Vector skeletonisation enables extraction of more structural features, which, in turn, increases writer classification accuracy

Method Writer classification accuracy, %

Original feature set + raster skeleton 73

Original feature set + vector skeleton 87

Extended feature set + vector skeleton 98

document examiner feature extraction: thinned vs skeletonised images

Documents

questioned document

original document

disguised handwriting

writing genuine

manual extraction

different writers

formed2nd stage

attempted3rd stage