nearest neighbor and locality-sensitive hashing yaniv masler idc - 16.03.08 tell me who your...

79
Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 l me who your neighbors are, and I'll know who you

Upload: horatio-neal-skinner

Post on 13-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Nearest Neighbor and Locality-Sensitive Hashing

Yaniv Masler

IDC - 16.03.08

Tell me who your neighbors are, and I'll know who you are

Page 2: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Lecture Outline• Variants of NN• Motivation• Algorithms:

– Linear scan– Quad-trees– kd-trees– Locality Sensitive Hashing– R-tree (and its variants)– VA-file

• Examples– Colorization by Example– Medical Pattern recognition

– Hand written digit recognition

Page 3: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Nearest Neighbor Search

• Given: a set P of n points in Rd

• Goal: a data structure, which given a query point q, finds the nearest neighbor p of q in P

qp

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 4: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Nearest Neighbor SearchProblem: what's the nearest restaurant to my hotel?

Page 5: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Near neighbor (range search)

Or: find all restaurants up to 400m from my hotel

Problem: find one/all points in P within distance r from q

Page 6: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Approximate Near neighbor

Or: find a restaurant that is near my hotel

Problem: find one/all points p’ in P, whose distance to q is at most (1+e) times the distance from q to its nearest neighbor

Page 7: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

K-Nearest-Neighbor

Or: find the 4 closest restaurants to my hotel

Problem: find the K points nearest q

Page 8: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Spatial join

Problem: Find pairs of hotels and shopping malls which are at most 100m apart

Problem: given two sets P,Q, find all pairs p in P, q in Q, such that p is within distance r from q

Page 9: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Nearest Neighbour Rule

Non-parametric pattern classification.

Consider a two class problem where each sample consists of two measurements (x,y).

k = 1

k = 3

For a given query point q, assign the class of the nearest neighbour.

Compute the k nearest neighbours and assign the class by majority vote.

http://www.robots.ox.ac.uk/~dclaus/cameraloc/samples/nearestneighbour.ppt

Page 10: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

MotivationThe nearest neighbor search problem arises

in numerous fields of application, including:

Pattern recognition Statistical classificationComputer vision DatabasesCoding theory Data compressionInternet marketing DNA sequencing Spell checking Plagiarism detection Copyright violation detection and many more

Page 11: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Algorithms

• Main memory (Computational Geometry)– linear scan– tree-based:

• quadtree

• kd-tree

– hashing-based: Locality-Sensitive Hashing

• Secondary storage (Databases)– R-tree (and numerous variants)– Vector Approximation File (VA-file)

Page 12: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Linear scan (Naïve approach)• The simplest solution to the NNS problem

• Compute the distance from the query point to every other point in the database, keeping track of the "best so far".

• This algorithm works for small databases but quickly becomes intractable as either the size or the dimensionality of the problem becomes large.

• Running time is O(Nd).

Page 13: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Quad-tree

Split the space into 2d equal subsquares

Repeat until done:only one pixel leftonly one point leftonly a few points left

A simple data structure

Page 14: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Range search

• Near neighbor (range search):– put the root on the stack– repeat

• pop the next node T from the stack• for each child C of T:

– if C is a leaf, examine point(s) in C– if C intersects with the ball of radius r around q, add C to

the stack

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 15: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Quad-tree - structure

X

Y

X1,Y1 P≥X1P≥Y1

P<X1P<Y1

P≥X1P<Y1

P<X1P≥Y1

X1,Y1

http://www.wisdom.weizmann.ac.il/~mica/CVspring06/presentations/Dan_Tomer.ppt

Page 16: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Quad-tree - Query

X

Y

X1,Y1P<X1P<Y1 P<X1

P≥Y1

X1,Y1

P≥X1P≥Y1

P≥X1P<Y1

http://www.wisdom.weizmann.ac.il/~mica/CVspring06/presentations/Dan_Tomer.ppt

Page 17: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Quad-tree

• Simple data structure

• What's the downside ?

Page 18: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Quad-tree – Pitfall1

X

Y

X1,Y1P≥X1P≥Y1

P<X1

P<X1P<Y1 P≥X1

P<Y1P<X1P≥Y1

X1,Y1

http://www.wisdom.weizmann.ac.il/~mica/CVspring06/presentations/Dan_Tomer.ppt

Page 19: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Quad-tree – pitfall 2

X

Y

Running Time: O(2d)

Space and Time Exponential in dimensionshttp://www.wisdom.weizmann.ac.il/~mica/CVspring06/presentations/Dan_Tomer.ppt

Page 20: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Kd-trees [Bentley’75]

• Main ideas:– only one-dimensional splits– instead of splitting in the middle, choose the

split “carefully” (many variations)– near(est) neighbor queries: as for quad-trees

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 21: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

47

6

5

1

3

2

9

8

10

11

l5

l1 l9

l6

l3

l10 l7

l4

l8

l2

l1

l8

1

l2l3

l4 l5 l7 l6

l9l10

3

2 5 4 11

9 10

8

6 7

Kd-Trees Construction

http://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/approx_nn_theory/Nearest_Neighbor_Theory.ppt

Page 22: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

47

6

5

1

3

2

9

8

10

11

l5

l1 l9

l6

l3

l10

l7

l4

l8

l2

l1

l8

1

l2l3

l4 l5 l7 l6

l9l10

3

2 5 4 11

9 10

8

6 7

q

Kd-Trees Query

http://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/approx_nn_theory/Nearest_Neighbor_Theory.ppt

Page 23: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Kd-trees

• Advantages:– no (or less) empty spaces– only linear space

• Exponential query time still possible– However if we dont do something really

stupid, query time is at most dn– This is still quite bad though, when the

dimension is around 20-30

Page 24: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Approximate nearest neighbor

• Can do it using k-d trees, by interrupting search earlier [Arya et al’94]

• Basically, After each search step, check if you are close enough, if so stop.

• Not good for exact queries.

• What about a different approach:– can we adapt hashing to nearest neighbor search ?

Page 25: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Locality-Sensitive Hashing[Indyk-Motwani’98]

Key Idea

• Preprocessing : – Hash the data-point using several LSH

functions so that probability of collision is higher for closer objects

• Querying :– Hash query point and retrieve elements in

the buckets containing the query point

Page 26: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Locality-Sensitive Hashing

• Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have:– Pr[h(p)=h(q)] is “high” if p is “close” to q– Pr[h(p)=h(q)] is “low” if p is ”far” from q

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 27: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Do such functions exist ?

• Consider the hypercube, i.e.,– points from {0,1}d

– Hamming distance D(p,q)= # positions on which p and q differ

• Define hash function h by choosing a set S of k random coordinates, and setting

h(p) = projection of p on S

Richard Hamming Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 28: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

011

Example

Hash function h()

Taked=12, p=010111001011

k=3, S={2,5,10}

Store p into the matching bucket

2k buckets

110

p=010111001011

h(p)=110

Page 29: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

h’s are locality-sensitive

• Pr[h(p)=h(q)]=(1-D(p,q)/d)k

• We can vary the probability by changing k

k=1 k=2

distance distance

Pr Pr

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 30: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

How can we use LSH ?

• Choose several h1..hl

• Initialize a hash array for each hi

• Store each point p in the bucket hi(p) of the i-th hash array, i=1...l

• In order to answer query q– for each i=1..l, retrieve points in a bucket hi(q)

– return the closest point found

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 31: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

LSH - Algorithm

h1(pi) h2(pi) hL(pi)

TLT2T1

pi

P

http://www.cmpe.boun.edu.tr/courses/cmpe521/fall2002/Similarty_Search_in_High_Dimensions_via_Hashing.ppt

Page 32: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

What does this algorithm do ?

• By proper choice of parameters k and l, we can make, for any p, the probability that

hi(p)=hi(q) for some i look like this:

• Can control:– Position of the slope– How steep it is

distance

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 33: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

The LSH algorithm

• Therefore, we can solve (approximately) the near neighbor problem with given parameter r

• Worst-case analysis guarantees dn1/(1+) query time

• Practical evaluation indicates much better behavior [GIM’99,HGI’00,Buh’00,BT’00]

• Drawbacks:• works best for Hamming distance (although can be

generalized to Euclidean space)• requires radius r to be fixed in advance

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 34: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Secondary storage• As mentioned in the Motivation Slide,

There are many usages for NN.

• Some store large datasets that need secondary storage.

Page 35: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Secondary storage

• Grouping the data is crucial

• Different approach required:– in main memory, any reduction in the number

of inspected points was good– on disk, this is not the case !

Page 36: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Disk-based algorithms

• R-tree [Guttman’84]– departing point for many variations– over 600 citations ! (according to CiteSeer)– “optimistic” approach: try to answer queries in

logarithmic time

• Vector Approximation File [WSB’98]– “pessimistic” approach: if we need to scan the whole

data set, we better do it fast

• LSH works also on disk

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 37: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

R-tree

• “Bottom-up” approach (k-d-tree was “top-down”) :– Start with a set of points/rectangles– Partition the set into groups of small cardinality– For each group, find minimum rectangle

containing objects from this group– Repeat

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 38: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

R-tree

Page 39: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

R-tree

• Advantages:– Supports near(est) neighbor search (similar

as before)– Works for points and rectangles– Avoids empty spaces– Many variants: X-tree, SS-tree, SR-tree etc– Works well for low dimensions

• Not so great for high dimensions

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 40: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

VA-file [Weber, Schek, Blott’98]

• Approach:– In high-dimensional spaces, all tree-based

indexing structures examine large fraction of leaves

– If we need to visit so many nodes anyway, it is better to scan the whole data set and avoid performing seeks altogether

– 1 seek = transfer of few hundred KB

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 41: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

VA-file

• Natural question: how to speed-up linear scan ?

• Answer: use approximation– Use only i bits per dimension (and speed-up

the scan by a factor of 32/i) – Identify all points which could be returned as

an answer– Verify the points using original data set

Algorithms for Nearest Neighbor Search / Piotr Indyk

Page 42: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

VA-file

• Tile d-dimensional data-space uniformly into 2b rectangular cells.

• b bits for each approximation

Page 43: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are
Page 44: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Where’s Waldo ?

Page 45: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are
Page 46: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

R.Irony, D.Cohen-Or, D.Lischinski

Colorization by example

Page 47: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

MotivationColorization, is the process of adding color to monochrome images and video.

Colorization typically involves segmentation + tracking regions across frames - neither can be done reliably – user intervention required – expensive + time consuming

Colorization by example – no need for accurate segmentation/region tracking

Page 48: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

The method• Colorize a grayscale image based on a

user provided reference.

Reference Image

Page 49: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Naïve MethodTransferring color to grayscale images [Walsh, Ashikhmin, Mueller 2002]

• Find a good match between a pixel and its neighborhood in a grayscale image and in a reference image.

Page 50: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

By Example Method

Page 51: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Overview1. training2. classification3. color transfer

Page 52: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Training stage

Input:

1. The luminance channel of the reference image

2. The accompanying partial segmentation

Construct a low dimensional feature space in which it is easy to discriminate between pixels belonging to differently labeled regions, based on a small (grayscale) neighborhood around each pixel.

Page 53: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Training stage

Create Feature Space(get DCT cefficients)

Page 54: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Classification stage

For each grayscale image pixel, determine which region should be used as a color reference for this pixel.

One way: K-Nearest –Neighbor Rule

Better way: KNN in discriminating subspace.

Page 55: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

KNN in discriminating subspace

Originaly sample point has a majority of Magenta

Page 56: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

KNN in discriminating subspace

Rotate Axes in the direction of the intra-difference vector

Page 57: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

KNN in discriminating subspace

Project Points onto the axis of the inter-difference vector

Nearest neigbors are now cyan

Page 58: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

KNN Differences

Simple KNN

Discriminating KNN

Matching Classification

Use a median filter to create a cleaner

classification

Page 59: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Color transfer

Using YUV color space

YUV

Page 60: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Final Results

Page 61: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

• Pattern recognition is the application of (statistical) techniques with the objective of classifying a set of objects into a number of distinct classes. Pattern recognition is applied in virtually all branches of science. Medical examples are as follow:

• Pattern recognition methods exploit the similarities of objects belonging to the same class and the dissimilarities of objects belonging to different classes.

Field Objects Objective

Cytology Cells Detection of carcinomas

Genetics Chromosomes Karyotyping

Cardiology ECGs Detection of coronary diseases

Neurology EEGs Detection of neurological conditions

Pharmacology Drugs Monitoring of medication

Diagnostics Disease patterns Computer-assisted decisions

Medical Pattern Recognition

HINF 2502 (Clinical Processes and Decision Making)© Hadi Kharrazi, Dalhousie University

Page 62: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Syntactic Pattern Recognition• In syntactic or linguistic pattern recognition, objects are

described as a set of primitives. A primitive is an elementary component of an object. The object is then recognized by the sequence in which the primitives appear in the object description.

• A simple example of a set of primitives is the Morse alphabet. The objects are the individual characters and the spaces between words. A grammar describes the sequence in which these primitives constitute the various characters.

• A medical sample of syntactic pattern recognition is karyotype where similar chromosomes are intended to be grouped. In this case the set of primitives describing a contour may be the following set: {convexity(a), straight part(b), deep concavity(c), shallow concavity(d)}

Ctn. Pattern Recognition

HINF 2502 (Clinical Processes and Decision Making)© Hadi Kharrazi, Dalhousie University

Page 63: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Syntactic Pattern Recognition

• A medical sample of syntactic pattern recognition is karyotype where similar chromosomes are intended to be grouped.

• A Karyotype is the characteristic chromosome complement of a eukaryote species (wikipedia)

Page 64: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Ctn. Pattern Recognition

Syntactic description of a submedian and a median chromosome in terms of primitives.

HINF 2502 (Clinical Processes and Decision Making)© Hadi Kharrazi, Dalhousie University

Page 65: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Statistical Pattern Recognition• In statistical pattern recognition objects are described by

numerical features. This method is categorized into: supervised and unsupervised techniques.

• In supervised techniques the number of distinct classes are known and a set of example objects is available. These objects are labeled with their class membership. The problem is to assign a new unclassified object to one of the classes.

• In unsupervised techniques (such as clustering) a collection of observations is given and the problem is to establish whether these observations naturally divide into two or more different classes.

Ctn. Pattern Recognition

HINF 2502 (Clinical Processes and Decision Making)© Hadi Kharrazi, Dalhousie University

Page 66: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Supervised Pattern Recognition• In supervised pattern recognition, class recognition is based on

the differences of the statistical distributions of the features between the various classes. The development of supervised classification rules normally proceeds in two steps:

• Learning phase: In this step the classification rule is designed on the basis of class properties as derived from a collection of class-labeled objects called the design (training) set.

• Validation phase: In this step another collection of class labeled objects called test set will be tested by the results from the learning phase. Thus, the proportion of correct classifications obtained by the rule can be calculated.

Ctn. Pattern Recognition

HINF 2502 (Clinical Processes and Decision Making)© Hadi Kharrazi, Dalhousie University

Page 67: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

• 1-Nearest-Neighbor Rule: In the simplest form, to classify an unknown object the nearest object from the learning set is identified. The unknown object is then assigned to the class to which its nearest neighbor belongs.

• q-Nearest-Neighbor Rule: Rather than deciding on class membership on the basis of a single nearest neighbor, a quorum of q nearest neighbors is inspected. The class membership of the unknown object is them established on the basis of the majority of the class memberships of these q nearest neighbors.

•The problem with NN rules is that they are justifiable only with large learning sets and this increases the computational time.

Ctn. Pattern Recognition

HINF 2502 (Clinical Processes and Decision Making)© Hadi Kharrazi, Dalhousie University

Page 68: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Ctn. Pattern Recognition

Illustration of nearest-neighbor classification. The learning set consists of objects belonging to three different classes: class 1 (blue), class 2 (red) and class 3 (black). Using one neighbor only, the 1-NN rule assigns the unknown object (yellow) to class 1. The 5-NN rule assigns the object to class 3, whereas the (5,4)-NN rule leaves the object unassigned.

HINF 2502 (Clinical Processes and Decision Making)© Hadi Kharrazi, Dalhousie University

Page 69: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Back To Computer Science

How we use Nearest Neighbor for OCR ?

Venkat Raghavan N. S., Saneej B. C., and Karteek Popuri

Department of Chemical and Materials EngineeringUniversity of Alberta, Canada.

Classification techniques for Hand-Written Digit Recognition

Page 70: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Sample Data• Normalize sample character to 16x16

grayscale image.

• We now have 256 pixels we can use as a characters feature vector. Xi=[xi1, xi2, ……. xi256]

• Collect Many Samples– dataset size: n * 256

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

xij ]1,0[

1616

Page 71: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Lets reduce dimensions

How ?

PCAPrincipal Component Analysis

(we skipped this lecture)

Page 72: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Principal Components AnalysisThe Basic Principle

PCA transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components.

The ObjectiveDiscovering the “true dimension” of the data.

It may be that p dimensional data can be represented in q < p dimensions without losing much information

Samples can be found: http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html

Page 73: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Dimension reduction - PCA• PCA done on the mean centered images

• The larger an Eigen value the more important is that Eigen digit.

• Based on the Eigen values first 64 PCs were found to be significant

• Any image represented nowby its PC: Y= [y1 y2….....y64 ]

• Each sample now has 64 variables– dataset size: n * 256

AVERAGE IMAGE

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

AVERAGE DIGIT

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

EIGEN DIGITS

Page 74: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Interpreting the PCs as Image Features

• Basically, the Eigen vectors are the rotation of the original axes to more meaningful directions.

• The PCs are the projection of the data onto each of these new axes.

• This is similar to what we did in ‘Colorization by Example’

Page 75: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Nearest Neighbour Classifier

• No assumption about distribution of data

• Euclidean distance to find nearest neighbour

Test point assigned to Class 2

Class 2

Class 1

Finds the nearest neighbours from the training set to test image and assigns its label to test image.

http://www.ualberta.ca/~slshah/files/Handwritten%20Digit%20Recognition.ppt

Page 76: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

K-Nearest Neighbour Classifier (KNN)

• Compute the k nearest neighbours and assign the class by majority vote.

k = 3

Test point assigned to Class 1

Class 2 ( 1 vote )

Class 1 ( 2 votes )

http://www.ualberta.ca/~slshah/files/Handwritten%20Digit%20Recognition.ppt

Page 77: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

1-NN Classification Results:

No of PCs 256 150 64

AER % 7.09 7.01 6.45

Using 64 PCs gives better results

Using higher k’s does not show improvement in recognition rate

http://www.ualberta.ca/~slshah/files/Handwritten%20Digit%20Recognition.ppt

Page 78: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Misclassification in NN:

0 1 2 3 4 5 6 7 8 90 1376 0 4 2 0 5 12 2 0 01 0 1113 1 0 1 0 2 0 2 02 22 9 728 17 4 4 6 16 18 23 4 0 4 690 2 26 0 4 6 34 3 15 9 0 687 0 7 2 4 325 9 3 12 37 5 517 32 0 23 96 10 3 5 0 3 2 714 0 3 27 0 6 1 0 19 0 0 657 1 208 8 11 1 26 7 7 8 5 547 139 6 1 2 0 23 0 0 32 0 664

Ac

tua

l

Recognised as

Euclidean distances between transformed images of same class can be very high

http://www.ualberta.ca/~slshah/files/Handwritten%20Digit%20Recognition.ppt

Page 79: Nearest Neighbor and Locality-Sensitive Hashing Yaniv Masler IDC - 16.03.08 Tell me who your neighbors are, and I'll know who you are

Issues in NN:

Expensive: To determine the nearest neighbour of a test image, must compute the distance to all N training examples

Storage Requirements: Must store all training data

http://www.ualberta.ca/~slshah/files/Handwritten%20Digit%20Recognition.ppt