1 estimating the reliability of the knn classification maxim tsypin and heinrich röder biodesix,...

Post on 17-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Estimating the reliability of the kNN classification

Maxim Tsypin and Heinrich Röder

Biodesix, Steamboat Springs, CO

1. kNN classification

2. Problems with kNN

3. Estimating the reliability of the kNN classification

2

k-Nearest Neighbor (kNN) classification

• Two classes

• Training set: N1 instances of class 1 N2 instances of class 2

• Each instance is characterized by d values, and is represented by a point in d-dimensional space

• k nearest neighbors of the test instance:

k1 instances of class 1

k2 instances of class 2

1( )dx x x

x1

x2 A

B

C

k = 5A: 5:0B: 3:2C: 0:5

3

Problems of simple kNN

• Works properly only when N1 = N2 . Adding to the training set more instances of a given class would bias classification results in favor of this class.

• No information on the confidence of class assignment for the individual test instances. Intuitively, the confidence of class assignment in the 5:0 case should be greater than in the 3:2 case.

x1

x2 A

B

C

k = 5A: 5:0B: 3:2C: 0:5

4

The question

• Two classes

• Training set: N1 instances of class 1 N2 instances of class 2

• k nearest neighbors of a given test instance:

k1 instances of class 1

k2 instances of class 2

k = k1 + k2

Given N1 , N2 , k1 , k2 , what is the probability that this test instance belongs to class 1 ? x1

x2 A

B

C

k = 5A: 5:0B: 3:2C: 0:5

5

The answerTwo derivations:1) within the kernel density estimation framework: a fixed vicinity of the

test instance determines the number of neighbors.2) within the kNN framework: a fixed number of neighbors determines

the size of the vicinity.Both approaches lead to the same result:

For N1 = N2, this simplifies to:

• Quantifies the reliability of class assignment for each individual test instance, depending only on the (known) training set data.

• Properly accounts for complications arising when the numbers of training instances in the two classes are different, i.e. N1 ≠ N2 .

1

2

12 1 2 1 2

1 2

1(class 1) 1, 1; 3;1 .

2N

N

kP F k k k

k k

1 1

1 2 2

1 1(class 1)(class 1) , .

2 (class 2) 1

k kPP

k k P k

top related