learning and using taxonomies for visual and olfactory

Learning and Using Taxonomies for Visual and OlfactoryCategory Recognition

Thesis by

Greg Griffin

In Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

California Institute of Technology

Pasadena, California

2013

(Defended March 2013)

ii

c© 2013

Greg Griffin

All Rights Reserved

iii

For Leslie

v

Acknowledgements

vii

Abstract

Dominant techniques for image classification over the last decade have mostly followed

a three-step process. In the first step, local features (suchas SIFT) are extracted. In the

second step, some statistic is generated which compares thefeatures in each new image to

the features in a set of training images. Finally, standard learning models (such as SVMs)

use these statistics to decide which class the image is most likely to belong to. Popular

models such as Spatial Pyramid Matching and Naive Bayse Nearest Neighbors adhere to

the same general linear 3-step process even while they differ in their choice of statistic and

learning model.

In this thesis I demonstrate a hierarhical approach which allows the above 3 steps to in-

teract more efficiently. Just as a child playing “20 questions” seeks to identify the unknown

object with the fewest number of querries, our learning model seeks to classify unlabelled

data with a minimial set of feature querries. The more costlyfeature extraction is, the more

beneficial this approach can be.

We apply this approach to three different applications: photography, microscopy and

olfaction. Image —. Microscope slide —. Finally, — both efficiency and accuracy. In

our tests, the resulting system functions well under real-world outdoor conditions and can

recognize a wide variety of real-world odors. It has the additional advantage of scaling well

viii

to large numbers of categories for future experiments.

A small array of 10 conventional carbon black polymer composite sensors can be used

to classify a variety of common household odors under wide-ranging environmental con-

ditions. We test a total of 130 odorants of which 90 are commonhousehold items and

40 are from the UPSIT, a widely used scratch-and-sniff odor identification test. Airflow

over the sensors is modulated using a simple and inexpensiveinstrument which switches

between odorants and reference air over a broad range of computer-controlled frequencies.

Sensor data is then reduced to a feature which measures response over multiple harmonics

of the fundamental sniffing frequency. Using randomized sets of 4 odor categories, over-

all classification performance on the UPSIT exam is 73.8% indoors, 67.8% outdoors and

57.8% using indoor and outdoor datasets taken a month apart for training and testing. For

comparison, human performance on the same exam is 88-95% forsubjects 10-59 years of

age. Using a variety of household odors the same testing procedure yield 91.3%, 72.4%

and 71.8% performance, respectively. We then examine how classification error changes

as a function of sniffing frequencies, number of sensors and number of odor categories.

Rapid sniffing outperforms low-frequency sniffing in outdoor environments, while addi-

tional sensors simultaneously reduce classification errorand improve background noise

rejection. Moreover classifiers learned using data taken indoors with ordinary household

odorants seem to generalize well to data acquired in a more variable outdoor environment.

Finally we demonstrate a framework for automatically building olfactory hierarchies to

facilitate coarse-to-fine category detection.

ix

Contents

Acknowledgements v

Abstract vii

List of Figures xiii

1 Introduction 1

1.1 Visual Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Olfactory Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1

1.3 Moving Into The Wild: Face Recognition . . . . . . . . . . . . . . .. . . 1

1.4 Hierarchical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

2 The Caltech 256 3

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Collection Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Image Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

x

2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Localization and Segmentation . . . . . . . . . . . . . . . . . . .. 16

2.3.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Size Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Correlation Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.3 Spatial Pyramid Matching . . . . . . . . . . . . . . . . . . . . . . 21

2.4.4 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.5 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Visual Hierarchies 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Training and Testing Data . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Spatial Pyramid Matching . . . . . . . . . . . . . . . . . . . . . . 35

3.2.3 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.4 Hierarchical Approach . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Building Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Top-Down Classification Algorithm . . . . . . . . . . . . . . . . . . .. . 41

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

xi

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Pollen Counting 49

4.1 Basic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Comparison To Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Machine Olfaction: Introduction 51

6 Machine Olfaction: Methods 55

6.1 Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 Sampling and Measurements . . . . . . . . . . . . . . . . . . . . . . . . .56

6.3 Datasets and Environment . . . . . . . . . . . . . . . . . . . . . . . . . .58

7 Machine Olfaction: Results 63

7.1 Subsniff Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.2 Number of Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.3 Feature Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.4 Feature Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.4.1 Top-Down Category Recognition . . . . . . . . . . . . . . . . . . 70

8 Machine Olfaction: Discussion 75

Bibliography 79

xiii

List of Figures

2.1 Examples of a 1, 2 and 3 rating for images downloaded usingthe keyword

dice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Summary of Caltech image datasets. There are actually 102and 257 cate-

gories if theclutter categories in each set are included. . . . . . . . . . . . . 5

2.3 Distribution of image sizes as measured by√

width · height, and aspect ra-

tios as measured bywidth/height. Some common image sizes and aspect

ratios that are overrepresented are labeled above the histograms. Overall in

Caltech-256 the mean image size is 351 pixels while the mean aspect ratio

is 1.17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Histogram showing number of images per category. Caltech-101’s largest

categoriesfaces-easy(435), motorbikes(798), airplanes (800) are shared

with Caltech 256. An additional large categoryt-shirt (358) has been added.

The clutter categories for Caltech-101 (467) and 256 (827) are identified

with arrows. This figure should be viewed in color. . . . . . . . . .. . . . . 7

xiv

2.5 Precision of images returned by Google. This is defined asthe total number

of images ratedgood divided by the total number of images downloaded

(averaged over many categories). As more images are download, it becomes

progressively more difficult to gather large numbers of images per object

category. For example, to gather 40 good images per categoryit is necessary

to collect 120 images and discard 2/3 of them. To gather 160 good images,

expect to collect about 640 images and discard 3/4 of them. . .. . . . . . . . 9

2.6 A taxonomy of Caltech-256 categories created by hand. At the top level

these are divided into animate and inanimate objects. Greencategories con-

tain images that were borrowed from Caltech-101. A category is colored red

if it overlaps with some other category (such asdogandgreyhound). . . . . . 10

2.7 Examples ofcluttergenerated by cropping the photographs of Stephen Shore [Sho04,

Sho05]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.8 Performance of all 256 object categories using a typicalpyramid match ker-

nel [LSP06a] in a multi-class setting withNtrain = 30. This performance

corresponds to the diagonal entries of the confusion matrix, here sorted from

largest to smallest. The ten best performing categories areshown in blue

at the top left. The ten worst performing categories are shown in red at the

bottom left. Vertical dashed lines indicate the mean performance. . . . . . . 13

2.9 The mean of all images in five randomly chosen categories,as compared to

the meanclutter image. Four categories show some degree of concentration

towards the center whilerefrigerator andclutter do not. . . . . . . . . . . . 16

xv

2.10 The256 × 256 matrixM for the correlation classifier described in subsec-

tion 2.4.2. This is the mean of 10 separate confusion matrices generated for

Ntrain = 30. A log scale is used to make it easier to see off-diagonal ele-

ments. For clarity we isolate the diagonal and row 82galaxyand describe

their meaning in Figure 2.11. . . . . . . . . . . . . . . . . . . . . . . . . . .19

2.11 A more detailed look at the confusion matrixM from figure 2.10. Top:

row 82 shows which categories were most likely to be confusedwith galaxy.

These are:galaxy, saturn, fireworks, cometandmars(in order of greatest to

least confusion). Bottom: the largest diagonal elements represent the cate-

gories that are easiest to classify with the correlation algorithm. These are:

self-propelled-lawn-mower, motorbikes-101, trilobite-101, guitar-pick and

saturn. All of these categories tend to have objects that are located consis-

tently between images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.12 Performance as a function ofNtrain for Caltech-101 and Caltech-256 using

the 3 algorithms discussed in the text. The spatial pyramid matching algo-

rithm is that of Lazebnik, Schmid and Ponce [LSP06a]. We compare our

own implementation with their published results, as well asthe SVM-KNN

approach of Zhang, Berg, Maire and Malik [ZBMM06a]. . . . . . . .. . . . 22

xvi

2.13 Selected rows and columns of the256× 256 confusion matrixM for spatial

pyramid matching [LSP06a] andNtrain = 30. Matrix elements containing

0.0 have been left blank. The first 6 categories are chosen because they

are likely to be confounded with the last 6 categories. The main diagonal

shows the performance for just these 12 categories. The diagonals of the

other 2 quadrants show whether the algorithm can detect categories which

are similar but not exact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.14 ROC curve for three different interest classifiers described in section 2.4.5.

These classifiers are designed to focus the attention of the multi-category

detectors benchmarked in Figure 3.2. BecauseDetector Bis roughly 200

times faster thanA or C, it represents the best tradeoff between performance

and speed. This detector can accurately detect 38.2% of the interesting (non-

clutter) images with a 0.1% rate of false detections. In other words, 1 in

1000 of the images classified asinterestingwill instead contain clutter (solid

red line). If a 1 in 100 rate of false detections is acceptable, the accuracy

increases to 58.6% (dashed red line). . . . . . . . . . . . . . . . . . . .. . 26

2.15 In general the Caltech-256 images are more difficult to classify than the

Caltech-101 images. Here we plot performance of the two datasets over a

random mix ofNcategories from each dataset. Even when the number of cate-

gories remains the same, the Caltech-256 performance is lower. For example

atNcategories = 100 the performance is∼ 60% lower. . . . . . . . . . . . . . 27

xvii

3.1 A typical one-vs-all multi-class classifier (top) exhaustively tests each image

against every possible visual category requiringNcat decisions per image.

This method does not scale well to hundreds or thousands of categories.

Our hierarchical approach uses the training data to construct a taxonomy of

categories which corresponds to a tree of classifiers (bottom). In principle

each image can now be classified with as few aslog2 Ncat decisions. The

above example illustrates this for an unlabeled test image andNcat = 8. The

tree we actually employ has slightly more flexibility as shown in Fig. 3.4 . . 33

3.2 Performance comparison between Caltech-101 and Caltech-256 datasets us-

ing the Spatial Pyramid Matching algorithm of Lazebnik et. al [LSP06b].

The performance of our implementation is almost identical to that reported

by the original authors; any performance difference may be attributed to a

denser grid used to sample SIFT features. This illustrates astandard non-

hierarchical approach where authors mainly present the number of training

examples and the classification performance, without also plotting classifi-

cation speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

xviii

3.3 In general the Caltech-256 [GHP06] images are more difficult to classify

than the Caltech-101 images. Here we fixNtrain = 30 and plot performance

of the two datasets over a random mix ofNcat categories chosen from each

dataset. The solid region represents a range of performancevalues for 10

randomized subsets. Even when the number of categories remains the same,

the Caltech-256 performance is lower. For example atNcat = 100 the per-

formance is∼ 60% lower (dashed red line). . . . . . . . . . . . . . . . . . . 37

3.4 A simple hierarchical cascade of classifiers (limited totwo levels and four

categories for simplicity of illustration). We call A, B, C and D four sets of

categories as illustrated in Fig 3.5. Each white square represents a binary

branch classifier. Test images are fed into the top node of the tree where a

classifier assigns them to either the set A∪ B or the set C∪ D (white square

at the center-top). Depending on the classification, the image is further clas-

sified into either A or B, or C or D. Test images ultimately terminate in one

of the 7 red octagonal nodes where a conventional multi-classnode classifier

makes the final decision. For a two-levelℓ = 2 tree, images terminate in one

of the 4 lower octagonal nodes. Ifℓ = 0 then all images terminate in the top

octagonal node, which is equivalent to conventional non-hierarchical classi-

fication. The tree is not necessarily perfectly balanced: A,B, C and D may

have different cardinality. Each branch or node classifier is trained exclu-

sively on images extracted from the sets that the classifier is discriminating.

See Sec. 3.4 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

xix

3.5 Top-down grouping as described in Sec. 3.3. Our underlying assumption is

that categories that are easily confused should be grouped together in order

to build the branch classifiers in Fig 3.4. First we estimate aconfusion ma-

trix using the training set and a leave-one-out procedure. Shown here is the

confusion matrix forNtrain = 10, with diagonal elements removed to make

the off-diagonal terms easier to see. . . . . . . . . . . . . . . . . . . .. . . 40

3.6 Taxonomy discovered automatically by the computer, using only a limited

subset of Caltech-256 training images and their labels. Aside from these

labels there is no other human supervision; branch membership is not hand-

tuned in any way. The taxonomy is created by first generating aconfusion

matrix for Ntrain = 10 and recursively dividing it by spectral clustering.

Branches and their categories are determined solely on the basis of the con-

fusion between categories, which in turn is based on the feature-matching

procedure of Spatial Pyramid Matching. To compare this withsome recog-

nizably human categories we color code all the insects (red), birds (yellow),

land mammals (green) and aquatic mammals (blue). Notice that the com-

puter’s hierarchy usually begins with a split that puts all the plant and animal

categories together in one branch. This split is found automatically with

such consistency that in a third of all randomized training setsnot a single

category of living thingends up on the opposite branch. . . . . . . . . . . . . 42

xx

3.7 The taxonomy from Fig.3.6 is reproduced here to illustrate how classification

performance can be traded for classification speed. NodeA represents an

ordinary non-hierarchical one-vs-all classifier implemented using an SVM.

This is accurate but slow because of the large combined set ofsupport vectors

in Ncats = 256 individual binary classifiers. A the other extreme, each test

image passes through a series of inexpensive binary branch classifiers until

it reaches 1 of the 256 leaves, collectively labeledC above. A compromise

solution B invokes a finite set of branch classifiers prior to final multi-class

classification in one of 7 terminal nodes. . . . . . . . . . . . . . . . .. . . . 44

3.8 Comparison of three different methods for generating taxonomies. For each

taxonomy we vary the number of branch comparisons prior to final classi-

fication, as illustrated in Fig. 3.4. This results in a tradeoff between perfor-

mance and speed as one moves between two extremesA andC. Randomly

generated hierarchies result in poor cascade performance.Of the three meth-

ods, taxonomies based on Spectral Clustering yield marginally better perfor-

mance. All three curves measure performance vs. speed forNcat = 256 and

Ntrain = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.9 Cascade performance / speed trade-off as a function ofNtrain. Values of

Ntrain = 10 andNtrain = 50 result in a 5-fold and 20-fold speed increase

(respectively) for a fixed 10% perforance drop. . . . . . . . . . . .. . . . . 46

4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

xxi

6.1 A fan draws air from 1 of 4 ordorant chambers or an empty reference cham-

ber, depending on the state of the computer-controlled solenoid valve. The

valve control signal can then be compared to the resistanceschanges recorded

from an arrays of 10 individual sensors as shown in Fig. 2. . . .. . . . . . . 56

6.2 (a) A sniff consists of 7 individual subsniffss1...s7 of sensor data taken as

the valve switches between a single odorant and reference air. From this

data a7 × 4 = 28 size featurem is generated representing the measured

power in each of the 7 subsniffsi over 4 fundamental harmonicsj. For com-

parison purposes a simple amplitude feature differences the top and bottom

5% quartiles of∆RR

in each subsniff. (b) As the switching frequencyf in-

creases by powers of 2 so does the number of pulses, so that thetime period

T is constant for all but the first subsniff. (c) To illustrate how m is mea-

sured we show the harmonic decomposition of justs4, highlighted in (a).

The corresponding measurementsm4j are the integrated spectral power for

each of 4 harmonics. Higher-order harmonics suffer from attenuation due

to the limited time-constant of the sensors but have the advantage of being

less susceptible to slow signal drift. Fitting a1/fn noise spectrum to the av-

erage indoor and outdoor frequency response of our sensors in the absence

of any odorants illustrates why higher-frequency switching and higher-order

harmonics may be especially advantageous in outdoors environments. . . . . 60

xxii

6.3 Visual representation of the harmonic decomposition featurem for 2 wines,

2 lemon parts and 2 teas from the Common Household Odors Dataset. Each

odorant is sampled 4 times on 2 different days in 2 separate environments.

Each box represents one complete 400 s sniff reduced to a 280-dimensional

feature vector. Within each box, the 10 rows (y axis) show theresponse of

different sensor over 28 frequencies (x axis) corresponding to 7 subsniffs

and 4 harmonics. For visual clarity the columns are sorted byfrequency and

rows are sorted so that adjacent sensors are maximally correlated. . . . . . . 61

7.1 Classification performance for the University of Pittsburgh Smell Identifi-

cation Test (UPSIT) and Common Household Odors Dataset (CHOD)for

different sniff subsets using 4 and 16 categories for training and testing.

For control purposes data is also acquired with empty odorant chambers.

Compared with using the entire sniff (top), the high-frequency subsniffs

(2nd row) outperform the low-frequency subsniffs (bottom)especially for

Ncat=16. Dotted lines show performance at the level of random guessing. . . 65

xxiii

7.2 Classification error for all three datasets taken indoorsand outdoors while

varying the number of sensors and the number of categories used for train-

ing and testing. Each dotted colored line represents the mean performance

over randomized subsets of 2, 4, 6 and 8 sensors out of the available 10.

To illustrate this for a single value ofNcat, gray vertical lines mark the er-

ror averaged over randomized sets of 16 odor categories for the indoor and

outdoor datasets. When the number of sensors increased from 4to 10, the

indoor error (left line) decreased by< 2% for the CHOD and UPSIT while

the outdoor error (right line) decreased by 4-7%. The Controlerror is also

important because deviations from random chance when no odor categories

are present may suggest sensitivity to environmental factors such as water

vapor. Here the indoor error for both 4 and 10 sensors remainsconsistent

with 93.75% random chance while the outdoor error rises from85.9% to

91.7%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.3 Classification error using features based on sensor response amplitude and

harmonic decomposition. For comparison we also show the UPSIT testing

error[DSKa84] for human test subjects 10-59 years of age (who perform

better than our instrument) and 70-79 years of age (who perform roughly the

same). The combined Indoor/Outdoor dataset uses data takenindoors and

outdoors as separate training and testing sets. . . . . . . . . . .. . . . . . . 69

xxiv

7.4 The confusion matrix for the Indoor Common Household OdorDataset is

used to automatically generate a top-down hierarchy of odorcategories. Branches

in the tree represent splits in the confusion matrix that minimize intercluster

confusion. As the depth of the tree increases with successive splits, the cate-

gories in each branch become more and more difficult for the electronic nose

to distinguish. The color of each branch node represent the classification

performance when determining whether an odorant belongs tothat branch.

This helps characterize the instrument by showing which odor categories

and super-categories are readily detectable and which are not. Highlighted

categories show the relationships discovered between the wine, lemon and

tea categories whose features are shown Fig. 3. The occurence of wine and

citrus categories in the same top-level branch indicates that these odor cate-

gories are harder to distinguish from one another than they are from tea. . . . 73

xxv

List of Tables

1

Chapter 1

Introduction

1.1 Visual Classification

1.2 Olfactory Classification

1.3 Moving Into The Wild: Face Recognition

1.4 Hierarchical Approach

3

Chapter 2

The Caltech 256

We introduce a challenging set of 256 object categories containing a total of 30607 images.

The original Caltech-101 [LFP04a] was collected by choosinga set of object categories,

downloading examples from Google Images and then manually screening out all images

that did not fit the category. Caltech-256 is collected in a similar manner with several

improvement: a) the number of categories is more than doubled, b) the minimum number

of images in any category is increased from 31 to 80, c) artifacts due to image rotation

are avoided and d) a new and larger clutter category is introduced for testing background

rejection. We suggest several testing paradigms to measureclassification performance,

then benchmark the dataset using two simple metrics as well as a state-of-the-art spatial

pyramid matching [LSP06a] algorithm. Finally we use the clutter category to train an

interest detector which rejects uninformative backgroundregions.

2.1 Introduction

Recent years have seen an explosion of work in the area of object recognition [LFP04a,

LSP06a, ZBMM06a, ML06a, Fer05, ABM04]. Several datasets have emerged as stan-

41. good2. bad

3. not applicable

Figure 2.1: Examples of a 1, 2 and 3 rating for images downloaded using the keyworddice.

dards for the community, including the Coil [NNM96], MIT-CSAIL [TMF04] PASCAL

VOC [Eea06], Caltech-6 and Caltech-101 [LFP04a] and Graz [OP05] datasets. These

datasets have become progressively more challenging as existing algorithms consistently

saturated performance. The Coil set contains objects placedon a black background with

no clutter. The Caltech-6. consists of 3738 images of cars, motorcycles, airplanes, faces

and leaves. The Caltech-101 is similar in spirit to the Caltech-6 but has many more object

categories, as well as hand-clicked silhouettes of each object. The MIT-CSAIL database

contains more than 77,000 objects labeled within 23,000 images that are shown in a variety

of environments. The number of labeled objects, object categories and region categories

increases over time thanks to a publicly available LabelMe [RTMF05] annotation tool. The

PASCAL VOC 2006 database contains 5,304 images where 10 categories are fully anno-

tated. Finally, the Graz set contains three object categories in difficult viewing conditions.

These and other standardized sets of categories allow usersto compare the performance of

their algorithms in a consistent manner.

Here we introduce the Caltech-256. Each category has a minimum of 80 images (com-

pared to the Caltech-101 where some classes have as few as31 images). In addition we do

5

not left-right align the object categories as was done with the Caltech-101, resulting in a

more formidable set of categories.

Because Caltech-256 images are harvested from two popular online image databases,

they represent a diverse set of lighting conditions, poses,backgrounds, image sizes and

camera systematics. The categories were hand-picked by theauthors to represent a wide

variety of natural and artificial objects in various settings. The organization is simple and

the images are ready to use, without the need for cropping or other processing. In most

cases the object of interest is prominent with a small or medium degree of background

clutter.

Dataset Released Categories Images Images Per CategoryTotal Min Med Mean Max

Caltech-101 2003 102 9144 31 59 90 800Caltech-256 2006 257 30607 80 100 119 827

Figure 2.2: Summary of Caltech image datasets. There are actually 102 and 257 categoriesif the clutter categories in each set are included.

In Section 3.2.3 we describe the collection procedures for the dataset. In Section 2.3

we give paradigms for testing recognition algorithms, including the use of the background

clutter class. Example experiments are provided in Section 2.4. Finally in Section 2.5 we

conclude with a general discussion of advantages and disadvantages of the set.

2.2 Collection Procedure

The object categories were assembled in a similar manner to the Caltech-101. A small

group of vision dataset users were asked to supply the names of roughly 300 object cat-

6

0 200 400 600 800 1000 12000

500

1000

640x480

800x6001024x768

sqrt(width*height)Im

ages

0 0.5 1 1.5 2 2.5 3 3.5 40

1000

2000

3000

4000

5000

width/height

Imag

es

2/3 3/4

1

4/3

3/2

Figure 2.3: Distribution of image sizes as measured by√

width · height, and aspect ratiosas measured bywidth/height. Some common image sizes and aspect ratios that are over-represented are labeled above the histograms. Overall in Caltech-256 the mean image sizeis 351 pixels while the mean aspect ratio is 1.17.

egories. Images from each category were downloaded from both Google and PicSearch

using scripts . We required that the minimum size in either aspect be 100 with no upper

range. Typically this procedure resulted in about400 − 600 images from each category.

Duplicates were removed by detecting images which contained over15 similar SIFT de-

scriptors [Low04].

The images obtained were of varying quality. We asked 4 different subjects to rate these

images using the following criteria:

1. Good: A clear example of the visual category

2. Bad: A confusing, occluded, cluttered or artistic example

3. Not Applicable: Not an example of the object category

Sorters were instructed to label the imagebadif either: (1) the image was very cluttered,

7

0 100 200 300 400 500 600 700 8000

20

40

60

80

100

120

Images

Cat

egor

ies

Caltech256Caltech101

Figure 2.4: Histogram showing number of images per category. Caltech-101’s largest cate-goriesfaces-easy(435),motorbikes(798),airplanes(800) are shared with Caltech 256. Anadditional large categoryt-shirt (358) has been added. Theclutter categories for Caltech-101 (467) and 256 (827) are identified with arrows. This figureshould be viewed in color.

(2) the image was a line drawing, (3) the image was an abstractartistic representation, or

(4) the object within the image occupied only a small fraction of the image. If the image

contained no examples of the visual category it was labelednot applicable. Examples of

each of the 3 ratings are shown in Figure 2.1.

The final set of images included in Caltech-256 are the ones that passed our size and

duplicate checks and were also ratedgood. Out of 304 original categories 48 had less

than 80good images and were dropped, leaving 256 categories. Figure 2.3shows the

distribution of the sizes of these final images.

In Caltech-101, categories such asminarethad a large number of images that were ar-

tificially rotated, resulting in large black borders aroundthe image. This rotation created

artifacts which certain recognition systems exploited resulting in deceptively high perfor-

mance. This made such categories artificially easy to identify. We have not introduced

such artifacts into this set and collecting an entirely newminaretcategory which was not

artificially rotated.

8

In addition we did not consistently right-left align the object categories as was done in

Caltech-101. For exampleairplanesmay be facing in either the left or right direction now.

This gives a better idea of what categorization performancewould be like under realistic

conditions, unlike that Caltech-101airplaneswhich are all facing right.

2.2.1 Image Relevance

We compiled statistics on the downloaded images to examine the typical yield ofgood

images. Figure 2.5 summarizes the results for images returned by Google. As expected,

the relevance of the images decreases as more images are returned. Some categories return

more pertinent results than others. In particular, certaincategories contain dual semantic

meanings. For example the categorypawnyields both the chess piece and also images of

pawn shops. The categoryeggis too ambiguous, because it yields images of whole eggs,

egg yolks, Faberge Eggs, etc. which are not in the same visualcategory. These ambiguities

were often removed with a more specific keyword search, such as fried-egg.

When using Google images alone, 25.6% of the images downloaded were found to be

good. To increase the precision of image downloading we augmented the Google search

with PicSearch.

Since both search engines return largely non-overlapping sets of images, the overall

precision for the initial set of downloaded images increased, as both returned a high fraction

of good images initially. Now 44.4% of the images were usable. The true overall precision

was slightly lower as there was some overlap between the Google and PicSearch images.

A total of 9104good images were gathered from PicSearch and 20677 from Google, out

9

0 100 200 300 400 5000

5

10

15

20

25

30

35

40

Images Downloaded

Pre

cisi

on (

%)

Figure 2.5: Precision of images returned by Google. This is defined as the total numberof images ratedgooddivided by the total number of images downloaded (averaged overmany categories). As more images are download, it becomes progressively more difficultto gather large numbers of images per object category. For example, to gather 40 goodimages per category it is necessary to collect 120 images anddiscard 2/3 of them. Togather 160 good images, expect to collect about 640 images and discard 3/4 of them.

of a total of 92652 downloaded images. Thus the overall sorting efficiency was 32.1%.

2.2.2 Categories

The category numbering provides some insight into which categories are similar to an

existing category. CategoriesC1...C250 are relatively independent of one another, whereas

categoriesC251...C256 are closely related to other categories. These areairplane-101, car-

side-101, faces-easy-101, greyhound, tennis-shoeand toad, which are closely related to

fighter-jet, car-tire, people, dog, sneakerand frog respectively. We felt these 6 category

pairs would be the most likely to be confounded with one another, so it would be best to

remove one of each pair from the confusion matrix, at least for the standard benchmarking

10

procedure1.

2.2.3 Taxonomy

Figure 2.6 shows a taxonomy of the final categories, grouped by animate and inanimate

and other finer distinctions. This taxonomy was compiled by the authors and is somewhat

arbitrary; other equally valid hierarchies can be constructed. The largest 30 categories from

Caltech-101 (shown in green) were included in Caltech-256, with additional images added

as needed to boost the number of images in each category to at least 80. Animate objects

- 69 categories in all - tend to be more cluttered than the inanimate objects, and harder to

identify. A total of 12 categories are marked in red to denotea possible relation with some

other visual category.

2.2.4 Background

CategoryC257 is clutter2. For several reasons (see subsection 2.3.4) it is useful to have

such a background category, but the exact nature of this category will vary from set to set.

Different backgrounds may be appropriate for different applications, and the statistics of a

given background category can effect the performance of theclassifier [HWP05].

For instance Caltech-6 contains a background set which consists of random pictures

taken around Caltech. The image statistics are no doubt biased by their specific choice

1While horseshoe-crabmay seem to be a specific case ofcrab, the images themselves involve two entirelydifferent sub-phylum of Arthropoda, which have clear differences in morphology. We find these easy to tellapart whereasfrogandtoaddifferences can be more subtle (none of our sorters were herpetologists). Likewisewe feel thatknifeandswiss-army-knifeare not confounding, even though they share some characteristics suchas blades.

2For purposes here we will use the termsbackgroundandclutter interchangeably to indicate the absenceor near-absence of any objects categories

11

Figure 2.6: A taxonomy of Caltech-256 categories created by hand. At the top level theseare divided into animate and inanimate objects. Green categories contain images that wereborrowed from Caltech-101. A category is colored red if it overlaps with some other cate-gory (such asdogandgreyhound).

12

Figure 2.7: Examples ofclutter generated by cropping the photographs of StephenShore [Sho04, Sho05].

of location. The Caltech-101 contains a set of background images obtained by typing the

keyword “things” into Google. This can turn up a wide varietyof objects not in Caltech-

101. However these images may or may not contain objects of interest that the user would

wish to classify.

Here we choose a different approach. Theclutter category in Caltech-256 is derived

by cropping 947 images from the pictures of photographer Stephen Shore [Sho04, Sho05].

Images were cropped such that the final image sizes in the clutter category are representa-

tive of the distribution of images sizes found in all the other categories (figure 2.3). Those

cropped images which contained Caltech-256 categories (such as people and cars) were

manually removed, with a total of 827clutter images remaining. Examples are shown in

Figure 2.7.

13

0 10 20 30 40 50 60 70 80 90 100

airplanes−101: 251 motorbikes−101: 145leopards−101: 129 self−propelled−lawn−mower: 182tower−pisa: 225 trilobite−101: 230watch−101: 240 desk−globe: 053ketch−101: 123 hibiscus: 103sunflower−101: 204 brain−101: 020zebra: 250 mars: 137saturn: 177 galaxy: 082bonsai−101: 015 guitar−pick: 094sheet−music: 184 menorah−101: 140revolver−101: 172 fireworks: 073teepee: 214 fire−truck: 072license−plate: 130 rainbow: 170homer−simpson: 104 grand−piano−101: 091photocopier: 161 tennis−court: 217cartman: 032 lightning: 133buddha−101: 022 hawksbill−101: 100ewer−101: 066 chandelier−101: 036golden−gate−bridge: 086 vcr: 237harp: 098 hourglass: 110backpack: 003 french−horn: 077touring−bike: 224 yarmulke: 248coffee−mug: 041 grapes: 092tombstone: 222 comet: 044video−projector: 238 hamburger: 095laptop−101: 127 washing−machine: 239waterfall: 241 boom−box: 016breadmaker: 021 telephone−box: 215eyeglasses: 067 cereal−box: 035diamond−ring: 054 stained−glass: 200binoculars: 012 computer−monitor: 046pci−card: 157 treadmill: 227tennis−shoes: 255 helicopter−101: 102chess−board: 037 palm−tree: 154hot−tub: 109 microwave: 142paper−shredder: 156 school−bus: 178rotary−phone: 174 human−skeleton: 112lightbulb: 131 umbrella−101: 235eiffel−tower: 062 frying−pan: 081mountain−bike: 146 elephant−101: 064porcupine: 164 top−hat: 223cd: 033 skyscraper: 187beer−mug: 010 harpsichord: 099tomato: 221 tripod: 231t−shirt: 232 american−flag: 002baseball−glove: 005 starfish−101: 201theodolite: 219 head−phones: 101roulette−wheel: 175 frisbee: 079hot−air−balloon: 107 covered−wagon: 050saddle: 176 megaphone: 139football−helmet: 076 tweezer: 234pez−dispenser: 160 spaghetti: 196teapot: 212 killer−whale: 124picnic−table: 162 sextant: 183steering−wheel: 202 coin: 043swiss−army−knife: 208 elk: 065wine−bottle: 246 skunk: 186house−fly: 111 teddy−bear: 213kangaroo−101: 121 pyramid: 167ipod: 117 billiards: 011fire−extinguisher: 070 mattress: 138segway: 181 llama−101: 134ak47: 001 toad: 256snowmobile: 192 toaster: 220triceratops: 228 chopsticks: 039refrigerator: 171 scorpion−101: 179electric−guitar−101: 063 iris: 118fern: 068 minaret: 143palm−pilot: 153 bathtub: 008microscope: 141 stirrups: 203tennis−racket: 218 ostrich: 151gorilla: 090 cowboy−hat: 051calculator: 027 doorknob: 058car−tire: 031 jesus−christ: 119welding−mask: 243 bulldozer: 023gas−pump: 083 harmonica: 097cormorant: 049 sneaker: 191soccer−ball: 193 penguin: 158giraffe: 084 mandolin: 136necktie: 149 octopus: 150tambourine: 211 butterfly: 024light−house: 132 raccoon: 168owl: 152 hummingbird: 113bowling−pin: 018 cake: 026computer−keyboard: 045 joy−stick: 120sushi: 206 blimp: 014tricycle: 229 bowling−ball: 017floppy−disk: 075 minotaur: 144lathe: 128 cactus: 025speed−boat: 197 xylophone: 247radio−telescope: 169 wheelbarrow: 244crab−101: 052 computer−mouse: 047hammock: 096 boxing−glove: 019duck: 060 ibis−101: 114superman: 205 swan: 207baseball−bat: 004 tennis−ball: 216watermelon: 242 chimp: 038ice−cream−cone: 115 birdbath: 013cockroach: 040 horseshoe−crab: 106pram: 165 bear: 009cannon: 029 spider: 198centipede: 034 fried−egg: 078mussels: 148 smokestack: 188windmill: 245 basketball−hoop: 006fighter−jet: 069 screwdriver: 180sword: 209 fire−hydrant: 071playing−card: 163 unicorn: 236dice: 055 mushroom: 147paperclip: 155 socks: 194greyhound: 254 dolphin−101: 057camel: 028 tuning−fork: 233flashlight: 074 grasshopper: 093coffin: 042 bat: 007goat: 085 golf−ball: 088goose: 089 frog: 080traffic−light: 226 goldfish: 087canoe: 030 dumb−bell: 061iguana: 116 snail: 189kayak: 122 mailbox: 135people: 159 drinking−straw: 059horse: 105 spoon: 199conch: 048 knife: 125snake: 190 syringe: 210yo−yo: 249 dog: 056praying−mantis: 166 hot−dog: 108soda−can: 195 ladder: 126rifle: 173 skateboard: 185

Performance (%)

←

Wor

se P

erfo

rman

ce

B

ette

r P

erfo

rman

ce

→

Figure 2.8: Performance of all 256 object categories using atypical pyramid match ker-nel [LSP06a] in a multi-class setting withNtrain = 30. This performance corresponds to thediagonal entries of the confusion matrix, here sorted from largest to smallest. The ten bestperforming categories are shown in blue at the top left. The ten worst performing categoriesare shown in red at the bottom left. Vertical dashed lines indicate the mean performance.

14

We feel that this is an improvement over our previous cluttercategories, since the im-

ages contain clutter in a variety of indoor and outdoor scenes. However it is still far from

perfect. For example some visual categories such as grass, brick and clouds appear to be

over-represented.

2.3 Benchmarks

Previous datasets suffered from non-standard testing and training paradigms, making di-

rect comparisons of certain algorithms difficult. For instance, results reported by Grau-

man [GD05a] and Berg [BBM05] were not directly comparable asBerg used only 15 train-

ing while Grauman used 30 training examples3. Some authors used the same number

of test examples for each category, while other did not. Thiscan be confusing if the re-

sults are not normalized in a consistent way. For consistentcomparisons between different

classification algorithms, it is useful to adopt standardized training and testing procedures

2.3.1 Performance

First we selectNtrain andNtest images from each class to train and test the classifier. Specif-

ically Ntrain = 5, 10, 15, 20, 25, 30, 40 andNtest = 25.

Each test image is assigned to a particular class by the classifier. Performance of each

classC can be measured by determining the fraction of test examplesfor classC which

are correctly classified as belonging to classC. The cumulative performance is calculated

by counting the total number of correctly classified test imagesNtest within each ofNclass

3It should be noted that Grauman achieved results surpassingthose of Berg in experiments conductedlater.

15

classes. It is of course important to weight each class equally in this metric. The easiest

way to guarantee this is to use the same number of test images for each class. Finally, better

statistics are obtained by averaging the above procedure multiple times (ideally at least 10

times) to reduce uncertainty.

The exactly value ofNtest is not important. For Caltech-101 values higher thanNtrain =

30 are impossible since some categories contain only 31 images. However Caltech-256 has

at least 80 images in all categories. Even a training set sizeof Ntrain = 75 leavesNtest ≥ 5

available for testing in all categories.

The confusion matrixMij illustrates classification performance. It is a table where

each elementi, j stores the fraction of the test images from categoryCi that were classified

as belonging toCj. Note that perfect classification would result in a table with ones along

the main diagonal. Even if such a classification method existed, this ideal performance

would not be reached for several reasons. Images in most categories contain instances of

other categories, which is a built-in source of confusion. Also our sorting procedure is

never prefect; there are bound to be some small fraction of incorrectly classified images in

a dataset of this size.

Since the last 6 categories are redundant with existing categories, andclutter indicates

the absence of any category, one might argue that only categoriesC1...C250 are appropriate

for generating performance benchmarks. Another justification for removing these last 6

categories when measuring overall performance is that theyare among the easiest to iden-

tify. Thus removing them makes the detection task more challenging4.

However for better clarity and consistency, we suggest thatauthors remove only the

4As shown in figure 2.13, categoriesC251, C252 andC253 each yield performance above90%

16

162.picnic−table 014.blimp 257.clutter

Figure 2.9: The mean of all images in five randomly chosen categories, as compared to themeanclutter image. Four categories show some degree of concentration towards the centerwhile refrigerator andclutter do not.

clutter category,generate a 256x256 confusion matrixwith the remaining categories, and

report their performance results directly from the diagonal of this matrix5. Is also useful

for authors to post the confusion matrix itself - not just themean of the diagonal.

2.3.2 Localization and Segmentation

Both Caltech-101 and the Caltech-256 contain categories in which the object may tend to

be centered (Figure 2.9). Thus, neither set is appropriate for localization experiments, in

which the algorithm must not only identify what object is present in the image but also

where the object is.

Furthermore we have not manually annotated the images in Caltech-256 so there is

presently no ground truth for testing segmentation algorithms.

5The difference in performance between the 250x250 and 256x256 matrix is typically less than a percent

17

2.3.3 Generality

Why not remove the last 6 categories from the dataset altogether? Closely related categories

can provide useful information that is not captured by the standard performance metric. Is

a certaingreyhoundclassifier also good at identifyingdog, or does it only detect specific

breeds? Does asneakerdetector also detect images fromtennis-shoe, a word which means

essentially the same thing? If it does not, one might worry that the algorithm is over-

training on specific features of the dataset which do not generalize to visual categories in

the real world.

For this reason we plot rows 251..256 of the confusion matrixalong with the categories

which are most similar to these, and discuss the results in section 2.3.3.

2.3.4 Background

Consider the example of a Mars rover that moves around in its environment while taking

pictures. Raw performance only tells us the accuracy with which objects are identified. Just

as important is the ability to identify where there is an object of interest and where there

is only uninteresting background. The rover cannot begin tounderstand its environment if

background is constantly misidentified as an object.

The rover example also illustrates how the meaning of the word backgroundis strongly

dependent on the environment and the application. Our choice of background images for

Caltech-256, as described in 2.2.4, is meant to reflect a variety of common (terrestrial)

environments.

Here we generate an ROC curve that tests the ability of the classification algorithm

18

to identify regions of interest. An ROC curve shows the ratioof false positives to true

positives. In single-category detection the meaning of true positive and false positive is un-

ambiguous. Imagine that a search window of varied size scansacross an image employing

some sort of bird classifier. Each true positive marks a successful detection of a bird inside

the scan window while each false positive indicates an erroneous detection.

What do positive and negative mean in the context of multi-class classification? Con-

sider a two-step process in which each search window is evaluated by a cascade [VJ01a]

of two classifiers. The first classifier is aninterestdetector that decides whether a given

window contains a object category or background. Background regions are discarded to

save time, while all other images are passed to the second classifier. This more expensive

multi-class classifier now attempts to identify which of theremaining 256 object categories

best matches the region as described in 2.3.1.

Our ROC curve measures the performance of severalinterestclassifiers. A false posi-

tive is anyclutter image which is misclassified as containing an object of interest. Likewise

true positive refers to an object of interest that is correctly identified. Here “object of inter-

est” means any classification besidesclutter.

2.4 Results

In this section we describe two simple classification algorithms as well as the more so-

phisticated spatial pyramid matching algorithm of Lazebnik, Schmid and Ponce [LSP06a].

Performance, generality and background rejection benchmarks are presented as examples

for discussion.

19

50 100 150 200 250

50

100

150

200

250 0.002

0.004

0.006

0.0080.01

0.02

0.04

0.06

0.080.1

0.2

0.4

0.6

0.8

Figure 2.10: The256 × 256 matrix M for the correlation classifier described in subsec-tion 2.4.2. This is the mean of 10 separate confusion matrices generated forNtrain = 30. Alog scale is used to make it easier to see off-diagonal elements. For clarity we isolate thediagonal and row 82galaxyand describe their meaning in Figure 2.11.

2.4.1 Size Classifier

Our first classifier used only the width and height of each image as features. During the

training phase, the width and height of all256 ·Ntrain images are stored in a 2-dimensional

space. Each test image is classified in a KNN fashion by votingamong the 10 nearest neigh-

bors to each image. The 1-norm Manhattan distance yields slightly better performance than

the 2-norm Euclidean distance. As shown in Figure 3.2, this algorithm identifies the correct

category for an image3.7 ± 0.6% of the time whenNtrain = 30.

Although identifying the correct object category 3.7% of the time seems like paltry per-

formance, we note that baseline (random guessing) would result in a performance of less

than .25%. This illustrates a danger inherent in many recognition datasets: the algorithm

20

0 50 100 150 200 2500

0.05

0.1

0.15

0.2

0.25

0.3

0.35Row 82: "galaxy"

082.galaxy

177.saturn

073.fireworks 044.comet 137.mars

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Diagonal

182.self−propelled−lawn−mower 145.motorbikes−101

230.trilobite−101 094.guitar−pick 177.saturn

Figure 2.11: A more detailed look at the confusion matrixM from figure 2.10. Top:row 82 shows which categories were most likely to be confusedwith galaxy. These are:galaxy, saturn, fireworks, cometandmars(in order of greatest to least confusion). Bottom:the largest diagonal elements represent the categories that are easiest to classify with thecorrelation algorithm. These are:self-propelled-lawn-mower, motorbikes-101, trilobite-101, guitar-pick andsaturn. All of these categories tend to have objects that are locatedconsistently between images.

21

can learn on ancillary features of the dataset instead of features intrinsic to the object cate-

gories. Such an algorithm will fail to identify categories if the images come from another

dataset with different statistics.

As shown in

2.4.2 Correlation Classifier

The next classifier we employed was a correlation based classifier. All images were resized

to Ndim × Ndim, desaturated and normalized to have unit variance. The nearest neighbor

was computed in theNdim2-dimensional space of pixel intensities. This is equivalent to

finding the training image that correlates best with the testimage, since

< (X − Y )2 >=< X2 > + < Y 2 > −2 < XY >= −2 < XY >

for imagesX,Y with unit variance. Again we use the 1-norm instead of the 2-norm because

it is faster to compute and yields better classification performance.

Performance of7.6 ± 0.7% at Ntrain = 30 is computed by taking the mean of the

diagonal of the confusion matrix in Figure 2.10.

2.4.3 Spatial Pyramid Matching

As a final test we re-implement the spatial pyramid matching algorithm of Lazebnik, Schmid

and Ponce [LSP06a] as faithfully as possible. In this procedure an SVM kernel is generat-

ing from matching scores between a set of training images. Their published Caltech-101

performance atNtrain = 30 was64.6±0.8%. Our own performance is practically the same.

22

0 5 10 15 20 25 30 35 400

10

20

30

40

50

60

Ntrain

Per

form

ance

(%

)

Caltech−101

SPM

SVM−KNN [3]

SPM [2]

Caltech−256

SPM

Correlation

Size

Figure 2.12: Performance as a function ofNtrain for Caltech-101 and Caltech-256 usingthe 3 algorithms discussed in the text. The spatial pyramid matching algorithm is thatof Lazebnik, Schmid and Ponce [LSP06a]. We compare our own implementation withtheir published results, as well as the SVM-KNN approach of Zhang, Berg, Maire andMalik [ZBMM06a].

As shown in Figure 3.2, performance on Caltech-256 is roughlyhalf the performance

achieved on Caltech-101. For example atNtrain = 30 our Caltech-256 and Caltech-101

performance are67.6 ± 1.4% and34.1 ± 0.2% respectively.

2.4.4 Generality

Figure 2.13 shows the confusion between six categories and their six confounding cate-

gories. We define thegeneralityas the mean of the off-quadrant diagonals divided by the

mean of the main diagonal. In this case, forNtrain = 30, the generality isg = 0.145.

What doesg signify? Consider two extreme cases. Ifg = 0.0 then their is absolutely no

confusion between any of the similar categories, includingtennis-shoeandsneaker. This

23

12.5

0.5

0.5

5.0

0.5

25.0

0.5

0.5

0.5

1.0

0.5

2.0

5.0

2.0

1.0

5.5

0.5

0.5

1.5

3.5

0.5

2.5

0.5

1.0

24.5

22.0

0.5

0.5

8.5

10.5

1.0

94.5

1.0 100.0

0.5

98.5

0.5

0.5

1.0

2.0

0.5

10.5

1.0

1.0

15.0

0.5

50.0

1.5

0.5

3.5

30.5

069 031 159 056 191 080 251 252 253 254 255 256

069.fighter−jet

031.car−tire

159.people

056.dog

191.sneaker

080.frog

251.airplanes−101

252.car−side−101

253.faces−easy−101

254.greyhound

255.tennis−shoes

256.toad0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

Figure 2.13: Selected rows and columns of the256 × 256 confusion matrixM for spatialpyramid matching [LSP06a] andNtrain = 30. Matrix elements containing 0.0 have beenleft blank. The first 6 categories are chosen because they arelikely to be confounded withthe last 6 categories. The main diagonal shows the performance for just these 12 categories.The diagonals of the other 2 quadrants show whether the algorithm can detect categorieswhich are similar but not exact.

would be suspicious since it means the categorization algorithm is splitting hairs, ie. finding

significant differences where none should exist. Perhaps the classifier is training on some

inconsequential artifact of the dataset. At the other extremeg = 1.0 suggests that the two

confounding sets of six categories were completely indistinguishable. Such a classifier is

not discriminating enough to differentiate betweenairplanesand the more specific category

fighter-jet, or betweenpeopleand theirfaces. In other words, the classifier generalizes so

well about similar object classes that it may be considered too sloppy for some applications.

In practice the desired value ofg depends on the needs of the customer. Lower values

of g denote fine discrimination between similar categories or sub-categories. This would

24

be particularly desirable in situations that require the exact identification of a particular

species of mammal. A more inclusive classifier tends toward higher value ofg. Such a

classifier would presumably be better at identifying a mammal it has never seen before,

based on general features shared by a large class of mammals.

As shown in Figure 2.13, a spatial pyramid matching classifier does indeed confuse

tennis-shoesandsneakersthe most. This is a reassuring sanity check. To a lesser extent

the object categoriesfrog/toad, dog/greyhound, fighter-jet/airplanesandpeople/faces-easy

are also confused.

Confusion betweencar-tire andcar-sideis entirely absent. This seems surprising since

tires are such a conspicuous feature of cars when viewed fromthe side. However the tires

pictured incar-tire tend to be much larger in scale than those found incar-side. One

reasonable hypothesis is that the classifier has limited scale-invariance: objects or pieces

of objects are no longer recognized if their size changes by an order of magnitude. This

characteristic of the classifier may or may not be important,depending on the application.

Another hypothesis is that the classifier relies not just on the presence of individual parts,

but on their relationship to one another.

In short, generality defines a trade-off between classifier precision and robustness. Our

metric for generatingg is admittedly crude because it uses only six pairs of similarcate-

gories. Nonetheless generating a confusion matrix like theone shown in Figure 2.13 can

provide a useful sanity check, while exposing features of a particular classifier that are not

apparent from the raw performance benchmark.

25

2.4.5 Background

Returning to the example of a Mars rover, suppose that a smallcamera window is used

to scan across the surface of the planet. Because there may beonly one interesting object

in 100 or 1000 images, the interest detector must have a low rate of false detections in

order to be effective. As illustrated in figure 2.14 this is a challenging problem, particularly

when the detector must accommodate hundreds of different object categories that are all

consideredinteresting.

In the spirit of the attentional cascade [VJ01a] we train interest classifiers to discover

which region are worthy of detailed classification and whichare not. These detectors are

summarized below. As before the classifier is an SVM with a spatial pyramid matching

kernel [LSP06a]. The margin threshold is adjusted in order to trace out a full ROC curve6.

Interest Ntrain Speed Description

Detector C1...C256 C257 (images/sec)

A 30 512 24 Modified 257-category classifier

B 2 512 4600 Fast two-category classifier

C 30 30 25 Ordinary 257-category classifier

First let us considerInterest Detector C. This is the same detector that was employed for

recognizing object categories in section 2.4.3. The only differences is that 257 categories

are used instead of 256. Performance is poor because only 30clutter images are used

during training. In other words,clutter is treated exactly like any other category.

Interest Detector Acorrects the above problem by using 512 training images fromthe

6When measuring speed, training time is ignored because it is aone-time expense

26

cluttercategory. Performance improves because their is now a balance between the number

of positive and negative examples. However the detector is still slow because it is a attempts

to recognize 257 different object categories in every single image or camera region. This

is wasteful if we expect the vast majority of regions to contain irrelevant clutter which is

not worth classifying. In fact this detector only classifiesabout 25 images per second on a

3 GHz Pentium-based PC.

Interest Detector Btrains on 512clutter images and 512 images taken from the other

256 object categories. These two groups of images are assigned to the categoriesuninter-

estingandinteresting, respectively. ThisB classifier is extremely fast because it combines

all the interestingimages into a single category instead of treating them as 256separate

categories. On a typical 3GHz Pentium processor this classifier can evaluate 4600 images

(or scan regions) per second.

It may seem counter-intuitive to group two images from each categoryC1...C256 into

a huge meta-category, as is done with Interest Detector B. What exactly is the classifier

training on? What makes an imageinteresting? What if we have merely created a classifier

that detects the photographic style of Stephen Shore? For these reasons any classifier which

implements attention should be verified on a variety of background images, not just those

in C257. For example the Caltech-6 provides 550 background images with very different

statistics.

27

0.1 1.0 10 1000

10

20

30

40

50

60

70

80

90

False Positives (%)

Tru

e P

ositi

ves

(%)

Interest Detector A

Interest Detector B

Interest Detector C

Figure 2.14: ROC curve for three different interest classifiers described in section 2.4.5.These classifiers are designed to focus the attention of the multi-category detectors bench-marked in Figure 3.2. BecauseDetector Bis roughly 200 times faster thanA or C, itrepresents the best tradeoff between performance and speed. This detector can accuratelydetect 38.2% of the interesting (non-clutter) images with a0.1% rate of false detections.In other words, 1 in 1000 of the images classified asinterestingwill instead contain clutter(solid red line). If a 1 in 100 rate of false detections is acceptable, the accuracy increasesto 58.6% (dashed red line).

2.5 Conclusion

Thanks to rapid advances in the vision community over the last few years, performance over

60% on the Caltech-101 has become commonplace. Here we present a new Caltech-256

image dataset, the largest set of object categories available to our knowledge. Our intent

is to provide a freely available set of visual categories that does a better job of challenging

today’s state-of-the-art classification algorithms.

For example, spatial pyramid matching [LSP06a] withNtrain = 30 achieves perfor-

mance of67.6% on the Caltech-101 as compared to34.1% on Caltech-256. The standard

practice among authors in the vision community is to benchmark raw classification perfor-

28

4 6 8 10 20 40 60 80 100 200

10

20

30

40

50

60

70

80

90

100Performance as a Function of the Number of Categories

Per

form

ance

(%

)

Ncategories

Caltech−101Caltech−256

Figure 2.15: In general the Caltech-256 images are more difficult to classify than theCaltech-101 images. Here we plot performance of the two datasets over a random mixof Ncategories from each dataset. Even when the number of categories remains the same,the Caltech-256 performance is lower. For example atNcategories = 100 the performance is∼ 60% lower.

mance as a function of training examples. As classification performance continues to im-

prove, however, new benchmarks will be needed to reflect the performance of algorithms

under realistic conditions. Beyond raw performance, we argue that a successful algorithm

should also be able to

• Generalize beyond a specific set of images or categories

• Identify which images or image regions are worth classifying

In order to evaluate these characteristics we test two new benchmarks in the context

of Caltech-256. No doubt there are other equally relevant benchmarks that we have not

considered. We invite researchers to devise suitable benchmarks and share them with the

29

community at large.

If you would like to share performance results as well as yourconfusion matrix, please

send them to [email protected]. We will try tokeep our comparison of per-

formance as up-to-date as possible. For more details see

31

Chapter 3

Visual Hierarchies

The computational complexity of current visual categorization algorithms scales linearly

at best with the number of categories. The goal of classifying simultaneouslyNcat =

104 − 105 visual categories requires sub-linear classification costs. We explore algorithms

for automatically building classification trees which have, in principle,log Ncat complexity.

We find that a greedy algorithm that recursively splits the set of categories into the two

minimally confused subsets achieves 5-20 fold speedups at asmall cost in classification

performance. Our approach is independent of the specific classification algorithm used. A

welcome by-product of our algorithm is a very reasonable taxonomy of the Caltech-256

dataset.

3.1 Introduction

Much progress has been made during the past 10 years in approaching the problem of

visual recognition. The literature shows a quick growth in the scope of automatic clas-

sification experiments: from learning and recognizing one category at a time until year

2000 [BP96, VJ01b] to a handful around year 2003 [WWP00, FPZ03,LS04] to∼ 100 in

32

2006 [GD06, GD05b, E+05, ML06b, SWP05, ZBMM06b, GD06]. While some algorithms

are remarkably fast [FG01, VJ01b, GD05b] the cost of classification is still at best linear in

the number of categories; in most cases it is in fact quadratic since one-vs-one discrimina-

tive classification is used in most approaches. There is one exception: cost is logarithmic

in the number of models for Lowe [Low04]. However Lowe’s algorithm was developed

to recognize specific objects rather than categories. Its speed hinges on the observation

that local features are highly distinctive, so that one may index image features directly into

a database of models which is organized like a tree [BL97]. Inthe more general case of

visual category recognition, local features are not very distinctive, hence one cannot take

advantage of this insight.

Humans can recognize between 104 and 105 object categories [Bie87] and this is a

worthwhile and practical goal for machines as well. It is therefore important to understand

how to scale classification costs sub-linearly with respectto the number of categories to

be recognized. It is quite intuitive that this is possible: when we see a dog we are not

for a moment considering the possibility that it might be classified as either a jet-liner or

an ice cream cone. It is reasonable to assume that, once an appropriate hierarchical tax-

onomy is developed for the categories in our visual world, wemay be able to recognize

objects by descending the branches of this taxonomy and avoid considering irrelevant pos-

sibilities. Thus, tree-like algorithms appear to be a possibility worth considering, although

formulations need to be found that are more ‘holistic’ than Beis and Lowe’s feature-based

indexing [BL97].

Here we explore one such formulation. We start by considering the confusion matrix

33

Class 1 vs others Class 2 vs others

Decision:Class 6

1,3,6,8vs

2,4,5,7

1,6 vs 3,8

2,7 vs 4,5

1 vs 6

3 vs 8

2 vs 7

4 vs 5

Class 3 vs othersClass 4 vs othersClass 5 vs others Class 6 vs othersClass 7 vs othersClass 8 vs others

Decision:Class 6

Figure 3.1: A typical one-vs-all multi-class classifier (top) exhaustively tests each imageagainst every possible visual category requiringNcat decisions per image. This methoddoes not scale well to hundreds or thousands of categories. Our hierarchical approachuses the training data to construct a taxonomy of categorieswhich corresponds to a tree ofclassifiers (bottom). In principle each image can now be classified with as few aslog2 Ncat

decisions. The above example illustrates this for an unlabeled test image andNcat = 8.The tree we actually employ has slightly more flexibility as shown in Fig. 3.4

that arises in one-vs-all discriminative classification ofobject categories. We postulate

that the structure of this matrix may reveal which categories are more strongly related.

In Sec. 3.3 we flesh out this heuristic and to produce taxonomies. In Sec. 3.4 we pro-

pose a mechanism for automatically splitting large sets of categories into cleanly separated

subsets, an operation which may be repeated obtaining a tree-like hierarchy of classifiers.

We explore experimentally the implications of this strategy, both in terms of classification

quality and in terms of computational cost. We conclude witha discussion in Sec. 3.5.

34

3.2 Experimental Setup

The goal of our experiment is to compare classification performance and computational

costs when a given classification algorithm is used in the conventional one-vs-many con-

figuration vs our proposed hierarchical cascade (see Fig. 3.1).

3.2.1 Training and Testing Data

The choice of the image classifier is somewhat arbitrary for the purposes of this study. We

decided to use the popular Spatial Pyramid Matching technique of Lazebnik et al. [LSP06b]

because of its high performance and ease of implementation.We summarize our imple-

mentation in Sec.3.2.2. Our implementation performs as reported by the original authors

on Caltech-101. As expected, typical performance on Caltech-256 [GHP06] is lower than

on Caltech-101 [LFP04b] (see Fig. 3.2). This is due to two factors: the larger number of

categories and the more challenging nature of the pictures themselves. For example some

of the Caltech-101 pictures are left-right aligned whereas the Caltech-256 pictures are not.

On average a random subset ofNcat categories from the Caltech-256 is harder to classify

than a random subset of the same number of categories from theCaltech-101 (see Fig. 3.3).

Other authors have achieved higher performance on the Caltech-256 than we report

here, for example, by using a linear combination of multiplekernels [VR07]. Our goal here

is not to achieve the best possible performance but to illustrate how a typical algorithm can

be accelerated using a hierarchical set of classifiers.

The Caltech-256 image set is used for testing and training. Weremove theclutter

category from Caltech-256 leaving a total ofNcat = 256 categories.

35

5 10 15 20 25 300

10

20

30

40

50

60

70

Ntrain

Per

form

ance

(%

)

Caltech 101Caltech 101 (Lazebnik et. al)Caltech−256

Figure 3.2: Performance comparison between Caltech-101 andCaltech-256 datasets usingthe Spatial Pyramid Matching algorithm of Lazebnik et. al [LSP06b]. The performanceof our implementation is almost identical to that reported by the original authors; any per-formance difference may be attributed to a denser grid used to sample SIFT features. Thisillustrates a standard non-hierarchical approach where authors mainly present the numberof training examples and the classification performance, without also plotting classificationspeed.

3.2.2 Spatial Pyramid Matching

First each image is desaturated, removing all color information. For each of these black-

and-white images, SIFT features [Low04] are extracted along a uniform 72x72 grid using

software that is publicly available [MTS+05]. An M-word feature vocabulary is formed by

fitting a Gaussian mixture model to 10,000 features chosen atrandom from the training set.

This model maps each 128-dimensional SIFT feature vector toa scalar integerm = 1..M

36

whereM = 200 is the total number of Gaussians. The choice of clustering algorithm

does not seem to affect the results significantly, but the choice of M does. The original

authors [LSP06b] find that 200 visual words are adequate.

At this stage every image has been reduced to a 72x72 matrix ofvisual words. This

representation is reduced still further by histogramming over a coarse 4x4 spatial grid. The

resulting 4x4xM histogram counts the number of times each word 1..M appears in each

of the 16 spatial bins. Unlike a bag-of-words approach [GD06], coarse-grained position

information is retained as the features are counted.

The matching kernel proposed by Lazebnik et al. finds the intersection between each

pair of 4x4xM histograms by counting the number of common elements in any two bins.

Matches in nearby bins are weighed more strongly than matches in far-away bins, resulting

in a single match score for each word. The scores for each wordare then summed to get

the final overall score. We follow this same procedure resulting in a kernel K that satisfies

Mercer’s condition [GD06] and is suitable for training an SVM.

3.2.3 Measuring Performance

Classification performance is measured as a function of the number of training examples.

First we select a random but disjoint set ofNtrain andNtest training and testing images from

each class. All categories are sampled equally, ie.Ntrain andNtest do not vary from class

to class.

Like Lazebnik et al. [LSP06b] we use a standard multi-class method consisting of a

Support Vector Machine (SVM) trained on the Spatial PyramidMatching kernel in a one-

37

4 6 8 10 20 40 60 80 100 200

10

20

30

40

50

60

70

80

90

100

Per

form

ance

(%

)

Ncat

Caltech−101Caltech−256

Figure 3.3: In general the Caltech-256 [GHP06] images are more difficult to classify thanthe Caltech-101 images. Here we fixNtrain = 30 and plot performance of the two datasetsover a random mix ofNcat categories chosen from each dataset. The solid region representsa range of performance values for 10 randomized subsets. Even when the number of cate-gories remains the same, the Caltech-256 performance is lower. For example atNcat = 100the performance is∼ 60% lower (dashed red line).

vs-all classification scheme. The training kernel has dimensionsNcat · Ntrain along each

side. Once the classifier has been trained, each test image isassigned to exactly one visual

category by selecting the one-vs-all classifier which maximizes the margin.

The confusion matrixCij counts the fraction of test examples from classi which were

classified as belonging to classj. Correct classifications lie along the diagonalCii so that

the cumulative performance is the mean of the diagonal elements. To reduce uncertaintly

we average the matrix obtained over 10 experiments using different randomized training

38A U B

vs C U D

A vs B C vs D

D

C U DA U B

A U B U

C U D

CBA

N est Images

Figure 3.4: A simple hierarchical cascade of classifiers (limited to two levels and fourcategories for simplicity of illustration). We call A, B, C and D four sets of categories asillustrated in Fig 3.5. Each white square represents a binary branch classifier. Test imagesare fed into the top node of the tree where a classifier assignsthem to either the set A∪ B orthe set C∪ D (white square at the center-top). Depending on the classification, the imageis further classified into either A or B, or C or D. Test images ultimately terminate in oneof the 7 red octagonal nodes where a conventional multi-class node classifiermakes thefinal decision. For a two-levelℓ = 2 tree, images terminate in one of the 4 lower octagonalnodes. Ifℓ = 0 then all images terminate in the top octagonal node, which isequivalent toconventional non-hierarchical classification. The tree isnot necessarily perfectly balanced:A, B, C and D may have different cardinality. Each branch or node classifier is trainedexclusively on images extracted from the sets that the classifier is discriminating. SeeSec. 3.4 for details.

and testing sets. By inspecting the off-diagonal elements of the confusion matrix it is clear

that some categories are more difficult to discriminate thanother categories. Upon this

observation we build a heuristic that creates an efficient hierarchy of classifiers.

39

3.2.4 Hierarchical Approach

Our hierarchical classification architecture is shown in Fig. 3.4. The principle behind the

architecture is simple: rather than a single one-vs-all classifier, we achieve classification by

recursively splitting the set of possible labels into two roughly equal subsets. This divide-

and-conqueor strategy is familiar to anyone who has played the game of 20 questions.

This method is faster because the binarybranchclassifiers are less complex than the

one-vs-allnodeclassifiers. For example the 1-vs-N node classifier at the topof Fig. 3.1

actually consists of N=8 separate binary classifiers, each with its own setSi of support

vectors. During classification each test image must now be compared with the union of

training images

Snode =N⋃

i=1

Si

Unless the setsSi happen to be the same (which is highly unlikely) the size ofSnode will

increase with N.

Our procedure works as follows. In the first stage of classification, each test image

reaches its terminal node via a series ofℓ inexpensive branch comparisons. By the time the

test image arrives at its terminal node there are only∼ Ncat/2ℓ categories left to consider

instead ofNcat. The greater the number of levelsℓ in the hierarchy, the fewer categories

there are to consider at the expensive final stage - with correspondingly fewer support

vectors overall.

The main decision to be taken in building such a hierarchicalclassification tree is how to

choose the sets into which each branch divides the remainingcategories. The key intuition

which guides our architecture is that decisions between categories that are more easily

40

Figure 3.5: Top-down grouping as described in Sec. 3.3. Our underlying assumption isthat categories that are easily confused should be grouped together in order to build thebranch classifiers in Fig 3.4. First we estimate a confusion matrix using the training setand a leave-one-out procedure. Shown here is the confusion matrix for Ntrain = 10, withdiagonal elements removed to make the off-diagonal terms easier to see.

confused should be taken later in the decision tree, i.e. at the lower nodes where fewer

categories are involved. With this in mind we start the training phase by constructing a

confusion matrixC′

ij from the training set alone using a leave-one-out validation procedure.

This matrix (see Fig. 3.5) is used to estimate the affinity between categories. This should

be distinguished from the standard confusion matrixCij which measures the confusion

between categories during thetestingphase.

41

3.3 Building Taxonomies

Next, we compare two different methods for generating taxonomies automatically based

on the confusion matrixC′

ij.

The first method splits the confusion matrix into two groups using Self-Tuning Spectral

Clustering [ZMP04]. This is a variant of the Spectral Clustering algorithm which auto-

matically chooses an appropriate scale for analysis. Because our cascade is a binary tree

we always choose two for the number of clusters. Fig. 3.4 shows only the first two levels

of splits while Fig. 3.6 repeats the process until the leavesof the tree contain individual

categories.

The second method builds the tree from the bottom-up. At eachstep the two groups

of categories with the largest mutual confusion are joined while their confusion matrix

rows/columns are averaged. This greedy process continues until there is only a single

super-group containing all 256 categories. Finally, we generate a random hierarchy as a

control.

3.4 Top-Down Classification Algorithm

Once a taxonomy of classes is discovered, we now seek to exploit this taxonomy for ef-

ficient top-down classification. The problem of multi-stageclassification has been stud-

ied in many different contexts [AGF04, Fan05, LYW+05, Li05]. For example, Viola and

Jones [VJ04] use an attentional cascade to quickly exclude areas of their image that are

unlikely to contain a face. Instead of using a tree, however,they use a linear cascade of

42

faces−easy

hom

er−s

imps

on

sta

ined

−gla

ss

car

tman

ketch car−side

refrigerator

tril

obite

s

unflo

wer

tennis−ball

brain yarm

ulke

lightning

comet minaret fireworks light−house

palm−pilot

lightbulb

cowboy−hat

desk−globe

bowling−pin

pez−dispenser

tripod

wine−bottle

laptop revolver

vcr eyeglasses

sheet−music

fire−truck motorbikes

golden−gate−bridge tower−pisa

toaster

paper−shredder

video−projector

computer−monitor

microwave

hib

iscu

s

iris

mars

minotaur

helicopter

tennis−court

horse

waterfall

windmill

chandelier

goat

eiffel−tower

smokestack

dolphin

hawksbill

galaxy rainbow

binoculars

french−horn

hea

d−ph

ones

syr

inge

ten

nis−r

acke

t

nec

ktie

sel

f−pr

opel

led−

lawn−

mow

er

men

orah

tra

ffic−

light

dia

mon

d−rin

g

hum

an−s

kele

ton

tou

ring−

bike

wat

ch

fris

bee

t

enni

s−sh

oes

bee

r−m

ug

hourglass t−shirt

ipod theodolite

microscope

washing−m

achine

computer−m

ouse

saturn

calculator

pci−card

chess−board

computer−keyboard

school−bus hot−tub segway

kayak

tweezer airplanes

breadmaker

photocopier

por

cupin

e

ham

burg

er

spa

ghet

ti u

nico

rn

coc

kroa

ch

hou

se−f

ly

spi

der

sta

rfish

f

rog

t

oad

c

entip

ede

o

ctop

us

frie

d−eg

g

sus

hi

jes

us−c

hris

t p

eopl

e cake

socks

leopards

mountain−bike

tricycle

cormorant

duck

tombstone

palm−tree

teepee

hot−dog s

neaker

tuning−fork

top−

hat

trea

dmill

soc

cer−

ball

fry

ing−

pan

gui

tar−

pick

cof

fee−

mug

l

athe

electric−guitar

american−flag

spoon

coffin

ladder harpsichord

hot−air−balloon

pyramid

license−plate

mattress

chopsticks paperclip

screwdriver sword

gas−pump telephone−box

fighter−jet speed−boat

bear

hummingbird

grass

hopper

pen

guin

gor

illa

rac

coon

c

him

p

owl

cam

el

kan

garo

o

car

−tire

s

addl

e

wheelbarrow

bulldozer

crab bonsai

goose

killer−whale

canoe

swan

harp

zebra

cactus

buddha

dog

welding−mask football−helmet sextant rotary−phone

roulette−wheel boxing−glove

fire−extinguisher

backpack doorknob

floppy−disk

teapot

joy−stick

megaphone

ska

teboard

stirr

ups

coi

n

sw

iss−

arm

y−kn

ife

um

brel

la

cer

eal−

box

pla

ying

−car

d

cd

s

teer

ing−

whe

el

fla

shlig

ht

tam

bour

ine

baseball−bat

knife

harmonica

rifle

ak47

billiards

basketball−hoop

blimp

bathtub

mailbox

boom−box

grand−piano

fern

tomato

goldfish

triceratops

praying−mantis

watermelon

horseshoe−crab

sku

nk

butte

rfly

grapes

gre

yhou

nd

con

ch

drin

king

−str

aw

pram

snowm

obile birdbath

covered−wagon

scorpion elephant

hamm

ock elk

radio−telescope

giraffe

skyscraper

dumb−bell baseball−glove ewer

dice

xylo

phon

e

bat

mussels

ice

−cre

am−c

one

t

eddy

−bea

r cannon

ostrich ibis picnic−table

fire−hydrant llam

a

mandolin yo−yo bowling−ball golf−ball soda−can superman

iguana

mushroom

snail

snake

Figure 3.6: Taxonomy discovered automatically by the computer, using only a limited sub-set of Caltech-256 training images and their labels. Aside from these labels there is noother human supervision; branch membership is not hand-tuned in any way. The taxonomyis created by first generating a confusion matrix forNtrain = 10 and recursively dividing itby spectral clustering. Branches and their categories are determined solely on the basis ofthe confusion between categories, which in turn is based on the feature-matching procedureof Spatial Pyramid Matching. To compare this with some recognizably human categorieswe color code all the insects (red), birds (yellow), land mammals (green) and aquatic mam-mals (blue). Notice that the computer’s hierarchy usually begins with a split that puts allthe plant and animal categories together in one branch. Thissplit is found automaticallywith such consistency that in a third of all randomized training setsnot a single category ofliving thingends up on the opposite branch.

43

classifiers that are progressively more complex and computationally intensive. Fleuret and

German [FG01] demonstrate a hierarchy of increasingly descriminative classifiers which

detect faces while also estimating pose.

Our strategy is illustrated in Fig. 3.4 and described in its caption. We represent the

taxonomy of categories as a binary tree, taking the two largest branches at the root of the

tree and calling these classesA ∪ B andC ∪ D. Now take a random subsample ofFtrain

of the training images in each of the two branches and label them as being in either class

1 or 2. An SVM is trained using the Spatial Pyramid Matching kernel as before except

that there are now two classes instead ofNcat. Empirically we find thatFtrain = 10%

significantly reduces the number of support vectors in each branch classifier with little or

no performance degradation.

If the branch classifier passes a test image down to the left branch, we assume that it

cannot belong to any of the classes in the right branch. This continues until the test image

arrives at a terminal node. Based on the above assumption, for each node at depthℓ, the

final multi-class classifier can ignore roughly1 − 2−ℓ of the training classes. The exact

fraction varies depending on how balanced the tree is.

The overall speed per test image is found by taking a union of all the support vectors

required at each level of classification. This includes all the branch and node classifiers

which the test image encounters prior to final classification. Each support vector corre-

sponds to a training image whose matching score must be computed, at a cost of 0.4 ms

per support vector on a Pentium 3 GHz machine. As already noted, the multi-class node

classifiers require many more support vectors than the branch classifiers. Thus increasing

44

A

B

C

Figure 3.7: The taxonomy from Fig.3.6 is reproduced here to illustrate how classificationperformance can be traded for classification speed. NodeA represents an ordinary non-hierarchical one-vs-all classifier implemented using an SVM. This is accurate but slowbecause of the large combined set of support vectors inNcats = 256 individual binaryclassifiers. A the other extreme, each test image passes through a series of inexpensivebinary branch classifiers until it reaches 1 of the 256 leaves, collectively labeledC above.A compromise solution B invokes a finite set of branch classifiers prior to final multi-classclassification in one of 7 terminal nodes.

45

0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

Speed (images/s)

Per

form

ance

(%

cla

ssifi

ed c

orre

ctly

)

A

B

C

top−downbottom−uprandom

Figure 3.8: Comparison of three different methods for generating taxonomies. For eachtaxonomy we vary the number of branch comparisons prior to final classification, as illus-trated in Fig. 3.4. This results in a tradeoff between performance and speed as one movesbetween two extremesA andC. Randomly generated hierarchies result in poor cascade per-formance. Of the three methods, taxonomies based on Spectral Clustering yield marginallybetter performance. All three curves measure performance vs. speed forNcat = 256 andNtrain = 10.

46

the number of branch classifier levels decreases the overallnumber of support vectors and

increases the classification speed, but at a performance cost.

3.5 Results

As shown in Fig. 3.8, our top-down and bottom-up methods givecomparable performance

at Ntrain = 10. Classification speed increases 5-fold with a corresponding10% decrease

in performance. In Fig. 3.9 we try a range of values forNtrain. At Ntrain = 50 there is a

20-fold speed increase for the same drop in performance.

3.6 Conclusions

Learning hierarchical relationships between categories of objects is an essential part of how

humans understand and analyze the world around them. Someone playing the game of “20

Questions” must make use of some preconceived hierarchy in order to guess the unknown

object using the fewest number of queries. Computers face thesame dilemma: without

some knowledge of the taxonomy of visual categories, classifying thousands of categories

is reduced to blind guessing. This becomes prohibitively inefficient as computation time

scales linearly with the number of categories.

To break this linear bottleneck, we attack two separate problems. How can computers

automatically generate useful taxonomies, and how can these be applied to the task of clas-

sification? The first problem is critical. Taxonomies built by hand have been applied to

the task of visual classification [ZW07] for a small number of categories, but this method

47

.2 .3 .4 .5 .6 .7 .8 .9 1 2 3 4 5 6 7 8 9100.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

Speed (images/s)

Per

form

ance

(%

cla

ssifi

ed c

orre

ctly

)

Ntrain

504030201510

Figure 3.9: Cascade performance / speed trade-off as a function of Ntrain. Values ofNtrain = 10 andNtrain = 50 result in a 5-fold and 20-fold speed increase (respectively)for a fixed 10% perforance drop.

does not scale well. It would be tedious - if not impossible - for a human operator to gener-

ate detailed visual taxonomies for the computer, updating them for each new environment

that the computer might encounter. Another problem for thisapproach is consistency: any

two operators are likely to construct entirely different trees. A more consistent approach

is to use an existing taxonomy such as WordNet [Fel98] and apply it to the task of visual

classification [MS07]. One caveat is that lexical relationships may not be optimal for cer-

48

tain visual classification tasks. The wordlemonrefers to an unreliablecar, but the visual

categories lemon and car are not at all similar.

Our experiments suggest that plausible taxonomies of object categories can be created

automatically using a classifier (in this case, Spatial Pyramid Matching) coupled to a learn-

ing phase which estimates inter-category confusion. The only input used for this process

is a set of training images and their labels. The taxonomies such the one shown in Fig. 3.6

seem to consistently discover broader categories which arenaturally recognizable to hu-

mans, such as the distinction between animate and inanimateobjects.

How should we compare one hierarchy to another? It is difficult to quantify such a

comparison without a specific goal in mind. To this end we benchmark a cascade of classi-

fiers based on our hierarchy and demonstrate significant speed improvements. In particular,

top-down and bottom-up recursive clustering processes both result in better performance

than a a randomly generated control tree.

49

Chapter 4

Pollen Counting

4.1 Basic Method

blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah


blah blah blah blah blah blah blah blah blah blah blah blah

4.2 Results




4.3 Comparison To Humans



50

!"#$"%

&"'(%)

!"#$%&'

(&)&*+,"-

.,-&

(&)&*+,"-

/0$&%0"12%

3345#%&2

60#7&8'

9$,:0+-&%%'

;"2&1

!1<++&$

.,-#1

!1#%%,=*#+,"-

6>;86?;

@77&#$#-*&

;"2&1

*$+,-"+./

&"'(%)

60#7&

.&#+<$&%'ABC

!$"77&2

?"11&-!"-+"<$%

D#5&1EFG

6+#:&'H

6+#:&'I

6+#:&'J

6+#:&'K

!"-L"1L&

60#7&89$,:0+-&%%

.&#+<$&%'AMHC

6G./

.&#+<$&%'

AMHNC

!#-2,2#+&%

.,-#1,%+%

!"#$%&'''''''''''

9$,:0+-&%%

.&#+<$&%'ABC

6+#:&'M;,*$"%*"7&'

GO#:&

Figure 4.1:


4.4 Conclusions




51

Chapter 5

Machine Olfaction: Introduction

Electronic noses have been used successfully in a wide variety of applications[RB08, WB09]

ranging from safety[DNPa07, GSHD00, ZEA+04, Ga01] and detection of explosives[AMa01,

KWTT03, Gar04] to medical diagnosis[NMMa99, GSH00, TM04, Phi05, DDNPa08], food

quality assessment[ZZX+06, GVMSB09, BO04, AB03] and discrimination of beverages

like coffee[PFSQ01, Pa02], tea[Yu07, DHG+03, LSH05] and wine[DNDD96, LSH05,

RSCC+08, SCA+11]. These applications typically involve a limited variety of odor cate-

gories with tens or even hundreds of training examples available for each odorant.

Human text subjects, on the other hand, are capable of distinguishing thousands of dif-

ferent odors[Lan86, Ree90, SR95] and can recognize new odors with only a few training

examples[Dav75, CdWL+98]. How we organize individual odor categories into broader

classes - and how many classes are required - is still a matterof active debate in the psy-

chophysics community. One recent study of 881 perfume materials found that human test

subjects group the vast majority of these odors into 17 distinct classes[ZS06]. Another

comprehensive study of 146-dimensional odorant responsesobtained by Dravnieks[Dra85]

showed that most of the variability in the responses can be explained using only 6-10

parameters[KKER11].

52

Results such as these suggest that the bulk of the variability in human odor perception

can be represented in a relatively low-dimensional space. What is less clear is whether

this low dimensionality is an intrinsic quality of the odorants themselves or a feature of

our olfactory perception. Measuring a large variety of odorants electronically provides an

opportunity to compare how machines and humans organize their olfactory environment.

One way to represent this is to construct a taxonomy of odor categories and super-categories

with closely related categories residing in nearby branches of the tree.

Over the last 10 years hierarchical organization tools haveproven increasingly use-

ful in the field of computer vision as image classification techniques have been scaled

to larger image datasets. Examples include PASCAL[EZWVG06],Caltech101[FFFP04],

Caltech256[GHP07] and, more recently, the SUN[CLTW10] LabelMe[TRY10] and Imagenet[DDS+09]

datasets with over a thousand categories each. These datasets are challenging not just be-

cause they include a larger number of categories but becausethe objects themselves are

photographed in a variety of poses and lighting conditions with varying degrees of back-

ground clutter.

While it is possible to borrow taxonomies such as WordNet[SR98] and apply them to

machine classification tasks, lexical relationships are atbest an imperfect approximation

of visual or olfactory relationships. It is therefore useful to automatically discover tax-

onomies that are directly relevant to the specific task at hand. One straightforward greedy

approach involves clustering the confusion-matrix created with a conventional one-vs-all

multi-class classifier. This results in a top-down arrangement of classifiers where sim-

ple, inexpensive decisions are made first in order to reduce the available solution space.

53

Such an approach yields faster terrain recognition for autonomous navigation[AMHP07]

as well as more computationally efficient classification of images containing hundreds of

visual categories[GP08]. One way to improve overall classification accuracy is to iden-

tify categories which cannot be excluded early and include them on multiple hierarchy

branches[MS08]. Binder et. al. show that taxonomy-based classification can improve both

speed and accuracy at the same time[BMK11].

In addition to larger more challenging datasets and hierarchical classification approaches

that scale well, machine vision has benefitted from discriminative features like SIFT and

GLOH that are relatively invariant to changes in illumination, viewpoint and pose[MS05].

Such features are not dissimilar from those extracted from astronomical data by switching

between fixed locations on the sky[DRNP94, TTA+06]. The resulting measurements can

reject slowly-varying atmospheric contamination while retaining extremely faint cosmo-

logical signals that areO(106) times smaller.

Motivated by this approach, we construct a portable apparatus capable of sniffing at a

range of frequencies to explore how well a small array of 10 sensors can classify hundreds

of odors in indoor and outdoor environments. We evaluate this swept-frequency approach

by sampling 90 common household odors as well as 40 odors in the University of Pittsburgh

Smell Identification Test. Reference data with no odorants is also gathered in order to

model and remove any systematic errors that remain after feature extraction.

The sensors themselves are carbon black-polymer compositethin-film chemiresistors.

Using controlled concentrations of analytes under laboratory conditions, these sensors have

been shown to exhibit steady-state and time-dependant resistance profiles that are highly

54

sensitive to inorganic gasses as well as organic vapors[BL03] and can be used for classi-

fying both[SL03, MGBW+08]. A challenge when operating outdoors is that variability in

water vapor concentrations masks the response of other analytes. Our approach focuses

on extracting features that are insensitive to background analytes whose concentrations

changes more slowly than the sniffing frequency. This strategy exploits the linearity of

the sensor response and the slowly-varying nature of ambient changes in temperature and

humidity.

From an instrument design perspective, we would like to discover how the choice of

sniffing frequencies, number of sensors and feature reduction method all contribute to the

final indoor/outdoor classification performance. Next we construct a top-down classifica-

tion framework which aggregates odor categories that cannot be easily distinguished from

one another. Such a framework quantitatively addresses questions like: what sorts of odor

groupings can be readily classified by the instrument, and with what specificity?

55

Chapter 6

Machine Olfaction: Methods

6.1 Instrument

The apparatus is a common design [NFM90, GS92, LEZP99] whichdraws air through a

small sensor chamber while controlling the source of the airvia a manifold mixing solenoid

valve. A small fan draws the air through a computer-controlled valve with five inlets. Such

elements as flowmeters, gas cylinders, air dryers and other filters are absent. Four sample

chambers hold the odorants to be tested while one empty chamber serves as a reference as

shown in Fig. 1.

One approach to the problem of removing confusing background factors is to filter

them explicitly from the odor stream. When discriminating between alcoholic beverage for

example, alcohol and water may be filtered to improve performance[LSH05]. We avoid this

approach here to keep the instrument as simple and portable as possible for acquiring data

in both indoor and outdoor environments. The sensor chamber, sample chambers, solenoid

valve, computer and electronics are light enough to carry, and all electronic components

run on battery power.

56

Figure 6.1: A fan draws air from 1 of 4 ordorant chambers or an empty reference chamber,depending on the state of the computer-controlled solenoidvalve. The valve control signalcan then be compared to the resistances changes recorded from an arrays of 10 individualsensors as shown in Fig. 2.

6.2 Sampling and Measurements

Ideally the sniffing frequencies used should be high enough to reject unwanted environ-

mental noise but low enough that the time-constant of the sensors does not attenuate the

signal. Between these two constraints we find a range of usable frequencies between1/64

and 1 Hz. This choice of frequencies is examined in detail in the next section.

The sniffing scheme is illustrated in Fig. 2a. The computer first chooses a single odor

and samples 7 frequencies in 7.5 minutes. During this span oftime 400 s are spent switch-

57

ing between a single odorant and the reference chamber whilethe remaining 50 s are spent

purging the chamber with reference air. Henceforward we refer to this complete sampling

pattern as a “sniff” and each of the 7 individual frequency modulations as “subsniffs”.

The sniff is repeated 4 times for each of 4 odorants for a totalof 2 hours per “run”.

Within each run the odorants are randomly selected and the order in which they are pre-

sented is also randomized. To avoid residual odors each run starts with a 1-hour period

during which the sensor chamber is purged with reference air, odorant chambers are re-

placed and the tubing that leads to the chambers is washed anddried.

The resistance change∆RR

of each sensor is sampled at 50 Hz while the valve modulates

the incoming odor streams. From this time-series data each individual sniff is reduced to

a feature vector of measurements representing the band power of the sensor resistance

integrated over subsniffsi = 1..7 and frequency bandsj = 1..4. Fig. 2c shows what

this filtering looks like for the case ofi = 4. In this subsniff the valve switches 4 times

between odorant and reference at a frequency of1/8 Hz. Integrating each portion of the

Fourier transform of the signal

Si(f) =

∫si(t)e

−2πiftdt

weighted by four different window functions results in7 × 4 = 28 measurements

mij =

∫ fmax

0

Hij(f)df , Hij(f) = Si(f)Wij(f)

wherefmax

= 25 Hz is the Nyquist frequency. The modulation of the odorant in the ith

58

subsniff can be thought of as the product of a single 64 s pulseand progressively faster

square waves of frequencyfi = 2 i−7 Hz. Thus the first window functionj = 1 in each

subsniff is centered aroundf1 = 1/64 while window functions forj = 2..4 are centered at

the odd harmonicsfi, 3fi and5fi where the square-wave modulation has maximal power.

Repeating this procedure for each sensork = 1..10 gives a final featuremijk of size7 ×

4 × 10 = 280 which is normalized to unit length.

For comparison a second featuremi of size7 × 10 = 70 is generated by simply differ-

encing the top and bottom 5% quartiles of∆RR

within each subsniff. This sort of amplitude

estimate is comparable to the so-called sensorial odor perception (SOP) feature commonly

used in machine olfaction experiments [DVHV01] and is similar tomi1k in that it ignores

harmonics with frequencies higher than1/64 Hz within each subsniff.

6.3 Datasets and Environment

Three separate datasets are used for training and testing.

TheUniversity of Pittsburgh Smell Identification Test (UPSIT)consists of 40 microen-

capsulated odorants chosen to be relatively recognizable and to span known odor classes [DSa84a,

DSKa84]. The test is administered as a booklet of 4-item multiple choice questions with an

accompanying scratch-and-sniff patch for each question. It is an especially useful standard

because of the wealth of psychophysical data gathered on a variety of test subjects since it

was introduced in 1984 [DSA+84b, DAZa85, Dot97, EFL+05].

In order to sample realistic real-world odors we introduce aCaltechCommon House-

hold Odors Dataset (CHOD)with 90 common foods products and household items. Items

59

were chosen to be as diverse as possible while remaining readily available. Odor categories

for both the CHOD and UPSIT are listed in the appendix. Of these, 78 were sampled in-

doors, 40 were sampled outdoors and 32 were sampled in both locations.

Lastly aControl Datasetwas acquired in the same manner as the other two sets but with

empty odorant chambers. The purpose of this set is to model and remove environmental

components which are not associated with an odor class. In this sense it is roughly anal-

ogous to theclutter category present in some image datasets. Control data was taken on

a semi-weekly basis over the entire 80-day period during which the other 2 datasets were

acquired. This was done in order to capture as much environmental variation as possible.

Half of the control data is used for modeling while the other half is used for verification

purposes.

Together these 3 sets contain 250 hours of data spanning 130 odor categories and 2

environments. The first 130 hours of data were acquired over a40-day period in a typical

laboratory environment with temperatures of 22.0-25.0◦C and 36-52% humidity. Over the

subsequent 40 days the remaining 120 hours were acquired on arooftop and balcony with

temperatures ranging from 10.1-24.7◦C and 29-81% humidity. On time-scales of 2 hours

the average change in temperature and humidity were0.11◦C and0.4% in the laboratory

and0.61◦C and1.9% outdoors. Thus the environmental variation outdoors was roughly

∼ 5 times greater than indoors.

60ref

0

10

20

x 10−4

∆ R

/ R

s7

s6

s5

s4

s3

s2

s1

1 2 3 4 5 6

10−4

minutes∆

R /

R

mi1

mi2

mi3

mi4 amplitude

si fi Ti Ni Ttot

s11/64 Hz 64 s 1 64 s

s21/32 32 1 32

s31/16 16 2 32

s41/8 8 4 32

s51/4 4 8 32

s61/2 2 16 32

s7 1 1 32 32

−20 0 20−5

0

5

10

15x 10

−4

∆ R

/ R

sec

s4

−20 0 20sec

h41

−20 0 20sec

h42

−20 0 20sec

h43

−20 0 20sec

h44

0.01 0.10 1.00

10−6

10−5

10−4

10−3

f1

f4

3f4

5f4

Hz

∆ R

/ R

Outdoor NoiseIndoor NoiseS

4H

41H

42H

43H

44

Figure 6.2: (a) A sniff consists of 7 individual subsniffss1...s7 of sensor data taken as thevalve switches between a single odorant and reference air. From this data a7× 4 = 28 sizefeaturem is generated representing the measured power in each of the 7subsniffsi over 4fundamental harmonicsj. For comparison purposes a simple amplitude feature differencesthe top and bottom 5% quartiles of∆R

Rin each subsniff. (b) As the switching frequencyf

increases by powers of 2 so does the number of pulses, so that the time periodT is constantfor all but the first subsniff. (c) To illustrate howm is measured we show the harmonicdecomposition of justs4, highlighted in (a). The corresponding measurementsm4j are theintegrated spectral power for each of 4 harmonics. Higher-order harmonics suffer fromattenuation due to the limited time-constant of the sensorsbut have the advantage of beingless susceptible to slow signal drift. Fitting a1/fnnoise spectrum to the average indoor andoutdoor frequency response of our sensors in the absence of any odorants illustrates whyhigher-frequency switching and higher-order harmonics may be especially advantageous inoutdoors environments.

61

10/10

wine : Cabernet SauvignonIndoor

10/04

wine : Chardonnay

11/07

lemon : juice

10/03

lemon : slice

10/11

tea : English Breakfast

10/08

tea : Irish Breakfast

1 2 3 4

12/24

Outdoor

12/26

12/14

12/26

12/19

12/08

1 2 3 4

Figure 6.3: Visual representation of the harmonic decomposition featurem for 2 wines,2 lemon parts and 2 teas from the Common Household Odors Dataset. Each odorant issampled 4 times on 2 different days in 2 separate environments. Each box represents onecomplete 400 s sniff reduced to a 280-dimensional feature vector. Within each box, the 10rows (y axis) show the response of different sensor over 28 frequencies (x axis) correspond-ing to 7 subsniffs and 4 harmonics. For visual clarity the columns are sorted by frequencyand rows are sorted so that adjacent sensors are maximally correlated.

63

Chapter 7

Machine Olfaction: Results

The first four experiments described below are designed to test the effect of sniffing fre-

quency, sensor array size, features extraction and sensor stability on classification perfor-

mance over a broad range of odor categories. For instrument design purposes it is useful to

isolate those features and feature subsets which are most effective for a given classification

task in a discriminative machine-learning framework.

Fig. 3 shows features extracted for 6 specific odorants in 3 broader odor categories:

wine, lemon and tea. Different kinds of tea seem to be easily distinguishable from wine

and lemon but less distinguishable from one another. A fifth experiment seeks to quantify

this intuition that certain odor categories are more readily differentiated than others and

incorporate this into a learning framework.

In each experiment 4 presentations per odor category are separated into randomly se-

lected training and testing sets of 2 sniffs each. Each sniffis reduced to a feature vectorm

and an SVM1 is used for final classification. In several experiments feature vectorsmik are

also generated for comparison purposes. Both features are pre-processed by normalizing

them to unit length and projecting out the first two principlecomponents of the control data,

1specifically the LIBLINEAR package of[HCLK08, FCHW08]

64

which together account for 83% of the feature variability when no odorants are present.

Performance is averaged over randomized data subsets ofNcat = 2, 4, ... odor categories

up to the maximum number if categories in the set. Classification error naturally increases

with Ncat as the task becomes more challenging and the probability of randomly guessing

the correct odorant decreases. In addition to random category groupings, the final test

clusters odorants to examine classification performance for top-down category groupings.

7.1 Subsniff Frequency

As a practical matter, two fundamental limiting factors in our experiments are the amount of

time required to prepare odorant chambers and to sample their contents. An unnecessarily

long sampling procedure limits the usefulness of machine olfaction in many real-world

applications. Reducing the duration of a sniff is thus highly worthwhile if it does not

significantly impact classification accuracy. Our first testmeasures performance using only

a limited number of subsniff frequencies. The results for each dataset in both indoor and

outdoor environments are presented in Fig. 4.

A complete sniff is divided into 4 overlapping 200 s time segments. Each covers a

different range of modulation frequencies from 1 -1/8 Hz for the fastest segment to1/16- 1/64

Hz for the slowest. Fig. 4 compares classification results using features constructed from

each time segment as well as the entire 400 s sniff.

Averaging the CHOD and UPSIT results in both environments, overall performance for

Ncat = 4 drops by 5.6%, 5.1%, 8.3% and 24.4% when using 200 s the data ina progressively

slower range of modulations frequencies. ForNcat = 16 this drop is more significant:

65

0255075

Performance (%)

Ncat

= 160 25 50 75 100

1−7

4−7

3−6

2−5

1−3

Performance (%)

Ncat

= 4

Sub

−sn

iffs

400 s

200 s

200 s

200 s

200 sCHOD Indoor

CHOD Outoor

UPSIT Indoor

UPSIT Outdoor

Control Indoor

Control Outdoor

random

Figure 7.1: Classification performance for the University ofPittsburgh Smell IdentificationTest (UPSIT) and Common Household Odors Dataset (CHOD) for different sniff subsetsusing 4 and 16 categories for training and testing. For control purposes data is also ac-quired with empty odorant chambers. Compared with using the entire sniff (top), the high-frequency subsniffs (2nd row) outperform the low-frequency subsniffs (bottom) especiallyfor Ncat=16. Dotted lines show performance at the level of random guessing.

66

9.5%, 10.6%, 17.2% and 41.2% respectively. At least for the polymer composite sensors

we are using, this suggests that the low-frequency subsniffs contribute relatively little to

classification performance.

This is perhaps not surprising considering that the mean spectrum of background noise

in the control data is skewed towards lower frequencies (Fig. 2c). While this noise spec-

trum depends partially on the type of sensor used, it is also symptomatic of the kinds of

slow linear drifts observed in both temperature and humidity throughout the tests. Other

sensors that are sensitive to such drifts may also benefit from faster switching so long as

the switching frequency does not far exceed the cutoff imposed by sensor time constants.

In our case these time constants range from .1 s for the fastest sensor to 1 s for the slowest.

7.2 Number of Sensors

Another important design consideration is the number and variety of sensors required for a

given classification task. Our second test measures the classification error while gradually

increasing the number of sensors from 2 up to the full array of10.

As shown in Fig. 5, the marginal utility of adding additionalsensors depended on the

difficulty of the task: performance in outdoor conditions orwith a large number of odor

categories showed the most improvement. Across the board however the Control data

classification error consistentlyincreasedas sensors were added, approaching closer to the

expected level of random chance. Averaged over all values ofNcat the Outdoor Control er-

ror was 17% less than what would be expected from random chance when 10 sensors were

used, compared to 58% less using only 2 of the available sensors. The positive detection of

67

2 4 8 16 32 64

10

20

30

40

50

60

70

80

90

100

Ncats

Err

or (

%)

Indoor

2 4 8 16 32 64N

cats

Outdoor

CHOD

10 sensors

8 sensors

6 sensors

4 sensors

2 sensors

UPSIT

10 sensors

8 sensors

6 sensors

4 sensors

2 sensors

Control

10 sensors

8 sensors

6 sensors

4 sensors

2 sensors

random

Figure 7.2: Classification error for all three datasets takenindoors and outdoors whilevarying the number of sensors and the number of categories used for training and testing.Each dotted colored line represents the mean performance over randomized subsets of 2,4, 6 and 8 sensors out of the available 10. To illustrate this for a single value ofNcat,gray vertical lines mark the error averaged over randomizedsets of 16 odor categoriesfor the indoor and outdoor datasets. When the number of sensors increased from 4 to10, the indoor error (left line) decreased by< 2% for the CHOD and UPSIT while theoutdoor error (right line) decreased by 4-7%. The Control error is also important becausedeviations from random chance when no odor categories are present may suggest sensitivityto environmental factors such as water vapor. Here the indoor error for both 4 and 10sensors remains consistent with 93.75% random chance whilethe outdoor error rises from85.9% to 91.7%.

distinct odor categories where none are present may suggesteither overfitting or sensitivity

to extraneous environmental factors such as water vapor. Thus the use of additional sen-

sors seems to be important for background rejection in outdoor environments even when

the reduction in classification error for the other datasetsis only marginal.

68

7.3 Feature Performance

For each individual sensor, the feature extraction processconverts 400 s or 20,000 samples

of time-stream data per sniff into a compact arraymij of 28 values representing the total

spectral response over multiple harmonics of the sniffing frequency. An even smaller fea-

turemi measures only the amplitude of the sensor response within each of the 7 subsniffs.

Our third test compares classification accuracy for both features to determine whether mea-

suring the spectral response of the sensor over a broad rangeof harmonics yields any com-

pelling benefits.

For Ncat = 4 the CHOD and UPSIT classification errors are 8.7% and 26.2% indoors

and 27.6% and 32.2% outdoors using the spectral response featurem. When the amplitude-

based featurem is used these errors increase to 27.3% and 31.9% indoors and 36.8% and

51.3% outdoors respectively. As shown in Fig. 6 the amplitude-based feature continues to

underperform the spectral response feature across all values ofNcat. It also shows a greater

tendency to make spurious classifications in the absence of odorants, with detection rates

on the Control Dataset that are 30-75% higher than random chance.

Relative to human performance on the UPSIT, the electronic nose performance of 26-

32% indoors is comparable to test subjects 70-79 yrs of age. Subjects 10-59 yrs of age

outperform the electronic nose with only 4-18% error whereas those over 80 yrs show

mean error rates in excess of 36% [DSKa84].

69

2 4 8 16 32 64

10

20

30

40

50

60

70

80

90

100

Err

or (

%)

Ncats

Indoor

2 4 8 16 32 64N

cats

Outdoor

2 4 8 16 32 64N

cats

Indoor / Outdoor

Harmonic

CHOD

UPSIT

Control

Amplitude

CHOD

UPSIT

Control

UPSIT subjects

age 70−79

UPSIT subjects

age 10−59

random

Figure 7.3: Classification error using features based on sensor response amplitude andharmonic decomposition. For comparison we also show the UPSIT testing error[DSKa84]for human test subjects 10-59 years of age (who perform better than our instrument) and70-79 years of age (who perform roughly the same). The combined Indoor/Outdoor datasetuses data taken indoors and outdoors as separate training and testing sets.

7.4 Feature Consistency

Next we would like to test whether the spectral response features are reproducible enough to

be used for classification across different environments over timescales of several months.

The indoor/outdoor dataset featured in the rightmost plot of Fig. 6 uses a classifier

trained on data taken indoors between October 3 and November18 to test data taken out-

doors between November 19 and December 26. For comparison, data taken in the center

plot uses the outdoor datasets for both training and testing. Classification errors for the

Indoor/Outdoor CHOD are 8-14 % higher than for the Outdoor CHOD, while those for the

70

UPSIT are 3-25 % higher.

From this test alone it is unclear how much of this increase inclassification error is

due to the change in environment and how much is due to sensor degradation. However

polymer-carbon sensor arrays similar to ours have been shown to exhibit response changes

of less than 10% over time periods of 15-18 months [RMG+10]. What is less well under-

stood is how classification error is affected when training data acquired an indoor laboratory

environment is used for testing in an uncontrolled outdoor environment. This is roughly

analogous to the visual classification task of using images taken under controlled light-

ing conditions in a relatively clutter-free environment toclassify object categories in more

complex outdoor scenes with variable lighting, occlusion etc.

Compared with the amplitude response feature (dotted lines)the full spectral response

of the sensor provides a feature that is significantly more accurate and more robust for

classification across indoor and outdoor environments.

7.4.1 Top-Down Category Recognition

The results presented so far are averaged over randomized subsets ofNcat categories. This

is an especially relevant benchmark when one does not know inadvance what categories

to expect during testing. Such a benchmark does not reveal how classification performance

changes from category to category, or how specifically a given category classification may

be refined.

Broadly speaking the odor categories in the CHOD can be divided into four main

groups: food items, beverages, vegetation and miscellaneous household items. One may

71

of course make finer distinctions within each category, suchas food items that are cheeses

or fruits, but such distinctions are inherently arbitrary and vary significantly according to

personal bias. Even a taxonomy such as WordNet[SR98] which groups words by meaning

may or may not be relevant to the olfactory classification task. The fact that coffee and tea

are both in the “beverages” category, for example, does not provide any real insight into

whether they will emit similar odors.

A more experimentally meaningful taxonomy may be created using the inter-category

confusion during classification. This is represented as a matrix Cij which counts how often

a member of categoryi is classified as belonging to categoryj. Diagonal elements record

the rate of correct classifications for each category while off-diagonal elements count mis-

classifications. Hierarchically clustering this matrix results in a taxonomy where successive

branches represent increasingly difficult classification tasks. As this process continues, cat-

egories that are most often confused would ideally end up as adjacent leaves on the tree.

Following our work with the Caltech-256 Image Dataset [GP08]we create a taxonomy

of odor categories by recursively clustering the olfactoryconfusion matrix via self-tuning

spectral clustering[PZM04]. Results for the Indoor CHOD areshown in Fig. 7. A binary

classifier is generated to evaluate membership in each branch of the tree. This is done

by randomly selecting 2 training examples per category and assigning positive or negative

labels depending on whether the category belongs to the branch. Remaining examples are

then used to evaluate the performance of each classifier.

With branch nodes color-coded by performance, the taxonomyreveals what randomized

subsets do not, namely, which individual categories and super-categories are detectable by

72

the instrument for a given performance threshold. The clustering process is prone to errors

in part because of uncertainty in the individual elements ofthe confusion matrix. Some

odorants such as individual flowers and cheeses were practically undetectable with our

instrument, making it impossible to establish taxonomic relationships with any certainty.

Other odorants with low individual detection rates showed relatively high intercategory

confusion. For example all the spices except mustard are located on a single sub-branch

which can be detected with 42% accuracy even though the individual spice categories in

that branch all have detection rates below 5%.

Thus while it is possible to make refined guesses for some categories, other “unde-

tectable” categories are in fact detectable only when pooled into meaningful super-categories.

Building a top-down classification taxonomy for a given instrument provides the flexibil-

ity to exchange classifier performance for specificity depending on the odor categories and

application requirements.

73

beverage:milk:regular

food:condiments:toasted sesame oil household:bathroom:rubbing alcohol beverage:alcoholic:other:vanilla extract household:bathroom:mint mouthwash

food:other:flavoring:Equal food:other:flavoring:sugar

food:other:flavoring:salt household:bathroom:asprin

vegetation:herbs:lavendar vegetation:other:pine

food:other:tuna

food:fruit:rosids:citrus:lemon:juice

beverage:coffee:expresso:Lavazzi

beverage:alcoholic:wine:Moscato beverage:alcoholic:wine:White Zinfandel

beverage:alcoholic:wine:Chardonnay

food:spices:rosids:mustard beverage:juice:apple

beverage:coffee:dry:Lavazzi

food:vinegar:distilled

food:other:rice

food:fruit:rosids:berries:strawberry food:fruit:rosids:other:grape

vegetation:flower:Achillae Milefolium household:other:moist cardboard

food:fruit:rosids:citrus:mango

food:fruit:rosids:citrus:lemon:slice food:fruit:rosids:other:apple

food:cheese:cottage food:fruit:rosids:citrus:lemon:peel

beverage:juice:orange gatoraid

beverage:alcoholic:wine:Cabernet Sauvignon household:bathroom:hydrogen peroxide

food:spices:rosids:allspice

vegetation:flower:Agerastum Houstonianum

food:vinegar:apple cider

food:fruit:asterids:kiwi beverage:coffee:expresso:TJHBD

food:fruit:rosids:other:melon food:fruit:other:pineapple

vegetation:other:grass food:condiments:French’s yellow mustard

vegetation:flower:Rosa Rosideae food:fruit:rosids:berries:rasberries

beverage:tea:English Breakfast food:cheese:cheddar food:fruit:other:banana

food:condiments:peanut butter food:spices:asterids:cayenne pepper food:spices:magnoliids:bay leaves

food:cheese:provolone beverage:milk:vanilla soy

food:condiments:soy sauce food:vinegar:red wine

vegetation:herbs:cilantro food:cheese:swiss

beverage:tea:Genmai Cha beverage:tea:Irish Breakfast food:spices:asterids:red pepper

food:fruit:magnoliids:avacado food:other:vanilla cookie fragrance oil

vegetation:herbs:rosemary food:fruit:asterids:tomato

vegetation:herbs:parsley food:vinegar:rice

vegetation:herbs:mint beverage:tea:Russian Caravan

food:spices:asterids:chili powder food:spices:rosids:clove

0 20 40 60 80 100 %

Figure 7.4: The confusion matrix for the Indoor Common Household Odor Dataset is usedto automatically generate a top-down hierarchy of odor categories. Branches in the treerepresent splits in the confusion matrix that minimize intercluster confusion. As the depthof the tree increases with successive splits, the categories in each branch become moreand more difficult for the electronic nose to distinguish. The color of each branch noderepresent the classification performance when determiningwhether an odorant belongs tothat branch. This helps characterize the instrument by showing which odor categories andsuper-categories are readily detectable and which are not.Highlighted categories show therelationships discovered between the wine, lemon and tea categories whose features areshown Fig. 3. The occurence of wine and citrus categories in the same top-level branchindicates that these odor categories are harder to distinguish from one another than they arefrom tea.

75

Chapter 8

Machine Olfaction: Discussion

In this paper we explore several design parameters for an electronic nose with the goal of

optimizing performance while minimizing environmental sensitivity The spectral response

profiles of a set of 10 carbon black-polymer composite thin film resistors were directly

measured using a portable apparatus which switches betweenreference air and odorants

over a range of frequencies. Compared to a feature based only on the fractional change in

sensor resistance, the spectral feature gives significantly better classification performance

while remaining relatively invariant to water vapor fluctuations and other environmental

systematics.

After acquiring two 400s sniffs of every odorant in a set of 90common household odor

categories, the instrument is presented with unlabeled odorants each of which it also sniffs

twice. The features extracted from these sniffs are used to select the most likely category

label out ofNcat options. GivenNcat = 4 possible choices and an indoor training set, the

correct label was found 91% of the time indoors and 72% of the time outdoors (compared

to 25% for random guessing). Fig. 6 shows how the classification error increases with

Ncat as the task becomes more difficult. The instrument’s score onthe UPSIT is roughly

comparable to scores obtained from elderly human.

76

Sampling 130 different odor categories in both indoor and outdoor environments re-

quired 250 hours of data acquisition and roughly an equal amount of time purging, clean-

ing and preparing sample chambers. Fortunately, high-frequency subsniffs in the 1 -1/8

Hz range provided 50% better olfactory classification performance than an equal time-

segement of relatively low-frequency subsniffs in the1/16- 1/64 Hz range. By focusing on

higher frequencies, sniff time can be cut in half with only a marginal 5-10% drop in overall

performance.

Judging from progress in the fields of machine vision and olfactory psychophysics, it is

reasonable to expect that the number and variety of odorantsused in electronic nose exper-

iments will only increase with time. Hierarchical classification frameworks scale well to

large numbers of categories and provide error rates for specific categories as well as super-

categories. There are many potential advantages to such an approach including the ability

to predict category performance at different levels of specificity. Identifying easly confused

categories groupings and sub-groupings may also reveal instrumental “blind spots”that can

then be addressed by employing different sensor technologies, sniffing techniques or fea-

ture extraction algorithms.

[Samples]

UPSIT Categories: pizza, bubble gum, menthol, cherry, motoroil, mint, banana, clove,

leather, coconut, onion, fruit punch, licorice, cheddar cheese, cinnamon, gasoline, straw-

berry, cedar, chocolate, ginger, lilac, turpentine, peach, root beer, dill pickle, pineapple,

lime, orange, wintergreen, watermelon, paint thinner, grass, smoke, pine, grape, lemon,

soap, natural gas, rose, peanut

77

CHOD Categories: alcohol, allspice, apple, apple juice, avocado, banana, basil, bay

leaves, beer (Guinness Extra Stout), bleach (regular, chorine-free and lavender), card-

board, cayenne pepper, cheese (cheddar, provolone, swiss), chili powder, chlorinated water,

chocolate (milk and dark), cilantro, cinnamon, cloves, coffee (Lavazzi, Trader Joe’s house

blend dark), expresso (Lavazzi, Trader Joe’s house blend dark), cottage cheese, Equal, flow-

ers (Rosa Rosideae, Agerastum Houstonianum, Achillae Millefolium), gasoline, Gatorade

(orange), grapes, grass, honeydew mellon, hydrogen peroxide, kiwi fruit, lavender, lemon

(juice, pulp, peel), lime (slice, peel only, pulp only), mango, mellon, milk (2%), mint,

mouth rinse, mustard (powder and French’s yellow), orange juice, paint thinner, parsley,

peanut butter, pine, pineapple, rasberries, red pepper, rice, rosemary, salt, soy milk (regu-

lar and vanilla), soy sauce, strawberry, sugar, tea (Cha Genmail, English Breakfast, Irish

Breakfast, Russian Caravan), toasted sesame oil, tomato, tuna, vanilla cookie fragrance oil,

vanilla extract, vinegar (apple, distilled, red wine, rice), windex (regular and vinegar), wine

(Cabernet Sauvignon, Chardonnay, Moscato, White Zinfandel)

79

Bibliography

[AB03] S Ampuero and J O Bosset. The electronic nose applied to dairy products:

a review.Sensors and Actuators B: Chemical, 94(1):1–12, August 2003.

[ABM04] Tamara L. Berg Alexander Berg and Jitendra Malik. Shape matching and

object recognition using low distortion correspondences.Technical Report

UCB/CSD-04-1366, EECS Department, University of California, Berkeley,

2004.

[AGF04] Yali Amit, Donald Geman, and Xiaodong Fan. A coarse-to-fine strategy

for multiclass shape detection.IEEE Trans. Pattern Anal. Mach. Intell.,

26(12):1606–1621, 2004.

[AMa01] KJ Albert, ML Myrick, and et al. Field-deployable sniffer for 2, 4-

dinitrotoluene detection.Environmental . . ., 2001.

[AMHP07] Anelia Angelova, Larry Matthies, Daniel Helmick,and Pietro Perona.

Learning slip behavior using automatic mechanical supervision. In 2007

IEEE International Conference on Robotics and Automation, pages 1741–

1748. IEEE, 2007.

80

[BBM05] Alexander C. Berg, Tamara L. Berg, and Jitendra Malik. Shape matching

and object recognition using low distortion correspondences.cvpr, 1:26–33,

2005.

[Bie87] I. Biederman. Recognition-by-components: A theory of human image un-

derstanding.Psychological Review, 94:115–147, 1987.

[BL97] Jeffrey S. Beis and David G. Lowe. Shape indexing using approximate

nearest-neighbour search in high-dimensional spaces. InCVPR, pages

1000–1006, 1997.

[BL03] Shawn M Briglin and Nathan S. Lewis. Characterizationof the Temporal

Response Profile of Carbon BlackPolymer Composite Detectors to Volatile

Organic Vapors.J. Phys. Chem. B, 107(40):11031–11042, October 2003.

[BMK11] Alexander Binder, Klaus-Robert Muller, and Motoaki Kawanabe. On Tax-

onomies for Multi-class Image Categorization.International Journal of

Computer Vision, 99(3):281–301, January 2011.

[BO04] K Brudzewski and S Osowski. ScienceDirect.com - Sensors and Actuators

B: Chemical - Classification of milk by means of an electronic nose and

SVM neural network.Sensors and Actuators B . . ., 2004.

[BP96] Michael C. Burl and Pietro Perona. Recognition of planar object classes. In

CVPR, pages 223–230, 1996.

81

[CdWL+98] W S Cain, R de Wijk, C Lulejian, F Schiet, and et al. Odor identification:

perceptual and semantic dimensions.Chemical senses, 23(3):309–326, June

1998.

[CLTW10] M J Choi, J J Lim, A Torralba, and A S Willsky. Exploitinghierarchical con-

text on a large database of object categories.Computer vision and pattern

recognition (CVPR), 2010 IEEE conference on, pages 129–136, 2010.

[Dav75] R G Davis. Acquisition of Verbal Associations to Olfactory Stimuli of Vary-

ing Familiarity and to Abstract Visual Stimuli.Journal of Experimental

Psychology: Human Learning . . ., 1975.

[DAZa85] R L Doty, S Applebaum, H Zusho, and et al. Sex differences in odor identi-

fication ability: a cross-cultural analysis.Neuropsychologia, 1985.

[DDNPa08] A D’Amico, C Di Natale, Roberto Paolesse, and et al. Olfactory systems

for medical applications.Sensors and Actuators B . . ., 2008.

[DDS+09] Jia Deng, Wei Dong, R Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet:

A Large-Scale Hierarchical Image Database. In2009 IEEE Computer So-

ciety Conference on Computer Vision and Pattern Recognition Workshops

(CVPR Workshops), pages 248–255. IEEE, 2009.

[DHG+03] Ritaban Dutta, E L Hines, J W Gardner, K R Kashwan, and M Bhuyan.

Tea quality prediction using a tin oxide-based electronic nose: an artificial

intelligence approach.Sensors and Actuators B: Chemical, 94(2):228–237,

September 2003.

82

[DNDD96] C Di Natale, FAM Davide, and A D’Amico. ScienceDirect.com - Sensors

and Actuators B: Chemical - An electronic nose for the recognition of the

vineyard of a red wine.Sensors and Actuators B . . ., 1996.

[DNPa07] C Di Natale, Roberto Paolesse, and et al. Metalloporphyrins based artificial

olfactory receptors.Sensors and Actuators B: Chemical, 2007.

[Dot97] Richard L Doty. Studies of Human Olfaction from the University of Penn-

sylvania Smell and Taste Center.Chemical senses, 22(5):565–586, 1997.

[Dra85] A Dravnieks. Atlas of odor character profiles.ASTM data series, ed. ASTM

Committee E-18 on Sensory Evaluation of Materials and Products. Section

E-18.04.12 on Odor Profiling, page 354, 1985.

[DRNP94] M Dragovan, J E Ruhl, G Novak, and S R Platt. 1994ApJ...427L..67D Page

L69. The Astrophysical . . ., 1994.

[DSa84a] R L Doty, P Shaman, and et al. Development of the University of Penn-

sylvania Smell Identification Test: a standardized microencapsulated test of

olfactory function.Physiology & Behavior, 1984.

[DSA+84b] R L Doty, P Shaman, S L Applebaum, R Giberson, L Siksorski, and

L Rosenberg. Smell identification ability: changes with age. Science,

226(4681):1441–1443, 1984.

83

[DSKa84] R L Doty, P Shaman, C P Kimmelman, and et al. University of Pennsylvania

Smell Identification Test: a rapid quantitative olfactory function test for the

clinic. The . . ., 1984.

[DVHV01] T Dewettinck, K Van Hege, and W Verstraete. The electronic nose as a

rapid sensor for volatile compounds in treated domestic wastewater.Water

Research, 35(10):2475–2483, 2001.

[E+05] Mark Everingham et al. The 2005 pascal visual object classes challenge. In

MLCW, pages 117–176, 2005.

[Eea06] Everingham and Mark et al. The 2005 pascal visual object classes challenge.

In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual

Object Classification, and Recognising Tectual Entailment (PASCAL Work-

shop 05), number 3944 in Lecture Notes in Artificial Intelligence, pages

117–176, Southampton, UK, 2006.

[EFL+05] A Eibenstein, A B Fioretti, C Lena, N Rosati, G Amabile, and M Fusetti.

Modern psychophysical tests to assess olfactory function.Neurological Sci-

ences, 26(3):147–155, July 2005.

[EZWVG06] M Everingham, A Zisserman, C Williams, and L Van Gool. The pascal

visual object classes challenge 2006 (voc 2006) results. 2006.

[Fan05] Xiaodong Fan. Efficient multiclass object detection by a hierarchy of clas-

sifiers. InCVPR (1), pages 716–723, 2005.

84

[FCHW08] R E Fan, K W Chang, C J Hsieh, and X R Wang. LIBLINEAR: A Library

for Large Linear Classification.The Journal of Machine . . ., 2008.

[Fel98] Christiane Fellbaum, editor.WordNet: An Electronic Lexical Database

(Language, Speech, and Communication). The MIT Press, May 1998.

[Fer05] R. Fergus.Visual Object Category Recognition. PhD thesis, University of

Oxford, 2005.

[FFFP04] Li Fei-Fei, R Fergus, and P Perona. Learning Generative Visual Models

from Few Training Examples: An Incremental Bayesian Approach Tested

on 101 Object Categories. InComputer Vision and Pattern Recognition

Workshop, 2004. CVPRW ’04. Conference on, page 178, 2004.

[FG01] Francois Fleuret and Donald Geman. Coarse-to-fine face detection.Inter-

national Journal of Computer Vision, 41(1/2):85–107, 2001.

[FPZ03] Robert Fergus, Pietro Perona, and Andrew Zisserman. Object class recogni-

tion by unsupervised scale-invariant learning. InCVPR (2), pages 264–271,

2003.

[Ga01] P Gostelow and et al. Odour measurements for sewage treatment works.

Water Research, 2001.

[Gar04] J W Gardner. Abstract - SpringerLink.Electronic noses & sensors for the

detection of . . ., 2004.

85

[GD05a] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Discrimi-

native classification with sets of image features. InICCV, pages 1458–1465,

2005.

[GD05b] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Discrimi-

native classification with sets of image features. InICCV, pages 1458–1465,

2005.

[GD06] Kristen Grauman and Trevor Darrell. Unsupervised learning of categories

from sets of partially matching image features. InCVPR (1), pages 19–25,

2006.

[GHP06] Greg Griffin, Alex Holub, and Pietro Perona. Caltech-256 image dataset.

Technical report, Computation and Neural Systems Dept., 2006.

[GHP07] G Griffin, A Holub, and P Perona. Caltech-256 object category dataset.

2007.

[GP08] G Griffin and P Perona. Learning and using taxonomies for fast visual cat-

egorization.Computer Vision and Pattern Recognition, 2008. CVPR 2008.

IEEE Conference on, pages 1–8, 2008.

[GS92] JW Gardner and HV Shurmer. ScienceDirect.com - Sensors and Actuators

B: Chemical - Application of an electronic nose to the discrimination of

coffees.Sensors and Actuators B: Chemical, 1992.

86

[GSH00] J W Gardner, H W Shin, and E L Hines. ScienceDirect.com - Sensors and

Actuators B: Chemical - An electronic nose system to diagnoseillness.Sen-

sors and Actuators B: Chemical, 2000.

[GSHD00] Julian W Gardner, Hyun Woo Shin, Evor L Hines, and Crawford S Dow. An

electronic nose system for monitoring the quality of potable water.Sensors

and Actuators B: Chemical, 69(3):336–341, October 2000.

[GVMSB09] Mahdi Ghasemi-Varnamkhasti, Seyed Saeid Mohtasebi, Maryam Siadat,

and Sundar Balasubramanian. Meat Quality Assessment by Electronic Nose

(Machine Olfaction Technology).Sensors, 9(8):6058–6083, August 2009.

[HCLK08] C J Hsieh, K W Chang, C J Lin, and S S Keerthi. A dual coordinate descent

method for large-scale linear SVM. InProceedings of the 25th . . ., 2008.

[HWP05] Alex Holub, Max Welling, and Pietro Perona. Combininggenerative models

and fisher kernels for object recognition. InICCV, pages 136–143, 2005.

[KKER11] Alexei A AA Koulakov, Brian E BE Kolterman, Armen G AG Enikolopov,

and Dmitry D Rinberg. In search of the structure of human olfactory space.

Frontiers in Systems Neuroscience, 5:65–65, January 2011.

[KWTT03] John Kauer, Joel White, Timothy Turner, and Barbara Talamo. Principles

of Odor Recognition by the Olfactory System Applied to Detection of Low-

Concentration Explosives. January 2003.

87

[Lan86] D Lancet. Vertebrate olfactory reception.Annual review of neuroscience,

1986.

[LEZP99] T S Lorig, D G Elmes, D H Zald, and J V Pardo. A computer-controlled ol-

factometer for fMRI and electrophysiological studies of olfaction.Behavior

Research Methods, 31(2):370–375, 1999.

[LFP04a] F.F. Li, R. Fergus, and Pietro Perona. Learning generative visual models

from few training examples: An incremental bayesian approach tested on

101 object categories. InIEEE CVPR Workshop of Generative Model Based

Vision (WGMBV), 2004.

[LFP04b] F.F. Li, R. Fergus, and Pietro Perona. Learning generative visual models

from few training examples: An incremental bayesian approach tested on

101 object categories. InCVPR (1), 2004.

[Li05] Tao Li. Hierarchical document classification using automatically generated

hierarchy. InSDM, 2005.

[Low04] David G. Lowe. Distinctive image features from scale-invariant keypoints.

Int. J. Comput. Vision, 60(2):91–110, 2004.

[LS04] Bastian Leibe and Bernt Schiele. Scale-invariant object categorization using

a scale-adaptive mean-shift search. InDAGM-Symposium, pages 145–153,

2004.

88

[LSH05] J Lozano, J SANTOS, and M HORRILLO. Classification of white wine

aromas with an electronic nose.Talanta, 67(3):610–616, September 2005.

[LSP06a] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of fea-

tures: Spatial pyramid matching for recognizing natural scene categories. In

IEEE Conference on Computer Vision & Pattern Recognition, 2006.

[LSP06b] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of fea-

tures: Spatial pyramid matching for recognizing natural scene categories. In

CVPR (2), pages 2169–2178, 2006.

[LYW +05] Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-

Ying Ma. Support vector machines classification with a very large-scale

taxonomy.SIGKDD Explorations, 7(1):36–43, 2005.

[MGBW+08] S Maldonado, E Garcıa-Berrıos, MD Woodka, BS Brunschwig, and

NS Lewis. Detection of organic vapors and NH3 (g) using thin-film car-

bon black-metallophthalocyanine composite chemiresistors. Sensors and

Actuators B: Chemical, 134(2):521–531, 2008.

[ML06a] Jim Mutch and David G. Lowe. Multiclass object recognition with sparse,

localized features. InIEEE Conference on Computer Vision & Pattern

Recognition, 2006.

[ML06b] Jim Mutch and David G. Lowe. Multiclass object recognition with sparse,

localized features. InCVPR (1), pages 11–18, 2006.

89

[MS05] K Mikolajczyk and C Schmid. A performance evaluationof local descrip-

tors. IEEE Transactions on Pattern Analysis and Machine Intelligence,

27(10):1615–1630, October 2005.

[MS07] Marcin Marszałek and Cordelia Schmid. Semantic hierarchies for visual ob-

ject recognition. InIEEE Conference on Computer Vision & Pattern Recog-

nition, jun 2007.

[MS08] Marcin Marszałek and Cordelia Schmid. Constructing Category Hierarchies

for Visual Recognition. InComputer Vision–ECCV 2008, pages 479–491.

Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.

[MTS+05] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaf-

falitzky, T. Kadir, and L. Van Gool. A Comparison of Affine Region Detec-

tors. International Journal of Computer Vision, 2005.

[NFM90] T Nakamoto, K Fukunishi, and T Moriizumi. ScienceDirect.com - Sensors

and Actuators B: Chemical - Identification capability of odorsensor using

quartz-resonator array and neural-network pattern recognition. Sensors and

Actuators B: Chemical, 1990.

[NMMa99] CD Natale, A Mantini, Antonella Macagnano, and et al. Electronic nose

analysis of urine samples containing blood.Physiological . . ., 1999.

[NNM96] S. Nene, S. Nayar, and H. Murase. Columbia object image library: Coil,

1996.

90

[OP05] A. Opelt and A. Pinz. Object localization with boosting and weak supervi-

sion for generic object recognition. InProceedings of the 14th Scandinavian

Conference on Image Analysis (SCIA), 2005.

[Pa02] M Pardo and et al. Coffee analysis with an electronic nose.Instrumentation

and Measurement, 2002.

[PFSQ01] M Pardo, G Faglia, G Sberveglieri, and L Quercia. IMTC 2001. Proceedings

of the 18th IEEE Instrumentation and Measurement Technology Confer-

ence. Rediscovering Measurement in the Age of Informatics (Cat. No.01CH

37188). InIMTC 2001. Proceedings of the 18th IEEE Instrumentation and

Measurement Technology Conference. Rediscovering Measurement in the

Age of Informatics, pages 123–127. IEEE, 2001.

[Phi05] M Phillips. Can the Electronic Nose Really Sniff out Lung Cancer?Ameri-

can journal of respiratory and critical care . . ., 2005.

[PZM04] P Perona and L Zelnik-Manor. Self Tuning Spectral Clustering. Advances

in neural information . . ., 2004.

[RB08] F Rock and N Barsan.Rock: Electronic nose: current status and future

trends - Google Scholar. Chemical reviews, 2008.

[Ree90] Randall R Reed. How does the nose know?Cell, 60(1):1–2, January 1990.

91

[RMG+10] M A Ryan, K S Manatt, S Gluck, A V Shevade, A K Kisor, H Zhou, LM

Lara, and M L Homer. The JPL electronic nose: Monitoring air in the US

Lab on the International Space Station. pages 1242–1247, 2010.

[RSCC+08] J A Ragazzo-Sanchez, P Chalier, D Chevalier, M Calderon-Santoyo, and

C Ghommidh. Identification of different alcoholic beverages by electronic

nose coupled to GC.Sensors and Actuators B: Chemical, 134(1):43–48,

2008.

[RTMF05] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme:

a database and web-based tool for image annotation.MIT AI Lab Memo

AIM-2005-025, 2005.

[SCA+11] J P Santos, J M Cabellos, T Arroyo, M C Horrillo, and Perena Gouma.

AIP Conference Proceedings. InProceedings Of The 14th International

Symposium On Olfaction And The Electronic Nose, pages 171–173. AIP,

September 2011.

[Sho04] Stephen Shore.Uncommon Places: The Complete Works. Aperture, 2004.

[Sho05] Stephen Shore.American Surfaces. Phaidon Press, 2005.

[SL03] BC Sisk and NS Lewis. Estimation of chemical and physical character-

istics of analyte vapors through analysis of the response data of arrays of

polymer-carbon black composite vapor detectors.Sensors and Actuators B:

Chemical, 96(1-2):268–282, 2003.

92

[SR95] S L Sullivan and K J Ressler. ScienceDirect.com - Current Opinion in Ge-

netics & Development - Spatial patterning and information coding in the

olfactory system.Current Opinion in Genetics & . . ., 1995.

[SR98] Michael M Stark and Richard F Riesenfeld. Wordnet: Anelectronic lexical

database. 1998.

[SWP05] Thomas Serre, Lior Wolf, and Tomaso Poggio. Object recognition with

features inspired by visual cortex. InCVPR (2), pages 994–1000, 2005.

[TM04] Anthony P F Turner and Naresh Magan. Innovation: Electronic noses and

disease diagnostics.Nature Reviews Microbiology, 2(2):161–166, February

2004.

[TMF04] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharingfeatures: efficient

boosting procedures for multiclass object detection. InProceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, volume 2,

pages 762–769, Washington, DC, June 2004.

[TRY10] A Torralba, B C Russell, and J Yuen. LabelMe: Online Image Annotation

and Applications. InProceedings of the IEEE, pages 1467–1484, 2010.

[TTA+06] P T Timbie, G S Tucker, P A R Ade, S Ali, E Bierman, E F Bunn,

C Calderon, A C Gault, P O Hyland, B G Keating, J Kim, A Korotkov,

S S Malu, P Mauskopf, J A Murphy, C O’Sullivan, L Piccirillo, and B D

Wandelt. The Einstein polarization interferometer for cosmology (EPIC)

93

and the millimeter-wave bolometric interferometer (MBI).New Astronomy

Reviews, 50(11-12):999–1008, December 2006.

[VJ01a] P. Viola and M. Jones. Rapid object detection using aboosted cascade of

simple features, 2001.

[VJ01b] Paul A. Viola and Michael J. Jones. Robust real-timeface detection. In

ICCV, page 747, 2001.

[VJ04] Paul A. Viola and Michael J. Jones. Robust real-time face detection.Inter-

national Journal of Computer Vision, 57(2):137–154, 2004.

[VR07] M. Varma and D. Ray. Learning the discriminative power-invariance trade-

off. In Proceedings of the IEEE International Conference on Computer

Vision, Rio de Janeiro, Brazil, October 2007.

[WB09] A D Wilson and M Baietto. Applications and advances in electronic-nose

technologies.Sensors, 9(7):5099–5148, 2009.

[WWP00] Markus Weber, Max Welling, and Pietro Perona. Unsupervised learning of

models for recognition. InECCV (1), pages 18–32, 2000.

[Yu07] H Yu. ScienceDirect.com - Sensors and Actuators B: Chemical - Discrimi-

nation of LongJing green-tea grade by electronic nose.Sensors and Actua-

tors B: Chemical, 2007.

[ZBMM06a] Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. Svm-

knn: Discriminative nearest neighbor classification for visual category

94

recognition. InIEEE Conference on Computer Vision & Pattern Recog-

nition, 2006.

[ZBMM06b] Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. Svm-

knn: Discriminative nearest neighbor classification for visual category

recognition. InCVPR (2), pages 2126–2136, 2006.

[ZEA+04] S Zampolli, I Elmi, F Ahmed, M Passini, GC Cardinali, S. Nicoletti, and

L. Dori. An electronic nose based on solid state sensor arrays for low-

cost indoor air quality monitoring applications.Sensors and Actuators B:

Chemical, 101(1):39–46, 2004.

[ZMP04] Lihi Zelnik-Manor and P. Perona. Self-tuning spectral clustering. Eigh-

teenth Annual Conference on Neural Information Processing Systems,

(NIPS), 2004.

[ZS06] M Zarzo and D T Stanton. Identification of latent variables in a semantic

odor profile database using principal component analysis.Chemical senses,

31(8):713–724, 2006.

[ZW07] Alon Zweig and Daphna Weinshall. Exploiting object hierarchy: Combin-

ing models from different category levels.Computer Vision, 2007. ICCV

2007. IEEE 11th International Conference on, pages 1–8, 14-21 Oct. 2007.

[ZZX+06] Qinyi Zhang, Shunping Zhang, Changsheng Xie, Dawen Zeng,Chaoqun

Fan, Dengfeng Li, and Zikui Bai. Characterization of Chinese vinegars

95

by electronic nose.Sensors and Actuators B: Chemical, 119(2):538–546,

December 2006.

learning and using taxonomies for visual and olfactory

Documents