learning and using taxonomies for visual and olfactory
TRANSCRIPT
Learning and Using Taxonomies for Visual and OlfactoryCategory Recognition
Thesis by
Greg Griffin
In Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
California Institute of Technology
Pasadena, California
2013
(Defended March 2013)
ii
c© 2013
Greg Griffin
All Rights Reserved
iii
For Leslie
iv
v
Acknowledgements
vi
vii
Abstract
Dominant techniques for image classification over the last decade have mostly followed
a three-step process. In the first step, local features (suchas SIFT) are extracted. In the
second step, some statistic is generated which compares thefeatures in each new image to
the features in a set of training images. Finally, standard learning models (such as SVMs)
use these statistics to decide which class the image is most likely to belong to. Popular
models such as Spatial Pyramid Matching and Naive Bayse Nearest Neighbors adhere to
the same general linear 3-step process even while they differ in their choice of statistic and
learning model.
In this thesis I demonstrate a hierarhical approach which allows the above 3 steps to in-
teract more efficiently. Just as a child playing “20 questions” seeks to identify the unknown
object with the fewest number of querries, our learning model seeks to classify unlabelled
data with a minimial set of feature querries. The more costlyfeature extraction is, the more
beneficial this approach can be.
We apply this approach to three different applications: photography, microscopy and
olfaction. Image —. Microscope slide —. Finally, — both efficiency and accuracy. In
our tests, the resulting system functions well under real-world outdoor conditions and can
recognize a wide variety of real-world odors. It has the additional advantage of scaling well
viii
to large numbers of categories for future experiments.
A small array of 10 conventional carbon black polymer composite sensors can be used
to classify a variety of common household odors under wide-ranging environmental con-
ditions. We test a total of 130 odorants of which 90 are commonhousehold items and
40 are from the UPSIT, a widely used scratch-and-sniff odor identification test. Airflow
over the sensors is modulated using a simple and inexpensiveinstrument which switches
between odorants and reference air over a broad range of computer-controlled frequencies.
Sensor data is then reduced to a feature which measures response over multiple harmonics
of the fundamental sniffing frequency. Using randomized sets of 4 odor categories, over-
all classification performance on the UPSIT exam is 73.8% indoors, 67.8% outdoors and
57.8% using indoor and outdoor datasets taken a month apart for training and testing. For
comparison, human performance on the same exam is 88-95% forsubjects 10-59 years of
age. Using a variety of household odors the same testing procedure yield 91.3%, 72.4%
and 71.8% performance, respectively. We then examine how classification error changes
as a function of sniffing frequencies, number of sensors and number of odor categories.
Rapid sniffing outperforms low-frequency sniffing in outdoor environments, while addi-
tional sensors simultaneously reduce classification errorand improve background noise
rejection. Moreover classifiers learned using data taken indoors with ordinary household
odorants seem to generalize well to data acquired in a more variable outdoor environment.
Finally we demonstrate a framework for automatically building olfactory hierarchies to
facilitate coarse-to-fine category detection.
ix
Contents
Acknowledgements v
Abstract vii
List of Figures xiii
1 Introduction 1
1.1 Visual Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Olfactory Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1
1.3 Moving Into The Wild: Face Recognition . . . . . . . . . . . . . . .. . . 1
1.4 Hierarchical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
2 The Caltech 256 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Collection Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Image Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
x
2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Localization and Segmentation . . . . . . . . . . . . . . . . . . .. 16
2.3.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Size Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Correlation Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3 Spatial Pyramid Matching . . . . . . . . . . . . . . . . . . . . . . 21
2.4.4 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.5 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Visual Hierarchies 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Training and Testing Data . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Spatial Pyramid Matching . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.4 Hierarchical Approach . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Building Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Top-Down Classification Algorithm . . . . . . . . . . . . . . . . . . .. . 41
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
xi
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Pollen Counting 49
4.1 Basic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Comparison To Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Machine Olfaction: Introduction 51
6 Machine Olfaction: Methods 55
6.1 Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Sampling and Measurements . . . . . . . . . . . . . . . . . . . . . . . . .56
6.3 Datasets and Environment . . . . . . . . . . . . . . . . . . . . . . . . . .58
7 Machine Olfaction: Results 63
7.1 Subsniff Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Number of Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 Feature Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4 Feature Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.4.1 Top-Down Category Recognition . . . . . . . . . . . . . . . . . . 70
8 Machine Olfaction: Discussion 75
Bibliography 79
xii
xiii
List of Figures
2.1 Examples of a 1, 2 and 3 rating for images downloaded usingthe keyword
dice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Summary of Caltech image datasets. There are actually 102and 257 cate-
gories if theclutter categories in each set are included. . . . . . . . . . . . . 5
2.3 Distribution of image sizes as measured by√
width · height, and aspect ra-
tios as measured bywidth/height. Some common image sizes and aspect
ratios that are overrepresented are labeled above the histograms. Overall in
Caltech-256 the mean image size is 351 pixels while the mean aspect ratio
is 1.17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Histogram showing number of images per category. Caltech-101’s largest
categoriesfaces-easy(435), motorbikes(798), airplanes (800) are shared
with Caltech 256. An additional large categoryt-shirt (358) has been added.
The clutter categories for Caltech-101 (467) and 256 (827) are identified
with arrows. This figure should be viewed in color. . . . . . . . . .. . . . . 7
xiv
2.5 Precision of images returned by Google. This is defined asthe total number
of images ratedgood divided by the total number of images downloaded
(averaged over many categories). As more images are download, it becomes
progressively more difficult to gather large numbers of images per object
category. For example, to gather 40 good images per categoryit is necessary
to collect 120 images and discard 2/3 of them. To gather 160 good images,
expect to collect about 640 images and discard 3/4 of them. . .. . . . . . . . 9
2.6 A taxonomy of Caltech-256 categories created by hand. At the top level
these are divided into animate and inanimate objects. Greencategories con-
tain images that were borrowed from Caltech-101. A category is colored red
if it overlaps with some other category (such asdogandgreyhound). . . . . . 10
2.7 Examples ofcluttergenerated by cropping the photographs of Stephen Shore [Sho04,
Sho05]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Performance of all 256 object categories using a typicalpyramid match ker-
nel [LSP06a] in a multi-class setting withNtrain = 30. This performance
corresponds to the diagonal entries of the confusion matrix, here sorted from
largest to smallest. The ten best performing categories areshown in blue
at the top left. The ten worst performing categories are shown in red at the
bottom left. Vertical dashed lines indicate the mean performance. . . . . . . 13
2.9 The mean of all images in five randomly chosen categories,as compared to
the meanclutter image. Four categories show some degree of concentration
towards the center whilerefrigerator andclutter do not. . . . . . . . . . . . 16
xv
2.10 The256 × 256 matrixM for the correlation classifier described in subsec-
tion 2.4.2. This is the mean of 10 separate confusion matrices generated for
Ntrain = 30. A log scale is used to make it easier to see off-diagonal ele-
ments. For clarity we isolate the diagonal and row 82galaxyand describe
their meaning in Figure 2.11. . . . . . . . . . . . . . . . . . . . . . . . . . .19
2.11 A more detailed look at the confusion matrixM from figure 2.10. Top:
row 82 shows which categories were most likely to be confusedwith galaxy.
These are:galaxy, saturn, fireworks, cometandmars(in order of greatest to
least confusion). Bottom: the largest diagonal elements represent the cate-
gories that are easiest to classify with the correlation algorithm. These are:
self-propelled-lawn-mower, motorbikes-101, trilobite-101, guitar-pick and
saturn. All of these categories tend to have objects that are located consis-
tently between images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12 Performance as a function ofNtrain for Caltech-101 and Caltech-256 using
the 3 algorithms discussed in the text. The spatial pyramid matching algo-
rithm is that of Lazebnik, Schmid and Ponce [LSP06a]. We compare our
own implementation with their published results, as well asthe SVM-KNN
approach of Zhang, Berg, Maire and Malik [ZBMM06a]. . . . . . . .. . . . 22
xvi
2.13 Selected rows and columns of the256× 256 confusion matrixM for spatial
pyramid matching [LSP06a] andNtrain = 30. Matrix elements containing
0.0 have been left blank. The first 6 categories are chosen because they
are likely to be confounded with the last 6 categories. The main diagonal
shows the performance for just these 12 categories. The diagonals of the
other 2 quadrants show whether the algorithm can detect categories which
are similar but not exact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.14 ROC curve for three different interest classifiers described in section 2.4.5.
These classifiers are designed to focus the attention of the multi-category
detectors benchmarked in Figure 3.2. BecauseDetector Bis roughly 200
times faster thanA or C, it represents the best tradeoff between performance
and speed. This detector can accurately detect 38.2% of the interesting (non-
clutter) images with a 0.1% rate of false detections. In other words, 1 in
1000 of the images classified asinterestingwill instead contain clutter (solid
red line). If a 1 in 100 rate of false detections is acceptable, the accuracy
increases to 58.6% (dashed red line). . . . . . . . . . . . . . . . . . . .. . 26
2.15 In general the Caltech-256 images are more difficult to classify than the
Caltech-101 images. Here we plot performance of the two datasets over a
random mix ofNcategories from each dataset. Even when the number of cate-
gories remains the same, the Caltech-256 performance is lower. For example
atNcategories = 100 the performance is∼ 60% lower. . . . . . . . . . . . . . 27
xvii
3.1 A typical one-vs-all multi-class classifier (top) exhaustively tests each image
against every possible visual category requiringNcat decisions per image.
This method does not scale well to hundreds or thousands of categories.
Our hierarchical approach uses the training data to construct a taxonomy of
categories which corresponds to a tree of classifiers (bottom). In principle
each image can now be classified with as few aslog2 Ncat decisions. The
above example illustrates this for an unlabeled test image andNcat = 8. The
tree we actually employ has slightly more flexibility as shown in Fig. 3.4 . . 33
3.2 Performance comparison between Caltech-101 and Caltech-256 datasets us-
ing the Spatial Pyramid Matching algorithm of Lazebnik et. al [LSP06b].
The performance of our implementation is almost identical to that reported
by the original authors; any performance difference may be attributed to a
denser grid used to sample SIFT features. This illustrates astandard non-
hierarchical approach where authors mainly present the number of training
examples and the classification performance, without also plotting classifi-
cation speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
xviii
3.3 In general the Caltech-256 [GHP06] images are more difficult to classify
than the Caltech-101 images. Here we fixNtrain = 30 and plot performance
of the two datasets over a random mix ofNcat categories chosen from each
dataset. The solid region represents a range of performancevalues for 10
randomized subsets. Even when the number of categories remains the same,
the Caltech-256 performance is lower. For example atNcat = 100 the per-
formance is∼ 60% lower (dashed red line). . . . . . . . . . . . . . . . . . . 37
3.4 A simple hierarchical cascade of classifiers (limited totwo levels and four
categories for simplicity of illustration). We call A, B, C and D four sets of
categories as illustrated in Fig 3.5. Each white square represents a binary
branch classifier. Test images are fed into the top node of the tree where a
classifier assigns them to either the set A∪ B or the set C∪ D (white square
at the center-top). Depending on the classification, the image is further clas-
sified into either A or B, or C or D. Test images ultimately terminate in one
of the 7 red octagonal nodes where a conventional multi-classnode classifier
makes the final decision. For a two-levelℓ = 2 tree, images terminate in one
of the 4 lower octagonal nodes. Ifℓ = 0 then all images terminate in the top
octagonal node, which is equivalent to conventional non-hierarchical classi-
fication. The tree is not necessarily perfectly balanced: A,B, C and D may
have different cardinality. Each branch or node classifier is trained exclu-
sively on images extracted from the sets that the classifier is discriminating.
See Sec. 3.4 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
xix
3.5 Top-down grouping as described in Sec. 3.3. Our underlying assumption is
that categories that are easily confused should be grouped together in order
to build the branch classifiers in Fig 3.4. First we estimate aconfusion ma-
trix using the training set and a leave-one-out procedure. Shown here is the
confusion matrix forNtrain = 10, with diagonal elements removed to make
the off-diagonal terms easier to see. . . . . . . . . . . . . . . . . . . .. . . 40
3.6 Taxonomy discovered automatically by the computer, using only a limited
subset of Caltech-256 training images and their labels. Aside from these
labels there is no other human supervision; branch membership is not hand-
tuned in any way. The taxonomy is created by first generating aconfusion
matrix for Ntrain = 10 and recursively dividing it by spectral clustering.
Branches and their categories are determined solely on the basis of the con-
fusion between categories, which in turn is based on the feature-matching
procedure of Spatial Pyramid Matching. To compare this withsome recog-
nizably human categories we color code all the insects (red), birds (yellow),
land mammals (green) and aquatic mammals (blue). Notice that the com-
puter’s hierarchy usually begins with a split that puts all the plant and animal
categories together in one branch. This split is found automatically with
such consistency that in a third of all randomized training setsnot a single
category of living thingends up on the opposite branch. . . . . . . . . . . . . 42
xx
3.7 The taxonomy from Fig.3.6 is reproduced here to illustrate how classification
performance can be traded for classification speed. NodeA represents an
ordinary non-hierarchical one-vs-all classifier implemented using an SVM.
This is accurate but slow because of the large combined set ofsupport vectors
in Ncats = 256 individual binary classifiers. A the other extreme, each test
image passes through a series of inexpensive binary branch classifiers until
it reaches 1 of the 256 leaves, collectively labeledC above. A compromise
solution B invokes a finite set of branch classifiers prior to final multi-class
classification in one of 7 terminal nodes. . . . . . . . . . . . . . . . .. . . . 44
3.8 Comparison of three different methods for generating taxonomies. For each
taxonomy we vary the number of branch comparisons prior to final classi-
fication, as illustrated in Fig. 3.4. This results in a tradeoff between perfor-
mance and speed as one moves between two extremesA andC. Randomly
generated hierarchies result in poor cascade performance.Of the three meth-
ods, taxonomies based on Spectral Clustering yield marginally better perfor-
mance. All three curves measure performance vs. speed forNcat = 256 and
Ntrain = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 Cascade performance / speed trade-off as a function ofNtrain. Values of
Ntrain = 10 andNtrain = 50 result in a 5-fold and 20-fold speed increase
(respectively) for a fixed 10% perforance drop. . . . . . . . . . . .. . . . . 46
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xxi
6.1 A fan draws air from 1 of 4 ordorant chambers or an empty reference cham-
ber, depending on the state of the computer-controlled solenoid valve. The
valve control signal can then be compared to the resistanceschanges recorded
from an arrays of 10 individual sensors as shown in Fig. 2. . . .. . . . . . . 56
6.2 (a) A sniff consists of 7 individual subsniffss1...s7 of sensor data taken as
the valve switches between a single odorant and reference air. From this
data a7 × 4 = 28 size featurem is generated representing the measured
power in each of the 7 subsniffsi over 4 fundamental harmonicsj. For com-
parison purposes a simple amplitude feature differences the top and bottom
5% quartiles of∆RR
in each subsniff. (b) As the switching frequencyf in-
creases by powers of 2 so does the number of pulses, so that thetime period
T is constant for all but the first subsniff. (c) To illustrate how m is mea-
sured we show the harmonic decomposition of justs4, highlighted in (a).
The corresponding measurementsm4j are the integrated spectral power for
each of 4 harmonics. Higher-order harmonics suffer from attenuation due
to the limited time-constant of the sensors but have the advantage of being
less susceptible to slow signal drift. Fitting a1/fn noise spectrum to the av-
erage indoor and outdoor frequency response of our sensors in the absence
of any odorants illustrates why higher-frequency switching and higher-order
harmonics may be especially advantageous in outdoors environments. . . . . 60
xxii
6.3 Visual representation of the harmonic decomposition featurem for 2 wines,
2 lemon parts and 2 teas from the Common Household Odors Dataset. Each
odorant is sampled 4 times on 2 different days in 2 separate environments.
Each box represents one complete 400 s sniff reduced to a 280-dimensional
feature vector. Within each box, the 10 rows (y axis) show theresponse of
different sensor over 28 frequencies (x axis) corresponding to 7 subsniffs
and 4 harmonics. For visual clarity the columns are sorted byfrequency and
rows are sorted so that adjacent sensors are maximally correlated. . . . . . . 61
7.1 Classification performance for the University of Pittsburgh Smell Identifi-
cation Test (UPSIT) and Common Household Odors Dataset (CHOD)for
different sniff subsets using 4 and 16 categories for training and testing.
For control purposes data is also acquired with empty odorant chambers.
Compared with using the entire sniff (top), the high-frequency subsniffs
(2nd row) outperform the low-frequency subsniffs (bottom)especially for
Ncat=16. Dotted lines show performance at the level of random guessing. . . 65
xxiii
7.2 Classification error for all three datasets taken indoorsand outdoors while
varying the number of sensors and the number of categories used for train-
ing and testing. Each dotted colored line represents the mean performance
over randomized subsets of 2, 4, 6 and 8 sensors out of the available 10.
To illustrate this for a single value ofNcat, gray vertical lines mark the er-
ror averaged over randomized sets of 16 odor categories for the indoor and
outdoor datasets. When the number of sensors increased from 4to 10, the
indoor error (left line) decreased by< 2% for the CHOD and UPSIT while
the outdoor error (right line) decreased by 4-7%. The Controlerror is also
important because deviations from random chance when no odor categories
are present may suggest sensitivity to environmental factors such as water
vapor. Here the indoor error for both 4 and 10 sensors remainsconsistent
with 93.75% random chance while the outdoor error rises from85.9% to
91.7%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Classification error using features based on sensor response amplitude and
harmonic decomposition. For comparison we also show the UPSIT testing
error[DSKa84] for human test subjects 10-59 years of age (who perform
better than our instrument) and 70-79 years of age (who perform roughly the
same). The combined Indoor/Outdoor dataset uses data takenindoors and
outdoors as separate training and testing sets. . . . . . . . . . .. . . . . . . 69
xxiv
7.4 The confusion matrix for the Indoor Common Household OdorDataset is
used to automatically generate a top-down hierarchy of odorcategories. Branches
in the tree represent splits in the confusion matrix that minimize intercluster
confusion. As the depth of the tree increases with successive splits, the cate-
gories in each branch become more and more difficult for the electronic nose
to distinguish. The color of each branch node represent the classification
performance when determining whether an odorant belongs tothat branch.
This helps characterize the instrument by showing which odor categories
and super-categories are readily detectable and which are not. Highlighted
categories show the relationships discovered between the wine, lemon and
tea categories whose features are shown Fig. 3. The occurence of wine and
citrus categories in the same top-level branch indicates that these odor cate-
gories are harder to distinguish from one another than they are from tea. . . . 73
xxv
List of Tables
xxvi
1
Chapter 1
Introduction
1.1 Visual Classification
1.2 Olfactory Classification
1.3 Moving Into The Wild: Face Recognition
1.4 Hierarchical Approach
2
3
Chapter 2
The Caltech 256
We introduce a challenging set of 256 object categories containing a total of 30607 images.
The original Caltech-101 [LFP04a] was collected by choosinga set of object categories,
downloading examples from Google Images and then manually screening out all images
that did not fit the category. Caltech-256 is collected in a similar manner with several
improvement: a) the number of categories is more than doubled, b) the minimum number
of images in any category is increased from 31 to 80, c) artifacts due to image rotation
are avoided and d) a new and larger clutter category is introduced for testing background
rejection. We suggest several testing paradigms to measureclassification performance,
then benchmark the dataset using two simple metrics as well as a state-of-the-art spatial
pyramid matching [LSP06a] algorithm. Finally we use the clutter category to train an
interest detector which rejects uninformative backgroundregions.
2.1 Introduction
Recent years have seen an explosion of work in the area of object recognition [LFP04a,
LSP06a, ZBMM06a, ML06a, Fer05, ABM04]. Several datasets have emerged as stan-
41. good2. bad
3. not applicable
Figure 2.1: Examples of a 1, 2 and 3 rating for images downloaded using the keyworddice.
dards for the community, including the Coil [NNM96], MIT-CSAIL [TMF04] PASCAL
VOC [Eea06], Caltech-6 and Caltech-101 [LFP04a] and Graz [OP05] datasets. These
datasets have become progressively more challenging as existing algorithms consistently
saturated performance. The Coil set contains objects placedon a black background with
no clutter. The Caltech-6. consists of 3738 images of cars, motorcycles, airplanes, faces
and leaves. The Caltech-101 is similar in spirit to the Caltech-6 but has many more object
categories, as well as hand-clicked silhouettes of each object. The MIT-CSAIL database
contains more than 77,000 objects labeled within 23,000 images that are shown in a variety
of environments. The number of labeled objects, object categories and region categories
increases over time thanks to a publicly available LabelMe [RTMF05] annotation tool. The
PASCAL VOC 2006 database contains 5,304 images where 10 categories are fully anno-
tated. Finally, the Graz set contains three object categories in difficult viewing conditions.
These and other standardized sets of categories allow usersto compare the performance of
their algorithms in a consistent manner.
Here we introduce the Caltech-256. Each category has a minimum of 80 images (com-
pared to the Caltech-101 where some classes have as few as31 images). In addition we do
5
not left-right align the object categories as was done with the Caltech-101, resulting in a
more formidable set of categories.
Because Caltech-256 images are harvested from two popular online image databases,
they represent a diverse set of lighting conditions, poses,backgrounds, image sizes and
camera systematics. The categories were hand-picked by theauthors to represent a wide
variety of natural and artificial objects in various settings. The organization is simple and
the images are ready to use, without the need for cropping or other processing. In most
cases the object of interest is prominent with a small or medium degree of background
clutter.
Dataset Released Categories Images Images Per CategoryTotal Min Med Mean Max
Caltech-101 2003 102 9144 31 59 90 800Caltech-256 2006 257 30607 80 100 119 827
Figure 2.2: Summary of Caltech image datasets. There are actually 102 and 257 categoriesif the clutter categories in each set are included.
In Section 3.2.3 we describe the collection procedures for the dataset. In Section 2.3
we give paradigms for testing recognition algorithms, including the use of the background
clutter class. Example experiments are provided in Section 2.4. Finally in Section 2.5 we
conclude with a general discussion of advantages and disadvantages of the set.
2.2 Collection Procedure
The object categories were assembled in a similar manner to the Caltech-101. A small
group of vision dataset users were asked to supply the names of roughly 300 object cat-
6
0 200 400 600 800 1000 12000
500
1000
640x480
800x6001024x768
sqrt(width*height)Im
ages
0 0.5 1 1.5 2 2.5 3 3.5 40
1000
2000
3000
4000
5000
width/height
Imag
es
2/3 3/4
1
4/3
3/2
Figure 2.3: Distribution of image sizes as measured by√
width · height, and aspect ratiosas measured bywidth/height. Some common image sizes and aspect ratios that are over-represented are labeled above the histograms. Overall in Caltech-256 the mean image sizeis 351 pixels while the mean aspect ratio is 1.17.
egories. Images from each category were downloaded from both Google and PicSearch
using scripts . We required that the minimum size in either aspect be 100 with no upper
range. Typically this procedure resulted in about400 − 600 images from each category.
Duplicates were removed by detecting images which contained over15 similar SIFT de-
scriptors [Low04].
The images obtained were of varying quality. We asked 4 different subjects to rate these
images using the following criteria:
1. Good: A clear example of the visual category
2. Bad: A confusing, occluded, cluttered or artistic example
3. Not Applicable: Not an example of the object category
Sorters were instructed to label the imagebadif either: (1) the image was very cluttered,
7
0 100 200 300 400 500 600 700 8000
20
40
60
80
100
120
Images
Cat
egor
ies
Caltech256Caltech101
Figure 2.4: Histogram showing number of images per category. Caltech-101’s largest cate-goriesfaces-easy(435),motorbikes(798),airplanes(800) are shared with Caltech 256. Anadditional large categoryt-shirt (358) has been added. Theclutter categories for Caltech-101 (467) and 256 (827) are identified with arrows. This figureshould be viewed in color.
(2) the image was a line drawing, (3) the image was an abstractartistic representation, or
(4) the object within the image occupied only a small fraction of the image. If the image
contained no examples of the visual category it was labelednot applicable. Examples of
each of the 3 ratings are shown in Figure 2.1.
The final set of images included in Caltech-256 are the ones that passed our size and
duplicate checks and were also ratedgood. Out of 304 original categories 48 had less
than 80good images and were dropped, leaving 256 categories. Figure 2.3shows the
distribution of the sizes of these final images.
In Caltech-101, categories such asminarethad a large number of images that were ar-
tificially rotated, resulting in large black borders aroundthe image. This rotation created
artifacts which certain recognition systems exploited resulting in deceptively high perfor-
mance. This made such categories artificially easy to identify. We have not introduced
such artifacts into this set and collecting an entirely newminaretcategory which was not
artificially rotated.
8
In addition we did not consistently right-left align the object categories as was done in
Caltech-101. For exampleairplanesmay be facing in either the left or right direction now.
This gives a better idea of what categorization performancewould be like under realistic
conditions, unlike that Caltech-101airplaneswhich are all facing right.
2.2.1 Image Relevance
We compiled statistics on the downloaded images to examine the typical yield ofgood
images. Figure 2.5 summarizes the results for images returned by Google. As expected,
the relevance of the images decreases as more images are returned. Some categories return
more pertinent results than others. In particular, certaincategories contain dual semantic
meanings. For example the categorypawnyields both the chess piece and also images of
pawn shops. The categoryeggis too ambiguous, because it yields images of whole eggs,
egg yolks, Faberge Eggs, etc. which are not in the same visualcategory. These ambiguities
were often removed with a more specific keyword search, such as fried-egg.
When using Google images alone, 25.6% of the images downloaded were found to be
good. To increase the precision of image downloading we augmented the Google search
with PicSearch.
Since both search engines return largely non-overlapping sets of images, the overall
precision for the initial set of downloaded images increased, as both returned a high fraction
of good images initially. Now 44.4% of the images were usable. The true overall precision
was slightly lower as there was some overlap between the Google and PicSearch images.
A total of 9104good images were gathered from PicSearch and 20677 from Google, out
9
0 100 200 300 400 5000
5
10
15
20
25
30
35
40
Images Downloaded
Pre
cisi
on (
%)
Figure 2.5: Precision of images returned by Google. This is defined as the total numberof images ratedgooddivided by the total number of images downloaded (averaged overmany categories). As more images are download, it becomes progressively more difficultto gather large numbers of images per object category. For example, to gather 40 goodimages per category it is necessary to collect 120 images anddiscard 2/3 of them. Togather 160 good images, expect to collect about 640 images and discard 3/4 of them.
of a total of 92652 downloaded images. Thus the overall sorting efficiency was 32.1%.
2.2.2 Categories
The category numbering provides some insight into which categories are similar to an
existing category. CategoriesC1...C250 are relatively independent of one another, whereas
categoriesC251...C256 are closely related to other categories. These areairplane-101, car-
side-101, faces-easy-101, greyhound, tennis-shoeand toad, which are closely related to
fighter-jet, car-tire, people, dog, sneakerand frog respectively. We felt these 6 category
pairs would be the most likely to be confounded with one another, so it would be best to
remove one of each pair from the confusion matrix, at least for the standard benchmarking
10
procedure1.
2.2.3 Taxonomy
Figure 2.6 shows a taxonomy of the final categories, grouped by animate and inanimate
and other finer distinctions. This taxonomy was compiled by the authors and is somewhat
arbitrary; other equally valid hierarchies can be constructed. The largest 30 categories from
Caltech-101 (shown in green) were included in Caltech-256, with additional images added
as needed to boost the number of images in each category to at least 80. Animate objects
- 69 categories in all - tend to be more cluttered than the inanimate objects, and harder to
identify. A total of 12 categories are marked in red to denotea possible relation with some
other visual category.
2.2.4 Background
CategoryC257 is clutter2. For several reasons (see subsection 2.3.4) it is useful to have
such a background category, but the exact nature of this category will vary from set to set.
Different backgrounds may be appropriate for different applications, and the statistics of a
given background category can effect the performance of theclassifier [HWP05].
For instance Caltech-6 contains a background set which consists of random pictures
taken around Caltech. The image statistics are no doubt biased by their specific choice
1While horseshoe-crabmay seem to be a specific case ofcrab, the images themselves involve two entirelydifferent sub-phylum of Arthropoda, which have clear differences in morphology. We find these easy to tellapart whereasfrogandtoaddifferences can be more subtle (none of our sorters were herpetologists). Likewisewe feel thatknifeandswiss-army-knifeare not confounding, even though they share some characteristics suchas blades.
2For purposes here we will use the termsbackgroundandclutter interchangeably to indicate the absenceor near-absence of any objects categories
11
Figure 2.6: A taxonomy of Caltech-256 categories created by hand. At the top level theseare divided into animate and inanimate objects. Green categories contain images that wereborrowed from Caltech-101. A category is colored red if it overlaps with some other cate-gory (such asdogandgreyhound).
12
Figure 2.7: Examples ofclutter generated by cropping the photographs of StephenShore [Sho04, Sho05].
of location. The Caltech-101 contains a set of background images obtained by typing the
keyword “things” into Google. This can turn up a wide varietyof objects not in Caltech-
101. However these images may or may not contain objects of interest that the user would
wish to classify.
Here we choose a different approach. Theclutter category in Caltech-256 is derived
by cropping 947 images from the pictures of photographer Stephen Shore [Sho04, Sho05].
Images were cropped such that the final image sizes in the clutter category are representa-
tive of the distribution of images sizes found in all the other categories (figure 2.3). Those
cropped images which contained Caltech-256 categories (such as people and cars) were
manually removed, with a total of 827clutter images remaining. Examples are shown in
Figure 2.7.
13
0 10 20 30 40 50 60 70 80 90 100
airplanes−101: 251 motorbikes−101: 145leopards−101: 129 self−propelled−lawn−mower: 182tower−pisa: 225 trilobite−101: 230watch−101: 240 desk−globe: 053ketch−101: 123 hibiscus: 103sunflower−101: 204 brain−101: 020zebra: 250 mars: 137saturn: 177 galaxy: 082bonsai−101: 015 guitar−pick: 094sheet−music: 184 menorah−101: 140revolver−101: 172 fireworks: 073teepee: 214 fire−truck: 072license−plate: 130 rainbow: 170homer−simpson: 104 grand−piano−101: 091photocopier: 161 tennis−court: 217cartman: 032 lightning: 133buddha−101: 022 hawksbill−101: 100ewer−101: 066 chandelier−101: 036golden−gate−bridge: 086 vcr: 237harp: 098 hourglass: 110backpack: 003 french−horn: 077touring−bike: 224 yarmulke: 248coffee−mug: 041 grapes: 092tombstone: 222 comet: 044video−projector: 238 hamburger: 095laptop−101: 127 washing−machine: 239waterfall: 241 boom−box: 016breadmaker: 021 telephone−box: 215eyeglasses: 067 cereal−box: 035diamond−ring: 054 stained−glass: 200binoculars: 012 computer−monitor: 046pci−card: 157 treadmill: 227tennis−shoes: 255 helicopter−101: 102chess−board: 037 palm−tree: 154hot−tub: 109 microwave: 142paper−shredder: 156 school−bus: 178rotary−phone: 174 human−skeleton: 112lightbulb: 131 umbrella−101: 235eiffel−tower: 062 frying−pan: 081mountain−bike: 146 elephant−101: 064porcupine: 164 top−hat: 223cd: 033 skyscraper: 187beer−mug: 010 harpsichord: 099tomato: 221 tripod: 231t−shirt: 232 american−flag: 002baseball−glove: 005 starfish−101: 201theodolite: 219 head−phones: 101roulette−wheel: 175 frisbee: 079hot−air−balloon: 107 covered−wagon: 050saddle: 176 megaphone: 139football−helmet: 076 tweezer: 234pez−dispenser: 160 spaghetti: 196teapot: 212 killer−whale: 124picnic−table: 162 sextant: 183steering−wheel: 202 coin: 043swiss−army−knife: 208 elk: 065wine−bottle: 246 skunk: 186house−fly: 111 teddy−bear: 213kangaroo−101: 121 pyramid: 167ipod: 117 billiards: 011fire−extinguisher: 070 mattress: 138segway: 181 llama−101: 134ak47: 001 toad: 256snowmobile: 192 toaster: 220triceratops: 228 chopsticks: 039refrigerator: 171 scorpion−101: 179electric−guitar−101: 063 iris: 118fern: 068 minaret: 143palm−pilot: 153 bathtub: 008microscope: 141 stirrups: 203tennis−racket: 218 ostrich: 151gorilla: 090 cowboy−hat: 051calculator: 027 doorknob: 058car−tire: 031 jesus−christ: 119welding−mask: 243 bulldozer: 023gas−pump: 083 harmonica: 097cormorant: 049 sneaker: 191soccer−ball: 193 penguin: 158giraffe: 084 mandolin: 136necktie: 149 octopus: 150tambourine: 211 butterfly: 024light−house: 132 raccoon: 168owl: 152 hummingbird: 113bowling−pin: 018 cake: 026computer−keyboard: 045 joy−stick: 120sushi: 206 blimp: 014tricycle: 229 bowling−ball: 017floppy−disk: 075 minotaur: 144lathe: 128 cactus: 025speed−boat: 197 xylophone: 247radio−telescope: 169 wheelbarrow: 244crab−101: 052 computer−mouse: 047hammock: 096 boxing−glove: 019duck: 060 ibis−101: 114superman: 205 swan: 207baseball−bat: 004 tennis−ball: 216watermelon: 242 chimp: 038ice−cream−cone: 115 birdbath: 013cockroach: 040 horseshoe−crab: 106pram: 165 bear: 009cannon: 029 spider: 198centipede: 034 fried−egg: 078mussels: 148 smokestack: 188windmill: 245 basketball−hoop: 006fighter−jet: 069 screwdriver: 180sword: 209 fire−hydrant: 071playing−card: 163 unicorn: 236dice: 055 mushroom: 147paperclip: 155 socks: 194greyhound: 254 dolphin−101: 057camel: 028 tuning−fork: 233flashlight: 074 grasshopper: 093coffin: 042 bat: 007goat: 085 golf−ball: 088goose: 089 frog: 080traffic−light: 226 goldfish: 087canoe: 030 dumb−bell: 061iguana: 116 snail: 189kayak: 122 mailbox: 135people: 159 drinking−straw: 059horse: 105 spoon: 199conch: 048 knife: 125snake: 190 syringe: 210yo−yo: 249 dog: 056praying−mantis: 166 hot−dog: 108soda−can: 195 ladder: 126rifle: 173 skateboard: 185
Performance (%)
←
Wor
se P
erfo
rman
ce
B
ette
r P
erfo
rman
ce
→
Figure 2.8: Performance of all 256 object categories using atypical pyramid match ker-nel [LSP06a] in a multi-class setting withNtrain = 30. This performance corresponds to thediagonal entries of the confusion matrix, here sorted from largest to smallest. The ten bestperforming categories are shown in blue at the top left. The ten worst performing categoriesare shown in red at the bottom left. Vertical dashed lines indicate the mean performance.
14
We feel that this is an improvement over our previous cluttercategories, since the im-
ages contain clutter in a variety of indoor and outdoor scenes. However it is still far from
perfect. For example some visual categories such as grass, brick and clouds appear to be
over-represented.
2.3 Benchmarks
Previous datasets suffered from non-standard testing and training paradigms, making di-
rect comparisons of certain algorithms difficult. For instance, results reported by Grau-
man [GD05a] and Berg [BBM05] were not directly comparable asBerg used only 15 train-
ing while Grauman used 30 training examples3. Some authors used the same number
of test examples for each category, while other did not. Thiscan be confusing if the re-
sults are not normalized in a consistent way. For consistentcomparisons between different
classification algorithms, it is useful to adopt standardized training and testing procedures
2.3.1 Performance
First we selectNtrain andNtest images from each class to train and test the classifier. Specif-
ically Ntrain = 5, 10, 15, 20, 25, 30, 40 andNtest = 25.
Each test image is assigned to a particular class by the classifier. Performance of each
classC can be measured by determining the fraction of test examplesfor classC which
are correctly classified as belonging to classC. The cumulative performance is calculated
by counting the total number of correctly classified test imagesNtest within each ofNclass
3It should be noted that Grauman achieved results surpassingthose of Berg in experiments conductedlater.
15
classes. It is of course important to weight each class equally in this metric. The easiest
way to guarantee this is to use the same number of test images for each class. Finally, better
statistics are obtained by averaging the above procedure multiple times (ideally at least 10
times) to reduce uncertainty.
The exactly value ofNtest is not important. For Caltech-101 values higher thanNtrain =
30 are impossible since some categories contain only 31 images. However Caltech-256 has
at least 80 images in all categories. Even a training set sizeof Ntrain = 75 leavesNtest ≥ 5
available for testing in all categories.
The confusion matrixMij illustrates classification performance. It is a table where
each elementi, j stores the fraction of the test images from categoryCi that were classified
as belonging toCj. Note that perfect classification would result in a table with ones along
the main diagonal. Even if such a classification method existed, this ideal performance
would not be reached for several reasons. Images in most categories contain instances of
other categories, which is a built-in source of confusion. Also our sorting procedure is
never prefect; there are bound to be some small fraction of incorrectly classified images in
a dataset of this size.
Since the last 6 categories are redundant with existing categories, andclutter indicates
the absence of any category, one might argue that only categoriesC1...C250 are appropriate
for generating performance benchmarks. Another justification for removing these last 6
categories when measuring overall performance is that theyare among the easiest to iden-
tify. Thus removing them makes the detection task more challenging4.
However for better clarity and consistency, we suggest thatauthors remove only the
4As shown in figure 2.13, categoriesC251, C252 andC253 each yield performance above90%
16
162.picnic−table 014.blimp 257.clutter
Figure 2.9: The mean of all images in five randomly chosen categories, as compared to themeanclutter image. Four categories show some degree of concentration towards the centerwhile refrigerator andclutter do not.
clutter category,generate a 256x256 confusion matrixwith the remaining categories, and
report their performance results directly from the diagonal of this matrix5. Is also useful
for authors to post the confusion matrix itself - not just themean of the diagonal.
2.3.2 Localization and Segmentation
Both Caltech-101 and the Caltech-256 contain categories in which the object may tend to
be centered (Figure 2.9). Thus, neither set is appropriate for localization experiments, in
which the algorithm must not only identify what object is present in the image but also
where the object is.
Furthermore we have not manually annotated the images in Caltech-256 so there is
presently no ground truth for testing segmentation algorithms.
5The difference in performance between the 250x250 and 256x256 matrix is typically less than a percent
17
2.3.3 Generality
Why not remove the last 6 categories from the dataset altogether? Closely related categories
can provide useful information that is not captured by the standard performance metric. Is
a certaingreyhoundclassifier also good at identifyingdog, or does it only detect specific
breeds? Does asneakerdetector also detect images fromtennis-shoe, a word which means
essentially the same thing? If it does not, one might worry that the algorithm is over-
training on specific features of the dataset which do not generalize to visual categories in
the real world.
For this reason we plot rows 251..256 of the confusion matrixalong with the categories
which are most similar to these, and discuss the results in section 2.3.3.
2.3.4 Background
Consider the example of a Mars rover that moves around in its environment while taking
pictures. Raw performance only tells us the accuracy with which objects are identified. Just
as important is the ability to identify where there is an object of interest and where there
is only uninteresting background. The rover cannot begin tounderstand its environment if
background is constantly misidentified as an object.
The rover example also illustrates how the meaning of the word backgroundis strongly
dependent on the environment and the application. Our choice of background images for
Caltech-256, as described in 2.2.4, is meant to reflect a variety of common (terrestrial)
environments.
Here we generate an ROC curve that tests the ability of the classification algorithm
18
to identify regions of interest. An ROC curve shows the ratioof false positives to true
positives. In single-category detection the meaning of true positive and false positive is un-
ambiguous. Imagine that a search window of varied size scansacross an image employing
some sort of bird classifier. Each true positive marks a successful detection of a bird inside
the scan window while each false positive indicates an erroneous detection.
What do positive and negative mean in the context of multi-class classification? Con-
sider a two-step process in which each search window is evaluated by a cascade [VJ01a]
of two classifiers. The first classifier is aninterestdetector that decides whether a given
window contains a object category or background. Background regions are discarded to
save time, while all other images are passed to the second classifier. This more expensive
multi-class classifier now attempts to identify which of theremaining 256 object categories
best matches the region as described in 2.3.1.
Our ROC curve measures the performance of severalinterestclassifiers. A false posi-
tive is anyclutter image which is misclassified as containing an object of interest. Likewise
true positive refers to an object of interest that is correctly identified. Here “object of inter-
est” means any classification besidesclutter.
2.4 Results
In this section we describe two simple classification algorithms as well as the more so-
phisticated spatial pyramid matching algorithm of Lazebnik, Schmid and Ponce [LSP06a].
Performance, generality and background rejection benchmarks are presented as examples
for discussion.
19
50 100 150 200 250
50
100
150
200
250 0.002
0.004
0.006
0.0080.01
0.02
0.04
0.06
0.080.1
0.2
0.4
0.6
0.8
Figure 2.10: The256 × 256 matrix M for the correlation classifier described in subsec-tion 2.4.2. This is the mean of 10 separate confusion matrices generated forNtrain = 30. Alog scale is used to make it easier to see off-diagonal elements. For clarity we isolate thediagonal and row 82galaxyand describe their meaning in Figure 2.11.
2.4.1 Size Classifier
Our first classifier used only the width and height of each image as features. During the
training phase, the width and height of all256 ·Ntrain images are stored in a 2-dimensional
space. Each test image is classified in a KNN fashion by votingamong the 10 nearest neigh-
bors to each image. The 1-norm Manhattan distance yields slightly better performance than
the 2-norm Euclidean distance. As shown in Figure 3.2, this algorithm identifies the correct
category for an image3.7 ± 0.6% of the time whenNtrain = 30.
Although identifying the correct object category 3.7% of the time seems like paltry per-
formance, we note that baseline (random guessing) would result in a performance of less
than .25%. This illustrates a danger inherent in many recognition datasets: the algorithm
20
0 50 100 150 200 2500
0.05
0.1
0.15
0.2
0.25
0.3
0.35Row 82: "galaxy"
082.galaxy
177.saturn
073.fireworks 044.comet 137.mars
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8Diagonal
182.self−propelled−lawn−mower 145.motorbikes−101
230.trilobite−101 094.guitar−pick 177.saturn
Figure 2.11: A more detailed look at the confusion matrixM from figure 2.10. Top:row 82 shows which categories were most likely to be confusedwith galaxy. These are:galaxy, saturn, fireworks, cometandmars(in order of greatest to least confusion). Bottom:the largest diagonal elements represent the categories that are easiest to classify with thecorrelation algorithm. These are:self-propelled-lawn-mower, motorbikes-101, trilobite-101, guitar-pick andsaturn. All of these categories tend to have objects that are locatedconsistently between images.
21
can learn on ancillary features of the dataset instead of features intrinsic to the object cate-
gories. Such an algorithm will fail to identify categories if the images come from another
dataset with different statistics.
As shown in
2.4.2 Correlation Classifier
The next classifier we employed was a correlation based classifier. All images were resized
to Ndim × Ndim, desaturated and normalized to have unit variance. The nearest neighbor
was computed in theNdim2-dimensional space of pixel intensities. This is equivalent to
finding the training image that correlates best with the testimage, since
< (X − Y )2 >=< X2 > + < Y 2 > −2 < XY >= −2 < XY >
for imagesX,Y with unit variance. Again we use the 1-norm instead of the 2-norm because
it is faster to compute and yields better classification performance.
Performance of7.6 ± 0.7% at Ntrain = 30 is computed by taking the mean of the
diagonal of the confusion matrix in Figure 2.10.
2.4.3 Spatial Pyramid Matching
As a final test we re-implement the spatial pyramid matching algorithm of Lazebnik, Schmid
and Ponce [LSP06a] as faithfully as possible. In this procedure an SVM kernel is generat-
ing from matching scores between a set of training images. Their published Caltech-101
performance atNtrain = 30 was64.6±0.8%. Our own performance is practically the same.
22
0 5 10 15 20 25 30 35 400
10
20
30
40
50
60
Ntrain
Per
form
ance
(%
)
Caltech−101
SPM
SVM−KNN [3]
SPM [2]
Caltech−256
SPM
Correlation
Size
Figure 2.12: Performance as a function ofNtrain for Caltech-101 and Caltech-256 usingthe 3 algorithms discussed in the text. The spatial pyramid matching algorithm is thatof Lazebnik, Schmid and Ponce [LSP06a]. We compare our own implementation withtheir published results, as well as the SVM-KNN approach of Zhang, Berg, Maire andMalik [ZBMM06a].
As shown in Figure 3.2, performance on Caltech-256 is roughlyhalf the performance
achieved on Caltech-101. For example atNtrain = 30 our Caltech-256 and Caltech-101
performance are67.6 ± 1.4% and34.1 ± 0.2% respectively.
2.4.4 Generality
Figure 2.13 shows the confusion between six categories and their six confounding cate-
gories. We define thegeneralityas the mean of the off-quadrant diagonals divided by the
mean of the main diagonal. In this case, forNtrain = 30, the generality isg = 0.145.
What doesg signify? Consider two extreme cases. Ifg = 0.0 then their is absolutely no
confusion between any of the similar categories, includingtennis-shoeandsneaker. This
23
12.5
0.5
0.5
5.0
0.5
25.0
0.5
0.5
0.5
1.0
0.5
2.0
5.0
2.0
1.0
5.5
0.5
0.5
1.5
3.5
0.5
2.5
0.5
1.0
24.5
22.0
0.5
0.5
8.5
10.5
1.0
94.5
1.0 100.0
0.5
98.5
0.5
0.5
1.0
2.0
0.5
10.5
1.0
1.0
15.0
0.5
50.0
1.5
0.5
3.5
30.5
069 031 159 056 191 080 251 252 253 254 255 256
069.fighter−jet
031.car−tire
159.people
056.dog
191.sneaker
080.frog
251.airplanes−101
252.car−side−101
253.faces−easy−101
254.greyhound
255.tennis−shoes
256.toad0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
80 %
90 %
100 %
Figure 2.13: Selected rows and columns of the256 × 256 confusion matrixM for spatialpyramid matching [LSP06a] andNtrain = 30. Matrix elements containing 0.0 have beenleft blank. The first 6 categories are chosen because they arelikely to be confounded withthe last 6 categories. The main diagonal shows the performance for just these 12 categories.The diagonals of the other 2 quadrants show whether the algorithm can detect categorieswhich are similar but not exact.
would be suspicious since it means the categorization algorithm is splitting hairs, ie. finding
significant differences where none should exist. Perhaps the classifier is training on some
inconsequential artifact of the dataset. At the other extremeg = 1.0 suggests that the two
confounding sets of six categories were completely indistinguishable. Such a classifier is
not discriminating enough to differentiate betweenairplanesand the more specific category
fighter-jet, or betweenpeopleand theirfaces. In other words, the classifier generalizes so
well about similar object classes that it may be considered too sloppy for some applications.
In practice the desired value ofg depends on the needs of the customer. Lower values
of g denote fine discrimination between similar categories or sub-categories. This would
24
be particularly desirable in situations that require the exact identification of a particular
species of mammal. A more inclusive classifier tends toward higher value ofg. Such a
classifier would presumably be better at identifying a mammal it has never seen before,
based on general features shared by a large class of mammals.
As shown in Figure 2.13, a spatial pyramid matching classifier does indeed confuse
tennis-shoesandsneakersthe most. This is a reassuring sanity check. To a lesser extent
the object categoriesfrog/toad, dog/greyhound, fighter-jet/airplanesandpeople/faces-easy
are also confused.
Confusion betweencar-tire andcar-sideis entirely absent. This seems surprising since
tires are such a conspicuous feature of cars when viewed fromthe side. However the tires
pictured incar-tire tend to be much larger in scale than those found incar-side. One
reasonable hypothesis is that the classifier has limited scale-invariance: objects or pieces
of objects are no longer recognized if their size changes by an order of magnitude. This
characteristic of the classifier may or may not be important,depending on the application.
Another hypothesis is that the classifier relies not just on the presence of individual parts,
but on their relationship to one another.
In short, generality defines a trade-off between classifier precision and robustness. Our
metric for generatingg is admittedly crude because it uses only six pairs of similarcate-
gories. Nonetheless generating a confusion matrix like theone shown in Figure 2.13 can
provide a useful sanity check, while exposing features of a particular classifier that are not
apparent from the raw performance benchmark.
25
2.4.5 Background
Returning to the example of a Mars rover, suppose that a smallcamera window is used
to scan across the surface of the planet. Because there may beonly one interesting object
in 100 or 1000 images, the interest detector must have a low rate of false detections in
order to be effective. As illustrated in figure 2.14 this is a challenging problem, particularly
when the detector must accommodate hundreds of different object categories that are all
consideredinteresting.
In the spirit of the attentional cascade [VJ01a] we train interest classifiers to discover
which region are worthy of detailed classification and whichare not. These detectors are
summarized below. As before the classifier is an SVM with a spatial pyramid matching
kernel [LSP06a]. The margin threshold is adjusted in order to trace out a full ROC curve6.
Interest Ntrain Speed Description
Detector C1...C256 C257 (images/sec)
A 30 512 24 Modified 257-category classifier
B 2 512 4600 Fast two-category classifier
C 30 30 25 Ordinary 257-category classifier
First let us considerInterest Detector C. This is the same detector that was employed for
recognizing object categories in section 2.4.3. The only differences is that 257 categories
are used instead of 256. Performance is poor because only 30clutter images are used
during training. In other words,clutter is treated exactly like any other category.
Interest Detector Acorrects the above problem by using 512 training images fromthe
6When measuring speed, training time is ignored because it is aone-time expense
26
cluttercategory. Performance improves because their is now a balance between the number
of positive and negative examples. However the detector is still slow because it is a attempts
to recognize 257 different object categories in every single image or camera region. This
is wasteful if we expect the vast majority of regions to contain irrelevant clutter which is
not worth classifying. In fact this detector only classifiesabout 25 images per second on a
3 GHz Pentium-based PC.
Interest Detector Btrains on 512clutter images and 512 images taken from the other
256 object categories. These two groups of images are assigned to the categoriesuninter-
estingandinteresting, respectively. ThisB classifier is extremely fast because it combines
all the interestingimages into a single category instead of treating them as 256separate
categories. On a typical 3GHz Pentium processor this classifier can evaluate 4600 images
(or scan regions) per second.
It may seem counter-intuitive to group two images from each categoryC1...C256 into
a huge meta-category, as is done with Interest Detector B. What exactly is the classifier
training on? What makes an imageinteresting? What if we have merely created a classifier
that detects the photographic style of Stephen Shore? For these reasons any classifier which
implements attention should be verified on a variety of background images, not just those
in C257. For example the Caltech-6 provides 550 background images with very different
statistics.
27
0.1 1.0 10 1000
10
20
30
40
50
60
70
80
90
False Positives (%)
Tru
e P
ositi
ves
(%)
Interest Detector A
Interest Detector B
Interest Detector C
Figure 2.14: ROC curve for three different interest classifiers described in section 2.4.5.These classifiers are designed to focus the attention of the multi-category detectors bench-marked in Figure 3.2. BecauseDetector Bis roughly 200 times faster thanA or C, itrepresents the best tradeoff between performance and speed. This detector can accuratelydetect 38.2% of the interesting (non-clutter) images with a0.1% rate of false detections.In other words, 1 in 1000 of the images classified asinterestingwill instead contain clutter(solid red line). If a 1 in 100 rate of false detections is acceptable, the accuracy increasesto 58.6% (dashed red line).
2.5 Conclusion
Thanks to rapid advances in the vision community over the last few years, performance over
60% on the Caltech-101 has become commonplace. Here we present a new Caltech-256
image dataset, the largest set of object categories available to our knowledge. Our intent
is to provide a freely available set of visual categories that does a better job of challenging
today’s state-of-the-art classification algorithms.
For example, spatial pyramid matching [LSP06a] withNtrain = 30 achieves perfor-
mance of67.6% on the Caltech-101 as compared to34.1% on Caltech-256. The standard
practice among authors in the vision community is to benchmark raw classification perfor-
28
4 6 8 10 20 40 60 80 100 200
10
20
30
40
50
60
70
80
90
100Performance as a Function of the Number of Categories
Per
form
ance
(%
)
Ncategories
Caltech−101Caltech−256
Figure 2.15: In general the Caltech-256 images are more difficult to classify than theCaltech-101 images. Here we plot performance of the two datasets over a random mixof Ncategories from each dataset. Even when the number of categories remains the same,the Caltech-256 performance is lower. For example atNcategories = 100 the performance is∼ 60% lower.
mance as a function of training examples. As classification performance continues to im-
prove, however, new benchmarks will be needed to reflect the performance of algorithms
under realistic conditions. Beyond raw performance, we argue that a successful algorithm
should also be able to
• Generalize beyond a specific set of images or categories
• Identify which images or image regions are worth classifying
In order to evaluate these characteristics we test two new benchmarks in the context
of Caltech-256. No doubt there are other equally relevant benchmarks that we have not
considered. We invite researchers to devise suitable benchmarks and share them with the
29
community at large.
If you would like to share performance results as well as yourconfusion matrix, please
send them to [email protected]. We will try tokeep our comparison of per-
formance as up-to-date as possible. For more details see
30
31
Chapter 3
Visual Hierarchies
The computational complexity of current visual categorization algorithms scales linearly
at best with the number of categories. The goal of classifying simultaneouslyNcat =
104 − 105 visual categories requires sub-linear classification costs. We explore algorithms
for automatically building classification trees which have, in principle,log Ncat complexity.
We find that a greedy algorithm that recursively splits the set of categories into the two
minimally confused subsets achieves 5-20 fold speedups at asmall cost in classification
performance. Our approach is independent of the specific classification algorithm used. A
welcome by-product of our algorithm is a very reasonable taxonomy of the Caltech-256
dataset.
3.1 Introduction
Much progress has been made during the past 10 years in approaching the problem of
visual recognition. The literature shows a quick growth in the scope of automatic clas-
sification experiments: from learning and recognizing one category at a time until year
2000 [BP96, VJ01b] to a handful around year 2003 [WWP00, FPZ03,LS04] to∼ 100 in
32
2006 [GD06, GD05b, E+05, ML06b, SWP05, ZBMM06b, GD06]. While some algorithms
are remarkably fast [FG01, VJ01b, GD05b] the cost of classification is still at best linear in
the number of categories; in most cases it is in fact quadratic since one-vs-one discrimina-
tive classification is used in most approaches. There is one exception: cost is logarithmic
in the number of models for Lowe [Low04]. However Lowe’s algorithm was developed
to recognize specific objects rather than categories. Its speed hinges on the observation
that local features are highly distinctive, so that one may index image features directly into
a database of models which is organized like a tree [BL97]. Inthe more general case of
visual category recognition, local features are not very distinctive, hence one cannot take
advantage of this insight.
Humans can recognize between 104 and 105 object categories [Bie87] and this is a
worthwhile and practical goal for machines as well. It is therefore important to understand
how to scale classification costs sub-linearly with respectto the number of categories to
be recognized. It is quite intuitive that this is possible: when we see a dog we are not
for a moment considering the possibility that it might be classified as either a jet-liner or
an ice cream cone. It is reasonable to assume that, once an appropriate hierarchical tax-
onomy is developed for the categories in our visual world, wemay be able to recognize
objects by descending the branches of this taxonomy and avoid considering irrelevant pos-
sibilities. Thus, tree-like algorithms appear to be a possibility worth considering, although
formulations need to be found that are more ‘holistic’ than Beis and Lowe’s feature-based
indexing [BL97].
Here we explore one such formulation. We start by considering the confusion matrix
33
Class 1 vs others Class 2 vs others
Decision:Class 6
1,3,6,8vs
2,4,5,7
1,6 vs 3,8
2,7 vs 4,5
1 vs 6
3 vs 8
2 vs 7
4 vs 5
Class 3 vs othersClass 4 vs othersClass 5 vs others Class 6 vs othersClass 7 vs othersClass 8 vs others
Decision:Class 6
Figure 3.1: A typical one-vs-all multi-class classifier (top) exhaustively tests each imageagainst every possible visual category requiringNcat decisions per image. This methoddoes not scale well to hundreds or thousands of categories. Our hierarchical approachuses the training data to construct a taxonomy of categorieswhich corresponds to a tree ofclassifiers (bottom). In principle each image can now be classified with as few aslog2 Ncat
decisions. The above example illustrates this for an unlabeled test image andNcat = 8.The tree we actually employ has slightly more flexibility as shown in Fig. 3.4
that arises in one-vs-all discriminative classification ofobject categories. We postulate
that the structure of this matrix may reveal which categories are more strongly related.
In Sec. 3.3 we flesh out this heuristic and to produce taxonomies. In Sec. 3.4 we pro-
pose a mechanism for automatically splitting large sets of categories into cleanly separated
subsets, an operation which may be repeated obtaining a tree-like hierarchy of classifiers.
We explore experimentally the implications of this strategy, both in terms of classification
quality and in terms of computational cost. We conclude witha discussion in Sec. 3.5.
34
3.2 Experimental Setup
The goal of our experiment is to compare classification performance and computational
costs when a given classification algorithm is used in the conventional one-vs-many con-
figuration vs our proposed hierarchical cascade (see Fig. 3.1).
3.2.1 Training and Testing Data
The choice of the image classifier is somewhat arbitrary for the purposes of this study. We
decided to use the popular Spatial Pyramid Matching technique of Lazebnik et al. [LSP06b]
because of its high performance and ease of implementation.We summarize our imple-
mentation in Sec.3.2.2. Our implementation performs as reported by the original authors
on Caltech-101. As expected, typical performance on Caltech-256 [GHP06] is lower than
on Caltech-101 [LFP04b] (see Fig. 3.2). This is due to two factors: the larger number of
categories and the more challenging nature of the pictures themselves. For example some
of the Caltech-101 pictures are left-right aligned whereas the Caltech-256 pictures are not.
On average a random subset ofNcat categories from the Caltech-256 is harder to classify
than a random subset of the same number of categories from theCaltech-101 (see Fig. 3.3).
Other authors have achieved higher performance on the Caltech-256 than we report
here, for example, by using a linear combination of multiplekernels [VR07]. Our goal here
is not to achieve the best possible performance but to illustrate how a typical algorithm can
be accelerated using a hierarchical set of classifiers.
The Caltech-256 image set is used for testing and training. Weremove theclutter
category from Caltech-256 leaving a total ofNcat = 256 categories.
35
5 10 15 20 25 300
10
20
30
40
50
60
70
Ntrain
Per
form
ance
(%
)
Caltech 101Caltech 101 (Lazebnik et. al)Caltech−256
Figure 3.2: Performance comparison between Caltech-101 andCaltech-256 datasets usingthe Spatial Pyramid Matching algorithm of Lazebnik et. al [LSP06b]. The performanceof our implementation is almost identical to that reported by the original authors; any per-formance difference may be attributed to a denser grid used to sample SIFT features. Thisillustrates a standard non-hierarchical approach where authors mainly present the numberof training examples and the classification performance, without also plotting classificationspeed.
3.2.2 Spatial Pyramid Matching
First each image is desaturated, removing all color information. For each of these black-
and-white images, SIFT features [Low04] are extracted along a uniform 72x72 grid using
software that is publicly available [MTS+05]. An M-word feature vocabulary is formed by
fitting a Gaussian mixture model to 10,000 features chosen atrandom from the training set.
This model maps each 128-dimensional SIFT feature vector toa scalar integerm = 1..M
36
whereM = 200 is the total number of Gaussians. The choice of clustering algorithm
does not seem to affect the results significantly, but the choice of M does. The original
authors [LSP06b] find that 200 visual words are adequate.
At this stage every image has been reduced to a 72x72 matrix ofvisual words. This
representation is reduced still further by histogramming over a coarse 4x4 spatial grid. The
resulting 4x4xM histogram counts the number of times each word 1..M appears in each
of the 16 spatial bins. Unlike a bag-of-words approach [GD06], coarse-grained position
information is retained as the features are counted.
The matching kernel proposed by Lazebnik et al. finds the intersection between each
pair of 4x4xM histograms by counting the number of common elements in any two bins.
Matches in nearby bins are weighed more strongly than matches in far-away bins, resulting
in a single match score for each word. The scores for each wordare then summed to get
the final overall score. We follow this same procedure resulting in a kernel K that satisfies
Mercer’s condition [GD06] and is suitable for training an SVM.
3.2.3 Measuring Performance
Classification performance is measured as a function of the number of training examples.
First we select a random but disjoint set ofNtrain andNtest training and testing images from
each class. All categories are sampled equally, ie.Ntrain andNtest do not vary from class
to class.
Like Lazebnik et al. [LSP06b] we use a standard multi-class method consisting of a
Support Vector Machine (SVM) trained on the Spatial PyramidMatching kernel in a one-
37
4 6 8 10 20 40 60 80 100 200
10
20
30
40
50
60
70
80
90
100
Per
form
ance
(%
)
Ncat
Caltech−101Caltech−256
Figure 3.3: In general the Caltech-256 [GHP06] images are more difficult to classify thanthe Caltech-101 images. Here we fixNtrain = 30 and plot performance of the two datasetsover a random mix ofNcat categories chosen from each dataset. The solid region representsa range of performance values for 10 randomized subsets. Even when the number of cate-gories remains the same, the Caltech-256 performance is lower. For example atNcat = 100the performance is∼ 60% lower (dashed red line).
vs-all classification scheme. The training kernel has dimensionsNcat · Ntrain along each
side. Once the classifier has been trained, each test image isassigned to exactly one visual
category by selecting the one-vs-all classifier which maximizes the margin.
The confusion matrixCij counts the fraction of test examples from classi which were
classified as belonging to classj. Correct classifications lie along the diagonalCii so that
the cumulative performance is the mean of the diagonal elements. To reduce uncertaintly
we average the matrix obtained over 10 experiments using different randomized training
38A U B
vs C U D
A vs B C vs D
D
C U DA U B
A U B U
C U D
CBA
N est Images
Figure 3.4: A simple hierarchical cascade of classifiers (limited to two levels and fourcategories for simplicity of illustration). We call A, B, C and D four sets of categories asillustrated in Fig 3.5. Each white square represents a binary branch classifier. Test imagesare fed into the top node of the tree where a classifier assignsthem to either the set A∪ B orthe set C∪ D (white square at the center-top). Depending on the classification, the imageis further classified into either A or B, or C or D. Test images ultimately terminate in oneof the 7 red octagonal nodes where a conventional multi-class node classifiermakes thefinal decision. For a two-levelℓ = 2 tree, images terminate in one of the 4 lower octagonalnodes. Ifℓ = 0 then all images terminate in the top octagonal node, which isequivalent toconventional non-hierarchical classification. The tree isnot necessarily perfectly balanced:A, B, C and D may have different cardinality. Each branch or node classifier is trainedexclusively on images extracted from the sets that the classifier is discriminating. SeeSec. 3.4 for details.
and testing sets. By inspecting the off-diagonal elements of the confusion matrix it is clear
that some categories are more difficult to discriminate thanother categories. Upon this
observation we build a heuristic that creates an efficient hierarchy of classifiers.
39
3.2.4 Hierarchical Approach
Our hierarchical classification architecture is shown in Fig. 3.4. The principle behind the
architecture is simple: rather than a single one-vs-all classifier, we achieve classification by
recursively splitting the set of possible labels into two roughly equal subsets. This divide-
and-conqueor strategy is familiar to anyone who has played the game of 20 questions.
This method is faster because the binarybranchclassifiers are less complex than the
one-vs-allnodeclassifiers. For example the 1-vs-N node classifier at the topof Fig. 3.1
actually consists of N=8 separate binary classifiers, each with its own setSi of support
vectors. During classification each test image must now be compared with the union of
training images
Snode =N⋃
i=1
Si
Unless the setsSi happen to be the same (which is highly unlikely) the size ofSnode will
increase with N.
Our procedure works as follows. In the first stage of classification, each test image
reaches its terminal node via a series ofℓ inexpensive branch comparisons. By the time the
test image arrives at its terminal node there are only∼ Ncat/2ℓ categories left to consider
instead ofNcat. The greater the number of levelsℓ in the hierarchy, the fewer categories
there are to consider at the expensive final stage - with correspondingly fewer support
vectors overall.
The main decision to be taken in building such a hierarchicalclassification tree is how to
choose the sets into which each branch divides the remainingcategories. The key intuition
which guides our architecture is that decisions between categories that are more easily
40
Figure 3.5: Top-down grouping as described in Sec. 3.3. Our underlying assumption isthat categories that are easily confused should be grouped together in order to build thebranch classifiers in Fig 3.4. First we estimate a confusion matrix using the training setand a leave-one-out procedure. Shown here is the confusion matrix for Ntrain = 10, withdiagonal elements removed to make the off-diagonal terms easier to see.
confused should be taken later in the decision tree, i.e. at the lower nodes where fewer
categories are involved. With this in mind we start the training phase by constructing a
confusion matrixC′
ij from the training set alone using a leave-one-out validation procedure.
This matrix (see Fig. 3.5) is used to estimate the affinity between categories. This should
be distinguished from the standard confusion matrixCij which measures the confusion
between categories during thetestingphase.
41
3.3 Building Taxonomies
Next, we compare two different methods for generating taxonomies automatically based
on the confusion matrixC′
ij.
The first method splits the confusion matrix into two groups using Self-Tuning Spectral
Clustering [ZMP04]. This is a variant of the Spectral Clustering algorithm which auto-
matically chooses an appropriate scale for analysis. Because our cascade is a binary tree
we always choose two for the number of clusters. Fig. 3.4 shows only the first two levels
of splits while Fig. 3.6 repeats the process until the leavesof the tree contain individual
categories.
The second method builds the tree from the bottom-up. At eachstep the two groups
of categories with the largest mutual confusion are joined while their confusion matrix
rows/columns are averaged. This greedy process continues until there is only a single
super-group containing all 256 categories. Finally, we generate a random hierarchy as a
control.
3.4 Top-Down Classification Algorithm
Once a taxonomy of classes is discovered, we now seek to exploit this taxonomy for ef-
ficient top-down classification. The problem of multi-stageclassification has been stud-
ied in many different contexts [AGF04, Fan05, LYW+05, Li05]. For example, Viola and
Jones [VJ04] use an attentional cascade to quickly exclude areas of their image that are
unlikely to contain a face. Instead of using a tree, however,they use a linear cascade of
42
faces−easy
hom
er−s
imps
on
sta
ined
−gla
ss
car
tman
ketch car−side
refrigerator
tril
obite
s
unflo
wer
tennis−ball
brain yarm
ulke
lightning
comet minaret fireworks light−house
palm−pilot
lightbulb
cowboy−hat
desk−globe
bowling−pin
pez−dispenser
tripod
wine−bottle
laptop revolver
vcr eyeglasses
sheet−music
fire−truck motorbikes
golden−gate−bridge tower−pisa
toaster
paper−shredder
video−projector
computer−monitor
microwave
hib
iscu
s
iris
mars
minotaur
helicopter
tennis−court
horse
waterfall
windmill
chandelier
goat
eiffel−tower
smokestack
dolphin
hawksbill
galaxy rainbow
binoculars
french−horn
hea
d−ph
ones
syr
inge
ten
nis−r
acke
t
nec
ktie
sel
f−pr
opel
led−
lawn−
mow
er
men
orah
tra
ffic−
light
dia
mon
d−rin
g
hum
an−s
kele
ton
tou
ring−
bike
wat
ch
fris
bee
t
enni
s−sh
oes
bee
r−m
ug
hourglass t−shirt
ipod theodolite
microscope
washing−m
achine
computer−m
ouse
saturn
calculator
pci−card
chess−board
computer−keyboard
school−bus hot−tub segway
kayak
tweezer airplanes
breadmaker
photocopier
por
cupin
e
ham
burg
er
spa
ghet
ti u
nico
rn
coc
kroa
ch
hou
se−f
ly
spi
der
sta
rfish
f
rog
t
oad
c
entip
ede
o
ctop
us
frie
d−eg
g
sus
hi
jes
us−c
hris
t p
eopl
e cake
socks
leopards
mountain−bike
tricycle
cormorant
duck
tombstone
palm−tree
teepee
hot−dog s
neaker
tuning−fork
top−
hat
trea
dmill
soc
cer−
ball
fry
ing−
pan
gui
tar−
pick
cof
fee−
mug
l
athe
electric−guitar
american−flag
spoon
coffin
ladder harpsichord
hot−air−balloon
pyramid
license−plate
mattress
chopsticks paperclip
screwdriver sword
gas−pump telephone−box
fighter−jet speed−boat
bear
hummingbird
grass
hopper
pen
guin
gor
illa
rac
coon
c
him
p
owl
cam
el
kan
garo
o
car
−tire
s
addl
e
wheelbarrow
bulldozer
crab bonsai
goose
killer−whale
canoe
swan
harp
zebra
cactus
buddha
dog
welding−mask football−helmet sextant rotary−phone
roulette−wheel boxing−glove
fire−extinguisher
backpack doorknob
floppy−disk
teapot
joy−stick
megaphone
ska
teboard
stirr
ups
coi
n
sw
iss−
arm
y−kn
ife
um
brel
la
cer
eal−
box
pla
ying
−car
d
cd
s
teer
ing−
whe
el
fla
shlig
ht
tam
bour
ine
baseball−bat
knife
harmonica
rifle
ak47
billiards
basketball−hoop
blimp
bathtub
mailbox
boom−box
grand−piano
fern
tomato
goldfish
triceratops
praying−mantis
watermelon
horseshoe−crab
sku
nk
butte
rfly
grapes
gre
yhou
nd
con
ch
drin
king
−str
aw
pram
snowm
obile birdbath
covered−wagon
scorpion elephant
hamm
ock elk
radio−telescope
giraffe
skyscraper
dumb−bell baseball−glove ewer
dice
xylo
phon
e
bat
mussels
ice
−cre
am−c
one
t
eddy
−bea
r cannon
ostrich ibis picnic−table
fire−hydrant llam
a
mandolin yo−yo bowling−ball golf−ball soda−can superman
iguana
mushroom
snail
snake
Figure 3.6: Taxonomy discovered automatically by the computer, using only a limited sub-set of Caltech-256 training images and their labels. Aside from these labels there is noother human supervision; branch membership is not hand-tuned in any way. The taxonomyis created by first generating a confusion matrix forNtrain = 10 and recursively dividing itby spectral clustering. Branches and their categories are determined solely on the basis ofthe confusion between categories, which in turn is based on the feature-matching procedureof Spatial Pyramid Matching. To compare this with some recognizably human categorieswe color code all the insects (red), birds (yellow), land mammals (green) and aquatic mam-mals (blue). Notice that the computer’s hierarchy usually begins with a split that puts allthe plant and animal categories together in one branch. Thissplit is found automaticallywith such consistency that in a third of all randomized training setsnot a single category ofliving thingends up on the opposite branch.
43
classifiers that are progressively more complex and computationally intensive. Fleuret and
German [FG01] demonstrate a hierarchy of increasingly descriminative classifiers which
detect faces while also estimating pose.
Our strategy is illustrated in Fig. 3.4 and described in its caption. We represent the
taxonomy of categories as a binary tree, taking the two largest branches at the root of the
tree and calling these classesA ∪ B andC ∪ D. Now take a random subsample ofFtrain
of the training images in each of the two branches and label them as being in either class
1 or 2. An SVM is trained using the Spatial Pyramid Matching kernel as before except
that there are now two classes instead ofNcat. Empirically we find thatFtrain = 10%
significantly reduces the number of support vectors in each branch classifier with little or
no performance degradation.
If the branch classifier passes a test image down to the left branch, we assume that it
cannot belong to any of the classes in the right branch. This continues until the test image
arrives at a terminal node. Based on the above assumption, for each node at depthℓ, the
final multi-class classifier can ignore roughly1 − 2−ℓ of the training classes. The exact
fraction varies depending on how balanced the tree is.
The overall speed per test image is found by taking a union of all the support vectors
required at each level of classification. This includes all the branch and node classifiers
which the test image encounters prior to final classification. Each support vector corre-
sponds to a training image whose matching score must be computed, at a cost of 0.4 ms
per support vector on a Pentium 3 GHz machine. As already noted, the multi-class node
classifiers require many more support vectors than the branch classifiers. Thus increasing
44
A
B
C
Figure 3.7: The taxonomy from Fig.3.6 is reproduced here to illustrate how classificationperformance can be traded for classification speed. NodeA represents an ordinary non-hierarchical one-vs-all classifier implemented using an SVM. This is accurate but slowbecause of the large combined set of support vectors inNcats = 256 individual binaryclassifiers. A the other extreme, each test image passes through a series of inexpensivebinary branch classifiers until it reaches 1 of the 256 leaves, collectively labeledC above.A compromise solution B invokes a finite set of branch classifiers prior to final multi-classclassification in one of 7 terminal nodes.
45
0 1 2 3 4 5 6 7 8 9 100
5
10
15
20
25
Speed (images/s)
Per
form
ance
(%
cla
ssifi
ed c
orre
ctly
)
A
B
C
top−downbottom−uprandom
Figure 3.8: Comparison of three different methods for generating taxonomies. For eachtaxonomy we vary the number of branch comparisons prior to final classification, as illus-trated in Fig. 3.4. This results in a tradeoff between performance and speed as one movesbetween two extremesA andC. Randomly generated hierarchies result in poor cascade per-formance. Of the three methods, taxonomies based on Spectral Clustering yield marginallybetter performance. All three curves measure performance vs. speed forNcat = 256 andNtrain = 10.
46
the number of branch classifier levels decreases the overallnumber of support vectors and
increases the classification speed, but at a performance cost.
3.5 Results
As shown in Fig. 3.8, our top-down and bottom-up methods givecomparable performance
at Ntrain = 10. Classification speed increases 5-fold with a corresponding10% decrease
in performance. In Fig. 3.9 we try a range of values forNtrain. At Ntrain = 50 there is a
20-fold speed increase for the same drop in performance.
3.6 Conclusions
Learning hierarchical relationships between categories of objects is an essential part of how
humans understand and analyze the world around them. Someone playing the game of “20
Questions” must make use of some preconceived hierarchy in order to guess the unknown
object using the fewest number of queries. Computers face thesame dilemma: without
some knowledge of the taxonomy of visual categories, classifying thousands of categories
is reduced to blind guessing. This becomes prohibitively inefficient as computation time
scales linearly with the number of categories.
To break this linear bottleneck, we attack two separate problems. How can computers
automatically generate useful taxonomies, and how can these be applied to the task of clas-
sification? The first problem is critical. Taxonomies built by hand have been applied to
the task of visual classification [ZW07] for a small number of categories, but this method
47
.2 .3 .4 .5 .6 .7 .8 .9 1 2 3 4 5 6 7 8 9100.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
Speed (images/s)
Per
form
ance
(%
cla
ssifi
ed c
orre
ctly
)
Ntrain
504030201510
Figure 3.9: Cascade performance / speed trade-off as a function of Ntrain. Values ofNtrain = 10 andNtrain = 50 result in a 5-fold and 20-fold speed increase (respectively)for a fixed 10% perforance drop.
does not scale well. It would be tedious - if not impossible - for a human operator to gener-
ate detailed visual taxonomies for the computer, updating them for each new environment
that the computer might encounter. Another problem for thisapproach is consistency: any
two operators are likely to construct entirely different trees. A more consistent approach
is to use an existing taxonomy such as WordNet [Fel98] and apply it to the task of visual
classification [MS07]. One caveat is that lexical relationships may not be optimal for cer-
48
tain visual classification tasks. The wordlemonrefers to an unreliablecar, but the visual
categories lemon and car are not at all similar.
Our experiments suggest that plausible taxonomies of object categories can be created
automatically using a classifier (in this case, Spatial Pyramid Matching) coupled to a learn-
ing phase which estimates inter-category confusion. The only input used for this process
is a set of training images and their labels. The taxonomies such the one shown in Fig. 3.6
seem to consistently discover broader categories which arenaturally recognizable to hu-
mans, such as the distinction between animate and inanimateobjects.
How should we compare one hierarchy to another? It is difficult to quantify such a
comparison without a specific goal in mind. To this end we benchmark a cascade of classi-
fiers based on our hierarchy and demonstrate significant speed improvements. In particular,
top-down and bottom-up recursive clustering processes both result in better performance
than a a randomly generated control tree.
49
Chapter 4
Pollen Counting
4.1 Basic Method
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah
4.2 Results
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah
4.3 Comparison To Humans
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
50
!"#$"%
&"'(%)
!"#$%&'
(&)&*+,"-
.,-&
(&)&*+,"-
/0$&%0"12%
3345#%&2
60#7&8'
9$,:0+-&%%'
;"2&1
!1<++&$
.,-#1
!1#%%,=*#+,"-
6>;86?;
@77&#$#-*&
;"2&1
*$+,-"+./
&"'(%)
60#7&
.&#+<$&%'ABC
!$"77&2
?"11&-!"-+"<$%
D#5&1EFG
6+#:&'H
6+#:&'I
6+#:&'J
6+#:&'K
!"-L"1L&
60#7&89$,:0+-&%%
.&#+<$&%'AMHC
6G./
.&#+<$&%'
AMHNC
!#-2,2#+&%
.,-#1,%+%
!"#$%&'''''''''''
9$,:0+-&%%
.&#+<$&%'ABC
6+#:&'M;,*$"%*"7&'
GO#:&
Figure 4.1:
blah blah blah blah blah blah blah blah blah blah blah blah
4.4 Conclusions
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah blah
51
Chapter 5
Machine Olfaction: Introduction
Electronic noses have been used successfully in a wide variety of applications[RB08, WB09]
ranging from safety[DNPa07, GSHD00, ZEA+04, Ga01] and detection of explosives[AMa01,
KWTT03, Gar04] to medical diagnosis[NMMa99, GSH00, TM04, Phi05, DDNPa08], food
quality assessment[ZZX+06, GVMSB09, BO04, AB03] and discrimination of beverages
like coffee[PFSQ01, Pa02], tea[Yu07, DHG+03, LSH05] and wine[DNDD96, LSH05,
RSCC+08, SCA+11]. These applications typically involve a limited variety of odor cate-
gories with tens or even hundreds of training examples available for each odorant.
Human text subjects, on the other hand, are capable of distinguishing thousands of dif-
ferent odors[Lan86, Ree90, SR95] and can recognize new odors with only a few training
examples[Dav75, CdWL+98]. How we organize individual odor categories into broader
classes - and how many classes are required - is still a matterof active debate in the psy-
chophysics community. One recent study of 881 perfume materials found that human test
subjects group the vast majority of these odors into 17 distinct classes[ZS06]. Another
comprehensive study of 146-dimensional odorant responsesobtained by Dravnieks[Dra85]
showed that most of the variability in the responses can be explained using only 6-10
parameters[KKER11].
52
Results such as these suggest that the bulk of the variability in human odor perception
can be represented in a relatively low-dimensional space. What is less clear is whether
this low dimensionality is an intrinsic quality of the odorants themselves or a feature of
our olfactory perception. Measuring a large variety of odorants electronically provides an
opportunity to compare how machines and humans organize their olfactory environment.
One way to represent this is to construct a taxonomy of odor categories and super-categories
with closely related categories residing in nearby branches of the tree.
Over the last 10 years hierarchical organization tools haveproven increasingly use-
ful in the field of computer vision as image classification techniques have been scaled
to larger image datasets. Examples include PASCAL[EZWVG06],Caltech101[FFFP04],
Caltech256[GHP07] and, more recently, the SUN[CLTW10] LabelMe[TRY10] and Imagenet[DDS+09]
datasets with over a thousand categories each. These datasets are challenging not just be-
cause they include a larger number of categories but becausethe objects themselves are
photographed in a variety of poses and lighting conditions with varying degrees of back-
ground clutter.
While it is possible to borrow taxonomies such as WordNet[SR98] and apply them to
machine classification tasks, lexical relationships are atbest an imperfect approximation
of visual or olfactory relationships. It is therefore useful to automatically discover tax-
onomies that are directly relevant to the specific task at hand. One straightforward greedy
approach involves clustering the confusion-matrix created with a conventional one-vs-all
multi-class classifier. This results in a top-down arrangement of classifiers where sim-
ple, inexpensive decisions are made first in order to reduce the available solution space.
53
Such an approach yields faster terrain recognition for autonomous navigation[AMHP07]
as well as more computationally efficient classification of images containing hundreds of
visual categories[GP08]. One way to improve overall classification accuracy is to iden-
tify categories which cannot be excluded early and include them on multiple hierarchy
branches[MS08]. Binder et. al. show that taxonomy-based classification can improve both
speed and accuracy at the same time[BMK11].
In addition to larger more challenging datasets and hierarchical classification approaches
that scale well, machine vision has benefitted from discriminative features like SIFT and
GLOH that are relatively invariant to changes in illumination, viewpoint and pose[MS05].
Such features are not dissimilar from those extracted from astronomical data by switching
between fixed locations on the sky[DRNP94, TTA+06]. The resulting measurements can
reject slowly-varying atmospheric contamination while retaining extremely faint cosmo-
logical signals that areO(106) times smaller.
Motivated by this approach, we construct a portable apparatus capable of sniffing at a
range of frequencies to explore how well a small array of 10 sensors can classify hundreds
of odors in indoor and outdoor environments. We evaluate this swept-frequency approach
by sampling 90 common household odors as well as 40 odors in the University of Pittsburgh
Smell Identification Test. Reference data with no odorants is also gathered in order to
model and remove any systematic errors that remain after feature extraction.
The sensors themselves are carbon black-polymer compositethin-film chemiresistors.
Using controlled concentrations of analytes under laboratory conditions, these sensors have
been shown to exhibit steady-state and time-dependant resistance profiles that are highly
54
sensitive to inorganic gasses as well as organic vapors[BL03] and can be used for classi-
fying both[SL03, MGBW+08]. A challenge when operating outdoors is that variability in
water vapor concentrations masks the response of other analytes. Our approach focuses
on extracting features that are insensitive to background analytes whose concentrations
changes more slowly than the sniffing frequency. This strategy exploits the linearity of
the sensor response and the slowly-varying nature of ambient changes in temperature and
humidity.
From an instrument design perspective, we would like to discover how the choice of
sniffing frequencies, number of sensors and feature reduction method all contribute to the
final indoor/outdoor classification performance. Next we construct a top-down classifica-
tion framework which aggregates odor categories that cannot be easily distinguished from
one another. Such a framework quantitatively addresses questions like: what sorts of odor
groupings can be readily classified by the instrument, and with what specificity?
55
Chapter 6
Machine Olfaction: Methods
6.1 Instrument
The apparatus is a common design [NFM90, GS92, LEZP99] whichdraws air through a
small sensor chamber while controlling the source of the airvia a manifold mixing solenoid
valve. A small fan draws the air through a computer-controlled valve with five inlets. Such
elements as flowmeters, gas cylinders, air dryers and other filters are absent. Four sample
chambers hold the odorants to be tested while one empty chamber serves as a reference as
shown in Fig. 1.
One approach to the problem of removing confusing background factors is to filter
them explicitly from the odor stream. When discriminating between alcoholic beverage for
example, alcohol and water may be filtered to improve performance[LSH05]. We avoid this
approach here to keep the instrument as simple and portable as possible for acquiring data
in both indoor and outdoor environments. The sensor chamber, sample chambers, solenoid
valve, computer and electronics are light enough to carry, and all electronic components
run on battery power.
56
Figure 6.1: A fan draws air from 1 of 4 ordorant chambers or an empty reference chamber,depending on the state of the computer-controlled solenoidvalve. The valve control signalcan then be compared to the resistances changes recorded from an arrays of 10 individualsensors as shown in Fig. 2.
6.2 Sampling and Measurements
Ideally the sniffing frequencies used should be high enough to reject unwanted environ-
mental noise but low enough that the time-constant of the sensors does not attenuate the
signal. Between these two constraints we find a range of usable frequencies between1/64
and 1 Hz. This choice of frequencies is examined in detail in the next section.
The sniffing scheme is illustrated in Fig. 2a. The computer first chooses a single odor
and samples 7 frequencies in 7.5 minutes. During this span oftime 400 s are spent switch-
57
ing between a single odorant and the reference chamber whilethe remaining 50 s are spent
purging the chamber with reference air. Henceforward we refer to this complete sampling
pattern as a “sniff” and each of the 7 individual frequency modulations as “subsniffs”.
The sniff is repeated 4 times for each of 4 odorants for a totalof 2 hours per “run”.
Within each run the odorants are randomly selected and the order in which they are pre-
sented is also randomized. To avoid residual odors each run starts with a 1-hour period
during which the sensor chamber is purged with reference air, odorant chambers are re-
placed and the tubing that leads to the chambers is washed anddried.
The resistance change∆RR
of each sensor is sampled at 50 Hz while the valve modulates
the incoming odor streams. From this time-series data each individual sniff is reduced to
a feature vector of measurements representing the band power of the sensor resistance
integrated over subsniffsi = 1..7 and frequency bandsj = 1..4. Fig. 2c shows what
this filtering looks like for the case ofi = 4. In this subsniff the valve switches 4 times
between odorant and reference at a frequency of1/8 Hz. Integrating each portion of the
Fourier transform of the signal
Si(f) =
∫si(t)e
−2πiftdt
weighted by four different window functions results in7 × 4 = 28 measurements
mij =
∫ fmax
0
Hij(f)df , Hij(f) = Si(f)Wij(f)
wherefmax
= 25 Hz is the Nyquist frequency. The modulation of the odorant in the ith
58
subsniff can be thought of as the product of a single 64 s pulseand progressively faster
square waves of frequencyfi = 2 i−7 Hz. Thus the first window functionj = 1 in each
subsniff is centered aroundf1 = 1/64 while window functions forj = 2..4 are centered at
the odd harmonicsfi, 3fi and5fi where the square-wave modulation has maximal power.
Repeating this procedure for each sensork = 1..10 gives a final featuremijk of size7 ×
4 × 10 = 280 which is normalized to unit length.
For comparison a second featuremi of size7 × 10 = 70 is generated by simply differ-
encing the top and bottom 5% quartiles of∆RR
within each subsniff. This sort of amplitude
estimate is comparable to the so-called sensorial odor perception (SOP) feature commonly
used in machine olfaction experiments [DVHV01] and is similar tomi1k in that it ignores
harmonics with frequencies higher than1/64 Hz within each subsniff.
6.3 Datasets and Environment
Three separate datasets are used for training and testing.
TheUniversity of Pittsburgh Smell Identification Test (UPSIT)consists of 40 microen-
capsulated odorants chosen to be relatively recognizable and to span known odor classes [DSa84a,
DSKa84]. The test is administered as a booklet of 4-item multiple choice questions with an
accompanying scratch-and-sniff patch for each question. It is an especially useful standard
because of the wealth of psychophysical data gathered on a variety of test subjects since it
was introduced in 1984 [DSA+84b, DAZa85, Dot97, EFL+05].
In order to sample realistic real-world odors we introduce aCaltechCommon House-
hold Odors Dataset (CHOD)with 90 common foods products and household items. Items
59
were chosen to be as diverse as possible while remaining readily available. Odor categories
for both the CHOD and UPSIT are listed in the appendix. Of these, 78 were sampled in-
doors, 40 were sampled outdoors and 32 were sampled in both locations.
Lastly aControl Datasetwas acquired in the same manner as the other two sets but with
empty odorant chambers. The purpose of this set is to model and remove environmental
components which are not associated with an odor class. In this sense it is roughly anal-
ogous to theclutter category present in some image datasets. Control data was taken on
a semi-weekly basis over the entire 80-day period during which the other 2 datasets were
acquired. This was done in order to capture as much environmental variation as possible.
Half of the control data is used for modeling while the other half is used for verification
purposes.
Together these 3 sets contain 250 hours of data spanning 130 odor categories and 2
environments. The first 130 hours of data were acquired over a40-day period in a typical
laboratory environment with temperatures of 22.0-25.0◦C and 36-52% humidity. Over the
subsequent 40 days the remaining 120 hours were acquired on arooftop and balcony with
temperatures ranging from 10.1-24.7◦C and 29-81% humidity. On time-scales of 2 hours
the average change in temperature and humidity were0.11◦C and0.4% in the laboratory
and0.61◦C and1.9% outdoors. Thus the environmental variation outdoors was roughly
∼ 5 times greater than indoors.
60ref
0
10
20
x 10−4
∆ R
/ R
s7
s6
s5
s4
s3
s2
s1
1 2 3 4 5 6
10−4
minutes∆
R /
R
mi1
mi2
mi3
mi4 amplitude
si fi Ti Ni Ttot
s11/64 Hz 64 s 1 64 s
s21/32 32 1 32
s31/16 16 2 32
s41/8 8 4 32
s51/4 4 8 32
s61/2 2 16 32
s7 1 1 32 32
−20 0 20−5
0
5
10
15x 10
−4
∆ R
/ R
sec
s4
−20 0 20sec
h41
−20 0 20sec
h42
−20 0 20sec
h43
−20 0 20sec
h44
0.01 0.10 1.00
10−6
10−5
10−4
10−3
f1
f4
3f4
5f4
Hz
∆ R
/ R
Outdoor NoiseIndoor NoiseS
4H
41H
42H
43H
44
Figure 6.2: (a) A sniff consists of 7 individual subsniffss1...s7 of sensor data taken as thevalve switches between a single odorant and reference air. From this data a7× 4 = 28 sizefeaturem is generated representing the measured power in each of the 7subsniffsi over 4fundamental harmonicsj. For comparison purposes a simple amplitude feature differencesthe top and bottom 5% quartiles of∆R
Rin each subsniff. (b) As the switching frequencyf
increases by powers of 2 so does the number of pulses, so that the time periodT is constantfor all but the first subsniff. (c) To illustrate howm is measured we show the harmonicdecomposition of justs4, highlighted in (a). The corresponding measurementsm4j are theintegrated spectral power for each of 4 harmonics. Higher-order harmonics suffer fromattenuation due to the limited time-constant of the sensorsbut have the advantage of beingless susceptible to slow signal drift. Fitting a1/fnnoise spectrum to the average indoor andoutdoor frequency response of our sensors in the absence of any odorants illustrates whyhigher-frequency switching and higher-order harmonics may be especially advantageous inoutdoors environments.
61
10/10
wine : Cabernet SauvignonIndoor
10/04
wine : Chardonnay
11/07
lemon : juice
10/03
lemon : slice
10/11
tea : English Breakfast
10/08
tea : Irish Breakfast
1 2 3 4
12/24
Outdoor
12/26
12/14
12/26
12/19
12/08
1 2 3 4
Figure 6.3: Visual representation of the harmonic decomposition featurem for 2 wines,2 lemon parts and 2 teas from the Common Household Odors Dataset. Each odorant issampled 4 times on 2 different days in 2 separate environments. Each box represents onecomplete 400 s sniff reduced to a 280-dimensional feature vector. Within each box, the 10rows (y axis) show the response of different sensor over 28 frequencies (x axis) correspond-ing to 7 subsniffs and 4 harmonics. For visual clarity the columns are sorted by frequencyand rows are sorted so that adjacent sensors are maximally correlated.
62
63
Chapter 7
Machine Olfaction: Results
The first four experiments described below are designed to test the effect of sniffing fre-
quency, sensor array size, features extraction and sensor stability on classification perfor-
mance over a broad range of odor categories. For instrument design purposes it is useful to
isolate those features and feature subsets which are most effective for a given classification
task in a discriminative machine-learning framework.
Fig. 3 shows features extracted for 6 specific odorants in 3 broader odor categories:
wine, lemon and tea. Different kinds of tea seem to be easily distinguishable from wine
and lemon but less distinguishable from one another. A fifth experiment seeks to quantify
this intuition that certain odor categories are more readily differentiated than others and
incorporate this into a learning framework.
In each experiment 4 presentations per odor category are separated into randomly se-
lected training and testing sets of 2 sniffs each. Each sniffis reduced to a feature vectorm
and an SVM1 is used for final classification. In several experiments feature vectorsmik are
also generated for comparison purposes. Both features are pre-processed by normalizing
them to unit length and projecting out the first two principlecomponents of the control data,
1specifically the LIBLINEAR package of[HCLK08, FCHW08]
64
which together account for 83% of the feature variability when no odorants are present.
Performance is averaged over randomized data subsets ofNcat = 2, 4, ... odor categories
up to the maximum number if categories in the set. Classification error naturally increases
with Ncat as the task becomes more challenging and the probability of randomly guessing
the correct odorant decreases. In addition to random category groupings, the final test
clusters odorants to examine classification performance for top-down category groupings.
7.1 Subsniff Frequency
As a practical matter, two fundamental limiting factors in our experiments are the amount of
time required to prepare odorant chambers and to sample their contents. An unnecessarily
long sampling procedure limits the usefulness of machine olfaction in many real-world
applications. Reducing the duration of a sniff is thus highly worthwhile if it does not
significantly impact classification accuracy. Our first testmeasures performance using only
a limited number of subsniff frequencies. The results for each dataset in both indoor and
outdoor environments are presented in Fig. 4.
A complete sniff is divided into 4 overlapping 200 s time segments. Each covers a
different range of modulation frequencies from 1 -1/8 Hz for the fastest segment to1/16- 1/64
Hz for the slowest. Fig. 4 compares classification results using features constructed from
each time segment as well as the entire 400 s sniff.
Averaging the CHOD and UPSIT results in both environments, overall performance for
Ncat = 4 drops by 5.6%, 5.1%, 8.3% and 24.4% when using 200 s the data ina progressively
slower range of modulations frequencies. ForNcat = 16 this drop is more significant:
65
0255075
Performance (%)
Ncat
= 160 25 50 75 100
1−7
4−7
3−6
2−5
1−3
Performance (%)
Ncat
= 4
Sub
−sn
iffs
400 s
200 s
200 s
200 s
200 sCHOD Indoor
CHOD Outoor
UPSIT Indoor
UPSIT Outdoor
Control Indoor
Control Outdoor
random
Figure 7.1: Classification performance for the University ofPittsburgh Smell IdentificationTest (UPSIT) and Common Household Odors Dataset (CHOD) for different sniff subsetsusing 4 and 16 categories for training and testing. For control purposes data is also ac-quired with empty odorant chambers. Compared with using the entire sniff (top), the high-frequency subsniffs (2nd row) outperform the low-frequency subsniffs (bottom) especiallyfor Ncat=16. Dotted lines show performance at the level of random guessing.
66
9.5%, 10.6%, 17.2% and 41.2% respectively. At least for the polymer composite sensors
we are using, this suggests that the low-frequency subsniffs contribute relatively little to
classification performance.
This is perhaps not surprising considering that the mean spectrum of background noise
in the control data is skewed towards lower frequencies (Fig. 2c). While this noise spec-
trum depends partially on the type of sensor used, it is also symptomatic of the kinds of
slow linear drifts observed in both temperature and humidity throughout the tests. Other
sensors that are sensitive to such drifts may also benefit from faster switching so long as
the switching frequency does not far exceed the cutoff imposed by sensor time constants.
In our case these time constants range from .1 s for the fastest sensor to 1 s for the slowest.
7.2 Number of Sensors
Another important design consideration is the number and variety of sensors required for a
given classification task. Our second test measures the classification error while gradually
increasing the number of sensors from 2 up to the full array of10.
As shown in Fig. 5, the marginal utility of adding additionalsensors depended on the
difficulty of the task: performance in outdoor conditions orwith a large number of odor
categories showed the most improvement. Across the board however the Control data
classification error consistentlyincreasedas sensors were added, approaching closer to the
expected level of random chance. Averaged over all values ofNcat the Outdoor Control er-
ror was 17% less than what would be expected from random chance when 10 sensors were
used, compared to 58% less using only 2 of the available sensors. The positive detection of
67
2 4 8 16 32 64
10
20
30
40
50
60
70
80
90
100
Ncats
Err
or (
%)
Indoor
2 4 8 16 32 64N
cats
Outdoor
CHOD
10 sensors
8 sensors
6 sensors
4 sensors
2 sensors
UPSIT
10 sensors
8 sensors
6 sensors
4 sensors
2 sensors
Control
10 sensors
8 sensors
6 sensors
4 sensors
2 sensors
random
Figure 7.2: Classification error for all three datasets takenindoors and outdoors whilevarying the number of sensors and the number of categories used for training and testing.Each dotted colored line represents the mean performance over randomized subsets of 2,4, 6 and 8 sensors out of the available 10. To illustrate this for a single value ofNcat,gray vertical lines mark the error averaged over randomizedsets of 16 odor categoriesfor the indoor and outdoor datasets. When the number of sensors increased from 4 to10, the indoor error (left line) decreased by< 2% for the CHOD and UPSIT while theoutdoor error (right line) decreased by 4-7%. The Control error is also important becausedeviations from random chance when no odor categories are present may suggest sensitivityto environmental factors such as water vapor. Here the indoor error for both 4 and 10sensors remains consistent with 93.75% random chance whilethe outdoor error rises from85.9% to 91.7%.
distinct odor categories where none are present may suggesteither overfitting or sensitivity
to extraneous environmental factors such as water vapor. Thus the use of additional sen-
sors seems to be important for background rejection in outdoor environments even when
the reduction in classification error for the other datasetsis only marginal.
68
7.3 Feature Performance
For each individual sensor, the feature extraction processconverts 400 s or 20,000 samples
of time-stream data per sniff into a compact arraymij of 28 values representing the total
spectral response over multiple harmonics of the sniffing frequency. An even smaller fea-
turemi measures only the amplitude of the sensor response within each of the 7 subsniffs.
Our third test compares classification accuracy for both features to determine whether mea-
suring the spectral response of the sensor over a broad rangeof harmonics yields any com-
pelling benefits.
For Ncat = 4 the CHOD and UPSIT classification errors are 8.7% and 26.2% indoors
and 27.6% and 32.2% outdoors using the spectral response featurem. When the amplitude-
based featurem is used these errors increase to 27.3% and 31.9% indoors and 36.8% and
51.3% outdoors respectively. As shown in Fig. 6 the amplitude-based feature continues to
underperform the spectral response feature across all values ofNcat. It also shows a greater
tendency to make spurious classifications in the absence of odorants, with detection rates
on the Control Dataset that are 30-75% higher than random chance.
Relative to human performance on the UPSIT, the electronic nose performance of 26-
32% indoors is comparable to test subjects 70-79 yrs of age. Subjects 10-59 yrs of age
outperform the electronic nose with only 4-18% error whereas those over 80 yrs show
mean error rates in excess of 36% [DSKa84].
69
2 4 8 16 32 64
10
20
30
40
50
60
70
80
90
100
Err
or (
%)
Ncats
Indoor
2 4 8 16 32 64N
cats
Outdoor
2 4 8 16 32 64N
cats
Indoor / Outdoor
Harmonic
CHOD
UPSIT
Control
Amplitude
CHOD
UPSIT
Control
UPSIT subjects
age 70−79
UPSIT subjects
age 10−59
random
Figure 7.3: Classification error using features based on sensor response amplitude andharmonic decomposition. For comparison we also show the UPSIT testing error[DSKa84]for human test subjects 10-59 years of age (who perform better than our instrument) and70-79 years of age (who perform roughly the same). The combined Indoor/Outdoor datasetuses data taken indoors and outdoors as separate training and testing sets.
7.4 Feature Consistency
Next we would like to test whether the spectral response features are reproducible enough to
be used for classification across different environments over timescales of several months.
The indoor/outdoor dataset featured in the rightmost plot of Fig. 6 uses a classifier
trained on data taken indoors between October 3 and November18 to test data taken out-
doors between November 19 and December 26. For comparison, data taken in the center
plot uses the outdoor datasets for both training and testing. Classification errors for the
Indoor/Outdoor CHOD are 8-14 % higher than for the Outdoor CHOD, while those for the
70
UPSIT are 3-25 % higher.
From this test alone it is unclear how much of this increase inclassification error is
due to the change in environment and how much is due to sensor degradation. However
polymer-carbon sensor arrays similar to ours have been shown to exhibit response changes
of less than 10% over time periods of 15-18 months [RMG+10]. What is less well under-
stood is how classification error is affected when training data acquired an indoor laboratory
environment is used for testing in an uncontrolled outdoor environment. This is roughly
analogous to the visual classification task of using images taken under controlled light-
ing conditions in a relatively clutter-free environment toclassify object categories in more
complex outdoor scenes with variable lighting, occlusion etc.
Compared with the amplitude response feature (dotted lines)the full spectral response
of the sensor provides a feature that is significantly more accurate and more robust for
classification across indoor and outdoor environments.
7.4.1 Top-Down Category Recognition
The results presented so far are averaged over randomized subsets ofNcat categories. This
is an especially relevant benchmark when one does not know inadvance what categories
to expect during testing. Such a benchmark does not reveal how classification performance
changes from category to category, or how specifically a given category classification may
be refined.
Broadly speaking the odor categories in the CHOD can be divided into four main
groups: food items, beverages, vegetation and miscellaneous household items. One may
71
of course make finer distinctions within each category, suchas food items that are cheeses
or fruits, but such distinctions are inherently arbitrary and vary significantly according to
personal bias. Even a taxonomy such as WordNet[SR98] which groups words by meaning
may or may not be relevant to the olfactory classification task. The fact that coffee and tea
are both in the “beverages” category, for example, does not provide any real insight into
whether they will emit similar odors.
A more experimentally meaningful taxonomy may be created using the inter-category
confusion during classification. This is represented as a matrix Cij which counts how often
a member of categoryi is classified as belonging to categoryj. Diagonal elements record
the rate of correct classifications for each category while off-diagonal elements count mis-
classifications. Hierarchically clustering this matrix results in a taxonomy where successive
branches represent increasingly difficult classification tasks. As this process continues, cat-
egories that are most often confused would ideally end up as adjacent leaves on the tree.
Following our work with the Caltech-256 Image Dataset [GP08]we create a taxonomy
of odor categories by recursively clustering the olfactoryconfusion matrix via self-tuning
spectral clustering[PZM04]. Results for the Indoor CHOD areshown in Fig. 7. A binary
classifier is generated to evaluate membership in each branch of the tree. This is done
by randomly selecting 2 training examples per category and assigning positive or negative
labels depending on whether the category belongs to the branch. Remaining examples are
then used to evaluate the performance of each classifier.
With branch nodes color-coded by performance, the taxonomyreveals what randomized
subsets do not, namely, which individual categories and super-categories are detectable by
72
the instrument for a given performance threshold. The clustering process is prone to errors
in part because of uncertainty in the individual elements ofthe confusion matrix. Some
odorants such as individual flowers and cheeses were practically undetectable with our
instrument, making it impossible to establish taxonomic relationships with any certainty.
Other odorants with low individual detection rates showed relatively high intercategory
confusion. For example all the spices except mustard are located on a single sub-branch
which can be detected with 42% accuracy even though the individual spice categories in
that branch all have detection rates below 5%.
Thus while it is possible to make refined guesses for some categories, other “unde-
tectable” categories are in fact detectable only when pooled into meaningful super-categories.
Building a top-down classification taxonomy for a given instrument provides the flexibil-
ity to exchange classifier performance for specificity depending on the odor categories and
application requirements.
73
beverage:milk:regular
food:condiments:toasted sesame oil household:bathroom:rubbing alcohol beverage:alcoholic:other:vanilla extract household:bathroom:mint mouthwash
food:other:flavoring:Equal food:other:flavoring:sugar
food:other:flavoring:salt household:bathroom:asprin
vegetation:herbs:lavendar vegetation:other:pine
food:other:tuna
food:fruit:rosids:citrus:lemon:juice
beverage:coffee:expresso:Lavazzi
beverage:alcoholic:wine:Moscato beverage:alcoholic:wine:White Zinfandel
beverage:alcoholic:wine:Chardonnay
food:spices:rosids:mustard beverage:juice:apple
beverage:coffee:dry:Lavazzi
food:vinegar:distilled
food:other:rice
food:fruit:rosids:berries:strawberry food:fruit:rosids:other:grape
vegetation:flower:Achillae Milefolium household:other:moist cardboard
food:fruit:rosids:citrus:mango
food:fruit:rosids:citrus:lemon:slice food:fruit:rosids:other:apple
food:cheese:cottage food:fruit:rosids:citrus:lemon:peel
beverage:juice:orange gatoraid
beverage:alcoholic:wine:Cabernet Sauvignon household:bathroom:hydrogen peroxide
food:spices:rosids:allspice
vegetation:flower:Agerastum Houstonianum
food:vinegar:apple cider
food:fruit:asterids:kiwi beverage:coffee:expresso:TJHBD
food:fruit:rosids:other:melon food:fruit:other:pineapple
vegetation:other:grass food:condiments:French’s yellow mustard
vegetation:flower:Rosa Rosideae food:fruit:rosids:berries:rasberries
beverage:tea:English Breakfast food:cheese:cheddar food:fruit:other:banana
food:condiments:peanut butter food:spices:asterids:cayenne pepper food:spices:magnoliids:bay leaves
food:cheese:provolone beverage:milk:vanilla soy
food:condiments:soy sauce food:vinegar:red wine
vegetation:herbs:cilantro food:cheese:swiss
beverage:tea:Genmai Cha beverage:tea:Irish Breakfast food:spices:asterids:red pepper
food:fruit:magnoliids:avacado food:other:vanilla cookie fragrance oil
vegetation:herbs:rosemary food:fruit:asterids:tomato
vegetation:herbs:parsley food:vinegar:rice
vegetation:herbs:mint beverage:tea:Russian Caravan
food:spices:asterids:chili powder food:spices:rosids:clove
0 20 40 60 80 100 %
Figure 7.4: The confusion matrix for the Indoor Common Household Odor Dataset is usedto automatically generate a top-down hierarchy of odor categories. Branches in the treerepresent splits in the confusion matrix that minimize intercluster confusion. As the depthof the tree increases with successive splits, the categories in each branch become moreand more difficult for the electronic nose to distinguish. The color of each branch noderepresent the classification performance when determiningwhether an odorant belongs tothat branch. This helps characterize the instrument by showing which odor categories andsuper-categories are readily detectable and which are not.Highlighted categories show therelationships discovered between the wine, lemon and tea categories whose features areshown Fig. 3. The occurence of wine and citrus categories in the same top-level branchindicates that these odor categories are harder to distinguish from one another than they arefrom tea.
74
75
Chapter 8
Machine Olfaction: Discussion
In this paper we explore several design parameters for an electronic nose with the goal of
optimizing performance while minimizing environmental sensitivity The spectral response
profiles of a set of 10 carbon black-polymer composite thin film resistors were directly
measured using a portable apparatus which switches betweenreference air and odorants
over a range of frequencies. Compared to a feature based only on the fractional change in
sensor resistance, the spectral feature gives significantly better classification performance
while remaining relatively invariant to water vapor fluctuations and other environmental
systematics.
After acquiring two 400s sniffs of every odorant in a set of 90common household odor
categories, the instrument is presented with unlabeled odorants each of which it also sniffs
twice. The features extracted from these sniffs are used to select the most likely category
label out ofNcat options. GivenNcat = 4 possible choices and an indoor training set, the
correct label was found 91% of the time indoors and 72% of the time outdoors (compared
to 25% for random guessing). Fig. 6 shows how the classification error increases with
Ncat as the task becomes more difficult. The instrument’s score onthe UPSIT is roughly
comparable to scores obtained from elderly human.
76
Sampling 130 different odor categories in both indoor and outdoor environments re-
quired 250 hours of data acquisition and roughly an equal amount of time purging, clean-
ing and preparing sample chambers. Fortunately, high-frequency subsniffs in the 1 -1/8
Hz range provided 50% better olfactory classification performance than an equal time-
segement of relatively low-frequency subsniffs in the1/16- 1/64 Hz range. By focusing on
higher frequencies, sniff time can be cut in half with only a marginal 5-10% drop in overall
performance.
Judging from progress in the fields of machine vision and olfactory psychophysics, it is
reasonable to expect that the number and variety of odorantsused in electronic nose exper-
iments will only increase with time. Hierarchical classification frameworks scale well to
large numbers of categories and provide error rates for specific categories as well as super-
categories. There are many potential advantages to such an approach including the ability
to predict category performance at different levels of specificity. Identifying easly confused
categories groupings and sub-groupings may also reveal instrumental “blind spots”that can
then be addressed by employing different sensor technologies, sniffing techniques or fea-
ture extraction algorithms.
[Samples]
UPSIT Categories: pizza, bubble gum, menthol, cherry, motoroil, mint, banana, clove,
leather, coconut, onion, fruit punch, licorice, cheddar cheese, cinnamon, gasoline, straw-
berry, cedar, chocolate, ginger, lilac, turpentine, peach, root beer, dill pickle, pineapple,
lime, orange, wintergreen, watermelon, paint thinner, grass, smoke, pine, grape, lemon,
soap, natural gas, rose, peanut
77
CHOD Categories: alcohol, allspice, apple, apple juice, avocado, banana, basil, bay
leaves, beer (Guinness Extra Stout), bleach (regular, chorine-free and lavender), card-
board, cayenne pepper, cheese (cheddar, provolone, swiss), chili powder, chlorinated water,
chocolate (milk and dark), cilantro, cinnamon, cloves, coffee (Lavazzi, Trader Joe’s house
blend dark), expresso (Lavazzi, Trader Joe’s house blend dark), cottage cheese, Equal, flow-
ers (Rosa Rosideae, Agerastum Houstonianum, Achillae Millefolium), gasoline, Gatorade
(orange), grapes, grass, honeydew mellon, hydrogen peroxide, kiwi fruit, lavender, lemon
(juice, pulp, peel), lime (slice, peel only, pulp only), mango, mellon, milk (2%), mint,
mouth rinse, mustard (powder and French’s yellow), orange juice, paint thinner, parsley,
peanut butter, pine, pineapple, rasberries, red pepper, rice, rosemary, salt, soy milk (regu-
lar and vanilla), soy sauce, strawberry, sugar, tea (Cha Genmail, English Breakfast, Irish
Breakfast, Russian Caravan), toasted sesame oil, tomato, tuna, vanilla cookie fragrance oil,
vanilla extract, vinegar (apple, distilled, red wine, rice), windex (regular and vinegar), wine
(Cabernet Sauvignon, Chardonnay, Moscato, White Zinfandel)
78
79
Bibliography
[AB03] S Ampuero and J O Bosset. The electronic nose applied to dairy products:
a review.Sensors and Actuators B: Chemical, 94(1):1–12, August 2003.
[ABM04] Tamara L. Berg Alexander Berg and Jitendra Malik. Shape matching and
object recognition using low distortion correspondences.Technical Report
UCB/CSD-04-1366, EECS Department, University of California, Berkeley,
2004.
[AGF04] Yali Amit, Donald Geman, and Xiaodong Fan. A coarse-to-fine strategy
for multiclass shape detection.IEEE Trans. Pattern Anal. Mach. Intell.,
26(12):1606–1621, 2004.
[AMa01] KJ Albert, ML Myrick, and et al. Field-deployable sniffer for 2, 4-
dinitrotoluene detection.Environmental . . ., 2001.
[AMHP07] Anelia Angelova, Larry Matthies, Daniel Helmick,and Pietro Perona.
Learning slip behavior using automatic mechanical supervision. In 2007
IEEE International Conference on Robotics and Automation, pages 1741–
1748. IEEE, 2007.
80
[BBM05] Alexander C. Berg, Tamara L. Berg, and Jitendra Malik. Shape matching
and object recognition using low distortion correspondences.cvpr, 1:26–33,
2005.
[Bie87] I. Biederman. Recognition-by-components: A theory of human image un-
derstanding.Psychological Review, 94:115–147, 1987.
[BL97] Jeffrey S. Beis and David G. Lowe. Shape indexing using approximate
nearest-neighbour search in high-dimensional spaces. InCVPR, pages
1000–1006, 1997.
[BL03] Shawn M Briglin and Nathan S. Lewis. Characterizationof the Temporal
Response Profile of Carbon BlackPolymer Composite Detectors to Volatile
Organic Vapors.J. Phys. Chem. B, 107(40):11031–11042, October 2003.
[BMK11] Alexander Binder, Klaus-Robert Muller, and Motoaki Kawanabe. On Tax-
onomies for Multi-class Image Categorization.International Journal of
Computer Vision, 99(3):281–301, January 2011.
[BO04] K Brudzewski and S Osowski. ScienceDirect.com - Sensors and Actuators
B: Chemical - Classification of milk by means of an electronic nose and
SVM neural network.Sensors and Actuators B . . ., 2004.
[BP96] Michael C. Burl and Pietro Perona. Recognition of planar object classes. In
CVPR, pages 223–230, 1996.
81
[CdWL+98] W S Cain, R de Wijk, C Lulejian, F Schiet, and et al. Odor identification:
perceptual and semantic dimensions.Chemical senses, 23(3):309–326, June
1998.
[CLTW10] M J Choi, J J Lim, A Torralba, and A S Willsky. Exploitinghierarchical con-
text on a large database of object categories.Computer vision and pattern
recognition (CVPR), 2010 IEEE conference on, pages 129–136, 2010.
[Dav75] R G Davis. Acquisition of Verbal Associations to Olfactory Stimuli of Vary-
ing Familiarity and to Abstract Visual Stimuli.Journal of Experimental
Psychology: Human Learning . . ., 1975.
[DAZa85] R L Doty, S Applebaum, H Zusho, and et al. Sex differences in odor identi-
fication ability: a cross-cultural analysis.Neuropsychologia, 1985.
[DDNPa08] A D’Amico, C Di Natale, Roberto Paolesse, and et al. Olfactory systems
for medical applications.Sensors and Actuators B . . ., 2008.
[DDS+09] Jia Deng, Wei Dong, R Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet:
A Large-Scale Hierarchical Image Database. In2009 IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition Workshops
(CVPR Workshops), pages 248–255. IEEE, 2009.
[DHG+03] Ritaban Dutta, E L Hines, J W Gardner, K R Kashwan, and M Bhuyan.
Tea quality prediction using a tin oxide-based electronic nose: an artificial
intelligence approach.Sensors and Actuators B: Chemical, 94(2):228–237,
September 2003.
82
[DNDD96] C Di Natale, FAM Davide, and A D’Amico. ScienceDirect.com - Sensors
and Actuators B: Chemical - An electronic nose for the recognition of the
vineyard of a red wine.Sensors and Actuators B . . ., 1996.
[DNPa07] C Di Natale, Roberto Paolesse, and et al. Metalloporphyrins based artificial
olfactory receptors.Sensors and Actuators B: Chemical, 2007.
[Dot97] Richard L Doty. Studies of Human Olfaction from the University of Penn-
sylvania Smell and Taste Center.Chemical senses, 22(5):565–586, 1997.
[Dra85] A Dravnieks. Atlas of odor character profiles.ASTM data series, ed. ASTM
Committee E-18 on Sensory Evaluation of Materials and Products. Section
E-18.04.12 on Odor Profiling, page 354, 1985.
[DRNP94] M Dragovan, J E Ruhl, G Novak, and S R Platt. 1994ApJ...427L..67D Page
L69. The Astrophysical . . ., 1994.
[DSa84a] R L Doty, P Shaman, and et al. Development of the University of Penn-
sylvania Smell Identification Test: a standardized microencapsulated test of
olfactory function.Physiology & Behavior, 1984.
[DSA+84b] R L Doty, P Shaman, S L Applebaum, R Giberson, L Siksorski, and
L Rosenberg. Smell identification ability: changes with age. Science,
226(4681):1441–1443, 1984.
83
[DSKa84] R L Doty, P Shaman, C P Kimmelman, and et al. University of Pennsylvania
Smell Identification Test: a rapid quantitative olfactory function test for the
clinic. The . . ., 1984.
[DVHV01] T Dewettinck, K Van Hege, and W Verstraete. The electronic nose as a
rapid sensor for volatile compounds in treated domestic wastewater.Water
Research, 35(10):2475–2483, 2001.
[E+05] Mark Everingham et al. The 2005 pascal visual object classes challenge. In
MLCW, pages 117–176, 2005.
[Eea06] Everingham and Mark et al. The 2005 pascal visual object classes challenge.
In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual
Object Classification, and Recognising Tectual Entailment (PASCAL Work-
shop 05), number 3944 in Lecture Notes in Artificial Intelligence, pages
117–176, Southampton, UK, 2006.
[EFL+05] A Eibenstein, A B Fioretti, C Lena, N Rosati, G Amabile, and M Fusetti.
Modern psychophysical tests to assess olfactory function.Neurological Sci-
ences, 26(3):147–155, July 2005.
[EZWVG06] M Everingham, A Zisserman, C Williams, and L Van Gool. The pascal
visual object classes challenge 2006 (voc 2006) results. 2006.
[Fan05] Xiaodong Fan. Efficient multiclass object detection by a hierarchy of clas-
sifiers. InCVPR (1), pages 716–723, 2005.
84
[FCHW08] R E Fan, K W Chang, C J Hsieh, and X R Wang. LIBLINEAR: A Library
for Large Linear Classification.The Journal of Machine . . ., 2008.
[Fel98] Christiane Fellbaum, editor.WordNet: An Electronic Lexical Database
(Language, Speech, and Communication). The MIT Press, May 1998.
[Fer05] R. Fergus.Visual Object Category Recognition. PhD thesis, University of
Oxford, 2005.
[FFFP04] Li Fei-Fei, R Fergus, and P Perona. Learning Generative Visual Models
from Few Training Examples: An Incremental Bayesian Approach Tested
on 101 Object Categories. InComputer Vision and Pattern Recognition
Workshop, 2004. CVPRW ’04. Conference on, page 178, 2004.
[FG01] Francois Fleuret and Donald Geman. Coarse-to-fine face detection.Inter-
national Journal of Computer Vision, 41(1/2):85–107, 2001.
[FPZ03] Robert Fergus, Pietro Perona, and Andrew Zisserman. Object class recogni-
tion by unsupervised scale-invariant learning. InCVPR (2), pages 264–271,
2003.
[Ga01] P Gostelow and et al. Odour measurements for sewage treatment works.
Water Research, 2001.
[Gar04] J W Gardner. Abstract - SpringerLink.Electronic noses & sensors for the
detection of . . ., 2004.
85
[GD05a] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Discrimi-
native classification with sets of image features. InICCV, pages 1458–1465,
2005.
[GD05b] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Discrimi-
native classification with sets of image features. InICCV, pages 1458–1465,
2005.
[GD06] Kristen Grauman and Trevor Darrell. Unsupervised learning of categories
from sets of partially matching image features. InCVPR (1), pages 19–25,
2006.
[GHP06] Greg Griffin, Alex Holub, and Pietro Perona. Caltech-256 image dataset.
Technical report, Computation and Neural Systems Dept., 2006.
[GHP07] G Griffin, A Holub, and P Perona. Caltech-256 object category dataset.
2007.
[GP08] G Griffin and P Perona. Learning and using taxonomies for fast visual cat-
egorization.Computer Vision and Pattern Recognition, 2008. CVPR 2008.
IEEE Conference on, pages 1–8, 2008.
[GS92] JW Gardner and HV Shurmer. ScienceDirect.com - Sensors and Actuators
B: Chemical - Application of an electronic nose to the discrimination of
coffees.Sensors and Actuators B: Chemical, 1992.
86
[GSH00] J W Gardner, H W Shin, and E L Hines. ScienceDirect.com - Sensors and
Actuators B: Chemical - An electronic nose system to diagnoseillness.Sen-
sors and Actuators B: Chemical, 2000.
[GSHD00] Julian W Gardner, Hyun Woo Shin, Evor L Hines, and Crawford S Dow. An
electronic nose system for monitoring the quality of potable water.Sensors
and Actuators B: Chemical, 69(3):336–341, October 2000.
[GVMSB09] Mahdi Ghasemi-Varnamkhasti, Seyed Saeid Mohtasebi, Maryam Siadat,
and Sundar Balasubramanian. Meat Quality Assessment by Electronic Nose
(Machine Olfaction Technology).Sensors, 9(8):6058–6083, August 2009.
[HCLK08] C J Hsieh, K W Chang, C J Lin, and S S Keerthi. A dual coordinate descent
method for large-scale linear SVM. InProceedings of the 25th . . ., 2008.
[HWP05] Alex Holub, Max Welling, and Pietro Perona. Combininggenerative models
and fisher kernels for object recognition. InICCV, pages 136–143, 2005.
[KKER11] Alexei A AA Koulakov, Brian E BE Kolterman, Armen G AG Enikolopov,
and Dmitry D Rinberg. In search of the structure of human olfactory space.
Frontiers in Systems Neuroscience, 5:65–65, January 2011.
[KWTT03] John Kauer, Joel White, Timothy Turner, and Barbara Talamo. Principles
of Odor Recognition by the Olfactory System Applied to Detection of Low-
Concentration Explosives. January 2003.
87
[Lan86] D Lancet. Vertebrate olfactory reception.Annual review of neuroscience,
1986.
[LEZP99] T S Lorig, D G Elmes, D H Zald, and J V Pardo. A computer-controlled ol-
factometer for fMRI and electrophysiological studies of olfaction.Behavior
Research Methods, 31(2):370–375, 1999.
[LFP04a] F.F. Li, R. Fergus, and Pietro Perona. Learning generative visual models
from few training examples: An incremental bayesian approach tested on
101 object categories. InIEEE CVPR Workshop of Generative Model Based
Vision (WGMBV), 2004.
[LFP04b] F.F. Li, R. Fergus, and Pietro Perona. Learning generative visual models
from few training examples: An incremental bayesian approach tested on
101 object categories. InCVPR (1), 2004.
[Li05] Tao Li. Hierarchical document classification using automatically generated
hierarchy. InSDM, 2005.
[Low04] David G. Lowe. Distinctive image features from scale-invariant keypoints.
Int. J. Comput. Vision, 60(2):91–110, 2004.
[LS04] Bastian Leibe and Bernt Schiele. Scale-invariant object categorization using
a scale-adaptive mean-shift search. InDAGM-Symposium, pages 145–153,
2004.
88
[LSH05] J Lozano, J SANTOS, and M HORRILLO. Classification of white wine
aromas with an electronic nose.Talanta, 67(3):610–616, September 2005.
[LSP06a] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of fea-
tures: Spatial pyramid matching for recognizing natural scene categories. In
IEEE Conference on Computer Vision & Pattern Recognition, 2006.
[LSP06b] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of fea-
tures: Spatial pyramid matching for recognizing natural scene categories. In
CVPR (2), pages 2169–2178, 2006.
[LYW +05] Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-
Ying Ma. Support vector machines classification with a very large-scale
taxonomy.SIGKDD Explorations, 7(1):36–43, 2005.
[MGBW+08] S Maldonado, E Garcıa-Berrıos, MD Woodka, BS Brunschwig, and
NS Lewis. Detection of organic vapors and NH3 (g) using thin-film car-
bon black-metallophthalocyanine composite chemiresistors. Sensors and
Actuators B: Chemical, 134(2):521–531, 2008.
[ML06a] Jim Mutch and David G. Lowe. Multiclass object recognition with sparse,
localized features. InIEEE Conference on Computer Vision & Pattern
Recognition, 2006.
[ML06b] Jim Mutch and David G. Lowe. Multiclass object recognition with sparse,
localized features. InCVPR (1), pages 11–18, 2006.
89
[MS05] K Mikolajczyk and C Schmid. A performance evaluationof local descrip-
tors. IEEE Transactions on Pattern Analysis and Machine Intelligence,
27(10):1615–1630, October 2005.
[MS07] Marcin Marszałek and Cordelia Schmid. Semantic hierarchies for visual ob-
ject recognition. InIEEE Conference on Computer Vision & Pattern Recog-
nition, jun 2007.
[MS08] Marcin Marszałek and Cordelia Schmid. Constructing Category Hierarchies
for Visual Recognition. InComputer Vision–ECCV 2008, pages 479–491.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
[MTS+05] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaf-
falitzky, T. Kadir, and L. Van Gool. A Comparison of Affine Region Detec-
tors. International Journal of Computer Vision, 2005.
[NFM90] T Nakamoto, K Fukunishi, and T Moriizumi. ScienceDirect.com - Sensors
and Actuators B: Chemical - Identification capability of odorsensor using
quartz-resonator array and neural-network pattern recognition. Sensors and
Actuators B: Chemical, 1990.
[NMMa99] CD Natale, A Mantini, Antonella Macagnano, and et al. Electronic nose
analysis of urine samples containing blood.Physiological . . ., 1999.
[NNM96] S. Nene, S. Nayar, and H. Murase. Columbia object image library: Coil,
1996.
90
[OP05] A. Opelt and A. Pinz. Object localization with boosting and weak supervi-
sion for generic object recognition. InProceedings of the 14th Scandinavian
Conference on Image Analysis (SCIA), 2005.
[Pa02] M Pardo and et al. Coffee analysis with an electronic nose.Instrumentation
and Measurement, 2002.
[PFSQ01] M Pardo, G Faglia, G Sberveglieri, and L Quercia. IMTC 2001. Proceedings
of the 18th IEEE Instrumentation and Measurement Technology Confer-
ence. Rediscovering Measurement in the Age of Informatics (Cat. No.01CH
37188). InIMTC 2001. Proceedings of the 18th IEEE Instrumentation and
Measurement Technology Conference. Rediscovering Measurement in the
Age of Informatics, pages 123–127. IEEE, 2001.
[Phi05] M Phillips. Can the Electronic Nose Really Sniff out Lung Cancer?Ameri-
can journal of respiratory and critical care . . ., 2005.
[PZM04] P Perona and L Zelnik-Manor. Self Tuning Spectral Clustering. Advances
in neural information . . ., 2004.
[RB08] F Rock and N Barsan.Rock: Electronic nose: current status and future
trends - Google Scholar. Chemical reviews, 2008.
[Ree90] Randall R Reed. How does the nose know?Cell, 60(1):1–2, January 1990.
91
[RMG+10] M A Ryan, K S Manatt, S Gluck, A V Shevade, A K Kisor, H Zhou, LM
Lara, and M L Homer. The JPL electronic nose: Monitoring air in the US
Lab on the International Space Station. pages 1242–1247, 2010.
[RSCC+08] J A Ragazzo-Sanchez, P Chalier, D Chevalier, M Calderon-Santoyo, and
C Ghommidh. Identification of different alcoholic beverages by electronic
nose coupled to GC.Sensors and Actuators B: Chemical, 134(1):43–48,
2008.
[RTMF05] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme:
a database and web-based tool for image annotation.MIT AI Lab Memo
AIM-2005-025, 2005.
[SCA+11] J P Santos, J M Cabellos, T Arroyo, M C Horrillo, and Perena Gouma.
AIP Conference Proceedings. InProceedings Of The 14th International
Symposium On Olfaction And The Electronic Nose, pages 171–173. AIP,
September 2011.
[Sho04] Stephen Shore.Uncommon Places: The Complete Works. Aperture, 2004.
[Sho05] Stephen Shore.American Surfaces. Phaidon Press, 2005.
[SL03] BC Sisk and NS Lewis. Estimation of chemical and physical character-
istics of analyte vapors through analysis of the response data of arrays of
polymer-carbon black composite vapor detectors.Sensors and Actuators B:
Chemical, 96(1-2):268–282, 2003.
92
[SR95] S L Sullivan and K J Ressler. ScienceDirect.com - Current Opinion in Ge-
netics & Development - Spatial patterning and information coding in the
olfactory system.Current Opinion in Genetics & . . ., 1995.
[SR98] Michael M Stark and Richard F Riesenfeld. Wordnet: Anelectronic lexical
database. 1998.
[SWP05] Thomas Serre, Lior Wolf, and Tomaso Poggio. Object recognition with
features inspired by visual cortex. InCVPR (2), pages 994–1000, 2005.
[TM04] Anthony P F Turner and Naresh Magan. Innovation: Electronic noses and
disease diagnostics.Nature Reviews Microbiology, 2(2):161–166, February
2004.
[TMF04] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharingfeatures: efficient
boosting procedures for multiclass object detection. InProceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, volume 2,
pages 762–769, Washington, DC, June 2004.
[TRY10] A Torralba, B C Russell, and J Yuen. LabelMe: Online Image Annotation
and Applications. InProceedings of the IEEE, pages 1467–1484, 2010.
[TTA+06] P T Timbie, G S Tucker, P A R Ade, S Ali, E Bierman, E F Bunn,
C Calderon, A C Gault, P O Hyland, B G Keating, J Kim, A Korotkov,
S S Malu, P Mauskopf, J A Murphy, C O’Sullivan, L Piccirillo, and B D
Wandelt. The Einstein polarization interferometer for cosmology (EPIC)
93
and the millimeter-wave bolometric interferometer (MBI).New Astronomy
Reviews, 50(11-12):999–1008, December 2006.
[VJ01a] P. Viola and M. Jones. Rapid object detection using aboosted cascade of
simple features, 2001.
[VJ01b] Paul A. Viola and Michael J. Jones. Robust real-timeface detection. In
ICCV, page 747, 2001.
[VJ04] Paul A. Viola and Michael J. Jones. Robust real-time face detection.Inter-
national Journal of Computer Vision, 57(2):137–154, 2004.
[VR07] M. Varma and D. Ray. Learning the discriminative power-invariance trade-
off. In Proceedings of the IEEE International Conference on Computer
Vision, Rio de Janeiro, Brazil, October 2007.
[WB09] A D Wilson and M Baietto. Applications and advances in electronic-nose
technologies.Sensors, 9(7):5099–5148, 2009.
[WWP00] Markus Weber, Max Welling, and Pietro Perona. Unsupervised learning of
models for recognition. InECCV (1), pages 18–32, 2000.
[Yu07] H Yu. ScienceDirect.com - Sensors and Actuators B: Chemical - Discrimi-
nation of LongJing green-tea grade by electronic nose.Sensors and Actua-
tors B: Chemical, 2007.
[ZBMM06a] Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. Svm-
knn: Discriminative nearest neighbor classification for visual category
94
recognition. InIEEE Conference on Computer Vision & Pattern Recog-
nition, 2006.
[ZBMM06b] Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. Svm-
knn: Discriminative nearest neighbor classification for visual category
recognition. InCVPR (2), pages 2126–2136, 2006.
[ZEA+04] S Zampolli, I Elmi, F Ahmed, M Passini, GC Cardinali, S. Nicoletti, and
L. Dori. An electronic nose based on solid state sensor arrays for low-
cost indoor air quality monitoring applications.Sensors and Actuators B:
Chemical, 101(1):39–46, 2004.
[ZMP04] Lihi Zelnik-Manor and P. Perona. Self-tuning spectral clustering. Eigh-
teenth Annual Conference on Neural Information Processing Systems,
(NIPS), 2004.
[ZS06] M Zarzo and D T Stanton. Identification of latent variables in a semantic
odor profile database using principal component analysis.Chemical senses,
31(8):713–724, 2006.
[ZW07] Alon Zweig and Daphna Weinshall. Exploiting object hierarchy: Combin-
ing models from different category levels.Computer Vision, 2007. ICCV
2007. IEEE 11th International Conference on, pages 1–8, 14-21 Oct. 2007.
[ZZX+06] Qinyi Zhang, Shunping Zhang, Changsheng Xie, Dawen Zeng,Chaoqun
Fan, Dengfeng Li, and Zikui Bai. Characterization of Chinese vinegars
95
by electronic nose.Sensors and Actuators B: Chemical, 119(2):538–546,
December 2006.