(rare) category detection using hierarchical mean shift
DESCRIPTION
(Rare) Category Detection Using Hierarchical Mean Shift. Pavan Vatturi ([email protected]) Weng-Keen Wong ([email protected]). 1. Introduction. Applications for surveillance, monitoring, scientific discovery and data cleaning require anomaly detection - PowerPoint PPT PresentationTRANSCRIPT
(Rare) Category Detection Using Hierarchical Mean Shift
Pavan Vatturi ([email protected])
Weng-Keen Wong ([email protected])
1. Introduction
• Applications for surveillance, monitoring, scientific discovery and data cleaning require anomaly detection
• Anomalies often identified as statistically unusual data points
• Many anomalies are simply irrelevant or correspond to known sources of noise
1. Introduction
Known objects (99.9% of the data)
Anomalies (0.1% of the data)
Uninteresting (99% of anomalies)
Interesting (1% of anomalies)
Pictures from: Sloan Digital Sky Survey (http://www.sdss.org/iotw/archive.html)
Pelleg, D. (2004). Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection. PhD Thesis, Carnegie Mellon University.
1. Introduction
Category Detection [Pelleg and Moore 2004]: human-in-the-loop exploratory data analysis
Data Set
Build Model
Spot Interesting Data Points
Ask User to Label Categories of Interesting
Data Points
Update Model with Labels
1. Introduction
Data Set
Build Model
Spot Interesting Data Points
Ask User to Label Categories of Interesting
Data Points
Update Model with Labels
User can:
• Label a query data point under an existing category
• Or declare data point to belong to a previous undeclared category
1. Introduction
• Goal: present to user a single instance from each category in as few queries as possible
• Difficult to detect rare categories if class imbalance is severe
• Interested in rare categories for anomaly detection
Outline
1. Introduction
2. Related Work
3. Background
4. Methodology
5. Results
6. Conclusion / Future Work
2. Related Work
• Interleave [Pelleg and Moore 2004]
• Nearest-Neighbor-based active learning for rare-category detection for multiple classes [He and Carbonell 2008]
• Multiple output identification [Fine and Mansour 2006]
3. Background: Mean Shift [Fukunaga and Hostetler 1975]
Reference data set
Query point
Center of Mass
Mean shift vector (follows density
gradient)
x
hxx
k
hxx
kx
xmn
i
i
in
ii
1
2
2
1
'
'
)(
Mean shift vector with kernel k
3. Background: Mean Shift [Fukunaga and Hostetler 1975]
Reference data set
Query point
Center of Mass
Convergence to cluster center
3. Background: Mean Shift Blurring
Reference data set
Query point
Center of Mass
Blurring: •When query points are the same as the reference data set •Progressively blurs the original data set
3. Background: Mean Shift
End result of applying mean shift to a synthetic data set
4. Methodology: Overview
1. Sphere the data
2. Hierarchical Mean Shift
3. Query user
4. Methodology: Hierarchical Mean Shift
Repeatedly blur data using Mean Shift with increasing bandwidth:hnew = k * hold
4. Methodology: Hierarchical Mean Shift
• Mean Shift complexity is O(n2dm) where • n = # of data points• d = dimensionality of
data points• m = # of mean shift
iterations• Single kd-tree optimization
used to speed up Hierarchical Mean Shift
4. Methodology: Querying the User
Rank cluster centers for querying to user.
1. Outlierness [Leung et al. 2000] for Cluster Ci:
i
i
C
C
in points data ofNumber
of Lifetime sOutliernes
Lifetime of Ci = Log (bandwidth when cluster Ci is merged with other clusters – bandwidth when cluster Ci is formed)
4. Methodology: Querying the User
Rank cluster centers for querying to user.
2. Compactness + Isolation [Leung et al. 2000] for Cluster Ci:
i
j
i
i
Cx j
h
px
Cx
h
px
e
e
sCompactnes2
2
2
2
2
||||
2
||||
2
2
2
2
2
||||
2
||||
h
px
x
Cx
h
px
i
i
i
e
e
Isolation
4. Methodology: Tiebreaker
• Ties may occur in Outlierness or Compactness/Isolation values.
• Highest average distance heuristic: choose cluster center with highest average distance from user-labeled points.
5. Results
Name Dims Records Classes Smallest Class
Largest Class
Abalone 7 4177 20 0.34% 16%
Shuttle 8 4000 7 0.02% 64.2%
OptDigits 64 1040 10 0.77% 50%
OptLetters 16 2128 26 0.37% 24%
Statlog 19 512 7 1.5% 50%
Yeast 8 1484 10 0.33% 31.68%
Shuttle, OptDigits, OptLetters, and Statlog were subsampled to simulate class imbalance.
Data sets used in experiments
5. Results (Yeast)
Category detection metric: # queries before user presented with at least one example from all categories
5. Results (Statlog)
5. Results (OptLetters)
5. Results (OptDigits)
5. Results (Shuttle)
5. Results (Abalone)
5. Results
Dataset HMS-CI HMS-CI+HAD
HMS-Out HMS-Out+HAD
NNDM Interleave
Abalone 1195 93 603 385 124 193
Shuttle 44 32 36 28 162 35
OptDigits 100 100 160 118 576 117
OptLetters 133 133 161 182 420 489
Statlog 18 20 34 124 228 54
Yeast 73 91 103 77 88 111
Number of hints to discover all classes
5. Results
Dataset HMS-CI HMS-CI+HAD
HMS-Out NNDM Interleave
Abalone 0.835 0.873 0.837 0.846 0.840
Shuttle 0.925 0.929 0.917 0.480 0.905
OptDigits 0.855 0.855 0.840 0.199 0.808
OptLetters 0.936 0.936 0.917 0.573 0.765
Statlog 0.956 0.958 0.944 0.472 0.934
Yeast 0.821 0.805 0.793 0.838 0.778
Area under the category detection curve
6. Conclusion / Future Work
Conclusions– HMS-based methods consistently discover
more categories in fewer queries than existing methods
– Do not need apriori knowledge of dataset properties eg. total number of classes
6. Conclusion / Future Work
Future Work
• Better use of user feedback
• Presentation of an entire cluster to the user instead of a representative data point
• Improved computational efficiency
• Theoretical analysis