remix: automated exploration for interactive outlier detectionlibrary.usc.edu.ph/acm/kkd...

REMIX: Automated Exploration for Interactive Outlier DetectionYanjie Fu

Missouri U. of Science & TechnologyMissouri, [email protected]

Charu AggarwalIBM T. J. Watson Research Center

NY, [email protected]

Srinivasan ParthasarathyIBM T. J. Watson Research Center


Deepak S. TuragaIBM T. J. Watson Research Center


Hui XiongRutgers University

NJ, [email protected]

ABSTRACTOutlier detection is the identication of points in a dataset thatdo not conform to the norm. Outlier detection is highly sensitiveto the choice of the detection algorithm and the feature subspaceused by the algorithm. Extracting domain-relevant insights fromoutliers needs systematic exploration of these choices since diverseoutlier sets could lead to complementary insights. is challenge isespecially acute in an interactive seing, where the choices must beexplored in a time-constrained manner.

In this work, we present REMIX, the rst system to address theproblem of outlier detection in an interactive seing. REMIX uses anovel mixed integer programming (MIP) formulation for automati-cally selecting and executing a diverse set of outlier detectors within atime limit. is formulation incorporates multiple aspects such as(i) an upper limit on the total execution time of detectors (ii) diversityin the space of algorithms and features, and (iii) meta-learning forevaluating the cost and utility of detectors. REMIX provides twodistinct ways for the analyst to consume its results: (i) a partition-ing of the detectors explored by REMIX into perspectives throughlow-rank non-negative matrix factorization; each perspective canbe easily visualized as an intuitive heatmap of experiments versusoutliers, and (ii) an ensembled set of outliers which combines outlierscores from all detectors. We demonstrate the benets of REMIXthrough extensive empirical validation on real-world data.

ACM Reference format:Yanjie Fu, Charu Aggarwal, Srinivasan Parthasarathy, Deepak S. Turaga, and Hui Xiong. 2017. REMIX: Automated Exploration for Interactive Out-lier Detection. In Proceedings of KDD’17, August 13–17, 2017, Halifax, NS, Canada., 9 pages.DOI: hp://dx.doi.org/10.1145/3097983.3098154

1 INTRODUCTIONOutlier detection is the identication of points in a dataset that donot conform to the norm. is is a critical task in data analysis and

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor prot or commercial advantage and that copies bear this notice and the full citationon the rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permied. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specic permission and/or afee. Request permissions from [email protected]’17, August 13–17, 2017, Halifax, NS, Canada.© 2017 ACM. ISBN 978-1-4503-4887-4/17/08. . .$15.00.DOI: hp://dx.doi.org/10.1145/3097983.3098154

is widely used in many applications such as nancial fraud detec-tion, Internet trac monitoring, and cyber security [4]. Outlierdetection is highly sensitive to the choice of the detection algorithmand the feature subspace used by the algorithm [4, 5, 35]. Further,outlier detection is oen performed on high dimensional data inan unsupervised manner without data labels; distinct sets of out-liers discovered through dierent algorithmic choices could revealcomplementary insights about the application domain. us, unsu-pervised outlier detection is a data analysis task which inherentlyrequires a principled exploration of the diverse algorithmic choicesthat are available.

Recent advances in interactive data exploration [10, 17, 33, 34]show much promise towards automated discovery of advancedstatistical insights from complex datasets while minimizing the bur-den of exploration for the analyst. We study unsupervised outlierdetection in such an interactive seing and consider three practi-cal design requirements: (1) Automated exploration: the systemshould automatically enumerate, assess, select and execute a diverseset of outlier detectors; the exploration strategy should guaranteecoverage in the space of features and algorithm parameters. (2)Predictable response time: the system should conduct its explo-ration within a specied time limit. is implies an explorationstrategy that is sensitive to the execution time (cost) of the detectors.(3) Visual interpretability: the system should enable the user toeasily navigate the results of automated exploration by groupingtogether detectors that are similar in terms of the data points theyidentify as outliers.

1.1 Key ContributionsWe present REMIX, a modular framework for automated outlierexploration and the rst to address the outlier detection problemin an interactive seing. To the best of our knowledge, none of theexisting outlier detection systems support interactivity in the man-ner outlined above – in particular, we are not aware of any systemwhich automatically explores the diverse algorithmic choices inoutlier detection within a given time limit. e following are thekey contributions of our work.

1.1.1 MIP based Automated Exploration. REMIX systematicallyenumerates candidate outlier detectors and formulates the explo-ration problem as a mixed integer program (MIP). e solution tothis MIP yields a subset of candidates which are executed by REMIX.e MIP maximizes a novel aggregate utility measure which tradeso between the total utility of the top-k vs all selected candidates;

KDD 2017 Research Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada

827

0

25

50

75

100

0 50 100 150 200Data Point

Expl

orat

ory

Actio

n

(a) Perspective 1

0

25

50

75

100

0 50 100 150 200Data Point

Expl

orat

ory

Actio

n

(b) Perspective 2

Figure 1: Factorization of an outlier matrix into two per-spectives. Each perspective is a heatmapped matrix whosecolumns are data points and rows are detectors. e inten-sity of a cell (s,p) in a perspective corresponds to the extentto which detector s identies point p as an outlier. Each per-spective clearly identies a distinct set of outliers.

the MIP also enforces (i) an upper limit on the total cost (budget) ofthe selected candidates, and (ii) a set of diversity constraints whichensure that each detection algorithm and certain prioritized featuresubspaces get at least a certain minimum share of the explorationbudget.

1.1.2 Meta-Learning for Cost and Utility. e REMIX MIP re-quires an estimate of cost and utility for each candidate detector.In order to estimate cost, REMIX trains a meta-learning model foreach algorithm which uses the number of data points, size of thefeature subspace, and various product terms derived from them asmeta-features. It is signicantly harder to estimate or even deneutility. REMIX handles this by dening the utility of a detector asa proxy for its accuracy on the given data. REMIX estimates thisby training a meta-learning model for each algorithm that usesvarious statistical descriptors of a detector’s feature subspace asmeta-features; this model is trained on a corpus of outlier detectiondatasets that are labeled by domain experts [2].

1.1.3 Perspective Factorization. e diversity of feature sub-spaces and algorithms explored by REMIX will result in dierent

detectors marking a distinct set of data points as outliers. REMIXprovides a succinct way for the data analyst to visualize these re-sults as heatmaps. Consider the outlier matrix ∆ where ∆s,p is thenormalized ([0, 1]-ranging) outlier score assigned by detector s todata point p. REMIX uses a low rank non-negative matrix factor-ization scheme to bi-cluster ∆ into a small user-specied numberof perspectives such that there is consensus among the detectorswithin a perspective. e idea of outlier perspectives is a general-ization of the idea of outlier ensembles. A notable special case occurswhen the number of perspectives equals one: here, the results fromall the detectors are ensembled into a single set of outliers.

REMIX can be used in two modes. (i) Simple: An analyst cangain rapid visual understanding of the outlier space by providingtwo simple inputs: an exploration budget and the number of per-spectives. is yields bi-clustered heatmaps of outliers that areeasily interpretable as in Figure 1. (ii) Advanced: As discussedin Section 4.3.1, an advanced user can also re-congure specicmodules within REMIX such as utility estimation or the enumera-tion of prioritized feature subspaces. Such re-conguration wouldsteer the REMIX MIP towards alternate optimization goals whilestill guaranteeing exploration within a given budget and providingfactorized heatmap visualizations of the diverse outlier sets.

e rest of the paper is organized as follows. We survey re-lated work in Section 2, and provide background denitions and anoverview in Section 3. In Sections 4.1 – 4.5, we present the detailsof feature subspace and candidate detector enumeration, cost andutility estimation, MIP for automated exploration, and perspectivefactorization. We present an evaluation of REMIX on real-worlddatasets in Section 5 and conclude in Section 6.

2 RELATEDWORKOutlier ensembles [4, 5, 23, 31] combine multiple outlier resultsto obtain a more robust set of outliers. Model centered ensemblescombine results from dierent base detectors while data-centeredensembles explore dierent horizontal or vertical samples of thedataset and combine their results [4]. Research in the area of sub-space mining [8, 18, 19, 23, 30] focuses on exploring a diverse familyof feature subspaces which are interesting in terms of their ability toreveal outliers. e work in [28] introduces a new ensemble modelthat uses detector explanations to choose base detectors selectivelyand remain robust to errors in those detectors.

REMIX is related to ensemble outlier detection since seing thenumber of perspectives to one in REMIX leads to ensembling ofresults from the base detectors. However, REMIX has some notabledistinctions: the idea of perspectives in REMIX generalizes the notionof ensembles; seing the number of perspectives to a number greaterthan one is possible in REMIX and results in complementary viewsof the outlier space which can be visualized as heatmaps; further,REMIX seeks to guarantee coverage not just in the space of featuresor feature subspaces, but also available algorithms – subject to abudget constraint on the total time available for exploration. None ofthe existing approaches in literature provide this guarantee on theexploration time.

Multiview outlier detection [9, 13, 27, 29] deals with a set-ting where the input consists of multiple datasets (views), andeach view provides a distinct set of features for characterizing the


828

objects in the domain. Algorithms in this seing aim to detectobjects that are outliers in each view or objects that are normal butshow inconsistent class or clustering characteristics across dierentviews. REMIX is related to multiview outlier detection throughnon-negative matrix factorization (NMF) which is oen used herefor creating a combined outlier score of objects. In REMIX, NMF isused not just for ensembling detector scores but also for factorizingdetector results into multiple heatmap visualizations.

Automated exploration is a growing trend in the world ofcommercial data science systems [1, 3] as well as machine learningand statistical research [6, 12, 14, 20, 25, 32]. While [14] focuses onmodel selection, [12, 25] focus on exploration for non-parametricregression models, [1, 3, 6, 32] deal with algorithm explorationfor classication models, while [20] deals with automated featuregeneration for classication and regression. Budgeted allocationsand recommendations have also been studied in the context ofproblems other than outlier analysis [24, 26].

3 PRELIMINARIES3.1 Baseline AlgorithmsOur implementation of REMIX uses the following set A of vebaseline outlier detection algorithms that are well-known. 1) Lo-cal outlier factor: LOF [7] nds outliers by measuring the lo-cal deviation of a point from its neighbors. 2) Mahalanobis dis-tance: MD [16] detects outliers by computing the Mahalanobisdistance from a point and the center of the entire dataset as outlierscore. 3) Angle-based outlier detection: ABOD [22] identiesoutliers by considering the variances of the angles between thedierence vectors of data points, which is more robust that distancein high-dimensional space. ABOD is robust yet time-consuming. 4)Feature-bagging outlier detection: FBOD [23] is an ensemblemethod, which is based on the results of local outlier factor (LOF).During each iteration, a random feature subspace is selected. LOFthen is applied to calculate the LOF scores based on the selecteddata subset. e nal score of FBOD is the cumulative sum of eachiteration. 5) Subspace outlier detection: SOD [21]: SOD aims todetect outliers in varying subspaces of a high dimensional featurespace. Specically, for each point in the dataset, SOD explores theaxis-parallel subspace spanned by its neighbors and determineshow much the point deviates from the neighbors in this subspace.

3.2 Interactive Outlier ExplorationA dataset in REMIX is a real-valued matrix A withm columns andn rows. A feature subspace is a subset of columns in A. An outlierdetector Da,f is simply a combination of an algorithm a ∈ A anda feature subspace f of A. e cost and utility values ca,f andua,f are positive values associated with Da,f which are intendedto be estimates of the execution cost and accuracy of Da,f respec-tively. REMIX enumerates candidate outlier detectors based on afamily of prioritized feature subspaces Fp and a family of randomlyconstructed feature subspaces Fr .

e interactive outlier exploration problem is a budgeted optimiza-tion problem which selects a subset of detectors with amaximiza-tion objective that is a linear combination of two quantities: (i)the total utility of all the selected detectors, and (ii) the total utilityof the top-k selected detectors, subject to the following budget and

Feature Subspace Enumeration

1

Candidate Detector Enumeration

2

Cost and Utility Estimation

3

Perspective Factorization and Ensembling

5

Mixed Integer Program based Exploration

4

Figure 2: REMIX Components: REMIX starts by enumer-ating multiple feature subspaces from the given dataset,which in turn is used to enumerate candidate detectors.Next, REMIX evaluates the cost and utility of all the enu-merated detectors (the meta-learning components for train-ing the cost and utility models is not shown in this gure).e cost and utility values are used as part of amixed integerprogram (MIP) which selects a subset of candidates for exe-cution based on a utility maximization objective and budgetand diversity constraints. e outlier results from the de-tectors executed by REMIX are factorized into perspectives,and also ensembled into a single set of results.

diversity constraints: (i) the total cost the selected detectors doesnot exceed the budget Ttotal , (ii) each algorithm gets a guaranteedshare of the exploration budget, and (iii) each prioritized featuregets a guaranteed share of the exploration budget.

4 THE REMIX FRAMEWORKWe now describe the ve components of REMIX shown in Figure 2starting with feature subspace enumeration.

4.1 Feature Subspace and Candidate DetectorEnumeration

Algorithm 1 describes feature subspace enumeration and has threeparts: (i) creating a non-redundant feature bag Fnr (lines 1 – 13),(ii) creating a prioritized family of subspaces Fp (lines 14-18) usinga feature ranking approach and (iii) creating a randomized familyof subspaces Fr (lines 19 – 23) used for maximizing coverage anddiversity during exploration. Redundant features < Fnr are not partof subspaces in Fp or Fr .

Algorithm 1 begins by initializing Fnr to all features in A (line1). It then iteratively looks for a member in Fnr which can beconsidered redundant and hence dropped from Fnr . In order fora feature A∗,p ∈ Fnr to be considered redundant, it needs to (i)be in a maximally correlated feature pair {A∗,p ,A∗,q } ⊆ Fnr (line4), (ii) the correlation coecient σp,q must be above the REMIX’sredundancy threshold α (line 5), and (iii) of the two features in thepair, A∗,p must have a mean correlation with other features in Fnrthat ≥ the mean correlation of A∗,q (line 7). We set the defaultvalue of α in REMIX to 0.9 based on our experimental evaluation.It is easy to see that at the end of line 13, Algorithm 1 yields anon-redundant feature bag Fnr with the following property.


829

Algorithm 1 Feature Subspace EnumerationInput: Data Matrix AOutput: Feature subspace families Fp and Fr1: Fnr =

m⋃j=1

A∗, j . Initialize non-redundant feature bag

2: while |Fnr | ≥ 2 do . Bag has at least 2 features3: ∀A∗, j ∈ Fnr , σ̂j ←

∑(i,j )∧(A∗,i ∈Fnr ) σi, j

|Fnr |−1 . Mean correlation4: (p,q) ← argmax(i,j )∧(A∗,i ∈Fnr )∧(A∗, j ∈Fnr ) σi, j . Max5: if σp,q ≥ α then . High max correlation?6: if σ̂p ≥ σ̂q then . Greater average correlation?7: Fnr ← Fnr \ {A∗,p } . Drop A∗,p8: else9: Fnr ← Fnr \ {A∗,q } . Drop A∗,q10: end if11: else break . Break out of while loop12: end if13: end while14: Fp = Φ . Initialize to null set15: for ` ← 1, |Fnr | do16: S` ← Top-` features in Fnr ranked by their Laplacian scores17: Fp ← Fp ∪ {S` } . Add prioritized subspace18: end for19: Fr = Φ . Initialize to null set20: for i ← 1, Γ do21: Ti ← Random subspace with each feature sampled inde-

pendently at random without replacement from Fnr with prob-ability 12

22: Fr ← Fr ∪ {Ti } . Add random subspace23: end for

Observation 4.1. For any pair of features {A∗,p ,A∗,q } ⊆ Fnr ,σp,q < α . For any feature A∗,r < Fnr , ∃A∗,s ∈ Fnr ,σr,s ≥ α .

e family of randomized feature subspaces Fr is created for thepurpose of guaranteeing feature coverage and diversity during ex-ploration. In particular, we select Γ subspaces, where each subspaceconsists of features selected independently at random without re-placement with probability 12 from Fnr . We set the default value ofΓ in REMIX to |Fp |/2 to balance the size of the two families.

e family of prioritized feature subspaces Fp is created as fol-lows. e features in Fnr are rst sorted according to their Laplacianscores [15]. We add |Fnr | subspaces to Fp , where the `th subspaceis the set of top-` features in Fnr ranked by their Laplacian scores.We now provide a brief justication for our use of the Laplacianscore. Due to lack of space, refer the reader to [15] for the exactdetails of the Laplacian computation. e Laplacian score is com-puted as a way to reect a feature’s ability to preserve locality. Inparticular, consider the projection A′ of all points in A onto thesubspace Fnr ; now, consider the r -nearest neighborhood of pointsin A′. Features that respect this r -nearest neighborhood provide abeer separation of the inlier class from outlier class within the data.We note that our experimental evaluation in Section 5 consistentlydemonstrates improved outlier detection accuracy with Fp and Fras opposed to purely Fr . We also discuss potential alternatives forprioritized feature subspace construction in Section 4.3.1.

We enumerate candidate detectors by a cartesian product ofFp ∪ Fr and the set of baseline detection algorithms A.

4.2 Cost Estimatione cost of a candidate detector can be modeled as a function ofits algorithm as well as the size of its feature subspace f (n and| f |). e runtime complexity of algorithms in the ‘big-O’ notationprovides an asymptotic relationship between cost and input size;however, in REMIX, we seek a more rened model which accountsfor lower order terms and the constants hidden by ‘big-O’. We dothis by training a multivariate linear regression model for eachalgorithm. Specically, consider the following polynomial: (1 +| f | +n + log | f | + logn)3 − 1. ere are

(73)− 1 distinct terms in the

expansion of this polynomial which can be derived exactly given afeature subspace. e terms1 form the explanatory variables whilecost is the dependent variable in the linear regression model.

4.3 Utility EstimationGiven a detector Da,f , the utility estimation algorithm (Algorithm2) rst normalizes the features in f , and computes a variety offeature-level statistics for each feature ψ ∈ f . ese statisticsinclude the Laplacian score which is a useful measure for unsu-pervised feature selection (Section 4.1), the standard deviation,skewness which measures the assymetry of the feature distribution,kurtosis whichmeasures the extent towhich the feature distributionis heavy-tailed, and entropy which is a measure of the informationcontent in the feature. We note that these computations are doneonce for each feature in Fnr and is reused for any given any detector.Next, the algorithm computes the meta-feature vectorMFV ( f ) offeature-subspace-level statistics by combining feature-level statistics.For instance, consider the Laplacian scores of all the features inf ; the mean, median, median absolute deviation, min, max, andstandard deviation of all the Laplacian scores provide 6 of the 30distinct components of MFV ( f ) in this step. Finally, it uses analgorithm specic utility modelUa (MFV ( f )) to estimate the utilityua,f of Da,f . REMIX trains ve distinct linear regression modelsULOF , UMD , UABOD , ULOF , and USOD corresponding to each al-gorithm. ese models are trained based on distinct expert labeleddatasets from the outlier dataset repository [2]. e explanatoryvariables for this linear regression are the feature-subspace-levelstatistics described in Algorithm 2. e dependent variable is thedetection accuracy, measured as the fraction of the data points onwhich both the detector and the expert labeled ground truth agreeon the outlier characterization.

4.3.1 Discussion. Estimating or even dening the utility of adetector is signicantly harder than estimating its cost. e goal ofutility estimation in REMIX is not to learn a perfect model; rather,the goal is merely to learn a utility model which can eectively steerthe solution of the mixed integer program (MIP) used by REMIXfor exploration (Section 4.4). Our experiments in Section 5 demon-strate that this is indeed the case with REMIX. Further, REMIX isintended to be a exible framework where alternative mechanismsfor utility estimation can be plugged in. For instance, considerCumulative Mutual Information (CMI) metric and the Apriori-style

1e exponent 3 suces to model the cost of most known outlier detection algorithms


830

Algorithm 2 Utility EstimationInput: A candidate detector Da,fOutput: e utility ua,f of the detector1: ∀ψ ∈ f , normalizeψ . One time procedure applied to Fnr2: ∀ψ ∈ f , extract the following 5 feature-level statistics: Lapla-

cian score, standard deviation, skewness, kurtosis, and entropy. One time procedure applied to Fnr

3: Extract the following 30 feature-subspace-level statistics: mean,median, median absolute deviation, min, max, and standard de-viation for each of the 5 feature-level statistics extracted in Step2. LetMFV ( f ) contain these feature subspace-level statistics.

4: Return ua,f = Ua (MFV ( f )), whereUa is the utility estimationmodel learnt for algorithm a

feature subspace search algorithm presented in [8] for subspaceoutlier detection. REMIX can use CMI as the detector utility, andthis Apriori-style algorithm as an alternative for enumerating pri-oritized feature subspaces. We chose the meta-learning approachin our implementation since this approach is algorithm agnosticand hence can be generalized easily.

4.4 MIP based ExplorationREMIX executes only a subset of detectors enumerated by its detec-tor enumeration component (Section 4.1). is subset is determinedby solving the following mixed integer program (MIP).

max∑a∈A

∑f ∈Fp∪Fr

za,f ua,f︸︷︷︸utility of top-k detectors

+λ∑a∈A

∑f ∈Fp∪Fr

ya,f ua,f︸︷︷︸utility of all detectors

(1)

∑a∈A

∑f ∈Fp∪Fr

ya,f ca,f ≤ Ttotal (2)

∀a ∈ A :∑

f ∈Fp∪Frya,f ca,f ≥

Ttotal2 · |A| (3)

∀f ∈ Fp :∑a∈A

ya,f ca,f ≥Ttotal2 · |Fp |

(4)

∀a ∈ A,∀f ∈ Fp ∪ Fr : za,f ≤ ya,f (5)∑a∈A

∑f ∈Fp∪Fr

za,f ≤ k (6)

∀a ∈ A,∀f ∈ Fp ∪ Fr : yi, j ∈ {0, 1} (7)∀a ∈ A,∀f ∈ Fp ∪ Fr : zi, j ∈ {0, 1} (8)

Given an algorithm a and a feature subspace f , ya,f is the binaryindicator variable in the MIP which determines if the detector Da,fis chosen for execution in the MIP solution. Recall that ca,f denotesthe estimated cost of Da,f . We observe the following.

Observation 4.2. Constraint 2 guarantees that in any feasiblesolution to the MIP, the total cost of the selected detectors is ≤ Ttotal .

e value of the total exploration budget Ttotal is provided bythe analyst as part of her interaction with REMIX.

Observation 4.3. Constraint 3 guarantees that in any feasible so-lution to the MIP, each algorithm is explored for an estimated durationof time which is ≥ Ttotal2· |A | .

Observation 4.4. Constraint 4 guarantees that in any feasiblesolution to the MIP, each prioritized feature is explored for an esti-mated duration of time which is ≥ Ttotal2· |Fp | . is also implies that theexploration focuses at least half its total time on prioritized features.

Consider the binary indicator variable za,f corresponding tothe detector Da,f . Constraint 5 ensures that in any feasible solu-tion to the MIP, za,f can be 1 only if it is chosen in the solution(i.e., ya,f = 1). Constraint 6 ensures at most k of the detectorschosen by the solution have their z-values set to 1. Now con-sider an optimal solution to the MIP. Since utility values are non-negative for all detectors, the rst part of the objective function∑a∈A

∑f ∈Fp∪Fr za,f ua,f is maximized when exactly k detectors

have their z-values set to 1 (and not fewer) and when the detectorswhose z-values are set to 1 are the ones with the highest utili-ties amongst the selected detectors. is leads us to the followingguarantee.

Theorem 4.5. e optimal solution to the MIP maximizes the sumof utilities of the top-k detectors with highest utilities and the totalutility of all the selected detectors scaled by a factor λ.

We set k = 10 and λ = 1 in our implementation of REMIX whichbalances the utility of the top-10 detectors vs the total utility of allthe selected detectors.

4.5 Perspective FactorizationEach detector executed by REMIX provides an outlier score for eachdata point and the results from dierent detectors could be poten-tially divergent. We now present a factorization technique calledNMFE (non-negative matrix factorization and ensembling) for bi-clustering the detection results into a few succinct perspectives.All detectors within a perspective agree on how they characterizeoutliers although there could be disagreement across perspectives.

Let ∆s,p be the outlier score assigned by detector s for data pointp normalized across data points to be in the range [0, 1]. Considerthe matrix of outlier scores ∆ ∈ [0, 1]t×n , where t is the number ofdetectors executed by REMIX andn is the number of data points. Weperform a rank-g non-negative matrix factorization of ∆ ≈ ΛΩ>,whereΛ ∈ Rt×д and Ω ∈ Rn×д , byminimizing the Kullback-Leiblerdivergence between ∆ and ΛΩ> [11]:

minΛ,Ω≥0

∑s,p

∆s,ploд∆s,p

Λs,∗Ωp,∗− ∆s,p + Λs,∗Ωp,∗ (9)

ematrixΛΩT by denition can be expressed as a sum ofд rank-1 matrices whose rows and columns correspond to detectors anddata points respectively. In any of these rank-1 matrices, every row(column) is a scaled multiple of any other non-zero row (column),and every entry is between 0 and 1. ese properties make itpossible for the rank-1 matrices to be visualized as perspectives,or heatmaps where the intensity of a heatmap cell is the value ofthe corresponding entry in the perspective, as shown in Figure1. e number of perspectives д is specied by the user as an


831

input to REMIX. Seing д = 1 simply results in a direct averaging(ensembling) of all the detector results into a single perspective.

5 EXPERIMENTS5.1 Data CollectionWe chose 98 datasets for our study from the outlier dataset reposi-tory [2] of the Del University of Technology. is corpus containsoutlier detection datasets labeled by domain experts. ese datasetsspan varied domains such as owers, breast cancers, heart diseases,sonar signals, genetic diseases, arrhythmia abnormals, hepatitis,diabetes, leukemia, vehicles, housing, and satellite images. Eachdataset has benchmark outlier labels, with an entry 1 representingoutliers and 0 representing normal points.

Figure 3 illustrate some statistics of the 98 data sets. Specically,Figure 3(a) shows the numbers of features for each dataset sorted ina descending order. In this gure, we can observe that most of thedatasets contains less than 100 features while only a small portionof these datasets have more than 1000 features. Figure 3(b) showsthe outlier ratio to the total number of data points for each datasetsorted in a descending order. In this gure, we can nd that theoutlier ratios of more than 50% of the datasets are less than 23.2%.

980 10 20 30 40 50 60 70 80 90

4970

1

10

100

1000

The rankings of data sets with respect to number of features

Feat

ure

num

ber

(a) Distribution of feature sizes

980 10 20 30 40 50 60 70 80 90

0.5

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

The rankings of data sets with respect to outlier ratio

Out

lier

ratio

(b) Distribution of outlier ratios

Figure 3: Statistics of the experimental datasets

5.2 Cost and Utility EstimationAmong the 98 datasets, we used 96 datasets to train and test the costand utility estimation models for each of the 5 baseline algorithmsin our REMIX implementation (with a 70%-30% split for train vstest). Figures 4(a) and 4(b) present the performance of the cost andutility models for the LOF algorithm from a specic run.

Recall that the utility ua,f estimates the fraction of the datapoints on which the detector da,f and the expert labels agree on theoutlier characterization. Consider the Hamming distance betweenthe outlier bit vector (1 if outlier, 0 otherwise) created by the detectorand the outlier bit vector created by the expert labels. Clearly, thisHamming distance equalsn · (1−ua,f ) wheren is the number of datapoints. Figure 4(b) plots the estimated Hamming distance on they-axis and the actual distance as observed by running the detectoron the x-axis. e red lines in Figures 4(a) and 4(b) represent idealpredictors. As expected, the cost estimator clearly performs beerthan the utility estimator. However, as mentioned in Section 4.3.1our goal in utility estimation is not a perfect predictive model butmerely to steer the MIP towards beer solutions. We demonstratethis to be the case in our next set of experiments.

0 2 4 6 8 10 12 14

02

46

810

1214

Groundtruth

Prediction

(a) Cost Estimation

0 200 400 600 800

0200

400

600

800

Groundtruth

Prediction

(b) Utility Estimation

Figure 4: Cost and Utility Estimation

5.3 Detection Accuracy and CostWe now present a detailed study of the detection accuracy vs costof REMIX on three datasets from the corpus. e rst dataset iscardiac arrhythmia (CA). Irregularity in heart beat may be harmlessor life threatening. e dataset contains medical records like age,weight and patient’s electrocardiograph related data of arrhythmiapatients and outlier healthy people. e task is to spot outlierhealthy people from arrhythmia patients. e second dataset is askewed subsample of CA (SkewCA) which we created for the sakeof diversity by subsampling from CA: 90% of the points in SkewCAwere points that were considered normal and the remaining 10%were considered outliers by expert labels. e third dataset is aboutsonar signals (SONAR) bounced o of a metal cylinder. e datasetcontains outlier sonar signals bounced o a roughly cylindricalrock. e task is to separate outlier rock related sonar signals fromcylinder-related sonar signals.

5.3.1 Evaluation Metrics. Recall that the REMIX perspectivefactorization scheme can be used as an outlier ensembling techniquesimply by seing the number of perspectives to 1. We use REMIXin this ensembling mode for the rest of our experiments. We nowdene Precision@N, Recall@N, and F-measure@N which we useto compare the detection accuracy of REMIX with various otherapproaches. Let 1 denote outlier label and 0 denote the label fornormal points in the expert labeled data.

Precision@N Given the top-N list of data points LN sorted ina descending order of the predicted outlier scores, the precisionis dened as: Precision@N = |LN

⋂ L=1 |N where L=1 are the data

points with expert outlier label = 1.Recall@N Given the top-N list of data points LN sorted in a

descending order of the predicted outlier scores, the recall is denedas: Recall@N = |LN

⋂ L=1 ||L=1 | where L=1 are the outlier data points

with label = 1.F-measure@N F-measure@N incorporates both precision and

recall in a single metric by taking their harmonic mean: F@N =2×Precision@N×Recall@NPrecision@N+Recall@N

5.3.2 Baseline Algorithms. We report the performance compari-son of REMIX vs baseline algorithms in terms of Precision, Recall,and F-measure. In this experiment, we provide sucient time forthe baseline algorithms in {LOF ,MD,ABOD, FBOD, SOD} to com-plete their executions. Meanwhile, we set a limited explorationtime budget of 0.5 second for REMIX.


832

@10 @13 @15 @17 @200

0.2

0.4

0.6

0.8

1

1.2 LOFMDABODFBODSODOur

(a) Precision@N

@10 @13 @15 @17 @200

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04LOFMDABODFBODSODOur

(b) Recall@N

@10 @13 @15 @17 @200

0.01

0.02

0.03

0.04

0.05

0.06

0.07


(c) Fmeasure@N

Figure 5: Comparison of REMIX with baseline algorithms on the CA dataset

@10 @13 @15 @17 @200

0.2

0.4

0.6

0.8

1

1.2 LOFMDABODFBODSODOur

(a) Precision@N

@10 @13 @15 @17 @200

0.05

0.1

0.15

0.2


(b) Recall@N

@10 @13 @15 @17 @200

0.05

0.1

0.15

0.2

0.25

0.3


(c) Fmeasure@N

Figure 6: Comparison of REMIX with baseline algorithms on SONAR dataset

@10 @13 @15 @17 @200.4

0.5

0.6

0.7

0.8

0.9

1

1.1EERSRRS1RS1RNMFE

(a) Precision@N

@10 @13 @15 @17 @200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8EERSRRS1RS1RNMFE

(b) Recall@N

@10 @13 @15 @17 @200.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8EERSRRS1RS1RNMFE

(c) Fmeasure@N

Figure 7: Comparison of REMIX with other exploration and ensemble strategies on the SkewCA dataset

@10 @13 @15 @17 @200.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3EERSRRS1RS1RNMFE

(a) Precision@N

@10 @13 @15 @17 @200

0.05

0.1

0.15

0.2

0.25EERSRRS1RS1RNMFE

(b) Recall@N

@10 @13 @15 @17 @200

0.05

0.1

0.15

0.2

0.25

0.3

0.35EERSRRS1RS1RNMFE

(c) Fmeasure@N

Figure 8: Comparison of REMIX with other exploration and ensemble strategies on the SONAR dataset


833

LOF MD ABOD FBOD SOD Our10-2

10-1

100

101

102

2.8

0.043

31.76923.429

31.369

0.481

(a) CA dataset

LOF MD ABOD FBOD SOD Our10-3

10-2

10-1

100

101

102

0.158

0.00184

46.02

0.859

4.398

0.478

(b) SONAR dataset

Figure 9: Comparison of execution costs of REMIX and base-line algorithms

EE RSR RS1 RS1R NMFE10-3

10-2

10-1

100

101

102

103

104 5931.62

1336.612

0.313311

0.0037129

0.471545

(a) CA dataset

EE RSR RS1 RS1R NMFE10-3

10-2

10-1

100

101

102

103

104 7097.053

1597.939

11.02027

0.003299873

0.4784815

(b) SONAR dataset

Figure 10: Comparison of execution costs of REMIX andother exploration strategies

Results on Eectiveness Comparison. Figure 5 shows thison the CA dataset. REMIX outperforms the ve baseline algorithmsin terms Precision@N, Recall@N, and Fmeasure@N (N=10, 13, 15,17, 20). Figure 6 shows this on the SONAR dataset. REMIX is con-sistently beer than the baseline algorithms in terms Precision@N,Recall@N, and Fmeasure@N (N=10, 13, 15, 17, 20).

Results on Eciency Comparison. Figure 9 jointly showsthis on the CA and SONAR datasets. Mahalanobis distance (MD)takes the least time; Angle-based outlier detection (ABOD) takesthe most time as angle is expensive to compute; REMIX falls in themiddle of this cost spectrum.

5.3.3 The Eectiveness and Eiciency of REMIX NMFE. We studythe eectiveness of the REMIX NMFE ensembling strategy, inwhich we used the rank-1 NMF to factorize the outlier score matrix∆ ≈ ΛΩ> and treat Ω as the predicted ensemble outlier scores.Let t denote the number of detectors selected in the REMIX MIPsolution. Let C be the set of all candidate detectors enumeratedby REMIX. We compared REMIX with the following explorationand ensembling strategies: (1) Exhaustive Ensemble (EE): weexecute all the detectors in C and averaged all the outlier scores;(2) Randomly select t detectors (RSR): we randomly select andexecute t detectors from C, and then average the outlier scores ofthese detectors; (3) Randomly select 1 detector (RS1): we ran-domly select one detector and use its results; (4) Randomly select1 detector in the MIP solution (RS1R): we randomly select onedetector in the MIP solution and use its results; In this experiment,

we provide sucient time for all strategies complete their execution.We set a limited time budget of 0.5 seconds for REMIX.

Results on Eectiveness Comparison. Figure 7 shows thaton the SkewCA dataset our strategy outperforms the other ex-ploration and ensembling strategies in terms of Precision@N, Re-call@N, and Fmeasure@N (N=10, 13, 15, 17, 20). Figure 8 showsthat on the sonar signal dataset, REMIX is consistently beer thanthe other strategies in terms of Precision@N, Recall@N, and Fmea-sure@N (N=10, 13, 15, 17, 20).

Results on Eciency Comparison. Figure 10(a) shows thaton the CA dataset REMIX takes 0.48 second, and EE, RSR, RS1require much more time. While RS1R takes only 0.0037 second,its detection accuracy is lower than ours. Figure 10(b) shows onthe SONAR dataset, our strategy takes only 0.49 second, which ismuch less than the time costs of EE, RSR, RS1. While RS1R takesonly 0.0033 second, our method outperforms RS1R with respect todetection accuracy.

6 CONCLUSIONS AND FUTURE DIRECTIONSIn this paper, we presented REMIX, a modular framework for out-lier exploration which is the rst to study the problem of outlierdetection in an interactive seing. At the heart of REMIX is anoptimization approach which systematically steers the selectionof base outlier detectors in a manner that is sensitive to their ex-ecution costs, while maximizing an aggregate utility function ofthe solution and also ensuring diversity across algorithmic choicepoints present in exploration. Data analysts are naturally interestedin extracting and interpreting outliers through multiple detectionmechanisms since distinct outliers could lead to distinct action-able insights within the application domain. REMIX facilitatesthis understanding in a practical manner by shiing the burdenof exploration away from the analyst through automation, andby summarizing the results of automated exploration into a fewcoherent heatmap visualizations called perspectives.

We believe many of the techniques presented in this paper couldbe of independent interest to other machine learning problems.We are interested in extending the REMIX exploratory approachbeyond outlier detection to clustering of high-dimensional data.Another interesting direction of research is sequential recommen-dations for visual outlier exploration: in particular, we are inter-ested in approaches for presenting outlier insights approximatelythrough a sequence of visualizations (like heatmaps) such that boththe length of this sequence as well the perceptual error involvedacross the visualizations is minimized while coverage across thevarious exploratory choice points is maximized. Also the study ofoutlier aspect mining in an interactive seing – which deals withthe inverse problem of nding explanatory features which charac-terize a given set of points as outliers in a cost sensitive manner –presents a new and interesting direction of research.

Acknowledgment. is research was partially supported bythe National Science Foundation (NSF) via the grant number IIS-1648664.


834

REFERENCES[1] 2017. Data Robot. hp://www.datarobot.com. (2017).[2] 2017. Outlier DetectionDatasets. hp://homepage.tudel.nl/n9d04/occ/index.html.

(2017).[3] 2017. SYTREE. hp://www.skytree.net. (2017).[4] Charu C Aggarwal. 2013. Outlier analysis. Springer Science & Business Media.[5] Charu C Aggarwal and Saket Sathe. 2015. eoretical Foundations and Algo-

rithms for Outlier Ensembles. ACM SIGKDD Explorations Newsleer 17, 1 (2015),24–47.

[6] Alain Biem, Maria Butrico, Mark Feblowitz, Tim Klinger, Yuri Malitsky, KenneyNg, Adam Perer, Chandra Reddy, Anton Riabov, Horst Samulowitz, Daby M. Sow,Gerald Tesauro, and Deepak S. Turaga. 2015. Towards Cognitive Automation ofData Science. In Proceedings of the Twenty-Ninth AAAI Conference on ArticialIntelligence, January 25-30, 2015, Austin, Texas, USA. 4268–4269. hp://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9721

[7] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000.LOF: identifying density-based local outliers. In ACM sigmod record, Vol. 29.ACM, 93–104.

[8] Klemens Bhm, Fabian Keller, Emmanuel Mller, Hoang Vu Nguyen, and JillesVreeken. 2013. CMI: An Information-eoretic Contrast Measure for EnhancingSubspace Cluster and Outlier Detection.. In SDM. SIAM, 198–206. hp://dblp.uni-trier.de/db/conf/sdm/sdm2013.html#BohmKMNV13

[9] Santanu Das, Bryan L Mahews, Ashok N Srivastava, and Nikunj C Oza. 2010.Multiple kernel learning for heterogeneous anomaly detection: algorithm andaviation safety case study. In Proceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 47–56.

[10] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: An Automatic ery Steering Framework for Interactive DataExploration. In Proceedings of the 2014 ACM SIGMOD International Conference onManagement of Data (SIGMOD ’14). ACM, New York, NY, USA, 517–528. DOI:hp://dx.doi.org/10.1145/2588555.2610523

[11] Chris H. Q. Ding, Tao Li, and Michael I. Jordan. 2010. Convex and Semi-Nonnegative Matrix Factorizations. IEEE Trans. Paern Anal. Mach. Intell. 32, 1(Jan. 2010), 45–55. DOI:hp://dx.doi.org/10.1109/TPAMI.2008.277

[12] David Duvenaud, James Robert Lloyd, Roger Grosse, Joshua B. Tenenbaum, andZoubin Ghahramani. 2013. Structure Discovery in Nonparametric Regressionthrough Compositional Kernel Search. In Proceedings of the 30th InternationalConference on Machine Learning.

[13] Jing Gao, Wei Fan, Deepak Turaga, Srinivasan Parthasarathy, and Jiawei Han.2011. A spectral framework for detecting inconsistency across multi-source ob-ject relationships. In Data Mining (ICDM), 2011 IEEE 11th International Conferenceon. IEEE, 1050–1055.

[14] R.B. Grosse, R. Salakhutdinov, W.T. Freeman, and J.B. Tenenbaum. 2012. Exploit-ing compositionality to explore a large space of model structures. In Uncertaintyin Articial Intelligence.

[15] Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for featureselection. In Advances in neural information processing systems. 507–514.

[16] Victoria J Hodge and Jim Austin. 2004. A survey of outlier detection methodolo-gies. Articial intelligence review 22, 2 (2004), 85–126.

[17] Alexander Kalinin, Ugur Cetintemel, and Stan Zdonik. 2014. Interactive DataExploration Using Semantic Windows. In SIGMOD Conference. hp://database.cs.brown.edu/papers/sw-paper.pdf

[18] Fabian Keller, Emmanuel Müller, and Klemens Böhm. 2012. HiCS: high contrastsubspaces for density-based outlier ranking. In Data Engineering (ICDE), 2012IEEE 28th International Conference on. IEEE, 1037–1048.

[19] Fabian Keller, Emmanuel Mller, AndreasWixler, and Klemens Bhm. 2013. Flexibleand adaptive subspace search for outlier analysis.. In CIKM, Qi He, Arun Iyengar,Wolfgang Nejdl, Jian Pei, and Rajeev Rastogi (Eds.). ACM, 1381–1390. hp://dblp.uni-trier.de/db/conf/cikm/cikm2013.html#KellerMWB13

[20] Udayan Khurana, Deepak Turaga, Horst Samulowitz, and SrinivasanParthasarathy. 2016. Cognito: Automated Feature Engineering for SupervisedLearning. Proceedings of the International Conference on Data Mining (2016).

[21] Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. 2009. Outlierdetection in axis-parallel subspaces of high dimensional data. In Pacic-AsiaConference on Knowledge Discovery and Data Mining. Springer, 831–838.

[22] Hans-Peter Kriegel, Arthur Zimek, and others. 2008. Angle-based outlier detec-tion in high-dimensional data. In Proceedings of the 14th ACM SIGKDD interna-tional conference on Knowledge discovery and data mining. ACM, 444–452.

[23] Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier de-tection. In Proceedings of the eleventh ACM SIGKDD international conference onKnowledge discovery in data mining. ACM, 157–166.

[24] Lei Li, Ding-Ding Wang, Shun-Zhi Zhu, and Tao Li. 2011. Personalized news rec-ommendation: a review and an experimental investigation. Journal of computerscience and technology 26, 5 (2011), 754–766.

[25] James Robert Lloyd, David Duvenaud, Roger Grosse, Joshua B. Tenenbaum,and Zoubin Ghahramani. 2014. Automatic Construction and Natural-LanguageDescription of Nonparametric Regression Models. In Association for the Advance-ment of Articial Intelligence (AAAI).

[26] Tyler Lu and Craig Boutilier. 2011. Budgeted social choice: From consensus topersonalized decision making. In IJCAI, Vol. 11. 280–286.

[27] Alejandro Marcos Alvarez, Makoto Yamada, Akisato Kimura, and TomoharuIwata. 2013. Clustering-based anomaly detection in multi-view data. In Proceed-ings of the 22nd ACM international conference on Conference on information &knowledge management. ACM, 1545–1548.

[28] Alex Memory and Ted Senator. 2015. Towards Robust Anomaly DetectionEnsembles using Explanations. In In the ODDx3 workshop of KDD2015.

[29] Emmanuel Muller, Ira Assent, Patricia Iglesias, Yvonne Mulle, and KlemensBohm. 2012. Outlier ranking via subspace analysis in multiple views of thedata. In Data Mining (ICDM), 2012 IEEE 12th International Conference on. IEEE,529–538.

[30] Emmanuel Müller, Mahias Schier, andomas Seidl. 2011. Statistical selectionof relevant subspace projections for outlier ranking. In Data Engineering (ICDE),2011 IEEE 27th International Conference on. IEEE, 434–445.

[31] Shebuti Rayana and Leman Akoglu. 2015. Less is More: Building SelectiveAnomaly Ensembles. arXiv preprint arXiv:1501.01924 (2015).

[32] Ashish Sabharwal, Horst Samulowitz, and Gerald Tesauro. 2016. Selecting Near-Optimal Learners via Incremental Data Allocation. In Proceedings of the irtiethAAAI Conference on Articial Intelligence, February 12-17, 2016, Phoenix, Arizona,USA. 2007–2015. hp://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12524

[33] Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, and AdityaParameswaran. 2016. Eortless Data Exploration with Zenvisage: An Expressiveand Interactive Visual Analytics System. Proc. VLDB Endow. 10, 4 (Nov. 2016),457–468. DOI:hp://dx.doi.org/10.14778/3025111.3025126

[34] Abdul Wasay, Manos Athanassoulis, and Stratos Idreos. 2015. eriosity: Auto-mated Data Exploration.. In BigData Congress, Barbara Carminati and LatifurKhan (Eds.). IEEE, 716–719. hp://dblp.uni-trier.de/db/conf/bigdata/bigdata2015.html#WasayAI15

[35] Arthur Zimek, Ricardo J.G.B. Campello, and Jörg Sander. 2014. Ensembles forUnsupervised Outlier Detection: Challenges and Research estions a PositionPaper. SIGKDD Explor. Newsl. 15, 1 (March 2014), 11–22. DOI:hp://dx.doi.org/10.1145/2594473.2594476


835

http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9721http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9721http://dblp.uni-trier.de/db/conf/sdm/sdm2013.html##BohmKMNV13http://dblp.uni-trier.de/db/conf/sdm/sdm2013.html##BohmKMNV13http://dx.doi.org/10.1145/2588555.2610523http://dx.doi.org/10.1109/TPAMI.2008.277http://database.cs.brown.edu/papers/sw-paper.pdfhttp://database.cs.brown.edu/papers/sw-paper.pdfhttp://dblp.uni-trier.de/db/conf/cikm/cikm2013.html##KellerMWB13http://dblp.uni-trier.de/db/conf/cikm/cikm2013.html##KellerMWB13http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12524http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12524http://dx.doi.org/10.14778/3025111.3025126http://dblp.uni-trier.de/db/conf/bigdata/bigdata2015.html##WasayAI15http://dblp.uni-trier.de/db/conf/bigdata/bigdata2015.html##WasayAI15http://dx.doi.org/10.1145/2594473.2594476http://dx.doi.org/10.1145/2594473.2594476

Abstract1 Introduction1.1 Key Contributions

2 Related Work3 Preliminaries3.1 Baseline Algorithms3.2 Interactive Outlier Exploration

4 The REMIX Framework4.1 Feature Subspace and Candidate Detector Enumeration4.2 Cost Estimation4.3 Utility Estimation4.4 MIP based Exploration4.5 Perspective Factorization

5 Experiments5.1 Data Collection5.2 Cost and Utility Estimation5.3 Detection Accuracy and Cost

6 Conclusions and Future DirectionsReferences

remix: automated exploration for interactive outlier detectionlibrary.usc.edu.ph/acm/kkd...

Documents