breaking the curse of dimensionality with isolation kernel
TRANSCRIPT
Breaking the curse of dimensionality with Isolation KernelKai Ming TingNanjing [email protected]
Takashi WashioOsaka University
Ye ZhuDeakin [email protected]
Yang XuNanjing University
ABSTRACTThe curse of dimensionality has been studied in different aspects.However, breaking the curse has been elusive. We show for thefirst time that it is possible to break the curse using the recentlyintroduced Isolation Kernel. We show that only Isolation Kernelperforms consistently well in indexed search, spectral & densitypeaks clustering, SVM classification and t-SNE visualization in bothlow and high dimensions, compared with distance, Gaussian andlinear kernels. This is also supported by our theoretical analysesthat Isolation Kernel is the only kernel that has the provable abilityto break the curse, compared with existing metric-based Lipschitzcontinuous kernels.
1 INTRODUCTIONIn the last few decades, a significant amount of research has beeninvested in studying the curse of dimensionality of a distance mea-sure. We have now a good understanding of the curse in terms of (i)concentration of measure [39] and its consequence of being unableto find the nearest neighbor [5], and (ii) hubness [34].
Despite this advancement, breaking the curse is still elusive. Thecentral question remains open: is it possible to find the single nearestneighbor of a query in high dimensions?
The answer to this question is negative, based on the currentanalyses on distance [4, 5, 32]. Rather than finding the single nearestneighbor of a query, the issue has been diverted to finding all datapoints belonging to the same cluster as the query, in a distributionwith well-separated clusters [4, 5]. These nearest neighbors areโmeaningfulโ for the purpose of separating different clusters in adataset. But the (single) nearest neighbor, in the true sense of theterm, cannot be obtained using a distance measure, even in thespecial distribution of well-separated clusters described above [5].
Unlike many existing measures, a recently introduced IsolationKernel [33, 43] is derived directly from a finite dataset without learn-ing and has no closed-form expression. We show here that IsolationKernel is not affected by the concentration effect, and it workswell in high dimensions; and it does not need to use low intrinsicdimensions to explain why it works well in high dimensions.
Our theoretical analysis focuses on the ability of a feature mapproduced by kernel evaluation using a finite dataset to distinguish twodistinct points in high dimensions (distinguishability or indistin-guishability hereafter). A kernel is said to have indistinguishabilityif its feature map is unable to distinguish two distinct points. Thisis manifested as being unable to find the single nearest neighbor ofa query because there are many nearest neighbors which cannotbe distinguished from each other, especially in high dimensions.
Our contributions are:
I Revealing that Isolation Kernel (IK) is the only measure that isprovably free from the curse of dimensionality, compared withexisting metric-based Lipschitz continuous kernels.
II Determining three key factors in IK that lead to breaking thecurse of dimensionality. (a) Space partitioning using an isola-tion mechanism is the key ingredient in measuring similaritybetween two points in IK. (b) The probability of a point fallinginto a partition is independent of data distribution, the distancemeasure used to create the partitions and the data dimensions.(c) The unique feature map of IK has its dimensionality linkedto a concatenation of these partitionings.
III Proving that IKโs unique feature map produced by a finitedataset has distinguishability, independent of the number ofdata dimensions and data distribution. Our theorem suggeststhat increasing the number of partitionings (i.e., the dimension-ality of the feature map) leads to increased distinguishability.
IV Verifying that IK breaks the curse of dimensionality in two setsof experiments. First, IK finds the single nearest neighbor inhigh dimensions; while many existing measures fail under thesame condition. Second, IK enables consistently effective index-ing, spectral & density peaks clustering, SVM classification andt-SNE visualization in both low and high dimensions.
V Showing that Euclidean distance, Gaussian and linear kernelshave a mixed bag of poor and good results in high dimensions,reflecting the current understanding of existing measures.
2 RELATEDWORK2.1 The curse of dimensionalityOne of the early investigations of the curse of dimensionality isin terms of concentration of measure [28, 39]. Talagrand [39] sum-marises the intuition behind the concentration of measure as fol-lows: โa random variable that depends (in a โsmoothโ way) on theinfluence of many independent random variables (but not too muchon any of them) is essentially constant.โ Translating this statementin the context of a high dimensional space, a distancemeasurewhichdepends on many independent dimensions (but not too much onany of them) is essentially constant.
Beyer et al. [5] analyze the conditions under which nearest neigh-bor is โmeaningfulโ when a distance measure is employed in highdimensions. A brief summary of the conditions in high dimensionsis given as follows:
arX
iv:2
109.
1419
8v1
[cs
.LG
] 2
9 Se
p 20
21
Kai Ming Ting, Takashi Washio, Ye Zhu, and Yang Xu
a) Finding the (single) nearest neighbor of a query is not mean-ingful because every point from the same cluster in the data-base has approximately the same distance to the query. Inother words, the variance of distance distribution withina cluster approaches zero as the number of dimensions in-creases to infinity.
b) Despite this fact, finding the exact match is still meaningful.c) The task of retrieving all points belonging to the same cluster
as the query is meaningful because although variance ofdistances of all points within the cluster is zero, the distancesof this cluster to the query are different from those of anotherclusterโmaking a distinction between clusters possible.
d) The task of retrieving the nearest neighbor of a query, whichdoes not belong to any of the clusters in the database, is notmeaningful for the same reason given in (a).
Beyer et al. [5] reveal that finding (non-exact match) nearest neigh-bors is โmeaningfulโ only if the dataset has clusters (or structure),where the nearest neighbors refer to any/all points in a clusterrather than a specific point which is the nearest among all points inthe dataset. In short, a distance measure cannot produce the โnearestneighborโ in high dimensions, in the true sense of the term. Here-after the term โnearest neighborโ is referred to the single nearestneighbor in a dataset, not many nearest neighbors in a cluster.
Beyer et al. [5] describe the notion of instability of a distancemeasure as follows: โA nearest neighbor query is unstable for a given๐ if the distance from the query point to most data points is less than(1 + ๐) times the distance from the query point to its nearest neighbor.โ
Current analyses on the curse of dimensionality focus on distancemeasures [1, 5, 16, 34]. While random projection preserves structurein the data, it still suffers the concentration effect [20]. Currentmethods dealing with high dimensions are: dimension reduction[19, 44]; suggestions to use non-metric distances, e.g., fractionaldistances [1]; and angle-based measures, e.g., [23]. None has beenshown to break the curse.
Yet, the influence of the concentration effect in practice is lesssevere than suggested mathematically [12]. For example, k-nearestneighbor (kNN) classifier still yields reasonable high accuracy inhigh dimensions [3]; while the concentration effect would haverendered kNN producing random prediction. One possible expla-nation of this phenomenon is that the high dimensional datasetsemployed have low intrinsic dimensions (e.g., [12, 22].)
In a nutshell, existing works center around reducing the effectof concentration of measures, knowing that the nearest neighborcould not be obtained in high dimensions. A saving grace is thatmany high dimensional datasets used in practice have low intrinsicdimensionsโleading to a significantly reduced concentration effect.
Our work shows that the concentration effect can be eliminatedto get the nearest neighbor in high dimensions using a recentlyintroduced measure called Isolation Kernel [33, 43], without theneed to rely on low intrinsic dimensions. While three existing mea-sures have a mixed bag of poor and good results in high dimensions,Isolation Kernel defined by a finite dataset yields consistently goodresults in both low and high dimensions.
2.2 Isolation KernelUnlike commonly used distance or kernels, Isolation Kernel [33, 43]has no closed-form expression, and it is derived directly from agiven dataset without learning; and the similarity between twopoints is computed based on the partitions created in data space.
The key requirement of Isolation Kernel (IK) is a space partition-ing mechanism which isolates a point from the rest of the points ina sample set. This can be achieved in several ways, e.g., isolationforest [26, 43], Voronoi Diagram [33] and isolating hyperspheres[41]. Isolation Kernel [33, 43] is defined as follows.
Let H๐ (๐ท) denote the set of all admissible partitionings ๐ป de-rived from a given finite dataset ๐ท , where each ๐ป is derived fromD โ ๐ท ; and each point in D has the equal probability of beingselected from ๐ท ; and |D| = ๐ . Each partition \ [z] โ ๐ป isolatesa point z โ D from the rest of the points in D. The union of allpartitions of each partitioning ๐ป covers the entire space.
Definition 1. Isolation Kernel of any two points x, y โ R๐ isdefined to be the expectation taken over the probability distributionon all partitionings ๐ป โ H๐ (๐ท) that both x and y fall into the sameisolating partition \ [z] โ ๐ป , where z โ D โ ๐ท ,๐ = |D|:
๐พ๐ (x, y | ๐ท) = EH๐ (๐ท) [1(x, y โ \ [z] | \ [z] โ ๐ป )], (1)
where 1(ยท) is an indicator function.In practice, Isolation Kernel ๐พ๐ is constructed using a finite
number of partitionings ๐ป๐ , ๐ = 1, . . . , ๐ก , where each ๐ป๐ is createdusing randomly subsampled D๐ โ ๐ท ; and \ is a shorthand for \ [z]:
๐พ๐ (x, y | ๐ท) โ 1๐ก
๐กโ๏ธ๐=11(x, y โ \ | \ โ ๐ป๐ )
=1๐ก
๐กโ๏ธ๐=1
โ๏ธ\ โ๐ป๐
1(x โ \ )1(y โ \ ) . (2)
This gives a good approximation of ๐พ๐ (x, y | ๐ท) when |๐ท | and ๐กare sufficiently large to ensure that the ensemble is obtained froma sufficient number of mutually independent D๐ , ๐ = 1, . . . , ๐ก .
Given ๐ป๐ , let ฮฆ๐ (x) be a๐ -dimensional binary column (one-hot)vector representing all \ ๐ โ ๐ป๐ , ๐ = 1, . . . ,๐ ; where x must fall intoonly one of the๐ partitions.
The ๐-component of the vector is: ฮฆ๐ ๐ (x) = 1(x โ \ ๐ | \ ๐ โ ๐ป๐ ).Given ๐ก partitionings,ฮฆ(x) is the concatenation ofฮฆ1 (x), . . . ,ฮฆ๐ก (x).
Proposition 1. Feature map of Isolation Kernel. The featuremapping ฮฆ : x โ {0, 1}๐กร๐ of ๐พ๐ , for point x โ R๐ , is a vector thatrepresents the partitions in all the partitionings ๐ป๐ โ H๐ (๐ท), ๐ =1, . . . , ๐ก that contain x; where x falls into only one of the๐ partitionsin each partitioning ๐ป๐ . Then, โฅ ฮฆ(x) โฅ =
โ๐ก holds, and Isolation
Kernel always computes its similarity using its feature map as:
๐พ๐ (x, y | ๐ท) โ 1๐กโจฮฆ(x|๐ท),ฮฆ(y|๐ท)โฉ .
Let 1 be a shorthand of ฮฆ๐ (x) such that ฮฆ๐ ๐ (x) = 1 and ฮฆ๐๐ (x) =0,โ๐ โ ๐ of any ๐ โ [1,๐ ].
The feature map ฮฆ(x) is sparse because it has exactly ๐ก out of ๐ก๐elements having 1 and the rest are zeros. This gives โฅ ฮฆ(x) โฅ =
โ๐ก .
Note that while every point has โฅ ฮฆ(x) โฅ =โ๐ก and ฮฆ๐ (x) = 1 for
all ๐ โ [1, ๐ก], they are not all the same point. Note that IsolationKernel is a positive definite kernel with probability 1 in the limit of
Breaking the curse of dimensionality with Isolation Kernel
๐ก โ โ because its Gram matrix is full rank as ฮฆ(๐ฅ) for all points๐ฅ โ ๐ท are mutually independent (see [41] for details.) In otherwords, the feature map coincides with a reproducing kernel Hilbertspace (RKHS) associated with the Isolation Kernel.
This finite-dimensional feature map has been shown to be thekey in leading to fast runtime in kernel-based anomaly detectors[41] and enabling the use of fast linear SVM [40, 48].
IK has been shown to be better than Gaussian and Laplaciankernels in SVM classification [43], better than Euclidean distancein density-based clustering [33], and better than Gaussian kernelin kernel-based anomaly detection [41].
IKโs superiority has been attributed to its data dependent simi-larity, given in the following lemma.
Let ๐ท โ X๐ท โ R๐ be a dataset sampled from an unknowndistribution P๐ท where X๐ท is the support of P๐ท on R๐ ; and let๐๐ท (x) denote the density of P๐ท at point x โ X๐ท , and โ๐ be theโ๐ -norm. The unique characteristic of Isolation Kernel [33, 43] is:
Lemma 1. Isolation Kernel ๐พ๐ has the data dependent characteris-tic:
๐พ๐ (x, y | ๐ท) > ๐พ๐ (xโฒ, yโฒ | ๐ท)for โ๐ (x โ y) = โ๐ (xโฒ โ yโฒ), โx, y โ XS and โxโฒ, yโฒ โ XT subject toโz โ XS, zโฒ โ XT, ๐๐ท (z) < ๐๐ท (zโฒ).
As a result, ๐พ๐ (x, y | ๐ท) is not translation invariant, unlike thedata independent Gaussian and Laplacian kernels.
We show in Section 3.2 that this feature map is instrumental inmaking IK the only measure that breaks the curse, compared withexisting metric-based Lipschitz continuous kernels.
3 ANALYZING THE CURSE OFDIMENSIONALITY IN TERMS OF(IN)DISTINGUISHABILITY
In this section, we theoretically show that IK does not suffer fromthe concentration effect in any high dimensional data space; whilemetric-based Lipschitz continuous kernels, including Gaussian andLaplacian kernels, do.
We introduce the notion of indistinguishability of two distinctpoints in a feature space H produced by kernel evaluation using afinite dataset to analyze the curse of dimensionality. Let ฮฉ โ R๐ bea data space, ๐ท โ ฮฉ be a finite dataset, and P๐ท be the probabilitydistribution that generates ๐ท . The support of P๐ท , denoted as X๐ท ,is an example of ฮฉ.
Given two distinct points x๐, x๐ โ ฮฉ and their feature vectors๐ (x๐), ๐ (x๐ ) โ H^ produced by a kernel ^ and a finite dataset๐ท = {x๐ โ R๐ | ๐ = 1, . . . , ๐} as๐ (x๐) = [^ (x๐, x1), . . . , ^ (x๐, x๐)]โคand ๐ (x๐ ) = [^ (x๐ , x1), . . . , ^ (x๐ , x๐)]โค.
Definition 2. H^ has indistinguishability if there exists some๐ฟ๐ฟ โ (0, 1) such that the probability of ๐ (x๐) = ๐ (x๐ ) is lowerbounded by ๐ฟ๐ฟ as follows:
๐ (๐ (x๐) = ๐ (x๐ )) โฅ ๐ฟ๐ฟ . (3)
On the other hand, given two feature vectorsฮฆ(x๐),ฮฆ(x๐ ) โ H๐พ
produced by a kernel ๐พ and a finite dataset ๐ท .
Definition 3. H๐พ has distinguishability if there exists some๐ฟ๐ โ (0, 1) such that ๐ (ฮฆ(x๐) = ฮฆ(x๐ )) is upper bounded by ๐ฟ๐ as
follows:๐ (ฮฆ(x๐) = ฮฆ(x๐ )) โค ๐ฟ๐ . (4)
Note that these definitions of indistinguishability and distin-guishability are rather weak. In practice, ๐ฟ๐ฟ should be close to 1for strong indistinguishability, and ๐ฟ๐ should be far less than 1 forstrong distinguishability.
In the following, we first show in Section 3.1 that a feature spaceH^ produced by any metric-based Lipschitz continuous kernel ^and a finite dataset ๐ท has indistinguishability when the number ofdimensions ๐ of the data space is very large. Then, in Section 3.2,we show that a feature space H๐พ procudd by Isolation Kernel ๐พimplemented using the Voronoi diagram [33] has distinguishabilityfor any ๐ .
3.1 H^ of a metric-based Lipschitz continuouskernel ^ has indistinguishability
Previous studies have analyzed the concentration effect in highdimensions [14, 24, 32]. Let a triplet (ฮฉ,๐, ๐น ) be a probability metricspace where ๐ : ฮฉ ร ฮฉ โฆโ R is a metric, and ๐น : 2ฮฉ โฆโ R is aprobability measure. โ๐ -norm is an example of๐. ฮฉ is the supportof ๐น , and thus ๐น is assumed to have non-negligible probabilitiesover the entire ฮฉ, i.e., points drawn from ๐น are widely distributedover the entire data space.
Further, let ๐ : ฮฉ โฆโ R be a Lipschitz continuous function, i.e.,|๐ (x๐) โ ๐ (x๐ ) | โค ๐(x๐, x๐ ),โx๐, x๐ โ ฮฉ; and๐ be a median of ๐defined as:
๐น ({x โ ฮฉ | ๐ (x) โค ๐}) = ๐น ({x โ ฮฉ | ๐ (x) โฅ ๐}).
Then, the following proposition holds [14, 32].
Proposition 2. ๐น ({x โ ฮฉ |๐ (x) โ [๐ โ ๐,๐ + ๐]}) โฅ 1 โ 2๐ผ (๐)holds for any ๐ > 0, where ๐ผ (๐) = ๐ถ1๐โ๐ถ2๐2๐ with two constants0 < ๐ถ1 โค 1
2 and ๐ถ2 > 0.
For our analysis on the indistinguishability of a feature spaceH^
associated with any metric-based Lipschitz continuous kernel, wereformulate this proposition by replacing ๐ (x) and๐ with๐(x, y)and๐ (y), respectively, for a given y โ ฮฉ. That is:
|๐(x๐, y) โ๐(x๐ , y) | โค ๐(x๐, x๐ ),โx๐, x๐ โ ฮฉ, and๐น ({x โ ฮฉ | ๐(x, y) โค ๐ (y)}) = ๐น ({x โ ฮฉ | ๐(x, y) โฅ ๐ (y)}).
The inequality in the first formulae always holds because of thetriangular inequality of๐. Then, we obtain the following corollary:
Corollary 1. Under the same condition in Proposition 2, theinequality ๐น (๐ด๐ (y)) โฅ 1โ 2๐ผ (๐) holds for any y โ ฮฉ and any ๐ > 0,where ๐ด๐ (y) = {x โ ฮฉ | ๐(x, y) โ [๐ (y) โ ๐,๐ (y) + ๐]}.
This corollary represents the concentration effect where almostall probabilities of the data are concentrated in the limited area๐ด๐ (y). It provides an important fact on the indistinguishability oftwo distinct points using a class of metric-based Lipschitz continu-ous kernels such as Gaussian kernel:
^ (x, y) = ๐^ (๐(x, y)),
where ๐^ is Lipschitz continuous and monotonically decreasing for๐(x, y).
Kai Ming Ting, Takashi Washio, Ye Zhu, and Yang Xu
Lemma 2. Given two distinct points x๐ , x๐ i.i.d. drawn from ๐น onฮฉ and a data set ๐ท = {x๐ โ R๐ | ๐ = 1, . . . , ๐} โผ ๐บ๐ where everyx๐ is i.i.d drawn from any probability distribution ๐บ on ฮฉ, let thefeature vectors of x๐ and x๐ be ๐ (x๐) = [^ (x๐, x1), . . . , ^ (x๐, x๐)]โคand ๐ (x๐ ) = [^ (x๐ , x1), . . . , ^ (x๐ , x๐)]โค in a feature space H^ . Thefollowing holds:
๐
(โ๐ (๐ (x๐) โ ๐ (x๐ )) โค 2๐ฟ๐๐1/๐
)โฅ (1 โ 2๐ผ (๐))2๐,
for any ๐ > 0, where โ๐ is an โ๐ -norm onH^ , and ๐ฟ is the Lipschitzconstant of ๐^ .
Theorem 1. Under the same condition as Lemma 2, the featurespace H^ has indistinguishability in the limit of ๐ โ โ.
This result implies that the feature spaceH^ produced by anyfinite dataset and any metric-based Lipschitz continuous kernelssuch as Gaussian and Laplacian kernels have indistinguishabilitywhen the data space has a high number of dimensions ๐ . In otherwords, these kernels suffer from the concentration effect.
3.2 H๐พ of Isolation Kernel ๐พ hasdistinguishability
We show that the feature map of Isolation Kernel implementedusing Voronoi diagram [33], which is not a metric-based Lipschitzcontinuous kernel, has distinguishability.
We derive Lemma 3 from the fact given in [13], before providingour theorem.
Lemma 3. Given x โ ฮฉ and a data set D = {z1, . . . , z๐ } โผ ๐บ๐
where every z๐ is i.i.d. drawn from any probability distribution ๐บ onฮฉ and forms its Voronoi partition \ ๐ โ ฮฉ, the probability that z๐ isthe nearest neighbor of x inD is given as: ๐ (x โ \ ๐ ) = 1/๐ , for every๐ = 1, . . . ,๐ .
This lemma points to a simple but nontrivial result that ๐ (x โ \ ๐ )is independent of ๐บ ,๐(x, y) and data dimension ๐ .
Theorem 2. Given two distinct points x๐ , x๐ i.i.d. drawn from๐น on ฮฉ and D๐ = {z1, . . . , z๐ } โผ ๐บ๐ defined in Lemma 3, let thefeature vectors of x๐ and x๐ be ฮฆ(x๐) and ฮฆ(x๐ ) in a feature spaceH๐พ associated with Isolation Kernel ๐พ , implemented using Voronoidiagram, as given in Proposition 1. Then the following holds:
๐ (ฮฆ(x๐) = ฮฆ(x๐ )) โค1๐๐ก,
and H๐พ always has strong distinguishability for large๐ and ๐ก .
Proofs of Lemmas 2 & 3 and Theorems 1 & 2 are given in Appen-dix A.
In summary, the IK implemented using Voronoi diagramis free from the curse of dimensionality for any probabilitydistribution.
3.3 The roles of๐ & ๐ก and the time complexityof Isolation Kernel
The sample size (or the number of partitions in one partitioning)๐in IK has a function similar to the bandwidth parameter in Gaussiankernel, i.e., the higher๐ is, the sharper the IK distribution [43] (thesmall the bandwidth, the sharper the Gaussian kernel distribution.)
In other words,๐ needs to be tuned for a specific dataset to producea good task-specific performance.
The number of partitionings ๐ก has three roles. First, the higher๐ก is, the better Eq 2 is in estimating the expectation described inEq 1. High ๐ก also leads to low variance of the estimation. Second,with its feature map, ๐ก can be viewed as IKโs user-definable effectivenumber of dimensions in Hilbert space. Third, Theorem 2 revealsthat increasing ๐ก leads to increased distinguishability. The first tworoles were uncovered in the previous studies [33, 41, 43]. We revealthe last role here in Theorem 2. These effects can be seen in theexperiment reported in Appendix D.
In summary, IKโs featuremap,whichhas its dimensionalitylinked to distinguishability, is unique among existing mea-sures. In contrast, Gaussian kernel (andmany existingmetric-basedkernels) have a feature map with intractable dimensionality; yet,they have indistinguishability.
Time complexity: IK, implemented using the Voronoi diagram,has time complexity ๐ (๐๐๐ก๐ ) to convert ๐ points in data space to๐ vectors in its feature space. Note that each D implemented usingone-nearest-neighbor produces a Voronoi diagram implicitly, noadditional computation is required to build the Voronoi diagramexplicitly (see Qin et al. [33] for details.) In addition, as this imple-mentation is amenable to acceleration using parallel computing, itcan be reduced by a factor of๐ค parallelizations to ๐ ( ๐๐ค๐๐ก๐ ). Thereadily available finite-dimensional feature map of IK enables a fastlinear SVM to be used. In contrast, Gaussian kernel (GK) must usea slow nonlinear SVM. Even adding the IKโs feature mapping time,SVM with IK still runs up to two orders of magnitude faster thanSVM with GK. See Section 4.2 for details.
4 EMPIRICAL VERIFICATIONWe verify that IK breaks the curse of dimensionality by conductingtwo experiments. In Section 4.1, we verify the theoretical results interms of instability of a measure, as used in a previous study of theconcentration effect [5]. This examines the (in)distinguishability oflinear, Gaussian and Isolation kernels. In the second experiment inSection 4.2, we explore the impact of IK having distinguishabilityin three tasks: ball tree indexing [30], SVM classification [6] andt-SNE visualization [44].
4.1 Instability of a measureThe concentration effect is assessed in terms of the variance of mea-sure distribution ๐น๐,๐ wrt themean of themeasure, i.e., ๐ฃ๐๐
(๐น๐,๐
E[๐น๐,๐ ]
),
as the number of dimensions ๐ increases, based on a measure๐(x, y) = 1 โ ^ (x, y). A measure is said to have the concentra-tion effect if the variance of the measure distribution in dataset ๐ทapproaches zero as ๐ โ โ.
Following [5], we use the same notion of instability of a measure,as a result of a nearest neighbor query, to show the concentrationeffect. Let๐(q|๐ท) = minyโ๐ท 1โ^ (q, y) be the distance as measuredby ^ from query q to its nearest neighbor in dataset ๐ท ; and ๐๐be the number of points in ๐ท having distances wrt q less than(1 + ๐)๐(q|๐ท).
When there are many close nearest neighbors for any small๐ > 0, i.e., ๐๐ is large, then the nearest neighbor query is said to be
Breaking the curse of dimensionality with Isolation Kernel
(a) Dataset used (b) Concentration effect (c) q๐ : between 2 clusters (d) q๐ : sparse cluster center
Figure 1: The instability of a measure ^: Gaussian kernel (GK) and Linear kernel (LK) versus IK. The dataset used has 1,000data points per cluster: the number of dimensions is increased from 10 to 10,000. Query q๐ is used in subfigure (b). ๐ = 0.005 isused in subfigures (c) & (d).
Table 1: The instability of a measure. The dataset shown in Figure 1(a) is used here. ๐ = 0.005.
SNN, AG and IK โ๐ -norm for ๐ = 0.1, 0.5 & 2q๐ : between 2 clusters q๐ : sparse cluster center q๐ : between 2 clusters q๐ : sparse cluster center
unstable. An unstable query is a direct consequence of the concen-tration effect. Here we use ๐๐ as a proxy for indistinguishability:for small ๐ , high ๐๐ signifies indistinguishability; and ๐๐ = 1 orclose to 1 denotes distinguishability.
Figure 1 shows the verification outcomes using the dataset shownin Figure 1(a). Figure 1(b) shows the concentration effect, that is,๐ฃ๐๐
(๐น๐,๐
E[๐น๐,๐ ]
)โ 0 as ๐ โ โ for Gaussian kernel (GK). Yet, using
Isolation Kernel (IK), the variance approaches a non-zero constantunder the same condition.
Figures 1(c) & 1(d) show that ๐๐ increases as ๐ increases for GKfor two different queries; and LK is unstable in one query. Yet, IKhas almost always maintained a constant ๐๐ (= 1) as ๐ increases.
SNN, AG, Linear kernel and fractional distances: Here weexamine the stability of Shared Nearest Neighbor (SNN) [17, 18],Adaptive Gaussian kernel (AG) [49] and fractional distances [1].
Our experiment using SNN has shown that it has query stabilityif the query is within a cluster; but it has the worst query instabilitywhen the query is outside any clusters when ๐ is set less than thecluster size (1000 in this case), even in low dimensions! This resultis shown in the first two subfigures in Table 1. AG (๐ = 200) has asimilar behaviour as SNN that uses ๐ > 1000.
The last two subfigures in Table 1 show a comparison betweenEuclidean distance and fractional distances (โ๐ , ๐ = 0.1, 0.5). It istrue that the fractional distances are better than Euclidean distancein delaying the concentration effect up to ๐ = 1000; but they couldnot resist the concentration effect for higher dimensions.
Summary. Many existing measures have indistinguishabilityin high dimensions that have prevented them from finding thenearest neighbor if the query is outside of any clusters.Only IKhasdistinguishability. Note that IK uses distance to determine the
(Voronoi) partition into which a point falls; but IKโs similarity doesnot directly rely on distance. IK has a unique ๐ก-dimensional featuremap that has increased distinguishability as ๐ก increases, described inSection 3.2. No existing metric-based Lipschitz continuous kernelshave a feature map with this property.
4.2 The impact of (in)distinguishability on fourtraditional tasks
We use the same ten datasets in the following four tasks. All havebeen used in previous studies (see Appendix B), except two syn-thetic datasets: Gaussians and ๐ค-Gaussians. The former has twowell-separated Gaussians, both on the same 10000 dimensions; andthe latter has two๐ค-dimensional Gaussians which overlap at theorigin only, as shown in Figure 2. Notice below that algorithmswhich employ distance or Gaussian kernel have problems with the๐ค-Gaussians dataset in all four tasks; but they have no problemswith the Gaussians dataset.
4.2.1 Exact nearest neighbor search using ball tree indexing. Exist-ing indexing techniques are sensitive to dimensionality, data sizeand structure of the dataset [21, 38]. They only work in low dimen-sional datasets of a moderate size where a dataset has clusters.
Table 2 shows the comparison between the brute force search andan indexed search based on ball tree [30], in terms of runtime, usingEuclidean distance, IK and linear kernel (LK). Using LK, the ball treeindex is always worse than the brute force search in all datasets,except the lowest dimensional dataset. Using Euclidean distance, wehave a surprising result that the ball tree index ran faster than thebrute force in one high dimensional dataset. However, this resultcannot be โexplained awayโ by using the argument that they mayhave low intrinsic dimensions (IDs) because the Gaussians dataset
Kai Ming Ting, Takashi Washio, Ye Zhu, and Yang Xu
Table 2: Runtimes of exact 5-nearest-neighbors search. Brute force vs Ball tree index. Boldface indicates faster runtime betweenbrute force and index; or better precision. The experimental details are in Appendix B. The Gaussians dataset is as used inFigure 1;๐ค-Gaussians (where๐ค = 5, 000) is as shown in Table 5. The last two columns show the retrieval precision of 5 nearestneighbours. Every point in a dataset is used as a query; and the reported result is an average over all queries in the dataset.
Dataset #๐๐๐๐๐ก๐ #๐๐๐๐๐. #class Distance IK LK Precision@5Brute Balltr Brute Balltr Map Brute Balltr Distance IK
Url 3,000 3,231,961 2 38,702 41,508 112 109 9 39,220 43,620 .88 .95News20 10,000 1,355,191 2 101,243 113,445 3,475 2,257 16 84,986 93,173 .63 .88Rcv1 10,000 47,236 2 3,615 4,037 739 578 11 5,699 6,272 .90 .94
Real-sim 10,000 20,958 2 2,541 2,863 4,824 4,558 21 2,533 2853 .50 .63Gaussians 2,000 10,000 2 46 41 229 213 46 45 54 1.00 1.00๐ค-Gaussians 2,000 10,000 2 53 77 210 205 45 56 73 .90 1.00Cifar-10 10,000 3,072 10 340 398 7,538 7,169 67 367 439 .25 .27Mnist 10,000 780 10 58 72 1,742 1,731 2 87 106 .93 .93A9a 10,000 122 2 10 13 5,707 5,549 1 12 16 .79 .79Ijcnn1 10,000 22 2 2.3 1.3 706 654 1 2.2 1.8 .97 .96
Average .77 .84
has 10,000 (theoretically true) IDs; while the indexed search ranslower in all other real-world high dimensional datasets where eachhas the median ID less than 25, as estimated using a recent local IDestimator [2] (see the full result in Appendix C.)
Yet, with an appropriate value of ๐ , IK yielded a faster searchusing the ball tree index than the brute force search in all datasets,without exception. Note that the comparison using IK is indepen-dent of the feature mapping time of IK because both the brute forceand the ball tree indexing require the same feature mapping time(shown in the โMapโ column in Table 2.)
It is interesting to note that the feature map of IK uses an effectivenumber of dimensions of ๐ก = 200 in all experiments. While one maytend to view this as a dimension reduction method, it is a side effectof the kernel, not by design. The key to IKโs success in enablingefficient indexing is the kernel characteristic in high dimensions(in Hilbert space.) IK is not designed for dimension reduction.
Note that IK works not because it has used a low-dimensionalfeature map. Our result in Table 2 is consistent with the previousevaluation (up to 80 dimensions only [21]), i.e., indexing methodsran slower than brute-force when the number of dimensions ismore than 16. If the IKโs result is due to low dimensional featuremap, it must be about 20 dimensions or less for IK to work. But allIK results in Table 2 have used ๐ก = 200.
Nevertheless, the use of ๐ก = 200 effective dimensions has animpact on the absolute runtime. Notice that, when using the bruteforce method, the actual runtime of IK was one to two orders ofmagnitude shorter than that of distance in ๐ > 40, 000; and thereverse is true in ๐ < 40, 000. We thus recommend using IK in highdimensional (๐ > 40, 000) datasets, not only because its indexingruns faster than the brute force, but IK runs significantly fasterthan distance too. In ๐ โค 40, 000 dimensional datasets, distance ispreferred over IK because the former runs faster.
The last two columns in Table 2 show the retrieval result in termsof precision of 5 nearest neighbors. It is interesting to note that IKhas better retrieval outcomes than distance in all cases, where largedifferences can be found in News20, Realsim and๐ค-Gaussians. The
only exception is Ijcnn1which has the lowest number of dimensions,and the difference in precision is small.
This shows that the concentration effect influences not only theindexing runtime but also the precision of retrieval outcomes.
4.2.2 Clustering. In this section, we examine the effect of IK ver-sus GK/distance using a scalable spectral clustering algorithm (SC)[7] and Density-Peak clustering (DP) [36]. We report the averageclustering performance in terms of AMI (Adjusted Mutual Infor-mation) [45] of SC over 10 trials on each dataset because SC is arandomized algorithm; but only a single trial is required for DP,which is a deterministic algorithm.
Table 3: Best AMI scores of Desity-Peak clustering (DP)and Spectral Clustering (SC) using Distance/Gaussian ker-nel (GK), Adaptive Gaussian kernel (AG) and IK. Note thatevery cluster in Cifar10 and Ijcnn1 has mixed classes. Thatis why all clustering algorithms have very low AMI.
Dataset DP SCDistance AG IK GK AG IK
Url .07 .19 .16 .04 .04 .07News20 .02 .05 .27 .16 .15 .21Rcv1 .18 .19 .32 .19 .19 .19
Real-sim .02 .03 .04 .04 .02 .05Gaussians 1.00 1.00 1.00 1.00 1.00 1.00๐ค-Gaussians .01 .01 .68 .24 .25 1.00Cifar-10 .08 .09 .10 .11 .11 .10Mnist .42 .73 .69 .49 .50 .69A9a .08 .17 .18 .15 .14 .17Ijcnn1 .00 .00 .07 .00 .00 .00Average .19 .25 .35 .24 .24 .35
Table 3 shows the result. With SC, IK improves over GK or AG[49] in almost all datasets. Large improvements are found on ๐ค-Gaussians and Mnist. The only dataset in which distance and GKdid well is the Gaussians dataset that has two well-separate clusters.
Breaking the curse of dimensionality with Isolation Kernel
Table 4: SVM classification accuracy & runtime (CPU seconds). Gaussian kernel (GK), IK and Linear kernel (LK). ๐๐๐ง% =
#๐๐๐๐ง๐๐๐_๐ฃ๐๐๐ข๐๐ /((#๐ก๐๐๐๐ + #๐ก๐๐ ๐ก) ร #๐๐๐๐๐๐ ๐๐๐๐ ) ร 100. IK ran five trials to produce [mean]ยฑ[standard error]. The runtime ofSVM with IK includes the IK mapping time. The separate runtimes of IK mapping and SVM are shown in Appendix B.2.
Dataset #๐ก๐๐๐๐ #๐ก๐๐ ๐ก #๐๐๐๐๐. ๐๐๐ง% Accuracy RuntimeLK IK GK LK IK GK
Url 2,000 1,000 3,231,961 .0036 .98 .98ยฑ.001 .98 2 3 14News20 15,997 3,999 1,355,191 .03 .85 .92ยฑ.007 .84 38 60 528Rcv1 20,242 677,399 47,236 .16 .96 .96ยฑ.013 .97 111 420 673
Real-sim 57,848 14,461 20,958 .24 .98 .98ยฑ.010 .98 49 57 2114Gaussians 1,000 1,000 10,000 100.0 1.00 1.00ยฑ.000 1.00 14 9 78๐ค-Gaussians 1,000 1,000 10,000 100.0 .49 1.00ยฑ.000 .62 20 8 79Cifar-10 50,000 10,000 3,072 99.8 .37 .56ยฑ.022 .54 3,808 1087 29,322Mnist 60,000 10,000 780 19.3 .92 .96ยฑ.006 .98 122 48 598A9a 32,561 16,281 123 11.3 .85 .85ยฑ.012 .85 1 31 100Ijcnn1 49,990 91,701 22 59.1 .92 .96ยฑ.006 .98 5 66 95
AG was previously proposed to โcorrectโ the bias of GK [49].However, our result shows that AG is not better than GK in almostall datasets we have used. There are two reasons. First, we found thatprevious evaluations have searched for a small range of๐ of GK (e.g.,[7].) Especially in high dimensional datasets, ๐ shall be searched in awide range (see Table 7 in Appendix B.) Second, spectral clusteringis weak in some data distribution even in low dimensional datasets.In this case, neither GK nor AG makes a difference. This is shownin the following example using๐ค-Gaussians, where๐ค = 1, whichcontains two 1-dimensional Gaussian clusters which only overlapat the origin in the 2-dimensional space.
Figure 2 shows the best spectral clustering results with IK andAG. When the two Gaussians in the๐ค-Gaussians dataset have thesame variance, spectral clustering with AG (or GK) fails to separatethe two clusters; but spectral clustering with IK succeeds.
(a) SC with IK (AMI=0.91) (b) SC with AG (AMI=0.29)
Figure 2: The spectral clustering results with IK and AG onthe ๐ค-Gaussians dataset (where ๐ค = 1) that contains two 1-dimensional subspace clusters. Each cluster has 500 points,sampled from a 1-dimensional Gaussian distributionN(0, 1).The density distribution of each cluster is shown in (a) wrtthe corresponding 1-dimensional subspace.
When the variances of the two subspace clusters are substantiallydifferent, spectral clustering with AG yields a comparable goodresult to that with IK. Note that spectral clustering with GK cannotseparate the two subspace clusters regardless. This is the specificcondition, as used previously [49] to show that AG is better thanGK in spectral clustering.
In a nutshell, in general, IK performs better than GK and AG onspectral clustering in both low and high dimensional datasets.
This clustering result is consistent with the previous result [33]conducted using DBSCAN [9] in both low and high dimensionaldatasets comparing between distance and IK.
Similar relative results with DP [36] are shown in Table 2, wherelarge differences in AMI in favor of IK over distance/AG can befound in News20 and๐ค-Gaussians.
4.2.3 SVM classification. Table 4 shows the comparison betweenlinear kernel, IK and Gaussian kernel using an SVM classifier interms of accuracy and runtime. The experimental details are inAppendix B.
It is interesting to note that IK produced consistently high (orthe highest) accuracy in all datasets. In contrast, both linear andGaussian kernels have a mixed bag of low and high accuracies. Forexample, they both perform substantially poorer than IK on News20and๐ค-Gaussians (where๐ค = 5, 000.) Each๐ค-Gaussians dataset hastwo ๐ค-dimensional subspace clusters. The ๐ค = 1 version of thedataset is shown in the first column in Table 5.
The runtime result shows that, SVM using IK runs either com-parably or faster than SVM using Linear Kernel, i.e., they are in thesame order, especially in high dimensions. Because of employingLIBLINEAR [10], both are up to two orders of magnitude faster inlarge datasets than SVM with GK which must employ the slowernonlinear LIBSVM [6].
4.2.4 Visualization using t-SNE. Studies in visualization [44, 46]often ignore the curse of dimensionality issue which raises doubtabout the assumption made by visualization methods. For example,t-SNE [44] employs Gaussian kernel as a means to measure similar-ity in the high dimensional data space. No studies have examinedthe effect of its use in the presence of the curse of dimensionality,as far as we know. Here we show one example misrepresentationfrom t-SNE, due to the use of GK to measure similarity in the highdimensional data space on the๐ค-Gaussians datasets.
The t-SNE visualization results comparing Gaussian kernel andIsolation Kernel are shown in the middle two columns in Table 5.While t-SNE using GK has preserved the structure in low dimen-sional (๐ค = 2) data space, it fails completely to separate the two
Kai Ming Ting, Takashi Washio, Ye Zhu, and Yang Xu
Table 5: t-SNE: Gaussian Kernel versus Isolation Kernel on the ๐ค-Gaussians datasets having two ๐ค-dimensional subspaceclusters (shown in the first column.) Both clusters are generated using N(0, 1). โรโ is the originโthe only place in which thetwo clusters overlap in the data space.
๐ค = 1 dataset t-SNE(GK) t-SNE(IK) FIt-SNE q-SNE
๐ค=2
๐ค=5000
Table 6: t-SNE: GK versus IK on ๐ค-Gaussians datasets having two ๐ค-dimensional subspace clusters (different variances.) Thered and blue clusters are generated using N(0, 1) and N(0, 32), respectively. โรโ is the originโthe only place in which the twoclusters overlap in the data space.
๐ค = 1 dataset t-SNE(GK) t-SNE(IK) FIt-SNE q-SNE
๐ค=2
๐ค=100
clusters in high dimensional (๐ค = 5, 000) data space. In contrast,t-SNE with IK correctly separates the two clusters in both low andhigh dimensional data spaces.
Recent improvements on t-SNE, e.g., FIt-SNE [25] and q-SNE [15],explore better local structure resolution or efficient approximation.As they employ Gaussian kernel, they also produce misrepresentedstructures, as shown in Table 5.
Table 6 shows the results on a variant๐ค-Gaussians dataset wherethe two clusters have different variances. Here t-SNE using GKproducesmisrepresented structure in both low and high dimensions.It has misrepresented the origin in the data space as belonging tothe red cluster only, i.e., no connection between the two clusters.Note that all points in the red cluster are concentrated at the originin๐ค = 100 for GK. There is no such issue with IK.
In low dimensional data space, this misrepresentation is duesolely to the use of a data independent kernel. To make GK adap-tive to local density in t-SNE, the bandwidth is learned locally foreach point. As a result, the only overlap between the two subspaceclusters, i.e., the origin, is assigned to the dense cluster only. The
advantage of IK in low dimensional data space is due to the datadependency stated in Lemma 1. This advantage in various taskshas been studied previously [33, 41, 43, 48].
In high dimensional data space, the effect due to the curse ofdimensionality obscures the similarity measurements made. Boththis effect and the data independent effect collude in high dimen-sional data space when distance-based measures are used, leadingto poor result in the transformed space.
4.2.5 Further investigation using the low dimensional๐ค-Gaussians(๐ค = 2). It is interesting to note that (i) in SVM classification, bothGK and IK could yield 100% accuracy while LK still performedpoorly at 51% accuracy; and (ii) all three measures enable fasterindexed search than brute force search and have equally high pre-cision on this low dimensional dataset. But only IK can do well inthe high dimensional๐ค-Gaussians (๐ค = 5, 000) in all four tasks, asshown in Tables 2 to 6.
The๐ค-Gaussians datasets provide an example condition inwhichGK/distance performed well in low dimensions but poorly in high
Breaking the curse of dimensionality with Isolation Kernel
dimensions in indexed search, SVM classification and t-SNE visual-ization. This is a manifestation that the GK/distance used in thesealgorithms has indistinguishability in high dimensions.Summary:
โข IK is the only measure that consistently (a) enabled faster in-dexed search than brute force search, (b) provided good clus-tering outcomes with SC and DP, (c) produced high accuracyin SVM classification, and (d) yielded matching structures int-SNE visualization in both low and high dimensional datasets.
โข Euclidean distance or GK performed poorly in most high di-mensional datasets in all tasks except SVM classification. Weshow one condition, using the artificial๐ค-Gaussians datasets,in which they did well in low dimensions but performed poorlyin high dimensions in all four tasks. The Gaussians dataset isone condition in which Euclidean distance or GK can do wellin high dimensions.
โข Though low intrinsic dimensions may explain why GK/LK cando well in some high dimensional datasets in SVM classifica-tion, the explanation failed completely for all the slower indexedsearches in high dimensions, and for the only faster indexedsearch on the Gaussians dataset as it has high intrinsic dimen-sions.
5 DISCUSSIONWe have limited our discussion on the curse of dimensionality basedon nearest neighbors, i.e., their (in)distinguishability. But the effectof the curse is broader than this. For example, it has effects on theconsistency of estimators, an aspect not discussed in this article.
In addition to the concentration effect [5, 39], the curse of dimen-sionality has also been studied under different phenomena, e.g., thecorrelation between attributes [8] and hubness [34]. It is interestingto analyze whether IK can deal with these issues, other than theconcentration effect, effectively in high dimensions.
Our results in Tables 2 to 4 show that some existing measuresmay perform well in high dimensions under special conditions.However, we are not aware of analyses that examine whether anyexisting measures can break the curse of dimensionality.
Previous work on intrinsic dimensionality (e.g., [22]) has shownthat nonparametric distance-based regressors could escape thecurse of dimensionality if a high dimensional dataset has low in-trinsic dimensions. This is consistent with the finding that theconcentration effect depends more on the intrinsic dimensions ๐ผ๐than on the input dimensions ๐ > ๐ผ๐ [12]. Our result in Table 2 onthe Gaussians dataset adds another element that needs an explana-tion for existing measures, i.e., why does distance perform well ondatasets with high intrinsic dimensions?
Because our analysis on IKโs distinguishability is independentof dimensionality and data distribution, IK does not rely on lowintrinsic dimensions to perform well, as shown in the artificialdatasets: both Gaussians and๐ค-Gaussians have high (โฅ 5, 000 true)intrinsic dimensions in Tables 2 to 6.
Our applications in ball tree indexing, two clustering algorithms,SVM classification and t-SNE here, and previously in DBSCANclustering [33] and multi-instance learning [48], suggest that manyexisting algorithms can get performance improvement by simplyreplacing distance/kernel with IK. However, a caveat is in order: not
all existing algorithms can achieve that. For example, OCSVM [37]and OCSMM [29] have been shown to work poorly with IK [42].This is because the learning in these two algorithms is designed tohave a certain geometry in Hilbert space in mind; and IK does nothave such a geometry (see details in [42].)
The isolating mechanism used has a direct influence on IKโs dis-tinguishability. Note that the proofs of Lemma 3 and Theorem 2 relyon the isolating partitions being the Voronoi diagram. It remains anopen question whether IK with a different implementation couldbe proven to have distinguishability.
Kernel functional approximation is an influential approach toobtain an approximate finite-dimensional feature map from aninfinite-dimensional feature map of a kernel such as Gaussian ker-nel. Its representative methods are Nystrรถm [47] and random fea-tures [11, 35]. The former is often dubbed data dependent and thelatter data independent. These terms refer to the use (or no use) ofdata samples to derive the approximate feature map. In either case,the kernel employed is data independent and translation invariant,unlike the data dependent IK in which the similarity depends localdata distribution and it is not translation invariant.
VP-SVM [27] employs Voronoi diagram to split the input spaceand then learns a local SVM with a local bandwidth of Gaussiankernel in each Voronoi cell. It aims to reduce the training time ofSVM only, and do not address the curse of dimensionality at all.
A recent work [50] has already shown that IK improves the ef-fectiveness and efficiency of t-SNE, by simply replacing Gaussiankernel. However, it did not address the issue of curse of dimension-ality, which is the focus of this paper.
The result of an additional investigation on the hubness effect[34] can be found in Appendix E.
6 CONCLUSIONSWe show for the first time that the curse of dimensionality canbe broken using Isolation Kernel (IK). It is possible because (a) IKmeasures the similarity between two points based on the spacepartitionings produced from an isolation mechanism; (b) the proba-bility of a point falling into a partition in the isolation partitioningof Voronoi diagram is independent of data distribution, the distancemeasure used to create the Voronoi partitions and the data dimen-sions, implied in Lemma 3; and (c) IKโs unique feature map hasits dimensionality linked to a concatenation of these partitionings.Theorem 2 suggests that increasing the number of partitionings๐ก (i.e., the dimensionality of the feature map) leads to increaseddistinguishability, independent of the number of dimensions anddistribution in data space.
Isolation Kernel, with its feature map having the distinguisha-bility stated in Theorem 2, is the key to consistently producing (i)faster indexed search than brute force search, and high retrievalprecision, (ii) good clustering outcomes with SC and DP, (iii) highaccuracy of SVM classification, and (iv) matching structures int-SNE visualization in both low and high dimensional data spaces.
Euclidean distance, Gaussian and linear kernels have a mixed bagof poor and good results in high dimensions in our experiments.This is not a surprising outcome of the curse of dimensionality,echoing the current state of understanding of these measures, withor without the aid of intrinsic dimensions.
Kai Ming Ting, Takashi Washio, Ye Zhu, and Yang Xu
A PROOFS OF LEMMAS AND THEOREMSLemma 2. Given two distinct points x๐ , x๐ i.i.d. drawn from ๐น on
ฮฉ and a data set ๐ท = {x๐ โ R๐ | ๐ = 1, . . . , ๐} โผ ๐บ๐ where everyx๐ is i.i.d drawn from any probability distribution ๐บ on ฮฉ, let thefeature vectors of x๐ and x๐ be ๐ (x๐) = [^ (x๐, x1), . . . , ^ (x๐, x๐)]โคand ๐ (x๐ ) = [^ (x๐ , x1), . . . , ^ (x๐ , x๐)]โค in a feature space H^ . Thefollowing inequality holds:
๐
(โ๐ (๐ (x๐) โ ๐ (x๐ )) โค 2๐ฟ๐๐1/๐
)โฅ (1 โ 2๐ผ (๐))2๐,
for any ๐ > 0, where โ๐ is an โ๐ -norm onH^ , and ๐ฟ is the Lipschitzconstant of ๐^ .
Proof. Since ๐^ is continuous andmonotonically decreasing for๐(x, y),the following holds for every x๐ :
๐ด๐ (x๐ ) = {x โ ฮฉ | ๐^ (๐(x, x๐ )) โ [๐^ (๐ (x๐ ) + ๐), ๐^ (๐ (x๐ ) โ ๐)]}= {x โ ฮฉ | ๐(x, x๐ ) โ [๐ (x๐ ) โ ๐,๐ (x๐ ) + ๐]}
Thus, ๐น (๐ด๐ (x๐ )) โฅ 1 โ 2๐ผ (๐) holds for every x๐ from Corollary 1.Further, ๐ (x๐ , x๐ โ ๐ด๐ (x๐ )) = ๐น (๐ด๐ (x๐ ))2 holds for every x๐ ,
since x๐ and x๐ are i.i.d. drawn from ๐น , and x๐ is i.i.d. drawn from๐บ .Therefore, the following holds:
๐ (x๐, x๐ โ ๐ด๐ (x๐ ) for all ๐ = 1, . . . , ๐) = ฮ ๐๐=1๐ (x๐, x๐ โ ๐ด๐ (x๐ ))= ฮ ๐๐=1๐น (๐ด๐ (x๐ ))
2
โฅ (1 โ 2๐ผ (๐))2๐
Accordingly, the following holds:
๐ยฉยญยซโ๐ (๐ (x๐) โ ๐ (x๐ )) โค
(๐โ๏ธ๐=1
|๐^ (๐ (x๐ ) โ ๐) โ ๐^ (๐ (x๐ ) + ๐) |๐)1/๐ยชยฎยฌ
โฅ (1 โ 2๐ผ (๐))2๐
Moreover, the following inequality holds from the Lipschitz continuityof ๐^ .
|๐^ (๐ (x๐ )โ๐)โ ๐^ (๐ (x๐ )+๐) | โค ๐ฟ | (๐ (x๐ )+๐)โ(๐ (x๐ )โ๐) | = 2๐ฟ๐The last two inequalities derive Lemma 2. โก
Theorem 1. Under the same condition as Lemma 2, the featurespace H^ has indistinguishability in the limit of ๐ โ โ.
Proof. By choosing ๐ as ๐ โฅ ๐ฝ/๐2 (๐ฝ > 0), we obtain the follow-ing inequalities from Lemma 2.
๐ผ (๐) = ๐ถ1๐โ๐ถ2๐2๐ โค ๐ถ1๐โ๐ถ2๐ฝ , and
๐
(โ๐ (๐ (x๐) โ ๐ (x๐ )) โค 2๐ฟ๐๐1/๐
)โฅ (1 โ 2๐ผ (๐))2๐ โฅ ๐ฟ,
where ๐ฟ = (1 โ 2๐ถ1๐โ๐ถ2๐ฝ )2๐ โ (0, 1).In the limit of ๐ โ 0, i.e., ๐ โ โ,
๐ (โ๐ (๐ (x๐) โ ๐ (x๐ )) = 0) = ๐ (๐ (x๐) = ๐ (x๐ )) โฅ ๐ฟ.Since this holds for any two distinct points x๐, x๐ โ ฮฉ and๐ (x๐), ๐ (x๐ ) โ H^ , H^ has indistinguishability based on Eq 3. โก
Lemma 3. Given x โ ฮฉ and a data set D = {z1, . . . , z๐ } โผ ๐บ๐
where every z๐ is i.i.d. drawn from any probability distribution ๐บ onฮฉ and forms its Voronoi partition \ ๐ โ ฮฉ, the probability that z๐ isthe nearest neighbor of x inD is given as: ๐ (x โ \ ๐ ) = 1/๐ , for every๐ = 1, . . . ,๐ .
Proof. Let ๐ (x) = {y โ ฮฉ | ๐(x, y) โค ๐ }, ฮ๐ (x) = {y โ ฮฉ | ๐ โค๐(x, y) โค ๐ + ฮ๐ } for ฮ๐ > 0, ๐ข =
โซ๐ (x) ๐๐บ (y)๐y and ฮ๐ข =โซ
ฮ๐ (x) ๐๐บ (y)๐y where ๐๐บ is the probability density of ๐บ . Further,let two events ๐ and ๐ (z๐ ) be as follows.
๐ โก z๐ โ ๐ (x) for all z๐ โ D (๐ = 1, . . . ,๐ ), and
๐ (z๐ ) โก{
z๐ โ ฮ๐ (x) for z๐ โ D, andz๐ โ ฮ๐ (x) for all z๐ โ D (๐ = 1, . . . ,๐, ๐ โ ๐).
Then, the probability, that z๐ is in ฮ๐ (x) and is the nearest neighborof x in D, i.e., x โ \ ๐ , is ๐ (๐ โง๐ (z๐ )) = ๐ (๐)๐ (๐ (z๐ ) |๐) where
๐ (๐) = (1 โ ๐ข)๐ , and
๐ (๐ (z๐ ) |๐) =ฮ๐ข
1 โ ๐ข
{1 โ ฮ๐ข
1 โ ๐ข
}๐โ1.
By letting ฮ๐ข be infinite decimal ๐๐ข, ฮ๐ข/(1 โ ๐ข) โ ๐๐ข/(1 โ ๐ข) and{1โฮ๐ข/(1โ๐ข)} โ 1. Then, we obtain the following total probabilitythat z๐ is the nearest neighbor of x in D, i.e., x โ \ ๐ , by integrating๐ (๐ โง๐ (z๐ )) on ๐ข โ [0, 1] for every ๐ฅ = 1, . . . ,๐ .
๐ (x โ \ ๐ ) =โซ 1
0(1 โ ๐ข)๐
๐๐ข
1 โ ๐ข =1๐.
โก
This lemma points to a simple but nontrivial result that๐ (x โ \ ๐ ) is independent of ๐บ ,๐(x, y) and data dimension ๐ .
Theorem 2. Given two distinct points x๐ , x๐ i.i.d. drawn from๐น on ฮฉ and D๐ = {z1, . . . , z๐ } โผ ๐บ๐ defined in Lemma 3, let thefeature vectors of x๐ and x๐ be ฮฆ(x๐) and ฮฆ(x๐ ) in a feature spaceH๐พ associated with Isolation Kernel ๐พ , implemented using Voronoidiagram, as given in Proposition 1. Then the following holds:
๐ (ฮฆ(x๐) = ฮฆ(x๐ )) โค1๐๐ก,
and H๐พ always has strong distinguishability for large๐ and ๐ก .
Proof. Let 1(ยท) โ {0, 1} be an indicator function. The following holdsfor every \ ๐ โ ๐ป ; and ๐ป is derived from D ( ๐ = 1, . . . ,๐ ) used toimplement Isolation Kernel, where ๐๐น and ๐๐บ๐ are the probabilitydensity of ๐น and ๐บ๐ , respectively.
๐ (x๐, x๐ โ \ ๐ )
=
โซ โซ โซ1(x๐, x๐ โ \ ๐ )๐๐บ๐ (D)๐๐น (x๐)๐๐น (x๐ )๐D๐x๐๐x๐
=
โซ โซ (โซ1(x๐ โ \ ๐ )1(x๐ โ \ ๐ )๐๐บ๐ (D)๐D
)๐๐น (x๐)๐๐น (x๐ )๐x๐๐x๐
โคโซ โซ ยฉยญยซ
โ๏ธโซ1(x๐ โ \ ๐ )๐๐บ๐ (D)๐D
โ๏ธโซ1(x๐ โ \ ๐ )๐๐บ๐ (D)๐Dยชยฎยฌ
ร๐๐น (x๐)๐๐น (x๐ )๐x๐๐x๐
=
โซ โซ โ๏ธ๐ (x๐ โ \ ๐ |๐บ๐ )
โ๏ธ๐ (x๐ โ \ ๐ |๐บ๐ )๐๐น (x๐)๐๐น (x๐ )๐x๐๐x๐
=
โซ โซ 1๐๐๐น (x๐)๐๐น (x๐ )๐x๐๐x๐ =
1๐,
Breaking the curse of dimensionality with Isolation Kernel
where the inequality is from the Cauchy-Schwarz inequality and thelast line is from Lemma 3. Accordingly, the following holds, becauseD๐ is i.i.d. drawn from ๐บ๐ over ๐ = 1, . . . , ๐ก .
๐ (ฮฆ(x๐) = ฮฆ(x๐ )) =๐กโ๐=1
๐ (x๐, x๐ โ \ ๐ ) โค1๐๐ก
where \ ๐ โ ๐ป๐ ; and ๐ป๐ is derived from D๐ .Let ๐ฟ = 1/๐๐ก , Eq 4 holds for any two distinct points x๐, x๐ โ ฮฉ.
ThusH๐พ has distinguishability. Particularly,H๐พ has strong distin-guishability if๐ and ๐ก are large where 1/๐๐ก โช 1 holds. โก
B EXPERIMENTAL SETTINGSWe implement Isolation Kernel with the Matlab R2021a. The param-eter ๐ก = 200 is used for all the experiments using Isolation Kernel(IK).
For the experiments on instability of a measure (reported inSection 4.1), the similarity scores are normalized to [0,1] for allmeasures.
In the ball tree search experiments, IK feature map is used to mapeach point in the given dataset into a point in Hilbert space; and forlinear kernel (LK), each point is normalized with the square root ofits self similarity using LK. Then exactly the same ball tree indexusing the Euclidean distance is employed to perform the indexingfor LK, IK and distance. The parameter search ranges for IK are๐ โ {3, 5, .., 250}.
We used the package โsklearn.neighborsโ fromPython to conductthe indexing task. The leaf size is 15 and we query the 5 nearestneighbors of each point from the given dataset using the brute-force and ball tree indexing. The dataset sizes have been limited to10,000 or less because of the memory requirements of the indexingalgorithm. We have also omitted the comparison with AG and SNN[18] because they require k-nearest neighbor (kNN) search. It doesnot make sense to perform indexing using AG/SNN to speed up anNN search because AG/SNN requires a kNN search (which itselfrequires an index to speed up the search.)
Table 7 shows the search ranges of the parameters for the twoclustering algorithms and kernels. All datasets are normalised usingthe ๐๐๐-๐๐๐ฅ normalisation to yield each attribute to be in [0,1]before the clustering begins.
Table 7: Parameters and their search ranges in spectral clus-tering (SC) and density-peaks (DP) clustering.๐ is the max-imum pairwise distance in the dataset.
Algorithm Parameters and their search rangeIK ๐ก = 200;๐ โ {21, 22, ..., 210}GK ๐ = ๐ ร `; ` โ {2โ5, 2โ4, ..., 25}
Adaptive GK ๐ โ {0.05๐, 0.1๐, 0.15๐, ..., 0.45๐, 0.5๐}DP ๐ = #๐ถ๐๐ข๐ ๐ก๐๐ , ๐ โ {1%๐, 2%๐, ..., 99%๐}SC ๐ = #๐ถ๐๐ข๐ ๐ก๐๐
In Section 4.2, SVM classifiers in scikit-learn [31] are used withIK/LK and Gaussian kernel. They are based on LIBLINEAR [10] andLIBSVM [6].
Table 8 shows the search ranges of the kernel parameters forSVM classifier. A 5-fold cross validation on the training set is usedto determine the best parameter. The reported accuracy in Table 4is the accuracy obtained from the test set after the final model istrained using the training set and the 5-fold CV determined bestparameter.
Table 8: Kernel parameters and their search ranges in SVM.
Parameters and their search rangeIK ๐ โ {2๐ |๐ = 2, 3, . . . , 12}GK ๐ = ๐ ร `; ` โ {2โ5, 2โ4, ..., 25} (dense datasets)
๐ = `; ` โ {2โ5, 2โ4, ..., 25} (sparse datasets: ๐๐๐ง% < 1)
For the t-SNE [44] experiment, we set ๐ก๐๐๐๐๐๐๐๐ = 0.00005. We re-port the best visualised results from a search in [5, 20, 40, 60, 80, 100,250, 500, 800] for both๐ (using IK) and ๐๐๐๐๐๐๐ฅ๐๐ก๐ฆ (usingGK).Whenusing IK in t-SNE visualization, we replace the similarity matrix cal-culated by GK with the similarity matrix calculated by IK. For bothFIt-SNE [25] and q-SNE [15], we report the best visualization resultfrom the same ๐๐๐๐๐๐๐ฅ๐๐ก๐ฆ search range and fix all other parametersto the default values.
All datasets used are obtained from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, except Gaussians and๐ค-Gaussianswhich are our creation.
The machine used in the experiments has one Intel E5-2682 v4@ 2.50GHz 16 cores CPU with 256GB memory.
B.1 A guide for parameter setting of IKSome advice on parameter setting is in order when using IK inpractice. First, finding the โrightโ parameters for IK may not be aneasy task for some applications such as clustering (so as indexing.)This is a general problem in the area of unsupervised learning, notspecific to the use of IK. For any kernel employed, when no orinsufficient labels are available in a given dataset, it is unclear howan appropriate setting can be found in practice.
Second, ๐ก can often be set as default to 200 initially. Then, searchfor the โrightโ ๐ for a dataset. This is equivalent to searching thebandwidth parameter for Gaussian kernel. Once the๐ setting hasbeen determined, one may increase ๐ก to examine whether highaccuracy can be achieved (not attempted in our experiments.)
B.2 IK feature mapping time and SVM runtimeTable 9 presents the separate runtimes of IK feature mapping andSVM used to complete the experiments in SVM classification.
C ESTIMATED INTRINSIC DIMENSIONSUSING TLE
Figure 3 shows that intrinsic dimensions (ID) as estimated by arecent local ID estimator TLE [2].
It is interesting to note that TLE has significantly underestimatedthe IDs of the Gaussians and๐ค-Gaussians datasets which have true10,000 and 5,000 IDs, respectively. This is likely to be due to theestimatorโs inability to deal with high dimensional IDs. Most papersused a low dimensional Gaussian distribution as ground truth intheir experiments, e.g., [2].
Kai Ming Ting, Takashi Washio, Ye Zhu, and Yang Xu
Table 9: Runtimes of IK feature mapping and SVM in CPUseconds.
Dataset ๐ Mapping SVMUrl 32 2 1
News20 64 41 19Rcv1 64 394 26
Real-sim 128 44 13Gaussians 4 8 0.5๐ค-Gaussians 4 7 0.5Cifar-10 128 594 493Mnist 64 31 17A9a 64 9 22Ijcnn1 128 26 40
Figure 3: Estimated intrinsic dimensions (ID) for 10 datasetsusing TLE with ๐ = 50.
With these under-estimations, one may use the typical high IDsto explain the low accuracy of SVM using Gaussian kernel in the๐ค-Gaussians dataset (as TLE estimated it to be more than 75 IDs).But it could not explain SVMโs high accuracy in the Gaussiansdataset (as it was estimated to be more than 300 IDs.)
D THE INFLUENCE OF PARTITIONINGS ONDISTINGUISHABILITY
This section investigates two influences of partitionings on ๐๐used in Section 4.1: (i) the number of partitionings ๐ก ; and (ii) thepartitions are generated using a dataset different from the givendataset.
Figure 4(a) shows that ๐๐ decreases as ๐ก increases. This effectis exactly as predicted in Theorem 2, i.e., increasing ๐ก leads toincreased distinguishability.
It is possible to derive an IK using a dataset of uniform distribu-tion (which is different from the given dataset.) Figure 4(b) showsthe outcome, i.e., it exhibits the same phenomenon as the IK de-rived from the given dataset, except that small ๐ก leads to poorerdistinguishability (having higher ๐๐ and higher variance.)
At high ๐ก , there is virtually no difference between the two ver-sions of IK. This is the direct outcome of Lemma 3: the probabilityof any point in the data space falling into one of the๐ partitions isindependent of the dataset used to generate the partitions.
(a) IK (given dataset) (b) IK (uniform data distribution)
Figure 4: ๐๐ (๐ = 0.005) as a result of varying ๐ก on ๐ = 10000.The same dataset in Figure 1(a) is used. The result for each ๐กvalue is the average and standard error over 10 trials.
E HUBNESS EFFECTRadovanoviฤ et al. [34] attribute the hubness effect to a consequenceof high (intrinsic) dimensionality of data, and not factors such assparsity and skewness of the distribution. However, it is unclearwhy hubness only occurs in ๐-nearest neighborhood, and not in๐-neighborhood on the same dataset.
In the context of ๐ nearest neighborhood, it has been shown thatthere are few points which are the nearest neighbors to many pointsin a dataset of high dimensions [34]. Let๐๐ (๐ฆ) be the set of๐ nearestneighbors of ๐ฆ; and ๐-occurrences of ๐ฅ , ๐๐ (๐ฅ) = |{๐ฆ : ๐ฅ โ ๐๐ (๐ฆ)}|,be the number of other points in the given dataset where ๐ฅ isone of their ๐ nearest neighbors. As the number of dimensionsincreases, the distribution of ๐๐ (๐ฅ) becomes considerably skewed(i.e., there are many points with zero or small ๐๐ and only a fewpoints have large ๐๐ (ยท) for many widely used distance measures[34]. The points with large ๐๐ (ยท) are considered as โhubsโ, i.e., thepopular nearest neighbors.
Figure 5 shows the result of ๐5 vs ๐ (๐5) comparing Gaussiankernel and IK, where ๐ (๐๐ (๐ฅ)) =
| {๐ฆโ๐ท |๐๐ (๐ฆ)=๐๐ (๐ฅ) } ||๐ท | . Consis-
tent with the result shown by Radovanoviฤ et al. [34] for distancemeasure, this result shows that Gaussian kernel suffers from thehubness effect as the number of dimensions increases. In contrast,IK is not severely affected by the hubness effect.
(a) ๐ = 3 (b) ๐ = 20 (c) ๐ = 100
Figure 5: The effect of hubness in ๐-nearest neighbors: GKvs IK. The experiment setting is as used by [34]: a randomdataset is drawn uniformly from the unit hypercube [0, 1]๐ .The parameter settings used are:๐ = 32 for IK; ๐ = 5 for GK.
Breaking the curse of dimensionality with Isolation Kernel
REFERENCES[1] Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. 2001. On the
surprising behavior of distance metrics in high dimensional space. In Proceedingsof the 8th International Conference on Database Theory. 420โ434.
[2] Laurent Amsaleg, Oussama Chelly, Michael E. Houle, Ken-ichi Kawarabayashi,Milos Radovanovic, and Weeris Treeratanajaru. 2019. Intrinsic DimensionalityEstimation within Tight Localities. In Proceedings of the 2019 SIAM InternationalConference on Data Mining. 181โ189.
[3] Sunil Aryal, Kai Ming Ting, Takashi Washio, and Gholamreza Haffari. 2017. Data-dependent dissimilarity measure: an effective alternative to geometric distancemeasures. Knowledge and Information Systems 53, 2 (01 Nov 2017), 479โ506.
[4] Kristin P. Bennett, Usama Fayyad, and Dan Geiger. 1999. Density-Based Indexingfor Approximate Nearest-Neighbor Queries. In Proceedings of the Fifth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. 233โ243.
[5] Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999.When Is โNearest Neighborโ Meaningful?. In Proceedings of the 7th InternationalConference on Database Theory. Springer-Verlag, London, UK, 217โ235.
[6] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for supportvector machines. ACM Transactions on Intelligent Systems and Technology 2(2011), 27:1โ27. Issue 3.
[7] Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y Chang.2010. Parallel spectral clustering in distributed systems. IEEE transactions onpattern analysis and machine intelligence 33, 3 (2010), 568โ586.
[8] Robert J. Durrant and Ata Kabรกn. 2009. When is โnearest Neighbourโ Meaningful:A Converse Theorem and Implications. Journal of Complexity 25, 4 (2009), 385โ397.
[9] Martin Ester, Hans-Peter Kriegel, Jรถrg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. InProceedings of the Second International Conference on Knowledge Discovery andData Mining. 226โ231.
[10] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin.2008. LIBLINEAR: A library for large linear classification. Journal of MachineLearning Research (2008), 1871โ1874.
[11] X. Y. Felix, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar.2016. Orthogonal random features. In Advances in Neural Information ProcessingSystems. 1975โ1983.
[12] Damien Francois, Vincent Wertz, and Michel Verleysen. 2007. The concentrationof fractional distances. IEEE Transactions on Knowledge and Data Engineering 19,7 (2007), 873โ886.
[13] Keinosuke Fukunaga. 2013. Introduction to Statistical Pattern Recognition (2nded.). Academic Press, Chapter 6, section 6.2.
[14] Mikhael Gromov and Vitali Milman. 1983. A topological application of theisoperimetric inequality. American Journal of Mathematics 105, 4 (1983), 843โ854.
[15] Antti Hรคkkinen, Juha Koiranen, Julia Casado, Katja Kaipio, Oskari Lehtonen,Eleonora Petrucci, Johanna Hynninen, Sakari Hietanen, Olli Carpรฉn, LucaPasquini, et al. 2020. qSNE: quadratic rate t-SNE optimizer with automaticparameter tuning for large datasets. Bioinformatics 36, 20 (2020), 5086โ5092.
[16] Alexander Hinneburg, Charu C. Aggarwal, and Daniel A. Keim. 2000. What Isthe Nearest Neighbor in High Dimensional Spaces?. In Proceedings of the 26thInternational Conference on Very Large Data Bases. 506โ515.
[17] Michael E. Houle, Hans-Peter Kriegel, Peer Krรถger, Erich Schubert, and ArthurZimek. 2010. Can Shared-Neighbor Distances Defeat the Curse of Dimension-ality?. In Proceedings of the International Conference on Scientific and StatisticalDatabase Management. 482โ500.
[18] Raymond A. Jarvis and Edward A. Patrick. 1973. Clustering using a similaritymeasure based on shared near neighbors. IEEE Trans. Comput. 100, 11 (1973),1025โ1034.
[19] I. T. Jolliffe. 2002. Principal Component Analysis. Springer Series in Statistics.New York: Springer-Verlag.
[20] Ata Kabรกn. 2011. On the Distance Concentration Awareness of Certain DataReduction Techniques. Pattern Recognition 44, 2 (2011), 265โ277.
[21] Ashraf M. Kibriya and Eibe Frank. 2007. An Empirical Comparison of Exact Near-est Neighbour Algorithms. In Proceedings of European Conference on Principles ofData Mining and Knowledge Discovery. 140โ151.
[22] Samory Kpotufe. 2011. K-NN Regression Adapts to Local Intrinsic Dimension. InProceedings of the 24th International Conference on Neural Information ProcessingSystems. 729โ737.
[23] Hans-Peter Kriegel, Matthias Schubert, and Arthur Zimek. 2008. Angle-BasedOutlier Detection in High-Dimensional Data. In Proceedings of the 14th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. Asso-ciation for Computing Machinery, 444โ452.
[24] Michel Ledoux. 2001. The Concentration of Measure Phenomenon. MathematicalSurveys & Monographs, The American Mathematical Society.
[25] George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, andYuval Kluger. 2019. Fast interpolation-based t-SNE for improved visualization of
single-cell RNA-seq data. Nature methods 16, 3 (2019), 243โ245.[26] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In
Proceedings of the IEEE International Conference on Data Mining. 413โ422.[27] Mona Meister and Ingo Steinwart. 2016. Optimal Learning Rates for Localized
SVMs. Journal of Machine Learning Research 17, 1 (2016), 6722โ6765.[28] Vitali D. Mil'man. 1972. New proof of the theorem of A. Dvoretzky on inter-
sections of convex bodies. Functional Analysis and Its Applications 5, 4 (1972),288โ295.
[29] Krikamol Muandet and Bernhard Schรถlkopf. 2013. One-class Support MeasureMachines for Group Anomaly Detection. In Proceedings of the Twenty-NinthConference on Uncertainty in Artificial Intelligence. 449โ458.
[30] Stephen M Omohundro. 1989. Five balltree construction algorithms. InternationalComputer Science Institute Berkeley.
[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python. Journal of Machine Learning Research 12 (2011), 2825โ2830.
[32] Vladimir Pestov. 2000. On the geometry of similarity search: Dimensionalitycurse and concentration of measure. Inform. Process. Lett. 73, 1 (2000), 47โ51.
[33] Xiaoyu Qin, Kai Ming Ting, Ye Zhu, and Vincent Cheng Siong Lee. 2019. Nearest-Neighbour-Induced Isolation Similarity and Its Impact on Density-Based Cluster-ing. In Proceedings of The Thirty-Third AAAI Conference on Artificial Intelligence.4755โ4762.
[34] Miloลก Radovanoviฤ, Alexandros Nanopoulos, and Mirjana Ivanoviฤ. 2010. Hubs inSpace: Popular Nearest Neighbors in High-Dimensional Data. Journal of MachineLearning Research 11, 86 (2010), 2487โ2531.
[35] Ali Rahimi and Benjamin Recht. 2007. Random Features for Large-scale KernelMachines. In Advances in Neural Information Processing Systems. 1177โ1184.
[36] Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find ofdensity peaks. Science 344, 6191 (2014), 1492โ1496.
[37] Bernhard Schรถlkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, andRobert C. Williamson. 2001. Estimating the Support of a High-DimensionalDistribution. Neural Computing 13, 7 (2001), 1443โ1471.
[38] Uri Shaft and Raghu Ramakrishnan. 2006. Theory of Nearest Neighbors Indexa-bility. ACM Transactions on Database Systems 31, 3 (2006), 814โ838.
[39] Michel Talagrand. 1996. A New Look at Independence. The Annals of Probability24, 1 (1996), 1โ34.
[40] Kai Ming Ting, Jonathan R. Wells, and Takashi Washio. 2021. Isolation Kernel:The X Factor in Efficient and Effective Large Scale Online Kernel Learning.Data Mining and Knowledge Discovery, https://doi.org/10.1007/s10618-021-00785-1(2021).
[41] Kai Ming Ting, Bi-Cun Xu, Takashi Washio, and Zhi-Hua Zhou. 2020. IsolationDistributional Kernel: A New Tool for Kernel Based Anomaly Detection. InProceedings of the 26th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. 198โ206.
[42] Kai Ming Ting, Bi-Cun Xu, Takashi Washio, and Zhi-Hua Zhou. 2020. IsolationDistributional Kernel: A New Tool for Point and Group Anomaly Detection. CoRR(2020). https://arxiv.org/abs/2009.12196
[43] Kai Ming Ting, Yue Zhu, and Zhi-Hua Zhou. 2018. Isolation Kernel and its effecton SVM. In Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. ACM, 2329โ2337.
[44] Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9, 2 (2008),2579โ2605.
[45] Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoreticmeasures for clusterings comparison: Variants, properties, normalization andcorrection for chance. The Journal of Machine Learning Research 11 (2010), 2837โ2854.
[46] Florian Wickelmaier. 2003. An introduction to MDS. Sound Quality ResearchUnit, Aalborg University (2003).
[47] Christopher K. I. Williams and Matthias Seeger. 2001. Using the Nystrรถm Methodto Speed Up Kernel Machines. In Advances in Neural Information ProcessingSystems 13, T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.). MIT Press, 682โ688.
[48] Bi-Cun Xu, Kai Ming Ting, and Zhi-Hua Zhou. 2019. Isolation Set-Kernel and itsapplication to Multi-Instance Learning. In Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining. 941โ949.
[49] Lihi Zelnik-Manor and Pietro Perona. 2005. Self-tuning spectral clustering. InAdvances in neural information processing systems. 1601โ1608.
[50] Ye Zhu and Kai Ming Ting. 2021. Improving the Effectiveness and Efficiencyof Stochastic Neighbour Embedding with Isolation Kernel. Journal of ArtificialIntelligence Research 71 (2021), 667โ695.