from: mark silverman [mailto:[email protected]] sent: wed, may 28, 2014 11:48...

23
From: Mark Silverman [mailto:[email protected] ] Sent: Wed, May 28, 2014 11:48 PM Hypothesis: How do you algorithmically resolve conflicts between spts's ? You place the point in the largest found cluster where the density is greater than threshold. So point4 may not be a singleton depending on density It's a working theory I will work through some examples but I think this way may resolve my concerns on sparse highly dim data sets This first example shows that recursion is important. A slightly different example show recursion order is also important: On May 29, 2014, at 10:31AM [email protected] : One point of view is: The dot product functional is distance dominated (meaning the distance on the d-line between L d (x)=xod and L d (y)=yod is always less than or equal to distance(x,y) (so any gap in the projected L d values IS a gap, probably bigger, in the space; but many gaps in the space may not show up as gaps on a projection d-line). Therefore there's never a conflict between spts gap- enclosed clusters. Consecutive spts gaps cluster(s), but not necessarily vice versa. (while a PCI followed by a PCD a cluster, but there can be ambiguous nesting) Each spts reveals only some of the gaps, and the spts gap sizes are conservative. That’s why recursion is used. E.g., the gap of 110 between point4 and its complement is revealed by spts, Attr 1 . We use the pTree mask restricting to points {123 5678} and there, Attr 2 reveals 2 more gaps (missed by Attr 1 ), gap2=100 between {138} and {256} and gap3=22 between {256} and {7}. But all projection gaps are legit and there are no conflicts. X Row Attr1 Attr2 1 0 2 0 3 0 4 110 1 5 0 6 0 7 0 8 0 1, 3, 8 4 2 6 5 7 Row Attr1 Att 1 0 2 0 3 0 4 0 5 100 6 0 7 0 8 0 1 4 2 6 5 7 3 8 Here, L e2 =Attr 2 reveals no gaps. So then, L e1 =Attr 1 is applied to all of X and reveals a gap of at least 100 for point5. The substantial gap of 50 between {1234} and {678} is missed, while if it were done in the other order: L e1 =Attr 1 is applied to all of X and reveals a gap of at least 100 for point5 (note the gap is actually 103.078) {1234 678} and {5} are split (and {5} is declared an outlier and declared finished.) L e2 =Attr 2 is applied to X-{5}={1234 678} and reveals a gap of at least 50 between {1234} and {678} StD: 33 57.2 So StD doesn't always reveal the best 100 2 5 103.078

Upload: ralf-evans

Post on 20-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

From: Mark Silverman [mailto:[email protected]] Sent: Wed, May 28, 2014 11:48 PM

Hypothesis: How do you algorithmically resolve conflicts between spts's ?  You place the point in the largest found cluster where the density is greater than threshold. So point4 may not be a singleton depending on density  It's a working theory I will work through some examples but I think this way may resolve my concerns on sparse highly dim data sets 

This first example shows that recursion is important. A slightly different example show recursion order is also important:

On May 29, 2014, at 10:31AM [email protected]: One point of view is: The dot product functional is distance dominated (meaning the distance on the d-line between Ld(x)=xod and Ld(y)=yod  is always less than or equal to distance(x,y) 

(so any gap in the projected Ld values IS a gap, probably bigger, in the space; but many gaps in the space may not show up as gaps on a projection d-line).

Therefore there's never a conflict between spts gap-enclosed clusters. Consecutive spts gaps cluster(s), but not necessarily vice versa. (while a PCI followed by a PCD a cluster, but there can be ambiguous nesting)

Each spts reveals only some of the gaps, and the spts gap sizes are conservative.

That’s why recursion is used.  E.g., the gap of 110 between point4 and its complement is revealed by spts, Attr1.  We use the pTree mask restricting to points {123 5678} and there, Attr2 reveals 2 more gaps (missed by Attr1), gap2=100 between {138} and {256} and gap3=22 between {256} and {7}.  But all projection gaps are legit and there are no conflicts.

XRow      Attr1       Attr21             0            02             0            1003             0            04             110        1105             0            1146             0            1237             0            1458             0            0

1, 3, 8

42

65

7

Row         Attr1      Attr21               0             02               0             253               0             504               0             755               100         1006               0             1257               0             1508               0             175

1

4

2

6

5

7

3

8

Here, Le2=Attr2 reveals no gaps.

So then, Le1=Attr1 is applied to all of X and reveals a gap of at least 100 for point5.

The substantial gap of 50 between {1234} and {678} is missed, while if it were done in the other order:

Le1=Attr1 is applied to all of X and reveals a gap of at least 100 for point5 (note the gap is actually 103.078) {1234 678} and {5} are split (and {5} is declared an outlier and declared finished.)

Le2=Attr2 is applied to X-{5}={1234 678} and reveals a gap of at least 50 between {1234} and {678}

StD: 33 57.2So StD doesn't always

reveal the best order!

100

25 103.078

From [email protected] Sent Thur, May 29, 2014 12:21 I see the confusion. You start with any spts then recurse thru unclustered pts.  {recurse on each subcluster set produced by that SPTS}. I'm thinking how to do this in hadoop in parallel - I have different spts arriving at different

processors, so my issue is what happens if these processors arrive at different conclusions.  Good question! Possible answer:  Parallelize incrementally (as the recursion progresses).  Let's assume the entire dataset is replicated to all nodes.

( If not, modifications need to be made.) Do the first dot product spts splitting.  We get a mask pTree for each gap-enclosed sub-cluster (a mask pTree in Hadoop, is a horizontal array of bits specifying which documents are in the sub-cluster)

In parallel, send each of those mask pTrees to a different node.  (also send all of those mask pTrees to the designated “dendogram building” node).Second issue is that entire dataset cannot fit into memory so getting to density is a challenge.If we deal strictly with gaps, density may not be necessary. In any case, Barrel Density could give a good approx. Once the Linear distrib is done,

compute the distribution of radial reach distances from the d-line.  The max (or last PCD) of those numbers should give you a good barrel radius.From here one could simply take the max of the max radial reach (mrr) and the linear projection radius (lpr) as a radius, r.  The volume would be

roughly  rn  to divided into the count for density. If that proves to rough, the actual barrel volume is roughly mrrn-1 * lprFrom: Mark Silverman Sent: Thursday, May 29, 2014 1:09 Each processor sees only some of attributes, not all.I'm thinking we have one pass through the attributes to get stats and also we can pull radius. Then we can sequentially process Only issue now is cluster merge which I don't see a way around but huge benefit if we can figure it out which we will :)Sparseness is a major issue-for example in a million docs the word "arrrgghhh" in a single document could cluster.  We might want it to be an outlier (2 emails with “arrrgghhh” probably came from the same sender?). Or we might make a pass through our vocab,

flagging those words that we don't want to trigger a cut. Then we would not use those columns as spts's.Clearly we need to apply some sort of relevance tuning probably better than stddev. Agree. Previous slide, last example shows that stddev can fail.Sparseness breaks kmeans because the average of any attribute will tend to be pretty close to zero. So I'm trying to keep that in mind.The other issue is that we cannot have a single processor manage every attribute so we need this notion of merging clusters somehow - that's why I've

been thinking about the issue where two processors make two different decisions. I have to think about that more. We could think about those decisions as being separate collections of mask pTrees, then we just AND the collections for the combined dendogram level?

Consider also steaming data where later arriving data may need to be either placed in an existing cluster or a new one.  I think it's the same problem?We should always save the spts cut_points and use those to put new data in the proper cluster. If new data pt, y, is in the middle of a previous gap, it

is the start of a new cluster? We could just build dis2(X,y) and calculate the minimum to decide. I’m going to assume: 1. Gap-based clustering (or splitting) will suffice in text mining (no need to split at PCCs also). Gap-based never requires a dendogram and therefore

never requires density nor fusion. When I use the term “gap” I will mean gap > Gap_Threshold for some chosen GT. 2. The initial dispersal of attributes to nodes is a partition (mutually exclusive).Then for d, to get “X dot d” spts use a master-slave (slaves compute their parts and send their partial spts to the master to be added onto the final sptsOf course, if d=ek the spts is the just the given kth column and is available without any required computation at one node.Thought: A cluster subset is always expressed as a mask pTree (a row of bits in Hadoop).Those results can be sent to a master node to build the full clustering (by ANDind each pair coming from different sites). I think such a master could keep up with the slaves working in parallel.We just have to AND all pairs of mask pTrees, one coming from each of two different sites. The result is one clustering of the entire dataset (the one

our gap-based FAUST Oblique algorithm produces for the chosen Minimum Gap Threshold). In the text mining case I am going to guess using only the ek’s will suffice (We won’t even need combo of them).

Said another way: A gap in any “X dot d” spts gives us two disjoint mask pTrees that split the space in two and we know that no bonafide cluster subset can span that divide (so there should never be a need to fuse). Therefore it doesn’t matter where cluster subsets (or splits) are revealed – at one site or by ANDing pairs of mask pTrees from different sites. From this point of view, a merge step may be unnecessary.

Disclaimer: In order to be guaranteed we have gotten all linear gaps, we may need to employ all unit vectors, d (a full partitioning of the unit n-1_half_sphere). And there are gaps that are not linear (not hyperplanar).

Machine Learning = moving data up some concept hierarchy (increasing information and/or/by reducing volume).ML takes two forms: clustering and classification (unsupervised and supervised). Clustering groups similar objects into a single higher level object or a cluster. Classification does the same but is supervised by an existing class assignment function on a Training Set, f:TS{Classes}.

1. Clustering can be done for Anomaly Detection (detecting those object that are dissimilar from the rest; which boils down to finding singleton [and/or doubleton..] clusters.

2. Clustering can be done to develop a Training Set for classification of future unclassified objects.

NCL: (k is discovered, not specified. Assign each (object,class) a ClassWeight, CWReals (could be <0). Classes "take next ticket" as they're discovered (tickets are 1,2,... Initially, all classes empty; All CWs=0.

Do for next d, compute Ld = Xod until after masking off new cluster, count is too high (doesn't drop enough)

For the next PCI in Ld (next-larger starting from smallest)

If followed by a PCD, declare next Classk and define it to be the set spherically gapped (or PCDed) around centroid, Ck=Avg or VoMk over Ld

-1[PCI, PCD]. Mask off this ticketed new Classk and contin

If followed by a PCI, declare next Classk and define it to be the set spherically gapped (or PCDed) around the centroid, Ck=Avg or VoMk over Ld

-1[ (3PCI1+PCI2)/4, PCI2 ) Mask off this ticketed new Classk and continue.For the next-smaller PCI (starting from largest) in Ld

If preceded by a PCD, declare next Classk and define it to be the set spherically gapped (or PCDed) around centroid Ck=Avg or VoMk over Ld

-1[PCD, PCI]. Mask off this ticketed new Classk, contin.

If preceded by a PCI, declare next Classk and define it to be the set spherically gapped (or PCDed) around the centroid, Ck=Avg or VoMk over Ld

-1( PCI2, (3PCI1+PCI2)/4] Mask off this ticketed new Classk and continue.

Machine Learning moves data up a concept hierarchy, according to some criteria which we call a similarity. A similarity is usually a function, s:XXCardinalSet s.t. xX s(x,x)s(x,y) yX (every x must be at least as similar to

itself as it is to any other object) and s(x,y)=s(y,x). OrdinalSet is usually a subset of {0,1,...} (e.g., binary {0,1}={No,Yes}).

Classification is binary clustering: s(x,y)=1 iff f(x)=f(y). Using that part of f which is known to predict that which is unknown

Pillar pk-means clusterer (k is not specified - it reveals itself.)

m1

m2

m3

m4

1a. Choose m1 maximizing D1=Dis(X,avgX).

2a. Choose m2 maximizing D2=Dis(X,m1)

3a. Choose m3 maximizing D3=D2+Dis(X,m2)

...Do until the MinDist(mh,mk)k<h < Threshold

1b. Check if m1 is outlier with Sm1 1c. Repeat until m1 is a non-outlier

3b. Check if m3 is outlier with Sm3 3c. Repeat until m3 is a non-outlier.

3d. Compute Mi,3=Pmim3 M3,i=Pmi<m3

i<3

4a. Choose m4 maximizing D4=D3+Dis(X,m3)

2b. Check if m2 is outlier with Sm2 2c. Repeat until m2 is a non-outlier.

2d. Compute M1,2=Pm1m2 M2,1=Pm1<m2

4b. Check if m4 is outlier with Sm4 4c. Repeat until m4 is a non-outlier.

4d. Compute Mi,4=Pmim4 M3,i=Pmi<m4

i<4

Mj = &hj Mj,h are the mask pTrees of the k clusters for the k first round clusters.

Apply pk-means from here on.

FAUST Oblique Analytics FAUST Oblique Analytics X(XX(X11..X..Xnn))RRnn, |X|=N; Classes={C, |X|=N; Classes={C11..C..CKK}; d=(d}; d=(d11..d..dnn), |d|=1; p=(p), |d|=1; p=(p11..p..pnn))RRnn; Functionals:; Functionals:

FAUST CFAUST CountountCChange clustererhange clusterer If DensThres unrealized, cut C at PCCIf DensThres unrealized, cut C at PCCssLLd,pd,p&C with next (d,pd)&C with next (d,pd)dpSetdpSet

FAUST TFAUST TopopKOKOutliersutliers Use DUse D22NN=SqDist(x, X')=rankNN=SqDist(x, X')=rank22SSxx for TopKOutlier for TopKOutlier-slider-slider..

FAUST LFAUST Linear classifierinear classifier y yCCkk iff y iff yLHLHk k {z | Lmin {z | Lmind,p,k d,p,k (z-p) (z-p)ood d Lmmax Lmmaxd,pd,kd,pd,k} } (d,p) (d,p)dpSetdpSet

LHLHkk is a hull around C is a hull around Ckk. . dpSetdpSet is a set of (d,p) pairs, e.g., is a set of (d,p) pairs, e.g.,

(Diag,DiagStartPt).(Diag,DiagStartPt).RkRkiiPtr(x,PtrPtr(x,PtrRankRankiiSSxx). Rk). RkiiSD(x,rankSD(x,rankiiSS22) ordered desc on rank) ordered desc on rankiiSSxx as constructed. as constructed.

Pre-compute what? 1. col stats(min, avg, max, std,...) ; 2. XoX; Xop, p=class_Avg/Med); 3. Xod, d=interclass_Avg/Med_UnitVector; 4. Xox, d2(X,x), Rkid2(X,x), xX, i=2..; 5. Ld,p and Rd,p d,pdpSet

FAUST LFAUST LinearinearSSphericalphericalRRadial classifier adial classifier yyCCkk iff y iff yLSRHLSRHkk{z | Tmin{z | Tmind,p,k d,p,k (z-p) (z-p)ood d Tmax Tmaxd,p,kd,p,k (d,p)(d,p)dpSetdpSet, ,

T=L|S|R }T=L|S|R }

LLd,pd,p (X-p) (X-p)ood= Xd= Xood-pd-pood= Ld= Ldd-p-poodd LminLmind,p,kd,p,k= min(L= min(Ld,pd,p&C&Ckk)) LmaxLmaxd,p,kd,p,k= max(L= max(Ld,pd,p&C&Ckk))

SSp p (X-p) (X-p)oo(X-p)= X(X-p)= XooX+XX+Xoo(-2p)+p(-2p)+poopp SminSminp,k p,k = min(S= min(Spp&C&Ckk) ) SmaxSmaxp,k p,k = max(S= max(Spp&C&Ckk))

RRd,pd,p S Spp-L-L22d,pd,p= X= XooX+XX+Xoo(-2p)+p(-2p)+poop-Lp-L22

dd-2p-2pood*Xd*Xood+pd+poodd22= L= L-2p-(2p-2p-(2pood)dd)d+p+poop+pp+poodd22+X+XooX-LX-L22dd

RminRmind,pd,k d,pd,k = min(R= min(Rd,pd,p&C&Ckk) ) RmaxaxRmaxaxd,pd,k d,pd,k = max(R= max(Rd,pd,p&C&Ckk))

XXooX+pX+poop-2p-2XXoop p -- [[XXoodd-p-pood]d]2 2

(X-p)(X-p)oo(X-p) -(X-p) - [(X-p)[(X-p)ood]d]2 2 ==

p

x

d

(x-p)o(x-p)

(x-p)od = |x-p| cos

(x-p)o(x-p) - (x-p)od2

XXoodd2 2 - 2p- 2pood d XXoodd + p + poodd22 XXooX+pX+poop-2p-2XXoop p --

XXooX+pX+poopp-2-2XXoop p -- + p+ poodd22+X+XooX+pX+poop orp orXXoodd2 2 - 2p- 2pood d XXoodd- 2p- 2pood d XXoodd + p + poodd22 XXooX+X+ppoop-2p-2XXoopp - X- Xoodd2 2 + X+ XooXX

FAUST Oblique LSR Classification IRIS150FAUST Oblique LSR Classification IRIS150

Ld d=0001 p=origin S 1 6 E 10 18I 14 26

4954 482422 118134 9809

5 3617 7 152 611

0 2522 12 4541397

50 22 16 3428

p=AvgS

50 34 15 2

p=AvgE

59 28 43 13

p=AvgI

66 30 55 20

Ld d=1000 p=origin S43 59 E 49 71I 49 80 16 26 32 34 24 6 12

270792 26 51558 2568

099 393 1096 1217 1826

1 517, 4 79 633

0 279 5 171 186748998 24 126 2 1 132 73016222281

0 3426 10 388 1369

Ld d=0010 p=origin S 10 19 E 30 51I 18 69

3000 331547 146120 6251

2.8 1633 14158 199

5.9 1146 14319 453

48 50 15 2 1 34 0

Ld d=0100 p=origin S 23 44 E 20 34.1I 22 38

055 3139 3850

066 310 35246 12 1750 4104

712 636 9, 3 9831369

3 5813 21 234793 110321 3

1417 5 36 47 23 1403 92918922824

96 27317762747

1 15 3 2 1 6 29 47 46

LSR IRIS150-LSR IRIS150-.. Consider all 3 functionals, L, S and R. What's the most efficient way to calculate all 3?\

LLp,d p,d (X - p) (X - p) o o d = Ld = Lo,d o,d - [p- [pood] minLd] minLp,d,k p,d,k = min[L= min[Lp,d p,d & C& Ckk] maxL] maxLp,d,k p,d,k = max[L= max[Lp,d p,d & C& Ckk[ [

= [minL= [minLo,d,ko,d,k]] - p- pood = [maxLd = [maxLo,d,ko,d,k] - p] - poodd

= min(X= min(Xood & Cd & Ckk)) - p- pood = max(Xd = max(Xood & Cd & Ckk) - p) - poodd OROR

= min(X&C= min(X&Ckk) ) o o dd - p- pood = max(X&Cd = max(X&Ckk) ) o o d - pd - poodd

SSp p = (X - p)= (X - p)oo(X - p) = -2X(X - p) = -2Xoop+Sp+Soo+p+poop = p =

LLo,-2p o,-2p + (S+ (Soo+p+poop) minSp) minSp,kp,k=minS=minSpp&C&Ckk maxS maxSp,k p,k = maxS= maxSpp&C&Ckk

= min[(X = min[(X o (o (-2p) &C-2p) &Ckk)])] + (X+ (XooX+pX+poop)p) =max[(X =max[(X o (o (-2p) &C-2p) &Ckk)] + (X)] + (XooX+pX+poop)p)

OROR = min[(X&C= min[(X&Ckk))oo-2p]-2p] + (X+ (XooX+pX+poop)p) =max[(X&C=max[(X&Ckk))oo-2p] + (X-2p] + (XooX+pX+poop)p)

RRp,d p,d S Sp,p, - L - Lp,dp,d22 minR minRp,d,kp,d,k=min[R=min[Rp,dp,d&C&Ckk] maxR] maxRp,d,kp,d,k=max[R=max[Rp,dp,d&C&Ckk]]

o=origin; pRn; dRn, |d|=1; {Ck}k=1..K are the classes; An operation enclosed in a parallelogram, , means it is a pTree op, not a scalar operation (on just numeric operands)

I suggest that we use each of the functionals with each of the pairs, (p,d) that we select for application (since, to get R we need to compute L and S anyway).So it would make sense to develop an optimal (minimum work and time) procedure to create L, S and R for any (p,d) in the set.

APPENDIXAPPENDIX

LSR on IRIS150LSR on IRIS150

Dse 9 -6 27 10 495 802 S 1270 2010 E 1061 2725 I L H C1,3: 0 s49 e11 i

Dei -3 -2 3 3 -117 -44 E y isa O if yoDei (-,-117)(-3,) -62 -3 I y isa O or E or I if yoDei C2,1 [-62 ,-44] L H y isa O or I if yoDei C2,2 [-44 , -3]

C2,1:2 e4 i

Dei 6 -2 3 1 420 459 E y isa O if yoDei (-,420)(459,480)(501,) 480 501 I y isa O or E if yoDei C3,1 [420 ,459] L H y isa O or I if yoDei C3,2 [480 ,501] Continue this on clusters with OTHER + one class, so the hull fits tightely (reducing false positives), using diagonals?

400 1000 1500 2000 2500 3000

y isa OTHER if yoDse (-,495)(802,1061)(2725,)y isa OTHER or S if yoDse C1,1 [ 495 , 802]y isa OTHER or I if yoDse C1,2 [1061 ,1270]

y isa OTHER or I if yoDse C1,4 [2010 ,2725]y isa OTHER or E or I if yoDse C1,3 [1270 ,2010

C13

C1,1: D=1000

43 58 y isa O if yoD(-,43)(58,) L H y isa O|S if yoD C2,3 [43,58]

C2,3: D=0100

23 44 y isa O if yoD(-,23)(44,) L H y isa O|S if yoD C3,3 [23,44]

C3,3: D=0010

10 19 y isa O if yoD(-,10)(19,) L H y isa O|S if yoD C4,1 [10,19]

C4,1: D=0001

1 6 y isa O if yoD(-,1)(6,) L H y isa O|S if yoD C5,1 [1,6]

C5,1: D=1100

68 117 y isa O if yoD(-,68)(117,) L H y isa O|S if yoD C6,1 [68,117]

C6,1: D=1010

54 146 y isa O if yoD(-,54)(146,) L H y isa O|S if yoD C7,1 [54,146]

C7,1: D=1001

44 100 y isa O if yoD(-,44)(100,) L H y isa O|S if yoD C8,1 [44,100]

C8,1: D=0110

36 105 y isa O if yoD(-,36)(105,) L H y isa O|S if yoD C9,1 [36,105]

C9,1: D=0101

26 61 y isa O if yoD(-,26)(61,) L H y isa O|S if yoD Ca,1 [26,61]

Ca,1: D=0011

12 91 y isa O if yoD(-,12)(91,) L H y isa O|S if yoD Cb,1 [12,91]

Cb,1: D=1110

81 182 y isa O if yoD(-,81)(182,) L H y isa O|S if yoD Cc,1 [81,182]

Cc,1: D=1101

71 137 y isa O if yoD(-,71)(137,) L H y isa O|S if yoD Cd,1 [71,137]

Cd,1: D=1011

55 169 y isa O if yoD(-,55)(169,) L H y isa O|S if yoD Ce,1 [55,169]

Ce,1: D=0111

39 127 y isa O if yoD(-,39)(127,) L H y isa O|S if yoD Cf,1 [39,127]

Cf,1: D=1111

84 204 y isa O if yoD(-,84)(204,) L H y isa O|S if yoD Cg,1 [84,204]

Cg,1: D=1-100

10 22 y isa O if yoD(-,10)(22,) L H y isa O|S if yoD Ch,1 [10,22]

Ch,1: D=10-10

3 46 y isa O if yoD(-,3)(46,) L H y isa O|S if yoD Ci,1 [3,46]

The amount of work yet to be done., even for only 4 attributes, is immense.. For each D, we should fit boundaries for each class, not just one class.

D, not only cut at minCoD, maxCoD but also limit the radial reach for each class (barrel analytics)? Note, limiting the radial reach limits all other directions [other than the D direction] in one step and therefore by the same amount. I.e., it limits all directions assuming perfectly round clusters). Think about Enron, some words (columns) have high count and others have low count. Our radial reach threshold would be based on the highest count and therefore admit many false positives. We can cluster directions (words) by count and limit radial reach differently for different clusters??

For 4 attributes, I count 77 diagonals*3 classes = 231 cases. How many in the Enron email case with 10,000 columns? Too many for sure!!

Dot Product SPTS computation:Dot Product SPTS computation: XXooD = D = k=1..nk=1..nXXkkDDkk

/*Calc P/*Calc PXoD,iXoD,i after P after PXoD,i-1XoD,i-1 CarrySet=CAR CarrySet=CARi-1,ii-1,i RawSet=RS RawSet=RSii */ INPUT: CAR */ INPUT: CARi-1,ii-1,i, RS, RSii

ROUTINE: PROUTINE: PXoD,iXoD,i=RS=RSiiCARCARi-1,ii-1,i CAR CARi,i+1i,i+1=RS=RSii&CAR&CARi-1,ii-1,i OUTPUT: P OUTPUT: PXoD,iXoD,i, CAR, CARi,i+1i,i+1 111100

110011

111100

001111

110011

110011

001100

CAR1CAR11,21,2

&&

PPXoD,1XoD,1

110000

&&

000000

001111

&&

000011

CAR2CAR22,32,3

110000 PPXoD,2XoD,2

&&

001111 PPXoD,3XoD,3

000000CAR1CAR13,43,4

3 3 3 3 DD DD1,11,1 DD1,01,0

1 11 1DD2,12,1 DD2,02,01 11 1

113322

110011

XXXX11 X X22

pp1111 p p1010

001111

111100

110011

000000

pp2121 p p2020

669999

XXooDD001111

110000

110000

001111

ppXoD,3XoD,3 ppXoD,2XoD,2 ppXoD,1XoD,1 ppXoD,0XoD,0 ((= = 2222 + 2+ 211 (1 p (1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 (1 p (1 p1,01,01 p1 p1,11,1 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))

001111

111100

001111

111100

110011

110011

111100

111111

111100

001111

111111

000011 PPXoD,0XoD,0

CAR1CAR10,10,1

&&

111100

001111

001100

001100

000011

&&

PPXoD,3XoD,3

001100 PPXoD,4XoD,4

Different data. Different data.

3 3 3 3 DD DD1,11,1 DD1,01,0

1 11 1DD2,12,1 DD2,02,01 11 1 ((== 2 222 + 2+ 211 (1 p (1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 (1 p (1 p1,01,0

1 p1 p1,11,1 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111

111100

001111

111100

111111

111111

113322

113311

XX pTreespTrees001111

111100

111111

001100

66181899

XXooDD000011

111100

110000

000011

001100

001100

001100

We have extended the Galois field, GF(2)={0,1}, XOR=add, AND=mult to pTrees.We have extended the Galois field, GF(2)={0,1}, XOR=add, AND=mult to pTrees.

001111 PPXoD,0XoD,0

110000

CAR1CAR10,10,1 &&

000000

&&

110011

CAR2CAR21,21,2000011

001100

CAR1CAR12,32,3

&&

001100

&&

PPXoD,1XoD,1

111100

000000

000000

&&

110011

001100

&&

001100

110011

&&

001100

&&

000011

001111

&&

000000

111100

&&

000011

110000

&&

001100

PPXoD,2XoD,2

&&

001111

000000

= (2= (211 pp1,11,1 +2+200 pp1,01,0)) (2(211 pp2,12,1 +2+200 pp2,02,0)) = 2= 222 pp1,11,1 pp2,12,1 +2+211(( pp1,11,1 pp2,02,0++ pp2,12,1 pp1,01,0)) + 2+ 200 pp1,01,0 pp2,02,0XX11**XX22

001111

001100

001111

111100

111111

001100

111100

111111

111100

&&

ppXX11*X*X22,0,0

&&

001111

&&

001100

&&

001100

001100

ppXX11*X*X22,3,3

001100

&&000000

ppXX11*X*X22,2,2

001100

&&

000011ppXX11*X*X22,1,1

SPTS multiplication: SPTS multiplication: (Note, pTree multiplication = &)(Note, pTree multiplication = &)

113322

113311

XXXX11 X X22

pp1111 p p1010 pp2121 p p2020

119922

XX11**XX22

111100

000011

000000

001100

ppXX11*X*X22,3,3

001111

111100

111111

001100

ppXX11*X*X22,2,2 ppXX11*X*X22,1,1 ppXX11*X*X22,0,0

RankRankN-1N-1(X(XooD)=RankD)=Rank22(X(XooD)D) 1 1 1 1

D=xD=x11 DD1,11,1 DD1,01,00 10 1

DD2,12,1 DD2,02,00 10 1

113322

110011

XXXX11 X X22

pp1111 p p1010

001111

111100

110011

000000

pp2121 p p2020

223333

XXooDD001111

111111

110000

000000

pp33 pp22 pp11 pp,0,0

RankK: p is what's left of K yet to be counted, initially p=K V is the RankKvalue, initially 0.For i=bitwidth+1 to 0 if Count(P&Pi) p { KVal=KVal+2i; P=P&Pi }; else /* < p */ { p=p-Count(P&Pi); P=P&P'i };

111111

P=P&pP=P&p11

3322

1*21*211++

111111

PP111111

pp11

n=1n=1p=2p=2

001111

P=pP=p00&P&P

2222

1*21*211+1*2+1*200==3 so -2x3 so -2x11ooX = -6 X = -6

111111

PP001111

&p&p00

n=0n=0p=2p=2

RankRankN-1N-1(X(XooD)=RankD)=Rank22(X(XooD)D)

3 0 3 0 D=xD=x22 DD1,11,1 DD1,01,0

1 11 1DD2,12,1 DD2,02,00 00 0

339922

XXooDD111100

110011

000000

001100

pp33 pp22 pp11 pp,0,0 110011

P=P&p'P=P&p'331<1<222-1=12-1=10*20*233++

111111

PP001100

pp33

n=3n=3p=2p=2

110011

P=p'P=p'22&P&P

0<0<111-0=11-0=10*20*233+0*2+0*222

110011

PP000000

&p&p22

n=2n=2p=1p=1

110011

P=pP=p11&P&P

2211

0*20*233+0*2+0*222+1*2+1*211++

110011

PP110011

&p&p11

n=1n=1p=1p=1

110000

P=pP=p00&P&P

1111

0*20*233+0*2+0*222+1*2+1*211+1*2+1*200==33 so -2xso -2x22ooX= -6X= -6

110011

PP111100

&p&p00

n=0n=0p=1p=1

RankRankN-1N-1(X(XooD)=RankD)=Rank22(X(XooD)D) 2 1 2 1

D=xD=x33 DD1,11,1 DD1,01,01 01 0

DD2,12,1 DD2,02,00 10 1

336655

XXooDD110011

111100

001111

000000

pp33 pp22 pp11 pp,0,0

001111

P=P&pP=P&p22

2222

1*21*222++

111111

PP001111

pp22

n=2n=2p=2p=2

000011

P=p'P=p'11&P&P

1<1<222-1=12-1=11*21*222+0*2+0*211

001111

PP111100

&p&p11

n=1n=1p=2p=2

000011

P=pP=p00&P&P

1111

1*21*222+0*2+0*211+1*2+1*200==5 5 so -2xso -2x33ooX= -10X= -10

000011

PP110011

&p&p00

n=0n=0p=1p=1

Example:Example:FAUST Oblique: FAUST Oblique: XXooDD used in CCC, TKO, PLC and LARC) and used in CCC, TKO, PLC and LARC) and (x-X)(x-X)oo(x-X)(x-X) = -2= -2XXooxx+x+xoox+Xx+XooX X is used in TKO.is used in TKO.

So in FAUST, we need to construct lots of SPTSs of the type, X dotted with a fixed vector, a costly pTree calculation So in FAUST, we need to construct lots of SPTSs of the type, X dotted with a fixed vector, a costly pTree calculation (Note that X(Note that XooX is costly too, but it is a 1-time calculation (a pre-calculation?). xX is costly too, but it is a 1-time calculation (a pre-calculation?). xoox is calculated for each individual x but it's x is calculated for each individual x but it's

a scalar calculation and just a read-off of a row of Xa scalar calculation and just a read-off of a row of XooX, once XX, once XooX is calculated.. Thus, we should optimize the living X is calculated.. Thus, we should optimize the living he__ out of the Xhe__ out of the XooD calculation!!! The methods on the previous seem efficient. Is there a better method? Then for D calculation!!! The methods on the previous seem efficient. Is there a better method? Then for TKO we need to computer ranks:TKO we need to computer ranks:

pTree Rank(K) computation: (Rank(N-1) gives 2nd smallest which is very useful in outlier analysis?)

X P4,3 P4,2 P4,1 P4,0

1

0

0

0

1

1

0

0

1

1

1

0

0

0

1

0

1

1

1

0

1

0

1

0

1

1

1

1

{0}

{1}

{0}

{1}

(n=3) c=Count(P&P4,3)= 3 < 6

p=6–3=3; P=P&P’4,3 masks off highest 3 (val 8)

(n=2) c=Count(P&P4,2)= 3 >= 3

P=P&P4,2 masks off lowest 1 (val 4)

(n=1) c=Count(P&P4,1)=2 < 3

p=3-2=1; P=P&P'4,1 masks off highest 2 (val8-2=6 )

(n=0) c=Count(P&P4,0 )=1 >= 1

P=P&P4,0

10

5

6

7

11

9

3

{0} {1} {0} {1}

RankKval=0; p=K; c=0; P=Pure1; /*Note: n=bitwidth-1. The RankK Points are returned as the resulting pTree, P*/For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi }; else {p=p-c; P=P&P'i }; return RankKval, P; /* Above K=7-1=6 (looking for the Rank6 or 6th highest vaue (which is also the 2nd lowest value) */

Cross out the 0-positions of P each step.

5 P=MapRankKPts= ListRankKPts={2}

0100000

23 * + 22 * + 21 * + 20 * =

RankKval=

RankRankN-1N-1(X(XooD)=RankD)=Rank22(X(XooD)D) 3 3 3 3

DD DD1,11,1 DD1,01,00 10 1

DD2,12,1 DD2,02,00 10 1

113322

110011

XXXX11 X X22

pp1111 p p1010

001111

111100

110011

000000

pp2121 p p2020

223333

XXooDD001111

111111

110000

000000

pp33 pp22 pp11 pp,0,0

001111

P=P&pP=P&p33

2222

1*21*233++

001111

P=p'P=p'22&P&P

0<0<2222-0=-0=221*21*233+0*2+0*222++

111111

PP001111

pp33

001111

PP110000

&p&p22

n=3n=3p=2p=2

n=2n=2p=2p=2

001111

P=p'P=p'11&P&P

0<0<2222-0=-0=221*21*233+0*2+0*222+0*2+0*211++

001111

PP110000

&p&p11

n=1n=1p=2p=2

001111

P=pP=p00&P&P

2222

1*21*233+0*2+0*222+0*2+0*211+1*2+1*200=9=9

001111

PP001111

&p&p00

n=0n=0p=2p=2

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 05/64 [0,64)

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 110/64 [64,128)

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Y y1 y2y1 1 1y2 3 1y3 2 2y4 3 3y5 6 2y6 9 3y7 15 1y8 14 2y9 15 3ya 13 4pb 10 9yc 11 10yd 9 11ye 11 11yf 7 8

yofM 11 27 23 34 53 80118114125114110121109125 83

p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0

p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1

p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1

p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1

p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0

p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

0[0,8)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

1[8,16)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

1[16,24)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

1[24,32)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

1[32,40)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

0[40,48)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

1[48,56)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

0[56,64)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

2[80,88)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

0[88,96)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

0[96,104)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

2[194,112)

p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1

3[112,120)

p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0

3[120,128)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

1/16[0,16)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

2/16[16,32)

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

1[32,48)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

1[48,64)

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

0[64,80)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

2[80,96)

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

2[96,112)

p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

6[112,128)

p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

3/32[0,32)

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

2/32[64,96)

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

2/32[32,64)

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

¼[96,128)

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

f=

UDR Univariate Distribution Revealer (on Spaeth:)

Pre-compute and enter into the ToC, all DT(YPre-compute and enter into the ToC, all DT(Ykk) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector .) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector .Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in

fact are often all we need to know about the pTree to get the answers we are after.).fact are often all we need to know about the pTree to get the answers we are after.).

0 0 1 1 1 1 0 1 01 1 1 1 0 1 0 00 0 0 2 0 0 2 3 32 0 0 2 3 3

1 2 1 1 0 2 2 6 1 2 1 1 0 2 2 6

3 2 2 8 3 2 2 8

5 105 10

depthDT(S)depthDT(S)bb≡≡BitWidth(S) h=depth of a node k=node offsetBitWidth(S) h=depth of a node k=node offsetNodeNodeh,kh,k has a ptr to pTree{x has a ptr to pTree{xS | F(x)S | F(x)[k2[k2b-h+1b-h+1, (k+1)2, (k+1)2b-h+1b-h+1)} and )} and

its 1countits 1count

applied to S, a column of numbers in bistlice format (an SpTS), will applied to S, a column of numbers in bistlice format (an SpTS), will produce the produce the DistributionTree of S DT(S)DistributionTree of S DT(S)

1515 depth=h=0depth=h=0

depth=h=1depth=h=1

nodenode2,32,3

[96.128)[96.128)

So let us look at ways of doing the work to calculate As we recall from the below, So let us look at ways of doing the work to calculate As we recall from the below, the task is to ADD bitslices giving a result bitslice and a set of carry bitslices to carry forwardthe task is to ADD bitslices giving a result bitslice and a set of carry bitslices to carry forward

XXooD = D = k=1..nk=1..nXXkk*D*Dkk

3 3 3 3 DD DD1,11,1 DD1,01,0

1 11 1DD2,12,1 DD2,02,01 11 1

((= 2= 222 + 2+ 211 1 p1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 1 p1 p1,01,01 p1 p1,11,1 (( ((+ 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111

111100

001111

111100

110011

110011

113322

110011

XX pTreespTrees001111

111100

110011

000000

((= 2= 222 + 2+ 211 1 p1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 1 p1 p1,01,01 p1 p1,11,1 (( ((+ 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111

111100

001111

000000

110011

111100

110011

I believe we add by successive XORs and the carry set is the raw set with one 1-bit turned off iff the sum at that bit is a 1-bitI believe we add by successive XORs and the carry set is the raw set with one 1-bit turned off iff the sum at that bit is a 1-bitOr we can characterize the carry as the raw set minus the result (always carry forward a set of pTrees plus one negative one). Or we can characterize the carry as the raw set minus the result (always carry forward a set of pTrees plus one negative one). We want a routine that constructs the result pTree from a positive set of pTrees plus a negative set always consisting of 1 We want a routine that constructs the result pTree from a positive set of pTrees plus a negative set always consisting of 1

pTree. pTree. The routine is: successive XORs across the positive set then XOR with the negative set pTree (because the successive pset The routine is: successive XORs across the positive set then XOR with the negative set pTree (because the successive pset

XOR gives us the odd values and if you subtract one pTree, the 1-bits of it change odd to even and vice versa.):XOR gives us the odd values and if you subtract one pTree, the 1-bits of it change odd to even and vice versa.):

/*For P/*For PXoD,iXoD,i (after P (after PXoD,i-1XoD,i-1). CarrySetPos=CSP). CarrySetPos=CSPi-1,ii-1,i CarrySetNeg=CSN CarrySetNeg=CSNi-1,i i-1,i RawSet=RS RawSet=RSii CSP CSP-1-1=CSN=CSN-1-1==*/*/

INPUT: CSPINPUT: CSPi-1i-1, CSN, CSNi-1i-1, RS, RSii

ROUTINE: PROUTINE: PXoD,iXoD,i=RS=RSiiCSPCSPi-1,ii-1,iCSNCSNi-1,ii-1,i CSN CSNi,i+1i,i+1=CSN=CSNi-1,ii-1,iPPXoD,iXoD,i; CSP; CSPi,i+1i,i+1=CSP=CSPi-1,ii-1,iRSRSi-1i-1;;

OUTPUT: POUTPUT: PXoD,iXoD,i, CSN, CSNi,i+1i,i+1 CSP CSPi,i+1 i,i+1

111100

110011

RSRS00

001111

==

669999

XXooDD

PPXoD,0XoD,0

CSPCSP-1,0-1,0=CSN=CSN-1,0-1,0==RSRS11

CSNCSN0,10,1== CSNCSN-1.0-1.0PPXoD,0XoD,0

000000

==

PPXoD,1XoD,1

111100

001111

110011

001111

110000

110000

001111

001111

110011

000000

001111

111100

110011

CSPCSP0,10,1== CSPCSP-1,0-1,0RSRS00

110011

000000

110000

111111

001111

XXooD = D = k=1..nk=1..nXXkk*D*Dkk

k=1..nk=1..n ( (= 2= 22B2B

+ 2+ 22B-12B-1 DDk,Bk,B p pk,B-1k,B-1 + D+ Dk,B-1k,B-1 p pk,Bk,B

+ 2+ 22B-22B-2 DDk,Bk,B p pk,B-2k,B-2 + D+ Dk,B-1k,B-1 p pk,B-1k,B-1 + D+ Dk,B-2k,B-2 p pk,Bk,B

+ 2+ 22B-32B-3 DDk,Bk,B p pk,B-3k,B-3 + D+ Dk,B-1k,B-1 p pk,B-2k,B-2 + D+ Dk,B-2k,B-2 p pk,B-1k,B-1 +D+Dk,B-3k,B-3 p pk,Bk,B

+ 2+ 233 DDk,Bk,B p pk,0k,0 + D+ Dk,2k,2 p pk,1k,1 + D+ Dk,1k,1 p pk,2k,2 +D+Dk,0k,0 p pk,3k,3

+ 2+ 222 DDk,2k,2 p pk,0k,0 + D+ Dk,1k,1 p pk,1k,1 + D+ Dk,0k,0 p pk,2k,2

+ 2+ 211 DDk,1k,1 p pk,0k,0 + D+ Dk,0k,0 p pk,1k,1

+ 2+ 200 DDk,0k,0 p pk,0k,0

DDk,Bk,B p pk,Bk,B

k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (

. . .. . .

k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (

XXooD=D=k=1,2k=1,2XXkk*D*Dk k with pTrees: qwith pTrees: qNN..q..q00, ,

N=2N=22B+roof(log2B+roof(log22n)+2B+1n)+2B+1k=1..2k=1..2 ( (= 2= 222

+ 2+ 211 DDk,1k,1 p pk,0k,0 + D+ Dk,0k,0 p pk,1k,1

+ 2+ 200 DDk,0k,0 p pk,0k,0

DDk,1k,1 p pk,1k,1

k=1..2k=1..2 ( (

k=1..2k=1..2 ( (

113322

110011

XX pTreespTrees001111

111100

110011

000000

1 2 1 2 DD DD1,11,1 DD1,01,0

0 10 1DD2,12,1 DD2,02,01 01 0B=1B=1

((= 2= 222 + 2+ 211 DD1,11,1 pp1,01,0 + D+ D1,01,0 pp1111 + 2+ 200 DD1,01,0 pp1,01,0DD1,11,1 pp1,11,1 (( ((+ D+ D2,12,1 pp2,1 2,1 )) + D+ D2,12,1 pp2,02,0 + D+ D2,02,0 pp2,12,1 )) + D+ D2,02,0 pp2,02,0 ))

((= 2= 222 + 2+ 211 DD1,11,1 p p1,01,0 ++ DD1,01,0 p p1111 + 2+ 200 DD1,01,0 p p1,01,0DD1,11,1 p p1,11,1 (( ((+ D+ D2,12,1 p p2,1 2,1 )) + + DD2,12,1 p p2,02,0 ++ DD2,02,0 p p2,12,1 )) + D+ D2,02,0 p p2,02,0 ))000000

001111

110011

111100

qq0 0 = p= p1,0 1,0 = = no carryno carry111100

qq11= = carrycarry11= =

111100

000011

qq22=carry=carry11= = no carryno carry000011

3 3 3 3 DD DD1,11,1 DD1,01,0

1 11 1DD2,12,1 DD2,02,01 11 1

qq0 0 = = carrycarry00==001111

110000

((= 2= 222 + 2+ 211 1 p1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 1 p1 p1,01,01 p1 p1,11,1 (( ((+ 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111

111100

001111

111100

110011

000000

110011

qq11=carry=carry00+raw+raw11= = carrycarry11==111111

221111

A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). In what form is it best to carry the carryTree over? (for speediest of processing?)In what form is it best to carry the carryTree over? (for speediest of processing?)1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added)1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added)2. carryTree as a SPTS, s2. carryTree as a SPTS, s11? (next level rawTree=SPTS, s? (next level rawTree=SPTS, s22, then , then ss1010& s& s20 20 = q= qnext_levelnext_level and carry and carrynext_levelnext_level ? ?

qq22=carry=carry11+raw+raw22= = carrycarry22==111111

111111

qq33=carry=carry22 = = carrycarry33==111111

CCC ClustererCCC Clusterer If DT (and/or DUT) not exceeded at C, partition C further by cutting at each gap and PCC in CIf DT (and/or DUT) not exceeded at C, partition C further by cutting at each gap and PCC in CooDD

For a table X(XFor a table X(X11...X...Xnn), the SPTS, X), the SPTS, Xkk*D*Dkk is the column of numbers, x is the column of numbers, xkk*D*Dkk. X. XooD is the sum of those SPTSs, D is the sum of those SPTSs, k=1..nk=1..nXXkk*D*Dkk

XXkk*D*Dkk = D = Dkkbb22bbppk,bk,b = 2= 2BBDDkkppk,Bk,B +..+ 2+..+ 200DDkkppk,0k,0

= D= Dkk(2(2BBppk,Bk,B +..+2+..+200ppk,0k,0) =) = (2(2BBppk,Bk,B +..+2+..+200ppk,0k,0))(2(2BBDDk,Bk,B+..+2+..+200DDk,0k,0))

+ 2+ 22B-12B-1(D(Dk,B-1k,B-1ppk,Bk,B +..+2+..+200DDk,0k,0ppk,0k,0= 2= 22B2B( ( DDk,Bk,Bppk,Bk,B) ) +D+Dk,Bk,Bppk,B-1k,B-1))

So, So, DotProduct DotProduct involves just multi-operand pTree involves just multi-operand pTree addition. (no SPTSs and no multiplications)addition. (no SPTSs and no multiplications)Engineering shortcut tricka would be huge!!!Engineering shortcut tricka would be huge!!!

Question: Question: Which primitives are needed Which primitives are needed and how do we compute them?and how do we compute them?

X(XX(X11...X...Xnn) D) D22NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}).NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}). DD22NN = each min[DNN = each min[D22NN(x)] NN(x)]

(x-X)o(x-X)= (x-X)o(x-X)= k=1..nk=1..n(x(xkk-X-Xkk)(x)(xkk-X-Xkk)=)=k=1..nk=1..n((b=B..0b=B..022bbxxk,bk,b-2-2bbppk,bk,b)()( ((b=B..0b=B..022bbxxk,bk,b-2-2bbppk,bk,b))

==k=1..nk=1..n( ( b=B..0b=B..022bb(x(xk,bk,b-p-pk,bk,b) )) ) (( ----a----ak,bk,b------b=B..0b=B..022bb(x(xk,bk,b-p-pk,bk,b) )) )

(2(2BBaak,Bk,B++ 22B-1B-1aak,B-1k,B-1+..++..+ 2211aak, 1k, 1++ 2200aak, 0k, 0)) (2(2BBaak,Bk,B++ 22B-1B-1aak,B-1k,B-1+..++..+ 2211aak, 1k, 1++ 2200aak, 0k, 0))==kk

( 2( 22B2Baak,Bk,Baak,Bk,B + +

222B-12B-1( a( ak,Bk,Baak,B-1k,B-1 + a + ak,B-1k,B-1aak,Bk,B ) + ) + { 2{ 22B2Baak,Bk,Baak,B-1 k,B-1 }}

222B-22B-2( a( ak,Bk,Baak,B-2k,B-2 + a + ak,B-1k,B-1aak,B-1k,B-1 + a + ak,B-2k,B-2aak,Bk,B ) + ) + {{2B-12B-1aak,Bk,Baak,B-2 k,B-2 + 2+ 22B-22B-2aak,B-1k,B-122

222B-32B-3( a( ak,Bk,Baak,B-3k,B-3 + a + ak,B-1k,B-1aak,B-2k,B-2 + a + ak,B-2k,B-2aak,B-1k,B-1 + a + ak,B-3k,B-3aak,B k,B ) +) + { 2{ 22B-22B-2( a( ak,Bk,Baak,B-3 k,B-3 + a+ ak,B-1k,B-1aak,B-2k,B-2 ) } ) }222B-42B-4(a(ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3+a+ak,B-2k,B-2aak,B-2k,B-2+a+ak,B-3k,B-3aak,B-1k,B-1+a+ak,B-4k,B-4aak,Bk,B)...)... {2{22B-32B-3( a( ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3)+2)+22B-42B-4aak,B-2k,B-2

22}}

=2=22B 2B ( a( ak,Bk,B22 + a + ak,Bk,Baak,B-1 k,B-1 ) +) + 222B-12B-1( a( ak,Bk,Baak,B-2 k,B-2 ) +) +

222B-22B-2( a( ak,B-1k,B-122 222B-32B-3( a( ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3) + ) + 222B-42B-4aak,B-2k,B-2

2 2 ......+ a+ ak,Bk,Baak,B-3 k,B-3 + a+ ak,B-1k,B-1aak,B-2k,B-2 ) + ) +

D2NN=multi-op pTree adds?D2NN=multi-op pTree adds?When xWhen xk,bk,b=1, a=1, ak,bk,b=p'=p'k,bk,b and and

when xwhen xk,bk,b=0, a=0, ak,bk,b= -p= -pk.bk.b

So D2NN just multi-op pTree So D2NN just multi-op pTree mults/adds/subtrs? mults/adds/subtrs?

Each D2NN row (each xX) Each D2NN row (each xX) is separate calc.is separate calc.

Should we pre-compute all pShould we pre-compute all pk,ik,i*p*pk,jk,j p'p'k,ik,i*p'*p'k,jk,j ppk,ik,i*p'*p'k,jk,j

ANOTHER TRY!ANOTHER TRY! X(XX(X11...X...Xnn) RKN (Rank K Nbr), K=|X|-1, yields1.a_outlier_detector (top y dissimilarity from X-{x}).) RKN (Rank K Nbr), K=|X|-1, yields1.a_outlier_detector (top y dissimilarity from X-{x}).

Install in RKN, each RankK(D2NN(x)) (1-time construct but for. e.g., 1 trillion xInstall in RKN, each RankK(D2NN(x)) (1-time construct but for. e.g., 1 trillion x ss? |X|=N=1T, slow. Parallelization?)? |X|=N=1T, slow. Parallelization?)

xxX, the square distance from x to its neighbors (near and far) is the column of number (vTree or SPTS)X, the square distance from x to its neighbors (near and far) is the column of number (vTree or SPTS)dd22(x,X)= (x-X)(x,X)= (x-X)oo(x-X)= (x-X)= k=1..nk=1..n|x|xkk-X-Xkk||2

2= = k=1..nk=1..n(x(xkk-X-Xkk)(x)(xkk-X-Xkk)= )= k=1..nk=1..n(x(xkk22-2x-2xkkXXkk+X+Xkk

22))

= -2 = -2 kkxxkkXXkk + + kkxxkk22 + + kkXXkk

22

= -2x= -2xooX + xX + xoox + Xx + XooXX

k=1..nk=1..n i=B..0,j=B..0i=B..0,j=B..022i+ji+jppk,ik,ippk,jk,j

i,j i,j 22i+j i+j kkppk,ik,ippk,jk,j

1. precompute pTree products within each k1. precompute pTree products within each k

2. Calculate this sum one time (independent of the x)2. Calculate this sum one time (independent of the x)

3. Pick this from XoX for each x and add to 2.3. Pick this from XoX for each x and add to 2.

5. Add 3 to this5. Add 3 to this

-2x-2xooX cost is linear in |X|=N.X cost is linear in |X|=N. xxoox cost is ~zero. Xx cost is ~zero. XooX is 1-time -amortized over xX (i.e., =1/N) or precomputedX is 1-time -amortized over xX (i.e., =1/N) or precomputedThe addition cost, -2xThe addition cost, -2xooX + xX + xoox + Xx + XoXoX, is linear in |X|=N So, overall, the cost is linear in |X|=n., is linear in |X|=N So, overall, the cost is linear in |X|=n.Data parallelization? No! (Need all of X at each site.) Code parallelization? Yes! (After replicating X to all sites,Data parallelization? No! (Need all of X at each site.) Code parallelization? Yes! (After replicating X to all sites,Each site creates/saves D2NN for its partition of X, then sends requested number(s) (e.g., RKN(x) ) back.Each site creates/saves D2NN for its partition of X, then sends requested number(s) (e.g., RKN(x) ) back.

LSR on IRIS150-3 LSR on IRIS150-3 Here we use the diagonals.Here we use the diagonals.

d=e1 p=AVGs, L=(X-p)od43 58 S 49 70 E 49 79 I

R(p,d,X)

S E I0

128 270 393

1558

3444

[43,49)S(16)

0

128

[49,58)E(24)I(6)

0 S(34)99

393

10961217

1825

[70,79]I(12)

2081

3444

[58,70)E(26)I(32)

270

792

1558

2567

Only overlap L=[58,70), R[792,1557] (E(26), I(5))With just d=e1, we get good hulls using LARC:

While Ip,d containing >1class, for next (d,p) create L(p,d)XXoodd-p-poodd, R(p,d)XXooX+pX+poop-2p-2XXoopp-L-L22

1. MnCls(L), MxCls(L), create a linear boundary.2. MnCls(R), MxCls(R).create a radial boundary.3. Use R&Ck to create intra-Ck radial boundariesHk = {I | Lp,d includes Ck}

R & L

I(1)

I(42)

E(50) I(7)

49 49 (36,7) 63

70 (11)

d=e1 p=AS L=(X-p)od (-pod=-50.06)-7.06 7.94 S&L -1;06 19.94 E&L -1.06 28.94 I&L

-8,-216

[-2,8)34, 24, 6099 393 1096 1217 1825

[20,29]12

[8,20) wp=AvgE26, 321.9 51.878.6 633

<--E=6 I=4 p=AvgE

d=e1 p=AvgS, L=Xod43 58 S&L 49 70 E&L 49 79 I&L

Here we try using other p points for the R step (other than the Here we try using other p points for the R step (other than the one used for the L step).one used for the L step).

d=e1 p=AS L=(X-p)od (-pod=-50.06)-7.06 7.94 S&L -1;06 19.94 E&L -1.06 28.94 I&L

-8,-216

[-2,8)34, 24, 6099 393 1096 1217 1825

[20,29]12

[8,20)26, 32 2707921558 2567

E=26 I=5p=AvgS

30ambigs, 5 errs

d=e1 p=AS L=(X-p)od (-pod=-50.06)-7.06 7.94 S&L -1;06 19.94 E&L -1.06 28.94 I&L

-8,-216

[-2,8)34, 24, 6099 393 1096 1217 1825

[20,29]12

[8,20) wp=AvgI26, 32 0.6234.9387.8 1369

<--E=25 I=10 p=AvgI

There is a best choice of p for the R step (p=AvgE) but how would we decide that ahead of time?

d=e4 p=AvgS, L=(X-p)od-2 4 S&L 7 16 E&L 11 23 I&L

-2,4)50

[7,11) 28

[16,23]I=34

[11,16) 22, 16 127.5648.71554.7 2892

E=22I=7p=AvgS

d=e4 p=AvgS, L=(X-p)od-2 4 S&L 7 16 E&L 11 23 I&L

-2,4)50

[7,11) 28

[16,23]I=34

[11,16) 22, 165.7 36.2151.06 611

E=17I=7p=AvgE

d=e4 p=AvgS, L=(X-p)od-2 4 S&L 7 16 E&L 11 23 I&L

-2,4)50

[7,11) 28

[16,23]I=34[11,16)

22, 16 127.51555 2892

E=22I=8p=AvgI

For e4, the best choice of p for the R step is also p=AvgE.(There are mistakes in this column on the previous slide!)

LSR on IRIS150LSR on IRIS150

Dse 9 -6 27 10; xoDes: -184 123 S 590 1331 E 381 2046 I

y isa O if yoD (-,-184)(123,381)(2046,)

y isa O or S(50) if yoD C1,1 [-184 , 123]y isa O or I(1) if yoD C1,2 [ 381 , 590]

y isa O or I(38) if yoD C1,4 [1331 ,2046]y isa O or E(50) or I(11) if yoD C1,3 [ 590 ,1331]

SRR(AVGs,dse) on C1,1 0 154 S

y isa O if y isa C1,1 AND SRR(AVGs,Dse)(154,)y isa O or S(50) if y isa C1,1 AND SRR(AVGs,DSE)[0,154]

SRR(AVGs,dse) on C1,2only one such I

SRR(AVGs,dse) onC1,3 2 137 E 7 143 I

y isa O if y isa C1,3 AND SRR(AVGs,Dse)(-,2)U(143,)y isa O or E(10) if y isa C1,3 AND SRR in [2,7) y isa O or E(40) or I(10) if y isa C1,3 AND SRR in [7,137) = C2,1

y isa O or I(1) if y isa C1,3 AND SRR in [137,143]etc.

We use the Radial steps to remove false positives from gaps and ends. We are effectively projecting onto a 2-dim range, generated by the Dline and the Dline (which measures the perpendicular radial reach from the D-line). In the D projections, we can attempt to cluster directions into "similar" clusters in some way and limit the domain of our projections to one of these clusters at a time, accommodating "oval" shaped or elongated clusters giving a better hull fit. E.g., in the Enron email case the dimensions would be words that have about the same count, reducing false positives.

Dei 1 .7 -7 -4; xoDei on C2,1: 1.4 19 E -2 3 I

y isa O if yoD (-,-2) (19,)y isa O or I(8) if yoD [ -2 , 1.4]

y isa O or E(40) or I(2) if yoD C3,1 [ 1.4 ,19]

SRR(AVGe,dei) onC3,1 2 370 E 8 106 I

y isa O if y isa C3,1 AND SRR(AVGs,Dei)[0,2)(370,)y isa O or E(4) if y isa C3,1 AND SRR(AVGs,Dei)[2,8)y isa O or E(27) or I(2) if y isa C3,1 AND SRR(AVGs,Dei)[8,106)y isa O or E(9) if y isa C3,1 AND SRR(AVGs,Dei)[106,370]

LSR on IRIS150-2 LSR on IRIS150-2 We use the diagonals. We use the diagonals. Also we set a MinGapThres=2 Also we set a MinGapThres=2 which will mean we stay 2 units which will mean we stay 2 units away from any cutaway from any cut

d=e1=1000; The xod limits: 43 58 S 49 70 E 49 79 I

y isa O if yoD(-,43)(79,)y isa O or S( 9) if yoD[43,47]

y isa O or S(41) or E(26) or I( 7) if yoD(47,60) (yC1,2)y isa O or E(24) or I(32) if yoD[60,72] (yC1,3)

y isa O if yoD[43,47]&SRR(-,52)(60,)

y isa O or I(11) if yoD(72,79]y isa O if yoD[72,79]&SRR(-,49)(78,)

d=e3=0010 on C2,2 xod lims: 30 33 S28 32 E 28 30 I

y isa O if yoD(-,28)(33,)y isa O or S(13) or E(10) or I(3) if yoD[28,33]

d=e3=0001 xod lims: 12 18 E 18 24 I

y isa O or S(13) if yoD[1,5]y isa O if yoD(-,1)(5,12)(24,)

y isa O or E( 9) if yoD[12,16)

y isa O or E( 1) or I( 3) if yoD[16,24)

y isa O if yoD[12,16)&SRR[0,208)(558,)

y isa O if yoD[16,24)&SRR[0,1198)(1199,1254)1424,)

y isa O or E(1) if yoD[16,24)&SRR[1198,1199]y isa O or I(3) if yoD[16,24)&SRR[1254,1424]

y isa O or E( 3) if yoD[18,23)y isa O if yoD(-,18)(46,)

y isa O or E(13) or I( 4) if yoD[23,28) (yC2,1)

y isa O or S(13) or E(10) or I( 3) if yoD[28,34) (yC2,2)y isa O or S(28) if yoD[34,46]

y isa O if yoD[18,23)&SRR[0,21)

y isa O if yoD[34,46]&SRR[0,32][46,)

d=e2=0100 on C1,2 xod lims: 30 44 S20 32 E 25 30 I

d=e2=0100 on C1,3 xod lims: 22 34 E 22 34 I zero differentiation!

y isa O or E(17) if yoD[60,72]&SRR[1.2,20]

y isa O or I(25)if yoD[60,72]&SRR[66,799]y isa O or E( 7) or I( 7)if yoD[60,72]&SRR[20, 66]

y isa O if yoD[0,1.2)(799,)

LSR LSR IRIS150IRIS150..

d=e1 p=AS L=(X-p)od (-pod=-50.06)-7.06 7.94 S&L -1;06 19.94 E&L -1.06 28.94 I&L

-8,-216

[-2,8)34, 24, 6099 393 1096 1217 1825

[20,29]12

[8,20)26, 32 2707921558 2567

E=26 I=5

30ambigs, 5 errs

d=e4 p=AvgS, L=(X-p)od-2 4 S&L 7 16 E&L 11 23 I&L-2,4)50

[7,11) 28

[16,23]I=34

[11,16) 22, 1611 1611 16 E=22

I=16

38ambigs 16errs

d=e3 p=AvgE, L=(X-p)od-32 -24 S&L -12 9 E&L -25 27 I&L,-25)48

-25,-122 1 1

[9,27] I=34

[-12,9)49, 152(17) 16158 199

E=32 I=14

d=e4 p=AvgE, L=(X-p)od-13 -7 S&L -3 5 E&L 1 12 I&L-7]50

[-3,1) 21

[5,12] 34

[1,5)22, 16.7 .74.8 4.8 E=22

I=16

d=e2 p=AvgS, L=(X-p)od -11 10 S&L-14 0 E&L -13 4 I&L,-13)1

-13,-110, 2, 1all=-11

[0,4) [4,15 3 6

-11,029,47,46066 310 352 1749 4104

1, 1 46,11

2, 1 9, 3

d=e3 p=AvgS, L=(X-p)od-5 5 S&L 15 37 E&L 4 55 I&L-5,4)47

[4,15)3 1

[37,55]I=34

[15,37)50, 15157 297536 792

E=18 I=12

3, 1

d=e1 p=AE L=(X-p)od (-pod=-59.36)-17 -1 S&L -11 11 E&L -11 20 I&L

-17-1116

[-11,-1)33, 21, 3 0 27 107 1727481150

[11,20]I12

[-1,11)26, 321 5179 633

E=7 I=4

E=5I=3

d=e2 p=AvgE, L=(X-p)od -5 `17 S&L-8 7 E&L -6 11 I&L,-6)1

[-6, -5)0, 2, 1 15 18 58 59

[7,11) [11,15 3 61 err

[-5,7)29,47, 46 3 58 234793 11031417

13, 21

21, 3

d=e1 p=AI L=(X-p)od (-pod=-65.88)-22 -8 S&L -17 4 E&L -17 14 I&L

[-17,-8)33, 21, 3 38 126 132 73016222181

[-8,4)26, 32 034 1368730

E=26 I=11

E=2 I=1

d=e2 p=AvgI, L=(X-p)od -7 `15 S&L-10 4 E&L -8 9 I&L,-6)1

[6,11) [11,15 3 6

[-7, 4)29,46,46 5 36 929 140318932823

[-8, -7)2, 1allsame

E=2 I=1

E=47 I=22

[5, 9]9, 2, 1allsameS=9

E=2 I=1

d=e3 p=AvgI, L=(X-p)od-44 -36 S&L -25 -4 E&L -37 14 I&L,-25)48

-25,-122 1 1

[9,27] I=34

[-25,-4)50, 15 511 318453

E=32 I=14E=46 I=14

d=e4 p=AvgI, L=(X-p)od-19 -14 S&L -10 -3 E&L -6 5 I&L

[5,12] 34

[-6,-3)22, 16same range

E=22 I=16

d=AvgEAvgI p=AvgE, L=(X-p)od -36 -25 S -14 11 E -17 33 I

R(p,d,X)

S E I 0 232 76 357514

[-17,-14)]I(1)

[-14,11) (50, 13)0 2.876 134

[11,33] I(36)

E=47I=12

R(p,d,X)

S E I.3 .9 4.7150 204 213

[12,17.5)]I(1)

d=AvgSAvgI p=AvgS, L=(X-p)od -6 5 S 17.5 42 E 12 65 I

[17.5,42)(50,12)4.7 6 192 205

[11,33]I(37)

E=45 I=12

d=AvgSAvgE p=AvgS, L=(X-p)od -6 4 S 18 42 E 11 64 I

R(p,d,X)

S E I0 2 6 137154 393

[11,18)]I(1)

[18,42) (50,11)2 6.92 133137

[42,64] 38

E=39 I=11

d=e1 p=AvgS, L=Xod43 58 S&L 49 70 E&L 49 79 I&L

Note that each L=(X-p)od is just a shiftof Xod by -pod (for a given d).

Next, we examine:For a fixed d, the SPTS, Lp,d. is just a shift of LdLorigin,d by -pod we get the same intervals to apply R to, independent of p (shifted by -pod).

Thus, we calculate once, lld=minXod hld=maxXod, then for each different p we shift these interval limit numbers by -pod since these numbers are really all we need for our hulls (Rather than going thru the SPTS calculation of (X-p)od anew new p).

There is no reason we have to use the same p on each of those intervals either.

So on the next slide, we consider all 3 functionals, L, S and R. E.g., Why not apply S first to limit the spherical reach (eliminate FPs). S is calc'ed anyway?

LSR IRIS150 eLSR IRIS150 e22

Ld d=0100 p=origin Setosa 23 44 vErsicolor 20 34vIrginica 22 38

d=0100 p=AS=(50 34 15 2) -11 10 -14 0 -12 4

d=0100 p=AE=(59 28 43 13) -5 16 -8 6 -6 10

d=0100 p=AI=(66 30 55 20) -7 14 -10 4 -8 8

1 2 1 all -7.7

6 29 47 46 5 36 47 22 929 140318922823

15 3

XoX402635013406330639964742347738852977358845143720340128715112542646224031499142794365421135164004382536603928415840603493352543134611498935883672442335883009398639032732313340174422340943053340440737898329737083485323735062277523416674825150422563705784696754427582628658746578540371336274722068586955

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475

736679088178669152505166507057587186606670377884660358865419578169335784421857986057602367034247 5883 9283 7055 9863 8270 897311473 534010463 880210826 8250 7995 8990 6774 7325 8458 84741234611895 6809 9563 672111602 7423 926810132 7256 7346 8457 97041034212181 8500 7579 772911079 8837 8406 514890799162885270559658945286227455822984457306

767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146146148149150

FAUST Oblique, LSR FAUST Oblique, LSR Linear, Spherical, Radial classifierLinear, Spherical, Radial classifier

p,p, ((pre-ccompute?pre-ccompute?))

LLd,pd,p(X-p)(X-p)ood=d=LLdd-p-pood d nnk,L,d,pk,L,d,pmin(Cmin(Ckk&L&Ld,pd,p)=)=nnk,L,dk,L,d-p-pood d xxk,L,d.pk,L,d.pmax(Cmax(Ckk&L&Ld,pd,p)=)=xxk,L,dk,L,d-p-poodd

On IRIS150 d, precompute! XoX, Ld=Xod nk,L,d Lmin(Ck&Ld) xk,L,d max(Ck&Ld)

d=1000 p=AS=(50 34 15 2) -7 8 -1 20 -1 29

d=1000 p=AE=(59 28 43 13) -16 -1 -10 11 -10 20

d=1000 p=AI=(66 30 55 20) -23 -8 -17 4 -17 13

p=AvgS p=AvgE p=AvgI

d=0100 p=AS=(50 34 15 2) -11 10 -14 0 -12 4

d=0100 p=AE=(59 28 43 13) -5 16 -8 6 -6 10

d=0100 p=AI=(66 30 55 20) -7 14-10 4 -8 8

d=0010 p=AS=(50 34 15 2) -5 4 15 36 3 54

d=0010 p=AE=(59 28 43 13) -33 -24 -13 8 -25 26

d=0010 p=AI=(66 30 55 20)-45 -36 -25 -4 -37 14

d=0001 p=AS=(50 34 15 2) -1 4 8 16 12 23

d=0001 p=AE=(59 28 43 13)-12 -7 -3 5 1 12

d=0001 p=AI=(66 30 55 20)-25 -20 -16 -8 -12 -1

We have introduce 36 linear bookends to the class hulls, 1 pair for each of 4 ds, 3 ps , 3 class. For fixed d, Ck, the pTree mask is the same over the 3 p's. However we need to differentiate anyway to calculate R correctly.

That is, for each d-line we get the same set of intervals for every p (just shifted by -pThat is, for each d-line we get the same set of intervals for every p (just shifted by -pood). The only reason we need to d). The only reason we need to have them all is to accurately compute R on each min-max interval. In fact, we computer R on all intervals (even have them all is to accurately compute R on each min-max interval. In fact, we computer R on all intervals (even those where a single class has been isolated) to eliminate False Positives (if FPs are possible - sometimes they are those where a single class has been isolated) to eliminate False Positives (if FPs are possible - sometimes they are not, e.g., if we are to classify IRIS samples known to be Setosa, vErsicolor or vIriginica, then there is no "other").not, e.g., if we are to classify IRIS samples known to be Setosa, vErsicolor or vIriginica, then there is no "other").

Assuming Ld, nk,L,d and xk,L,d have been pre-computed and stored, the cut-pt pairs of (nk,L,d,p; xk,L,d,p) are computed

without further pTree processing, by the scalar computations: nnk,L,d,p k,L,d,p = = nnk,L,dk,L,d-p-pood d xxk,L,d.p k,L,d.p = = xxk,L,dk,L,d-p-pood.d.

Ld514947465054465044495448484358575451575154514651485050525247485452554950554944515045445051485146535070646955655763496652505960615667565862565961636164

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475

d=1000666867605755555860546067635655556158505657576251576358716365764973677265646857586465777760695677636772626164727479646361776364606967695868676763656259

767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146146148149150

Lp,d

=Ld-pod

d=e1

d=e2

d=e3

d=e4

d=1000 p=0000 nk,L,d xk,L,d S 43 58E 49 70I 49 79

S 23 44E 20 34I 22 38

d=0100 p=0000 nk,L,d xk,L,d

S 10 19E 30 51I 18 69

d=0010 p=0000 nk,L,d xk,L,d

S 1 6E 10 18I 14 25

d=0001 p=0000 nk,L,d xk,L,d

Form Class Hulls using linear d boundaries thru min and max of Lk.d,p=(Ck&(X-p))od On every Ik,p,d{[epi,epi+1) | epj=minLk,p,d or maxLk,p,d for some k,p,d} interval add spherical and barrel boundaries with Sk,p and Rk,p,d similarly (use enough (p,d) pairs so that no 2 class hulls overlap) Points outside all hulls are declared as "other". all p,ddis(y,Ik,p,d) = unfitness of y being classed in k. Fitness of y in k is f(y,k) = 1/(1-uf(y,k))

LSR IRIS150 LSR IRIS150 ee11 only only

SSp p (X-p) (X-p)oo(X-p) = (X-p) = XXooX X + L+ L-2p -2p + p+ poop p nnk,S,p k,S,p = min(C= min(Ckk&S&Spp) ) xxk,S,p k,S,p max(C max(Ckk&S&Spp))RRp,dp,d S Spp-L-L22

p,d p,d = L= L-2p-(2p-2p-(2pood)d d)d + p+ poop + pp + poodd2 2 + + XXooX X - - LL22dd nnk,R,p,d k,R,p,d = min(C= min(Ckk&R&Rp,dp,d) ) xxk,R,p,d k,R,p,d max(C max(Ckk&R&Rp,dp,d))

Analyze R:RnR1 (and S:RnR1?) projections on each interval formed by consecutive L:RnR1 cut-pts.

d=1000 p=AS=(50 34 15 2)-7 8 -1 20 -1 29

d=1000 p=AE=(59 28 43 13)-16 -1 -10 11 -10 20

d=1000 p=AI=(66 30 55 20)-23 -8 -17 4 -17 13

Ld d=1000 p=origin Setosa 43 58vErsicolor 49 70vIrginica 49 79

26 32 270792 26 51558 2568

16 0128

24 634 099 393 1096 1217 1826

12 20813445

26 321 517,479 633

16 7231258

34 24 6 0 279 5 171 186748998

12 249794

16 16412391

34 24 6 24 126 2 1 132 73016222281

12 17220

26 32 0 3426 10 388 1369

If we have computed, S:RnR1, how can we utilize it?. We can, of course simply put spherical hulls boundaries by centering on the class Avgs, e.g., Sp p=AvgS

Setosa 0 154 E=50 I=11vErsicolor 394 1767vIrginica 369 4171

with AI 17220

with AE 1 517,478 633

eliminates FPs better?

What is the cost for these additional cuts (at new p-values in an L-interval)? It looks like: make the one additional calculation: LL-2p-(2p-2p-(2pood)dd)d then AND the interval masks, then AND the class masks? (Or if we already have all interval-class mask, only one mask AND step.)

Recursion works wonderfully on IRIS: The only hull overlaps after only d=1000 areAnd the 4 i's common to both are {i24 i27 i28 i34}. We could call those "errors".

If on the L 1000,avgE interval, [-1, 11) we recurse using SavgI we get7 436 540,4 72170

Thus, for IRIS at least, with only d=e1=(1000), with only the 3 ps avgS, avgE, avgI, using full linear rounds, 1 R round on each resulting interval and 1 S, the hulls end up completely disjoint. That's pretty good news!

There is a lot of interesting and potentially productive (career building) engineering to do here.What is precisely the best way to intermingle p, d, L, R, S? (minimizing time and False Positives)?

A pTree Pillar k-means clustering method(The k is not specified - it reveals itself.)

m1

m2

m3

m4

Choose m1 as a pt that maximizes Distance(X, avgX)Choose m2 as a pt that maximizes Distance(X, m1)Choose m3 as a pt that maximizes h=1..2Distance(X, mh)

Choose m4 as a pt that maximizes h=1..3Distance(X,mh)

Do until minimumh=1..kDistance(X,mh) < Threshold

This gives k. Apply pk-means. (Note we already have all Dis(X,mh)s for the first round.

Note: D=m1m2 line. Treat PCCs like parentheses - ( corresponds to a PCI and ) corresponds to a PC. Each matched pair should indicate a cluster somewhere in that slice. Where? One could take the VoM as the best-guess centroid? Then proceed by restricting to that slice. Or 1st apply R and do PCC parenthesizing on R values to identify radial slice where the cluster occurs. VoM of that combo slice (linear and radial) as the centroid. Apply S to confirm.

Note: A possible clustering method for identifying density clusters (as opposed to round or convex clusters)

(Treating PCCs like parentheses)

d-linePCI PCD PCI PCD

(or Do until mk < Threshold)

Clustering:1. For Anomaly Detection2. To develop Classes against which we future unclassified objects are classified.

( Classification = moving up a concept hierarchy using a class assignment function, caf:X{Classes} )

NewClu (k is discovered, not specified. Assign each (object,class) a ClassWeight, CWReals (could be <0). Classes "take next ticket" as they're discovered (tickets are 1,2,... Initially, all classes empty; All CWs=0.

Do for next d, compute Ld = Xod until after masking off new cluster, count is too high (doesn't drop enough)

For the next PCI in Ld (next-larger starting from smallest)

If followed by a PCD, declare next Classk and define it to be the set spherically gapped (or PCDed) around centroid, Ck=Avg or VoMk over Ld

-1[PCI, PCD]. Mask off this ticketed new Classk and continIf followed by a PCI, declare next Classk and define it to be the set spherically gapped (or PCDed)

around the centroid, Ck=Avg or VoMk over Ld-1[ (3PCI1+PCI2)/4, PCI2 ) Mask off this ticketed

new Classk and continue.For the next-smaller PCI (starting from largest) in Ld

If preceded by a PCD, declare next Classk and define it to be the set spherically gapped (or PCDed) around centroid Ck=Avg or VoMk over Ld

-1[PCD, PCI]. Mask off this ticketed new Classk, contin.If preceded by a PCI, declare next Classk and define it to be the set spherically gapped (or PCDed)

around the centroid, Ck=Avg or VoMk over Ld-1( PCI2, (3PCI1+PCI2)/4] Mask off this ticketed

new Classk and continue.

When is it important not to over partition? Sometimes it is but sometimes it is not. In 2. it usually isn't.With gap clustering we don't ever over partition, but with PCC based clustering we can.If it is important that each cluster be whole, when using a k=means type clusterer, each round we can fuse C i

and Cj iff on Lmimj their projections touch or overlap.