from: mark silverman [mailto:[email protected]] sent: wed, may 28, 2014 11:48...
TRANSCRIPT
From: Mark Silverman [mailto:[email protected]] Sent: Wed, May 28, 2014 11:48 PM
Hypothesis: How do you algorithmically resolve conflicts between spts's ? You place the point in the largest found cluster where the density is greater than threshold. So point4 may not be a singleton depending on density It's a working theory I will work through some examples but I think this way may resolve my concerns on sparse highly dim data sets
This first example shows that recursion is important. A slightly different example show recursion order is also important:
On May 29, 2014, at 10:31AM [email protected]: One point of view is: The dot product functional is distance dominated (meaning the distance on the d-line between Ld(x)=xod and Ld(y)=yod is always less than or equal to distance(x,y)
(so any gap in the projected Ld values IS a gap, probably bigger, in the space; but many gaps in the space may not show up as gaps on a projection d-line).
Therefore there's never a conflict between spts gap-enclosed clusters. Consecutive spts gaps cluster(s), but not necessarily vice versa. (while a PCI followed by a PCD a cluster, but there can be ambiguous nesting)
Each spts reveals only some of the gaps, and the spts gap sizes are conservative.
That’s why recursion is used. E.g., the gap of 110 between point4 and its complement is revealed by spts, Attr1. We use the pTree mask restricting to points {123 5678} and there, Attr2 reveals 2 more gaps (missed by Attr1), gap2=100 between {138} and {256} and gap3=22 between {256} and {7}. But all projection gaps are legit and there are no conflicts.
XRow Attr1 Attr21 0 02 0 1003 0 04 110 1105 0 1146 0 1237 0 1458 0 0
1, 3, 8
42
65
7
Row Attr1 Attr21 0 02 0 253 0 504 0 755 100 1006 0 1257 0 1508 0 175
1
4
2
6
5
7
3
8
Here, Le2=Attr2 reveals no gaps.
So then, Le1=Attr1 is applied to all of X and reveals a gap of at least 100 for point5.
The substantial gap of 50 between {1234} and {678} is missed, while if it were done in the other order:
Le1=Attr1 is applied to all of X and reveals a gap of at least 100 for point5 (note the gap is actually 103.078) {1234 678} and {5} are split (and {5} is declared an outlier and declared finished.)
Le2=Attr2 is applied to X-{5}={1234 678} and reveals a gap of at least 50 between {1234} and {678}
StD: 33 57.2So StD doesn't always
reveal the best order!
100
25 103.078
From [email protected] Sent Thur, May 29, 2014 12:21 I see the confusion. You start with any spts then recurse thru unclustered pts. {recurse on each subcluster set produced by that SPTS}. I'm thinking how to do this in hadoop in parallel - I have different spts arriving at different
processors, so my issue is what happens if these processors arrive at different conclusions. Good question! Possible answer: Parallelize incrementally (as the recursion progresses). Let's assume the entire dataset is replicated to all nodes.
( If not, modifications need to be made.) Do the first dot product spts splitting. We get a mask pTree for each gap-enclosed sub-cluster (a mask pTree in Hadoop, is a horizontal array of bits specifying which documents are in the sub-cluster)
In parallel, send each of those mask pTrees to a different node. (also send all of those mask pTrees to the designated “dendogram building” node).Second issue is that entire dataset cannot fit into memory so getting to density is a challenge.If we deal strictly with gaps, density may not be necessary. In any case, Barrel Density could give a good approx. Once the Linear distrib is done,
compute the distribution of radial reach distances from the d-line. The max (or last PCD) of those numbers should give you a good barrel radius.From here one could simply take the max of the max radial reach (mrr) and the linear projection radius (lpr) as a radius, r. The volume would be
roughly rn to divided into the count for density. If that proves to rough, the actual barrel volume is roughly mrrn-1 * lprFrom: Mark Silverman Sent: Thursday, May 29, 2014 1:09 Each processor sees only some of attributes, not all.I'm thinking we have one pass through the attributes to get stats and also we can pull radius. Then we can sequentially process Only issue now is cluster merge which I don't see a way around but huge benefit if we can figure it out which we will :)Sparseness is a major issue-for example in a million docs the word "arrrgghhh" in a single document could cluster. We might want it to be an outlier (2 emails with “arrrgghhh” probably came from the same sender?). Or we might make a pass through our vocab,
flagging those words that we don't want to trigger a cut. Then we would not use those columns as spts's.Clearly we need to apply some sort of relevance tuning probably better than stddev. Agree. Previous slide, last example shows that stddev can fail.Sparseness breaks kmeans because the average of any attribute will tend to be pretty close to zero. So I'm trying to keep that in mind.The other issue is that we cannot have a single processor manage every attribute so we need this notion of merging clusters somehow - that's why I've
been thinking about the issue where two processors make two different decisions. I have to think about that more. We could think about those decisions as being separate collections of mask pTrees, then we just AND the collections for the combined dendogram level?
Consider also steaming data where later arriving data may need to be either placed in an existing cluster or a new one. I think it's the same problem?We should always save the spts cut_points and use those to put new data in the proper cluster. If new data pt, y, is in the middle of a previous gap, it
is the start of a new cluster? We could just build dis2(X,y) and calculate the minimum to decide. I’m going to assume: 1. Gap-based clustering (or splitting) will suffice in text mining (no need to split at PCCs also). Gap-based never requires a dendogram and therefore
never requires density nor fusion. When I use the term “gap” I will mean gap > Gap_Threshold for some chosen GT. 2. The initial dispersal of attributes to nodes is a partition (mutually exclusive).Then for d, to get “X dot d” spts use a master-slave (slaves compute their parts and send their partial spts to the master to be added onto the final sptsOf course, if d=ek the spts is the just the given kth column and is available without any required computation at one node.Thought: A cluster subset is always expressed as a mask pTree (a row of bits in Hadoop).Those results can be sent to a master node to build the full clustering (by ANDind each pair coming from different sites). I think such a master could keep up with the slaves working in parallel.We just have to AND all pairs of mask pTrees, one coming from each of two different sites. The result is one clustering of the entire dataset (the one
our gap-based FAUST Oblique algorithm produces for the chosen Minimum Gap Threshold). In the text mining case I am going to guess using only the ek’s will suffice (We won’t even need combo of them).
Said another way: A gap in any “X dot d” spts gives us two disjoint mask pTrees that split the space in two and we know that no bonafide cluster subset can span that divide (so there should never be a need to fuse). Therefore it doesn’t matter where cluster subsets (or splits) are revealed – at one site or by ANDing pairs of mask pTrees from different sites. From this point of view, a merge step may be unnecessary.
Disclaimer: In order to be guaranteed we have gotten all linear gaps, we may need to employ all unit vectors, d (a full partitioning of the unit n-1_half_sphere). And there are gaps that are not linear (not hyperplanar).
Machine Learning = moving data up some concept hierarchy (increasing information and/or/by reducing volume).ML takes two forms: clustering and classification (unsupervised and supervised). Clustering groups similar objects into a single higher level object or a cluster. Classification does the same but is supervised by an existing class assignment function on a Training Set, f:TS{Classes}.
1. Clustering can be done for Anomaly Detection (detecting those object that are dissimilar from the rest; which boils down to finding singleton [and/or doubleton..] clusters.
2. Clustering can be done to develop a Training Set for classification of future unclassified objects.
NCL: (k is discovered, not specified. Assign each (object,class) a ClassWeight, CWReals (could be <0). Classes "take next ticket" as they're discovered (tickets are 1,2,... Initially, all classes empty; All CWs=0.
Do for next d, compute Ld = Xod until after masking off new cluster, count is too high (doesn't drop enough)
For the next PCI in Ld (next-larger starting from smallest)
If followed by a PCD, declare next Classk and define it to be the set spherically gapped (or PCDed) around centroid, Ck=Avg or VoMk over Ld
-1[PCI, PCD]. Mask off this ticketed new Classk and contin
If followed by a PCI, declare next Classk and define it to be the set spherically gapped (or PCDed) around the centroid, Ck=Avg or VoMk over Ld
-1[ (3PCI1+PCI2)/4, PCI2 ) Mask off this ticketed new Classk and continue.For the next-smaller PCI (starting from largest) in Ld
If preceded by a PCD, declare next Classk and define it to be the set spherically gapped (or PCDed) around centroid Ck=Avg or VoMk over Ld
-1[PCD, PCI]. Mask off this ticketed new Classk, contin.
If preceded by a PCI, declare next Classk and define it to be the set spherically gapped (or PCDed) around the centroid, Ck=Avg or VoMk over Ld
-1( PCI2, (3PCI1+PCI2)/4] Mask off this ticketed new Classk and continue.
Machine Learning moves data up a concept hierarchy, according to some criteria which we call a similarity. A similarity is usually a function, s:XXCardinalSet s.t. xX s(x,x)s(x,y) yX (every x must be at least as similar to
itself as it is to any other object) and s(x,y)=s(y,x). OrdinalSet is usually a subset of {0,1,...} (e.g., binary {0,1}={No,Yes}).
Classification is binary clustering: s(x,y)=1 iff f(x)=f(y). Using that part of f which is known to predict that which is unknown
Pillar pk-means clusterer (k is not specified - it reveals itself.)
m1
m2
m3
m4
1a. Choose m1 maximizing D1=Dis(X,avgX).
2a. Choose m2 maximizing D2=Dis(X,m1)
3a. Choose m3 maximizing D3=D2+Dis(X,m2)
...Do until the MinDist(mh,mk)k<h < Threshold
1b. Check if m1 is outlier with Sm1 1c. Repeat until m1 is a non-outlier
3b. Check if m3 is outlier with Sm3 3c. Repeat until m3 is a non-outlier.
3d. Compute Mi,3=Pmim3 M3,i=Pmi<m3
i<3
4a. Choose m4 maximizing D4=D3+Dis(X,m3)
2b. Check if m2 is outlier with Sm2 2c. Repeat until m2 is a non-outlier.
2d. Compute M1,2=Pm1m2 M2,1=Pm1<m2
4b. Check if m4 is outlier with Sm4 4c. Repeat until m4 is a non-outlier.
4d. Compute Mi,4=Pmim4 M3,i=Pmi<m4
i<4
Mj = &hj Mj,h are the mask pTrees of the k clusters for the k first round clusters.
Apply pk-means from here on.
FAUST Oblique Analytics FAUST Oblique Analytics X(XX(X11..X..Xnn))RRnn, |X|=N; Classes={C, |X|=N; Classes={C11..C..CKK}; d=(d}; d=(d11..d..dnn), |d|=1; p=(p), |d|=1; p=(p11..p..pnn))RRnn; Functionals:; Functionals:
FAUST CFAUST CountountCChange clustererhange clusterer If DensThres unrealized, cut C at PCCIf DensThres unrealized, cut C at PCCssLLd,pd,p&C with next (d,pd)&C with next (d,pd)dpSetdpSet
FAUST TFAUST TopopKOKOutliersutliers Use DUse D22NN=SqDist(x, X')=rankNN=SqDist(x, X')=rank22SSxx for TopKOutlier for TopKOutlier-slider-slider..
FAUST LFAUST Linear classifierinear classifier y yCCkk iff y iff yLHLHk k {z | Lmin {z | Lmind,p,k d,p,k (z-p) (z-p)ood d Lmmax Lmmaxd,pd,kd,pd,k} } (d,p) (d,p)dpSetdpSet
LHLHkk is a hull around C is a hull around Ckk. . dpSetdpSet is a set of (d,p) pairs, e.g., is a set of (d,p) pairs, e.g.,
(Diag,DiagStartPt).(Diag,DiagStartPt).RkRkiiPtr(x,PtrPtr(x,PtrRankRankiiSSxx). Rk). RkiiSD(x,rankSD(x,rankiiSS22) ordered desc on rank) ordered desc on rankiiSSxx as constructed. as constructed.
Pre-compute what? 1. col stats(min, avg, max, std,...) ; 2. XoX; Xop, p=class_Avg/Med); 3. Xod, d=interclass_Avg/Med_UnitVector; 4. Xox, d2(X,x), Rkid2(X,x), xX, i=2..; 5. Ld,p and Rd,p d,pdpSet
FAUST LFAUST LinearinearSSphericalphericalRRadial classifier adial classifier yyCCkk iff y iff yLSRHLSRHkk{z | Tmin{z | Tmind,p,k d,p,k (z-p) (z-p)ood d Tmax Tmaxd,p,kd,p,k (d,p)(d,p)dpSetdpSet, ,
T=L|S|R }T=L|S|R }
LLd,pd,p (X-p) (X-p)ood= Xd= Xood-pd-pood= Ld= Ldd-p-poodd LminLmind,p,kd,p,k= min(L= min(Ld,pd,p&C&Ckk)) LmaxLmaxd,p,kd,p,k= max(L= max(Ld,pd,p&C&Ckk))
SSp p (X-p) (X-p)oo(X-p)= X(X-p)= XooX+XX+Xoo(-2p)+p(-2p)+poopp SminSminp,k p,k = min(S= min(Spp&C&Ckk) ) SmaxSmaxp,k p,k = max(S= max(Spp&C&Ckk))
RRd,pd,p S Spp-L-L22d,pd,p= X= XooX+XX+Xoo(-2p)+p(-2p)+poop-Lp-L22
dd-2p-2pood*Xd*Xood+pd+poodd22= L= L-2p-(2p-2p-(2pood)dd)d+p+poop+pp+poodd22+X+XooX-LX-L22dd
RminRmind,pd,k d,pd,k = min(R= min(Rd,pd,p&C&Ckk) ) RmaxaxRmaxaxd,pd,k d,pd,k = max(R= max(Rd,pd,p&C&Ckk))
XXooX+pX+poop-2p-2XXoop p -- [[XXoodd-p-pood]d]2 2
(X-p)(X-p)oo(X-p) -(X-p) - [(X-p)[(X-p)ood]d]2 2 ==
p
x
d
(x-p)o(x-p)
(x-p)od = |x-p| cos
(x-p)o(x-p) - (x-p)od2
XXoodd2 2 - 2p- 2pood d XXoodd + p + poodd22 XXooX+pX+poop-2p-2XXoop p --
XXooX+pX+poopp-2-2XXoop p -- + p+ poodd22+X+XooX+pX+poop orp orXXoodd2 2 - 2p- 2pood d XXoodd- 2p- 2pood d XXoodd + p + poodd22 XXooX+X+ppoop-2p-2XXoopp - X- Xoodd2 2 + X+ XooXX
FAUST Oblique LSR Classification IRIS150FAUST Oblique LSR Classification IRIS150
Ld d=0001 p=origin S 1 6 E 10 18I 14 26
4954 482422 118134 9809
5 3617 7 152 611
0 2522 12 4541397
50 22 16 3428
p=AvgS
50 34 15 2
p=AvgE
59 28 43 13
p=AvgI
66 30 55 20
Ld d=1000 p=origin S43 59 E 49 71I 49 80 16 26 32 34 24 6 12
270792 26 51558 2568
099 393 1096 1217 1826
1 517, 4 79 633
0 279 5 171 186748998 24 126 2 1 132 73016222281
0 3426 10 388 1369
Ld d=0010 p=origin S 10 19 E 30 51I 18 69
3000 331547 146120 6251
2.8 1633 14158 199
5.9 1146 14319 453
48 50 15 2 1 34 0
Ld d=0100 p=origin S 23 44 E 20 34.1I 22 38
055 3139 3850
066 310 35246 12 1750 4104
712 636 9, 3 9831369
3 5813 21 234793 110321 3
1417 5 36 47 23 1403 92918922824
96 27317762747
1 15 3 2 1 6 29 47 46
LSR IRIS150-LSR IRIS150-.. Consider all 3 functionals, L, S and R. What's the most efficient way to calculate all 3?\
LLp,d p,d (X - p) (X - p) o o d = Ld = Lo,d o,d - [p- [pood] minLd] minLp,d,k p,d,k = min[L= min[Lp,d p,d & C& Ckk] maxL] maxLp,d,k p,d,k = max[L= max[Lp,d p,d & C& Ckk[ [
= [minL= [minLo,d,ko,d,k]] - p- pood = [maxLd = [maxLo,d,ko,d,k] - p] - poodd
= min(X= min(Xood & Cd & Ckk)) - p- pood = max(Xd = max(Xood & Cd & Ckk) - p) - poodd OROR
= min(X&C= min(X&Ckk) ) o o dd - p- pood = max(X&Cd = max(X&Ckk) ) o o d - pd - poodd
SSp p = (X - p)= (X - p)oo(X - p) = -2X(X - p) = -2Xoop+Sp+Soo+p+poop = p =
LLo,-2p o,-2p + (S+ (Soo+p+poop) minSp) minSp,kp,k=minS=minSpp&C&Ckk maxS maxSp,k p,k = maxS= maxSpp&C&Ckk
= min[(X = min[(X o (o (-2p) &C-2p) &Ckk)])] + (X+ (XooX+pX+poop)p) =max[(X =max[(X o (o (-2p) &C-2p) &Ckk)] + (X)] + (XooX+pX+poop)p)
OROR = min[(X&C= min[(X&Ckk))oo-2p]-2p] + (X+ (XooX+pX+poop)p) =max[(X&C=max[(X&Ckk))oo-2p] + (X-2p] + (XooX+pX+poop)p)
RRp,d p,d S Sp,p, - L - Lp,dp,d22 minR minRp,d,kp,d,k=min[R=min[Rp,dp,d&C&Ckk] maxR] maxRp,d,kp,d,k=max[R=max[Rp,dp,d&C&Ckk]]
o=origin; pRn; dRn, |d|=1; {Ck}k=1..K are the classes; An operation enclosed in a parallelogram, , means it is a pTree op, not a scalar operation (on just numeric operands)
I suggest that we use each of the functionals with each of the pairs, (p,d) that we select for application (since, to get R we need to compute L and S anyway).So it would make sense to develop an optimal (minimum work and time) procedure to create L, S and R for any (p,d) in the set.
APPENDIXAPPENDIX
LSR on IRIS150LSR on IRIS150
Dse 9 -6 27 10 495 802 S 1270 2010 E 1061 2725 I L H C1,3: 0 s49 e11 i
Dei -3 -2 3 3 -117 -44 E y isa O if yoDei (-,-117)(-3,) -62 -3 I y isa O or E or I if yoDei C2,1 [-62 ,-44] L H y isa O or I if yoDei C2,2 [-44 , -3]
C2,1:2 e4 i
Dei 6 -2 3 1 420 459 E y isa O if yoDei (-,420)(459,480)(501,) 480 501 I y isa O or E if yoDei C3,1 [420 ,459] L H y isa O or I if yoDei C3,2 [480 ,501] Continue this on clusters with OTHER + one class, so the hull fits tightely (reducing false positives), using diagonals?
400 1000 1500 2000 2500 3000
y isa OTHER if yoDse (-,495)(802,1061)(2725,)y isa OTHER or S if yoDse C1,1 [ 495 , 802]y isa OTHER or I if yoDse C1,2 [1061 ,1270]
y isa OTHER or I if yoDse C1,4 [2010 ,2725]y isa OTHER or E or I if yoDse C1,3 [1270 ,2010
C13
C1,1: D=1000
43 58 y isa O if yoD(-,43)(58,) L H y isa O|S if yoD C2,3 [43,58]
C2,3: D=0100
23 44 y isa O if yoD(-,23)(44,) L H y isa O|S if yoD C3,3 [23,44]
C3,3: D=0010
10 19 y isa O if yoD(-,10)(19,) L H y isa O|S if yoD C4,1 [10,19]
C4,1: D=0001
1 6 y isa O if yoD(-,1)(6,) L H y isa O|S if yoD C5,1 [1,6]
C5,1: D=1100
68 117 y isa O if yoD(-,68)(117,) L H y isa O|S if yoD C6,1 [68,117]
C6,1: D=1010
54 146 y isa O if yoD(-,54)(146,) L H y isa O|S if yoD C7,1 [54,146]
C7,1: D=1001
44 100 y isa O if yoD(-,44)(100,) L H y isa O|S if yoD C8,1 [44,100]
C8,1: D=0110
36 105 y isa O if yoD(-,36)(105,) L H y isa O|S if yoD C9,1 [36,105]
C9,1: D=0101
26 61 y isa O if yoD(-,26)(61,) L H y isa O|S if yoD Ca,1 [26,61]
Ca,1: D=0011
12 91 y isa O if yoD(-,12)(91,) L H y isa O|S if yoD Cb,1 [12,91]
Cb,1: D=1110
81 182 y isa O if yoD(-,81)(182,) L H y isa O|S if yoD Cc,1 [81,182]
Cc,1: D=1101
71 137 y isa O if yoD(-,71)(137,) L H y isa O|S if yoD Cd,1 [71,137]
Cd,1: D=1011
55 169 y isa O if yoD(-,55)(169,) L H y isa O|S if yoD Ce,1 [55,169]
Ce,1: D=0111
39 127 y isa O if yoD(-,39)(127,) L H y isa O|S if yoD Cf,1 [39,127]
Cf,1: D=1111
84 204 y isa O if yoD(-,84)(204,) L H y isa O|S if yoD Cg,1 [84,204]
Cg,1: D=1-100
10 22 y isa O if yoD(-,10)(22,) L H y isa O|S if yoD Ch,1 [10,22]
Ch,1: D=10-10
3 46 y isa O if yoD(-,3)(46,) L H y isa O|S if yoD Ci,1 [3,46]
The amount of work yet to be done., even for only 4 attributes, is immense.. For each D, we should fit boundaries for each class, not just one class.
D, not only cut at minCoD, maxCoD but also limit the radial reach for each class (barrel analytics)? Note, limiting the radial reach limits all other directions [other than the D direction] in one step and therefore by the same amount. I.e., it limits all directions assuming perfectly round clusters). Think about Enron, some words (columns) have high count and others have low count. Our radial reach threshold would be based on the highest count and therefore admit many false positives. We can cluster directions (words) by count and limit radial reach differently for different clusters??
For 4 attributes, I count 77 diagonals*3 classes = 231 cases. How many in the Enron email case with 10,000 columns? Too many for sure!!
Dot Product SPTS computation:Dot Product SPTS computation: XXooD = D = k=1..nk=1..nXXkkDDkk
/*Calc P/*Calc PXoD,iXoD,i after P after PXoD,i-1XoD,i-1 CarrySet=CAR CarrySet=CARi-1,ii-1,i RawSet=RS RawSet=RSii */ INPUT: CAR */ INPUT: CARi-1,ii-1,i, RS, RSii
ROUTINE: PROUTINE: PXoD,iXoD,i=RS=RSiiCARCARi-1,ii-1,i CAR CARi,i+1i,i+1=RS=RSii&CAR&CARi-1,ii-1,i OUTPUT: P OUTPUT: PXoD,iXoD,i, CAR, CARi,i+1i,i+1 111100
110011
111100
001111
110011
110011
001100
CAR1CAR11,21,2
&&
PPXoD,1XoD,1
110000
&&
000000
001111
&&
000011
CAR2CAR22,32,3
110000 PPXoD,2XoD,2
&&
001111 PPXoD,3XoD,3
000000CAR1CAR13,43,4
3 3 3 3 DD DD1,11,1 DD1,01,0
1 11 1DD2,12,1 DD2,02,01 11 1
113322
110011
XXXX11 X X22
pp1111 p p1010
001111
111100
110011
000000
pp2121 p p2020
669999
XXooDD001111
110000
110000
001111
ppXoD,3XoD,3 ppXoD,2XoD,2 ppXoD,1XoD,1 ppXoD,0XoD,0 ((= = 2222 + 2+ 211 (1 p (1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 (1 p (1 p1,01,01 p1 p1,11,1 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))
001111
111100
001111
111100
110011
110011
111100
111111
111100
001111
111111
000011 PPXoD,0XoD,0
CAR1CAR10,10,1
&&
111100
001111
001100
001100
000011
&&
PPXoD,3XoD,3
001100 PPXoD,4XoD,4
Different data. Different data.
3 3 3 3 DD DD1,11,1 DD1,01,0
1 11 1DD2,12,1 DD2,02,01 11 1 ((== 2 222 + 2+ 211 (1 p (1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 (1 p (1 p1,01,0
1 p1 p1,11,1 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111
111100
001111
111100
111111
111111
113322
113311
XX pTreespTrees001111
111100
111111
001100
66181899
XXooDD000011
111100
110000
000011
001100
001100
001100
We have extended the Galois field, GF(2)={0,1}, XOR=add, AND=mult to pTrees.We have extended the Galois field, GF(2)={0,1}, XOR=add, AND=mult to pTrees.
001111 PPXoD,0XoD,0
110000
CAR1CAR10,10,1 &&
000000
&&
110011
CAR2CAR21,21,2000011
001100
CAR1CAR12,32,3
&&
001100
&&
PPXoD,1XoD,1
111100
000000
000000
&&
110011
001100
&&
001100
110011
&&
001100
&&
000011
001111
&&
000000
111100
&&
000011
110000
&&
001100
PPXoD,2XoD,2
&&
001111
000000
= (2= (211 pp1,11,1 +2+200 pp1,01,0)) (2(211 pp2,12,1 +2+200 pp2,02,0)) = 2= 222 pp1,11,1 pp2,12,1 +2+211(( pp1,11,1 pp2,02,0++ pp2,12,1 pp1,01,0)) + 2+ 200 pp1,01,0 pp2,02,0XX11**XX22
001111
001100
001111
111100
111111
001100
111100
111111
111100
&&
ppXX11*X*X22,0,0
&&
001111
&&
001100
&&
001100
001100
ppXX11*X*X22,3,3
001100
&&000000
ppXX11*X*X22,2,2
001100
&&
000011ppXX11*X*X22,1,1
SPTS multiplication: SPTS multiplication: (Note, pTree multiplication = &)(Note, pTree multiplication = &)
113322
113311
XXXX11 X X22
pp1111 p p1010 pp2121 p p2020
119922
XX11**XX22
111100
000011
000000
001100
ppXX11*X*X22,3,3
001111
111100
111111
001100
ppXX11*X*X22,2,2 ppXX11*X*X22,1,1 ppXX11*X*X22,0,0
RankRankN-1N-1(X(XooD)=RankD)=Rank22(X(XooD)D) 1 1 1 1
D=xD=x11 DD1,11,1 DD1,01,00 10 1
DD2,12,1 DD2,02,00 10 1
113322
110011
XXXX11 X X22
pp1111 p p1010
001111
111100
110011
000000
pp2121 p p2020
223333
XXooDD001111
111111
110000
000000
pp33 pp22 pp11 pp,0,0
RankK: p is what's left of K yet to be counted, initially p=K V is the RankKvalue, initially 0.For i=bitwidth+1 to 0 if Count(P&Pi) p { KVal=KVal+2i; P=P&Pi }; else /* < p */ { p=p-Count(P&Pi); P=P&P'i };
111111
P=P&pP=P&p11
3322
1*21*211++
111111
PP111111
pp11
n=1n=1p=2p=2
001111
P=pP=p00&P&P
2222
1*21*211+1*2+1*200==3 so -2x3 so -2x11ooX = -6 X = -6
111111
PP001111
&p&p00
n=0n=0p=2p=2
RankRankN-1N-1(X(XooD)=RankD)=Rank22(X(XooD)D)
3 0 3 0 D=xD=x22 DD1,11,1 DD1,01,0
1 11 1DD2,12,1 DD2,02,00 00 0
339922
XXooDD111100
110011
000000
001100
pp33 pp22 pp11 pp,0,0 110011
P=P&p'P=P&p'331<1<222-1=12-1=10*20*233++
111111
PP001100
pp33
n=3n=3p=2p=2
110011
P=p'P=p'22&P&P
0<0<111-0=11-0=10*20*233+0*2+0*222
110011
PP000000
&p&p22
n=2n=2p=1p=1
110011
P=pP=p11&P&P
2211
0*20*233+0*2+0*222+1*2+1*211++
110011
PP110011
&p&p11
n=1n=1p=1p=1
110000
P=pP=p00&P&P
1111
0*20*233+0*2+0*222+1*2+1*211+1*2+1*200==33 so -2xso -2x22ooX= -6X= -6
110011
PP111100
&p&p00
n=0n=0p=1p=1
RankRankN-1N-1(X(XooD)=RankD)=Rank22(X(XooD)D) 2 1 2 1
D=xD=x33 DD1,11,1 DD1,01,01 01 0
DD2,12,1 DD2,02,00 10 1
336655
XXooDD110011
111100
001111
000000
pp33 pp22 pp11 pp,0,0
001111
P=P&pP=P&p22
2222
1*21*222++
111111
PP001111
pp22
n=2n=2p=2p=2
000011
P=p'P=p'11&P&P
1<1<222-1=12-1=11*21*222+0*2+0*211
001111
PP111100
&p&p11
n=1n=1p=2p=2
000011
P=pP=p00&P&P
1111
1*21*222+0*2+0*211+1*2+1*200==5 5 so -2xso -2x33ooX= -10X= -10
000011
PP110011
&p&p00
n=0n=0p=1p=1
Example:Example:FAUST Oblique: FAUST Oblique: XXooDD used in CCC, TKO, PLC and LARC) and used in CCC, TKO, PLC and LARC) and (x-X)(x-X)oo(x-X)(x-X) = -2= -2XXooxx+x+xoox+Xx+XooX X is used in TKO.is used in TKO.
So in FAUST, we need to construct lots of SPTSs of the type, X dotted with a fixed vector, a costly pTree calculation So in FAUST, we need to construct lots of SPTSs of the type, X dotted with a fixed vector, a costly pTree calculation (Note that X(Note that XooX is costly too, but it is a 1-time calculation (a pre-calculation?). xX is costly too, but it is a 1-time calculation (a pre-calculation?). xoox is calculated for each individual x but it's x is calculated for each individual x but it's
a scalar calculation and just a read-off of a row of Xa scalar calculation and just a read-off of a row of XooX, once XX, once XooX is calculated.. Thus, we should optimize the living X is calculated.. Thus, we should optimize the living he__ out of the Xhe__ out of the XooD calculation!!! The methods on the previous seem efficient. Is there a better method? Then for D calculation!!! The methods on the previous seem efficient. Is there a better method? Then for TKO we need to computer ranks:TKO we need to computer ranks:
pTree Rank(K) computation: (Rank(N-1) gives 2nd smallest which is very useful in outlier analysis?)
X P4,3 P4,2 P4,1 P4,0
1
0
0
0
1
1
0
0
1
1
1
0
0
0
1
0
1
1
1
0
1
0
1
0
1
1
1
1
{0}
{1}
{0}
{1}
(n=3) c=Count(P&P4,3)= 3 < 6
p=6–3=3; P=P&P’4,3 masks off highest 3 (val 8)
(n=2) c=Count(P&P4,2)= 3 >= 3
P=P&P4,2 masks off lowest 1 (val 4)
(n=1) c=Count(P&P4,1)=2 < 3
p=3-2=1; P=P&P'4,1 masks off highest 2 (val8-2=6 )
(n=0) c=Count(P&P4,0 )=1 >= 1
P=P&P4,0
10
5
6
7
11
9
3
{0} {1} {0} {1}
RankKval=0; p=K; c=0; P=Pure1; /*Note: n=bitwidth-1. The RankK Points are returned as the resulting pTree, P*/For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi }; else {p=p-c; P=P&P'i }; return RankKval, P; /* Above K=7-1=6 (looking for the Rank6 or 6th highest vaue (which is also the 2nd lowest value) */
Cross out the 0-positions of P each step.
5 P=MapRankKPts= ListRankKPts={2}
0100000
23 * + 22 * + 21 * + 20 * =
RankKval=
RankRankN-1N-1(X(XooD)=RankD)=Rank22(X(XooD)D) 3 3 3 3
DD DD1,11,1 DD1,01,00 10 1
DD2,12,1 DD2,02,00 10 1
113322
110011
XXXX11 X X22
pp1111 p p1010
001111
111100
110011
000000
pp2121 p p2020
223333
XXooDD001111
111111
110000
000000
pp33 pp22 pp11 pp,0,0
001111
P=P&pP=P&p33
2222
1*21*233++
001111
P=p'P=p'22&P&P
0<0<2222-0=-0=221*21*233+0*2+0*222++
111111
PP001111
pp33
001111
PP110000
&p&p22
n=3n=3p=2p=2
n=2n=2p=2p=2
001111
P=p'P=p'11&P&P
0<0<2222-0=-0=221*21*233+0*2+0*222+0*2+0*211++
001111
PP110000
&p&p11
n=1n=1p=2p=2
001111
P=pP=p00&P&P
2222
1*21*233+0*2+0*222+0*2+0*211+1*2+1*200=9=9
001111
PP001111
&p&p00
n=0n=0p=2p=2
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 05/64 [0,64)
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 110/64 [64,128)
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
Y y1 y2y1 1 1y2 3 1y3 2 2y4 3 3y5 6 2y6 9 3y7 15 1y8 14 2y9 15 3ya 13 4pb 10 9yc 11 10yd 9 11ye 11 11yf 7 8
yofM 11 27 23 34 53 80118114125114110121109125 83
p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0
p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1
p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1
p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1
p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0
p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
0[0,8)
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
1[8,16)
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
1[16,24)
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
1[24,32)
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
1[32,40)
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
0[40,48)
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
1[48,56)
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
0[56,64)
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
2[80,88)
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
0[88,96)
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
0[96,104)
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
2[194,112)
p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1
3[112,120)
p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0
3[120,128)
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
1/16[0,16)
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
2/16[16,32)
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
1[32,48)
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
1[48,64)
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
0[64,80)
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
2[80,96)
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
2[96,112)
p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
6[112,128)
p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
3/32[0,32)
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
2/32[64,96)
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
2/32[32,64)
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
¼[96,128)
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0
f=
UDR Univariate Distribution Revealer (on Spaeth:)
Pre-compute and enter into the ToC, all DT(YPre-compute and enter into the ToC, all DT(Ykk) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector .) plus those for selected Linear Functionals (e.g., d=main diagonals, ModeVector .Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these Suggestion: In our pTree-base, every pTree (basic, mask,...) should be referenced in ToC( pTree, pTreeLocationPointer, pTreeOneCount ).and these OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in OneCts should be repeated everywhere (e.g., in every DT). The reason is that these OneCts help us in selecting the pertinent pTrees to access - and in
fact are often all we need to know about the pTree to get the answers we are after.).fact are often all we need to know about the pTree to get the answers we are after.).
0 0 1 1 1 1 0 1 01 1 1 1 0 1 0 00 0 0 2 0 0 2 3 32 0 0 2 3 3
1 2 1 1 0 2 2 6 1 2 1 1 0 2 2 6
3 2 2 8 3 2 2 8
5 105 10
depthDT(S)depthDT(S)bb≡≡BitWidth(S) h=depth of a node k=node offsetBitWidth(S) h=depth of a node k=node offsetNodeNodeh,kh,k has a ptr to pTree{x has a ptr to pTree{xS | F(x)S | F(x)[k2[k2b-h+1b-h+1, (k+1)2, (k+1)2b-h+1b-h+1)} and )} and
its 1countits 1count
applied to S, a column of numbers in bistlice format (an SpTS), will applied to S, a column of numbers in bistlice format (an SpTS), will produce the produce the DistributionTree of S DT(S)DistributionTree of S DT(S)
1515 depth=h=0depth=h=0
depth=h=1depth=h=1
nodenode2,32,3
[96.128)[96.128)
So let us look at ways of doing the work to calculate As we recall from the below, So let us look at ways of doing the work to calculate As we recall from the below, the task is to ADD bitslices giving a result bitslice and a set of carry bitslices to carry forwardthe task is to ADD bitslices giving a result bitslice and a set of carry bitslices to carry forward
XXooD = D = k=1..nk=1..nXXkk*D*Dkk
3 3 3 3 DD DD1,11,1 DD1,01,0
1 11 1DD2,12,1 DD2,02,01 11 1
((= 2= 222 + 2+ 211 1 p1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 1 p1 p1,01,01 p1 p1,11,1 (( ((+ 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111
111100
001111
111100
110011
110011
113322
110011
XX pTreespTrees001111
111100
110011
000000
((= 2= 222 + 2+ 211 1 p1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 1 p1 p1,01,01 p1 p1,11,1 (( ((+ 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111
111100
001111
000000
110011
111100
110011
I believe we add by successive XORs and the carry set is the raw set with one 1-bit turned off iff the sum at that bit is a 1-bitI believe we add by successive XORs and the carry set is the raw set with one 1-bit turned off iff the sum at that bit is a 1-bitOr we can characterize the carry as the raw set minus the result (always carry forward a set of pTrees plus one negative one). Or we can characterize the carry as the raw set minus the result (always carry forward a set of pTrees plus one negative one). We want a routine that constructs the result pTree from a positive set of pTrees plus a negative set always consisting of 1 We want a routine that constructs the result pTree from a positive set of pTrees plus a negative set always consisting of 1
pTree. pTree. The routine is: successive XORs across the positive set then XOR with the negative set pTree (because the successive pset The routine is: successive XORs across the positive set then XOR with the negative set pTree (because the successive pset
XOR gives us the odd values and if you subtract one pTree, the 1-bits of it change odd to even and vice versa.):XOR gives us the odd values and if you subtract one pTree, the 1-bits of it change odd to even and vice versa.):
/*For P/*For PXoD,iXoD,i (after P (after PXoD,i-1XoD,i-1). CarrySetPos=CSP). CarrySetPos=CSPi-1,ii-1,i CarrySetNeg=CSN CarrySetNeg=CSNi-1,i i-1,i RawSet=RS RawSet=RSii CSP CSP-1-1=CSN=CSN-1-1==*/*/
INPUT: CSPINPUT: CSPi-1i-1, CSN, CSNi-1i-1, RS, RSii
ROUTINE: PROUTINE: PXoD,iXoD,i=RS=RSiiCSPCSPi-1,ii-1,iCSNCSNi-1,ii-1,i CSN CSNi,i+1i,i+1=CSN=CSNi-1,ii-1,iPPXoD,iXoD,i; CSP; CSPi,i+1i,i+1=CSP=CSPi-1,ii-1,iRSRSi-1i-1;;
OUTPUT: POUTPUT: PXoD,iXoD,i, CSN, CSNi,i+1i,i+1 CSP CSPi,i+1 i,i+1
111100
110011
RSRS00
001111
==
669999
XXooDD
PPXoD,0XoD,0
CSPCSP-1,0-1,0=CSN=CSN-1,0-1,0==RSRS11
CSNCSN0,10,1== CSNCSN-1.0-1.0PPXoD,0XoD,0
000000
==
PPXoD,1XoD,1
111100
001111
110011
001111
110000
110000
001111
001111
110011
000000
001111
111100
110011
CSPCSP0,10,1== CSPCSP-1,0-1,0RSRS00
110011
000000
110000
111111
001111
XXooD = D = k=1..nk=1..nXXkk*D*Dkk
k=1..nk=1..n ( (= 2= 22B2B
+ 2+ 22B-12B-1 DDk,Bk,B p pk,B-1k,B-1 + D+ Dk,B-1k,B-1 p pk,Bk,B
+ 2+ 22B-22B-2 DDk,Bk,B p pk,B-2k,B-2 + D+ Dk,B-1k,B-1 p pk,B-1k,B-1 + D+ Dk,B-2k,B-2 p pk,Bk,B
+ 2+ 22B-32B-3 DDk,Bk,B p pk,B-3k,B-3 + D+ Dk,B-1k,B-1 p pk,B-2k,B-2 + D+ Dk,B-2k,B-2 p pk,B-1k,B-1 +D+Dk,B-3k,B-3 p pk,Bk,B
+ 2+ 233 DDk,Bk,B p pk,0k,0 + D+ Dk,2k,2 p pk,1k,1 + D+ Dk,1k,1 p pk,2k,2 +D+Dk,0k,0 p pk,3k,3
+ 2+ 222 DDk,2k,2 p pk,0k,0 + D+ Dk,1k,1 p pk,1k,1 + D+ Dk,0k,0 p pk,2k,2
+ 2+ 211 DDk,1k,1 p pk,0k,0 + D+ Dk,0k,0 p pk,1k,1
+ 2+ 200 DDk,0k,0 p pk,0k,0
DDk,Bk,B p pk,Bk,B
k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (
. . .. . .
k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (
XXooD=D=k=1,2k=1,2XXkk*D*Dk k with pTrees: qwith pTrees: qNN..q..q00, ,
N=2N=22B+roof(log2B+roof(log22n)+2B+1n)+2B+1k=1..2k=1..2 ( (= 2= 222
+ 2+ 211 DDk,1k,1 p pk,0k,0 + D+ Dk,0k,0 p pk,1k,1
+ 2+ 200 DDk,0k,0 p pk,0k,0
DDk,1k,1 p pk,1k,1
k=1..2k=1..2 ( (
k=1..2k=1..2 ( (
113322
110011
XX pTreespTrees001111
111100
110011
000000
1 2 1 2 DD DD1,11,1 DD1,01,0
0 10 1DD2,12,1 DD2,02,01 01 0B=1B=1
((= 2= 222 + 2+ 211 DD1,11,1 pp1,01,0 + D+ D1,01,0 pp1111 + 2+ 200 DD1,01,0 pp1,01,0DD1,11,1 pp1,11,1 (( ((+ D+ D2,12,1 pp2,1 2,1 )) + D+ D2,12,1 pp2,02,0 + D+ D2,02,0 pp2,12,1 )) + D+ D2,02,0 pp2,02,0 ))
((= 2= 222 + 2+ 211 DD1,11,1 p p1,01,0 ++ DD1,01,0 p p1111 + 2+ 200 DD1,01,0 p p1,01,0DD1,11,1 p p1,11,1 (( ((+ D+ D2,12,1 p p2,1 2,1 )) + + DD2,12,1 p p2,02,0 ++ DD2,02,0 p p2,12,1 )) + D+ D2,02,0 p p2,02,0 ))000000
001111
110011
111100
qq0 0 = p= p1,0 1,0 = = no carryno carry111100
qq11= = carrycarry11= =
111100
000011
qq22=carry=carry11= = no carryno carry000011
3 3 3 3 DD DD1,11,1 DD1,01,0
1 11 1DD2,12,1 DD2,02,01 11 1
qq0 0 = = carrycarry00==001111
110000
((= 2= 222 + 2+ 211 1 p1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 1 p1 p1,01,01 p1 p1,11,1 (( ((+ 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111
111100
001111
111100
110011
000000
110011
qq11=carry=carry00+raw+raw11= = carrycarry11==111111
221111
A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). In what form is it best to carry the carryTree over? (for speediest of processing?)In what form is it best to carry the carryTree over? (for speediest of processing?)1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added)1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added)2. carryTree as a SPTS, s2. carryTree as a SPTS, s11? (next level rawTree=SPTS, s? (next level rawTree=SPTS, s22, then , then ss1010& s& s20 20 = q= qnext_levelnext_level and carry and carrynext_levelnext_level ? ?
qq22=carry=carry11+raw+raw22= = carrycarry22==111111
111111
qq33=carry=carry22 = = carrycarry33==111111
CCC ClustererCCC Clusterer If DT (and/or DUT) not exceeded at C, partition C further by cutting at each gap and PCC in CIf DT (and/or DUT) not exceeded at C, partition C further by cutting at each gap and PCC in CooDD
For a table X(XFor a table X(X11...X...Xnn), the SPTS, X), the SPTS, Xkk*D*Dkk is the column of numbers, x is the column of numbers, xkk*D*Dkk. X. XooD is the sum of those SPTSs, D is the sum of those SPTSs, k=1..nk=1..nXXkk*D*Dkk
XXkk*D*Dkk = D = Dkkbb22bbppk,bk,b = 2= 2BBDDkkppk,Bk,B +..+ 2+..+ 200DDkkppk,0k,0
= D= Dkk(2(2BBppk,Bk,B +..+2+..+200ppk,0k,0) =) = (2(2BBppk,Bk,B +..+2+..+200ppk,0k,0))(2(2BBDDk,Bk,B+..+2+..+200DDk,0k,0))
+ 2+ 22B-12B-1(D(Dk,B-1k,B-1ppk,Bk,B +..+2+..+200DDk,0k,0ppk,0k,0= 2= 22B2B( ( DDk,Bk,Bppk,Bk,B) ) +D+Dk,Bk,Bppk,B-1k,B-1))
So, So, DotProduct DotProduct involves just multi-operand pTree involves just multi-operand pTree addition. (no SPTSs and no multiplications)addition. (no SPTSs and no multiplications)Engineering shortcut tricka would be huge!!!Engineering shortcut tricka would be huge!!!
Question: Question: Which primitives are needed Which primitives are needed and how do we compute them?and how do we compute them?
X(XX(X11...X...Xnn) D) D22NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}).NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}). DD22NN = each min[DNN = each min[D22NN(x)] NN(x)]
(x-X)o(x-X)= (x-X)o(x-X)= k=1..nk=1..n(x(xkk-X-Xkk)(x)(xkk-X-Xkk)=)=k=1..nk=1..n((b=B..0b=B..022bbxxk,bk,b-2-2bbppk,bk,b)()( ((b=B..0b=B..022bbxxk,bk,b-2-2bbppk,bk,b))
==k=1..nk=1..n( ( b=B..0b=B..022bb(x(xk,bk,b-p-pk,bk,b) )) ) (( ----a----ak,bk,b------b=B..0b=B..022bb(x(xk,bk,b-p-pk,bk,b) )) )
(2(2BBaak,Bk,B++ 22B-1B-1aak,B-1k,B-1+..++..+ 2211aak, 1k, 1++ 2200aak, 0k, 0)) (2(2BBaak,Bk,B++ 22B-1B-1aak,B-1k,B-1+..++..+ 2211aak, 1k, 1++ 2200aak, 0k, 0))==kk
( 2( 22B2Baak,Bk,Baak,Bk,B + +
222B-12B-1( a( ak,Bk,Baak,B-1k,B-1 + a + ak,B-1k,B-1aak,Bk,B ) + ) + { 2{ 22B2Baak,Bk,Baak,B-1 k,B-1 }}
222B-22B-2( a( ak,Bk,Baak,B-2k,B-2 + a + ak,B-1k,B-1aak,B-1k,B-1 + a + ak,B-2k,B-2aak,Bk,B ) + ) + {{2B-12B-1aak,Bk,Baak,B-2 k,B-2 + 2+ 22B-22B-2aak,B-1k,B-122
222B-32B-3( a( ak,Bk,Baak,B-3k,B-3 + a + ak,B-1k,B-1aak,B-2k,B-2 + a + ak,B-2k,B-2aak,B-1k,B-1 + a + ak,B-3k,B-3aak,B k,B ) +) + { 2{ 22B-22B-2( a( ak,Bk,Baak,B-3 k,B-3 + a+ ak,B-1k,B-1aak,B-2k,B-2 ) } ) }222B-42B-4(a(ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3+a+ak,B-2k,B-2aak,B-2k,B-2+a+ak,B-3k,B-3aak,B-1k,B-1+a+ak,B-4k,B-4aak,Bk,B)...)... {2{22B-32B-3( a( ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3)+2)+22B-42B-4aak,B-2k,B-2
22}}
=2=22B 2B ( a( ak,Bk,B22 + a + ak,Bk,Baak,B-1 k,B-1 ) +) + 222B-12B-1( a( ak,Bk,Baak,B-2 k,B-2 ) +) +
222B-22B-2( a( ak,B-1k,B-122 222B-32B-3( a( ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3) + ) + 222B-42B-4aak,B-2k,B-2
2 2 ......+ a+ ak,Bk,Baak,B-3 k,B-3 + a+ ak,B-1k,B-1aak,B-2k,B-2 ) + ) +
D2NN=multi-op pTree adds?D2NN=multi-op pTree adds?When xWhen xk,bk,b=1, a=1, ak,bk,b=p'=p'k,bk,b and and
when xwhen xk,bk,b=0, a=0, ak,bk,b= -p= -pk.bk.b
So D2NN just multi-op pTree So D2NN just multi-op pTree mults/adds/subtrs? mults/adds/subtrs?
Each D2NN row (each xX) Each D2NN row (each xX) is separate calc.is separate calc.
Should we pre-compute all pShould we pre-compute all pk,ik,i*p*pk,jk,j p'p'k,ik,i*p'*p'k,jk,j ppk,ik,i*p'*p'k,jk,j
ANOTHER TRY!ANOTHER TRY! X(XX(X11...X...Xnn) RKN (Rank K Nbr), K=|X|-1, yields1.a_outlier_detector (top y dissimilarity from X-{x}).) RKN (Rank K Nbr), K=|X|-1, yields1.a_outlier_detector (top y dissimilarity from X-{x}).
Install in RKN, each RankK(D2NN(x)) (1-time construct but for. e.g., 1 trillion xInstall in RKN, each RankK(D2NN(x)) (1-time construct but for. e.g., 1 trillion x ss? |X|=N=1T, slow. Parallelization?)? |X|=N=1T, slow. Parallelization?)
xxX, the square distance from x to its neighbors (near and far) is the column of number (vTree or SPTS)X, the square distance from x to its neighbors (near and far) is the column of number (vTree or SPTS)dd22(x,X)= (x-X)(x,X)= (x-X)oo(x-X)= (x-X)= k=1..nk=1..n|x|xkk-X-Xkk||2
2= = k=1..nk=1..n(x(xkk-X-Xkk)(x)(xkk-X-Xkk)= )= k=1..nk=1..n(x(xkk22-2x-2xkkXXkk+X+Xkk
22))
= -2 = -2 kkxxkkXXkk + + kkxxkk22 + + kkXXkk
22
= -2x= -2xooX + xX + xoox + Xx + XooXX
k=1..nk=1..n i=B..0,j=B..0i=B..0,j=B..022i+ji+jppk,ik,ippk,jk,j
i,j i,j 22i+j i+j kkppk,ik,ippk,jk,j
1. precompute pTree products within each k1. precompute pTree products within each k
2. Calculate this sum one time (independent of the x)2. Calculate this sum one time (independent of the x)
3. Pick this from XoX for each x and add to 2.3. Pick this from XoX for each x and add to 2.
5. Add 3 to this5. Add 3 to this
-2x-2xooX cost is linear in |X|=N.X cost is linear in |X|=N. xxoox cost is ~zero. Xx cost is ~zero. XooX is 1-time -amortized over xX (i.e., =1/N) or precomputedX is 1-time -amortized over xX (i.e., =1/N) or precomputedThe addition cost, -2xThe addition cost, -2xooX + xX + xoox + Xx + XoXoX, is linear in |X|=N So, overall, the cost is linear in |X|=n., is linear in |X|=N So, overall, the cost is linear in |X|=n.Data parallelization? No! (Need all of X at each site.) Code parallelization? Yes! (After replicating X to all sites,Data parallelization? No! (Need all of X at each site.) Code parallelization? Yes! (After replicating X to all sites,Each site creates/saves D2NN for its partition of X, then sends requested number(s) (e.g., RKN(x) ) back.Each site creates/saves D2NN for its partition of X, then sends requested number(s) (e.g., RKN(x) ) back.
LSR on IRIS150-3 LSR on IRIS150-3 Here we use the diagonals.Here we use the diagonals.
d=e1 p=AVGs, L=(X-p)od43 58 S 49 70 E 49 79 I
R(p,d,X)
S E I0
128 270 393
1558
3444
[43,49)S(16)
0
128
[49,58)E(24)I(6)
0 S(34)99
393
10961217
1825
[70,79]I(12)
2081
3444
[58,70)E(26)I(32)
270
792
1558
2567
Only overlap L=[58,70), R[792,1557] (E(26), I(5))With just d=e1, we get good hulls using LARC:
While Ip,d containing >1class, for next (d,p) create L(p,d)XXoodd-p-poodd, R(p,d)XXooX+pX+poop-2p-2XXoopp-L-L22
1. MnCls(L), MxCls(L), create a linear boundary.2. MnCls(R), MxCls(R).create a radial boundary.3. Use R&Ck to create intra-Ck radial boundariesHk = {I | Lp,d includes Ck}
R & L
I(1)
I(42)
E(50) I(7)
49 49 (36,7) 63
70 (11)
d=e1 p=AS L=(X-p)od (-pod=-50.06)-7.06 7.94 S&L -1;06 19.94 E&L -1.06 28.94 I&L
-8,-216
[-2,8)34, 24, 6099 393 1096 1217 1825
[20,29]12
[8,20) wp=AvgE26, 321.9 51.878.6 633
<--E=6 I=4 p=AvgE
d=e1 p=AvgS, L=Xod43 58 S&L 49 70 E&L 49 79 I&L
Here we try using other p points for the R step (other than the Here we try using other p points for the R step (other than the one used for the L step).one used for the L step).
d=e1 p=AS L=(X-p)od (-pod=-50.06)-7.06 7.94 S&L -1;06 19.94 E&L -1.06 28.94 I&L
-8,-216
[-2,8)34, 24, 6099 393 1096 1217 1825
[20,29]12
[8,20)26, 32 2707921558 2567
E=26 I=5p=AvgS
30ambigs, 5 errs
d=e1 p=AS L=(X-p)od (-pod=-50.06)-7.06 7.94 S&L -1;06 19.94 E&L -1.06 28.94 I&L
-8,-216
[-2,8)34, 24, 6099 393 1096 1217 1825
[20,29]12
[8,20) wp=AvgI26, 32 0.6234.9387.8 1369
<--E=25 I=10 p=AvgI
There is a best choice of p for the R step (p=AvgE) but how would we decide that ahead of time?
d=e4 p=AvgS, L=(X-p)od-2 4 S&L 7 16 E&L 11 23 I&L
-2,4)50
[7,11) 28
[16,23]I=34
[11,16) 22, 16 127.5648.71554.7 2892
E=22I=7p=AvgS
d=e4 p=AvgS, L=(X-p)od-2 4 S&L 7 16 E&L 11 23 I&L
-2,4)50
[7,11) 28
[16,23]I=34
[11,16) 22, 165.7 36.2151.06 611
E=17I=7p=AvgE
d=e4 p=AvgS, L=(X-p)od-2 4 S&L 7 16 E&L 11 23 I&L
-2,4)50
[7,11) 28
[16,23]I=34[11,16)
22, 16 127.51555 2892
E=22I=8p=AvgI
For e4, the best choice of p for the R step is also p=AvgE.(There are mistakes in this column on the previous slide!)
LSR on IRIS150LSR on IRIS150
Dse 9 -6 27 10; xoDes: -184 123 S 590 1331 E 381 2046 I
y isa O if yoD (-,-184)(123,381)(2046,)
y isa O or S(50) if yoD C1,1 [-184 , 123]y isa O or I(1) if yoD C1,2 [ 381 , 590]
y isa O or I(38) if yoD C1,4 [1331 ,2046]y isa O or E(50) or I(11) if yoD C1,3 [ 590 ,1331]
SRR(AVGs,dse) on C1,1 0 154 S
y isa O if y isa C1,1 AND SRR(AVGs,Dse)(154,)y isa O or S(50) if y isa C1,1 AND SRR(AVGs,DSE)[0,154]
SRR(AVGs,dse) on C1,2only one such I
SRR(AVGs,dse) onC1,3 2 137 E 7 143 I
y isa O if y isa C1,3 AND SRR(AVGs,Dse)(-,2)U(143,)y isa O or E(10) if y isa C1,3 AND SRR in [2,7) y isa O or E(40) or I(10) if y isa C1,3 AND SRR in [7,137) = C2,1
y isa O or I(1) if y isa C1,3 AND SRR in [137,143]etc.
We use the Radial steps to remove false positives from gaps and ends. We are effectively projecting onto a 2-dim range, generated by the Dline and the Dline (which measures the perpendicular radial reach from the D-line). In the D projections, we can attempt to cluster directions into "similar" clusters in some way and limit the domain of our projections to one of these clusters at a time, accommodating "oval" shaped or elongated clusters giving a better hull fit. E.g., in the Enron email case the dimensions would be words that have about the same count, reducing false positives.
Dei 1 .7 -7 -4; xoDei on C2,1: 1.4 19 E -2 3 I
y isa O if yoD (-,-2) (19,)y isa O or I(8) if yoD [ -2 , 1.4]
y isa O or E(40) or I(2) if yoD C3,1 [ 1.4 ,19]
SRR(AVGe,dei) onC3,1 2 370 E 8 106 I
y isa O if y isa C3,1 AND SRR(AVGs,Dei)[0,2)(370,)y isa O or E(4) if y isa C3,1 AND SRR(AVGs,Dei)[2,8)y isa O or E(27) or I(2) if y isa C3,1 AND SRR(AVGs,Dei)[8,106)y isa O or E(9) if y isa C3,1 AND SRR(AVGs,Dei)[106,370]
LSR on IRIS150-2 LSR on IRIS150-2 We use the diagonals. We use the diagonals. Also we set a MinGapThres=2 Also we set a MinGapThres=2 which will mean we stay 2 units which will mean we stay 2 units away from any cutaway from any cut
d=e1=1000; The xod limits: 43 58 S 49 70 E 49 79 I
y isa O if yoD(-,43)(79,)y isa O or S( 9) if yoD[43,47]
y isa O or S(41) or E(26) or I( 7) if yoD(47,60) (yC1,2)y isa O or E(24) or I(32) if yoD[60,72] (yC1,3)
y isa O if yoD[43,47]&SRR(-,52)(60,)
y isa O or I(11) if yoD(72,79]y isa O if yoD[72,79]&SRR(-,49)(78,)
d=e3=0010 on C2,2 xod lims: 30 33 S28 32 E 28 30 I
y isa O if yoD(-,28)(33,)y isa O or S(13) or E(10) or I(3) if yoD[28,33]
d=e3=0001 xod lims: 12 18 E 18 24 I
y isa O or S(13) if yoD[1,5]y isa O if yoD(-,1)(5,12)(24,)
y isa O or E( 9) if yoD[12,16)
y isa O or E( 1) or I( 3) if yoD[16,24)
y isa O if yoD[12,16)&SRR[0,208)(558,)
y isa O if yoD[16,24)&SRR[0,1198)(1199,1254)1424,)
y isa O or E(1) if yoD[16,24)&SRR[1198,1199]y isa O or I(3) if yoD[16,24)&SRR[1254,1424]
y isa O or E( 3) if yoD[18,23)y isa O if yoD(-,18)(46,)
y isa O or E(13) or I( 4) if yoD[23,28) (yC2,1)
y isa O or S(13) or E(10) or I( 3) if yoD[28,34) (yC2,2)y isa O or S(28) if yoD[34,46]
y isa O if yoD[18,23)&SRR[0,21)
y isa O if yoD[34,46]&SRR[0,32][46,)
d=e2=0100 on C1,2 xod lims: 30 44 S20 32 E 25 30 I
d=e2=0100 on C1,3 xod lims: 22 34 E 22 34 I zero differentiation!
y isa O or E(17) if yoD[60,72]&SRR[1.2,20]
y isa O or I(25)if yoD[60,72]&SRR[66,799]y isa O or E( 7) or I( 7)if yoD[60,72]&SRR[20, 66]
y isa O if yoD[0,1.2)(799,)
LSR LSR IRIS150IRIS150..
d=e1 p=AS L=(X-p)od (-pod=-50.06)-7.06 7.94 S&L -1;06 19.94 E&L -1.06 28.94 I&L
-8,-216
[-2,8)34, 24, 6099 393 1096 1217 1825
[20,29]12
[8,20)26, 32 2707921558 2567
E=26 I=5
30ambigs, 5 errs
d=e4 p=AvgS, L=(X-p)od-2 4 S&L 7 16 E&L 11 23 I&L-2,4)50
[7,11) 28
[16,23]I=34
[11,16) 22, 1611 1611 16 E=22
I=16
38ambigs 16errs
d=e3 p=AvgE, L=(X-p)od-32 -24 S&L -12 9 E&L -25 27 I&L,-25)48
-25,-122 1 1
[9,27] I=34
[-12,9)49, 152(17) 16158 199
E=32 I=14
d=e4 p=AvgE, L=(X-p)od-13 -7 S&L -3 5 E&L 1 12 I&L-7]50
[-3,1) 21
[5,12] 34
[1,5)22, 16.7 .74.8 4.8 E=22
I=16
d=e2 p=AvgS, L=(X-p)od -11 10 S&L-14 0 E&L -13 4 I&L,-13)1
-13,-110, 2, 1all=-11
[0,4) [4,15 3 6
-11,029,47,46066 310 352 1749 4104
1, 1 46,11
2, 1 9, 3
d=e3 p=AvgS, L=(X-p)od-5 5 S&L 15 37 E&L 4 55 I&L-5,4)47
[4,15)3 1
[37,55]I=34
[15,37)50, 15157 297536 792
E=18 I=12
3, 1
d=e1 p=AE L=(X-p)od (-pod=-59.36)-17 -1 S&L -11 11 E&L -11 20 I&L
-17-1116
[-11,-1)33, 21, 3 0 27 107 1727481150
[11,20]I12
[-1,11)26, 321 5179 633
E=7 I=4
E=5I=3
d=e2 p=AvgE, L=(X-p)od -5 `17 S&L-8 7 E&L -6 11 I&L,-6)1
[-6, -5)0, 2, 1 15 18 58 59
[7,11) [11,15 3 61 err
[-5,7)29,47, 46 3 58 234793 11031417
13, 21
21, 3
d=e1 p=AI L=(X-p)od (-pod=-65.88)-22 -8 S&L -17 4 E&L -17 14 I&L
[-17,-8)33, 21, 3 38 126 132 73016222181
[-8,4)26, 32 034 1368730
E=26 I=11
E=2 I=1
d=e2 p=AvgI, L=(X-p)od -7 `15 S&L-10 4 E&L -8 9 I&L,-6)1
[6,11) [11,15 3 6
[-7, 4)29,46,46 5 36 929 140318932823
[-8, -7)2, 1allsame
E=2 I=1
E=47 I=22
[5, 9]9, 2, 1allsameS=9
E=2 I=1
d=e3 p=AvgI, L=(X-p)od-44 -36 S&L -25 -4 E&L -37 14 I&L,-25)48
-25,-122 1 1
[9,27] I=34
[-25,-4)50, 15 511 318453
E=32 I=14E=46 I=14
d=e4 p=AvgI, L=(X-p)od-19 -14 S&L -10 -3 E&L -6 5 I&L
[5,12] 34
[-6,-3)22, 16same range
E=22 I=16
d=AvgEAvgI p=AvgE, L=(X-p)od -36 -25 S -14 11 E -17 33 I
R(p,d,X)
S E I 0 232 76 357514
[-17,-14)]I(1)
[-14,11) (50, 13)0 2.876 134
[11,33] I(36)
E=47I=12
R(p,d,X)
S E I.3 .9 4.7150 204 213
[12,17.5)]I(1)
d=AvgSAvgI p=AvgS, L=(X-p)od -6 5 S 17.5 42 E 12 65 I
[17.5,42)(50,12)4.7 6 192 205
[11,33]I(37)
E=45 I=12
d=AvgSAvgE p=AvgS, L=(X-p)od -6 4 S 18 42 E 11 64 I
R(p,d,X)
S E I0 2 6 137154 393
[11,18)]I(1)
[18,42) (50,11)2 6.92 133137
[42,64] 38
E=39 I=11
d=e1 p=AvgS, L=Xod43 58 S&L 49 70 E&L 49 79 I&L
Note that each L=(X-p)od is just a shiftof Xod by -pod (for a given d).
Next, we examine:For a fixed d, the SPTS, Lp,d. is just a shift of LdLorigin,d by -pod we get the same intervals to apply R to, independent of p (shifted by -pod).
Thus, we calculate once, lld=minXod hld=maxXod, then for each different p we shift these interval limit numbers by -pod since these numbers are really all we need for our hulls (Rather than going thru the SPTS calculation of (X-p)od anew new p).
There is no reason we have to use the same p on each of those intervals either.
So on the next slide, we consider all 3 functionals, L, S and R. E.g., Why not apply S first to limit the spherical reach (eliminate FPs). S is calc'ed anyway?
LSR IRIS150 eLSR IRIS150 e22
Ld d=0100 p=origin Setosa 23 44 vErsicolor 20 34vIrginica 22 38
d=0100 p=AS=(50 34 15 2) -11 10 -14 0 -12 4
d=0100 p=AE=(59 28 43 13) -5 16 -8 6 -6 10
d=0100 p=AI=(66 30 55 20) -7 14 -10 4 -8 8
1 2 1 all -7.7
6 29 47 46 5 36 47 22 929 140318922823
15 3
XoX402635013406330639964742347738852977358845143720340128715112542646224031499142794365421135164004382536603928415840603493352543134611498935883672442335883009398639032732313340174422340943053340440737898329737083485323735062277523416674825150422563705784696754427582628658746578540371336274722068586955
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
736679088178669152505166507057587186606670377884660358865419578169335784421857986057602367034247 5883 9283 7055 9863 8270 897311473 534010463 880210826 8250 7995 8990 6774 7325 8458 84741234611895 6809 9563 672111602 7423 926810132 7256 7346 8457 97041034212181 8500 7579 772911079 8837 8406 514890799162885270559658945286227455822984457306
767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146146148149150
FAUST Oblique, LSR FAUST Oblique, LSR Linear, Spherical, Radial classifierLinear, Spherical, Radial classifier
p,p, ((pre-ccompute?pre-ccompute?))
LLd,pd,p(X-p)(X-p)ood=d=LLdd-p-pood d nnk,L,d,pk,L,d,pmin(Cmin(Ckk&L&Ld,pd,p)=)=nnk,L,dk,L,d-p-pood d xxk,L,d.pk,L,d.pmax(Cmax(Ckk&L&Ld,pd,p)=)=xxk,L,dk,L,d-p-poodd
On IRIS150 d, precompute! XoX, Ld=Xod nk,L,d Lmin(Ck&Ld) xk,L,d max(Ck&Ld)
d=1000 p=AS=(50 34 15 2) -7 8 -1 20 -1 29
d=1000 p=AE=(59 28 43 13) -16 -1 -10 11 -10 20
d=1000 p=AI=(66 30 55 20) -23 -8 -17 4 -17 13
p=AvgS p=AvgE p=AvgI
d=0100 p=AS=(50 34 15 2) -11 10 -14 0 -12 4
d=0100 p=AE=(59 28 43 13) -5 16 -8 6 -6 10
d=0100 p=AI=(66 30 55 20) -7 14-10 4 -8 8
d=0010 p=AS=(50 34 15 2) -5 4 15 36 3 54
d=0010 p=AE=(59 28 43 13) -33 -24 -13 8 -25 26
d=0010 p=AI=(66 30 55 20)-45 -36 -25 -4 -37 14
d=0001 p=AS=(50 34 15 2) -1 4 8 16 12 23
d=0001 p=AE=(59 28 43 13)-12 -7 -3 5 1 12
d=0001 p=AI=(66 30 55 20)-25 -20 -16 -8 -12 -1
We have introduce 36 linear bookends to the class hulls, 1 pair for each of 4 ds, 3 ps , 3 class. For fixed d, Ck, the pTree mask is the same over the 3 p's. However we need to differentiate anyway to calculate R correctly.
That is, for each d-line we get the same set of intervals for every p (just shifted by -pThat is, for each d-line we get the same set of intervals for every p (just shifted by -pood). The only reason we need to d). The only reason we need to have them all is to accurately compute R on each min-max interval. In fact, we computer R on all intervals (even have them all is to accurately compute R on each min-max interval. In fact, we computer R on all intervals (even those where a single class has been isolated) to eliminate False Positives (if FPs are possible - sometimes they are those where a single class has been isolated) to eliminate False Positives (if FPs are possible - sometimes they are not, e.g., if we are to classify IRIS samples known to be Setosa, vErsicolor or vIriginica, then there is no "other").not, e.g., if we are to classify IRIS samples known to be Setosa, vErsicolor or vIriginica, then there is no "other").
Assuming Ld, nk,L,d and xk,L,d have been pre-computed and stored, the cut-pt pairs of (nk,L,d,p; xk,L,d,p) are computed
without further pTree processing, by the scalar computations: nnk,L,d,p k,L,d,p = = nnk,L,dk,L,d-p-pood d xxk,L,d.p k,L,d.p = = xxk,L,dk,L,d-p-pood.d.
Ld514947465054465044495448484358575451575154514651485050525247485452554950554944515045445051485146535070646955655763496652505960615667565862565961636164
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
d=1000666867605755555860546067635655556158505657576251576358716365764973677265646857586465777760695677636772626164727479646361776364606967695868676763656259
767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146146148149150
Lp,d
=Ld-pod
d=e1
d=e2
d=e3
d=e4
d=1000 p=0000 nk,L,d xk,L,d S 43 58E 49 70I 49 79
S 23 44E 20 34I 22 38
d=0100 p=0000 nk,L,d xk,L,d
S 10 19E 30 51I 18 69
d=0010 p=0000 nk,L,d xk,L,d
S 1 6E 10 18I 14 25
d=0001 p=0000 nk,L,d xk,L,d
Form Class Hulls using linear d boundaries thru min and max of Lk.d,p=(Ck&(X-p))od On every Ik,p,d{[epi,epi+1) | epj=minLk,p,d or maxLk,p,d for some k,p,d} interval add spherical and barrel boundaries with Sk,p and Rk,p,d similarly (use enough (p,d) pairs so that no 2 class hulls overlap) Points outside all hulls are declared as "other". all p,ddis(y,Ik,p,d) = unfitness of y being classed in k. Fitness of y in k is f(y,k) = 1/(1-uf(y,k))
LSR IRIS150 LSR IRIS150 ee11 only only
SSp p (X-p) (X-p)oo(X-p) = (X-p) = XXooX X + L+ L-2p -2p + p+ poop p nnk,S,p k,S,p = min(C= min(Ckk&S&Spp) ) xxk,S,p k,S,p max(C max(Ckk&S&Spp))RRp,dp,d S Spp-L-L22
p,d p,d = L= L-2p-(2p-2p-(2pood)d d)d + p+ poop + pp + poodd2 2 + + XXooX X - - LL22dd nnk,R,p,d k,R,p,d = min(C= min(Ckk&R&Rp,dp,d) ) xxk,R,p,d k,R,p,d max(C max(Ckk&R&Rp,dp,d))
Analyze R:RnR1 (and S:RnR1?) projections on each interval formed by consecutive L:RnR1 cut-pts.
d=1000 p=AS=(50 34 15 2)-7 8 -1 20 -1 29
d=1000 p=AE=(59 28 43 13)-16 -1 -10 11 -10 20
d=1000 p=AI=(66 30 55 20)-23 -8 -17 4 -17 13
Ld d=1000 p=origin Setosa 43 58vErsicolor 49 70vIrginica 49 79
26 32 270792 26 51558 2568
16 0128
24 634 099 393 1096 1217 1826
12 20813445
26 321 517,479 633
16 7231258
34 24 6 0 279 5 171 186748998
12 249794
16 16412391
34 24 6 24 126 2 1 132 73016222281
12 17220
26 32 0 3426 10 388 1369
If we have computed, S:RnR1, how can we utilize it?. We can, of course simply put spherical hulls boundaries by centering on the class Avgs, e.g., Sp p=AvgS
Setosa 0 154 E=50 I=11vErsicolor 394 1767vIrginica 369 4171
with AI 17220
with AE 1 517,478 633
eliminates FPs better?
What is the cost for these additional cuts (at new p-values in an L-interval)? It looks like: make the one additional calculation: LL-2p-(2p-2p-(2pood)dd)d then AND the interval masks, then AND the class masks? (Or if we already have all interval-class mask, only one mask AND step.)
Recursion works wonderfully on IRIS: The only hull overlaps after only d=1000 areAnd the 4 i's common to both are {i24 i27 i28 i34}. We could call those "errors".
If on the L 1000,avgE interval, [-1, 11) we recurse using SavgI we get7 436 540,4 72170
Thus, for IRIS at least, with only d=e1=(1000), with only the 3 ps avgS, avgE, avgI, using full linear rounds, 1 R round on each resulting interval and 1 S, the hulls end up completely disjoint. That's pretty good news!
There is a lot of interesting and potentially productive (career building) engineering to do here.What is precisely the best way to intermingle p, d, L, R, S? (minimizing time and False Positives)?
A pTree Pillar k-means clustering method(The k is not specified - it reveals itself.)
m1
m2
m3
m4
Choose m1 as a pt that maximizes Distance(X, avgX)Choose m2 as a pt that maximizes Distance(X, m1)Choose m3 as a pt that maximizes h=1..2Distance(X, mh)
Choose m4 as a pt that maximizes h=1..3Distance(X,mh)
Do until minimumh=1..kDistance(X,mh) < Threshold
This gives k. Apply pk-means. (Note we already have all Dis(X,mh)s for the first round.
Note: D=m1m2 line. Treat PCCs like parentheses - ( corresponds to a PCI and ) corresponds to a PC. Each matched pair should indicate a cluster somewhere in that slice. Where? One could take the VoM as the best-guess centroid? Then proceed by restricting to that slice. Or 1st apply R and do PCC parenthesizing on R values to identify radial slice where the cluster occurs. VoM of that combo slice (linear and radial) as the centroid. Apply S to confirm.
Note: A possible clustering method for identifying density clusters (as opposed to round or convex clusters)
(Treating PCCs like parentheses)
d-linePCI PCD PCI PCD
(or Do until mk < Threshold)
Clustering:1. For Anomaly Detection2. To develop Classes against which we future unclassified objects are classified.
( Classification = moving up a concept hierarchy using a class assignment function, caf:X{Classes} )
NewClu (k is discovered, not specified. Assign each (object,class) a ClassWeight, CWReals (could be <0). Classes "take next ticket" as they're discovered (tickets are 1,2,... Initially, all classes empty; All CWs=0.
Do for next d, compute Ld = Xod until after masking off new cluster, count is too high (doesn't drop enough)
For the next PCI in Ld (next-larger starting from smallest)
If followed by a PCD, declare next Classk and define it to be the set spherically gapped (or PCDed) around centroid, Ck=Avg or VoMk over Ld
-1[PCI, PCD]. Mask off this ticketed new Classk and continIf followed by a PCI, declare next Classk and define it to be the set spherically gapped (or PCDed)
around the centroid, Ck=Avg or VoMk over Ld-1[ (3PCI1+PCI2)/4, PCI2 ) Mask off this ticketed
new Classk and continue.For the next-smaller PCI (starting from largest) in Ld
If preceded by a PCD, declare next Classk and define it to be the set spherically gapped (or PCDed) around centroid Ck=Avg or VoMk over Ld
-1[PCD, PCI]. Mask off this ticketed new Classk, contin.If preceded by a PCI, declare next Classk and define it to be the set spherically gapped (or PCDed)
around the centroid, Ck=Avg or VoMk over Ld-1( PCI2, (3PCI1+PCI2)/4] Mask off this ticketed
new Classk and continue.
When is it important not to over partition? Sometimes it is but sometimes it is not. In 2. it usually isn't.With gap clustering we don't ever over partition, but with PCC based clustering we can.If it is important that each cluster be whole, when using a k=means type clusterer, each round we can fuse C i
and Cj iff on Lmimj their projections touch or overlap.