bcolz groupby discussion document
DESCRIPTION
A document to facilitate the discussion on creating an effective groupby mechanism for bcolzTRANSCRIPT
Groupby Thoughts
Imagine you have a table like this:
A1 A2 A3 M1 M2
A Y 3 100 30.0
C Z 2 50 22.34
A X 3 25 10.0
A X 4 12 2.0
C X 1 98 5.45
B Z 2 150 20.12
A Z 3 200 30.45
C Y 2 225 20.0
B Z 4 203 34.5
Etc.
And we want to aggregrate it
So basically our input can look like this (will use this for the example): • Group by columns list
• Eg: [‘A1’, ‘A2’, ‘A3’]
• Measure columns set• Eg: {‘M1_sum’: [‘M1’, ‘sum’], ‘M1_avg’: [‘M1’,’avg’], ‘M2_sum’: [‘M2’, ‘sum’]}
• Where statement or boolarray to filter the existing rows • if None, then the entire table should be scanned, else the selected rows
only
• Rootdir option is also needed to specify in-core or out-of-core result; nb: in-between results (factorized & count sort results) should perhaps follow not the specified outcome but instead whether the input ctable is in-mem or on-disk
• NB: stuff like factor caching and parallel is mostly meant as ideas for future (might greatly accelerate the groupby though)
• I have left sorting the end-result out too for now
Logic pipeline
A1 A2 A3
A1 Factorized
A2 Factorized
A3 Factorized
A1/A2/A3 Combined
Factorize Factorize Factorize
Combine individual indexes into a unique new one
• Factorizing of each carraycan be parallel / multi threaded
• Factorizations of carrayscan be potentially be cached next to the original carrayuntil next carray delete / update / insert
• Worst cost of cache would be tripling the size (in case of unique integer columns ;)
A1/A2/A3 Factorized
Factorize
• Combination step is only needed in case of groupby over multiple columns, else take the factorized carray directly
A1/A2/A3 Index
Empty ctable
• Ctable can be based on length from combined factor + dtypes input
• Can be done in parallel
Counted Sort
Create A3
A3
A3
M1_Sum
M1_Avg
M2_Sum
• Can run parallel• Groupby columns have
to be filled deriving A1/A2/A3 Factorized back into original values for lookup
• Measure columns have to use index to filter original measure carrayand perform aggregation for each A1/A2/A3 combination
• You can also parallelize aggregations!
So we first factorize the groupby
columns
A1
A
C
A
A
C
B
A
C
B
Etc.
A1 Values
A
C
B
A1 Index
0
1
0
0
1
2
0
1
2
Etc.
+
• While factorizing, you do not yet know how many unique values you will get (the entire column might be unique), so you start out with 2 carrays of equal length to the input
• The hashing is done in-memory (klib) but this should be okay for almost all cases (memory usage is limited to unique nr of values)
• At the end you can resize the Values carray to its actual size
• In case of WORM (write once, read many) it can be very beneficial to cache this result already in the carray(meaning we end up with three carrays on-disk)
So we end up with 3 factor results
A1 Values
A
C
B
A1 Index
0
1
0
0
1
2
0
1
2
Etc.
A2 Values
Y
Z
X
A2Index
0
1
2
2
2
1
1
0
1
Etc.
A3Values
3
2
4
1
A3Index
0
1
0
2
3
1
0
1
2
Etc.
3 Unique Values
3 Unique Values
4 Unique Values
• The # of unique values are important for the next step which is combining the indexes into one
• If there is only one column we groupby on, there would be no additional step needed
• So we have 3 * 3 * 4 = 36 unique combinations, any value can take a place on that range• We can create this range by calculating a multiplier for each column, where you start at a
multiplier 1 and then for each following column multiply the previous multiplier by the number of unique values from the previous column:
• So 3*12 + 3*4 + 4*1 = 52 and 2*12 + 1*4 + 3*1 = 31• We calculate this for each row and end up with a new carray that contains all
multiplications• You can also calculate this back (for instance for 31) by doing:
• Val1 = floor(31/12) • Val2 = floor((31-val1*12)/4)• Val3 = floor((31-val1*12-val2*4)/1)
How to combine the factorized carrays
into unique values
# of values multiplier
Value example "start" Value example "end"
Value example second value of all
Value example random
3 12 0 3 1 2
3 4 0 3 1 1
4 1 0 4 1 3
0 52 17 31
So we create a groupby index & values
like this
A1 Index
0
1
0
0
1
2
0
1
2
Etc.
A2Index
0
1
2
2
2
1
1
0
1
Etc.
A3Index
0
1
0
2
3
1
0
1
2
Etc.
* 12 * 4 * 1
GroupbyInput
0
17
8
10
23
29
4
13
30
Etc.
GroupbyIndex
0
1
2
3
4
5
6
7
8
Etc.
GroupbyValues
0
17
8
10
23
29
4
13
30
Calculate(numexpr can do this very nicely)
factorize
(Okay, slightly crappy example as everything is unique here ;)
The length of the groupbyvalues is the length of the ctable output!
Create the new ctable
• @Valentin: it’s probably better to just create the carrays on the go from iterations right? (no need to first create an empty one)
• We know the length from the groupby values carray size and the dtypes from the input carrays
Sort
GroupbyIndex
0
1
2
2
0
3
0
1
2
3
GroupbyValues
0
17
8
10
Counted sort gives per value a count and a sorted carraywhich gives the row indicesNB: we have this cython function already through Pandas)
I changed the example from slide 8 to make it more understandable ;)
GroupbyRow Index
0
5
7
1
8
Etc.
GroupbyValue Count
3
2
3
2
So now you can select rows from the original carraysusing index lookups
Create groupby & measure columns
• Don’t have time for this slide anymore but using the previous slides we should be okay I hope ;)
• Basically create the groupby columns looking up the correct value from the values carrays deriving that from the groupby input
• Create the measure column by index selecting the values for each groupby value and applying the aggregation