bcolz groupby discussion document

Groupby Thoughts

Imagine you have a table like this:

A1 A2 A3 M1 M2

A Y 3 100 30.0

C Z 2 50 22.34

A X 3 25 10.0

A X 4 12 2.0

C X 1 98 5.45

B Z 2 150 20.12

A Z 3 200 30.45

C Y 2 225 20.0

B Z 4 203 34.5

Etc.

And we want to aggregrate it

So basically our input can look like this (will use this for the example): • Group by columns list

• Eg: [‘A1’, ‘A2’, ‘A3’]

• Measure columns set• Eg: {‘M1_sum’: [‘M1’, ‘sum’], ‘M1_avg’: [‘M1’,’avg’], ‘M2_sum’: [‘M2’, ‘sum’]}

• Where statement or boolarray to filter the existing rows • if None, then the entire table should be scanned, else the selected rows

only

• Rootdir option is also needed to specify in-core or out-of-core result; nb: in-between results (factorized & count sort results) should perhaps follow not the specified outcome but instead whether the input ctable is in-mem or on-disk

• NB: stuff like factor caching and parallel is mostly meant as ideas for future (might greatly accelerate the groupby though)

• I have left sorting the end-result out too for now

Logic pipeline

A1 A2 A3

A1 Factorized

A2 Factorized

A3 Factorized

A1/A2/A3 Combined

Factorize Factorize Factorize

Combine individual indexes into a unique new one

• Factorizing of each carraycan be parallel / multi threaded

• Factorizations of carrayscan be potentially be cached next to the original carrayuntil next carray delete / update / insert

• Worst cost of cache would be tripling the size (in case of unique integer columns ;)

A1/A2/A3 Factorized

Factorize

• Combination step is only needed in case of groupby over multiple columns, else take the factorized carray directly

A1/A2/A3 Index

Empty ctable

• Ctable can be based on length from combined factor + dtypes input

• Can be done in parallel

Counted Sort

Create A3

A3

A3

M1_Sum

M1_Avg

M2_Sum

• Can run parallel• Groupby columns have

to be filled deriving A1/A2/A3 Factorized back into original values for lookup

• Measure columns have to use index to filter original measure carrayand perform aggregation for each A1/A2/A3 combination

• You can also parallelize aggregations!

So we first factorize the groupby

columns

A1

A

C

A

A

C

B

A

C

B

Etc.

A1 Values

A

C

B

A1 Index

0

1

0

0

1

2

0

1

2

Etc.

+

• While factorizing, you do not yet know how many unique values you will get (the entire column might be unique), so you start out with 2 carrays of equal length to the input

• The hashing is done in-memory (klib) but this should be okay for almost all cases (memory usage is limited to unique nr of values)

• At the end you can resize the Values carray to its actual size

• In case of WORM (write once, read many) it can be very beneficial to cache this result already in the carray(meaning we end up with three carrays on-disk)

So we end up with 3 factor results

A1 Values

A

C

B

A1 Index

0

1

0

0

1

2

0

1

2

Etc.

A2 Values

Y

Z

X

A2Index

0

1

2

2

2

1

1

0

1

Etc.

A3Values

3

2

4

1

A3Index

0

1

0

2

3

1

0

1

2

Etc.

3 Unique Values

3 Unique Values

4 Unique Values

• The # of unique values are important for the next step which is combining the indexes into one

• If there is only one column we groupby on, there would be no additional step needed

• So we have 3 * 3 * 4 = 36 unique combinations, any value can take a place on that range• We can create this range by calculating a multiplier for each column, where you start at a

multiplier 1 and then for each following column multiply the previous multiplier by the number of unique values from the previous column:

• So 3*12 + 3*4 + 4*1 = 52 and 2*12 + 1*4 + 3*1 = 31• We calculate this for each row and end up with a new carray that contains all

multiplications• You can also calculate this back (for instance for 31) by doing:

• Val1 = floor(31/12) • Val2 = floor((31-val1*12)/4)• Val3 = floor((31-val1*12-val2*4)/1)

How to combine the factorized carrays

into unique values

# of values multiplier

Value example "start" Value example "end"

Value example second value of all

Value example random

3 12 0 3 1 2

3 4 0 3 1 1

4 1 0 4 1 3

0 52 17 31

So we create a groupby index & values

like this

A1 Index

0

1

0

0

1

2

0

1

2

Etc.

A2Index

0

1

2

2

2

1

1

0

1

Etc.

A3Index

0

1

0

2

3

1

0

1

2

Etc.

* 12 * 4 * 1

GroupbyInput

0

17

8

10

23

29

4

13

30

Etc.

GroupbyIndex

0

1

2

3

4

5

6

7

8

Etc.

GroupbyValues

0

17

8

10

23

29

4

13

30

Calculate(numexpr can do this very nicely)

factorize

(Okay, slightly crappy example as everything is unique here ;)

The length of the groupbyvalues is the length of the ctable output!

Create the new ctable

• @Valentin: it’s probably better to just create the carrays on the go from iterations right? (no need to first create an empty one)

• We know the length from the groupby values carray size and the dtypes from the input carrays

Sort

GroupbyIndex

0

1

2

2

0

3

0

1

2

3

GroupbyValues

0

17

8

10

Counted sort gives per value a count and a sorted carraywhich gives the row indicesNB: we have this cython function already through Pandas)

I changed the example from slide 8 to make it more understandable ;)

GroupbyRow Index

0

5

7

1

8

Etc.

GroupbyValue Count

3

2

3

2

So now you can select rows from the original carraysusing index lookups

Create groupby & measure columns

• Don’t have time for this slide anymore but using the previous slides we should be okay I hope ;)

• Basically create the groupby columns looking up the correct value from the values carrays deriving that from the groupby input

• Create the measure column by index selecting the values for each groupby value and applying the aggregation

bcolz groupby discussion document

Data & Analytics