software + babies

Software + BabiesHow to design software and APIs for parallelism onmodern hardware

Richard ParkerMathematician and freelance computer

programmer in Cambridge, England

Cologne, 20.01.2016

Maths research 2011-2014.

• My original (hobby) project was to workwith matrices mod 2, obsessively fast.

• I considered even a 5% speedupcompulsory.

• Highly optimized, but up to 30-year-oldprograms have been used for many years.

• Mine were ~300 times faster in the end.

I learned a lot of tricks.

• I tried to consider every possible trick.• I had to understand how all the parts of the

(x86) microprocessor worked.

• And Intel AVX-512F when it comes.

• I have been blogging to meataxe64 (onwordpress) if anyone is interested.

Is this useful commercially?

• I felt there must be situations whereperformance improvements of x10 or x100would be useful in the real world.

• I have recently been applying these ideasin the context of ArangoDB database.

• My conclusion is that there are three mainsoftware considerations which can have amajor (x10) influence on performance.

3 pillars of performance.

• Multi-Core. If you use a lot of cores, youcan get a lot more done per second.

• Use Cache. Fetches from real memorytake x100 as long as from cache.

Real-memory bandwidth matters too.

• Single thread parallelism Even oneprocessor can do many things at once.

Multi-Core

• 16 cores can do a lot more work persecond than one core.

• Using them is not always easy, and theresults are often disappointing. . .

• because the memory system often getssaturated, perhaps even with 4 cores, andusing more cores doesn't help.

Make better use of cache.

• This often requires a new algorithm.• This work is still in its infancy, but it is

gradually becoming mainstream.

• Two algorithms with the same “complexity”such as quicksort and heapsort can easilydiffer by a factor of 10 because of cache.

Single Thread Parallelism

• This Haswell processor is designed toissue four instructions every clock cycle.

• Three of those four can be “vector”instructions, operating on the 256-bitvector registers.

• The result is that, roughly speaking, theprocessor can do 20-40 similar things inthe same time it takes to do one.

Higher level implications.

• In a sense, these are all low-level tricks.

• But to make good use of all these low-leveltricks, the high-level software must alsomake changes. . . and one is critical.

• any subroutine call should do a batch ofstuff, not just one thing.

Fred Brooks.• In “the mythical man month”, Brooks wrote

"nine women can't make a baby in one month".

• He was talking about writing programs, notrunning them, but the same principle isimportant, and getting more important, forprogram execution.

Ask for more Babies.

• In its simplest form, don't write

for(i=0;i<20;i++) MakeOneBaby()

• instead write

MakeSomeBabies(20)

• The first version takes 15 years to execute.• It must surely be clear that the second

version can be done faster, depending onthe resources available. :-)

Example - Cosine

• The computation of the cosine of an angletakes about 50 clock cycles.

double a = cosine(double ang)

• The “babies” version can manage aboutone cosine every 5 clock cycles once itgets going. Ten times faster.

void cosines(int ct,

double * ang, double * a)

Why is Babies cosine faster?

• 4 identical operations can be done in thevector registers at the same speed asone. This gives an immediate factor of 4.

• The hardware can start another cosinerather than wait for an intermediate result.

• Duplicated work (in this case loading thecoefficients of the series expansion) canbe done just once.

Even if count is 1 . . .

• With just a little care, the babies versionwith a count of 1 is no slower than the non-babies version.

• For example, the routine could start bytesting whether the count is 1, andbranching to new code if the count is not 1.

• The (not-taken) branch will be correctlypredicted if the count always is 1, andusually no time at all will be lost.

Throughput, not Latency.

• for(i=0;i<20;i++) makeonebaby()

This depends on the Latency - time fromstarting one to finishing that one.

• makesomebabies(20)

This depends on the Throughput -number that can be done on average inunit time.

Latency constant. Throughput increasing.

• We seem to have hit laws of physics withlatency. It is quite hard to get electronics to(say) do a double-precision multiply andget the answer faster than ~2 nSec.

• But there is little to prevent it doing a lot atonce. Indeed it can do forty since Sandy-Bridge.

Think of x86 as many units.

• Each single x86 core . . .

• Has a huge number of execution units justwaiting for the chance to do somethinguseful for the program.

• If you can use a lot more of them at once,you can get a lot more done per unit time!

Not just arithmetic

• For example, fetches from L1 cache taketwo clock cycles . . .

• But fetching 32 bytes is no slower thanfetching 8 (or less), and you can issue twoof these every clock cycle

• So the throughput of loading 8-byte data issixteen times greater than the latency.

Much the same with memory

• The fetch of a (64-byte) cache-line on thiscomputer has a 180 clock-cycle latency.

• But it can usually do three at once

• and then you have 180 clock cycles to begetting on with other work that does notdepend on those fetches.

• If, that is, you have something to do.

Example - database updates.

• When applying updates to a database,particularly if it is distant (i.e. has a highping-time) it can be important to batchthem up and send off the whole batch atonce.

• This is an example of the “babies” conceptthat is already mainstream - DTO.

Example - heap.• A heap allows you to put a key/value pair in,

and take out the one with the smallest key.• I could tell you about a version that runs

considerably faster. You need to put a fewpairs in, and take a few pairs (sorted) out ateach step.

• But even if you don't know (yet) how to dothat, it is sensible to do use the “babies”interface anyway.

• Perhaps you'll think of something later.

Example - Hash table

• Here the dead time is waiting for a memoryfetch.

• It should be clear what the interface shouldbe. . .

• You look-up (insert, delete, whatever) manythings in one call. Give it a chance to beclever and fast.

Example - GeoIndex

• I have very recently looked at how onecan find the nearest points in a GeoIndex.

• There are several slowish steps - memoryfetch, trig functions, distancecomputations, sorting the results etc.

• Every one of these steps can benefitgreatly by handling multiple points ratherthan just one.

Compilers good without babies.

• If one implements, say, the cosine function(one-at-a-time interface) the code isdependency bound anyway. The exactalgorithm wanted must be carefully codedto minimize the dependency chain, but ifthis is done in (say) “c”, it will run well.

• An assembler programmer will be hardpressed to get more than a few percentimprovement.

“Babies” changes everything.

• If you want to write a “babies” version ofcosine, it is an uphill struggle to get acompiler to generate good code.

• The use of vector instructions, the partialloop unrolling, the use of conditionalmoves rather than branch etc. etc. etc.make the challenge too hard for even amodern compiler.

Compilers may get better.

• Once programmers are asking more oftheir compilers (by writing babies versions)this may give a wake-up call.

• I would love to see a language whichhelps me write top-performance code!

• As of today, however, the answer is oftento write the “leaf” babies routines inassembler code.

“Ninja Assembler”

• Today's assemblers have changed little in50 years. . .

• But it not hard to imagine a languagedesigned for writing high-performance x86code, but still offering many of the luxuriesof a high-level language.

• One issue is portability. Portability is a bigobstacle to making best use of themachine you actually have!

Portability problems.

• Laying out data for the vector instructions.• Aligning parts of a data structure.• Specializing types for vector use.

• When is a branch unpredictable?

• What is the width of the vector registers?• When may a data fetch miss cache.• Cramming data into a single cache-line.

• At the 5% level there are many more.

The 80/20 rule.

• Even for “peak performance” code, ~80%of the time it is in ~20% of the code.

• We accept writing 20% in assembler

• But this determines the data layout.

• So it is very hard, at the moment anyway,to write the 80% in a high-level language.

• Mainly because of portability!

Strategic view

• As of today, it seems to me that a properlyprogrammed x86 can outperform anything.

• What I am less clear about is whether thesoftware community can afford the extratime and cost needed to program itproperly.

• My aim is to try to reduce that cost.

Summary

• Try to use “babies” interfaces at all times -

even if you can't immediately see how orwhy you might want to make it faster

• As of today, a small, assembler, babiesroutine can make a big difference to thespeed of a whole system.

• And we can hope for better softwaredevelopment tools in future.