software + babies
TRANSCRIPT
Software + BabiesHow to design software and APIs for parallelism onmodern hardware
Richard ParkerMathematician and freelance computer
programmer in Cambridge, England
Cologne, 20.01.2016
Maths research 2011-2014.
• My original (hobby) project was to workwith matrices mod 2, obsessively fast.
• I considered even a 5% speedupcompulsory.
• Highly optimized, but up to 30-year-oldprograms have been used for many years.
• Mine were ~300 times faster in the end.
I learned a lot of tricks.
• I tried to consider every possible trick.• I had to understand how all the parts of the
(x86) microprocessor worked.
• And Intel AVX-512F when it comes.
• I have been blogging to meataxe64 (onwordpress) if anyone is interested.
Is this useful commercially?
• I felt there must be situations whereperformance improvements of x10 or x100would be useful in the real world.
• I have recently been applying these ideasin the context of ArangoDB database.
• My conclusion is that there are three mainsoftware considerations which can have amajor (x10) influence on performance.
3 pillars of performance.
• Multi-Core. If you use a lot of cores, youcan get a lot more done per second.
• Use Cache. Fetches from real memorytake x100 as long as from cache.
Real-memory bandwidth matters too.
• Single thread parallelism Even oneprocessor can do many things at once.
Multi-Core
• 16 cores can do a lot more work persecond than one core.
• Using them is not always easy, and theresults are often disappointing. . .
• because the memory system often getssaturated, perhaps even with 4 cores, andusing more cores doesn't help.
Make better use of cache.
• This often requires a new algorithm.• This work is still in its infancy, but it is
gradually becoming mainstream.
• Two algorithms with the same “complexity”such as quicksort and heapsort can easilydiffer by a factor of 10 because of cache.
Single Thread Parallelism
• This Haswell processor is designed toissue four instructions every clock cycle.
• Three of those four can be “vector”instructions, operating on the 256-bitvector registers.
• The result is that, roughly speaking, theprocessor can do 20-40 similar things inthe same time it takes to do one.
Higher level implications.
• In a sense, these are all low-level tricks.
• But to make good use of all these low-leveltricks, the high-level software must alsomake changes. . . and one is critical.
• any subroutine call should do a batch ofstuff, not just one thing.
Fred Brooks.• In “the mythical man month”, Brooks wrote
"nine women can't make a baby in one month".
• He was talking about writing programs, notrunning them, but the same principle isimportant, and getting more important, forprogram execution.
Ask for more Babies.
• In its simplest form, don't write
for(i=0;i<20;i++) MakeOneBaby()
• instead write
MakeSomeBabies(20)
• The first version takes 15 years to execute.• It must surely be clear that the second
version can be done faster, depending onthe resources available. :-)
Example - Cosine
• The computation of the cosine of an angletakes about 50 clock cycles.
double a = cosine(double ang)
• The “babies” version can manage aboutone cosine every 5 clock cycles once itgets going. Ten times faster.
void cosines(int ct,
double * ang, double * a)
Why is Babies cosine faster?
• 4 identical operations can be done in thevector registers at the same speed asone. This gives an immediate factor of 4.
• The hardware can start another cosinerather than wait for an intermediate result.
• Duplicated work (in this case loading thecoefficients of the series expansion) canbe done just once.
Even if count is 1 . . .
• With just a little care, the babies versionwith a count of 1 is no slower than the non-babies version.
• For example, the routine could start bytesting whether the count is 1, andbranching to new code if the count is not 1.
• The (not-taken) branch will be correctlypredicted if the count always is 1, andusually no time at all will be lost.
Throughput, not Latency.
• for(i=0;i<20;i++) makeonebaby()
This depends on the Latency - time fromstarting one to finishing that one.
• makesomebabies(20)
This depends on the Throughput -number that can be done on average inunit time.
Latency constant. Throughput increasing.
• We seem to have hit laws of physics withlatency. It is quite hard to get electronics to(say) do a double-precision multiply andget the answer faster than ~2 nSec.
• But there is little to prevent it doing a lot atonce. Indeed it can do forty since Sandy-Bridge.
Think of x86 as many units.
• Each single x86 core . . .
• Has a huge number of execution units justwaiting for the chance to do somethinguseful for the program.
• If you can use a lot more of them at once,you can get a lot more done per unit time!
Not just arithmetic
• For example, fetches from L1 cache taketwo clock cycles . . .
• But fetching 32 bytes is no slower thanfetching 8 (or less), and you can issue twoof these every clock cycle
• So the throughput of loading 8-byte data issixteen times greater than the latency.
Much the same with memory
• The fetch of a (64-byte) cache-line on thiscomputer has a 180 clock-cycle latency.
• But it can usually do three at once
• and then you have 180 clock cycles to begetting on with other work that does notdepend on those fetches.
• If, that is, you have something to do.
Example - database updates.
• When applying updates to a database,particularly if it is distant (i.e. has a highping-time) it can be important to batchthem up and send off the whole batch atonce.
• This is an example of the “babies” conceptthat is already mainstream - DTO.
Example - heap.• A heap allows you to put a key/value pair in,
and take out the one with the smallest key.• I could tell you about a version that runs
considerably faster. You need to put a fewpairs in, and take a few pairs (sorted) out ateach step.
• But even if you don't know (yet) how to dothat, it is sensible to do use the “babies”interface anyway.
• Perhaps you'll think of something later.
Example - Hash table
• Here the dead time is waiting for a memoryfetch.
• It should be clear what the interface shouldbe. . .
• You look-up (insert, delete, whatever) manythings in one call. Give it a chance to beclever and fast.
Example - GeoIndex
• I have very recently looked at how onecan find the nearest points in a GeoIndex.
• There are several slowish steps - memoryfetch, trig functions, distancecomputations, sorting the results etc.
• Every one of these steps can benefitgreatly by handling multiple points ratherthan just one.
Compilers good without babies.
• If one implements, say, the cosine function(one-at-a-time interface) the code isdependency bound anyway. The exactalgorithm wanted must be carefully codedto minimize the dependency chain, but ifthis is done in (say) “c”, it will run well.
• An assembler programmer will be hardpressed to get more than a few percentimprovement.
“Babies” changes everything.
• If you want to write a “babies” version ofcosine, it is an uphill struggle to get acompiler to generate good code.
• The use of vector instructions, the partialloop unrolling, the use of conditionalmoves rather than branch etc. etc. etc.make the challenge too hard for even amodern compiler.
Compilers may get better.
• Once programmers are asking more oftheir compilers (by writing babies versions)this may give a wake-up call.
• I would love to see a language whichhelps me write top-performance code!
• As of today, however, the answer is oftento write the “leaf” babies routines inassembler code.
“Ninja Assembler”
• Today's assemblers have changed little in50 years. . .
• But it not hard to imagine a languagedesigned for writing high-performance x86code, but still offering many of the luxuriesof a high-level language.
• One issue is portability. Portability is a bigobstacle to making best use of themachine you actually have!
Portability problems.
• Laying out data for the vector instructions.• Aligning parts of a data structure.• Specializing types for vector use.
• When is a branch unpredictable?
• What is the width of the vector registers?• When may a data fetch miss cache.• Cramming data into a single cache-line.
• At the 5% level there are many more.
The 80/20 rule.
• Even for “peak performance” code, ~80%of the time it is in ~20% of the code.
• We accept writing 20% in assembler
• But this determines the data layout.
• So it is very hard, at the moment anyway,to write the 80% in a high-level language.
• Mainly because of portability!
Strategic view
• As of today, it seems to me that a properlyprogrammed x86 can outperform anything.
• What I am less clear about is whether thesoftware community can afford the extratime and cost needed to program itproperly.
• My aim is to try to reduce that cost.
Summary
• Try to use “babies” interfaces at all times -
even if you can't immediately see how orwhy you might want to make it faster
• As of today, a small, assembler, babiesroutine can make a big difference to thespeed of a whole system.
• And we can hope for better softwaredevelopment tools in future.