6.172 lecture notes

6.172 Lecture NotesPerformance Engineering of Software Systems

Wanlin Li

Fall 2020

1 September 3• Definition. Work: sum total of all operations executed by program

• Bentley rules: data structures, logic, loops, functions

• Data structure ideas:

– Packing and encoding: store more than one data value in machine wordand use fewer bits; e.g. structs

– Augmentation: add information to structure so that common opera-tions are faster; e.g. tail pointer for linked list appending

– Caching: store recent results to avoid recomputing– Precomputation: perform calculations in advance to avoid doing them

at critical times; e.g. precompute binomial coefficients and lookup in atable later

– Compile-time initialization: store values of constants during compiletime rather than runtime

– Sparsity: avoid storing and computing on zeroes (avoid computation inthe first place)

• Example. Encoding dates: could use strings or store just the year, month, anddate, depending on how often the data is moved around

• Example. Compressed sparse rows for matrix representation: store nonzerovalues in one array, store the corresponding column index in a second array,and store indices of first row element in a third array

• This gives storage O(n+ # nonzero) instead of O(n2)

• Same trick can be used for sparse graphs

1

6.172 Notes Wanlin Li

• Compiler flags -O0 through -O3 indicate level of compiler optimization

• Logic rules (generally for compiler):

– Constant folding and propagation: evaluate and substitute constantexpressions during compilation

– Common-subexpression elimination: avoid recomputing the same ex-pression by just reusing the result

– Algebraic identities: simplify whenever possible but beware of floatingpoint error

– Creating fast path: check for easy conditions first and only do expensivecomputation when needed

– Short circuiting: when testing, stop evaluating as soon as answer isdetermined

– Ordering tests: perform more successful tests first (e.g. ||,&&)– Combining tests: replace sequence of tests with one test/switch

• Loops:

– Hoisting: pull out loop-invariant code from loop interior– Sentinels: dummy values in data structure to simplify boundary con-

ditions e.g. loop exit testing– Loop unrolling: reduce number of loop iterations by doing more work

per loop– Loop fusion/jamming: combine multiple loops over same index range– Eliminating wasted iterations: modify loop bounds to avoid empty loop

bodies

• Functions:

– Inlining: avoid overhead of function call by replacing with function (ide-ally convince compiler to do it)

– Tail-recursion elimination: remove overhead of recursive call occurringat end of function to save local storage

– Coarsening recursion: increase size of base case with more efficientcode

2


2 September 8: Bit Hacks• Let x = 〈xw−1 . . . x0〉 be a w-bit computer word with unsigned integer value x =w−1∑k=0

xk2k

• ∼x + 1 = -x as one’s complement vs two’s complement and are the same for thefirst k bits

• Four bits is a nibble

• Bitwise operators in C: &,|,ˆ,∼,<<,>> are bitwise and, or, xor, not, shift left,shift right

• Left shift is undefined for signed negative numbers

• Example. Set the kth bit in a word x to 1 by x | (1 << k)

• Example. Clear kth in a word x by x & ∼(1 << k)

• Example. Flip the kth bit by x ˆ (1 << k)

• Extract a bit field from x by (x & mask) >> shift

• To set a bit field in x to y, (x & ∼mask) | (y << shift) & mask where the firstclears the corresponding bits of x

• Swap x, y without temporary using x = x ˆ y; y = x ˆ y; x = x ˆ y

• Processor attempts to predict branch but may be wrong, which empties the pro-cessor pipeline −→ avoid unpredictable branches

• Compute minimum without using branch by y ˆ ((x ˆ y) & -(x < y)) becauseC represents booleans true and false as 1, 0

• Unpredictable branches are extremely expensive but in some cases the compilerdoes a better job than manually rewriting to make the code branchless

• For branchless modular addition (division is expensive), z = x + y; r = z - (n

& -(z >= n)); but the compiler can do this

• Compute 2dlgne by n--; n |= n >> 1; ... n |= n >> 32; n++;

• Compute least significant 1 by r = x & (-x);

• Count number of trailing zeros, i.e. lg 2n using magic!

• Definition. deBruijn sequence: (cyclic) string s of length 2k such that each ofthe 2k length k bitstrings occurs exactly once as a substring of s

3


• Count number of 1 bits in a word x: x &= x - 1 eliminates the least significant1 but iterating this is no better than naive algorithm

• Instead use parallel divide and conquer for consecutive blocks

• Solve n-queens problem (count number of non-attacking queen arrangements)by backtracking search and optimize using sparse representation of board

3 September 10• Four stages of compilation: preprocessing, compiling, assembling, linking

• Binary executable with debug symbols can be disassembled/reinterpreted as as-sembly

• Assembly allows reverse engineering the code with only access to the binary

• Instruction set architecture (ISA) specifies syntax and semantics of assembly:registers, instructions, data types, memory addressing

• x64-64 has 16 general purpose registers which are aliased to allow 32-bit, 16-bit,etc. accessing within the register

• In AT&T syntax, <op> A,B means B <- B <op> A

• Opcodes might have suffix describing data type of operation or condition code

• Arithmetic and logic operations update status flags in RFLAGS register

• Hardware compares integer operands using subtraction

• Instruction operands have direct addressing modes (immediate, register, directmemory) and indirect addressing (register indirect, register indexed)

• Definition. Base indexed scale displacement: address base + index * scale +

displacement which is good for reading arrays

• No-ops mostly serve to optimize instruction memory

• SSE opcodes for floating point values and vectors

• Vector hardware operates on k vector lanes in lock-step, usually in an element-wise fashion

• Improve processor performance by exploiting parallelism or locality

• Instruction level parallelism can be stalled by various issues that prevent exe-cution during designated cycle - hazards (structural, control, data)

4


• Decrease structural hazard latency by separating functional units for complexopreations

• Then further increase ILP by fetching and decoding multiple instructions atonce to keep functional units full

• For control hazard, either stall the processor at the branch or speculate execu-tion with branch predictors to increase effectiveness

• Simple branch predictor maintains table mapping instruction addresses to pre-dictions by tracking difference in number of branches taken vs not taken

4 September 15• Compiler has to figure out how to choose correct assembly, assign registers, co-

ordinate function calls

• C→ LLVM IR→ Assembly

• LLVM IR function syntax is similar to C, has similar instructions and instruc-tion format to assembly, distinguishes registers by name but allows an infinitenumber of registers without aliasing (more like C variables)

• Assembly tracks datatypes at end of opcode but LLVM IR has simple opcodeswith C-like datatype declarations

• LLVM IR is split into basic blocks which are executed from the beginning of theblock to the end of the block without control flow; branches move among basicblocks

• Basic blocks form control flow graph

• LLVM IR uses static single assignment which is helpful for optimization

• Compiler also adds attributes to registers (e.g. nocapture, readonly)

“We have to type correctly in order to do that.”— T. Schardl, live demo

• LLVM IR solves structural problems of translating C into basic blocks with con-trol flow

• Compiler still has to figure out translation into assembly, register allocation,and coordinating function calls

5


• Virtual memory in program is organized into segments: stack (grows down to-ward lower addresses), heap (grows up), bss, data, text

• Stack stores return address, register state, any function arguments and localvariables that don’t fit in registers

• Linux x64-64 calling convention organizes stack into frames so each functiongets its own frame

• Base/frame pointer points to top of current stack frame and stack pointer pointsto bottom

• Different registers are caller or callee saved to avoid wasting work saving unusedregisters

• When a function is called, the linkage block holding non-register arguments ispushed onto the stack, then the return address, then any local variables, etc.

5 September 17: Compiler Optimizations• LLVM supports many front ends and back ends

• Compiler performs sequence of transformation passes on code to analyze andedit code for the sake of performance optimization

• Passes have a predetermined order that are experimentally determined to workwell most of the time

• Different compiler flags can produce reports for transformation passes

• For C, compiler optimizations focus on loops, logic, and functions rather thandata structures

• Compiler is good at promoting values to registers

• Many optimizations happen on intermediate representation (IR) but not all

• Compiler can use dedicated memory-addressing hardware to speedup multipli-cation by small powers of 2

• For division, used fixed point numbers (fixed radix location, as opposed to float-ing point)

• Compiling at -O1 is probably more useful and readable than -O0; the latter tendsto be overly conservative and uses a lot of stack space

6


• Everywhere a local variable is pulled from the stack, check to see if it can bereplaced with a register and remove any dead code

• Structures are harder to optimize because code operates on indiviual fields: in-stead bust aggregates and separate the fields; if these fields are set from regis-ters, take the register values instead of computing the memory address of thefields

• LLVM IR can have struct-type registers for some extra optimization

• A lot of compiler potential comes from combining optimizations

• To reduce function call overhead, inline and directly copy the code over in LLVMIR

• Function inlining and extra transformations eliminate the cost of function ab-straction

• Not every function can be inlined (e.g. recursive calls, functions defined else-where) and inlining increases code size

• If a loop doesn’t do much work per iteration, coarsen the loop body because loopshave overhead as well

6 September 22: Multicore Programming• Moore’s Law is no longer holding for transistors or clock speed

• Instead the number of processor cores is increasing

• Cores connected by some sort of network with a large shared memory

• Each core may have its own cache memory

• Cache coherence: maintain shared memory state that makes sense to all pro-cessors, for example if one processor changes a value in its local cache

• Value may be loaded from local on-chip cache rather than from DRAM memory

• Cache coherence gives illusion of shared memory state with the performanceboost of caching

• MSI protocol: each cache line is labeled with a state indicating whether thecache block has been modified (M), other caches are sharing the block (S), or thecache block is invalid and would result in a cache miss (I)

7


• A cache block marked as M may not have other caches holding this block in Mor S states

• Before a cache modifies a location, it must invalidate all other copies which ismore efficient than constantly broadcasting the value (since usually only oneprocessor needs the value at a time)

• This is usually done on the level of cache lines

• Concurrency platform abstracts processor core to handle synchronization/com-munication protocols, avoiding directly interacting with processor cores

• Pthreads: DIY concurrency platform, each thread representing an abstractionof a processor

• Inputs to a pthread must be packed into a single struct and pthreads have veryhigh overhead

“1957 - when even Professor Leiserson was too young to code.”— J. Ragan-Kelley

• Threading building blocks: C++ library on top of native threads, where programspecifies tasks rather than threads

• Tasks are automatically balanced across threads using work-stealing

• OpenMP is another framework that supports loop/task/pipeline parallelism in-dicated by pragma directives

• Cilk: provides linguistic extension for fork-join parallelism with work-stealing

• Cilk keywords don’t actually force parallel execution, they just allow it

• Cilk scheduler maps program to processor cores at runtime

7 September 24

“The slides are faster than your face.”— J. Ragan-Kelley, technical difficulties

• In Cilk, parent forks a child process and control cannot pass cilk_sync until allspawned children have returned

• Pointer to stack space can be passed from parent to child but not the other wayaround

8


• Cactus stack supports multiple views in parallel, where one child cannot see itssiblings’ stack space

“From history books - sorry, I really am dating myself - I mean Wikipedia”— C. Leiserson

• Definition. Determinacy race: when two logically parallel instructions accessthe same memory location and at least one location performs a write

• Definition. Trace: DAG representing dependency of executed instructions atruntime

“Then one day, the bug goes BAM. Nasty, nasty, nasty.”— C. Leiserson

• Two sections of code are independent if they don’t result in determinacy races(read race, write race)

• Arguments to spawned function are evaluated before spawn occurs

• Updating values in packed data structures may cause races depending on com-piler (safe on x86-64)

• Cilksan -fsanitize=cilk detects races that could cause differences with serialprojection of a particular input

“Magenta, chartreuse, cerulean... Or in terms programmers can understand,pink, green, and blue.”

— C. Leiserson

• Each node of the trace DAG is a strand (sequence of instructions not containingspawn, sync, or return from spawn)

• Each edge is a spawn (down), call (down), return (up), or continue (horizontal)

“I feel like I’m giving a history lesson here, but I guess I’m qualified.”— C. Leiserson

Amdahl’s Law. If α of code must be run serially, the speedup can be at most 1α

regardless of the number of parallel processors.

• This is a very weak bound

9


• Define TP to be the execution time on P processors

• Natural performance measures: T1 is the work of the program and T∞ is thespan (also critical-path length or computational depth)

• TP ≥ T1P and Tp ≥ T∞

• T1TP

is the speedup on P processors; speedup of P is perfect linear speedup butusually sublinear speedup is achieved

• Superlinear speedup may be obtained due to extra cache memory

• Definition. Parallelism: maximum possible speedup T1T∞

• Strand is ready when predecessors have all executed

• In a complete scheduling step, there are ≥ P strands ready and a greedy sched-uler can run any P

• In an incomplete step, all < P strands are run

• Any greedy scheduler achieves TP ≤ T1P + T∞ which is optimal up to a constant

factor

• The number of complete steps is at most T1P and each incomplete step reduces

the span of the remaining DAG by 1

8 September 29• Use matrix transpose as example for parallel loops

“That was the first time I learned that if you ask for help, you look stupid. Butif you don’t ask for help, you are stupid.”

— C. Leiserson

• Cilk loops are implemented using divide and conquer spawns and syncs

“You see the lavender part - that’s a fancy word for purple.”— C. Leiserson

• Results in complete binary tree as computation DAG

• Total work including loop overhead is Θ(n2) and in general the loop overhead isproportional to the number of iterations

10


• When parallelizing just the outer loop in matrix transpose, span of loop controlis Θ(lg n) and max span of body is Θ(n) so the span is Θ(n)

• When both loops are parallelized, span of outer and inner loop controls is Θ(lg n)and the span of the body is O(1)

• Spans of nested loops add, but in the matrix transpose case there is a lot of loopspawn overhead because there are Θ(n2) spawns and each body operation is veryfast

• Can reduce function overhead by coarsening recursion with #pragma cilk grainsize

G

• With grainsize of G, I work per iterations, S work per spawn, the total work isnI+

(nG − 1

)S because the number of internal nodes in the computation DAG is

nG − 1

• Span is GI + S log nG

• Work wants large G and span wants small G and a good balance point is G > SI

with G small, giving a parallelism of Θ(

nlgn

)IS

• Automatic Cilk scheme usually does reasonably well at picking grain size

• Definition. Puny parallelism: parallelism value of 1

• Goal to generate 10 times more parallelism (minimize span) than processors fornear-perfect linear speedup

• C preprocessor has token-pasting operator ## which allows access to variablenames

9 October 1

“If someone asks you for an algorithmic strategy, just guess divide and conquereven if you don’t understand the questions. Either that or dynamic program-ming, but guess one of those. Just listen to me and you’ll get far - look where Iam! I’m a professor at MIT.”

— C. Leiserson

• Parallel merge in mergesort finds median of the longer array and binary searchesin the other array for that median value

• This guarantees that the recursive call has at most 34 of the original elements

11


• Definition. Work efficiency: if TS(n) is the running time of the best serial alo-girthm on input of size n and T1(n) is the work of the parallel program (onone processor) on the same input, work overhead is the worst case ratio λ(n) =T1(n)TS(n)

≥ 1; a program is work effecient if T1(n) ≈ TS(n) and asymptotically workefficient if T1 = Θ(TS)

• If λ = T1TS

is worst case work overhead, then TP ≥ λTSP

• If work overhead is high, it is impossible to get near-perfect linear speedup overserial code no matter the number of processors and processing power is beingwasted

• Definition. Hairy recurrence: recurrence that cannot be solved using Master’sTheorem

10 October 6• Measuring performance is important so that small optimizations can add up

• Definition. Dynamic voltage and frequency scaling: dynamically trade powerfor performance by adjusting clock frequency and supply voltage to transistors

• Example. Reduce operating frequency if chip is too hot or running on batterypower

• Turbo boost increases frequency if chip is cool

• Power is proportional to CV 2f where C is dynamic capacitance, V is supplyvoltage, f is clock frequency

• This can influence performance measurements

• Different kinds of summary statistics to optimize

• To account for noise, take many head-to-head comparisons between programsand evaluate them statistically

• For deterministic programs, the raw performance of software is best measuredby the minimum runtime

• Unix time command outputs real, user, sys times

• Real is wall-clock time, user is processor time spent in user-mode code outsideof kernel, sys is processor time within process kernel

• clock_gettime(CLOCK_MONOTONIC) never runs backwards and is usually faster thansystem call

12


• Probe effect: program instrumentation (timing calls) alters program behaviorin unintended ways

• Profile by sampling by randomly stopping the program a large number of timesand determining which function is running, which gives a rough profiling report

• Instead of runtime, performance surrogates include work (number of executedinstructions), processor cycles, memory accesses, system calls, span

• Some systems support hardware counters but they may be nondeterministic andcan have accuracy or performance side-effects

• Simulators (e.g. cachegrind) are robust to noise and the probe effect but oftenrun much slower than real time

• Reducing variability makes it easier to account for noise

11 October 8• Cilk concurrency platform allows expression of logical parallelism and the sched-

uler then uses a work-stealing method to map calls to cores

• Parallel speedup is given by TSTP

where TS is work of serial program

• Each worker maintains work deque of strands that are ready to be called andmanipulates bottom of deque like a stack

• When a worker runs out of work, it steals from the top of another random dequeand must steal one whole spawn/call unit at a time

• Cilk work-stealing achieves expected running time Tp ≈ T1P +O(T∞)

• The first term corresponds roughly to time workers spend working and the sec-ond term corresponds to time spent stealing

• If the program has a lot of parallelism, T1T∞ P and if the program is work

efficient then TST1≈ 1

• Definition. Work-first principle: optimize for ordinary serial execution at ex-pense of overhead in steals in programs with sufficient parallelism

• Each worker sees a different view of cactus stack

• Upon successful steal, a worker can resume the stolen function mid-execution;after syncing, the worker needs to know if there is any spawned subroutine stillexecuting on another worker

13


• Each worker has a work deque that stores functions ready to execute, a Cilkstack frame structure to represent each spawned function, and a full frame treeto represent stolen function instances

• This is a collaborative process between compiler and runtime: compiler managessmall data structures and optimized fast paths for functions with no stealingwhile the runtime library handles steals and slow paths

• Processor continues with the spawned portion and the continuation of the parentcall is pushed onto the deque, available for stealing

• On a spawn, the current frame is pushed onto the bottom of the deque and theframe is popped when returning from the spawn (only possible when the framewas not yet stolen)

• Work thief steals from the top of the deque, so synchronization is needed betweenworker and thief so that they do not fight over the same frame

• If the worker’s pop succeeds, it continues as normal and otherwise becomes athief

• Optimize for worker’s successful pop; synchronization between worker and thiefonly needed when the deque is almost empty

• Definition. Tail-head-exception protocol: workers pop deque optimistically (with-out grabbing lock) and only grab the lock if the deque appears empty; thief al-ways grabs the lock

• After a successful steal, processor state must be restored

• cilk_sync waits on child frames and not just workers

• Full-frame tree keeps track of join counter of number of un-synced child frames,references to parent and child full frames, and references into cactus stack

• Victim’s new full frame is child of stolen full frame

• Frame suspends at sync if it still has active child frames; with enough paral-lelism, sync is usually expected to succeed because there are not many steals(usually follows fast path)

• Cilk stack frame maintains flag indicating frame status, which gets set whenstolen (infrequent)

• On a sync, frame suspends and processor state is saved if flag is set

14


12 October 20: Storage Allocation• Kinds of storage management: stack, heap, garbage collection

• C call stack stores local variables for function instantiation, pushing frames ontostack when function is called and popping on return

• Heap provides memory space to programmer that can be allocated and deallo-cated without constraint, e.g. malloc and free for C or new and delete for C++

• Heap storage must be freed explicitly to avoid memory leaks and improper usagecan result in dangling pointers or double freeing

• Garbage collection is provided in most higher level languages and does not needto be explicitly freed, but this has performance costs because allocated storageis usually not in the L1-cache

• Stack can be implemented using an array with a pointer, with everything fromthe start of the array to the pointer being used and everything to the end of thearray being unused

• Reading or writing past the end of the stack (stack overflow) results in a segfault

• Allocating and freeing a stack takes O(1) time with the single constraint thatfreeing must be consistent with the stack discipline of LIFO

• Bitmap allocator for fixed-size heap allocation uses bitmap to keep track of whichblocks of an array are free and which are used, which allows block sizes to bearbitrarily small

• Searching for a free block is not very scalable with a bitmap allocator, so a mul-tilayer hierarchy is more helpful

• Instead of a bitmap, a free list keeps a linked list of fixed-size unused blockswhere each block is at least as large as a pointer, so that free list pointers canbe stored in the unused block and overwritten on allocation

• When freeing an object x, set the next pointer of x to free and set free to x

• This has good temporal locality (operating like a stack) but poor spatial localitydue to external fragmentation, wasting space between blocks

• Mitigate external fragmentation by keeping a free list or bitmap per disk pageand allocating from the free list for the fullest page that is not yet full, whichincentivizes the system to keep information in complete pages and improvesspatial locality

15


• For variable size allocation, keep binned free lists where bin k holds memoryblocks of size Bk, allowing a bounded amount of internal fragmentation (blockis larger than allocated data)

• To allocate x bytes, check to see if the appropriate bin has space available andotherwise find a block in the next larger nonempty bin and split it into smallerblocks, distributing the pieces

• If all larger bins are empty, ask the OS to allocate more memory

• In practice the stack never grows into the heap

• If virtual addresses are continuously allocated and never freed, the externalfragmentation would be terrible

Theorem. Suppose the maximum amount of heap memory in use at any time isM.The virtual memory consumed by heap storage with a binned free list manageris O(M lgM).

• Binned free list is constant factor competitive

• Coalescing combines adjacent smaller blocks into larger block

• Garbage collector identifies and recycles objects that are no longer accessed

• Definition. Root: object directly accessible by program (global variables, stackvariables, etc.)

• Definition. Live objects: all objects reachable from roots by following pointersfrom roots

• Definition. Dead objects: inaccessible objects that can be recycled

• Garbage collection identifies pointers by strong typing (pointers known at com-pile time) or prohibiting pointer arithmetic (may slow down programs)

• Reference counting keeps a count of the number of pointers referencing eachobject and freeing when the count reaches 0

• However a cycle is never garbage collected in reference counting schemes

“Uncollected garbage stinks!”— J. Ragan-Kelley

• Graph abstraction uses BFS from roots to find and mark all live objects

16


• Mark and sweep algorithm has mark stage (BFS and mark live objects) andsweep stage (scan over memory to free unmarked objects)

• This does not handle fragmentation

• Definition. Stop and copy garbage collection: as objects are allocated and freed,the free list contains some live, dead, and unused blocks; when the free list isfull, use mark-and-sweep to mark live objects while using a second free list asthe queue, which copies all live vertices to contiguous storage in queue

• The queue implementation does not erase objects on pop, only updating pointers

• When an object is copied to the second free list, store a forwarding pointer inthe initial list that marks it as moved and update all pointers of objects as theyare popped off the queue

• This helps de-fragment memory, which gives tradeoff between overhead of garbagecollection and performance benefit

• Determination of when a free list is full operates like a dynamic list

13 October 22• malloc(size_t s) returns a void* pointer to a block of memory containing at

least s bytes

• C is only weakly typed and the malloc return value should be immediately castto the appropriate pointer

• Aligned allocation from void* memalign(size_t a, size_t s) which returns apointer of at least s bytes aligned to a multiple of a, which must be a powerof 2

• void free(void *p) deallocates a block starting at p (which has to have beenallocated using malloc or memalign)

• mmap system call allocates virtual memory by finding a sufficiently large contigu-ous unused region in address space, modifying the page table, and managingvirtual memory structure correctly so that accesses to this area do not segfault

• This call is lazy and does not immediately allocate physical memory, insteadpointing the page table to a special read only zero page

• The first write into this page causes a page fault, forcing the OS to allocate aphysical page and modify the page table correctly

17


• malloc first tries to reuse freed memory when possible, and mmap requests moreoverall memory

• Virtual address consists of virtual page number along with an offset which getstranslated to a physical address with a physical frame index and the same offset

• If the virtual page is not in physical memory, a page fault occurs

• Page table lookups are costly, so hardware has a translation lookaside buffer(TLB) to cache recent page table lookups

• Stack managed memory allows parent to pass its stack variable pointers to chil-dren but not the other way around; passing memory from child to parent mustbe on the heap

• In serial code, a normal stack suffices; in parallel, a cactus stack allows multipleviews of the stack at the same time

• The simplest way to implement this is a heap-based cactus stack which allocatesframes off the heap using malloc and free

• Each of these frames must explicitly carry a pointer to the parent frame

• This doesn’t work well with legacy/third-party serial binaries, so Cilk uses a poolof linear stacks that is less space efficient but more versatile

• If S1 is the stack space required for the serial execution, a P -worker heap-basedcactus stack execution requires space SP ≤ PS1 which is due to the busy-leavesproperty of the work-stealing algorithm (every active leaf frame has a workerexecuting it)

• Allocator speed is the number of allocations/deallocations that can be sustainedper second; optimizing for smaller blocks is more important

• Definition. User footprint: maximum number bytes M in use over the lifetimeof the program; maximum number H of provided bytes provided to allocator byOS (allocator footprint); framentation F = H

M and space utilization MH

• H tends to grow monotonically with time regardless of whether memory is prop-erly freed

• Internal fragmentation for binned free lists is F = O(lgM)

• Definition. Space overhead: space used by allocator for bookkeeping

• Definition. Internal fragmentation: waste due to allocation of larger blocksthan requested

18


• Definition. External fragmentation: waste due to inability to use storage be-cause contiguous blocks are insufficiently large

• Definition. Parallel blowup: additional space required beyond that of a serialallocator

• Default C allocator uses single global heap with accesses mediated by mutex topreserve atomic operations; this has blowup 1 but every thread needs to obtainthe lock which is slow

• Allocator should scale well with number of threads/processors

• Poor scalability is often due to lock contention, which again affects small blocksthe most

• Instead, each thread could allocate out of its own heap to avoid locking which isfast, but it has an issue of memory drift when a block is allocated by one threadand freed by another heap/thread, leading to unbounded blowup

• Memory is taken from one heap and freed into another heap, so the other heapskeep growing in size while the allocating heap keeps requesting more freshmemory

• Yet another strategy is to label each object with its owner so that freed objectsare returned to the owner’s heap

• This is fast for allocating and freeing local objects, but freeing remote objectsrequires synchronization

• Blowup is at most P and this method is resilient to false sharing

• True sharing refers to an actual shared memory location and false sharing in-volves different memory locations on the same cache line when different threadsprocess nearby objects

• Allocator could actively induce false sharing by satisfying memory requests fromdifferent threads using the same cache block

• Hoard allocator has P local heaps and one global heap where memory is orga-nized into superblocks of size S,where only superblocks are moved between localheaps and global heap

“Heap, heap, array!”— J. Ragan-Kelley

19


• When mallocing on thread i, first check the fullest nonfull superblock in the localheap and otherwise pull from the global heap; if the global heap has no memoryleft, request more from the OS

• If mi is storage used by heap i and hi is storage owned by that heap, then mi ≥min(hi−2S, hi/2) where S is the superblock size by augmenting the free functionto yield memory from local heap to global heap

14 October 27• Multicore cache hierarchy first checks for contents in L1 (closest to processor),

then L2, L3, etc. all the way to DRAM and moves the memory back to the pro-cessor

• Retrieving cache block from DRAM costs around 20000 normal instructions interms of time

“Back then, one gigabyte of memory cost $2 billion. As a student, I couldn’tafford that.”

— C. Leiserson

• Fully associative cache allows cached block to reside anywhere in the cache, soevery line in the cache must be searched for the tag

• Cache block is data that is moved around and cache line is physical hardwarestoring the data (usually used interchangeably)

• If the virtual address space isw bits and the block/line size isB, the byte addresstag is w − lgB bits and the offset is lgB bits (offset not stored in cache)

• Direct mapped cache splits virtual address space into sets, where a cache block’sset uniquely determines its cache line

• This makes checking for a block in the cache very fast and it has high spatiallocality (e.g. reading variables off a stack frame)

• If cache size is M, the byte address consists of tag of length w− lgM, set indica-tion of length lg(M/B), and offset lgB

• Set-associative cache maps each set to k cache lines (e.g. M/B-way associativityis fully associative)

• Definition. Cold miss: first time the cache block is accessed (may be mitigatedby pre-fetching)

20


• Definition. Capacity miss: previously cached copy would have been evictedeven with fully associative cache

• Definition. Conflict miss: eviction due to too many blocks from the same set inthe cache

• Definition. Sharing miss: in a multicore system, when another processor hasexclusive access to cache block (may be true-sharing or false-sharing)

• Conflict misses are the main bottleneck in serial code and are problematic forcaches with limited associativity

• Definition. Ideal cache model: two-level hierarchy with fully associative cachewith size of M bytes, line length of B bytes; uses optimal and omniscient re-placement (oracle access)

• Performance measures work (ordinary running time T ) and cache misses Q

LRU Lemma. Suppose an algorithm incurs Q cache misses on an ideal cacheof size M. Then a fully associative cache of size 2M using LRU replacementincurs at most 2Q cache misses.

• First design theoretically good alogirthm and engineer performance details (setassociativity, anomalies)

Segment Caching Lemma. Suppose a program reads a set of r data seg-ments with si contiguous bytes in the ith segment. Suppose further that

r∑i=1

si = N < M3 ,

Nr ≥ B.

Then all segments fit into cache, and reading all of them requires at most 3NB

cache misses.

• Note that the minimum number of misses is NB

• A single segment incurs at most siB + 2 misses

• Tall cache assumption: B2 < cM for some sufficiently small constant c ≤ 1

Submatrix Caching Lemma. Suppose that an n×n submatrixA is read intoa tall cache with B2 < cM for some constant c ≤ 1, and further suppose thatcM ≤ n2 < M

3 . Then A fits into the cache with at most 3n2

B cache misses to readall elements of A.

21


• Tiling: break matrix into submatrices that fit into cache

• This does not scale too well; instead use divide and conquer for a cache-obliviousalgorithm with the same results

“All our logs are base 2 because we’re computer scientists, not electrical engi-neers.”

— C. Leiserson

15 October 29: Cache-Oblivious Algorithms

• Simulate heat diffusion ∂u∂t = α

(∂2u∂x2

+ ∂2u∂y2

)from some starting state, where α

is thermal diffusivity

• In 1D with boundary conditions, the steady state has ∂u∂x > 0, ∂

2u∂x2

= 0, ∂u∂t = 0

• Use finite difference approximation of ∂2u∂x2

using central finite difference for both∂∂xu(t, x) and second derivative

• Stencil computation updates each point in the array by a fixed stencil pattern

• Update 2D array using 3 point stencil with some ∆t,∆x and starting condition:u[t+1][x] = u[t][x] + ALPHA * (u[t][x+1] - 2 * u[t][x] + u[t][x-1])

• Save space by storing only two time steps at once

“I last had a physics class in 1972.”— C. Leiserson

• Ideal cache model is fully associative and has cache size of M bytes and cacheline length of B bytes

• Time loop outside space loop should vectorize and prefetch well

• Assume lengthN > M using LRU replacement and consider just the data reads

• At the beginning, there is one miss per cache block for NB cold misses

“Right off the bat...”— C. Leiserson, as a bat flies across the screen

• Reading each subsequent row incurs NB capacity misses for a total of Θ

(NTB

)

22


• Cache-oblivious 3 point stencil recursively traverses trapezoidal regions of space-time points (t, x) such that t0 ≤ t < t1, x0 + dx0(t − t0) ≤ x < x1 + dx1(t − t0),dx0, dx1 ∈ −1, 0, 1

• If the width (average of base lengths) is at least twice the height (squat trape-zoid), the trapezoid is cut with a line of slope −1 through the center to recurseon left and right

• The left trapezoid must be computed before the right parallelogram

• In the case of a tall trapezoid, use a time cut as a horizontal line through thecenter and compute the lower trapezoid first

• This structure can also be implemented with only two arrays

• Each recursive step takes constant bookkeeping and the bulk of the work is atthe leaves

• The bottom of a leaf trapezoid fits into the cache with width Θ(M) and Θ(hw) =Θ(w2) points for Θ(w2) work

• At a leaf, there are Θ(wB

)cache misses (recall only two arrays are needed)

• There are Θ(NThw

)leaves and internal nodes, making a minor contribution to

both work and cache misses

• This accounts for Θ(NTBM

)cache misses and Θ(NT ) work

• Cache-oblivious trapezoidal decomposition is not that much faster despite hav-ing many fewer cache misses, due to prefetching and hardware support

Theorem. Let QP be the number of cache misses in a deterministic Cilk compu-tation on P processors each having a private cache of size M, and let SP be thenumber of successful steals during the computation. In the ideal cache model,QP = Q1 +O

(SPMB

).

• Space cuts can be parallelized by first computing the common portions neededfor the trapezoids and then spawning off each half

• Four impediments to speedup: insufficient parallelism, scheduling overhead,lack of memory bandwidth, contention (locking/sharing)

• First two problems can be detected by Cilkscale

23


16 November 3• Definition. Deterministic program: every memory location is updated with the

same sequence of values in every execution on the same input

• Deterministic program may update two different memory locations in differentorders, but each location sees the same sequence of updates

• Nondeterministic parallel programs can exhibit anomalous behaviors and theyare ridiculously hard to debug

• However, determinism may inhibit performance; include testing strategies thatturn off nondeterminism, encapsulate nondeterminism, substituting determin-istic alternatives, or using analysis tools

• Definition. Atomicity: sequence of instructions is atomic if the rest of the sys-tem cannot ever view it as partially executed

• Definition. Critical section: piece of code that accesses shared data structurethat must not be accessed by two threads at the same time

• Definition. Mutual exclusion: each thread claims exclusive control over its heldresources

• Definition. Mutex: object with lock and unlock functions; thread that tries tolock an already locked mutex causes that thread to block until unlocked

• Example. In parallel insertions to the same hash table, make each slot a structwith a mutex that must be locked and unlocked to implement atomicity

• Cilksan only guarantees to find determinacy races in a program with no mutexes

• Locks prevent data races, not determinacy races

• Definition. Data race: two logically parallel instructions holding no commonlocks access the same memory location, and at least one instruction performs awrite

• Data-race-free programs obey atomicity but can still be nondeterministic due todifferent orderings of obtaining the lock

• Absence of data races does not mean absence of bugs and presence of data racesdoes not imply presence of bugs

• Example. Benign races in identifying the set of digits in an array by keepinga bitmap of whether digits have been seen, assuming hardware writes arrayelements atomically

24


• Cilksan race detection may be turned off for intentional races, but other/bettersolutions exist e.g. fake locks

• Yielding mutex returns control to OS when it blocks and spinning mutex con-sumes processor cycles while blocked

• Reentrant mutex allows thread to acquire lock it already holds, while nonreen-trant mutex deadlocks if a thread tries to reacquire a mutex it already holds

• Fair mutex puts blocked thread on FIFO queue while unfair mutex lets anyblocked thread go next

• Reentrant mutexes are actually hard to use correctly; nonreentrant are morecommon

• Assembly can have atomic code in implementation of mutex, with pause instruc-tion to unconfuse speculating pipeline

• Spinning mutex does an atomic exchange between local register and mutex valueand tests to see if the local register indicates that the exchange was successful(another process may have acquired the mutex while this thread was trying toacquire it), returning to spinning state again if unsuccessful

• Yielding mutex implementation uses call pthread_yield (give control to OS andput this thread to sleep) instead of pause

• pthread_yield is expensive and spinning may be faster even though it may wastea few cycles

• Opposing goals of claiming mutex soon after release and minimizing cycle wastage

• Competitive mutex spins for some time (as long as the time of a context switch)and then yields, so that threads never wait longer than twice the optimal time

• Clever randomized algorithm can achieve ee−1 bound (see 6.854 pset)

• Holding multiple locks at once can lead to deadlock

• Conditions for deadlock: mutual exclusion, nonpreemption (thread does not re-lease held resources until use is completed), circular waiting (cycle of threads inwhich each thread is blocked waiting for resources held by the next thread)

“So instead of dining philosophers, we have starving philosophers.”— J. Ragan-Kelley

• Assume that mutexes can be linearly ordered so that each thread only attemptsto acquire locks in increasing order

25


• However, Cilk can be deadlocked with just one lock by waiting for sync, so mu-texes should generally not be held across strands

• Definition. Lock convoy: multiple threads of equal priority repeatedly contendfor the same lock and move forward slowly even though the threads do not dead-lock

• Example. In early Cilk implementation, every worker attempted to steal from asingle worker, causing almost all thieves to block

• Solve convoying problem by using try_lock instead of lock to avoid blocking onunsuccessful attempt

• Definition. Contention: parallel threads attempt to access the same lock, whichmust occur in serial, effectively killing parallelism

• With greedy scheduler, Tp ≤ T1P + T∞ + B where B is total time of all critical

sections (bondage)

• This bound is weak, e.g. if many small mutexes each protect different criticalregions which may still occur in parallel

• Transactional memory forces atomicity of regions of code:

– On transaction commit, all memory updates in critical region appearto take effect simultaneously

– On transaction abort, no memory updates appear to take effect andtransaction is restarted

– Restarted transaction may take different code path

• Transactional memory concerns: conflict (multiple transactions attempt to ac-cess same transactional memory location concurrently), contention resolution(deciding which of conflicting transactions to wait/abort/restart), forward progress(avoiding deadlock, livelock, starvation), throughput

17 November 5: Synchronization Without Locks• Definition. Sequential consistency: sequence of instructions as defined by pro-

cessor’s program are interleaved with instruction sequences of other processorsto produce global linear order of instructions

• All stores and loads in a program must behave as though they occur in somelinear order, and the load for some location reads the value written by the mostrecent store

26


• Most implementations of mutual exclusion employ atomic read-modify-write in-struction (usually for a lock)

• Mutual exclusion can be implemented with loads and stores as the only memoryoperations (without atomic operations)

“Alice wants to frob the widget and Bob wants to borf it, and the widget cannotbe frobbed and borfed at the same time.”

— J. Ragan-Kelley

Peterson’s Algorithm: The following protocol achieves mutual exclusion onthe critical section.Alice:A_wants = true;turn = B;while (B_wants && turn == B);frob(&x); // critical sectionA_wants = false;

Bob:B_wants = true;turn = A;while (A_wants && turn == A);borf(&x); // critical sectionB_wants = false;

This also guarantees starvation freedom, in which Bob cannot execute his crit-ical section twice in a row while Alice is trying to execute the critical section.

• If both parties try to enter the critical section, whoever last writes to the turn

variable spins and the other progresses

• Modern memory models do not actually use sequential consistency

• Hardware actively reorders instructions and compiler may also reorder instruc-tions for performance gains (namely maximizing instruction level parallelismby decreasing load latency)

• Instruction reordering of loads and stores is safe when the load/store requestsare to different locations and there is no concurrency

• Stores can be issued faster than they are handled and they get put into a storebuffer; loads bypass the store buffer and take priority because they could poten-tially stall the processor

• If a load address matches an address in the store buffer, the store buffer returnsthe result and so loads can bypass stores to different addresses

• Definition. Total store order: loads are not reordered with loads, stores arenot reordered with stores or with prior loads, loads are not reordered with priorstores to the same address, and neither stores nor loads are reordered with locks;globally, stores to the same location must obey total order, locks must respect a

27


global total order, and the memory ordering must preserve transitive visibility(causality)

• Instruction reordering violates sequential consistency and breaks Peterson’s al-gorithm because the loads of B_wants and A_wants might be reordered before therespective stores of A_wants, B_wants

• Definition. Memory fence: hardware action that enforces ordering constraintbetween instructions before and after the fence; may be issued explicitly as as-sembly instruction or implicitly through locking/synchronizing instructions

• Memory fence cost is comparable to L2-cache access and compiler fences are alsoneeded to fully fix Peterson’s algorithm

Theorem. Any n-thread deadlock-free mutual exclusion algorithm using onlyload and store memory operations require Ω(n) space.

Theorem. Any n-thread deadlock-free mutual exclusion algorithm on a modernmachine must use an expensive operation such as a memory fence or an atomiccompare and swap.

• Atomic memory operations for lock-free synchronization: load, store, cas (com-pare and swap)

• bool CAS(T *x, T old, T new) if (*x == old)

*x = new;return true;

return false;

Theorem. An n-thread deadlock-free mutual exclusion algorithm using CAS canbe implemented using Θ(1) space.

• The lock operation will spin on !CAS(*lock, false, true) which spins until thelock is set to false

• Another way of using CAS is updating a variable only if its value has not changedsince it was read

• Example. Lock-free push to a linked list

28


void push(Node* node) do

node->next = head; while (!CAS(&head, node->next, node));

• Compare and swap acquires cache line in exclusive mode and invalidates cacheline in other caches, which may cause high contention; add logically redundantchecks to improve performance

• Transactional memory allows executing an entire block of code atomically

ABA Problem: A stack begins to pop the head node but stalls after readingcurrent->next. A second thread successfully pops the head and its successorand then pushes another node to the stack while reusing the memory that wasfreed by the head pointer. Then if the first thread resumes, its CAS succeedsand it removes the head, but it inserts garbage onto the list.

• There are stronger methods than CAS

18 November 10• To speed up networks (internet), first avoid using the network; then avoid copy-

ing data

• Data packet copied from network interface card to kernel space and then to userspace; recently work is being pushed more towards the NIC to avoid extra copies

• Set socket to be non-blocking so that other useful work can be performed whilewaiting, or spawn a thread for every connection so that other threads can con-tinue working

• Internet service makes its best effort to deliver packet to destination addressstored in header and TCP (transmission control protocol) may step in to assist

• TCP three-way handshake to setup connection and prevent spoofing of returnaddress

“It’s like driving down the street and sticking mail to be sent in other people’smailboxes. You normally don’t do that.”

— B. Maggs

• Definition. Secure socket layer: standard web protocol with handshake (includ-ing verifications and key exchange) followed by data exchange; replaced by TLS

29


• These protocols require several costly round-trips between server and clientwhich can be sped up by decreasing distance between server and client

• DNS resolution has many queries and round-trips but DNS caching helps reducelatency

• Internet is much slower than the speed of light because there are many roundtrips for accomplishing basic tasks before data can be sent

“If you decide to go to grad school, you’ll look more and more like your advisor.Look at me and look at Charles. Pick your advisor wisely.”

— B. Maggs

19 November 17

“We sized the project so that everyone can do well working normal hours with-out overtime. Not that any of you listen to me.”

— C. Leiserson

• Speculative parallelism if extra potential performance is needed

• Writes to int* should be atomic as opposed to writing to something smaller

• Definition. Speculative parallelism: program spawns some parallel work thatwould not be performed by the serial projection

• Speculative parallelism for alpha-beta search: α is the best guaranteed scoreand β is the maximum score allowed to this player by the opponent

• Move ordering greatly affects performance due to short-circuiting on beta cutoffs

• Usually consider a game tree with branching factor b and depth d with O(bd)nodes

Theorem. For a game tree with branching factor b and depth d, an alpha-betasearch with moves searched in best-first order examines exactly bdk/2e + bbk/2c − 1nodes at ply k. This accounts for approximately 2bd/2 nodes overall.

“They do a lot of complicated things later in the paper that nobody reads, butthis part I think they read.”

— C. Leiserson, on Knuth and Moore alpha-beta search analysis

• Thus alpha-beta doubles the search depth, square roots the work, and squareroots the branching factor

30


• In a best-ordered tree, the degree searched at every node is either 1 or b

• Young siblings wait: speculative parallelism of alpha-beta search first checks ifthe first (hopefully best) child fails to generate a beta cutoff and then searchesthe remaining children in parallel, ideally wasting little work

• The span would be analogous to the work of searching a tree with branchingfactor 2, which is 2 · 2d/2 with a parallelism of

(b2

)d/2• This would waste work if the best child is not first, since game trees are usually

not best ordered

“These are younger siblings, so they have to wait. The first one has priority.”— C. Leiserson

• One idea for aborting unnecessary work is to propagate scores up the tree tokeep alpha/beta values current, polling up the tree and aborting if possible, butthis is difficult to implement correctly

“I don’t recommend it. It’s hard to implement and you will tear your hair out,unless you’re like me and don’t have hair to tear out.”

— C. Leiserson

• Zero-window search prunes more aggressively than full-window search, check-ing only for comparison to α with β = α+ 1

• Jamboree search uses zero-window search first and then a full-window searchonly when necessary

• Simple parallelization introduces races in best-move history table, killer table,and transposition table

20 December 1: TSP Walkthrough

“If your audience is literate and you’re giving a power point presentation, youcan assume they can read.”

— J. Bentley

“A famous computer programmer back in the day, Abraham Lincoln, faced thisproblem when he was doing a little lawyering on the side.”

— J. Bentley

31


“It is sort of disconcerting standing here yelling into the void, but I have raiseda teenager.”

— J. Bentley

• Brute force permutation approach to TSP improves by a factor of 20 using justcompiler optimizations

• Precompute distances, fix last city in the permutation, keep partial sums ofdistances to reduce the number of additions to (1 + e) · (n− 1)!

“This is why when all the turkeys in high school were going to prom, you werememorizing the digits of e and π. You’re a better person for it now.”

— J. Bentley

• Prune by checking current distance against the minimum distance, which re-sults in a huge speedup

• This bound can be improved using various heuristics, of which MST distance ofthe remaining points is extremely effective for its cost

• As a result the runtimes may not be monotonically increasing; most of the timeis spent building MSTs

• Cache results of MST in a hashtable to get speedup without a terrible space cost

• Sort the cities to first visit the closest city, then second closest, etc. or use anapproximate sort to reduce sorting time

• Heuristically start at an extreme city rather than a central city

• Start with more optimal tour (e.g. 2-OPT)

“The difference between me and an addict is that I can stop whenever I wantto. But I’ve got to go now because I have to prepare for next year.”

— J. Bentley

21 December 3• Performance and efficiency of visual computing applications is dependent on

amount of work, parallelism, and locality

• Nowadays communication is the dominant factor in both energy and time, whichmakes locality of critical importance

32


• For the sake of locality, it might make sense to recompute values rather thanstoring them

• Program has components of algorithm and organization of computations, bothof which affect performance somewhat independently

• Breadth-first organization (e.g. rasterize everything first and then render) ishighly parallel but takes up space, sacrificing locality

• Depth-first organization (e.g. rasterize, render, move on to next object) is morelocal, but there may be dependencies which must be handled first

• Optimizing arbitrary organization of arbitrary programs is intractable, mean-ing compilers cannot automatically perform these optimizations and these mustbe done manually

• Use domain-specific knowledge to abstract the dependency patterns and modelthem, decreasing the space of possible organizations and finding structure toidentify the profitable ones

“This kind of optimization is hard, both for people and compilers.”— J. Ragan-Kelley

• In a traditional program, reorganization may break modularity and combinemultiple operations together

• Halide seeks to decouple the algorithm (what is computed) from the organiza-tion schedule (where/when things are computed) using functional programming,where pure functions may be called anywhere and their interleaving determinesperformance

• Restricting to only feed-forward pipelines and bounded depth iterations makessuch languages not Turing-complete

• Locality is a function of reuse distance between computation and usage of value

• Determine interleaving of a pipeline based on ordering of computations for eachstage along with when they should be computed

• Combined with ML, a learned scheduler can outperform human experts

“The goal of this class is to put you on the crank in place of Moore’s Law. Hope-fully you can all go out into the world with these skills and print your ownmoney.”

— J. Ragan-Kelley

33

6.172 lecture notes

Documents