morsel’drive+parallelism:+a+ numa’aware+query+evaluation+...

Morsel-‐Drive Parallelism: A NUMA-‐Aware Query Evaluation

Framework for the Many-‐Core AgePresented by Dennis Grishin

What is the problem?

Efficient computation requires distribution of processing between many cores and associated

memory.

Why is it important?Rise of the multi-‐core CPU architecture

ABSTRACT

NUMA refers to the computer memory design choice available for multiprocessors. NUMA means that it will take longer to access some regions of memory than others. This work aims at explaining what NUMA is, the background developments, and how the memory access time depends on the memory location relative to a processor. First, we present a background of multiprocessor architectures, and some trends in hardware that exist along with NUMA. We, then briefly discuss the changes NUMA demands to be made in two key areas. One is in the policies the Operating System should implement for scheduling and run-time memory allocation scheme used for threads and the other is in the programming approach the programmers should take, in order to harness NUMA’s full potential. In the end we also present some numbers for comparing UMA vs. NUMA’s performance. Keywords: NUMA, Intel i7, NUMA Awareness, NUMA Distance

SECTIONS

In the following sections we first describe the background, hardware trends, Operating System’s goals, changes in programming paradigms, and then we conclude after giving some numbers for comparison.

Background

Hardware Goals / Performance Criteria There are 3 criteria on which performance of a multiprocessor system can be judged, viz. Scalability, Latency and Bandwidth. Scalability is the ability of a system to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Latency is the time taken in sending a message from node A to node B, while bandwidth is the amount of data that can be communicated per unit of time. So, the goal of a multiprocessor system is to achieve a highly scalable, low latency, high bandwidth system. Parallel Architectures Typically, there are 2 major types of Parallel Architectures that are prevalent in the industry: Shared Memory Architecture and Distributed Memory Architecture. Shared Memory Architecture, again, is of 2 types: Uniform Memory Access (UMA), and Non-Uniform Memory Access (NUMA). Shared Memory Architecture As seen from the figure 1 (more details shown in “Hardware Trends” section) all processors share the same memory, and treat it as a global address space. The major challenge to overcome in such architecture is the issue of Cache Coherency (i.e. every read must

Figure 1 Shared Memory Architecture (from [1])

reflect the latest write). Such architecture is usually adapted in hardware model of general purpose CPU’s in laptops and desktops. Distributed Memory Architecture

In figure 2 (more details shown in “Hardware Trends” section) type of architecture, all the processors have their own local memory, and there is no mapping of memory addresses across processors. So, we don’t have any concept of global address space or cache coherency. To access data in another processor, processors use explicit communication. One example where this architecture is used with clusters, with different nodes connected over the internet as network. Shared Memory Architecture – UMA Shared Memory Architecture, again, is of 2 distinct types, Uniform Memory Access (UMA), and Non-Uniform Memory Access (NUMA).

Figure 2 Distributed Memory (from [1])

Figure 3 UMA Architecture Layout (from [3])

Non-Uniform Memory Access (NUMA) Nakul Manchanda and Karan Anand

New York University {nm1157, ka804} @cs.nyu.edu

ABSTRACT

NUMA refers to the computer memory design choice available for multiprocessors. NUMA means that it will take longer to access some regions of memory than others. This work aims at explaining what NUMA is, the background developments, and how the memory access time depends on the memory location relative to a processor. First, we present a background of multiprocessor architectures, and some trends in hardware that exist along with NUMA. We, then briefly discuss the changes NUMA demands to be made in two key areas. One is in the policies the Operating System should implement for scheduling and run-time memory allocation scheme used for threads and the other is in the programming approach the programmers should take, in order to harness NUMA’s full potential. In the end we also present some numbers for comparing UMA vs. NUMA’s performance. Keywords: NUMA, Intel i7, NUMA Awareness, NUMA Distance

SECTIONS

In the following sections we first describe the background, hardware trends, Operating System’s goals, changes in programming paradigms, and then we conclude after giving some numbers for comparison.

Background

Hardware Goals / Performance Criteria There are 3 criteria on which performance of a multiprocessor system can be judged, viz. Scalability, Latency and Bandwidth. Scalability is the ability of a system to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Latency is the time taken in sending a message from node A to node B, while bandwidth is the amount of data that can be communicated per unit of time. So, the goal of a multiprocessor system is to achieve a highly scalable, low latency, high bandwidth system. Parallel Architectures Typically, there are 2 major types of Parallel Architectures that are prevalent in the industry: Shared Memory Architecture and Distributed Memory Architecture. Shared Memory Architecture, again, is of 2 types: Uniform Memory Access (UMA), and Non-Uniform Memory Access (NUMA). Shared Memory Architecture As seen from the figure 1 (more details shown in “Hardware Trends” section) all processors share the same memory, and treat it as a global address space. The major challenge to overcome in such architecture is the issue of Cache Coherency (i.e. every read must

Figure 1 Shared Memory Architecture (from [1])

reflect the latest write). Such architecture is usually adapted in hardware model of general purpose CPU’s in laptops and desktops. Distributed Memory Architecture

In figure 2 (more details shown in “Hardware Trends” section) type of architecture, all the processors have their own local memory, and there is no mapping of memory addresses across processors. So, we don’t have any concept of global address space or cache coherency. To access data in another processor, processors use explicit communication. One example where this architecture is used with clusters, with different nodes connected over the internet as network. Shared Memory Architecture – UMA Shared Memory Architecture, again, is of 2 distinct types, Uniform Memory Access (UMA), and Non-Uniform Memory Access (NUMA).

Figure 2 Distributed Memory (from [1])

Figure 3 UMA Architecture Layout (from [3])

Non-Uniform Memory Access (NUMA) Nakul Manchanda and Karan Anand

New York University {nm1157, ka804} @cs.nyu.edu

Uniform Memory Access (UMA)

Non-‐Uniform Memory Access (NUMA)

Rise of the NUMA architecture

Why is it hard?

How to distribute work evenly between many out-‐of-‐order cores?

How to maximize NUMA-‐local execution?

Why existing solutions do not work?

Plan-‐driven parallelism: query fragmentation at compile time into big fragments and initiation of

static number of threads

Insufficient load-‐balancing due hard-‐to-‐predict performance of out-‐of-‐order CPUs.

What is the core intuition of the solution?

Morsel-‐driven parallelism: query fragmentation at runtime into small fragments and dynamic

scheduling of threads

Runtime scheduling is elastic and achieves perfect load-‐balancing.

Solution I – three-‐way join

Morsel-Driven Parallelism: A NUMA-Aware QueryEvaluation Framework for the Many-Core Age

Viktor Leis⇤ Peter Boncz† Alfons Kemper⇤ Thomas Neumann⇤

⇤ Technische Universität München † CWI⇤ {leis,kemper,neumann}@in.tum.de † [email protected]

ABSTRACTWith modern computer architecture evolving, two problems con-spire against the state-of-the-art approaches in parallel query exe-cution: (i) to take advantage of many-cores, all query work mustbe distributed evenly among (soon) hundreds of threads in order toachieve good speedup, yet (ii) dividing the work evenly is difficulteven with accurate data statistics due to the complexity of modernout-of-order cores. As a result, the existing approaches for “plan-driven” parallelism run into load balancing and context-switchingbottlenecks, and therefore no longer scale. A third problem facedby many-core architectures is the decentralization of memory con-trollers, which leads to Non-Uniform Memory Access (NUMA).

In response, we present the “morsel-driven” query executionframework, where scheduling becomes a fine-grained run-time taskthat is NUMA-aware. Morsel-driven query processing takes smallfragments of input data (“morsels”) and schedules these to workerthreads that run entire operator pipelines until the next pipelinebreaker. The degree of parallelism is not baked into the plan but canelastically change during query execution, so the dispatcher can re-act to execution speed of different morsels but also adjust resourcesdynamically in response to newly arriving queries in the workload.Further, the dispatcher is aware of data locality of the NUMA-localmorsels and operator state, such that the great majority of execu-tions takes place on NUMA-local memory. Our evaluation on theTPC-H and SSB benchmarks shows extremely high absolute per-formance and an average speedup of over 30 with 32 cores.

Categories and Subject DescriptorsH.2.4 [Systems]: Query processing

KeywordsMorsel-driven parallelism; NUMA-awareness

1. INTRODUCTIONThe main impetus of hardware performance improvement nowa-

days comes from increasing multi-core parallelism rather than fromspeeding up single-threaded performance [2]. By SIGMOD 2014

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage, and that copies bear this notice and the full ci-tation on the first page. Copyrights for third-party components of this work must behonored. For all other uses, contact the owner/author(s). Copyright is held by theauthor/owner(s).SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA.ACM 978-1-4503-2376-5/14/06.http://dx.doi.org/10.1145/2588555.2610507 .

A16

1827

5

7

B8

3310

5

23

B83310

5

23

Cvxy

z

u

HT(S)HT(T)

A167102718575...............

Zacibejdf...............

RZa......

A16......

B8......

Cv......

Result

storeprobe(16)probe(10)

probe(8)

probe(27)store

Zb......

A27......

B10......

Cy......

morsel

morselDispatcher

Figure 1: Idea of morsel-driven parallelism: R 1A S 1B T

Intel’s forthcoming mainstream server Ivy Bridge EX, which canrun 120 concurrent threads, will be available. We use the termmany-core for such architectures with tens or hundreds of cores.

At the same time, increasing main memory capacities of up toseveral TB per server have led to the development of main-memorydatabase systems. In these systems query processing is no longerI/O bound, and the huge parallel compute resources of many-corescan be truly exploited. Unfortunately, the trend to move memorycontrollers into the chip and hence the decentralization of mem-ory access, which was needed to scale throughput to huge mem-ories, leads to non-uniform memory access (NUMA). In essence,the computer has become a network in itself as the access costs ofdata items varies depending on which chip the data and the access-ing thread are located. Therefore, many-core parallelization needsto take RAM and cache hierarchies into account. In particular, theNUMA division of the RAM has to be considered carefully to en-sure that threads work (mostly) on NUMA-local data.

Abundant research in the 1990s into parallel processing led themajority of database systems to adopt a form of parallelism in-spired by the Volcano [12] model, where operators are kept largelyunaware of parallelism. Parallelism is encapsulated by so-called“exchange” operators that route tuple streams between multiplethreads each executing identical pipelined segments of the queryplan. Such implementations of the Volcano model can be calledplan-driven: the optimizer statically determines at query compile-time how many threads should run, instantiates one query operatorplan for each thread, and connects these with exchange operators.

In this paper we present the adaptive morsel-driven query execu-tion framework, which we designed for our main-memory databasesystem HyPer [16]. Our approach is sketched in Figure 1 for thethree-way-join query R 1A S 1B T . Parallelism is achieved

select * from R, S, T where R.A = S.A and S.B = T.B

Solution II – build-‐phaseFigure 2: Parallellizing the three pipelines of the sample query plan: (left) algebraic evaluation plan; (right) three- respectivelyfour-way parallel processing of each pipeline

HT(T)global

Hash Table

morsel

T

Phas

e 1:

pro

cess

T m

orse

l-wise

and

stor

e N

UM

A-lo

cally

Phas

e 2:

scan

NU

MA-

loca

l sto

rage

are

a an

d in

sert

poi

nter

s int

o HT

next

mor

sel

Storage area of

red core

Storage area of

green core

Storage area of

blue corescan

scan

Insert the pointer

into HT

...(T)v ...(T)v...(T)v

Figure 3: NUMA-aware processing of the build-phase

morsels; this way the succeeding pipelines start with new homoge-neously sized morsels instead of retaining morsel boundaries acrosspipelines which could easily result in skewed morsel sizes. Thenumber of parallel threads working on any pipeline at any time isbounded by the number of hardware threads of the processor. Inorder to write NUMA-locally and to avoid synchronization whilewriting intermediate results the QEPobject allocates a storage areafor each such thread/core for each executable pipeline.

The parallel processing of the pipeline for filtering T and build-ing the hash table HT (T ) is shown in Figure 3. Let us concentrateon the processing of the first phase of the pipeline that filters in-put T and stores the “surviving” tuples in temporary storage areas.

In our figure three parallel threads are shown, each of which op-erates on one morsel at a time. As our base relation T is stored“morsel-wise” across a NUMA-organized memory, the schedulerassigns, whenever possible, a morsel located on the same socketwhere the thread is executed. This is indicated by the coloring inthe figure: The red thread that runs on a core of the red socket isassigned the task to process a red-colored morsel, i.e., a small frag-ment of the base relation T that is located on the red socket. Once,the thread has finished processing the assigned morsel it can eitherbe delegated (dispatched) to a different task or it obtains anothermorsel (of the same color) as its next task. As the threads pro-cess one morsel at a time the system is fully elastic. The degree ofparallelism (MPL) can be reduced or increased at any point (moreprecisely, at morsel boundaries) while processing a query.

The logical algebraic pipeline of (1) scanning/filtering the inputT and (2) building the hash table is actually broken up into twophysical processing pipelines marked as phases on the left-handside of the figure. In the first phase the filtered tuples are insertedinto NUMA-local storage areas, i.e., for each core there is a sep-arate storage area in order to avoid synchronization. To preserveNUMA-locality in further processing stages, the storage area of aparticular core is locally allocated on the same socket.

After all base table morsels have been scanned and filtered, in thesecond phase these storage areas are scanned – again by threads lo-cated on the corresponding cores – and pointers are inserted intothe hash table. Segmenting the logical hash table building pipelineinto two phases enables perfect sizing of the global hash table be-cause after the first phase is complete, the exact number of “surviv-ing” objects is known. This (perfectly sized) global hash table willbe probed by threads located on various sockets of a NUMA sys-tem; thus, to avoid contention, it should not reside in a particularNUMA-area and is therefore is interleaved (spread) across all sock-ets. As many parallel threads compete to insert data into this hashtable, a lock-free implementation is essential. The implementationdetails of the hash table are described in Section 4.2.

After both hash tables have been constructed, the probing pipelinecan be scheduled. The detailed processing of the probe pipeline isshown in Figure 4. Again, a thread requests work from the dis-patcher which assigns a morsel in the corresponding NUMA parti-tion. That is, a thread located on a core in the red NUMA partitionis assigned a morsel of the base relation R that is located on the cor-

NUMA-‐aware hash table creation

• Table partitioning on the join key -‐> matching tuples usually on the same socket -‐> less cross-‐socket communication for joins

worst case a too large morsel size results in underutilized threadsbut does not affect throughput of the system if enough concurrentqueries are being executed.

4. PARALLEL OPERATOR DETAILSIn order to be able to completely parallelize each pipeline, each

operator must be capable to accept tuples in parallel (e.g., by syn-chronizing shared data structures) and, for operators that start a newpipeline, to produce tuples in parallel. In this section we discuss theimplementation of the most important parallel operators.

4.1 Hash JoinAs discussed in Section 2 and shown in Figure 3, the hash table

construction of our hash join consists of two phases. In the firstphase, the build input tuples are materialized into a thread-localstorage area2; this requires no synchronization. Once all input tu-ples have been consumed that way, an empty hash table is createdwith the perfect size, because the input size is now known pre-cisely. This is much more efficient than dynamically growing hashtables, which incur a high overhead in a parallel setting. In the sec-ond phase of the parallel build phase each thread scans its storagearea and inserts pointers to its tuples using the atomic compare-and-swap instruction. The details are explained in Section 4.2.

Outer join is a minor variation of the described algorithm. Ineach tuple a marker is additionally allocated that indicates if thistuple had a match. In the probe phase the marker is set indicatingthat a match occurred. Before setting the marker it is advantageousto first check that the marker is not yet set, to avoid unnecessarycontention. Semi and anti joins are implemented similarly.

Using a number of single-operation benchmarks, Balkesen et al.showed that a highly-optimized radix join can achieve higher per-formance than a single-table join [5]. However, in comparison withradix join our single-table hash join

• is fully pipelined for the larger input relation, thus uses lessspace as the probe input can be processed in place,

• is a “good team player” meaning that multiple small (dimen-sion) tables can be joined as a team by a probe pipeline ofthe large (fact) table through all these dimension hash tables,

• is very efficient if the two input cardinalities differ strongly,as is very often the case in practice,

• can benefit from skewed key distributions3 [7],• is insensitive to tuple size, and• has no hardware-specific parameters.

Because of these practical advantages, a single-table hash join isoften preferable to radix join in complex query processing. For ex-ample, in the TPC-H benchmark, 97.4% of all joined tuples arriveat the probe side, and therefore the hash table often fits into cache.This effect is even more pronounced with the Star Schema Bench-mark where 99.5% of the joined tuples arrive at the probe side.Therefore, we concentrated on a single-table hash join which hasthe advantage of having no hardware-specific parameters and notrelying on query optimizer estimates while providing very good (ifthe table fits into cache) or at least decent (if the table is larger thancache) performance. We left the radix join implementation, whichis beneficial in some scenarios due to higher locality, for future en-hancement of our query engine.2We also reserve space for a next pointer within each tuple for han-dling hash collisions.3One example that occurs in TPC-H is positional skew, i.e., in a1:n join all join partners occur in close proximity which improvescache locality.

d00000100

e10000010

f

hashTable 16 bit tag for early filtering

48 bit pointer

1 insert(entry) {

2 // determine slot in hash table

3 slot = entry->hash >> hashTableShift

4 do {

5 old = hashTable[slot]

6 // set next to old entry without tag

7 entry->next = removeTag(old)

8 // add old and new tag

9 new = entry | (old&tagMask) | tag(entry->hash)

10 // try to set new value, repeat on failure

11 } while (!CAS(hashTable[slot], old, new))

12 }

Figure 7: Lock-free insertion into tagged hash table

4.2 Lock-Free Tagged Hash TableThe hash table that we use for the hash join operator has an

early-filtering optimization, which improves performance of selec-tive joins, which are quite common. The key idea is to tag a hashbucket list with a small filter into which all elements of that partic-ular list are “hashed” to set their 1-bit. For selective probes, i.e.,probes that would not find a match by traversing the list, the filterusually reduces the number of cache misses to 1 by skipping thelist traversal after checking the tag. As shown in Figure 7 (top), weencode a tag directly into 16 bits of each pointer in the hash table.This saves space and, more importantly, allows to update both thepointer and the tag using a single atomic compare-and-swap oper-ation.

For low-cost synchronization we exploit the fact that in a join thehash table is insert-only and lookups occur only after all inserts arecompleted. Figure 7 (bottom) shows the pseudo code for inserting anew entry into the hash table. In line 11, the pointer to the new ele-ment (e.g, “f” in the picture) is set using compare-and-swap (CAS).This pointer is augmented by the new tag, which is computed fromthe old and the new tag (line 9). If the CAS failed (because anotherinsert occurred simultaneously), the process is repeated.

Our tagging technique has a number of advantages in compari-son to Bloom filters, which can be used similarly and are, for ex-ample, used in Vectorwise [8], SQL Server [21], and BLU [31].First, a Bloom filter is an additional data structure that incurs mul-tiple reads. And for large tables, the Bloom filter may not fit intocache (or only relatively slow last-level cache), as the Bloom fil-ter size must be proportional to the hash table size to be effective.Therefore, the overhead can be quite high, although Bloom filterscan certainly be a very good optimization due to their small size.In our approach no unnecessary memory accesses are performed,only a small number of cheap bitwise operations. Therefore, hashtagging has very low overhead and can always be used, without re-lying on the query optimizer to estimate selectivities. Besides join,tagging is also very beneficial during aggregation when most keysare unique.

The hash table array only stores pointers, and not the tuplesthemselves, i.e., we do not use open addressing. There are a num-ber of reasons for this: Since the tuples are usually much larger thanpointers, the hash table can be sized quite generously to at leasttwice the size of the input. This reduces the number of collisions

• Tagging of hash bucket lists reduces -‐> list traversal skipped -‐> number of cash misses reduced to 1

Solution III – probe-‐phase

morsel

R

Storage area of

red core

HT(T) HT(S)

Storage area of

green core

Storage area of

blue core

next

mor

sel

...(R)v ...(R)v...(R)v

Figure 4: Morsel-wise processing of the probe phase

responding “red” NUMA socket. The result of the probe pipelineis again stored in NUMA local storage areas in order to preserveNUMA locality for further processing (not present in our samplequery plan).

In all, morsel-driven parallelism executes multiple pipelines inparallel, which is similar to typical implementations of the Vol-cano model. Different from Volcano, however, is the fact thatthe pipelines are not independent. That is, they share data struc-tures and the operators are aware of parallel execution and mustperform synchronization (through efficient lock-free mechanisms– see later). A further difference is that the number of threads exe-cuting the plan is fully elastic. That is, the number may differ notonly between different pipeline segments, as shown in Figure 2, butalso inside the same pipeline segment during query execution – asdescribed in the following.

3. DISPATCHER: SCHEDULING PARALLELPIPELINE TASKS

The dispatcher is controlling and assigning the compute re-sources to the parallel pipelines. This is done by assigning tasks toworker threads. We (pre-)create one worker thread for each hard-ware thread that the machine provides and permanently bind eachworker to it. Thus, the level of parallelism of a particular query isnot controlled by creating or terminating threads, but rather by as-signing them particular tasks of possibly different queries. A taskthat is assigned to such a worker thread consists of a pipeline joband a particular morsel on which the pipeline has to be executed.Preemption of a task occurs at morsel boundaries – thereby elimi-nating potentially costly interrupt mechanisms. We experimentallydetermined that a morsel size of about 100,000 tuples yields goodtradeoff between instant elasticity adjustment, load balancing andlow maintenance overhead.

There are three main goals for assigning tasks to threads that runon particular cores:

1. Preserving (NUMA-)locality by assigning data morsels tocores on which the morsels are allocated

2. Full elasticity concerning the level of parallelism of a partic-ular query

Figure 5: Dispatcher assigns pipeline-jobs on morsels tothreads depending on the core

3. Load balancing requires that all cores participating in a querypipeline finish their work at the same time in order to prevent(fast) cores from waiting for other (slow) cores1.

In Figure 5 the architecture of the dispatcher is sketched. Itmaintains a list of pending pipeline jobs. This list only containspipeline jobs whose prerequisites have already been processed. E.g.,for our running example query the build input pipelines are firstinserted into the list of pending jobs. The probe pipeline is onlyinserted after these two build pipelines have been finished. As de-scribed before, each of the active queries is controlled by a QEPob-ject which is responsible for transferring executable pipelines tothe dispatcher. Thus, the dispatcher maintains only lists of pipelinejobs for which all dependent pipelines were already processed. Ingeneral, the dispatcher queue will contain pending pipeline jobsof different queries that are executed in parallel to accommodateinter-query parallelism.

3.1 ElasticityThe fully elastic parallelism, which is achieved by dispatching

jobs “a morsel at a time”, allows for intelligent scheduling of theseinter-query parallel pipeline jobs depending on a quality of servicemodel. It enables to gracefully decrease the degree of parallelismof, say a long-running query Ql at any stage of processing in orderto prioritize a possibly more important interactive query Q+. Oncethe higher prioritized query Q+ is finished, the pendulum swingsback to the long running query by dispatching all or most cores totasks of the long running query Ql. In Section 5.4 we demonstratethis dynamic elasticity experimentally. In our current implemen-tation all queries have the same priority, so threads are distributed1This assumes that the goal is to minimize the response time of aparticular query. Of course, an idle thread could start working onanother query otherwise.

Morsel-‐wise probing

Solution III -‐ dispatcher

Figure 4: Morsel-wise processing of the probe phase

responding “red” NUMA socket. The result of the probe pipelineis again stored in NUMA local storage areas in order to preserveNUMA locality for further processing (not present in our samplequery plan).

In all, morsel-driven parallelism executes multiple pipelines inparallel, which is similar to typical implementations of the Vol-cano model. Different from Volcano, however, is the fact thatthe pipelines are not independent. That is, they share data struc-tures and the operators are aware of parallel execution and mustperform synchronization (through efficient lock-free mechanisms– see later). A further difference is that the number of threads exe-cuting the plan is fully elastic. That is, the number may differ notonly between different pipeline segments, as shown in Figure 2, butalso inside the same pipeline segment during query execution – asdescribed in the following.

3. DISPATCHER: SCHEDULING PARALLELPIPELINE TASKS

The dispatcher is controlling and assigning the compute re-sources to the parallel pipelines. This is done by assigning tasks toworker threads. We (pre-)create one worker thread for each hard-ware thread that the machine provides and permanently bind eachworker to it. Thus, the level of parallelism of a particular query isnot controlled by creating or terminating threads, but rather by as-signing them particular tasks of possibly different queries. A taskthat is assigned to such a worker thread consists of a pipeline joband a particular morsel on which the pipeline has to be executed.Preemption of a task occurs at morsel boundaries – thereby elimi-nating potentially costly interrupt mechanisms. We experimentallydetermined that a morsel size of about 100,000 tuples yields goodtradeoff between instant elasticity adjustment, load balancing andlow maintenance overhead.

There are three main goals for assigning tasks to threads that runon particular cores:

1. Preserving (NUMA-)locality by assigning data morsels tocores on which the morsels are allocated

2. Full elasticity concerning the level of parallelism of a partic-ular query

DispatcherCode

disp

atch

(0)

(J 1, M

r1)

Pipe

line-

Job

J 1 on

mor

sel M

r1

on

(red

) soc

ket o

f Cor

e0

Pipeline-JobJ1

Pipeline-JobJ2

Mr1

Mr2

Mr3

Mg1

Mg2

Mg3

Mb1

Mb2

Mb3

(virtual) lists of morsels to be processed(colors indicates on what socket/core

the morsel is located)

Lock-free Data Structures of DispatcherList of pending pipeline-jobs

(possibly belonging to different queries)

Core0 Core Core CoreCore Core Core CoreDR

AM

Core8 Core Core CoreCore Core Core CoreDR

AM

Core Core Core CoreCore Core Core Core DR

AM

Core Core Core CoreCore Core Core Core DR

AM

Socket Socket

inter connect

SocketSocket

Example NUMA Multi-Core Server with 4 Sockets and 32 Cores

Figure 5: Dispatcher assigns pipeline-jobs on morsels tothreads depending on the core

3. Load balancing requires that all cores participating in a querypipeline finish their work at the same time in order to prevent(fast) cores from waiting for other (slow) cores1.

In Figure 5 the architecture of the dispatcher is sketched. Itmaintains a list of pending pipeline jobs. This list only containspipeline jobs whose prerequisites have already been processed. E.g.,for our running example query the build input pipelines are firstinserted into the list of pending jobs. The probe pipeline is onlyinserted after these two build pipelines have been finished. As de-scribed before, each of the active queries is controlled by a QEPob-ject which is responsible for transferring executable pipelines tothe dispatcher. Thus, the dispatcher maintains only lists of pipelinejobs for which all dependent pipelines were already processed. Ingeneral, the dispatcher queue will contain pending pipeline jobsof different queries that are executed in parallel to accommodateinter-query parallelism.

3.1 ElasticityThe fully elastic parallelism, which is achieved by dispatching

jobs “a morsel at a time”, allows for intelligent scheduling of theseinter-query parallel pipeline jobs depending on a quality of servicemodel. It enables to gracefully decrease the degree of parallelismof, say a long-running query Ql at any stage of processing in orderto prioritize a possibly more important interactive query Q+. Oncethe higher prioritized query Q+ is finished, the pendulum swingsback to the long running query by dispatching all or most cores totasks of the long running query Ql. In Section 5.4 we demonstratethis dynamic elasticity experimentally. In our current implemen-tation all queries have the same priority, so threads are distributed1This assumes that the goal is to minimize the response time of aparticular query. Of course, an idle thread could start working onanother query otherwise.

• Dispatcher implemented as a lock free data structure and executed by work requesting threads

• Maintains a list of pending pipeline jobs whose prerequisites have already been processed

• Segmentation of queries upon request by processing thread

• NUMA-‐locality awareness

• ”Work stealing” if necessary

Solution IV – morsel size

equally over all active queries. A priority-based scheduling com-ponent is under development but beyond the scope of this paper.

For each pipeline job the dispatcher maintains lists of pendingmorsels on which the pipeline job has still to be executed. For eachcore a separate list exists to ensure that a work request of, say, Core0 returns a morsel that is allocated on the same socket as Core 0.This is indicated by different colors in our architectural sketch. Assoon as Core 0 finishes processing the assigned morsel, it requestsa new task, which may or may not stem from the same pipeline job.This depends on the prioritization of the different pipeline jobs thatoriginate from different queries being executed. If a high-priorityquery enters the system it may lead to a decreased parallelism de-gree for the current query. Morsel-wise processing allows to re-assign cores to different pipeline jobs without any drastic interruptmechanism.

3.2 Implementation OverviewFor illustration purposes we showed a (long) linked list of morsels

for each core in Figure 5. In reality (i.e., in our implementation) wemaintain storage area boundaries for each core/socket and segmentthese large storage areas into morsels on demand; that is, whena core requests a task from the dispatcher the next morsel of thepipeline argument’s storage area on the particular socket is “cutout”. Furthermore, in Figure 5 the Dispatcher appears like a sep-arate thread. This, however, would incur two disadvantages: (1)the dispatcher itself would need a core to run on or might pre-empt query evaluation threads and (2) it could become a sourceof contention, in particular if the morsel size was configured quitesmall. Therefore, the dispatcher is implemented as a lock-free datastructure only. The dispatcher’s code is then executed by the work-requesting query evaluation thread itself. Thus, the dispatcher is au-tomatically executed on the (otherwise unused) core of this workerthread. Relying on lock-free data structures (i.e., the pipeline jobqueue as well as the associated morsel queues) reduces contentioneven if multiple query evaluation threads request new tasks at thesame time. Analogously, the QEPobject that triggers the progressof a particular query by observing data dependencies (e.g., build-ing hash tables before executing the probe pipeline) is implementedas a passive state machine. The code is invoked by the dispatcherwhenever a pipeline job is fully executed as observed by not beingable to find a new morsel upon a work request. Again, this statemachine is executed on the otherwise unused core of the workerthread that originally requested a new task from the dispatcher.

Besides the ability to assign a core to a different query at anytime – called elasticity – the morsel-wise processing also guaran-tees load balancing and skew resistance. All threads working on thesame pipeline job run to completion in a “photo finish”: they areguaranteed to reach the finish line within the time period it takesto process a single morsel. If, for some reason, a core finishesprocessing all morsels on its particular socket, the dispatcher will“steal work” from another core, i.e., it will assign morsels on a dif-ferent socket. On some NUMA systems, not all sockets are directlyconnected with each other; here it pays off to steal from closer sock-ets first. Under normal circumstances, work-stealing from remotesockets happens very infrequently; nevertheless it is necessary toavoid idle threads. And the writing into temporary storage will bedone into NUMA local storage areas anyway (that is, a red morselturns blue if it was processed by a blue core in the process of steal-ing work from the core(s) on the red socket).

So far, we have discussed intra-pipeline parallelism. Our par-allelization scheme can also support bushy parallelism, e.g., thepipelines “filtering and building the hash table of T ” and “filteringand building the hash table of S” of our example are independent

0.0

0.2

0.4

0.6

0.8

100 1K 10K 100K 1M 10Mmorsel size

time

[s]

Figure 6: Effect of morsel size on query execution

and could therefore be executed in parallel. However, the useful-ness of this form of parallelism is limited. The number of indepen-dent pipelines is usually much smaller than the number of cores,and the amount of work in each pipeline generally differs. Fur-thermore, bushy parallelism can decrease performance by reducingcache locality. Therefore, we currently avoid to execute multiplepipelines from one query in parallel; in our example, we first exe-cute pipeline T , and only after T is finished, the job for pipeline Sis added to the list of pipeline jobs.

Besides elasticity, morsel-driven processing also allows for asimple and elegant implementation of query canceling. A user mayhave aborted her query request, an exception happened in a query(e.g., a numeric overflow), or the system is running out of RAM.If any of these events happen, the involved query is marked in thedispatcher. The marker is checked whenever a morsel of that queryis finished, therefore, very soon all worker threads will stop work-ing on this query. In contrast to forcing the operating system tokill threads, this approach allows each thread to clean up (e.g., freeallocated memory).

3.3 Morsel SizeIn contrast to systems like Vectorwise [9] and IBM’s BLU [31],

which use vectors/strides to pass data between operators, there is noperformance penalty if a morsel does not fit into cache. Morsels areused to break a large task into small, constant-sized work units tofacilitate work-stealing and preemption. Consequently, the morselsize is not very critical for performance, it only needs to be largeenough to amortize scheduling overhead while providing good re-sponse times. To show the effect of morsel size on query per-formance we measured the performance for the query select

min(a) from R using 64 threads on a Nehalem EX system,which is described in Section 5. This query is very simple, so itstresses the work-stealing data structure as much as possible. Fig-ure 6 shows that the morsel size should be set to the smallest pos-sible value where the overhead is negligible, in this case to a valueabove 10,000. The optimal setting depends on the hardware, butcan easily be determined experimentally.

On many-core systems, any shared data structure, even if lock-free, can eventually become a bottleneck. In the case of our work-stealing data structure, however, there are a number of aspects thatprevent it from becoming a scalability problem. First, in our imple-mentation the total work is initially split between all threads, suchthat each thread temporarily owns a local range. Because we cacheline align each range, conflicts at the cache line level are unlikely.Only when this local range is exhausted, a thread tries to steal workfrom another range. Second, if more than one query is executedconcurrently, the pressure on the data structure is further reduced.Finally, it is always possible to increase the morsel size. This re-sults in fewer accesses to the work-stealing data structure. In the

• Morsels should be large enough to amortize scheduling overhead while providing a good response time

Execution time dependency on morsel size

Experiment results I -‐ speedupProcessing of 22 TPC-‐H queries

Experiment results II -‐ elasticityIntra-‐ and inter-‐query parallelism

Morsel-‐wise processing

Does the paper prove its claims?

Yes.

Are there any gaps in the logic?

• Can the compilation at runtime cause a significant overhead?

• What is the overhead caused by the dispatcher in morsel-‐driven parallelism?

• Can all types of queries be easily broken into morsels?

Possible next steps?

• Priority based scheduling.

• Hardware specific optimization.

• Real world testing.

morsel’drive+parallelism:+a+ numa’aware+query+evaluation+...

Documents