just-in-time compilation in vectorized query...

University of Warsaw

Faculty of Mathematics, Computer Science and Mechanics

VU University Amsterdam

Faculty of Sciences

Juliusz Sompolski

Student no. 248396(UW), 2128039(VU)

Just-in-time Compilation inVectorized Query Execution

Master thesis

in COMPUTER SCIENCE

Supervisor:

Marcin Zukowski

VectorWise B.V.

First reader:Peter Boncz

Dept. of Computer Science,VU University Amsterdam

Centrum Wiskunde & Informatica

Second reader:Henri Bal

Dept. of Computer Science,VU University Amsterdam

August 2011

Supervisor’s statement

Hereby I confirm that the present thesis was prepared under my supervision andthat it fulfils the requirements for the degree of Master of Computer Science.

Date Supervisor’s signature

Author’s statement

Hereby I declare that the present thesis was prepared by me and none of its contentswas obtained by means that are against the law.

The thesis has never before been a subject of any procedure of obtaining an academicdegree.

Moreover, I declare that the present version of the thesis is identical to the attachedelectronic version.

Date Author’s signature

Abstract

Database management systems face high interpretation overhead when supporting flexibledeclarative query languages such as SQL. One approach to solving this problem is to on-the-fly generate and compile specialized programs for evaluating each query. Another, used bythe VectorWise DBMS, is to process data in bigger blocks at once, which allows to amortizeinterpretation overhead over many database records and enables modern CPUs to e.g. useSIMD instructions. Both of these approaches provide an order of magnitude performanceimprovement over traditional DBMSes. In this master thesis we ask the question whethercombining these two techniques can yield yet significantly better results than using each ofthem alone. Using a proof of concept system we perform micro benchmarks and identify caseswhere compilation should be combined with block-wise query execution. We implement aJIT query execution infrastructure in VectorWise DBMS. Finally, using our implementation,we confirm our findings in benchmarks inside the VectorWise system.

Keywords

databases, query execution, JIT compilation, code generation

Thesis domain (Socrates-Erasmus subject area codes)

11.3 Informatics, Computer Science

Subject classification

H. SoftwareH.2 DATABASE MANAGEMENTH.2.4 Systems: Query processing

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.1. Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1. Relational database systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2. VectorWise vectorized execution . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1. Architecture of Ingres VectorWise system . . . . . . . . . . . . . . . . 14

2.2.2. Expressions and primitives . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.3. VectorWise performance . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3. Modern hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1. CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.2. Pipelined execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.3. Hazards and how to deal with them . . . . . . . . . . . . . . . . . . . 21

2.3.4. CPU caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.5. SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.6. Virtual memory and TLB cache . . . . . . . . . . . . . . . . . . . . . 22

2.4. Related work on JIT query compilation . . . . . . . . . . . . . . . . . . . . . 23

2.4.1. Origins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.2. HIQUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.3. Hyper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.4. JAMDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.5. Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3. Vectorization vs. Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1. Experiments overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2. Case study: Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1. Vectorized Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2. Loop-compiled Execution . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.3. SIMDization using SSE . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.4. Tuple-at-a-time compilation . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.5. Effect of vector length . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.6. Project under Select and compute-all optimization . . . . . . . . . . . 34

3.3. Case study: Conjunctive select . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3

3.3.1. Vectorized execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.2. Loop-compiled execution . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4. Case study: Disjunctive select . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4.1. Vectorized execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4.2. Loop-compiled execution . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.5. Synthesis: Selection and Projection . . . . . . . . . . . . . . . . . . . . . . . . 41

3.6. Case Study: Hash Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.6.1. Vectorized implementation . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6.2. Loop-compiled implementation . . . . . . . . . . . . . . . . . . . . . . 48

3.6.3. Build phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6.4. Probe lookup: bucket chain traversal . . . . . . . . . . . . . . . . . . . 49

3.6.5. Probe fetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.6.6. Reduced vector length . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6.7. Partial Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.7. Software engineering aspects of Vectorization and Compilation . . . . . . . . 56

3.7.1. System complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.7.2. Compilation overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.7.3. Profiling and runtime performance optimization . . . . . . . . . . . . . 58

3.8. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4. JIT Compilation infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1. Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1.1. GNU Compiler Collection – gcc . . . . . . . . . . . . . . . . . . . . . . 61

4.1.2. Intel Compiler – icc . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1.3. LLVM and clang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2. VectorWise compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.1. Build time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.2. TPC-H performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.1. External compiler and dynamic linking . . . . . . . . . . . . . . . . . . 65

4.3.2. Using LLVM and clang internally . . . . . . . . . . . . . . . . . . . . . 66

4.3.3. JIT compilation infrastructure API . . . . . . . . . . . . . . . . . . . . 66

4.3.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5. JIT primitive generation and management . . . . . . . . . . . . . . . . . . . 71

5.1. Source templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.1. Primitive templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.2. Source preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2. JIT opportunities detection and injection . . . . . . . . . . . . . . . . . . . . 74

5.2.1. Fragments tagging pass . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.2. Fragments generation and collapsing pass . . . . . . . . . . . . . . . . 74

5.2.3. Compilation pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.4. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3. Source code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.1. snode trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.2. JitContext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.3. Code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4. Primitives storage and maintenance . . . . . . . . . . . . . . . . . . . . . . . 79

4

5.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6. JIT primitives evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.1. TPC-H Query 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.2. TPC-H Query 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.3. TPC-H Query 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.4. TPC-H Query 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7. Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877.1. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877.2. Answers to research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 877.3. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5

Chapter 1

Introduction

This thesis was written at the VectorWise company, part of the Ingres Corporation. Theresearch concentrated around the VectorWise database engine, originating from MonetD-B/X100 research project at the CWI in Amsterdam [Zuk09]. The aim was to investigatethe possibilities of introducing Just-In-Time (JIT) source code generation and compilationof functions performing query execution, and evaluate and integrate these techniques in theframework of the existing VectorWise query execution engine. On the way, other questionsraised and have been answered, concerning various aspects of combining vectorization andcompilation in a database engine. This research should be valuable for database systemsarchitects, both from the perspective of extending a vectorized database with query com-pilation and from the perspective of introducing vectorized processing into a JIT compiledsystem.

1.1. Motivation

Database systems provide many useful abstractions such as data independence, ACID prop-erties, and the possibility to pose declarative complex ad-hoc queries over large amountsof data. This flexibility implies that a database server has no advance knowledge of thequeries until run-time, which leads most systems to implement their query evaluators usingan interpretation engine. Such an engine evaluates plans consisting of algebraic operators,such as Scan, Join, Project, Aggregation and Select. The operators internally include expres-sions such as boolean conditions used in Joins and Select, calculations used to introduce newcolumns in Project, and functions like MIN, MAX and SUM used in Aggregation. Most queryinterpreters follow the so-called iterator-model (as described in Volcano [Gra94]), in whicheach operator implements an API that consists of open(), next() and close() methods. Eachnext() call produces one new tuple, and query evaluation follows a “pull” model in whichnext() is called recursively to traverse the operator tree from the root downwards, with theresult tuples being pulled upwards.

It has been observed that the tuple-at-a-time model leads to interpretation overhead:the situation that much more time is spent in evaluating the query plan than in actuallycalculating the query result. Additionally, this tuple-at-a-time interpretation particularlyaffects high performance features introduced in modern CPUs [Zuk09]. For instance, thefact that units of actual work are hidden in the stream of interpreting code and functioncalls, prevents compilers and modern CPUs from getting the benefits of deep CPU pipeliningand SIMD instructions, because for these the “work” instructions should be adjacent in theinstruction stream and independent of each other [ZNB08, Zuk09].

7

MonetDB [Bon02] reduced interpretation overhead by using bulk processing, where eachoperator would fully process its input, and only then invoke the next execution stage. Thisidea has been further improved in the X100 project [BZN05], later evolving into VectorWise,with vectorized execution. It is a form of block-oriented query processing [PMAJ01], where thenext() method rather than a single tuple produces a block (typically 100-10000) of tuples. Inthe vectorized model, data is represented as small single-dimensional arrays (vectors), easilyaccessible for CPUs. The effect is that the percentage of instructions spent in interpretationlogic is reduced by a factor equal to the vector-size. It was shown that vectorized executioncan improve data-intensive (OLAP) queries performance by a factor 50 over interpretedtuple-at-a-time systems [ZBNH05].

Just-In-Time (JIT) query compilation has been considered as an alternative method foreliminating the ill effects of interpretation overhead. On receiving a query for the first time,the query processor compiles (a part of) the query into a routine that gets subsequently exe-cuted. Query compilation removes the interpretation overhead altogether. A compiled codealways can be made faster than interpreted code. In analytic workloads with long runningqueries the compilation time is small compared to query execution time, and the compiledcode can be reused, because many queries are repeated with different parameters. Therefore,query compilation has often been considered the ultimate solution to query execution.

In state-of-the-art database engines applying JIT compilation to query execution still fitsthe tuple-at-a-time model. In place of a tree of interpreted operators, with one tuple beingpassed using the next() interface, operations are compiled together, but still process one tupleat a time. Such systems observe an order of magnitude improvement in query performanceover interpreted tuple-at-a-time systems, just as vectorized systems did.

1.1.1. Research questions

In both cases much of the gains came from removing the interpretation overhead. Theresearch question arising from the perspective of VectorWise is:

• Since a vectorized systems already removed most of this overhead, would it still bebeneficial to introduce JIT query compilation to it?

A symmetrical research question can be asked from a perspective of an architect of a databaseengine applying JIT query compilation:

• Can benefits of vectorization still make an improvement to a tuple-at-a-time compiledsystem?

When combining vectorization with JIT compilation, the focus has to turn away fromthe aspect of reducing interpretation overhead, since it is already done with each of thesetechniques alone. Other potential benefits have to be found and explored.

In vectorized execution, the functions that perform work typically process an array ofvalues in a tight loop. Such tight loops can be optimized well by compilers, e.g. unrolledwhen beneficial, and, depending on the storage format, enable compilers to generate SIMDinstructions automatically. Modern CPUs also do well on such loops because function callsare eliminated, branches get more predictable, and out-of-order execution in CPUs often takesmultiple loop iterations into execution concurrently, exploiting the deeply pipelined resourcesof modern CPUs. A tuple-at-a-time compiled system has just one tuple available at a giventime, therefore it does not have the possibility to exploit any inter-tuple micro parallelismusing SIMD. It performs many different subsequent operations on this one tuple, which areoften dependent of each other. Control and data dependencies hinder the possibilities of deep

8

CPU pipelining. Therefore, these are the potential reasons why vectorized execution couldimprove a compiled system.

There are however many operations which are hard to express in a vectorized way. Vec-torized execution needs additional materialization of intermediate results – vectors of datahave to be written in each basic step. Even though systems like VectorWise go throughlengths to ensure that these arrays are CPU cache-resident, this materialization constitutesextra memory load and store work. Compilation can avoid this work by keeping results inCPU registers all the time as they flow from one expression to the other. The memory accesspattern of vectorized execution is sometimes sub-optimal, for example when an operation in-volving multiple columns with values accessed from random positions (e.g. a hash table) hasto be executed as many basic vectorized operations, each accessing a vector of these randompositions anew. Vectorized execution is anyway better than interpreted tuple-at-a-time exe-cution because of the latter’s overhead, but switching to a compiled tuple-at-a-time executioncould be beneficial. Just-in-time compilation can be used in a vectorized system to changean operation consisting of multiple vectorized steps into one compiled function, executing amore complex operation in a single loop over tuples.

Introducing both vectorization and compilation into a DBMS also has a software engi-neering aspect. Implementing additional subsystems makes the system more complex andrequires additional maintenance. Vectorization has the advantage of the possibility of veryfine grained profiling and error detection. Each basic vectorized function is a big enoughblock that it can be measured individually during execution – the measurement overhead issmall enough compared to an operation working on a whole vector of tuples. This cannotbe said about tuple-at-a-time processing – starting and stopping counters around individualoperations would be more expensive than the operations themselves. JIT compiling the wholequery together makes it hard to later trace which parts of it take how much processing time.Combining vectorization with JIT compilation is a middle ground, because we only generatecode for small fragments of query execution, and we can directly compare their performanceto that of the vectorized functions they replaced.

1.1.2. Contributions

Combining vectorized processing with JIT compilation is therefore a matter of switchingbetween vector-at-a-time and tuple-at-a-time processing of combined instructions for thesake of optimally utilizing the characteristics of modern CPUs. We identify the situationswhere it is the case.

We study which operations are better performed in basic vectorized loops, and whichwould benefit from being compiled into a single loop. Our findings are that in many casesneither pure vectorization nor blind single-loop compilation is the best solution, and the twohave to be carefully combined. The results of our work were published as a paper for theDaMoN workshop in June 2011 in Athens [SZB11].

The experiments were made from the perspective of VectorWise, a vectorized systemin which we introduce JIT compilation, but the findings can also be applied to introducingvector-at-a-time processing into compiled systems. Until now, in such systems processing wasdone tuple-at-a-time and intermediate materialization was always treated as an overhead. Weshow that it is not always true, and switching to vector-at-a-time processing can also improveparts of compiled query execution.

9

1.2. Outline

The outline of this thesis is as follows:

• Chapter 2 provides an introduction to database systems, focusing on VectorWise DBMSand its aspects relevant to this thesis.

• Chapter 3 evaluates the benefits of introducing Just-in-time query compilation into avectorized database system and vice-versa. The material from this chapter is the mainscientific contribution of this thesis.

• Chapter 4 discusses an infrastructure for Just-in-time compilation, implemented insidethe VectorWise database system.

• Chapter 5 describes the implementation of source generation for fragments of queryexecution in VectorWise.

• Chapter 6 shows an evaluation of the actual implementation that was made in theVectorWise system on TPC-H industrial benchmark. The results are compared to theconclusions of experiments from Chapter 3.

• We conclude and sketch questions and directions for future work in Chapter 7.

10

Chapter 2

Preliminaries

This chapter introduces concepts from the field of database systems that will be used through-out this thesis. It starts with general databases introduction in Section 2.1, followed by anintroduction to the VectorWise DBMS architecture and vectorized query execution model inSection 2.2. Implementation details that will be relevant to the task of integrating this DBMSwith JIT query compilation infrastructure are highlighted. The reasoning behind many of themethods used in this thesis lies in the hardware architecture, so an introduction to modernhardware is provided in Section 2.3. An overview of what has to be done to implement JITquery compilation and previous research and existing systems that already apply JIT aredescribed in Section 2.4.

2.1. Relational database systems

A database management system (DBMS) is a software package of programs that provideservices for users and applications to store and access data. It allows multiple concurrentusers to query and modify data using a high level query language such as SQL.

The relational database management system (RDBMS, we will often use just DBMS) isthe most common type of a DBMS. It uses relations as data model. A relation is a setof tuples, best viewed as a table, with rows being tuples and columns being attributes. Incontrary to mathematical relations, the attributes are named and typed.

Database

A database is a collection of relations/tables stored and managed by the DBMS. The databaseis described by a schema. The schema specifies what tables are in the database, what at-tributes are in each table, and how the tables reference one another. An example of databasetables is shown in Figure 2.1.

Structured Query Language and relational algebra

Structured Query Language (SQL) can be used to express querying operations on a database.It enables selecting tuples fulfilling given conditions, joining data from multiple relationstogether, calculating derived values and aggregates grouped by certain attributes. SQL hasunderstandable syntax, resembling a querying statement sentence in English language.

SQL queries correspond to relational algebra. The algebra consists of operators, such as:

11

Employees

id name age dept id

1 Adam 39 12 Eva 25 13 Tom 31 14 Mark 60 25 Chris 42 2

Departments

id name mngr id

1 Production 12 Sales 4

SELECT Emp.name AS employee, bonus = (Emp.age - 30) * 50

Departments.name AS dep_name,Mng.name AS dep_manager,

FROM Employees AS Emp

JOIN Departments ON Emp.dept_id = Departments.id

JOIN Employees AS Mng ON Mng.id = Departments.mngr_id

WHERE Emp.age > 30

Π employee=Emp.name,bonus=(Emp.age−30)∗50,

dep name=Departments.name,dep manager=Mng.name

(σEmp.age>30(ρEmp/Employees(Employees)) ./ Emp.dept id=Departments.id

Departments) ./ Mng.id=Departments.mngr id

ρMng/Employees(Employees)

Select

Scan

next() data

Emp.age�>�30

Employees�AS�Emp

MergeJoin

next() data next() data

ScanDepartments

Emp.dept_id�=�Departments.idScan

Employees�AS�Mng

HashJoin

next() data next() data

Departments.mngr_id�=�Mng.id

Project

next() data

employee=Emp.name,�bonus=(Emp.age-30)*50,dep_name=Departments.name,�dep_manager=Mng.name

Query resultemployee bonus dep name dep manager

Adam 450 Production AdamTom 50 Production AdamMark 1500 Sales MarkChris 600 Sales Mark

Figure 2.1: Example of database table, SQL query, the corresponding operations in relationalalgebra, and a Volcano model operator tree.

• Projection (Π) and Rename (ρ), which are used to choose and maintain the subsetof attributes to be passed on, as well as to introduce new attributes derived from theexisting ones by e.g. arithmetic calculations.

• Selection (σ), which is used to select tuples that match a certain boolean condition.

• Join (./), used to combine tuples from two tables (left and right) using a given condition.The condition is most commonly an equality of an attribute from the left table toan attribute from the right table (equi-join). There are many different types of join,depending on whether the result should contain only tuples matched from both operands(inner join), or should unmatched tuples from left, right or both tables be also produced,

12

extended with empty (NULL) values on the unmatched side (left, right or full outerjoin).

• Aggregation (G), used to aggregate a function such as sum, average, minimum or max-imum of an attribute or count of tuples, grouped by the value of another attribute(s).

An example of a SQL query and corresponding relational algebra expression is shown inFigure 2.1.

Query optimization

A parsed SQL query in a database system resembles a tree of relational algebra operators.It is the work of query optimizer to match these operators with their implementations in thesystem. There may be multiple concrete operators for a given abstract algebraic operator,e.g. a join can be implemented either as a HashJoin, with one relation put in a hash table,or a MergeJoin, when both relations were stored sorted on the join attributes. Differentjoin algorithms can also be used if we know whether to except at most one match (Join01),exactly one match (Join1) or more matches (JoinN) per tuple from one of the relations.The optimizer needs to decide which of the implementations to use based on the metadata,statistics and value distribution it collects about the data.

For example, in Figure 2.1 MergeJoin was used for one of the joins, because data wassorted on the department id, and HashJoin was used for the other join.

Volcano operator model

Concrete operators in the DBMS are usually implemented following the model first introducedin Volcano [Gra94]. The operators form a tree and implement an iterator API, with open(),close() and next() methods. The open() and close() methods are used for initialization andfinalization. The next() method makes the operator produce output. In most databasesystems this is done tuple-at-a-time: exactly one tuple of output will be returned on one callto the next() method.

Data is pulled by the operators upwards in the tree. The root operator produces queryoutput, calling next() on its children, which in turn call their children etc., down until theScan operators at the leaves of query execution tree, which interact with the storage layerand read the data from the database.

For example, when the next() method would be called on a Project operator, it wouldcall the next() method on its child once, compute the projections on the delivered tuple, andreturn the result. A Select operator would call next() on its child possibly multiple times,evaluating the filtering condition until a tuple from the child satisfies it and can be returned.More complex operators follow even more complicated logic. When the next() method of aHashJoin with a memory fitting hash table is called for the first time, it would iterate throughall output of the child operator of the smaller of joined relations and put all resulting tuplesinto the hash table. Then upon subsequent next() calls it would call the next() method ofthe bigger of joined relations and try to match it against tuples in the hash table. There aremany variants of when and what actually will be returned, depending on whether it is aninner or outer join and “01”, “1”, or “N” join.

An example Volcano-style operator tree is presented in Figure 2.1.

13

Client applicationconnection (SQL) query result

VectorWise

SQL Parser

Buffer manager / Storage

SQL OptimizerIN

GR

ES

data request

parsed tree

query plan (VW algebra)

data

client query (SQL)

INGRES Server

Vectorized query execution

Builder

Rewriterxnode tree

operator and expression tree

query

result

Figure 2.2: Architecture of the Ingres VectorWise system.

2.2. VectorWise vectorized execution

VectorWise [Vec] is a database management system mostly intended for analytical workloadsin applications such as online analytical processing, business intelligence, decision supportsystems and data mining. Such a workload is characterized by being read-intensive, havingcomplex, compute intensive queries and database tables with a lot of attributes. Queries mayoften only access a few attributes, but touch a large fraction of all rows. VectorWise rootsfrom the MonetDB/X100 project, developed at CWI in The Netherlands [BZN05, Zuk09].

2.2.1. Architecture of Ingres VectorWise system

VectorWise is integrated with Ingres DBMS. Ingres is responsible for the top layers of thedatabase management system stack, while VectorWise implements the bottom layers – queryexecution, buffer manager and storage. The architecture is shown in Figure 2.2.

The Ingres server maintains connections with users and parses their SQL queries. Ingreskeeps track of all schemas, metadata and statistics (e.g. cardinality, distribution) and basedon that decides on an optimal query execution plan.

VectorWise algebra

The plan generated by the Ingres optimizer is a query execution tree of algebraic operatorsimplemented in VectorWise, expressed in VectorWise algebra language. Its syntax is simpleand can be seen in Figures 2.3 and 2.4. We describe the operators that are used in this thesis:

14

Aggr (Project (

MScan (’ l i n e i t e m ’ ,

[ ’ l e x t e n d e d p r i c e ’ ,’ l t a x ’ ,’ l r e t u r n f l a g ’

]) , [ r e s p r i c e =

∗( l i n e i t e m . l e x t e n d e d p r i c e ,+(sint ( ’ 1 ’ ) , l i n e i t e m . l t a x ) ) ,

f l a g = l i n e i t e m . l r e t u r n f l a g]

) , [ f l a g ] ,[ sumprice = sum( r e s p r i c e ) ]

) ;

<AGGR>:

|_<PROJECT>:

| |_<MSCAN>:

| | |_<STRING>: str(’_lineitem’)

| | |_<LIST>:

| | | |_<STRING>: str(’_l_extendedprice’)

| | | |_<STRING>: str(’_l_tax’)

| | | \_<STRING>: str(’_l_returnflag’)

| | \_<LIST>:

| \_<LIST>:

| |_<ASSIGNMENT>: resprice

| | \_<OPERATOR>: str(’*’)

| | |_<IDENT>: _lineitem._l_extendedprice

| | \_<OPERATOR>: str(’+’)

| | |_<CALL>: str(’sint’)

| | | \_<STRING>: str(’1’)

| | \_<IDENT>: _lineitem._l_tax

| \_<ASSIGNMENT>: flag

| \_<IDENT>: _lineitem._l_returnflag

|_<LIST>:

| \_<IDENT>: flag

|_<LIST>:

\_<ASSIGNMENT>: sumprice

\_<AGGROPERATOR>: str(’sum’)

\_<IDENT>: resprice

Figure 2.3: An expression in VectorWise algebra and the corresponding xnode tree, as pretty-printed from the system log. The xnode tree corresponds to the parsed algebra.

• MScan. The MScan operator is the bridge between the storage layer and query exe-cution. It is situated at the leaves of the query tree and it is responsible for providingdata read from the database. Its parameters are a table name and a list of columns tobe read. As a result of a call to its next() method it produces a vector of values fromeach of the requested attributes in the given table.

• Select. The Select operator corresponds to a relational algebra Selection. It takes theinput operator and a boolean filtering condition as parameters. A call to its next()

method produces a vector of indexes in the input vectors for which the condition eval-uated to true (selection vector, see Section 2.2.2 for a description of how execution isperformed with selection vectors).

• Project. The Project operator corresponds to a relational algebra Projection andRename. Its parameters are the input operator and a list of projection expressions.The next() method produces vectors with the results of these expressions.

• Aggr. The Aggr operator corresponds to a relational algebra Aggregation. It takes aninput operator, list of key attributes and a list of aggregate functions. If the list of keys isempty, it performs aggregation into one global group. If the list of aggregate functions isempty, Aggr does duplicate elimination. A call to the next() method produces vectorsfor key attributes, and vectors with the results of aggregation for the correspondinggroup. Internally the Aggr operator can choose to work either directly using the valuesof keys as group identifiers when their domain is small enough (DirectAggr), or a hashtable otherwise (HashAggr). If the input is sorted on the key attributes, we can alsoperform ordered aggregation, where the aggregates are computed one group at a time,because all values from one group will come sequentially (OrdAggr). The operator can

15

decide which method to use based on the metadata.

• HashJoin family. HashJoin is a Join relational operator implementation using a hashtable to store the smaller of the two joined relations (build relation). There is a wholefamily of HashJoin operators, such as HashJoinN, HashJoin1, HashJoin01, with usagedepending on the expected cardinality of matches on both sides. It takes two inputoperators for the two joined relations, from each of them a list of keys to be matchedand a list of other attributes to project, and a flag specifying the matching behaviour(inner or outer, left, right or full). The next() method produces vectors of values fromthe bigger relation (probe relation) and the corresponding vectors of matching valuesfrom the build relation.

• MergeJoin family. MergeJoin family of operators is an alternative family of Joinrelational operators used when both input relations are sorted on the key attributes, andthe join can be performed by simultaneously progressing in both inputs and producingmatches in a merging process.

Rewriter

The Rewriter performs VectorWise specific semantic analysis on the query plan. It assignsdatatypes and other helpful metadata. It also does some optimizations that were not per-formed by Ingres – e.g. introduces parallelism, rewrites conditions to use lazy evaluationand eliminates dead code. The Rewriter is implemented as a set of rules that are appliedto a tree of xnode structures, which represent the plan. Figure 2.3 shows an example treeof xnodes created from a VectorWise algebra expression. Each Rewriter rule traverses thexnode tree in a top-down or bottom-up walk and either transforms it or annotates the nodeswith additional meta information (called properties). The Rewriter is easily extensible withnew rules. More information on Rewriter can be found in Section 5.2 where we describe thenew Rewriter rules introduced by JIT compilation and their limitations.

Builder

The Builder constructs the actual query execution structures out of the annotated xnode

tree. These structures are Operators and Expressions. Operators in VectorWise correspondto the algebraic operators of VectorWise algebra, like Select, Project, Aggr etc. They form atree implementing the Volcano model iterator API.

Expressions perform the actual basic operations. For example, they do the arithmeticoperations in Project operator, evaluate the conditions in Select operator, compute hashvalues etc. Expressions often do not have corresponding nodes in the xnode tree. Theyform DAGs rather than trees and the logic of what expressions are needed and how they arecreated and connected is operator specific, implemented in the operator builder. Therefore,the annotated xnode tree after the Rewriter passes does not contain all the information anddoes not correspond exactly to the final query execution tree.

Since expressions are the structures that perform the actual work, they will be the object ofJIT code generation in this thesis. Therefore, we will provide some more detailed introductionto expressions in in Section 2.2.2.

Vectorized query execution

When data is processed and transferred one tuple at a time, instructions related to query inter-pretation and tuple manipulation outnumber the instructions for actual operation. Moreover,

16

User

Project

Selectname

adameva

marctom

3925

6031

12

43

id age

Scan

name

adameva

marctom

3925

6031

12

43

id age

name

adameva

marctom

3925

6031

12

43

id age450

150050

bonus

...chris 425

next()

next()

next() vector

vector

vector

age�>�30

Employees

id,name,age,bonus=(age-30)*50�

023

sel

023

sel

Project(

Select(

MScan('people',

['id', 'name', 'age']),

People.age > sint('30')

),

[people.id,

people.name,

people.age,

bonus =

*(-(age,sint('30')),50)]

)

SELECT id, name, age, (age-30)*50 AS bonusFROM peopleWHERE age > 30

Project( Select( MScan('Employees', ['id','name','age']), People.age > sint('30') ), [people.id, people.name, people.age, bonus = *(-(age,sint('30')),50)])

VectorWise algebraSQL

Figure 2.4: Example of translation from SQL query into VectorWise algebra and then a treeof vectorized operators.

generic operator interfaces require virtual function calls, which further reduce performance.The difference between VectorWise execution and most other databases implementing

the Volcano model is that the next() method of the iterator API returns a vector of resultsinstead of a single tuple. Each call to next() method produces a block of 100-10000 tuples(default: 1024). This reduces the interpretation overhead of the Volcano model, because theinterpretation instructions are amortized across multiple tuples, and the actual computationsare performed in tight loops.

Vectorized processing will be described in more details in Section 2.2.2.

Column storage and vectors

Most database systems use row-wise storage, with all attributes from one tuple stored to-gether, called N-ary Storage Model (NSM). Data in the VectorWise database is stored column-wise. One model is called Decomposition Storage Model (DSM) [CK85], in which a disk pagecontains consecutive values belonging to just one column of a certain table. A middle groundbetween DSM and NSM that can also be used in VectorWise is a model called PartitionAttributes Across (PAX). In this model a disk page stores data from all columns of a table(i.e. complete tuples), but within the page data is stored column-wise.

DSM has several advantages for analytical workloads. As database tables have manyattributes, often only few of which are used in a given query, reading only the needed columnsinstead of whole table rows greatly reduces the amount of transferred data. Data from a singlecolumn can also be compressed efficiently and easily using simple compression schemes suchas dictionary encoding or run-length encoding, as the data often comes from a limited orclustered domains. This thesis focuses on query execution and is mostly indifferent to thestorage layer. More details about VectorWise storage can be found in [Zuk09].

During execution data is also processed in a column-wise way. Expressions operate onvectors, which are arrays that contain data from a single column. Vectors are allocated bythe Builder and data is transferred between operations in them.

2.2.2. Expressions and primitives

In the query execution layer of VectorWise, the actual worker functions are called primitives.They are short functions processing arrays of values in a tight loop. They usually perform justone operation, such as an arithmetic calculation, evaluation of a boolean condition, fetchingfrom memory locations pointed to by a vector of indexes etc. They take a number of vectors

17

Algorithm 1 Examples of templates of primitive functions. The map primitives performcalculations on input vectors. The selection primitives evaluate a boolean condition on theinput vectors and write the positions of tuples for which the condition was true to a resultingselection vector. Any primitive can work with or without an input selection vector. If aselection vector is present, the calculation is performed only on tuples from selected positions.A NULL selection vector means that the operation shall be performed on all elements of thevectors. They return the number of elements they written to the resulting vector.

// Template o f a b inary map p r im i t i v e .uidx map OP T col T col (uidx n , T∗ res , T∗ co l1 , T∗ co l2 , uidx ∗ s e l ) {

i f ( s e l ) {for (uidx i = 0 ; i < n ; i++) r e s [ s e l [ i ] ] = OP( co l 1 [ s e l [ i ] ] , c o l 2 [ s e l [ i ] ] ) ;

} else {for (uidx i = 0 ; i < n ; i++) r e s [ i ] = OP( co l 1 [ j ] , c o l 2 [ j ] ) ;

}return n ;

}

// Template o f a b inary map pr im i t i v e with f i r s t parameter be ing a constant .uidx map OP T val T col (uidx n , T∗ res , T∗ val1 , T∗ co l2 , uidx ∗ s e l ) {

i f ( s e l ) {for (uidx i = 0 ; i < n ; i++) r e s [ s e l [ i ] ] = OP(∗ val1 , co l 2 [ s e l [ i ] ] ) ;

} else {for (uidx i = 0 ; i < n ; i++) r e s [ i ] = OP(∗ val1 , co l 2 [ j ] ) ;

}return n ;

}

// Template o f an unary map p r im i t i v e .uidx map OP T col (uidx n , T∗ res , T∗ co l1 , uidx ∗ s e l ) {

i f ( s e l ) {for (uidx i = 0 ; i < n ; i++) r e s [ s e l [ i ] ] = OP( co l 1 [ s e l [ i ] ] ) ;

} else {for (uidx i = 0 ; i < n ; i++) r e s [ i ] = OP( co l 1 [ j ] ) ;

}return n ;

}

// Template o f a b inary s e l e c t i o n p r im i t i v e .uidx se l COND T col T col branching (uidx n , uidx ∗ r e s s e l , T∗ co l1 , T∗ co l2 , uidx ∗ s e l ) {

uidx r e t ;i f ( s e l ) {

for (uidx i = 0 ; i < n ; i++)i f (COND( co l 1 [ s e l [ i ] ] , c o l 2 [ s e l [ i ] ] ) ) r e s s e l [ r e t++] = s e l [ i ] ;

} else {for (uidx i = 0 ; i < n ; i++)

i f (COND( co l 1 [ i ] , c o l 2 [ i ] ) ) r e s s e l [ r e t++] = i ;}return r e t ;

}

// A l t e rna t i v e template o f a b inary s e l e c t i o n p r im i t i v e .uidx se l COND T col T col datadep (uidx n , uidx ∗ r e s s e l , T∗ co l1 , T∗ co l2 , uidx ∗ s e l ) {

i f ( s e l ) {for (uidx i = 0 ; i < n ; i++) {

r e s s e l [ r e t ] = s e l [ i ] ; r e t += COND( co l 1 [ s e l [ i ] ] , c o l 2 [ s e l [ i ] ] ) ;}

} else {for (uidx i = 0 ; i < n ; i++) {

r e s s e l [ r e t ] = i ; r e t += COND( co l 1 [ i ] , c o l 2 [ i ] ) ;}

}return r e t ;

}

18

(or scalars) as parameters and output a vector of results, which in turn can be the input ofanother operation. They are arranged into a tree (or DAG) of structures called expressionsand evaluated in left-deep recursive order, with output from the child expressions fed as inputto the parents. The way how expressions are linked together and evaluated is controlled byoperators, which use them according to their more complex logic. The DAG of expressionsmakes composite operations out of the basic primitives. The vectors that are being passedbetween the expressions introduce some materialization overhead, but it is minor because thesize of the vector is such that the currently processed vectors fit in the CPU cache.

Algorithm 1 shows examples of code templates of different types of primitives. A col

suffix in the primitive name indicates a vector (columnar) parameter, while a val suffix isused for a constant parameter (scalar). Type T can be substituted with different types, withuchr, usht, uint, ulng representing internal types for 1-, 2-, 4- and 8-byte unsigned integers,and schr, ssht, sint, slng being their signed counterparts. uidx is an unsigned integertype used for indexes and positions. Each primitive can either work on all elements of theinput vectors, or only on selected positions, provided in an input selection vector. This wayVectorWise handles selections: a Select operator only has to output a vector of selectedpositions in a flow-through stream of tuples, instead of copying the selected values from allcolumns of a table into new densely-filled vectors. Of course, if the selection vectors becomeshort, so the operations are performed on only few tuples in a block, a special operatorcalled Chunk can be used, which packs the sparse values into new vectors, dropping the needfor a selection vector and allowing better amortization of interpretation overhead up in thequery plan. There may be more than one primitive implementation for a single operation,for example the two different templates for selection primitives in Algorithm 1. Differentversions of a primitive might be chosen depending on some additional logic in the operator.

VectorWise pre-generates primitives for all needed combinations of operations, types andparameter modes. All functions to support SQL fit in ca. 9000 lines of macro-code, whichexpands into roughly 3000 different primitive functions producing 200K lines of code and a5MB binary.

2.2.3. VectorWise performance

VectorWise achieves very high performance, with over 50 times speedup over interpretedtuple-at-a-time database systems [ZBNH05]. On one side, it is the effect of optimized storage,which is able to deliver data for execution at a very high rate thanks to its column orientedarchitecture, light-weight compression schemes and in-cache decompression (more info canbe found in [Zuk09]). In this thesis we concentrate more on the execution layer. Thanksto vectorization we get benefits such as amortizing the interpretation overhead over multipletuples and better possibilities of SIMD processing using SSE.

VectorWise performance is confirmed by the TPC-H industrial benchmark [Tra02]. TPC-H is an ad-hoc, decision support benchmark. The VectorWise DBMS is currently the fastestnon-clustered database management system benchmarked in the categories of 100 GB, 300GB and 1 TB databases.

With introducing JIT compilation we try to improve the performance even further.

2.3. Modern hardware architecture

The reasoning behind many of the methods used in this thesis lies in the architecture ofmodern hardware. We describe some of the features of modern CPUs that we take advantagefrom. Detailed information is taken from [Fog11].

19

CPU cycle

CPU cycle

. . .

Instructionfetch

Instructiondecode

Execute

Write back

Instructionfetch

Instructiondecode

Execute

Write back

Time

Pipelined execution

IF−1 IF−2

ID−1 ID−2 ID−3

IF−4IF−3 IF−5 IF−6

ID−5ID−4

EX−3EX−2EX−1

WB−1WB−2

EX−4

WB−3

Sequential execution

IF−1

ID−1

EX−1

WB−1

ID−2

IE−2

WB−2

IF−3

ID−3

EX−3

WB−3

IF−2

Figure 2.5: Sequential execution vs. Pipelined execution. From [Zuk09].

2.3.1. CPU

The central processing unit (CPU) is the part of a computer system that executes instructionsof a computer program. It carries out each instruction of the program, to perform the basicarithmetical, logical, and input/output operations of the system. The CPU manipulates data,which can be stored in a number of places: registers, caches, RAM memory (from fastest toslowest). The instructions that it execute are also stored in memory or cache. CPU operatesin cycles issued by a clock with a frequency in order of magnitude of around a few GHzin modern processors. In most of experiments we use cycles instead of time as a metric tomeasure performance. The elapsed cycles can be read using hardware counters.

2.3.2. Pipelined execution

Carrying out an instruction takes more than one CPU cycle. The execution consists of variousstages:

• Instruction Fetch (IF) – get the next instruction to execute.

• Instruction Decode (ID) – decode the instruction from the binary operation code.

• Execute (EX) – perform the requested task. This can involve multiple different devicessuch as arithmetic logic units (ALU), floating-point units (FPU) or load-store units(SPU).

• Write-back (WB) – save the result.

These stages are executed by different components of the CPU. Execution of different stagesof different instructions could therefore be overlapped. This way the CPU can accept a newinstruction every cycle.

The first Pentium processors were able to execute two paired instruction simultaneously.From that point the pipeline depth was increasing until the Pentium 4 Prescott processors,

20

Flushed instructions

Instructionfetch

Instructiondecode

Execute

Write back

IF−1 IF−2

ID−1 ID−2 ID−3

IF−4IF−3 IF−5 IF−6

ID−5ID−4

EX−3EX−2EX−1

WB−1WB−2WB−3

ID−6

WB−4 WB−6WB−5

EX−4 EX−5 EX−6

IF−7

ID−7

EX−7

WB−7

IF−7’ IF−8’ IF−9’

ID−8’ID−7’

EX−7’

. . . . . . . . .

. . .

. . .

. . .. . .

. . . . . .

. . . . . .. . .

Figure 2.6: CPU bubble due to branch misprediction. From [Zuk09].

where it reached 31 stages. Due to problems with such deep pipelining described in Sec-tion 2.3.3, the pipelines got a little shorter in newer architectures. The Nehalem processorshave 16 stages pipeline. A comparison of pipelined execution and sequential execution isshown in Figure 2.6.

2.3.3. Hazards and how to deal with them

The problem with deeply pipelined execution is that the instructions taken into the pipelinehave to be independent. If the next instruction depends on the result of a previous one, CPUhas to wait and cannot fully utilize the pipeline and performance is sub-optimal. This canbe due to two types of situations called hazards: data hazards and control hazards.

A data hazard is when an instruction cannot be executed because it operates on an inputthat is not ready. For example in

c = a + b;

e = c + d;

the first addition has to finish before the second one can start.A control hazard is a result of conditional jumps and branching of the code. To fill the

pipeline the CPU needs to know what instructions will be executed next. If the code hasconditional jumps (“if” statements) it might be impossible to schedule next instructions,causing stalls in the pipeline.

Out-of-order execution

To deal with the problem of instruction independence, a CPU can execute instructions outof order. Even though a program is a sequence of instructions, when reaching an instructionthat cannot be put into execution due to a hazard, the CPU can look ahead and detect whichof the upcoming instructions are independent and can be executed immediately. The reorderbuffer in a Nehalem processor can contain 128 entries.

Branch prediction

To avoid waiting for the calculation of the outcome of a conditional jump, processors tryto predict the outcome and speculate ahead, taking into execution instructions from one ofthe branches. The CPU bases its decision on a history of previous results of the conditionaljump.

On one hand branch prediction helps performance if the prediction is correct, but ifthe branch is mispredicted the speculated instructions have to be undone, which imposes apenalty. In Nehalem processors this penalty is at least 17 cycles. If the pattern of whetherthe branch is taken or not is not regular, with a selectivity close to 50%, no sophisticated

21

branch prediction algorithm can help and performance is severely degraded. Therefore, suchsituations have to be avoided.

In our experiments we measure the number of mispredicted branches with hardware coun-ters.

2.3.4. CPU caches

The memory system in a modern computer is hierarchical.Registers can be called the fastest type of memory. There are 16 general purpose registers

in a x86 64 CPU that are named and manipulated explicitly in an assembly program. In factthe CPU has more registers, which it can use to execute more instructions simultaneously ina pipeline. The access latency of a register is about 1-3 cycles.

Main memory is also accessed explicitly (see more in Section 2.3.6) and its capacity is ingigabytes, but the access latency is much higher, in order of hundreds of CPU cycles.

Because of that, there is a number of cache layers between the registers and main memory.These caches work implicitly and cannot be controlled by the programmer. The accessed dataand instructions are automatically put into and evicted from the caches.

A Nehalem processors has three levels of cache. The L1 cache is split into an instructioncache and data cache, each with a capacity of 32 kB per CPU core. The latency for accessingL1 cache is 4 cycles, so it is almost as fast as registers. The L2 cache has 256 kB per CPUcore and the access latency is 11 cycles. There is also a L3 cache with a capacity of 8 MBshared by the whole processor, but the access latency for that one goes up to 38 cycles. Thedata into the caches is loaded and stored in 64 byte cache-lines.

In VectorWise we try to achieve that the data vectors that we currently operate on fitand are permanently kept in the L2 cache.

2.3.5. SIMD instructions

We often encounter a situation where we execute the same operation on multiple data ele-ments. This is especially the case in vectorized execution, where we perform basic operationson whole vectors of data in tight loops.

Processors are equipped with special instructions that can perform one operation onmultiple data elements at once. This is called Single Instruction, Multiple Data (SIMD)model. It was started with MMX instruction set in the Pentium 1 processors, which weresharing registers with the floating point operations. More recent processors have a SSEinstruction set, with its own set of registers. Nehalem CPUs support SSE version 4.2. SSE4.2 defines 8 new 128-bit registers (xmm0 to xmm7) and supports 54 instructions. Using it wecan for example add and subtract 16 single-byte values, multiply 8 single-byte integers into2-byte results etc. in single instruction.

Using SSE in vectorized operations is possible thanks to the column-wise orientation ofvectors. Multiple values from one attribute are adjacent in memory, so that they can be movedinto a SSE register with a single instruction. If the tuples were stored together row-wise, wewould have to gather the values from different tuples from separate memory locations, whichwould make it an unprofitable endeavour.

2.3.6. Virtual memory and TLB cache

Unfortunately, not all database operations can be executed on data that fits in the processorcaches. Sometimes we need to access random memory locations in very big data structures,for example while probing a hash table. Memory accesses to RAM memory are much more

22

expensive, but the performance penalty can be even bigger because of the way virtual memoryis implemented.

A process memory is addressed in virtual addresses instead of physical ones. This waythe process sees its memory as a contiguous and homogeneous address space, while in fact itcan have multiple fragments of physical memory assigned by the operating system, or evenhave part of its memory swapped to disk.

Virtual memory consists of pages which are usually 4 kB in size (or 2 MB if big pagesare enabled). A virtual memory address consists of the page address and an offset in thepage. The page address has to be translated into an address in physical memory. This isdone in hardware using a Translation Lookaside Buffer (TLB), which holds the translationsfor recently accessed tables.

This buffer’s capacity is limited, so when we access random memory locations, we oftenhit virtual memory pages for which translations are not in the TLB, and we cause a TLBmiss. To handle a TLB miss, the translation has to be read from a directory page. A directorypage is itself a normal 4 kB virtual memory page and it can hold 512 page translation entries,covering a span of 2 MB of memory. If the directory page’s translation is not in the cacheeither, we can hit another TLB miss while accessing it. Then we have to access a higher leveldirectory page, which holds translations for 512 lower level directory pages, which correspondsto a span of 1 GB of memory. Accessing a random location in a huge span of memory canactually force us to perform multiple memory accesses for handling TLB misses, causing veryhigh latency. We can either try to avoid such situations or provide other operations that canbe out-of-order executed to hide that latency.

In recent architectures we observe a significant increase in TLB caches size. A NehalemCPU has 2 levels of TLB and can hold 64 entries of 4 kB data pages and 64 entries of 4 kBinstruction pages on the first level, and 512 entries of 4 kB pages on the unified second level.

The amount of TLB misses can be measured using hardware counters.

2.4. Related work on JIT query compilation

Just-In-Time (JIT) query compilation has been mostly considered so far as an alternativemethod for eliminating the ill effects of interpretation overhead. On receiving a query forthe first time, the query processor compiles (a part of) the query into a routine that getssubsequently executed. We will generalise the current state-of-the-art JIT techniques usingthe term “single-loop” strategies, as these try to compile the core of the query into a singleloop that iterates over tuples. This is in contrast with vectorized execution, which decom-poses operators into multiple basic steps, and executes a separate loop for each of them(“multi-loop”). We will discuss some of the existing database systems employing JIT querycompilation in Section 2.4.

It will be the goal of this thesis to generate on the fly new, specialized primitives forcombinations of operations, reducing a tree (DAG) of interpreted expressions into a singlecompiled function. The proposed model is to perform many operations in a single-loop,operating on one tuple in one iteration, but leave it in the framework of the vectorizedengine. The evaluation in Chapter 3 shows that such a mixed approach is better thancompiling everything in single-loop compilation.

The following sections briefly describes other academic and commercial database man-agement systems that use JIT query compilation.

23

2.4.1. Origins

System R originally generated assembly directly, but the the non-portability of that approachled to its abandonment [D. 81].

2.4.2. HIQUE

HIQUE is an academic system developed at the University of Edinburgh and described in[KVC10, Kri10]. It coins the term holistic compilation. It generates and compiles code inC for whole queries at once, using an external call to the gcc compiler and dynamic linking.Compilation time of a whole TPC-H query is in the order of hundreds of milliseconds.

At the simplest level, all primitive operations such as comparisons, casts, data fetchingetc. are generated and inlined, tailored for the specific data types and tuple layout. HIQUEcode generation strives to achieve good cache locality and linear memory access patterns.It employs elements of block oriented processing by using units of work fitting in the L2cache, which the authors call pages. The prototype system is limited to queries that canbe executed with and Aggregation on top of Joins on top of a Selection over a Scan. AllSelections have to be conjunctive and without negation, so they can be pushed down belowthe joins. The first group of generated functions perform data input and selections. Storage isNSM model, but unnecessary attributes are dropped during scan, to produce thinner tuples.HIQUE immediately calculates the selections during scan. Joins are hash-partitioned intoL2 cache-fitting partitions on both sides of the join. Whole tuples, not references, are beingcopied into the partitions so that further memory accesses are local. Then either a nested loopjoin is used, or cache-fitting partitions are sorted and a merge join is performed, to achievelinear memory access pattern. JIT code generation allows HIQUE to generate specializedcode for multi-way joins on a common key, which are common in analytical workloads. Suchmulti-way joins implemented as join teams can be very efficient [GBC98]. In aggregationHIQUE also prefers ordered aggregation, hash partitioned into L2 cache fitting chunks. Withdynamically generated code many aggregate functions can be calculated at once.

In general, HIQUE performs a lot of in-memory intermediate materialization and datastaging to use algorithms with linear memory access patterns and work on L2 cache fittingchunks. This can be seen as elements of vectorization.

The authors of HIQUE observe that the development process of implementing C codegeneration is relatively easy. Errors in code generation can be easily spotted by analyzing thegenerated C code. This way, the number of develop and test iterations required until codegeneration for a new algorithm is fully supported is reduced.

HIQUE claims TPC-H performance comparable to MonetDB/X100 system in [KVC10],not based on a direct benchmark but by comparing its execution times with times given in[ZNB08].

2.4.3. Hyper

The recently announced Hyper system is developed at Technical University of Munich [KN10].Contrary to HIQUE, the algorithms Hyper use are intended to avoid any intermediate

materialization as much as possible. The generated code breaks operator boundaries, combin-ing as much code together in a single loop and keeping the intermediate results in registers.Materialization is done only when strictly necessary, when a pipeline breaker is encountered,e.g. if one side of the join has to be put into a hash table or an aggregation result has to becalculated. Each generated code fragment performs all actions that can be done within onepart of the execution pipeline before materializing the result into the next pipeline breaker.

24

Another difference is that query execution follows a “push” model instead of “pull”, withdata being pushed towards the root of the query execution tree, instead of being pulled fromthe leaves.

What makes Hype different from other systems is that it does not generate code in a highlevel language but in LLVM assembly-level intermediate representation (IR) language. LLVMis a collection of modular and reusable compiler-construction tools that will be described inSection 4.1. The LLVM IR is architecture-independent, but it uses assembly-level instructionsoperating on virtual registers. Code generation may be harder than in higher-level language,but provides more precise control over the result. The LLVM code optimization library is usedto optimize the assembly code. It provides optimization passes just as in different compilers,but using LLVM directly Hyper can control what passes and in what order are performed oradd its own custom optimization passes, adapted to the specific code it generates.

2.4.4. JAMDB

The JAMBD system comes from IBM Almaden Research Center [RPML06]. The majorcharacteristic that distinguishes it from others is that it is implemented in Java and usesreflection and JIT present in Java Virtual Machines.

The system has no real storage layer, all data is Java objects in memory. It does notdeal with the problems of ensuring ACID properties. A database table schema correspondssimply to a Java class and one object is a row of the table. Foreign keys are just referencesto other Java objects in memory and indexes are Java HashMaps. Joins are supported onlyon these explicitly created join indexes. JAMDB depend on Java Virtual Machine for buffersand memory management.

Code generation is therefore relatively simple. Almost nothing is needed for joins, as theyare just reference following or Java HashMap lookups. For selections JAMDB reorders thepredicates based on predicted selectivity, to get the most filtering condition first. Nothingis however said about reordering them dynamically at runtime if the observed selectivitydiffers from the expected one. Most of the code generation is actually just inlining correctcomparison, casting and fetching methods for different datatypes.

Their main claim is that using Java Virtual Machine and generating Java classes letsthem take advantage of JVM’s built in JIT compilation capabilities. They generate new Javaclasses code and load them directly using Java’s class loader. When compiling operationstogether JAMDB avoids virtual function calls and reflection in Java. The authors claim thatthe JVM JIT compiler will analyze and optimize the executed code in runtime, which willresult in performance comparable to a hard coded C system.

The provided system evaluation is very limited. JAMDB is compared only against DB2,achieving speedup not more than 3x over an interpreted system.

2.4.5. Other

AT&T developed a system called Daytona. In [Gre99] not much is written about how andwhat code is generated and about the overall infrastructure. It rather describes the syntaxand capabilities of a custom query language called Cymbal. In the Daytona system thereis no database server. Each query template is compiled into a standalone program, thataccesses the data directly. A query template is a query with some parameters placehold-ers, and the generated program is called with these parameters filled to perform the actualquery. The database itself is stored as disc files directly accessed by the query programs. Allconcurrency control, scheduling, locking etc. is left to the operating system file locking and

25

scheduling mechanisms. The system does not deal with ACID properties itself. No discussionor measurements are provided about Daytona’s performance.

Another system known to use compilation is the ParAccel [Par10] commercial analyticaldatabase system. No details on its inner working are publicly released however.

2.5. Summary

In this chapter we introduced all concepts necessary for understanding the rest of this thesis.A description of database management system construction in general, and VectorWise

DBMS in particular, have been provided. In addition to VectorWise, which is a representativeof a vectorized system, we analyzed and described existing systems that implement JITcompilation.

The reasoning behind many of the algorithms and methods used in this thesis lies in thearchitecture of modern hardware, so an introduction to some hardware concepts was alsoneeded.

26

Chapter 3

Vectorization vs. Compilation

Vectorization and JIT compilation have been proposed as two different techniques for reduc-ing the interpretation overhead of a database management system. In this chapter we try toshed light on the question of how state-of-the-art compilation strategies relate to vectorizedexecution for analytical database workloads on modern CPUs and is it worth combining them.For this purpose, we carefully investigate the behavior of vectorized and compiled strategiesin three use cases: Project, Select and Hash Join. One of the findings is that compilationshould always be combined with block-wise query execution. Another contribution is identi-fying three cases where “loop-compilation” strategies are inferior to vectorized execution. Assuch, a careful merging of these two strategies is proposed for optimal performance: eitherby incorporating vectorized execution principles into compiled query plans or using querycompilation to create building blocks for vectorized processing. 1

3.1. Experiments overview

We have chosen three case studies which show how to exploit the capabilities of modern pro-cessors by combining vectorization with loop-compilation. These case studies cover operatorsimportant for database performance, such as Project, Select and hash-based operators (usingthe example of HashJoin). Improving their performance would result in significant speedupof database engine as a whole.

As our first case, Section 3.2 shows how in Project expression calculations loop-compilationtends to provide the best results, but that this hinges on using block-oriented processing andSIMD. Thus, compiling expressions in a tuple-at-a-time engine may improve some perfor-mance, but falls short of the gains that are possible. This case study shows the problemswith effectively using SIMD capabilities of modern processors.

In Section 3.3, our second case is Select, where we show that branch mispredictions hurtloop-compilation when evaluating conjunctive predicates. In contrast, the vectorized ap-proach avoids this problem as it can transform control-dependencies into data-dependenciesfor evaluating booleans (along [Ros02]). In Section 3.4 we consider a Select with disjunc-tive predicates, which is a little more complicated for the vectorized approach, but it stilloutperforms loop-compilation. These case studies show the problems of maintaining deeplypipelined execution and avoiding branches. Section 3.5 gives a synthesis of the Project andSelect cases.

1Part of material from this chapter was published as a paper at the DaMoN 2011 workshop in Athens[SZB11].

27

The third case, described in Section 3.6, concerns building and probing large hash-tables,using a HashJoin as an example. Here, loop-compilation gets CPU cache miss stalled whileprocessing linked lists (i.e. hash bucket chains). We show that a mixed approach that usesvectorization for chain-following is fastest and robust to the various parameters of the keylookup problem. This case study shows the importance of optimal memory access patternsand exploiting the processor’s ability of out-of-order execution. Since hash tables are usednot only in HashJoin but in different operators, it can be applied also to e.g. Aggregation.

In Section 3.7 we discuss aspects of vectorization and computation other than performance– software engineering, system complexity and maintainability.

These findings lead to a set of conclusions which we present in Section 3.8.

3.1.1. Experimental setup

We used the PAPI library [TJYD09] to collect the actual number of clock cycles, retiredinstructions, branch mispredictions, and TLB misses in our microbenchmarks. Between 100and 1000 repetitions of each measurement has been performed, to make sure that the gatheredresults are stable and smooth. We used a 2.67GHz Nehalem core, on a 64-bits Linux machinewith 12GB of RAM. The gcc 4.4.2 compiler was used for most of the experiments. Weexplicitly mention places where we also used icc 11.0 or clang with LLVM 2.9.

3.2. Case study: Project

Inspired by the expressions in Q1 of TPC-H we focus on the following simple Scan-Projectquery as micro-benchmark:

SELECT l_extprice*(1-l_discount)*(1+l_tax) FROM lineitem

The scanned columns are all decimals with precision two. VectorWise represents theseinternally as integers, using the value multiplied by 100 (decimal with precision 2). Afterscanning and decompression it chooses the smallest integer type that, given the actual valuedomain, can represent the numbers. The same happens for calculation expressions, wherethe destination type is chosen to be the minimal-width integer type, such that overflow isprevented. In the TPC-H case, the price column is a 4-byte integer and the other twoare single-byte columns. The addition and subtraction produce again single bytes, theirmultiplication a 2-byte integer. The final multiplication multiplies a 4-byte with a 2-byteinteger, creating a 4-byte result.

We have run the query using different vector lengths, from 1 to 4M tuples. The resultspresented in Figure 2 present the cost of calculating the whole expression per one tuple.

3.2.1. Vectorized Execution

The expression is calculated using map primitives shown in the upper part of Algorithm 2.In section 2.2 it was discussed that primitives are generated for combinations of operationsand types. A question is: how many such combinations should be considered?

Currently, VectorWise pre-generates map primitives where the input vectors and the re-sults all have the same type. Focusing on integer types (1-, 2-, 4-, 8- byte; signed/unsigned)there are 8 combinations of functions to generate for each operation (which is further multi-plied by the col and val combinations). However, such functions do not suit all needs ofthe benchmarked expression: at one point we want to multiply 1-byte integers into a 2-byte

28

12

510

2050

100

Cyc

les

per

tupl

e (lo

g sc

ale)

1 2 4 8 16 32 64 128 512 1K 2K 4K 8K 16K 64K 256K 1M 2M 4M

A. Vectorized, explicit casts (gcc)B. Vectorized, optimal (gcc)C. Compiled, gccD. Compiled, iccE. Vectorized, optimal, no−SIMD clangF. Compiled, no−SIMD clang

Comp.per tuple

Interpreted

1.0

1.5

2.0

2.5

3.0

Cyc

les

clos

e−up

1 2 4 8 16 32 64 128 512 1K 2K 4K 8K 16K 64K 256K 1M 2M 4M

0.5

1.0

2.0

5.0

10.0

20.0

Vector size

L1 c

ache

acc

esse

s pe

r tu

ple

(log

scal

e)

1 2 4 8 16 32 64 128 512 1K 2K 4K 8K 16K 64K 256K 1M 2M 4M

Figure 3.1: Project micro-benchmark: vectorized and compiled implementations of an arith-metic expression. Different compilers are used to show differences in SIMDization capabilities.Two vectorized implementations: optimal and with forced type coercions show the trade-offbetween performance and the amount of necessary pregenerated code.

29

Algorithm 2 Implementation of an example query using vectorized and compiled modes.Dynamically compiled primitives, such as map compiled(), follow the same template as pre-generated vectorized primitives, but may take arbitrarily complex expressions as OP. Han-dling of selection vector is omitted for compactness.

const uidx VECLENGTH=1024;// inputuint l e x t p r i c e [VECLENGTH] ; uchr l d i s c o u n t [VECLENGTH] , l t a x [VECLENGTH] ;uchr v1 = 100 , v2 = 100 ;// outputuint r e s [VECLENGTH] ;// in t e rmed ia t e suchr tmpuchr1 [VECLENGTH] , tmpuchr2 [VECLENGTH] ;usht tmpusht1 [VECLENGTH] , tmpusht2 [VECLENGTH] , tmpusht3 [VECLENGTH] ;uint tmpuint1 [VECLENGTH] ;

// Vector i zed code with need fo r e x p l i c i t c a s t i n g :map uchr add uchr co l uchr va l (VECLENGTH, tmpuchr1 , uchr1 , v1 , s e l ) ;map uchr sub uchr va l uchr co l (VECLENGTH, tmpuchr2 , v2 , uchr2 , s e l ) ;map uchr2usht uchr co l (VECLENGTH, tmpusht1 , tmpuchr1 , s e l ) ;map uchr2usht uchr co l (VECLENGTH, tmpusht2 , tmpuchr2 , s e l ) ;map usht mu l ush t co l u sh t co l (VECLENGTH, tmpusht3 , tmpusht1 , tmpusht2 , s e l ) ;map usht2u int usht co l (VECLENGTH, tmpuint1 , tmpusht3 , s e l ) ;m a p u i n t m u l u i n t c o l u i n t c o l (VECLENGTH, res , uint1 , tmpuint1 , s e l ) ;

// Optimal v e c t o r i z e d code with p r im i t i v e s f o r more combinations o f t ypes :map uchr add uchr va l uchr co l (VECLENGTH, tmpuchr1 , &v1 , l d i s c o u n t , s e l ) ;map uchr sub uchr va l uchr co l (VECLENGTH, tmpuchr2 , &v2 , l t ax , s e l ) ;map usht mul uchr co l uchr co l (VECLENGTH, tmpusht1 , tmpuchr1 , tmpuchr2 , s e l ) ;m a p u i n t m u l u i n t c o l u s h t c o l (VECLENGTH, res , l e x t p r i c e , tmpusht1 , s e l ) ;

// Compiled e qu i v a l en t o f t h i s expres s ion :uidx map compiled (uidx n , uint ∗ res , uint ∗ co l1 , uchr∗ co l2 , uchr∗ co l 3 /∗ , u idx ∗ s e l ∗/ ) {

for (uidx i =0; i<n ; i++)r e s [ i ]= co l 1 [ i ]∗((100− co l 2 [ i ] )∗(100+ co l 3 [ i ] ) ) ;

}

result, and then multiply 4- and 2- byte integers together. We need additional primitives,which cast a vector into a vector of another type, to coerce it to be suitable to be used with aproper arithmetic map primitive. This not only constitutes a redundant additional operation,but also uses more memory, as an additional intermediate vector has to be used to save theresult of casting. This may make the operation working space memory needs grow beyondCPU cache, causing performance degradation.

If the map primitives were able to accept parameters of different types, casting wouldbe zero-cost, performed during moving the operands from memory to register, and less in-termediate values would be materialized. The performance can be compared looking at the“explicit casts” and “optimal” lines in Figure 3.1.

However, such flexible behaviour comes at a cost of having to pregenerate an order ofmagnitude more combinations of primitives. Consider a binary operation on integer types.Originally there were 8 variants generated because of types combinations. Now we would likethe operation to be able to take parameters also from shorter types and also convert betweensigned and signed.

• An operation producing a signed/unsigned 1-byte result will have to accept two pa-rameters, of which each could be signed/unsigned 1-byte – 22 combinations for twoparameters.

• An operation producing a signed/unsigned 2-byte result will need support for signed/un-

30

signed 1- and 2- byte parameters – 42 combinations.

• Similar for 4 and 8 byte datatypes.

As a result, instead of 8 combinations there are

2 · 22 + 2 · 42 + 2 · 62 + 2 · 82 = 240

This is a 30-fold increase in size of code that needs to be generated for binary operations.The results in Figure 3.1 show that this approach can lead to significant performance gains,but such an increase in executable size might be unacceptable. It can alone, if nothingelse, be a reason for using JIT compilation to generate simple primitives with the requiredcombination of types.

3.2.2. Loop-compiled Execution

The lower part of Algorithm 2 shows the compiled code that a modified version of VectorWisecan now generate on-the-fly: it combines vectorization with compilation. Such a combinationin itself is not new (“compound primitives” [BZN05]), and the end result is similar to whata holistic query compiler like HIQUE [KVC10] generates for this Scan-Project plan, thoughit would also add Scan code. However, if we assume a HIQUE with a simple main-memorystorage system and take l tax, etc. to be pointers into a column-wise stored table, thenmap compiled() would be the exact product of a “loop-compilation” strategy.

The most obvious reason for using JIT compilation in the context of compound operationsin a projection, is to avoid materialization of intermediate results and the necessity of typecoercions.

3.2.3. SIMDization using SSE

Simple independent arithmetic operations on columns of data, such as performed in map-primitives, are easy to SIMDize in the vectorized implementation. With a single SSE in-struction, modern CPUs can add and subtract 16 single-byte values, multiply 8 single-byteintegers into 2-byte results, or multiply two 4-byte integers into 8-byte results (cast back into4-byte type) [sse]. It is reasonable to expect that a compiler that claims support for SIMDshould be able to vectorize the trivial loop in the map functions.

Thus, 16 tuples in the analyzed expression could be processed by the vectorized imple-mentation with 12 calculations:

• 1 for single-byte addition,

• 1 for single-byte subtraction,

• 2 for multiplications of single-bytes into a 2-byte results,

• 8 for multiplication of 2-byte and 4-byte into a 4-byte results.

The input for 16 tuples are 16 · (1 + 1 + 4) = 96 bytes of memory. Intermediate results ofvectorized execution are 16 · (1 + 1 + 2) = 64 bytes of memory. The final result is 16 · 4 = 64bytes of memory. We therefore need to load 96 + 64 = 160 bytes and store 64 + 64 = 128bytes of memory for every 16 tuples. Using SSE we will be loading 16 bytes into an SSEregister in one memory operation, so we need 18 operations.

The vectorized implementation with explicit casting needs to cast 1-byte results of 100 +

l tax and 100 - l discount into 2-byte type, and cast the 2-byte results of (100+l tax)

31

l_tax l_discount

100+l_tax 100-l_discount

(100+l_tax)*(100-l_discount)

l_extprice

l_extprice*(100+l_tax)*(100-l_discount)

1 x add 1 x sub

16 bytes load 16 bytes load 64 bytes load

2 x mul

8 x mul

64 bytes store

Figure 3.2: Calculation of l extendedprice*(100+l tax)*(100-l discount) with SSE.The diagram shows number of operations that have to be performed per 16 tuples. In the“loop-compiled” implementation only the input attributes and final result have to be loadedfrom/to memory. A vectorized implementation additionally has to store all intermediateresults in memory.

* (100-l discount) into 4-byte type. For the additional intermediate results we need 16 +16 + 16 · 2 = 64 bytes per 16 tuples that have to be loaded and stored. The total memorytraffic is therefore 224 bytes loaded and 192 bytes stored, which is 22 memory operations.

The loop in ,,loop-compiled” implementation is more complicated to vectorize in an opti-mum way, because different operations have to be vectorized with different block sizes. Theloop has to be decomposed to process one 16-tuple single-byte addition and subtraction, thentwo 8-tuple 2-byte multiplications and finally eight 4-byte multiplications, storing interme-diate results in SSE registers. This requires more padding and border-case handling, butdramatically reduces the number of memory operations. We only need to load the inputparameters and store the final result. This is 96 bytes to load and 64 bytes, which can bedone in 10 memory operations for 16 tuples.

Without SIMD each input attribute, intermediate result and final result has to be loadedand stored separately for each tuple. This requires 3+1 stores (3 intermediate results, 1 finalresult) and 3 + 3 loads (3 input attributes, 3 intermediate results), for a total of 10 memoryoperations per tuple. Compiled implementation without SIMD only needs 1 + 3 = 4 memoryoperations per tuple, because it does not produce intermediate results.

This reasoning about the number of memory and calculation operations is presented ona diagram in Figure 3.2. The bottom graph in Figure 3.1 shows number of L1 cache accessesper tuple made by different implementation which is roughly in line with this discussion.

It is a question whether the compiler will also be able to automatically vectorize the,,loop-compiled” compound operation. We performed our experiments on three compilers

• gcc 4.4.2 with options -O3 -msse4 -ftree-vectorize

32

• icc 11.0 with options -O3 -xSSE4.2

• clang/llvm 2.8 with options -O3

Currently, LLVM and clang does not perform auto-SIMDization using SSE. Howeverthere is a project to add this functionality underway [GZA+11]. Hopefully, when it is finishedit will work as well as in gcc. For now, clang run shows the performance of non-SIMDizedcode.

Only gcc line is shown in Figure 3.1 for the vectorized implementations (lines “A” and“B”), because there was no difference between gcc and icc in this case. However, in thecourse of those experiments it turned out that in the general case gcc sometimes refuses tovectorize the trivial loops in map primitives. It is some compiler sub-optimality, as tweakinglittle things in the code, like changing the loop iterator variable from unsigned to signed type,or using temporary local variables to store intermediates instead of inlining operations helpedgcc to improve auto-vectorization. It was made sure that the gcc code used for the resultswas SIMDized by the compiler properly, but it required some manual tweaking. icc wasindifferent to those changes and was able to auto-vectorize the map primitives in all cases.Those results warn that SIMDization of the simple vectorized primitives also has its pitfalls.However, as this code is compiled statically, it can be inspected to make sure that it works asexpected before releasing it. It is harder to make such guarantees for dynamically generatedand compiled code.

gcc (line “C”) performs much better than icc (line “D”) in the ,,loop-compiled” imple-mentation. Looking at the assembly revealed that gcc was able to vectorize the more complexloop in an optimal way described above, while icc chooses in its SIMD generation to alignall calculations to the widest datatype (here: 4-byte integers). Hence, the opportunities for1-byte and 2-byte SIMD operations working on more tuples at once are lost. Together withthe need of additional padding and zero-extending the shorter data types, this makes icc

implementation only slightly faster than the one not using any SIMD instructions.

If the operations are compiled together in an optimal way, the performance gain mightbe even 50% over the best possible vectorized version, and almost by a factor 3 over theimplementation with explicit casting and coercions requiring less combinations of staticallypre-compiled vectorized primitives. This however depends on the compiler being able toSIMDize the operation properly. Worse yet, such a sub-optimality might be much harderto notice. As the code is generated in run-time, it cannot all be inspected. Sub-optimalbehaviour for some combinations of operations might easily go unnoticed.

Until the LLVM auto-SIMDization project is finished and introducing SSE works as wellin clang as it’s in gcc, loop-compilation of expressions working on full vectors of data usingclang (line “E”) as the JIT compilation will work worse than vectorized code using SSE.The vectorized code compiled with clang without SSE (line “F”) works over ten times slowerthan the gcc one with SSE (line “B”). Care has to be taken how good which compiler handlesauto-vectorization, but gcc results in loop compilation are very promising.

3.2.4. Tuple-at-a-time compilation

Much of the performance of map operations has to be attributed to the ability to use SSESIMD operations on many tuples at a time. This however is possible only thanks to theoverall vectorized architecture of the DBMS. An engine like MySQL, whose whole iteratorinterface is tuple-at-a-time, cannot exploit it, as it has just one tuple to operate on at anymoment. We have therefore also measured the possible gains of compiling map operations in

33

a DBMS working tuple-at-a-time, to see what the potential gains from compilation would bein such a system.

The black star and diamond in Figure 3.1, correspond to situations where primitive func-tions work tuple-at-a-time. The non-compiled strategy is called “interpreted”, here. Tuple-at-a-time primitives are conceptually very close to the functions in Algorithm 2 with vector-size n=1, but lack the for-loop. We implemented them separately for fairness, because thesefor-loops introduce loop-overhead. This experiment shows that if one would contemplateintroducing expression compilation in an engine like MySQL, the gain in performance forexpression calculation could be something like a factor 3 (23 vs 7 cycle/tuple). The absoluteperformance is still much worse from what block-wise query processing offers (7 vs 1.2 cy-cles/tuple), mainly due to missed SIMD opportunities, but also because the virtual methodcall for every tuple inhibits speculative execution across tuples in the CPU. Even worse, intuple-at-a-time query evaluation function primitives in OLAP queries only make up a smallpercentage (< 5%) of overall time [BZN05], because most effort goes into the tuple-at-a-timeoperator APIs. When considering expression compilation, the overall gain without changingthe tuple-at-a-time nature of a query engine can therefore at most be a few percent, makingsuch an endeavour questionable.

3.2.5. Effect of vector length

The general trend of decreasing interpretation overhead with increasing vector-size untilaround one thousand and performance deteriorating due to cache misses if vectors startexceeding the L1 and L2 caches has been described in detail in [Zuk09, BZN05].

A final observation is that compiled map-primitives are less sensitive to cache size, sincethey use less intermediate results, so that a hybrid vectorized/compiled engine can use largervector-sizes, because larger but fewer vectors will still fit in cache.

3.2.6. Project under Select and compute-all optimization

Map operations in Project were very fast, because they were heavily SIMDized when workingon full vectors of data. Suppose now, that the calculated expression is put under a selectioncondition, as in the following query:

SELECT l_extprice*(1-l_discount)*(1+l_tax) FROM lineitem

WHERE ...

We will deal with selections in Sections 3.3 and 3.4 and then provide a synthesis in Section3.5. For now we need to know that VectorWise handles selections with a vector of indexes ofselected tuples that is passed on to the map operations, which in turn will perform operationsonly on the selected elements instead of full vectors.

Working with a selection vector slows execution down, because

1. less importantly, there are more memory accesses, as the elements of the vectors arenot accessed directly, but through indexes stored in selection vector.

2. more importantly, SSE SIMD instructions cannot be used, as the operations are notperformed on contiguous fragments of memory.

Because of that, ignoring the selection vector and performing computation on all tuplesmight be faster, and it is safe for most operations (when the operations do not have anyunwanted side effects, e.g. like divisions, where possibility of division by zero error has to be

34

010

0020

0030

0040

00

Selection vector fullness

Tota

l cyc

les

0.000 0.075 0.150 0.225 0.300 0.375 0.450 0.525 0.600

A. Vec., expl. casts (gcc)B. Vec., optimal (gcc)C. Compiled, gccD. Compiled, clang

Compute−all (A)

Compute−all (B)

Compute−all (C)

Figure 3.3: Project under Select. If the number of tuples in the selection vector exceeds acertain threshold, it is more profitable to make the computation on all tuples. The length ofa full vector is 1024.

taken into account). The threshold selection vector fullness can be estimated with a simpleheuristics. If there are V elements in a full vector and N are selected and the operation canbe SIMDized to work on k tuples in one instruction, then if N

V > 1k , it should be faster to

drop the selection vector for the reason of SSE alone. This optimization can be decided atrun time for each vector of data independently and is called “compute-all”.

Results in Figure 3.3 show, that it is beneficial to perform compute-all on vectorizedimplementations (line “A” with explicit casts, and line “B” without) for as few as around10% selected tuples. An important aspect to consider when using clang for JIT compila-tion of compound primitives is that since the current (2.9) version does not support auto-SIMDization, the biggest advantage of compute-all is broken. Because of that, we don’t showa “compute-all” line for the clang compiled implementation (line “D”). The vectorized im-plementation using compute-all will outperform the clang compiled one when the fraction ofselected tuples is high enough. The gcc compiled implementation using compute-all (line “C”)shows how the JIT compiled primitives work work when compiled with auto-SIMDization.

Even without SSE, the compiled implementation is faster than the vectorized one thatwould currently be used in VectorWise when the selection vectors are less than half full. Itis profitable to use JIT compilation for compound map calculations under a selection vectorin such cases.

35

Algorithm 3 Implementations of conjunctive selection. The vectorized implementation eval-uates the three conditions lazily. Each of the predicates produces a selection vector, withwhich the next predicate is evaluated. The resulting selection vector is overwritten at eachstep with a new, shorter one. In the end the selection vector contains only positions forwhich all three conditions evaluated true. For mid selectivities, branching instructions leadto branch mispredictions. Either branching or datadep versions of selection primitivesare dynamically selected by VectorWise, depending on the observed selectivity. Branchingcannot be avoided in loop-compilation, which combines selection with other operations, with-out executing these operations eagerly. There are four alternative implementations of the loopbody, with different trade-offs between branching and eager calculation.

/∗ Vector i zed implementation . ∗/const uidx LENGTH = 1024 ;// Inputuint co l 1 [LENGTH] , co l 2 [LENGTH] , co l 3 [LENGTH] , v1 , v2 , v3 ;// Resu l t ing s e l e c t i o n vec tor .uidx r e s [LENGTH] ;uidx n1 = s e l l t u i n t c o l u i n t v a l (LENGTH, res , co l1 , &v1 , NULL) ;uidx n2 = s e l l t u i n t c o l u i n t v a l ( n1 , res , co l2 , &v2 , r e s ) ;uidx re tn = s e l l t u i n t c o l u i n t v a l ( n2 , res , co l3 , &v3 , r e s ) ;

/∗ Compiled ver s ion : four implementat ions o f the above . ∗/// (C. ) a l l p r ed i c a t e s branching (” l a z y ”)uidx c0001 (uidx n , uint∗ res , uint∗ co l1 , uint∗ co l2 , uint∗ co l3 , uint∗ v1 , uint∗ v2 , uint∗ v3 ){

uidx j = 0 ;for (uidx i =0; i<n ; i++)

i f ( co l 1 [ i ]<∗v1 && co l2 [ i ]<∗v2 && co l3 [ i ]<∗v3 )r e s [ j ++] = i ;

return j ; // return number o f s e l e c t e d items .}// (D. ) branching 1 ,2 , non−br . 3uidx c0002 (uidx n , uint∗ res , uint∗ co l1 , uint∗ co l2 , uint∗ co l3 , uint∗ v1 , uint∗ v2 , uint∗ v3 ){

uidx j =0;for (uidx i =0; i<n ; i++)

i f ( co l 1 [ i ]<∗v1 && co l2 [ i ] < ∗v2 ) {r e s [ j ] = i ; j += co l 3 [ i ] < ∗v3 ;

}return j ;

}// (E. ) branching 1 , non−br . 2 ,3uidx c0003 (uidx n , uint∗ res , uint∗ co l1 , uint∗ co l2 , uint∗ co l3 , uint∗ v1 , uint∗ v2 , uint∗ v3 ){

uidx j =0;for (uidx i =0; i<n ; i++)

i f ( co l 1 [ i ]<v1 ) {r e s [ j ] = i ; j += co l 2 [ i ]<∗v2 & co l 3 [ i ]<∗v3

}return j ;

}// (F. ) non−branching 1 ,2 ,3 , (” compute−a l l ”)uidx c0004 (uidx n , uint∗ res , uint∗ co l1 , uint∗ co l2 , uint∗ co l3 , uint∗ v1 , uint∗ v2 , uint∗ v3 ){

uidx j =0;for (uidx i =0; i<n ; i++) {

r e s [ j ] = i ;j += ( co l 1 [ i ]<∗v1 & co l 2 [ i ]<∗v2 & co l 3 [ i ]<∗v3 )

}return j ;

}

3.3. Case study: Conjunctive select

We now turn our attention to a micro-benchmark that tests conjunctive selections:

WHERE col1 < v1 AND col2 < v2 AND col3 < v3

36

05

1015

20

Cyc

les

per

tupl

e

A. Vectorized, branchingB. Vectorized, non−branchingC. Comp., if(1 && 2 && 3){} (lazy)D. Comp., if(1 && 2) { 3 non−br }E. Comp., if(1) { 2&3 non−br }F. Comp., 1&2&3 non−br. (compute all)

020

060

0

Selectivity

Tota

l br.

mis

p.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 3.4: Conjunction of selection conditions: total cycles and branch mispredictions vs.selectivity

In this experiment each of the columns col1, col2, col3 is an integer column, and thevalues v1, v2 and v3 are constants, adjusted to control the selectivity of each condition. Theselectivity x on the x-axis in Figure 3.4 is the approximate fraction of input values for whicheach of the three conditions is true. Therefore, the overall selectivity of the selection is x3.We performed the experiment on vectors of 1K input tuples.

3.3.1. Vectorized execution

With the way VectorWise uses selection vectors to handle selections it is very easy to evaluatea conjunctive condition in a lazy way. A primitive for first predicate outputs a selection vectorwith positions for which it evaluated to true. Then the second primitive can be called with thisselection vector, to evaluate lazily only on those positions, which passed the first condition,and so on. As the resulting selection vector will get shorter with each step, one vector canbe overwritten in each subsequent operation. No intermediate vectors are needed.

It was already mentioned in Algorithm 1 in Chapter 2 that VectorWise uses two alternativeimplementations of a selection primitive. The naive “branching” implementation evaluatesthe condition and saves the position into a selection vector only if the predicate is true. If

37

the selectivity of conditions is neither very low or high CPU branch predictors are unable tocorrectly guess the branch outcome. This prevents the CPU from filling its pipelines withuseful future code and hinders performance. In [Ros02] it was shown that a branch (control-dependency) in the selection code can be transformed into a data dependency for betterperformance. In that implementation the position is always saved into a resulting selectionvector, but the counter of selected positions is incremented by the outcome of the calculatedcondition – so the counter moves to the next position only if the condition evaluated totrue. This approach needs to execute more instructions, but there are no branches, so formedium selectivity the pipeline does not get stalled in case of branch mispredictions. TheVectorWise implementation of conjunctive selections uses a mechanism that chooses either thebranching or non-branching strategy depending on the observed selectivity. It even re-ordersdynamically the conjunctive predicates such that the most selective is evaluated first. Assuch, its performance achieves the minimum of the vectorized branching and non-branchinglines in Figure 3.4.

3.3.2. Loop-compiled execution

Compiled implementations of conjunctive selection are shown in Algorithm 3. The gist ofthe problem is that the trick of on one hand converting all control dependencies in datadependencies while still avoiding unnecessary evaluation, cannot be achieved in a single loop.If one avoids all branches (the “compute-all” approach in Algorithm 3), all conditions alwaysget evaluated eagerly, wasting resources if a prior condition already failed. One can try mixedapproaches, branching on the first predicates and using data dependency on the remainingones. They perform better in some selectivity ranges, but maintain the basic problems –their worst behavior is when the selectivity is around 50%.

Figure 3.4 shows that JIT compilation of conjunctive Select is inferior to the pure vec-torized approach. The lazy compiled program (line “C”) does slightly outperform vectorizedbranching (line “A”), but for the medium selectivities the non-branching vectorized implemen-tation (line “B”) is by far the best choice. Other variants of compiled implementation (lines“D”, “E” and “F”), which balance between being lazy but branching or avoiding branchesbut evaluating the conditions eagerly, are not much better.

Comparing the best vectorized implementation with the best compiled program for agiven selectivity, loop compilation was better only when the selectivity of the condition wasbelow 10% or above 90%. Also, having each operation as a separate vectorized primitivemakes it possible to dynamically adapt during runtime based on changes in the selectivity ofindividual conditions. We can switch between branching and non-branching implementationsor reorder the conditions to always evaluate that with the least selectivity first. If we compilethe whole selection into a single function evaluating the whole condition in a single loopiteration we lose this flexibility.

Therefore, applying JIT compilation to conjunctive selection conditions has limited us-ability.

3.4. Case study: Disjunctive select

The next test case is very similar to the previous one from Section 3.3, but with a disjunctionof conditions instead of a conjunction.

SELECT ...

WHERE col1 >= v1 OR col2 >= v2 OR col3 >= v3

38

05

1015

2025

30

Cyc

les

per

tupl

e

A. Vectorized, branchingB. Vectorized, non−branchingC. Vec., br., map+sel_zeroD. Vec., non−br., map+sel_zeroE. Compiled

040

080

0

Selectivity

Tota

l br.

mis

p.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 3.5: Disjunction of selection conditions: total cycles and branch mispredictions vs.selectivity

3.4.1. Vectorized execution

A disjunction of conditions is more complicated for the vectorized computation than a con-junction. If a disjunction is to be performed lazily, the second condition should be evaluatedonly on those tuples for which the first was false, and the third only on those where boththe first two conditions were false. This would require two selection vectors: one for selectedand one for not-selected tuples from the first three conditions. Later, a union of the selectedtuples would have to be computed. A simpler solution would be to invert the condition usingde Morgan’s law:

a ∨ b ∨ c = ¬(¬a ∧ ¬b ∧ ¬c)

This way the selection is actually evaluated as a negated conjunction, equivalent to

SELECT ...

WHERE NOT (col1 < v1 AND col2 < v2 AND col3 < v3)

The selectivity Figure 3.5 is the selectivity of each of the inverted conditions (i.e. theones that are actually calculated as a conjunction). So when the selectivity in the figure isx, the total selectivity of the disjunction is 1− x3.

39

Algorithm 4 Implementations of disjunctive select. The vectorized implementation invertsthe disjunction into a conjunction in order to evaluate it lazily. Therefore in the end ithas to invert it’s result. The first approach, used in VectorWise inverts the final selectionvector using a special diff selected primitive, which is prone to branch mispredictions. Analternative method has been implemented, in which the last predicate is evaluated as a mapcondition, creating a bitmap of zeroes and ones. Then, a selection vector with positions ofzeroes in the bitmap can be created while avoiding branches. Compiled implementation juststraightforwardly checks the disjunction.

const uidx LENGTH = 1024 ;uint co l 1 [LENGTH] , co l 2 [LENGTH] , co l 3 [LENGTH] , v1 , v2 , v3 ; // Inputuidx tmpsel [LENGTH] ; uchr tmpuchr [LENGTH] ; // Intermed ia tesuidx r e s [LENGTH] ; // Resu l t ing s e l e c t i o n vec tor .

/∗ Spec i a l p r im i t i v e f o r i n v e r t i n g a s e l e c t i o n vec tor ∗/uidx d i f f s e l e c t e d (uidx n , uidx∗ r src , uidx m, uidx∗ r dst /∗ , u idx ∗ r s e l ∗/ ) {

uidx i , j ;for ( j=i =0; i<m; i ++, j++)

while ( j < s r c [ i ] ) ∗dst++ = j ++;while ( j < n) ∗dst++ = j ++;return n−m;

}

/∗ Vector i zed implementation , i n v e r t i n g s e l e c t i o n vec tor at the end . ( l i n e s ‘ ‘A ’ ’ and ‘ ‘B ’ ’ ) ∗/uidx n1 = s e l l t s i n t c o l s i n t v a l (LENGTH, tmpsel , co l1 , val1 , NULL) ;uidx n2 = s e l l t s i n t c o l s i n t v a l ( n1 , tmpsel , co l2 , val2 , tmpsel ) ;uidx n3 = s e l l t s i n t c o l s i n t v a l ( n2 , tmpsel , co l3 , val3 , s e l 2 ) ;uidx re tn = d i f f s e l e c t e d (n , res , n3 , tmpsel , NULL) ;

/∗ Vector i zed implementation , performing the l a s t s e l e c t i o n as a map . ( l i n e s ‘ ‘C ’ ’ and ‘ ‘D ’ ’ ∗/n1 = s e l l t s i n t c o l s i n t v a l (LENGTH, tmpsel , co l1 , val1 , NULL) ;n2 = s e l l t s i n t c o l s i n t v a l ( n1 , tmpsel , co l2 , val2 , tmpsel ) ;memset ( tmp1 , 0 , VECLENGTH∗ s izeof (uchr ) ) ; /∗ Have to zero ∗/m a p l t s i n t c o l s i n t v a l ( n2 , tmpuchr , co l3 , val3 , s e l 2 ) ;uidx re tn = s e l z e r o u c h r c o l (n , res , tmp1 , s e l ) ;

/∗ Compiled implementation . ( l i n e ‘ ‘E ’ ’ ) ∗/uidx s e l c o m p i l e d (uidx n , uidx∗ res , uint∗ co l1 , uint∗ co l2 , uint∗ co l3 ,

uint∗ val1 , uint∗ val2 , uint∗ va l3 )for ( int i =0, int j =0; i<n ; i++)

// a l l p r ed i c a t e s branching (” l a z y ”)i f ( co l 1 [ i ] >= v1 | | co l 2 [ i ] >= v2 | | co l 3 [ i ] >= v3 )

r e s [ j ++] = i ;return j ;

}

The result of a conjunction from previous example is a selection vector. To negate it, avector of indexes that were not in the resulting selection vector has to be constructed. Thevector has to be inverted.

The way VectorWise does it is shown in the top of Algorithm 4. A special primitivediff selected is called at the end, which iterates the selection vector from the conjunction,and for every gap between the indexes in it, output a list of indexes in that gap. Whenthe selection vector is nearly full, so the gaps are small, this results in many very shortexecutions of inner loops. This causes bad performance due to branch mispredictions. Itcan be seen in Figure 3.5, that the vectorized implementation using data dependency (non-branch) suffers a steep rise in execution cycles when the selectivity of the conditions getshigher (line “B”). This slowdown and rising number of branch mispredictions is the fault ofthis special diff selected primitive.

40

Algorithm 5 All-compiled Selection and Projection.

uidx se lmap compi led (uidx n , uint ∗ res , uidx ∗ r e s s e l , uint ∗ pr i ce , uchr∗ tax , uchr∗ discount ,uint ∗ co l1 , uint ∗v1 ) {

for ( int i =0, int j =0; i<n ; i++)i f ( co l 1 [ i ] < v1 ) {

r e s [ i ] = p r i c e [ i ]∗((100− d i scount [ i ] )∗(100+ tax [ i ] ) ) ;r e s s e l [ j++] = i ;

}return j ;

}

An alternative way would be to calculate the very last operation using a map operationinstead of a selection, on a previously zero-initialized vector, as is also presented in Algo-rithm 4. The map operation works on a selection vector resulting from the conjunction ofprevious conditions, and it puts “ones” in the map result vector only where also the last con-dition is true. Then zeroes can be selected from the map result, which can be done withoutbranching with a standard sel nonzero primitive. This implementation does not suffer frombranch mispredictions at high selectivities, and its data dependency variant is the best on avery wide range of selectivities and only slightly outperformed on other (lines “C” and “D”).

As previously, VectorWise chooses either the branch or non-branch strategy dependingon the observed selectivity. Therefore it can achieve the minimum from the branching andnon-branching lines in Figure 3.5.

3.4.2. Loop-compiled execution

The compiled implementation of a disjunctive Select is straightforward, evaluating the con-ditions in a branching if. The performance of the compiled implementation does not differmuch from that from experiment in Section 3.3. Since a disjunction is harder for the vector-ized implementations, JIT compilation manages to compete better with them. Nevertheless,the range of selectivities at which it is better than a vectorized implementation is only alittle wider than that for a conjunctive selection. The conclusion from Section 3.3 and thisone is that it is in most cases non profitable to introduce JIT compilation of selections inVectorWise.

3.5. Synthesis: Selection and Projection

We discussed Projections in Section 3.2 and Selections in Sections 3.3 and 3.4. We can nowdemonstrate a synthesis of both cases, analyzing the following query:

SELECT l_extprice*(1-l_discount)*(1+l_tax)

FROM lineitem

WHERE col1 < val1;

The all-compiled generated code shown in Algorithm 5 has all of the single-loop com-pilation problems described in previous sections: evaluation of the condition causes branchmispredictions and conditional computation of the selection prevents it from being SIMDized.

In Figure 3.6 we show how applying various described optimizations makes the vectorizedimplementation superior to to the all compiled implementation is shown in line “A”. Notsurprisingly, a plain implementation using branching selection and computing the projection

41

050

0010

000

1500

020

000

Selectivity

Tota

l cyc

les

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

A. All compiledB. Vectorized, branching selC. Vec., + best selD. Vec., + computeallE. Mixed, + compiled map

Figure 3.6: Selection and projection. The all-compiled implementation is compared to vec-torized implementations with different optimizations described in previous Sections.

with a selection vector (line “B”) is the slowest. However choosing between branching andnon-branching selection (line “C”), as described in Section 3.3.1, already makes the vectorizedimplementation competitive with the all-compiled. Adding the compute-all optimization fromSection 3.2.6 (line “D”) makes vectorized implementation better than all-compiled for almostall selectivities. The final optimization (line “E”) is to compile Project together in single-loop. With an auto-SIMDizing compiler, the loop will get optimized using SSE and thecompute-all optimization can be applied to it. Section 3.2.6 has shown it to be the fastestProjection implementation.

This example demonstrated a case where single-loop compilation is beneficial, but com-piling too much can hurt performance rather than improve it, so partial compilation is thebest answer.

3.6. Case Study: Hash Join

Our last micro-benchmark is related Hash Joins. We use the following SQL query:

SELECT build.col1, build.col2, build.col3

WHERE probe.key1 = build.key1 AND probe.key2 = build.key2

FROM probe, build

We focus on an equi-join condition involving keys consisting of two integer columns, becausesuch composite keys are more challenging for vectorized execution. This discussion assumes

42

key1 key2 val1 val2 val3

V (DSM)

H B next

hash value computation

hashvalues

bucketheads

3

2

3

0

2

3

4

0

1

2

3

0

x

103

102

203

x

1002

2003

1000

x

73

736

124

x

a

v

w

x

x

Apr

Oct

Jan100

1003 46 May0

1

2

3

4

x

0

0

1

0

Figure 3.7: Bucket-chain hash table as used in VectorWise. The value space V presented inthe figure is in DSM format, with separate array for each attribute. It can also be implementedin NSM, with data stored tuple-wise. From [Zuk09].

simple bucket-chaining such as used in VectorWise, presented in Figure 3.7. This means thatkeys are hashed into buckets in a bucket array B, whose size N is a power of two. Thebucket contains the offset of a tuple in a value space V . This space can either be organizedusing DSM or NSM layout; VectorWise supports both [ZNB08]. It contains the values of thebuild relation, as well as a next-offset, which implements the linked list. The value spaceis therefore either a collection of arrays, one for each column of the build relation and thenext-offset (DSM), or one big array of tuples (NSM). A bucket may have a chain of lengthgreater than one either due to hash collisions, or because there are multiple tuples in thebuild relation with the same keys.

The bottleneck in hash table operations are random accesses to huge arrays in memory.Those accesses not only cause cache misses, but also, which is often overlooked, can causea mayhem of TLB misses, which make the performance an order of magnitude slower thanthat of the experiments from previous sections.

We measured the performance of both building and probing phase of a hash join 0-1(with at most one tuple in the build relation matching a tuple in the probe relation). Inthree experiments, we vary hash table value space V size (the build relation size), selectivity(fraction of probe keys that match something), and bucket array B size (which influencesthe number of collisions and therefore length of bucket chains). The default values are 16 Mtuples in the value space, number of buckets equal to the value space size and match rate100% – in the experiments we vary one of the values, when the other two remained fixed.The size of the value space was changed from 4 K to 64 M in the first experiment, matchrate from 0% to 100% in the second and buckets number from 2 times the value space sizeto 1

32 times the value space size in the third.

Figure 3.8 shows the building phase of hash join. Figure 3.9 shows the probing phase ofhash join. The left graph shows that when the hash table size grows, performance deteriorates;it is well understood that cache and TLB misses are to blame, and DSM achieves less localitythan NSM. The middle graph shows that with increasing match rate the cost goes up, which ismainly caused by more tuples needed to be fetched. The compiled NSM fetch performs best,as will be explained in Section 3.6.5. The right graph shows what happens with increasingchain length. The fully compiled variant suffers most, as it gets no parallel memory access,as will be explained in Section 3.6.4.

43

020

4060

8010

012

0

Varying hash table size

Hash table size (tuples)

Cyc

les

per

tupl

e

4K 16K 64K 256K 1M 4M 16M 64M

Vectorized, DSMVectorized, NSMCompiled, DSMCompiled, NSM

020

4060

8010

0

Varying number of buckets on HT with 16M tuples

Number of buckets / hash table size

Cyc

les

per

tupl

e2x 1x 1/2x 1/4x 1/8x 1/16x 1/32x

4K 1M 16M 64M

Putting valuesFilling bucketsHashing



Cyc

les

per

tupl

e

020

4060

8010

0

2x 1x 1/4x 1/16x



Cyc

les

per

tupl

e

020

4060

8010

0

Left − DSM. Right − NSM.

Figure 3.8: Hash table building. Left: different sizes of the hash table value space V . Right:different sizes of the bucket array B w.r.t. value space V (different chain lengths). Fixedparameters are: V size 16M tuples (448 MB), selectivity 100%, B array size equal to V size.

3.6.1. Vectorized implementation

Algorithm 7 contains high level pseudocode of hash table building and probing. Additionally,Algorithm 6 shows (simplified) C code for the primitives that are used in Algorithm 7.

Building starts by vectorized computation of the hash number from the key in a column-by-column fashion using map-primitives. A map hash primitive first hashes each key of typeT onto a lng long integer. If the key is composite, we iteratively refine the hash value usinga map rehash primitive, passing in the hash values previously computed and the next key

44

010

020

030

040

050

060

0Varying hash table size


Cyc

les

per

tupl

e

4K 16K 64K 256K 1M 4M 16M 64M

Vectorized, DSMVectorized, NSMVectorized, NSM, 2 pass fetchCompiled, DSMCompiled, NSMMixed (NSM, partly compiled)

846894

010

020

030

040

050

060

0

Varying selectivity on HT with 16M tuples

Selectivity of matching (fraction of matched tuples)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

050

010

0015

0020

0025

00



2x 1x 1/2x 1/4x 1/8x 1/16x 1/32x

52004445

4K 1M 16M 64M

FetchNext lookupsCheckingInitial lookupHashing



Cyc

les

per

tupl

e

020

040

060

080

0

0 0.3 0.6 1

Varying selectivity on HT with 16M tuples

Selectivity of matching (fraction of matched tuples)

Cyc

les

per

tupl

e

010

020

030

040

0

2x 1x 1/4x 1/16x



Cyc

les

per

tupl

e

050

010

0015

00

Left − DSM. Middle − NSM. Right − Mixed (NSM, partly compiled)

Figure 3.9: Hash table probing. Left: different sizes of the hash table value space V . Middle:different match rates of the probe. Right: different sizes of the bucket array B w.r.t. valuespace V (different chain lengths). For the partially compiled implementation, hashing andlookup initial are together as lookup initial, and checking with lookup next and fetching aretogether as checking. Fixed parameters are: V size 16M tuples (448 MB), selectivity 100%,B array size equal to V size.

column. A bitwise-and map-primitive is used to compute a bucket number from the hashvalues: H&(N -1).

Sequential free slots at the back of the value space V are allocated for the inserted tuples.A tuple is made the new head of the chain in the computed bucket of B, and the previoushead is saved in the “next” field of V . This is done by a special ht insert primitive.

Probing starts by vectorized computation of a hash number and bucket number exactlylike in building. To read the positions of heads of chains for the calculated buckets weuse a special primitive ht lookup initial. It behaves like a selection primitive, creatinga selection vector ~Match of positions in the bucket positions vector ~H for which a matchwas found in bucket array B. Also, it fills the ~Pos vector with positions of the candidate

45

Algorithm 6 Primitives used by the vectorized implementation of hash table operations.Exact format of primitive parameter may vary, depending on whether value space is in DSMor NSM storage.

// Pr imi t i ve f o r hashing , f i r s t columnmap hash T col (uidx n , ulng∗ res , T∗ co l 1 ) {

for (uidx i =0; i<n ; i++)r e s [ i ] = HASH( co l 1 [ i ] ) ;

}

// Pr imi t i ve f o r hashing , subsequent columnsmap rehash u lng co l T co l (uidx n , ulng∗ res , ulng∗ co l1 , T∗ co l 2 ) {

for (uidx i =0; i<n ; i++)r e s [ i ] = co l 1 [ i ] ˆ HASH( co l 2 [ i ] ) ;

}

// Pr imi t i ve f o r f e t c h i n g va lue s .m a p f e t c h u i d x c o l T c o l (uidx n , T∗ res , uidx∗ co l1 , T∗ base , uidx∗ s e l ) {

i f ( s e l ) {for ( idx i =0; i<n ; i++)

r e s [ s e l [ i ] ] = base [ co l 1 [ s e l [ i ] ] ] ;} else {/∗ s e l == NULL, . , omit ted ∗/}

}

// Pr imi t i ve f o r f e t c h i n g and check ing a key , f i r s t column .m a p c h e c k T c o l i d x c o l c o l T c o l ( idx n , chr ∗ res , T∗ keys , T∗ base , idx ∗ pos , idx ∗ s e l ) {

i f ( s e l ) {for ( idx i =0; i<n ; i++)

r e s [ s e l [ i ] ] = ( keys [ s e l [ i ] ] ! = base [ pos [ s e l [ i ] ] ] ) ;} else {/∗ s e l == NULL, omit ted ∗/}

}

// Pr imi t i ve f o r f e t c h i n g and check ing a key , subsequent columns .m a p r e c h e c k c h r c o l T c o l i d x c o l T c o l ( idx n , chr ∗ res , chr ∗ co l1 ,

T∗ keys , T∗ base , idx ∗ pos , idx ∗ s e l ) {i f ( s e l ) {

for ( idx i =0; i<n ; i++)r e s [ s e l [ i ] ] = co l 1 [ s e l [ i ] ] | | ( keys [ s e l [ i ] ] ! = base [ pos [ s e l [ i ] ] ] ) ;

} else {/∗ s e l == NULL, omit ted ∗/}}

// Pr imi t i ve f o r i n s e r t i n g in to hash t a b l e .h t i n s e r t (uidx n , uidx∗ H, uidx∗ B, ValueSpace∗ V) {

for (uidx i =0; i<n ; i++) {V−>next [V−>f i r s t e m p t y ] = B[H[ i ] ] ; /∗ moving o ld chain head to , , next ’ ’ ∗/B[H[ i ] ] = V−>f i r s t e m p t y ; /∗ sav ing new chain head ∗/V−>f i r s t e m p t y++;

}}

// Pr imi t i ve f o r l ook ing up the head o f bucke t chain .h t l o o k u p i n i t i a l (uidx n , uidx ∗pos , uidx∗ match , uidx∗ H, uidx∗ B) {

for (uidx i =0,k=0; i<n ; i++) {pos [ i ] = B[H[ i ] ] ; /∗ sav ing found chain head po s i t i on in V ∗/match [ k ] = i ; k += ( pos [ i ] ! = 0 ) ; /∗ sav ing to a s e l . v ec to r i f non−zero ∗/

}}

// Pr imi t i ve f o r advancing to next element in a bucke t chain .ht lookup next (uidx n , uidx ∗pos , uidx∗ match , ValueSpace∗ V) {

for (uidx i =0,k=0; i<n ; i++) {pos [ match [ i ] ] = V−>next [ pos [ match [ i ] ] ] ; /∗ advancing to next in chain ∗/match [ k ] = match [ i ] ; k += ( pos [ match [ i ] ] ! = 0 ) ; /∗ to a s e l . vec . i f non−zero ∗/

}}

46

Algorithm 7 Pseudocode for the operator level logic of hash table operations. Primitivesfrom Algorithm 6 are used.

procedure HTbuild(V,B[1..N ], ~K1..k, ~V1..v)~H ← map hash( ~K1) . Hashingfor each remaining key vector Ki do

~H ← map rehash( ~H, ~Ki)~H ← map bitwiseand(H,N − 1)first put pos← V.first empty . Save position where to start puttinght insert( ~H,B, V ) . Filling buckets with new chain heads

for each ~x in ~Ki and ~Vi do . PuttingPut ~x into V starting at first put pos

procedure HTprobe(V,B[0..N − 1], ~K1..k(in), ~R1..v(out), ~Hits(out))~H ← map hash( ~K1) . Hashingfor each remaining key vectors Ki do

~H ← map rehash( ~H, ~Ki)~H ← map bitwiseand(H,N − 1)~Pos, ~Match← ht lookup initial(H,B) . Initial Lookup

while ~Match not empty do~Check ← map check( ~K1, Vkey1 ,

~Pos, ~Match) . Checking

for each remaining key vector ~Ki do~Check ← map recheck( ~Check, ~Ki, Vkeyi ,

~Pos, ~Match)

~Match← sel nonzero( ~Check, ~Match)~Pos, ~Match← ht lookup next( ~Pos, ~Match, Vnext) . Chain Following

~Hits← sel nonzero( ~Pos)

for each result vector ~Ri do . Result Fetching~Ri ← map fetch( ~Pos, Vvaluei ,

~Hits)

matching tuples in the hash table. If the value (offset) in the bucket is 0, there is no key inthe hash table – these tuples store 0 in ~Pos and are not part of ~Match.

Having identified the positions of possible matching tuples, the next task is to “check” ifthe key values actually match. This is implemented using a specialized map primitive thatcombines fetching a value from the given offset of value space V with testing for non-equality:map check. Similar to hashing, composite keys are supported using a map recheck primitivewhich gets the boolean output of the previous check as an extra parameter. The resultingbooleans mark positions for which the check failed. These positions can then be selected usinga select primitive, overwriting the selection vector ~Match with positions for which probingshould advance to the “next” position in the chain. Moving on in the linked lists is done bya special primitive ht lookup next, which for each probe tuple in ~Match fills ~Pos with thenext position in the bucket-chain of V . It also guards ends of chain by saving in ~Match onlythe subset of positions for which the resulting position in ~Pos was non-zero.

The loop finishes when the ~Match selection vector becomes empty, either because ofreaching end of chain (element in ~Pos equals 0, a miss) or because checking succeeded (elementin ~Pos pointing to a position in V , a match).

Matches can be found by selecting the elements of ~Pos which ultimately produced a hitwith a sel nonzero primitive. ~Pos with selection vector ~Hits becomes a pivot vector forfetching. This pivot is subsequently used to fetch (non-key) result values from the build valuearea into the hash join result vectors; using one fetch primitive for each result column. Abasic fetching primitive is shown in Algorithm 6, but an in-depth description of fetching willbe provided in Section 3.6.5.

The primitives map recheck, ht lookup initial and ht lookup next use branching in-structions that could be avoided with a data dependency. However, the performance of thoseis dominated by TLB misses they cause, not by branch mispredictions. It was tested, that

47

Algorithm 8 Fully loop-compiled hash operations, working just like vectorized implemen-tation would for vectors of size 1.

HTprobe build (uidx n , ValueSpace∗ V, uidx∗ B, T∗ key1 , T∗ key2 , T∗ val1 , T∗ val2 , T∗ va l3 ) {for (uidx i = 0 ; i < n ; i++) {

uidx pos = V−>f i r s t empty , hash = (HASH( key1 [ i ] ) ˆHASH( key2 [ i ] ) )&(N−1);V−>next [ pos ] = B[ hash ] ; B[ hash ] = pos ;V−>key1 [ pos ] = key1 [ i ] ; V−>key2 [ pos ] = key2 [ i ] ;V−>co l 1 [ pos ] = val1 [ i ] ; V−>co l 2 [ pos ] = val2 [ i ] ; V−>co l 3 [ pos ] = val3 [ i ] ;V−>f i r s t e m p t y++;

}}

HTprobe comp (uidx n , ValueSpace∗ V, uidx∗ B, T∗ key1 , T∗ key2 , T∗ res1 , T∗ res2 , T∗ r e s3 ) {for (uidx i =0; i<n ; i++) {

uidx pos , hash = HASH( key1 [ i ] ) ˆHASH( key2 [ i ] ) ;i f ( pos = B[ hash&(N−1) ]) do {

i f ( key1 [ i ]==V[ pos ] . key1 && key2 [ i ]==V[ pos ] . key2 ) {r e s1 [ i ] = V−>co l 1 [ pos ] ; r e s2 [ i ] = V−>co l 2 [ pos ] ;r e s 3 [ i ] = V−>co l 3 [ pos ] ; break ; // found match

}} while ( pos = V−>next [ pos ] ) ; // next

}}

changing those branches into data dependency made no difference to the results.

Figures 3.8 and 3.9 feature bar graphs splitting the execution time of the vectorizedimplementation into phases annotated on the right margin of Algorithm 7. Such fine-grainedprofiling is possible because those steps are performed for whole vectors of data. In thecompiled implementations, the overhead of starting and stopping the counters after operationson single tuples would be to big to get any meaningful measurements.

3.6.2. Loop-compiled implementation

It is possible to create a loop-compiled version of the whole join, like proposed in e.g.HIQUE [KVC10]. Both the build and the probe phase work much like vectorized implemen-tation would for vectors of size 1. Algorithm 8 shows the code of such an implementation.

It can be seen from Figure 3.9 that overall the compiled implementation using NSM hashtable is faster than the vectorized ones. However, it will be discussed in the following sections,that it has its shortcomings.

3.6.3. Build phase

Figure 3.8 shows that different implementation of hash table building have similar perfor-mance. All implementations of build phase have sequential access patterns to the value spaceV . The only difference between vectorized or compiled implementation and DSM or NSMlayout of the value space V is either fully sequential access to one array, or a series of se-quential accesses with a stride, or many interleaved sequential cursors. All these methodsnevertheless operate with good locality and linearity on a small portion of value space V ,which should become cache resident and do not cause cache or TLB misses.

The phase-split part of Figure 3.8 shows that the dominant part of building the hashtable is filling the buckets arrays B. This is because it involves accesses to random positionsin the arrays, therefore causing TLB misses.

48

010

020

030

040

0

Table size (bytes)

Cyc

les

per

acce

ss

4K 16K 64K 256K 1M 4M 16M 64M 256M 1G 4G

12

34

5

No dependencyDependentSpeedup

Figure 3.10: Dependent and independent memory accesses

3.6.4. Probe lookup: bucket chain traversal

The right graph of Figure 3.9 shows an experiment in which we decreased the number ofbuckets (size of array B) which results in more hash collisions, and therefore longer chainsthat have to be followed during lookup. Results of this show that performance of the compiledimplementations degrades dramatically when the number of buckets decreases (and so thecollision lists become longer). This leads to a conclusion that the bucket chain traversal phasein the compiled implementation must be slower than in the vectorized one. To explain it, weinvestigate how memory accesses work on modern processors.

Because memory latency is not improving much (∼50ns), and cache line granularitiesmust remain limited (64bytes), memory bandwidth on modern hardware is no longer simplythe division between these two. A single Nehalem core can achieve a factor 10 more thanthis 1.28GB/s thanks to automatic CPU prefetching on sequential access. Performance thuscrucially depends on having multiple outstanding memory requests at all times. For randomaccess, this is hard to achieve, but the deeply pipelined out-of-order nature of modern CPUsdoes help. That is, if a load stalls, the CPU might be able to speculate ahead into upstreaminstructions and reach more loads. Speculation-friendly code is therefore more effective thanmanual prefetching, which tends to give only minor improvement, and is hard to tune/-maintain for multiple platforms. Success is not guaranteed, since the instruction speculationwindow of a CPU is limited, depends on branch prediction, and only independent upstreaminstructions can be taken into execution.

A small experiment demonstrates the difference. Two code snippets are shown in Algo-rithm 9. In the first case, the memory accesses are independent. In the second one, previousmemory access has to be completed to get the new value of k before the next one can beissued. The input is prepared in such a way, that t[in[k]]] = k+1, so that always k = i,and so both procedures perform the same sequence of reads from a big array in memory –t[in[0]], t[in[1]], t[in[2]], ....

Despite that, Figure 3.10 shows that the procedure with independent accesses works up to5 times faster, because the processor can dispatch memory accesses and handle the misses they

49

Algorithm 9 Dependent and independent memory accesses. The accessed memory positionsare exactly the same, but in the first example they are independent, while in the second oneprevious access has to be completed before proceeding to the next one.

// in : random po s i t i o n s in t .void nodep ( int n , int ∗ r , int ∗ in , int ∗ t ) {

for ( int i = 0 ; i < n ; i++)r [ i ] = t [ in [ i ] ] ;

}

void dep ( int n , int ∗ r , int ∗ in , int ∗ t ) {// ( 1 . ) t [ in [ k ] ] == k+1for ( int i = 0 , int k=0; i < n ; i++)

// ( 2 . ) k == i always , because o f ( 1 . )k = r [ i ] = t [ in [ k ] ] ;

}

trigger in parallel, while the dependent procedure can perform and complete only one memoryaccess at a time. The crux here is that the vectorized primitive trivially achieves whatevermaximum outstanding loads the CPU supports, as it is a tight loop of independent loads. Thefully compiled hash probe from Algorithm 8 however, can run aground while following thebucket chain. Its performance is only good if the CPU can speculate ahead across multipleprobe tuples (execute concurrently instructions from multiple for-loop iterations for differenti ). That depends on the branch predictor predicting the while(pos) to be false, which willhappen in join key distributions where there are no collisions and the chains stay short. Ifhowever the build relation has long chains of tuples in one bucket, the CPU will stall with asingle outstanding memory request (V[pos], because the branch predictor will make it stayin the while-loop and it will be unable to proceed, as the value of pos = V.next[pos] isunknown because it must wait for the resolution of the current cache/TLB miss. This causesfully compiled hashing to suffer a dramatic slowdown when the bucket chains are long, asshown by lines “D” and “E” in Figure 3.9.

3.6.5. Probe fetching

The phase of hash table probing in which we fetch the values of matched tuples into outputvectors deserves a separate discussion. The basic vectorized fetching implementation wasshown in Algorithm 6. A vectorized fetching operation takes a pivot vector ~Pos of indexes ofthe matched tuples in the hash table value space V . The vectorized operation fetches valuesone column at a time. Since ~Pos contains random positions in a huge data array, each accessis expected to cause a TLB miss.

If V is stored in DSM, then the different columns are stored in separate arrays and nothingcan be done – fetching the pivot positions from different columns again triggers TLB missesfor that new memory regions. However, if the V is in NSM, the values from different columnsare adjacent to each other. Yet, after a primitive finishes fetching a whole vector of values forthe first column and proceeds to the next one, tuples in V at positions from the beginning ofthe pivot vector would already be evicted from the cache and TLB. Figure 3.12 demonstratesthis. The TLB cache size on Nehalem processors has been significantly increased comparedto older architectures, with 512 entries on the second level of TLB. As soon as the pivotvector gets longer than this, performance drops dramatically and an increased number ofTLB misses is observed.

Obviously, a compiled solution, in which a primitive is generated for a given combinationof attributes, fetching them all at once from an NSM tuple directly into a number of output

50

010

020

030

040

050

060

070

0

Cyc

les

per

tupl

eA. Vectorized DSMB. Vectorized NSMC. Vectorized NSM, inner loop / dynamic width memcpyD. Vectorized NSM, inner loop / fixed width memcpyE. Vectorized NSM, two passF. Compiled NSMG. Compiled DSM

02

46

810

Number of fetched columns

TLB

mis

s.pe

r tu

ple

1 2 3 4 5 6 7 8 9 10

Figure 3.11: Fetching from 1 to 10 columns of data from a hash table value space usingdifferent methods: cycles per tuple and total TLB misses.

vectors will be superior. The reason for using JIT is the number of possible combinationsof types of columns and their number is too high, and specialized primitives could not bestatically pre-generated for all of them.

There are however other ways in which the memory access pattern of a vectorized primitivecan be improved without using compilation. Algorithm 10 shows different vectorized methodsof fetching data from a hash table. The first two are the already discussed simple fetchingfrom DSM (line “A”) and NSM (line “B”) value space. The third method tries to achievethe flexibility of a compiled solution in an interpreted way. The “inner loop” fetch primitivetakes the description of columns and their offsets in an NSM space as parameters, and usesa nested inner loop to copy out all columns of a tuple at once into given output vectors. Alltypes of data can be and copied out using memcpy library call, knowing only the datatypewidth. When widths of datatypes are given as parameters to memcpy, such a call constitutes

51

020

040

060

0

Vector size

Cyc

les

per

fetc

hed

tupl

e

16 32 64 128 256 512 1K 2K 4K 8K 16K 32K

02

46

810

CyclesTLB misses per tuple (right y−axis)

Figure 3.12: Fetching from an NSM value space column by column with different vectorlengths.

an additional cost. However, if the width of memcpy is known at compile time, the compileris able to inline and optimize it, often into a simple memory-register-memory move.

If it is safe to write to a few bytes of memory past the end of a vector (e.g. by overallocat-ing) a width greater than the actual width of the data type can safely be used. We are able tostatically pre-generate primitives that always use memcpy with a compile-time constant valueof WIDTH, and the widths parameter is used only for calculating offsets. The static primitivewith the WIDTH equal to the width of the widest attribute in the tuple may be used, as thethinner attributes can safely be copied with a few bytes of redundant padding. It can beseen in Figure 3.11 that while the implementation with dynamic width of memcpy (line “C”)does not give much improvement compared to the basic vectorized fetching, the fixed memcpy

(line “D”) performs significantly better. The cost is sometimes copying more than strictlynecessary and the need of a few bytes of additional space overallocated after the end of thevectors.

The “inner loop” version introduced more complicated code, and inner loop and interpre-tation overhead. The last, “two pass”, implementation (line “E”) uses a different approach.It is more in the “vectorized spirit”: use intermediate results that allow to use code withsimple tight loops. The method works in two passes: first copies whole tuples from an NSMvalue space into an intermediate contiguous NSM space, and then converts the local NSMbuffer into DSM output, avoiding TLB and cache misses because of locality. As results inFigure 3.11 show, even though additional memory traffic is needed, the simpler constructionof this implementation makes it perform better than “inner loop”.

The compiled implementation is still the fastest one, but when the optimized vectorizedimplementations are taken into account, the difference is not so big. While the compiledNSM implementation (line “F”) is around 5 times faster than the basic vectorized one usedin VectorWise when fetching 10 columns of data, the speedup over the best vectorized “twopass” approach is only 28%. A vectorized implementation of hash join probing using the “two

52

Algorithm 10 Simplified code of different vectorized implementations of fetching values froma hash table value space V . The stride parameter is the width of the tuple in an NSM valuespace. By using a stride and an offset we can access individual attributes in the tuples.

// 1 . Simple f e t c h from a DSM va lue space .map fetchDSMtoDSM uidx col T val (uidx n , T∗ res , uidx∗ pivot , T∗ base ) {

for (uidx i =0; i<n ; i++)r e s [ i ] = base [ p ivot [ i ] ] ;

}

// 2 . Simple f e t c h from a NSM va lue space .map fetchNSMtoDSM uidx col T val uidx val (uidx n , T∗ res , uidx∗ pivot , T∗ base , uidx s t r i d e ) {

for (uidx i =0; i<n ; i++)r e s [ i ] = base [ s t r i d e ∗ p ivot [ i ] ] ;

}

// 3 . Fetching a l l columns at once from NSM in an inner loop , two var i an t s .map fetchINNERLOOP(uidx n , uidx s t r i d e , uidx ∗widths , void ∗∗ res , uidx∗ pivot , void∗ base ) {

for (uidx i =0; i<n ; i++) {void ∗ source = input+WIDTH∗ p ivot [ i ]∗ s t r i d e ;for (uidx k = 0 ; k < nco l s ; k++) {

// 3a . dynamic width memcpy :memcpy( r e s [ k]+ widths [ k ]∗ i , source , widths [ i ] ) ;OR// 3b . compile−time f i x e d width memcpy :memcpy( r e s [ k]+ widths [ k ]∗ i , source , WIDTH) ;

source += widths [ k ] ; // next o f f s e t}

}}

// 4 . Two pass f e t c h i n g .map fetchNSMtoNSM(uidx n , void∗ res , uidx∗ pivot , void∗ base , uidx s t r i d e ) {

for (uidx i =0, i<n ; i++)memcpy( r e s+i ∗ s t r i d e , base+pivot [ i ]∗ s t r i d e , s t r i d e ) ;

}map fetchDIRECT T val uidx val (uidx n , T∗ res , T∗ base , uidx s t r i d e ) {

for (uidx i =0; i<n ; i++)r e s [ i ] = base [ s t r i d e ∗ i ] ;

}fetchTWOPASS(uidx n , uidx nco l s , uidx ∗widths , void ∗∗ res , uidx∗ pivot , void∗ base ) {

/∗ Fi r s t pass : copy the p i v o t t u p l e s in to an in termed ia te cont iguous NSM space ∗/map fetchNSMtoNSM(n , tmp , pivot , base , SUM( widths ) ) ;/∗ Second phase : copy the columns from the temporary space in to output v e c t o r s ∗/for (uidx i = 0 ; i < nco l s ; i++) {

// a map fetchDIRECT pr im i t i v e o f co r r e c t type would be used .map fetchDIRECT T val uidx val (n , r e s [ i ] , base , SUM( widths ) ) ;base += widths [ i ] ; // next o f f s e t

}}

pass” approach has also been included in Figure 3.9, and it has also been used for the NSMfetching phase in the phase-split bar graph. To get a full picture, a compiled DSM line (“G”)has also been included in Figure 3.11, showing that it has even worse performance than allvectorized ones. This is obvious, because by compiling fetching for a DSM value space wedo not achieve anything. With a DSM value space, a vectorized primitive accesses randomlocations from an array for just one attribute, while a compiled function accesses values fromdifferent attributes at once, so the random access pattern is scattered over an even biggerrange of memory.

53

010

020

030

040

050

060

070

0

Vector length

Cyc

les

per

tupl

e

16 32 64 128 256 512 1K

Vect., 1MComp., 1MMixed, 1M

Vect., 128MComp., 128MMixed, 128M 0

24

68

Vector length

TLB

mis

ses

per

tupl

e

16 32 64 128 256 512 1K

Figure 3.13: Probing of two NSM hash tables: smaller with 1 M tuples (24 MB), bigger with128 M tuples (3 GB), using different lengths of vectors.

3.6.6. Reduced vector length

Section 3.6.5 has shown that vectorized implementation suffers from TLB cache trashing assoon as vectors lengths exceed the size of TLB cache. We check if reducing the vector lengthcan improve the overall performance of probing, just as it improved the performance of fetchphase.

Figure 3.13 shows the results of an experiment, in which two hash tables, small and large,were probed using varying vector lengths. Parameters were like in the other experiments:100% match rate and the number of buckets equal to the number of tuples. For that combi-nation of parameters the loop-compiled implementation had the biggest advantage over thevectorized one. The length of the vectors does not affect the compiled implementation much,as it is a monolithic function, only once looping over the vector while processing it.

On the smaller hash table, the compiled implementation causes on average a little over2 TLB misses per tuple. This corresponds to one miss while accessing the bucket arrayB, another one accessing the tuple, and a fraction corresponding to the odd case of a hashcollision and having to advance to the next tuple in the linked list. For small vector lengths,the vectorized implementation performs similarly. For vectors of length 256, when there isstill no TLB cache trashing, it reaches its maximal performance which slightly beats the loop-compiled implementation. However, as the vectors get longer, the number of TLB misses getshigher, reaching an average of over 6 misses per tuple. These 4 additional tuples correspondto a TLB miss for the vectorized operation on every column of hash table: checking of thesecond key column, and fetching of three value columns. Each of these operations accessesthe same elements anew, but with long vectors they are already evicted from TLB. The

54

Algorithm 11 Partially compiled hash probing. ValueSpace V is in NSM..

c o m p l o o k u p i n i t i a l (uidx n , uidx∗ pos , uidx∗ match , T∗ key1 , T∗ key2 , uidx∗ B, uidx N) {uidx k = 0 , i ;for (uidx i = 0 ; i < n ; i++) {

uidx bucket = (HASH( key1 [ i ] ) ˆHASH( key2 [ i ] ) )&(N−1); /∗ f i nd bucke t ∗/pos [ i ] = B[H[ i ] ] ; match [ k ] = i ; k += ( pos [ i ] ! = 0 ) ; /∗ as in h t l o o k u p i n i t i a l ∗/

}return k ;

}

comp check fe tch next (uidx n , ValueSpace∗ V, uidx∗ pos , uidx∗ match ,T∗ key1 , T∗ key2 , T∗ res1 , T∗ res2 , T∗ r e s3 ) {

uidx k = 0 , i ;for ( i = 0 ; i < n ; i++) {

i f ( key1 [ i ] != V[ pos [ match [ i ] ] ] . key1 | key2 [ i ] != V[ pos [ match [ i ] ] ] . key2 ) {/∗ as in h t l o o kup nex t ∗/pos [ match [ i ] ] = V[ pos [ i ] ] . next ;match [ k ] = match [ i ] ;k += ( pos [ match [ i ] ] ! = 0 ) ; /∗ k <= i , does not overwr i t e ahead ∗/

} else {/∗ as in f e t c h ∗/r e s1 [ match [ i ] ] = V[ pos [ i ] ] . va l1 ; r e s2 [ match [ i ] ] = V[ pos [ i ] ] . va l2 ;r e s3 [ match [ i ] ] = V[ pos [ i ] ] . va l3 ;

}}return k ;

}

procedure HTprobe(V,B[0..N − 1], ~K1..k(in), ~R1..v(out), ~Hits(out))~Pos, ~Match← comp hash rehash bitand( ~K1..k, B) . Hashing and initial Lookup

while ~Match not empty do . Checking and fetching or chain following combined.~Pos, ~Match, ~R1..v ← comp check fetch next(V, ~Pos, ~Match)

~Hits← sel nonzero( ~Pos) . Fetching already done, just save the matched positions.

performance in Figure 3.13 for vectors of length 1024 corresponds to that from Figure 3.9,since these are the same parameters.

Probing the bigger hash table is slower and all algorithms make more TLB misses. Sincethe hash table is much bigger than 2 MB, a TLB miss occurs not only on the pages itself,but also on the directory pages holding the translations, as was explained in Section 2.3.6.When accessing a position in buckets array B or value space V two TLB misses occur: onefor the virtual memory page, and one for the directory page. That is why there is on averagea little over 4 TLB misses per tuple instead of 2.

This scenario shows that a pure vectorized approach can be competitive with single-loop compiled when the vector lengths are reduced to prevent TLB trashing. Both of themare however always surpassed by a hybrid partially-compiled implementation, presented inSection 3.6.7.

3.6.7. Partial Compilation

We now propose a hybrid partially-compiled implementation shown in Algorithm 11 thatcombines the advantages of vectorized and compiled approach. As shown in Section 3.6.4,the bucket chain following part of lookup is best to be left vectorized. In other parts, thereare opportunities to apply compilation.

The first idea is to compile the full sequence of hash/rehash/bitwise-and and initial lookupinto a single primitive. The hashing/rehashing/bitwise-and part re-iterate the compilationbenefits of Project primitives as discussed in Section 3.2. Initial lookup requires a randomaccess to the buckets array B. Interleaving it with more computations helps to hide the

55

latency of this memory access, especially when selection percentage is high.The second step combines all operations that can be done on a single tuple in the hash

table after it is accessed. When the hash table is in NSM, this would mean at most one TLBmiss, compared to one miss per attribute in the vectorized implementation. Therefore, checkand iterative re-check are combined. Then, if the check was successful, fetching the valuesinto result vectors is performed at once. If not, the position of the next tuple in the bucketchain is saved in the ~Pos vector.

The only operation left to be done with vector granularity, is to advance to all the nexttuples for which neither the check was successful and values already fetch nor the chain ended.After the loop, only hits have to be selected into the ~Hits vector. Code for the primitivesand pseudocode of the operator level logic is shown in Algorithm 11. Since the hashing andlookup initial phases and checking, fetching and lookup next phases are joined into one andcould not be measured on a finer granularity, they are shown in Figure 3.9 merged as lookupinitial and checking. The split-time bar graphs of Figure 3.9 confirm that phases we compileperform better than their vectorized counterparts.

Partially compiled NSM (line “F”) is the best solution, thanks to its efficient compiledmulti-column checking and fetching, compiled hashing, and its vectorized independent mem-ory accesses during chain-following.

3.7. Software engineering aspects of Vectorization and Compi-lation

So far in this chapter we discussed the aspects of performance of query execution routines.It was shown that combining both techniques of vectorization and JIT allows to choose thebest performance. This section will discuss other aspects, connected with general systemcomplexity and maintenance.

Figure 5.1 in [Zuk09] summarizes the comparison of a tuple-at-a-time interpreted system,column-at-a-time MonetDB and vector-at-a-time MonetDB/X100 which became VectorWise.Inspired by it, we present Table 3.1, similarly comparing a vectorized, single-loop compiledand the multi-loop compiled hybrid solution we propose. The first part of the table summa-rizes the performance aspects discussed in previous sections, while this section will discussthe other aspects.

3.7.1. System complexity

Having both vectorization and JIT code generation in a database system obviously requiresincreased system complexity. It seems however that implementing elements of one into a sys-tem based on the other could be done relatively easy, so that the complexity of implementingsuch a combined system is definitely less than the “sum” of complexities, as suggested witha question mark in Table 3.1.

Vectorization in a Compiled system

In a single-loop compiled system introducing vectorization boils down to adding additionalmaterialization points in the execution algorithms. This is in opposition to what the Hypersystem tries to achieve [KN10]. It can be done by introducing new code generation patternsthat implement algorithms which instead of processing one tuple in a loop body accumulatea vector of input coming from the previous iterations and produce some operations as innerloops working on that accumulated vectors one operation vector-at-a-time. As described in

56

Vector-at-timeSingle-loop comp.tuple-at-a-time Hybrid

Instruction independence very good poor very goodSIMD opportunities very good poor very goodBranch mispredictions can be avoided unavoidable can be avoidedData access pattern sometimes poor sometimes poor betterFunction calls few extremely few very fewInterpretation during execution small none very smallSystem complexity some some sum?Compiled code complexity static, simple code high mediumCompilation time not applicable big smallerCompilation reusability not applicable hard/none someOptimization and profiling simple hard simpleError detection and reporting simple hard medium

Table 3.1: Comparison of various aspects of a purely vectorized interpreted system, withsingle-loop compiled system and the proposed hybrid solution.

Section 2.4.2, the HIQUE system authors claim that adding new algorithms to code generationin a compiled system is relatively easy [KVC10].

Everything that can be done in an interpreted system can also be done in a compiledsystem. It is just a matter of increased complexity of code generation engine. Just likeeverything that can be implemented in a high level language can of course be implementedin assembly. It results in higher complexity of the compiled code. The question is when theadded complexity is worth it.

Compilation in a Vectorized system

From our experience of actually using this approach, it seems that it is more work to addJIT query compilation to a vectorized system than the other way around. Completely newcomponents have to be added to support source code generation, integration with a compiler,management of the generated machine code etc. The good thing about these componentsis that they are mostly independent from the existing system, so they can be designed andimplemented from scratch, not dealing with any existing system limitations.

The only interconnection with existing query execution is first to analyze the query planto seize compilation opportunities, and then to connect the generated and compiled functionsinto execution. This can be done easily in VectorWise, by making the JIT functions act likeexisting primitive functions, wrapped into the Expression execution framework. More workis required for the proper detection and generation of these function. It will be described inChapter 5.

3.7.2. Compilation overheads

Another source of overheads in a system that uses JIT compilation is the cost of compilationitself. A system that compiles whole queries has two significant disadvantages over a hybridapproach.

• Compiling whole queries means more generated code, and therefore more compilationtime required, compared to compiling only fragments.

57

• Furthermore, when whole queries are compiled monolithically, there is less possibilityfor reusing the compiled code. In a given workload, many fragments of different queriesmay repeat. If the system compiles the whole query, it is unable to reuse the compiledcode for an other query which shares some fragments of the query plan. If fragmentsare compiled separately, they can be buffered and reused in a much more flexible way.

We will discuss the compilation infrastructure in details in Chapter 4, showing that our JITcompilation of only fragments of execution is very fast.

3.7.3. Profiling and runtime performance optimization

Another interesting aspect of a vectorized system is that its performance can be very eas-ily profiled. Each primitive is a separate function operating on a vector of values. Thisgranularity of function calls is coarse enough to allow measuring the actual performance ofeach primitive, including rich, detailed information like from CPU hardware counters withoutintroducing significant measurement overhead.

This way, a vectorized system can obtain detailed and highly accurate information aboutthe performance of each part of query execution. Such information can be used for runtimeadaptation of the query plan and for providing feedback to the query optimizer. Differentimplementations of the same primitive can be interchanged at runtime based on their per-formance, e.g. the different selection implementations in Section 3.3. Changing betweenalternative methods can be done at vector-level granularity during execution. In long run-ning analytical queries different approaches can be tried and measured at the beginning ofquery execution, before settling on the best methods. Then this choice can be periodicallyreevaluated.

With dynamically compiled fragments of queries there is even more room for runtimeevaluation and adaptation. The generated functions may be directly compared with theprimitives that they replaced.

The finer granularity of working on fragments of queries may also help other maintain-ability aspects, such as detecting and pinpointing errors.

The possibility of fine-grained profiling helped us in our experiments. In Figures 3.8and 3.9 in Section 3.6 we were able to measure split performance of various phases of thealgorithm of the vectorized implementations. We could not do that for the fully compiledimplementation.

When operations are compiled together monolithically and performed tuple-at-a-time,they cannot be accurately profiled, because starting and stopping counters around simpleoperations on single tuples would be more costly than the operations itself. Some informationcould be derived from sampling methods, but it will not be accurate and when working oncode compiled from a dynamically generated source, it is more difficult to map the profilesamples in the machine code back to the corresponding operation templates from which thecode was generated.

Although it is relatively easy to implement new code generation patterns and find basicbugs in them, it is really hard to find more obscure errors at runtime in the generated JITcode. Such code cannot be debugged with debuggers like GDB since they cannot link itto symbols in existing source code. Even when an error in the generated code is found, itstill has to be traced back to the correct source generation pattern. Altering the patterncan result in errors in other unforeseen situations when a different dynamic combination ofcompiled operations occurs. Of course, there is less room for errors and they are easier topinpoint when the compiled fragments are smaller.

58

Summarizing, “holistic” compilation such as proposed in HIQUE [KVC10], which gener-ates monolithic code for whole queries misses some opportunities of flexible runtime adapta-tion and optimization.

3.8. Summary

For database architects seeking a way to increase the computational performance of a databaseengine, there might seem to be a choice between vectorizing the expression engine versus in-troducing expression compilation. Vectorization is a form of block-oriented processing, andif a system already has an operator API that is tuple-at-a-time, there will be many changesneeded beyond expression calculation, notably in all query operators as well as in the storagelayer. If high computational performance is the goal, such deep changes cannot be avoided,as we have shown that if one would keep adhering to a tuple-a-time operator API, expressioncompilation alone only provides marginal improvement.

Our main message is that one does not need to choose between compilation and vector-ization, as we show that best results are obtained if the two are combined. As to what thiscombining entails, we have shown that ”loop-compilation” techniques as have been proposedrecently can be inferior to plain vectorization, due to better (i) SIMD alignment, (ii) abil-ity to avoid branch mispredictions and (iii) parallel memory accesses. Thus, in such cases,compilation should better be split in multiple loops, materializing intermediate vectorizedresults. Also, we have signaled cases where an interpreted (but vectorized) evaluation strat-egy provides optimization opportunities which are very hard with compilation, like dynamicselection of a predicate evaluation method or predicate evaluation order.

Thus, a simple compilation strategy is not enough; state-of-the art algorithmic methodsmay use certain complex transformations of the problem at hand, sometimes require run-timeadaptivity, and always benefit from careful tuning. To reach the same level of sophistication,compilation-based query engines would require significant added complexity, possibly evenhigher than that of interpreted engines. Also, it shows that vectorized execution, which is anevolution of the iterator model, thanks to enhancing it with compilation further evolves intoan even more efficient and more flexible solution without making dramatic changes to theDBMS architecture. It obtains very good performance while maintaining clear modulariza-tion, simplified testing and easy performance- and quality-tracking, which are key propertiesof a software product.

From the perspective of introducing JIT compilation into the VectorWise system, thefollowing areas turned out to be the most interesting:

• combine arithmetic calculation in projections.

• revise VectorWise hash tables implementation, introducing JIT compilation for partsof the hash table probing phase.

However, we believe that investigation of other elements of the system will reveal manyadditional opportunities for JIT application.

59

Chapter 4

JIT Compilation infrastructure

To be able to support JIT compilation, a system like VectorWise needs an infrastructurethat allows it to connect to a compiler and to dynamically link the JIT generated code toitself. In this chapter we study the suitability of different compilers as a JIT compiler insidethe VectorWise system. Experiments in Chapter 3 used mostly gcc, but also revealed somefundamental differences between compilers. Also, the overhead of invoking the compiler wasnot taken into account, only the execution time of already compiled code was measured.

First, in Section 4.1 we introduce the three considered compilers – gcc, icc and clang.In Section 4.2 we evaluate the compilers by compiling the whole VectorWise system itself.We measured both compilation time and the compiled system performance on the TPC-Hbenchmark. Different proof-of-concept implementations of JIT compiler are discussed andtested in Section 4.3. The proofs-of-concept were afterward turned into a JIT compilationinfrastructure inside the VectorWise system.

4.1. Compilers

For our JIT compilation we want to be able to compile code generated in C, as opposed tosome systems like System R [D. 81] or Hyper [KN10], which generate assembly or assembly-level low level language directly. Therefore we depend on the compiler’s code generation andlow level optimization abilities. This gives two important benefits:

1. Code generation complexity. It is much easier to generate source code in a high levellanguage without the need of low level optimizations. Also, snippets of existing codefrom the system can be reused as templates and stubs for the code generator.

2. Portability and architecture independence. Source generated in a high level languagelike C is translated into hardware specific machine code by the compiler.

These benefits come at a cost of bigger compilation overhead of a general purpose compilerthan would be possible by generating specialized pre-optimized code, and slightly reducedcontrol of what actually comes out at the end of the process.

4.1.1. GNU Compiler Collection – gcc

GNU Compiler Collection (gcc) is an open-source compiler system produced by the GNUProject. It has been adopted as the standard compiler by most modern Unix-like computeroperating systems and is also widely deployed as a tool in commercial, proprietary andclosed source software development environments. Gcc was used as the default compiler in

61

the VectorWise build system, and also most of experiments in Chapter 3 were conductedusing it. We used gcc 4.4.2.

4.1.2. Intel Compiler – icc

Intel C++ Compiler (icc) is a group of proprietary C and C++ compilers from Intel Cor-poration. As such, it is marketed as being especially suitable for generating code for Intelprocessors, including automatic vectorizer that can detect vectorizable loops and generateall flavours of SSE instructions. However, we observed in Section 3.2 of Chapter 3 that itsometimes makes sub-optimal decisions about vectorization in more complex situations. Onthe other hand it never fails to vectorize simple loops where possible.

The VectorWise build system was already configured to use icc as an alternative compilerfor the system. We used icc 11.0.

4.1.3. LLVM and clang

The Low Level Virtual Machine (LLVM) [LLV], despite its name, has little to do with vir-tual machines, but is a collection of modules and libraries for building compilers. Theselibraries operate on an intermediate code representation known as the LLVM intermediaterepresentation (”LLVM IR”). It is a target independent assembly-like language. The LLVMlibraries provide a modern target-independent optimizer working on LLVM IR, along withassembly and machine code generation support for various CPUs. LLVM has been developedsince 2003, making it a relatively mature set of libraries. Hyper [KN10] generates code forits JIT compilation directly in LLVM IR, therefore being able to keep the benefit of beingarchitecture independent, while still maintaining tight control over generated code. Hypercan also use the LLVM optimizer passes for general optimizations, while implementing somespecific optimizations itself.

LLVM is also suitable for creating compiler frontends for different languages. The frame-work is provided to help construct abstract syntax trees and code generation into LLVMIR. One such frontend is the “clang” subproject of LLVM [cla]. Developed since 2007, it isquite a young compiler, but it depends on the much more mature set of LLVM libraries forthe compiler back-end. Clang claims to be significantly faster than gcc, which is true as wedemonstrate in Section 4.2. One of the known current deficiencies of LLVM and clang is thatit does not perform automatic SIMDization of loops optimizations yet. There is howeverwork underway towards this functionality, the Polly project [Gro10], which will hopefully beincluded in the future releases. We used the currently most recent LLVM+clang release 2.9.

A strong reason to consider using LLVM and clang as a JIT compiler is their modulardesign as a set of libraries with relatively well documented API. This way, a JIT compilercan be constructed by linking directly with clang libraries and using its internal API. It willbe discussed in detail in Section 4.3.2.

4.2. VectorWise compilation

The VectorWise build system was already configured to enable compilation using gcc andicc. Adding clang to the suite proved to be very easy, as it was designed to be a drop-inreplacement for gcc, with highly compatible command line parameters format.

62

debug build optimized build

clang 145 s 245 s

gcc 87 s 490 s

icc 311 s 449 s

Table 4.1: VectorWise compilation time using different compilers. Debug build is withoutoptimizations, while optimized build is a production build.

4.2.1. Build time

For each of the three compilers two VectorWise builds were created: a debug and an optimizedbuild. The debug build turned off all optimizations, leaving only -g option for debuggingsymbols. Optimized builds use different optimization flags, depending on the compiler:

• gcc: -m64 -O6 -fomit-frame-pointer -finline-functions

-malign-loops=4 -malign-jumps=4 -malign-functions=4

-fexpensive-optimizations -funroll-all-loops -funroll-loops

-frerun-cse-after-loop -frerun-loop-opt

• icc: -no-gcc -mp1 -O3 -restrict -unroll -axWPT -axSSE4.2

• clang: -m64 -O3 -finline-functions -funroll-loops

Table 4.1 shows that LLVM/clang claims of being faster than gcc hold true for VectorWisecompilation. While gcc is the fastest compiler in a non-optimized build, clang is able tocompile with optimizations in half the time of gcc. Intel compiler is much slower whendelivering a debug build, but the speed of compilation with optimizations is comparable tothat of gcc.

4.2.2. TPC-H performance

Performance of the created builds was tested on a realistic workload for an analytical database– the whole set of TPC-H benchmark [Tra02] queries on a scale factor 10 database. Themeasured times are of “hot” runs of the queries – the database size was 10 GB, so on amachine with 12 GB of RAM all data can be buffered in the buffer pool, leading to in-memory execution. Two runs were conducted – with and without intra query parallelism.

Results of the benchmarks are shown in Table 4.2. In both runs, the gcc optimized buildwas the fastest. The results were however very close to each other. The clang build is slowerby 5% in the single threaded run. However, some known deficiencies of clang, like automaticSSE generation are currently being worked on. In the multi-threaded run all compilers arewithin 1.2% of each other.

The difference between a SIMDizing and non-SIMDizing is smaller than it could be ex-pected, as one of the first things that come to mind as a benefit of vectorized (in the meaningof operating on vectors of data) execution is the possibility to vectorize (in the meaning ofusing SIMD machine instructions) the code. However, results show that it does not reallyhave such high impact. The reason for that is that actually the easily SSE-able operations,like map calculations and computing hash functions do not take up much of the executiontime. Microbenchmarks in Chapter 3 displayed results of a few cycles per tuple in the projectcase study in Section 3.2 which involved SIMD opportunities, compared to hundreds of cyclesin hash table operations involving random memory accesses in Section 3.6. Another reasonis that even in those simple map primitives, the compilers claiming auto-vectorization are

63

Single thread Eight threads

Query clang gcc icc clang gcc icc

1 2.19 s 1.96 s 1.95 s 0.56 s 0.53 s 0.54 s

2 0.24 s 0.22 s 0.20 s 0.18 s 0.18 s 0.17 s

3 0.18 s 0.17 s 0.17 s 0.17 s 0.17 s 0.17 s

4 0.16 s 0.17 s 0.18 s 0.12 s 0.11 s 0.11 s

5 0.57 s 0.57 s 0.57 s 0.49 s 0.49 s 0.50 s

6 0.15 s 0.15 s 0.23 s 0.10 s 0.10 s 0.11 s

7 0.58 s 0.61 s 0.58 s 0.29 s 0.28 s 0.28 s

8 0.65 s 0.66 s 0.63 s 0.37 s 0.39 s 0.41 s

9 4.11 s 3.69 s 3.93 s 1.48 s 1.46 s 1.44 s

10 0.77 s 0.76 s 0.76 s 0.46 s 0.46 s 0.47 s

11 0.25 s 0.24 s 0.24 s 0.17 s 0.18 s 0.18 s

12 0.49 s 0.50 s 0.51 s 0.20 s 0.21 s 0.21 s

13 3.17 s 3.06 s 3.02 s 1.18 s 1.16 s 1.15 s

14 0.35 s 0.33 s 0.35 s 0.23 s 0.22 s 0.23 s

15 0.16 s 0.16 s 0.18 s 0.14 s 0.15 s 0.14 s

16 0.92 s 0.91 s 1.01 s 0.48 s 0.47 s 0.49 s

17 0.56 s 0.56 s 0.54 s 0.26 s 0.24 s 0.25 s

18 1.93 s 1.65 s 1.80 s 0.73 s 0.69 s 0.72 s

19 1.35 s 1.38 s 1.34 s 0.40 s 0.42 s 0.40 s

20 1.75 s 1.73 s 1.69 s 0.67 s 0.67 s 0.65 s

21 2.16 s 2.21 s 2.44 s 0.88 s 0.87 s 0.90 s

22 0.76 s 0.69 s 0.65 s 0.38 s 0.37 s 0.37 s

Total time: 23.45 s 22.38 s 22.97 s 9.94 s 9.82 s 9.89 sPower: 54077.8 55561.2 53882.7 103715.9 104527.5 103669.0

Relative: (104.8%) (BEST) (102.6%) (101.2%) (BEST) (100.7%)

Table 4.2: Performance of VectorWise compiled by different compilers on the TPC-H bench-mark scale factor 10 on a 2.67 Ghz Intel Nehalem 920 with 4 physical (8 virtual) cores.Because total time is influenced more by the differences in long running queries, an alterna-tive metric called “Power” is also presented, which is proportional to the geometric mean ofquery times and is more objective.

often not able to SIMDize the loops correctly. Closer investigation of generated assemblyfor individual primitives show, that both gcc and icc do not generate SSE code for some oftrivially SIMD-izable loops in VectorWise primitives. The pattern of whether they succeed orfail is seemingly random. VectorWise build system could be improved with automatic meansof inspecting and reporting the quality of compiled primitives, so that such imperfectionswould be automatically detected and corrected.

4.3. Implementation

In this section we discuss the objective of building a generic JIT compiler infrastructure, withan API capable of compiling arbitrary code provided as strings in memory and extractingpointers to functions given their names.

64

Algorithm 12 A short snippet demonstrating how a simple JIT compiler is able to compilearbitrary code from a string by using an external compiler and dynamically linking to thecreated library can be. Actual system provides additional functionalities omitted in thissnippet, like system calls error checking, compilation errors checking, platform independence,more advanced ability to track used memory and to free the no longer needed libraries,or to use different compilers with different command line options to the calls. Still, a fullimplementation fits in under 200 lines of code.

/∗ JIT compi les the source from s t r i n g in to a l i b r a r y and loads i t ∗/void ∗ c o m p i l e l i b (char ∗ s r c ) {

// Write generated source to temporary f i l e .FILE ∗ f = fopen ( ”/tmp/ j i t s r c . c” , ”w” ) ;f p r i n t f ( f , ”%s ” , s r c ) ;f c l o s e ( f ) ;// Prepare t a r g e t l i b r a r y f i l ename .char l ibname [ 1 5 ] ;s t r cpy ( libname , ”/tmp/libXXXXXX” ) ;mktemp( libname ) ;

int pid = fo rk ( ) ;i f ( ! pid ) { // c h i l d c a l l s the compi ler

e x e c l ( ”/ usr / bin / gcc ” , ” gcc ” , ”−O3” , ” s r c . c” ,”−fPIC” , ”−shared ” , ”−o” , libname , NULL) ;

}// parent l oads the l i b r a r y .wait ( pid ) ; // wait f o r compi ler to f i n i s h .unl ink ( ”/tmp/ j i t s r c . c” ) ;return dlopen ( libname , RTLD NOW | RTLD GLOBAL) ;

}

/∗ g e t s a func t i on from the JIT compiled l i b r a r y ∗/void ∗ g e t f u n c t i o n (void ∗ l i bhand le , char ∗ func ) {

return dlsym ( l ibhand le , func ) ;}

/∗ f r e e a no longer needed l i b r a r y ∗/void f r e e l i b (void ∗ l i b h a n d l e ) {

d l c l o s e ( l i b h a n d l e ) ;}

4.3.1. External compiler and dynamic linking

Most other database systems featuring JIT compilation write the generated source to a fileand use the compiler externally on it. It can be done by forking a new process which invokesa C compiler, and after waiting for it to finish, dynamically linking to the created library, asis done in the HIQUE system [KVC10]. The JAMDB system [RPML06] is written in Java, soit can use built-in JIT capabilities of Java, simply writing a class into a file, and using JavaClassLoader to load it. AT&T’s Daytona system [Gre99] compiles a standalone programspecialized in executing one query (possibly with wildcard parameters) in an architecturewithout a database server, and the client runs it directly.

Algorithm 12 demonstrates the simplicity of the external approach. To interface withthe compiler the generated source has to be written to a disk file, but as the compiler willbe accessing it directly afterward, it will most probably be able to read it from disk cachesin memory instead of actually using the hard drive. If the source file is removed afterward,it is possible that it will even actually never be synced to disk. The Linux functions fork

and exec family, dlopen, dlsym, dlclose have Windows counterparts spawn, LoadLibrary,GetProcAddress and FreeLibrary, enabling to port this approach to the Windows versionof the system. This approach is compiler independent. Any compiler can be used, if provided

65

with correct path and command line options.

4.3.2. Using LLVM and clang internally

The modular design of LLVM and clang, with well defined API makes it possible to use itinternally, linking directly to LLVM libraries. The LLVM suite features a library designedexplicitly for JIT compilation, the ExecutionEngine [LLV].

We used two presentations from LLVM Developers’ Meetings as an example of how toimplement such a JIT compiler using LLVM. The first one [Beg08] discussed the LLVMJIT API in general. The second one discussed Cling [Nau10], a prompt-based C interpreterdeveloped at CERN, extending the language to enable interactive writing and executingC-like code as in e.g. Python, Ocaml, R etc. We also based our implementation on the”clang-interpreter” example in the clang code repository.

The goal was to make the interaction with LLVM objects as simple as possible, whilesaving the overheads:

• do not create external processes,

• have a “warm” pre-initialized compiler ready to accept work,

• read source directly from memory and generate executable code also directly into mem-ory,

The source code for achieving these goals is longer than that for the external approach,and would not fit in a single snippet, but conceptually it is simple. On a high level, thefollowing steps have to be performed:

1. The input location has to be overridden to read source from memory instead of a file.

2. A clang instance has to be invoked to parse the source and generate a Module withLLVM IR, and apply LLVM optimization passes to it.

3. The resulting Module can be put into JIT ExecutionEngine.

4. ExecutionEngine can be asked to generate machine code for given symbols from theModule and return function pointers.

Both the LLVM ExecutionEngine object and clang CompilerInstance object, togetherwith a number of helper objects, can be created once and reside ready waiting for incomingcompilations.

4.3.3. JIT compilation infrastructure API

We abstracted the JIT compiler from other layers of the system by using an uniform APIwith which all the compilers have to comply.

A global function jit exclusive get is used by a JitContext (to be described inSection 5.3.2) to request its need to use a compiler (and afterward to release it usingjit release). A handle is returned, with pointers to functions of the API. Such a handleglobal request method is also used to control concurrency problems with multiple compila-tions performed at once. For example, the LLVM libraries and clang are not adapted tobe used in a multi threaded environment, with multiple compilations going on at the samemoment, so locking has to be done when using the internal LLVM implementation of the JITcompiler.

66

The decision on the strategy of which compiler to choose has not been determined. Wecould be using always one compiler, or switching between different ones for different queries.Different paths and command line arguments to externally called compilers can also be used.These options could be exposed as VectorWise configuration options to the database user.This would enable the user to specify any compiler, any path and any set of arguments tothe C compiler. In the scope of this thesis we implemented a simple code switch that let ustest different JIT compilers.

The API of a concrete compiler is actually very simple and consists of only three functions:

• void* jit compile(char* source) compiles and loads the C code saved in a string.The result is a handle to the loaded library.

• void* jit get function(void* library, char* symbol) retrieves a function withthe given symbol name from the previously compiled library.

• A global method jit clear to unload all JIT compiled code.

In case of an externally used compiler, the jit compile method differs only by using dif-ferent paths and command line line arguments to different compilers. The jit get function

method is the same, since it just deals with libraries dynamically loaded with dlopen, whichare the same, no matter what compiler was used. The internal clang compiler implemen-tation uses the LLVM infrastructure and therefore the implementation of jit compile andjit get function does not have any similarities to the external one. The jit clear methodfor LLVM compiler clears the ExecutionEngine object. For external compilers it closes allthe loaded libraries with dlclose.

It would be useful to have a method for controlling the memory usage and unloading JITprimitives more selectively. There are however a few problems with that:

• The ExecutionEngine object in LLVM is able to report the amount of memory occupiedby compiled code, and we can even replace the JITMemoryManager with our ownimplementation. It is not so easy however with the dynamically loaded libraries in theexternal compiler implementation. There is no portable way to ask how much memoryis a library loaded with dlopen actually taking up. A possible estimate for this wouldbe to check the size of the library .so file. On Linux the /proc/PID/maps file could bequeried to look where the library is mapped into the process memory.

• We compile multiple functions in a single compilation to save overhead. Again, whileLLVM has a method to free machine code for just a single symbol, with dlclose we canonly unload whole libraries. A hypothetical memory manager would have to maintainrelations between functions coming from the same library and only unload the libraryafter unloading of all symbols have been requested.

Therefore, a more advanced implementation of JIT code memory management and an APIfor that remains future work.

4.3.4. Evaluation

A proof-of-concept implementation was written as a microbenchmark of the described solu-tions. We compared compilation speed of the external approach using all three compilersand using clang and LLVM internally, both in not-optimized (-O0) and optimized (-O3)compilation.

Execution time of jit compile and subsequent calls to jit get function for all functionsin the compiled code was measured. The benchmark was performed on three source files:

67

Algorithm 13 Queries 1 and 19 of the TPC-H benchmark. Version of our code generatorwe used when evaluating different JIT compilers was able to recognize and generate code ofprimitives for the projection calculations like l extendedprice*(1-l discount)*(1+l tax)

in Query 1, and for complex selection conditions in Query 19, together with necessary decla-rations producing 10 kB and 34 kB of C source code to be JIT compiled.

# Query 1select

l r e t u r n f l a g , l l i n e s t a t u s ,sum( l q u a n t i t y ) as sum qty ,sum( l e x t e n d e d p r i c e ) as sum base pr ice ,sum( l e x t e n d e d p r i c e ∗(1− l d i s c o u n t ) ) as sum di s c pr i c e ,sum( l e x t e n d e d p r i c e ∗(1− l d i s c o u n t )∗(1+ l t a x ) ) as sum charge ,avg ( l q u a n t i t y ) as avg qty ,avg ( l e x t e n d e d p r i c e ) as avg pr i ce ,avg ( l d i s c o u n t ) as avg d i sc ,count (∗ ) as count order

from l i n e i t e mwhere l s h i p d a t e <= date ’ 1998−12−01 ’ − interval ’ 90 ’ daygroup by l r e t u r n f l a g , l l i n e s t a t u sorder by l r e t u r n f l a g , l l i n e s t a t u s ;

# Query 19select sum( l e x t e n d e d p r i c e ∗ (1 − l d i s c o u n t ) ) as revenuefrom l i n e i t em , partwhere

(p partkey = l p a r t k e yand p brand = ’ [BRAND1] ’and p conta ine r in ( ’SM CASE ’ , ’SM BOX’ , ’SM PACK’ , ’SM PKG’ )and l q u a n t i t y >= [QUANTITY1] and l q u a n t i t y <= [QUANTITY1] + 10and p s i z e between 1 and 5and l sh ipmode in ( ’AIR ’ , ’AIR REG’ )and l s h i p i n s t r u c t = ’DELIVER IN PERSON’

)or(

p partkey = l p a r t k e yand p brand = ’ [BRAND2] ’and p conta ine r in ( ’MED BAG’ , ’MED BOX’ , ’MED PKG’ , ’MED PACK’ )and l q u a n t i t y >= [QUANTITY2] and l q u a n t i t y <= [QUANTITY2] + 10and p s i z e between 1 and 10and l sh ipmode in ( ’AIR ’ , ’AIR REG’ )and l s h i p i n s t r u c t = ’DELIVER IN PERSON’

)or

(p partkey = l p a r t k e yand p brand = ’ [BRAND3] ’and p conta ine r in ( ’LG CASE ’ , ’LG BOX’ , ’LG PACK’ , ’LG PKG’ )and l q u a n t i t y >= [QUANTITY3] and l q u a n t i t y <= [QUANTITY3] + 10and p s i z e between 1 and 15and l sh ipmode in ( ’AIR ’ , ’AIR REG’ )and l s h i p i n s t r u c t = ’DELIVER IN PERSON’

) ;

• Hello World – a simple source with one function printing ”Hello World!”. This teststhe plain overhead of starting the compiler, loading the library, linking and resolvingsymbol references (a reference to standard library function printf).

• Query 1 of TPC-H. The queries are presented in Algorithm 13. The JIT compiledsource is not a program for evaluating the whole query, but a small portion of it thatan early version of our source generator was able to produce. It compiles JIT primi-

68

Sort

Project

Aggr

Project

Select

Scan

Project

Aggr

Project

Select

HashJoin

Select

Scan

Select

Scan

Figure 4.1: Query plans of Query 1 and Query 19 of TPC-H. The operators for expressionsin which source code was used in the benchmark are written in bold.

Test case / JIT compiler gcc icc clang clang-internal

Hello World-O0 23.4 ms 139.7 ms 25.8 ms 0.9 ms-O3 24.9 ms 145.3 ms 27.4 ms 1.5 ms

Query 1 (10 kB)-O0 38.3 ms 150.2 ms 38.0 ms 15.9 ms-O3 56.6 ms 193.9 ms 44.0 ms 16.9 ms

Query 19 (34 kB)-O0 107.5 ms 205.7 ms 95.4 ms 104.8 ms-O3, no vec. 757.4 ms 855.6 ms-O3 754.8 ms 888.9 ms 338.2 ms 301.4 ms

Table 4.3: Compilation speed of three test cases using different JIT compilers. The faststartup advantage of clang-internal is quickly diminished as size and complexity of JIT com-piled source increases. Clang once again confirms its fast optimization capabilities. Icc hasthe slowest startup of all compilers.

tives for calculating the Project expressions like l extendedprice * (1-l discount)

* (1+l tax) from the query plan in Figure 4.1. They amount to 10 kB of C sourcecode, which also includes a preamble with all declarations and type definitions neces-sary for using this code in VectorWise after JIT compilation. The code itself is quitebloated, as the C code generator produces each operation in a new statement, declar-ing and using a new local variable, similar to a Static Single Assignment (SSA) form.Therefore, the code size gets quite big, even for those few simple operations [ASU86].

• Query 19 of TPC-H. The query in Algorithm 13 has a very complicated selection condi-tion on two tables. In query plan in Figure 4.1, this selection gets separated into threeSelect operators – two below a join, for conditions from only one table, and one afterthe join. In each of these Selects code is generated to evaluate the complex condition,producing 34 kB of source code in total.

Results are shown in Table 4.3. Internal clang implementation offers very fast startuplatency, being able to deliver a compiled ”Hello World” function in under a millisecond, 25xfaster than calling gcc or clang externally, and over 150x faster than icc. Icc has particularlyslow startup time, but catches on with gcc’s speed once bigger pieces of code are compiled.

However, this advantage of internal clang implementation in fast startup time becomesless significant as the source becomes more complicated, when the compilation itself becomes

69

the dominating part. Nevertheless, while without optimizations clang is on par with gcc,it beats the other compilers by over a factor of 2 when optimizations are enabled, whileproducing code of comparable quality as was proved in Section 4.2. This doesn’t account forclang not performing auto-vectorization optimizations so, to be absolutely fair, we reran thelast test on gcc with -fno-tree-vectorize and icc with -no-vec flags to explicitly disablethese optimizations. That however did not change anything on gcc and took less than 4% officc’s time.

The conclusion is that if bigger pieces of generated code are to be compiled at once, thereis no advantage in linking directly with the compiler and using its internal API. The externalapproach saves complication, library dependencies and having to maintain changes to theJIT compiler each time a new version of the libraries change the internal APIs. It also hasmore flexibility, as any compiler can be chosen to be called. If however a model in whichwe often (meaning: multiple times per query) compile small pieces of code was chosen, theinternal compiler would be the best choice.

4.4. Summary

Comparison of VectorWise performance when compiled by different compilers have shownthat the differences are very small. On the other hand, the difference between how the com-pilers perform themselves is huge, with clang being over twice as fast as any other compiler.As compilation overhead is very important in JIT compilation, clang should be the firstchoice.

Linking with LLVM and clang and using them internally promised a low latency always-ready JIT compiler, but that turned out not to be such a big advantage, as the real-situationgenerated code is complex enough that compiler startup latency is not the dominating factor.Moreover, maintaining the dependency to LLVM libraries binds the system not only to onecompiler, but also to its specific version, as minor changes in the API prevent from justlinking against a different release of LLVM. Using the compiler as an external program allowsfull flexibility of configuring what compiler to use.

The surprising conclusion is to use the simplest solution because it works best. Usingless known clang rather than gcc may result in reduced overheads. The door always remainsopen for any different configuration.

70

Chapter 5

JIT primitive generation andmanagement

This chapter contains practical details of how the detection of JIT opportunities, source codegeneration and management of the compiled code was implemented in VectorWise.

Chapter 4 already described the integration with a C compiler, so we skip this part ofinfrastructure in this chapter. In Section 5.1 we explain how the templates of operationsfor which we perform code generation are extracted from the VectorWise own source codeand maintained. In Section 5.2 we describe how we find opportunities for JIT compilationin the query execution tree during the Rewriter phase. Section 5.3 discusses source codegeneration in more detail. We explain how the object code of compiled primitives is storedand maintained for future reuse in Section 5.4.

Not all of the ideas from Chapter 3 were implemented into VectorWise in the scope ofthis thesis. Some future enhancements will be described in Chapter 7.

5.1. Source templates

In Section 2.2.2 we introduced primitives, which are the working horse functions during queryexecution. In JIT query compilation we would like to compile many such primitives togetherinto a single function. It would be good to be able to reuse parts of source code of the originalprimitives when generating JIT source for new ones.

5.1.1. Primitive templates

Usually a primitive is a little more complicated than was shown in Algorithm 1 on page 18.In addition to the operation performed in the loop, there is often some initialization prologuebefore the loop and an epilogue after. These are used for checking and returning executionerrors such as overflows, division by zero etc. There are therefore three things that we wouldlike to extract from the source of a primitive for code generation of its dynamic counterpart:the prologue, the in-loop operation itself and the epilogue.

Help comes from the fact that primitives are generated using macros. The macro languageMx (developed internally at CWI for the MonetDB database [Bon02]) and the C preprocessorare used to expand combinations of a primitive operation in different dimensions: types ofarguments and result, working on vectors or scalars, with or without selection vector. AMx macro is used to expand the different combinations and provide a template for the mainprimitive loop, while the operation itself is wrapped into a C preprocessor function.

71

Algorithm 14 A (simplified) example of how Mx and C preprocessor macro languagesare used to expand primitive functions. The impl and decl macros are used to generateimplementations of all primitives and an array of with pointers to all of them.

/∗ Operations de f ined as preprocessor macro ∗/#define s i n t d i v ( res , a , b ) \

{ int d iv by ze ro =((b)==0); r e s =(a ) / ( ( b)+ d iv by ze ro ) ; e r r |= div by ze ro ; }#define s i n t d i v r e t

i f ( e r r ) return x100 e r r ( e , VWLOG FACILITY, X100 ERR INTDIVBYZERO ) ;

/∗ Mx macro ∗/@= dec l mapop co l co l

{ ”@4” , ” map @1 @2 col @3 col ” , ( Pr imi t ive ) map @1 @2 col @3 col , } ,@= impl mapop co lco luidx map @1 @2 col @3 col

(uidx n , @4∗ r res , @2∗ r co l1 , @3∗ r co l2 , uidx∗ r s e l , Expr ∗e ) {ulng e r r = 0 ; /∗ a l l p r im i t i v e s use a ulng err v a r i a b l e ∗/

#i f d e f @1 prologue@1 prologue /∗ would expand a s i n t d i v p r o l o g u e macro i f i t e x i s t e d ∗/#e n d i f

/∗ App l i ca t ion o f a macro tha t would crea t e the ” f o r ” loopfo r s e l . vec . and no s e l . vec . ∗/

@: impl1 (@1( r e s [ i ] , c o l 1 [ i ] , c o l 2 [ i ] ) )@

#i f d e f @1 ret@1 ret /∗ @1 w i l l expand to s i n t d i v .

C preprocessor w i l l expand s i n t d i v r e t macro ∗/#elseg e n e r i c r e t /∗ Generic ep i l o gue macro ∗/

#e n d i freturn n ;

}@= imp l s e l 1 { uidx i =0, j =0; for ( j =0; j<n ; j++) { i = s e l [ j ] ; @1; } }@= implnose l1 { uidx i =0; for ( i =0; i<n ; i++) { @1; } }@= impl1 { i f ( s e l ) { @: imp l s e l 1 (@1)@ } else { @: imp lnose l1 (@1)@ } }

/∗ App l i ca t ion o f the macro to crea t e a s i n t d i v p r im i t i v e ∗/@: d e c l mapop co l c o l ( s i n t d i v , sint , sint )@@: imp l mapop co l co l ( s i n t d i v , sint , sint )@

There are few top-level Mx macros that accept an operation name as one of their argu-ments. If the operation name is OP (example: sint div), then it would expect that there isa C preprocessor macro defined for this operation OP and optionally a macro for the prologueOP prologue, and epilogue OP ret. Algorithm 14 shows an example how Mx and C prepro-cessor are used to expand primitive functions. Mx uses the @= keyword for macro definition,@: for call and @1, @2, . . . for macro arguments expansion. The snippet is simplified, infact more levels of macro expansion would be used to avoid code duplication, there will bemore expansion for vector and scalar attributes etc. The important fact here is that just onemacro will be called for a very broad class of operations, such as all binary map operations.It will be passed the operation itself as one of the parameters (here: @1 as sint div, furtherexpanded by C preprocessor). The impl macros expand to primitives implementations. Thedecl macros create an array of definitions, which is later turned into a hash table used tolookup primitives by name.

Therefore, intercepting the operations and saving them as strings shall require modifyingjust a few of the top-level Mx macros. We take advantage of the fact that C preprocessoris able to turn and properly escape the argument of a macro into a string using #define

QUOTE(x) #x, and use it to save OP, OP prologue and OP ret to global strings in the impl

72

Algorithm 15 Extracting source templates from primitives using macro tricks.

#define QUOTE( x ) #x

@= impl FOO. . .#define e r r $0const char∗ @1 @2 @3 = QUOTE(@1( $1 , $2 , $3 ) ) ;const char∗ @1 @2 @3 prologue =

#i f d e f @2 prologueQUOTE( @1 prologue ; ) ;

#else”” ;#e n d i f

const char∗ @1 @2 @3 ret =#i f d e f @1 ret

QUOTE( @1 ret ) ;#else

QUOTE( g e n e r i c r e t ) ;#e n d i f

#undef e r r

@= d e c l s r c o p c o l v a l{ ”@5” , ” @1 @2 @3” , 2 , & @1 @2 @3 , & @1 @2 @3 prologue , & @1 @2 @3 ret } ,

macro. In these strings we use $0, $1, . . . to mark placeholders for the error variable ($0,taking advantage of the fact, that all primitives in the system use err as the error variablename), result ($1) and operands. Later, when generating source, they will be substitutedwith correct variable names.

An array with declarations of the operations saved as strings can be created in the sameway as an array of primitives is created using another series of macros, declsrc , which arebe expanded like impl and decl . The resulting source templates are then turned into a hashtable, similar to that with primitives, which can be used to lookup if for a given primitive wealso have an extracted template.

A code snippet showing how to force saving operations source code into strings in parallelwith implementing them in the system is shown in Algorithm 15.

Templates for code generation can be extracted from the existing primitives source codewith small modifications in only a few places of macro expansion. There are some exceptions,and some things have to be done in a more complicated way, but in general the implementationis clean and not invasive.

There is a system table in the VectorWise system that can be queried to view all availablesource templates. It can be done using:

SysScan(’jit_templates’, [’name’, ’return_type’, ’src’, ’prologue’, ’ret’]);

5.1.2. Source preamble

There are some global definitions and declarations that are needed and used by the primitivesand therefore have to also be included in the generated code of every compilation. We achievethis by creating a preamble that contains all necessary definitions, which is them pasted atthe beginning of each generated compilation unit.

It contains type definitions of different types (uidx, ulng, slng, uint, sint etc.) compatiblewith the system platform and configuration, enums for status and error reporting, somestructs and function declarations. It is all forwarded into a global string using some more Cpreprocessor tricks.

73

5.2. JIT opportunities detection and injection

We implemented the detection of possibilities for dynamic compilation and transformation ofthe query plan in the Rewriter phase, which was briefly introduced in Section 2.2.1 on page 16.For tree manipulations the Rewriter uses Tom [Tom] libraries, an extension language thatcan be used together with C, designed to support manipulations on tree structures.

We implemented three additional Rewriter rules, applied at the end of the Rewriter phase.Each of them traverses and transforms the xnode tree, as was described in Section 2.2.1. Anexample of xnode tree with additional information annotated by the Rewriter is shown inFigure 5.1.

5.2.1. Fragments tagging pass

The first pass traverses the tree and tags fragments that we are able to generate code forand compile. In the previous Rewriter passes, the nodes were annotated with a hint aboutwhat primitive will be used for the operation in that node. As described in Section 5.1.1 thesource templates extracted from primitives are named using a scheme corresponding to thenames of primitives and put into a hash table. We can therefore lookup if we have a sourcetemplate for the primitive annotated to a given node, and if so, annotate the node with ahint about this template. Depending on the context we can also tag the node with a hintabout what kind of operation will be required – a selection primitive, a map primitive etc.

5.2.2. Fragments generation and collapsing pass

In the second pass we traverse the tree again to find fragments that we will actually compile.In the scope of this thesis we implemented a naive rule, where we always compile the biggestpossible fragments.

Therefore, this rule finds strongly connected components of the tree that have been taggedby the previous pass. For each strongly connected component it creates a tree of new struc-tures called snodes, which contain all information necessary for code generation. It collapsesthe fragment into a single xnode, rearranging the children of all leaves of the fragment tobe children of this xnode. It might have happened that the subtree used the same inputparameter multiple times (e.g. many conditions concerning the same attribute). In such asituation, the duplicates would be detected, and the aliased inputs would be collapsed intoone. The snode tree is attached to the xnode for future code generation.

It may also be detected that such a fragment of the tree was already compiled in someprevious query, and that the resulting JIT primitive may be reused. In this case, the bufferedJIT primitive will be retrieved and attached to the xnode directly instead of the snode.Another possibility is that an identical fragment is present in a different part of the treefor the same query. It will also be detected, and the same snode will be attached ratherthan generating the same code twice. More about maintaining and reusing the generatedprimitives will be discussed in Section 5.4.

5.2.3. Compilation pass

The last pass performs compilation and attaches the compiled functions as primitives to thexnodes. Compilation is performed only once per query – all code is generated into one stringbuffer, with the preamble in the front. This way the compilation overhead is minimized, sincethe compiler only needs to be called once.

74

<AGGR>: props: $$usedcols(flag sumprice) columns(flag sumprice)

|_<PROJECT>: props: $$usedcols(flag resprice) columns(resprice flag)

| |_<MSCAN>: props: $$usedcols(_lineitem._l_extendedprice _lineitem._l_tax _lineitem._l_returnflag)

| | | columns(_lineitem._l_extendedprice _lineitem._l_tax _lineitem._l_returnflag)

| | |_<ASSIGNMENT>: _lineitem

| | | \_<STRING>: str(’_lineitem’)

| | \_<LIST>: COLGEN(Raw) props: $varnr(0)

| | |_<STRING>: str(’_l_extendedprice’) props: isconst(0) dmax(1.0495e+07) dmin(90091)

| | | type(sint) ltype(decimal(8,2))

| | |_<STRING>: str(’_l_tax’) props: isconst(0) dmax(8) dmin(0) type(schr) ltype(decimal(2,2))

| | \_<STRING>: str(’_l_returnflag’) props: isconst(0) dmax(82) dmin(65) type(sint) ltype(char(1))

| \_<LIST>: COLDEF(Project)

| |_<ASSIGNMENT>: resprice props: type(sint) isconst(0) dmax(1.13345e+09) dmin(9.0091e+06)

| | ltype(decimal(21,4)) $varnr(0)

| | \_<OPERATOR>: str(’*’) props: $primitive(map_*_sint_col_sint_col) type(sint) isconst(0)

| | | dmax(1.13345e+09) dmin(9.0091e+06) ltype(decimal(21,4)) $varnr(0)

| | |_<IDENT>: _lineitem._l_extendedprice props: isconst(0) dmax(1.0495e+07) dmin(90091)

| | | type(sint) ltype(decimal(8,2))

| | \_<CALL>: str(’sint’) props: $primitive(map_sint_schr_col) type(sint) dmax(108) dmin(100)

| | | ltype(decimal(13,2)) isconst(0) $varnr(0)

| | \_<OPERATOR>: str(’+’) props: $primitive(map_+_schr_val_schr_col) type(schr) isconst(0)

| | | dmax(108) dmin(100) ltype(decimal(13,2)) $varnr(0)

| | |_<OPERATOR>: str(’*’) props: $primitive(map_*_schr_val_schr_val) type(schr) isconst(1)

| | | | dmax(100) dmin(100) ltype(decimal(13,2)) $varnr(0) constval(100)

| | | |_<CALL>: str(’schr’) props: $primitive(map_schr_str_val) type(schr) dmax(1) dmin(1)

| | | | | isconst(1) ltype(sint) $varnr(0) constval(1)

| | | | \_<STRING>: str(’1’) props: dmax(1) dmin(1) type(str) ltype(varchar(1))

| | | | isconst(1) constval(1)

| | | \_<CALL>: str(’schr’) props: dmax(100) dmin(100) type(schr) ltype(slng) isconst(1)

| | | | $primitive(map_schr_str_val) $varnr(0) constval(100)

| | | \_<STRING>: str(’100’) props: dmax(100) dmin(100) isconst(1) type(str) ltype(str) $varnr(0)

| | \_<IDENT>: _lineitem._l_tax props: isconst(0) dmax(8) dmin(0) type(schr) ltype(decimal(2,2))

| \_<ASSIGNMENT>: flag props: isconst(0) dmax(82) dmin(65) type(sint) ltype(char(1))

| \_<IDENT>: _lineitem._l_returnflag props: isconst(0) dmax(82) dmin(65) type(sint) ltype(char(1))

|_<LIST>:

| \_<IDENT>: flag props: isconst(0) dmax(82) dmin(65) type(sint) ltype(char(1))

|_<LIST>: COLDEF(Aggr)

\_<ASSIGNMENT>: sumprice props: type(s128) isconst(0) dmax(4.98499e+21) dmin(0) ltype(decimal(22,4))

\_<CALL>: str(’sum128’) props: type(s128) isconst(0) dmax(4.98499e+21) dmin(0) ltype(decimal(22,4))

\_<IDENT>: resprice props: type(sint) isconst(0) dmax(1.13345e+09) dmin(9.0091e+06)

ltype(decimal(21,4)) $varnr(0)

Figure 5.1: An xnode tree with attached properties, pretty-printed from the system log.

The pass through the tree will find all xnodes with attached snodes and find for them theresulting compiled JIT primitive, which will be attached to them just as normal primitiveswere attached to xnodes previously. Since they work in the same way, they can be put intoexecution with almost no changes in the Builder and later phases.

5.2.4. Limitations

The Rewriter was chosen as the phase in which to implement JIT opportunities detectionand code generation because operating on the xnode tree with new pluggable rules was theleast intrusive for the system and easiest to do. However, the xnode tree in the Rewriterphase corresponds only to the parsed VectorWise algebra, and as such does not contain allthe information about query execution.

In Figure 5.1 we can see that the Rewriter semantic analysis rules attach a lot of metainformation to the xnodes: value ranges, types and the primitives that will be used. However,the Aggr node has only three children: its input (the Project), a list of “group by” keysand a list of aggregate functions. From that we cannot see e.g. what form of Aggr will

75

[SNODE_MAPPRIM]: res_type: sint, res_var: r0, err_var: e1, const: 0 value: ptr(7f921c03a770)

\_[SNODE_MAPOP]: res_type: sint, res_var: r2, err_var: e3, const: 0 value: prim(_*_sint_sint)

|_[SNODE_INPUT]: res_type: sint, res_var: r4, err_var: e5, const: 0 value: int(0)

\_[SNODE_MAPOP]: res_type: sint, res_var: r6, err_var: e7, const: 0 value: prim(_sint_schr)

\_[SNODE_MAPOP]: res_type: schr, res_var: r8, err_var: e9, const: 0 value: prim(_+_schr_schr)

|_[SNODE_CONST]: res_type: schr, res_var: r10, err_var: e11, const: 1 value: str(100)

\_[SNODE_INPUT]: res_type: schr, res_var: r12, err_var: e13, const: 0 value: int(1)

Figure 5.2: An snode tree used to generate code for a *(l extendedprice,+(1,l tax))

expression. The snodes contain information such as return type and variable names. De-pending on their type, they contain additional specialized information. It can be a pointer tocorrect primitive template, index of an input parameter or for constants it is their constantvalue. The pointer in the root node value is a shortcut to an array of all input parameters.

be used. There are no xnodes corresponding to operations such as computing hashes, hashtable lookups and inserts, or shifting the bits of the key values to create a direct aggregationindex. Direct aggregation is an optimized aggregation for small domains, where the keys areused directly as indexes, without a hash table. The aggregate functions in the tree are notconnected in any way to operations calculating the group positions to which the given valueshould be added. All these operations are created and resolved in the Builder phase ratherthan Rewriter. It is the logic of the Aggregation operator allocator at operator build timethat will decide whether to use hash or direct aggregation based on the information aboutranges of the key attributes. It will create expressions that calculate the group positions andconnect them as inputs to the aggregate functions.

Therefore, as it was already mentioned in Section 2.2.2, the final execution tree cannotbe derived from the xnode tree in the Rewriter phase. JIT detection and generation in theRewriter phase is sufficient for compiling together arithmetic operations in projections orselection conditions, but for more complicated operations, the logic would need to be movedto the Builder phase. There, each operator allocator would be responsible for generatingoperations specific to that operator. Unarguably, that implementation would be much moreintrusive to the system, as pieces of JIT generation and compilation would need to be scatteredin each of the affected operators.

5.3. Source code generation

In this section we will describe in more detail how the JIT primitives source code is generatedfrom the snode trees.

5.3.1. snode trees

The tree of snodes is a syntax tree for the to-be-generated code. A snode is created not onlyfor the operations, but also for constants, function inputs, function root nodes. Each snode

has a type, and depending on the type, additional information necessary to generate code forthe operation it represents:

• All nodes have an unique name for the local variable in which to output its operationresult. It can be the local variable in which to write the result of some computation, orthe variable in which to save the currently processed value from an input vector.

• All nodes hold the result type of the operation they represent.

76

• All nodes have an unique name for the local variable in which to store an error flag. Itis a substitute for the err variable from the original primitives. Since now there aremultiple operations in one primitive, we need to have multiple error flags, one for eachof them. Not all operations use it, but it is impossible to tell using the preprocessormacros when extracting the source templates.

• Nodes representing primitive operations hold a reference to the proper source template.

• Nodes representing input parameters hold the name of the parameter they have to readfrom.

• Nodes representing constants hold the constant value they contain.

An snode tree for an example expression *(l extendedprice,+(1,l tax)) is shown inFigure 5.2.

5.3.2. JitContext

As mentioned, there is one compilation performed per query. This compilation is managedby a JitContext object.

JitContext provides compilation-global utilities such as unique name generation. It holdsall the roots of the snode trees for which code will be generated. In later phase, it is theJitContext that interfaces with the compilation infrastructure and commands to performcompilation on the string buffer it managed. In the end, it holds the compiled module objectcode and is queried to return pointers to the compiled functions for the xnodes.

JitContext also contains the string buffer into which the source code is generated. Thestring buffer is a simple utility, which is starts with the source preamble with various defini-tions. It provides basic functionality of printing and placeholder filling, necessary for sourcegeneration.

5.3.3. Code generation

The code is generated by walking over the tree of snodes and writing to the string bufferin the JitContext. A usual syntax tree corresponds directly to the code, and a compilerpretty-printer needs only to walk it once during generation. The snode tree however has tobe walked multiple times in order to create the JIT primitive function:

• The first walk organizes input parameters. When there are many operations combined,some might have had the same input parameters (e.g. selection conditions on the sameattribute, complex arithmetic calculations involving some attribute in more than oneplace). We are able to detect and combine such aliased inputs.

• The second walk of the tree generates the operation prologues before the loop.

• Two walks of the tree are done for generating code of the loop bodies – once with aselection vector, and once without.

• A fifth walk of the tree generates the operation epilogues (error checking) after theloops.

The walks are post-order, so code for the children operations is generated before theparent.

77

Algorithm 16 JIT primitive for the *(l extendedprice,+(1,l tax)) expression generatedfrom the snode tree from Figure 5.2. Code redundancies are dealt with by the C compileroptimizer. Indentation and comments are added for clarity.

uidx r0 (uidx n , sint ∗ r res , void ∗∗params , uidx ∗ r s e l , void ∗e ) {uidx i ;sint r4 ; ulng e5 = 0 ; /∗ temporary r e s u l t v a r i a b l e and error f l a g ∗/sint ∗const v14 = ( sint ∗) params [ 0 ] ; /∗ params [ 0 ] i s a vec tor o f l e x t end edp r i c e ∗/schr r10 ; ulng e11 = 0 ; /∗ temporary r e s u l t v a r i a b l e and error f l a g ∗/r10 = 100 ; /∗ the constant 100 ∗/schr r12 ; ulng e13 = 0 ; /∗ temporary r e s u l t v a r i a b l e and error f l a g ∗/schr ∗const v15 = ( schr ∗) params [ 1 ] ; /∗ params [ 1 ] i s a vec tor o f l t a x ∗/schr r8 ; ulng e9 = 0 ; /∗ temporary r e s u l t v a r i a b l e and error f l a g ∗/sint r6 ; ulng e7 = 0 ; /∗ temporary r e s u l t v a r i a b l e and error f l a g ∗/sint r2 ; ulng e3 = 0 ; /∗ temporary r e s u l t v a r i a b l e and error f l a g ∗/i f ( s e l ) { /∗ var ian t with a s e l e c t i o n vec tor ∗/

for ( i = 0 ; i < n ; i++) {r4 = v14 [ s e l [ i ] ] ; r12 = v15 [ s e l [ i ] ] ; /∗ read from input v e c t o r s ∗/r8 = ( ( r10 ) + ( r12 ) ) ; /∗ 100+ l t a x ∗/r6 = ( ( sint ) ( r8 ) ) ; /∗ cas t to 4−by te i n t e g e r ∗/r2 = ( ( r4 ) ∗ ( r6 ) ) ; /∗ mul t i p l y by l e x t end edp r i c e ∗/r e s [ s e l [ i ] ] = r2 ; /∗ save to r e s u l t v ec to r ∗/

}} else { /∗ var ian t wi thout a s e l e c t i o n vec tor ∗/

for ( i = 0 ; i < n ; i++) { /∗ same opera t ions as above ∗/r4 = v14 [ i ] ; r12 = v15 [ i ] ; /∗ read from input v e c t o r s ∗/r8 = ( ( r10 ) + ( r12 ) ) ; /∗ 100+ l t a x ∗/r6 = ( ( sint ) ( r8 ) ) ; /∗ cas t to 4−by te i n t e g e r ∗/r2 = ( ( r4 ) ∗ ( r6 ) ) ; /∗ mul t i p l y by l e x t end edp r i c e ∗/r e s [ i ] = r2 ; /∗ save to r e s u l t v ec to r ∗/

}}/∗ Error check ing ( even though the opera t ions don ’ t produce er ror s ∗/i f ( e9 ) return x100 e r r ( e , VWLOG FACILITY, X100 ERR GENERIC ) ;i f ( e7 ) return x100 e r r ( e , VWLOG FACILITY, X100 ERR GENERIC ) ;i f ( e3 ) return x100 e r r ( e , VWLOG FACILITY, X100 ERR GENERIC ) ;return n ; /∗ re turn number o f processed t u p l e s ∗/

}

Algorithm 16 shows a JIT primitive for the example expression *(l extendedprice,

+(1,l tax)), generated from the snode tree from Figure 5.2. It can be seen that there issome redundant code generated. For example, the error variables are always checked, evenif the operation did not use them. The same however happens in normal primitives, andthere’s no possibility to detect it when generating the source template. Another exampleis that there are a lot of redundant local variables created. A value from an array is firstread into a local variable before being processed. Even such simple operations like casts arevery explicit, assigned into a new local variable. This is however because they also had tobe explicit in the query execution tree, where whole vectors were cast from one type intoanother.

The C compiler is able to detect and reduce all these redundancies in simple optimizationpasses. Undoubtedly, this introduces more compilation overhead, both because of the biggervolume of C code that needs to be parsed and by more work for the C compiler optimizer.However, we have already shown in Chapter 4 that compilation overhead is minor, and suchimplementation enables us to simplify code generation significantly.

78

5.4. Primitives storage and maintenance

After being compiled, the JIT primitives are stored in a hash table for future reuse. Later,if a snode tree is built for the same combination of operations, the old primitive could belooked up and reused instead of generating source and compiling again.

To be able to lookup a JIT primitive in a hash table, it needs to have a string name thatunambiguously describes it. We can create such a name by walking through the snode treeand outputting the names of particular nodes pre-order (left deep). The name of individualnodes come from the name of the source templates for operation nodes, types of input param-eters for input nodes, constant values and types for constants etc. It leads to unambiguousnaming scheme that may not be beautiful, but serves the purpose of naming the primitives,so that they can compared and looked up in a hash table.

The JIT primitive from Algorithm 16 has the name:jit map * sint sint(sint col0, sint schr( + schr schr(schr 100,schr col1)))

There is a system table that can be queried to see all JIT primitives that are present inthe system. It can be done using:

SysScan(’jit_primitives’, [’name’, ’return_type’, ’ptr’]);

There is also a system call in VectorWise that flushes all existing buffered JIT primitivesfrom the system. This should unload all dynamically loaded libraries and free object codememory. It can be done using:

Sys(clear_jit);

5.5. Summary

In this chapter we have shown how an implementation of JIT compilation infrastructure andcode generation for compiled execution of parts of a query can be done in a way that doesnot require big modifications to an existing system. With relatively little code we are ableto generate code covering a large class of operators. Still, in the scope of this thesis we werenot able to implement all our ideas, and further extensions remain future work.

79

Chapter 6

JIT primitives evaluation

In this chapter we present some benchmarks of how the implemented JIT compilation affectsthe query execution performance of a working VectorWise system and compare it to thetheoretical benefits and hard-coded benchmarks discussed in Chapter 3.

We evaluated the implementation of JIT compiled primitives in the TPC-H benchmark.We used a computer with 2.67 Ghz Nehalem processor and 12 GB of RAM. We used TPC-H scale factor 10, which corresponds to a 10 GB database. We run the queries withoutparallelism and we measure “hot” runs, with the database loaded into the buffer pool so weread from RAM instead of disk, so that the queries are more CPU intensive and less I/Ointensive.

Due to limitations of the Rewriter based implementation of the JIT compilation logic,the change in system performance on the TPC-H benchmark as a whole is limited. We areable to detect and compile together:

• Arithmetic calculations that were evaluated using map primitives. They are mostlyseen in Project operator, but are also possible e.g. in complicated selection conditions.

• Selection conditions generating a selection vector for the Select operator. It can generatecode for both eager and lazy evaluation of the conditions. Multiple Select operatorscan be combined into one, with just one huge compiled condition evaluation function.

This has an impact on a few of the TPC-H queries: Query 1, Query 6, Query 12 andQuery 19. An overview of JIT compilation effect on the whole queries is shown in Table 6.1.In the following sections we will discuss each of them in detail.

TPC-H Scale Factor 10

Query Without JIT With JIT

1 1.93 s 1.79 s (-7.2%)

6 0.15 s 0.21 s (+40%)

12 0.50 s 0.43 s (-14%)

19 1.36 s 1.05 s (-22.8%)

Table 6.1: Overview of TPC-H Queries 1, 6, 12 and 19 at scale factor 10 with and withoutJIT compilation.

81

TPC-H Scale Factor 10, Query 1

Projection Tot. cycles (vect.) Tot. cycles (JIT)

Total: 1,167,793,680 781,728,432 (-33.1%)

Table 6.2: Compiled projection in TPC-H Query 1.


Selection Sel. Tot. cycles (vect.) Tot. cycles (JIT)

l shipdate < date(’1995-01-01’) 87.43% 32,880,796l quantity < 24 45.97% 58,692,544l discount >= 0.05 54.53% 28,413,788l shipdate >= date(’1994-01-01’) 85.85% 12,343,920l discount < 0.07 49.94% 15,621,912

Total: 9.40% 147,952,960 296,671,052(+100.5%)

Table 6.3: Compiled selections in TPC-H Query 6.

6.1. TPC-H Query 1

This query was already presented in Algorithm 13 on page 68. In this query we are able tocompile together the computation of arithmetic expressions in a projection. These expressionshave already been the inspiration for the Project benchmark in Section 3.2 of Chapter 3.

Since the expression l extendedprice * (1 - l discount) is reused by two aggregatecomputations, we separately compile this expression and later the expression res * (1 +

l tax). It could be more effective if primitives returning more than one return vector wouldbe implemented, because then it could have been compiled together.

Overall, it speeds up the Project operator in Query 1 by 33%, as is shown in Table 6.2.However, the most time consuming part of this query is the aggregation, so it does nottranslate to such a big speedup of the whole query, which is just over 7%.

6.2. TPC-H Query 6

This query is presented in Algorithm 17. It demonstrates well the results of the ConjunctiveSelection benchmark from Section 3.3. The query consists of a single aggregate over thelineitem table filtered with four simple selection conditions, each being implemented asan integer comparison. We focus on those selections, which together take 57.75% of queryexecution time, another 30% being Scan, and only a few percent Aggregation.

In vanilla VectorWise, the selections are implemented by four separate Select operatorsstacked on top of each other. The JIT compilation rules combine them all into one operatorevaluating all the conditions by a single primitive. The generated JIT function evaluatesthe condition lazily with nested “if” branches, as in the benchmarks from Section 3.3. Theresult is that it actually makes the selection twice slower due to branch mispredictions!This is shown in Table 6.3. All of the filter conditions have medium selectivities and theconditions themselves are computationally cheap, so this result confirms the previous hard-coded benchmarks. Since the selection corresponds to a substantial part of query executiontime, this makes a 40% slowdown of the whole query.

This is one of the faster running queries, so that does not have a big impact on thetotal TPC-H benchmark performance result. However, it signals the need of implementing a

82



l shipmode in (’MAIL’, ’SHIP’) 28.58% 500,450,608!= str col str val() 180,805,384!= str col str val() 180,971,184|| lazy adaptive uidx col uidx col() 132,944,656

l receiptdate < date ’1995-01-01’ 85.22% 18,080,888l receiptdate >= date ’1994-01-01’ 82.70% 13,991,616l shipdate < l commitdate 48.72% 18,023,004l commitdate >= l receiptdate 24.54% 9,831,664

Total: 2.41% 560,377,780 361,024,540(-35.6%)




l shipmode in ... 14.28% 1,653,803,608l shipinstruct == ... 25.00% 197,060,672l quantity between 1 and 30 59.96% 252,331,068

Total: 2.41% 2,103,195,348 1,308,809,160 (-37.8%)

p brand in ... 11.99% 104,048,228p container in ... 30.02% 53,450,032p size < 15 29.98% 3,657,288p size >= 1 100.00% 305,680

Total: 2.41% 161,461,228 232,642,600 (+43.4%)

Selection above join: 8.17% 110,356,512 12,336,860 (-88.8%)

Total: 2,375,013,088 1,553,788,620(-34.6%)


smarter criterion for evaluating the feasibility of JIT compilation and not compiling naivelyas much as possible.

6.3. TPC-H Query 12

This query is presented in Algorithm 17. Here we also have a stack of selection conditions,however the first Select operator executing l shipmode in (’MAIL’, ’SHIP’) is dominatingthe whole execution time. Because of that, in Table 6.4 we further split the execution profileof this operator into individual primitives: the two string comparison primitives, and theprimitive that controls lazy execution and negates the selection vector of a disjunction (asdiscussed in vectorized lazy disjunctions implementation in Section 3.4).

The first Select operator is a disjunction of two string comparisons: l shipmode = ’MAIL’

or l shipmode = ’SHIP’. It processes 13 million tuples from the lineitem table. There aretwo reasons why such a disjunction can work better compiled together as a JIT primitive:

83

Algorithm 17 Query 6 and 12 of the TPC-H benchmark.

# Query 6select

sum( l e x t e n d e d p r i c e ∗ l d i s c o u n t ) as revenuefrom l i n e i t e mwhere l s h i p d a t e >= date ’ 1994−01−01 ’

and l s h i p d a t e < date ’ 1994−01−01 ’ + interval ’ 1 ’ yearand l d i s c o u n t between 0 .06 − 0 .01 and 0 .06 + 0.01and l q u a n t i t y < 24 ;

# Query 12select

l shipmode ,sum( case when o o r d e r p r i o r i t y = ’1−URGENT’

or o o r d e r p r i o r i t y = ’2−HIGH ’then 1

else 0end) as h i g h l i n e c o u n t ,sum( case when o o r d e r p r i o r i t y <> ’1−URGENT’

and o o r d e r p r i o r i t y <> ’2−HIGH ’then 1

else 0end) as l o w l i n e c o u n t

from orders , l i n e i t e mwhere o orderkey = l o r d e r k e y and l sh ipmode in ( ’MAIL ’ , ’SHIP ’ )

and l commitdate < l r e c e i p t d a t e and l s h i p d a t e < l commitdateand l r e c e i p t d a t e >= date ’ 1994−01−01 ’and l r e c e i p t d a t e < date ’ 1994−01−01 ’ + interval ’ 1 ’ year

group by l sh ipmodeorder by l sh ipmode ;

• The evaluation of a string comparison condition is expensive in itself. It involves loop-ing over strings and breaking out of the loop after finding a mismatching character.This causes branch mispredictions and suboptimal processor utilization even in thepure vectorized implementation, so compiling them together should not bring furtherdamage.

• Implementation of lazy disjunctions requires to execute them as negated conjunctions.Negating the condition involves creating an inverted selection vector, which is a branchheavy and expensive operation, as was discussed in Section 3.4. In this particularexample we can see from the profile in Table 6.4 that the part of the operator controllinglazy execution and inverting the selection vectors takes 26% of the operator’s executiontime. Compiled implementation can implement disjunction directly and save all thisoverhead.

A single JIT primitive that compiles together all the selection conditions of this queryis as fast as the string comparison primitives alone were in the vanilla vectorized execution.Overall, it speeds the Select operators by over 35% and the whole query by 14%.

6.4. TPC-H Query 19

This query was already presented in Algorithm 13 on page 68. It is another query with alot of disjunctive selection conditions on strings: p brand, p container and l shipmode areeach compared to a number of string constants.

The selection conditions relate to attributes from two joined tables: lineitem and part.In query execution they have to be separated into three selections: one pre-selection relating

84

to one table for each side of the join and a post-selection after the join with conditions onattributes from both tables. In Table 6.5 we show each of these operators separately.

The preselection on the lineitem table part dominates the overall execution time. Itprocesses almost 6 million tuples from that table. The result here is very similar to thatfrom query 12. The JIT combined primitive is able to speed up disjunctive string comparisonselection on l shipmode, which speeds the whole operator up by 38%.

However, even though the situation on the other side of the join seems similar, we geta slowdown instead of speedup with JIT compilation there. From the part table 2 milliontuples are processed. There are two disjunctive string selections on p brand and p container

that dominate execution time on this side of the join. In this case however, compiling themtogether into one JIT primitive turned out to be 40% slower than executing it in vanillaVectorWise.

The Select operator above the join contains the whole huge condition on both tables.In vanilla VectorWise it uses 45 separate primitives to evaluate it. JIT compilation is ableto combine them together into one gigantic JIT primitive. We can see that it speeds upexecution of this operator by 88%! It does not have impact on the whole query however, asonly several thousand tuples are processed above the join.

The three Select operators are together faster by almost 35%, even though the preselectionon part table side is significantly slower. The whole query is executing almost 23% fasterwith JIT compilation.

6.5. Summary

The evaluation confirmed our previous experiments and suspicions, that naive compilationof everything that we can generate code for is not always the best answer. Cost models willhave to be carefully developed to make the VectorWise system automatically able to decidewhat strategy is the best. The impact of JIT compilation on full TPC-H was limited, sincethe implementation is still basic. More possibilities are promised by the experiments fromChapter 3. For now, some of the results we obtained are already promising.

85

Chapter 7

Conclusions and future work

7.1. Contributions

In this master thesis we conducted an analysis of potential benefits of combining the tech-niques of vectorization and compilation in a database query execution engine. We identifiedsituations where such a combination shows substantial improvements. Existing database en-gines, either scientific or commercial, concentrated on only one of these two methods. Ourmain message is that one does not need to choose between compilation and vectorization.The study in Chapter 3 described concrete examples and use cases that directly show that.

We applied our findings to the VectorWise DBMS by adding a JIT query compilationinfrastructure to the system. This infrastructure forms a solid base for JIT compilationexperiments and evaluation. In the scope of the thesis we were able to implement detectionand code generation for simple operations such as arithmetic calculations in projections orcondition evaluation in selections.

7.2. Answers to research questions

Our research questions were:

• Since a vectorized systems already removed most of this overhead, would it still bebeneficial to introduce JIT query compilation to it?

• Can benefits of vectorization still make an improvement to a tuple-at-a-time compiledsystem?

Our study answered both of these questions with a “yes” answer. We observed speedupsof 7% to 23% on TPC-H queries 1, 12 and 19 by only compiling limited small fragments ofthem. When counting only the fragments that we actually compiled, the speedups were ashigh as 30% to 35%. While the difference is not as stunning as when introducing vectorizationor compilation to a plain tuple-at-a-time system, this is achieved with a very simple imple-mentation. Chapter 3 suggests more opportunities in e.g. hash operations. We anticipatethat speedups of as much as 50% can still be gained in some situations.

The thesis has also shown some very interesting results concerning differences betweencompilers. These differences are easy to overlook when examining the performance of entirequeries, as study in Chapter 4 has shown that the overall TPC-H benchmark result of theVectorWise DBMS compiled by clang, gcc or icc differs by no more than 5%. However, thefocus of JIT compilation is on small fragments of operators. When examining them in detail,

87

as we have done in Chapter 3, we can see that the compilation result of different compilerscan vary significantly. While the choice of a compiler for the VectorWise system as a wholedoes not matter that much, some additional benefits can be gained by choosing the bestcompiler for the given situation with fine granularity.

7.3. Future work

Work on JIT compilation in the VectorWise system will continue. An important next stepis to expand JIT opportunities detection capabilities. As mentioned in Section 5.2.4, thiswill require an alternative implementation based in the Builder phase of query processinginstead of Rewriter. Adapting this will be the first major upcoming goal. Foreseeing this,the current implementation separates this phase from code generation, so only the snode treeconstruction will have to be adjusted.

Having a new JIT compilation detection implementation, we can work on providing codegeneration for other ideas from Chapter 3.

• Compiled hash tables. This will affect not only HashJoin as described in Section 3.6,but also other operators that use hashing, such as Aggregation. Introducing compiledhash tables would require some intrusive changes, as right now the hash table logic issplit between the hash table implementation itself and the operators using them.

• For Aggregation we can combine computing multiple aggregates together. This way,we would ease the data dependency when the same group is accessed in subsequentiterations. That would require implementing Expressions that can return multipleresult vectors.

• Operators like sorting could benefit from compilation. Right now there is no blockoriented way of doing sorting, and query execution has to fall back to interpreted tuple-at-a-time processing.

Another field for further work is establishing a good heuristic or a cost model to makedecision about when compilation is beneficial. It was shown in Chapter 3 that compilationis not always the best answer, and it was seen in live-system examples in Chapter 6 thatthe naive greedy algorithm of always compiling the largest component can bring damagesinstead of benefits. A cost model should take into account not only the expected performanceof compiled function versus that of pure vectorized primitives, but also other factors likecompilation overhead (significant for fast running queries) or reuse potential (maybe betterto compile a few smaller fragments that could repeat in future queries than one monolithicfunction).

We could also think about JIT primitives buffering policy. Right now we are only able toflush all the JIT object code clear. We could think of counting how much the primitives arebeing reused and evict them selectively. This may however prove to be a hard task, because ofprimitives coming from dynamically loaded libraries, multiple primitives in one library fromone compilation, and no reliable way to estimate the memory usage of individual primitives.

88

Acknowledgments

This master thesis was made possible thanks to Marcin Zukowski and the VectorWise com-pany where I was offered an internship.

I would like to thank Marcin Zukowski and Peter Boncz, who were my supervisors andsupported me daily. With my stubborn nature, listening to their remarks, comments andideas were one of the rare moments in my life when I really felt I truly and directly learna lot. They pushed my ambitions and encouraged me to believe in my skills not only inprogramming, but also in writing and giving talks. Thanks to them, this thesis resulted in aconference publication and I could enjoy a week long trip to Athens.

I would like to thank my second reader Henri Bal for contributing to the final shape ofthis thesis.

Micha l S. offered me a lot of help when I was starting the project and getting to know theVectorWise system. His patience and answers to all my questions were really of great help.Thanks to him for that, and for some great time after work.

I have been working in a very friendly environment and I have learned a lot here. I wouldlike to thank Gosia, Giel, Hui, Sandor, Willem and Irish for contributing to the great time Ispent in the VectorWise office.

Micha l S., my fellow co-intern shared the office with me the whole time and was a soundingboard for all my everyday problems and ideas. Thank you for coping with that.

I would like to thank Kamil, who did his master thesis at VectorWise last year andcontinued working here, and from whom I learned about this company.

It is Ala without whom I would not even be in The Netherlands. She motivated me toapply for the Short Track Masters programme and come to Vrije Universiteit Amsterdam formy final year of studies. It was her idea, and it turned out well for both of us. We spentthis year almost 24 hours a day together. Living in one room, attending the same universitycourses, both working on our theses at VectorWise. I thank her for being with me all thistime.

Last but not least, I would like to thank my family for their everlasting support.

89

Bibliography

[ASU86] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Tech-niques, and Tools. Addison Wesley, 1986.

[Beg08] Nate Begeman. Building an Efficient JIT. In LLVM Developers’ Meeting, 2008.

[Bon02] P. A. Boncz. Monet: A Next-Generation DBMS Kernel For Query-Intensive Ap-plications. Ph.d. thesis, Universiteit van Amsterdam, Amsterdam, The Nether-lands, May 2002.

[BZN05] P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-Pipelining QueryExecution. In Proc. CIDR, Asilomar, CA, USA, 2005.

[CK85] A. Copeland and S. Khoshafian. A Decomposition Storage Model. In Proc.SIGMOD, Austin, TX, USA, 1985.

[cla] Clang. http://clang.llvm.org.

[D. 81] D. Chamberlin et al. A history and evaluation of System R. Commun. ACM,24(10):632–646, 1981.

[Fog11] Agner Fog. The microarchitecture of Intel, AMD and VIA CPUs. An optimizationguide for assembly programmers and compiler makers, 2011.

[GBC98] Goetz Graefe, Ross Bunker, and Shaun Cooper. Hash Joins and Hash Teams inMicrosoft SQL Server. In VLDB, pages 86–97, 1998.

[Gra94] Goetz Graefe. Volcano - an extensible and parallel query evaluation system. IEEETKDE, 6(1):120–135, 1994.

[Gre99] Rick Greer. Daytona and the fourth-generation language Cymbal. In SIGMOD,1999.

[Gro10] Tobias Grosser. Polly - Polyhedral optimizations in LLVM. In LLVM Developers’Meeting, 2010.

[GZA+11] Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simburger, ArminGroßlinger, and Louis-Noel Pouchet. Polly - Polyhedral optimization in LLVM.In IMPACT, 2011.

[KN10] Alfons Kemper and Thomas Neumann. HyPer: Hybrid OLTP and OLAP HighPerformance Database System. Technical report, Technical Univ. Munich, TUM-I1010, May 2010.

[Kri10] Konstantinos Krikellas. The case for holistic query evaluation. 2010.

91

http://clang.llvm.org

[KVC10] Konstantinos Krikellas, Stratis Viglas, and Marcelo Cintra. Generating code forholistic query evaluation. In ICDE, pages 613–624, 2010.

[LLV] LLVM. The LLVM Compiler Infrastructure. http://llvm.org.

[Nau10] Axel Naumann. Creating cling, an interactive interpreter interface for clang. InLLVM Developers’ Meeting, 2010.

[Par10] ParAccel Inc. Whitepaper. The ParAcel Analytical Database: A TechnicalOverview, Feb 2010. http://www.paraccel.com.

[PMAJ01] S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhingran. Block OrientedProcessing of Relational Database Operations in Modern Computer Architectures.In Proc. ICDE, Heidelberg, Germany, 2001.

[Ros02] Kenneth A. Ross. Conjunctive selection conditions in main memory. In Proc.PODS, Washington, DC, USA, 2002.

[RPML06] Jun Rao, Hamid Pirahesh, C. Mohan, and Guy M. Lohman. Compiled QueryExecution Engine using JVM. In Proc. ICDE, Atlanta, GA, USA, 2006.

[sse] Intel SSE4 Programming Reference. http://software.intel.com/file/

18187/.

[SZB11] Juliusz Sompolski, Marcin Zukowski, and Peter A. Boncz. Vectorization vs. Com-pilation in query execution. In Proceedings of the Seventh International Workshopon Data Management on New Hardware, DaMoN 2011, pages 33–40, 2011.

[TJYD09] D. Terpstra, H. Jagode, H. You, and J. Dongarra. Collecting performance datawith PAPI-C. Tools for High Performance Computing, pages pp. 157–173, 2009.

[Tom] Tom. Tom. http://tom.loria.fr/.

[Tra02] Transaction Processing Performance Council. TPC Benchmark H version 2.1.0,2002.

[Vec] VectorWise. VectorWise. http://www.vectorwise.com.

[ZBNH05] M. Zukowski, P. A. Boncz, N. J. Nes, and S. Heman. Monetdb/X100 - A DBMSIn The CPU Cache. IEEE Data Engineering Bulletin, 28(2):17 – 22, June 2005.

[ZNB08] Marcin Zukowski, Niels Nes, and Peter Boncz. DSM vs. NSM: CPU PerformanceTradeoffs in Block-Oriented Query Processing. 2008.

[Zuk09] Marcin Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. Ph.D. Thesis, Universiteit van Amsterdam, Sep 2009.

92

http://llvm.org

http://www.paraccel.com

http://software.intel.com/file/18187/

http://software.intel.com/file/18187/

http://tom.loria.fr/

http://www.vectorwise.com

just-in-time compilation in vectorized query...

Documents