4 best paper - multi threaded architectures and the sort benchmark

8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

1/24

Multithreaded Architectures and

The Sort Benchmark Phil GarciaHank Korth

Dept. of Computer Science and EngineeringLehigh University


2/24

About our Sort Benchmark

Based on the benchmark proposed in Ameasure of transaction processing power (Anonymous et al).

Sorts 100 byte records containing 10 bytekeys.

Modified to run in main-memory. Modified to sort 250MB of records (instead

of 100MB).


3/24

Results 2-way SMT can result in speedups of over

60%. SMT can tolerate cache misses.

Gains increase as the processor/memory gapwidens.

The order of threads actions significantlyaffects speed.

Merge sort can be more efficient than

selection trees.


4/24

Test Platform

Xeon dual 3.0GHz. 2-way SMT 512KB L2 cache 1MB L3 cache.

2GB of RAM 533MHz Bus

Pentium 4 2.8GHz

2-way SMT 2GB of RAM 1MB L2 cache 800 MHz Bus

Debian GNU/LinuxKernel 2.6.6gcc v3.3

Optimized for test

architecture.


5/24

Algorithm Design

Based on Alphasort (Nyberg et al.)

For Each SetExtract (key, pointer) pairsQuicksort on keys

Mergesort 2 sets at a time until doneFinal merge materializes output.


6/24

Single Threaded Breakdown

0

2

4

6

8

10

12

14

9 8

1 9 6

3 9 2

7 8 4

1 , 5 6 8

3 ,1 6 4

6 , 3 5 6

1 2

,7 8 2

2 5 ,7 1 8

5 1

,7 3 0

1 0 4

, 0 4 8

2 0 9 ,2 8 6

B i l l i o

n s

Set Size (Bytes)

TotalMergesort

Quicksort

Xeon single processor


7/24

Mergesort vs.Selection Tree

Selection tree requires large memory footprint. Results in many cache misses per traversal.

Mergesort has a smaller overall runtime (for larger

sorts) Mergesort is limited by memory bandwidth

because hardware prefetching hides memory

latency.


8/24

The Final Merge

aaacatdogegg

6257

batcardimfog

3108

dim data0car data1cat data2

bat data3for data4dog data5aaa data6egg data7fog data8hog data9

key data012

3456789

Set 1 Set 2 Unsorted Input


9/24

The Final Merge

aaacatdogegg

6257

batcardimfog

3108



key data012

3456789



10/24

The Final Merge

aaacatdogegg

6257

batcardimfog

3108



key data012

3456789

aaa data6



11/24

The Final Merge

aaacatdogegg

6257

batcardimfog

3108



key data012

3456789

aaa data6



12/24

The Final Merge

aaacatdogegg

6257

batcardimfog

3108



key data012

3456789

aaa data6bat data3



13/24

The Final Merge

aaacatdogegg

6257

batcardimfog

3108



key data012

3456789

aaa data6bat data3



14/24

The Final Merge

aaacatdogegg

6257

batcardimfog

3108



key data012

3456789

aaa data6bat data3car data1



15/24

Final Merge Comparison Takes a significant

portion of runtime. Cache thrashing

Propose not

dereferencingpointers. Could be useful if

the sort was just oneoperation within aquery pipeline.0

2

4

6

8

10

12

14

9 8

1 6 8

2 9 4

5 0 4

8 9 6

1 ,

5 6 8

2 ,

7 4 4

4 ,

8 0 2

8 ,

4 0 0

1 4

, 7 0 0

2 5

, 7 1 8

4 4

, 9 8 2

7 8

, 6 8 0

1 3 7

, 6 0 6

2 4 0

, 6 8 8

B i l l i o n s

Set Size

With final mergeWith final merge

Without final mergeWithout final merge


16/24

Multithreading

Partitioned data among threads based on anestimated median value (Lyer et al.)

Multiple threads sort simultaneously. Ran for both SMT and SMP for two threads.


17/24

0

2

4

68

10

12

14

9 8 2 5 2

6 7 2

1 , 8 0

6 4 , 8 0

2

1 2 , 7 8 2

3 4 , 0 0 6

9 0 , 4 8 2

2 4 0 ,

6 8 8

B i l l i o n s

Set Size (Bytes)

Multithreading (continued)

With final merge Without final merge

SingleSingle

SMTSMT

SingleSingle

SMTSMTSMPSMP

SMPSMP

Total runtimes on Xeon Processor

0

2

4

68

10

12

14

9 8 2 5 2

6 7 2

1 , 8 0

6 4 , 8 0

2

1 2 , 7 8 2

3 4 , 0 0 6

9 0 , 4 8 2

2 4 0 ,

6 8 8

B i l l i o n s

Set Size (Bytes)


18/24

Final Merge (detailed) For the final

merge itself wesee extremelylarge speedup.

SMT speedupsimilar to thatachieved bySMP.1

1.2

1.4

1.6

1.8

2

2.2

2 4 9,

9 9 8

4 8 8,

2 7 8

9 5 3,

6 6 6

1, 8 6

2, 6 3

0

Set Size (Bytes)

SMP speedupSMP speedup

SMT speedupSMT speedup


19/24

Memory/Processor Gap

As the memory/processor gap widens so does thespeedups obtainable through SMT.

Ran on both Xeon and P4

Xeon showed overall speedup of 47% P4 showed overall speedup of 33%

Mostly due to Pentium 4s faster memory and

slower clock Enabled a single thread to better utilize processor

resources.


20/24

Semaphores For Speed,Not Correctness

Without

Semaphores

With

Semaphores

P1 P2 P1 P2

Memoryaccess

Quicksorting


21/24

Semaphores (continued)

Memory bandwidth does not scale with thenumber of processors using it.

Therefore whenever possible: Coordinate threads to share resources. Simple synchronization methods (such as

semaphores) work well.

Large performance gains possible onmultiprocessor.


22/24

Further Improving Sort

Sort key-prefixes rather than the full key. Enable more threads to speedup the sort

2 processors each running 2 threads.

Optimize memcpy . Using multithreaded sort within a query

pipeline.


23/24

Future Work

Impact of future processors: Chip Multiprocessors (CMP) Massively Parallel (Sun Niagara/Rock)

Database pipelines: How best to utilize processor resources.

Impact on vertically partitioned databases(Manegold, Boncz et al.)


24/24

Contact Information

Philip [email protected]

Henry F. [email protected]

Dept. of Computer Science and EngineeringPackard Lab

19 Memorial Dr. WestBethlehem, PA 18015

4 best paper - multi threaded architectures and the sort benchmark

Documents