-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
1/24
Multithreaded Architectures and
The Sort Benchmark Phil GarciaHank Korth
Dept. of Computer Science and EngineeringLehigh University
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
2/24
About our Sort Benchmark
Based on the benchmark proposed in Ameasure of transaction processing power (Anonymous et al).
Sorts 100 byte records containing 10 bytekeys.
Modified to run in main-memory. Modified to sort 250MB of records (instead
of 100MB).
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
3/24
Results 2-way SMT can result in speedups of over
60%. SMT can tolerate cache misses.
Gains increase as the processor/memory gapwidens.
The order of threads actions significantlyaffects speed.
Merge sort can be more efficient than
selection trees.
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
4/24
Test Platform
Xeon dual 3.0GHz. 2-way SMT 512KB L2 cache 1MB L3 cache.
2GB of RAM 533MHz Bus
Pentium 4 2.8GHz
2-way SMT 2GB of RAM 1MB L2 cache 800 MHz Bus
Debian GNU/LinuxKernel 2.6.6gcc v3.3
Optimized for test
architecture.
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
5/24
Algorithm Design
Based on Alphasort (Nyberg et al.)
For Each SetExtract (key, pointer) pairsQuicksort on keys
Mergesort 2 sets at a time until doneFinal merge materializes output.
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
6/24
Single Threaded Breakdown
0
2
4
6
8
10
12
14
9 8
1 9 6
3 9 2
7 8 4
1 , 5 6 8
3 ,1 6 4
6 , 3 5 6
1 2
,7 8 2
2 5 ,7 1 8
5 1
,7 3 0
1 0 4
, 0 4 8
2 0 9 ,2 8 6
B i l l i o
n s
Set Size (Bytes)
TotalMergesort
Quicksort
Xeon single processor
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
7/24
Mergesort vs.Selection Tree
Selection tree requires large memory footprint. Results in many cache misses per traversal.
Mergesort has a smaller overall runtime (for larger
sorts) Mergesort is limited by memory bandwidth
because hardware prefetching hides memory
latency.
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
8/24
The Final Merge
aaacatdogegg
6257
batcardimfog
3108
dim data0car data1cat data2
bat data3for data4dog data5aaa data6egg data7fog data8hog data9
key data012
3456789
Set 1 Set 2 Unsorted Input
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
9/24
The Final Merge
aaacatdogegg
6257
batcardimfog
3108
dim data0car data1cat data2
bat data3for data4dog data5aaa data6egg data7fog data8hog data9
key data012
3456789
Set 1 Set 2 Unsorted Input
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
10/24
The Final Merge
aaacatdogegg
6257
batcardimfog
3108
dim data0car data1cat data2
bat data3for data4dog data5aaa data6egg data7fog data8hog data9
key data012
3456789
aaa data6
Set 1 Set 2 Unsorted Input
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
11/24
The Final Merge
aaacatdogegg
6257
batcardimfog
3108
dim data0car data1cat data2
bat data3for data4dog data5aaa data6egg data7fog data8hog data9
key data012
3456789
aaa data6
Set 1 Set 2 Unsorted Input
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
12/24
The Final Merge
aaacatdogegg
6257
batcardimfog
3108
dim data0car data1cat data2
bat data3for data4dog data5aaa data6egg data7fog data8hog data9
key data012
3456789
aaa data6bat data3
Set 1 Set 2 Unsorted Input
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
13/24
The Final Merge
aaacatdogegg
6257
batcardimfog
3108
dim data0car data1cat data2
bat data3for data4dog data5aaa data6egg data7fog data8hog data9
key data012
3456789
aaa data6bat data3
Set 1 Set 2 Unsorted Input
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
14/24
The Final Merge
aaacatdogegg
6257
batcardimfog
3108
dim data0car data1cat data2
bat data3for data4dog data5aaa data6egg data7fog data8hog data9
key data012
3456789
aaa data6bat data3car data1
Set 1 Set 2 Unsorted Input
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
15/24
Final Merge Comparison Takes a significant
portion of runtime. Cache thrashing
Propose not
dereferencingpointers. Could be useful if
the sort was just oneoperation within aquery pipeline.0
2
4
6
8
10
12
14
9 8
1 6 8
2 9 4
5 0 4
8 9 6
1 ,
5 6 8
2 ,
7 4 4
4 ,
8 0 2
8 ,
4 0 0
1 4
, 7 0 0
2 5
, 7 1 8
4 4
, 9 8 2
7 8
, 6 8 0
1 3 7
, 6 0 6
2 4 0
, 6 8 8
B i l l i o n s
Set Size
With final mergeWith final merge
Without final mergeWithout final merge
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
16/24
Multithreading
Partitioned data among threads based on anestimated median value (Lyer et al.)
Multiple threads sort simultaneously. Ran for both SMT and SMP for two threads.
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
17/24
0
2
4
68
10
12
14
9 8 2 5 2
6 7 2
1 , 8 0
6 4 , 8 0
2
1 2 , 7 8 2
3 4 , 0 0 6
9 0 , 4 8 2
2 4 0 ,
6 8 8
B i l l i o n s
Set Size (Bytes)
Multithreading (continued)
With final merge Without final merge
SingleSingle
SMTSMT
SingleSingle
SMTSMTSMPSMP
SMPSMP
Total runtimes on Xeon Processor
0
2
4
68
10
12
14
9 8 2 5 2
6 7 2
1 , 8 0
6 4 , 8 0
2
1 2 , 7 8 2
3 4 , 0 0 6
9 0 , 4 8 2
2 4 0 ,
6 8 8
B i l l i o n s
Set Size (Bytes)
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
18/24
Final Merge (detailed) For the final
merge itself wesee extremelylarge speedup.
SMT speedupsimilar to thatachieved bySMP.1
1.2
1.4
1.6
1.8
2
2.2
2 4 9,
9 9 8
4 8 8,
2 7 8
9 5 3,
6 6 6
1, 8 6
2, 6 3
0
Set Size (Bytes)
SMP speedupSMP speedup
SMT speedupSMT speedup
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
19/24
Memory/Processor Gap
As the memory/processor gap widens so does thespeedups obtainable through SMT.
Ran on both Xeon and P4
Xeon showed overall speedup of 47% P4 showed overall speedup of 33%
Mostly due to Pentium 4s faster memory and
slower clock Enabled a single thread to better utilize processor
resources.
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
20/24
Semaphores For Speed,Not Correctness
Without
Semaphores
With
Semaphores
P1 P2 P1 P2
Memoryaccess
Quicksorting
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
21/24
Semaphores (continued)
Memory bandwidth does not scale with thenumber of processors using it.
Therefore whenever possible: Coordinate threads to share resources. Simple synchronization methods (such as
semaphores) work well.
Large performance gains possible onmultiprocessor.
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
22/24
Further Improving Sort
Sort key-prefixes rather than the full key. Enable more threads to speedup the sort
2 processors each running 2 threads.
Optimize memcpy . Using multithreaded sort within a query
pipeline.
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
23/24
Future Work
Impact of future processors: Chip Multiprocessors (CMP) Massively Parallel (Sun Niagara/Rock)
Database pipelines: How best to utilize processor resources.
Impact on vertically partitioned databases(Manegold, Boncz et al.)
-
8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark
24/24
Contact Information
Philip [email protected]
Henry F. [email protected]
Dept. of Computer Science and EngineeringPackard Lab
19 Memorial Dr. WestBethlehem, PA 18015