4 best paper - multi threaded architectures and the sort benchmark

Upload: snehal-patel

Post on 07-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    1/24

    Multithreaded Architectures and

    The Sort Benchmark Phil GarciaHank Korth

    Dept. of Computer Science and EngineeringLehigh University

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    2/24

    About our Sort Benchmark

    Based on the benchmark proposed in Ameasure of transaction processing power (Anonymous et al).

    Sorts 100 byte records containing 10 bytekeys.

    Modified to run in main-memory. Modified to sort 250MB of records (instead

    of 100MB).

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    3/24

    Results 2-way SMT can result in speedups of over

    60%. SMT can tolerate cache misses.

    Gains increase as the processor/memory gapwidens.

    The order of threads actions significantlyaffects speed.

    Merge sort can be more efficient than

    selection trees.

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    4/24

    Test Platform

    Xeon dual 3.0GHz. 2-way SMT 512KB L2 cache 1MB L3 cache.

    2GB of RAM 533MHz Bus

    Pentium 4 2.8GHz

    2-way SMT 2GB of RAM 1MB L2 cache 800 MHz Bus

    Debian GNU/LinuxKernel 2.6.6gcc v3.3

    Optimized for test

    architecture.

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    5/24

    Algorithm Design

    Based on Alphasort (Nyberg et al.)

    For Each SetExtract (key, pointer) pairsQuicksort on keys

    Mergesort 2 sets at a time until doneFinal merge materializes output.

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    6/24

    Single Threaded Breakdown

    0

    2

    4

    6

    8

    10

    12

    14

    9 8

    1 9 6

    3 9 2

    7 8 4

    1 , 5 6 8

    3 ,1 6 4

    6 , 3 5 6

    1 2

    ,7 8 2

    2 5 ,7 1 8

    5 1

    ,7 3 0

    1 0 4

    , 0 4 8

    2 0 9 ,2 8 6

    B i l l i o

    n s

    Set Size (Bytes)

    TotalMergesort

    Quicksort

    Xeon single processor

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    7/24

    Mergesort vs.Selection Tree

    Selection tree requires large memory footprint. Results in many cache misses per traversal.

    Mergesort has a smaller overall runtime (for larger

    sorts) Mergesort is limited by memory bandwidth

    because hardware prefetching hides memory

    latency.

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    8/24

    The Final Merge

    aaacatdogegg

    6257

    batcardimfog

    3108

    dim data0car data1cat data2

    bat data3for data4dog data5aaa data6egg data7fog data8hog data9

    key data012

    3456789

    Set 1 Set 2 Unsorted Input

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    9/24

    The Final Merge

    aaacatdogegg

    6257

    batcardimfog

    3108

    dim data0car data1cat data2

    bat data3for data4dog data5aaa data6egg data7fog data8hog data9

    key data012

    3456789

    Set 1 Set 2 Unsorted Input

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    10/24

    The Final Merge

    aaacatdogegg

    6257

    batcardimfog

    3108

    dim data0car data1cat data2

    bat data3for data4dog data5aaa data6egg data7fog data8hog data9

    key data012

    3456789

    aaa data6

    Set 1 Set 2 Unsorted Input

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    11/24

    The Final Merge

    aaacatdogegg

    6257

    batcardimfog

    3108

    dim data0car data1cat data2

    bat data3for data4dog data5aaa data6egg data7fog data8hog data9

    key data012

    3456789

    aaa data6

    Set 1 Set 2 Unsorted Input

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    12/24

    The Final Merge

    aaacatdogegg

    6257

    batcardimfog

    3108

    dim data0car data1cat data2

    bat data3for data4dog data5aaa data6egg data7fog data8hog data9

    key data012

    3456789

    aaa data6bat data3

    Set 1 Set 2 Unsorted Input

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    13/24

    The Final Merge

    aaacatdogegg

    6257

    batcardimfog

    3108

    dim data0car data1cat data2

    bat data3for data4dog data5aaa data6egg data7fog data8hog data9

    key data012

    3456789

    aaa data6bat data3

    Set 1 Set 2 Unsorted Input

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    14/24

    The Final Merge

    aaacatdogegg

    6257

    batcardimfog

    3108

    dim data0car data1cat data2

    bat data3for data4dog data5aaa data6egg data7fog data8hog data9

    key data012

    3456789

    aaa data6bat data3car data1

    Set 1 Set 2 Unsorted Input

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    15/24

    Final Merge Comparison Takes a significant

    portion of runtime. Cache thrashing

    Propose not

    dereferencingpointers. Could be useful if

    the sort was just oneoperation within aquery pipeline.0

    2

    4

    6

    8

    10

    12

    14

    9 8

    1 6 8

    2 9 4

    5 0 4

    8 9 6

    1 ,

    5 6 8

    2 ,

    7 4 4

    4 ,

    8 0 2

    8 ,

    4 0 0

    1 4

    , 7 0 0

    2 5

    , 7 1 8

    4 4

    , 9 8 2

    7 8

    , 6 8 0

    1 3 7

    , 6 0 6

    2 4 0

    , 6 8 8

    B i l l i o n s

    Set Size

    With final mergeWith final merge

    Without final mergeWithout final merge

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    16/24

    Multithreading

    Partitioned data among threads based on anestimated median value (Lyer et al.)

    Multiple threads sort simultaneously. Ran for both SMT and SMP for two threads.

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    17/24

    0

    2

    4

    68

    10

    12

    14

    9 8 2 5 2

    6 7 2

    1 , 8 0

    6 4 , 8 0

    2

    1 2 , 7 8 2

    3 4 , 0 0 6

    9 0 , 4 8 2

    2 4 0 ,

    6 8 8

    B i l l i o n s

    Set Size (Bytes)

    Multithreading (continued)

    With final merge Without final merge

    SingleSingle

    SMTSMT

    SingleSingle

    SMTSMTSMPSMP

    SMPSMP

    Total runtimes on Xeon Processor

    0

    2

    4

    68

    10

    12

    14

    9 8 2 5 2

    6 7 2

    1 , 8 0

    6 4 , 8 0

    2

    1 2 , 7 8 2

    3 4 , 0 0 6

    9 0 , 4 8 2

    2 4 0 ,

    6 8 8

    B i l l i o n s

    Set Size (Bytes)

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    18/24

    Final Merge (detailed) For the final

    merge itself wesee extremelylarge speedup.

    SMT speedupsimilar to thatachieved bySMP.1

    1.2

    1.4

    1.6

    1.8

    2

    2.2

    2 4 9,

    9 9 8

    4 8 8,

    2 7 8

    9 5 3,

    6 6 6

    1, 8 6

    2, 6 3

    0

    Set Size (Bytes)

    SMP speedupSMP speedup

    SMT speedupSMT speedup

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    19/24

    Memory/Processor Gap

    As the memory/processor gap widens so does thespeedups obtainable through SMT.

    Ran on both Xeon and P4

    Xeon showed overall speedup of 47% P4 showed overall speedup of 33%

    Mostly due to Pentium 4s faster memory and

    slower clock Enabled a single thread to better utilize processor

    resources.

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    20/24

    Semaphores For Speed,Not Correctness

    Without

    Semaphores

    With

    Semaphores

    P1 P2 P1 P2

    Memoryaccess

    Quicksorting

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    21/24

    Semaphores (continued)

    Memory bandwidth does not scale with thenumber of processors using it.

    Therefore whenever possible: Coordinate threads to share resources. Simple synchronization methods (such as

    semaphores) work well.

    Large performance gains possible onmultiprocessor.

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    22/24

    Further Improving Sort

    Sort key-prefixes rather than the full key. Enable more threads to speedup the sort

    2 processors each running 2 threads.

    Optimize memcpy . Using multithreaded sort within a query

    pipeline.

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    23/24

    Future Work

    Impact of future processors: Chip Multiprocessors (CMP) Massively Parallel (Sun Niagara/Rock)

    Database pipelines: How best to utilize processor resources.

    Impact on vertically partitioned databases(Manegold, Boncz et al.)

  • 8/4/2019 4 Best Paper - Multi Threaded Architectures and the Sort Benchmark

    24/24

    Contact Information

    Philip [email protected]

    Henry F. [email protected]

    Dept. of Computer Science and EngineeringPackard Lab

    19 Memorial Dr. WestBethlehem, PA 18015