universal scalable matrix multiplication

Upload: soumitry-j-ray

Post on 05-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Universal Scalable Matrix Multiplication

    1/10

    r c

    MK KN r c

    ar acbr bc ar ac

    br bc

    ar K

    Kbc

  • 7/31/2019 Universal Scalable Matrix Multiplication

    2/10

    pbpb

    ar pbpb bc

  • 7/31/2019 Universal Scalable Matrix Multiplication

    3/10

    ar bc

    kthiter

    kthiter

    1 kiter K

  • 7/31/2019 Universal Scalable Matrix Multiplication

    4/10

  • 7/31/2019 Universal Scalable Matrix Multiplication

    5/10

  • 7/31/2019 Universal Scalable Matrix Multiplication

    6/10

  • 7/31/2019 Universal Scalable Matrix Multiplication

    7/10

    pb = 1, 4, 16

    pb = 1

    0 10 20 30 40 50 60 70

    0

    20

    40

    60

    80

    100

    120

    #processors

    timeperiter

    MPI (only)

    128

    256

    512

    1024

    2048

    4096

  • 7/31/2019 Universal Scalable Matrix Multiplication

    8/10

    0 10 20 30 40 50 60 700

    100

    200

    300

    400

    500

    600

    700

    #processors

    timeperiter

    MPI+OpenMP (#threads 2)

    128

    256

    512

    1024

    2048

    4096

    0 10 20 30 40 50 60 700

    100

    200

    300

    400

    500

    #processors

    timeperiter

    MPI+OpenMP (#threads 4)

    128

    256

    512

    1024

    2048

    4096

    0 10 20 30 40 50 60 700

    50

    100

    150

    200

    250

    300

    350

    400

    450

    #processors

    timeperiter

    MPI+OpenMP (#threads 6)

    128

    256

    512

    1024

    2048

    4096

  • 7/31/2019 Universal Scalable Matrix Multiplication

    9/10

    0 20 40 60 800

    10

    20

    30

    40

    50

    60

    #processors

    timeperiter

    MPI+CUDA (block size 2)

    128

    256

    512

    1024

    20484096

    0 20 40 60 800

    5

    10

    15

    20

    25

    30

    35

    #processors

    timeperiter

    MPI+CUDA (block size 4)

    128

    256

    512

    1024

    20484096

    0 20 40 60 800

    2

    4

    6

    8

    10

    12

    14

    #processors

    timeperiter

    MPI+CUDA (block size 16)

    128

    256

    512

    1024

    2048

    4096

  • 7/31/2019 Universal Scalable Matrix Multiplication

    10/10