a comparison of cc-sas, mp and shmem on sgi origin2000

A comparison of CC-SAS, MP and SHMEM on SGI Origin2000

Three Programming Models CC-SAS

– Linear address space for shared memory MP

– Communicate with other processes explicitly via message passing interface

SHMEM– Via get and put primitives

Platforms: Tightly-coupled multiprocessors

– SGI Origin2000: a cache-coherent distributed shared memory machine

Less tightly-coupled clusters– A cluster of workstations connected by ethernet

Purpose Compare the three programming models on

Origin2000, a modern 64-processor hardware cache-coherent machine– We focus on scientific applications that access

data regularly or predictably.

Questions to be answered Can parallel algorithms be structured in the

same way for good performance in all three models?

If there are substantial differences in performance under three models, where are the key bottlenecks?

Do we need to change the data structures or algorithms substantially to solve those bottlenecks?

Applications and Algorithms FFT

– All-to-all communication(regular) Ocean

– Nearest-neighbor communication Radix

– All-to-all communication(irregular) LU

– One-to-many communication

Performance Result

question: Why MP is much worse than CC-SAS and

SHMEM?

Analysis:Execution time = BUSY + LMEM + RMEM +

SYNC whereBUSY: CPU computation timeLMEM: CPU stall time for local cache missRMEM: CPU stall time for sending/receiving

remote dataSYNC: CPU time spend at synchronization events

Where does the time go in MP?

Improving MP performance Remove extra data copy

– Allocate all data involved in communication in shared address space

Reduce SYNC time– Use lock-free queue management instead in

communication

Speedups under Improved MP

Why does CC-SAS perform best?

Why does CC-SAS perform best? Extra packing/unpacking operation in MP

and SHMEM Extra packet queue management in MP …

Speedups for Ocean

Speedups for Radix

Speedups for LU

Conclusions Good algorithm structures are portable

among programming models. MP is much worse than CC-SAS and

SHMEM under hardware-coherent machine. However, we can achieve similar performance if extra data copy and queue synchronization are well solved.

Something about programmability

Future work How about those applications that indeed

have irregular, unpredictable and naturally fine-grained data access and communication patterns?

How about software-based coherent machines (i.e. clusters)?

a comparison of cc-sas, mp and shmem on sgi origin2000

Documents

cpu time

data structures

cpu stall time

execution time

finegrained data access

comparison of cc

queue synchronization

coherent machines