a comparison of cc-sas, mp and shmem on sgi origin2000
DESCRIPTION
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000. Three Programming Models. CC-SAS Linear address space for shared memory MP Communicate with other processes explicitly via message passing interface SHMEM Via get and put primitives. Platforms:. Tightly-coupled multiprocessors - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/1.jpg)
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000
![Page 2: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/2.jpg)
Three Programming Models CC-SAS
– Linear address space for shared memory MP
– Communicate with other processes explicitly via message passing interface
SHMEM– Via get and put primitives
![Page 3: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/3.jpg)
Platforms: Tightly-coupled multiprocessors
– SGI Origin2000: a cache-coherent distributed shared memory machine
Less tightly-coupled clusters– A cluster of workstations connected by ethernet
![Page 4: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/4.jpg)
Purpose Compare the three programming models on
Origin2000, a modern 64-processor hardware cache-coherent machine– We focus on scientific applications that access
data regularly or predictably.
![Page 5: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/5.jpg)
Questions to be answered Can parallel algorithms be structured in the
same way for good performance in all three models?
If there are substantial differences in performance under three models, where are the key bottlenecks?
Do we need to change the data structures or algorithms substantially to solve those bottlenecks?
![Page 6: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/6.jpg)
Applications and Algorithms FFT
– All-to-all communication(regular) Ocean
– Nearest-neighbor communication Radix
– All-to-all communication(irregular) LU
– One-to-many communication
![Page 7: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/7.jpg)
Performance Result
![Page 8: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/8.jpg)
question: Why MP is much worse than CC-SAS and
SHMEM?
![Page 9: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/9.jpg)
Analysis:Execution time = BUSY + LMEM + RMEM +
SYNC whereBUSY: CPU computation timeLMEM: CPU stall time for local cache missRMEM: CPU stall time for sending/receiving
remote dataSYNC: CPU time spend at synchronization events
![Page 10: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/10.jpg)
Where does the time go in MP?
![Page 11: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/11.jpg)
Improving MP performance Remove extra data copy
– Allocate all data involved in communication in shared address space
Reduce SYNC time– Use lock-free queue management instead in
communication
![Page 12: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/12.jpg)
Speedups under Improved MP
![Page 13: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/13.jpg)
Why does CC-SAS perform best?
![Page 14: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/14.jpg)
Why does CC-SAS perform best? Extra packing/unpacking operation in MP
and SHMEM Extra packet queue management in MP …
![Page 15: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/15.jpg)
Speedups for Ocean
![Page 16: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/16.jpg)
Speedups for Radix
![Page 17: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/17.jpg)
Speedups for LU
![Page 18: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/18.jpg)
Conclusions Good algorithm structures are portable
among programming models. MP is much worse than CC-SAS and
SHMEM under hardware-coherent machine. However, we can achieve similar performance if extra data copy and queue synchronization are well solved.
Something about programmability
![Page 19: A comparison of CC-SAS, MP and SHMEM on SGI Origin2000](https://reader034.vdocuments.net/reader034/viewer/2022042901/56814c1a550346895db91a60/html5/thumbnails/19.jpg)
Future work How about those applications that indeed
have irregular, unpredictable and naturally fine-grained data access and communication patterns?
How about software-based coherent machines (i.e. clusters)?