parallel computing - kursused · 2016. 9. 21. · mini-supercomputer in a compute card simplified...

25
Parallel Computing Benson Muite [email protected] http://math.ut.ee/~benson https://courses.cs.ut.ee/2016/paralleel/fall/Main/HomePage 19 September 2016 https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1] https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 1 / 25

Upload: others

Post on 01-Apr-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Parallel Computing

Benson Muite

[email protected]://math.ut.ee/~benson

https://courses.cs.ut.ee/2016/paralleel/fall/Main/HomePage

19 September 2016

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 1 / 25

Page 2: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Parallel Computer Architecture

Chip Architecture ReviewAcceleratorsGraphics CardsIntel Xeon PhiParallel Computer NetworkingCM3The Earth SimulatorIBM Blue GeneK computerTitanTianhe IISunway Taihulight

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 2 / 25

Page 3: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Chip Architecture Review

Typical chip today has multiple coresData may need to be obtained from a hard disk, RAM orcache before being processedFor many applications getting data can be more of aconstraint than computing the data

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 3 / 25

Page 4: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Example HPC Chip Architectures

Intel HaswellAMD OpteronSPARC64 XIfxNEC SX-ACEIBM Power 8IBM PowerPC A2Hotchips (http://www.hotchips.org/), Coolchips(http://www.coolchips.org/2016/)

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 4 / 25

Page 5: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Accelerators

External specialized device for floating point operationsTypically good at doing many simplified instructions inparallelHigh latency is compensated by high bandwidth

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 5 / 25

Page 6: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Graphics Cards and General Purpose Computing onGraphics Cards

Nvidia – many simple cores, CUDA, CUDA Fortran, OpenACC, OpenCL and OpenGL application programminginterfaces, strong support of academic communityAMD – many simple cores, Open CL and OpenGL. Havelaunched APU (Accelerated Processing Unit) whichcombines CPU and GPUEmbedded graphics cards in AMD APU, Cell phone chips,such as Qualcomm snapdragon

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 6 / 25

Page 7: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Intel Xeon Phi

1Tflop of performanceMini-supercomputer in a compute cardSimplified x86 coresTypically easy to get code to run, more difficult to get codeto run efficiently

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 7 / 25

Page 8: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Parallel Computer Networks

Bus – simple, cheap, poor communication performanceRing – simple, cheap, poor communication performanceMesh – simple, more expensive than ring, bettercommunication performance than ringHypercube – good communication performance, expensiveat a large scaleTorus 2D, 3D, 4D, 6D – good communication performance,Fat tree – Commonly used, not quite as good performanceas a torus, but cheaperWhich topology is cost effective for a monte carlosimulation?What is the topology of Rocket?

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 8 / 25

Page 10: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

CM-5

http://people.csail.mit.edu/bradley/cm5/,

https://en.wikipedia.org/wiki/Connection_Machine

NAS Thinking Machines CM-5, photographer: Tom Trower, 1993(This is probably a 256 processor machine.)

131 Gflops on 1024 processorsFastest computer on Top 500 list in June 1993Fat tree topology networkThinking Machines grew out of Danny Hills doctoralresearch, but is no longer producing supercomputers

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 10 / 25

Page 11: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

The Earth Simulator

https://en.wikipedia.org/wiki/Earth_Simulator

http://www.jamstec.go.jp/ceist/avcrg/index.en.html

Old Earth Simulator Earth Simulator 2

35.86 Tflops on 5120 processorsFastest computer on Top 500 list between March 2002 andNovember 2004Vector processorsFive times faster than previous first computer on Top500

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 11 / 25

Page 12: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

IBM Blue Gene L

https://en.wikipedia.org/wiki/Blue_Gene#Blue_Gene.2FL

https://asc.llnl.gov/computing_resources/bluegenel/photogallery.html

Adam Bertsch next to a Blue Gene L system at LawrenceLivermore National Laboratories

596 Tflops on 106,496 dual core processorsFastest computer on Top 500 list between November 2004and November 20073D torus and many not so fast coresMore athttps://asc.llnl.gov/computing_resources/bluegenel/configuration.html

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 12 / 25

Page 13: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

K computer

https://en.wikipedia.org/wiki/K_computer

http://www.aics.riken.jp/en/outreach/photo-gallery/

Currently 10.5 Pflops on 88,128 SPARC64 VIIIfxprocessors with 8 cores per processorFastest computer on Top 500 list between June 2011 andJune 2012Fastest computer on Graph 500 list from June 2011 tothe present6D “mesh/”torus network and many fast and smart coresMore at http://www.aics.riken.jp/en/k-computer/system

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 13 / 25

Page 14: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Titan

https://en.wikipedia.org/wiki/Titan_%28supercomputer%29

https://www.olcf.ornl.gov/

Titan Supercomputer at Oak Ridge National Laboratory

27 Pflops on 18,688 AMD Opteron 6274 16-core CPUsand 18,688 Nvidia Tesla K20X GPUsFastest computer on Top 500 list between November 2012and June 2013More at https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 14 / 25

Page 15: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Tianhe II

https://en.wikipedia.org/wiki/Tianhe-2 https://www.olcf.ornl.gov/

https://duckduckgo.com/?q=tianhe+II+pictures

33.86 Pflops on 32,000 Intel Xeon E5-2692 chips with48,000 Xeon Phi 31S1P coprocessorsFat tree topology, American chips, but Fat tree topologyInterconnect is made in ChinaFastest computer on Top 500 list between November 2013and June 2016More atwww.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 15 / 25

Page 16: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

TaihuLight

https://en.wikipedia.org/wiki/Sunway_TaihuLight

http://demo.wxmax.cn/wxc/index.php

http://demo.wxmax.cn/wxc/soft1.php?word=soft&i=46

93 Pflops on 40,960 Sunway SW26010 260C chipsFastest computer on Top 500 list between June 2016 andnowMore at http://engine.scichina.com/downloadPdf/rdbMLbAMd6ZiKwDco

http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 16 / 25

Page 17: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Hardware on Rocket Cluster

Intel R©Xeon R©CPU E5-2660 v2 @ 2.20GHzIntel Xeon Phi coprocessor 5100

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 17 / 25

Page 18: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Intel R©Xeon R©CPU E5-2660 v2

10 cores, 2.20 GHz nominal frequency, 4 double precisionfused add and multiply operations per cycle10×2.20×8×109 = 176×109 Floating point operationsper secondLevel 1 Cache 10 × 32 Kb instruction Cache, 10 × 32 Kbdata cacheLevel 2 Cache 10 × 256 KbLevel 3 Cache 25 Mb (shared)http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E5-2660%20v2.html

http://www.slideshare.net/IntelSoftwareBR/hpc-update-istep2014

Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 18 / 25

Page 19: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Intel Xeon Phi coprocessor 5110 P

60 cores, 1.053 GHz nominal frequency, 8 doubleprecision fused add and multiply floating point operationsper cycle per core60×16×1.053 = 1010.88×109 floating point operationsper secondLevel 1 Cache 60 × 32 Kb instruction Cache, 10 × 32 Kbdata cacheLevel 2 Cache 60 × 512 Kbhttp://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/

xeon-phi-theoretical-maximums.html

Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1http://www.hpc.ut.ee/user_guides/xeonphi

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 19 / 25

Page 20: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Performance Engineering for a Medical ImagingApplication on the Intel Xeon Phi Accelerator

Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 20 / 25

Page 21: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Neural networks on the Xeon Phi

http://colfaxresearch.com/isc16-neuraltalk/

Codehttps://github.com/xhzhao/Optimized-Torch

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 21 / 25

Page 22: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Typical code optimizations

Simplified algorithmOptimizing memory accessCache blockingVector directives, intrinsics, inline assembly

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 22 / 25

Page 23: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

Summary

Supercomputer architectures are still evolvingDepending on the problem you are solving, the best choiceof computer architecture and algorithm should be made ifpossibleIn many cases, you have no choice in the computerarchitecture of a supercomputer, but do have some choicein the algorithmSometimes you are lucky and can choose both, but mayneed to write a lot of codeNeed to consider both peak floating point performance andmemory bandwidth to determine serial speed of yourapplicationConsider re-organizing your work to make the codeefficient and well suited to the architecture you are arunning on

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 23 / 25

Page 24: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

New Key Concepts and References

Parallel Computer Architecture; RR 2.1-2.4, 2.7, 3.4,3.5,Hoffman, J., Treibig, J. Hager, G. Wellein, G. “PerformanceEngineering for a Medical Imaging Application on the IntelXeon Phi Accelerator”http://arxiv.org/abs/1401.3615

T. Hoefler “Networking and Computer Architecture”http://htor.inf.ethz.ch/teaching/CS498/

A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction toParallel Computing, 2nd Ed., Addison Wesley (2003)Patterson, D.A., Hennessy, J.L. Computer Organizaitonand Design: The Hardware and Software Interface 5thEd., Morgan Kaufmann (2014)

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 24 / 25

Page 25: Parallel Computing - Kursused · 2016. 9. 21. · Mini-supercomputer in a compute card Simplified x86 cores Typically easy to get code to run, more difficult to get code to run

New Key Concepts and References

Rahman, R. Intel Xeon Phi Coprocessor Architectureand Tools: The Guide for Application Developers,Apress Open, (2013) $0.35 on AmazonSolnushkin, K.http://clusterdesign.org/fat-trees/

Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q.,Wang, Y. High-Performance Computing on theIntel R©Xeon PhiTM, Springer (2014) http://www.springer.com/computer/communication+networks/book/978-3-319-06485-7?otherVersion=978-3-319-06486-4

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 25 / 25