parallel computing - kursused · 2016. 9. 21. · mini-supercomputer in a compute card simplified...
TRANSCRIPT
Parallel Computing
Benson Muite
[email protected]://math.ut.ee/~benson
https://courses.cs.ut.ee/2016/paralleel/fall/Main/HomePage
19 September 2016
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 1 / 25
Parallel Computer Architecture
Chip Architecture ReviewAcceleratorsGraphics CardsIntel Xeon PhiParallel Computer NetworkingCM3The Earth SimulatorIBM Blue GeneK computerTitanTianhe IISunway Taihulight
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 2 / 25
Chip Architecture Review
Typical chip today has multiple coresData may need to be obtained from a hard disk, RAM orcache before being processedFor many applications getting data can be more of aconstraint than computing the data
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 3 / 25
Example HPC Chip Architectures
Intel HaswellAMD OpteronSPARC64 XIfxNEC SX-ACEIBM Power 8IBM PowerPC A2Hotchips (http://www.hotchips.org/), Coolchips(http://www.coolchips.org/2016/)
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 4 / 25
Accelerators
External specialized device for floating point operationsTypically good at doing many simplified instructions inparallelHigh latency is compensated by high bandwidth
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 5 / 25
Graphics Cards and General Purpose Computing onGraphics Cards
Nvidia – many simple cores, CUDA, CUDA Fortran, OpenACC, OpenCL and OpenGL application programminginterfaces, strong support of academic communityAMD – many simple cores, Open CL and OpenGL. Havelaunched APU (Accelerated Processing Unit) whichcombines CPU and GPUEmbedded graphics cards in AMD APU, Cell phone chips,such as Qualcomm snapdragon
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 6 / 25
Intel Xeon Phi
1Tflop of performanceMini-supercomputer in a compute cardSimplified x86 coresTypically easy to get code to run, more difficult to get codeto run efficiently
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 7 / 25
Parallel Computer Networks
Bus – simple, cheap, poor communication performanceRing – simple, cheap, poor communication performanceMesh – simple, more expensive than ring, bettercommunication performance than ringHypercube – good communication performance, expensiveat a large scaleTorus 2D, 3D, 4D, 6D – good communication performance,Fat tree – Commonly used, not quite as good performanceas a torus, but cheaperWhich topology is cost effective for a monte carlosimulation?What is the topology of Rocket?
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 8 / 25
Parallel Computer Networks
http://prism-pub.hpc.ut.ee/infiniband_topology/topo.html?topofile=rocket_topo.json
http://htor.inf.ethz.ch/research/topologies/
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 9 / 25
CM-5
http://people.csail.mit.edu/bradley/cm5/,
https://en.wikipedia.org/wiki/Connection_Machine
NAS Thinking Machines CM-5, photographer: Tom Trower, 1993(This is probably a 256 processor machine.)
131 Gflops on 1024 processorsFastest computer on Top 500 list in June 1993Fat tree topology networkThinking Machines grew out of Danny Hills doctoralresearch, but is no longer producing supercomputers
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 10 / 25
The Earth Simulator
https://en.wikipedia.org/wiki/Earth_Simulator
http://www.jamstec.go.jp/ceist/avcrg/index.en.html
Old Earth Simulator Earth Simulator 2
35.86 Tflops on 5120 processorsFastest computer on Top 500 list between March 2002 andNovember 2004Vector processorsFive times faster than previous first computer on Top500
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 11 / 25
IBM Blue Gene L
https://en.wikipedia.org/wiki/Blue_Gene#Blue_Gene.2FL
https://asc.llnl.gov/computing_resources/bluegenel/photogallery.html
Adam Bertsch next to a Blue Gene L system at LawrenceLivermore National Laboratories
596 Tflops on 106,496 dual core processorsFastest computer on Top 500 list between November 2004and November 20073D torus and many not so fast coresMore athttps://asc.llnl.gov/computing_resources/bluegenel/configuration.html
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 12 / 25
K computer
https://en.wikipedia.org/wiki/K_computer
http://www.aics.riken.jp/en/outreach/photo-gallery/
Currently 10.5 Pflops on 88,128 SPARC64 VIIIfxprocessors with 8 cores per processorFastest computer on Top 500 list between June 2011 andJune 2012Fastest computer on Graph 500 list from June 2011 tothe present6D “mesh/”torus network and many fast and smart coresMore at http://www.aics.riken.jp/en/k-computer/system
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 13 / 25
Titan
https://en.wikipedia.org/wiki/Titan_%28supercomputer%29
https://www.olcf.ornl.gov/
Titan Supercomputer at Oak Ridge National Laboratory
27 Pflops on 18,688 AMD Opteron 6274 16-core CPUsand 18,688 Nvidia Tesla K20X GPUsFastest computer on Top 500 list between November 2012and June 2013More at https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 14 / 25
Tianhe II
https://en.wikipedia.org/wiki/Tianhe-2 https://www.olcf.ornl.gov/
https://duckduckgo.com/?q=tianhe+II+pictures
33.86 Pflops on 32,000 Intel Xeon E5-2692 chips with48,000 Xeon Phi 31S1P coprocessorsFat tree topology, American chips, but Fat tree topologyInterconnect is made in ChinaFastest computer on Top 500 list between November 2013and June 2016More atwww.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 15 / 25
TaihuLight
https://en.wikipedia.org/wiki/Sunway_TaihuLight
http://demo.wxmax.cn/wxc/index.php
http://demo.wxmax.cn/wxc/soft1.php?word=soft&i=46
93 Pflops on 40,960 Sunway SW26010 260C chipsFastest computer on Top 500 list between June 2016 andnowMore at http://engine.scichina.com/downloadPdf/rdbMLbAMd6ZiKwDco
http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 16 / 25
Hardware on Rocket Cluster
Intel R©Xeon R©CPU E5-2660 v2 @ 2.20GHzIntel Xeon Phi coprocessor 5100
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 17 / 25
Intel R©Xeon R©CPU E5-2660 v2
10 cores, 2.20 GHz nominal frequency, 4 double precisionfused add and multiply operations per cycle10×2.20×8×109 = 176×109 Floating point operationsper secondLevel 1 Cache 10 × 32 Kb instruction Cache, 10 × 32 Kbdata cacheLevel 2 Cache 10 × 256 KbLevel 3 Cache 25 Mb (shared)http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E5-2660%20v2.html
http://www.slideshare.net/IntelSoftwareBR/hpc-update-istep2014
Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 18 / 25
Intel Xeon Phi coprocessor 5110 P
60 cores, 1.053 GHz nominal frequency, 8 doubleprecision fused add and multiply floating point operationsper cycle per core60×16×1.053 = 1010.88×109 floating point operationsper secondLevel 1 Cache 60 × 32 Kb instruction Cache, 10 × 32 Kbdata cacheLevel 2 Cache 60 × 512 Kbhttp://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/
xeon-phi-theoretical-maximums.html
Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1http://www.hpc.ut.ee/user_guides/xeonphi
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 19 / 25
Performance Engineering for a Medical ImagingApplication on the Intel Xeon Phi Accelerator
Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 20 / 25
Neural networks on the Xeon Phi
http://colfaxresearch.com/isc16-neuraltalk/
Codehttps://github.com/xhzhao/Optimized-Torch
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 21 / 25
Typical code optimizations
Simplified algorithmOptimizing memory accessCache blockingVector directives, intrinsics, inline assembly
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 22 / 25
Summary
Supercomputer architectures are still evolvingDepending on the problem you are solving, the best choiceof computer architecture and algorithm should be made ifpossibleIn many cases, you have no choice in the computerarchitecture of a supercomputer, but do have some choicein the algorithmSometimes you are lucky and can choose both, but mayneed to write a lot of codeNeed to consider both peak floating point performance andmemory bandwidth to determine serial speed of yourapplicationConsider re-organizing your work to make the codeefficient and well suited to the architecture you are arunning on
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 23 / 25
New Key Concepts and References
Parallel Computer Architecture; RR 2.1-2.4, 2.7, 3.4,3.5,Hoffman, J., Treibig, J. Hager, G. Wellein, G. “PerformanceEngineering for a Medical Imaging Application on the IntelXeon Phi Accelerator”http://arxiv.org/abs/1401.3615
T. Hoefler “Networking and Computer Architecture”http://htor.inf.ethz.ch/teaching/CS498/
A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction toParallel Computing, 2nd Ed., Addison Wesley (2003)Patterson, D.A., Hennessy, J.L. Computer Organizaitonand Design: The Hardware and Software Interface 5thEd., Morgan Kaufmann (2014)
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 24 / 25
New Key Concepts and References
Rahman, R. Intel Xeon Phi Coprocessor Architectureand Tools: The Guide for Application Developers,Apress Open, (2013) $0.35 on AmazonSolnushkin, K.http://clusterdesign.org/fat-trees/
Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q.,Wang, Y. High-Performance Computing on theIntel R©Xeon PhiTM, Springer (2014) http://www.springer.com/computer/communication+networks/book/978-3-319-06485-7?otherVersion=978-3-319-06486-4
https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 25 / 25