parallel computing - kursused · 2016. 9. 21. · mini-supercomputer in a compute card simpliﬁed...

Parallel Computing

Benson Muite

[email protected]://math.ut.ee/~benson

https://courses.cs.ut.ee/2016/paralleel/fall/Main/HomePage

19 September 2016

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html [1]https://en.wikipedia.org/wiki/Berkeley_RISC [2]. 1 / 25

[email protected]

http://math.ut.ee/~benson

https://courses.cs.ut.ee/2016/paralleel/fall/Main/HomePage

https://www2.eecs.berkeley.edu/bears/CS_Anniversary/karp-talk.html

https://en.wikipedia.org/wiki/Berkeley_RISC

Parallel Computer Architecture

Chip Architecture ReviewAcceleratorsGraphics CardsIntel Xeon PhiParallel Computer NetworkingCM3The Earth SimulatorIBM Blue GeneK computerTitanTianhe IISunway Taihulight




Chip Architecture Review

Typical chip today has multiple coresData may need to be obtained from a hard disk, RAM orcache before being processedFor many applications getting data can be more of aconstraint than computing the data




Example HPC Chip Architectures

Intel HaswellAMD OpteronSPARC64 XIfxNEC SX-ACEIBM Power 8IBM PowerPC A2Hotchips (http://www.hotchips.org/), Coolchips(http://www.coolchips.org/2016/)


http://www.hotchips.org/

http://www.coolchips.org/2016/



Accelerators

External specialized device for floating point operationsTypically good at doing many simplified instructions inparallelHigh latency is compensated by high bandwidth




Graphics Cards and General Purpose Computing onGraphics Cards

Nvidia – many simple cores, CUDA, CUDA Fortran, OpenACC, OpenCL and OpenGL application programminginterfaces, strong support of academic communityAMD – many simple cores, Open CL and OpenGL. Havelaunched APU (Accelerated Processing Unit) whichcombines CPU and GPUEmbedded graphics cards in AMD APU, Cell phone chips,such as Qualcomm snapdragon




Intel Xeon Phi

1Tflop of performanceMini-supercomputer in a compute cardSimplified x86 coresTypically easy to get code to run, more difficult to get codeto run efficiently




Parallel Computer Networks

Bus – simple, cheap, poor communication performanceRing – simple, cheap, poor communication performanceMesh – simple, more expensive than ring, bettercommunication performance than ringHypercube – good communication performance, expensiveat a large scaleTorus 2D, 3D, 4D, 6D – good communication performance,Fat tree – Commonly used, not quite as good performanceas a torus, but cheaperWhich topology is cost effective for a monte carlosimulation?What is the topology of Rocket?




Parallel Computer Networks

http://prism-pub.hpc.ut.ee/infiniband_topology/topo.html?topofile=rocket_topo.json

http://htor.inf.ethz.ch/research/topologies/









CM-5

http://people.csail.mit.edu/bradley/cm5/,

https://en.wikipedia.org/wiki/Connection_Machine

NAS Thinking Machines CM-5, photographer: Tom Trower, 1993(This is probably a 256 processor machine.)

131 Gflops on 1024 processorsFastest computer on Top 500 list in June 1993Fat tree topology networkThinking Machines grew out of Danny Hills doctoralresearch, but is no longer producing supercomputers


http://people.csail.mit.edu/bradley/cm5/

https://en.wikipedia.org/wiki/Connection_Machine



The Earth Simulator

https://en.wikipedia.org/wiki/Earth_Simulator

http://www.jamstec.go.jp/ceist/avcrg/index.en.html

Old Earth Simulator Earth Simulator 2

35.86 Tflops on 5120 processorsFastest computer on Top 500 list between March 2002 andNovember 2004Vector processorsFive times faster than previous first computer on Top500


https://en.wikipedia.org/wiki/Earth_Simulator

http://www.jamstec.go.jp/ceist/avcrg/index.en.html



IBM Blue Gene L

https://en.wikipedia.org/wiki/Blue_Gene#Blue_Gene.2FL

https://asc.llnl.gov/computing_resources/bluegenel/photogallery.html

Adam Bertsch next to a Blue Gene L system at LawrenceLivermore National Laboratories

596 Tflops on 106,496 dual core processorsFastest computer on Top 500 list between November 2004and November 20073D torus and many not so fast coresMore athttps://asc.llnl.gov/computing_resources/bluegenel/configuration.html


https://en.wikipedia.org/wiki/Blue_Gene#Blue_Gene.2FL

https://asc.llnl.gov/computing_resources/bluegenel/photogallery.html

https://asc.llnl.gov/computing_resources/bluegenel/configuration.html



K computer

https://en.wikipedia.org/wiki/K_computer

http://www.aics.riken.jp/en/outreach/photo-gallery/

Currently 10.5 Pflops on 88,128 SPARC64 VIIIfxprocessors with 8 cores per processorFastest computer on Top 500 list between June 2011 andJune 2012Fastest computer on Graph 500 list from June 2011 tothe present6D “mesh/”torus network and many fast and smart coresMore at http://www.aics.riken.jp/en/k-computer/system


https://en.wikipedia.org/wiki/K_computer



http://www.aics.riken.jp/en/k-computer/system



Titan

https://en.wikipedia.org/wiki/Titan_%28supercomputer%29

https://www.olcf.ornl.gov/

Titan Supercomputer at Oak Ridge National Laboratory

27 Pflops on 18,688 AMD Opteron 6274 16-core CPUsand 18,688 Nvidia Tesla K20X GPUsFastest computer on Top 500 list between November 2012and June 2013More at https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/


https://en.wikipedia.org/wiki/Titan_%28supercomputer%29


https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/



Tianhe II

https://en.wikipedia.org/wiki/Tianhe-2 https://www.olcf.ornl.gov/

https://duckduckgo.com/?q=tianhe+II+pictures

33.86 Pflops on 32,000 Intel Xeon E5-2692 chips with48,000 Xeon Phi 31S1P coprocessorsFat tree topology, American chips, but Fat tree topologyInterconnect is made in ChinaFastest computer on Top 500 list between November 2013and June 2016More atwww.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf


https://en.wikipedia.org/wiki/Tianhe-2


https://duckduckgo.com/?q=tianhe+II+pictures

www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf



TaihuLight

https://en.wikipedia.org/wiki/Sunway_TaihuLight

http://demo.wxmax.cn/wxc/index.php

http://demo.wxmax.cn/wxc/soft1.php?word=soft&i=46

93 Pflops on 40,960 Sunway SW26010 260C chipsFastest computer on Top 500 list between June 2016 andnowMore at http://engine.scichina.com/downloadPdf/rdbMLbAMd6ZiKwDco

http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf


https://en.wikipedia.org/wiki/Sunway_TaihuLight

http://demo.wxmax.cn/wxc/index.php

http://demo.wxmax.cn/wxc/soft1.php?word=soft&i=46

http://engine.scichina.com/downloadPdf/rdbMLbAMd6ZiKwDco

http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf



Hardware on Rocket Cluster

Intel R©Xeon R©CPU E5-2660 v2 @ 2.20GHzIntel Xeon Phi coprocessor 5100




Intel R©Xeon R©CPU E5-2660 v2

10 cores, 2.20 GHz nominal frequency, 4 double precisionfused add and multiply operations per cycle10×2.20×8×109 = 176×109 Floating point operationsper secondLevel 1 Cache 10 × 32 Kb instruction Cache, 10 × 32 Kbdata cacheLevel 2 Cache 10 × 256 KbLevel 3 Cache 25 Mb (shared)http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E5-2660%20v2.html

http://www.slideshare.net/IntelSoftwareBR/hpc-update-istep2014

Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1


http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E5-2660%20v2.html

http://www.slideshare.net/IntelSoftwareBR/hpc-update-istep2014



Intel Xeon Phi coprocessor 5110 P

60 cores, 1.053 GHz nominal frequency, 8 doubleprecision fused add and multiply floating point operationsper cycle per core60×16×1.053 = 1010.88×109 floating point operationsper secondLevel 1 Cache 60 × 32 Kb instruction Cache, 10 × 32 Kbdata cacheLevel 2 Cache 60 × 512 Kbhttp://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/

xeon-phi-theoretical-maximums.html

Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1http://www.hpc.ut.ee/user_guides/xeonphi


http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-theoretical-maximums.html

http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-theoretical-maximums.html

http://www.hpc.ut.ee/user_guides/xeonphi



Performance Engineering for a Medical ImagingApplication on the Intel Xeon Phi Accelerator

Hoffman, Treibig, Hager and Wellein, arXiv:1401.3615v1




Neural networks on the Xeon Phi

http://colfaxresearch.com/isc16-neuraltalk/

Codehttps://github.com/xhzhao/Optimized-Torch


http://colfaxresearch.com/isc16-neuraltalk/

https://github.com/xhzhao/Optimized-Torch



Typical code optimizations

Simplified algorithmOptimizing memory accessCache blockingVector directives, intrinsics, inline assembly




Summary

Supercomputer architectures are still evolvingDepending on the problem you are solving, the best choiceof computer architecture and algorithm should be made ifpossibleIn many cases, you have no choice in the computerarchitecture of a supercomputer, but do have some choicein the algorithmSometimes you are lucky and can choose both, but mayneed to write a lot of codeNeed to consider both peak floating point performance andmemory bandwidth to determine serial speed of yourapplicationConsider re-organizing your work to make the codeefficient and well suited to the architecture you are arunning on




New Key Concepts and References

Parallel Computer Architecture; RR 2.1-2.4, 2.7, 3.4,3.5,Hoffman, J., Treibig, J. Hager, G. Wellein, G. “PerformanceEngineering for a Medical Imaging Application on the IntelXeon Phi Accelerator”http://arxiv.org/abs/1401.3615

T. Hoefler “Networking and Computer Architecture”http://htor.inf.ethz.ch/teaching/CS498/

A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction toParallel Computing, 2nd Ed., Addison Wesley (2003)Patterson, D.A., Hennessy, J.L. Computer Organizaitonand Design: The Hardware and Software Interface 5thEd., Morgan Kaufmann (2014)


http://arxiv.org/abs/1401.3615

http://htor.inf.ethz.ch/teaching/CS498/



New Key Concepts and References

Rahman, R. Intel Xeon Phi Coprocessor Architectureand Tools: The Guide for Application Developers,Apress Open, (2013) $0.35 on AmazonSolnushkin, K.http://clusterdesign.org/fat-trees/

Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q.,Wang, Y. High-Performance Computing on theIntel R©Xeon PhiTM, Springer (2014) http://www.springer.com/computer/communication+networks/book/978-3-319-06485-7?otherVersion=978-3-319-06486-4


http://clusterdesign.org/fat-trees/

http://www.springer.com/computer/communication+networks/book/978-3-319-06485-7?otherVersion=978-3-319-06486-4






parallel computing - kursused · 2016. 9. 21. · mini-supercomputer in a compute card simpliﬁed...

Documents