porting industrial application on intel® xeon phi™: … · altair radioss case study developer...
Post on 05-Aug-2018
233 Views
Preview:
TRANSCRIPT
Porting industrial application on Intel Xeon Phi:
Altair RADIOSS case studyDeveloper feedbacks and outlooks
Eric LequiniouDirector, HPC
November 2016
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Altair HyperWorks: Simulation-driven Innovation
Getting to the right design
Saving time in the process
Access to the latest technologies
Modern, open architecture CAE simulation platform, offering the best technologies to design and optimize high performance, weight efficient and innovative products.
Learn more at: altairhyperworks.com
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Altair Solver Technology
Multiphysics Analysis and Optimization
Structural Analysis
Manufacturing Simulation
Systems Simulation
Fluid Dynamics
ThermalAnalysis
Crash, Safety, Impact & Blast
Electro-Magnetics
Digital Materials
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Altair Solver Brands
CFD and Thermal
Explicit
CrashSafety
FormingBlast
GravitySpringback
Multi-bodyDynamics
OptiStruct RADIOSS MotionSolve AcuSolvenanoFluidX
Design and Optimization
HyperStudy
FEKOFlux
Electro-Magnetics
Implicit
DurabilityVibrationsAcousticsBuckling
Heat Transfer
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
RADIOSS – Crash, Safety & Impact
Altair RADIOSS is a leading structural analysis solver for non-linear problems under dynamic loadings.
It is highly differentiated for scalability, quality, robustness, and consists of features for multiphysics simulation and advanced materials such as composites.
RADIOSS is used across many industries to improve the crashworthiness, safety, and manufacturability of structural designs.
Learn more at altairhyperworks.com/radioss5
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
CPU Challenges in CAE & Crash Simulation
No more “free lunch” on the CPU sideFrequency of CPU and intrinsic performance tend to flatten
Increase parallel scalability to meetgrowing CAE computing needs
● Increased number of products and simulation load cases● Growth in product portfolio● Simulation load cases increase due to regulation requirements: ~30 safety load cases for crash tests
● Requirement for increasing accuracy to answer to CO2 reduction challenge ● Fracture prediction and correlation leading to finer element meshes ● Manufacturing process as initial conditions of crash initialization)
● Stochastic/robustness analysis● Inherent to the sensitivity of the underlying physics and bifurcations in real tests● Need to run hundreds of variants to get confidence on results (corridor/worst case)
● Design Optimization● Numerous iterations to automatically improve product performance
6
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Key Technologies – RADIOSS Hybrid MPI OpenMP
● Enhanced performance● High efficiency on large HPC clusters● Flexibility – easy tuning of MPI & OpenMP● Unique proven method for rich scalability over thousands of cores for FEA● Double Precision as default – Extended Single Precision ~ 1.5X faster
● Robustness● Parallel arithmetic option allows perfect repeatability in parallel
● Highly parallel code with Hybrid model● Domain decomposition with MPI● OpenMP parallelization
● Explicit multitasking● Loop auto-parallelization
7
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
A history of collaboration…
● Cluster Management: PBS-Intel integrations ● MPI integration ● Intel® Cluster Checker
● Certifications● Intel Cluster-Ready & Intel Scalable System Framework (SSF)● PBS Professional● Solvers (RADIOSS, OptiStruct, AcuSolve, FEKO)
● Application Integration: Use of Intel tools and technologies ● Intel® MPI library, Intel® Fortran & C++ compilers, Intel® MKL Library, Intel®
VTune™ Amplifier XE, Intel® Advisor, Intel® Trace Analyzer & Collector● Benchmarking activities on large cluster configurations
● Professional Support: Close collaboration among technical personnel● Access to Intel hardware resources: SDP systems, large cluster● Intel technical expertise helps us to optimize our software on Intel systems
Intel and Altair – Partners in HPC
8
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Many core processor architecture• Aimed for large power-efficient clusters and supercomputer• Price-performance ratio
• New generation of Xeon Phi looks really promising• Faster CPU based on Atom• Faster MCDRAM memory• New AVX512 vector instruction set• Future KNL-F coming with Omni-Path
• Assess the potential of the Xeon Phi• RADIOSS was already ported to KNC• Hybrid MPI + OpenMP parallelism fits well with KNL architecture• Need to prepare for AVX512• Port additional solvers in a second step
Motivation to Port on Xeon Phi
9
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Intel Knights Landing 7250• 68 cores / 272 threads• 1.4 GHz clock speed• 16 GB MCDRAM• 96 GB DDR4 2400• CentOS Linux 7
• Default Configuration• Cache mode• Quadrant• KMP_AFFINITY=scatter
Intel Xeon Phi – System Configuration
10
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Hybrid MPI OpenMP standard version running on Xeon• Compiled using ifort and icc – several millions of lines, mostly Fortran• SSE3 only – AVX not supported• Constraints regarding reproducibility – specific flags: -fp-model precise• Intel MPI for communication between nodes (distributed memory)• MPI and OpenMP setup optimized versus number of sockets and number of cores• Double precision (default)
• Aimed to run without modification on Xeon Phi KNL• Backward compatibility between Xeon code and Xeon Phi• AVX512 and AVX performance missing
RADIOSS Baseline Version
11
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Compilation of the code• Intel Parallel Studio XE 2016 & 2017
• Compilers: ifort, icc• MPI library
• AVX512 support• -xCOMMON-AVX512 : common between KNL and future Xeon
Skylake (AVX512-F & AVX512-CD)• Restriction to keep parallel arithmetic
• -no-fma• -fp-model precise
• Debugging & performance optimization• Intel tools: Vtune Amplifier, Advisor, ITAC
Xeon Phi Programing Environment
12
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Initial benchmark:Neon 1 million elements front crash
• Big enough to test scalability up to 68/272 cores• Small enough to fit in 16GB MCDRAM• 80ms full run reduced to 8ms for initial performance analysis
• Additional QA tests and customers models
• Larger benchmark Taurus refined with 10 millions elements
RADIOSS Benchmarks
13
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
First Test with compilers v16.0 – NEON 1M 8ms
MPI OpenMP Threads Elapsed (s)68 1 68 81668 2 136 63068 4 272 79534 2 68 78934 4 136 6584 17 68 8224 34 136 8488 16 128 76368 3 204 775
Best configurationwith 68 MPIs and 136 threads1.23x faster than baseline
Baseline reference
14
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
First Profiling using Intel VTune Amplifier
• Single thread profiling- Typical profiling of RADIOSS- Except very high cbilan
• Multi threads- Per routine CPU time x3 ~ x4- Explains the limited speed-up
from 1, to 2, 3, and 4 threads achieved with HyperThreading
• Memory speed limiting factor?- Code performance limited by
memory communication speed rather than flops
- Lots of vector-based operations - Few memory reuse
68 MPI x 1 OMP Profile 68 MPI x 4 OMP Profile
15
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Checking Vectorization with Intel Advisor
16
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Using Intel Advisor for Code Optimization
Indirections slowed down efficiency Code rewritten to gather global array into local vectors before compute
cbilan example
17
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Compilers v16.0 vs Compiler 17 – NEON 1M 8ms
MPI OpenMP Threads Compiler16 Elapsed (s)
Compiler17 Elapsed(s)
Gain
68 1 68 816 705 -14%68 2 136 630 624 -1%68 4 272 795 647 -19%34 2 68 789 626 -21%34 4 136 658 611 -7%
Compiler 17 (beta) always better than compiler 16
Best configuration using 34 MPI x 4 threads
18
630 611
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Arithmetic Flags – NEON 1M 8ms
MPI OMP Threads fp-model=preciseno-fma
fp-model=consistentno-fma
fp-model=precise fma
fp-model=fastfma
68 1 68 705 720 - -68 2 136 624 629 620 61068 4 272 647 647 631 62934 2 68 626 654 605 58834 4 136 611 612 631 614
• fp-model=precise | consistent required for correctness• consistent does not bring improvement versus precise• Acceptable penalty to not use fma and fp-model=fast: ~3% at most
no consistency!
19
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Gain between COMMON-AVX512 and SSE3 is sensitive > 20%• Gain between COMMON-AVX512 and MIC-AVX512 remains limited < 5%
-xCOMMON-AVX512 vs xMIC-AVX512 vs SSE3
MPI OMP Threads xSSE3 xCOMMON-AVX512 xMIC-AVX51268 1 68 1070 705 68868 2 136 947 624 61168 4 272 799 647 60834 2 68 1153 626 59634 4 136 998 611 589
20
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Profiling SSE3 vs AVX512 on KNL 1/3
SSE3
AVX512
AVX512 efficient for computational routines
21
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Profiling SSE3 vs AVX512 on KNL 2/3SSE3
AVX512
No improvement for gather/scatter routines
22
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Profiling SSE3 vs AVX512 on KNL 3/3
SSE3
AVX512
Specific issue?
23
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• xCOMMON-AVX512 kept as default flag• Good performance on KNL• Ready to support Skylake• Compilation time concern to use too many architecture flags• Few routines still compiled with SSE3
• Advanced optimizations• Some specific tunings required like in routine cbilan and few others
• Reproducibility of results requirements • -no-fma• -fp-model precise
• Compiler updates• Compiler 16• Compiler 17 beta• Compiler 17 final upgrade
Synthesis of First Optimization Work
24
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Memory Modes• Cache mode : easiest mode, transparent to application, but cache miss if data not in MCDRAM!• Flat mode : both type of memory avail, may require additional programing• Hybrid : % of MCDRAM reserved for cache and the rest for flat memory
• Cluster Modes• All 2 all : basic mode• Quadrant : tiles split into 4 parts (or 2 parts for hemisphere), each associated with a different
memory controller, L2 cache misses latency reduced compared to A2A• Sub Numa Clustering : tiles split into 4 (SNC4) or 2 (SNC2) NUMA nodes, lowest latency for
NUMA aware applications
Additional Tests of Advanced Features
Bios 10R02 : Advanced → Uncore Configuration → Memory Mode→ Cluster Mode
25
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Final* Tests With Compiler 17 – Memory ModeMPI OMP Threads Cache Flat68 1 68 61368 2 136 601 62968 3 204 588 58868 4 272 60934 2 68 606 62234 4 136 598 59034 6 204 5994 17 68 7394 34 136 749 7698 17 136 680 6758 34 272 751 782
Cache and Flat modes deliver comparable performancefor this moderate size model(under Quadrant cluster mode)
New Best configurationwith 68 MPIs X 3 OMP and 204 threads
26
630 611 588
* Compiler 17 final release + all optimization changes implemented
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Final Tests With Compiler 17 – Cluster ModeMPI OMP Threads Quadrant SNC468 1 68 61368 2 136 601 61768 3 204 588 58068 4 272 609 59934 2 68 60634 4 136 598 62334 6 204 599 5854 17 68 7394 34 136 749 7728 17 136 680 7228 34 272 751 728
Quadrant and SNC4 perform similarly, with a tiny advantagefor SNC4(under Cache mode)
RADIOSS Hybrid MPI OpenMP NUMA aware
New Best Elapsed timewith 68 MPIs x 3 OMP and 204 threads
27
630 588 580
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
[fr-piano]$ numastatnode0 node1 node2 node3
numa_hit 454046 245362 234920 452032numa_miss 0 0 0 0numa_foreign 0 0 0 0interleave_hit 18587 18416 18586 18419local_node 450820 224056 213597 430684other_node 3226 21306 21323 21348
Control of NUMA memory access
28
[fr-piano 1M]$ numastatnode0 node1 node2 node3
numa_hit 1151229 837142 733471 973063numa_miss 0 0 0 0numa_foreign 0 0 0 0interleave_hit 18587 18416 18586 18419local_node 1147957 815789 712098 951698other_node 3272 21353 21373 21365
Good memory localityNo NUMA miss during the run
Cache / SNC4 example
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Process Pining under SNC4 – NEON 1M 8ms
MPI OMP ThreadsScatter
auto
compact,1,0, granularity=fine
auto
Scatter
omp68 1 6868 2 136 617 595 94968 3 204 580 582 66868 4 272 599 59934 4 136 623 599 93934 6 204 585 586 6744 34 136 772 745 11388 17 136 722 692 10188 34 272 728 752 732
KMP_AFFINITY=scatterorcompact,1,0, granularity=fineare almost equivalent
I_MPI_PIN_DOMAINmust be set to autobut not omp
29
Cache / SNC4 example
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• I_MPI_PIN_DOMAIN=auto (34 MPI x 4 OMP)
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 63170 fr-piano.europe.altair.com {0,1,68,69,136,137,204,205}
Use 2 physical cores sharing L2 (1 MB) cache
• I_MPI_PIN_DOMAIN=omp (34 MPI x 4 OMP)
[0] MPI startup(): 0 80504 fr-piano.europe.altair.com {0,68,136,204}
Use a single physical core and 4 threads sharing L1 (32 KB) cache
Note : use cpuinfo from Intel MPI to get processor configuration and I_MPI_DEBUG=5 for pining info
Process Pining – Details
Core 0 : Thread 0, 1, 2, 3
Core 0 : Thread 0, 1, 2, 3Core 1 : Thread 0, 1, 2, 3
30
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• QA Data Base• 2000+ regression tests• 60 customer models
• RADIOSS QA on KNL• Original validation of the baseline Xeon executable (SSE3)
• No issue, backward compatibility verified • Validation of the AVX512 dedicated version
• Few compiler issues detected at –O3• Workaround to diminish to –O2 (SSE3) or –O1 (AVX512)
• OpenMP issues• Some calls to omp_set_lock crashed (SEGV inside)• Workaround to use critical section instead
• Duration of the QA on KNL• Starter program to read, prepare and decompose input deck mostly serial (OpenMP)• Small tests take more time than under Xeon – too small to benefit from KNL many cores
Quality Assurance
31
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Performance Comparison – NEON 1M Full Run
• KNL ~ 3 times faster than KNC • AVX512 binary: ~ 30% perf improvement versus baseline executable (SSE3)• KNL performance close to dual Xeon E5 – equivalent to 2P E5 v3-2698 32C 2.3GHz
6384
18480
89416464
KNC Reference
KNL Baseline
KNL Optimized
Xeon E5-2698 v3
RADIOSS Performance – Elapsed Time (s)
4 MPI x 8 OMP
30 MPI x 6 OMP
68 MPI x 3 OMP
Low
er is
Bet
ter
32
KNL best configuration:Cache / SNC4 / scatter68 MPI x 3 OpenMP
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
First Cluster Tests (OPA) – NEON 1M Full Run
33
0,98
0,83
0,63
0,39
0
1000
2000
3000
4000
5000
6000
7000
1 Node 2 Nodes 4 Nodes 8 Nodes
Elap
sed
(s)
RADIOSS Performance – Elapsed(s)
E5-2698 v3 4 MPI x 8 OMP KNL 7250 34 MPI x 6 OMP Ratio E5 v3 / KNL
272 MPIs1632 threads
32 MPIs256 threads
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
34
Large Benchmark – Taurus 10 M
• 10 million of elements FORD Taurus refined model• 500K solids• 9550K shells• 5K 1D elements• Scalability study reduced to 10ms
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
First Cluster Tests (OPA) – Taurus 10M, 10ms run
35
44258
4450
58521
6491
0,760,69
0,50
0,60
0,70
0,80
0,90
1,00
0
10000
20000
30000
40000
50000
60000
70000
1 Node 16 Nodes
Elap
sed(
s)
RADIOSS Performance – Elapsed (s)
E5-2697 v4 4 MPI x 9 OMP KNL 7250 34 MPI x 6 OMP Ratio E5 v4 / KNL
64 MPIs576 threadsSpeedup=10/16
544 MPIs3264 threadsSpeedup=9/16
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• Xeon Phi many cores architecture• Offers more parallelism than any other Intel CPU – good scalability is crucial• RADIOSS Performance on single KNL 7250 close to dual Xeon E5 v3• Consider performance/Watt and per $ when comparing to high-end Xeon E5 v4• KNL-F with integrated Omni-Path fabric to sustain performance on cluster
• AVX512 RADIOSS optimized version• Up to 30% performance improvement versus non AVX binary on Xeon Phi processor• Future tests on Xeon Skylake• RADIOSS Beta version available, official version to be released with HyperWorks 2017
• Altair leadership in solver performance• Highly parallel solver technologies based on hybrid MPI OpenMP• HyperWorks “Unlimited Solver Node” licensing leveraging customer’s ROI on HPC
• Fruitful long term collaboration with Intel is very helpful
Concluding Remarks
36
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Visit us at Supercomputing 2016
Join Altair at SC’16November 14-17
Booth #1811
Free workshops, technical briefings, talks, demos… and much more!
Thank you for your attention!
Eric Lequiniou| HPC Director | elequiniou@altair.com | altairhyperworks.com
top related