porting industrial application on intel® xeon phi™: … · altair radioss case study developer...

Porting industrial application on Intel Xeon Phi:

Altair RADIOSS case studyDeveloper feedbacks and outlooks

Eric LequiniouDirector, HPC

November 2016

Altair HyperWorks: Simulation-driven Innovation

Getting to the right design

Saving time in the process

Access to the latest technologies

Modern, open architecture CAE simulation platform, offering the best technologies to design and optimize high performance, weight efficient and innovative products.

Learn more at: altairhyperworks.com

Altair Solver Technology

Multiphysics Analysis and Optimization

Structural Analysis

Manufacturing Simulation

Systems Simulation

Fluid Dynamics

ThermalAnalysis

Crash, Safety, Impact & Blast

Electro-Magnetics

Digital Materials

Altair Solver Brands

CFD and Thermal

Explicit

CrashSafety

FormingBlast

GravitySpringback

Multi-bodyDynamics

OptiStruct RADIOSS MotionSolve AcuSolvenanoFluidX

Design and Optimization

HyperStudy

FEKOFlux

Electro-Magnetics

Implicit

DurabilityVibrationsAcousticsBuckling

Heat Transfer

RADIOSS – Crash, Safety & Impact

Altair RADIOSS is a leading structural analysis solver for non-linear problems under dynamic loadings.

It is highly differentiated for scalability, quality, robustness, and consists of features for multiphysics simulation and advanced materials such as composites.

RADIOSS is used across many industries to improve the crashworthiness, safety, and manufacturability of structural designs.

Learn more at altairhyperworks.com/radioss5

CPU Challenges in CAE & Crash Simulation

No more “free lunch” on the CPU sideFrequency of CPU and intrinsic performance tend to flatten

Increase parallel scalability to meetgrowing CAE computing needs

● Increased number of products and simulation load cases● Growth in product portfolio● Simulation load cases increase due to regulation requirements: ~30 safety load cases for crash tests

● Requirement for increasing accuracy to answer to CO2 reduction challenge ● Fracture prediction and correlation leading to finer element meshes ● Manufacturing process as initial conditions of crash initialization)

● Stochastic/robustness analysis● Inherent to the sensitivity of the underlying physics and bifurcations in real tests● Need to run hundreds of variants to get confidence on results (corridor/worst case)

● Design Optimization● Numerous iterations to automatically improve product performance

Key Technologies – RADIOSS Hybrid MPI OpenMP

● Enhanced performance● High efficiency on large HPC clusters● Flexibility – easy tuning of MPI & OpenMP● Unique proven method for rich scalability over thousands of cores for FEA● Double Precision as default – Extended Single Precision ~ 1.5X faster

● Robustness● Parallel arithmetic option allows perfect repeatability in parallel

● Highly parallel code with Hybrid model● Domain decomposition with MPI● OpenMP parallelization

● Explicit multitasking● Loop auto-parallelization

A history of collaboration…

● Cluster Management: PBS-Intel integrations ● MPI integration ● Intel® Cluster Checker

● Certifications● Intel Cluster-Ready & Intel Scalable System Framework (SSF)● PBS Professional● Solvers (RADIOSS, OptiStruct, AcuSolve, FEKO)

● Application Integration: Use of Intel tools and technologies ● Intel® MPI library, Intel® Fortran & C++ compilers, Intel® MKL Library, Intel®

VTune™ Amplifier XE, Intel® Advisor, Intel® Trace Analyzer & Collector● Benchmarking activities on large cluster configurations

● Professional Support: Close collaboration among technical personnel● Access to Intel hardware resources: SDP systems, large cluster● Intel technical expertise helps us to optimize our software on Intel systems

Intel and Altair – Partners in HPC

• Many core processor architecture• Aimed for large power-efficient clusters and supercomputer• Price-performance ratio

• New generation of Xeon Phi looks really promising• Faster CPU based on Atom• Faster MCDRAM memory• New AVX512 vector instruction set• Future KNL-F coming with Omni-Path

• Assess the potential of the Xeon Phi• RADIOSS was already ported to KNC• Hybrid MPI + OpenMP parallelism fits well with KNL architecture• Need to prepare for AVX512• Port additional solvers in a second step

Motivation to Port on Xeon Phi

• Intel Knights Landing 7250• 68 cores / 272 threads• 1.4 GHz clock speed• 16 GB MCDRAM• 96 GB DDR4 2400• CentOS Linux 7

• Default Configuration• Cache mode• Quadrant• KMP_AFFINITY=scatter

Intel Xeon Phi – System Configuration

• Hybrid MPI OpenMP standard version running on Xeon• Compiled using ifort and icc – several millions of lines, mostly Fortran• SSE3 only – AVX not supported• Constraints regarding reproducibility – specific flags: -fp-model precise• Intel MPI for communication between nodes (distributed memory)• MPI and OpenMP setup optimized versus number of sockets and number of cores• Double precision (default)

• Aimed to run without modification on Xeon Phi KNL• Backward compatibility between Xeon code and Xeon Phi• AVX512 and AVX performance missing

RADIOSS Baseline Version

• Compilation of the code• Intel Parallel Studio XE 2016 & 2017

• Compilers: ifort, icc• MPI library

• AVX512 support• -xCOMMON-AVX512 : common between KNL and future Xeon

Skylake (AVX512-F & AVX512-CD)• Restriction to keep parallel arithmetic

• -no-fma• -fp-model precise

• Debugging & performance optimization• Intel tools: Vtune Amplifier, Advisor, ITAC

Xeon Phi Programing Environment

• Initial benchmark:Neon 1 million elements front crash

• Big enough to test scalability up to 68/272 cores• Small enough to fit in 16GB MCDRAM• 80ms full run reduced to 8ms for initial performance analysis

• Additional QA tests and customers models

• Larger benchmark Taurus refined with 10 millions elements

RADIOSS Benchmarks

First Test with compilers v16.0 – NEON 1M 8ms

MPI OpenMP Threads Elapsed (s)68 1 68 81668 2 136 63068 4 272 79534 2 68 78934 4 136 6584 17 68 8224 34 136 8488 16 128 76368 3 204 775

Best configurationwith 68 MPIs and 136 threads1.23x faster than baseline

Baseline reference

First Profiling using Intel VTune Amplifier

• Single thread profiling- Typical profiling of RADIOSS- Except very high cbilan

• Multi threads- Per routine CPU time x3 ~ x4- Explains the limited speed-up

from 1, to 2, 3, and 4 threads achieved with HyperThreading

• Memory speed limiting factor?- Code performance limited by

memory communication speed rather than flops

- Lots of vector-based operations - Few memory reuse

68 MPI x 1 OMP Profile 68 MPI x 4 OMP Profile

Checking Vectorization with Intel Advisor

Using Intel Advisor for Code Optimization

Indirections slowed down efficiency Code rewritten to gather global array into local vectors before compute

cbilan example

Compilers v16.0 vs Compiler 17 – NEON 1M 8ms

MPI OpenMP Threads Compiler16 Elapsed (s)

Compiler17 Elapsed(s)

68 1 68 816 705 -14%68 2 136 630 624 -1%68 4 272 795 647 -19%34 2 68 789 626 -21%34 4 136 658 611 -7%

Compiler 17 (beta) always better than compiler 16

Best configuration using 34 MPI x 4 threads

630 611

Arithmetic Flags – NEON 1M 8ms

MPI OMP Threads fp-model=preciseno-fma

fp-model=consistentno-fma

fp-model=precise fma

fp-model=fastfma

68 1 68 705 720 - -68 2 136 624 629 620 61068 4 272 647 647 631 62934 2 68 626 654 605 58834 4 136 611 612 631 614

• fp-model=precise | consistent required for correctness• consistent does not bring improvement versus precise• Acceptable penalty to not use fma and fp-model=fast: ~3% at most

no consistency!

• Gain between COMMON-AVX512 and SSE3 is sensitive > 20%• Gain between COMMON-AVX512 and MIC-AVX512 remains limited < 5%

-xCOMMON-AVX512 vs xMIC-AVX512 vs SSE3

MPI OMP Threads xSSE3 xCOMMON-AVX512 xMIC-AVX51268 1 68 1070 705 68868 2 136 947 624 61168 4 272 799 647 60834 2 68 1153 626 59634 4 136 998 611 589

Profiling SSE3 vs AVX512 on KNL 1/3

AVX512

AVX512 efficient for computational routines

Profiling SSE3 vs AVX512 on KNL 2/3SSE3

AVX512

No improvement for gather/scatter routines

Profiling SSE3 vs AVX512 on KNL 3/3

AVX512

Specific issue?

• xCOMMON-AVX512 kept as default flag• Good performance on KNL• Ready to support Skylake• Compilation time concern to use too many architecture flags• Few routines still compiled with SSE3

• Advanced optimizations• Some specific tunings required like in routine cbilan and few others

• Reproducibility of results requirements • -no-fma• -fp-model precise

• Compiler updates• Compiler 16• Compiler 17 beta• Compiler 17 final upgrade

Synthesis of First Optimization Work

• Memory Modes• Cache mode : easiest mode, transparent to application, but cache miss if data not in MCDRAM!• Flat mode : both type of memory avail, may require additional programing• Hybrid : % of MCDRAM reserved for cache and the rest for flat memory

• Cluster Modes• All 2 all : basic mode• Quadrant : tiles split into 4 parts (or 2 parts for hemisphere), each associated with a different

memory controller, L2 cache misses latency reduced compared to A2A• Sub Numa Clustering : tiles split into 4 (SNC4) or 2 (SNC2) NUMA nodes, lowest latency for

NUMA aware applications

Additional Tests of Advanced Features

Bios 10R02 : Advanced → Uncore Configuration → Memory Mode→ Cluster Mode

Final* Tests With Compiler 17 – Memory ModeMPI OMP Threads Cache Flat68 1 68 61368 2 136 601 62968 3 204 588 58868 4 272 60934 2 68 606 62234 4 136 598 59034 6 204 5994 17 68 7394 34 136 749 7698 17 136 680 6758 34 272 751 782

Cache and Flat modes deliver comparable performancefor this moderate size model(under Quadrant cluster mode)

New Best configurationwith 68 MPIs X 3 OMP and 204 threads

630 611 588

* Compiler 17 final release + all optimization changes implemented

Final Tests With Compiler 17 – Cluster ModeMPI OMP Threads Quadrant SNC468 1 68 61368 2 136 601 61768 3 204 588 58068 4 272 609 59934 2 68 60634 4 136 598 62334 6 204 599 5854 17 68 7394 34 136 749 7728 17 136 680 7228 34 272 751 728

Quadrant and SNC4 perform similarly, with a tiny advantagefor SNC4(under Cache mode)

RADIOSS Hybrid MPI OpenMP NUMA aware

New Best Elapsed timewith 68 MPIs x 3 OMP and 204 threads

630 588 580

[fr-piano]$ numastatnode0 node1 node2 node3

numa_hit 454046 245362 234920 452032numa_miss 0 0 0 0numa_foreign 0 0 0 0interleave_hit 18587 18416 18586 18419local_node 450820 224056 213597 430684other_node 3226 21306 21323 21348

Control of NUMA memory access

[fr-piano 1M]$ numastatnode0 node1 node2 node3

numa_hit 1151229 837142 733471 973063numa_miss 0 0 0 0numa_foreign 0 0 0 0interleave_hit 18587 18416 18586 18419local_node 1147957 815789 712098 951698other_node 3272 21353 21373 21365

Good memory localityNo NUMA miss during the run

Cache / SNC4 example

Process Pining under SNC4 – NEON 1M 8ms

MPI OMP ThreadsScatter

compact,1,0, granularity=fine

Scatter

omp68 1 6868 2 136 617 595 94968 3 204 580 582 66868 4 272 599 59934 4 136 623 599 93934 6 204 585 586 6744 34 136 772 745 11388 17 136 722 692 10188 34 272 728 752 732

KMP_AFFINITY=scatterorcompact,1,0, granularity=fineare almost equivalent

I_MPI_PIN_DOMAINmust be set to autobut not omp

Cache / SNC4 example

• I_MPI_PIN_DOMAIN=auto (34 MPI x 4 OMP)

[0] MPI startup(): Rank Pid Node name Pin cpu

[0] MPI startup(): 0 63170 fr-piano.europe.altair.com {0,1,68,69,136,137,204,205}

Use 2 physical cores sharing L2 (1 MB) cache

• I_MPI_PIN_DOMAIN=omp (34 MPI x 4 OMP)

[0] MPI startup(): 0 80504 fr-piano.europe.altair.com {0,68,136,204}

Use a single physical core and 4 threads sharing L1 (32 KB) cache

Note : use cpuinfo from Intel MPI to get processor configuration and I_MPI_DEBUG=5 for pining info

Process Pining – Details

Core 0 : Thread 0, 1, 2, 3

Core 0 : Thread 0, 1, 2, 3Core 1 : Thread 0, 1, 2, 3

• QA Data Base• 2000+ regression tests• 60 customer models

• RADIOSS QA on KNL• Original validation of the baseline Xeon executable (SSE3)

• No issue, backward compatibility verified • Validation of the AVX512 dedicated version

• Few compiler issues detected at –O3• Workaround to diminish to –O2 (SSE3) or –O1 (AVX512)

• OpenMP issues• Some calls to omp_set_lock crashed (SEGV inside)• Workaround to use critical section instead

• Duration of the QA on KNL• Starter program to read, prepare and decompose input deck mostly serial (OpenMP)• Small tests take more time than under Xeon – too small to benefit from KNL many cores

Quality Assurance

Performance Comparison – NEON 1M Full Run

• KNL ~ 3 times faster than KNC • AVX512 binary: ~ 30% perf improvement versus baseline executable (SSE3)• KNL performance close to dual Xeon E5 – equivalent to 2P E5 v3-2698 32C 2.3GHz

89416464

KNC Reference

KNL Baseline

KNL Optimized

Xeon E5-2698 v3

RADIOSS Performance – Elapsed Time (s)

4 MPI x 8 OMP

30 MPI x 6 OMP

68 MPI x 3 OMP

KNL best configuration:Cache / SNC4 / scatter68 MPI x 3 OpenMP

First Cluster Tests (OPA) – NEON 1M Full Run

1 Node 2 Nodes 4 Nodes 8 Nodes

RADIOSS Performance – Elapsed(s)

E5-2698 v3 4 MPI x 8 OMP KNL 7250 34 MPI x 6 OMP Ratio E5 v3 / KNL

272 MPIs1632 threads

32 MPIs256 threads

Large Benchmark – Taurus 10 M

• 10 million of elements FORD Taurus refined model• 500K solids• 9550K shells• 5K 1D elements• Scalability study reduced to 10ms

First Cluster Tests (OPA) – Taurus 10M, 10ms run

0,760,69

1 Node 16 Nodes

RADIOSS Performance – Elapsed (s)

E5-2697 v4 4 MPI x 9 OMP KNL 7250 34 MPI x 6 OMP Ratio E5 v4 / KNL

64 MPIs576 threadsSpeedup=10/16

544 MPIs3264 threadsSpeedup=9/16

• Xeon Phi many cores architecture• Offers more parallelism than any other Intel CPU – good scalability is crucial• RADIOSS Performance on single KNL 7250 close to dual Xeon E5 v3• Consider performance/Watt and per $ when comparing to high-end Xeon E5 v4• KNL-F with integrated Omni-Path fabric to sustain performance on cluster

• AVX512 RADIOSS optimized version• Up to 30% performance improvement versus non AVX binary on Xeon Phi processor• Future tests on Xeon Skylake• RADIOSS Beta version available, official version to be released with HyperWorks 2017

• Altair leadership in solver performance• Highly parallel solver technologies based on hybrid MPI OpenMP• HyperWorks “Unlimited Solver Node” licensing leveraging customer’s ROI on HPC

• Fruitful long term collaboration with Intel is very helpful

Concluding Remarks

Visit us at Supercomputing 2016

Join Altair at SC’16November 14-17

Booth #1811

Free workshops, technical briefings, talks, demos… and much more!

Thank you for your attention!

Eric Lequiniou| HPC Director | elequiniou@altair.com | altairhyperworks.com

porting industrial application on intel® xeon phi™: … · altair radioss case study developer...

Documents

interface limit nastran/optistruct/radioss - cae-sim … ·...

marketing strategies and outlooks

maximize performance and scalability of radioss ......about...

hydrological status and outlooks

career outlooks for cooks

radioss theory manual v13

radioss impact 1day

sixth iea ief opec symposium on energy outlooks · sixth...

altair radioss™ performance on supermicro bigtwin™ and...

advanced lower limbs model with radioss …advanced lower...

altair radioss - guide.jsae.or.jp

radioss for linear 10.0

radioss -...

altair radioss user’s code interface

ece r66 simulation accreditation study with radioss

altair radioss(tm) – 2019 version - features...

radioss crash seat presentation v02

theory_1 radioss

radioss theory part 2

climate outlooks and webinars