deployment of simulation (trends and issues impacting it)...•a combination of serial-io...

© 2011 ANSYS, Inc. September 7, 20111

Deployment of Simulation (Trends and Issues Impacting IT)

– Discussion

Mapping HPC to Performance (Scaling, Technology Advances)

– Discussion

Optimizing IT for Remote Access (Private / Public Cloud Computing)

– Discussion

Break

Let’s Talk about Hardware (Specifying HW for ANSYS)

– Discussion

ANSYS Roadmap (Toward aligned IT Planning)

Tomorrow

1:1 Meeting Time

HPC and IT Issues

Session Agenda


Mapping HPC to ANSYS PerformanceTechnology for Simulation Excellence

Barbara [email protected]


2001 - 2003►Parallel dynamic moving/deforming mesh►Distributed memory particle tracking

ANSYS HPC Leadership

A History of HPC Performance

1980

1990

2010

1990► Shared Memory Multiprocessing for structural simulations

1994►Iterative PCG Solver Introduced for large structural analysis

1999 - 2000►64bit large memory addressing►Shared memory multiprocessing (HFSS 7)

► 15 % spent on R&D

► 570 software developers

► Partner relationships

2004►1st company to solve 100M structural DOF

2007 - 2008► Optimized performance on multicore

processors►1st One Billion cell fluids simulation

2009►Ideal scaling to 2048 cores (fluids)►Teraflop performance at 512 core (structures)►Parallel I/O (fluids)►Domain Decomposition introduced (HFSS 12)

1998-1999►Integration with load management systems►Support for Linux clusters, low latency interconnects►10M cell fluids simulations, 128 processors

2005 - 2006►Parallel meshing (fluids)►Support for clusters using Windows HPC

© 2008 ANSYS, Inc. All rights reserved. ANSYS, Inc. Proprietary

1980’s► Vector Processing on Mainframes

1993►1st general-purpose parallel CFD with

interactive client-server user environment

1994 - 1995►Parallel dynamic mesh refinement and coarsening►Dynamic load balancing

2005 -2007►Distributed sparse solver►Distributed PCG solver►Variational Technology►DANSYS released►Distributed Solve (DSO) HFSS 10

2000

Today’s multi-core / many-core hardware evolution makes HPC a software development imperative. ANSYS is committed to maintaining performance

leadership.

2010► Hybrid parallel for sustained multi-

core performance (fluids)►GPU acceleration (structures)


HPC – A Software Development Imperative

• Clock Speed – Leveling off

• Core Counts – Growing• Exploding (GPUs)

• Future performance depends on highly scalable parallel software

Source: http://www.lanl.gov/news/index.php/fuseaction/1663.article/d/20085/id/13277


0

200

400

600

800

1000

1200

1400

0 256 512 768 1024 1280 1536

RA

TIN

GNumber of Cores

2010 Hardware(Intel Westmere, QDR IB)

12.1.0

13.0.0

IDEAL

0

50

100

150

200

250

300

350

400

450

0 200 400 600 800 1000 1200

RA

TIN

G

Number of Cores

2008 Hardware (Intel Harpertown, DDR IB)

6.3.35

12.0.5

IDEAL

Systems keep improving: faster processors, more cores• Ideal rating (speed) doubled in two years!

Memory bandwidth per core and network latency/BW stress scalability• 2008 release (12.0) re-architected MPI – huge scaling improvement, for a while…

• 2010 release (13.0) introduces hybrid parallelism – and scaling continues!

ANSYS FLUENT Scaling Achievement


Extreme CFD Scaling - 1000’s of cores

Enabled by ongoing software innovation

Hybrid parallel: fast shared memory communication (OpenMP) within a machine to speed up overall solver performance; distributed memory (MPI) between machines


Parallel Scaling ANSYS Mechanical

0

50

100

150

200

250

300

0 64 128 192 256

Solu

tio

n R

atin

g

Number of cores

Sparse Solver (Parallel Re-Ordering)

R12.1R13.0

0

500

1000

1500

2000

2500

3000

3500

4000

0 16 32 48 64

Solu

tio

n R

atin

g

Number of cores

PCG Solver (Pre-Conditioner Scaling)

R12.1R13.0

Focus on bottlenecks in

the distributed memory

solvers (DANSYS)

● Sparse Solver

• Parallelized equation ordering

• 40% faster w/ updated Intel MKL

● Preconditioned Conjugate Gradient (PCG) Solver

• Parallelized preconditioning step


Architecture-Aware Partitioning

Original partitions are remapped to the cluster considering the network topology and latencies

Minimizes inter-machine traffic reducing load on network switches

Improves performance, particularly on slow interconnects and/or large clusters

Partition Graph3 machines, 8 cores eachColors indicate machines

Original mapping New mapping


File I/O Performance

Case file IO

• Both read and write significantly faster in R13

• A combination of serial-IO optimizations as well as parallel-IO techniques, where available

Parallel-IO (.pdat)

• Significant speedup of parallel IO, particularly for cases with large number of zones

• Support for Lustre, EMC/MPFS, AIX/GPFS file systems added

Data file IO (.dat)

• Performance in R12 was highly optimized. Further incremental improvements done in R13

Parallel Data write

R12 vs. R13

BMW -68%

FL5L2 4M -63%

Circuit -97%

Truck 14M -64%

91.279.5 90.4 94.4

37.3 35.2 39.6 45.3

48 96 192 384

truck_14m, case read

12.1.0 13.0.0


What about GPU Computing?

CPUs and GPUs work in a collaborative fashion

Multi-core processors

•Typically 4-6 cores

•Powerful, general purpose

Many-core processors

•Typically hundreds of cores

•Great for highly parallel code, within memory constraints

CPU GPU

PCI Express channel


SolverKernel

Speedups

OverallSpeedups

From NAFEMS World Congress

May 2011 Boston, MA, USA

“Accelerate FEA Simulations with a

GPU”-by Jeff Beisheim,

ANSYS

ANSYS Mechanical SMP – GPU Speedup

Tesla C2050 and Intel Xeon 5560


•Windows workstation : Two Intel Xeon 5560 processors (2.8 GHz, 8 cores total), 32 GB RAM, NVIDIA Tesla C2070, Windows 7, TCC driver mode

R14: GPU Acceleration for DANSYS

1.52

1.16

1.70

1.20

2.24

1.44

0

1

2

3

V13cg-1 (JCG,

1100k)

V13sp-1 (sparse, 430k)





R14 Distributed ANSYS Total Simulation Speedups for R13 Benchmark set

4 CPU cores

4 CPU cores + 1 GPU


1.9x

3.2x

1.7x

3.4x

4.4x

0.0

1.0

2.0

3.0

4.0

5.0

16 cores 32 cores 64 cores

Tota

l Sp

eed

up

R14 Distributed ANSYS w/wo GPU

Without GPU

With GPU

ANSYS Mechanical – Multi-Node GPU

Mold

PCB

Solder balls

Results Courtesy of MicroConsult Engineering, GmbH

• Solder Joint Benchmark (4 MDOF, Creep Strain Analysis)

Linux cluster : Each node contains 12 Intel Xeon 5600-series cores, 96 GB RAM, NVIDIA Tesla M2070, InfiniBand


First capability for “specialty physics”

– view factors, ray tracing, reaction rates, etc.

R&D focus on linear solvers, smoothers – but potential limited by Amdahl’s Law

GPU Acceleration for CFD

Radiation viewfactorcalculation (ANSYS FLUENT 14 - beta)


Case Study

HPC for High Fidelity CFD

http://www.ansys.com/About+ANSYS/ANSYS+Advantage+Magazine/Current+Issue

• 8M to 12M element turbocharger models (ANSYS CFX)

• Previous practice (8 nodes HPC)

● Full stage compressor runs 36-48 hours

● Turbine simulations up to 72 hours

• Current practice (160 nodes)

● 32 nodes per simulation

● Full stage compressor 4 hours

● Turbine simulations 5-6 hours

● Simultaneous consideration of 5 ideas

● Ability to address design uncertainty – clearance tolerance

“ANSYS HPC technology is enabling Cummins to use larger models with greater geometric details and more-realistic treatment of physical phenomena.”



Microconsult GmbH

Solder joint failure analysis

• Thermal stress 7.8 MDOF

• Creep strain 5.5 MDOF

Simulation time reduced from 2 weeks to 1 day

• From 8 – 26 cores (past) to 128 cores (present)

“HPC is an important competitive advantage for companies looking to optimize the performance of their products and reduce time to market.”

Case Study

HPC for High Fidelity Mechanical


3 Millions of Cells(6 Days)

25 Millions(4 Days)

10 Millions(5 Days)

50 Millions(2 Days)

Increase of :

Spatial-temporal Accuracy

Complexity of Physical Phenomenon

SupersonicMultiphaseRadiation

CompressibilityConduction/Convection

TransientOptimisation / DOEDynamic Mesh

LES CombustionAeroacousticFluid Structure Interaction

Case Study

HPC for High Fidelity CFD

EuroCFD• Model sizes up to 100M cells (ANSYS FLUENT)

• 2011 cluster of 700 cores

– 64-256 cores per simulation


Case Study

HPC for Desktop Productivity

• Cognity Limited – steerable conductors for oil recovery

• ANSYS Mechanical simulations to determine load carrying capacity

• 750K elements, many contacts• 12 core workstations / 24 GB RAM• 6X speedup / results in 1 hour or less• 5-10 design iterations per day

“Parallel processing makes it possible to evaluate five to 10 design iterations per day, enabling Cognity to rapidly improve their design.


http://www.ansys.com/staticassets/ANSYS/staticassets/resourcelibrary/article/AA-V5-I2-Breath-of-Fresh-Air.pdf





NVIDIA - Case study on the value of HW refresh and SW best-practice

Deflection and bending of 3D glasses

• ANSYS Mechanical – 1M DOF models

Optimization of:

• Solver selection (direct vs iterative)

• Machine memory (in core execution)

• Multicore (8-way) parallel with GPU acceleration

Before/After:

77x speedup – from 60 hours per simulation to 47 minutes.

Most importantly: HPC tuning added scope for design exploration and optimization.

Case Study

Desktop Productivity Cautionary Tale


“Take Home” Points / Discussion

ANSYS HPC performance enables scaling for high-fidelity● What could you learn from a 10M (or 100M) cell / DOF model?

● What could you learn if you had time to consider 10 x more design ideas?

● Scaling applies to “all physics”, “all hardware” (desktop and cluster)

ANSYS continually invests in software development for HPC• Maximized value from your HPC investment

• This creates differentiated competitive advantage for ANSYS users

Comments / Questions / Discussion

deployment of simulation (trends and issues impacting it)...•a combination of serial-io...

Documents