lawrence livermore national laboratory llnl-pres- xxxxxx llnl-pres-657922 this work was performed...

14
Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Dissecting On-node Memory Performance with MemAxes Petascale Tools Workshop 2014 Alfredo Gimenez * , Todd Gamblin , Martin Schulz , Peer-Timo Bremer , Barry Rountree , Abhinav Bhatele , Ilir Jusufi * , and Bernd Hammann * Madison, WI August 4-7, 2014 LLNL * UC Davis

Upload: osborne-dennis

Post on 19-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National Laboratory LLNL-PRES-XXXXXX

LLNL-PRES-657922This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Dissecting On-node Memory Performance with MemAxes

Petascale Tools Workshop 2014

Alfredo Gimenez*, Todd Gamblin†, Martin Schulz†, Peer-Timo Bremer†, Barry Rountree†,

Abhinav Bhatele†, Ilir Jusufi*, and Bernd Hammann*

Madison, WIAugust 4-7, 2014

† LLNL* UC Davis

Page 2: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Memory Access Sampling• Recent hardware additions allow us to precisely

sample events, including memory accesses• Intel PEBS, AMD IBS

• Memory access samples contain:• The instruction pointer• The address accessed• How many core clock cycles elapsed during the access• Where in the memory hierarchy the address was resolved

(e.g. L1 cache, Local RAM, Remote RAM)

• We need a way to meaningfully interpretthese samples

Page 3: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Can get thesefrom tools

Need help from app

Adding Context• Can better understand memory references with

appropriate context

• Contexts include:– The code– The node hardware topology– Calling context (call path)– The application (e.g. fluid dynamics)

• Other work by Liu & Mellor-Crummey has looked at mapping latency & access patterns to particular variables, call paths, and access patterns.

Page 4: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

We can already get coarse-grained application context for some codes

• Physics data is available in data structures

• Time steps are easy to mark in the code

• Per-process performance– easy to get– just turn on counters at the

beginning of the run– read them periodically.

• What if we want finer-grained attribution?– How to tie measurements to data

structures?– How to slice and dice the data?

Aluminum

FLOP/s per MPI process

Page 5: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Node topology is easy to get, but not shown clearly.

• PEBS provides metadata for node topology

• Want to highlight connections clearly to show:– Load distribution– Bandwidth– Resource contention

• Existing visualization from hwloc (right)– Does not scale– Clutters connections between

components

Page 6: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

We have developed a measurement tool for collecting detailed context

*SMT: (Semantic Memory Tree) data structure used to mapcallbacks sampled instruction operands

• Use PEBS sampling for hardware information• Supplement with application instrumentation for

mapping addresses to physical coordinates

*

Page 7: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Currently the developer has to instrument the application manually• Add calls to get metadata for allocated objects:

1. Label string2. Start and end addresses3. Size of each element4. Number of elements5. Callback to map address to physical coordinates

• Metadata must be provided by the programmer– Could easily be implemented in libraries– Lots of common mesh libraries would be interesting for this.

Page 8: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Instrumentation

Specify DataObjects

Add additional semantic attributes and define attribution function (optional)

Page 9: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Semantic Memory Tree

Binary Search Tree

VelocityVelocity PressurePressure TempTemp DensityDensity

0x0F 0xF6

0x0F 0x80 0xA2 0xF6

0x0F 0x20 0x40 0x80 0xA2 0xC2 0xE0 0xF6

Address Ranges

Semantic Memory Ranges

Page 10: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Lagrangian Hydrodynamics: LULESH

2D 3D

3D with mappedperformance data

Page 11: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

We have developed MemAxes, a tool for analyzing on-node memory performance

• Measurement component samples memory instructions• We map latency information onto A) source code, B) node topology • C) Pie chart shows percent of total latency selected• D) Parallel coordinates view allows exploration of correlations

Page 12: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Linked views clearly show on-nodelocality problems

PIPER

• Parallel coordinates view shows correlation between array index and core id in LULESH

• Linked node topology view shows data motion for highlighted memory operations

• A contiguous chunk of an array is initially split between threads on four cores

• Using an optimized affinity scheme, we improve locality

• Performance improved by 10%

Default thread affinity with poor locality

Optimized thread affinity with good locality

Page 13: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Hyperion Thread/Core Binding

Improved cache usage44% less access cycles10% total speedup

Page 14: Lawrence Livermore National Laboratory LLNL-PRES- XXXXXX LLNL-PRES-657922 This work was performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National LaboratoryLLNL-PRES-657922

Future work• Back-port perf_events API to production TOSS 2 kernel

– Currently unable to do fine-grained memory sampling on production machines due to PMU access limits

– Affects some Intel thread tools as well

• More detailed architecture mapping– Sandy Bridge LLC ring interconnect information?– Other node architecture features?

• Instrument AMR libraries for proper context attribution– Study per-patch memory behavior– Study blocking behavior of solvers

• How to query large instruction traces effectively?