multicoreware parallel path analyzer...

38
MulticoreWare Parallel Path Analyzer (PPA) Hui Huang, Chunpeng Zhang, Yao Wang, Lihua Zhang 6/13/11 Copyright (C) 2011 MulticoreWare Inc

Upload: others

Post on 16-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

MulticoreWare

Parallel Path Analyzer (PPA)

Hui Huang, Chunpeng Zhang, Yao Wang, Lihua Zhang

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Page 2: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• What is Parallel Path Analyzer?

• PPA overview

• Developer Library overview

• Story 1: Instrument codes with the developer library

• Story 2: Data collecting and viewing

• Story3: Data post-processing

• Performance optimization example with PPA

• Other outlined features & future enhancements

• Q&A

6/13/11 2 Copyright (C) 2011 MulticoreWare Inc

Agenda

Page 3: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• A platform-independent standalone tool for analyzing applications on a heterogeneous computing system

• Identify system-wide performance bottlenecks and find critical paths of the applications, whether it is in CPU, GPU, or I/O

• Target both traditional discrete CPU/GPU based and APU-accelerated OpenCL & Non-OpenCL applications

• Visualize profiling data in intuitive graphs and generates meaningful numerical results

• Seamlessly integrate with other MulticareWare tools to provide comprehensive toolset to ease complex heterogeneous computing development

6/13/11 Copyright (C) 2011 MulticoreWare Inc

What is Parallel Path Analyzer?

Page 4: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• What is Parallel Path Analyzer?

• PPA overview

• Developer Library overview

• Story 1: Add user-level library

• Story 2: Events captured and viewed

• Story3: Data processing

• Performance optimization example with PPA

• Other outlined features & future enhancements

• Q&A

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Agenda

Page 5: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Provide a user-level library for developer to instrument CPU code by manual and automatic ways

• Automatic capture of full trace of OpenCL API and commands – Common trace format with AMD tools

• A virtual global system clock mechanism is used to create time synchronized performance view across CPUs/GPUs

• Support debugging & statistic events

• User friendly GUI and comprehensive data views to help developer understand the behavior of the applications

6/13/11 Copyright (C) 2011 MulticoreWare Inc

PPA Overview – Key Features

Page 6: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Simultaneous multiple applications profiling support – Complicated app which includes independent but dependent modules

that could be even developed by different developers, or even different vendors

– DirectShow filters based app is such an example

• Exclusive AMD OpenCL sub-kernel profiling & debugging capability (to be avail soon) – Identify load balancing issues between workgroup/wavefront

– Identify critical path within a kernel (which segments of code is bottleneck)

– Debug events allowed to help run-time debug

• Fusion Supports – power & bandwidth measurement (will be avail soon)

6/13/11 Copyright (C) 2011 MulticoreWare Inc

PPA Overview – Key Features

Page 7: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Developer / User Level Library – PPA initialization APIs & profiling APIs

– Runtime DLL

• GUI – Provide user friendly interface for all major operations

– Provide comprehensive & intuitive graph view to visualize profiling data

6/13/11 Copyright (C) 2011 MulticoreWare Inc

PPA Overview – Main Components

Page 8: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

PPA Overview – Main GUI

Page 9: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Event – i.e. profiling event in PPA is a data structure, which has

• A unique event name / ID to identify itself

• An event type to identify its meaning, e.g. start, stop of a CPU event, debug event

• A globalized timestamp to record its occurrence time

• Other info depending on its type, e.g., – CPU event: Start time, stop time, core ID, thread ID & priority

– GPU event: device ID, queued time, submit time, start time, stop time of CL command

– Essential element to measure which has performance meaning, like start and stop of an subroutine in CPU code user is interested in

– Essential element to draw in PPA viewer

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Some PPA Terms

Page 10: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Frame – In default all events are drawn in single view window

– For Bullet type of apps which is timestep based and same pipeline repeats every timestep.

– Performance is varying between frames and target is const perf over frames

• Session – A session records results from a single data collection operation

• including the raw profiling data dumped from app runtime as well post processing results

– PPA will automatically generate a new session for each data collection operation, and user can quickly guide between sessions to check / process historical data

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Some PPA Terms

Page 11: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Bullet – An open source project for game physics simulation

– Its core feature is rigid body dynamics, but also support effect physics like particles & softbody

– We’ll use AMD Opencl version of Bullet particle demo to introduce PPA & demonstrate how PPA can be used to optimize a heterogeneous app, because,

• It uses both CPU & GPU, and there is dependency in between

• GPU is used for both simulation & rendering

• It is a natural frame based application

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Bullet as Sample

Page 12: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• What is Parallel Path Analyzer?

• PPA overview

• Developer Library overview

• Story 1: Instrument codes with the developer library

• Story 2: Data collecting and viewing

• Story3: Data post-processing

• Performance optimization example with PPA

• Other outlined features & future enhancements

• Q&A

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Agenda

Page 13: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

Library files

• ppa.dll – PPA developer library (user-level runtime)

• ppa.h – Declarations of PPA_INIT and PPA_END,

– Other user level profiling APIs, like PPAStopCpuEventFunc(e)

• ppa.cpp – Implementation of PPA initialization

• ppaEventDefs.h – User-defined events

Copyright (C) 2011 MulticoreWare Inc

Developer Library Overview

Page 14: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Define the events

PPA_DEFINE_EVENT(your-event-name)

• Declare & initialize the lib

PPA_DECLARE_EVENT;

PPA_INIT();

• Instrument the application codes by using paired routines

//beginning of the code block

ppaStartCpuEventFuc(your-event-name)

ppaStopCpuEventFunc(your-event-name)

//end of the code bock

• Release the lib

PPA_END();

6/13/11

Developer Library Overview

Page 15: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• What is Parallel Path Analyzer?

• PPA overview

• Developer Library overview

• Story 1: Instrument codes with the developer library

• Story 2: Data collecting and viewing

• Story3: Data post-processing

• Performance optimization example with PPA

• Other outlined features & future enhancements

• Q&A

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Agenda

Page 16: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

#include "ppa.h"

PPA_DECLARE_EVENT;

int main(int argc,char** argv)

{

// Call PPA_INIT() to initialize ppaUtil

PPA_INIT();

ParticlesDemo pDemo(argc, argv);

pDemo.initPhysics();

pDemo.getDynamicsWorld()->setDebugDrawer(&gDebugDrawer);

glutmain(argc, argv,640,480,"Bullet Physics Demo. http://bulletphysics.com", &pDemo);

// Call PPA_END() to terminate ppaUtil

PPA_END();

return 0;

}

6/13/11

Developer Library Use in Bullet

Page 17: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

Event declaration in ppaEventDefs.h

PPA_DEFINE_EVENT(btDemo_renderme)

:

PPA_DEFINE_EVENT(btPart_runIntegrateMotionKernel)

PPA_DEFINE_EVENT(btPart_runCollideParticlesKernel)

:

Profiling code in CPU code: void ParticlesDemo::renderme()

{

PPAStartCpuEventFunc(btDemo_renderme);

glColor3f(1.0, 1.0, 1.0);

:

PPAStopCpuEventFunc(btDemo_renderme);

}

6/13/11

Developer Library Use in Bullet

Page 18: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• What is Parallel Path Analyzer?

• PPA overview

• Developer Library overview

• Story 1: Instrument codes with developer library

• Story 2: Data collecting and viewing

• Story 3: Data post processing

• Performance optimization example with PPA

• Other outlined features & future enhancements

• Q&A

6/13/11

Agenda

Page 19: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

PPA Data Collection

Page 20: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

PPA Data Viewer

Page 21: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• What is Parallel Path Analyzer?

• PPA overview

• Developer Library overview

• Story 1: Instrument codes with developer library

• Story 2: Data collecting and viewing

• Story 3: Data post-processing

• Performance optimization example with PPA

• Other outlined features & future enhancements

• Q&A

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Agenda

Page 22: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Data Post-processing – Framing

Page 23: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Data Post-processing – Framing

Page 24: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Data Post-processing – Extractor

Page 25: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Data Post-processing – Extractor

Page 26: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• What is Parallel Path Analyzer?

• PPA overview

• Developer Library overview

• Story 1: Instrument codes with developer library

• Story 2: Data collecting and viewing

• Story 3: Data post-processing

• Performance optimization example with PPA

• Other outlined features & future enhancements

• Q&A

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Agenda

Page 27: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• 5 stages of particle pipeline for each timestep (i.e. frame)

• Each stage has single or multiple kernel launches as well as data copy

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Bullet capture before opt

Page 28: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• All clFinish calls except for runCollideParticlesKernel stage are removed

• Host is still blocked by last clFinish() and clEnqReadBuffer for CLInterop

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Bullet capture after initial try

Page 29: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• In fact No ClFinish is needed since no data read is required till ClInterop

• Host is not blocked except for ClInterop; Main gaps are remove!

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Bullet capture after opt

Page 30: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Bullet perf before & after opt

Page 31: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• What is Parallel Path Analyzer?

• PPA overview

• Developer Library overview

• Story 1: Instrument codes with developer library

• Story 2: Data collecting and viewing

• Story 3: Data post-processing

• Performance optimization example with PPA

• Other outlined features & future enhancements

• Q&A

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Agenda

Page 32: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

Special CL API calls removal for profiling perf concerns

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Other outlined features – duplicated events removal

Page 33: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Pause & resume profiling to profile interesting spots only – Reduce captured data size as well

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Other outstanding features–

pause & resume profiling

Page 34: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Allow profiling complicated applications consisting of independent modules

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Other outstanding features–

multiple app profiling capability

Page 35: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

Other outstanding features–

System Monitor

Page 36: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

• Interoperability with TM and GMAC

– Visualizing task scheduling, load balancing/ migration, data transfer in TM and GMAC in an intuitive way

• Automatic CPU profiling events insertion

• Automatic critical path analysis

• GPU rendering profiling capability

6/13/11 Copyright (C) 2011, MulticoreWare Inc

Future enhancements

Page 37: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

6/13/11 Copyright (C) 2011 MulticoreWare Inc

• Visit http://www.multicorewareinc.com to request closed beta application

• Contact email: [email protected]

Page 38: MulticoreWare Parallel Path Analyzer (PPA)developer.amd.com/wordpress/media/2013/06/1006_final.pdf–Common trace format with AMD tools • A virtual global system clock mechanism

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is

not responsible for the content herein and no endorsements are implied.