tuning parallel code on the solaris™ operating system

56
Abstract This white paper is targeted at users, administrators, and developers of parallel applications on the Solaris™ OS. It describes the current challenges of performance analysis and tuning of complex applications, and the tools and techniques available to overcome these challenges. Several examples are provided, from a range of application types, environments, and business contexts, with a focus on the use of the Solaris Dynamic Tracing (DTrace) utility. The examples cover virtualized environments, multicore platforms, I/O and memory bottlenecks, Open Multi-Processing (MP) and Message Passing Interface (MPI). TUNING PARALLEL CODE ON THE SOLARIS™ OPERATING SYSTEM — LESSONS LEARNED FROM HPC White Paper September 2009

Upload: others

Post on 13-Jan-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tuning Parallel Code on the Solaris™ Operating System

AbstractThis white paper is targeted at users, administrators, and developers of parallel applications on the Solaris™

OS. It describes the current challenges of performance analysis and tuning of complex applications, and the

tools and techniques available to overcome these challenges. Several examples are provided, from a range of

application types, environments, and business contexts, with a focus on the use of the Solaris Dynamic Tracing

(DTrace) utility. The examples cover virtualized environments, multicore platforms, I/O and memory bottlenecks,

Open Multi-Processing (MP) and Message Passing Interface (MPI).

TUNING PARALLEL CODE ON THE SOLARIS™ OPERATING SYSTEM — LESSONS LEARNED FROM HPCWhite PaperSeptember 2009

Page 2: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.

Table of Contents

Introduction...................................................................................................1

Challenges of Tuning Parallel Code .......................................................................1

Some Solaris™ OS Performance Analysis Tools .......................................................2

DTrace ............................................................................................................2

System Information Tools ................................................................................2

Sun™ Studio Performance Analyzer ...................................................................3

Combining DTrace with Friends ........................................................................3

Purpose of this Paper ..........................................................................................4

DTrace and its Visualization Tools .....................................................................5

Chime ................................................................................................................5

DLight ................................................................................................................6

gnuplot ..............................................................................................................7

Performance Analysis Examples .......................................................................8

Analyzing Memory Bottlenecks with !"#$%&!', (#))$&$, and the Sun Studio

Performance Analyzer .........................................................................................8

Improving Performance by Reducing Data Cache Misses .................................... 8

Improving Scalability by Avoiding Memory Stalls Using the Sun Studio Analyzer 11

Memory Placement Optimization with OpenMP .............................................. 16

Analyzing I/O Performance Problems in a Virtualized Environment....................... 19

Using DTrace to Analyze Thread Scheduling with OpenMP ................................... 21

Using DTrace with MPI ...................................................................................... 27

Setting the Correct *"+%#, Privileges ........................................................... 28

A Simple Script for MPI Tracing ...................................................................... 28

Tracking Down MPI Memory Leaks ................................................................. 30

Using the DTrace *"+"-%#)- Provider ......................................................... 33

A Few Simple *"+"-%#)- Usage Examples ................................................... 35

Conclusion ...................................................................................................37

Acknowledgments ............................................................................................ 37

References ...................................................................................................38

Page 3: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.

Appendix .....................................................................................................39

Chime Screenshots ............................................................................................ 39

Listings for Matrix Addition !"#$%&!' Example ................................................ 40

gnuplot Program to Graph the Data Generated by !"#$%&!' ........................ 40

!"#$%&!'.&//.%012/&$& ................................................................. 40

!"#$%&!'.&//.!032/&$& ................................................................. 40

Listings for Matrix Addition (#))$&$ Example .................................................. 42

gnuplot Program to Graph the Data Generated by (#))$&$ .......................... 42

(#))$&$.&//.%012/&$& ................................................................... 42

(#))$&$.&//.!032/&$& ................................................................... 42

Sun Studio Analyzer Hints to Collect Cache Misses on an UltraSPARC T1® Processor 44

Analyzing I/O Performance Problems in a Virtualized Environment — a Complete

Description ....................................................................................................... 45

Tracking MPI Requests with DTrace — *"+)$&$2/ ........................................... 51

Page 4: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.1 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

1. In the Solaris™ Operating System (OS), these include various analysis and debugging tools such as truss(1), apptrace(1) and mdb(1). Other operating environments have similar tools.

Chapter 1

Introduction

Most computers today are equipped with multicore processors and manufacturers

are increasing core and thread counts, as the most cost effective and energy efficient

route to increased processing throughput. While the next level of parallelism — grid

computers with fast interconnects — is still very much the exclusive domain of high

performance computing (HPC) applications, mainstream computing is developing

ever increasing levels of parallelism. This trend is creating an environment in

which the optimization of parallel code is becoming a core requirement of tuning

performance in a range of application environments. However, tuning parallel

applications brings with it complex challenges that can only be overcome with

increasingly advanced tools and techniques.

Challenges of Tuning Parallel CodeSingle threaded code, or code with a relatively low level of parallelism, can be tuned

by analyzing its performance in test environments, and improvements can be made

by identifying and resolving bottlenecks. Developers tuning in this manner are

confident that any results they obtain in testing can result in improvements to the

corresponding production environment.

As the level of parallelism grows, due to interaction between various concurrent

events typical of complex production environments, it becomes more difficult to

create a test environment that closely matches the production environment. In turn,

this means that analysis and tuning of highly multithreaded, parallel code in test

environments is less likely to improve its performance in production environments.

Consequentially, if the performance of parallel applications is to be improved, they

must be analyzed in production environments. At the same time, the complex

interactions typical of multithreaded, parallel code are far more difficult to analyze

and tune using simple analytical tools, and advanced tools and techniques are

needed.

Conventional tools1 used for performance analysis were developed with low levels

of parallelism in mind, are not dynamic, and do not uncover problems relating to

timing or transient errors. Furthermore, these tools do not provide features that

allow the developer to easily combine data from different processes and to execute

an integrated analysis of user and kernel activity. More significantly, when these

tools are used, there is a risk that the act of starting and stopping, or tracing can

affect and obscure the problems that require investigation, thus rendering any

results found suspect and creating a potential risk to business processing.

Page 5: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.2 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

2. See discussion of the proposed cpc DTrace provider and the DTrace limitations page in the Solaris internals wiki through links in “References” on page 38

Some Solaris™ OS Performance Analysis ToolsIn addition to performance and scalability, the designers of the Solaris OS set

observability as a core design goal. Observability is achieved by providing users

with tools that allow them to easily observe the inner workings of the operating

system and applications, and analyze, debug, and optimize them. The set of tools

needed to achieve this goal was expanded continuously in each successive version

of the Solaris OS, culminating in the Solaris 10 OS with DTrace — arguably the most

advanced observability tool available today in any operating environment.

DTraceDTrace provides the Solaris 10 OS user with the ultimate observability tool — a

framework that allows the dynamic instrumentation of both kernel and user

level code. DTrace permits users to trace system data safely without affecting

performance. Users can write programs in D — a dynamically interpreted scripting

language — to execute arbitrary actions predicated on the state of specific data

exposed by the system or user component under observation. However, currently

DTrace has certain limitations — most significantly, it is unable to read hardware

counters, although this might change in the near future2.

System Information ToolsThere are many other tools that are essential supplements to the capabilities of

DTrace that can help developers directly observe system and hardware performance

characteristics in additional ways. These tools and their use are described in

this document through several examples. There are other tools that can be used

to provide an overview of the system more simply and quickly than doing the

equivalent in DTrace. See Table 1 for a non-exhaustive list of these tools.

Table 1. Some Solaris system information tools

Command Purpose

+,$%)$&$4567 Gathers and displays run-time interrupt statistics per device and processor or processor set

(#))$&$4567 Report memory bus related performance statistics

!"#$%&!'456789!"#)$&$4567

Monitor system and/or application performance using CPU hardware counters

$%&")$&$4567 Reports trap statistic per processor or processor set

"%)$&$4567 Report active process statistics

:*)$&$4567 Report virtual memory statistics

Page 6: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.3 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Sun™ Studio Performance AnalyzerThe Sun Studio Performance Analyzer and the Sun Studio Collector are components

of the Sun Studio integrated development environment (IDE). The collector is used

to run experiments to collect performance related data for an application, and the

analyzer is used to analyze and display this data with an advanced GUI.

The Sun Studio Performance Analyzer and Collector can perform a variety of

measurements on production codes without special compilation, including clock- and

hardware-counter profiling, memory allocation tracing, Message Passing Interface

(MPI) tracing, and synchronization tracing. The tool provides a rich graphical user

interface, supports the two common programming models for HPC — OpenMP and

MPI — and can collect data on lengthy production runs.

Combining DTrace with FriendsWhile DTrace is an extremely powerful tool, there are other tools that can be

executed from the command line, which makes them more appropriate to view a

system’s performance. These tools can be better when a quick, initial diagnosis is

needed to answer high level questions. A few examples of this type of question are:

Are there a significant number of cache misses?

Is a CPU accessing local memory or is it accessing memory controlled by another

CPU?

How much time is spent in user versus system mode?

Is the system short on memory or other critical resources?

Is the system running at high interrupt rates and how are they assigned to

different processors?

What are the system’s I/O characteristics?

An excellent example is the use of the "%)$&$ utility which, when invoked with the

;<* option, provides microstates per thread that offer a wide view of the system.

This view can help identify what type of further investigation is required, using

DTrace or other more specific tools, as described in Table 2.

Page 7: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.4 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Table 2. Using the "%)$&$ command for initial analysis

!"#$%$ column

Meaning Investigate if high values are seen

=>? % of time in user mode Profile user mode with DTrace using either "+/ or !"#$%& providers

>@> % of time in system mode Profile the kernel

<AB % of time waiting for locks Use the "30!')$&$ DTrace provider or the "30!')$&$4567 utility to see which user locks are used extensively

><C % of time sleeping Use the )!D-/ DTrace provider and view call stacks with DTrace to see why the threads are sleeping

EF< or GF<

% of time processing text or data page faults

Use the :*+,H0 DTrace provider to identify the source of the page faults

Once the various tools are used to develop an initial diagnosis of a problem, it is

possible to use DTrace, Sun Studio Performance Analyzer, and other more specific

tools for an in-depth analysis. This approach is described in further detail in the

examples to follow.

Purpose of this PaperThis white paper is targeted at users, administrators, and developers of parallel

applications. The lessons learned in the course of performance tuning efforts are

applicable to a wide range of similar scenarios common in today’s computing

environments in different public sector, business, and academic settings.

Page 8: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.5 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Chapter 2

DTrace and its Visualization Tools

DTrace is implemented through thousands of probes (also known as tracing points)

embedded in the Solaris 10 kernel, utilities, and in other software components that

run on the Solaris OS. The kernel probes are interpreted in kernel context, providing

detailed insight into the inner workings of the kernel. The information exposed by

the probes is accessed through scripts written in the D programming language.

Administrators can monitor system resources to analyze and debug problems,

developers can use it to help debugging and performance tuning, and end-users can

analyze applications for performance and logic problems.

The D language provides primitives to print text to a file or the standard-output of

a process. While suitable for brief, simple analysis sessions, direct interpretation of

DTrace output in more complex scenarios can be quite difficult and lengthy. A more

intuitive interface that allows the user to easily analyze DTrace output is provided by

several tools — Chime, DLight, and gnuplot — that are covered in this paper.

For detailed tutorials and documentation on DTrace and D, see links in “References”

on page 38.

ChimeChime is a standalone visualization tool for DTrace and is included in the NetBeans

DTrace GUI plug-in. It is written in the Java™ programming language, supports the

Python and Ruby programming languages, includes predefined scripts and examples,

and uses XML to define how to visualize the results of a DTrace script. The Chime

GUI provides an interface that allows the user to select different views, to easily drill

down into the data, and to access the underlying DTrace code.

Chime includes the following additional features and capabilities:

Graphically displays aggregations from arbitrary DTrace programs

Displays moving averages

Supports record and playback capabilities

Provides an XML based configuration interface for predefined plots

When Chime is started it displays a welcome screen that allows the user to select

from a number of functional groups of displays through a pull-down menu, with

Initial Displays the default group shown. The Initial Displays group contains various

general displays including, for example, System Calls, that when selected brings up

a graphical view showing the application executing the most system calls, updated

Page 9: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.6 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

every second (Figure 8, in the section “Chime Screenshots” on page 39). The time

interval can be changed with the pull-down menu at the bottom left corner of the

window.

Figure 1. File I/O displayed with the rfilio script from the DTraceToolkit

Another useful choice available from the main window allows access to the

probes based on the DTraceToolkit, that are much more specific than those in

Initial Displays. For example, selecting rfileio brings up a window (Figure 1) with

information about file or block read I/O. The output can be sorted by the number

of reads, helping to identify the file that is the target of most of the system read

operations at this time — in this case, a raw device. Through the context menu of

the suspect entry, it is possible to bring up a window with a plotting of the reads

from this device over time (Figure 9, on page 39).

For further details on Chime, see links in “References” on page 38.

DLightDLight is a GUI for DTrace, built into the Sun Studio 12 IDE as a plug-in. DLight unifies

application and system profiling and introduces a simple drag-and-drop interface

that interactively displays how an application is performing. DLight includes a set of

DTrace scripts embedded in XML files. While DLight is not as powerful as the DTrace

Page 10: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.7 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

3. According to the developers of gnuplot (http://www.gnuplot.info/faq/faq.html#SECTION00032000000000000000) The real name of the program is “gnuplot” and is never capitalized.

command line interface, it does provide a useful way to see a quick overview of

standard system activities and can be applied directly to binaries using the DTrace

scripts included with it.

After launching DLight, the user selects a Sun Studio project, a DTrace script, or

a binary executable to apply DLight to. For the purpose of this discussion, the tar

utility was used to create a load on the system and DLight was used to observe it.

The output of the tar command is shown in the bottom right section of the display

(Figure 2). The time-line for the selected view (in this case File System Activity) is

shown in the middle panel of the display. Clicking on a sample prints the details

associated with it in the lower left panel of the display. The panels can be detached

from the main window and their size adjusted.

Figure 2. DLight main user interface window

For further details on DLight, see links in “References” on page 38.

gnuplotgnuplot3 is a powerful, general purpose, open-source, graphing tool that can help

visualize DTrace output. gnuplot can generate two- and three-dimensional plots of

DTrace output in many graphic formats, using scripts written in gnuplot’s own text-

based command syntax. An example using gnuplot with the output of the !"#$%&!'

and (#))$&$ commands is provided in the section “Analyzing Memory Bottlenecks

with cputrack, busstat, and the Sun Studio Performance Analyzer” on page 8, in the

“Performance Analysis Examples” chapter.

Page 11: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.8 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Chapter 3

Performance Analysis Examples

In the following sections, several detailed examples of using DTrace and other Solaris

observability tools to analyze the performance of Solaris applications are provided.

These use-cases include the following:

1. Use (#))$&$, !"#$%&!', and the Sun Studio Performance Analyzer to explain

performance of different SPARC® processors and scalability issues resulting from

data cache misses and memory bank stalls

2. Use !"#)$&$ to analyze memory placement optimization with OpenMP code on

non-uniform memory access (NUMA) architectures

3. Use DTrace to resolve I/O related issues

4. Use DTrace to analyze thread scheduling with Open Multi-Processing (MP)

5. Use DTrace to analyze MPI performance

Analyzing Memory Bottlenecks with !"#$%&!', (#))$&$, and the Sun Studio Performance AnalyzerMost applications require fast memory access to perform optimally. In the following

sections, several use-cases are described where memory bottlenecks cause sub-

optimal application performance, resulting in sluggishness and scalability problems.

Improving Performance by Reducing Data Cache MissesThe !"#)$&$4567 and !"#$%&!'4567 commands provide access to CPU hardware

counters. (#))$&$4567 reports system-wide bus-related performance statistics.

These commands are passed sets of platform dependent event-counters to monitor.

In this example, the performance of two functionally identical instances of a small

program that sums two matrices are analyzed — &//.!03 and &//.%01. &//.!03

uses the data-cache inefficiently, with many cache misses, while &//.%01 uses the

data-cache efficiently. The insights provided by these examples are applicable in a

wide range of contexts.

Page 12: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.9 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

The following listing, &//.!032!, shows the implementation of &//.!03:

59I+,!3#/-9J)$/3+(2DKL9'(&$)&*+,-.****/01112M9(#34%&*56+,-.76+,-.78*46+,-.76+,-.78*96+,-.76+,-.7:N9;)<*=5;)>;)<*5"?98*9@5"*A5"?B67C*DO9**E;F&G<*;8*H:P9**I#"*>;*J*1:*;*K*+,-.:*;LLCQ9****I#"*>H*J*1:*H*K*+,-.:*HLLCR9******96H76;7*J*56H76;7*L*46H76;7:S9T

&//.%012! is identical to &//.!032!, except that line 8 — the line that executes

the addition of the matrix cells, runs across rows and takes the following form:

M*******96;76H7*J*56;76H7*L*46;76H7:

The programs are compiled without optimization or prefetching, to side-step the

compiler’s ability to optimize memory access (that would make the differences

between the programs invisible) and with large pages to avoid performance

problems resulting from the relatively small TLB cache available to the UltraSPARC

T1® processor and the UltraSPARC T2® processor:

I9!!9;U"&V-)+W-XLOP*9;U"%-H-$!DX,09;09&//.%019&//.%012!I9!!9;U"&V-)+W-XLOP*9;U"%-H-$!DX,09;09&//.!039&//.!032!

Using !"#$%&!' to Count Data Cache Misses

The programs are first run with !"#$%&!', while tracing data cache misses as

reported by the CPU’s hardware counters. The output is piped through the sed

command to remove the last line in the file, since it contains the summary for the

run and should not be plotted, and directed to !"#$%&!'.&//.!032/&$& and

!"#$%&!'.&//.%012/&$&, as appropriate:

'*9!3<"59N*O)*O&I*O9*!;91JPQG=;EE8!;9/J,)E<"G9)<*RS5((G"#T*U*V***********************E&(*WX(Y*Z*9!3<"59NG5((G"#TR(5<5'*9!3<"59N*O)*O&I*O9*!;91JPQG=;EE8!;9/J,)E<"G9)<*RS5((G9#%*U*V***********************E&(*WX(Y*Z*9!3<"59NG5((G9#%R(5<5

The ;, option is used to prevent the output of column headers, ;- instructs

!"#$%&!' to follow all -U-!4L7, or -U-!:-4L7 system calls, ;H instructs

!"#$%&!' to follow all child processes created by the H0%'4L7, H0%'54L7, or

:H0%'4L7 system calls. The ;! option is used to specify the set of events for the CPU

performance instrumentation counters (PIC) to monitor:

!;91JPQG=;EE — counts the number of data-cache misses

!;9/J,)E<"G9)< — counts the number of instructions executed by the CPU

Plotting the results of the two invocations with V,#"30$ (Figure 3) shows that the

number of cache-misses in the column-wise matrix addition is significantly higher

Page 13: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.10 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

4. The V,#"30$ program used to generate the graph in Figure 3 is listed in the section “Listings for Matrix Addition cputrack Example” on page 40, followed by the listings of !"#$%&!'.&//.!032/&$&, and !"#$%&!'.&//.%012/&$&.

than in the row-wise matrix addition, resulting in a much lower instruction count per

second and a much longer execution time for the column-wise addition4.

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

0 10 20 30 40 50 60

L1 D

-cac

he m

isse

s /s

Inst

ruct

ion

coun

t /s

time [s]

D-cache misses colD-cache misses rowInstruction count col

Instruction count row

Figure 3. Comparison of data-cache misses in row-wise and column-wise matrix addition

Using (#))$&$ to Monitor Memory Bus Use

The (#))$&$ command is only able to trace the system as a whole, so it is necessary

to invoke it in the background, run the program that is to be measured, wait for a

period of time, and terminate it. This must be done on a system that is not running

a significant workload at the same time, since it is not possible to isolate the data

generated by the test program from the other processes. The following short shell-

script, %#,.(#))$&$2)D, is used to run the test:

59'[S4;)SE@L943EE<5<*O)*OT*("5=18!;91J=&=G"&5(E*/*\M9!;(JX[N9&B5%*X]O9)3--"95P9N;%%*X!;(

The command to run — in this case &//.%01 or &//.!03 — is passed to the script

as a command line parameter and the script performs the following steps:

Runs (#))$&$ in the background in line 2

Saves (#))$&$’s process ID in line 3

Runs the command in line 4

After the command exits, the script waits for one second in line 5

Terminates (#))$&$ in line 6

Page 14: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.11 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

5. The V,#"30$ program used to generate the graph in Figure 4, is listed in the section “Listings for Matrix Addition busstat Example” on page 42, followed by the listings of (#))$&$.&//.!032/&$&, and (#))$&$.&//.%012/&$&.

The ;, (#))$&$ option is used to prevent the output of column headers, ;1

specifies the device to be monitored — in this case the memory bus (("5=1) and

!;91J*-*.%-&/) defines the counter to count the number of memory reads

performed by DRAM controller #0 over the bus. The number 5 at the end of the

command instructs (#))$&$ to print out the accumulated information every second.

The %#,.(#))$&$2)D script is used to run &//.!03 and &//.%01, directing the

output to (#))$&$.&//.!032/&$& and (#))$&$.&//.%012/&$&, as appropriate:

'*RS"3)G43EE<5<RE@*5((G9#%*Z*(#))$&$.&//.!032/&$&'*RS"3)G43EE<5<RE@*5((G"#T*Z*43EE<5<G5((G"#TR(5<5

Plotting the results of the two invocations with V,#"30$ (Figure 4) shows that the

number of cache-misses in the column-wise addition seen by !"#$%&!' is reflected

in the results reported by (#))$&$, that shows a much larger number of reads over

the memory bus5.

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

2200000

2400000

2600000

0 10 20 30 40 50 60

read

s/s

time [s]

reads colreads row

Figure 4. Comparison of memory bus performance in row-wise and column-wise matrix addition

Improving Scalability by Avoiding Memory Stalls Using the Sun Studio AnalyzerIn this example, a Sun Fire™ T2000 server with a 1 GHz UltraSPARC T1 processor, with

eight CPU cores and four threads per core is compared with a Sun Fire V100 server

with a 550 MHz UltraSPARC-IIe single core processor. The Sun Fire V100 server was

chosen since its core design is similar to the Sun Fire T2000 server. The software used

Page 15: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.12 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

for the comparison is John the Ripper — a password evaluation tool that includes a

data encryption standard (DES) benchmark capability, has a small memory footprint,

and performs no floating point operations. Multiple instances of John the Ripper are

run in parallel to achieve any level of parallelism required.

The benchmark is compiled with the Sun Studio C compiler using the YH&)$ and

YU&%!DX,&$+:-PN options, and a single instance is executed on each of the servers.

Due to its higher clock-rate, it may seem reasonable to expect the Sun Fire T2000

server to perform twice as fast as the Sun Fire V100 server for a single invocation,

however, the UltraSPARC T1 processor is designed for maximum multithreaded

throughput so this expectation might not be valid. To demonstrate the reason for

this result it is necessary to first quantify it in further detail. With the !"#$%&!'

command it is possible to find out the number of instructions per cycle:

<0111Z*9!3<"59N*O^*_*O<*O9*!;9/J,)E<"G9)<*RSH#@)*O<&E<*OI#"=5<`P.+

The !"#$%&!' command is used to monitor the instruction count by defining

!;9/J,)E<"G9)<, and the ;E option sets the sampling interval to five seconds. The

output generated by this invocation indicates an average ratio of 0.58 instructions

per CPU cycle, calculated by dividing the total instruction count from the bottom cell

of the "+!5 column by the number of cycles in the bottom cell of the Z$+!' column:

$+*- 31" -:-,$ Z$+!' "+!5

_R1/a 5 $+!' _1/b00cM/0 bd1/edda//

/1R10a 5 $+!' PN5LM5NPQL b/M0de1/_M

/1R1b0 5 -U+$ 55NLPL5QQML c_Mdd0_//1

When the same command is run on the Sun Fire V100 server, the average

performance per core of the UltraSPARC-IIe architecture is higher, resulting in a ratio

of 2.05 instructions per CPU cycle.

The lower single threaded performance of the UltraSPARC T1 processor can be

attributed to the differing architectures — the single in-order issue pipeline of

each core of the UltraSPARC T1 processor is designed for lower single threaded

performance than the four-way superscalar architecture of the UltraSPARC IIe

processor. In this case, it is the different CPU designs that explain the results seen.

Next, the benchmark is run on the Sun Fire T2000 server in multiple process

invocations to check its scalability. While one to eight parallel processes scale

linearly as expected, at 12 processes performance is flat, with 16 processes, the

throughput goes down to a level comparable with that of four processes, and at 32

processes, the total throughput is similar to that of a dual process invocation.

Page 16: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.13 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

6. –g instructs the compiler to produce additional symbol table information needed by the analyzer, and –xhwcprof, in conjunction with –g, instructs the compiler to generate information that helps tools associate profiled load and store instructions with data-types and structure members.

Due to the fact that the UltraSPARC T1 processor has a 12-way L2 cache, it seems

reasonable to suspect that the scaling problem, where throughput drops rapidly

for more than 12 parallel invocations, is a result of L2 cache misses. To validate this

assumption, an analysis of the memory access patterns is needed to see if there is a

significant increase in memory access over the memory bus as the application scales.

This analysis can take advantage of the superior user interface of the Sun Studio

Performance Analyzer. To use it, John the Ripper is first recompiled, adding the ;V

and ;UD1!"%0H options to each compilation and link6. These options instruct the

collector to trace hardware counters and the analyzer to use them in analysis. In

addition, the analyzer requires UltraSPARC T1 processor specific hints to be added to

one of the .er.rc files, that instruct the collector to sample additional information,

including max memory stalls on a specific function. See the section “Sun Studio

Analyzer Hints to Collect Cache Misses on an UltraSPARC T1® Processor” on page 44 in

the Appendix for a list of these hints.

Next John the Ripper is run with the collector:

<0111Z*9#%%&9<*O!*L#)*Of*9#!g*Oh*#)*RSH#@)*OO<&E<*OOI#"=5<`P.+

The O!*L#) option triggers the collection of clock-based profiling data with the

default profiling interval of approximately 10 milliseconds, the Of*9#!g option

requests that the load objects used by the target process are copied into the

recorded experiment, and the ;F90, option instructs the collector to record the data

on descendant processes that are created by H0%' and -U-! system call families.

Figure 5. Profile of John the Ripper on a Sun Fire T2000 server

Once the performance data is collected, the analyzer is invoked and the data

collected for John the Ripper selected. The profile information (Figure 5) shows that

Page 17: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.14 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

7. To improve performance, the Solaris OS uses a page coloring algorithm which allocates pages to a virtual address space with a specific predetermined relationship between the virtual address to which they are mapped and their underlying physical address.

of the total of 9.897 seconds the process has run on the CPU, it stalled on memory

access for 9.387 seconds, or nearly 95% of the time. Furthermore, a single function

— DES_bs_crypt_25 — is responsible for almost 99% of the stalls.

When the hardware counter data related to the memory/cache subsystem is

examined (Figure 6) it is seen that of the four UST1_Bank Memory Objects, stalls

were similar on objects 0, 1 and 2 at 1.511, 1.561, and 1.321 seconds respectively

while on object 3 it was 4.993 seconds, or 53% of the total.

Figure 6. UST1_Bank memory objects in profile of John the Ripper on a Sun Fire T2000 server

Note: When hardware-counter based memory access profiling is used, the Sun Studio Analyzer can identify the specific instruction that accessed the memory more reliably than in the case of clock-based memory access profiling.

Further analysis (Figure 7) shows that 85% of the memory stalls are on a single page,

with half of all the stalls associated with a single memory bank. Note that the Solaris

OS uses larger pages for the sun4v architecture to help small TLB caches. This can

create memory hot spots, as illustrated here, since it limits cache mapping flexibility

and when the default page coloring7 is retained, it can cause unbalanced L2 cache

use. In the case of the Sun Fire T2000 server, this behavior is changed by adding a

)-$9!0,)+)$-,$.!030%+,VXL in S&<9SEgE<&= or by applying patch 118833-03 or

later. Note that this patch can be installed as part of a recommended patch set or a

Solaris update.

Page 18: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.15 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Figure 7. 64K memory pages in profile of John the Ripper on a Sun Fire T2000 server

Note: While the change in the !0,)+)$-,$.!030%+,V parameter in S&<9SEgE<&= improves the performance of John the Ripper, it can adversely affect the performance of other applications.

The (#))$&$4567 command can be used to demonstrate the effect of the change

in the !0,)+)$-,$.!030%+,V parameter by measuring the memory stalls before

and after modifying the page coloring, when running John the Ripper with 1 to 32

invocations:

'*43EE<5<*OT*("5=18!;91J45)NG43EgGE<5%%E8!;9/J=&=G"&5(GT";<&*0

The ;1 option is used to instruct (#))$&$ to monitor the memory bus — ("5=18

while !;91J45)NG43EgGE<5%%E assigns the first counter to count the number of

memory stalls, and "+!5X*-*.%-&/.1%+$- assigns the second counter to count

the number of memory operations over the bus. The number L at the end of the

command instructs (#))$&$ to print out the accumulated information every two

seconds. The total number of memory accesses across the memory bus seen in the

results is 337 million, with 904 million memory stall events, indicating a very low L2

cache hit-rate.

After the !0,)+)$-,$.!030%+,V setting is modified to L8 John the Ripper

executed, and memory access measured again with the (#))$&$ command,

the total number of memory accesses across the memory bus goes down to 10.6

million, with 17.7 million memory stall events. This reduction of 97% in the number

of memory accesses over the memory bus indicates a huge improvement in the

L2 cache hit-rate. Running the multiple invocations of John the Ripper, scalability

Page 19: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.16 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

improves drastically, with 32 parallel invocations resulting in 1.7 million DES

encryptions/second, compared to less than 400,000 DES encryptions/second before

the !0,)+)$-,$.!030%+,V setting was changed.

Memory Placement Optimization with OpenMPThe Solaris OS optimizes memory allocation for NUMA architectures. This is currently

relevant to UltraSPARC IV+, UltraSPARC T2+, SPARC64-VII®, AMD Opteron™, and Intel®

Xeon® processor 5500 series multiprocessor systems. Each CPU has its own memory

controller and through it — its own memory.

On the systems with NUMA CPUs running the Solaris OS, by default, the first time

uninitialized memory is accessed or when memory is allocated due to a page fault,

it is allocated on the memory local to the CPU that runs the thread accessing the

memory. This mechanism is called memory placement optimization. However, if

the first thread accessing the memory is moved to another CPU, or if the memory

is accessed by a thread running on another CPU, this can result in higher memory

latency.

The program used to test this behavior implements a matrix by vector multiplication

that scales almost linearly with the number of CPUs, for large working sets. The

benefits of first-touch optimization will show in the working routines, where the

data is accessed repeatedly. However, the optimization itself must be applied when

the data is initialized, since that is when the memory is allocated. By making use

of the Solaris MPO the code can ensure that data gets initialized by the thread that

processes the data later. This results in reduced memory latency and improved

memory bandwidth. Following is the initialization code that scales almost linearly

for large matrix and vector sizes:

599+,+$./&$94+,$9*89+,$9,89/0#(3-9[:589/0#(3-9[[*589/0#(3-9[:L7L9*DM9***;)<*;8*H:N9*'!"5?=5*#=!*!5"5%%&%*(&I53%<>)#)&C*VO99999999999)D&%-/4*8,8:58*58:L79"%+:&$-4+8\7P9***DQ99I"%&V*&90*"9H0%R9*****I#"*>HJ1:*HK):*HLLCS9*******B06H7*J*/R1:/1*I"%&V*&90*"9H0%559****I#"*>;J1:*;K=:*;LLC*D5L9******B/6;7*J*O/a_eR1:5M9******I#"*>HJ1:*HK):*HLLC5N9********=/6;76H7*J*;:5O9****i*SAOO*.)(*#I*#=!*I#"*OOAS5P9**i*SAOO*.)(*#I*!5"5%%&%*"&?;#)*OOAS5Q9T

Page 20: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.17 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

8. The OpenMP API supports multiplatform shared-memory parallel programming in C/C++ and Fortran. OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface to develop parallel applications.

This code implements two top level loops:

Lines 8 and 9 write sequentially to floating point vector :L

Lines 11 to 15 implement an outer loop, in which the floating point elements of

vector :5 are set to -1957.0 (line 12)

In the inner loop, for every iteration of the external loop, a sequence of elements

of the matrix *5 is set to the value of the outer loop’s index (lines 13 and 14)

The three I"%&V*&90*" compile time directives instruct the compiler to use the

OpenMP8 API to generate code that executes the top level loops in parallel. OpenMP

generates code that spreads the loop workload over several threads. In this case,

the initialization is the first time the memory written to in these loops is accessed,

and first-touch attaches the memory the threads use to the CPU they run on. Here

the code is executed with one or two threads on a four-way dual-core AMD Opteron

system, with two cores —1 and 2 — on different processors defined as a processor

set using the ")%)-$4567 command. This setup helps ensure almost undisturbed

execution and makes the !"#)$&$4567 command output easy to parse, while

optimizing memory allocation. The !"#)$&$ command is invoked as follows:

9!3E<5<*O9*W!;91JjkG=&=G9<"%"G!5?&G599&EE83=5EN1J1l1/8V***!;9/JjkG=&=G9<"%"G!5?&G599&EE83=5EN/J1l108V***!;90JjkG=&=G9<"%"G!5?&G599&EE83=5EN0J1l1d8EgEY*0

The command line parameters instruct !"#)$&$ to read the AMD Opteron CPU’s

hardware performance counters to monitor memory accesses broken down

by internal latency. This is based on the #*&)' value, which associates each

performance instrumentation counter (PIC) with the L1 cache, L2 cache, and main

memory, respectively. The EgE token directs cpustat to execute in system mode, and

the sampling interval is set to two seconds.

Note: Many PICs are available for the different processors that run the Solaris OS and are documented in the processor developers’ guides. See “References” on page 32 for a link to the developer’s guide for the AMD Opteron processor relevant to this section.

To illustrate the affect of NUMA and memory placement on processing, the code is

executed in the following configurations:

One thread without first-touch optimization — achieved by setting the NCPUS

environment variable to 1

Two threads without first-touch optimization — achieved by removing all of the

I"%&V*&90*" directives from the code, resulting in code where a single thread

initializes and allocates all memory but both threads work on the working set

Two threads, with first-touch optimization — achieved by setting the NCPUS

environment variable to 2 and retaining all three I"%&V*&90*" directives

Page 21: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.18 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Following are sample results of the invocations with one thread without first-touch

optimization, that show that the vast majority of memory accesses is handled by

CPU #2:

$+*- !"# -:-,$ !;91 "+!5 "+!L

0R11M 5 $+!' PLSRN 5M55N /1bM

0R11M L $+!' /b_b_d11 /a101a0e bcMe1dc

222 222 222 222 222 222

/1R11M 5 $+!' SPR5R /add1 5LMM

/1R11M L $+!' 5MLRPRLO /a1cd0ab bcaca10

Following are sample results of the invocation with two threads without first-touch

optimization. The results show that even though two threads are used, only one of

these threads had allocated the memory, so it is local to a single chip resulting in a

memory access pattern similar to the single thread example above:

$+*- !"# -:-,$ !;91 "+!5 "+!L

0R11b 5 $+!' 011M/M/e eM1eMM/ 0b01M0/

0R11b L $+!' daa1_ ONSS bM1

222 222 222 222 222 222

/1R11b 5 $+!' /111b_M0e LSOQOSL /M1e00_

/1R11b L $+!' QOL55 /01/_M QL5Q

/0R11b 5 $+!' O5QLQ d11a LSS

/0R11b L $+!' _11c_ MR5N MQ5

Finally, the sample results of the invocations with two threads with first-touch

optimization show that memory access is balanced over both controllers as each

thread has initialized it’s own working set:

$+*- !"# -:-,$ !;91 "+!5 "+!L

0R11a 5 $+!' _b1c/11b 5MN5RNNS /10b_cce

0R11a L $+!' _b010d/c /b0d1c1c /1/cMd1/

222 222 222 222 222 222

/1R11a 5 $+!' dcd/1/c0 aM10b1 5RSPPM

/1R11a L $+!' NPMSPMPR SQNM5N 5SPMLL

/0R11a 5 $+!' _/1Me MMS5 L5N

/0R11a L $+!' NSPRL M5PQ MSS

When the memory is initialized from one core and accessed from another

performance degrades, since access is through the hyperlink, which is slower than

direct access by the on-chip memory controller. The first-touch capability enables

each of the threads to access its data directly.

Page 22: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.19 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

For a large memory footprint the performance is about 800 MFLOPS for a single

thread, about 1050 MFLOPS for two threads without first-touch optimization, and

close to 1600 MFLOPS — which represents perfect scaling — for two threads with

first-touch optimization.

Analyzing I/O Performance Problems in a Virtualized EnvironmentA common performance problem is system sluggishness due to a high rate of I/O

system calls. To identify the cause it is necessary to determine which I/O system calls

are called and in what frequency, by which process, and to determine why. The first

step in the analysis is to narrow down the number of suspected processes, based on

the ratio of the time each process runs in system context versus user context since

processes that spend a high proportion of their running time in system context, can

be assumed to be requesting a lot of I/O.

Good tools to initiate the analysis of this type of problem are the :*)$&$4567 and

"%)$&$4567 commands, which examine all active processes on a system and report

statistics based on the selected output mode and sort order for specific processes,

users, zones, and processor-sets.

In the example described here and in further detail in the Appendix (“Analyzing I/O

Performance Problems in a Virtualized Environment — a Complete Description” on

page 45), a Windows 2008 server is virtualized on the OpenSolaris™ OS using the Sun™

xVM hypervisor for x86 and runs fine. When the system is activated as an Active

Directory domain controller, it becomes extremely sluggish.

As a first step in diagnosing this problem, the system is examined using the

:*)$&$9O command, which prints out a high-level summary of system activity every

five seconds, with the following results:

Examining these results shows that the number of system calls reported in the

Eg column increases rapidly as soon as the affected virtual machine is booted,

and remains quite high even though the CPU is constantly more than 79% idle,

as reported in the +/ column. While it is known from past experience that a CPU-

'$D% =&=#"g "&V- /+)' H&#3$) !"#% ( 1 )1&" H%-- %- *H "+"0 H% /- )% =1 *5 *L *M+, Eg !) #) Eg +/1 1 1 5QPMOQLN d1acb_c 1 1 1 1 1 1 1 M M 1 1 SSN NN5 Q5Q 1 L SR1 1 1 5QPMOQLN d1acb_c 1 1 1 1 1 1 1 1 1 1 1 SP5 N5P Q5M 1 1 /111 1 1 5QPM5NNR d1a__0M QS NPO 1 1 1 1 1 1 1 1 1 /1ed a01_ 5NLR 5 L SQ1 1 1 /ec1d_0d d1e0/dM d1e NOON 1 5 5 1 1 P P 1 1 /1__M QLQRM 010/b N 5Q QS1 1 1 5QOSORLR d1c0bc1 /10 RLR 1 1 1 1 1 M M 1 1 MNN5 NNQNQ /1_01 5 5N RO1 1 1 5QOSRNSL d1cdc0M L L 1 1 1 1 1 5 5 1 1 OMPM 0M_1M RQOL L M SO1 1 1 5QOSRN5L d1c_1cM 1 1 1 1 1 1 1 01 01 1 1 /e1ee Mb10d b1cbb O Q RR1 1 1 /e_aM/1M d1c_/bc 1 1 1 1 1 1 1 1 1 1 1 RSO5 NPNOP /c/d1 L N SM

Page 23: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.20 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

bound workload on this system normally generates less than 5 thousand calls per

five second interval, here the number of calls is constantly more than 9 thousand

from the third interval and on, ranging as high as 83 thousand. Clearly something is

creating a very high system call load.

In the next step in the analysis, the processes are examined with "%)$&$9;<9;*. The

;< option instructs the "%)$&$ command to report statistics for each thread

separately, and the thread ID is appended to the executable name. The ;* option

instructs the "%)$&$ command to report information such as the percentage of time

the process has spent processing system traps, page faults, latency time, waiting for

user locks, and waiting for CPU, while the virtual machine is booted:

The first line of the output indicates that the ]-*#;/* process executes a very large

number of system calls — 200,000 — as shown in the >A< column. A factor of 100

more than U-,)$0%-/, the process with the second largest number of system calls.

At this stage, while it is known that the ]-*#;/* process is at fault, to solve the

problem it is necessary to identify which system calls are called. This is achieved with

the help of a D-script, 9#3)<GEgE95%%ER(, which prints system call rates for the

top-ten processes/system call combinations every five seconds (for the listing and

results of invocation see the Appendix).

When 9#3)<GEgE95%%ER( is run, of the top 10 process/system call pairs that are

printed, seven are executed by ]-*#;/*, with the number of executions of 3)--'

and 1%+$- the most prevalent by far.

The next stage of the investigation is to identify why ]-*#;/* calls these I/O

system calls. To answer this, an analysis of the I/O is implemented with another

D-script — ]-*#;)$&$2/, which collects the statistics of the I/O calls performed by

]-*#;/*, while focusing on 3)--' and 1%+$-9(see listing and results of invocation

in the Appendix).

The results of this invocation show that all calls to 3)--' with an absolute position

and most of the calls to 1%+$- target a single file. The vast majority of the calls to

3)--' move the file pointer of this file to an absolute position that is one byte away

m,P n+.ojfp. =>? >@> E?C EF< GF< <AB ><C 2f^ ^A_ ,Qq >A< +,Q morQ.++S2sm,P/cdM1 U:* P2S S2R 1R1 1R1 1R1 LQ ON 52R b1t 55N 2L6 1 u&=3O(=SbMPM U:* 1R/ 1R0 1R1 1R1 1R1 1R1 /11 1R1 N 5 LB 1 l&)E<#"&(S/5PMQN %00$ 1R1 1R/ 1R1 1R1 1R1 /11 1R1 1R1 /1 1 5B 1 (<"59&S/5PNN U:* 1R/ 1R/ 1R1 1R1 1R1 MM PP 1R1 OPS Q RMO 1 u&=3O(=SbLMSS %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 NS 1 MRR 1 EE@(S/5PMQP %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 MR 1 LSQ 1 !"E<5<S///e1_ U:* 1R1 1R/ 1R1 1R1 1R1 _1 _1 1R1 OQP 5O ROR 1 u&=3O(=Sd5POMP %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 NR 1 LRP 1 B)9B;&T&"S/222^#<5%`*bc*!"#9&EE&E8*/0a*%T!E8*%#5(*5B&"5?&E`*1Rcd8*1Rbe8*1Rb/

Page 24: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.21 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

9. See http://support.microsoft.com/kb/888794

from the previous position, and the vast majority of the calls to the 1%+$- system

call write a single byte. In other words, ]-*#;/* writes a data stream as single

bytes, without any buffering — an access pattern that is inherently slow.

Next, the !$%&E command is used to identify the file accessed by ]-*#;/* through

file descriptor O as the virtual Windows system disk:

'*!$%&E*/cdM1222_`*+G,ho.v*=#(&`1c11*(&B`/M08c__db*;)#`0c*3;(`c1*?;(`1*E;F&`//c0ba0be/0******rGoPsoUrG2fov.h,2.******SlB=S@&"=;5S(;ENG9SB(;ENRB=(N222

To further analyze this problem, it must be seen where the calls to 3)--' originate

by viewing the call stack. This is performed with the ]-*#;!&33)$&!'2/ script,

which prints the three most common call stacks for the 3)--' and 1%+$- system

calls every five seconds (see listing and invocation in the Appendix).

By examining the 1%+$- stack trace, it seems that the virtual machine is flushing the

disk cache very often, apparently for every byte. This could be the result of a disabled

disk cache. Further investigation uncovers the fact that if a Microsoft Windows server

acts as an Active Directory domain controller, the Active Directory directory service

performs unbuffered writes and tries to disable the disk write cache on volumes

hosting the Active Directory database and log files9. Active Directory also works in

this manner when it runs in a virtual hosting environment. Clearly this issue can only

be solved by modifying the behavior of the Microsoft Windows 2008 virtual server.

Using DTrace to Analyze Thread Scheduling with OpenMPTo achieve the best performance on a multithreaded application the workload needs

to be well balanced for all threads. When tuning a multithreaded application, the

analysis of the scheduling of the threads on the different processing elements —

whether CPUs, cores, or hardware threads — can provide significant insight into the

performance characteristics of the application.

Note: While this example focuses on the use of DTrace, many other issues important to profiling applications that rely on OpenMP for parallelization are addressed by the Sun Studio Performance Analyzer. For details see the Sun Studio Technical Articles Web site linked from “References” on page 38.

The following program — "&%$-)$2! — is includes a matrix initialization loop,

followed by a loop that multiplies two matrices. The program is compiled with the

;U&#$0"&% compiler and linker option, which instructs the compiler to generate

Page 25: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.22 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

10. If the program is executed on a single core system, it is likely to run slower, since it runs the code for multithreaded execution without benefiting from the performance improvement of running on multiple cores.

code that uses the OpenMP API to enable the program to run in several threads in

parallel on different cores, accelerating it on multicore hardware10:

;)<*=5;)>;)<*5"?98*9@5"*A5"?B67C*D****%#)?*;8*H:****I#"*>;*J*1:*;*K*,^.o:*;LLC* 56;7*J*46;7*J*96;7*J*;:****!3<E>w2rrm0xC:****I#"*>H*J*1:*H*K*o.m.f^:*HLLC* I#"*>;*J*1:*;*K*,^.o:*;LLC* ****96;7*LJ*56;7*A*46;7:T

This code lacks calls to library functions or system calls and is thus fully CPU bound.

As a result, it is only affected by CPU scheduling considerations.

The $D%-&/)!D-/2/ DTrace script that uses the schedule provider )!D-/ is

implemented to analyze the execution of "&%$-)$. This script can be used to

analyze the scheduling behavior of any multithreaded program.

$D%-&/)!D-/2/ uses the following variables:

Variable Function

(&)-3+,- Used to save the initial time-stamp when the script starts

!#%!"#;K!"#.+/ Current CPU ID

!#%!"#;K!"#.3V%" Current CPU locality group

)-3H;K/-3$& Thread local variable that measures the time elapsed between two events on a single thread

"+/ Process ID

)!&3- Scaling factor of time-stamps from nanoseconds to milliseconds

)-3H;K3&)$!"# Thread local ID of previous CPU the thread ran on

)-3H;K3&)$3V%" Thread local ID of previous locality group the thread ran on

)-3H;K)$&*" Thread local time-stamp which is also used as an initialization flag

$+/ Thread ID

59'[S3E"SE4;)S(<"59&*OEL9I"%&V*&9G90"$+0,9]#+-$M9k.v,jN9DO9**45E&%;)&*J*T5%%<;=&E<5=!:P9**E95%&*J*/111111:Q9T

The k.v,j probe fires when the script starts and is used to initialize the baseline

timestamp employed to measure all other times from the 1&33$+*-)$&*"

built-in DTrace variable. 1&33$+*-)$&*" and all other DTrace timestamps are

measured in nanoseconds, which is a much higher resolution than needed, so the

measurement is scaled down by a factor of 1 million to milliseconds. The scaling

factor is saved in the variable )!&3-.

Page 26: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.23 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

R9*E9@&(```#)O9!3S9*S*!;(*JJ*X<5"?&<*\\*[E&%IOZE<5=!*S/1*D559**E&%IOZE<5=!*J*T5%%<;=&E<5=!:5L9**E&%IOZ%5E<9!3*J*93"9!3OZ9!3G;(:5M9**E&%IOZ%5E<%?"!*J*93"9!3OZ9!3G%?"!:5N9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:5O9**!";)<I>wya(`yOa(*^,P*yb(*Qmn*yb(>y(C*9"&5<&(V)x85P9**E&%IOZE<5=!8*18*<;(8*93"9!3OZ9!3G;(8*93"9!3OZ9!3G%?"!C:5Q9T

The E9@&(``#)O9!3 probe fires whenever a thread is scheduled for running.

The first part of the predicate on line 9 — !;(*JJ*X<5"?&< — is common to all of

the probes in this example and ensures that the probe only fires for processes that

are controlled by this script.

DTrace initializes variables to zero, so the fact that the )-3H;K)$&*" thread-local

variable is still zero indicates that the thread is running for the first time and

)-3H;K)$&*" and the other thread-local variables are initialized in lines 11-14.

5R9E9@&(```#)O9!35S9S*!;(*JJ*X<5"?&<*\\*E&%IOZE<5=!*\\*E&%IOZ%5E<9!3*V01***************************************[J*93"9!3OZ9!3G;(*SL59DLL9**E&%IOZ(&%<5*J*>T5%%<;=&E<5=!*OE&%IOZE<5=!CSE95%&:LM9**E&%IOZE<5=!*J*T5%%<;=&E<5=!:LN9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:LO9**!";)<I>wya(`yOa(*^,P*yb(*I"#=OQmn*y(>y(C*w8*E&%IOZE<5=!8LP9**********E&%IOZ(&%<58*<;(8*E&%IOZ%5E<9!38*E&%IOZ%5E<%?"!C:LQ9**!";)<I>w<#OQmn*y(>y(C*Qmn*=;?"5<;#)V)x8LR9*******93"9!3OZ9!3G;(8*93"9!3OZ9!3G%?"!C:LS9**E&%IOZ%5E<9!3*J*93"9!3OZ9!3G;(:b1***E&%IOZ%5E<%?"!*J*93"9!3OZ9!3G%?"!:M59T

This code runs when a thread switches from one CPU to another.

The timestamp, the time the thread spent on the previous CPU, the last CPU this

thread ran on, and the CPU it is switching to are printed in lines 25-28.

ML9E9@&(```#)O9!3MM9S*!;(*JJ*X<5"?&<*\\*E&%IOZE<5=!*\\*E&%IOZ%5E<9!3*JJ*VMN9******************************************93"9!3OZ9!3G;(*SMO9DMP9**E&%IOZ(&%<5*J*>T5%%<;=&E<5=!*O*E&%IOZE<5=!C*S*E95%&:MQ9**E&%IOZE<5=!*J*T5%%<;=&E<5=!:MR9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:MS9**!";)<I>wya(`yOa(*^,P*yb(*Qmn*yb(>y(Cx8*E&%IOZE<5=!8d1********E&%IOZ(&%<58*<;(8*93"9!3OZ9!3G;(8*93"9!3OZ9!3G%?"!C:N59**!";)<I>w"&E<5"<&(*#)*E5=&*QmnV)x8NL9T

This code runs when the thread is rescheduled to run on the same CPU it ran on

the previous time it was scheduled to run.

Page 27: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.24 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

The timestamp is updated in line 37 and a message that the thread is rescheduled

to run on the same CPU is printed together with the timestamp, in lines 39-41.

NM9E9@&(```#IIO9!3NN9S*!;(*JJ*X<5"?&<*\\*E&%IOZE<5=!*SNO9DNP9**E&%IOZ(&%<5*J*>T5%%<;=&E<5=!*O*E&%IOZE<5=!C*S*E95%&:NQ9**E&%IOZE<5=!*J*T5%%<;=&E<5=!:NR9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:NS9**!";)<I>wya(`yOa(*^,P*yb(*Qmn*yb(>y(C*!"&&=!<&(V)x8*_1*9999999999)-3H;K)$&*"89)-3H;K/-3$&89$+/89!#%!"#;K!"#.+/89O59**********93"9!3OZ9!3G%?"!C:OL9T

The E9@&(``#IIO9!3 probe fires whenever a thread is about to be stopped by

the scheduler.

OM9E9@&(```E%&&!ON9S*!;(*JJ*X<5"?&<*SOO9DOP9**E&%IOZE#4H*J*>93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGpn^.q*{OQ9**wN&")&%*=3<&lx*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGos2rQt*{OR9**wN&")&%*os*%#9Nx*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGQ|*{OS9**w9#)(*B5"x*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzG+.pf*{c1***wN&")&%*E&=5!@#"&x*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGn+.o*{P59**w3E&"O%&B&%*%#9Nx*`*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGn+.oGm,*{*PL9**w3E&"O%&B&%*m,*%#9Nx*`*93"%T!E;)I#OZ!"GE<g!&*JJ*PM9**********************+rkzG+}n^^2.*{*wE@3<<%&x*`*w3)N)#T)xC:PN9**E&%IOZ(&%<5*J*>T5%%<;=&E<5=!*O*E&%IOZE<5=!C*SE95%&:PO9**E&%IOZE<5=!*J*T5%%<;=&E<5=!:PP9**E&%IOZE<5=!*J*>T5%%<;=&E<5=!*O*45E&%;)&C*S*E95%&:PQ9**!";)<I>wya(`yOa(*^,P*yb(*E%&&!;)?*#)*WyEYV)x8PR9******************E&%IOZE<5=!8*E&%IOZ(&%<58*<;(8*E&%IOZE#4HC:PS9T

The E9@&(```E%&&! probe fires before the thread sleeps on a synchronization

object.

The type of synchronization object is printed in lines 67-68.

e1*E9@&(```E%&&!Q59S*!;(*JJ*X<5"?&<*\\*>*93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGQ|*UUQL9**93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGn+.o*UUQM9**93"%T!E;)I#OZ!"GE<g!&*JJ*+rkzGn+.oGm,C*SQN9DQO9****3E<59N>C:QP9T

The second E9@&(```E%&&! probe fires when a thread is put to sleep on a

condition variable or user-level lock, which are typically caused by the application

itself, and prints the call-stack.

Page 28: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.25 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

11. For compatibility with legacy programs, setting the mfof22.2 environment variable has the same effect as setting rpmGjnpG^}o.fP+. However, if they are both set to different values, the runtime library issues an error message.

The ")%)-$ command is used to set up a processor set with two CPUs (0, 4) to

simulate CPU over-commitment:

@#E<'*!E"E&<*O9*1*d

Note: The numbering of the cores for the ")%)-$ command is system dependent.

The number of threads is set to three with the rpmGjnpG^}o.fP+11 environment

variable and $D%-&/)!D-/2/ is executed with "&%$-)$:

@#E<'*rpmGjnpG^}o.fP+Jb*RS<@"&5(E9@&(R(*O9*RS!5"<&E<

The output first shows the startup of the main thread (lines 1 to 5). The second

thread first runs at line 6 and the third at line 12:

59 1 ` 1 ^,P 59AC= 1>1C*9"&5<&(L9 1 ` 1 ^,P 59AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnM9 1 ` 1 ^,P 59AC= 1>1C*!"&&=!<&(N9 1 ` 1 ^,P 59AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnO9 1 ` 1 ^,P 59AC= 1>1C*!"&&=!<&(*P9 NS ` 1 ^,P L9AC= 1>1C*9"&5<&(Q9 NS ` 1 ^,P L9AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnR9 NS ` 1 ^,P L9AC= 1>1C*!"&&=!<&(S9 NS ` 1 ^,P L9AC= 1>1C*"&E<5"<&(*#)*E5=&*Qmn/19 NS ` 1 ^,P 0*E%&&!;)?*#)*W3E&"O%&B&%*%#9NY559 NS ` 1 ^,P L9AC= 1>1C*!"&&=!<&(5L9 NS ` 1 ^,P M9AC= 1>1C*9"&5<&(5M9 NS ` 1 ^,P M9AC= 1>1C*"&E<5"<&(*#)*E5=&*Qmn

5N9 d01 ` be1 ^,P M9AC= 1>1C*!"&&=!<&(5O9 222

As the number of available CPUs is set to two, only two of the three threads can

run simultaneously resulting in many thread migrations between CPUs, as seen in

lines 19, 21, 36, 39, 43, and 46. At lines 24, 33, and 53 respectively, each of the three

threads goes to sleep on a condition variable:

Note: OpenMP generates code to synchronize threads which cause them to sleep. This is not related to threads migrating between CPUs resulting from a shortage of available cores, as demonstrated here.

5P9 <``CL5Q9 /ec10d ` /111 ^,P L9AC= 1>1C*!"&&=!<&(5R9 /ec10d ` 1 ^,P L9AC= 1>1C*"&E<5"<&(*#)*E5=&*Qmn5S9 /ecM1d ` 1 ^,P b*I"#=OQmn*d>1C*<#OQmn*1>1C*Qmn*=;?"5<;#)019 /ecM1d ` 1 ^,P M9AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnL59 /ecM1d ` 1 ^,P /*I"#=OQmn*d>1C*<#OQmn*1>1C*Qmn*=;?"5<;#)LL9 /ecM1d ` 1 ^,P 59AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnLM9 /ec10d ` 1 ^,P M9AC= d>1C*"&E<5"<&(*#)*E5=&*Qmn

Page 29: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.26 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

LN9 /ec/1d ` M1 ^,P b*E%&&!;)?*#)*W9#)(*B5"YLO9 /ec/1d ` 1 ^,P M9AC= d>1C*!"&&=!<&(LP9 5QPNRN ` bM1 ^,P M9AC= d>1C*"&E<5"<&(*#)*E5=&*QmnLQ9 5QPNRN ` 1 ^,P M9AC= d>1C*!"&&=!<&(LR9 5QPNRN ` b__1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnLS9 5QPPLN ` /d1 ^,P 59AC= d>1C*!"&&=!<&(b19 5QPPLN ` /d1 ^,P M9AC= d>1C*"&E<5"<&(*#)*E5=&*QmnM59 5R5LPP ` /M1 ^,P M9AC= d>1C*!"&&=!<&(ML9 5R5LPP ` /M1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnMM9 5R5LPP ` 1 ^,P /*E%&&!;)?*#)*W9#)(*B5"YMN9 /M_1ed ` /c1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnMO9 5RO5PN ` a1 ^,P 59AC= d>1C*!"&&=!<&(MP9 5RO5PN ` a1 ^,P 0*I"#=OQmn*1>1C*<#OQmn*d>1C*Qmn*=;?"5<;#)MQ9 5RO5PN ` 1 ^,P L9AC= d>1C*"&E<5"<&(*#)*E5=&*QmnMR9 5ROS5N ` e_1 ^,P L9AC= d>1C*!"&&=!<&(MS9 5ROS5N ` /01 ^,P /*I"#=OQmn*1>1C*<#OQmn*d>1C*Qmn*=;?"5<;#)d19 5ROS5N ` 1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnN59 /Mc1bd ` 1 ^,P M9AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnNL9 /Mc1ed ` d1 ^,P M9AC= 1>1C*!"&&=!<&(NM9 /Mc1ed ` 1 ^,P /*I"#=OQmn*d>1C*<#OQmn*1>1C*Qmn*=;?"5<;#)NN9 /Mc1ed ` 1 ^,P 59AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnNO9 5RPLQO ` 011 ^,P 59AC= 1>1C*!"&&=!<&(*NP9 5RPLQO ` /_1 ^,P 0*I"#=OQmn*d>1C*<#OQmn*1>1C*Qmn*=;?"5<;#)NQ9 5RPLQO ` 1 ^,P L9AC= 1>1C*"&E<5"<&(*#)*E5=&*QmnNR9 222NS9 01/bc0 ` 1 ^,P 59AC= d>1C*!"&&=!<&(_19 01/bc0 ` 1 ^,P M9AC= d>1C*"&E<5"<&(*#)*E5=&*QmnO59 01/bc0 ` 1 ^,P M9AC= d>1C*!"&&=!<&(OL9 01/bc0 ` 1 ^,P 59AC= d>1C*"&E<5"<&(*#)*E5=&*QmnOM9 01/bc0 ` 1 ^,P /*E%&&!;)?*#)*W9#)(*B5"Y

From line 54, the call stack dump shows that the last function called is $D%".\0+,,

which indicates the end of a parallelized section of the program with all threads

concluding their processing and only the main thread of the process remaining:

ON9 %;49RE#R/~GG%T!GT5;<L1ldOO9 %;49RE#R/~G<@"!GH#;)L1lbMOP9 %;4=<ENRE#R/~<@"&5(EG$);L1l/eMOQ9 %;4=<ENRE#R/~%;4=<ENG$);L1l/9OR9 %;4=<ENRE#R/~95%%G5""5gL1l51OS9 %;4=<ENRE#R/~95%%G$);L1l41c19 %;4=<ENRE#R/~5<&l;<G$);L1lM1P59 %;49RE#R/~G&l;<@5)(%&L1lddPL9 %;49RE#R/~&l;<L1ldPM9 !5"<&E<~GE<5"<L1l/Md

Page 30: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.27 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Using DTrace with MPIThe Message Passing Interface (MPI) de facto standard is a specification for

an API that allows many computers to communicate with one another. Sun

HPC ClusterTools™ software is based directly on the Open MPI open-source

implementation of the MPI standard, and reaps the benefits of that community

initiative, which has deployments and demonstrated scalability on very large scale

systems. Sun HPC ClusterTools software is the default MPI distribution in the Sun

HPC Software. It is fully tested and supported by Sun on the wide spectrum of Sun

x86 and SPARC processor-based systems. Sun HPC ClusterTools provides the libraries

and run-time environment needed for creating and running MPI applications and

includes DTrace providers, enabling additional profiling capabilities.

Note: While this example focuses on the use of DTrace with MPI, alternative, complementary, and supplementary ways to profile MPI based applications are provided by the Sun Studio Performance Analyzer. The Analyzer directly supports the profiling of MPI, and provides features that help understand message transmission issues and MPI stalls. For details see the Sun Studio Technical Articles Web site linked from “References” on page 38.

The MPI standard states that MPI profiling should be implemented with wrapper

functions. The wrapper function performs the required profiling tasks, and the real

MPI function is called inside the wrapper through a profiling MPI (PMPI) interface.

However, using DTrace for profiling has a number of advantages over the standard

approach:

The PMPI interface is compiled code that requires restarting a job every time a

library is changed. DTrace is dynamic and profiling code can be implemented in D

and attached to a running process.

MPI profiling changes the target system, resulting in differences between the

behavior of profiling and non-profiling code. DTrace enables the testing of an

actual production system with, at worst, a negligible affect on the system under

test.

DTrace allows the user to define probes that capture MPI tracing information with

a very powerful and concise syntax.

The profiling interface itself is implemented in D, which includes built-in

mechanisms that allow for safe, dynamic, and flexible tracing on production

systems.

When DTrace is used in conjunction with MPI, DTrace provides an easy way to

identify the potentially problematic function and the desired job, process, and

rank that require further scrutiny.

D has several built-in functions that help in analyzing and trouble shooting

problematic programs.

Page 31: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.28 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Setting the Correct *"+%#, PrivilegesThe *"+%#, command controls several aspects of program execution and uses

the Open Run-Time Environment (ORTE) to launch jobs. /$%&!-."%0! and

/$%&!-.#)-% privileges are needed to run a script with the *"+%#, command,

otherwise DTrace fails and reports an insufficient privileges error. To determine

whether the correct privileges are assigned, the following shell script, *"""%+:2)D,

must be executed by *"+%#,:

'[S4;)SE@I9*"""%+:2)D9;9%#,9""%+:4579#,/-%9&9)D-339$09)--9$D-9I9"%+:+3-V-)90H9$D-9"%0!-))9$D&$9*"+%#,9!%-&$-)!!";B*XX

Execute the following command to determine whether the correct privileges are

assigned for a cluster consisting of host1 and host2:

Z9*"+%#,9;,"9L9;;D0)$9D0)$58D0)$L9*"""%+:2)D

The privileges can be set by the system administrator. Once they are set,

/$%&!-.#)-% and /$%&!-."%0! privileges are reported by *"""%+:2)D as follows:

d1M_`*O9E@****�5?E*J*K)#)&Z****.`45E;98(<"59&G!"#98(<"59&G3E&"****,`45E;98(<"59&G!"#98(<"59&G3E&"****m`45E;98(<"59&G!"#98(<"59&G3E&"****2`*5%%

A Simple Script for MPI TracingThe following script, *"+$%&!-2/, is used to trace the entry into and exit from all of

the MPI API calls:

!;(X<5"?&<`%;4=!;`pm,GA`&)<"gD****!";)<I>w.)<&"&(*yERRRx8*!"#4&I3)9C:T!;(X<5"?&<`%;4=!;`pm,GA`"&<3")D****!";)<I>w&l;<;)?8*"&<3")*B5%3&*J*y(V)x8*5"?/C:T

To trace the entry and exit of MPI calls in four instances of an MPI application,

*"+&""8 running under dtrace and using the *"+$%&!-2/ script, it seems you

should execute the following command:

Z9*"+%#,9;,"9N9/$%&!-9;09*"+&""2$%&!-9;)9*"+$%&!-2/9;!9*"+&""

Page 32: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.29 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Where:

;,"9N instructs the *"+%#, command to invoke four copies of /$%&!-

;)9*"+$%&!-2/ instructs /$%&!- to run the *"+$%&!-2/ D-script

;!9*"+&"" instructs each instance of the /$%&!- command to run *"+&""

In this invocation, the output for all processes is directed to the standard-output and

the different output from different processes is mixed up in the single output file

*"+&""2$%&!-. To direct the output of each invocation of the /$%&!- command to

a separate file, invoke it using the following "&%$%&!-2)D script:

'[S4;)SE@(<"59&*OE*X/*O9*X0*O#*X0RXrpm,GQrppGsro2PGofjtR<"59&

Where:

OE*X/ instructs /$%&!- to run the D-script passed to it as the first argument

O9*X0 instructs /$%&!- to run the application named in the second argument

O#*X0RXrpm,GQrppGsro2PGofjtR<"59& directs the trace-output to the file

named from the concatenation of the second argument, the MPI-rank of the shell

and the string trace. If this file exists, the output is appended to it.

Now invoke four copies of the "&%$%&!-2)D script that invokes *"+&"":

Z9*"+%#,9;,"9N9"&%$%&!-2)D9*"+$%&!-2/9*"+&""

Note: The status of the OMPI_COMM_WORLD_RANK shell variable is unstable and subject to change. The user may need to change it to its replacement.

To attach the /$%&!- command with the same *"+$%&!-2/ D-script to a running

instance of the MPI program *"+&"" with process id 24768 run the following

command:

Z9/$%&!-9;"9LNQPR9;)9*"+$%&!-2/

When DTrace running the *"+$%&!-2/ script is attached to a job that performs )-,/

and %-!: operations, the output looks similar to the following:

y*(<"59&*Ou*O!*0dee1*OE*=!;<"59&R(.)<&"&(*pm,G+&)(RRR&l;<;)?8*"&<3")*B5%3&*J*1.)<&"&(*pm,Go&9BRRR&l;<;)?8*"&<3")*B5%3&*J*1.)<&"&(*pm,G+&)(RRR&l;<;)?8*"&<3")*B5%3&*J*1.)<&"&(*pm,Go&9BRRR&l;<;)?8*"&<3")*B5%3&*J*1.)<&"&(*pm,G+&)(RRR&l;<;)?8*"&<3")*B5%3&*J*1

Page 33: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.30 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

12. In MPI, a communicator is an opaque object with a number of attributes together with simple rules that govern its creation, use, and destruction. Communicators are the basic data types of MPI.

The *"+$%&!-2/ script can be easily modified to include an argument list. The

resulting output resembles output of the $%#)) command:

59*!;(X<5"?&<`%;4=!;`pm,G+&)(`&)<"g8L9*!;(X<5"?&<`%;4=!;`pm,GAE&)(`&)<"g8M9*!;(X<5"?&<`%;4=!;`pm,Go&9B`&)<"g8N9*!;(X<5"?&<`%;4=!;`pm,GA"&9B`&)<"gO9*DP9***!";)<I>wyE>1lyl8*y(8*1lyl8*y(8*y(8*1lylCx8!"#4&I3)98Q9************************5"?18*5"?/8*5"?08*5"?b8*5"?d8*5"?_C:R99TS9*!;(X<5"?&<`%;4=!;`pm,G+&)(`"&<3")8/1*!;(X<5"?&<`%;4=!;`pm,GAE&)(`"&<3")8559!;(X<5"?&<`%;4=!;`pm,Go&9B`"&<3")85L9!;(X<5"?&<`%;4=!;`pm,GA"&9B`"&<3")5M9D5N9**!";)<I>wV<V<*J*y(V)x8*5"?/C:5O9T

The *"+$%#))2/ script demonstrates the use of wildcard names to match all send

and receive type function calls in the MPI library. The first probe shows the usage of

the built-in &%V variables to print out the argument list of the traced function. The

following example shows a sample invocation of the *"+$%#))2/ script:

y*(<"59&*Ou*O!*0dee1*OE*=!;<"3EER(pm,G+&)(>1lM1de1418*/8*1lM1c1IdM8*18*/8*1lM1c1(dMC*J*1pm,Go&9B>1lM1de15M8*/8*1lM1c1IdM8*18*18*1lM1c1(dMC*J*1pm,G+&)(>1lM1de1418*/8*1lM1c1IdM8*18*/8*1lM1c1(dMC*J*1pm,Go&9B>1lM1de15M8*/8*1lM1c1IdM8*18*18*1lM1c1(dMC*J*1

Tracking Down MPI Memory LeaksMPI communicators12 are allocated by the MPI middleware at the request of

application code, making it difficult to identify memory leaks. The following

D-script — *"+!0**!D-!'2/9a9demonstrates how DTrace can help overcome this

challenge.

*"+!0**!D-!'2/ probes for MPI functions that allocate and deallocate

communicators, and keeps track of the call-stack for each function. Every 10 seconds,

the script dumps out the count of MPI communicator calls and the total number of

instances of communicator that were allocated and freed. At the end of the DTrace

session, the script prints the total count and the different stack traces, as well as the

number of times those stack traces were seen.

59k.v,jL9DM9**5%%#9E*J*1:*(&5%%#9E*J*1:*E&9#)(E*J*1:N9T

The k.v,j probe fires when the script starts and initializes the count of

allocations, deallocations, and seconds counter.

Page 34: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.31 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

O9*!;(X<5"?&<`%;4=!;`pm,GQ#==G9"&5<&`&)<"g8P9*!;(X<5"?&<`%;4=!;`pm,GQ#==G(3!`&)<"g8Q9*!;(X<5"?&<`%;4=!;`pm,GQ#==GE!%;<`&)<"gR9*DS9***LL5%%#9E:/1***]9#3)<E6!"#4&I3)97*J*9#3)<>C:559**]E<59NE63E<59N>C7*J*9#3)<>C:5L9T

This code section implements the communicator constructor probes —

pm,GQ#==G9"&5<&, pm,GQ#==G(3!, or pm,GQ#==GE!%;<.

Line 9 increments the allocation counter, line 10 saves the count of calls to the

specific function, and line 11 saves the count of the stack frames.

5M9!;(X<5"?&<`%;4=!;`pm,GQ#==GI"&&`&)<"g5N9D5O9**LL(&5%%#9E:5P9**]9#3)<E6!"#4&I3)97*J*9#3)<>C:5Q9**]E<59NE63E<59N>C7*J*9#3)<>C:5R9T

This code section implements the communicator destructor probe —

pm,GQ#==GI"&&R

Line 15 increments the deallocation counter, line 16 saves the count of calls to

pm,GQ#==GI"&&,9and line 17 saves the count of the stack frames.

5S9!"#$%&```<;9NO/1E&901*DL59**!";)<I>wJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJxC:LL9**!";)<5>]9#3)<EC:LM9**!";)<I>wQ#==3);95<#"*5%%#9E*J*y(V)x8*5%%#9EC:LN9**!";)<I>wQ#==3);95<#"*(&5%%#9E*J*y(V)x8*(&5%%#9EC:LO9**E&9#)(E*J*1:LP9T

The code section runs every 10 seconds and prints the counters.

LQ9.jPLR9DLS9**!";)<I>wQ#==*5%%#9E*J*y(8*Q#==*(&5%%#9E*J*y(V)x8b1*******5%%#9E8*(&5%%#9EC:M59T

The .jP probe fires when the DTrace session exits.

The count of the number of times each call stack was seen is printed in line 29.

DTrace prints aggregations on exit by default, so the three communicator

constructors and the communicator destructor call stacks are printed.

When this script is run, a growing difference between the total number of allocations

and deallocations can indicate the existence of a memory leak. Using the stack

traces dumped, it is possible to determine where in the application code the objects

Page 35: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.32 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

are allocated and deallocated to help diagnose the issue. The following example

demonstrates such a diagnosis.

The *"+!0**3-&' MPI application performs three pm,GQ#==G(3! operations and

two pm,GQ#==GI"&& operations. The program thus allocates one communicator

more than it frees with each iteration of a specific loop. Note that this in itself does

not necessarily mean that the extra allocation is in fact a memory leak, since the

extra object might be referred to by code that is external to the loop in question.

When DTrace is attached to *"+!0**3-&' using *"+!0**!D-!'2/, the count

of allocations and deallocations is printed every 10 seconds. In this example, this

output shows that the count of the allocated communicators is growing faster than

the count of deallocations.

When the DTrace session exits, the .jP code outputs a total of five stack traces

— three for the constructor and two for the destructor call stacks, as well as the

number of times each call stack was encountered.

The following command attaches the *"+!0**!D-!'2/9D-script to process 24952:

Z9/$%&!-9;]9;"9LNSOL9;)9*"+!0**!D-!'2/

In the following output, the difference between the number of communicators

allocated and freed grows continuously, indicating a potential memory leak:

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

pm,GQ#==GI"&& N

pm,GQ#==G(3! P

Q#==3);95<#"*f%%#95<;#)E*J*c

A0**#,+!&$0%9G-&330!&$+0,)9X9N

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

pm,GQ#==GI"&& R

pm,GQ#==G(3! 5L

Q#==3);95<#"*f%%#95<;#)E*J*/0

A0**#,+!&$0%9G-&330!&$+0,)9X9R

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

pm,GQ#==GI"&& 5L

pm,GQ#==G(3! 5R

Q#==3);95<#"*f%%#95<;#)E*J*/M

A0**#,+!&$0%9G-&330!&$+0,)9X95L

Page 36: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.33 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

After the D-script is halted (by the user interrupting DTrace), the final count of

allocations and deallocations, together with the table of stack traces, is printed:

Q#==3);95<#"*f%%#95<;#)E*J*0/8*Q#==3);95<#"*P&5%%#95<;#)E*J*/d

%;4=!;RE#R1R1R1Ypm,GQ#==GI"&&

=!;9#==%&5NY(&5%%#95<&G9#==EL1l/a

=!;9#==%&5NY=5;)L1lc(

=!;9#==%&5NY1lM1_1M/5

Q

%;4=!;RE#R1R1R1Ypm,GQ#==GI"&&

=!;9#==%&5NY(&5%%#95<&G9#==EL1l0c

=!;9#==%&5NY=5;)L1lc(

=!;9#==%&5NY1lM1_1M/5

Q

%;4=!;RE#R1R1R1Ypm,GQ#==G(3!

=!;9#==%&5NY5%%#95<&G9#==EL1l/&

=!;9#==%&5NY=5;)L1l_4

=!;9#==%&5NY1lM1_1M/5

Q

%;4=!;RE#R1R1R1Ypm,GQ#==G(3!

=!;9#==%&5NY5%%#95<&G9#==EL1lb1

=!;9#==%&5NY=5;)L1l_4

=!;9#==%&5NY1lM1_1M/5

Q

%;4=!;RE#R1R1R1Ypm,GQ#==G(3!

=!;9#==%&5NY5%%#95<&G9#==EL1ld0

=!;9#==%&5NY=5;)L1l_4

=!;9#==%&5NY1lM1_1M/5

Q

In analyzing the output of the script, it is clear that 50% more communicators are

allocated than are deallocated, indicating a potential memory leak. The five stack

traces, each occurring seven times, point to the code that called the communicator

constructors and destructors. By analyzing the handling of the communicator

objects by the code executed following the calls to their constructors, it is possible to

identify whether there is in fact a memory leak, and its cause.

Using the DTrace *"+"-%#)- ProviderPERUSE is an extension interface to MPI that exposes performance related processes

and interactions between application software, system software, and MPI message-

passing middleware. PERUSE provides a more detailed view of MPI performance

Page 37: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.34 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

compared to the standard MPI profiling interface (PMPI). Sun HPC ClusterTools

includes a DTrace provider named *"+"-%#)- that supports DTrace probes into the

MPI library.

The DTrace *"+"-%#)- provider exposes the events defined in the MPI PERUSE

specification that track the MPI requests. To use an *"+"-%#)- probe, add the

string =!;!&"3E&X<5"?&<``` before the probe-name, remove the PERUSE_ prefix,

convert the letters into lower-case, and replace the underscore characters with

dashes. For example, to specify a probe to capture a PERUSE_COMM_REQ_ACTIVATE

event, use =!;!&"3E&X<5"?&<```9#==O"&uO59<;B5<&R*If the optional

object and function fields in the probe description are omitted, all of the

!0**;%-];&!$+:&$- probes in the MPI library and all of its plug-ins will fire.

All of the *"+"-%#)- probes recognize the following four arguments:

Argument Function

*"+!0**+,H0.$9[+ Provides a basic source and destination for the request and which protocol is expected to be used for the transfer.

#+,$"$%.$9#+/ The PERUSE unique ID for the request that fired the probe (as defined by the PERUSE specifications). For OMPI this is the address of the actual request.

#+,$.$90" Indicates whether the probe is for a send or receive request.

*"+!0**.)"-!.$9[!) Mimics the )"-! structure, as defined in the PERUSE specification.

To run D-scripts with *"+"-%#)- probes, use the O- (upper case) switch with the

/$%&!- command since the *"+"-%#)- probes do not exist at initial load time:

'[S4;)SE@'*!5"<"59&RE@*O*5*@&%!&"*E9";!<*<#*(<"59&*pm,*H#4E*I"#=I9$D-9)$&%$90H9$D-9\0(2(<"59&*O-*OE*X/*O9*X0*O#*X0RXrpm,GQrppGsro2PGofjtR<"59&

Tracking MPI Message Queues with DTrace

Sun HPC ClusterTools™ software includes an example of using *"+"-%#)- probes

on the Solaris OS that allows users to see the activities of the MPI message queues.

To run the script, *"+)$&$2/, a compiled MPI program that is executed with the

*"+%#, command is needed to create two MPI processes. In this example, the

process ID of the traced process is 5MMQ5.

Run the *"+)$&$2/ D-script:

Z9/$%&!-9;]9;"95MMQ59;)9*"+)$&$2/

Page 38: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.35 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

The *"+)$&$2/ D-script prints out the following columns of information on the MPI

message queues every second:

Col PERUSE event probe Meaning

1 o.�Gqh.oGQrj^,jn. Count of bytes received

2 o.�GfQ^,|f^. Number of message transfers activated

3 o.�G,j+.o^G,jGmr+^.PG� Number of messages inserted in the posted request queue

4 p+vG,j+.o^G,jGnj.qG� Number of messages inserted into the unexpected queue

5 p+vGpf^Q}Gmr+^.PGo.� Number of incoming messages matched to a posted request

6 o.�Gpf^Q}Gnj.q Number of user requests matched to an unexpected message

7 o.�Gqh.oG.jP Count of bytes that have been scheduled for transfer

8 o.�GfQ^,|f^. Number of times MPI has initiated the transfer of a message

A Few Simple *"+"-%#)- Usage ExamplesThe examples in this section show how to perform the described DTrace operations

from the command line. In each example, the process ID shown (254) must be

substituted with the process ID of the process to be monitored.

Count the number of messages to or from a host:

y*(<"59&*O!*0_d*O)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*D*V9**]65"?E617OZ9;G"&=#<&7*J*9#3)<>C:iY(<"59&`*(&E9";!<;#)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(W**&$!D-/95Q9"%0(-)bAD0)$L9%-!:9MD0)$L9)-,/9M

Count the number of messages to or from specific MPI byte transfer layers (BTLs):

y*(<"59&*O!*0_d*O)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*D*V9**]65"?E617OZ9;G!"#<#9#%7*J*9#3)<>C:iY(<"59&`*(&E9";!<;#)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*W**&$!D-/95Q9"%0(-)bAE=*c1

Page 39: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.36 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Obtain distribution plots of message sizes sent or received from a host:

y*(<"59&*O!*0_d*O)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*D*V9**]65"?E617OZ9;G"&=#<&7*J*u35)<;F&>5"?E6b7OZ=9EG9#3)<C:iY(<"59&`*(&E9";!<;#)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*W**&$!D-/95Q9"%0(-)bA=g@#E<

:&3#- ;;;;;;;;;;;;;9G+)$%+(#$+0,9;;;;;;;;;;;;; !0#,$

L U 1

N U]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] N

R U 1

Create distribution plots of message sizes by communicator, rank, and send/receive:

y*(<"59&*O!*0_d*O)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*D*V*9**]65"?E6b7OZ=9EG9#==8*5"?E6b7OZ=9EG!&&"8*5"?E6b7OZ=9EG#!7*V9**J*u35)<;F&>5"?E6b7OZ=9EG9#3)<C:iY(<"59&`*(&E9";!<;#)*W=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(*W**&$!D-/95S9"%0(-)bA5MNP5NRPN9 59%-!:

:&3#- ;;;;;;;;;;;;;9G+)$%+(#$+0,9;;;;;;;;;;;;; !0#,$

L U 1

N U]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] S

R U 15MNP5NRPN9 59)-,/

:&3#- ;;;;;;;;;;;;;9G+)$%+(#$+0,9;;;;;;;;;;;;; !0#,$

L U 1

N U]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] S

R U 1

Page 40: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.37 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Chapter 4

Conclusion

The increased complexity of modern computing and storage architectures with

multiple CPUs, cores, threads, virtualized operating systems, networking, and

storage devices present serious challenges to system architects, administrators,

developers, and users. The requirements of high availability and reliability,

together with the increasing pressure on datacenter budgets and resources,

create an environment where it is necessary to maintain systems at a high level of

performance — without the ability to simply add resources to resolve performance

issues.

To achieve these seemingly conflicting objectives, Sun Microsystems provides a

comprehensive set of tools on the Solaris 10 OS with DTrace at their core, that

enable unprecedented levels of observability and insight into the workings of the

operating system and the applications running on it. These tools allow datacenter

professionals to quickly analyze and diagnose issues in the systems they maintain

without increasing operational risk, making the promise of observability as a driver

of consistent system performance and stability a reality.

AcknowledgmentsThomas Nau of the Infrastructure Department, Ulm University, provided the content

for the chapters on DTrace and its Visualization Tools and Performance Analysis

Examples, except for the section on Using DTrace with MPI.

The section on Using DTrace with MPI is based on the article Using the Solaris DTrace

Utility With Open MPI Applications, by Terry Dontje, Karen Norteman, and Rolf vande

Vaart of Sun Microsystems.

Page 41: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.38 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Chapter 5

References

Table 3. Web sites for more information

Description URL

DLight tutorial http://blogs.sun.com/solarisdev/entry/project_d_light_tutorial

Another DLight tutorial http://developers.sun.com/sunstudio/documentation/tutorials/d_light_tutorial/index.html

Yet another DLight tutorial http://developers.sun.com/sunstudio/documentation/tutorials/dlight/

Chime Visualization Tool for DTrace http://opensolaris.org/os/project/dtrace-chime/

The NetBeans DTrace GUI Plugin Tutorial (includes Chime)

http://netbeans.org/kb/docs/ide/NetBeans_DTrace_GUI_Plugin_0_4.html

DTraceToolkit http://opensolaris.org/os/community/dtrace/dtracetoolkit/

Proposed cpc DTrace provider http://wikis.sun.com/display/DTrace/cpc+Provider

DTrace limitations page in the Solaris internals wiki

http://solarisinternals.com/wiki/index.php/DTrace_Topics_Limitations

Sun Studio documentation http://developers.sun.com/sunstudio/documentation/index.jsp

Sun Studio Technical Articles http://developers.sun.com/sunstudio/documentation/techart/

Sun HPC ClusterTools http://sun.com/clustertools

Sun HPC ClusterTools 8 Documentation

http://docs.sun.com/app/docs/prod/hpc.cluster80~hpc-clustertools8?l=en#hic

MPO/lgroup observability tools http://opensolaris.org/os/community/performance/numa/observability/tools/

John the Ripper password evaluation tool

http://openwall.com/john/

BIOS and Kernel Developer’s Guide for the AMD Athlon 64 and AMD Opteron Processors AMD publication #26094

http://support.amd.com/us/Processor_TechDocs/26094.PDF

Open MPI: Open-source High Performance Computing Web site

http://open-mpi.org

Using the Solaris DTrace Utility With Open MPI Applications

http://developers.sun.com/solaris/articles/dtrace_article.html

Page 42: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.39 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Chapter 6

Appendix

Chime Screenshots

Figure 8. Processes sorted by system call load

Figure 9. Plotting of reads over time with the rfilio DTraceToolkit script

Page 43: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.40 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Listings for Matrix Addition !"#$%&!' Example

gnuplot Program to Graph the Data Generated by !"#$%&!'

)-$9$-%*+,&39"0)$9-")9!030%9-,D)-$90#$"#$9c!"#$%&!'.&//.%-&/)2-")dE&<*l%54&%*w<;=&*6E7xE&<*g%54&%*w2/*PO959@&*=;EE&ESExE&<*g0%54&%*w,)E<"39<;#)*9#3)<SExE&<*I#"=5<*g**wyR/0?xE&<*I#"=5<*g0*wyRb?xE&<*N&g*";?@<*9&)<&"!%#<*w9!3<"59NG5((G9#%R(5<5x*3E;)?*_*<;<%&*wPO959@&*=;EE&E*9#%xV*T;<@*%;)&E*%<*/8*w9!3<"59NG5((G"#TR(5<5x*3E;)?*_*<;<%&*V*wPO959@&*=;EE&E*"#Tx*T;<@*%;)&E*%<*08w9!3<"59NG5((G9#%R(5<5xV9*3E;)?*c*<;<%&*w,)E<"39<;#)*9#3)<*9#%x*5l&E*l/g0*T;<@*%;)&E8Vw9!3<"59NG5((G"#TR(5<5x*3E;)?*c*<;<%&*w,)E<"39<;#)*9#3)<*"#TxV*5l&E*l/g0*T;<@*%;)&E

!"#$%&!'.&//.%012/&$& /R/ba* 01d1e* /* <;9N* _dbMec* Me/_1e_/0R/ba* 01d1e* /* <;9N* /1M1_/_* /e0M_c01dbR/ba* 01d1e* /* <;9N* e_/ddM* /01/a1cbbdR/Ma* 01d1e* /* <;9N* M11cd1* /0M1ddec__Rdbc* 01d1e* /* <;9N* /1/ca0b* /c0e1/abdcR10a* 01d1e* /* <;9N* /b01ddd* /1_cd/00/eRb0/* 01d1e* /* <;9N* 0/ee1b1* /ed/e/e10MR//a* 01d1e* /* <;9N* /eee_a0* /d00/dadeaR/0a* 01d1e* /* <;9N* /_daMb_* /0baabbMM/1R/dM*01d1e* /* <;9N* /_cbe/1* /0_/1beb1//R/da*01d1e* /* <;9N* 000ea1M* /eM0d000_/0R1bc*01d1e* /* <;9N* /0ca_cc* /1/_e1caM/bR10a*01d1e* /* <;9N* 00/bedM* /ee/1aba0/dR1da*01d1e* /* <;9N* /_cd/e1* /0_/d1/1c/_R1_a*01d1e* /* <;9N* /_dbaeM* /0b_0_1db/cR1ca*01d1e* /* <;9N* 00_/cdd* /M1/d//c0/eR/da*01d1e* /* <;9N* /cab1bc* /b_d_11_b/MR/_a*01d1e* /* <;9N* 00_/10_* /M11a/Md1/aR/ca*01d1e* /* <;9N* 00_/0/d* /M1/1cMe1

!"#$%&!'.&//.!032/&$& /R1ee* 01db_* /* <;9N* /Mbe* /M/c/10R01_* 01db_* /* <;9N* bacb* __a01bR/__* 01db_* /* <;9N* __a0* __a01dR///* 01db_* /* <;9N* __ad* __ad1_R//M* 01db_* /* <;9N* _cbaa10* _cd1deadcR1cM* 01db_* /* <;9N* _bdbd1_* _bdbadbMeR1eM* 01db_* /* <;9N* _ce1/ec* _ce1eMcMMR/0M* 01db_* /* <;9N* _Mcc10b* _Mccc0d0aR1MM* 01db_* /* <;9N* _ba0e0b* _bab0e__

Page 44: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.41 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

/1R1bM*01db_* /* <;9N* _bd/bae* _bd/ac/a//R1dM*01db_* /* <;9N* _cca0cb* _ccaMccc/0R1MM*01db_* /* <;9N* _Mde1ca* _Mdeceed/bR1dM*01db_* /* <;9N* _bab1Me* _babc_cb/dR1_M*01db_* /* <;9N* _cca1dc* _ccac_0M/_R1aM*01db_* /* <;9N* _M_1/0M* _M_1e0a//cR1_M*01db_* /* <;9N* _bM0bc/* _bM0adbb/eR1cM*01db_* /* <;9N* _ceaac1* _cM1_cbb/MR/1M*01db_* /* <;9N* _MbM/Mc* _MbMeMMa/aR1cM*01db_* /* <;9N* _bad0aa* _badM_Mb01R1eM*01db_* /* <;9N* _ce_a_0* _cec_bcM0/R1MM*01db_* /* <;9N* _e1dcd_* _e1_1deb00R1aM*01db_* /* <;9N* _eb0c_0* _eb0ac0e0bR/1M*01db_* /* <;9N* _eb/M_d* _eb0/c/M0dR//M*01db_* /* <;9N* _e0aMc/* _eb1/ecd0_R/0M*01db_* /* <;9N* _eb/eaM* _eb0/1Me0cR/bM*01db_* /* <;9N* _e0_ccc* _e0_aece0eR/dM*01db_* /* <;9N* _eb/edc* _eb01__d0MR10M*01db_* /* <;9N* daa1dd/* daa1e/de0aR/aM*01db_* /* <;9N* ccd1Me0* ccd/0b01b1R01M*01db_* /* <;9N* _eb/1c1* _eb/be1eb/R0/M*01db_* /* <;9N* _eb0MMc* _ebb/aceb0R10M*01db_* /* <;9N* d_ac/cc* d_acd/dbbbR1bM*01db_* /* <;9N* _eb1c_c* _eb1ae1ebdR1dM*01db_* /* <;9N* _ebb0d/* _ebb__0_b_R1_M*01db_* /* <;9N* _eb0/1b* _eb0d/dcbcR1cM*01db_* /* <;9N* _ebd01M* _ebd_/MebeR1eM*01db_* /* <;9N* _ebdeMM* _eb_1aMebMR1MM*01db_* /* <;9N* _eb10_b* _eb1_c/_baR1aM*01db_* /* <;9N* _eb0b1a* _eb0c/Mad1R/1M*01db_* /* <;9N* _eb0_aa* _eb0a/1_d/R//M*01db_* /* <;9N* _eb/ce_* _eb/aMdad0R/0M*01db_* /* <;9N* _eb/dbM* _eb/edMedbR/bM*01db_* /* <;9N* _ebbM/b* _ebd/0dcddR/dM*01db_* /* <;9N* _eb0e_e* _ebb1c_Md_R/_M*01db_* /* <;9N* _eb0bdM* _eb0cc1/dcR/cM*01db_* /* <;9N* _eb/bbd* _eb/cdb1deR/eM*01db_* /* <;9N* _ebb1c1* _ebbbe/ddMR/MM*01db_* /* <;9N* _eb/acc* _eb00ecddaR/aM*01db_* /* <;9N* _ebb01/* _ebb_/0/_1R01M*01db_* /* <;9N* _eb0cea* _eb0aa1c_/R0/M*01db_* /* <;9N* _e01eaM* _e0//1_e_0R10M*01db_* /* <;9N* d_adMcM* d_a_//cb_bR1bM*01db_* /* <;9N* _eb01eM* _eb0bMMe_dR1dM*01db_* /* <;9N* _eb0a_1* _ebb0c1M

Page 45: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.42 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Listings for Matrix Addition (#))$&$ Example

gnuplot Program to Graph the Data Generated by (#))$&$

)-$9$-%*+,&39"0)$9-")9!030%9-,D)-$90#$"#$9c(#))$&$.&//.%-&/)2-")dE&<*l%54&%*w<;=&*6E7xE&<*g%54&%*w"&5(ESExE&<*I#"=5<*g*wyR/1?xE&<*N&g*";?@<*9&)<&""30$9c(#))$&$.&//.!032(5<5x*3E;)?*d*<;<%&*w"&5(E*9#%x*T;<@*%;)&E8Vc(#))$&$.&//.%012/&$&d9#)+,V9N9$+$3-9c%-&/)9%01d91+$D93+,-)

(#))$&$.&//.%012/&$& /* ("5=1* =&=G"&5(E* _bbacb* =&=G"&5(E* _bbac/0* ("5=1* =&=G"&5(E* eaabaa* =&=G"&5(E* eaad1/b* ("5=1* =&=G"&5(E* _a0a/d* =&=G"&5(E* _a0a/bd* ("5=1* =&=G"&5(E* _accb/* =&=G"&5(E* _accb/_* ("5=1* =&=G"&5(E* cdc01/* =&=G"&5(E* cdc010c* ("5=1* =&=G"&5(E* ec1a/0* =&=G"&5(E* ec1a/Me* ("5=1* =&=G"&5(E* a/d/d1* =&=G"&5(E* a/d/0/M* ("5=1* =&=G"&5(E* eab1a1* =&=G"&5(E* eab//1a* ("5=1* =&=G"&5(E* M1MM/a* =&=G"&5(E* M1MM0//1* ("5=1* =&=G"&5(E* M1c1Md* =&=G"&5(E* M1c1c0//* ("5=1* =&=G"&5(E* a_0_M_* =&=G"&5(E* a_0_a1/0* ("5=1* =&=G"&5(E* eMc_01* =&=G"&5(E* eMc_bd/b* ("5=1* =&=G"&5(E* a1Mc_/* =&=G"&5(E* a1Mc_//d* ("5=1* =&=G"&5(E* eMeadb* =&=G"&5(E* eMeadc/_* ("5=1* =&=G"&5(E* ea00be* =&=G"&5(E* ea00bd/c* ("5=1* =&=G"&5(E* a1add0* =&=G"&5(E* a1add0/e* ("5=1* =&=G"&5(E* eMMa//* =&=G"&5(E* eMMa/d/M* ("5=1* =&=G"&5(E* a1aMMd* =&=G"&5(E* a1aMMb/a* ("5=1* =&=G"&5(E* a1a1ee* =&=G"&5(E* a1a1e_01* ("5=1* =&=G"&5(E* MabMaa* =&=G"&5(E* MabMMe0/* ("5=1* =&=G"&5(E* e/0b_d* =&=G"&5(E* e/0bc100* ("5=1* =&=G"&5(E* e1cac_* =&=G"&5(E* e1caba

(#))$&$.&//.!032/&$& /* ("5=1* =&=G"&5(E* dbbce1* =&=G"&5(E* dbbcea0* ("5=1* =&=G"&5(E* debaea* =&=G"&5(E* debaeab* ("5=1* =&=G"&5(E* daadca* =&=G"&5(E* daade1d* ("5=1* =&=G"&5(E* _/_1Md* =&=G"&5(E* _/_1ea_* ("5=1* =&=G"&5(E* 0/Me_0/* =&=G"&5(E* 0/Me_/_c* ("5=1* =&=G"&5(E* 0/bb110* =&=G"&5(E* 0/bb110e* ("5=1* =&=G"&5(E* 0bbM1c/* =&=G"&5(E* 0bbMbMdM* ("5=1* =&=G"&5(E* 0000_00* =&=G"&5(E* 0000011a* ("5=1* =&=G"&5(E* 0/0ae/e* =&=G"&5(E* 0/0ae/_/1* ("5=1* =&=G"&5(E* 00aab/a* =&=G"&5(E* 00aacdM//* ("5=1* =&=G"&5(E* 00_eM_d* =&=G"&5(E* 00_e_0c

Page 46: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.43 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

/0* ("5=1* =&=G"&5(E* 0/b1a0c* =&=G"&5(E* 0/b1a0_/b* ("5=1* =&=G"&5(E* 0b/11dM* =&=G"&5(E* 0b/1bea/d* ("5=1* =&=G"&5(E* 00_//a1* =&=G"&5(E* 00_1M_a/_* ("5=1* =&=G"&5(E* 0/baec/* =&=G"&5(E* 0/baec0/c* ("5=1* =&=G"&5(E* 0b/Ma_d* =&=G"&5(E* 0b/a/c_/e* ("5=1* =&=G"&5(E* 00ce1aM* =&=G"&5(E* 00ccMMe/M* ("5=1* =&=G"&5(E* 0/bbeac* =&=G"&5(E* 0/bbeac/a* ("5=1* =&=G"&5(E* 0b/ce/c* =&=G"&5(E* 0b/e1db01* ("5=1* =&=G"&5(E* 00deedb* =&=G"&5(E* 00ded/_0/* ("5=1* =&=G"&5(E* 0/0ccdc* =&=G"&5(E* 0/0ccdc00* ("5=1* =&=G"&5(E* 0d00bb1* =&=G"&5(E* 0d00bb/0b* ("5=1* =&=G"&5(E* 0/0c/d1* =&=G"&5(E* 0/0c/ba0d* ("5=1* =&=G"&5(E* 00_eeae* =&=G"&5(E* 00_M/0_0_* ("5=1* =&=G"&5(E* 00a11ab* =&=G"&5(E* 00Maecc0c* ("5=1* =&=G"&5(E* 00M_/_e* =&=G"&5(E* 00M_/_e0e* ("5=1* =&=G"&5(E* 0d0/dMM* =&=G"&5(E* 0d0/dMM0M* ("5=1* =&=G"&5(E* 0/0e1cM* =&=G"&5(E* 0/0e1ce0a* ("5=1* =&=G"&5(E* 0b1c/0a* =&=G"&5(E* 0b1cd_/b1* ("5=1* =&=G"&5(E* 00d/b/e* =&=G"&5(E* 00d1aacb/* ("5=1* =&=G"&5(E* 0/0c/1a* =&=G"&5(E* 0/0c/1ab0* ("5=1* =&=G"&5(E* 0d0b/cc* =&=G"&5(E* 0d0b0bdbb* ("5=1* =&=G"&5(E* 0/0ca0c* =&=G"&5(E* 0/0cMc/bd* ("5=1* =&=G"&5(E* 0be0b__* =&=G"&5(E* 0be0cedb_* ("5=1* =&=G"&5(E* 0/e_1_M* =&=G"&5(E* 0/edebcbc* ("5=1* =&=G"&5(E* 0/cdd1_* =&=G"&5(E* 0/cdc/bbe* ("5=1* =&=G"&5(E* 0bab/M_* =&=G"&5(E* 0ba0aecbM* ("5=1* =&=G"&5(E* 0/0e1cM* =&=G"&5(E* 0/0e1cMba* ("5=1* =&=G"&5(E* 0bedddM* =&=G"&5(E* 0bedc_cd1* ("5=1* =&=G"&5(E* 001ba_M* =&=G"&5(E* 001be_/d/* ("5=1* =&=G"&5(E* 0/bdM_e* =&=G"&5(E* 0/b_/Med0* ("5=1* =&=G"&5(E* 0d/M_b/* =&=G"&5(E* 0d/M01/db* ("5=1* =&=G"&5(E* 0/0acac* =&=G"&5(E* 0/0aca_dd* ("5=1* =&=G"&5(E* 0d/_/bM* =&=G"&5(E* 0d/_de0d_* ("5=1* =&=G"&5(E* 0/b0b_1* =&=G"&5(E* 0/b01/cdc* ("5=1* =&=G"&5(E* 001M0bc* =&=G"&5(E* 001Md_/de* ("5=1* =&=G"&5(E* 0bd/ada* =&=G"&5(E* 0bd/ebddM* ("5=1* =&=G"&5(E* 0/0_M0c* =&=G"&5(E* 0/0_M0cda* ("5=1* =&=G"&5(E* 0d01a0b* =&=G"&5(E* 0d01a0d_1* ("5=1* =&=G"&5(E* 0/de/cb* =&=G"&5(E* 0/de/c0_/* ("5=1* =&=G"&5(E* 00MMMa0* =&=G"&5(E* 00Ma/1a_0* ("5=1* =&=G"&5(E* 00eb/ca* =&=G"&5(E* 00e0a_b_b* ("5=1* =&=G"&5(E* 0/b/_Md* =&=G"&5(E* 0/b/_Mb_d* ("5=1* =&=G"&5(E* 0d0/01c* =&=G"&5(E* 0d0/01e__* ("5=1* =&=G"&5(E* /MMa__/* =&=G"&5(E* /MMa__M_c* ("5=1* =&=G"&5(E* aedd10* =&=G"&5(E* aedd/__e* ("5=1* =&=G"&5(E* c01/1e* =&=G"&5(E* c011Mb

Page 47: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.44 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Sun Studio Analyzer Hints to Collect Cache Misses on an UltraSPARC T1® Processor

-,./-)!90,+V,0%-.,0.UD1!"%0H=#4HG(&$)&*|5(("*|fPPo=#4HG(&$)&*m5(("*mfPPo=#4HG(&$)&*m"#9&EE*m,P=#4HG(&$)&*^@"&5(*>m,PA/111CL^}o,P=#4HG(&$)&*^@"&5(,P*^}o,P=#4HG(&$)&*+&9#)(E*>^+^fpmS/111111111C=#4HG(&$)&*p;)3<&E*>^+^fpmSc1111111111C=#4HG(&$)&*n+^/Gk5)N*>mfPPo\1l91CZZc=#4HG(&$)&*n+^/G20Q59@&2;)&*>mfPPo\1l0II91CZZc=#4HG(&$)&*n+^/G2/P5<5Q59@&2;)&*>mfPPo\1leI1CZZd=#4HG(&$)&*n+^/G+<"5)(*>Qmn,PC=#4HG(&$)&*n+^/GQ#"&*>Qmn,P\1l/9CZZ0=#4HG(&$)&*|fG20*|fPPoZZc=#4HG(&$)&*|fG2/*|fPPoZZd=#4HG(&$)&*mfG20*mfPPoZZc=#4HG(&$)&*mfG2/*mfPPoZZd|!5?&G0_cp*|fPPoZZ0Mm!5?&G0_cp*mfPPoZZ0M

Page 48: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.45 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Analyzing I/O Performance Problems in a Virtualized Environment — a Complete DescriptionAs described in the summary of this example in the section “Analyzing I/O

Performance Problems in a Virtualized Environment” on page 19, the following is a

full description of resolving system sluggishness due to a high rate of I/O system

calls. To identify the cause it is necessary to determine which I/O system calls are

called and in what frequency, by which process, and to determine why. The first step

in the analysis is to narrow down the number of suspected processes, based on the

ratio of the time each process runs in system context versus user context, with the

:*)$&$4567 and "%)$&$4567 commands.

In the example described here in detail, a Windows 2008 server is virtualized on the

OpenSolaris™ OS using the Sun™ xVM hypervisor for x86 and runs fine. When the

system is activated as an Active Directory domain controller, it becomes extremely

sluggish.

As a first step in diagnosing this problem, the system is examined using the

:*)$&$9O command, which prints out a high-level summary of system activity every

five seconds, with the following results:

The results show that the number of system calls reported in the Eg column

increases rapidly as soon as the affected virtual machine is booted, and remains

quite high even though the CPU is constantly more than 79% idle, as reported in the

+/ column. The number of calls is constantly more than 9 thousand from the third

interval and on, ranging as high as 83 thousand. Clearly something is creating a very

high system call load.

In the next step in the analysis, the processes are examined with "%)$&$9;<9;*.

The ;< option instructs the "%)$&$ command to report statistics for each thread

separately, and the thread ID is the appended to the executable name. The ;* option

instructs the "%)$&$ command to report information such as the percentage of time

the process has spent processing system traps, page faults, latency time, waiting for

user locks, and waiting for CPU, while the virtual machine is booted:

'$D% =&=#"g "&V- /+)' H&#3$) !"#% ( 1 )1&" H%-- %- *H "+"0 H% /- )% =1 *5 *L *M+, Eg !) #) Eg +/1 1 1 5QPMOQLN d1acb_c 1 1 1 1 1 1 1 M M 1 1 SSN NN5 Q5Q 1 L SR1 1 1 5QPMOQLN d1acb_c 1 1 1 1 1 1 1 1 1 1 1 SP5 N5P Q5M 1 1 /111 1 1 5QPM5NNR d1a__0M QS NPO 1 1 1 1 1 1 1 1 1 /1ed a01_ 5NLR 5 L SQ1 1 1 /ec1d_0d d1e0/dM d1e NOON 1 5 5 1 1 P P 1 1 /1__M QLQRM 010/b N 5Q QS1 1 1 5QOSORLR d1c0bc1 /10 RLR 1 1 1 1 1 M M 1 1 MNN5 NNQNQ /1_01 5 5N RO1 1 1 5QOSRNSL d1cdc0M L L 1 1 1 1 1 5 5 1 1 OMPM 0M_1M RQOL L M SO1 1 1 5QOSRN5L d1c_1cM 1 1 1 1 1 1 1 01 01 1 1 /e1ee Mb10d b1cbb O Q RR1 1 1 /e_aM/1M d1c_/bc 1 1 1 1 1 1 1 1 1 1 1 RSO5 NPNOP /c/d1 L N SM

Page 49: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.46 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

The first line of the output indicates that the ]-*#;/* process executes a very large

number of system calls — 200,000 — as shown in the >A< column. A factor of 100

more than U-,)$0%-/, the process with the second largest number of system calls.

At this stage, while it is known that the ]-*#;/* process is at fault, to solve the

problem it is necessary to identify which system calls are called. This is achieved with

the help of the following D-script, 9#3)<GEgE95%%ER(, which prints system call

rates for the top-ten processes/system call combinations every five seconds:

59*'[S3E"SE4;)S(<"59&*OEL99I"%&V*&9G90"$+0,9]#+-$M9*k.v,j*DN9***<;=&"*J*<;=&E<5=!:*SA*)5)#E&9#)(*<;=&E<5=!*ASO99TP9*EgE95%%```&)<"g*DQ9***]95%%G9#3)<6!;(8*&l&9)5=&8*!"#4&I3)97*J*9#3)<>C:R99TS9*<;9NO_E*D/1***<"3)9>]98*/1C:559**)#"=5%;F&>]95%%G9#3)<8*><;=&E<5=!O<;=&"C*S*/111111111C:5L9**!";)<5>wy_(*yO01E*yc](*yEV)x8*]95%%G9#3)<C:5M9**9%&5">]95%%G9#3)<C:5N9**!";)<I>wV)xC:5O9**<;=&"*J*<;=&E<5=!:5P9T

The k.v,j probe fires when the script starts and is used to initialize the timer.

The EgE95%%```&)<"g probe fires whenever a system call is called and the

system call name, the name of executable that called it, and the process ID are

saved in the !&33.!0#,$ aggregation.

The $+!';O) prints the information collected— line 10 truncates the aggregation

to its top 10 entries, line 12 prints the system call count, and line 13 clears the

contents of the aggregation.

When 9#3)<GEgE95%%ER( is run, of the top 10 process/system call pairs that are

printed, seven are executed by ]-*#;/*, with the number of executions of 3)--'

and 1%+$- the most prevalent by far:

m,P n+.ojfp. =>? >@> E?C EF< GF< <AB ><C 2f^ ^A_ ,Qq >A< +,Q morQ.++S2sm,P/cdM1 U:* P2S S2R 1R1 1R1 1R1 LQ ON 52R b1t 55N 2L6 1 u&=3O(=SbMPM U:* 1R/ 1R0 1R1 1R1 1R1 1R1 /11 1R1 N 5 LB 1 l&)E<#"&(S/5PMQN %00$ 1R1 1R/ 1R1 1R1 1R1 /11 1R1 1R1 /1 1 5B 1 (<"59&S/5PNN U:* 1R/ 1R/ 1R1 1R1 1R1 MM PP 1R1 OPS Q RMO 1 u&=3O(=SbLMSS %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 NS 1 MRR 1 EE@(S/5PMQP %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 MR 1 LSQ 1 !"E<5<S///e1_ U:* 1R1 1R/ 1R1 1R1 1R1 _1 _1 1R1 OQP 5O ROR 1 u&=3O(=Sd5POMP %00$ 1R1 1R/ 1R1 1R1 1R1 1R1 /11 1R1 NR 1 LRP 1 B)9B;&T&"S/222^#<5%`*bc*!"#9&EE&E8*/0a*%T!E8*%#5(*5B&"5?&E`*1Rcd8*1Rbe8*1Rb/

Page 50: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.47 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

'*RS9#3)<GEgE95%%ER(01a ,)!/ LQ U)$&$

5PMQP "%)$&$ MO "%-&//cdM1 ]-*#;/* 55Q !#%%EgE/cdM1 ]-*#;/* 5LM %-&//cdM1 ]-*#;/* 5MP +0!$3//e1_ ]-*#;/* 5NO !#%%EgE5PNN ]-*#;/* 5O5 !#%%EgE

5PMQN /$%&!- MM5 +0!$3/cdM1 ]-*#;/* MOO5L 3)--'/cdM1 ]-*#;/* b_c1e 1%+$-

The next stage of the investigation is to identify why ]-*#;/* calls these I/O system

calls. To answer this, an analysis of the I/O is implemented with another D-script —

]-*#;)$&$2/, which collects the statistics of the I/O calls performed by ]-*#;/*,

while focusing on 3)--' and 1%+$-:

59*'[S3E"SE4;)S(<"59&*OEL99I"%&V*&9G90"$+0,9]#+-$M9*k.v,j*DN9****E&&N**J*12:O99TP9*EgE95%%``%E&&N`&)<"gQ9*S*&l&9)5=&*JJ*wu&=3O(=x*\\*[5"?0*\\*E&&N*SR9*DS9*****]%E&&N65"?18*5"?/OE&&N7*J*9#3)<>C:/1*****E&&N*J*5"?/:559T5L9EgE95%%``%E&&N`&)<"g5M9S*&l&9)5=&*JJ*wu&=3O(=x*\\*[5"?0*\\*[E&&N*S5N9D5O9****E&&N*J*5"?/:5P9T

The two instances of the EgE95%%``%E&&N`&)<"g probe have predicates (lines 7

and 13) that ensure they are called only if the triggering call to 3)--' sets the file

pointer to an absolute value, by testing that &%VL (whence) is zero, or +..tG+.^.

The probes identify the file according to the file descriptor (5"?1).

To determine the I/O pattern, the D-script saves the last absolute position of the

file pointer passed to 3)--'47 in the variable )--' in line 10. The difference

between the current and previous position of the file pointer is used as the second

index of the aggregation in line 9.

5Q9EgE95%%``T";<&`&)<"g5R9S*&l&9)5=&*JJ*wu&=3O(=x*S5S9D01***]T";<&65"?18*5"?07*J*9#3)<>C:L59T

The EgE95%%``T";<&`&)<"g probe counts the number of times the 1%+$-

system call is called with a specific number of bytes written.

Page 51: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.48 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

LL9<;9NO_E*DLM9**<"3)9>]%E&&N8*_C:LN9**<"3)9>]T";<&8*_C:LO9**!";)<I>w%E&&N*I(&E9*****#IIE&<******9#3)<V)xC:LP9**!";)<5>w******y_%(*y/1%(*y/1]%3V)x8*]%E&&NC:LQ9**!";)<I>wV)T";<&*I(&E9*******E;F&******9#3)<V)xC:LR9**!";)<5>w******y_%(*y/1%(*y/1]%3V)x8*]T";<&C:LS9**!";)<I>wV)V)xC:b1***9%&5">]%E&&NC:M59**9%&5">]T";<&C:ML9T

Lines 23 and 24 truncate the saved counts to the top five values.

Lines 25 and 26 print out the count of calls to 3)--' and lines 27 and 28 print out

the count of calls to 1%+$-.

Following are results of a single invocation of ]-*#;)$&$2/:

3)--'9H/-)!9990HH)-$9999!0#,$9999999999O9999999LP9999999LR9999999999O9999999LS9999999LR**********_********1*******d09999999999O9999999L59999999NL**********_********/***/bd_d11%+$-9H/-)!99999)+W-9999!0#,$9999999999O9999999L59999999NL9999999995O99999999N9999999ON9999999995P99999999N9999999PM9999999995N99999999N999999NN59999999999O9999999959995MNOON

The results of this invocation show that all calls to 3)--' with an absolute position

and most of the calls to 1%+$- target a single file. The vast majority of the calls to

3)--' move the file pointer of this file to an absolute position that is one byte away

from the previous position, and the vast majority of the calls to the 1%+$- system

call write a single byte. In other words, ]-*#;/* writes a data stream as single

bytes, without any buffering — an access pattern that is inherently slow.

Next, the !$%&E command is used to identify the file accessed by ]-*#;/* through

file descriptor O as the virtual Windows system disk:

'*!$%&E*/cdM1222_`*+G,ho.v*=#(&`1c11*(&B`/M08c__db*;)#`0c*3;(`c1*?;(`1*E;F&`//c0ba0be/0******rGoPsoUrG2fov.h,2.******SlB=S@&"=;5S(;ENG9SB(;ENRB=(N222

Page 52: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.49 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

To drill down further into this problem, it is necessary to investigate where the

calls to 3)--' originate by viewing the call stack. This is performed with the

]-*#;!&33)$&!'2/ script, which prints the three most common call stacks for the

3)--' and 1%+$- system calls every five seconds:

59*'[S3E"SE4;)S(<"59&*OEL99I"%&V*&9G90"$+0,9]#+-$M9*EgE95%%``%E&&N`&)<"g8*EgE95%%``T";<&`&)<"gN9*S*&l&9)5=&*JJ*wu&=3O(=x*SO9*DP9*****]96!"#4&I3)98*3E<59N>C7*J*9#3)<>C:Q9TR9<;9NO_E*DS9*****<"3)9>]98*bC:/1*****!";)<5>]9C:559****9%&5">]9C:5L9T

Line 6 saves the call stack of 3)--' and 1%+$-.

Line 10 prints the three most common stacks.

Following is the most common stack trace of the calls to 1%+$-, extracted from a

single invocation of ]-*#;!&33)$&!'2/:

1%+$-****%;49RE#R/~GGT";<&L1l5****u&=3O(=~o^h;%&s";<&L1lbe****u&=3O(=~o^h;%&s";<&f<L1ldM****u&=3O(=~B=(Ns";<&P&E9";!<#"L1l/(_****u&=3O(=~B=(Nh%3E@,=5?&L1l0b****u&=3O(=~B=(Nh%3E@L1la****u&=3O(=~|Ph%3E@L1la/****u&=3O(=~B(;ENG�3E@L1l/9****u&=3O(=~4("BG�3E@L1l0&****u&=3O(=~;(&GT";<&G(=5G94L1l/Me****u&=3O(=~4("BG5;#G4@G94L1l/c****u&=3O(=~u&=3G4@G!#%%L1l0(****u&=3O(=~=5;)G%##!GT5;<L1l009****u&=3O(=~=5;)G%##!L1le5****u&=3O(=~=5;)L1l/MMc****u&=3O(=~GE<5"<L1lc99999LRQOR

By examining the 1%+$- stack trace, it seems that the virtual machine is flushing the

disk cache very often, apparently for every byte. This could be the result of a disabled

disk cache. Further investigation uncovers the fact that if a Microsoft Windows server

acts as an Active Directory domain controller, the Active Directory directory service

performs unbuffered writes and tries to disable the disk write cache on volumes

hosting the Active Directory database and log files. Active Directory also works in this

manner when it runs in a virtual hosting environment. Clearly this issue can only be

solved by modifying the behavior of the Microsoft Windows 2008 virtual server.

Page 53: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.50 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Once the problem is resolved on the Microsoft Windows 2008 virtual server, the

]-*#;)$&$;]#&,$2/ D-script is implemented to print the patterns of calls to

1%+$-:

59*'[S3E"SE4;)S(<"59&*OEL99I"%&V*&9G90"$+0,9]#+-$M9*EgE95%%``T";<&`&)<"gN9*S*&l&9)5=&*JJ*wu&=3O(=x*\\*5"?1*JJ*X/*SO9*DP9***]T";<&u6wT";<&x7*J*u35)<;F&>5"?0C:Q99TR9*<;9NO_E*DS9***!";)<5>]T";<&uC:/1***9%&5">]T";<&uC:559T

Invoking it shows a much improved pattern of calls, in term of the size of each write:

'*RSu&=3OE<5<Ou35)<R(*_1%+$-99:&3#-99;;;;;;;;;;;;;9G+)$%+(#$+0,9;;;;;;;;;;;;;9!0#,$*****0_c*U*****************************************1*****_/0*U]]]]]]***********************************/d****/10d*U*****************************************1****01dM*U]]]]*************************************/1****d1ac*U]]]]]]]]]]]]]]]]*************************ba****M/a0*U]]]]]]]]]]]]]]]**************************bc***/cbMd*U*****************************************1

Page 54: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.51 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Tracking MPI Requests with DTrace — *"+)$&$2/

59k.v,jL9DM9***"&9BEG4g<&EJ1:*"&9BEG59<J1:*"&9BEG!#E<&(GE;F&J1:*N9***"&9BEG3)&l!GE;F&J1:*"&9BEG!#E<&(G=5<9@&EJ1:*O9***"&9BEG3)&l!G=5<9@&EJ1:*P9***E&)(EG59<J1:*E&)(EG4g<&EJ1:*#3<!3<G9)<*J*1:Q9***!";)<I>w;)!3<>^#<5%C*�OE;F&E******�Op5<9@&ExC:R9***!";)<I>w************#3<!3<V)xC:S9***!";)<I>w4g<&E*59<;B&*!#E<&(*3)&l!*!#E<&(*3)&l!*4g<&EV)xC:/1***!";)<I>w59<;B&V)xy_(*yc(*yc(*y_(*yc(*y_(******y_(*yc(V)x8*559**********"&9BEG4g<&E8*"&9BEG59<8*"&9BEG!#E<&(GE;F&8*5L99999999999%-!:).#,-U".)+W-89%-!:)."0)$-/.*&$!D-)85M9**********"&9BEG3)&l!G=5<9@&E8*E&)(EG4g<&E8*E&)(EG59<C:5N9T5O9SA*m";)<*+<5<;E<;9E*&B&"g*/*E&9*AS5P9!"#$%&```<;9NO/E&95Q9D5R9**!";)<I>wy_(*yc(*yc(*y_(*yc(*y_(******y_(*yc(*V)x8*5S9**********"&9BEG4g<&E8*"&9BEG59<8*"&9BEG!#E<&(GE;F&8*01*9999999999%-!:).#,-U".)+W-89%-!:)."0)$-/.*&$!D-)89L59**********"&9BEG3)&l!G=5<9@&E8*E&)(EG4g<&E8*E&)(EG59<C:LL9**LL#3<!3<G9)<:LM9TLN9SA*Q#%%&9<*f9<;B&*+&)(*o&u3&E<E*ASLO9=!;!&"3E&X<5"?&<```9#==O"&uO59<;B5<&LP9S5"?E6b7OZ=9EG#!JJxE&)(xSLQ9DLR9**LLE&)(EG59<:LS9Tb1*SA*Q#%%&9<*o&=#B5%*#I*+&)(*o&u3&E<E*ASM59=!;!&"3E&X<5"?&<```9#==O"&uO)#<;IgML9S5"?E6b7OZ=9EG#!JJxE&)(xSMM9DMN9**OOE&)(EG59<:MO9TMP9SA*Q#%%&9<*4g<&E*+&)<*ASMQ9=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O&)(MR9S5"?E6b7OZ=9EG#!JJxE&)(xSMS9Dd1***E&)(EG4g<&E*LJ*5"?E6b7OZ=9EG9#3)<:N59TNL9SA*Q#%%&9<*f9<;B&*o&9B*o&u3&E<*ASNM9=!;!&"3E&X<5"?&<```9#==O"&uO59<;B5<&NN9S5"?E6b7OZ=9EG#!JJx"&9BxSNO9DNP9**LL"&9BEG59<:NQ9TNR9SA*Q#%%&9<*o&=#B5%*#I*o&9B*o&u3&E<*ASNS9=!;!&"3E&X<5"?&<```9#==O"&uO)#<;Ig_1*S5"?E6b7OZ=9EG#!JJx"&9Bx\\"&9BEG59<Z1SO59DOL9**OO"&9BEG59<:OM9TON9SA*Q#%%&9<*o&u3&E<*m%59&(*#)*m#E<&(*�*ASOO9=!;!&"3E&X<5"?&<```9#==O"&uO;)E&"<O;)O!#E<&(OuOP9S5"?E6b7OZ=9EG#!JJx"&9BxSOQ9DOR9***LL"&9BEG!#E<&(GE;F&:OS9T

Page 55: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc.52 Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

c1*SA*Q#%%&9<*pE?*=5<9@&(*m#E<&(*�*ASP59=!;!&"3E&X<5"?&<```9#==O=E?O=5<9@O!#E<&(O"&uPL9S5"?E6b7OZ=9EG#!JJx"&9BxSPM9DPN9***LL"&9BEG!#E<&(G=5<9@&E:PO9TPP9=!;!&"3E&X<5"?&<```9#==O=E?O=5<9@O!#E<&(O"&uPQ9S5"?E6b7OZ=9EG#!JJx"&9Bx\\"&9BEG!#E<&(GE;F&Z1SPR9DPS9***OO"&9BEG!#E<&(GE;F&:e1*TQ59SA*Q#%%&9<*=&EE5?&E*;)*3)&l!*�*ASQL9=!;!&"3E&X<5"?&<```9#==O=E?O;)E&"<O;)O3)&lOuQM9S5"?E6b7OZ=9EG#!JJx"&9BxSQN9DQO9***LL"&9BEG3)&l!GE;F&:QP9TQQ9SA*Q#%%&9<*=&EE5?&E*"&=#B&(*I"#=*3)&l!*�*ASQR9=!;!&"3E&X<5"?&<```9#==O=E?O"&=#B&OI"#=O3)&lOuQS9S5"?E6b7OZ=9EG#!JJx"&9Bx\\"&9BEG3)&l!GE;F&Z1SM1*DR59***OO"&9BEG3)&l!GE;F&:RL9TRM9SA*Q#%%&9<*=&EE5?&E*"&=#B&(*I"#=*3)&l!*�*ASRN9=!;!&"3E&X<5"?&<```9#==O"&uO=5<9@O3)&lRO9S5"?E6b7OZ=9EG#!JJx"&9BxSRP9DRQ9***LL"&9BEG3)&l!G=5<9@&E:RR9TRS9SA*Q#%%&9<*9#3)<*#I*4g<&E*"&9;&B&(*ASa1*=!;!&"3E&X<5"?&<```9#==O"&uOlI&"O9#)<;)3&S59Sf"?E6b7OZ=9EG#!JJx"&9BxSSL9DSM9***"&9BEG4g<&E*LJ*5"?E6b7OZ=9EG9#3)<:SN9T

Page 56: Tuning Parallel Code on the Solaris™ Operating System

Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com© 2009 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java, OpenSolaris, Solaris, Sun HPC ClusterTools, and Sun Fire are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the US and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. AMD Opteron is a trademarks or registered trademarks of Advanced Micro Devices. Intel Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Information subject to change without notice. Printed in USA 09/09

Sun Microsystems, Inc.Tuning Parallel Code on the Solaris OS — Lessons Learned from HPC

Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com