early experience with out-of-core applications on the cray xmt daniel chavarría-miranda §, andrés...

Early Experience with Out-of-Early Experience with Out-of-Core Applications on the Cray Core Applications on the Cray

XMTXMT

Early Experience with Out-of-Early Experience with Out-of-Core Applications on the Cray Core Applications on the Cray

XMTXMT

Daniel Chavarría-Miranda§, Andrés Márquez§, Jarek Nieplocha§, Kristyn Maschhoff† and Chad Scherrer§

§ Pacific Northwest National Laboratory (PNNL)† Cray, Inc.

2

IntroductionIntroductionIntroductionIntroduction

Increasing gap between memory and processor speed Causing many applications to become memory-bound Mainstream processors utilize cache hierarchy Caches not effective for highly irregular, data-intensive

applications

Multithreaded architectures provide an alternative Switch computation context to hide memory latency

Cray MTA-2 processors and newer ThreadStorm processors on the Cray XMT utilize this strategy

3

Cray XMTCray XMTCray XMTCray XMT3rd generation multithreaded system from Cray Infrastructure is based on XT3/4, scalable up to 8192 processors SeaStar network, torus topology, service and I/O nodes Compute nodes contain 4 ThreadStorm multithreaded processors

instead of 4 AMD Opteron processors Hybrid execution capabilities: code can run on ThreadStorm

processors in collaboration with code running on Opteron processors

4

Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)ThreadStorm processors run at 500 MHz 128 hardware thread contexts, each with its own set of 32 registers No data cache 128KB, 4-way associative data buffer on the memory side Extra bits in each 64-bit memory word: full/empty for synchronization Hashed memory at a 64-byte level, i.e. contiguous logical addresses at a

64-byte boundary might be mapped to uncontiguous physical locations Global shared memory

5

Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)

Lightweight User Communication library (LUC) to coordinate data transfers and hybrid execution between ThreadStorm and Opteron processors Portals-based on Opterons Fast I/O API-based on ThreadStorms RPC-style semantics

Service and I/O (SIO) nodes provide Lustre, a high-performance parallel file system ThreadStorm processors cannot directly access Lustre

LUC-based execution and transfers combined with Lustre access on the SIO nodes Attractive and high-performance alternative for processing very

large datasets on the XMT system

6

OutlineOutlineOutlineOutline

IntroductionCray XMT

PDTree Multithreaded implementation Static & dynamic versions

Experimental setup and ResultsConclusions

PDTree PDTree (or Anomaly Detection for Categorical Data)(or Anomaly Detection for Categorical Data)

PDTree PDTree (or Anomaly Detection for Categorical Data)(or Anomaly Detection for Categorical Data)

7

Originates from cyber security analysis Detect anomalies in packet headers Locate and characterize network attacks

Analysis method is more widely applicable Uses ideas from conditional probability Multivariate categorical data analysis

For a combination of variables and instances of values for these variables, find out how many times the pattern has occurred

Resulting count table or contingency table specifies a joint distribution Efficient implementation of algorithms using such tables are very

important in statistical analysis

ADTree data structure (Moore & Lee 1998), can be used to store data counts Stores all combinations of values for variables

PDTree (cont.)PDTree (cont.)PDTree (cont.)PDTree (cont.)

8

We use an enhancement to the ADTree data structure called a PDTree where we don’t need to store all possible combinations of values Only store a priori specified combinations

B C

A C D

B C

A C D

N

B C

( b0

, N ) ( b1

, N ) ( c0

, N ) ( c1

, N )

AA C C D D

( c1

, N )

( a0

, N ) ( c0

, N )

( a1

, N ) ( c1

, N )

( a0

, N ) ( c0

, N )

( a1

, N ) ( d1

, N )

( d0

, N )

( d1

, N )

( d0

, N )

N

B C

( b0

, N ) ( b1

, N ) ( c0

, N ) ( c1

, N )

AA C C D D

( c1

, N )

( a0

, N ) ( c0

, N )

( a1

, N ) ( c1

, N )

( a0

, N ) ( c0

, N )

( a1

, N ) ( d1

, N )

( d0

, N )

( d1

, N )

( d0

, N )

Multithreaded ImplementationMultithreaded ImplementationMultithreaded ImplementationMultithreaded Implementation

9

PDTree implemented using a multiple type, recursive tree structure Root node is an array of ValueNodes (counts for different value instances of

the root variables) Interior and leaf nodes are linked lists of ValueNodes Inserting a record at the top level involves just incrementing the counter of

the corresponding ValueNode XMT’s int_fetch_add() atomic operation is used to increment counters

Inserting a record at other levels requires the traversal of a linked list to find the right ValueNode

If the ValueNode does not exist, it must be appended to the end of the list

Inserting at other levels when the node does not exist is tricky To ensure safety the end pointer of the list must be locked Use readfe() and writeef() MTA operations to create critical sections

Take advantage of full/empty bits on each memory word

As data analysis progresses the probability of conflicts between threads is lower

Multithreaded Implementation (cont.)Multithreaded Implementation (cont.)Multithreaded Implementation (cont.)Multithreaded Implementation (cont.)

10

vi = j(count)

vi = k(count)

T2 trying to grab theend pointer

T1 trying to grab theend pointer

vi = j(count)

vi = k(count)

T2 now has a lock to a non-end pointer

T1 succeeded and inserted a new node

vi = m(count)

Static and Dynamic VersionsStatic and Dynamic VersionsStatic and Dynamic VersionsStatic and Dynamic Versions

11

column = anumCols = 3

values =

RootNode

count = 5columns =

column = bvalues =

column = cvalues = ...

value = 10count = 1numCols = 3columns = ...nextVN =

value = 19count = 4numCols = 3columns = ...nextVN = ...

count = 3columns =

column = bvalues =

column = cvalues = ...

Linked list of ValueNodes

Hash table of ValueNodes

Array of ColumnNodes

€

H(v)

Array of RootNodes

12

OutlineOutlineOutlineOutline

IntroductionCray XMTPDTree Multithreaded Implementation Static & dynamic versions

Experimental setup and ResultsConclusions

Experimental setup and ResultsExperimental setup and ResultsExperimental setup and ResultsExperimental setup and ResultsLarge dataset to be analyzed by PDTree 4 GB resident on disk (64M records, 9 column guide

tree)

Options: Direct file I/O from ThreadStorm procesors via NFS

Not very efficient Indirect I/O via LUC server running on Opteron

processors on the SIO nodes Large input file can reside on high-performance Lustre file

system

Simulates the use of PDTree for online network traffic analysis Need to use dynamic PDTree 128K element hash table

13

Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)

14

ThreadstormCPU

ThreadstormCPU

ThreadstormCPU

ThreadstormCPU

DRAM

OpteronCPU

DRAM

Service/login node Compute node

SeaStarInterconnect

Lustre filesystem

DirectAccess

IndirectAccess LUC

RPC

Note: results obtained on a preproduction XMT with only half ofthe DIMM slots populated


15

# of procs.

XMT Insertion

XMT Speedup

MTA Insertion

MTA Speedup

1 239.26 1.00 200.17 1.00

2 116.36 2.06 98.25 2.04

4 56.48 4.24 48.07 4.16

8 27.53 8.69 23.29 8.59

16 13.97 17.13 11.61 17.24

32 7.13 33.56 5.81 34.45

64 3.68 65.02 N/A N/A

96 2.60 92.02 N/A N/A

In-core, 1M record execution, static PDTree version


16

816

3264

96

0

50

100

150

200

250

300

time [s]

# of processors

100 MB Dataset

Insertion

Preprocessing

LUC Transfer


17

816

3264

96

0

100

200

300

400

500

600

700

time [s]

# of processors

250 MB Dataset

Insertion

Preprocessing

LUC Transfer

18

ConclusionsConclusionsConclusionsConclusions

Results indicate the value of the XMT hybrid architecture and its improved I/O capabilities Indirect access to Lustre through LUC interface Need to improve I/O operation implementation to take

full advantage of Lustre Multiple LUC transfers in parallel should improve

performance

Scalability of the system is very good for complex, data-dependent irregular accesses in the PDTree application Future work includes comparisons against parallel

cache-based systems

early experience with out-of-core applications on the cray xmt daniel chavarría-miranda §, andrés...

Documents

data transfers

data countsstores

newer threadstorm processors

cray xmt cont

synchronizationhashed

memory word

registersno data cache128kb

xmt system