early experience with out-of-core applications on the cray xmt daniel chavarría-miranda §, andrés...
TRANSCRIPT
Early Experience with Out-of-Early Experience with Out-of-Core Applications on the Cray Core Applications on the Cray
XMTXMT
Early Experience with Out-of-Early Experience with Out-of-Core Applications on the Cray Core Applications on the Cray
XMTXMT
Daniel Chavarría-Miranda§, Andrés Márquez§, Jarek Nieplocha§, Kristyn Maschhoff† and Chad Scherrer§
§ Pacific Northwest National Laboratory (PNNL)† Cray, Inc.
2
IntroductionIntroductionIntroductionIntroduction
Increasing gap between memory and processor speed Causing many applications to become memory-bound Mainstream processors utilize cache hierarchy Caches not effective for highly irregular, data-intensive
applications
Multithreaded architectures provide an alternative Switch computation context to hide memory latency
Cray MTA-2 processors and newer ThreadStorm processors on the Cray XMT utilize this strategy
3
Cray XMTCray XMTCray XMTCray XMT3rd generation multithreaded system from Cray Infrastructure is based on XT3/4, scalable up to 8192 processors SeaStar network, torus topology, service and I/O nodes Compute nodes contain 4 ThreadStorm multithreaded processors
instead of 4 AMD Opteron processors Hybrid execution capabilities: code can run on ThreadStorm
processors in collaboration with code running on Opteron processors
4
Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)ThreadStorm processors run at 500 MHz 128 hardware thread contexts, each with its own set of 32 registers No data cache 128KB, 4-way associative data buffer on the memory side Extra bits in each 64-bit memory word: full/empty for synchronization Hashed memory at a 64-byte level, i.e. contiguous logical addresses at a
64-byte boundary might be mapped to uncontiguous physical locations Global shared memory
5
Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)Cray XMT (cont.)
Lightweight User Communication library (LUC) to coordinate data transfers and hybrid execution between ThreadStorm and Opteron processors Portals-based on Opterons Fast I/O API-based on ThreadStorms RPC-style semantics
Service and I/O (SIO) nodes provide Lustre, a high-performance parallel file system ThreadStorm processors cannot directly access Lustre
LUC-based execution and transfers combined with Lustre access on the SIO nodes Attractive and high-performance alternative for processing very
large datasets on the XMT system
6
OutlineOutlineOutlineOutline
IntroductionCray XMT
PDTree Multithreaded implementation Static & dynamic versions
Experimental setup and ResultsConclusions
PDTree PDTree (or Anomaly Detection for Categorical Data)(or Anomaly Detection for Categorical Data)
PDTree PDTree (or Anomaly Detection for Categorical Data)(or Anomaly Detection for Categorical Data)
7
Originates from cyber security analysis Detect anomalies in packet headers Locate and characterize network attacks
Analysis method is more widely applicable Uses ideas from conditional probability Multivariate categorical data analysis
For a combination of variables and instances of values for these variables, find out how many times the pattern has occurred
Resulting count table or contingency table specifies a joint distribution Efficient implementation of algorithms using such tables are very
important in statistical analysis
ADTree data structure (Moore & Lee 1998), can be used to store data counts Stores all combinations of values for variables
PDTree (cont.)PDTree (cont.)PDTree (cont.)PDTree (cont.)
8
We use an enhancement to the ADTree data structure called a PDTree where we don’t need to store all possible combinations of values Only store a priori specified combinations
B C
A C D
B C
A C D
N
B C
( b0
, N ) ( b1
, N ) ( c0
, N ) ( c1
, N )
AA C C D D
( c1
, N )
( a0
, N ) ( c0
, N )
( a1
, N ) ( c1
, N )
( a0
, N ) ( c0
, N )
( a1
, N ) ( d1
, N )
( d0
, N )
( d1
, N )
( d0
, N )
N
B C
( b0
, N ) ( b1
, N ) ( c0
, N ) ( c1
, N )
AA C C D D
( c1
, N )
( a0
, N ) ( c0
, N )
( a1
, N ) ( c1
, N )
( a0
, N ) ( c0
, N )
( a1
, N ) ( d1
, N )
( d0
, N )
( d1
, N )
( d0
, N )
Multithreaded ImplementationMultithreaded ImplementationMultithreaded ImplementationMultithreaded Implementation
9
PDTree implemented using a multiple type, recursive tree structure Root node is an array of ValueNodes (counts for different value instances of
the root variables) Interior and leaf nodes are linked lists of ValueNodes Inserting a record at the top level involves just incrementing the counter of
the corresponding ValueNode XMT’s int_fetch_add() atomic operation is used to increment counters
Inserting a record at other levels requires the traversal of a linked list to find the right ValueNode
If the ValueNode does not exist, it must be appended to the end of the list
Inserting at other levels when the node does not exist is tricky To ensure safety the end pointer of the list must be locked Use readfe() and writeef() MTA operations to create critical sections
Take advantage of full/empty bits on each memory word
As data analysis progresses the probability of conflicts between threads is lower
Multithreaded Implementation (cont.)Multithreaded Implementation (cont.)Multithreaded Implementation (cont.)Multithreaded Implementation (cont.)
10
vi = j(count)
vi = k(count)
T2 trying to grab theend pointer
T1 trying to grab theend pointer
vi = j(count)
vi = k(count)
T2 now has a lock to a non-end pointer
T1 succeeded and inserted a new node
vi = m(count)
Static and Dynamic VersionsStatic and Dynamic VersionsStatic and Dynamic VersionsStatic and Dynamic Versions
11
column = anumCols = 3
values =
RootNode
count = 5columns =
column = bvalues =
column = cvalues = ...
value = 10count = 1numCols = 3columns = ...nextVN =
value = 19count = 4numCols = 3columns = ...nextVN = ...
count = 3columns =
column = bvalues =
column = cvalues = ...
Linked list of ValueNodes
Hash table of ValueNodes
Array of ColumnNodes
€
H(v)
Array of RootNodes
12
OutlineOutlineOutlineOutline
IntroductionCray XMTPDTree Multithreaded Implementation Static & dynamic versions
Experimental setup and ResultsConclusions
Experimental setup and ResultsExperimental setup and ResultsExperimental setup and ResultsExperimental setup and ResultsLarge dataset to be analyzed by PDTree 4 GB resident on disk (64M records, 9 column guide
tree)
Options: Direct file I/O from ThreadStorm procesors via NFS
Not very efficient Indirect I/O via LUC server running on Opteron
processors on the SIO nodes Large input file can reside on high-performance Lustre file
system
Simulates the use of PDTree for online network traffic analysis Need to use dynamic PDTree 128K element hash table
13
Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)
14
ThreadstormCPU
ThreadstormCPU
ThreadstormCPU
ThreadstormCPU
DRAM
OpteronCPU
DRAM
Service/login node Compute node
SeaStarInterconnect
Lustre filesystem
DirectAccess
IndirectAccess LUC
RPC
Note: results obtained on a preproduction XMT with only half ofthe DIMM slots populated
Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)
15
# of procs.
XMT Insertion
XMT Speedup
MTA Insertion
MTA Speedup
1 239.26 1.00 200.17 1.00
2 116.36 2.06 98.25 2.04
4 56.48 4.24 48.07 4.16
8 27.53 8.69 23.29 8.59
16 13.97 17.13 11.61 17.24
32 7.13 33.56 5.81 34.45
64 3.68 65.02 N/A N/A
96 2.60 92.02 N/A N/A
In-core, 1M record execution, static PDTree version
Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)
16
816
3264
96
0
50
100
150
200
250
300
time [s]
# of processors
100 MB Dataset
Insertion
Preprocessing
LUC Transfer
Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)Experimental setup and Results (cont.)
17
816
3264
96
0
100
200
300
400
500
600
700
time [s]
# of processors
250 MB Dataset
Insertion
Preprocessing
LUC Transfer
18
ConclusionsConclusionsConclusionsConclusions
Results indicate the value of the XMT hybrid architecture and its improved I/O capabilities Indirect access to Lustre through LUC interface Need to improve I/O operation implementation to take
full advantage of Lustre Multiple LUC transfers in parallel should improve
performance
Scalability of the system is very good for complex, data-dependent irregular accesses in the PDTree application Future work includes comparisons against parallel
cache-based systems