brasil ross 2011
DESCRIPTION
Basic Resource Aggregate Interface Layer. ACM ROSS 2011 Workshop presentation.TRANSCRIPT
© 2011 IBM CorporationROSS Workshop 05/31/2011
BRASILBasic Resource Aggregation Interface Layer
Eric Van Hensbergen([email protected])IBM Research Austin
Pravin Shinde (now at ETH Zurich)Noah Evans (now at Bell-Labs)
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Motivation
Figure 7: The data dependency graph for a portion of the Hartree-Fock
procedure using a traditional formulation.
12
64,000 Node TorusData Flow Oriented HPC Problems
2
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
PUSH Pipelines
3
For more detail: refer to PODC09 Short Paper on PUSH Dataflow Shell
UNIX Modela | b | c
PUSH Modela |< b >| c
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
PUSH Implementation
4
shell command Pipe Multiplexor
Pipe
Pipe
Pipe
Pipe
Pipe
Pipeshell
command
shell command
shell command
shell command
shell command
shell command Pipe
Pipe
Pipe
Pipe
Pipe
Pipe
Demultiplexor Pipe shell command
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Back to Motivation: 64,000 Nodes
5
t
L
I1 I2
c1 c2 c4c3
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Related Work/Background Work
•MPI• “A Compositional Environment for MPI Programs”
•Hadoop & Other Map/Reduce Solutions
•cpu (from Plan 9) - http://plan9.bell-labs.com/magic/man2html/1/cpu
•xcpu (from LANL) - http://www.xcpu.org/
•xcpu2 (from LANL)
6
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
File System Interfaces
7
!"#$%
&'($"'
!)*+!*,!-,!*).!,."./,!$'(*)!&0!&1
!&*!$.'!)*+!*,!"#2,!3"4.!,."./,!,.54*!,.5(/.!,.54(
!&0!&*!$.'!)*+!*,!"#2,!3"4.!,."./,!,.54*!,.5(/.!,.54(
!&0!&*
!&*
6($"'!7),(/#$), 8),,4(* 8/9:8),,4(*
Control File Syntax• reserve [n] [os] [arch]• dir [wdir]• exec [command] [args]• kill• killonclose• nice [n]• splice [path]
Environment Syntax• [key] = [value]
Namespace Syntax• mount [-abcC] [server] [mnt]• bind [-abcC] [new] [old]• import [-abc] [host] [path]• cd [dir]• unmount [new] [old]• clear• . [path]
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Distribution and Aggregation
8
!"#$%&#
"%'()
!"*!"#$%&#!"#+!"#$%&#!"%+!"#$%&#
!"%,!"#$%&#
!"#,!"#$%&#!"%-!"#$%&#
!"%.!"#$%&#
!"#$%&#
"%'()
!"/&(012!3#,4!"#$%&#!"%.!"#$%&#
!"/&(012!3*4!"#$%&#!"#+!"#$%&#!"%+!"#$%&#
!"%,!"#$%&#
!"/&(012!324!"#$%&#
5//6+!)708 5//6,!)708
2
*
#+ #,
%+ %, %- %.
5//6+
5//6,
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Command Line Interface(s)
•Direct File Interaction
•Python Command Line
•PUSH
9
ORS=blm find . -type f |< xargs chasen | sort | uniq -c >| sort -rn
echo “res 2” > ./mpoint/csrv/local/0/ctlecho “exec date” > ./mpoint/csrv/local/0/0/ctlecho “exec wc” > cat ./mpoint/csrv/local/0/1/ctlecho “xsplice 0 1” > ./mpoint/csrv/0/local/0/ctlcat ./mpoint/csrv/0/local/0/stdio
runJob.py -n 4 /bin/date
� �
�����������������
�����������������
������������
��� ����
��������������������
������
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Evaluation Setup
10
! !! " # $ !% &" %# !"$ "'% '!" !("# "(#$
(
"
#
%
$
!(
!"
!#
)*+,-./012/3.,-14+.567*68196/*5,96/:3;6.<
'!"/513.=
>19=.?..,*5@
A.B+*5;6*15
C@@B.@;6*15
DE.F96*15
G.=.BH;6*15
!"#$%&'()'%*%+",-(./'0%1"%/,%2
A*+.
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Simple Micro-benchmark: Job start times
11
! !! " # $ !% &" %# !"$ "'% '!" !("# "(#$
(
"
#
%
$
!(
!"
!#
!%
!$
"(
)*+,-./0*12.34,5647,63/84,95:;
'!",3*<.=
>*8=.?../63@
A.B263C46*3
D@@B.@C46*3
E3/84
FG.:846*3
H.=.BIC46*3
JK*+=
A62.
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Job Completion Times with I/O
12
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Discussion•Non-buffered communications channels are great for synchronization, but bad for
performance as aggregation points become serialization points.•Pushing multiplexing to PUSH a mistake•adds additional copies of data and additional context switches for I/O to do record separation•by pushing them into the infrastructure we can go from 6 copies and context switches down to 2 for each
pipeline stage•“beltway buffers” and other techniques may reduce this further or eliminate copies altogether
•Transitive mounts of namespace an elegant way to bypass network segmentation -- but it also incurs lots of overhead•Allowing splice inside a reservation has dubious usefulness. It seems like it might
be more useful to allow for individual elements to be spawned outside a reservation.•Fan-in and Fan-out are too limiting of a use case. What about deterministic
delivery? What about many-to-many pipeline components? What about collectives and barriers?•Fault tolerance & fault debugging properties were poor -- no channel for out-of-band
communication of error or logging information. When transitive mounts went down, the system wedged -- but no integrated method of figuring out where the system went down.•File systems and communication are outside the scope of this work, but still a sore
point for performance with the Plan 9 kernel on Blue Gene. Both issues are actively being worked on with a release planned later this summer.
13
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Future Work: Cutting Out The Middle Man
14
Kernel
App Brasil
Kernel
Brasil
Kernel
Brasil App
Kernel
App
Kernel Kernel
App
Initiating Terminal or Node Parent or Gateway Node(s) Compute Node
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Future Work: Splice Optimizations via Direct Communication
•Splice operations through name space is elegant, but transitive mounting of name space means that data has to flow through hierarchy in order to be spliced from one element to another•A direct communication path would be much more desirable, particularly on BG/P
where hierarchy is constructed on tree, but torus is the preferred node-to-node data transport•Solution: add a layer of indirection to splice communication•add a file to the session directories which contains a list of “locations” for this node which essentially gives a
recipe for connecting directly to the channel in order of performance. If the source node cannot use any of these recipes he will default to the name space path.•example:
•in the simplest embodiment we mount the namespace of the node in order to access its interfaces, but we also want to play with direct connections for data and/or potentially hiding everything inside the csrv interface so the mpipe splice code (and end-users) can be more or less ignorant of how the resources are accessed
15
% cat ./mpoint/csrv/parent/c4/local/0/locationtorus!1.3.4!564tree!0.23!564tcp!10.1.3.4!564
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Future Work:
•Turn fan-out/fan-in/many-to-many pipeline operations into a system primitives with similar syntax to existing pipe primitives but with semantics of multipipe•respect record boundries•allow broadcast•allow enumerated specification of destination•support splice operations
•Build out rest of brasil infrastructure based on this new primitve•All stdio channels are multipipes
• allow record buffers for decoupled performance•All ctl channels implemented on top of multipipes•etc.
•Same model could potentially be used for implementation of collective operations and barriers within a distributed file system namespace
16
TYPE
SIZE
DESTINATION
PARAMETERS
pwrite(pipefd, buf, sz, ~(0));
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
This project is supported in part by the U.S. Department of Energy under Award Number DE-FG02- 08ER25851 http://www.research.ibm.com/austin
http://goo.gl/5eFB
Brasil Code Available: http://www.bitbucket.org/ericvh/hare/usr/brasil
New Version In Progress: http://www.bitbucket.org/ericvh/hare/sys/src/cmd/uem
17
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Core Concept: BRASILBasic Resource Aggregate System Inferno Layer
•Stripped down Inferno - No GUI or anything we can live without, minimal footprint•Runs as a daemon (no console), all interaction via 9p mounts of its namespace•Different modes•default (exports /srv/brasil or on a tcp!127.0.0.1!5670)•gateway (exports over standard I/O - to be used by ssh initialization)•terminal (initiates ssh connection and starts a gateway)
•Runs EVERYWHERE•User’s workstation•Surveyor Login Nodes•I/O Nodes•Compute Nodes
18
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Preferred Embodiment: BRASIL Desktop Extension Model
19
workstation login node I/O
CPU
CPU
ssh-duct
•Setup•User starts brasild on workstation
•brasild ssh’s to login node and starts another brasil hooking the two together with 27b-6 and mount resources in /csrv
•User mounts brasild on workstation into namespace using 9pfuse or v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac)
•Boot•User runs anl/run script on workstation
•script interacts with taskfs on login node to start cobalt qsub •when I/O nodes boot it will connect its csrv to login csrv•when CPU nodes boot they will connect to csrv on I/O node
•Task Execution•User runs anl/exec script on workstation to run app
•script reserves x nodes for app using taskfs•taskfs on workstation aggregates execution by using taskfs running on I/O nodes
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Core Concept: Central Services•Establish hierarchical namespace of cluster services•Automount remote servers based on reference (ie. cd /csrv/
criswell)•Export local services for use elsewhere within the network
20
/csrv/local/L
/local/l1
/local/c1
/local/c2
/local/l2
/local/c3
/local/c4
/local
/csrv/local/l2
/local/c4
/local/L
/local/t
/local/l1
/local/c1
/local/c2
/local
c3
t
L
I1 I2
c1 c2 c4c3
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Our Approach: Workload Optimized Distribution
21
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Our Approach: Workload Optimized Distribution
21
Desktop Extension
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Our Approach: Workload Optimized Distribution
21
Desktop Extension
!"#$%&
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#(#
!"#(#
!"#(#
!"#(#
PUSH Pipeline Model
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Our Approach: Workload Optimized Distribution
21
Desktop Extension
!"#$%&
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#(#
!"#(#
!"#(#
!"#(#
PUSH Pipeline Model
local service
remote services
local service proxy service aggregate service
Aggregation ViaDynamic Namespace
andDistributed Service
Model
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Our Approach: Workload Optimized Distribution
21
Desktop Extension
!"#$%&
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#(#
!"#(#
!"#(#
!"#(#
PUSH Pipeline Model
local service
remote services
local service proxy service aggregate service
Aggregation ViaDynamic Namespace
andDistributed Service
Model
Scaling
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
Our Approach: Workload Optimized Distribution
21
Desktop Extension
!"#$%&
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#$%&'
!"#(#
!"#(#
!"#(#
!"#(#
!"#(#
PUSH Pipeline Model
local service
remote services
local service proxy service aggregate service
Aggregation ViaDynamic Namespace
andDistributed Service
Model
Scaling Reliability
IBM Research
© 2011 IBM CorporationROSS Workshop 05/31/2011
“Old” Performance Graph (from USENIX)
22
XCPU3
Problem
Related Work
Solution
Evaluations
References
Evaluations:Deployment and aggregation time
XCPU3
IBM Research
© 2010 IBM Corporation
AAA
BBB
ABABAB
ABA
BAB
23
Problem: Limitations of Traditional Pipes
AAA
BBB
IBM Research
© 2010 IBM Corporation24
Long Packet Pipes
AAA
BBB
AAA BBB
BBB
AAA
AAA
BBB
IBM Research
© 2010 IBM Corporation25
Enumerated Pipes
1:A 2:B
A B
A B
A
B
IBM Research
© 2010 IBM Corporation
A
A
A
Broadcast
Reduce(+)
Allreduce(+)
(B+C)
B
C
(A+B+C)
B
C
A
26
Collective Pipes
IBM Research
© 2010 IBM Corporation
spliceto(b) a b = a b
splicefrom(b) a b = a b
27
Splicing Pipes
IBM Research
© 2010 IBM Corporation
Example Simple Invocation
•mpipefs•mount /srv/mpipe /n/testpipe•ls -l /n/testpipe
•echo hello > /n/testpipe/data &•cat /n/testpipe/data
28
--rw-rw-rw- M 24 ericvh ericvh 0 Oct 10 18:10 /n/testpipe/data
hello
IBM Research
© 2010 IBM Corporation
Passing Arguments via aname
•mount /srv/mpipe /n/test othername•ls -l /n/test
•mount /srv/mpipe /n/test2 -b bcastpipe
•mount /srv/mpipe /n/test3 -e 5 enumpipe
•....you get the idea, read the man page for more details
29
--rw-rw-rw- M 26 ericvh ericvh 0 Oct 10 18:12 /n/test/otherpipe
IBM Research
© 2010 IBM Corporation
Example for writing control blocks
intpipewrite(int fd, char *data, ulong size, ulong which){ int n; char hdr[255]; ulong tag = ~0; char pkttype='p';
/* header byte is at offset ~0 */ n = snprint(hdr, 31, "%c\n%lud\n%lud\n\n", pkttype, size, which); n = pwrite(fd, hdr, n+1, tag); if(n <= 0) return n;
return write(fd, data, size);}
30
IBM Research
© 2010 IBM Corporation
Larger Example (execfs)
31
/proc/clone/###
/stdin/stdout/stderr/args/ctl/fd/fpregs/kregs/mem/note/noteid/notepg/ns/proc/profile/regs/segment/status/text/wait
mount -a /srv/mpipe /proc/### stdinmount -a /srv/mpipe /proc/### stdoutmount -a /srv/mpipe /proc/### stderr
IBM Research
© 2010 IBM Corporation
Really Large Example (gangfs)
32
/proc/gclone/status/g###
/stdin/stdout/stderr/ctl/ns/status/wait
mount -a /srv/mpipe /proc/### -b stdinmount -a /srv/mpipe /proc/### stdoutmount -a /srv/mpipe /proc/### stderr
...and then, post exec from gangfs clone - execfs stdins are splicedfrom g#/stdin
...and then execfs stdouts and stderrs are splicedto g#/stdout and g#/stderr
...and you can do -e # with stdin to get enumerated instead of brodcast pipes