new process/thread runtime
TRANSCRIPT
New Process/Thread Runtime
Process in Process Techniques for Practical Address-Space Sharing
Atsushi Hori (RIKEN)
Dec. 13, 2017
Arm HPC Workshop@Akihabara 2017
Background• The rise of many-core architectures
• The current parallel execution models are designed for multi-core architectures
• Shall we have a new parallel execution model ?
2
Arm HPC Workshop@Akihabara 2017
What to be shared and what not to be shared
• Isolated address spaces • slow communication
• Shared variables • contentions on shared variables
3
Address Space
Isolated Shared
VariablesPrivatized Multi-Process
(MPI)
Shared ?? Multi-Thread (OpenMP)
Arm HPC Workshop@Akihabara 2017
What to be shared and what not to be shared
• Isolated address spaces • slow communication
• Shared variables • contentions on shared variables
4
Address Space
Isolated Shared
VariablesPrivatized Multi-Process
(MPI)3rd Exec.
Model
Shared ?? Multi-Thread (OpenMP)
Arm HPC Workshop@Akihabara 2017
Implementation of 3rd Execution Model
• MPC (by CEA) • Multi-thread approach • Compiler converts all variables thread local • a.out and b.out cannot run simultaneously
• PVAS (by RIKEN) • Multi-process approach • Patched Linux • OS kernel allows processes to share address
space
• MPC, PVAS, and SMARTMAP are not portable
5
Arm HPC Workshop@Akihabara 2017
Why portability matters ?• On the large supercomputers (i.e. the K),
modified OS kernel or kernel module is not allowed for users to install
• When I tried to port PVAS onto McKernel, core developer denies the modification • DO NOT CONTAMINATE MY CODE !!
6
Arm HPC Workshop@Akihabara 2017
PiP is very PORTABLE
7
CPU OS
Xeon and Xeon Phix86_64 Linux
x86_64 McKernel
the K and FX10 SPARC64 XTCOS
ARM (Opteron A1170) Aarch64 Linux
0
0.1
0.2
1 10 100
200
Tim
e [S
]
# Tasks -- Xeon
PiP:preload
PiP:thread
Fork&Exec
Vfork&Exec
PosixSpawn Pthread
0
1
2
1 10 100
200
# Tasks -- KNL
0
0.1
0.2
1 10 100
200
# Tasks -- Aarch64
0
1
2
1 10 100
200
# Tasks -- K
Task Spawning Time
Arm HPC Workshop@Akihabara 2017
Portability• PiP can run the machines where
• pthread_create() (, or clone() system call) • PIE • dlmopen() are supported
• PiP does not run on • BG/Q PIE is not supported • Windows PIE is not fully supported • Mac OSX dlmopen() is not supported
• FACT: All machines listed in Top500 (Nov. 2017) use Linux family OS !!
8
Arm HPC Workshop@Akihabara 2017
• User-level implementation of 3rd exec. model • Portable and practical
Process in Process (PiP)
9
555555554000-555555556000 r-xp ... /PIP/test/basic555555755000-555555756000 r--p ... /PIP/test/basic555555756000-555555757000 rw-p ... /PIP/test/basic555555757000-555555778000 rw-p ... [heap]7fffe8000000-7fffe8021000 rw-p ...7fffe8021000-7fffec000000 ---p ...7ffff0000000-7ffff0021000 rw-p ...7ffff0021000-7ffff4000000 ---p ...7ffff4b24000-7ffff4c24000 rw-p ...7ffff4c24000-7ffff4c27000 r-xp ... /PIP/lib/libpip.so7ffff4c27000-7ffff4e26000 ---p ... /PIP/lib/libpip.so7ffff4e26000-7ffff4e27000 r--p ... /PIP/lib/libpip.so7ffff4e27000-7ffff4e28000 rw-p ... /PIP/lib/libpip.so7ffff4e28000-7ffff4e2a000 r-xp ... /PIP/test/basic7ffff4e2a000-7ffff5029000 ---p ... /PIP/test/basic7ffff5029000-7ffff502a000 r--p ... /PIP/test/basic7ffff502a000-7ffff502b000 rw-p ... /PIP/test/basic7ffff502b000-7ffff502e000 r-xp ... /PIP/lib/libpip.so7ffff502e000-7ffff522d000 ---p ... /PIP/lib/libpip.so7ffff522d000-7ffff522e000 r--p ... /PIP/lib/libpip.so7ffff522e000-7ffff522f000 rw-p ... /PIP/lib/libpip.so7ffff522f000-7ffff5231000 r-xp ... /PIP/test/basic7ffff5231000-7ffff5430000 ---p ... /PIP/test/basic7ffff5430000-7ffff5431000 r--p ... /PIP/test/basic7ffff5431000-7ffff5432000 rw-p ... /PIP/test/basic...7ffff5a52000-7ffff5a56000 rw-p ......7ffff5c6e000-7ffff5c72000 rw-p ...7ffff5c72000-7ffff5e28000 r-xp ... /lib64/libc.so7ffff5e28000-7ffff6028000 ---p ... /lib64/libc.so7ffff6028000-7ffff602c000 r--p ... /lib64/libc.so7ffff602c000-7ffff602e000 rw-p ... /lib64/libc.so
7ffff602e000-7ffff6033000 rw-p ...7ffff6033000-7ffff61e9000 r-xp ... /lib64/libc.so7ffff61e9000-7ffff63e9000 ---p ... /lib64/libc.so7ffff63e9000-7ffff63ed000 r--p ... /lib64/libc.so7ffff63ed000-7ffff63ef000 rw-p ... /lib64/libc.so7ffff63ef000-7ffff63f4000 rw-p ...7ffff63f4000-7ffff63f5000 ---p ...7ffff63f5000-7ffff6bf5000 rw-p ... [stack:10641]7ffff6bf5000-7ffff6bf6000 ---p ...7ffff6bf6000-7ffff73f6000 rw-p ... [stack:10640]7ffff73f6000-7ffff75ac000 r-xp ... /lib64/libc.so7ffff75ac000-7ffff77ac000 ---p ... /lib64/libc.so7ffff77ac000-7ffff77b0000 r--p ... /lib64/libc.so7ffff77b0000-7ffff77b2000 rw-p ... /lib64/libc.so7ffff77b2000-7ffff77b7000 rw-p ......7ffff79cf000-7ffff79d3000 rw-p ...7ffff79d3000-7ffff79d6000 r-xp ... /PIP/lib/libpip.so7ffff79d6000-7ffff7bd5000 ---p ... /PIP/lib/libpip.so7ffff7bd5000-7ffff7bd6000 r--p ... /PIP/lib/libpip.so7ffff7bd6000-7ffff7bd7000 rw-p ... /PIP/lib/libpip.so7ffff7ddb000-7ffff7dfc000 r-xp ... /lib64/ld.so7ffff7edc000-7ffff7fe0000 rw-p ...7ffff7ff7000-7ffff7ffa000 rw-p ...7ffff7ffa000-7ffff7ffc000 r-xp ... [vdso]7ffff7ffc000-7ffff7ffd000 r--p ... /lib64/ld.so7ffff7ffd000-7ffff7ffe000 rw-p ... /lib64/ld.so7ffff7ffe000-7ffff7fff000 rw-p ...7ffffffde000-7ffffffff000 rw-p ... [stack]ffffffffff600000-ffffffffff601000 r-xp ... [vsyscall]
Program
Glibc
Address Space
Task-0 int x;
Task-1 int x;
:
a.out Task-(n-1) int x;
Task-(n) int a;
:
b.out Task-(m-1) int a;
:
Arm HPC Workshop@Akihabara 2017
Why address space sharing is better ?• Memory mapping techniques in multi-process model
• POSIX (SYS-V, mmap, ..) shmem • XPMEM
• Same page table is shared by tasks • no page table coherency overhead • saving memory for page tables • pointers can be used as they are
10
Memory mapping must maintain page table coherency -> OVERHEAD (system call, page fault, and page table size)
shared region
Page Table
shared region
Page Table
Proc-0 Proc-1
coherent
Shared Physical Memory Pages
Arm HPC Workshop@Akihabara 2017
Memory Mapping vs. PiP
11
Process-in-Process: Techniques for Practical Address-Space Sharing PPoPP 2018, February 24–28, 2018, Vienna, Austria
the latter can allow for greater concurrency because thesimulation can run while the analysis is processing the data.For the experiments shown in Section 7.4, we chose the latterapproach, but the implementation is �exible enough to allowfor either method.
5 Experimental PlatformsWe use four experimental platforms to cover several OSkernels and CPU architectures in our evaluation as listed inTables 3 and 4.
Table 3. Experimental platform H/W info.
Name CPU # Cores Clock Memory NetworkWallaby Xeon E5-2650 v2 8⇥2(⇥2) 2.6GHz 64 GiB ConnectX-3OFP† [19] Xeon Phi 7250 68(⇥4) 1.4GHz 96(+16) GiB Omni-PathK [35] SPARC64 VIIIfx 8 2.0GHz 16 GiB Tofu† Flat mode was used in the evaluations in Section 7.1 and 7.3 without usingMCDRAM. The other evaluations were done with cache quadrant mode.
Table 4. Experimental platform S/W info.Name OS Glibc PiP Exec. Mode(s)Wallaby Linux (CentOS 7.3) w/ patch process and threadWallaby McKernel + CentOS 7.3 w/ patch thread onlyOFP Linux (CentOS 7.2) w/ patch process and threadK XTCOS w/o patch process and thread
The Linux kernel on the K computer is old, and we gaveup trying to install the patched Gilbc. The CPU of the Kcomputer supports 8 cores, and the PiP library without thepatched Glibc can still utilize all CPU cores.McKernel is a multikernel operating system that runs
Linux with a lightweight kernel side by side on computenodes [17]. In the experiments with McKernel on Wallaby,McKernel was con�gured to run on 14 cores out of 16, andthe Linux kernel ran on the remaining 2 cores. The currentMcKernel is unable to handle the clone() �ag combinationdescribed in Section 3.3, The PiP programs ran in the threadexecution mode.
6 PiP Performance AnalysisThis evaluation section shows PiP performances by usingin-house microbenchmarks to contrast PiP characteristicswith the other memory-mapping techniques.
6.1 Setup OverheadIn themicrobenchmark, the root task created a 2GiB mmap()edarray region, and then a child task summed all members ofthe integer array, assuming the root task sent integer datato the child task via the allocated region. The array was ini-tialized by the root task. Table 5 shows the times spent inthe XPMEM and POSIX shmem functions. PiP also providesthe <xpmem.h> header �le so that XPMEM users can easilyconvert to PiP users. Indeed most of the XPMEM functionsprovided by PiP do almost nothing. Each PiP-implementedXPMEM function takes only 40–80 clock cycles.
Table 5. Overhead of XPMEM and POSIX shmem functions(Wallaby/Linux)
XPMEM Cyclesxpmem_make() 1,585xpmem_get() 15,294
xpmem_attach() 2,414xpmem_detach() 19,183xpmem_release() 693
POSIX Shmem CyclesSender shm_open() 22,294
ftruncate() 4,080mmap() 5,553close() 6,017
Receiver shm_open() 13,522mmap() 16,232close() 16,746
6.2 Page Fault OverheadFigure 4 shows the time series of each access using the samemicrobenchmark program used in the preceding subsection.Element access was strided with 64 bytes so that each cacheblock was accessed only once, to eliminate the cache blocke�ect. In the XPMEM case, the mmap()ed regionwas attachedby using the XPMEM functions. The upper-left graph inthis �gure shows the time series using POSIX shmem andXPMEM, and the lower-left graph shows the time seriesusing PiP. Both graphs on the left-hand side show spikes atevery 4 KiB. Because of space limitations, we do not showtwo separate graphs, but the spike heights of XPMEM arehigher than the ones of POSIX shmem. The spike heights ofthe PiP process mode and PiP thread mode are almost thesame. However, the heights of the spikes of XPMEM andPOSIX shmem are much higher than those of PiP.
10
100
1,000
5,000
Acce
ss T
ime
[Tic
k]
ShmemXPMEM XPMEMPageSize:4KiB PageSize:2GiB
10
100
500
0 4,096 8,192 12,288 16,384Array Elements [Byte offset]
PiP:process PiP:thread
0 4,096 8,192 12,288 16,384Array Elements [Byte offset]
PiP:process PiP:thread
Figure 4. Access time series of an array (Wallaby/Linux).XPMEM and POSIX shmem can be categorized as memory-
mapping techniques, and a PF happens every time a newmemory page is accessed. The spikes in PiP are the time spentfor the translation lookaside bu�er (TLB) misses, not PF. InPiP, the whole array is touched at the time of initialization bythe root (sender) task, and all required PT entries are createdthen.
We ran the same benchmark program but using HugeTLBthis time (graphs on the right-hand side). POSIX shmemcannot handle the HugeTLB on this Linux kernel. XPMEMdoes show huge spikes again on the every 4 KiB page bound-ary. We consulted the XPMEM device driver source code(version 2.6.4) and found that the XPMEM driver can createonly 4 KiB page PT entries, regardless of the page size of thetarget region. In PiP, no TLB-miss spikes are found this timebecause of using one 2 GiB page.
7
(Xeon/Linux)
10
100
1,000
5,000
Acce
ss T
ime
[Tic
k]
ShmemXPMEM XPMEMPageSize:4KiB PageSize:2MiB
10
100
500
0 4,096 8,192 12,288 16,384Array Elements [Byte offset]
PiP:process PiP:thread
0 4,096 8,192 12,288 16,384Array Elements [Byte offset]
PiP:process PiP:thread
(Xeon/Linux)
PiP takes less than
100 clocks !!
Arm HPC Workshop@Akihabara 2017
Process in Process (PiP)• dlmopen (not a typo of dlopen)
• load a program having a new name space • The same variable “foo” can have multiple
instances having different addresses • Position Independent Executable (PIE)
• PIE programs can be loaded at any location
• Combine dlmopen and PIE • load a PIE program with dlmopen • We can privatize variables in the same
address space12
Arm HPC Workshop@Akihabara 2017
Glibc Issue• In the current Glibc, dlmopen() can create up to 16
name spaces only • Each PiP task requires one name space to have
privatized variables • Many-core architecture can run more than 16 PiP tasks,
up to the number of CPU cores
• Glibc patch is also provided to have more number of name spaces, in case 16 is not enough • Changing the size of name space stable • Currently 260 PiP tasks can be created
• Some workaround codes can be found in PiP library code
13
Arm HPC Workshop@Akihabara 2017
PiP Showcases
14
Arm HPC Workshop@Akihabara 2017
Showcase 1 : MPI pt2pt• Current Eager/Rndv. 2 Copies • PiP Rndv. 1 Copy
15
(Xeon/Linux)
14
1664
25610244096
1638465536
Band
wid
th (M
B/s)
Message Size (Bytes)
eager-2copyrndv-2copyPiP (rndv-1copy)
PiP is 3.5x faster @ 128KB
better
Arm HPC Workshop@Akihabara 2017
Showcase 2 : MPI DDT• Derived Data Type (DDT) Communication
• Non-contiguous data transfer • Current pack -> send -> unpack (3 copies) • PiP non-contig send (1 copy)
16
0
0.5
1
1.5
2
64K16,128
32K,32,128
16K,64,128
8K,128,128
4K,256,128
2K,512,128
1K,1K,128
512,2K,128
256,4K,128
128,8K,128
64,16K,128
Nor
mol
ized
Tim
e
Count of double elements in X,Y,Z dimentions
eager-2copy (base)rndv-2copyPiP Non-contig Vec
Non-contig Vec
(Xeon/Linux)
better
Arm HPC Workshop@Akihabara 2017
Showcase 3 : MPI_Win_allocate_shared (1/2)
17
MPI Implementation int main(int argc, char **argv) { MPI_Init(&argc, &argv); ... MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL, comm, &mem, &win); ... MPI_Win_shared_query(win, north, &sz, &dsp_unit, &northptr); MPI_Win_shared_query(win, south, &sz, &dsp_unit, &southptr); MPI_Win_shared_query(win, east, &sz, &dsp_unit, &eastptr); MPI_Win_shared_query(win, west, &sz, &dsp_unit, &westptr); ... MPI_Win_lock_all(0, win); for(int iter=0; iter<niters; ++iter) { MPI_Win_sync(win); MPI_Barrier(shmcomm); /* stencil computation */ } MPI_Win_unlock_all(win); ...}
PiP Implementation int main(int argc, char **argv) { pip_init( &pipid, &p, NULL, 0 ); ... mem = malloc( size ); ...
pip_get_addr( north, "mem", &northptr );
pip_get_addr( south, "mem", &southptr );
pip_get_addr( east, "mem", &eastptr );
pip_get_addr( west, "mem", &westptr ); ...
for(int iter=0; iter<niters; ++iter) { pip_barrier( p ); ...
/* stencil computation */ } ... pip_fin();}
Arm HPC Workshop@Akihabara 2017
Showcase 3 : MPI_Win_allocate_shared (2/2)
18
1E+2
1E+3
1E+4
1E+5
1E+6
0.1
1
10
100
1 10 100 1,000
Tota
l Pag
e Ta
ble
Size
[KiB
]
PT S
ize
Perc
enta
ge to
Arra
y Si
ze (M
PI)
# Tasks -- KNL
PiP
MPI
Percentage
1E+0
1E+1
1E+2
1E+3
1E+4
1E+5
1E+6
1 10 100 1,000
# To
tal P
age
Faul
ts
# Tasks -- KNL
PiP
MPI
5P Stencil (4K x 4K)
Size of Page Tables# Page Faults
better
Arm HPC Workshop@Akihabara 2017
Showcase 4 : In Situ
19
LAMMPS&Process In&situ&Process
Pre1allocatedShared&Buffer
Copy%inCopy%out
Gather&datachunks
Analysis
Dump&&©&data
LAMMPS&process In&situ&process
Copy%out Gather&datachunksAnalysis
Dumpdata
Original SHMEM-based In Situ
PiP-based In Situ 4,4,4 6,6,6 8,8,8 10,10,10 12,12,121
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Slow
down
Rat
io (b
ased
on
w/o
In-s
itu)
LAMMPS: 3d Lennard-Jones melt (xx, yy, zz)
POSIX shmem PiP
LAMMPS in situ: POSIX shmem vs. PiPOn Xeon + Linux
• LAMMPS&process&ran&with&four&OpenMP threads;• In&situ&process&is&with&single& thread;• O(N2)&comp. cost >>&data transfer cost at&(12,12,12).
(Xeon/Linux)
better
Arm HPC Workshop@Akihabara 2017
Showcase 5 : SNAP
20
683.3%
379.1%
207.9%
153.0%
106.4%
91.6%
83.3%
430.5%
221.2%
123.0%
68.3%
42.0%
27.7%
22.0%
1.6%
1.7%
1.7%2.2%
2.5%
3.3%3.8%
00.511.522.533.54
0100200300400500600700800
16 32 64 128 256 512 1024
Spee
dup(PiP%vs%Threads)
Solve%Time%(s)
Number%of%Cores
MPICH/ThreadsMPICH/PiPSpeedup<(PiP<vs<Threads)
PiP V.S. threads in hybrid MPI+X SNAP strong scaling on OFP (1-16 nodes, flat mode).
• ( MPI + OpenMP ) ( MPI + PiP )
better
better
Arm HPC Workshop@Akihabara 2017
Showcase 5 : Using in Hybrid MPI + “X” as the “X” (2)
21
! PiP based(parallelism
– Easy(application(data(sharing(across(cores
– No(multithreading(safety(overhead
– Naturally(utilizing(multiple(network(ports((
Network(Ports
MPI(stack
APP(data
1
4
16
64
256
1024
4096
16384
65536
1 4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Message&Size&(Bytes)
KMessages/s&bewteen&PiP& tasks
1(Pair
4(Pairs
16(Pairs
64(Pairs
1
4
16
64
256
1024
4096
16384
65536
1 4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Message&Size&(Bytes)
KMessages/s&between&threads
1(Pair
4(Pairs
16(Pairs
64(Pairs
683.3&
379.1&
207.9&
153.0&
106.4&
91.6&
83.3&
430.5&
221.2&
123.0&
68.3&
42.0&
27.7&
22.0&
1.6&
1.7&
1.7&2.2&
2.5&
3.3&3.8&
0
0.5
1
1.5
2
2.5
3
3.5
4
0
100
200
300
400
500
600
700
800
16 32 64 128 256 512 1024
Spee
dup(PiP&vs&Threads)
Solve&Time&(s)
Number&of&Cores
MPICH/Threads
MPICH/PiP
Speedup((PiP(vs(Threads)
Multipair message rate (osu_mbw_mr )between two OFP nodes (Xeon Phi + Linux, flat mode).
PiP V.S. threads in hybrid MPI+X SNAP strong scaling on OFP (1-16 nodes, flat mode).
Arm HPC Workshop@Akihabara 2017
Research Collaboration• ANL (Dr. Pavan and Dr. Min) — DOE-MEXT
• MPICH • UT/ICL (Prof. Bosilca)
• Open MPI • CEA (Dr. Pérache) — CEA-RIKEN
• MPC • UIUC (Prof. Kale) — JLESC
• AMPI • Intel (Dr. Dayal)
• In Situ
22
Arm HPC Workshop@Akihabara 2017
Summary• Process in Process (PIP)
• New implementation of the 3rd execution model • better than memory mapping techniques
• PiP is portable and practical because of user-level implementation • can run on the K and OFP super
computers • Showcases prove PiP can improve
performance
23
Arm HPC Workshop@Akihabara 2017
Final words• The Glib issues will be reported to Redhat
• We are seeking PiP applications not only HPC but also Enterprise
24