new process/thread runtime

New Process/Thread Runtime

Process in Process Techniques for Practical Address-Space Sharing

Atsushi Hori (RIKEN)

Dec. 13, 2017

Arm HPC Workshop@Akihabara 2017

Background• The rise of many-core architectures

• The current parallel execution models are designed for multi-core architectures

• Shall we have a new parallel execution model ?

2


What to be shared and what not to be shared

• Isolated address spaces • slow communication

• Shared variables • contentions on shared variables

3

Address Space

Isolated Shared

VariablesPrivatized Multi-Process

(MPI)

Shared ?? Multi-Thread (OpenMP)


What to be shared and what not to be shared

• Isolated address spaces • slow communication

• Shared variables • contentions on shared variables

4

Address Space

Isolated Shared

VariablesPrivatized Multi-Process

(MPI)3rd Exec.

Model

Shared ?? Multi-Thread (OpenMP)


Implementation of 3rd Execution Model

• MPC (by CEA) • Multi-thread approach • Compiler converts all variables thread local • a.out and b.out cannot run simultaneously

• PVAS (by RIKEN) • Multi-process approach • Patched Linux • OS kernel allows processes to share address

space

• MPC, PVAS, and SMARTMAP are not portable

5


Why portability matters ?• On the large supercomputers (i.e. the K),

modified OS kernel or kernel module is not allowed for users to install

• When I tried to port PVAS onto McKernel, core developer denies the modification • DO NOT CONTAMINATE MY CODE !!

6


PiP is very PORTABLE

7

CPU OS

Xeon and Xeon Phix86_64 Linux

x86_64 McKernel

the K and FX10 SPARC64 XTCOS

ARM (Opteron A1170) Aarch64 Linux

0

0.1

0.2

1 10 100

200

Tim

e [S

]

# Tasks -- Xeon

PiP:preload

PiP:thread

Fork&Exec

Vfork&Exec

PosixSpawn Pthread

0

1

2

1 10 100

200

# Tasks -- KNL

0

0.1

0.2

1 10 100

200

# Tasks -- Aarch64

0

1

2

1 10 100

200

# Tasks -- K

Task Spawning Time


Portability• PiP can run the machines where

• pthread_create() (, or clone() system call) • PIE • dlmopen() are supported

• PiP does not run on • BG/Q PIE is not supported • Windows PIE is not fully supported • Mac OSX dlmopen() is not supported

• FACT: All machines listed in Top500 (Nov. 2017) use Linux family OS !!

8


• User-level implementation of 3rd exec. model • Portable and practical

Process in Process (PiP)

9

555555554000-555555556000 r-xp ... /PIP/test/basic555555755000-555555756000 r--p ... /PIP/test/basic555555756000-555555757000 rw-p ... /PIP/test/basic555555757000-555555778000 rw-p ... [heap]7fffe8000000-7fffe8021000 rw-p ...7fffe8021000-7fffec000000 ---p ...7ffff0000000-7ffff0021000 rw-p ...7ffff0021000-7ffff4000000 ---p ...7ffff4b24000-7ffff4c24000 rw-p ...7ffff4c24000-7ffff4c27000 r-xp ... /PIP/lib/libpip.so7ffff4c27000-7ffff4e26000 ---p ... /PIP/lib/libpip.so7ffff4e26000-7ffff4e27000 r--p ... /PIP/lib/libpip.so7ffff4e27000-7ffff4e28000 rw-p ... /PIP/lib/libpip.so7ffff4e28000-7ffff4e2a000 r-xp ... /PIP/test/basic7ffff4e2a000-7ffff5029000 ---p ... /PIP/test/basic7ffff5029000-7ffff502a000 r--p ... /PIP/test/basic7ffff502a000-7ffff502b000 rw-p ... /PIP/test/basic7ffff502b000-7ffff502e000 r-xp ... /PIP/lib/libpip.so7ffff502e000-7ffff522d000 ---p ... /PIP/lib/libpip.so7ffff522d000-7ffff522e000 r--p ... /PIP/lib/libpip.so7ffff522e000-7ffff522f000 rw-p ... /PIP/lib/libpip.so7ffff522f000-7ffff5231000 r-xp ... /PIP/test/basic7ffff5231000-7ffff5430000 ---p ... /PIP/test/basic7ffff5430000-7ffff5431000 r--p ... /PIP/test/basic7ffff5431000-7ffff5432000 rw-p ... /PIP/test/basic...7ffff5a52000-7ffff5a56000 rw-p ......7ffff5c6e000-7ffff5c72000 rw-p ...7ffff5c72000-7ffff5e28000 r-xp ... /lib64/libc.so7ffff5e28000-7ffff6028000 ---p ... /lib64/libc.so7ffff6028000-7ffff602c000 r--p ... /lib64/libc.so7ffff602c000-7ffff602e000 rw-p ... /lib64/libc.so

7ffff602e000-7ffff6033000 rw-p ...7ffff6033000-7ffff61e9000 r-xp ... /lib64/libc.so7ffff61e9000-7ffff63e9000 ---p ... /lib64/libc.so7ffff63e9000-7ffff63ed000 r--p ... /lib64/libc.so7ffff63ed000-7ffff63ef000 rw-p ... /lib64/libc.so7ffff63ef000-7ffff63f4000 rw-p ...7ffff63f4000-7ffff63f5000 ---p ...7ffff63f5000-7ffff6bf5000 rw-p ... [stack:10641]7ffff6bf5000-7ffff6bf6000 ---p ...7ffff6bf6000-7ffff73f6000 rw-p ... [stack:10640]7ffff73f6000-7ffff75ac000 r-xp ... /lib64/libc.so7ffff75ac000-7ffff77ac000 ---p ... /lib64/libc.so7ffff77ac000-7ffff77b0000 r--p ... /lib64/libc.so7ffff77b0000-7ffff77b2000 rw-p ... /lib64/libc.so7ffff77b2000-7ffff77b7000 rw-p ......7ffff79cf000-7ffff79d3000 rw-p ...7ffff79d3000-7ffff79d6000 r-xp ... /PIP/lib/libpip.so7ffff79d6000-7ffff7bd5000 ---p ... /PIP/lib/libpip.so7ffff7bd5000-7ffff7bd6000 r--p ... /PIP/lib/libpip.so7ffff7bd6000-7ffff7bd7000 rw-p ... /PIP/lib/libpip.so7ffff7ddb000-7ffff7dfc000 r-xp ... /lib64/ld.so7ffff7edc000-7ffff7fe0000 rw-p ...7ffff7ff7000-7ffff7ffa000 rw-p ...7ffff7ffa000-7ffff7ffc000 r-xp ... [vdso]7ffff7ffc000-7ffff7ffd000 r--p ... /lib64/ld.so7ffff7ffd000-7ffff7ffe000 rw-p ... /lib64/ld.so7ffff7ffe000-7ffff7fff000 rw-p ...7ffffffde000-7ffffffff000 rw-p ... [stack]ffffffffff600000-ffffffffff601000 r-xp ... [vsyscall]

Program

Glibc

Address Space

Task-0 int x;

Task-1 int x;

:

a.out Task-(n-1) int x;

Task-(n) int a;

:

b.out Task-(m-1) int a;

:


Why address space sharing is better ?• Memory mapping techniques in multi-process model

• POSIX (SYS-V, mmap, ..) shmem • XPMEM

• Same page table is shared by tasks • no page table coherency overhead • saving memory for page tables • pointers can be used as they are

10

Memory mapping must maintain page table coherency -> OVERHEAD (system call, page fault, and page table size)

shared region

Page Table

shared region

Page Table

Proc-0 Proc-1

coherent

Shared Physical Memory Pages


Memory Mapping vs. PiP

11

Process-in-Process: Techniques for Practical Address-Space Sharing PPoPP 2018, February 24–28, 2018, Vienna, Austria

the latter can allow for greater concurrency because thesimulation can run while the analysis is processing the data.For the experiments shown in Section 7.4, we chose the latterapproach, but the implementation is �exible enough to allowfor either method.

5 Experimental PlatformsWe use four experimental platforms to cover several OSkernels and CPU architectures in our evaluation as listed inTables 3 and 4.

Table 3. Experimental platform H/W info.

Name CPU # Cores Clock Memory NetworkWallaby Xeon E5-2650 v2 8⇥2(⇥2) 2.6GHz 64 GiB ConnectX-3OFP† [19] Xeon Phi 7250 68(⇥4) 1.4GHz 96(+16) GiB Omni-PathK [35] SPARC64 VIIIfx 8 2.0GHz 16 GiB Tofu† Flat mode was used in the evaluations in Section 7.1 and 7.3 without usingMCDRAM. The other evaluations were done with cache quadrant mode.

Table 4. Experimental platform S/W info.Name OS Glibc PiP Exec. Mode(s)Wallaby Linux (CentOS 7.3) w/ patch process and threadWallaby McKernel + CentOS 7.3 w/ patch thread onlyOFP Linux (CentOS 7.2) w/ patch process and threadK XTCOS w/o patch process and thread

The Linux kernel on the K computer is old, and we gaveup trying to install the patched Gilbc. The CPU of the Kcomputer supports 8 cores, and the PiP library without thepatched Glibc can still utilize all CPU cores.McKernel is a multikernel operating system that runs

Linux with a lightweight kernel side by side on computenodes [17]. In the experiments with McKernel on Wallaby,McKernel was con�gured to run on 14 cores out of 16, andthe Linux kernel ran on the remaining 2 cores. The currentMcKernel is unable to handle the clone() �ag combinationdescribed in Section 3.3, The PiP programs ran in the threadexecution mode.

6 PiP Performance AnalysisThis evaluation section shows PiP performances by usingin-house microbenchmarks to contrast PiP characteristicswith the other memory-mapping techniques.

6.1 Setup OverheadIn themicrobenchmark, the root task created a 2GiB mmap()edarray region, and then a child task summed all members ofthe integer array, assuming the root task sent integer datato the child task via the allocated region. The array was ini-tialized by the root task. Table 5 shows the times spent inthe XPMEM and POSIX shmem functions. PiP also providesthe <xpmem.h> header �le so that XPMEM users can easilyconvert to PiP users. Indeed most of the XPMEM functionsprovided by PiP do almost nothing. Each PiP-implementedXPMEM function takes only 40–80 clock cycles.

Table 5. Overhead of XPMEM and POSIX shmem functions(Wallaby/Linux)

XPMEM Cyclesxpmem_make() 1,585xpmem_get() 15,294

xpmem_attach() 2,414xpmem_detach() 19,183xpmem_release() 693

POSIX Shmem CyclesSender shm_open() 22,294

ftruncate() 4,080mmap() 5,553close() 6,017

Receiver shm_open() 13,522mmap() 16,232close() 16,746

6.2 Page Fault OverheadFigure 4 shows the time series of each access using the samemicrobenchmark program used in the preceding subsection.Element access was strided with 64 bytes so that each cacheblock was accessed only once, to eliminate the cache blocke�ect. In the XPMEM case, the mmap()ed regionwas attachedby using the XPMEM functions. The upper-left graph inthis �gure shows the time series using POSIX shmem andXPMEM, and the lower-left graph shows the time seriesusing PiP. Both graphs on the left-hand side show spikes atevery 4 KiB. Because of space limitations, we do not showtwo separate graphs, but the spike heights of XPMEM arehigher than the ones of POSIX shmem. The spike heights ofthe PiP process mode and PiP thread mode are almost thesame. However, the heights of the spikes of XPMEM andPOSIX shmem are much higher than those of PiP.

10

100

1,000

5,000

Acce

ss T

ime

[Tic

k]

ShmemXPMEM XPMEMPageSize:4KiB PageSize:2GiB

10

100

500

0 4,096 8,192 12,288 16,384Array Elements [Byte offset]

PiP:process PiP:thread



Figure 4. Access time series of an array (Wallaby/Linux).XPMEM and POSIX shmem can be categorized as memory-

mapping techniques, and a PF happens every time a newmemory page is accessed. The spikes in PiP are the time spentfor the translation lookaside bu�er (TLB) misses, not PF. InPiP, the whole array is touched at the time of initialization bythe root (sender) task, and all required PT entries are createdthen.

We ran the same benchmark program but using HugeTLBthis time (graphs on the right-hand side). POSIX shmemcannot handle the HugeTLB on this Linux kernel. XPMEMdoes show huge spikes again on the every 4 KiB page bound-ary. We consulted the XPMEM device driver source code(version 2.6.4) and found that the XPMEM driver can createonly 4 KiB page PT entries, regardless of the page size of thetarget region. In PiP, no TLB-miss spikes are found this timebecause of using one 2 GiB page.

7

(Xeon/Linux)

10

100

1,000

5,000

Acce

ss T

ime

[Tic

k]

ShmemXPMEM XPMEMPageSize:4KiB PageSize:2MiB

10

100

500





(Xeon/Linux)

PiP takes less than

100 clocks !!


Process in Process (PiP)• dlmopen (not a typo of dlopen)

• load a program having a new name space • The same variable “foo” can have multiple

instances having different addresses • Position Independent Executable (PIE)

• PIE programs can be loaded at any location

• Combine dlmopen and PIE • load a PIE program with dlmopen • We can privatize variables in the same

address space12


Glibc Issue• In the current Glibc, dlmopen() can create up to 16

name spaces only • Each PiP task requires one name space to have

privatized variables • Many-core architecture can run more than 16 PiP tasks,

up to the number of CPU cores

• Glibc patch is also provided to have more number of name spaces, in case 16 is not enough • Changing the size of name space stable • Currently 260 PiP tasks can be created

• Some workaround codes can be found in PiP library code

13


PiP Showcases

14


Showcase 1 : MPI pt2pt• Current Eager/Rndv. 2 Copies • PiP Rndv. 1 Copy

15

(Xeon/Linux)

14

1664

25610244096

1638465536

Band

wid

th (M

B/s)

Message Size (Bytes)

eager-2copyrndv-2copyPiP (rndv-1copy)

PiP is 3.5x faster @ 128KB

better


Showcase 2 : MPI DDT• Derived Data Type (DDT) Communication

• Non-contiguous data transfer • Current pack -> send -> unpack (3 copies) • PiP non-contig send (1 copy)

16

0

0.5

1

1.5

2

64K16,128

32K,32,128

16K,64,128

8K,128,128

4K,256,128

2K,512,128

1K,1K,128

512,2K,128

256,4K,128

128,8K,128

64,16K,128

Nor

mol

ized

Tim

e

Count of double elements in X,Y,Z dimentions

eager-2copy (base)rndv-2copyPiP Non-contig Vec

Non-contig Vec

(Xeon/Linux)

better


Showcase 3 : MPI_Win_allocate_shared (1/2)

17

MPI Implementation int main(int argc, char **argv) { MPI_Init(&argc, &argv); ... MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL, comm, &mem, &win); ... MPI_Win_shared_query(win, north, &sz, &dsp_unit, &northptr); MPI_Win_shared_query(win, south, &sz, &dsp_unit, &southptr); MPI_Win_shared_query(win, east, &sz, &dsp_unit, &eastptr); MPI_Win_shared_query(win, west, &sz, &dsp_unit, &westptr); ... MPI_Win_lock_all(0, win); for(int iter=0; iter<niters; ++iter) { MPI_Win_sync(win); MPI_Barrier(shmcomm); /* stencil computation */ } MPI_Win_unlock_all(win); ...}

PiP Implementation int main(int argc, char **argv) { pip_init( &pipid, &p, NULL, 0 ); ... mem = malloc( size ); ...

pip_get_addr( north, "mem", &northptr );

pip_get_addr( south, "mem", &southptr );

pip_get_addr( east, "mem", &eastptr );

pip_get_addr( west, "mem", &westptr ); ...

for(int iter=0; iter<niters; ++iter) { pip_barrier( p ); ...

/* stencil computation */ } ... pip_fin();}


Showcase 3 : MPI_Win_allocate_shared (2/2)

18

1E+2

1E+3

1E+4

1E+5

1E+6

0.1

1

10

100

1 10 100 1,000

Tota

l Pag

e Ta

ble

Size

[KiB

]

PT S

ize

Perc

enta

ge to

Arra

y Si

ze (M

PI)

# Tasks -- KNL

PiP

MPI

Percentage

1E+0

1E+1

1E+2

1E+3

1E+4

1E+5

1E+6

1 10 100 1,000

# To

tal P

age

Faul

ts

# Tasks -- KNL

PiP

MPI

5P Stencil (4K x 4K)

Size of Page Tables# Page Faults

better


Showcase 4 : In Situ

19

LAMMPS&Process In&situ&Process

Pre1allocatedShared&Buffer

Copy%inCopy%out

Gather&datachunks

Analysis

Dump&&&copy&data

LAMMPS&process In&situ&process

Copy%out Gather&datachunksAnalysis

Dumpdata

Original SHMEM-based In Situ

PiP-based In Situ 4,4,4 6,6,6 8,8,8 10,10,10 12,12,121

1.1

1.2

1.3

1.4

1.5

1.6

1.7

Slow

down

Rat

io (b

ased

on

w/o

In-s

itu)

LAMMPS: 3d Lennard-Jones melt (xx, yy, zz)

POSIX shmem PiP

LAMMPS in situ: POSIX shmem vs. PiPOn Xeon + Linux

• LAMMPS&process&ran&with&four&OpenMP threads;• In&situ&process&is&with&single& thread;• O(N2)&comp. cost >>&data transfer cost at&(12,12,12).

(Xeon/Linux)

better


Showcase 5 : SNAP

20

683.3%

379.1%

207.9%

153.0%

106.4%

91.6%

83.3%

430.5%

221.2%

123.0%

68.3%

42.0%

27.7%

22.0%

1.6%

1.7%

1.7%2.2%

2.5%

3.3%3.8%

00.511.522.533.54

0100200300400500600700800

16 32 64 128 256 512 1024

Spee

dup(PiP%vs%Threads)

Solve%Time%(s)

Number%of%Cores

MPICH/ThreadsMPICH/PiPSpeedup<(PiP<vs<Threads)

PiP V.S. threads in hybrid MPI+X SNAP strong scaling on OFP (1-16 nodes, flat mode).

• ( MPI + OpenMP ) ( MPI + PiP )

better

better


Showcase 5 : Using in Hybrid MPI + “X” as the “X” (2)

21

! PiP based(parallelism

– Easy(application(data(sharing(across(cores

– No(multithreading(safety(overhead

– Naturally(utilizing(multiple(network(ports((

Network(Ports

MPI(stack

APP(data

1

4

16

64

256

1024

4096

16384

65536

1 4

16

64

256

1K

4K

16K

64K

256K

1M

4M

Message&Size&(Bytes)

KMessages/s&bewteen&PiP& tasks

1(Pair

4(Pairs

16(Pairs

64(Pairs

1

4

16

64

256

1024

4096

16384

65536

1 4

16

64

256

1K

4K

16K

64K

256K

1M

4M

Message&Size&(Bytes)

KMessages/s&between&threads

1(Pair

4(Pairs

16(Pairs

64(Pairs

683.3&

379.1&

207.9&

153.0&

106.4&

91.6&

83.3&

430.5&

221.2&

123.0&

68.3&

42.0&

27.7&

22.0&

1.6&

1.7&

1.7&2.2&

2.5&

3.3&3.8&

0

0.5

1

1.5

2

2.5

3

3.5

4

0

100

200

300

400

500

600

700

800

16 32 64 128 256 512 1024

Spee

dup(PiP&vs&Threads)

Solve&Time&(s)

Number&of&Cores

MPICH/Threads

MPICH/PiP

Speedup((PiP(vs(Threads)

Multipair message rate (osu_mbw_mr )between two OFP nodes (Xeon Phi + Linux, flat mode).

PiP V.S. threads in hybrid MPI+X SNAP strong scaling on OFP (1-16 nodes, flat mode).


Research Collaboration• ANL (Dr. Pavan and Dr. Min) — DOE-MEXT

• MPICH • UT/ICL (Prof. Bosilca)

• Open MPI • CEA (Dr. Pérache) — CEA-RIKEN

• MPC • UIUC (Prof. Kale) — JLESC

• AMPI • Intel (Dr. Dayal)

• In Situ

22


Summary• Process in Process (PIP)

• New implementation of the 3rd execution model • better than memory mapping techniques

• PiP is portable and practical because of user-level implementation • can run on the K and OFP super

computers • Showcases prove PiP can improve

performance

23


Final words• The Glib issues will be reported to Redhat

• We are seeking PiP applications not only HPC but also Enterprise

24

new process/thread runtime

Technology