new process/thread runtime

24
New Process/Thread Runtime Process in Process Techniques for Practical Address-Space Sharing Atsushi Hori (RIKEN) Dec. 13, 2017

Upload: linaro

Post on 21-Jan-2018

240 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: New Process/Thread Runtime

New Process/Thread Runtime

Process in Process Techniques for Practical Address-Space Sharing

Atsushi Hori (RIKEN)

Dec. 13, 2017

Page 2: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Background• The rise of many-core architectures

• The current parallel execution models are designed for multi-core architectures

• Shall we have a new parallel execution model ?

2

Page 3: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

What to be shared and what not to be shared

• Isolated address spaces • slow communication

• Shared variables • contentions on shared variables

3

Address Space

Isolated Shared

VariablesPrivatized Multi-Process

(MPI)

Shared ?? Multi-Thread (OpenMP)

Page 4: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

What to be shared and what not to be shared

• Isolated address spaces • slow communication

• Shared variables • contentions on shared variables

4

Address Space

Isolated Shared

VariablesPrivatized Multi-Process

(MPI)3rd Exec.

Model

Shared ?? Multi-Thread (OpenMP)

Page 5: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Implementation of 3rd Execution Model

• MPC (by CEA) • Multi-thread approach • Compiler converts all variables thread local • a.out and b.out cannot run simultaneously

• PVAS (by RIKEN) • Multi-process approach • Patched Linux • OS kernel allows processes to share address

space

• MPC, PVAS, and SMARTMAP are not portable

5

Page 6: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Why portability matters ?• On the large supercomputers (i.e. the K),

modified OS kernel or kernel module is not allowed for users to install

• When I tried to port PVAS onto McKernel, core developer denies the modification • DO NOT CONTAMINATE MY CODE !!

6

Page 7: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

PiP is very PORTABLE

7

CPU OS

Xeon and Xeon Phix86_64 Linux

x86_64 McKernel

the K and FX10 SPARC64 XTCOS

ARM (Opteron A1170) Aarch64 Linux

0

0.1

0.2

1 10 100

200

Tim

e [S

]

# Tasks -- Xeon

PiP:preload

PiP:thread

Fork&Exec

Vfork&Exec

PosixSpawn Pthread

0

1

2

1 10 100

200

# Tasks -- KNL

0

0.1

0.2

1 10 100

200

# Tasks -- Aarch64

0

1

2

1 10 100

200

# Tasks -- K

Task Spawning Time

Page 8: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Portability• PiP can run the machines where

• pthread_create() (, or clone() system call) • PIE • dlmopen() are supported

• PiP does not run on • BG/Q PIE is not supported • Windows PIE is not fully supported • Mac OSX dlmopen() is not supported

• FACT: All machines listed in Top500 (Nov. 2017) use Linux family OS !!

8

Page 9: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

• User-level implementation of 3rd exec. model • Portable and practical

Process in Process (PiP)

9

555555554000-555555556000 r-xp ... /PIP/test/basic555555755000-555555756000 r--p ... /PIP/test/basic555555756000-555555757000 rw-p ... /PIP/test/basic555555757000-555555778000 rw-p ... [heap]7fffe8000000-7fffe8021000 rw-p ...7fffe8021000-7fffec000000 ---p ...7ffff0000000-7ffff0021000 rw-p ...7ffff0021000-7ffff4000000 ---p ...7ffff4b24000-7ffff4c24000 rw-p ...7ffff4c24000-7ffff4c27000 r-xp ... /PIP/lib/libpip.so7ffff4c27000-7ffff4e26000 ---p ... /PIP/lib/libpip.so7ffff4e26000-7ffff4e27000 r--p ... /PIP/lib/libpip.so7ffff4e27000-7ffff4e28000 rw-p ... /PIP/lib/libpip.so7ffff4e28000-7ffff4e2a000 r-xp ... /PIP/test/basic7ffff4e2a000-7ffff5029000 ---p ... /PIP/test/basic7ffff5029000-7ffff502a000 r--p ... /PIP/test/basic7ffff502a000-7ffff502b000 rw-p ... /PIP/test/basic7ffff502b000-7ffff502e000 r-xp ... /PIP/lib/libpip.so7ffff502e000-7ffff522d000 ---p ... /PIP/lib/libpip.so7ffff522d000-7ffff522e000 r--p ... /PIP/lib/libpip.so7ffff522e000-7ffff522f000 rw-p ... /PIP/lib/libpip.so7ffff522f000-7ffff5231000 r-xp ... /PIP/test/basic7ffff5231000-7ffff5430000 ---p ... /PIP/test/basic7ffff5430000-7ffff5431000 r--p ... /PIP/test/basic7ffff5431000-7ffff5432000 rw-p ... /PIP/test/basic...7ffff5a52000-7ffff5a56000 rw-p ......7ffff5c6e000-7ffff5c72000 rw-p ...7ffff5c72000-7ffff5e28000 r-xp ... /lib64/libc.so7ffff5e28000-7ffff6028000 ---p ... /lib64/libc.so7ffff6028000-7ffff602c000 r--p ... /lib64/libc.so7ffff602c000-7ffff602e000 rw-p ... /lib64/libc.so

7ffff602e000-7ffff6033000 rw-p ...7ffff6033000-7ffff61e9000 r-xp ... /lib64/libc.so7ffff61e9000-7ffff63e9000 ---p ... /lib64/libc.so7ffff63e9000-7ffff63ed000 r--p ... /lib64/libc.so7ffff63ed000-7ffff63ef000 rw-p ... /lib64/libc.so7ffff63ef000-7ffff63f4000 rw-p ...7ffff63f4000-7ffff63f5000 ---p ...7ffff63f5000-7ffff6bf5000 rw-p ... [stack:10641]7ffff6bf5000-7ffff6bf6000 ---p ...7ffff6bf6000-7ffff73f6000 rw-p ... [stack:10640]7ffff73f6000-7ffff75ac000 r-xp ... /lib64/libc.so7ffff75ac000-7ffff77ac000 ---p ... /lib64/libc.so7ffff77ac000-7ffff77b0000 r--p ... /lib64/libc.so7ffff77b0000-7ffff77b2000 rw-p ... /lib64/libc.so7ffff77b2000-7ffff77b7000 rw-p ......7ffff79cf000-7ffff79d3000 rw-p ...7ffff79d3000-7ffff79d6000 r-xp ... /PIP/lib/libpip.so7ffff79d6000-7ffff7bd5000 ---p ... /PIP/lib/libpip.so7ffff7bd5000-7ffff7bd6000 r--p ... /PIP/lib/libpip.so7ffff7bd6000-7ffff7bd7000 rw-p ... /PIP/lib/libpip.so7ffff7ddb000-7ffff7dfc000 r-xp ... /lib64/ld.so7ffff7edc000-7ffff7fe0000 rw-p ...7ffff7ff7000-7ffff7ffa000 rw-p ...7ffff7ffa000-7ffff7ffc000 r-xp ... [vdso]7ffff7ffc000-7ffff7ffd000 r--p ... /lib64/ld.so7ffff7ffd000-7ffff7ffe000 rw-p ... /lib64/ld.so7ffff7ffe000-7ffff7fff000 rw-p ...7ffffffde000-7ffffffff000 rw-p ... [stack]ffffffffff600000-ffffffffff601000 r-xp ... [vsyscall]

Program

Glibc

Address Space

Task-0 int x;

Task-1 int x;

:

a.out Task-(n-1) int x;

Task-(n) int a;

:

b.out Task-(m-1) int a;

:

Page 10: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Why address space sharing is better ?• Memory mapping techniques in multi-process model

• POSIX (SYS-V, mmap, ..) shmem • XPMEM

• Same page table is shared by tasks • no page table coherency overhead • saving memory for page tables • pointers can be used as they are

10

Memory mapping must maintain page table coherency -> OVERHEAD (system call, page fault, and page table size)

shared region

Page Table

shared region

Page Table

Proc-0 Proc-1

coherent

Shared Physical Memory Pages

Page 11: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Memory Mapping vs. PiP

11

Process-in-Process: Techniques for Practical Address-Space Sharing PPoPP 2018, February 24–28, 2018, Vienna, Austria

the latter can allow for greater concurrency because thesimulation can run while the analysis is processing the data.For the experiments shown in Section 7.4, we chose the latterapproach, but the implementation is �exible enough to allowfor either method.

5 Experimental PlatformsWe use four experimental platforms to cover several OSkernels and CPU architectures in our evaluation as listed inTables 3 and 4.

Table 3. Experimental platform H/W info.

Name CPU # Cores Clock Memory NetworkWallaby Xeon E5-2650 v2 8⇥2(⇥2) 2.6GHz 64 GiB ConnectX-3OFP† [19] Xeon Phi 7250 68(⇥4) 1.4GHz 96(+16) GiB Omni-PathK [35] SPARC64 VIIIfx 8 2.0GHz 16 GiB Tofu† Flat mode was used in the evaluations in Section 7.1 and 7.3 without usingMCDRAM. The other evaluations were done with cache quadrant mode.

Table 4. Experimental platform S/W info.Name OS Glibc PiP Exec. Mode(s)Wallaby Linux (CentOS 7.3) w/ patch process and threadWallaby McKernel + CentOS 7.3 w/ patch thread onlyOFP Linux (CentOS 7.2) w/ patch process and threadK XTCOS w/o patch process and thread

The Linux kernel on the K computer is old, and we gaveup trying to install the patched Gilbc. The CPU of the Kcomputer supports 8 cores, and the PiP library without thepatched Glibc can still utilize all CPU cores.McKernel is a multikernel operating system that runs

Linux with a lightweight kernel side by side on computenodes [17]. In the experiments with McKernel on Wallaby,McKernel was con�gured to run on 14 cores out of 16, andthe Linux kernel ran on the remaining 2 cores. The currentMcKernel is unable to handle the clone() �ag combinationdescribed in Section 3.3, The PiP programs ran in the threadexecution mode.

6 PiP Performance AnalysisThis evaluation section shows PiP performances by usingin-house microbenchmarks to contrast PiP characteristicswith the other memory-mapping techniques.

6.1 Setup OverheadIn themicrobenchmark, the root task created a 2GiB mmap()edarray region, and then a child task summed all members ofthe integer array, assuming the root task sent integer datato the child task via the allocated region. The array was ini-tialized by the root task. Table 5 shows the times spent inthe XPMEM and POSIX shmem functions. PiP also providesthe <xpmem.h> header �le so that XPMEM users can easilyconvert to PiP users. Indeed most of the XPMEM functionsprovided by PiP do almost nothing. Each PiP-implementedXPMEM function takes only 40–80 clock cycles.

Table 5. Overhead of XPMEM and POSIX shmem functions(Wallaby/Linux)

XPMEM Cyclesxpmem_make() 1,585xpmem_get() 15,294

xpmem_attach() 2,414xpmem_detach() 19,183xpmem_release() 693

POSIX Shmem CyclesSender shm_open() 22,294

ftruncate() 4,080mmap() 5,553close() 6,017

Receiver shm_open() 13,522mmap() 16,232close() 16,746

6.2 Page Fault OverheadFigure 4 shows the time series of each access using the samemicrobenchmark program used in the preceding subsection.Element access was strided with 64 bytes so that each cacheblock was accessed only once, to eliminate the cache blocke�ect. In the XPMEM case, the mmap()ed regionwas attachedby using the XPMEM functions. The upper-left graph inthis �gure shows the time series using POSIX shmem andXPMEM, and the lower-left graph shows the time seriesusing PiP. Both graphs on the left-hand side show spikes atevery 4 KiB. Because of space limitations, we do not showtwo separate graphs, but the spike heights of XPMEM arehigher than the ones of POSIX shmem. The spike heights ofthe PiP process mode and PiP thread mode are almost thesame. However, the heights of the spikes of XPMEM andPOSIX shmem are much higher than those of PiP.

10

100

1,000

5,000

Acce

ss T

ime

[Tic

k]

ShmemXPMEM XPMEMPageSize:4KiB PageSize:2GiB

10

100

500

0 4,096 8,192 12,288 16,384Array Elements [Byte offset]

PiP:process PiP:thread

0 4,096 8,192 12,288 16,384Array Elements [Byte offset]

PiP:process PiP:thread

Figure 4. Access time series of an array (Wallaby/Linux).XPMEM and POSIX shmem can be categorized as memory-

mapping techniques, and a PF happens every time a newmemory page is accessed. The spikes in PiP are the time spentfor the translation lookaside bu�er (TLB) misses, not PF. InPiP, the whole array is touched at the time of initialization bythe root (sender) task, and all required PT entries are createdthen.

We ran the same benchmark program but using HugeTLBthis time (graphs on the right-hand side). POSIX shmemcannot handle the HugeTLB on this Linux kernel. XPMEMdoes show huge spikes again on the every 4 KiB page bound-ary. We consulted the XPMEM device driver source code(version 2.6.4) and found that the XPMEM driver can createonly 4 KiB page PT entries, regardless of the page size of thetarget region. In PiP, no TLB-miss spikes are found this timebecause of using one 2 GiB page.

7

(Xeon/Linux)

10

100

1,000

5,000

Acce

ss T

ime

[Tic

k]

ShmemXPMEM XPMEMPageSize:4KiB PageSize:2MiB

10

100

500

0 4,096 8,192 12,288 16,384Array Elements [Byte offset]

PiP:process PiP:thread

0 4,096 8,192 12,288 16,384Array Elements [Byte offset]

PiP:process PiP:thread

(Xeon/Linux)

PiP takes less than

100 clocks !!

Page 12: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Process in Process (PiP)• dlmopen (not a typo of dlopen)

• load a program having a new name space • The same variable “foo” can have multiple

instances having different addresses • Position Independent Executable (PIE)

• PIE programs can be loaded at any location

• Combine dlmopen and PIE • load a PIE program with dlmopen • We can privatize variables in the same

address space12

Page 13: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Glibc Issue• In the current Glibc, dlmopen() can create up to 16

name spaces only • Each PiP task requires one name space to have

privatized variables • Many-core architecture can run more than 16 PiP tasks,

up to the number of CPU cores

• Glibc patch is also provided to have more number of name spaces, in case 16 is not enough • Changing the size of name space stable • Currently 260 PiP tasks can be created

• Some workaround codes can be found in PiP library code

13

Page 14: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

PiP Showcases

14

Page 15: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Showcase 1 : MPI pt2pt• Current Eager/Rndv. 2 Copies • PiP Rndv. 1 Copy

15

(Xeon/Linux)

14

1664

25610244096

1638465536

Band

wid

th (M

B/s)

Message Size (Bytes)

eager-2copyrndv-2copyPiP (rndv-1copy)

PiP is 3.5x faster @ 128KB

better

Page 16: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Showcase 2 : MPI DDT• Derived Data Type (DDT) Communication

• Non-contiguous data transfer • Current pack -> send -> unpack (3 copies) • PiP non-contig send (1 copy)

16

0

0.5

1

1.5

2

64K16,128

32K,32,128

16K,64,128

8K,128,128

4K,256,128

2K,512,128

1K,1K,128

512,2K,128

256,4K,128

128,8K,128

64,16K,128

Nor

mol

ized

Tim

e

Count of double elements in X,Y,Z dimentions

eager-2copy (base)rndv-2copyPiP Non-contig Vec

Non-contig Vec

(Xeon/Linux)

better

Page 17: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Showcase 3 : MPI_Win_allocate_shared (1/2)

17

MPI Implementation int main(int argc, char **argv) { MPI_Init(&argc, &argv); ... MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL, comm, &mem, &win); ... MPI_Win_shared_query(win, north, &sz, &dsp_unit, &northptr); MPI_Win_shared_query(win, south, &sz, &dsp_unit, &southptr); MPI_Win_shared_query(win, east, &sz, &dsp_unit, &eastptr); MPI_Win_shared_query(win, west, &sz, &dsp_unit, &westptr); ... MPI_Win_lock_all(0, win); for(int iter=0; iter<niters; ++iter) { MPI_Win_sync(win); MPI_Barrier(shmcomm); /* stencil computation */ } MPI_Win_unlock_all(win); ...}

PiP Implementation int main(int argc, char **argv) { pip_init( &pipid, &p, NULL, 0 ); ... mem = malloc( size ); ...

pip_get_addr( north, "mem", &northptr );

pip_get_addr( south, "mem", &southptr );

pip_get_addr( east, "mem", &eastptr );

pip_get_addr( west, "mem", &westptr ); ...

for(int iter=0; iter<niters; ++iter) { pip_barrier( p ); ...

/* stencil computation */ } ... pip_fin();}

Page 18: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Showcase 3 : MPI_Win_allocate_shared (2/2)

18

1E+2

1E+3

1E+4

1E+5

1E+6

0.1

1

10

100

1 10 100 1,000

Tota

l Pag

e Ta

ble

Size

[KiB

]

PT S

ize

Perc

enta

ge to

Arra

y Si

ze (M

PI)

# Tasks -- KNL

PiP

MPI

Percentage

1E+0

1E+1

1E+2

1E+3

1E+4

1E+5

1E+6

1 10 100 1,000

# To

tal P

age

Faul

ts

# Tasks -- KNL

PiP

MPI

5P Stencil (4K x 4K)

Size of Page Tables# Page Faults

better

Page 19: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Showcase 4 : In Situ

19

LAMMPS&Process In&situ&Process

Pre1allocatedShared&Buffer

Copy%inCopy%out

Gather&datachunks

Analysis

Dump&&&copy&data

LAMMPS&process In&situ&process

Copy%out Gather&datachunksAnalysis

Dumpdata

Original SHMEM-based In Situ

PiP-based In Situ 4,4,4 6,6,6 8,8,8 10,10,10 12,12,121

1.1

1.2

1.3

1.4

1.5

1.6

1.7

Slow

down

Rat

io (b

ased

on

w/o

In-s

itu)

LAMMPS: 3d Lennard-Jones melt (xx, yy, zz)

POSIX shmem PiP

LAMMPS in situ: POSIX shmem vs. PiPOn Xeon + Linux

• LAMMPS&process&ran&with&four&OpenMP threads;• In&situ&process&is&with&single& thread;• O(N2)&comp. cost >>&data transfer cost at&(12,12,12).

(Xeon/Linux)

better

Page 20: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Showcase 5 : SNAP

20

683.3%

379.1%

207.9%

153.0%

106.4%

91.6%

83.3%

430.5%

221.2%

123.0%

68.3%

42.0%

27.7%

22.0%

1.6%

1.7%

1.7%2.2%

2.5%

3.3%3.8%

00.511.522.533.54

0100200300400500600700800

16 32 64 128 256 512 1024

Spee

dup(PiP%vs%Threads)

Solve%Time%(s)

Number%of%Cores

MPICH/ThreadsMPICH/PiPSpeedup<(PiP<vs<Threads)

PiP V.S. threads in hybrid MPI+X SNAP strong scaling on OFP (1-16 nodes, flat mode).

• ( MPI + OpenMP ) ( MPI + PiP )

better

better

Page 21: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Showcase 5 : Using in Hybrid MPI + “X” as the “X” (2)

21

! PiP based(parallelism

– Easy(application(data(sharing(across(cores

– No(multithreading(safety(overhead

– Naturally(utilizing(multiple(network(ports((

Network(Ports

MPI(stack

APP(data

1

4

16

64

256

1024

4096

16384

65536

1 4

16

64

256

1K

4K

16K

64K

256K

1M

4M

Message&Size&(Bytes)

KMessages/s&bewteen&PiP& tasks

1(Pair

4(Pairs

16(Pairs

64(Pairs

1

4

16

64

256

1024

4096

16384

65536

1 4

16

64

256

1K

4K

16K

64K

256K

1M

4M

Message&Size&(Bytes)

KMessages/s&between&threads

1(Pair

4(Pairs

16(Pairs

64(Pairs

683.3&

379.1&

207.9&

153.0&

106.4&

91.6&

83.3&

430.5&

221.2&

123.0&

68.3&

42.0&

27.7&

22.0&

1.6&

1.7&

1.7&2.2&

2.5&

3.3&3.8&

0

0.5

1

1.5

2

2.5

3

3.5

4

0

100

200

300

400

500

600

700

800

16 32 64 128 256 512 1024

Spee

dup(PiP&vs&Threads)

Solve&Time&(s)

Number&of&Cores

MPICH/Threads

MPICH/PiP

Speedup((PiP(vs(Threads)

Multipair message rate (osu_mbw_mr )between two OFP nodes (Xeon Phi + Linux, flat mode).

PiP V.S. threads in hybrid MPI+X SNAP strong scaling on OFP (1-16 nodes, flat mode).

Page 22: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Research Collaboration• ANL (Dr. Pavan and Dr. Min) — DOE-MEXT

• MPICH • UT/ICL (Prof. Bosilca)

• Open MPI • CEA (Dr. Pérache) — CEA-RIKEN

• MPC • UIUC (Prof. Kale) — JLESC

• AMPI • Intel (Dr. Dayal)

• In Situ

22

Page 23: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Summary• Process in Process (PIP)

• New implementation of the 3rd execution model • better than memory mapping techniques

• PiP is portable and practical because of user-level implementation • can run on the K and OFP super

computers • Showcases prove PiP can improve

performance

23

Page 24: New Process/Thread Runtime

Arm HPC Workshop@Akihabara 2017

Final words• The Glib issues will be reported to Redhat

• We are seeking PiP applications not only HPC but also Enterprise

24