r. dementiev and p. sanders: asynchronous parallel disk sorting 1 asynchronous parallel disk sorting...
TRANSCRIPT
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting1
Asynchronous Parallel Disk Sorting
Roman Dementiev and Peter Sanders
Max-Planck-Institut für Informatik,Saarbrücken, Germany
Thanks to: A. Crauser, D. Hutchinson, L. Kettner, S. Mitra, A. Morton, N. Rajput, and J. Vitter
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting2
Motivation & Related Work Sorting is important for EM problems Disks are cheaper than computers Prefetch buffers for load disk balancing and overlapping have been
studied but no results which guarantee– overlapping during merging of runs– asymptotically optimal I/O volume
Here: algorithm & implementation– I/O cost lower bound – guarantees almost perfect overlapping between I/O and computation
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting3
Motivation & Related Work [ChaudhryCormen’01’02] impl. of distributed memory, parallel disk Theory: Implementations: Differences:
– Optimal prefetching algorithm– Overlapping of I/O and computation– Completely asynchronous implementation– Part of <stxxl> library, compatible with STL. Use as simple as:
stxxl::vector<my_type> v;long long int i=vec_size; // ‘long long int’ is a 64-bit data typewhile(i--)
v.push_back(f(i)); // fill vector with some valuesstxxl::sort(v.begin(), v.end()); // sort// check order using STL predicateassert(std::is_sorted(v.begin(), v.end()));
[BarvGrovVit’97] [SandEgnKorst’00] [HutchVitt’01] [HutchSandVitt’02]
Here[DementievSanders’03]
[BarveVitter’99]
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting4
External Multi-way Merge Sortru
n 0
input:
O(M)
chunk 0chunk 1 chunk 2 chunk 3chunk 4
run
1ru
n 2
run
3chunk 5 chunk 6 …
Bbuffers B B B
k=O(M/B)
k-mergerru
n 4
run
5ru
n 6
run
7
B B B B
k-merger
chunk 7chunk 8
k
run
012
3
Bru
n 45
67
B
buffers
…
These algorithms do not guarantee overlapping
D w
rite
bu
ffe
rs
1234
k
… .. mer
ge
elements
D blocks
me
rge
b
uff
ers
merging
D p
refe
tch
bu
ffe
rsD-head disk
D-head model
N – input size
M – main memory size
B – block size
D – number of disks
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting5
Multi-way Merge Sort with Overlapped I/Os
Theorem 1:Theorem 1: If I/O and computation can be overlapped, N elements can be sorted in time
where is the time needed for run formation,
is the time needed for merging k=(M/B) : # sequences,
k’ = O(N/M): total # runs.
mBMformruns TkT 'log )/(
upstartOIsortformruns TNTk
NTkT
/,
''max
upstartintmergeOIm TNTNTT ,max /
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting6
Overlapping Run Formation
Easy:
Corollary 2:Corollary 2: Input of size N => sorted runs of size ~ M/2 in time
2M in the paper
time
read
sort
write
read
sort
write
1 2
1 1
2
3
3
3
4
2
4
4
k-1
k-1
k-1
k
k
k
...
...
...
1 2
1
1
2
3
3
2
4
4
3 4
k-1
k-1
k-1
k
k
k
I/O bound
case
compute
bound case
startupOIsort TNTM
NMT
/,2
2max
run numbers
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting7
Simple External Merging Smallest element of a block is a trigger
k sorted runs….merger
sort pairs <trigger,block_id>
prediction sequence :
….
k-merger
….
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting8
Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy:
1B-12 3B-14 5B-16 …
1B-12 3B-14 5B-16 …
1B-12 3B-14 5B-16 …
k…
merger
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting9
Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy:
Solution:
2 3B-14 5B-16 …
2 3B-14 5B-16 …
2 3B-14 5B-16 …
k…
merger
D w
rite
bu
ffe
rs
1234
k
… .. mer
ge
elements
D blocks
me
rge
b
uff
ers
merging
D p
refe
tch
bu
ffe
rsD-head disk
2D
writ
e b
uff
ers
1234
k
… .. mer
ge
elements
D blocks
me
rge
b
uff
ers
merging
D p
refe
tch
bu
ffe
rsD-head disk
k+3Doverlapbuffers
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting10
Overlapping I/O and Merging [contd.] Theorem 4:Theorem 4: Merging k sequences with a total of N’ elements takes
time
– ℓ: time to produce one element of output – L: time to input or output D arbitrary blocks
Our I/O thread strategy: DB elements in write buffer => output step– < DB elements in write buffer AND D overlap buffers avail. => input step
To prove Theorem 4 we distinguish two cases– I/O bound case: 2L DBℓ– Compute bound case: 2L< DBℓ
startupTNDB
LN
','2
max
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting11
Compute Bound Case Lemma 6:Lemma 6: In compute bound case
after k/D+1 steps, the merging thread never blocks until all elements are merged.
Notation: # elem. in write buffer– r # of elem. in the overlap and
merge buffers Our I/O thread strategy:
DB elements in write buffer => output
step– < DB elements in write buffer AND D
overlap buffers avail. => input step
r
kB
kB+DB
kB+2DB
kB+3DB
kB+y L/ℓ
kB+2 L/ℓ
0 DB 2DB- L/ℓ 2DB
merge thread blocked
I/O blocking
fetch output
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting12
I/O Bound Case Lemma 7:Lemma 7: In I/O bound case the I/O
thread never blocks until all input blocks are fetched.
Our I/O thread strategy (the same): DB elements in write buffer
=> output step
– < DB elements in write buffer AND D overlap buffers avail. => input step
Proof: Proof: similar to compute bound case
r
blocking
fetch
output
DB- L/ℓ
DB 2DB- L/ℓ
2DB
kB+3DB
kB+2DB+ L/ℓ
kB+2DB
kB+DB+ L/ℓ
kB+DB
0kB+ L/ℓ
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting13
Multi-head Model Multi-disk Model Randomized Cyclic Allocation [VitHut’01] makes accesses in
balanced
Optimal prefetching from [HutchSanVitt’01,SanEgnKor’00] Corollary 8:Corollary 8: For any > 0, prefetch buffer of size m=(D/) merging k
sequences with a total N’ elements can be implemented to run in time
startupTNDB
LN
',1
'2max
…………
-ping
2D
writ
e b
uff
ers
…
O(D
) pr
efet
ch b
uffe
rs
k+O(D)overlap buffers
1
234
k
… .. mer
ge
elements
read buffers
D blocks
me
rge
bu
ffe
rs
merging
overlap-disk scheduling
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting14
Implementation Issues Implementation (part of <stxxl> library)
– Forming runs: Key sorting, efficient two passes MSD radix sort– Multi-way merging: [Sanders’00]– prefetch buffer + overlap buffer = read buffer– Asynchronous I/O (POSIX threads)– No superfluous copying
User blocks are transferred by DMA (O_DIRECT flag) Buffers are passed by pointer between pools
Applications
Sorting, Priority queues,Search trees
OS
Caching, Mem. managementDisk scheduling
POSIX threads,File system
STL + more
<stxxl> s
tru
ctu
re
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting15
Hardware
3000 Euro, 375 MByte/s, July 2002
PCIBusses
Controller
Channels
2x Xeon
2x64x66 Mb/s
4 Threads
E7500 12
8
400x64 Mb/s
MB/s4x2x100
8x80GBMB/s
Chipset
Intel
8x45
RAMDDR1 GB
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting16
Single Disk Performance LEDA-SM, TPIE 2 GB volume, 32-bit keys, runs of size
256 MB, g++ 3.2 –O6 I/O bandwidth 45.4 Mbyte/s = 95% of
peak bandwidth of the disk
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting17
Multiple Disk Performance 2 GB volume, 32-bit keys, runs of size
256 MB, g++ 3.2 –O6 TPIE, LEDA-SM
– Linux Soft-RAID 8X128KByte stripes
Bandwidth of 315 MByte/s
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting18
Element Size 16GB volume, 32-bit keys, runs of
size 256 MB, g++ 3.2 –O6 64 merging is I/O bound 128 run formation is I/O bound Small elements require special
treatment
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting19
Optimal Prefetching
2D
writ
e b
uff
ers
…
k+O(D)overlap buffers
1234
k
… .. mer
ge
read buffers
D blocks
me
rge
bu
ffe
rs
merging
2D
writ
e b
uff
ers
…
O(D
) pr
efet
ch b
uffe
rs
k+O(D)overlap buffers
1234
k
… .. mer
ge
read buffers
D blocks
me
rge
bu
ffe
rs
merging
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting20
Block Size Block sizes of several MB => good
performance Leave room for read and write
buffers Naive striping implies factor 8 block
size reduction Dramatic run time increase
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting21
Large Inputs One-pass, curves go up because of
– Slower zones– Seek times
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting22
Summary Parallel disk sorting with high performance on state of art hardware
with theoretical performance guarantees Bandwidth is no longer a limiting factor for external memory
algorithms Price performance ratio can improve by adding disks
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting23
?