external memory parallel sorting by sampling

External Memory Parallel Sorting by Sampling

Yongzheng ZhangFaculty of Computer Science

Dalhousie UniversityHalifax, Canada B3H 1W5

[email protected]://www.cs.dal.ca/∼yongzhen

Rui ZhangFaculty of Computer Science

Dalhousie UniversityHalifax, Canada B3H 1W5

[email protected]://www.cs.dal.ca/∼rzhang

Abstract

This paper introduces an external memory parallel sorting algorithm in a multipro-cessor architecture. The overall goal is to choose p − 1 partitioning elements so thatthe final p sorted files, one per processor, are of roughly equal size. It first determinesa sample of splitters by either regular sampling or random sampling techniques. Theneach data file at each processor is separated according to final splitters and sublists areredistributed to appropriate processors. Finally each processor sorts incoming recordsinto runs and merges sorted runs into a fully sorted file. We implemented our algorithmusing C and MPI package and tested its performance on both a cluster of SUN Solariesworkstations and a Linux cluster CGM1. The result indicates that regular samplingprovides better performance than random sampling does.

1 Introduction

The classical problem of sorting and related processing is universally acknowledged to beimportant and fundamental in computing [3, 5], because it accounts for a significant per-centage of computing resources in large-scale applications [9, 12], and also because sortingis an important paradigm in the design of efficient external memory algorithms. Currently,the substantial access gap between internal memory and disk memory is growing rapidly,since the latency and bandwidth of memory chips are improving more quickly than those ofdisks. Use of parallel processors further widens the gap [16]. This gap leads to the problemof performance bottleneck. In light of this, the specific problem of external memory sortingassumes particular importance [3, 5].

In external memory sorting, data sets are often too massive to fit completely inside thecomputer’s internal memory. The data items are typically stored on disks. The resultinginput/output communication (or I/O) between fast internal memory and slower externalmemory can be a major performance bottleneck [16], as I/O is fundamental and frequentlyused operation during the sorting process [3].

External memory algorithms explicitly control data placement and movement, so it isvery important for algorithm designers to have a simple but reasonably accurate model ofthe memory system’s characteristics. Vitter and Shriver introduced the commonly usedparallel disk model (PDM) [15] to capture the main properties of multiple disk systems:

• N = problem size (in units of data items),

1

• M = internal memory size (in units of data items),

• B = block transfer size (in units of data items),

• D = number of independent disk drives, and

• p = number of processors.

where M < N , and 1 < DB < M/2. All the data items are assumed to be of fixed length.In a single I/O, each of the D disks can simultaneously transfer a block of B contiguous dataitems. PDM provides an elegant and reasonably accurate model for analyzing the relativeperformance for external memory algorithms and data structures [16].

The key to achieve efficient I/O performance in external memory applications is to designthe application to access its data with a high degree of locality [16]. In our project, weassume D = P , which means each processor holds exactly one independent disk (precisely,one big file). We also assume that the work of the individual nodes of the machine (includingI/O, CPU, and network) is done completely in parallel with the other nodes of the machine.These assumptions allow us to consider that the time for a single processor to perform itssegment of the computation is the total time for the multiprocessor to perform the fullcomputation [6].

At the top level, our algorithm works as follows:

• Determine a sample S with p−1 records s1, s2, ... , sp−1, such that in the final sortedorder, all records on processor P1 have sorting key value less than or equal to s1, allrecords on processor P2 have sorting key value greater than s1 but less than or equalto s2, and so on, until all records on processor Pp have sorting key value greater thansp−1.

• Based on the sample, all processors redistribute each block of records in the file so thateach record is at the appropriate processor. After the redistribution, each processorsorts the local records and save them in a temporary file.

• Each processor merges its local temporary files into a final sorted file.

The remaining part of this paper is organized as follows: first we review the relevantliterature in the area of external memory parallel sorting. Next we present our main algo-rithms and finally we show the experiment results.

2 Review of the Relevant Literature

Over the last a few decades, there has been quite a lot of work on sorting, because sorting andsorting-like operations account for a significant percentage of computer use [16, 9]. Manyrecently developed external memory sorting algorithms use disks independently. Thesealgorithms are based upon the important distribution and merge paradigms, which are twogeneric approaches to sorting [16].

Distribution sort is a recursive process in which we determine a set of p− 1 partitioningelements to partition all the items into p disjoint buckets. All the items in one bucketprecede all the items in the next bucket. We complete the sorting by recursively sortingthe individual buckets and concatenating them to form a single fully sorted list [16]. The

2

requirement here is that we choose p − 1 partitioning elements so that the buckets are ofroughly equal size, i.e., the load balance is good. Much work in this area has been done.

DeWitt et al [6] consider the external memory parallel sorting in a shared nothingmultiprocessor. The critical step is to determine the range of sorting keys to be handled byeach processor. Two techniques for determining these ranges of sorting keys are introduced:probabilistic splitting, using sampling to estimate quantiles, and exact splitting, which usesa parallel version of the algorithm proposed by Iyer et al [8]. The first step of the exactsplitting algorithm requires that each processor fully sorts its file fragment, producing psorted runs, one per processor. It does not work efficiently when the size of the file segmentis much larger than internal memory size.

Quinn [13] has suggested implementing a parallel quicksort. The algorithm chooses p−1partitioning elements at random, and run the recursive quicksorts in parallel. Huang andChow [7] consider external memory parallel sorting using approximate partitioning basedupon sampling.

The performance of distribution sort primarily depends on how well the data can beevenly partitioned into smaller ordered buckets. Unfortunately, no general, effective methodis currently available, and it is an open question of how to achieve linear speedup for parallelsorting on multiprocessors [14].

Merge sort paradigm is somewhat orthogonal to the distribution paradigm [16]. Atypical merge sort algorithm consists of two main phases. In the “run formation” phase,a set of data blocks are scanned, one memory load at a time. Then each memory load issorted into a single “run”, which is then written onto the disks. After the initial runs areformed, the merging phase begins. In each pass of the merging phase, data items in a groupof runs are merged as they stream through the internal memory. Merge sort algorithms(e.g., [1, 3, 4]) perform well only with a small number of processors [14].

Chaudhry et al [5] introduces an efficient algorithm based on Leighton’s columnsortalgorithm [11]. It sorts N numbers, which are treated as an r × s matrix, where N = rs,s is a divisor of r, and r ≥ 2(s− 1)2. When columnsort completes, the matrix is sorted incolumn-major order. That is, each column is sorted, and the keys in each column are nolarger than the keys in columns to the right.

3 Parallel Sorting Algorithms

Our overall goal is to choose p − 1 partitioning elements so that the final p sorted filesare of roughly equal size, i.e., the load balance is good. We used two sampling techniquesto select a sample of splitters: regular sampling and random sampling. Regular samplingselects splitters with equal intervals, while random sampling selects a certain number ofpivots at random.

3.1 Main Algorithm

Our approach mainly derives from the work by Shi et al [14], which is an internal memoryparallel sorting by regular sampling. We also applied the random sampling technique intro-duced by Quinn [13]. In this project we consider the problem of external memory parallelsorting in a distributed memory system. In this multiprocessor architecture, each processorhas its own memory and an independent disk (precisely, a big file), and all communicationsamong processors must happen through an interconnection network.

3

In this sorting scenario, we have a data file with N integer numbers created with a datagenerator. Each processor holds a disk file with N/p unsorted records. At the terminationof the sorting algorithm, files have been partitioned, redistributed and merged into approx-imately equal sized non-overlapping sorted files, which must again be on disks, one at eachprocessor. In more details, our algorithm can be described as follows:

/* Input: original file list F = f1, f2, ... , fp (total size is N , p is the number ofprocessors). Processor Pi holds N/p unsorted data items stored in file fi (1 ≤ i ≤ p).Output: sorted file list F’ = f ′1, f ′2, ... , f ′p, where all records in f ′i are less than or equal tothose in f ′i+1 (1 ≤ i ≤ p− 1).*/

1. Each processor Pi samples its disk-resident file fi: it reads all data elements block byblock (B = 128K) and selects p−1 pivots at equal intervals (or at random) from eachblock to form the set of splitters Si. So the size of Si is N(p−1)

Bp .

2. The coordinate processor P1 gathers all unsorted samples Si from all other processors.P1 then sorts the set of these samples to form a regular sample S′ with size N(p−1)

B .Then the final sample S with p− 1 elements at equal intervals are selected from theregular sample S′, and is broadcasted to all other processors.

3. Each processor Pi reads and sorts data items block by block from local file fi, andredistributes the records to the appropriate processors using the final sample S. Whena processor’s memory has been filled with incoming records, the processor sorts theserecords, writes the sorted run onto disk as a temporary file, and continues readingincoming records.

4. In parallel, the processors merge the sorted runs (precisely, temporary files) and backonto the disk as the final sorted file f ′i .

Note that during the second step, if we select pivots by regular sampling, then each datablock has to be sorted first. This additional sorting time can be saved by an alternative way:write these sorted blocks into local temporary files, and in the third step, each processorreads these files (sorted blocks), not unsorted data block of the original data file. Thismeans each data block of the original file has to be sorted only once anyway.

3.2 Parallel Disk Model Parameters

Parallel Disk Model (PDM) is an elegant and reasonably accurate model for analyzing therelative performance of external memory parallel algorithms. Basically, the PDM parame-ters of our project are as follows:

• N = 16M , problem size (in units of data items),

• M = 8M , internal memory size (in units of data items),

• B = 128K, optimal block transfer size (in units of data items),

• D = p = 16, maximum number of processors (disk files).

All data items are assumed to be of fixed length of 4 bytes. This means we have theproblem size 64MB and the internal memory size 32MB. For example, when 16 processorssort the 16M data items (1M data items per processor) in parallel, the first two steps

4

in the algorithm above look like this: First, each processor can simultaneously transfer ablock of 128K contiguous data items, leading to a total of 1M

128K = 8 local blocks. Fromeach block, 15 pivots are selected at equal intervals or at random, leading to the set of15× 8 = 120 splitters. Next, the coordinate processor P1 gathers all 16 sets of splitters andforms the regular sample with size 120× 16 = 1920. Then the final sample with 15 recordsare determined at equal intervals.

If we fix the optimal block size B = 128K (as shown in the next section) and theinternal memory size M = 8M , then in order for the regular sample of N(p−1)

B splittersto fit completely on the coordinate processor (suppose it takes one fourth of the internalmemory size), the total problem size can be as large as N = 8M/4×128K

16−1 = 17.1G (in dataitems). This means we can make a reasonable assumption that the whole regular samplecan be fit on one processor.

4 Performance Evaluation

In this section we discuss the experimental examination of our algorithm. We first discussour experimental methodology and then present the performance results obtained.

4.1 Experimental Setup and Methodology

In order to investigate the performance of our external memory parallel sorting algorithmbased on regular sampling and random sampling, we implemented our algorithm using Cand the MPI communication library [2]. We also used a data generator which can createdata sets of various sizes (from uniform to skewed data created via ZIPF distributions).

We constructed a parallel machine for preliminary experiment purpose. It consists of16 UltraSPARC-III processors running Sun Solaries 8. Each processor has 64MB localmemory and 10M Ethernet card. We did all final experiments on CGM1. CGM1 is acommercial-grade, 32 board, dual processor Linux cluster. In total there are 64 Xeon 1.8GHz chips (2 per board), 32 GB of distributed memory (1 GB per board), and 2.56 TBof distributed external memory (two 40 GB IDE disks per board). And there is a 100 MB(Fast Ethernet) interconnect.

All sequential times were measured as wall clock times in seconds, running on eachprocessor. All parallel times were measured as the wall clock time between the start ofthe first process and the termination of the last process. These times include all I/Os.Furthermore, all wall clock times were measured with no other user except us on CGM1.Instead of running applications directly from the command line, CGM1 uses a batch sub-mission infrastructure that guarantees fair, efficient use of resources. A queuing systemknown as OpenPBS (Portable Batch System) monitors all submission requests and takescare of running the applications and returning the results.

In our project, we did a set of experiments (in order to eliminate the influence of varyingrun-time characteristics, each test was repeated four times) to evaluate the performance ofour algorithm in the following steps:

1. Optimal block size: we executed our algorithm with fixed data size and numberof processors, but various block sizes to determine what block size can achieve theshortest execution time.

5

2. Speedup experiments: we executed our algorithm on a single processor of theCGM1 machine and measured the sequential wall clock time. Then we executed ouralgorithm on up to 16 processors of the CGM1 machine and measured the parallelwall clock time.

3. Load balance experiments: we executed our algorithm with regular sampling andrandom sampling techniques on up to 16 processors of the CGM1 machine and eval-uated the load balance performance in each case.

4. Scaleup experiments: we executed our algorithm with regular sampling on 2 to 16processors of the CGM1 machine and evaluated the scalability of our algorithm.

5. Sizeup experiments: we executed our algorithm with regular sampling on 4 pro-cessors of the CGM1 machine and evaluated the sizeup performance of our algorithm.

4.2 Performance Results

Typically there are four metrics for evaluating the performance of a parallel algorithm:speedup, load balance, scaleup and sizeup. Speedup is a useful metric because it indicateswhether additional processors result in a corresponding decrease in the sorting time. Loadbalance is another important metric to measure if all processors are equally load balanced.Scaleup indicates whether a constant response time can be maintained when the workload isincreased by adding a proportional number of processors and files. We also fix the number ofprocessors and files, but increase the size of data elements to the sizeup evaluation. Sizeupexperiments indicate the growth rate of the execution time as a function of the problemsize [5].

4.2.1 Optimal Block Size

The minimum block transfer size imposed by hardware is often 512 bytes, but operatingsystems generally use a larger block size, such as 16KB [16]. Since the CPU cost of thesorting is independent of the size of each block, producing longer runs tends to reduce theI/O cost while making the sorting CPU bound [6].

It is possible to use blocks in larger size to reduce the relative significance of seek androtational latency, but the wall clock time per I/O will increase accordingly. For best resultsin applications where the data are streamed through internal memory, the block transfersize B in PDM “should be considered to be a fixed hardware parameter a litter larger thanthe track size (say, on the order of 100KB for most disks)” [16].

In order to test the optimal block size for our algorithm on the cluster, we executed ouralgorithm with fixed 16M data items and 16 processors, but various block sizes rangingfrom 8K to 512K to determine what block size can achieve the shortest execution time.

As shown in Table 1, We did seven tests based on block sizes of 8K, 16K, 32K, 64K,128K, 256K and 512K (in data items), respectively. By calculating the average time inseconds to finish the parallel sorting, we can see that using a block size of 128K can achievethe shortest sorting time, as indicated in Figure 1.

4.2.2 Speedup Performance Evaluation

For the speedup experiments, we fixed the problem size at 16M data items, while varying thenumber of processors from 1 to 16. The data items were created using the data generator

6

B/Time Test 1 Test 2 Test 3 Test 4 Average8K 52.264646 48.213372 47.206085 50.977083 49.66529716K 36.775604 35.886760 36.835403 36.908217 36.60149632K 28.911333 28.984640 33.364160 30.182869 30.36075164K 27.411129 28.084392 31.296957 27.255955 28.512108128K 26.888327 25.981139 24.743847 26.303148 25.979115256K 35.347010 29.402780 32.100467 34.629514 32.869943512K 37.821276 33.811213 36.671432 34.742710 35.761658

Table 1: Optimal block size results

Parallel Sorting time with various block sizes

49.67

36.60

30.36 28.51 25.98

32.87 35.76

0.00

10.00

20.00

30.00

40.00

50.00

60.00

8K 16K 32K 64K 128K 256K 512K

Block size (in data items)

Tim

e (i

n s

eco

nd

s)

Figure 1: Optimal block size

7

p/Time Test 1 Test 2 Test 3 Test 4 Average1 545.758088 541.650612 533.852716 542.228178 540.8723992 330.468458 329.891630 331.716016 329.626777 330.4257204 137.346401 149.410647 149.232407 146.488899 145.6195898 64.925571 66.414541 63.425187 66.606700 65.34300016 32.623585 33.415828 31.895733 32.428208 32.590839

Table 2: Speedup results based on regular sampling

with 8 dimensions and mixed cardinalities, varying between 2 and 1000 for the differentdimensions.

We executed our algorithm using both regular sampling and random sampling techniquesand measured the wall clock time. Each test on a specific number of processors based onregular sampling or random sampling were repeated four times. The sequential wall clocktime was measured by executing our algorithm on a single processor of the CGM1 machine.Then we measured the parallel wall clock time by executing our algorithm on multiprocessorsof the CGM1 machine. The speedup results by regular sampling and random sampling arelisted in Table 2 and Table 3, respectively.

p/Time Test 1 Test 2 Test 3 Test 4 Average1 502.206848 503.235764 512.967881 511.132192 507.3856712 317.178244 325.92796 320.792459 321.111448 321.2525284 137.362508 134.898542 148.41629 133.53064 138.5519958 54.106875 59.157725 56.74223 64.577927 58.64618916 24.984488 25.882996 25.611825 26.544247 25.755889

Table 3: Speedup results based on random sampling

As we can see in Figure 2, it is obvious that adding additional processors significantlyreduces the parallel sorting time. Several factors, such as the communication overheadbetween processors and the effects of the skew in data items, prevent the system fromachieving perfectly linear speedups [6]. It is also indicated that random sampling algorithmachieves shorter sorting time than regular sampling. This is mainly because in the regularsampling algorithm, there is additional block-writing time in the second step.

4.2.3 Load Balance Evaluation

For the load balance experiments, we fixed the problem size at 16M data items. Weexecuted our algorithm with 2, 4, 8 and 16 processors by regular sampling and randomsampling techniques, separately, and evaluated how well all processors are load balanced.Table 4 shows the results obtained.

Suppose the size of final sorted file at processor Pi is xi (1 ≤ i ≤ p). We define thevariance V , standard deviation D of final sizes, and ratio R of standard deviation over meansize N/p as follows:

V = 1p−1 [

∑pi=1 xi

2 − 1p (

∑pi=1 xi)2 ], D =

√V , and R = Dp/N .

The variance is a measure of how spread out a distribution is. It is computed as theaverage squared deviation of each number from its mean. Standard deviation is the square

8

Speedup Performance

0.00

100.00

200.00

300.00

400.00

500.00

600.00

1 2 4 8 16

Number of Processors

Tim

e in

sec

on

ds

Regular Sampling Random Sampling

Figure 2: Speedup performance by regular sampling versus random sampling

Regular Sampling Random Samplingp = 2 p = 16 p = 2 p = 16

8390945 1048843 7567499 9979548386271 1032523 9209717 868196p = 4 1047153 p = 4 998540

4225662 1063391 4061470 8852414182626 1032683 3783394 10469654192658 1049050 4323759 11471084176270 1049093 4608593 1130430p = 8 1063900 p = 8 1097743

2078401 1048066 1885171 10481912081212 1049547 2064980 11310382097864 1064951 2081325 8868162097798 1048134 2131611 11642032096230 1050843 2127547 11632082081731 1047922 2079901 10981832128841 1032032 2114045 10326632115139 1049085 2292636 1080737

Table 4: Size of final sorted data file at each processor

9

p and N/p Regular Sampling Random Samplingp N/p D R D R

2 8M 0.003152M 0.04% 1.107429M 13.84%4 4M 0.020949M 0.52% 0.337123M 8.43%8 2M 0.016803M 0.84% 0.106398M 5.32%16 1M 0.009588M 0.96% 0.094195M 9.42%

Table 5: Load balance statistics

Load Balance Performance of Regular Sampling versus Random Sampling

0.00%

5.00%

10.00%

15.00%

2 4 8 16

Number of processors

Sta

nd

ard

dev

iati

on

of

si

zes

Regular Sampling Random Sampling

Figure 3: Load balance performance of regular sampling versus random sampling

root of the variance. It is the most commonly used measure of spread. The standarddeviation has proven to be an extremely useful measure of spread in part because it ismathematically tractable [10]. So in our case, it is reasonable to use the ratio of standarddeviation over the original file size to measure how well all processors are load balanced.

Based on the formulas, we calculate the load balance statistics shown in Table 5. As wecan see, the standard deviation of the final file sizes is mostly below 10%, which indicatesgood performance.

Figures 3 compares the load balance performance of both sampling techniques. Theload balance of random sampling is much worse than that of regular sampling. This is notstrange because some random pivots are more representative of the actual data items thanothers. This means that if we run the sorting algorithm using two different random samplesof the same size, the variances of the load balance using the two sets of samples will alsodiffer. Increasing the number of pivots per block helps achieving better load balance byrandom sampling. The more pivots taken, the lower the load balance variance [6]. In thefollowing experiments, we always executed our algorithm by regular sampling.

In order to get an idea of how skewed data will affect the load balance performance, wedid two tests by regular sampling, one executed upon uniform data items, and the otherbased on skewed data items created via ZIPF distributions. Both tests have the sameproblem size 16M and 16 processors. Table 6 gives the results where the load balance onskewed data (ZIPF α = 10) is worse than that on uniform data (ZIPF α = 0). In order toeliminate the effect of skewed data on load balance, we can select more pivots from each

10

Uniform Data (α = 0) Skewed Data (α = 10)1048843 1047432 1261133 9997901047153 1050843 1048043 10158211049050 1048066 1015608 10145141048991 1048357 1017113 10010791049093 1049079 1016234 10317421048134 1047936 1014789 10159571049547 1032683 1016268 10486361047058 1064951 1000418 1260071

D R D R

0.0057M 0.57% 0.0800M 8.00%

Table 6: Load balance on uniform versus skewed distribution

Load balance performance on various block sizes

0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

8K 16K 32K 64K 128K 256K 512K

Block size (in data items)

Sta

nd

ard

dev

iati

on

of

fin

al

file

siz

es (

M)

Figure 4: Load balance performance on various block sizes

sorted block to get a better regular sample.We are also interested in measuring how data block size will affect the load balance. We

did seven tests with fixed 16M data items and 16 processors, but the block size varied from8K to 512K. By calculating the standard deviation of the final file sizes in each test, wefound that when the block size increases, the workload tends to be unbalanced, as shownin Figure 4.

The reason this occurs is that when the block size increases, the number of blocks perprocessor decreases accordingly, thus reducing the size of the regular sample, and leadingto higher load balance variance. However, as we discussed in the subsection Optimal BlockSize, the algorithm can not achieve good sorting time when the block size is small. As thesorting time is a key feature for the parallel sorting algorithm, we have to determine anappropriate block size in order to achieve both less sorting time and better load balance(though better load balance helps achieving less sorting time). In this case, we selected ablock size of 128K, and the corresponding standard deviation value is only 0.0857M , whichis below 10% of the file size 1M . This means that it is reasonable and acceptable to use128K as the block size for our algorithm on this cluster.

11

p Test 1 Test 2 Test 3 Test 4 Average2 20.534873 20.501474 20.020725 20.201662 20.3146844 22.122656 22.169552 22.064002 22.217527 22.1434346 23.250604 23.134389 23.643126 22.768990 23.1992778 24.078981 22.317042 22.539703 24.608882 23.38615210 23.181545 23.210565 23.544121 24.059308 23.49888512 28.354383 28.284563 28.374335 28.536254 28.38738414 31.402173 31.475530 31.864976 31.610783 31.58836616 31.440832 32.044045 33.242405 32.841428 32.392178

Table 7: Scaleup results based on regular sampling

Scalability

0.000000

5.000000

10.000000

15.000000

20.000000

25.000000

30.000000

35.000000

2 4 6 8 10 12 14 16

Number of Processors

Tim

e in

Sec

on

ds

Figure 5: Scaleup performance by regular sampling

4.2.4 Scalability Evaluation

For the scaleup experiments, we varied the number of processors with files from 2 to 16, butfixed the size of the data file (one per processor) at 1M . Thus, with two processors, 2Mdata items was sorted. At 16 processors, the total size of all data items was 16M . Eachtest was repeated four times. Table 7 presents the scaleup results we obtained.

Ideally the algorithm should exhibit a constant sorting time when the problem sizeand hardware configuration is scaled. Figure 5 indicates that the resulting sorting timeon different scales basically keeps constant within an acceptable range. The slight increasein sorting time is mainly contributed by the communication overhead among processors.When additional processors are added, the number of final pivots increases accordingly.Thus there are more communications among processors in order to distribute the sublistsseparated by final pivots.

4.2.5 Sizeup Evaluation

For the sizeup experiments, we fixed the number of processors at 4, but the data file size ateach processor varies from 1M to 4M . As shown in Table 8, each experiment was repeated

12

four times and the average sorting time in seconds was calculated. It indicates that ouralgorithm achieved superlinear sizeup. That is, sorting 4M data items took more than fourtimes as long as sorting 1M data items.

Size Test 1 Test 2 Test 3 Test 4 Average Ratio1M 30.428702 29.363398 29.119551 29.035474 29.486781 1.002M 70.268822 67.676471 71.204376 68.469696 69.404841 2.353M 101.707116 110.422570 106.793636 108.374601 106.824481 3.624M 141.603099 149.225022 143.326049 150.140922 146.073773 4.95

Table 8: Sizeup results based on regular sampling

5 Conclusion and Discussion

We designed and implemented our external memory parallel sorting algorithm using regularand random sampling techniques. We did large amounts of experiments to measure theperformance of our algorithm. Basically the regular sampling works better than randomsampling. We found that using a block size of 128K achieves good performance in sortingtime. Our algorithm achieved a linear speedup and the load balance by regular sampling isgood as for the external memory parallel sorting scenario. As for the future work, we cantry to find the optimal number of pivots which leads to optimal speedup and load balance.

References

[1] A. Aggarwal and C. G. Plaxton. Optimal parallel sorting in multi-level storage. In Proceedingsof the ACM-SIAM Symposium on Discrete Algorithms, volume 5, pages 659–668.

[2] Argonne National Laboratory, http://www-unix.mcs.anl.gov/mpi/index.html. The MessagePassing Interface (MPI) standard.

[3] Rakesh D. Barve, Edward F. Grove, and Jeffrey Scott Vitter. Simple randomized mergesort onparallel disks. Parallel Computing, 23(4–5):601–631, 1997.

[4] Rakesh D. Barve and Jeffrey Scott Vitter. A simple and efficient parallel disk mergesort. InProceedings of the ACM Symposium on Parallel Algorithms and Architecutres, volume 11, pages232–242, 1999.

[5] Geeta Chaudhry, Thomas H. Cormen, and Leonard F. Wisniewski. Column-sort Lives! An Efficient Out-of-Core Sorting Program. Available athttp://www.cs.dartmouth.edu/∼geetac/colsort now.pdf.

[6] David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. Parallel sorting on a shared-nothing architecture using probabilistic splitting. In Proceedings of the First InternationalConference on Parallel and Distributed Information Systems, pages 280–291, 1991.

[7] J. S. Huang and Y. C. Chow. Parallel sorting and data partitioning by sampling. In Proceedingsof the Seventh International Computer Software and Applications Conference, pages 627–631,1983.

[8] Balakrishna R. Iyer, Gray R. Ricard, and Peter J. Varman. Percentile finding algorithm formultiple sorted runs. In Proceedings of the Fifteenth International Conference on Very LargeData Bases, pages 135–144. Morgan Kaufmann, 1989.

13

[9] D. E. Knuth. The Art of Computer Programming, volume 3, chapter Sorting and Searching.Addison-Wesley, 1973.

[10] David M. Lane. Variance and Standard Deviation. Psychology, Statistics, and Administrationat Rice University, http://davidmlane.com/hyperstat/A16252.html.

[11] Tom Leighton. Tight bounds on the complexity of parallel sorting. IEEE Transactions onComputers, C(34(4)):344–354, April 1985.

[12] E. E. Lindstrom and J. S. Vitter. The design and analysis of bucketsort for bubble memorysecondary storage. IEEE Transactions on Computers, C(34):218–233, 1985.

[13] M. J. Quinn. Parallel sorting algorithms for tightly coupled multiprocessors. Parallel Comput-ing, 6:349–367, 1988.

[14] Hanmao Shi and Jonathan Schaeffer. Parallel sorting by regular sampling. Parallel and Dis-tributed Computing, 14(4):361–372, 1992.

[15] J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory ii: Two-level memories.Algorithmica, 12(2–3):110–147, 1994.

[16] Jeffrey Scott Vitter. External memory algorithms and data structures: Dealing with massivedata. ACM Computing Surveys, 33(2):209–271, June 2001.

A Source Code

/********************************************************* CSCI 6702 - Parallel Computing ** Project - External Memory Parallel Sorting ** Authors: Rui Zhang & Yongzheng Zhang ** First Created: July 9, 2002 ** Last Updated: July 19, 2002 *********************************************************/

#include <stdio.h>#include <string.h>#include <malloc.h>#include <stdlib.h>#include "mpi.h"#define M 1024*1024*8#define B 1024*128

void Extmem_Sort(int p, int rank);void File_Name(int order1, int order2, char fn[32]);int *Safe_Malloc(int size);void Block_Pivots(int p, int run, int len, int *pivot, int block[B]);void Final_Pivots(int p, int psize, int *pivot, int *sample);int *Exchange_Sublists(int p, int len, int *fl, int block[B],

int *pivot, int *sc, int *sd, int *rc, int *rd);void Save_Block(int rank, int run, int flen, int *fblock);void Merge_Files(int rank, int total, int run, int *fsize);void Read_Number(int i, int *n, int *number, int *more, FILE *fp);int Min_Number(int run, int *number, int *more);

void Quicksort(int *A, int start, int end);int Partition(int *A, int start, int end);

14

main(int argc, char *argv[]) {int p; /* number of processors */int rank; /* rank of each processor */

/* start up MPI */MPI_Init(&argc, &argv);

/* find out # processors */MPI_Comm_size(MPI_COMM_WORLD, &p);

/* find out rank */MPI_Comm_rank(MPI_COMM_WORLD, &rank);

/* external memory parallel sorting */Extmem_Sort(p, rank);

/* quit MPI */MPI_Finalize();

}

void Extmem_Sort(int p, int rank) {char fn[32]; /* file name of data file */FILE *fp; /* original file pointer */int i, j; /* indice of array elements */int len = 0; /* number of items in a block */int run = 0; /* number of runs in a file */int total = 0; /* total number of items */int psize; /* # pivots at one processor */int block[B]; /* a block of integers */int *pivot; /* array of pivots */int *sample; /* array of regular sample */int *sc; /* # integers in each sublist */int *sd; /* start index of each sublist */int *rc; /* receive number of integers */int *rd; /* receive indice of sublists */int *fsize; /* size of each final block */int *fblock; /* final block of data elements */double start, finish; /* start and finish time */

/* get the name of input file */File_Name(rank + 1, 0, fn);

/* open local data file */if ((fp = fopen(fn, "rb")) == NULL) {

printf("Cannot open the file %s.\n", fn);exit(0);

}

/* start time of parallel computing */start = MPI_Wtime();

/* each processor reads in blocks and selects pivots */pivot = Safe_Malloc(p - 1);while (!feof(fp)) {

15

/* read in one block of data items */len = fread(block, sizeof(int), B, fp);if (len > 0) {

run ++; /* number of runs increase */psize = (p - 1) * run; /* current number of pivots */pivot = realloc(pivot, psize * sizeof(int));if (pivot == NULL) {

printf("No enough memory available!\n");exit(1);

}/* find pivots in this block of data items */Block_Pivots(p, run, len, pivot, block);

}}

/* processor 0 gets regular sample and determine final pivots */if (rank == 0) sample = Safe_Malloc(psize * p);MPI_Gather(pivot, psize, MPI_INT, sample, psize, MPI_INT, 0, MPI_COMM_WORLD);if (rank == 0) Final_Pivots(p, psize, pivot, sample);

/* processor 0 broadcasts final pivots to other processors */MPI_Bcast(pivot, p - 1, MPI_INT, 0, MPI_COMM_WORLD);

/* allocate memory for send and receive counts and displacements */sc = Safe_Malloc(p);sd = Safe_Malloc(p);rc = Safe_Malloc(p);rd = Safe_Malloc(p);

/* each processor reads in blocks and exchange sublists */fsize = Safe_Malloc(run);run = 0; /* reset number of runs to zero */

/* reposition file pointer to beginning of file */rewind(fp);while (!feof(fp)) {

/* read in one block of data items */len = fread(block, sizeof(int), B, fp);if (len > 0) {

/* exchange sublists of each block */fblock = Exchange_Sublists(p, len, &fsize[run], block, pivot, sc, sd, rc, rd);

/* write final block into a temporary file */Save_Block(rank, run, fsize[run], fblock);

/* free temporary storage */free(fblock);

run ++; /* increase order of runs by one */}

}

total = 0;for (i = 0; i < run; i ++) total += fsize[i];

16

printf("P%d: total size = %d, runs = %d", rank, total, run);printf("\n");

/* close local data file */fclose(fp);

/* merge all final blocks (in files) into one big file */Merge_Files(rank, total, run, fsize);

/* free temporary storage */free(pivot);free(sc);free(sd);free(rc);free(rd);free(fsize);if (rank == 0) free(sample);

/* finish time of parallel computing */finish = MPI_Wtime();if (rank == 0)

printf("Elapsed time is %f - %f = %f seconds.\n\n", finish, start, finish - start);}

/* get the name of input or output file */void File_Name(int order1, int order2, char fn[32]) {

char tmp[32];sprintf(tmp, "%d", order1);

if (order2 == 0) strcpy(fn, "/tmp/Data/data");else if (order2 == -1) strcpy(fn, "/tmp/Data/out");else strcpy(fn, "/tmp/Data/temp");

strcat(fn, tmp);if (order2 > 0) {

strcat(fn, "_");sprintf(tmp, "%d", order2);strcat(fn, tmp);

}strcat(fn, ".dat");

}

/* allocate memory safely */int *Safe_Malloc(int size) {

int *ptr;ptr = (int*) malloc(size * sizeof(int));if (ptr == NULL) {


}return ptr;

}

/* select pivots from each block of data items */

17

void Block_Pivots(int p, int run, int len, int *pivot, int block[B]) {int i, index;int rsize; /* spaces between pivots */

/* quicksort this block */Quicksort(block, 0, len - 1);

rsize = (int)((B + p - 1) / p);for (i = 0; i < p - 1; i ++) {

index = (p - 1) * (run - 1) + i;/* regular sampling */if ((i + 1) * rsize <= len)

pivot[index] = block[(i + 1) * rsize - 1];else pivot[index] = block[len - 1];/* random sampling *///pivot[index] = block[random() % len];

}}

/* construct regular sample and determine final pivots */void Final_Pivots(int p, int psize, int *pivot, int *sample) {

int i;/* quicksort regular sample */Quicksort(sample, 0, p * psize - 1);

/* select final pivots */printf("Final pivots: ");for (i = 0; i < p - 1; i++) {

pivot[i] = sample[(i + 1) * psize - 1];printf("%d ", pivot[i]);

}printf("\n\n");

}

/* exchange sublists of each block */int *Exchange_Sublists(int p, int len, int *fl, int block[B],

int *pivot, int *sc, int *sd, int *rc, int *rd) {int i, j; /* indice of array elements */int runno; /* order of runs */int index = 0; /* index of a data element */int flen = 0; /* length of each final block */int *fblock; /* final block of data elements */

/* quicksort this block */Quicksort(block, 0, len - 1);

/* compute send counts and displacements */sd[0] = 0; /* disp of first sublist */for (i = 1; i < p; i++) { /* compute next p-1 disps */

if (index == len) { /* no element > present pivot */sd[i] = len; /* set displacement to len */sc[i - 1] = 0; /* set count of sublist to 0 */

} else {for (j = index; j < len; j++) { /* elements to be compared */

18

/* element <= present pivot */if (block[j] <= pivot[i - 1]) {

sd[i] = j + 1; /* update displacement */sc[i - 1] = j + 1 - index; /* update count */

} else {if (j == index) { /* no element <= present pivot */

sd[i] = j; /* disp equal to previous disp */sc[i - 1] = 0; /* no elements in this sublist */

}break; /* no more elements <= pivot */

}}index = j; /* next sublist starts from j */

}}sc[p - 1] = len - sd[p - 1]; /* count of pth sublist */

/* exchange send counts */MPI_Alltoall(sc, 1, MPI_INT, rc, 1, MPI_INT, MPI_COMM_WORLD);

/* compute receive counts and displacements */for (i = 0; i < p; i++)

flen += rc[i]; /* sum up size of final block */*fl = flen;fblock = Safe_Malloc(flen); /* allocate memory for fblock */rd[0] = 0; /* first displacement is zero */for (i = 1; i < p; i++)

rd[i] = rd[i - 1] + rc[i - 1]; /* compute other p-1 recv disps */

/* exchange all p sublists */MPI_Alltoallv(block, sc, sd, MPI_INT, fblock, rc, rd, MPI_INT, MPI_COMM_WORLD);

/* each processor sorts final block */Quicksort(fblock, 0, flen - 1);

return fblock;}

/* write final block into a temporary file */void Save_Block(int rank, int run, int flen, int *fblock) {

char fn[32];FILE *fp;/* get the name of output file */File_Name(rank + 1, run + 1, fn);

/* open the output file */if ((fp = fopen(fn, "wb")) == NULL) {


}

/* write final block into temporary file */fwrite(fblock, sizeof(int), flen, fp);

19

/* close temporary data file */fclose(fp);

}

/* merge all final blocks (in files) into one big file */void Merge_Files(int rank, int total, int run, int *fsize) {

int i; /* index of array items */int n = 0; /* number of items read */int pos; /* position of minimum */int *number; /* one item each block */int *more; /* more numbers left */FILE **fpin; /* input file pointers */FILE *fpout; /* output file pointer */char fn[32]; /* data file name */

/* allocate memory for file pointers */fpin = (FILE**) malloc(run * sizeof(FILE*));if (fpin == NULL) {


}

/* allocate memory for number and more */number = Safe_Malloc(run);more = Safe_Malloc(run);

/* open all temporary files */for (i = 0; i < run; i ++) {

/* get the name of input file */File_Name(rank + 1, i + 1, fn);

/* open temporary data file */if ((fpin[i] = fopen(fn, "rb")) == NULL) {


}

/* read one data item from each file */Read_Number(i, &n, number, more, fpin[i]);

}

/* get the name of output file */File_Name(rank + 1, -1, fn);

/* open the output file */if ((fpout = fopen(fn, "wb")) == NULL) {


}

while (n <= total) { /* more items left in files *//* get the position of minimum number */pos = Min_Number(run, number, more);

20

/* write minimum number into output file */if (pos > -1) {

fwrite(&number[pos], sizeof(int), 1, fpout);

/* read in a new data item */Read_Number(pos, &n, number, more, fpin[pos]);

} else break;}

/* close data file */fclose(fpout);for (i = 0; i < run; i ++)

fclose (fpin[i]);

/* free temporary storage */free(fpin);free(number);

}

/* read in one number from certain file */void Read_Number(int i, int *n, int *number, int *more, FILE *fp) {

int len;if (!feof(fp)) { /* not end of file */

len = fread(&number[i], sizeof(int), 1, fp);if (len == 1) { /* read in one number */

(*n) ++; /* one more number */more[i] = 1; /* write into file */

} else more[i] = 0; /* no more numbers */} else more[i] = 0; /* no more numbers */

}

/* find the minimum number */int Min_Number(int run, int *number, int *more) {

int i;int pos;int done = 0;for (i = 0; i < run; i ++) {

if (more[i] == 1) {pos = i;break;

} else done ++;}if (done == run) return -1; /* all files done */for (i = pos + 1; i < run; i ++)

if ((more[i] == 1) && (number[i] < number[pos]))pos = i; /* position of minimum */

return pos;}

/* standard quicksort algorithm */void Quicksort(int *A, int start, int end) {

int pivot;if (start < end) {

pivot = Partition(A, start, end);

21

Quicksort(A, start, pivot);Quicksort(A, pivot + 1, end);

}}

int Partition(int *A, int start, int end) {int i, j, x;x = A[start];i = start - 1;j = end + 1;while (1) {

do j --;while (A[j] > x);do i ++;while (A[i] < x);if (i < j) {

int temp = A[i];A[i] = A[j];A[j] = temp;

} else return j;}

}

22

external memory parallel sorting by sampling

Documents