chap 8. cosequential processing and the sorting of large files

78
Chap 8. Cosequential Processing and the Sorting of Large Files

Upload: bing

Post on 14-Jan-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Chap 8. Cosequential Processing and the Sorting of Large Files. Chapter Objectives(1). Describe a class of frequently used processing activities known as cosequential process Provide a general object-oriented model for implementing varieties of cosequential processes - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chap 8. Cosequential Processing                and the Sorting of Large Files

Chap 8. Cosequential Processing and the Sorting of Large Files

Page 2: Chap 8. Cosequential Processing                and the Sorting of Large Files

Chapter Objectives(1)

Describe a class of frequently used processing activities known as cosequential

process

Provide a general object-oriented model for implementing varieties of cosequential

processes

Illustrate the use of the model to solve a number of different kinds of cosequential

processing problems, including problems other than simple merges and matches

Introduce heapsort as an approach to overlapping I/O with sorting in RAM

Page 3: Chap 8. Cosequential Processing                and the Sorting of Large Files

Chapter Objectives(2)

Show how merging provides the basis for sorting very large files

Examine the costs of K-way merges on disk and find ways to reduce those costs

Introduce the notion of replacement selection

Examine some of the fundamental concerns associated with sorting large files using

tapes rather than disks

Introduce UNIX utilities for sorting, merging, and cosequential processing

Page 4: Chap 8. Cosequential Processing                and the Sorting of Large Files

Contents

8.1 Cosequential operations

8.2 Application of the OO Model to a General Ledger Program

8.3 Extension of the OO Model to Include Multiway Merging

8.4 A Second Look at Sorting in Memory

8.5 Merging as a Way of Sorting Large Files on Disk

8.6 Sorting Files on Tape

8.7 Sort-Merge Packages

8.8 Sorting and Cosequential Processing in Unix

Page 5: Chap 8. Cosequential Processing                and the Sorting of Large Files

Cosequential operations

Coordinated processing of two or more sequential lists to produce a single list

Kinds of operations

merging, or union

matching, or intersection

combination of above

Page 6: Chap 8. Cosequential Processing                and the Sorting of Large Files

Matching Names in Two Lists(1)

So called “intersection operation”

Output the names common to two lists

Things that must be dealt with to make match procedure work reasonably

initializing that is to arrange things

methods that are getting and accessing the next list item

synchronizing between two lists

handling EOF conditions

recognizing errors

e.g. duplicate names or names out of sequence

Page 7: Chap 8. Cosequential Processing                and the Sorting of Large Files

Matching Names in Two Lists(2)

In comparing two names

if Item(1) is less than Item(2), read the next from List 1

if Item(1) is greater than Item(2), read the next name from List 2

if the names are the same, output the name and read the next names from the

two lists

Page 8: Chap 8. Cosequential Processing                and the Sorting of Large Files

Cosequential match procedure(1)Cosequential match procedure(1)

PROGRAM: match

List 1

List 2

same name

Item(1)

Item(2)

Item(1) < Item(2)

Item(1) > Item(2)

use input() & initialize() procedure

Page 9: Chap 8. Cosequential Processing                and the Sorting of Large Files

Cosequential match procedure(2)int Match(char * List1, char List2, char *OutputList)

{

int MoreItems; // true if items remain in both of the lists

// initialize input and output lists InitializeList(1, List1); InitializeList(2, List2); InitializeOutput(OutputList);

// get first item from both lists

MoreItems = NextItemInLIst(1) && NextItemInList(2); while (MoreItems) { // loop until no items in one of the lists if(Item(1) < Item(2) ) MoreItems = NextItemInList(1);

else if (Item(1) == Item (2) ) { ProcessItem(1); // match found MoreItems = NextItemInList(1) && NextItemInList(2);}else MoreItems = NextItemInList(2); // Item(1) > Item(2)

}

FinishUp(); return 1;

}

Page 10: Chap 8. Cosequential Processing                and the Sorting of Large Files

template <class ItemType> class CosequentialProcess// base class for cosequential processing{ public: // the following methods provide basic list processing // these must be defined in subclasses virtual int InitializeList (int ListNumber, char *LintName) = 0; virtual int InitializeOutput (char * OutputListName) = 0; virtual int NextItemInList (int ListNumber) = 0; // advance to next item in this list virtual ItemType Item(int ListNumber) = 0; // return current item from this list virtual int ProcessItem(int ListNumber) = 0;

// process the item in this list virtual int FinishUp() = 0; // complete the processing // 2-way cosequential match method virtual int Match2Lists (char *List1, char * List2, char *OutputList);};

General Class for Cosequential Processing(1)

Page 11: Chap 8. Cosequential Processing                and the Sorting of Large Files

General Class for Cosequential Processing(2)

A Subclass to support lists that are files of strings, one per line

class StringListProcess : public CosequentialProcess<String &>{ public:

StringListProcess (int NumberOfLists); // constructor// Basic list processing methodsint InitializeList (int ListNumber, char * List1);int InitializeOutput(char * OutputList);int NextItemInList (int ListNumber); // get nextString & Item (int ListNumber); // return currentint ProcessItem (int ListNumber); // process the itemint FinishUp(); // complete the processing

protected:ifstream * List; // array of list filesString * Items; // array of current Item from each list

ofstream OutputLsit;static const char * LowValue; //used so that NextItemInList() doesn’t

// have to get the first item in an special way static const char * HighValue;};

Page 12: Chap 8. Cosequential Processing                and the Sorting of Large Files

General Class for Cosequential Processing(3)

Appendix H: full implementation

An example of main

#include “coseq.h”

int main()

{

StringListProcess ListProcess(2); // process with 2 lists

ListProces.Match2Lists (“list1.txt”, “list2.txt”, “match.txt”);

}

Page 13: Chap 8. Cosequential Processing                and the Sorting of Large Files

Merging Two Lists(1)

Based on matching operation

Difference

must read each of the lists completely

must change MoreItems behavior

– keep this flag set to true as long as there are records in either list

HighValue

the special value (we use “\xFF”)

come after all legal input values in the files to ensure both input files are read t

o completion

Page 14: Chap 8. Cosequential Processing                and the Sorting of Large Files

Merging Two Lists(2)

Cosequential merge procedure based on a single loop

This method has been added to class CosequentialProcess

No modifications are required to class StringListProcess

template <class ItemType>int CosequentialProcess<ItemType> :: Merge2Lists (char * List1Name, char * List2Name, char * OutputList){

int MoreItems1, MoreItems2; // true if more items in list InitializeList (1 List1Name);InitializeList (2, List2Name);InitializeOutput (OutputListName);

MoreItems1 = NextItemInLIst(1);MoreItems2 = NextItemInList(2);

while(MoreItems1 || MoreItems2) { (continued … )

Page 15: Chap 8. Cosequential Processing                and the Sorting of Large Files

Merging Two Lists(3)

while (MoreItems(1) || MoreItems(2) ) { // if either file has more if (Item(1) < Item(2)) { // list 1 has next item to be processed

ProcessItem(1);MoreItem1 = NextItemInList(1);

}else if (Item(1) == Item(2) ) {

ProcessItem(1);MoreItems1 = NextItemInList(1);MoreItems2 = NextItemInList(2);

}else // Item(1) > Item(2) {

ProcessItem(2);MoreItem2 = NextItemInList(2);

}}FinishUp(); return 1;

}

Page 16: Chap 8. Cosequential Processing                and the Sorting of Large Files

Cosequential merge procedure(1)Cosequential merge procedure(1)

PROGRAM: merge

NAME_1

NAME_2

List 1

List 2

OutputList

(Item(1) < Item(2) )or match

Item(1) > Item(2)

Page 17: Chap 8. Cosequential Processing                and the Sorting of Large Files

Summary of the Cosequential Processing Model(1)

Assumptions

two or more input files are processed in a parallel fashion

each file is sorted

in some cases, there must exist a high key value or a low key

records are processed in a logical sorted order

for each file, there is only one current record

records should be manipulated only in internal memory

Page 18: Chap 8. Cosequential Processing                and the Sorting of Large Files

Summary of the Cosequential Processing Model(2)

Essential Components initialization - reads from first logical records one main synchronization loop - continues as long as relevant records remain selection in main synchronization loop

Input files & Output files are sequence checked by comparing the previous item value with new one

if (Item(1) > Item(2) then ..........else if ( Item(1) < Item(2)) then .........else ........... /* current keys equal */endif

Page 19: Chap 8. Cosequential Processing                and the Sorting of Large Files

Summary of the Cosequential Processing Model(3)

Essential components (cont’d)

substitute high values for actual key when EOF

–main loop terminates when high values have occurred for

all relevant input files

–no special code to deal with EOF

I/O or error detection are to be relegated to supporting method so the details

of these activities do not obscure the principal processing logic

Page 20: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program (1)

– Figure 8.7 Sample journal entries

Acct.No. Check No. Date Description Debit/credit

101 1271 04/02/97 Auto expense -78.70510 1271 04/02/97 Tune-up and minor repair 78.70511 1272 04/02/97 Rent -500.00550 1272 04/02/97 Rent for April 500.00551 1273 04/04/97 Advertising -87.50505 1273 04/04/97 Newspaper ad re:new product 87.50506 670 04/02/97 Office expense -32.78540 670 04/02/97 Printer cartridge 32.78541 1274 04/02/97 Auto expense -31.83510 1274 04/09/97 Oil change 31.83

Page 21: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program (2)

– Figure 8.8 Sample ledger printout showing the effect of posting

from the journal101 Checking account #1

1271 04/02/97 Auto expense -78.701272 04/02/97 Rent -500.001273 04/04/97 Advertising -87.501274 04/02/97 Auto expense -31.83

Prev.bal:5219.23 New bal: 4521.20102 Checking account #2

670 04/02/97 Office expense -32.78Prev.bal:1321.20 New bal: 1288.42

505 Advertising expense1273 04/04/97 Newspaper ad re:new product 87.50

Prev.bal:25.00 New bal: 112.50510 Auto expenses

1271 04/02/97 Tune-up and minor repair 78.701274 04/09/97 Oil change 31.83

Prev.bal:501.12 New bal: 611.65

이전 잔액

현재 잔액

Page 22: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program (3)

– Figure 8.9 List of journal transactions sorted by account number

Acct.No. Check No. Date Description Debit/credit

101 1271 04/02/97 Auto expense -78.70101 1272 04/02/97 Rent -500.00101 1273 04/04/97 Advertising -87.50101 1274 04/02/97 Auto expense -31.83102 670 04/02/97 Office expense -32.78505 1273 04/04/97 Newspaper ad re:new product 87.50510 1271 04/02/97 Tune-up and minor repair 78.70510 1274 04/09/97 Oil change 31.83540 670 04/02/97 Printer cartridge 32.78550 1272 04/02/97 Rent for April 500.00

Page 23: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program (4)

– Figure 8.10 Conceptual view of cosequential matching of the ledge

r and journal files

Ledger List Journal List

101 Checking account #1 101 1271 Auto expense101 1272 Rent101 1273 Advertising101 1274 Auto expense

102 Checking account #2 102 670 Office expense505 Advertising expense 505 1273 Newspaper ad re: new product510 Auto expenses 510 1271 Tune-up and minor repair

510 1274 Oil change

Page 24: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program (5)

– Figure 8.11 Sample ledger printout for the first six accounts

101 Checking account #11271 04/02/97 Auto expense -78.701272 04/02/97 Rent -500.001274 04/02/97 Auto expense -31.831273 04/04/97 Advertising -87.50

Prev.bal:5219.23 New bal: 4521.20102 Checking account #2

670 04/02/97 Office expense -32.78Prev.bal:1321.20 New bal: 1288.42

505 Advertising expense1273 04/04/97 Newspaper ad re:new product 87.50

Prev.bal:25.00 New bal: 112.50510 Auto expenses

1271 04/02/97 Tune-up and minor repair 78.701274 04/09/97 Oil change 31.83

Prev.bal:501.12 New bal: 611.65515 Bank charges

Prev.bal: 0.00 New bal: 0.00520 Books and publications

Prev.bal: 87.40 New bal: 87.40

Page 25: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program(6)

The ledger (master) account number

The journal (transaction) account number

Class MasterTransactionProcess (Fig 8.12)

Subclass LedgeProcess (Fig 8.14)

Page 26: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program (7)

Template <class ItemType>

class MasterTransactionProcess: Public CosequentialProcess<ItemType>

// a cosequential process that supports master/transaction processing

{public:

MasterTransactionProcess(); // constructor

Virtual int ProcessNewMaster() = 0; //processing when new master read

Virtual int ProcessCurrentMaster() = 0;

Virtual int ProcessEndMaster() = 0;

Virtual int ProcessTransactionError()= 0;

//cosequential processing of master and transaction records

int PostTransactions (char * MasterFileName, char * TransactionFileName,

char * OutputListName);

};

Page 27: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program (8)

while (MoreMasters || MoreTransactions)if (Item(1) < Item(2)) { // 이 마스터 레코드를 끝낸다

ProcessEndMaster();MoreMasters = NextItemInList(1);if (MoreMasters)

ProcessNewMaster();} else if (Item(1) == Item(2)) { // 마스터와 일치하는 트랜잭션

ProcessCurrentMaster(); // 마스터를 위한 또다른 트랜잭션ProcessItem(2); // 트랜잭션 레코드 출력MoreTransactions = NextItemInList(2);

} else { // Item(1) > Item(2) 마스터를 갖지 않는 트랜잭션ProcessTransactionError();MoreTransactions = NextItemInList(2);

}

Page 28: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.2 The General Ledger Program (9)

int LedgerProcess::ProcessNewMaster(){ // 헤더를 프린트하고 마지막 달의 잔액을 설정

ledger.PrintHeader(OutputList);ledger.Balances[MonthNumber] = ledger.Balances[MonthNumber-1];

}int LedgerProcess::ProcessCurrentMaster(){ // 이 달의 잔액에 트랜잭션의 양을 더한다

ledger.Balances[MonthNumber] += journal.Amount;}int LedgerProcess::ProcessEndMaster(){ // 잔액을 출력한다

PrintBalances(OutputList,ledger.Balances[MonthNumber-1],

ledger.Balances[MonthNumber]);}

Page 29: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.3 A K-way Merge Algorithm

A very general form of cosequential file processing

Merge K input lists to create a single, sequentially ordered output list

Algorithm

begin loop

determine which list has the key with the lowest value

output that key

move ahead one key in that list

– in duplicate input entries, move ahead in each list

loop again

Page 30: Chap 8. Cosequential Processing                and the Sorting of Large Files

K-way merge

nice if K is no larger than 8 or so

if K > 8, the set of comparisons for minimum key is expensive

loop of comparison (computing)

Selection Tree (if K > 8)

time vs. space trade off

a kind of “tournament” tree

the minimum value is at root node

the depth of tree is log2 K

8.3 Selection Tree for Merging Large Number of Lists8.3 Selection Tree for Merging Large Number of Lists

Page 31: Chap 8. Cosequential Processing                and the Sorting of Large Files

7, 10, 17....List 0

9, 19, 23....List 1

8, 16, 29....List 7

15, 20, 30....List 6

5, 6, 25....List 5

12, 14, 21....List 4

18, 22, 24....List 3

11, 13, 32....List 2

7

11

5

8

7

5

5input

8.3 Selection Tree

Page 32: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.4 A Second Look at Sorting in Memory

Read the whole file from into memory, perform sorting, write the whole file

into disk

Can we improve on the time that it takes for this RAM sort?

perform some of parts in parallel

selection sort is good but cannot be used to sort entire file

Using Heap technique!

processing and I/O can occur in parallel

keep all the keys in heap

Heap building while reading a block

Heap rebuilding while writing a block

Page 33: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.4 Overlapping processing and I/O : Heapsort

Heap

a kind of binary tree, complete binary tree

each node has a single key, that key is less than or equal to the key at its parent

node

storage for tree can be allocated sequentially

so there is no need for pointers or other dynamic overhead for maintaining the

heap

Page 34: Chap 8. Cosequential Processing                and the Sorting of Large Files

A

B c

E H I D

G F

D I G F A C B E H

1 2 3 4 5 6 7 8 9

8.4 A heap in both its tree form and as it would be stored in an array

(1)

(2) (3)

(4) (5) (6) (7)

(8) (9)

* n, 2n, 2n+1 positions

Page 35: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.4 Class Heap and Method Insert(1)

class Heap{ public:

Heap(int maxElements);int Insert (char * newKey);char * Remove();

protected:int MaxElements; int NumElements;char ** HeapArray;void Exchange (int i, int j); // exchange element i and jint Compare (int i, int j) // compare element i and j

{ return strcmp(Heaparray[i], HeapArray[j]); }};

Page 36: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.4 Class Heap and Method Insert(2)

int Heap::Insert(char * newKey){

if (NumElements == MaxElements) return FALSE;NumElements++; // add the new key at the last positionHeapAray[NumElements] = newKey;// re-order the heapint k = NumElements; int parent;while(k > 1) { // k has a parent

parent = k/2;if (Compare(k, parent) >= 0) break;

// HeapArray[k] is in the right place// else exchange k and parentExchange(k, parent);k = parent;

}return;

}

Page 37: Chap 8. Cosequential Processing                and the Sorting of Large Files

Heap Building Algorithm(1)

input key order : F D C G H I B E A

New key tobe inserted

Heap, after insertionof the new key

Selected heapsin tree form

F 1 2 3 4 5 6 7 8 9 F

D 1 2 3 4 5 6 7 8 9 D F

C 1 2 3 4 5 6 7 8 9 C F D

G 1 2 3 4 5 6 7 8 9 C F D G

H 1 2 3 4 5 6 7 8 9 C F D G H

C

F D

(continued....)

Page 38: Chap 8. Cosequential Processing                and the Sorting of Large Files

Heap Building Algorithm(2)

New key tobe inserted

Heap, after insertionof the new key

Selected heapsin tree form

I 1 2 3 4 5 6 7 8 9 C F D G H I

B 1 2 3 4 5 6 7 8 9 B F C G H I D

E 1 2 3 4 5 6 7 8 9 B E C F H I D G

A 1 2 3 4 5 6 7 8 9 A B C E H I D G F

F D

G H I

C

B

F C

G H I D

(continued....)

input key order : F D C G H B E A

Page 39: Chap 8. Cosequential Processing                and the Sorting of Large Files

Heap Building Algorithm(3)

A 1 2 3 4 5 6 7 8 9 A B C E H I D G F

New key tobe inserted

Heap, after insertionof the new key

Selected heapsin tree form

A

B C

E H I D

G F

input key order : F D C G H B E A

Page 40: Chap 8. Cosequential Processing                and the Sorting of Large Files

Illustration for overlapping input with heap building(1)

Total RAM area allocated for heap

First input buffer. First part of heap is built here. Thefirst record is added to the heap, then the second recordis added, and so forth

Second input buffer. This buffer is being filled while heap is being built in first buffer.

(Free ride of main memory processing: heap building is faster than IO!)

Page 41: Chap 8. Cosequential Processing                and the Sorting of Large Files

Illustration for overlapping input with heap building(2)

Second part of heap is built here. The first record is added to the heap, then the second record, etc

Third input buffer. This buffer is filled while heap is beingbuilt in second buffer

Third part of heap is built here

Fourth input buffer is filled while heap is being built in third buffer

(One Heap is growing during IO time!)

Page 42: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.4 Sorting while Writing to the File

Heap rebuilding while writing a block (Free ride of main memory processing) Retrieving the keys in order (Fig 8.20)

while( there is no elements)– get the smallest value– put largest value into root– decrease the # of elements– reorder the heap

Overlapping retrieve-in-order with I/O retrieve-in-order a block of records while writing this block, retrieve-in-order the next block

Page 43: Chap 8. Cosequential Processing                and the Sorting of Large Files

char * Heap::Remove(){

// 가장 작은 원소를 제거하고 , 힙을 재정렬시키고 , 가장 작은 원소를 리턴한다// 리턴하기 위해 가장 작은 값을 val 에 넣는다char * val = HeapArray[1];HeapArray[1] = HapArray[NumElements]; // 루트에 가장 큰 값을 넣는다NumElements--; // 원소의 수를 감소시킨다

// 교환과 하향이동에 의해 힙을 재정렬한다int k = 1; // 가장 큰 값을 포함하는 힙의 노드int newK; // 가장 큰 값으로 교환되는 힙의 노드while (2*k <= NumElements) { // k 는 최소 하나의 자식을 가짐

// k 의 가장 작은 자식 인덱스로 newK 를 설정if (Comapre(2*k, 2*k+1) > 0) newK = 2*k;else newK = 2*k+1;// k 와 newK 가 순서대로 되어 있는지를 비교if (Compare(k, newK) < 0) break; // 순서대로 되어 있는 경우Exchange(k, newK); // k 와 newK 가 순서대로 되지 않은 경우k = newK; // 트리의 아래 방향으로 진행 }

return val;}

Figure 8.20

8.4 Method Remove

Page 44: Chap 8. Cosequential Processing                and the Sorting of Large Files

8.5 Merging as a Way of Sorting Large Files on Disk

Keysort: holding keys in memory

Two Shortcomings of Keysort

substantial cost of seeking may happen after keysort

cannot sort really large files

– e.g. a file with 800,000 records, size of each record: 100 bytes,

size of key part: 10 bytes, then 800,000 X 10 => 8G bytes!

cannot even sort all the keys in RAM

Multiway merge algorithm

small overhead for maintaining pointers, temporary variables

run: sorted subfile

using heap sort for each run

split, read-in, heap sort, write-back

Page 45: Chap 8. Cosequential Processing                and the Sorting of Large Files

Sorting through the creation of runsand subsequential merging of runs

800,000 unsorted records

80 internal sorts

.............

.............80runs, each containing 10,000 sorted records

Merge

800,000 records in sorted order

Page 46: Chap 8. Cosequential Processing                and the Sorting of Large Files

Multiway merging (K-way merge-sort)

Can be extended to files of any size

Reading during run creation is sequential

no seeking due to sequential reading

Reading & writing is sequential

Sort each run: Overlapping I/O using heapsort

K-way merges with k runs

Since I/O is largely sequential, tapes can be used

Page 47: Chap 8. Cosequential Processing                and the Sorting of Large Files

How Much Time Does a Merge Sort Take?

Assumptions

only one seek is required for any sequential access

only one rotational delay is required per access

Four I/Os

during the sort phase

– reading all records into RAM for sorting, forming runs

– writing sorted runs out to disk

during the merge phase

– reading sorted runs into RAM for merging

– writing sorted file out to disk

Page 48: Chap 8. Cosequential Processing                and the Sorting of Large Files

Four Steps(1) Step1: Reading records into RAM for sorting and forming runs

assume: 10MB input buffer, 800MB file size

seek time --> 8msec, rotational delay --> 3msec

transmission rate --> 0.0145MB/msec

Time for step1:

access 80 blocks (80 X 11)msec + transfer 80 blocks (800/0.0145)msec

Step2: Writing sorted runs out to disk

writing is reverse of reading

time that it takes for step2 equals to time of step1

Page 49: Chap 8. Cosequential Processing                and the Sorting of Large Files

Four Steps(2)

Step3: Reading sorted runs into RAM for merging

10 MB of RAM is for storing runs. 80 runs

reallocate each of 80 buffers 10MB RAM as 80 input buffers

access each run 80 buffers to read all of it

Each buffer holds 1/80 of a run (0.125MB)

total seek & rotational time --> 80 runs X 80 seeks

--> 6400 seeks. 6400 X 11 msec = 70 seconds

transfer time --> 60 seconds (800MB 0.0145MB/msec)

total time = total seek & rotation time + transfer time

Page 50: Chap 8. Cosequential Processing                and the Sorting of Large Files

Four Steps(3)

Step4: Writing sorted file out to disk

need to know how big output buffers are

with 20,000-byte output buffers,

total seek & rotation time = 4,000 x 11 msec

transfer time is still 60 seconds

Consider Table 8.1 (359pp)

What if we use keysort for 800M file? --> 24hrs 26mins 40secs

80,000,000 bytes

20,000 bytes per seek4,000 seeks

Page 51: Chap 8. Cosequential Processing                and the Sorting of Large Files

800,000sorted records

1st run = 80 buffers’ worth(80 accesses)

2nd run = 80 buffers’ worth(80 accesses)

80th run = 80 buffers’ worth(80 accesses)

:::

Effect of buffering on the number of seeks required

800MB file

10MB file

80 buffers(10MB)

Page 52: Chap 8. Cosequential Processing                and the Sorting of Large Files

Sorting a Very Large File

Two kinds of I/O

Sort phase

– I/O is sequential if using heapsort

– Since sequential access is minimal seeking, we cannot algorithmically speed

up I/O

Merge phase

– RAM buffers for each run get loaded, reloaded at predictable times ->

random access

– For performance, look for ways to cut down on the number of random

accesses that occur while reading runs

– you can have some chance here!

Page 53: Chap 8. Cosequential Processing                and the Sorting of Large Files

The Cost of Increasing the File Size

K-way merge of K runs

Merge sort = O(K2) ( merge op. -> K2 seeks )

If K is a big number, you are in trouble!

Some ways to reduce time!! (8.5.4, 8.5.5, 8.5.6)

more hardware (disk drives, RAM, I/O channel)

reducing the order of merge (k), increasing buffer size of each run

increase the lengths of the initial sorted runs

find the ways to overlap I/O operations

Page 54: Chap 8. Cosequential Processing                and the Sorting of Large Files

Hardware-base Improvements

Increasing the amount of RAM

longer & fewer initial runs

fewer seeks

Increasing the number of disk drives

no delay due to seek time after generation of runs

assign input and output to separate drives

Increasing the number of I/O channels

separate I/O channels, I/O can overlap

Improve transmission time

Page 55: Chap 8. Cosequential Processing                and the Sorting of Large Files

Decreasing the Num of Seeks Using Multiple-step Merges

K-way merge characteristics

a selection tree is used

– the number of comparisons is N*log K

(K-way merge with N records)

K is proportional to N

– O(N*log N) : reasonably efficient

Reducing seeks is to reduce the number of runs

give each run a bigger buffer space

multiple-step merge provides the way without more RAM

Page 56: Chap 8. Cosequential Processing                and the Sorting of Large Files

Do not merge all runs at one time

Break the original set of runs into small groups and Merge runs in these group

separately

Leads fewer seeks, but extra transmission time in second pass

Reads every record twice

to form the intermediate runs & the final sorted file

Similar to have selection tree in merging n lists!!

Multiple-step merge(1)

Page 57: Chap 8. Cosequential Processing                and the Sorting of Large Files

Two-step merge of 800 runsTwo-step merge of 800 runs

......32 runs

......32 runs

......32 runs

......

......

25 sets of 32 runs each

(25 sets X 32 runs) = 800 runs

Page 58: Chap 8. Cosequential Processing                and the Sorting of Large Files

Multiple-step merge(2)

Essence of multiple-step merging

increase the available buffer space for each run

extra pass vs. random access decrease

Can we do even better with more than two steps?

trade-offs between the seek&rotation time and the transmission time

major cost in merge sort

seek, rotation time, transmission time, buffer size, number of runs

Page 59: Chap 8. Cosequential Processing                and the Sorting of Large Files

Increasing Run Lengths Using Replacement Selection(1)

Facts of Life

Want to use the heap sort in memory

Want to allocate longer output runs

Can we pack the longer output runs using the heap sort in memory?

Replacement Selection

Idea

– always select the key from memory that has the lowest value

– output the key

– replace it with a new key from the input list

– use 2 heaps in the memory buffer

(continued...)

Page 60: Chap 8. Cosequential Processing                and the Sorting of Large Files

Increasing Run Lengths Using Replacement Selection(2)

Implementation

– step1: read records and sort using heap sort

– this heap is the primary heap

– step2: write out only the record with the lowest value

– step3: bring in new record and compare its key with that of record

just output – step3-a: if the new key is higher, insert new record into its proper in the primary

heap along with the other records selected for output

– step3-b: if the new key is lower, place the record in a secondary heap with key values lower than already written out

– step4: repeat step 3 while there are records in the primary heap and there are records to be read in. When the primary heap is empty, make the secondary heap into the primary heap and repeat step2 & step3

Page 61: Chap 8. Cosequential Processing                and the Sorting of Large Files

Input:21, 67, 12, 5, 47, 16

Front of input string

Remaining input Memory(p=3) Output run

21, 67, 1221, 6721----

5 47 1612 47 1667 47 1667 47 2167 47 -67 - -- - -

- 5

12, 5 16, 12, 5 21, 16, 12, 5 47, 21, 16, 12, 567, 47, 21, 16, 12, 5

Example of the principle underlying replacement selection

(Heap sort!)

Page 62: Chap 8. Cosequential Processing                and the Sorting of Large Files

Replacement Selection(1)

What happens if a key arrives in memory too late to be output into ins proper position relative to the

other keys? (if 4th key is 2 rather than 12)

use of second heap, to be included in next run

refer to page 335 Figure 8.25

Two questions

Given P locations in memory, how long a run can we expect replacement selection to produce, on

the average?

– On the average, we can expect a run length of 2P

– Knuth provides an excellent description (page 371-373)

(continued...)

Page 63: Chap 8. Cosequential Processing                and the Sorting of Large Files

Total Seek &Rotation DelayTime

Approach# of Records per Seek to Form Runs

Size ofRuns Formed

# of SeeksRequired to Form Runs

MergeOrderUsed

TotalNumberof Seeks

(hr) (min)

800 RAMsorts followedby an 800-waymerge

Replacement selection followedby 534-way merge (records in randomorder)Replacement selection followedby 200-way merge(records partiallyordered)

10,000 10,000 800 1,600 681,600 4 58

2,500 15,000 534 6,400

2,500 40,000 200 200

521,134

206,400

3

1

48

30

Comparisons of access times required to sort 8 million recordsboth RAM sort and replacement selection

Page 64: Chap 8. Cosequential Processing                and the Sorting of Large Files

Step-by-step op. of replacement selection with 2 heaps working to form two sorted runs(1)

Input33, 18, 24, 58, 14, 17, 7, 21, 67, 12, 5, 47, 16

Front of input string

Remaining input33, 18, 24, 58, 14, 17, 7, 21, 67, 1233, 18, 24, 58, 14, 17, 7, 21, 6733, 18, 24, 58, 14, 17, 7, 2133, 18, 24, 58, 14, 17, 733, 18, 24, 58, 14, 1733, 18, 24, 58, 1433, 18, 24, 58

Memory(P=3)5 47 1612 47 1667 47 1667 47 2167 47 ( 7)67 (17) ( 7)(14) (17) ( 7)

Output run(A) - 5 12, 5 16, 12, 5 21, 16, 12, 5 47, 21, 16, 12, 5

67, 47, 21, 16, 12, 5

(Heap sort!)

Page 65: Chap 8. Cosequential Processing                and the Sorting of Large Files

Step-by-step op. of replacement selection working to form two sorted runs(2)

First run complete; start building the second

33, 18, 24, 5833, 18, 2433, 18---

Remaining input Memory(P=3) Output run(B)

14 17 714 17 5824 17 5824 18 5824 33 58- 33 58- - 58-

- 7 14, 7 17, 14, 7 18, 17, 14, 7 24, 18, 17, 14, 7 33, 24, 18, 17, 14, 758, 33, 24, 18, 17, 14, 7

Page 66: Chap 8. Cosequential Processing                and the Sorting of Large Files

Replacement Selection Plus Multiple Merging

Total number of seeks is less than for the one-step merges

The two-step merge requires transferring the data two more times than do the one-step

merge

the two-step merges & replacement selection are still better, but the results are less

dramatic

refer to table of the next slide

8.5 Merging as a Way of Sorting Large Files on Disk

Page 67: Chap 8. Cosequential Processing                and the Sorting of Large Files

Approach Number ofRecords perSeek to Form Runs

MergePatternUsed

Numberof Seeksfor Sortsand Merges

Seek + RotationalDelayTime(min)

TotalPassesover theFile

Total Trans-missionTime(min)

Total of Seek,Rotation, andTransmissionTimes(min)

RAM sorts

replacementselection(records in random order)

replacementselection(records part -ially ordered)

Comparison of merges, considering transmission times(1):1-step merge

10,000

2,500

2,500

800-way

534-way

200-way

681,700

521,134

206,400

298

228

90

4

4

4

43

43

43

341

341

341

(continued...)

8.5 Merging as a Way of Sorting Large Files on Disk

Page 68: Chap 8. Cosequential Processing                and the Sorting of Large Files

Approach Number ofRecords perSeek to Form Runs

MergePatternUsed

Numberof Seeksfor Sortsand Merges

Seek + RotationalDelayTime(min)

TotalPassesover theFile

Total Trans-missionTime(min)

Total of Seek,Rotation, andTransmissionTimes(min)

RAM sorts

replacementselection(records in random order)

replacementselection(records part -ially ordered)

Comparison of merges, considering transmission times(2):2-step merge

10,000

2,500

2,500

25 x 32-way(one 25-way)

19 x 28-way(one 19-way)

20 x 10-way(one 20-way)

127,200

124,438

110,400

56

55

48

6

6

6

65

65

65

121

120

113

8.5 Merging as a Way of Sorting Large Files on Disk

Page 69: Chap 8. Cosequential Processing                and the Sorting of Large Files

Using Two Disks with Replacement Selection

Two disk drives

input & output can overlap

– reduce transmission by 50%

seeking is virtually eliminated

Sort phase

the run selection & output can overlap

Merge phase

output disk becomes input disk, and vice versa

seeking will occur on input disk, output is sequential

substantially reducing merge & transmission time

8.5 Merging as a Way of Sorting Large Files on Disk

Page 70: Chap 8. Cosequential Processing                and the Sorting of Large Files

disk1

disk2

input

buffers

output

buffers

heap

Memory organization for replacement selection

8.5 Merging as a Way of Sorting Large Files on Disk

Page 71: Chap 8. Cosequential Processing                and the Sorting of Large Files

More Drives? More Processors?

More drives?

Until I/O becomes so fast that processing cannot keep up with it

More processors?

mainframes

vector and array processors

massively parallel machines

very fast local area networks

8.5 Merging as a Way of Sorting Large Files on Disk

Page 72: Chap 8. Cosequential Processing                and the Sorting of Large Files

Effects of Multiprogramming

Increase the efficiency of overall system by overlapping processing and

I/O

Effects are very hard to predict

8.5 Merging as a Way of Sorting Large Files on Disk

Page 73: Chap 8. Cosequential Processing                and the Sorting of Large Files

A Concept Toolkit for External Sorting

For in-RAM sorting, use heapsort

Use as much RAM as possible

Use a multiple-step merge when

the number of initial runs is so long that seek and rotation time is much greater than

transmission time

Use replacement selection when

possibility of partially ordered

Use more than one disk drive and I/O channel

read/write can overlap

Look for ways to take advantage of new architecture and systems

parallel processing or high-speed networks

8.5 Merging as a Way of Sorting Large Files on Disk

Page 74: Chap 8. Cosequential Processing                and the Sorting of Large Files

Sorting Files on Tape

Balanced Merge with several tape drivers

Tape contains runs

T1 R1 R3 R5 R7 R9

Step1 T2 R2 R4 R6 R8 R10

T3 --

T4 --

Figure 8.28 (2 way-balanced 4 tape merge)

P is the number of passes, N is the number of runs, k is the number of input drivers ==> then, P

= ceiling of (logkN)

4 tape drivers (2 for input, 2 for output), 10 runs ==> 4 passes

20 tape drivers (10 for input, 10 for output), 200 runs ==> 3 passes

Page 75: Chap 8. Cosequential Processing                and the Sorting of Large Files

Sorting Files on Tape

Other ways of Balanced Merge

(Fig 8.30) T1 T2 T3 T4

Step1 1 1 1 1 1 1 1 1 1 1 -- --

Step2 -- -- 2 2 2 2 2

Step3 4 4 .. 2 --

Step4 -- -- -- 10

(Fig 8.31) T1 T2 T3 T4

Step1 1 1 1 1 1 1 1 1 1 1 --

Step2 …1 1 1 .. 1 -- 3 3

Step3 … 1 1 -- 5 .3

Step4 …. 1 4 5 --

Step5 -- -- -- 10

Page 76: Chap 8. Cosequential Processing                and the Sorting of Large Files

K-way Balanced Merge on Tapes

Some difficult questions

How does one choose an initial distribution that leads readily to an efficient merge

pattern?

Are there algorithmic descriptions of the merge patterns, given an initial

distribution?

Given N runs and J tape drives, is there some way to compute the optimal merging

performance so we have a yardstick against which to compare the performance of

any specific algorithm?

Page 77: Chap 8. Cosequential Processing                and the Sorting of Large Files

Unix: Sorting and Cosequential Processing

Sorting in Unix

The Unix sort command

The qsort library routine

Cosequential processing utilities in Unix

Compares: cmp

Difference: diff

Common: comm

Page 78: Chap 8. Cosequential Processing                and the Sorting of Large Files

Let’s Review !!

8.1 Cosequential operations

8.2 Application of the Model to a General Ledger Program

8.3 Extension of the Model to Include Multiway Merging

8.4 A Second Look at Sorting in Memory

8.5 Merging as a Way of Sorting Large Files on Disk

8.6 Sorting Files on Tape

8.7 Sort-Merge Packages

8.8 Sorting and Cosequential Processing in Unix