mc9214 data structures/unit‐3 -...

MC9214 DATA STRUCTURES/UNIT‐3

CCET/MCA

UNIT III SORTING AND SEARCHING 9

General Background – Exchange sorts – Selection and Tree Sorting – Insertion Sorts –

Merge and Radix Sorts – Basic Search Techniques – Tree Searching – General Search Trees-

Hashing.

Introduction

Sorting and Searching are fundamental operations in computer science. Sorting refers to

the operation of arranging data in some given order. Searching refers to the operation of

searching the particular record from the existing information. Normally, the information retrieval

involves searching, sorting and merging. In this chapter we will discuss the searching and sorting

techniques in detail.

After going through this unit you will be able to:

Know the fundamentals of sorting techniques

Know the different searching techniques

Discuss the algorithms of internal sorting and external sorting

Difference between internal sorting and external sorting

Complexity of each sorting techniques

Discuss the algorithms of various searching techniques

Discuss Merge sort

Discuss algorithms of sequential search, binary search and binary tree search.

Analyze the performance of searching methods

SEARCHING

Searching refers to the operation of finding the location of a given item in a collection of items.

The search is said to be successful if ITEM does appear in DATA and unsuccessful otherwise.

The following searching algorithms are discussed in this chapter.

1. Sequential Searching

2. Binary Search


CCET/MCA

3. Binary Tree Search

Sequential Search

This is the most natural searching method. The most intuitive way to search for a given ITEM in

DATA is to compare ITEM with each element of DATA one by one. .The algorithm for a

sequential search procedure is now presented.

Algorithm

SEQUENTIAL SEARCH

INPUT : List of Size N. Target Value T

OUTPUT : Position of T in the list-1

BEGIN

Set FOUND = false

Set I := 0

While (I <= N) and (FOUND is false)

IF List[i] ==t THEN

FOUND = true

ELSE

I = I+1

IF FOUND==false THEN

T is not present in the List

END

Binary Search

Suppose DATA is an array which is sorted in increasing numerical order. Then there is an

extremely efficient searching algorithm, called binary search, which can be used to find the

location LOC of a given ITEM of information in DATA.

The binary search algorithm applied to our array DATA works as follows. During each stage of

our algorithm, our search for ITEM is reduced to a segment of elements of DATA:

DATA[BEG], DATA[BEG + 1], DATA[BEG + 2], ...... DATA[END].


CCET/MCA

Note that the variable BEG and END denote the beginning and end locations of the segment

respectively. The algorithm compares ITEM with the middle element DATA[MID] of the

segment, where MID is obtained by

MID = INT((BEG + END) / 2)

(We use INT(A) for the integer value of A.) If DATA[MID] = ITEM, then the search is

successful and we set LOC: = MID. Otherwise a new segment of DATA is obtained as follows:

(a) If ITEM < DATA[MID], then ITEM can appear only in the left half of the

segment: DATA[BEG],DATA[BEG + 1],….. ,DATA[MID - 1]So we reset END := MID - 1 and

begin searching again.

(b) If ITEM > DATA[MID], then ITEM can appear only in the right half of the

segment: DATA[MID + 1], DATA[MID + 2],....,DATA[END]

So we reset BEG := MID + 1 and begin searching again.

Initially, we begin with the entire array DATA; i.e. we begin with BEG = 1

and END = n, If ITEM is not in DATA, then eventually we obtain END<BEG

This condition signals that the search is unsuccessful, and in this case we assign LOC: =

NULL. Here NULL is a value that lies outside the set of indices of DATA. We now formally

state the binary search algorithm.

Algorithm 2.9:

(Binary Search) BINARY(DATA, LB, UB, TEM, LOC)

Here DATA is sorted array with lower bound LB and upper bound

UB, and ITEM is a given item of information. The variables BEG,

END and MID denote, respectively, the beginning, end and middle

locations of a segment of elements of DATA. This algorithm finds

the location LOC of ITEM in DATA or sets LOC=NULL.

1. [Initialize segment variables.]

Set BEG := LB, END := UB and MID = INT((BEG + END)/ 2).

2. Repeat Steps 3 and 4 while BEG ≤ END and

DATA[MID] ≠ ITEM.

3. If ITEM<DATA[MID], then:


CCET/MCA

Set END := MID - 1.

Else:

Set BEG := MID + 1.

[End of If structure]

4. Set MID := INT((BEG + END)/2).

[End of Step 2 loop.]

5. If DATA[MID] :=ITEM, then:

Set LOC:=MID.

Else:

Set LOC := NULL.

[End of If structure.]

6. Exit.

Example 2.9

Let DATA be the following sorted 13-element array:

DATA: 11, 22, 30, 33, 40, 44, 55, 60, 66, 77, 80, 88, 99

We apply the binary search to DATA for different values of ITEM.

(a) Suppose ITEM = 40. The search for ITEM in the array DATA is pictured in where the

values of DATA[BEG] and DATA[END] in each stage of the algorithm are indicated by

parenthesis and- the value of DATA[MID] by a bold. Specifically, BEG, END and MID will

have the following successive values:

(1) Initially, BEG = 1 and END 13. Hence,

MID = INT[(1 + 13) / 2 ] = 7 and so DATA[MID] = 55

(2) Since 40 < 55, END = MID – 1 = 6. Hence,


(3) Since 40 > 30, BEG = MID + 1 = 4. Hence,


The search is successful and LOC = MID = 5.

(1) (11), 22, 30, 33, 40, 44, 55, 60, 66, 77, 80, 88, (99)

(2) (11), 22, 30, 33, 40, (44), 55, 60, 66, 77, 80, 88, 99

(3) 11, 22. 30, (33), 40, (44), 55, 60, 66, 77, 80, 88, 99 [Successful]


CCET/MCA

Complexity of the Binary Search Algorithm

The complexity is measured by the number of comparisons f(n) to locate ITEM in DATA

where DATA contains n elements. Observe that each comparison reduces the sample size in half.

Hence we require at most f(n) comparisons to locate ITEM where

f(n) = [log2n] + 1

That is the running time for the worst case is approximately equal to log2n. The running

time for the average case is approximately equal to the running time for the worst case.

Limitations of the Binary Search Algorithm

The algorithm requires two conditions:

(1) the list must be sorted and

(2) one must have direct access to the middle element in any sublist.

Binary Search Tree

Suppose T is a binary tree. Then T is called a binary search tree if each node N of T has

the following property:

“The value at N is greater than every value in the left subtree of N and is less than every

value in the right subtree of N.

Binary Search Tree(T)

SEARCHING AND INSERTING IN BINARY SEARCH TREES

Suppose an ITEM of information is given. The following algorithm finds the location of

ITEM in the binary search tree T, or inserts ITEM as a new node in its appropriate place in the

tree.

(a) Compare ITEM with the root node N of the tree.


CCET/MCA

(i) If ITEM < N, proceed to the left child of N.

(ii) If ITEM > N, proceed to the right child of N.

(b) Repeat Step (a) until one of the following occurs:

(i) We meet a node N such that ITEM = N. In this case the search is successful.

(ii) We meet an empty subtree, which indicates that the search is unsuccessful, and we

insert ITEM in place of the empty subtree.

In other words, proceed from the root R down through the tree T until finding ITEM in T

or inserting ITEM as a terminal node in T.

Example 2.11

Consider the binary search tree T in Fig. 2.6 . Suppose ITEM = 20 is given. Compare ITEM

= 20 with the root, 38, of the tree T. Since 20 < 38, proceed to the left child of 38, which is 14.

1. Compare ITEM = 20 with 14. Since 20 > 14, proceed to the right child of 14, which is 23.

2. Compare ITEM = 20 with 23. Since 20 < 23, proceed to the left child of 23, which is 18.

3. Compare ITEM = 20 with 18. Since 20 > 18 and 18 does not have a right child, insert 20 as

the right child of 18.

ITEM=20 inserted

DELETING IN A BINARY SEARCH TREE

Suppose T is a binary search tree, and suppose an ITEM of information is given. This

section gives an algorithm which deletes ITEM from the tree T.


CCET/MCA

Case 1. N has no children. Then N is deleted from T by simply replacing the location of N in

the parent node P(N) by the null pointer.

Case 2. N has exactly one child. Then N is deleted from T by simply replacing the location of

N in P(N) by the location of the only child of N.

Case 3. N has two children. Let S(N) denote the inorder successor of N. ( The reader can

verify that S(N) does not have a left child.) Then N is deleted from T by first deleting

S(N) from T (by using Case 1 or Case 2) and then replacing node N in T by the node S(N).

Observe that the third case is much more complicated than the first two cases. In all three

cases, the memory space of the deleted node N is returned to the AVAIL list.

(a) Before deletions. (b) Linked representation.


CCET/MCA

(a) Node 44 is deleted

(b) Linked representation

Sorting Methods

The function of sorting or ordering a list of objects according to some linear order is so

fundamental that it is ubiquitous in engineering applications in all disciplines.

There are two broad categories of sorting methods:

Internal sorting takes place in the main memory, where we can take advantage of the

random access nature of the main memory.

External sorting is necessary when the number and size of objects are prohibitive to be

accommodated in the main memory.

· Given records r1, r2,..., rn, with key values k1, k2,..., kn, produce the records in the order ri1,

ri2,..., rin, Such that ki1 ki2 ... kin


CCET/MCA

The complexity of a sorting algorithm can be measured in terms of number of algorithm steps to

sort n records number of comparisons between keys (appropriate when the keys are long

character strings)number of times records must be moved (appropriate when record size is large)

· Any sorting algorithm that uses comparisons of keys needs at least O(n log n) time to

accomplish the sorting.

Sorting Methods

Internal External

(In memory) Appropriate for secondary storage

quick sort

heap sort Mergesort

bubble sort radix sort

insertion sort Polyphase sort

selection sort

shell sort

Insertion Sort

The general idea of the insertion sort method is that for each element, find the slot where it

belongs.

Example

• The element in position Array[0] is certainly sorted.

• Thus, move on to insert the second character, D,

into the appropriate location to maintain the alphabetical order.

How does it work?

• Each element Array[j] is taken one at a time

from j = 0 to n-1.

• Before insertion of Array[j], the subarray from Array[0] to Array[j-1] is sorted, and the

remainder of the array is not.

• After insertion, Array[0…j] is correctly ordered, while the subarray with elements

Array[j+1]…..Array[n-1] is unsorted.

Insertion Sort Algorithm


CCET/MCA

for i = 1 to n-1

temp = a[i]

loc = i

while(loc>0 && (a[loc-1]> temp)

a[loc] = a[loc-1]

loc = loc –1

a[loc] = temp

Insertion sort

• The initial state is that the first element,considered by itself, is sorted

• The final state is that all elements, considered as a group, are sorted.

• Basic action is to arrange that elements in positions 0 through ‘i’. In each stage ‘i’

increases by 1. The outer loop controls this.

• When the body of the outer for loop is entered,we know that elements at positions 0 through

‘i’ are sorted and we need to extend this to positions 0 to n-1.

• At each step the element indexed by ‘i’ needs to be added to the sorted part of the array. This is

done by placing it in a temporary variable and sliding all elements larger than it one position

to the right.

• Then the temporary element is copied into the leftmost relocated element. The counter ‘loc’

indicates this position.

Complexity

• Best situation: the data is already sorted. The inner loop is never executed, and the outer

loop is executed n – 1 times for total complexity of O(n).

• Worst situation: data in reverse order. The inner loop is executed the maximum number

of times. Thus the complexity of the insertion sort in this worst possible case is quadratic or

O(n2).

Selection Sort

In this sorting we find the smallest element in this list and put it in the first position. Then find

the second smallest element in the list and put it in the second position. And so on.

Pass 1. Find the location LOC of the smallest in the list of N elementsA[l], A[2], . . . , A[N], and

then interchange A[LOC] and [1] .Then A[1] is sorted.


CCET/MCA

Pass 2. Find the location LOC of the smallest in the sublist of N – 1Elements A[2], A[3],. . . ,

A[N], and then interchangeA[LOC]and A[2]. Then:A[l], A[2] is sorted, since A[1]<A[2].

Pass 3. Find the location LOC of the smallest in the sublist of N – 2elements A[3], A[4], . . . ,

A[N], and then interchange A[LOC] and A[3]. Then: A[l], A[2], . . . , A[3] is sorted, since

A[2] <A[3].

………………………………

Pass N - 1. Find the location LOC of the smaller of the elements A[N - 1),

A[N], and then interchange A[LOC] and A[N- 1]. Then: A[l],

A[2], . . . , A[N] is sorted, since A[N - 1] < A[N].Thus A is sorted after N - 1 passes.

Hashing

Accessing elements in an array is extremely efficient. Array elements are accessed by index. If

we can find a mapping between the search keys and indices, we can store each record in the

element with the corresponding index. Thus each element would be found with one operation

only.

Advantage: the records can be references directly - ideally the search time is a constant ,

complexity O(1)

Question: how to find such correspondence?

Answers:

direct access tables

hash tables

Direct-address tables

Direct-address tables – the most elementary form of hashing.

Assumption – direct one-to-one correspondence between the keys and numbers 0, 1, …, m-1.,

m – not very large.

Searching is fast, but there is cost – the size of the array we need is the size of the largest key.

Not very useful if only a few keys are widely distributed.


CCET/MCA

Hash functions

Hash function: function that transforms the search key into a table address.

Hash functions transform the keys into numbers within a predetermined interval. These numbers

are then used as indices in an array (table, hash table) to store the records

� Keys – numbers.

If M is the size of the array, then h(key) = key % M.

This will map all the keys into numbers within the interval [0, M-1].

� Keys – strings of characters

Treat the binary representation of a key as a number, and then apply the first case.

How keys are treated as numbers: If each character is represented with m bits,

then the string can be treated as base-2m number.

Hash tables: Basic concepts

Once we have found the method of mapping keys to indexes, the questions to be solved is how to

choose the size of the table (array) to store the records, and how to perform the basic operations:

Insert

Search

delete

Let N be the number of the records to be stored, and M - the size of the array (hash table). The

integer, generated by a hash function between 0 and M-1 is used as an index in a hash table of

M elements.

Initially all slots in the table are blank. This is shown either by a sentinel value, or a special field

in each slot.

To insert use the hash function to generate an address for each value to be inserted.


CCET/MCA

To search for a key in the table the same hash function is used.

To delete a record with a given key - first we apply the search method and when the key is found

we delete the record.

Size of the table: Ideally we would like to store N records in a table with size N. However, in

many cases we don't know in advance the exact number of records. Also, the hash function can

map two keys to one and the same index, and some cells in the array will not be used. Hence we

assume that the size of the table can be different from the number of the records. We use M

to denote the size of the table.

A characteristic of the hash table is its

load factor L = N/M: the ratio between the number of records to be stored and the size

of the table. The method to choose the size of the table depends on the chosen method of

collision resolution, discussed below.

M should be a prime number.

It has been proved that if M is a prime number, we obtain better (more even) distribution

of the keys over the table.

Collision resolution

Collision is the case when two or more keys hash to one and the same index in the hash table.

Collision resolution deals with keys that are mapped to same indexes.

Methods:

Separate chaining

Open addressing

o Linear probing

o Quadric probing

o Double hashing


CCET/MCA

SEPARATE CHAINING

Complexity of separate chaining

The time to compute the index of a given key is a constant. Then we have to search in a list for

the record. Therefore the time depends on the length of the lists. It has been shown empirically

that on average the list length is N/M (the load factor L), provided M is prime and we use a

function that gives good distribution.

Unsuccessful searches go to the end of some list, hence we have L comparisons. Successful

searches are expected to go half the way down some list. On average the number of comparisons

in successful search is L/2. Therefore we can say that runtime complexity of separate chaining is

O(L).Note, that what really matters is the load factor rather than the size of the table or the

number of records, taken separately.

How to choose M in separate chaining?

Since the method is used in cases when we cannot predict the number of records in advance, the

choice of M basically depends on other factors such as available memory. Typically M is chosen

relatively small so as not to use up a large area of contiguous memory, but enough large so that

the lists are short for more efficient sequential search. Recommendations in the literature vary

form M to be about one tenth of N - the number of the records to M to be equal (or close to) N.

Other methods of chaining:

Keep the lists ordered: useful if there are much more searches than inserts,

and if most of the searches are unsuccessful.

Represent the chains as binary search tree. Extra effort needed – not efficient.

Advantages of separate chaining– used when memory is of concern, easily implemented.

Disadvantages – unevenly distributed keys – long lists and many empty spaces in the table.


CCET/MCA

1. Open addressing

Invented by A. P. Ershov and W. W. Peterson in 1957 independently.

Idea: Store collisions in the hash table itself.

The method uses a collision resolution function in addition to the hash functon.

If collision occurs, next probes are performed following the formula:

hi(x) = (hash(x) + f(i)) mod TableSize

where:

hash(x) is the hash function

f(i) is the collision resolution function

i is the number of the current attempt (probe) to insert an element.

a. Linear probing (linear hashing, sequential probing): f(i) = i

Insert: When there is a collision we just probe the next slot in the table.

If it is unoccupied – we store the key there.

If it is occupied – we continue probing the next slot.

Search: If the key hashes to a position that is occupied and there is no match,

we probe the next position.

a) match – successful search

b) empty position – unsuccessful search

c) occupied and no match – continue probing.

When the end of the table is reached, the probing continues from the beginning,

until the original starting position is reached.

Problems with delete: a special flag is needed to distinguish deleted from empty positions.

This is necessary for the search function – if we come to a "deleted" position,


CCET/MCA

the search has to continue as the deletion might have been done after

the insertion of the key we are looking for, and it might be further in the table.

Here is an example of Linear probing

Total amount of memory space – less, since no pointers are maintained.

Disadvantage: " Primary clustering"

Large clusters tend to build up: if an empty slot is preceded by i filled slots, the probability that

the empty slot is the next one to be filled is (i+1)/M.

If the preceding slot was empty, the probability is 1/M.

This means that when the table begins to fill up, many other slots are examined.

Linear probing runs slowly for nearly full tables.

Quadratic probing: f(i) = i2

A guadratic function is used to compute the next index in the table to be probed.

Example:

In linear probing we check the i-th position. If it is occupied, we check the i+1st position,

next we check the i+2nd, etc.

In quadric probing, if the i-th position is occupied we check the i+1st,

next we check the i+4th, next - i + 9th etc.

The idea here is to skip regions in the table with possible clusters.

Double hashing: f(i) = i * hash2(x)

Purpose – same as in quadratic probing : to overcome the disadvantage of clustering.

Instead of examining each successive entry following a collided position, we use

a second hash function to get a fixed increment for the "probe" sequence.


CCET/MCA

The second function should be chosen so that the increment and M are relatively prime.

Otherwise there will be slots that would remain unexamined.

Example: hash2(x) = R - (x mod R), R is smaller than TableSize, prime.

In open addressing the load factor L is less than 1.

Good strategy is to keep L < 0.5

If the table is close to full, the search time grows and may become equal to the table size

Rehashing

If the table is close to full, the search time grows and may become equal to the table size.

When the load factor exceeds a certain value (e.g. greater than 0.5) we do rehashing :

Build a second table twice as large as the original

and rehash there all the keys of the original table.

Rehashing is expensive operation, with running time O(N)

However, once done, the new hash table will have good performance.

Extendible hashing

Used when the amount of data is too large to fit in main memory and external storage is used.

N records in total to store, M records in one disk block

The problem: in ordinary hashing several disk blocks may be examined to find an element -

a time consuming process.

Extendible hashing: no more than two blocks are examined.

Idea:

Keys are grouped according to the first m bits in their code.

Each group is stored in one disk block.


CCET/MCA

If the block becomes full and no more records can be inserted, each group is split into

two,

and m+1 bits are considered to determine the location of a record.

Example: lets' say we have 4 groups of keys according to the first two bits:

00 01 10 11

00010 01001 10001 11000

00100 01010 10100 11010

01100

Each disk block in the example can contain 3 records only, 4 blocks are needed to store the

above keys

New key to be inserted: 01011.

Block2 is full, so we start considering 3 bits:

000/001 010 011 100/101 110/111

(still on same block)

00010 01001 01100 10001 11000

00100 01010 10100 11010

01011

The second group of keys is split onto two disk blocks - one for keys staring with 010,

and one for keys starting with 011.

A directory is maintained in main memory with pointers to the disk blocks for each bit pattern.

The size of the directory is 2D = O(N(1+1/M)/M), where


CCET/MCA

D - number of bits considered

N - number of records

M - number of disk blocks.

Conclusion

Hashing is the best search method (constant running time) if we don't need to have the records

sorted.

The choice of the hash function remains the most difficult part of the task and depends very

much on the nature of the keys.

Separate chaining or open addressing?

Open addressing is the preferred method if there is enough memory

to keep a table twice larger than the number of the records.

Separate chaining is used when we don't know in advance the number of the records to

be stored. Though it requires additional time for list processing, it is simpler to

implement.

Some application areas

Dictionaries, on-line spell checkers, compiler symbol tables.

mc9214 data structures/unit‐3 -...

Documents