mc9214 data structures/unit‐3 -...
TRANSCRIPT
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 1
UNIT III SORTING AND SEARCHING 9
General Background – Exchange sorts – Selection and Tree Sorting – Insertion Sorts –
Merge and Radix Sorts – Basic Search Techniques – Tree Searching – General Search Trees-
Hashing.
Introduction
Sorting and Searching are fundamental operations in computer science. Sorting refers to
the operation of arranging data in some given order. Searching refers to the operation of
searching the particular record from the existing information. Normally, the information retrieval
involves searching, sorting and merging. In this chapter we will discuss the searching and sorting
techniques in detail.
After going through this unit you will be able to:
Know the fundamentals of sorting techniques
Know the different searching techniques
Discuss the algorithms of internal sorting and external sorting
Difference between internal sorting and external sorting
Complexity of each sorting techniques
Discuss the algorithms of various searching techniques
Discuss Merge sort
Discuss algorithms of sequential search, binary search and binary tree search.
Analyze the performance of searching methods
SEARCHING
Searching refers to the operation of finding the location of a given item in a collection of items.
The search is said to be successful if ITEM does appear in DATA and unsuccessful otherwise.
The following searching algorithms are discussed in this chapter.
1. Sequential Searching
2. Binary Search
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 2
3. Binary Tree Search
Sequential Search
This is the most natural searching method. The most intuitive way to search for a given ITEM in
DATA is to compare ITEM with each element of DATA one by one. .The algorithm for a
sequential search procedure is now presented.
Algorithm
SEQUENTIAL SEARCH
INPUT : List of Size N. Target Value T
OUTPUT : Position of T in the list-1
BEGIN
Set FOUND = false
Set I := 0
While (I <= N) and (FOUND is false)
IF List[i] ==t THEN
FOUND = true
ELSE
I = I+1
IF FOUND==false THEN
T is not present in the List
END
Binary Search
Suppose DATA is an array which is sorted in increasing numerical order. Then there is an
extremely efficient searching algorithm, called binary search, which can be used to find the
location LOC of a given ITEM of information in DATA.
The binary search algorithm applied to our array DATA works as follows. During each stage of
our algorithm, our search for ITEM is reduced to a segment of elements of DATA:
DATA[BEG], DATA[BEG + 1], DATA[BEG + 2], ...... DATA[END].
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 3
Note that the variable BEG and END denote the beginning and end locations of the segment
respectively. The algorithm compares ITEM with the middle element DATA[MID] of the
segment, where MID is obtained by
MID = INT((BEG + END) / 2)
(We use INT(A) for the integer value of A.) If DATA[MID] = ITEM, then the search is
successful and we set LOC: = MID. Otherwise a new segment of DATA is obtained as follows:
(a) If ITEM < DATA[MID], then ITEM can appear only in the left half of the
segment: DATA[BEG],DATA[BEG + 1],….. ,DATA[MID - 1]So we reset END := MID - 1 and
begin searching again.
(b) If ITEM > DATA[MID], then ITEM can appear only in the right half of the
segment: DATA[MID + 1], DATA[MID + 2],....,DATA[END]
So we reset BEG := MID + 1 and begin searching again.
Initially, we begin with the entire array DATA; i.e. we begin with BEG = 1
and END = n, If ITEM is not in DATA, then eventually we obtain END<BEG
This condition signals that the search is unsuccessful, and in this case we assign LOC: =
NULL. Here NULL is a value that lies outside the set of indices of DATA. We now formally
state the binary search algorithm.
Algorithm 2.9:
(Binary Search) BINARY(DATA, LB, UB, TEM, LOC)
Here DATA is sorted array with lower bound LB and upper bound
UB, and ITEM is a given item of information. The variables BEG,
END and MID denote, respectively, the beginning, end and middle
locations of a segment of elements of DATA. This algorithm finds
the location LOC of ITEM in DATA or sets LOC=NULL.
1. [Initialize segment variables.]
Set BEG := LB, END := UB and MID = INT((BEG + END)/ 2).
2. Repeat Steps 3 and 4 while BEG ≤ END and
DATA[MID] ≠ ITEM.
3. If ITEM<DATA[MID], then:
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 4
Set END := MID - 1.
Else:
Set BEG := MID + 1.
[End of If structure]
4. Set MID := INT((BEG + END)/2).
[End of Step 2 loop.]
5. If DATA[MID] :=ITEM, then:
Set LOC:=MID.
Else:
Set LOC := NULL.
[End of If structure.]
6. Exit.
Example 2.9
Let DATA be the following sorted 13-element array:
DATA: 11, 22, 30, 33, 40, 44, 55, 60, 66, 77, 80, 88, 99
We apply the binary search to DATA for different values of ITEM.
(a) Suppose ITEM = 40. The search for ITEM in the array DATA is pictured in where the
values of DATA[BEG] and DATA[END] in each stage of the algorithm are indicated by
parenthesis and- the value of DATA[MID] by a bold. Specifically, BEG, END and MID will
have the following successive values:
(1) Initially, BEG = 1 and END 13. Hence,
MID = INT[(1 + 13) / 2 ] = 7 and so DATA[MID] = 55
(2) Since 40 < 55, END = MID – 1 = 6. Hence,
MID = INT[(1 + 6) / 2 ] = 3 and so DATA[MID] = 30
(3) Since 40 > 30, BEG = MID + 1 = 4. Hence,
MID = INT[(4 + 6) / 2 ] = 5 and so DATA[MID] = 40
The search is successful and LOC = MID = 5.
(1) (11), 22, 30, 33, 40, 44, 55, 60, 66, 77, 80, 88, (99)
(2) (11), 22, 30, 33, 40, (44), 55, 60, 66, 77, 80, 88, 99
(3) 11, 22. 30, (33), 40, (44), 55, 60, 66, 77, 80, 88, 99 [Successful]
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 5
Complexity of the Binary Search Algorithm
The complexity is measured by the number of comparisons f(n) to locate ITEM in DATA
where DATA contains n elements. Observe that each comparison reduces the sample size in half.
Hence we require at most f(n) comparisons to locate ITEM where
f(n) = [log2n] + 1
That is the running time for the worst case is approximately equal to log2n. The running
time for the average case is approximately equal to the running time for the worst case.
Limitations of the Binary Search Algorithm
The algorithm requires two conditions:
(1) the list must be sorted and
(2) one must have direct access to the middle element in any sublist.
Binary Search Tree
Suppose T is a binary tree. Then T is called a binary search tree if each node N of T has
the following property:
“The value at N is greater than every value in the left subtree of N and is less than every
value in the right subtree of N.
Binary Search Tree(T)
SEARCHING AND INSERTING IN BINARY SEARCH TREES
Suppose an ITEM of information is given. The following algorithm finds the location of
ITEM in the binary search tree T, or inserts ITEM as a new node in its appropriate place in the
tree.
(a) Compare ITEM with the root node N of the tree.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 6
(i) If ITEM < N, proceed to the left child of N.
(ii) If ITEM > N, proceed to the right child of N.
(b) Repeat Step (a) until one of the following occurs:
(i) We meet a node N such that ITEM = N. In this case the search is successful.
(ii) We meet an empty subtree, which indicates that the search is unsuccessful, and we
insert ITEM in place of the empty subtree.
In other words, proceed from the root R down through the tree T until finding ITEM in T
or inserting ITEM as a terminal node in T.
Example 2.11
Consider the binary search tree T in Fig. 2.6 . Suppose ITEM = 20 is given. Compare ITEM
= 20 with the root, 38, of the tree T. Since 20 < 38, proceed to the left child of 38, which is 14.
1. Compare ITEM = 20 with 14. Since 20 > 14, proceed to the right child of 14, which is 23.
2. Compare ITEM = 20 with 23. Since 20 < 23, proceed to the left child of 23, which is 18.
3. Compare ITEM = 20 with 18. Since 20 > 18 and 18 does not have a right child, insert 20 as
the right child of 18.
ITEM=20 inserted
DELETING IN A BINARY SEARCH TREE
Suppose T is a binary search tree, and suppose an ITEM of information is given. This
section gives an algorithm which deletes ITEM from the tree T.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 7
Case 1. N has no children. Then N is deleted from T by simply replacing the location of N in
the parent node P(N) by the null pointer.
Case 2. N has exactly one child. Then N is deleted from T by simply replacing the location of
N in P(N) by the location of the only child of N.
Case 3. N has two children. Let S(N) denote the inorder successor of N. ( The reader can
verify that S(N) does not have a left child.) Then N is deleted from T by first deleting
S(N) from T (by using Case 1 or Case 2) and then replacing node N in T by the node S(N).
Observe that the third case is much more complicated than the first two cases. In all three
cases, the memory space of the deleted node N is returned to the AVAIL list.
(a) Before deletions. (b) Linked representation.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 8
(a) Node 44 is deleted
(b) Linked representation
Sorting Methods
The function of sorting or ordering a list of objects according to some linear order is so
fundamental that it is ubiquitous in engineering applications in all disciplines.
There are two broad categories of sorting methods:
Internal sorting takes place in the main memory, where we can take advantage of the
random access nature of the main memory.
External sorting is necessary when the number and size of objects are prohibitive to be
accommodated in the main memory.
· Given records r1, r2,..., rn, with key values k1, k2,..., kn, produce the records in the order ri1,
ri2,..., rin, Such that ki1 ki2 ... kin
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 9
The complexity of a sorting algorithm can be measured in terms of number of algorithm steps to
sort n records number of comparisons between keys (appropriate when the keys are long
character strings)number of times records must be moved (appropriate when record size is large)
· Any sorting algorithm that uses comparisons of keys needs at least O(n log n) time to
accomplish the sorting.
Sorting Methods
Internal External
(In memory) Appropriate for secondary storage
quick sort
heap sort Mergesort
bubble sort radix sort
insertion sort Polyphase sort
selection sort
shell sort
Insertion Sort
The general idea of the insertion sort method is that for each element, find the slot where it
belongs.
Example
• The element in position Array[0] is certainly sorted.
• Thus, move on to insert the second character, D,
into the appropriate location to maintain the alphabetical order.
How does it work?
• Each element Array[j] is taken one at a time
from j = 0 to n-1.
• Before insertion of Array[j], the subarray from Array[0] to Array[j-1] is sorted, and the
remainder of the array is not.
• After insertion, Array[0…j] is correctly ordered, while the subarray with elements
Array[j+1]…..Array[n-1] is unsorted.
Insertion Sort Algorithm
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 10
for i = 1 to n-1
temp = a[i]
loc = i
while(loc>0 && (a[loc-1]> temp)
a[loc] = a[loc-1]
loc = loc –1
a[loc] = temp
Insertion sort
• The initial state is that the first element,considered by itself, is sorted
• The final state is that all elements, considered as a group, are sorted.
• Basic action is to arrange that elements in positions 0 through ‘i’. In each stage ‘i’
increases by 1. The outer loop controls this.
• When the body of the outer for loop is entered,we know that elements at positions 0 through
‘i’ are sorted and we need to extend this to positions 0 to n-1.
• At each step the element indexed by ‘i’ needs to be added to the sorted part of the array. This is
done by placing it in a temporary variable and sliding all elements larger than it one position
to the right.
• Then the temporary element is copied into the leftmost relocated element. The counter ‘loc’
indicates this position.
Complexity
• Best situation: the data is already sorted. The inner loop is never executed, and the outer
loop is executed n – 1 times for total complexity of O(n).
• Worst situation: data in reverse order. The inner loop is executed the maximum number
of times. Thus the complexity of the insertion sort in this worst possible case is quadratic or
O(n2).
Selection Sort
In this sorting we find the smallest element in this list and put it in the first position. Then find
the second smallest element in the list and put it in the second position. And so on.
Pass 1. Find the location LOC of the smallest in the list of N elementsA[l], A[2], . . . , A[N], and
then interchange A[LOC] and [1] .Then A[1] is sorted.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 11
Pass 2. Find the location LOC of the smallest in the sublist of N – 1Elements A[2], A[3],. . . ,
A[N], and then interchangeA[LOC]and A[2]. Then:A[l], A[2] is sorted, since A[1]<A[2].
Pass 3. Find the location LOC of the smallest in the sublist of N – 2elements A[3], A[4], . . . ,
A[N], and then interchange A[LOC] and A[3]. Then: A[l], A[2], . . . , A[3] is sorted, since
A[2] <A[3].
………………………………
Pass N - 1. Find the location LOC of the smaller of the elements A[N - 1),
A[N], and then interchange A[LOC] and A[N- 1]. Then: A[l],
A[2], . . . , A[N] is sorted, since A[N - 1] < A[N].Thus A is sorted after N - 1 passes.
Hashing
Accessing elements in an array is extremely efficient. Array elements are accessed by index. If
we can find a mapping between the search keys and indices, we can store each record in the
element with the corresponding index. Thus each element would be found with one operation
only.
Advantage: the records can be references directly - ideally the search time is a constant ,
complexity O(1)
Question: how to find such correspondence?
Answers:
direct access tables
hash tables
Direct-address tables
Direct-address tables – the most elementary form of hashing.
Assumption – direct one-to-one correspondence between the keys and numbers 0, 1, …, m-1.,
m – not very large.
Searching is fast, but there is cost – the size of the array we need is the size of the largest key.
Not very useful if only a few keys are widely distributed.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 12
Hash functions
Hash function: function that transforms the search key into a table address.
Hash functions transform the keys into numbers within a predetermined interval. These numbers
are then used as indices in an array (table, hash table) to store the records
� Keys – numbers.
If M is the size of the array, then h(key) = key % M.
This will map all the keys into numbers within the interval [0, M-1].
� Keys – strings of characters
Treat the binary representation of a key as a number, and then apply the first case.
How keys are treated as numbers: If each character is represented with m bits,
then the string can be treated as base-2m number.
Hash tables: Basic concepts
Once we have found the method of mapping keys to indexes, the questions to be solved is how to
choose the size of the table (array) to store the records, and how to perform the basic operations:
Insert
Search
delete
Let N be the number of the records to be stored, and M - the size of the array (hash table). The
integer, generated by a hash function between 0 and M-1 is used as an index in a hash table of
M elements.
Initially all slots in the table are blank. This is shown either by a sentinel value, or a special field
in each slot.
To insert use the hash function to generate an address for each value to be inserted.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 13
To search for a key in the table the same hash function is used.
To delete a record with a given key - first we apply the search method and when the key is found
we delete the record.
Size of the table: Ideally we would like to store N records in a table with size N. However, in
many cases we don't know in advance the exact number of records. Also, the hash function can
map two keys to one and the same index, and some cells in the array will not be used. Hence we
assume that the size of the table can be different from the number of the records. We use M
to denote the size of the table.
A characteristic of the hash table is its
load factor L = N/M: the ratio between the number of records to be stored and the size
of the table. The method to choose the size of the table depends on the chosen method of
collision resolution, discussed below.
M should be a prime number.
It has been proved that if M is a prime number, we obtain better (more even) distribution
of the keys over the table.
Collision resolution
Collision is the case when two or more keys hash to one and the same index in the hash table.
Collision resolution deals with keys that are mapped to same indexes.
Methods:
Separate chaining
Open addressing
o Linear probing
o Quadric probing
o Double hashing
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 14
SEPARATE CHAINING
Complexity of separate chaining
The time to compute the index of a given key is a constant. Then we have to search in a list for
the record. Therefore the time depends on the length of the lists. It has been shown empirically
that on average the list length is N/M (the load factor L), provided M is prime and we use a
function that gives good distribution.
Unsuccessful searches go to the end of some list, hence we have L comparisons. Successful
searches are expected to go half the way down some list. On average the number of comparisons
in successful search is L/2. Therefore we can say that runtime complexity of separate chaining is
O(L).Note, that what really matters is the load factor rather than the size of the table or the
number of records, taken separately.
How to choose M in separate chaining?
Since the method is used in cases when we cannot predict the number of records in advance, the
choice of M basically depends on other factors such as available memory. Typically M is chosen
relatively small so as not to use up a large area of contiguous memory, but enough large so that
the lists are short for more efficient sequential search. Recommendations in the literature vary
form M to be about one tenth of N - the number of the records to M to be equal (or close to) N.
Other methods of chaining:
Keep the lists ordered: useful if there are much more searches than inserts,
and if most of the searches are unsuccessful.
Represent the chains as binary search tree. Extra effort needed – not efficient.
Advantages of separate chaining– used when memory is of concern, easily implemented.
Disadvantages – unevenly distributed keys – long lists and many empty spaces in the table.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 15
1. Open addressing
Invented by A. P. Ershov and W. W. Peterson in 1957 independently.
Idea: Store collisions in the hash table itself.
The method uses a collision resolution function in addition to the hash functon.
If collision occurs, next probes are performed following the formula:
hi(x) = (hash(x) + f(i)) mod TableSize
where:
hash(x) is the hash function
f(i) is the collision resolution function
i is the number of the current attempt (probe) to insert an element.
a. Linear probing (linear hashing, sequential probing): f(i) = i
Insert: When there is a collision we just probe the next slot in the table.
If it is unoccupied – we store the key there.
If it is occupied – we continue probing the next slot.
Search: If the key hashes to a position that is occupied and there is no match,
we probe the next position.
a) match – successful search
b) empty position – unsuccessful search
c) occupied and no match – continue probing.
When the end of the table is reached, the probing continues from the beginning,
until the original starting position is reached.
Problems with delete: a special flag is needed to distinguish deleted from empty positions.
This is necessary for the search function – if we come to a "deleted" position,
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 16
the search has to continue as the deletion might have been done after
the insertion of the key we are looking for, and it might be further in the table.
Here is an example of Linear probing
Total amount of memory space – less, since no pointers are maintained.
Disadvantage: " Primary clustering"
Large clusters tend to build up: if an empty slot is preceded by i filled slots, the probability that
the empty slot is the next one to be filled is (i+1)/M.
If the preceding slot was empty, the probability is 1/M.
This means that when the table begins to fill up, many other slots are examined.
Linear probing runs slowly for nearly full tables.
Quadratic probing: f(i) = i2
A guadratic function is used to compute the next index in the table to be probed.
Example:
In linear probing we check the i-th position. If it is occupied, we check the i+1st position,
next we check the i+2nd, etc.
In quadric probing, if the i-th position is occupied we check the i+1st,
next we check the i+4th, next - i + 9th etc.
The idea here is to skip regions in the table with possible clusters.
Double hashing: f(i) = i * hash2(x)
Purpose – same as in quadratic probing : to overcome the disadvantage of clustering.
Instead of examining each successive entry following a collided position, we use
a second hash function to get a fixed increment for the "probe" sequence.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 17
The second function should be chosen so that the increment and M are relatively prime.
Otherwise there will be slots that would remain unexamined.
Example: hash2(x) = R - (x mod R), R is smaller than TableSize, prime.
In open addressing the load factor L is less than 1.
Good strategy is to keep L < 0.5
If the table is close to full, the search time grows and may become equal to the table size
Rehashing
If the table is close to full, the search time grows and may become equal to the table size.
When the load factor exceeds a certain value (e.g. greater than 0.5) we do rehashing :
Build a second table twice as large as the original
and rehash there all the keys of the original table.
Rehashing is expensive operation, with running time O(N)
However, once done, the new hash table will have good performance.
Extendible hashing
Used when the amount of data is too large to fit in main memory and external storage is used.
N records in total to store, M records in one disk block
The problem: in ordinary hashing several disk blocks may be examined to find an element -
a time consuming process.
Extendible hashing: no more than two blocks are examined.
Idea:
Keys are grouped according to the first m bits in their code.
Each group is stored in one disk block.
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 18
If the block becomes full and no more records can be inserted, each group is split into
two,
and m+1 bits are considered to determine the location of a record.
Example: lets' say we have 4 groups of keys according to the first two bits:
00 01 10 11
00010 01001 10001 11000
00100 01010 10100 11010
01100
Each disk block in the example can contain 3 records only, 4 blocks are needed to store the
above keys
New key to be inserted: 01011.
Block2 is full, so we start considering 3 bits:
000/001 010 011 100/101 110/111
(still on same block)
00010 01001 01100 10001 11000
00100 01010 10100 11010
01011
The second group of keys is split onto two disk blocks - one for keys staring with 010,
and one for keys starting with 011.
A directory is maintained in main memory with pointers to the disk blocks for each bit pattern.
The size of the directory is 2D = O(N(1+1/M)/M), where
MC9214 DATA STRUCTURES/UNIT‐3
CCET/MCA Page 19
D - number of bits considered
N - number of records
M - number of disk blocks.
Conclusion
Hashing is the best search method (constant running time) if we don't need to have the records
sorted.
The choice of the hash function remains the most difficult part of the task and depends very
much on the nature of the keys.
Separate chaining or open addressing?
Open addressing is the preferred method if there is enough memory
to keep a table twice larger than the number of the records.
Separate chaining is used when we don't know in advance the number of the records to
be stored. Though it requires additional time for list processing, it is simpler to
implement.
Some application areas
Dictionaries, on-line spell checkers, compiler symbol tables.