91.102 - computing ii trees: some terminology. nodes: usually contain information. usually...

75
91.102 - Computing II Trees: Some terminology. Nodes : Usually contain information. Usually represented by structures. Branches : Edges - usually represented by pointers. They represent a connection (relationship) between two nodes. Root : The “entry node” into the tree. Usually at the TOP. Leaves : Nodes WITH NO edges emanating from (only NULL pointers). Interior Nodes : Nodes WITH edges emanating from.

Upload: cori-hodges

Post on 03-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Trees: Some terminology.

Nodes: Usually contain information. Usually represented by structures.

Branches: Edges - usually represented by pointers. They represent a connection (relationship) between two nodes.

Root: The “entry node” into the tree. Usually at the TOP.

Leaves: Nodes WITH NO edges emanating from (only NULL pointers).

Interior Nodes: Nodes WITH edges emanating from.

Page 2: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Root - Node - Interior Node

Leaf

Edge - Branch

Tree:

Interior Node

Leaf

Edges

Page 3: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Parent: The first node encountered in going back to the root.

Children: the first nodes encountered on outgoing edges (going away from the root)

Level: the number of edges between “here” and the root.

Path: a “contiguous” collection of edges - all in one direction, generally away from the root.

Q: how many nodes in a tree have NO ancestors?

A: ….

Page 4: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Binary Tree - Recursive definition:

A) A binary tree is empty; or

B) A binary tree consist of a node with two children that are binary trees.

=========================

Full Binary Tree: each node has either NO children or TWO children that are non-empty binary trees.

Complete Binary Tree: a binary tree where every level is FULL, except possibly the lowest, where the only missing members are in the rightmost positions.

Extended Binary Tree: Represent the (empty) children of a leaf by something - squares...

Page 5: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Full Binary Tree Non-Full Binary Tree

Page 6: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Complete Binary Tree Non-Complete Binary Tree

Page 7: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Extended Binary Tree

Page 8: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

In a complete binary tree, how many nodes are leaves?

16 = 24

8 = 23

4 = 22

2 = 21

1 = 20

1 + 2 + 4 + 8 = 15 < 16

The bottom row could be un-full, but that would change little: just over half the nodes are leaves.

Page 9: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Representations: how do we represent trees - binary trees especially.

One Method: Embed them in arrays. How?

A[1] is the root; A[2] and A[3] its children;

A[4], A[5] the grandchildren by A[2];

A[6], A[7] the grandchildren by A[3];

A[2n] and A[2n+1] are the children of A[n];

A[m/2] is the parent of A[m] - regardless of whether m is even or odd - and the division is an integer division.

Keep BOTH a tree and array picture in your mind...

Page 10: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

1

2 3

54 6 7

15148 9 10 11 12 13

16 17 18 19 20 21 22 23 24 25 26 27 31302928

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 31

Page 11: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

We have a data type and a representation. What is it good for?

First Application: fast priority queues.

Our previous implementations of priority queues required O(n) time for either insertion or retrieval. An application that requires the management of priority queues of large size (millions of elements) - for example, the simulation of a large high speed network - requires rummaging through millions of items EVERY TIME ANY ONE ITEM NEEDS SERVICE. This slows down the simulation so far that it is unusable.

Page 12: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Can we use (complete) binary trees to implement priority queues with a time complexity of O(log(n))? This would mean that doubling the size of the queue will require JUST ONE EXTRA OPERATION. A simulator that uses priority queues would become almost completely insensitive to the size of the queues.

The trick is to impose an “internal” condition that must be satisfied by every such structure.

The condition is actually rather easy: the priority of every node must be at least as great as that of its children. This seems weaker than the requirement that all the entries be ordered - it should take less time to keep it always satisfied.

Page 13: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

How do we do it, under insertion and retrieval?

We start with insertion. Begin with an empty priority queue: since it has 0 nodes, its nodes all satisfy the priority condition

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Insert 66 in the first open position (A[1], the root)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 66

Page 14: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

The priority queue with one item will obviously satisfy the condition. Now insert 23 in the next open position (A[2]).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 23

Is the condition still satisfied? Yes: A[2] is a child of A[1] and its priority (23) is less than the priority of A[1] (66). Now insert 7 in the next open position.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 23 7

Is the condition still satisfied? Yes: A[3] is a child of A[1] and its priority (7) is less than the priority of A[1] (66). Insert 43 in the next open position.

66

23

66

23 7

Page 15: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 23 7 43

Is the condition still satisfied? No: A[4] is a child of A[2] and its priority (43) is greater than the priority of A[2] (23). We must DO something to make sure the property holds again, before we exit the insertion. How? If we simply swap the contents of A[4] with those of its parent A[2], the condition holds again.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 43 7 23

Now insert 70 - in A[5], to get:

66

23 7

43

66

43 7

23

Page 16: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 43 7 23 70

The parent of A[5] is A[2] - since 43 < 70, the condition is NOT satisfied. Swap parent and child.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 70 7 23 43

The parent of A[2] is A[1] - since 66 < 70, the condition is NOT satisfied. Swap parent and child.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

70 66 7 23 43

The condition is satisfied!

66

43 7

23 70

66

70 7

23 43

70

66 7

23 43

Page 17: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

A Time Complexity Question.

We can recall that keeping a list of n items ordered requires O(n) comparisons for every insertion - on average, you will need to compare the incoming item with half the items on the list. How many comparisons do we need here? If O(n), we haven’t really gained anything…

What is it that we do? Compare a new node (a leaf) with its parent - stop or swap. For every swap, repeat the comparison (with the new parent) and stop or swap.

Page 18: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

How many comparisons? This is where the Complete Binary Tree structure comes in: a Complete Binary Tree with n nodes has log2(n) levels. Roughly half of all the nodes are leaves, one quarter appear at the next level up, one eighth at the next level, and so on until you reach the root. Since you do at most one comparison - and swap - at each level, the time is O(log2(n))

This is definitely better than O(n). But this is only for insertion, and insertion was O(1) in the array implementation we saw earlier in the course. Furthermore, deletion is O(1) in the ordered linked list implementation, so we’d better be able to show that deleting an item from this kind of data structure remains sufficiently cheap.

91.102 - Computing II

Page 19: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

void Insert(PQItem Item, PriorityQueue *PQ){ int ChildLoc; /* location of current child */ int ParentLoc; /* parent of current child */

(PQ->Count)++; /* caution: insertion does not */ ChildLoc = PQ->Count; /* guard against overflow */ ParentLoc = ChildLoc/2;

while (ParentLoc != 0) { /* while a parent still exists */ if (Item <= PQ->ItemArray[ParentLoc]) { PQ->ItemArray[ChildLoc] = Item; /* store Item */ return; /* and return */

} else { /* here, Item > PQ->ItemArray[ParentLoc] */ PQ->ItemArray[ChildLoc] = PQ->ItemArray[ParentLoc]; ChildLoc = ParentLoc; ParentLoc = ParentLoc/2; }

} /* Put Item in final resting place */ PQ->ItemArray[ChildLoc] = Item; }

Page 20: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Assume that, after some more insertions, the structure looks like:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

70 66 45 23 43 37 41 20 18 39 33 19 7

Now we delete the highest priority element - which MUST reside in A[1]. We can’t just remove it, since this would leave us with something that is not even a complete binary tree:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 45 23 43 37 41 20 18 39 33 19 7

70

66 45

23 43 37 41

20 18 39 33 19 7

66 45

23 43 37 41

20 18 39 33 19 7

Page 21: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

It is two Complete Binary Trees, since the common root is gone. How do we rebuild this into a SINGLE Complete Binary Tree, satisfying “the condition”?

1) Remove the element in the LAST used array position and insert it in the root position: A[n] -> A[1]. It should be easy to see that this gives us a new Complete Binary Tree.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

7 66 45 23 43 37 41 20 18 39 33 19

Unfortunately, “the condition” doesn’t hold: A[1] < A[2], and A[1] < A[3]. How do we get it to hold again, without undoing the Complete Binary Tree?

66 45

23 43 37 41

20 18 39 33 19

7

Page 22: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

2) Swap the contents of A[1] with the contents of its largest child. Since A[2] > A[3], we get:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 7 45 23 43 37 41 20 18 39 33 19

The condition is still not satisfied, since A[2] < A[4] and a[2] < A[5]. Swap again with the largest child:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 43 45 23 7 37 41 20 18 39 33 19

Since the condition is still unsatisfied, do it again:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

66 43 45 23 39 37 41 20 18 7 33 19

7 is now in A[10], a leaf, and we are done. We still have to count the number of comparisons and swap.

7 45

23 43 37 41

20 18 39 33 19

66

43 45

23 7 37 41

20 18 39 33 19

66

43 45

23 39 37 41

20 18 7 33 19

66

Page 23: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

PQItem Remove(PriorityQueue *PQ){

int CurrentLoc; /* location being examined */int ChildLoc; /* a child of CurrentLoc */PQItem ItemToPlace; /* an Item value to relocate */PQItem ItemToReturn; /* removed Item value to return */

if (Empty(PQ)) return;/* result undefined if PQ empty */

/* Initializations */ItemToReturn = PQ->ItemArray[1]; /* value to return */

/* last leaf's value */ItemToPlace = PQ->ItemArray[PQ->Count]; (PQ->Count)--; /* delete last leaf in level order */CurrentLoc = 1; /* CurrentLoc starts at root */

/* ChildLoc starts at root's left child */ChildLoc = 2*CurrentLoc;

Page 24: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

while (ChildLoc <= PQ->Count) { // while a child exists // Set ChildLoc to location of larger child of CurrentLoc if (ChildLoc < PQ->Count) { /* if right child exists */ if ( PQ->ItemArray[ChildLoc+1] > PQ->ItemArray[ChildLoc]) { ChildLoc++; } }

/* If item at ChildLoc is larger than ItemToPlace, */ /* move this larger item to CurrentLoc, and move */ /* CurrentLoc down. */ if (PQ->ItemArray[ChildLoc] <= ItemToPlace) { PQ->ItemArray[CurrentLoc] = ItemToPlace; return (ItemToReturn); } else { PQ->ItemArray[CurrentLoc]=PQ->ItemArray[ChildLoc]; CurrentLoc = ChildLoc; ChildLoc = 2 * CurrentLoc; } } /* final placement of ItemToPlace */ PQ->ItemArray[CurrentLoc] = ItemToPlace; /* return the Item originally at the root */ return (ItemToReturn);}

91.102 - Computing II

Page 25: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

In the worst case we must perform two comparisons and one swap for each level of the tree - except for the last one. The number of levels is log2(n), for a tree holding n items, so the total number of operations during a deletion is O(log2(n)).

Insertion: O(log2(n)).

Deletion(Retrieval): O(log2(n)).

Worst Operation: O(log2(n)).

Recall that for the ordered (or unordered) linked list representation, the worst operation was O(n), with the same cost for the unordered (or ordered) array implementation. Let’s look at a table:

Page 26: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Size Array List Heap2 2 2 1

1024=210 1024 1024 10220 220 220 20230 230 230 30240 240 240 40

240 means one trillion items are being managed. In the simple array case, we have to perform one trillion comparisons and, on average, half a trillion swaps for every deletion; in the ordered list case, we must perform, on average, half a trillion comparisons for each insertion; in the HEAP case (this is the name for this new data type) we will have no more that 40 comparisons and swaps for an insertion and no more than 80 comparisons and 40 swaps for a deletion.

Page 27: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Insertion turns out to be even cheaper: at least two thirds of all items reside in the bottom two (complete) rows of the tree. This means that, on average, 75% of all insertions will find their correct place within 2 comparisons and 1 swap, regardless of the size of the set being “priority queued”.

One can also show that an unordered array of n elements can be made into a Heap in O(n) time - linear time.

We will see later in the course how this data type can be used for asymptotically efficient sorting.

Page 28: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Some Code… to generalize the idea of "comparison"…

Let's see how to make Insert allow arbitrary comparisons.

Page 29: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

void Insert(PQItem Item, PriorityQueue *PQ){ int ChildLoc; /* location of current child */ int ParentLoc; /* parent of current child */

(PQ->Count)++; /* caution: insertion does not */ ChildLoc = PQ->Count; /* guard against overflow */ ParentLoc = ChildLoc/2;

while (ParentLoc != 0) {/* while parent exists */ if (Item <= PQ->ItemArray[ParentLoc]) { PQ->ItemArray[ChildLoc] = Item; /* store Item */ return; /* and return */ } else {/* here, Item > PQ->ItemArray[ParentLoc] */ PQ->ItemArray[ChildLoc] =PQ->ItemArray[ParentLoc]; ChildLoc = ParentLoc; ParentLoc = ParentLoc/2; } } PQ->ItemArray[ChildLoc] = Item; /* Put Item in final resting place */}

Page 30: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

What happens if the "PQItem" is a complex structure: you may want to prioritize the collection of PQItems by one of the fields. For example, you may want to choose people by age, by income, by Last Name, by anything for which an unambiguous comparison is possible. How do we do this?

By passing a "comparison function" as an extra parameter.

Example. Assume the PQItem is a structure.typedef struct {

int age;long income;char FirstName[20];char LastName[20];char SSN[10];

} PQItem;

Page 31: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

We define some appropriate comparison functions:

bool AscendingByAge(Item x, Item y){

return(x.age <= y.age);

}

bool AscendingByLastName(Item x, Item y){

return(strcmp(x.LastName, y.LastName) <= 0);

}

Finally, we re-define the Insertion Function:

Page 32: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

void Insert(PQItem Item, PriorityQueue *PQ, bool (*compare)(PQItem, PQItem)) { int ChildLoc; // location of current child int ParentLoc; // parent of current child

(PQ->Count)++; // caution: insertion does not ChildLoc = PQ->Count; // guard against overflow ParentLoc = ChildLoc/2;

while (ParentLoc != 0) {// while parent exists if (compare(Item, PQ->ItemArray[ParentLoc])) { /* new Item has NO GREATER priority than Parent */ PQ->ItemArray[ChildLoc] = Item; /* store Item */ return; /* and return */ } else { /* new Item has GREATER priority than Parent */ PQ->ItemArray[ChildLoc] =PQ->ItemArray[ParentLoc]; ChildLoc = ParentLoc; ParentLoc = ParentLoc/2; } } PQ->ItemArray[ChildLoc] = Item; // Put Item in final // resting place}

Page 33: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

With these modifications - extended to all other functions that need to compare PQItems in order to insert them in the correct array position (PQRemove is one), our Priority Queue module can deal with complex structures and can change the ORDER of priority: for example, you may want to give higher priority to people with LOW income, rather than HIGH - just add a comparison function with the "right" inequality.

Page 34: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Other Things you do with Binary Trees.

Traversals: PreOrder, InOrder, PostOrder, LevelOrder.

Another Tree Representation: Linked Objects.

Typedef struct NodeTag {

arbitrary Item;

struct NodeTag *LLink;

struct NodeTag *RLink;

} TreeNode;

Item LLink RLink

Page 35: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

The easiest way for the programmer to deal with Pre-, In- and PostOrder traversals of a binary tree is to use recursion:

void Traverse(TreeNode *T)

{ if (T != NULL) {

Visit(T);

Traverse(T->LLink);

Traverse(T->RLink);

}

}

For PreOrder; with Visit BETWEEN the Traverse calls for InOrder, and with Visit AFTER the calls for PostOrder.

Page 36: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Basically, this uses the Activation Stack Frames to keep the information needed to know whether you are “done” with the current node or not. If you wish to dispense with recursion, you must find another way to save the information you need: use your OWN stack.

Void PreOrderTraversal(TreeNode *T){ Stack S; TreeNode *N;

InitializeStack(&S);Push(T, &S); // Root Node.while(!Empty(&S)) {

Pop(&S, &N);if (N != NULL) {

Visit(N->Item);Push(N->RLink, &S);Push(N->LLink, &S); // FIFO!

}}

}

Page 37: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Unfortunately, moving the “Visit” from before to between or after the Push calls will NOT result in giving us the other two Traversals… They require substantially more thought.

a

b c

ed

bc

a b c d e

cde e

PreOrder Traversal with a Stack.

a

Page 38: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

A Fourth Traversal: LevelOrder. This uses QUEUES.

Void LevelOrderTraversal(TreeNode *T)

{ Queue Q; TreeNode *N;

InitializeQueue(&Q);

Insert(T, &Q); // enqueue the root node

while(!Empty(&Q)) {

Remove(&Q, &N); // get the front item

if (N != NULL) {

Visit(N->Item);

Insert(N->LLink, &Q); // enqueue

Insert(N->RLink, &Q); // enqueue

}

}

}

Page 39: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

a

b c

gfed

h

91.102 - Computing II

b c

c d e

a

d e fg

h e fg

h fg

hg

h

b

c

d

e

f

g

h

a

Page 40: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Other Things you do with Binary Trees.

Search.

The idea here comes from “Binary Search” with arrays. It seems like a natural: either what you are looking for is in the current node or, if the key is smaller, you should look for it in the left subtree, otherwise look in the right subtree. What is the looked for advantage over an array implementation? For Binary Search to work, the array must be sorted, and inserting an element in sorted position in a sorted array has cost O(n), where n is the size of the array - about n/2 comparisons (average) to find the right place and about n/2 swaps to make space in that right place.

Page 41: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

Since Complete Binary trees have log2(n) levels, and we saw that heaps could be managed in O(log2(n)) time, we might be able to perform “sorted” insertions in O(log2(n)) time. The retrievals in O(log2(n)) time would be for free…

A Binary Search Tree is simply a binary tree containing data with the property that an Inorder Traversal will output the data in sorted order.

Another way: all items stored in the left subtree of the root will satisfy the same order relation with the root, while those in the right subtree will satisfy the opposite relation.

91.102 - Computing II

Page 42: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Insert a, b, c, d, e, …

Simple-minded solution:

a

b

e

d

c

Oops… This doesn’t look like the hoped-for binary tree - this has degenerated into a list and we are back to O(n)...

Page 43: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

There are two possibilities:

1) Hope. After all, the incoming data will be random, so the tree will grow randomly in both directions (left and right) anyway.

For those of you ready to accept this, I have this nice bridge connecting Manhattan and Brooklyn… It’s a real steal…

2) Roll up your sleeves and figure out an algorithm that will keep acceptable balance properties regardless of the properties of the input. Obviously, the insertion algorithm should have cost O(log2(n)) - or at least provably better than O(n).

For those of you ready for work, we shall take a look at AVL trees.

Page 44: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

AVL Trees.

Question: given that I have a Binary Search Tree with n items and depth O(log2(n)), is it possible for me to insert or delete an item in O(log2(n)) time, in such a way that:

A) The tree remains a Binary Search Tree;

B) The depth remains O(log2(n)).

Answer: Yes.

Problem: HOW?

Imposing an “internal condition” worked well for Heaps (Priority Queues) - what kind of additional “internal condition” should we impose here?

Page 45: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Condition: that the height of the left subtree of every node can differ at most by 1 from the height of the right subtree.

Height: the length of the longest path from the root to some leaf. Left Subtree of Root: d = 3

Right Subtree of Root: d = 4

Height of Root = 4

Does the requirement hold recursively for all nodes?

NO: the right subtree of the root has two subtrees, one of depth 1, the other of depth 3.

Page 46: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

An AVL Tree: it does not need to be Complete - or even very “occupied”. n = 20, d = 5, log2(20) [4, 5). A Complete Binary Tree of depth 5 could hold as few as 32 nodes or as many as 63.

Page 47: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Insert, successively, 20, 9, 8 into an empty AVL Tree.

An Empty tree clearly satisfies the AVL condition.

20

A tree with ONE node also satisfies the AVL condition.

20

9

A tree with TWO nodes also satisfies the AVL condition.

EQ = equal height subtrees

EQ

L + EQ = L = Left Unbalanced

Page 48: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

20

9

8

This tree with three nodes doesn’t. How do we fix it? How do we ensure that the subtrees of the root (which does not have to remain the node with value 20) satisfy the condition? Since 20 > 9, the node holding 20 must remain to the right of that holding 9, but the “parent-child” relationship does NOT need to be maintained.

20

9

8

EQ

L + L: condition violated

L + EQ = L

EQ

EQ EQ

Page 49: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

This is a special case of the following “general” situation:

T1 T2

T3A

BThe left subtree of A has height n + 1, while the right subtree of A has height n. The right subtree of B has height n. Difference at B = 2.

T1

T2 T3

A

BThis is known as “single right rotation”. The left subtree of A has height n + 1, and the right subtree has height n + 1. Difference at A = 0.

Page 50: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

In words:

1) The parent of B becomes the parent of A.

2) B becomes the rightchild of A)

3) The rightchild of A becomes the leftchild of B.

4) Everybody else stays the same.

3 pointers need to be moved (in the linked node implementation)

Page 51: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

/* PRE: p is the root of the non-empty AVL subtree being rotated and its left child is not empty. POST: The left child of p has become the new p. The old p has become the right child of the new p. USES: AVLTreeError. */

TreeNode *RotateRight(TreeNode *p){ TreeNode *leftchild = p;

if(p == NULL) AVLTreeError("Cannot RotateRight: empty tree"); else if(p->left == NULL) AVLTreeError("Cannot RotateRight: empty subtree"); else { leftchild = p->left; p->left = leftchild->right; leftchild->right = p; }

return leftchild;}

Page 52: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

If we now insert 31 and 25:

20

9

8

31

25

Which node knows first that something is wrong? You can’t know what WILL happen as you move DOWN the tree, looking for the insertion point. You know what DID happen as you come back UP the tree. So the first node to notice is the one containing 20: its subtrees are unbalanced.

Page 53: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

20

9

8

31

25

What makes sense? Everything to the left of 25 should remain there, everything to the right should too… Everything to the right of 20 should also remain there...

Step 1): Make 25 the right-child of 20, and 31 the right child of 25. This leaves us still unbalanced, but makes the next step easier.

20

9

8

31

25

Page 54: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

20

9

8

31

25

Now move 25 up, in place of 20, and move 20 to the left-child of 25, making sure their children, if any, end up in the right positions.

20

9

8

31

25

We are done! The AVL condition is satisfied, as well as the Binary Search Tree one.

Page 55: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

This is a special case of the configuration below.

T1

T2

T4

A

C

It is the “inner” subtree (rooted at B) that is causing the imbalance.

B

T3

We can observe that every descendant of B is greater than A and less than C. This means that T2 could become the rightchild of A, while T3 could become the leftchild of C, with B the new parent of BOTH A and C. We perform TWO “rotations”:

Page 56: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

T1

T2

T4

A

C

B

T3

91.102 - Computing II

T1

T2

T4

A

C

B

T3

This is still unbalanced, but it takes us back to a “single left rotation” case.

Page 57: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

T1

T2

T4

A

C

B

T3

91.102 - Computing II

T1 T2 T4

A C

B

T3

Balanced again.

Still unbalanced: perform a single left rotation.

Page 58: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

/* PRE: The AVL tree to which root points has been created. Newnode is a node that has been created and is to be inserted into the tree. POST: The node newnode has been inserted into the AVL tree with taller equal to TRUE if the height of the tree has increased, and with taller equal to FALSE otherwise. USES: LT, EQ, RightBalance, LeftBalance. */TreeNode *InsertAVL(TreeNode *root, TreeNode *newnode, bool *taller){ if(root == NULL){ root = newnode; root->left = root->right = NULL; root->bf = EH; *taller = true; } else if(EQ(newnode->entry.key, root->entry.key)) { AVLTreeError("No duplicate keys in AVL tree"); } else if(LT(newnode->entry.key, root->entry.key)) // We are obviously going to the LEFT

91.102 - Computing II

Page 59: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

{ root->left = InsertAVL(root->left, newnode, taller); // returned from LEFT INSERTION if(*taller) { // left subtree is now taller switch(root->bf) { case LH: // node was left high root = LeftBalance(root,taller); break;

case EH: // make node left high root->bf = LH; break;

case RH: // node now balanced height root->bf = EH; *taller = false; break; } } } else // try a RIGHT INSERTION

Page 60: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

{ root->right = InsertAVL(root->right,newnode,taller); // returned from RIGHT INSERTION if(*taller) {// right subtree is now taller switch(root->bf) { case LH: root->bf = EH; // node now of balanced height *taller = false; break;

case EH: // make node right high root->bf = RH; break;

case RH: // node was right high root = RightBalance(root,taller); break; } } } return root;}

Page 61: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Perform a “double left” rotation:

20

9

8

31

25

20

9

8

31

25

20

9

8

31

25

Page 62: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

20

9

8 25

31

Insert 22:The root notices that its subtrees are unbalanced: the right one is deeper than the left one. It is the “inner” tree that is causing the imbalance: double left rotation, again.

22

20

9

8

25

3122

20

25

31

9

8 22

Page 63: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Some quick observations:

A node can find itself “unbalanced” in 4 ways:

1) via the leftchild of the leftchild: single right rotation;

2) via the rightchild of the rightchild: single left rotation;

3) via the rightchild of the leftchild: double right rotation;

4) via the leftchild of the rightchild: double left rotation.

Page 64: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

20

25

31

9

8 22

91.102 - Computing II

A Deletion: an easy example would involve removing 8 first and then 9.

EH

L

EH

EH

EH EH

Remove 8: the leaf is removed; the left subtree of 9 is decreased by 1 in height. When the recursion returns to 9, the balance factor goes from L to EH, but the total height has decreased by 1. Which means that 20 is now right unbalanced, by 1.

20

25

31

9

22

R

EH EH

EH EH

Page 65: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

20

25

3122

RR

EH

EH EH

91.102 - Computing II

Removing 9 leads to the configuration:

Since we are returning through a recursion, we have access to the parent of 20 - or the variable holding its address.

Let's make this more general by introducing height n subtrees at each child position:

Page 66: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

20

25

3122

RR

EH

EH EHT1

T2 T3 T4 T5

What kind of conditions do we want? Besides being an AVL Tree… Each T is of height n - the tree rooted at 20 is of height n + 3.

a) If possible, the height of the transformed tree should remain the same - since this means that no further changes need to be made further up.

Let's look at a slightly more general configuration:

Page 67: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

20

25

RR

EH

T1

T2 T3

91.102 - Computing II

Performing a left rotation leads to:

20

25R

L

T1 T2T3

Where the height of the final tree is still n + 3

Red: height = n + 1

Page 68: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

20

25

31

22

R

L

EH

EH

91.102 - Computing II

Final Configuration:

To remove 31:

20

25

22

R

LL

EH

25LL

22L

20EH

2522

20But the height has decreased by 1, and this information must be propagated up the tree.

Page 69: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

The data structures actually used (text rather than Demo - they match):

typedef enum {LeftHeavy, Balanced, RightHeavy}

BalanceFactor;

typedef struct AVLTreeNodeTag {

BalanceFactor BF;

KeyType Key;

struct AVLTreeNodeTag *LLink;

struct AVLTreeNodeTag *RLink;

} AVLTreeNode;

Page 70: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

What about performance?

1) What is the expected height of an AVL tree with n entries?

A: Experimentally: log2(n) + 0.25 - so little extra depth is lost from a Complete Binary Tree. The worst AVL tree (double-starred section) will require no more than 44% extra comparisons than the best.

2) Probability that an insertion requires rotation?

A: From empirical evidence: every third insertion or so.

3) How many rotations could be required for ONE insertion?

A: Just one.

Page 71: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

4) Probability that a deletion requires rotation?

A: From empirical evidence: every fifth deletion.

5) How many rotations could be required for ONE deletion?

A: One for every node on the path back to the root.

This last statement is one of the reasons why people have continued looking for other balancing algorithms: the Red-Black trees (for example) require at most two rotations, but they give up some depth (the law of the no-free-lunch strikes again).

Page 72: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Tries… (pies??)

How can you access a large dictionary - the 200000+ words of the English language - quickly?

A trie (from retrieval) is a data-type designed for this kind of work: keys with the same prefix are accessed together.

Consider the following words:

art, artisan, artifact, arthur, boat, bloat, bloated, boa.

Page 73: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

a

b

r t

h

i

u r

f a c t

s a n

tao

detaol

Page 74: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

In English, there are 26 choices for the first letter, 26 for the second, etc. One should add a marker to indicate that an actual word has been found, and a pointer to the dictionary information.

A word of n characters can be found (or rejected) in 26*n character comparisons, regardless of the size of the dictionary…

Finding a word of n characters by standard word match (one word at a time) would require, on average, at least half as many character comparisons as there are words in the dictionary - probably several times as many, due to all the matching prefixes that would eventually fail.

Page 75: 91.102 - Computing II Trees: Some terminology. Nodes: Usually contain information. Usually represented by structures. Branches: Edges - usually represented

91.102 - Computing II

Disadvantages:

A 26+ bit structure (one bit for each letter + various pointers) for each node of this tree. The tree would be shallow - average depth < 10 (probably) - but very “bushy”.

Carelessness in the implementation could be very expensive in space. Law of the no-free-lunch coming up again: time is gained by giving away space, just be careful you don’t give away too much.

There may be better ways of setting this up, but this method already explores a number of ideas that can be “tweaked”.