cs503: eleventh lecture, fall 2008 self-balancing trees michael barnathan

29
CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

CS503: Eleventh Lecture, Fall 2008Self-Balancing Trees

Michael Barnathan

Page 2: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Here’s what we’ll be learning:

• Review of Assignment 3.

• Data Structures:– AVL Trees.– B+ Trees.

Page 3: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

The Assignment• Between last week’s review and the assignment, I’ve probably shattered some confidence. Don’t

get discouraged just yet.• The problems you’re having stem from a lack of knowledge, not ability:

– You can use the Scanner class to read space-delimited fields from the file, followed by the rest of the line (so you don’t need to parse it yourself).

• I did mention this on the assignment description.– You can do this on ANY input stream, not just System.in.– I didn’t mention Iterators in class yet, but did mention Java’s “for each” syntax:

“for (Object o : container)”• It’s good that you discovered them, though – they are very useful and I was planning on covering them soon.

– Concatenation of strings is sort of clumsy in Java. The StringBuilder and StringWriter classes make this easier.• However, discovering new programming knowledge is a component of ability:

– In your careers, you will frequently need to use unfamiliar libraries and maintain others’ code.– You start out knowing nothing about that code.– To use it, you need to learn it.

• Remember, abstraction is key! Memorizing every detail is self-defeating. Know the general concepts and you’ll be able to look up the details.

– Do you think I’ve memorized every algorithm I put on these slides? Of course not. But I know enough about the concepts to look what information I need up and implement the algorithms using it.

• Ask questions if you don’t understand something. Ask early, ask often.– Aside from my weird humor, this is the primary advantage of having me here over reading the textbook.

• Time management: Estimate how much time you will need. Then double it.• If you get stuck on one part, move forward with another and come back to it later.

Page 4: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

toString()• This isn’t too important, just syntactic sugar.• System.out.println() will actually allow you to output any sort of Object,

not just a String.• By default, Java just outputs its memory address – not very useful.• However, what this is really doing is calling a method called toString() and

outputting the result. toString() is present in the Object class.• So in the assignment, we can override toString() as follows:

public String toString() {StringWriter out = new StringWriter();out.write("\n" + word + " (" + part + ")\n\n\t" + def + "\n");return out.toString(); //See, StringWriter has one too!

}

• And then we can output a Word just by writing System.out.println(word);

Page 5: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

AVL Trees – The Idea• We looked at an algorithm for balancing trees using

rotations last time.• This turns out to be a pretty good strategy in general.

– Rotations are O(1): they only affect up to 2 levels of the tree no matter how deep it is.

– As in the DSW algorithm, rotations can be used to maintain tree balance.

– The trick is knowing when to apply them.• A left rotation will decrease the right subtree’s height and increase

the left subtree’s height.• A right rotation will do the opposite.

– Recall: A balanced tree is one in which the depth of the leaves differ by no more than one level.

• We can enforce this condition using rotations!

Page 6: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Balance Factor

• The difference between the height of a node’s right and left subtrees is called the node’s balance factor.– Balance Factor = Height(Right) - Height(Left).– Some sources define it as Height(Left) - Height(Right), but

this does not change anything.• Leaves, having no children, have a balance factor of 0

(Height(right) = Height(left) = 0).• By the definition of tree balance, a subtree is

considered balanced if its balance factor is -1, 0, or 1.• A left rotation will lower the balance factor.• A right rotation will raise it.

Page 7: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Balance Factor

5

4

1

2

3

-1

0 0

0 0

No balance factor < -1 or > 1:This tree is balanced.

-2

0

0 0

A node has a balance factor of -2:This tree is not balanced.

A right rotation will balance this tree.

4

1

2

3

+1

-1

Page 8: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

AVL Trees - Structure• Small modification to a node’s structure:class BinaryTree {

int value;BinaryTree left;BinaryTree right;

}

class AVLTree extends BinaryTree {//value, left, and right are inherited.int balanceFactor;

}

Page 9: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

AVL Trees: Algorithms• Insertion and deletion must keep the balance. Access doesn’t change.• Insertion:

– Insert as in a normal binary search tree, but go back up the tree and update the balance factor of each node back towards the root.

– “Go back up the tree” -> do something after the recursive call / on the “pop”.– If the balance factor becomes +2 or -2, rotate to correct it.– Four different cases involving up to 2 rotations.

• Deletion:– Delete as in a normal binary search tree (replacing the node with its inorder

successor), but go back up the tree and adjust the balance factors.– If the balance factor becomes +2 or -2, rotate to correct it.– If the balance factor becomes +1 or -1, we can stop.

• This indicates that the height of the subtree hasn’t changed.– If the balance factor becomes 0, we must keep going.– The deletion algorithm is very similar to the BST algorithm, so I won’t present

it formally.

Page 10: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Insertion Cases (Wikipedia)

Note the similarity to tree_to_list:

Page 11: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Visual AVL Demonstration

• http://webpages.ull.es/users/jriera/Docencia/AVL/AVL%20tree%20applet.htm

Page 12: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Insertion Algorithm//Refer to Lecture 9 for the rotate functions.void insert(AVLTree root, AVLTree newtree) {

//This can only happen now if the user passes in an empty tree.if (root == null)

root = newtree; //Empty. Insert the root.else if (newtree.value < root.value) { //Go left if <.

if (root.left == null) //Found a place to insert.root.left = newtree;

elseinsert(root.left, newtree); //Keep traversing.

}else { //Go right if >=.

if (root.right == null)root.right = newtree; //Found a place to insert.

elseinsert(root.right, newtree); //Keep traversing.

}

updateBalance(root);}

void updateBalance(AVLTree root) {//Note that a balance factor of -1 guarantees a left child exists.if (root.balance < -1 && root.left.balance < 0) {

rotateRight(root); //Left-left case: rotate right once.root.right.balance = root.balance++;

}else if (root.balance < -1) {

rotateLeft(root.left); //Left-right case: rotate the left child left, rotate the root right.rotateRight(root);root.left.balance = -1 * Math.max(root.balance, 0);root.right.balance = -1 * Math.min(root.balance, 0);root.balance = 0;

}else if (root.balance > 1 && root.right.balance > 0) {

rotateLeft(root); //Right-right case.root.left.balance = root.balance--;

}else if (root.balance > 1) {

rotateRight(root.right); //Right-left case.rotateLeft(root);root.left.balance = -1 * Math.max(root.balance, 0);root.right.balance = -1 * Math.min(root.balance, 0);root.balance = 0;

}}

Page 13: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Insertion Analysis

• We go down the tree to insert.• We go back up the tree and rotate.• AVL trees are always balanced, so what is the

complexity of this operation?

Page 14: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

CRUD: AVL Trees.• Insertion: O(log n).• Access: O(log n).• Updating an element: O(log n).• Deleting an element: O(log n).

• Search: O(log n).• Traversal: O(n).

• This is a winner.• We have all of the nice BST properties, without having to worry

about balance.• This does, however, require O(n) extra space to store the balance

factor for each node.

Page 15: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

B+ Trees: Motivation• Binary search trees are very useful data structures when data lives in

memory.• However, they are not good for disk access.

– Disk access is very slow compared to memory.– Traversing a BST is a mess on disk.

• If each node is stored somewhere on the disk, even a simple traversal requires a great deal of random access.

• Random access is difficult to cache.• Range queries in particular perform poorly.

– Nodes do not align to “blocks” on disk.• Disks can only read data one “block” at a time. If we need less than one block, we waste

time reading data that isn’t used.• A self-balancing tree called a B+ tree can solve these issues.• These are used in several popular filesystems, including NTFS, ReiserFS,

XFS, and JFS.• They are also used to index tables in database systems, such as MySQL.

Page 16: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

B+ Trees: Idea• Very different from what we’ve seen.• First, they grow UP, not DOWN.• They are not binary; each node contains an array of n

values and points to n+1 children.• Only leaves hold actual values; interior nodes hold the

maximum value in the corresponding leaf. This is used as a means of indexing.– Some variations use the minimum or middle.

• They are “threaded”:– Each leaf points to the next in sorted order.– This makes sequential access and range queries fast.

• If each variable occupies a bytes and your device’s block size is b, the optimal size of the array is b / a - 1.– One level of the tree would then fill one block.

Page 17: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

B+ Trees

14 20 26

7 14 20 22 26 28 34 97

Note that each value in an interior node is the maximum of its leaves’ values. The last pointer points to a child containing greater elements.

Advantage: you can tell which leaf to read using only the interior node (one disk read). The other leaves do not have to be read.

Page 18: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

B+ Tree Structure

• We define k as the order of a B+ tree.• Each node is allowed to store up to 2k values and

2k+1 pointers.• The structure looks like this:class BPTree {

static final int ORDER=2; //You choose this.int[] values = new int[ORDER*2];BPTree[] children = new BPTree[ORDER*2+1];

}

Page 19: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Search

• Each node contains the maximum value of its keys. We can use this to locate the node to descend to when searching.

BPTree search(BPTree root, int val) {if (root == null)

return null;for (int childidx = root.values.length - 1; childidx >= 0; childidx--) {

if (root.children[0] == null && root.values[childidx] == val) //Found in a leaf.return root;

if (val > root.values[childidx])return search(root.children[childidx+1], val);

return search(root.children[0], val);}

Page 20: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Search Example

14 20 26

7 14 20 22 26 28 34 97

Search for 17.

Childidx = 2. Val > 26? No.Childidx = 1. Val > 20? No.Childidx = 0. Val > 14? Yes. Traverse down 20 (child[childidx+1]).

20 is a leaf. Val = 20? No. Return null.

Page 21: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Insertion: Split of a Leaf Node• If the node is not full, we can just locate the proper position in

the node’s array of keys to insert.• However, if the node is full, we need to split the node. This is

how the tree grows.• Insertion always begins at the leaves.• When a leaf splits, a new leaf is created, which becomes that

leaf’s successor.– The lower half of the old leaf’s values stay, while the upper half move

to the new leaf. The parent value for the old leaf becomes the new maximum of the values remaining in the leaf.

– The old leaf is then linked to the new leaf.• That means the leaf’s parent must point to this new node.

– So we insert the new leaf into the parent.– Ack, there’s a problem here!

Page 22: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Split of an Internal Node• What if the parent is also full when we try to insert

the new leaf into it?• We then have to split the parent.• This is similar to a leaf-node split (cut the node in

half, move the maximum up), with one crucial difference:– When you move the old maximum to the parent, you

remove it from the current node.– Internal nodes don’t contain values, so this is OK.

• Now we’re inserting into this node’s parent.– And that means that node can split as well!

• When will the insanity end!?

Page 23: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Root Split

• When you reach the root, of course.• When the root splits in two, a new root is

created pointing to the old root and its new sibling, which are now its children.

• This increases the height of the tree by 1.• So it is possible for one insertion to cascade

splits all the way up the tree.– What do you think the complexity is, then?

Page 24: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

B+ Tree Deletion

• As always, deletion is insertion in reverse.• As a rule, B+ tree nodes should always be at

least halfway full (that is why the order is half of the maximum number of nodes).

• If deletion causes a node to fall below this size, we will have to undo splits.

• But first, the easy case:– If the leaf we remove from is more than half full,

we simply remove it and we’re finished.

Page 25: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

“Borrowing” Values.

• If we fall below the halfway threshold but the next sibling of this leaf is above the threshold, just move the first value of the sibling into the last value of the current node.

• Since this is larger than anything currently in the node by definition, update the parent’s maximum pointer as well.

• Nodes can be borrowed from the previous sibling as well.– If neither can spare an element, there’s no choice but to

merge.

Page 26: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Merging• The inverse of a split is called a merge or coalesce operation.• The inability to borrow ensured that this node and its sibling

are both half full.• Therefore, we can merge the two siblings into one node.• We then delete the pointer to the sibling from the parent

node (and from the linked list of siblings, of course).• This in turn can cause the parent to underflow…

– Fortunately, internal nodes and leaves are merged in the same way.• If the merge propagates to the root, the old root disappears

and the height of the tree drops by 1.

Page 27: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Example• The algorithms for these operations are complex. I

haven’t decided whether we should discuss them in detail yet.

• First, make sure you understand the idea of what is going on.

• This will help:– http://people.ksp.sk/~kuko/bak/index.html– That applet demonstrates B-trees, not B+ trees, but I’ll

point out the differences.– http://www.seanster.com/BplusTree/BplusTree.html– This is a B+-tree implementation, but uses the middle

element rather than the maximum.

Page 28: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

CRUD: B+ Trees.• Insertion: O(log n).• Access: O(log n).• Update: O(log n).• Delete: O(log n).

• Search: O(log n).• Traversal: O(n).

• These are the same asymptotic performances as in AVL trees.• The primary advantage of the B+ tree over the AVL tree is in disk

performance and indexing.– There’s also no balance factor, but you waste more space with half-

empty arrays in the worst case.• You also have that nice linked list structure to traverse.

Page 29: CS503: Eleventh Lecture, Fall 2008 Self-Balancing Trees Michael Barnathan

Our Last Balancing Act

• We’ve devoted a lot of time to tree balance.• Next time, we’ll move on to heaps and

heapsort, and we’ll revisit priority queues.• The lesson:

– Automatic solutions save time with repeated use, but often carry a higher initial cost.