cse 326: data structures binary search...
TRANSCRIPT
CSE 326: Data StructuresBinary Search Trees
5/30/2008 1
Today’s Outline
• Dictionary ADT / Search ADT• Quick Tree Review• Binary Search Trees
5/30/2008 2
ADTs Seen So Far• Stack
– Push– Pop
• Queue– Enqueue– Dequeue
• Priority Queue– Insert– DeleteMin
5/30/2008 3
Then there is decreaseKey…
Need pointer! Why?Because find not efficient.
The Dictionary ADT
• Data:– a set of
(key, value) pairs
• Operations:– Insert (key,
value)– Find (key)– Remove (key)
5/30/2008 4
The Dictionary ADT is alsocalled the “Map ADT”
• jfogartyJamesFogartyCSE 666
• phenryPeterHenryCSE 002
• boqinBoQinCSE 002
insert(jfogarty, ….)
find(boqin)• boqin
Bo, Qin, …
A Modest Few Uses
• Sets• Dictionaries• Networks : Router tables• Operating systems : Page tables• Compilers : Symbol tables
5/30/2008 5
Probably the most widely used ADT!
Implementations
• Unsorted Linked-list
• Unsorted array
• Sorted array
5/30/2008 6
insert deletefind
What limits the performance?
Θ(1) Θ(n) Θ(n)
Θ(1) Θ(n) Θ(n)
log n + n Θ(log n) log n + n
Time to move elements, can we mimic BinSearch with BST?
SO CLOSE!
Tree CalculationsRecall: height is max
number of edges from root to a leaf
Find the height of the tree...
5/30/2008 7
t
runtime:Θ(N) (constant time for each node;
each node visited twice)
height(t) = 1 + max {height(t.left), height(t.right)}
Tree Calculations Example
5/30/2008 8
A
E
B
D F
C
G
IH
KJ L
M
L
N
How high is this tree?
height(B) = 1height(C) = 4
so height(A) = 5
More Recursive Tree Calculations:Tree Traversals
A traversal is an order for visiting all the nodes of a tree
Three types:• Pre-order: Root, left subtree, right subtree
• In-order: Left subtree, root, right subtree
• Post-order: Left subtree, right subtree, root
5/30/2008 9
+
*
2 4
5
(an expression tree)
Inorder Traversal
void traverse(BNode t){if (t != NULL)traverse (t.left);process t.element;traverse (t.right);
}}
5/30/2008 10
Binary Trees• Binary tree is
– a root– left subtree (maybe
empty) – right subtree (maybe
empty)
• Representation:
5/30/2008 11
A
B
D E
C
F
HG
JI
Dataright
pointerleft
pointer
Binary Tree: Representation
5/30/2008 12
Aright
pointerleft
pointer A
B
D E
C
F
Bright
pointerleft
pointer
Cright
pointerleft
pointer
Dright
pointerleft
pointer
Eright
pointerleft
pointer
Fright
pointerleft
pointer
Binary Tree: Special Cases
5/30/2008 13
A
B
D E
C
GF
IH
A
B
D E
C
F
A
B
D E
C
GF
Full Tree
Complete Tree Perfect Tree
Binary Tree: Some Numbers!For binary tree of height h:
– max # of leaves:
– max # of nodes:
– min # of leaves:
– min # of nodes:
Average Depth for N nodes?
5/30/2008 14
2h, for perfect tree
2h+1 – 1, for perfect tree
1, for “list” tree
h+1, for “list” tree
Binary Search Tree Data Structure• Structural property
– each node has ≤ 2 children– result:
• storage is small• operations are simple• average depth is small
• Order property– all keys in left subtree smaller
than root’s key– all keys in right subtree larger
than root’s key– result: easy to find any given key
• What must I know about what I store?
5/30/2008 15
4
121062
115
8
14
13
7 9
Comparison, equality testing
Example and Counter-Example
5/30/2008 16
3
1171
84
5
4
181062
115
8
20
21BINARY SEARCH TREES?
7
15All children must obey order
Find in BST, RecursiveNode Find(Object key,
Node root) {if (root == NULL)return NULL;
if (key < root.key)return Find(key,
root.left);else if (key > root.key)return Find(key,
root.right);elsereturn root;
}
5/30/2008 17
2092
155
10
307 17
Runtime:
Find in BST, IterativeNode Find(Object key,
Node root) {
while (root != NULL &&root.key != key) {
if (key < root.key)root = root.left;
else root = root.right;
}
return root;}
5/30/2008 18
2092
155
10
307 17
Runtime:
Insert in BST
5/30/2008 19
2092
155
10
307 17
Runtime:
Insert(13)Insert(8)Insert(31)
Insertions happen only at the leaves – easy!
BuildTree for BST• Suppose keys 1, 2, 3, 4, 5, 6, 7, 8, 9 are
inserted into an initially empty BST. Runtime depends on the order!
– in given order
– in reverse order
– median first, then left median, right median, etc.
5/30/2008 20
Θ(n2)
Θ(n2)
5, 3, 7, 2, 1, 6, 8, 9 better: n log n
Bonus: FindMin/FindMax
• Find minimum
• Find maximum
5/30/2008 21
2092
155
10
307 17
Deletion in BST
5/30/2008 22
2092
155
10
307 17
Why might deletion be harder than insertion?
Lazy DeletionInstead of physically deleting nodes, just mark them as deleted
+ simpler+ physical deletions done in
batches+ some adds just flip deleted
flag
– extra memory for “deleted” flag– many lazy deletions = slow
finds– some operations may have to
be modified (e.g., min and max)
5/30/2008 23
2092
155
10
307 17
Non-lazy Deletion• Removing an item disrupts the tree
structure.• Basic idea: find the node that is to be
removed. Then “fix” the tree so that it is still a binary search tree.
• Three cases:– node has no children (leaf node)– node has one child– node has two children
5/30/2008 24
Non-lazy Deletion – The Leaf Case
5/30/2008 25
2092
155
10
307 17
Delete(17)
Deletion – The One Child Case
5/30/2008 26
2092
155
10
307
Delete(15)
Deletion – The Two Child Case
5/30/2008 27
3092
205
10
7
Delete(5)
What can we replace 5 with?
A value guaranteed to bebetween the two subtrees!- succ from right subtree- pred from left subtree
Deletion – The Two Child CaseIdea: Replace the deleted node with a value
guaranteed to be between the two child subtrees
Options:• succ from right subtree: findMin(t.right)• pred from left subtree : findMax(t.left)
Now delete the original node containing succ or pred• Leaf or one child case – easy!
5/30/2008 28
Finally…
5/30/2008 29
3092
207
10
7 replaces 5
Original node containing7 gets deleted
Balanced BSTObservation• BST: the shallower the better!• For a BST with n nodes
– Average height is O(log n)– Worst case height is O(n)
• Simple cases such as insert(1, 2, 3, ..., n)lead to the worst case scenario
Solution: Require a Balance Condition that1. ensures depth is O(log n) – strong enough!2. is easy to maintain – not too strong!
5/30/2008 30
Potential Balance Conditions1. Left and right subtrees of the root
have equal number of nodes
2. Left and right subtrees of the roothave equal height
5/30/2008 31
Too weak!Do height mismatch example
Too weak!Do example where there’sa left chain and a right chain, no other nodes
Potential Balance Conditions3. Left and right subtrees of every node
have equal number of nodes
4. Left and right subtrees of every nodehave equal height
5/30/2008 32
Too strong!Only perfect trees
Too strong!Only perfect trees
CSE 326: Data StructuresAVL Trees
33
Balanced BSTObservation• BST: the shallower the better!• For a BST with n nodes
– Average height is O(log n)– Worst case height is O(n)
• Simple cases such as insert(1, 2, 3, ..., n)lead to the worst case scenario
Solution: Require a Balance Condition that1. ensures depth is O(log n) – strong enough!2. is easy to maintain – not too strong!
34
Potential Balance Conditions1. Left and right subtrees of the root
have equal number of nodes2. Left and right subtrees of the root
have equal height3. Left and right subtrees of every node
have equal number of nodes4. Left and right subtrees of every node
have equal height
35
The AVL Balance ConditionAVL balance property:
Left and right subtrees of every nodehave heights differing by at most 1
• Ensures small depth– Will prove this by showing that an AVL tree of height
h must have a lot of (i.e. O(2h)) nodes• Easy to maintain
– Using single and double rotations36
Adelson-Velskii and Landis
The AVL Tree Data StructureStructural properties
1. Binary tree property (0,1, or 2 children)
2. Heights of left and right subtrees of every nodediffer by at most 1
Result:Worst case depth of any
node is: O(log n)
Ordering property– Same as for BST
37
4
121062
115
8
14137 9
15
This is an AVL tree
38
111
84
6
3
1171
84
6
2
5
AVL trees or not?
10 12
7
Proving Shallowness Bound
39
121062
115
8
14137 9
15
Let S(h) be the min # of nodes in anAVL tree of height h
Claim: S(h) = S(h-1) + S(h-2) + 1
Solution of recurrence: S(h) = O(2h)(like Fibonacci numbers)
AVL tree of height h=4with the min # of nodes (12)
Trees of height h = 1, 2, 3 ….
Testing the Balance Property
40
2092
155
10
30177
NULLs have height -1
We need to be able to:
1. Track Balance
2. Detect Imbalance
3. Restore Balance
An AVL Tree
41
20
92 15
5
10
30
177
0
0 0
011
2 2
3 103
data
height
children
Track height at all times. Why?
AVL trees: find, insert• AVL find:
– same as BST find.• AVL insert:
– same as BST insert, except may need to “fix” the AVL tree after inserting new value.
42
AVL tree insertLet x be the node where an imbalance occurs.
Four cases to consider. The insertion is in the1. left subtree of the left child of x.2. right subtree of the left child of x.3. left subtree of the right child of x.4. right subtree of the right child of x.
Idea: Cases 1 & 4 are solved by a single rotation.Cases 2 & 3 are solved by a double rotation.
43
Bad Case #1
Insert(6)Insert(3)Insert(1)
44
Where is AVL property violated?
Fix: Apply Single Rotation
45
3
1 600
16
3
1 0
1
2
Single Rotation: 1. Rotate between x and child
AVL Property violated at this node (x)
Single rotation in general
46
a
ZY
b
Xh hh
h ≥ -1
a
ZY
b
Xh+1 h h
X < b < Y < a < Z
Height of tree before? Height of tree after? Effect on Ancestors?
Bad Case #2
Insert(1)Insert(6)Insert(3)
47
Fix: Apply Double Rotation
48
3
1 600
1
3
6
1
0
1
2
6
3
1
0
1
2
Intuition: 3 must become root
AVL Property violated at this node (x)
Double Rotation1. Rotate between x’s child and grandchild2. Rotate between x and x’s new child
Double rotation in general
49
a
Z
b
W
c
X Yh-1
h
h h -1
a
Z
b
W
c
X Yh-1 hh h
h ≥ 0
W < b <X < c < Y < a < Z
Height of tree before? Height of tree after? Effect on Ancestors?
Double rotation, step 1
50
104
178
15
3 6
16
5
106
178
15
4
3
16
5
Double rotation, step 2
51
106
178
15
4
3
16
5
10
6 17
8
15
4
3
16
5
Imbalance at node X
Single Rotation 1. Rotate between x and child
Double Rotation1. Rotate between x’s child and grandchild2. Rotate between x and x’s new child
52
53
95
2
11
7
1. single rotation?
2. double rotation?
3. no rotation?
Inserting what integer values would cause the tree to need a:
Single and Double Rotations:
13
30
Insertion into AVL tree
1. Find spot for new key2. Hang new node there with this key3. Search back up the path for imbalance4. If there is an imbalance:
case #1: Perform single rotation and exit
case #2: Perform double rotation and exit
54
Both rotations keep the subtree height unchanged.Hence only one rotation is sufficient!
Zig-zig
Zig-zag
Easy Insert
55
2092
155
10
3017
Insert(3)
120
0
100
1 2
3
0
Unbalanced? No
Hard Insert (Bad Case #1)
56
2092
155
10
3017
Insert(33)
3
121
0
100
2 2
3
00
Imbalance
Zig-zig
How to fix? Single rotate
Unbalanced? Yes, at 15
Single Rotation
57
2092
155
10
30173
12
33
1
0
200
2 3
3
10
0
3092
205
10
333
151
0
110
2 2
3
001712
0
Hard Insert (Bad Case #2)
58
Insert(18)
2092
155
10
30173
121
0
100
2 2
3
00
Zig-zag
How to fix? Double rotate
Imbalance
Unbalanced? Yes, at 15
Single Rotation (oops!)
59
2092
155
10
30173
121
1
200
2 3
3
00
3092
205
10
3
151
1
020
2 3
3
01712
0
180
180
Double Rotation (Step #1)
60
2092
155
10
30173
121
1
200
2 3
3
00
180
1792
155
10
203
121 200
2 3
3
10
300
180
Still unbalanced.But like zig-zig tree!
Double Rotation (Step #2)
61
1792
155
10
203
121 200
2 3
3
10
300
180
2092
175
10
303
151
0
110
2 2
3
0012
018
Insert into an AVL tree: 5, 8, 9, 4, 2, 7, 3, 1
62
CSE 326: Data StructuresSplay Trees
AVL Trees Revisited• Balance condition:
Left and right subtrees of every nodehave heights differing by at most 1
– Strong enough : Worst case depth is O(log n)– Easy to maintain : one single or double rotation
• Guaranteed O(log n) running time for– Find ?– Insert ?– Delete ?– buildTree ?
64
Θ(n log n)
Single and Double Rotations
65
a
ZY
b
Xh hh
a
Z
b
W
c
X Yh-1
h
h h -1
AVL Trees Revisited
• What extra info did we maintain in each node?
• Where were rotations performed?
• How did we locate this node?66
Other Possibilities?• Could use different balance conditions, different ways
to maintain balance, different guarantees on running time, …
• Why aren’t AVL trees perfect?
• Many other balanced BST data structures– Red-Black trees– AA trees– Splay Trees– 2-3 Trees– B-Trees– …
67
Extra info, complex logic todetect imbalance, recursivebottom-up implementation
Splay Trees
• Blind adjusting version of AVL trees– Why worry about balances? Just rotate
anyway!• Amortized time per operations is O(log n)• Worst case time per operation is O(n)
– But guaranteed to happen rarely
Insert/Find always rotate node to the root!
68
SAT/GRE Analogy question:AVL is to Splay trees as ___________ is to __________
Leftish heap : Skew heap
Recall: Amortized Complexity
69
If a sequence of M operations takes O(M f(n)) time,we say the amortized runtime is O(f(n)).
Amortized complexity is worst-case guarantee oversequences of operations.
• Worst case time per operation can still be large, say O(n)
• Worst case time for any sequence of M operations is O(M f(n))
Average time per operation for any sequence is O(f(n))
Recall: Amortized Complexity• Is amortized guarantee any weaker than
worstcase?
• Is amortized guarantee any stronger than averagecase?
• Is average case guarantee good enough in practice?
• Is amortized guarantee good enough in practice?
70
Yes, it is only for sequences
Yes, guarantees no bad sequences
No, adversarial input, bad day, …
Yes, again, no bad sequences
The Splay Tree Idea
71
17
10
92
5
If you’re forced to make a really deep access:
Since you’re down there anyway,fix up a lot of deep nodes!
3
Find/Insert in Splay Trees
72
1. Find or insert a node k2. Splay k to the root using:
zig-zag, zig-zig, or plain old zig rotation
Why could this be good??
1. Helps the new root, ko Great if k is accessed again
2. And helps many others!o Great if many others on the path are accessed
Splaying node k to the root:Need to be careful!
One option (that we won’t use) is to repeatedly use AVL single rotation until k becomes the root: (see Section 4.5.1 for details)
73
s
A k
B C
r
D
q
E
p
F
r
D
q
E
p
F
C
s
A B
k
Splaying node k to the root:Need to be careful!
What’s bad about this process?
74
s
A k
B C
r
D
q
E
p
F
r
D
q
E
p
F
C
s
A B
k
r is pushed almost as low as k wasBad seq: find(k), find(r), find(…), …
Splay: Zig-Zag*
75
g
Xp
Y
k
Z
W
*Just like an…
k
Y
g
W
p
ZX
AVL double rotation
Helps those in blueHurts those in red
Which nodes improve depth?
k and its original children
Splay: Zig-Zig*
76
k
Z
Y
p
X
g
W
g
W
X
p
Y
k
Z
*Is this just two AVL single rotations in a row?
Not quite – we rotate g and p, then p and kWhy does this help?
Same number of nodes helped as hurt. But later rotations help the whole subtree.
Special Case for Root: Zig
77
p
X
k
Y
Z
root k
Z
p
Y
X
root
Down 1 level
Relative depth of p, Y, Z? Relative depth of everyone else?
Much betterWhy not drop zig-zig and just zig all the way?
Zig only helps one child!
Splaying Example: Find(6)
78
2
1
3
4
5
6
Find(6)
2
1
3
6
5
4
?
Zig-zig
Think of as if created by inserting 6,5,4,3,2,1 –each took constant time –a LOT of
Still Splaying 6
79
2
1
3
6
5
4
1
6
3
2 5
4
?
Zig-zig
Finally…
80
1
6
3
2 5
4
6
1
3
2 5
4
?
Zig
Another Splay: Find(4)
81
Find(4)
6
1
3
2 5
4
6
1
4
3 5
2
?
Zig-zag
Example Splayed Out
82
6
1
4
3 5
2
61
4
3 5
2
?
Zig-zag
But Wait…
What happened here?
Didn’t two find operations take linear timeinstead of logarithmic?
What about the amortized O(log n) guarantee?
83
That still holds, though we must takeinto account the previous steps used to createthis tree. In fact, a splay tree, by construction,
will never look like the example we started with
Why Splaying Helps• If a node n on the access path is at depth d
before the splay, it’s at about depth d/2 after the splay
• Overall, nodes which are low on the access path tend to move closer to the root
• Splaying gets amortized O(log n) performance. (Maybe not now, but soon, and for the rest of the operations.)
84
Practical Benefit of Splaying• No heights to maintain, no imbalance to
check for– Less storage per node, easier to code
• Data accessed once, is often soon accessed again– Splaying does implicit caching by bringing it to
the root
85
Splay Operations: Find
• Find the node in normal BST manner• Splay the node to the root
– if node not found, splay what would have been its parent
What if we didn’t splay?
86
Amortized guarantee fails!Bad sequence: find(leaf k), find(k), find(k), …
Splay Operations: Insert
• Insert the node in normal BST manner• Splay the node to the root
What if we didn’t splay?
87
Amortized guarantee fails!Bad sequence: insert(k), find(k), find(k), …
Splay Operations: Remove
88
find(k)
L R
k
L R
> k< k
delete k
Now what?
Everything else splayed, so we’d better do that for remove
JoinJoin(L, R): given two trees such that (stuff in L) < (stuff in R), merge them:
Splay on the maximum element in L, then attach R
89
L R R
L
Similar to BST delete – find max = find element with no right childDoes this work to join any two trees?No, need L < R
splay
max
Delete Example
90
91
6
4 7
2
Delete(4)
find(4)
9
6
7
1
4
2
1
2
9
6
7
Find max
2
1
9
6
7
2
1
9
6
7
Splay Tree Summary• All operations are in amortized O(log n) time
• Splaying can be done top-down; this may be better because:– only one pass– no recursion or parent pointers necessary– we didn’t cover top-down in class
• Splay trees are very effective search trees– Relatively simple– No extra fields required– Excellent locality properties:
frequently accessed keys are cheap to find91
What happens to node that never get accessed(tend to drop to the bottom)
Like what? Skew heaps! (don’t need to wait)
Splay E
92E
D
CF
I
GB
AH
A
B
E
I
H
C
D
G
F
Splay E
93
BI
H
C
DG
F
E
A
I
E
A
D
C
B H
F
G
CSE 326: Data StructuresB-Trees
B-Trees
Weiss Sec. 4.7
CPU
(has registers)
Cache
Main Memory
Disk
TIme to access(conservative)
2-10 ns
40-100 ns
a few milliseconds
(5-10 Million ns)
SRAM
8KB - 4MB
DRAM
up to 10GB
many GB
Cache
Main Memory
Disk
Summary: Accesses to Memory are Painful, especially slower levels
1 ns per instructionDisk
(mechanical device)
access time =
Seek time+
Transfer time
Seek time is expensive, once seek,
then transfer alot
Trees so far
• BST
• AVL
• Splay
Dictionary ADT: Find, Insert, Remove
Up to 2 children, height can be N-1Operations WC: O(N)10 Mil disk accesses
Up to 2 children, height can be log NOperations WC: O(log N)23 disk accesses
Up to 2 children, height can be N-1Simpler implementation than AVL Operations WC: ON)Amortized: O(log N)Draw picture of tree and memory and show
that each access can be an access to disk
Example from book: N = 10 Milliondrivers in WA state
Want to do even better, make that number smaller (3 or 4)
M = 5, 10 accesses; M=10, 7 accesses; M=100, 3-4 accesses
AVL treesSuppose we have 100 million items
(100,000,000):
• Depth of AVL Tree
• Number of Disk Accesses
log_2 100,000,000 = 26.6
Log_128 100,000,000 = 3.8
Wow!
M-ary Search Tree
• Maximum branching factor of M
• Complete tree has height =
# disk accesses for find:
Runtime of find:
Best: O(logMn) Worst: O(n)
O(logMn)
(DO LATER)- binary search at each levelBest: O(logMn . log2M = log2 n) Worst: O(n)
Might be unbalanced
Might just be binary;We said max b.f. = M
Like we did with d heaps
Solution: B-Trees• specialized M-ary search trees
• Each node has (up to) M-1 keys:– subtree between two keys x and y contains
leaves with values v such thatx ≤ v < y
• Pick branching factor Msuch that each node takes one full {page, block}of memory
3 7 12 21
x<3 3≤x<7 7≤x<12 12≤x<21 21≤x
M = 7
Keep things more balanced
B-TreesWhat makes them disk-friendly?
1. Many keys stored in a node• All brought to memory/cache in one access!
2. Internal nodes contain only keys;Only leaf nodes contain keys and actual data• The tree structure can be loaded into memory
irrespective of data object size• Data actually resides in disk
B-Tree: ExampleB-Tree with M = 4 (# pointers in internal node)and L = 4 (# data items in leaf)
1AB
2xG 3 5 6 9
10 11 12
15 17
20 25 26
30 32 33 36
40 42
50 60 70
10 40
3 15 20 30 50
Note: All leaves at the same depth!Data objects, that I’ll ignore in slides
N = 24, depth = 2
What depth with an AVL tree? 2h to 2h+1-1
depth = 4 to 5
B-Tree Properties ‡
– Data is stored at the leaves– All leaves are at the same depth and contains between
⎡L/2⎤ and L data items– Internal nodes store up to M-1 keys– Internal nodes have between ⎡M/2⎤ and M children– Root (special case) has between 2 and M children (or
root could be a leaf)
‡These are technically B+-Trees
Leaves are at least ½ full, kind of like complete tree
Tree about log n deep.Actually, logM/2n
Example, AgainB-Tree with M = 4and L = 4
1 2
3 5 6 9
10 11 12
15 17
20 25 26
30 32 33 36
40 42
50 60 70
10 40
3 15 20 30 50
24 entries, but only 2 deep– BST would be 4.
Mostly full.Some inactive entries.
(Only showing keys, but leaves also have data!)
Find(33, 2, 16)Insert(18, 16,71)Insert(80) (split into 2 leaves each w. min # values) [put 71 at top]
Building a B-Tree
The empty B-Tree
M = 3 L = 2
3Insert(3)
3 14Insert(14)
Now, Insert(1)?
Splitting the Root
And createa new root
1 3 14
1 3 14
14
1 3 143 14
Insert(1)
Too many keys in a leaf!
So, split the leaf.
How to solve?Definitely need to split.Then what?
Create new rootThis is how B-Trees grow deeper.
M = 3 L = 2
Overflowing leaves
Insert(59)14
1 3 14 59
14
1 3 14Insert(26)
14
1 3
14 26 59
14 59
1 3 14 26 59
And add a new child
Too many keys in a leaf!
So, split the leaf.
Same problem as before-- must split
But parent has room.What if there wasn’t?
M = 3 L = 2
Propagating Splits
14 59
1 3 14 26 59
14 59
14 26 59
1 3 5
Insert(5)
5 14
14 26 591 3 5
59
5 595
1 3 5 14 26 59
59
14
Add newchild
Create anew root
Split the leaf, but no space in parent!
So, split the node.
M = 3 L = 2
Insertion Algorithm1. Insert the key in its leaf2. If the leaf ends up with L+1
items, overflow!– Split the leaf into two nodes:
• original with ⎡(L+1)/2⎤ items• new one with ⎣(L+1)/2⎦ items
– Add the new child to the parent– If the parent ends up with M+1
items, overflow!
3. If an internal node ends up with M+1 items, overflow!– Split the node into two nodes:
• original with ⎡(M+1)/2⎤ items• new one with ⎣(M+1)/2⎦ items
– Add the new child to the parent– If the parent ends up with M+1
items, overflow!
4. Split an overflowed root in two and hang the new nodes under a new root
This makes the tree deeper!
Note that new leaves/ internal nodes guaranteed to have enough nodes after split. Why?
After More Routine Inserts
5
1 3 5 14 26 59
59
14
5
1 3 5 14 26 59 79
59 89
14
89
Insert(89)Insert(79)
How aboutdeletion?
M = 3 L = 2
Deletion
5
1 3 5 14 26 59 79
59 89
14
89
5
1 3 5 14 26 79
79 89
14
89
Delete(59)
Might cause leaf to drop below L/2 children
What could go wrong?
1. Delete item from leaf2. Update keys of ancestors if necessary
M = 3 L = 2
Deletion and Adoption
5
1 3 5 14 26 79
79 89
14
89
Delete(5)?
1 3 14 26 79
79 89
14
89
3
1 3 3 14 26 79
79 89
14
89
A leaf has too few keys!
So, borrow from a sibling
What if notenough to
borrow from?
M = 3 L = 2
Sibling pointer(and parent pointers)
Does Adoption Always Work?
• What if the sibling doesn’t have enough for you to borrow from?
e.g. you have ⎡L/2⎤-1 and sibling has ⎡L/2⎤?
Merge them into one node!
Deletion and Merging
3
1 3 14 26 79
79 89
14
89
Delete(3)?
1 14 26 79
79 89
14
89
1 14 26 79
79 89
14
89
A leaf has too few keys!
And no sibling with surplus!
So, deletethe leaf
But now an internal node has too few subtrees!
What if L were 100?Would have keys leftover (49)in node that must be deleted!Soln: give to neighbors!(no surplus must be room)
M = 3 L = 2
Adopt aneighbor
1 14 26 79
79 89
14
89
14
1 14 26 79
89
79
89
Deletion with Propagation (More Adoption)
Same as before: borrow from rich neighbor!
M = 3 L = 2
Delete(1)(adopt asibling)
14
1 14 26 79
89
79
89
A Bit More Adoption
26
14 26 79
89
79
89
This is just to setup for next slide.
M = 3 L = 2
Delete(26)26
14 26 79
89
79
89
Pulling out the Root
14 79
89
79
89
A leaf has too few keys!And no sibling with surplus!
14 79
89
79
89
So, delete the leaf;merge
A node has too few subtrees and no neighbor with surplus!
14 79
79 89
89
Delete the node
But now the roothas just one subtree!
and merge
M = 3 L = 2
Pulling out the Root (continued)
14 79
79 89
89
The roothas just one subtree!
14 79
79 89
89
Simply makethe one childthe new root!
M = 3 L = 2
Deletion Algorithm
1. Remove the key from its leaf
2. If the leaf ends up with fewer than ⎡L/2⎤ items, underflow!– Adopt data from a sibling;
update the parent– If adopting won’t work, delete
node and merge with neighbor– If the parent ends up with
fewer than ⎡M/2⎤ items, underflow!
Deletion Slide Two3. If an internal node ends up with
fewer than ⎡M/2⎤ items, underflow!– Adopt from a neighbor;
update the parent– If adoption won’t work,
merge with neighbor– If the parent ends up with fewer than
⎡M/2⎤ items, underflow!
4. If the root ends up with only one child, make the child the new root of the tree
This reduces the height of the tree!
Thinking about B-Trees• B-Tree insertion can cause (expensive)
splitting and propagation• B-Tree deletion can cause (cheap)
adoption or (expensive) deletion, merging and propagation
• Propagation is rare if M and L are large (Why?)
• If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items
Height 5: 2 billion
Only 1/L inserts cause split, only 1/M of these go up at all!
Tree Names You Might Encounter
FYI:– B-Trees with M = 3, L = x are called 2-3
trees• Nodes can have 2 or 3 keys
– B-Trees with M = 4, L = x are called 2-3-4 trees
• Nodes can have 2, 3, or 4 keys
K-D Trees and Quad Trees
Range Queries
• Think of a range query.– “Give me all customers aged 45-55.”– “Give me all accounts worth $5m to $15m”
• Can be done in time ________.
• What if we want both:– “Give me all customers aged 45-55 with
accounts worth between $5m and $15m.”
Geometric Data Structures
• Organization of points, lines, planes, etc in support of faster processing
• Applications– Map information– Graphics - computing object intersections– Data compression - nearest neighbor search– Decision Trees - machine learning
k-d Trees• Jon Bentley, 1975, while an undergraduate• Tree used to store spatial data.
– Nearest neighbor search.– Range queries.– Fast look-up
• k-d tree are guaranteed log2 n depth where n is the number of points in the set.– Traditionally, k-d trees store points in
d-dimensional space which are equivalent to vectors in d-dimensional space.
Range Queries
x
y
ab
f
c
g h
ed
i
x
y
ab
f
c
g h
ed
i
Rectangular query Circular query
x
y
ab
f
c
g h
ed
i
Nearest Neighbor Search
query
Nearest neighbor is e.
k-d Tree Construction• If there is just one point, form a leaf with that point.• Otherwise, divide the points in half by a line
perpendicular to one of the axes. • Recursively construct k-d trees for the two sets of
points.• Division strategies
– divide points perpendicular to the axis with widest spread.– divide in a round-robin fashion (book does it this way)
x
y
k-d Tree Construction
ab
f
c
g h
ed
i
divide perpendicular to the widest spread.
y
k-d Tree Construction (18)x
ab
c
g h
ed
i s1
s2y y
s6
s3x
s4y
s7y
s8y
s5x
s1
s2
s3
s4
s5
s6
s7
s8
a b
d e
g c f h i
x
f
k-d tree cell
2-d Tree Decomposition
1
2
3
x
y
k-d Tree Splitting
ab
c
g h
ed
i a d g b e i c h fa c b d f e h g i
xy
sorted points in each dimension
f
• max spread is the max offx -ax and iy - ay.
• In the selected dimension themiddle point in the list splits the data.
• To build the sorted lists for the other dimensions scan the sortedlist adding each point to one of two sorted lists.
1 2 3 4 5 6 7 8 9
x
y
k-d Tree Splitting
ab
c
g h
ed
i a d g b e i c h fa c b d f e h g i
xy
sorted points in each dimension
f
1 2 3 4 5 6 7 8 9
a b c d e f g h i0 0 1 0 0 1 0 1 1
indicator for each set
scan sorted points in y dimensionand add to correct set
a b d e g c f h iy
k-d Tree Construction Complexity
• First sort the points in each dimension.– O(dn log n) time and dn storage.– These are stored in A[1..d,1..n]
• Finding the widest spread and equally divide into two subsets can be done in O(dn) time.
• We have the recurrence– T(n,d) < 2T(n/2,d) + O(dn)
• Constructing the k-d tree can be done in O(dn log n) and dn storage
Node Structure for k-d Trees• A node has 5 fields
– axis (splitting axis)– value (splitting value)– left (left subtree)– right (right subtree)– point (holds a point if left and right children
are null)
Rectangular Range Query• Recursively search every cell that
intersects the rectangle.
Rectangular Range Query (1)
y
x
ab
c
g h
ed
i s1
s2y y
s6
s3x
s4y
s7y
s8y
s5x
s1
s2
s3
s4
s5
s6
s7
s8
a b
d e
g c f h
x
f
i
Rectangular Range Query (8)
y
x
ab
c
g h
ed
i s1
s2y y
s6
s3x
s4y
s7y
s8y
s5x
s1
s2
s3
s4
s5
s6
s7
s8
a b
d e
g c f h
x
f
i
Rectangular Range Queryprint_range(xlow, xhigh, ylow, yhigh :integer, root: node pointer) {
Case {
root = null: return;
root.left = null:if xlow < root.point.x and root.point.x < xhigh
and ylow < root.point.y and root.point.y < yhigh
then print(root);
else
if(root.axis = “x” and xlow < root.value ) or
(root.axis = “y” and ylow < root.value ) thenprint_range(xlow, xhigh, ylow, yhigh, root.left);
if (root.axis = “x” and xlow > root.value ) or(root.axis = “y” and ylow > root.value ) then
print_range(xlow, xhigh, ylow, yhigh, root.right);
}}
k-d Tree Nearest Neighbor Search
• Search recursively to find the point in the same cell as the query.
• On the return search each subtree where a closer point than the one you already know about might be found.
y
k-d Tree NNS (1)
x
ab
c
g h
ed
i s1
s2y y
s6
s3x
s4y
s7y
s8y
s5x
s1
s2
s3
s4
s5
s6
s7
s8
a b
d e
g c f h i
x
f
query point
y
k-d Tree NNS
x
ab
c
g h
ed
i s1
s2y y
s6
s3x
s4y
s7y
s8y
s5x
s1
s2
s3
s4
s5
s6
s7
s8
a b
d
c f h i
x
f
query point
w
e
g
Notes on k-d NNS
• Has been shown to run in O(log n) average time per search in a reasonable model.
• Storage for the k-d tree is O(n).• Preprocessing time is O(n log n) assuming
d is a constant.
y
Worst-Case for Nearest Neighbor Search
query point •Half of the points visited for a query
•Worst case O(n)
•But: on average (and in practice) nearest neighbor queries are O(log N)
x
Quad Trees• Space Partitioning
x
y
ab
c
g
ed f g e
d ba cf
A Bad Case
x
y
ab
Notes on Quad Trees
• Number of nodes is O(n(1+ log(Δ/n))) where n is the number of points and Δ is the ratio of the width (or height) of the key space and the smallest distance between two points
• Height of the tree is O(log n + log Δ)
K-D vs Quad• k-D Trees
– Density balanced trees– Height of the tree is O(log n) with batch insertion– Good choice for high dimension– Supports insert, find, nearest neighbor, range queries
• Quad Trees– Space partitioning tree– May not be balanced– Not a good choice for high dimension– Supports insert, delete, find, nearest neighbor, range queries
Geometric Data Structures
• Geometric data structures are common.• The k-d tree is one of the simplest.
– Nearest neighbor search– Range queries
• Other data structures used for– 3-d graphics models– Physical simulations