trees & hierarchies in sqlfiles.meetup.com/274991/trees.pdf · trees in sql -2 trees are not...
TRANSCRIPT
Trees & Hierarchies in SQL
Joe Celko
Copyright 2009
Trees in SQL
Trees are graph structures used to represent
– Hierarchies
– Parts explosions
– Organizational charts
Three major methods in SQL – Adjacency list model
– Nested Sets Model
– Path enumeration
Trees in SQL -2
Trees are not hierarchies – Hierarchies have subordination
– Kill your captain, you still have to take orders from your general
– Break an edge in a tree, and you have two or more disjoint trees.
This means an adjacency list model is a tree, but not a hierarchy – It is also not normalized!
Trees as Nested Sets
root
A0
A1 A2
B0
Graphs as Tables
Nodes and edges are not the same kind of things – Organizational chart & Personnel file
You should use separate tables for the structure and the elements – You can put more than one structure table on the
same elements – very useful for data mining and reporting
Adjacency List Model
CREATE TABLE AdjTree
(node_name CHAR(2) NOT NULL,
parent _name CHAR(2), -- null is root
<< other data >>);
The structure and node data should be in different tables – Nobody does this in practice
– Look at Oracle's Scott/Tiger sample
Adjacency List Model -2
Programmers do not add constraints: – Trees have no cycles
– Number of edges = number of nodes – 1
– Look up others in any book on graph theory
The result is that adjacency list models get corrupted – If you use stored procedures, they have to check
every row for cycles
– Because it has to be accessed procedurally, this is not discovered
Adjacency List Model -3
The adjacency list model is not normalized – Change ‘B0’ to ‘Bx’ - a single fact is changed
– You must change it in his node – one place
– You must also change it in his subordinates – many places!!
Try to write triggers or DRI actions for adjacency list model – The longest cycle is the number of nodes in the
entire tree
Adjacency List Model -3
Many SQL products do not support table level constraints. Example:
– CHECK((SELECT COUNT(*) FROM Tree) -1
= SELECT COUNT(*)
FROM ((SELECT node_name FROM AdjTree)
UNION
(SELECT parent_name FROM AdjTree)))
Nested Sets as Numbers
Basic nested sets model for tree structure – Does not show the nodes table
– This does not show all constraints
CREATE TABLE Tree
(node_id INTEGER NOT NULL
REFERENCES Nodes(node_id),
lft INTEGER NOT NULL,
rgt INTEGER NOT NULL);
Problems with Adjacency list
You have to use cursors or self-joins to traverse the tree
Cursors are not a table -- their order has meaning -- Closure violation!
Cursors take MUCH longer than queries
Ten level self-joins are worse than cursors
Problems with Path Enumeration
Path can get long in a deep tree
Great for searching down the tree, but not up the tree – SELECT * FROM Tree WHERE path LIKE ‘Root,%’;
– SELECT * FROM Tree WHERE path LIKE ‘%,B0’;
Inserting and deleting nodes is complicated – Requires string manipulation to change all the paths
beneath the insertion or deletion point
Problems with Nested Sets
Not good for traversals – Too set oriented – great for hierarchical summaries
Inserting and deleting nodes is complicated – Requires stored procedures to re-numbering
– Not as bad as you think!
The rows are VERY short, so a lot of them fit onto a page
Math is simple
Tree Aggregates
Give me the total cost for all subtrees – (root, 13.75) -- sum of every node in tree
– (A0, 7.25) -- sum of “A0” subtree
– (A1, 2.00)
– (A2, 3.50)
Dropping A2 would reduce all superior rows by 3.50, but would not change A1
Find Root of Tree
SELECT * FROM Tree WHERE lft = 1;
It helps to have an index the lft column
The rgt value will be twice the number of nodes in the tree.
General rule: The number of nodes in any subtree ((rgt -lft) + 1 )/ 2
Find All Leaf Nodes
SELECT * FROM Tree WHERE lft =
rgt -1;
An index on lft will help
A covering index on (lft, rgt) is even better
Find Superiors of X
SELECT Super.*
FROM Tree AS T1, Tree AS Sup
WHERE T1.node = ‘X’
AND T1.lft BETWEEN Sup.lft mAND Sup.rgt;
This is the most important trick in this method
The BETWEEN predicates preserve subordination in the hierarchy
One query for any depth tree
Find Subordinates of X
SELECT Sub.*
FROM Tree AS T1, Tree AS Sub
WHERE T1.node = ‘X’
AND Sub.lft BETWEEN T1.lft AND
T1.rgt;
This is the same pattern as finding superiors
Find Depth of Tree
SELECT T1.node, COUNT(T2.node) AS lvl
FROM Tree AS T1, Tree AS T2
WHERE T1.lft BETWEEN T2.lft AND T2.rgt
GROUP BY T1.node;
Count the containing nested sets for levels
The closer to the root a node is, the greater the value of (rgt - lft)
Totals by Level in Tree
SELECT T1.node,
SUM(T2.cost) AS
tot_level_cost
FROM Tree AS T1, Tree AS T2
WHERE T2.lft BETWEEN T1.lft AND
T1.rgt
GROUP BY T1.node;
Uses any aggregate function the same way
CTEs and Adjacency List – 1
Part of SQL-99 – Oracle
– DB2
– SQL Server
– others
WITH RECURSIVE <temp table> <column list> AS
(<seed statement>
UNION ALL
<recursive statement>)
<statement>;
CTEs and Adjacency List – 2
Can do tree traversals and many other things
WITH RECURSIVE OrgChart (emp_id, mgr_id, lvl) AS
(SELECT P1.emp_id, P1.mgr_id, 0
FROM Personnel AS P1
WHERE P1.emp_id = :my_guy
UNION ALL
SELECT P2.emp_id, P2.mgr_id, (lvl + 1)
FROM Personnel AS P2, OrgChart AS C1
WHERE P2.mgr_id = C1.emp_id)
SELECT P3.emp_title, P3.emp_id, P3.mgr_id, C2.lvl
FROM Personnel AS P3, OrgChart AS C2
WHERE P3.emp_id = C2.emp_id
AND P3.emp_id <> :my_guy;
Delete a Subtree
Remove subtree rooted at :my_node
DELETE FROM Tree
WHERE lft BETWEEN
(SELECT lft
FROM Tree
WHERE node = :my_node)
AND (SELECT rgt
FROM Tree
WHERE node = :my_node);
Delete & Promote Oldest - 1
Delete A0 node
A1 A2
A0 B0
Root
Delete & Promote Oldest - 2
A2
A1 B0
root
Delete & Promote Subtree - 1
Delete A0 node
A1 A2
A0 B0
Root
Delete & Promote Subtree - 2
A1 A2 B0
root
Delete a Single Node
Method one - promote a child to the parent’s prior position in the tree. – Oldest son inherits family business
– Requires business rules and stored procedures
Method two- subordinate the entire subtree to the grandparent. – Orphans go live with grandmother
– This is the default in nested sets model
Renumbering is not required
Useful View
CREATE VIEW LftRgt (seq_nbr)
AS SELECT lft FROM Tree
UNION ALL
SELECT rgt FROM Tree;
You can use this to find gaps in the node numbering
Gaps in Nested Sets Model -1
Deleted nodes leave gaps in numbering of lft and rgt nodes.
Fill in gaps by sliding everyone over to the lft until there are no gaps.
UPDATE Tree
SET lft = (SELECT COUNT(*)
FROM LftRgt
WHERE seq_nbr <= Tree.lft,
rgt = (SELECT COUNT(*)
FROM LftRgt
WHERE seq_nbr <= Tree.rgt);
Gaps in Nested Sets Model -2
ROW_NUMBER () function to close up gaps
WITH X(lr, seq_nbr)
AS
(SELECT lr, seq_nbr
FROM (SELECT ROW_NUMBER()
OVER (ORDER BY seq_nbr), seq_nbr
FROM LftRgt)
WHERE lr <> seq_nbr)
UPDATE Tree
SET lft = (SELECT lr
FROM X
WHERE X.seq_nbr = Tree.lft,
rgt = (SELECT lr
FROM X
WHERE seq_nbr = Tree.rgt);
Inserting into a Tree
The real trick is numbering the subtree correctly before inserting it.
Basic idea is to spread the Nested Sets numbers apart to make a gap, the size of the subtree then you add the subtree.
The position of the subtree within the siblings of the new parent in the tree is another decision.
Inserting into a Tree -2
If you frequently update the tree structure, then use a bigger spread in the numbering.
At higher levels, use steps of 100,000, then 10,000 and so forth.
Most SQL products can handle DECIMAL(s,p) of 30 or more digits.
Inserting into a Tree -3
B A1 A2
A0 Root
Slide everyone to the left
New
Creating a Tree -1
If you want to have all the constraints for a proper hierarchy, then it is complicated.
CREATE TABLE Tree
(node_id INTEGER NOT NULL -- primary key optional
REFERENCES Nodes(node_id)
ON UPDATE CASCADE
ON DELETE CASCADE,
lft INTEGER NOT NULL UNIQUE CHECK (lft > 0),
rgt INTEGER NOT NULL UNIQUE CHECK (rgt > 1),
UNIQUE (lft, rgt), -- redundant, but useful
CHECK (lft < rgt)
);
Creating a Tree -2
Other needed constraints – no overlaps in the nodes
SELECT *
FROM Tree AS T1
WHERE EXISTS
(SELECT *
FROM Tree AS T2
WHERE T1.lft BETWEEN T2.lft AND T2.rgt
AND T1.rgt NOT BETWEEN T2.lft AND T2.rgt;
Creating a Tree -3
Other needed constraints – no disjoint nodes
SELECT *
FROM Tree AS T1
WHERE EXISTS
(SELECT *
FROM Tree AS T2
WHERE T1.lft <
(SELECT rgt
FROM Tree
WHERE lft = 1));
Creating a Tree -4
If you do not have triggers or CREATE ASSERTION, you can use an updatable view
CREATE VIEW GoodTree (node, i, j)
AS
SELECT T1.node, T1.i, T1.j
FROM Tree AS T1
WHERE NOT EXISTS (<overlaps>)
AND NOT EXISTS (<disjoint>)
WITH CHECK OPTION;
Converting an Adjacency Model into
a Nested Sets Model
Current best method is to load nodes into a tree in a host language, then do a recursive pre-order tree traversal to get the lft and rgt traversal numbers.
Adjacency list method does not order siblings; Nested Sets Model does this automatically
Classic push down stack algorithm works
You can keep both models in one table with a column for the immediate superior
Converting a Nested Sets Model into
an Adjacency Model
This actually pretty straight forward; you can put it into a single view
SELECT B.emp AS boss, P.emp
FROM OrgChart AS P
LEFT OUTER JOIN
OrgChart AS B
ON B.lft
= (SELECT MAX(lft)
FROM OrgChart AS S
WHERE P.lft > S.lft
AND P.lft < S.rgt);
Structure versus Contents
Nested Sets Model allows the structure of trees to be compared
– For each tree find the lft value of the root node of each tree
– Make a canonical form and UNION ALL them
EXISTS ( SELECT *
FROM ( SELECT (lft - lftmost), (rgt - lftmost)
FROM Tree1
UNION ALL
SELECT (lft - lftmost), (rgt - lftmost)
FROM Tree2) AS Both (lft, rgt)
GROUP BY Both.lft, Both.rgt
HAVING COUNT (*) =1 ) ;
Questions & Answers
?