obtaining provably good performance from suffix trees in secondary storage

22
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa State University.

Upload: makara

Post on 06-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage. Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa State University. Motivation. Large amount of biological sequence data. Index for text usually is bigger than the text itself. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru

Department of Electrical and Computer Engineering

Iowa State University.

Page 2: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Motivation

Large amount of biological sequence data. Index for text usually is bigger than the text

itself. Requires efficient ways to store and query

these data.

Page 3: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Related Works

String B-tree Has the best worst case performance in secondary storage, allowing updates. However, most existing programs still uses suffix tree instead of string B-tree.

Many other works that only focus on construction of suffix tree, and without worst case bound S.J. Bedathur and J.R. Haritsa. Search-optimized suffix-tree storage for biological

applications. E. Hunt, M.P. Atkinson, and R.W. Irving. Database indexing for large DNA and protein

sequence collections. Clark and Munro. “Efficient suffix trees on secondary storage”

Focus on reducing the space usage of suffix trees. Performance depends on the height of the tree.

Farach, odd even tree construction. Optimal construction time in secondary storage The performance for search and update operations are not studied.

We show that suffix tree can achieve the same level of efficiency with constant size alphabet.

Page 4: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Definitions

Let v be an internal node of a suffix tree. size(v) is the number of leaves in the subtree

rooted at v. rank(v) = i, iff Ci size(v) Ci+1. Internal nodes u and v belong to the same

partition, iff u is the parent of v and rank(v)=rank(u).

The rank of a partition P, rank(P) is the rank of the internal nodes in the partition.

Page 5: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

A Suffix Tree Partitioned

rank = 0 rank = 0

rank = 1rank = 0rank = 0

rank = 2

Each root to leaf path goes through at most logCn partitions.

Page 6: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Suffix Tree & Partition Example

C = 3

Partitions of rank 0

Page 7: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Suffix Tree & Partition Example

C = 3

Partitions of rank 1

Page 8: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Properties of a Partition

Nodes in a partition without any child in the same partition are referred to as leaves.

The node whose parent is in another partition is referred to as the root.

There are at most C-1 leaves for each partition. size(root) ≥ size(u), for all leaves u of the partition. Ci+1-1 ≥ size(root) ≥ size(u) ≥ Ci

C*Ci = Ci+1

Page 9: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Properties of a Partition

If a node v has more than 1 child in the same partition as v, it is referred to as a branching node.

There can be at most C-2 branching nodes, because there are at most C-1 leaves.

A skeleton partition tree for a partition P contains the root, all the leaves and branching nodes of a partition. There are at most 2C-2 nodes in a skeleton partition tree. With a suitable choice of C, it can be stored in 1 disk page.

Page 10: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Partition and Skeleton Partition Tree

Store a representative suffix in each nodes of the skeleton partition tree

Page 11: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Searching for an Exact Match (1)

p = TTAATGAT

Page 12: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Searching for an Exact Match (1)

p = TTAATGAT

Load the representative suffix and compare to p.

Page 13: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Searching for an Exact Match (1)

p = TTAATGAT

Load the representative suffix and compare to p.

Suppose the representative suffix is TTATTAGGA……

The lcp between p and the representative suffix is 3.

Page 14: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Searching for an Exact Match (2)

The lcp between p and the representative suffix is 3.

Move to the appropriate next partition.

Total number of disk access:

O(p/B+logBn)

p = TTAATGAT

Page 15: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Supporting Update Operations With insertion and deletion the size of a node

as well as the partition changes. During insertion of a suffix,

Size(v) changes if and only if node v is an ancestor of the newly inserted leaf.

Rank(v) may change only if size(v) changes and node v is the root of a partition.

If rank(v) changes node v will became either a new partition by itself or a leaf in its parent’s partition.

Page 16: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Only the Rank of the Root of a Partition Changes

Root

Rank(v) increased by one

size(v) was Crank(v)+1 - 1

size(root) was Crank(v)+1

Root was not in the partition

Page 17: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Insertion and Deletion

By the same argument only a leaf’s rank can change during the deletion of a suffix.

Store and keep size(v) up to date for node v if

1. Node v is the root of the partition,2. Node v, such that v is connected to the root by a

chain of branching nodes. 3. Node v is a non-branching node and is the child

of a node u that satisfies one of the conditions above.

Page 18: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

The Root of a Partition is Removed Let v be a child of the old root in the partition.

If v is a branching node, nothing need to be done, and the new partition with v as the root have all the size set correctly.

If v is a non-branching node, we can calculated the size of its only child in the partition by subtract the size of all other children from size(v).

After the updates all the size value will be set correctly as stated previously.

Page 19: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

The Root of a Partition is Removed Old Root

New Roots

Page 20: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

The Leaf of a Partition is Removed If a leaf is removed from a partition,

The leaf became the root, its size can be calculated as the sum of the size of all its children, which were all roots of different partitions.

Either a previously branching node became a non-branching node, no update of size is necessary, or

A previously non-branching node became a new leaf, in this case the size of the new leaf can be calculated by added the size of all its children.

Page 21: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

The Leaf of a Partition is Removed

Leaf from another partition

Page 22: Obtaining Provably Good Performance from Suffix Trees in Secondary Storage

Results

Let B be the size of a disk block. Let n be the total length of strings. Let m be the length of the string being

inserted or deleted. Construction takes O(n logB n) disk accesses.

Insertion and deletion takes O(m logB (n+m)) and O(m logB (n)) disk accesses, respectively.

Let p be the length of a pattern. Searching takes disk O(p/B + logB (n)) accesses.