data structure handout 1

107
MODULE ONE DATA STRUCTURE: OVERVIEW In computer science , a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently . Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, B-trees are particularly well- suited for implementation of databases, while compiler implementations usually use hash tables to look up identifiers. Data structures provide a means to manage huge amounts of data efficiently, such as large databases and internet indexing services . Usually, efficient data structures are a key to designing efficient algorithms . Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design. Storing and retrieving can be carried out on data stored in both main memory and in secondary memory. Various Data Structures are available that are needed to be employed based on the need. Overview An array data structure stores a number of elements of the same type in a specific order. They are accessed 1

Upload: saboga147

Post on 27-Dec-2015

37 views

Category:

Documents


0 download

DESCRIPTION

Discusses the various topics under data structure and its application

TRANSCRIPT

Page 1: Data Structure Handout 1

MODULE ONE

DATA STRUCTURE: OVERVIEW

In computer science, a data structure is a particular way of storing and organizing data

in a computer so that it can be used efficiently.

Different kinds of data structures are suited to different kinds of applications, and some

are highly specialized to specific tasks. For example, B-trees are particularly well-suited

for implementation of databases, while compiler implementations usually use hash tables

to look up identifiers.

Data structures provide a means to manage huge amounts of data efficiently, such as

large databases and internet indexing services. Usually, efficient data structures are a key

to designing efficient algorithms. Some formal design methods and programming

languages emphasize data structures, rather than algorithms, as the key organizing factor in software

design. Storing and retrieving can be carried out on data stored in both main memory and in secondary

memory. Various Data Structures are available that are needed to be employed based on the need.

Overview

An array data structure stores a number of elements of the same type in a specific

order. They are accessed using an integer to specify which element is required

(although the elements may be of almost any type). Arrays may be fixed-length or

expandable.

Record (also called tuple or struct) Records are among the simplest data structures.

A record is a value that contains other values, typically in fixed number and

sequence and typically indexed by names. The elements of records are usually

called fields or members.

A hash or dictionary or map is a more flexible variation on a record, in which

name-value pairs can be added and deleted freely.

Union . A union type definition will specify which of a number of permitted

primitive types may be stored in its instances, e.g. "float or long integer". Contrast

1

Page 2: Data Structure Handout 1

with a record, which could be defined to contain a float and an integer; whereas, in

a union, there is only one value at a time.

A tagged union (also called a variant, variant record, discriminated union, or

disjoint union) contains an additional field indicating its current type, for enhanced

type safety.

A set is an abstract data structure that can store specific values, without any

particular order, and no repeated values. Values themselves are not retrieved from

sets, rather one tests a value for membership to obtain a Boolean "in" or "not in".

An object contains a number of data fields, like a record, and also a number of

program code fragments for accessing or modifying them. Data structures not

containing code, like those above, are called plain old data structure.

Many others are possible, but they tend to be further variations and compounds of the

above.

1.1 BASIC PRINCIPLES

Data structures are generally based on the ability of a computer to fetch and store

data at any place in its memory, specified by an address—a bit string that can be itself

stored in memory and manipulated by the program. Thus the record and array data

structures are based on computing the addresses of data items with arithmetic operations;

while the linked data structures are based on storing addresses of data items within the

structure itself. Many data structures use both principles, sometimes combined in non-

trivial ways (as in XOR linking)

The implementation of a data structure usually requires writing a set of procedures that

create and manipulate instances of that structure. The efficiency of a data structure cannot

be analyzed separately from those operations. This observation motivates the theoretical

concept of an abstract data type, a data structure that is defined indirectly by the

operations that may be performed on it, and the mathematical properties of those

operations (including their space and time cost).

2

Page 3: Data Structure Handout 1

1.2 LANGUAGE SUPPORT

Most assembly languages and some low-level languages, such as BCPL (Basic

Combined Programming Language), lack support for data structures. Many high-level

programming languages and some higher-level assembly languages, such as MASM, on

the other hand, have special syntax or other built-in support for certain data structures,

such as vectors (one-dimensional arrays) in the C language or multi-dimensional arrays in

Pascal.

Most programming languages feature some sort of library mechanism that allows data

structure implementations to be reused by different programs. Modern languages usually

come with standard libraries that implement the most common data structures. Examples

are the C++ Standard Template Library, the Java Collections Framework, and Microsoft's

.NET Framework.

Modern languages also generally support modular programming, the separation between

the interface of a library module and its implementation. Some provide opaque data types

that allow clients to hide implementation details. Object-oriented programming

languages, such as C++, Java and .NET Framework may use classes for this purpose.

Many known data structures have concurrent versions that allow multiple computing

threads to access the data structure simultaneously.

3

Page 4: Data Structure Handout 1

MODULE TWO

THE TREES STRUCTURE

In computer science, a tree is a widely used data structure that simulates a hierarchical

tree structure with a set of linked nodes.

A tree can be defined recursively (locally) as a collection of nodes (starting at a root

node), where each node is a data structure consisting of a value, together with a list of

nodes (the "children"), with the constraints that no node is duplicated.

A tree can be defined abstractly as a whole (globally) as an ordered tree, with a value

assigned to each node. Both these perspectives are useful: while a tree can be analyzed

mathematically as a whole, when actually represented as a data structure it is usually

represented and worked with separately by node (rather than as a list of nodes and an

adjacency list of edges between nodes, as one may represent a digraph, for instance).

For example, looking at a tree as a whole, one can talk about "the parent node" of a given

node, but in general as a data structure a given node only contains the list of its children,

but does not contain a reference to its parent (if any).

Mathematical

Viewed as a whole, a tree data structure is an ordered tree, generally with values attached

to each node. Concretely, it is:

A rooted tree with the "away from root" direction (a more narrow term is an

"arborescence"), meaning:

A directed graph,

whose underlying undirected graph is a tree (any two vertices are connected by

exactly one simple path),

with a distinguished root (one vertex is designated as the root),

which determines the direction on the edges (arrows point away from the root;

given an edge, the node that the edge points from is called the parent and the node

that the edge points to is called the child),

Together with:

an ordering on the child nodes of a given node, and

4

Page 5: Data Structure Handout 1

a value (of some data type) at each node.

Often trees have a fixed (more properly, bounded) branching factor (outdegree),

particularly always having two child nodes (possibly empty, hence at most two non-

empty child nodes), hence a "binary tree".

2.1 BINARY TREES

The simplest form of tree is a binary tree. A binary tree consists of

a. a node (called the root node) and

b. left and right sub-trees.

Both the sub-trees are themselves binary trees.

You now have a recursively defined data structure. (It is also possible to define a list

recursively: can you see how?)

A binary tree

The nodes at the lowest levels of the tree (the ones with no sub-trees) are called leaves.

In an ordered binary tree,

The keys of all the nodes in the left sub-tree are less than that of the root,

The keys of all the nodes in the right sub-tree are greater than that of the root,

The left and right sub-trees are themselves ordered binary trees. 5

Page 6: Data Structure Handout 1

Data Structure

The data structure for the tree implementation simply adds left and right pointers in place

of the next pointer of the linked list implementation.

The AddToCollection method is, naturally, recursive.

Similarly, the FindInCollection method is recursive.

Analysis

Complete Trees

Before we look at more general cases, let's make the optimistic assumption that we've

managed to fill our tree neatly, i.e. that each leaf is the same 'distance' from the root.

A complete tree

This forms a complete tree, whose height is defined as the number of links from the root

to the deepest leaf.

First, we need to work out how many nodes, n, we have in such a tree of height, h.

Now,

n = 1 + 21 + 22 + .... + 2h

n = 1 + 21 + 22 + .... + 2h

From which we have,

n = 2h+1 - 1

and

h = floor( log2n )

Examination of the Find method shows that in the worst case, h+1 or ceiling ( log2n )

comparisons are needed to find an item. This is the same as for binary search.

6

Page 7: Data Structure Handout 1

However, Add also requires ceiling ( log2n ) comparisons to determine where to add an

item. Actually adding the item takes a constant number of operations, so we say that a

binary tree requires O (logn) operations for both adding and finding an item - a

considerable improvement over binary search for a dynamic structure which often

requires addition of new items.

2.3 TERMINOLOGY

A node is a structure which may contain a value or condition, or represent a

separate data structure (which could be a tree of its own). Each node in a tree has zero or

more child nodes, which are below it in the tree (by convention, trees are drawn growing

downwards). A node that has a child is called the child's parent node (or ancestor node,

or superior). A node has at most one parent.

An internal node (also known as an inner node, inode for short, or branch node) is any

node of a tree that has child nodes. Similarly, an external node (also known as an outer

node, leaf node, or terminal node) is any node that does not have child nodes.

The topmost node in a tree is called the root node. Being the topmost node, the root node

will not have a parent. It is the node at which algorithms on the tree begin, since as a data

structure, one can only pass from parents to children.

Note that some algorithms (such as post-order depth-first search) begin at the root, but

first visit leaf nodes (access the value of leaf nodes), only visit the root last (i.e., they first

access the children of the root, but only access the value of the root last).

All other nodes can be reached from it by following edges or links. (In the formal

definition, each such path is also unique.) In diagrams, the root node is conventionally

drawn at the top. In some trees, such as heaps, the root node has special properties.

Every node in a tree can be seen as the root node of the subtree rooted at that node.

The height of a node is the length of the longest downward path to a leaf from that node.

The height of the root is the height of the tree. The depth of a node is the length of the

path to its root (i.e., its root path). This is commonly needed in the manipulation of the

various self balancing trees, AVL Trees in particular. The root node has depth zero, leaf

7

Page 8: Data Structure Handout 1

nodes have height zero, and a tree with only a single node (hence both a root and leaf)

has depth and height zero. Conventionally, an empty tree (tree with no nodes) has depth

and height −1.

A subtree of a tree T is a tree consisting of a node in T and all of its descendants in T.

Nodes thus correspond to subtrees (each node corresponds to the subtree of itself and all

its descendants) – the subtree corresponding to the root node is the entire tree, and each

node is the root node of the subtree it determines; the subtree corresponding to any other

node is called a proper subtree (in analogy to the term proper subset).

8

Page 9: Data Structure Handout 1

MODULE THREE

THE GRAPH STRUCTURE

In computer science, a graph is an abstract data type that is meant to implement the

graph and hypergraph concepts from mathematics.

A graph data structure consists of a finite (and possibly mutable) set of ordered pairs,

called edges or arcs, of certain entities called nodes or vertices. As in mathematics, an

edge (x,y) is said to point or go from x to y. The nodes may be part of the graph structure,

or may be external entities represented by integer indices or references.

A graph data structure may also associate to each edge some edge value, such as a

symbolic label or a numeric attribute (cost, capacity, length, etc.).

Graphs (also known as Networks) are very powerful structures and find their

applications in path-finding, visibility determination, soft-bodies using mass-spring

systems and probably a lot more. A graph is similar to a tree, but it imposes no

restrictions on how nodes are connected to each other. In fact each node can point to

another node or even multiple nodes at once.

A node is represented by the Graph Node class, and the connections between GraphNode

objects are modeled by the GraphArc class.

The arc has only one direction (uni-directional) and points from one node to another so

you are only allowed to go from A to B and not in the opposite direction. Bi-directional

connections can be simulated by creating an arc from node A to B and vice-versa from B

to A. Also, each arc has a weight value associated with it, which describes how costly it

is to move along the arc.

This is optional though, so the default value is 1.

9

Page 10: Data Structure Handout 1

Putting it together, the graph is implemented as an uni-directional weighted graph. The

Graph manages everything: it stores the nodes and the arcs in separate lists, makes sure

you don’t add a node twice, or mess up the arcs (for example if you remove a node from

the graph, it also scans the arc list and removes all arcs pointing to that node) and

provides you with tools to traverse the graph.

3.1 BUILDING THE GRAPH STRUCTURE

In figure 1, you see a simple graph containing 8 nodes. You can add additional

nodes, which will be placed at the position of the cursor by pressing ‘a’ or start with a

fresh graph by pressing ‘r’. To create an arc point from node A to Node B, simply click

both nodes successively. Traversal is also possible: First press ‘t’ to switch to ‘traverse’

mode, then click a node to find all nodes which are connected to the this node.

Figure 1: Building a graph structure

Graph traversal

If you have tried out the traversal in the example above, you may wonder how it’s done.

The answer lies in two common algorithms to accomplish this: Breadth-first search

(BFS) and depth-first search (DFS). (The demonstration above used the breadth-first

search.)

The BFS algorithm visits all nodes that are closest to the starting node first, so it

gradually expands outward in all directions equally. This looks like a virus infecting the

direct neighborhood at each search iteration.

BFS utilizes a queue and proceeds as follows:

Mark the starting node and enqueue it.

Process the node by calling a user-defined function on it.

Mark all connected nodes and also put them into the queue.

10

Page 11: Data Structure Handout 1

Remove the node at the front of the queue.

Repeat steps 2-5 with the node that is now at the front of the queue.

Stop if the queue is empty.

The depth-first search (DFS) on the other hand takes the starting node, follows the next

arc it finds to get to the next node, and continues this until the complete path has been

discovered, then goes back to the starting node and follows the next path until it reaches a

dead end and so on. It’s currently implemented as a recursive function, that means that it

probably can fail for very large graphs when the call-stack exceeds the maximum size (I

don’t know how big it is in AS3 though).

Both algorithms have in common that they mark a node when it’s added to the queue,

otherwise the node would be enqueued and unnecessarily processed multiple times,

because different nodes can all point to a common node. So before you start a BFS or

DFS it’s very important to reset all markers by calling the clearMarks() function on the

graph.

The two algorithms are visualized in figure 2 below. I’ve created a rectangular grid of

nodes (similar to a tilemap) by connecting each node with the top, bottom, left and right

neighbors (I left out the arcs because it would be a total mess). I have also deleted some

nodes to show you that both algorithms don’t rely on a regular structure and can look like

anything. Just click a node to start the traversal. You can toggle between both algorithms

by pressing ‘b’ BFS and ‘d’ (guess what ;-)).

Figure 2: Building a graph structure

BFS is much more useful than DFS in most situations. But DFS is likely to be faster, for

example when you only want to modify all connected nodes in some way.

11

Page 12: Data Structure Handout 1

MODULE FOUR

THE POLISH NOTATION

Polish notation, also known as prefix notation, is a symbolic logic invented by Polish

mathematician Jan Lukasiewicz in the 1920's. When using Polish notation, the instruction

(operation) precedes the data (operands). In Polish notation, the order (and only the

order) of operations and operands determines the result, making parentheses unnecessary.

x 3 + 4 5

Polish notation, also known as Polish prefix notation or simply prefix notation, is a form

of notation for logic, arithmetic, and algebra. Its distinguishing feature is that it places

operators to the left of their operands. If the arity of the operators is fixed, the result is a

syntax lacking parentheses or other brackets that can still be parsed without ambiguity.

The Polish logician Jan Łukasiewicz invented this notation around 1920 in order to

simplify sentential logic.

The term Polish notation is sometimes taken (as the opposite of infix notation) to also

include Polish postfix notation, or Reverse Polish notation, in which the operator is

placed after the operands.

This contrasts with the traditional algebraic methodology for performing mathematical

operations, the Order of Operations. (The mnemonic device for remembering the Order

of Operations is "Please Excuse My Dear Aunt Sally" - parentheses, exponents,

multiplication, division, addition, subtraction).

In the expression 3(4+5), you would work inside the parentheses first to add four plus

five and then multiply the result by three.

In the early days of the calculator, the end-user had to write down the results of their

intermediate steps when using the algebraic Order of Operations. Not only did this slow

things down, it provided an opportunity for the end-user to make errors and sometimes

defeated the purpose of using a calculating machine. In the 1960's, engineers at Hewlett-

Packard decided that it would be easier for end-users to learn Jan Lukasiewicz' logic

system than to try and use the Order of Operations on a calculator. They modified Jan

12

Page 13: Data Structure Handout 1

Lukasiewicz's system for a calculator keyboard by placing the instructions (operators)

after the data. In homage to Jan Lukasiewicz' Polish logic system, the engineers at

Hewlett-Packard called their modification reverse Polish notation (RPN).

The notation for the expression 3(4+5) would now be expressed as

4 5 + 3 x

or it could be further simplified to

3 4 5 + x

Reverse Polish notation provided a straightforward solution for calculator or computer

software mathematics because it treats the instructions (operators) and the data (operands)

as "objects" and processes them in a last-in, first-out (LIFO) basis.

This is called a "stack method". (Think of a stack of plates. The last plate you put on the

stack will be the first plate taken off the stack.)

Modern calculators with memory functions are sophisticated enough to accommodate the

use of the traditional algebraic Order of Operations, but users of RPN calculators like the

logic's simplicity and continue to make it profitable for Hewlett-Packard to manufacture

RPN calculators. Some of Hewlett Packard's latest calculators are capable of both RPN

and algebraic logic.

When Polish notation is used as a syntax for mathematical expressions by interpreters of

programming languages, it is readily parsed into abstract syntax trees and can, in fact,

define a one-to-one representation for the same. Because of this, Lisp (see below) and

related programming languages define their entire syntax in terms of prefix notation (and

others use postfix notation).

Here is a quotation from a paper by Jan Łukasiewicz, Remarks on Nicod's Axiom and on

"Generalizing Deduction", page 180.

"I came upon the idea of a parenthesis-free notation in 1924. I used that notation for

the first time in my article Łukasiewicz(1), p. 610, footnote."

The reference cited by Jan Łukasiewicz above is apparently a lithographed report in

Polish. The referring paper by Łukasiewicz Remarks on Nicod's Axiom and on

13

Page 14: Data Structure Handout 1

"Generalizing Deduction" was reviewed by H. A. Pogorzelski in the Journal of Symbolic

Logic in 1965.

Alonzo Church mentions this notation in his classic book on mathematical logic as

worthy of remark in notational systems even contrasted to Whitehead and Russell's

logical notational exposition and work in Principia Mathematica.

While no longer used much in logic, Polish notation has since found a place in computer

science.

4.1 POLISH NOTATION IN ARITHMETIC

The expression for adding the numbers 1 and 2 is, in prefix notation, written "+ 1

2" rather than "1 + 2". In more complex expressions, the operators still precede their

operands, but the operands may themselves be nontrivial expressions including operators

of their own. For instance, the expression that would be written in conventional infix

notation as

(5 − 6) * 7

can be written in prefix as

* (− 5 6) 7

Since the simple arithmetic operators are all binary (at least, in arithmetic contexts), any

prefix representation thereof is unambiguous, and bracketing the prefix expression is

unnecessary. As such, the previous expression can be further simplified to

* − 5 6 7

The processing of the product is deferred until its two operands are available (i.e., 5

minus 6, and 7). As with any notation, the innermost expressions are evaluated first, but

in prefix notation this "innermost-ness" can be conveyed by order rather than bracketing.

In the classical notation, the parentheses in the infix version were required, since moving

them

5 − (6 * 7)

or simply removing them

14

Page 15: Data Structure Handout 1

5 − 6 * 7

would change the meaning and result of the overall expression, due to the precedence

rule.

Similarly

5 − (6 * 7)

can be written in Polish notation as

− 5 * 6 7

4.2 POLISH NOTATION IN COMPUTER PROGRAMMING

Prefix notation has seen wide application in Lisp s-expressions, where the brackets

are required since the operators in the language are themselves data (first-class functions).

Lisp functions may also have variable arity. The Ambi programming language uses

Polish Notation for arithmetic operations and program construction.

The postfix reverse Polish notation is used in many stack-based programming languages

like PostScript and Forth, and is the operating principle of certain calculators, notably

from Hewlett-Packard.

The number of return values of an expression equals the difference between the number

of operands in an expression and the total arity of the operators minus the total number of

return values of the operators.

4.3 ORDER OF OPERATIONS

Order of operations is defined within the structure of prefix notation and can be

easily determined. One thing to keep in mind is that when executing an operation, the

operation is applied to the first operand by the second operand. This is not an issue with

operations that commute, but for non-commutative operations like division or

subtraction, this fact is crucial to the analysis of a statement. For example, the following

statement:

/ 10 5 = 2

15

Page 16: Data Structure Handout 1

is read as "divide 10 by 5". Thus the solution is 2, not 1/2 as would be the result of an

incorrect analysis.

Prefix notation is especially popular with stack-based operations due to its innate ability

to easily distinguish order of operations without the need for parentheses. To evaluate

order of operations under prefix notation, one does not even need to memorize an

operational hierarchy, as with infix notation. Instead, one looks directly to the notation to

discover which operator to evaluate first. Reading an expression from left to right, one

first looks for an operator and proceeds to look for two operands. If another operator is

found before two operands are found, then the old operator is placed aside until this new

operator is resolved. This process iterates until an operator is resolved, which must

happen eventually, as there must be one more operand than there are operators in a

complete statement. Once resolved, the operator and the two operands are replaced with a

new operand. Because one operator and two operands are removed and one operand is

added, there is a net loss of one operator and one operand, which still leaves an

expression with N operators and N + 1 operands, thus allowing the iterative process to

continue. This is the general theory behind using stacks in programming languages to

evaluate a statement in prefix notation, although there are various algorithms that

manipulate the process. Once analyzed, a statement in prefix notation becomes less

intimidating to the human mind as it allows some separation from convention with added

convenience. An example shows the ease with which a complex statement in prefix

notation can be deciphered through order of operations:

− * / 15 − 7 + 1 1 3 + 2 + 1 1 =

− * / 15 − 7 2 3 + 2 + 1 1 =

− * / 15 5 3 + 2 + 1 1 =

− * 3 3 + 2 + 1 1 =

− 9 + 2 + 1 1 =

− 9 + 2 2 =

− 9 4 =

5

16

Page 17: Data Structure Handout 1

An equivalent in-fix is as follows: ((15 / (7 − (1 + 1))) * 3) − (2 + (1 + 1)) = 5

Here is an implementation (in pseudocode) of prefix evaluation using a stack. Note that

under this implementation the input string is scanned from right to left. This differs from

the algorithm described above in which the string is processed from left to right. Both

algorithms compute the same value for all valid strings.

Scan the given prefix expression from right to left

for each symbol

{

if operand then

push onto stack

if operator then

{

operand1=pop stack

operand2=pop stack

compute operand1 operator operand2

push result onto stack

}

}

return top of stack as result

The result is at the top of the stack.

4.4 POLISH NOTATION FOR LOGIC

The table below shows the core of Jan Łukasiewicz's notation for sentential logic.

The "conventional" notation did not become so until the 1970s and 80s. Some letters in

the Polish notation table means a certain word in Polish, as shown:

Concept Conventional

notation Polish

17

Page 18: Data Structure Handout 1

word

Negation Nφ negacja

Conjunction Kφψ koniunkcja

Disjunction Aφψ alternatywa

Material conditional Cφψ implikacja

Biconditional Eφψ ekwiwalencja

Falsum O fałsz

Sheffer stroke Dφψ dysjunkcja

Possibility Mφ możliwość

Necessity Lφ konieczność

Universal quantifier Πpφ kwantyfikator ogólny

Existential quantifier Σpφ kwantyfikator szczegółowy

Note that the quantifiers ranged over propositional values in Łukasiewicz's work on

many-valued logics.

Bocheński introduced an incompatible system of Polish notation that names all 16 binary

connectives of classical propositional logic.

18

Page 19: Data Structure Handout 1

MODULE FIVE

STORAGE MANAGEMENT

Storage management is a general storage industry phrase that is used to describe the

tools, processes, and policies used to manage storage networks and storage services such

as virtualization, replication, mirroring, security, compression, traffic analysis and other

services. The phrase storage management also encompasses numerous storage

technologies including process automation, real-time infrastructure products and storage

provisioning.

In some cases, the phrase storage management may be used in direct reference to Storage

Resource Management (SRM) -- software that manages storage from a capacity,

utilization, policy and event-management perspective.

5.1 HIERARCHICAL STORAGE MANAGEMENT

Hierarchical storage management (HSM) is a data storage technique which

automatically moves data between high-cost and low-cost storage media. HSM systems

exist because high-speed storage devices, such as hard disk drive arrays, are more

expensive (per byte stored) than slower devices, such as optical discs and magnetic tape

drives. While it would be ideal to have all data available on high-speed devices all the

time, this is prohibitively expensive for many organizations. Instead, HSM systems store

the bulk of the enterprise's data on slower devices, and then copy data to faster disk

drives when needed. In effect, HSM turns the fast disk drives into caches for the slower

mass storage devices. The HSM system monitors the way data is used and makes best

guesses as to which data can safely be moved to slower devices and which data should

stay on the fast devices.

In a typical HSM scenario, data files which are frequently used are stored on disk drives,

but are eventually migrated to tape if they are not used for a certain period of time,

typically a few months. If a user does reuse a file which is on tape, it is automatically

moved back to disk storage. The advantage is that the total amount of stored data can be

19

Page 20: Data Structure Handout 1

much larger than the capacity of the disk storage available, but since only rarely-used

files are on tape, most users will usually not notice any slowdown.

HSM is sometimes referred to as tiered storage.

HSM (originally DFHSM, now DFSMShsm) was first implemented by IBM on their

mainframe computers to reduce the cost of data storage, and to simplify the retrieval of

data from slower media. The user would not need to know where the data was stored and

how to get it back; the computer would retrieve the data automatically. The only

difference to the user was the speed at which data was returned.

Later, IBM ported HSM to its AIX operating system, and then to other Unix-like

operating systems such as Solaris, HP-UX and Linux.

HSM was also implemented on the DEC VAX/VMS systems and the Alpha/VMS

systems. The first implementation date should be readily determined from the VMS

System Implementation Manuals or the VMS Product Description Brochures.

Recently, the development of Serial ATA (SATA) disks has created a significant market

for three-stage HSM: files are migrated from high-performance Fibre Channel Storage

Area Network devices to somewhat slower but much cheaper SATA disks arrays totaling

several terabytes or more, and then eventually from the SATA disks to tape.

The newest development in HSM is with hard disk drives and flash memory, with flash

memory being over 30 times faster than disks, but disks being considerably cheaper.

Conceptually, HSM is analogous to the cache found in most computer CPUs, where

small amounts of expensive SRAM memory running at very high speeds is used to store

frequently used data, but the least recently used data is evicted to the slower but much

larger main DRAM memory when new data has to be loaded.

In practice, HSM is typically performed by dedicated software, such as IBM Tivoli

Storage Manager, Oracle's SAM-QFS, Quantum, SGI Data Migration Facility (DMF),

StorNext, or EMC Legato OTG DiskXtender.

Use Cases

20

Page 21: Data Structure Handout 1

HSM is often used for deep archival storage of data to be held long term at low cost.

Automated tape robots can silo large quantities of data efficiently with low power

consumption.

Some HSM software products allow the user to place portions of data files on high-speed

disk cache and the rest on tape. This is used in applications that stream video over the

internet -- the initial portion of a video is delivered immediately from disk while a robot

finds, mounts and streams the rest of the file to the end user. Such a system greatly

reduces disk cost for large content provision systems.

Tiered storage

Tiered storage is a data storage environment consisting of two or more kinds of storage

delineated by differences in at least one of these four attributes: price, performance,

capacity and function.

Any significant difference in one or more of the four defining attributes can be sufficient

to justify a separate storage tier.

Examples:

Disk and tape: two separate storage tiers identified by differences in all four

defining attributes.

Old technology disk and new technology disk: two separate storage tiers identified

by differences in one or more of the attributes.

High performing disk storage and less expensive, slower disk of the same capacity

and function: two separate tiers.

Identical enterprise class disk configured to utilize different functions such as

RAID level or replication: a separate storage tier for each set of unique functions.

Note: Storage Tiers are not delineated by differences in vendor, architecture, or geometry

except where those differences result in clear changes to price, performance, capacity and

function.

21

Page 22: Data Structure Handout 1

5.2 GARBAGE COLLECTION (COMPUTER SCIENCE)

In computer science, garbage collection (GC) is a form of automatic memory

management. The garbage collector, or just collector, attempts to reclaim garbage, or

memory occupied by objects that are no longer in use by the program. Garbage collection

was invented by John McCarthy around 1959 to solve problems in Lisp.

Garbage collection is often portrayed as the opposite of manual memory management,

which requires the programmer to specify which objects to deallocate and return to the

memory system. However, many systems use a combination of approaches, including

other techniques such as stack allocation and region inference.

Resources other than memory, such as network sockets, database handles, user

interaction windows, and file and device descriptors, are not typically handled by garbage

collection. Methods used to manage such resources, particularly destructors, may suffice

to manage memory as well, leaving no need for GC. Some GC systems allow such other

resources to be associated with a region of memory that, when collected, causes the other

resource to be reclaimed; this is called finalization. Finalization may introduce

complications limiting its usability, such as intolerable latency between disuse and

reclaim of especially limited resources, or a lack of control over which thread performs

the work of reclaiming.

5.2.1 PRINCIPLES

Many computer languages require garbage collection, either as part of the

language specification (e.g., Java, C#, and most scripting languages) or effectively for

practical implementation (e.g., formal languages like lambda calculus); these are said to

be garbage collected languages. Other languages were designed for use with manual

memory management, but have garbage collected implementations available (e.g., C, C+

+). Some languages, like Ada, Modula-3, and C++/CLI allow both garbage collection and

manual memory management to co-exist in the same application by using separate heaps

for collected and manually managed objects; others, like D, are garbage collected but

allow the user to manually delete objects and also entirely disable garbage collection

22

Page 23: Data Structure Handout 1

when speed is required. While integrating garbage collection into the language's compiler

and runtime system enables a much wider choice of methods,[citation needed] post hoc

GC systems exist, including some that do not require recompilation. (Post-hoc GC is

sometimes distinguished as litter collection.) The garbage collector will almost always be

closely integrated with the memory allocator.

5.2.2 BENEFITS

Garbage collection frees the programmer from manually dealing with memory

deallocation. As a result, certain categories of bugs are eliminated or substantially

reduced:

Dangling pointer bugs, which occur when a piece of memory is freed while there are still

pointers to it, and one of those pointers is dereferenced. By then the memory may have

been re-assigned to another use, with unpredictable results.

Double free bugs, which occur when the program tries to free a region of memory that

has already been freed, and perhaps already been allocated again.

Certain kinds of memory leaks, in which a program fails to free memory occupied by

objects that have become unreachable, which can lead to memory exhaustion. (Garbage

collection typically does not deal with the unbounded accumulation of data that is

reachable, but that will actually not be used by the program.)

Efficient implementations of persistent data structures

Some of the bugs addressed by garbage collection can have security implications.

5.2.3 DISADVANTAGES

Typically, garbage collection has certain disadvantages:

Garbage collection consumes computing resources in deciding which memory to free,

even though the programmer may have already known this information. The penalty for

the convenience of not annotating object lifetime manually in the source code is

overhead, which can lead to decreased or uneven performance. Interaction with memory

23

Page 24: Data Structure Handout 1

hierarchy effects can make this overhead intolerable in circumstances that are hard to

predict or to detect in routine testing.

The moment when the garbage is actually collected can be unpredictable, resulting in

stalls scattered throughout a session. Unpredictable stalls can be unacceptable in real-time

environments, in transaction processing, or in interactive programs. Incremental,

concurrent, and real-time garbage collectors address these problems, with varying trade-

offs.

Non-deterministic GC is incompatible with RAII based management of non GCed

resources. As a result, the need for explicit manual resource management (release/close)

for non-GCed resources becomes transitive to composition. That is: in a non-

deterministic GC system, if a resource or a resource like object requires manual resource

management (release/close), and this object is used as 'part of' another object, then the

composed object will also become a resource like object that itself requires manual

resource management (release/close).

5.2.4 TRACING GARBAGE COLLECTORS

Tracing garbage collectors are the most common type of garbage collector. They

first determine which objects are reachable (or potentially reachable), and then discard all

remaining objects.

5.2.5 REACHABILITY OF AN OBJECT

Informally, an object is reachable if it is referenced by at least one variable in the

program, either directly or through references from other reachable objects. More

precisely, objects can be reachable in only two ways:

A distinguished set of objects are assumed to be reachable: these are known as the roots.

Typically, these include all the objects referenced from anywhere in the call stack (that is,

all local variables and parameters in the functions currently being invoked), and any

global variables.

24

Page 25: Data Structure Handout 1

Anything referenced from a reachable object is itself reachable; more formally,

reachability is a transitive closure.

The reachability definition of "garbage" is not optimal, insofar as the last time a program

uses an object could be long before that object falls out of the environment scope. A

distinction is sometimes drawn between syntactic garbage, those objects the program

cannot possibly reach, and semantic garbage, those objects the program will in fact never

again use. For example:

Object x = new Foo();

Object y = new Bar();

x = new Quux();

/* at this point, we know that the Foo object

* originally assigned to x will never be

* accessed: it is syntactic garbage

*/

if(x.check_something()) {

x.do_something(y);

}

System.exit(0);

/* in the above block, y *could* be semantic garbage,

* but we won't know until x.check_something() returns

* some value -- if it returns at all

*/

The problem of precisely identifying semantic garbage can easily be shown to be

partially decidable: a program that allocates an object X, runs an arbitrary input program

P, and uses X if and only if P finishes would require a semantic garbage collector to solve

the halting problem. Although conservative heuristic methods for semantic garbage

detection remain an active research area, essentially all practical garbage collectors focus

on syntactic garbage.

25

Page 26: Data Structure Handout 1

Another complication with this approach is that, in languages with both reference types

and unboxed value types, the garbage collector needs to somehow be able to distinguish

which variables on the stack or fields in an object are regular values and which are

references: in memory, an integer and a reference might look alike. The garbage collector

then needs to know whether to treat the element as a reference and follow it, or whether it

is a primitive value. One common solution is the use of tagged pointers.

5.2.6 STRONG AND WEAK REFERENCES

The garbage collector can reclaim only objects that have no references pointing to

them either directly or indirectly from the root set. However, some programs require

weak references, which should be usable for as long as the object exists but should not

prolong its lifetime. In discussions about weak references, ordinary references are

sometimes called strong references. An object is eligible for garbage collection if there

are no strong (i.e. ordinary) references to it, even though there still might be some weak

references to it.

A weak reference is not merely just any pointer to the object that a garbage collector

does not care about. The term is usually reserved for a properly managed category of

special reference objects which are safe to use even after the object disappears because

they lapse to a safe value. An unsafe reference that is not known to the garbage collector

will simply remain dangling by continuing to refer to the address where the object

previously resided. This is not a weak reference.

In some implementations, weak references are divided into subcategories.

For example, the Java Virtual Machine provides three forms of weak references, namely

soft references, phantom references, and regular weak references. A softly referenced

object is only eligible for reclamation, if the garbage collector decides that the program is

low on memory. Unlike a soft reference or a regular weak reference, a phantom reference

does not provide access to the object that it references. Instead, a phantom reference is a

mechanism that allows the garbage collector to notify the program when the referenced

object has become phantom reachable. An object is phantom reachable, if it still resides

26

Page 27: Data Structure Handout 1

in memory and it is referenced by a phantom reference, but its finalizer has already

executed. Similarly, Microsoft.NET provides two subcategories of weak references,

namely long weak references (tracks resurrection) and short weak references.

5.2.7 WEAK COLLECTIONS

Data structures can also be devised which have weak tracking features. For

instance, weak hash tables are useful. Like a regular hash table, a weak hash table

maintains an association between pairs of objects, where each pair is understood to be a

key and value. However, the hash table does not actually maintain a strong reference on

these objects. A special behavior takes place when either the key or value or both become

garbage: the hash table entry is spontaneously deleted. There exist further refinements

such as hash tables which have only weak keys (value references are ordinary, strong

references) or only weak values (key references are strong).

Weak hash tables are important for maintaining associations between objects, such that

the objects engaged in the association can still become garbage if nothing in the program

refers to them any longer (other than the associating hash table).

The use of a regular hash table for such a purpose could lead to a "logical memory leak":

the accumulation of reachable data which the program does not need and will not use.

5.3 BASIC ALGORITHM

Tracing collectors are so called because they trace through the working set of

memory. These garbage collectors perform collection in cycles. A cycle is started when

the collector decides (or is notified) that it needs to reclaim memory, which happens most

often when the system is low on memory. The original method involves a naïve mark-

and-sweep in which the entire memory set is touched several times.

5.3.1 TRI-COLOR MARKING

Because of these pitfalls, most modern tracing garbage collectors implement some variant

of the tri-colour marking abstraction, but simple collectors (such as the mark-and-sweep

27

Page 28: Data Structure Handout 1

collector) often do not make this abstraction explicit. Tri-colour marking works as

follows:

Create initial white, grey, and black sets; these sets will be used to maintain progress

during the cycle.

Initially the white set or condemned set is the set of objects that are candidates for having

their memory recycled.

The black set is the set of objects that can cheaply be proven to have no references to

objects in the white set, but are also not chosen to be candidates for recycling; in many

implementations, the black set starts off empty.

The grey set is all the objects that are reachable from root references but the objects

referenced by grey objects haven't been scanned yet. Grey objects are known to be

reachable from the root, so cannot be garbage collected: grey objects will eventually end

up in the black set. The grey state means we still need to check any objects that the object

references.

The grey set is initialised to objects which are referenced directly at root level; typically

all other objects are initially placed in the white set.

Objects can move from white to grey to black, never in the other direction.

Pick an object from the grey set. Blacken this object (move it to the black set), by greying

all the white objects it references directly. This confirms that this object cannot be

garbage collected, and also that any objects it references cannot be garbage collected.

Repeat the previous step until the grey set is empty.

When there are no more objects in the grey set, then all the objects remaining in the white

set have been demonstrated not to be reachable, and the storage occupied by them can be

reclaimed.

The 3 sets partition memory; every object in the system, including the root set, is in

precisely one set.

The tri-colour marking algorithm preserves an important invariant:

No black object points directly to a white object.

28

Page 29: Data Structure Handout 1

This ensures that the white objects can be safely destroyed once the grey set is empty.

(Some variations on the algorithm do not preserve the tricolour invariant but they use a

modified form for which all the important properties hold.)

The tri-colour method has an important advantage: it can be performed 'on-the-fly',

without halting the system for significant time periods. This is accomplished by marking

objects as they are allocated and during mutation, maintaining the various sets. By

monitoring the size of the sets, the system can perform garbage collection periodically,

rather than as-needed. Also, the need to touch the entire working set each cycle is

avoided.

5.4 IMPLEMENTATION STRATEGIES

In order to implement the basic tri-colour algorithm, several important design

decisions must be made, which can significantly affect the performance characteristics of

the garbage collector.

5.4.1 Moving vs. non-moving

Once the unreachable set has been determined, the garbage collector may simply

release the unreachable objects and leave everything else as it is, or it may copy some or

all of the reachable objects into a new area of memory, updating all references to those

objects as needed. These are called "non-moving" and "moving" (or, alternatively, "non-

compacting" and "compacting") garbage collectors, respectively.

At first, a moving GC strategy may seem inefficient and costly compared to the non-

moving approach, since much more work would appear to be required on each cycle. In

fact, however, the moving GC strategy leads to several performance advantages, both

during the garbage collection cycle itself and during actual program execution:

No additional work is required to reclaim the space freed by dead objects; the entire

region of memory from which reachable objects were moved can be considered free

space. In contrast, a non-moving GC must visit each unreachable object and somehow

record that the memory it alone occupied is available.

29

Page 30: Data Structure Handout 1

Similarly, new objects can be allocated very quickly. Since large contiguous regions of

memory are usually made available by the moving GC strategy, new objects can be

allocated by simply incrementing a 'free memory' pointer. A non-moving strategy may,

after some time, lead to a heavily fragmented heap, requiring expensive consultation of

"free lists" of small available blocks of memory in order to allocate new objects.

If an appropriate traversal order is used (such as cdr-first for list conses), objects that

refer to each other frequently can be moved very close to each other in memory,

increasing the likelihood that they will be located in the same cache line or virtual

memory page. This can significantly speed up access to these objects through these

references.

One disadvantage of a moving garbage collector is that it only allows access through

references that are managed by the garbage collected environment, and does not allow

pointer arithmetic. This is because any native pointers to objects will be invalidated when

the garbage collector moves the object (they become dangling pointers). For

interoperability with native code, the garbage collector must copy the object contents to a

location outside of the garbage collected region of memory. An alternative approach is to

pin the object in memory, preventing the garbage collector from moving it and allowing

the memory to be directly shared with native pointers (and possibly allowing pointer

arithmetic).

5.4.2 Copying vs. mark-and-sweep vs. mark-and-don't-sweep

To further refine the distinction, tracing collectors can also be divided by

considering how the three sets of objects (white, grey, and black) are maintained during a

collection cycle.

The most straightforward approach is the semi-space collector, which dates to 1969. In

this moving GC scheme, memory is partitioned into a "from space" and "to space".

Initially, objects are allocated into "to space" until they become full and a collection is

triggered. At the start of a collection, the "to space" becomes the "from space", and vice

versa. The objects reachable from the root set are copied from the "from space" to the "to

space". These objects are scanned in turn, and all objects that they point to are copied into

30

Page 31: Data Structure Handout 1

"to space", until all reachable objects have been copied into "to space". Once the program

continues execution, new objects are once again allocated in the "to space" until it is once

again full and the process is repeated. This approach has the advantage of conceptual

simplicity (the three object color sets are implicitly constructed during the copying

process), but the disadvantage that a (possibly) very large contiguous region of free

memory is necessarily required on every collection cycle. This technique is also known

as stop-and-copy. Cheney's algorithm is an improvement on the semi-space collector.

A mark and sweep garbage collector maintains a bit (or two) with each object to record

whether it is white or black; the grey set is either maintained as a separate list (such as the

process stack) or using another bit. As the reference tree is traversed during a collection

cycle (the "mark" phase), these bits are manipulated by the collector to reflect the current

state. A final "sweep" of the memory areas then frees white objects. The mark and sweep

strategy has the advantage that, once the unreachable set is determined, either a moving

or non-moving collection strategy can be pursued; this choice of strategy can even be

made at runtime, as available memory permits. It has the disadvantage of "bloating"

objects by a small amount.

A mark and don't sweep garbage collector, like the mark-and-sweep, maintains a bit with

each object to record whether it is white or black; the gray set is either maintained as a

separate list (such as the process stack) or using another bit. There are two key

differences here. First, black and white mean different things than they do in the mark

and sweep collector. In a "mark and don't sweep" system, all reachable objects are always

black. An object is marked black at the time it is allocated, and it will stay black even if it

becomes unreachable. A white object is unused memory and may be allocated. Second,

the interpretation of the black/white bit can change. Initially, the black/white bit may

have the sense of (0=white, 1=black). If an allocation operation ever fails to find any

available (white) memory, that means all objects are marked used (black). The sense of

the black/white bit is then inverted (for example, 0=black, 1=white). Everything becomes

white. This momentarily breaks the invariant that reachable objects are black, but a full

31

Page 32: Data Structure Handout 1

marking phase follows immediately, to mark them black again. Once this is done, all

unreachable memory is white. No "sweep" phase is necessary.

5.4.3 Generational GC (ephemeral GC)

It has been empirically observed that in many programs, the most recently created

objects are also those most likely to become unreachable quickly (known as infant

mortality or the generational hypothesis). A generational GC (also known as ephemeral

GC) divides objects into generations and, on most cycles, will place only the objects of a

subset of generations into the initial white (condemned) set. Furthermore, the runtime

system maintains knowledge of when references cross generations by observing the

creation and overwriting of references. When the garbage collector runs, it may be able to

use this knowledge to prove that some objects in the initial white set are unreachable

without having to traverse the entire reference tree. If the generational hypothesis holds,

this results in much faster collection cycles while still reclaiming most unreachable

objects.

In order to implement this concept, many generational garbage collectors use separate

memory regions for different ages of objects. When a region becomes full, those few

objects that are referenced from older memory regions are promoted to the next highest

region, and the entire region can then be overwritten with fresh objects.

This technique permits very fast incremental garbage collection, since the garbage

collection of only one region at a time is all that is typically required.

Generational garbage collection is a heuristic approach, and some unreachable objects

may not be reclaimed on each cycle. It may therefore occasionally be necessary to

perform a full mark and sweep or copying garbage collection to reclaim all available

space. In fact, runtime systems for modern programming languages (such as Java and

the .NET Framework) usually use some hybrid of the various strategies that have been

described thus far; for example, most collection cycles might look only at a few

generations, while occasionally a mark-and-sweep is performed, and even more rarely a

full copying is performed to combat fragmentation. The terms "minor cycle" and "major

cycle" are sometimes used to describe these different levels of collector aggression.

32

Page 33: Data Structure Handout 1

5.4.4 Stop-the-world vs. incremental vs. concurrent

Simple stop-the-world garbage collectors completely halt execution of the program to run

a collection cycle, thus guaranteeing that new objects are not allocated and objects do not

suddenly become unreachable while the collector is running.

This has the obvious disadvantage that the program can perform no useful work while a

collection cycle is running (sometimes called the "embarrassing pause"). Stop-the-world

garbage collection is therefore mainly suitable for non-interactive programs.

Its advantage is that it is both simpler to implement and faster than incremental garbage

collection.

Incremental and concurrent garbage collectors are designed to reduce this disruption by

interleaving their work with activity from the main program.

Incremental garbage collectors perform the garbage collection cycle in discrete phases,

with program execution permitted between each phase (and sometimes during some

phases). Concurrent garbage collectors do not stop program execution at all, except

perhaps briefly when the program's execution stack is scanned. However, the sum of the

incremental phases takes longer to complete than one batch garbage collection pass, so

these garbage collectors may yield lower total throughput.

Careful design is necessary with these techniques to ensure that the main program does

not interfere with the garbage collector and vice versa; for example, when the program

needs to allocate a new object, the runtime system may either need to suspend it until the

collection cycle is complete, or somehow notify the garbage collector that there exists a

new, reachable object.

5.4.5 Precise vs. conservative and internal pointers

Some collectors can correctly identify all pointers (references) in an object; these are

called precise (also exact or accurate) collectors, the opposite being a conservative or

partly conservative collector. Conservative collectors assume that any bit pattern in

memory could be a pointer if, interpreted as a pointer, it would point into an allocated

object. Conservative collectors may produce false positives, where unused memory is not

released because of improper pointer identification. This is not always a problem in

33

Page 34: Data Structure Handout 1

practice unless the program handles a lot of data that could easily be misidentified as a

pointer. False positives are generally less problematic on 64-bit systems than on 32-bit

systems because the range of valid memory addresses tends to be a tiny fraction of the

range of 64-bit values. Thus, an arbitrary 64-bit pattern is unlikely to mimic a valid

pointer. Whether a precise collector is practical usually depends on the type safety

properties of the programming language in question. An example for which a

conservative garbage collector would be needed is the C language, which allows typed

(non-void) pointers to be type cast into untyped (void) pointers, and vice versa.

A related issue concerns internal pointers, or pointers to fields within an object. If the

semantics of a language allow internal pointers, then there may be many different

addresses that can refer to parts of the same object, which complicates determining

whether an object is garbage or not. An example for this is the C++ language, in which

multiple inheritance can cause pointers to base objects to have different addresses. In a

tightly optimized program, the corresponding pointer to the object itself may have been

overwritten in its register, so such internal pointers need to be scanned.

5.5 PERFORMANCE IMPLICATIONS

Tracing garbage collectors require some implicit runtime overhead that may be

beyond the control of the programmer, and can sometimes lead to performance problems.

For example, commonly used stop-the-world garbage collectors, which pause program

execution at arbitrary times, may make garbage collection inappropriate for some

embedded systems, high-performance server software, and applications with real-time

needs.

It is difficult to compare the two cases directly, as their behavior depends on the situation.

For example, in the best case for a garbage collecting system, allocation just increments a

pointer, but in the best case for manual heap allocation, the allocator maintains freelists of

specific sizes and allocation only requires following a pointer. However, this size

segregation usually cause a large degree of external fragmentation, which can have an

adverse impact on cache behavior. Memory allocation in a garbage collected language

34

Page 35: Data Structure Handout 1

may be implemented using heap allocation behind the scenes (rather than simply

incrementing a pointer), so the performance advantages listed above don't necessarily

apply in this case. In some situations, most notably embedded systems, it is possible to

avoid both garbage collection and heap management overhead by preallocating pools of

memory and using a custom, lightweight scheme for allocation/deallocation.

The overhead of write barriers is more likely to be noticeable in an imperative-style

program which frequently writes pointers into existing data structures than in a

functional-style program which constructs data only once and never changes them.

Some advances in garbage collection can be understood as reactions to performance

issues. Early collectors were stop-the-world collectors, but the performance of this

approach was distracting in interactive applications. Incremental collection avoided this

disruption, but at the cost of decreased efficiency due to the need for barriers.

Generational collection techniques are used with both stop-the-world and incremental

collectors to increase performance; the trade-off is that some garbage is not detected as

such for longer than normal.

5.5.1 Determinism

Tracing garbage collection is not deterministic in the timing of object finalization.

An object which becomes eligible for garbage collection will usually be cleaned up

eventually, but there is no guarantee when (or even if) that will happen. This is an issue

for program correctness when objects are tied to non-memory resources, whose release is

an externally visible program behavior, such as closing a network connection, releasing a

device or closing a file. One garbage collection technique which provides determinism in

this regard is reference counting.

Garbage collection can have a nondeterministic impact on execution time, by potentially

introducing pauses into the execution of a program which are not correlated with the

algorithm being processed. Under tracing garbage collection, the request to allocate a

new object can sometimes return quickly and at other times trigger a lengthy garbage

collection cycle. Under reference counting, whereas allocation of objects is usually fast,

35

Page 36: Data Structure Handout 1

decrementing a reference is nondeterministic, since a reference may reach zero, triggering

recursion to decrement the reference counts of other objects which that object holds.

5.5.2 Real-time garbage collection

While garbage collection is generally nondeterministic, it is possible to use it in

hard real-time systems. A real-time garbage collector should guarantee that even in the

worst case it will dedicate a certain number of computational resources to mutator

threads. Constraints imposed on a real-time garbage collector are usually either work

based or time based. A time based constraint would look like: within each time window

of duration T, mutator threads should be allowed to run at least for Tm time. For work

based analysis, MMU (minimal mutator utilization) is usually used as a real time

constraint for the garbage collection algorithm.

One of the first implementations of real-time garbage collection for the JVM was work

on the Metronome algorithm. There are other commercial implementations.

5.5.3 Reference counting

Reference counting is a form of garbage collection whereby each object has a

count of the number of references to it. Garbage is identified by having a reference count

of zero. An object's reference count is incremented when a reference to it is created, and

decremented when a reference is destroyed. The object's memory is reclaimed when the

count reaches zero.

Compared to tracing garbage collection, reference counting guarantees that objects are

destroyed as soon as they become unreachable (assuming that there are no reference

cycles), and usually only accesses memory which is either in CPU caches, in objects to

be freed, or directly pointed by those, and thus tends to not have significant negative side

effects on CPU cache and virtual memory operation.

There are some disadvantages to reference counting:

If two or more objects refer to each other, they can create a cycle whereby neither will be

collected as their mutual references never let their reference counts become zero. Some

garbage collection systems using reference counting (like the one in CPython) use

specific cycle-detecting algorithms to deal with this issue.

36

Page 37: Data Structure Handout 1

Another strategy is to use weak references for the "backpointers" which create cycles.

Under reference counting, a weak reference is similar to a weak reference under a tracing

garbage collector. It is a special reference object whose existence does not increment the

reference count of the referent object. Furthermore, a weak reference is safe in that when

the referent object becomes garbage, any weak reference to it lapses, rather than being

permitted to remain dangling, meaning that it turns into a predictable value, such as a null

reference.

In naive implementations, each assignment of a reference and each reference falling out

of scope often require modifications of one or more reference counters. However, in the

common case, when a reference is copied from an outer scope variable into an inner

scope variable, such that the lifetime of the inner variable is bounded by the lifetime of

the outer one, the reference incrementing can be eliminated. The outer variable "owns"

the reference. In the programming language C++, this technique is readily implemented

and demonstrated with the use of const references. Reference counting in C++ is usually

implemented using "smart pointers" whose constructors, destructors and assignment

operators manage the references. A smart pointer can be passed by reference to a

function, which avoids the need to copy-construct a new reference (which would increase

the reference count on entry into the function and decrease it on exit). Instead the

function receives a reference to the smart pointer which is produced inexpensively.

When used in a multithreaded environment, these modifications (increment and

decrement) may need to be atomic operations such as compare-and-swap, at least for any

objects which are shared, or potentially shared among multiple threads. Atomic

operations are expensive on a multiprocessor, and even more expensive if they have to be

emulated with software algorithms. It is possible to avoid this issue by adding per-thread

or per-CPU reference counts and only accessing the global reference count when the

local reference counts become or are no longer zero (or, alternatively, using a binary tree

of reference counts, or even giving up deterministic destruction in exchange for not

having a global reference count at all), but this adds significant memory overhead and

37

Page 38: Data Structure Handout 1

thus tends to be only useful in special cases (it's used, for example, in the reference

counting of Linux kernel modules).

Naive implementations of reference counting do not in general provide real-time

behavior, because any pointer assignment can potentially cause a number of objects

bounded only by total allocated memory size to be recursively freed while the thread is

unable to perform other work. It is possible to avoid this issue by delegating the freeing

of objects whose reference count dropped to zero to other threads, at the cost of extra

overhead.

5.5.4 Escape analysis

Escape analysis can be used to convert heap allocations to stack allocations, thus

reducing the amount of work needed to be done by the garbage collector.

5.5.5 Compile-time

Compile-time garbage collection is a form of static analysis allowing memory to

be reused and reclaimed based on invariants known during compilation. This form of

garbage collection has been studied in the Mercury programming language

5.5.6 Availability

Generally speaking, higher-level programming languages are more likely to have

garbage collection as a standard feature. In languages that do not have built in garbage

collection, it can often be added through a library, as with the Boehm garbage collector

for C and C++. This approach is not without drawbacks, such as changing object creation

and destruction mechanisms.

Most functional programming languages, such as ML, Haskell, and APL, have garbage

collection built in. Lisp, which introduced functional programming, is especially notable

for introducing this mechanism.

Other dynamic languages, such as Ruby (but not Perl 5, or PHP, which use reference

counting), also tend to use GC. Object-oriented programming languages such as

Smalltalk, Java and ECMAScript usually provide integrated garbage collection. Notable

exceptions are C++ and Delphi which have destructors. Objective-C has not traditionally

38

Page 39: Data Structure Handout 1

had it, but ObjC 2.0 as implemented by Apple for Mac OS X uses a runtime collector

developed in-house, while the GNUstep project uses a Boehm collector.

Historically, languages intended for beginners, such as BASIC and Logo, have often used

garbage collection for heap-allocated variable-length data types, such as strings and lists,

so as not to burden programmers with manual memory management. On early

microcomputers, with their limited memory and slow processors, BASIC garbage

collection could often cause apparently random, inexplicable pauses in the midst of

program operation.

Some BASIC interpreters, such as Applesoft BASIC on the Apple II family, repeatedly

scanned the string descriptors for the string having the highest address in order to

compact it toward high memory, resulting in O(N*N) performance, which could

introduce minutes-long pauses in the execution of string-intensive programs. A

replacement garbage collector for Applesoft BASIC published in Call-A.P.P.L.E.

(January 1981, pages 40–45, Randy Wigginton) identified a group of strings in every

pass over the heap, which cut collection time dramatically. BASIC.System, released with

ProDOS in 1983, provided a windowing garbage collector for BASIC that reduced most

collections to a fraction of a second.

5.6 LIMITED ENVIRONMENTS

Garbage collection is rarely used on embedded or real-time systems because of the

perceived need for very tight control over the use of limited resources. However, garbage

collectors compatible with such limited environments have been developed.

The Microsoft .NET Micro Framework and Java Platform, Micro Edition are embedded

software platforms that, like their larger cousins, include garbage collection.

39

Page 40: Data Structure Handout 1

MODULE SIX

HASH FUNCTION

A hash function is any algorithm or subroutine that maps large data sets of variable

length to smaller data sets of a fixed length. For example, a person's name, having a

variable length, could be hashed to a single integer. The values returned by a hash

function are called hash values, hash codes, hash sums, checksums or simply hashes.

6.1 DESCRIPTIONS

Hash functions are mostly used to accelerate table lookup or data comparison

tasks such as finding items in a database, detecting duplicated or similar records in a

large file, finding similar stretches in DNA sequences, and so on.

A hash function should be referentially transparent (stable), i.e., if called twice on

input that is "equal" (for example, strings that consist of the same sequence of

characters), it should give the same result. This is a contract in many programming

languages that allow the user to override equality and hash functions for an object: if two

objects are equal, their hash codes must be the same. This is crucial to finding an element

in a hash table quickly, because two of the same element would both hash to the same

slot.

All hash functions that map a larger set of data to a smaller set of data cause

collisions. Such hash functions try to map the keys to the hash values as evenly as

possible because collisions become more frequent as hash tables fill up. Thus, single-digit

hash values are frequently restricted to 80% of the size of the table. Depending on the

algorithm used, other properties may be required as well, such as double hashing and

linear probing. Although the idea was conceived in the 1950s, the design of good hash

functions is still a topic of active research.

Hash functions are related to (and often confused with) checksums, check digits,

fingerprints, randomization functions, error correcting codes, and cryptographic hash

functions. Although these concepts overlap to some extent, each has its own uses and

40

Page 41: Data Structure Handout 1

requirements and is designed and optimized differently. The HashKeeper database

maintained by the American National Drug Intelligence Center, for instance, is more

aptly described as a catalog of file fingerprints than of hash values.

6.2 HASH TABLES

Hash functions are primarily used in hash tables, to quickly locate a data record

(e.g., a dictionary definition) given its search key (the headword). Specifically, the hash

function is used to map the search key to an index; the index gives the place in the hash

table where the corresponding record should be stored. Hash tables, in turn, are used to

implement associative arrays and dynamic sets.

Typically, the domain of a hash function (the set of possible keys) is larger than its

range (the number of different table indexes), and so it will map several different keys to

the same index. Therefore, each slot of a hash table is associated with (implicitly or

explicitly) a set of records, rather than a single record. For this reason, each slot of a hash

table is often called a bucket, and hash values are also called bucket indices.

Thus, the hash function only hints at the record's location—it tells where one

should start looking for it. Still, in a half-full table, a good hash function will typically

narrow the search down to only one or two entries.

Caches

Hash functions are also used to build caches for large data sets stored in slow media. A

cache is generally simpler than a hashed search table, since any collision can be resolved

by discarding or writing back the older of the two colliding items. This is also used in file

comparison.

Bloom filters

Hash functions are an essential ingredient of the Bloom filter, a space-efficient

probabilistic data structure that is used to test whether an element is a member of a set.

Finding duplicate records

When storing records in a large unsorted file, one may use a hash function to map each

record to an index into a table T, and collect in each bucket T[i] a list of the numbers of

41

Page 42: Data Structure Handout 1

all records with the same hash value i. Once the table is complete, any two duplicate

records will end up in the same bucket. The duplicates can then be found by scanning

every bucket T[i] which contains two or more members, fetching those records, and

comparing them. With a table of appropriate size, this method is likely to be much faster

than any alternative approach (such as sorting the file and comparing all consecutive

pairs).

Finding similar records

Hash functions can also be used to locate table records whose key is similar, but not

identical, to a given key; or pairs of records in a large file which have similar keys. For

that purpose, one needs a hash function that maps similar keys to hash values that differ

by at most m, where m is a small integer (say, 1 or 2). If one builds a table T of all record

numbers, using such a hash function, then similar records will end up in the same bucket,

or in nearby buckets. Then one need only check the records in each bucket T[i] against

those in buckets T[i+k] where k ranges between −m and m.

This class includes the so-called acoustic fingerprint algorithms, that are used to locate

similar-sounding entries in large collection of audio files. For this application, the hash

function must be as insensitive as possible to data capture or transmission errors, and to

"trivial" changes such as timing and volume changes, compression, etc.

Finding similar substrings

The same techniques can be used to find equal or similar stretches in a large collection of

strings, such as a document repository or a genomic database. In this case, the input

strings are broken into many small pieces, and a hash function is used to detect

potentially equal pieces, as above.

The Rabin–Karp algorithm is a relatively fast string searching algorithm that works in

O(n) time on average. It is based on the use of hashing to compare strings.

6.2.1 GEOMETRIC HASHING

This principle is widely used in computer graphics, computational geometry and

many other disciplines, to solve many proximity problems in the plane or in three-

42

Page 43: Data Structure Handout 1

dimensional space, such as finding closest pairs in a set of points, similar shapes in a list

of shapes, similar images in an image database, and so on. In these applications, the set of

all inputs is some sort of metric space, and the hashing function can be interpreted as a

partition of that space into a grid of cells. The table is often an array with two or more

indices (called a grid file, grid index, bucket grid, and similar names), and the hash

function returns an index tuple. This special case of hashing is known as geometric

hashing or the grid method. Geometric hashing is also used in telecommunications

(usually under the name vector quantization) to encode and compress multi-dimensional

signals.

6.2.3 PROPERTIES

Good hash functions, in the original sense of the term, are usually required to

satisfy certain properties listed below. Note that different requirements apply to the other

related concepts (cryptographic hash functions, checksums, etc.).

6.3 DETERMINISM

A hash procedure must be deterministic—meaning that for a given input value it

must always generate the same hash value. In other words, it must be a function of the

data to be hashed, in the mathematical sense of the term. This requirement excludes hash

functions that depend on external variable parameters, such as pseudo-random number

generators or the time of day. It also excludes functions that depend on the memory

address of the object being hashed, because that address may change during execution

(as may happen on systems that use certain methods of garbage collection), although

sometimes rehashing of the item is possible.

6.4 UNIFORMITY

A good hash function should map the expected inputs as evenly as possible over

its output range. That is, every hash value in the output range should be generated with

roughly the same probability. The reason for this last requirement is that the cost of

hashing-based methods goes up sharply as the number of collisions—pairs of inputs that

43

Page 44: Data Structure Handout 1

are mapped to the same hash value—increases. Basically, if some hash values are more

likely to occur than others, a larger fraction of the lookup operations will have to search

through a larger set of colliding table entries.

Note that this criterion only requires the value to be uniformly distributed, not random in

any sense. A good randomizing function is (barring computational efficiency concerns)

generally a good choice as a hash function, but the converse need not be true.

Hash tables often contain only a small subset of the valid inputs. For instance, a club

membership list may contain only a hundred or so member names, out of the very large

set of all possible names. In these cases, the uniformity criterion should hold for almost

all typical subsets of entries that may be found in the table, not just for the global set of

all possible entries.

In other words, if a typical set of m records is hashed to n table slots, the probability of a

bucket receiving many more than m/n records should be vanishingly small. In particular,

if m is less than n, very few buckets should have more than one or two records. (In an

ideal "perfect hash function", no bucket should have more than one record; but a small

number of collisions is virtually inevitable, even if n is much larger than m – see the

birthday paradox).

When testing a hash function, the uniformity of the distribution of hash values can be

evaluated by the chi-squared test.

6.5 VARIABLE RANGE

In many applications, the range of hash values may be different for each run of the

program, or may change along the same run (for instance, when a hash table needs to be

expanded). In those situations, one needs a hash function which takes two parameters—

the input data z, and the number n of allowed hash values.

A common solution is to compute a fixed hash function with a very large range (say, 0 to

232 − 1), divide the result by n, and use the division's remainder. If n is itself a power of

2, this can be done by bit masking and bit shifting. When this approach is used, the hash

44

Page 45: Data Structure Handout 1

function must be chosen so that the result has fairly uniform distribution between 0 and n

− 1, for any value of n that may occur in the application. Depending on the function, the

remainder may be uniform only for certain values of n, e.g. odd or prime numbers.

We can allow the table size n to not be a power of 2 and still not have to perform any

remainder or division operation, as these computations are sometimes costly.

For example, let n be significantly less than 2b. Consider a pseudo random number

generator (PRNG) function P(key) that is uniform on the interval [0, 2b − 1]. A hash

function uniform on the interval [0, n-1] is n P(key)/2b. We can replace the division by a

(possibly faster) right bit shift: nP(key) >> b.

6.5.1 Variable Range with Minimal Movement (Dynamic Hash Function)

When the hash function is used to store values in a hash table that outlives the run

of the program, and the hash table needs to be expanded or shrunk, the hash table is

referred to as a dynamic hash table.

A hash function that will relocate the minimum number of records when the table is

resized is desirable. What is needed is a hash function H(z,n) – where z is the key being

hashed and n is the number of allowed hash values – such that H(z,n + 1) = H(z,n) with

probability close to n/(n + 1).

Linear hashing and spiral storage are examples of dynamic hash functions that execute in

constant time but relax the property of uniformity to achieve the minimal movement

property.

Extendible hashing uses a dynamic hash function that requires space proportional to n to

compute the hash function, and it becomes a function of the previous keys that have been

inserted.

Several algorithms that preserve the uniformity property but require time proportional to

n to compute the value of H(z,n) have been invented.

Data normalization

In some applications, the input data may contain features that are irrelevant for

comparison purposes. For example, when looking up a personal name, it may be

45

Page 46: Data Structure Handout 1

desirable to ignore the distinction between upper and lower case letters. For such data,

one must use a hash function that is compatible with the data equivalence criterion being

used: that is, any two inputs that are considered equivalent must yield the same hash

value. This can be accomplished by normalizing the input before hashing it, as by upper-

casing all letters.

Continuity

A hash function that is used to search for similar (as opposed to equivalent) data must be

as continuous as possible; two inputs that differ by a little should be mapped to equal or

nearly equal hash values.

Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash

functions, and other related concepts. Continuity is desirable for hash functions only in

some applications, such as hash tables that use linear search.

Hash function algorithms

For most types of hashing functions the choice of the function depends strongly on the

nature of the input data, and their probability distribution in the intended application.

Trivial hash function

If the datum to be hashed is small enough, one can use the datum itself (reinterpreted as

an integer in binary notation) as the hashed value. The cost of computing this "trivial"

(identity) hash function is effectively zero. This hash function is perfect, as it maps each

input to a distinct hash value.

The meaning of "small enough" depends on the size of the type that is used as the hashed

value. For example, in Java, the hash code is a 32-bit integer. Thus the 32-bit integer

Integer and 32-bit floating-point Float objects can simply use the value directly; whereas

the 64-bit integer Long and 64-bit floating-point Double cannot use this method.

Other types of data can also use this perfect hashing scheme. For example, when mapping

character strings between upper and lower case, one can use the binary encoding of each

character, interpreted as an integer, to index a table that gives the alternative form of that

46

Page 47: Data Structure Handout 1

character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in ASCII

or ISO Latin 1), the table has only 28 = 256 entries; in the case of Unicode characters, the

table would have 17×216 = 1114112 entries.

The same technique can be used to map two-letter country codes like "us" or "za" to

country names (262=676 table entries), 5-digit zip codes like 13083 to city names

(100000 entries), etc. Invalid data values (such as the country code "xx" or the zip code

00000) may be left undefined in the table, or mapped to some appropriate "null" value.

Perfect hashing

A perfect hash function for the four names shown

A hash function that is injective—that is, maps each valid input to a different hash value

—is said to be perfect. With such a function one can directly locate the desired entry in a

hash table, without any additional searching.

Minimal perfect hashing

A minimal perfect hash function for the four names shown

A perfect hash function for n keys is said to be minimal if its range consists of n

consecutive integers, usually from 0 to n−1. Besides providing single-step lookup, a

minimal perfect hash function also yields a compact hash table, without any vacant slots.

Minimal perfect hash functions are much harder to find than perfect ones with a wider

range.

Hashing uniformly distributed data

If the inputs are bounded-length strings (such as telephone numbers, car license plates,

invoice numbers, etc.), and each input may independently occur with uniform probability,

then a hash function need only map roughly the same number of inputs to each hash

value. For instance, suppose that each input is an integer z in the range 0 to N−1, and the

output must be an integer h in the range 0 to n−1, where N is much larger than n. Then

47

Page 48: Data Structure Handout 1

the hash function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷

N (the value z scaled down by n/N and truncated to an integer), or many other formulas.

Warning: h = z mod n was used in many of the original random number generators, but

was found to have a number of issues. One of which is that as n approaches N, this

function becomes less and less uniform.

Hashing data with other distributions

These simple formulas will not do if the input values are not equally likely, or are not

independent. For instance, most patrons of a supermarket will live in the same geographic

area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that

case, if m is 10000 or so, the division formula (z × m) ÷ M, which depends mainly on the

leading digits, will generate a lot of collisions; whereas the remainder formula z mod M,

which is quite sensitive to the trailing digits, may still yield a fairly even distribution.

Hashing variable-length data

When the data values are long (or variable-length) character strings—such as personal

names, web page addresses, or mail messages—their distribution is usually very uneven,

with complicated dependencies. For example, text in any natural language has highly

non-uniform distributions of characters, and character pairs, very characteristic of the

language. For such data, it is prudent to use a hash function that depends on all characters

of the string—and depends on each character in a different way.

In cryptographic hash functions, a Merkle–Damgård construction is usually used. In

general, the scheme for hashing such data is to break the input into a sequence of small

units (bits, bytes, words, etc.) and combine all the units b[1], b[2], ..., b[m] sequentially,

as follows

S ← S0; // Initialize the state.

for k in 1, 2, ..., m do // Scan the input data units:

S ← F(S, b[k]); // Combine data unit k into the state.

return G(S, n) // Extract the hash value from the state.

48

Page 49: Data Structure Handout 1

This schema is also used in many text checksum and fingerprint algorithms. The state

variable S may be a 32- or 64-bit unsigned integer; in that case, S0 can be 0, and G(S,n)

can be just S mod n. The best choice of F is a complex issue and depends on the nature of

the data.

If the units b[k] are single bits, then F(S,b) could be, for instance

if highbit(S) = 0 then

return 2 * S + b

else

return (2 * S + b) ^ P

Here highbit(S) denotes the most significant bit of S; the '*' operator denotes unsigned

integer multiplication with lost overflow; '^' is the bitwise exclusive or operation applied

to words; and P is a suitable fixed word.

Special-purpose hash functions

In many cases, one can design a special-purpose (heuristic) hash function that yields

many fewer collisions than a good general-purpose hash function. For example, suppose

that the input data are file names such as FILE0000.CHK, FILE0001.CHK,

FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function that

extracts the numeric part k of the file name and returns k mod n would be nearly optimal.

Needless to say, a function that is exceptionally good for a specific kind of data may have

dismal performance on data with different distribution.

Rolling hash

In some applications, such as substring search, one must compute a hash function h for

every k-character substring of a given n-character string t; where k is a fixed integer, and

n is k. The straightforward solution, which is to extract every such substring s of t and

compute h(s) separately, requires a number of operations proportional to k·n. However,

49

Page 50: Data Structure Handout 1

with the proper choice of h, one can use the technique of rolling hash to compute all those

hashes with an effort proportional to k + n.

Universal hashing

A universal hashing scheme is a randomized algorithm that selects a hashing function h

among a family of such functions, in such a way that the probability of a collision of any

two distinct keys is 1/n, where n is the number of distinct hash values desired—

independently of the two keys. Universal hashing ensures (in a probabilistic sense) that

the hash function application will behave as well as if it were using a random function,

for any distribution of the input data. It will however have more collisions than perfect

hashing, and may require more operations than a special-purpose hash function.

Hashing with checksum functions

One can adapt certain checksum or fingerprinting algorithms for use as hash functions.

Some of those algorithms will map arbitrary long string data z, with any typical real-

world distribution—no matter how non-uniform and dependent—to a 32-bit or 64-bit

string, from which one can extract a hash value in 0 through n − 1.

This method may produce a sufficiently uniform distribution of hash values, as long as

the hash range size n is small compared to the range of the checksum or fingerprint

function. However, some checksums fare poorly in the avalanche test, which may be a

concern in some applications. In particular, the popular CRC32 checksum provides only

16 bits (the higher half of the result) that are usable for hashing[citation needed].

Moreover, each bit of the input has a deterministic effect on each bit of the CRC32, that

is one can tell without looking at the rest of the input, which bits of the output will flip if

the input bit is flipped; so care must be taken to use all 32 bits when computing the hash

from the checksum.

50

Page 51: Data Structure Handout 1

Hashing with cryptographic hash functions

Some cryptographic hash functions, such as SHA-1, have even stronger uniformity

guarantees than checksums or fingerprints, and thus can provide very good general-

purpose hashing functions.

In ordinary applications, this advantage may be too small to offset their much higher cost.

However, this method can provide uniformly distributed hashes even when the keys are

chosen by a malicious agent. This feature may help to protect services against denial of

service attacks.

Hashing By Nonlinear Table Lookup

Tables of random numbers (such as 256 random 32 bit integers) can provide high-quality

nonlinear functions to be used as hash functions or for other purposes such as

cryptography. The key to be hashed would be split into 8-bit (one byte) parts and each

part will be used as an index for the nonlinear table. The table values will be added by

arithmetic or XOR addition to the hash output value. Because the table is just 1024 bytes

in size, it will fit into the cache of modern microprocessors and allow for very fast

execution of the hashing algorithm. As the table value is on average much longer than 8

bits, one bit of input will affect nearly all output bits. This is different to multiplicative

hash functions where higher-value input bits do not affect lower-value output bits.

This algorithm has proven to be very fast and of high quality for hashing purposes

(especially hashing of integer number keys).

Efficient Hashing Of Strings

Modern microprocessors will allow for much faster processing, if 8-bit character Strings

are not hashed by processing one character at a time, but by interpreting the string as an

array of 32 bit or 64 bit integers and hashing/accumulating these "wide word" integer

values by means of arithmetic operations (e.g. multiplication by constant and bit-

51

Page 52: Data Structure Handout 1

shifting). The remaining characters of the string which are smaller than the word length

of the CPU must be handled differently (e.g. being processed one character at a time).

This approach has proven to speed up hash code generation by a factor of five or more on

modern microprocessors of a word size of 64 bit.

A far better approach for converting strings to a numeric value that avoids the problem

with some strings having great similarity ("Aaaaaaaaaa" and "Aaaaaaaaab") is to use a

Cyclic redundancy check (CRC) of the string to compute a 32- or 64-bit value.

52

Page 53: Data Structure Handout 1

MODULE SEVEN

HASH CODING AND HASH TABLE

Hashing is a method of storing records according to their key values. It provides access to

stored records in constant time, O(1), so it is comparable to B-trees in searching speed.

Therefore, hash tables are used for:

a) Storing a file record by record.

b) Searching for records with certain key values.

In hash tables, the main idea is to distribute the records uniquely on a table, according to

their key values. We take the key and we use a function to map the key into one location

of the array: f(key)=h, where h is the hash address of that record in the hash table.

If the size of the table is n, say array [1..n], we have to find a function which will give

numbers between 1 and n only.

Each entry of the table is called a bucket. In general, one bucket may contain more than

one (say r) records. In our discussions we shall assume r=1 and each bucket holds exactly

one record.

7.1. DEFINITIONS

key density :

Two key values are synonyms with respect to f, if f(key1)=f(key2).

Synonyms are entered into the same bucket if r>1 and there is space in that bucket.

When a key is mapped by f into a full bucket this is an overflow.

When two non-identical keys are mapped into the same bucket, this is a collision.

The hash function f;

a) Must be easy to compute,

b) Must be a uniform hash function. (a random key value should have an equal chance of

hashing into any of the n buckets.)

c) Should minimize the number of collisions.

53

Page 54: Data Structure Handout 1

Some hash functions used in practical applications :

1) f(key)=key mod n can be a hash function, However n should never be a power of 2, n

should be a prime number.

2) Ex-or'ing the first and the last m bits of the key:

Notice that the hash table will now have a size n, which is a power of 2.

3) Mid-squaring:

a) take the square of the key.

b) then use m bits from the middle of the square to compute the hash address.

4) Folding:

The key is partitioned into several parts. All exept the last part have the same length.

These parts are added together to obtain the hash address for the key. There are two ways

of doing this addition.

a) Add the parts directly

b) Fold at the boundaries.

Example:

key = 12320324111220, part length=3,

123|203|241|112|20

P1 P2 P3 P4 P5

a) 123 b ) 123

203 302

241 241

112 211

+ 20 + 20

______ ______

699 897

Handling Collisions - Overflows :

Consider r=1, so there is one slot per bucket. All slots must be initialized to 'empty' ( for

instance, zero or minus one may denote empty ).

54

Page 55: Data Structure Handout 1

1) Linear probing:

- When we reach the end of the table, we go back to location 0.

- Finding the first empty location will sometimes take a lot of time.

- Also, in searching for a specific key value, we have to continue the search until we find

an empty location, if that key value is not found at the calculated hash address.

2) Random probing

When there is a collision, we start a (pseudo) random number generator.

For example;

f(key1)=3

f(key2)=3DD> collision

Then, start the pseudo random number generator and get a number, say 7. Add 3+7=10

and store key2 at location 10.

The pseudo-random number i is generated by using the hash address that causes the

collision. It should generate numbers between 1 and n and it should not repeat a number

before all the numbers between 1 and n are generated exactly once.

In searching, given the same hash address, for example 3, it will give us the same number

7, so key2 shall be found at location 10.

We carry out the search until:

a) We find the key in the table,

b) Or, until we find an empty bucket, (unsuccessful termination)

c) Or, until we search the table for one sequence and the random number repeats.

(unsuccessful termination, table is full)

3. Chaining

We modify entries of the hash table to hold a key part (and the record) and a link part.

When there is a collision, we put the second key to any empty place and set the link part

of the first key to point to the second one. Additional storage is needed for link fields.

>

55

Page 56: Data Structure Handout 1

4) Chaining with overflow

In this method, we use extra space for colliding items.

f(key1)=3 goes into bucket 3

f(key2)=3 collision, goes into the overflow area

5) Rehashing:

Use a series of hash functions. If there is a collision, take the second hash function and

hash again, etc... The probability that two key values will map to the same address with

two different hash functions is very low.

Average number of probes (AVP) calculation :

Calculate the probability of collisions, then the expected number of collisions, then

average.

To delete key1, we have to put a special sign into location 2, because there might have

been collisions, and we can break the chain if we set that bucket to empty. However then

we shall be wasting some empty locations, LF is increased and AVP is increased. We

cannot increase the hash table size, since the hash function will generate values between

1 and n (or, 0 and n-1).

Using an overflow area is one solution.

56

Page 57: Data Structure Handout 1

MODULE EIGHT

RECURSIVE PROGRAMMING

Recursive programming is a powerful technique that can greatly simplify some

programming tasks. In summary, recursive programming is the situation in which a

procedure calls itself, passing in a modified value of the parameter(s) that was passed in

to the current iteration of the procedure. Typically, a recursive programming environment

contains (at least) two procedures: first, a procedure to set up the initial environment and

make the initial call to the recursive procedure, and second, the recursive procedure itself

that calls itself one or more times.

Let's begin with a simple example. The Factorial of a number N is the product of all the

integers between 1 and N. The factorial of 5 is equal to 5 * 4 * 3 * 2 * 1 = 120. In the real

world you would not likely use a recursive procedure for this, but it will serve as a simple

yet illustrative example. The first procedure is namedDoFact sets things up, calls the Fact

function and displays the result.

Sub DoFact()

Dim L As Long

Dim N As Long

N = 3

L = Fact(N)

Debug.Print "The Factorial of " & CStr(N) & " is " & Format(L, "#,##0")

End Sub

The Fact function does the real work of calculating the factorial.

Function Fact(N As Long) As Long

If N = 1 Then

Fact = 1

Else

Fact = N * Fact(N - 1)

End If

57

Page 58: Data Structure Handout 1

End Function

In this code, the value of the input N is tested. If it is 1, the function simply returns 1. If N

is greater than 1, Fact calls itself passing itself the value N-1. The function returns as its

result the input value N times the value of itself evaluated for N-1.

Cautions For Recursive Programming

While recursive programming is a powerful technique, you must be careful to structure

the code so that it will terminate properly when some condition is met. In the Fact

procedure, we ended the recursive calls when N was less than or equal to 1. Your

recursive code must have some sort of escape logic that terminates the recursive calls.

Without such escape logic, the code would loop continuously until the VBA runtime

aborts the processing with an Out Of Stack Space error. Note that you cannot trap an Out

Of Stack Space error with conventional error trapping. It is called an untrappableerror

and will terminate all VBA execution immediately. You cannot recover from an

untrappable error.

For example, consider the following poorly written recursive procedure:

Function AddUp(N As Long)

Static R As Long

If N <= 0 Then

R = 0

End If

R = AddUp(N + 1)

AddUp = R

End Function

In this code, there is no condition that prevents AddUp from calling itself. Every call to

AddUp results in another call to AddUp. The function will continue to call itself without

restriction until the VBA runtime aborts the procedure execution sequence.

58

Page 59: Data Structure Handout 1

MODULE NINE

MACROS

jEdit's macro editor

A macro (short for "macroinstruction", from Greek μακρο- 'large') in computer science is

a rule or pattern that specifies how a certain input sequence (often a sequence of

characters) should be mapped to a replacement input sequence (also often a sequence of

characters) according to a defined procedure. The mapping process that instantiates

(transforms) a macro use into a specific sequence is known as macro expansion.

A facility for writing macros may be provided as part of a software application or as a

part of a programming language. In the former case, macros are used to make tasks using

the application less repetitive. In the latter case, they are a tool that allows a programmer

to enable code reuse or even to design domain-specific languages.

59

Page 60: Data Structure Handout 1

Macros are used to make a sequence of computing instructions available to the

programmer as a single program statement, making the programming task less tedious

and less error-prone.

(Thus, they are called "macros" because a big block of code can be expanded from a

small sequence of characters). Macros often allow positional or keyword parameters that

dictate what the conditional assembler program generates and have been used to create

entire programs or program suites according to such variables as operating system,

platform or other factors. The term derives from "macro instruction", and such

expansions were originally used in generating assembly language code.

9.1 KEYBOARD AND MOUSE MACROS

Keyboard macros and mouse macros allow short sequences of keystrokes and

mouse actions to be transformed into other, usually more time-consuming, sequences of

keystrokes and mouse actions. In this way, frequently used or repetitive sequences of

keystrokes and mouse movements can be automated. Separate programs for creating

these macros are called macro recorders.

During the 1980s, macro programs – originally SmartKey, then SuperKey, KeyWorks,

Prokey – were very popular, first as a means to automatically format screenplays, then for

a variety of user input tasks. These programs were based on the TSR (Terminate and stay

resident) mode of operation and applied to all keyboard input, no matter in which context

it occurred. They have to some extent fallen into obsolescence following the advent of

mouse-driven user interface and the availability of keyboard and mouse macros in

applications such as word processors and spreadsheets, making it possible to create

application-sensitive keyboard macros.

Keyboard macros have in more recent times come to life as a method of exploiting the

economy of massively multiplayer online role-playing game (MMORPG)s. By tirelessly

performing a boring, repetitive, but low risk action, a player running a macro can earn a

large amount of the game's currency or resources. This effect is even larger when a

macro-using player operates multiple accounts simultaneously, or operates the accounts

60

Page 61: Data Structure Handout 1

for a large amount of time each day. As this money is generated without human

intervention, it can dramatically upset the economy of the game. For this reason, use of

macros is a violation of the TOS or EULA of most MMORPGs, and administrators of

MMORPGs fight a continual war to identify and punish macro users.

9.2 APPLICATION MACROS AND SCRIPTING

Keyboard and mouse macros that are created using an application's built-in macro

features are sometimes called application macros. They are created by carrying out the

sequence once and letting the application record the actions. An underlying macro

programming language, most commonly a scripting language, with direct access to the

features of the application may also exist.

The programmers' text editor Emacs (short for "editing macros") follows this idea to a

conclusion. In effect, most of the editor is made of macros. Emacs was originally devised

as a set of macros in the editing language TECO; it was later ported to dialects of Lisp.

Another programmer's text editor, Vim (a descendant of vi), also has full implementation

of macros. It can record into a register (macro) what a person types on the keyboard and

it can be replayed or edited just like VBA macros for Microsoft Office. Vim also has a

scripting language called Vimscript to create macros.

Visual Basic for Applications (VBA) is a programming language included in Microsoft

Office. However, its function has evolved from and replaced the macro languages that

were originally included in some of these applications.

9.3 MACRO VIRUS

VBA has access to most Microsoft Windows system calls and executes when

documents are opened. This makes it relatively easy to write computer viruses in VBA,

commonly known as macro viruses. In the mid-to-late 1990s, this became one of the most

common types of computer virus. However, during the late 1990s and to date, Microsoft

has been patching and updating their programs. In addition, current anti-virus programs

immediately counteract such attacks.

61

Page 62: Data Structure Handout 1

9.4 TEXT SUBSTITUTION MACROS

Languages such as C and assembly language have rudimentary macro systems,

implemented as preprocessors to the compiler or assembler. C preprocessor macros work

by simple textual search-and-replace at the token, rather than the character, level. A

classic use of macros is in the computer typesetting system TeX and its derivatives,

where most of the functionality is based on macros. MacroML is an experimental system

that seeks to reconcile static typing and macro systems. Nemerle has typed syntax

macros, and one productive way to think of these syntax macros is as a multi-stage

computation. Other examples:

m4 is a sophisticated, stand-alone, macro processor.

TRAC

Macro Extension TAL , accompanying Template Attribute Language

SMX, for web pages

ML/1 Macro Language One

The General Purpose Macroprocessor is a contextual pattern matching macro

processor, which could be described as a combination of regular expressions,

EBNF and AWK

SAM76

minimac, a concatenative macro processor.

troff and nroff, for typesetting and formatting Unix manpages.

9.5 EMBEDDABLE LANGUAGES

Some languages, such as PHP, can be embedded in free-format text, or the source

code of other languages. The mechanism by which the code fragments are recognised (for

instance, being bracketed by <?php and ?>) is similar to a textual macro language, but

they are much more powerful, fully featured languages.

9.5.1 Procedural macros

Macros in the PL/I language are written in a subset of PL/I itself: the compiler

executes "preprocessor statements" at compilation time, and the output of this execution

62

Page 63: Data Structure Handout 1

forms part of the code that is compiled. The ability to use a familiar procedural language

as the macro language gives power much greater than that of text substitution macros, at

the expense of a larger and slower compiler.

Frame Technology's frame macros have their own command syntax but can also contain

text in any language. Each frame is both a generic component in a hierarchy of nested

subassemblies, and a procedure for integrating itself with its subassembly frames (a

recursive process that resolves integration conflicts in favor of higher level

subassemblies). The outputs are custom documents, typically compilable source modules.

Frame Technology can avoid the proliferation of similar but subtly different components,

an issue that has plagued software development since the invention of macros and

subroutines.

Most assembly languages have less powerful procedural macro facilities, for example

allowing a block of code to be repeated N times for loop unrolling; but these have a

completely different syntax from the actual assembly language.

9.5.2 Syntactic macros

Macro systems that work at the level of abstract syntax trees are called syntactic

macros and preserve the lexical structure of the original program. Meanwhile, macro

systems, such as the C preprocessor described earlier, that work at the level of lexical

tokens cannot preserve the lexical structure reliably. The most widely used

implementations of syntactic macro systems are found in Lisp-like languages such as

Common Lisp, Scheme, ISLISP and Racket. These languages are especially suited for

this style of macro due to their uniform, parenthesized syntax (known as S-Expressions).

In particular, uniform syntax makes it easier to determine the invocations of macros. Lisp

macros transform the program structure itself, with the full language available to express

such transformations. While syntactic macros are most commonly found in Lisp-like

languages, they have been implemented for other languages such as Dylan, Scala, and

Nemerle.

63

Page 64: Data Structure Handout 1

9.6 EARLY LISP MACROS

The earliest Lisp macros took the form of FEXPRs, function-like operators whose

inputs were not the values computed by the arguments but rather the syntactic forms of

the arguments, and whose output were values to be used in the computation. In other

words, FEXPRs were implemented at the same level as EVAL, and provided a window

into the meta-evaluation layer. This was generally found to be a difficult model to reason

about effectively.

An alternate, later facility was called DEFMACRO, a system that allowed programmers

to specify source-to-source transformations that were applied before the program is run.

9.6.1 Hygienic macros

In the mid-eighties, a number of papers introduced the notion of hygienic macro

expansion (syntax-rules), a pattern-based system where the syntactic environments of the

macro definition and the macro use are distinct, allowing macro definers and users not to

worry about inadvertent variable capture (cf. Referential transparency). Hygienic macros

have been standardized for Scheme in both the R5RS and R6RS standards. The upcoming

R7RS standard will also include hygienic macros. A number of competing

implementations of hygienic macros exist such as syntax-rules, syntax-case, explicit

renaming, and syntactic closures. Both of syntax-rules and syntax-case have been

standardized in the Scheme standards.

A number of languages other than Scheme either implement hygienic macros or

implement partially hygienic systems. Examples include Scala, Julia, Dylan, and

Nemerle.

9.7 APPLICATIONS

Evaluation order

Macro systems have a range of uses. Being able to choose the order of evaluation

(see lazy evaluation and non-strict functions) enables the creation of new syntactic

constructs (e.g. control structures) indistinguishable from those built into the

language. For instance, in a Lisp dialect that has cond but lacks if, it is possible to

64

Page 65: Data Structure Handout 1

define the latter in terms of the former using macros. For example, Scheme has

both continuations and hygienic macros, which enables a programmer to design

their own control abstractions, such as looping and early exit constructs, without

the need to build them into the language.

Data sub-languages and domain-specific languages

Next, macros make it possible to define data languages that are immediately

compiled into code, which means that constructs such as state machines can be

implemented in a way that is both natural and efficient.

Binding constructs

Macros can also be used to introduce new binding constructs. The most well-

known example is the transformation of let into the application of a function to a

set of arguments.

Felleisen conjectures that these three categories make up the primary legitimate uses of

macros in such a system. Others have proposed alternative uses of macros, such as

anaphoric macros in macro systems that are unhygienic or allow selective unhygienic

transformation.

The interaction of macros and other language features has been a productive area of

research. For example, components and modules are useful for large-scale programming,

but the interaction of macros and these other constructs must be defined for their use

together. Module and component-systems that can interact with macros have been

proposed for Scheme and other languages with macros.

For example, the Racket language extends the notion of a macro system to a syntactic

tower, where macros can be written in languages including macros, using hygiene to

ensure that syntactic layers are distinct and allowing modules to export macros to other

modules.

9.7.1 Macros for machine-independent software

Macros are normally used to map a short string (macro invocation) to a longer

sequence of instructions. Another, less common, use of macros is to do the reverse: to

65

Page 66: Data Structure Handout 1

map a sequence of instructions to a macro string. This was the approach taken by the

STAGE2 Mobile Programming System, which used a rudimentary macro compiler

(called SIMCMP) to map the specific instruction set of a given computer to counterpart

machine-independent macros. Applications (notably compilers) written in these machine-

independent macros can then be run without change on any computer equipped with the

rudimentary macro compiler. The first application run in such a context is a more

sophisticated and powerful macro compiler, written in the machine-independent macro

language. This macro compiler is applied to itself, in a bootstrap fashion, to produce a

compiled and much more efficient version of itself. The advantage of this approach is

that complex applications can be ported from one computer to a very different computer

with very little effort (for each target machine architecture, just the writing of the

rudimentary macro compiler). The advent of modern programming languages, notably C,

for which compilers are available on virtually all computers, has rendered such an

approach superfluous. This was, however, one of the first instances (if not the first) of

compiler bootstrapping.

66