data structure handout 1
DESCRIPTION
Discusses the various topics under data structure and its applicationTRANSCRIPT
MODULE ONE
DATA STRUCTURE: OVERVIEW
In computer science, a data structure is a particular way of storing and organizing data
in a computer so that it can be used efficiently.
Different kinds of data structures are suited to different kinds of applications, and some
are highly specialized to specific tasks. For example, B-trees are particularly well-suited
for implementation of databases, while compiler implementations usually use hash tables
to look up identifiers.
Data structures provide a means to manage huge amounts of data efficiently, such as
large databases and internet indexing services. Usually, efficient data structures are a key
to designing efficient algorithms. Some formal design methods and programming
languages emphasize data structures, rather than algorithms, as the key organizing factor in software
design. Storing and retrieving can be carried out on data stored in both main memory and in secondary
memory. Various Data Structures are available that are needed to be employed based on the need.
Overview
An array data structure stores a number of elements of the same type in a specific
order. They are accessed using an integer to specify which element is required
(although the elements may be of almost any type). Arrays may be fixed-length or
expandable.
Record (also called tuple or struct) Records are among the simplest data structures.
A record is a value that contains other values, typically in fixed number and
sequence and typically indexed by names. The elements of records are usually
called fields or members.
A hash or dictionary or map is a more flexible variation on a record, in which
name-value pairs can be added and deleted freely.
Union . A union type definition will specify which of a number of permitted
primitive types may be stored in its instances, e.g. "float or long integer". Contrast
1
with a record, which could be defined to contain a float and an integer; whereas, in
a union, there is only one value at a time.
A tagged union (also called a variant, variant record, discriminated union, or
disjoint union) contains an additional field indicating its current type, for enhanced
type safety.
A set is an abstract data structure that can store specific values, without any
particular order, and no repeated values. Values themselves are not retrieved from
sets, rather one tests a value for membership to obtain a Boolean "in" or "not in".
An object contains a number of data fields, like a record, and also a number of
program code fragments for accessing or modifying them. Data structures not
containing code, like those above, are called plain old data structure.
Many others are possible, but they tend to be further variations and compounds of the
above.
1.1 BASIC PRINCIPLES
Data structures are generally based on the ability of a computer to fetch and store
data at any place in its memory, specified by an address—a bit string that can be itself
stored in memory and manipulated by the program. Thus the record and array data
structures are based on computing the addresses of data items with arithmetic operations;
while the linked data structures are based on storing addresses of data items within the
structure itself. Many data structures use both principles, sometimes combined in non-
trivial ways (as in XOR linking)
The implementation of a data structure usually requires writing a set of procedures that
create and manipulate instances of that structure. The efficiency of a data structure cannot
be analyzed separately from those operations. This observation motivates the theoretical
concept of an abstract data type, a data structure that is defined indirectly by the
operations that may be performed on it, and the mathematical properties of those
operations (including their space and time cost).
2
1.2 LANGUAGE SUPPORT
Most assembly languages and some low-level languages, such as BCPL (Basic
Combined Programming Language), lack support for data structures. Many high-level
programming languages and some higher-level assembly languages, such as MASM, on
the other hand, have special syntax or other built-in support for certain data structures,
such as vectors (one-dimensional arrays) in the C language or multi-dimensional arrays in
Pascal.
Most programming languages feature some sort of library mechanism that allows data
structure implementations to be reused by different programs. Modern languages usually
come with standard libraries that implement the most common data structures. Examples
are the C++ Standard Template Library, the Java Collections Framework, and Microsoft's
.NET Framework.
Modern languages also generally support modular programming, the separation between
the interface of a library module and its implementation. Some provide opaque data types
that allow clients to hide implementation details. Object-oriented programming
languages, such as C++, Java and .NET Framework may use classes for this purpose.
Many known data structures have concurrent versions that allow multiple computing
threads to access the data structure simultaneously.
3
MODULE TWO
THE TREES STRUCTURE
In computer science, a tree is a widely used data structure that simulates a hierarchical
tree structure with a set of linked nodes.
A tree can be defined recursively (locally) as a collection of nodes (starting at a root
node), where each node is a data structure consisting of a value, together with a list of
nodes (the "children"), with the constraints that no node is duplicated.
A tree can be defined abstractly as a whole (globally) as an ordered tree, with a value
assigned to each node. Both these perspectives are useful: while a tree can be analyzed
mathematically as a whole, when actually represented as a data structure it is usually
represented and worked with separately by node (rather than as a list of nodes and an
adjacency list of edges between nodes, as one may represent a digraph, for instance).
For example, looking at a tree as a whole, one can talk about "the parent node" of a given
node, but in general as a data structure a given node only contains the list of its children,
but does not contain a reference to its parent (if any).
Mathematical
Viewed as a whole, a tree data structure is an ordered tree, generally with values attached
to each node. Concretely, it is:
A rooted tree with the "away from root" direction (a more narrow term is an
"arborescence"), meaning:
A directed graph,
whose underlying undirected graph is a tree (any two vertices are connected by
exactly one simple path),
with a distinguished root (one vertex is designated as the root),
which determines the direction on the edges (arrows point away from the root;
given an edge, the node that the edge points from is called the parent and the node
that the edge points to is called the child),
Together with:
an ordering on the child nodes of a given node, and
4
a value (of some data type) at each node.
Often trees have a fixed (more properly, bounded) branching factor (outdegree),
particularly always having two child nodes (possibly empty, hence at most two non-
empty child nodes), hence a "binary tree".
2.1 BINARY TREES
The simplest form of tree is a binary tree. A binary tree consists of
a. a node (called the root node) and
b. left and right sub-trees.
Both the sub-trees are themselves binary trees.
You now have a recursively defined data structure. (It is also possible to define a list
recursively: can you see how?)
A binary tree
The nodes at the lowest levels of the tree (the ones with no sub-trees) are called leaves.
In an ordered binary tree,
The keys of all the nodes in the left sub-tree are less than that of the root,
The keys of all the nodes in the right sub-tree are greater than that of the root,
The left and right sub-trees are themselves ordered binary trees. 5
Data Structure
The data structure for the tree implementation simply adds left and right pointers in place
of the next pointer of the linked list implementation.
The AddToCollection method is, naturally, recursive.
Similarly, the FindInCollection method is recursive.
Analysis
Complete Trees
Before we look at more general cases, let's make the optimistic assumption that we've
managed to fill our tree neatly, i.e. that each leaf is the same 'distance' from the root.
A complete tree
This forms a complete tree, whose height is defined as the number of links from the root
to the deepest leaf.
First, we need to work out how many nodes, n, we have in such a tree of height, h.
Now,
n = 1 + 21 + 22 + .... + 2h
n = 1 + 21 + 22 + .... + 2h
From which we have,
n = 2h+1 - 1
and
h = floor( log2n )
Examination of the Find method shows that in the worst case, h+1 or ceiling ( log2n )
comparisons are needed to find an item. This is the same as for binary search.
6
However, Add also requires ceiling ( log2n ) comparisons to determine where to add an
item. Actually adding the item takes a constant number of operations, so we say that a
binary tree requires O (logn) operations for both adding and finding an item - a
considerable improvement over binary search for a dynamic structure which often
requires addition of new items.
2.3 TERMINOLOGY
A node is a structure which may contain a value or condition, or represent a
separate data structure (which could be a tree of its own). Each node in a tree has zero or
more child nodes, which are below it in the tree (by convention, trees are drawn growing
downwards). A node that has a child is called the child's parent node (or ancestor node,
or superior). A node has at most one parent.
An internal node (also known as an inner node, inode for short, or branch node) is any
node of a tree that has child nodes. Similarly, an external node (also known as an outer
node, leaf node, or terminal node) is any node that does not have child nodes.
The topmost node in a tree is called the root node. Being the topmost node, the root node
will not have a parent. It is the node at which algorithms on the tree begin, since as a data
structure, one can only pass from parents to children.
Note that some algorithms (such as post-order depth-first search) begin at the root, but
first visit leaf nodes (access the value of leaf nodes), only visit the root last (i.e., they first
access the children of the root, but only access the value of the root last).
All other nodes can be reached from it by following edges or links. (In the formal
definition, each such path is also unique.) In diagrams, the root node is conventionally
drawn at the top. In some trees, such as heaps, the root node has special properties.
Every node in a tree can be seen as the root node of the subtree rooted at that node.
The height of a node is the length of the longest downward path to a leaf from that node.
The height of the root is the height of the tree. The depth of a node is the length of the
path to its root (i.e., its root path). This is commonly needed in the manipulation of the
various self balancing trees, AVL Trees in particular. The root node has depth zero, leaf
7
nodes have height zero, and a tree with only a single node (hence both a root and leaf)
has depth and height zero. Conventionally, an empty tree (tree with no nodes) has depth
and height −1.
A subtree of a tree T is a tree consisting of a node in T and all of its descendants in T.
Nodes thus correspond to subtrees (each node corresponds to the subtree of itself and all
its descendants) – the subtree corresponding to the root node is the entire tree, and each
node is the root node of the subtree it determines; the subtree corresponding to any other
node is called a proper subtree (in analogy to the term proper subset).
8
MODULE THREE
THE GRAPH STRUCTURE
In computer science, a graph is an abstract data type that is meant to implement the
graph and hypergraph concepts from mathematics.
A graph data structure consists of a finite (and possibly mutable) set of ordered pairs,
called edges or arcs, of certain entities called nodes or vertices. As in mathematics, an
edge (x,y) is said to point or go from x to y. The nodes may be part of the graph structure,
or may be external entities represented by integer indices or references.
A graph data structure may also associate to each edge some edge value, such as a
symbolic label or a numeric attribute (cost, capacity, length, etc.).
Graphs (also known as Networks) are very powerful structures and find their
applications in path-finding, visibility determination, soft-bodies using mass-spring
systems and probably a lot more. A graph is similar to a tree, but it imposes no
restrictions on how nodes are connected to each other. In fact each node can point to
another node or even multiple nodes at once.
A node is represented by the Graph Node class, and the connections between GraphNode
objects are modeled by the GraphArc class.
The arc has only one direction (uni-directional) and points from one node to another so
you are only allowed to go from A to B and not in the opposite direction. Bi-directional
connections can be simulated by creating an arc from node A to B and vice-versa from B
to A. Also, each arc has a weight value associated with it, which describes how costly it
is to move along the arc.
This is optional though, so the default value is 1.
9
Putting it together, the graph is implemented as an uni-directional weighted graph. The
Graph manages everything: it stores the nodes and the arcs in separate lists, makes sure
you don’t add a node twice, or mess up the arcs (for example if you remove a node from
the graph, it also scans the arc list and removes all arcs pointing to that node) and
provides you with tools to traverse the graph.
3.1 BUILDING THE GRAPH STRUCTURE
In figure 1, you see a simple graph containing 8 nodes. You can add additional
nodes, which will be placed at the position of the cursor by pressing ‘a’ or start with a
fresh graph by pressing ‘r’. To create an arc point from node A to Node B, simply click
both nodes successively. Traversal is also possible: First press ‘t’ to switch to ‘traverse’
mode, then click a node to find all nodes which are connected to the this node.
Figure 1: Building a graph structure
Graph traversal
If you have tried out the traversal in the example above, you may wonder how it’s done.
The answer lies in two common algorithms to accomplish this: Breadth-first search
(BFS) and depth-first search (DFS). (The demonstration above used the breadth-first
search.)
The BFS algorithm visits all nodes that are closest to the starting node first, so it
gradually expands outward in all directions equally. This looks like a virus infecting the
direct neighborhood at each search iteration.
BFS utilizes a queue and proceeds as follows:
Mark the starting node and enqueue it.
Process the node by calling a user-defined function on it.
Mark all connected nodes and also put them into the queue.
10
Remove the node at the front of the queue.
Repeat steps 2-5 with the node that is now at the front of the queue.
Stop if the queue is empty.
The depth-first search (DFS) on the other hand takes the starting node, follows the next
arc it finds to get to the next node, and continues this until the complete path has been
discovered, then goes back to the starting node and follows the next path until it reaches a
dead end and so on. It’s currently implemented as a recursive function, that means that it
probably can fail for very large graphs when the call-stack exceeds the maximum size (I
don’t know how big it is in AS3 though).
Both algorithms have in common that they mark a node when it’s added to the queue,
otherwise the node would be enqueued and unnecessarily processed multiple times,
because different nodes can all point to a common node. So before you start a BFS or
DFS it’s very important to reset all markers by calling the clearMarks() function on the
graph.
The two algorithms are visualized in figure 2 below. I’ve created a rectangular grid of
nodes (similar to a tilemap) by connecting each node with the top, bottom, left and right
neighbors (I left out the arcs because it would be a total mess). I have also deleted some
nodes to show you that both algorithms don’t rely on a regular structure and can look like
anything. Just click a node to start the traversal. You can toggle between both algorithms
by pressing ‘b’ BFS and ‘d’ (guess what ;-)).
Figure 2: Building a graph structure
BFS is much more useful than DFS in most situations. But DFS is likely to be faster, for
example when you only want to modify all connected nodes in some way.
11
MODULE FOUR
THE POLISH NOTATION
Polish notation, also known as prefix notation, is a symbolic logic invented by Polish
mathematician Jan Lukasiewicz in the 1920's. When using Polish notation, the instruction
(operation) precedes the data (operands). In Polish notation, the order (and only the
order) of operations and operands determines the result, making parentheses unnecessary.
x 3 + 4 5
Polish notation, also known as Polish prefix notation or simply prefix notation, is a form
of notation for logic, arithmetic, and algebra. Its distinguishing feature is that it places
operators to the left of their operands. If the arity of the operators is fixed, the result is a
syntax lacking parentheses or other brackets that can still be parsed without ambiguity.
The Polish logician Jan Łukasiewicz invented this notation around 1920 in order to
simplify sentential logic.
The term Polish notation is sometimes taken (as the opposite of infix notation) to also
include Polish postfix notation, or Reverse Polish notation, in which the operator is
placed after the operands.
This contrasts with the traditional algebraic methodology for performing mathematical
operations, the Order of Operations. (The mnemonic device for remembering the Order
of Operations is "Please Excuse My Dear Aunt Sally" - parentheses, exponents,
multiplication, division, addition, subtraction).
In the expression 3(4+5), you would work inside the parentheses first to add four plus
five and then multiply the result by three.
In the early days of the calculator, the end-user had to write down the results of their
intermediate steps when using the algebraic Order of Operations. Not only did this slow
things down, it provided an opportunity for the end-user to make errors and sometimes
defeated the purpose of using a calculating machine. In the 1960's, engineers at Hewlett-
Packard decided that it would be easier for end-users to learn Jan Lukasiewicz' logic
system than to try and use the Order of Operations on a calculator. They modified Jan
12
Lukasiewicz's system for a calculator keyboard by placing the instructions (operators)
after the data. In homage to Jan Lukasiewicz' Polish logic system, the engineers at
Hewlett-Packard called their modification reverse Polish notation (RPN).
The notation for the expression 3(4+5) would now be expressed as
4 5 + 3 x
or it could be further simplified to
3 4 5 + x
Reverse Polish notation provided a straightforward solution for calculator or computer
software mathematics because it treats the instructions (operators) and the data (operands)
as "objects" and processes them in a last-in, first-out (LIFO) basis.
This is called a "stack method". (Think of a stack of plates. The last plate you put on the
stack will be the first plate taken off the stack.)
Modern calculators with memory functions are sophisticated enough to accommodate the
use of the traditional algebraic Order of Operations, but users of RPN calculators like the
logic's simplicity and continue to make it profitable for Hewlett-Packard to manufacture
RPN calculators. Some of Hewlett Packard's latest calculators are capable of both RPN
and algebraic logic.
When Polish notation is used as a syntax for mathematical expressions by interpreters of
programming languages, it is readily parsed into abstract syntax trees and can, in fact,
define a one-to-one representation for the same. Because of this, Lisp (see below) and
related programming languages define their entire syntax in terms of prefix notation (and
others use postfix notation).
Here is a quotation from a paper by Jan Łukasiewicz, Remarks on Nicod's Axiom and on
"Generalizing Deduction", page 180.
"I came upon the idea of a parenthesis-free notation in 1924. I used that notation for
the first time in my article Łukasiewicz(1), p. 610, footnote."
The reference cited by Jan Łukasiewicz above is apparently a lithographed report in
Polish. The referring paper by Łukasiewicz Remarks on Nicod's Axiom and on
13
"Generalizing Deduction" was reviewed by H. A. Pogorzelski in the Journal of Symbolic
Logic in 1965.
Alonzo Church mentions this notation in his classic book on mathematical logic as
worthy of remark in notational systems even contrasted to Whitehead and Russell's
logical notational exposition and work in Principia Mathematica.
While no longer used much in logic, Polish notation has since found a place in computer
science.
4.1 POLISH NOTATION IN ARITHMETIC
The expression for adding the numbers 1 and 2 is, in prefix notation, written "+ 1
2" rather than "1 + 2". In more complex expressions, the operators still precede their
operands, but the operands may themselves be nontrivial expressions including operators
of their own. For instance, the expression that would be written in conventional infix
notation as
(5 − 6) * 7
can be written in prefix as
* (− 5 6) 7
Since the simple arithmetic operators are all binary (at least, in arithmetic contexts), any
prefix representation thereof is unambiguous, and bracketing the prefix expression is
unnecessary. As such, the previous expression can be further simplified to
* − 5 6 7
The processing of the product is deferred until its two operands are available (i.e., 5
minus 6, and 7). As with any notation, the innermost expressions are evaluated first, but
in prefix notation this "innermost-ness" can be conveyed by order rather than bracketing.
In the classical notation, the parentheses in the infix version were required, since moving
them
5 − (6 * 7)
or simply removing them
14
5 − 6 * 7
would change the meaning and result of the overall expression, due to the precedence
rule.
Similarly
5 − (6 * 7)
can be written in Polish notation as
− 5 * 6 7
4.2 POLISH NOTATION IN COMPUTER PROGRAMMING
Prefix notation has seen wide application in Lisp s-expressions, where the brackets
are required since the operators in the language are themselves data (first-class functions).
Lisp functions may also have variable arity. The Ambi programming language uses
Polish Notation for arithmetic operations and program construction.
The postfix reverse Polish notation is used in many stack-based programming languages
like PostScript and Forth, and is the operating principle of certain calculators, notably
from Hewlett-Packard.
The number of return values of an expression equals the difference between the number
of operands in an expression and the total arity of the operators minus the total number of
return values of the operators.
4.3 ORDER OF OPERATIONS
Order of operations is defined within the structure of prefix notation and can be
easily determined. One thing to keep in mind is that when executing an operation, the
operation is applied to the first operand by the second operand. This is not an issue with
operations that commute, but for non-commutative operations like division or
subtraction, this fact is crucial to the analysis of a statement. For example, the following
statement:
/ 10 5 = 2
15
is read as "divide 10 by 5". Thus the solution is 2, not 1/2 as would be the result of an
incorrect analysis.
Prefix notation is especially popular with stack-based operations due to its innate ability
to easily distinguish order of operations without the need for parentheses. To evaluate
order of operations under prefix notation, one does not even need to memorize an
operational hierarchy, as with infix notation. Instead, one looks directly to the notation to
discover which operator to evaluate first. Reading an expression from left to right, one
first looks for an operator and proceeds to look for two operands. If another operator is
found before two operands are found, then the old operator is placed aside until this new
operator is resolved. This process iterates until an operator is resolved, which must
happen eventually, as there must be one more operand than there are operators in a
complete statement. Once resolved, the operator and the two operands are replaced with a
new operand. Because one operator and two operands are removed and one operand is
added, there is a net loss of one operator and one operand, which still leaves an
expression with N operators and N + 1 operands, thus allowing the iterative process to
continue. This is the general theory behind using stacks in programming languages to
evaluate a statement in prefix notation, although there are various algorithms that
manipulate the process. Once analyzed, a statement in prefix notation becomes less
intimidating to the human mind as it allows some separation from convention with added
convenience. An example shows the ease with which a complex statement in prefix
notation can be deciphered through order of operations:
− * / 15 − 7 + 1 1 3 + 2 + 1 1 =
− * / 15 − 7 2 3 + 2 + 1 1 =
− * / 15 5 3 + 2 + 1 1 =
− * 3 3 + 2 + 1 1 =
− 9 + 2 + 1 1 =
− 9 + 2 2 =
− 9 4 =
5
16
An equivalent in-fix is as follows: ((15 / (7 − (1 + 1))) * 3) − (2 + (1 + 1)) = 5
Here is an implementation (in pseudocode) of prefix evaluation using a stack. Note that
under this implementation the input string is scanned from right to left. This differs from
the algorithm described above in which the string is processed from left to right. Both
algorithms compute the same value for all valid strings.
Scan the given prefix expression from right to left
for each symbol
{
if operand then
push onto stack
if operator then
{
operand1=pop stack
operand2=pop stack
compute operand1 operator operand2
push result onto stack
}
}
return top of stack as result
The result is at the top of the stack.
4.4 POLISH NOTATION FOR LOGIC
The table below shows the core of Jan Łukasiewicz's notation for sentential logic.
The "conventional" notation did not become so until the 1970s and 80s. Some letters in
the Polish notation table means a certain word in Polish, as shown:
Concept Conventional
notation Polish
17
word
Negation Nφ negacja
Conjunction Kφψ koniunkcja
Disjunction Aφψ alternatywa
Material conditional Cφψ implikacja
Biconditional Eφψ ekwiwalencja
Falsum O fałsz
Sheffer stroke Dφψ dysjunkcja
Possibility Mφ możliwość
Necessity Lφ konieczność
Universal quantifier Πpφ kwantyfikator ogólny
Existential quantifier Σpφ kwantyfikator szczegółowy
Note that the quantifiers ranged over propositional values in Łukasiewicz's work on
many-valued logics.
Bocheński introduced an incompatible system of Polish notation that names all 16 binary
connectives of classical propositional logic.
18
MODULE FIVE
STORAGE MANAGEMENT
Storage management is a general storage industry phrase that is used to describe the
tools, processes, and policies used to manage storage networks and storage services such
as virtualization, replication, mirroring, security, compression, traffic analysis and other
services. The phrase storage management also encompasses numerous storage
technologies including process automation, real-time infrastructure products and storage
provisioning.
In some cases, the phrase storage management may be used in direct reference to Storage
Resource Management (SRM) -- software that manages storage from a capacity,
utilization, policy and event-management perspective.
5.1 HIERARCHICAL STORAGE MANAGEMENT
Hierarchical storage management (HSM) is a data storage technique which
automatically moves data between high-cost and low-cost storage media. HSM systems
exist because high-speed storage devices, such as hard disk drive arrays, are more
expensive (per byte stored) than slower devices, such as optical discs and magnetic tape
drives. While it would be ideal to have all data available on high-speed devices all the
time, this is prohibitively expensive for many organizations. Instead, HSM systems store
the bulk of the enterprise's data on slower devices, and then copy data to faster disk
drives when needed. In effect, HSM turns the fast disk drives into caches for the slower
mass storage devices. The HSM system monitors the way data is used and makes best
guesses as to which data can safely be moved to slower devices and which data should
stay on the fast devices.
In a typical HSM scenario, data files which are frequently used are stored on disk drives,
but are eventually migrated to tape if they are not used for a certain period of time,
typically a few months. If a user does reuse a file which is on tape, it is automatically
moved back to disk storage. The advantage is that the total amount of stored data can be
19
much larger than the capacity of the disk storage available, but since only rarely-used
files are on tape, most users will usually not notice any slowdown.
HSM is sometimes referred to as tiered storage.
HSM (originally DFHSM, now DFSMShsm) was first implemented by IBM on their
mainframe computers to reduce the cost of data storage, and to simplify the retrieval of
data from slower media. The user would not need to know where the data was stored and
how to get it back; the computer would retrieve the data automatically. The only
difference to the user was the speed at which data was returned.
Later, IBM ported HSM to its AIX operating system, and then to other Unix-like
operating systems such as Solaris, HP-UX and Linux.
HSM was also implemented on the DEC VAX/VMS systems and the Alpha/VMS
systems. The first implementation date should be readily determined from the VMS
System Implementation Manuals or the VMS Product Description Brochures.
Recently, the development of Serial ATA (SATA) disks has created a significant market
for three-stage HSM: files are migrated from high-performance Fibre Channel Storage
Area Network devices to somewhat slower but much cheaper SATA disks arrays totaling
several terabytes or more, and then eventually from the SATA disks to tape.
The newest development in HSM is with hard disk drives and flash memory, with flash
memory being over 30 times faster than disks, but disks being considerably cheaper.
Conceptually, HSM is analogous to the cache found in most computer CPUs, where
small amounts of expensive SRAM memory running at very high speeds is used to store
frequently used data, but the least recently used data is evicted to the slower but much
larger main DRAM memory when new data has to be loaded.
In practice, HSM is typically performed by dedicated software, such as IBM Tivoli
Storage Manager, Oracle's SAM-QFS, Quantum, SGI Data Migration Facility (DMF),
StorNext, or EMC Legato OTG DiskXtender.
Use Cases
20
HSM is often used for deep archival storage of data to be held long term at low cost.
Automated tape robots can silo large quantities of data efficiently with low power
consumption.
Some HSM software products allow the user to place portions of data files on high-speed
disk cache and the rest on tape. This is used in applications that stream video over the
internet -- the initial portion of a video is delivered immediately from disk while a robot
finds, mounts and streams the rest of the file to the end user. Such a system greatly
reduces disk cost for large content provision systems.
Tiered storage
Tiered storage is a data storage environment consisting of two or more kinds of storage
delineated by differences in at least one of these four attributes: price, performance,
capacity and function.
Any significant difference in one or more of the four defining attributes can be sufficient
to justify a separate storage tier.
Examples:
Disk and tape: two separate storage tiers identified by differences in all four
defining attributes.
Old technology disk and new technology disk: two separate storage tiers identified
by differences in one or more of the attributes.
High performing disk storage and less expensive, slower disk of the same capacity
and function: two separate tiers.
Identical enterprise class disk configured to utilize different functions such as
RAID level or replication: a separate storage tier for each set of unique functions.
Note: Storage Tiers are not delineated by differences in vendor, architecture, or geometry
except where those differences result in clear changes to price, performance, capacity and
function.
21
5.2 GARBAGE COLLECTION (COMPUTER SCIENCE)
In computer science, garbage collection (GC) is a form of automatic memory
management. The garbage collector, or just collector, attempts to reclaim garbage, or
memory occupied by objects that are no longer in use by the program. Garbage collection
was invented by John McCarthy around 1959 to solve problems in Lisp.
Garbage collection is often portrayed as the opposite of manual memory management,
which requires the programmer to specify which objects to deallocate and return to the
memory system. However, many systems use a combination of approaches, including
other techniques such as stack allocation and region inference.
Resources other than memory, such as network sockets, database handles, user
interaction windows, and file and device descriptors, are not typically handled by garbage
collection. Methods used to manage such resources, particularly destructors, may suffice
to manage memory as well, leaving no need for GC. Some GC systems allow such other
resources to be associated with a region of memory that, when collected, causes the other
resource to be reclaimed; this is called finalization. Finalization may introduce
complications limiting its usability, such as intolerable latency between disuse and
reclaim of especially limited resources, or a lack of control over which thread performs
the work of reclaiming.
5.2.1 PRINCIPLES
Many computer languages require garbage collection, either as part of the
language specification (e.g., Java, C#, and most scripting languages) or effectively for
practical implementation (e.g., formal languages like lambda calculus); these are said to
be garbage collected languages. Other languages were designed for use with manual
memory management, but have garbage collected implementations available (e.g., C, C+
+). Some languages, like Ada, Modula-3, and C++/CLI allow both garbage collection and
manual memory management to co-exist in the same application by using separate heaps
for collected and manually managed objects; others, like D, are garbage collected but
allow the user to manually delete objects and also entirely disable garbage collection
22
when speed is required. While integrating garbage collection into the language's compiler
and runtime system enables a much wider choice of methods,[citation needed] post hoc
GC systems exist, including some that do not require recompilation. (Post-hoc GC is
sometimes distinguished as litter collection.) The garbage collector will almost always be
closely integrated with the memory allocator.
5.2.2 BENEFITS
Garbage collection frees the programmer from manually dealing with memory
deallocation. As a result, certain categories of bugs are eliminated or substantially
reduced:
Dangling pointer bugs, which occur when a piece of memory is freed while there are still
pointers to it, and one of those pointers is dereferenced. By then the memory may have
been re-assigned to another use, with unpredictable results.
Double free bugs, which occur when the program tries to free a region of memory that
has already been freed, and perhaps already been allocated again.
Certain kinds of memory leaks, in which a program fails to free memory occupied by
objects that have become unreachable, which can lead to memory exhaustion. (Garbage
collection typically does not deal with the unbounded accumulation of data that is
reachable, but that will actually not be used by the program.)
Efficient implementations of persistent data structures
Some of the bugs addressed by garbage collection can have security implications.
5.2.3 DISADVANTAGES
Typically, garbage collection has certain disadvantages:
Garbage collection consumes computing resources in deciding which memory to free,
even though the programmer may have already known this information. The penalty for
the convenience of not annotating object lifetime manually in the source code is
overhead, which can lead to decreased or uneven performance. Interaction with memory
23
hierarchy effects can make this overhead intolerable in circumstances that are hard to
predict or to detect in routine testing.
The moment when the garbage is actually collected can be unpredictable, resulting in
stalls scattered throughout a session. Unpredictable stalls can be unacceptable in real-time
environments, in transaction processing, or in interactive programs. Incremental,
concurrent, and real-time garbage collectors address these problems, with varying trade-
offs.
Non-deterministic GC is incompatible with RAII based management of non GCed
resources. As a result, the need for explicit manual resource management (release/close)
for non-GCed resources becomes transitive to composition. That is: in a non-
deterministic GC system, if a resource or a resource like object requires manual resource
management (release/close), and this object is used as 'part of' another object, then the
composed object will also become a resource like object that itself requires manual
resource management (release/close).
5.2.4 TRACING GARBAGE COLLECTORS
Tracing garbage collectors are the most common type of garbage collector. They
first determine which objects are reachable (or potentially reachable), and then discard all
remaining objects.
5.2.5 REACHABILITY OF AN OBJECT
Informally, an object is reachable if it is referenced by at least one variable in the
program, either directly or through references from other reachable objects. More
precisely, objects can be reachable in only two ways:
A distinguished set of objects are assumed to be reachable: these are known as the roots.
Typically, these include all the objects referenced from anywhere in the call stack (that is,
all local variables and parameters in the functions currently being invoked), and any
global variables.
24
Anything referenced from a reachable object is itself reachable; more formally,
reachability is a transitive closure.
The reachability definition of "garbage" is not optimal, insofar as the last time a program
uses an object could be long before that object falls out of the environment scope. A
distinction is sometimes drawn between syntactic garbage, those objects the program
cannot possibly reach, and semantic garbage, those objects the program will in fact never
again use. For example:
Object x = new Foo();
Object y = new Bar();
x = new Quux();
/* at this point, we know that the Foo object
* originally assigned to x will never be
* accessed: it is syntactic garbage
*/
if(x.check_something()) {
x.do_something(y);
}
System.exit(0);
/* in the above block, y *could* be semantic garbage,
* but we won't know until x.check_something() returns
* some value -- if it returns at all
*/
The problem of precisely identifying semantic garbage can easily be shown to be
partially decidable: a program that allocates an object X, runs an arbitrary input program
P, and uses X if and only if P finishes would require a semantic garbage collector to solve
the halting problem. Although conservative heuristic methods for semantic garbage
detection remain an active research area, essentially all practical garbage collectors focus
on syntactic garbage.
25
Another complication with this approach is that, in languages with both reference types
and unboxed value types, the garbage collector needs to somehow be able to distinguish
which variables on the stack or fields in an object are regular values and which are
references: in memory, an integer and a reference might look alike. The garbage collector
then needs to know whether to treat the element as a reference and follow it, or whether it
is a primitive value. One common solution is the use of tagged pointers.
5.2.6 STRONG AND WEAK REFERENCES
The garbage collector can reclaim only objects that have no references pointing to
them either directly or indirectly from the root set. However, some programs require
weak references, which should be usable for as long as the object exists but should not
prolong its lifetime. In discussions about weak references, ordinary references are
sometimes called strong references. An object is eligible for garbage collection if there
are no strong (i.e. ordinary) references to it, even though there still might be some weak
references to it.
A weak reference is not merely just any pointer to the object that a garbage collector
does not care about. The term is usually reserved for a properly managed category of
special reference objects which are safe to use even after the object disappears because
they lapse to a safe value. An unsafe reference that is not known to the garbage collector
will simply remain dangling by continuing to refer to the address where the object
previously resided. This is not a weak reference.
In some implementations, weak references are divided into subcategories.
For example, the Java Virtual Machine provides three forms of weak references, namely
soft references, phantom references, and regular weak references. A softly referenced
object is only eligible for reclamation, if the garbage collector decides that the program is
low on memory. Unlike a soft reference or a regular weak reference, a phantom reference
does not provide access to the object that it references. Instead, a phantom reference is a
mechanism that allows the garbage collector to notify the program when the referenced
object has become phantom reachable. An object is phantom reachable, if it still resides
26
in memory and it is referenced by a phantom reference, but its finalizer has already
executed. Similarly, Microsoft.NET provides two subcategories of weak references,
namely long weak references (tracks resurrection) and short weak references.
5.2.7 WEAK COLLECTIONS
Data structures can also be devised which have weak tracking features. For
instance, weak hash tables are useful. Like a regular hash table, a weak hash table
maintains an association between pairs of objects, where each pair is understood to be a
key and value. However, the hash table does not actually maintain a strong reference on
these objects. A special behavior takes place when either the key or value or both become
garbage: the hash table entry is spontaneously deleted. There exist further refinements
such as hash tables which have only weak keys (value references are ordinary, strong
references) or only weak values (key references are strong).
Weak hash tables are important for maintaining associations between objects, such that
the objects engaged in the association can still become garbage if nothing in the program
refers to them any longer (other than the associating hash table).
The use of a regular hash table for such a purpose could lead to a "logical memory leak":
the accumulation of reachable data which the program does not need and will not use.
5.3 BASIC ALGORITHM
Tracing collectors are so called because they trace through the working set of
memory. These garbage collectors perform collection in cycles. A cycle is started when
the collector decides (or is notified) that it needs to reclaim memory, which happens most
often when the system is low on memory. The original method involves a naïve mark-
and-sweep in which the entire memory set is touched several times.
5.3.1 TRI-COLOR MARKING
Because of these pitfalls, most modern tracing garbage collectors implement some variant
of the tri-colour marking abstraction, but simple collectors (such as the mark-and-sweep
27
collector) often do not make this abstraction explicit. Tri-colour marking works as
follows:
Create initial white, grey, and black sets; these sets will be used to maintain progress
during the cycle.
Initially the white set or condemned set is the set of objects that are candidates for having
their memory recycled.
The black set is the set of objects that can cheaply be proven to have no references to
objects in the white set, but are also not chosen to be candidates for recycling; in many
implementations, the black set starts off empty.
The grey set is all the objects that are reachable from root references but the objects
referenced by grey objects haven't been scanned yet. Grey objects are known to be
reachable from the root, so cannot be garbage collected: grey objects will eventually end
up in the black set. The grey state means we still need to check any objects that the object
references.
The grey set is initialised to objects which are referenced directly at root level; typically
all other objects are initially placed in the white set.
Objects can move from white to grey to black, never in the other direction.
Pick an object from the grey set. Blacken this object (move it to the black set), by greying
all the white objects it references directly. This confirms that this object cannot be
garbage collected, and also that any objects it references cannot be garbage collected.
Repeat the previous step until the grey set is empty.
When there are no more objects in the grey set, then all the objects remaining in the white
set have been demonstrated not to be reachable, and the storage occupied by them can be
reclaimed.
The 3 sets partition memory; every object in the system, including the root set, is in
precisely one set.
The tri-colour marking algorithm preserves an important invariant:
No black object points directly to a white object.
28
This ensures that the white objects can be safely destroyed once the grey set is empty.
(Some variations on the algorithm do not preserve the tricolour invariant but they use a
modified form for which all the important properties hold.)
The tri-colour method has an important advantage: it can be performed 'on-the-fly',
without halting the system for significant time periods. This is accomplished by marking
objects as they are allocated and during mutation, maintaining the various sets. By
monitoring the size of the sets, the system can perform garbage collection periodically,
rather than as-needed. Also, the need to touch the entire working set each cycle is
avoided.
5.4 IMPLEMENTATION STRATEGIES
In order to implement the basic tri-colour algorithm, several important design
decisions must be made, which can significantly affect the performance characteristics of
the garbage collector.
5.4.1 Moving vs. non-moving
Once the unreachable set has been determined, the garbage collector may simply
release the unreachable objects and leave everything else as it is, or it may copy some or
all of the reachable objects into a new area of memory, updating all references to those
objects as needed. These are called "non-moving" and "moving" (or, alternatively, "non-
compacting" and "compacting") garbage collectors, respectively.
At first, a moving GC strategy may seem inefficient and costly compared to the non-
moving approach, since much more work would appear to be required on each cycle. In
fact, however, the moving GC strategy leads to several performance advantages, both
during the garbage collection cycle itself and during actual program execution:
No additional work is required to reclaim the space freed by dead objects; the entire
region of memory from which reachable objects were moved can be considered free
space. In contrast, a non-moving GC must visit each unreachable object and somehow
record that the memory it alone occupied is available.
29
Similarly, new objects can be allocated very quickly. Since large contiguous regions of
memory are usually made available by the moving GC strategy, new objects can be
allocated by simply incrementing a 'free memory' pointer. A non-moving strategy may,
after some time, lead to a heavily fragmented heap, requiring expensive consultation of
"free lists" of small available blocks of memory in order to allocate new objects.
If an appropriate traversal order is used (such as cdr-first for list conses), objects that
refer to each other frequently can be moved very close to each other in memory,
increasing the likelihood that they will be located in the same cache line or virtual
memory page. This can significantly speed up access to these objects through these
references.
One disadvantage of a moving garbage collector is that it only allows access through
references that are managed by the garbage collected environment, and does not allow
pointer arithmetic. This is because any native pointers to objects will be invalidated when
the garbage collector moves the object (they become dangling pointers). For
interoperability with native code, the garbage collector must copy the object contents to a
location outside of the garbage collected region of memory. An alternative approach is to
pin the object in memory, preventing the garbage collector from moving it and allowing
the memory to be directly shared with native pointers (and possibly allowing pointer
arithmetic).
5.4.2 Copying vs. mark-and-sweep vs. mark-and-don't-sweep
To further refine the distinction, tracing collectors can also be divided by
considering how the three sets of objects (white, grey, and black) are maintained during a
collection cycle.
The most straightforward approach is the semi-space collector, which dates to 1969. In
this moving GC scheme, memory is partitioned into a "from space" and "to space".
Initially, objects are allocated into "to space" until they become full and a collection is
triggered. At the start of a collection, the "to space" becomes the "from space", and vice
versa. The objects reachable from the root set are copied from the "from space" to the "to
space". These objects are scanned in turn, and all objects that they point to are copied into
30
"to space", until all reachable objects have been copied into "to space". Once the program
continues execution, new objects are once again allocated in the "to space" until it is once
again full and the process is repeated. This approach has the advantage of conceptual
simplicity (the three object color sets are implicitly constructed during the copying
process), but the disadvantage that a (possibly) very large contiguous region of free
memory is necessarily required on every collection cycle. This technique is also known
as stop-and-copy. Cheney's algorithm is an improvement on the semi-space collector.
A mark and sweep garbage collector maintains a bit (or two) with each object to record
whether it is white or black; the grey set is either maintained as a separate list (such as the
process stack) or using another bit. As the reference tree is traversed during a collection
cycle (the "mark" phase), these bits are manipulated by the collector to reflect the current
state. A final "sweep" of the memory areas then frees white objects. The mark and sweep
strategy has the advantage that, once the unreachable set is determined, either a moving
or non-moving collection strategy can be pursued; this choice of strategy can even be
made at runtime, as available memory permits. It has the disadvantage of "bloating"
objects by a small amount.
A mark and don't sweep garbage collector, like the mark-and-sweep, maintains a bit with
each object to record whether it is white or black; the gray set is either maintained as a
separate list (such as the process stack) or using another bit. There are two key
differences here. First, black and white mean different things than they do in the mark
and sweep collector. In a "mark and don't sweep" system, all reachable objects are always
black. An object is marked black at the time it is allocated, and it will stay black even if it
becomes unreachable. A white object is unused memory and may be allocated. Second,
the interpretation of the black/white bit can change. Initially, the black/white bit may
have the sense of (0=white, 1=black). If an allocation operation ever fails to find any
available (white) memory, that means all objects are marked used (black). The sense of
the black/white bit is then inverted (for example, 0=black, 1=white). Everything becomes
white. This momentarily breaks the invariant that reachable objects are black, but a full
31
marking phase follows immediately, to mark them black again. Once this is done, all
unreachable memory is white. No "sweep" phase is necessary.
5.4.3 Generational GC (ephemeral GC)
It has been empirically observed that in many programs, the most recently created
objects are also those most likely to become unreachable quickly (known as infant
mortality or the generational hypothesis). A generational GC (also known as ephemeral
GC) divides objects into generations and, on most cycles, will place only the objects of a
subset of generations into the initial white (condemned) set. Furthermore, the runtime
system maintains knowledge of when references cross generations by observing the
creation and overwriting of references. When the garbage collector runs, it may be able to
use this knowledge to prove that some objects in the initial white set are unreachable
without having to traverse the entire reference tree. If the generational hypothesis holds,
this results in much faster collection cycles while still reclaiming most unreachable
objects.
In order to implement this concept, many generational garbage collectors use separate
memory regions for different ages of objects. When a region becomes full, those few
objects that are referenced from older memory regions are promoted to the next highest
region, and the entire region can then be overwritten with fresh objects.
This technique permits very fast incremental garbage collection, since the garbage
collection of only one region at a time is all that is typically required.
Generational garbage collection is a heuristic approach, and some unreachable objects
may not be reclaimed on each cycle. It may therefore occasionally be necessary to
perform a full mark and sweep or copying garbage collection to reclaim all available
space. In fact, runtime systems for modern programming languages (such as Java and
the .NET Framework) usually use some hybrid of the various strategies that have been
described thus far; for example, most collection cycles might look only at a few
generations, while occasionally a mark-and-sweep is performed, and even more rarely a
full copying is performed to combat fragmentation. The terms "minor cycle" and "major
cycle" are sometimes used to describe these different levels of collector aggression.
32
5.4.4 Stop-the-world vs. incremental vs. concurrent
Simple stop-the-world garbage collectors completely halt execution of the program to run
a collection cycle, thus guaranteeing that new objects are not allocated and objects do not
suddenly become unreachable while the collector is running.
This has the obvious disadvantage that the program can perform no useful work while a
collection cycle is running (sometimes called the "embarrassing pause"). Stop-the-world
garbage collection is therefore mainly suitable for non-interactive programs.
Its advantage is that it is both simpler to implement and faster than incremental garbage
collection.
Incremental and concurrent garbage collectors are designed to reduce this disruption by
interleaving their work with activity from the main program.
Incremental garbage collectors perform the garbage collection cycle in discrete phases,
with program execution permitted between each phase (and sometimes during some
phases). Concurrent garbage collectors do not stop program execution at all, except
perhaps briefly when the program's execution stack is scanned. However, the sum of the
incremental phases takes longer to complete than one batch garbage collection pass, so
these garbage collectors may yield lower total throughput.
Careful design is necessary with these techniques to ensure that the main program does
not interfere with the garbage collector and vice versa; for example, when the program
needs to allocate a new object, the runtime system may either need to suspend it until the
collection cycle is complete, or somehow notify the garbage collector that there exists a
new, reachable object.
5.4.5 Precise vs. conservative and internal pointers
Some collectors can correctly identify all pointers (references) in an object; these are
called precise (also exact or accurate) collectors, the opposite being a conservative or
partly conservative collector. Conservative collectors assume that any bit pattern in
memory could be a pointer if, interpreted as a pointer, it would point into an allocated
object. Conservative collectors may produce false positives, where unused memory is not
released because of improper pointer identification. This is not always a problem in
33
practice unless the program handles a lot of data that could easily be misidentified as a
pointer. False positives are generally less problematic on 64-bit systems than on 32-bit
systems because the range of valid memory addresses tends to be a tiny fraction of the
range of 64-bit values. Thus, an arbitrary 64-bit pattern is unlikely to mimic a valid
pointer. Whether a precise collector is practical usually depends on the type safety
properties of the programming language in question. An example for which a
conservative garbage collector would be needed is the C language, which allows typed
(non-void) pointers to be type cast into untyped (void) pointers, and vice versa.
A related issue concerns internal pointers, or pointers to fields within an object. If the
semantics of a language allow internal pointers, then there may be many different
addresses that can refer to parts of the same object, which complicates determining
whether an object is garbage or not. An example for this is the C++ language, in which
multiple inheritance can cause pointers to base objects to have different addresses. In a
tightly optimized program, the corresponding pointer to the object itself may have been
overwritten in its register, so such internal pointers need to be scanned.
5.5 PERFORMANCE IMPLICATIONS
Tracing garbage collectors require some implicit runtime overhead that may be
beyond the control of the programmer, and can sometimes lead to performance problems.
For example, commonly used stop-the-world garbage collectors, which pause program
execution at arbitrary times, may make garbage collection inappropriate for some
embedded systems, high-performance server software, and applications with real-time
needs.
It is difficult to compare the two cases directly, as their behavior depends on the situation.
For example, in the best case for a garbage collecting system, allocation just increments a
pointer, but in the best case for manual heap allocation, the allocator maintains freelists of
specific sizes and allocation only requires following a pointer. However, this size
segregation usually cause a large degree of external fragmentation, which can have an
adverse impact on cache behavior. Memory allocation in a garbage collected language
34
may be implemented using heap allocation behind the scenes (rather than simply
incrementing a pointer), so the performance advantages listed above don't necessarily
apply in this case. In some situations, most notably embedded systems, it is possible to
avoid both garbage collection and heap management overhead by preallocating pools of
memory and using a custom, lightweight scheme for allocation/deallocation.
The overhead of write barriers is more likely to be noticeable in an imperative-style
program which frequently writes pointers into existing data structures than in a
functional-style program which constructs data only once and never changes them.
Some advances in garbage collection can be understood as reactions to performance
issues. Early collectors were stop-the-world collectors, but the performance of this
approach was distracting in interactive applications. Incremental collection avoided this
disruption, but at the cost of decreased efficiency due to the need for barriers.
Generational collection techniques are used with both stop-the-world and incremental
collectors to increase performance; the trade-off is that some garbage is not detected as
such for longer than normal.
5.5.1 Determinism
Tracing garbage collection is not deterministic in the timing of object finalization.
An object which becomes eligible for garbage collection will usually be cleaned up
eventually, but there is no guarantee when (or even if) that will happen. This is an issue
for program correctness when objects are tied to non-memory resources, whose release is
an externally visible program behavior, such as closing a network connection, releasing a
device or closing a file. One garbage collection technique which provides determinism in
this regard is reference counting.
Garbage collection can have a nondeterministic impact on execution time, by potentially
introducing pauses into the execution of a program which are not correlated with the
algorithm being processed. Under tracing garbage collection, the request to allocate a
new object can sometimes return quickly and at other times trigger a lengthy garbage
collection cycle. Under reference counting, whereas allocation of objects is usually fast,
35
decrementing a reference is nondeterministic, since a reference may reach zero, triggering
recursion to decrement the reference counts of other objects which that object holds.
5.5.2 Real-time garbage collection
While garbage collection is generally nondeterministic, it is possible to use it in
hard real-time systems. A real-time garbage collector should guarantee that even in the
worst case it will dedicate a certain number of computational resources to mutator
threads. Constraints imposed on a real-time garbage collector are usually either work
based or time based. A time based constraint would look like: within each time window
of duration T, mutator threads should be allowed to run at least for Tm time. For work
based analysis, MMU (minimal mutator utilization) is usually used as a real time
constraint for the garbage collection algorithm.
One of the first implementations of real-time garbage collection for the JVM was work
on the Metronome algorithm. There are other commercial implementations.
5.5.3 Reference counting
Reference counting is a form of garbage collection whereby each object has a
count of the number of references to it. Garbage is identified by having a reference count
of zero. An object's reference count is incremented when a reference to it is created, and
decremented when a reference is destroyed. The object's memory is reclaimed when the
count reaches zero.
Compared to tracing garbage collection, reference counting guarantees that objects are
destroyed as soon as they become unreachable (assuming that there are no reference
cycles), and usually only accesses memory which is either in CPU caches, in objects to
be freed, or directly pointed by those, and thus tends to not have significant negative side
effects on CPU cache and virtual memory operation.
There are some disadvantages to reference counting:
If two or more objects refer to each other, they can create a cycle whereby neither will be
collected as their mutual references never let their reference counts become zero. Some
garbage collection systems using reference counting (like the one in CPython) use
specific cycle-detecting algorithms to deal with this issue.
36
Another strategy is to use weak references for the "backpointers" which create cycles.
Under reference counting, a weak reference is similar to a weak reference under a tracing
garbage collector. It is a special reference object whose existence does not increment the
reference count of the referent object. Furthermore, a weak reference is safe in that when
the referent object becomes garbage, any weak reference to it lapses, rather than being
permitted to remain dangling, meaning that it turns into a predictable value, such as a null
reference.
In naive implementations, each assignment of a reference and each reference falling out
of scope often require modifications of one or more reference counters. However, in the
common case, when a reference is copied from an outer scope variable into an inner
scope variable, such that the lifetime of the inner variable is bounded by the lifetime of
the outer one, the reference incrementing can be eliminated. The outer variable "owns"
the reference. In the programming language C++, this technique is readily implemented
and demonstrated with the use of const references. Reference counting in C++ is usually
implemented using "smart pointers" whose constructors, destructors and assignment
operators manage the references. A smart pointer can be passed by reference to a
function, which avoids the need to copy-construct a new reference (which would increase
the reference count on entry into the function and decrease it on exit). Instead the
function receives a reference to the smart pointer which is produced inexpensively.
When used in a multithreaded environment, these modifications (increment and
decrement) may need to be atomic operations such as compare-and-swap, at least for any
objects which are shared, or potentially shared among multiple threads. Atomic
operations are expensive on a multiprocessor, and even more expensive if they have to be
emulated with software algorithms. It is possible to avoid this issue by adding per-thread
or per-CPU reference counts and only accessing the global reference count when the
local reference counts become or are no longer zero (or, alternatively, using a binary tree
of reference counts, or even giving up deterministic destruction in exchange for not
having a global reference count at all), but this adds significant memory overhead and
37
thus tends to be only useful in special cases (it's used, for example, in the reference
counting of Linux kernel modules).
Naive implementations of reference counting do not in general provide real-time
behavior, because any pointer assignment can potentially cause a number of objects
bounded only by total allocated memory size to be recursively freed while the thread is
unable to perform other work. It is possible to avoid this issue by delegating the freeing
of objects whose reference count dropped to zero to other threads, at the cost of extra
overhead.
5.5.4 Escape analysis
Escape analysis can be used to convert heap allocations to stack allocations, thus
reducing the amount of work needed to be done by the garbage collector.
5.5.5 Compile-time
Compile-time garbage collection is a form of static analysis allowing memory to
be reused and reclaimed based on invariants known during compilation. This form of
garbage collection has been studied in the Mercury programming language
5.5.6 Availability
Generally speaking, higher-level programming languages are more likely to have
garbage collection as a standard feature. In languages that do not have built in garbage
collection, it can often be added through a library, as with the Boehm garbage collector
for C and C++. This approach is not without drawbacks, such as changing object creation
and destruction mechanisms.
Most functional programming languages, such as ML, Haskell, and APL, have garbage
collection built in. Lisp, which introduced functional programming, is especially notable
for introducing this mechanism.
Other dynamic languages, such as Ruby (but not Perl 5, or PHP, which use reference
counting), also tend to use GC. Object-oriented programming languages such as
Smalltalk, Java and ECMAScript usually provide integrated garbage collection. Notable
exceptions are C++ and Delphi which have destructors. Objective-C has not traditionally
38
had it, but ObjC 2.0 as implemented by Apple for Mac OS X uses a runtime collector
developed in-house, while the GNUstep project uses a Boehm collector.
Historically, languages intended for beginners, such as BASIC and Logo, have often used
garbage collection for heap-allocated variable-length data types, such as strings and lists,
so as not to burden programmers with manual memory management. On early
microcomputers, with their limited memory and slow processors, BASIC garbage
collection could often cause apparently random, inexplicable pauses in the midst of
program operation.
Some BASIC interpreters, such as Applesoft BASIC on the Apple II family, repeatedly
scanned the string descriptors for the string having the highest address in order to
compact it toward high memory, resulting in O(N*N) performance, which could
introduce minutes-long pauses in the execution of string-intensive programs. A
replacement garbage collector for Applesoft BASIC published in Call-A.P.P.L.E.
(January 1981, pages 40–45, Randy Wigginton) identified a group of strings in every
pass over the heap, which cut collection time dramatically. BASIC.System, released with
ProDOS in 1983, provided a windowing garbage collector for BASIC that reduced most
collections to a fraction of a second.
5.6 LIMITED ENVIRONMENTS
Garbage collection is rarely used on embedded or real-time systems because of the
perceived need for very tight control over the use of limited resources. However, garbage
collectors compatible with such limited environments have been developed.
The Microsoft .NET Micro Framework and Java Platform, Micro Edition are embedded
software platforms that, like their larger cousins, include garbage collection.
39
MODULE SIX
HASH FUNCTION
A hash function is any algorithm or subroutine that maps large data sets of variable
length to smaller data sets of a fixed length. For example, a person's name, having a
variable length, could be hashed to a single integer. The values returned by a hash
function are called hash values, hash codes, hash sums, checksums or simply hashes.
6.1 DESCRIPTIONS
Hash functions are mostly used to accelerate table lookup or data comparison
tasks such as finding items in a database, detecting duplicated or similar records in a
large file, finding similar stretches in DNA sequences, and so on.
A hash function should be referentially transparent (stable), i.e., if called twice on
input that is "equal" (for example, strings that consist of the same sequence of
characters), it should give the same result. This is a contract in many programming
languages that allow the user to override equality and hash functions for an object: if two
objects are equal, their hash codes must be the same. This is crucial to finding an element
in a hash table quickly, because two of the same element would both hash to the same
slot.
All hash functions that map a larger set of data to a smaller set of data cause
collisions. Such hash functions try to map the keys to the hash values as evenly as
possible because collisions become more frequent as hash tables fill up. Thus, single-digit
hash values are frequently restricted to 80% of the size of the table. Depending on the
algorithm used, other properties may be required as well, such as double hashing and
linear probing. Although the idea was conceived in the 1950s, the design of good hash
functions is still a topic of active research.
Hash functions are related to (and often confused with) checksums, check digits,
fingerprints, randomization functions, error correcting codes, and cryptographic hash
functions. Although these concepts overlap to some extent, each has its own uses and
40
requirements and is designed and optimized differently. The HashKeeper database
maintained by the American National Drug Intelligence Center, for instance, is more
aptly described as a catalog of file fingerprints than of hash values.
6.2 HASH TABLES
Hash functions are primarily used in hash tables, to quickly locate a data record
(e.g., a dictionary definition) given its search key (the headword). Specifically, the hash
function is used to map the search key to an index; the index gives the place in the hash
table where the corresponding record should be stored. Hash tables, in turn, are used to
implement associative arrays and dynamic sets.
Typically, the domain of a hash function (the set of possible keys) is larger than its
range (the number of different table indexes), and so it will map several different keys to
the same index. Therefore, each slot of a hash table is associated with (implicitly or
explicitly) a set of records, rather than a single record. For this reason, each slot of a hash
table is often called a bucket, and hash values are also called bucket indices.
Thus, the hash function only hints at the record's location—it tells where one
should start looking for it. Still, in a half-full table, a good hash function will typically
narrow the search down to only one or two entries.
Caches
Hash functions are also used to build caches for large data sets stored in slow media. A
cache is generally simpler than a hashed search table, since any collision can be resolved
by discarding or writing back the older of the two colliding items. This is also used in file
comparison.
Bloom filters
Hash functions are an essential ingredient of the Bloom filter, a space-efficient
probabilistic data structure that is used to test whether an element is a member of a set.
Finding duplicate records
When storing records in a large unsorted file, one may use a hash function to map each
record to an index into a table T, and collect in each bucket T[i] a list of the numbers of
41
all records with the same hash value i. Once the table is complete, any two duplicate
records will end up in the same bucket. The duplicates can then be found by scanning
every bucket T[i] which contains two or more members, fetching those records, and
comparing them. With a table of appropriate size, this method is likely to be much faster
than any alternative approach (such as sorting the file and comparing all consecutive
pairs).
Finding similar records
Hash functions can also be used to locate table records whose key is similar, but not
identical, to a given key; or pairs of records in a large file which have similar keys. For
that purpose, one needs a hash function that maps similar keys to hash values that differ
by at most m, where m is a small integer (say, 1 or 2). If one builds a table T of all record
numbers, using such a hash function, then similar records will end up in the same bucket,
or in nearby buckets. Then one need only check the records in each bucket T[i] against
those in buckets T[i+k] where k ranges between −m and m.
This class includes the so-called acoustic fingerprint algorithms, that are used to locate
similar-sounding entries in large collection of audio files. For this application, the hash
function must be as insensitive as possible to data capture or transmission errors, and to
"trivial" changes such as timing and volume changes, compression, etc.
Finding similar substrings
The same techniques can be used to find equal or similar stretches in a large collection of
strings, such as a document repository or a genomic database. In this case, the input
strings are broken into many small pieces, and a hash function is used to detect
potentially equal pieces, as above.
The Rabin–Karp algorithm is a relatively fast string searching algorithm that works in
O(n) time on average. It is based on the use of hashing to compare strings.
6.2.1 GEOMETRIC HASHING
This principle is widely used in computer graphics, computational geometry and
many other disciplines, to solve many proximity problems in the plane or in three-
42
dimensional space, such as finding closest pairs in a set of points, similar shapes in a list
of shapes, similar images in an image database, and so on. In these applications, the set of
all inputs is some sort of metric space, and the hashing function can be interpreted as a
partition of that space into a grid of cells. The table is often an array with two or more
indices (called a grid file, grid index, bucket grid, and similar names), and the hash
function returns an index tuple. This special case of hashing is known as geometric
hashing or the grid method. Geometric hashing is also used in telecommunications
(usually under the name vector quantization) to encode and compress multi-dimensional
signals.
6.2.3 PROPERTIES
Good hash functions, in the original sense of the term, are usually required to
satisfy certain properties listed below. Note that different requirements apply to the other
related concepts (cryptographic hash functions, checksums, etc.).
6.3 DETERMINISM
A hash procedure must be deterministic—meaning that for a given input value it
must always generate the same hash value. In other words, it must be a function of the
data to be hashed, in the mathematical sense of the term. This requirement excludes hash
functions that depend on external variable parameters, such as pseudo-random number
generators or the time of day. It also excludes functions that depend on the memory
address of the object being hashed, because that address may change during execution
(as may happen on systems that use certain methods of garbage collection), although
sometimes rehashing of the item is possible.
6.4 UNIFORMITY
A good hash function should map the expected inputs as evenly as possible over
its output range. That is, every hash value in the output range should be generated with
roughly the same probability. The reason for this last requirement is that the cost of
hashing-based methods goes up sharply as the number of collisions—pairs of inputs that
43
are mapped to the same hash value—increases. Basically, if some hash values are more
likely to occur than others, a larger fraction of the lookup operations will have to search
through a larger set of colliding table entries.
Note that this criterion only requires the value to be uniformly distributed, not random in
any sense. A good randomizing function is (barring computational efficiency concerns)
generally a good choice as a hash function, but the converse need not be true.
Hash tables often contain only a small subset of the valid inputs. For instance, a club
membership list may contain only a hundred or so member names, out of the very large
set of all possible names. In these cases, the uniformity criterion should hold for almost
all typical subsets of entries that may be found in the table, not just for the global set of
all possible entries.
In other words, if a typical set of m records is hashed to n table slots, the probability of a
bucket receiving many more than m/n records should be vanishingly small. In particular,
if m is less than n, very few buckets should have more than one or two records. (In an
ideal "perfect hash function", no bucket should have more than one record; but a small
number of collisions is virtually inevitable, even if n is much larger than m – see the
birthday paradox).
When testing a hash function, the uniformity of the distribution of hash values can be
evaluated by the chi-squared test.
6.5 VARIABLE RANGE
In many applications, the range of hash values may be different for each run of the
program, or may change along the same run (for instance, when a hash table needs to be
expanded). In those situations, one needs a hash function which takes two parameters—
the input data z, and the number n of allowed hash values.
A common solution is to compute a fixed hash function with a very large range (say, 0 to
232 − 1), divide the result by n, and use the division's remainder. If n is itself a power of
2, this can be done by bit masking and bit shifting. When this approach is used, the hash
44
function must be chosen so that the result has fairly uniform distribution between 0 and n
− 1, for any value of n that may occur in the application. Depending on the function, the
remainder may be uniform only for certain values of n, e.g. odd or prime numbers.
We can allow the table size n to not be a power of 2 and still not have to perform any
remainder or division operation, as these computations are sometimes costly.
For example, let n be significantly less than 2b. Consider a pseudo random number
generator (PRNG) function P(key) that is uniform on the interval [0, 2b − 1]. A hash
function uniform on the interval [0, n-1] is n P(key)/2b. We can replace the division by a
(possibly faster) right bit shift: nP(key) >> b.
6.5.1 Variable Range with Minimal Movement (Dynamic Hash Function)
When the hash function is used to store values in a hash table that outlives the run
of the program, and the hash table needs to be expanded or shrunk, the hash table is
referred to as a dynamic hash table.
A hash function that will relocate the minimum number of records when the table is
resized is desirable. What is needed is a hash function H(z,n) – where z is the key being
hashed and n is the number of allowed hash values – such that H(z,n + 1) = H(z,n) with
probability close to n/(n + 1).
Linear hashing and spiral storage are examples of dynamic hash functions that execute in
constant time but relax the property of uniformity to achieve the minimal movement
property.
Extendible hashing uses a dynamic hash function that requires space proportional to n to
compute the hash function, and it becomes a function of the previous keys that have been
inserted.
Several algorithms that preserve the uniformity property but require time proportional to
n to compute the value of H(z,n) have been invented.
Data normalization
In some applications, the input data may contain features that are irrelevant for
comparison purposes. For example, when looking up a personal name, it may be
45
desirable to ignore the distinction between upper and lower case letters. For such data,
one must use a hash function that is compatible with the data equivalence criterion being
used: that is, any two inputs that are considered equivalent must yield the same hash
value. This can be accomplished by normalizing the input before hashing it, as by upper-
casing all letters.
Continuity
A hash function that is used to search for similar (as opposed to equivalent) data must be
as continuous as possible; two inputs that differ by a little should be mapped to equal or
nearly equal hash values.
Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash
functions, and other related concepts. Continuity is desirable for hash functions only in
some applications, such as hash tables that use linear search.
Hash function algorithms
For most types of hashing functions the choice of the function depends strongly on the
nature of the input data, and their probability distribution in the intended application.
Trivial hash function
If the datum to be hashed is small enough, one can use the datum itself (reinterpreted as
an integer in binary notation) as the hashed value. The cost of computing this "trivial"
(identity) hash function is effectively zero. This hash function is perfect, as it maps each
input to a distinct hash value.
The meaning of "small enough" depends on the size of the type that is used as the hashed
value. For example, in Java, the hash code is a 32-bit integer. Thus the 32-bit integer
Integer and 32-bit floating-point Float objects can simply use the value directly; whereas
the 64-bit integer Long and 64-bit floating-point Double cannot use this method.
Other types of data can also use this perfect hashing scheme. For example, when mapping
character strings between upper and lower case, one can use the binary encoding of each
character, interpreted as an integer, to index a table that gives the alternative form of that
46
character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in ASCII
or ISO Latin 1), the table has only 28 = 256 entries; in the case of Unicode characters, the
table would have 17×216 = 1114112 entries.
The same technique can be used to map two-letter country codes like "us" or "za" to
country names (262=676 table entries), 5-digit zip codes like 13083 to city names
(100000 entries), etc. Invalid data values (such as the country code "xx" or the zip code
00000) may be left undefined in the table, or mapped to some appropriate "null" value.
Perfect hashing
A perfect hash function for the four names shown
A hash function that is injective—that is, maps each valid input to a different hash value
—is said to be perfect. With such a function one can directly locate the desired entry in a
hash table, without any additional searching.
Minimal perfect hashing
A minimal perfect hash function for the four names shown
A perfect hash function for n keys is said to be minimal if its range consists of n
consecutive integers, usually from 0 to n−1. Besides providing single-step lookup, a
minimal perfect hash function also yields a compact hash table, without any vacant slots.
Minimal perfect hash functions are much harder to find than perfect ones with a wider
range.
Hashing uniformly distributed data
If the inputs are bounded-length strings (such as telephone numbers, car license plates,
invoice numbers, etc.), and each input may independently occur with uniform probability,
then a hash function need only map roughly the same number of inputs to each hash
value. For instance, suppose that each input is an integer z in the range 0 to N−1, and the
output must be an integer h in the range 0 to n−1, where N is much larger than n. Then
47
the hash function could be h = z mod n (the remainder of z divided by n), or h = (z × n) ÷
N (the value z scaled down by n/N and truncated to an integer), or many other formulas.
Warning: h = z mod n was used in many of the original random number generators, but
was found to have a number of issues. One of which is that as n approaches N, this
function becomes less and less uniform.
Hashing data with other distributions
These simple formulas will not do if the input values are not equally likely, or are not
independent. For instance, most patrons of a supermarket will live in the same geographic
area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that
case, if m is 10000 or so, the division formula (z × m) ÷ M, which depends mainly on the
leading digits, will generate a lot of collisions; whereas the remainder formula z mod M,
which is quite sensitive to the trailing digits, may still yield a fairly even distribution.
Hashing variable-length data
When the data values are long (or variable-length) character strings—such as personal
names, web page addresses, or mail messages—their distribution is usually very uneven,
with complicated dependencies. For example, text in any natural language has highly
non-uniform distributions of characters, and character pairs, very characteristic of the
language. For such data, it is prudent to use a hash function that depends on all characters
of the string—and depends on each character in a different way.
In cryptographic hash functions, a Merkle–Damgård construction is usually used. In
general, the scheme for hashing such data is to break the input into a sequence of small
units (bits, bytes, words, etc.) and combine all the units b[1], b[2], ..., b[m] sequentially,
as follows
S ← S0; // Initialize the state.
for k in 1, 2, ..., m do // Scan the input data units:
S ← F(S, b[k]); // Combine data unit k into the state.
return G(S, n) // Extract the hash value from the state.
48
This schema is also used in many text checksum and fingerprint algorithms. The state
variable S may be a 32- or 64-bit unsigned integer; in that case, S0 can be 0, and G(S,n)
can be just S mod n. The best choice of F is a complex issue and depends on the nature of
the data.
If the units b[k] are single bits, then F(S,b) could be, for instance
if highbit(S) = 0 then
return 2 * S + b
else
return (2 * S + b) ^ P
Here highbit(S) denotes the most significant bit of S; the '*' operator denotes unsigned
integer multiplication with lost overflow; '^' is the bitwise exclusive or operation applied
to words; and P is a suitable fixed word.
Special-purpose hash functions
In many cases, one can design a special-purpose (heuristic) hash function that yields
many fewer collisions than a good general-purpose hash function. For example, suppose
that the input data are file names such as FILE0000.CHK, FILE0001.CHK,
FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function that
extracts the numeric part k of the file name and returns k mod n would be nearly optimal.
Needless to say, a function that is exceptionally good for a specific kind of data may have
dismal performance on data with different distribution.
Rolling hash
In some applications, such as substring search, one must compute a hash function h for
every k-character substring of a given n-character string t; where k is a fixed integer, and
n is k. The straightforward solution, which is to extract every such substring s of t and
compute h(s) separately, requires a number of operations proportional to k·n. However,
49
with the proper choice of h, one can use the technique of rolling hash to compute all those
hashes with an effort proportional to k + n.
Universal hashing
A universal hashing scheme is a randomized algorithm that selects a hashing function h
among a family of such functions, in such a way that the probability of a collision of any
two distinct keys is 1/n, where n is the number of distinct hash values desired—
independently of the two keys. Universal hashing ensures (in a probabilistic sense) that
the hash function application will behave as well as if it were using a random function,
for any distribution of the input data. It will however have more collisions than perfect
hashing, and may require more operations than a special-purpose hash function.
Hashing with checksum functions
One can adapt certain checksum or fingerprinting algorithms for use as hash functions.
Some of those algorithms will map arbitrary long string data z, with any typical real-
world distribution—no matter how non-uniform and dependent—to a 32-bit or 64-bit
string, from which one can extract a hash value in 0 through n − 1.
This method may produce a sufficiently uniform distribution of hash values, as long as
the hash range size n is small compared to the range of the checksum or fingerprint
function. However, some checksums fare poorly in the avalanche test, which may be a
concern in some applications. In particular, the popular CRC32 checksum provides only
16 bits (the higher half of the result) that are usable for hashing[citation needed].
Moreover, each bit of the input has a deterministic effect on each bit of the CRC32, that
is one can tell without looking at the rest of the input, which bits of the output will flip if
the input bit is flipped; so care must be taken to use all 32 bits when computing the hash
from the checksum.
50
Hashing with cryptographic hash functions
Some cryptographic hash functions, such as SHA-1, have even stronger uniformity
guarantees than checksums or fingerprints, and thus can provide very good general-
purpose hashing functions.
In ordinary applications, this advantage may be too small to offset their much higher cost.
However, this method can provide uniformly distributed hashes even when the keys are
chosen by a malicious agent. This feature may help to protect services against denial of
service attacks.
Hashing By Nonlinear Table Lookup
Tables of random numbers (such as 256 random 32 bit integers) can provide high-quality
nonlinear functions to be used as hash functions or for other purposes such as
cryptography. The key to be hashed would be split into 8-bit (one byte) parts and each
part will be used as an index for the nonlinear table. The table values will be added by
arithmetic or XOR addition to the hash output value. Because the table is just 1024 bytes
in size, it will fit into the cache of modern microprocessors and allow for very fast
execution of the hashing algorithm. As the table value is on average much longer than 8
bits, one bit of input will affect nearly all output bits. This is different to multiplicative
hash functions where higher-value input bits do not affect lower-value output bits.
This algorithm has proven to be very fast and of high quality for hashing purposes
(especially hashing of integer number keys).
Efficient Hashing Of Strings
Modern microprocessors will allow for much faster processing, if 8-bit character Strings
are not hashed by processing one character at a time, but by interpreting the string as an
array of 32 bit or 64 bit integers and hashing/accumulating these "wide word" integer
values by means of arithmetic operations (e.g. multiplication by constant and bit-
51
shifting). The remaining characters of the string which are smaller than the word length
of the CPU must be handled differently (e.g. being processed one character at a time).
This approach has proven to speed up hash code generation by a factor of five or more on
modern microprocessors of a word size of 64 bit.
A far better approach for converting strings to a numeric value that avoids the problem
with some strings having great similarity ("Aaaaaaaaaa" and "Aaaaaaaaab") is to use a
Cyclic redundancy check (CRC) of the string to compute a 32- or 64-bit value.
52
MODULE SEVEN
HASH CODING AND HASH TABLE
Hashing is a method of storing records according to their key values. It provides access to
stored records in constant time, O(1), so it is comparable to B-trees in searching speed.
Therefore, hash tables are used for:
a) Storing a file record by record.
b) Searching for records with certain key values.
In hash tables, the main idea is to distribute the records uniquely on a table, according to
their key values. We take the key and we use a function to map the key into one location
of the array: f(key)=h, where h is the hash address of that record in the hash table.
If the size of the table is n, say array [1..n], we have to find a function which will give
numbers between 1 and n only.
Each entry of the table is called a bucket. In general, one bucket may contain more than
one (say r) records. In our discussions we shall assume r=1 and each bucket holds exactly
one record.
7.1. DEFINITIONS
key density :
Two key values are synonyms with respect to f, if f(key1)=f(key2).
Synonyms are entered into the same bucket if r>1 and there is space in that bucket.
When a key is mapped by f into a full bucket this is an overflow.
When two non-identical keys are mapped into the same bucket, this is a collision.
The hash function f;
a) Must be easy to compute,
b) Must be a uniform hash function. (a random key value should have an equal chance of
hashing into any of the n buckets.)
c) Should minimize the number of collisions.
53
Some hash functions used in practical applications :
1) f(key)=key mod n can be a hash function, However n should never be a power of 2, n
should be a prime number.
2) Ex-or'ing the first and the last m bits of the key:
Notice that the hash table will now have a size n, which is a power of 2.
3) Mid-squaring:
a) take the square of the key.
b) then use m bits from the middle of the square to compute the hash address.
4) Folding:
The key is partitioned into several parts. All exept the last part have the same length.
These parts are added together to obtain the hash address for the key. There are two ways
of doing this addition.
a) Add the parts directly
b) Fold at the boundaries.
Example:
key = 12320324111220, part length=3,
123|203|241|112|20
P1 P2 P3 P4 P5
a) 123 b ) 123
203 302
241 241
112 211
+ 20 + 20
______ ______
699 897
Handling Collisions - Overflows :
Consider r=1, so there is one slot per bucket. All slots must be initialized to 'empty' ( for
instance, zero or minus one may denote empty ).
54
1) Linear probing:
- When we reach the end of the table, we go back to location 0.
- Finding the first empty location will sometimes take a lot of time.
- Also, in searching for a specific key value, we have to continue the search until we find
an empty location, if that key value is not found at the calculated hash address.
2) Random probing
When there is a collision, we start a (pseudo) random number generator.
For example;
f(key1)=3
f(key2)=3DD> collision
Then, start the pseudo random number generator and get a number, say 7. Add 3+7=10
and store key2 at location 10.
The pseudo-random number i is generated by using the hash address that causes the
collision. It should generate numbers between 1 and n and it should not repeat a number
before all the numbers between 1 and n are generated exactly once.
In searching, given the same hash address, for example 3, it will give us the same number
7, so key2 shall be found at location 10.
We carry out the search until:
a) We find the key in the table,
b) Or, until we find an empty bucket, (unsuccessful termination)
c) Or, until we search the table for one sequence and the random number repeats.
(unsuccessful termination, table is full)
3. Chaining
We modify entries of the hash table to hold a key part (and the record) and a link part.
When there is a collision, we put the second key to any empty place and set the link part
of the first key to point to the second one. Additional storage is needed for link fields.
>
55
4) Chaining with overflow
In this method, we use extra space for colliding items.
f(key1)=3 goes into bucket 3
f(key2)=3 collision, goes into the overflow area
5) Rehashing:
Use a series of hash functions. If there is a collision, take the second hash function and
hash again, etc... The probability that two key values will map to the same address with
two different hash functions is very low.
Average number of probes (AVP) calculation :
Calculate the probability of collisions, then the expected number of collisions, then
average.
To delete key1, we have to put a special sign into location 2, because there might have
been collisions, and we can break the chain if we set that bucket to empty. However then
we shall be wasting some empty locations, LF is increased and AVP is increased. We
cannot increase the hash table size, since the hash function will generate values between
1 and n (or, 0 and n-1).
Using an overflow area is one solution.
56
MODULE EIGHT
RECURSIVE PROGRAMMING
Recursive programming is a powerful technique that can greatly simplify some
programming tasks. In summary, recursive programming is the situation in which a
procedure calls itself, passing in a modified value of the parameter(s) that was passed in
to the current iteration of the procedure. Typically, a recursive programming environment
contains (at least) two procedures: first, a procedure to set up the initial environment and
make the initial call to the recursive procedure, and second, the recursive procedure itself
that calls itself one or more times.
Let's begin with a simple example. The Factorial of a number N is the product of all the
integers between 1 and N. The factorial of 5 is equal to 5 * 4 * 3 * 2 * 1 = 120. In the real
world you would not likely use a recursive procedure for this, but it will serve as a simple
yet illustrative example. The first procedure is namedDoFact sets things up, calls the Fact
function and displays the result.
Sub DoFact()
Dim L As Long
Dim N As Long
N = 3
L = Fact(N)
Debug.Print "The Factorial of " & CStr(N) & " is " & Format(L, "#,##0")
End Sub
The Fact function does the real work of calculating the factorial.
Function Fact(N As Long) As Long
If N = 1 Then
Fact = 1
Else
Fact = N * Fact(N - 1)
End If
57
End Function
In this code, the value of the input N is tested. If it is 1, the function simply returns 1. If N
is greater than 1, Fact calls itself passing itself the value N-1. The function returns as its
result the input value N times the value of itself evaluated for N-1.
Cautions For Recursive Programming
While recursive programming is a powerful technique, you must be careful to structure
the code so that it will terminate properly when some condition is met. In the Fact
procedure, we ended the recursive calls when N was less than or equal to 1. Your
recursive code must have some sort of escape logic that terminates the recursive calls.
Without such escape logic, the code would loop continuously until the VBA runtime
aborts the processing with an Out Of Stack Space error. Note that you cannot trap an Out
Of Stack Space error with conventional error trapping. It is called an untrappableerror
and will terminate all VBA execution immediately. You cannot recover from an
untrappable error.
For example, consider the following poorly written recursive procedure:
Function AddUp(N As Long)
Static R As Long
If N <= 0 Then
R = 0
End If
R = AddUp(N + 1)
AddUp = R
End Function
In this code, there is no condition that prevents AddUp from calling itself. Every call to
AddUp results in another call to AddUp. The function will continue to call itself without
restriction until the VBA runtime aborts the procedure execution sequence.
58
MODULE NINE
MACROS
jEdit's macro editor
A macro (short for "macroinstruction", from Greek μακρο- 'large') in computer science is
a rule or pattern that specifies how a certain input sequence (often a sequence of
characters) should be mapped to a replacement input sequence (also often a sequence of
characters) according to a defined procedure. The mapping process that instantiates
(transforms) a macro use into a specific sequence is known as macro expansion.
A facility for writing macros may be provided as part of a software application or as a
part of a programming language. In the former case, macros are used to make tasks using
the application less repetitive. In the latter case, they are a tool that allows a programmer
to enable code reuse or even to design domain-specific languages.
59
Macros are used to make a sequence of computing instructions available to the
programmer as a single program statement, making the programming task less tedious
and less error-prone.
(Thus, they are called "macros" because a big block of code can be expanded from a
small sequence of characters). Macros often allow positional or keyword parameters that
dictate what the conditional assembler program generates and have been used to create
entire programs or program suites according to such variables as operating system,
platform or other factors. The term derives from "macro instruction", and such
expansions were originally used in generating assembly language code.
9.1 KEYBOARD AND MOUSE MACROS
Keyboard macros and mouse macros allow short sequences of keystrokes and
mouse actions to be transformed into other, usually more time-consuming, sequences of
keystrokes and mouse actions. In this way, frequently used or repetitive sequences of
keystrokes and mouse movements can be automated. Separate programs for creating
these macros are called macro recorders.
During the 1980s, macro programs – originally SmartKey, then SuperKey, KeyWorks,
Prokey – were very popular, first as a means to automatically format screenplays, then for
a variety of user input tasks. These programs were based on the TSR (Terminate and stay
resident) mode of operation and applied to all keyboard input, no matter in which context
it occurred. They have to some extent fallen into obsolescence following the advent of
mouse-driven user interface and the availability of keyboard and mouse macros in
applications such as word processors and spreadsheets, making it possible to create
application-sensitive keyboard macros.
Keyboard macros have in more recent times come to life as a method of exploiting the
economy of massively multiplayer online role-playing game (MMORPG)s. By tirelessly
performing a boring, repetitive, but low risk action, a player running a macro can earn a
large amount of the game's currency or resources. This effect is even larger when a
macro-using player operates multiple accounts simultaneously, or operates the accounts
60
for a large amount of time each day. As this money is generated without human
intervention, it can dramatically upset the economy of the game. For this reason, use of
macros is a violation of the TOS or EULA of most MMORPGs, and administrators of
MMORPGs fight a continual war to identify and punish macro users.
9.2 APPLICATION MACROS AND SCRIPTING
Keyboard and mouse macros that are created using an application's built-in macro
features are sometimes called application macros. They are created by carrying out the
sequence once and letting the application record the actions. An underlying macro
programming language, most commonly a scripting language, with direct access to the
features of the application may also exist.
The programmers' text editor Emacs (short for "editing macros") follows this idea to a
conclusion. In effect, most of the editor is made of macros. Emacs was originally devised
as a set of macros in the editing language TECO; it was later ported to dialects of Lisp.
Another programmer's text editor, Vim (a descendant of vi), also has full implementation
of macros. It can record into a register (macro) what a person types on the keyboard and
it can be replayed or edited just like VBA macros for Microsoft Office. Vim also has a
scripting language called Vimscript to create macros.
Visual Basic for Applications (VBA) is a programming language included in Microsoft
Office. However, its function has evolved from and replaced the macro languages that
were originally included in some of these applications.
9.3 MACRO VIRUS
VBA has access to most Microsoft Windows system calls and executes when
documents are opened. This makes it relatively easy to write computer viruses in VBA,
commonly known as macro viruses. In the mid-to-late 1990s, this became one of the most
common types of computer virus. However, during the late 1990s and to date, Microsoft
has been patching and updating their programs. In addition, current anti-virus programs
immediately counteract such attacks.
61
9.4 TEXT SUBSTITUTION MACROS
Languages such as C and assembly language have rudimentary macro systems,
implemented as preprocessors to the compiler or assembler. C preprocessor macros work
by simple textual search-and-replace at the token, rather than the character, level. A
classic use of macros is in the computer typesetting system TeX and its derivatives,
where most of the functionality is based on macros. MacroML is an experimental system
that seeks to reconcile static typing and macro systems. Nemerle has typed syntax
macros, and one productive way to think of these syntax macros is as a multi-stage
computation. Other examples:
m4 is a sophisticated, stand-alone, macro processor.
TRAC
Macro Extension TAL , accompanying Template Attribute Language
SMX, for web pages
ML/1 Macro Language One
The General Purpose Macroprocessor is a contextual pattern matching macro
processor, which could be described as a combination of regular expressions,
EBNF and AWK
SAM76
minimac, a concatenative macro processor.
troff and nroff, for typesetting and formatting Unix manpages.
9.5 EMBEDDABLE LANGUAGES
Some languages, such as PHP, can be embedded in free-format text, or the source
code of other languages. The mechanism by which the code fragments are recognised (for
instance, being bracketed by <?php and ?>) is similar to a textual macro language, but
they are much more powerful, fully featured languages.
9.5.1 Procedural macros
Macros in the PL/I language are written in a subset of PL/I itself: the compiler
executes "preprocessor statements" at compilation time, and the output of this execution
62
forms part of the code that is compiled. The ability to use a familiar procedural language
as the macro language gives power much greater than that of text substitution macros, at
the expense of a larger and slower compiler.
Frame Technology's frame macros have their own command syntax but can also contain
text in any language. Each frame is both a generic component in a hierarchy of nested
subassemblies, and a procedure for integrating itself with its subassembly frames (a
recursive process that resolves integration conflicts in favor of higher level
subassemblies). The outputs are custom documents, typically compilable source modules.
Frame Technology can avoid the proliferation of similar but subtly different components,
an issue that has plagued software development since the invention of macros and
subroutines.
Most assembly languages have less powerful procedural macro facilities, for example
allowing a block of code to be repeated N times for loop unrolling; but these have a
completely different syntax from the actual assembly language.
9.5.2 Syntactic macros
Macro systems that work at the level of abstract syntax trees are called syntactic
macros and preserve the lexical structure of the original program. Meanwhile, macro
systems, such as the C preprocessor described earlier, that work at the level of lexical
tokens cannot preserve the lexical structure reliably. The most widely used
implementations of syntactic macro systems are found in Lisp-like languages such as
Common Lisp, Scheme, ISLISP and Racket. These languages are especially suited for
this style of macro due to their uniform, parenthesized syntax (known as S-Expressions).
In particular, uniform syntax makes it easier to determine the invocations of macros. Lisp
macros transform the program structure itself, with the full language available to express
such transformations. While syntactic macros are most commonly found in Lisp-like
languages, they have been implemented for other languages such as Dylan, Scala, and
Nemerle.
63
9.6 EARLY LISP MACROS
The earliest Lisp macros took the form of FEXPRs, function-like operators whose
inputs were not the values computed by the arguments but rather the syntactic forms of
the arguments, and whose output were values to be used in the computation. In other
words, FEXPRs were implemented at the same level as EVAL, and provided a window
into the meta-evaluation layer. This was generally found to be a difficult model to reason
about effectively.
An alternate, later facility was called DEFMACRO, a system that allowed programmers
to specify source-to-source transformations that were applied before the program is run.
9.6.1 Hygienic macros
In the mid-eighties, a number of papers introduced the notion of hygienic macro
expansion (syntax-rules), a pattern-based system where the syntactic environments of the
macro definition and the macro use are distinct, allowing macro definers and users not to
worry about inadvertent variable capture (cf. Referential transparency). Hygienic macros
have been standardized for Scheme in both the R5RS and R6RS standards. The upcoming
R7RS standard will also include hygienic macros. A number of competing
implementations of hygienic macros exist such as syntax-rules, syntax-case, explicit
renaming, and syntactic closures. Both of syntax-rules and syntax-case have been
standardized in the Scheme standards.
A number of languages other than Scheme either implement hygienic macros or
implement partially hygienic systems. Examples include Scala, Julia, Dylan, and
Nemerle.
9.7 APPLICATIONS
Evaluation order
Macro systems have a range of uses. Being able to choose the order of evaluation
(see lazy evaluation and non-strict functions) enables the creation of new syntactic
constructs (e.g. control structures) indistinguishable from those built into the
language. For instance, in a Lisp dialect that has cond but lacks if, it is possible to
64
define the latter in terms of the former using macros. For example, Scheme has
both continuations and hygienic macros, which enables a programmer to design
their own control abstractions, such as looping and early exit constructs, without
the need to build them into the language.
Data sub-languages and domain-specific languages
Next, macros make it possible to define data languages that are immediately
compiled into code, which means that constructs such as state machines can be
implemented in a way that is both natural and efficient.
Binding constructs
Macros can also be used to introduce new binding constructs. The most well-
known example is the transformation of let into the application of a function to a
set of arguments.
Felleisen conjectures that these three categories make up the primary legitimate uses of
macros in such a system. Others have proposed alternative uses of macros, such as
anaphoric macros in macro systems that are unhygienic or allow selective unhygienic
transformation.
The interaction of macros and other language features has been a productive area of
research. For example, components and modules are useful for large-scale programming,
but the interaction of macros and these other constructs must be defined for their use
together. Module and component-systems that can interact with macros have been
proposed for Scheme and other languages with macros.
For example, the Racket language extends the notion of a macro system to a syntactic
tower, where macros can be written in languages including macros, using hygiene to
ensure that syntactic layers are distinct and allowing modules to export macros to other
modules.
9.7.1 Macros for machine-independent software
Macros are normally used to map a short string (macro invocation) to a longer
sequence of instructions. Another, less common, use of macros is to do the reverse: to
65
map a sequence of instructions to a macro string. This was the approach taken by the
STAGE2 Mobile Programming System, which used a rudimentary macro compiler
(called SIMCMP) to map the specific instruction set of a given computer to counterpart
machine-independent macros. Applications (notably compilers) written in these machine-
independent macros can then be run without change on any computer equipped with the
rudimentary macro compiler. The first application run in such a context is a more
sophisticated and powerful macro compiler, written in the machine-independent macro
language. This macro compiler is applied to itself, in a bootstrap fashion, to produce a
compiled and much more efficient version of itself. The advantage of this approach is
that complex applications can be ported from one computer to a very different computer
with very little effort (for each target machine architecture, just the writing of the
rudimentary macro compiler). The advent of modern programming languages, notably C,
for which compilers are available on virtually all computers, has rendered such an
approach superfluous. This was, however, one of the first instances (if not the first) of
compiler bootstrapping.
66