1 chap 7. indexing. 2 chapter objectives(1) introduce concepts of indexing that have broad...

1

Chap 7. Indexing

2

Chapter Objectives(1)

Introduce concepts of indexing that have broad applications in the design of file systems

Introduce the use of a simple linear index to provide rapid access to records in an entry-sequenced, variable-length record file

Investigate the implementation of the use of indexes for file maintenance

Introduce the template features of C++ for object I/O

Describe the object-oriented approach to indexed sequential files

3

Chapter Objectives(2)

Describe the use of indexes to provide access to records by more than one key

Introduce the idea of an inverted list, illustrating Boolean operations on lists

Discuss of when to bind an index key to an address in the data file

Introduce and investigate the implications of self-indexing files

4

Contents(1)

7.1 What is an Index?

7.2 A Simple Index for Entry-Sequenced Files

7.3 Using Template Classes in C++ for Object I/O

7.4 Object-Oriented Support for Indexed, Entry-

Sequenced Files of Data Objects

7.5 Indexes That Are Too Large to Hold in Memory

5

Contents(2)

7.6 Indexing to Provide Access by Multiple Keys

7.7 Retrieval Using Combinations of Secondary Keys

7.8 Improving the Secondary Index Structure: Inverted Lists

7.9 Selective Indexes

7.10 Binding

6

Overview: Index(1)

Index: a data structure which associates given key values with corresponding

record numbers

It is usually physically separate from the file (unlike for indexed sequential

files tight binding).

Linear indexes (like indexes found at the back of books)

Index records are ordered by key value as in an ordered relative file

Best algorithm for finding a record with a specific key value is binary

search

Addition requires reorganization

7

Overview: Index(2)

k1 k2 k4 k5 k7 k9

k1 k2 k4 k5 k7 k9

AAA ZZZ CCC XXX EEE FFF

Index File

Data File

8

Overview: Index(3)

Tree Indexes (like those of indexed sequential files)

Hierarchical in that each level

Beginning with the root level, points to the next record

Leaves POINTs only the data file

Indexed Sequential File

Binary Tree Index

AVL Tree Index

B+ tree Index

9

Roles of Index?

Index: keys and reference fields

Fast Random Accesses

Uniform Access Speed

Allow users to impose order on a file without actually rearranging the

file

Provide multiple access paths to a file

Give user keyed access to variable-length record files

10

A Simple Index(1)

Datafile entry-sequenced, variable-length record

primary key : unique for each entry in a file

Search a file with key (popular need) cannot use binary search in a variable-length

record file(can’t know where the middle record)

construct an index object for the file

index object : key field + byte-offset field

12

A Simple Index (3)

Index file: fixed-size record, sorted

Datafile: not sorted because it is entry sequenced

Record addition is quick (faster than a sorted file)

Can keep the index in memory

find record quickly with index file than with a sorted one

Class TextIndex encapsulates the index data and index operations

Key Reference field

Let’s See Figure 7.4Class TextIndex{ public: TextIndex(int maxKeys = 100, int unique = 1);

int Insert(const char*ckey, int recAddr); //add to index int Remove(const char* key); //remove key from index int Search(const char* key) const;

//search for key, return recAddr void Print (ostream &) const; protected: int MaxKeys; // maximum num of entries int NumKeys;// actual num of entries char **Keys; // array of key values int* RecAddrs; // array of record references int Find (const chat* key) const; int Init (int maxKeys, int unique); int Unique;// if true --> each key must be unique}

TextIndex::TextIndex

TextIndex:: TextIndex (int maxKeys, int unique)

: NumKeys (0), Keys(0), RecAddrs(0)

{Init (maxKeys, unique);}

TextIndex :: ~TextIndex ()

{delete Keys; delete RecAddrs;}

TextIndex::Init

int TextIndex :: Init (int maxKeys, int unique)

{

Unique = unique != 0;

if (maxKeys <= 0)

{

MaxKeys = 0;

return 0;

}

MaxKeys = maxKeys;

Keys = new char *[maxKeys];

RecAddrs = new int [maxKeys];

return 1;

}

TextIndex::Insert

int TextIndex :: Insert (const char * key, int recAddr){

int i;int index = Find (key);if (Unique && index >= 0) return 0; // key already inif (NumKeys == MaxKeys) return 0; //no room for another keyfor (i = NumKeys-1; i >= 0; i--){

if (strcmp(key, Keys[i])>0) break; // insert into location i+1Keys[i+1] = Keys[i];RecAddrs[i+1] = RecAddrs[i];

}Keys[i+1] = strdup(key);RecAddrs[i+1] = recAddr;NumKeys ++;return 1;

}

TextIndex::Remove

int TextIndex :: Remove (const char * key)

{

int index = Find (key);

if (index < 0) return 0; // key not in index

for (int i = index; i < NumKeys; i++)

{

Keys[i] = Keys[i+1];

RecAddrs[i] = RecAddrs[i+1];

}

NumKeys --;

return 1;

}

TextIndex::Search

int TextIndex :: Search (const char * key) const

{

int index = Find (key);

if (index < 0) return index;

return RecAddrs[index];

}

TextIndex::Find

int TextIndex :: Find (const char * key) const

{

for (int i = 0; i < NumKeys; i++)

if (strcmp(Keys[i], key)==0) return i;// key found

else if (strcmp(Keys[i], key)>0) return -1;// not found

return -1;// not found

}

Index Implementation

Page 706~709

G.1 Recording.h

G.2 Recording.cpp

G.3 Makerec.cpp

Page 710~712

G.4 Textind.h

G.5 Textind.cpp

IndexRecordingFile

int IndexRecordingFile (char * myfile, TextIndex & RecordingIndex){

Recording rec; int recaddr, result;DelimFieldBuffer Buffer; // create a bufferBufferFile RecordingFile(Buffer); result = RecordingFile . Open (myfile,ios::in);if (!result){ cout << "Unable to open file "<<myfile<<endl; return 0; }while (1) // loop until the read fails{

recaddr = RecordingFile . Read (); // read next recordif (recaddr < 0) break;rec. Unpack (Buffer);RecordingIndex . Insert(rec.Key(), recaddr);cout << recaddr <<'\t'<<rec<<endl;

}RecordingIndex . Print (cout);result = RetrieveRecording (rec, "LON2312", RecordingIndex, RecordingFile);cout <<"Found record: "<<rec;

}

RetrieveRecording

int RetrieveRecording (Recording & recording, char * key,

TextIndex & RecordingIndex, BufferFile & RecordingFile)

// read and unpack the recording, return TRUE if succeeds

{ int result;

cout <<"Retrieve "<<key<<" at recaddr "<<RecordingIndex.Search(key)<<endl;

result = RecordingFile . Read (RecordingIndex.Search(key));

cout <<"read result: "<<result<<endl;

if (result == -1) return FALSE;

result = recording.Unpack (RecordingFile.GetBuffer());

return result;

}

Template Class RecordFile

we want to make the following code possible

– Person p; RecordFile pFile; pFile.Read(p);

– Recording r; RecordFile rFile; rFile.Read(r);

difficult to support files for different record types without having to

modify the class

Template class which is derived from BufferFile

– the actual declarations and calls

– RecordFile <Person> pFile; pFile.Read(p);

– RecordFile <Recording> rFile; rFile.Read(p);

Template Class for I/O Object(1)


Template Class RecordFile

template <class RecType>class RecordFile : public BufferFile{ public:

int Read(RecType& record, int recaddr = -1); int Write(const RecType& record, int recaddr = -1); int Append(const RecType& record); RecordFile(IOBuffer& buffer) : BufferFile(buffer) {}

};//The template parameter RecType must have the following methods//int Pack(IOBuffer &); pack record into buffer//int Unpack(IOBuffer &); unpack record from buffer

Adding I/O to an existing class RecordFile

add methods Pack and Unpack to class Recording

create a buffer object to use in the I/O

– DelimFieldBuffer Buffer;

declare an object of type RecordFile<Recording>

– RecordFile<Recording> rFile (Buffer);

Declaration and Calls


Recording r1, r2;rFile.Open(“myfile”);rFile.Read(r1);rFile.Write(r2);

Directly open a file and read andwrite objects of class Recording

Object-Oriented Approach to I/O

Class IndexedFile

add indexed access to the sequential access provided by class RecordFile

extends RecordFile with Update, Append and Read method

– Update & Append : maintain a primary key index of data file

– Read : supports access to object by key

TextIndex, RecordFile ==> IndexedFile

Issues of IndexedFile

– how to make a persistent index of a file

– how to guarantee that the index is an accurate reflection of the contents

of the data file

27

Create the original empty index and data files

Load the index file into memory

Rewrite the index file from memory

Add records to the data file and index

Delete records from the data file

Update records in the data file

Update the index to reflect changes in the data file

Retrieve records

Basic Operations of IndexedFile(1)

28

Basic Operations of TextIndexedFile (1)

Creating the files

initially empty files (index file and data file) created as empty files with header records

implementation ( makeind.cpp in Appendix G ) Create method in class BufferFile

Loading the index into memory

loading/storing objects are supported in the IOBuffer classes

need to choose a particular buffer class to use for an index file ( tindbuff.cpp in Appendix G )

– define class TextIndexBuffer as a derived class of FixedFieldBuffer to support reading and writing of index objects

29

Rewriting the index file from memory

part of the Close operation on an IndexedFile

write back index object to the index file

should protect the index when failure

write changes when out-of-date(use status flag)

Implementation – Rewind and Write operations of class BufferFile

Record Addition

Basic Operations of TextIndexedFile(2)

Add an entry to the index

Requires rearrangementif in memory, no file access using TextIndex.Insert

Add a new record to data file

using RecordFile<Recording>::Write

+

30

Record Deletion

data file: the records need not be moved

index: delete entry really or just mark it

– using TextIndex::Delete

Record Updating (2 categories)

the update changes the value of the key field

– delete/add approach

– reorder both the index and the data file

the update does not affect the key field

– no rearrangement of the index file

– may need to reconstruct the data file

Basic Operations of TextIndexedFile(3)

Class TextIndexedFile(1)

Members

methods

– Create, Open, Close, Read (sequential & indexed), Append, and

Update operations

protected members

– ensure the correlation between the index in memory (Index),

the index file (IndexFile), and the data file (DataFile)

char* key()

– the template parameter RecType must have the key method

– used to extract the key value from the record

Class TextIndexedFile(2)Template <class RecType>class TextIndexedFile{ public:

int Read(RecType& record); // read next recordint Read(char* key, RecType& record) // read by key int Append(const RecType& record);int Update(char* oldKey, const RecType& record);int Create(char* name, int mode=ios::in|los::out);int Open(char* name, int mode=ios::in|los::out);int Close();TextIndexedFile(IOBuffer & buffer, int keySize, int maxKeys=100);~TextIndexedFile(); // close and delete

protected:TextIndex Index; BufferFile IndexFile;TextIndexBuffer IndexBuffer;RecordFile<RecType> DataFile;char * FileName; // base file name for fileint SetFileName(char* fName, char*& dFileName, char*&IdxFName);

};

TextIndexedFile 생성자 / 소멸자

template <class RecType>

TextIndexedFile<RecType>::TextIndexedFile (IOBuffer & buffer,

int keySize, int maxKeys) : DataFile(buffer), Index (maxKeys),

IndexBuffer(keySize, maxKeys),

IndexFile(IndexBuffer)

{

FileName = 0;

}


TextIndexedFile<RecType>::~TextIndexedFile (){ Close(); }

TextIndexedFile::Createint TextIndexedFile<RecType>::Create (char * fileName, int mode)// use fileName.dat and fileName.ind{ int result;

char * dataFileName, * indexFileName;result = SetFileName (fileName, dataFileName, indexFileName);cout <<"file names "<<dataFileName<<" "<<indexFileName<<endl;if (result == -1) return 0;result = DataFile.Create (dataFileName, mode);if (!result){

FileName = 0; // remove connectionreturn 0;

}result = IndexFile.Create (indexFileName, ios::out|ios::in);if (!result){

DataFile . Close(); // close the data fileFileName = 0; // remove connectionreturn 0;

}return 1;

}

TextIndexedFile::Opentemplate <class RecType>int TextIndexedFile<RecType>::Open (char * fileName, int mode)// open data and index file and read index file{ int result;

char * dataFileName, * indexFileName;result = SetFileName (fileName, dataFileName, indexFileName);if (!result) return 0;// open filesresult = DataFile.Open (dataFileName, mode);if (!result) { FileName = 0; return 0; }result = IndexFile.Open (indexFileName, ios::out);if (!result) { DataFile . Close(); FileName = 0; return 0; }// read index into memoryresult = IndexFile . Read ();if (result != -1) {result = IndexBuffer . Unpack (Index);if (result != -1) return 1; }DataFile.Close();IndexFile.Close();FileName = 0;return 0;

}

TextIndexedFile::Read


int TextIndexedFile<RecType>::Read (RecType & record)

{ return result = DataFile . Read (record, -1);}


int TextIndexedFile<RecType>::Read (char * key, RecType & record)

{

int ref = Index.Search(key);

if (ref < 0) return -1;

int result = DataFile . Read (record, ref);

return result;

}

TextIndexedFile::Append


int TextIndexedFile<RecType>::Append (const RecType & record)

{

char * key = record.Key();

int ref = Index.Search(key);

if (ref != -1) // key already in file

return -1;

ref = DataFile . Append(record);

int result = Index . Insert (key, ref);

return ref;

}

TextIndexedFile::Close


int TextIndexedFile<RecType>::Close ()

{ int result;

if (!FileName) return 0; // already closed!

DataFile . Close();

IndexFile . Rewind();

IndexBuffer.Pack (Index);

result = IndexFile . Write ();

cout <<"result of index write: "<<result<<endl;

IndexFile . Close ();

FileName = 0;

return 1;

}

TextIndexBuffer

class TextIndexBuffer: public FixedFieldBuffer

{public:

TextIndexBuffer(int keySize, int maxKeys = 100,

int extraFields = 0, int extraSize=0);

// extraSize is included to allow derived classes to extend

// the buffer with extra fields.

// Required because the buffer size is exact.

int Pack (const TextIndex &);

int Unpack (TextIndex &);

void Print (ostream &) const;

protected:

int MaxKeys;

int KeySize;

char * Dummy; // space for dummy in pack and unpack

};

TextIndexBuffer::TextIndexBuffer

TextIndexBuffer::TextIndexBuffer (int keySize, int maxKeys, int extraFields, int extraSpace)

: FixedFieldBuffer (1+2*maxKeys+extraFields,

sizeof(int)+maxKeys*keySize+maxKeys*sizeof(int) + extraSpace)

// buffer fields consist of numKeys, actual number of keys

// Keys [maxKeys] key fields size = maxKeys * keySize

// RecAddrs [maxKeys] record address fields size = maxKeys*sizeof(int)

{

MaxKeys = maxKeys;

KeySize = keySize;

AddField (sizeof(int));

for (int i = 0; i < maxKeys; i++)

{

AddField (KeySize);

AddField (sizeof(int));

}

Dummy = new char[keySize+1];

}

TextIndexBuffer::Pack

int TextIndexBuffer::Pack (const TextIndex & index)

{

int result;

Clear ();

result = FixedFieldBuffer::Pack (&index.NumKeys);

for (int i = 0; i < index.NumKeys; i++)

{// note only pack the actual keys and recaddrs

result = result && FixedFieldBuffer::Pack (index.Keys[i]);

result = result && FixedFieldBuffer::Pack (&index.RecAddrs[i]);

}

for (int j = 0; j<index.MaxKeys-index.NumKeys; j++)

{// pack dummy values for other fields

result = result && FixedFieldBuffer::Pack (Dummy);

result = result && FixedFieldBuffer::Pack (Dummy);

}

return result;

}

TextIndexBuffer::Unpack

int TextIndexBuffer::Unpack(TextIndex & index)

{

int result;

result = FixedFieldBuffer::Unpack (&index.NumKeys);

for (int i = 0; i < index.NumKeys; i++)

{// note only pack the actual keys and recaddrs

index.Keys[i] = new char[KeySize]; // just to be safe

result = result && FixedFieldBuffer::Unpack (index.Keys[i]);

result = result && FixedFieldBuffer::Unpack (&index.RecAddrs[i]);

}

for (int j = 0; j<index.MaxKeys-index.NumKeys; j++)

{// pack dummy values for other fields

result = result && FixedFieldBuffer::Unpack (Dummy);

result = result && FixedFieldBuffer::Unpack (Dummy);

}

return result;

}

IndexRecordingFile

int IndexRecordingFile (char * myfile, TextIndexedFile<Recording> & indexFile){ Recording rec; int recaddr, result;

DelimFieldBuffer Buffer; // create a bufferBufferFile RecFile(Buffer); result = RecFile . Open (myfile,ios::in);if (!result){ cout << "Unable to open file "<<myfile<<endl;

return 0;}while (1) // loop until the read fails{ recaddr = RecFile . Read (); // read next record

if (recaddr < 0) break;rec. Unpack (Buffer);indexFile . Append(rec);

}Recording rec1;result = indexFile.Read ("LON2312", rec1);cout <<"Found record: "<<rec;

}

Enhancements to TextIndexedFile(1)

Support other types of keys

Restriction: the key type is restricted to string (char *)

Relaxation: support a template class SimpleIndex with parameter for key

type

Support data object class hierarchies

Restriction: every object must be of the same type in RecordFile

Relaxation: the type hierarchy supports virtual pack methods

Enhancements to TextIndexedFile(2)

Support multirecord index files

Restriction: the entire index fit in a single record

Relaxation: add protected method Insert, Delete, and Search to

manipulate the arrays of index objects

Active optimization of operations

Obvious: the most obvious optimization is to use binary search in the

Find method

Active: add a flag to the index object to avoid writing the index record

back to the index file when it has not been changed

Where are we going?

Plain Stream File

Persistency ==> Buffer support ==> BufferFile

<incremental approach> Deriving BufferFile using

various other classes

Random Access ==> Index support => IndexedFile

<incremental approach> : Deriving TextIndexedFile using RecordFile and

TextIndex

47

Too Large Index(1)

On secondary storage (large linear index)

Disadvantages

binary searching of the index requires several seeks(slower than a sorted

file)

index rearrangement requires shifting or sorting records on second storage

Alternatives (to be considered later)

hashed organization

tree-structured index (e.g. B-tree)

48

Too Large Index (2)

Advantages over the use of a data file sorted by key even if the index is on the

secondary storage

can use a binary search

sorting and maintaining the index is less expensive than doing the data file

can rearrange the keys without moving the data records if there are pinned

records

49

Index by Multiple Keys(1)

DB-Schema = ( ID-No, Title, Composer, Artist, Label)

Find the record with ID-NO “COL38358” (primary key - ID-No)

Find all the recordings of “Beethoven” (2ndary key - composer)

Find all the recordings titled “Violin Concerto” (2ndary key - title)

50

Index by Multiple Keys(2)

Most people don’t want to search only

by primary key

Secondary Key

can be duplicated

Figure -->

Secondary Key Index

secondary key --> consult one

additional index (primary key

index)

BEETHOVEN ANG3795

BEETHOVEN DG139201

BEETHOVEN COL38358

COREA WAR23699

DVORAK COL31809

PROKOFIEV LON2312

RIMSKY-KORSAKOV MER75016

SPRINGSTEEN COL38358

SWEET HONEY IN THE R FF245

BEETHOVEN DG18807

Secondary key Primary key

Composer index

BEETHOVEN DG18807

51

Secondary Index:Basic Operations(1)

Record Addition

similar to the case of adding to primary index

secondary index is stored in canonical form

– fixed length (so it can be truncated)

– original name can be obtained from the data file

can contain duplicate keys

local ordering in the same key group

52

Secondary Index:Basic Operations (2)

Record Deletion (2 cases)

Secondary index references directly record

– delete both primary index and secondary index

– rearrange both indexes

Secondary index references primary key

– delete only primary index

– leave intact the reference to the deleted record

– advantage : fast

– disadvantage : deleted records take up space

53

Secondary Index: Basic Operations (3)

Record Updating

primary key index serves as a kind of protective buffer

Secondary index references directly record

– update all files containing record’s location

Secondary index references primary key (1)

– affect secondary index only when either primary or secondary key is changed

Continued.

54

Secondary Index: Basic Operations (4)

Secondary index references primary key(2)

when changes the secondary key

– rearrange the secondary key index

when changes the primary key

– update all reference field

– may require reordering the secondary index

when confined to other fields

– do not affect the secondary key index

55

Retrieval of Records

Types

primary key access

secondary key access

combination of above

Combination of keys

using secondary key index, it is easy

boolean operation (AND, OR)

56

Inverted Lists(1) Inverted List

a secondary key leads to a set of one or more primary keys

Disadvantages of 2nd-ary index structure

rearrange when adding

repeated entry when duplicating

Solution A: by an array of references

Solution B: by linking the list of references

57

Array of References

BEETHOVEN ANG3795 DG139201 DG18807 RCA2626

COREA WAR23699

DVORAK COL31809

PROKOFIEV LON2312

RIMSKY-KORSAKOV MER75016

SPRINGSTEEN COL38358

SWEET HONEY IN THE R FF245

Secondary key Set of primary key references

Revised composer index

* no need to rearrange

* limited reference array

* internal fragmentation

58

Inverted Lists (2)

Guidelines for better solution

no reorganization when adding

no limitation for duplicate key

no internal fragmentation

Solution B: by Linking the list of references

A list of primary key references

secondary key field, relative record number of the first corresponding primary

key reference

PROKOFIEV ANG36193

LON2312

59

Linking List of References (1)

BEETHOVEN

COREA

PROKOFIEV

RIMSKY-KORSAKOV

SPINGSTEEN

SWEET HONEY IN THE R

DVORAK

3

2

7

10

6

4

9

LON2312

RCA2626

ANG23699

COL38358

DG18807

MER75016

COL31809

DG139201

ANG36193

WAR23699

-1

-1

-1

8

-1

1

-1

-1

5

0

0

1

2

3

4

5

6

7

8

9 FF245 -1

Secondary Index file Label ID List file

Improved revision of the composer index

0

1

2

3

4

5

6

10

60


The primary key references in a separate, entry-sequenced file

Advantages

rearranges only when secondary key changes

rearrangement is quick

less penalty associated with keeping the secondary index file on secondary storage (less need for sorting)

Label ID List file not need to be sorted

reusing the space of deleted record is easy

61


Disadvantage

same secondary key references may not be physically grouped

– lack of locality

– could involve a large amount of seeking

– solution: reside in memory

– same Label ID list can hold the lists of a number of secondary index files

– if too large in memory, can load only a part of it

62

Selective Indexes

Selective Index: Index on a subset of records

Selective index contains only some part of entire index

provide a selective view

useful when contents of a file fall into several categories

– e.g. 20 < Age < 30 and $1000 < Salary

63

Index Binding(1)

When to bind the key indexes to the physical address of its associated record?

File construction time binding

(Tight, in-the-data binding)

tight binding & faster access

the case of primary key

when secondary key is bound to that time

– simpler and faster retrieval

– reorganization of the data file results in modifications of all

bound index files

64

Index Binding (2)

Postpone binding until a record is actually retrieved (Retrieval-time binding) minimal reorganization & safe approach mostly for secondary key

Tight, in-the-data binding is good when static, little or no changes rapid performance during retrieval mass-produced, read-only optical disk

65

Let’s Review (1)

7.1 What is an Index?

7.2 A Simple Index for Entry-Sequenced Files

7.3 Using Template Classes in C++ for Object I/O

7.4 Object-Oriented Support for Indexed, Entry-

Sequenced Files of Data Objects

7.5 Indexes That Are Too Large to Hold in Memory

66

Let’s Review(2)

7.6 Indexing to Provide Access by Multiple Keys

7.7 Retrieval Using Combinations of Secondary Keys

7.8 Improving the Secondary Index Structure:

Inverted Lists

7.9 Selective Indexes

7.10 Binding

1 chap 7. indexing. 2 chapter objectives(1) introduce concepts of indexing that have broad...

Documents

index key

index file

index data

roles of index

index operationslets

simple linear index

index int removeconst

index int searchconst