1 file structure n file as a stream of characters l no structure l consider students registered in a...

24
1 File Structure File as a stream of characters No structure Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert ChanSC943 File as a structured collection of related data A set of related data form a record a file consists of records Information about each student forms a record 320587Joe SmithSC953 184923Kathy LiEN923 249793Albert ChanSC943 What is the meaning of each piece of information about each student?

Post on 20-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

1

File Structure File as a stream of characters

No structure Consider students registered in a course

320587Joe SmithSC953184923Kathy LeeEN324979231Albert ChanSC943

File as a structured collection of related data A set of related data form a record a file consists of

records Information about each student forms a record

320587Joe SmithSC953184923Kathy LiEN923249793Albert ChanSC943

What is the meaning of each piece of information about each student?

Page 2: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

2

DBMS Structures & Files

DBMS Structures File Structures

Attribute Field

Tuple Record

Relation File

Page 3: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

3

Fields

Each record consists of a set of fields Fields separate data units

Identification of the pieces of data in a record320587Joe SmithSC953

Usually the same fields exist in all records in a file

Page 4: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

4

Field Separation Alternatives Fixed length fields

A given field (e.g., NAME) is the same size for all records Easy and fast reading but wastes space

320587 Joe Smith SC 95 3184923 Kathy Lee EN 92 3249793 Albert Chan SC 94 3

Length indicator at the beginning of each field Also wastes space (at least 1 byte per field) You have to know the length before you store

63205879Joe Smith2SC2951361849238Kathy Li2EN29213624979311Albert Chan2SC29413

Page 5: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

5

Field Separation Alternatives Separate fields with delimeters

Use white space characters (blank, new line, tab) Easy to read, uses one byte per field, have to be careful in

the choice of the delimeter |320587|Joe Smith|SC|95|3||184923|Kathy Li|EN|92|3||249793|Albert Chan|SC|94|3|

Use keywords Each field has a keyword that indicates what the field is Self describing but high space overhead

ID=320587NAME=Joe SmithFACULTY=SCDEG=92YEAR=3ID=184923NAME=Kathy LiFACULTY=ENDEG=92YEAR=3ID= 249793NAME=Albert ChanFACULTY=SCDEG=94YEAR=3

Page 6: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

6

Record Organization Alternatives Fixed length records

All records are the same length

320587 Joe Smith SC 95 3184923 Kathy Lee EN 92 3249793 Albert Chan SC 94 3

The number and size of fields in each record may be variable

|320587|Joe Smith|SC|95|3| Padding|184923|Kathy Li|EN|92|3| Padding|249793|Albert Chan|SC|94|3| Padding

Page 7: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

7

Record Organization Alternatives

Variable Length Records Fixed number of fields

Count the fields to detect the end of record

Length field at the beginningPut the length of each record in front of itYou have to buffer the record before writing

24320587|Joe Smith|SC|95|323184923|Kathy Li|EN|92|326249793|Albert Chan|SC|94|3

Page 8: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

8

Record Organization Alternatives

Variable Length Records (cont’d) Index the beginning

Build a secondary index that shows where each record begins

320587|Joe Smith|SC|95|3184923|Kathy Li|EN|92|3249793|Albert …

00 24 47

End-of-record markers Put a special end-of-record marker

Page 9: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

9

Summary

File System

Header Record Record Record Record Record

…Field Field Field

consists of

File File File File File

consists of

consists of

Page 10: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

10

Accessing a File Sequential access

Based on key values Useful when file is small or most (all) of the file needs

to be searched Complexity O(n) where n is the number of disk reads Block records to reduce n Block size should match physical disk organization

multiples of sector size Direct access

Based on relative record number (RRN) Record-based file systems can jump to the record

directly Stream-based systems calculate byte offset =

RRN * record length

Page 11: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

11

Header Records May be the same or different length than the rest of the

records in the file May contain information about the file

Number of records Size of records Date of file creation Date of last file modification Name of file creator/owner Meta information

Formats of data Origin of data Units used …

Page 12: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

12

File Organization Issues

Primary concern: Organizing files for improving performance

Data compression Reclaiming space in files Search and sorting Indexing

Page 13: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

13

Data Compression

Encoding information to reduce size of files Reversible compression

redundancy reduction short notations: AB for Alberta

suppressing repeating sequence 22 23 24 24 24 24 24 24 25 22 23 ff 24 06 25 (images)

variable length coding (Huffman) most frequently used letters with least length codes

Irreversible compression from GIF to JPEG save 20 ~ 90 %

Page 14: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

14

Reclaiming Space in Files File updates

record addition record deletion record modification

Requirements how to recognize deleted records:

tombstone: * how to utilize space left by deleted records

storage compaction– reconstruct the file to reclaim space occupied by all deleted

records– how often ?

Available List

Page 15: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

15

0 1 2 3 4 5 6 7

-1

List Head

Available ListConsider fixed length records Available list is a linked list of deleted records Implemented as a stack Use relative record number (RRN) for physical addresses

Adam Barb Peter Susan Brenda Sue Tim Jack

3 Adam Barb Peter -1 Brenda Sue JackTim

3Adam Barb Peter -1 Brenda Sue Jack

TamerAdam Barb Peter -1 Brenda Sue Jack

6

3

Page 16: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

16

Variable Length Records Case Problems

RRNs cannot be used Fitting Fragmentation

internal fragmentation: occurs if variable length records are stored in fixed size slots with padding

external fragmentation: split record leftover may be too small to hold any record

Solutions An available list with the byte offset Placement strategies Storage compaction Coalescing holes

combining adjacent slots to form a bigger one

Page 17: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

17

Placement Strategies

First fit unsorted list, the newly deleted record is put at the front insertion uses the first one on the list that fits

Best fit the list is sorted in ascending order insertion uses the first one on the list that fits too much fragmentation

Worst fit the list is sorted in descending order insertion always uses the first one if possible

Page 18: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

18

Search ProblemFind a record with a given key value Sequential search: O(n) Binary search: O(log n)

the file must be sorted how to maintain the sorting order?

deleting, insertion

variable length records Sorting

RAM sort: read the whole file into RAM, sort it, and then write it back to disk

Keysort: read the keys into RAM, sort keys in RAM and then rearrange records according to sorted keys

Index

Page 19: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

19

Keysorting

320587 Joe Smith SC 95 3184923 Kathy Lee EN 92 3249793 Albert Chan SC 94 3

320587 1184923 2249793 3

Before sortingRRN

320587 Joe Smith SC 95 3184923 Kathy Lee EN 92 3249793 Albert Chan SC 94 3

184923 2249793 3 320587 1

After sorting

Problem: Now the physical file has to be rearranged

Page 20: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

20

Indexing A tool used to find things

book index, student record indexes A function from keys to addresses

A record consisting of two fields key: on which the index is searched reference: location of data record associated with the key

Advantages smaller size of the index file makes RAM index possible binary search from files of variable length records rearrange keys without moving records multiple indexes

primary and secondary

Page 21: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

21

Operations With an Indexed File

Create original index and data file Load index file into RAM before using it Rewrite index file after using it

file header Update

insertion deletion update

Page 22: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

22

Secondary Index

Primary index

CD # physicallocation

ABG379 ...

Composer index

composer CD #

Beethoven ABG379

title CD #

Symphony ABG379

Title index

Provides multiple views of records Example: Consider a collection of music CDs

Page 23: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

23

Primary vs Secondary Keys

Uniqueness a primary key is a unique identification of a record a secondary key may be associated with many records

Binding:association of key and address

We may retrieve records using combinations of secondary keys FIND all records WHERE Composer = “ Beethoven” AND Title = “Symphony 9’

Page 24: 1 File Structure n File as a stream of characters l No structure l Consider students registered in a course 320587Joe SmithSC953184923Kathy LeeEN324979231Albert

24

Binding Association between a key and a physical address Tight binding

bind early the binding takes place when the file is24 constructed

advantage: high performance disadvantage: updates

Lazy binding bind later the binding takes place when they are actually used

advantage: easy updates safer: consistency

Primary index: tight binding; secondary index: later binding