chapter 121 chapter 12: representing data elements (slides by hector garcia-molina,...

60
Chapter 12 1 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, http://www-db.stanford.edu/~hector/ cs245/notes.htm)

Upload: beryl-higgins

Post on 02-Jan-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 1

Chapter 12:Representing Data Elements

(Slides by Hector Garcia-Molina,http://www-db.stanford.edu/~hector/cs245/

notes.htm)

Page 2: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 2

• How to lay out data on disk• Sec. 12.3 and 15.7 (self-study): How to move data to memory

Topics for today

Page 3: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 3

Issues:

Flexibility Space Utilization

Complexity Performance

Page 4: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 4

To evaluate a given strategy, compute following parameters:-> space used for expected data-> expected time to- fetch record given key- fetch record with next key- insert record- append record- delete record- update record- read all file- reorganize file

Page 5: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 5

What are the data items we want to store?

• a salary• a name• a date• a pictureWhat we have available: Bytes

8bits

Page 6: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 6

To represent:

• Integer (short): 2 bytese.g., 35 is

00000000 00100011

• Real, floating pointn bits for mantissa, m for exponent….

Page 7: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 7

• Characters various coding schemes suggested,

most popular is ascii

To represent:

Example:A: 1000001a: 11000015: 0110101LF: 0001010

Page 8: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 8

• Booleane.g., TRUE

FALSE

1111 1111

0000 0000

To represent:

• Application specifice.g., RED 1 GREEN 3

BLUE 2 YELLOW 4 …Can we use less than 1

byte/code?Yes, but only if desperate...

Page 9: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 9

• Datese.g.: - Integer, # days since Jan 1, 1900

- 8 characters, YYYYMMDD - 7 characters, YYYYDDD

(not YYMMDD! Why?)• Time

e.g. - Integer, seconds since midnight - characters, HHMMSSFF

To represent:

Page 10: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 10

• String of characters– Null terminated

e.g.,

– Length givene.g.,

- Fixed length

c ta

c ta3

To represent:

Page 11: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 11

• Bag of bits

Length Bits

To represent:

Page 12: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 12

Key Point

• Fixed length items

• Variable length items- usually length given at beginning

Page 13: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 13

• Type of an item: Tells us how to interpret(plus size if

fixed)

Also

Page 14: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 14

Data Items

Records

Blocks

Files

Memory

Overview

Page 15: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 15

Record - Collection of related data

items (called FIELDS)E.g.: Employee record:

name field,salary field,date-of-hire field, ...

Page 16: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 16

Types of records:

• Main choices:– FIXED vs VARIABLE FORMAT– FIXED vs VARIABLE LENGTH

Page 17: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 17

A SCHEMA (not record) containsfollowing information

- # fields- type of each field- order in record- meaning of each field

Fixed format

Page 18: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 18

Example: fixed format and length

Employee record(1) E#, 2 byte integer(2) E.name, 10 char. Schema(3) Dept, 2 byte code

55 s m i t h 02

83 j o n e s 01

Records

Page 19: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 19

• Record itself contains format“Self Describing”

Variable format

Page 20: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 20

Example: variable format and length

4I52 4S DROF46

Field name codes could also be strings, i.e. TAGS

# F

ield

s

Cod

e id

enti

fyin

g

field

as

E#

Inte

ger

typ

e

Cod

e f

or

En

am

eS

trin

g t

yp

eLe

ng

th o

f st

r.

Page 21: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 21

Variable format useful for:

• “sparse” records• repeating fields• evolving formats

But may waste space...

Page 22: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 22

• EXAMPLE: var format record with repeating fields

Employee can have children

3 E_name: Fred Child: SallyChild: Tom

Page 23: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 23

Note: Repeating fields does not imply- variable format, nor- variable size

John Sailing Chess --

• Key is to allocate maximum number ofrepeating fields (if not used null)

Page 24: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 24

Many variants betweenfixed - variable format:

Ex. #1: Include record type in record

record type record lengthtells me whatto expect(i.e. points to schema)

5 27 . . . .

Page 25: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 25

Record header - data at beginning

that describes recordMay contain:

- record type- record length- time stamp- other stuff ...

Page 26: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 26

Ex #2 of variant between FIXED/VAR format

• Hybrid format– one part is fixed, other variable

E.g.: All employees have E#, name, deptother fields vary.

25 Smith Toy 2 retiredHobby:chess

# of varfields

Page 27: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 27

Other interesting issues:

• Compression– within record - e.g. code selection– collection of records - e.g. find

common patterns

• Encryption

Page 28: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 28

Next: placing records into blocks

blocks ...

a file

assume fixedlength blocks

assume a single file (for now)

Page 29: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 29

(1) separating records(2) spanned vs. unspanned(3) mixed record types – clustering(4) split records

Options for storing records in blocks:

Page 30: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 30

Block

(a) no need to separate - fixed size recs.(b) special marker(c) give record lengths (or offsets)

- within each record- in block header

(1) Separating records

R2R1 R3

Page 31: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 31

• Unspanned: records must be within one block

block 1 block 2

...

• Spannedblock 1 block 2

...

(2) Spanned vs. Unspanned

R1 R2

R1

R3 R4 R5

R2 R3(a)

R3(b) R6R5R4 R7

(a)

Page 32: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 32

need indication need indication

of partial record of continuation

“pointer” to rest (+ from where?)

R1 R2 R3(a)

R3(b) R6R5R4 R7

(a)

With spanned records:

Page 33: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 33

• Unspanned is much simpler, but may waste space…

• Spanned essential if record size > block size

Spanned vs. unspanned:

Page 34: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 34

Example

106 recordseach of size 2,050 bytes (fixed)block size = 4096 bytes

block 1 block 2

2050 bytes wasted 2046 2050 bytes wasted 2046

R1 R2

• Total wasted = 2 x 109 Utiliz = 50%• Total space = 4 x 109

Page 35: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 35

• Mixed - records of different types(e.g. EMPLOYEE, DEPT)allowed in same block

e.g., a block

(3) Mixed record types

EMP e1 DEPT d1 DEPT d2

Page 36: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 36

Why do we want to mix?Answer: CLUSTERING

Records that are frequently accessed together should bein the same block

Page 37: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 37

Compromise:

No mixing, but keep relatedrecords in same cylinder ...

Page 38: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 38

Example

Q1: select A#, C_NAME, C_CITY, …from DEPOSIT, CUSTOMERwhere DEPOSIT.C_NAME =

CUSTOMER.C.NAME

a blockCUSTOMER,NAME=SMITH

DEPOSIT,NAME=SMITH

DEPOSIT,NAME=SMITH

Page 39: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 39

• If Q1 frequent, clustering good• But if Q2 frequent

Q2: SELECT * FROM CUSTOMER

CLUSTERING IS COUNTER PRODUCTIVE

Page 40: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 40

Fixed part in one block

Typically forhybrid format

Variable part in another block

(4) Split records

Page 41: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 41

Block with fixed recs.

R1 (a)R1 (b)

Block with variable recs.

R2 (a)

R2 (b)

R2 (c)

This block alsohas fixed recs.

Page 42: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 42

Question

What is difference between- Split records- Simply using two different

record types?

Page 43: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 43

(1) Separating records(2) Spanned vs. Unspanned(3) Mixed record types - Clustering(4) Split records

Options for storing records in blocks

Page 44: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 44

(1) Insertion/Deletion(2) Buffer Management(3) Comparison of Schemes

Other Topics

Page 45: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 45

Block

Deletion

Rx

Page 46: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 46

Options:

(a) Immediately reclaim space(b) Mark deleted

– May need chain of deleted records

(for re-use)– Need a way to mark:

• special characters

Page 47: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 47

As usual, many tradeoffs...

• How expensive is to move valid record to free space for immediate reclaim?

• How much space is wasted?– e.g., deleted records, free space

chains,...

Page 48: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 48

Dangling pointers

Concern with deletions

R1 ?

Page 49: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 49

Solution #1: Do not worry

Page 50: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 50

E.g., Leave “MARK” in map or old location

Solution #2: Tombstones

• Physical IDs

A block

This space This space cannever re-used be re-used

Page 51: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 51

• Logical IDs

ID LOC

7788

map

Never reuseID 7788 nor

space in map...

E.g., Leave “MARK” in map or old location

Solution #2: Tombstones

Page 52: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 52

Easy case: records not in sequence Insert new record at end of

file or in deleted slot If records are variable size,

not as easy...

Insert

Page 53: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 53

Hard case: records in sequence If free space “close by”, not too bad... Or use overflow idea...

Insert

Page 54: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 54

Interesting problems:

• How much free space to leave in each block, track, cylinder?

• How often do I reorganize file + overflow?

Page 55: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 55

• There are 10,000,000 ways to organize my data on disk…

Which is right for me?

Comparison

Page 56: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 56

Issues:

Flexibility Space Utilization

Complexity Performance

Page 57: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 57

To evaluate a given strategy, compute following parameters:-> space used for expected data-> expected time to- fetch record given key- fetch record with next key- insert record- append record- delete record- update record- read all file- reorganize file

Page 58: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 58

Example

How would you design Megatron 3000 storage system? (for a relational DB, low end)– Variable length records?– Spanned?– What data types?– Fixed format?– How to handle deletions?

Page 59: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 59

• How to lay out data on disk

Data Items

Records

Blocks

Files

Memory

DBMS

Summary

Page 60: Chapter 121 Chapter 12: Representing Data Elements (Slides by Hector Garcia-Molina, hector/cs245/notes.htm)

Chapter 12 60

How to find a record quickly,given a key

Next