final review - computer science | drexel ccijulia/cs500/documents/lectures/lecture... · final...

61
CS 500: Database Theory Final Review Julia Stoyanovich ([email protected])

Upload: vandien

Post on 30-Jun-2018

245 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

CS 500: Database Theory !!!

Final Review!!

Julia Stoyanovich ([email protected])

Page 2: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Final exam logistics

• When: August 29th @ 9am through September 1 @ 9pm!

• The same format as the midterm: electronic, open book / open notes !

• 3 hours in length!

• The exam is cumulative, it will include material from the first half of the term, but will likely focus more on the material from the second half of the term (homeworks 3 and 4)

2

Page 3: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Topics not on the final

• Database application development (JDBC and such)!

• MapReduce and Spark!

• Data, Responsibly

3

Page 4: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Data mining (HW4)Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. !

Describe all the association rules that have 100% confidence. Which of the following rules has 100% confidence?

4

{4,6}→12 conf({4,6}→12) = supp({4,6,12})supp({4,6})

A rule has 100% confidence if, for any b, whenever all items on the left divide b, then also the item on the right divides b

Since 4 and 6 are in b, then b = 2 * 2 * 3 * c, where c is some natural number. In this case, is 12 guaranteed to divide b? - yes!

{3,5}→1

{8,10}→ 20 b = 2 * 2 * 2 * 5 * c

{3,4,5}→ 30 b = 2 * 2 * 3 * 5 * c

{1,2}→ 4

{2,3,5}→ 45

{3,6}→ 9

Page 5: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Data mining (HW4)Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. !

An itemset S is closed if no proper superset of S has the same support.!

Describe all closed itemsets. Which of the itemsets are closed?

5

An itemset S is closed if the least common multiple (LCM) of its numbers increases if any number between 1 and 100 is added to the itemset. That is, S consists of some integer j (the LCM) and all multiples of j.

{1,5,7,35} What is the support set of this itemset? In which baskets can all of its items be found? Those where b = 5 * 7 * c, i.e., baskets 35 and 70.

{1,3,4,12} In which baskets can all of its items be found? Those where b = 2 * 2 * 3 * c, i.e., baskets 12, 24, 36, 48, 60, 72, 84, 96. But all of these are also divisible by 2!

{1,2,3,6}{1,5,25}{1,2,17,34}

{1,2,3,4,8}{1,2,3,5}{1,3,5,30}

Page 6: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Data mining (HW4)Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if j divides i evenly. For example, basket 24 is the set of items {24, 48, 72, 96}. !

A frequent itemset S is maximal no proper superset of S is frequent.!

Describe all maximal itemsets that have support exactly 4.

6

In which baskets can all of its items be found? Those where b divides 27 and 54 and 81. That’s baskets 27, 9, 3, 1. {27,54,81}

Under the new definition of the itemset to basket mapping, an itemset is in a bucket that is the greatest common divisor (GCD) of the items, and in buckets that correspond to all divisors of the GCD. All itemsets are in b=1.

{15,30,45,60,75,90}

To have support 4, the GCD must have exactly 4 distinct divisors (or 3 divisors in addition to the number 1). To realize this, GCD must be a product of 2 distinct primes a * b or a cube of a prime a * a * a.

{25,50,75,100}{6,36}{22,44,66,88}

Page 7: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

User-defined types

7

rank = reductionFactor −1Cost

Videos  (vid:  int,  v(ile:  video_file)N=100,000

• Violence:  returns  true  if  the  given  video  contains  violence,  and  false  otherwise.  On  average  it  takes  c1=0.4  sec  to  evaluate  this  method  on  a  video,  and  we  estimate  that  this  method  returns  true  for  r1=20%  of  the  videos  in  the  relation.  

   • StrongLanguage:  returns  true  if  the  given  video  contains  strong  language,  and  false  otherwise.  On  average  it  takes  c2=0.3  sec  to  evaluate  this  method  on  a  video,  and  we  estimate  that  this  method  returns  true  for  r2=10%  of  the  videos  in  the  relation.  

!• Nudity:   returns   true   if   the   given   video   contains   nudity,   and   false   otherwise.   On  average  it  takes  c3=0.2sec  to  evaluate  this  method  on  a  video,  and  we  estimate  that  this  method  returns  true  for  r3=20%  of  the  videos  in  the  relation.  

SELECT  *  FROM      Videos  WHERE    Violence  (vfile)  AND  StrongLanguage(vfile)  AND  NOT  Nudity(vfile)  

Page 8: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

User-defined types

8

rank = reductionFactor −1Cost

Videos  (vid:  int,  v(ile:  video_file)N=100,000

• Violence:  c1=0.4,  r1=20%    • StrongLanguage:  c2=0.3,  r2=10%  • Nudity:  c3=0.2sec,  r3=20%  • NOT Nudity:  c3=0.2sec,  r’3=80%  !• rank  (Violence)  =    (r1  -­‐1)  /  c1  =  -­‐0.8  /  0.4  =  -­‐2  • rank  (StrongLanguage)  =    (r2  -­‐1)  /  c2  =  -­‐0.9  /  0.3  =  -­‐3  • rank  (NOT  Nudity)  =    (r3’  -­‐1)  /  c3  =  -­‐0.2  /  0.2  =  -­‐1  

In  the  most  efIicient  query  evaluation  plan,  conditions  are  evaluates  in  increasing  order  of  their  rank.      The  best  query  plan  is       σ  (NOT  Nudity(  Violence  (StrongLanguage  (Videos)))

Page 9: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

User-defined types

9

Videos  (vid:  int,  v(ile:  video_file) N=100,000

In  the  most  efIicient  query  evaluation  plan,  conditions  are  evaluates  in  increasing  order  of  their  rank.      The  best  query  plan  is       σ  (NOT  Nudity(  Violence  (StrongLanguage  (Videos)))

Let  us  see  how  this  query  is  evaluated  step-­‐by-­‐step.  !1. N=100,  000  tuples  are  processed  by  StrongLanguage,  which  costs  

100,000  *  0.3  sec    ;  10%  of  the  tuples  are  passed  along.  2.  N’  =  10,000  tuples  are  processed  by  Violence,  which  costs               10,000  *  0.4  sec  ;  20%  of  the  tuples  are  passed  along.  

3.  N”  =  2,000  tuples  are  processed  by  Nudity,  which  costs                 2,000  *  0.2sec.    Those  failing  the  condition  are  returned.

!Total  cost  =  100,000  *  0.3  +    10,000  *  0.4  +  2,000  *  0.2  =  34,400  sec

Page 10: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

User-defined types

10

rank = reductionFactor −1Cost

Videos  (vid:  int,  v(ile:  video_file)N=100,000

In  addition  to  the  information  on  costs  and  reduction  factors  given  in  (a),  you  are  now  told  that  an  estimated  90%  of  the  videos  for  which  StrongLanguage  is  true  also  have  Violence  evaluate  to  true.    Should  you  modify  your  plan  from  (a)  in  light  of  this  information?

  σ  (NOT  Nudity(  Violence  (StrongLanguage  (Videos)))In   light  of  this  new  information,  r1=90%  in  a  plan  that  contains  Violence(StrongLanguage  (Videos)),  in  that  order.    The  cost  of  the  best  plan  in  (a)  is  modiIied  as  follows: 1.  N=100,  000  tuples  are  processed  by  StrongLanguage,  which  costs  100,000  *  0.3  sec.      10%  of  the  tuples  are  passed  along  to  the  next  operator  in  the  pipeline.  

2.  N’  =  10,000  tuples  are  processed  by  Violence,  which  costs  10,000  *  0.4  sec.          90%  of  the  tuples  are  passed  along  to  the  next  operator  in  the  pipeline.  

3.  N”  =  9,000  tuples  are  processed  by  Nudity,  which  costs  9,000  *  0.2sec.              Those  failing  the  condition  are  returned.

!Total  cost  =  100,000  *  0.3  +    10,000  *  0.4  +  9,000  *  0.2=35,800  sec.

Page 11: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

User-defined types

11

rank = reductionFactor −1Cost

Videos  (vid:  int,  v(ile:  video_file)N=100,000

In  addition  to  the  information  on  costs  and  reduction  factors  given  in  (a),  you  are  now  told  that  an  estimated  90%  of  the  videos  for  which  StrongLanguage  is  true  also  have  Violence  evaluate  to  true.    Should  you  modify  your  plan  from  (a)  in  light  of  this  information?

  σ  (NOT  Nudity(  Violence  (StrongLanguage  (Videos)))Total  cost  =  100,000  *  0.3  +    10,000  *  0.4  +  9,000  *  0.2=35,800  sec

Going  back  to  reasoning  about  ranks,  we  now  have:  rank  (Violence)  =    (0.9  -­‐1)  /  c1  =  -­‐0.1  /  0.4  =  -­‐0.25.    This  is  the  highest  rank,  and  so  this  operator  should  be  evaluated  last.    A  different  plan  is  now  the  most  efIicient:    ! σ  (Violence  (NOT  Nudity(  StrongLanguage  (Videos)))          Total  cost  =  100,000  *  0.3  +    10,000  *  0.2  +  8,000  *  0.4=35,200  sec

Page 12: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Basic file organization• Heap files: good for full file scans or frequent updates!

• unordered files!

• insert at the end of file!

• assumes equality selection on key, exactly one match (why?)!

• Sorted files: good for range queries on sort field(s)!

• need external sort to keep sorted!

• compacted after deletion!

• assumes selection on sort field(s)!

• Hashed files: good for selection on equality !

• collection of buckets with primary & overflow pages!

• hashing function h(r) = bucket for record r!

• each bucket is a heap file

12

Page 13: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Cost of operations

13

Heap File Sorted File Hashed File

Scan all recs p(T) D p(T) D 1.25 p(T) D

Equality Search p(T) D / 2 D log2 p(T) D

Range Search p(T) D D log2 p(T) + (# pages with matches)

1.25 p(T) D

Insert 2D Search + p(T) D 2D

Delete Search + D Search + p(T) D 2D

*

* assuming no overflow bucket, 80% page occupancy

p(T) - number of data pages in table T!

r(T) - number of records in table T!

D - time to read or write a disk page

Page 14: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Clustered vs. unclustered index

14

Data entries

(Index File)

(Data file)

Data Records

Data entries

Data Records

CLUSTERED UNCLUSTERED

Page 15: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

B+ tree search

• Start at root, use key comparisons to navigate to a leaf!

• Search for 5*, 15*, all data entries >=24*

15

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

How many disk I/Os to answer a point query?!How many disk I/Os to answer a range query?

Page 16: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Access paths• An access path is a method of retrieving tuples: file scan, or

index that matches a selection in the query!

• An index matches a conjunction of terms if it can be used to retrieve all data values that match this conjunction of terms.!

• A tree index matches a conjunction of terms that involve only attributes in a prefix of the search key.!

• e.g., tree index <a,b,c> matches the selection a=5 AND b=3; it also matches a=5 AND b>4; it does not match b=3.!

• A hash index matches a conjunction of terms that has a term attribute=value for every attribute in the search key of the index.!

• e.g., hash index on <a,b,c> matches a=b AND b=3 AND c=5; it does not match b=3; or a=5 and b=5; or a>5 AND b=3 and c=5

16

Page 17: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

One approach to selection!

• Find the most selective access path, retrieve tuples using it, and apply the remaining terms that do not match the index.!

• The most selective access path: an index or file scan (!) that we estimate will require the fewest page I/Os.!

• Terms that match this index reduce the number of tuples retrieved; other terms are used to discard some retrieved tuples, but do not affect the number of tuples / pages fetched.!

• Example: day < 1/1/2011 AND bid=5 AND sid=3!

• option 1: use a B+ tree index on day, then check bid=5 and sid=3 for each retrieved tuple!

• option 2: use a hash index on <bid, sid>, then check day <1/1/2011 for each retrieved tuple!

Once again, we are interested in quantifying the I/O-based cost

17

Page 18: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Index-only evaluation

• Many DBMS implement index-only query plans: if the query can be satisfied using the information in the search key of the index, without going to the data record on disk!

• Important because typically only 1 index is clustered, and so using other indexes will potentially trigger several random I/Os!

• Works well only with unclustered indexes

18

Page 19: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Using an index for selection

• Cost of finding qualifying data entries (typically small) plus cost of retrieving records (could be large)!

• Example: assuming uniform distribution of names, about 10% of tuples qualify (100 pages, 10,000 tuples).!

• with a clustered index, cost is little more than 100 I/Os!

• with an unclustered index, cost is up to 10,000 I/Os!

19

SELECT * FROM Reserves R WHERE R.rname < �C%�

Sailors (sid:int, sname: string, rating:int, age:real)

Reserves (sid:int, bid:int, day:date, rname:string)

Reserves (R): each tuple us 40 bytes long, 100 tuples per page, 1000 pages Sailors (S): each tuple is 50 bytes long, 80 tuples per page, 500 pages!

Page 20: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Access paths: another example

20

Employees (ssn, name, salary, age, did);

100,000 employees, 10 employee records per disk page. !Stored on disk in a sorted file (alternative 1), with did as the sort key.!

Salaries from 0 to $100K; ages from 20 to 80; 50 employees per department. Uniform, uncorrelated values.!

Q1. Compute the number of employees whose salary is $35K and who work in department 177.

For each query: (1) List indexes that match the query. (2) What index would you build? (3) What is the cost of using that index to answer this query?

Q2. List name, age, salary of employee with eid=12357.

Q3. Compute the number of employees who are between 30 and 35 years old.

Page 21: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

2-way merge-sort

21

Input file PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,6 2,6 4,9 7,8 1,3 2

2,3 4,6

4,7 8,9

1,3 5,6 2

2,3 4,4 6,7 8,9

1,2 3,5 6

1,2 2,3 3,4 4,5 6,6 7,8

1-page runs

2-page runs

4-page runs

8-page runs

example with N=7 pages

Page 22: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

2-way merge-sort

• What is the cost of this algorithm?!

• In each pass, we read each page process it, and write it out: 2 disk I/Os per page, per pass!

• There are k = log2N + 1 passes!

• The over-all cost is 2N (log2N + 1) I/Os

22

suppose the input occupies N = 2k disk pages

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

Disk Disk

Page 23: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Generalization: external merge-sort

M M M M M M M M M M M M M M M

MMM

MMM

MMM

MMM

MMM

MMM

MMM

MM

MMM

MMM

23

N records, divided into NR / M sorted runs of M / R records each

final sorted result

B: block size M: main memory size!N: input size (blocks) R: size of 1 record!

Page 24: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Cost of external merge-sort

24

Given B = 4KB, M = 64MB, R = 0.1KB!!Pass 1: runs of 40*16*1024 = about 640,000 records !!Pass 2: runs increase by a factor of M/B - 1 = 16,000!! sorted runs of 10,240,000,000 records!!Pass 3: runs increase by a factor of M/B - 1 = 16,000!! sorted runs of 1014 records

with a modest memory size, we can sort everything in 2-3 passes!

B: block size M: main memory size!N: input size (pages) R: size of 1 record!

Cost = 2*N *(logM−1NM +1)

Page 25: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

External merge-sort example

A file with 10,000 records, each record is 1KB. Size of a page/block is 64KB (i.e., 64 records / block).!

What is the number of passes, the cost of 2-way external merge-sort?!

In this dataset, there are ceil(10,000 / 64) = 157 pages that must be sorted. In two-way external merge-sort, we use 1 memory block in pass 0 (each 64-record block is sorted), and 3 memory blocks in subsequent passes (pairs of adjacent sorted runs are merged). !!To sort 157 pages, we will need 1 + ceil(log2157) = 9 passes. !!Each page is read and written once on each pass (2 I/Os per page per pass). Thus, the total cost of two-way external merge-sort on this dataset is 2 * 157 * 9 = 2,826 I/Os.

25

Page 26: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

External merge-sort exampleA file with 10,000 records, each record is 1KB. Size of a page/block is 64KB (i.e., 64 records / block).!

With memory size of 320KB, how many passes for generalized external merge sort? What is the cost?!

Memory  Iits  320  /  64  =  5  pages.    All  are  used  for  sorting  in  pass  0.    All  but   1   are   used   for   sorting   in   subsequent   runs,   the   remaining   page   is  used  for  output.  !In   phase   0   of   generalized   external   merge-­‐sort,   we   read   in   and   sort  320KB  (5  blocks  worth)  at  a  time,  creating  ceil(157/5)  =  32  sorted  runs  of  5  blocks  each.      !Then   in   subsequent   passes   we   merge   5-­‐1=4   neighboring   runs.     We  need   ceil(log432)=2   passes   to   complete   sorting.     That’s   a   total   of   3  passes,  with  2  I/Os  per  page  per  pass,  for  a  total  of  2  *  157  *  3  =  942  I/Os,  a  signiIicant  reduction  compared  to  (a).

26

Page 27: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Datalog

27

Buys(p,g) :−Likes(p,g)Buys(p,g) :−Follows(p, f ),Likes( f ,g),¬Hates(p,g)

Likes(A, 'Skirts ')Likes(A, 'Stilettos')Likes(B, 'Shorts ')Likes(B, 'Sneakers')Hates(A, 'Sneakers')Follows(A,B)Follows(B,A)

Page 28: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Datalog

28

Path(x, y) :−Edge(x, y)Path(x, y) :−Edge(x, z),Path(z, y)

Path(x, y) :−Edge(x, y)Path(x, y) :−Path(x, z),Path(z, y)

2" 3" 6" 7"

5"

10" 11"

0" 1" 4" 8" 9"

Page 29: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Normalization: important to know• Closures, keys!

• Computing the closure of a set of attributes!

• Identifying candidate keys of a relation!

• Identifying whether an FD follows from a set of FDs!

• Minimal basis of a set of FDs!

• Normal forms and decompositions!

• Determining whether a relation is in BCNF, in 3NF!

• Decomposition into BCNF!

• Determining whether a decomposition into BCNF is dependency-preserving!

• Decomposition into 3NF (synthesis)

29

Page 30: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Closure of a set of attributes

Suppose A = {A1, …, An} is a set of attributes and S is a set of FDs.!

The closure of A under the FDs in S is the set of attributes B s.t. every relation that satisfies all the FDs in S also satisfies

30

A→ B

We denote the closure of {A1,A2,…,An} {A1,A2,…,An}+by

Note that {A1,A2,…,An}⊆ {A1,A2,…,An}+

Page 31: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Computing the closure of a set of attributes

1. Split the FDs of S using the splitting rule, so that each FD has one attribute on the right!

2.Initialize !

3. Repeatedly search for some FD such that !

!

4.Stop when no more attributes can be added to

31

Input: a set of attributes {A1,A2,…,An} and a set of FDs SOutput: the closure {A1,A2,…,An}

+

{A1,A2,…,An}+ ← {A1,A2,…,An}

B1,B2,…,Bm →C{B1,B2,…,Bm}⊆ {A1,A2,…,An}

+ ∧C ∉{A1,A2,…,An}+

{A1,A2,…,An}+

Algorithm AttributeClosure

Page 32: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Closures and keysQ: How can we tell if a set of attributes is a candidate key or a superkey of a relation R?!

A: If = all the attributes in R

32

A1A2…An

{A1A2…An}*

Q: How can we compute the candidate keys for R?!

A: Find all sets of attributes that functionally determine all other attributes and make sure these sets are minimal.

Page 33: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Example

33

R(ABCD) BD→C AB→ D AC→ B BD→ A

Find all candidate keys of the given set of FDs.

Page 34: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Example

34

Find all candidate keys of the given set of FDs.

R ABCD( ) ABD → C ; A → B ; AB → C ; B → A

Page 35: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Minimal basis of a set of FDs• For a given relation R, there may exist several sets of

FDs that are equivalent: !

- they give rise to the same closures of all subsets of R’s attributes!

- the same sets of FDs follow from them!

- all such equivalent sets of FDs are called bases for S in R!

• A minimal basis B is a set of FDs that satisfies 3 conditions!

1. All FDs in B have 1 attribute on the right!

2. If any FD is removed from B, the result is no longer a basis!

3. If for any FD in B we remove 1 attribute on the left, the result is no longer a basis

35

Page 36: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Example

36

Find all candidate keys R(ABCD) C→ B BC→ A A→C BD→ A

Check whether the following are minimal bases of the set of FDs.

{AC→ D,D→ B}{D→ A,D→ B,D→C}

Page 37: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Computing a projection of a set of FDs

1. Compute the closure of each subset of attributes of R1 in S. Add to T all non-trivial FDs X -> A s.t. A is both in X+ and an attribute of R1.!

2. Remove from T all FDs that involve attributes not in L (on either side). !

3.Optionally compute the minimal basis of T, remove FDs from T that do not belong to the minimal basis.

37

Algorithm ProjectFDsInput: Relations R and R1= . A set of FDs S that hold in R.!

Output: The set of FDs T that hold in R1.

π L (R)

Suppose relation R is given, with its corresponding set of FDs S. If we take a projection of R onto a set of attributes L, what can we say about the FDs of ? π L (R)

Why not simply take S and project each FD?

Page 38: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Example

38

Compute a projection of the set of FDs when R (ABCD) is projected onto ACD.

R(ABCD) A → B ; B → C ; C → D

π ACD (R)

Page 39: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Boyce-Codd Normal Form (BCNF)Let R be a relation schema, S be the set of FDs given to hold over R. !

R is in BCNF if, for every FD !

one of the following statements is true:

39

In a BCNF relation, the only set of attributes that determines values for other attributes is a superkey!

A1A2…An → B1B2…Bm

1. The FD is trivial: !

2. is a candidate key of R!

3. is a superkey of R

A1A2…AnA1A2…An

{B1,B2,…,Bm}⊆ {A1,A2,…,An}

Page 40: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Third Normal Form (3NF)Let R be a relation schema, S be the set of FDs given to hold over R. !

R is in 3NF if, for every FD !

one of the following statements is true:

40

In contrast to BCNF, some redundancy is possible with 3NF. This normal form is a compromise, needed when no dependency-preserving decomposition into BCNF exists.

A1A2…An → B1B2…Bm

1. The FD is trivial: !

2. is a candidate key of R!

3. is a superkey of R!

4. Each is part of some candidate key of R

A1A2…AnA1A2…An

{B1,B2,…,Bm}⊆ {A1,A2,…,An}{same as!for BCNF

Bi

Page 41: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Example: are these relations in BCNF? In 3NF?

41

R ABCD( ) A → B ; B → A ; A → D ; D → B

R ABCD( ) AB → C ; BCD → A ; D → A ; B → C  

R ABCD( ) FD 's : AC → D ; D → A ; D → C ; D → B 

Page 42: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Decomposition into BCNF

42

Let R be a relation schema, S be the set of FDs given to hold over R. We decompose R by considering FDs that violate BCNF.!

1. Check whether R is in BCNF. If so, return {R}.!

2. Otherwise, let be an FD that violates BCNF. !

2.1.Use AttributeClosure to compute !

2.2. Decompose R into R1 = and R2 =!

2.3.Use ProjectFDs to compute FDs of R1 and R2!

2.4.Recursively decompose R1 and R2 using BCNFDecomposition

Algorithm BCNFDecompositionInput: Relation R, a set of FDs S that hold in R.!

Output: A decomposition of R into a set of relations, all of which are in BCNF.

A1A2…An → B1B2…Bm{A1,A2,…,An}

+

{A1,A2,…,An}+ R − {B1,B2,…,Bm}

Page 43: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Normalization: more examples

43

AB → C ; D → B ; AC → D R(ABCD)

(a) list candidate keys of R

AD→ B(b) does this FD follow from the set of FDs above?

(c) is R in BCNF? is it in 3NF?

Page 44: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Normalization: more examples

44

R(ABCD)

Is the decomposition dependency preserving? - yes

Decompose R into BCNF. Show keys, projected FDs.

A→ B B→ D AD→C BC→ A

2 candidate keys: A and BC; thus BCNF is violated by B→ D

R(ABCD)( Keys:(A,(BC(FDs:(

R1(ABC)( R2(BD)( Key:(B(Keys:(A,(BC(FDs:( FD:(

A→ B ; B→ D ; AD→C ; BC→ A

B→ DA→ BC ; BC→ A

Page 45: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Normalization: more examples

45

R(ABCD)

Compute the minimal basis of the original set of FDs

A→ B B→ D AD→C BC→ A

2 candidate keys: A and BC; thus BCNF is violated by B→ D

A→ B ; B→ D ; A→C ; BC→ A

Decompose R into 3NF. Clearly mark all candidate keys.

R1(AB), R2(AC), R3(BD), R4(BCA).!

Page 46: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Decompose into BCNF

B is the only key. Thus, A → C ; A → D ; AD → C all violate BCNF. Let's decompose on A → C.!

We end up with R1(AC) and R2(ABD). R1 is in BCNF, since the only FD that holds there is A → C, the FD on which we decomposed. To see whether R2 is in BCNF we need to project FDs of R onto R2, compute the keys of R2 and see whether any FDs violate BCNF.!

To project FDs of R onto R2, we compute closures of all subsets of ABD w.r.t. the FDs of R. This gives: {A}+={ACD} = {AD}+, {B}+={ABCD}, {D}+={D}. There is no need to check any supersets of B, since B is already a candidate key. Now, given these closures, we see that R2 has the following non-trivial FDs: A → D, B → A, B → D. So, R2 is not in BCNF, the FD A → D violates BCNF.!

Decomposing R2(ABD) on A → D gives R3(AB) and R4(AD). Both are in BCNF. The final decomposition is as follows:!

R1(AC), with FD A → C; R3(AB), with FD B → A; R4(AD), with FD A → D.

46

R ABCD( ) A → C ; B → A ; A → D ; AD → C  

Page 47: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Decomposition into 3NF

47

Let R be a relation schema, S be the set of FDs given to hold over R. We decompose R by considering FDs that violate 3NF.!

1. Check whether R is in 3NF. If so, return {R}.!

2. Find a minimal basis for S, say T. !

3. For each FD in T of the form create a relation!

and add it to the decomposition!

4. If none of the relations from Step 3 is a superkey for R, another relation to the decomposition, whose schema is a key for R

Algorithm 3NFSynthesisDecompositionInput: Relation R, a set of FDs S that hold in R.!

Output: A decomposition of R into a set of relations, all of which are in 3NF.

A1A2…An → B1B2…Bm

A1A2…AnB1B2…Bm

Page 48: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Normalization: more examples

48

R ABCD( ) C → B ; A → B ; CD → A ; BCD → ADecompose R into 3NF. Show keys.

First,  we  compute  candidate  keys  for  R.    Since  no  FDs  have  either  C  or  D  on  the  right,  both  these  attributes  must  be  part  of  a  candidate  key.    In  fact,  {CD}  is  the  only  candidate  key  of  R,  since  {CD}+={ABCD}.      R  is  not  in  3NF,  since  FDs  and    violate  this  normal  form.  !To  Iind  a  3NF  decomposition,  we  compute  minimal  basis  of  the  set  of  FDs.    To  do  this,  we  observe  that  the  last  FD,  with  BCD  on  the  left,  can  be  dropped,  since  it  is  redundant  with  the  FD  that  has  CD  on  the  left.      !We  create  a  3NF  decomposition  with  relations  R1(CB),  R2(AB)  and  R3(CDA).    Since  R3   is  a  superkey   for  R,  we  don’t  need  to  add  any  more  relations   to   the  decomposition,  done.  

Page 49: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

ER modelingDraw an ER diagram that encodes the following business rules. Clearly mark all key and participation constraints.!

Chefs work at restaurants. A chef is uniquely identified by an SSN, and is also described by a name and a cuisine in which she specialized. A restaurant is uniquely identified by a combination of name and city. Each chef works in at least one restaurant, and each restaurant must have at least one chef working at it. Some chefs own restaurants, and if a chef owns a restaurant - she is its sole owner.

49

Page 50: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

ER modelingDraw an ER diagram that encodes the following business rules. Clearly mark all key and participation constraints.!

Chefs work at restaurants. A chef is uniquely identified by an SSN, and is also described by a name and a cuisine in which she specialized. A restaurant is uniquely identified by a combination of name and city. Each chef works in at least one restaurant, and each restaurant must have at least one chef working at it. Some chefs own restaurants, and if a chef owns a restaurant - she is its sole owner.

50

Page 51: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

ER to relational

51

RESTAURANTS(

city(name(

CHEFS(

name(ssn(

work_at(

own(cuisine(

Page 52: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

ER to relational

52

PRESIDENTS)

name)

running_mate) VICE_PRESIDENTS)

name)party)

Page 53: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Binary vs. ternary relationship sets

53

PRESIDENT)

name)

running_mate) VICE_PRESIDENT)

name)

Party)

name)

PRESIDENT)

name)

running_mate) VICE_PRESIDENT)

name)party)

Page 54: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

And now with constraints

54

PRESIDENT)

name)

running_mate) VICE_PRESIDENT)

name)party)

PRESIDENT)

name)

running_mate) VICE_PRESIDENT)

name)

Party)

name)

Page 55: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Candidate keys, superkeys

55

Consider a relation schema and business rules below. !!Dancers (name: string, dob: date, stage_name: string, company: string)!• No two dancers have the same combination of name and date of birth (dob).!• No two dancers have the same combination of stage name and company.!• A name, a dob and a stage name have to be specified for each dancer, but not

all dancers belong to a company.!

What are the candidate keys? !Which of these would be appropriate for a primary key?!Which are not appropriate for a primary key?!!What are the superkeys? !!Write a valid create table statement.

Page 56: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Relational algebra and SQL

56

Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)

Employees (eid: int, ename: string, salary: int)

Certified (eid: int, aid: int)

(a) List eids of pilots certified to fly Boeing.

(b) List names of pilots certified to fly Boeing.

Page 57: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Relational algebra and SQL

57

(c) List names of aircraft that can be used on non-stop flights from Bonn to Madras.

Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)

Employees (eid: int, ename: string, salary: int)

Certified (eid: int, aid: int)

Page 58: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

Relational algebra and SQL

58

(d) Find names of pilots who can operate planes with a range greater than 3,000 miles but are not certified on any Boeing aircraft.

Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)

Employees (eid: int, ename: string, salary: int)

Certified (eid: int, aid: int)

Page 59: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

SQL

59

(e) List eids of pilots certified to fly exactly 3 aircraft.

Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)

Employees (eid: int, ename: string, salary: int)

Certified (eid: int, aid: int)

Page 60: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

SQL

60

(f) List aids of aircraft that can be used on flight AF007, along with an average salary of pilots who are certified to operate these aircraft.

Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)

Employees (eid: int, ename: string, salary: int)

Certified (eid: int, aid: int)

Page 61: Final Review - Computer Science | Drexel CCIjulia/cs500/documents/lectures/lecture... · Final Review!! Julia Stoyanovich (stoyanovich@drexel.edu) Julia Stoyanovich Final exam logistics

Julia Stoyanovich

When writing queries• For relational algebra, do worry about efficiency: avoid Cartesian

product whenever possible, push selections!

!

• For SQL, do worry about efficiency and readability: !

• avoid nested queries if your query can be expressed with a join!

• use group by / having as appropriate, not a subquery and a where clause in the outer!

• use standard notation, like we covered in class, e.g., no need to write “inner join”, and do write your queries by hand!

!

• For both SQL and relational algebra: do not join with relations unnecessarily. You should have exactly the right number of tables in the from clause of a SQL query, no more no less

61