introduction zachary g. ives university of pennsylvania cis 550 – database & information...

34
Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 9, 2004 Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Introduction

Zachary G. IvesUniversity of Pennsylvania

CIS 550 – Database & Information Systems

September 9, 2004

Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan

2

Welcome to CIS 550, Database and Information Systems!

Instructor: Zachary Ives, zives@cis 576 Levine Hall North Office hours: Tuesday, 3:00-4:00PM (after class)

TA: T.J. Green, tjgreen@cis Office hours: Thursday, 3:00-4:00PM

Newsgroup: upenn.cis.cis550Home page: www.seas.upenn.edu/~zives/cis550/

Texts and readings: Ramakrishnan & Gerke, Database Systems, 3rd ed. Supplementary papers (to be handed out in class) Other books may be useful, esp. Brundage’s Using

XQuery

3

Course Format and Grading

Roughly one major topic area per week to two weeks Readings in the text & research papers Occasionally, summaries/commentary on papers (5%) Homework assignment for each topic area (30%)

One midterm (10%), one final exam (20%) Project (30%) – groups of 3-4:

Build a “GMail”/Hotmail clone on top of a database, or Build a P2P system for synchronizing tables (Or propose your own idea)

General participation, discussion, intangibles (5%)

4

Why This Course?

Most CS courses concentrate on code – our interest is managing and representing data

Warning: this course doesn’t focus on teaching SQL or how to be an Oracle DBA (though it will get you started)

… So what in the world are we studying for 14 weeks???

5

What Do We Do with Data?

6

Some Ways to Represent Information

7

Example: An Encyclopedia Entry(www.wikipedia.com)

A database is an information set with a regular structure. Its front-end allows data access, searching and sorting routines. Its back-end affords data inputting and updating. A database is usually but not necessarily stored in some machine-readable format accessed by a computer. There are a wide variety of databases, from simple tables stored in a single file to very large databases with many millions of records, stored in rooms full of disk drives or other peripheral electronic storage devices.

Databases resembling modern versions were first developed in the 1960s. A pioneer in the field was Charles Bachman.

The most useful way of classifying databases is by the programming model associated with the database. Several models have been in wide use for some time. Historically, the hierarchical model was implemented first, then the network model, then the relational model overcame with the so-called flat model accompanying it for low-end usage…

8

Example: To-Do List

Buy school supplies due 9/7 Go to orientation on 9/7 Exercise every M/W/F Buy Philly postcards

How does this differ from the plain text model? What might you do with it that you couldn’t?

9

Example: Your PDA/Cell Phone

Event Day When Who WhereLunch 10/24 1pm Zack Cavanaugh’sAdvice10/25 9am Dr. Smith 599 LevineBiking 10/26 9am Jane PottruckDinner 10/26 6PM Jane Food Court

Calendar

Who Phone Email OfficeZack 6-2789 zives 576 Levine NDr. Smith 6-1234 drsmith 599 LevineJane 543-2198 jane 2220 Walnut St.

Contacts

10

What If We Want to Include Contact Info on Our Calendar?

Do we also want to keep e-mail addresses, telephone numbers etc.?

Should we expand the number of columns in our table:

Event When Who-name Who-email Who-tel …. Where

Lunch 1pm Zack zives 6-2789…. Cav……

What is the trade-off in terms of entering data?

11

“Link” Calendar with Contacts?

Why can’t we “link” calendar entries with contact info, and show the results of the two? The link could be based on something as simple as

the person's name (What’s the danger here? What else might work

better?)

This brings up an issue – how to “follow links” If we were to do this in Java, how might it be done?

12

Another Kind of Link: Classes and Subclasses

Person has attributes: ssn PennID set of user IDs given name family name …

Student IS A person who: takes courses is given grades is taught listens to lectures in class, OR over the Web, OR on videotape

This is yet another kind of information How have you previously seen such relationships encoded?

13

Data Representation and Modeling

All of the data we’ve seen have an implicit data model

The data model includes some basic assumptions about what’s an “item” of data, how to interpret it, and so on

The relational data model was the first model for data that is independent of its data structures and implementation A theory of normalization guides you in designing

relations Concepts from the relational data model have been

adapted to form object-oriented data models (with classes and subclasses), XML models, etc.

There are “sibling” fields to databases that consider: natural language models (how to understand words) document models (how to match words and documents) ontologies (how to define relationships between classes)

14

The DBMS Provides an Interfaceover the Database

A database (DB) is a large, integrated collection of data Generally is cohesive in “some” way

A DB models a real-world organization or unit A database management system (DBMS) is a

software package designed to store and manage databases Reliable storage & recovery of 100s of GB Querying/updating interface and API (for applications and

Web pages) Support for many concurrent users

Why do we need a DBMS, instead of coding in Java?

15

DBMS Benefit #1: Generality and Declarativity

Don’t require the programmer or user to know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc.

Instead, the programmer/user programs with a logical model in mind

The DBMS “makes it happen” based on an understanding of relative costs of different methods

16

Benefit #2: Efficiency and Scale

Size of personal address book is probably less than 100 entries, but there are things we'd like to do quickly and efficiently: “Give me all appointments on 10/28” “When am I next meeting Jim?”

“Program” these as quickly as possible (and make them resilient to data format changes)

Scale to a corporate calendar with hundreds of thousands of entries

17

Benefit #3: Management of Concurrency and Reliability

Suppose other people are allowed access to your calendar and are allowed to modify it? How do we stop two people changing the file at the same time and leaving it in a physical (or logical) mess?

Suppose the system crashes while we are changing the calendar. How do we recover our work?

This requires a basic concept…

18

Transactions

Key concept for concurrency is that of a transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry).

Key concept for recoverability is that of a log: keeping track of all actions carried out by the db.

19

The Layers of the DBMS

API/GUI

Optimizer

Storage Mgr

Exec. Engine

Storage

Catalog

Query

Physical plan

Pages

RequestsData

Pages

Stats

Schemas

(Simplification!)

Buffer Mgr

Index/file/rec MgrData/etc Requests

RequestsData/etc

Logging, recovery

Red = logicalBlue = physical

20

The Database Abstraction Provided by the DBMS

We think of databases at two levels: Logical structure:

What users/programmers see – program or query interface

Physical structure: Organization on disk, indices, etc.

The logical level is further split into: Overall database design (conceptual; seen by

the DB designer) Views that various users get to see

21

The Three-level Architecture forDatabases

View 1 View 2 … View N

Physical Level(file organization, indexing)

Schema Logical,Conceptual Level

22

Data Independence

A user of a relational database system should be able to use the database without knowing about how the precisely how data is stored, e.g.

After all, you don't worry IEEE floating-point when you do division in a Java program or with a calculator

SELECT When, WhereFROM CalendarWHERE Who = “Jane"

23

More on Data Independence

Logical data independence Protects the user from changes in the logical structure of the data:

could reorganize the calendar “schema” without changing how we query it

Physical data independenceProtects the user from changes in the physical structure of data:

could add an index on who (or sort by when) without changing how the user would write the query, but the query would execute faster (query optimization)

24

Presentation Layer (4th Tier): Data-Driven Web Sites

“Data driven web sites” also add an HTML “presentation” layer on top of what we’ve seen

Or they use XML plus “style sheets” to get the same effect

view

HTML

Processing

25

An Issue: 80% of the World’s Data is Not in a DB!

Examples: scientific data

(large images, complex programs that analyze the data) personal data WWW and email

(some of it is stored in something resembling a DBMS)

Data management is expanding to tackle these problems

Flexibility – data management imposes many constraints to make problems solvable

Must deal with entities outside our control

In this course, we’ll start by focusing on databases, but eventually look “outside the box” at the Web and at gluing together data from many places

26

Combining Databases with Mediators(a kind of middleware)

A layer above the three-tiered architecture, to combine multiple databases/sources on the Web Some of these are databases over which we have no control Some must be accessed in special ways We generally need to think about how to translate between

different database formats

XML

“Mediated Schema”

27

How Does One Build a Database?

Start with a conceptual model Design & implement schema Write applications using DBMS and other tools

Many ways of doing this where the hard problems are taken care of by other people (DBMS, API writers, library authors, web server, etc.)

Common applications include PHP/JSP/servlet-driven web sites

The DBMS takes care of query optimization and execution

28

Conceptual Design

STUDENT COURSETakes

namesid cid name

PROFESSOR

Teaches

semester

fid name

29

Designing a Schema (Set of Relations)

Convert to tables +constraints

Then need to do “physical” design: the layout on disk, indices, etc.

sid name

1 Jill

2 Bo

3 Maya

fid name

1 Ives

2 Saul

8 Roth

sid cid

1 550-0103

1 700-1003

3 500-0103

cid name sem

550-0103 DB F03

700-1003 AI S03

501-0103 Arch F03

fid cid

1 550-0103

2 700-1003

8 501-0103

STUDENT Takes COURSE

PROFESSOR Teaches

30

Applications Use Queries in SQL

<html><body> <!-- hypotheticalEmbeddedSQL: SELECT *

FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid --></body></html>

Structured Query Language Based on restricted first-order logic expressions over relations Not procedural – defines constraints on the output

Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine

31

Processing the Query

SELECT * FROM STUDENT, Takes, COURSE

WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid

STUDENT

Takes COURSE

Merge

Hash

by cid by cidOptimizer

ExecutionEngine

StorageSubsystem

Web Server / UI / etc

32

DBMSs in the Real World

A huge industry for 20% of the world’s data! Big, mature relational databases

IBM, Oracle, Microsoft “Middleware” above these

SAP, PeopleSoft, dozens of special-purpose apps “Application servers” Integration and warehousing systems Current trends:

Web services; XML everywhere Smarter, self-tuning systems

33

So What about Database Research?

Not focusing on the problems of Oracle… Understanding what’s possible to do with XML

Better query processing Better languages for meta-info (e.g., constraints)

Data streams Peer-to-peer architectures Integrating data from different formats Lots of theory and systems-building

You’ll see familiar concepts in this course from operating systems and from complexity theory/logic

… And from programming languages, AI planning, …

34

In this Course...

Study relational databases, their design, how to query, what forms of indices to use.

Beyond relational algebra: a logical model of data (Datalog), recursion

XML and semi-structured data models Understanding DB internals

How DBs are built Performance implications

Integrating and mediating between databases (a huge problem today)