sep-01cs545 intro1 september 2001 gio wiederhold stanford university cs545 intro why databases?

Sep-01 CS545 Intro 1

September 2001

Gio WiederholdStanford University

www-db.stanford.edu/people/gio.html

CS545 intro

Why Databases?


AbstractThe distinction of storing data in files and databases is that databases are intended to be

used by multiple programs and types of users.Databases have been available in various forms since 1958.The major paper defining database functionality in a formal sense is due to Ted Codd, of

IBM, published in 1970.Information is created by applying knowledge (encoded as programs or rules) to collected

data and message received. Data and computation resources are provided by a variety of suppliers, public and

private. The number of potential suppliers and their autonomy also creates information overload

To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change.

The autonomy of the suppliers causes heterogeneity and inconsistencies. The semantics of diverse sources are captured by their ontologies, the collection of terms and their relationships as used in the domain of discourse for the source. When sources are to be related we rely on their ontologies to make the linkages. . Creating a sound algebra encompassing the required operations allows manipulation and composition of the interoperation process.


Outline

• Motivation and Functions needed• Early Inventions• Architecture• Formal basis• Breadth of applicability• Unsolved problems• Research Directions


Files versus Databases

Files: provide input and output for a program � (transient)• Devices: Paper tape (ascii), Cards, Magnetic Tapes• Examples:

1. FORTRAN: tapes 1-5 input, 5 standard in ( 80 column cards) tapes 6-7 output, 6 print (120 cols), 7 punch ( 80 cols)

still visible in files, IBM VM OS2. UNIX: standard in > Standard out3. Data-processing: in > � > out = in > � > out = in > � > out ....

Databases: storage (persistent, reliable, random access)• Enabled by disk - technology, starting in 1960 (5MB)• Many users, i.e., many (small) programs ��

• Example:1. BOMP – Bill-of-materials (inventory), airline seats, processing


Files

• Files: a means for programs to store data for later use– The initial program � determines

1. what data are being stored (all? – memory dump [LISP] )2. how it is being stored – structure and format3. when it is being stored and available

– successor programs must follow these decisions• often the successor program is another invocation of the

initial program �• Problems

– One program requires a different structure than another: BOMP– Data must be available rapidly, incrementally:

• Class-assignments• seat reservations• library checkout

– Programs � must be available continuously, depend on data


Databases

• Data are intended to be used by many programs– Often small – transactions– Various subsets of the all the relevant data – Structural transformations: Bill-of-Materials Programs:

Input programRecords parts being

deliveredSupplier :> parts

Output programRecords parts being

consumedProducts :> parts

Inventory

Suppliers, Products :> parts


BoMPs are common

• Supplier Parts Product-Assemblies• Clinical-labs Observations Patient-Records• Employees Salary & Tasks Productivity• Accidents Reports Failure-Analysis• Flights Seats Passengers• Classes Grades Student-Performance• . . .

Two directions / hierarchies needed for data access:

Data sources Data consumption

Solutions?

Stuff


Design Problem & SolutionsConceptual - model• Supplier program:

– Use a hierarchy: supplier parts supplied ( 1: n )

• Consumer program:– Use a hierarchy: consumer parts used ( 1: m )

Actual solution in memory: Matrix: if it exceeds memory then either

supplier or consumer part accesses

become costlyActual solution beyond memory: 1. redundant transformed data

2. pointer and index structures

s1 s2 s3 sn

c1

c2c3

cm

P


Factors influencing design

• Size --- memories are getting bigger, problems too

• Density of matrix:– suppliers supply only some parts, overlapping– products consume only some parts, overlapping

• Performance requirements:– supplier response can be less critical

– airline seats made available versus seats being sold

– laboratory data obtained versus patient records needed

• Usage patterns:– batches versus single item accesses– linked according to yet other criteria:


DBMSs

Database Management Systems• Collection of the software needed to manage databases• Components:

– Storage management – intertwined with the operating systems– Query and update processor – uses the schema– Schema interpreter and compiler– Transaction management and concurrency control/protection –

also jointly with OS– Logger for backup– Recovery programs

• Large, complex, not all features always needed• Many fewer vendors now than 10 yesrs ago


Inventions – 1 - Data Description

• Schemas [McGee, 1958] program independence= A symbolic description of each column, to be interpreted by

update and retrieval programs as well as users– Allows programs to use subsets– Allows columns to be added without affecting current programs

• Compilation of Schemas [1975]= avoids interpretation cost– requires keeping track of last update for auto-recompile

• Views [Chamberlin et al., 1976] Bounded schemas= Data base adminiistrator defines schema subset for user roles– Can be compiled for fast execution– Must be recompiled when base schema or view is changed.


Inventions – 2 – access trees

• Indexes [Landauer 1963] balanced trees= Efficient ancillary access path– Requires updating to stay current

• Multiple Indexes [DavisLin 1965] multi-attribute-based access= Multiple ancillary access paths– Allows access by multiple paths – Requires much updating to stay current

• B-trees [Bayer, 1972] Index Updateability= Index blocks are kept only 50%-100% full for mostly fast

update – Improves performance greatly for indexes


Inventions – 3 - structures

• Hierarchical Structures [IMS, 1963] Dense data structures= Trees mapped to sequential structures for fast access to sparse data – Fast access when many related values are needed – Costly to update, often done periodically – Must be combined with trees for multiple-access paths

• Triple storage [Feldman, 1969] Arbitrary structures= All data represented by object-attribute-value entries– High cost when many related values are needed

Note that these two conflict – in today's database implementations performance has won out over flexibility


Inventions – 4 – model foodfight

• Relational Model [Codd 1970] = tabular model, with an algebraic set of operations, normalization– Formalization enabled understanding, dissemination– No inter-relation semantics, specified when query is made– Later constraints were added, implicitly defining keys, connections

• Hierarchical - (also applied to one view of BOMPs)

= describe hierarchical connections among data records, no algebra– An attempt to describe earlier, simple implementations in model terms

• Network – generalization of BOMP

= describe structure, procedural navigation in near-arbitrarily linked data

Strong inter-record connections, needed for locating data


Why did the relational model win?

• Relational Model DBMSes Sequel QUEL, SQL– Formality – allowed essential optimization algorithms– Restrictions – as normalization, provide guidance– Teachability – exposed principles:

• can't teach only from examples– DBMS independence – safety blanket for mission-critical users

• But implementations added features• Use least common set of features?

– Hard to enforce once a system has been bought• Few suppliers remain {ORACLE. IBM. MS, mySQL}

• ER model [Chen, 1976]= Focuses on design, can be mapped to multiple implementations– Few tools for direct translation– Poor maintenance of model, ignored when DBs are expanded


Databases and the Web

• HTML presentation: Hierarchical Markup Language= Data are transformed for human consumption, external refs– Often hierarchical – object-oriented view– If there was a schema, it is now hidden

• XML presentation= Schema data is embedded– Much flexibility– Much more space when entries are small– Requires an interpretation for viewing as XSLT

• RDF Resource description Formalism = Triple representation: object-attribute-value– Great flexibility– Uncertain implementation


Information overload Data starvation

• More databases– public & corporate

• Faster communication– digital– packeting: TCP-IP, ATM

• World-wide connectivity– Internet & Intranets– world-wide web

• Disintermediation– ubiquitous publishing


Change in Supply vs Demand

What information consumes is rather obvious, it consumes the attention of its recipients.

Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.

[Herbert Simon]


Making data relevant

• Data reduction• Data abstraction

– Level changing– Summarization– Exception search– Level change to integrate with

other data sources

• Follow Customer Model: hierarchical, divide-and-conquer, a common paradigm


Data and Knowledge

Information is created at theconfluence ofdata -- the state & knowledge -- the ability to select and project the state into the future

Knowledge LoopKnowledge LoopData LoopData Loop

EducationEducation

RecordingRecording

ActionAction

StorageStorage

State changesState changes

SelectionSelection

IntegrationIntegration

SummarizationSummarization

Decision-makingDecision-making

AbstractionAbstraction

ExperienceExperience


Transforming Data to Information

Application Layer

Mediation Layer

Foundation Layer

data and simulation resources

value-added services

users at workstations


Functions inside Mediation

Selection

Summarize

Transform

Inte- -gration

Hetero-

genous

resources

articulation


Function of Mediation

Apply Domain-specific Specialist Knowledge to add value

• to locate data sources• to convert for consistency• to integrate from diverse sources• to describe data for processing• to abstract for insight / models• to extrapolate to new situations• to summarize for presentation

INFORMATION


Environmental Restoration at INEL Undoing 50 years of messes

…. MQL [ISX]

MSL [Stanford]OQL [ODMG]

QEM

mediator

QEMQEM

QEM

QEM

QEM

CORBA

othermediators

OEMOEM

OEM

OEM

OEMOEM

OEM

QEM

QEM

Idaho NationalEngineering LaboratoryJune 1998

LOCKHEED MARTINISX - Stanford Univ.

Many projectsMany projectsmany sourcesmany sources

wrapper

wrapper

ERIS

wrapper

IEDMS

wrapper


From Schemas to Ontologies

Ontologies allow communication among partners in enterprises (rarely in machine-readable form)

Relationships determine meaning - parent, school, company

Databases use ontologies during design in their E-R diagrams (implicitly) and to represent the leaf nodes in their schemas.

Variable and Class names in SoftwareKnowledge-bases use term ontologies (often

explicitely), add class definition (to hold instances), constraints, and operations among the terms.


Ontology: components .

We represent the contents and structure of a languages by its ontology:

• a set of well-defined terms, which delimit the domain of discourse

• relationships among those terms, chosen from a limited set

a formalizable subset of expert knowledge


Heterogeneity among Domains

If interoperation involves distinct

domains mismatch ensues• Autonomy conflicts with consistency,

– Local Needs have Priority,– Outside uses are a Byproduct

Heterogeneity must be addressed• Platform and Operating Systems • Representation and Access Conventions • Naming and Ontologies


Unsolved problem in Interoperation

Common assumption in assembling and integrating distributed information resources

• The language used by the resources is the same• Sublanguages used by the resources are subsets of a

globally consistent language

This assumption is provably false.

Working towards the goal of global consistency is

1. naïve -- the goal cannot be achieved

2. inefficient -- languages are efficient in local contexts


Large Ontologies: good or bad?

Have all the Knowledge together+ simple for customers of KBs– hard for owners of KBs, must synchronize with many others– in the limit -- everybody must be globally consistent

Large KB will cover multiple / all domains created by a committee -- slow

maintained by a committee – costly to impssible

Differences in level of abstraction -- efficiency homeowner: nail carpenter: sinker, brad, boxnail, . . .


Evolution of mediation

W2 W1

D2

D6D4

W3

I1

D1D5

I2

M1 M2

A1A4 A5A2

A6

a.

b.

A3

c.

d. e.

datasources

wrappers

mediators

network

integrators

applications

D3


Definition*

A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications.

It should be small and simple, so that it can be maintained by one expert or, at most, a small and coherent group of experts.

* Wiederhold: IEEE Computer March 1992


Interfaces

Application Application Mediator Mediator{OQL, KQML, ...}{OQL, KQML, ...}

Mediator Mediator Data sources Data sources{SQL, TQL, XML, … }{SQL, TQL, XML, … }

Data Data real worldreal world{sensors, clerks, … }{sensors, clerks, … }

Human Human Computer Computer{x-widgets, HTML}{x-widgets, HTML}


An Integration Architecture

ClientApplication

business reports

portfolios for each company

stock market prices

WrapperWrapper

TickerTape Dialog

Mediator


Status of Mediation Technology

Today• Handcrafted• Expert consults with

programmer • Programmer codes the

knowledge needed• Resource changes

require advise, program update

Future• Generated from

models• Domain Expert

maintains models• Specification

determines functions • Resource changes

trigger regeneration


A mediator is not static software: Knowledge ages

ApplicationInterface

Resource Interfaces

Owner / Creator Maintainer Lessor - Seller Advertisor

Changes ofuser needs

Domainchanges

Resource changes

Models, programs,rules, caches, . . .

Software & People


Domain Specialization

• Knowledge Acquisition (20% effort) &• Knowledge Maintenance (80% effort *)

to be performed by• Domain specialists• Professional organizations• Field teams of modest size

Empowermentautomouslymaintainable

* based on experience with software


Roles

Computer Scientists• Provide tools

– adapatation– integration– matching– composing

• Assess Standards• Assure scalability

Domain Experts• Learn to use the tools• Select resources• Assess their value • Rank their quality • Resolve semantics• Get client feedback• Give provide feedback


Mediation Research Topics

• Mediator management and maintenance• Representation of knowledge and customer models• Balancing dynamic and warehouse solutions• Formalization of semantic heterogneities

– many levels and types – roles for wrappers vs. mediators vs. applications– scalability by partitioning -- make it simple!– Domain Ontologies --- tools, validation, . . .

• Effect of object paradigm and method-based access• Service and business models • New types of information systems


IntegrationScience

IntegrationScience

ArtificialIntelligence

knowledge mgmtdomain expertise

uncertainty

ArtificialIntelligence

knowledge mgmtdomain expertise

uncertainty

Systems Engineering

analysisdocumentation

costing

Systems Engineering

analysisdocumentation

costing

Databasesaccessstoragealgebras

Databasesaccessstoragealgebras

Long Range Science Vision

Integration Methods

GIS


Fat versus thin mediators

• too broad:

hard to maintain, needs a committee

• too thin: insufficient added value

• Too fat: hard to

compose

• Too narrow: few costumers

domain scope

service scope

Just right


Maintenance is good for you

rela

tive

an

nu

al

mai

nte

nan

ce c

ost

dep

reci

atio

n =

1 /

lif

etim

e

automobile hardware software automobile hardware software

100%100%

4040

00

2020

7070

3030

1010

8080

9090

6060

5050

life

tim

eli

feti

me

yearsyears 10 10

44

22

77

33

11

88

99

66

55

1313

1111

1212??


Client-Server Architecture

Client system

data and simulation resources

Fast build of clients by resource reuse

s X

Changes (x) are difficult,can affect many clients


Systems with Mediators

Applications . . . .

Mediators . . . . . .

Data Resources . . .

Gio Wiederhold. 1995


Growth through Reuse

New Application

Prior & Revised Mediators

Extended Data Resources

Gio Wiederhold. 1995


Linear O(n) Cost of Growth-- now

O(n2)

• Data changes only affect some mediators; only in their domain

• Mediators can

1. supply old information to n-1 prior applications

2. provide better information to the new application

3. be partially or completely reused

• New applications, using the new data, can be developed and inserted dynamically

27


Assigning maintenance responsibility

a. Source data quality –supplier database, files, or web pages

b. Interface to the source – wrapper, supplier or vendor for supplier

c. Source selection – expert specialist in mediator

d. Source quality assessment – customer input to mediator

e. Semantic interoperation – specialist group providing input to the mediator

f. Consistency and metadata information – mediator service operation or warehouse

g. Informal, pragmatic integration – client services with customer input

h. User presentation formats – client services with customer input

Services

Sources

Customers

sep-01cs545 intro1 september 2001 gio wiederhold stanford university cs545 intro why databases?

Documents

data slide

databases files

01cs545 intro6 databases

parts slide

bomp data

01cs545 intro5 files

materials programs

01cs545 intro4 files