sep-01cs545 intro1 september 2001 gio wiederhold stanford university cs545 intro why databases?
Post on 19-Dec-2015
221 views
TRANSCRIPT
Sep-01 CS545 Intro 1
September 2001
Gio WiederholdStanford University
www-db.stanford.edu/people/gio.html
CS545 intro
Why Databases?
Sep-01 CS545 Intro 2
AbstractThe distinction of storing data in files and databases is that databases are intended to be
used by multiple programs and types of users.Databases have been available in various forms since 1958.The major paper defining database functionality in a formal sense is due to Ted Codd, of
IBM, published in 1970.Information is created by applying knowledge (encoded as programs or rules) to collected
data and message received. Data and computation resources are provided by a variety of suppliers, public and
private. The number of potential suppliers and their autonomy also creates information overload
To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change.
The autonomy of the suppliers causes heterogeneity and inconsistencies. The semantics of diverse sources are captured by their ontologies, the collection of terms and their relationships as used in the domain of discourse for the source. When sources are to be related we rely on their ontologies to make the linkages. . Creating a sound algebra encompassing the required operations allows manipulation and composition of the interoperation process.
Sep-01 CS545 Intro 3
Outline
• Motivation and Functions needed• Early Inventions• Architecture• Formal basis• Breadth of applicability• Unsolved problems• Research Directions
Sep-01 CS545 Intro 4
Files versus Databases
Files: provide input and output for a program � (transient)• Devices: Paper tape (ascii), Cards, Magnetic Tapes• Examples:
1. FORTRAN: tapes 1-5 input, 5 standard in ( 80 column cards) tapes 6-7 output, 6 print (120 cols), 7 punch ( 80 cols)
still visible in files, IBM VM OS2. UNIX: standard in > Standard out3. Data-processing: in > � > out = in > � > out = in > � > out ....
Databases: storage (persistent, reliable, random access)• Enabled by disk - technology, starting in 1960 (5MB)• Many users, i.e., many (small) programs �� ��� �� �
• Example:1. BOMP – Bill-of-materials (inventory), airline seats, processing
Sep-01 CS545 Intro 5
Files
• Files: a means for programs to store data for later use– The initial program � determines
1. what data are being stored (all? – memory dump [LISP] )2. how it is being stored – structure and format3. when it is being stored and available
– successor programs must follow these decisions• often the successor program is another invocation of the
initial program �• Problems
– One program requires a different structure than another: BOMP– Data must be available rapidly, incrementally:
• Class-assignments• seat reservations• library checkout
– Programs � must be available continuously, depend on data
Sep-01 CS545 Intro 6
Databases
• Data are intended to be used by many programs– Often small – transactions– Various subsets of the all the relevant data – Structural transformations: Bill-of-Materials Programs:
Input programRecords parts being
deliveredSupplier :> parts
Output programRecords parts being
consumedProducts :> parts
Inventory
Suppliers, Products :> parts
Sep-01 CS545 Intro 7
BoMPs are common
• Supplier Parts Product-Assemblies• Clinical-labs Observations Patient-Records• Employees Salary & Tasks Productivity• Accidents Reports Failure-Analysis• Flights Seats Passengers• Classes Grades Student-Performance• . . .
Two directions / hierarchies needed for data access:
Data sources Data consumption
Solutions?
Stuff
Sep-01 CS545 Intro 8
Design Problem & SolutionsConceptual - model• Supplier program:
– Use a hierarchy: supplier parts supplied ( 1: n )
• Consumer program:– Use a hierarchy: consumer parts used ( 1: m )
Actual solution in memory: Matrix: if it exceeds memory then either
supplier or consumer part accesses
become costlyActual solution beyond memory: 1. redundant transformed data
2. pointer and index structures
s1 s2 s3 sn
c1
c2c3
cm
P
Sep-01 CS545 Intro 9
Factors influencing design
• Size --- memories are getting bigger, problems too
• Density of matrix:– suppliers supply only some parts, overlapping– products consume only some parts, overlapping
• Performance requirements:– supplier response can be less critical
– airline seats made available versus seats being sold
– laboratory data obtained versus patient records needed
• Usage patterns:– batches versus single item accesses– linked according to yet other criteria:
Sep-01 CS545 Intro 10
DBMSs
Database Management Systems• Collection of the software needed to manage databases• Components:
– Storage management – intertwined with the operating systems– Query and update processor – uses the schema– Schema interpreter and compiler– Transaction management and concurrency control/protection –
also jointly with OS– Logger for backup– Recovery programs
• Large, complex, not all features always needed• Many fewer vendors now than 10 yesrs ago
Sep-01 CS545 Intro 11
Inventions – 1 - Data Description
• Schemas [McGee, 1958] program independence= A symbolic description of each column, to be interpreted by
update and retrieval programs as well as users– Allows programs to use subsets– Allows columns to be added without affecting current programs
• Compilation of Schemas [1975]= avoids interpretation cost– requires keeping track of last update for auto-recompile
• Views [Chamberlin et al., 1976] Bounded schemas= Data base adminiistrator defines schema subset for user roles– Can be compiled for fast execution– Must be recompiled when base schema or view is changed.
Sep-01 CS545 Intro 12
Inventions – 2 – access trees
• Indexes [Landauer 1963] balanced trees= Efficient ancillary access path– Requires updating to stay current
• Multiple Indexes [DavisLin 1965] multi-attribute-based access= Multiple ancillary access paths– Allows access by multiple paths – Requires much updating to stay current
• B-trees [Bayer, 1972] Index Updateability= Index blocks are kept only 50%-100% full for mostly fast
update – Improves performance greatly for indexes
Sep-01 CS545 Intro 13
Inventions – 3 - structures
• Hierarchical Structures [IMS, 1963] Dense data structures= Trees mapped to sequential structures for fast access to sparse data – Fast access when many related values are needed – Costly to update, often done periodically – Must be combined with trees for multiple-access paths
• Triple storage [Feldman, 1969] Arbitrary structures= All data represented by object-attribute-value entries– High cost when many related values are needed
Note that these two conflict – in today's database implementations performance has won out over flexibility
Sep-01 CS545 Intro 14
Inventions – 4 – model foodfight
• Relational Model [Codd 1970] = tabular model, with an algebraic set of operations, normalization– Formalization enabled understanding, dissemination– No inter-relation semantics, specified when query is made– Later constraints were added, implicitly defining keys, connections
• Hierarchical - (also applied to one view of BOMPs)
= describe hierarchical connections among data records, no algebra– An attempt to describe earlier, simple implementations in model terms
• Network – generalization of BOMP
= describe structure, procedural navigation in near-arbitrarily linked data
Strong inter-record connections, needed for locating data
Sep-01 CS545 Intro 15
Why did the relational model win?
• Relational Model DBMSes Sequel QUEL, SQL– Formality – allowed essential optimization algorithms– Restrictions – as normalization, provide guidance– Teachability – exposed principles:
• can't teach only from examples– DBMS independence – safety blanket for mission-critical users
• But implementations added features• Use least common set of features?
– Hard to enforce once a system has been bought• Few suppliers remain {ORACLE. IBM. MS, mySQL}
• ER model [Chen, 1976]= Focuses on design, can be mapped to multiple implementations– Few tools for direct translation– Poor maintenance of model, ignored when DBs are expanded
Sep-01 CS545 Intro 16
Databases and the Web
• HTML presentation: Hierarchical Markup Language= Data are transformed for human consumption, external refs– Often hierarchical – object-oriented view– If there was a schema, it is now hidden
• XML presentation= Schema data is embedded– Much flexibility– Much more space when entries are small– Requires an interpretation for viewing as XSLT
• RDF Resource description Formalism = Triple representation: object-attribute-value– Great flexibility– Uncertain implementation
Sep-01 CS545 Intro 17
Information overload Data starvation
• More databases– public & corporate
• Faster communication– digital– packeting: TCP-IP, ATM
• World-wide connectivity– Internet & Intranets– world-wide web
• Disintermediation– ubiquitous publishing
Sep-01 CS545 Intro 18
Change in Supply vs Demand
What information consumes is rather obvious, it consumes the attention of its recipients.
Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.
[Herbert Simon]
Sep-01 CS545 Intro 19
Making data relevant
• Data reduction• Data abstraction
– Level changing– Summarization– Exception search– Level change to integrate with
other data sources
• Follow Customer Model: hierarchical, divide-and-conquer, a common paradigm
Sep-01 CS545 Intro 20
Data and Knowledge
Information is created at theconfluence ofdata -- the state & knowledge -- the ability to select and project the state into the future
Knowledge LoopKnowledge LoopData LoopData Loop
EducationEducation
RecordingRecording
ActionAction
StorageStorage
State changesState changes
SelectionSelection
IntegrationIntegration
SummarizationSummarization
Decision-makingDecision-making
AbstractionAbstraction
ExperienceExperience
Sep-01 CS545 Intro 21
Transforming Data to Information
Application Layer
Mediation Layer
Foundation Layer
data and simulation resources
value-added services
users at workstations
Sep-01 CS545 Intro 22
Functions inside Mediation
Selection
Summarize
Transform
Inte- -gration
Hetero-
genous
resources
articulation
Sep-01 CS545 Intro 23
Function of Mediation
Apply Domain-specific Specialist Knowledge to add value
• to locate data sources• to convert for consistency• to integrate from diverse sources• to describe data for processing• to abstract for insight / models• to extrapolate to new situations• to summarize for presentation
INFORMATION
Sep-01 CS545 Intro 24
Environmental Restoration at INEL Undoing 50 years of messes
…. MQL [ISX]
MSL [Stanford]OQL [ODMG]
QEM
mediator
QEMQEM
QEM
QEM
QEM
CORBA
othermediators
OEMOEM
OEM
OEM
OEMOEM
OEM
QEM
QEM
Idaho NationalEngineering LaboratoryJune 1998
LOCKHEED MARTINISX - Stanford Univ.
Many projectsMany projectsmany sourcesmany sources
wrapper
wrapper
ERIS
wrapper
IEDMS
wrapper
Sep-01 CS545 Intro 25
From Schemas to Ontologies
Ontologies allow communication among partners in enterprises (rarely in machine-readable form)
Relationships determine meaning - parent, school, company
Databases use ontologies during design in their E-R diagrams (implicitly) and to represent the leaf nodes in their schemas.
Variable and Class names in SoftwareKnowledge-bases use term ontologies (often
explicitely), add class definition (to hold instances), constraints, and operations among the terms.
Sep-01 CS545 Intro 26
Ontology: components .
We represent the contents and structure of a languages by its ontology:
• a set of well-defined terms, which delimit the domain of discourse
• relationships among those terms, chosen from a limited set
a formalizable subset of expert knowledge
Sep-01 CS545 Intro 27
Heterogeneity among Domains
If interoperation involves distinct
domains mismatch ensues• Autonomy conflicts with consistency,
– Local Needs have Priority,– Outside uses are a Byproduct
Heterogeneity must be addressed• Platform and Operating Systems • Representation and Access Conventions • Naming and Ontologies
Sep-01 CS545 Intro 28
Unsolved problem in Interoperation
Common assumption in assembling and integrating distributed information resources
• The language used by the resources is the same• Sublanguages used by the resources are subsets of a
globally consistent language
This assumption is provably false.
Working towards the goal of global consistency is
1. naïve -- the goal cannot be achieved
2. inefficient -- languages are efficient in local contexts
Sep-01 CS545 Intro 29
Large Ontologies: good or bad?
Have all the Knowledge together+ simple for customers of KBs– hard for owners of KBs, must synchronize with many others– in the limit -- everybody must be globally consistent
Large KB will cover multiple / all domains created by a committee -- slow
maintained by a committee – costly to impssible
Differences in level of abstraction -- efficiency homeowner: nail carpenter: sinker, brad, boxnail, . . .
Sep-01 CS545 Intro 30
Evolution of mediation
W2 W1
D2
D6D4
W3
I1
D1D5
I2
M1 M2
A1A4 A5A2
A6
a.
b.
A3
c.
d. e.
datasources
wrappers
mediators
network
integrators
applications
D3
Sep-01 CS545 Intro 31
Definition*
A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications.
It should be small and simple, so that it can be maintained by one expert or, at most, a small and coherent group of experts.
* Wiederhold: IEEE Computer March 1992
Sep-01 CS545 Intro 32
Interfaces
Application Application Mediator Mediator{OQL, KQML, ...}{OQL, KQML, ...}
Mediator Mediator Data sources Data sources{SQL, TQL, XML, … }{SQL, TQL, XML, … }
Data Data real worldreal world{sensors, clerks, … }{sensors, clerks, … }
Human Human Computer Computer{x-widgets, HTML}{x-widgets, HTML}
Sep-01 CS545 Intro 33
An Integration Architecture
ClientApplication
business reports
portfolios for each company
stock market prices
WrapperWrapper
TickerTape Dialog
Mediator
Sep-01 CS545 Intro 34
Status of Mediation Technology
Today• Handcrafted• Expert consults with
programmer • Programmer codes the
knowledge needed• Resource changes
require advise, program update
Future• Generated from
models• Domain Expert
maintains models• Specification
determines functions • Resource changes
trigger regeneration
Sep-01 CS545 Intro 35
A mediator is not static software: Knowledge ages
ApplicationInterface
Resource Interfaces
Owner / Creator Maintainer Lessor - Seller Advertisor
Changes ofuser needs
Domainchanges
Resource changes
Models, programs,rules, caches, . . .
Software & People
Sep-01 CS545 Intro 36
Domain Specialization
• Knowledge Acquisition (20% effort) &• Knowledge Maintenance (80% effort *)
to be performed by• Domain specialists• Professional organizations• Field teams of modest size
Empowermentautomouslymaintainable
* based on experience with software
Sep-01 CS545 Intro 37
Roles
Computer Scientists• Provide tools
– adapatation– integration– matching– composing
• Assess Standards• Assure scalability
Domain Experts• Learn to use the tools• Select resources• Assess their value • Rank their quality • Resolve semantics• Get client feedback• Give provide feedback
Sep-01 CS545 Intro 38
Mediation Research Topics
• Mediator management and maintenance• Representation of knowledge and customer models• Balancing dynamic and warehouse solutions• Formalization of semantic heterogneities
– many levels and types – roles for wrappers vs. mediators vs. applications– scalability by partitioning -- make it simple!– Domain Ontologies --- tools, validation, . . .
• Effect of object paradigm and method-based access• Service and business models • New types of information systems
Sep-01 CS545 Intro 39
IntegrationScience
IntegrationScience
ArtificialIntelligence
knowledge mgmtdomain expertise
uncertainty
ArtificialIntelligence
knowledge mgmtdomain expertise
uncertainty
Systems Engineering
analysisdocumentation
costing
Systems Engineering
analysisdocumentation
costing
Databasesaccessstoragealgebras
Databasesaccessstoragealgebras
Long Range Science Vision
Integration Methods
GIS
Sep-01 CS545 Intro 40
Fat versus thin mediators
• too broad:
hard to maintain, needs a committee
• too thin: insufficient added value
• Too fat: hard to
compose
• Too narrow: few costumers
domain scope
service scope
Just right
Sep-01 CS545 Intro 41
Maintenance is good for you
rela
tive
an
nu
al
mai
nte
nan
ce c
ost
dep
reci
atio
n =
1 /
lif
etim
e
automobile hardware software automobile hardware software
100%100%
4040
00
2020
7070
3030
1010
8080
9090
6060
5050
life
tim
eli
feti
me
yearsyears 10 10
44
22
77
33
11
88
99
66
55
1313
1111
1212??
Sep-01 CS545 Intro 42
Client-Server Architecture
Client system
data and simulation resources
Fast build of clients by resource reuse
s X
Changes (x) are difficult,can affect many clients
Sep-01 CS545 Intro 43
Systems with Mediators
Applications . . . .
Mediators . . . . . .
Data Resources . . .
Gio Wiederhold. 1995
Sep-01 CS545 Intro 44
Growth through Reuse
New Application
Prior & Revised Mediators
Extended Data Resources
Gio Wiederhold. 1995
Sep-01 CS545 Intro 45
Linear O(n) Cost of Growth-- now
O(n2)
• Data changes only affect some mediators; only in their domain
• Mediators can
1. supply old information to n-1 prior applications
2. provide better information to the new application
3. be partially or completely reused
• New applications, using the new data, can be developed and inserted dynamically
27
Sep-01 CS545 Intro 46
Assigning maintenance responsibility
a. Source data quality –supplier database, files, or web pages
b. Interface to the source – wrapper, supplier or vendor for supplier
c. Source selection – expert specialist in mediator
d. Source quality assessment – customer input to mediator
e. Semantic interoperation – specialist group providing input to the mediator
f. Consistency and metadata information – mediator service operation or warehouse
g. Informal, pragmatic integration – client services with customer input
h. User presentation formats – client services with customer input
Services
Sources
Customers