1 david maier petdb a petabyte in your pocket david maier oregon graduate institute with help from...
Post on 22-Dec-2015
213 views
TRANSCRIPT
1David Maier
PetDB
A Petabyte in Your Pocket
David MaierOregon Graduate Institute
with help fromD. DeWitt, J. Naughton, L. Delcambre, K.
Tufte, V. Papadimos, P. Tucker
2David Maier
PetDB
Your PetDBIt’s 2015.
For $300 a year, you can have a personal petabyte database (PetDB).
You can talk to it from anywhere.
Organizes any kind of digital data.– Doesn’t lose structure, can restructure– Queryable– Handles streams– Organized by type, content, associations, multiple
categorizations and groupings
Locate items by– How or where you encountered them– What you’ve done with them– Where you were when you accessed them
3David Maier
PetDB
What Would I Put in a Petabyte?A lot.
Fill my office floor to ceiling with books 100 GBWhat do I do with 10,000 as much?
Many possibilities:– Contents of every book and magazine I read– Every web page I visit– All email I send or receive– Every TV program I watch– Every version of every piece of software I use– Maps of everywhere I go– Notes from every class or seminar I attend– All the telephone calls I make– My “Lifestream” (Freeman and Gerlernter)
4David Maier
PetDB
Streams and Restructuring
Can incorporate streamed data on the fly.– MD: Vital signs from patients in ICU– Factory supervisor: status, output rate of all
machines; finished products; rejects
Can restructure data if desired.– Combined list of conferences in my area– Info sheets on autos I’m considering buying– Comparable salaries of faculty at my rank in
similar departments
5David Maier
PetDB
Anything I Might Want to Refer Back to
Personally indexed for me.
Can be located in a thousand different ways.
What is the company in Massachusetts I read about in the article on factory tours when I was on the plane to the sales meeting in Atlanta last spring?
6David Maier
PetDB
Or Things I Might Want in the Future
• Histories of news groups and mailing lists
• Parts of the web I might want to browse, including past snapshots
• Descriptions and prices for any item I might want to buy
• Papers I’ve been meaning to read
• Historical data on stocks I’m interested in
Functions as a personal web portal
7David Maier
PetDB
“Database” Not Completely Apt• Didn’t have to define a scheme for it
• Doesn’t need to know the datatypes I want to store in advance
• Doesn’t chop data into rows and columnsUnless I ask
• Can query over information streams
• Don’t need to write and run applications to add dataAnything I’ve touched is thereOr expressed an interest in
• Not on a particular computer
• Doesn’t have an “outside”
8David Maier
PetDB
My PetDB is Good to Me
• I don’t move data between environmentsI’m never on the “wrong” machine
• Never go back to my office to grab a paper, never have the wrong folder at a meeting
• Don’t worry a lot about filing systems–PetDB organizes itself by ways I like to look for information
• Anticipates what data I’ll be using
9David Maier
PetDB
How to Do This?
On $300/year
Plan A: Pack my office floor to ceiling with disk drives.
About a $1 million.
Plan B: Be clever.– Share– Stage– Reconstitute
10David Maier
PetDB
Share
Most of the information in my PetDB isn’t unique to me: magazine article, web page, stock quote.
Store one copy.
Information Paradox: What’s too expensive for one may be affordable for all.
My PetDB
Others’ PetDBs
11David Maier
PetDB
Stage
Not all data has to be at my current point of connection.
Mainly resides in shared and private servers on the Internet.
Staged to me on a series of data managers.
Access time depends on context, likely use– Current itinerary: 1 second– Upcoming trips: 5 seconds– Past trips: 30 seconds
12David Maier
PetDB
Reconstitute“If I found it once, PetDB can find it again”
Remember what procedure or search constructed or located data originally.
Use the same method to get it again.
Need to ensure base data is archived.
Plus a small amount of unique content
– Stuff I’ve created
– Foreground information that superimposes my personal perspective: selections, annotations, responses, manipulations, groupings
13David Maier
PetDB
What Infrastructure Do I Need?
Net Data Managers
• Network-centric vs. disk-centric– Data movement vs. data storage
• Work on lives streams as well as stored data
• Deal with data of arbitrary types
• Run queries of thousands of sites
• Locate data by external contexts as well as internal content• Large-scale monitoring
14David Maier
PetDB
Data Management Space
Disk Centric Network Centric
No Query
Query DBMS
File System Web Servers
Net Data Managers (NDMs)
15David Maier
PetDB
Why Net Data Managers?
File systems won’t work– No queries, disk centric
Web Servers won’t work– No structural query, no combining of data– No support for optimization and execution of
high-level queries spanning 1000s of sites– No support for triggers– In reality, nothing more than “page servers”
16David Maier
PetDB
Limitations of Current DBMSs
• Schema-first
• Load then query
• Data in the box
• Scale
• Search by content, not by context
17David Maier
PetDB
Key Elements of NDM
• Self-describing data (e.g., XML)
• NetQueries
• Algebraic basis
• Stream-processing componentsOil refinery vs. book-order warehouse
Want to do for net-centric, data-intensive applications what relational DBs did for business data processing:
Reduce the coding effort to produce such applications, while improving performance, scalability and reliability.
18David Maier
PetDB
Codd’s Contribution
What’s the most important aspect of the relational model?
– Calculus?– Algebra?– Equivalence?
My opinion: Observing that BDP programs only do about 6-7 different things:
scan files remove fieldsselect records remove duplicatescombine records [aggregate records]concatenate files
What are the building blocks of net data management?
19David Maier
PetDB
Without NDMs
Format
Conversion
Custom Software
Data SourcesUsers
Browser
Push
Receiver
Generic Component
Accumulator
+ Query Eng.
Alert
Service
Profiles
Parameter
File
Data Product
Generation
Algorithm
Format
Conversion
Format
ConversionBrowser
Push
Receiver
Browser
Push
Receiver
20David Maier
PetDB
With NDMs
Custom SoftwareGeneric Component
SourcesUsers
Accumulator
+ Query Eng.
Alert
Service
Profiles
Parameter
File
Algorithm
Data Product
Generation
Browser
Push
Receiver
Browser
Push
Receiver
Browser
Push
Receiver
Format
Conversion
Format
Conversion
Format
Conversion
21David Maier
PetDB
Kinds of Components
• Stream-based query processors
• Alerters
• Accumulators
• Remote monitoring/indexing
• Semantic Routers
• Replicators: lazy, eager, just-in-time
• Semantic caches
• Splitters
• Access-mode adapters
• Partial evaluators
22David Maier
PetDB
Alerting vs. Querying
D D
D
DBMS? ? ? ! ! !
? ?
?
AlerterD D D ! ! !
Data Centric Net Centric
Stream of queries past a
store of data
Stream of data past a
store of queries
23David Maier
PetDB
Access Modes: Who Decides
Consumer
Producer Post
Pull Poll
Push
ProducerConsumer
What Data Moves
When DataMoves
24David Maier
PetDB
Assembling Applications from Components
Akamai FreeFlow (see NASDAQ site)Splitting + Replication + Merge + Adapters
Web
Content
Split Graphics
Push
Replicate
BaseServer
FieldServer
FieldServer
FieldServer
Browser
Text
Pull
Pull
Merge
25David Maier
PetDB
NIAGARA Project
Initial investigation of NDM based on XMLUniversity of Wisconsin and OGI
• Stream-oriented XML-QL evaluator
• “Text-in-context” search
• NiagaraCQ
• Merge operator (and rest of algebra)
• XML Firehose
26David Maier
PetDB
Use of NDM for PetDB
• NetQueries encode procedures for reconstituting data
• Monitoring sources of interest
• Replication, splitting, push, accumulators, semantic routing for staging data
• NetQuery to inform an archive server what to save
• Archives, semantic caches express what they already hold with a NetQuery
27David Maier
PetDB
Building the PetDB System
Pet
DB
Indexer
ContextMgr.
Petster
DataKennel
BackQuote
WebSnap
IP Server
TaskAnalyzer
Stager
Profiler
Public Archives
Private Archive
StreamProcessor
InternetMonitor
ReplicateServer
Stager
SecureLocalCache
Stager
28David Maier
PetDB
What Else is Needed?
• Superimposed InformationMuch of my unique content is an organizational
overlay on base data
• Small-footprint data managers
• Presentation model of stream data
• Authorization and Authentication
• QoS control, content scaling
• Intelligent prediction, learning
• Secure staging areas