the structure of (computer) scientific revolutions
DESCRIPTION
The Structure of (Computer) Scientific Revolutions. Michael Franklin UC Berkeley & Amalgamated Insight. Dow Jones Enterprise Ventures May 2006. Data Management: Then. Structured Data Processing. Data Management: Now. The Structure Spectrum. Structured data (schema-first) - PowerPoint PPT PresentationTRANSCRIPT
The Structure of (Computer) Scientific Revolutions
Dow Jones Enterprise VenturesMay 2006
Michael Franklin
UC Berkeley&
Amalgamated Insight
Michael FranklinDow Jones EV Summit May 2006
Data Management: Then
Structured DataProcessing
Michael FranklinDow Jones EV Summit May 2006
Data Management: Now
Michael FranklinDow Jones EV Summit May 2006
The Structure Spectrum
• Structured data (schema-first)• regular, known, conforming, …• e.g., Relational database
• Unstructured data (schema-never) freeform, irregular, • e.g., plain text, images, audio, …
• Semi-structured data (schema-later)• Provides structural information, but
less constrained. e.g., XML, tagged text/media
Michael FranklinDow Jones EV Summit May 2006
Whither Structured Data?
• Conventional Wisdom: ~20% of data is structured currently.
• Consumer apps, enterprise search, media apps are placing downward pressure on this.
Michael FranklinDow Jones EV Summit May 2006
A Contrarian View? Two reasons why structured data is where
the action will be:
• The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!!
• The Data Integration quagmire: structure provides crucial cues for making data usable.
Michael FranklinDow Jones EV Summit May 2006
The New LandscapeBell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect
• Mainframes 1960s• Minicomputers 1970s• Microcomputers/PCs 1980s• Web-based computing 1990s• Devices (Cell phones, PDAs, wireless sensors,
RFID) 2000’s
Enabling a new generation of applications forOperational Visibility, monitoring, and alerting.
Michael FranklinDow Jones EV Summit May 2006
Data Streams Data Flood
Clickstream
BarcodesPoS System
SensorsRFID
Telematics
Inventory
• Exponential data growth
• New challenges: continuous, inter-connected, distributed, physical
• Shrinking business cycles
• More complex decisions
Phones
TransactionalSystems
Michael FranklinDow Jones EV Summit May 2006
State of the Art
• Custom-coded implementations that are expensive and often unsuccessful.
• Can we develop the right infrastructure to support large-scale data streaming apps?
Michael FranklinDow Jones EV Summit May 2006
High Fan In Systems• A data management infrastructure for
large-scale data streaming environments.
• Uniform Declarative Framework • Every node is a data stream processor that
speaks SQL-ese stream-oriented queries at all levels• Hierarchical, stream-based views as an
organizing principle.• Can impose a “view” over messy devices.
Michael FranklinDow Jones EV Summit May 2006
HiFi - Taming the Data Flood
Receptors
Warehouses, Stores
Dock doors, Shelves
Regional Centers
Headquarters
Hierarchical Aggregation
• Spatial• TemporalIn-network StreamQuery Processing and Storage
Fast DataPath vs.Slow DataPath
Michael FranklinDow Jones EV Summit May 2006
Device Issues: example
Shelf RIFD Test - Ground Truth
Michael FranklinDow Jones EV Summit May 2006
Actual RFID Readings
“Restock every time inventory goes below 5”
Michael FranklinDow Jones EV Summit May 2006
Query-based Data Cleaning
Point
Smooth
CREATE VIEW smoothed_rfid_stream AS(SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)
Michael FranklinDow Jones EV Summit May 2006
Query-based Data Cleaning
Point
Smooth
ArbitrateCREATE VIEW arbitrated_rfid_stream AS(SELECT receptor_id, tag_idFROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’]GROUP BY receptor_id, tag_idHAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))
Michael FranklinDow Jones EV Summit May 2006
After Query-based Cleaning
“Restock every time inventory goes below 5”
Michael FranklinDow Jones EV Summit May 2006
Once you have the right abstractions…
• “Soft Sensors”• Quality and lineage• Optimization (power, etc.)• Pushdown of external validation
information• Data archiving• Model-based sensing• Imperative processing• …
Michael FranklinDow Jones EV Summit May 2006
Data Integration
• Integration is the ultimate schema-first problem.
• Structure is both a key enabler and a key impediment here.
Michael FranklinDow Jones EV Summit May 2006
Search vs. Query
What if you wanted to find out which actors donated to John Kerry’s presidential campaign?
Michael FranklinDow Jones EV Summit May 2006
Search vs. Query
Michael FranklinDow Jones EV Summit May 2006
Search vs. Query
What if you wanted to find out which actors donated to John Kerry’s presidential campaign?
Michael FranklinDow Jones EV Summit May 2006
Search vs. Query
• “Search” can return only what’s been previously “stored”.
Michael FranklinDow Jones EV Summit May 2006
Also…
• What if you wanted to find out the average donation of actors to each candidate?
• What if you wanted to compare actor donations this campaign to the last one?
• What if you wanted to find out who gave the most to each candidate?
• What if you wanted to know where the information came from, and how old it was?
Michael FranklinDow Jones EV Summit May 2006
A “Deep-Web” Query Approach
SELECT y.name,f.occupation,…FROM Yahoo_Actors y, FECInfo fWHERE y.name = f.name
Michael FranklinDow Jones EV Summit May 2006
“Yahoo Actors” JOIN “FECInfo”
Q: Did it Work?
Michael FranklinDow Jones EV Summit May 2006
The Fundamental Tradeoff
Level ofFunctionality
Time (and cost)
Structured(schema-first)
Unstructured (schema-less)
Semi-Structured(schema-later)
Structure enables computers to help users manipulate and maintain the data.
Michael FranklinDow Jones EV Summit May 2006
Dataspaces*
• Deal with all the data from an enterprise – in whatever form
• Data co-existenceno integrated schema, no single warehouse
• Pay-as-you-go services• Keyword search is bare minimum.• Data manipulation and increased consistency as you add work.
* “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.
Michael FranklinDow Jones EV Summit May 2006
Dataspaces vs. Databases
• Data Coexistence• Autonomous
Sources
• Search, Browse, Approximate Answer
• Best Effort Guarantees
• Single Schema• Centralized
Administration
• Structured Query
• Strict Integrity Constraints
Michael FranklinDow Jones EV Summit May 2006
The World of Dataspaces
High Low
Near
Far
Desktop Search
Web SearchVirtual
Organization
Federated DBMS
DBMS
Semantic Integration
AdministrativeProximity
Michael FranklinDow Jones EV Summit May 2006
Conclusions• Structured data not going away.
• In fact, there will be lots more of it.• and it must be processed as fast as it is created.
• Structure is crucial for successful data integration and manipulation.• Much effort will be expended to add structural information to text and media.
• Traditional (structured) database technology is not up to the task.
• Great opportunities for innovation.• HiFi and Dataspaces are examples.