sensor data management and xml data management

Sensor Data Managementand XML Data Management

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

November 19, 2008

Administrivia

By next Tuesday, please email me with a status report on your project … We are well under a month from the

deadline!

For next time: Please read & review the TurboXPath paper

2

3

Sensor Networks: Target Platform

Most sensor network research argues for the Berkeley mote as a target platform: Mote: 4MHz, 8-bit CPU 128B RAM (original) 512B Flash memory (original) 40kbps radio, 100 ft range Sensors:

Light, temperature, microphone Accelerometer Magnetometer http://robotics.eecs.berkeley.edu/~pister/SmartDust/

4

Sensor Net Data Acquisition

• First: build routing tree

• Second: begin sensing and aggregation

5

Sensor Net Data Acquisition (Sum)

5 5 5 5

5

55

8

5

5

5

5

55

5

5 5

7


• Second: begin sensing and aggregation (e.g., sum)

6

Sensor Net Data Acquisition (Sum)

5 5 5 5

5

55

8

5

5

5

5

55

5

5 5

10 15205

25

6055

20105

138 18

5

3023357

5

85


• Second: begin sensing and aggregation (e.g., sum)

7

Sensor Network Research

Routing: need to aggregate and consolidate data in a power-efficient way Ad hoc routing – generate routing tree to base

station Generally need to merge computation with

routing Robustness: need to combine info from

many sensors to account for individual errors What aggregation functions make sense?

Languages: how do we express what we want to do with sensor networks? Many proposals here

8

A First Try: Tiny OS and nesC

TinyOS: a custom OS for sensor nets, written in nesC Assumes low-power CPU

Very limited concurrency support: events (signaled asynchronously) and tasks (cooperatively scheduled)

Applications built from “components” Basically, small objects without any local state

Various features in libraries that may or may not be included

interface Timer { command result_t start(char type,

uint32_t interval); command result_t stop(); event result_t fired();}

9

Drawbacks of this Approach

Need to write very low-level code for sensor net behavior

Only simple routing policies are built into TinyOS – some of the routing algorithms may have to be implemented by hand

Has required many follow-up papers to fill in some of the missing pieces, e.g., Hood (object tracking and state sharing), …

10

An Alternative

“Much” of the computation being done in sensor nets looks like what we were discussing with STREAM

Today’s sensor networks look a lot like databases, pre-Codd Custom “access paths” to get to data One-off custom-code

So why not look at mapping sensor network computation to SQL? Not very many joins here, but significant aggregation Now the challenge is in picking a distribution and routing

strategy that provides appropriate guarantees and minimizes power usage

11

TinyDB and TinySQL

Treat the entire sensor network as a universal relation Each type of sensor data is a column in a

global table

Tuples are created according to a sample interval (separated by epochs) (Implications of this model?)

SELECT nodeid, light, tempFROM sensorsSAMPLE INTERVAL 1s FOR 10s

12

Storage Points and Windows

Like Aurora, STREAM, can materialize portions of the data: CREATE STORAGE POINT recentlight SIZE 8

AS (SELECT nodeid, light FROM sensors SAMPLE INTERVAL 10s)

and we can use windowed aggregates: SELECT WINAVG(volume, 30s, 5s)

FROM sensorsSAMPLE INTERVAL 1s

13

Events

ON EVENT bird-detect(loc): SELECT AVG(light), AVG(temp), event.loc FROM sensors AS s WHERE dist(s.loc, event.loc) < 10m SAMPLE INTERVAL 2s FOR 30s

14

Power and TinyDB

Cost-based optimizer tries to find a query plan to yield lowest overall power consumption Different sensors have different power usage

Try to order sampling according to selectivity (sounds familiar?)

Assumption of uniform distribution of values over range Batching of queries (multi-query optimization)

Convert a series of events into a stream join with a table

Also need to consider where the query is processed…

15

Dissemination of Queries Based on semantic routing

tree idea SRT build request is flooded

first Node n gets to choose its

parent p, based on radio range from root

Parent knows its children Maintains an interval on

values for each child Forwards requests to

children as appropriate Maintenance:

If interval changes, child notifies its parent

If a node disappears, parent learns of this when it fails to get a response to a query

16

Query Processing

Mostly consists of sleeping! Wake briefly, sample, and compute operators,

then route onwards

Nodes are time synchronized Awake time is proportional to the

neighborhood size (why?)

Computation is based on partial state records Basically, each operation is a partial aggregate

value, plus the reading from the sensor

17

Load Shedding & Approximation

What if the router queue is overflowing? Need to prioritize tuples, drop the ones we don’t want FIFO vs. averaging the head of the queue vs. delta-

proportional weighting

Later work considers the question of using approximation for more power efficiency If sensors in one region change less frequently, can

sample less frequently (or fewer times) in that region If sensors change less frequently, can sample readings

that take less power but are correlated (e.g., battery voltage vs. temperature)

18

The Future of Sensor Nets?

TinySQL is a nice way of formulating the problem of query processing with motes View the sensor net as a universal relation Can define views to abstract some concepts, e.g., an

object being monitored

But: What about when we have multiple instances of an

object to be tracked? Correlations between objects? (Joins)

What if we have more complex data? More CPU power?

What if we want to reason about accuracy?

19

XML: A Format of Many Uses

XML has become the standard for data interchange, and for many document representations

Sometimes we’d like to store it… Collections of text documents, e.g., the Web, doc DBs

… How would we want to query those? IR/text queries, path queries, XQueries?

Interchanging data SOAP messages, RSS, XML streams Perhaps subsets of data from RDBMSs

Storing native, database-like XML data Caching Logging of XML messages

20

XML: Hierarchical Data and Its Challenges It’s not normalized…

It conceptually centers around some origin, meaning that navigation becomes central to querying and visualizing

Contrast with E-R diagrams How to store the hierarchy? Complex navigation may include going up, sideways in tree Updates, locking Optimization

Also, it’s ordered May restrict order of evaluation (or at least presentation) Makes updates more complex

Many of these issues aren’t unique to XML Semistructured databases, esp. with ordered collections,

were similar But our efforts in that area basically failed…

21

Two Ways of Thinking of XML Processing

XML databases (today) Hierarchical storage + locking (Natix, TIMBER,

BerkeleyDB, Tamino, …) Query optimization

“Streaming XML” (next time) RDBMS XML export Partitioning of computation between source and

mediator “Streaming XPath” engines

The difference is in storage (or lack thereof)

22

XML in a Database

Use a legacy RDBMS Shredding [Shanmugasundaram+99] and many others Path-based encodings [Cooper+01] Region-based encodings [Bruno+02][Chen+04] Order preservation in updates [Tatarinov+02], … What’s novel here? How does this relate to materialized

views and warehousing?

Native XML databases Hierarchical storage (Natix, TIMBER, BerkeleyDB,

Tamino, …) Updates and locking Query optimization (e.g., that on Galax)

23

Query Processing for XML

Why is optimization harder? Hierarchy means many more joins (conceptually)

“traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op Though typically parent-child relationships Often don’t have good measure of “fan-out” More ways of optimizing this

Order preservation limits processing in many ways Nested content ~ left outer join

Except that we need to cluster a collection with the parent Relationship with NF2 approach

Tags (don’t really add much complexity except in trying to encode efficiently)

Complex functions and recursion Few real DB systems implement these fully

Why is storage harder? That’s the focus of Natix, really

24

The Natix System

In contrast to many pieces of work on XML, focuses on the bottom layers, equivalent to System R’s RSS

Physical layout Indexing Locking/concurrency control Logging/recovery

25

Physical Layout What are our options in storing XML trees?

At some level, it’s all smoke-and-mirrors Need to map to “flat” byte sequences on disk

But several options: Shred completely, as in many RDBMS mappings

Each path may get its own contiguous set of pages e.g., vectorized XML [Buneman et al.]

An element may get its 1:1 children e.g., shared inlining [Shanmugasundaram+] and [Chen+]

All content may be in one table e.g., [Florescu/Kossmann] and most interval encoded XML

We may embed a few items on the same page and “overflow” the rest

How collections are often stored in ORDBMS We may try to cluster XML trees on the same page, as

“interpreted BLOBs” This is Natix’s approach (and also IBM’s DB2)

Pros and cons of these approaches?

26

Challenges of the Page-per-Tree Approach

How big of a tree? What happens if the XML overflows the tree?

Natix claims an adaptive approach to choosing the tree’s granularity Primarily based on balancing the tree, constraints

on children that must appear with a parent What other possibilities make sense?

Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages

27

Example

Split point in parent page

Note “proxy” nodes

28

That Was Simple – But What about Updates?

Clearly, insertions and deletions can affect things Deletion may ultimately require us to rebalance Ditto with insertion

But insertion also may make us run out of space – what to do? Their approach: add another page; ultimately may

need to split at multiple levels, as in B+ Tree

Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order

29

Does this Help?

According to general lore, yes The Natix experiments in this paper were

limited in their query and adaptivity loads But the IBM people say their approach, which

is similar, works significantly better than Oracle’s shredded approach

30

There’s More to Updates than the Pages

What about concurrency control and recovery?

We already have a notion of hierarchical locks, but they claim: If we want to support IDREF traversal, and

indexing directly to nodes, we need more What’s the idea behind SPP locking?

31

Logging

They claim ARIES needs some modifications – why?

Their changes: Need to make subtree updates more efficient – don’t

want to write a log entry for each subtree insertion Use (a copy of) the page itself as a means of tracking

what was inserted, then batch-apply to WAL “Annihilators”: if we undo a tree creation, then we

probably don’t need to worry about undoing later changes to that tree

A few minor tweaks to minimize undo/redo when only one transaction touches a page

32

Annihilators

33

Assessment

Native XML storage isn’t really all that different from other means of storage There are probably some good reasons to

make a few tweaks in locking Optimization remains harder

A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking

sensor data management and xml data management

Documents