probabilistic databases

Probabilistic Databases

Amol Deshpande, University of Maryland

Overview

V.S. Subrahmanian ProbView, PXML, Temporal Probabilistic

Databases, Probabilistic Aggregates

Lise Getoor Statistical Relational Learning, Probabilistic

Relational Models, Entity Resolution

Amol MauveDB: Statistical Modeling in Databases,

Correlated tuples in probabilistic databases

Overview of Today’s Presentation

Model-based Views/MauveDB [Amol]

Statistical Relational Learning [Lise]

Representing arbitrarily correlated data and processing queries over it [Prithviraj]

Overview of Today’s Presentation

Model-based Views/MauveDB [Amol] Goal: Making it easy to continuously apply statistical models to

streaming data Current focus on designing declarative interfaces, and on

efficient maintenance algorithms Less on the “probabilistic databases” issues

Statistical Relational Learning [Lise]

Representing arbitrarily correlated data and processing queries over it [Prithviraj]

Motivation

Unprecedented, and rapidly increasing,

instrumentation of our every-day world

Huge data volumes generated continuously

that must be processed in real-time

Typically imprecise, unreliable and incomplete

data

Measurement noises, low success rates,

failures etc…

Wireless sensor networks

RFID

Distributed measurementnetworks (e.g. GPS)

Industrial Monitoring

Data Processing Step 1

Process data using a statistical/probabilistic model Regression and interpolation models

To eliminate spatial or temporal biases, handle missing data, prediction Filtering techniques (e.g. Kalman Filters), Bayesian Networks

To eliminate measurement noise, to infer hidden variables etc

Regression/interpolation models

Temperature monitoring

Kalman Filters et

GPS Data

A Motivating Example

Inferring “transportation mode”/ “activities” [Henry Kautz et al] Using easily obtainable sensor data, e.g. GPS, RFID proximity

data Can do much if we can infer these automatically

officehome

Have access to noisy “GPS” dataInfer the transportation mode: walking, running, in a car, in a bus

Motivating Example

Inferring “transportation mode”/ “activities” [Henry Kautz et al] Using easily obtainable sensor data, e.g. GPS, RFID proximity

data Can do much if we can infer these automatically

officehome

Preferred end result: Clean path annotated with transportation mode

Dynamic Bayesian Network

Use a “generative model” for describing how the observations were generated

Time = t

Mt

Xt

Ot

Transportation Mode: Walking, Running, Car, Bus

True velocity and location

Observed location

Need conditional probability distributions e.g. a distribution on (velocity, location) given the transportation mode

Prior knowledge or learned fromdata


Use a “generative model” for describing how the observations were generated

Time = t

Mt

Xt

Ot



Observed location

Time = t+1

Mt+1

Xt+1

Ot+1


Given a sequence of observations (Ot), find the most likely Mt’s that explain it.Or could provide a probability distribution on the possible Mt’s.

Time = t

Mt

Xt

Ot



Observed location

Time = t+1

Mt+1

Xt+1

Ot+1

Statistical Modeling of Sensor Data

No support in database systems --> Database

ends up being used as a backing store With much replication of functionality

Very inefficient, not declarative…

How can we push statistical modeling inside a

database system ?

Abstraction: Model-based Views

An abstraction analogous to traditional database views

Present the output of the application of model as a database view That the user can query as with normal database

views

Example DBN View

User Time Location Mode prob

John 5pm (x’1, y’1) Walking 0.9

John 5pm (x’1, y’1) Car 0.1

John 5:05pm (x’2, y’2) Walking 0

John 5:05pm (x’2, y’2) Car 1

User Time Location

John 5pm (x1, y1)

John 5:05pm (x2, y2)Original noisy GPS data

User view of the data - Smoothed locations - Inferred variables

User

e.g. select count(*) group by mode sliding window 5 minutes

Application of the model/inference is pushed inside the databaseOpens up many optimization opportunitiese.g. can do inference lazily when queried etc

Correlations

User Time Location Mode prob

John 5pm (x’1, y’1) Walking 0.9

John 5pm (x’1, y’1) Car 0.1

John 5:05pm (x’2, y’2) Walking 0

John 5:05pm (x’2, y’2) Car 1

User

Strong and complex correlations across tuples

- Mutual exclusivity

- Temporal correlations

MauveDB: Status

Written in the Apache Derby Java open source

database system

Support for Regression- and Interpolation-based views Neither produce probabilistic data

SIGMOD 2006 (w/ Sam Madden)

Currently building support for views based on Dynamic

Bayesian networks [Bhargav] Kalman Filters, HMMs etc

Initial focus on the user interfaces and efficient inference

Will generate probabilistic data; may not be able to do anything

too sophisticated with it

Research Challenges/Future Work

Generalizing to arbitrary models ? Develop APIs for adding arbitrary models

Try to minimize the work of the model developer

Probabilistic databases Uncertain data with complex correlation patterns

Query processing, query optimization

View maintenance in presence of high-rate

measurement streams

Thanks !!

Mauve == Model-based User Views

probabilistic databases

Documents

correlated data

obtainable sensor data

missing data

process data

database systems

database viewthat

statistical models

generative model