probabilistic databases

18
Probabilistic Databases Amol Deshpande, University of Maryland

Upload: dorcas

Post on 08-Jan-2016

48 views

Category:

Documents


1 download

DESCRIPTION

Probabilistic Databases. Amol Deshpande, University of Maryland. Overview. V.S. Subrahmanian ProbView, PXML, Temporal Probabilistic Databases, Probabilistic Aggregates Lise Getoor Statistical Relational Learning, Probabilistic Relational Models, Entity Resolution Amol - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Databases

Probabilistic Databases

Amol Deshpande, University of Maryland

Page 2: Probabilistic Databases

Overview

V.S. Subrahmanian ProbView, PXML, Temporal Probabilistic

Databases, Probabilistic Aggregates

Lise Getoor Statistical Relational Learning, Probabilistic

Relational Models, Entity Resolution

Amol MauveDB: Statistical Modeling in Databases,

Correlated tuples in probabilistic databases

Page 3: Probabilistic Databases

Overview of Today’s Presentation

Model-based Views/MauveDB [Amol]

Statistical Relational Learning [Lise]

Representing arbitrarily correlated data and processing queries over it [Prithviraj]

Page 4: Probabilistic Databases

Overview of Today’s Presentation

Model-based Views/MauveDB [Amol] Goal: Making it easy to continuously apply statistical models to

streaming data Current focus on designing declarative interfaces, and on

efficient maintenance algorithms Less on the “probabilistic databases” issues

Statistical Relational Learning [Lise]

Representing arbitrarily correlated data and processing queries over it [Prithviraj]

Page 5: Probabilistic Databases

Motivation

Unprecedented, and rapidly increasing,

instrumentation of our every-day world

Huge data volumes generated continuously

that must be processed in real-time

Typically imprecise, unreliable and incomplete

data

Measurement noises, low success rates,

failures etc…

Wireless sensor networks

RFID

Distributed measurementnetworks (e.g. GPS)

Industrial Monitoring

Page 6: Probabilistic Databases

Data Processing Step 1

Process data using a statistical/probabilistic model Regression and interpolation models

To eliminate spatial or temporal biases, handle missing data, prediction Filtering techniques (e.g. Kalman Filters), Bayesian Networks

To eliminate measurement noise, to infer hidden variables etc

Regression/interpolation models

Temperature monitoring

Kalman Filters et

GPS Data

Page 7: Probabilistic Databases

A Motivating Example

Inferring “transportation mode”/ “activities” [Henry Kautz et al] Using easily obtainable sensor data, e.g. GPS, RFID proximity

data Can do much if we can infer these automatically

officehome

Have access to noisy “GPS” dataInfer the transportation mode: walking, running, in a car, in a bus

Page 8: Probabilistic Databases

Motivating Example

Inferring “transportation mode”/ “activities” [Henry Kautz et al] Using easily obtainable sensor data, e.g. GPS, RFID proximity

data Can do much if we can infer these automatically

officehome

Preferred end result: Clean path annotated with transportation mode

Page 9: Probabilistic Databases

Dynamic Bayesian Network

Use a “generative model” for describing how the observations were generated

Time = t

Mt

Xt

Ot

Transportation Mode: Walking, Running, Car, Bus

True velocity and location

Observed location

Need conditional probability distributions e.g. a distribution on (velocity, location) given the transportation mode

Prior knowledge or learned fromdata

Page 10: Probabilistic Databases

Dynamic Bayesian Network

Use a “generative model” for describing how the observations were generated

Time = t

Mt

Xt

Ot

Transportation Mode: Walking, Running, Car, Bus

True velocity and location

Observed location

Time = t+1

Mt+1

Xt+1

Ot+1

Page 11: Probabilistic Databases

Dynamic Bayesian Network

Given a sequence of observations (Ot), find the most likely Mt’s that explain it.Or could provide a probability distribution on the possible Mt’s.

Time = t

Mt

Xt

Ot

Transportation Mode: Walking, Running, Car, Bus

True velocity and location

Observed location

Time = t+1

Mt+1

Xt+1

Ot+1

Page 12: Probabilistic Databases

Statistical Modeling of Sensor Data

No support in database systems --> Database

ends up being used as a backing store With much replication of functionality

Very inefficient, not declarative…

How can we push statistical modeling inside a

database system ?

Page 13: Probabilistic Databases

Abstraction: Model-based Views

An abstraction analogous to traditional database views

Present the output of the application of model as a database view That the user can query as with normal database

views

Page 14: Probabilistic Databases

Example DBN View

User Time Location Mode prob

John 5pm (x’1, y’1) Walking 0.9

John 5pm (x’1, y’1) Car 0.1

John 5:05pm (x’2, y’2) Walking 0

John 5:05pm (x’2, y’2) Car 1

User Time Location

John 5pm (x1, y1)

John 5:05pm (x2, y2)Original noisy GPS data

User view of the data - Smoothed locations - Inferred variables

User

e.g. select count(*) group by mode sliding window 5 minutes

Application of the model/inference is pushed inside the databaseOpens up many optimization opportunitiese.g. can do inference lazily when queried etc

Page 15: Probabilistic Databases

Correlations

User Time Location Mode prob

John 5pm (x’1, y’1) Walking 0.9

John 5pm (x’1, y’1) Car 0.1

John 5:05pm (x’2, y’2) Walking 0

John 5:05pm (x’2, y’2) Car 1

User

Strong and complex correlations across tuples

- Mutual exclusivity

- Temporal correlations

Page 16: Probabilistic Databases

MauveDB: Status

Written in the Apache Derby Java open source

database system

Support for Regression- and Interpolation-based views Neither produce probabilistic data

SIGMOD 2006 (w/ Sam Madden)

Currently building support for views based on Dynamic

Bayesian networks [Bhargav] Kalman Filters, HMMs etc

Initial focus on the user interfaces and efficient inference

Will generate probabilistic data; may not be able to do anything

too sophisticated with it

Page 17: Probabilistic Databases

Research Challenges/Future Work

Generalizing to arbitrary models ? Develop APIs for adding arbitrary models

Try to minimize the work of the model developer

Probabilistic databases Uncertain data with complex correlation patterns

Query processing, query optimization

View maintenance in presence of high-rate

measurement streams

Page 18: Probabilistic Databases

Thanks !!

Mauve == Model-based User Views