olap over uncertain and imprecise data adapted from a talk by t.s. jayram (ibm almaden) with doug...

OLAP Over Uncertain and Imprecise Data

Adapted from a talk byT.S. Jayram (IBM Almaden)

with Doug Burdick (Wisconsin), Prasad Deshpande (IBM), Raghu Ramakrishnan (Wisconsin), Shivakumar

Vaithyanathan (IBM)

Adapted by S. Sudarshan

CA MANYTX

EastWest

AllLocation

Civic SierraF150Camry

TruckSedan

AllAutomobile

Dimensions in OLAP

Auto = TruckLoc = EastSUM(Repair) = ?

Measures, Facts, and Queries

MA

NY

TX

CA

West

East

ALL


TruckSedan

ALLAutomobile

p1

p2

p3

p4p5

p6p7

p8

Auto = F150Loc = NYRepair = $200

Cell

Location

Restriction on Imprecision

We restrict the sets of values in an imprecise fact to either:

1. A singleton set consisting of a leaf level member of the hierarchy, or,2. The set of all the leaf level members under some non-leaf level member of the hierarchy.

Cells and Regions

A region is a vector of attribute values from an imprecise domains of each dimension of the cube.A cell is a region in which all values are leaf level members.Let reg(R) represent the set of cells in a region R.

Queries on precise data

A query Q = (R, M, A) refers to a region R, a measure M, and an aggregate function A. Eg : (<Ambassador, Location>, Repairs, Sum)The result of the query in a precise database is obtained by applying A on the measure M of all cells in R.For the example above, the result is (P1 + P2)

Extend the OLAP model to handle data

ambiguity

Imprecision

Uncertainty

Extend the OLAP model to handle data

ambiguity

Imprecision

Uncertainty

MA

NY

TX

CA

West

East

ALL

Location


TruckSedan

ALLAutomobile

p1

p2

p3

p4p5

p6p7

p8

Auto = F150Loc = EastRepair = $200

p9

p10

Imprecision

p11

Representing Imprecision using Dimension Hierarchies

Dimension hierarchies lead to a natural space

of “partially specified” objects

Sources of imprecision: incomplete data,

multiple sources of data

SierraF150

Truck

MA

NY

East

p1

p3

p5

p4

p2

Motivating Example

We propose desiderata that enable appropriate definition of query semantics for imprecise data

We propose desiderata that enable appropriate definition of query semantics for imprecise data

Query: COUNT

Queries on imprecise data

Consider the query region <Pune, Model> in the figure. It overlaps two imprecise facts P4 and P5.Three (naive) options for including fact in query: Contains: consider only if contained in query Overlaps: consider if overlapping query None: ignore all imprecise facts

Desideratum I: Consistency

Consistency specifies the relationship between answers to related queries on a fixed data set

SierraF150

Truck

MA

NY

East

p1

p3

p5

p4

p2

Notions of Consistency

Generic idea: if query region is partitioned, and aggregate applied on each partition, then aggregate q on whole region must be consistent in some ways with aggregates qi on partitions

General idea: alpha consistency for property alpha

Specific forms of consistency discussed in detail in paper

Sum consistency (for count/sum)

Boundedness consistency (for average)

Contains option : Consistency

Intuitively, consistency means that the answer to a query should be consistent with the aggregates from individual partitions of the query.Using the Contains option could give rise to inconsistent results.For example, consider the sum aggregate of the query above and that of its individual cells. With the Contains option, will the individual results add up to be the same as the collective?

Desideratum II: Faithfulness

Faithfulness specifies the relationship between answers to a fixed query on related data sets

Notion of result quality relative to the quality of the data input to the query.

– For example, the answer computed for Q=F150,MA should be of higher quality if p3 were precisely known.

SierraF150

MA

NY

p3

p1

p4

p2

p5

SierraF150

MA

NY

p3

p1

p4

p2

p5

SierraF150

MA

NY

p3

p1

p4

p2

p5

Data Set 1 Data Set 2 Data Set 3

Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator

Can we define query semantics that satisfy these desiderata?

Formal definitions of both Consistency and Faithfulness depend on the underlying aggregation operator

Can we define query semantics that satisfy these desiderata?

p3

p1

p4

p2

p5MA

NY

SierraF150Query Semantics

Possible Worlds [Kripke63,…]

SierraF150M

AN

Y

p4

p1

p3 p5

p2

p1p3

p4p5

p2

p4

p1p3

p5

p2

MA

NY

MA

NY

SierraF150SierraF150

p3 p4

p1

p5

p2

MA

NY

SierraF150

w1

w2 w3

w4

Possible Worlds Query Semantics

Given all possible worlds together with their probabilities, queries are easily answered (using expected values)

But number of possible worlds is exponential!

Allocation

Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data

Size increase is linear in number of (completions of) imprecise facts

Queries operate over this extended version

Key contributions:Appropriate characterization of the large space of allocation policies

Designing efficient allocation policies that take into account the correlations in the data

Storing Allocations using Extended Data Model

p3

p1

p4

p2

p5MA

NY

SierraF150

ID FactID Auto Loc Repair Weight

1 1 F150 NY 100 1.0

2 2 Sierra NY 500 1.0

3 3 F150 MA 150 0.6

4 3 F150 NY 150 0.4

5 4 Sierra MA 200 1.0

6 5 F150 MA 100 0.5

7 5 Sierra MA 100 0.5

Truck

East

Advantages of EDM

No extra infrastructure required for representing imprecisionEfficient algorithms for aggregate queries :SUM and COUNT : linear time algo.AVERAGE : slightly complicated algorithm running in O(m + n3) for m precise facts and n imprecise facts.

Aggregating Uncertain Measures

Opinion pooling: provide a consensus opinion from a set of opinions Θ.

The opinions in Θ as well as the consensus opinion are represented as pdfs over a discrete domain O

linear operator LinOp(Θ) produces a consensus pdf P that is a weighted linear combination of the pdfs in Θ,

Allocation Policies

For every region r in the database, we want to assign an allocation p

c, r to each cell c in Reg(r), such that

∑c Reg(r)

pc, r

= 1

Three ways of doing so:

1. Uniform : Assign each cell c in a region r an equal probability.

pc, r

= 1 / |Reg(r)|

Allocation Policies



∑c Reg(r)

pc, r

= 1

However, we can do better. Some cells may be naturally inclined to have more probability than others. Eg : Mumbai will clearly have more repairs than Bhopal. We can do this automatically by giving more probability to cells with higher number of precise facts.

2. Count based :

where Nc is the number of precise facts in cell c

Allocation Policies



∑c Reg(r)

pc, r

= 1

Again, we can arguably get a better result by looking at not just the count, but rather than the actual value of the measure in question.

3. Measure based : next slide.

Measure Based Allocation

Assumes the following model : The given database D with imprecise facts has been generated by randomly injecting imprecision in a precise database D'.D' assigns value o to a cell c according to some unknown pdf P(o, c).

If we could determine this pdf, the allocation is simplyp

c, r = P(c) / ∑

c' in Reg(r) P(c')

Classifying Allocation Policies

Ignored Used

IgnoredU

sed

Uniform

EMCount

Measure Correlation

Dim

ension C

orrelation

Results on Query Semantics

Evaluating queries over extended version of data yields expected value of the aggregation operator over all possible worlds

intuitively, the correct value to compute

Efficient query evaluation algorithms for SUM, COUNT

consistency and faithfulness for SUM, COUNT are satisfied under appropriate conditions

Dynamic programming algorithm for AVERAGEUnfortunately, consistency does not hold for AVERAGE

Alternative Semantics for AVERAGE

APPROXIMATE AVERAGEE[SUM] / E[COUNT] instead of E[SUM/COUNT]

simpler and more efficient

satisfies consistency

extends to aggregation operators for uncertain measures

Maximum Likelihood Principle

A reasonable estimate for this function P can be that which maximises the probability of generating the given imprecise data set D.

Example :Suppose the pdf depends only on the cells and is independent of the measure values. Thus, the pdf is a mapping : C ℝ where C is the set of cells.This pdf can be found by maximising the likelihood function :

ℒ() = r D

∑c Reg(r)

(c)

EM Algorithm

The Expectation Maximization algorithm provides a standard way of maximizing the likelihood, when we have some unknown variables in the observation set.

Expectation step (compute data): Calculate the expected value of the unknown variables, given the current estimate of variables.Maximization step (compute generator): Calculate the distribution that maximizes the probability of the current estimated data set.

Initialization Step: Data: [4, 10, ?, ?] Initial mean value: 0New Data: [4, 10, 0, 0]

Step 1: New Mean: 3.5New Data:[4, 10, 3.5, 3.5]

Step 2: New Mean: 5.25New Data: [4, 10, 5.25, 5.25]


Result: New Mean: 6.890625

EM Algorithm : Example



EM Algorithm : Application

Experiments : Allocation run time

Experiments : Query run time

Experiments : Accuracy

Uncertainty

Measure value is modeled as a probability

distribution function over some base domain

e.g., measure Brake is a pdf over values {Yes,No}

sources of uncertainty: measures extracted from text

using classifiers

Adapt well-known concepts from statistics to

derive appropriate aggregation operators

Our framework and solutions for dealing with

imprecision also extend to uncertain measures

Summary

Consistency and faithfulness desiderata for designing query semantics for imprecise data

Allocation is the key to our framework

Efficient algorithms for aggregation operators with appropriate guarantees of consistency and faithfulness

Iterative algorithms for allocation policies

Correlation-based Allocation

Involves defining an objective function to capture some underlying correlation structure

a more stringent requirement on the allocations

solving the resulting optimization problem yields the allocations

EM-based iterative allocation policyinteresting highlight: allocations are re-scaled iteratively by computing appropriate aggregations

olap over uncertain and imprecise data adapted from a talk by t.s. jayram (ibm almaden) with doug...

Documents

consistency consistency

olap slide

imprecise data query

query region

sudarshan slide

count slide

average slide

p1 p2 slide