sean kandel - data profiling: assessing the overall content and quality of a data set

Post on 15-Jul-2015

199 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Agile Data Profiling

Sean Kandel

What’s in your data?

Opening Questions

in the Data Lifecycle…

UnboxingWhat’s in this data?

Can I make use of it?

… Become Persistent Questions

in the Data LifecycleWhat’s in this data?

Can I make use of it?

Unboxing Transformation Analysis Visualization Productization

Unboxing Transformation Analysis Visualization Productization

Unboxing Transformation Analysis Visualization Productization

STRUCTURING CLEANING

ENRICHMENT DISTILLATION

“Its easy to just think you know what you are doing and not look at data at every intermediary step.

An analysis has 30 different steps. Its tempting to just do this then that and then this. You have no idea in which ways you are wrong and what data is wrong.”

What’s in the data?

• The Expected: Models, Densities, Constraints

• The Unexpected: Residuals, Outlier, Anomalies

Average Movie Ratings

Expected

Unexpected

Overview of all variables

Show relevant perspectives

What to compute?

• Densities and descriptive statistics

• Identify anomalies and outliers

How often to compute it?

Unboxing Transformation Analysis Visualization Productization

Challenge: Agility

• Profiling throughout the lifecycle

• Particularly important as you manipulate data

Design Space and Tradeoffs

Mapping out the Design Space

How much data to examine?

How accurate are the results?

How fast can you get them?

Mapping out the Design Space

Decide how your requirements fall on these axes

Find a strategy (if one exists) that fits the requirements

Accuracy

Urgency

Data Volume

Accuracy

Urgency

Data Volume

Strategy vs Cost

Head of file

Good EnoughAnomaliesBig PictureUnbox

Strategy vs Cost

Random Sample

Accuracy

Urgency

Data Volume

Good EnoughAnomaliesBig PictureUnbox

Strategy vs Cost

Scan, summarize, collect samples

Accuracy

Urgency

Data Volume

Good EnoughAnomaliesBig PictureUnbox

Far better an approximate answer

to the right question, which is often

vague, than the exact answer to

the wrong question, which can

always be made precise.

Data Analysis & Statistics, Tukey & Wilk 1966

Technical Methods

Sanity Check: Is this really expensive?

• Computers are fast

• In-memory, column stores, OLAP, …

• Still, “Big Data” can be hard

• Big is sometimes really big

• Big data can be raw: no indexes or precomputed summaries

• Agility remains critical to harness the “informed human mind”

Two Useful Techniques

Sampling

• A variety of techniques available

Sketches

• One-pass memory-efficient structures for capturing distributions

Accuracy

Urgency

Data Volume

Technique I: Sampling

Approaches to Sampling

• Scan-based access

• Head-of-file

• Bernoulli

• Reservoir

• Random I/O Sampling

• Block-level sampling

Head-of-File

• Pros:

• Very fast: small data, no disk seeks

• Absolutely required when unboxing raw data

• Nested data (JSON/XML), Text (logs, database dumps, etc.)

• Cons:

• Correlation of position and value

Bernoulli

• Take a full pass, flip a (weighted) coin for each record

• Pros:

• trivial to implement

• trivial to parallelize

• almost no memory required

• Cons:

• requires a full scan of the data

• output size proportional to input size, and random

filter(lambda x : random() < 0.01, data)

Reservoir

• Fix “reservoir”. For each item, with probability eject old for new

• Pros:

• trivial to implement

• easy to parallelize

• constant memory required

• fixed-size output — need not know input size in advance

• Cons:

• Requires a full scan of the data

• Requires 0(sample_size) memory

… 61141217 139

res = data [0:k] //initialize: first k items

counter = k

for x in data [k:]:

if random () < k/float(counter+1):

res[randint(0,len(res)-1)] = x

counter += 1

1141217Reservoir … 6 133

• Fix “reservoir”. For each item, with probability eject old for new

• Pros:

• trivial to implement

• easy to parallelize

• constant memory required

• fixed-size output — need not know input size in advance

• Cons:

• Requires a full scan of the data

• Requires 0(sample_size) memory

res = data [0:k] //initialize: first k items

counter = k

for x in data [k:]:

if random () < k/float(counter+1):

res[randint(0,len(res)-1)] = x

counter += 1

41217Reservoir … 6 137 3

• Fix “reservoir”. For each item, with probability eject old for new

• Pros:

• trivial to implement

• easy to parallelize

• constant memory required

• fixed-size output — need not know input size in advance

• Cons:

• Requires a full scan of the data

• Requires 0(sample_size) memory

res = data [0:k] //initialize: first k items

counter = k

for x in data [k:]:

if random () < k/float(counter+1):

res[randint(0,len(res)-1)] = x

counter += 1

Meta-Strategy: Stratified Sampling

• Sometimes you need representative samples from each “group”

• Coverage: e.g., displaying examples for every state in a map

• Robustness: e.g., consider average income

• if you miss the rare top tax bracket, estimate is way off

Stratification: the GroupBy / Agg pattern

• Given:

• A group-partitioning key for stratification

• Sizes for each stratum

• Easy to implement: partition, and construct sample per partition

• your favorite sampling technique applies

SELECT D.group_key, reservoir(D.value)

FROM data D

GROUP BY D.group_key;

Record Sampling

• Randomly sample records?

• r the % items sampled; p #rows/block

• 20x random I/O penalty => read fewer than 5% of blocks!

Record Sampling

• Randomly sample records?

• r the % items sampled; p #rows/block

• 20x random I/O penalty => read fewer than 5% of blocks!

• Pretty inefficient: touches 1-(1-r)p blocks

Record Sampling

% items sampled

% b

locks

touched (

expecte

d)

1-(1-r)p with p = 100

Block Sampling

• Randomly sample blocks of records from disk

• Concern: clustering bias.

• Techniques from database literature: assess bias and correct

• Beware: even block sampling needs to be well below 5%.

Sampling in Hadoop

• Larger unit of access: HDFS blocks (128MB vs. 64KB)

• HDFS buffering makes forward seeking within block cheaper

• But CPU costs may encourage sampling within the block.

• …and Hadoop makes it easy to sample across nodes

• Each worker only processes one block

• Must find record boundaries

• Tougher when dealing with quote escaping

Technique II: Sketching

Sketching

• Family of algorithms for estimating contents of a data stream

• Constant-sized memory footprint

• Computed in 1 pass over the data

• Classic Examples

• Bloom filter: existence testing

• HyperLogLog Sketches (FM): distinct values

• CountMin (CM): a surprisingly versatile sketch for frequencies

CountMin Sketch: Initialization

0

d h

ash functions

w hash buckets

Count-Min Sketch

0 0 0 0

0 0 0 0 0

0 0 0 0 0

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min SketchInsert(7)

h1

h2

hw

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(7)

h2(7)

hw(7)

1

1

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min SketchInsert(4)

h1

h2

hw

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(4)

h2(4)

hw(4)

2

1

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(4)

h2(4)

hw(4)

2

1

CountMin Sketch: Query

d h

ash functions

w hash buckets

Count-Min SketchCount(7)?

h1

h2

hw

CountMin Sketch: Query

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(7)

h2(7)

hw(7)

2

1

Count(7)?

CountMin Sketch: Query

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(7)

h2(7)

hw(7)

2

1

min

Count(7)

CountMin Sketch: Theorem & Tuning

— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).

d h

ash functions

w hash buckets

Count-Min Sketch

CountMin Sketch: Theorem & Tuning

— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).

d h

ash functions

w hash buckets

Count-Min Sketch

an over-estimate

CountMin Sketch: Theorem & Tuning

— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).

d h

ash functions

w hash buckets

Count-Min Sketchw controls expected error amount

d controls probability of error

Suppose we want:

0.1% error, 99.9% probability.

w = 2000

d = 10

CountMeanMin Sketch

d h

ash functions

w hash buckets

Count-Mean-Min Sketch Idea: subtract out expected

overage.

i.e. mean of other cells

CountMeanMin Sketch

d h

ash functions

w hash buckets

Count-Mean-Min Sketchmean

CountMeanMin Sketch

d h

ash functions

w hash buckets

Count-Mean-Min Sketchmean

mean

median

CountMeanMin Sketch

d h

ash functions

w hash buckets

Count-Mean-Min Sketchmean

mean

mean

median

Count(7)

CountMin (and CountMeanMin) answer “point frequency queries”.

Surprisingly, we can use them to answer many more questions

• densities

• even order statistics (median, quantiles, etc.)

The Versatile CountMin Sketch

More Statistics

• Count-Range Queries

• Median

• Quantiles

• Histograms

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Count(x=13)

CountMin: Point Queries

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Count(x ∊ [14-15])

CountMin(⌊x/2⌋): Pair Queries

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Count(x ∊ [16-19])

CountMin(⌊x/4⌋): Quartet Queries

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Maintain all of these, and answer arbitrary range queries.

Count(x ∊ [13-24])

Dyadic CountMin: log2 CountMins

x

x/2

x/4

x/8

x/16

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Maintain all of these, and answer arbitrary range queries.

Count(x ∊ [13-24])

Dyadic CountMin: log2 CountMins

x

x/2

x/4

x/8

x/16

More Statistics

• Count-Range Queries

• Median

• Quantiles

• Histograms

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

More Statistics

• Count-Range Queries

• Median

• Quantiles: generalization of Median

• Histograms

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

More Statistics

• Count-Range Queries

• Median

• Quantiles

• Histograms:

• fixed-width bins: range queries

• fixed-height bins: quantiles

1-10 11-20 21-30 31-40

Putting It Together

Wrangling Revisited

Good EnoughAnomaliesBig PictureUnbox

Wrangling Revisited

Good EnoughAnomaliesBig PictureUnbox

Head-of-file

Wrangling Revisited

Good EnoughAnomaliesBig PictureUnbox

Head-of-fileBernoulli

Block

Reservoir

Wrangling Revisited

Good EnoughAnomaliesBig PictureUnbox

Head-of-fileBernoulli

Sketching

Stratified

Block

Reservoir

Summary

• ABP: Always Be Profiling

• Tradeoff latency and accuracy

• Approximation methods

• Heuristics and reasonable assumptions

Acknowledgments

Adam Silberstein, Joe Hellerstein

top related