sean kandel - data profiling: assessing the overall content and quality of a data set

83
Agile Data Profiling Sean Kandel

Upload: huguk

Post on 15-Jul-2015

199 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Agile Data Profiling

Sean Kandel

Page 2: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

What’s in your data?

Page 3: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Opening Questions

in the Data Lifecycle…

UnboxingWhat’s in this data?

Can I make use of it?

Page 4: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

… Become Persistent Questions

in the Data LifecycleWhat’s in this data?

Can I make use of it?

Unboxing Transformation Analysis Visualization Productization

Page 5: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Unboxing Transformation Analysis Visualization Productization

Page 6: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Unboxing Transformation Analysis Visualization Productization

STRUCTURING CLEANING

ENRICHMENT DISTILLATION

Page 7: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

“Its easy to just think you know what you are doing and not look at data at every intermediary step.

An analysis has 30 different steps. Its tempting to just do this then that and then this. You have no idea in which ways you are wrong and what data is wrong.”

Page 8: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

What’s in the data?

• The Expected: Models, Densities, Constraints

• The Unexpected: Residuals, Outlier, Anomalies

Page 9: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Average Movie Ratings

Page 10: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Expected

Page 11: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Unexpected

Page 12: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set
Page 13: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set
Page 14: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set
Page 15: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Overview of all variables

Page 16: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Show relevant perspectives

Page 17: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

What to compute?

• Densities and descriptive statistics

• Identify anomalies and outliers

Page 18: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

How often to compute it?

Unboxing Transformation Analysis Visualization Productization

Page 19: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Challenge: Agility

• Profiling throughout the lifecycle

• Particularly important as you manipulate data

Page 20: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Design Space and Tradeoffs

Page 21: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Mapping out the Design Space

How much data to examine?

How accurate are the results?

How fast can you get them?

Page 22: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Mapping out the Design Space

Decide how your requirements fall on these axes

Find a strategy (if one exists) that fits the requirements

Accuracy

Urgency

Data Volume

Page 23: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Accuracy

Urgency

Data Volume

Strategy vs Cost

Head of file

Good EnoughAnomaliesBig PictureUnbox

Page 24: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Strategy vs Cost

Random Sample

Accuracy

Urgency

Data Volume

Good EnoughAnomaliesBig PictureUnbox

Page 25: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Strategy vs Cost

Scan, summarize, collect samples

Accuracy

Urgency

Data Volume

Good EnoughAnomaliesBig PictureUnbox

Page 26: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Far better an approximate answer

to the right question, which is often

vague, than the exact answer to

the wrong question, which can

always be made precise.

Data Analysis & Statistics, Tukey & Wilk 1966

Page 27: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Technical Methods

Page 28: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Sanity Check: Is this really expensive?

• Computers are fast

• In-memory, column stores, OLAP, …

• Still, “Big Data” can be hard

• Big is sometimes really big

• Big data can be raw: no indexes or precomputed summaries

• Agility remains critical to harness the “informed human mind”

Page 29: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Two Useful Techniques

Sampling

• A variety of techniques available

Sketches

• One-pass memory-efficient structures for capturing distributions

Accuracy

Urgency

Data Volume

Page 30: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Technique I: Sampling

Page 31: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Approaches to Sampling

• Scan-based access

• Head-of-file

• Bernoulli

• Reservoir

• Random I/O Sampling

• Block-level sampling

Page 32: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Head-of-File

• Pros:

• Very fast: small data, no disk seeks

• Absolutely required when unboxing raw data

• Nested data (JSON/XML), Text (logs, database dumps, etc.)

• Cons:

• Correlation of position and value

Page 33: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Bernoulli

• Take a full pass, flip a (weighted) coin for each record

• Pros:

• trivial to implement

• trivial to parallelize

• almost no memory required

• Cons:

• requires a full scan of the data

• output size proportional to input size, and random

filter(lambda x : random() < 0.01, data)

Page 34: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Reservoir

• Fix “reservoir”. For each item, with probability eject old for new

• Pros:

• trivial to implement

• easy to parallelize

• constant memory required

• fixed-size output — need not know input size in advance

• Cons:

• Requires a full scan of the data

• Requires 0(sample_size) memory

… 61141217 139

res = data [0:k] //initialize: first k items

counter = k

for x in data [k:]:

if random () < k/float(counter+1):

res[randint(0,len(res)-1)] = x

counter += 1

Page 35: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

1141217Reservoir … 6 133

• Fix “reservoir”. For each item, with probability eject old for new

• Pros:

• trivial to implement

• easy to parallelize

• constant memory required

• fixed-size output — need not know input size in advance

• Cons:

• Requires a full scan of the data

• Requires 0(sample_size) memory

res = data [0:k] //initialize: first k items

counter = k

for x in data [k:]:

if random () < k/float(counter+1):

res[randint(0,len(res)-1)] = x

counter += 1

Page 36: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

41217Reservoir … 6 137 3

• Fix “reservoir”. For each item, with probability eject old for new

• Pros:

• trivial to implement

• easy to parallelize

• constant memory required

• fixed-size output — need not know input size in advance

• Cons:

• Requires a full scan of the data

• Requires 0(sample_size) memory

res = data [0:k] //initialize: first k items

counter = k

for x in data [k:]:

if random () < k/float(counter+1):

res[randint(0,len(res)-1)] = x

counter += 1

Page 37: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Meta-Strategy: Stratified Sampling

• Sometimes you need representative samples from each “group”

• Coverage: e.g., displaying examples for every state in a map

• Robustness: e.g., consider average income

• if you miss the rare top tax bracket, estimate is way off

Page 38: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Stratification: the GroupBy / Agg pattern

• Given:

• A group-partitioning key for stratification

• Sizes for each stratum

• Easy to implement: partition, and construct sample per partition

• your favorite sampling technique applies

SELECT D.group_key, reservoir(D.value)

FROM data D

GROUP BY D.group_key;

Page 39: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Record Sampling

• Randomly sample records?

• r the % items sampled; p #rows/block

• 20x random I/O penalty => read fewer than 5% of blocks!

Page 40: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Record Sampling

• Randomly sample records?

• r the % items sampled; p #rows/block

• 20x random I/O penalty => read fewer than 5% of blocks!

• Pretty inefficient: touches 1-(1-r)p blocks

Page 41: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Record Sampling

% items sampled

% b

locks

touched (

expecte

d)

1-(1-r)p with p = 100

Page 42: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Block Sampling

• Randomly sample blocks of records from disk

• Concern: clustering bias.

• Techniques from database literature: assess bias and correct

• Beware: even block sampling needs to be well below 5%.

Page 43: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Sampling in Hadoop

• Larger unit of access: HDFS blocks (128MB vs. 64KB)

• HDFS buffering makes forward seeking within block cheaper

• But CPU costs may encourage sampling within the block.

• …and Hadoop makes it easy to sample across nodes

• Each worker only processes one block

• Must find record boundaries

• Tougher when dealing with quote escaping

Page 44: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Technique II: Sketching

Page 45: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Sketching

• Family of algorithms for estimating contents of a data stream

• Constant-sized memory footprint

• Computed in 1 pass over the data

• Classic Examples

• Bloom filter: existence testing

• HyperLogLog Sketches (FM): distinct values

• CountMin (CM): a surprisingly versatile sketch for frequencies

Page 46: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Initialization

0

d h

ash functions

w hash buckets

Count-Min Sketch

0 0 0 0

0 0 0 0 0

0 0 0 0 0

Page 47: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min SketchInsert(7)

h1

h2

hw

Page 48: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(7)

h2(7)

hw(7)

1

1

Page 49: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min SketchInsert(4)

h1

h2

hw

Page 50: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(4)

h2(4)

hw(4)

2

1

Page 51: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Insertion

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(4)

h2(4)

hw(4)

2

1

Page 52: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Query

d h

ash functions

w hash buckets

Count-Min SketchCount(7)?

h1

h2

hw

Page 53: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Query

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(7)

h2(7)

hw(7)

2

1

Count(7)?

Page 54: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Query

d h

ash functions

w hash buckets

Count-Min Sketch

1

h1(7)

h2(7)

hw(7)

2

1

min

Count(7)

Page 55: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Theorem & Tuning

— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).

d h

ash functions

w hash buckets

Count-Min Sketch

Page 56: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Theorem & Tuning

— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).

d h

ash functions

w hash buckets

Count-Min Sketch

an over-estimate

Page 57: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin Sketch: Theorem & Tuning

— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).

d h

ash functions

w hash buckets

Count-Min Sketchw controls expected error amount

d controls probability of error

Suppose we want:

0.1% error, 99.9% probability.

w = 2000

d = 10

Page 58: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMeanMin Sketch

d h

ash functions

w hash buckets

Count-Mean-Min Sketch Idea: subtract out expected

overage.

i.e. mean of other cells

Page 59: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMeanMin Sketch

d h

ash functions

w hash buckets

Count-Mean-Min Sketchmean

Page 60: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMeanMin Sketch

d h

ash functions

w hash buckets

Count-Mean-Min Sketchmean

mean

median

Page 61: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMeanMin Sketch

d h

ash functions

w hash buckets

Count-Mean-Min Sketchmean

mean

mean

median

Count(7)

Page 62: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

CountMin (and CountMeanMin) answer “point frequency queries”.

Surprisingly, we can use them to answer many more questions

• densities

• even order statistics (median, quantiles, etc.)

The Versatile CountMin Sketch

Page 63: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

More Statistics

• Count-Range Queries

• Median

• Quantiles

• Histograms

Page 64: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Count(x=13)

CountMin: Point Queries

Page 65: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Count(x ∊ [14-15])

CountMin(⌊x/2⌋): Pair Queries

Page 66: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Count(x ∊ [16-19])

CountMin(⌊x/4⌋): Quartet Queries

Page 67: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Maintain all of these, and answer arbitrary range queries.

Count(x ∊ [13-24])

Dyadic CountMin: log2 CountMins

x

x/2

x/4

x/8

x/16

Page 68: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Maintain all of these, and answer arbitrary range queries.

Count(x ∊ [13-24])

Dyadic CountMin: log2 CountMins

x

x/2

x/4

x/8

x/16

Page 69: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

More Statistics

• Count-Range Queries

• Median

• Quantiles

• Histograms

Page 70: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

Page 71: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

Page 72: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

Page 73: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

Page 74: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Median

Via binary search.

(Suppose we have N elements, and the real median is 14)

Page 75: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

More Statistics

• Count-Range Queries

• Median

• Quantiles: generalization of Median

• Histograms

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Page 76: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

More Statistics

• Count-Range Queries

• Median

• Quantiles

• Histograms:

• fixed-width bins: range queries

• fixed-height bins: quantiles

1-10 11-20 21-30 31-40

Page 77: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Putting It Together

Page 78: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Wrangling Revisited

Good EnoughAnomaliesBig PictureUnbox

Page 79: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Wrangling Revisited

Good EnoughAnomaliesBig PictureUnbox

Head-of-file

Page 80: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Wrangling Revisited

Good EnoughAnomaliesBig PictureUnbox

Head-of-fileBernoulli

Block

Reservoir

Page 81: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Wrangling Revisited

Good EnoughAnomaliesBig PictureUnbox

Head-of-fileBernoulli

Sketching

Stratified

Block

Reservoir

Page 82: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Summary

• ABP: Always Be Profiling

• Tradeoff latency and accuracy

• Approximation methods

• Heuristics and reasonable assumptions

Page 83: Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Acknowledgments

Adam Silberstein, Joe Hellerstein