sean kandel - data profiling: assessing the overall content and quality of a data set
TRANSCRIPT
Agile Data Profiling
Sean Kandel
What’s in your data?
Opening Questions
in the Data Lifecycle…
UnboxingWhat’s in this data?
Can I make use of it?
… Become Persistent Questions
in the Data LifecycleWhat’s in this data?
Can I make use of it?
Unboxing Transformation Analysis Visualization Productization
Unboxing Transformation Analysis Visualization Productization
Unboxing Transformation Analysis Visualization Productization
STRUCTURING CLEANING
ENRICHMENT DISTILLATION
“Its easy to just think you know what you are doing and not look at data at every intermediary step.
An analysis has 30 different steps. Its tempting to just do this then that and then this. You have no idea in which ways you are wrong and what data is wrong.”
What’s in the data?
• The Expected: Models, Densities, Constraints
• The Unexpected: Residuals, Outlier, Anomalies
Average Movie Ratings
Expected
Unexpected
Overview of all variables
Show relevant perspectives
What to compute?
• Densities and descriptive statistics
• Identify anomalies and outliers
How often to compute it?
Unboxing Transformation Analysis Visualization Productization
Challenge: Agility
• Profiling throughout the lifecycle
• Particularly important as you manipulate data
Design Space and Tradeoffs
Mapping out the Design Space
How much data to examine?
How accurate are the results?
How fast can you get them?
Mapping out the Design Space
Decide how your requirements fall on these axes
Find a strategy (if one exists) that fits the requirements
Accuracy
Urgency
Data Volume
Accuracy
Urgency
Data Volume
Strategy vs Cost
Head of file
Good EnoughAnomaliesBig PictureUnbox
Strategy vs Cost
Random Sample
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
Strategy vs Cost
Scan, summarize, collect samples
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
Far better an approximate answer
to the right question, which is often
vague, than the exact answer to
the wrong question, which can
always be made precise.
Data Analysis & Statistics, Tukey & Wilk 1966
Technical Methods
Sanity Check: Is this really expensive?
• Computers are fast
• In-memory, column stores, OLAP, …
• Still, “Big Data” can be hard
• Big is sometimes really big
• Big data can be raw: no indexes or precomputed summaries
• Agility remains critical to harness the “informed human mind”
Two Useful Techniques
Sampling
• A variety of techniques available
Sketches
• One-pass memory-efficient structures for capturing distributions
Accuracy
Urgency
Data Volume
Technique I: Sampling
Approaches to Sampling
• Scan-based access
• Head-of-file
• Bernoulli
• Reservoir
• Random I/O Sampling
• Block-level sampling
Head-of-File
• Pros:
• Very fast: small data, no disk seeks
• Absolutely required when unboxing raw data
• Nested data (JSON/XML), Text (logs, database dumps, etc.)
• Cons:
• Correlation of position and value
Bernoulli
• Take a full pass, flip a (weighted) coin for each record
• Pros:
• trivial to implement
• trivial to parallelize
• almost no memory required
• Cons:
• requires a full scan of the data
• output size proportional to input size, and random
filter(lambda x : random() < 0.01, data)
Reservoir
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
• Requires 0(sample_size) memory
… 61141217 139
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
1141217Reservoir … 6 133
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
• Requires 0(sample_size) memory
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
41217Reservoir … 6 137 3
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
• Requires 0(sample_size) memory
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
Meta-Strategy: Stratified Sampling
• Sometimes you need representative samples from each “group”
• Coverage: e.g., displaying examples for every state in a map
• Robustness: e.g., consider average income
• if you miss the rare top tax bracket, estimate is way off
Stratification: the GroupBy / Agg pattern
• Given:
• A group-partitioning key for stratification
• Sizes for each stratum
• Easy to implement: partition, and construct sample per partition
• your favorite sampling technique applies
SELECT D.group_key, reservoir(D.value)
FROM data D
GROUP BY D.group_key;
Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
• Pretty inefficient: touches 1-(1-r)p blocks
Record Sampling
% items sampled
% b
locks
touched (
expecte
d)
1-(1-r)p with p = 100
Block Sampling
• Randomly sample blocks of records from disk
• Concern: clustering bias.
• Techniques from database literature: assess bias and correct
• Beware: even block sampling needs to be well below 5%.
Sampling in Hadoop
• Larger unit of access: HDFS blocks (128MB vs. 64KB)
• HDFS buffering makes forward seeking within block cheaper
• But CPU costs may encourage sampling within the block.
• …and Hadoop makes it easy to sample across nodes
• Each worker only processes one block
• Must find record boundaries
• Tougher when dealing with quote escaping
Technique II: Sketching
Sketching
• Family of algorithms for estimating contents of a data stream
• Constant-sized memory footprint
• Computed in 1 pass over the data
• Classic Examples
• Bloom filter: existence testing
• HyperLogLog Sketches (FM): distinct values
• CountMin (CM): a surprisingly versatile sketch for frequencies
CountMin Sketch: Initialization
0
d h
ash functions
w hash buckets
Count-Min Sketch
0 0 0 0
0 0 0 0 0
0 0 0 0 0
CountMin Sketch: Insertion
d h
ash functions
w hash buckets
Count-Min SketchInsert(7)
h1
h2
hw
CountMin Sketch: Insertion
d h
ash functions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
1
1
CountMin Sketch: Insertion
d h
ash functions
w hash buckets
Count-Min SketchInsert(4)
h1
h2
hw
CountMin Sketch: Insertion
d h
ash functions
w hash buckets
Count-Min Sketch
1
h1(4)
h2(4)
hw(4)
2
1
CountMin Sketch: Insertion
d h
ash functions
w hash buckets
Count-Min Sketch
1
h1(4)
h2(4)
hw(4)
2
1
CountMin Sketch: Query
d h
ash functions
w hash buckets
Count-Min SketchCount(7)?
h1
h2
hw
CountMin Sketch: Query
d h
ash functions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
2
1
Count(7)?
CountMin Sketch: Query
d h
ash functions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
2
1
min
Count(7)
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
d h
ash functions
w hash buckets
Count-Min Sketch
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
d h
ash functions
w hash buckets
Count-Min Sketch
an over-estimate
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
d h
ash functions
w hash buckets
Count-Min Sketchw controls expected error amount
d controls probability of error
Suppose we want:
0.1% error, 99.9% probability.
w = 2000
d = 10
CountMeanMin Sketch
d h
ash functions
w hash buckets
Count-Mean-Min Sketch Idea: subtract out expected
overage.
i.e. mean of other cells
CountMeanMin Sketch
d h
ash functions
w hash buckets
Count-Mean-Min Sketchmean
—
CountMeanMin Sketch
d h
ash functions
w hash buckets
Count-Mean-Min Sketchmean
—
mean
—
median
CountMeanMin Sketch
d h
ash functions
w hash buckets
Count-Mean-Min Sketchmean
—
mean
—
mean
—
median
Count(7)
CountMin (and CountMeanMin) answer “point frequency queries”.
Surprisingly, we can use them to answer many more questions
• densities
• even order statistics (median, quantiles, etc.)
The Versatile CountMin Sketch
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Count(x=13)
CountMin: Point Queries
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Count(x ∊ [14-15])
CountMin(⌊x/2⌋): Pair Queries
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Count(x ∊ [16-19])
CountMin(⌊x/4⌋): Quartet Queries
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Maintain all of these, and answer arbitrary range queries.
Count(x ∊ [13-24])
Dyadic CountMin: log2 CountMins
x
x/2
x/4
x/8
x/16
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Maintain all of these, and answer arbitrary range queries.
Count(x ∊ [13-24])
Dyadic CountMin: log2 CountMins
x
x/2
x/4
x/8
x/16
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
More Statistics
• Count-Range Queries
• Median
• Quantiles: generalization of Median
• Histograms
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms:
• fixed-width bins: range queries
• fixed-height bins: quantiles
1-10 11-20 21-30 31-40
Putting It Together
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-file
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-fileBernoulli
Block
Reservoir
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-fileBernoulli
Sketching
Stratified
Block
Reservoir
Summary
• ABP: Always Be Profiling
• Tradeoff latency and accuracy
• Approximation methods
• Heuristics and reasonable assumptions
Acknowledgments
Adam Silberstein, Joe Hellerstein