data quality is bad? deal with it dennis shasha new york university

28
Data Quality is Bad? Deal With It Dennis Shasha New York University

Post on 20-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Quality is Bad? Deal With It Dennis Shasha New York University

Data Quality is Bad?Deal With It

Dennis ShashaNew York University

Page 2: Data Quality is Bad? Deal With It Dennis Shasha New York University

Data Quality Problem –challenges

• Two companies merge or two divisions want to share data. Problem: identify common customers even though their names are spelled differently (work with Bellcore/Telcordia colleagues: Munir Cochinwala, Verghese Kurien, and Gail Lalk)

• Real-time sensor network. Problem: sensors fail; want to avoid false alarms (work with physicist Alan Mincer and student Yunyue Zhu)

Page 3: Data Quality is Bad? Deal With It Dennis Shasha New York University

My Approach

• Let’s look at fields that have dealt with data quality problems for years though they consider these problems part of business as usual.

• We will ask: what do these fields do and how might that help us?

Page 4: Data Quality is Bad? Deal With It Dennis Shasha New York University

Data Quality Problem – biology

• Take two genetically identical plants, treat them in the same way, and measure the RNA expression levels. Get vastly different results.

• Differences increase if experiments done in different labs or by different people in the same lab.

• Even breathing can be dangerous…• Goal: find causal relationships among genes.

Page 5: Data Quality is Bad? Deal With It Dennis Shasha New York University

What Can One Do?

• One way to tease out causality is to perform a time series experiment on closely spaced time points.

• Want close spacing to be able to say gene expression level at time t depends on gene expression levels at t-1.

• Start with noise-free model.

Page 6: Data Quality is Bad? Deal With It Dennis Shasha New York University

Red Squares represent a transition function f to

be learned

TFs t

TFs

+ ta

rget

s t+

1 TFs t+3

TFs

+ ta

rget

s t+

4

t t + 1 t + 2 t + 3 t + 4

Gen

e ex

pres

sion

Time

f f f fgene zi

gene zk

Krouk et al 2010 submitted [19]

Noise-free Modeling of Transcriptome Time Series Data

Explain target gene expression as function of up to 4 input TFs

Page 7: Data Quality is Bad? Deal With It Dennis Shasha New York University

Modeling Noise (poor quality)

• There is reason to believe that Gaussian noise is a decent model of the inconsistencies in biological replicates.

• So model the relationship between observations and “true” value by a Gaussian noise component.

• We’ll see whether this is a good idea or not.

Page 8: Data Quality is Bad? Deal With It Dennis Shasha New York University

(B) Noisy model (black box is Gaussian noise)

“Leave-out-last” test:

observation model g

dynamic model f

0 63 9 12 15Training set

Z(t) Z(t+1) Z(t+2) Z(t+3) Z(t+4) Z(t+5)

Y(t) Y(t+1) Y(t+2) Y(t+3) Y(t+4) Y(t+5)

(C) Naive

71% correct

Krouk et al 2010 submitted [19]

Predict direction of change of each gene @ 20 min

0 63 9 12 15 20 min

(A) Transcriptome Data Set – time series

0 63 9 12 15 min

12 15 min

“Trend-forecast” test:

51% correct

Predict direction of change of each gene @ 20 min

Predict 20 min?

Training set

Page 9: Data Quality is Bad? Deal With It Dennis Shasha New York University

Test and Adaptation

• Test the model by predicting values at a time point not used in the training.

• Predictions are not generally perfect, so adaptation is to figure out which other time points to test.

• One way to do this is to perform the training and testing process with one fewer experiment. If the most critical experiment is at time t, then gather more data at time t.

Page 10: Data Quality is Bad? Deal With It Dennis Shasha New York University

Lessons from Network Inference

• The objective is predictive power.• Use the training set to train noise model and

causal relationships among the genes.• If predictions work out, then good.• Modeling data quality is part of the learning

problem.

Page 11: Data Quality is Bad? Deal With It Dennis Shasha New York University

Physics -- supernovas

• Look at sky and observe showers of gamma particles.

• Model the background as a Poisson process.• Look for exceptionally high bursts (these can

last seconds, minutes, hours, up to days).• Aim telescopes in the appropriate part of the

sky.

Page 12: Data Quality is Bad? Deal With It Dennis Shasha New York University

12

Astrophysical Application

Motivation: In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. An unusual event burst may signal an event interesting to physicists.

Technical Overview:1.The sky is partitioned into 1800*900 buckets.2.14 Sliding window lengths are monitored from 0.1s to 39.81s

1800

900

Page 13: Data Quality is Bad? Deal With It Dennis Shasha New York University

Physics -- adaptation

• A burst is only the first filter for detecting a supernova.

• If certain kinds of bursts (e.g. 10 second long bursts) lead to false positives often, then adjust the thresholds.

Page 14: Data Quality is Bad? Deal With It Dennis Shasha New York University

Physics -- lessons

• Once again the noise model is an integral part of the problem setting.

• Adaptation is ongoing (no fixed training set).• Because physicists are looking for a single

piece of information, e.g. there is a supernova at location X,Y, redundancy can overcome noise.

Page 15: Data Quality is Bad? Deal With It Dennis Shasha New York University

Drug Testing

• Give N patients a drug and N patients a placebo.

• This is a classic “data quality”/”biological variation” situation. Different patients will react differently to a drug and almost all patients will benefit from a placebo.

• Two questions: is the drug better than the placebo and how much?

Page 16: Data Quality is Bad? Deal With It Dennis Shasha New York University

Drug Testing -- Resampling

• Suppose you arrange the results in a table (patient id, drug/placebo, improvement).

• Compute the average improvement for the drug population

• Evaluate significance using a permutation test• Evaluate the level using confidence intervals• Don’t require assumption about distribution.

Page 17: Data Quality is Bad? Deal With It Dennis Shasha New York University

Typical table

Patient improvement Drug/Placebo 10 Drug 12 Placebo 8 Drug -3 Placebo 20 Drug 4 PlaceboDrug improvement: 38/3; Placebo: 11/3

Page 18: Data Quality is Bad? Deal With It Dennis Shasha New York University

One Permutation of table

Patient improvement Drug/Placebo 10 Drug 12 Placebo 8 Drug -3 Placebo 20 Placebo 4 DrugDrug improvement: 22/3; Placebo: 29/3

Page 19: Data Quality is Bad? Deal With It Dennis Shasha New York University

Significance Test – is the drug’s apparent effect due to luck?

• count = 0• do 10,000 times permute the drug/placebo column

recompute improvement under permutation if recomputed improvement >= measured improvement in real test then count+= 1

P-value = count/10,000; chance that improvement was due to chance.

Page 20: Data Quality is Bad? Deal With It Dennis Shasha New York University

Confidence interval – what’s a good estimate of the drug’s benefit

• count = 0• do 10,000 times take 2N elements from the original table

with replacement compute improvementSort the 10,000 improvement scores and

compute 95% confidence interval as 250th score to 9,750th score.

Page 21: Data Quality is Bad? Deal With It Dennis Shasha New York University

Lessons from Drug Testing

• Assume different patients can react differently.

• Is the drug benefit significant?• How much of a benefit does it have? • Lesson: questions are simple; individual noise

is overcome with redundancy.

Page 22: Data Quality is Bad? Deal With It Dennis Shasha New York University

Data Quality Problem – adversaries

• A farmer in the developing world wants to do a banking transaction.

• The bank has appointed the shopkeeper the bank agent. The shopkeeper will call the bank over an insecure phone line.

• The farmer doesn’t know whether the shopkeeper is truly honest and even whether messages can be intercepted and mangled (poor quality due to adversary).

Page 23: Data Quality is Bad? Deal With It Dennis Shasha New York University

Basic Solution

• Bank provides a collection of (essentially) one-time nonces and one-time pads to each of farmer and shopkeeper ahead of time.

• Per transaction: each of farmer/shopkeeper sends one-time nonce and messages to the bank listing the amount of the transaction.

• The bank verifies their identities via the nonces and the farmer/shopkeepers verify the amounts via the one-time pad.

Page 24: Data Quality is Bad? Deal With It Dennis Shasha New York University

“Quality Issues” this Solves

• Replay is impossible because nonces are one-time.

• Mangling will be detected because of one-time pads.

• False confederates and hacking of telephone network will be detected thanks to one-time pads.

• Even a determined adversary can be overcome. Never mind a little random noise.

Page 25: Data Quality is Bad? Deal With It Dennis Shasha New York University

Application – record matching

• Develop noise model: how sounds are misheard or how symbols are mistyped?

• Develop training set having correct outcomes but also metadata properties (e.g. who took the information and when was it taken) in case noise characteristics/probabilities depend on that.

• Model cost of errors vs. cost to clean.

Page 26: Data Quality is Bad? Deal With It Dennis Shasha New York University

Application – sensor reading

• Be conscious of what the goals of the sensor are, e.g. fire/no fire; earthquake/no earthquake.

• Use burst detection to locate possibly troublesome sensors in quiet times.

• Error model is key: could there be an adversary? Can you use non-parametric stats?

Page 27: Data Quality is Bad? Deal With It Dennis Shasha New York University

Lessons

• Data quality problems (i.e. noise or adversarial attacks) are an everyday occurrence in many fields.

• First lesson: model the amount of noise and design system to answer critical question (e.g. what is causal network, is drug effective, where is supernova) in spite of noise.

Page 28: Data Quality is Bad? Deal With It Dennis Shasha New York University

More Lessons

• Second lesson: If you can design for an adversary, then get noise correction for free.

• Third lesson: Use the meta-data to try to localize bursts of errors to try to shut down the reason for noise.