better machine learning through data -...

81
Better Machine Learning Through Data Saleema Amershi Machine Teaching Group Microsoft Research August 14, 2016

Upload: others

Post on 06-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Better Machine Learning Through Data

Saleema AmershiMachine Teaching GroupMicrosoft Research

August 14, 2016

Page 2: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Making better sense of data.

Better data makes better machine learning.

Page 3: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Data + Algorithm = Model

Data ModelAlgorithm

Page 4: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Data + Algorithm = Model

Data ModelAlgorithm

Machine learning research often takes the data as given.

Page 5: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

When Algorithms Discriminate – The New York Times, 2015

Big Data’s all-too-human failings – Reuters, 2016

Artificial Intelligence’s White Guy Problem – The New York Times, 2016

Mapping Crime – Or Stirring Hate?– Financial Times, 2014

Page 6: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Making better sense of data.

Better data makes better machine learning.

Most influence practitioners have on machine learning is through data.

Page 7: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Data + Algorithm = Model

Data ModelAlgorithm

In research,

data is often taken as given.

Page 8: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm

Data + Algorithm = Model

ModelData

In practice, the

algorithm is often taken as given.

In research,

data is often taken as given.

Page 9: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm

Data + Algorithm = Model

ModelData

“Data scientists, according to interviews and

expert estimates, spend 50 percent to 80 percent

of their time mired in this more mundane labor

of collecting and preparing unruly digital data.”

- New York Times, 2014

In practice, the

algorithm is often taken as given.

Page 10: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm

Data + Algorithm = Model

ModelData

Page 11: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

[Patel et al., CHI 2008]

Page 12: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm

Data + Algorithm = Model

ModelData

Page 13: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm

Data + Algorithm = Model

Data Model

Iterations are driven by evaluating models on data.

Page 14: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm

Data + Algorithm = Model

ModelData

Iterations are driven by evaluating models on data.

In practice, most effort is spent crafting input data.

Page 15: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm ModelData

Machine learning in theory

Page 16: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm

Collect &

Label

Samples

Create

Features

Evaluate

Results

Machine learning in practice

Page 17: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

AlgorithmEvaluate

Results

Collect &

Label

Samples

Create

Features

Page 18: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

AlgorithmEvaluate

Results

Collect &

Label

Samples

Create

Features

Structured Labeling [CHI 2014]

Feature Insight [VAST 2015]

ModelTracker[CHI 2015, VAST 2016]

Page 19: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Evaluate

Results

Create

FeaturesAlgorithm

Feature Insight [VAST 2015]

ModelTracker[CHI 2015, VAST 2016]

Collect &

Label

Samples

Structured Labeling [CHI 2014]

Page 20: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

0

0

Page 21: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

0

0

Page 22: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

1

0

Page 23: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

1

0

Page 24: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

2

0

Page 25: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

2

0

Page 26: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

2

1

Page 27: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

2

1

Page 28: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

2

2

Page 29: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

2

2

Page 30: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

3

2

Page 31: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

3

2

Page 32: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

4

2

Page 33: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

4

2

Page 34: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

4

3

Page 35: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

4

3

Page 36: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

4

3

Does not support

concept evolution

(refining the target

concept as data is

observed).

Page 37: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

How common is concept evolution?

Nine machine learning experts labeled the same 200 pages in two sessions (4 weeks apart).

Average consistency 81.7% (SD=6.8%)

6 out of 9 people’s labels changed significantly (via Chi Square test of symmetry)

25

50

75

100

Co

nsi

sten

cy

Participants

Page 38: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Proposed Solution – Structured Labeling

Enable people to explicitly organize their concept via grouping and tagging within a traditional labeling scheme.

Page 39: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Traditional Labeling

Pre-defined high-level categories.

Cat

Not Cat

Is this a Cat?

2

2

Page 40: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

Page 41: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

Page 42: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

2

2

Definitely Not Cat

Cat Poster

1

User provided tags on groups aid recall.

Grouping within high-level categories.

Page 43: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

2

2

Definitely Not Cat

Cat Poster

1

User provided tags on groups aid recall.

Grouping within high-level categories.Blogs

2

Lions

2

Page 44: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

2

2

Definitely Not Cat

Cat Poster

1

User provided tags on groups aid recall.

Grouping within high-level categories.Blogs

2

Lions

2

Page 45: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

2

2

Definitely Not Cat

User provided tags on groups aid recall.

Grouping within high-level categories.Blogs

2

Lions

2

Cat Poster

2

Page 46: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

2

2

Definitely Not Cat

User provided tags on groups aid recall.

Grouping within high-level categories.

Lions

2

Blogs

2

Cat Poster

2

Can move, merge

and split groups

as desired.

Page 47: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Assisted Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

2

2

Definitely Not Cat

Lions

2

Blogs

2

Cat Poster

2

Grouping recommendations to improve label consistency.

Page 48: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Assisted Structured Labeling

Cat

Not Cat

Is this a Cat?

2

Definitely Cat

2

Definitely Not Cat

2

2

Definitely Not Cat

Lions

2

Blogs

2

Cat Poster

2

Grouping recommendations to improve label consistency.

Similar items to help users make decisions.

Page 49: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Findings

People revised labels significantly more with structured labeling

People labeled more consistently

People preferred it over traditional labeling

Label Consistency

(X2=6.53, df=2, p < .038)(X2=20.19, df=2, p < .001)

Mean # Groups

(X2=12, df=2, p < .002)

# Revisions

Page 50: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Structured Labeling Summary

Current tools do not support concept evolution.

Structured labeling helps people refine their concepts by surfacing labeling decisions and aiding recall.

People used structured labeling when it was available and labeled more consistently.

Structure contains additional information (e.g., group related features, group related accuracy, decisions made…)

Page 51: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Evaluate

ResultsAlgorithm

ModelTracker[CHI 2015, VAST 2016]

Collect &

Label

Samples

Create

Features

Feature Insight [VAST 2015]

Structured labeling improves consistency

[CHI 2014]

Page 52: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.” [Domingos, CACM 2012]

…yet, little guidance or best practices exist.

Page 53: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

How do people come up with features?

Look for features used in related domains.

Use intuition or domain knowledge.

Apply automated techniques

Feature ideation – Think of and experiment with custom features (a “black art”).

Page 54: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Proposed Solution – Feature Insight

Support compare and contrast of data.

Page 55: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

What makes a cat a cat?

Page 56: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

What makes a cat a cat?

Page 57: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Proposed Solution – Feature Insight

Support compare and contrast of data.

Comparing pairs vs sets?

Page 58: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Comparing Pairs vs Sets

Sets may help people think of generalizable features.

NegativePositivePositives Negatives

vs

Page 59: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Proposed Solution – Feature Insight

Support compare and contrast of data.

Comparing pairs vs sets?

Raw data vs visual summaries?

Page 60: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Looking at Raw Data vs. Visual Summaries

Visual summaries may reveal relevant characteristics and hide irrelevant noise.

vs

Raw Data Visual Summary

Page 61: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Individual Comparison Set Comparison

Raw

Dat

aV

isu

alSu

mm

arie

s

Page 62: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

0.0

0.5

1.0

ClassifierPerformance (p<.01)

0

2

4

6

Feature Count

0

1

2

3

Preference Rank(Small is better) (p=.03)

Raw + Individual

Raw + Set

Visual + Individual

Visual + Set

Findings

Visual summaries led to better features

Visual summaries preferred over looking at raw data

Sets useful only in combination with visuals

Page 63: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Feature Insight Summary

Featuring is arguably the most important step in machine learning, but there is little guidance on feature ideation.

Feature Insight supports error comparison, examination of sets, and visual summaries.

Visual summaries help people create better quality features.

Page 64: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Algorithm

Collect &

Label

Samples

Structured Labeling [CHI 2014]

Create

Features

Feature Insight [VAST 2015]

Evaluate

Results

ModelTracker[CHI 2015, VAST 2016]

Page 65: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

AlgorithmEvaluate

Results

ModelTracker[CHI 2015, VAST 2016]

Collect &

Label

Samples

Structured Labeling [CHI 2014]

Create

Features

Feature Insight [VAST 2015]

How do people evaluate performance?

Page 66: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

AlgorithmEvaluate

Results

Collect &

Label

Samples

Create

Features

0.710.67 0.70 ???

Predicted

Act

ual

Positive Negative

Positive 143 72

Negative 35 190

How do people evaluate performance?

Summary statistics hide important

information about model behavior.

Page 67: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

AlgorithmEvaluate

Results

Collect &

Label

Samples

Create

Features

0.710.67 0.70 ???

Predicted

Act

ual

Positive Negative

Positive 143 72

Negative 35 190

How do people evaluate performance?

Summary statistics hide important

information about model behavior.

Page 68: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

AlgorithmEvaluate

Results

Collect &

Label

Samples

Create

Features

0.710.67 0.70 ???

Predicted

Act

ual

Positive Negative

Positive 143 72

Negative 35 190

Summary statistics hide important

information about model behavior.

Switching tools to examine data is

disruptive and leads to a trial-and-

error approach [Patel et al., AAAI 2008].

How do people evaluate performance?

Page 69: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Example: Predicting Income Levels

Page 70: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine
Page 71: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Decision Tree

86% Accuracy

Support Vector Machine

85% Accuracy

Page 72: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Decision Tree

86% Accuracy

Support Vector Machine

85% Accuracy

Page 73: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

ModelTracker Demo

Page 74: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Significantly faster and more accurate performance analysis

ModelTracker Common Confusion Matrix

Page 75: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

ModelTracker Summary

Current tools for performance analysis and debugging hide a lot of important information about model behavior.

ModelTracker supports estimating performance at multiple levels of granularity while enabling direct access to data.

People are significantly faster and more accurate at performance analysis with ModelTracker.

Page 76: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

AlgorithmEvaluate

Results

ModelTracker[CHI 2015, VAST 2016]

Collect &

Label

Samples

Structured Labeling [CHI 2014]

Create

Features

Feature Insight [VAST 2015]

Page 77: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

TuneCollect Clean DeployTrain

Many more opportunities to better support machine learning in practice.

Label Feature Evaluate

Page 78: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Tune EvaluateLabel FeatureCollect Clean Train Deploy

Many more opportunities to better support machine learning in practice.

Page 79: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Tune EvaluateFeatureCollect Clean DeployLabel Train

Many more opportunities to better support machine learning in practice and theory.

Page 80: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Making better sense of data.

Better data means better machine learning.

Most influence practitioners have on machine learning is through data.

Many more opportunities!

Page 81: Better Machine Learning Through Data - Visualizationpoloclub.gatech.edu/idea2016/slides/keynote-amershi-kdd-idea16.pdf · Better Machine Learning Through Data Saleema Amershi Machine

Better Machine Learning Through Data

Saleema Amershi, [email protected] Teaching GroupMicrosoft Research

August 14, 2016

Thanks! Questions?