https://aka.ms/aiguidelines
INITIALLY
DURING
INTERACTION
WHEN WRONG
OVER TIME
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Time services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social norms.
5
Mitigate
social biases.
6
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services when
in doubt.
10
Make clear
why the
system did
what it did.
11
Remember
recent
interactions.
12
Learn from
user behavior.
13
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide
global
controls.
17
Notify users
about
changes.
1816
Convey the
consequences
of user
actions.
https://aka.ms/aiguidelines
Agenda
Agenda
AI is fundamentally changing how we interact with
computing systems
MS Word
MS OneNote
MS PowerPoint
Disclaimers
INITIALLY
DURING
INTERACTION
WHEN WRONG
OVER TIME
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Time services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social norms.
5
Mitigate
social biases.
6
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services when
in doubt.
10
Make clear
why the
system did
what it did.
11
Remember
recent
interactions.
12
Learn from
user behavior.
13
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide
global
controls.
17
Notify users
about
changes.
1816
Convey the
consequences
of user
actions.
https://aka.ms/aiguidelines
INITIALLY
DURING
INTERACTION
WHEN WRONG
OVER TIME
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Time services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social norms.
5
Mitigate
social biases.
6
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services when
in doubt.
10
Make clear
why the
system did
what it did.
11
Remember
recent
interactions.
12
Learn from
user behavior.
13
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide
global
controls.
17
Notify users
about
changes.
1816
Convey the
consequences
of user
actions.
https://aka.ms/aiguidelines
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
*Dan Russell. The Joy of Search.
INITIALLY
DURING
INTERACTION
WHEN WRONG
OVER TIME
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Time services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social norms.
5
Mitigate
social biases.
6
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services when
in doubt.
10
Make clear
why the
system did
what it did.
11
Remember
recent
interactions.
12
Learn from
user behavior.
13
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide
global
controls.
17
Notify users
about
changes.
1816
Convey the
consequences
of user
actions.
https://aka.ms/aiguidelines
Time
services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social
norms.
5
Mitigate
social
biases.
6
Remember to
call your mom
Time
services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social
norms.
5
Mitigate
social
biases.
6
Time
services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social
norms.
5
Mitigate
social
biases.
6
Time
services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social
norms.
5
Mitigate
social
biases.
6“Information is not subject to biases, unless
users are biased against fastest routes”
“There’s no way to set an avg walking speed.
[The product] assumes users to be healthy”
INITIALLY
DURING
INTERACTION
WHEN WRONG
OVER TIME
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Time services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social norms.
5
Mitigate
social biases.
6
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services when
in doubt.
10
Make clear
why the
system did
what it did.
11
Remember
recent
interactions.
12
Learn from
user behavior.
13
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide
global
controls.
17
Notify users
about
changes.
1816
Convey the
consequences
of user
actions.
https://aka.ms/aiguidelines
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services
when in
doubt.
10
Make clear
why the
system did
what it did.
11
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services
when in
doubt.
10
Make clear
why the
system did
what it did.
11
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services
when in
doubt.
10
Make clear
why the
system did
what it did.
11
INITIALLY
DURING
INTERACTION
WHEN WRONG
OVER TIME
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Time services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social norms.
5
Mitigate
social biases.
6
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services when
in doubt.
10
Make clear
why the
system did
what it did.
11
Remember
recent
interactions.
12
Learn from
user behavior.
13
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide
global
controls.
17
Notify users
about
changes.
1816
Convey the
consequences
of user
actions.
https://aka.ms/aiguidelines
Cupcakesare
muffinsthat believed
in miracles
Remember
recent
interactions.
12
Learn from
user behavior.
13
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide global
controls.
17
Notify users
about changes.
1816
Convey the
consequences of
user
actions.
Agenda
Agenda
Findings &
Impact
Opportunity Analysis
Engagements with Practitioners
Findings &
Impact
Academia Practitioners Industry
CHI 2019 Best Paper Honorable Mention Being leveraged by product teams
across the company throughout the
design and development process
Cited and used in related organizations
Translated to other languages
Initial Impact
Engagements with Practitioners
Findings &
Impact
Developing the Guidelines for Human-AI Interaction
Confirms that the guidelines
are applicable to a broad range
of products, features and
scenarios.
Applies Violates
14
29
21
17
4
12
13
27
17
20
14
11
17
9
9
6
23
14
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Violations
5
13
11
18
18
22
28
8
9
17
23
20
13
22
26
18
16
26
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Applications
Remember
recent
interactions.
12
14
29
21
17
4
12
13
27
17
20
14
11
17
9
9
6
23
14
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Violations
5
13
11
18
18
22
28
8
9
17
23
20
13
22
26
18
16
26
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Applications
Remember
recent
interactions.
12
14
29
21
17
4
12
13
27
17
20
14
11
17
9
9
6
23
14
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Violations
5
13
11
18
18
22
28
8
9
17
23
20
13
22
26
18
16
26
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Applications
Make clear
what the
system
can do.
1
Remember
recent
interactions.
12
14
29
21
17
4
12
13
27
17
20
14
11
17
9
9
6
23
14
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Violations
5
13
11
18
18
22
28
8
9
17
23
20
13
22
26
18
16
26
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Applications
Make clear
what the
system
can do.
1
Show
contextually
relevant
information.
4
Provide
global
controls.
17
14
29
21
17
4
12
13
27
17
20
14
11
17
9
9
6
23
14
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Violations
5
13
11
18
18
22
28
8
9
17
23
20
13
22
26
18
16
26
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Applications
Provide
global
controls.
17
14
29
21
17
4
12
13
27
17
20
14
11
17
9
9
6
23
14
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Violations
5
13
11
18
18
22
28
8
9
17
23
20
13
22
26
18
16
26
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Applications
Make clear
why the
system did
what it did.
11
Provide
global
controls.
17
14
29
21
17
4
12
13
27
17
20
14
11
17
9
9
6
23
14
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Violations
5
13
11
18
18
22
28
8
9
17
23
20
13
22
26
18
16
26
G18
G17
G16
G15
G14
G13
G12
G11
G10
G9
G8
G7
G6
G5
G4
G3
G2
G1
Applications
Make clear
why the
system did
what it did.
11Make clear
how well the
system can
do what it
can do.
2
Types of content: examples, patterns,
research, code
Tagged by guideline and scenario
with faceted search and filtering
Comments and ratings to support
learning
Grow with examples and case studies
submitted by practitioners
Initial Impact
Opportunity Analysis
Engagements with Practitioners
Findings &
Impact
Workshops and Courses
Q & A Break
Agenda
Agenda
How can I
implement the
HAI Guidelines? Interaction Design
AI System Engineering
Interaction Design
for AI requires
ML & Eng Support
Time services
based on
context.
3
Hard to implement if the logging infrastructure is
oblivious to context.
Scope
services when
in doubt.
10
Does the ML algorithm know or state that
it is “in doubt”?
Make clear
why the
system did
what it did.
11
Is the ML algorithm explainable?
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
[Buolamwini, J. & Gebru, T. 2018]
?? ?
? ? ?
90% accuracy
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Benchmark
data
Internal component features
Content features
countOBJECTS
cat
people
Descriptive features Failure explanation models
with Pandora
[Nushi et. al. HCOMP 2018]
All (5.5%)
Women (11.5%)
All (5.5%)
No makeup (23.7%)
Women (11.5%)
All (5.5%)
Short hair (30.8%)
Women (11.5%)
No makeup (23.7%)
Not smiling (35.7%)
Women (11.5%)
No makeup (23.7%)
Short hair (30.8%)
All (5.5%)
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
[Nushi et. al. HCOMP 2018]
[Wu et. al. ACL 2019]
[Zhang et. al. IEEE TVCG 2018]
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
2
https://scikit-learn.org/stable/auto_examples/
calibration/plot_calibration_curve.html
[Platt et al., 1999; Zadrozny & Elkan, 2001]
[Gal & Ghahramani, 2016; Osband et al., 2016]
https://forecast.weather.gov/
2
Probably a yellow school bus driving down a street
Time services
based on
context.
3
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
AI System
Automated
triggering
Explicit
invocation
Time services
based on
context.
3
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
AI System
Automated
triggering
Explicit
invocation
Time services
based on
context.
3
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Pre
cisi
on
Recall
Learn from
user behavior.
13
Encourage
granular
feedback.
15
Update and
adapt
cautiously.
14
Learn from
user behavior.
13
Encourage
granular
feedback.
15
Update and
adapt
cautiously.
14
Learn from
user behavior.
13
Encourage
granular
feedback.
15
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide
global
controls.
17
Encourage
granular
feedback.
15
Provide
global
controls.
17
Encourage
granular
feedback.
15
Provide
global
controls.
17
Q & A
Agenda
Agenda
Learn from
user behavior.
13
Remember
recent
interactions.
12
Make clear
why the
system did
what it did.
11
Intelligible, Transparent, Explainable AI
Predictable ~ (Human) Simulate-able
Intelligible ~ Transparent
Explainable ~ Interpretable
Inscrutable ⊇ Blackbox
⊇⊇
Predict exactly what it will do
Answer counterfactual
predict how a change to model’s inputs
will change its output
Construct rationalization for why
(maybe) it did what it did
Inscrutable: too complex to understand
Blackbox: know nothing about it
Caveat: My take – No consensus here
Reasons for Wanting Intelligibility
1. The AI May be Optimizing the Wrong Thing
2. Missing a Crucial Feature
3. Distributional Drift
4. Facilitating User Control in Mixed Human/AI Teams
5. User Acceptance
6. Learning for Human Insight
7. Legal Requirements
[Weld & Bansal CACM 2019]
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
Deploy
Autonomous AI
Intelligibility Useful in Both Cases
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
Human-AI Team
Reasons for Wanting Intelligibility
1. The AI May be Optimizing the Wrong Thing
2. Missing a Crucial Feature
3. Distributional Drift
4. Facilitating User Control in Mixed Human/AI Teams
5. User Acceptance
6. Learning for Human Insight
7. Legal Requirements
[Weld & Bansal CACM 2019]
Reasons for Wanting Intelligibility
1. The AI May be Optimizing the Wrong Thing
2. Missing a Crucial Feature
3. Distributional Drift
4. Facilitating User Control in Mixed Human/AI Teams
5. User Acceptance
6. Learning for Human Insight
7. Legal Requirements
[Weld & Bansal CACM 2019]
Panda!
Firetruck?!?
AI Errors
Human Errors
But Humans Err as Well
Human Errors
AI Errors
AI-Specific
ErrorsJoint
Errors
Preventable Errors
Human Errors
AI Errors
AI-Specific
ErrorsJoint
Errors
Preventable Errors
Intelligible AI → Better Teamwork
A Simple Human-AI Team
Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance
Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3
Daniel S. Weld1,4 Eric Horvitz2
1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence
Abstract
Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.
1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).
Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-
Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.
man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.
Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.
We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,
ML Model
Readmission Prediction
ClassifierHuman
Decision
Maker
When can I trust it?
How can I adjust it?
Small decision tree over semantically
meaningful primitives
Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance
Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3
Daniel S. Weld1,4 Eric Horvitz2
1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence
Abstract
Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.
1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).
Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-
Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.
man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.
Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.
We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,
When can I trust it?How can I adjust it?
Linear model over semantically
meaningful primitives
Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance
Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3
Daniel S. Weld1,4 Eric Horvitz2
1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence
Abstract
Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.
1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).
Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-
Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.
man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.
Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.
We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,
When can I trust it?How can I adjust it?
Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a
patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s
age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes
the contr ibution due to pairwise interactions between age and cancer rate.
evidence from the presence of thousands of words, one could see
the e ect of a given word by looking at the sign and magnitude of
the corresponding weight. This answers the question, “What if the
word had been omitted?” Similarly, by comparing the weights asso-
ciated with two words, one could predict the e ect on the model of
substituting one for the other.
Ranking Intel l igible Models: Since one may have a choice of
intelligible models, it is useful to consider what makes one prefer-
able to another. Grice introduced four rules characterizing coop-
erative communication, which also hold for intelligible explana-
tions [11]. The maxim of quality says be truthful, only relating
things that are supported by evidence. The maxim of quantity says
to give as much information as is needed, and no more. The maxim
of relation: only say things that are relevant to the discussion. The
maxim of manner says to avoid ambiguity, being be as clear as
possible.
Miller summarizes decades of work by psychological research,
noting that explanations are contrastive, i.e., of the form “Why P
rather than Q?” The event in question, P, is termed the fact and Q is
called the foil [28]. Often thefoil isnot explicitly stated even though
it is crucially important to the explanation process. For example,
consider the question, “Why did you predict that the image depicts
an indigo bunting?” An explanation that points to the color blue
implicitly assumes that the foil is another bird, such as a chickadee.
But perhaps the questioner wonders why the recognizer did not
predict a pair of denim pants; in this case a more precise explana-
tion might highlight the presence of wings and a beak. Clearly, an
explanation targeted to the wrong foil will be unsatisfying, but the
nature and sophistication of a foil can depend on the end user’s
expertise; hence, the ideal explanation will di er for di erent peo-
ple [7]. For example, to verify that an ML system is fair, an ethicist
might generate more complex foils than a data scientist. Most ML
explanation systems have restricted their attention to elucidating
thebehavior of abinary classi er, i.e., wherethere isonly onepossi-
ble foil choice. However, as we seek to explain multi-class systems,
addressing this issue becomes essential.
Many systems are simply too complex to understand without
approximation. Here, the key challenge is deciding which details
to omit. After long study psychologists determined that several
criteria can beprioritized for inclusion in an explanation: necessary
causes (vs. su cient ones); intentional actions (vs. those taken
without deliberation); proximal causes vs. distant ones); details that
distinguish between fact and foil; and abnormal features [28].
According to Lombrozo, humans prefer explanations that are
simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,
consistent with what thehuman’s prior beliefs) [24]. In particular,
she observed the surprising result that humans preferred simple
(one clause) explanations to conjunctive ones, even when the prob-
ability of the latter was higher than the former) [24]. These results
raise interesting questions about the purpose of explanations in
an AI system. Is an explanation’s primary purpose to convince a
human to accept thecomputer’sconclusions (perhapsby presenting
a simple, plausible, but unlikely explanation) or is it to educate the
human about the most likely true situation? Tversky, Kahneman,
and other psychologists have documented many cognitive biases
that lead humans to incorrect conclusions; for example, people
reason incorrectly about the probability of conjunctions, with a
concrete and vivid scenario deemed more likely than an abstract
one that strictly subsumes it [17]. Should an explanation system
exploit human limitations or seek to protect us from them?
Other studies raise an additional complication about how to
communicate a system’s uncertain predictions to human users.
Koehler found that simply presenting an explanation asafact makes
people think that it is more likely to be true [18]. Furthermore,
explaining a fact in the same way as previous facts have been
explained ampli es this e ect [35].
3 INHERENTLY INTELLIGIBLE MODELS
Several AI systems are inherently intelligible. We previously men-
tioned linear models as an example that supports conterfactuals.
Unfortunately, linear models have limited utility because they of-
ten result in poor accuracy. More expressive choices may include
simple decision trees and compact decision lists. We focus on Gen-
eralized additive models (GAMs), which are a powerful class of
ML models that relate a set of features to the target using a linear
combination of (potentially nonlinear) single-feature models called
shape functions [25]. For example, if y represents the target and
{x1, . . . .xn } represents the features, then a GAM model takes the
form y = β0 +Õ
j f j (xj ), where the f i sdenote shape functions and
the target y is computed by summing single-feature terms. Popular
shape functions include non-linear functions such as splines and
decision trees. With linear shapefunctions GAMsreduce to a linear
3
Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:
Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.
1 (of 56) components of learned GA2M: risk of pneumonia death
Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a
patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s
age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes
the contr ibution due to pairwise interactions between age and cancer rate.
evidence from the presence of thousands of words, one could see
the e ect of a given word by looking at the sign and magnitude of
the corresponding weight. This answers the question, “What if the
word had been omitted?” Similarly, by comparing the weights asso-
ciated with two words, one could predict the e ect on the model of
substituting one for the other.
Ranking Intel l igible Models: Since one may have a choice of
intelligible models, it is useful to consider what makes one prefer-
able to another. Grice introduced four rules characterizing coop-
erative communication, which also hold for intelligible explana-
tions [11]. The maxim of quality says be truthful, only relating
things that are supported by evidence. The maxim of quantity says
to give as much information as is needed, and no more. The maxim
of relation: only say things that are relevant to the discussion. The
maxim of manner says to avoid ambiguity, being be as clear as
possible.
Miller summarizes decades of work by psychological research,
noting that explanations are contrastive, i.e., of the form “Why P
rather than Q?” The event in question, P, is termed the fact and Q is
called the foil [28]. Often thefoil isnot explicitly stated even though
it is crucially important to the explanation process. For example,
consider the question, “Why did you predict that the image depicts
an indigo bunting?” An explanation that points to the color blue
implicitly assumes that the foil is another bird, such as a chickadee.
But perhaps the questioner wonders why the recognizer did not
predict a pair of denim pants; in this case a more precise explana-
tion might highlight the presence of wings and a beak. Clearly, an
explanation targeted to the wrong foil will be unsatisfying, but the
nature and sophistication of a foil can depend on the end user’s
expertise; hence, the ideal explanation will di er for di erent peo-
ple [7]. For example, to verify that an ML system is fair, an ethicist
might generate more complex foils than a data scientist. Most ML
explanation systems have restricted their attention to elucidating
thebehavior of abinary classi er, i.e., wherethere isonly onepossi-
ble foil choice. However, as we seek to explain multi-class systems,
addressing this issue becomes essential.
Many systems are simply too complex to understand without
approximation. Here, the key challenge is deciding which details
to omit. After long study psychologists determined that several
criteria can beprioritized for inclusion in an explanation: necessary
causes (vs. su cient ones); intentional actions (vs. those taken
without deliberation); proximal causes vs. distant ones); details that
distinguish between fact and foil; and abnormal features [28].
According to Lombrozo, humans prefer explanations that are
simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,
consistent with what thehuman’s prior beliefs) [24]. In particular,
she observed the surprising result that humans preferred simple
(one clause) explanations to conjunctive ones, even when the prob-
ability of the latter was higher than the former) [24]. These results
raise interesting questions about the purpose of explanations in
an AI system. Is an explanation’s primary purpose to convince a
human to accept thecomputer’sconclusions (perhapsby presenting
a simple, plausible, but unlikely explanation) or is it to educate the
human about the most likely true situation? Tversky, Kahneman,
and other psychologists have documented many cognitive biases
that lead humans to incorrect conclusions; for example, people
reason incorrectly about the probability of conjunctions, with a
concrete and vivid scenario deemed more likely than an abstract
one that strictly subsumes it [17]. Should an explanation system
exploit human limitations or seek to protect us from them?
Other studies raise an additional complication about how to
communicate a system’s uncertain predictions to human users.
Koehler found that simply presenting an explanation asafact makes
people think that it is more likely to be true [18]. Furthermore,
explaining a fact in the same way as previous facts have been
explained ampli es this e ect [35].
3 INHERENTLY INTELLIGIBLE MODELS
Several AI systems are inherently intelligible. We previously men-
tioned linear models as an example that supports conterfactuals.
Unfortunately, linear models have limited utility because they of-
ten result in poor accuracy. More expressive choices may include
simple decision trees and compact decision lists. We focus on Gen-
eralized additive models (GAMs), which are a powerful class of
ML models that relate a set of features to the target using a linear
combination of (potentially nonlinear) single-feature models called
shape functions [25]. For example, if y represents the target and
{x1, . . . .xn } represents the features, then a GAM model takes the
form y = β0 +Õ
j f j (xj ), where the f i sdenote shape functions and
the target y is computed by summing single-feature terms. Popular
shape functions include non-linear functions such as splines and
decision trees. With linear shapefunctions GAMsreduce to a linear
3
Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:
Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.
2 (of 56) components of learned GA2M: risk of pneumonia death
Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a
patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s
age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes
the contr ibution due to pairwise interactions between age and cancer rate.
evidence from the presence of thousands of words, one could see
the e ect of a given word by looking at the sign and magnitude of
the corresponding weight. This answers the question, “What if the
word had been omitted?” Similarly, by comparing the weights asso-
ciated with two words, one could predict the e ect on the model of
substituting one for the other.
Ranking Intel l igible Models: Since one may have a choice of
intelligible models, it is useful to consider what makes one prefer-
able to another. Grice introduced four rules characterizing coop-
erative communication, which also hold for intelligible explana-
tions [11]. The maxim of quality says be truthful, only relating
things that are supported by evidence. The maxim of quantity says
to give as much information as is needed, and no more. The maxim
of relation: only say things that are relevant to the discussion. The
maxim of manner says to avoid ambiguity, being be as clear as
possible.
Miller summarizes decades of work by psychological research,
noting that explanations are contrastive, i.e., of the form “Why P
rather than Q?” The event in question, P, is termed the fact and Q is
called the foil [28]. Often thefoil isnot explicitly stated even though
it is crucially important to the explanation process. For example,
consider the question, “Why did you predict that the image depicts
an indigo bunting?” An explanation that points to the color blue
implicitly assumes that the foil is another bird, such as a chickadee.
But perhaps the questioner wonders why the recognizer did not
predict a pair of denim pants; in this case a more precise explana-
tion might highlight the presence of wings and a beak. Clearly, an
explanation targeted to the wrong foil will be unsatisfying, but the
nature and sophistication of a foil can depend on the end user’s
expertise; hence, the ideal explanation will di er for di erent peo-
ple [7]. For example, to verify that an ML system is fair, an ethicist
might generate more complex foils than a data scientist. Most ML
explanation systems have restricted their attention to elucidating
thebehavior of abinary classi er, i.e., wherethere isonly onepossi-
ble foil choice. However, as we seek to explain multi-class systems,
addressing this issue becomes essential.
Many systems are simply too complex to understand without
approximation. Here, the key challenge is deciding which details
to omit. After long study psychologists determined that several
criteria can beprioritized for inclusion in an explanation: necessary
causes (vs. su cient ones); intentional actions (vs. those taken
without deliberation); proximal causes vs. distant ones); details that
distinguish between fact and foil; and abnormal features [28].
According to Lombrozo, humans prefer explanations that are
simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,
consistent with what thehuman’s prior beliefs) [24]. In particular,
she observed the surprising result that humans preferred simple
(one clause) explanations to conjunctive ones, even when the prob-
ability of the latter was higher than the former) [24]. These results
raise interesting questions about the purpose of explanations in
an AI system. Is an explanation’s primary purpose to convince a
human to accept thecomputer’sconclusions (perhapsby presenting
a simple, plausible, but unlikely explanation) or is it to educate the
human about the most likely true situation? Tversky, Kahneman,
and other psychologists have documented many cognitive biases
that lead humans to incorrect conclusions; for example, people
reason incorrectly about the probability of conjunctions, with a
concrete and vivid scenario deemed more likely than an abstract
one that strictly subsumes it [17]. Should an explanation system
exploit human limitations or seek to protect us from them?
Other studies raise an additional complication about how to
communicate a system’s uncertain predictions to human users.
Koehler found that simply presenting an explanation asafact makes
people think that it is more likely to be true [18]. Furthermore,
explaining a fact in the same way as previous facts have been
explained ampli es this e ect [35].
3 INHERENTLY INTELLIGIBLE MODELS
Several AI systems are inherently intelligible. We previously men-
tioned linear models as an example that supports conterfactuals.
Unfortunately, linear models have limited utility because they of-
ten result in poor accuracy. More expressive choices may include
simple decision trees and compact decision lists. We focus on Gen-
eralized additive models (GAMs), which are a powerful class of
ML models that relate a set of features to the target using a linear
combination of (potentially nonlinear) single-feature models called
shape functions [25]. For example, if y represents the target and
{x1, . . . .xn } represents the features, then a GAM model takes the
form y = β0 +Õ
j f j (xj ), where the f i sdenote shape functions and
the target y is computed by summing single-feature terms. Popular
shape functions include non-linear functions such as splines and
decision trees. With linear shapefunctions GAMsreduce to a linear
3
Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:
Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.
Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance
Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3
Daniel S. Weld1,4 Eric Horvitz2
1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence
Abstract
Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.
1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).
Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-
Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.
man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.
Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.
We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,
When can I trust it?How can I adjust it?
3 (of 56) components of learned GA2M: risk of pneumonia death
need
Deep cascade of CNNs
Variational networks
Transfer learning
GANs
Kidney MRI From [Lundervold & Lundervold 2018]https://www.sciencedirect.com/science/article/pii/S0939388918301181
Input: Pixels
Features are not semantically
meaningful
Use DirectlyInteract with Simpler
Model
Yes
NoIntelligible? •Explanations
•Controls
Map to Simpler Model
Inscrutable
Model Too Complex
Simplify by currying –> instance-specific explanation
Simplify by approximating
Features not Semantically Meaningful
Map to new vocabulary
Usually have to do both of these!
Usually have to do all of these!Simpler Explanatory
Model
Inscrutable
Model
102
LIME - Local Approximations
1. Sample points around p
2. Use complex model to predict labels for each sample
3. Weigh samples according to distance from p
4. Learn new simple modelon weighted samples(possibly using different features)
5. Use simple model as explaination
Slide adapted from Marco Ribeiro – see “Why Should I Trust You?: Explaining the Predictions of Any Classifier,” M. Ribeiro, S. Singh, C. Guestrin,
SIGKDD 2016
To explain prediction for point p …
P
To create features for explanatory classifier,
Compute `superpixels’ using off-the-shelf image segmenter
Hope that feature/values are semantically meaningful
To sample points around p, set some superpixels to grey
Explanation is set of superpixels with high coefficients…
Understandable
Over-Simplification
Accurate
Inscrutable
Any model simplification is a Lie
2
Inscrutable
1
?
Need Desiderata
Fewer conjuncts (regardless of probability)
Explanations consistent with listener’s prior beliefs
Presenting an explanation made people believe P was true
If explanation ~ previous, effect was strengthened
… Even when the explainer is wrong…
when not to trust
Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance
Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3
Daniel S. Weld1,4 Eric Horvitz2
1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence
Abstract
Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.
1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).
Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-
Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.
man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.
Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.
We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,
When can I trust it?
How can I adjust it?
Except…
In these papers, Accuracy(Humans) << Accuracy(AI)
So… the rational decision is to omit the humans (not explain)
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
Assistance Architecture
0) Solo Human (No AI)
1) AI Recommends
2) AI also gives its confidence
3) AI also explains (LIME-like)
4) AI gives human explanations
Explanations are Convincing
AI Correct
AI Incorrect
Explanations are Convincing
AI Correct
AI Incorrect
Explanations are Convincing
AI Correct
AI Incorrect
Explanations are Convincing
Better Explanations are More Convincing
AI Correct
AI Incorrect
That Other Question…
Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance
Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3
Daniel S. Weld1,4 Eric Horvitz2
1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence
Abstract
Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.
1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).
Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-
Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.
man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.
Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.
We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,
How can I adjust it?
• Deep neural paper embeddings
• Explain with linear bigrams
Beta: s2-sanity.apps.allenai.org
x'+
++
+
+
+
+
o
o
o
o
oo
o
o
Tuning
Agents
Turi
ng
+
g
+
If all one cared about was the explanatory model,
one could change this parameters… but not even the
features are shared with the neural model!
x'+
++
+
+
+
+
o
o
o
o
oo
o
o
Tuning
Agents
Turi
ng
+
g
+
Instead… We generate new training instances by
varying the feedback feature, weight by distance to
x’…
+
+
+
+
x'+
++
+
+
+
+
o
o
o
o
oo
o
o
Tuning
Agents
Turi
ng
+
g
+
Instead… We generate new training instances by
varying the feedback feature, weight by distance to
x’…
+
+
+
+
and Retrain.
Good News:
Less Good News:
No significant improvement on feed quality (team
performance) as measured by clickthru
Summary
Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance
Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3
Daniel S. Weld1,4 Eric Horvitz2
1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence
Abstract
Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.
1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).
Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-
Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.
man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.
Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.
We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,
ML Model
Readmission Prediction
ClassifierHuman
Decision
Maker
When can I trust it?
How can I adjust it?
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
interactions, or computed in expectation from the system it-self.
In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.
Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).
While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1
In sum, we make the following contributions:
1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.
2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.
3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.
1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.
(a)
(b)
Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .
4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.
2 3 4 5
2 Problem Descr iption
We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed
2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-
portant5BUG: Work weneed to citehere to show that collaboration may
not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].
Human-AI Team
Helpful
Use DirectlyInteract with Simpler
Model
Yes
NoIntelligible? •Explanations
•Controls
Map to Simpler Model
INITIALLY
DURING
INTERACTION
WHEN WRONG
OVER TIME
Make clear
how well the
system can
do what it
can do.
2
Make clear
what the
system
can do.
1
Time services
based on
context.
3
Show
contextually
relevant
information.
4
Match
relevant
social norms.
5
Mitigate
social biases.
6
Support
efficient
invocation.
7
Support
efficient
dismissal.
8
Support
efficient
correction.
9
Scope
services when
in doubt.
10
Make clear
why the
system did
what it did.
11
Remember
recent
interactions.
12
Learn from
user behavior.
13
Update and
adapt
cautiously.
14
Encourage
granular
feedback.
15
Provide
global
controls.
17
Notify users
about
changes.
1816
Convey the
consequences
of user
actions.
https://aka.ms/aiguidelines
Thanks! Questions?
Tutorial website: https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-
interaction/articles/aaai-2020-tutorial-guidelines-for-human-ai-interaction/
Learn the guidelines
Introduction to guidelines for human-AI interaction
Interactive cards with examples of the guidelines in practice
Use the guidelines in your work
Printable cards (PDF)
Printable poster (PDF)
Find out more
Guidelines for human-AI interaction design, Microsoft Research Blog
AI guidelines in the creative process: How we’re putting the human-AI guidelines into practice at
Microsoft, Microsoft Design on Medium
How to build effective human-AI interaction: Considerations for machine learning and software
engineering, Microsoft Research Blog