aiguidelines · 2020. 2. 8. · opportunity analysis engagements with practitioners findings &...

Post on 05-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

https://aka.ms/aiguidelines

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Agenda

Agenda

AI is fundamentally changing how we interact with

computing systems

MS Word

MS OneNote

MS PowerPoint

Disclaimers

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

*Dan Russell. The Joy of Search.

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Time

services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social

norms.

5

Mitigate

social

biases.

6

Remember to

call your mom

Time

services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social

norms.

5

Mitigate

social

biases.

6

Time

services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social

norms.

5

Mitigate

social

biases.

6

Time

services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social

norms.

5

Mitigate

social

biases.

6“Information is not subject to biases, unless

users are biased against fastest routes”

“There’s no way to set an avg walking speed.

[The product] assumes users to be healthy”

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services

when in

doubt.

10

Make clear

why the

system did

what it did.

11

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services

when in

doubt.

10

Make clear

why the

system did

what it did.

11

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services

when in

doubt.

10

Make clear

why the

system did

what it did.

11

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Cupcakesare

muffinsthat believed

in miracles

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide global

controls.

17

Notify users

about changes.

1816

Convey the

consequences of

user

actions.

Agenda

Agenda

Findings &

Impact

Opportunity Analysis

Engagements with Practitioners

Findings &

Impact

Academia Practitioners Industry

CHI 2019 Best Paper Honorable Mention Being leveraged by product teams

across the company throughout the

design and development process

Cited and used in related organizations

Translated to other languages

Initial Impact

Engagements with Practitioners

Findings &

Impact

Developing the Guidelines for Human-AI Interaction

Confirms that the guidelines

are applicable to a broad range

of products, features and

scenarios.

Applies Violates

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Remember

recent

interactions.

12

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Remember

recent

interactions.

12

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Make clear

what the

system

can do.

1

Remember

recent

interactions.

12

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Make clear

what the

system

can do.

1

Show

contextually

relevant

information.

4

Provide

global

controls.

17

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Provide

global

controls.

17

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Make clear

why the

system did

what it did.

11

Provide

global

controls.

17

14

29

21

17

4

12

13

27

17

20

14

11

17

9

9

6

23

14

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Violations

5

13

11

18

18

22

28

8

9

17

23

20

13

22

26

18

16

26

G18

G17

G16

G15

G14

G13

G12

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Applications

Make clear

why the

system did

what it did.

11Make clear

how well the

system can

do what it

can do.

2

Types of content: examples, patterns,

research, code

Tagged by guideline and scenario

with faceted search and filtering

Comments and ratings to support

learning

Grow with examples and case studies

submitted by practitioners

Initial Impact

Opportunity Analysis

Engagements with Practitioners

Findings &

Impact

Workshops and Courses

Q & A Break

Agenda

Agenda

How can I

implement the

HAI Guidelines? Interaction Design

AI System Engineering

Interaction Design

for AI requires

ML & Eng Support

Time services

based on

context.

3

Hard to implement if the logging infrastructure is

oblivious to context.

Scope

services when

in doubt.

10

Does the ML algorithm know or state that

it is “in doubt”?

Make clear

why the

system did

what it did.

11

Is the ML algorithm explainable?

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

[Buolamwini, J. & Gebru, T. 2018]

?? ?

? ? ?

90% accuracy

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Benchmark

data

Internal component features

Content features

countOBJECTS

cat

people

Descriptive features Failure explanation models

with Pandora

[Nushi et. al. HCOMP 2018]

All (5.5%)

Women (11.5%)

All (5.5%)

No makeup (23.7%)

Women (11.5%)

All (5.5%)

Short hair (30.8%)

Women (11.5%)

No makeup (23.7%)

Not smiling (35.7%)

Women (11.5%)

No makeup (23.7%)

Short hair (30.8%)

All (5.5%)

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

[Nushi et. al. HCOMP 2018]

[Wu et. al. ACL 2019]

[Zhang et. al. IEEE TVCG 2018]

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

2

https://scikit-learn.org/stable/auto_examples/

calibration/plot_calibration_curve.html

[Platt et al., 1999; Zadrozny & Elkan, 2001]

[Gal & Ghahramani, 2016; Osband et al., 2016]

https://forecast.weather.gov/

2

Probably a yellow school bus driving down a street

Time services

based on

context.

3

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

AI System

Automated

triggering

Explicit

invocation

Time services

based on

context.

3

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

AI System

Automated

triggering

Explicit

invocation

Time services

based on

context.

3

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Pre

cisi

on

Recall

Learn from

user behavior.

13

Encourage

granular

feedback.

15

Update and

adapt

cautiously.

14

Learn from

user behavior.

13

Encourage

granular

feedback.

15

Update and

adapt

cautiously.

14

Learn from

user behavior.

13

Encourage

granular

feedback.

15

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Encourage

granular

feedback.

15

Provide

global

controls.

17

Encourage

granular

feedback.

15

Provide

global

controls.

17

Q & A

Agenda

Agenda

Learn from

user behavior.

13

Remember

recent

interactions.

12

Make clear

why the

system did

what it did.

11

Intelligible, Transparent, Explainable AI

Predictable ~ (Human) Simulate-able

Intelligible ~ Transparent

Explainable ~ Interpretable

Inscrutable ⊇ Blackbox

⊇⊇

Predict exactly what it will do

Answer counterfactual

predict how a change to model’s inputs

will change its output

Construct rationalization for why

(maybe) it did what it did

Inscrutable: too complex to understand

Blackbox: know nothing about it

Caveat: My take – No consensus here

Reasons for Wanting Intelligibility

1. The AI May be Optimizing the Wrong Thing

2. Missing a Crucial Feature

3. Distributional Drift

4. Facilitating User Control in Mixed Human/AI Teams

5. User Acceptance

6. Learning for Human Insight

7. Legal Requirements

[Weld & Bansal CACM 2019]

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

Deploy

Autonomous AI

Intelligibility Useful in Both Cases

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

Human-AI Team

Reasons for Wanting Intelligibility

1. The AI May be Optimizing the Wrong Thing

2. Missing a Crucial Feature

3. Distributional Drift

4. Facilitating User Control in Mixed Human/AI Teams

5. User Acceptance

6. Learning for Human Insight

7. Legal Requirements

[Weld & Bansal CACM 2019]

Reasons for Wanting Intelligibility

1. The AI May be Optimizing the Wrong Thing

2. Missing a Crucial Feature

3. Distributional Drift

4. Facilitating User Control in Mixed Human/AI Teams

5. User Acceptance

6. Learning for Human Insight

7. Legal Requirements

[Weld & Bansal CACM 2019]

Panda!

Firetruck?!?

AI Errors

Human Errors

But Humans Err as Well

Human Errors

AI Errors

AI-Specific

ErrorsJoint

Errors

Preventable Errors

Human Errors

AI Errors

AI-Specific

ErrorsJoint

Errors

Preventable Errors

Intelligible AI → Better Teamwork

A Simple Human-AI Team

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

ML Model

Readmission Prediction

ClassifierHuman

Decision

Maker

When can I trust it?

How can I adjust it?

Small decision tree over semantically

meaningful primitives

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

When can I trust it?How can I adjust it?

Linear model over semantically

meaningful primitives

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

When can I trust it?How can I adjust it?

Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a

patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s

age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes

the contr ibution due to pairwise interactions between age and cancer rate.

evidence from the presence of thousands of words, one could see

the e ect of a given word by looking at the sign and magnitude of

the corresponding weight. This answers the question, “What if the

word had been omitted?” Similarly, by comparing the weights asso-

ciated with two words, one could predict the e ect on the model of

substituting one for the other.

Ranking Intel l igible Models: Since one may have a choice of

intelligible models, it is useful to consider what makes one prefer-

able to another. Grice introduced four rules characterizing coop-

erative communication, which also hold for intelligible explana-

tions [11]. The maxim of quality says be truthful, only relating

things that are supported by evidence. The maxim of quantity says

to give as much information as is needed, and no more. The maxim

of relation: only say things that are relevant to the discussion. The

maxim of manner says to avoid ambiguity, being be as clear as

possible.

Miller summarizes decades of work by psychological research,

noting that explanations are contrastive, i.e., of the form “Why P

rather than Q?” The event in question, P, is termed the fact and Q is

called the foil [28]. Often thefoil isnot explicitly stated even though

it is crucially important to the explanation process. For example,

consider the question, “Why did you predict that the image depicts

an indigo bunting?” An explanation that points to the color blue

implicitly assumes that the foil is another bird, such as a chickadee.

But perhaps the questioner wonders why the recognizer did not

predict a pair of denim pants; in this case a more precise explana-

tion might highlight the presence of wings and a beak. Clearly, an

explanation targeted to the wrong foil will be unsatisfying, but the

nature and sophistication of a foil can depend on the end user’s

expertise; hence, the ideal explanation will di er for di erent peo-

ple [7]. For example, to verify that an ML system is fair, an ethicist

might generate more complex foils than a data scientist. Most ML

explanation systems have restricted their attention to elucidating

thebehavior of abinary classi er, i.e., wherethere isonly onepossi-

ble foil choice. However, as we seek to explain multi-class systems,

addressing this issue becomes essential.

Many systems are simply too complex to understand without

approximation. Here, the key challenge is deciding which details

to omit. After long study psychologists determined that several

criteria can beprioritized for inclusion in an explanation: necessary

causes (vs. su cient ones); intentional actions (vs. those taken

without deliberation); proximal causes vs. distant ones); details that

distinguish between fact and foil; and abnormal features [28].

According to Lombrozo, humans prefer explanations that are

simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,

consistent with what thehuman’s prior beliefs) [24]. In particular,

she observed the surprising result that humans preferred simple

(one clause) explanations to conjunctive ones, even when the prob-

ability of the latter was higher than the former) [24]. These results

raise interesting questions about the purpose of explanations in

an AI system. Is an explanation’s primary purpose to convince a

human to accept thecomputer’sconclusions (perhapsby presenting

a simple, plausible, but unlikely explanation) or is it to educate the

human about the most likely true situation? Tversky, Kahneman,

and other psychologists have documented many cognitive biases

that lead humans to incorrect conclusions; for example, people

reason incorrectly about the probability of conjunctions, with a

concrete and vivid scenario deemed more likely than an abstract

one that strictly subsumes it [17]. Should an explanation system

exploit human limitations or seek to protect us from them?

Other studies raise an additional complication about how to

communicate a system’s uncertain predictions to human users.

Koehler found that simply presenting an explanation asafact makes

people think that it is more likely to be true [18]. Furthermore,

explaining a fact in the same way as previous facts have been

explained ampli es this e ect [35].

3 INHERENTLY INTELLIGIBLE MODELS

Several AI systems are inherently intelligible. We previously men-

tioned linear models as an example that supports conterfactuals.

Unfortunately, linear models have limited utility because they of-

ten result in poor accuracy. More expressive choices may include

simple decision trees and compact decision lists. We focus on Gen-

eralized additive models (GAMs), which are a powerful class of

ML models that relate a set of features to the target using a linear

combination of (potentially nonlinear) single-feature models called

shape functions [25]. For example, if y represents the target and

{x1, . . . .xn } represents the features, then a GAM model takes the

form y = β0 +Õ

j f j (xj ), where the f i sdenote shape functions and

the target y is computed by summing single-feature terms. Popular

shape functions include non-linear functions such as splines and

decision trees. With linear shapefunctions GAMsreduce to a linear

3

Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:

Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.

1 (of 56) components of learned GA2M: risk of pneumonia death

Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a

patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s

age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes

the contr ibution due to pairwise interactions between age and cancer rate.

evidence from the presence of thousands of words, one could see

the e ect of a given word by looking at the sign and magnitude of

the corresponding weight. This answers the question, “What if the

word had been omitted?” Similarly, by comparing the weights asso-

ciated with two words, one could predict the e ect on the model of

substituting one for the other.

Ranking Intel l igible Models: Since one may have a choice of

intelligible models, it is useful to consider what makes one prefer-

able to another. Grice introduced four rules characterizing coop-

erative communication, which also hold for intelligible explana-

tions [11]. The maxim of quality says be truthful, only relating

things that are supported by evidence. The maxim of quantity says

to give as much information as is needed, and no more. The maxim

of relation: only say things that are relevant to the discussion. The

maxim of manner says to avoid ambiguity, being be as clear as

possible.

Miller summarizes decades of work by psychological research,

noting that explanations are contrastive, i.e., of the form “Why P

rather than Q?” The event in question, P, is termed the fact and Q is

called the foil [28]. Often thefoil isnot explicitly stated even though

it is crucially important to the explanation process. For example,

consider the question, “Why did you predict that the image depicts

an indigo bunting?” An explanation that points to the color blue

implicitly assumes that the foil is another bird, such as a chickadee.

But perhaps the questioner wonders why the recognizer did not

predict a pair of denim pants; in this case a more precise explana-

tion might highlight the presence of wings and a beak. Clearly, an

explanation targeted to the wrong foil will be unsatisfying, but the

nature and sophistication of a foil can depend on the end user’s

expertise; hence, the ideal explanation will di er for di erent peo-

ple [7]. For example, to verify that an ML system is fair, an ethicist

might generate more complex foils than a data scientist. Most ML

explanation systems have restricted their attention to elucidating

thebehavior of abinary classi er, i.e., wherethere isonly onepossi-

ble foil choice. However, as we seek to explain multi-class systems,

addressing this issue becomes essential.

Many systems are simply too complex to understand without

approximation. Here, the key challenge is deciding which details

to omit. After long study psychologists determined that several

criteria can beprioritized for inclusion in an explanation: necessary

causes (vs. su cient ones); intentional actions (vs. those taken

without deliberation); proximal causes vs. distant ones); details that

distinguish between fact and foil; and abnormal features [28].

According to Lombrozo, humans prefer explanations that are

simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,

consistent with what thehuman’s prior beliefs) [24]. In particular,

she observed the surprising result that humans preferred simple

(one clause) explanations to conjunctive ones, even when the prob-

ability of the latter was higher than the former) [24]. These results

raise interesting questions about the purpose of explanations in

an AI system. Is an explanation’s primary purpose to convince a

human to accept thecomputer’sconclusions (perhapsby presenting

a simple, plausible, but unlikely explanation) or is it to educate the

human about the most likely true situation? Tversky, Kahneman,

and other psychologists have documented many cognitive biases

that lead humans to incorrect conclusions; for example, people

reason incorrectly about the probability of conjunctions, with a

concrete and vivid scenario deemed more likely than an abstract

one that strictly subsumes it [17]. Should an explanation system

exploit human limitations or seek to protect us from them?

Other studies raise an additional complication about how to

communicate a system’s uncertain predictions to human users.

Koehler found that simply presenting an explanation asafact makes

people think that it is more likely to be true [18]. Furthermore,

explaining a fact in the same way as previous facts have been

explained ampli es this e ect [35].

3 INHERENTLY INTELLIGIBLE MODELS

Several AI systems are inherently intelligible. We previously men-

tioned linear models as an example that supports conterfactuals.

Unfortunately, linear models have limited utility because they of-

ten result in poor accuracy. More expressive choices may include

simple decision trees and compact decision lists. We focus on Gen-

eralized additive models (GAMs), which are a powerful class of

ML models that relate a set of features to the target using a linear

combination of (potentially nonlinear) single-feature models called

shape functions [25]. For example, if y represents the target and

{x1, . . . .xn } represents the features, then a GAM model takes the

form y = β0 +Õ

j f j (xj ), where the f i sdenote shape functions and

the target y is computed by summing single-feature terms. Popular

shape functions include non-linear functions such as splines and

decision trees. With linear shapefunctions GAMsreduce to a linear

3

Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:

Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.

2 (of 56) components of learned GA2M: risk of pneumonia death

Figure 4: A part of Figure 1 from [5] showing 3 (of 56 total) components for a GA2M model, which was trained to predict a

patient’s r isk of dying from pneumonia. The two l ine graphs depict the contr ibution of individual features to risk : a) patient’s

age, and b) boolean variable asthma. The y-axis denotes i ts contr ibution (log odds) to predicted risk . The heat map, c, visual izes

the contr ibution due to pairwise interactions between age and cancer rate.

evidence from the presence of thousands of words, one could see

the e ect of a given word by looking at the sign and magnitude of

the corresponding weight. This answers the question, “What if the

word had been omitted?” Similarly, by comparing the weights asso-

ciated with two words, one could predict the e ect on the model of

substituting one for the other.

Ranking Intel l igible Models: Since one may have a choice of

intelligible models, it is useful to consider what makes one prefer-

able to another. Grice introduced four rules characterizing coop-

erative communication, which also hold for intelligible explana-

tions [11]. The maxim of quality says be truthful, only relating

things that are supported by evidence. The maxim of quantity says

to give as much information as is needed, and no more. The maxim

of relation: only say things that are relevant to the discussion. The

maxim of manner says to avoid ambiguity, being be as clear as

possible.

Miller summarizes decades of work by psychological research,

noting that explanations are contrastive, i.e., of the form “Why P

rather than Q?” The event in question, P, is termed the fact and Q is

called the foil [28]. Often thefoil isnot explicitly stated even though

it is crucially important to the explanation process. For example,

consider the question, “Why did you predict that the image depicts

an indigo bunting?” An explanation that points to the color blue

implicitly assumes that the foil is another bird, such as a chickadee.

But perhaps the questioner wonders why the recognizer did not

predict a pair of denim pants; in this case a more precise explana-

tion might highlight the presence of wings and a beak. Clearly, an

explanation targeted to the wrong foil will be unsatisfying, but the

nature and sophistication of a foil can depend on the end user’s

expertise; hence, the ideal explanation will di er for di erent peo-

ple [7]. For example, to verify that an ML system is fair, an ethicist

might generate more complex foils than a data scientist. Most ML

explanation systems have restricted their attention to elucidating

thebehavior of abinary classi er, i.e., wherethere isonly onepossi-

ble foil choice. However, as we seek to explain multi-class systems,

addressing this issue becomes essential.

Many systems are simply too complex to understand without

approximation. Here, the key challenge is deciding which details

to omit. After long study psychologists determined that several

criteria can beprioritized for inclusion in an explanation: necessary

causes (vs. su cient ones); intentional actions (vs. those taken

without deliberation); proximal causes vs. distant ones); details that

distinguish between fact and foil; and abnormal features [28].

According to Lombrozo, humans prefer explanations that are

simpler (i.e., contain fewer clauses), moregeneral, and coherent (i.e.,

consistent with what thehuman’s prior beliefs) [24]. In particular,

she observed the surprising result that humans preferred simple

(one clause) explanations to conjunctive ones, even when the prob-

ability of the latter was higher than the former) [24]. These results

raise interesting questions about the purpose of explanations in

an AI system. Is an explanation’s primary purpose to convince a

human to accept thecomputer’sconclusions (perhapsby presenting

a simple, plausible, but unlikely explanation) or is it to educate the

human about the most likely true situation? Tversky, Kahneman,

and other psychologists have documented many cognitive biases

that lead humans to incorrect conclusions; for example, people

reason incorrectly about the probability of conjunctions, with a

concrete and vivid scenario deemed more likely than an abstract

one that strictly subsumes it [17]. Should an explanation system

exploit human limitations or seek to protect us from them?

Other studies raise an additional complication about how to

communicate a system’s uncertain predictions to human users.

Koehler found that simply presenting an explanation asafact makes

people think that it is more likely to be true [18]. Furthermore,

explaining a fact in the same way as previous facts have been

explained ampli es this e ect [35].

3 INHERENTLY INTELLIGIBLE MODELS

Several AI systems are inherently intelligible. We previously men-

tioned linear models as an example that supports conterfactuals.

Unfortunately, linear models have limited utility because they of-

ten result in poor accuracy. More expressive choices may include

simple decision trees and compact decision lists. We focus on Gen-

eralized additive models (GAMs), which are a powerful class of

ML models that relate a set of features to the target using a linear

combination of (potentially nonlinear) single-feature models called

shape functions [25]. For example, if y represents the target and

{x1, . . . .xn } represents the features, then a GAM model takes the

form y = β0 +Õ

j f j (xj ), where the f i sdenote shape functions and

the target y is computed by summing single-feature terms. Popular

shape functions include non-linear functions such as splines and

decision trees. With linear shapefunctions GAMsreduce to a linear

3

Part of Fig 1 from R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. “Intelligible models for healthcare:

Predicting pneumonia risk and hospital 30-day readmission.” In KDD 2015.

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

When can I trust it?How can I adjust it?

3 (of 56) components of learned GA2M: risk of pneumonia death

need

Deep cascade of CNNs

Variational networks

Transfer learning

GANs

Kidney MRI From [Lundervold & Lundervold 2018]https://www.sciencedirect.com/science/article/pii/S0939388918301181

Input: Pixels

Features are not semantically

meaningful

Use DirectlyInteract with Simpler

Model

Yes

NoIntelligible? •Explanations

•Controls

Map to Simpler Model

Inscrutable

Model Too Complex

Simplify by currying –> instance-specific explanation

Simplify by approximating

Features not Semantically Meaningful

Map to new vocabulary

Usually have to do both of these!

Usually have to do all of these!Simpler Explanatory

Model

Inscrutable

Model

102

LIME - Local Approximations

1. Sample points around p

2. Use complex model to predict labels for each sample

3. Weigh samples according to distance from p

4. Learn new simple modelon weighted samples(possibly using different features)

5. Use simple model as explaination

Slide adapted from Marco Ribeiro – see “Why Should I Trust You?: Explaining the Predictions of Any Classifier,” M. Ribeiro, S. Singh, C. Guestrin,

SIGKDD 2016

To explain prediction for point p …

P

To create features for explanatory classifier,

Compute `superpixels’ using off-the-shelf image segmenter

Hope that feature/values are semantically meaningful

To sample points around p, set some superpixels to grey

Explanation is set of superpixels with high coefficients…

Understandable

Over-Simplification

Accurate

Inscrutable

Any model simplification is a Lie

2

Inscrutable

1

?

Need Desiderata

Fewer conjuncts (regardless of probability)

Explanations consistent with listener’s prior beliefs

Presenting an explanation made people believe P was true

If explanation ~ previous, effect was strengthened

… Even when the explainer is wrong…

when not to trust

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

When can I trust it?

How can I adjust it?

Except…

In these papers, Accuracy(Humans) << Accuracy(AI)

So… the rational decision is to omit the humans (not explain)

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

Assistance Architecture

0) Solo Human (No AI)

1) AI Recommends

2) AI also gives its confidence

3) AI also explains (LIME-like)

4) AI gives human explanations

Explanations are Convincing

AI Correct

AI Incorrect

Explanations are Convincing

AI Correct

AI Incorrect

Explanations are Convincing

AI Correct

AI Incorrect

Explanations are Convincing

Better Explanations are More Convincing

AI Correct

AI Incorrect

That Other Question…

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

How can I adjust it?

• Deep neural paper embeddings

• Explain with linear bigrams

Beta: s2-sanity.apps.allenai.org

x'+

++

+

+

+

+

o

o

o

o

oo

o

o

Tuning

Agents

Turi

ng

+

g

+

If all one cared about was the explanatory model,

one could change this parameters… but not even the

features are shared with the neural model!

x'+

++

+

+

+

+

o

o

o

o

oo

o

o

Tuning

Agents

Turi

ng

+

g

+

Instead… We generate new training instances by

varying the feedback feature, weight by distance to

x’…

+

+

+

+

x'+

++

+

+

+

+

o

o

o

o

oo

o

o

Tuning

Agents

Turi

ng

+

g

+

Instead… We generate new training instances by

varying the feedback feature, weight by distance to

x’…

+

+

+

+

and Retrain.

Good News:

Less Good News:

No significant improvement on feed quality (team

performance) as measured by clickthru

Summary

Beyond Accuracy: TheRole of Mental Models in Human-AI Team Performance

Gagan Bansal1 Besmira Nushi2 EceKamar 2 Walter S. Lasecki3

Daniel S. Weld1,4 Eric Horvitz2

1University of Washington 2Microsoft Research 3University of Michigan 4Allen Institute for Artificial Intelligence

Abstract

Decisions made by human-AI teams (e.g., AI-advised hu-mans) are increasingly common in high-stakes domains suchas healthcare, criminal justice, and finance. Achieving highteam performance depends on more than just the accuracy ofthe AI system: Since the human and the AI may have differ-ent expertise, the highest team performance is often reachedwhen they both know how and when to complement one an-other. We focus on a factor that is crucial to supporting suchcomplementary: thehuman’smental model of theAI capabil-ities, specifically the AI system’s error boundary (i.e. know-ing “When does the AI err?”). Awareness of this lets the hu-man decidewhen to accept or override theAI’srecommenda-tion. Wehighlight two key properties of anAI’serror bound-ary, parsimony and stochasticity, and a property of the task,dimensionality. We show experimentally how these proper-ties affect humans’ mental models of AI capabilities and theresulting team performance. We connect our evaluations torelated work and propose goals, beyond accuracy, that meritconsideration during model selection and optimization to im-proveoverall human-AI team performance.

1 IntroductionWhile many AI applications address automation, numer-ous others aim to team with people to improve joint per-formance or accomplish tasks that neither the AI nor peo-ple can solve alone (Gillies et al. 2016; Kamar 2016;Chakraborti and Kambhampati 2018; Lundberg et al. 2018;Lundgard et al. 2018). In fact, many real-world, high-stakesapplications deploy AI inferences to help human expertsmake better decisions, e.g., with respect to medical diag-noses, recidivism prediction, and credit assessment. Rec-ommendations from AI systems—even if imperfect—canresult in human-AI teams that perform better than eitherthe human or the AI system alone (Wang et al. 2016;Jaderberg et al. 2019).

Successfully creating human-AI teamsputsadditional de-mands on AI capabilities beyond task-level accuracy (Grosz1996). Specifically, even though team performance dependson individual human and AI performance, the team perfor-mance will not exceed individual performance levels if hu-

Copyright c 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: AI-advised human decision making for read-mission prediction: The doctor makes final decisions us-ing the classifier’s recommendations. Check marks denotecases where the AI system renders a correct prediction, andcrosses denote instances where the AI inference is erro-neous. Thesolid line represents theAI error boundary, whilethedashed lineshowsapotential human mental model of theerror boundary.

man and the AI cannot account for each other’s strengthsand weaknesses. In fact, recent research shows that AI ac-curacy does not always translate to end-to-end team perfor-mance(Yin, Vaughan, and Wallach 2019; Lai and Tan 2018).Despite this, and even for applications where people are in-volved, most AI techniques continue to optimize solely foraccuracy of inferences and ignore team performance.

Whilemany factors influence team performance, westudyone critical factor in this work: humans’ mental model ofthe AI system that they are working with. In settings wherethe human is tasked with deciding when and how to makeuse of the AI system’s recommendation, extracting benefitsfrom the collaboration requires the human to build insights(i.e., a mental model) about multiple aspects of the capabil-ities of AI systems. A fundamental attribute is recognitionof whether the AI system will succeed or fail for a partic-ular input or set of inputs. For situations where the humanuses theAI’soutput to makedecisions, this mental model oftheAI’serror boundary—which describes theregionswhereit is correct versus incorrect—enables the human to predictwhen the AI will err and decide when to override the auto-mated inference.

We focus here on AI-advised human decision making, asimplebut widespread form of human-AI team, for example,

ML Model

Readmission Prediction

ClassifierHuman

Decision

Maker

When can I trust it?

How can I adjust it?

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

interactions, or computed in expectation from the system it-self.

In this paper, we argue that a classifier that is intended towork with a human should maximize the number of accurateexamples beyond the handover threshold by focusing the op-timization effort on improving theaccuracy of such examplesand putting less emphasis on examples within the threshold,since such examples would be inspected and overriden by thehuman anyways. We achieve this goal by proposing a newoptimization loss function named as negative log-utility loss(NELU) that optimizes for team utility rather than AI accu-racy alone. Current classifiers trained via log-loss for opti-mizing team accuracy do not take into account this aspect ofAI-advised decision making, and therefore only aim at maxi-mizing the number of instances on either side of the decisionboundary.

Take for instance, the illustrative example shown in Fig-ure 1. The left side shows a standard classifier (h1) trainedwith negative log-loss, while the right side shows a classifiertrained with negative log-utility loss (h2). The shaded grayarea shows the handover region for which it is rational forthe human to inspect the prediction and if inaccurate, over-ride thedecision. It is clear that although h1 ismore accurate(i.e., more accurate instances on either side of the decisionboundary), if it were used in an AI-advised decision makingsetting, its high accuracy in the handover region would notbe useful to the team because the human would not trust theclassifier on those instances and spend more time in solvingthe problem herself, which would then translate to subopti-mal team utility. For the case of h2 instead, even though theoverall accuracy is lower (B is on the wrong side of the de-cision boundary), if combined with a human it could achievea higher team utility because the human can safely trust themachine outside of the handover region and at the same timethere exist less data points to inspect within the handover re-gion (notice that the set of data points A has now moved out-side of this region).

While there exist other aspects of collaboration that canalso beaddressed via optimization techniques, such as modelinterpretability, supporting complementary skills, or enablinglearning among partners, the problem we address in this pa-per to account for team-based utility functions is a first steptowards human-centered optimization. 1

In sum, we make the following contributions:

1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utilitywhen paired with ahuman overseer.

2. Weshow that log-loss, themost popular loss function, isinsufficient (as it ignores teamutility) and develop anewloss function to overcome its issues.

3. Show that on multiple real data sets and machine learn-ing models our new loss achieves higher utility than logloss.

1BUG: This is an attempt to say that we are just making a firststep here to set expectations right.

(a)

(b)

Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Over r i de meta-decision is costlierthan Accept .

4. Provide a thorough explanation on when and why it isbest to use the negative log-utility loss by showing ex-perimental results with varying domain related cost pa-rameters.

2 3 4 5

2 Problem Descr iption

We focus on a special case of AI-advised decision mak-ing where a classifier h gives recommendations to a hu-man decision maker d to help make decisions (Figure 2a).If h(x) denotes the classifier’s output, a probability distri-bution over Y , the recommendation r h (x) consists of a la-bel argmax h(x) and a confidence value max h(x), i.e.,r h (x) := (argmax h(x), max h(x)). Using this recommen-dation, the user computes a final decision d(x, r h (x)). Theenvironment, in response, returns a utility which depends onthe quality of the final decision and any cost incurred due tohuman effort. Let U denote the utility function. If the teamclassifiesasequenceof instances, theobjectiveof this team isto maximize the cumulative utility. Before deriving a closed

2BUG: Change. Weanalyze... with different costs....3BUG: - show connection with real world4BUG: say wemakeassumptions about collaboration, but its im-

portant5BUG: Work weneed to citehere to show that collaboration may

not work: [Zhang et al., 2020; Lai and Tan, 2018; Feng and Boyd-Graber, 2019].

Human-AI Team

Helpful

Use DirectlyInteract with Simpler

Model

Yes

NoIntelligible? •Explanations

•Controls

Map to Simpler Model

INITIALLY

DURING

INTERACTION

WHEN WRONG

OVER TIME

Make clear

how well the

system can

do what it

can do.

2

Make clear

what the

system

can do.

1

Time services

based on

context.

3

Show

contextually

relevant

information.

4

Match

relevant

social norms.

5

Mitigate

social biases.

6

Support

efficient

invocation.

7

Support

efficient

dismissal.

8

Support

efficient

correction.

9

Scope

services when

in doubt.

10

Make clear

why the

system did

what it did.

11

Remember

recent

interactions.

12

Learn from

user behavior.

13

Update and

adapt

cautiously.

14

Encourage

granular

feedback.

15

Provide

global

controls.

17

Notify users

about

changes.

1816

Convey the

consequences

of user

actions.

https://aka.ms/aiguidelines

Thanks! Questions?

Tutorial website: https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-

interaction/articles/aaai-2020-tutorial-guidelines-for-human-ai-interaction/

Learn the guidelines

Introduction to guidelines for human-AI interaction

Interactive cards with examples of the guidelines in practice

Use the guidelines in your work

Printable cards (PDF)

Printable poster (PDF)

Find out more

Guidelines for human-AI interaction design, Microsoft Research Blog

AI guidelines in the creative process: How we’re putting the human-AI guidelines into practice at

Microsoft, Microsoft Design on Medium

How to build effective human-AI interaction: Considerations for machine learning and software

engineering, Microsoft Research Blog

top related