safe and fair machine learning · 2019-10-06 · reinforcement learning • historical data,...
TRANSCRIPT
![Page 1: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/1.jpg)
Safe and Fair Machine Learning
Philip S. Thomas
College of Information and Computer Sciences, UMass AmherstMSR Reinforcement Learning Day, October 3, 2019
![Page 2: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/2.jpg)
Andy Barto Blossom MetevierStephen Giguere Bruno Castro da SilvaEmma BrunskillYuriy Brun
Georgios Theocharous Mohammad GhavamzadehAri Kobren Sarah Brockman
![Page 3: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/3.jpg)
Machine learning algorithms should avoid undesirable behaviors.
![Page 4: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/4.jpg)
![Page 5: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/5.jpg)
This Photo by Unknown Author is licensed under CC BY-NC-ND
![Page 6: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/6.jpg)
![Page 7: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/7.jpg)
Undesirable behavior of ML algorithms is causing harm.
![Page 8: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/8.jpg)
Supervised Learning(Classification and Regression)
![Page 9: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/9.jpg)
Reinforcement Learning
This Photo by Unknown Author is licensed under CC BY-SA-NCThis Photo by Unknown Author is licensed under CC BY-SAThis Photo by Unknown Author is licensed under CC BY-NDThis Photo by Unknown Author is licensed under CC BY-SAThis Photo by Unknown Author is licensed under CC BY-SA-NCThis Photo by Unknown Author is licensed under CC BY-SAThis Photo by Unknown Author is licensed under CC BY-SA-NC
![Page 10: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/10.jpg)
Can we create algorithms that allow their users to more easily control their behavior?
![Page 11: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/11.jpg)
Desiderata
• Easy for users to define undesirable behavior.
• Guarantee that the algorithm will not produce this undesirable behavior.
Mean Time Hypoglycemicvs
Weighted Mean Time Hypoglycemic
![Page 12: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/12.jpg)
Learn to predict loan repayment, and don’t discriminate.
Data, 𝐷𝐷
Algorithm, 𝑎𝑎
𝑋𝑋1
𝑋𝑋2
Solution, 𝜃𝜃
Repay
Default
This Photo by Unknown Author is licensed under CC BY-NC-ND
![Page 13: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/13.jpg)
Learn to predict job aptitude, and don’t discriminate.
Data, 𝐷𝐷
Algorithm, 𝑎𝑎
𝑋𝑋1
𝑋𝑋2
Solution, 𝜃𝜃
Hire
Decline
![Page 14: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/14.jpg)
Learn to predict landslide severity, but don’t under-estimate.
Data, 𝐷𝐷
Algorithm, 𝑎𝑎
𝑋𝑋
Severity
Solution, 𝜃𝜃
![Page 15: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/15.jpg)
Learn how much insulin to inject, but don’t increase the prevalence of hypoglycemia.
Data, 𝐷𝐷
Algorithm, 𝑎𝑎 Solution, 𝜃𝜃
injection =current − target
𝜃𝜃1+meal size
𝜃𝜃2
![Page 16: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/16.jpg)
Predict the average human height, but do not over-estimate.
Data, 𝐷𝐷
Algorithm, 𝑎𝑎
1.7 meters
Solution, 𝜃𝜃
![Page 17: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/17.jpg)
Achieve [main goal] but do not produce [undesirable behavior].
Data, 𝐷𝐷
Algorithm, 𝑎𝑎 Solution, 𝜃𝜃
Probability, 𝛿𝛿
𝑋𝑋1
𝑋𝑋2
+1
−1
![Page 18: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/18.jpg)
Learn to predict loan repayment, and don’t discriminate.
Data, 𝐷𝐷
Algorithm, 𝑎𝑎 Solution, 𝜃𝜃
Probability, 𝛿𝛿 = 0.01
𝑋𝑋1
𝑋𝑋2
Repay
Default
(From four people)
![Page 19: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/19.jpg)
Learn how much insulin to inject, but don’t ever allow blood glucose to deviate from optimal by more than 1.2mg
dL.
Data, 𝐷𝐷
Algorithm, 𝑎𝑎 Solution, 𝜃𝜃
Probability, 𝛿𝛿 = 0.05injection =
current − target𝜃𝜃1
+meal size
𝜃𝜃2
![Page 20: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/20.jpg)
Achieve [main goal] but do not produce [undesirable behavior].
Data, 𝐷𝐷
Algorithm, 𝑎𝑎 Solution, 𝜃𝜃OR
No Solution Found
Probability, 𝛿𝛿
𝑋𝑋1
𝑋𝑋2
+1
−1
![Page 21: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/21.jpg)
Desiderata
• Interface for defining undesirable behavior.• User-specified probability, 𝛿𝛿.• Guarantee that the probability of a solution that produces
undesirable behavior is at most 𝛿𝛿.
![Page 22: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/22.jpg)
Notation
• Let 𝐷𝐷 be all of the training data.• 𝐷𝐷 is a random variable
• Let Θ be the set of all possible solutions the algorithm can return.• Let 𝑓𝑓:Θ → ℝ be the primary objective function.• Let 𝑎𝑎 be a machine learning algorithm.
• 𝑎𝑎:𝒟𝒟 → Θ• 𝑎𝑎(𝐷𝐷) is the solution returned by the algorithm when run on data 𝐷𝐷
• Let 𝑔𝑔:Θ → ℝ be a function that measures undesirable behavior• 𝑔𝑔 𝜃𝜃 ≤ 0 if and only if 𝜃𝜃 does not produce undesirable behavior• 𝑔𝑔 𝜃𝜃 > 0 if and only if 𝜃𝜃 produces undesirable behavior
• Let NSF ∈ Θ and 𝑔𝑔 NSF = 0.
![Page 23: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/23.jpg)
Desiderata
• Provide the user with an interface for defining undesirable behavior(i.e., defining 𝑔𝑔).
• Attempt to optimize a (possibly user-provided) objective 𝑓𝑓.• Guarantee that
Pr 𝑔𝑔 𝑎𝑎 𝐷𝐷 ≤ 0 ≥ 1− 𝛿𝛿.• The probability that the algorithm returns a solution that does not produce
undesirable behavior is at least 1− 𝛿𝛿.• The probability that the algorithm returns a solution that produces
undesirable behavior is at most 𝛿𝛿.• We need a name for algorithms that provide this guarantee.
![Page 24: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/24.jpg)
Pr 𝑔𝑔 𝑎𝑎 𝐷𝐷 ≤ 0 ≥ 1− 𝛿𝛿
• An algorithm that provides this guarantee is safe.• An algorithm that provides this guarantee is Seldonian.• Quasi-Seldonian: Reasonable false assumptions
• Appeals to central limit theorem
This Photo by Unknown Author is licensed under CC BY-SA
![Page 25: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/25.jpg)
Seldonian Framework
• Framework for designing machine learning algorithms.• Provide the user with an interface for defining undesirable behavior (i.e.,
defining 𝑔𝑔).• Attempt to optimize a (possibly user-provided) objective 𝑓𝑓.• Guarantee that
Pr 𝑔𝑔 𝑎𝑎 𝐷𝐷 ≤ 0 ≥ 1− 𝛿𝛿.• This guarantee does not depend on any hyperparameter settings.
• I am not promoting a specific algorithm.• The algorithms I am going to discuss are extremely simple examples.• These examples show feasibility.
• I am promoting the framework.
![Page 26: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/26.jpg)
Example Usage
• 𝑋𝑋 is a vector of features describing a person convicted of a crime.• 𝑌𝑌 is 1 if the person committed a subsequent violent crime and 0
otherwise.• Find a solution, 𝜃𝜃, such that �𝑦𝑦(𝑋𝑋,𝜃𝜃) is a good estimator of 𝑌𝑌.• 𝑔𝑔 𝜃𝜃 = Pr �𝑦𝑦 𝑋𝑋,𝜃𝜃 = 1 White − Pr �𝑦𝑦 𝑋𝑋,𝜃𝜃 = 1 Not White − 𝜖𝜖
• “Demographic Parity”• 𝑔𝑔 𝜃𝜃 ≤ 0 iff Pr �𝑦𝑦 𝑋𝑋,𝜃𝜃 = 1 White ≈ Pr �𝑦𝑦 𝑋𝑋,𝜃𝜃 = 1 Not White
• 𝑔𝑔 𝜃𝜃 = Pr FP White − Pr(FP|Not White) − 𝜖𝜖• “Predictive Equality”• 𝑔𝑔 𝜃𝜃 ≤ 0 iff Pr FP White ≈ Pr FP Not White
![Page 27: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/27.jpg)
Minimize classification loss, use𝑔𝑔 𝜃𝜃 = Pr FP White − Pr(FP|Not White) − 𝜖𝜖
Data, 𝐷𝐷
Algorithm, 𝑎𝑎 Solution, 𝜃𝜃OR
No Solution Found
Probability, 𝛿𝛿
𝑋𝑋1
𝑋𝑋2
+1
−1
![Page 28: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/28.jpg)
• Provide code for 𝑔𝑔:
Minimize classification loss, use𝑔𝑔 𝜃𝜃 = Pr FP White − Pr(FP|Not White) − 𝜖𝜖
![Page 29: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/29.jpg)
• Provide code for unbiased estimates of 𝑔𝑔:
Minimize classification loss, use𝑔𝑔 𝜃𝜃 = Pr FP White − Pr(FP|Not White) − 𝜖𝜖
![Page 30: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/30.jpg)
• Write an equation for 𝑔𝑔:𝑔𝑔 𝜃𝜃 = Pr FP White − Pr(FP|Not White) − 𝜖𝜖
• Can use any of the common variables already provided.• Classification: FP, FN, TP, TN, conditional FP, accuracy, probability positive, etc.• Regression: MSE, ME, conditional MSE, conditional ME, mean prediction, etc.• Reinforcement learning: Expected return, conditional expected return
• Can use any other variable for which they can provide unbiased estimates from data.
• Can use any supported operators +,−, %, abs,min,max, etc.
Minimize classification loss, use𝑔𝑔 𝜃𝜃 = Pr FP White − Pr(FP|Not White) − 𝜖𝜖
![Page 31: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/31.jpg)
Minimize classification loss, use𝑔𝑔 𝜃𝜃 = Pr FP White − Pr(FP|Not White) − 𝜖𝜖
Data, 𝐷𝐷
Algorithm, 𝑎𝑎 Solution, 𝜃𝜃OR
No Solution Found
Probability, 𝛿𝛿
𝑋𝑋1
𝑋𝑋2
+1
−1
![Page 32: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/32.jpg)
Algorithm 11
𝐷𝐷𝐷𝐷candidate
𝐷𝐷safety
Candidate Selection
Safety Test
Candidate solution, 𝜃𝜃𝑐𝑐
{𝜃𝜃𝑐𝑐 , NSF}
![Page 33: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/33.jpg)
Safety Test
• Consider the earlier example: 𝑔𝑔 𝜃𝜃 = Pr FP White − Pr(FP|Not White) − 𝜖𝜖
• Given 𝜃𝜃𝑐𝑐 and 𝐷𝐷safety, output either 𝜃𝜃𝑐𝑐 or NSF
−
abs 𝜖𝜖
−
Pr FP White Pr FP White
0 1 0 1
0 1
0 1 0 1
0 1
![Page 34: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/34.jpg)
Candidate Selection
• Use 𝐷𝐷candidate to pick the solution, 𝜃𝜃𝑐𝑐, predicted to:• Optimize the primary objective, 𝑓𝑓• Pass the subsequent safety test
![Page 35: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/35.jpg)
Reinforcement Learning
• Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur.• A solution, 𝜃𝜃, is a policy or policy parameters.• User can define multiple objectives (reward functions), and can require
improvement (or limit degradation) with respect to all.• 𝑔𝑔 𝜃𝜃 = 𝐄𝐄 ∑𝑡𝑡=0∞ 𝛾𝛾𝑡𝑡𝑅𝑅𝑡𝑡 𝜋𝜋cur − 𝐄𝐄 ∑𝑡𝑡=0∞ 𝛾𝛾𝑡𝑡𝑅𝑅𝑡𝑡 𝜃𝜃
Pr 𝐄𝐄 ∑𝑡𝑡=0∞ 𝛾𝛾𝑡𝑡𝑅𝑅𝑡𝑡 𝜋𝜋cur ≤ 𝐄𝐄 ∑𝑡𝑡=0∞ 𝛾𝛾𝑡𝑡𝑅𝑅𝑡𝑡 𝑎𝑎(𝐷𝐷) ≥ 1− 𝛿𝛿
• Monte Carlo returns are unbiased estimates of 𝐄𝐄 ∑𝑡𝑡=0∞ 𝛾𝛾𝑡𝑡𝑅𝑅𝑡𝑡 𝜋𝜋cur . • Use importance sampling to obtain unbiased estimates of 𝐄𝐄 ∑𝑡𝑡=0∞ 𝛾𝛾𝑡𝑡𝑅𝑅𝑡𝑡 𝜃𝜃𝑐𝑐 .
![Page 36: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/36.jpg)
Reinforcement Learning
• The ability to require improvement w.r.t. multiple objectives makes objective specification easier.
• Try to change the current policy to one that reaches the goal quicker in expectation.
• Do not increase the probability that the agent steps in the water.
Start Goal
![Page 37: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/37.jpg)
A Powerful Interface for Reinforcement Learning
• Have user label trajectories with a value 𝐿𝐿 ∈ 1,0 :• Undesirable event: 1• No undesirable event: 0
• 𝐄𝐄 𝐿𝐿 is the probability that the undesirable event will occur.• Let 𝑔𝑔 𝜃𝜃 = 𝐄𝐄 𝐿𝐿 𝜃𝜃 − 𝐄𝐄 𝐿𝐿 𝜋𝜋cur
Pr 𝐄𝐄[𝐿𝐿|𝑎𝑎(𝐷𝐷)] ≤ 𝐄𝐄 𝐿𝐿 𝜋𝜋cur ≥ 1− 𝛿𝛿
• The probability that the policy will be changed to one that increases the probability of an undesirable event is at most 𝛿𝛿.
• The user need only be able to identify undesirable events!
![Page 38: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/38.jpg)
Example: Type 1 Diabetes Management
• Undesirable event: the person experienced hypoglycemia during the day.
• Try to keep blood glucose close to ideal levels (primary reward function), but guarantee with probability 0.95 that the probability of a hypoglycemic event will not increase.
![Page 39: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/39.jpg)
Example: Classification (GPA, Disparate Impact)
![Page 40: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/40.jpg)
Example: Classification (GPA, Demographic Parity)
![Page 41: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/41.jpg)
Example: Classification (GPA, Equalized Odds)
![Page 42: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/42.jpg)
Example: Bandit (Tutoring)
![Page 43: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/43.jpg)
Example: Bandit (Tutoring, Skewed Proportions)
![Page 44: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/44.jpg)
Example: Bandit (Loan Approval, Disparate Impact)
![Page 45: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/45.jpg)
Example: Bandit (Recidivism, Statistical Parity)
![Page 46: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/46.jpg)
Example: RL (Type 1 Diabetes Management)
![Page 47: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/47.jpg)
Example: RL (Type 1 Diabetes Management)
![Page 48: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/48.jpg)
Example: RL (Type 1 Diabetes Management)
![Page 49: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/49.jpg)
Example: RL (Type 1 Diabetes Management)
![Page 50: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/50.jpg)
Example: RL (HCPI, Mountain Car)
![Page 51: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/51.jpg)
Example: RL (Daedalus)
![Page 52: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/52.jpg)
Example: RL (Require significant improvement)
![Page 53: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/53.jpg)
Future Research Directions• How to partition data?• Why NSF (Not enough data? Conflicting constraints? Failed internal prediction? Available 𝛿𝛿?)• How to divide up 𝛿𝛿 among base variables and solutions / intelligence interval propagation?• How to trade-off primary objective and predicted-safety-test in candidate selection in a principled
way?• Secure Seldonian algorithms• Combine with reward machines (specification for 𝑔𝑔, and perhaps 𝑓𝑓)?• Multi-Agent RL, (with different constraints on different agents)?• Extend Fairlearn to be Seldonian / to settings other than classification?• Improved off-policy estimators for RL safety tests• Sequential Seldonian algorithms• Efficient optimization in candidate selection• Actual HCI interface (natural language?)• Better concentration inequalities (sequences?)
![Page 54: Safe and Fair Machine Learning · 2019-10-06 · Reinforcement Learning • Historical data, 𝐷𝐷, is data collected from running some current policy, 𝜋𝜋cur. • A solution,](https://reader034.vdocuments.net/reader034/viewer/2022042308/5ed4006a8d46b66d22634053/html5/thumbnails/54.jpg)
Watch For:
• High Confidence Policy Improvement (ICML 2015)• Offline Contextual Bandits with High Probability Fairness Guarantees
• NeurIPS 2019
• On Ensuring that Intelligent Machines are Well-Behaved• 2017 Arxiv paper updated soon!