data mining, privacy and (non-)discrimination bettina berendt, ku leuven knowledge and the web /...

Data mining, privacy and (non-)discrimination

Bettina Berendt, KU Leuven

Knowledge and the Web /

Privacy and Big Data courses 2015last updated 9 December 2015

AgendaMotivation: concepts and current cases

(Classical) discrimination-aware data mining

Exploratory discrimination-aware data mining; evaluation

(Some) limitations + outlook

Privacy and non-discrimination

Two fundamental rights In ICT and data mining:

Violations may result from the use of certain information

Protection may result from changing processing w.r.t. this information (e.g. “features“)

“privacy-preserving data mining/publishing“

“discrimination-aware data mining“

Is this discrimination?

https://www.wonga.com analyses, among other things, your social-media data to determine your creditworthiness Assume (cf. examples from last week) that it generates

patterns that deny a loan to1. People who like Converse sneakers

2. People who like Oil of Olay

Assume that this is because people who ... in the past very rarely paid back their loans.

(from Martijn Van Otterlo‘s presentation in Privacy and Big Data 2015)

PS: China‘s Social Credit Score (1) (from the Los Angeles Times)

in China, government authorities are hard at work devising their own e-database to rate each and every one of the nation's 1.3 billion citizens by 2020 using metrics that include whether they pay their bills on time, plagiarize schoolwork, break traffic laws or adhere to birth-control regulations.

PS: China‘s Social Credit Score (2)

China — largely atheist and lacking a strong civil society sector — has struggled for years to find a way to incentivize and reward moral and responsible behavior. It has launched appeals for citizens to uphold "traditional Chinese values" and […]

But the country continues to be shocked by incidents of callous, dishonest and immoral behavior, China prepares to rank its citizens on 'social credit' such as pedestrians refusing to help seniors who have fallen down (because they fear being sued by elderly extortionists), and motorists who accidentally strike pedestrians intentionally hitting them again to ensure they're dead (otherwise, the motorist would have to pay lifelong compensation for injuries).

The Social Credit System, the State Council says, offers hope of addressing this: "Only if there is mutual sincere treatment between members of society, and only if sincerity is fundamental, will it be possible to create harmonious and amicable interpersonal relationships. ... and realize social harmony, stability and a long period of peace and order."

Data and discrimination

E.g. a credit scoring & loan granting system uses/shares a person‘s personal data makes loan decisions depend on personal data

= differential treatment

Differential treatment is unlawful discrimination if it is based on “unjust grounds“ (e.g., gender)

Attention! This is a preliminary definition in the legal sense!

“Discrimination is forbidden“

In many areas, including Labour Loans Insurance

The protected-by-law grounds differ by area, but usually include gender, disability, age and sexual orientation, cultural, religious and linguistic beliefs/affiliation

A short intro: (Naudts, 2015) – PaBD lecture #6

“You may no longer ...“

European Court of Justice (2011) Case C-236/09, Association Belge des Consommateurs Test-Achats ASBL and Others v Conseil des ministres:

(18) The use of actuarial factors related to sex is widespread in the provision of insurance and other related financial services. In order to ensure equal treatment between men and women, the use of sex as an actuarial factor should not result in differences in individuals’ premiums and benefits. To avoid a sudden readjustment of the market, the implementation of this rule should apply only to new contracts concluded after the date of transposition of this Directive.

Historical examples: only { rich | white | male } people get to vote

Data mining (DM) and discrimination (D) (1)

“DM avoids D.“ E.g. in the domain of predictive policing: Dave Eggers, The Circle: start-up pitch

(warning: satire) Chicago police “heat list“ Relapse prediction and parole decisions

From The Economist, 2014

“The data that matter include the prisoner’s age at first arrest, his education, the nature of his crime, his behaviour in prison, his friends’ criminal records, the results of psychometric tests and even the sobriety of his mother while he was in the womb. The software estimates the probability that an inmate will relapse by comparing his profile with many others. The American version of LS/CMI, for example, holds data on 135,000 (and counting) parolees.

It is better to be guided by software than one’s gut, says Olivia Craven, head of the Idaho Commission of Pardons and Parole. Donna Sytek of the New Hampshire Parole Board agrees. Unaided, parole board members rely too much on their personal experiences and make inconsistent decisions, she says.”

What‘s right about this?What‘s wrong with this?

Reflection question

Recommended reading: Legal view of predictive policing and Big

Data: (Ferguson, 2015) More CS thinking: (Berendt, 2015)

DM and D (2)

“DM can lead to D, but ... hm ... maybe there‘s something to it?“

Cf. Laurens Naudts‘ remarks on the rational basis test in law and the assumptions of rationality concerning statistics and data mining.

Cf. “It is better to be guided by software than one’s gut” above

Reflection question

DM and D (3)

“DM can lead to D, but modifying the algorithm can fix it.“

Classical discrimination-aware data mining

Part of today‘s lecture

Recommended reading: Sources and critique in (Berendt &

Preibusch, 2014)

DM and D (4)

“The point of DM is D. (And so is much of human civilization?!) DM can lead to D, but making the

workings of the algorithm transparent can help make this more visible and encourage reflection and, ultimately, corrective action.“

Exploratory discrimination-aware data mining

Part of today‘s lecture Reflection question

Recommended reading: (Berendt & Preibusch, 2014)

Pedreschi, Ruggieri, & Turini (2008)

PD and PND items: potentially (not) discriminatory– goal: want to detect & block mined rules such as

purpose=new_car & gender = female → credit=no– measures of discriminatory power of a rule include

elift (B&A → C) = conf (B&A → C) / conf (B → C) ,

where A is a PD item and B a PND item

Note: 2 uses/tasks of data mining here: Descriptive

“In the past, women who got a loan for a new car often defaulted on it.“

Prescriptive (Therefore) “Women who want a new car should not get a loan.“

Why not just “delete“ PD attributes?

If focus is detection: Prevents detection

If focus is prevention: May reproduce

indirect discrimination

... and this indirect discrimination will also not be detected!

DADM: Examples and DCUBE output

Three points of intervention for DADM – algorithmic / “classical“ Post-processing

As a filter on the mining results (e.g. DCUBE)

Pre-processing Similar to the distortion-based techniques for privacy-

preserving association-rule mining e.g. Hajian et al. 2013ff.

In-processing e.g. Kamiran et al. 2010: change tree-learning algorithm at each node, the good split will be the one that achieves a

high purity with respect to the class label (e.g. credit good/bad), but a low purity with respect to the sensitive attribute (e.g. gender).

Many algorithms also avoid indirect discrimination (as formally defined via correlations / probabilistic implication).

Recall: Example weather data

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Recall: Decision tree learning for Classification / prediction

In which weather will someone play (tennis etc.)?

Result: this tree; but how to get there?

(Learned from the WEKA weather data)

Recall: Which attribute to select?

Based on highest purity of the class attribute in the new nodes

(measured by entropy / info. gain)

Extending the weather dataGoal: learn a classifier that does not discriminate by gender

NoTrueHighMildRainy

NoFalseHighHotSunny

PlayWindyHumidityTempOutlookGender

Assume this “pattern“ in the new weather data

NoTrueHighMildRainy

NoFalseHighHotSunny

PlayWindyHumidityTempOutlookGender

Which attribute to select now?

Based on highest purity of the class attribute in the new nodes

(measured by entropy / info. gain)AND

each node is low in purity w.r.t. gender (~ half/half)!

(Of course, in general, this does not need to lead to the

selection of the same attribute!)

Decision making: DM only?

But are (e.g. loan) decisions made fully automatically?

Cf. EU Privacy Directive, Article 15(1): “Member States shall grant the right to every person

not to be subject to a decision which produces legal effects concerning him or significantly affects him and which is based solely on automated processing of data intended to evaluate certain personal aspects relating to him, such as his performance at work, creditworthiness, reliability, conduct, etc.”

Four points of intervention for DADM – algorithmic & beyond

Pre-processing In-processing Post-processing

As a filter on the mining results (e.g. DCUBE) hiding “bad patterns“

In the interaction of a decision-support system (Berendt & Preibusch)

hiding or highlighting “bad patterns“

Limitations of classical DADM

„Thou shalt not discriminate on grounds of gender, skin colour, or nationality.“

… oh, or sexual orientation.

… (and so on)

Detection

… (and so on)

Detection

• Can only detect discrimination by pre-defined features / constraints

… (and so on)

Detection

• Ex.: PD(female), PND(has-children), but discrimination of mothers

… (and so on)

Detection

Constraint-oriented DADM

Detection

Avoidance of creation

Detection

Fully automatic decision making: cannot implement the legal concept of „treat equal things equally and different things differently“ (AI-hard)

Detection

Semi-automated decision support: sanitized rules sanitized minds?

Constraint-oriented DADM Exploratory DADM

Detection

Exploratory data analysis supports feature construction, new feature analyses

Detection

Semi-automated decision support: sanitized rules sanitized minds? Salience, awareness, reflection

better decisions?

How to do exploratory DADM?

Patterns that characterize classes Patterns that characterize rules Items, itemsets

interestingness measures

Visualisation, exploration, interactivity

Exploratory DADM: DCUBE-GUI

Left: rule count (size) vs. PD/non-PD (colour)

Exploratory DADM: DCUBE-GUI

Left: rule count (size) vs. PD/non-PD (colour)

Right: rule count (size) vs. AD-measure (rainbow-colours scale)

DCUBE-GUI: Co-occurrences of items in rule premises

Evaluating DADM

Algorithm-centric, automated measures User studies

Evaluation: Comparing c & eDADM

Detection

better decisions?

“hiding bad patterns“, black box

“highlighting bad patterns“, white box

A more accurate definition of unlawful discrimination

Equality and discrimination are two sides of the same coin: “The principle of equality requires that equal situations are treated equally and

unequal situations differently. Failure to do so will amount to discrimination unless an objective and reasonable justification exists” - Explanatory memorandum protocol 12 to the ECHR

Differential/unequal treatment vs. discrimination: Differential treatment: neutral - tells us nothing about the legal

acceptability of a given measure. Discrimination: refers to unacceptable differential treatment (from

a legal perspective). Whether or not differential treatment is unacceptable and thus amounts to

discrimination is determined by the choices of law makers and judicial review. However: differential treatment may be perceived as unfair/unjust even if

tolerated by law.

An important example of EU non-discrimination law

European Convention on Human Rights • Art. 14 Prohibition of Discrimination “The enjoyment of the rights and freedoms set forth in this

Convention shall be secured without discrimination on any ground such as sex, race, colour, language, religion, political or other opinion, national or social origin, association with a national minority, property, birth or other status.”

Limitations (1): DADM‘s simple view of unlawful discrimination

1. A given differentiation in treatment may or may not be unlawful discrimination depending on the agent if based on “innocuous“ reasons (indirect discr.) depending on whether situations are comparable (“treat equal

things equally and unequal things unequally“) NOT differentiating by a protected attribute may constitute

discrimination! depending on aims and proportionality of means

e.g. “genuine occupational requirement“ depending on the changing social & legal environment

2. A fixed set of attributes makes it impossible to detect new forms of discrimination.

Data mining for loan decision support

Algorithm

Pattern

Decision

Loan defaults Demographics, loan

purposes

Actionability, decision quality

Grant / deny loan, justify

Positive / negative risk factors

Graphical presentation

With / without discrimination

DM, cDADM, eDADM

Online experiment with 215 US mTurkersFraming Prevention:

bank Detection:

agency $6.00 show-up

Tasks 3 Exercise tasks 6 Assessed

tasks $0.25

performance bonus per AT

Questionnaire Demographics Quant/bank job Experience with

discrimination

Dabiku is a Kenyan national. She is single and has no children. She has been employed as a manager for the past 10 years. She now asks for a loan of $10,000 for 24 months to set up her own business. She has $100 in her checking account and no other debts. There have been some delays in paying back past loans.

Decision-making scenario

Task structure Vignette, describing applicant and

application Rules: positive/negative risks, flagged Decision and motivation, optional

comment

Required competencies Discard discrimination-indexed rules Aggregate rule certainties Justify decision by categorising risk

factors

Rule visualisation by treatment

Constrained DADM Hide bad features Prevention

scenario

Exploratory DADM Flag bad features Detection

scenario

residence

savings residence

foreigner

(not DA)DM Neither flagged

nor hidden

residence

foreigner

Actionability and decision quality

Decisions and Motivations DA versus DADM

More correct decisions in DADM More correct motivations in DADM No performance impact

Relative merits Constrained DADM better for

prevention Exploratory DADM better for

detection

Berendt & Preibusch. Better decision support through exploratory discrimination-aware data mining. in: ARTI, 2014

Biases Discrimination

persistent in cDADM

‘‘I dropped the -.67 number a little bit because it included her being a female as a reason.’’

Limitations (1): DADM‘s simple view of unlawful discrimination

A given differentiation in treatment may or may not be unlawful discrimination depending on the agent if based on “innocuous“ reasons (indirect discr.) depending on whether situations are comparable

(“treat equal things equally and unequal things unequally“)

depending on aims and proportionality of means e.g. “genuine occupational requirement“

depending on the changing social & legal environment

Claim: The eDADM whitebox approach can

accommodate (some of) these complexities: provide more flexibility for detecting and

avoiding discrimination by positioning itself as a decision-support system

support awareness and reflection increase transparency increase accountability

Limitations (2) / Outlook: Social / Critical theories of discrimination

New discrimination grounds (see “mother“ ex.) Further patterns related to discrimination:

intersectionality + and – of hiding / showing features The hidden assumptions (+ effects!) of DM:

Ontological status of features? DM creates new features and new forms of

discrimination Notion of social justice underlying allocation?

Outlook: Evaluating these claims in practice

Detection

better decisions?

Outlook: Developing the automated parts of eDADM furtherConstraint-oriented DADM Exploratory DADM

Detection

better decisions?

Thankyou!

References Makienen, J. (2015). China prepares to rank its citizens on 'social credit‘. Los Angeles Times, 15 November 2015. http://

www.latimes.com/world/asia/la-fg-china-credit-system-20151122-story.html The Economist (2014). Parole and Technology: Prison breakthrough. 19 April 2014. http://

www.economist.com/news/united-states/21601009-big-data-can-help-states-decide-whom-release-prison-prison-breakthrough

Ferguson, A.G. (2015). Big data and predictive reasonable suspicion. University of Pennsylvania Law Review, 163(2), 327-410. http://scholarship.law.upenn.edu/cgi/viewcontent.cgi?article=9464&context=penn_law_review

Berendt, B. (2015). Big Capta, Bad Science? http://people.cs.kuleuven.be/~bettina.berendt/Reviews/BigData.pdf Berendt, B. & Preibusch, S. (2014). Better decision support through exploratory discrimination-aware data mining:

foundations and empirical evidence. Artificial Intelligence and Law, 22 (2), 175-209 . http://people.cs.kuleuven.be/~bettina.berendt/Papers/berendt_preibusch_2014.pdf

Pedreschi D, Ruggieri S, Turini F (2008) Discrimination-aware data mining. In: Proceedings of KDD’08, pp 560–568. ACM. http://www.di.unipi.it/~ruggieri/Papers/kdd2008.pdf

Ruggieri S, Pedreschi D, Turini F (2010). DCUBE: discrimination discovery in databases. In: Proceedings of SIGMOD’10, pp 1127–1130. http://www.di.unipi.it/~ruggieri/Papers/dcube.pdf

(and further papers by the same team) Sara Hajian, Josep Domingo-Ferrer: A Methodology for Direct and Indirect Discrimination Prevention in Data Mining. IEEE

Trans. Knowl. Data Eng. 25(7): 1445-1459 (2013). http://crises2-deim.urv.cat/docs/publications/journals/684.pdf Sara Hajian, Josep Domingo-Ferrer, Oriol Farràs: Generalization-based privacy preservation and discrimination prevention

in data publishing and mining. Data Min. Knowl. Discov. 28(5-6): 1158-1188 (2014). http://crises2-deim.urv.cat/docs/publications/journals/813.pdf

Faisal Kamiran, Toon Calders, Mykola Pechenizkiy: Discrimination Aware Decision Tree Learning. ICDM 2010: 869-874. http://wwwis.win.tue.nl/~tcalders/pubs/TR10-13.pdf

“EU Privacy Directive“: Directive 95/46/EC of the European Parliament and of the Council of 24.10.1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (O.J. L 281, 23.11.1995)

Lee A. Bygrave. Minding the Machine: Article 15 of the EC Data Protection Directive and Automated Profiling. Computer Law & Security Report, 2001, volume 17, pp. 17–24. http://folk.uio.no/lee/oldpage/articles/Minding_machine.pdf

Gao B, Berendt B (2011) Visual data mining for higher-level patterns: discrimination-aware data mining and beyond. In: Proceedings of the BENELEARN 2011. http://www.liacs.nl/~putten/benelearn2011/Benelearn2011_Proceedings.pdf

data mining, privacy and (non-)discrimination bettina berendt, ku leuven knowledge and the web /...

Documents

bettina berendt, frank havemannbeschleunigung der...

was ich gerne zu beginn meiner promotion gewusst hätte eine...

1 1 the databases specialisation: advanced databases;...

privacy in online social networks: software agents and...

1 berendt: advanced databases, 1ste semester 2010/2011,...

bettina berendt ku leuven. interdisciplinary workshop on...

riina vuorikari european schoolnet / open univ. of the...

information structures and implications 2015 prof. bettina...

1 adb 2011 – text mining bettina berendt, k.u.leuven

1 the evolution of a story in a network – a web mining...

1 advanced databases – inferring implicit/new knowledge...

presented by bettina berendt, k.u. leuven

data mining, interactive semantic structuring, and...

1 data mining for enlightenment bettina berendt ~berendt

presented by bettina berendt, k.u. leuven. presented by...

1 1 1 web mining – an introduction to web content (text)...

1 1 1 berendt: advanced databases, 2010, berendt/teaching...

1 semantic web usage mining – overview and case studies...

1 evaluation for web mining applications bettina berendt...

where does this new information belong? from developing...