balanced ranking with diversity constraints · balanced ranking with diversity constraints ke...

Balanced Ranking with Diversity Constraints

Ke Yang1˚ , Vasilis Gkatzelis2 and Julia Stoyanovich11New York University, Department of Computer Science and Engineering

2Drexel University, Department of Computer [email protected], [email protected], [email protected]

Abstract

Many set selection and ranking algorithms have re-cently been enhanced with diversity constraints thataim to explicitly increase representation of histori-cally disadvantaged populations, or to improve theoverall representativeness of the selected set. Anunintended consequence of these constraints, how-ever, is reduced in-group fairness: the selected can-didates from a given group may not be the bestones, and this unfairness may not be well-balancedacross groups. In this paper we study this phe-nomenon using datasets that comprise multiple sen-sitive attributes. We then introduce additional con-straints, aimed at balancing the in-group fairnessacross groups, and formalize the induced optimiza-tion problems as integer linear programs. Usingthese programs, we conduct an experimental eval-uation with real datasets, and quantify the feasi-ble trade-offs between balance and overall perfor-mance in the presence of diversity constraints.

1 IntroductionThe desire for diversity and fairness in many contexts, rang-ing from results of a Web search to admissions at a univer-sity, has recently introduced the need to revisit algorithm de-sign in these settings. Prominent examples include set selec-tion and ranking algorithms, which have recently been en-hanced with diversity constraints [Celis et al., 2018; Drosouet al., 2017; Stoyanovich et al., 2018; Zehlike et al., 2017;Zliobaite, 2017]. Such constraints focus on groups of itemsin the input that satisfy a given sensitive attribute label, typi-cally denoting membership in a demographic group, and seekto ensure that these groups are appropriately represented inthe selected set or ranking. Notably, each item will often beassociated with multiple attribute labels (e.g., an individualmay be both female and Asian).

Diversity constraints may be imposed for legal reasons,such as for compliance with Title VII of the Civil Rights Actof 1964 [(EEOC), 2019]. Beyond legal requirements, benefitsof diversity, both to small groups and to society as a whole,

˚Contact Author

are increasingly recognized by sociologists and political sci-entists [Page, 2008; Surowiecki, 2005]. Last but not least,diversity constraints can be used to ensure dataset represen-tativeness, for example when selecting a group of patients tostudy the effectiveness of a medical treatment, or to under-stand the patterns of use of medical services [Cohen et al.,2009], an example we will revisit in this paper.

Our goal in this paper is to evaluate and mitigate an unin-tended consequence that such diversity constraints may haveon the outcomes of set selection and ranking algorithms.Namely, we want to ensure that these algorithms do not sys-tematically select lower-quality items in particular groups. Inwhat follows, we make our set-up more precise.

Given a set of items, each associated with multiple sen-sitive attribute labels and with a quality score (or utility), aset selection algorithm needs to select k of these items aim-ing to maximize the overall utility, computed as the sum ofutility scores of selected items. The score of an item is asingle scalar that may be pre-computed and stored as a phys-ical attribute, or it may be computed on the fly. The outputof traditional set selection algorithms, however, may lead toan underrepresentation of items with a specific sensitive at-tribute. As a result, recent work has aimed to modify thesealgorithms with the goal of introducing diversity.

There are numerous ways to define diversity. For set selec-tion, a unifying formulation for a rich class of proportionalrepresentation fairness [Zliobaite, 2017] and coverage-baseddiversity [Drosou et al., 2017] measures is to specify a lowerbound `v for each sensitive attribute value v, and to enforce itas the minimum cardinality of items satisfying v in the output[Stoyanovich et al., 2018]. If the k selected candidates needto also be ranked in the output, this formulation can be ex-tended to specify a lower bound `v,p for every attribute v andevery prefix p of the returned ranked list, with p ď k [Celis etal., 2018]. Then, at least `v,p items satisfying v should appearin the top p positions of the output. We refer to all of theseas diversity constraints in the remainder of this paper. Givena set of diversity constraints, one can then seek to maximizethe utility of the selected set, subject to these constraints.

As expected, enforcing diversity constraints often comes ata price in terms of overall utility. Furthermore, the followingsimple example exhibits that, when faced with a combinationof diversity constraints, maximizing utility subject to theseconstraints can lead to another form of imbalance.

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

6035

Male FemaleWhite A (99) B (98) C (96) D (95)Black E (91) F (91) G (90) H (89)Asian I (87) J (87) K (86) L (83)

Table 1: A set of 12 individuals with sensitive attributes race andgender. Each cell lists an individual’s ID, and score in parentheses.

Example 1 Consider 12 candidates who are applying fork “ 4 committee positions. Table 1 illustrates this exampleusing a letter from A to L as the candidate ID and specifyingthe ethnicity, gender, and score of each candidate (e.g., can-didate E is a Black male with a score of 91). Suppose that thefollowing diversity constraints are imposed: the committeeshould include two male and two female candidates, and atleast one candidate from each race. In this example both raceand gender are strongly correlated with the score: White can-didates have the highest scores, followed by Black and thenby Asian. Further, male candidates of each race have higherscores than female candidates of the same race.

The committee that maximizes utility while satisfying thediversity constraints is tA, B, G, Ku, with utility score 373.Note that this outcome fails to select the highest-scoring fe-male candidates (C and D), as well as the highest-scoringBlack (E and F) and Asian (I and J) candidates. This is incontrast to the fact that it selects the best male and the bestWhite candidates (A and B). This type of “unfairness” is un-avoidable due to the diversity constraints, but in this outcomeit hurts historically disadvantaged groups (e.g., females andBlack candidates) more. However, one can still try to dis-tribute this unfairness in a more “balanced” way across dif-ferent sensitive attribute values. For instance, an alternativecommittee selection could be tA, C, E, Ku, with utility 372.For only a small drop in utility, this ensures that the top fe-male, male, White, and Black candidates are selected.

Example 1 illustrates that diversity constraints may in-evitably lead to unfairness within groups of candidates. Animportant concern is that, unless appropriately managed, thisunfairness may disproportionately affect demographic groupswith lower scores. This is particularly problematic whenlower scores are the result of historical disadvantage, as may,for example, be the case in standardized testing [Brunn-Beveland Byrd, 2015]. In this paper, we focus on this phenomenonand our goal is to provide measures for quantifying fairnessin this context, to which we refer as in-group fairness, and tostudy the extent to which its impact can be balanced acrossgroups rather than disproportionately affect a few groups.

Contributions We make the following contributions:

• We observe that diversity constraints can impact in-group fairness, and introduce two novel measures thatquantify this impact.

• We observe that the extent to which in-group fairnessis violated in datasets with multiple sensitive attributesmay be quite different for different attribute values.

• We translate each of our in-group fairness measures intoa set of constraints for an integer linear program.

• We experimentally evaluate the feasible trade-offs be-tween balance and utility, under diversity constraints.

Organization. After providing the required definitions andnotation in Section 2, we introduce our notions of in-groupfairness in Section 3. We translate in-group fairness into con-straints for an integer linear program in Section 4, and use theleximin criterion to select the best feasible parameters. Wepresent experiments on real datsets in Section 5, discuss re-lated work in Section 6, and conclude in Section 7.

2 Preliminaries and NotationBoth the set selection and the ranking problem are definedgiven a set I of n items (or candidates), along with a scoresi associated with each item i P I; this score summarizes thequalifications or the relevance of each item. The goal of theset selection problem is to choose a subset A Ď I of k ! nof these items aiming to maximize the total score

ř

iPA si.The ranking problem, apart from selecting the items, also re-quires that the items are ranked — assigned to distinct posi-tions 1, 2, . . . , k. In this paper we study the impact of diver-sity constraints that may be enforced on the outcome.

Each item is labeled based on a set Y of sensitive attributes.For instance, if the items correspond to job candidates, thesensitive attributes could be “race”, “gender”, or “national-ity”. Each attribute y P Y may take one of a predefinedset (or domain) of values, or labels, Ly; for the attribute“gender”, the set of values would include “male” and “fe-male”. We refer to attributes that have only two values (e.g.,the values tmale, femaleu for “gender”) as binary. We useL “

Ť

yPY Ly to denote the set of all the attribute valuesrelated to attributes in the set Y . (To simplify notation, weassume that domains of attribute values do not overlap.)

Given a sensitive attribute value v P L, we let Iv Ď Idenote the set of items that satisfy this label. For instance, ifv corresponds to the label “female”, then Iv is the set of allfemale candidates. We refer to such a set Iv as a group (e.g.,the group of female candidates). For each attribute value vand item i P Iv , we let Ii,v “ tj P Iv : sj ě siu be theset of items with attribute value v that have a score greaterthan or equal to si (including i); for simplicity, and withoutloss of generality, we assume that no two scores are exactlyequal. In Example 1 if we let v correspond to the attributevalue “Black” and let i be candidate G, then Ii,v “ tE, F, Guand Si,v “ 272. We also use smax “ maxiPItsiu and smin “miniPItsiu to denote the maximum and minimum scores overall available items. Let λ “ smax{smin be the ratio of themaximum over the minimum score.

For a given selection, A, of k items, we use Av Ď Iv todenote the subset of items in Iv that are inA, andBv Ď Iv forthose that are not. Also, av “ miniPAvtsiu is the lowest scoreamong the ones that were accepted and bv “ maxiPBvtsiuthe highest score among the items in Iv that were rejected. Wesay that a set selection of k items is in-group fair with respectto v if bv ď av , i.e., no rejected candidate in Iv is better thanan accepted one. In Section 3 we define two measures forquantifying how in-group fair a solution is. Finally, we useAi,v “ Av X Ii,v to denote the set of selected items in Ivwhose score is at least si for some i P Av .


6036

3 Diversity and In-Group FairnessTo ensure proportional representation of groups in algorith-mic outcomes, one way of introducing diversity constraints isto specify lower bounds for each group Iv . Using diversityconstraints, one may require, for example, that the propor-tion of females in a selected set or in the top-p positions ofa ranking, for some p ď k, closely approximates the propor-tion of females in the overall population. Depending on thenotion of appropriate representation, one can then define, foreach sensitive attribute value (or label) v and each position p,a lower bound `v,p on the number of items satisfying valuev that appear in the top-p positions of the final ranking (see,e.g., [Celis et al., 2018]). Given such bounds, one can thenseek to maximize the overall utility of the generated set orranking. The main drawback of this approach is that it maycome at the expense of in-group fairness, described next.

Given some label v and the group of items Iv that satisfythis label, in-group fairness requires that if kv items from Ivappear in the outcome, then it should be the best kv itemsof Iv . Note that, were it not for diversity constraints, anyutility maximizing outcome would achieve in-group fairness:excluding some candidate in favor of someone with a lowerscore would clearly lead to a sub-optimal outcome. However,as we verified in Example 1, in the presence of diversity con-straints, maintaining in-group fairness for all groups may beimpossible. In this paper we propose and study two measuresof approximate in-group fairness, which we now define.IGF-Ratio Given an outcome A, a direct and intuitive wayto quantify in-group fairness for some group Iv is to considerthe unfairness in the treatment of the most qualified candidatein Iv that was non included in A. Specifically, we use theratio between the minimum score of an item inAv (the lowestaccepted score) over the maximum score in Bv (the highestrejected score). This provides a fairness measure with valuesin the range r0, 1s and higher values implying more in-groupfairness. For instance, if some highly qualified candidate inIv with score s was rejected in favor of another candidate inIv with score s{2, then the ratio is 1{2.

IGF-Ratiopvq “ avbv

(1)

IGF-Aggregated Our second measure of in-group fair-ness aims to ensure that for every selected item i P Av thescore

ř

hPAi,v sh is a good approximation ofř

hPIi,v sh. Thefirst sum corresponds to aggregate utility of all accepted itemsh P Iv with sh ě si, and the second sum is the aggregateutility of all items (both accepted and rejected) h P Iv withsh ě si. If no qualified candidate in Iv is rejected in favorof some less qualified one, then these sums are equal. On theother hand, the larger the fraction of high-scoring candidatesrejected in favor of lower-scoring ones, the wider the gap be-tween these sums. For a given group Iv , our second in-groupfairness measure is the worst-case ratio (over all i P Av) ofthe former sum divided by the latter. Just like our first mea-sure, this leads to a number in the range r0, 1s, with greatervalues indicating more in-group fairness.

IGF-Aggregatedpvq “ miniPAv

#

ř

hPAi,v shř

hPIi,v sh

+

(2)

4 Balancing In-Group FairnessAs our experiments in Section 5 show, the distribution of in-group fairness across groups can be quite imbalanced in thepresence of diversity constraints. In order to mitigate this is-sue, we now introduce additional constraints that aim to betterdistribute in-group fairness across groups. We begin by show-ing that the problem of maximizing the total utility subject toa combination of diversity and in-group fairness constraintscan be formulated as an integer linear program.

It is worth noting that even the much simpler problem ofchecking the feasibility of a given set of diversity constraints(let alone maximizing total utility and introducing additionalin-group fairness constraints) is NP-hard (see Theorem 3.5 in[Celis et al., 2018]). Therefore, we cannot not expect to solveour optimization problems in polynomial time. In light of thisintractability, we instead formulate our optimization prob-lems as integer programs. Although solving these programscan require exponential time in the worst case, standard inte-ger programming libraries allow us to solve reasonably largeinstances in a very small amount of time. We briefly discussthe computational demands of these programs in Sec. 5.

4.1 Integer Program FormulationsThe integer programs receive as input, for each v P L andp P rks, a set of `v,p values, and for each v P L, a qvvalue, which correspond to diversity and in-group fairnesslower bounds, respectively. The output of the program is autility-maximizing ranking of k items such that at least `v,pitems from Iv are in the top p positions, and the in-group fair-ness of each v P L is at least qv . For each item i P I andeach position p P rks of the ranking, the integer program usesan indicator variable xi,p that is set either to 1, indicating thatitem i is in position p of the ranked output, or to 0, indicatingthat it is not. We also use variable xi “

ř

pPrks xi,p that isset to 1 if i is included in any position of the output, and to 0otherwise. The program below is for the IGF-Ratio measure.

maximize:ÿ

iPIxisi

subject to: xi “ÿ

pPrksxi,p, @i P I

ÿ

iPIxi,p ď 1, @p P rks

ÿ

iPIv

ÿ

qPrpsxi,q ě `v,p, @v P L,@p P rks

av ď pλ´ pλ´ 1q ¨ xiqsi, @v P L,@i P Ivbv ě p1´ xiqsi, @v P L,@i P Iv

av ě qvbv, @v P Lav, bv P rsmin, smaxs, @v P L

xi, xi,p P t0, 1u, @i P I,@p P rks

The first set of inequalities ensures that at most one itemis selected for each position. The second set of inequalitiescorresponds to the diversity constraints: for each attributevalue v and position p, include at least `v,p items from Iv


6037

among the top p positions of the output. Most of the remain-ing constraints then aim to guarantee that every group Iv hasan IGF-Ratio of at least qv . Among these constraints, themost interesting one is av ď pλ ´ pλ ´ 1q ¨ xiqsi. This is alinear constraint, since both λ “ smax{smin and si are con-stants. Note that, if xi is set to be equal to 1, i.e., if itemi P I is accepted, then this constraint becomes av ď si andhence ensures that av can be no more than the score of anyaccepted item. Yet, if xi is set to be equal to 0, i.e., if item iis rejected, then this becomes av ď λsi “ smax sismin ď smax,and hence this constraint does not restrict the possible val-ues of av . Therefore, av is guaranteed to be no more thanthe smallest accepted score in Iv and bv is at least as largeas the smallest rejected score, since bv ě p1 ´ xiqsi, andhence bv ě si if xi “ 0. As a result, av ě qvbv enforces theIGF-Ratio constraint avbv ě qv .

In order to get the integer linear program that enforces IGF-Aggregated constraints, we modify the program above byreplacing all the constraints that involve av or bv with the setof constraints

ř

hPIiv xhsh ě qvxiř

hPIiv sh for all v P Land all i P Iv . Note that for i with xi “ 0 this constraintbecomes trivially satisfied, since the right hand side becomes0 and the left hand side is always positive. If, on the otherhand, xi “ 1, then this inequality ensures that, for all i P Av:

ř

hPIi,v xhshř

hPIi,v shě qv ñ

ř

hPAi,v shř

hPIi,v shě qv.

4.2 The Leximin SolutionEquipped with the ability to compute the utility-maximizingoutcome in the presence of both in-group fairness and diver-sity constraints, our next step is to use these integer linear pro-grams to optimize the balance of in-group fairness across dif-ferent groups. In fact, this gives rise to a non-trivial instanceof a fair division problem: given one of the in-group fairnessmeasures that we have defined, for each possible outcome,this measure quantifies how “happy” each group should be,implying a form of “group happiness”. Therefore, each out-come yields a vector q “ pq1, q2, . . . , q|L|q of in-group fair-ness values for each group. As we have seen, achieving anin-group fairness vector of q “ p1, 1, . . . , 1q may be infeasi-ble due to the diversity constraints. Given the setQ of feasiblevectors q implied by the diversity constraints, our goal is toidentify an outcome (i.e., an in-group fairness vector q) thatfairly distributes in-group fairness values across all groups.

From this perspective, our problem can be thought of as afair division problem where the goods being allocated are thek slots in the set, and the fairness of the outcome towards allpossible groups is evaluated based on the in-group fairnessvector q that it induces. Fair division is receiving significantattention in economics (e.g., [Moulin, 2004]) and recentlyalso in computer science (e.g., [Brandt et al., 2016, Part II]).A common solution that this literature provides to the ques-tion of how happiness should be distributed is maximin. Overall feasible in-group fairness vectors q P Q, the maximin so-lution dictates that a fair solution should maximize the min-imum happiness, so it outputs argmaxqPQtminvPL qvu. Inthis paper we consider a well-studied refinement of this ap-proach, known as the leximin solution [Moulin, 2004, Sec.

3.3]. Since there may be multiple vectors q with the sameminimum happiness guarantee, minvPL qv , the leximin solu-tion chooses among them one that also maximizes the sec-ond smallest happiness value. In case of ties, this process isrepeated for the third smallest happiness, and so on, until aunique vector is identified. More generally, if the elements ofeach feasible vector q P Q are sorted in non-decreasing or-der, then the leximin solution is the vector that is greater thanall others from a lexicographic order standpoint.

Given the vector q of in-group fairness values that corre-spond to the leximin solution, we can just run one of the inte-ger linear programs defined above to get the balanced set, buthow do we compute this leximin vector q? A simple way toachieve this is to use binary search in order to first identify theset of maximin solutions. For each value of q P r0, 1s we cancheck the feasibility of the integer linear program if we usethat same q value for all the groups. Starting from q “ 0.5,if the solution is feasible we update the value to q “ 0.75,whereas if it is infeasible we update it to q “ 0.25, and repeatrecursively. Once the largest value of q that can be guaranteedfor all groups has been identified, we check which group Iv itis for which this constraint is binding, we fix qv “ q, and con-tinue similarly for the remaining groups, aiming to maximizethe second smallest in-group fairness value, and so on.

In the experiments of the following section, we computethe leximin solution on real datasets and evaluate its in-groupfairness and utility.

5 Experimental EvaluationIn this section, we describe the results of an experimentalevaluation with three empirical goals: (1) to understand theconditions that lead to reduced in-group fairness; (2) to as-certain the feasibility of mitigating the observed imbalancewith the help of in-group fairness constraints; (3) to explorethe trade-offs between in-group fairness and utility.

5.1 DatasetsOur experimental evaluation was conducted on both real andsynthetic datasets. However, due to space constraints we canonly include results for the following two real datasets:

Medical Expenditure Panel Survey (MEPS)MEPS is a comprehensive source of individual andhousehold-level information regarding the amount of healthexpenditures by individuals from various demographic or so-cioeconomic groups [Cohen et al., 2009; Coston et al., 2019].MEPS is ideal for our purposes as it includes sensitive at-tributes such as race and age group for each individual. Eachcandidate’s score corresponds to the utilization feature,defined in the IBM AI Fairness 360 toolkit [Bellamy et al.,2018] as the total number of trips requiring medical care, andcomputed as the sum of the number of office-based visits, thenumber of outpatient visits, the number of ER visits, the num-ber of inpatient nights, and the number of home health visits.

High-utilization respondents (with utilization ě 10)constitute around 17% of the dataset. MEPS includes sur-vey data of more than 10,000 individuals. We use data fromPanel 20 of calendar year 2016, and select the top 5,110individuals with utilization ą 5 as our experimental


6038

(a) Diversity constraints only (b) Adding balance constraints (c) Diversity constraints only (d) Adding balance constraints

Figure 1: In-group fairness for groups defined by race and age in the MEPS dataset with 5, 110 items, selecting top-k candidates as aranked list, with diversity constraints on each of the 8 possible race groups and 2 age groups.

(a) Diversity constraints only (b) Adding balance constraints (c) Diversity constraints only (d) Adding balance constraints

Figure 2: In-group fairness for groups defined by department size and area in the CS dataset with 51 items, selecting top-k candidatesas a ranked list, with diversity constraints on each of the 2 possible size groups and 5 area groups.

(a) Distribution of utilization score in MEPS dataset (b) Distribution of publication count in CS dataset

Figure 3: Score distribution for groups in MEPS and CS dataset.

dataset. We focus on two categorical attributes: race (withvalues “White”, “Black”, “Multiple races”, “Native Hawai-ian”, “Asian Indian”, “Filipino”, “Chinese”, and “AmericanIndian”) and age (“Middle” and “Young”).

CS department rankings (CS). CS contains informationabout 51 computer science departments in the US1. We usepublication count as the scoring attribute and selecttwo categorical attributes: department size (with val-ues “large” and “small”) and area (with values “North

1https://csrankings.org

East”, “West”, “Middle West”, “South Center”, and “SouthAtlantic”). Unlike the MEPS dataset, CS is a relatively smalldataset, but our experiments show that the diversity con-straints and the balancing operations exhibit similar behavior.For this dataset we ask for a ranking of the top CS depart-ments, while ensuring that departments of different size andlocation are well-represented.

5.2 Experimental Evidence of ImbalanceUsing the in-group fairness measures defined in Section 3,we now experimentally evaluate the extent to which in-group


6039

Dataset k1 k2 k3 k4 k5diversity ratio agg diversity ratio agg diversity ratio agg diversity ratio agg diversity ratio aggMEPS 28% 5% 3% 34% 7% 3% 34% 5% 1% 33% 3% 3% 28% 4% 5%

CS 20% 2% 9% 18% 1% 8% 17% 3% 9% 12% 1% 0% 11% 3% 1%

Table 2: Percentage of optimal utility lost due to diversity constraints and in-group fairness constraints in MEPS and CS datasets.

fairness is violated in real datasets, as well as how well-balanced it is across the different groups of candidates. Fig-ures 1a and 1c exhibit the IGF-Ratio and IGF-Aggregatedvalues, respectively, for each of the 8 race groups and 2 agegroups of the MEPS dataset, when a ranked list of the top-kcandidates is requested, for k P t20, 40, 60, 80, 100u. For ev-ery prefix of the ranked list, diversity constraints ensure eachrace and age is represented in that prefix in approximate cor-respondence with their proportion in the input. For example,at least 11 candidates among the top-20 are from the “Mid-dle” age group, and at least 8 are from the “Young” age group.

In both Figures 1a and 1c, we observe that diversity con-straints cause a significant amount of in-group unfairnesswith respect to the two metrics, leading to values below 0.1for some groups for both IGF-Ratio and IGF-Aggregated.In particular, the young age group and the American Indianrace are the two groups with the lowest in-group fairnessvalues in both cases. Apart from low in-group fairness val-ues, we also observe a significant amount of imbalance, sincesome other groups actually receive very high in-group fair-ness values for both measures.

At this point, it is worth providing some intuition regard-ing why imbalance might arise to begin with. To help us withthis intuition we provide Figure 3a, which exhibits the distri-bution of the scores for each group in MEPS (both race andage). From this plot, we can deduce that the individuals inthe American Indian group tend to have lower scores com-pared to other ethnic groups, and that the young age groupalso has lower scores than the middle age group. However,diversity constraints require that these groups are representedin the output as well.

An important observation is that selecting a young Ameri-can Indian in the outcome would satisfy two binding diversityconstraints, while using just one slot in the outcome. This es-sentially “kills two birds with one stone”, compared to an al-ternative solution that would dedicate a separate slot for eachminority. The slot that this solution “saves”, could then beused for a high scoring candidate that is not part of a minor-ity, leading to a higher utility. Therefore without any in-groupfairness constraints, the utility-maximizing solution is likelyto include a young American Indian who is neither one ofthe top-scoring candidates in the young group, nor one of thetop-scoring American Indians, thus introducing in-group un-fairness to both of these groups.

This undesired phenomenon — the selection of low-qualitycandidates that satisfy multiple diversity constraints — ismore likely to impact low-scoring groups, disproportionatelyaffecting historically disadvantaged minorities, or historicallyundervalued entities. We observe this in the CS dataset,where small departments, and departments located in SouthCenter and South Atlantic areas, experience higher in-groupunfairness before our mitigation (see Figures 2a, 2c, and 3b).

5.3 The Impact of Leximin BalancingHaving observed the imbalance that may arise in the outputof an unrestricted utility-maximizing algorithm, and havingexplained how this may systematically adversely impact his-torically disadvantaged groups, we now proceed to evaluatethe impact of the leximin solution. In all the Figures 1b, 1d,2b, and 2d that are generated by the leximin solution, we seea clear improvement compared to their counterparts, beforeany balancing was introduced. Recall that the leximin solu-tion’s highest priority is to maximize the minimum in-groupfairness over all groups, so looking, for instance, at Figure 1dand comparing it with Figure 1c, it is easy to see that theminimum IGF-Aggregated value has strictly increased forevery value of k. Note that this does not mean that everygroup is better-off. For instance, the White group is signifi-cantly worse-off in Figure 1d, especially for larger values ofk. However, this drop in the in-group fairness of the Whitegroup enabled a very significant increase for the AmericanIndian group, and a noticeable increase for the Chinese andyoung age groups, which suffered the most in-group unfair-ness prior to the balancing operation. As a result, the in-groupfairness values after the leximin balancing operation are allmuch more concentrated in the middle of the plot, instead ofresembling a bi-modal distribution of high and low in-groupfairness values as was the case before, in Figure 1c.

We observe very similar patterns in all the other applica-tions of balancing as well, exhibiting a consistently better-balanced in-group fairness across groups. Before we con-clude, we also show that this significant improvement camewithout a very significant loss in utility.

The Price of BalanceJust like the enforcement of diversity constraints leads to adrop in utility, the same is also true when introducing the in-group fairness constraints in order to reach a more balanceddistribution. To get a better sense of the magnitude of thisloss in utility to achieve fairness, in Table 2 we show the per-centage loss caused by in-group fairness and diversity con-straints, measured against the optimality utility. Note that thek1, ..., k5 in the columns correspond to r20, 40, 60, 80, 100sfor MEPS and to r8, 10, 12, 14, 16s for CS. Therefore, the en-try in the first columns and first row of this table should beinterpreted as follows: for the MEPS dataset and a rankingrequest with k “ 20, enforcing diversity constraints leadsto a loss of 28% in utility, compared to the outcome with-out any such constraints. The entry to its right (5%), is theadditional loss in utility caused if, on top of the diversity con-straints, we also enforce the leximin constraints for the IGF-Ratio measure. Similarly, the next entry to the right (3%) isthe additional loss in utility caused if, on top of the diversityconstraints, we also enforce the leximin constraints for theIGF-Aggregated measure. We note that, compared to the


6040

28% of utility loss required by diversity constraints alone, theutility loss due to balancing in-group fairness is actually quitesmall, and we observe this to be the case for both datasets andall k values in our experiments.

As a final remark we note that, despite the fact that inte-ger linear programs are not polynomial time solvable in gen-eral, the computational cost involved in computing the utility-maximizing outcome for a given vector of qv values was nottoo significant, and the standard library solvers were quite ef-ficient. Even for the most time-consuming case, which wasthe IGF-Aggregated measure and the MEPS dataset, thesolver would complete in a matter of a few minutes.

6 Related WorkYang and Stoyanovich were the first to consider fairness inranked outputs [Yang and Stoyanovich, 2017]. They focusedon a single binary sensitive attribute, such as male or femalegender, and proposed three fairness measures, each quan-tifying the relative representation of protected group mem-bers at discrete points in the ranking (e.g., top-10, top-20,etc.), and compounding these proportions with a logarithmicdiscount, in the style of information retrieval. A follow-upwork [Zehlike et al., 2017] developed a statistical test forgroup fairness in a ranking, also with respect to a single bi-nary sensitive attribute, and proposed an algorithm that mit-igates the lack of group fairness. They proposed the notionsof in-group monotonicity and ordering utility that are similarin spirit to our in-group fairness. The imbalance in terms ofin-group fairness does not arise in the set-up of Zehlike et al.,where only a single sensitive attribute is considered.

[Stoyanovich et al., 2018] considered online set selectionunder label-representation constraints for a single sensitiveattribute, and posed the Diverse k-choice Secretary Problem:pick k candidates, arriving in random order, to maximize util-ity (sum of scores), subject to diversity constraints. Theseconstraints specify the lowest (`v) and highest (hv) number ofcandidates from group v to be admitted into the final set. Thepaper developed several set selection strategies and showedthat, if a difference in scores is expected between groups,then these groups must be treated separately during process-ing. Otherwise, a solution may be derived that meets the con-straints, but that selects comparatively lower-scoring mem-bers of a disadvantaged group — it lacks balance.

Trade-offs between different kinds of accuracy and fairnessobjectives for determining risk scores are discussed in [Klein-berg et al., 2017]. The authors use the term balance to referto performance of a method with respect to members of a par-ticular class (positive or negative), and note that “the balanceconditions can be viewed as generalizations of the notionsthat both groups should have equal false negative and falsepositive rates.” Our use of the term “balance” is consistentwith their terminology, as it applies to in-group fairness.

The work of [Celis et al., 2018] conducts a theoretical in-vestigation of ranking with diversity constraints of the kindwe consider here, for the general case of multiple sensitive at-tributes. They prove that even the problem of deciding feasi-bility of these constraints is NP-hard. They provide hardnessresults, and develop exact and approximation algorithms for

the constrained ranking maximization problem, including alinear program and a dynamic programming solution. Thesealgorithms also allow for the scores of the candidates to beposition-specific. The novelty of our work compared to thatof Celis et al. is that we focus on the imbalance in terms ofin-group fairness, develop methods for mitigating the imbal-ance, and provide an empirical study of these methods.

Diversity, as a crucial aspect of quality of algorithmic out-comes, has been studied in Information Retrieval [Agrawalet al., 2009; Clarke et al., 2008] and content recommen-dation [Kaminskas and Bridge, 2017; Vargas and Castells,2011]. See also [Drosou et al., 2017] for a survey of diversityin set selection tasks, including a conceptual comparison ofdiversity and fairness.

7 Conclusion and Open ProblemsIn this paper we identified the lack of in-group fairness asan undesired consequence of maximizing total utility subjectto diversity constraints, in the context of set selection andranking. We proposed two measures for evaluating in-groupfairness, and developed methods for balancing its loss acrossgroups. We then conducted a series of experiments to betterunderstand this issue and the extent to which our methods canmitigate it. This paper opens up many interesting research di-rections, both empirical and theoretical.

From an empirical standpoint, it would be important to de-velop a deeper understanding of the aspects that may causedisproportionate in-group unfairness. In our experimentalevaluation we observed that, all else being equal, a larger dif-ference in expected scores between disjoint groups leads toa higher imbalance in terms of in-group fairness. In the fu-ture, we would like to identify additional such patterns thatmay lead to systematic imbalance, especially when this maydisproportionately impact disadvantaged groups.

From a theoretical standpoint, it would be interesting to un-derstand the extent to which polynomial time algorithms canapproximate the optimal utility or approximately satisfy thecombination of diversity and in-group fairness constraints. Inthis paper we restricted our attention to finding exact solu-tions to the defined optimization problems and, since evensimplified versions of these problems are NP-hard, we had toresort to algorithms without any appealing worst-case guaran-tees. In [Celis et al., 2018], the authors considered the designof approximation algorithms for closely related problems in-volving diversity constraints, but without any in-group fair-ness constraints. On the other hand, [Celis et al., 2018] alsoconsider ranking problems where the candidates’ scores canbe position-dependent. Extending our framework to capturethese generalizations is another direction for future work.

Finally, moving beyond the leximin solution, one couldconsider alternative ways to choose the in-group fairness vec-tor q, such as maximizing the Nash social welfare, an ap-proach that has recently received a lot of attention (e.g., [Coleand Gkatzelis, 2018] and [Caragiannis et al., 2016]).

AcknowledgementsThis work was supported in part by NSF Grants No. 1926250,1916647, and 1755955.


6041

References[Agrawal et al., 2009] Rakesh Agrawal, Sreenivas Golla-

pudi, Alan Halverson, and Samuel Ieong. Diversifyingsearch results. In Proceedings of the 2nd InternationalConference on Web Search and Web Data Mining, WSDM,pages 5–14, 2009.

[Bellamy et al., 2018] Rachel K. E. Bellamy, Kuntal Dey,Michael Hind, Samuel C. Hoffman, Stephanie Houde,Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino,Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar,Karthikeyan Natesan Ramamurthy, John T. Richards,Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh,Kush R. Varshney, and Yunfeng Zhang. AI fairness360: An extensible toolkit for detecting, understand-ing, and mitigating unwanted algorithmic bias. CoRR,abs/1810.01943, 2018.

[Brandt et al., 2016] Felix Brandt, Vincent Conitzer, UlleEndriss, Jérôme Lang, and Ariel D. Procaccia. Handbookof Computational Social Choice. Cambridge UniversityPress, New York, NY, USA, 1st edition, 2016.

[Brunn-Bevel and Byrd, 2015] Rachelle J. Brunn-Bevel andW. Carson Byrd. The foundation of racial disparities in thestandardized testing era: The impact of school segregationand the assault on public education in virginia. Humanity& Society, 39(4):419–448, 2015.

[Caragiannis et al., 2016] Ioannis Caragiannis, DavidKurokawa, Hervé Moulin, Ariel D. Procaccia, NisargShah, and Junxing Wang. The unreasonable fairness ofmaximum nash welfare. In Proceedings of the 2016 ACMConference on Economics and Computation, EC, pages305–322, 2016.

[Celis et al., 2018] L. Elisa Celis, Damian Straszak, andNisheeth K. Vishnoi. Ranking with fairness constraints. In45th International Colloquium on Automata, Languages,and Programming, ICALP, pages 28:1–28:15, 2018.

[Clarke et al., 2008] Charles L. A. Clarke, Maheedhar Kolla,Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Ste-fan Büttcher, and Ian MacKinnon. Novelty and diversityin information retrieval evaluation. In Proceedings of the31st Annual International ACM SIGIR Conference on Re-search and Development in Information Retrieval, SIGIR,pages 659–666, 2008.

[Cohen et al., 2009] Joel W Cohen, Steven B Cohen, andJessica S Banthin. The medical expenditure panel survey:a national information resource to support healthcare costresearch and inform policy and practice. Medical care,pages S44–S50, 2009.

[Cole and Gkatzelis, 2018] Richard Cole and VasilisGkatzelis. Approximating the Nash social welfare withindivisible items. SIAM J. Comput., 47(3):1211–1236,2018.

[Coston et al., 2019] Amanda Coston, Karthikeyan NatesanRamamurthy, Dennis Wei, Kush R Varshney, SkylerSpeakman, Zairah Mustahsan, and Supriyo Chakraborty.

Fair transfer learning with missing protected attributes. InProceedings of the 2019 AAAI/ACM Conference on AI,Ethics, and Society, AIES. ACM, 2019.

[Drosou et al., 2017] Marina Drosou, HV Jagadish,Evaggelia Pitoura, and Julia Stoyanovich. Diversityin Big Data: A review. Big Data, 5(2):73–84, 2017.

[(EEOC), 2019] U.S. Equal Employment Opportunity Com-mission (EEOC). Title VII of the Civil Rights Act of 1964.https://www.eeoc.gov/policy/vii.html, 2019. [Online; ac-cessed 24-Feb-2019].

[Kaminskas and Bridge, 2017] Marius Kaminskas andDerek Bridge. Diversity, serendipity, novelty, and cover-age: A survey and empirical analysis of beyond-accuracyobjectives in recommender systems. ACM Transactionson Interactive Intelligent Systems (TiiS), 7(1):2:1–2:42,2017.

[Kleinberg et al., 2017] Jon M. Kleinberg, Sendhil Mul-lainathan, and Manish Raghavan. Inherent trade-offs inthe fair determination of risk scores. In 8th Innovationsin Theoretical Computer Science Conference, ITCS, pages43:1–43:23, 2017.

[Moulin, 2004] Hervé Moulin. Fair division and collectivewelfare. MIT press, 2004.

[Page, 2008] Scott E Page. The Difference: How the Powerof Diversity Creates Better Groups, Firms, Schools, andSocieties-New Edition. Princeton University Press, 2008.

[Stoyanovich et al., 2018] Julia Stoyanovich, Ke Yang, andH. V. Jagadish. Online set selection with fairness anddiversity constraints. In Proceedings of the 21th Inter-national Conference on Extending Database Technology,EDBT, pages 241–252, 2018.

[Surowiecki, 2005] James Surowiecki. The wisdom ofcrowds. Anchor, 2005.

[Vargas and Castells, 2011] Saul Vargas and Pablo Castells.Rank and relevance in novelty and diversity metrics forrecommender systems. In Proceedings of the 2011 ACMConference on Recommender Systems, RecSys, pages109–116, 2011.

[Yang and Stoyanovich, 2017] Ke Yang and Julia Stoy-anovich. Measuring fairness in ranked outputs. In Pro-ceedings of the 29th International Conference on Scien-tific and Statistical Database Management, SSDBM, pages22:1–22:6, 2017.

[Zehlike et al., 2017] Meike Zehlike, Francesco Bonchi,Carlos Castillo, Sara Hajian, Mohamed Megahed, and Ri-cardo A. Baeza-Yates. FA*IR: A fair top-k ranking algo-rithm. In Proceedings of the 2017 ACM on Conference onInformation and Knowledge Management, CIKM, pages1569–1578, 2017.

[Zliobaite, 2017] Indre Zliobaite. Measuring discriminationin algorithmic decision making. Data Mining and Knowl-edge Discovery, 31(4):1060–1089, 2017.


6042

balanced ranking with diversity constraints · balanced ranking with diversity constraints ke...

Documents