Discrimination discovery in scientific project evaluation: A case studyI Andrea Romei, Salvatore Ruggieri and Franco Turini Dipartimento di Informatica, Università di Pisa Largo B. Pontecorvo 3, 56127 Pisa, Italy Abstract Discovering contexts of unfair decisions in a dataset of historical decision records is a non-trivial problem. It requires the design of ad-hoc methods and techniques of analysis, which have to comply with existing laws and with legal argumenta- tions. While some data mining techniques have been adapted to the purpose, the state-of-the-art of research still needs both methodological refinements, the consolidation of a Knowledge Discovery in Databases (KDD) process, and, most of all, experimentation with real data. This paper contributes by presenting a case study on gender discrimination in a dataset of scientific research proposals, and by distilling from the case study a general discrimination discovery process. Gender bias in scientific research is a challenging problem, that has been tackled in the social sciences literature by means of statistical regression. However, this approach is limited to test an hypothesis of discrimination over the whole dataset under analysis. Our methodology couples data mining, for unveiling previously unknown contexts of possible discrimination, with statistical regression, for testing the significance of such contexts, thus obtaining the best of the two worlds. Keywords: Discrimination discovery, Gender bias, Case study, Situation testing, Data mining, KDD process 1. Introduction Discrimination refers to an unjustified distinction of in- dividuals based on their membership, or perceived mem- bership, in a certain group or category disregarding indi- vidual merits. Unfair behaviors have been observed in a number of settings, including credit, housing, insurance, personnel selection and worker wages. Civil rights laws prohibit discrimination against protected groups defined on the grounds of gender, race, age, nationality, marital status, personal opinion, and so on. One crucial prob- lem from legal, economic and social point of view is dis- crimination discovery, that is defining methods capable of providing evidence of discriminatory behavior in activities such as the ones listed above. In the socio-economic field the problem has been addressed by analysing data with a statistical approach. The basic idea is to see, by means of regression analysis, whether sensitive features, like gen- der and race, are correlated with a less favorable treat- ment of individuals. In our opinion such an approach can highlight only macroscopic situations, while missing to dig out situations of deep discrimination in (small) subsets of the population, i.e., niches of individuals with a particular combination of characteristics. As an example, consider the case of loan applications to a bank. The discrimina- tory behavior of a single branch office against applicants from a local minority can readily be hidden in the much IA preliminary version of the results of this paper appeared in Romei et al. (2012). larger set of decisions of the whole bank. In a few words, the statistical approach tends to find a general model char- acterizing the whole population, whereas discrimination often arises in specific contexts. We maintain that a data mining approach, that is the search for particular patterns in the data, must be coupled with statistical validation of the patterns found as a thorough strategy for discovering (unexpected or unknown) contexts of discrimination. Data mining approaches to discrimination discovery have recently gained momentum, but, in our opinion, they still need major advancements: first, experimentation with real data; second methodological refinements; and third, the consolidation of a KDD process of discrimination dis- covery. Solving these issues is essential for the acceptance of discrimination discovery methods based on data mining in practice. In this paper we contribute to the advance- ment of the state-of-the-art in all those aspects. First, we describe a large experiment on a real case study concerning the challenging problem of discovering gender discrimination in the selection of scientific projects for funding. Data refers to an Italian call for research proposals with 3790 applications nationwide. Second, we couple a recently developed discrimination discovery method (Luong et al., 2011), based on data min- ing, with statistical validation of its results, thus reconcil- ing the statistical and the data mining methodologies. The data mining method is unsupervised in the sense that there are no examples of discriminatory or non-discriminatory situations to learn from. Rather, the method discovers sets of situations in which the comparison of the features Expert Systems with Applications May 1, 2013 and the decision may suggest a possible discrimination ac- cording to the legal methodology of situation testing. Sta- tistical regression analysis is then used to prove or disprove them as an hypothesis. Third, the description of the steps followed in the case study provides us with the basis for distilling a general KDD process for discrimination discovery. The process abstracted is rather complex. It contains both automated and semi-automated steps, the possibility of iterating sub- processes, and the need of tuning the parameters of the analyses. The paper is organized as follows. Section 2 offers a survey of the background material, including a multi- disciplinary introduction to the gender bias in scientific research, a survey of data mining approaches for discrim- ination discovery, and details on the approach based on situation testing. Sections 3 presents the case study of an Italian national research funding call, the available data, and the data preparation steps. Sections 5 and 6 report the core of the experiment, consisting of first extracting clas- sification models describing contexts of possible discrim- ination, and then of validating such contexts by means of logistic regression. Four interesting contexts of possi- ble discrimination are discussed. Section 7 generalizes the phases of the case study to a generic discrimination dis- covery process. Finally, the conclusions summarize the contributions of the paper and discuss some future work. 2. Background The problem of gender differences in research is of concern to all major funding institutions. The European Union (EU) regularly publishes a report on the status of gender funding in the member states (European Commis- sion, 2009), and it promotes gender equality in scientific research.1 EU legislation includes an explicit resolution on women and science (Council of the E.U., 1999), which no- tably preceded the resolutions on racial and employment equality. The National Science Foundation (NSF) in the United States (US) supports the development of systemic approaches to increase the representation and advance- ment of women in academic science, technology, engineer- ing, and mathematics through the ADVANCE program.2 Broader overviews of studies and findings on gender in (sci- entific and technological) research have been conducted by the European Commission (2012) and by UNESCO (2007). In the next subsection, we review the existing literature on gender bias in scientific research, which basically relies on statistical regression as the basic tool for data analy- sis. Then, we briefly review recent approaches that use data mining for discrimination discovery and prevention. Finally, a deeper introduction is reported on the data anal- ysis technique of situation testing, and to its implementa- tion as a data mining algorithm. 1http://www.yellowwindow.be/genderinresearch 2http://www.portal.advance.vt.edu 2.1. Gender bias in scientific research Forms of gender discrimination may explain women’s under-representation in academia, both past and present. The surveys by Bentley and Adamson (2003) and Ceci and Williams (2011) cover multi-disciplinary literature on gender differences in manuscript reviewing, grant fund- ing, university admission, and hiring and promoting in re- search. We focus here on grant and fellowship funding. The influential paper by Wenner̊as and Wold (1997) reports a study on post-doctoral fellowship applications to the Swedish Medical Research Council (MRC) in 1995. A total of 62 applications were submitted by men and 52 by women: 16 men were funded (25.8%) versus 4 women (7.7%). Applicant’s sex and scientific competence are con- sidered as independent variables in a linear regression model estimating the score assigned by the reviewers. Scientific competence as a control factor is measured in terms of the number of published journal articles, their citation count, and total impact of those journals.3. The regression shows that “a female applicant had to be 2.5 times more produc- tive than the average male applicant to receive the same score”. However, subsequent studies by several funding so- cieties in Europe and North America fail to show evidence of sex bias in approval rates (Ceci and Williams, 2011). In fact, Sandström and Hällsten (2008) analysed data on applications to the MRC in 2004 and found a reversed gen- der bias, namely a small but significant effect in favor of funding women’s grants compared to men with the same scientific competence score. Let us recall here a few large scale studies. RAND (2005) investigates grant applications in the period 2001- 2003 to the NSF, the National Institutes of Health (NIH), and the Department of Agriculture. No evidence of gender bias was generally found after controlling for age, academic degree, institution, grant type, institute, and application year. There were two exceptions, partly explainable by the lack of further control variables. First, women received only 63% of the amount of funding awarded to men by the NIH. Secondly, women who applied in 2001 were less likely than men to submit applications in the next two years. Similar findings as in the first exception are also reported by Larivière et al. (2011) with reference to 9074 professors at universities in Quebec (Canada). The lower amount of funding received by women is not necessarily ev- idence of discriminatory decisions. Wilson (2004) explains the lower amount of funding granted to women by their marginalization within the scientific community, by their segregation to lower rank positions, and by their smaller social networks – all of these factors affecting their chances of funding possibilities. 3The problem of measuring and analysing science is the subject of scientometrics Indicators of scientific productivity of researchers have been debated for long time. See Bornmann and Daniel (2009) for a discussion on H-index and its variants, and for a comparison with the Impact Factor index. 2 Ley and Hamilton (2008) highlight that, whilst there is now generally gender equality between students and in- structors, there is still a striking drop in the roles of as- sistant professors and professors – i.e., a glass ceiling in science. The authors analysed more than 100,000 appli- cations between 1996 and 2007 to NIH grant programs to determine whether gender differences occur at some stage of a researcher’s career, which may explain the observed attrition. While they found a decrease in female applicants for grants throughout a researcher’s career, there is sub- stantial equity of the rates of funded applications between males and females at all stages of the process. Similar results are observed also by Brouns (2000) on grants awarded to 809 individual applicants by the Dutch Organization for Scientific Research. In this case, how- ever, the stratification by discipline exhibits a higher vari- ability of the success rate for women, from 26% up to 84%, compared to men, from 46% to 76%. Women ap- pear very successful in “hard” sciences (Physics, Mathe- matics, and Astronomy) and surprisingly unsuccessful in the “soft” natural sciences (Biology, Oceanography, and the Earth Sciences). Bornmann and Daniel (2005) anal- yse 1954 applications for doctoral and 743 applications for post-doctoral fellowships to a German foundation for the promotion of research in Biomedicine. The odds ratio of the approved doctoral fellowships for females (7%) against males (16%) is found statistically significant after checking for applicant’s age, grade, mobility, number of recommen- dation letters, ratings of reviewers. Marsh et al. (2008) summarize the major findings of an eight year research program on the analysis of peer reviews in grant applications to the Australian Research Council. Their dataset includes 2331 proposals rated by 6233 exter- nal assessors, out of a total of 10023 reviews. They con- sider issues such as: reliability of reviews, in the sense of an agreement of reviewers across individual proposals and across disciplines; trustworthiness of reviewers nominated by applicants; bias of national reviewers, who give more favorable evaluations than international ones; the positive influence of academic rank, in the sense that professors are more likely to be funded due to their experience and successful research track records; and the positive influ- ence of the prestige of the university affiliation and of the applicant’s age. They also consider the influence of an ap- plicant’s gender, finding that 15.2% of funded applications were led by females, which was exactly the same percent- age as female applicants. Once again, although women are under-represented in the applicants pool, they are equally represented in the funded pool. Their experiments also re- ject the “matching hypothesis” that reviewers give higher ratings to applicants of their same sex. Regarding the analytical methodology, research on peer review studies has carried out statistical analyses mainly by means of variants of correlation (Brouns, 2000), Z- tests of proportions (Ley and Hamilton, 2008), regression and more rarely by analysis of variance and discriminant function analysis. Multi-stages peer review processes have been also analysed with latent Markov models (Bornmann et al., 2008). The variants of regression adopted include multiple regression (Wenner̊as and Wold, 1997; Sandström and Hällsten, 2008), multi-level regression4 (Jayasinghe et al., 2003; Mutz et al., 2012), logistic regression (Born- mann and Daniel, 2005). The coefficient of the indepen- dent variable coding the applicant’s gender is taken as a measure of how gender affects the dependent variable, which is typically the score received by the application or its probability (or its logit) of being funded. Other in- dependent variables control factors such as scientific per- formance, scientific field, age, position, and institution. In this sense, “discrimination is the remaining racial [in our context, gender] difference after statistically account- ing for all other race-related [gender-related] influences on the outcome” (Quillian, 2006). However, it is difficult to know that all important characteristics of individuals have been taken into account: a recurring problem known as the omitted-variable bias. The inclusion of an omitted control variable may then explain (part of) the remaining gender differences. 2.2. Data mining for discrimination data analysis Discrimination discovery from data consists in the ac- tual discovery of discriminatory situations and practices hidden in a large amount of historical decision records. The aim is to unveil contexts of possible discrimination on the basis of legally-grounded measures of the degree of discrimination suffered by protected-by-law groups in such contexts. The legal principle of under-representation has inspired existing approaches for discrimination discovery based on pattern mining. A common tool for statistical analysis is provided by a 2 × 2, or 4-fold, contingency ta- ble, as shown in Figure 1. Different outcomes between groups are measured in terms of the proportion of people in each group (p1 for the protected group, and p2 for the unprotected one) with a specific outcome (benefit denial). Differences and rates of those proportions are commonly adopted as the formal counterpart of the legal principle of group under-representation. They are known in statistics as risk difference (RD = p1 − p2), also known as absolute risk reduction; risk ratio or relative risk (RR = p1/p2); relative chance (RC = (1 − p1)/(1 − p2)), also known as selection rate; odds ratio (OR = p1(1 − p2)/(p2(1 − p1))). Starting from a dataset of historical decision records, Pe- dreschi et al. (2008); Ruggieri et al. (2010a) propose to extract classification rules such as for instance: race=black, purpose=new_car => credit=no 4In addition to a measurement level random variable, multi-level regression (Goldstein, 2011) includes a subject level random variable modelling variations in a cluster of data. For instance, (Jayasinghe et al., 2003) adopt multi-level regression to take into account corre- lation in the cluster of ratings of a reviewer, and in the cluster of ratings of a same field of study. 3 benefit group denied granted protected a b n1 unprotected c d n2 m1 m2 n p1 = a/n1 p2 = c/n2 RD = p1 −p2 RR = p1 p2 RC = 1−p1 1−p2 OR = RR RC = a/b c/d Figure 1: 4-fold contingency table and discrimination measures. called potentially discriminatory rules, to unveil contexts (here, people asking for a loan to buy a new car) where the protected group (here, black people) suffered from under or over-representation with respect to the decision (here, over-representation w.r.t. credit denial). The approach has been implemented by Ruggieri et al. (2010b) on top of an Oracle database by relying on tools for frequent itemset mining. The main limitation of such an approach is that measuring group representation by aggregated values over undifferentiated groups results in no control of the char- acteristics of the protected group, versus or as opposed to others in this context. The high value of a discrimination measure from Figure 1 can be justified by the fact that proportions p1 and p2 mix decisions for people that may be very different as per characteristics that are lawful to obtain the required benefit (e.g., skills required for a job position). This results in an overly large number of rules that need to be further screened to filter out explainable discrimination. Luong et al. (2011) overcome this lim- itation by exploiting the legal methodology of situation testing, which will be presented in Section 2.3. The approach described so far assumes that the dataset under analysis contains an attribute that denotes the pro- tected group under analysis. The case when data do not contain such an attribute (or it is not even collectable at a micro-data level, e.g., as in the case of sexual orienta- tion) is known as indirect discrimination analysis (Ruggieri et al., 2010a), where ‘indirect’ refers to the exploitation of a known correlation with some other attribute, which can be used as a proxy for group membership. A well-known ex- ample is redlining discrimination analysis, occurring when the ZIP code of residence is correlated with the race of individuals in highly segregated regions. In this paper, we restrict to consider direct discrimination analysis. Finally, we mention the related research area of dis- crimination prevention in data mining and machine learn- ing (Calders and Verwer, 2010; Kamiran and Calders, 2012; Hajian and Domingo-Ferrer, 2012), where the problem is to design classification algorithms that trade off accuracy for non-discrimination in making predictions. Discrimina- Discovery(t) { for r ∈P { if( benefit(r) = denied and diff (r) ≥ t ) disc(r) ← true else disc(r) ← false } build a classifier on P where the class is disc } Figure 2: Left: example of risk difference diff (r) for k = 4. Women are the protected group, knnsetwomen(r) (resp., knnsetmen(r)) is the set of female (resp., male) k-nearest neighbors of r. Red labels benefit denied, green labels benefit granted. Right: pseudo code of k-NN as situation testing. Individuals r from the protected group P are first labeled as discriminated or not, and then a classifier is induced for describing those discriminated. tory predictions may be the result of a bias of the classifier induction algorithm, or of learning from training data tra- ditional prejudices that are endemic in reality. Summaries of contributions in discrimination discovery and prevention are collected in a recent book (Custers et al., 2013). In par- ticular, Romei and Ruggieri (2013) present, in a chapter of that book, a multi-disciplinary annotated bibliography of statistical tools, legal argumentations, economic mod- els, and experimental approaches for discrimination data analysis. 2.3. Situation testing and k-NN In a legal setting, situation testing is a quasi-experi- mental approach to investigate for the presence of discrim- ination by checking the factors that may influence decision outcomes. Pairs of research assistants, called testers, un- dergo the same kind of selection. For example, they apply for the same job, they present themselves at the same night club, and so on. Within each pair, applicant characteris- tics likely to be related to the situation (characteristics related to a worker’s productivity on the job in the first case, look, age and the like in the second case) are made equal by selecting, training, and credentialing testers to appear equally qualified for the activity. Simultaneously, membership to a protected group is experimentally ma- nipulated by pairing testers who differ in membership – for example, a black and a white, a male and a female, and so on. Observing significant difference in the selec- tion outcome between testers is a prima facie evidence of discrimination, i.e., a proof that, unless rebutted, would be legally sufficient to prove the claim of discrimination. For applications of situation testing, we refer to Bendick (2007), covering employment discrimination in the US; to Rorive (2009), covering the EU member States context; and to Pager (2007), including an appendix on the design of situation testing experiments. In Luong et al. (2011), the idea of situation testing is exploited for discrimination discovery just inverting the 4 Figure 3: The two-phases review process of the FIRB “Future in Research” call. point of view. Given past records of decisions taken in some context, for each member of the protected group with vector of attributes r suffering from a negative de- cision outcome (someone who may claim to be a victim of discrimination), we look for 2k testers with similar char- acteristics. Such characteristics are legally admissible in affecting the decision, apart the one of being or not in the protected group. Similarity is modeled via a distance func- tion between tuples. If we can observe significantly dif- ferent decision outcomes between the k-nearest neighbors of r belonging to the protected group and the k-nearest neighbors belonging to the unprotected group, we can as- cribe the negative decision to a bias against the protected group, hence labeling the individual r as discriminated. This approach resembles the k-nearest neighbor (k-NN) classification model, where the class of an individual is predicted as the most frequent class among its k-nearest neighbors. Difference in decision outcomes between the two groups of neighbors is measured by any of the func- tions from Figure 1, calculated over the proportions for the two sets of testers. Throughout the paper, we con- sider risk difference diff (r) = p1 − p2 with the intuitive reading that it represents the difference in the frequency p1 of negative decisions in the neighbors of the protected group with respect to frequency p2 in the neighbors of the unprotected group (see Figure 2 for an example). A value diff (r) > t, for a maximum threshold t, implies that the negative decision for r is not explainable on the basis of the (legally-grounded) attributes used for distance mea- surement, but rather it is biased by group membership. diff (r) is then a measure of the strength of such a bias. When diff (r) > t, individual r is labeled as discriminated by setting a new attribute disc to true. Starting from this labeling procedure, the actual learning of the conditions of discrimination is then modeled as a standard classification problem, where the class is the attribute disc. The overall procedure is reported in Figure 2. 3. Case study: data understanding & preparation In this section, we start the analysis of the case study of an Italian national call for research projects. We introduce the call and its evaluation process, the available data, and the features selected to form the dataset in input to the discrimination discovery analysis. Figure 4: Input data on research units. 3.1. The FIRB “Future in Research” call In 2008, the Italian Ministry of University and Re- search published a call for scientific research projects un- der the Basic Research Investment Fund (FIRB) reserved to young scientists – the FIRB “Future in Research” call. The scientific scope of the call is very broad, ranging from social sciences and humanities, to physical sciences and engineering, to life sciences. Research proposals are sub- mitted by a consortium of one or more research units, with a principal investigator (from now on, PI) and zero or more associate investigators heading each unit and affiliated to an Italian university or to a public research organization. Research proposals are distinguished in two programs, de- pending on whether the PI holds a non-tenured position and she/he is at most 33 years old at the time of the call (program P1), or she/he holds a tenured position and she/he is at most 39 years old (program P2). Each pro- gram has its own total budget, but the submission forms and the evaluation procedures are the same for both. The submitted proposals consist of a description of work and a budget for each research unit, called the B forms, and of a description of work and a budget for the whole proposal, called the A form. The global budget is basically the sum of the budgets of the research units par- ticipating in the project. The A form also contains the curriculum vitae of the PI, a list of her/his main publi- cations, and an abstract of the proposal. The hiring of at least one young researcher (defined as “a post-doc or a post-degree of at most 32”) per project proposal is required by the call. Invitation of good reputation researchers from abroad to spend some period working on the project is instead an option. Project proposals underwent a two-steps evaluation pro- cess, as shown in Figure 3. The first step consisted in a blind peer-review by national and international reviewers resulting in four scores about: (S1) scientific relevance of the proposal (score 0–8); (S2) impact of the proposal (score 0–7); 5 Figure 5: Input data on research proposals and evaluation results. (S3) scientific and technical value of the proposal (score 0–15); (S4) quality of the partnership (score 0–10). Only project proposals that received the best score for all of the four evaluation criteria (i.e., a total score of 40) were admitted to the second step, which consisted in an audition of the PI in front of a panel of national experts. The panel ranked the proposals into three classes: “to be funded”, “to be funded if additional budget were available”5 and “not to be funded”. In the first two cases, the panel also decided a budget cut with respect to the budget requested. 3.2. Data sources Anonymized data on project proposals and evaluation results were made available to us as an Oracle relational database. Proposals are identified by unique IDs CODE A. Similarly, research units have unique IDs CODE B. Data on research units (see Figure 4) are retrieved from B forms of project proposals. Table FORM B contains, for each research unit, attributes on the associate investigator leading the unit (gender, age, title, institution) and on the planned effort of the research unit in person/months. For each research unit, three other groups of data are available: • participants to the research unit, whose gender and age attributes are stored in the PARTICIPANT B table; • detailed costs of the research unit, stored in the DE- TAILED COST B table, including costs of: tenured per- sonnel, personnel to be hired, equipment, overhead, travel and subsistence, consulting and other costs; • aggregated costs, stored in the COST B table, includ- ing: the costs of hiring young researchers and good reputation researchers, the total budget of the re- search unit, and the eligible costs to be funded by the call. 5At that time, an increase of the budget of the call was under consideration. Data on research proposals (see Figure 5) are retrieved from A forms. Table FORM A contains data on the PI (gen- der, age, title, institution) and the research program of the proposal (P1 or P2). A few auxiliary tables follow: • PUBLICATION stores the list of publications of the PI. Authors’ names have been removed, but the number of authors is recorded; • ERC CLASSIFICATION stores the scientific area of the research project according to the European Research Council (ERC) classification; more than one area could have been chosen for a research proposal, e.g., in case of multi-disciplinary topics, with the first one representing the main area6; • KEYWORDS and ABSTRACT store respectively an or- dered list of keywords and the textual abstract of the proposal, in English; • COST A stores the duration of the project in months, and aggregated budget data: total effort in person/- months, total cost of the project, total eligible costs to be funded, number of young researchers and their total cost, number of good reputation researchers and their total cost. Data on the evaluation results is shown in the right- most tables of Figure 5. The scores obtained by a pro- posal over the four evaluation criteria of the first step of the evaluation process are stored in the SCORE table. Each criterium is coded with an ID. The ranks assigned by the commission of national experts to proposals in the second step of the evaluation process is stored in the AUDITIONS table. For proposals ranked as “to be funded” or “to be funded if additional budget were available”, the total cost and grant assigned to the project after budget cut is stored in the GRANTS table. 6The main area is used in the first step of the evaluation process to select the peer-reviewers of the proposal from a pool of area experts. 6 Name Description Type Features on principal and associate investigators gender Gender of the PI Nominal region Region of the institution Nominal city City of the institution Nominal inst type Type of the institution Nominal title Title of the PI Nominal age Age of the PI Numeric pub num No. of publications of the PI Numeric avg aut Average number of authors in pubs Numeric f partner num No. of females among principal or associate invest. Numeric Project costs (absolute values in e) tot cost Total cost of the project Numeric fund req Requested grant amount Numeric fund req perc fund req over tot cost Numeric yr num No. of young researchers Numeric yr cost Cost of young researchers Numeric yr perc yr cost over tot cost Numeric grr num No. of good reputation researchers Numeric grr cost Cost of good reputation researchers Numeric grr perc grr cost over tot cost Numeric Research areas program Program P1 or P2 of the proposal Nominal d1 lv1, d2 lv1, d3 lv1 1st, 2nd and 3rd area at the 1st ERC level Nominal d1 lv2, d2 lv2, d3 lv2 1st, 2nd and 3rd area at the 2nd ERC level Nominal d1 lv3, d2 lv3, d3 lv3 1st, 2nd and 3rd area at the 3rd ERC level Nominal Evaluation results S1 Score S1 assigned by the peer-reviewer Numeric S2 Score S2 assigned by the peer-reviewer Numeric S3 Score S3 assigned by the peer-reviewer Numeric S4 Score S4 assigned by the peer-reviewer Numeric peer-review Passed or rejected at the peer-review Nominal audition Passed or rejected at the audition (i.e., proposal funded) Nominal grant Amount granted after budget cut Numeric Table 1: Attributes of the dataset of the case study. 3.3. Data preparation The data preparation phase produced a dataset for the discrimination analysis in the form of a single relational table, including both source and derived features for each project proposal. Table 1 summarizes four groups of fea- tures. Features on the principal and associate investigators. These include gender, age, and title of the PI; number of publications and average number of authors in publica- tions of the PI; region (North, Center, South of Italy), city and type of her/his institution (University, Consortium or Other); and number of female principal or associate inves- tigators in the project proposal. Project costs. Several costs are considered: total cost of the project, requested grant (both absolute and in pro- portion to the total cost), number and cost of young re- searchers, number and cost of good reputation researchers. Research areas. In addition to the research program a proposal is submitted to (P1 or P2), up to three research areas are included, the first of which is the main area, according to the ERC classification. Such a classification consists of a three-level hierarchy. The top level includes Social sciences and Humanities (SH), Physical sciences and Engineering (PE), and Life Sciences (LS). The second and third levels (coded, e.g., as PE n and PE n m) include 25 and 3792 sub-categories respectively. Evaluation results. The following attributes are in- cluded: the scores (S1)-(S4) received at the peer-review, whether the project passed the first evaluation phase (i.e., the peer-review), whether the project passed the second evaluation phase7 (i.e., the audition), the actual amount granted after budget cut. 4. Case study: risk difference analysis Since research proposals of programs P1 and P2 are evaluated in isolation (due to distinct budgets), from now on, we act as if there were two datasets, one per program. Program P1 received 1804 applications, 923 of which are from female PIs; program P2 received 1986 ones, 792 of which from female PIs. 4.1. Exploratory data analysis of gender differences Table 2 summarizes the proportion of genders in the two phases of the evaluation process: peer-review and audition. It is readily checked that, for both programs, the proportion of female PIs progressively decreases when moving from applicant proposals to proposals passing the peer-review up to those passing the audition decision. Let us quantify such a decrease by means of discrimination measures. Figure 6 shows the 4-fold contingency tables of 7Since no additional budget was available for the call, proposals ranked as “to be funded if additional budget were available” are considered as not passing the audition. 7 PIs Peer-review passed Audition passed Male Female Male Female Male Female P1 881 923 43 31 25 17 (48.8%) (51.2%) (58.1%) (41.9%) (59.5%) (40.5%) P2 1194 792 100 31 51 12 (60.1%) (39.9%) (76.3%) (23.7%) (81%) (19%) Table 2: Aggregate data on gender differences. Peer-review P1 Audition P1 Applic. Rejected Passed Rejected Passed Female 892 31 923 14 17 31 Male 838 43 881 18 25 43 1730 74 1804 32 42 74 p1 = 892/923 = 0.966 p1 = 14/31 = 0.452 p2 = 838/881 = 0.951 p2 = 18/43 = 0.419 RD = 0.015 RR = 1.02 RD = 0.033 RR = 1.08 RC = 0.69 OR = 1.48 RC = 0.94 OR = 1.15 Peer-review P2 Audition P2 Applic. Rejected passed Rejected passed Female 761 31 792 19 12 31 Male 1094 100 1194 49 51 100 1855 131 1986 68 63 131 p1 = 761/792 = 0.961 p1 = 19/31 = 0.613 p2 = 1094/1194 = 0.916 p2 = 49/100 = 0.49 RD = 0.045 RR = 1.05 RD = 0.123 RR = 1.25 RC = 0.46 OR = 2.24 RC = 0.76 OR = 1.65 Figure 6: 4-fold contingency tables and discrimination measures. passing the peer-review and of being funded for propos- als in programs P1 and P2. Consider first the peer-review phase. Recall that the measures of risk difference (RD) and risk ratio (RR) compare the proportions of rejected proposals. Due to the small fraction of projects passing the phase, it turns out that RD and RR cannot highlight differences in the outcomes. Overall, the vast majority of proposals were rejected. In fact, RR is only 1.02 for P1 and 1.05 for P2; RD is only 1.5% for P1, and a modest 4.5% for P2. On the other hand, since relative chance (RC) com- pares the success rates, it highlights major differences: the chance of passing the peer-review for a female is only 69% of the chance of a male for program P1, and only 46% for program P2. Finally, since the odds ratio (OR) is the ra- tio between RR and RC, it highlights differences in both rejection and success rates. Consider now the audition phase. Rejected and funded projects are now more evenly distributed. The discrimination measures highlight no sig- nificant difference for program P1. Differences in program P2 are lower than the first phase, yet still moderately high. Are the different success rates of males and females due to legitimate characteristics or skill differences of the gender of applicants? In order to answer such a question, Figure 7 reports the distributions of the age of the PIs, of the number of her/his publications, and of a few costs of project proposals (young researchers, good reputation researchers, total cost, request grant). Distributions are distinguished for gender of the PI and for program of the project proposal. The distribution of age across gender highlights no difference for both programs. Notice that the distributions across programs are clearly distinguished due to the requirements of each program in the call for proposals. However, the plot of the number of publica- tions shows that males have a slightly higher productivity than females, for both programs P1 and P2. As an exam- ple, about 37% of females in program P2 have more than 20 publications, against a percentage of 48% for males. Turning our attention on project costs, we observe that proposals led by females require slightly lower costs for young researchers than proposals led by males, in both programs. The situation is similar for the total cost and the requested grant: the average total cost is e 980K for females and e 1080K for males. The distributions of costs for good reputation researchers are, instead, similar. No- tice that such costs are non-zero for only 19% of proposals in program P1 and only 10% in P2. Summarizing, even though an analysis of distributions provides some hints on gender differences, it is still too gross grained to draw any conclusion about the presence of discrimination. Aggregations at the level of the whole dataset may hide differences in smaller niches of data. Un- veiling these niches is precisely the objective of the discrim- ination discovery analysis. 4.2. Distributions of gender risk difference Let us instantiate the approach of situation testing (see Section 2.3) by exploring risk differences. Let r be a project proposal led by a female PI that did not pass the peer-review phase. The function diff (r) = p1 − p2 mea- sures the risk difference between the rejection percentage p1 of its k-nearest neighbor proposals headed by female PIs and the rejection percentage p2 of its k-nearest neighbor- ing proposals headed by male ones. Distance is measured on the basis of proposal’s characteristics that are (legally) admissible in affecting the (first or second phase) decision. We consider here all the features of Table 1 apart from the project evaluation results and the gender of the PI. Simi- larity is modeled via the distance function adopted in the experiments by Luong et al. (2011), which consists of the Manhattan distance of z-scores for continuous attributes in r, and of the percentage of mismatching attributes for discrete ones. The higher diff (r) is, the more the negative decision on proposal r is unexplainable by differences in the compared characteristics. The residual explanation is then the gender of the PI, which implies a prima facie evidence of gender discrimination, or the lack of further explana- tory variables – the omitted variables. A critical choice concerns how to set the k constant. Figure 8 (a,b) shows the distributions of diff () for k = 4, 8, 16, 32 with reference to proposals from programs P1 and P2. As k increases, the distributions tend to flatten (for k sufficiently large, the risk differences of all proposals collapse to a unique value). From now on, we fix k = 8, which means compar- ing each proposal with 0.9% of the proposals in program P1 (= 16/1804, where 16 is 2k, and 1804 is the overall number of proposals), and with 0.8% of the proposals in program P2. 8 0 20 40 60 80 100 26 28 30 32 34 36 38 40 % Age gender = Female, program = P1 gender = Male, program = P1 gender = Female, program = P2 gender = Male, program = P2 0 20 40 60 80 100 0 10 20 30 40 50 % Number of Publications gender = Female, program = P1 gender = Male, program = P1 gender = Female, program = P2 gender = Male, program = P2 0 20 40 60 80 100 100K 300K 500K 700K % Young Researcher Cost gender = Female, program = P1 gender = Male, program = P1 gender = Female, program = P2 gender = Male, program = P2 0 20 40 60 80 100 0 50K 100K 150K 200K 250K % Int. Good Reput Researcher Cost gender = Female, program = P1 gender = Male, program = P1 gender = Female, program = P2 gender = Male, program = P2 0 20 40 60 80 100 300K 700K 1.1M 1.5M 1.9M % Total Cost gender = Female, program = P1 gender = Male, program = P1 gender = Female, program = P2 gender = Male, program = P2 0 20 40 60 80 100 200K 600K 1.0M 1.4M % Requested Grant gender = Female, program = P1 gender = Male, program = P1 gender = Female, program = P2 gender = Male, program = P2 Figure 7: Cumulative distributions across gender of PIs and program of proposals. 0 0.2 0.4 0.6 0.8 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 F ra c t. o f tu p le s w it h d if f ≥ t t program=P1 peer-review=rejected gender=Female k=4 k=8 k=16 k=32 (a) 0 0.2 0.4 0.6 0.8 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 F ra c t. o f tu p le s w it h d if f ≥ t t program=P2 peer-review=rejected gender=Female k=4 k=8 k=16 k=32 (b) 0 0.2 0.4 0.6 0.8 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 F ra c t. o f tu p le s w it h d if f ≥ t t program=P2 k=8 gender=Female (S1) scientific relevance <> 8 (S2) impact <> 7 (S3) technical value <> 15 (S4) quality of partnership <> 10 (c) Figure 8: Cumulative distributions of diff (). Recall that only the project proposals that receive the highest scores (S1)-(S4) pass the peer-review phase. It is interesting then to look at the distributions of diff () separately for each score. This is shown in Figure 8 (c), where the “benefit denied” decision is set as receiving a score lower than the maximum. The impact of the project (S2) appears as the most biased criteria. Distributions might also unveil forms of multiple dis- crimination. Figure 9 shows the distributions of risk dif- ference for two possibly discriminated groups in isolation, namely female PIs and PIs affiliated to institutions from the South of Italy (an historically disadvantaged region of Italy), and for PIs belonging to both groups. The two groups in isolation exhibit some risk difference, with gen- der bias being more prominent than bias against people from the South. However, no increase in risk difference can be observed for the sub-group of female PIs from the South when compared to the whole group of female PIs. 5. Case study: discrimination model extraction In this section, we start applying a discrimination dis- covery approach to the pre-processed datasets of propos- als in program P1 (resp., P2) with reference to the peer- review decision. We will not be considering the decision of the audition phase due to three motivations. First, the number of proposals involved in the second phase of the reviewing process is rather small, hence we run the risk of drawing no (statistically) significant conclusion. Second, the discrimination measures in Figure 6 highlight higher gender differences in the peer-review decisions than in the audition decisions, so we expect higher chances of finding non-negligible contexts of clear discrimination. Third, and most important, the set of features available in Table 1 appear adequate as control factors for the decision of the first phase only. In fact, peer-reviewers had access to the proposal text, to the curriculum and list of publications 9 0 0.2 0.4 0.6 0.8 1 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 F ra c t. o f tu p le s w it h d if f ≥ t t program=P2 k=8 gender=Female region=South gender=Female ∧ region=South Figure 9: Cumulative distributions of diff (). of the PI, and to the budget data. This is about the set of features listed in Table 1. On the contrary, the panel of national experts “entered in personal contact” with the PIs during the auditions, so their decisions are affected by additional factors not recorded in the data, e.g., physical characteristics of the PI, proficiency in speaking, motiva- tion, and appropriateness of answers to questions. The omitted-variable bias in analysing data with reference to the decision of the panel of national experts would then be considerably high. Consequently, any finding of possible discrimination would be questionable. 5.1. Before data mining: what can regression tell us? Data analysts from economic and social sciences have typically adopted logistic regression as a tool for testing an hypothesis of possible discrimination. Before starting our data mining analysis, let us then follow such an approach and discuss what conclusions can be made without apply- ing data mining methods. Logistic regression is a form of multiple linear regression: logit(P(Y = 1)) = α + N∑ i=1 βiXi where the logit of the dependent variable value Y = 1 is estimated as a linear function of the independent vari- ables X1, . . . , Xn. The logit function logit(P(Y = 1)) = log(P(Y = 1)/(1 − P(Y = 1))) is the log of the odds of the probability P(Y = 1). By exponentiating the equation sides, we obtain: P(Y = 1) 1 − P(Y = 1) = eα+ ∑N i=1 βiXi = eα N∏ i=1 eβiXi The value βi can then be interpreted as the variation co- efficient of the logarithm of the odds of the event Y = 1 due to a linear variation of the factor Xi, all other control factors being constant. A nominal feature X with values v1, . . . , vk is modeled in this framework by k − 1 indepen- dent indicator variables X = v1, . . . X = vk−1. The coeffi- cients of these features model the variation of the logit of P(Y = 1) with respect to the default value X = vk. Model for P1 Model for P2 Variable Coeff. (Std. err) Coeff. (Std. err) gender = Female -0.33 (-0.39) -0.87 (0.30) [***] region = North 0.14 (0.30) 0.22 (0.22) region = South -0.02 (0.35) -0.42 (0.28) inst type = Univ -0.42 (0.62) -0.43 (0.40) inst type = Cons -0.36 (0.50) 0.04 (0.45) age -0.03 (0.09) 0.01 (0.04) pub num 0.01 (0.01) -0.01 (0.01) avg aut 0.01 (0.01) 0.03 (0.02) f partner num -0.02 (0.25) 0.12 (0.16) tot cost 0.10 (0.35) 0.03 (0.26) fund req -0.13 (0.50) -0.05 (0.37) fund req perc 0.51 (0.37) 0.39 (0.28) yr num -0.11 (0.26) -0.56 (0.19) [***] yr cost -0.09 (0.35) -0.03 (0.26) yr perc 0.31 (0.26) 0.27 (0.19) grr num -0.28 (0.46) 0.22 (0.24) grr cost -0.09 (0.35) -0.03 (0.26) grr perc 0.16 (0.32) 0.26 (0.21) d1 l1 = PE -0.42 (0.33) -0.34 (0.25) d1 l1 = SH 0.08 (0.33) 0.03 (0.29) Table 3: Logistic regression models for datasets of proposals for pro- grams P1 and P2. The dependent variable is peer-review = passed. Coefficients marked by [***] are statistically significant at the 99% confidence level. Table 3 shows logistic regression models for the datasets of proposals in program P1 and P2. The event Y = 1 is here peer-review = passed. Standard errors and statisti- cal significance of regression coefficients are also shown. In both models, the regression coefficient of the indicator variable gender = Female is negative, which means that, all other factors being equal, female PIs have lower odds of passing the peer-review phase: by a factor of e−0.33 = 0.72 for program P1, and of e −0.87 = 0.42 for program P2 w.r.t. the odds of male PIs. For program P2, the null hypothesis that the coefficient is zero is rejected at the 99% level of statistical significance. The region of the in- stitution of the PIs affects the odds of passing the peer review as well, particularly for proposal in program P2: PIs from the North of Italy have higher chances, whilst those from the South have lower ones. Variables on age, number of publications, average number of authors in pub- lications have coefficients close to zero. Concerning cost variables, proposals with higher percentage of costs cov- ering young researchers (variables yr perc, fund req perc) have higher odds of passing the peer-review. This is not unexpected, since the call explicitly states the objective of funding start-up research groups of young researchers. However, proposals with large (yet, young) groups (vari- able yr num) are disfavored in the peer-review decision. Moreover, competition appears to be harder in the area of Physical sciences and Engineering (PE) rather than in Humanities (SH), and Life Sciences (LS); and for PIs from the University (variable inst type = Univ) compared to PIs from other institutions. Finally, the literature on discrimination analysis accounts for an included-variable bias (Killingsworth, 1993), namely for control variables that incorporate some form of gender discrimination. One such variable is f partner num, i.e., the number of female principal or associate investigators. Since its coefficients 10 Accuracy Precision Recall f-Measure Size Id Set Alg CS P1 P2 P1 P2 P1 P2 P1 P2 P1 P2 1 D13 Jrip Y 48.4 45.2 33.1 40.2 97.0 93.4 0.49 0.56 35 23 2 D19 C4.5 N 77.6 70.7 54.3 58.7 83.1 74.5 0.66 0.66 302 184 3 D19 C4.5 Y 56.4 53.0 36.6 44.0 93.5 92.3 0.53 0.60 78 73 4 D19 Jrip Y 53.5 45.4 35.4 40.4 97.0 95.5 0.52 0.57 21 11 5 D19 Part N 72.3 66.6 47.9 54.5 77.9 67.8 0.59 0.60 74 33 6 D19 Part Y 61.9 57.3 39.7 46.4 90.9 88.8 0.55 0.61 74 87 7 D27 C4.5 N 82.2 74.5 62.3 63.4 78.8 75.9 0.7 0.69 2051 4645 8 D27 C4.5 Y 50.7 37.6 33.5 37.6 92.2 100 0.49 0.55 9 325 9 D27 Jrip Y 50.0 46.5 33.7 41.0 96.1 96.9 0.5 0.58 12 11 10 D27 Part Y 61.2 53.1 38.1 43.7 80.1 86.7 0.52 0.58 128 113 Table 4: Top 10 classification models of discriminated proposals. are small, it appears that gender discrimination bias, if present, is directed mainly against the PI, and not against the group of investigators. The conclusions drawn from Table 3 are certainly more informative than the explorative data analysis of Section 4. However, whilst they reveal some gender bias at the level of the whole datasets, there is no indication of whether this is uniformly distributed or whether there are some contexts with a very high bias. Unfortunately, the sta- tistical regression approach is limited to the verification of an hypothesis. Thus, one should explicitly figure out a possible context and re-compute a regression model for proposals in such a context. The purpose of our data min- ing approach is precisely to let such contexts emerge as a result of the analysis. 5.2. A classification model of the discriminated proposals The number of proposals led by female PIs that did not pass the peer-review phase amounts at 892 for pro- gram P1 and at 761 for program P2. Now, we intend to extract from these two datasets a global description of pro- posals whose negative decision is discriminatory according to the legal methodology of situation testing. We proceed as follows. First, we set a threshold value t to the maxi- mum admissible risk difference. Values of risk differences greater than 0.05 (i.e., 5%) have been considered prima facie evidence of discrimination in some legislations and law cases, and Figure 8 (a) supports this choice in prac- tice. In order to be more stringent, we assume from now on the higher threshold t = 0.10. Second, a proposal r is labeled as discriminated if its risk difference is greater or equal than t – technically, we introduce a binary attribute disc defined as disc(r) = true iff diff (r) ≥ 0.10. These two steps allow us for reducing the problem of characterizing discriminatory decisions to the standard problem of induc- ing a classification model, where the class attribute is the newly introduced attribute disc. The resulting datasets have a distribution of disc = true and disc = false values of 26-74% for program P1, and of 38-62% for P2. Since the intended use of the classification models is to describe global conditions under which a proposal led by a female PI is rejected at the peer-review phase with a risk difference of 0.10 or above, we restrict the search space to classification models that are readily interpretable (e.g., before a court in a law case). We experimented classification rule models (RIPPER by Cohen (1995), and PART by Frank and Witten (1998)) and decision trees (C4.5 by Quinlan (1993)). Classification models are eval- uated by the objective interestingness measures of accu- racy, precision, recall and f-measure for the class value disc = true using a 10-fold cross validation. The actual classification model is extracted from the whole dataset. Other settings that have been experimented are mainly concerned with tackling the unbalanced distribution of class values, and they include standard approaches: uni- form resampling of the training folds, cost-sensitive in- duction of classifiers,8 and meta-classification approaches (bagging and boosting). We relied on the Weka tool by Witten and Frank (2011) for algorithms and as experi- mental environment. Finally, we also varied the set of predictive attributes in order to evaluate the explanatory power of different subsets: • set D13 includes a subset of features of the PI (age, title, pub num, avg aut), of project costs (yr num, yr cost, grr num, grr cost, tot cost, fund req), of the research area (d1 lv1, d1 lv2) as well as the class attribute disc; • set D19 adds features of project costs (fund req perc, yr perc, grr perc), of the PI (inst type, region) and of the participants (f partern num); • set D27 also includes the remaining attributes of the ERC hierarchy and the attribute city of the PI. Table 4 reports the top 10 classification models obtained. For each model, the table includes: the set of predictive at- tributes, the extraction algorithm, whether cost-sensitive (CS) classification is adopted, and performance measures for both program P1 and P2. All of the top 10 clas- sifiers use resampling, whilst none of them adopt meta- classification. The size of a model measures its structural complexity: for classification rules, it is the number of rules; for decision trees, it is the number of leaves. A few comments follow on the lessons learned in tuning models and parameters. First, resampling of the training 8The best performance is obtained with a cost of misclassifying disc = true set to 2.5 times the cost of misclassifying disc = false. 11 folds reveals itself as an effective technique, improving per- formances both in term of accuracy and f-measure, and irrespectively of the model type and subset of attributes. Using in addition cost-sensitive classification does not im- prove further. Compare for instance rows 2 vs 3, 5 vs 6, 7 vs 8 from Table 4, where the only difference between the pairs is in the use or not of misclassification costs. Second, the impact of the set of predictive attributes is dependent on the classification model. Jrip and PART both benefit from larger sets as per accuracy and f-measure when mov- ing from D13 to D19, but then the additional attributes in D27 worsen the performances. Contrast for example rows 1 vs 4 vs 9, and 6 vs 10. This also holds for C4.5 models when using misclassification costs (see rows 3 vs 8). How- ever, when using resampling only, there is an improvement from D19 to D27 (rows 2 and 7). C4.5 with resampling on D27 (row 7) turns out to be the best model with respect to both accuracy and f-measure. Third, we highlight the im- portance of extracting models that trade off performance with simplicity. The best model (row 7) is, unfortunately, the most complex one. The global description of discrim- inatory decisions it provides is accurate but sparse in too many conditions, whose validation, e.g., by a legal expert, is impractical. This motivates the search for a few local contexts of possible discrimination. 6. Case study: rule reasoning and validation In this section, we report four rules filtered out from the top 10 classifiers. They have been ranked in the top po- sitions on the basis of both objective measures (precision, recall, average diff (), odds ratio) and subjective ones (in- terpretability, relation with known stereotypes). The an- tecedent of a rule unveils a context of prima facie evidence of gender discrimination. Proposals led by female PIs in such a context observe different decisions (with risk differ- ence of at least 0.10) of peer-reviewers between projects with similar characteristics led by male PIs and projects with similar characteristics led by female PIs. We validate the statistical significance of such a context by means of logistic regression. This way, we merge the capability of the k-NN as a situation testing approach for discovering contexts of possible discrimination with the capability of statistical regression for testing hypothesis of possible dis- crimination – thus obtaining the best of the two worlds. A technical note is in order. For each of the four rules, all proposals led by female PIs in the context of the rule result to have been rejected at the peer-review. As a conse- quence, the coefficient of the independent variable gender = Female in a logistic regression model cannot be calcu- lated – this is known as the separation problem (Heinze, 2006). We will then apply the Firth logistic regression (Firth, 1993), also called penalized maximum likelihood method, which takes into account such a problem. For the same reason, we will calculate the odds ratio (OR) of a rule by applying the plus-4 correction, which consists of 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -0.4 -0.2 0 0.2 0.4 F ra c t. o f tu p le s w it h d if f ≥ t t (d1 − lv1 = LS) and (yr − cost ≥ 244,000) and (yr − num ≥ 2) and (avg − aut ≥ 8.4) and (pub − num ≤ 12) gender=Female gender=Male Figure 10: Cumulative distributions of diff () for proposals satisfying the antecedent of rule R1. adding a fictitious +1 to each cell in the contingency table of Figure 1. 6.1. Rule R1: life sciences in program P1 The first rule concerns proposals in program P1. It highlights a possible discrimination against female PIs in the area of Life Sciences (LS). R1: (d1_lv1 = LS) and (yr_num >= 2) and (yr_cost >= 244,000) and (pub_num <= 12) and (avg_aut >= 8.4) => disc=yes [prec=1.0] [rec=0.095] [diff=0.165] [OR=11.14] The antecedent of the rule points out a context of research proposals requiring two or more young researchers, having a cost for them of e 244K or more, and such that the PI has at most 12 publications with a mean number of authors of 8.4 or more. There are 33 proposals in the contex: 8 led by male PIs, 2 of which passed the peer-review, and 25 led by female PIs, none of which passed the peer-review. All of the 25 proposals led by female PIs have been labeled as discriminated, i.e., the precision of the rule is 100%. Among the proposals led by female PIs that are labeled as discriminated, 9.5% satisfy the antecedent of the rule, i.e., the recall of the rule is 9.5%, which makes the context rather relevant for the anti-discrimination analyst. With reference to proposals of the LS area only, recall lifts up to 27%. The average risk difference measure of the 25 pro- posals led by female PIs is 16.5%, which is much higher than the discrimination threshold value of 10%. Figure 10 reports the cumulative distributions of diff () for proposals satisfying the antecedent of the rule R1 distinguishing fe- male and male led projects. This is more informative than simply the average risk difference. Moreover, it highlights the other side of discrimination, namely favoritism: pro- posals led by males exhibit very low or even negative risk differences, or stated otherwise, they have been favored in comparison to similar projects led by female PIs. Finally, the (corrected) odds risk is 11.14: the odds of proposals led by female PIs of being rejected at the peer-review is 11.14 times the odds of those led by male PIs. Rule R1 unveils then a possible gender discrimination in the Life Science area when proposals are ambitious (more 12 Rule R1 Rule R2 Rule R3 Rule R4 Variable Coeff. (Std. err) Coeff. (Std. err) Coeff. (Std. err) Coeff. (Std. err) gender = Female -1.64 (1.40) [*] -1.37 (1.38) [**] -0.86 (1.22) -1.20 (0.63) [***] inst type = Univ 0.11 (2.41) -1.58 (1.28) inst type = Cons. -0.70 (1.63) [**] age -0.05 (0.46) 0.03 (0.42) 0.01 (0.44) 0.01 (0.20) pub num 0.12 (0.22) 0.01 (0.10) 0.01 (0.06) -0.01 (0.02) avg aut 0.03 (0.16) 0.01 (0.46) -0.13 (0.24) 0.04 (0.06) tot cost 0.87 (1.83) 0.01 (0.01) 0.14 (1.15) -0.19 (0.68) fund req -1.25 (2.62) 0.01 (0.01) -0.21 (1.64) 0.27 (0.97) fund req perc 0.40 (1.83) -1.7 (2.48) 0.83 (0.65) yr num 0.58 (1.38) 0.03 (0.66) 1.49 (1.83) -0.93 (0.51) yr cost -0.87 (1.83) 0.01 (0.01) -0.15 (1.15) 0.19 (0.68) yr perc 0.22 (1.41) -1.0 (1.5) -0.12 (0.70) 0.52 (0.46) grr num 0.27 (0.81) 0.58 (0.43) grr cost -0.15 (1.15) 0.19 (0.68) grr perc -0.11 (0.70) 0.24 (0.51) d1 l1 = PE 0.24 (1.93) 0.78 (0.66) d1 l1 = SH 0.25 (1.85) 0.59 (0.83) d1 l2 = LS2 -1.17 (1.88) d1 l2 = LS3 -0.72 (2.24) d1 l2 = LS4 -0.17 (1.63) d1 l2 = LS6 -0.21 (1.70) d1 l2 = LS7 0.14 (2.10) Table 5: Firth logistic regression models for the datasets of proposals satisfying the antecedent of rules R1-R4. The dependent variable is peer-review = passed. Coefficients marked by [*], [**], and [***] are statistically significant at the 90%, 95% and 99% confidence level respectively. Blank cells are due to control variables with unique values (e.g., d1 lv 1 is always “LS” in rule R1), or to control variables omitted due to high standard errors. f partner num is not part of the model in order to account for the included-variable bias (see Section 5.1). than one young researcher to be hired) but the PI has a low productivity record of publications (at most 12) and high uncertainty on the PI’s effective contribution (large number of co-authors). The peer-reviewers of the LS area appear to have compensated the lack of knowledge on the skills of the PIs by some prior knowledge or stereotype resorting to the gender of the PI, with females being dis- advantaged. This phenomenon is known as statistical dis- crimination or rational racism (Romei and Ruggieri, 2013) – as opposed to taste-based discrimination which is moti- vated by prejudice. Table 5 reports the Firth logistic regression model for the proposals satisfying the antecedent of rule R1. All other factors being equal, female PIs have e−1.64 = 0.194 the odds of male PIs of passing the peer-review.9 The coefficient −1.64 is greater (in absolute value) than −0.33, the one computed over the whole dataset of proposals in program P1 (see Table 3). More important, the coefficient is now significantly non-zero at 90% confidence level. 6.2. Rule R2: physical and analytical chemical sciences in program P2 A context of possible discrimination for proposals in program P2 is unveiled by the following rule: R2: (d1_lv2 = PE4) and (tot_cost >= 1,358,000) and (age <= 35) => disc=yes [prec=1.0] [rec=0.031] [diff=0.194] [OR=4.50] where PE4 is the Physical and Analytical Chemical Sci- ences panel, at the second level of the ERC hierarchy. The 9Here we deal with the odds of passing, and low values denote high burden. The odds ratio (OR) deals with the odds of being rejected, and high values denote high burden. context of rule R2 concerns proposals with high budget led by young PIs. There are 9 proposals led by male PIs, 2 of which passed the peer-review, and 11 proposals led by fe- male PIs, none of which passed it. The recall of rule R2 is 3.1%, i.e., the context covers 3.1% of the proposals labeled as discriminated. The precision of rule R2 is 100%, mean- ing that all of the 11 proposals in the context led by female PIs have been labeled as discriminated. The average risk difference is 19.4%, and the odds ratio is 4.5. Table 5 shows the Firth logistic regression model for the proposals in the context of rule R2. All other factors being equal, female PIs have e−1.37 = 0.254 the odds of male PIs of passing the peer-review. The coefficient −1.37 is greater (in absolute value) than −0.87, the one computed over the whole dataset (see Table 3), yet it is significantly non-zero at the lower confidence level of 95%. Summarizing, rule R2 unveils a niche of proposals with a gender bias higher than the average bias of the whole dataset of proposals in program P2. 6.3. Rule R3: expensive projects in program P1 A second rule about proposals in the program P1 is the following: R3: (yr_cost >= 187,000) and (grr_cost >= 70,000) => disc=yes [prec=0.86] [rec=0.052] [diff=0.161] [OR=5.77] The antecedent of the rule concerns proposals with high budget for young researchers and for good reputation re- searchers. We checked that such a context is disjoint from the one of rule R1, where all proposals had no budget for good reputation researchers. There are 16 proposals led by male PIs in the context, 4 of which passed the peer-review, 13 Figure 11: Scatter plot of grr cost over yr cost for proposals satis- fying the antecedent of R3. and 14 proposals led by female PIs, none of which passed it. Precision of the rule is 86%, meaning that 12 out of the 14 proposals led by female PIs have been labeled as discriminated, with an average risk difference for the 14 proposals of 16.1%. The recall of the rule is 5.2%, hence rules R1 and R2 cover together 14.7% of the proposals la- beled as discriminated. The odds ratio of the proposals in the context is 5.77. Intuitively, the peer-reviewers of program P1 seem to trust more male PIs than female PIs in managing projects with high costs of personnel, namely young and good rep- utation researchers. Does rule R3 unveil then a case of actual discrimination? Firth logistic regression on the dataset of proposals of the context (see Table 5) shows a coefficient for gender=Female of -0.86, which, however, is not statistically significant at 90% confidence level – i.e., the null hypothesis that the coefficient is actually zero cannot be rejected. We proceed by analysing further the proposals in the context. The scatter plot in Figure 11 reports the costs of good reputation researchers over the costs of young researchers. It highlights that proposals led by female PIs tend to a higher proportion of good reputa- tion researcher costs over proposals led by male PIs. This could somehow be in contrast with the intended objectives of the call for proposals, which require a substantial hiring of young researchers. Therefore, it may well be the case that peer-reviewers have scored worse those proposals re- lying too much on the hiring of senior researchers. This could be argued as a legitimate and objective justification for the disparate treatment of female PIs, an exception admitted by the anti-discrimination laws. Whether this is the case or not, however, is a matter of legal argumenta- tion. Strictly speaking, the call for proposals did not set an explicit maximum threshold on the proportion of good reputation researcher costs over young researcher costs. 6.4. Rule R4: young PIs in program P2 A second rule for proposals in the program P2 is the following: R4: (age <= 32) and (fund_req >= 310,000) => disc=yes [prec=0.52] [rec=0.12] [diff=0.07] [OR=9.6] The antecedent of the rule concerns younger PIs with a fund request greater or equal than e 310K. Intuitively, this can be interpreted as a negative bias against younger female PIs who require a medium-high grant. There are 201 proposals in such a context: 131 with male PIs, 16 of which passed the peer-review; and 70 with female PIs, only 1 of which passed the peer-review. The odds ratio is 9.6. Precision of the rule is moderately higher than 38% – the overall percentage of proposals labeled as discrimi- nated in program P2. That is, about half of the 69 female PIs whose proposal was rejected showed a risk difference greater or equal than 10%. In fact, the average risk dif- ference is only 7%. However, recall is rather high: 12% of the proposals labeled as discriminated in program P2 are in the context of rule R4. Finally, we checked that the overlap of proposals in the contexts of both rule R4 and R2 is minimal, with only 3 proposals led by male PIs and 1 led by a female PI. Such overlap originates from the fact that rules R2 and R4 are selected from two different classification models. Consider the logistic regression model for proposals in the context of rule R4 (see Table 5). All other factors being equal, female PIs have e−1.20 = 0.301 the odds of male PIs of passing the peer-review. The coefficient −1.20 is smaller (in absolute value) than the one of rule R2, but significant at the higher confidence level of 99%. Moreover, it is greater (in absolute value) than the one of the whole dataset (see Table 3). Summarizing, rule R4 highlights a context with higher gender bias than in the whole dataset of proposals in program P2. This and the contexts of the other rules were not previously known as possible stereo- types of discriminatory behaviors. Rather, they have been the result of a discrimination discovery investigation. 7. A KDD process in support of discrimination discovery Since personal data in decision records are highly di- mensional, i.e., characterized by many multi-valued vari- ables, a huge number of possible contexts may, or may not, be the theater for discrimination. In order to extract, select, and rank those that represent actual discrimina- tory behaviors, an anti-discrimination analyst should ap- ply appropriate tools for pre-processing data, extracting prospective discrimination contexts, exploring in details the data related to the context, and validating them both statistically and from a legal perspective10. Discrimination discovery consists then of an iterative and interactive pro- cess. Iterative because, at certain stages, the user should have the possibility of choosing different algorithms, pa- rameters, and evaluation measures or to iteratively repeat some steps to unveil meaningful discrimination patterns. 10As observed by Gastwirth (1992), the objectives of science and the law often diverge, with rigorous scientific methods conflicting with the adversarial nature of the legal system. 14 Figure 12: The KDD process of situation testing for discrimination discovery. Interactive because several stages need the support of a domain expert in making decisions or in analysing the re- sults of a previous step. We propose here to adopt the process reported in Figure 12, which is specialized in the use of the situation testing for extracting contexts of possi- ble discrimination. The process has been abstracted from the case study presented in the previous sections, and it consists of four major steps. Data Understanding and Preparation. The availability of historical data concerning decisions made in socially- sensitive tasks is the starting point for discovering dis- crimination. We assume a collection of data sources stor- ing historical decisions records in any format, including relational, XML, text, spreadsheets or any combination of them. Standard data pre-processing techniques (selec- tion, cleansing, transformation, outlier detection) can be adopted to reach a pre-processed dataset consisting of an input relation as the basis for the discrimination analysis. The grain of tuples in the relation is that of an individual (an applicant to a loan, to a position, to a benefit). Three groups of attributes are assumed to be part of the relation: protected group attributes: one or more attributes that identify the membership of an individual to a pro- tected group. Attributes such as sex, age, marital status, language, disability, and membership to po- litical parties or unions are typically recorded in ap- plication forms, curricula, or registry databases. At- tributes such as race, skin color, and religion may be not available, and must be collected, e.g., by survey- ing the involved people; decision attribute: an attribute storing the decision for each individual. Decision values can be nominal, e.g., granting or denying a benefit, or continuous, e.g., the interest rate of a loan or the wage of a worker; control attributes: one or more attributes on control fac- tors that may be (legally) plausible reasons that may affect the actual decision. Examples include attributes on the financial capability to repay a loan, or on the productivity of an applicant worker. Risk Difference Analysis. For each tuple of the input relation denoting an individual of the protected group, the additional attribute diff is calculated as the risk difference between the decisions of its k nearest-neighbors of the pro- tected group and the decisions for its k nearest-neighbors of the unprotected group (see Section 2.3). We call the output of the algorithm the risk difference relation. The value k is a parameter of the algorithm. A legitimate ques- tion is how to choose the “right” k? A large k means that every instance is a neighbor, hence the distribution of diff tends towards a unique value. Conversely, for a small k, we run the risk that the distribution is affected by ran- domness. As a consequence, a study of the distribution of diff for a few values of k is required. This means iterating the calculation of the diff attribute. Exploratory analy- sis of diff distributions may also be conducted to evaluate risk differences at the variation of: the protected group under consideration, e.g., discrimination against women or against youngsters; the compound effects of multiple discrimination grounds, e.g., discrimination against young women vs discrimination against women or youngsters in isolation; the presence of favoritism towards individuals of a dominant group, e.g., nepotism. Once again, this requires iterating the calculation of diff by specifying a different protected group attribute to focus on. Discrimination Model Extraction. By fixing a thresh- old value t, an individual r of the protected group can be labeled as discriminated or not on the basis of the condi- tion diff (r) ≥ t. We introduce a new boolean attribute disc and set it to true for a tuple r meeting the condition above, and to false otherwise. A global description of who 15 has been discriminated can now be extracted by resorting to a standard classification problem on the dataset of indi- viduals of the protected group, where the class attribute is the newly introduced disc attribute. Accuracy of the clas- sifier is evaluated with objective interestingness measures, e.g., precision and recall over the disc = true class value. The intended use of the classifier is descriptive, namely to provide the analyst with a characterization of the individ- uals that have been discriminated. The choice of the value t should then be supported by laws or regulators.11 For in- stance, the four-fifths rule by the US Equal Employment Opportunity Commission (1978) states that a job selec- tion rate (the RC measure from Figure 1) lower than 80% represents a prima facie evidence of adverse impact. Since the intended use of the extracted classifier is de- scriptive, classification models that are easily interpretable by (legal) experts and whose size is small should be pre- ferred. In other words, one should trade accuracy for sim- plicity. Classification rules and decision trees are natural choices in this sense, since rules and tree paths can eas- ily be interpreted and ranked. The extracted classification models provide a global description of the disc class val- ues. They are stored in a knowledge base, for comparison purposes and for the filtering of specific contexts of dis- crimination – as described next. Rule Reasoning and Validation. The actual discovery of discriminatory situations and practices may reveal it- self as an extremely difficult task. Due to time and cost constraints, an anti-discrimination analyst needs to put under investigation a limited number of contexts of pos- sible discrimination. In this sense, only a small portion of the classification models can be analysed in detail, say the top N rules or the top N paths of a decision tree. We propose to concentrate on classification rules of the form: (cond_1) and ... and (cond_n) => disc=yes [prec] [rec] [diff] [OR] where (cond 1) and ... and (cond n) is obtained from a classification model (from a rule or from a path of a de- cision tree). Rules are ranked on the basis of one or more interestingness measures, including: precision [prec] (proportion of discriminated individuals among those of the protected group which satisfy the antecedent), recall [rec] (proportion of the overall discriminated individuals covered by the antecedent), average value of diff [diff] (a measure of the degree of discrimination observed by individuals of the protected group which satisfy the an- tecedent), and odds ratio [OR] (a measure of the burden of negative decisions on the individuals of the protected group when compared to those of the unprotected group satisfying the antecedent of the rule). Notice that [diff] and [OR] may rank rules differently because they con- trast distinct sets of groups (the 2k nearest neighbors, and 11A relevant question is the other way round – namely, can data mining help law makers and regulators in the definition of appropri- ate values for t? the members of the unprotected group satisfying the an- tecedent of the rule). Statistical validation is accounted for in our approach by relying on logistic regression, which is a well-known tool in the legal and economic research com- munities. Readability and interpretability should also be taken into account by preferring rules with fewer items in the antecedent, thus trading interestingness with simplic- ity. As an alternative approach to the selection of rules from the classifiers extracted in the previous step of the process, one could mine all classification rules of the form above by means of association rule mining. Unfortunately, this results in a huge number of rules covering overlapping contexts of possible discrimination. This is what occurs, for instance, in the rule extraction and filtering approach of Ruggieri et al. (2010a). The rules selected in this step of the process, however, are still subject to further consid- eration, e.g., by a legal expert, who may require further data exploration and, possibly, iteration of previous steps of the process. Therefore, the number of selected rules must be reasonably low. Selecting the best rules from the best performing classification models is then a means to keep the number of (overlapping) rules to a minimum. 8. Conclusions The contribution of this paper has been threefold. First, we have presented a complex case study in the context of scientific project funding using real data from an Italian national call for proposals. The application of dis- crimination discovery methodologies based on data mining to real case studies was lacking in the existing literature. So far, experiments and analyses have been conducted on “general purpose” datasets, not explicitly collected or pro- cessed for discrimination analysis. As a consequence, the reported analyses have been necessarily partial, typically being limited to summary statistics (e.g., number of possi- bly discriminatory contexts found), to artificial examples, and to generic argumentations on the results found. This is a serious drawback that limits the acceptance of knowl- edge discovery methods in practice. Second, we have proposed and applied a methodology that couples legal methods (situation testing) for the def- inition of cases of possible discrimination, data mining methods (a variant of k-NN plus standard classification) for the search of contexts of possible discrimination, and regression analysis for the statistical validation of such contexts. This approach overcomes the statistical anal- ysis12 of discrimination conducted in the social sciences, economics, and legal literature, which is limited to the verification of an hypothesis of possible discrimination on 12The contrast between the two approaches above is an instance of the two general “cultures” in the use of statistical modeling (Breiman, 2001): data modeling vs the algorithmic modeling. 16 the whole set of past decision records. Such an analysis re- veals to be inadequate to cope with the problem of search- ing for unknown or unforeseen contexts of discriminatory decisions hidden in a large dataset. On the contrary, the rules discussed in Section 6 unveil prima facie evidence of discrimination when certain project costs are above a threshold value. Both the cost attribute and the threshold value, however, come as the result of the analysis – they were not an a priori hypothesis to be verified. The extrac- tion of contexts of discrimination is precisely the objective of discrimination discovery. Third, from the specific case study, we have abstracted a general process of discrimination discovery. The adopted methodology relies on an implementation of the legal prac- tice of situation testing using a variant of k-NN, and then on extracting and reasoning about a classification model. The steps of the methodology have been described in the process of Figure 12, which represents a guidance for re- searchers and anti-discrimination analysts. We believe that this contribution can provide higher confidence about the replicability of the analyses and their applicability to real cases. Some issues remain open for future investigation. With reference to the case study, further analysis will be made possible by enriching the available dataset with additional control features, e.g., some accurate measures of the sci- entific productivity of applicants and of their professional network. This was not possible in our study, since our in- put data were anonymized. Concerning the tools adopted, while the k-NN algorithm remains the core component of the proposed process, of particular interest is the formal- ization of the deductive component, in which the extracted classification models are filtered, refined, transformed and validated into useful knowledge. We aim at designing a post-processing tool, by adapting the XQuake system (Romei and Turini, 2010), able to support the user in the deductive part of the process. Finally, throughout the pa- per, we have assumed the availability of a feature denoting the protected group under analysis – in the case study, the gender of the PIs. In indirect discrimination discovery, this assumption does not hold, e.g., because race, ethnicity, or sexual orientation may be not recorded in data. In such cases, a different approach must be devised. References Bendick, M., 2007. Situation testing for employment discrimination in the United States of America. Horizons Stratégiques 3 (5), 17– 39. Bentley, J. T., Adamson, R., 2003. Gender differences in the careers of academic scientists and engineers: A literature review. Special report, National Science Foundation, http://www.nsf.gov. Bornmann, L., Daniel, H.-D., 2005. Selection of research fellowship recipients by committee peer review: Reliability, fairness and pre- dictive validity of Board of Trustees’ decisions. Scientometrics 63 (2), 297–320. Bornmann, L., Daniel, H.-D., 2009. The state of h index research. EMBO reports 10 (1), 2–6. Bornmann, L., Mutz, R., Daniel, H.-D., 2008. Latent markov mod- eling applied to grant peer review. Journal of Informetrics 2 (3), 217–228. Breiman, L., 2001. Statistical modeling: The two cultures. Statistical Science 16 (3), 199–231. Brouns, M., 2000. The gendered nature of assessment procedures in scientific research funding: The Dutch case. Higher Education in Europe 25 (2), 193–199. Calders, T., Verwer, S., 2010. Three naive bayes approaches for discrimination-free classification. Data Mining & Knowledge Dis- covery 21 (2), 277–292. Ceci, S. J., Williams, W. M., 2011. Understanding current causes of women’s underrepresentation in science. Proc. of the National Academy of Sciences 108 (8), 3157–3162. Cohen, W. W., 1995. Fast effective rule induction. In: Proc. of Int. Conf. on Machine Learning (ICML 1998). Morgan Kaufmann, pp. 115–123. Council of the E.U., 1999. Resolution 1999/C 201/01 on Women and Science. http://eur-lex.europa.eu. Custers, B. H. M., Calders, T., Schermer, B. W., Zarsky, T. Z. (Eds.), 2013. Discrimination and Privacy in the Information So- ciety. Vol. 3 of Studies in Applied Philosophy, Epistemology and Rational Ethics. Springer. Equal Employment Opportunity Commission, 1978. Uniform guidelines on employee selection procedure. 43 FR 38295, http://www.gpo.gov. European Commission, 2009. The gender challenge in research fund- ing: Assessing the European national scenes. Directorate Gen- eral for Research, Science, Economy and Society, Unit L.4, http://ec.europa.eu. European Commission, 2012. Meta-analysis of gender and science research. Directorate General for Research and Innovation, Sector B6.2, http://www.genderandscience.org. Firth, D., 1993. Bias reduction of maximum likelihood estimates. Biometrika 80 (1), 27–38. Frank, E., Witten, I. H., 1998. Generating accurate rule sets without global optimization. In: Proc. of Int. Conf. on Machine Learning (ICML 1998). Morgan Kaufmann, pp. 144–151. Gastwirth, J. L., 1992. Statistical reasoning in the legal setting. The American Statistician 46 (1), 55–69. Goldstein, H., 2011. Multilevel Statistical Models, 4th Edition. Wi- ley. Hajian, S., Domingo-Ferrer, J., 2012. A methodology for direct and indirect discrimination prevention in data mining. IEEE Transac- tions on Knowledge and Data Engineering, to appear. Heinze, G., 2006. A comparative investigation of methods for logistic regression with separated or nearly separated data. Statistics in Medicine 25 (24), 4216–4226. Jayasinghe, U. W., Marsh, H. W., Bond, N. W., 2003. A multilevel cross-classified modeling approach to peer-review of grant propos- als. Journal of the Royal Statistical Society 166 (3), 279–300. Kamiran, F., Calders, T., 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33, 1–33. Killingsworth, M. R., 1993. Analyzing employment discrimination: From the seminar room to the courtroom. American Economic Review 83 (2), 67–72. Larivière, V., Vignola-Gagné, E., Villeneuve, C., Gélinas, P., Gin- gras, Y., 2011. Sex differences in research funding, productivity and impact: an analysis of Québec university professors. Sciento- metrics 87 (3), 483–498. Ley, T. J., Hamilton, B. H., 2008. The gender gap in NIH grant applications. Science 322 (5907), 1472–1474. Luong, B. T., Ruggieri, S., Turini, F., 2011. k-NN as an implementa- tion of situation testing for discrimination discovery and preven- tion. In: Apté, C., Ghosh, J., Smyth, P. (Eds.), Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2011). ACM, pp. 502–510. Marsh, H. W., Jayasinghe, U. W., Bond, N. W., 2008. Improving the peer-review process for grant applications: Reliability, validity, bias, and generalizability. American Psychologist 63 (3), 160–168. 17 Mutz, R., Bornmann, L., Daniel, H.-D., 2012. Does gender matter in grant peer review? An empirical investigation using the example of the Austrian Science Fund. Journal of Psychology 220, 121–129. Pager, D., 2007. The use of field experiments for studies of employ- ment discrimination: Contributions, critiques, and directions for the future. The ANNALS of the American Academy of Political and Social Science 609 (1), 104–133. Pedreschi, D., Ruggieri, S., Turini, F., 2008. Discrimination-aware data mining. In: Li, Y., Liu, B., Sarawagi, S. (Eds.), Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2008). ACM, pp. 560–568. Quillian, L., 2006. New approaches to understanding racial prejudice and discrimination. Annual Review of Sociology 32 (1), 299–328. Quinlan, J. R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. RAND, 2005. Is there gender bias in federal grant programs? RAND Infrastructure, Safety, and Environment Brief RB-9147- NSF, http://rand.org. Romei, A., Ruggieri, S., 2013. Discrimination data analysis: A multi- disciplinary bibliography. In: Custers, B. H. M., Calders, T., Schermer, B. W., Zarsky, T. Z. (Eds.), Discrimination and Privacy in the Information Society. Vol. 3 of Studies in Applied Philoso- phy, Epistemology and Rational Ethics. Springer, pp. 109–135. Romei, A., Ruggieri, S., Turini, F., 2012. Discovering gender dis- crimination in project funding. In: Proc. of the IEEE ICDM 2012 Int. Workshop on Discrimination and Privacy-Aware Data Mining (DPADM). IEEE Computer Society, pp. 394–401. Romei, A., Turini, F., 2010. XML Data Mining. Software: Practice and Experience 40 (2), 101–130. Rorive, I., 2009. Proving Discrimination Cases - the Role of Situa- tion Testing. Centre For Equal Rights & Migration Policy Group, http://www.migpolgroup.com. Ruggieri, S., Pedreschi, D., Turini, F., 2010a. Data mining for dis- crimination discovery. ACM Trans. on Knowledge Discovery from Data 4 (2), Article 9. Ruggieri, S., Pedreschi, D., Turini, F., 2010b. DCUBE: Discrimina- tion discovery in databases. In: Elmagarmid, A. K., Agrawal, D. (Eds.), Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 2010). ACM, pp. 1127–1130. Sandström, U., Hällsten, M., 2008. Persistent nepotism in peer- review. Scientometrics 74 (2), 175–189. UNESCO, 2007. Science, Technology and Gender: An International Report, 4th Edition. UNESCO Publishing. Wenner̊as, C., Wold, A., 1997. Nepotism and sexism in peer-review. Nature 387 (5), 341–343. Wilson, R., 2004. Where the elite teach, it’s still a man’s world. The Chronicle of Higher Education 51 (15). Witten, I. H., Frank, E., 2011. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations., 3rd Edition. Morgan Kaufmann, San Francisco. 18