key: cord-0325322-7oubjata
authors: Green, Ben; Chen, Yiling
title: Algorithmic Risk Assessments Can Alter Human Decision-Making Processes in High-Stakes Government Contexts
date: 2020-12-09
journal: nan
DOI: 10.1145/3479562
sha: 1251ce38b191203d77a4d203ac449c584c4eb9ae
doc_id: 325322
cord_uid: 7oubjata

Governments are increasingly turning to algorithmic risk assessments when making important decisions, such as whether to release criminal defendants before trial. Policymakers assert that providing public servants with algorithmic advice will improve human risk predictions and thereby lead to better (e.g., fairer) decisions. Yet because many policy decisions require balancing risk-reduction with competing goals, improving the accuracy of predictions may not necessarily improve the quality of decisions. If risk assessments make people more attentive to reducing risk at the expense of other values, these algorithms would diminish the implementation of public policy even as they lead to more accurate predictions. Through an experiment with 2,140 lay participants simulating two high-stakes government contexts, we provide the first direct evidence that risk assessments can systematically alter how people factor risk into their decisions. These shifts counteracted the potential benefits of improved prediction accuracy. In the pretrial setting of our experiment, the risk assessment made participants more sensitive to increases in perceived risk; this shift increased the racial disparity in pretrial detention by 1.9%. In the government loans setting of our experiment, the risk assessment made participants more risk-averse; this shift reduced government aid by 8.3%. These results demonstrate the potential limits and harms of attempts to improve public policy by incorporating predictive algorithms into multifaceted policy decisions. If these observed behaviors occur in practice, presenting risk assessments to public servants would generate unexpected and unjust shifts in public policy without being subject to democratic deliberation or oversight.

Following recent advances in the quality and accessibility of algorithms, governments increasingly use machine learning when making high-stakes decisions [21, 26] . Many applications of algorithms involve risk assessments, which predict the risk of some adverse outcome. These predictions are then presented to human decision-makers to inform consequential decisions about individuals. Applications of public sector risk assessments include informing pretrial and sentencing decisions 2 BACKGROUND AND RELATED WORK The increasing use of algorithmic decision-making aids across government has placed novel humanalgorithm collaborations at the center of consequential policy decisions. In the context of machine learning, the standard framework for human-algorithm collaborations is "human-in-the-loop" systems. In these settings, people are incorporated into the machine learning pipeline to produce the best possible algorithmic output. Algorithms make the final decisions, with humans assisting (e.g., labeling training data and reviewing low-confidence classifications) [3] . In some cases, such tasks are completed by crowds of people, with crowdsourcing techniques used to support machine learning models [39, 71] .

In public policy contexts, however, human-algorithm collaborations involve a different process and goal: algorithms are incorporated into human decision-making processes to generate the best possible human decision. People make the final decisions, with algorithms assisting (e.g., making accurate predictions based on patterns within datasets). Instead of the human-in-the-loop frame, therefore, most uses of algorithms in government call for a complementary paradigm: "algorithmin-the-loop" decision-making [28] . Algorithm-in-the-loop settings center human decisions-rather than algorithmic decisions-as the most important outcome, orienting attention to how algorithms influence human decisions.

Despite the potential promise of algorithms aiding human predictions, experimental studies have uncovered numerous limits in people's ability to make appropriate and effective use of algorithmic advice. Several studies have found that algorithmic advice can improve the accuracy of human predictions, but people's decisions about when and how to diverge from algorithmic recommendations are typically incorrect [28, 29, 32, 46] . People struggle to evaluate the quality of algorithmic advice [24, 28, 29, 46] , often discount accurate algorithmic recommendations [18, 48, 75] , and exhibit racial biases in their responses to risk assessments [28, 29] . Although evidence suggests that experts are capable of overriding some erroneous predictions in practice [15] , other evidence demonstrates that incorrect predictions reduce the quality of expert judgments [40] and that experts make less effective use of algorithmic forecasts than laypeople [50] .

These breakdowns in human-algorithm collaboration demonstrate that algorithmic interventions are indeterminate: the effects of using an algorithm often diverge in practice from what was expected based on the algorithm's technical characteristics [31] . Numerous evaluations of how judges use pretrial risk assessments indicate that providing accurate risk predictions does not generate the intended improvements in decision-making (i.e., reduced pretrial detention, recidivism, and racial disparities). In jurisdictions across the country, judges disproportionately override release recommendations to detain defendants, leading to much higher than expected pretrial detention rates [60, [63] [64] [65] 72] . Several studies have shown that risk assessments exacerbate rather than diminish racial disparities in pretrial detention, in part because judges often make more punitive decisions about Black defendants than similar white defendants [1, 14, 64] . Furthermore, ethnographic work has found that judges often resist using risk assessments because they dislike the idea of these tools replacing or surveilling them [6] .

Despite growing knowledge about how people use algorithms in both experimental and realworld settings, a significant open question is whether presenting algorithmic advice alters the process through which people make decisions. Because risk assessments emphasize the likelihood of specific adverse outcomes (such as a pretrial defendant failing to appear for trial or being rearrested), scholars and activists have raised concerns that risk assessments could make risk a more salient factor in decision-making processes [25, 62, 66] . Such concerns follow closely from the existing literature on framing and priming effects. Prior research demonstrates that framing decisions around losses motivates decision-makers (including judges) to avoid those losses [58, 59, 67] . Similarly, priming people (including financial professionals) to consider risks makes them less likely to make or support decisions that involve risk [12, 19, 20, 23] .

Initial studies have observed behaviors that are consistent with the possibility of risk assessments altering decision-making processes. A small experiment with 83 law students making simulated sentencing decisions found that presenting a risk assessment increased the sentence given to a highrisk defendant and decreased the sentence given to a low-risk defendant [62] . A later experiment with 340 judges making simulated sentencing decisions found that presenting a risk assessment increased the likelihood that a low socioeconomic status defendant would be incarcerated and decreased the likelihood that a high socioeconomic status defendant would be incarcerated [61] .

Although these results suggest that showing a risk assessment heightens the salience of risk, they cannot conclusively determine whether risk assessments alter how people weigh risk when making decisions. Because these studies looked only at people's decisions with and without a risk assessment, neither can distinguish between two potential explanations: a) the risk assessment actually increased the weight that people gave to reducing risk when making decisions, and b) the risk assessment merely influenced the estimates of risk that people factored into their decisions. Distinguishing between these two explanations requires accounting for a risk assessment's effects on human risk predictions rather than directly comparing human decisions with and without a risk assessment.

In this study, we investigate whether risk assessments systematically alter how people factor risk into their decisions. Because risk assessments emphasize the risk of an adverse outcome, we hypothesized that presenting risk assessments would make people more attentive to avoiding risk when making decisions. Given existing racial disparities in risk, we also hypothesized that this effect would exacerbate racial disparities in decisions.

Based on how policy documents [55] and court rulings [37, 73] describe the use of risk assessments, we analyzed decisions as being made through a two-stage process ( Figure 1 ). First is the risk-prediction process (RPP), which represents a quantitative prediction task. The RPP evaluates the attributes of a given subject (e.g., pretrial defendant) to predict that person's risk of an adverse outcome (e.g., failing to return to court for trial or being arrested before trial). This process yields a decision-maker's "perceived risk" about the subject. Second is the decision-making process (DMP), which involves a normative balancing act between numerous considerations rather than a straightforward translation of risk into a decision. The DMP incorporates the perceived risk alongside other relevant factors (e.g., the harms associated with pretrial detention) to make a decision about the subject (e.g., whether to release the defendant before their trial). A systematic change to the DMP reflects a shift in how decision-makers balance risk with other factors, which amounts to a shift in public policy [51, 76] . We instantiated the DMP as a function that determines the probability of detaining a defendant or rejecting a loan applicant conditioned on the perceived risk about the subject in question. 1 The central question of this study is whether risk assessments influence only the RPP, as is typically assumed, or instead affect both the RPP and the DMP. We categorize the influence of risk assessments into four possible "scenarios," as summarized in Table 1 . Scenario 1 represents the baseline condition without any risk assessment. When a risk assessment's advice is presented, it could lead to either Scenario 3 or Scenario 4. 2 Scenario 3 is the commonly assumed outcome: risk assessments alter the RPP but not the DMP, meaning that improvements in prediction accuracy lead directly to more informed decisions. The assumption that algorithms lead to Scenario 3 is central to support for algorithmic decision-making aids in government [21, 37, 55, 62, 73] . In this scenario, which represents the absence of the dashed line in Figure 1 , improving prediction accuracy with risk assessments directly improves decisions. Scenario 4 is our hypothesized outcome: risk assessments alter both the RPP and the DMP, meaning that shifts in the DMP could counteract any gains in prediction accuracy. In this scenario, which represents the presence of the dashed line in Figure 1 , improving prediction accuracy with risk assessments may not improve decisions.

Our goal is to test whether showing a risk assessment alters the DMP, which amounts to distinguishing between Scenario 3 and Scenario 4. However, we cannot directly observe the DMP because it is a latent form of cognitive processing. Because risk assessments alter the RPP [28, 29, 32, 65] , simply showing that a risk assessment changed participant decisions is insufficient to demonstrate that the risk assessment affected the DMP. The challenge arises because a risk assessment could alter decisions in two distinct but superficially indistinguishable ways. First, it could influence the RPP alone (Scenario 3), which would lead to decisions based on different risk estimates without changing how perceived risk factors into decisions. Second, it could influence both the RPP and the DMP (Scenario 4), which would lead to decisions based on different risk estimates and would change how perceived risk factors into decisions.

The only way to determine whether risk assessments affect the DMP is to compare decisions made with and without a risk assessment while accounting for the risk assessment's effects on predictions. Accomplishing this requires access to decision-makers' perceptions of risk about each subject. However, this information is not produced in practice and is difficult to obtain experimentally without influencing people's behavior. Determining the risk assessment's effects on the DMP thus requires a more complex experimental setup than prior work that has directly compared the decisions that people make with and without a risk assessment's advice. As described in more detail below, we designed our experiment to elicit both decisions and predictions from participants, enabling us to infer the effects of risk assessments on human decision-making processes.

Our study progressed in two stages. The first stage involved developing risk assessments for pretrial detention and home improvement loans. The second stage involved running an experiment on Amazon Mechanical Turk to evaluate how people interact with these risk assessments when making predictions and decisions. The full study was approved by the Harvard University Institutional Review Board and the National Archive of Criminal Justice Data (which manages the data used for the pretrial setting).

Our experiment simulated two settings of government decision-making: pretrial detention and government home improvement loans. Within the context of this study, these settings are structurally similar. In both settings, decision-makers must balance risk-reduction with conflicting normative considerations. Decisions are made about an individual "subject" and can be "positive decisions" or "negative decisions." Subjects with high risk are more likely to receive negative decisions. In the pretrial setting, the subject of the decision is a criminal defendant, the positive decision is to release the defendant before trial, and the negative decision is to detain the defendant before trial. In the loans setting, the subject of the decision is a loan applicant, the positive decision is to approve the loan application, and the negative decision is to reject the loan application.

After someone is arrested in the United States, they must await trial. Courts can either hold the criminal defendant in jail until their trial or release them with a mandate to return for their trial. 3 Pretrial detention decisions involve balancing competing goals. Courts aim to ensure that defendants will return to court for trial and will not commit any crimes if released. The higher the risk that a defendant will fail to return to court for their trial or will commit any crimes, the more likely a judge is to detain the defendant until their trial. The interest in reducing risk is enhanced by detaining defendants. However, pretrial decisions are also made with an interest in protecting the liberty of defendants, ensuring that defendants are able to mount a proper legal defense, and reducing the hardship to defendants and their families [2] . Pretrial detention is associated with a range of negative outcomes that include longer prison sentences, sexual abuse, and limited employment opportunities [27] . The interests in protecting liberty and avoiding the harms of pretrial detention are advanced by releasing defendants.

In recent years, many jurisdictions across the U.S. have turned to risk assessments as a tool to make more accurate and objective predictions of risk. These improvements in prediction are intended to reduce racial biases and increase pretrial release rates [35, 38, 55] .

Home Improvement Loans Setting. Many people apply for a loan to improve their house (e.g., to rehabilitate a home or to make a home energy efficient). When someone applies for a loan, it is common for the lender to assess the risk that the borrower will fail to pay back the money. This is known as defaulting on the loan. The higher the risk that the potential borrower will default on the loan, the less likely the lender generally is to provide money to that person. The U.S. government provides many types of home improvement loans in order to support low-income applicants who are unable to obtain affordable loans from banks [69] . This sets up a balancing act between conflicting aims. On the one hand, the goal of limiting loan default risk is enhanced by declining loans to low-income applicants. On the other hand, the goals of promoting equity, economic development, and community stability are enhanced by providing loans to low-income applicants.

It is common for lenders to evaluate loan applicants using risk assessments that predict the likelihood of loan default. Although there are no known cases of governments using risk assessments when allocating home improvement loans, this setting is akin to government uses of risk assessments to determine who should receive other resources [21] .

In order to test the effects of presenting risk assessment predictions to participants in our experiment, we first developed risk assessments for pretrial detention and government home improvement loans. See Section A of the Appendix for a more detailed description of the data we used and how we developed these models. Our goal in this stage was not to develop optimal risk assessments, but to develop risk assessments that resemble those used in practice and that could be presented to participants during the Mechanical Turk experiment. We used datasets with information about 47,141 felony defendants across the United States who had been released before trial [68] (Table A.1) and 45,218 recipients of home improvement loans via the peer-to-peer lending company Lending Club (Table A. 2). The data included demographic information (including race, which we restricted to Black and white) for the felony defendants but not the loan applicants.

We developed risk assessments (i.e., machine learning classifiers) using gradient boosted trees with ten-fold cross-validation. Our models included five attributes of each defendant 4 and seven attributes of each loan application. 5 The pretrial risk assessment was trained to predict whether a defendant, if released before trial, would fail to appear in court for trial or would be arrested before trial. The loans risk assessment was trained to predict whether a loan applicant, if given the loan, would default on that loan. Both risk assessments exhibited similar accuracy to pretrial and loan risk assessments developed in research and practice (pretrial AUC=0.67, loans AUC=0.69). Drawing from the held-out validation sets in each setting, we selected samples of 300 defendants and 300 loan applicants whose profiles and risk predictions would be presented to participants during the experiment. When used in our experiment, the risk assessments presented numerical predictions of risk about subjects (i.e., 0%-100%, in intervals of 10%) but did not suggest what decision participants should make based on those predictions. 

We recruited 2,685 participants on Amazon Mechanical Turk over two weeks in May 2020, restricting our task to workers inside the United States who had a task approval rate of at least 75%. 6 Our analysis includes the results from the 2,140 participants who completed the experiment while also passing our quality control reviews (by correctly answering several comprehension questions and two attention-check questions). Across both settings, a majority of participants were male, white, and college graduates ( Table 2 ). Participants were paid $3 for completing the experiment, and those making predictions received an additional payment of up to $1 based on the accuracy of their predictions. Bonus payments were allocated using a Brier score, which incentivizes participants to report their true estimate of risk. Participants completed the experiment in an average of 19.0 minutes and received an average wage of $15.02 per hour. The experimental design consisted of three treatments. When participants entered the experiment, they were split evenly into either the pretrial or loans setting. We then followed a 2x2 design within each setting ( Figure 2 ). Participants were split into control groups (which were not presented with a risk assessment's advice) and treatment groups (which were presented with a risk assessment's advice). Participants were also split into prediction groups (which were asked to make quantitative predictions about subjects) and decision groups (which were asked to make binary decisions about subjects). This design elicits sufficient information to determine whether the risk assessments altered the DMP and is described below in more detail. The experimental procedure was the same in both the pretrial and the loans settings. After completing a consent page, participants entered a tutorial that described their setting and the predictions or decisions that they would be asked to make. These descriptions explained the key considerations (including but not limited to risk) that factor into decisions in the relevant setting. Participants who would be shown the risk assessment's advice were also presented with background information about the risk assessment. The description of the risk assessment included details about the algorithm's prediction task, training data, and accuracy, and invited participants to use the predictions in whatever manner they desired. Participants were unable to proceed beyond the tutorial until they correctly answered several questions demonstrating their comprehension. We ignored all data from participants who required more than four attempts to correctly answer all of the comprehension questions. Participants then completed an intro survey (to provide demographic information and other attributes), a prediction or decision task (described in detail below), and an exit survey (to provide reflections on the task).

The key component of the experiment was the prediction or decision task ( Figure 3 ). Based on their assigned setting, participants were presented with narrative profiles describing seven features about defendants or applicants. 7 These defendants and applicants were drawn randomly from the assigned setting's 300-subject sample. Participants were tasked with making either numeric predictions of risk about 40 subjects or binary decisions about 30 subjects. 8 Prediction-makers were asked to predict risk on a scale from 0% to 100%, with options in 10% increments. Decisions in the pretrial setting entailed whether to release or detain criminal defendants before trial. Decisions in the loans setting entailed whether to approve or reject home improvement loan applications. This setup matches salient elements of real-world settings such as pretrial adjudication, in which risk assessments are introduced as important decision-making aids [27, 64] and in which decisions are often made in just a few minutes [5, 60] .

The primary goal of our experiment was to determine the effects of the risk assessments on human decision-making processes. This requires comparing the decisions of participants with and without a risk assessment while accounting for the risk assessment's effects on predictions. We therefore followed a 2x2 experimental setup within each setting, splitting participants according to whether they are presented with the risk assessment and whether they make binary decisions or quantitative risk predictions ( Figure 2 ). Our first experimental condition in each setting was whether or not participants were presented with the predictions of a risk assessment. Participants in the control group were shown only the narrative profiles about subjects. Participants in the treatment group were shown the narrative profiles as well as the risk assessment's predictions about subjects ( Figure 3 ). This first condition allows us to compare the behaviors of participants with and without the risk assessment.

However, directly comparing the decisions of the control and treatment groups cannot determine whether the risk assessment altered the DMP. Decisions could differ across the control and treatment groups because the risk assessment influenced the RPP but not the DMP. For instance, a risk assessment could increase the likelihood of a defendant being detained before trial by a) making decision-makers more risk-averse or b) causing decision-makers to increase their estimate of the defendant's risk. Determining a risk assessment's influence on the DMP therefore requires accounting for the risk assessment's influence on predictions. This means that we must obtain information regarding participants' perceived risk about subjects in addition to their decisions about subjects.

We could obtain information about the risks perceived by decision-makers in two ways. The first approach is to ask each participant to make both predictions and decisions about subjects. This approach would provide the most accurate measure of the perceived risk associated with each decision. However, because this approach requires directly asking decision-making participants about risk, it would also prime them to consider risk whether or not they are shown the risk assessment. This priming would undermine the entire study by confounding our ability to detect how presenting a risk assessment influences the consideration of risk in the DMP. The second approach to measuring perceived risk-which we take in this study-is to have some participants make risk predictions and some participants make decisions. We use the risk predictions provided by prediction-makers to estimate the risks perceived by decision-makers. Although this approach means that we cannot directly measure decision-making participants' perceptions of risk, it provides a reasonable proxy while maintaining the integrity of our research question.

Our second experimental condition, therefore, was whether participants were asked to make predictions or decisions. To obtain risk estimates about each subject (both with and without a risk assessment), we asked 75% of participants to make binary decisions about subjects and 25% to make numerical predictions of risk about each subject ( Figure 2 ). 9 We used the risk predictions elicited from the prediction-making participants to estimate the risks perceived by the decision-making participants. We estimated the perceived risk associated with a given decision as the average risk prediction made about the subject in question, grouping predictions and decisions based on whether the risk assessment was shown. For instance, the perceived risk assigned to a decision about a defendant made without the risk assessment was the average of the risk predictions made about that same defendant without the risk assessment. By eliciting many predictions about each subject, we obtained reliable measures of the average perceived risk about each subject (both with and without the risk assessment) without inappropriately influencing the behaviors of decision-making participants.

To study whether and how a risk assessment alters the decision-making process, we modeled the DMP of participants with and without a risk assessment. We characterized negative decisions as a function of perceived risk and conducted Bayesian mixed-effects logistic regressions to learn this function. 10 Following the decision-making structure in Figure 1 , we regressed participant decisions on three factors: the perceived risk about the subject in question, whether the risk assessment was shown, and the interaction between these two factors. Factors such as subject attributes and the risk assessment's prediction are incorporated into this decision function through . , which is based on these elements. We also included three random effects to account for repeated samples in the data.

This regression is structured to infer the DMP that participants followed and to determine whether the risk assessment altered this function, thus distinguishing between Scenario 3 and Scenario 4. If risk assessments present information that improves the RPP but does not influence the DMP (Scenario 3), we would expect to see that showing the risk assessment does not alter this regression. In this case, neither regression factor that includes ℎ . would be significant, such that the relationship between decisions and perceived risk is the same whether or not the risk assessment is shown. However, if risk assessments influence the DMP as hypothesized (Scenario 4), we would expect to see that showing the risk assessment alters this regression, making people more attentive to reducing risk when making decisions. This result could emerge through two different mechanisms: 1) the risk assessment makes participants more risk-averse at all levels of risk (in this case, the ℎ . coefficient would be positive), or 2) the risk assessment makes participants more sensitive to increases in risk (in this case, the . * ℎ . coefficient would be positive).

After observing that the risk assessments influenced the DMP (and thus generated Scenario 4 as hypothesized), we estimated the impacts of this influence. Our goal was to isolate the effects of the DMP change, controlling for the risk assessments' effects on the RPP. This analysis entailed comparing outcomes from the observed Scenario 4 behaviors with outcomes from the commonly expected Scenario 3 behaviors. Because our control group participants exhibited Scenario 1 and our treatment group participants exhibited Scenario 4, we did not observe Scenario 3 behaviors and could not directly compare Scenario 3 and Scenario 4 outcomes. We therefore estimated the differences between Scenario 3 and Scenario 4 outcomes through simulations. We began by fitting models for the RPP and DMP in the pretrial and loans settings, both with and without the risk assessment's advice. We then ran 1,000 trials simulating the outcomes for more than 4,000 defendants and loan applicants in the four scenarios described in Table 1 .

See Section C of the Appendix for additional details about our analyses.

We looked first at how the risk assessments affected predictions of risk. We evaluated participant "prediction quality" using a reverse Brier score bounded between 0 (worst possible performance) and 1 (best possible performance). In both settings, presenting the risk assessment reduced estimates of risk, improved prediction accuracy, and aligned the RPP more closely with the risk assessment's calculations. These results are consistent with prior work [28, 29] .

In the pretrial setting, the risk assessment reduced perceived risk for 54.0% of defendants. Overall, defendants received an average reduction in perceived risk of 1.6% (from 40.6% to 38.9%, P=.001, d=0.19). While the reduction in perceived risk was significant for white defendants (38.4% to 35.7%, P=.003, d=0.30), Black defendants received a smaller and nonsignificant reduction (41.7% to 40.7%, P=.085, d=0.12). Bayesian linear regression (Equation A.1) found that showing the risk assessment altered the risk-prediction process, most notably prompting participants to consider the age of defendants and to reduce the risk associated with violent crime and prior failures to appear (Table A. 3). Through these changes, presenting the risk assessment increased the average participant prediction quality from 0.72 to 0.75 (P<.001, d=0.11).

In the loans setting, the risk assessment altered predictions of risk more dramatically. The risk assessment reduced the perceived risk for 92.3% of loan applicants and generated an overall average reduction of 14.2% for each applicant (from 38.5% to 24.3%, P<.001, d=1.54). Bayesian linear regression (Equation A.2) found that showing the risk assessment altered the RPP by significantly reducing participants' baseline risk predictions, increasing the salience of annual income and interest rate, and prompting participants to consider the length of loans (Table A. 3). In turn, showing the risk assessment increased participant prediction quality from 0.75 to 0.83 (P<.001, d=0.31).

We next analyzed how the risk assessments affected participant decisions and decision-making processes.

We first compared participant decisions with and without a risk assessment. Our goal was to investigate whether the shifts in decisions induced by the risk assessments align with the shifts in predictions induced by the risk assessments. If risk assessments lead to Scenario 3, we would expect to see that shifts in decisions closely track the shifts in predictions described above. In particular, reductions in perceived risk due to the risk assessment would be associated with reductions in negative decisions due to the risk assessment, and vice versa. Our results do not closely follow this pattern, however, indicating that the risk assessment's effects on decisions cannot be explained by shifts in the RPP alone.

In the pretrial setting, the risk assessment reduced pretrial detention rates but increased racial disparities. The risk assessment reduced each defendant's likelihood of pretrial detention by an average of 2.4% (from 44.5% to 42.1%, P<.001, d=0.21). White defendants received a 27% larger average reduction (38.7% to 35.9%, P=.014, d=0.24) than Black defendants (47.7% to 45.5%, P=.007, d=0.20). As a result, the overall racial disparity in pretrial detention increased by 18.8% from 8.8% to 10.4%. In addition, the risk assessment increased the "accuracy" of decisions from 56.7% to 58.4% (P=0.009, h=0.03) and reduced the "false positive rate" from 26.9% to 24.4% (P<.001, h=0.06). 11 In the loans setting, the risk assessment's effects on negative decisions contrasted with the risk assessment's effects on perceived risk. Although the risk assessment dramatically reduced risk predictions, the risk assessment did not significantly alter each loan applicant's likelihood of rejection (loan rejection rates went from 22.1% to 23.1%, P=.159, d=0.08). Furthermore, although the risk assessment significantly increased the accuracy of risk predictions, the risk assessment reduced the "accuracy" of decisions from 72.2% to 70.5% (P=.002, h=0.04) and increased the "false positive rate" from 17.3% to 18.5% (P=.015, h=0.03). In sum, the risk assessment notably reduced perceived risk yet did not reduce rejection rates, and similarly increased prediction accuracy yet decreased decision "accuracy. "

To further investigate the relationship between predictions and decisions, we then compared how the risk assessments altered the predictions and decisions made about each individual subject. We found that shifts in perceived risk did not translate to equivalent shifts in negative decisions ( Figure 4 ). In both settings, subjects for whom the risk assessment decreased perceived risk did not reliably receive lower negative decision rates due to the risk assessment. Among the 54.0% of pretrial defendants for whom the risk assessment reduced perceived risk, only 59.3% received a reduced likelihood of pretrial detention when the risk assessment was shown. Among the 92.3% of loan applicants for whom the risk assessment reduced perceived risk, only 52.0% received a reduced likelihood of rejection when the risk assessment was shown. Overall, shifts in decisions were relatively insensitive to shifts in predictions, with regression coefficients less than 1 in both settings (0.23 in pretrial, P=.003; 0.42 in loans, P<.001). For instance, a 10% reduction in average perceived risk due to the risk assessment was associated with a 4.4% reduction in the pretrial detention rate and a 2.8% increase in the loan rejection rate.

These results depart notably from what we would expect to see if the risk assessments induced a shift to Scenario 3. These patterns demonstrate that reductions in perceived risk do not lead directly to reductions in pretrial detention or loan application rejections. Instead, these changes in perceived risk must be mediated through changes to the DMP before yielding decisions. 

Process. We next analyzed the risk assessments' effects on the DMP. Bayesian mixed-effects logistic regressions (Equation 1) found that the risk assessment altered the decision-making process in both settings, making participants more attentive to risk when making decisions. These results demonstrate that the risk assessments prompted the hypothesized shift to Scenario 4 rather than Scenario 3.

In the pretrial setting, the risk assessment made participants more sensitive to increases in risk ( Figure 5 ). Presenting the risk assessment increased the odds ratio associated with a 10% increase in perceived risk from 1.82 to 2.39 (Table 3 ). These results mean that the risk assessment made perceived risk a stronger determinant of whether defendants were released or detained: the risk assessment reduced pretrial detention rates for defendants with low perceived risk and increased pretrial detention rates for defendants with high perceived risk. For example, the risk assessment reduces the detention likelihood by 6.3% for a defendant with a perceived risk of 30% but increases the detention likelihood by 8.7% for a defendant with a perceived risk of 60% (Table A. 

In the loans setting, the risk assessment made participants more risk-averse at all levels of risk ( Figure 5 ). Presenting the risk assessment increased the odds of rejecting loan applications by a factor of 2.09 (Table 3) . For all levels of perceived risk up to 46.0% (covering 97.3% of risk estimates with the risk assessment), participants were more than twice as likely to reject loan applications if they were shown the risk assessment (Table A. Table 3 for model coefficients). The risk assessment made participants more sensitive to increases in perceived risk, reducing detention at low risk and increasing detention at high risk. (B) Decision functions indicating the likelihood of rejecting a loan application based on the perceived risk of that applicant (see Table 3 for model coefficients). The risk assessment caused rejection rates to increase at all levels of perceived risk. (C) Shift in negative decision (i.e., pretrial detention or loan rejection) probability due to the shift in the DMP caused by showing the risk assessment. Given a perceived risk of 50%, for instance, the DMP shift increased the likelihood of pretrial detention by 4.7% and the likelihood of loan rejection by 21.9%. Bands indicate 95% confidence intervals in all panels. The values behind this figure are summarized in Table A.4.  Table 3 . Bayesian mixed-effects logistic regression results estimating the likelihood of a negative decision about defendants and loan applicants as a function of perceived risk, following Equation 1. The first column presents the coefficient of each factor; the second column presents the coefficient of the interaction between that factor and the risk assessment being shown. The second column thus describes how showing the risk assessment altered each factor. Parenthetical terms represent standard errors and terms in brackets represent odds ratios. The intercept represents modeled participant responses at a perceived risk of 0%, with perceived risk measured in units of 10%. In the pretrial setting, presenting the risk assessment reduced the likelihood of detention for 0% risk but increased participants' sensitivity to increases in risk. In the loans setting, presenting the risk assessment increased the odds of rejecting loan applications by a factor of 2.09. These patterns are plotted in Figure 5 . When asked to reflect on their behavior after making decisions, participants did not seem to recognize that the risk assessment had altered how they consider risk when making decisions. Despite becoming more attentive to risk when making decisions, participants presented with a risk assessment expressed less support for basing decisions on risk (Pretrial: P=.003, d=0.21; Loans: P=.001, d=0.23). Furthermore, the risk assessments did not alter participant reports regarding the priority that decision-makers should assign to key considerations such as risk (Table A.5).

We used simulations to estimate the impacts of each risk assessment's influence on the DMP. Our goal was to isolate the effects of the DMP shifts by controlling for the concurrent RPP shifts. We accomplished this through simulations that enabled us to compare the observed Scenario 4 outcomes with the commonly expected Scenario 3 outcomes.

In the pretrial setting, the risk assessment's influence on the DMP reduced the average detention rate but exacerbated racial disparities ( Figure 6 ). Had the risk assessment affected only the RPP (i.e., created a shift from Scenario 1 to Scenario 3), none of the "accuracy," "false positive rate," nor detention rates for either race would have changed. The shift in the DMP (i.e., from Scenario 3 to Scenario 4) increased decision "accuracy" from 57.7% to 60.4% (P<.001, d=4.43), decreased the "false positive rate" from 27.4% to 24.2% (P<.001, d=6.20), and reduced detention by 4.9% for white defendants and by 3.0% for Black defendants (P<.001, d=1.52). Thus, although the DMP shift improved some outcomes, it also increased the racial disparity by 1.9% and by a factor of 1.34 from 5.6% in Scenario 3 to 7.5% in Scenario 4 (P<.001, d=1.06; Figure 6 ).

In the loans setting, the change in the DMP caused by the risk assessment generated a notable decrease in "accuracy" and increase in rejections ( Figure 6 ). Had the risk assessment affected only the RPP and thus prompted a shift from Scenario 1 to Scenario 3, the decision "accuracy" would have increased from 70.8% to 75.6% (P<.001, d=8.82), the "false positive rate" would have decreased from 17.5% to 11.5% (P<.001, d=12.31), and the rejection rate would have dropped from 22.2% to 14.9% (P<.001, d=13.09). The shift in the DMP negated these potential benefits, however, as the risk assessment made participants more risk-averse. Moving from Scenario 3 to Scenario 4 decreased the decision "accuracy" from 75.6% to 70.7% (P<.001, d=8.83), increased the "false positive rate" from 11.5% to 18.1% (P<.001, d=13.26), and increased the rejection rate from 14.9% to 23.2% (P<.001, d=14.88). Overall, instead of simply improving risk predictions and thereby generating a 7.3% increase in loans granted, the risk assessment also increased risk-aversion and thereby actually reduced the loans granted by 1.0% ( Figure 6 ). The shift in the DMP is therefore responsible for an 8.3% increase in loan rejections.

This paper provides the first direct evidence that risk assessments can systematically alter how people balance risk with other factors when making policy-relevant decisions. Even though our risk assessments improved the accuracy of human predictions, they also induced shifts in decisionmaking processes that counteracted the potential benefits of these improved predictions. Presenting a risk assessment increased participant sensitivity to risk in pretrial detention decisions (thus exacerbating racial disparities) and increased participant risk-aversion in government loan decisions (thus reducing the loans granted). These shifts mean that even when the risk assessments reduced participant predictions of risk about subjects, participants did not accordingly reduce the rate of negative decisions about those subjects. Alternative explanations, such as the risk assessments simply making participants more confident in their risk estimates, can be ruled out by our data (see Section D in the Appendix).

Our results challenge the assumption that improving human predictions with risk assessments will necessarily improve human decision-making-an assumption that has been central to the adoption of algorithmic decision-making aids by governments. These findings demonstrate the potential limits and harms of efforts to improve public policy by incorporating predictive algorithms into multifaceted policy decisions. If the observed changes were to occur in real-world settings, they would be notable for three primary reasons.

First, our findings indicate that government algorithms could generate unexpected shifts in public policy and jurisprudence. Although improving the accuracy of risk predictions is consistent with policies that include risk as a consideration, a systematic increase in the salience of risk amounts to a shift in the normative balancing act that comprises public policy in domains such as pretrial adjudication [44, 51] . Such a shift reduces the range of factors that decision-makers consider, diminishing the implementation of public policy [76] . In pretrial settings, increasing the weight that judges place on risk would generate undue social harms [74] and enhance the constitutionally contested policy of preventative detention (detaining defendants until trial due to their likelihood to commit future crimes) [27, 44] . In loans settings, greater risk-aversion would reduce government aid and would counteract the goal of promoting equity through giving loans to low-income (and hence high-risk) applicants.

Second, because risk is intertwined with legacies of racial discrimination in the criminal justice and financial systems, more heavily basing decisions on risk would likely exacerbate racial disparities in incarceration and government aid. Due to past and present oppression in the United States, Blacks have disproportionately higher risk levels than whites for being arrested and defaulting on loans, making them particularly vulnerable to increased attention to risk [27, 41] . Indeed, we found that the DMP shifts caused by the risk assessments increased the racial disparity in pretrial decisions and reduced government aid in loans decisions.

Third, because these two effects would arise as an unexpected byproduct of integrating an algorithm into decision-making, they would occur without deliberation or oversight. These shifts in policy and jurisprudence (and the resulting racial disparities) would be the consequence of an algorithm's unintended influence on human decision-making rather than a democratic policymaking process. Because these effects are unexpected, they would likely evade scrutiny, at least until their effects manifest in practice with sufficient evidence. Such changes would likely be further obscured by decision-makers not recognizing that the risk assessment had influenced their behavior, as observed both here and in prior work [28, 29] . These effects add another dimension to the unexpected and unaccountable policy distortions that emerge when laws are translated into code [11] .

Together, these implications highlight harms that can arise when algorithms are incorporated into multifaceted policy decisions. If evaluations of algorithmic decision-making aids do not account for human-algorithm interactions and the many normative considerations relevant to policy decisions, they are likely to overestimate the benefits and underestimate the harms of incorporating algorithms into government decision-making [31, 65] .

There is an urgent need to uncover potential issues in human-algorithm collaborations before algorithms shape life-changing decisions. Risk assessments are increasingly being integrated into high-stakes decisions, yet consistently produce unexpected and unjust impacts in practice [1, 6, 64, 65] . Achieving a more responsible approach to algorithm-in-the-loop decision-making requires several areas of future work.

It is necessary to develop a deeper scientific understanding of how risk assessments and other algorithms influence human decision-making. Although we demonstrated that presenting risk assessments can alter human decision-making processes in harmful ways, many open questions remain.

One open question is how the effects of risk assessments vary across contexts. Notably, our risk assessments exerted different effects across the two settings studied, making participants more sensitive to increases in perceived risk in the pretrial setting and more risk-averse in the loans settings. We do not know what caused the observed differences across the two settings. One hypothesis is that the effects of a risk assessment depend on people's pre-existing notions of risk in that context. For instance, people may be strongly predisposed to consider risk in pretrial decisions, such that the risk assessment merely amplified this behavior, but not in government loans decisions, such that the risk assessment prompted heightened concern about mitigating risk.

An important role for future inquiry will be to study how algorithms alter decision-making processes in different settings and with different decision-makers. Algorithms are being deployed in many social contexts beyond government, such as schools [36] , hospitals [40] , and newsrooms [10] . Although these settings involve some straightforward prediction problems, in many cases people must integrate predictions with other considerations to make decisions. Determining the proper roles for algorithms in these and other settings thus requires a deeper understanding of how algorithmic predictions influence human decision-making across contexts.

A second open question is whether any mechanisms could mitigate the risk assessments' effects on decision-making processes, such that these algorithms do in fact lead to the widely expected Scenario 3 outcomes. It is possible that other approaches to presenting algorithms and structuring decision-making could improve how people incorporate algorithmic advice into their decisions. In the context of algorithm-aided human predictions, for instance, asking people to make preliminary predictions before being shown a risk assessment's predictions modestly improved accuracy and fairness, whereas providing feedback and explanations did not improve performance [29] .

It is also necessary to develop a testing pipeline that evaluates human interactions with algorithmic decision-making aids before these tools are implemented in practice. Decisions to adopt algorithms should require a baseline of evidence suggesting that they are actually likely to improve decision-making. Our results show that a central assumption motivating risk assessments in public policy-that improving human predictions will improve human decisions-can be violated with laypeople. This finding suggests the need to investigate whether this assumption holds in practice. Furthermore, many regulations across the world point to human oversight as providing protections against algorithms, yet these protections rarely function as desired [30] . Rather than relying on untested assumptions, efforts to integrate algorithms into public policy should be grounded in proactive evaluations of proposed human-algorithm collaborations.

Attaining more thorough knowledge about the effects of algorithmic decision-making aids will require a pipeline of evaluations that combines several modes of analysis: experimental studies with laypeople in lab settings, experimental studies with domain experts in lab settings, and ethnographic and empirical studies of expert interactions with algorithms in practice. Each of these modes has particular strengths and weaknesses. Collectively, they can provide robust, proactive knowledge about how algorithms affect human decision-making and how to improve humanalgorithm collaborations.

Developing this pipeline requires further exploring how public servants collaborate with algorithms in practice and how lab experiments can inform the implementation of algorithmic decision-making aids. The primary limitation of this paper is that our findings are based on the behaviors of Mechanical Turk workers in a lab experiment rather than judges or loan agents operating in real-world contexts. Our results do not directly reflect how algorithms affect the behaviors of experts making real decisions. There are likely to be significant differences between how laypeople and trained experts make decisions with algorithms, particularly related to perceptions of professional identity and autonomy [6] .

Despite these differences, experiments with laypeople can shed light on some behaviors of experts in practice. Research suggests that both judges [33, 57, 58] and financial professionals [12, 23, 34] are susceptible to priming and framing effects (alongside other cognitive biases) in much the same manner as laypeople. Prior studies of how laypeople interact with risk assessments [28, 29] have demonstrated racially biased behaviors similar to those observed among judges using risk assessments in practice [1, 14] . Furthermore, the results of this study align with prior experiments suggesting that risk assessments cause law students and judges to place a greater priority on reducing crime risk [61, 62] and that pretrial risk assessments have increased racial disparities in practice [1, 64] .

Lab studies with laypeople therefore present a valuable approach for attaining preliminary insights about human-algorithm collaborations. Initial trials with laypeople can provide a foundation of knowledge about how algorithms influence decision-makers and whether there are mechanisms that can improve these collaborations. Compared to studies with experts, experimental studies with laypeople have several advantages. Most importantly, such experiments allow us to learn about human-algorithm collaborations before implementing an algorithm into real-world contexts. Furthermore, compared to lab and in situ evaluations with practitioners, experiments with laypeople can be conducted more quickly, with more participants, and with more precisely controlled experimental procedures. Insights from experiments with laypeople can inform the hypotheses and methods for studies with practitioners, which provide more precise knowledge about human-algorithm collaborations in a particular context but are more intensive to run. 12 A proactive pipeline of evaluations along these lines should become a central component of proposals and policies for how governments use algorithmic decision-making aids. If algorithms such as risk assessments are to be implemented in a given policy context, there must first be rigorous evidence regarding what impacts they are likely to generate and democratic deliberation supporting those impacts.

To create our pretrial risk assessment, we used the dataset "State Court Processing Statistics, 1990-2009: Felony Defendants in Large Urban Counties, " which was collected by the U.S. Department of Justice [68] . The dataset contains court processing information about 151,461 felony cases filed in May in even years from 1990-2006 and in 2009 in 40 of the 75 most populous counties in the United States. The data contains information about each case that includes the arrest charges, the defendant's demographic characteristics and criminal history, and the outcomes of the case related to pretrial release (whether the defendant was released before trial and, if so, whether they were rearrested before trial or failed to appear in court for trial).

We first cleaned the dataset. We removed incomplete entries and restricted our analysis to defendants who were at least 18 years old and whose race was recorded as either Black or white. In order to have ground-truth data about whether a defendant was rearrested before trial or failed to appear for trial, we also restricted our analysis to defendants who were released before trial.

This yielded a dataset of 47,141 defendants (Table A .1). The defendants were primarily male (76.7%) and Black (55.7%), with an average age of 30.8 years. Among these defendants, 15.0% were rearrested before trial, 20.3% failed to appear for trial, and 29.8% exhibited at least one of these outcomes (which we defined as "violating" the terms of pretrial release).

We used this data to train a risk assessment (i.e., a machine learning classifier) that predicts whether each defendant will violate pretrial release. We trained the model using gradient boosted decision trees [22] with the xgboost implementation in R [9] . The classifier incorporated five features about each defendant: age, offense type, number of prior arrests, whether that person has any prior failures to appear, and number of prior convictions. Despite knowing the race and gender of defendants, we excluded these attributes from the model to match common practice among risk assessment developers [4] .

We performed model selection and evaluated the model using ten-fold cross-validation. We first set aside a random sample of 10% of the data as a held-out validation set. We then took the remaining 90% of the data as the training data. We split this training data into ten folds, using cross-validation to find hyperparameters for the boosted trees model. Cross-validation on the final model yielded an average test AUC of 0.66 (sd=0.009).

We then trained the model on the complete training data and applied it to the held-out validation set, yielding an AUC of 0.67. This indicates comparable accuracy to COMPAS [47] , the Public Safety Assessment [16] , and other risk assessments used in practice [17] .

We selected a sample of 300 defendants from the validation set whose profiles would be shown to participants during the Mechanical Turk experiment. To protect defendant privacy, this sample could include only defendants whose seven displayed attributes were shared with at least two other defendants in the complete dataset. This restriction meant that we could not select a uniform random sample of 300 defendants from the validation set. However, we found in practice that sampling from the validation set with weights based on each defendant's risk score yielded a sample population that resembles the complete set of released defendants across most dimensions (Table A. 

We used a dataset of loans from the peer-to-peer lending company Lending Club to create our loans risk assessment. The data contains records about all 2,004,091 loans that Lending Club issued between 2007 and 2018. Each record includes information such as the purpose of the loan; the loan applicant's job, annual income, and approximate credit score; the loan amount and interest rate; and whether the borrower paid off the loan. The data includes the first three digits of each borrower's zip code but does not include further demographic information (such as the age, race, or gender of applicants).

We cleaned the dataset to remove incomplete entries and classified credit scores into one of five categories (Poor, Fair, Good, Very Good, and Exceptional), as defined by FICO [54] . We restricted our analysis to loans issued for home improvements, which represents 6.7% of the total issued loans. Home improvement loans represent the third most common purpose for loans in the dataset, following debt consolidation and paying off credit cards. We also limited the data to loans that have been either fully paid or defaulted on. 13 This yielded a dataset of 45,218 home improvement loans (Table A. 2). The average loan was for $14,556.38. The average applicant had an income of $95,262.88 and a credit score of 707.5 (categorized by FICO as "Good"). More than 80% of these loans were fully paid off.

We used this data to train a risk assessment that predicts whether each loan will be defaulted on. We trained the classifier using gradient boosted decision trees [22] with the xgboost implementation in R [9] . Our model considered seven factors about each loan: three factors about each applicant (annual income, credit score category, and whether they own their home) and four factors about each loan (total value, interest rate, monthly installment, and whether its repayment term is 36 or 60 months). We evaluated the model using ten-fold cross-validation, following the procedure described above for the pretrial risk assessment. Cross-validation on the final model yielded an average test AUC of 0.70 (sd=0.01). Training the classifier on the complete training data (90% of the samples) and applying it to the held-out validation set (the remaining 10% of the data) yielded an AUC of 0.69. This performance is similar to that of other loan default risk assessments [70] .

We selected a sample of 300 loan applicants from the validation set whose profiles would be shown to participants during the Mechanical Turk experiment (Table A. 2). These applicants were selected through a uniform random sample from the complete validation set.

As we prepared to run our experiment in May 2020, we wanted to ensure that our results would not be the product of aberrant behavior prompted by the COVID-19 pandemic. Before running the full experiment, therefore, we conducted a retest of a trial experiment that we had conducted in December 2019.

The December 2019 trial closely resembled the experiment described in the main text. We recruited 240 participants from Mechanical Turk to evaluate a sample of 100 defendants. For the May 2020 trial, we recruited 250 participants to evaluate the same set of 100 defendants. We compared the results of these two trials to determine whether COVID-19 altered the population of Mechanical Turk workers or human interactions with risk assessments. We focused on three results central to our study: the demographics of participants, how participants made risk predictions, and how participants made decisions about whether to release or detain defendants. For all three results, we did not observe any notable differences across the two trials, suggesting that COVID-19 did not have notable impacts on our results.

The demographics of our study participants were similar across the two trials. In both cases, participants were predominantly white (80.5% in 12/2019 vs. 73.4% in 05/2020), male (58.6% vs. 58.0%), and college-educated (73.5% vs. 70.2%). A logistic regression predicting which trial each participant was part of (based on all of the demographic attributes reported during the intro survey) yielded no terms that were statistically significant.

We observed a high degree of consistency between the predictions made across the two trials. The correlation between the average prediction made about each of the 100 defendants was r(198)=+.94, P<.001. A two-sided t-test yielded no statistically significant difference between participants' prediction quality across the two trials (0.751 vs. 0.753, P=.820).

We also estimated the function used by participants to predict the risk of each defendant. We used a mixed-effects linear regression model to measure the average risk prediction about each defendant, grouped by whether the risk assessment was shown and whether the prediction was made in the first or second trial (we refer to this variable as "trial number"). The model included fixed effects for whether the risk assessment was shown, whether the predictions were made in the first or second trial, the attributes of defendants, and the interactions between these three sets of factors (up to three-way). We also included random effects for participant and defendant identities to account for repeated samples. Our goal was to evaluate whether trial number influenced how participants made predictions. We observed minimal differences in the prediction function used across the two trials. The trial number and the interaction between trial number and whether the risk assessment was presented were not statistically significant. Only two of the interactions that included trial number were statistically significant: participants were slightly less responsive to prior failures to appear (P=.025) and prior convictions (P=.039) in the second trial.

Finally, we observed a high degree of consistency between the decisions made across the two trials. The correlation between the average detention rate for each of the 100 defendants was r(198)=+.97, P<.001.

We also estimated the function used by participants to decide whether to release or detain each defendant. We used a mixed-effects logistic regression model on all 8,070 decisions made across the two trials. The model included fixed effects for whether the risk assessment was shown, the trial number, and the perceived risk about each defendant, with up to three-way interactions between these factors. We included random effects for participants, defendants, and status in the experiment to account for repeated measurements. None of the coefficients that included trial number were statistically significant, indicating that the decision-making function did not notably differ across the December 2019 or the May 2020 trials.

In sum, we found high levels of test-retest reliability. The results found in May 2020 (in the early stages of the COVID-19 pandemic) closely resemble the results found in December 2019. This suggests that the results presented in this paper were not notably influenced by aberrant behaviors that arose in response to COVD-19. More broadly, this also indicates the reliability of our results as being reproducible upon repeated experimentation.

This section provides further detail on the results provided in Section 5.1. We estimated the risk-prediction process of participants using Bayesian linear regression. We used a Bayesian approach for consistency with the next section, where Bayesian regression enabled analysis based on posteriors. For all results throughout the paper, the inferences made from Bayesian and non-Bayesian regressions were almost identical. We implemented models with the brms package in R [7] , which provides a high-level interface to Markov Chain Monte Carlo (MCMC) sampling for Bayesian inference using Stan [8] .

In both settings, we regressed the average prediction about each subject (both with and without the risk assessment) on the subject attributes presented to participants and a binary variable ( ℎ . ) reflecting whether the risk assessment was shown. We included interactions between ℎ . and each subject attribute. To account for repeated samples of subjects, the model also included random effects for subject identity. We initialized models with uninformative priors and implemented sampling using four chains with 1,000 iterations, following 1,000 burn-in iterations on each chain. All coefficients in both models returned^= 1.00, indicating that the chains were well-mixed and converged to a common distribution. We estimated statistical significance from the samples by using the probability of direction measure and obtaining the equivalent frequentist p-value [52, 53] . The results are summarized in Table A .3.

This section provides further detail on the results provided in Section 5.2.2. We estimated the decision-making process using Bayesian mixed-effects logistic regression, implemented in brms [7] . In both settings, we regressed each decision on the perceived risk about the subject in question, whether the risk assessment was shown, and the interaction between these two factors (Equation 1). 14 To account for repeated samples, the model also included random effects for the participant identity, the subject identity, and the progress index marking the participant's progress in the experiment.

We initialized models with uninformative priors and implemented sampling using four chains with 1,000 iterations, following 1,000 burn-in iterations on each chain. In both settings, all fixed effect coefficients returned = 1.00 and all random effect coefficients returned^≤ 1.01, indicating that the chains were well-mixed and converged to a common distribution. We estimated statistical significance from the samples by using the probability of direction measure and obtaining the equivalent frequentist p-value [52, 53] . The results are summarized in Table 3 . The standard deviations for the random effects in the pretrial setting are 1.03 for worker, 0.90 for subject, and 0.07 for experiment progress index. The standard deviations for the random effects in the loans setting are 1.19 for worker, 0.90 for subject, and 0.29 for experiment progress index.

We then studied the characteristics of these fitted decision-making process functions ( Figure 5 and Table A .4). We did this using all 4,000 posterior samples of the fixed effect coefficients from the fitted model. First, we used these samples to calculate the fitted negative decision rate at each level of risk from 0% to 100% (in intervals of 0.1%), both with and without the risk assessment. Second, we used these posterior estimates to calculate the shifts in negative decision rates caused by the risk assessment at each level of risk.

This section provides further detail on the results provided in Section 5.3. We used simulations to isolate the effects of the changes in the DMP due to the risk assessments. This required simulating outcomes in the four scenarios described in Table 1 and comparing the results of Scenario 3 and Scenario 4. We did this in two stages. First, we used data from the experiment to learn participant prediction and decision functions both with and without the presence of a risk assessment. Second, we applied those functions to a large sample of defendants and loan applicants to simulate the outcomes of the four Table 1 We fit all models using generalized linear regression with a logit link function from the quasibinomial family. We used this quasibinomial approach because the fitted values of these regressions are bounded probabilities (either risk predictions or negative decision rates, which both range from 0%-100%). Although linear regression yields very similar results, it does not guarantee that predicted values will be bounded between 0 and 1.

Before applying these models to new defendants, we used leave-one-out cross-validation to test the effectiveness of this approach on the data from our experiment. For each model in each setting, we removed one subject at a time, trained the model on the predictions or decisions about the other 299 subjects, and estimated the prediction or decision that would be made about the held-out subject both with and without the risk assessment. We evaluated these models by applying the risk prediction model and then using the output of that model as input to the decision model. The mean average error (MAE) of the entire pipeline for negative decisions rates is 5.92 (RMSE=7.46) in the pretrial setting and 7.33 (RMSE=9.95) in the loans setting. All the models are unbiased estimators, with mean errors close to 0.

We then fit prediction and decision models for both settings on the complete set of 300 subjects for use in our simulations.

We applied these models to the held-out validation sets from both settings (not including the 300 subjects sampled from those datasets for inclusion in our experiment). These samples represent approximately 10% of the complete data in each setting. They contain 4,375 defendants and 4,231 loan applicants drawn from the populations described in Tables A.1 

Our simulations proceeded as follows:

(1) Apply the predictions and decisions models to every subject to estimate the negative decision probabilities in the four scenarios from Table 1 . The predictions and decisions models enable us to simulate outcomes both with and without a risk assessment's advice. Using the outputs of the predictions model as the perceived risk, we applied these models in all four possible combinations of whether the risk assessment affected predictions and decisions. This process yields four estimated negative decision probabilities for each subject: predictions and decisions are both unaffected by the risk assessment (Scenario 1), predictions are unaffected by the risk assessment but decisions are affected by the risk assessment (Scenario 2), predictions are affected by the risk assessment but decisions are unaffected by the risk assessment (Scenario 3), and predictions and decisions are both affected by the risk assessment (Scenario 4). (2) Run 1,000 trials simulating the outcome for each subject in each scenario, based on the negative decision probabilities found in the prior step. Doing this allowed us to estimate the distribution of outcomes for all four scenarios from Table 1 .

In this section, we discuss potential alternative explanations for our conclusion that showing the risk assessment altered the DMP and describe why they are inconsistent with our results.

One alternative explanation is that the risk assessment makes people more confident in their risk prediction rather than more concerned about avoiding risk in decision-making. In other words, people may place a greater weight on their risk prediction because they are more certain about this prediction rather than because they are more concerned about risk as a consideration. If this were the case, we would expect to see risk become a more "extreme" distinguishing factor in decisions: low levels of perceived risk lead to lower negative decision rates, while high levels of perceived risk lead to higher rates. Although that is what we observe in the pretrial setting, we observe a very different pattern in the loans setting: rejection rates go up at all levels of risk ( Figure 5 ). The loans setting results are consistent with our explanation that the risk assessment makes people more risk-averse, yet inconsistent with people becoming more confident in their risk predictions. For instance, it is relatively implausible that becoming more confident that a loan applicant has a 0% likelihood to default on the loan would more than double the likelihood of rejecting that loan application (Table A.4 ). This pattern in the loans setting suggests that the pretrial setting results are also caused by greater attentiveness to risk rather than greater confidence in estimates of risk. Furthermore, even if the pretrial setting does involve greater confidence in risk predictions, the effect would be equivalent to increasing the salience of risk: in both cases, the risk assessment would be causing perceived risk to become a stronger determinant of whether defendants are released or detained.

We can further investigate the role of confidence in decision-making by looking at participant self-reports of confidence. In the exit survey at the end of the experiment, we asked participants how confident they were in their decisions on a Likert scale from 1 (least confident) to 7 (most confident). We found that the risk assessment had no significant effects on participant confidence. In the pretrial setting, the risk assessment did not alter confidence among participants making predictions (P=.978, d=0.00) or decisions (P=.246, d=0.08). Similarly, in the loans setting, the risk assessment did not alter confidence among participants making predictions (P=.580, d=0.07) or decisions (P=.213, d=0.09). Given that the risk assessments did not significantly impact participant self-reports of confidence, it is unlikely that the effects of the risk assessments can be attributed to them making participants more confident in their estimates of risk.

Another alternative explanation is that perceived risk differs between participants making predictions and participants making decisions. In particular, the risk assessment might exert a stronger influence on participants making predictions than on participants making decisions. Our results directly contradict this explanation, however. Most notable is the contrast between the effects of the risk assessment in the loans setting, reducing predictions of risk without reducing loan rejections. For instance, among the 92.3% of loan applicants for whom the risk assessment reduced perceived risk, almost half received a higher likelihood of rejection when the risk assessment was shown. For this explanation to apply here, it would have to be the case that for almost half of the loan applicants, the risk assessment reduced risk estimates for prediction-makers yet increased risk estimates for decision-makers. Although it is plausible that the risk assessment's effects on predictions could be attenuated for decision-makers, it is not plausible that prediction-makers and decision-makers would have their risk estimates influenced in opposite directions.

A third alternative explanation is that the risk assessments provide a random shock to decision-making, adding "noise" to decisions in a manner that is not connected to perceived risk. Two results clearly rule out this explanation. First, we observed that the reduction in pretrial detention was statistically significant, indicating that risk assessments can influence decisions in specific directions. Second, in both settings there was a positive and statistically significant relationship between changes in perceived risk and changes in negative decision rates for each subject (Figure 4) . These correlations indicate that the risk assessments' effect on decisions is (at least loosely) connected to the risk assessments' effect on perceived risk. D.4 The Risk Assessment Alters the "Other Factors" Rather than the DMP Another potential explanation is that the risk assessment alters the calculation of the "other factors" that are incorporated into the DMP (Figure 1 ) rather than (or in addition to) altering the DMP itself. In the loans setting, for instance, the risk assessment could cause people to reduce their evaluation of the benefits of granting home improvement loans rather than cause people to become more risk-averse. However, there is little reason to believe that receiving an algorithmic risk estimate would prompt a large enough reduction in perceived benefit to fully offset the large observed reductions in perceived risk. Moreover, although this alternative explanation would place the change at a different place in Figure 1 , the overall effect would be similar: the risk assessment would be altering decision-making in unexpected ways that can have significant negative impacts. 4 . Modeled probability of negative decisions at a range of perceived risk levels, by setting and risk assessment treatment. The negative decision in the pretrial setting is detaining the defendant; the negative decision in the loans setting is rejecting the loan application. No RA indicates the probability of negative decisions when the risk assessment is not shown, Shown RA indicates the probability of negative decisions when the risk assessment is shown, and Difference indicates the difference between these values (numbers in brackets indicate the effect size of this difference). All differences in both settings are statistically significant with P<.001. These results are plotted in Figure 5 . 

Evidence from Kentucky Bail Decisions. The John M. Olin Center for Law, Economics, and Business Fellows

ABA Standards for Criminal Justice

What is Human-in-the-Loop Machine Learning?

Public Safety Assessment FAQs

Evaluation of Broward County Jail Population: Current Trends and Recommended Options

Technologies of Crime Prediction: The Reception of Algorithms in Policing and Criminal Courts

Advanced Bayesian Multilevel Modeling with the R Package brms

Stan: A Probabilistic Programming Language

Yifeng Geng, and Yutian Li. 2020. xgboost: Extreme Gradient Boosting

Metrics at Work: Journalism and the Contested Meaning of Algorithms

Technological Due Process

Evidence for Countercyclical Risk Aversion: An Experiment with Financial Professionals

Generalizing from Survey Experiments Conducted on Mechanical Turk: A Replication Approach

A Case for Humans-in-the-Loop: Decisions in the Presence of Erroneous Algorithmic Scores

Public safety assessment: Predictive utility and differential prediction by race in Kentucky

Risk Assessment Instruments Validated and Implemented in Correctional Settings in the United States. Council of State Governments Justice Center

Algorithm Aversion: People Erroneously Avoid Algorithms After Seeing Them Err

Priming Risk: The Accessibility of Uncertainty in Public Policy Decision Making

Choice preferences without inferences: subconscious priming of risk attitudes

Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor

Greedy Function Approximation: A Gradient Boosting Machine

Priming the Risk Attitudes of Professionals in Financial Decision Making

Judgmental Forecasts of Time Series Affected by Special Events: Does Providing a Statistical Forecast Improve Accuracy

Fair' Risk Assessments: A Precarious Approach for Criminal Justice Reform

The Smart Enough City: Putting Technology in Its Place to Reclaim Our Urban Future

The False Promise of Risk Assessments: Epistemic Reform and the Limits of Fairness

Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments

The Principles and Limits of Algorithm-in-the-Loop Decision Making

The False Comfort of Human Oversight as an Antidote to A.I. Harm. Slate (

Algorithmic Realism: Expanding the Boundaries of Algorithmic Thought

Human Decision Making with Machine Assistance: An Experiment on Bailing and Jailing

Inside the Judicial Mind

Do Professional Traders Exhibit Myopic Loss Aversion? An Experimental Analysis

Pretrial Integrity and Safety Act of 2017

Designing for Complementarity: Teacher and Student Needs for Orchestration Support in AI-Enhanced Classrooms

HB 463 -Statement from the Sponsors. Criminal Law Reform: The First Year of HB

Combining Human and Machine Intelligence in Large-scale Crowdsourcing

Impact of a deep learning assistant on the histopathologic classification of liver cancer

The Moral Limits of Predictive Practices: The Case of Credit-Based Insurance Scores

Human Decisions and Machine Predictions

Prediction Policy Problems

Danger Ahead: Risk Assessment and the Future of Bail Reform

Crowdsourcing Performance Evaluations of User Interfaces

On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection

How We Analyzed the COMPAS Recidivism Algorithm

Judgemental Adjustment of Initial Forecasts: Its Effectiveness and Biases

Street-Level Bureaucracy: Dilemmas of the Individual in Public Services (30th Anniversary

Algorithm appreciation: People prefer algorithmic to human judgment

Pretrial Services Programs: Responsibilities and Potential. National Institute of Justice: Issues and Practices in Criminal Justice

Indices of Effect Existence and Significance in the Bayesian Framework

bayestestR: Describing Effects and their Uncertainty, Existence and Significance within the Bayesian Framework

Understanding FICO Scores

One Year Criminal Justice Reform Report to the Governor and the Legislature

Predictive Modeling for Public Health: Preventing Childhood Lead Poisoning

Does Unconscious Racial Bias Affect Trial Judges

Gains, Losses, and Judges: Framing and the Judiciary

The Effect of Framing Actuarial Risk Probabilities on Involuntary Civil Commitment Decisions

Sheriff's Justice Institute

Impact of Risk Assessment on Judges' Fairness in Sentencing Relatively Poor Defendants

Evidence-Based Sentencing and the Scientific Rationalization of Discrimination

Juvenile detention risk assessment: A practice guide to juvenile detention reform. The Annie E. Casey Foundation

Assessing Risk Assessment in Action

Algorithmic Risk Assessment in the Hands of Humans

The Use of Pretrial

The Framing of Decisions and the Psychology of Choice

State Court Processing Statistics, 1990-2009: Felony Defendants in Large Urban Counties

Single Family Housing Repair Loans & Grants

Actionable Recourse in Linear Classification

Crowd-Assisted Machine Learning: Current Issues and Future Directions

Not in it for Justice": How California's Pretrial Detention and Bail System Unfairly Punishes Poor People

Toward an Optimal Bail System

Making sense of recommendations

When the State Meets the Street: Public Service and Moral Agency

We thank the area chairs and reviewers for thoughtful feedback regarding how to improve the manuscript. We also thank Alan Altshuler, Evan Green, Ben Lempert, and Salomé Viljoen for their helpful comments on earlier drafts of this manuscript and Steve Worthington for consultation on statistical methodology. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. This work was also supported by the Michigan Society of Fellows.