key: cord-0184695-dqpg7d9q
authors: Schemmer, Max; Hemmer, Patrick; Kuhl, Niklas; Benz, Carina; Satzger, Gerhard
title: Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making
date: 2022-04-14
journal: nan
DOI: nan
sha: 33621ee7ee5fa167e0b7d58b359bd559496ec8cb
doc_id: 184695
cord_uid: dqpg7d9q

Many important decisions in daily life are made with the help of advisors, e.g., decisions about medical treatments or financial investments. Whereas in the past, advice has often been received from human experts, friends, or family, advisors based on artificial intelligence (AI) have become more and more present nowadays. Typically, the advice generated by AI is judged by a human and either deemed reliable or rejected. However, recent work has shown that AI advice is not always beneficial, as humans have shown to be unable to ignore incorrect AI advice, essentially representing an over-reliance on AI. Therefore, the aspired goal should be to enable humans not to rely on AI advice blindly but rather to distinguish its quality and act upon it to make better decisions. Specifically, that means that humans should rely on the AI in the presence of correct advice and self-rely when confronted with incorrect advice, i.e., establish appropriate reliance (AR) on AI advice on a case-by-case basis. Current research lacks a metric for AR. This prevents a rigorous evaluation of factors impacting AR and hinders further development of human-AI decision-making. Therefore, based on the literature, we derive a measurement concept of AR. We propose to view AR as a two-dimensional construct that measures the ability to discriminate advice quality and behave accordingly. In this article, we derive the measurement concept, illustrate its application and outline potential future research.

For many important decisions in life, we seek the opinion of advisors. While in the past, advice was typically obtained from human experts, nowadays, advisors based on artificial intelligence (AI) are becoming more and more present in research and practice [11] . For example, AI now advises medical professionals with regard to breast cancer screening [19] , or in detecting COVID-19 pneumonia [5] . Past research has often focused on maximizing the utilization of advice [13, 27, 30] , i.e., increasing the amount of accepted advice leading to increased compliance. Even though this might be valid for human advice seeking, a different paradigm might be more suitable in the scenario of human-AI decision-making due to a change in objectives. Research on human advice is often based on the perspective of the advisor, who is most likely interested in high advice utilization [30] . For example, when considering a bank that offers advice for investment decisions, the investments will have different potential outcomes for the bank itself. Therefore, the bank is interested in the client following its advice as it has "stakes" in the decision. However, the critical difference in the human-AI decision-making setting is that the advice seekers can design and develop the advisor based on personal 1 arXiv:2204.06916v1 [cs.HC] 14 Apr 2022 goals. If those goals are to maximize decision-making performance, blindly following AI advice does not necessarily lead to the best possible outcome [1] . For this reason, researchers argue that in order to benefit the most from AI advice, humans need to be able to appropriately rely on it [1, 2, 38] . We define appropriate reliance (AR) in AI advice as the human's ability to differentiate between correct and incorrect AI advice and to act upon that discrimination.

As AI becomes more important in our daily lives, both professionally and privately, humans need to be able to discriminate between correct and incorrect advice. If their discrimination capabilities are sufficient, they could even benefit from an AI that performs worse on average compared to them. However, the discriminating ability is not the only factor of appropriate reliance as humans need not only to discriminate but also to adapt their decisions accordingly.

For example, in theory, they could be capable of detecting errors but do not dare to contradict AI advice due to a disproportionately high level of trust in or perceived authority/skill of the AI.

To enable humans to rely on AI advice appropriately, we need to have a precise understanding of AR and measure it coherently. Currently, there is no specific measurement for AR with regard to AI advice, and researchers are using many different measurement concepts. Therefore, we derive a measurement concept for AR in the context of AI advice based on literature in automation on AR [16, 31] and organizational psychology [30] . Subsequently, we illustrate our measurement concept through a behavioral experiment. Lastly, we discuss possible avenues for future research.

The remainder of this article is structured as follows: In Section 2, we first outline related work on AR in the context of human-AI decision-making. In Section 3, we propose a measurement concept capturing AR, followed by an illustration drawn from a user study in Section 4. In Section 5 we discuss AR and provide ideas for future work. Section 6 concludes our work.

Historically, many researchers have worked on AR with regard to automation [16] and robotics [31] . In the following, we will provide an overview of the most common definitions. Fundamental work in the context of AR in automation has been laid by Lee and See [16] . The authors outline the relationship between "appropriate trust" and AR in their work. However, they do not define AR explicitly but provide examples of inappropriate reliance, such as "misuse and disuse are two examples of inappropriate reliance on automation that can compromise safety and profitability" [16, p. 50 ]. Other researchers go one step further and define inappropriate reliance as under-or over-reliance [18, 32, 37] .

Wang et al. [33] define appropriate reliance as the impact of reliance on performance. For example, they discuss the situation in which automation reaches a reliability of 99%, and the human performance is 50%. In their opinion, it would be appropriate to always rely on AI as this would increase performance. Talone [31] follows the work by Wang et al. [33] and defines AR as "the pattern of reliance behavior(s) that is most likely to result in the best human-automation team performance" [31, p. 13] . Both see appropriate reliance as a function of team performance.

Recent work in human-AI decision-making has started to discuss AR in the context of AI advice. Lai et al. [14] gives an overview of empirical studies that analyze AI advice considering AR. For example, Chandrasekaran et al. [4] analyze whether humans can learn to predict the model behavior. This ability is associated with an improved ability to rely on the model's predictions for the right cases. Moreover, Gonzalez et al. [8] evaluate the impact of explainable AI (XAI) on the discrimination of incorrect and correct AI advice. Similarly, Poursabzi-Sangdeh et al. [23, p. 1] point out the idea of AR in the form of "making people more closely follow a model's predictions when it is beneficial for them to do so or enabling them to detect when a model has made a mistake". However, the authors do not explicitly relate this idea to the concept of AR. In this context, additional work uses the term "appropriate trust" with a similar interpretation as the behavior to follow "the fraction of tasks where participants used the model's prediction when the model was correct and did not use the model's prediction when the model was wrong" [34, p. 323 ]. Finally, also Yang et al. [35, p. 190] define "appropriate trust is to [not] follow an [in]correct recommendation". All these articles have in common that they consider AR or appropriate trust on a case-by-case basis. Similar to our work, Buçinca et al. [2] analyze whether cognitive forcing functions can reduce over-reliance on AI advice, which is measured as the percentage of agreement with the AI when the AI makes incorrect predictions. Bussone et al. [3] assess how explanations impact trust and reliance on clinical decision support systems. The authors partition reliance into over-and self-reliance as part of their study. However, they use a qualitative approach to answer their research questions. To summarize, previous research does not provide a unified measurement concept that allows measuring AR on AI advice.

Despite several studies having examined human-AI interaction with regard to reliance, an agreed-upon definition of AR is still missing. We, therefore, initiate our research by deriving a definition of AR. To provide an accurate definition, we first analyze the two terms "appropriate" and "reliance" individually.

Reliance. Reliance itself is defined as a behavior [7, 16] . This means it is neither a feeling nor an attitude but the actual action conducted. Defining reliance as behavior also clarifies the role of trust, which is defined as "the attitude that an agent will help achieve an individual's goals in a situation characterized by uncertainty and vulnerability" [16, p. 51 ]. In general, research has shown that trust increases reliance, but it can also take place without trust being present [16] . For example, we might not trust the banking advisor but consciously decide that the advice is still the best possible decision.

The final reliance is beyond trust and is also influenced by other attitudes such as perceived risk or self-confidence [25] .

Appropriateness. After establishing a common understanding of reliance, we proceed by defining "appropriateness".

The appropriateness of reliance stems from the fact that current AI is imperfect, i.e., it may provide erroneous advice.

This erroneous advice can be divided into systematic errors and random errors [31] . While humans can identify systematic errors, random errors have no identifiable patterns and can not be distinguished. These different errors allow differentiation between two cases of AR. If all errors are random and cannot be detected, then humans should always rely on AI if, on average, AI performs better and never rely if AI performs worse on average [31] . However, suppose there are some systematic errors, depending on the discrimination capabilities. In that case, humans might be able to differentiate between correct and incorrect advice, which may even result in superior performance compared to the scenario of AI and humans conducting the task alone [9] . This changes the overall discrimination to a case-by-case discrimination. In the presence of systematic errors, humans should evaluate each case individually. Since the solution approach in the presence of just random errors is relatively simple, as pointed out above, in this article, we focus on the more complicated setting when a significant proportion of task instances inhibit systematic errors.

For AR in the presence of systematic errors, we see two main aspects. First, humans need to be able to differentiate between correct and incorrect advice. Second, people need to act upon their discrimination accordingly due to its behavioral nature. For instance, a human decision-maker might be able to differentiate between correct and incorrect AI advice but has a too high level of trust to reject the advice of the AI. Thus, AR consists of the capability to discriminate and execute the consequent behavior. Definition 1. Appropriate reliance on AI advice is a) the human capability to differentiate between correct and incorrect AI advice and b) to act upon that discrimination. Now that we defined AR for our study, we derive a corresponding measurement. Current metrics in literature do not capture the ability to discriminate AI advice, including the final decision by the human. For example, the weight on advice (WOA) metric measures advice utilization [30] . This means the metric does not differentiate between correct or incorrect advice but instead measures the share of taken advice. Furthermore, performance metrics blur the effect of AR. For example, if AI has higher performance on a task than a human, blindly relying on AI without differentiating between correct and incorrect advice might increase the team performance. However, it will not result in the desired outcome that team performance exceeds the one of humans or AI conducting the task alone. For this reason, one cannot assume AR after a performance increase. This results in the need for a measurement concept that reflects how well humans are able to discriminate between correct and incorrect advice.

Following judge-advisor literature [30] , we propose to study AR in a sequential human-AI decision-making setup with two steps of human decision-making. Table 1 gives an overview of the different combinations based on a classification task. Note that for simplicity, we refer to classification problems. However, the measurement concept can be extended to regression problems as well. We consider a sequential decision process which can be described as follows: First, the human makes a decision, then receives AI advice. Second, the human is asked to update the initial decision, i.e., either adopt or overwrite the AI advice. This allows measuring AR in a fine-granular way. For example, the initial human decision can either be correct or incorrect in the classification setting. The follow-up AI advice can then either confirm the human's initial decision-or contradict it. We call the two contradictory cases positive and negative AI advice. For any further analysis, we propose to focus on the two contradictory cases as the confirmation cases do not allow to measure the human's discrimination ability and blur the actual discrimination capability. In general, if we do not consider the initial human decision, information about the human discrimination ability, including the consequent action, gets lost-it is not traceable how the human would have decided without the AI advice. Nevertheless, especially this interaction needs to be documented to research AR holistically. To give an example, imagine a case where the AI gives 10 times correct advice and 10 times incorrect advice. The human is initially 10 times correct. Now the question is, how are these 10 correct initial decisions distributed over the AI advice. If, for example, 8 correct initial decisions are followed by correct AI advice, this leads to a high accuracy after receiving correct AI advice. However, it is not distinguishable whether the resulting accuracy results from a good discrimination ability or other factors, e.g., uncertainty in one's own decision followed by confirmation by the AI prediction. By considering the initial human decision and focusing on the contradiction cases, the influence of confounding factors that can lead to misinterpretations can be minimized. After receiving AI advice, humans need to decide whether to rely on AI or self-rely. We can now classify four types of reliance. First, positive AI-reliance, which describes the case when the human is initially incorrect, receives correct advice and relies on that advice. Second, the case in which the human relies on the initial incorrect decision and neglects positive AI advice. This is denoted as negative self-reliance. Third, if the human is initially correct and receives incorrect advice, this can either result in positive self-reliance, i.e., neglecting the incorrect AI advice, or relying on it, which is denoted as negative AI-reliance. We can now observe that the positive effect of AI-reliance and self-reliance depends on the quality of the AI advice. Therefore, we propose to measure AR on two dimensions.

On the first dimension, we calculate the ratio of cases where the human relies on correct AI advice under the condition that the decision was initially not correct, i.e., in which the human rightfully changes his mind to follow the AI advice.

On the second dimension, we propose to measure the relative amount of positive self-reliance in the presence of negative advice.

-

Relative positive AI reliance (RAIR) Relative positive selft-reliance (RSR) On the x-axis, we depict the relative positive AI-reliance ( ), and on the y-axis, the relative positive self-reliance ( ). The figure highlights the properties of the measurement concept. It ranges on both dimensions between 0 and 1. As a baseline, we can consider a random decision. The random baseline allows us to detect under-and over-reliance in a static case. If and are below this threshold, we observe over-and under-reliance. More specifically, a below the threshold means that a human performs worse than by chance in detecting incorrect AI advice, essentially representing an over-reliance on AI. Similarly, a below the random threshold means that a human differentiates correct AI advice worse than a random guess, essentially representing an under-reliance on AI. We depict the decision-making threshold in Figure 1 to illustrate this reasoning. Furthermore, we

can use the space to analyze the effect of experimental treatments. For example, a treatment that increases and decreases actually just increases over-reliance on AI. Similarly, a condition that increases while decreasing points towards under-reliance.

We refer to the theoretical goal of having a and a metric of "1" as optimal AR. Most likely, this theoretical goal will not be reached in any practical context as humans will not always be able to perfectly discriminate on a case-by-case basis whether they should rely on AI advice. Furthermore, random errors will reduce AR as they cannot be discriminated against. Therefore, optimal AR will most likely be a theoretical goal. Lastly, the area with and larger than the random threshold encompasses the proportion of final decisions that did not occur by chance. Therefore, AR is defined as every combination larger than the random threshold. In Figure 1 it refers to the right top quadrant. This means AR is not binary but a tuple of and above the random threshold with the theoretical optimum of "1".

To illustrate our measurement concept, in the following Section 4, we describe the results of an experimental study.

To illustrate the proposed measurement concept, we conducted a user study. The study's goal is to highlight how our measurement concept can be used to evaluate human-AI decision-making experiments with regard to AR. For illustration, we focus on the explainability of AI advice as a design decision. XAI is intensively discussed in research with regards to its impact on human-AI decision-making in general and AR in specific [1, 2, 14, 29] .

To discriminate advice, humans need information that can approximate the quality of advice, e.g., the uncertainty of the advisor or explanations [30] . The emerging research stream of XAI [10] aims to equip human users with such insight into AI advice. Explanations might help the human decision-maker better judge the quality of an AI-based

decision. An analogy would be the interaction between a consultant, who provides advice, and his client. To assess the quality of the advice, the client will ask the consultant to describe the reasoning. Based on the explanations, the decision-maker should be able to determine whether the advice can be relied upon or not. The same logic should hold for an AI advisor.

On the other hand, experimental studies indicate that explanations in human-AI decision-making can lead to overreliance [1, 38] . Research shows that explanations are sometimes interpreted more as a general sign of competence [2] and have a persuasive character [3] . This is supported by psychology literature which has shown that human explanations cause humans to agree even when the explanation is wrong [12] . The same effect could occur when the explanations are generated by AI. Therefore, there exists an ambiguous trade-off between enabling the human to discriminate the AI's advice and, on the other hand, the tendency that the sole existence of an explanation could increase AI reliance [1] . In this illustrative study, we use this often discussed ambiguity to highlight the advantages of our measurement concept.

As an experimental task, we have chosen a deceptive hotel review classification. Humans have to differentiate whether a given hotel review is deceptive or genuine. Ott et al. [20, 21] provide the research community with a data set of 400 deceptive and 400 genuine hotel reviews. The deceptive ones were created by crowd-workers, resulting in corresponding ground truth labels.

The implemented AI is based on a Support Vector Machine with an accuracy of 86%, which is a performance that is similar to the performance in related literature [15] . For the XAI condition, we use a state-of-the-art explanation technique, namely LIME feature importance explanations [24] . Feature importance aims to explain the influence of an independent variable on the AI's decision in the form of a numerical value. Since we deal with text data, a common technique to display the values is to highlight the respective words according to their computed influence on the AI's decision [15] . We additionally provide information on the direction of the effect and differentiate the values into three effect sizes following the implementation of Lai et al. [15] (see step 2 in Figure 2 ). Fig. 2 . Online experiment graphical user interface for the XAI treatment. The ground truth of the exemplarily shown hotel review is "fake". The design of the interface is adapted from Lai et al. [15] .

For the AR measurement concept, a sequential task processing is essential. In our study, this means the human first receives a review without any AI advice, i.e., just the plain text, and classifies whether the review is deceptive or genuine (see step 1 in Figure 2 ). Following that, the human either receives a simple AI advice statement, e.g. "the AI predicts that the review is fake" or the AI advice and additional explanations (see step 2 in Figure 2 ). This sequential two-step decision-making allows us to measure AR.

The participants were recruited using the platform Prolific.co. In total, we conducted the experiment with 200 participants. In each treatment, participants were provided with 16 reviews-8 correct and 8 incorrect ones.

We depict the results of the experiment in Figure 3 . They highlight in the AI condition a high of 0.72 (±0.03) and a relatively low of 0.3 (±0.03). This indicates that humans in the setting were able to differentiate wrong AI advice and self-rely to a high degree. The of 0.3 shows that we can observe under-reliance on AI as the is below the random guess of our binary classification task. Our experiment thereby highlights the general tendency of humans to ignore AI advice that is in literature usually discussed as algorithm aversion [6] .

In the XAI condition, we can observe a significant increase ( 

In this paper, we conducted a review on AR and proposed a measurement concept for its measurement in human-AI decision-making. Subsequently, we illustrated this concept in the scope of a user study to highlight its capability. With our approach, AR can be measured on a more fine-granular level. Thus, it can be leveraged in future experimental studies to address questions on how to design for AR.

In this context, research needs to investigate the capability of AR to discriminate between incorrect and correct AI advice and evaluate possible impact factors. For example, the potential to discriminate might be different between and . One would assume that it might be easier to discriminate negative advice than positive advice, as in the negative advice condition, the human is initially, per definition, able to solve the task. In contrast, it might be challenging to discriminate positive AI advice after failing to solve a task alone correctly. Our initial illustrative experiment also showed these differences. The experiment further highlighted that while XAI might address , it does not seem to influence . Therefore, research needs to investigate the differences in detail and find proper ways to address them.

Furthermore, researchers need to investigate factors beyond the discrimination capability that influence the behavioral part of AR. For instance, research models could incorporate attitudes as well as human biases that might influence AR. Among others, important constructs that have been evaluated in the judge-advisor literature concerning advice utilization are human confidence [17] , perception [26] and trust [30] . Other research has shown the influence of human bias on AR. For example, so-called egocentric discounting-humans systematically overweighting their own decisions-increases under-reliance on AI [36] . Some of these attitudes might enable enhanced discrimination, such as engagement in AI advice [10] . Others could potentially harm AR. For example, maximizing trust in AI could lead to "blind trust" and consequently lead to a situation where humans accept all advice [1] . Similar phenomena could happen in terms of cognitive constraints. Research in automation has shown that humans tend always to follow the path of least cognitive effort [28] . Simply accepting could therefore be a preferred human decision. Research needs to investigate these factors in future work.

Lastly, we want to emphasize several limitations of the proposed measurement concept. First, the concept is limited to classification tasks but will be extended in future work. First approaches can be found in the work of Petropoulos et al. [22] . Furthermore, the sequential task setup necessary for our measurement concept has some disadvantages as it changes the task itself. Since conducting the same task initially alone before receiving AI advice, the human is already mentally prepared and might react differently than after directly receiving AI advice. Moreover, sequentially conducted tasks with AI advice might not always be possible or desired in real-world cases. Therefore, the measurement should be seen as an approximation of real human behavior. Instead of having a sequential task setup, one alternative option could be to simulate a human model based on a data set of task instances solved by humans without AI advice. This simulation model could approximate the initial human decision within a non-sequential task setting. However, also this approach is an approximation of real human behavior. Future work should compare both approaches.

Many researchers highlighted the need for humans to rely on AI advice appropriately, i.e., being able to discriminate AI advice quality and acting upon it for the best possible human-AI decision-making [1, 2, 18, 38] . However, current research is missing a measurement concept for AR that allows the evaluation of human-AI decision-making experiments.

Therefore, in this article, we develop a new measurement concept for quantifying AR in the human-AI decisionmaking context. Specifically, we propose to view AR as a two-dimensional construct that measures the capability to discriminate the quality of advice and behave accordingly. The first dimension considers the relative positive effect of relying on AI advice, whereas the second dimension assesses the relative positive self-reliance in the presence of incorrect AI advice. Subsequently, we illustrate our measurement concept and provide an outlook on future research.

Our research provides a basis for future studies to evaluate the impact factors of AR and develop designs possibilities to improve human-AI decision-making.

Does the whole exceed its parts? the effect of ai explanations on complementary team performance

To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making

The role of explanations on trust and reliance in clinical decision support systems

Do explanations make VQA models more predictable to a human

Can AI help in screening viral and COVID-19 pneumonia?

Algorithm aversion: people erroneously avoid algorithms after seeing them err

The role of trust in automation reliance

Human evaluation of spoken vs. visual explanations for open-domain qa

Michael Vössing, and Niklas Kühl. 2021. Human-AI Complementarity in Hybrid Intelligence Systems: A Structured Literature Review. PACIS 2021 Proceedings

Metrics for explainable AI: Challenges and prospects

Robo-advisory

Explanation, imagination, and confidence in judgment

Do you comply with AI?-Personalized explanations of learning algorithms and their impact on employees' compliance behavior

Towards a Science of Human-AI Decision Making: A Survey of Empirical Studies

Towards Building Model-Driven Tutorials for Humans

Trust in automation: Designing for appropriate reliance

What is AI literacy? Competencies and design considerations

Human Reliance on Machine Learning Models When Performance Feedback is Limited: Heuristics and Risks

International evaluation of an AI system for breast cancer screening

Negative deceptive opinion spam

Finding deceptive opinion spam by any stretch of the imagination

Do 'big losses' in judgmental adjustments to statistical forecasts affect experts' behaviour?

Manipulating and measuring model interpretability

Model-agnostic interpretability of machine learning

Operator reliance on automation: Theory and data

2022. Perceptions of Fairness and Trustworthiness Based on Explanations in Human vs. Automated Decision-Making

Effects of distance between initial estimates and advice on advice utilization

Does automation bias decision-making?

No explainability without accountability: An empirical study of explanations and feedback in interactive ml

Trust, confidence, and expertise in a judge-advisor system

The effect of reliability information and risk on appropriate reliance in an autonomous robot teammate

Aiding human reliance decision making using computational models of trust

Selecting methods for the analysis of reliance on automation

Are explanations helpful? a comparative study of the effects of explanations in ai-assisted decision-making

How do visual explanations foster end users' appropriate trust in machine learning

Advice taking in decision making: Egocentric discounting and reputation formation. Organizational behavior and human decision processes

Effect of descriptive information and experience on automation reliance

Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making