key: cord-0462479-loxuodbe
authors: Zhang, L.; Karakasidis, G.; Odnoblyudova, A.; Dogruel, L.; Jung, A.
title: Explainable Empirical Risk Minimization
date: 2020-09-03
journal: nan
DOI: nan
sha: 90eacdc72aeae098735902ada585f02e953d136b
doc_id: 462479
cord_uid: loxuodbe

The successful application of machine learning (ML) methods becomes increasingly dependent on their interpretability or explainability. Designing explainable ML systems is instrumental to ensuring transparency of automated decision-making that targets humans. The explainability of ML methods is also an essential ingredient for trustworthy artificial intelligence. A key challenge in ensuring explainability is its dependence on the specific human user ("explainee"). The users of machine learning methods might have vastly different background knowledge about machine learning principles. One user might have a university degree in machine learning or related fields, while another user might have never received formal training in high-school mathematics. This paper applies information-theoretic concepts to develop a novel measure for the subjective explainability of the predictions delivered by a ML method. We construct this measure via the conditional entropy of predictions, given a user signal. This user signal might be obtained from user surveys or biophysical measurements. Our main contribution is the explainable empirical risk minimization (EERM) principle of learning a hypothesis that optimally balances between the subjective explainability and risk. The EERM principle is flexible and can be combined with arbitrary machine learning models. We present several practical implementations of EERM for linear models and decision trees. Numerical experiments demonstrate the application of EERM to detecting the use of inappropriate language on social media.

The successful application of machine learning (ML) methods becomes increasingly dependent on their interpretability or explainability. Designing explainable ML systems is instrumental to ensuring transparency of automated decisionmaking that targets humans. The explainability of ML methods is also an essential ingredient for trustworthy artificial intelligence. A key challenge in ensuring explainability is its dependence on the specific human user ("explainee"). The users of machine learning methods might have vastly different background knowledge about machine learning principles. One user might have a university degree in machine learning or related fields, while another user might have never received formal training in high-school mathematics. This paper applies information-theoretic concepts to develop a novel measure for the subjective explainability of the predictions delivered by a ML method. We construct this measure via the conditional entropy of predictions, given a user signal. This user signal might be obtained from user surveys or biophysical measurements. Our main contribution is the explainable empirical risk minimization (EERM) principle of learning a hypothesis that optimally balances between the subjective explainability and risk. The EERM principle is flexible and can be combined with arbitrary machine learning models. We present several practical implementations of EERM for linear models and decision trees. Numerical experiments demonstrate the application of EERM to detecting the use of inappropriate language on social media.

A key challenge for the successful use of machine learning (ML) methods is their explainability or interpretability (Holzinger, 2018; Hagras, 2018; Mittelstadt et al., 2016; Wachter et al., 2017) . Explaining the predictions delivered by ML methods seem to also increasingly become a legal obligation (Wachter et al., 2017) . Moreover, humans seem to have a basic need for understanding decision making processes (Kagan, 1972; Kruglanski & Webster, 1996) . The explainability of predictions delivered by ML is particularly important when these predictions inform decisions that crucially affect humans. As a point in case, consider the school closures during the Covid-19 pandemic throughout Europe (Flaxman & et.al., 2020) . These school closures have been decided by policy-makers based on predictions obtained by ML methods (Bicher et al., 2020) . It was then customary that the policy-makers explain these decisions to the public (Cairney & Kwiatkowski, 2017) .

Existing approaches to explainable (or interpretable) ML broadly form two categories: intrinsically explainable methods and model-agnostic methods (Molnar, 2019) . Modelagnostic methods provide post-hoc (after model training) explanations (Ribeiro et al., 2016; Jung & Nardelli, 2020) . These methods do not require the details of a ML method but only require its predictions for some training examples.

A second category of explainable ML methods is obtained from models that are considered intrinsically interpretable (Montavon et al., 2018; Bach et al., 2015; Hagras, 2018) . Linear models using few features and shallow decision trees are often considered as intrinsically interpretable (Molnar, 2019) . The interpretation of a linear model is typically obtained from an inspection of the learned weights for the individual features. A large (in magnitude) weight is then read as an indicator for a high relevance of the corresponding feature. However, there is no widely accepted definition of which model is considered intrinsically interpretable.

What sets this work apart from most existing work on explainable ML is that we use a subjective notion of explainability. This subjective explainability, that is tailored to a specific human user, is implemented using the concept of a user signal. Broadly speaking, the user signal is some user-specific attribute that is assigned or associated with a data point. Formally, we can think of the user signal as an arXiv:2009.01492v2 [cs.

LG] 29 Jan 2022 additional feature of a data point. This additional feature is measured or determined via the user and revealed to our method for each data point. Similar to (Chen et al., 2018) , we use information-theoretic concepts to measure subjective explainability. However, while (Chen et al., 2018) uses the mutual information between an explanation and the prediction, we measure the subjective explainability of a hypothesis using the conditional entropy of its predictions given a user feedback signal. This conditional entropy is then used as a regularizer for empirical risk minimization (ERM) resulting in explainable empirical risk minimization (EERM).

The EERM principle requires a training set consisting of data points for which, beside their features, also the label and user signal values are known. The user signal values for the data points in the training set are used to estimate the subjective explainability of a hypothesis. We obtain different instances of EERM form different hypothesis spaces (models). Two specific instances are explainable linear regression (see Section 3.1) and explainable decision tree classification (see Section 3.2).

We illustrate the usefulness of EERM using the task of detecting hate speech in social media. Hate speech is a main obstacle towards embracing the Internet's potential for deliberation and freedom of speech (Laaksonen et al., 2020) . Moreover, the detrimental effect of hate speech seems to have been amplified during the current Covid-19 pandemic (Hardage & Peyman, 2020) .

Detecting hate speech requires multi-disciplinary expertise from both social science and computer science expertise (Papcunová et al., 2021; Liao et al., 2020) . Providing subjective explainability for ML users with different backgrounds is crucial for the diagnosis and improvement of hate speech detection systems (Laaksonen et al., 2020; Hardage & Peyman, 2020; Bunde, 2021) .

Our main contributions can be summarized as follows:

• We introduce a novel measure for the subjective explainability of the predictions delivered by a ML method to a specific user. This measure is constructed from the conditional entropy of the predictions given some user signal (see Section 2.2).

• Our main methodological contribution is EERM which uses subjective explainability as regularizer. We present two equivalent (dual) formulations of EERM as optimization problems (see Section 3).

• We detail practical implementations of the EERM principle for linear regression and decision tree classification (see Section 3.1 -3.2).

• The usefulness of the EERM principle is illustrated using the task of detecting hate-speech in social media (see Section 4). We use EERM to learn an explainable decision tree classifier for a user that associates hate speech with the presence of specific keywords.

We consider a ML application that involve data points, each characterized by a label (quantity of interest) y and some features (attributes) x = x 1 , . . . , x n T ∈ R n (Hastie et al., 2001; Bishop, 2006) . ML methods aim at learning a hypothesis map h that allows to predict the label of a data point based solely on its features.

In contrast to standard ML approaches, we explicitly take the specific user of the ML method into account. Each data point is also assigned a user signal u that characterizes it from the perspective of a specific human user. The user signal u is conceptually similar to the features x of a data point. However, while features represent objective measurements (e.g., obtained from sensing devices), the user signal u is provided (actively or passively) by the human user of the ML method.

Let us illustrate the rather abstract notion of a user signal by some examples. One important example for a user signal is a manually constructed feature of the data point. Section 4 considers hate speech detection in social media where data points represent short messages ("tweets"). Here, the user signal u for a specific data point could be defined via the presence of a certain word that is considered a strong indicator for hate speech.

The user signal u might also be collected in a more indirect fashion. Consider an application where data points are images that have to be classified into different categories. Here, the user signal u might be derived from EEG measurements taking when a data point (image) is revealed to the user (Zubarev et al., 2022) .

The goal of supervised ML is to learn a hypothesis

that is used to compute the predicted labelŷ = h(x) from the features x = x 1 , . . . , x n T ∈ R n of a data point. Any ML method that can only use finite computational resources can only use a subset of (computationally) feasible maps. We refer to this subset as the hypothesis space (model) H of a ML method. Examples for such a hypothesis space are linear maps, decision trees or artificial neural networks (Hastie et al., 2015; Goodfellow et al., 2016) .

For a given data point with features x and label y, we measure the quality of a hypothesis h using some loss function L((x, y), h). The number L x, y , h measures the error incurred by predicting the label y of a data point using the

. Popular examples for loss functions are the squared error loss L x, y , h = (h(x) − y) 2 (for numeric labels y ∈ R) or the logistic loss L x, y , h = log(1 + exp(−h(x)y)) (for binary labels y ∈ {−1, 1}).

Roughly speaking, we would like to learn a hypothesis h that incurs small loss on any data point. To make this informal goal precise we can use the notion of expected loss or risk

Ideally, we would like to learn a hypothesisĥ with minimum risk

It seems natural to learn a hypothesis by solving the risk minimization problem (3).

There are two caveats to consider when using the risk minimization principle (3). First, we typically do not know the underlying probability distribution p(x, y) required for evaluating the risk (2). We will see in Section 2.1 how empirical risk minimization (ERM) is obtained by approximating the risk using an average loss over some training set.

The second caveat to a direct implementation of risk minimization (3) is its ignorance about the explainability of the learned hypothesisĥ. In particular, we are concerned with the subjective explainability of the predictionsĥ(x) for a user that is characterized via a user signal u for each data point. We construct a measure for this subjective explainability in Section 2.2 and use it as a regularizer to obtain explainable ERM (EERM) (see Section 3).

The idea of ERM is to approximate the risk (2) using the average loss (or empirical risk)

The average lossL(h|D) of the hypothesis h is measured on a set of labelled data points (the training set)

The training set D contains data points for which we know the true label value y (i) and the corresponding user signal u (i) .

Section 4 applies our methods to the problem of hate speech detection. In this application, a data point is a short text message ("tweet") and the training set (5) consists of tweets for which we know if they are hate speech or not. As the user signal we will use the presence of a small number of keywords that are considered a strong indicator for hate speech.

Many practical ML methods are based on solving the ERM problemĥ ∈ arg min h∈HL (h|D).

However, a direct implementation of ERM (6) is prone to overfitting if the hypothesis space H is too large (e.g., linear maps using many features and very deep decision trees) compared to the size m of the training set. To avoid overfitting in this high-dimensional regime (Bühlmann & van de Geer, 2011; Wainwright, 2019) , we add a regularization term λR(h) to the empirical risk in (6),

The choice of the regularization parameter λ ≥ 0 in (7) can be guided by a probabilistic model for the data or using validation techniques (Hastie et al., 2001) .

A dual form of regularized ERM (7) is obtained by replacing the regularization term with a constraint,

The solutions of (8) coincide with those of (7) for an appropriate choice of η (Bertsekas, 1999) . Solving the primal formulation (7) might be computationally more convenient as it is an unconstrained optimization problem in contrast to the dual formulation (8) (Boyd & Vandenberghe, 2004) . However, the dual form (8) allows to explicitly specify an upper bound η on the value R(h (η) ) for the learned hypothesis h (η) .

Regularization techniques are typically used to improve statistical performance (risk) of the learned hypothesis. Instead, we use regularization as a vehicle for ensuring explainability.

In particular, we do not use the regularization term as an estimate for the generalization error L(h)−L(h|D). Rather, we use a regularization term that measures for the subjective explainability of the predictionsŷ = h(x). The regularization parameter λ in (7) (or η in the dual formulation (8)) adjusts the level of subjective explainability of the learned hypothesisĥ.

There seems to be no widely accepted formal definition for the explainability (interpretability) of a learn hypothesisĥ. While linear regression is sometimes considered as interpretable, the predictions obtained by applying a linear hypothesis to a huge number of features might be difficult to grasp. Moreover, the interpretability of linear models also depends on the background (formal training) of the specific user of a ML method.

Similar to (Chen et al., 2018) we use information-theoretic concepts to make the notion of explainability precise. This approach interprets each data point as realizations of i.i.d. random variables. In particular, the features x, label y and user signal u associated with a data point are realizations drawn from a joint probability density function (pdf) p(x, y, u). In general, the joint pdf p(x, y, u) is unknown and needs to be estimated from data using, e.g., maximum likelihood methods (Bishop, 2006; Hastie et al., 2015) .

Note that since we model the features of a data point as the realization of a random variable, the predictionŷ = h(x) also becomes the realization of a random variable. Figure 1 summarizes the overall probabilistic model for data points, the user signal and the predictions delivered by (the hypothesis learned with) a ML method.

We measure for the subjective explainability of the predictionsŷ delivered by a hypothesis h for some data point x, y, u as,

Here, we used the conditional entropy H(h|u) (Cover & Thomas, 2006 )

The calibration constant C in (9) only serves the notational convenience. In particular, the precise value of C is meaningless for our approach (see Section 3) and only serves the convention that subjective explainability E(h|u) is a non-negative quantity.

For regression problems, the predicted labelŷ might be modelled as a continuous random variable. In this case, the quantity H(ŷ|u) is a conditional differential entropy. With slight abuse of notation we refer to H(ŷ|u) as a conditional entropy and do not explicitly distinguish between the case whereŷ is discrete, such as in classification problems studied in Sections 3.1-3.2 and Section 4. We refer the reader to (Cover & Thomas, 2006) for precise definition and discussion of conditional entropy and conditional differential entropy.

The conditional entropy H(h|u) in (9) quantifies the uncertainty (of a user that assigns the value u to a data point) about the predictionŷ = h(x) delivered by the hypothesis h. Smaller values H(h|u) correspond to smaller levels of subjective uncertainty about the predictionsŷ = h(x) for a data point with known user signal u. This, in turn, corresponds to a larger value E(h|u) of subjective explainability.

Section 4 discusses explainable methods for detecting hate speech or the use of offensive language. A data point represents a short text message (a tweet). Here, the user signal u could be the presence of specific keywords that are considered a strong indicator for hate speech or offensive language.

These keywords might be provided by the user via answering a survey or they might be determined by computing word histograms on public datasets that have been manually labeled (Davidson et al., 2017) .

data point (x, y, u) prediction y h ∈ H Figure 1 . The features x, label y and user signal u of a data point are realizations drawn from a pdf p(x, y, u). Our goal is to learn a hypothesis h such that its predictionsŷ have a small conditional entropy given the user signal u.

Section 2 has introduced all the components of EERM as a novel principle for explainable ML. EERM learns a hypothesis h by using an estimate H(h(x)|u) for the conditional entropy in (9) as the regularization term R(h) in (7),

A dual form of (11) is obtained by specializing (8),

The empirical riskL(h|D) and the regularizer H(h|u) are computed based solely from the available training set (5). We will discuss specific choices for the estimator H(ŷ|u) in Section 3.1 -3.2.

The idea of EERM is that the solution of (11) (or (12)) provides a hypothesis that balances the requirement of a small loss (accuracy) with a sufficient level of subjective explainability E(h|u) = C − H(h|u)). This balance is steered by the parameter λ in (11) and η in (12), respectively.

Choosing a large value for λ in (11) (small value for η in (12)) penalizes any hypothesis resulting in a large conditional entropy H(h|u). Thus, using a large λ in (11) (small η in (12) enforces a small subjective explainability of the learned hypothesis. Figure 2 illustrates the parametrized solutions of (11) in the plane spanned by risk and subjective explainability. The different curves in Figure 2 are parametrized solutions of (11) using different realizations of the training sets and different estimators H for the conditional entropy.

One extreme case is λ = 0 when EERM (11) reduces to plain ERM that delivers a hypothesis h (λ=0) with risk L min . The other extreme case of EERM (11) is when λ = λ is chosen sufficiently large such that the resulting hypothesis h (λ ) has zero conditional entropy H(h|u). The hypothesis h (λ ) then achieves maximum subjective explainability E(h|u) = C but also incurs a larger risk L max .

The above extreme cases of EERM are obtained analogously for the dual form (8) with η either sufficiently large or η = 0, respectively. 

We now specialize EERM in its primal form (11) to linear regression (Bishop, 2006; Hastie et al., 2015) . Linear regression methods learn the parameters w of a linear hypothesis h (w) (x) = w T x to minimize the squared error loss of the resulting prediction error. The features x and user signal u of a data point are modelled realizations of jointly Gaussian random variables with mean zero and covariance matrix C,

Note that (13) only specifies the marginal of the joint pdf p(x, y, u) (see Figure 1 ). Under the assumption (13) (see (Cover & Thomas, 2006) ) we obtain the conditional entropy

Here, we use the conditional variance σ 2 y|u ofŷ = h(x) of the predicted labelŷ = h(x) given the user signal u for a data point.

To develop an estimator H(h|u) for (15), we use the identity

The identity (15) relates the conditional variance σ 2 y|u to the minimum mean squared error that can be achieved by estimatingŷ using a linear estimator αu with some α ∈ R.

We obtain an estimator for (15) by removing the logarithm and replacing the expectation in (15) by a sample average over the training set D (5),

Inserting the estimator (16) into EERM (11), yields Algorithm 1 as an instance of EERM for linear regression.

Input: explainability parameter λ, training set D (see (5)) 1: solve

subjective explainability (17) Output: h (λ) (x) := x T w

We now specialize EERM in its dual (constraint) form (12) to decision tree classifiers (Bishop, 2006; Hastie et al., 2015) . Consider data points characterized by features x and a binary label y ∈ {0, 1}. Moreover, each data point is characterized by a binary user signal u ∈ {0, 1}. The restriction to binary labels and user signals is for ease of exposition. Our approach can be generalized easily to more than two label values (mult-class classification) and non-binary user signals.

The model H in (12) is constituted by all decision trees whose root node tests the user signal u and whose depth does not exceed a prescribed maximum depth d max (Hastie et al., 2001) . The depth d of a specific decision tree h is the maximum number of test nodes that are encountered along any possible path from root node to a leaf node (Hastie et al., 2001) .

(3) illustrates a hypothesis h obtained from a decision tree with depth d = 2. We consider only decision trees whose nodes implement a binary test, such as whether a specific feature x j exceeds some threshold. Each such binary test can maximally contribute one bit to the entropy of the resulting prediction (at some leaf node).

Thus, for a given user signal u, the conditional entropy of the predictionŷ = h(x) is upper bounded by d − 1 bits. Indeed, since the root node is reserved for testing the user signal u, the number of binary tests carried out for computing the prediction is upper bounded by d − 1. We then obtain Algorithm 2 from (12) by using the estimator H(h|u) := d − 1.

Input: subjective explainability η, training set D (5) 1: maximum tree-depth d max := η . EERM implementation for learning an explainable decision tree classifier. EERM amounts to learning a separate decision tree for all data points sharing a common user signal u. The constraint in (12) can be enforced naturally by fixing a maximum tree depth d.

We study the usefulness of EERM by numerical experiments revolving around the problem of detecting hate-speech and offensive language in social media (Wang et al., 2011) . Hatespeech is a contested term whose meaning ranges from concrete threats to individuals to venting anger against authority (Gagliardone et al., 2015) . Hate-speech is characterized by devaluing individuals based on group-defining characteristics such as their race, ethnicity, religion and sexual orientation (Erjavec & Kovačič, 2012) .

Our experiments use a public dataset that contains curated short messages (tweets) from a social network (Davidson et al., 2017) . Each tweet has been manually rated by a varying number of users as either "hate-speech", "offensive language" or "neither". For each tweet we define its binary label as y = 1 ("inappropriate tweet"') if the majority of users rated the tweet either as "hate-speech" or "offensive language". If the majority of users rated the tweet as "neither", we define its label value as y = 0 ("appropriate tweet").

The feature vector x of a tweet is constructed using the normalized frequencies ("tf-idf") of individual words (Baeza-Yates & Ribeiro-Neto., 2011) . Each tweet is also characterized by a binary user signal u ∈ {0, 1}. The user signal is defined to be u = 1 if the tweet contains at least one of the 5 most frequent words appearing in tweets with y = 1.

We use Algorithm 2 to learn an explainable decision tree classifier with its subjective explainability upper bounded by η = 2 bits. The training set D used for Algorithm 2 is obtained by randomly selecting a fraction of around 90% percent of the entire dataset. The remaining 10% of tweets are used as a test set.

To learn the decision tree classifiers in step 3 and 4 of Algorithm 2, we used the implementations provided by the current version of the Python package scikit-learn (Pedregosa, 2011). The resulting explainable decision tree classifier (with the root node testing the user signal) h (η=2) (x) achieved an accuracy of 0.929 on the test set.

The explainability of predictions provided by ML becomes increasingly relevant for their use in automated decisionmaking (Rohlfing et al., 2021; Larsson & Heintz, 2020) . Given lay and expert user's different level of expertise and knowledge, providing subjective (tailored) explainability is instrumental for achieving trustworthy AI (Sokol & Flach, 2020; Rohlfing et al., 2021) . Our main contribution is EERM as a new design principle for subjective explainable ML. EERM is obtained by using the conditional entropy of predictions, given a user signal, as a regularizer. The hypothesis learned by EERM balances between small risk and a sufficient explainability for a specific user (explainee).

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

Modern Information Retrieval

Guiding austria through the covid-19 epidemics with a forecast-based early warning system. medRxiv

Pattern Recognition and Machine Learning

Convex Optimization

Statistics for High-Dimensional Data

Ai-assisted and explainable hate speech detection for social media moderators -a design science approach

How to communicate effectively with policymakers: combine insights from psychology and policy studies

Learning to explain: An information-theoretic perspective on model interpretation

Elements of Information Theory

Automated hate speech detection and the problem of offensive language

You don't understand

Estimating the number of infections and the impact of non-pharmaceutical interventions on covid-19 in 11 european countries

Countering online hate speech

Toward human-understandable, explainable ai

Hate and toxic speech detection in the context of covid-19 pandemic using xai: Ongoing applied research

The Elements of Statistical Learning

Statistical Learning with Sparsity. The Lasso and its Generalizations

Explainable AI (ex-AI)

An information-theoretic approach to personalized explainable machine learning

Motives and development

Motivated closing of the mind

The datafication of hate: Expectations and challenges in automated hate speech monitoring. Front. Big Data

Transparency in artificial intelligence

Questioning the AI: Informing design practices for explainable AI user experiences

The ethics of algorithms: Mapping the debate

Interpretable Machine Learning -A Guide for Making Black Box Models Explainable

Methods for interpreting and understanding deep neural networks

Hate speech operationalization: a preliminary examination of hate speech indicators and their structure. Complex & Intelligent Systems

Scikit-learn: Machine learning in python

Why should i trust you?": Explaining the predictions of any classifier

Explanation as a social practice: Toward a conceptual framework for the social design of ai systems

One explanation does not fit all

Why a right to explanation of automated decision-making does not exist in the general data protection regulation

Dimensional Statistics: A Non-Asymptotic Viewpoint

Topic sentiment analysis in twitter: A graph-based hashtag sentiment classification approach

Neural networks for eeg/meg decoding and interpretation. SoftwareX