An Incremental MaxSAT-Based Model to Learn Interpretable and Balanced Classification Rules

Ferreira Júnior, Antônio Carlos Souza; Rocha, Thiago Alves

doi:10.1007/978-3-031-45368-7_15

Antônio Carlos Souza Ferreira Júnior⁹ &
Thiago Alves Rocha⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14195))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

536 Accesses

Abstract

The increasing advancements in the field of machine learning have led to the development of numerous applications that effectively address a wide range of problems with accurate predictions. However, in certain cases, accuracy alone may not be sufficient. Many real-world problems also demand explanations and interpretability behind the predictions. One of the most popular interpretable models that are classification rules. This work aims to propose an incremental model for learning interpretable and balanced rules based on MaxSAT, called IMLIB. This new model was based on two other approaches, one based on SAT and the other on MaxSAT. The one based on SAT limits the size of each generated rule, making it possible to balance them. We suggest that such a set of rules seem more natural to be understood compared to a mixture of large and small rules. The approach based on MaxSAT, called IMLI, presents a technique to increase performance that involves learning a set of rules by incrementally applying the model in a dataset. Finally, IMLIB and IMLI are compared using diverse databases. IMLIB obtained results comparable to IMLI in terms of accuracy, generating more balanced rules with smaller sizes.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Conceptual challenges for interpretable machine learning

Article Open access 01 March 2022

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

Active Class Incremental Learning for Imbalanced Datasets

1 Introduction

The success of Machine Learning (ML) in recent years has led to a growing advancement in studies in this area [2, 8, 12]. Several applications have emerged with the aim of circumventing various problems and situations [4, 14, 20]. One such problem is the lack of explainability of prediction models. This directly affects the reliability of using these applications in critical situations involving, for example, finance, autonomous systems, damage to equipment, the environment, and even lives [1, 7, 23]. That said, some works seek to develop approaches that bring explainability to their predictions [13, 21, 22].

Precise predictions with high levels of interpretability are often not a simple task. There are some works that try to solve this problem by balancing the accuracy of the prediction with the interpretability [5, 6, 9, 15,16,17, 24]. It can be seen that some of these works use approaches based on the Boolean Satisfiability Problem (SAT) and the Maximum Boolean Satisfiability Problem (MaxSAT). The choice of these approaches to solve this problem has been increasingly recurrent in recent years. The reasons can be seen in the results obtained by these models.

SAT-based approaches have been proposed recently [18, 19] to learn quantifier-free first-order sentences from a set of classified strings. More specifically, given a set of classified strings, the goal is to find a first-order sentence over strings of minimum size that correctly classifies all the strings. One of the approaches demonstrated is SQFSAT (Synthesis of quantifier-free first-order sentences over strings with SAT). Upon receiving a set of classified strings, this approach generates a quantifier-free first-order sentence over strings in disjunctive normal form (DNF) with a given number of terms. What makes this method stand out is the fact that we can limit both the number of terms and the number of formulas per term in the generated formula. In addition, as the approach generates formulas in DNF, each term of the formula can be seen as a rule. Then, for each rule, its explanation is the conjunction of formulas in the rule, which can be interesting for their interpretability [11, 18]. On the other hand, as the model is based on the SAT problem, in certain situations it may bring results that are not so interesting in terms of interpretability and efficiency, such as in cases where the set of strings is large.

Ghosh, B. et al. created a classification model based on MaxSAT called IMLI [6]. The approach takes a set of classified samples, represented by vectors of numerical and categorical data, and generates a set of rules expressed in DNF or in conjunctive normal form (CNF) that correctly classifies as many samples as possible. In this work, we focus on using IMLI for learning rules in DNF. The number of rules in the set of rules can be defined similarly to SQFSAT, but IMLI does not consider the number of elements per rule. Although IMLI focuses on learning a sparse set of rules, it may obtain a combination of both large and small rules. IMLI also takes into account the option of defining a weighting for correct classifications. As the weighting increases, the accuracy of the model improves, but at the cost of an increase in the size of the generated set of rules. The smaller the weighting, the lower the accuracy of the model, but correspondingly, the generated set of rules tends to be smaller. Furthermore, IMLI uses an incremental approach to achieve better runtime performance. The incremental form consists of dividing the set of samples into partitions in order to generate a set of rules for each partition from the set of rules obtained in the previous partitions.

In this work, we aim to create a new approach for learning interpretable rules based on MaxSAT that unites SQFSAT with the incrementality of IMLI. The motivation for choosing SQFSAT is the possibility of defining the number of literals per clause, allowing us to generate smaller and more balanced rules. The choice of IMLI is motivated by its incrementability technique, which allows the method to train on large sets of samples efficiently. In addition, we propose a technique that reduces the size of the generated rules, removing possible redundancies.

This work is divided into 6 sections. In Sect. 2, we define the general notions and notations. Since all methods presented in this paper use Boolean logic, we also define in Sect. 2 how these methods binarize datasets with numerical and categorical data. In Sect. 3, SQFSAT and IMLI are presented, respectively. We present SQFSAT in the context of our work where samples consist of binary vectors instead of strings, and elements of rules are not first-order sentences over strings. In Sect. 4, our contribution is presented: IMLIB. In Sect. 5, we describe the experiments conducted and the results for the comparison of our approach against IMLI. Finally, in the last section, we present the conclusions and indicate future work.

2 Preliminaries

We consider the binary classification problem where we are given a set of samples and their classifications. The set of samples is represented by a binary matrix of size $n \times m$ and their classifications by a vector of size n. We call the matrix ${\textbf {X}}$ and the vector ${\textbf {y}}$. Each row of ${\textbf {X}}$ is a sample of the set and we will call it ${\textbf {X}}_i$ with $i \in \{1,...,n\}$. To represent a specific value of ${\textbf {X}}_i$, we will use $x_{i,j}$ with $j \in \{1,...,m\}$. Each column of ${\textbf {X}}$ has a label representing a feature and the label is symbolized by $x^j$. To represent a specific value of ${\textbf {y}}$, we will use $y_i$.

To represent the opposite value of $y_i$, that is, if it is 1 the opposite value is 0 and vice versa, we use $\lnot y_i$. Therefore, we will use the symbol $\lnot {\textbf {y}}$ to represent ${\textbf {y}}$ with all opposite values. To represent the opposite value of $x_{i,j}$, we use $\lnot x_{i,j}$. Therefore, we will use the symbol $\lnot {\textbf {X}}_i$ to represent ${\textbf {X}}_i$ with all opposite values. Each label also has its opposite label which is symbolized by $\lnot x^j$.

A partition of ${\textbf {X}}$ is represented by ${\textbf {X}}^t$ with $t \in \{1,...,p\}$, where p is the number of partitions. Therefore, the partitions of vector ${\textbf {y}}$ are represented by ${\textbf {y}}^t$. Each element of ${\textbf {y}}$ is symbolized by $y_i$ and represents the class value of sample ${\textbf {X}}_i$. We use $\mathcal {E}^- = \{X_i \mid y_i = 0, 1 \le i \le n\}$ and $\mathcal {E}^+ = \{X_i \mid y_i = 1, 1 \le i \le n\}$. To represent the size of these sets, that is, the number of samples contained in them, we use the notations: $|\mathcal {E}^-|$ and $|\mathcal {E}^+|$.

Example 1

Let ${\textbf {X}}$ be the set of samples

$${\textbf {X}} = \left[ \begin{array}{ccc} 0&{}0&{}1\\ 0&{}1&{}0\\ 0&{}1&{}1\\ 1&{}0&{}0 \end{array}\right] $$

and their classifications ${\textbf {y}} = [1, 0, 0, 1]$. The samples ${\textbf {X}}_i$ are: ${\textbf {X}}_1 = [0, 0, 1], ..., {\textbf {X}}_4 = [1, 0, 0]$. The values of each sample $x_{i,j}$ are: $x_{1,1}=0, x_{1,2}=0, x_{1,3}=1, x_{2,1}=0, ..., x_{4,3}=0$. The class values $y_i$ of each sample are: $y_1=1, ..., y_4=1$. We can divide ${\textbf {X}}$ into two partitions in several different ways, one of which is: ${\textbf {X}}^1 = \left[ \begin{array}{ccc} 0&{}1&{}1\\ 0&{}1&{}0 \end{array}\right] $, ${\textbf {y}}^1 = [0, 0]$, ${\textbf {X}}^2 = \left[ \begin{array}{ccc} 1&{}0&{}0\\ 0&{}0&{}1 \end{array}\right] $ e ${\textbf {y}}^2 = [1, 1]$.

Example 2

Let ${\textbf {X}}$ be the set of samples from Example 1, then

$$\lnot {\textbf {X}} = \left[ \begin{array}{ccc} 1&{}1&{}0\\ 1&{}0&{}1\\ 1&{}0&{}0\\ 0&{}1&{}1 \end{array}\right] $$

and $\lnot {\textbf {y}} = [0, 1, 1, 0]$. The samples $\lnot {\textbf {X}}_i$ are: $\lnot {\textbf {X}}_1 = [1, 1, 0], ..., \lnot {\textbf {X}}_4 = [0, 1, 1]$. The values of each sample $\lnot x_{i,j}$ are: $\lnot x_{1,1}=1, \lnot x_{1,2}=1, \lnot x_{1,3}=0, \lnot x_{2,1}=1, ..., \lnot x_{4,3}=1$. The class values of each sample $\lnot y_i$ are: $\lnot y_1=0, ..., \lnot y_4=0$. We can divide $\lnot {\textbf {X}}$ in partitions as in Example 1: $\lnot {\textbf {X}}^1 = \left[ \begin{array}{ccc} 1&{}0&{}0\\ 1&{}0&{}1 \end{array}\right] $, $\lnot {\textbf {y}}^1 = [1, 1]$, $\lnot {\textbf {X}}^2 = \left[ \begin{array}{ccc} 0&{}1&{}1\\ 1&{}1&{}0 \end{array}\right] $ e $\lnot {\textbf {y}}^2 = [0, 0]$.

We define a set of rules being the disjunction of rules and is represented by ${\textbf {R}}$. A rule is a conjunction of one or more features. Each rule in ${\textbf {R}}$ is represented by $R_o$ with $o \in \{1,...,k\}$, where k is the number of rules. Moreover, ${\textbf {R}}({\textbf {X}}_i)$ represents the application of ${\textbf {R}}$ to ${\textbf {X}}_i$. The notations $|{\textbf {R}}|$ and $|R_o|$ are used to represent the number of features in ${\textbf {R}}$ and $R_o$, respectively.

Example 3

Let $x^1=$ Man, $x^2=$ Smoke, $x^3=$ Hike be labels of features. Let ${\textbf {R}}$ be the set of rules ${\textbf {R}} =$ (Man) $\vee $ (Smoke $\wedge $ $\lnot $Hike). The rules $R_o$ are: $R_1=$ (Man) and $R_2=$ (Smoke $\wedge $ $\lnot $Hike). The application of ${\textbf {R}}$ to ${\textbf {X}}_i$ is represented as follows: ${\textbf {R}} ({\textbf {X}}_i) = x_{i,1} \vee (x_{i,2} \wedge \lnot x_{i,3})$. For example, Let ${\textbf {X}}$ be the set of samples from Example 1, then: ${\textbf {R}}({\textbf {X}}_1) = x_{1,1} \vee (x_{1,2} \wedge \lnot x_{1,3}) = 0 \vee (0 \wedge 0) = 0$. Moreover, we have that $|{\textbf {R}}| = 3$, $|R_1| = 1$ and $|R_2| = 2$.

As we assume a set of binary samples, we need to perform some preprocessing. Preprocessing consists of binarizing a set of samples with numerical or categorical values. The algorithm divides the features into four types: constant, where all samples have the same value; binary, where there are only two distinct variations among all the samples for the same value; categorical, when the feature does not fit in constant and binary and its values are three or more categories; ordinal, when the feature does not fit into constant and binary and has numerical values.

When the feature type is constant, the algorithm discards that feature. This happens due to the fact that a feature common to all samples makes no difference in the generated rules. When the type is binary, one of the feature variations will receive 0 and the other 1 as new values. If the type is categorical, we employ the widely recognized technique of one-hot encoding. Finally, for the ordinal type feature, a quantization is performed, that is, the variations of this feature are divided into quantiles. With this, Boolean values are assigned to each quantile according to the original value.

We use SAT and MaxSAT solvers to implement the methods presented in this work. A solver receives a formula in CNF, for example: $(p \vee q) \wedge (q \vee \lnot p)$. Furthermore, a MaxSAT solver receives weights that will be assigned to each clause in the formula. A clause is the disjunction of one or more literals. The weights are represented by $W(Cl) = w$ where Cl is one or more clauses and w represents the weight assigned to each one of them. A SAT solver tries to assign values to the literals in such a way that all clauses are satisfied. A MaxSAT solver tries to assign values to the literals in a way that the sum of the weights of satisfied clauses is maximum. Clauses with numerical weights are considered soft. The greater the weight, the greater the priority of the clause to be satisfied. Clauses assigned a weight of $\infty $ are considered hard and must be satisfied.

3 Rule Learning with SAT and MaxSAT

3.1 SQFSAT

SQFSAT is a SAT-based approach that, given X, y, k and the number of features per rule l, tries to find a set of rules ${\textbf {R}}$ with k rules and at most l features per rule that correctly classify all samples ${\textbf {X}}_i$, that is, ${\textbf {R}}({\textbf {X}}_i) = y_i$ for all i. In general, the approach takes its parameters ${\textbf {X}}$, ${\textbf {y}}$, k and l and constructs a CNF formula to apply it to a SAT solver, which returns an answer that is used to get ${\textbf {R}}$.

The construction of the SAT clauses is defined by propositional variables: $u_{o,d}^j$, $p_{o,d}$, $u_{o,d}^*$, $e_{o,d,i}$ and $z_{o,i}$, for $d \in \{1,...,l\}$. If the valuation of $u_{o,d}^j$ is true, it means that jth feature label will be the dth feature of the rule $R_o$. Furthermore, if $p_{o,d}$ is true, it means that the dth feature of the rule $R_o$ will be $x^j$, in other words, will be positive. Otherwise, it will be negative: $\lnot x^j$. If $u_{o,d}^*$ is true, it means that the dth feature is skipped in the rule $R_o$. In this case, we ignore $p_{o,d}$. If $e_{o,d,i}$ is true, then the dth feature of rule $R_o$ contributes to the correct classification of the ith sample. If $z_{o,i}$ is true, then the rule $R_o$ contributes to the correct classification of the ith sample. That said, below, we will show the constraints formulated in the model for constructing the SAT clauses.

Conjunction of clauses that guarantees that exactly one $u_{o,d}^j$ is true for the dth feature of the rule $R_o$:

$$\begin{aligned} A = \bigwedge _{o \in \{1,...,k\} \atop d \in \{1,...,l\}} \bigvee _{j \in \{1,...,m,*\}} u_{o,d}^j \end{aligned}$$

(1)

$$\begin{aligned} B = \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\} \atop j,j' \in \{1,...,m,*\}, j \ne j'}} (\lnot u_{o,d}^j \vee \lnot u_{o,d}^{j'}) \end{aligned}$$

(2)

Conjunction of clauses that ensures that each rule has at least one feature:

$$\begin{aligned} C = \bigwedge _{o \in \{1,...,k\}} \bigvee _{d \in \{1,...,l\}} \lnot u_{o,d}^* \end{aligned}$$

(3)

We will use the symbol $s_{o,d,i}^j$ to represent the value of the ith sample in the jth feature label of ${\textbf {X}}$. If this value is 1, it means that if the jth feature label is in the dth position of the rule $R_o$, then it contributes to the correct classification of the ith sample. Therefore, $s_{o,d,i}^j = e_{o,d,i}$. Otherwise, $s_{o,d,i}^j = \lnot e_{o,d,i}$. That said, the following conjunction of formulas guarantees that $e_{o,d,i}$ is true if the jth feature in the oth rule contributes to the correct classification of the sample ${\textbf {X}}_i$:

$$\begin{aligned} D = \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\} \atop {j \in \{1,...,m\} \atop i \in \{1,...,n\}}}} u_{o,d}^j \rightarrow (p_{o,d} \leftrightarrow s_{o,d,i}^j) \end{aligned}$$

(4)

Conjunction of formulas guaranteeing that if the dth feature of a rule is skipped, then the classification of this rule is not interfered by this feature:

$$\begin{aligned} E = \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\} \atop i \in \{1,...,n\}}} u_{o,d}^* \rightarrow e_{o,d,i} \end{aligned}$$

(5)

Conjunction of formulas indicating that $z_{o,i}$ will be set to true if all the features of rule $R_o$ contribute to the correct classification of sample ${\textbf {X}}_i$:

$$\begin{aligned} F = \bigwedge _{o \in \{1,...,k\}} \bigwedge _{i \in \{1,...,n\}} z_{o,i} \leftrightarrow \bigwedge _{d \in \{1,...,l\}} e_{o,d,i} \end{aligned}$$

(6)

Conjunction of clauses that guarantees that ${\textbf {R}}$ will correctly classify all samples:

$$\begin{aligned} G = \bigwedge _{i \in \mathcal {E}^+} \bigvee _{o \in \{1,...,k\}} z_{o,i} \end{aligned}$$

(7)

$$\begin{aligned} H = \bigwedge _{i \in \mathcal {E}^-} \bigwedge _{o \in \{1,...,k\}} \lnot z_{o,i} \end{aligned}$$

(8)

Next, the formula Q below is converted to CNF. Then, finally, we have the SAT query that is sent to the solver.

$$\begin{aligned} Q = A \wedge B \wedge C \wedge D \wedge E \wedge F \wedge G \wedge H \end{aligned}$$

(9)

3.2 IMLI

IMLI is an incremental approach based on MaxSAT for learning interpretable rules. Given ${\textbf {X}}$, ${\textbf {y}}$, k, and a weight $\lambda $, the model aims to obtain the smallest set of rules ${\textbf {M}}$ in CNF that correctly classifies as many samples as possible, penalizing classification errors with $\lambda $. In general, the method solves the optimization problem $\min _{{\textbf {M}}} \{|{\textbf {M}}| + \lambda |\mathcal {E}_M| \mid \mathcal {E}_M = \{{\textbf {X}}_i \mid {\textbf {M}}({\textbf {X}}_i) \ne y_i\}\}$, where $|{\textbf {M}}|$ represents the number of features in ${\textbf {M}}$ and ${\textbf {M}}({\textbf {X}}_i)$ denotes the application of the set of rules ${\textbf {M}}$ to ${\textbf {X}}_i$. Therefore, the approach takes its parameters ${\textbf {X}}$, ${\textbf {y}}$, k and $\lambda $ and constructs a MaxSAT query to apply it to a MaxSAT solver, which returns an answer that is used to generate ${\textbf {M}}$. Note that IMLI generates set of rules in CNF, whereas our objective is to obtain sets of rules in DNF. For that, we will have to use as parameter $\lnot {\textbf {y}}$ instead of ${\textbf {y}}$ and negate the set of rules ${\textbf {M}}$ to obtain a set of rules ${\textbf {R}}$ in DNF.

The construction of the MaxSAT clauses is defined by propositional variables: $b_o^v$ and $\eta _i$, for $v \in \{1,...,2m\}$. The v ranges from 1 to 2m, as it also considers opposite features. If the valuation of $b_o^v$ is true and $v \le m$, it means that feature $x^v$ will be in the rule $M_o$, where $M_o$ is the oth rule of ${\textbf {M}}$. If the valuation of the $b_o^v$ is true and $v > m$, it means that feature $\lnot x^{v - m}$ will be in the rule $M_o$. If the valuation of $\eta _i$ is true, it means that sample ${\textbf {X}}_i$ is not classified correctly, that is, ${\textbf {M}}({\textbf {X}}_i) \ne y_i$. That said, below, we will show the constraints for constructing MaxSAT clauses.

Constraints that represent that the cost of a misclassification is $\lambda $:

$$\begin{aligned} A = \bigwedge _{i \in \{1,...,n\}} \lnot \eta _i, W(A) = \lambda \end{aligned}$$

(10)

Constraints that represent that the model tries to insert as few features as possible in ${\textbf {M}}$, taking into account the weights of all clauses:

$$\begin{aligned} B = \bigwedge _{v \in \{1,...,2m\} \atop o \in \{1,...,k\}} \lnot b_o^v, W(B) = 1 \end{aligned}$$

(11)

Even though the constraints in 11 prioritize learning sparse rules, they do so by directing attention to the overall set of rules, i.e. in the total number of features in ${\textbf {M}}$. Then, IMLI may generate a set of rules that comprises a combination of both large and small rules. In our approach presented in Sect. 4, we address this drawback by limiting the number of features in each rule.

We will use ${\textbf {L}}_o$ to represent the set of variables $b_o^v$ of a rule $M_o$, that is, ${\textbf {L}}_o = \{b_o^v\, | v \in \{1, ...,2m\}\}$, for $o \in \{1,...,k\}$. To represent the concatenation of two samples, we will use the symbol $\cup $. We also use the symbol @ to represent an operation between two vectors of the same size. The operation consists of applying a conjunction between the corresponding elements of the vectors. Subsequently, a disjunction between the elements of the result is applied. The following example illustrates how these definitions will be used:

Example 4

Let be ${\textbf {X}}_4$ as in Example 1, ${\textbf {X}}_4 \cup \lnot {\textbf {X}}_4 = [1,0,0,0,1,1]$ and ${\textbf {L}}_o = [b_o^1,b_o^2,b_o^3,b_o^4,b_o^5,b_o^6]$. Therefore, $({\textbf {X}}_4 \cup \lnot {\textbf {X}}_4) @ {\textbf {L}}_o = (x_{4,1} \wedge b_o^1) \vee (x_{4,2} \wedge b_o^2) \vee ... \vee (\lnot x_{4,6} \wedge b_o^6) = (1 \wedge b_o^1) \vee (0 \wedge b_o^2) \vee ... \vee (1 \wedge b_o^6) = b_o^1 \vee b_o^5 \vee b_o^6$.

The objective of this operation is to generate a disjunction of variables that indicates if any of the features associated with these variables are present in $M_o$, then sample ${\textbf {X}}_i$ will be correctly classified by $M_o$. Now, we can show the formula that guarantees that if $\eta _i$ is false, then ${\textbf {M}} ({\textbf {X}}_i) = y_i$:

$$\begin{aligned} C = \bigwedge _{\ i\ \in \ \{1, ..., n\}} \lnot \eta _i \rightarrow (y_i \leftrightarrow \bigwedge _{o \in \{1,...,k\}} (({\textbf {X}}_i \cup \lnot {\textbf {X}}_i) @ {\textbf {L}}_o)), W(C) = \infty \end{aligned}$$

(12)

We can see that C is not in CNF. Therefore, formula Q below must be converted to CNF. With that, finally, we have the MaxSAT query that is sent to the solver.

$$\begin{aligned} Q = A \wedge B \wedge C \end{aligned}$$

(13)

The set of samples ${\textbf {X}}$, in IMLI, can be divided into p partitions: ${\textbf {X}}^1$, ${\textbf {X}}^2$, ..., ${\textbf {X}}^p$. Each partition, but the last one, contains the same values of $|\mathcal {E}^-|$ and $|\mathcal {E}^+|$. Also, the samples are randomly distributed across the partitions. Partitioning aims to make the model perform better in generating the set of rules ${\textbf {M}}$. Thus, the conjunction of clauses will be created from each partition ${\textbf {X}}^t$ in an incremental way, that is, the set of rules ${\textbf {M}}$ obtained by the current partition will be reused for the next partition. In the first partition, constraints (10), (11), (12) are created in exactly the same way as described. From the second onwards, (11) is replaced by the following constraints:

$$\begin{aligned} B' = \bigwedge _{v \in \{1,...,2m\} \atop o \in \{1,...,k\}} \left\{ \begin{array}{ll} b_o^v \text {, if} \, b_o^v \, \text {is true in the previous partition};\\ \lnot b_o^v \text {, otherwise}. \end{array} \right. , W(B') = 1 \end{aligned}$$

(14)

The IMLI also has a technique for reducing the size of the generated set of rules. The technique removes possible redundancies in ordinal features as the one in Example 5. In the original implementation of the model, the technique is applied at the end of each partition. In our implementation for the experiments in Sect. 5, this technique is applied only at the end of the last partition. The reason for this is training performance.

Example 5

Let ${\textbf {R}}$ be the following set of rules with redundancy in the same rule:

$$(\textit{Age}>18 \wedge \textit{Age}>20) \vee (\textit{Height}\le 2).$$

Then, the technique removes the redundancy and the following set of rules is obtained:

$$(\textit{Age}>20) \vee (\textit{Height}\le 2).$$

4 IMLIB

In this section, we will present our method IMLIB which is an incremental version of SQFSAT based on MaxSAT. IMLIB also has a technique for reducing the size of the generated set of rules. Therefore, our approach partitions the set of samples ${\textbf {X}}$. Moreover, our method has one more constraint and weight on all clauses. With that, our approach receives five input parameters ${\textbf {X}}$, ${\textbf {y}}$, k, l, $\lambda $ and tries to obtain the smallest ${\textbf {R}}$ that correctly classifies as many samples as possible, penalizing classification errors with $\lambda $, that is, $\min _{{\textbf {R}}} \{|{\textbf {R}}| + \lambda |\mathcal {E}_R| \mid \mathcal {E}_R = \{{\textbf {X}}_i \mid {\textbf {R}} ({\textbf {X}}_i) \ne y_i\}\}$. That said, below, we will show the constraints of our approach for constructing MaxSAT clauses.

Constraints that guarantee that exactly only one $u_{o,d}^j$ is true for the dth feature of the rule $R_o$:

$$\begin{aligned} A = \bigwedge _{o \in \{1,...,k\} \atop d \in \{1,...,l\}} \bigvee _{j \in \{1,...,m,*\}} u_{o,d}^j, W(A) = \infty \end{aligned}$$

(15)

$$\begin{aligned} B = \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\} \atop j,j' \in \{1,...,m,*\}, j \ne j'}} \lnot u_{o,d}^j \vee \lnot u_{o,d}^{j'}, W(B) = \infty \end{aligned}$$

(16)

Constraints representing that the model will try to insert as few features as possible in ${\textbf {R}}$:

$$\begin{aligned} C = \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\} \atop j \in \{1,...,m\}}} \lnot u_{o,d}^j \wedge \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\}}} u_{o,d}^*, W(C) = 1 \end{aligned}$$

(17)

Conjunction of clauses that guarantees that each rule has at least one feature:

$$\begin{aligned} D = \bigwedge _{o \in \{1,...,k\}} \bigvee _{d \in \{1,...,l\}} \lnot u_{o,d}^*, W(D) = \infty \end{aligned}$$

(18)

The following conjunction of formulas ensures that $e_{o,d,i}$ is true if the jth feature label in the oth rule contributes to correctly classify sample ${\textbf {X}}_i$:

$$\begin{aligned} E = \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\} \atop {j \in \{1,...,m\} \atop i \in \{1,...,n\}}}} u_{o,d}^j \rightarrow (p_{o,d} \leftrightarrow s_{o,d,i}^j), W(E) = \infty \end{aligned}$$

(19)

Conjunction of formulas guaranteeing that the classification of a specific rule will not be interfered by skipped features in the rule:

$$\begin{aligned} F = \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\} \atop i \in \{1,...,n\}}} u_{o,d}^* \rightarrow e_{o,d,i}, W(F) = \infty \end{aligned}$$

(20)

Conjunction of formulas indicating that the model assigns true to $z_{o,i}$ if all the features of rule $R_o$ support the correct classification of sample ${\textbf {X}}_i$:

$$\begin{aligned} G = \bigwedge _{o \in \{1,...,k\}} \bigwedge _{i \in \{1,...,n\}} z_{o,i} \leftrightarrow \bigwedge _{d \in \{1,...,l\}} e_{o,d,i}, W(G) = \infty \end{aligned}$$

(21)

Conjunction of clauses designed to generate a set of rules ${\textbf {R}}$ that correctly classify as many samples as possible:

$$\begin{aligned} H = \bigwedge _{i \in \mathcal {E}^+} \bigvee _{o \in \{1,...,k\}} z_{o,i}, W(H) = \lambda \end{aligned}$$

(22)

$$\begin{aligned} I = \bigwedge _{i \in \mathcal {E}^-} \bigwedge _{o \in \{1,...,k\}} \lnot z_{o,i}, W(I) = \lambda \end{aligned}$$

(23)

Finally, after converting formula Q below to CNF, we have the MaxSAT query that is sent to the solver.

$$\begin{aligned} Q = A \wedge B \wedge C \wedge D \wedge E \wedge F \wedge G \wedge H \wedge I \end{aligned}$$

(24)

IMLIB can also partition the set of samples ${\textbf {X}}$ in the same way IMLI. Therefore, all constraints described above are applied in the first partition. Starting from the second partition, the constraints in (17) are replaced by the following constraints:

$$\begin{aligned} C' = \bigwedge _{o \in \{1,...,k\} \atop {d \in \{1,...,l\} \atop j \in \{1,...,m,*\}}} \left\{ \begin{array}{ll} u_{o,d}^j \text {, if} \, u_{o,d}^j \, \text {is true in the previous}\\ {partition};\\ \lnot u_{o,d}^j \text {, otherwise}. \end{array} \right. , W(C') = 1 \end{aligned}$$

(25)

IMLIB also has a technique for reducing the size of the generated set of rules demonstrated in Example 5. Moreover, we added two more cases which are described in Example 6 and Example 7.

Example 6

Let ${\textbf {R}}$ be the following set of rules with opposite features in the same rule:

$$(\textit{Age}>20) \vee (\textit{Height}\le 2 \wedge \textit{Height}>2) \vee (\textit{Hike} \wedge \textit{Not Hike}).$$

Therefore, the technique removes rules with opposite features in the same rule obtaining the following set of rules:

$$(\textit{Age}>20).$$

Example 7

Let ${\textbf {R}}$ be the following set of rules with the same feature occurring twice in a rule:

$$(\textit{Hike} \wedge \textit{Hike}) \vee (\textit{Age}>20).$$

Accordingly, our technique for removing redundancies eliminates repetitive features, resulting in the following set of rules:

$$(\textit{Hike}) \vee (\textit{Age}>20).$$

5 Experiments

Table 1. Databases information.

Full size table

In this section, we present the experiments we conducted to compare our method IMLIB against IMLI. The two models were implemented^{Footnote 1} with Python and MaxSAT solver RC2 [10]. The experiments were carried out on a machine with the following configurations: Intel(R) Core(TM) i5-4460 3.20GHz processor, and 12GB of RAM memory. Ten databases from the UCI repository [3] were used to compare IMLI with IMLIB. Information on the datasets can be seen in Table 1. Databases that have more than two classes were adapted, considering that both models are designed for binary classification. For purposes of comparison, we measure the following metrics: number of rules, size of the set of rules, size of the largest rule, accuracy on test data and training time. The number of rules, size of the set of rules, and size of the largest rule can be used as interpretability metrics. For example, a set of rules with few rules and small rules is more interpretable than one with many large rules.

Each dataset was split into $80\%$ for training and $20\%$ for testing. Both models were trained and evaluated using the same training and test sets, as well as the same random distribution. Then, the way the experiments were conducted ensured that both models had exactly the same set of samples to learn the set of rules.

For both IMLI and IMLIB, we consider parameter configurations obtained by combining values of: $k \in \{1,2,3\}$, $\lambda \in \{5,10\}$ and $lp \in \{8,16\}$, where lp is the number of samples per partition. Since IMLIB has the maximum number of features per rule l as an extra parameter, for each parameter configuration of IMLI and its corresponding $\textbf{R}$, we considered l ranging from 1 to one less than the size of the largest rule in $\textbf{R}$. Thus, the value of l that resulted in the best test accuracy was chosen to be compared with IMLI. Our objective is to evaluate whether IMLIB can achieve higher test accuracy compared to IMLI by employing smaller and more balanced rules. Furthermore, it should be noted that this does not exclude the possibility of our method generating sets of rules with larger sizes than IMLI.

For each dataset and each parameter configuration of k, $\lambda $ and lp, we conducted ten independent realizations of this experiment. For each dataset, the parameter configuration with the best average of test accuracy for IMLI was chosen to be inserted in Table 2. For each dataset, the parameter configuration with the best average of test accuracy for IMLIB was chosen to be inserted in Table 3. The results presented in both tables are the average over the ten realizations.

In Table 2, when considering parameter configurations that favor IMLI, we can see that IMLIB stands out in the size of the generated set of rules and in the size of the largest rule in datasets. Furthermore, our method achieved equal or higher accuracy compared to IMLI in four out of ten datasets. In datasets where IMLI outperformed IMLIB in terms of accuracy, our method exhibited a modest average performance gap of only three percentage points. Besides, IMLI outperformed our method in terms of training time in all datasets.

Table 2. Comparison between IMLI and IMLIB in different databases with the IMLI configuration that obtained the best result in terms of accuracy. The column Training time represents the training time in seconds.

Full size table

Table 3. Comparison between IMLI and IMLIB in different databases with the IMLIB configuration that obtained the best result in terms of accuracy. The column Training time represents the training time in seconds.

Full size table

In Table 3, when we consider parameter configurations that favor our method, we can see that IMLIB continues to stand out in terms of the size of the generated set of rules and the size of the largest rule in all datasets. Moreover, our method achieved equal or higher accuracy than IMLI in all datasets. Again, IMLI consistently demonstrated better training time performance compared to IMLIB across all datasets.

As an illustrative example of interpretability, we present a comparison of the sizes of rules learned by both methods in the Mushroom dataset. Table 4 shows the sizes of rules obtained in all ten realizations of the experiment. We can observe that IMLIB consistently maintains a smaller and more balanced set of rules across the different realizations. This is interesting because unbalanced rules can affect interpretability. See realization 6, for instance. The largest rule learned by IMLI has a size of 10, nearly double the size of the remaining rules. In contrast, IMLIB learned a set of rules where the size of the largest rule is 6 and the others have similar sizes. Thus, interpreting three rules of size at most 6 is easier than interpreting a rule of size 10. Also as illustrative examples of interpretability, we can see some sets of rules learned by IMLIB in Table 5.

Table 4. Comparison of the size of the rules generated in the ten realizations of the Mushroom base from Table 2. The configuration used was $lp = 16$, $k = 3$ and $\lambda = 10$. As the value of l used in IMLIB varies across the realizations, column l will indicate which was the value used in each realization. In the column Rules sizes, we show the size of each rule in the following format: ($|R_1|$, $|R_2|$, $|R_3|$). We have highlighted in bold the cases where the size of $|R_o|$ is the same or smaller in our model compared to IMLI.

Full size table

Table 5. Examples of set of rules generated by IMLIB in some tested databases.

Full size table

6 Conclusion

In this work, we present a new incremental model for learning interpretable and balanced rules: IMLIB. Our method leverages the strengths of SQFSAT, which effectively constrains the size of rules, while incorporating techniques from IMLI, such as incrementability, cost for classification errors, and minimization of the set of rules. Our experiments demonstrate that the proposed approach generates smaller and more balanced rules than IMLI, while maintaining comparable or even superior accuracy in many cases. We argue that sets of small rules with approximately the same size seem more interpretable when compared to sets with a few large rules. As future work, we plan to develop a version of IMLIB that can classify sets of samples with more than two classes, enabling us to compare this approach with multiclass interpretable rules from the literature [11, 24].

Notes

1.
Source code of IMLIB and the implementation of the tests performed can be found at the link: https://github.com/cacajr/decision_set_models.

References

Biran, O., Cotton, C.: Explanation and justification in machine learning: a survey. In: IJCAI-17 Workshop on Explainable AI (XAI), vol. 8, pp. 8–13 (2017)
Google Scholar
Carleo, G., et al.: Machine learning and the physical sciences. Rev. Mod. Phys. 91(4), 045002 (2019)
Article Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Ghassemi, M., Oakden-Rayner, L., Beam, A.L.: The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health 3(11), e745–e750 (2021)
Article Google Scholar
Ghosh, B., Malioutov, D., Meel, K.S.: Efficient learning of interpretable classification rules. J. Artif. Intell. Res. 74, 1823–1863 (2022)
Article MathSciNet MATH Google Scholar
Ghosh, B., Meel, K.S.: IMLI: an incremental framework for MaxSAT-based learning of interpretable classification rules. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 203–210 (2019)
Google Scholar
Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., Yang, G.Z.: XAI-explainable artificial intelligence. Sci. Robot. 4(37), eaay7120 (2019)
Google Scholar
Huang, H.Y., et al.: Power of data in quantum machine learning. Nat. Commun. 12(1), 2631 (2021)
Article Google Scholar
Ignatiev, A., Marques-Silva, J., Narodytska, N., Stuckey, P.J.: Reasoning-based learning of interpretable ML models. In: IJCAI, pp. 4458–4465 (2021)
Google Scholar
Ignatiev, A., Morgado, A., Marques-Silva, J.: RC2: an efficient MaxSAT solver. J. Satisfiability Boolean Modeling Comput. 11(1), 53–64 (2019)
Article MathSciNet MATH Google Scholar
Ignatiev, A., Pereira, F., Narodytska, N., Marques-Silva, J.: A SAT-based approach to learn explainable decision sets. In: Galmiche, D., Schulz, S., Sebastiani, R. (eds.) IJCAR 2018. LNCS (LNAI), vol. 10900, pp. 627–645. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-94205-6_41
Chapter Google Scholar
Janiesch, C., Zschech, P., Heinrich, K.: Machine learning and deep learning. Electron. Mark. 31(3), 685–695 (2021)
Article Google Scholar
Jiménez-Luna, J., Grisoni, F., Schneider, G.: Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2(10), 573–584 (2020)
Article Google Scholar
Kwekha-Rashid, A.S., Abduljabbar, H.N., Alhayani, B.: Coronavirus disease (COVID-19) cases analysis using machine-learning applications. Appl. Nanosci. 1–13 (2021)
Google Scholar
Lakkaraju, H., Bach, S.H., Leskovec, J.: Interpretable decision sets: a joint framework for description and prediction. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1675–1684 (2016)
Google Scholar
Maliotov, D., Meel, K.S.: MLIC: a MaxSAT-based framework for learning interpretable classification rules. In: Hooker, J. (ed.) CP 2018. LNCS, vol. 11008, pp. 312–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98334-9_21
Chapter Google Scholar
Mita, G., Papotti, P., Filippone, M., Michiardi, P.: LIBRE: learning interpretable Boolean rule ensembles. In: AISTATS, pp. 245–255. PMLR (2020)
Google Scholar
Rocha, T.A., Martins, A.T.: Synthesis of quantifier-free first-order sentences from noisy samples of strings. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 12–17. IEEE (2019)
Google Scholar
Rocha, T.A., Martins, A.T., Ferreira, F.M.: Synthesis of a DNF formula from a sample of strings using Ehrenfeucht-Fraïssé games. Theoret. Comput. Sci. 805, 109–126 (2020)
Article MathSciNet MATH Google Scholar
Sharma, A., Jain, A., Gupta, P., Chowdary, V.: Machine learning applications for precision agriculture: a comprehensive review. IEEE Access 9, 4843–4873 (2020)
Article Google Scholar
Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (XAI): toward medical XAI. IEEE Trans. Neural Netw. Learn. Syst. 32(11), 4793–4813 (2020)
Article Google Scholar
Vilone, G., Longo, L.: Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf. Fusion 76, 89–106 (2021)
Article Google Scholar
Yan, L., et al.: An interpretable mortality prediction model for COVID-19 patients. Nat. Mach. Intell. 2(5), 283–288 (2020)
Article Google Scholar
Yu, J., Ignatiev, A., Stuckey, P.J., Le Bodic, P.: Computing optimal decision sets with SAT. In: Simonis, H. (ed.) CP 2020. LNCS, vol. 12333, pp. 952–970. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58475-7_55
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Federal de Educação, Ciência e Tecnologia do Ceará (IFCE), Fortaleza, Brazil
Antônio Carlos Souza Ferreira Júnior & Thiago Alves Rocha

Authors

Antônio Carlos Souza Ferreira Júnior
View author publications
Search author on:PubMed Google Scholar
Thiago Alves Rocha
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Antônio Carlos Souza Ferreira Júnior .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferreira Júnior, A.C.S., Rocha, T.A. (2023). An Incremental MaxSAT-Based Model to Learn Interpretable and Balanced Classification Rules. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14195. Springer, Cham. https://doi.org/10.1007/978-3-031-45368-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-45368-7_15
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45367-0
Online ISBN: 978-3-031-45368-7
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

An Incremental MaxSAT-Based Model to Learn Interpretable and Balanced Classification Rules

Abstract

Similar content being viewed by others

Conceptual challenges for interpretable machine learning

Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges

Active Class Incremental Learning for Imbalanced Datasets

Explore related subjects

1 Introduction

2 Preliminaries

Example 1

Example 2

Example 3

3 Rule Learning with SAT and MaxSAT

3.1 SQFSAT

3.2 IMLI

Example 4

Example 5

4 IMLIB

Example 6

Example 7

5 Experiments

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us