1 Introduction

Education is the single-most important investment that people can make in their futures, and the best way to beat poverty, promoting more inclusive societies [40]. Even in the worst circumstances, it is the basic building block to give people confidence face to future. Nevertheless, the cost of attaining higher education has skyrocketed over the past few decades [39]. Faced with this, the Web has enabled one of the most prominent developments in higher education in recent years, not only taking classes out of the traditional classroom and making them open, accessible and dynamic [26], but also minimizing the lack of quality due to poor conditions of schools and the unavailability of adequately trained teachers. In 2011, the deployment of massive open online courses (MOOCs), such as edX consortium [8] and Coursera [19], exploded around the world and rapidly became the current trend for online learning [25], available to all those who wish to learn regardless of educational background and individual needs.

Despite their importance and a high degree of interest, there are significant limitations of moving from open content towards open educational practices, as described by [12]. Since online courses are based on “static” learning material, i.e. one-size-fits-all program and content [29], it is not straightforward to assess learning with those great number of students who differ considerably in their educational background, engagement styles and cognitive skills, which might merely lead to surface learning.

In this spirit, Machine Learning plays an important role to open up the “black box of learning”, giving us deeper and fine-grained understandings of how learning actually happens, creating a relevant and transformative approach for making the education both universally available and more relevant. This paper aims to address these aforementioned challenges by proposing an explainable Machine Learning algorithm for personalizing web-based education systems. The key feature of the method is automatically tailoring the content to the proficiency level of the examinees by using an unsupervised Deep Learning algorithm, such that the next item (or set of items) selected to be administered is optimally informative for a given person. The items’ personalized recommendations are elicited by using an post hoc explanation approach.

This paper is structured as follows: Sect. 2 presents some related work. Section 3 presents the problem formulation and describes the proposed method to deal with personalized learning. Section 4 explains the scenario of the simulations and results. Section 5 presents some results on the analysis of real data from a national exam for students in Brazil. Section 6 discusses and concludes the paper.

2 Related Work

One of the most important latent variable models, the item response theory (IRT) models [15], have been very successful in educational research [23]. In this context, latent traits refer to unobservable characteristics of individuals that may influence their performance in educational settings. By considering those abilities, it is possible to gain insights into the complex interplay between individual differences and educational outcomes (item responses), allowing for more targeted interventions and instructional approaches tailored to students’ unique needs and characteristics. Considering the rich and broad corpus of findings devoted to IRT models [22], the two-parameter logistic (2PL) [7] is one of the most prominent. Let a response of a person j (\(j=1,\ldots ,n\)) to the item i (\(i = 1,\ldots ,I\)) be represented by a stochastic vector \(\textbf{x}_{j} = (x_{1j},x_{2j},\ldots ,x_{Ij})\) with elements

$$\begin{aligned} x_{ij} = {\left\{ \begin{array}{ll} 1, &{} \text {if person } j \text { gives a correct response in item } i \\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Selecting a response alternative is the result of the complex interaction between the characteristics of the test item and the latent trait of the person, \(y_j\), which is commonly known as ‘ability’ (or proficiency) in the framework of educational and psychological measurement [23]. The probability of a correct response for an item i by individual j \((X_{ij} = 1)\), is given by:

$$\begin{aligned} P(X_{ij} = 1 | y_j, \mathbf {\Theta }) = \frac{e^{\left( W_i y_j - a_i\right) }}{1 + e^{\left( W_i y_j - a_i\right) }}, \end{aligned}$$
(1)

where \(\mathbf {\Theta } = \left( \textbf{W}, \textbf{a}\right) \) groups the model parameters. \(\textbf{W} = \left( W_1,\ldots ,W_I\right) \) is the vector of items discrimination parameters, which reflects the capability of a generic item i to discriminate between individuals with different levels of ability. The difficulty parameters of the items are \(d_i = a_i/W_i\). In IRT, each item is assumed to have its own unique item characteristic curve (ICC) and it provides valuable information about the properties of an item. The left panel of Fig. 1 presents three examples of ICCs with varying W-parameters and the right panel presents ICCs with varying a-parameters. Greater values of W represent curves with higher slopes. Therefore, even individuals with almost the same latent traits will be well discriminated, because they will produce different response probabilities. On the other hand, higher values of a indicate a proportionally higher level of difficulty of the item, since greater latent traits are needed to obtain \(P(X_{ij} = 1| y_j, \mathbf {\Theta })\).

Fig. 1.
figure 1

Item characteristic curves for the 2PL with W-parameters 0.3, 1.0, 1.7 (left) and a-parameters \(-1,0,1\) (right), respectively.

While the 2PL is often easy to comprehend, interpret and can have various important measurement properties, the complexities of the brain are even more challenging. Many pieces of knowledge with different “wiring” are required to identify the correct response. For example, the paper [27] describes fourth-grade reading as follows “reading includes developing a general understanding of written text, thinking about texts, and using texts for different purposes”, while [9] have identified 15 factors (dimensions) related to language ability. In recent years, the study of nervous systems, minds and other complex phenomena through artificial intelligence systems has shown a successful performance and defined state of the art results in many fields [11, 24, 32]. Furthermore, there is an increasing effort to apply such methods to problems in psychometry and education [18]. However, for the greater part, this attention is focused only on predicting student performance by using classification techniques, such as support vector machines [14, 41], or by matrix factorization [36, 38], etc. Furthermore, these algorithms often act as a “black box” and do not provide detailed information about the reasons for a given decision.

The field of Explainable AI (XAI) has produced an abundance of explanation techniques, from the simplest, such as intrinsically interpretable models [10], to more sophisticated ones, which assign a local interpretation, according to the specific characteristics of the input. Among them, one of the most prominent is the Layer-wise Relevance Propagation (LRP) [4]. It belongs to the propagation-based class, which assumes that the prediction was produced by a neural network, and takes advantage of this structure to produce an explanation, starting from the last layer and backtracking to the initial [33]. That is, from the weights of the network and its activation functions, the LRP propagates the contribution of each neuron to the input layer. Nevertheless, so far, research on explanation methods has mainly concentrated on supervised learning.

In a technical sense, our methodology can be seen as an extension of 2PL model by considering more interaction layers between latent abilities, modeled by a Deep Boltzmann Machine. Furthermore, it is a new framework that can explain an unsupervised model, such as 2PL, rewritten as neural networks - or “neuralizing it”, and then applying a post hoc explanation such as LRP.

3 Proposed Method

Over the past several years, big data methods, including but not limited to the use of Machine Learning (ML), have been very successful in computer science applications, and there is increasing effort to apply such methods to problems in psychometry and education [18]. This includes methods that have a “deep” architecture mimicking the information representation structure in human brains. Recently, Deep Boltzmann Machines (DBMs [31]) have risen to prominence due to their capacity of learning efficient representations of seemingly complex data. DBM is a probabilistic undirected graphical model of interconnected units that learn a joint probability density over these units by adapting connections between them. Formally, a DBM can be used to specify the relation between the level of a latent trait and item responses, as follows

$$\begin{aligned} P\left( \textbf{X},\textbf{Y}, Z;\mathbf {\Theta }\right) = & {} \frac{e^{-f(\textbf{x},\textbf{y}, z;\mathbf {\Theta })}}{G},\\ G = & {} \displaystyle \int e^{-f(\textbf{x},\textbf{y}, z;\mathbf {\Theta })} \text { d}\textbf{x} \text { d}\textbf{y} \text { d}z, \nonumber \end{aligned}$$
(2)

where \(\textbf{X}\) are the item responses, \(\textbf{Y}\) is the first hidden layer (m-dimensional) and Z is the second hidden layer (unidimensional). The term in the denominator is called the partition function, which normalizes the probability distribution by integrating over all possible values of \(\textbf{X}\), \(\textbf{Y}\) and Z, and

$$\begin{aligned} f(\textbf{x},\textbf{y}, z;\mathbf {\Theta }) = {} & {} \frac{1}{2} \left( \textbf{y} - \textbf{b}\right) ^T\mathbf {\Sigma }\left( \textbf{y} - \textbf{b}\right) ^T - \textbf{y}\mathbf {\Sigma }^{1/2} \textbf{W} \textbf{x} \\ {} + & {} \frac{1}{2} \frac{\left( z - c\right) ^2}{\sigma ^2} - \frac{z}{\sigma } \textbf{V} \textbf{y} - \textbf{x}^T \textbf{a}, \end{aligned}$$

where \(\mathbf {\Theta } = \left( \mathbf {\Sigma },\textbf{W}, \textbf{V},\textbf{a},\textbf{b}, c \right) \) groups the model parameters. \(\textbf{W} \in m\times I\) corresponds to the items discrimination parameters matrix (the slope of the surface). The parameters b and c are the bias into hidden units. \(\mathbf {\Sigma }\) is a diagonal matrix of precisions. Figure 2 shows how the variables are interconnected. The standard 2PL model corresponds to having one unique output \(y_1\). The DBM extends this model by allowing more output dimensions (\(y_2\) to \(y_m\)) and also by providing a combined view of those dimensions in z.

Fig. 2.
figure 2

Relationship between variables in the DBM.

Almost all models for latent traits have unidentified parameters due to scale invariance, this occurs because of overparametrization. To rule out it, we fixed the matrix of precisions to the identity matrix, as in [6]. Many other impressive results fixing those parameters to unity have been published [16, 20, 30]. The probability of the latent traits is given by:

$$\begin{aligned} P \left( Y = y | \textbf{x}, z;\mathbf {\Theta }\right) = & {} \mathcal {N}_R\left( \textbf{y} | \textbf{W}\textbf{x} + \frac{\textbf{V}^Tz}{\sigma } + b, \textbf{I}_m \right) \end{aligned}$$
(3)
$$\begin{aligned} P \left( Z = z | \textbf{y};\mathbf {\Theta }\right) = & {} \mathcal {N}\left( z|\textbf{V}\textbf{y} + c, \sigma ^2\right) \end{aligned}$$
(4)

where \(\mathcal {N}(\cdot | \mu , \sigma ^2)\) denotes the Gaussian probability density function with mean \(\mu \) and variance \(\sigma ^2\) and \(\mathcal {N}_R(\cdot | \mu , \sigma ^2)\) denotes the rectified Gaussian distribution. Thus, it is possible to demonstrate that Eq. 2 is structurally similar to the expression for 2PL (Eq. 1). Detailed proofs of these results are given in the Appendix 1. This is a first result presented in this work, that is, that DBMs generalize a 2PL model.

3.1 Explaining DBMs Parameters

In contrast to their very high flexibility, in DBMs it is often difficult to estimate the item parameters in high dimensional spaces and provide a comprehensive explanation along several dimensions of knowledge, limiting their broader applicability. This fact stands a huge drawback in ML models, since their nested non-linear structure makes them highly non-transparent. Not surprisingly, this issue has recently received a lot of attention in the literature [4, 5, 34]. The goal of the explanation is to assign the importance of each input (item) for the output (latent trait). The multidimensional item discrimination can be interpreted through the \(\Vert W_i \Vert \), and can be seen as sensitivity analysis. Therefore, the higher the value of \(\Vert W_i \Vert \), the higher the discrimination power of that item, independently from the assumed underlying latent structure. The multidimensional item difficulty can be interpreted through \(d_i = a_i/\Vert W_i \Vert \). A limitation of this approach is that it is an explanation of the latent trait variation rather than the value itself. In other words, it answers the question “which items lead to increase/decrease of latent trait when changed?”. The aforementioned weakness has led to the development of a more precise technique, that explains “which items contribute much to the latent trait”. To achieve this, we use the Layer-wise Relevance Propagation (LRP) [4]. Let j and k be indices for the neurons in layers l and \(l + 1\), respectively, and assume that the output of function f(x) has been propagated through from the upper layer to layer \(l+1\). The LRP defines the “message” \(R_{j\leftarrow k}\) as the redistribution of relevance \(R_k\) to the neurons in the previous layer:

$$\begin{aligned} R_{j\leftarrow k} = \frac{y_j\cdot w_{jk}}{\sum _{j}y_j \cdot w_{jk}}R_{k}. \end{aligned}$$

The overall relevance of j is then obtained by calculating \(R_j = \sum _k R_{j\leftarrow k}\). Note that, by definition, \(\sum _i R_i = \ldots = \sum _j R_j = \sum _k R_k = \ldots = f(x)\), that is, it satisfies the conservation (or completeness) property, since for all \(x \in \mathcal {X}\), we have \(\textbf{1}^T \mathcal {E}_f = f(x)\). The process is outlined in Fig. 3. First, the DBMs estimate the latent trait, f(x), based on the input. Then, LRP is applied to explain the prediction, that is, to indicate which items are more or less relevant.

3.2 Recommender System

When a student finishes a test, there are two fundamental and commonly asked questions: “what did I do wrong?” and “how can I improve it?” The former is about identifying specific breakdowns; whereas the latter focuses on creating knowledge. We are asking these two questions to introduce our methodology. The algorithm, at each time-step, reads the input (items responses) and returns an output (student proficiency). Once the current state is outputted, it is aggregated through the recommender system to define the next items to be administered, which are sorted according to items with the same difficulty parameter (vector \(\textbf{d}\)) and lesser relevance (defined by the LRP) answered by the student, which can be viewed as a content-based recommender system focusing on items where the student needs more attention/learning [3]. Next, the sequence in which the items are presented is considered. The system intentionally incites experiences of like-minded students failure, grading the stimuli to be presented accordingly. In such a setting, the approach comprises collaborative filtering-based methods [35], by performing failure-based learning collaboratively. Various research posits fault-stages as a strategic way to identify possible breakdowns and engender higher-order knowledge [17, 21, 37]. This, in turn, has pedagogical benefits because it allows students to challenge their existing (incomplete/inaccurate) mental models and activate cognitive mechanisms for deep learning.

Fig. 3.
figure 3

Explaining predictions of an AI system using LRP.

Formally, we have a set of latent traits \(\mathcal {Z} = \{z_j\}\), \(j=1,\ldots ,n\), and a set-based responses \(\mathcal {X} = \{\textbf{x}_i\}\), \(i = 1,\ldots ,I\). The aim of the system is to create a list of the best l recommendations called stimuli, \(\mathcal {S} = \{s_k\}\), \(k = 1,\ldots ,l\), for an examinee with ability \(z_a \in \mathcal {Z}\). The formulation of the algorithm is based on two steps. The first step involves selecting a set of items based on the difficulty parameter (note that the latent trait is in the same unit as the parameter). These items are chosen to be similar to those with lesser relevance (redistributed “message” given \(z_a\)) that have been answered by the student, such as:

$$\begin{aligned} \mathcal {S}_a = \{i \ :\ d_i - \delta \le z_a \le d_i + \delta \}, \end{aligned}$$
(5)

where \(\delta \) is determined by the user and defines the homogeneity of items presented to the student. For the second step, failure is an intentional part. The items are sorted by the average of correct responses based on the k nearest neighbors of \(z_a\). Consequently, we are able to control the examinee’s exposure to failure. This strategy allows better opportunities for the student deepen conceptual understanding through items suitable to their proficiency level, and increase the problem-solving capacity through failure-based learning.

4 Simulation Study

To demonstrate our arguments and set up the subsequent discussion, in this section two simulated examples will be presented. The first example will investigate how the true values of latent traits and items are recovered in different situations, involving sample sizes and number of items. In the second example, we show how the test items are sensitive to examinee differences on a number of correlated dimensions (latent traits). The data is designed to mirror a real-life scenario where items have different characteristics and latent traits exhibit multiple dimensions with varying correlations. These initial experiments aim to show the DBM model is indeed a precise estimator of latent characteristics, generalizing the 2PL model to more latent dimensions.

4.1 Example 1

The datasets were generated containing responses of \(n = 500, 2000\) and 5000 examinees to \(I = 50, 100\) and 250 items. The proficiency parameters are normally distributed, by model assumption. The values of discrimination and difficulty parameters were generated from a uniform distribution such that \(W \in (0.4,1.8)\) and \(a \in (-2.4,2.4)\). As mentioned earlier, the proposed model has a structural similarity with the two-parameter logistic (2PL) Item Response Model [7] if we consider one hidden layer and \(m = 1\). Thus, the responses were simulated using the 2PL parameter values. Each implementation involved 10 replications. Using these data, the parameters were estimated based on the DBM model. To carry out comparisons, the average root-mean-squared error and correlation between the true and estimated parameters for the unidimensional models were obtained, and they are summarized in Table 1. It can be seen that the parameter estimates are almost similar and appear to have converged to reasonable values, with high correlation to the ground truth values and low RMSE in all scenarios.

Table 1. Average root-mean-squared error and correlation between the true (2PL’s) and estimated parameters (DBM’s) based on 10 replications.

4.2 Example 2

For the second example, the response datasets were produced with three levels of correlation between latent trait dimensions (\(\rho = 0.2,0.6\) and 0.8). The data is considered more unidimensional when the value of \(\rho \) is closer to 1, since all latent traits will represent the same piece of knowledge. The proficiency parameters were drawn from a multivariate normal distribution. Two datasets with known characteristics are used here. Dataset 1 was designed to represent more realistic conditions where the difficulty parameters were confounded with dimensionality, such that the easier items tend to have high values for \(W_1\) and more difficult items tend to have high values for \(W_2\). Thus, the discrimination parameters, \(\textbf{W}_i = (W_{i1},W_{i2})\), were ranged from \(\textbf{W}_1 = (1.8, 0.4)\) to \(\textbf{W}_I =(0.4, 1.8)\) and a was ranged from \(-2.4\) (for \(W_{i1} = 0.4, W_{i2} = 1.8\)) to 2.4 (for \(W_{i1} = 1.8, W_{i2} = 0.4\)). This pattern of systematic shift in the item’s parameters is the same as those also found in [28] and [1]. Dataset 2 was generated according to [2], considering three dimensions in the latent trait to obtain the simulated response. The results are presented in Fig. 4. In contrast to the previous example where the dimension was set at 1, in cases involving multiple dimensions, there are additional discrimination parameters to consider. Consequently, only a correlation metric was employed. Overall, the model recovered the parameters precisely, particularly when the model considered was two-dimensional in dataset 1 and three-dimensional in dataset 2 (matching the number of latent traits used to generate the responses). It is worth noting that when a one-dimensional model was employed with multidimensional data, we observed poor estimates of the discrimination parameters. Furthermore, the results were slightly worse when the correlation between competencies was 0.8, a quite high correlation value.

Fig. 4.
figure 4

Results of correlation and RMSE for the Data Sets 1 and 2.

5 Real Data Analysis

In the following, the operation of the recommender system is illustrated by a real example which comprises the Brazilian National High School Examination [13]. This is a non-mandatory, standardized exam, that evaluates high school students in Brazil. Every year, more than 5 million students take the test, which is composed of 180 multiple-choice items, coded as zero (incorrect) or one (correct), on four main content areas referred to as Natural Sciences (NC), Human Sciences (HS), Languages and Codes (LC) and Mathematics (MT). After 2009, its importance has increased since ENEM scores became accepted for admission by more than 23 universities and other institutions as well as for certification for a high school degree. This study selected a sample of 100.000.000 responses of the 2018 ENEM datasetFootnote 1 (among those that answered all 180 items of the test). The number of test dimensions (latent traits) was inferred from the presence of a single factor in all items and evidence of distinct clusters of local dependencies formed by other factors, based on the Bayesian Information Criterion. The results indicate that the data set’s dimensionality is equal to 4. Thus, we considered a model with four latent traits. Furthermore, the generative p-value for the latent traits was 0.69 and for the overall items it was 0.66. These results suggest that the selected model (in terms of the number of dimensions) is appropriate for the data set. Figure 5a illustrates the distribution of the latent trait according to each of the four dimensions of knowledge. Additionally their correlation is illustrated in Fig. 5b. Note that the distribution of latent traits are complementary (compensatory structure), reinforcing their multidimensional nature. In other words, a high ability on one dimension can compensate for a low ability on another dimension. For instance, having a high level of proficiency in ‘reading comprehension’ can offset a lower proficiency in ‘basic mathematics’. In addition, it is possible to assemble all dimensions into one unidimensional representation as described in Fig. 5c. Thereby, we assess the overall latent trait to compare students. Moreover, this information is used to make recommendations for suitable new items. After applying the recommendation system, many improvements emerged after applying a here called step 2, where items are recommended according to Eq. 5. It achieved the personalized items based on the examinee’s neighborhood, dug further into the quality of recommendations to determine the degree of fault-stages, as shown in Fig. 6(left), instead of simply choosing according to \(P \left( X_{ij} = 1 | \textbf{y}; \mathbf {\Theta }\right) \), as depicted in Fig. 6(right). On the left, each point expresses the proportion of items correctly answered, chosen according to their expected probability of failure (based on the k nearest neighbors responses). In contrast, on the right, the experiences of like-minded student is ignored. One would expect the box-plots to be within the horizontal dashed lines, that is, if one sets as goal an average of 0.2 (20%) of correct responses, the students responses should be around this value. This clearly happens more effectively when step 2 is used (left of Fig. 6), since the boxplots presents a lower variation and are around the horizontal lines. In practice, this means being able to recommend items that challenge the student at any level, from items with a low chance of solution (e.g., expected average = 0.2) to high chance of solving (e.g., expected average = 0.8). This information is more refined than simply considering \(P \left( X_{ij} = 1 | \textbf{y}; \mathbf {\Theta }\right) \), which does not yield good results, especially for items that do not challenge the student (with expected average of correct responses from 0.4 onwards).

Fig. 5.
figure 5

The distribution of the latent trait according to each dimension (panel a), its correlation (panel b), as well as the assemble unidimensional representation (panel c).

Fig. 6.
figure 6

Results of recommender system step 2. The left panel refers to the distribution of the observed average of correct responses from items chosen according to responses of the nearest k neighbor. The right panel shows the distribution of the observed average of correct responses from items chosen according to \(P \left( X | \textbf{y}; \mathbf {\Theta }\right) \).

6 Conclusion and Discussion

In this paper, we have reformulated the two-parameter logistic model as a neural network, with the Deep Boltzmann Machine (DBM) being defined as its extension. This makes the model explainable based on the underlying interpretation of Item Response Theory parameters. But interpretability is further extended by the use of Layer-wise Relevance Propagation, which propagates the prediction backwards through the neural network to the input features. Based on that model, we have developed an algorithm for individualized education. The approach is continuously adapted based on substantively complex constructs and recommends new items according to their characteristics and students’ latent traits. The explainable algorithm allows grasping what made the algorithm arrive at a particular decision, playing an important role of open up the “black box of learning” of both algorithm and student. In the second case, this interpretable feedback is a valuable tool for verifying and improving the items to be recommended, as well as generating higher-order knowledge.

We illustrate our procedure by analyzing two simulated data sets, where the DBM model recovered the IRT parameters precisely in different situations, involving sample sizes, number of items and dimensions. Furthermore, we employ the method in a real-world example which comprises the Brazilian National High School Examination. Overall, our approach produced a better determination of the probability of answering an item correctly as compared with using simply \(P \left( X | \textbf{y}; \mathbf {\Theta }\right) \). This performance arose due to the highly detailed subpopulation selected based on responses of the others examinees. However, given the student and set of items, it is not obvious what degree the fault-stages serves as catalysts for deep and constructive learning. Thus, more research is needed to ascertain it. In addition, there are various possible applications beyond the one presented here, particularly in more complex domains, such as massive open online courses and e-commerce.