A genetic fuzzy expert system for automatic question classification in a competitive learning environment


Expert Systems with Applications 39 (2012) 7471–7478
Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a
A genetic fuzzy expert system for automatic question classification
in a competitive learning environment

Elena Verdú, María J. Verdú ⇑, Luisa M. Regueras, Juan P. de Castro, Ricardo García
School of Telecommunications Engineering, University of Valladolid, Paseo Belén, 15, 47011 Valladolid, Spain

a r t i c l e i n f o a b s t r a c t
Keywords:
Intelligent tutoring systems
Educational technology
Automatic question classification
Competitive learning
Genetic algorithms
Fuzzy systems
0957-4174/$ - see front matter � 2012 Elsevier Ltd. A
doi:10.1016/j.eswa.2012.01.115

⇑ Corresponding author. Address: ETSI Telecomunic
Valladolid, Spain. Tel.: +34 983423707; fax: +34 9834

E-mail addresses: elever@tel.uva.es (E. Verdú), ma
luireg@tel.uva.es (L.M. Regueras), jpdecastro@tel.uva.
uva.es (R. García).
Intelligent tutoring systems are efficient tools to automatically adapt the learning process to the student’s
progress and needs. One of the possible adaptations is to apply an adaptive question sequencing system,
which matches the difficulty of the questions to the student’s knowledge level. In this context, it is impor-
tant to correctly classify the questions to be presented to students according to their difficulty level. Many
systems have been developed for estimating the difficulty of questions. However the variety in the appli-
cation environments makes difficult to apply the existing solutions directly to other applications. There-
fore, a specific solution has been designed in order to determine the difficulty level of open questions in
an automatic and objective way. This solution can be applied to activities with special temporal and run-
ning features, as the contests developed through QUESTOURnament, which is a tool integrated into the e-
learning platform Moodle. The proposed solution is a fuzzy expert system that uses a genetic algorithm in
order to characterize each difficulty level. From the output of the algorithm, it defines the fuzzy rules that
are used to classify the questions. Data registered from a competitive activity in a Telecommunications
Engineering course have been used in order to validate the system against a group of experts. Results
show that the system performs successfully. Therefore, it can be concluded that the system is able to
do the questions classification labour in a competitive learning environment.

� 2012 Elsevier Ltd. All rights reserved.
1. Introduction interest, motivation and engagement by arousing their competitive in-
During the last years, the learning process is changing substan-
tially in order to be centred on the students and adapted to their
needs and features. Different studies have shown the effectiveness
of the new adaptive learning systems (Verdú, Regueras, Verdú, de
Castro, & Pérez, 2008). Many of these systems attempt to be more
adaptive by offering students questions with difficulty levels
according to their skills and capabilities. The aim is to increase
the efficiency and the level of interaction and motivation of stu-
dents (Lilley, Barker, & Britton, 2004). Too difficult or too easy
questions can frustrate and decrease students’ motivation, while
adaptive question sequencing provides a more efficient and effec-
tive learning (Wauters, Desmet, & Van den Noortgate, 2010). More-
over, according to (Lee & Heyworth, 2000), students should be able
to score higher if the items or problems are arranged according to
their difficulty level, since after solving easier problems, they feel
more motivated to solve the harder ones.

On the other hand, the competitive learning systems, as the QUES-
TOURnament system, are an effective technique to capture students’
ll rights reserved.

ación, Paseo Belén 15, 47011
23667.
rver@tel.uva.es (M.J. Verdú),

es (J.P. de Castro), ricgar@tel.
stincts (Anderson, 2006; Philpot, Hall, Hubing & Flori, 2005). Moreover,
competitive learning reduces procrastination, a common cause for stu-
dents failing to complete assignments (Lawrence, 2004) and improves
the learning process (Regueras et al., 2009).

QUESTOURnament is a telematic tool integrated into the e-
learning platform Moodle that allows teachers to organize dynamic
contests in any knowledge domain (Regueras et al., 2009). Students
compete for getting the highest marks and being at the top in the
ranking. They must solve exercises (known as challenges in QUES-
TOURnament) within a time limit and as soon as possible, since the
scoring function varies with time.

The competitive nature of QUESTOURnament motivates stu-
dents but also can provoke stress and discouragement in the worst
classified students. To assign the adequate opponents and ques-
tions to a student may be an effective strategy to reduce these neg-
ative effects (Wu et al., 2007). Therefore the system should group
students by knowledge level so that students with similar skills
compete together and answer questions with a difficulty level suit-
able for them.

In this context, it is very important to correctly classify ques-
tions by difficulty level. However, it is difficult for teachers to accu-
rately estimate the difficulty level according to the students’ level
of competence (Watering & Rijt, 2006). Experience helps teachers
to better estimate the difficulty level of the questions, but even
senior teachers sometimes fail and have to rectify when they

http://dx.doi.org/10.1016/j.eswa.2012.01.115
mailto:<xml_chg_old>elever.@tel.uva.es</xml_chg_old><xml_chg_new>elever@tel.uva.es</xml_chg_new>
mailto:<xml_chg_old>marver.@tel.uva.es</xml_chg_old><xml_chg_new>marver@tel.uva.es</xml_chg_new>
mailto:<xml_chg_old>luireg.@tel.uva.es</xml_chg_old><xml_chg_new>luireg@tel.uva.es</xml_chg_new>
mailto:<xml_chg_old>jpdecastro.@tel.uva.es</xml_chg_old><xml_chg_new>jpdecastro@tel.uva.es</xml_chg_new>
mailto:<xml_chg_old>ricgar.@tel.uva.es</xml_chg_old><xml_chg_new>ricgar@tel. uva.es</xml_chg_new>
mailto:<xml_chg_old>ricgar.@tel.uva.es</xml_chg_old><xml_chg_new>ricgar@tel. uva.es</xml_chg_new>
http://dx.doi.org/10.1016/j.eswa.2012.01.115
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


7472 E. Verdú et al. / Expert Systems with Applications 39 (2012) 7471–7478
analyze the answers given by their students. An automatic estima-
tion system could be the basis for an effective adaptation process.

A lot of systems that automatically estimate the difficulty level
of items can be found in the literature (Burghof, 2001; Cheng, Shen,
& Basu, 2008; Jong, Chan, Wu, & Lin, 2006; Lee, 1996; Wauters
et al., 2010). However, the variety in the nature of the application
environments makes difficult to apply the existing solutions di-
rectly to other applications. Therefore, a specific solution has been
designed in order to turn the competitive e-learning system QUES-
TOURnament into an intelligent system. The objective is to make
learning more effective and to mitigate some of the practical draw-
backs of competitive learning.

This paper discusses the validity of an expert system that auto-
matically estimates the difficulty level of the questions posed in
the QUESTOURnament competitive learning system. Section 2
introduces the major issues about teachers’ perception of difficulty
and summarizes the search towards the solution. The expert sys-
tem is described in Section 3. Section 4 starts with a description
of the experiment developed in order to validate the system. Next,
a study that analyzes the accuracy of the estimations of difficulty
obtained by the intelligent system is presented. Finally, the main
conclusions are stated.
2. Background

2.1. Teachers’ perception of difficulty

The correct estimation of the difficulty level of learning material
(questions, items. . .) is very important in the design and definition
of assessment processes, adaptive learning systems or standard
setting methods. However, there are not too many studies about
the perception and estimation of difficulty level by teachers.

Estimating the difficulty level of questions is not an easy
job. Several studies (Alexandrou-Leonidou & Philippou, 2005;
Hadjidemetriou & Williams 2002; Lee & Heyworth, 2000; Watering
& Rijt, 2006) question the ability of teachers to make accurate diffi-
culty level estimations of learning material since teachers usually
fail to identify the correct difficulty level according to the students’
ability. In general terms, students’ performance tends to be
overestimated by teachers (Goodwin, 1999; Impara & Plake, 1998;
Verhoeven, Verwijnen, Muijtjens, Scherpbier, & Van der Vleuten,
2002). Moreover, according to Watering and Rijt (2006), if the accu-
racy of teachers’ perception of difficulty is analysed by categories,
teachers tend to overestimate the difficulty of easy items and under-
estimate the difficulty of hard items. Impara and Plake (1998) also
suggest that estimating item difficulty accurately is quite difficult;
however, they do not think that teachers systematically underesti-
mate the difficulty of hard items and overestimate the difficulty of
easy items. In this respect, other contradictory results are found
too. For example, Mattar (2000) states that teachers are less success-
ful at rating very difficult or very easy items, while Zhou (2009)
indicates that teachers classify better the hardest items.

In short, although there are not conclusive studies about the
tendency of teachers when they classify questions by difficulty le-
vel, all researchers agree on the difficulty of doing this classifica-
tion. Therefore, an automatic system that adjusts the difficulty
level of questions according to the students’ behaviour would be
a very useful support tool and a key component for a truly adaptive
learning environment.
2.2. In search of an intelligent solution for a competitive tool

There are many domain-dependent intelligent tutoring systems
(ITSs) that provide students an adequate learning path through the
different topics of a subject, according to the previously learnt
topics. These systems are based on techniques such as Bayesian
Networks (Hibou & Labat, 2004; Nouh, Karthikeyani, & Nadarajan,
2006; Vomlel, 2004) and require the previous definition of knowl-
edge domains by using, for example, domain-specific ontologies
(Colace & De Santo, 2006). Modelling these networks of knowledge
components and their dependencies, generalizing them for every
student, is not an easy task (Noguez, Sucar, & Ramos, 2005), espe-
cially for domain-independent systems like QUESTOURnament,
which can be used for diverse subjects and levels of education.

Many domain-independent ITSs focus on presenting questions
and problems adapted to the students’ knowledge level. They often
apply the Item Response Theory (IRT) to estimate both the charac-
teristics of the questions, such as difficulty or guessing probability,
and the knowledge level of students (Chen, Lee, & Chen, 2005; Lilley
et al., 2004), independently of the knowledge domain. However, the
correct application of traditional theories for tests implies some
assumptions, which are not met by many examination contexts,
especially when telematic tools are used for distance learning.
Moreover, some of the characteristics of more specific tools, such
as the competitive nature of QUESTOURnament, make the applica-
tion of these theories difficult for the environment under study.

The typically used IRT models are one-dimensional, that is, they
assume that the response to a question depends on a single trait,
usually the knowledge level. Besides, it is also supposed that the
response a student gives to a specific question does not depend
on the responses given to other questions (Embretson & Reise,
2000). Therefore, using IRT entails carefully designing the tests so
that these both conditions are fulfilled. Moreover, conventional
IRT models only the response accuracy, ignoring response time;
since it was thought to be used in pure power tests (Roskam,
1997), which assume that students have unlimited time to solve
a question. Even if limited time could be assumed, at least the
requirement should be that time is not a factor that affects the stu-
dents’ response. However, in a competitive environment as QUES-
TOURnament, time is very important, since only the first student
who answers a challenge correctly will be able to obtain the high-
est score for that challenge. Therefore, there are different factors
that could distort the results obtained by the IRT methods when
applied to the QUESTOURnament system.

Students can apply different strategies during competition and
even different personality factors can determine the students’ final
response to an item. Several challenges can be posed at the same
time and students have to select one of them to be solved first.
Many students tend to read all the different questions and select
the one that seems the easiest to be solved first. Difficult chal-
lenges are usually read several times and solved after the easiest
ones have been answered. On the other hand, two students with
exactly the same knowledge level could respond to a same ques-
tion differently, as one can be more persistent and devote more
time to solve the question while another one can be more anxious
with the competition and quickly respond to be the first one. Con-
sequently, time and number of readings are important factors that
should be taken into account in the model, but its modelling is
dependent on the actual students’ behaviour.

Moreover, when teachers pose challenges to QUESTOURna-
ment, they do not have any restriction related to time, type of
questions or skills needed to solve them. They are free to use any
configuration of the system in any context. Then, there are some
important factors that can vary:

� Maximum time available to submit an answer to a challenge.
� Type of questions (open questions, multiple choice questions,

true/false questions, short response questions, problems, etc).
� Context surrounding students when they solve the questions: a

contest may be developed in classroom or on distance during
one or several days.


E. Verdú et al. / Expert Systems with Applications 39 (2012) 7471–7478 7473
� Personality of the students (e.g. the stress of the student faced
with a competitive situation can influence on the response).

There are different models adapted from the classic IRT that
cover different partial aspects of the searched solution but there
is not a model that covers all aspects required by the specific fea-
tures of the QUESTOURnament system. Roskam (1997) presents a
model based on IRT for speed tests with time limit where correct-
ness and response time are integrated. Van der Linden (2007) pro-
poses a flexible hierarchical solution that basically has an IRT
model, a time-response distribution model and a higher level
structure that has into account the dependencies between the
items and the students’ parameters in those models. For each of
these components, the most suitable model can be used.

Anyway, a model based on IRT, which took into account all possi-
ble factors that influence the response a student gives to a challenge
within QUESTOURnament, according to the so many different
contexts of application, would be vastly complex. There are other
solutions to determine the difficulty level of the learning material.
However, most of these proposals are too simplistic – like the solu-
tion used in Jong et al. (2006), where the difficulty is estimated as the
ratio between the number of times that a question is incorrectly
answered and the total number of answers – or are too focused on
the target subject – such as the solution described in Kunichika,
Urushima, Hirashima, and Takeuchi (2002), which estimates the
difficulty level of questions about English language sentences.

After analysing classical and specific solutions, it was decided to
design an ad-hoc solution for the system, whose fundamentals
could be applied to other systems used in open contexts. This solu-
tion is based on the definition of a fuzzy genetic expert system,
which classifies the questions in several difficulty levels.

There are examples of successful application of this kind of systems
to e-learning environments such as the one described by Romero,
Gonzalez, Ventura, del Jesus, and Herrera (2009). They use an evolu-
tionary algorithm to learn fuzzy rules, which describe relationships be-
tween the students’ interactions with the e-learning system Moodle
and the final marks obtained in the course. Typically, genetic learning
of rules assumes a predefined set of fuzzy membership functions gen-
erated by human domain experts (Cordón, 2004). However, as afore-
mentioned, the different nature of the challenges that can be posed
through QUESTOURnament, as well as the varied students’ profiles,
makes it very difficult to define and generalize fuzzy sets and rules.
Teachers can use QUESTOURnament for multiple-choice questions or
for laborious exercises or problems. Since, for example, ten minutes
can be a very short time for a complex problem but a long time for a
true/false question, it is very difficult to predefine the fuzzy member-
ship function for the time parameter. Moreover, the contests with
QUESTOURnament can be developed in very different contexts, for
example, during face-to-face classes or on distance, even lasting sev-
eral weeks. All these elements (nature of the questions, application
contexts of the system, profiles and behaviours of the students. . .)
make it necessary to define fuzzy sets and fuzzy rules each time a group
of questions are classified. Doing it by hand should be very laborious
and impractical, so an automatic system is needed. Besides, according
to Nebot, Mugica, Castro, and Acosta (2010), learning the fuzzification
process parameters by genetic algorithms instead of using the expert’s
criteria provides better results.

Then, the proposed system starts from scratch. Taking some data
about the interaction of the students with QUESTOURnament and
the initial difficulty level estimated by the teacher, it learns both
the adequate membership functions with their linguistic values as
well as the fuzzy rules. In the next section, the complete system is
detailed. Along this description of the genetic fuzzy expert system
some real case examples are included to facilitate comprehension.
The details of the real case and the corresponding experiment results
are set out in Section 4.
3. The expert system

A genetic fuzzy expert system has been designed that generates
fuzzy sets and rules appropriate for each specific case. The knowl-
edge base is provided by a Fuzzy Model Generator that includes a
genetic system capable of identifying the characteristics of the
questions for each difficulty level.

The estimation of the difficulty level then takes place in two
phases. During a first phase the Fuzzy Model Generator learns from
the Facts Base (formed by the students’ response patterns) and
dynamically creates the classification rules and the fuzzy sets of
the input variables for the specific data. During a second phase,
the fuzzy expert system infers the difficulty level of each question.

The components of all the system are shown in Fig. 1. From
Moodle and QUESTOURnament logs, three parameters are consid-
ered in the response patterns: time in minutes from the last read-
ing of the question until the submission of the answer, grade
obtained for that answer and number of accesses or readings be-
fore submitting the answer. All these factors depend on the stu-
dents’ behaviour when answering a question and are related to
the difficulty level of each challenge (as aforementioned). All these
data make up a set of context-dependent and noisy usage patterns
that are stored in the Facts Base and feed the intelligent system.

For each difficulty level, the genetic system uses the response
patterns of all the questions belonging to that level (according to
the initial classification made by the teacher) and obtains a charac-
terization of their responses as crisp sets. From these crisp sets the
Fuzzy Model Generator creates the fuzzy sets and rules of the
Knowledge Base. Once the fuzzy sets and the rules of a group of
questions have been generated, the Inference Engine can infer
the difficulty level of the patterns in the Facts Base. Finally, the dif-
ficulty level of each question is calculated as the median of the dif-
ficulty level of its response patterns and the challenges repository
is updated with the new difficulty level.

Thus, the system combines the students’ behaviour and the
teachers’ perception in order to objectively estimate the real diffi-
culty level of each challenge.
3.1. The genetic system

The objective of the genetic algorithm for the proposed system
is to generate groups of crisp sets that characterize the students’
responses for three difficulty levels: easy, moderate and hard.

The system groups challenges by difficulty level according to the
initial classification made by the teacher. The genetic algorithm then
uses the responses for all the questions belonging to a specific diffi-
culty level in order to obtain its characterization. As above men-
tioned, the input of the genetic algorithm is a set of response
patterns with the structure <time, grade, accesses>. The crisp sets
then are ranges of time, grade and number of accesses that together
include the highest number of response patterns for a specific diffi-
culty level. Therefore, a possible solution to the problem is repre-
sented with the following coded chromosome or individual [t1, t2,
g1, g2, a1, a2], being t1 and t2 the lower and upper limits of a time
range, g1 and g2, the lower and upper limits of a grade range, and
a1 and a2, the lower and upper values of a number of accesses range.

The genetic algorithm implements the crossover operator BLX-a,
the uniform mutation operator, the roulette wheel as selection
method, and a fitness function based on the support measure typi-
cally used to evaluate inferred rules. In addition, due to the fact that
the response patterns for a question do not depend only on the ques-
tion itself but also on the behaviour of the students answering (e.g.
knowledge level, persistence, etc.), the algorithm also incorporates
niching methods, such as sharing, in order to promote the diversity
and to be able to characterize each difficulty level by several groups


Fig. 1. Architecture of the genetic fuzzy expert system.

Table 1
Groups of ranges delivered by the genetic system (where Acc. = number of accesses).

Hard level Moderate level Easy level

Time Grade Acc. Fitness Time Grade Acc. Fitness Time Grade Acc. Fitness

[1, 48] [0, 5] [3, 4] 0.143 [4, 28] [30, 50] [1, 2] 0.301
[13, 36] [44, 60] [1, 2] 0.107 [7, 35] [0, 18] [1, 2] 0.251 [0, 25] [80, 100] [1, 2] 0.541

[27, 74] [84, 100] [1, 2] 0.167

7474 E. Verdú et al. / Expert Systems with Applications 39 (2012) 7471–7478
of ranges. More details about the fitness function, selection
method, crossover and mutation operators and diversity
methods are available in Verdú, Regueras, Verdú, and de Castro
(2010a,2010b).

3.2. Generation of the Fuzzy Model

For each difficulty level, the genetic algorithm obtains groups of
ranges of the input variables that characterize the response pat-
terns of that difficulty level. Table 1 shows the ranges obtained
by the genetic algorithm in the experiment (which is described
in detail in Section 4.1). Later, from these groups of ranges, the Fuz-
zy Model Generator obtains the membership functions and the
classification rules of the Fuzzy Model.

The fuzzy sets for the input variables grade, time and number of
accesses corresponding to the data of Table 1 are represented in
Fig. 2. In principle, a fuzzy set is defined for each range found by
the genetic algorithm. For example the fuzzy set of the input vari-
able grade with linguistic value ‘‘Very Low (VL)’’ corresponds to
the grade range [0, 5], found by the genetic system for the hard level
questions. However, when two ranges are very similar, such as the
grade range [80, 100] found in easy questions and the grade range
[84, 100] found in moderate ones, the algorithm assigns only one
fuzzy set for both ranges, ‘‘Very High (VH)’’ in this specific example.
On the other hand, when the system finds a range that is not similar
to another range but includes or overlaps this range, the system cre-
ates several fuzzy sets. For example, the grade range [0, 18] includes
the grade range [0, 5]. In this case the system generates two differ-
ent linguistic values, ‘‘Very Low (VL)’’ and ‘‘Low (L)’’.
The number of linguistic values of the input variables (Very
Low, Low, Medium, High, etc.) is not fixed as it depends on the
number of fuzzy sets determined by the Fuzzy Model Generator
each time a group of questions are classified. In the given example,
five fuzzy sets have been defined for the input variables grade and
time and then, they take the linguistic values ‘‘Very Low (VL)’’,
‘‘Low (L)’’, ‘‘Medium (M)’’, ‘‘High (H)’’ and ‘‘Very High (VH)’’. How-
ever, only two fuzzy sets have been defined for the input variable
number of accesses and then, two linguistic values are used: ‘‘Low
(L)’’ and ‘‘High (H)’’.

The membership functions of the output variable Difficulty are
not dynamically set, unlike those of the input variables, and always
take the trapezoidal shapes shown in Fig. 2.

Once the fuzzy sets have been automatically created, the Fuzzy
Model Generator defines the fuzzy rules from these fuzzy sets and
the results of the Genetic Algorithm (see Table 1). For example, for
the easy difficulty level the following fuzzy rules have been auto-
matically defined:

IF GRADE IS VH AND TIME IS VL AND ACCESSES IS L THEN DIF-
FICULTY IS EASY
IF GRADE IS VH AND TIME IS L AND ACCESSES IS L THEN DIFFI-
CULTY IS EASY

Two rules have been created from an only group of ranges
found by the genetic system since the time range was split during
the fuzzy sets creation phase. This same procedure is followed in
order to define all the fuzzy rules from each group of ranges deliv-
ered by the genetic algorithm.


0 10 20 30 40 50 60 70 80 90 100

0

0.2

0.4

0.6

0.8

1

Grade

M H VHLVL

0 10 20 30 40 50 60 70 80 90 100

0

0.2

0.4

0.6

0.8

1

Time

VL L HM VH

0 1 2 3 4 5 6 7 8

0

0.2

0.4

0.6

0.8

1

Accesses

L H

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0

0.2

0.4

0.6

0.8

1

Difficulty

Easy Moderate Hard

D
eg

re
es

of
 m

em
be

rs
hi

p

Fig. 2. Fuzzy sets of the input and the output variables.

Fig. 3. Sliced-cube FAM representation of the set of rules.

E. Verdú et al. / Expert Systems with Applications 39 (2012) 7471–7478 7475
The fuzzy rules can be graphically represented by using a
cube form called Fuzzy Associative Memory (FAM) as shown in
Fig. 3, where the 16 rules that describe the given data are
represented.

At the end of this process, a set of fuzzy rules as well
as the membership functions of the input and output variables
describing the problem have been defined and stored in the Knowl-
edge Base.
3.3. The inference engine

The inference engine uses the Mamdani method to infer the
difficulty level corresponding to a response pattern from its three
crisp input variables: time, grade and number of accesses.

The Matlab Fuzzy Logic Toolbox has been used to simulate the
operation of this component. Again, a concrete example is used to
show this operation. Fig. 4 shows the fuzzy inference of the


Fig. 4. Fuzzy inference of the difficulty of the input pattern <31, 67, 3>.

7476 E. Verdú et al. / Expert Systems with Applications 39 (2012) 7471–7478
difficulty for the response pattern with time, grade and number of
accesses equal to 31, 67 and 3, respectively.

Fuzzy inference takes place in four steps. First, the crisp input
variables are fuzzified during the fuzzification phase. The rule eval-
uation phase then takes place. In the example, 16 rules describe the
behaviour of the system but this input pattern only fulfils the three
antecedent conditions of two rules. The classical Min method has
been applied for the fuzzy operator AND. During the third step,
the rule consequents are aggregated into a single fuzzy set using
the Max composition method. Last, defuzzification obtains a crisp
output using the MOM (Mean of Maximum) method. The pattern
of the example has been assigned a difficulty value equal to 1.27,
which corresponds to a difficulty level halfway between moderate
and hard, but closer to moderate, as it can be seen in Fig. 2.
4. Results

The hypothesis to be tested in this paper is that the designed ex-
pert system performs as a human expert, that is, an expert teacher
that is able to reclassify questions by difficulty by means of a thor-
ough analysis of the students’ behaviour while answering.

In order to analyze and validate the performance of this expert
system, this one has been tested with real data from a contest
developed with the QUESTOURnament tool in an undergraduate
course of Diploma in Telecommunications Engineering at the Uni-
versity of Valladolid (Spain).
4.1. The experiment

The study was carried out from February until June 2010 (dur-
ing three weeks with a 2-hour laboratory session in each week),
with 38 enrolled students. All these students participated in the
contest and 12 challenges (exercises on IP addressing and routing),
were posed by the teacher. The system asked the teacher, upon
creating the challenges, to classify them according to the estimated
difficulty level: easy, moderate or hard.

According to the teacher’s initial estimation; 4 challenges were
classified as easy, 3 as moderate and 5 as hard. The total number of
available answers for the easy, moderate and hard levels was 134,
103 and 169, respectively. For each answer, one response pattern is
recorded with the grade obtained, the number of accesses and the
time from reading to answering. All these records made up the in-
put patterns to the system.

In the previous sections, it has been explained the design and
operation of the genetic fuzzy expert system with the help of some
examples related to the real case presented now. Therefore, Table 1
corresponds to the output given by the genetic algorithm with the
input data corresponding to the 12 challenges posed in this course.
These groups of ranges found by the genetic system were used by
the Fuzzy Model Generator to create the membership functions of
the input variables shown in Fig. 2 as explained in Section 3.2. In
the same way, the output of the genetic system and the generated
fuzzy sets were used to define the 16 classification rules shown in
Fig. 3, which describe the behaviour of the students when answer-
ing these challenges of different difficulty levels. For example, the
system found that answers with Very High grade and Low number
of accesses correspond to the easy difficulty level if the time is Low
or Very Low, or to the moderate difficulty level if the time is Med-
ium or Higher.

On the other hand, as it was expected, easy questions are char-
acterized by Very High grades and Very Low or low time. There are
also moderate questions with answers graded Very High but the
time in this case is higher than the time required for the easy ques-
tions before mentioned, indicating that time and difficulty are in-
versely related, which is consistent with the study presented in
Mason, Zollman, Bramble, and O’Brien (1992).

Applying the rules to the students’ response patterns, the expert
system obtains the new difficulty level for each challenge. Thus,
next step is to update the challenges repository consequently.
According to the classification done by the system (see the third
column in Table 2), three challenges should be reclassified (ques-
tions number 2, 8 and 10) as their initial difficulty does not match
the difficulty assigned by the system.

Once the questions have been reclassified, it is necessary to val-
idate the intelligent system.

4.2. Validation of the intelligent system

Since expert systems are addressed to perform at close to hu-
man expert levels and to solve problems without a defined correct
solution, they should typically be validated against human experts
(O’Keefe, Balci & Smith, 1987). Therefore, the chosen method for
the system validation has been a ‘‘validation against a group of


Table 2
Data for validation of the expert system.

Id Initial classification
by teacher

Expert system
(crisp value)

Expert system (linguistic value) Human expert 1 Human expert 2 Human expert 3

1 Moderate 1.08 Moderate Moderate Moderate Moderate
2 Moderate 1.51 Hard bordering on moderate Hard bordering on moderate Hard Hard
3 Easy 0.38 Easy Easy Easy Easy
4 Easy 0.52 Easy bordering on moderate Easy Easy Easy bordering on moderate
5 Moderate 1.07 Moderate Moderate Moderate Moderate
6 Hard 1.48 Hard bordering on moderate Hard Hard Hard bordering on moderate
7 Easy 0.37 Easy Easy Easy Easy
8 Easy 1.08 Moderate Moderate Moderate Moderate
9 Hard 1.68 Hard Hard Hard Hard
10 Hard 1.32 Moderate bordering on hard Moderate Moderate Moderate bordering on hard
11 Hard 1.62 Hard Hard Hard Hard
12 Hard 1.55 Hard bordering on moderate Moderate bordering on hard Hard Moderate bordering on hard

Table 3
Values of weighted kappa.

Human
expert 1

Human
expert 2

Human
expert 3

Expert
system

Human expert 1 – 0.901 0.805 0.837
Human expert 2 0.901 – 0.813 0.747
Human expert 3 0.805 0.813 – 0.947
Expert system 0.837 0.747 0.947 –

E. Verdú et al. / Expert Systems with Applications 39 (2012) 7471–7478 7477
experts’’ based on the method described in Mosqueira-Rey, Moret-
Bonillo, and Fernández-Leal (2008). This method provides a mea-
sure of agreement between the human experts and verifies if the
expert system performs as one of them and, therefore, it can be
then incorporated into the group of experts without making the
agreement level worse.

In the experiment described the group of experts consisted of
the teacher of the course and other two teachers who are also ex-
perts on the subject. Table 2 shows the difficulty levels estimated
by the genetic fuzzy expert system and the group of experts for
each of the 12 challenges. First column is simply a challenge iden-
tifier. Second column shows the initial classification done by the
teacher. Third column shows a crisp number that represents the
difficulty of a challenge obtained by the system, which ranges from
0 to 2 (see Fig. 2). Fourth column represents the corresponding lin-
guistic value for this crisp output. Last three columns show the
classification done by the human experts, who carefully assigned
a difficulty level to all the questions after analyzing the actual re-
sults and behaviour of students who answered them.

The level of agreement between each pair of human experts has
been measured through the weighted kappa (Mosqueira-Rey et al.,
2008). The results of the measure of kappa (see Table 3) show a
strong agreement between pairs of experts; since values of kappa
higher than 0.80 indicate an almost perfect agreement whereas
values in the range 0.61–0.80 indicate a significant agreement
(Viera & Garrett, 2005). The level of agreement between the expert
system and each human expert varies from a significant agreement
(kappa = 0.747) to an almost perfect agreement (kappa = 0.947).

Next, Williams index (Mosqueira-Rey et al., 2008) has been used
as a measure to verify that introducing the system into the group of
experts does not decrease the agreement level of the group. Values
equal or higher than 1 indicate a good agreement whereas values
lower than 1 imply that the agreement in the group of experts with
the expert system included is worse than the agreement among
only human experts. The Williams index has been calculated from
the values for kappa of Table 3, and a value of 1.005 has been ob-
tained. Therefore, it can be concluded that the system has per-
formed successfully and is able to do the reclassification labour
on behalf of the teacher satisfactorily.
5. Discussion and conclusion

An expert system that satisfactorily classifies the challenges
posed in the competitive learning system QUESTOURnament
according to their real difficulty level has been designed. Teachers
can insert challenges into the QUESTOURnament tool and the
genetic fuzzy expert system will readjust the initial difficulty level
estimated by the teacher according to the real behaviour of stu-
dents when facing up to the challenges. The system has been tested
with real data and the results have been successfully validated
against human experts.

Once the system has been validated, its results can also be used
to study and analyze the accuracy of estimations done by teachers.
When compared with the difficulty level obtained by the expert
system, the teacher’s estimation (Initial classification by teacher
in Table 2) is quite accurate, having into account that the teacher
uses a three-scale method, whereas the values calculated from
the data of the system include intermediate levels. The teacher’s
estimation does not match the difficulty assigned by the system
in three cases. Specifically, the teacher overestimates the difficulty
of one question and underestimates the difficulty of other two
questions without following any rule. Thus, no conclusion can be
obtained about the tendency of the teacher. The results of Table
2 only show that teachers estimate better the difficulty of hard
questions than of easy or moderate questions.

For future work it is intended to study the possible inclusion of
more parameters, related to students’ profile and behaviour, in the
response patterns. Besides, the system assumes that the initial
classification of questions done by the teacher is good enough. It
could be interesting to study in depth how dependant is the ap-
proach on the initial difficulty levels assigned by teachers.

Finally, the tests show that a same difficulty level is character-
ized by different fuzzy sets. This is probably due to the different
behaviour of students when solving a challenge (for example, some
of them can be more resolute while other ones can be more persis-
tent) and, of course, to their different knowledge level. Then, the
output of the system could also be used for students clustering be-
cause typical behaviours of students can be detected from the rules
shown in the FAM representation of Fig. 3. For example, there are
students who, when they do not know how to answer a challenge,
read it several times and, finally, submit a quick response just in
case they get it right by chance. This corresponds to the hard level
questions characterized by a high number of accesses, Very Low to
medium time for solving and Very Low grade. There are other rules
for hard level questions that correspond to those more promising
students who get a high grade even in hard questions. Therefore,
each rule for the same difficulty level may correspond to students
with similar knowledge level and/or behaviour when answering


7478 E. Verdú et al. / Expert Systems with Applications 39 (2012) 7471–7478
questions; this result can be used to effectively classify students
and to refine the model by taking into account their competences
and profiles. Thus, it is also planned to study the possibility of
using the different fuzzy sets obtained by the Fuzzy Model Gener-
ator to detect and classify groups of students according to their
knowledge level and behaviour profile.

References

Alexandrou-Leonidou, V., & Philippou, G. N. (2005). Teachers’ beliefs about students’
development of the pre-algebraic concepto of equation. In Proceedings of the
29th Conference of the International Group for the Psychology of Mathematics
Education (pp. 41–48). Melbourne: University of Melbourne.

Anderson, J. R. (2006). On cooperative and competitive learning in the Management
Classroom. Mountain Plains Journal of Business and Economics - Pedagogy, 7,
1–10.

Burghof, K. L. (2001). Assembling an item-bank for computerised linear and
adaptive testing in Geography. International Education Journal, 2(4), 74–83.

Chen, C.-M., Lee, H.-M., & Chen, Y.-H. (2005). Personalized e-learning system using
item response theory. Computers & Education, 44(3), 237–255.

Cheng, I., Shen, R., & Basu, A. (2008). An algorithm for automatic difficulty level
estimation of multimedia mathematical test items. In Proceedings of the 8th IEEE
International Conference on Advanced Learning Technologies (pp. 175–179). Los
Alamitos, CA: IEEE Computer Society.

Colace, F., & De Santo, M. A. (2006). A tutoring tool based on bayesian approach. In
Proceedings of Sixth International Conference on Advanced Learning Technologies
(pp. 109–113). Washington DC: IEEE Computer Society.

Cordón, O. (2004). Ten years of genetic fuzzy systems: Current framework and new
trends. Fuzzy Sets and Systems, 141(1), 5–31.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,
NJ: Lawrence Erlbaum Associates.

Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff
minimum passing levels for a group of borderline examinees. Applied
Measurement in Education, 12, 13–28.

Hadjidemetriou, C., & Williams, J. S. (2002) Teachers’ pedagogical content
knowledge: graphs, from a cognitivist to a situated perspective. In Proceedings
of the 26th Conference of the International Group for the Psychology of Mathematics
Education (pp. 57–64). Norwich: University of East Anglia.

Hibou, M., & Labat, J.-M. (2004). Embedded Bayesian network student models. In
Proceedings of the Fifth International Conference on Information Technology Based
Higher Education and Training (pp 468–472). Istanbul: IEEE.

Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test
of the assumptions in the Angoff standard setting method. Journal of Educational
Measurement, 35(1), 69–81.

Jong, B.-S., Chan, T.-Y., Wu, Y.-L., & Lin, T.-W. (2006). Applying the adaptive learning
material producing strategy to group learning. Lecture Notes in Computer
Science, 3942, 39–49.

Kunichika, H., Urushima, M., Hirashima, T., & Takeuchi, A. (2002). A computational
method of complexity of questions on contents of English sentences and its
evaluation. In Proceedings of the International Conference on Computers in
Education (pp. 97–101). Auckland, New Zealand: IEEE Computer Society.

Lawrence, R. (2004). Teaching Data Structures Using Competitive Games. IEEE
Transactions on Education, 47(4), 459–466.

Lee, F. L. (1996). Electronic Homework: An Intelligent Tutoring System in
Mathematics. PhD Thesis. The Chinese University of Hong Kong.

Lee, F. L., & Heyworth, R. M. (2000). Problem complexity: A measure of problem
difficulty in algebra by using computer. Education Journal, 28(1), 85–107.

Lilley, M., Barker, T., & Britton, C. (2004). The development and evaluation of a
software prototype for computer-adaptive testing. Computers & Education,
43(1), 109–123.

Mason, E., Zollman, A., Bramble, W. J., & O’Brien, J. (1992). Response time and item
difficulty in a computer-based high school mathematics course. Focus on
Learning Problems in Mathematics, 14(3), 41–51.

Mattar, J. D. (2000). Investigation of the validity of the Angoff standard setting
procedure for multiple-choice items. Ph.D. dissertation. University of
Massachusetts.
Mosqueira-Rey, E., Moret-Bonillo, V., & Fernández-Leal, Á. (2008). An expert system
to achieve fuzzy interpretations of validation data. Expert Systems with
Applications, 35(4), 2089–2106.

Nebot, A., Mugica, F., Castro, F. & Acosta, J. (2010). Genetic fuzzy system for
predictive and decision support modelling in e-learning. In Proceedings of the
2010 IEEE International Conference on Fuzzy Systems (pp. 1804–1811). IEEE
Computer Society.

Noguez, J., Sucar, E., & Ramos, F. (2005). A probabilistic relational student model for
virtual laboratories. In Proceedings of the Sixth Mexican International Conference
on Computer Science (pp. 2–9). Puebla, Mexico. doi:10.1109/ENC.2005.7. <http://
ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1592194>.

Nouh, Y., Karthikeyani, P., & Nadarajan, R. (2006). Intelligent tutoring system-
Bayesian student model. In Proceedings of the 1st International Conference on
Digital Information Management (pp 257–262). Bangalore: IEEE.

O’Keefe, R., Balci, O., & Smith, E. P. (1987). Validation of expert system performance.
IEEE Transactions on Expert Systems, 2(4), 81–90.

Philpot, T. A., Hall, R. H., Hubing, N., & Flori, R. E. (2005). Using games to teach statics
calculation procedures: Application and assessment. Computer Applications in
Engineering Education, 13(3), 222–232.

Regueras, L. M., Verdú, E., Muñoz, M. F., Pérez, M. A., de Castro, J. P., & Verdú, M. J.
(2009). Effects of competitive e-learning tools on higher education students: A
case study. IEEE Transactions on Education, 52(2), 279–285.

Romero, C., Gonzalez, P., Ventura, S., del Jesus, M. J., & Herrera, F. (2009).
Evolutionary algorithms for subgroup discovery in e-learning: A practical
application using Moodle data. Expert Systems with Applications, 36(2),
1632–1644.

Roskam, E. E. (1997). Models for speed and time-limit tests. In W. J. van der Linden
& R. K. Hambleton (Eds.), Handbook of Modern Item Response Theory
(pp. 188–207). New York: Springer-Verlag.

Van der Linden, W. J. (2007). A hierarchical framework for modeling speed and
accuracy on test items. Psychometrika, 72(3), 287–308.

Verhoeven, B. H., Verwijnen, G. M., Muijtjens, A. M. M., Scherpbier, A. J. J. A., & Van
der Vleuten, C. P. M. (2002). Panel expertise for an Angoff standard setting
procedure in progress testing: Item writers compared to recently graduated
students. Medical Education, 36, 860–867.

Verdú, E., Regueras, L. M., Verdú, M. J., de Castro, J. P., & Pérez, M. A. (2008). An
analysis of the research on adaptive learning: The next generation of e-learning.
WSEAS Transactions on Information Science and Applications, 5(6), 859–868.

Verdú, E., Regueras, L. M., Verdú, M. J., & de Castro, J. P. (2010a). Estimating the
difficulty level of the challenges proposed in a competitive e-learning
environment. Lecture Notes in Artificial Intelligence, 6096, 225–234.

Verdú, E., Verdú, M. J., Regueras, L. M., & de Castro, J. P. (2010b). A diversity-
enhanced genetic algorithm to characterize the questions of a competitive e-
learning system. In Proceedings of IEEE International Conference on Advanced
Learning Technologies (pp. 25–29). Los Alamitos, CA: IEEE Computer Society.

Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: The
Kappa statistic. Family Medicine, 37(5), 360–363.

Vomlel, J. (2004). Building adaptive tests using Bayesian Networks. Kybernetika, 40,
333–348.

Watering, G. V. D., & Rijt, J. V. D. (2006). Teachers’ and students’ perceptions of
assessments: A review and a study into the ability and accuracy of estimating
the difficulty levels of assessment items. Educational Research Review, 1,
133–147.

Wauters, K., Desmet, P., & Van den Noortgate, W. (2010). Adaptive item-based
learning environments based on the item response theory: Possibilities and
challenges. Journal of Computer Assisted Learning, 26, 549–562.

Wu, W. M. C., Cheng, H. N. H., Chiang, M.-C., Deng, Y.-C., Chou, C.-Y., Tsai, C.-C., &
Chan, T.-W. (2007). Answer matching: A competitive learning game with
uneven chance tactic. In Proceedings of the First IEEE International Workshop on
Digital Game and Intelligent Toy Enhanced Learning (pp.89–96). Los Alamitos, CA:
IEEE Computer Society.

Zhou, W. (2009). Teachers’ estimation of item difficulty: What contributes to their
accuracy? In Proceedings of the 31st Annual Meeting of the North American
Chapter of the International Group for the Psychology of Mathematics Education
(pp. 261–264). Atlanta: Georgia State University.

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1592194
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1592194

	A genetic fuzzy expert system for automatic question classification  in a competitive learning environment
	1 Introduction
	2 Background
	2.1 Teachers’ perception of difficulty
	2.2 In search of an intelligent solution for a competitive tool

	3 The expert system
	3.1 The genetic system
	3.2 Generation of the Fuzzy Model
	3.3 The inference engine

	4 Results
	4.1 The experiment
	4.2 Validation of the intelligent system

	5 Discussion and conclusion
	References