This article was originally published in a journal published by Elsevier, and the attached copy is provided by Elsevier for the author’s benefit and for the benefit of the author’s institution, for non-commercial research and educational use including without limitation use in instruction at your institution, sending it to specific colleagues that you know, and providing a copy to your institution’s administrator. All other uses, reproduction and distribution, including without limitation commercial reprints, selling or licensing copies or access, or posting on open internet sites, your personal or institution’s website or repository, are prohibited. For exceptions, permission may be sought for such use through Elsevier’s permissions site at: http://www.elsevier.com/locate/permissionusematerial http://www.elsevier.com/locate/permissionusematerial A ut ho r's pe rs on al co py Educational data mining: A survey from 1995 to 2005 C. Romero *, S. Ventura Department of Computer Sciences, University of Cordoba, Cordoba, Spain Abstract Currently there is an increasing interest in data mining and educational systems, making educational data mining as a new growing research community. This paper surveys the application of data mining to traditional educational systems, particular web-based courses, well-known learning content management systems, and adaptive and intelligent web-based educational systems. Each of these systems has different data source and objectives for knowledge discovering. After preprocessing the available data in each case, data mining tech- niques can be applied: statistics and visualization; clustering, classification and outlier detection; association rule mining and pattern min- ing; and text mining. The success of the plentiful work needs much more specialized work in order for educational data mining to become a mature area. � 2006 Elsevier Ltd. All rights reserved. Keywords: Data mining; Educational systems; Web mining; Web-based educational systems 1. Introduction During the past decades, the most important innova- tions in educational systems are related to the introduction of new technologies (Ha, Bae, & Park, 2000) as web-based education. This is a form of computer-aided instruction virtually independent of a specific location and any specific hardware platform (Brusilovsky & Peylo, 2003). It has con- siderably gained in importance and thousands of web courses have been deployed in the past few years. But many of the current web-based courses are based on static learn- ing materials, which do not take into account the diversity of students. Adaptive and intelligent web-based educa- tional systems have been seen as a solution to individually richer learning environments. These systems try to offer learners personalized education by building a model of the individual’s goals, preferences, and knowledge. Data mining or knowledge discovery in databases (KDD) is the automatic extraction of implicit and interesting pat- terns from large data collections (Klosgen & Zytkow, 2002). KDD can be used not only to learn the model for the learning process (Hamalainen, Suhonen, Sutinen, & Toivonen, 2004) or student modeling (Tang & McCalla, 2002) but also to evaluate and to improve e-learning sys- tems (Zaı̈ane & Luo, 2001) by discovering useful learning information from learning portfolios (Hwang, Chang, & Chen, 2004). In conventional teaching environments, educators are able to obtain feedback on student learning experiences in face-to-face interactions with students, enabling a con- tinual evaluation of their teaching programs (Sheard, Ced- dia, Hurst, & Tuovinen, 2003). Decision making of classroom processes involves observing a student’s behav- ior, analyzing historical data, and estimating the effective- ness of pedagogical strategies. However, when students work in electronic environments, this informal monitoring is not possible; educators must look for other ways to attain this information. Organizations, which run distance education sites, collect large volumes of data, automatically generated by web servers and collected in server access logs. Web-based learning environments are able to record most learning behaviors of the students, and are hence able to provide a huge amount of learning profile. Recently, there is a growing interest in the automatic analysis of lear- ner interaction data with web-based learning environments 0957-4174/$ - see front matter � 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.04.005 * Corresponding author. Tel.: +34 957 212172; fax: +34 957 218630. E-mail address: cromero@uco.es (C. Romero). www.elsevier.com/locate/eswa Expert Systems with Applications 33 (2007) 135–146 Expert Systems with Applications A ut ho r's pe rs on al co py (Muehlenbrock, 2005). In order to provide a more effective learning environment, data mining techniques can be applied (Ingram, 1999). Data mining is a step in the overall process of KDD that consists of preprocessing, data min- ing and postprocessing. Data mining has already been suc- cessfully applied in e-commerce (Srivastava, Cooley, Deshpande, & Tan, 2000), and it has begun to be used in e-learning with promising results. Although the discovery methods used in both areas (e-commerce and e-learning) are similar (Hanna, 2004), there are some important differ- ences between them: • Domain. The e-commerce purpose is to guide clients in purchasing while the e-learning purpose is to guide stu- dents in learning (Romero, Ventura, & Bra, 2004). • Data. In e-commerce the used data are normally simple web server access log, but in e-learning there is more information about a student’s interaction (Pahl & Don- nellan, 2003). The user model is also different in both systems. • Objective. The objective of data mining in e-commerce is increasing profit, that is tangible and can be measured in terms of amounts of money, number of customers and customer loyalty. And the objective of data mining in e-learning is to improving the learning. This goal is more subjective and more subtle to measure. • Techniques. Educational systems have special character- istics that require a different treatment of the mining problem. As a consequence, some specific data mining techniques are needed to address in particular the pro- cess of learning (Li & Zaı̈ane, 2004; Pahl & Donnellan, 2003). Some traditional techniques can be adapted, some cannot. The application of knowledge extraction techniques to educational systems in order to improve learning can be viewed as a formative evaluation technique. Formative evaluation (Arruabarrena, Pérez, López-Cuadrado, & Vadillo, 2002) is the evaluation of an educational program while it is still in development, and with the purpose of con- tinually improving the program. Examining how students use the system is one way to evaluate the instructional design in a formative manner and it may help the educator to improve the instructional materials (Ingram, 1999). Data mining techniques can discover useful information that can be used in formative evaluation to assist educators establish a pedagogical basis for decisions when designing or modifying an environment or teaching approach. The application of data mining in educational systems is an iter- ative cycle of hypothesis formation, testing, and refinement (see Fig. 1). Mined knowledge should enter the loop of the system and guide, facilitate and enhance learning as a whole. Not only turning data into knowledge, but also fil- tering mined knowledge for decision making. As we can see in Fig. 1, educators and academics respon- sible are in charge of designing, planning, building and maintaining the educational systems. Students use and interact with them. Starting from all the available informa- tion about courses, students, usage and interaction, different data mining techniques can be applied in order to discover useful knowledge that helps to improve the e-learning pro- cess. The discovered knowledge can be used not only by pro- viders (educators) but also by own users (students). So, the application of data mining in educational systems can be oriented to different actors with each particular point of view (Zorrilla, Menasalvas, Marin, Mora, & Segovia, 2005): • Oriented towards students (Heraud, France, & Mille, 2004; Farzan, 2004; Lu, 2004; Tang & McCalla, 2005; Zaı̈ane, 2002). The objective is to recommend to learners activities, resources and learning tasks that would favour and improve their learning, suggest good learn- ing experiences for the students, suggest path pruning and shortening or simply links to follow, based on the tasks already done by the learner and their successes, and on tasks made by other similar learners, etc. Educational Systems (traditional classrooms, e-learning systems, adaptive and intelligent web-based educational systems) Educators Students Data Mining (clustering, classification, outlier, association, pattern matching, text mining) Academics Responsible Students usage and interaction data, course information, academic data, etc. To show recommendations To show discovered knowledge To design, plan, build and maintenance To use, interact, participe and communicate Fig. 1. The cycle of applying data mining in educational systems. 136 C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 A ut ho r's pe rs on al co py • Oriented towards educators (Ha et al., 2000; Hamalainen et al., 2004; Merceron & Yacef, 2004; Minaei-Bidgoli & Punch, 2003; Mor & Minguillon, 2004; Muehlenbrock, 2005; Pahl & Donnellan, 2003; Romero et al., 2004; Silva & Vieira, 2002; Talavera & Gaudioso, 2004; Tang et al., 2000; Ueno, 2004b; Zaı̈ane & Luo, 2001). The objective is to get more objective feedback for instruc- tion, evaluate the structure of the course content and its effectiveness on the learning process, classify learners into groups based on their needs in guidance and mon- itoring, find learning learner’s regular as well as irregular patterns, find the most frequently made mistakes, find activities that are more effective, discover information to improve the adaptation and customization of the courses, restructure sites to better personalize course- ware, organize the contents efficiently to the progress of the learner and adaptively constructing instructional plans, etc. • Oriented towards academics responsible and administra- tors (Becker, Ghedini, & Terra, 2000; Grob, Bensberg, & Kaderali, 2004; Luan, 2002; Ma, Liu, Wong, Yu, & Lee, 2000; Peled & Rashty, 1999; Sanjeev & Zytkow, 1995; Urbancic, Skrjanc, & Flach, 2002). The objective is to have parameters about how to improve site effi- ciency and adapt it to the behavior of their users (opti- mal server size, network traffic distribution, etc.), have measures about how to better organize institutional resources (human and material) and their educational offer, enhance educational programs offer and determine effectiveness of the new computer mediated distance learning approach. There are many general data mining tools that provide mining algorithms, filtering and visualization techniques. Some examples of commercial and academic tool are DBMiner, Clementine, Intelligent Miner, Weka, etc. (Klos- gen & Zytkow, 2002). However these tools are not specifi- cally designed and maintained for pedagogical purposes and it is cumbersome for an educator who does not have an extensive knowledge in data mining to use these tools (Zaı̈ane, Xin, & Han, 1998). In order to solve this problem, some specific educational data mining, statistical and visu- alization tools have been developed to help educators in analyzing the different aspects of the learning process (see Table 1). We have divided this paper into the following sections. We first review some different types of educational systems and how data mining can be applied in each of them. We then describe the data mining techniques that have been applied in educational systems grouping them by task. Finally, we summarize the main conclusions and we draw some future research. 2. Educational systems: data and objectives Data mining can be applied to data coming from two types of educational systems: traditional classroom and distance education. It is necessary to deal separately with the application of data mining techniques in each type due to the fact that they have different data sources and objectives. 2.1. Traditional classrooms Traditional classroom environments are the most widely used educational systems. It is based on face-to-face con- tact between educators and students organized through lec- turers. There are a lot of different subtypes: private and public education, elementary and primary education, adult education, higher, tertiary and academic education, special education, etc. They have been criticized because they encourage passive learning, ignore individual differences and needs of the learners, and do not pay attention to problem solving, critical thinking, or other higher order thinking skills (Johnson, Arago, Shaik, & Palma-Rivas, 2000). In conventional classrooms, educators attempt to enhance instructions by monitoring student’s learning pro- cesses and analyzing their performances by paper records and observation. They can also use information about stu- dent attendance, course information, curriculum goals, and individualized plan data. And educational institution has many diverse and varied sources of information (Ma et al., 2000): traditional databases (with a student’s infor- mation, educator’s information, class and schedule infor- mation, etc.), online information (online web pages and course content pages), multimedia databases, etc. Data mining can help each actor of the learning process. Institutions would like to know which students will enroll in a particular course and which students will need assis- tance in order to graduate. An administrator may wish to find out information such as the admission requirements and to predict the class enrollment size for timetabling. Students may wish to know how best to select courses Table 1 Some specific educational data mining, statistics and visualization tools Tool name Authors Mining task Mining tool Za€ıane and Luo (2001) Association and patterns MultiStar Silva and Vieira (2002) Association and classification Data Analysis Center Shen et al. (2002) Association and classification EPRules Romero et al. (2003) Association KAON Tane et al. (2004) Text mining and clustering TADA-ED Merceron and Yacef (2005) Classification and association O3R Becker et al. (2005) Sequential patterns Synergo/ColAT Avouris et al. (2005) Statistics and visualization GISMO/CourseVis Mazza and Milani (2005) Visualization Listen tool Mostow et al. (2005) Visualization TAFPA Damez et al. (2005) Classification iPDF_Analyzer Bari and Benzater (2005) Text mining C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 137 A ut ho r's pe rs on al co py based on prediction of how well they will perform in the courses selected. Instructors may wish to know what learn- ing experiences are most contributive to overall learning outcomes, why is one class outperforming the other, similar groups of students, etc. There are some works about the application of data mining in traditional education. One of the first articles about the use of data mining in educa- tion to understand the student enrollment was written by Sanjeev and Zytkow (1995). They apply knowledge discov- ery in the form of statements ‘‘Pattern P holds for data in Range R’’ to university databases. The results were pre- sented to a senior university administrator in order to make strategic decisions about the institutional policies. Another work on the use of KDD to identity and understand whether curriculum revisions can affect students in a Bra- zilian university was done by Becker et al. (2000). They ver- ify the qualitative impact of revisions and evaluate it using a number of techniques, such as summarization, associa- tion, classification. In a related work, the objective is to select the weak students to attend remedial classes (Ma et al., 2000). They use a scoring function that is based on association rules. First, they identify the potential weak students and then select the course that each weak student is recommended to take. Finally, an application in higher education for doing a comprehensive analysis of student characteristics is done by Luan (2002). He proposes to use different unsupervised (Kohonen nets) and supervised (C5.0, genetic algorithms, etc.) data mining algorithms to do clustering and prediction in order to enable educational institutions to better allocate resources and staff, proac- tively manage student outcomes, and improve the effective- ness of alumni development. 2.2. Distance education Distance education or distance learning consists of techniques and methods providing access to educational programs for students who are separated by time and space from lecturers. e-Learning systems lack a closer student–educator relationship (one to one). There are dif- ferent subtypes of distance education: paper-based corre- spondence education, videotape education, computer- aided education (multimedia education, internet education or web-based education), etc. Currently, the most used is web-based education allowing students to conveniently learn via the Internet. Web-based education is a form of distance education delivered over the Internet (Johnson et al., 2000). Today, there are a lot of terms used to refer to web-based education such as e-learning, e-training, online instruction, web-based learning, web-based training, web-based instruction, etc. And there are different types of web-based systems: synchronous and asynchronous, col- laborative and non-collaborative, closed corpus and open corpus, etc. These web-based education systems can nor- mally record the student’s accesses in web logs that provide a raw trace of the learners’ navigation on the site. There are several types of logs (Srivastava et al., 2000): • Server log file. This constitutes the most widely used data source for performing data mining, containing just the bare details of timing, path, and input-response. There are a variety of formats, such as common log for- mat (CLF), extended log format (ELF), etc. (Koutri, Avouris, & Daskalaki, 2004). Normally, there is a single log file for all students. • Client log file. This consists of a set of log files, one per student, and contains information about the interaction of the user with the system. Can be implemented by a remote agent (such Javascripts, Java Applets), modify- ing the source code of an existing browser, or using cookies. • Proxy log file. This consists of a set of log files of caching between client browsers and web servers. This informa- tion complements server log file information. It should be noted that the concept of logging may include restrictions by law. Therefore, whenever a log sys- tem authenticates users it should not relate to a person’s true identity but primarily they as individual persons (Rahkila & Karjalainen, 1999). Log files also have several inherent limitations, tracking for files not users, simple clicks and not learning activities, not capturing contextual information, recognizing specific computers not specific people, having incomplete and incorrect information prob- lem, and some technical aspects of web browser (as the cache) may prevent to record logs. To address these prob- lems, authors have proposed several solutions. Yu, Own, and Lin (2001) propose to use another way to record a lear- ner’s portfolio that includes the learning path, preferred learning course, grade of course, and learning time, etc. Li and Zaı̈ane (2004) use more information channels to model user navigational behavior: web access logs, the structure of a visited web site, and the content of visited web pages. Avouris, Komis, Fiotakis, Margaritis, and Voyiatzaki (2005) expand automatically generated log files by introduc- ing contextual information as additional events and by associating comments and static files. Monk (2005) com- bines data on the activity with content and user profiles in a composite information model. Ingram (1999) combines data with other inquiry methods, such as informal chatting with students, in class shows of hands, surveys, or written feedback about the web site. Iksal and Choquet (2005) pro- pose to use a specific usage tracking meta-language to describe the track semantics recorded by web-based educa- tional systems. Markham et al. (2003) propose to use soft- ware agents to extract data from the e-learning environment and to organize them in intelligent ways. Next, we distinguish between three different types of web- based education systems: particular web-based courses, well-known learning content management systems, and adaptive and intelligent web-based educational systems. 2.2.1. Particular web-based courses Particular web-based courses are specific courseware that use standard HTML (HyperText Markup Language). 138 C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 A ut ho r's pe rs on al co py There are a lot of courses, tutorials, etc. of this type on the Internet, and as another web site, they have the same kinds of data sources (Srivastava et al., 2000): • Content: The real data in the web pages, i.e. the data the web page were designed to convey to the users. This usu- ally consists of texts, graphics, videos, sounds, etc. • Structure: Data which describe the organization of the content. Intra-page structure information includes the arrangement of various HTML or XML tags within a given page. This can be represented as a tree structure, where the HTML tag becomes the root of the tree. The principal kind of inter-page structure information is hyper-links connecting one page to another. • Usage: Data that describe the pattern of usage of web pages. There are two main types of students’ informa- tion (Silva & Vieira, 2002): information about the stu- dent’s actions and communications, and information about the student’s activities in the course. • User profile: Data that provide demographic informa- tion about users of the web site. This includes registra- tion data and customer profile information. Data mining can be used to know how students use the course, how a pedagogical strategy impacts different types of students, in which order the students study subtopics, what are the pages/topics that students skip, how much time the students spend with a single page, a chapter or the full course, etc. 2.2.2. Well-known learning content management systems Well-known learning content management systems (LCMS) are platforms that offer a great variety of channels and workspaces to facilitate information sharing and com- munication between participants in a course, let educators distribute information to students, produce content mate- rial, prepare assignments and tests, engage in discussions, manage distance classes and enable collaborative learning with forums, chats, file storage areas, news services, etc. Some examples of commercial LCMS are Blackboard, Vir- tual-U, WebCT, TopClass, etc. and some example of free LCMS are Moodle, Ilias, Claroline, aTutor, etc. (Paulsen, 2003). These systems accumulate large log data of the stu- dents’ activities and usually have built-in student monitor- ing features (Mazza & Milani, 2005). They can record whatever student activities it involves, such as reading, writ- ing, taking tests, performing various tasks in real or virtual environments, even communicating with peers (Mostow, 2004). They normally also provide a database that stores all the systems information: personal information of the users (profile), academic results, user’s interaction data, etc. Although some platforms offer reporting tools, when there is a great number of students, it becomes hard for a tutor to extract useful information. Data mining can be applied to explore, visualize and analyze data in order to identify useful patterns (Talavera & Gaudioso, 2004) and to evaluate web activity to get more objective feedback for your instruction and knowing more about how the stu- dents learning on the LCMS (Zaı̈ane & Luo, 2001). 2.2.3. Adaptive and intelligent web-based educational systems Adaptive and intelligent web-based educational systems (AIWBES) provide an alternative to the traditional just- put-it-on-the-web approach in the development of web- based educational courseware (Brusilovsky & Peylo, 2003). AIWBES attempt to be more adaptive by building a model of the goals, preferences and knowledge of each individual student and using this model throughout the interaction with the student in order to adapt to the needs of that student. AIWES are the result of a joint evolution of intelligent tutoring systems (ITS) and adaptive hyperme- dia systems (AHS). Some examples of ITS are SQL-Tutor, German Tutor, ActiveMath, VC-Prolog-Tutor, and some examples of AHS are AHA!, InterBook, KBS-Hyperbook, WebCOBALT (Brusilovsky & Peylo, 2003). The data from AIWEBS are semantically richer and can lead to more diag- nostic analysis than data from traditional web-based educa- tion system (Merceron & Yacef, 2004). The available data come from the domain model (which may be structured into an ontology), pedagogical dataset (set of problems, their answer and complexity information), interaction log files (data related with user interaction) and student model (list of the satisfactions and violations constraints). AIWBES use a standard student model (used internally by the tutor- ing system), but, for the purpose of data mining, it is neces- sary to have a new model of student interaction with augmented information with contextual data. These stu- dent’s interaction can be analyzed at a number of different layers of granularity: course, sessions, problems, attempts and constraints (Nilakant & Mitrovic, 2005). Data mining can be used in order to know the causes of problems in the system, for example, incorrect feedback statements (Nilakant & Mitrovic, 2005), to adapt the level to the progress of the learner (Romero et al., 2004), to sug- gest personalized learning experiences and activities for the students (Tang & McCalla, 2005). 3. Data preprocessing Data preprocessing allows to transform the original data into a suitable shape to be used by a particular mining algorithm. So, before applying the data mining algorithm, a number of general data preprocessing tasks have to be addressed (Koutri et al., 2004; Zorrilla et al., 2005): • Data cleaning. It is one of the major preprocessing tasks, to remove irrelevant items and log entries that are not needed for the mining process such us graphics, scripts. • User identification. Process of associating page refer- ences to the connected user. • Session identification. It takes all of the page references for a given user and course in a log and breaks them up into user sessions. In our particular case, we have C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 139 A ut ho r's pe rs on al co py initially considered a new session when a change in a user course happens or when the time interval between two successive inter-transaction clicks ups 30 min (Zorrilla et al., 2005). • Path completion. It fills in page references that are miss- ing due to browser and proxy server caching. • Transaction identification. It breaks down sessions into smaller units, referred to as transactions or episodes. • Data transformation and enrichment. It consists of calcu- lating new attributes from the existing ones, conversing of numerical attributes into nominal attributes, provid- ing meaning to references contained in the log, etc. • Data integration. It is the integration and synchroniza- tion of data from heterogeneous sources. • Data reduction. It is for reducing data dimensionality. Additionally, data preprocessing of web-based educa- tional systems has some specific issues: • Most of the systems use user authentication (password protection) in which logs have entries identified by users since the users have to log-in, and sessions are already identified since users may also have to log-out (Rahkila & Karjalainen, 1999). • Most of the systems record the students’ interactions not only in log files but also directly in databases. If this is not the case, during the preparation process, data for each individual student (profile, logs, etc.) can be aggre- gated to a database (Talavera & Gaudioso, 2004). Dat- abases are more powerful than typical log text files and provide an analysis easier, more flexible and less bug prone. • Data transformation is more oriented to a better inter- pretation of data. Numerical values are discretized or transformed into ranges that provide a much more com- prehensible view of the data. New attributes result from other current attributes in a specific attribute derivation. The derivation performs some kind of aggregation, for example, each attempt is grouped into a new number of attempt attribute (Nilakant & Mitrovic, 2005). • In the division of individual visit sessions into transac- tions, can be identified subsessions or missions with coherent information needs in which the identified sequence is based on the real content of pages (Li & Zaı̈ane, 2004). Besides different meanings of interaction at different levels of abstraction can be distinguished (Pahl & Donnellan, 2003): learning and training interac- tion, human–computer interface and multimedia and service interaction. • The data filtration uses specific educational concepts as number of attempts, number of repeated reading, level of knowledge, etc. Normally, data is filtered by defining some condition on one or more attributes and removing the instances that violate it (Nilakant & Mitrovic, 2005). The educators have to actively participate in the previ- ous preprocessing task, for example, indicating specific data filtration and attribute derivation or transformation, etc. So, it is needed to enhance preprocessing facilities that prepare the e-learning data in a meaningful and useful way. 4. Data mining techniques in educational systems Data mining is a multidisciplinary area in which several computing paradigms converge: decision tree construction, rule induction, artificial neural networks, instance-based learning, Bayesian learning, logic programming, statistical algorithms, etc. (Klosgen & Zytkow, 2002). Next, we are going to describe some specific application of data mining techniques grouped by tasks, in web-based educational sys- tems (see Table 2). 4.1. Statistics and visualization Student’s usage statistics are often the starting point of evaluations of an e-learning system, although they are usu- ally not considered as data mining techniques (Tsantis & Castellani, 2001). Formal statistical inference is assumption driven in the sense that a hypothesis is formed and then tested against the data. Data mining, in contrast, is discov- ery driven in the sense that the hypothesis is automatically extracted from the data. Usage statistics may be extracted using standard tools designed to analyze web server logs as AccessWatch, Ana- log, Gwstat, WebStat, etc. (Zaı̈ane et al., 1998). But there are some specific statistical tools in educational data as Synergo/ColAT (Avouris et al., 2005). Some example of usage statistics are simple measures such as the total num- ber of visits and number of visits per page (Pahl & Donne- llan, 2003). Some other general statistics show the connected learner distribution over time, the most frequent acceded courses, how learners establish many learning ses- sions over time (Zorrilla et al., 2005). Besides, some specific statistical in AIWBES can show the average number of constraint violations, the average problem complexity, the total time spent in attempts (Nilakant & Mitrovic, 2005). More complex statistical tests of procedures such as regression analysis, correlation analysis, multivariate statistical methods. (Zarzo, 2003) need to use a more pow- erful statistical tools as SPSS, SAS, S, R, Statistica, etc. (Klosgen & Zytkow, 2002). If data are stored in a relational database, then SQL queries (Heiner, Beck, & Mostow, 2004; Merceron & Yacef, 2003) can provide functionality for a number of simple statistical operations such as stan- dard deviation, mode, sample size, etc. (Nilakant & Mitro- vic, 2005). But the information obtained from usage statistics is not always easy to interpret to the educators and then other techniques have to be used. Information visualization techniques can be used to graphically render complex, multidimensional student tracking data collected by web-based educational systems (Mazza & Milani, 2005). These techniques facilitate to analyze large amounts of information by representing the data in some visual display. Normally large quantities of 140 C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 A ut ho r's pe rs on al co py raw instance data are represented or plotted as spreadsheet charts, scatterplot, 3D representations, etc. The informa- tion visualized in statistical graphs can be about assignment complement, admitted question, exam score, etc. (Shen, Yang, & Han, 2002). Visualization techniques have been used to visualize social aspects in computer-supported collaborative learning, community relationships in peer- to-peer systems, and conversations in online groups. Instructors can manipulate the graphical representations generated, which allow them to gain an understanding of their learners and become aware of what is happening in distance classes. There are some specific visualization tools in educational data as GISMO/CourseVis (Mazza & Milani, 2005) and Listen tool (Mostow et al., 2005). 4.2. Web mining Web mining (Srivastava et al., 2000) is the application of data mining techniques to extract knowledge from web data. There are three main web mining categories from the used data viewpoint: web content mining is the process of extracting useful information from the contents of web documents; web structure mining is the process of discover- ing structure information from the web; and web usage mining (WUM) that is the discovering of meaningful pat- terns from data generated by client–server transactions on one or more web localities. But there are two types of web mining categories from the used system viewpoint (Li & Zaı̈ane, 2004): offline web mining, that is used to dis- cover patterns and other useful information to help educa- tors to validate the learning models and restructure the web site; and online or integrated web mining in which the pat- terns automatically discovered are fed into an intelligent software system or agent that could assist learners in their online learning endeavours. The mined patterns are used on-the-fly by the system to improve the application or its functions. There are different web mining techniques applied to educational systems, but almost all of them can be grouped in one of the three next ones: clustering, classification and outlier detection; association rule mining and sequential pattern mining; and text mining. 4.2.1. Clustering, classification and outlier detection Clustering is a process of grouping physical or abstract objects into classes of similar objects. Clustering and clas- sification (Klosgen & Zytkow, 2002) are both classification methods. Clustering is an unsupervised classification and classification is a supervised classification. Classification and prediction are also related techniques. Classification predicts class labels, whereas prediction predicts continu- ous-valued functions. On the other hand, an outlier is an observation (or measurement) that is unusually large or small relative to the other values in a dataset. Outliers typ- ically are attributable to one of the following causes: the measurement is observed, recorded, or entered into the computer incorrectly; the measurements come from a dif- ferent population; the measurement is correct, but repre- sents a rare event. Table 2 Works about applying data mining techniques in educational systems Authors Mining task Educational system Sanjeev and Zytkow (1995) Sequence pattern Traditional education Za€ıane et al. (1998) Statistic and sequence pattern LCM systems Beck and Woolf (2000) Prediction AIWBE system Becker et al. (2000) Association and classification Traditional education Chen et al. (2000) Classification Web-based course Ha et al. (2000) Association Web-based course Ma et al. (2000) Association Traditional education Tang et al. (2000) Text mining AIWBE system Yu et al. (2001) Association Web-based course Za€ıane and Luo (2001) Sequence pattern LCM system Luan (2002) Clustering and prediction Traditional education Pahl and Donnellan (2003) Sequence pattern and statistics LCM system Shen et al. (2002) Visualization LCM system Wang (2002) Association and sequence pattern Web-based course Merceron and Yacef (2003) Statistic AIWBE system Minaei-Bidgoli and Punch (2003) Classification Web-based course Shen et al. (2003) Sequence pattern and clustering Web-based course Zarzo (2003) Statistic Web-based course Arroyo et al. (2004) Prediction AIWBE system Baker et al. (2004) Classification AIWBE system Chen et al. (2004) Text mining Web-based course Freyberger et al. (2004) Association AIWBE system Hamalainen et al. (2004) Classification AIWBE system Heiner et al. (2004) Statistic AIWBE system Lu (2004) Association AIWBE system Merceron and Yacef (2004) Association AIWBE system Minaei-Bidgoli et al. (2004) Association Web-based course Mor and Minguillon (2004) Clustering LCM system Romero et al. (2004) Association AIWBE system Talavera and Gaudioso (2004) Clustering LCM system Ueno (2004b) Outlier detection Web-based course Ueno (2004a) Text mining Web-based course Wang et al. (2004) Sequence pattern and clustering LCM system Li and Za€ıane (2004) Association LCM system Avouris et al. (2005) Statistic Web-based course Castro et al. (2005) Outlier detection LCM system Dringus and Ellis (2005) Text mining LCM system Feng et al. (2005) Prediction AIWBE system Hammouda and Kamel (2005) Text mining Web-based course Markellou et al. (2005) Association Web-based course Mazza and Milani (2005) Visualization LCM system Mostow et al. (2005) Visualization AIWBE system Muehlenbrock (2005) Outlier detection AIWBE system Nilakant and Mitrovic (2005) Statistic AIWBE system Tang and McCalla (2005) Clustering AIWBE system Zorrilla et al. (2005) Statistic LCM system Damez et al. (2005) Classification AIWBE system Bari and Benzater (2005) Text mining LCM system C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 141 A ut ho r's pe rs on al co py All these methods have been applied to web-based edu- cational systems. Clustering can group together a set of pages with similar contents, users with similar navigation behavior or navigation sessions. Classification allows char- acterizing the properties of a group of user profiles, similar pages or learning sessions. And outlier detection can detect students with learning problems. Next, we describe some works about the application of these techniques in different types of web-based educational systems: • Particular web-based courses. Chen, Liu, Ou, and Liu (2000) apply decision tree (C5.0 algorithm) and data cube technology from web log portfolios for managing classroom processes. The induction analysis discovers potential student groups that have similar characteristics and reaction to a particular pedagogical strategy. Min- aei-Bidgoli and Punch (2003) classify students based on features extracted from the logged data in order to predict their final grades. They use genetic algorithms to optimize a combination of multiple classifiers by weighing feature vectors. Ueno (2004b) proposes a method of online outlier detection of learners’ irregular learning processes by using the learners’ response time data for the e-learning contents. The outlier detection method uses a Bayesian predictive distribution and it assists a two way instruction by using mining results for the learners’ learning processes. • Well-known learning content management systems. Tala- vera and Gaudioso (2004) propose mining student data using clustering to discover patterns reflecting user behaviors. They propose models for collaboration man- agement to characterize similar behavior groups in unstructured collaboration spaces. Mor and Minguillon (2004) extend the sequencing capabilities of the SCORM standard to include the concept of recommended itiner- ary, by combining educators expertise with learned experience acquired by system usage analysis. They use clustering algorithms for grouping students. Castro, Vel- lido, Nebot, and Minguillon (2005) detect atypical behavior on the grouping structure of the users of a vir- tual campus. They propose to use a generative topo- graphic mapping model and a clustering model to characterize groups of online students. The model neu- tralizes the negative impact of outliers on the data clus- tering process. • Adaptive and intelligent web-based educational systems. Tang et al. (2000) use data clustering for web learning to promote group-based collaborative learning and to provide incremental learner diagnosis. They find clusters of students with similar learning characteristics based on the sequence and the contents of the pages they visited. Currently, they are working in smart recommendation for evolving e-learning systems (Tang & McCalla, 2005) using clustering and collaborative filtering. This is a paper recommender system that can personalize and adapt the course content based on the system’s observation of the learners and the accumulated ratings given by the learners. Hamalainen et al. (2004) introduce a hybrid model, which combines both data mining and machine learning techniques in constructing a Bayesian network to describe the student’s learning process. The goal is to classify students to give them differentiated guiding according to their skills and other characteris- tics. Beck and Woolf (2000) construct a learning agent for high-level student modeling with machine learning in intelligent tutoring systems. The agent learns to pre- dict the probability the student’s next response will be correct, and how long it will take the student to generate that response. They use linear regression to predict observable variables. Arroyo, Murray, Woolf, and Beal (2004) infer unobservable learning variables from stu- dents ITS log files. They star from a correlation analysis between variables and construct a Bayesian network that infers students’ attitudes (negative and positive) and predictions of the system. They use a maximum likelihood method to learn conditional probabilities from students’ data. Baker, Corbett, and Koedinger (2004) use machine-learned latent response model to detect student misuse of intelligent tutoring systems. They build a classifier to identify if a student is gaming the system in a way that leads to poor learning and in need of an intervention. Feng, Heffernan, and Koe- dinger (2005) look for sources of error in predicting a student’s knowledge. They perform a stepwise regres- sion to predict what metrics help to explain poor predic- tion of state exam scores. Muehlenbrock (2005) detects regularities and deviations in the learner’s or educator’s actions among others, in order to provide educators and learners with additional information to manage their learning and teaching. Damez, Marsala, Dang, and Bouchon-Meunier (2005) use a fuzzy decision tree for user modeling and discriminating a novice from an experimented user automatically. They use an agent to learn the cognitive characteristics of an user’s interac- tions and classify users as experimented or not. 4.2.2. Association rule mining and sequential pattern mining Association rule mining is one of the most well studied mining methods. Such rules associate one or more attri- butes of a dataset with another attribute, producing an if–then statement concerning attribute values. Mining asso- ciation rules between sets of items in large databases was first stated by Agrawal, Imielinski, and Swami (1993) and it opened a brand new family of algorithms. The original problem was the market basket analysis that tried to find all the interesting relations between the bought products. Sequential pattern mining (Agrawal & Srikant, 1995) attempts to find inter-session patterns such as the presence of a set of items followed by another item in a time-ordered set of sessions or episodes. These methods have been applied to web-based educa- tion systems. Associations could reveal which contents stu- dents tend to access together, or which combination of tools 142 C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 A ut ho r's pe rs on al co py they use. Sequential patterns can reveal which content has motivated the access to other contents, or how tools and contents are entwined in the learning process. Next, we describe some works about the application of these tech- niques in different types of web-based educational systems: • Particular web-based courses. Ha et al. (2000) perform web page traversal path analysis for customized educa- tion, and web page associations for virtual knowledge structures, which can be formed by learners themselves as they navigate web pages. Yu et al. (2001) find incor- rect student behavior. They modify traditional web logs, and apply fuzzy association rules to find out the rela- tionships between each pattern of a learner’s behavior; including the time spent online, number of read and published articles, number of asked questions, etc. Wang (2002) develops a portfolio analysis tool based on data mining techniques. He uses associative material clusters and sequences among them. This knowledge allows educators to study the dynamic browsing struc- ture and to identify interesting or unexpected learning patterns. To do this, he discovers two types of relations: association relations and sequence relations between documents. Shen, Han, Yang, Yang, and Huang (2003) use data mining and case-based reasoning for distance learning. They use clustering to classify students based on their learning actions and they find sequential associ- ation rules of different knowledge points. Minaei- Bidgoli, Tan, and Punch (2004) propose mining interesting contrast rules for web-based education systems. Con- trast rules help one to identify attributes characterizing patterns of performance disparity between various groups of students. Markellou, Mousourouli, Spiros, and Tsakalidis (2005) propose an ontology-based frame- work and discover association rules, using the a priori algorithm. The role of the ontology is to determine which learning materials are more suitable to be recom- mended to the user. • Well-known learning content management systems. Zaı̈ane and Luo (2001) propose the discovery of useful patterns based on restrictions, to help educators evalu- ate students’ activities in web courses. Li and Zaı̈ane (2004) also use recommender agents for e-learning sys- tems which use association rule mining to discover asso- ciations between user actions and URLs. The agent recommends online learning activities or shortcuts in a course web site based on a learner’s access history. Pahl and Donnellan (2003) analyze a student’s individual ses- sions. They first define the learning period (of time) of each student and then split web server log files into indi- vidual sessions, calculate session statistics, and search for session patterns and time series. Wang, Weng, Su, and Tseng (2004) propose a four phase learning portfo- lio mining approach, which use sequential pattern min- ing, clustering and decision tree creation sequentially, to extract learning features to create a decision tree to pre- dict which group a new learner belongs to. • Adaptive and intelligent web-based educational systems. Lu (2004) uses association fuzzy rules in a personalized e-learning material recommender system. He uses fuzzy matching rules to discover associations between stu- dent’s requirements and a list of learning materials. Romero et al. (2004) propose to use grammar-based genetic programming with multi-objective optimization techniques for providing a feedback to courseware authors. They discover interesting relationships from student’s usage information. Merceron and Yacef (2004) use association rule and symbolic data analysis, as well as traditional SQL queries to mining student data captured from a web-based tutoring tool. Their goal is to find mistakes that often occur together. Frey- berger, Heffernan, and Ruiz (2004) use association rules to guide a search for best fitting transfer model of stu- dent learning in intelligent tutoring systems. The associ- ation rules determine what operation to perform on the transfer model that predict a student’s success. 4.2.3. Text mining Text mining methods can be viewed as an extension of data mining to text data and it is very related to web con- tent mining. It is an interdisciplinary area involving machine learning and data mining, statistics, information retrieval and natural language processing (Grobelnik, Mladenic, & Jermol, 2002). Text mining can work with unstructured or semi-structured datasets such as full-text documents, HTML files, emails, etc. Next, we describe some works on the application of these techniques in differ- ent types of web-based educational systems: • Particular web-based courses. Ueno (2004a) uses data mining and text mining technologies for collaborative learning in an ILMS. She uses text mining for discussion board an expanded correspondence analysis. Learners select the relevant category which represent his/her com- ment and the system provides evaluations for a learner’s comments between peers. Chen, Li, Wang, and Jia (2004) propose to automatically construct an e-textbook via web content mining. They use a ranking strategy to evaluate the web page suitability and they extract con- cept features and build concept hierarchies. Tane, Sch- mitz, and Stumme (2004) propose an ontology-based tool to make the most of the resources available on the web. They use text mining and text clustering tech- niques in order to group documents according to their topics and similarities. Hammouda and Kamel (2005) propose to perform data mining on documents, which serves as a basis for knowledge extraction in e-learning environments. In the process of text mining, a grouping (clustering) approach is also employed to identify groups of documents. • Well-known learning content management systems. Drin- gus and Ellis (2005) propose to use text mining as a strategy for assessing asynchronous discussion forums. C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 143 A ut ho r's pe rs on al co py Text mining techniques improve the educator’s ability to evaluate the progress of a thread discussion. Bari and Benzater (2005) retrieve data from pdf interactive multi- media productions for helping the evaluation of multi- media presentations, for statistics purpose and for extracting relevant data. They identify the main blocks of multimedia presentations and retrieve their internal properties. • Adaptive and intelligent web-based educational systems. Tang et al. (2000) propose to construct a personalized web tutor tree by mining both context and structure of the courseware. They use a key-word-driven text min- ing algorithm to select articles for distance learning students. 5. Conclusions and future research Educational data mining is an upcoming field related to several well-established areas of research including e-learn- ing, adaptive hypermedia, intelligent tutoring systems, web mining, data mining, etc. The application of data mining in educational systems has specific requirements not present in other domains, mainly the need to take into account pedagogical aspects of the learner and the system. Although the educational data mining is a very recent research area there is an important number of contribu- tions published in journals, international congress, specific workshops and some ongoing books (Romero & Ventura, 2006) that show it is one new promising area. Some of the most promising work line is the use of e-learning recommendation agents (Lu, 2004; Zaı̈ane, 2002). These recommender agents sees what a student is doing and recommends actions (activities, shortcuts, contents, etc.) they think would be beneficial to the student. Recom- mender agents can also be integrated in evolving e-learning systems in which materials are automatically found on the web and integrated into the system (Tang & McCalla, 2005). In this way, they help educators to detect which parts of existing materials from heterogeneous sources as the Internet are the best to use for composing new courses. Besides recommenders can also be integrated with domain knowledge and ontologies, combining web mining and semantic web in semantic web mining (Markellou et al., 2005). Semantic web mining is a successful integration of ontological knowledge at every stage of the knowledge discovery process (Becker, Vanzin, & Ruiz, 2005). Educational data mining is a young research area and it is necessary more specialized and oriented work educa- tional domain in order to obtain a similar application suc- cess level to other areas, such as medical data mining, mining e-commerce data, etc. We believe that some future researches lines are: • Mining tools more easy to use by educators or not expert users in data mining. Data mining tools are normally designed more for power and flexibility than for simplic- ity. Most of the current data mining tools are too com- plex to use for educators and their features go well beyond the scope of what a educator may want to do. So, these tools must have a more intuitive and easy to use interface, with parameter-free data mining algo- rithms to simplify the configuration and execution, and with good visualization facilities to make their results meaningful to educators and e-learning designers. • Standardization of methods and data. Current tools for mining data from a specific course may be useful only to its developers. There are no general tools or re-using tools or techniques that can be applied to any educational system. So, a standardization of data, and the preprocess- ing, discovering and postprocessing tasks is needed. • Integration with the e-learning system. The data mining tool has to be integrated into the e-learning environment as another author tool. All data mining tasks (prepro- cessing, data mining and postprocessing) have to be car- ried out into a single application. Feedback and results obtained with data mining can be directly applied to the e-learning environment. • Specific data mining techniques. More effective mining tools that integrate educational domain knowledge into data mining techniques. Education-specific mining tech- niques can help much better to improve the instructional design and pedagogical decisions. Traditional mining algorithms need to be tuned to take into account the educational context. Acknowledgement The authors gratefully acknowledge the financial sup- port provided by the Spanish Department of Research of the Ministry of Science and Technology under TIN2005- 08386-C05-02 Project. References Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington, DC (pp. 207–216). Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Eleventh international conference on data engineering (pp. 3–14). Taipei, Taiwan: IEEE Computer Society Press. Arroyo, I., Murray, T., Woolf, B., & Beal, C. (2004). Inferring unobservable learning variables from students’ help seeking behavior. In Intelligent tutoring systems (pp. 782–784). Arruabarrena, R., Pérez, T. A., López-Cuadrado, J., & Vadillo, J. G. J. (2002). On evaluating adaptive systems for education. In Adaptive hypermedia (pp. 363–367). Avouris, N., Komis, V., Fiotakis, G., Margaritis, M., & Voyiatzaki, E. (2005). Why logging of fingertip actions is not enough for analysis of learning activities. In Workshop on usage analysis in learning systems at the 12th international conference on artificial intelligence in education. Baker, R., Corbett, A., & Koedinger, K. (2004). Detecting student misuse of intelligent tutoring systems. In Intelligent tutoring systems (pp. 531– 540). 144 C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 A ut ho r's pe rs on al co py Bari, M., & Benzater, B. (2005). Retrieving data from pdf interactive multimedia productions. In International conference on human system learning: Who is in control? (pp. 321–330). Beck, J., & Woolf, B. (2000). High-level student modeling with machine learning. In Intelligent tutoring systems (pp. 584–593). Becker, K., Ghedini, C., & Terra, E. (2000). Using kdd to analyze the impact of curriculum revisions in a Brazilian university. In Eleventh international conference on data engineering. Proceedings of the SPIE 14th annual international conference on aerospace/defense, sensing, simulation and controls, Orlando (pp. 412–419). Becker, K., Vanzin, M., & Ruiz, D. D. A. (2005). Ontology-based filtering mechanisms for web usage patterns retrieval. In 6th International conference on e-commerce and web technologies (pp. 267–277). Brusilovsky, P., & Peylo, C. (2003). Adaptive and intelligent web-based educational systems. International Journal of Artificial Intelligence in Education, 13, 156–169. Castro, F., Vellido, A., Nebot, A., & Minguillon, J. (2005). Detecting atypical student behaviour on an e-learning system. In I Simposio Nacional de Tecnologas de la Informacin y las Comunicaciones en la Educacin, Granada (pp. 153–160). Chen, G., Liu, C., Ou, K., & Liu, B. (2000). Discovering decision knowledge from web log portfolio for managing classroom processes by applying decision tree and data cube technology. Journal of Educational Computing Research, 23(3), 305–332. Chen, J., Li, Q., Wang, L., & Jia, W. (2004). Automatically generating an e-textbook on the web. In International conference on advances in web- based learning (pp. 35–42). Damez, M., Marsala, C., Dang, T., & Bouchon-Meunier, B. (2005). Fuzzy decision tree for user modeling from human–computer interactions. In International conference on human system learning: Who is in control? (pp. 287–302). Dringus, L., & Ellis, T. (2005). Using data mining as a strategy for assessing asynchronous discussion forums. Computer & Education Journal, 45, 141–160. Farzan, R. (2004). Adaptive socio-recommender system for open-corpus e-learning. In Doctoral consortium of the third international conference on adaptive hypermedia and adaptive web-based systems. Feng, M., Heffernan, N., & Koedinger, K. (2005). Looking for sources of error in predicting student’s knowledge. In Proceedings of AAAI’05 workshop on educational data mining. Freyberger, J., Heffernan, N., & Ruiz, C. (2004). Using association rules to guide a search for best fitting transfer models of student learning. In Workshop on analyzing student–tutor interactions logs to improve educational outcomes at ITS conference. Grob, H., Bensberg, F., & Kaderali, F. (2004). Controlling open source intermediaries – a web log mining approach. In Proceedings of the 26th international conference on information technology interfaces (pp. 233–242). Grobelnik, M., Mladenic, D., & Jermol, M. (2002). Exploiting text mining in publishing and education. In Proceedings of the ICML-2002 workshop on data mining lessons learned (pp. 34–39). Ha, S., Bae, S., & Park, S. (2000). Web mining for distance education. In IEEE international conference on management of innovation and technology (pp. 715–719). Hamalainen, W., Suhonen, J., Sutinen, E., & Toivonen, H. (2004). Data mining in personalizing distance education courses. In World confer- ence on open learning and distance education, Hong Kong. Hammouda, K., & Kamel, M. (2005). Ch. Data mining in e-learning. Hanna, M. (2004). Data mining in the e-learning domain. Computers & Education Journal, 42(3), 267–287. Heiner, C., Beck, J., & Mostow, J. (2004). Lessons on using its data to answer educational research questions. In Proceedings of the ITS2004 workshop on analyzing student–tutor interaction logs to improve educational outcomes (pp. 1–9). Heraud, J., France, L., & Mille, A. (2004). Pixed: an its that guides students with the help of learners’ interaction log. In International conference on intelligent tutoring systems (workshop analyzing student– tutor interaction logs to improve educational outcomes), Maceio (pp. 57–64). Hwang, W., Chang, C., & Chen, G. (2004). The relationship of learning traits, motivation and performance-learning response dynamics. Computers & Education Journal, 42(3), 267–287. Iksal, S., & Choquet, C. (2005). Usage analysis driven by models in a pedagogical context. Ingram, A. (1999). Using web server logs in evaluating instructional web sites. Journal of Educational Technology Systems, 28(2), 137–157. Johnson, S., Arago, S., Shaik, N., & Palma-Rivas, N. (2000). Comparative analysis of learner satisfaction and learning outcomes in online and face-to-face learning environments. Journal of Interactive Learning Research, 11(1), 29–49. Klosgen, W., & Zytkow, J. (2002). Handbook of data mining and knowledge discovery. New York: Oxford University Press. Koutri, M., Avouris, N., & Daskalaki, S. (2004). Ch. A survey on web usage mining techniques for web-based adaptive hypermedia systems. Li, J., & Zaı̈ane, O. (2004). Combining usage, content, and structure data to improve web site recommendation. In International conference on e- commerce and web technologies (pp. 305–315). Lu, J. (2004). Personalized e-learning material recommender system. In International conference on information technology for application (pp. 374–379). Luan, J. (2002). Data mining, knowledge management in higher educa- tion, potential applications. In Workshop associate of institutional research international conference, Toronto (pp. 1–18). Ma, Y., Liu, B., Wong, C., Yu, P., & Lee, S. (2000). Targeting the right students using data mining. In KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 457–464). Markellou, P., Mousourouli, I., Spiros, S., & Tsakalidis, A. (2005). Using semantic web mining technologies for personalized e-learning experi- ences. In Proceedings of the web-based education (pp. 461–826). Markham, S., Ceddia, J., Sheard, J., Burvill, C., Weir, J., Field, B., et al. (2003). Applying agent technology to evaluation tasks in e-learning environments. In Proceedings of the exploring educational technologies conference. Mazza, R., & Milani, C. (2005). Exploring usage analysis in learning systems: Gaining insights from visualisations. In Workshop on usage analysis in learning systems at 12th international conference on artificial intelligence in education. Merceron, A., & Yacef, K. (2003). A web-based tutoring tool with mining facilities to improve learning and teaching. In Proceedings of 11th international conference on artificial intelligence in education (pp. 201–208). Merceron, A., & Yacef, K. (2004). Mining student data captured from a web-based tutoring tool: Initial exploration and results. Journal of Interactive Learning Research, 15(4), 319–346. Merceron, A., & Yacef, K. (2005). Tada-ed for educational data mining. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 7(1), 267–287. Minaei-Bidgoli, B., & Punch, W. (2003). Using genetic algorithms for data mining optimization in an educational web-based system. In GECCO (pp. 2252–2263). Minaei-Bidgoli, B., Tan, P., & Punch, W. (2004). Mining interesting contrast rules for a web-based educational system. In International conference on machine learning applications. Monk, D. (2005). Using data mining for e-learning decision making. Electronic Journal of e-Learning, 3(1), 41–54. Mor, E., & Minguillon, J. (2004). E-learning personalization based on itineraries and long-term navigational behavior. In Proceedings of the 13th international world wide web conference (pp. 264–265). Mostow, J. (2004). Some useful design tactics for mining its data. In Proceedings of the ITS2004 workshop on analyzing student–tutor interaction logs to improve educational outcomes. Mostow, J., Beck, J., Cen, H., Cuneo, A., Gouvea, E., & Heiner, C. (2005). An educational data mining tool to browse tutor–student interactions: Time will tell! In Proceedings of the workshop on educational data mining (pp. 15–22). Muehlenbrock, M. (2005). Automatic action analysis in an interactive learning environment. C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146 145 A ut ho r's pe rs on al co py Nilakant, K., & Mitrovic, A. (2005). Application of data mining in constraint-based intelligent tutoring systems. In Proceedings of the artificial intelligence in education, AIED (pp. 896–898). Pahl, C., & Donnellan, C. (2003). Data mining technology for the evaluation of web-based teaching and learning systems. In Proceedings of the congress e-learning, Montreal, Canada. Paulsen, M. (2003). Online education and learning management systems. Bekkestua: NKI Forlaget. Peled, A., & Rashty, D. (1999). Logging for success: Advancing the use of www logs to improve computer mediated distance learning. Journal of Educational Computing Research, 21(4), 413–431. Rahkila, M., & Karjalainen, M. (1999). Evaluation of learning in computer based education using log systems. In ASEE/IEEE frontiers in education conference, San Juan, Puerto Rico (pp. 16–21). Romero, C., & Ventura, S. (2006). Data mining in e-learning. Southamp- ton, UK: Wit Press. Romero, C., Ventura, S., Bra, P., & Castro, C. (2003). Discovering prediction rules in aha! courses. In User modeling (pp. 25–34). Romero, C., Ventura, S., & Bra, P. D. (2004). Knowledge discovery with genetic programming for providing feedback to courseware author. User Modeling and User-Adapted Interaction: The Journal of Person- alization Research, 14(5), 425–464. Sanjeev, P., & Zytkow, J. M. (1995). Discovering enrollment knowledge in university databases. In KDD (pp. 246–251). Sheard, J., Ceddia, J., Hurst, J., & Tuovinen, J. (2003). Inferring student learning behaviour from website interactions: A usage analysis. Journal of Education and Information Technologies, 8(3), 245–266. Shen, R., Han, P., Yang, F., Yang, Q., & Huang, J. (2003). Data mining and case-based reasoning for distance learning. Journal of Distance Education Technologies, 1(3), 46–58. Shen, R., Yang, F., & Han, P. (2002). Data analysis center based on e- learning platform. In Proceedings of the 5th international workshop on the internet challenge: Technology and applications (pp. 19–28). Silva, D., & Vieira, M. (2002). Using data warehouse and data mining resources for ongoing assessment in distance learning. In IEEE international conference on advanced learning technologies, Kazan, Russia (pp. 40–45). Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2), 12–23. Talavera, L., & Gaudioso, E. (2004). Mining student data to characterize similar behavior groups in unstructured collaboration spaces. In Workshop on artificial intelligence in CSCL. 16th European conference on artificial intelligence (pp. 17–23). Tane, J., Schmitz, C., & Stumme, G. (2004). Semantic resource manage- ment for the web: An e-learning application. In Proceedings of the WWW conference, New York, USA (pp. 1–10). Tang, C., Yin, H., Li, T., Lau, R., Li, Q., & Kilis, D. (2000). Personalized courseware construction based on web data mining. In Proceedings of the first international conference on web information systems engineer- ing, Washington, DC, USA (pp. 204–211). Tang, T., & McCalla, G. (2002). Student modeling for a web-based learning environment: A data mining approach. In Eighteenth national conference on artificial intelligence, Menlo Park, CA, USA (pp. 967– 968). Tang, T., & McCalla, G. (2005). Smart recommendation for an evolving e-learning system. International Journal on E-Learning, 4(1), 105– 129. Tsantis, L., & Castellani, J. (2001). Enhancing learning environments through solution-based knowledge discovery tools. Journal of Special Education Technology, 16(4). Ueno, M. (2004a). Data mining and text mining technologies for collaborative learning in an ILMS ‘‘samurai’’. In ICALT. Ueno, M. (2004b). Online outlier detection system for learning time data in e-learning and its evaluation. In International conference on computers and advanced technology in education (pp. 248–253). Urbancic, T., Skrjanc, M., & Flach, P. (2002). Web-based analysis of data mining and decision support education. AI Communications, 15, 199–204. Wang, F. (2002). On using data-mining technology for browsing log file analysis in asynchronous learning environment. In Conference on educational multimedia, hypermedia and telecommunications (pp. 2005– 2006). Wang, W., Weng, J., Su, J., & Tseng, S. (2004). Learning portfolio analysis and mining in SCORM compliant environment. In ASEE/ IEEE frontiers in education conference (pp. 17–24). Yu, P., Own, C., & Lin, L. (2001). On learning behavior analysis of web based interactive environment. In Proceedings of ICCEE, Oslo/Bergen, Norway. Zaı̈ane, O. (2002). Building a recommender agent for e-learning systems. In ICCE (pp. 55–59). Zaı̈ane, O., & Luo, J. (2001). Web usage mining for a better web-based learning environment. In Proceedings of conference on advanced technology for education, Banff, Alberta (pp. 60–64). Zaı̈ane, O., Xin, M., & Han, J. (1998). Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. In Advances in digital libraries (pp. 19–29). Zarzo, M. (2003). E-learning in the new era of data mining. In International conference on network universities and e-learning, Valen- cia, Spain. Zorrilla, M. E., Menasalvas, E., Marin, D., Mora, E., & Segovia, J. (2005). Web usage mining project for improving web-based learning sites. In Web mining workshop, Cataluna. 146 C. Romero, S. Ventura / Expert Systems with Applications 33 (2007) 135–146