AUTHOR(S): TITLE: YEAR: Publisher citation: OpenAIR citation: Publisher copyright statement: OpenAIR takedown statement: This publication is made freely available under ________ open access. This is the ______________________ version of an article originally published by ____________________________ in __________________________________________________________________________________________ (ISSN _________; eISSN __________). This publication is distributed under a CC ____________ license. ____________________________________________________ Section 6 of the “Repository policy for OpenAIR @ RGU” (available from http://www.rgu.ac.uk/staff-and-current- students/library/library-policies/repository-policies) provides guidance on the criteria under which RGU will consider withdrawing material from OpenAIR. If you believe that this item is subject to any of these criteria, or for any other reason should not be held on OpenAIR, then please contact openair-help@rgu.ac.uk with the details of the item and the nature of your complaint. Improving e-Learning Recommendation by using Background Knowledge Blessing Mbipom Susan Craw Stewart Massie School of Computing Science & Digital Media Robert Gordon University, Aberdeen, UK b.e.mbipom/s.craw/s.massie@rgu.ac.uk Abstract There is currently a large amount of e-Learning resources available to learners on the Web. How- ever, learners often have difficulty finding and retrieving relevant materials to support their learning goals because they lack the domain knowledge to craft effective queries that convey what they wish to learn. In addition, the unfamiliar vocabulary often used by domain experts makes it difficult to map a learner’s query to a relevant learning material. We address these challenges by introducing an innovative method that automatically builds background knowledge for a learning domain. In cre- ating our method, we exploit a structured collection of teaching materials as a guide for identifying the important domain concepts. We enrich the identified concepts with discovered text from an en- cyclopedia, thereby increasing the richness of our acquired knowledge. We employ the developed background knowledge for influencing the representation and retrieval of learning resources to im- prove e-Learning recommendation. The effectiveness of our method is evaluated using a collection of Machine Learning and Data Mining papers. Our method outperforms the benchmark, demonstrating the advantage of using background knowledge for improving the representation and recommendation of e-Learning materials. 1 Introduction Learning-focused content is increasingly available on the Web, thus providing an excellent source of information for building e-Learning systems (Clarà and Barberà, 2013). However, learners often have difficulty finding the right learning materials because they lack the domain knowledge required to formulate effective queries (Chen et al., 2014). In addition, a mismatch in the vocabulary used by learners when crafting their queries and that used by domain experts to describe learning concepts poses a further challenge for systems recommending resources to learners. Another challenge with e-Learning recommendation is that the learning resources are often un- structured text, and so are not properly indexed for retrieval (Nasraoui and Zhuhadar, 2010). The challenge of dealing with unstructured learning resources creates a difficulty in finding and retrieving relevant learning resources. Hence the need for an effective method of representing learning materials with the aim of improving recommendation. This paper proposes the automated acquisition of background knowledge about a domain that can then be employed for enhancing e-Learning recommendation. In our method, we create a concept- aware representation that contains a good coverage of relevant topics from the domain. First, we exploit a structured collection of teaching materials as a guide for identifying the important concepts. Next, we enrich the identified concepts with discovered text from an encyclopedia source, thereby increasing the richness of our representation. Our developed method is demonstrated in Machine Learning and Data Mining, although the method we present can be applied to learning materials in other domains. Other projects such as DeepQA (Ferrucci et al., 2013) and DBpedia (Lehmann et al., 2015) use a range of knowledge-rich representations to enhance retrieval. Such knowledge-rich sources are usu- ally in the form of important topics that describe a domain. While these projects generally rely on handcrafted knowledge sources, they highlight the advantage in exploiting knowledge-rich represen- tations as a basis for improving recommendation. A good coverage of domain topics is useful for representing learning materials. These domain topics contain rich vocabulary and provide a good knowledge source for mapping learners’ queries 1 to learning materials. Thus allowing us to address the mismatch in the vocabulary used by learners and domain experts. We address this issue by introducing a method that automatically creates custom background knowledge in the form of a rich set of domain topics. Further, we explore building a richer vocabulary to achieve a better coverage of the domain, and this method is employed to improve e-Learning recommendation. We make several contributions in this work. Firstly, the creation of background knowledge for an e-Learning domain. We describe how we take advantage of the knowledge of experts contained in e-Books to build a knowledge-rich representation that is used to enhance recommendation. Secondly, we present a method that harnesses the developed background knowledge to augment the represen- tation of learning resources in order to improve e-Learning recommendation. Finally, we explore a larger concept vocabulary which provides a better coverage of the domain. We refine our method pre- sented in (Mbipom et al., 2016) to generate a richer and focused set of domain concepts. The results from our evaluation show the improvement in e-Learning recommendation when the richer concept vocabulary is used for representing learning resources. The rest of this paper is organised as follows. In Section 2 we present related text representation approaches that underpin this work. Section 3 describes the development of our background knowl- edge using available knowledge sources. Section 4 discusses the representation of learning resources using our methods. Then Section 5 presents the evaluation of the learning resource representation. In Section 6 we present our refined method of generating background knowledge with an evaluation using the richer vocabulary and a larger dataset for recommendation. Finally, Section 7 presents our conclusions. 2 Related Work E-Learning recommendation is challenging because learning resources are often unstructured text, and so are not properly indexed for retrieval. A possible solution to addressing this challenge is the creation of effective representations that capture the content of learning resources. However, building suitable representations for learning resources in e-Learning environments is not easy (Dietze et al., 2012), as the resources do not have a pre-defined set of features by which they can be indexed. We propose the creation of a knowledge-rich representation that captures the domain-specific vocabulary contained in learning resources. Figure 1 illustrates two broad approaches often used to address the challenge of text representation. These are corpus-based methods, such as topic models (Blei and McAuliffe, 2007; Chen and Liu, 2014); and structured representations, such as those that take advantage of ontologies (Boyce and Pahl, 2007; Yarandi et al., 2011). In Figure 1, the lower row of items identifies various knowledge sources that can be employed to build a range of knowledge- light to knowledge-rich text representation approaches. Corpus-based methods usually involve the use of statistical models to identify topics from a cor- pus. The identified topics are often keywords (Beliga et al., 2015; Matsuo and Ishizuka, 2004) or phrases (Coenen et al., 2007; Witten et al., 1999). Coenen et al. showed that using a combination of keywords and phrases was better than using only keywords (Coenen et al., 2007). These topics can be extracted from different text sources such as: learning resources (Rodrigues et al., 2007; Yang et al., 2016), metadata e.g. Tables of contents (Bousbahi and Chorfi, 2015), and Encyclopedia e.g. Wikipedia (Milne and Witten, 2008; Qureshi et al., 2014). A drawback of the corpus-based methods is that, they normally rely on the coverage of the document collection used, so the topics produced may not be representative of the learning domain. Figure 1: Two broad approaches used for text representation Structured representations capture relationships between important domain concepts. This often entails using an existing ontology e.g ACM taxonomy (Nasraoui and Zhuhadar, 2010; Ruiz-Iniesta et al., 2014), or creating a new one (Gherasim et al., 2013; Panagiotis et al., 2016). Although ontolo- gies are designed to have a good coverage of their domains, the output is still dependent on the view 2 of its builders and, because of handcrafting, existing ontologies cannot easily be adapted to new do- mains. e-Learning is dynamic because new resources are becoming available regularly, and so using fixed ontologies limits the potential to incorporate new content. The approach adopted in this paper draws insight from both the corpus-based methods and struc- tured representations highlighted in Figure 1. We leverage on a structured corpus of teaching materials such as Tables of contents of e-Books, in order to identify important topics in an e-Learning domain. These topics are a combination of keywords and phrases as recommended in (Coenen et al., 2007). The identified topics are then enriched with discovered text from Wikipedia in order to enhance our representation. In addition, we refine the methods developed in previous work (Mbipom et al., 2016) so that we can generate a richer set of relevant topics that provide a good coverage of the learning do- main. Consequently, our approach is employed to influence the representation and retrieval of relevant learning resources. 3 Creation of Background Knowledge Background knowledge refers to information about a domain that is useful for general understanding and problem-solving (Zhang et al., 2013). We attempt to capture background knowledge as a set of domain concepts, each representing an important topic in the domain. For example, in a learning domain, such as Machine Learning, you would find topics such as Classification, Clustering and Regression. Each of these topics would be represented by a concept, in the form of a concept label and a pseudo-document which describes the concept. The concepts can then be used to underpin the representation of e-Learning resources. Our knowledge extraction process is shown in Figure 2. The input to this process are domain knowledge sources, and we use a structured collection of teaching materials and an encyclopedia source. Next, ngrams are automatically extracted from our structured collection to generate a set of potential concept labels. Then a domain lexicon is used to validate the extracted ngrams to ensure that the ngrams are also being used in another information source. The encyclopedia provides text descriptions for the identified ngrams. The output from this process is a set of domain concepts, each having a concept label and an associated pseudo-document. We discuss the stages of the background knowledge creation in the following sections. Figure 2: An overview of the background knowledge creation process 3.1 Knowledge Sources Two knowledge sources are used as initial inputs for discovering concept labels. A structured col- lection of teaching materials provides a source for extracting important topics identified by teaching experts in the domain, while a domain lexicon provides a broader but more detailed coverage of the relevant topics in the domain. The lexicon is used to verify that the concept labels identified from the 3 teaching materials are directly relevant. Thereafter, an encyclopedia source, such as Wikipedia pages, is searched and provides the relevant text to form a pseudo-document for each verified concept label. The final output from this process is our set of domain concepts each comprising a concept label and an associated pseudo-document. Our approach is demonstrated with learning resources from Machine Learning and Data Mining. We use e-Books as our collection of teaching materials; a summary of the books used is shown in Table 1. Two Google Scholar queries: “Introduction to data mining textbook” and “Introduction to machine learning textbook” guided the selection process, and 20 e-Books that met all of the following 3 criteria were chosen. Firstly, the book should be about the domain. Secondly, there should be Google Scholar citations for the book. Thirdly, the book should be accessible. We use the Tables-of-Contents (TOCs) of the books as our structured knowledge source. We use Wikipedia to create our domain lexicon because it contains articles for many learning domains (Völkel et al., 2006; Zheng et al., 2010), and the contributions of many people (Yang and Lai, 2010), so this provides the coverage we need in our lexicon. The lexicon is generated from 2 Wikipedia sources. First, the phrases in the contents and overview sections of the chosen domain are extracted to form a topic list. Then, a list with the titles of articles related to the domain is added to the topic list to assemble our lexicon. Overall, our domain lexicon contains a set of 664 Wiki-phrases. Table 1: Summary of e-Books used Book Title & Author Cites Machine learning; Mitchell 264 Introduction to machine learning; Alpaydin 2621 Machine learning a probabilistic perspective; Murphy 1059 Introduction to machine learning; Kodratoff 159 Gaussian processes for machine learning; Rasmussen & Williams 5365 Introduction to machine learning; Smola & Vishwanathan 38 Machine learning, neural and statistical classification; Michie, Spiegelhalter, & Taylor 2899 Introduction to machine learning; Nilsson 155 A First Encounter with Machine Learning; Welling 7 Bayesian reasoning and machine learning; Barber 271 Foundations of machine learning; Mohri, Rostamizadeh, & Talwalkar 197 Data mining-practical machine learning tools and techniques; Witten & Frank 27098 Data mining concepts models and techniques; Gorunescu 244 Web data mining; Liu 1596 An introduction to data mining; Larose 1371 Data mining concepts and techniques; Han & Kamber 22856 Introduction to data mining; Tan, Steinbach, & Kumar 6887 Principles of data mining; Bramer 402 Introduction to data mining for the life sciences; Sullivan 15 Data mining concepts methods and applications; Yin, Kaku, Tang, & Zhu 23 3.2 Generating Potential Domain Concepts In the first stage of the process, the text from the TOCs is pre-processed. We remove punctuations, symbols, and numbers from the TOCs, so that only words are used for generating concept labels. After this, we remove 2 sets of stopwords. First, a standard English stopwords list, which allows us to remove common words and still retain a good set of words for generating our concept labels. Second, an additional set of words which we refer to as TOC-stopwords are removed. It contains: structural words, such as chapter and appendix, which relate to the structure of the TOCs; roman numerals, such as xxiv and xxxv, which are used to indicate the sections in a TOC; and words, such as introduction and conclusion, which describe parts of a learning material and are generic across domains. In addition, words referring directly to the name of the domain used for demonstration are removed, as we wish to generate concepts that describe the domain. We do not use stemming because we found it harmful during pre-processing. When searching an encyclopedia source with the stemmed form of words, relevant results would not be returned. The output from pre-processing is a set of TOC phrases. In the next stage, we apply ngram extraction to the TOC phrases to generate all 1-3 grams from the entire set of TOC phrases. The output from this process are TOC-ngrams containing a set of 2038 unigrams, 5405 bigrams and 6133 trigrams, which 4 are used as the potential domain concept labels. Many irrelevant ngrams are generated from the TOCs because we have simply selected all 1-3 grams. 3.3 Verifying Concept Labels using Domain Lexicon A domain lexicon is used to verify the generated TOC-ngrams to confirm which of the ngrams are relevant for the domain. Our domain lexicon contains a set of 664 Wiki-phrases, each of which is pre-processed by removing non-alphanumeric characters. The distribution of Wiki-phrases is shown in Figure 3. The 84% of the Wiki-phrases that are 1-3 grams are used for verification. The comparison of TOC-ngrams with the domain lexicon identifies the potential domain concept labels that are actu- ally being used to describe aspects of the chosen domain in Wikipedia. During verification, ngrams referring directly to the title of the domain, e.g. machine learning and data mining, are not included in the Wiki-phrases because our aim is to generate concept labels that describe specific topics within the domain. Overall, a set of 17 unigrams, 58 bigrams and 15 trigrams are verified as potential concept labels. Bigrams yield the highest number of ngrams, which indicates that bigrams are particularly useful for describing topics in this domain. Figure 3: Distribution of Wiki-phrases used for verifying concept labels 3.4 Domain Concept Generation Our domain concepts are generated after a second verification step is applied to the ngrams returned from the previous stage. Each ngram is retained as a concept label if all of 3 criteria are met. Firstly, if a Wikipedia page describing the ngram exists. Secondly, if the text describing the ngram is not contained as part of the page describing another ngram. Thirdly, if the ngram is not a synonym of another ngram. For the third criteria, if two ngrams are synonyms, the ngram with the higher frequency is retained as a concept label while its synonym is retained as part of the extracted text. For example, 2 ngrams cluster analysis and clustering are regarded as synonyms in Wikipedia, so the text associated with them is the same. The label clustering is retained as the concept label because it occurs more frequently in the TOCs, and its synonym, cluster analysis is contained as part of the discovered text. The concept labels are used to search Wikipedia pages in order to generate a description for the identified concept label. The search returns discovered text that forms a pseudo-document which in- cludes the concept label. So, the concept label and pseudo-document pair make up a domain concept. Overall, 73 domain concepts are generated. Each pseudo-document is pre-processed using standard techniques of English stopwords removal and Porter stemming (Porter, 1980). The pseudo-document terms form the concept vocabulary that can be used to represent resources. 4 Representing Learning Resources Using Background Knowledge Our background knowledge contains a rich representation of the learning domain and by harnessing this knowledge for representing learning resources, we expect to retrieve documents based on the do- 5 main concepts that they contain. These concepts are designed to be effective for e-Learning, because they are assembled from TOCs of teaching materials (Agrawal et al., 2012). We present two ap- proaches which have been developed by employing our background knowledge in the representation of learning resources. 4.1 The CONCEPTBASED Document Representation approach Representing documents with the concept vocabulary allows retrieval to focus on the concepts con- tained in the documents. Figures 4 & 5 illustrate the CONCEPTBASED method. Firstly, in Figure 4, the concept vocabulary, t1 ... tc, from the pseudo-documents of concepts, C1 ... Cm, is used to create a term-concept matrix and a term-document matrix using TF-IDF weighting (Salton and Buckley, 1988). In Figure 4a, ci j is the TF-IDF of term ti in concept C j , while Figure 4b shows dik which is the TF-IDF of ti in Dk . (a) Term-concept matrix (b) Term-document matrix Figure 4: Term matrices for concepts and documents Next, documents D1 ... Dn are represented with respect to concepts by computing the cosine similarity of the term vectors for concepts and documents. The output is the concept-document matrix shown in Figure 5a, where y jk is the cosine similarity of the vertical shaded term vectors for C j and Dk from Figures 4a and 4b respectively. Finally, the document similarity is generated by computing the cosine similarity of concept-vectors for documents. Figure 5b shows zkm, which is the cosine similarity of the concept-vectors for Dk and Dm from Figure 5a. So, the CONCEPTBASED approach uses the document representation and similarity in Figure 5 to influence retrieval. We expect to retrieve documents that are similar based on the domain concepts that they contain. (a) Concept-document matrix representation (b) Document-document similarity Figure 5: Document representation and similarity using the CONCEPTBASED approach 4.2 The HYBRID Document Representation Approach The HYBRID approach exploits the relative distribution of the vocabulary in the concept and document spaces to augment the representation of learning resources with a bigger, but focused, vocabulary as shown in Figure 6. So the TF-IDF weight of a term changes depending on its relative frequency in both spaces. First, our 73 domain concepts, C1 ... Cm from section 3.4, and the documents we wish to represent, D1 ... Dn, are merged to form a corpus. Next, a term-document matrix with TF-IDF weighting is created using all the terms, t1 ... tT from the vocabulary of the merged corpus as shown in Figure 6a. Entry qik is the TF-IDF weight of term ti in Dk . If ti has a lower relative frequency in the 6 concept space compared to the document space, then the weight qik is boosted. So, distinctive terms from the concept space will get boosted. Although the overlap of terms from both spaces are useful for altering the term weights, it is valuable to keep all the terms from the document space because this gives us a richer vocabulary. The shaded term vectors for D1 ... Dn in Figure 6a form a term- document matrix for documents whose term weights have been influenced by the presence of terms from the concept vocabulary. (a) Hybrid term-document matrix representation (b) Hybrid document similarity Figure 6: Representation and similarity of documents using the HYBRID approach Finally, the document similarity in Figure 6b, is generated by computing the cosine similarity between the augmented term vectors for D1 ... Dn. Entry r jk is the cosine similarity of the term vectors for documents, D j and Dk from Figure 6a. The HYBRID method exploits the vocabulary in the concept and document spaces to influence the retrieval of documents. 5 Evaluating Learning Resource Representation Our methods are evaluated on a collection of topic-labelled learning resources by simulating an e- Learning recommendation task. We use a collection from Microsoft Academic Search (MAS)(Hands, 2012), in which the author-defined keywords associated with each paper identifies the topics they contain. The keywords represent what relevance would mean in an e-Learning domain and we exploit them for judging document relevance. The papers from MAS act as our e-Learning resources, and using a query-by-example scenario, we evaluate the relevance of a retrieved document by considering the overlap of keywords with the query. This evaluation approach allows us to measure the ability of the methods to identify relevant learning resources. We compare the performance of our CONCEPTBASED and HYBRID methods against that of Bag of Words (BOW). The BOW is a standard Information Retrieval method where documents are repre- sented using terms from the document space only with TF-IDF weighting. For each of the 3 methods, the documents are first pre-processed by removing English stopwords and applying Porter stemming. Then, after representation, a similarity-based retrieval is employed using cosine similarity. 5.1 Evaluation Method and Dataset Evaluations using human evaluators are expensive, so we take advantage of the author-defined key- words for judging the relevance of a document. The keywords are used to define an overlap metric. Given a query document Q with a set of keywords KQ, and a retrieved document R with its set of keywords KR, the relevance of R to Q is based on the overlap of KR with KQ. The overlap is computed as: Overlap(KQ,KR) = |KQ ∩KR| min ( |KQ|,|KR| ) (1) We decide if a retrieval is relevant by setting an overlap threshold, and if the overlap between KQ and KR meets the threshold, then KR is considered to be relevant. Figure 7 shows the number of keywords per document and the overlap of document pairs for the first dataset used. Our first dataset which we refer to as dataset 1 contains 217 Machine Learning and Data Mining papers. A distribution of the keywords per document is shown in Figure 7a, where the documents are sorted based on the number of keywords they contain. There are 903 unique keywords, and 1,497 keywords in total. A summary of the overlap scores for all document pairs is shown in Figure 7b. There are 23,436 entries for the 217 document pairs, and 20,251 are zero, meaning that there is no overlap in 86% of the data. So only 14% of the data have an overlap of keywords, indicating that the distribution of keyword overlap is skewed. There are 10% of document 7 pairs with overlap scores ≥ 0.14, and 5% are ≥ 0.25. For experiments with this dataset we use 0.14 and 0.25 as thresholds, thus avoiding extreme values that would allow either very many or few of the documents to be considered as relevant. (a) # of keywords per Document (b) Overlap of document pairs Figure 7: Number of keywords per document and overlap profile of document pairs in dataset 1 Our interest is in the topmost documents retrieved, because we want our top recommendations to be relevant. We use precision@n to determine the proportion of relevant documents retrieved: Precision@n = |retrievedDocuments∩relevantDocuments| n (2) where, n is the number of documents retrieved each time, retrievedDocuments is the set of documents retrieved, and relevantDocuments are those documents that are considered to be relevant i.e. have an overlap that is greater than the threshold. 5.2 Evaluation Results The methods are evaluated using a leave-one-out retrieval. In Figure 8, the number of recommenda- tions (n) is shown on the x-axis and the average precision@n is shown on the y-axis. RANDOM(N) has been included to give an idea of the relationship between the threshold and the precision values. RANDOM results are consistent with the relationship between the threshold and the proportion of data in Figure 7b. Overall, HYBRID(�) performs better than BOW(×) and CONCEPTBASED(•), showing that aug- menting the representation of documents with a bigger, but focused vocabulary, as done in HYBRID, is a better way of harnessing our background knowledge. BOW also performs well because the doc- ument vocabulary is large, but the vocabulary used in CONCEPTBASED may be too limited. The complexity of the representation method in HYBRID overcomes the limitation of CONCEPTBASED. All the graphs fall as the number of recommendations, n increases. This is expected because the ear- lier retrievals are more likely to be relevant. However, the overlap of HYBRID and BOW at higher values of n may be because the documents retrieved by both methods are drawn from the same neigh- bourhoods. (a) Results at Threshold of 0.14 (b) Results at Threshold of 0.25 Figure 8: Precision of the methods at overlap thresholds of 0.14 and 0.25 on dataset 1 The relative performance at a threshold of 0.25 in Figure 8b, is similar to the performance at 0.14. However, at this more challenging threshold, HYBRID and BOW do not perform well on the first retrieval. Generally, the results show that the HYBRID method is able to identify relevant learning 8 resources by highlighting the domain concepts they contain, and this is important in e-Learning. The graphs show that augmenting the representation of learning resources with our background knowledge is beneficial for e-Learning recommendation. 6 Refined Background Knowledge One issue with the previous concept generation method is that the concept vocabulary produced was limited. A suitable representation for e-Learning resources should have a good coverage of relevant domain topics. In this section, we discuss the steps taken to refine our method used for generating domain concepts in order to improve our background knowledge and increase the coverage of our concept vocabulary. 6.1 Enriched Domain Concepts In developing this method, we go through the phases described in sections 3.2 - 3.4. First, in addition to the TOC stopwords, the SMART stopwords (Salton, 1971) are also removed during pre-processing. This allows us to remove words that do not contribute to learning terms, and still retain a good set of words for generating our concepts. Second, words referring to the name of the domain used for demonstration such as: machine, learning, data, and mining are not removed during pre-processing, as we observed that removing these words before ngram generation prevents other relevant ngrams such as instance based learning or reinforcement learning, that contain any of these words from being identified. Third, we increase our ngram extraction to generate 1-5 grams from our TOC-phrases because, a distribution of the Wiki-phrases in Figure 3 showed that 99% of phrases are 1-5grams; this allows us to increase the number of concepts we can generate. We apply ngram extraction to the TOC-phrases to produce the following TOC-ngrams: 2467 Unigrams; 5387 Bigrams; 3625 Trigrams; 1668 Fourgrams; and 576 Fivegrams. The TOC-ngrams are verified as described in Section 3.3 using the Wiki-phrases to produce a set of potential concept labels containing 24 Unigrams; 96 Bigrams; 38 Trigrams; 6 Fourgrams; and no Fivegrams. A second verification step as described in Section 3.4 is applied to the potential concept labels. This entails using the verified ngrams to search Wikipedia pages in order to generate a domain concept. The search returns discovered text that forms a pseudo-document and a concept label. Overall, our refined method has 150 domain concepts that pass the second verification, each having a concept label and pseudo-document pair. The pseudo-document terms are pre-processed using standard techniques of English stopword removal and Porter Stemming. These terms now form the concept vocabulary of our refined background knowledge which we refer to as the CONCEPTBASED+ method. 6.2 Recommendation using the CONCEPTBASED+ approach The CONCEPTBASED+ method employs the richer concept vocabulary of our refined background knowledge for representing documents. We expect the representation created using the CONCEPT- BASED+ method to contain a better coverage of the learning domain because of the richer concepts it contains. Our aim is to address the issue of the limited concepts contained in the CONCEPTBASED method. For recommendation using CONCEPTBASED+, we use the same representation and docu- ment similarity as the CONCEPTBASED method illustrated in Figures 4 & 5, but with a richer concept vocabulary. So documents are represented with respect to concepts by computing the cosine similarity of term vectors for concepts and documents to produce a concept document matrix. Then, the simi- larity between documents can be generated by computing the similarity between respective concept vectors for documents. By using the CONCEPTBASED+ method for representation, we expect to retrieve documents that are similar based on the concepts they contain, and this is obtained from a document-document sim- ilarity matrix as shown in Figure 5b. A standard approach of representing documents would be to define the document similarity based on the term document matrix illustrated in Figure 4b, but this exploits the concept vocabulary only. In our approach, we put more emphasis on the domain concepts, so we use the concept document matrix illustrated in Figure 5a, to underpin the similarity between documents. The CONCEPTBASED+ method combines the focus with breadth of a richer set of domain concepts when representing documents. 6.3 Evaluating the Refined Representation This section investigates whether the domain concepts generated using a refined approach i.e. CON- CEPTBASED+ are better for representing documents than concepts generated with a standard method 9 i.e.CONCEPTBASED. The same evaluation method and dataset 1 presented in Section 5.1 is adopted here, and a leave-one-out retrieval is applied for evaluating the methods. In Figure 9, the number of recommendations is shown on the x-axis while the average precision@n is shown on the y-axis. An overlap threshold of 0.14 is used because there are 10% of document pairs in this dataset with overlap scores ≥ 0.14. The performance of CONCEPTBASED+(�) is shown by the darker line, and CONCEPTBASED(•) by the gray line. BOW(×) is included as the benchmark and RANDOM(N) gives an idea of the rela- tionship between the threshold used and the precision values. The graphs of all the methods fall as the number of recommendations, n increases. This is expected as earlier retrievals are more likely to be relevant. Overall, CONCEPTBASED+ outperforms CONCEPTBASED, BOW, and RANDOM, by producing better recommendations for all values of n. This performance shows the advantage of using the richer concept vocabulary for representing learning materials. The results confirm that CONCEPT- BASED+ contains concepts that have a better coverage of the learning domain than CONCEPTBASED which has a limited set of concepts. So we adopt CONCEPTBASED+ as a background knowledge representation for learning materials in this domain. Figure 9: Comparing CONCEPTBASED and CONCEPTBASED+ at a threshold of 0.14 on dataset 1 6.4 Evaluation Using a Larger Dataset We compare the performance of our HYBRID and CONCEPTBASED+ methods against that of the standard BOW approach on a larger dataset, in order to confirm our findings from the previous ex- periments. Figure 10 contains the number of keywords per document and the overlap of document pairs for the second dataset used. Our second dataset which we refer to as dataset 2 contains 1000 Machine Learning and Data Mining papers also from Microsoft Academic Research. Figure 10a contains a distribution of the keywords per document, where the documents are sorted based on the number of keywords they contain. There are 3063 unique keywords, and 4551 keywords in total. We take advantage of these author-defined keywords for judging relevance. A summary of the overlap profile of document pairs for dataset 2 is shown in Figure 10b. There are 499,500 entries for the 1000 document pairs, and 480,129 entries are zero, meaning that there is no overlap in 96% of the data. So only 4% of the data have an overlap of keywords, indicating that the distribution of keyword overlap is skewed. There are 3% of document pairs with overlap scores ≥ 0.2. The same evaluation method presented in 5.1 is used here. Then a leave-one-out retrieval method is applied, and precision@n as given in Equation 2 is used to determine the proportion of relevant documents retrieved. With dataset 2, we use a threshold of 0.2 thus preventing values that allow either too many or few documents to be considered as relevant. In Figure 11, the number of recommendations is shown on the x-axis and the average precision@n is on the y-axis. The average precision values are based on the overlap of keywords between document pairs and the threshold value chosen for the experiment. RANDOM(N) gives an idea of the relationship between the threshold and the precision values, and the results are consistent with the overlap profile in Figure 10b. 10 (a) # of keywords per document (b) Overlap of document pairs Figure 10: Number of keywords per document and overlap profile of document pairs in dataset 2 On this bigger dataset, CONCEPTBASED+(�) method outperforms HYBRID(�), BOW(×), and CONCEPTBASED(•), confirming that using a richer and focused vocabulary to represent documents is useful for e-Learning recommendation. The results also show HYBRID performing better than BOW, again confirming that augmenting the representation of learning resources with domain con- cepts is better than using the content only for e-Learning recommendation. Experiments were also run at thresholds of 0.25 and 0.33 and the relative performance at these thresholds is similar to the performance at 0.2, so the graphs are not shown. Our results show that we are able to leverage on the vocabulary from CONCEPTBASED+ which is not only a larger vocabulary, but one focused on domain concepts, thus allowing our method to influence the retrieval and recommendation of relevant learning resources. Figure 11: Precision of the methods at overlap threshold of 0.2 on dataset 2 7 Conclusions The growing availability of e-Learning materials on the Web provides opportunities for learners to easily access new and valuable information. However, finding good materials is difficult because retrieval has to overcome the challenge of ineffective queries often input by learners. e-Learning recommendation offers a possible solution to this difficulty. Though, recommendation in e-Learning environments is challenging because the learning materials are often unstructured text, and so are not properly indexed for retrieval. We address this challenge by creating a method that automatically acquires background knowledge in the form of a rich set of concepts related to the selected learning domain. In building our method, we take advantage of the knowledge of experts contained in the TOCs of e-Books to identify relevant domain topics. By using e-Books we benefit from the prove- nance associated with these teaching materials. The identified topics are enriched with discovered 11 text from Wikipedia, and this extends the coverage and richness of our representation. CONCEPTBASED method takes advantage of similar distributions of concept terms in the concept and document spaces to define a concept-term driven representation. Although the concept vocab- ulary in CONCEPTBASED is limited, HYBRID exploits the relative distribution of the vocabulary in the concept and document spaces to augment the representation of learning resources with a larger vocabulary influenced by domain concepts. CONCEPTBASED+ provides a richer concept vocabulary that allows concept-based distinctiveness to be helpful in the representation and retrieval of docu- ments. This refined method allows us to generate a richer and focused set of domain concepts, which provides a better coverage of the domain. The performance of CONCEPTBASED+ in our evaluation shows the advantage of using the richer concept vocabulary for representing learning materials. Our results confirm an improvement in e-Learning recommendation when a rich concept vocabulary is used for representing learning resources. References Agrawal, R., Chakraborty, S., Gollapudi, S., Kannan, A., and Kenthapadi, K. (2012). Quality of textbooks: An empirical study. In ACM Symposium on Computing for Development, pages 16:1–16:1. doi: 10.1145/2160601.2160623. Beliga, S., Meštrović, A., and Martinčić-Ipšić, S. (2015). An overview of graph-based keyword ex- traction methods and approaches. Journal of Information and Organizational Sciences, 39(1):1– 20. Blei, D. M. and McAuliffe, J. D. (2007). Supervised topic models. In Neural Information Processing Systems, pages 121–128. Bousbahi, F. and Chorfi, H. (2015). MOOC-Rec: A case based recommender system for MOOCs. Procedia - Social and Behavioral Sciences, 195:1813 – 1822. doi: 10.1016/j.sbspro.2015.06.395. Boyce, S. and Pahl, C. (2007). Developing domain ontologies for course content. Journal of Educa- tional Technology & Society, 10(3):275–288. Chen, W., Niu, Z., Zhao, X., and Li, Y. (2014). A hybrid recommendation algorithm adapted in e-learning environments. World Wide Web, 17(2):271–284. doi: 10.1007/s11280-012-0187-z. Chen, Z. and Liu, B. (2014). Topic modeling using topics from many domains, lifelong learning and big data. In 31st International Conference on Machine Learning, pages 703–711. Clarà, M. and Barberà, E. (2013). Learning online: Massive Open Online Courses (MOOCs), connectivism, and cultural psychology. Distance Education, 34(1):129–136. doi: 10.1080/01587919.2013.770428. Coenen, F., Leng, P., Sanderson, R., and Wang, Y. J. (2007). Statistical identification of key phrases for text classification. In Machine Learning and Data Mining in Pattern Recognition, pages 838–853. Springer. doi: 10.1007/978-3-540-73499-4_63. Dietze, S., Yu, H. Q., Giordano, D., Kaldoudi, E., Dovrolis, N., and Taibi, D. (2012). Linked educa- tion: Interlinking educational resources and the Web of data. In 27th Annual ACM Symposium on Applied Computing, pages 366–371. 10.1145/2245276.2245347. Ferrucci, D., Levas, A., Bagchi, S., Gondek, D., and Mueller, E. T. (2013). Watson: Beyond Jeopardy! Artificial Intelligence, 199:93–105. doi: https://doi.org/10.1016/j.artint.2012.06.009. Gherasim, T., Harzallah, M., Berio, G., and Kuntz, P. (2013). Methods and tools for automatic construction of ontologies from textual resources: A framework for comparison and its appli- cation. In Advances in Knowledge Discovery and Management, pages 177–201. Springer. doi: 10.1007/978-3-642-35855-5_9. Hands, A. (2012). Microsoft academic search. Technical Services Quarterly, 29(3):251–252. doi: 10.1080/07317131.2012.682026. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al. (2015). DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 6(2):167–195. doi: 10.3233/SW-140134. Matsuo, Y. and Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01):157–169. doi: 10.1142/S0218213004001466. 12 Mbipom, B., Craw, S., and Massie, S. (2016). Harnessing background knowledge for e-learning rec- ommendation. In Research and Development in Intelligent Systems XXXIII: Incorporating Ap- plications and Innovations in Intelligent Systems XXIV, pages 3–17. Springer. doi: 10.1007/978- 3-319-47175-4_1. Milne, D. and Witten, I. H. (2008). Learning to link with Wikipedia. In 17th ACM Conference on Information and Knowledge Management, pages 509–518. doi: 10.1145/1458082.1458150. Nasraoui, O. and Zhuhadar, L. (2010). Improving recall and precision of a personalized semantic search engine for e-learning. In 4th IEEE International Conference on Digital Society, pages 216–221. doi: 10.1109/ICDS.2010.63. Panagiotis, S., Ioannis, P., Christos, G., and Achilles, K. (2016). APLe: Agents for personalized learn- ing in distance learning. In 7th International Conference on Computer Supported Education, pages 37–56. Springer. doi: 10.1007/978-3-319-29585-5_3. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130–137. Qureshi, M. A., O’Riordan, C., and Pasi, G. (2014). Exploiting Wikipedia to identify domain-specific key terms/phrases from a short-text collection. In Proceedings of the 5th Italian Information Retrieval Workshop, pages 63–74. Rodrigues, L., Antunes, B., Gomes, P., Santos, A., Barbeira, J., and Carvalho, R. (2007). Using textual CBR for e-learning content categorization and retrieval. In 4th International Conference on Case-Based Reasoning workshop on textual case-based reasoning. Ruiz-Iniesta, A., Jimenez-Diaz, G., and Gomez-Albarran, M. (2014). A semantically enriched context-aware OER recommendation strategy and its application to a computer science OER repository. IEEE Transactions on Education, 57(4):255–260. doi: 10.1109/TE.2014.2309554. Salton, G. (1971). The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall, Upper Saddle River, NJ, USA. doi: 10.1109/TPC.1972.6591971. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Informa- tion Processing & Management, 24(5):513–523. doi: 10.1016/0306-4573(88)90021-0. Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., and Studer, R. (2006). Semantic Wikipedia. In Proceedings of the 15th International Conference on World Wide Web, pages 585–594. ACM. doi: 10.1145/1135777.1135863. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. (1999). KEA: Practical automatic keyphrase extraction. In 4th ACM Conference on Digital libraries, pages 254–255. doi: 10.1145/313238.313437. Yang, H.-L. and Lai, C.-Y. (2010). Motivations of Wikipedia content contributors. Computers in Human Behavior, 26(6):1377 – 1383. doi: 10.1016/j.chb.2010.04.011. Yang, K., Chen, Z., Cai, Y., Huang, D., and Leung, H. F. (2016). Improved automatic keyword extraction given more semantic knowledge. In International Conference on Database Systems for Advanced Applications, pages 112–125. Springer. doi: 10.1007/978-3-319-32055-7_10. Yarandi, M., Tawil, A.-R., and Jahankhani, H. (2011). Adaptive e-learning system using ontology. In In Proceedings of the 22nd International Workshop on Database and Expert Systems Applica- tions, pages 511–516. doi: 10.1109/DEXA.2011.9. Zhang, X., Liu, J., and Cole, M. (2013). Task topic knowledge vs. background domain knowledge: Impact of two types of knowledge on user search performance. In Advances in Information Systems and Technologies, pages 179–191. Springer. doi: 10.1007/978-3-642-36981-0_17. Zheng, Z., Li, F., Huang, M., and Zhu, X. (2010). Learning to link entities with knowledge base. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 483–491. 13 MBIPOM 2018 Improving e-learning recommendation coversheetJournalArticles MBIPOM 2017 Improving e-learning recommendation MBIPOM 2018 Improving e-learning (FINAL) OA: GREEN OA Logo: AUTHORS: MBIPOM, B., CRAW, S. and MASSIE, S. TITLE: Improving e-learning recommendation by using background knowledge. YEAR: 2018 Publisher citation: MBIPOM, B., CRAW, S. and MASSIE, S. 2018. Improving e-learning recommendation by using background knowledge. Expert systems [online], Early View. Available from: https://doi.org/10.1111/exsy.12265 OpenAIR citation: MBIPOM, B., CRAW, S. and MASSIE, S. 2018. Improving e-learning recommendation by using background knowledge. Expert systems, Early View. Held on OpenAIR [online]. Available from: https://openair.rgu.ac.uk Version: AUTHOR ACCEPTED Publisher: WILEY Series: Expert systems ISSN: 0266-4720 eISSN: 1468-0394 Set statement: This is the accepted version of the following article MBIPOM, B., CRAW, S. and MASSIE, S. 2018. Improving e-learning recommendation by using background knowledge. Expert systems, Early View, which has been published in final form at https://doi.org/10.1111/exsy.12265. This article may be used for non-commercial purposes in accordance with the Wiley Self-Archiving Policy. License: BY-NC 4.0 License URL: https://creativecommons.org/licenses/by-nc/4.0 CC Logo: 2018-01-26T16:06:55+0000 OpenAIR at RGU