key: cord-0141416-h7gz47jw
authors: Gomez, Manuel J.; Calder'on, Mario; S'anchez, Victor; Clemente, F'elix J. Garc'ia; Ruip'erez-Valiente, Jos'e A.
title: Large Scale Analysis of Open MOOC Reviews to Support Learners' Course Selection
date: 2022-01-11
journal: nan
DOI: nan
sha: 6296d4d142de31f33d4e58717325bc631f39ef51
doc_id: 141416
cord_uid: h7gz47jw

The recent pandemic has changed the way we see education. It is not surprising that children and college students are not the only ones using online education. Millions of adults have signed up for online classes and courses during last years, and MOOC providers, such as Coursera or edX, are reporting millions of new users signing up in their platforms. However, students do face some challenges when choosing courses. Though online review systems are standard among many verticals, no standardized or fully decentralized review systems exist in the MOOC ecosystem. In this vein, we believe that there is an opportunity to leverage available open MOOC reviews in order to build simpler and more transparent reviewing systems, allowing users to really identify the best courses out there. Specifically, in our research we analyze 2.4 million reviews (which is the largest MOOC reviews dataset used until now) from five different platforms in order to determine the following: (1) if the numeric ratings provide discriminant information to learners, (2) if NLP-driven sentiment analysis on textual reviews could provide valuable information to learners, (3) if we can leverage NLP-driven topic finding techniques to infer themes that could be important for learners, and (4) if we can use these models to effectively characterize MOOCs based on the open reviews. Results show that numeric ratings are clearly biased (63% of them are 5-star ratings), and the topic modeling reveals some interesting topics related with course advertisements, the real applicability, or the difficulty of the different courses. We expect our study to shed some light on the area and promote a more transparent approach in online education reviews, which are becoming more and more popular as we enter the post-pandemic era.

Massive Open Online Courses (MOOCs) are a relatively recent educational phenomenon that has received a great deal of attention during the last decade (Pina and Steffens, 2015) . This attention is due to their potential to disrupt the traditional educational pathways trough ease of access and free or low-cost contents. MOOCs offer the potential to enable access to highquality education to students, even in the most underserved regions of the world (Castillo et al., 2015) . Furthermore, the recent COVID-19 pandemic has completely shifted the behavior of students around the world (Dhawan, 2020) . Strict lockdowns and a growing fear of returning to classrooms were the perfect storm for the unparalleled acceleration in the adoption of MOOCs. Thus, it is not surprising that the largest providers of MOOCs such as Coursera or Udemy reported a +50% growth in revenue in 2020 (Lohr, 2020) , the most significant year-over-year increase since the beginning of the MOOC era. Taking advantage of the momentum, ed-tech providers attempted to capitalize by widening their offer of courses. In fact, Class Central reported that, by the end of 2020, 16,300 MOOCs will be announced or launched, and in 2020 alone, around 2,800 courses were added (Shah, 2021) .

Though a sudden increase in the educational offer seems like an idyllic scenario as it provides a broader range of educational options, it is yet to be confirmed whether the aforementioned is net positive for the system as a whole. Some would argue that a supply-side where providers like Coursera alone shows around 415 Excel courses simply does not need one more Excel course. As a result, we might be reaching what some scholars call "choice overload," as a higher number of options might decrease the motivation to choose or the satisfaction with the final preferred option (Scheibehenne et al., 2010) . Arguably, this lack of satisfaction with the preferred option con-tributes to low completion rates in the MOOC space, which are reported to be around 10% (Khalil and Ebner, 2014) . Moreover, Hew and Cheung (2014) reported that the quality of MOOC education and the MOOC business model are still unresolved issues. In addition, more than 55% of the students who were part of a research conducted in LATAM reported spending days and weeks choosing a course (SkillMapper, 2022) ; thus, shedding light on how time-consuming the process of finding and purchasing courses could be. A process that ultimately fails to deliver on the promise of democratizing education.

In addition to the trend above, students do face other challenges when choosing courses. Though online review systems are standard among many verticals, no standardized or fully decentralized review systems exist in the MOOC ecosystem. Whereas some MOOC providers rely on the standard 5-star rating system and free form feedback, such as Class Central (Central, 2021) , other popular ones do not even offer the alternative to leave a single review. Moreover, many of these MOOCs only allow the user to leave a review once the course has been completed (Gamage et al., 2016) . Thus, the opinions of around the 90% of students who never finished the MOOC are missing, hence usually creating a very positive and skewed perspective. In this vein, NCA (research agency) found that around 18% of students that do not finish a MOOC experienced teachers with low pedagogical qualities (SkillMapper, 2022) . Naturally, the restrictiveness of many of these review systems limits the voice of many students, thus suppressing the distribution of these kinds of relevant insights for MOOC purchase decision making. This transparency in the reviewing systems could allow new approaches to facilitate students' choices when comparing different courses.

The inconsistent review systems, high search costs, and meager completion rates contribute to many MOOC providers' poor experience. Current textual reviews are considered as additional information to the rating, to know how the course is, but it is not considered in the rating of the review itself and in personalized recommendations (Kapoor et al., 2020) . We believe that there is an opportunity to leverage available open MOOC reviews, in order to build simpler and more transparent reviewing systems, allowing users to really identify the best courses out there. In fact, Natural Language Processing (NLP) provides us a set of techniques and approaches to analyze textual information (e.g., sentiment analysis, topic modeling, text classification) that could be used to add value to the reviews and provide useful insights for learners. Ultimately, there is an opportunity to provide a framework that could benefit the overall MOOC industry. Thus, enable more transparency, better discoverability, and increased satisfaction with choice, also helping to greatly improve the professional development of our society to current industry needs. As a by-product, these enhancements could also lead to improved overall satisfaction with MOOC providers. In our work, we aimed to collect the largest dataset of MOOC reviews from five different platforms via web scraping, in order to analyze around 2.4 million reviews and to transform these big data into useful insights and actionable information for learners and other stakeholders, such as industry and educators. Specifically, we state four research questions (RQs):

• RQ1. Do the crowdsourced numeric ratings of the courses provide discriminant information to prospective learners?

• RQ2. Does NLP-driven sentiment analysis on the textual reviews provide valuable information to prospective learners?

• RQ3. Can we leverage NLP-driven topic analysis techniques to find themes that can be important for prospective learners?

-RQ3.1. Using words describing the course in a qualitative way.

-RQ3.2. Using words describing the courses' content.

• RQ4. Can we use these models to effectively characterize MOOCs based on the open reviews?

The rest of the paper is organized as follows. Section 2 reviews background literature on MOOCs, previous course selection approaches, and previous analyses using MOOCs reviews. Section 3 describes the complete methodology, and Section 4 shows the results of our analysis, answering each one of the RQs. Then, we finalize the paper with discussion in Section 5 and conclusions and future work in Section 6.

Technological advances, particularly in the Web 2.0, have brought a major transformation in the delivery of education, and one of the most recent innovations in e-learning is the development, roll out and uptake of MOOCs (Terras and Ramsay, 2015) . MOOCs represent the next stage in the evolution of open educational resources (Al-Rahmi et al., 2018) . In addition, they have the advantage of being available for all and open to an unlimited number of students (Al-Rahmi et al., 2019) . The increase in numbers of MOOCs has been dramatic in recent years, and their scope, available in every sector of education, makes them very strong and being considered to be a new form of virtual technology enhanced learning environments (Pina and Steffens, 2015) . In 2013, Billington and Fronmueller (2013) stated that some experts considered MOOCs as a bubble that would burst soon. However, by the end of 2020, more than 180 million students have registered in one or more of the 16,300 courses offered by over 950 universities worldwide (Shah, 2021) , indicating that MOOCs continued to capture the attention of many institutions and the public worldwide since the "year of the MOOC" in 2012 (Pappano, 2012; Lan and Hew, 2020) . Moreover, they have overcome all the challenges of the COVID-19 pandemic and have been instrumental in providing courses to the learners as they do not have physical boundaries (Purkayastha and Sinha, 2021) .

Although MOOCs have a great potential for educating large numbers of students in a wide variety of settings, previous research has also identified some problematic issues with this area. One of the major problems is the high dropout rate for learners: a small percentage (generally around 10%) of the large numbers of participants enrolling in MOOCs manage to complete the course, which is a poor result compared to traditional courses (Liyanagunawardena et al., 2014) . BEZERRA and SILVA (2017) conducted a review on the different reasons causing this high dropout rates, identifying 24 different reasons, such as the lack of prior knowledge, the difference between the expected and the real level of the course, or the quality of materials. The second major area of concern relates to the financial model used in these courses. While there is a great potential for attracting large numbers of students, universities still need to determine how to generate sufficient income from the delivery of these free or low-priced courses (Milheim, 2013) . We see another example in (Jobe et al., 2014) , where authors reported challenges associated with recognition, validation, and accreditation of learning, even though this accreditation is obviously a difficult goal to achieve. Of course, another area of major concern is academic integrity, including verifying that a student who is registered for a course is the same person who completed the tasks, and the potential for widespread plagiarism or homework assignments. Finally, another common problem that usually is not discussed in the literature is the course selection. Given the large numbers of courses (most of them with very high ratings) and information out there, learners are eas-ily disoriented by the information overload (Zhang et al., 2018) , and they usually find difficulties choosing which is the appropriate course for them (taking several days to decide); this course selection difficulty is also related to the high dropout rate as they may not be choosing the best course to fit their preferences. Some of the reasons of the high dropout rate reported previously could be related with the overload and misunderstandings that learners suffer, since they do not usually know the real characteristics of the course that they are choosing: in a survey conducted by Gütl et al. (2014) , 8.96% of the students reported that the course was too difficult, meanwhile 7.46% emphasized that the course was not challenging. Moreover, 14.93% of students also highlighted that the learning environment was not personalized, and another 6.72% indicated that the courses were poorly taught. The selection of courses is a crucial decision for learners, and the use of existing reviews from other learners to reveal the real courses' characteristics could be a interesting solution to make this choice easier and fruitful. Learners are exposed to various challenges with this excess of learning resources and information: which provider they have to choose to search for a specific MOOC? Who is the best provider? How to search a specific MOOC? Are the available MOOC reviews reliable? (Ouertani and Alawadh, 2017) . Usually, in literature we find different approaches trying to perform an efficient course selection: collaborative filtering (according to users in same interests on same courses), content-based filtering (using courses features from courses that the user likes to recommend similar items), knowledgebased approaches, or hybrid recommender approaches, among others (Al-Badarenah and Alsakran, 2016). Ouertani and Alawadh (2017) presented the idea of a system that could satisfy learners' needs when searching for suitable courses among different providers, using previous experiences of its users. Moreover, Al-Badarenah and Alsakran (2016) developed a recommendation system that employed an association rules algorithm to recommend university elective courses to a target student based on what other similar learners have taken. Aher and Lobo (2012) presented an approach combining of k-means clustering and Apriori association rule algorithms are useful in recommending courses to students. We found many other studies that used these type of approaches to improve course selection (Aher, 2014; Huang et al., 2019; Vanitha et al., 2019) . In fact, in their review, Khalid et al. (2020) reported that most of the work on the implementation of the recommender system for courses used collaborative and content-based filtering. In our research, we decided to use a novel approach that aims to leverage existing open MOOC reviews using NLP to gather information that could be useful for learners' course selection.

Moreover, as we do in our case study, previous researchers have explored the potential of using open MOOC reviews and NLP. For example, authors in (Deng and Benckendorff, 2021) reported that it is critical that student reviews are systematically analyzed and interpreted in conjunction with student ratings data. In their work, they presented a study using Leximancer, a data-mining tool which extracted the key concepts from the data collection, based on the frequency of occurrence of concepts and co-occurrence use. This study highlighted some interesting key terms inferred, such as "videos," "engaging," "understanding," "easy," or "problems." Furthermore, Wen et al. (2014) applied sentiment analysis toward the curriculum and key course tools from forum posts, finding that the emotion of students was correlated with the number of students dropping out of the course each day. Moreover, Chen et al. (2020) employed an innovative structural topic modeling technique to analyze 1,920 reviews of 339 courses regarding computer science to understand what primary concerns the learners had. They found that 64.2% of the reviews in their data collection were 5-star reviews, while 16.7% were 4-star reviews, revealing clearly biased rating values. Regarding structural topic modeling, they found topics such as "Course levels," "Teaching style," or "Course content." Furthermore, Chang et al. (2016) presented a novel approach that used the courses' videos to locate "hot" video segments (i.e., the parts that students reviewed more) to generate their subtitles and extract the main keywords, being useful for the teacher to better understand which parts of the contents of each topic are most difficult for the learners. In our work, we go beyond the state of the art by presenting an approach that uses the largest MOOC reviews dataset to date in the literature to build NLP models (specifically topic finding and sentiment analysis) oriented to provide help and support learners' course selection easily and quickly, and not only extracting terms or revealing important topics within a word collection. While the literature usually considers extracting a set of key terms from the entire reviews dataset, our topic modeling approach considers two different aspects of the reviews: on the one side, we consider words in reviews describing the course in a qualitative way; on the other side, we consider words in reviews describing the courses' content. This approach aims to gather insights from two different points of view, also reducing the bias produced by other types of words.

We conducted our analysis in six different stages: 1) MOOC review providers, 2) Data scraping, 3) Data cleansing and wrangling, 4) Data collection, 5) Data pre-processing, and 6) Data modeling. We can see the entire methodology process represented in Figure 1 . Next, we explain each stage in detail: 

Initially, we considered five different MOOC review providers for our analysis: Udemy, Coursera, Domestika, Platzi, and Crehana. Next, we make a brief introduction to each platform:

• Udemy is a for-profit MOOC provider aimed at professional adults and students. Founded in May 2010 by Eren Bali, Gagan Biyani, and Oktay Caglar, offers more than 183,000 courses taught by 65,000 different instructors and with 75 different languages available. It offers interesting features such as course quality checklists, a teaching hub, and training videos (Udemy, 2021) .

• Coursera is an American MOOC provider founded in 2012 by Andrew Ng and Daphne Koller. In 2021, 150 universities offered more than 4,000 courses through Coursera, reaching more than 190 different countries. The platform is updated every year with new features such as personalized browsing, learner skill tracking, or cloud imports (Coursera, 2021b,a) .

• Domestika is an online learning platform for creatives, with paid courses on software, crafts, illustration, and more. It was first started more than a decade ago in Spain, but has since moved its headquarters to San Francisco. Although there are a few courses about Python and web programming like in other platforms, what makes the platform unique is its unwavering focus on creatives, and its major feature is the exceptionally high course quality (Rowan, 2021) .

• Platzi is a MOOC provider, founded in Colombia by Freddy Vega and Christian Van Der Henst in 2014. It offers more than 700 open courses within different areas, such as data science, videogames, or blockchain, among many others (Platzi, 2021) .

• Crehana is an online education provider for today's student: digital and creative. Over 1,000,000 students around the world have used the platform, which offers subjects such as illustration, design, business and more. Its main goal is to bring education closer to all types of students regardless of geographical location and income level (Dañino, 2021) .

First of all, we needed to retrieve the structure for each of the websites. We used two different approaches, depending on how the websites were structured. In our experience:

• We used Scrapy to work with websites that consider static pages.

Scrapy is a web framework built upon years of experience in extracting massive amounts of data in a robust and efficient manner (Kouzis-Loukas, 2016).

• We used Selenium, a portable open source software (Bruns et al., 2009) , to work with websites that are more dynamic, consider infinite scrolls or that need to click buttons to get more information.

Using these tools, we started scraping the existing reviews in each one of the platforms. We worked with the metadata divided into two different levels:

• Review level. For each review, we collected the following information:

URL, review, rating of the review (from one to five), the platform, the username, the date of the review, and the identifier of the related course.

• Course level. For each course, we collected the following information: the course identifier, URL, title, platform related with the course, the content category of the course, and the name of the teacher.

The first step was to download all the data crawled and store it directly in a database. Because there is no structure defined and we only need to process the data once, we used Firestore, a non-relational database from Google Cloud. Since Firestore provides a simple and cheap solution to store data in a non-relational format, thus meeting our requirements. We used a dictionary structure to store the data in Firestore, adapted to each one of the educational platforms.

After the data is saved in the non-relational database, we proceeded to apply a step wise approach to clean and process the data, ensuring that the fields are properly registered. This part of the process is one of the most essential steps because it considers a set of rules that we use to process the metadata and give a proper structure to it. These series of rules mostly leverage a set of regular expressions and variable conversions. These functions allow us to remove elements that do not add value, such as commas, special characters, and normalize types of variables when necessary. As a result, we can obtain clean dataset that will help to optimize our analysis later. This section is ran in Python and consist in: (1) read all the data that we previously stored in Firestore, (2) apply a particular set of rules for each of the rows, and (3) save the processed value in an object.

Since we needed to identify the language of each review, and this information is not available in the original metadata, we identified it using the raw text of each review. Initially, we considered using the audio feature of the course to identify the language, but we can not rely on this feature to ensure that all the reviews are in the same language as the course, so we finally decided to work with Fasttext library (Bhattacharjee, 2018) to infer the language of each review.

The last step was to store the data in a PostgreSQL database. This one is a relational database that will help us to understand the data collected running some analytics easily. The objects that we built are inserted in each of the two tables that we consider for this project:

• final courses info: contains all the metadata at course level.

• final reviews info: contains all the metadata at review level.

We initially considered all the data scraped from the five different platforms to collect the metadata. However, since for this study we decided to focus only on English reviews, the data collection was distributed as follows:

• Udemy platform: It has 95,143 courses and 71,518 of them have reviews, meaning that 24.8% of them do not have any review.

• Coursera platform: It considers 4,154 courses and 2,932 of them have reviews, so 29.4% of courses do not have any review.

• Domestika platform: It has 558 courses in English and 541 of them have reviews, meaning that 3% of the courses do not have any review.

• Platzi platform: It has 451 courses in English and 450 of them have reviews, meaning that 0.002% of the courses do not have any review.

• Crehana platform: It has 1 course with reviews (100%).

We established a threshold of one minimum review for each course to be considered for the analysis, meaning that 25% of the courses were removed from our work. In our data collection, some of the reviews are concentrated on some courses. This can be related to several factors, such as important differences in popularity of each one of the courses, marketing efforts from the platforms, or hot topics. For example, a "Roman history" MOOC will attract a very different population of learners in quantity and background than a "Python" course. The distribution of reviews by the different providers can be seen in Figure 2a . Furthermore, in Figure 2b we can see the distribution of the number of reviews per course. The maximum number of reviews in a single one is 9,149, the minimum is one review, and the median is six reviews per course.

When trying to leverage NLP-driven topic finding in such a large set of reviews, we found that a large proportion of the words do not provide useful information to learners, such as "lot," "give," or "thing," among many others. We wanted to provide feedback to learners that could become useful insights to select courses, thus we collected the most frequent words in the collection (words that appear more than 500 times in the entire collection), and we manually classified every word within two different categories: • QualitativeDescription: if the word is related to a qualitative description of the course (e.g., easy, clear, practical). This category includes 357 different words.

• Content: if the word is related to the content of the course itself (e.g., machine, yoga, cooking). This category includes 759 different words.

This categorization will help us to create the two different topic finding models in our research. Every word not matching any of those two categories will be excluded from our later analysis. Once we classified those words, the next step in the pre-processing is to clean each textual review. To do that, we keep only the review's main words removing, for example, unnecessary URLs, numbers, or additional space characters. Moreover, to apply NLP techniques afterward, we need to define a set of "stop words" (i.e., words that will not be considered in the text analysis). Some examples of stop words are "would," "be," or "however," which are common words that appear in every document but do not provide helpful information to the analysis.

From now, we can start treating each review as a "document," since most of the cleaning process has ended. Once the full review collection cleaned, we lemmatize every document using pywsd library. The lemmatization is the process of converting a word to its base form, and its implementation in pywsd works as follows:

1. It tokenizes the string, dividing it into a set of tokens (words).

2. It uses a Part-Of-Speech (POS) tagger to map each word to a POS tag (adverb, noun, adjective, etcetera). 3. It calls the lemmatizer with the token and the POS tag to get the base form of the word.

This subsection is divided into four separate parts, each one explaining the analysis to address each RQ:

To discover how was the distribution of ratings within the data collection, we checked two different approaches: in the first approach, we simply count every single numeric rating to see which are the most common ones; in the second approach, we calculated the mean rating of each one of the courses in order to see the distribution of ratings within the different courses separately.

The main aim of sentiment analysis (or opinion mining) is to discover emotion, opinion, subjectivity and attitude from a natural text. There are many different techniques that are frequently applied to natural text to determine the sentiment such as feature extraction, emoticon study, tokenization etc. During sentiment analysis, usually positive and negative words are extracted from the text and are assigned a score from the dictionary of words (Ahuja and Dubey, 2017) . In our work, to perform sentiment analysis, we explored the use of three different libraries in Python, namely TextBlob, VADER, and Flair :

• TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common NLP tasks such as partof-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more (Loria, 2018) . In his sentiment analysis implementation, given a sentence, it returns a score within the range [-1.0, 1.0], depending on how positive or how negative that sentence is.

• VADER: Valence Aware Dictionary for Sentiment Reasoning (VADER) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to the sentiments expressed in social media. It is an entirely free open-source tool, and it also takes into consideration word order and degree modifiers (Elbagir and Yang, 2019) . Given its implementation, we used the "compound" value that the library provides, which is a number within the range [-1.0, 1.0], depending on how positive or how negative that sentence is.

• Flair is a framework which aims to present a simple, unified interface for conceptually very different types of word and document embeddings, hiding all embedding-specific engineering complexity and allowing researchers to "mix and match" various embeddings with little effort. It provides a pre-trained sentiment analysis model, which returns a tag ("Positive" or "Negative") depending on the sentiment of the text provided.

To discover which are the main topics within the review collection, we applied Latent Dirichlet Allocation (LDA) topic modeling to the data provided. Specifically, we use gensim library (Saxton, 2018 ) and its ldaMallet model (Graham et al., 2012) . Generally in LDA, each document can be described by a distribution of topics, and each topic can be described by a distribution of words; ldaMallet uses an optimized Gibbs sampling algorithm for LDA (Yao et al., 2009 ).

There are multiple metrics for evaluating the optimal number of topics. Recent studies have shown that the classic predictive likelihood metric (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated (O'callaghan et al., 2015) . This has led to many studies that have focused upon the development of topic coherence measures. To determine the optimal number of topics, we used two of these coherence measures: C v and C umass (Röder et al., 2015; Kapadia, 2019 ):

• C v measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity.

• C umass is based on document co-occurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure.

Based on these two coherence measures, we chose the number of topics of the two final models calculated. The first of our models aimed to identify topics that described the course in a qualitative way (e.g., easy, clear, detailed, hard), thus we selected, for each review, only words that were classified as QualitativeDescription, removing the rest of the review to avoid bias produced by other type of words. Then, the second of our models aimed to identify topics that described the content of the courses, thus we selected only words that were previously classified as Content, removing the remaining ones.

Then, based on the topics discovered by the topic finding algorithm, we calculated the proportion of each topic across the entire corpus. To do that, we evaluated each review to get its topics associated (note that, in LDA, each document can be assigned to several topics with a certain weight). We calculated the proportion of each topic as follows:

Therefore, the proportion of topic j would be the summation of each weight assigned to the topic j in each document from i to N , divided by the number of documents in the corpus (N ).

To provide a more complete analysis, we wanted to see the relationship between the sentiment analysis and the qualitative topic model (e.g. see if a course with a bad mean rating also is related to more negative topics). That way, we calculated the distribution of topics in each course separately (the distribution in each course will be the sum of proportions of each review separately), and compared them to the positive or negative values given by the sentiment classifier. A given course will be labeled as Positive if the majority of its reviews are labeled as Positive, and Negative if the majority of its reviews are labeled as Negative.

4.1. RQ1. Do the crowdsourced numeric ratings of the courses provide discriminant information to prospective learners? In Figure 3a we can see the distribution of numeric ratings available in our collection. About 1.52 million reviews (63% of the total) are 5-star ratings, and also about 519,000 reviews (21.5% of the total) are 4 and 4.5 star ratings, thus we have a clearly biased data, where only 17.5% of the ratings are 3.5star ratings or worse. Moreover, in Figure 3b we see that, out of 93,678 total courses, 48,640 (52%) have a mean rating between 4.5 and 5 stars, and 34,770 (37%) courses have a mean higher of equal than 3.5 and below 4.5, highlighted again the systematic biases existing in open MOOC reviews. Therefore, this motivates alternative models to facilitate course selection to learners'. 4.2. RQ2. Does NLP-driven sentiment analysis on the textual reviews provide valuable information to prospective learners? As we stated in Section 3.6.2, we performed sentiment analysis on textual reviews based on three different libraries: TextBlob, VADER, and Flair. Since TextBlob and VADER provide numeric values, we can compare both libraries directly. After applying the pre-trained sentiment models to every review in our corpus, we obtained a mean compound value of 0.54 (SD = 0.37) using VADER, and a mean compound value of 0.37 (SD = 0.30) using TextBlob. Then, we aggregated the sentiment values by course. On the one hand, we find that 67,019 courses (71.5%) have a mean compound value above 0.4 (high positive mean) using VADER. On the other hand, we find that 29,777 courses (31.8%) have a mean compound value above 0.4 using TextBlob.

We also calculated the Pearson's correlation coefficient and the Spearman's rank correlation coefficient between the compound sentiment value and the numeric ratings for both libraries. VADER library got a Pearson's coefficient value of 0.49, and a Spearman's coefficient value of 0.35. In the case of TextBlob, the Pearson's coefficient was 0.38, while the Spearman's coefficient was 0.31. Although all these coefficients show a positive correlation between the two variables, VADER library obtained a higher value, indicating a moderate level of correlation since it is below 0.5.

In the case of Flair, it provides a label indicating if the sentence is Positive or Negative. To compare the three libraries, we established a threshold to classify TextBlob and VADER compound values into one of the three (positive, negative and neutral) labels. That way, we classify a sentence into Positive when the compound value > 0.1, into Neutral when −0.1 ≤ compound value ≤ 0.1 , and into Negative when the compound value < 0.1. In Figure 4a we can see a comparison of the number of positive, neutral, and negative reviews classified by each one of the libraries. As we can observe, all three libraries classified the majority of the reviews as Positive (2M, 2.02M and 1.95M), indicating also a skewed data. However, we see a more significant difference if we take a look at the Neutral and Negative reviews. On the one hand, regarding VADER, we note that 245,538 reviews are Neutral, and 135,571 are Negative. On the other hand, TextBlob classifies 371,348 reviews as Neutral, and 78,198 as Negative. Finally, Flair library results show that 401,575 are Negative reviews, and only 516 reviews are Neutral. Furthermore, in Figure 4b we can see a box plot comparing the sentiment values between the two libraries providing continuous values: VADER and TextBlob. As we observe, VADER library produces higher sentiment values (median= 0.62, Q1 = 0.45, Q3 = 0.77) than TextBlob (median= 0.35, Q1 = 0.14, Q3 = 0.6). 4.3. RQ3. Can we leverage NLP-driven topic analysis techniques to find themes that can be important for prospective learners? As described previously, we created two different topic finding models: the first one using words that describe the courses qualitatively; and the second one using words that described the courses' content.

The first model that we built was the qualitative description one. After applying the LDA algorithm as described in Section 3.6.3, we obtained a set of coherence measures to determine which is the optimal number of topics for this concrete model. We determined 14 as the optimal number of topics, with a C v score of 0.20 and a C umass score of -3.88. A summary of each topic, including the five most important words related, can be found in Table 1 .

Main terms 1

Informative, easy, fun, enjoyable, helpful 2

Personal, authentic, productive, write, child, book 3

Basic, beginner, introduction, overview, simple 4

Simple, easy, effective, clear, short 5

Error, week, wrong, issue, outdated 6

Voice, fast, slow, hard, monotone 7

Money, worth, ad, free, quality 8

Specialization, team, rigorous, introduction 9

Step, explanation, detailed, easy, clear 10 Exam, test, practice, certification, real 11

Beginner, basic, advanced, intermediate 12 Real, worth, life, apply, practical 13 Theory, practical, theoretical, lab, hand 14 Slide, powerpoint, visual, lack, repetitive We can see a wide variety of topics, such as the first one, described by the words "informative," "easy," or "fun;" or the second one, described by the words "productive," "personal," or "authentic." Then, Figure 5 shows the distribution of such topics across the entire collection of reviews. In the Figure, each topic is labeled by its three most important keywords (e.g., "simple easy effective"). As we can observe, the most frequent top-ics are "informative easy fun" (12.2%), "basic beginner introduction" (8.8%) and "personal authentic productive" (8.3%), and the less frequent topics are "real worth life" (5.3%) and "slide powerpoint visual" (5.0%). To see the relationship between the qualitative topic model and the sentiment analysis, we calculated the distribution of topics in labeled courses based on their reviews' sentiment value. In this case, we used Flair library, which, compared to other NLP packages, its sentiment classifier is based on a character-level LSTM neural network which takes sequences of letters and words into account when predicting. It is based on a corpus but in the meantime, it could also predict a sentiment for OOV (Out of Vocabulary) words including typos (Terry-Jack, 2019). In Figure 6 we can see the distribution in both Positive and Negative labeled courses. As we can see, in the most frequent topics for positive courses, there are more positive topics such as "informative easy fun" (11.6%) and "simple easy effective" (8.6%). Then, in negative courses, the most frequent topics are the ones describing the course in a negative way, such as "voice fast slow"(11.7%) or "error week wrong" (10.4%). However, since most of the topics are positive, courses with negative sentiment values also show a high frequency of positive topics.

Then, we conducted a MANOVA test including the 15 different topics, confirming that there is a significant difference in topics proportion when comparing Positive and Negative courses (F = 59.05, p < 0.0001). Thus, we can see that there is a relation between the sentiment analysis and the topic modeling approach that we have performed in this research. 

Moreover, the second model that we built was the one describing courses' content. After applying the LDA algorithm, we determined 14 as the optimal number of topics, with a C v score of 0.27 and a C umass score of -5.08. In this case, we have assigned a label to each topic based on the existing content categories in the corpus and the keywords of each topic. A summary of each topic, including its name, description, and the five most important words related, can be found in Table 2 . Includes courses that aim to teach investing and trading knowledge and techniques. Although both investing and trading are methods of attempting to profit in the financial markets, investing often takes long-term approach and trading usually takes a shortterm approach.

Business, trading, market, trade, financial

Courses related with the music field, such as music production, how to sing, or how to play any instrument.

Music, play, audio, song, piano

Courses that aim to teach security and networks related content, such as how to prevent a hacking attack or how to provide more security to your devices.

Security, software, hack, project, management Language learning Courses dedicated to learn any language (e.g., Spanish, French).

Language, speak, accent, pronunciation, chinese

Finance & Accounting Content related with accounting, which is an essential tool for providing information for decision-making, as well as for the evaluation of decisions previously made, and finance, that must seek resources at a reasonable cost and use them efficiently (Melé et al., 2017 Again, we see a large variety of topics referring to different content areas that courses aim to teach, such as "Programming," "Music," or "Language learning." Then, Figure 7 shows the distribution of such topics across the entire collection of reviews. In this case, we see that there is a very uniform distribution of topics, since all of them have a similar value. Although there is a wide variety of knowledge areas among the different courses, there is only a 1.5% of difference between the most and less frequent topics. 

on the open reviews? To evaluate if we can effectively characterize MOOCs using the models developed in this research, we chose a set of four MOOCs belonging to different topics in order to analyze them:

• Finance for Non-Finance Professionals. The original category of this course is "Business" and has 342 available reviews.

• Using Python to Access Web Data. The original category of this course is "Computer science" and has 5,748 available reviews.

• Mind Control: Managing Your Mental Health During COVID-19. The original category of this course is "Health" and has 1,183 available reviews.

• Teach English Now! Foundational Principles. The original category of this course is "Social sciences" and has 2,447 available reviews. As we can see in Figure 8 , all four courses have a rating mean above 4.7 stars, which confirms the biased data that we previously highlighted. Furthermore, we also see a positive correlation between the rating mean and the sentiment analysis results of the courses (calculated using Flair library), since the courses showing a higher rating mean also show a higher percentage of Positive labeled reviews. For example, the business related course has a rating mean of 4.83/5, and the 96.2% of its reviews are labeled as Positive, meanwhile the Python related course has a rating mean of 4.73/5, and the 88% of its reviews are labeled as Positive.

Then, we evaluated the two topic models obtained through NLP on the open reviews. For the qualitative description model on the left, the COVID-19 and English courses have a high proportion of the topics "real worth life," "informative easy fun," and "personal authentic productive," meanwhile the Python related course has a higher proportion of the topic "error week wrong," which indicates that reviews are revealing more problems in this last course than in the others. Furthermore, this finding is also confirmed by the fact that this course is the one with the lowest rating mean and sentiment analysis result of the selected courses. In addition, the topics that have a high proportion in the COVID-19 and English courses reveal additional positive aspects of these courses, which is the personalized and real application that they show. Regarding the content model, we see that our approach obtains an accurate result when modeling the content of each course automatically. For example, we see that the health related course, which was originally classified within the "Health" category, has a very high proportion of the topics "General health" and "Health and lifestyle." Furthermore, the business related course, which was originally labeled as a "Business" course, has a very high proportion of the topics "Investing & Trading" and "Finance & Accounting." These results validate that our models can automatically add additional and reliable information for learners to select the most appropriate course to their preferences.

This study has four research objectives. First, the study aims to analyze the numeric ratings in the collection in order to see if they provide any valuable information to learners. Second, the study aims to perform NLP-driven sentiment analysis on textual reviews to see if this analysis provides any valuable information to learners, comparing it with the previous numeric ratings. Then, the study aims to perform topic analysis on textual reviews to find the most representative themes across the corpus and see if these themes provide any valuable information to learners. To achieve these research objectives, this study used around 2.4 million reviews from learners who participated in 93,678 different courses across four MOOC platforms. Major findings revealed in this study are reported in this section and discussed in conjunction with the existing MOOC literature. Finally, this research performed an analysis trying to characterize four different MOOCs, applying the models developed in this study.

In Section 2 we identified some previous studies that also tried to analyze existing reviews in the MOOCs field. For example, Chen et al. (2020) analyzed 1920 reviews from 339 computer science courses, finding that 64.2% of the reviews in their data collection were 5-star reviews, while 16.7% were 4-star reviews. In addition, Gamage et al. (2016) analyzed 137 courses from Coursera, edX, FutureLearn, Udacity, Iversity, Canvas, and few Independent MOOCs platforms from Universities. They found that 18.24% of the courses had a 5-star rating, while 43.8% of the courses had a 4-star rating. The results of our research also confirm that the majority of reviews are 5-star and 4-star ratings. This bias is evident in the fact that most courses receive a majority of 5-star ratings. Further, using the 5-star rating system may provide limited insights in separating good from great products or supplies, and star ratings may be less meaningful to differentiate what learners like and dislike about a course design (Li et al., 2021) . Instead, this research also explores sentiment analysis and topic modeling to gather insights into different course elements that were positive or negative to the learner experience.

The majority of MOOC reviews are less than 50 words long, and such short text is likely to create poor context-dependency, insufficient co-occurrence information, and sparse feature matrix, and result in an awful performance of LDA (Qi and Liu, 2021) . Knöös and Rääf (2021) analyzed 28,000 reviews scraped from five courses within the fields of data science, and found that the average length of the extracted reviews was 103 characters. Many reviews consisted of only one or two adjectives such as "good" or "nice." In addition to having the biggest dataset analyzed in the area, our research found that the average length of the extracted reviews was 134.6 characters, which is larger than the typical average length of a review. In addition, since the coherence score is only an indicator of the model performance, we made iterations and manual reviews of the topics to ensure the relevance and coherence of the topics.

Our sentiment analysis results confirm that we have a skewed data, revealing also a positive correlation with the numeric ratings. Previous studies (Knöös and Rääf, 2021) used VADER sentiment analysis to predict the sentiment of reviews, finding that the library could predict the sentiment of positive reviews with a 93% accuracy. However, the accuracy in the prediction of negative and neutral reviews was less than 60%. Thus, since there might be some negative reviews classified as positive, we can only estimate the factors and topics that learners had a negative sentiment towards. Chen et al. (2020) performed structural topic modeling and identified some interesting topics such as "course level." "teaching style" or "problem solving," meanwhile Deng and Benckendorff (2021) identified "Learning" as the most salient theme. Moreover, a closer inspection on that concrete theme showed that learners placed high importance on making connections between MOOC content and their everyday lives. In our research, we found that reviews also made emphasis on the importance of MOOCs application in real life, with frequent topics and important keywords, such as topic 9 ("practice," "real") and topic 12 ("real," "worth," "life," "practical"). Seeing this trend, we recommend that MOOCs content should be focused on a more practical approach, being applied in real life contexts and environments, and instructional conditions in MOOCs should be engineered to facilitate the acquisition of knowledge that can be transferred to real-world practices.

In their study, Chen et al. (2020) explored the relationship between the ratings and the topics identified, finding that negative reviews tended to relate more to issues such as course assessment, learning tools and platforms, and critique, while positive reviews concerned more about issues such as course levels, course organization, and learning perception. In our research, since the correlation identified between sentiment analysis results and ratings is a moderate one, we also decided to explore the relationship between sentiment analysis and our topic modeling approach, finding that courses with positive sentiment value tend to be related to positive topics, while courses with negative sentiment value tend to be more related with negative topics.

Furthermore, we can take a look at both of our topic models and discuss the coherence and usability of the topics revealed. Regarding the qualitative description model, we see that some topics could be really useful for students to identify the courses' key features, such as topic 14 ("slide," "powerpoint," "lack," "repetitive") identifying courses that might be not as well designed as other, or topic 2 ("personal," "authentic," productive") highlighting the personalization that many students missed in other courses (Gütl et al., 2014) . However, we also identify other topics that might not add useful information, such as topic 11 ("beginner," "intermediate," "advanced") showing keywords that are confusing. Another example is topic 13, mixing the keywords "theory" and "practical," which also do not clarify any information. Regarding the content model, we believe that the topics revealed by the model are appropriate in comparison with the original categories defined by the courses themselves, and also consistent between them. However, a topic model with a larger number of topics may discover new ones that are not included in our current analysis.

This research's findings validate that the existing open MOOC review systems contains courses with extremely positive ratings, making the decisionmaking process challenging. A potential MOOC student will face thousands of highly-rated MOOCs for popular topics such as Python or Excel, wherein the majority of courses ratings are around 4.7 stars or above, exacerbating what could be described as a "choice paradox" (Scheibehenne et al., 2010) . A challenge that no MOOC provider has taken seriously is that all review and search systems are far from vertical-specific and have not improved significantly since MOOCs became more popular in 2012. This is not surprising, as providers have no incentives to improve the decision-making process and mainly focus on widening their offer and reach.

It is unknown whether providers mostly display courses with the highest rating by curation or if a structural decision pushes the overall distribution of the system to a standard that does not reflect reality. Moreover, there is a clear economic incentive for MOOC providers to only display courses in the high four-stars range. A recent study from McKinsey found that even the slightest improvements in the star rating score (+0.2) in the retail sector could trigger a 37% increase in sales throughout a product life cycle (Briedis et al., 2020) . The bias mentioned above, combined with an online education system with completion rates as low as 7%, creates the perfect scenario for MOOC students to be skeptical about the quality promoted by these providers. Ultimately eroding trust in the system and leading to inefficient use of time and money throughout the online educational journey.

This study also has some limitations. Non-English reviews were excluded from this study. An analysis of non-English reviews may provide additional insight into the learning experience idiosyncratic to these learners (e.g., individuals who use non-English MOOC platforms, individuals whose primary language is not English) (Deng and Benckendorff, 2021) . Future research could overcome this limitation by analyzing non-English reviews to provide a more comprehensive and complete understanding of the learning experience in MOOCs. Moreover, the reviews considered for the analysis were not further filtered by length and contextual richness. Arguably, further research could utilize a subset of reviews that would provide more context, and ergo the results of a new topic labeling system could provide more nuance for the student.

There are reasons to believe artificial intelligence and community-driven tools can help improve the overall system. In this vein, the approach in our work, using topic modeling, and the standardization and decentralized review systems can provide more transparency to the decision-making process. Moreover, we can foresee increasing interest in platforms leveraging the aforementioned techniques and want to help students in their educational journey. Thus, creating a better system that increases the Return of Ed-ucation (ROE). Last but not least, it is essential to reckon that the online education revolution will be only as good and transformative as the providers want it to be. However, there is also agency on the students poised to seek more transparency in online education systems, which are only becoming more popular as we enter the post-pandemic era.

In this study, we aimed to answer four research questions in order to support learners' course selection in MOOCs, using the largest dataset so far in the literature. First, we analyzed the numeric ratings in our data collection, finding a biased dataset with a majority of 5-star and 4-star ratings. Then, we performed sentiment analysis on textual reviews using three pre-trained models from different libraries. Results suggests that there is a positive correlation between the compound sentiment values and the numeric ratings. All three libraries show that the majority of textual reviews (with results varying between 80.8% and 83.8%) are classified as positive reviews. Then, we built two models based on a topic modeling approach. The first model used words that described the courses in a qualitative way, and revealed some interesting topics and keywords such as "informative easy fun," "money worth ad" or "slide powerpoint visual." This approach can effectively characterize courses, thus helping learners to decide which course to choose, highlighting features that have been important for learners over time. The second model used words that described the courses' content, and revealed frequent topics such as "Health and lifestyle," "Programming," or "Cloud computing." Since each course can cover different areas, this approach can help learners to discover some hidden topics in existing courses, based on other learners' reviews. We also calculated the distribution of each topic, which allowed us to see which were the most frequent and the less frequent topics across the entire collection. Moreover, we explored the relationship between the sentiment analysis results and the qualitative topic modeling approach, finding that courses with positive sentiment value tend to be related to positive topics such as "informative easy fun" or "real worth life," while courses with negative sentiment value tend to be more related with negative topics like "error week wrong." Finally, we tried to characterize four different MOOCs using our models, which confirmed the positive correlation between the sentiment analysis and the numeric rating, but also the difference in topic characterization when comparing courses with higher and lower sentiment analysis results.

As part of our future work, we would like to validate the Content topic model against the original content category in every course to see the accuracy of the model. Moreover, since the typical limited size in reviews could be a problem for the LDA model, future studies could consider only large reviews in order to explore more relevant or meaningful topics. In addition, some of the pre-trained sentiment models tend to fail in some classifications (specially referring to neutral and negative reviews). Thus, in future studies, the design of a lexicon that better takes into account negative words could improve the analysis and results obtained. Finally, another step would be the development of a real recommendation system to validate the appropriateness of these discovered topics for its real purpose: allow students to identify the best courses among the almost infinite variety of them. To wrap up, we believe that there is an opportunity to leverage current approaches and create more powerful, yet simpler reviewing systems, using techniques similar to the ones applied in this research. Thus, being able to really identify the best courses out there, and turn a blind eye to the current highly biased MOOC review systems.

Em&aa: An algorithm for predicting the course selection by student in e-learning using data mining techniques

Prediction of course selection by student using combination of data mining algorithms in e-learning

Clustering and sentiment analysis on twitter data

An automated recommender system for course selection

Massive open online courses (moocs): Data on higher education

A model of factors affecting learning performance through the use of social media in malaysian higher education

A review of literature on the reasons that cause the high dropout rates in the moocs

fastText Quick Start Guide: Get started with Facebook's library for text representation and classification

Moocs and the future of higher education

Adapting to the next normal in retail: The customer experience imperative

Web application tests with selenium

Moocs for development: Trends, challenges, and opportunities. International Technologies & International Development 11

Class central

Developing a datadriven learning interest recommendation system to promoting self-paced learning on moocs

What are moocs learners' concerns? text analysis of reviews for computer science courses

Impact report 2021: Serving the world through learning

Introducing new tools and features as demand for online learning grows

What are the key themes associated with the positive learning experience in moocs? an empirical investigation of learners' ratings and reviews

Online learning: A panacea in the time of covid-19 crisis

Twitter sentiment analysis using natural language toolkit and vader sentiment

What do star rates for moocs tell you? an analysis of pedagogy and review rates to identify effective pedagogical model

Getting started with topic modeling and MALLET

Attrition in mooc: Lessons learned from drop-out students

Students' and instructors' use of massive open online courses (moocs): Motivations and challenges

A score prediction approach for optional course recommendation via cross-userdomain collaborative filtering

Moocs for professional teacher development

Evaluate topic models: Latent dirichlet allocation (lda)

Movie recommendation system using nlp tools

Recommender systems for moocs: A systematic literature survey

Moocs completion rates and possible methods to improve retention-a literature review

Sentiment analysis of mooc learner reviews: What motivates learners to complete a course?

Learning scrapy

Examining learning engagement in moocs: A selfdetermination theoretical perspective using mixed method

Key factors in mooc pedagogy based on nlp sentiment analysis of learner reviews: What makes a hit

Dropout: Mooc participants' perspective

Remember the moocs? after near-death, they're booming. The New York Times 26

Ethics in finance and accounting: Editorial introduction

Massive open online courses (moocs): Current applications and future potential

Moocs recommender system: A recommender system for the massive open online courses, in: Innovations in Smart Learning

An analysis of the coherence of descriptors in topic modeling

The year of the mooc. The New York Times 2

Are moocs promising learning environments?

Unstoppable study with moocs during covid 19 pandemic: A study

Evaluating on-line courses via reviews mining

Exploring the space of topic coherence measures

What is domestika

A gentle introduction to topic modeling using python

Can there ever be too many options? a meta-analytic review of choice overload

By the numbers: Moocs in 2020

Did you pay for an online course and never finish it? you aren't alone and lack of time isn't always to blame

Massive open online courses (moocs): Insights and challenges from a psychological perspective

URL: https: //medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysi s-1eb52a9d742c

About us

Data science in action

Collaborative optimization algorithm for learning path construction in e-learning

Sentiment analysis in mooc discussion forums: What does it tell us?, in: Educational data mining

Efficient methods for topic model inference on streaming document collections

Mcrs: A course recommendation system for moocs. Multimedia Tools and Applications