key: cord-0883292-llqx21go authors: Kannan, Rathimala; Wang, Ivan Zhi Wei; Ong, Hway Boon; Ramakrishnan, Kannan; Alamsyah, Andry title: COVID-19 impact: Customised economic stimulus package recommender system using machine learning techniques date: 2021-11-12 journal: F1000Res DOI: 10.12688/f1000research.72976.2 sha: d6d4acb5d4e494cb66e980fcc3e91f1dde09900e doc_id: 883292 cord_uid: llqx21go Background: The Malaysian government reacted to the pandemic’s economic effect with the Prihatin Rakyat Economic Stimulus Package (ESP) to cushion the novel coronavirus 2019 (COVID-19) impact on households. The ESP consists of cash assistance, utility discount, moratorium, Employee Provident Fund (EPF) cash withdrawals, credit guarantee scheme and wage subsidies. A survey carried out by the Department of Statistics Malaysia (DOSM) shows that households prefer different types of financial assistance. These preferences forge the need to effectively customise ESPs to manage the economic burden among low-income households. In this study, a recommender system for such ESPs was designed by leveraging data analytics and machine learning techniques. Methods: This study used a dataset from DOSM titled “Effects of COVID-19 on the Economy and Individual - Round 2,” collected from April 10 to April 24, 2020. Cross-Industry Standard Process for Data Mining was followed to develop machine learning models to classify ESP receivers according to their preferred subsidies types. Four machine learning techniques—Decision Tree, Gradient Boosted Tree, Random Forest and Naïve Bayes—were used to build the predictive models for each moratorium, utility discount and EPF and Private Remuneration Scheme (PRS) cash withdrawals subsidies. The best predictive model was selected based on F-score metrics. Results: Among the four machine learning techniques, Gradient Boosted Tree outperformed the rest. This technique predicted the following: moratorium preferences with 93.8% sensitivity, 82.1% precision and 87.6% F-score; utilities discount with 86% sensitivity, 82.1% precision and 84% F-score; and EPF and PRS with 83.6% sensitivity, 81.2% precision and 82.4% F-score. Households that prefer moratorium subsidies did not favour other financial aids except for cash assistance. Conclusion: Findings present machine learning models that can predict individual household preferences from ESP. These models can be used to design customised ESPs that can effectively manage the financial burden of low-income households. The novel coronavirus 2019 (COVID-19) pandemic has created devastation in people's lives worldwide, both socially and economically (Shah et al., 2020). As a result, governments have adopted various strategies aimed at reducing the pandemic's impact, particularly the financial strain. The Malaysian government has introduced a series of economic stimulus packages to support various segments of its citizens. One such support is the Prihatin Rakyat Economic Stimulus Package (ESP) to cushion the impact of COVID-19 on low-income households after the first movement control in the country. The ESP consists of cash assistance, utility discount, moratorium, Employee Provident Fund and Private Remuneration Scheme (EPF and PRS) cash withdrawals and Credit Guarantee Scheme and Wage subsidies (Flanders et al., 2020) . Following the implementation of ESP, the Department of Statistics Malaysia (DOSM) carried out a special survey from April 10 to April 24,2020 to better understand the implications of COVID-19 on the economy and households. The study included questions on social and economic factors and subsidy preferences. A typical low-income household often bears considerable debt and has limited savings. When movement control was implemented, households that lost their income sources faced difficulties in accessing necessities, such as food and housing (Flanders et al., 2020) . Even though the government offered ESP to help residents cope financially, the demands and desires of citizens in the event of a pandemic are unknown. For example, several households are reluctant to withdraw from EPF and PRS due to its reduction on their savings for old age. A personalised ESP can be built to reduce residents' financial burden in this crisis if we can foresee their requirements and preferences for various subsidies, such as cash allowance, utility discount, moratorium or EPF and PRS withdrawals. Using data analytics and machine learning approaches, this study attempted to analyse survey data and construct predictive models for customised economic stimulus packages. The following research questions were put forward. 1 . How can households that favour moratorium subsidies be identified? 2. How to find out which households seek utility discount subsidies? 3. How can households who desire EPF and PRS withdrawal subsidies be identified? This study contributes to the literature by using four machine learning techniques on socioeconomic survey data and predicting household subsidy preferences. A comparison of the feature selection methods, such as Gini index, Gain-Ratio and various partitioning ratios of the training and test data sets were carried out. The outcomes of this study can help the government deliver better and improved stimulus packages in the future based on individual preferences. For planning and execution, this study used the Cross-Industry Standard Process-Data Mining (CRISP-DM), which is the industry-independent de-facto standard for implementing data mining initiatives (Schröer et al., 2021) . This process has six phases, namely, business understanding, data understanding, data preparation, modelling, evaluation and deployment. Figure 1 depicts the activities carried out in each phase, as further explained below. This phase identifies the problems to solve using the machine learning perspective and approach. All three research questions were selected as the problems, and the purpose was to propose predictive models for the moratorium, utility discount and EPF and PRS subsidies in the Prihatin Rakyat ESPs. Data gathering, evaluating, characterising and assuring its quality are part of this phase. DOSM performed a special survey (Round 2) to investigate the consequences of the COVID-19 epidemic on household economics and status ("Department of Statistics Malaysia Official Portal," 2020; Malaysia, 2020). The dataset includes 36 questions and 41,386 respondents. However, the data obtained from DOSM were not complete due to missing questions. The missing questions were Q3, Q6, Q19, and Q27 -Q31. In terms of the total respondents, the data were complete and had a total of 41,386 participants, all of them were aged 15 and older. 96.8% of the respondents have received benefits from Prihatin Rakyat ESPs. The raw data were based on responses from respondents, which included qualitative personal opinions on economy, employment, lifestyle, and education. The original dataset was in Malay language, and is translated into English for this study and given below as Table 1 . There were 28 questions available for further analysis in this study, eliminating the missing ones. Question 32 and 33 were The term "How" was added to the research questions. The methodology section has been enhanced to include different of machine learning approaches. A new flowchart (new Figure 2 ) of the modelling and evaluation process has been introduced, which demonstrates how a supervised machine learning model is constructed, as well as parameter tuning to achieve optimised models. To select the optimal model, k-fold cross validation was used, which revealed no differences when the data was separated into training and testing datasets. The discussion section has been updated to address each research questions individually, providing additional detail on the findings and their implications. The trade-off between prediction accuracy and interpretability of a model in machine learning approach is explained. The part on limitations has been explained in detail. Any further responses from the reviewers can be found at the end of the article REVISED excluded from the survey because they focused on the primary food and non-food products purchased during the time of movement control orders. Given that the dataset was cluttered with missing values and errors, a considerable effort was spent on its cleaning before applying descriptive analytics techniques. Q34, Q35 and Q36 had missing values of 2071, 2243 and 2310 respectively and were replaced by the most frequent values. By cleaning the data, the raw data is transformed into structured data. Without losing any information, all lengthy responses were reduced to short and detailed responses. If the original answer for respondent's dwelling state was "Wilayah Persekutuan Kuala Lumpur," it was converted to "KL." Questions with answers were labelled "Yes," whereas those without answers are labelled "No." One question, for example, inquired about respondents' willingness to eat out as part of the new norm's lifestyle adjustments. Those who agreed said they "will not eat out." Those who opposed to the shift in lifestyle left the question unanswered. As a result, these questions were changed to a "Yes" or "No" format. This phase employed a variety of machine learning techniques in order to achieve the study's goal of developing classification models. Different approaches, such as information-based learning, similarity-based learning, error-based learning, and probability-based learning, can be used to create classifiers (Kelleher et al., 2015) . This research used information based learning (a decision tree algorithm family) as the best model for explaining decision logic. To create prediction models, we used four machine learning techniques: Decision Tree, Random Forest, Gradient Boosted Tree and Naïve Bayes. These four machine learning techniques were chosen from a literature review (Mostafa et al., 2021; Sangavi et al., 2020) and used to determine the optimal model by tuning their parameters. Feature selection methods, such as Gini index, Gain-Ratio and various partitioning ratios of the training and test data sets were also compared (Trivedi, 2020). Figure 2 depicts parameter tuning carried out in the modelling phase. In this phase, the best predictive model for each subsidy was selected based on the standard performance evaluation metrics: Sensitivity, Precision, F-Score and Accuracy (Moscato et al., 2021). The formulas used to calculate each of the metrics are given below. Precision F-Score, also known as F1 Score, is a balance of both precision and sensitivity. Hence, this study used F-Score to evaluate the machine learning models. 2 Sensitivity F Score Sensitivity In the final phase a deployment strategy for the model was created and documented. The best predictive model as determined for each of the subsidies was to be recommended for further deployment. The entire CRISP-DM phases were carried out using the Konstanz Information Miner (KNIME 4.3.2), a free and open-source data analytics software. This study was carried out from November 23, 2020 to October 06, 2021 and has obtained ethics approval (EA1322021) by Technology Transfer Office, secretariat of research ethics committee, Multimedia university. The outcomes of this study were organised as descriptive analytics, model optimisation and findings. Descriptive analytics helps to understand the characteristics of each respondent and the relationship between variables. Table 2 provides the descriptive information on the respondents. Figure 3 shows the various types of subsidies offered in the ESP. Among the 41,386 respondents, 72.2% were eligible to receive subsidies, 21.9% were newly applied, 3.2% were not eligible and 2.7% had appealed. Figure 3 also shows the most beneficial forms of support. The most popular type of subsidy was cash allowance, followed by moratorium, utility discounts and EPF and PRS cash withdrawals. The least preferred type was the credit guarantee plan and wage subsidies. Following the descriptive analytics, the four machine learning techniques were applied to develop prediction models for each moratorium, utility discount and EPF withdrawals subsidies. Decision Tree, Gradient Boosted Tree, Random Forest and Naïve Bayes are subjected to parameter tuning to determine the best model and parameter values. Table 3 to Table 6 show how the optimal model was obtained from each machine learning technique. Partitioning ratio indicates the training and test data. Gain ratio, Gini index and information gain were used to measure the quality of each predictor in classifying the target variable. The results show that the Gradient Boosted Tree and Naïve Bayes techniques performed well when 60% of the data were used to train the machine learning models and the other 40% was used for testing. Random Forest and Decision Tree techniques generated the best models when the training data were 80% and the test data were 20%. F-Score was used as the evaluation measure to select the optimal models. After identifying the optimal models, the best was selected among the four machine learning techniques. Table 7 shows the results. Gradient Boosted Tree outperformed the rest of the techniques in predicting the moratorium preference with 93.8% sensitivity, 82.1% precision and 87.6% F-score. When the data is partitioned with k= 5, and K=10, the findings reveal little difference in classifier performance, and the Gradient boosting tree still performs the best. When the A similar process was carried out to develop machine learning models for utility discounts and EPF and PRS subsidies. The results show that for both subsidies, Gradient Boosted Tree was identified as the best machine learning technique. Table 8 and Table 9 show that this technique can predict utility discount with 86% sensitivity, 82.1% precision and 84% F-score, as well as EPF and PRS with 83.6% sensitivity, 81.2% precision and 82.4% F-score, respectively. To answer this research question, four classification models have been built using decision tree, gradient boosted tree, random forest and naïve bayes machine learning techniques. The optimal model from each of these techniques are determined by tuning their parameters and the details of parameter values are explained in the previous section. Finally, the best model from each of the four machine learning algorithms is compared, and the best model is chosen using the F-score performance evaluation measure. When the data division ratio is 60 percent training data and 40 percent testing data, Gradient Boosted Tree was shown to be the best machine learning model for predicting moratorium subsidies preferred households, with F-score =0.876 and sensitivity = 0.938. Hence this model is recommended for the deployment phase. Although the gradient boosting tree can more accurately identify households that favour moratorium subsidies, it is difficult to interpret. It's because the relationship between each predictor and the target is modelled using a curve, making it difficult to explain how each predictor relates to the target. Machine learning techniques are always a trade-off between prediction accuracy and interpretability. In general, a method's interpretability reduces as its accuracy improves. (Hastie et al., 2021) . Therefore, the decision tree model is utilised to create the ruleset in order to determine the general profile of families who favour a moratorium subsidy. Rule support refers to the number of respondents to whom this condition applies. Rule confidence indicates the probability of having a moratorium as the preferred subsidy. Table 10 shows the basic characteristics of families that choose moratorium subsidies with a rule support of 400 and above. The first rule shows that households who prefer to have a cash allowance and their race is either Malay or Native Sabah/ Sarawak or others, while those aged between 25 to 64 prefer moratorium. Table 11 explains the first rule indicating the general profile of households who prefer moratorium subsidies. 2. How to find out which households seek utility discount subsidies? The four machine learning techniques described in the moratorium subsidies were applied to develop the classification model in order to identify the households who want utility discount subsidies. All of the procedures outlined in the preceding section were followed in order to find the optimal machine learning model. The gradient boosting tree outperforms the other three techniques, with a data partitioning ratio of 80:20, an F-score of 0.84, and a sensitivity of 0.86. Although the gradient boosting tree can more accurately identify households who seek utilities discount subsidies, it is difficult to interpret this model. As a result, decision tree rules were developed in order to comprehend the overall profile of households who seek utility discount subsidies. One such rule is presented in Table 12 . To identify the households who want EPF and PRS withdrawal subsidies., similar to the moratorium and utilities discount subsidies classification models, decision tree, gradient boosted tree, random forest and naïve bayes techniques were used to develop the machine learning model. With a data partitioning ratio of 60:40 and F-scores of 0.824 and sensitivity of 0.836, the gradient boosting tree was found to be the best model compared to the others. Decision tree rules were developed to explain the general features of households who prefer EPF and PRS withdrawal subsidies, and one of the rules is displayed in Table 13 . The results imply that households that prefer moratorium subsidies did not favour other financial aids except cash assistance. By contrast, households that prefer for utility discounts, EPF and PRS withdrawals also chose moratorium subsidies and cash assistance. All households preferred cash assistance, which had the highest score among financial aids, followed by moratorium subsidies. Utility discounts, EPF and PRS withdrawals can be implemented according to the household income group preferences. Wage subsidy and credit guarantee scheme were the least preferred financial assistance. First, the Prihatin wage subsidy is only for eligible Social Security Organisation (SOCSO) subscribers. Hawkers, small businesses and their employees might not subscribe to SOCSO and thus, ineligible to apply for financial aid. Second, the credit guarantee scheme was not preferred due to economic uncertainty from COVID-19. Economic uncertainty adversely affects household income, resulting in their inability to repay the loan instalments. The following are some of the limitations of this study's findings: The data used are survey responses and cannot be considered to represent the views of all Malaysians. According to DOSM, it should not be used to analyse the impact of COVID-19 in Malaysia and should not be considered official statistics. It can, however, be utilised to assist in the reflection process ("Department of Statistics Malaysia Official Portal," 2020; Malaysia, 2020). Another limitation is data partitioning method. This study portioned the data into training and test data sets. However, to improve the model, the dataset could be divided into training, validation and test data. This research used information based learning (a decision tree algorithm family) as the best model for explaining decision logic. However, it is possible to test with other classification methods, and they might have better accuracy. This study used data analytics and machine learning approaches to derive insights from the "Effects of COVID-19 on the Economy and Individual -Round 2" survey dataset. The CRISP-DM approach was applied to develop prediction models for households' preferred subsidies, such as moratoriums, utility discounts and EPF and PRS using four machine learning algorithms, namely, Decision Tree, Random Forest, Naïve Bayes and Gradient Boosted Tree. For all three subsidies, the best predictive model was obtained by Gradient Boosted Tree. The findings can be used to design customised ESPs that effectively manage the economic burden of low-income households. Data used in this study were obtained from a survey dataset "Effects of COVID-19 on the Economy and Individual -Round 2," available from the Department of Statistics, Malaysia (DOSM Albukhary International University, Alor Setar, Kedah, Malaysia Since this is a full research paper, it is recommended that a Literature Review section be included. It is believed that there are some previous research that have performed the 4 ML techniques covered by this research, hence the need to also mention those and how they differ in results as compared to the results found in here. Some additional elaboration is required for technical parts of this paper, especially in clarifying the relation between the formula and the techniques used. Do all the formulas need to be performed for each of the 4 techniques? If there is a step taken before the evaluation that uses the formulas, please include them at least as summary. If this part is covered in the published paper (part 1), perhaps that should be mentioned. If this paper only covers the result on which technique is the best, then perhaps the title should reflect that. At the moment, the title does not reflect the research questions. Research questions are usually asked with the word "how" instead of "can". It would be more sound to rephrase them to "how". Is the study design appropriate and is the work technically sound? Yes Are all the source data underlying the results available to ensure full reproducibility? Partly Are the conclusions drawn adequately supported by the results? Partly 8. Why discuss on different ratio partitioning, in this case it is better use simulation dataset to see the impact of the ratio into the results. Data partitioning ratio is considered as one of the parameters and the aim was to identify the best ratio that can provide better classification model. We also used kfold cross validation method as explained earlier. Publisher Full Text Is the work clearly and accurately presented and does it cite the current literature? Partly Is the study design appropriate and is the work technically sound? Yes Are sufficient details of methods and analysis provided to allow replication by others? Partly If applicable, is the statistical analysis and its interpretation appropriate? The authors thank the Department of Statistics Malaysia for allowing us to use "Effects of COVID-19 on the Economy and Individual -Round 2" survey data to carry out this study. The authors also acknowledge Multimedia University for supporting this research. Individual -Round 2 data set to predict individual household preferences from ESP. I would recommend the authors to use k-fold cross validation method to evaluate the performance of a classifier instead of a random 60-40 or 80-20 split. This would guarantee that all k subsets are used as a test set once. The authors can also perform some statistical tests to validate the results obtained from experiments. ○ In Page 4, (section Evaluation), TPR is the True positive rate and not as mentioned by the authors. Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes Since this is a full research paper, it is recommended that a Literature Review section be included. It is believed that there are some previous research that have performed the 4 ML techniques covered by this research, hence the need to also mention those and how they differ in results as compared to the results found in here. We have added brief literature review in the modelling section. Due to the limitation of total words to be within 2500 words, a separate section is not added.Some additional elaboration is required for technical parts of this paper, especially in clarifying the relation between the formula and the techniques used. Do all the formulas need to be performed for each of the 4 techniques? We employed standard performance indicators to evaluate the classification models.If there is a step taken before the evaluation that uses the formulas, please include them at least as summary. If this part is covered in the published paper (part 1), perhaps that should be mentioned. If this paper only covers the result on which technique is the best, then perhaps the title should reflect that. At the moment, the title does not reflect the research questions. We have revised the discussion section which summarizes the process and the results.Research questions are usually asked with the word "how" instead of "can". It would be more sound to rephrase them to "how". The research questions are rephrased to start with "how". Competing Interests: No competing interests were disclosed.Reviewer Expertise: Machine Learning, Data mining, statistics, official statistics, bioinformatics, biostatistics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Rathimala Kannan null, Multimedia University, Cyberjaya, Malaysia 1. Research questions and the discussion and conclusion are not inline. The authors mentioned three research questions, but they are not discussed throughout the manuscript. Instead, the authors focus on comparing different methods for predicting household subsidy preferences. Please address the research question on the results and discussion as well. 2. Please explain the question used in the study and why some questions are missing. Is it due to not being the focus of the study, or for other reasons?As mentioned in the previous answer, now we have revised the discussion section which clearly addresses all three research questions.3. Please elaborate more on the discussion of the results of the study. What would be the impact of utilizing best methods for result of the stimulus, especially for the citizen? Would it be that big a difference if we use the other classification methods?The aim of this study is to identify the households that prefer to have a specific subsidy which is expected to lessen their financial burden during this pandemic. Therefore, we focus on the classification model which can predict more accurately the household's choice. It is possible to test with other classification methods, and they might have better accuracy. However, we focus also the model interpretability, thus we use mostly on decision tree algorithm family plus naïve bayes algorithm. Machine learning models are always a trade-off between accuracy and interpretability (Hastie, Tibshirani, James, & Witten, 2021). , R., James, G., & Witten, D. (2021) . An Introduction to Statistical Learning (2nd Edition). Springer Texts, 102, 618. (chapter 2.1.3 page 24) 4. Please explain the advantages and disadvantages of the methods being compared in the study. Different approaches, such as information-based learning, similarity-based learning, error-based learning, and probability-based learning, can be used to create classifiers. This research used information based learning (a decision tree algorithm family) as the best model for explaining decision logic . 5. The result shows that the decision tree, gradient boost, and random forest, does not differ much, how will it affect the stimulus package?The aim here is to predict household's preference as much possible correctly. Thus we chose the model that performs the best compared to the rest.6. If the data changes would the results change? The authors may divide the data into testing, training and validation data set. Several times as discussed by this paper: Lessman et al. (2015 1 ). -Yes. We can divide the dataset in to 3 sets. We conducted K-fold cross validation method to evaluate the performance of a classifier and presented the results in the following table. When the data is partitioned with k= 5, and K=10, the findings reveal little difference in classifier performance, and the Gradient boosting tree performs best. When the data set is small, k-fold cross validation produces a superior model; but, when the data set is large, it produces no change.