A comparative study of approaches to forecast the correct trading actions Received: 7 November 2015 Revised: 7 April 2016 Accepted: 8 July 2016 DO I 10.1111/exsy.12169 A R T I C L E A comparative study of approaches to forecast the correct trading actions Luís Baía1,2 | Luís Torgo1,2 1 LIAAD ‐ INESC TEC, Porto, Portugal 2 Departamento de Ciência de Computadores‐ Faculdade de Ciências, Universidade do Porto, Porto, Portugal Correspondence Luís Baía, LIAAD‐INESC TEC, Porto, Portugal. Email: luisbaia_1992@hotmail.com Expert Systems 2017;34:e12169. w https://doi.org/10.1111/exsy.12169 Abstract This paper addresses the problem of decision making in the context of financial markets, more specifically, the problem of forecasting the correct trading action for a certain future horizon. We study and compare two alternative ways of addressing these forecasting tasks: (a) using standard numeric prediction models to forecast the variation on the prices of the target asset and, on a second stage, transform these numeric predictions into a decision according to some predefined decision rules; and (b) use models that directly forecast the right decision thus ignoring the intermediate numeric forecasting task. The objective of our study is to determine if both strategies provide identical results or if there is any particular advantage worth being considered that may distinguish each alternative in the context of financial markets. KEYWORDS classification, extensive experimental comparison, forecast, regression, trading actions 1 | INTRODUCTION Many real‐world applications require decisions to be made based on forecasting some numeric quantity. Sales forecasting may lead to some important decisions concerning the production process. Asset price forecasting may lead investors to buy or sell some financial product. Forecasting the future evolution of some indicator of a patient may lead a medical doctor to prescribe some important treatments. These are just a few examples of concrete applications that fit this general setting: decisions based on numeric forecasts of some variable. Frequently, the decision process is based on a predefined protocol that associates inter- vals of the range of the numeric variable with concrete actions/decisions. This means that once we have a prediction for the numeric variable, we use some deterministic process to reach the action/decision to be made. In spite of the generality of this type of applications, this work is focused on analyzing them in the context of financial markets. In this domain the goal of investors is to make the correct trading decision (Sell, Buy, or Hold) at any given point in time. These decisions are made based on the investor’s expectations on the future evolution of the asset prices. In this work, we approach this decision problem using prediction models. More specifically, we will compare two possible ways of trying to forecast what is the correct trading decision at any point in time. In our target applications, we assume that there are determin- istic decision rules that given the estimated evolution of the prices of the asset will indicate the trading action to be taken. These rules ileyonlinelibrary.com/journal/exsy are typically driven by the investor’s preferences concerning impor- tant aspects like financial risk. For instance, a rule could state that if the forecast of the variation of prices is a 2.5% increase then the correct decision is to buy the asset as this will allow covering trans- action costs and still have some profit. Given the deterministic mapping from forecasted values into decisions, we can define the prediction task in two ways. The first consists on obtaining a numeric prediction model that we can use to obtain predictions of the future variation of the prices, which are then transformed (deterministically) into trading decisions (e.g., Hellstrom (1999); Lu, Lee, and Chiu (2009)). The second alternative consists of directly forecasting the correct trading decisions (e.g., Luo and Chen (2013); Ma, Song, Hung, Su, and Huang (2012); Teixeira and de Oliveira (2010)). Which is the best option in terms of financial results? To the best of our knowledge, no comparative study was carried out to answer this question. This is the goal and the main contribution of this paper: to compare these two approaches to decision making and provide experimental evidence of the advantages and disadvantages of each alternative. 2 | PROBLEM FORMALIZATION The problem of decision making based on forecasts of a numerical (continuous) value can be formalized as follows. We assume there is Copyright © 2016 John Wiley & Sons, Ltd. 1 of 13 https://doi.org/10.1111/exsy.12169 http://wileyonlinelibrary.com/journal/exsy https://doi.org/10.1111/exsy.12169 2 of 13 BAÍA AND TORGO an unknown function that maps the values of p predictor variables into the values of a certain numeric variable Y. Let f be this unknown function that receives as input a vector x with the values of the p predictors and returns the value of the target numeric variableY whose values are supposed to depend on these predictors, f : ℝp→ℝ x↦f xð Þ: We also assume that based on the values of this variable Y some decisions need to be made. Let g be another function that given the values of this target numeric variable transforms them into actions/ decisions, g : ℝ→A ¼ α1; α2; α3…f g Y↦g Yð Þ: where A represents a set of possible actions. In our target applications, functions f and g are very different. Function g is known and deterministic, in the sense that it is part of the domain background knowledge. Function f is unknown and uncertain. The only information we have about function f is a histor- ical record of mappings from x into Y, that is, a data set that can be used to learn an approximation of the function f. Given that the variable Y is numeric, this approximation could be obtained using some existing multiple regression tools. This means that given a data set Dr ¼ xi; Yih ini¼1 � � , we can use some regression tool to obtain a model r ̂ xð Þ that is an approximation of f. From an operational perspective, this would mean that given a test case q for which a decision needs to be made, we would proceed by first using r ̂ to obtain a prediction for Y and then apply g to this predicted value to get the predicted action/decision, that is, q↦ r ̂ qð Þ↦g r ̂ qð Þð Þ. In the con- text of financial markets, the predictors describe the currently observed dynamics of the prices of some financial asset, and the target numeric variable Y represents the future variation of this price. This means that f is the unknown function that maps the currently observed price dynamics into a future evolution of the price. On the other hand, g is a deterministic function (typically based on domain knowledge and risk preferences of traders) that maps the prediction of the future evolution of prices into one of three possible decisions: Sell, Hold, or Buy. Given the deterministic nature of g, we can use an alternative process for obtaining decisions. More specifically, we can build an alternative data set Dc ¼ xi; g Yð Þh ini¼1 � � , where the target variable is the decision associated with each known Y value in the historical record of data. This means that we have a nominal target variable, that is, we are facing a classification task. Once again, we can use some standard classification tool to obtain an approximation bc of the unknown function that maps the predictors into the correct actions/ decisions. Once such model is obtained, we can use it given a query case q to directly estimate the correct decision by applying the learned model to the case, that is, q↦ ĉ qð Þ . This means that given the description of the current dynamics of the price, we will use function ĉ to forecast directly the correct trading action for this context. Independently, of the approach followed, the final goal of the applications we are targeting is always to make correct decisions. This means that whatever process we use to reach a decision, it will be evaluated in terms of the “quality” of the decisions it generates. In this context, it seems that the classification approach, by having as target variable the decisions, would be easier to bias towards optimal actions. However, this approach completely ignores the intermediate numeric variable that is supposed to influence decisions, though one may argue that information on the relationship between Y and the decisions is “encoded” when building the training set Dc by using as target the values of g(Yi). On the other hand, although the regression approach is focused on obtaining accurate predictions of Y, it completely ignores questions like eventual different cost/benefits of the different possible decisions that could be easily encoded into the classification tasks. All these potential trade‐offs motivate the current study. The main goal of this paper is to compare these two approaches in the context of financial markets. 3 | MATERIAL AND METHODS This section describes the main issues involved in the experimental comparison we will carry out with the goal of comparing the two possible approaches described in the previous section. 3.1 | The tasks The problem addressed in this paper is very common in automatic trading systems where decisions are based on the forecasts of some prediction models. The decisions to open or close short/long positions are typically the result of a deterministic mapping from the predicted prices variation. In our experiments, we have used the assets prices of 12 companies. Each data set has a minimum of 7 years of daily data and a maximum of 30 years. In order to simplify the study, we will be work- ing with a 1‐day horizon, that is, take a decision based on the forecasts of the assets variation for 1 day ahead. Moreover, we will be working exclusively with the closing prices of each trading session, that is, we assume trading decisions are to be made after the markets close. The decision function for this application receives as input the forecast of the daily variation of the assets closing prices and returns a trading action. We will be using the following function in our experiments: g : ℝ→A ¼ hold; buy; sellf g Y↦ buy; Y>0:02 sell; Y<−0:02 hold; other cases 8>>< >>: : This means we are assuming that any variation above 2% will be sufficient to cover the transaction costs and still obtain some profit. Concerning the data that will be used as predictors for the forecasting models (either forecasting the prices variation [Y] or directly the trading action [A]), we have used the price variations on recent days as well as some trading indicators, such as the annual volatility, the Welles Wilder style moving average (Wilder, 1978), the stop and reverse point indicator developed by J. Welles Wilder (Wilder, 1978), the usual moving average, and others. The goal of this selection of predictors is to provide the forecasting models with useful information on the recent dynamics of the assets prices. BAÍA AND TORGO 3 of 13 Regarding the performance metrics, we will use to compare each approach; we will use two metrics that capture important properties of the economic results of the trading decisions made by the alterna- tive models. More specifically, we will use the Sharpe Ratio as a mea- sure of the risk (volatility) associated with the decisions and the percentage total return as a measure of the overall financial results of these actions. We will also consider the Macro versions of the Recall and Precision metrics for the buy and sell signals combined. This will allow us to analyze how many trading opportunities are being detected (Recall) and how accurate are the models regarding the prediction of every non‐hold trading action (Precision), respectively. To make our experiments more realistic, we will consider a transaction cost of 2% for each buy or sell decision a model may originate. At this stage, it is important to remark that the prediction tasks we are facing have some characteristics that make them particularly challenging. One of the main hurdles results from the fact that inter- esting events, from a trading perspective, are rare in financial mar- kets. In effect, large movements of prices are not very frequent. This means that the data sets we will provide to the models have clearly imbalanced distributions of the target variables (both the numeric percentage variations and the trading actions). To make this imbalance problem harder, the situations that are more interesting from a trading perspective are rare in the data sets, which creates difficulties to most modeling techniques. In the next section, we will describe some of the measures we have taken to alleviate this prob- lem. We have conducted a thorough analysis to test the impact of these measures on each approach. Evaluating these existing TABLE 1 Regression models used for the experimental comparisons Model Variants SVM cost = {1,5,10}, ε = {0.1,0.05,0.01}, tolerance = {0.001,0.005},kernel = linear SVM cost = {1,10}, ε = {0.1,0.05,0.01}, degree = {2,3,5},kern Random Forest ntree = {500,750,1000,2000,3000},mtry = {4,5,6} Trees (pruned) se = {0,0.5,1,1.5,2},cp = 0, minsplit = 6 KNN k = {1,3,5,7,11,15} NNET size = {2,4,6},decay = {0.05,0.1,0.15} MARS thresh = {0.001,0.0005,0.002}, degree = {1,2,3},minspa AdaBoost dist = {gaussian},n.trees = {10000,20000}, shrinkage = {0.001,0.01},interaction.depth = {1,2)} SVM = support vectorial machines; KNN = K‐nearest neighbours; NNET = neur TABLE 2 Classification models used for the experimental comparisons Model Variants SVM cost = {1,3,7,10},kernel = linear tolerance = {0.001,0.005 SVM cost = {1,10}, ε = {0.1,0.05}, degree = {2,3,4,5},kernel = p Random Forest ntree = {500,750,1000,2000,3000},mtry = {3,4,5} Trees (pruned) se = {0,0.5,1,1.5,2},cp = 0, minsplit = 6 KNN k = {1,3,5,7,11,15} NNET size = {2,4,6},decay = {0.05,0.1,0.15} AdaBoost coeflearn = c(‘Breiman’,‘Freund’,‘Zhu’), mfinal = c(500,10 SVM = support vectorial machines; KNN = K‐nearest neighbours; NNET = neur techniques in the context of financial forecasting problems is the sec- ond main contribution of the paper. 3.2 | The models In this section, we describe all the model variants that will be used in the experimental comparisons. They were selected to ensure that both approaches have the same conditions for a fair comparison and that the results are not biased by some specific characteristics of one particular technique. Several variants for each family of models (SVM [support vectorial machines], Random Forests, etc.) were tested in order to make sure our conclusions were not biased. Tables 1 and 2 show all the model variants used in our experiments (nearly 182 model variants were tested). To facilitate the reproduc- ibility of our results, we have used the free and open source implementations of these techniques available in the R software environment (R Core Team, 2014). The predictive tasks we are facing have two main difficulties: (a) the fact that the distribution of the target variables is highly imbal- anced, with the more relevant values being less frequent; and (b) the fact that there is an implicit ordering among the decisions. The first problem causes most modeling techniques to focus on cases (the most frequent) that are not relevant for the application goals. The second problem is specific to classification tasks as these algorithms do not distinguish among the different types of errors, whilst in our target application, confusing a buy decision with a hold decision is less seri- ous than confusing it with a sell. R package e1071 ‐ Meyer, Dimitriadou, Hornik, Weingessel, and Leisch (2014) el = polynomial e1071 ‐ Meyer et al. (2014) randomForest ‐ Liaw and Wiener (2002) DMwR – Torgo (2010) DMwR ‐ Torgo (2010) Nnet ‐ Venables and Ripley (2002) n = {0,1} Earth – Milborrow (2014) Gbm – Ridgeway (2013) al networks; MARS = multivariate adaptive regression spline. R package ,0.0005,0.002} e1071 ‐ Meyer et al. (2014) olynomial e1071 ‐ Meyer et al. (2014) randomForest ‐ Liaw and Wiener (2002) DMwR – Torgo (2010) DMwR – Torgo (2010) Nnet ‐ Venables and Ripley (2002) 00,2000) Boosting ‐ Ridgeway, Southworth, and RUnit (2013) al networks; MARS = multivariate adaptive regression spline. 4 of 13 BAÍA AND TORGO These two problems led us to consider several alternatives to our base modeling approaches described in Tables 1 and 2. For the first problem of imbalance, we have considered the hypothesis of using resampling to balance the distribution of the target variable before obtaining the models. In order to do that, we have used the SMOTE algorithm (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). This method is well‐known for classification models, consisting basically of oversampling the minority classes and under‐sampling the majority ones. The goal is to modify the data set in order to ensure that each class is similarly represented. Regarding the regression tasks, we have used the work by Torgo, Branco, Ribeiro, and Pfahringer (2015), where a regression version of SMOTE was presented. Essentially, the concept is the same as in classification, using a method to try to balance the continuous distribution of the target variable by oversampling and under‐sampling different ranges of its domain. Regarding the second problem of the order among the classes, we have also considered a frequently used approach to handle this issue. Namely, we have used a cost–benefit matrix that allows to distinguish between the different types of classification errors. Using this matrix, and given a probabilistic classifier, we can predict for each test case the class that maximizes the utility instead of the class that has the highest probability. We have used the following procedure to obtain the cost–benefit matrices for our tasks. Correctly predicted buy/sell signals have a pos- itive benefit estimated as the average return of the buy/sell signals in the training set. On the other hand, in the case of incorrectly predicting a true hold signal as buy (or sell), we assign it minus the average return of the buy (or sell) signals. Basically, the benefit associated to correctly predicting one rare signal is entirely lost when the model suggests an investment when the correct action would be doing nothing. In the extreme case of confusing the buy and sell signals, the penalty will be minus the sum of the average return of each signal. Choosing such a high penalty for these cases will eventually change the model to be less likely to make this type of very dangerous mistakes. Considering the case of incorrectly predicting a true sell (or buy) signal as hold, we also charge for it but in a less severe way. Therefore, the average of the sell (or buy) signal is considered, but divided by two. This division was our way of “teaching” the model that it is preferable to miss an opportunity to earn money rather than making the investor lose money. Finally, correctly predicting a hold signal gives no penalty nor reward, because no money is either won or lost. Table 3 shows an example of such cost–benefit matrix that was obtained with the data from 1981‐01‐ 05 to 2000‐10‐13 of Apple. We have thoroughly tested the hypothesis that using resampling before obtaining the models would boost the performance of the dif- ferent models we have considered for our tasks (both classification TABLE 3 Example cost–benefit matrix for Apple shares Trues S H B S 0.49 −0.49 −0.82 Pred H −0.24 0.00 −0.17 B −0.82 −0.33 0.33 S = share; H = hold; B = buy. and regression) and have also tested the hypothesis that using cost– benefit matrices to implement utility maximization would also improve the performance of the classification models. The results of testing these hypotheses will be presented in Section 4. 3.3 | The experimental methodology In this section, we present the experimental methodology used in our comparative experiments. Because of the temporal nature of the data sets, the usual cross‐validation methodology should not be used to estimate the performance of a certain model. This procedure involves randomly reshuffling the data, which may lead to test cases that are “older” than the training cases, which would lead to unreliable estimates. In this context, we have used a Monte Carlo simulation method consisting of randomly selecting a series of N points in time within the available data set. For each of these random dates, we use a certain consecutive past window as training set for obtaining the alternative models that are then tested/compared in a subsequent and consecutive test window. The Monte Carlo estimates are formed by the average scores obtained on the N repetitions. In our experi- ments, we have used N = 10, 50% of the data as the size of the training window, and 25% of the data as size of the test sets. With respect to testing the statistical significance of the observed differences between the estimated scores, we have used the recommendations of the work by Demšar (2006). More specifically, in situations where we are comparing k alternative models on one specific task, we have used the Wilcoxon signed‐rank test to check the significance of the differences. On the experiments where k models are compared on t tasks, we use the Friedman test followed by a post hoc Nemenyi test to check the significance of the difference between the average ranks of the k models across the t tasks. 4 | EXPERIMENTAL RESULTS This section describes the results of our empirical studies. We have split these results in two main parts. The first has to do with testing the validity of the hypotheses described in Section 3.2 concerning the usage of resampling techniques and also the usage of cost–benefit matrices. The second part is the results of the final comparison between the two modeling approaches to our target decision problems. 4.1 | Addressing the particularities of the prediction tasks As we have mentioned before, the prediction tasks we are facing have some particularities that turn them into particularly challenging problems. We have described two ways of trying to overcome these challenges. In this section, we check the validity of these hypotheses. Specifically, we test the advantages of: (a) using SMOTE on classification and regression models to overcome the problem of imbalanced distributions; and (b) using cost–benefit matrices on classification models to provide information on the different importance of the class values. BAÍA AND TORGO 5 of 13 4.1.1 | Hypothesis 1: Resampling the data sets Our first hypothesis states that resampling the data sets with the goal of balancing the response variable will enhance the performance of both classification and regression models. In the experiments carried out to check the validity of this hypothesis, we have observed that the way the resampling affected each modeling approach was not significantly different. In this context, we will just present the results for the classification models as the conclusions with regression models were similar. We start by comparing the top modeling variant obtained without using resampling against the best one using SMOTE, company by com- pany, applying a Wilcoxon test in each case (for each modeling approach individually). In Figure 1 we analyze recall and precision for the buy and sell signals combined. The results were somewhat expected. Resampling methods balance more the distribution of the target, which means the models will have more examples of the rare cases that are interesting in this application. This has a positive impact on recall because the models end up forecasting more frequently these events because they are not so rare in the re‐sampled training sets. However, due to the inherent difficulty of this financial forecasting tasks, this also involves a higher risk of making wrong predictions and thus the decrease in precision. Because we are in a trading context, we should prefer safer decisions rather than riskier ones. However, only after checking the financial results of these strategies we can confirm this. Nevertheless, we should reinforce that the issue of preferring high precision (less risk) over high recall is something specific to financial trading, and in other application domains, the conclusions on the usefulness of resampling could be different if the preference bias is different. FIGURE 1 Best classification variant without SMOTE against the best classif rics (asterisks denote that the respective variant is significantly better, acco Figure 2 allows us to analyze the financial consequences of the resampling procedure. We can observe that in terms of Total Return and Sharpe Ratio, the models applied to data sets without any resampling achieved significantly better results and, in several cases, by a large margin. This means that pre‐processing the data using resampling is leading to models that make more risky decisions and that these decisions are often wrong, leading to serious financial losses. Considering all the plots at the same time, our results provide evidence that using SMOTE on the training data will make the models predict significantly more trading signals (Buy and Sell). This leads to higher recall but unfortunately also to much lower precision because these signals are frequently wrong, with serious financial consequences. However, we should not forget that the previous comparisons were carried out between the best overall variant of each type (with and without SMOTE). A different question is whether the same conclusions are valid if we consider each modeling technique individually. In this context, we will now check the resampling hypothesis per type of model instead of globally. We will group the variants according to the type of model and analyze the impact of using SMOTE per type of model. The results of this new set of comparisons are shown in Figure 3. The first thing to notice is that the recall was significantly better every single time across all types of models where resampling was applied. On the other hand, the precision was worse almost every time, except with SVMs. So we again observe that the use of the SMOTE algorithm is making the models more capable of detecting the buy and sell (rare) classes at the cost of riskier decisions (except for SMVs). Concerning the financial metrics, we observe a clear ication with SMOTE for the macro versions of Precision and Recall met- rding to a Wilcoxon test with α = 0.05) FIGURE 2 Best classification variant without SMOTE against the best classification with SMOTE for the Total Return and Annualized Sharpe Ratio metrics (asterisks denote that the respective variant is significantly better, according to a Wilcoxon test with α = 0.05) FIGURE 3 Segmented by type of model and by metric, a Wilcoxon test is performed between the best model variant of each modeling tool (Classification without SMOTE versus Classification with SMOTE). Because there are 12 datasets in each segment, then there are 12 results per segment. Each bar shows the results for each segment, where each one of the four colors is associated to a type of win (significant/non‐significant win without SMOTE ‐ strong/light blue, non‐ significant/significant win with SMOTE ‐ light/ strong green). The length of each color describes the number of times that type of win occurred 6 of 13 BAÍA AND TORGO advantage of the standard approach (i.e., no resampling applied), with SVM and AdaBoost being the only algorithms benefiting from the use of resampling on some tasks, with the advantage being more evident for SVMs. In summary, we have collected sufficient evidence to conclude that resampling the data sets for both classification and regression models will have a negative impact on their performance in the context of financial forecasting tasks, namely when considering BAÍA AND TORGO 7 of 13 trading‐related evaluation metrics. Only the SVM, TREE, and AdaBoost algorithms have benefited from the resampling method on a small set of tasks. Overall, our conclusion is that this resampling strategy is not recommended in the context of forecast- ing for financial trading. 4.1.2 | Hypothesis 2: adding cost–benefit matrices The second hypothesis we have put forward was that the use of cost–benefit matrices would boost the performance of the classifi- cation models, because the information on the implicit ordering among the classes would be passed to the models, allowing them to avoid more costly errors (e.g., confusing a buy decision with a sell decision). In order to test this hypothesis, we will follow the same methodol- ogy used for the resampling hypothesis. Firstly, the top modeling variant of each alternative (with and without cost–benefit matrices) will be compared for each data set. The results regarding precision and recall are shown in Figure 4. These results are a bit inconclusive. In terms of recall the use of these matrices led to better results, as we have five significant wins against only one significant loss of the approach using the matrices. On the other hand, in terms of precision, we observed three significant wins of the approach without cost–benefit matrices. Figure 5 shows the results of this same experiment in terms of the financial metrics. The most evident observation is the fact that not a single statistically significant difference was achieved by either alternative. However, both in terms of the Total Return FIGURE 4 Best classification variant without costs against the best classific (asterisks denote that the respective variant is significantly better, accordin and of the Annualized Sharpe Ratio, the approaches using cost– benefit matrices obtained more wins. This is an interesting result, suggesting that the use of these matrices may be beneficial for the classification models in the context of financial trading based on prediction models. Let us now check if these results hold across each different type of learning algorithm. Figure 6 shows the comparison between the top modeling variant of each type of model. Even though there is a slightly higher abundance of green (suggesting that the usage of cost–benefit matrices may be beneficial), these results are quite even. Each type of model is influenced in a different way by the cost–benefit matrices. Although NNET (neural networks), TREE (decision tree models) and SVM are able to take advantage of the information on the matrices, the conclusions for the other models are not so clear. These observations seem to indicate that the poten- tial advantage of the usage of cost–benefit matrices is algorithm‐ dependent, with some techniques being able to capitalize on this extra information while others do not. As future research agenda it, should be interesting to understand if there are some theoretical properties of the algorithms that may be causing this differentiated behavior. A possible explanation has to do with the quality of proba- bility estimates. In effect, the decisions using cost–benefit matrices depend on the estimated probabilities of each class. If these esti- mates are wrong or unreliable, this may lead to unreliable decisions. Still, this explanation needs to be confirmed in practice. In summary, contrary to the first hypothesis involving resampling, the conclusions for the second hypothesis regarding the advantages of ation with costs for the macro versions of Precision and Recall metrics g to a Wilcoxon test with α = 0.05) FIGURE 5 Best classification variant without costs against the best classification with costs for the Total Return and Annualized Sharpe Ratio metrics (asterisks denote that the respective variant is significantly better, according to a Wilcoxon test with α = 0.05) FIGURE 6 Segmented by type of model and by metric, a Wilcoxon test is performed between the best model variant of each modeling tool (Classification without costs versus Classification with costs). Because there are 12 datasets in each segment, then there are 12 results per segment. Each bar shows the results for each segment, where each one of the five colors is associated to a type of win (significant/non‐significant win without costs ‐ strong/light blue, draw ‐ yellow, non‐significant/significant win with costs ‐ light/strong green). The length of each color describes the number of times that type of win occurred 8 of 13 BAÍA AND TORGO cost–benefit matrices are that there is some potential for this alterna- tive way of addressing the classification tasks. Although we have not observed an overwhelming advantage of this usage, we have found that for some modeling algorithms, these matrices provide a clear boost in terms of their results (both in terms of Precision/Recall and financially‐oriented metrics). BAÍA AND TORGO 9 of 13 4.2 | Comparison of classification and regression modeling approaches This section presents the results of the experimental comparisons between the two general approaches to making trading decisions based on forecasting models. In our experiments, we have considered 76 clas- sification models. For each of these models, we have also tried the ver- sion with resampling and the version with cost–benefit matrices, totaling 76 × 3 = 228 different classification variants. In terms of regres- sion, we have a slightly larger set of 97 base models that were then tried with and without resampling, for a total of 97 × 2 = 194 variants. All these variants were compared on the data sets of the 12 companies described in Section 3.1 using the methodology described in Section 3.3. We have divided our experimental analysis in two main parts. In the first one, for each company and for each metric, we have compared the best regression and classification variant using a Wilcoxon singed‐ rank statistical test with a significance level of 0.05 to check if we can reject the null hypothesis that there is no significant difference between the best classification and regression variants. This leads to 12 statistical tests for each metric (one test for each company), where the models compared for each company are not necessarily the same. The motivation of this first part is to compare the best classification variant against the best regression variant for each company and metric. Figure 7 shows the results of this comparison for the Total Return and Sharpe Ratio financial evaluation metrics. The results on these figures are somewhat correlated. In effect, whenever we have found a significant difference in terms of Total Return, the same also FIGURE 7 Best classification variant against the best regression one for respective variant is significantly better, according to a Wilcoxon test with happened in terms of Sharpe Ratio. Regarding the left graph (Total Return), we have one significant win for each approach and 6 against four non‐ignificant wins for classification and regression, respectively. With respect to the right graph (Sharpe Ratio), we can observe a slight advantage of the classification approach, with one more significant win and eight vs one non‐significant wins. Overall, we have observed a very slight advantage of the best classification approach against the best regression variant. From an economical perspective, some results are contradictory. For instance, there is a very high level of Total Return for the Meg company (above 60% return), but the best Sharpe Ratio was very low. This means that the best model for the first metric was taking enormous amounts of risk and that the high level of return achieved was probably due to pure luck. On the other hand, there are some high values for the Total Return accompanied by high levels of Sharpe Ratio, such as for the Exas company. This strongly suggests that the models could actually provide some profit with low risk, thus indicating that the model actually predicted meaningful signals. Given the high variability of the results across companies, taking conclusions solely based on the analysis of the best variant per model and per metric may lead to wrong results. This establishes the motivation for the second part of our experiments. In this second part of our experiments, instead of grouping by metric and company, we will just group by metric and study the average rank of each model across all the companies (top five of each approach are considered). With the use of the Friedman test followed by the post hoc Nemenyi test, we check whether there are statistically the Total Return and Sharpe Ratio metrics (asterisks denote that the α = 0.05) 10 of 13 BAÍA AND TORGO significant differences among these rankings. This way, if a model obtains a very good result for one company but poor for all the others (meaning that it was lucky in that specific company), its average rank- ing will be low allowing the top average rankings to be populated by the true top models that perform well across most companies. Figure 8 summarizes the results in terms of Total Return and Sharpe Ratio. In either case, because we could not reject the Friedman null hypothesis, the post hoc Nemenyi test was not performed. This means that we cannot say with 95% confidence that there is some sig- nificant difference in terms of the average rank of the models for both the Total Return and the Sharpe Ratio between these two modeling approaches. Nevertheless, there are some observations to remark. Regarding Total Return, the model with the best average ranking is a classification model using cost–benefit matrices. All the remaining clas- sification variants are in their original form (without using cost–benefit matrices) and occupying mostly the last positions in terms of average rankings. Moreover, not a single variant obtained with SMOTE appears in this top five for each approach, which means that we confirm that resampling does not seem to pay off for this type of applications due to the economic costs of making more risky decisions. Furthermore, another very interesting remark is that all the top models are using SVMs as the base learning algorithm. Overall, we cannot say that any of the two approaches (forecasting directly the trading actions using classification models or forecasting the price returns using regression) is better than the other regarding the Total Return. The second part of Figure 8 shows the results of the same exper- iment in terms of Sharpe Ratio, that is, the risk exposure of the alterna- tives. The conclusions are quite similar to the Total Return metric. FIGURE 8 The top five average ranking model variants of both modeling a iants, and their average rankings are recalculated. Each model is thus given implies that the Friedmans null hypothesis of all averages being equal was r black line is considered to have their average rankings significantly differen Once again, no significant differences were observed. Still, one should note that the first five places are dominated by the classification approaches. The best variant for the Total Return is also the best variant for the Sharpe Ratio, which makes this variant unarguably the best one of our study when considering the 12 different companies. Hence, ultimately, we can state that the most solid model belongs to the classification approach using an SVM with cost–benefit matrices, because it obtained the highest returns with lowest associated risk. Finally, unlike the results for Total Return, in this case, we observe other learning algorithms appearing in the top five best results. In conclusion, we cannot state that one approach performs definitely better than the other in the context of financial trading decisions. The scientific community typically puts more effort into the regression models, but this study strongly suggests that both have at least the same potential. Actually, the most consistent model we could obtain belongs to the classification approach. Another interest- ing conclusion is that of a considerably large set of different types of models, SVMs achieved better results both when considering classifi- cation or regression tasks. 5 | DISCUSSION In this paper, we have studied the question of classification versus regression on decision making problems based on forecasts of a numeric variable. Our study was focused on financial trading decisions, so the obvious question is as follows: Can these results be generalized to other domains/tasks with similar structure? pproaches (Classification and Regression) are forming a new set of var- an average ranking (x axis). The presence of at least one black line ejected. In that case, every pair of model variants not connected by any t according to a Nemenyi statistical test BAÍA AND TORGO 11 of 13 This question was addressed in Baia (2015). The setup used in that work was rather similar to the one presented here, with two main dif- ferences: (a) the tasks were associated to several distinct contexts (while the current paper focus only on trading tasks); and (b) each task was considered with a different number of possible decisions to be made. While the second point allowed us to evaluate if one modeling approach could gain some specific advantage over the other depending on the number of values of the decision variable, the first point let us check if whatever conclusion we reached regarding trading problems would also apply in a more generic context. Moreover, the tasks considered in Baia (2015) were all non‐temporal, stressing out even more the differences to the trading problem. The main conclusions of the study carried out in Baia (2015) were that, overall, the classification modeling approach can outperform more frequently the approach based on regression tools. Whether the user is willing to make an extensive search for the optimal parameters to model a certain task or not, the results point in the same direction. The methods based on the classification approach tend to be better. The only setup where the regression approach was more competitive was on tasks with a high number of classes/decisions and where the user is not merely interested in the accuracy of the decisions and wants to consider different grades of severity of the decision errors. Contrary to what was observed in Baia (2015), in this paper, we have gathered enough evidence to state that there is no statistically significant difference between both approaches in the context of the trading decision problem. The trading problem has some characteris- tics that make it quite different from the generic tasks studied in Baia (2015), which may explain the different conclusions. Namely, the tasks addressed in the current paper are based on data that is ordered by time (time series data), and the frequency of decisions is rather imbalanced, with the more important decisions being rare. These characteristics raise significant challenges to modeling tools, and even with the use of techniques developed to address these issues, we were not able to obtain conclusive results. In summary, the conclusions of the current paper seem to be specific to the financial trading setting. For more general tasks, existing evidence (Baia, 2015) seems to indicate that there is some advantage on using the classification approach. Future work should try to explain the reason for this different conclusion, namely if it is something specific to the domain or to some of its characteristics (time‐dependent data and imbalanced decisions). 6 | RELATED WORK To the best of our knowledge, this is the first research work to directly compare the regression and classification approaches to financial trad- ing. Existing work in this area essentially uses one of the approaches and compares different variants of it on some concrete financial data sets. Even outside of the financial trading domain, we were not able to find some comparison between these two plausible approaches to decision making based on numeric forecasts. Still, as we have men- tioned in the introduction, we think this is a common and relevant setup for many application domains. Nevertheless, there are some concepts that are strongly related with the problem of making decisions based on numeric forecasts and thus we provide here a short review of some of the main works in these areas. In this paper, the ultimate goal is to predict an action/ decision. However, in many application domains, there are decisions that are more important than others, and there may exist information on some costs and benefits associated with each decision. In this con- text, all area of cost‐sensitive learning (Elkan, 2001) is strongly related with our target applications. Another related problem is that of imbal- anced distributions of the target variable in the context of prediction models. Both these two problems are present in financial trading prob- lems. Branco, Torgo, and Ribeiro (2016) present a survey of existing techniques for handling these situations of imbalanced target variables. Although most of the existing work considers classification tasks (nominal target variables), this work also describes methods designed to handle similar problems within regression tasks (numeric target variables). In terms of the use of regression models for trading systems, neu- ral networks models have shown to be a promising tool for forecasting time series (namely the prices of some assets). However, they typically require a large effort in terms of model tuning and can be computation- ally demanding. Genay (1999) has observed that a simple feed‐forward network fails to statistically outperform the random walk model when the input variables are just the past returns. However, with the simple addition of a moving average, it can statistically outperform this base- line. Ghazali, Hussain, and Liatsis (2011); Shin and Ghosh (1995) and Sermpinis, Dunis, Laws, and Stasinakis (2012) have proposed variants and generalizations of neural networks that present decent improve- ments over the standard versions of these models. Most research seems to indicate that these standard versions of the neural networks models may present poor results, yet some small variations in terms of the structure of the model may greatly improve their performance. Another popular modeling technique in this area is k‐nearest neighbours (KNN). Genay (1999) has observed that the regression KNN model statistically outperforms the random walk model by merely using the past return as predictors, unlike the feed‐forward neural network. Lee, Wei, Cheng, and Yang (2012) have used the nearest‐neighbour‐based approach for churn predictions. Even though this problem is different from financial trading, there are some similar- ities. Detecting an uncommon event the soonest possible can be seen as highly relevant in trading, that is, detecting a buy or sell signal the earliest possible to obtain the maximum possible profit. Support Vector Machines (SVM) and Multivariate Adaptive Regression Splines (MARS) have been the subject of a study by Kao, Chiu, Lu, and Chang (2013). The author also tested these models incorporated with wavelets, where some interesting results were obtained. Furthermore, Lu et al. (2009) has studied the combination of using independent component analysis to reduce the noise and randomness of the data before applying standard SVM models and has observed a slight improvement of the results. In terms of using classification models in the context of financial trading, there are not some many works. Chang, Fan, and Liu (2009) have studied the combination of piecewise linear models with back‐ propagation artificial classification neural networks PLR‐BPN. The experimental results were interesting in terms of the amount of profit 12 of 13 BAÍA AND TORGO obtained. However, Luo and Chen (2013) have seen that PLR‐BPN is outperformed by the combination of PLR with the well‐known Support Vector Machine model. Ma et al. (2012) have used cost matrices with back‐propagation neural networks. In several tasks, an increase of the utility score of a model came at the cost of a decrease in the accuracy level. However, this accuracy decrease may not be relevant for finan- cial trading, where the main goal is to avoid serious errors like forecast- ing a buy signal when we should sell, as this type of errors may have very serious economic impact. Teixeira and de Oliveira (2010) have studied the classification KNN model and also the combination of this model with indicators such as the RSI filter, stop‐gain and stop‐loss criteria. All the tested models outperformed the used benchmark, par- ticularly the models combined with the above indicators that obtained an overall better performance. Atsalakis and Valavanis (2009) contain a very detailed state of art review on stock market forecasting techniques. For each referred work, the authors list the used data sets, the chosen input variables and, most importantly, a summary of all the used modeling techniques as well as which models were compared against each other. Informa- tion regarding the usage of pre‐processing techniques and the training method was also given. In summary, although our review of the related literature has not found any work with objectives similar to the current paper, we were able to observe that the most frequent approach to financial trading is based on regression approaches. Our work has shown that classifica- tion approaches should be given more importance by the research community in this area. 7 | CONCLUSIONS This paper presents a comparative study of two different approaches to financial trading decisions based on forecasting models. The first, and more conventional approach, uses regression tools to forecast the future evolution of prices and then uses some decision rule based on these predictions, to choose the trading action. The second approach tries to directly forecast the “correct” trading decision. Our study is a specific instance of the more general problem of making decisions based on numerical forecasts. In this paper we have focused on financial trading decisions because this is a domain that requires specific trade‐offs in terms of economic results. This means that our conclusions are specific to this area and it remains an open question for future research whether the same conclusions can be drawn in other application domains where the same two approaches to decision making are plausible. Still, our initial experiments (Baia, 2015) with other types of domains provide some evidence for the specificity of financial trading decision making that is addressed in the current paper. Overall, the main conclusion of the study we have described in this paper is that, for this specific application domain, there seems to be no statistically significant difference between these two approaches to decision making. Given the large set of classification and regression models that were considered, as well as the different data sets involved in our study, we claim that this conclusion is supported by significant experimental evidence. Financial forecasting has some particularities that are challenging to most modeling techniques. In our study we have considered the hypothesis of using some existing techniques that were developed to address this type of challenges. Another contribution of this paper involves testing the applicability of these solutions in the context of financial forecasting. Regarding the problem of the imbalance of the distribution of the target variable we have considered the application of resampling strategies both for regression and classification models. This technique has been applied with success to many application domains. Our experiments have shown that although resampling increases the ability of the models to generate trading signals (thus increasing their recall), it also brings a significant amount of financial risk as several of these signals are wrong, frequently leading to cata- strophic financial results. This means that our study clearly indicates that resampling is not recommended in the context of financial fore- casting due to this increase in the risk exposure. The other standard technique we have considered in our study was the use of cost– benefit matrices as a means to make the classification models aware of the different costs of the classification errors. In this case, our study collected sufficient evidence to conclude that this method is promising, although not all modeling techniques were able to capitalize on the extra information provided by these matrices. ACKNOWLEDGEMENTS This work is financed by the European Regional Development Fund (ERDF) through the Operational Programme for Competitiveness and Internationalisation ‐ COMPETE 2020 Programme within project POCI‐01‐0145‐FEDER‐006961 and by the North Portugal Regional Operational Programme (ON.2 ‐ O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development Fund (ERDF), and by national funds, through the Portuguese funding agency (FCT) within Project NORTE‐07‐ 0124‐FEDER‐000059. Part of the work of Luís Torgo was supported by a sabbatical scholarship (SFRH/BSAB/113896/2015) from the Portuguese funding agency (FCT). REFERENCES Atsalakis, G. S., & Valavanis, K. P. (2009). Surveying stock market forecast- ing techniques part II: Soft computing methods. Expert Systems with Applications, 36(3, Part 2), 5932–5941. Baia, L. (2015). Actionable forecasting and activity monitoring: Applications to financial trading. Branco, P., Torgo, L., & Ribeiro, R. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (to appear). Chang, P.‐C., Fan, C.‐Y., & Liu, C.‐H. (2009). Integrating a piecewise linear representation method and a neural network model for stock trading points prediction. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, 39(1), 80–92 .cited By 39 Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority over‐sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30. Elkan, C. (2001). The foundations of cost‐sensitive learning. In IJCAI’01: Proc. of 17th Int. Joint Conf. of Artificial Intelligence, volume 1, pp. 973–978. Morgan Kaufmann Publishers. BAÍA AND TORGO 13 of 13 Genay, R. (1999). Linear, non‐linear and essential foreign exchange rate prediction with simple technical trading rules. Journal of International Economics, 47(1), 91–107. Ghazali, R., Hussain, A. J., & Liatsis, P. (2011). Dynamic ridge polynomial neural network: Forecasting the univariate non‐stationary and stationary trading signals. Expert Systems with Applications, 38(4), 3765–3776. Hellstrom, T. (1999). Data snooping in the stock market. Theory of Stochastic Processes, (21, 1999b), pp. 33–50. Kao, L.‐J., Chiu, C.‐C., Lu, C.‐J., & Chang, C.‐H. (2013). A hybrid approach by integrating wavelet‐based feature extraction with {MARS} and {SVR} for stock index forecasting. Decision Support Systems, 54(3), 1228–1244. Lee, Y.‐H., Wei, C.‐P., Cheng, T.‐H., & Yang, C.‐T. (2012). Nearest‐neighbor‐ based approach to time‐series classification. Decision Support Systems, 53(1), 207–217. Liaw, A., & Wiener, M. (2002). Classification and regression by random forest. Lu, C.‐J., Lee, T.‐S., & Chiu, C.‐C. (2009). Financial time series forecasting using independent component analysis and support vector regression. Decision Support Systems, 47(2), 115–125 .cited By 112 Luo, L., & Chen, X. (2013). Integrating piecewise linear representation and weighted support vector machine for stock trading signal prediction. Applied Soft Computing, 13(2), 806–816. Ma, G.‐Z., Song, E., Hung, C.‐C., Su, L., & Huang, D.‐S. (2012). Multiple costs based decision making with back‐propagation neural networks. Decision Support Systems, 52(3), 657–663. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2014). e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.6–4. Milborrow, S. (2014). Earth: Multivariate adaptive regression spline models. R package version 3.2–7. R Core Team (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Ridgeway, G. (2013). GBM: Generalized Boosted Regression Models. R package version 2.1. Ridgeway, G., Southworth, M. H., & RUnit, S. (2013). Package gbm. Sermpinis, G., Dunis, C., Laws, J., & Stasinakis, C. (2012). Forecasting and trading the eur/usd exchange rate with stochastic neural network com- bination and time‐varying leverage. Decision Support Systems, 54(1), 316–329. Shin, Y., & Ghosh, J. (1995). Ridge polynomial networks. IEEE Transactions on Neural Networks, 6(3), 610–622 .cited By 64 Teixeira, L. A., & de Oliveira, A. L. I. (2010). A method for automatic stock trading combining technical analysis and nearest neighbor classification. Expert Systems with Applications, 37(10), 6885–6890. Torgo, L. (2010). Data Mining with R, learning with case studies. London, United Kingdom. Torgo, L., Branco, P., Ribeiro, R. P., & Pfahringer, B. (2015). Resampling strategies for regression. Expert Systems, 32(3), 465–476. Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed. ISBN 0‐387‐95457‐0). London, United Kingdom. Wilder, J. (1978). New concepts in technical trading systems. Kingston, New York: Trend Research. How to cite this article: BaíaL,TorgoL.Acomparativestudyof approaches to forecast the correct trading actions. Expert Sys- tems. 2017;34: e12169. https://doi.org/10.1111/exsy.12169 AUTHOR BIOGRAPHIES Luis Baia has a master degree in Applied Mathematics and a Bachelor in Pure Mathematics. After some months researching in Machine Learning at LIAAD, he joined a company called “Farfetch” as a Data Scientist. Luis Torgo is an associate professor in the Department of Computer Science at the University of Porto in Portugal. An active researcher in machine learning and data mining for more than 20 years, Dr. Torgo is also a researcher in the Laboratory of Artificial Intelligence and Data Analysis (LIAAD) of INESC Porto LA. https://doi.org/10.1111/exsy.12169