key: cord-0132832-vfs1p8n2 authors: Borrageiro, Gabriel; Firoozye, Nick; Barucca, Paolo title: Sequential Asset Ranking within Nonstationary Time Series date: 2022-02-24 journal: nan DOI: nan sha: 8e9dd4c7ac23f35724236204608eb5f94b0181e3 doc_id: 132832 cord_uid: vfs1p8n2 Financial time series are both autocorrelated and nonstationary, presenting modelling challenges that violate the independent and identically distributed random variables assumption of most regression and classification models. The prediction with expert advice framework makes no assumptions on the data-generating mechanism yet generates predictions that work well for all sequences, with performance nearly as good as the best expert with hindsight. We conduct research using S&P 250 daily sampled data, extending the academic research into cross-sectional momentum trading strategies. We introduce a novel ranking algorithm from the prediction with expert advice framework, the naive Bayes asset ranker, to select subsets of assets to hold in either long-only or long/short portfolios. Our algorithm generates the best total returns and risk-adjusted returns, net of transaction costs, outperforming the long-only holding of the S&P 250 with hindsight. Furthermore, our ranking algorithm outperforms a proxy for the regress-then-rank cross-sectional momentum trader, a sequentially fitted curds and whey multivariate regression procedure. There are countless practical examples in machine learning where one would like to combine the predictions of a committee of experts to solve a particular forecasting challenge. A traditional form of offline (batch) learning is ensemble learning, which arguably includes neural networks with their committees of hidden processing units and certainly includes boosting and bagging algorithms. Our particular modelling interest is with financial time series, which are typically nonstationary. Nonstationarity implies variable means and variances that adapt over time and violates the independent and identically distributed random variables assumption of most regression and classification models. As such, we require approaches that adopt sequential optimisation methods to model financial data, preferably approaches that make little or no assumptions on the data generating process for the time series. The prediction with expert advice framework is a multidisciplinary area of research suited to predicting sequences sequentially, where statistical distribution assumptions are not made and are shown to minimise the regret with respect to the best available expert with hindsight. This framework provides a suitable way to model and forecast financial time series and are well-suited portfolio selection problems. The main result of this paper is the introduction of a novel ranking algorithm, the naive Bayes asset ranker, which we use to target long-only and long/short positions in constituents of the S&P 250 index. Our algorithm is optimised sequentially, which allows it to cope with the changing interactions and behaviours of the S&P 250 constituents. Furthermore, our algorithm outperforms a strategy that would hold the long-only S&P 250 index with hindsight, despite the index appreciating by 88.2% during the test period. Our algorithm can generate excess total and risk-adjusted returns by ranking the probability that individual constituents in the index will generate higher returns than the remaining constituents. Our algorithm also outperforms a regress-then-rank cross-sectional momentum trader baseline, a sequentially fitted curds and whey multivariate regression procedure. Our paper is organised as follows. Section 2 begins with the rationale for online learning in financial time series prediction, describes the prediction with expert advice modelling framework, moves on to online financial portfolio selection problems, including the ranking of constituents within portfolios of assets, and completes with an introduction to naive Bayes ranking. Section 3 presents our naive Bayes asset ranking algorithm. In section 4, we describe the design of the experiments we conduct, including details of the regress-then-rank baseline model we use, the curds and whey multivariate regression model. The section completes with a description of the results. Section 5 provides a discussion of the results and the implications for modelling financial time series. The paper concludes with remarks section 6. In this section, we introduce various concepts central to the algorithm we devise in section 3. We begin by arguing the case for online learning in financial time series prediction and describe the research area of prediction with expert advice, which leads to a review of the academic literature on online financial portfolio selection and the ranking of portfolios of assets. Finally, the section completes with a review of naive Bayes ranking. Financial time series exhibit the characteristics of high serial autocorrelation and nonstationarity. Merton (1976) models the dynamics of financial assets as a jump-diffusion process, which implies that financial time series should observe small changes over time, so-called continuous changes, as well as occasional jumps. One approach for coping with nonstationarity is learning and optimising prediction models sequentially. The impact of high serial autocorrelation on time series where forecasts are made using regression-based models is that the R 2 of model fit will be spuriously high (Granger and Newbold, 1974) . The impact of nonstationarity can be more severe in that models fitted offline on training data generalise poorly on hitherto unseen test data. One way to identify nonstationarity in time series is through the Dickey-Fuller test (Dickey and Fuller, 1979) . Denote the autocovariance function as A process is said to be strictly stationary if the probabilistic behaviour of a collection of values x 1 , x 2 , ..., x k is identical to that of a time shifted set x 1+h , x 2+h , ..., x k+h . It is said to be weakly stationary if 1. Its mean is constant and does not depend on time. 2. Its variance is finite and its autocovariance function γ(s, t) depends only on s and t and only through their difference |s − t|. Define the autoregressive(1) model with a drift term If θ 1 = 0, the time series contains a unit root, implying nonstationarity. The Dickey-Fuller critical values allow the researcher to accept or reject the null hypothesis H 0 : θ 1 = 0. The standard regression t-values for the regression parameter estimates in the above model can be compared against these critical values. In section 4, we experiment with approximately the last seven years of daily sampled S&P 250 data. When running the Dickey-Fuller test against the closing prices, 83.6% of the time series are considered nonstationary. When running the test against daily returns of the form price t /price t−1 − 1, 100% of the time series are now considered stationary. Naively one could now assume that it is safe to perform any modelling on the daily returns, and nonstationarity will no longer present a problem. A simple experiment calls this view into question. Table 1 shows the output of the following procedure: 1. For each constituent of the S&P 250, construct daily returns using the closing prices. 2. Separate the daily samples into groups by calendar month. 3. Run the Levene test (Levene, 1960) to check if the daily-sampled data separated into monthly groups have equal variances, also known as homogeneity of variance. Levene's test is an alternative to the Bartlett test (Bartlett, 1937) and is less sensitive than the Bartlett test to departures from normality. 4. Conduct a two-sample t-test for equal means (Snedecor and Cochran, 1989) where the current daily mean is compared against the past month's daily mean. 5. For both tests, set the significance level α = 0.05. The results for the Levene test show that the null hypothesis that the batch variances are all equal can be rejected in 100% of all cases. For the two-sample t-test for equal means, the results indicate that the null hypothesis H 0 : µ t = µ t−1 can be rejected 8.1% of the time, which is evidence of nonstationarity even with daily returns. Figure 1 runs the same test, except the statistics being compared are shifted further apart in months. The curve plotted through the averages by the shift in months is the lowess curve (Cleveland, 1979) . There is evidence that the null hypothesis of equal means can be rejected with a higher probability as the distance in time between the statistics increases. The traditional approach of training a model offline is likely to result in models that generalise poorly, as the statistical properties of the data are changing over time. Thus, we see firm evidence for the need to apply some form of sequential optimisation when modelling these time series. Prediction with expert advice is a multidisciplinary research area concerned with predicting individual sequences. Whereas standard statistical approaches to forecasting impose a probabilistic assumption on the data-generating mechanism, the prediction with expert advice framework makes no such assumptions yet generates predictions that work well for all sequences, with performance nearly as good as the best expert with hindsight (Cesa-Bianchi and Lugosi, 2006) . The basic structure of problems in this context is encapsulated in algorithm 1, adapted from Rakhlin and Sridharan (2014) . Algorithm 1: Sequential prediction with an adaptive environment // iterate over each time step The learner chooses the set of predictionsŷ t ∈ D, where D is the decision space. 3 Nature simultaneously chooses an outcome z t ∈ Z, where Z is the adversarial outcome space. The player suffers a loss (ŷ t , z t ) and both players observe (ŷ t , z t ). Perhaps the most well-known algorithm within this framework is the weighted majority algorithm of Littlestone and Warmuth (1994) . The authors study the construction of prediction algorithms where the learner faces a sequence of trials, with predictions to be made at the end of each trial, making as few mistakes as possible. They are interested in cases where the learner believes that some experts will perform well, but the learner does not know which ones specifically. A simple and effective method, based on weighted voting, is introduced for constructing a compound algorithm in such a circumstance. They show that this algorithm is robust in the presence of errors in the data and discuss various versions of the algorithm, proving mistake bounds for them that are closely related to the mistake bounds of the best algorithms of the pool. Finally, given a sequence of trials, if there is an algorithm in the pool A that makes at most m mistakes, then the weighted majority algorithm will make at most c(log|A| + m) mistakes on that sequence, where c is fixed constant. If one were to criticise the weighted majority algorithm, the criticism would be directed at applying the algorithm to nonstationary data. For example, in section 3 of their paper Littlestone and Warmuth detail a modification to their base algorithm, which ensures that the weight assigned to the individual experts never dips below γ/|A|, which is the learning rate divided by the number of experts. This fixed, minimum threshold weight is somewhat rudimentary. Our preference is for algorithms that assign more weight to experts that are now performing well that might have performed less well previously. Herbster and Warmuth (1998) ameliorate the problem of prediction of sequences in the nonstationary setting by introducing an algorithm that determines the best experts for segments of the individual sequences. When the number of segments is k + 1 and the sequence is of length n, they can bound the additional loss of their algorithm over the best partition by O k log |A| + k log [n/k] . An Achilles heel of the algorithm is that knowledge is required a priori of the optimal number of segments k corresponding to the periods when new best experts are required. Freund et al (1997) study online learning algorithms that combine the predictions of several experts who belong to the multiplicative weights family of algorithms. They apply their methods to the problem of predicting in a model in which the best expert may change with time. They derive a specialist-based algorithm for this problem that is as fast as the best-known algorithm of Herbster and Warmuth (1998) and achieves almost as good a loss bound. However, unlike Herbster and Warmuth, Freund et al 's algorithm does not require prior knowledge of the length of the sequence and the number of switches. The prediction with expert advice framework is valuable in sequential or online portfolio selection. Helmbold et al (1998) present an online investment algorithm that achieves almost the same wealth as the best constant-rebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth (1995) . They test the performance of their algorithm on actual stock data from the New York Stock Exchange accumulated during 22 years. On these data, their algorithm outperforms the best single stock with hindsight as well as Cover's universal portfolio selection algorithm (Cover, 1991) . Singer (1998) notes that the earlier work into online portfolio selection algorithms which are competitive with the best constant rebalanced portfolio determined in hindsight (Cover, 1991; Helmbold et al, 1998; Cover and Ordentlich, 1996) , employ the assumption that high yield returns can be achieved using a fixed asset allocation strategy. However, stock markets are nonstationary, and in many cases, the return of a constant rebalanced portfolio is much smaller than the return of an ad-hoc investment strategy that adapts to changes in the market. In his paper Singer presents an efficient portfolio selection algorithm that can track a changing market and describes a simple extension of his algorithm for including transaction costs. Finally, he provides a simple analysis of the competitiveness of the algorithm and evaluates its performance on actual stock data from the New York Stock Exchange accumulated during 22 years. On this data, his algorithm outperforms all the algorithms referenced above, with and without transaction costs. Related to the issue of portfolio selection is the ranking of a subset of assets to buy or trade in a long/short portfolio. Poh et al (2021) apply learning to rank algorithms which are primarily designed for natural language processing, to cross-sectional momentum trading strategies. Cross-sectional strategies mitigate some of the risk associated with wider market moves by buying the top α-percentile of strategies with the highest expected future returns and selling the bottom α-percentile of strategies with the lowest expected future returns. Classical approaches that rely on the ranking of assets include the ranking of annualised returns (Jegadeesh and Titman, 1993) , or more recent regressthen-rank approaches (Wang and Rasheed, 2018; Kim, 2019; Gu et al, 2020) . Poh et al (2021) take a different approach using pair-wise learning to rank algorithms such as RankNet (Burges et al, 2005) and LambdaRank (Burges, 2010) . Overall, on experiments they conduct on monthly-sampled Center for Research in Security Prices 2019 data, the learning to rank algorithms achieve higher total and risk-adjusted returns than the traditional regress-then rank approaches. In terms of assessing the limitations of the work, experimenting with daily sampled data would probably resonate more with the finance community, as this facilitates the possibility of a more active portfolio management style. Grinold and Kahn (2019) demonstrate how active portfolio managements generally outperforms passive portfolio management and exhibit quantitative procedures to do so. However, working with higher frequency sampled data presents further challenges such as transaction cost mitigation, slower training times for the learning to rank algorithms and perhaps more crucially, a decrease in signal and increase in noise from the financial time series themselves. In the following section, we introduce our naive Bayes ranker for portfolios comprised of financial assets. First, however, we perform a literature review of naive Bayes ranking. The phrase "naive Bayes ranking" has broad meaning, with different implementations in various contexts. At the core of the idea is the naive Bayes classifier P (y c |x) = P (x|y c )P (y c ) j P (x|y j )P (y j ) , which predicts that y takes on a label value of c if this posterior probability is highest. The firm (naive) independence assumptions between the features x means that the probabilities can be modelled iteratively and inexpensively. Zhang and Su (2004) study the general performance of naive Bayes in ranking. They use the area under the receiver operating characteristics curve (Provost and Fawcett, 1997) , also known as the AUC, to evaluate the quality of rankings generated by a classifier. Finally, they discuss how the AUC is a good metric for evaluating the quality of classifiers averaged across all possible probability thresholds. Furthermore, they show that for binary classification, AUC is equivalent to the probability that a randomly chosen example of class − will have a more negligible estimated probability of belonging to class + than a randomly chosen example of class +. Thus, the AUC is a measure of the quality of ranking. The AUC is created by plotting the true-positive rate against the false-positive rate at various threshold settings and has a maximum value of one if no positive example precedes any negative example. Overall they find that their naive Bayes ranker evaluated using the AUC outperforms C4.4 on various datasets, with C4.4 being the state-of-the-art decision-tree algorithm for ranking (Provost and Domingos, 2003) . Flach and Matsubara (2007) consider binary classification tasks, where a ranker sorts a set of instances from highest to lowest expectation that the instance is positive. They propose a lexicographic ranker, LexRank, whose rankings are derived not from scores but a simple ranking of attribute values obtained from the training data. Using the odds ratio to rank the attribute values, they obtain a restricted version of the naive Bayes ranker. They systematically develop the relationships and differences between classification, ranking, and probability estimation, which leads to a novel connection between the Brier score (Brier, 1950) and AUC curves. Combining LexRank with isotonic regression (Zadrozny and Elkan, 2001) , which derives probability estimates from the AUC convex hull, results in the lexicographic probability estimator LexProb. Both LexRank and LexProb are empirically evaluated on various datasets and are highly effective. This section introduces our novel ranking algorithm, the Naive Bayes Asset Ranker. The algorithm can be used to sequentially rank an array of experts via a performance measure such as forecasted returns, profit and loss (pnl) or risk-adjusted returns such as the Sharpe or information ratio. Assume for the moment that the algorithm is presented with a set of d pnls. The goal here is to select a subset of experts 1 ≤ k ≤ d such that the reward of the k experts is expected to be the highest; this is achieved by computing the sequential posterior probability that expert j ∈ 1, ..., d is ranked higher than each of the d experts. A crucial attribute of the algorithm is that this posterior probability is computed with exponential decay using parameter 0 < τ ≤ 1, allowing experts who performed poorly previously and now perform well to be selected with weights that are not too close to zero; such behaviour is crucial for financial time series, which are nonstationary. In section 4.1, we provide algorithmic details of a multivariate regression procedure, the curds and whey model, which is a suitable way of predicting several response variables from the same set of explanatory variables. Furthermore, by taking advantage of the correlations between the response variables, the model can reduce mean-square prediction error compared to the more common procedure of doing individual regressions of each response variable on the shared set of predictor variables. Finally, we aggregate the one-step-ahead forecasts of the curds and whey procedure and provide them to our naive Bayes asset ranker when conducting experiments on daily returns of S&P 250 data. Input: r t , τ // r t ∈ R d , a time t performance measure vector we wish to track, such as forecasted returns or realised pnls // 0 < τ ≤ 1, an exponential decay constant Initialise: R, p, q // R 0 ∈ 0 d×d , a d × d zero matrix // p 0 ∈ (1/d) d , the posterior probability that each element within it is ranked higher than the others, with elements initialised to 1/d // q 0 ∈ 0 d , the likelihood that each element within it is ranked higher than the others Output: z t = argSort(p t ) // argSort(.) returns the indices that would sort an array from largest to smallest value t ≥ r t // I(.) is equal to 1 if the condition is true, or 0 otherwise // The comparison is made element-wise on the vector s t t = j * and assume that a subset of experts 1 ≤ k ≤ d is used, the test time ensemble return is r We begin with a description of the research experiment design, including details of two experiments using the same fixed data. In the first experiment, we train and test sequentially using a regress-then-rank multivariate regression model to hold a subset of long-only positions from the S&P250 index. Additionally, we optimise a novel online ranking model from the prediction with expert advice framework to target a subset of long-only positions. In the second experiment, the same models are permitted now to hold a basket of long/short positions. We extract daily sampled S&P 250 data from Refinitiv. Designate the midprice for the d th predictor as These returns are stored in a predictor vector Represent the vector of one-step-ahead target returns that we want to forecast as We wish to learn sequentially the multivariate mapping which is a standard regression problem. The curds and whey procedure due to Breiman and Friedman (1997) is a suitable way of predicting several response variables from the same set of explanatory variables. The method takes advantage of the correlations between the response variables to improve predictive accuracy compared with the usual procedure of doing individual regressions of each response variable on the shared set of predictor variables. The basic version of the procedure begins with the usual multivariate ridge regression and estimates a shrinkage matrix At test time, the vector of forecasts is Breiman and Friedman also derive estimates of the matrix Φ that take the form Φ = T −1 DT where T is the q × q matrix whose rows are the response canonical co-ordinates and D is a is a diagonal 'shrinking' matrix. Given the nonstationary nature of the data we model, our particular interest is in sequential optimisation. Thus we combine the curds and whey procedure with exponentially weighted recursive least-squares (ewrls) updating. This is shown in algorithm 3. Algorithm 3: the curds and whey procedure with ewrls updating Require: λ // the ridge penalty 0 τ < 1 // an exponential forgetting factor Initialise: Θ = 0 (d+1)×(d+1) , P = I d+1 /λ, Q = I d /λ // I d+1 denotes the identity matrix of size d + 1 Input: x t ∈ R d+1 // the inputs/predictors vector at time t y t ∈ R d // the targets/responses vector at time t Output:ỹ t // forecast one-step-ahead returns vector We use the forecasts of algorithm 3 as the basis for taking risk in a subset of constituents in the S&P 250 index. Specifically, the top decile of forecastsỹ t by size are bought in a long-only position experiment, and the top decile of forecasts are bought, and the bottom decile of forecasts are sold in a long/short position experiment. The long-only position constraint is of interest to mutual funds or exchange-traded funds where this constraint is applied. The long/short portfolio is of interest to the broader hedge fund community and is an example of cross-sectional momentum trading. If there are k assets in the relevant decile, each constituent's weight of 1/k is applied. A second forecaster that we consider is the naive Bayes asset ranker, algorithm 2, applied to the one-step-ahead forecasts of algorithm 3. Assuming there are k assets in the top decile that we wish to go long, the naive Bayes asset ranker assigns a weight to the j th constituent (1 ≤ j ≤ k) of For short positions, assuming there are k assets we wish to go short, the weight assigned to the j th constituent ( We must also consider execution costs. We force the curds and whey model and the naive Bayes asset ranker to trade as a price taker, meaning that the models incur a cost equal to half the bid/ask spread times the change in absolute position. Furthermore, as these data are sampled daily, any portfolio rebalancing is applied at most once a day, at the close of trading circa 4 pm EST. Finally, the hyperparameters that are set for models in this experiment are exponential decay τ = 0.999 for algorithms 2 and 3, and ridge penalty λ = 1 for algorithm 3. There is no particular reason for this choice of hyperparameters other than setting them by default to these values. Some performance gains could be made by estimating these hyperparameters via a cross-validation procedure, but this would reduce the test set size, so we avoid performing cross-validation here. Two experiments using the same data are considered. In the first experiment, we consider long-only positions, and in the second, we consider cross-sectional momentum long/short positions. With hindsight, holding an equal weighting of constituents in the S&P 250 would earn a total return of 88.2%, implying a compound annual growth rate of 14.3%. The annualised Sharpe ratio is 0.74, which implies that the probability that the annualised return will be greater than zero is 69.7%. The maximum drawdown is 52.9% which occurred in the March 2020 Covid-19 induced sell-off. The total return to maximum drawdown coverage is 1.67×. The final statistic to consider is the win ratio, which is positive pnl days to all trading days; this comes in at 53.7%. The long-only curds and whey procedure achieves a test-set total return of 135.4%, suggesting a compound annual growth rate of 19.8%. The annualised Sharpe ratio is 0.92, insinuating a probability that the annualised return will be greater than zero, of 73.8%. Finally, the total return to maximum drawdown coverage is 2.4×; thus, this model has outperformed holding the long-only S&P 250 with hindsight. The naive Bayes asset ranker has the highest total return and risk-adjusted return. It achieves a total return of 156.7%, intimating a compound annual growth rate of 22.1%. The annualised Sharpe ratio is 1.17, implying a probability of annualised return greater than zero, of 79.8%. Finally, the total return to maximum drawdown coverage is 3.2×. In the experiment where we trade cross-sectional momentum with long/short positions, the curds and whey procedure achieves a test-set total return of 97.9%, implying a compound annual growth rate of 15.5%. The risk-adjusted performance, measured through the annualised Sharpe ratio, comes in at 0.76. This Sharpe ratio suggests that the probability that the annualised return will be greater than zero is 70.1%. The total return to maximum drawdown coverage has improved, showing 2.8×, reflecting a lower maximum drawdown of 34.8%. As before, the naive Bayes asset ranker has the highest total return and risk-adjusted return. It achieves a total return of 149.3%, implying a compound annual growth rate of 21.3%. Furthermore, the annualised Sharpe ratio has increased to 1.47, which intimates that the probability that the annualised return will be greater than zero is 86.5%. The total return to maximum drawdown coverage has increased to 8.6×, and the win ratio is also highest, at 54.1%. Figure 4 shows the long/short cumulative selection probability by sector for the naive Bayes asset ranker. We see that it draws long positions predominately from constituents in the manufacturing, It is clear from the results that the market has appreciated during the experiments test period. Our proxy for regress-then-rank modelling, the curds and whey procedure, performs better than the long-only S&P 250 (with hindsight), irrespective of the long-only or long/short constraints. However, there is a deterioration in total return and risk-adjusted return when forcing the model to take short positions. The naive Bayes asset ranker performs well under both scenarios, with risk-adjusted returns highest with the long/short portfolio. How might we rationalise or explain the deterioration in the performance of the curds and whey regress-then-rank procedure and the excellent performance of the naive Bayes asset ranker? In the former case, several academic papers discuss some shortcomings with regression models compared to classification models in the financial time series prediction setting. For example, Satchell and Timmermann (1995) show that regression models that typically minimise mean-square prediction error obtain worse performance than a random-walk model when forecasting daily foreign exchange rates. Furthermore, they show that the probability of correctly predicting the sign of the change in daily exchange rates can be higher for the non-linear forecasts than for the random-walk forecasts, even though the mean-square prediction error of the non-linear forecasts exceeds that of the random-walk forecasts. They conclude that the mean-square prediction error is not always an appropriate performance measure for evaluating predictions of non-linear processes. Moody et al (1998) discuss how supervised learning techniques that are used to minimise forecast error are not guaranteed to optimise trading system performance globally. Furthermore, they state that trading system profits depend upon sequences of interdependent decisions and are thus pathdependent. Examples of path-dependent trading decisions include evaluating transactions costs, market impact and taxes, which all require knowledge of the current system state. Including information related to past decisions in the inputs to a trading system results in a recurrent decision system, requiring different optimisation from supervised optimisation techniques used for direct Fig. 4 naive Bayes asset ranker long/short selection probability by sector forecasts. More recently, Amjad and Shah (2017) find that classical time series regression algorithms, such as auto-regressive integrated moving average models, have poor performance when forecasting Bitcoin returns. They argue that the probability distribution of the sign of future price changes can be adequately approximated from finite data, specifically classification algorithms that estimate this conditional probability distribution. We conclude by saying that if recent past data relates to near-future data that is hitherto unseen, our naive Bayes asset ranker will outperform regress-then-rank models and, at the very least, achieve a loss close to the best expert with hindsight. It will achieve this as it directly models the conditional posterior probability that a portfolio constituent will be ranked higher than the other constituents in the portfolio. Financial time series are both autocorrelated and nonstationary, presenting modelling challenges that violate the independent and identically distributed random variables assumption of most regression and classification models. The prediction with expert advice framework makes no assumptions on the data-generating mechanism yet generates predictions that work well for all sequences, with performance nearly as good as the best expert with hindsight. We conduct research using S&P 250 daily sampled data, extending the academic research into cross-sectional momentum trading strategies. We introduce a novel ranking algorithm from the prediction with expert advice framework, the naive Bayes asset ranker, to select subsets of assets to hold in either long-only or long/short portfolios. Our algorithm generates the best total returns and risk-adjusted returns, net of transaction costs, outperforming the long-only holding of the S&P 250 with hindsight. Furthermore, our ranking algorithm outperforms a proxy for the regress-then-rank cross-sectional momentum trader, a sequentially fitted curds and whey multivariate regression procedure. Trading bitcoin and online time series prediction Properties of sufficiency and statistical tests Predicting multivariate responses in multiple linear regression Verification of forecasts expressed in terms of probability Learning to rank using gradient descent From RankNet to LambdaRank to LambdaMART: An overview Robust locally weighted regression and smoothing scatterplots Universal portfolios with side information Universal portfolios Distribution of the estimators for autoregressive time series with a unit root Simple Lexicographic Ranker and Probability Estimator Using and combining predictors that specialize Spurious regressions in econometrics Advances in Active Portfolio Management: New Developments in Quantitative Investing Empirical Asset Pricing via Machine Learning On-line portfolio selection using multiplicative updates Tracking the best expert Returns to buying winners and selling losers: Implications for stock market efficiency Enhancing the momentum strategy through deep regression Additive versus exponentiated gradient updates for linear prediction Contributions to probability and statistics The weighted majority algorithm. Information and Computation Option pricing when underlying stock returns are discontinuous Performance functions and reinforcement learning for trading systems and portfolios Building cross-sectional systematic strategies by learning to rank Tree induction for probability-based ranking Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions Statistical learning theory and sequential prediction An assessment of the economic value of nonlinear foreign exchange rate forecasts Switching portfolios Statistical methods Stock ranking with market microstructure, technical indicator and news Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers Naive bayesian classifiers for ranking