key: cord-0488936-t8ucykz9
authors: Leung, Raymond C. W.; Tam, Yu-Man
title: Statistical Arbitrage Risk Premium by Machine Learning
date: 2021-03-18
journal: nan
DOI: nan
sha: 702f7504bba14ded4be1823308524f69967a77a5
doc_id: 488936
cord_uid: t8ucykz9

How to hedge factor risks without knowing the identities of the factors? We first prove a general theoretical result: even if the exact set of factors cannot be identified, any risky asset can use some portfolio of similar peer assets to hedge against its own factor exposures. A long position of a risky asset and a short position of a"replicate portfolio"of its peers represent that asset's factor residual risk. We coin the expected return of an asset's factor residual risk as its Statistical Arbitrage Risk Premium (SARP). The challenge in empirically estimating SARP is finding the peers for each asset and constructing the replicate portfolios. We use the elastic-net, a machine learning method, to project each stock's past returns onto that of every other stock. The resulting high-dimensional but sparse projection vector serves as investment weights in constructing the stocks' replicate portfolios. We say a stock has high (low) Statistical Arbitrage Risk (SAR) if it has low (high) R-squared with its peers. The key finding is that"unique"stocks have both a higher SARP and higher excess returns than"ubiquitous"stocks: in the cross-section, high SAR stocks have a monthly SARP (monthly excess returns) that is 1.101% (0.710%) greater than low SAR stocks. The average SAR across all stocks is countercyclical. Our results are robust to controlling for various known priced factors and characteristics.

Given any stock, how can one hedge against its factor risks? This question is simple to answer with a linear factor model structure. For instance, under the Markowitz mean-variance portfolio theory and its resulting equilibrium capital asset pricing model (CAPM) (Sharpe (1964 ), Lintner (1965 ), any given stock's returns can be explained by some linear combination of a risk free asset return and its beta loading on the market portfolio return. Hence, the factor risk of any stock can be hedged out by shorting its beta loading multiplied by the market factor. However, with the large number of tradable and non-tradable factors that have been documented in the literature (Harvey, Liu, and Zhu (2016) ), our question becomes difficult to approach because its answer then significantly depends on which factors the researcher decides to include in his empirical study, and is also affected by the empirical uncertainty with estimating the factor loadings.

The first main contribution of this paper is to theoretically argue an effective way to hedge against potentially unidentifiable factor risks of a stock is to answer this dual question: Given any stock, what portfolio of all other stocks is most similar to it? Suppose all stocks are exposed to the same set of linear factors but with heterogeneous factor loadings. If one can identify a group of peers that is the most "similar" to a given stock i, then this portfolio is also exposed to similar factor loadings of this stock i. We view this portfolio of peer stocks as the replicate of stock i. A long position on stock i and a short position on its replicate will expose the holder to any remaining factor risks of stock i that cannot be completely hedged out by its peer stocks. We show this longshort position exactly equates to the residual factor risks of stock i. Provided an econometrician has a method to find these peer stocks, this long-short position does not require the econometrician to know the true underlying factor structure of the economy. We call the expected returns of this long-short position the Statistical Arbitrage Risk Premium (SARP) of stock i, and SARP is the key object of study in this paper.

How do we empirically study SARP? Is there a cross-sectional difference in SARP? The second main contribution of the paper is answering these questions. For each month end, we use the elastic-net estimator, a machine learning method, to project each stock i's past twelve months' daily returns onto the returns of every other stock in the market. The resulting elastic-net projection vector is high-dimensional but very sparse. After a suitable normalization, the projection vector is then used as investment weights into all stocks other than i. The resulting portfolio is hence a machine learning constructed replicate of stock i. As theoretically motivated above, the time-series average return from a long position of stock i and a short position of its replicate is the SARP of stock i. Moreover, we call the elastic-net projection R 2 of each stock i as the Statistical Arbitrage Risk (SAR) of stock i; we say a stock has high SAR if it has a low elastic-net R 2 , while a stock has low SAR if it has high R 2 . The core empirical message of this paper can be succinctly summarized as:

That is in the cross-section, "unique" stocks (i.e. so having low R 2 , and hence high SAR) have a higher SARP than "ubiquitous" stocks (i.e. high R 2 , so low SAR). Over the sample period of January 31, 1976 to December 31, 2020, high SAR stocks have a monthly SARP of 1.368% and low SAR stocks have a monthly SARP of 0.267%, and the difference 1.101% is highly statistically significant. And even without studying SARP, we have the important corollary that high SAR stocks have a monthly return of 1.481% and low SAR stocks have a monthly return of 0.771%, and the difference 0.710% is also highly statistically significant.

Our paper belongs to a growing literature of applying machine learning methods to study empirical asset pricing questions. Broadly speaking, many recent papers in this literature use machine learning methods for factor selection and/or forecasting. We do neither in this paper.

There are only two purposes of using a machine learning method in this paper: to identify the SAR of each stock, and to construct the replicate portfolio of each stock. The estimation and inference of SARP for each stock use conventional empirical asset pricing procedures.

Recent papers have applied variants of the least absolute shrinkage and selection operator (LASSO) estimator of Tibshirani (1996) . Feng, Giglio, and Xiu (2020) take advantage of the sparsity property of LASSO and develop a multi-step approach to evaluate the price of risk of a given new factor above and beyond an existing set of factors. Freyberger, Neuhierl, and Weber (2017) uses adaptive group LASSO to select characteristics that provides marginal information for the cross section of expected stock returns. The literature has documented a large set of factors or characteristics (Harvey, Liu, and Zhu (2016) ), and it is hoped that machine learning methods can substantially shrink down the number of factors that can explain the cross-section of returns. Chinco, Clark-Joseph, and Ye (2019) use the LASSO to predict one-minute-ahead return using lagged high frequency returns of other stocks as regressors. Gu, Kelly, and Xiu (2020) apply an extensive battery of machine learning methods and discover that such methods improve predictability accuracy over traditional methods. Shu et al. (2020) is a recent paper that uses an adaptive elastic-net estimator to construct a sparse portfolio that can track a large index portfolio. While both of our papers use the elastic-net estimator, our paper emphasizes the use of this estimator to discover a new asset pricing anomaly, while Shu et al. (2020) emphasizes the use of this estimator to mimic and dimension-reduce a large portfolio.

Our paper is also related to the literature of pairs trading, substitutability of risky assets and statistical arbitrage. Gatev, Goetzmann, and Rouwenhorst (2006) finds pairs of similar stocks using the minimal distance between normalized historical prices, and argue that the resulting pairs trading strategy generates abnormal returns and that the source of this profit is the mispricing of close substitutes. Krauss (2017) is a recent survey of the pairs trading literature. Wurgler and Zhuravskaya (2002) similarly also argue that stocks without close substitutes are likely to have large mispricings; the authors identify a few predefined number of similar stocks using a predefined sorting method. In contrast by using a machine learning method in this paper, the selection of a stock's risky peers is completely data driven. Indeed, the closest substitute of a given stock could potentially be hundreds of all other stocks. Huck (2019) and Avellaneda and Lee (2010) are two recent studies on statistical arbitrage. Section 1 lays out the theoretical framework of the paper. Section 2 explains our estimation methodology. The main empirical results of the paper are in Section 3. Section 3.6 show additional empirical robustness checks. We conclude in Section 4. We defer the details of the elastic-net procedure to Section A. All proofs to Section 1 are in the Online Supplementary Materials Leung and Tam (2021) .

We first prove a general theoretical asset pricing result that will guide our empirical research design.

Theorem 1.1. Suppose there are N + 1 risky assets and a single risk-free asset. Assume all of these risky assets are governed by a linear factor structure with K number of factors with risky returns F ,

and where the idiosyncratic risk ε i of the ith risky asset is assumed to have zero mean and is independent of F for all i = 1, . . . , N + 1. Suppose there are strictly more risky assets than factors, so N > K.

Then the excess returns R i of any individual risky asset i can be expressed as a linear combination of other risky asset returns as.

That is, the excess returns R i of any individual asset i can be expressed as a combination of: (i) the N × 1 vector of excess returns of all other of risky assets R −i ; (ii) the factor loadings on some K risky asset returns Φ i , and we will call these K assets the factor residuals of asset i; (iii) the N × 1 vector of idiosyncratic risks ε −i of all other N risky assets; and (iv) the idiosyncratic risk ε i of asset i itself.

The N × 1 vector b i is dependent on the intercepts {α j } N +1 j=1,j =i and the entire factor loadings {β j } N +1 j=1,j =i of the economy and whose analytical expression is in the Internet Appendix, and where

This result tells us given any factor structure in the financial markets like (1), of which its theoretical existence can always be justified via Ross (1976) , the returns of a single risky asset R i can be expressed as a linear combination b i of returns of all other risky assets R −i like (2). The key intuition of this result is a hedging and replication argument. Suppose the researcher does not know the exact identifies of the factors F = [F 1 , . . . , F K ] in the economy. But as long as all risky assets have an exposure to these factors, then any particular risky asset i can use some combination b i of other risky stocks to hedge against risky asset i's factor risks. 1 For the empirical component of our paper, this means projecting one stock's returns onto the returns of all other stocks is not a naive statistical exercise but actually has concrete microeconomic foundations.

The next result will tell us the economic content of Theorem 1.1.

Corollary 1.2. Suppose the conditions of Theorem 1.1 hold. For each fixed risky asset i, we can find K + 1 risky assets whose returns are given by

where c i,k is some N × 1 deterministic vector such that only depends on the factor loadings {β j } N +1 j=1,j =i and intercept {α j } N +1 j=1,j =i structure of the economy (its analytical expression is in the Internet Appendix). Here we denote α −i := [α 1 , . . . , α i−1 , α i+1 , . . . , α N +1 ] and ||·|| 2 is the Euclidean norm on R N . We will denote

Corollary 1.2 is what motivates us to refer to those K risky assets with returns Φ i as factor residuals for asset i. From (2), we see the following decomposition,

where Φ i,1:

The first part Φ i,0 adjusts for any potential mispricings in the economy. As an important special case when there is no pricing at all in the economy 2 meaning α i = 0 and α j = 0 for all assets j = i, 1 Some readers may think that the result in Theorem 1.1 and the subsequent Corollary 1.2 can be trivially derived by "inverting the factor loadings" in (1) to derive (2). However, we should note that as there are (significantly) more assets N than the number of factors K in this economy. Thus the economy-wide factor loading matrix is necessarily of dimensions N × K, meaning the matrix is not square and thus not invertible. We take extra care in the proofs to make sure that a proper sense of "factor inversion" is possible in this general setup.

2 It is well known that if all risky assets are mean-variance efficient then necessarily there is no mispricing in the economy. See an early discussion of this classical empirical asset pricing test in Black, Jensen, and Scholes (1972) . then Φ i,0 = 1 and α i Φ i,0 = 0. If on the other hand, when the intercepts α i , α j 's are generically non-zero, then the factor residuals of asset i will explicitly hedge out any level of mispricing in the economy through Φ i,0 , and asset i's net mispricing exposure is

Secondly, the next K parts Φ i,k adjust for factor exposures. From the perspective of an agent who owns asset i, he is exposed to each of the k = 1, . . . , K factor returns through the factor loadings β i = [β i,1 , . . . , β i,K ] of (1). However, all the other N risky assets will also have some exposure to the k-th factor. Suppose the agent constructs an "artificial asset" that loads into the return of the k-th factor, while shorting some combination c i,k of the factor contributions to all the other N risky assets R −i − ε −i . The resulting artificial asset has the returns Φ i,k . Hence Φ i,k is precisely the residual exposure of asset i to the k-th factor, after using the returns of all other assets to "hedge" as much as possible this factor risk. The "hedging" nature of these artificial assets motivates us to call them as "factor residuals" for asset i. The holder of asset i is exposed to K number of factor risks, and so he will have to construct K number of these factor residuals.

And since asset i has factor exposure to K number of factors, the agent will weigh β i,k loadings into the k-th factor residual return Φ i,k , as in the β i Φ i,1:K term of (4).

Let's rearrange (2) and take its expectation,

where E[ε −i ] = 0 N and E[ε i ] = 0 because idiosyncratic risks are not priced. The overall term (5) is exactly the expected portfolio return into these K factor residuals. This is why we will call a i E[Φ i ] as the Statistical Arbitrage Risk Premium (SARP) for asset i. The key objective of this paper is to empirically study SARP.

The next result shows that, under weak economic and technical conditions, SARP is non-zero.

Corollary 1.3. The SARP of any non-redundant asset i is almost surely non-zero when some of the K factors are correlated.

The following result relates the regression R-squared to Theorem 1. 

In the remainder of this paper, we will identify the regression R 2 with the Statistical Arbitrage Risk (SAR) of an asset i. That is, we will say an asset i has a high (low) SAR if it has low (high)

Corollary 1.4(a) provides another way to view the K factor residuals of stock i from (3b).

Condition (ii) is used to simplify the equations and condition (iii) is a mild technical assumption.

For the sake of exposition, consider the case when σ 2 ε is negligible, and so the magnitude of the term in Corollary 1.4(a) is driven by the positive-or negative-definiteness of the matrix Var(Φ i )−Var(F ).

The case where this K ×K matrix is negative-definite is when the volatility of the factor residuals of asset i is lower than the volatility of the factors themselves. This happens when the factor residuals of asset i do a good job in hedging asset i against its exposure to the K factor risks. Recall from (3b) these K factor residuals are dependent on the factor structure of all the other N risky assets.

This implies these K factor residuals can only do a good job in insuring asset i against factor risks if the N other risky assets also highly co-move with asset i itself, which implies a high regression R-squared. Given the desirability of these K factor residuals, the holder of asset i will be willing to pay a high price for these K factor residuals, which then pushes down their expected returns. This explains why in Corollary 1.4(b), there is a negative relationship between the regression R-squared and the SARP. The discussion for the case when that K ×K matrix is positive-definite is analogous.

The above discussions are still contingent upon the existence and positivity of such a SARP, which again, is entirely an empirical question we now proceed to answer.

We summarize the empirical implications of our theoretical discussions.

Empirical Hypothesis.

(1) The expected difference between a given asset's return and some linear combination of other assets' returns can be seen as a Statistical Arbitrage Risk Premium (SARP). Under weak economic and technical conditions, the SARP is non-zero. Moreover, one does not need to know a priori what are the underlying factors that drive the economy to compute SARP.

(2) We can identify regression R 2 with Statistical Arbitrage Risk (SAR). We anticipate assets with low SAR (i.e. high R 2 ) to have a low SARP, while assets with a high SAR (i.e. low R 2 ) to have a high SARP in the cross-section.

The empirical methodology of the paper is separated into two distinct steps. In the first step, we project a given stock's return onto the span of all other stocks' returns to get an empirical approximation of b i from (2). However, despite the microfoundations of Theorem 1.1, we shall argue it is econometrically non-trivial to execute this projection. We will apply a machine learning method to overcome a critical technical hurdle. In the second step, we will use standard portfolio sort methods from the empirical asset pricing literature to estimate the expectation a i E[Φ i ] of (5), which is again the SARP of asset i.

Our data source is standard. We use both the CRSP daily and monthly data from December 31, 1974 to December 31, 2020 with the standard filters. In all of the subsequent empirical analysis, we identify a stock by its PERMNO number as recorded in the CRSP database. 3 Moreover, we only 3 As is well known with the CRSP and Compustat databases, there are several unique identifiers of equities: PERMNO, PERMCO and GVKEY. Each have their different strengths and weaknesses. We recognize some -but fewfirms have dual class shares which implies a single firm can have having multiple PERMNOs. Rather than making arbitrary corrections to somehow "merge" the time series of returns of these dual class shares, we simply leave the PERMNOs "as is" in CRSP. In all, this means our paper focuses on tracking an individual security rather than the individual firm -although they are synonymous with each other except for those few special cases. 1975 , January 31, 1976 , ..., November 30, 2020 , we use the past twelve months' worth of daily observations to project each of stock returns onto the returns span of every other stock. We only consider stocks have at least 60 days worth for daily trading data. That is, for each stock i = 1, . . . , N t−1 , suppose {d 1 , . . . , D i,t−1 } with D i ≥ 60 are past twelve months' worth of trading days that end at month t − 1. Note that the the number of stocks N t−1 in the market at each month end t − 1 may vary. The returns span for stock i are all those other stocks j = 1, . . . , i − 1, i + 1, . . . , N t−1 that have trading days that at least overlap with that of stock i, so {d 1 , . . . , D i,t−1 } ⊆ {d 1 , . . . , D j,t−1 }. Dec 31, 1974 Month end t − 1 = Dec 31, 1975 t = Jan 31, 1976

Twelve months of daily observations d1, . . . , Di,t−1

One month ahead

Return of longing stock i and shorting its replicate at month t

include stocks that, in the past twelve months for any given month end, there are at least 60 days of valid trading returns. This 60 days choice ensures that we do include effectively all stocks, except for the most extremely illiquid or dead stocks, so that our results are not driven only by the liquid and hence most likely large stocks. We also obtain the Fama-French data from Kenneth French's website.

Remark 2.1 (Filters and sampling period). We subset only for US common equities (i.e. SHRCD code of 10 or 11), and only those that are listed in the NYSE, AMEX or NASDAQ (i.e. EXCHCD code of 1, 2 or 3). In addition, we start our data sample from 1974 because the CRSP datasets only started to include NASDAQ stocks in December 1972. We start in December 1974 to allow an additional

year of buffering for good measure. Effectively this means the number of stocks in CRSP pre-1974 and post-1974 are structurally different in several magnitudes; see (Bali et al., 2016, §7.1 .2) for a detailed discussion. Our estimation and projection method is clearly sensitive to total number of stocks. Starting our analysis pre-1974 might bias our results simply due to CRSP data limitations.

The first step in the empirical test of Theorem 1.1 is to project a given stock's returns onto the returns span of all other stocks. Our projection procedure is summarized in Figure 1 . For months ending t − 1 = December 31, 1975, January 31, 1976, ..., November 30, 2020, we use the past twelve months' worth of daily observations to project each of stock i's returns onto the returns of all other stocks. We reserve a one month gap for returns realization in a procedure to be described in Section 2.3. Unless specified otherwise, we will denote t − 1 as the end of the projection month, and t as the one-month ahead returns realization date; that is, we set t = January 31, 1976, February 29, 1976, . . . , December 31, 2020. Figure 3 (a) shows the number of stocks that have at least 60 past trading days in for each given month t − 1. Observe since we start our projection procedure on December 31, 1975, we require daily data starting from December 31, 1974.

Let N t−1 be the total number of stocks traded in the market at month t − 1. The return vector of stock i at month t − 1 and the returns span of all other assets are, respectively:

where R i,d is the daily return of stock i, D i,t−1 is the number of trading days of stock i in the past 12 months ending at month t − 1. The dimensions of (6b) are approximately 250 × 5000 for each stock i. There are T = 539 number of months from December 31, 1975 to November 30, 2020.

Thus we run a total of approximately T × 5000 ≈ 2.7 million projections in this paper.

In this paper we will use the elastic-net estimator developed by Zou and Hastie (2005) to empirically project a given stock's return onto the returns span of all other stocks. We defer a detailed and technical discussion of the elastic-net in our context to Section A.1. But if we are interested in estimating linear relationships as in Theorem 1.1, why do we not use the workhorse ordinary least squares (OLS) estimator? The design matrix (6b) of returns to evaluate our empirical hypothesis is necessarily a T × N matrix, where T ≈ 250 is the number of days, and N ≈ 5000 is the total number of traded stocks. This is a case where T N . This means the N × N matrix

The OLS estimator is thus necessarily not well defined. In contrast, the elastic-net is a machine learning method that explicitly allows for "wide" T N regressors.

We denote the elastic-net projection coefficient vector of stock i at month t − 1 asβ i,t−1 ∈ R N i,t−1 , and the resulting coefficient of determination ("R-squared") as R 2 i,t−1 . Note and recall the conventional definition of R 2 is,

whereŷ i,t−1 := X i,t−1βi,t−1 is the fitted value, andμ i,t−1 is the sample mean; they are explicitly given byŷ

We emphasize we are only using the elastic-net as a projection method for constructing the replicates, and we do not use it for statistical inference. The statistical inference claims are on the expected returns of SARP, which we will discuss beginning in Section 2.3. As a result, despite the large number of projections that we run at this stage, we do not suffer from the multiple hypothesis testing problem that has been discussed in the recent empirical asset pricing literature by Harvey et al. (2016) . Other than the use in calculating R 2 , we do not use the fitted valueŷ i,t−1 in any other subsequent steps. This is in contrast to the recent literature on financial applications of machine learning (say Gu et al. (2020) and others) where they use the fitted value (or predicted value when one uses X i,t instead of X i,t−1 ) in assessing model forecasting accuracy.

Remark 2.2 (Not projecting onto the intercept). We deliberately do not include an intercept as a regressor in (6b). Including an intercept together with the sparseness property of the elastic-net estimator will attribute a stock with stale prices with extremely high R 2 . See Section A.1.2 for a more detailed discussion. Other than the projection method here and in Section 3.6.2, unless noted otherwise, all subsequent statistical inference tests that employ a linear regression will include an intercept.

Remark 2.3 (Why elastic-net and not other machine learning methods?). Out of a myriad of machine learning methods, why did we choose the elastic-net estimator to test our empirical implication? Simply put, we regard the elastic-net estimator as a parsimonious method that allows us to test our theoretical prediction. We defer to Section A.1 for detailed technical discussions of why we particularly use and prefer the elastic-net estimator in this paper.

Remark 2.4 (Overlapping data). It is evident we are using overlapping time data. Overlapping returns data in traditional empirical asset pricing raises several technical inference issues revolving around autocorrelations (see Hansen and Hodrick (1980) and a recent discussion by Hedegaard and Hodrick (2016) ). However, the "wide" regressors in our setting imply we actually gain ≈ 5000 new cross-sectional data points for each month advance. Because of the sheer amount of new entering cross sectional data, it is not necessarily true that the projected coefficient ending at months t − 2 and t − 1 would be quantitatively similar even if they only differ by a one month step.

For each month end t − 1 and each stock i = 1, . . . , N t−1 , we collect the projected coefficientβ i,t−1 and the R 2 i,t−1 . We track stock i's one-month ahead return R i,t from t − 1 to t. For subsequent exposition clarity, we will sometimes call stock i as the focal stock and write it's return at month t as,

Next, we introduce the replicate (portfolio) of a stock i. This is a key idea of this paper.

Let R −i,t := [R 1,t , . . . , R i−1,t , R i+1,t , . . . , R N t−1 ,t ] be the vector of month t returns for all stocks except stock i. We wish to treat the projected coefficientsβ i,t−1 as investment weights into each of the N t−1 − 1 number of stocks. However, the regularization nature of the elastic-net causes the entries of these estimated coefficients to be small in magnitude. So if we were to directly usê β i,t−1 as investment weights, the result would be a very small allocation into risky component of the portfolio. To have a more reasonable risky component, we normalizeβ i,t−1 by its L 1 norm and write, 4β

Since investment weights must sum to one, we place the remainder of the weights 1 β i,t−1 into the risk free asset with returns r f,t , and here 1 is a vector of ones of conformable dimensions. In all, the month t return of stock i's replicate is,

Finally, we track the long-short return of stock i against its replicate,

This long-short return (10) will proxy for the difference R i − b i R −i in Theorem 1.1, and is the key emphasis of study in this paper. We will use conventional portfolio sort methods of the empirical asset pricing literature to estimate the expectation

, which then is equal to the We consider the average amount of equity-only components that are invested by each replicate portfolio. Stocks are sorted into their method m R 2 decile bins k = 'Lo', . . . , 'Hi'. At the end of the estimation month t − 1, we have the normalized vector β i,t−1 of asset i. The quantity 1 β i,t−1 is the proportion allocated to the equity-only components of stock i's replicate as from (9). The average equity-only proportion across all the replicates in bin k at month t − 1 is the quantity 1

Here we plot time series of this average equity-only proportion for the bins k = 'Lo', 5 and 'Hi', and the y-axis are expressed in decimals (e.g. 0.10 means 10%). The plot labeled "Overall average" plots the overall average of these equity-only components across all bins, which is the quantity 1 10 1 Overall average

Remark 2.5 (Projection method for constructing mimicking portfolios). Breeden et al. (1989) and Lamont (2001) use analogous methods to (9) to construct a mimicking factor out of tradable base assets. Ang et al. (2006) use also an analogous method to construct a factor mimicking aggregate volatility risk. While our approach in (9) seems identical to these previous methods, we stress the dimensionality of the regressors is substantially different. The number of base assets in those aforementioned methods is small, usually in the range of five to ten. In contrast, our base assets are effectively every other stock other than the stock i itself, which number in the thousands.Wurgler and Zhuravskaya (2002) 5 uses an idea that is conceptually similar -but operationally quite different -to this paper in evaluating arbitrage risk. Wurgler and Zhuravskaya (2002) first use a sorting method to closely match a stock i with three other stocks that is most closest to it on size and book-to-market, and then use a linear regression on of the returns of stock i onto these three stocks. Similar to us, they use the regression coefficients to construct a replicate of stock i.

However, the selection procedure of Wurgler and Zhuravskaya (2002) by portfolio sort is not entirely statistical in nature. Indeed, a priori there's no reason why the time series statistical properties of any given stock i would be best matched by the top three stock matched on size and value, even if size and value are well known priced factors. Moreover, the focus of Wurgler and Zhuravskaya (2002) is to study "demand curves" for stocks and to identify this effect, the authors only constrain their study to 259 stocks addition into the S&P 500 index over a 13 year period. In contrast, the focus of our study is SARP and we consider effectively all stocks (so not just those in the S&P 500) at any given point in time and over a 45 year period.

Having defined three types of returns (7), (9) and (10) associated with stock i, we now proceed to construct portfolios. We use our theoretical discussions as guidance: Corollary 1.4 anticipates we should find a negative relationship between the regression R-squared (the SAR of stock i) and the SARP of stock i. At the end of the month t − 1, we sort each stock i by its R 2 i,t−1 into decile bins.

Let B k t−1 be the set of stocks in the k-th bin at month t − 1. We organize the bins in ascending order, so bin k = 1 (labeled "Lo") consists of stocks with the lowest R 2 's, and bin k = 10 (labeled "Hi") consists of stocks with the highest R 2 's.

We focus on equal-weighted portfolios in this paper (see Remark 2.6 for a discussion of the issues in evaluating value-weighted SARP). The equal-weighted excess returns of bin k = Lo, 2, . . . , Hi of the stocks is standard:

where we denote |B| as the cardinality of a set B. In this paper, we will view and assume the replicate of stock i to have the same equal-weight as its focal counterpart,

Finally, the construction of the equal-weighted portfolio of the long-short of the focal stocks versus their replicates is now immediate thanks to the aforementioned equivalent weighting assumption. We define the equal-weighted portfolio return of the long-focal, short-replicate position

Note (13) does not subtract a risk-free rate as it is the difference of two excess returns.

Remark 2.6 (Ambiguity in value-weighted SARP). Value-weighting causes several causes for concern in evaluating SARP. Value-weights for focal stocks in each bin k is conventional. The cause

for caution is what should one assume for the value-weights of the replicates. Perhaps the most natural way is just assume the replicate of stock i has the same value-weight as stock i itself. While this value-weight assumption for the replicates might seem natural, it is also ambiguous. For stocks with high SAR there is no contention: for these high SAR stocks, their replicates are effectively just the risk-free asset, and so using value-weights in this way would just produce the equity risk premium. But for stocks with low SAR, there will be many similar peer stocks in their replicates.

It is possible that a focal stock has large size, but the elastic-net statistically matches its peers with small stocks, and vice-versa. Thus the total value of the replicate portfolio of stock i could be substantially different than that of the focal stock i itself. Assigning the replicate of stock i to have the same value-weight of the focal stock i in bin k may or may not reflect the size of the latter.

Indeed, the same concern also holds in converse. This ambiguity in what "value-weighting" means for the replicates is what drives our paper to focus on the most parsimonious weighting schemeequal-weights.

We can now translate our theoretical predictions of Section 1 to concrete quantities for an empirical investigation: (i) By Corollary 1.3, SARP should be non-zero (and indeed positive). That is, we should expect the time-series average ofR k LS,t of (10) to be non-zero in the data; and (ii) More 

Let's first investigate the average characteristics of stocks that stocks that are decile sorted by Table 1 show the time-series mean, standard deviation, 5 th and 95 th percentiles of these characteristicsZ k t−1 for each bin k.

Result (i) of Table 1 shows the elastic-net R 2 characteristic Z i,t−1 = R 2 i,t−1 . The elastic-net is capturing a wide range of projection R 2 's from effectively 0% in the lowest bin to 62% in the highest bin. 6 As a matter of comparison and for subsequent robustness discussions in Section 3.6.2, at the end of month t−1 we take each stock i's past twelve months' of daily returns and project them onto the daily returns of the Fama and French (2015) five factors using OLS. We collect the resulting FF5 OLS R 2 and this is also a characteristic for each stock i for month t − 1. Result (ii) shows the summary statistics of the FF5 OLS R 2 characteristic for stocks sorted by elastic-net R 2 . It is quite interesting to observe that stocks with high SAR (low elastic-net R 2 ) have also low corresponding FF5 OLS R 2 , and likewise stocks with low SAR (high elastic-net R 2 ) have high FF5 OLS R 2 .

Results (iii) and (iv) examine the quality of our replicate construction procedure. Recall (9).

The characteristic Z i,t−1 = Number of non-zero entries inβ i,t−1 is the number of risky peers used to construct the replicate of stock i and (iii) shows this result. Note the sparseness of the elasticnet projection vector: while N t−1 ranges in the thousands, the number of non-zero entries in the projection vector is remarkably low. For stocks with the highest SAR, it is essentially so "unique"

that the replicate of this stock is just the risk-free asset. In other words, the next best alternative of not investing into a truly unique stock is simply the risk-free asset. In contrast for stocks with the lowest SAR, one can statistically identify about 82.58 risky peers. So even if one forgoes investing into a "ubiquitous" stock, one can construct its replicate consisting of 82.58 peer stocks that have similar statistical properties. Next, the characteristic Z i,t−1 = 1 β i,t−1 is the proportion of equity that is used to construct the replicate of stock i. Result (iv) shows we allocate only 4% into equity components (so 96% into the risk-free asset) of the replicates for stocks with the highest SAR, but we allocate on average 60% into equity components of the replicates for stocks with the lowest SAR. At least with our elastic-net projection procedure and even for stocks with the highest SAR, we basically never achieve an 100% allocation into equity for the replicates of any stock; an 100% allocation can happen if there is a stock l physically different from stock i but are otherwise statistically equivalent. That is to say, all stocks are effectively unique but there's still nonetheless a wide range of heterogeneity.

Results (v) to (vii) show the market capitalization, book-to-market and dollar volume liquidity characteristics. From result (v), low SAR stocks tend to be smaller stocks while high SAR stocks tend to be bigger stocks. Nonetheless, we should keep in mind the magnitudes and also within and mechanically raises R 2 . In our elastic-net application, the number of regressors Nt−1 remains fixed, and it is an estimation outcome that only few of them have non-zero coefficient loadings.

decile bin heterogeneity. For low SAR stocks the average market capitalization is about $1.2 billion with a standard deviation of $9.3 billion, and for the high SAR stocks the average is $6.1 billion with a standard deviation of $26.2 billion. In all, while the high SAR bin holds on average "megasized" stocks, the stocks in the low SAR bin are not exclusively microcap stocks either. Hence our subsequent cross-sectional results are unlikely to be driven by a size effect. Result (vi) shows an interesting value tilt for stocks with high SAR and a growth tilt for stocks with low SAR. Finally, result (vi) shows the dollar volume liquidity (VOLD) characteristic. In particular, VOLD i,t−1 is defined as the product of trading volume of stock i on the last day of month t − 1 and the closing price of stock i on the last day of month t−1, then all divided by 1,000,000. We see that on average, high SAR stocks have a lower liquidity than low SAR stocks. On the surface, results (v) to (vii), it appears size, book-to-market and liquidity are related to a stock's SAR. We will explicitly control for these three characteristics, among others, in Section 3.4 to ensure our results are not driven by the well-known pricing qualities of these characteristics in the cross-section.

Pursuant to Table 1 (i), we directly plot the time series of the cross-sectionally averaged R 2 characteristic in Figure 3 (b). Again, as motivated by our theoretical discussions, we regard assets with low R 2 to have high SAR, and assets with high R 2 to have low SAR. As a consequence, the plots in Figure 3 (b) can thus be regarded as an "average" SAR of the economy at any given point in time. By inspection, it appears the average SAR of the economy is countercyclical and spikes considerably during times of financial distress (e.g. 1980 (e.g. -1991 (e.g. US savings and loan crisis, 1998 Russian financial, 2008 -2009 Great Recession, Greek government debt crisis 2011-2012, March 2020 COVID-19 crash, and others). These plots strongly suggest SAR (and the resulting SARP of assets) is not simply a statistical construct but is also capturing systematic shocks. We formally test the association of SAR with macroeconomic variables in Figure 3 (c). The regression results corroborate the visual inspection of Figure 3 (b): shocks to the average SAR is countercyclical, in that it is negatively associated with shocks to both personal consumption expenditure and consumer sentiment. Table 1 : Summary characteristics of elastic-net R 2 sorted stocks. We project each stock's past twelve month's daily returns onto every other stock using the elastic-net estimator. Stocks are sorted into deciles based on their elastic-net R 2 from the lowest (quantile 1, labelled "Lo") to highest (quantile 10, labelled "Hi"). At the end of each month, we first compute the simple average of the stocks' characteristics within each bin. We then take the time-average of these averaged characteristics for each bin, and the displayed figure shows this crosssectional and time-series average for each bin. Brackets show the standard deviation and the parentheses show the ( 

How good are the replicates in mimicking the return behavior of the focals? While we will discuss at length the first moments of the focals and the replicates starting from Section 3.2, let's first investigate into their second cross-moments. For bins k 1 , k 2 ∈ {Lo, 2, . . . , Hi} and return type m 1 , m 2 ∈ {Foc, Rep} (recall (11) and (12)), we track the time seriesR k 1 m 1 ,t and the time series R k 2 m 2 ,t . The numbers in Table 2 reports the sample correlation between the time seriesR k 1 m 1 ,t and R k 2 m 2 ,t . The diagonal entries are most important: stocks with low (high) R 2 's have low (high) correlations with their replicates. Specifically, the correlation between the returns of stocks with highest SAR (so they have the lowest R 2 ) with their replicates is only 17.8%. In contrast, this correlation rises substantially along the diagonal of Table 2 to 91.9% for stocks with the lowest SAR (so highest R 2 ). In all, for stocks with low SAR, the elastic-net replicate portfolio procedure (that is entirely based on past returns) do a fairly good job in matching the stock's one-month ahead return second moments. However, but expectedly, the replicates do a poor job in mimicking stocks with high SAR. Table 2 : Pairwise correlations of one-month ahead returns between stocks and their replicates. The (k 1 , k 2 )-th entry above represents the correlation of the time series one month ahead returns between the k 1 -th elastic-net R 2 sorted bin of the stocks and the k 2 -th portfolio bin its replicate. The entries are expressed in decimals (e.g. 0.10 means 10%). We form equal-weighted portfolios decile portfolios every month by projecting each stock's daily return over the past year onto every other stock using the elastic-net estimator. Stocks are sorted into deciles based on their elastic-net R 2 from the lowest (quantile 1, labelled "Lo") to highest (quantile 10, labelled "Hi"). The rows labelled "Foc" report the portfolio of the focal stocks (7). The columns labelled "Rep" report the replicate returns of the focal stock that are constructed out of the estimated elastic-net beta coefficients according to (9). The sampling period is from January 31, 1976 to December 31, 2020. Figure (a) shows the time series of the total number of traded stocks N t−1 in the US at any given point in time. The elastic-net R 2 for each stock i at month t − 1 is denoted as R 2 i,t−1 . The top panel of Figure ( b) plots the time series of the averaged R 2 across all stocksR 2 t , and the bottom panel plots the difference ∆R 2 t := R 2 t − R 2 t−1 . Table (c) regresses the time series ∆R 2 t onto the following regressors' difference and its lag: industrial production, personal consumption expenditure, unemployment rate, University of Michigan consumer sentiment, and the market factor. All macro variables data are from FRED Economic Data of the St. Louis Federal Reserve. Since UMCSENT is only readily available from January 1978, the regressions' sampling period is monthly from January 31, 1978 to December 31, 2020. 

The univariate portfolio sorts of Table 3 show the main empirical result of this paper. Let's first discuss the returns of focal stocks in the first column of Table 3 . Stocks with the lowest elastic-net R 2 earn an excess return of 1.481% (t-stat 5.298) per month, while stocks with the highest elasticnet R 2 earn an excess return of 0.771% (t-stat 2.882). The difference in returns between stocks with the lowest and highest elastic-net R 2 is 0.710% (t-stat 4.386). In other words, high SAR stocks have a substantially greater return than low SAR stocks. We form equal-weighted decile portfolios every month by projecting each stock's past twelve month's daily returns onto every other stock using the elastic-net estimator. Stocks are sorted into deciles based on their elastic-net R 2 from the lowest (quantile 1, labelled "Lo") to highest (quantile 10, labelled "Hi"). The row labelled "Lo -Hi" is the monthly return difference between the "Lo" bin and the "Hi" bin. The row labelled "Avg" is the simple average of the monthly excess returns across the ten k = 'Lo', 2, . . . , 'Hi' bins. The column labelled "Focal" reports the one-month ahead portfolio excess returns (7). The column labelled "Replicate" reports the onemonth ahead excess returns of a portfolio that are constructed out of the estimated normalized elastic-net coefficients according to (9). The column labelled "Foc -Rep" reports the one-month ahead returns of the portfolio of a long position in the focal stocks, and a short position in the corresponding replicates. We define the Statistical Arbitrage Risk Premium (SARP) as the returns from "Foc -Rep". In addition, a stock with low EN R 2 is said to have high Statistical Arbitrage Risk (SAR), while a stock with high EN R 2 is said to have low SAR. The "mean" column is reported in monthly percentage terms (e.g. 1.0 means 1%). Robust Newey and West (1987) t-statistics with six lags are reported in column "t" in parentheses. The sample period is monthly from January 31, 1976 to December 31, 2020. By conventional wisdom in the empirical asset pricing literature, these results already strongly hint that R 2 is a potential priced factor in the cross-section. However, guided by our theoretical discussions of Section 1, we are not just interested in testing the cross-sectional difference in the focal stocks (but see later in Section 3.5 for a factor discussion). Rather, we are interested in empirically testing for the cross-sectional presence of SARP. Let's consider the empirical results of the replicates in the second column of Table 3 . The excess returns of the replicates are almost monotonically increasing, from 0.115% (t-stat 2.109) per month in the lowest R 2 bin, to 0.503% (tstat 2.070) in the highest R 2 bin, even though their difference -0.387% (t-stat -1.662) is only weakly statistically significant. The excess returns of the replicates in some middle bins (e.g. bins 3 -6) are statistically insignificant from zero. Other than these four middle bins, it appears the projection construction procedure does construct a replicate asset that has reasonable mean returns.

Finally, we come to the main highlight of our paper: the third column of also document the presence of an "unconditional SARP". The "Avg" portfolio in Table 3 takes the simple average of returns of all of the ten k = 'Lo', 2, . . . , 'Hi' bins. We find that "Avg" enjoys a monthly SARP of 0.499% (t-stat 3.391).

In all, these empirical results show strong evidence in support of our core theoretical predictions:

(i) an unconditional SARP exists and is positive for the average representative asset; and (ii) SARP is increasing with SAR: stocks with low (high) SAR earn a low (high) SARP in the cross-section.

We present two sets of evidence to show our main result -SARP is increasing in SAR -from Table 3 is robust after controlling for risk factors and other characteristics.

We show our results still persist even after adjusting for the Fama-French three factors French (1992, 1993) ) and Fama-French five factors (Fama and French (2015) ). For each bin k and each return r k t ∈ {R k Foc,t ,R k Rep,t ,R k LS,t }, we consider the following two time series factor model regressions:

The regressors are well-known: "MktRF" is the market factor, "SMB" is the size factor, "HML" is the value factor, "CMA" is the investment factor, and "RMW" is the profitability factor. 

We show our main result remains robust after we control for various stock characteristics. At the end of each month, stocks are first sorted by a characteristic into quintile portfolios. Then for a given characteristic and within each of its five portfolios, we further sort stocks based on the elastic-net R 2 into deciles bins. Each of these ten elastic-net R 2 bins are then averaged over their respective five characteristic portfolios. In all, the ten resulting elastic-net R 2 bins represent returns that control for a particular characteristic. All portfolios are equal-weighted. Table 4 : Fama-French 3 and 5 factor regressions on portfolios sorted by elastic-net R 2 . We run the Fama-French three (14a) and five (14b) factor time series regressions onto the elastic-net R 2 sorted decile portfolios. Stocks are sorted into deciles based on their elastic-net R 2 from the lowest (quantile 1, labelled "Lo") to highest (quantile 10, labelled "Hi"). The column labelled "Lo -Hi" is the monthly return difference between the "Lo" bin and the "Hi" bin. The row labelled "Avg" is the simple average of the monthly excess returns across the ten k = 'Lo', 2, . . . , 'Hi' bins. The group labelled "Focal" reports the one-month ahead portfolio excess returns (7). The group labelled "Replicate" reports the one-month ahead excess returns of a portfolio that are constructed out of the estimated normalized elastic-net beta coefficients according to (9). The group labelled "Foc -Rep" reports the one-month ahead returns of the portfolio of a long position in the focal stocks, and a short position in the corresponding replicates. We define the Statistical Arbitrage Risk Premium (SARP) as the returns from "Foc -Rep". In addition, a stock with low EN R 2 is said to have high Statistical Arbitrage Risk (SAR), while a stock with high EN R 2 is said to have low SAR. The estimated values are reported in monthly percentage points (e.g. 1.0 means 1%) and we report the Newey and West (1987) 

where D i,t−1 is the total number of trading days of stock i in the past twelve months leading up to month t − 1 and VOLD was as defined in Section 3.1.1.

Mom is momentum; we define momentum of stock i at the end of month t − 1 (i.e. the end of the estimation period) as the return of the stock during the 11-month period covering months t − 12 through t − 2. STR is Jegadeesh (1990) and Lehmann (1990) 's short-term reversal; it is defined as STR i,t−1 = 100 × R i,t−1 . Table 5 shows the results. We omit showing the results on the replicates for brevity. Overall, these results that control for the aforementioned characteristics are consistent with the unconditional results of Table 3 . We pay special attention our results after controlling for size, value, VOLD, idiosyncratic volatility and total volatility. In particular, even though Table 1 shows a size, value and VOLD tilt for the elastic-net R 2 sorted stocks, the results here show that SARP is still persistent after we control for these characteristics.

The main empirical objective of this paper is to show assets' SARP increases with SAR. However as hinted in the main results Section 3.2, conventional empirical asset pricing results would suggest that SAR is a priced factor. We can show the following corollary factor result. Let us define the Statistical Arbitrage Risk factor (SAR factor) with returns at month t as

where recall the definition ofR k LS,t in (13). In other words, the SAR factor is simply the difference in the SARP of the high SAR and low SAR stocks.

Let's investigate the price of risk λ SAR of our SAR factor by the classical Fama and MacBeth Newey and West (1987) robust t-statistics with 6 lags in parentheses. We perform a bivariate dependent sort. Each month, we first sort stocks based on a characteristic (i.e. size, book-to-market, idiosyncratic volatility, total volatility, Amihud's illiquidity, momentum, short-term reversal, dollar volume liquidity, idiosyncratic skewness, and total skewness; see Section 3.4 for details and references on these characteristics) into quintiles portfolios. For a given characteristic and within each of its five portfolios, we sort stocks based on elastic-net R 2 into decile bins, from the lowest (quantile 1, labelled "Lo") to highest (quantile 10, labelled "Hi"). The column labelled "Lo -Hi" is the monthly return difference between the "Lo" bin and the "Hi" bin. These ten elastic-net R 2 bins are then averaged over each of the five characteristic portfolios. Thus these ten elastic-net R 2 bins represent returns that control for a particular characteristic. All portfolios are equal-weighted. The sample period is monthly from January 31, 1976 to December 31, 2020. (1973) regressions. We run the Fama and MacBeth (1973) with four different models: (i) SAR factor + CAPM; (ii) SAR factor + FF3 + momentum; (iii) SAR factor + FF5 + momentum;

and (iv) FF5 + momentum as a benchmark. The data of the momentum factor is available from Kenneth French's website. Table 6 (a) shows the correlations of our SAR factor along with the other asset pricing factors. Next is the choice of test assets. We first evaluate the price of risk of our SAR factor on the classical 25 portfolios formed on size and book-to-market (5 × 5). In addition, given our projection and replicate construction procedure of Section 2 is explicitly dependent on past returns, it is prudent to evaluate our SAR factor on test assets that are also sorted along a past return dependent characteristic. To this end, we will also consider the 25 portfolios formed on size and momentum (5 × 5) and the 25 portfolios formed on size and residual variance (5 × 5).

Finally, we also evaluate our SAR factor on portfolios that are further sorted along corporate fundamentals, and consider the 25 portfolios formed on book-to-market and investments (5 × 5), and the 36 portfolios formed on size, operating profitability and investments (2 × 4 × 4). All test portfolios are equal-weighted and are obtained from French's website. Table 6 shows the main result of this section and the focus is the value of λ SAR . The SAR factor has a positive price of risk, is statistically significant, and is robust across the various model specifications and test assets.

Focusing on the SAR factor + FF5 + momentum model, we see the price of risk estimate for the SAR factor range from 0.990% to 1.631% per month, depending on the test asset.

The SAR factor is clearly a tradable portfolio. Figure 7 (a) shows the cumulative returns (not inflation adjusted) from December 31, 1975 to December 31, 2020 of an initial $100 investment on our SAR factor and other factors, and Figure 7 (b) plots the log cumulative returns. The strategy "(Lo -Hi) Foc" is the cumulative returns fromR Lo Foc,t −R Hi Foc,t ; the strategy "(Lo -Hi) Rep" is that ofR Lo Rep,t −R Hi Rep,t ; and "SAR factor" is that of (15). Following the "(Lo -Hi) Foc" strategy will result in a $3, 177.63 cumulative return on December 31, 2020, "(Lo -Hi) Rep" will result in $5.70 and SAR factor will result in $18, 219.66. The amazing performance of SAR factor is firstly because the "(Lo -Hi) Foc" strategy itself already has a high return. More importantly, however, Table 6 : SAR factor price of risk. This table shows the market price of risk λ SAR in monthly percentage points (i.e. 1.0 means 1%) of the SAR factor and various other factors, and where the parentheses show the Fama and MacBeth (1973) t-statistics. The last column shows the χ 2 statistic and the brackets show its p-value. We define our Statistical Arbitrage Risk factor (SAR factor) in (15); the factor return R SAR,t is a result of a long position on the Lo elastic-net R 2 bin with returnsR Lo LS,t and a short position on the Hi elastic R 2 bin with returnsR Hi LS,t . We focus on five equal-weighted tests assets: the 25 portfolios formed by the 5 size and 5 book-to-market bins; the 25 portfolios formed by the 5 size and 5 momentum bins; the 25 portfolios formed by size and Ang et al. (2006) residual variance bins; 25 portfolios formed by 5 book-to-market and 5 investment bins; and 36 portfolios formed by 2 size, 4 operating profitability and 4 investment bins. We use the Fama and MacBeth (1973) two-pass procedure by first running a monthly time series regression of the test assets onto the proposed factor models to obtain the betas, and then run a cross-sectional regression to obtain the prices of risk. The sampling period is monthly from January 31, 1976 to December 31, 2020. as leverage and also its high correlations with the focal stocks. As a matter of comparison, investing into the risk-free asset would result in the cumulative return of $687.29; the market factor returns $2, 505.58; the SMB factor $264.59; the HML factor $220.58; the CMA factor $318.83; the RMW factor $467.17; and the MOM factor $1, 547.36. See Figure 4 for a plot of the monthly returns of the SAR factor. See Figure 5 for a plot of the rolling cumulative returns of the SAR factor. Table 7 : Cumulative returns and various descriptive statistics of the SAR factor and other assets. The column "Cum. return" of Table (a) shows the cumulative returns from an initial $100 investment on December 31, 1975 to December 31, 2020 (the final amount is not inflation adjusted). The column "Ann. Sharpe ratio" shows the average monthly excess returns of various strategies divided by its standard deviation and then multiplied by √ 12. The third to sixth columns of (a) show the first four moments of the monthly excess returns time series; the mean and standard deviation columns are expressed in percentage points (e.g. 1.0 means 1%). The seventh column shows the 5-th and 95-th percentile of the monthly returns in percentage points. The column "ARMA(1, ·) coef" shows the estimated coefficient a while the column "ARMA(·, 1) coef" shows the estimated coefficient b of the ARMA(1,1) model r t = c + ar t−1 + b t−1 + t , and the parentheses show the associated t-statistic. Figure (b) shows the log cumulative returns from said initial investment across various investment strategies. The red shaded regions are NBER recession dates. (c) Correlations Figure 5 : Rolling window cumulative returns of the SAR factor. We plot the cumulative returns of rolling n = 12, 36 and 60 months of an initial $1 investment of the SAR factor, the "(Lo -Hi) Foc" strategy, and the market factor for comparison (e.g. 1.5 means a total of return of 150% over the past n many months). The red shaded regions are NBER recession dates. The sampling period is monthly from January 31, 1976 to December 31, 2020. Rolling 60 months

We run several robustness checks on our main result and they are summarized in Table 8 .

3.6.1 SARP is not driven by the equity risk premium

Is SARP just the equity risk premium in disguise? From Table 1 (iv) and the replicate construction procedure of Section 2.2, it is evident that stocks with high SAR have replicates that are essentially just the risk-free asset. The SARP for high SAR stocks is simply just the equity premium. However, the SARP for low SAR stocks are distinctively different from their equity premia; there are a plethora of risky peers in the replicates of low SAR stocks. Nonetheless, as constructed, there is potential concern the main empirical message of this paper -stocks with high (low) SAR have high (low) SARP -is driven by the equity premium of low SAR stocks. To rule out this concern, we follow the projection and replicate construction procedures of Section 2, but then drop all stocks whose replicates only consist of the risk-free assets (that is, the replicates now must contain at least one risky peer). Table 3 . We see high SAR stocks still have a monthly return of 0.658% (t-stat 3.901) higher than that of low SAR stocks. More importantly, we see the SARP of high SAR stocks (which is now distinctly different from the equity risk premium) is still substantially higher at 0.815% (t-stat 3.147) per month than the SARP of low SAR stocks. Hence the cross-sectional difference in SARP is not driven by the equity risk premium of high SAR stocks. In results (ii) to (vii), we repeat the bivariate sort procedure of Table 5 and sort various characteristics into quintiles. Our main result remains robust after subsetting for stocks to have at least one risky peer in its replicate, and after controlling for these characteristics.

Is there any value at all in running a computationally expensive elastic-net projection procedure of Section 2.1.2? Can we get analogous results using simple OLS projections with much fewer regressors? Instead of projecting each stock i onto the past twelve months' daily returns of every other stock using elastic-net, let's simply project each stock i's past twelve months' daily returns Newey and West (1987) robust t-statistics with six lags. Result (i) repeats the procedure that shows the main result Table 3 except we restrict each stock to have a replicate that must contain at least one single risky asset, so the subsetted stocks' replicates cannot simply just be the risk-free asset. Results (ii) to (iv) repeat the bivariate dependent sort procedure of Table 5 but enforces the aforementioned minimum one risky asset requirement on the stocks. Results (viii) and (x) are analogous to the main result of Table 3 We can see a similar result in the FF5 projections in (iv). However upon closer inspection, the FFz procedures are not robust to size effects, while it is robust for the elastic-net counterpart of Section 2.1.2. In Table 8 (ix), we repeat the bivariate sort procedure to that of Table 5 (xi), we see no cross-sectional SARP in stocks that are constructed out of the aforementioned FF5 projection and sorting procedure, after controlling for size. Overall, these results show SARP can still be found using conventional asset pricing factor models and the standard OLS procedure. However, the "quality" of this SARP is rather low compared to the SARP that is identified and constructed via our elastic-net procedure.

This paper theoretically and empirically show the relationship between the Statistical Arbitrage Risk (SAR) and Statistical Arbitrage Risk Premium (SARP) of a stock. Theoretically, SARP is the expected return of the residual factor risks of a given stock. Empirically we use the elastic-net, a machine learning method, to project a given stock's returns onto the span of every other stock in the market. The projection R 2 is SAR, and the normalized projection coefficient entries serve as investment weights in constructing the replicate portfolio of a given stock. The core message of this paper is: SARP is increasing in SAR.

We see several interesting directions to further study SAR and SARP by machine learning (ML). In this paper, the elastic-net serves two simultaneous roles and steps: (1) fitting and variable selection and (2) portfolio construction. In

Step (1), the elastic-net R 2 is used to measure the SAR of a stock i, and the non-zero entries of the projection vector are used to identify the risky peers of stock i. In

Step (2), the normalized values of the non-entries are used as investment weights to construct the replicate of stock i. A further study of SAR and SARP would be to disentangle these two roles by using two different ML methods.

Step (1) can benefit from using a more sophisticated method ML 1 that take in "wide regressors" as inputs but has better sparse variable selection and goodness-of-fit R 2 (and hence SAR) properties than the elastic-net. For step (2), we see benefits in designing a ML 2 replicate portfolio construction method that also takes into account the relative importance of these peers from step (1). Once these two improved steps are complete, the estimation of SARP and other analyses can follow conventional empirical asset pricing procedures as outlined in this paper. Tangential to using ML methods for forecasting and factors selection in finance, we feel a novel avenue for ML applications in finance is this sense of portfolio construction. In closing, we strongly believe this paper has merely scratched the surface in the study of SAR and SARP by machine learning.

A.1 Discussions on the projection procedure

Here we outline the details of our projection procedure.

Let's introduce the optimization problem of the elastic-net estimator developed by Zou and Hastie (2005) . Let y be a T × 1 vector, X be a T × N matrix and let β be a N × 1 vector. Consider the optimization problem,

where ||x|| 1 := N j=1 |x j | is the L 1 -norm on R N , and ||x|| 2 := N j=1 x 2 j is the L 2 -norm on R N .

In our application, y will be a time series vector of T days of returns of a particular stock, and X will be the concatenation of the time series of N number of other stocks. Note this means there are N + 1 total number of stocks.

The solutionβ is called the elastic-net estimator. This estimator encompasses the special cases of the ordinary least squares (OLS) estimator (when λ 1 = 0, λ 2 = 0), least absolute shrinkage and selection operator (LASSO) estimator of Tibshirani (1996) (when λ 1 > 0, λ 2 = 0), and the ridge estimator (when λ 1 = 0, λ 2 > 0). The hyperparameters λ 1 , λ 2 control the strength of the L 1 -and L 2 -norm penalties, respectively. In this paper when we refer to the elastic-net estimator, we always refer to the case when λ 1 , λ 2 are both strictly positive. In our actual implementation, we use a 3-fold cross-validation procedure to empirically select the hyperparameters λ 1 , λ 2 > 0.

Remark A.1. To reduce the already lengthy computational time in estimating (17), we crossvalidate for only one hyperparameter rather than two in our empirical studies by making the following simplifying assumption on λ 1 , λ 2 . We set λ 1 = λ and λ 2 = 1 2 λ(1 − ), and set = 1/2, and only cross-validate for the single λ > 0 parameter.

We deliberately do not estimate an intercept in (17). 8 This is to avoid attributing a price with very stale prices with high R 2 . Imagine a given stock i has a vector of 12 months' worth daily returns y i,t−1 , where all the daily returns are almost all 0's except on a handful of days. As an extreme, suppose all other stocks has a return matrix X i,t−1 of (6) with rank D i,t−1 . 9 Then using the elastic-net estimator without intercept (17), the squared error term would be high, and thus leading to an overall low R 2 i,t−1 .

Observe that in this caseβ i,t−1 = 0 is not necessarily an optimal solution because X i,t−1 is of rank D i,t−1 whileβ i,t−1 is N t−1 × 1, and we have that D i,t−1 N t−1 . This means there could exist some sparseβ i,t−1 = 0 that achieves a smaller value in (17) than that of the zero vector. This is so when ||y i,t−1 − X i,t−1βi,t−1 || 2 2 + λ 1 ||β i,t−1 || 1 + λ 2 ||β i,t−1 || 2 2 ||y i,t−1 || 2 2 , especially when the penalty weights λ 1 , λ 2 are small. In contrast, if one were to use the elastic-net estimator with an intercept, then settingβ i,t−1 = 0 with interceptĉ ≈ 0 is an optimal solution. As a result, this would lead to the numerator term in the calculation of R 2 to approximately equal to zero, and thus resulting in a high R 2 . We want to avoid the latter case in our results.

In this paper, we want to exclusively reserve "high R 2 " to mean stock i can be well explained by other risky assets, and "low R 2 " to mean stock i cannot be explained by other risky assets.

A.1.3 Parsimony: Why elastic-net and not other machine learning methods?

Out of a myriad of machine learning methods, why did we choose the elastic-net estimator to test our empirical implication? In Theorem 1.1, we motivated the need to linearly regress a stock return R i onto the returns R −i of all other stocks. The elastic-net estimator can actually be seen as a linear regression problem with constraints. By Lagrange-duality, (17) is identical to the following 8 The elastic-net estimator that contains an intercept is given by, c,β ∈ arg min c∈R,β∈R p ||y − c − Xβ|| 2 2 + λ1 ||β|| 1 + λ2 ||β|| 2 2 , λ1, λ2 ≥ 0, where by convention, L 1 -and L 2 -penalties are only applied on the β coefficients and not on the intercept c. 9 In the actual empirical implementations, we do not impose nor check for any such condition.

constrained least squares problem, min β∈R N 1 2T ||y − Xβ|| 2 2 subject to ||β|| 1 ≤ η 1 ,

for some η 1 , η 2 ≥ 0 that is dependent on the values of λ 1 , λ 2 ≥ 0. Most references in the statistical literature (e.g. Zou and Hastie (2005) ) prefers the unconstrained Lagrangian-form (17) We can now justify why we choose to implement the elastic-net estimator in this paper out of the myriad of machine learning methods. Without a doubt, there are more advanced machine learning and econometric methods that can increase the in-sample fit and thereby boost the in-sample R 2 . A recent extensive study by Gu, Kelly, and Xiu (2020) shows that many possible machine learning methods can achieve high in-sample and out-of-sample R 2 's. However, the issue with these "black box" methods is that regardless of their in-sample or even out-of-sample performance, their estimated model parameters often do not have a transparent link to the regressors themselves. In contrast, although the elastic-net estimator is inherently non-linear, its estimated coefficients can be applied in linearly back to the regressors. This parsimonious nature of the elastic-net allows us to explicitly and linearly construct the replicates as in (9).

We emphasize that the elastic-net machine learning method is simply a tool to test the prediction of Theorem 1.1 via the construction of the replicates (9). We have no intentions to conduct statistical inference on the estimated elastic-net coefficients. Our statistical inference claims are still based upon well understood portfolio sort methods in the empirical asset pricing literature as outlined in Section 2.3.

The consideration of the replicates is what drives us to prefer the elastic-net over its close cousin, the LASSO. The LASSO enjoys a "sparsity property", whereby the number of estimated coefficients tend to be small, even though the set of regressors could be large. Sparsity means there are only a handful of stocks that have to be considered in order to construct the replicate of stock i. This is important because if the number of stocks needed to construct the replicate of stock i is large, then whatever statistical results we claim to find may be economically infeasible due to trading costs and other market frictions. Unfortunately, if a group of regressors are highly correlated with each other then LASSO has a tendency to only select one regressor effectively at random. This makes for a poor portfolio construction of the replicates for numerous reasons, and loss of potential diversification is an obvious one. The elastic-net remedies this problem by inheriting the grouping property of the ridge estimator. In all, this means the elastic-net is a good candidate for our consideration of the replicates because: (i) it can fit the data (i.e., from the least squares property of the OLS); (ii) it can linearly apply its estimated coefficients over the regressors; (iii) it has a sparsity property (i.e., from LASSO); and (iv) it has a grouping property (i.e., from ridge).

Would a more general machine learning method be useful for our purposes? The answer is mixed. A general form of a machine learning estimator has the form y = g(X; θ, λ) + ε, where y is the response variable, g could be a parameterized or non-parametric function, X is the set of regressors, θ parameterizes g, λ is a hyperparameter, and ε is a nuisance parameter. Generally speaking, the relationship between X, θ and λ could be highly non-linear. The data is typically split into three sets [y train , y validate , y test ] and [X train , X validate , X test ]. For a given hyperparameter λ, the machine learning method uses the training set (y train , X train ) to fit the data to find an optimal parameterθ(λ). Using the validation set (y valid , X valid ), the method then finds an optimalλ that fits the validation data, and a trained model is then given by the parametersθ(λ) andλ. Finally, the forecast accuracy of the model is tested against the testing set (y test , X test ) by comparing against the predicted valueŷ = g(X train ;θ(λ),λ) against its realization y train .

In our context, if we were to apply such general machine learning method to asset i, then we would still use the same response variable and regressors as in (6). However, the fitted value would be of very little use to us when constructing the replicates (9) from a finance perspective. There are two issues. Firstly, we are not using a forecast valueŷ to construct portfolios. We are using the fitted coefficientsθ(λ) =β i,t−1 (which is then subsequently normalized by its L 1 norm) as investment weights for the replicate of stock i. In order to use fitted coefficients as investment weights, then it is almost necessary that these fitted coefficients have a clear and linear relationship to the one-month ahead returns X train = R −i,t . The OLS, LASSO and elastic-net certainly have this property, as seen in (9). However, a general machine learning method does not have this linear relationship between the fitted coefficients and its out-of-sample regressors. This non-linearity between fitted coefficients and its out-of-sample regressors explicitly prevent us from directly using these general machine learning methods in constructing a portfolio.

Nonetheless, as discussed at the conclusion of the main text, there are many research avenues to disentangle the fitting role and replicate portfolio construction role of the elastic-net. By designing a method ML 1 for fitting and another method ML 2 for replicate portfolio construction, there are many other properties of the SAR and SARP to explore. Regardless, and at least to us, seems to be the most parsimonious and is a good baseline.

Illiquidity and stock returns: cross-section and time-series effects

The Cross-Section of Volatility and Expected Returns

Statistical arbitrage in the US equities market

Empirical Asset Pricing: The Cross Section of Stock Returns

The capital asset pricing model: Some empirical tests

Empirical tests of the consumption-orientated CAPM

Sparse signals in the cross-section of returns

The cross-section of expected stock returns

Common risk factors in the returns on stocks and bonds

A five-factor asset pricing model

Risk, return, and equilibrium: Empirical tests

Taming the factor zoo: A test of new factors

Dissecting Characteristics Nonparametrically

Pairs trading: Performance of a relative-value arbitrage rule

Empirical asset pricing via machine learning

Forward Exchange Rates as Optimal Predictors of Future Spot Rates: An Econometric Analysis

and the Cross-Section of Expected Returns

Estimating the risk-return trade-off with overlapping data inference

Large data sets and machine learning: Applications to statistical arbitrage

Evidence of predictable behavior of security returns

Statistical arbitrage pairs trading strategies: Review and outlook

Economic tracking portfolios

Fads, martingales, and market efficiency

Online Supplementary Materials for 'Statistical Arbitrage Risk Premium by Machine Learning

The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets

A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix

The Arbitrage Theory of Capital Asset Pricing

Capital Asset Prices: A Theory of Market Equilibrium under Conditions of Risk

High-dimensional index tracking based on the adaptive elastic net

Regression shrinkage and selection via the lasso

Does arbitrage flatten demand curves for stocks

Regularization and variable selection via the elastic net