key: cord-0597187-k54jq3yh authors: James, Nick; Menzies, Max; Gottwald, Georg title: On financial market correlation structures and diversification benefits across and within equity sectors date: 2022-02-22 journal: nan DOI: nan sha: 9fecb9eabb071cf321065a529a207c695059002f doc_id: 597187 cord_uid: k54jq3yh We study how to assess the potential benefit of diversifying an equity portfolio by investing within and across equity sectors. We analyse 20 years of US stock price data, which includes the global financial crisis (GFC) and the COVID-19 market crash, as well as periods of financial stability, to determine the `all weather' nature of equity portfolios. We establish that one may use the leading eigenvalue of the cross-correlation matrix of log returns as well as graph-theoretic diagnostics such as modularity to quantify the collective behaviour of the market or a subset of it. We confirm that financial crises are characterised by a high degree of collective behaviour of equities, whereas periods of financial stability exhibit less collective behaviour. We argue that during times of increased collective behaviour, risk reduction via sector-based portfolio diversification is ineffective. Using the degree of collectivity as a proxy for the benefit of diversification, we perform an extensive sampling of equity portfolios to confirm the old financial adage that 30-40 stocks provide sufficient diversification. Using hierarchical clustering, we discover a `best value' equity portfolio for diversification consisting of 36 equities sampled uniformly from 9 sectors. We further show that it is typically more beneficial to diversify across sectors rather than within. Our findings have implications for cost-conscious retail investors seeking broad diversification across equity markets. Financial market structure and behaviour are notoriously difficult to describe and predict. Over the last 100 years, countless mathematical models and intuitive rules have been developed to predict the behaviour of individual assets Email addresses: nick.james@unimelb.edu.au (Nick James), max.menzies@alumni.harvard.edu (Max Menzies), georg.gottwald@sydney.edu.au (Georg Gottwald) as well as broader market trends. In 1952, Markowitz [1] revolutionised the study of financial markets and the practice of asset selection by arguing that diversification across many assets provides superior risk reduction to the optimal selection of individual assets. The idea of diversification relies on disentangling the risk of a particular financial asset into the risk of the market, the so called systematic risk, which an investor cannot control, and the individual risk of an asset, the so called unsystematic risk, which is assumed to be uncorrelated to the systematic risk of the market. Diversification amounts to averaging out the unsystematic risk by investing in a sufficient number of individual assets, leaving an investor exposed to only the inherent systematic risk of the market. The benefit of diversification is intimately tied to the notion that the price of an asset can be decomposed into a (noisy) collective market component and an idiosyncratic noisy component which is uncorrelated to the collective behaviour of the market [2, 3, 4] . By analysing data from a 20-year period of 339 US equities, we aim to shed some light on how well this separation of the risk into a collective market component and into an individual component holds across time, and how diversification benefits vary when investing in different sub-collections of the market. We pay particular attention to the traditional method of diversifying across industry sectors, and study how beneficial this approach is in diversifying an equity portfolio. Until the last several decades, active investment management has been dominated by fundamental investors who make investment decisions based on the future earning potential of companies, relative to their current valuations. The correlation between the prosperity of companies in different sectors and that of the overall economy varies significantly. For example, equities classified in the Information Technology, Financials, Energy and Materials sectors often thrive during periods of economic growth. By contrast, sectors with more defensive earning profiles such as Healthcare, Utilities and Consumer staples tend to outperform during recessionary periods. Therefore, it is reasonable to expect that is more beneficial to diversify across sectors rather than within. However, this intuitive reasoning requires a thorough investigation backed up by data. Ever since Markowitz' work, cross-correlation matrices of asset prices have been the key object of study in capturing market structure and the interdependencies of assets in the market or within a subset of the market such as equity sectors. These matrices' spectral properties encode important information about the overall market structure. To study evolutionary correlation structures, principal component analysis of the cross-correlation matrix was employed. In particular, the leading eigenvectors and eigenvalues were used to characterise the collective behaviour of the market. It was shown that a few components describe most of the observed variability of the market [5, 6, 7, 8, 9] . Using random matrix theory, differences between cross-correlation matrices of stock price changes and random matrices can be used to uncover non-random aspects of the market [10, 11, 12, 7] . Network analysis, in which the stock market is viewed as a complex network where the cross-correlation matrix describes the coupling strength between individual assets, was used to find correlated groups of assets within the market [13, 14, 15, 16, 17, 18, 19] . The aim we set ourselves here is to find and employ a quantitative measure informing investors if diversifying their portfolio by investing in a larger number of equities will be beneficial for their risk reduction. The quantification of risk reduction is a difficult task, and highly definitional. We argue that diversification is beneficial if the unsystematic risk is sufficiently large compared to the systematic risk of the collective market. Hence diversification is beneficial in a market with a sufficiently low degree of collective behaviour, in which individual assets display a certain degree of independence. On the contrary, in a market which exhibits a high degree of collective behaviour, diversification may not lead to a significant reduction in the overall risk of the portfolio. Here we apply several complementary diagnostics to uncover dominant collective behaviour (or the lack thereof) of the market as a whole and in terms of individual sectors. We will use the leading eigenvalue of the cross-correlation matrix as a proxy for the collective behaviour of the market (or subset of the market) [20, 21, 7] , with a larger value of it being indicative of stronger collective behaviour. We further employ modularity, a diagnostic borrowed from complex network analysis, to probe into how far sectors function as mutually independent sets of equities. We find that all our metrics show the same signature: in times of crisis, such as the global financial crisis (GFC) in 2008/2009 or the 2020 market crash related to the COVID-19 pandemic, the market exhibits increased collective behaviour in which assets collectively react to overwhelming market and equity-specific unsystematic risk is swamped by the systematic risk of the market. By contrast, periods of sustained equity price growth (often referred to as bull market periods) are characterised by a lesser degree of collectivity, allowing for more efficient equity portfolio diversification. There is a commonly held principle among investors that in order to diversify, a sufficient and perhaps optimal number of equities to hold in a portfolio is between 30 and 40 [22] . Many fund managers and individual investors may wish to limit their total number of held equities, either due to mandated restrictions in their investment policy statement [23] , transaction fees, or complexity considerations of large portfolios. Hence finding the smallest number of equities which still allows for sufficient diversification is of paramount interest to investors. Using an exhaustive sampling strategy we aim to find evidence in the data of the 30-40 stock number rule, and how this rule is affected by the presence of sectors. Motivated by the results on the degree of collectivity, we measure the propensity of the market to allow for diversification by the reduction in the magnitude of the leading eigenvalue of the cross-correlation matrix associated with the respective portfolios. Perhaps unsurprisingly, we show that the precise selection of equities within a sector is less important than selecting a sufficient number of sectors to choose from. However, we will show that the data suggests that during the 20 year long period from 2001 to 2020 the anecdotal 30-40 equity rule does apply when investing across sectors. Interestingly, we show that a portfolio consisting of 36 equities sampled uniformly from 9 sectors provides comparable risk mitigation to a 90 equity portfolio, sampled uniformly from 10 sectors. Moreover, we show that risk reduction is less sensitive to the precise selection of equities within a sector once sectors are chosen. This supports the rationale behind recent trends in finance where diversification is promoted by investing in thematic areas. The paper is structured as follows. Section 2 describes the US equity data used for our analysis. In Section 3, we study the market structure across and within sectors. We begin with a study of the leading eigenvalue of the correlation matrix to explore the collective behaviour of the whole portfolio comprised of all equities as well as within each equity sector. Periods of financial crises and of bull markets are clearly identified as increased and decreased collective behaviour, respectively. Periods of financial crisis are further characterised by an increase in the market's homogeneity. We augment the analysis by graph-theoretic-informed diagnostics and show that modularity can be used as another proxy for the degree of collectivity of the market, exhibiting the same temporal signatures as the leading eigenvalue of the cross-correlation matrix. In Section 4, we turn to the more practical problem of studying the diversification benefit provided by diversifying across sectors. We verify that appropriately chosen combinations of 30-40 stocks across diverse sectors provides essentially as much diversification benefit as the entire market. We consider the daily stock prices of N = 339 US equities from January We will establish that one can measure the degree of collectivity of the market by certain spectral and graph-theoretic properties of the cross-correlation matrix of the log returns of the stock price data. These measures will be used in Section 4 as a proxy for the benefit of diversification of a portfolio. We denote by c i (t), i = 1, . . . , N , t = 0, . . . , T the multivariate time series of daily closing prices among our collection of N equities. The multivariate time series of log returns, r i (t), i = 1, . . . , N , t = 1, . . . , T is defined as Our primary objects of study in this section are correlation matrices of log returns across rolling time windows of length τ , We choose here τ = 120 days. We standardise the log returns over such a window by defining where . denotes the temporal average over the time window and σ the associated standard deviation. The correlation matrix Ψ is then defined as follows: let R be the N × (T − τ ) matrix defined by R it = R i (t) with i = 1, . . . , N and t = τ, . . . , T and let Explicitly, individual entries describing the correlation behaviour between equities i and j are defined, for 1 ≤ i, j ≤ N . We may analogously define the cross-correlation matrices for each individual sector by restricting i and j to be chosen from a set of indices corresponding to a particular sector. All entries Ψ ij lie in [−1, 1]. Ψ is a symmetric positive semi-definite matrix with real and non-negative eigenvalues λ i (t), so we may order them as λ 1 ≥ · · · ≥ λ N ≥ 0. As all diagonal entries of Ψ are equal to 1, the trace of Ψ is equal to N . Thus, we may normalise the eigenvalues by definingλ i = λi N j=1 λj = λi N . Principal component analysis has been a corner stone in the analysis of dominant patterns in multivariate time series [24] and has been widely applied to financial data (see, for example, [7] ). The eigenvectors v i of the cross-correlation matrix R, which we assume to be normalised throughout here, capture directions of maximal variance of the data in a time period of length τ , and the eigenvalues λ i capture how much of the observed variance of the data in that period can be described by the respective eigenvectors. In particular,λ i describes the proportion that the ith eigenvector v i is able to reproduce the data. Hence if there are only a few eigenvalues of large magnitude, the data can be described by a linear combination of a few dominant eigenvectors. We are particularly interested inλ 1 (t) = λ 1 (t)/N as a function of the rolling τ = 120-day window. Indeed, in the extreme case thatλ 1 is close to 1, the data can be described by the single mode v 1 , which we refer to as the market. Hence, if the temporal evolution of equities is dominated by a single mode, then all the variance in the data can be explained by v 1 , and there is no significant contribution of variance coming from other subspaces spanned by higher eigenvectors. In this sense we defineλ 1 as a measure of the strength in collective correlations among a group of equities and as a proxy for a potential benefit of diversification. Figure 1 shows the evolution of the leading eigenvalueλ 1 (t) of the correlation matrix for all GICS sectors and the entire collection of equities, over the 20-year period we examine. There are several noteworthy findings. First, the leading eigenvalue attains large values during the two most prominent market crises, the global financial crisis (GFC) in 2008/2009 and the COVID-19 market crash in 2020. The GFC features three spikes in short succession commencing in late 2008 and the subsequent severe market responses in 2010 and 2011. By contrast, the COVID-19 market crash corresponds to one pronounced spike sustained for a period in early 2020. During bear markets and crises, the magnitude of the leading eigenvalue increases, often sharply, to large values -this heralds increased correlation between all underlying equities and less opportunity for successfully diversifying a portfolio returns stream. Spikes of the leading eigenvalue can be explained by indiscriminate selling of risky assets (including equities) by both active and passive funds management businesses. Such spikes in the leading eigenvalue are associated with increased correlations among all underlying equities, and pronounced negative returns exhibited by equities with significant market beta. This suggests that during periods of large values ofλ 1 diversification may not be beneficial as the overall systematic risk dominates over the unsystematic risk. During bull markets, for example during the extended period from 2016-2019, the normalised leading eigenvalueλ 1 can experience large fluctuations, however, the overall magnitudes are small, pointing to a lesser degree of correlations between equities. This lesser degree of correlations can be utilised by investors to diversify their portfolio. Second, all sectors display broadly similar evolution over time regarding the peaks and troughs of their leading eigenvalue. Third, the degree of collectivityλ 1 is larger at all times when calculated from a cross-correlation matrix restricted to individual sectors than when calculated using all equities. This is consistent with our intuitive argument that diversification is more beneficial investing across sectors rather than within. Equities within a sector are more likely to be mutually correlated. An interesting observation is the absence of a spike in the leading eigenvalue around the time of the dot-com bubble in 2000/2001, in particular in the Information Technology sector. There are several possible explanations for this. First, most financial datasets spanning a significant period of time, such as ours, suffer from survivorship bias. Many technology-related companies went bankrupt during this period (including Pets.co, Webvan and 360Networks) and no longer exist within our dataset. Second, many companies that are generally thought of (and often classified) as Information Technology companies, may be classified in other GICS sector. One prominent example of this is Amazon's classification within the Consumer Discretionary sector. Factors such as these may have dampened the determination of equity collective behaviours (and the degree of severity) of the dot-com crisis, especially among the Information Technology sector. Finally, We further investigate the extent of uniformity of the leading eigenvector v 1 by introducing where 1 = (1, 1, . . . , 1) ∈ RÑ andÑ denotes the size of the underlying equities used to construct the cross-correlation matrix (2) . We remark that when the whole equity market is considered 1 = N and if only a particular of the M = 11 sectors is considered then 1 equals the size of that sector. Note that h(t) ≤ 1 with h(t) = 1 for v 1 = 1. In this case, all equities carry the same amount of variance. This can be used to quantify the potential benefit of diversification: Increased values of h(t) indicate increased interchangeability of equities and hence less opportunity for diversification or judicious selection of individual equities. In Figure 2 , we plot the uniformity measure h(t) for each sector and for the entire market. The results are consistent with those shown in Figure 1 for the degree of collectivity. As for the leading eigenvalueλ 1 , the degree of uniformity h(t) spikes during market crises (GFC and COVID-19), both for the individual sectors as well as for the entire market. We complement the spectral analysis of the cross-correlation matrix (2) with a graph-theoretic view of the cross-correlation matrix over time. We view the correlation matrix as an adjacency matrix of a weighted graph to uncover the presence or absence of correlated sectors. Specifically, we consider a weighted graph with adjacency matrix A with A ij = |C ij |. Unlike usual network-based community-finding algorithms, which are designed to identify communities purely from the structure of the adjacency matrix [25, 26, 7, 27] , we assume here that the given sectors predetermine the communities a priori. The graph-theoretic diagnostics are then used to quantify the strength of the partition of the graph into those fixed sectors. We study in particular the (rolling) modularity associated with the partition of the graph defined by the sectors. Modularity measures the difference between the observed number of (weighted) edges within a sector and the expected number of edges if they were randomly assigned [25] . Treating each individual asset as a vertex, its degree is defined as k i = N j=1 A ij , while the total number of edges (counted by weight) of the graph is e = 1 2 N i=1 k i . Denoting by S m the set of equities which make up the mth sector the modularity is defined as As elsewhere in this section, we compute and study Q on a rolling τ = 120-day basis. In Figure 3 , we show the evolution in modularity Q for the partition defined by the GICS sectors. Consistent with the degree of collectivityλ 1 and the uniformity measure h(t), modularity clearly identifies financial crises as events with small modularity Q, indicating that in times of financial crises sectors cease to constitute independent sets of equities which are more correlated with each other than with equities from other sectors. In fact, there are only four events in time where the level of modularity drops below 0.018 -the three troughs corresponding to the GFC between 2008 and 2012, and the COVID-19 market crash in 2020. In contrast, modularity is highest during the mid-2000s and during the equity bull market of 2016-2019. As for the normalised leading eigenvalueλ 1 (cf. Fig. 1 ), the modularity experiences large fluctuations albeit with values significantly larger than those experienced during financial crises. To test if sectors constitute reasonable communities in the sense that there are more (weighted) edges linking equities within a sector than what one expects from a random allocation of edges, we calculated the average modularity over 500 random allocations of equities to 11 random groupings of the same size as the original sectors. The averaged modularity of this ensemble of randomised sectors is of the order of 5 times smaller than the modularity of the actual market with its sectors (not shown). Interestingly though the temporal evolution experiences the same troughs and spikes as the modularity Q shown in Figure 3 . We have shown in this Section that the normalised leading eigenvalue of the cross-correlation matrixλ 1 (t), the equity uniformity measure h(t) and the modularity Q(t) have very similar signatures. All these diagnostics can be used to identify collective behaviour of the market and hence to assess the potential benefit of diversification. In the following Section we will use the normalised leading eigenvalue as our proxy for a diversification benefit. We remark on the choice of the time window length τ = 120 used to construct the cross-correlation matrix (2) . The choice of this parameter is a delicate balance between excessive and insufficient smoothing. If the value of τ is chosen too large, the level of smoothing will be excessive, and we may be unable to identify abrupt changes in the correlation structure. A prime example of this is the COVID-19 market crash, which was extremely severe, albeit quite brief. Alternatively, if τ is chosen too small, we may erroneously interpret short-term transient noisy events as meaningful changes of the correlation structure. The size of the smoothing window varies significantly in the literature, ranging from windows of 3 months to 2 years of trading data [28, 7] , depending on the time-scales of interest. Here we are interested in both abrupt market changes developing over a few months as well as longer-term structural shifts in correlation patterns, motivating the compromising value of τ = 120 days, i.e. 6 months. We use the whole data set comprised of several periods of bear and bull markets and did not stratify the data according to different market dynamics. First, this allows us to investigate the long-term implications of equity portfolio diversification strategies, which consists of bull and bear market periods. Second, given that market dynamics varies significantly over time (and no two market crises are ever the same), the optimal portfolio structure of a previous period of economic crisis or stability, may not be ideal in a similarly-themed future period. Finally, it may be difficult for retail investors to anticipate changes in equity market dynamics, and restructure their portfolios based on expected equity market performance. Accordingly, we study optimal diversification strategies for equity market investors who may lack the resources, information or interest to constantly re-balance their equity sector exposure based on market sentiment. We now perform an extensive sampling procedure to explore how diversification benefits depend on the number of equities held in the portfolio and on the number of sectors from which to choose those equities. To quantify the potential diversification benefit we choose here, motivated by the results obtained in Section 3, the degree of collective behaviour of the marketλ 1 (t). We study the diversification benefits of portfolios that consist of mn equities such that n equities are drawn from m separate sectors. Both the individual equities and the sectors are drawn randomly and independently with uniform probability. We draw D = 500 portfolios for each combination (m, n). To quantify the potential diversification benefit for a portfolio consisting of mn equities, we determine the mn × mn correlation matrix Ψ for each draw and calculate the associated normalised eigenvaluesλ m,n (t). We again use a rolling time window of length τ = 120 days when determining the cross-correlation matrix. For each combination (m, n) of number of sectors m and number of equities per sector n we record the 5th percentile, 50th percentile (median) and the 95th percentile of the D values ofλ m,n (t). These are denoted bỹ λ 0.05 m,n (t),λ 0.50 m,n (t) andλ 0.95 m,n (t), respectively. We introduce the temporal mean of the median of the normalised eigenvalues as a measure of the diversification benefit of a portfolio with n stocks in each of m sectors. Table 1 records µ m,n for portfolios with 2 ≤ m ≤ 10 and 2 ≤ n ≤ 9. In the following we denote by (m, n) a portfolio with n equities chosen from m separate sectors. As expected, the diversification benefit is seen to be smallest for the smallest portfolio (2, 2) consisting of 4 equities and is largest for the largest portfolio (10, 9) consisting of 90 equities. Table 1 reveals that if we want to keep the total number mn of equities contained in a portfolio constant, we have µ m,n < µ n,m showing that it is more beneficial to diversify across sectors than within sectors. We can fix the number of sectors m and increase n to see a reduction in the magnitude of µ m,n implying as expected a larger diversification benefit. Similarly, we can fix the number of equities form each sector and increase the number of sectors m. The decrease of µ m,n is stronger here for the increase of the number of sectors when compared to the previous scenario where the number of equities is varied, again pointing to the fact that diversifying across sectors is more beneficial than within sectors. We show in Table 1 a greedy strategy (online red) where, starting at the smallest portfolio (2, 2) and ending at the largest portfolio (10, 9) we aim to decrease the value of µ m,n by either increasing the number of sectors m to choose from or the number of equities n to be chosen from each of those sectors. The greedy path is shown in Fig. 4 . It is seen that the median µ m,n saturates and that not much is gained by increasing the number of sectors from 9 to 10. The question we ask in this Section is whether we can find a portfolio which results in a comparable diversification benefit to the largest (10, 9) portfolio but which contains significantly less number of equities? Since there is no significant difference in our measure for the diversification benefit µ m,n for the (9, 9) and the (10, 9) portfolios we restrict our analysis from now on to portfolios with a maximum of 9 sectors, with the (9, 9) portfolio being the most diversified portfolio. The smallest portfolio which has a value of µ m,n comparable to the minimal value of the (10, 9) and (9, 9) portfolios is identified to be a (9, 4) portfolio. Using hierarchical clustering we will show below that indeed the (9, 4) portfolio with a total of 36 equities behaves close to the most diversified (9, 9) portfolio with a total of 81 equities. To explore the (9, 4) portfolio in more detail and how it compares to portfolios of the same size such as a (4, 9) portfolio as well as to the most diversified (9, 9) portfolio we show in Figure 5 the temporal evolution ofλ 0.50 m,n . It is clearly seen in Figure 5a that the (9, 4) portfolio exhibits smaller values ofλ 0.50 m,n compared to the (4, 9) portfolio with the same number of total equities held at all times, independent of whether the market experiences a financial crisis or a bull market. Moreover, it is seen that the spread of the (9, 4) portfolio, as measured by the distance between the 5th and the 95th percentile curves is smaller for the (9, 4) portfolio. Remarkably, as seen in Figure 5b , the curves ofλ 0. Table 1 : Average µm,n of the median normalised eigenvalueλ 0.50 m,n (t) for different pairs of m sectors and n equities per sectors. In red, we display a greedy strategy how to reduce the value of µm,n (implying an increase the overall diversification benefit) by gradually increasing the portfolio size, starting from the smallest portfolio (2, 2). portfolio closely resembles that of the largest (9, 9) portfolio, with comparable spread. This shows that the diversification benefit of the smaller (9, 4) portfolio is very similar to the much larger (9, 9) portfolio. The previous discussion was centred around the average behaviour of a portfolio with a specified number of sectors and equities per sectors. For investors it is of paramount importance to know if the average behaviour is typical. If this is not the case then the diversification benefit will strongly depend on the particular choice of the equities picked form each sector. We expect that the variance will be larger in smaller portfolios, with a maximum at the (2, 2) portfolio, and will decrease with increasing number of equities held, with a minimum variance for the largest (9, 9) portfolio. To quantify this we look at the average spread defined by The difference between the 5th and 95th percentile of a distribution corresponds, under the assumption of Gaussianity, to approximately 1.96 times the underlying standard deviation. We record σ m,n in Table 2 . As for the average µ m,n we observe that σ m,n < σ n,m , implying that to construct a portfolio of mn equities it is more beneficial to increase the number of sectors than the number of equities held per sector. Similarly, the decrease in the spread is more pronounced when for a fixed number of equities we increase the number of sectors to choose from than for the case when for a fixed number of sectors the number of equities per sector is increased. We now address the question which portfolio combinations (m, n) share the most similar evolution in their collective dynamics? This allows us to determine the smallest portfolio which has a comparable diversification benefit to the Table 2 : Average spread σm,n of the median normalised eigenvalues for different pairs of m sectors and n equities per sectors. In red, we display a greedy strategy how to reduce the value of σm,n (implying a smaller dependency on the portfolio selection) by gradually increasing the portfolio size, starting from the smallest portfolio (2, 2). most diversified (9, 9) portfolio. To tackle this question, we perform hierarchical clustering on the distance metric which quantifies the average absolute difference between the median eigenvalues of two portfolios (m, n) and (m , n ). This results in a 64 × 64 distance matrix for 2 ≤ m, n ≤ 9. Given the relatively high dimensionality of the data, we choose the Manhattan distance over other alternatives such as the Euclidean distance. However, we checked that the key findings remain unchanged when using the Euclidean distance instead. Once thedistance matrix has been formed, we apply hierarchical clustering to determine which portfolio combinations share the most similarity in their collective behaviour evolution. Hierarchical clustering is a convenient tool to reveal proximity between different elements of a collection. Here, we perform agglomerative hierarchical clustering based on the averagelinkage criterion [29] . The algorithm works in a bottom-up manner, where each portfolio combination starts in its own cluster, and pairs of clusters are merged as one traverses up the hierarchy. Given the high transaction cost investors may face when holding larger portfolios, we wish to identify the smallest (or best value) portfolios which provide the greatest risk reduction relative to the number of equities held. As in Section 3 we compute distances d((m, n), (m , n )) over the entire period, rather than stratifying according to different macroeconomic characteristics to addresses 'all weather' risk mitigation of risk reduction across a range of market scenarios. The resulting dendrogram from this analysis is shown in Figure 6 . Clusters of similar evolution in their collective behaviour are identified as blue blocks along the anti-diagonal. The corresponding portfolio combinations are shown on the far left of Figure 6 . The darker blue colouring for any respective square block corresponds to less distance between evolutionary paths, and a higher degree of affinity. The dendrogram exhibits 3 primary subclusters and a small outlier cluster. The outlier cluster (orange leaves) consists only of portfolios (2, 2) and (2, 3) which contain just 4 and 6 equities respectively and which provide the least diversification benefit (cf. Table 1 ). Directly above the outlier cluster is a subcluster of 9 relatively small portfolios ranging from (3, 2) to (2, 7) on the leftside panel. Excluding the outliers, this subcluster provides the least diversification benefit to an investor. The largest portfolio in this cluster is portfolio (2, 9) with 18 equities and the smallest portfolio is portfolio (3, 2) with only 6 equities. Both portfolios exhibit similar temporal evolution in terms of the median eigenvalues and hence similar levels of risk reduction. This confirms again that it is more advantageous to increase the number of sectors to construct a portfolio than the actual number of equities held. The predominant subcluster in Figure 6 spans (according to the labels on the left-hand side) portfolios (7, 4) to (9, 8) . It can be further subdivided into two subclusters. The first subcluster consists of portfolios ranging from (7, 4) to (5, 5) . The largest portfolio in this cluster is portfolio (5, 9) with 45 equities and the smallest portfolio is portfolio (7, 3) with 21 equities, providing the same level of risk reduction. Again, it is seen that diversifying across sectors is much more effective in terms of diversification benefits than simply increasing the number of equities. The other of the two subclusters contains cluster (according to the left-hand side panel) (7, 6) to (9, 8) . This subcluster contains portfolios which behave the most similar in terms of their median eigenvalues as seen by the dark blue colour. This cluster contains the largest portfolio (9, 9) and the smallest portfolio is our designated portfolio (9, 4) . This has several important implications for equity-based portfolio management. First, there is an old adage in financial markets that 30-40 equities are sufficient for diversification and elimination of unsystematic portfolio risk. The (9, 4) portfolio, composed of 36 equities nicely fits into this range. Second, it is of great relevance to retail Figure 6 : Hierarchical clustering between pairs (m, n) of different portfolio structures, using the L 1 metric between median functions (8) . Two outliers portfolios (the smallest size) are revealed, and then three subclusters of the majority collection. Of greatest interest is the dense dark partition of high similarity ranging from (7,6) to (9, 8) . This contains both the (9, 4) and (9, 9) portfolios, revealing that a 36-equity portfolio of 9 sectors and 4 equities per sector attains near-identical diversification benefit as the largest possible portfolio. and cost-conscious investors, that a (9, 4) portfolio provides a nearly identical diversification benefit to a (9, 9) portfolio, more than twice its total size. We have used spectral and graph-theoretical characteristics of the crosscorrelation matrix of the log returns of equities in the US market from 2000 until 2020 to quantify the collective behaviour of equities over time as a diagnostics for potential diversification benefits in terms of identifying the dominance of systematic risk over unsystematic risk. We found that the leading eigenvalue, a uniformity measure and modularity can all be used to detect dominant collective behaviour in the market such as the GFC and the COVID19 crisis as well as identify bear markets as encountered during the period from 2016-2019. We then studied the properties of random portfolios of a specific size. A major takeaway from our portfolio sampling and hierarchical clustering analysis is the identification of a best value 'all weather' portfolio consisting of choosing 4 equities from each of 9 sectors, totalling 36 equities. The sampling procedure and respective dendrogram highlight that this portfolio provides comparable reduction in unsystematic risk to the largest and most diversified portfolio consisting of choosing 9 equities from each of 9 sectors, totalling 81 equities. The findings in this paper highlight optimal equity sector diversification strategies during a 20 year period which includes multiple periods of economic crisis, as well as periods of stability, and hence provide guidance for portfolio constructions in an 'all weather' environment which is agnostic to the current macroeconomic environment. We verified that the actual choice of which sectors and which equities to choose from is not important in terms of risk reduction and the optimal (9, 4) portfolios exhibit very little spread, again comparable to the spread incurred by the largest (9, 9) portfolio. This supports the widely known rule of thumb that a portfolio consisting of 30-40 equities is sufficient in reducing unsystematic risk. Our results demonstrate that there is significantly greater benefit in diversifying equity portfolios across sectors than within sectors and a (9, 4) portfolio provides significantly larger risk reduction than, for example, a (4, 9) portfolio of equal total size. Reassuringly, for the optimal (9, 4) portfolio we found that the risk reduction does not depend strongly on the actual choice of sectors and equities in a long 20 year investment period. There are several avenues of potential future research. First, it would be interesting to consider a market consisting of more than a single asset class and to include asset classes such as fixed income, currencies, commodities, cryptocurrencies and other alternative asset classes. In particular, it would be interesting to see if the graph-theoretic approach is able to identify separate community behaviours based on different asset classes, and if these could be further broken down into underlying constituent groupings (such as equity sectors). Similarly, it would be interesting to extend the portfolio sampling to include other asset classes. It is possible (and quite likely) that including more asset classes is conducive in the diversification of portfolios and reduces the tendency for correlated collective portfolio behaviour. Second, one could study similar phenomena to that explored in this paper in different geographies. It is possible that in some countries, market dynamics may be more or less correlated than that of the US equity market. Finally, one could extend the portfolio sampling procedure to consider portfolio returns, in addition to risk. This paper specifically deals with the concept of portfolio diversification from the standpoint of reducing collective behaviours. If one were to consider the returns (in addition to the risk) in various portfolio settings, this could potentially be of great interest to the community of financial market researchers. All equity data is obtained from Bloomberg (https://www.bloomberg.com) Portfolio selection How markets process information: News releases and volatility Economic news and bond prices: Evidence from the u.s. treasury market Real-time price discovery in global stock, bond and foreign exchange markets Collective behavior of stock price movements in an emerging market An analysis of cross-correlations in an emerging market Temporal evolution of financial-market correlations Identifying states of a financial market Uncovering the dynamics of correlation structures relative to the collective market motion Noise dressing of financial correlation matrices Random matrix approach to cross correlations in financial data Quantifying and interpreting collective behavior in financial markets Topology of correlationbased minimal spanning trees in real and model markets Dynamics of market correlations: Taxonomy and portfolio analysis Kert'esz, Clustering and information in correlation based financial networks Random matrix theory analysis of cross correlations in financial markets Systematic analysis of group identification in stock markets Information-theoretic approach to lead-lag effect on financial markets Networks in financial markets based on the mutual information rate Hierarchical PCA and applications to portfolio management Competition of noise and collectivity in global cryptocurrency trading: Route to a self-contained market Some studies of variability of returns on investments in common stocks Investment policy statement: Elements of a clearly defined IPS for non-profits Principal component analysis A tutorial on spectral clustering Community detection in graphs Dynamics, behaviours, and anomaly persistence in cryptocurrencies and equities surrounding COVID-19 Fastcluster: Fast hierarchical, agglomerative clustering routines forRandPython