key: cord-0502987-4rdctxf5 authors: Colladon, A. Fronzetti; Grassi, S.; Ravazzolo, F.; Violante, F. title: Forecasting financial markets with semantic network analysis in the COVID-19 crisis date: 2020-09-09 journal: nan DOI: nan sha: 2b105ed8562ee6a03d1958f444d6a3652a08a866 doc_id: 502987 cord_uid: 4rdctxf5 This paper uses a new textual data index for predicting stock market data. The index is applied to a large set of news to evaluate the importance of one or more general economic related keywords appearing in the text. The index assesses the importance of the economic related keywords, based on their frequency of use and semantic network position. We apply it to the Italian press and construct indices to predict Italian stock and bond market returns and volatilities in a recent sample period, including the COVID-19 crisis. The evidence shows that the index captures well the different phases of financial time series. Moreover, results indicate strong evidence of predictability for bond market data, both returns and volatilities, short and long maturities, and stock market volatility. 1 I n t r o d u c t i o n I n order t o make informed decisions, individuals attempt at anticipating future movements of economic variables and business cycle dynamics. This is particularly important f o r investment decisions i n financial markets. A n agent can gain f r o m predicting market future needs and investing early, satisfying market demands when they a p p e a r . There is a large literature on predicting financial markets and many indicators have been used as predictors. However, t h e predictability power of such indicators is often questionable and evidence of systematic predictability is weak, see Welch and Goyal (2008) . Indeed, as early as March 2020, markets collapsed, with weekly losses above 30%, well before macro economic indicators were impacted, as well as accurate information about t h e severity of t h e Spreading of t h e corona Virus i n Europe and i n t h e US was available. I n this context, Italy represents a peculiar case. Indeed, it has been t h e first country i n E u r o p e t o a m a j o r break out of COVID-19, w i t h a much higher mortality rate than What observed i n other countries. This crisis hit Italy i n a moment when public finances were already under stress. I n Italy, t h e country w i t h t h e third largest public debt i n t h e Word after U S a n d J apan, t h e COVID-19 crisis has sharpened a n already difficult economic s i t u a t i o n , posing serious doubts about short-term sustainability of t h e economy as well as t h e long-term outlook. From t h e early summer 2020, signals and actions f r o m t h e European Central Bank a n d t h e European Union i n support of t h e members' economies and health systems have somewhat alleviated t h e burden of a n otherwise extensively paralyzed economic environment. Such period of economic and social t u r m o i l , Where news media have not only merely covered t h e role of broadcasting information b u t also that of conveying perceptions and expectations about future states of t h e economy, represents a n unprecedented test ground t o evaluate t h e link between news media information and macro-finance variables. This paper introduces a new index of semantic importance. The index is based o n a novel methodology that evaluates t h e relative importance of one or more general economic related keywords (ERKS) that appear i n t h e news. The index, Whose construction combines methods drawn from both network analysis and text mining, evaluates semantic importance along t h e three dimensions of prevalence, i . e . frequency of word occurrences, connectivity, 1.6. degree of centrality of a word i n t h e discourse, and diversity, i.e. richness and distinctiveness of textual associations. We identify 38 relevant E R K S suited t o o u r scope. Using a large database of articles published by a pool of Italian newspapers, over t h e period spanning between J a n u a r y 2017 and A p r i l 2020, we assign a score t o each E R K compounding t h e three dimensions mentioned above. We t h e n aggregate t h e information f r o m t h e 38 ERKS i n a single composite news index. For t h e aggregation, we apply Partial Least Squares ( P L S ) between t h e target variable and t h e (38 ERKS) predictors incorporating While existing empirical literature i n this context has focused o n stock market return predictability, see e.g. Baker and Wurgler (2006) , Baker and Wurgler (2007) , Mettere (2012) and Limongi Concetto and Ravazzolo (2019) among others, we evaluate t h e power of our media news index i n predicting not only t h e Italian stock market aggregate return b u t also various short and long m a t u r i t y government bonds index returns as well as their volatility. Periods of large movements i n t h e stock and bond markets a r e associated t o political instability and economic uncertainty. O u r findings Show t h a t t h e index is able t o anticipate t h e different phases of t h e market and capture idiosyncratic features of each series. We find evidence of economically meaningful and statistically significant predictability f o r government bond returns and volatilities at all maturities. For stock market d a t a , we find evidence of predictability of t h e market portfolio returns only t o a limited extent. However, when predicting stock market volatility adding information contained i n news media improves t h e prediction accuracy u p t o 8 % , compared t o s t a n d a r d benchmark forecast models. The remainder of t h e paper is organized as follows. Section 2 introduces o u r new textual d a t a index. Section 3 provides a detailed description of t h e d a t a employed, t h e methodological strategy used t o predict financial time series w i t h textual d a t a , and t h e results of o u r analysis. Finally Section 4 concludes. A new i n d e x for t e x t u a l d a t a I n this paper we propose a novel measure of text importance that combines methods drawn f r o m both social network analysis and text mining. The index aims at measuring t h e relative importance of a predefined set of words mentioned i n a large set of textual documents. The methodology labelled Semantic B r a n d Score, has been introduced by Fronzetti Colladon (2018) f o r application t o commercial brands' reputation a n d awareness, b u t has never been applied i n t h e economic and financial environment. S t a r t i n g f r o m t h e word frequency as a natural measure of importance W i t h i n a text (Piantadosi, 2014) , t h e association that a word has i n t h e t e x t , as well as t h e heterogeneity of i t s context, are used as pivotal additional variables f o r a comprehensive assessment. O u r index explicitly exploits t h e relationships among words i n a text. To this e n d , texts are transformed into networks of CO-occurring words and relationships are studied through social network analysis, see Wasserman and Faust (1994) . As a n example consider t h e following sentence (and a word The index measures words semantic importance, 1.6. i n o u r context of t h e selected ERKS, along t h e three dimensions of prevalence, diversity a n d connectivity. Prevalence, which relates t o t h e notion of awareness, see Keller (1993) , measures how frequently a n E R K is mentioned, t h e rationality being that a n E R K used frequently is easier t o remember and more recognizable. Prevalence is calculated as t h e frequency of a word i n a given set of documents and time-frame. Prevalence of a particular set of words could ultimately influence t h e opinions and behaviors of t h e readers. For instance, i n o u r context, t h e recurrent use of specific combinations of words may trigger fear o r a n optimistic View about t h e c u r r e n t , as well as f u t u r e , states of t h e economy. Diversity measures t h e degree of heterogeneity of t h e semantic context i n which a word is used, w i t h emphasis o n t h e richness and distinctiveness of its textual associations. Diversity is defined by t h e number a n d uniqueness of connections a word has i n t h e co-occurrence network and it is measured by t h e distinctiveness centrality metric Fronzetti Colladon and Naldi (2020). More precisely, i n a graph of n nodes (words) and E edges (e.g. 1 ) , distinctiveness of node i is given by: Where W is t h e set of weights associated t o each edge; 93-is t h e degree of node j , which is a neighbor of node 2'; I ( ) is an indicator function that equals 1 if there is a n edge that connects nodes 2' and j w i t h positive weight, w i j . We postulate that t h e degree of diversity, i.e. Whether a set of words is of interest t o a specific category rather than a variety of economic actors, e.g. institutional investors, policy makers, private investors, e t c , provides relevant information about how pervasive a topic is i n t h e weave of t h e economy. The third dimension of t h e index, 1.6. connectivity, assesses t h e weighted betweenness centrality of t h e ERKS7 see Brandes (2001) and Freeman (1978) . Connectivity measures how much a word is embedded i n a discourse acting as a bridge between i t s parts, or more specifically, how often a word appears in-between t h e network paths which interconnect t h e other words i n t h e text. Following Wasserman and Faust (1994) , for node 2' we have: Where d j k ( i ) / d j k is t h e proportion of shortest network paths connecting nodes j and k (measured by edge weights) that include t h e node 2'. Finally, a n index is constructed as a composite score obtained by summing t h e standardized measures of prevalence, diversity a n d connectivity discussed large stock m a r k e t , among t h e most liquid i n Europe a n d , with t h e third largest sovereign debt i n t h e world after J a p a n a n d US. This makes t h e Italian government debt market very attractive, and thus liquid, at all maturities. We assess t h e predictive power of textual news information as a Table 1 : D e s c r i p t i v e s t a t i s t i c s (mean, standard d e v i a t i o n , mawimum a n d minimum) of t h e w e e k l y F TSE MIB a n d t h e 2, 5, 1 0 a n d 30 years m a t u r i t y B TP i n d i c e s return a n d realized variance. European Central B a n k , and exhibit a n upward sloped volatility term structure. Choosing pertinent keywords t o search i n a database of newspapers articles is crucial for t h e construction of a n informative textual index. As documented i n l i t e r a t u r e , such Choice is non trivial because word meaning can vary across fields a n d users. Prior t o t h e computation of t h e semantic network index, common text pre-processing routines The database of Italian news, published between J a n u a r y 2 , 2017 and April 24, 2020, contains more than 579,000 news articles. After pre-processing and removal of stop-words, t h e database comprises about 361,614,000 words. For each news we only consider t h e t i t l e a n d t h e lead, 1.6. t h e initial 30% of t e x t , ignoring t h e remaining part.1 This is consistent with previous work, which suggested that semantic important indices are more informative when calculated o n t h e news parts that better capture t h e readers' a t t e n t i o n , 1.6. t h e title and t h e lead Fronzetti Colladon (2020). This is also aligned w i t h past research, which already proved that a large part of internet users only read t h e beginning of online articles see Nielsen a n d Loranger (2006) among others. We calculate a n index, as detailed i n section 2 , for each E R K S listed i n predictors incorporating information from both t h e definition of scores a n d loadings. Therefore, o u r measure is a series specific index. We opt for a simple forecasting model, i.e. t h e ARX(1): Where yt+1 is t h e target variable we aim at predicting, mt is a set of news information predictors and 5t+1 N W N ( 0 , 0 2 ) . A n obvious Choice for 113,; is t h e pool (or a subset) of t h e 38 index assigned t o t h e ERKS listed i n Table 2 . I f , o n t h e one hand, this approach enables identifying and isolating ERKS w i t h predictive power f r o m t h e r e s t , on t h e other h a n d , t h e large set of regressors, relative t o t h e limited amount of observations i n o u r sample, may generate a n undesirable level of uncertainty. To circumvent this problem we convey t h e information contained i n t h e index associated w i t h t h e individual (sets o f ) keywords using two alternative aggregation approaches. The first approach extracts, f r o m t h e pool of 38 ERKS variables, one o r more common factors by means of partial least square ( P L S ) . 2 The PLS is computed individually f o r each series a n d repeated for each Vintage that forecasts a r e produced, therefore using real-time information. The competing aggregation m e t h o d , following Stock and Watson (1999) and Timmermann (2006) , consists i n computing multiple forecasts f o r each target variable using 3 and one predictor at t h e t i m e and then i n aggregating those forecasts. We use t h e equal weight combination: Where m m is t h e forecast of yt+1 generated by t h e linear ARX(1) using :13" and w; = 1/38, 75 = 1, . . . , 38. We contrast t h e predictive performance of t h e model exploiting textual news information against two s t a n d a r d benchmarks: t h e White noise model when t h e target is t h e return of either stocks and bonds3 and t h e AR(1) model when we aim at forecasting volatility. Welch and Goyal (2008) I n -s a m p l e evidence Results f r o m Inoue and Kilian (2004) imply that in-sample predictability is a necessary condition f o r out-of-sample predictability. To assess t h e degree of in-sample fi t , i n Figure 3 we compare t h e A R X model including t h e semantic importance index t o t h e nested benchmark 2We have also t r i e d principal component analysis ( P C A ) , b u t results were inferior a n d not reported. We believe t h a t t h e total n u m b e r of ERKS is limited a n d P C A is less adequate i n t h i s case. 3 T h e W h i t e noise model f o r returns corresponds t o t h e driftless r a n d o m walk model f o r price. We refer t o i t , labelling RW, i n t h e r e m a i n i n g part of t h e paper. 4W6 have also applied t h e RW no-Change b e n c h m a r k t o volatility predictions, results a r e substantially inferior t h a n t h e A R models a n d not reported i n t h e text. BTP m a t u r i t y it achieves lower MSPE t h a n the RW. The forecast combination of ARX ( E W ) is statistically superior to the benchmark only for the 2-year bond r e t u r n , yet falls behind the ARX model using the aggregated index. Volatility out-of-sample results Panel B of Table 3 provides results on volatility predictions. In this case, we observe the largest gains, in terms of forecast accuracy, obtained by using the semantic index. This result suggests that textual news data, convey relevant information about stock market uncertainty. The reduction in M S P E , when forecasting stock market volatility using the ARX compared to the benchmark, is up to 8% and it is statistically significant at 1% confidence level. When predicting bond volatility, the ARX statistically outperforms the benchmark for all maturities, with the largest economic gains observed for the 10-year m a t u r i t y (6% improvement). The forecast combination does not measure up, often resulting in statistically inferior to the benchm a r k . This result stresses the importance of the index construction to avoid an unfavourable signal to noise balance. Indeed, textual d a t a helps to improve forecast accuracy provided specific key- This paper introduces a new textual d a t a index for predicting stock market d a t a . The index is based on a novel methodology applied to a large set of newspapers articles to evaluate the importance of one or more general economic related keywords that appear in a text. The index considers three dimensions: prevalence, connectivity and diversity. The methodology is applied to the Italian press and 38 economic related keywords are selected. The resulting index is used to predict the Italian stock market and government bond returns and volatilities in the 2017-2020 p e r i o d , including the inception of the COVID-19 crisis. Our findings ShOW that the semantic network index based on media news text d a t a is able to capture the different phases and individual features of return and volatility dynamics of financial variables. Periods of large movements in the index are associated w i t h political and economic instability. When used to predict weekly market and bond returns and volatilities, we find strong evidence of predictability of bond returns and volatility at all maturities, as well as of stock market volatility. Investor Sentiment and the Cross-Section of Stock Returns Investor Sentiment in the Stock Market Stock Prices, News, and Economic Fluctuations News-Driven Business Cycles: Insights and Challenges A Faster Algorithm for Betweenness Centrality Forecasting Cryptocurrencies Under Model and Parameter Instability SIMPLS: An Alternative Approach to Partial Least Squares Regression Term Structure Forecasting using Macro Factors and Forecast Combination Comparing Predictive Accuracy Economic S t a t i s t i c s Centrality in Social Networks Conceptual Clarification The Semantic Brand Score B r a n d Intelligence Analytics Using Social Network and Semantic Analysis t o Analyze Online Travel Forums and Forecast Tourism Demand Distinctiveness Centrality i n Social Networks In-Sample or Out-of-Sample Tests of Predictability: Which One Should We Use? Conceptualizing, Measuring, and Managing Customer-Based Brand Equity The Value of News for Economic Developments Optimism in Financial Markets: Stock Market Returns and Investor Sentiments When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks Empirical Exchange Rate Models of t h e Seventies: Do They Fit O u t of Sample When Does Investor Sentiment Predict Stock Returns? Prioritizing Web U s a b i l i t y . Pearson Education The Extreme Value Method for Estimating t h e Variance of t h e Rate of Return P y t h o n 3 temt processing with N L T K 3 cookbook Zipf's Word Frequency Law i n N a t u r a l Language: A Critical Review and Future Directions The role of medium size Forecasting inflation Forecast Combinations Social network a n a l y s i s A Comprehensive Look at The Empirical Performance of Equity Premium Prediction The Porter Stemming Algorithm: Then and Now. Program Electronic Library s