key: cord-0182055-1yj5nbx7 authors: Nyman, Rickard; Ormerod, Paul title: Text as data: a machine learning-based approach to measuring uncertainty date: 2020-06-11 journal: nan DOI: nan sha: 04f90c6b1a9a0c30fe1661312e1f43ea2de8252f doc_id: 182055 cord_uid: 1yj5nbx7 The Economic Policy Uncertainty index had gained considerable traction with both academics and policy practitioners. Here, we analyse news feed data to construct a simple, general measure of uncertainty in the United States using a highly cited machine learning methodology. Over the period January 1996 through May 2020, we show that the series unequivocally Granger-causes the EPU and there is no Granger-causality in the reverse direction Following the seminal work of Baker et al. (2016) the Economic Policy Uncertainty index (EPU) has gained considerable influence. The original paper now has almost 4,000 citations according to Google Scholar. The index itself appears in speeches and papers by central bankers. The website dedicated to the EPU (https://www.policyuncertainty.com/index.html) offers two different versions of the data. The first is based purely on a quantification of newspaper coverage of policy-related economic uncertainty. For the United States, this is based on a monthly search of 10 large newspapers. We refer to this below as EPUNEWS. The second includes EPUNEWS as a component, but also incorporates information on tax code information and on disagreement amongst economic forecasters in the database published by the Federal Reserve Bank of Philadelphia's Survey of Professional Forecasters. We refer to this as EPUGEN (for "general"). Full details of the construction of both indices are provided on the website cited above. The indices do not simply provide information about the current level of economic policy uncertainty. The website states that: "A significant dynamic relationship exists between our economic policy uncertainty index and real macroeconomic variables. We find that an increase in economic policy uncertainty as measured by our index foreshadows a decline in economic growth and employment in the following months". Here, we report a measure of uncertainty (UNCERT) which Granger-causes both EPUNEWS and EPUGEN in the United States over the period January 1996 through May 2020. In other words, changes in this measure systematically lead changes in both EPUNEWS and EPUGEN. There is no evidence of causation from the EPU indices to UNCERT. UNCERT is constructed from the Reuters newsfeed for the United States using the unsupervised learning algorithm for obtaining vector representations for words developed by Pennington et al. (2014) 3 . Section 2 describes the construction of UNCERT and compares the three variables discussed above. Section 3 provides a summary of the results. An extended version of the results is set out in the Appendix. Over the course of the most recent decade, important advances have been made in machine learning in the conversion of text into some form of quantitative representation. A recent paper in the Journal of Economic Literature by Gentzkow et al. (2019) draws attention to the potential which these algorithms create for economists and other quantitative social scientists: "New technologies have made available vast quantities of digital text, recording an ever-increasing share of human interaction, communication, and culture. For social scientists, the information encoded in text is a rich complement to the more structured kinds of data traditionally used in research" (p.535). We use an approach which has become standard in machine learning known as GloVe (Pennington et al. op.cit.) . A clear overview, with a description of how to download and use the method, is given at https://nlp.stanford.edu/projects/glove/. The authors assemble a very large corpus of words from various sources. We use the one described on the GloVe website as Common Crawl (glove.42B.300d.zip). A co-occurrence matrix is constructed, which describes how frequently pairs of words co-occur with each other in any given corpus. The referenced webpage above states: "The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the word's probability of cooccurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space". The eventual output of the process is that every word in the corpus has a unique ndimensional vector associated with it. The elements of each vector are real valued numbers which essentially describe the closeness of the word to all other words in the corpus. This description is perforce rather imprecise. It is only intended to give a broad non-technical indication of what is going on. Full technical details are in Pennington et al. (op.cit.) . We apply the algorithm to the Reuters newsfeed over the period 1 January 1996 through 31 May 2020. To restrict the analysis to the US, we analyse all stories published by the New York and Washington offices, amounting to a total of 2,540,233 articles. The basic analysis is carried out on a daily basis, and the results aggregated onto a monthly frequency. Our basic approach is simply to count the number of times the word "uncertainty" appears in the Reuters newsfeed each day. But we add to this word the four closest words to it, identified by the GloVe methodology. These are: "uncertainties", "uncertain", "unpredictability" and "ambiguity". We then scale the raw data by counting the number of articles that mention at least one of the words, divided by the total number of articles. The scale is therefore the proportion of articles that matches the keyword search. The closest word to "uncertainty" (except of course for the word itself) is "uncertainties". The Euclidean distance between the vector associated with "uncertainty" and the vector associated with "uncertainties" is 5.40. Of the 1.9 million words in the corpus, the Euclidean distance to the one furthest away is 17.70. The median is 8.64 and the standard deviation 0.68. Figure 1 plots the Euclidean distance of the 200 words closest to "uncertainty". Euclidean distance from the GloVe vector of "uncertainty" of the 200 words nearest to it As can be seen, the distance rises quite rapidly, and we select, as mentioned, the four nearest. The monthly count of the five words is plotted with EUNEWS in Figure 2 for comparison, with both series put on the same units of measurement to enable this to be illustrated easily. The Uncertainty Index derived from the Reuters newsfeed and the Economic Policy Uncertainty Index based solely on news, January 1996 -May 2020 In general, the two series move closely together. Uncertainty was high in the aftermath of the dot com boom in the early 2000s, and again during the financial crisis. After a temporary fall, it rose again in the early 2010s. The economic recovery in America, whilst stronger than it was in Europe, was weak by historical standards. Uncertainty appears to have been a factor in this. There is a marked divergence between the two series in April and May 2020, the period of the Covid crisis. One possible explanation is that although output dropped dramatically, the balance sheets of major companies were in general sufficiently strong to dampen doubts about any substantial defaults. We carry out Granger causality tests between UNCERT and, separately, both EUNEWS and EUGEN. We do this both over the full sample period January 1996 through May 2020, and over each of two sub-samples, January 1996 through December 2007 and January 2008 through May 2020. The main reason is to check is there is any difference in the results in the pre-and postfinancial crisis periods, though by coincidence this split gives sub-samples of very similar length. We use the methodology described in Toda and Yamamoto (1995) . In outline, in investigating Granger causality between any two series, this is as follows: 1. Check the order of integration of the two series using Augmented Dickey-Fuller and the Kwiatowski-Phillips-Schmidt-Shin tests. Let m be the maximum order of integration found. 2. Specify the VAR model using the data in levelled form whatever was found in step 1 determine the number of lags to use with standard method. We use the Akaike Information Criteria 3. Check the stability of the VAR using OLS-CUSUM plots 4. Test for autocorrelation of residuals. If autocorrelation is found, increase the number of lags until it goes away. We use the multivariate Portmanteau-and Breusch-Godfrey tests for serially correlated errors. Let p be the number of lags then used 5. Add m extra lags of each variable to the VAR 6. Perform Wald tests with null being that the first p lags of the independent variable have coefficients equal to 0. If this is rejected, we have evidence of Granger-causality from the independent to dependent variable. In the Appendix we describe the specific functions used in R and some further details of the results. A complete set of results, including the R commands used at each stage of the process is available from Nyman (rickard.nyman.11@ucl.ac.uk) The results are unequivocal. The UNCERT series Granger causes both EPU series, and the EPU series do not Granger cause UNCERT. The series UNCERT is based on the word "uncertainty", with a small number of words added which are closest to it, as defined by results obtained using a machine learning-based methodology. In terms of the EPUNEWS variable, the website states that "we search for articles containing the term 'uncertainty' or 'uncertain', the terms 'economic' or 'economy' and one or more of the following terms: 'congress', 'legislation', 'white house', 'regulation', 'federal reserve', or 'deficit'". In other words, the core word of each is the same, but the additional information used is different. The results suggest that policy makers react to changes in uncertainty rather than anticipating them. Wald tests: UNCERT and EPUNEWS Measuring economic policy uncertainty. The quarterly journal of economics Glove: Global vectors for word representation Statistical inference in vector autoregressions with possibly integrated processes Appendix 10 most similar words to uncertainty: uncertainty, uncertainties, uncertain, unpredictability, ambiguity, certainty, confusion, turmoil, expectation, instability R packages used for the Granger causality tests:• tseries -we use the two functions adf.test and kpss.test (the Augmented Dickey-Fuller test and Kwiatkowski-Phillips-Schmidt-Shin test respectively) to check if series are stationary or contain unit roots• vars -we use the function VARselect to compute the Akaike Information Criteria for VAR(p) processes with p from 1 through 20. We use the VAR function for estimating a VAR(p) process. We use the function serial.test to compute the multivariate Portmanteau-and Breusch-Godfrey tests for serially correlated errors in a VAR(p) process. We use the function stability to compute empirical fluctuation processes according to the OLS-CUSUM method• aod -we use the function wald.test to perform the Wald tests for Granger causalityThe order of integration of each series is one, both over the full sample and over each subperiod.Number of lags selected using Akaike Information Criteria varying the number of lags from 1 through 10:UNCERT and EPUNEWS, January 1996 -May 2020UNCERT and EPUGEN, January 1996 -May 2020UNCERT and EPUNEWS, January 1996 -Dec 2007