key: cord-0192538-zjmtlvjp authors: Hossu, Philip; Parde, Natalie title: Tracking Turbulence Through Financial News During COVID-19 date: 2021-09-09 journal: nan DOI: nan sha: a4f4ab9d58d18bf79f5e217c372e857a8cda0de8 doc_id: 192538 cord_uid: zjmtlvjp Grave human toll notwithstanding, the COVID-19 pandemic created uniquely unstable conditions in financial markets. In this work we uncover and discuss relationships involving sentiment in financial publications during the 2020 pandemic-motivated U.S. financial crash. First, we introduce a set of expert annotations of financial sentiment for articles from major American financial news publishers. After an exploratory data analysis, we then describe a CNN-based architecture to address the task of predicting financial sentiment in this anomalous, tumultuous setting. Our best performing model achieves a maximum weighted F1 score of 0.746, establishing a strong performance benchmark. Using predictions from our top performing model, we close by conducting a statistical correlation study with real stock market data, finding interesting and strong relationships between financial news and the S&P 500 index, trading volume, market volatility, and different single-factor ETFs. Recent advancements in fundamental natural language understanding problems (Zhang et al., 2019; Raffel et al., 2020; Clark et al., 2020) have generated renewed interest in their downstream deployment in many applications; one such example is extrapolating meaning from large quantities of financial publications. The year 2020 and its accompanying financial unpredictability presents an interesting case study and test bed for financial sentiment models that rely on complex inferences. In this work we tackle that problem, and uncover numerous interesting and important relationships involving sentiment in financial publications during the 2020 pandemic-related U.S. financial crash. This work is unique as we focus not on lexicon scoring, nor developing a trading strategy during normal market conditions, but instead on analyz-ing the relationship between sentiment and market movement during a financial crisis -a subject seldom studied due to the infrequency of financial crashes and pandemics. Our key contributions are as follows: • We collect a set of expert annotations for 800 financial article titles and descriptions with COVID-19-specific vocabulary, labeled for negative, neutral, or positive sentiment. • We introduce CNN-based models to distinguish between sentiment categories in this challenging dataset, achieving a maximum weighted F1 score of 0.746. • We perform correlation studies between our top model's predictions and popular market measures: S&P 500 Average, S&P 500 Trading Volume, Cboe Volatility Index, Single-Factor ETFs, and FAANG+ stocks. Our experiments yield intriguing findings, including a strong relationship between predicted market sentiment and the S&P 500 Average and a notably lower correlation with value stocks compared to other factors. Sentiment analysis is a popular NLP task (Ribeiro et al., 2016) that has been employed in a variety of practical fields, ranging from product review analyses for marketing purposes (Hu and Liu, 2004) , to tracking online voter sentiment (Tumasjan et al., 2011) , as well as to financial markets. Current methods for extracting sentiment from bodies of text fall primarily into two categories: lexiconbased and machine learning methods. Lexiconbased approaches ultimately produce dictionaries containing key/value pairs for scoring text. Although simplistic, they have the added benefit of being extremely efficient, requiring user programs to do little more than load a dictionary into memory and perform linear searches. However, lexicon-based methods tend to fall short compared to modern deep learning techniques, especially in domains with specialized vocabularies. Machine learning methods often outperform lexicons in financial text applications (Man et al., 2019) . Financial texts not only contain highly specialized language referring to specific actions and outcomes (eg., [Company] prices senior notes announces a type of bond offering), but also exhibit language that can have very different meaning compared to general settings (eg., "bull" may typically have a neutral or negative connotation, but is overwhelmingly positive in a financial context). A recent surge of work in financial sentiment analysis emerged during SEMEVAL 2017's Task 5 (Cortis et al., 2017) , in which participants were asked to create systems capable of predicting article sentiment on a continuous scale from -1 (bearish/negative/price-down) to 1 (bullish/positive/price-up), given a dataset of 1647 annotated financial statements and headlines. The top two performing systems (Mansar et al., 2017; Kar et al., 2017) performed within 0.001 points of each other, yet used vastly different techniques. Mansar et al. (2017) utilized three primary input features: pre-trained GloVe embeddings (Pennington et al., 2014) , DepecheMood scores (Staiano and Guerini, 2014) , and VADER scores (Hutto and Gilbert, 2014) . They paired these with a CNNbased architecture, leveraging the following layers in order: convolution, max pooling, dropout layer, fully connected layer, dropout layer, and fully connected layer. Conversely, Kar et al. (2017) used pre-trained Word2Vec embeddings (Mikolov et al., 2013) as input separately to a CNN and a bidirectional gated recurrent unit (BiGRU), and also generated a set of hand-crafted features which were fed to multiple fully-connected layers. All three of these model outputs were concatenated, before being input to another fully-connected layer which ultimately issued a prediction. We take inspiration from the former paper (Mansar et al., 2017) , and experiment with a similar CNN-based approach. We collected financial articles using the NewsFilter.IO API. 1 This service provides straightforward and high quality data access for online articles from most major American publishers of financial news. Articles are provided in a JSON dictionary format with fields corresponding to title, description, publisher name, published time, and stock tickers mentioned, among others. We refer to this dataset from here onward as FIN-COV-NEWS. To collect market and share prices, we downloaded publicly available data from Yahoo Finance. 2 Yahoo Finance allows users to download daily price data given inputs of a specific stock or instrument symbol and date range. We collected data from January 1 to December 31, 2020. This yielded 192,154 articles from the following sources: Wall Street Journal, Bloomberg, SeekingAlpha, PR Newswire, CNBC, and Reuters. Considering only article titles, the dataset contains 2,069,244 words, or 1,660,842 words without common stopwords (Bird et al., 2009) , with a vocabulary size of 118,719 unique words. The twenty most common words in titles (minus punctuation) are shown in Table 1 . The (mean, min, max) set for titles is (8.64, 1, 54), and for article descriptions is (27.35, 1, 1309). As expected, a significant number of COVID words appear in the dataset (e.g., "coronavirus," "covid-19," and "virus"), alongside traditional finance words like "dividend." We aggregated articles by daily frequency and computed the occurrences over time for different sets of query words, producing the proportion of all posts published in a given day which contain the query words in the post title or description. Our first plot, shown in Figure 1 , uses common "Earnings" query words: {revenue, earnings, dividend, miss, beat, estimates}. These words peak notably near the beginnings and ends of fiscal quarters. The blue line denotes the daily percent of posts mentioning one (or multiple) query terms, and the grey bars denote the daily post volume. We additionally consider a query of "COVID-19-related" words: {covid, covid19, coronavirus, virus, corona, covid-19, 2019-ncov, 2019ncov, sars-cov-2, sarscov2}. Interestingly, we note a spike of COVID-related keywords before the actual spike of cases in the U.S., and a steady decline between April and June despite the increasing ground truth number of cases and ongoing economic impact. One may also observe that the peak percent of posts mentioning the "COVID-19-related" keywords is around 60% in late-March of 2020, and the lowest percent is in September with around 10%. Figure 2 shows these percentages, again with the blue line denoting the daily percent of posts mentioning one (or multiple) query terms, and the grey bars denoting the daily post volume. When comparing to the actual COVID-19 statis-tics as aggregated by the U.S. Centers for Disease Control (CDC), made publicly available via their website, 3 we observe a strong initial financial press reaction to the virus, despite the low infection rate. The press decreased its focus on COVID-19 over time, despite its continuing impact on society and the economy. While labeling the FIN-COV-NEWS data was essential for learning COVID-19 related trends, we additionally utilized samples of two external datasets created for similar tasks. For the first (SEMEVAL), from SEMEVAL-2017 (Cortis et al., 2017) , we sample 750 random articles and use cutoff points at -0.33 and 0.33 to threshold the continuous sentiment labels to the categorical negative, neutral, and positive. This yielded 156 negative articles, 413 neutral, and 181 positive. We additionally sample 750 random phrases from v1.0 of the Financial Phrase Bank (FINPHRASEBANK, Malo et al., 2014) . This dataset similarly contains news phrases from financial data sources with categorical negative, neutral, and positive labels. We sampled exclusively from the "AllAgree" subset (containing only samples exhibiting a unanimous label decision from the annotators) to ensure the highest possible quality in our training data. To train our model to identify language patterns during the 2020 COVID-19 financial crisis, we first acquired labels for a sample from FIN-COV-NEWS. Annotations were solicited from four annotators (one of our authors and three external experts). Our external annotators (referred to as B, C, and D) were recruited from the quantitative strategy research group of a top Investment Management/Exchange Traded Fund provider, where annotator A (author) worked for a Summer. All three external annotators are full CFA charter 4 holders and have worked in the investment space for at least three years. The experts completed the labeling task on a volunteer basis, and were given approximately two weeks to complete their labels. Negative Article reports news expected to negatively impact the price of a specific company or industry, or market health as a whole, either now or in the future. COVID-19 cases spread in a country (and it is unclear that specific company or industry stands to gain), a company announces layoffs or bankruptcy, or a news blurb indicates that the price of a share is down. Article reports news expected to positively impact the price of a specific company or industry, or market health as a whole, either now or in the future. A company is in talks for a significant new business partnership, COVID-19 vaccine progress is positive, or a news blurb indicates that the price of a share is up. Article cannot be placed with any notable degree of certainty in positive or negative categories. Article mentions a company or event which has no influence on U.S. markets, article mentions multiple companies with multiple sentiments, or annotator is uncertain if the company will benefit from or be hurt by the news. We selected a basic categorical (negative/neutral/positive) labeling scheme, at the article level. Annotators were provided with XLSX files with each row containing an article title, a truncated article description (∼25 words), and a label column to be completed. Labeling guidelines are shown in Table 2 . We allowed annotators to volunteer for a round number of articles they felt comfortable labeling, resulting in the following amount of data: (A, 300), (B, 200) , (C, 300), (D, 300). Between each set of articles, 50 were shared among all annotators to compute inter-annotator agreement. The distributions of all annotations, separated by annotator, are shown in Figure 3 , with -1, 0, and 1 corresponding to negative, neutral, and positive labels. Annotators C and D both leaned heavily towards assigning neutral labels and being selective with good news or bad news, whereas Annotator B had a fairly uniform distribution of labels across their data subset. Pairwise Cohen's Kappa (Cohen, 1960) scores are shown in Table 3 . Annotator B was an outlier, exhibiting poor agreement with the other two experts, C and D, likely due to the annotator's increased willingness to assign positive/negative labels. Experts C and D agreed well with each other at κ = 0.685. Following convention (Landis and Koch, 1977) , annotator pairs (A,D) and (C,D) exhibit "substantial" agreement, and annotator pairs (A,B) and (A,C) exhibit high "moderate" agreement. However, Annotator B exhibits only "fair" agreement with Annotator C and barely reaches the bottom threshold of "moderate" with Annotator D. Based on this analysis, we removed Annotator B from the dataset, leaving 800 labeled articles in FIN-COV-NEWS. We highlight the inherent uncertainty during the labeling of these financial documents -upon manual inspection, it is not uncommon for "close-misses," where some annotators thought an article was positive and others thought it was neutral (or some thought it was negative, others neutral). Even when considering exclusively domain experts, it is difficult to obtain complete agreement. We experiment with various models to predict financial sentiment, varying input data and features as well as model structure. To evaluate our models, we report both macro and weighted F1 scores (harmonic mean of precision and recall) as computed by the Scikit-Learn Python library (Pedregosa et al., 2011) . We include both because although weighted F1 may more effectively reflect performance in real-world settings, it may also skew positive since we observe a slight class imbalance in our training data -roughly 60% of labels are neutral, with the remaining 40% split between positive and negative. We assessed performance relative to two baseline models, ALLNEUTRAL and VADERMAX, to establish dataset learnability above chance or naive heuristics. We report performance metrics for these baselines in Table 4 . ALLNEUTRAL is a majorityclass baseline that predicts "neutral" for each instance, and VADERMAX is a rule-based heuristic that predicts labels based on the maximum VADER score for any word in the instance. ALLNEUTRAL achieves a poor macro F1 score, but a weighted F1 score around 0.5 due to the label imbalance. VA-DERMAX performs comparably, giving evidence that lexicon-based heuristics are insufficient in a financial crisis/recovery context. To ensure competitive performance, we experiment with different word embeddings, text preprocessing steps, and model inputs and architectures. All experiments were performed in Python 3, using Keras (Chollet et al., 2015) . For the model architecture, we adapt a similar Convolutional Neural Network (CNN) design to that of Mansar et al. (2017) , shown in Figure 4 . We also experimented with a BiGRU-based model, as they have demonstrated competitive performance relative even to Bidirectional LSTMs in comparable tasks (Sinha and Khandait, 2020) . However, despite the ∼30x increase in training time, upon evaluation the BiGRU performed similarly to the CNN. For this reason, we selected to use the CNN for our model input experiments. 5 For all model results, maximum F1 scores are reported on the 5 Future work may be able to achieve higher performance using more complex Transformer-based models, such as those by Devlin et al. (2019) and Araci (2019) . Since our focus here was on establishing proof of concept in this challenging setting, substantial finetuning experiments remained out of scope but offer intriguing future possibilities. testing data over 100 epochs of training. We experimented with a variety of input features. All models are trained using a concatenation of article title text and a truncated version (∼25 words) of the article description. In this subset of experiments, we consider only the 800 labeled samples of FIN-COV-NEWS in a simple 85/15 train/test split. Our input features and preprocessing steps were as follows: • COVID-19 Embeddings: We created custom embeddings for the missing words {coronavirus, covid19, 2019ncov, covid, sarscov2}. These embeddings are arrays of length 300, consisting of floating point numbers which would not otherwise have words assigned to them (e.g., coronavirus = [-5.99, -5.99, ...] ). • Percent Replacements: We separated commonly seen percentages in the form of: +5% → {plus, 5, percent}. • Dash Removal: We removed hyphens or dash characters (e.g., -, -, or -) from the text. • Stopword Removal: We removed stopwords using the NLTK stopwords list (Bird et al., 2009 ). The pre-processing steps for each model in Ta Embeddings, Stopword Removal, Percent Replace-ments}, M5 -{Dash Removal, COVID-19 Embeddings, Stopword Removal, Percent Replacements, Punctuation Removal}. We find that the optimal level of preprocessing occurs with M4. Although the regular expressions incorporated into M5 reduce the number of missing words significantly, the model performance does not improve, and we omit the extra processing step in our final model. We compare conditions leveraging pretrained GloVe embeddings, pretrained Word2Vec embeddings, and the inclusion of VADER scores in Table 6 , using a 85/15 split of FIN-COV-NEWS. The model inputs are defined as follows: N1: {300length GloVe 6B, VADER Scores}, N2: {300length Google News Word2Vec, VADER Scores}, N3: {300-length Google News Word2Vec, No VADER Scores}. We find that Word2Vec embeddings consistently result in higher performance than GloVe embeddings. This is likely because the GloVe pretraining data is more general-purpose, whereas the Word2Vec model was pretrained on news articles, likely capturing more rich relationships for our use case. For the best performing model, we expanded training to a full SEMEVAL + FINPHRASEBANK + FIN-COV-NEWS dataset. The macro and weighted evaluation scores are shown in Table 7 predicted 256 articles correctly (∼74.2%), 77 offby-one (∼22.3%), and 12 clear mispredictions (∼3.5%). While still an error, off-by-one errors signify that the model predicted only one label away from the true value. We further decompose these incorrect predictions to gain a better understanding of what the model was missing. We deem errors when the model erred on the side of caution (e.g., predicting neutral when the label was positive or negative) to be acceptable errors, as these errors would likely cause minimal detrimental effects in downstream systems. We report our findings below: Shifting perspective to the twelve instances for which the model made clear mispredictions, there were ten instances for which the model predicted positive and the true label was negative, and two instances of the opposite. Upon manual inspection of these errors, we found that the articles in question were indeed fairly ambiguous. For example, when the model saw: one best years bull market erased s&p 500 rose almost 29 percent course 2019, good second best year bull market. took little four weeks fall apart., the algorithm thought this was good news -not able to catch on the subtlety of "little four weeks fall apart." We extend the predictions of our best performing model to the entire year (January 1 -December 31, 2020) of article titles + truncated short descriptions to uncover the relationships between financial articles and actual market movement. We remove weekends (when the market is closed) from our dataset to avoid times when share prices are relatively stable. We compute Pearson correlation coefficients in all cases using the Stats module of Scipy (Virtanen et al., 2020) . We first compare our model predictions and the S&P 500 average, 6 shown in the top two rows of Table 8 . Averaged Daily Model Sentiment refers to a simple daily aggregate of the article sentiment. Smoothed Daily Model Sentiment is a 10-day rolling average of the article sentiment. We find that the rolling average has a strong correlation with the S&P 500, at r = 0.886. This information is also presented visually in the first three plots of Figure 5 . We additionally hypothesized that trading volume and market volatility would correlate with predicted financial news sentiment. To assess this, we considered the simple trading volume related to the S&P500, as well as the Cboe Volatility Index (VIX) -a well-regarded way to measure market volatility based on options contracts. These correlations are shown in rows three through five of Table 8 . Interestingly, while our smoothed sentiment model outputs exhibit non-zero correlation with both the S&P500 trading volume and the VIX, the magnitude of this correlation is notably higher with the VIX. Next, we investigated which single factor-based strategies aligned most closely with predicted financial article sentiment. Factor investing is a classical investing technique that seeks to identify certain attributes of securities that are mathematically linked with higher returns. The body of academic work on this topic is extensive (we direct interested readers to French (1993, 2014) ). For strategy consistency, we selected four single factor exchange traded funds (ETFs) from iShares by BlackRock, 7 by far the largest ETF provider as of March 16, 2021. 8 iShares defines their four factor-based ETFs as follows: • Value ($VLUE): Stocks discounted relative to fundamentals. • Quality ($QUAL): Financially healthy companies. • Momentum ($MTUM): Stocks with an upward price trend. • Size ($SIZE): Smaller, more nimble companies. While these are over-simplified definitions of the factors, they provide a basic intuition regarding which companies were pooled into which of the single-factor ETFs. The statistical correlations between these tickers and our model results are shown on Table 9 . We note that Value is the clear outlier when compared to the other factors, with a significantly lower correlation (r = 0.625). We found a correlation of r = 0.441 (or r = 0.669 with smoothed model sentiment) between the news for these specific companies versus the $FNGS ETN; this was notably lower than the overall market correlations that we found earlier. Figure 6 shows that sentiment tracks market performance fairly well for the initial crash, but while the tech stocks recover quickly and add to their gains, sentiment continues to oscillate around a neutral point. This behavior highlights one of the potential shortcomings of a purely sentiment-driven trading strategy -when market behavior is fairly stable and trending upward, the news sentiment does not always progressively grow to match this trend. 9 https://www.microsectors.com/fang In this work, we comprehensively investigated the relationship between financial news sentiment and stock market performance during the 2020 pandemic-induced financial crisis. While comparable works on this subject (Zammarchi et al., 2021) focus primarily on the use of social media data in conjunction with lexicons or standard machine learning models, we leveraged a task-specific CNN model to predict financial article sentiment over time, achieving a maximum F1 of 0.746. This enabled us to subsequently perform a correlation study that considered not only the stock market as a whole, but also different factor-based strategies and industry subsets. As part of this work we collected expert annotations of financial sentiment for 800 news articles (a subset of FIN-COV-NEWS), with strong inter-annotator agreement (averaged pairwise κ = 0.656). We make these annotations available for other interested parties by request. With a solid deep learning-based sentiment model and an abundance of stock data, there is a high ceiling for where future work could lead. One obvious avenue would be continued improvement of the financial sentiment prediction model, although significant improvements may require more labeled data. Using the current model predictions, more fine-grained analysis could also be performed on other subsets of companies or industries of interest. While it is clear that news sentiment could not completely localize the market crash/bottom, an intriguing extension of our work would be to consider sentiment prediction on a more fine-grained time interval, and investigate whether it could be used to inform higher frequency trading strategies that could weather highly turbulent market conditions. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models Natural language processing with Python: analyzing text with the natural language toolkit Electra: Pretraining text encoders as discriminators rather than generators A Coefficient of Agreement for Nominal Scales SemEval-2017 task 5: Finegrained sentiment analysis on financial microblogs and news BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Common risk factors in the returns on stocks and bonds A Five-Factor Asset Pricing Model. SSRN Scholarly Paper ID 2287202 Mining and summarizing customer reviews Vader: A parsimonious rule-based model for sentiment analysis of social media text RiTUAL-UH at SemEval-2017 Task 5: Sentiment Analysis on Financial Data Using Neural Networks The Measurement of Observer Agreement for Categorical Data Good debt or bad debt: Detecting semantic orientations in economic texts Financial Sentiment Analysis(FSA): A Survey Fortia-FBK at SemEval-2017 Task 5: Bullish or Bearish? Inferring Sentiment towards Brands from Financial News Headlines Efficient Estimation of Word Representations in Vector Space Scikit-learn: Machine learning in Python GloVe: Global Vectors for Word Representation Exploring the limits of transfer learning with a unified text-totext transformer SentiBench -a benchmark comparison of state-of-the-practice sentiment analysis methods Impact of News on the Commodity Market: Dataset and Results De-pecheMood: a Lexicon for Emotion Analysis from Crowd-Annotated News Election Forecasts With Twitter: How 140 Characters Reflect the Political Landscape Impact of the COVID-19 outbreak on Italy's country reputation and stock market performance: a sentiment analysis approach ERNIE: Enhanced language representation with informative entities