key: cord-0434165-5n1bcclm authors: Davis, E.; Danforth, C. M.; Mieder, W.; Dodds, P. S. title: Computational Paremiology: Charting the temporal, ecological dynamics of proverb use in books, news articles, and tweets date: 2021-07-10 journal: nan DOI: nan sha: f4fdb00591bd83667905fdcb7d47617836c0291e doc_id: 434165 cord_uid: 5n1bcclm Proverbs are an essential component of language and culture, and though much attention has been paid to their history and currency, there has been comparatively little quantitative work on changes in the frequency with which they are used over time. With wider availability of large corpora reflecting many diverse genres of documents, it is now possible to take a broad and dynamic view of the importance of the proverb. Here, we measure temporal changes in the relevance of proverbs within three corpora, differing in kind, scale, and time frame: Millions of books over centuries; hundreds of millions of news articles over twenty years; and billions of tweets over a decade. We find that proverbs present heavy-tailed frequency-of-usage rank distributions in each venue; exhibit trends reflecting the cultural dynamics of the eras covered; and have evolved into contemporary forms on social media. Our goal here is to advance 'computational paremiology': The data-driven study of proverbs. We first build a quantitative foundation by estimating the frequency of use over time for an ecology of proverbs in several large corpora from different domains. We then characterize basic temporal dynamics allowing us to address fundamental questions such as whether or not proverbs appear in texts according to a similar distribution to words [1] [2] [3] [4] . In studies of phraseology, data on frequency of use is often conspicuously absent [5] . The recent proliferation of large machine-readable corpora has enabled new frequency-informed studies of words and n-grams that have expanded our knowledge of language use in a variety of settings, from the Google Books n-gram Corpus and the introduction of "culturomics" [6, 7] , to availability and analysis of Twitter data [8] . However, routine formulae, or multi-word expressions that cannot be reduced to a literal reading of their semantic components, remain notoriously averse to reliable identification despite carrying high degrees of symbolic and indexical meaning [9] . It is, for instance, much easier to chart a probability distribution of single words or n-grams than complex lexicon-dependent utterances such as proverbs, conventional metaphors, or idioms. Perhaps the most recognizable routine formulae are proverbs and their close cousin, idioms. Centuries of the study of proverbs-paremiology-have shown their importance in language and culture, and that they are immensely popular among the folk [10] . Proverbs are generally metaphorical in their use, and map a generic situation described by the proverb to an immediate context. In light of challenges in developing reliable instruments for measurement and quantification of figurative language, research would greatly benefit, as it has with words, from a better understanding of the frequency and dynamics of proverb use in texts. By applying new methodologies in measuring frequency and probability distributions, this study seeks to contribute to this endeavor. Before going any further, we must detail a more precise definition of the proverb. Though there is still some debate, it is widely agreed that proverbs are popular sayings that offer general advice or wisdom. However, naturally not all such sayings are proverbs. Many attempts at more precise definitions have been made, perhaps simplest being that of Gallacher: "A proverb is a concise statement of an apparent truth which has [had, or will have] currency among the people." This definition, while convenient, leaves out some important features, like their metaphoricity, and their dependence on context [11] . Mieder's definition is perhaps the most useful for our present purposes: "Proverbs [are] concise traditional statements of apparent truths with currency among the folk. More elaborately stated, proverbs are short, generally known sentences of the folk that contain wisdom, truths, morals, and traditional views in a metaphorical, fixed, and memorizable form and that are handed down from generation to generation" [11] . Proverbs maintain a particular relationship with their context of use that provides a fruitful domain for frequency and probability analysis. An important part of the proverb is the context in which it is used. The metaphorical property of a proverb need not only have to do with the proverb itself (as in the proverb/metaphor Typeset by REVT E X arXiv:2107.04929v1 [cs.CL] 10 Jul 2021 "war is hell", in which war is compared to hell within the proverb). In general, the use of a proverb is metaphorical in context, meaning that the proverb offers wisdom about a current situation via a metaphoric comparison to a proverbial one [11] . For instance, while the proverb "still waters run deep" might be used to caution someone against taking a seeming calm for granted, as it may belie unseen dangers. As with many other proverbs, it is hard to imagine anyone using the proverb "you can't put lipstick on a pig" in any literal or pragmatic context. Rather, these phrases offer wisdom embodied in the culture as opposed to that of the speaker. In this way proverbs may be used generically without proffering personal expertise. Indeed, proverbs are necessarily ambiguous enough to offer wisdom in any number of situations. Michael Lieber argued that this ambiguity paradoxically gives proverbs the function of disambiguating situations in which they are used. In part due to their role as cultural rather than individual wisdom, they can be invoked impersonally as a way of clarifying a complex reality [12] . As such, part of Winick's definition of the proverb is that they "address recurrent social situations in a strategic way" [11] . It is important to note the distinction between proverbs and idioms. An example of an idiom would be the phrase "red herring" denoting a mislead. The meanings of idioms, like proverbs, often cannot be ascertained from the meanings of their component words. But unlike proverbs, idioms are often not complete sentences, require context, and need not reference a paradigmatic situation. Proverbs on the other hand represent a complete situation and offer some sort of general wisdom. The boundary between the two however is rather fuzzy and contains many idioms and proverbial expressions. For instance the proverb "every cloud has its silver lining" is perhaps more well known by its idiomatic reduction "silver lining". In fact, people may use an idiom without any knowledge of the its proverbial context. Our intent here is to focus on expressions of full proverbs, and not their idiomatic uses. As previous work has shown, it is possible to investigate the manipulations and idiomizations of individual proverbs [5, 13] , and part of our study is devoted to continuing that work. However, our approach certainly has limitations, and further research into flexible searches or other identification methods will be essential in future work. Metaphor and idiom identification and comprehension are an open area of research in machine learning and NLP (Natural Language Processing) [14, 15] . In general, metaphors and metaphorical speech are difficult to identify, and do not occur in consistent repeated phrasings. Whereas in "bag-of-words" methods, one is allowed the tacit assumption that most of these words are represented in the lexicon of the language in the search for routine formulae, one must access the lexicon as an essential step in verifying a phrase's meaningfulness. Furthermore, the source and target domains of their mapping are seldom explicit, as laid out by Lakoff and Johnson in their Con-ceptual Metaphor Theory [16, 17] . However, proverbs generally appear in the same recognizable format, and in the form of a full, self-contained sentence. Prospectively, understanding of the conceptual mapping involved in proverb use may provide a useful step towards general understanding of metaphors in the above fields [18, 19] . Arguably, the proverb's flexibility of use has helped make them an essential part of language and communication, literature, discourse, and media [10] . Interest in the collection and study of proverbs dates back to at least the ancient Greeks and Sumerians. Erasmus famously collected proverbs. In English literature, the proverb has been an important device for many famous authors, among them Geoffrey Chaucer, William Shakespeare, Oscar Wilde, and Agatha Christie [20, 21] . Modern politics attests to the continued relevance of the proverb. In politics, proverbs have been employed as a way to communicate succinctly and persuasively with the populace. Early American politicians like Benjamin Franklin used proverbs to help shape a national identity and character, as with his still widely read/cited Poor Richard's Almanac. Abraham Lincoln employed proverbs in his famous speeches surrounding the American Civil War and Emancipation. During the Second World War, Churchill, Truman, and Hitler all famously used proverbs in their speeches and slogans [22] . During Emancipation and the American civil rights movement respectively, proverbs were used by Frederick Douglass, and Martin Luther King Jr., to motivate the people and communicate moral values [22] . More recently, dominant political figures in the US like Barack Obama and Bernie Sanders have used proverbs to great effect [23] , and political and religious interests try to shape which proverbs children are taught in school [24] . This is by no means the first quantitative study of proverb use. Permiakov called for demographic studies of proverb knowledge to gather an impression of which proverbs were being used by the folk, in the interest of establishing a paremiological minimum: A minimum lexicon of proverbs for a language [25] . Subsequent interest in proverb knowledge in psychology and folklore resulted in several studies conducted in the United States. Early studies by Albig and Bain in the 1930s found that American college students could recall on average between 25 and 27 distinct proverbs, many of which were common among participants [26, 27] . A more recent study by Haas observed proverb familiarity among college students in several regions of the US. They performed experiments in both proverb generation and proverb recognition. Notably, students could recognize more proverbs than they could recall on their own [28] . Apart from the lexicographic collection of proverbs from texts, several attempts have been made to quantify and characterize their use. Whiting, in his assiduous collection of proverbs from texts in "Modern Proverbs and Proverbial Sayings" [29] , kept track of the frequency with which they were encountered. Norrick attempted a manual search for proverb frequency, though he was constrained to only using proverbs starting with the letter f, and used a relatively small text sample [30] . In the first serious computational analysis of proverb frequency, Lau searched for and counted instances of proverbs in newspapers in the Lexis/Nexis ALLNWS database [31] . David Cram theorized that proverbs, acting as selfcontained lexical units, were employed much in the same way that words are, and that their use involved a "lexical loop" where the speaker accesses the lexicon in addition to the syntax when forming a text. As such, in the case of proverbs (and phrasal idioms), one ought to "analyze a syntactic string as a single lexical item" [32] . Moon's exhaustive early study of fixed expressions and idioms (denoted FEIs) in the Oxford Hector Pilot Corpus (OHPC) did just that [13] . His study represents the first serious attempt to apply the new tools of computational linguistics to routine formulae. He searched the OHPC (a precursor to the British National Corpus or BNC) for instances of 6776 FEIs from the Collins Cobuild English Language Dictionary. It is worth noting that at the time, there were few machine-readable English phraseological lexica. Though proverbs consisted of only 3.5% of the searched phrases (240), 19% of the expressions found in the corpus were proverbial expressions, the second most common subtype behind "simple expressions" (70%). Of the proverbs found, 59% were deemed metaphorical. Moon notes that exploitation of FEIs are easy to miss, and uses the proverb "a bird in the hand is worth two in the bush" as an example. Significantly, Moon noted that journalism was overrepresented in the corpus, and that the results did not represent the distributions of these FEIs in English as a whole. This and other similar caveats inspired the present study to observe genre-specific corpora separately, and compare after analysis. Cermák's essay collection "Proverbs: Their Lexical and Semantic Features" contains several essays that deal with the distribution of proverbs in the British National Corpus [5] . InČermák's pioneering essays, he searches for occurrences of English proverbs in the BNC corpus (100 million words) [34] . In this study, even the most common proverbs seem to occur relatively infrequently. For example, "easier said than done" is the most common, appearing 62 times in the entire corpus. His study discusses the relevance of corpus occurrence to a paremiological minimum (He uses a limited proverb list from Wiktionary). Another study focuses on text introducers to various proverbs using collocation analysis. (Čermák notably created/spearheaded one of the first machinereadable phrasaeological lexica in the "Czech Idiom Dictionary" (1994).) Cermák relates frequency dictionaries to discussions of a paremeological minimum. Should proverb frequency in large corpora be taken into account when judging that minimum? Of course, there are problems with this approach as well: proverbs rely heavily on oral tradition, and are prone to frequent corruptions and purposeful exploitations. As such there is no guarantee that a search of a given phrasing of a proverb will capture all, if any, of its occurrences in a text. There are ways around this on an individual basis, but it depends on the proverb: some employ parallel structures (like "good X make good Y"), or have popular idiomizations (like "silver lining"). Most recently, in an introductory paremiology textbook [35] , Steyer outlined a process general corpus linguistic method for studying proverbs, similar to Moon andČermák. Here, we expand on the above literature, including much larger corpora and proverb data sets. Should the ambition be to find these distributions in English as a whole? We contend that there is no such universal corpus for any language. Clearly use of these phrases is context-dependent, it seems unlikely intercontextual searches will yield greater insight than singlegenre searches. Instead, frequency dynamics and distributions in separate corpora from differing contexts may be more informative. For our present study of proverbs from a corpus linguistic point of view, we focus on two problems, which we may frame, however artificially, as questions: 1) How does frequency of proverb use compare across proverbs, and does that distribution echo previous findings in linguistics? and 2) What stories emerge once the dimension of time is added to our observations of the frequency of proverb use in these corpora? Can shifts in popularity be related to known events, and can our knowledge of the history of proverb use be advanced through these methods? One of the foundational achievements in the study of complex systems was Zipf's identification of scaling laws in language and other social phenomena [36] . Indeed as early as 1996, natural language (in the context of computational linguistics) was cited explicitly as an example of the recently coined "complex adaptive systems" [2] . It was first observed by Zipf that the rank distribution of words in a text follows a power law F (r) = cr −α , where r is a word's rank, F (r) is its frequency, with α 1. While primary interest here is paid to its appearance and seeming ubiquity in language, the same class of distributions have been observed in phenomena across a wide range of fields including physics, biology, psychology, sociology, urban studies, and engineering [37, 38] . Several studies have addressed possible mechanisms for the emergence of these distributions from empirical data. Notably, work by Dodds et al. showed that the distribution results from a Simon competition model, in which the first mover has an advantage [39] . In this case the older proverbs may have a competitive edge in their proliferation and popularity. Cancho et al. showed in a lan- Daily relative frequency of the 3-gram "enough is enough" on Twitter. The popularity of "enough is enough" on Twitter grew steadily over the last decade, and it has been the most popular proverb on Twitter since 2016, perhaps originating from its consistent use by Senator Bernie Sanders [23] . It has since become associated with growing protests against police brutality and gun violence. Annotations reflect widely reported violent events and protests (with the exception of the 2018 US midterm elections). The stark simplicity of this sixteenth century proverb evokes a narrative of repetition past the point of tolerance [33] . In this instance, beginning as a condemnation of the continued reaffirmation of the status quo in US politics by Senator Sanders, it is now popular as collective outcry against political inaction in the wake of regular mass shootings in the US, and a lack of accountability in the killing of black Americans by police. The changing significance and popularity of the proverb in the past decade displays the aptitude of proverbial speech to be successfully employed in varying contexts, and its potential to illustrate narrative commonalities between phenomena. guage generating genetic algorithm that optimal results for both low speaker and receiver effort followed a Zipf distribution [40] . While Zipf observed this phenomenon for words in a text, it has since been observed that individual words in a large corpus follow a broken power law distribution and do not strictly adhere to Zipf's law [3] . Several attempts have been made to generalize the original Zipf distribution. Benoit Mandelbrot derived an analogous distribution using information theory, dubbed the Zipf-Mandelbrot distribution [41] . More recently, Cancho and Sole formalized a broken power law distribution with two distinct scaling regimes [4] . One shortcoming noted in many evaluations of Zipf's law in text is that power law scaling breaks down toward the tail of these empirical distributions. Recent work by Williams et al. [3] however, showed that power law scaling holds over more orders of magnitude when randomly partitioned phrases are used rather than individual words. That study also suggested a refocusing of corpus linguistic attention from words to phrases as essential elements of language. Further work by Williams et al. [42] suggested that changes in scaling in Zipf distrubutions of large corpora can be attributed to text mining. Few, if any, attempts have been made to apply Zipf's law to phraseological lexica. With large amounts of newly digitized text, corpus linguistics and lexicology/lexicography have seen renewed wider interest, and new results. Can these methods be used to tell new stories that are of interest to those working in the humanities? And in particular, how can that work embed itself into the existing wealth of knowledge accrued by those disciplines. In this case, how can computational work on proverbs situate itself in the existing knowledge-base of paremiology? In their seminal 2011 paper, Michel et al. discussed the newly created Google Books corpus, and coined the term "culturomics" to describe the nascent discipline concerned with observable trends in the use of n-grams over time [6] . They present several convincing case studies, among them trends in the use of "influenza" with historical outbreaks, and the use of geographical and antagonistic terms alongside the history of the American Civil War. These case studies make use of time series data and relative frequency to tell complex stories of interest from simple queries. However, Pechenick et al. note that there are issues with Google Books' representation of culture. For one, books are not indexed by popularity, and each book appears only once. As a result, the linguistic contribu-A. Gutenberg Project: B. New York Times: Frequency to delay may mean to forget enough is enough time will tell pay as you go take it or leave it C. Google Books 3-grams: D. Twitter 3-grams: Frequency never say never enough is enough nothing is impossible time will tell the truth hurts Zipf distributions for entries from Mieder's Dictionary of American Proverbs [33] . For each corpus, proverbs are enumerated and shown on logarithmic axes as a function of rank, with 'hold your tongue', 'to delay may mean to forget', 'time will tell', and 'never say never' topping the charts in Gutenberg, NYT, Google, and Twitter respectively. Each distribution exhibits heavy-tailed behavior, more prominently for Gutenberg and NYT . tions of the most popular books are weighted equally with the least popular [7] . Secondly, the increase in volume of scientific publications in the last century causes the last century of English as a whole to be relatively skewed towards that genre. For instance, enormously influential books like To Kill a Mockingbird, I Know Why the Caged Bird Sings, Mockingjay, or Harry Potter and the Order of the Phoenix are only represented once, and share the same weight as any other book (new editions notwithstanding). In the last century, the rise in volume of scientific and academic publication drastically increased the relative influence of this type of writing. here, we examine only the English Fiction subset of the corpus, which can be partially defended [43] . Other work by Reagan et al. utilized the timelines within texts to evaluate the emotional arc of a text, given word valence (sentiment) data. Inspired by Kurt Vonnegut's rejected Master's thesis (in Anthropology) on the shapes of stories, they found that indeed the emotional arcs of most stories in the Gutenberg corpus could be reduced to a handful of paradigmatic shapes [44] . Work by Underwood et al. used historical use of gendered names and words to reveal trends in gender representation in literature using data from the HathiTrust digital library [45] . et al. allows users to explore the temporal dynamics of n-grams found on Twitter [8] . Using a data set reflecting a random 10% of Twitter since 2008 (presently over 150 billion tweets), Storywrangler tracks the prevalence of ngrams on a daily scale. n-grams are portrayed via rank by popularity, and convey the rise/dynamics of President Trump (further depicted in the PoTUSometer) [46] , or the meteoric rise, and continued influence of Justin Bieber (of surprising relevance to this work). Unlike the Google Books n-gram Corpus, StoryWrangler is notable in its ability to track phrases in both original tweets and retweets, conveying aspects of popularity through amplification. Beyond simple words and phrases, data have been used to track the progression of ideas. For instance, Leskovec et al.'s paper on "meme-tracking" tracked the progression and mutation of popular sayings as they proliferated through news reporting and blogging [47] . Recently, "Computational Folkloristics" has gained recognition as an area of study, with a 2016 issue of the Journal of American Folklore being devoted to the subject [48] . Using classification, networks, geographical data, temporal data, and digitized text, folklorists and other interested academics have explored new possibilities in understanding texts and cultural history. The Danish Folklore Nexus developed by Abello et al. provides tools for large-scale analysis of Danish folk tales and stories, aiding in classification of stories, or mapping their similarity to others through networks. Tools like this can augment traditional methods of studying folklore, using data-driven methodology to guide future avenues of folklore research [49] . This represents a paradigmatic example of a computational tool participating in the continued discourse around folklore, without being an end in and of itself. In an effort to quantify the ecology of proverbial language, a list of over 14,000 proverbs was obtained from Mieder's Dictionary of American Proverbs [33] . Proverbs were stored in an SQL database for ease of access, and matched for frequency with four distinct corpora: • Twitter (2008-2020) Individual corpora were collected as follows. The Gutenberg corpus comprises over 60,000 collected published documents spanning several centuries. The The gray represent the data binned by month, and the orange represent the data binned by year. The proverb "to delay may mean to forget" owes its yearly rhythm to its role as the NYT's charity tagline. The frequencies are normalized by article count (obits, and non-body included). Plots are ordered in the grid by rank first left to right, then top to bottom. present study restricts its use to the subset of documents in English. As the metadata for the Gutenberg corpus does not consistently encode the date of original publication, temporal data was collected using author birth dates (gathered from the Gutenberg library for R) [50] . These were used in place of publication dates, as the publication dates in the corpus seldom represent the original publication, instead they represent the digitized edition. For temporal analysis, documents without authors and their birth dates were omitted. The Gutenberg corpus comes with several caveats. Firstly, works were curated by perceived importance. Works also disproportionately represent the 18th and 19th centuries, and for this reason much of our work with Gutenberg focuses on this period. Several authors have much of their extensive oeuvre represented in the corpus (e.g., Anthony Trollope, Mark Twain), which could compromise a more objective view of English writing tendencies of the period. Data from the New York Times were gathered from the New York Times Annotated Corpus of 1.8 million articles from 1987-2007 [51] . The data are organized in NTIF (News Industry Text Format) formatted XML-readable documents. The corpus includes obituaries and other short pieces in addition to more traditional news articles. The 2020 English Fiction Google n-grams corpus consists of every n-gram that appears at least 40 times in its set of millions of digitized books. For each n-gram the corpus provides on each year it appears in the data set, the frequency with which it appeared that year, and the number of documents it appeared in that year [6] . Data from Twitter was accessed through the Vermont Complex Systems Center's StoryWrangler API [8] . Sto-ryWrangler receives a randomly selected 1/10th of each day's tweets from Twitter's Decahose API (including retweets), and organizes n-grams by rank and frequency. Data for 2-gram and 3-gram proverbs were obtained though the tool, and were aggregated so the collection was case insensitive. The data from all four corpora were processed using Python, and the libraries pandas and matplotlib were used for organization and visualization respectively [52, 53] . In our processing of Gutenberg and the New York Times, punctuation in both proverbs and texts was removed. Twitter data were punctuation insensitive. Regular expressions were used to capture variations in punctuation when processing the Google Books n-gram Corpus. Where relative frequency is used, it is calculated as: f rel = f t /n t which is the frequency f for time period t divided by the number of documents n found during time period t. Zipf distributions were plotted using ranks of proverbs in a corpus, with rank 1 being the most frequent, as well as their frequency. Zipf distribution plots are shown on a log-log scale as is standard. Networks of books and proverbs, as well as authors and proverbs, were made using books/authors as nodes, connected by proverbs they have in common. The networks are unweighted, and do not reflect instances where books/authors share multiple proverbs. Betweenness centrality in these networks is calculated as or the proportion of shortest paths between any two other nodes in the network that pass through a given node. Most processing was performed using the Vermont Advanced Computing Core (VACC) located at the University of Vermont. Fig. 2 shows Zipf distributions for entries from Mieder's Dictionary of American Proverbs using 3-gram anchors for each of the four corpora studied While the distributions exhibit heavy tails, we do not observe robust power-law scaling over many orders of magnitude. We find the largest number of distinct proverbs appearing in Gutenberg and the New York Times, on the order of thousands, with the Google Books and Twitter examples showing roughly an order of magnitude fewer. We note that Zipf's law for words does not itself extend over many orders of magnitude [54] , typically only 2 or 3, and that it is meaningful, mixed length phrases that present many orders of magnitude of scaling [3] . The Zipf distributions for proverbs are thus comparable to what we see for single words. With a more sophisticated proverb detection, one that captures minor variations in phrase structure, we would expect to see some adjustments to the Zipf distributions we have observed, though a priori it is not clear how. Short, robust proverbs ("time flies") will be well counted, while longer ones for which, say, constituent function words might be changed based on context or era ("he/she/they who hesitates . . . ") would only see their apparent observed frequency of usage grow. While the most popular entry in the Gutenberg corpus and the Google Books n-gram corpus was the phrase "hold your tongue", this phrase is classified as a proverbial expression rather than a proverb (its use requires outside context). For clarity of focus the phrase has been excluded from figures in this section. "Sink or swim", another proverbial expression, has been left in. In light of the limitations of the Gutenberg corpus detailed in Methods, it is difficult to make claims about the trends of proverb use over time ( Figure 3 ). However, it is clear from the data shown in Figure 3 that proverbs appear in a remarkable portion of the documents in the corpus. "The sooner the better" for example, appears in nearly one in every ten documents in the early 1800s. The data for proverbs in the Gutenberg corpus were used to construct a network with documents as nodes, connected if a given proverb appears in both documents. When betweenness centrality was calculated for nodes in the network, surprisingly James Joyce's Ulysses had the 14th highest centrality, close to several dictionaries of proverbs and quotations, and the collected works of Mark Twain (Table 1) . Creasy [55] documented Joyce's use of proverbs in Ulysses from a critical perspective, noting that they are often altered, and blend high and low culture in the work. As Joyce uses many fewer proverbs than a comprehensive proverbial dictionary, the book's centrality in this network implies that Joyce's use of proverbs is far from arbitrary, and that his choice of proverbs is purposefully situated in the broader context of English proverbial knowledge. Figure 4 shows time series plots for the 16 most common proverbs in New York Times Annotated Corpus. Shown are frequency binned by month and year, and normalized by article count. All articles are included in the count including smaller articles like obituaries (the average article count is 248 per issue). It is by no means a surprise that proverbs appear frequently in journalism; in fact Lau's study found as much [31] . Not present in that work, however is a temporal dimension (not to mention a different time period). It is clear in Figure 4 that the proverbs represented are used on a monthly or semi-monthly basis, and are rarely if ever absent in a year's publications. In these representations of proverb use, it is easier to identify use patterns and perhaps to extract narratives from their dynamics. The easiest, if somewhat trivial case is "to delay may mean to forget" owes its yearly rhythm to its role as the NYT's charity tagline. Its frequency of use increased markedly over the period studied, though stayed confined to the winter holiday months. With the exception of "to delay may mean to forget", and consistent with accepted definitions of the proverb, the consistency with which proverbs are used in the New York Times suggests they are employed widely for their Year Frequency among all 2-grams on Twitter FIG. 6. Time series plots for the nine most popular 2-gram proverbs on Twitter (ranked by overall count). The gray represents the daily frequency, while the orange represents the 30 day rolling average. The proverbs "be yourself" and "time flies" maintain popularity over the period studied. Notably, the "safety first" shows an increase in popularity in early 2020, possibly relating to the coronavirus pandemic. Plots are ordered in the grid by rank first left to right, then top to bottom. utility in mapping general wisdom to a specific context. Nonetheless, prominent spikes in frequency can be associated with historical events. For instance the brief severalfold increase in the use of "boys will be boys" around November of 1992 is likely attributed to a contentious and widely publicized sexual assault case at the time, which prompted additional discussion of rape culture [56, 57] . The maximum in use of "pay as you go" seems to correspond with concurrent discussion of a local gas tax levy in New Jersey, and national discussion of President Bush's second term proposed tax cuts. Its increase in use in 1996 seems to owe to discussion of the Environmental Bond Act being proposed in New York at the time [58, 59] . In Figure 5 are time series plots for the 12 most common 2-gram proverbs in the Google n-grams corpus. Here the gray represents yearly frequency (counted once per volume), and the orange represents the five-year rolling average, normalized by the number of volumes in a given year. One can see clearly from the figure the emergence of several more recent proverbs: "safety first", "money talks", and "shit happens". "Safety first" exhibits a precipitous rise in usage in the early 20th century. Specifically, in 1912, the National Safety Council (NSC) in the US adopted the phrase as its slogan to promote standards of worker safety, though the Safety First Movement was initiated by US Steel in 1906. Its origin has been traced back to at least 1818 [23] . The data shown in Figure 5 support the history of its popularization [33, 60] . Previous scholarship on the proverb "shit happens" traced its origin to the year 1944, and its rise in popularity corresponds to its humorous use as a bumper sticker, and cultural controversy (and legal battles) associated with it [61, 62] . It also famously appeared in the movie Forrest Gump [63] . Figure 6 shows time series plots for the 16 most popular 3-gram proverbs in the Google Books n-gram corpus. Though the proverb "never say never" originated in 1887 [33] , it is evident that it gained far wider popularity in the late 1900s. Though the proverb "enough is enough" dates at least to 1546 [33] , its popularity seems to vastly increase throughout the 20th century. The proverb "divide and conquer" seems to have briefly gained popularity around the World War II era. On Twitter, the four most common 2-gram proverbs, on average, don't seem to exhibit much variability in their usage (Figure 7) . The proverbs "be yourself" and "time flies" seem to remain above 10 −6 , or 1 in every million 2grams on Twitter during the period studied. An increase in usage of "safety first" in early 2020 may be related to the onset of the coronavirus pandemic during the same period. Exhibited on Twitter (Figure 8) , the convenience of proverbs as succinct narratives has made them useful in several titular media events in the past decade. Of note, Figure 8 shows marked shifts in frequency of "never say never", and "love is blind". "Never say never" owes its initial attention in 2010 to Justin Bieber's single of the same title (Justin Bieber: Never Say Never ), repeated as his slogan and title of a biographical documentary. This was not the first film to utilize the proverb in its title; Sean Connery's final performance as James Bond was titled Never Say Never Again (1983). Figure 9 shows the dynamics of "never say never" on Twitter in more detail. We observe first its meteoric rise in popularity at the time of Never Say Never 's (song) release as the lead single off the soundtrack for a modern remake of the Karate Kid movie (roughly two magnitudes in a single day). At the time of the single's official release on June 8 th , 2010, "never say never" was the 63 rd most used 3-gram on Twitter. When Justin Bieber: Never Say Never was released on January 31, 2011, "never say never" was the 34 th most common 3-gram on Twitter; for comparison, "I love you" was 22 nd at the time. Remarkably, the popularity of "never say never" on Twitter decayed so slowly that it did not reach its pre-Bieber frequency until 2016. The continued presence of the proverb in Twitter discourse suggests that in the wake of its initial rise, it was more frequently adopted to general non-Bieber usage. (A similarly popular 3-gram, nonproverbial song of that year "rock that body" appeared and disappeared from the Twitter discourse in the span of a few months). While the enormity and fervor of Bieber's fanbase at the time (a period called "Bieber fever" [64] ) certainly contributed to its popularity, its continued use over a five-year period is compelling evidence that the proverb became a more integral part of the Twitter lexicon for a time. In 2020, "Love is Blind" became the title of a literally minded reality dating show in which participants were quarantined in private rooms, only communicating via audio interfaces [65] . In this instance, the proverb was not only an apt description of the show's narrative, but a template for its formation. Additionally, it came to represent a narrative solution to the isolation imposed by the concurrent pandemic. However, the increase in the phrase's popularity seems only to have lasted for the month of the show's release, after which it seems to settle at its former rate of use. The proverb itself is ancient, and translations exist nearly every European language. While with "never say never" (the most popular proverb on Twitter), we see a sudden rise and slow decay, we see a different pattern in the second most popular proverb, "enough is enough". From 2016 to the present, we see a steady increase in the frequency of "enough is enough" on Twitter. Recent work by Mieder attributes its renewed popularity in part to its constant use by Bernie Sanders [23] . Unlike "never say never" there does not seem to be a single event that precipitates this trend. However, an investigation into the several local maxima suggest a possible narrative correspondence. Many of these local maxima correspond to events related to either The gray represents the daily frequency, while the orange represents the 30 day rolling average. The proverb "never say never" owes its meteoric rise in popularity in 2010 to popular musician Justin Bieber's single and biographical documentary of the same name. "never say never" remains the most popular proverb on Twitter until 2016, when it is supplanted by "enough is enough" which has steadily gained popularity in the last decade, owed in part to its constant use by Senator Bernie Sanders, and punctuated by reactions to tragedies related to gun and police violence. Plots are ordered in the grid by rank first left to right, then top to bottom. police violence or mass shootings. Famously, survivors of the Parkland shooting in 2018 appeared on the cover of Time magazine with a simple title: "Enough." [66] . Coverage of the March for Our Lives against gun violence in the New York Times included the title: March for Our Lives Highlights: Students Protesting Guns Say "Enough Is Enough" [67] . When protesters marched in DC in the wake of the murder of George Floyd, Politico's coverage was titled: 'Enough is enough': Thousands descend on D.C. for largest George Floyd protest yet [68] . Inasmuch as proverbs can create metaphorical mappings from a paradigmatic situation (or narrative) onto a present one, "enough is enough" represents a compelling narrative of continued injustice, and a critical point of retaliation. However, the data from Twitter display a narrative of repeated tragedy in spite of public outcry. The proverb was most popular during the 2018 US midterm elections. This study is by no means the first exploring the potential of new and growing digital databases for the future of phraseology. In fact, in one of the most recent textbooks on parameiology, there is a section on proverbs and corpus linguistics. However, as yet, we believe there has been no large-scale effort to examine the dynamics of proverb use across corpora of several domains. Pioneering work byČermák and Moon validated the usage computational resources to augment quantitative efforts. That work was limited by the available computational and digital resources. Much attention has been paid to the use of words and n-grams in general in large corpora, but it is difficult to extract from them instances of individual narrative or metaphorical language use. Proverbs, in their tendency to act as both narrative and metaphor, and in their often relatively fixed structure, are perhaps an ideal test case for our ability to observe broader cultural narratives through the piecemeal, routine stories employed by the folk. Through novel or context-specific words and phrases, we are able to observe discourse around specific phenomena ("pizzagate'', "pandemic'', or "Make America Great Again"). In contrast, through proverbs we may be able to observe how we organize specific phenomena into the paradigmatic narratives represented by proverbs. Much of proverb scholarship has been concerned with the idea of a "paremiological minimum": A minimum proverbial lexicon for a language and culture. Certainly, as shown by Lau [24] , and again in the present study, computational studies of the frequency of proverb use can contribute to the understanding of these minima, as those proverbs which seem ubiquitous in large corpora ought to be understood by speakers of a language. Further- Frequency among all 3-grams on Twitter "never say never" FIG. 9. Daily relative frequency of the 3-gram "never say never" on Twitter. While "never say never" was already popular on Twitter as of 2008, its popularity was amplified in 2010 by the release of Justin Bieber's single entitled "Never say never", and his subsequent biographical documentary of the same name. Remarkably, it remained the most popular proverb on Twitter for almost six years, punctuated by anniversaries and reruns of the movie, until it was surpassed by "enough is enough" in 2016. more, temporal analysis of their frequency may further validate that their frequency is related to enduring currency among the folk, rather than correspondence with a specific occurrence. Another concern in paremiology and phraeseology is the origins of sayings. Work like the present study can serve to both validate and expand on previous scholarship on the history of phrases. In the study of the statistical distribution of natural language, there exists the idea of a kernel lexicon, a subset of words that are essential to communication using a given language. Much literature on the study of culture and education has focused on what one might consider a "minimum of cultural literacy". Special attention has been paid to which proverbs constitute part of that minimum. It is clear from this study that the most common proverbs vary considerably between corpora. However, given the prevalence of these popular proverbs in their respective contexts, we can posit that English learners would benefit in their comprehension of the language if they were familiar with these proverbs. A natural limitation of this study, and indeed any study that uses extant data to study language, is the issue of representativeness. In this study that limitation is twofold: Both the lexicon for directing the search, and the data being searched are inherently limited. While The Dictionary of American Proverbs is extensive, and represents much that is known of proverbs in America, it naturally excludes new proverbs and does not account for many ways in which the structure of the proverbs it contains may be manipulated in their practical use. There are however lexical resources that address recent proverbs, for example The Dictionary of Modern Proverbs, and the methodology of this study may be readily applied to such lexica [69] . Previous studies on proverb frequency have relied on composite corpora, namely variations of the BNC (British National Corpus), which contains manually curated selections from several domains of text. The present approach of studying data from distinct domains allows for both a more limited and more useful interpretation of the results: We can only claim that results are representative of proverb use on Twitter for instance, rather than proverb use in English as a whole-an impossible achievement. Certainly, fieldwork (digital and otherwise) continues to be important in identifying new proverbs and changing structures of existing proverbs. This task may be aided in the future by tools like StoryWrangler, that track n-gram rank, likely capturing new proverbs in the process. The task then would be extracting likely proverbs from these data, which would require both linguistic, cultural, and computational expertise. Analyses of the frequency and rank of proverbs in this study verify that with ever increasing amounts of machine-readable textual data, we may produce longitudinal phraseological studies. Furthermore, as machine/robot comprehension of natural language becomes increasingly important, this area too, would benefit from an expanded lexicon that includes proverbs and routine formulae, and understanding of metaphor may be assisted by a more basic understanding of the mapping from general to specific situations that exists in the use of proverbs. Human Behaviour and the Principle of Least-Effort Quantitative linguistics and complex system studies Zipf's law holds for phrases, not words Two regimes in the frequency of words and the origins of complex lexicons: Zipf's law revisited Proverbs: their lexical and semantic features. Number volume 36 in Supplement series of Proverbium Yearbook of International Proverb Scholarship. The University of Vermont Quantitative analysis of culture using millions of digitized books Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter Multiword expressions: A pain in the neck for NLP Series Title: Lecture Notes in Computer Science Proverbs are never out of season: Popular wisdom in the modern age Proverbs speak louder than words": Folk wisdom in art, culture, folklore, history, literature and mass media Analogic ambiguity: A paradox of proverb usage Fixed expressions and idioms in English: A corpus-based approach Unsupervised Type and Token Identification of Idiomatic Expressions Models of metaphor in NLP Metaphors We Live By Understanding figurative proverbs: A model based on conceptual blending Learning to identify metaphors from a corpus of proverbs PROMETHEUS: A corpus of proverbs annotated with metaphors The literary use of proverbs Proverbs and social history Proverbs are the best policy: Folk wisdom and American politics Right makes Might": Proverbs and the American worldview Winick, editors, What goes around comes around: the circulation of proverbs in contemporary life: Essays in Honor of Wolfgang Mieder On the question of a Russian paremiological minimum Proverbs and social control Verbal stereotypes and social control Proverb familiarity in the United States: Cross-regional comparisons of the paremiological minimum Modern Proverbs and Proverbial Sayings How Proverbs Mean: Semantic Studies in English Proverbs It's About Time": The ten proverbs most frequently used newspapers and their relation to American values The linguistic status of the proverb A Dictionary of American Proverbs Proverbs from a corpus linguistic point of view Human behavior and the principle of least effort: An introduction to human ecology. Martino Publishing Powerlaw distributions in empirical data Universality of rank-ordering distributions in the arts and sciences Simon's fundamental rich-get-richer model entails a dominant first-mover advantage Least effort and the origins of scaling in human language An informational theory of the statistical structure of languages Text mixing shapes the anatomy of rankfrequency distributions Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not The emotional arcs of stories are dominated by six basic shapes The transformation of gender in English-language fiction Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy Memetracking and the dynamics of the news cycle Big folklore: A special issue on computational folkloristics Computational folkloristics The New York Times Annotated Corpus. type: dataset T. pandas development team. pandas-dev/pandas: Pandas Matplotlib: A 2D graphics environment Text mixing shapes the anatomy of rankfrequency distributions To vary the timehonoured adage": Ulysses and the Proverb Assault case renews debate on rape shield law. The New York Times Jury chosen in Glen Ridge assault trial. The New York Times Vote Yes on the Bond Act. The New York Times How the money was spent in previous environmental Bond Acts. The New York Times Safety metaphors and theories, a review of the occupational safety literature of the US, UK and The Netherlands, till the first part of the 20th century Proverbs: a handbook. Greenwood folklore handbooks Cunningham v. state 1991 A mathematical model of Bieber Fever: The most infectious disease of our time? Love Is Blind The Young and the Relentless March for Our Lives Highlights: Students Protesting Guns Say 'Enough Is Enough'. The New York Times Enough is enough": Thousands descend on D.C. for largest George Floyd protest yet The Dictionary of Modern Proverbs Thank you to David Dewhurst, Josh Minot, Michael Arnold, Nicholas Allgaier, Thayer Alshaabi for provid-ing invaluable guidance in the writing of this paper. The authors are grateful for the computing resources provided by the Vermont Advanced Computing Core which was supported in part by NSF award No. OAC-1827314, and financial support from the Massachusetts Mutual Life Insurance Company to CMD and PSD. Tables SI-SIV show the total count of the 50 most popular proverbs in their respective corpora.Supplementary Figures S1-S4 are in the same format as the time series plots in the body, but show data for proverbs ranked 17-32 in their respective corpora.