key: cord-219817-dqmztvo4 authors: Oghaz, Toktam A.; Mutlu, Ece C.; Jasser, Jasser; Yousefi, Niloofar; Garibay, Ivan title: Probabilistic Model of Narratives Over Topical Trends in Social Media: A Discrete Time Model date: 2020-04-14 journal: nan DOI: nan sha: doc_id: 219817 cord_uid: dqmztvo4 Online social media platforms are turning into the prime source of news and narratives about worldwide events. However,a systematic summarization-based narrative extraction that can facilitate communicating the main underlying events is lacking. To address this issue, we propose a novel event-based narrative summary extraction framework. Our proposed framework is designed as a probabilistic topic model, with categorical time distribution, followed by extractive text summarization. Our topic model identifies topics' recurrence over time with a varying time resolution. This framework not only captures the topic distributions from the data, but also approximates the user activity fluctuations over time. Furthermore, we define significance-dispersity trade-off (SDT) as a comparison measure to identify the topic with the highest lifetime attractiveness in a timestamped corpus. We evaluate our model on a large corpus of Twitter data, including more than one million tweets in the domain of the disinformation campaigns conducted against the White Helmets of Syria. Our results indicate that the proposed framework is effective in identifying topical trends, as well as extracting narrative summaries from text corpus with timestamped data. Social media and microblogging platforms, such as Twitter and Facebook, are becoming the primary sources of real-time content regarding ongoing socio-political events, such as United States Presidential Election in 2016 [11] , and natural and man-made emergencies, such as COVID-19 pandemic in 2020 [9] . However, without the appropriate tools, the massive textual data from these platforms makes it extremely challenging to obtain relevant information on significant events, distinguish between high-quality and unreliable content [17] , or identify the opinions within a polarized domain [13] . The challenges mentioned above have been studied from different aspects related to topic detection and tracking within the field 1 In this study, we have used the terms narrative and story interchangeably. *Equal contribution. of Natural Language Processing (NLP). Researchers have developed automatic document summarization tools and techniques, which intend to provide concise and fluent summaries over a large corpus of textual data [20] . Preserving the key information in the summary and producing summaries that are comparable to human-created narratives are the primary goals of the extractive and abstractive approaches for automatic text summarization [2] . News websites are a prime example of such techniques, where automatic text summarization algorithms are applied to generate news headlines and titles from the news content [31] . The shortage of labeled data for text analysis has encouraged researchers to develop novel unsupervised algorithms that consider co-occurrence of words in documents as well as emerging new techniques such as exploiting an additional source of information similar to Wikipedia knowledge-based topic models [37, 38] . Additionally, unsupervised learning enables training general-purpose systems that can be used for a variety of tasks and applications as strong classifiers [7] . In this regard, statistical models of co-occurrence such as Latent Dirichlet Allocation (LDA) [6] , discover the relevant structure and co-occurrence dependencies of words within a collection of documents to capture the distribution of topic latent variable from the data. Although an abundant timestamped textual data, particularly from social media platforms and news reports are available for analysis, the changes in the distribution of data over time have been neglected in most of the topic mining algorithms proposed in the literature [35] . For instance, time-series analysis on datasets over the events relative to 2012 US presidential election suggests that modeling topics and extracting summaries without considering the text-time relationship lead to missing the rise and fall of topics over time, the changes in terms of correlations, and the emergence of new topics [12] . Although continuous-time topic models such as [35] have been proposed in the literature, topical models with continuous-time distribution cannot model many modes in time, which leads to deficiency in modeling the fluctuations. Additionally, continuoustime models suffer from instability problems in the case of analyzing a multimodal dataset that is sparse in time. In this paper, we propose a probabilistic model of topics over time with categorical time distribution to detect topical recurrence, designed as an LDA-based generative model. To achieve probabilistic modeling of narratives over topical trends, we incorporate the components of narratives including named-entities and temporal-causal coherence between events into our topical model. We believe that what differentiates a narrative model 2 from topic analysis and summarization approaches is the ability to extract relevant sequences of text relative to the corresponding series of events associated with the same topic over time. Accordingly, our proposed narrative framework integrates unsupervised topic mining with extractive text summarization for narrative identification and summary extraction. We compare the identified narratives by our model with the topics identified by Latent Dirichlet Allocation (LDA) [6] and Topics over Time (TOT) [35] . This comparison includes presenting numerical results and analysis for a large corpus of more than one million tweets in the domain of disinformation campaigns conducted against the White Helmets of Syria. The collected dataset contains tweets spanning 13 months within the years 2018 and 2019. Our results provide evidence that our proposed method is effective in identifying topical trends within a timestamped data. Furthermore, we define a novel metric called significance-dispersity trade-off (SDT) in order to compare and identify topics with higher lifetime attractiveness in timestamped data. Finally, we demonstrate that our proposed model discovers time localized topics over events that approximates the distribution of user activities on social media platforms. The remaining of this paper is organized as follows: First, an overview of the related works is provided in Section 2. In Section 3, we provide a detailed explanation of our proposed method followed by the experimental setup and results. Finally, in Section 5 we conclude the paper and discuss future directions. In this section, we first provide a background on narrative analysis and how literature has investigated stories in social media. Then, we present an overview of topic modeling and text summarization. Narratives can be found in all day-to-day activities. The fields of research on narrative analysis include narrative representation, coherence and structure of narratives, and the strategies, aim, and functionality of storytelling [22] . From a computational perspective, narratives may relate to topic mining, text summarization, machine translation [33] , and graph visualization. The later can be achieved via using directed acyclic graphs (DAGs) to demonstrate relationships over the network of entities [15] . Narrative summaries can be constructed from an ordered chain of individual events with causality relationships amongst events, appeared within a specific topic [18] . The narrative sequence may report fluctuations over time relative to the underlying events. Additionally, the story-like interpretation of the text is a must to imply a narrative [25] . Since social media have been admitted as a component of today's society, many studies have investigated narratives in social media content [14, 25, 34] . These Narratives contain small autobiographies that have been developed in personal profiles and cover trivial everyday life events. Other types of narratives appearing in social media platforms consist of breaking news and long stories of past events [25] . Some types of narratives, such as breaking news, result in the emergence of other narratives related to the predictions or projections of events in near future [14] . These literature view social media conversation cascades as stories that are co-constructed by the tellers and their audience, and are circulating amongst the public within and across social media platforms. Moreover, the events have been considered as the causes of online user activity that can be identified via activity fluctuations over time [3, 25] . Developing appropriate tools for social media narrative analysis can facilitate communicating the main ideas regarding the events in large data. As social media activities generate abundant timestamped multimodal data, many studies such as [8] have presented algorithms to discover the topics and develop descriptive summaries over social media events. probabilistic models to discover word patterns that reflect the underlying topics in a set of document collections [1] . The most commonly used approach to topic modeling is Latent Dirichlet Allocation (LDA) [19] . LDA is a generative probabilistic model with a hierarchical Bayesian network structure that can be used for a variety of applications with discrete data, including text corpora. Using LDA for topic mining, a document is a bagof-words that has a mixture of latent topics [6] . Many advanced topic modeling approaches have been derived from LDA, including Hierarchical Topic Models [15, 16] that learn and organize the topics into a hierarchy to address a super-sub topic relationship. This approach is well-suited for analyzing social media and news stories that contain rich data over a series of real-world events [30] . Topic models over time with continuous-time distribution [5] and dynamic topic models [35] intend to capture the rise and falls of topics within a time range. However, continuous-time topic models, such as beta or normal time distribution, cannot model many modes in time. Furthermore, the smooth time distribution over topics does not allow recognizing distinct topical events in the timestamped dataset, where topical events reflect the event-based topic activity fluctuations over time. Topic modeling and summarization of social media data is challenging as a result of certain restrictions, such as the maximum number of characters allowed on the Twitter platform. As shorttext or microblogs have low word co-occurrence and contextual information, models designed for short-text topic analysis and summarization may obtain context information with short-text aggregation to enrich the relevant context before further analysis [27] . Document summarization techniques are generally categorized into abstractive and generative text summarization models. Herein, we consider extractive text summarization methods. Several algorithms for extractive text summarization have been proposed in the literature that assign a salient score to sentences [10] . To summarize a text corpus with short text, [29] presents an automatic summarization algorithm with topic clustering, cluster ranking and assigning scores to the intermediate features, and sentence extraction. Some other approaches, particularly for the Twitter data include aggregating tweets by hashtags or conversation cascades [27, 32] , and obtaining summaries for a targeted event of interest as one or a set of tweets that are representative of the topics [8] . Additionally, neural network-based summarization models [23, 28] , commonly with an encoder-decoder architecture, leverage attention mechanism for contextual information among sentences or ROUGE evaluation metric to identify discriminative features for sentence ranking and summarization. However, these architectures require labeled datasets and might not apply to short-text. Text summarization with compression using neural networks is proposed by [36] which applies joint extraction and syntactic compression to rank compressed summaries with a neural network. Our focus in the present work is on probabilistic topic modeling and extractive text summarization to provide descriptive narratives for the underlying events that occurred over a period of time. In this section we explain our narrative framework. The framework comprises of 2 steps: I. Narrative modeling based on topic identification over time and II. extractive summarization from the identified narratives. To discover the narratives over topical events, first, we use our discrete-time generative narrative model as an unsupervised learning algorithm to learn distribution of textual contents from daily conversation cascades. Then, we extract narrative summaries over topical events from sentences in the time categories. This is achieved by sampling from the identified distribution of narratives and perform sentence ranking. Narrative modeling and summarization steps are explained below in separate subsections. To model narratives, we design our topic model such that the discovered topics present a series of timely ordered topical events. Accordingly, the topical events deliver a narrative covering distinct events over the same topic. In this regard, we present Narratives Over Categorical time (NOC), a novel probabilistic topic model that discovers topics based on both word co-occurrence and temporal information to present a narrative of events. According to the topic-time relationship explained above, we refer to the topics or narratives, topical events as events, and the extracted timely ordered sentences of documents with high probability of belonging to each event as the extracted narrative summary. To fully comply with the definition of narrative, we assume a causality relation between the conversation cascades in social media. However, we do not investigate the causality relation across the conversation cascades or named-entities. The differences between our Narrative model with dynamic topic models [5] , topic models with continuous time distribution [35] , and hierarchical topic models [16, 26] include: not filtering the data for an specific event, imposing sharp transition for topic-time changes with time slicing, discovering topical events without scalability and sparsity issues, allowing multimodal topic distribution in time as a result of categorical time distribution, and selecting an appropriate slicing size such that distinct topical events be recognizable. Additionally, categorical time distribution enables discovering topical events with varying time resolution, for instance, weekly, biweekly, and monthly. Time discretization brings the question of selecting the appropriate slicing size or the number of categories that depends on the characteristics of the dataset under study. On the contrary, topical models with continuous time distribution cannot model many modes in time. Additionally, continuous time models such as [35] suffer from instability problem if the dataset is multimodal and sparse in time. Furthermore, categorical time enables discovering topic recurrence which results in identifying topical events related to distinct narrative activities, which is of our interest in this paper. Narrative activities in social media refer to the amount of textual content that is circulating in online platforms over time, corresponding to a specific topic. The generative process in NOC, models timestamps and words per documents using Gibbs sampling which is a Markov Chain Monte Carlo (MCMC) algorithm. The graphical model of NOC is illustrated in Figure 1 . As can be seen from the graphical model, the posterior distribution of topics is dependent on both text and time modalities. This generative procedure can be described as follows: I. For each topic z, draw T multinomials ϕ z from a Dirichlet prior β; II. For each document d, draw a multinomial θ d from a Dirichlet prior α; III. For each word w di in d: (a) draw a topic z di from multinomial θ d ; where In this model, Gibbs sampling provides an approximate inference instead if exact inference. To calculate the probability of topic assignment to word w di , we first need to calculate the joint probability of the dataset as P(z d i , w d i , t d i |w −di , t −di , z −di , α, β,ψ ) and use chain rule to derive the probability of P(z d i |w, t, z −d i , α, β,ψ ) as below, where −di subscripts refers to all tokens except w di : where n zv refers to the number of words v assigned to topic z, m dz refers to the number of word tokens in document d that are assigned to topic z, and b k represents the kth time slice. The details on the Gibbs sampling derivation can be found in the Appendix section. After each iteration of Gibbs sampling, we update the probability of p(t z d i ∈ b k ) as follows: where I(.) is equal to 1 when t z d i ∈ b k , and 0 otherwise. In this paper, we report results with bi-weekly categorical time resolution. To determine the values for hyper-parameters α and β and to investigate the sensitivity of the model to these values, we repeated our experiment with symmetric Dirichlet distributions using values α ∈ [0.1, 0.5, 1], β ∈ [0.01, 0.1, 0.5, 0.8, 1]. We observed that the model did not show significant sensitivity to the values of these hyper-parameters. Thus, we fix α = 1 and β = 0.5, both as symmetric Dirichlet distributions. We initialize the hyperparameter ψ in 2 ways for comparison: I. random initialization (model referred as NOC R ); and II. based on the probability of user activity per time category, illustrated in Figure 3c , (model referred as NOC A ). To estimate the number of topics for our experiments, we first visualize the tweets' hashtag co-occurrence graph. We measure the graph modularity to examine the structure of the communities in this graph. We observe the highest modularity score of 0.41 using modularity resolution equal to 0.85. Figure 2 illustrates a downsample version of this graph, where each color represents a modularity class. The edges of the graph are weighted according to the number of hashtags' co-occurrence in the document collection. Our modularity analysis suggests that few distinct hashtag communities exist. Additionally, the dataset under study contains tweets associated with a single domain. As a result, we assume the number of topics to be relatively low. To choose an appropriate number of topics, we repeated LDA with the number of topics as T ∈ [4, . . . , 20] with increments of size 1. We evaluated the c v coherence of topics identified by LDA and observed the highest coherence score for T = 5 and T = 5, respectively. Thus, we report our experimental results using these values. We employ the discovered probabilities of topics over documents, θ , probabilities of words per topic, ϕ, and probabilities of topics per time category, ψ to perform sentence ranking. This ranking allows extracting the sentences with the higher scores of belonging to each topic. This is achieved via performing weighted sampling on the collection of documents based on the probabilities of topics per time category ψ and draw D documents from θ . The weighted sampling leads to drawing more documents from the time categories b k with a higher ψ as this time slices contain more documents related to the topic z. Each document contains a sequence of sentences (s 1 , s 2 , . . . , s J ) ∈ d from the aggregated conversation cascades per day. Information on the aggregation of conversation cascades and document preparation can be found in section 4.2. Since the social media narrative activity over a topic evolves from the circulation of identical or similar textual content in the platform, the content involves significant similarity. For instance, the Twitter conversation cascades include replies, quotes, and comments, where replies and quotes duplicate the textual content. Therefore, we applied Jaro-Winkler distance over the timely ordered sentences and dismissed the sentences with similarity above 70%, while keeping the longest sentence. After removing redundant text as described earlier, we calculate the probability of each sentence s j by measuring the sum of the probabilities of topics for words w di ∈ s j . Then, we select the sentences with the highest accumulative probability of words w per topic z. Summary coherence was induced as suggested in [4] by ordering the extracted sentences according to their timestamps such that the oldest sentences appear first. Table 4 in the Appendix section contains the extracted narrative summaries for 5 topics for a sample run. As mentioned earlier, the discovered topics by NOC present a series of timely ordered topical events. Thus, the topical events deliver a narrative covering distinct social media events over the same topic. Figure 3 demonstrates the generated narrative distributions with NOC, where the hyperparameter ψ was randomly initialized (referred to as NOC R ). This figure represents that the identified narratives by our model are distinct from each other and the collapsed distribution of all narratives approximates the distribution of social media user activity over time. The identified narratives can be evaluated using effective evaluation metrics for topic models. Accordingly, we calculate pointwise mutual information [24] to measure the coherence of a topic z as follows: where K is the number of most probable words for each narrative, p(w j ) and p(w k ) refer to the probabilities of occurrence for words w j and w k , and p(w j , w k ) represents the probability of co-occurrence for the two words in the collection of documents. We compare our model with LDA and TOT [35] , where TOT is a probabilistic topic model over time with Beta distribution for time. Table 2 displays the average coherence score measured across the discovered topics by LDA, TOT, and NOC. For NOC, we investigate initializing the parameterψ with random and user activity-based initialization, referred as NOC R and NOC A , respectively. We considere K = 500 most probable words from each topic. This comparison suggests that the narratives identified by NOC are more coherent than the identified topics by LDA, with an improvement in coherence of about 35%. The observed improvement comparing with TOT was about 27%. Additionally, initializing the hyperparameter ψ in NOC using the distribution of user activity improves the narrative coherence by about 3%. The topic attractiveness to social media users can be investigated as a measure of the length of conversation cascades, the number of initiated textual content, and the number of unique users performing an activity relative to the underlying topic. The user activity fluctuations for timestamped data may contain activity bursts that are illustrative of significant events. Similarly, the generation and propagation of textual content within an online platform can illustrate the narrative activity relative to the events over time, where a burst represents a significant narrative activity. Additionally, the recurrence of a topic can be considered as an attractiveness measure for the associated topic. In this regard, we propose the significance-dispersity trade-off (SDT) metric to compare the identified narratives against each-other. SDT measures the lifetime attractiveness of the identified narratives based on the distribution of narratives over topical events. The proposed metric quantifies the significance of the narrative activities and recurrence of a topic via employing the Shannon entropy for the discovered narrative distributions. The intuition behind the SDT score is that the value of the entropy is maximum when the probability distribution is uniform. On the contrary, this value is minimum if the distribution is delta function. This is visualized in Figure 4 in the Appendix section. We define dispersity of a categorical time topic distribution as a measure of the dispersion of the time categories. Based on this definition, SDT score of topic z can be obtained as: where H is the Shannon entropy for the categorical distribution of time for topic z: H max = loд 2 (K), and K refers to the number of time slices in the distribution. We assume that social media topics with high lifetime attractiveness are significant and recurrent. However the probability distribution imposes a trade-off on the two. The parameter α provides a weighted geometric mean of H and H max − H that enables promoting either significance or recurrence, dependent on the application under study. A larger value of parameter α promotes dispersity for SDT score, and a smaller amount of this parameter promotes mode significant. The bounds for the SDT score are: where H = 0 occurs when the distribution under study is uniform, and H = H max relates to delta distribution. Since the time categorical distribution of our narrative model allows many modes in time, recurrent narratives can be identified. Additionally, the narrative activity fluctuations can be modeled using categorical time distribution in topic analysis. Table 3 provides a comparison for the SDT scores measured for the 5 identified narratives, using varying values of α. The illustration of the distribution of the extracted narratives can be seen in Figure 3a . We can clearly see in this figure that narratives 1 and 3 have the highest dispersity. On the contrary, narratives 4 and 2 have the highest significance. We compare SDT i for narrative i with the number of user activity associated with narrative z. The results suggest that SDT score can be used to identify the narrative with higher lifetime attractiveness in a timestamped dataset. In our experiments, this is achieved for topic 1 when the value of γ is greater than or equal to 0.7. As it can be seen, this topic is associated with the highest user activity count, reported in the same table. To analyze topical events and provide narratives, we investigate the Twitter dataset on the domain of White Helmets of Syria over a period of 13 month from April 2018 to April 2019. This dataset was provided to us by Leidos Inc 1 as part of the Computational Simulation of Online Social Behavior (SocialSim) 2 program initiated by the Defense Advanced Research Projects Agency (DARPA). We analyze more than 1,052,000 tweets from April 2018 to April 2019. To prepare the model inputs, we filter the tweets from non-English text. Then, we clean up the data by removing usernames, short URLs, as well as emoticons. Additionally, we remove the stopwords, performe Part of Speech (POS) tagging and Named Entity Recognition (NER) on each tweet using Stanford Named Entity Recognizer 3 model. Using the NER tool, we extract persons, locations and organizations and removed all pseudo-documents that do not contain named entities similar to [21] . Furthermore, We remove the tweets shorter than 3 words. As the Twitter maintains a maximum allowed character limit of 280 characters, collected tweets lack context information and have very low word co-occurrence. We tackle the challenge of topic modeling on short-text tweets and to include plentiful context information by preparing pseudo-documents for our model inputs via aggregating daily root, parent, and reply/quote/retweet comments in each conversation cascade. We maintain the order of the conversation according to the timestamps associated with each tweet. This text aggregation method results in preparing pseudodocuments rich of context and related words with a daily time resolution. We use the pre-processing phase output as the model input pseudo-documents, referred as documents in this paper. In this paper, we addressed the problem of narrative modeling and narrative summary extraction for social media content. We presented a narrative framework consisting of I. Narratives over topic Categories (NOC), a probabilistic topic model with categorical time distribution; and II. extractive text summarization. The proposed narrative framework identifies narrative activities associated with social media events. Identifying topics' recurrence and significance over time categories with our model allowed us to propose significance-dispersity trade-off (SDT) metric. SDT can be employed as a comparison measure to identify the topic with the highest lifetime attractiveness in a timestamped corpus. Results on real-world timestamped data suggest that the narrative framework is effective in identifying distinct and coherent topics from the data. Additionally, the results illustrate that the identified narrative distributions approximate the user activity fluctuations over time. moreover, informative, and concise narrative summaries for timestamped data are produced. Further improvement of the narrative framework can be achieved via incorporating the causality relation cross the social media conversation cascades and social media events into account. Other future directions include identifying topical hierarchies and extract summaries associated with each hierarchy. Starting with the joint distribution P(w, t, z|α, β,ψ ), we can use conjugate priors to simplify the equations as below: P(w, t, z|α, β,ψ ) = P(w |z, β) p(t |ψ , z) P(z|α) where P and p refer to the probability mass function (PMF) and probability density function (PDF), respectively. The conditional probability P(z di |w, t, z −di , α, β,ψ ) can be found using the chain rule as: The probability of p(t di ∈ b k ) can be measured as follows: where I(.) is equal to 1 when t z d i ∈ b k , and 0 otherwise. Remember first they said the video including the pics of the chlorine cylinder was fake. Whitehelmets One America News Pearson Sharp Visits Hospital in Douma Where White Helmets Filmed Chemical Attack Hoax Multiple Eyewitness Doctors Say No Chemical Attack Took Place Syria. This is the video evidence of the airstrike on Zardana an Idlib town controlled by Very expensive camera on the helmet of the WhiteHelmets rescuer. White Helmets making films of chemical attacks with children in Idlib. Chemical, Attack, Douma, Terrorist, Fake, Child, Propaganda, Video, Russian, Russia From the fabrication of the plays of the chemist and coverage of the crimes of terrorism to the public cooperation with the Israeli army the white helmets. They are holding children! Another chemical attack is imminent its all they've got left! 4 dead including two children and more than 50 wounded mostly women and children. Love the White Helmets propaganda almost as untruthful as the BBC. Trumps USA has built a rationale for its public that it will need to support rebels in holding on to a large chunk of Syria. I wonder how it is possible that criminal associations such as WhiteHelmets and the Syrian Human Rights Observatory can make the world go round as they want by influencing the policies of world leaders. U.S. freezes funding for Syrias White Helmets. White helmets are terrorists. Former Head of Royal Navy Lord West on BBC White Helmets Aren't Neutral They're On The Side Of The Terrorists. The summaries provided here are the results for a sample run of the proposed narrative framework and do not reflect authors' personal opinions. A survey of topic modeling in text mining Text summarization techniques: a brief survey Leveraging burst in twitter network communities for event detection Sentence ordering in multidocument summarization Dynamic topic models Latent dirichlet allocation A Survey of Multi-Label Topic Models Automatic summarization of events from social media The covid-19 social media infodemic Summarizing microblogs during emergency events: A comparison of extractive summarization algorithms Twitter as arena for the authentic outsider: exploring the social media campaigns of Trump and Clinton in the 2016 US presidential election ThemeDelta: Dynamic segmentations over temporal topic models Polarization in social media assists influencers to become more influential: analysis and two inoculation strategies 17 Small Stories Research: A Narrative Paradigm for the Analysis of Social Media. The Sage Handbook of social media research methods HiEve: A corpus for extracting event hierarchies from news stories Hierarchical topic models and the nested chinese restaurant process ClaimBuster: the first-ever end-to-end factchecking system Skip n-grams and ranking functions for predicting script events Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey Twitter Based Event Summarization Real-time entity-based event detection for twitter Models of narrative analysis: A typology Ranking sentences for extractive summarization with reinforcement learning Automatic evaluation of topic coherence Seriality and storytelling in social media Large-scale hierarchical topic models Short and sparse text topic modeling via self-aggregation Leveraging contextual sentence relations for extractive summarization using a neural attention model Sumblr: continuous summarization of evolving tweet streams Sub-story detection in Twitter with hierarchical Dirichlet processes From Neural Sentence Summarization to Headline Generation: A Coarse-to-Fine Approach Seq2Seq models for recommending short text conversations Narrative information extraction with non-linear natural language processing pipelines Make data sing: The automation of storytelling Topics over time: a non-Markov continuous-time model of topical trends Neural extractive text summarization with syntactic compression Incorporating Wikipedia concepts and categories as prior knowledge into topic models. Intelligent Data Analysis Concept over time: the combination of probabilistic topic model with wikipedia knowledge