MISNIS: An Intelligent Platform for Twitter Topic Mining Accepted Manuscript MISNIS: An Intelligent Platform for Twitter Topic Mining Joao P. Carvalho , Hugo Rosa , Gaspar Brogueira , Fernando Batista PII: S0957-4174(17)30531-6 DOI: 10.1016/j.eswa.2017.08.001 Reference: ESWA 11470 To appear in: Expert Systems With Applications Received date: 1 February 2016 Revised date: 31 July 2017 Accepted date: 1 August 2017 Please cite this article as: Joao P. Carvalho , Hugo Rosa , Gaspar Brogueira , Fernando Batista , MISNIS: An Intelligent Platform for Twitter Topic Mining, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.08.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. http://dx.doi.org/10.1016/j.eswa.2017.08.001 http://dx.doi.org/10.1016/j.eswa.2017.08.001 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Highlights:  An intelligent platform to efficiently collect and manage large Twitter corpora  Circumvents Twitter restrictions that limit free access to 1% of all flowing tweets  An add-on implementing intelligent methods for Twitter topic mining  Intelligent retrieval of tweets related to a given topic  A case study is presented as a demonstration example ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT MISNIS: An Intelligent Platform for Twitter Topic Mining Joao P. Carvalho INESC-ID/Instituto Superior Técnico, Universidade de Lisboa R. Alves Redol, 9, 1000-029 Lisboa, Portugal joao.carvalho@inesc-id.pt Hugo Rosa INESC-ID, Portugal hugo.rosa@inesc-id.pt Gaspar Brogueira INESC-ID/ISCTE-IUL, Portugal gmrba@iscte.pt Fernando Batista INESC-ID/ISCTE-IUL, Portugal Fernando.Batista@inesc-id.pt ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Abstract Twitter has become a major tool for spreading news, for dissemination of positions and ideas, and for the commenting and analysis of current world events. However, with more than 500 million tweets flowing per day, it is necessary to find efficient ways of collecting, storing, managing, mining and visualizing all this information. This is especially relevant if one considers that Twitter has no ways of indexing tweet contents, and that the only available categorization “mechanism” is the #hashtag, which is totally dependent of a user’s will to use it. This paper presents an intelligent platform and framework, named MISNIS - Intelligent Mining of Public Social Networks’ Influence in Society - that facilitates these issues and allows a non-technical user to easily mine a given topic from a very large tweet’s corpus and obtain relevant contents and indicators such as user influence or sentiment analysis. When compared to other existent similar platforms, MISNIS is an expert system that includes specifically developed intelligent techniques that: (1) Circumvent the Twitter API restrictions that limit access to 1% of all flowing tweets. The platform has been able to collect more than 80% of all flowing portuguese language tweets in Portugal when online; (2) Intelligently retrieve most tweets related to a given topic even when the tweets do not contain the topic #hashtag or user indicated keywords. A 40% increase in the number of retrieved relevant tweets has been reported in real world case studies. The platform is currently focused on portuguese language tweets posted in Portugal. However, most developed technologies are language independent (e.g. intelligent retrieval, sentiment analysis, etc.), and technically MISNIS can be easily expanded to cover other languages and locations. Keywords: Twitter; Intelligent Topic Mining; Fuzzy Fingerprints; Text Analytics; Sentiment Analysis. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 1 Introduction When Twitter was launched in 2006 as a simple public social networking service enabling users to send and read short 140-character messages, hardly anyone could predict that it would become a major tool for spreading news, for dissemination of positions or ideas, and for the commenting and analysis of current world events. This became evident with the so-called “Arab Spring” in 2010, where Twitter was used as an alternative means of communicating to the outside world what was censored by state controlled traditional news broadcasters. During the subsequent years, events such as the “Spanish protests”, the “London riots” or the “Taksim Gezi Park protests”, further increased the notion that important events are often commented in Twitter before they become “public news”. This has led to a change in how the public perceives the importance of social networks, and even news agencies and networks had to adapt and are now using Twitter as a potential (and some times preferential) source of information. As an example, Sankaranarayanan (2009), showed how Twitter can be used to automatically obtain breaking news from the tweets posted by users, and exemplifies that when Michael Jackson passed away, “the first tweet was posted 20 minutes after the 911 call, which was almost an hour before the conventional news media first reported on his condition”. In fact, Twitter is so fast that it can even outpace an earthquake: on August, 23rd 2011, when a 5.9 magnitude earthquake struck close to Richmond, Virginia, U.S.A., the effects were first felt in Washington D.C. from where several tweets were posted stating the event; various people reported having read those tweets in New York City (400Km away) before the earthquake reached them! (Ford, 2011) (Gupta et al., 2014). The negative side of this fast paced online news environment is that it discourages fact- check and verification (Chen et al., 2015), and some concern is justified when considering the rise in phenomena such as “fake news”, that were, for example, exploited in recipe-like fashion to impact the 2016 USA Presidential elections (Mustafaraj and Metaxas, 2017). Vosoughi et al. (2017) aimed to reduce the impact of false information on Twitter by automatically predicting, with 75% accuracy, the veracity of rumors on a collection of nearly 1 million tweets, extracted from real-world events such as the 2013 Boston Marathon bombings, the 2014 Ferguson unrest and the 2014 Ebola epidemic. The previous examples show the importance of automatically analyzing the massive amount of information on Twitter. However, using Twitter as a source of information involves many technical obstacles, of which the first is collecting and dealing with the amount of flowing information. As of mid 2015, more than 500 millions tweets covering thousands of different topics are published daily (almost 6000 tweets per second!). Collecting, storing, managing and visualizing such large amounts of information is a far from trivial problem and demands dedicated and intelligent hardware and software platforms. Even assuming one is able to access all tweeted contents, there is still the problem of filtering which content is relevant for a given topic of interest. This is far from trivial, even if we simply consider doing it on a daily basis: on a given day it is very unlikely that more than a few thousand tweets are relevant to a given discussion topic (even when considering major topics). We are talking about detecting 0.001%-0.01% of the 500 million daily tweets, which is basically trying to find a needle on a hay stack. Twitter’s approach to deal with this problem is to provide a list of top trends (Twitter, 2010) and the #hashtag mechanism: when referring to a certain topic, users are encouraged to indicate it using a hashtag. E.g., “#refugeeswelcome in Europe!” indicates the topic of the tweet is the current refugees crisis in Europe. However, not all tweets related to a given ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT topic are hashtagged. In fact, according to (Mazzia, 2010), only 16% of all tweets are hashtagged, numbers that have been confirmed by our experiments. The explanation for this lies partially with the fact that 140 characters is a scarce amount of text to communicate a thought, something that can be aggravated by the inclusion of an #hashtag, which also uses valuable space. It thus becomes clear that, to correctly analyze a given discussion topic, it is of the utmost importance to retrieve as much of the remaining 84% untagged tweets as possible. Since no other tagging mechanisms exist in Twitter, the process of retrieving tweets that are related to a given topic, our needle on a haystack, must use some kind of text classification process in order to detect if the contents of a given tweet is somehow related with the intended topic. This classification process must simultaneously be able to retrieve only the relevant tweets (i.e., have high values of precision and recall), and to be computationally efficient in order to deal with the huge amount of data. Additional tasks of interest when considering the use of Twitter as an information source, include finding which of the topic related tweets have more relevance (e.g. by finding who are most important “actors” discussing the topic), performing sentiment analysis, extract statistics on the topic origin and respective spatial-temporal evolution, etc. In this article, we present an intelligent platform, MISNIS - Intelligent Mining of Public Social Networks’ Influence in Society, which addresses the issues mentioned above, and can be used as an expert system by social scientists when studying social networks’ impact in society. The platform can be divided into two major blocks (Figure 1): 1) Smart mechanisms for collecting, storing and managing Twitter information; 2) Intelligent mechanisms to retrieve, analyze and represent the information that is relevant for a given topic. Figure 1: MISNIS Framework architecture ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT When compared to other existent similar platforms, MISNIS includes specifically developed intelligent techniques that: (1) Circumvent the Twitter API restrictions that limit access to 1% of all flowing tweets. The platform has been able to collect more than 80% of all flowing portuguese language tweets in Portugal when online (Brogueira et al., 2016); (2) Intelligent retrieval of most tweets related to a given topic even when the tweets do not contain the topic #hashtag or user indicated keywords. A 40% increase in the number of retrieved relevant tweets has been reported in real world case studies Carvalho et al., 2017). Despite being operational, MISNIS is an ongoing work and can be improved in several aspects: (1) The platform is currently focused on portuguese language tweets posted in Portugal. However, most developed technologies are language independent (e.g. intelligent retrieval, sentiment analysis, etc.), and technically MISNIS can be easily expanded to cover other languages and locations; (2) Sentiment analysis methods can be improved; (3) Dependence on Twitter and Google APIs: most changes to the APIs endpoints imply changing and recompiling the platform code. The paper is organized as follows: In section 2 we describe some related work relevant to the developed platform; Section 3 describes the architecture of the framework that was implemented for Twitter data acquisition, storage, management and visualization; Section 4 focuses on the expert system for intelligent Twitter data mining added to the framework; Section 5 presents a small case study used to exemplify the developed platform and framework; Finally, Section 6 presents some conclusions. 2 Related Work 2.1 Large Scale Social Data Acquisition and Storage The analysis of the content and information shared on social networks has been proved useful in various fields, including Politics, Marketing, Tourism, Public Health, and Safety. Twitter is amongst the most widely used social networks, making available about 500 million tweets every day1, on average. Twitter provides free access to part of the information produced by its users through public APIs (Application Programming Interface), and the popularity of Twitter as a source of information has led to the development of numerous applications and to new research methods in various fields. For example, Paul et al. (2011) developed a method for tracking disease risk factors by measuring the behavioral level, tracking diseases by geographic regions to analyze the symptoms and medication applied. The study was based on about 1.5 Million tweets related to health that contained references to various ailments including allergies, obesity and insomnia. Santos et al. (2013, 2014) used a set of approximately 2700 tweets produced in Portugal to predict the incidence and spread of the influenza virus through the Portuguese population. Widener et al. (2014) used information extraction and sentiment analysis (through a data mining framework), to try to understand how geolocated tweets can be used to research the prevalence of healthy and unhealthy food in contiguous regions of the United States. Other studies related to public health were also reported by Culotta (2010) and by Scanfeld et al. (2010). Twitter was also used as a source of information to help identifying or locating the occurrence of earthquakes, taking into account that “when an earthquake occurs people produce many posts on Twitter related to the event, which permits the identification of earthquakes simply by observing the increase in the tweet volume” (Sakaki et al. 2010). Kumar et al. (2013b) proposes an approach to identify a subset of users and their location to justify them to be followed in disaster situations in order to get a quick access to useful information 1 https://about.twitter.com/company (accessed in 14-05-2015). ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT about the event. During a crisis a particular user’s location is an important factor in determining whether he/she is likely to publish relevant information on the state of crisis. For instance, in the eventuality of an earthquake, tweets produced in a place close to the earthquake are likely to be more relevant to assess the situation than tweets produced from a more distant location. Other studies have been produced based on similar topics (Mendoza et al. 2010, Qu et al. 2011, Lachlan et al. 2014). Gerber (2014) tried to predict criminal activity in the largest city of the United States of America using tweets marked in space and time. Since tweets are public, officially made available by Twitter services, the development of linguistic analysis models that enable the automatic identification of topics related to the commission of a crime may be considered quite relevant not only in preventing similar crimes but also to support decision making under trial in court of law. The design of software architectures for capturing Twitter information and extracting relevant knowledge from tweets represents a major challenge, not only due to the massive amounts of streamed data, but also due to possible access limits imposed by Twitter (Oussalah et al. 2013). Most of the work reported in literature restricts the data in some way, in order to respect the limits imposed by Twitter. For example, Perera et al. (2010) describes a software architecture based on the Twitter API based on Python and MySQL that collects tweets sent only to specific users. Twython, a Python wrapper, was used to obtain spatial data (location, name, description, etc.) of the authors of the tweets. The collecting process runs over in 5-minute intervals and the collected tweets sent to a particular Twitter id user, including the President Barack Obama. Anderson et al. (2011) reports a concurrency-based software architecture that allows collecting a large volume of data, with a theoretical maximum of 500 million tweets per day. The developed code is multi-threaded and therefore adapted to run on machines with multiple processors, uses Spring, MVC, Hibernate and JPA frameworks, and the infrastructure components: Tomcat, Lucene (for tweet indexation), MySQL (to store the collected data). Marcus et al. (2011) developed TwitInfo, a platform for collecting and processing tweets in real time for sentiment analysis. The architecture proposed by Oussalah et al. (2013) collects tweets continuously and in real time using the Streaming API, aiming at semantically and spatially analyze the collected data. The collected tweets are restricted to a rectangular area bounded by geographical coordinates (longitude and latitude), and the software implementation is based on Django (framework for developing web applications in Python), Lucene, and MySQL. This architecture allows searching for tweets using the text, username or location. 2.2 Tweet Topic Detection One of the goals of this work is to automatically classify tweets into a set of topics of interest. Tweet Topic Detection involves automatically determining if a given tweet is related to a given, usually #hashtagged, topic. This is basically a classification problem, albeit one with its own specificities: (1) it is a text-based classification problem, with an unknown and vast number of classes, where the documents up for classification are very short (maximum of 140 characters in length); (2) it fits the Big Data paradigm due to the huge amounts of streaming data. We purposefully distinguish between Topic Classification and Topic Detection. The former is broadly known in Natural Language Processing (NLP) as Text Categorization, and is defined as the task of finding the correct topic (or topics) for each document, given a closed set of generic categories (subjects, topics) such as politics, sports, music, religion, etc., and a collection of text documents (Feldman, 2006), in this case, tweets; the tweets will commonly belong to one or more of those categories and it is highly uncommon that a tweet goes unclassified. The latter is more detailed in its approach ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT since it attempts to determine the topic of the document, from a predetermined large set of possible topics, where the topics are so unique amongst themselves that there is a good chance that a tweet without a hashtag will probably not belong to any of the current trends. With this difference in mind, the most similar works on topic detection within Twitter are those related with emerging topics or trends, as for example the works of Mathiodakis (2010), Cataldi (2010) Kasiviswanathan (2011) or Saha (2012). In these articles, the authors use a wide variety of text analysis techniques to determine the most common related words and, as consequence, detect topics. In our work, we assume the existence of trending topics and set the goal of efficient detection of tweets that are related to said topics, despite not being explicitly marked (hashtagged) as so. Topic Classification is also a well-documented and commonly studied task. In Lee (2011), Twitter Trending Topics are classified into 18 broad categories like sports, politics, technology, etc., and a classification accuracy of 65% and 70% is achieved when using text-based and network-based classification modelling, respectively. The experiment was performed on a dataset of randomly selected 768 trending topics (over 18 classes). More recently, Cigarrán et al. (2016) proposed an approach based on Formal Concept Analysis (FCA) to perform Twitter topic detection in unsupervised fashion, finding that it outperforms traditional classification, clustering and probabilistic approaches on a benchmark Replab 2013 dataset. Empowered by previous work and supporting our view of topic detection, the most promising method is Twitter Topic Fuzzy Fingerprinting (Rosa et al., 2014a, 2014b). Fingerprint identification is a well-known and widely documented technique in forensic sciences. In computer sciences a fingerprint is a procedure that maps an arbitrarily large data item (such as a computer file, or author set of texts) to a much compact information block, its fingerprint, that uniquely identifies the original data for all practical purposes, just as human fingerprints uniquely identify people. Fuzzy Fingerprints were originally introduced as a tool for text classification by Homem and Carvalho (2011). They were successfully used to detect authorship of newspaper articles (out of 73 different authors). For text classification purposes, a set of texts associated with a given class is used to build the class fingerprint. Each word in each text represents a distinctive event in the process of building the class fingerprint, and distinct word frequencies are used as a proxy for the class associated with a specific text. The set of the fuzzy fingerprints of all classes is known as the fingerprint library. Given a fingerprint library and a text to be classified, the text fingerprint is obtained using a process similar to the one used to create the fingerprint of each class, and then a similarity function is used to fit the text into the class that has the most similar fingerprint. In order to use Fuzzy Fingerprints for tweet topic detection, several procedural changes were proposed by Rosa et. al (Rosa, 2014a, 2014b). According to the authors, the Twitter Topic Fuzzy Fingerprints performed very well on a set of 2 millions English, Spanish and Portuguese tweets collected over a single day, beating other widely used text classification techniques. The training set consisted of 11000 tweets containing the 22 of the top daily trends (hashtagged topics). 350 unhashtagged test tweets were properly classified with an f-measure score of 0.844 (precision=0.804, recall=0.889). Further work by Rosa (2014), used a training set of 21000 tweets, from “21 impartially chosen topics of interest out of the top trends of the 18th of May, 2013”. The test set was made of “585 tweets that do not contain any of the top trending hashtags” and “each tweet was impartially annotated to belong to one of the 21 chosen top trends”. After extensive parameter optimization using a development set, the fuzzy fingerprint method scored an f-measure of 0.833 proving to be not only more accurate than other well- known classifying techniques (kNN and SVM), but also much faster (177 times faster than kNN and 419 times faster than SVM). ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Topic models are commonly reported in the literature as one of the most successful techniques for topic detection/classification/trending on twitter (Hoffman et al., 2010). Non-probabilistic topic models, namely Latent Semantic Analysis (LSA) (Landauer, 1998) appeared first, but most of the current literature refers to generative probabilistic models (Blei, 2012), based on Latent Dirichlet Allocation (LDA). Our previous attempts applying methods based on LDA to the specific problem of tweet topic detection produce weak results, unless very extensive parameterization and testing was done a priori for each new topic, which obviously prevents their use in the developed platform. 2.3 User Influence The concept of influence is of much interest for several fields, such as sociology, marketing and politics. Empirically speaking, an influential person can be described as someone with the ability to change the opinion of many, in order to reflect his own. While Rogers (1982) supports this statement, claiming that “a minority of users, called influentials, excel in persuading others”, more modern approaches (Domingos, 2001) seem to emphasize the importance of interpersonal relationships amongst ordinary users, reinforcing that people make choices based on the opinions of their peers. The point is that “influence” is an abstract concept, which makes it exceptionally hard to quantify. Several studies have attempted to accomplish this goal. In (Cha, 2010), three measures of influence were taken into account, regarding Twitter: in-degree, re-tweets and mentions, where “in-degree is the number of people who follow a user; re-tweets mean the number of times others forward a user's tweet; and mentions mean the number of times others mention a user's name”. It concluded that while in-degree measure is useful to identify users who get a lot of attention, it “is not related to other important notions of influence such as engaging audience”. Instead “it is more influential to have an active audience who re-tweets or mentions the user”. In (Leavitt, 2009), the authors conclude that within Twitter, “news outlets, regardless of follower count, influence large amounts of followers to republish their content to other users”, while “celebrities with higher follower totals foster more conversation than provide retweetable content”. InfluenceTracker (Razis, 2014) is a framework that rates the impact of a Twitter account taking into consideration an Influence Metric, based on the ratio between the number of followers of a user and the users it follows, and the amount of recent activity of a given account. It also calculates a Tweet Transmission rate where the “most important factor (...) is the followers' probability of re-tweeting”. Cha (2010) also shows “that the number of followers a user has, is not sufficient to guarantee the maximum diffusion of information in Twitter (...) because, these followers should not only be active Twitter users, but also have impact on the network”. Even if one agrees on the measures that best represent influence, aggregating and computing the measures is not a trivial task since user interactions should not be ignored. A sound approach consists in using graphs and to compute user relevance recurring to graph centrality algorithms. In graph theory and network analysis, the concept of centrality refers to the identification of the most important vertices within a graph, in this case, the most important users. We therefore define a graph G(V,E) where V is the set of users and E is the set of directed links between them. Currently the most “famous” centrality algorithm is PageRank (Page, 1998, 1999). It is one of Google's search engine methods, with web pages used as nodes and back-links ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT forming the edges of the graph. The PageRank is considered to be a random walk model, because the weight of a page/node is “the probability that a random walker (which continues to follow arbitrary links to move from page to page) will be at a node at any given time”. A damping factor is used as the “probability of the random walk to jump to an arbitrary page, rather than to follow a link, on the Web” and is “required to reduce the effects on the PageRank computation of loops and dangling links in the Web” (Phuoc, 2009). Other less complex and with guaranteed convergence centrality methods exist, such as for example, Katz (Katz, 1953), which are often preferred over PageRank for that reason. 2.4 Sentiment Analysis Sentiment Analysis is a relevant and well-known task that consists of extracting sentiments and emotions expressed in texts. Being the first step towards the online reputation analysis, it is now gaining particular relevance because of the rise of social media, such as blogs and social networks. The increasing amount of user-generated contents constitute huge volumes of opinionated texts all over the web that are precious sources of information, especially for decision support. Sentiment Analysis can be used to know what people think about a product, a company, an event, or a political candidate. Sentiment analysis can be performed at different complexity levels, where the most basic one consists just on deciding whether a portion of text contains a positive or a negative sentiment. Dealing with the huge amounts of data available on Twitter demand clever strategies. One approach combines sentiment analysis and causal rule discovery (Dehkharghani, 2014). Other, by Kontopoulos (2013) uses ontologies. An interesting and simpler idea, explored by Go (2009), consists of using emoticons, abundantly available on tweets, to automatically label the data and then use such data to train machine learning algorithms. The paper shows that machine learning algorithms trained with such approach achieve above 80% accuracy, when classifying messages as positive or negative. A similar idea was previously explored by Pang (2002) for movie reviews, by using star ratings as polarity signals in their training data. This latter paper analyses the performance of different classifiers on movie reviews, and presents a number of techniques that were used by many authors and served as baseline for posterior studies. As an example, they have adapted a technique, introduced by Das and Chen (2001), for modelling the contextual effect of negation, adding the prefix NOT_ to every word between a “negation word” and the first punctuation mark following the negation word. Common approaches to sentiment analysis also involve the use of sentiment lexicons of positive and negative words or expressions (Stone, 1996; Hu, 2004; Wilson, 2005; Baccianella, 2010). Another research approach involves learning polarity lexicons and can be especially useful for dealing with large corpora. The process starts with a seed set of words and the idea is to increasingly find words or phrases with similar polarity, in semi-supervised fashion (Turney, 2002). The final lexicon contains much more words, possibly learning domain-specific information, and therefore is more prone to be robust. The work reported by Kim (2004) is another example of learning algorithm that uses WordNet synonyms and antonyms to learn polarity. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 2.5 Twitter Data Analysis Platforms To the best of our knowledge, there are no Twitter data analysis platforms with the exact same goal as MISNIS. However, similar ones have come up in recent years, albeit more focused on social media marketing by aiding brands to grow or just helping regular users and public personalities to better foster their web engagement. The earliest such alternative that we found was TwitterMonitor (Mathioudakis and Koudas, 2010), which automatically identified emerging trends on Twitter and provided meaningful analytics that synthetized an accurate description of each topic. Other tools focus on eye-catching reports and graphics to display relevant data and often provide insightful analytics on the retrieved data: (i) Warble (www.warble.co) is a web-based solution that allows users to track keywords and hashtags that matter to him/her, while monitoring brands and brand engagement; (ii) Twitonomy (www.twitonomy.com) provides detailed analytics on anyone’s tweets, allowing users to get insights on followers and friends as well as your interactions with other users; (iii) TweetReach (www.tweetreach.com) gives real-time analytics on a user’s reach, performance and engagement while continuously analysing all posts about the topics he/she cares about, including sentiment analysis; (iv) SocioViz (www.socioviz.net) analyses any topic, term or hashtag and identifies key influencers, opinions and contents; (v) Mozdeh (http://mozdeh.wlv.ac.uk/) is a free Windows application for keyword, issue, time series, sentiment, gender and content analyses of social media texts; (vi) Netlytic (http://netlytic.org) is a community-supported text and social networks analyser that can automatically summarize and discover social networks from online conversations on social media sites; (vii) Discovertext (http://discovertext.com) is a commercial company combining machine learning classifiers with cloud and crowdsource services to retrieve relevant items and sort them into topics and sentiment categories; (viii) Visibrain (http://www.visibrain.com) is a commercial “media monitoring tool for PR and communications professionals, used for reputation management, PR crisis prevention, and detecting influencers and trends”, claiming to be able to capture all social media around a brand. Most of these tools share common features with MISNIS. From keyword tracking, to geo- located data, sentiment analysis and user influence, almost all the above-mentioned platforms implement at least one of these capabilities, often more, and with better user interface. However, except for DiscoverText, none of the platforms have the mechanisms to overcome Twitter API limits. Hence they are only able to capture and analyse 1% of all flowing tweets. The exception, DiscoverText, makes use of the Twitter “firehose”, which allows access to 100% of the tweets in real-time streaming. However it is a paid (and quite expensive) solution. Even more important, independently from being paid or a free solution, and far as we could tell, none of the mentioned platforms is able to detect relevant related tweets unless they contain explicit user defined keywords/hashtags, therefore missing important information for the analysis of a given topic. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 3 Twitter Data Acquisition, Storage, Management and Visualization In order to circumvent Twitter data access restrictions and build a tweets repository from where data could be efficiently managed, intelligently retrieved and visualized, we developed an information system consisting of four main modules (Figure 2): 1) Data collection; 2) Data expansion; 3) Data access; 4) Visualization. Figure 2: Architecture of the Twitter data collection, management and visualization system. The presented system focus mainly on Portuguese Twitter data (Tweets produced in Portugal and in European Portuguese), but can easily be adapted and expanded for most countries and languages. In the Collect module we retrieve geolocated tweets produced in Portugal and discard those that are not recognized as written in European Portuguese. The collection uses Twitter StremingAPI, which implies that at most 1% of the data is collected. All collected tweets are stored in MongoDB2. In the Expand module we identify each individual previously collected user, and explore their timelines to retrieve past tweets (and add them to the MongoDB). In the Access module we implemented a REST API3 as an abstraction to intelligently access the database. The Visualization module implements a dashboard to visualize metrics, indicators and any queried information. Each module is detailed in the following sections. 2 https://www.mongodb.org, last accessed, July 2015 3 http://www.restapitutorial.com/, lasta accessed, July 2015 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 3.1 Geolocated Data Collection The Twitter Streaming API statuses/filter allows the access to tweets being tweeted at the time of request. Several filters can be used to tailor the requested information: the API allows filtering by keywords, hashtags, user ID and geographically delimited regions (Kumar 2013a). The number of parameters and the volume of returned information is limited by Twitter. Currently each request allows for a maximum of 400 keywords, 25 geographical regions or 5000 user IDs, and up to a maximum of 1% of all currently flowing tweets are searched for and returned. The geolocation of a tweet can be obtained using two different processes: i) directly from the tweet when the user opts to make his location known at time of publishing; ii) using the information contained in the user profile location field. The percentage of geolocated tweets is low, under one fifth of all tweets4, and as such, it is easier to comply with Twitter API restrictions when filtering for geolocated tweets:  Taking into consideration that currently there are 500 million tweets per day5, it is theoretically possible to retrieve up to 5 million tweets per day using the Streaming API from a single user account;  Previous works (Brogueira, 2014a, 2014b) have shown that in 2014, 60000 geolocated tweets were produced per day, in Portugal and using (European) Portuguese, i.e., almost 2 orders of magnitude lower than the 5 million Streaming API daily limits. The implemented platform works on a 24 hours per day basis using the Twitter Streaming API to collect geolocated flowing tweets within Portugal. The delimitation of the geographical area of mainland Portugal and the archipelagos of Madeira and Azores uses the Twitter REST API geo/search. Note that this area includes small parts of Spain and North Africa that are filtered a posteriori. The collected tweets are compressed and stored on a hard drive (in as “real time” as possible), and later filtered and stored in a MongoDB database. The filtering operation consists of considering only tweets that are produced in Portugal (field place.country = ”pt”) and written in European Portuguese (field lang = “pt”). The double-check allows for the detection of tweets where the Twitter language detection algorithm6 fails, which is not uncommon between Portuguese, Spanish and Galician. In addition to the tweet filter and storage operation, the author of each tweet is identified using the field user.id, and if he is unknown, added to a “Users” MongoDB collection. For each new user it is created a JSON7 document containing: i) the user.id; ii) the date of first detection; iii) control flags (see section 3.2). 3.2 Database Expansion The tweet corpus expansion is based on the retrieval of the timeline of each user present in the “Users” collection. The user’s timeline is the record of the user’s recent Twitter activity, and can be accessed via the REST API statuses/user_timeline. 4 https://pressroom.usc.edu/twitter-and-privacy-nearly-one-in-five-tweets-divulge- user-location-through-geotagging-or-metadata/, last accessed July 2015. 5 https://about.twitter.com, last accessed October 2015. 6 https://blog.twitter.com/2013/introducing-new-metadata-for-tweets, last accessed July 2015. 7 http://json.org/, last accessed June 2015. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT The API returns the most recent 3200 tweets of a given user. However, Twitter imposes several restrictions that hinder the timeline collection process: up to 180 requests are authorized per 15 minute period and per authenticated Twitter API access account (Kumar, 2013a), and each request returns at most 200 tweets. Therefore, 16 API requests are needed in order to retrieve a 3200 tweets timeline. As such, it is only possible to retrieve the timeline of 11 different users per 15-minute period (around 1080 timelines per day) when using a single Twitter account. Since an average of 232 new users are identified per day by the system (Brogueira, 2015a), and the total of registered users is, by August 2015, above 120K, collecting the timelines of every single new user, and updating existing users’ timelines on a continuous basis using a single Twitter API access account is an increasingly lengthy process, that is not viable and far from straightforward due to the above-mentioned restrictions. Timeline retrieval in the developed system uses 15 different Twitter API access accounts that are synchronized and optimized to prevent repeated invocations and failures due to exceeding the API limits during the 15-minute window. The retrieval process considers three different scenarios: i) Integration of a new user, which implies obtaining the complete timeline (up to 3200 tweets); ii) Existing user, for whom it is necessary to retrieve tweets produced since the date of the last retrieved tweet; iii) Blocked access users, i.e., users that have explicitly blocked the access to their timelines. Each account is dynamically assigned to one of the first two scenarios taking into consideration the amount of timelines to process for each. Blocked users are kept out of the loop and checked sporadically in order to detect eventual status change. Each account is also associated to a JSON document containing the strings used for Twitter API authentication (OAuth), and the flags that identify which scenario the account is currently assigned to and how it should operate. More details can be found in (Brogueira et al., 2015b, 2016). With the joint operation of the geolocated data collection and data expansion modules, MISNIS has been able to collect more than 80% of all flowing portuguese language tweets in Portugal when online, which is a huge amount when compared to the theoretical 1% freely made available by Twitter (Brogueira et al., 2016). 3.3 Data Access and Data sharing The Data Access module consists of a REST API developed in order to facilitate the access to the information stored in the database, allow access by third party applications, and enable the developed Dashboard (see section 3.4). REST stands for Representational State Transfer, a set of constraints and principles used in web interface architectures. A REST API is a set of data and functions that facilitate information exchange between applications and web services designed according to the REST principles. The developed REST API makes a set of endpoints available for interaction with the MongoDB. Table 1 presents some of the endpoints developed for information access. They are divided into three categories: i) Access to tweets information; ii) Access to user information; iii) Access to previously processed statistics (performed on the stored data). Table 2 presents endpoints available for adding information to the database, ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT namely information resulting from the intelligent data processing methods presented in Section 4. Table 1: Database access endpoints of the REST API. Endpoint Return Data /api/{collection}/tweet/{id_tweet} All information concerning a specific tweet /api/{collection}/tweet/page/{id_page} Set of 1K tweets ordered by decreasing publishing date /api/{collection}/tweet/hour/{hour} Tweets collected per hour /api/{collection}/tweet/day/{day} Tweets collected per day /api/{collection}/tweet/weekDay/{weekDay} Tweets collected per week day /api/{collection}/tweet/month/{month} Tweets collected per month /api/{collection}/tweet/year/{year} Tweets collected per year /api/{collection}/query/{query} Set of tweets according to the filter specified in parameter “query” /api/user/{id_user} All tweets produced by user /api/user/{id_user}/firstProfile User profile when first tweet included in the database was published /api/user/{id_user}/lastProfile User profile when his last tweet included in the database was published /api/user/{id_user}/ageGender Fields of user profile used to infer age and gender Table 2: REST API endpoints for saving “intelligent analysis” results into MongoDB. Endpoint Return Data /api/t2f2tweets/{topic} Tweets related to a given topic (processed using Twitter Topic Detection) /api/popusers/{topic} Most relevant users on a given topic of discussion 3.4 Data Visualization The Data Visualization module consists on a web dashboard integrating several data indicators. The dashboard is implemented using the Google Charts API8 and the REST API presented in the previous section. The dashboard includes charts and statistics for geolocated tweets, timeline tweets and user information. 8 https://developers.google.com/chart/?csw=1 ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Since some of the statistics and charts involve a large volume of data, and the processing of the respective queries is not feasible in real time, such queries are therefore pre- processed automatically on a daily basis. In such cases, the visualized information refers to data collected up to the previous day. Figure 3 shows an example of such queries, where information concerning geolocated tweets is visualized. In the top center of the screen it is shown the total number of tweets and the average number of collected tweets per day. The top graph shows the evolution in the number of collected tweets per day during the analyzed period (the y-scale is in millions of tweets). The bottom 4 graphs show (left-to-right, top-to-bottom): Histogram of tweets per hour of the day (%); Histogram of tweets per day of the week (%); Number of tweets per day during the last month (millions); Number of tweets per month during the last 16 months (millions). It is also possible to query and visualize in “real time”. For example, Figure 4 shows the dashboard for the location of geolocalized tweets collected on May 31st, 2015 (distributed per 6 hour periods). Figure 3: Dashboard containing indicators about collected geolocated tweets ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Figure 4: Visualization of collected geolocated tweets in May, 31st, 2015 4 Intelligent twitter topic mining add-on The framework described in the previous section allows the collection, management and visualization of an extensive tweet corpus. In this section we address how we add to the framework the capabilities to intelligently retrieve relevant information from the stored corpus. Despite the underneath complexity, the process is designed to be easily accessible to users, regardless of specialization and degree of technical knowledge. The process is represented in Figure 5, succinctly described as follows, and detailed in the next subsections. Figure 5: Intelligent Twitter topic mining (FUWS – Fuzzy Uke Word Similarity algorithm; TD – Topic Detection task; PR - PageRank for User Influence task; SA – Sentiment Analysis task). ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT All interaction with the user occurs in the “User interface” layer. A “Twitter Topic Fuzzy Fingerprints” layer is used to process user inputs, interact with the REST API, process the data returned from MongoDB and present the results to the user. When a user wants to retrieve information relevant to a given topic of interest, it must input the search period (begin date, end date) and any keywords/#hashtags using the available web interface (Figure 6-1). The system will then use a fuzzy word similarity algorithm (FUWS – see section 4.2) to search the MongoDB for similar keywords and #hashtags found in the database during the referenced time period. (Figure 6-2). As a result of this step, all tweets containing such keywords/#hashtags are retrieved and stored in a temporary collection, and an output list of all similar keywords/#hashtags is returned and shown to the user. The user can prune the list and indicate which of the returned keywords and hashtags might or might not be useful within the context of the topic to be analyzed. The pruned keywords/hashtags list is then passed back to the Twitter Topic Fuzzy Fingerprints layer, where it is used to create a topic Fuzzy Fingerprint (see section 4.1) based on the tweets stored in the temporary collection. This fingerprint is then used to find tweets in MongoDB that are related to the topic of interest (Figure 6-3). The set of relevant tweets is written back into a separate collection in MongoDB. Tweets are considered relevant to the topic if they match the fingerprint to a given degree. It should be noted that the method can find relevant tweets even when they do not contain any of the items present in the pruned list of keywords and/or #hashtags. A user only needs to provide a single relevant #hashtag since the method is able to create the topic fingerprint based on contents of the tweets found after applying the FUWS. The relevant tweets are then processed in order to find the top-20 most influential users in propagating the topic in study, (see section 4.3), and sentiment analysis is performed (see section 4.4). The resulting data is also written back into MongoDB in separate collections. A Results webpage (Figure 7), containing relevant information and automatically obtained by querying the above-mentioned collections, is presented to the user as a result of the process. The results presented area can be further detailed, and the collections can be queried using any of the developed REST API commands (in a user friendly way). ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Figure 6: Intelligent topic mining steps when viewed from the user dashboard ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Figure 7: Intelligent topic mining results’ visualization 4.1 Twitter Topic Fuzzy Fingerprints In order to perform tweet topic detection, we used the Twitter Topic Fuzzy Fingerprints method (Rosa et al., 2014a, 2014b, 2014c), which adapts the original Fuzzy Fingerprints method to the characteristics of individual tweets. As described in section 2.2, textual fuzzy fingerprinting works by comparing the similarity of the fuzzy fingerprint of an individual text, with the fingerprint of a possible author (created based on a set of texts written by that author). When using Fuzzy Fingerprints to detect if a tweet is related to a given topic, we start by creating the fingerprint of the topic (instead of an author), plus the fingerprint of a set of trending topics. A topic fingerprint is created based on a set of training tweets known to be related to the topic in question. All tweets containing #hashtags related with the topic are retrieved from the database, and their text is preprocessed to remove unimportant information for this task, such as, for example, words with less than 3 characters, as studied in (Rosa et al., 2014a, 2014b). The next step consists in obtaining the word frequency in the processed tweets. Only the top-k words are considered for the fingerprint. The computation of the top-k words and respective frequency is done using an approximated counting method, FSS – Filtered Space Saving (Homem, 2010) for efficiency reasons. This process is repeated for all the trending topics and the topic to be detected. The next step differs from the original method: we account for the Inverse Class Frequency of each word, icfv (1), which is an adaptation of the well-known Inverse Document Frequency (idf), to reorder the top-k word lists of all topics. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT , (1) In (1), J is the cardinality of all considered topics, and Jv is the number of topics where word v is present. The product of the frequency of each word v with its icfv, is used to distinguish the occurrence of common words and get a new ordered k-sized rank for each #hashtagged topic. The next step consists in fuzzifying each top-k list in order to obtain the fingerprint for each #hashtagged topic. A membership value is assigned to each word of the top-k list based on its order (instead of its frequency or its icf). In MISNIS we used the fuzzifying function represented in (2), a Pareto-based linear function, where 20% of the top k words assume 80% of the membership degree: { ( ) ( ) (2) where μji is the membership value of the ith top word in topic j, and k is a constant (the size of the fingerprint), The fingerprint of topic j is a size-k fuzzy vector, where each position contains an element vji (in this approach vji is a word of topic j), and a membership value μji representing the fuzzified value of the rank of vji (the membership of the rank), obtained by the application of (2). Formally, topic j is represented by its size-k fingerprint Φj (3) {( ) ( ) ( )} (3) The set of all computed topic fingerprints constitutes the fingerprint library. Once the fingerprint library is created, it is possible to look in MongoDB for tweets that, despite not containing an intended #hashtag, discuss the same topic. In the original fingerprint method (Homem, 2011), this would be done by computing the fingerprint of an individual tweet and comparing it to the topic fingerprint. However, knowing that a text fingerprint is essentially based on the fuzzification of the order of the word frequency of the text, and that tweets have a maximum of 140 characters, it is not possible (or useful) to create the fingerprint on an individual tweet, since within a tweet, very few relevant words (if any!), are repeated. As such, a new similarity score, the Tweet to Topic Similarity Score (T2S2) was developed to test for the similarity between a given tweet and a topic fingerprint (4). The T2S2 score does not take into account the size of the text to be classified (i.e., its number of words), hence avoids the problem of fuzzy fingerprint similarity computation for short texts. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT In (4), Φj is the fingerprint of #hashtagged topic j, T is the set of distinct words in the preprocessed tweet text, { } is the set of words of Φj, and μji is the membership degree of word vji in the fingerprint Φj. Essentially, T2S2 sums the membership value of every word v that is common between the tweet and the #hashtag j fingerprint, and normalizes this value by dividing it with the sum of the top-x membership values of Φj, being x the minimum between k and the cardinality of T. T2S2 approaches 1 if most to all features of the tweet belong to the top words of the fingerprint. T2S2 tends to 0 when there are no common words between the tweet and the fingerprint, or when the few common words are in the bottom of the fingerprint. When a given tweet has a T2S2 score with #topic j above a given threshold, then it is considered relevant to the topic. Such tweets are retrieved from the database. In the MISNIS platform we use the following parameters when performing topic detection: fingerprint size k=20, words with less than 3 characters removed during preprocessing (no stopwords are removed and no stemming is performed). Tweets with a T2S2 score above 0.10 are retrieved. Stemming is not performed since even though it gives marginal gains in Precision and Recall, there is a high penalty in execution time. Contrary to other tasks, stopwords do not hinder performance, so they are kept. The choice and optimization of the preprocessing steps are detailed in (Rosa et. al 2014, 2014a, 2014b). A real word case study based on a 2011 London Riots dataset has shown the Twitter Topic Fuzzy Fingerprints was able to retrieve 40% more relevant tweets (with an F- measure estimated to be 0.95@userB is created. This is also true for re-tweets:  The tweet “RT @userC The crisis is everywhere!” from @userA, creates the link: @userA->@userC. In (Rosa et. al, 2015), an empirical analysis showed that, in the context of Twitter User Influence, PageRank (Page 1998, 1999) outperforms other well-known network centrality algorithms, in particular, Katz (1953). As such, PageRank was chosen for the implementation of determining user relevance within MISNIS. PageRank parameterization consists in deciding the damping factor d (see section 2.3). The true value that Google uses as damping factor is unknown, but it has become common to use d=0.85 in the literature. A lower value of the damping factor implies that the graph's structure is less respected, therefore making the ''walker'' more random and less strict. After several experiments we opted to use d=0.85 within the platform. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT 4.4 Sentiment Analysis Tweets are a form of short informal texts that may contain misspellings, slang terms, shortened word forms, elongations, leet speech, hashtags, and many other specific phenomena that pose new challenges to Sentiment Analysis (Kiritchenko et al., 2014). Given the limited amounts of annotated data for certain languages, supervised learning is often not possible, and alternate approaches using either manual or automatic sentiment lexicons are often applied (Kiritchenko et al., 2014). In order to take this into account and to maintain language independence, our approach to Sentiment Analysis follows closely the idea explored by (Go, 2009), that consists of using emoticons, abundantly available on tweets, to automatically label the data and then use such data to train machine learning models. In order to build our sentiment models, we have used the knowledge flow described in http://markahall.blogspot.co.nz/2012/03/sentiment- analysis-with-weka.html (retrieved in September, 2015) that uses weka (Hall et al., 2009) to automatically label the training tweets based on emoticons. This approach has the huge advantage of being easily applied to different languages, for which manual labels are scarce or non-existing, as is the case of Portuguese language. We have then adopted an approach based on logistic regression, which corresponds to the maximum entropy (ME) classification for independent events (Berger, 1996), to create our sentiment models, based on the previously labelled data. The ME models used in this study were trained using the MegaM tool (Daume, 2004), which uses an efficient implementation of conjugate gradient (for binary problems). Finally, the MegaM models were used thought an interface available from the NLTK toolkit9. Our Portuguese language models were trained using around 200k tweets, and achieved an accuracy between 69% and 71% for the “Positive”/ “Neutral” / “Negative” classification problem. Moreover, based on the produced model, we can automatically derive a new automatic lexicon that can be used for alternative future classification approaches. 5 Application example Here we present an execution of the MISNIS framework, applied to a real case collected from the Portuguese database of tweets in MongoDB. On the night from 21st to the 22nd of November of 2014, former Portuguese Prime Minister José Sócrates was arrested under suspicion of fraud and money laundering during his time in office. This piece of news was the headline of all media outlets for several days and, to this day, remains quite an important story to follow. Naturally, Twitter was no exception, which made this an interesting story to analyze. The following data was inputted into the interface webforms:  Start Date: 20th November 2014 00:00:00  End Date: 23th November 2014 23:59:59  Keywords: socrates, freesocas With this input, the framework generated the screen shown in Figure 7. Out of a total of 3,309,468 tweets in that time interval, 380 were identified to the inputted or other similar keywords found by the system (including the hashtags #socrates and #freesocas). These 380 tweets were used to train the Twitter Topic Fuzzy Fingerprint method along with 15 other top trends existing within the full three million 9 http://www.nltk.org/_modules/nltk/classify/megam.html ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT tweets. A total of 29,019 tweets containing the 15+1 trends were used as the training set. As an output, the procedure created a dataset of 10,687 tweets regarding José Socrates' arrest (including additional sentiment information per tweet), produced several indicators regarding topic evolution through time, and listed the top 20 users in discussing and forwarding the topic (Table 3). The first retrieved true positive tweet occurs on November 22nd, at 00:08:51, just a few minutes after the arrest occurred. This is one more example of the relevance of Twitter in current news events. The Precision (true positive rate) of the retrieved results is 0.97, with many of the false positives (i.e., tweets that are not related to José Sócrates’ arrest) retrieved on the 48 hours before the arrest (November 20-21st). It was not possible to calculate the Recall since it would be necessary to manually check 3 Million tweets one by one to count the false negatives (this would obviously defeat the purpose of the expert system), but previous tests using the same parameters on other databases have shown that Fuzzy Fingerprint Recall is usually slightly higher than Precision (Rosa et al. 2014a). It should be noted that, at the time, the database did not contain tweets from Twitter users that never produced geolocated contents. As such, there were no tweets collected from the major media players such as TV stations and newspapers. Although 10,687 tweets out of a universe of 3,309,468 tweets, seems a low number, this can be explained by the absence in the database of the major media players, and also by the fact that the majority of Twitter users in Portugal are young teenagers (Brogueira, 2014a, 2014b) that usually do not show any special interest in political matters. Of special relevance is the fact that from 380 hashtagged tweets, the framework retrieved more than 10K tweets that could otherwise go unnoticed among more than 3 million tweets (with a very high Precision). Amongst the top users, it is rather interesting the high incidence of Portuguese comedians, namely @niltoncomedy (ranked 5th), @raminhoseffect (ranked 11th), and @omalestafeito (ranked 13th), and also a few humor oriented internet personalities, such as @avo_idalina (ranked 4th), who is a fictional grandmother character, @miguelluzp (ranked 15th) and @oconguito (ranked 19th), both famous YouTuber teenagers. A proper sociological analysis on the retrieved results is available in (Rebelo et. al, 2016). ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Table 3: Top-20 most influential users on Twitter during the early days of the Sócrates prison arrest events (excluding major media players). Ranking (Page Rank) User PageRank Weight 1 @afonsotorrao 0.00701 2 @likerabos 0.00418 3 @sergiohatesyou 0.00356 4 @avo_idalina 0.00295 5 @niltoncomedy 0.00291 6 @tiagosamuel69 0.00219 7 @marinhskilla 0.00209 8 @pedromvfranco 0.00206 9 @nandinholuz 0.00167 10 @unkndeath2002 0.00158 11 @raminhoseffect 0.00155 12 @im_a_barbieee 0.00143 13 @omalestafeito 0.00141 14 @issodepende 0.00134 15 @miguelluzp 0.00127 16 @inaaraujo99 0.00126 17 @rubencgomes 0.00125 18 @pedroboucherie 0.00124 19 @oconguito 0.00109 20 @daspthebest 0.00109 6 Conclusions Social networks and social networking are here to stay. This is not a controversial or novel statement: independently from how one feels towards adopting the use of social networks, no one can deny their importance in current modern world society. From event advertising or idea dissemination, to commenting and analysis, social networks have become the de facto means for individual opinion making and, consequently, one of the main shapers of an individual’s perception of society and the world that surrounds her/him. Despite the undeniable importance of social networks, too many questions concerning their effect in society are yet to be properly addressed. What makes events become important in social networks? Why and how they become important? How long does it take for an event to make an impact in social networks and society? Can social networks give more importance to an event than it really deserves, i.e., are social networks becoming a factor by themselves? What is the role of social networks’ major actors (important journalists, bloggers, commentators, politicians, etc.) in the propagation of such events? Are such actors in the origin of the events or mere catalysts to the observations of minor role players? In this article we presented a framework, MISNIS, that can help answering such questions:  The framework enables the identification and traction of important events (topics) and of key actors within those topics, as well as identifies their origin and propagation timeline. Measures and indicators to characterize events are provided by the framework contributing with essential information to understand the social network phenomena and its importance in current world society. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT  MISNIS addresses the issues of collecting, storing, managing, mining and visualizing Twitter data. It applies well-known and novel techniques in the fields of Computational Intelligence, Information Retrieval, Big Data, Topic Detection, User Influence and Sentiment Analysis, to social networks Data Mining, in particular, Twitter.  MISNIS can be used as an expert system by social scientists, sociologists, or any other users to retrieve relevant data to study social networks’ impact in society without requiring any computer technical expertise. The framework is currently operational, even if on a prototype form with limited access. External access to its functionalities can be made available by request to the authors. Future work includes:  Expand and facilitate the access to the platform (direct external access via user account);  Expand the platform to use other public social networks, such as public blogs, webpages, public Facebook profiles, etc., as additional sources of data;  Expand multilinguality and region use. The platform is currently focused on portuguese language tweets posted in Portugal, but most developed technologies are language independent and can be presently used for tweets in most languages (e.g. intelligent tweet retrieval, sentiment analysis, etc.). However, some collection and data expansion, and the geo-location mechanisms are either region or language dependent, and should be adapted. Currently we are working on English language tweets in UK and Ireland and in the USA;  Sentiment analysis methods can and should be improved. Even though we opted for a language independent mechanism, it is possible to improve sentiment analysis by combining it with language dependent lexicon. We also find the polarity sentiment approach very limiting, and would like to include more elaborate approaches;  The platform is very dependent on the use of several Twitter and Google APIs: most changes to the APIs endpoints imply changing and recompiling the platform code. We would like to improve the code architecture to allow for external dynamical API changes: the platform administrator would simply need to update the API access data without the need to recompile the code. Acknowledgment This work was supported by national funds through FCT Fundação para a Ciência e a Tecnologia, under project PTDC/IVC-ESCT/4919/2012 and project PEstOE/EEI/LA0021/2013. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT References Baccianella, S., Esuli, A., and Sebastiani, F. (2010). Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proc. of LREC’10, Valletta, Malta. ELRA. Berger, A. L., Pietra, S. D., and Pietra, V. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71. Blei, David M. (2012). Probabilistic Topic Models. Communications of the ACM, 55(4):77– 84. Blei, David M., Ng, Andrew Y., and Jordan, Michael I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022. Brogueira, G., Batista, F., and Carvalho, J. P. (2015a). Arquitetura e desenvolvimento de um repositório de tweets em português europeu. In 5as Jornadas de Informtica da Universidade de vora - JIUE 2015. Springer. Brogueira, G., Batista, F., and Carvalho, J. P. (2015b). Sistema inteligente de recolha, armazenamento e visualização de informação proveniente do twitter. In Conferência da Associação Portuguesa de Sistemas de Informação, CAPSI 2015. Brogueira, G., Batista, F., and Carvalho, J. P. (2015c). Using geolocated tweets for characterization of portuguese administrative regions. In 18th AGILE International Conference on Geographic Information Science. Brogueira, G., Batista, F., Carvalho, J. P., and Moniz, H. (2014a). Expanding a database of portuguese tweets. In SLATE’14 3rd Symposium on Languages, Applications and Technologies, volume 4569 of OpenAccess Series in Informatics (OASIcs), pages 275– 282. Schloss Dagstuhl. Brogueira, G., Batista, F., Carvalho, J. P., and Moniz, H. (2014b). Portuguese geolocated tweets: an overview. In ISDOC2014 - Proceedings of the International Conference on Information Systems and Design of Communication, pages 178–179. ACM. Brogueira, G., Batista, F., Carvalho, J.P. (2016). A Smart System for Twitter Corpus Collection, Management and Visualization, International Journal of Technology and Human Interaction (IJTHI), IGI Global, vol. 13, n. 3, December 2016 C. Rebelo, I. Pereira, H. Rosa, F. Batista, J.P. Carvalho, “The news will be tweeted: multiple uses of Twitter around a major political event”, submitted to New Media and Society. Carvalho, J. P. and Coheur, L. (2013). Introducing UWS - A fuzzy based word similarity function with good discrimination capability: Preliminary results. In FUZZ-IEEE, pages 1–8. Carvalho, J. P., Pedro, V., and Batista, F. (2013). Towards intelligent mining of public social networks’ influence in society. In IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), pages 478 – 483, Edmonton, Canada. Carvalho, J.P., Rosa, H. and Batista, F. (2017). Detecting relevant tweets in very large tweet collections: the London Riots case study. InFUZZ-IEEE, 2017 IEEE International Conference on Fuzzy Systems, Jul, 2017, Naples, Italy Cataldi, M., Di Caro, L., and Schifanella, C. (2010). Emerging topic detection on twitter based on temporal and social terms evaluation. In Proceedings of the Tenth International Workshop on Multimedia Data Mining, MDMKDD ’10, pages 4:1–4:10, New York, NY, USA. ACM. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Cha, M., Haddai, H., Benevenuto, F., and Gummadi, K. P. (2010). Measuring User Influence in Twitter: The Million Follower Fallacy. International AAAI Conference on Weblogs and Social Media, pages 10–17. Chen Y., Conroy N.J. and Rubin, V. L. (2015). News in an online world: the need for an "automatic crap detector". In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community (ASIST '15). American Society for Information Science, Silver Springs, MD, USA, Article 81 , 4 pages. Cigarrán J., Castellanos A., García-Serrano A. (2016), A step forward for Topic Detection in Twitter: An FCA-based approach, Expert Systems with Applications, Volume 57, 2016, Pages 21-36, ISSN 0957-4174, http://dx.doi.org/10.1016/j.eswa.2016.03.011. Culotta, A. (2010). Towards detecting influenza epidemics by analyzing twitter messages. In Proceedings of the First Workshop on Social Media Analytics, SOMA ’10, pages 115–122, New York, NY, USA. ACM. Das, S. and Chen, M. (2001). Yahoo! for Amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA). Daume III, H. (2004). Notes on CG and LM-BFGS optimization of logistic regression. http://hal3.name/megam/. Dehkharghani, R., Mercan, H., Javeed, A., and Saygin, Y. (2014). Sentimental causal rule discovery from twitter. Expert Systems with Applications, 41(10):4950 – 4958. Domingos, P. and Richardson, M. (2001). Mining the network value of customers. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 57–66, New York, NY, USA. ACM. Feldman, R. and Sanger, J. (2006). Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New York, NY, USA. Ford, R., (2011). Hollywood Reporter [Online]. Available at: http://www.hollywoodreporter.com/news/earthquake-Twitter-users-learnedtremors- -226481 [Accessed 30/7/2017] Gerber, M. S. (2014). Predicting crime using twitter and kernel density estimation. Decision Support Systems, 61(0):115 – 125. Go, A., Bhayani, R., and Huang, L. (2009). Twitter Sentiment Classification using Distant Supervision. Technical report, Stanford University. Gupta, M. Li, R. Chang, K. (2014). Towards a Social Media Analytics Platform: Event Detection and Description for Twitter – a Tutorial. 23rd International WWW Conference. [Online] Available at: http://www2014.kr/asset/slide/Towards%20a%20Social%20Media%2 0Analytics%20Platform.pdf, [Accessed 11/ 1/ 2017] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10–18. Hoffman, M. D., Blei, D. M., and Bach, F. R. (2010). Online learning for latent dirichlet allocation. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., editors, NIPS, pages 856–864. Curran Associates, Inc. Homem, N. and Carvalho, J. P. (2010). Finding top-k elements in data streams. Inf. Sci., 180(24):4958–4974, Elsevier. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Homem, N. and Carvalho, J. P. (2011). Authorship identification and author fuzzy fingerprints. In 30th Annual Conference of the North American Fuzzy Information Processing Society, NAFIPS2011. Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Proc. of the 10th ACM SIGKDD int. conf. on Knowledge discovery and data mining, KDD ’04, pages 168–177. ACM. Kasiviswanathan, S. P., Melville, P., Banerjee, A., and Sindhwani, V. (2011). Emerging topic detection using dictionary learning. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 745–754, New York, NY, USA. ACM. Katz, L. (1953). A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43. Kim, S.-M. and Hovy, E. (2004). Determining the sentiment of opinions. In Proc. of COLING ’04. ACL. Kiritchenko, Svetlana, Zhu, Xiaodan, and Mohammad, Saif M. (2014). Sentiment analysis of short informal texts. Jounal of Artificial Inteligence Research, 50, 1, 723-762. Kontopoulos, E., Berberidis, C., Dergiades, T., Bassiliades, N., (2013). Ontology-based sentiment analysis of twitter posts. Expert systems with Applications, 40(10):4065– 4074, Elsevier. Kumar, S., Morstatter, F., and Liu, H. (2013a). Twitter Data Analytics. Springer, New York, NY, USA. Kumar, S., Morstatter, F., Zafarani, R., and Liu, H. (2013b). Whom should I follow?: Identifying relevant users during crises. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, HT ’13, pages 139–147, New York, NY, USA. ACM. Lachlan, K. A., Spence, P. R., and Lin, X. (2014). Expressions of risk awareness and concern through twitter: On the utility of using the medium as an indication of audience needs. Computers in Human Behavior, 35(0):554–559. Leavitt, A., Burchard, E., Fisher, D., and Gilbert, S. (2009). The influentials: New approaches for analyzing influence on twitter. Lee, K., Palsetia, D., Narayanan, R., Patwary, M. M. A., Agrawal, A., and Choudhary, A. (2011). Twitter trending topic classification. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDMW ’11, pages 251–258, Washington, DC, USA. IEEE Computer Society. Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals” Soviet Physics Doklady, 1966. 10:707–710. Marcus, A., Bernstein, M. S., Badar, O., Karger, D. R., Madden, S., and Miller, R. C. (2011a). Tweets as data: Demonstration of tweeql and twitinfo. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pages 1259– 1262, New York, NY, USA. ACM. Marcus, A., Bernstein, M. S., Badar, O., Karger, D. R., Madden, S., and Miller, R. C. (2011b). Twitinfo: Aggregating and visualizing microblogs for event exploration. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, pages 227– 236, New York, NY, USA. ACM. Mathioudakis, M. and Koudas, N. (2010). Twittermonitor: Trend de- tection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pages 1155–1158, New York, NY, USA. ACM. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Mazzia, A. and Juett, J. (2010). Suggesting hashtags on twitter. Master’s thesis, University of Michigan. Mendoza, M., Poblete, B., and Castillo, C. (2010). Twitter under crisis: Can we trust what we rt? In Proceedings of the First Workshop on Social Media Analytics, SOMA ’10, pages 71–79, New York, NY, USA. ACM. Mustafaraj E. and Metaxas P.T. (2017). The Fake News Spreading Plague: Was it Preventable?. In Proceedings of the 2017 ACM on Web Science Conference (WebSci '17). ACM, New York, NY, USA, 235-239. DOI: https://doi.org/10.1145/3091478.3091523 Oussalah, M., Bhat, F., Challis, K., and Schnier, T. (2013). A software architecture for twitter collection, search and geolocation services. Knowledge-Based Systems, 37(0):105 – 120. Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). 1 Introduction and Motivation 2 A Ranking for Every Page on the Web. World Wide Web Internet And Web Information Systems, 54(1999-66):1–17. Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web. Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of EMNLP, pages 79–86. Paul, M. J. and Dredze, M. (2011). You are what you tweet: Analyzing twitter for public health. In Adamic, L. A., Baeza-Yates, R. A., and Counts, S., editors, ICWSM. The AAAI Press. Perera, R., Anand, S., Subbalakshmi, K., and Chandramouli, R. (2010). Twitter analytics: Architecture, tools and analysis. In MILITARY COMMUNICATIONS CONFERENCE, 2010 - MILCOM 2010, pages 2186–2191. Phuoc, N. Q., Kim, S.-R., Lee, H.-K., and Kim, H. (2009). Pagerank vs. Katz status index, a theoretical approach. In Proceedings of the 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology, ICCIT ’09, pages 1276– 1279, Washington, DC, USA. IEEE Computer Society. Qu, Y., Huang, C., Zhang, P., and Zhang, J. (2011). Microblogging after a major disaster in china: A case study of the 2010 yushu earthquake. In Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, CSCW ’11, pages 25–34, New York, NY, USA. ACM. Razis, G. and Anagnostopoulos, I. (2014). Influencetracker: Rating the impact of a twitter account. CoRR. Rogers, E. M. (1962). Diffusion of innovations. Rosa, H. (2014). Topic detection within social networks. Master’s thesis, Instituto Superior Técnico, Universidade de Lisboa. Rosa, H., Batista, F., and Carvalho, J. P. (2014a). Twitter topic fuzzy fingerprints. In WCCI2014, FUZZ-IEEE, 2014 IEEE World Congress on Computational Intelligence, International Conference on Fuzzy Systems, IEEE Xplorer, pages 776–783, Beijing, China. Rosa, H., Carvalho, J. P., and Batista, F. (2014b). Detecting a Tweet’s Topic within a Large Number of Portuguese Twitter Trends. In Pereira, M. J. V., Leal, J. P., and Simes, A., editors, 3rd Symposium on Languages, Applications and Technologies, volume 38 of OpenAccess Series in Informatics (OASIcs), pages 185–199, Dagstuhl, Germany. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ACCEPTED MANUSCRIPT A CC EP TE D M A N U SC RI PT Rosa, H., Carvalho, J. P., Astudillo, R., and Batista, F. (2015). Detecting user influence in twitter: Pagerank vs katz, a case study. 7th European Symposium on Computational Intelligence and Mathematics. Saha, A. and Sindhwani, V. (2012). Learning evolving and emerging topics in social media: A dynamic nmf approach with temporal regularization. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pages 693–702, New York, NY, USA. ACM. Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earthquake shakes twitter users: Real- time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 851–860, New York, NY, USA. ACM. Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., and Sperling, J. (2009). Twitterstand: News in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’09, pages 42–51, New York, NY, USA. ACM. Santos, C. J. and Matos, S. (2013). Predicting flu incidence from portuguese tweets. In International Work-Conference on Bioinformatics and Biomedical Engineering 2013. Proceedings. Scanfeld, D., Scanfeld, V., and Larson, E. L. (2010a). Dissemination of health in- formation through social networks: Twitter and antibiotics. American Journal of Infection Control, 38(3):182–188. Stone, P., Dunphy, D., Smith, M., and Ogilvie, D. (1966). The General Inquirer: A Computer Approach to Content Analysis. MIT Press. Turney, P. D. (2002). Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proc. of the 40th Annual Meeting on Association for Computational Linguistics, pages 417–424. ACL. Twitter (2010). To trend or not to trend. https://blog.twitter.com/2010/trend-or-not- trend. Accessed: 2014-03-28. Vosoughi S., Mohsenvand M., and Roy D. (2017). Rumor Gauge: Predicting the Veracity of Rumors on Twitter. ACM Trans. Knowl. Discov. Data 11, 4, Article 50 (July 2017), 36 pages. DOI: https://doi.org/10.1145/3070644