key: cord-121200-2qys8j4u authors: Zogan, Hamad; Wang, Xianzhi; Jameel, Shoaib; Xu, Guandong title: Depression Detection with Multi-Modalities Using a Hybrid Deep Learning Model on Social Media date: 2020-07-03 journal: nan DOI: nan sha: doc_id: 121200 cord_uid: 2qys8j4u Social networks enable people to interact with one another by sharing information, sending messages, making friends, and having discussions, which generates massive amounts of data every day, popularly called as the user-generated content. This data is present in various forms such as images, text, videos, links, and others and reflects user behaviours including their mental states. It is challenging yet promising to automatically detect mental health problems from such data which is short, sparse and sometimes poorly phrased. However, there are efforts to automatically learn patterns using computational models on such user-generated content. While many previous works have largely studied the problem on a small-scale by assuming uni-modality of data which may not give us faithful results, we propose a novel scalable hybrid model that combines Bidirectional Gated Recurrent Units (BiGRUs) and Convolutional Neural Networks to detect depressed users on social media such as Twitter-based on multi-modal features. Specifically, we encode words in user posts using pre-trained word embeddings and BiGRUs to capture latent behavioural patterns, long-term dependencies, and correlation across the modalities, including semantic sequence features from the user timelines (posts). The CNN model then helps learn useful features. Our experiments show that our model outperforms several popular and strong baseline methods, demonstrating the effectiveness of combining deep learning with multi-modal features. We also show that our model helps improve predictive performance when detecting depression in users who are posting messages publicly on social media. Mental illness is a serious issue faced by a large population around the world. In the United States (US) alone, every year, a significant percentage of the adult population is affected by different mental disorders, which include depression mental illness (6.7%), anorexia and bulimia nervosa (1.6%), and bipolar mental illness (2.6%) [1] . Sometimes mental illness has been attributed to the mass shooting in the US [26] , which has taken numerous innocent lives. One of the common mental health problems is depression that is more dominant than other mental illness conditions worldwide [60] . The fatality risk of suicides in depressed people is 20 times higher than the general population [54] . Diagnosis of depression is usually a difficult task because depression detection needs a thorough and detailed psychological testing by experienced psychiatrists at an early stage [39] . Moreover, it is very common among people who suffer from depression that they do not visit clinics to ask help from doctors in the early stages of the problem [66] . However, it is common for people who suffer from mental health problems to often "implicitly" (and sometimes even "explicitly") disclose their feelings and their daily struggles with mental health issues on social media as a way of relief [3, 33] . Therefore, social media is an excellent resource to automatically help discover people who are under depression. While it would take a considerable amount of time to manually sift through individual social media posts and profiles to locate people going through depression, automatic scalable computational methods could provide timely and mass detection of depressed people which could help prevent many major fatalities in the future and help people who genuinely need it at the right moment. The daily activities of users on social media could be a gold-mine for data miners because this data helps provide rich insights on user-generated content. It not only helps give them a new platform to study user behaviour but also helps with interesting data analysis, which might not be possible otherwise. Mining users' behavioural patterns for psychologists and scientists through examining their online posting activities on multiple social networks such as Facebook, Weibo [12, 25] , Twitter, and others could help target the right people at right time and provide urgent crucial care [5] . There are existing startup companies such as Neotas 1 with offices in London and elsewhere which mines publicly available user data on social media to help other companies automatically do the background check including understanding the mental states of prospective employees. This suggests that studying the mental health conditions of users online using automated means not only helps government or health organisations but it also has a huge commercial scope. The behavioural and social characteristics underlying the social media information attract many researchers' interests from different domains such as social scientists, marketing researchers, data mining experts and others to analyze social media information as a source to examine human moods, emotions and behaviours. Usually, depression diagnosis could be difficult to be achieved on a large-scale because most traditional ways of diagnosis are based on interviews, questionnaires, self-reports or testimony from friends and relatives. Such methods are hardly scalable which could help cover a larger population. Individuals and health organizations have thus shifted away from their traditional interactions, and now meeting online by building online communities for sharing information, seeking and giving the advice to help scale their approach to some extent so that they could cover more affected population in less time. Besides sharing their mood and actions, recent studies indicate that many people on social media tend to share or give advice on health-related information [17, 29, 36, 40] . These sources provide the potential pathway to discover the mental health knowledge for tasks such as diagnosis, medications and claims. Detecting depression through online social media is very challenging requiring to overcome various hurdles ranging from acquiring data to learning the parameters of the model using sparse and complex data. Concretely, one of the challenges is the availability of the relevant and right amount of data for mental illness detection. The reason why more data is ideal is primarily that it helps give the computational model more statistical and contextual information during training leading to faithful parameter estimation. While there are approaches which have tried to learn a model on a small-scale data, the performance of these methods is still sub-optimal. For instance, in [10] , the authors tried crawling tweets that contain depression-related keywords as ground truth from Twitter. However, they could collect only a limited amount of relevant data which is mainly because it is difficult to obtain relevant data on a large-scale quickly given the underlying search intricacies associated with the Twitter Application Programming Interface (API) and the daily data download limit. Despite using the right keywords the service might return several false-positives. As a result, their model suffered from the unsatisfactory quantitative performance due to poor parameter estimation on small unreliable data. The authors in [9] also faced a similar issue where they used a small number of data samples to train their classifier. As a result, their study suffered from the problem of unreliable model training using insufficient data leading to poor quantitative performance. In [20] the authors propose a model to detect anxious depression of users. They have proposed an ensemble classification model that combines results from three popular models including studying the performance of each model in the ensemble individually. To obtain the relevant data, the authors introduced a method to collect their data set quickly by choosing the first randomly sampled 100 users who are followers of MS India student forum for one month. A very common problem faced by the researchers in detecting depression on social media is the diversity in the user's behaviours on social media, making extremely difficult to define depressionrelated features to cope with mental health issues. For example, it was evidenced that although social media could help us to gather enough data through which useful feature engineering could be effectively done and several user interactions could be captured and thus studied, it was noticed in [15, 51] that one could only obtain a few crucial features to detect people with eating disorders. In [44] the authors also suffered from the issue of inadequate features including the amount of relevant data set leading to poor results. Different from the above works, we have proposed a novel model that is trained on a relatively large dataset showcasing that the method scales and it produces better and reliable quantitative performance than existing popular and strong comparative methods. We have also proposed a novel hybrid deep learning approach which can capture crucial features automatically based on data characteristic making the approach reliable. Our results show that our model outperforms several state-of-the-art comparative methods. Depressed users behave differently when they interact on social media, producing rich behavioural data, which is often used to extract various features. However, not all of them are related to depression characteristics. Many existing studies have either neglected important features or selected less relevant features, which mostly are noise. On the other hand, some studies have considered a variety of user behaviour. For example, [41] is one such work that has collected a large-scale dataset with reliable ground truth labels. They then extracted various features representing user behaviour in social media and grouped these features into several modalities. Finally, they proposed a new model called the Multimodal Dictionary Learning Model (MDL) to detect depressed users from tweets, based on dictionary learning. However, given the high-dimensional, sparse, figurative and ambiguous nature of tweet language use, dictionary learning cannot capture the semantic meaning of tweets. Instead, word embedding is a new technique that can solve the above difficulties through neural network paradigms. Hence, due to the capability of the word embedding for holding the semantic relationship between tweets and the knowledge to capture the similarity between terms, we combine multi-modal features with word embedding, to build a comprehensive spectrum of behavioural, lexical, and semantic representations of users. Recently, using deep learning to gain insightful and actionable knowledge from complex and heterogeneous data has become mainstream in AI applications for healthcare, e.g. the medical image processing and diagnosis has gained great success. The advantage of deep learning sits in its outstanding capability of iterative learning and automated optimizing latent representations from multi-layer network structure [32] . This motivates us to leverage the superior neural network learning capability with the rich and heterogeneous behavioural patterns of social media users. To be specific, this work aims to develop a new novel deep learning-based solution for improving depression detection by utilizing multi-modal features from diverse behaviour of the depressed user in social media. Apart from the latent features derived from lexical attributes, we notice that the dynamics of tweets, i.e. tweet timeline provides a crucial hint reflecting depressed user emotion change over time. To this end, we propose a hybrid model comprising Bidirectional Gated Recurrent Unit (BiGRU) and Conventional Neural network (CNN) model to boost the classification of depressed users using multi-modal features and word embedding features. The model can derive new deterministic feature representations from training data and produce superior results for detecting depression-level of Twitter users. Our proposed model uses a BiGRU, which is a network that can capture distinct and latent features, as well as long-term dependencies and correlations across the features matrix. BiGRU is designed to use backward and forward contextual information in text, which helps obtain a user latent feature from their various behaviours by using a reset and update gates in a hidden layer in a more robust way. In general, GRU-based models have shown better effectiveness and efficiency than the other Recurrent Neural Networks (RNN) such as Long Short Term Memory (LSTM) model [8] . By capturing the contextual patterns bidirectionally helps obtain a representation of a word based on its context which means under different contexts, a word could have different representation. This indeed is very powerful than other techniques such as traditional unidirectional GRU where one word is represented by only one representation. Motivated by this we add a bidirectional network for GRU that can effectively learn from multi-modal features and provide a better understanding of context, which helps reduce ambiguity. Besides, BiGRU can extract more discrete features and helps improve the performance of our model. The BiGRU model could capture contextual patterns very well, but lacks in automatically learning the right features suitable for the model which would play a crucial role in predictive performance. To this end, we introduce a one-dimensional CNN as a new feature extractor method to classify user timeline posts. Our full model can be regarded as a hybrid deep learning model where there is an interplay between a BiGRU and a CNN model during model training. While there are some existing models which have combined CNN and BiRNN models, for instance, in [63] the authors combine BiLSTM or BiGRU and CNN to learn better features for text classification using an attention mechanism for feature fusion, which is a different modelling paradigm than what is introduced in this work, which captures the multi-modalities inherent in data. In [62] , the authors proposed a hybrid BiGRU and CNN model which later constrains the semantic space of sentences with a Gaussian. While the modelling paradigms may be closely related with the combinations of a BiGRU and a CNN model, their model is designed to handle sentence sentiment classification rather than depression detection which is a much more challenging task as tweets in our problem domain are short sentences, largely noisy and ambiguous. In [53] , the authors propose a combination of BiGRU and CNN model for salary detection but do not exploit multi-modal and temporal features. Finally, we also studied the performance of our model when we used the two attributes word embedding and multi-modalities separately. We found that model performance deteriorated when we used only multi-modal features. We further show when we combined the two attributes, our model led to better performance. To summarize, our study makes the following contributions: (1) We propose a novel depression detection framework by deep learning the textual, behavioural, temporal, and semantic modalities from social media. (2) A Gated Recurrent Unit to detect depression using several features extracted from user behaviours. (3) We built a CNN network to classify user timeline posts concatenated with BiGRU network to identify social media users who suffer from depression. To the best of our knowledge, this is the first work of using multi-modalities of topical, temporal and semantic features jointly with word embeddings in deep learning for depression detection. (4) The experiment results obtained on a real-world tweet dataset have shown the superiority of our proposed method when compared to baseline methods. The rest of our paper is organized as follows. Section 2 reviews the related work to our paper. Section 3 presents the dataset that used in this work, and different pre-processing we applied on data. Section 4 describes the two different attributes that we extracted for our model. In Section 5, we present our model for detection depression. Section 6 reports experiments and results. Finally, Section 7 concludes this paper. In this section, we will discuss closely related literature and mention how they are different from our proposed method. In general, just like our work, most existing studies focus on user behaviour to detect whether a user suffers from depression or any mental illness. We will also discuss other relevant literature covering word embeddings and hybrid deep learning methods which have been proposed for detecting mental health from online social networks and other resources including public discussion forums. Since we also introduce the notion of latent topics in our work, we have also covered relevant related literature covering topic modelling for depression detection, which has been widely studied in the literature. Data present in social media is usually in the form of information that user shares for public consumption which also includes related metadata such as user location, language, age, among others [20] . In the existing literature, there are generally two steps to analyzing social data. The first step is collecting the data generated by users on networking sites, and the second step is to analyze the collected data using, for instance, a computational model or manually. In any data analysis, feature extraction is an important task because using only a relevant small set of features, one can learn a high-quality model. Understanding depression on online social networks could be carried out using two complementary approaches which are widely discussed in the literature, and they are: • Post-level behavioural analysis • User-level behavioural analysis Methods that use this kind of analysis mainly target at the textual features of the user post that is extracted in the form of statistical knowledge such as those based on count-based methods [21] . These features describe the linguistic content of the post which are discussed in [9, 19] . For instance, in [9] the authors propose classifier to understand the risk of depression. Concretely, the goal of the paper is to estimate that there is a risk of user depression from their social media posts. To this end, the authors collect data from social media for a year preceding the onset of depression from user-profiles and distil behavioural attributes to be measured relating to social engagement, emotion, language and linguistic styles, ego network, and mentions of antidepressant medications. The authors collect their data using crowd-sourcing task, which is not a scalable strategy, on Amazon Mechanical Turk. In their study, the crowd workers were asked to undertake a standardized clinical depression survey, followed by various questions on their depression history and demographics. While the authors have conducted thorough quantitative and qualitative studies, they are disadvantageous in that it does not scale to a large set of users and does not consider the notion of text-level semantics such as latent topics and semantic analysis using word embeddings. Our work is both scalable and considers various features which are jointly trained using a novel hybrid deep learning model using a multi-modal learning approach. It harnesses high-performance Graphics Processing Units (GPUs) and as a result, has the potential to scale to large sets of instances. In Hu et al., [19] the authors also consider various linguistic and behavioural features on data obtained from social media. Their underlying model relies on both classification and regression techniques for predicting depression while our method performs classification, but on a large-scale using a varied set of crucial features relevant to this task. To analyze whether the post contains positive or negative words and/or emotions, or the degree of adverbs [49] used cues from the text, for example, I feel a little depressed and I feel so depressed, where they capture the usage of the word "depressed" in the sentences that express two different feelings. The authors also analyzed the posts' interaction (i.e., on Twitter (retweet, liked, commented)). Some researchers studied post-level behaviours to predict mental problems by analysing tweets on Twitter to find out the depression-related language. In [38] , the authors have developed a model to uncover meaningful and useful latent structure in a tweet. Similarly, in [41] , the authors monitored different symptoms of depression that are mentioned in a user's tweet. In [42] , they study users' behaviour on both Twitter and Weibo. To analyze users' posts, they have used linguistic features. They used a Chinese language psychological analysis system called TextMind in sentiment analysis. One of the interesting post-level behavioural studies was done by [41] on Twitter by finding depression relevant words, antidepressant, and depression symptoms. In [37] the authors used postlevel behaviour for detecting anorexia; they analyze domain-related vocabulary such as anorexia, eating disorder, food, meals and exercises. There are various features to model users in social media as it reflects overall behaviour over several posts. Different from post-level features extracted from a single post, user-level features extract from several tweets during different times [49] . It also extracts the user's social engagement presented on Twitter from many tweets, retweets and/or user interactions with others. Generally, posts' linguistic style could be considered to extract features [19, 59, 59] . The authors in [41] extracted six depression-oriented feature groups for a comprehensive description of each user from the collected data set. The authors used the number of tweets and social interaction as social network features. For user profile features, they have used user shared personal information in a social network. Analysing user behaviour looks useful for detecting eating disorder. In Wang et al., [51] they extracted user engagement and activities features on social media. They have extracted linguistic features of the users for psychometric properties which resembles the settings described in [20, 37, 42] where the authors have extracted 70 features from two different social networks (Twitter and Weibo). They extracted features from a user profile, posting time and user interaction feature such as several followers and followee. This is one interesting work [56] where the authors combine user-level and post-level semantics and cast their problem as a multiple instance learning setup. The advantage that this method has is that it can learn from user-level labels to identify post-level labels. There is an extensive literature which has used deep learning for detecting depression on the Internet in general ranging from tweets to traditional document collection and user studies. While some of these works could also fall in one of the categories above, we are separately presenting these latest findings which use modern deep learning methods. The most closely related recent work to ours is [23] where the authors propose a CNN-based deep learning model to classify Twitter users based on depression using multi-modal features. The framework proposed by the authors has two parts. In the first part, the authors train their model in an offline mode where they exploit features from Bidirectional Encoder Representations from Transformers (BERT) [11] and visual features from images using a CNN model. The two features are then combined, just as in our model, for joint feature learning. There is then an online depression detection phase that considers user tweets and images jointly where there is a feature fusion at a later stage. In another recently proposed work [7] , the authors use visual and textual features to detect depressed users on Instagram posts than Twitter. Their model also uses multi-modalities in data, but keep themselves confined to Instagram only. While the model in [23] showed promising results, it still has certain disadvantage. For instance, BERT vectors for masked tokens are computationally demanding to obtain even during the fine-tuning stage, unlike our model which does not have to train the word embeddings from scratch. Another limitation of their work is that they obtain sentence representations from BERT, for instance, BERT imposes a 512 token length limit where longer sequences are simply truncated resulting in some information loss, where our model has a much longer sequence length which we can tune easily because our model is computationally cheaper to train. We have proposed a hybrid model that considers a variety of features unlike these works. While we have not specifically used visual features in our work, using a diverse set of crucial relevant textual features is indeed reasonable than just visual features. Of course, our model has the flexibility to incorporate a variety of other features including visual features. Multi-modal features from the text, audio, images have also been used in [64] , where a new graph attention-based model embedded with multi-modal knowledge for depression detection. While they have used temporal CNN model, their overall architecture has experimented on small-scale questionnaire data. For instance, their dataset contains 189 sessions of interactions ranging between 7-33min (with an average of 16 min). While they have not experimented their method with short and noisy data from social media, it remains to be seen how their method scales to such large collections. Xezonaki et al., [57] propose an attention-based model for detecting depression from transcribed clinical interviews than from online social networks. Their main conclusion was that individuals diagnosed with depression use affective language to a greater extent than those who are not going through depression. In another recent work [55] , the authors discuss depression among users during the COVID-19 pandemic using LSTM and fastText [28] embeddings. In [43] , the authors also propose a multi-model RNN-based model for depression prediction but apply their model on online user forum datasets. Trotzek et al., [48] study the problem of early detection of depression from social media using deep learning where the leverage different word embeddings in an ensemble-based learning setup. The authors even train a new word embedding on their dataset to obtain task-specific embeddings. While the authors have used the CNN model to learn high-quality features, their method does not consider temporal dynamics coupled with latent topics, which we show to play a crucial role in overall quantitative performance. The general motivation of word embeddings is to find a low-dimensional representation of a word in the vocabulary that signifies its meaning in the latent semantic space. While word embeddings have been popularly applied in various domains in natural language processing [34] and information retrieval [61] , it has also been applied in the domain of mental health issues such as depression. For instance, in [2] , the authors study on Reddit (Reddit is also used in [47] ) a few communities which contain discussions on mental health struggles such as depression and suicidal thoughts. To better model the individuals who may have these thoughts, the authors proposed to exploit the representations obtained from word embeddings where they group related concepts close to each other in the embeddings space. The authors then compute the distance between a list of manually generated concepts to discover how related concepts align in the semantic space and how users perceive those concepts. However, they do not exploit various multi-modal features including topical features in their space. Farruque et al., [13] study the problem of creating word embeddings in cases where the data is scarce, for instance, depressive language detection from user tweets. The underlying motivation of their work is to simulate a retrofitting-based word embedding approach [14] where they begin with a pre-trained model and fine-tune the model on domain-specific data. Gong et al., [16] proposed a topic modelling approach to depression detection using multi-modal analysis. They propose a novel topic model which is context-aware with temporal features. While the model produced satisfactory results on 2017 Audio/Visual Emotion Challenge (AVEC), the method does not use a variety of rich features and could face scalability issues because simple posterior inference algorithms such as those based on Gibbs or collapsed Gibbs sampling do not parallelize unlike deep learning methods, or one need sophisticated engineering to parallelize such models. Twitter has been popularly regarded as one online social media resource that provides free data for data mining on tweets. This is the reason for its popularity among researchers who have widely used data from Twitter. One can freely and easily download tweet data through their APIs. However, in the past, researchers have generally followed two methods for using twitter data, which are: • Using an already existing dataset shared freely and publicly by others. The downside of such datasets is that they might be old to learn anything useful in the current context. Recency may be crucial in some studies such as understanding current trends of a recently trending topic [22] . • Crawling data using vocabulary from a social media network though is slow but helps get fresh, relevant and reliable data which would help learn patterns that are currently being discussed on online social networks. This method takes time to collect relevant and then process the data given that resources such as Twitter which provide data freely impose tweet download restrictions per user per day, as a result of fair usage policy applied to all users. Developing and validating the terms used in the vocabulary by users with mental illness is time-consuming but helps obtain a reliable list of words, by which reliable tweets could be crawled reducing the amount the false-positives. Recent research conducted by the authors of [41] is one such work that has collected a large-scale data with reliable ground truth data, which we aim to reuse. We present the statistics of the data in Table 1 . To exemplify the dataset further, the authors collected three complementary data sets, which are: • Depression data set: Each user is labelled as depressed, based on their tweet content between 2009 and 2016. This includes 1,402 depressed users and 292,564 tweets. • Non-depression data set: Each user is labelled as non-depressed and the tweets were collected in December 2016. This includes over 300 million active users and 10 billion tweets. • Depression-candidate data set: The authors collected are labelled as depression-candidate, where the tweet was collected if contained the word "depress". This includes 36,993 depressioncandidate users and over 35 million tweets. Data collection mechanisms are often loosely controlled, impossible data combinations, for instance, users labelled as depressed but have provided no posts, missing values, among others. After data has Dataset Depressed Non-Depressed No. of Users 1402 300 million No. of Tweets 292, 564 10 billion Table 1 . Statistics of the large dataset collected by the authors in [41] which is used in this study. been crawled, it is still not ready to be used directly by the machine learning model due to various noise still present in data, which is called the "raw data". The problem is even more exacerbated when data has been downloaded from online social media such as Twitter because tweets may contain spelling and grammar mistakes, smileys, and other undesirable characters. Therefore, a pre-processing strategy is needed to ensure satisfactory data quality for computational modal to achieve reliable predictive analysis. The raw data used in this study has labels of "depressed" and "non-depressed". This data is organised as follows: Users: This data is packaged as a JSON file for each user account describing details about the user such as user id, number of followers, number of tweets etc. Note that JSON is a standard popular data-interchange which is easy for humans to read and write. Timeline: This data package contains files containing several tweets along with corresponding metadata, again in JSON format. To further clean the data we used Natural Language processing ToolKit (NLTK). This package has been widely used for text pre-processing [18] and various other works. It has also been widely used for removing common words such as stop words from text [10, 20, 38] . We have removed the common words from users tweets (such as "the", "an", etc.) as these are not discriminative or useful enough for our model. These common words sometimes also increase the dimensionality of the problem which could sometimes lead to the "curse-of-dimensionality" problem and may have an impact on the overall model efficiency. To further improve the text quality, we have also removed non-ASCII characters which have also been widely used in literature [59] . Pre-processing and removal of noisy content from the data helped get rid of plenty of noisy content from the dataset. We then obtained a high-quality reliable data which we could use in this study. Besides, this distillation helped reduce the computational complexity of the model because we are only dealing with informative data which eventually would be used in modelling. We present the statistics of this distilled data below: To further mitigate the issue of sparsity in data, we excluded those users who have posted less than ten posts and users who have less than 5000 followers, therefore we ended up with 2500 positive users and 2300 negative users. Social media data conveys all user contents, insights and emotion reflected from individual's behaviours in the social network. This data shows how users interact with their connections. In this work, we collect information from each user and categorize it into two types of attributes, namely multi-modal attribute and word embedding, as follows: We introduce this attribute type where the goal is to calculate the attribute value corresponding to each modality for each user. We estimate that the dimensionality for all modalities of interest is 76; and we mainly consider four major modalities as listed below and ignore two modalities due to missing values. These features are extracted respectively for each user as follows: 4.1.1 Social Information and Interaction. From this attribute, we extracted several features embedded in each user profile. These are features related to each user account as specified by each feature name. Most of the features are directly available in the user data, such as the number of users following and friends, favourites, etc. Moreover, the extracted features relate to user behaviour on their profile. For each user, we calculate their total number of tweets, their total length of all tweets and the number retweets. We further calculate posting time distribution for each user, by counting how many tweets the user published during each of the 24 hours a day. Hence it is a 24-dimensional integer array. To get posting time distribution for each tweet, we extract two digits as hour information, then go through all tweets of each user and track the count of tweets posted in each hour of the day. Emojis allow users to express their emotions through simple icons and non-verbal elements. It is useful to get the attention of the reader. Emojis could give us a glance for the sentiment of any text or tweets, and it is essential to differentiate between positive and negative sentiment text [31] . User tweets contain a large number of emojis which can be classified into positive, negative and neutral. For each positive, neutral, and negative type, we count their frequency in each tweet. Then we sum up the numbers from each user's tweets to get the sum for each user. So the final output is three values corresponding to positive, neutral and negative emojis by the user. We also consider Voice Activity Detection (VAD) features. These features contain Valance, Arousal and Dominance scores. For that, we count First Person Singular and First Person Plural. Using affective norms for English words, a VAD score for 1030 words are obtained. We create a dictionary with each word as a key and a tuple of its (valance, arousal, dominance) score as value. Next, we parse each tweet and calculate VAD score for each tweet using this dictionary. Finally, for each user, we add up the VAD scores of tweets by that user, to calculate the VAD score for each user. Topic modelling belongs to the class statistical modelling frameworks which helps in the discovery of abstract topics in a collection of text documents. It gives us a way of organizing, understanding and summarizing collections of textual information. It helps find hidden topical patterns throughout the process, where the number of topics is specific by the user apriori. It can be defined as a method of finding a group of words (i.e. topics) from a collection of documents that best represent the latent topical information in the collection. In our work, we applied the unsupervised Latent Dirichlet Allocation (LDA) [4] to extract the most latent topic distribution from user tweets. To calculate topic level features, we first consider corpus of all tweets of all depressed users. Next, we split each tweet into a list of words and assemble all words in decreasing order of their frequency of occurrence, and common English words (stopwords) are removed from the list. Finally, we apply LDA to extract the latent K = 25 topics distribution, where K is the number of topics. We have found experimentally K = 25 to be a suitable value. While there are tuning strategies and strategies based on Bayesian non-parametrics [46] , we have opted to use a simple, popular, and computationally efficient approach which helps give us the desired results. It is the count of depression symptoms occurring in tweets, as specified in nine groups in DSM-IV criteria for a depression diagnosis. The symptoms are listed in Appendix A. We count how many times the nine depression symptoms are mentioned by the user in their tweets. The symptoms are specified as a list of nine categories, each containing various synonyms for the particular symptom. We created a set of seed keywords for all these nine categories, and with the help of the pre-trained word embedding, we extracted the similarities of these symptoms to extend the list of keywords for each depression symptoms. Furthermore, we scan through all tweets, counting how many times a particular symptom is mentioned in each tweet. We also focused on the antidepressants, and we created a lexicon of antidepressants from the "Antidepressant" Wikipedia page which contains an exhaustive list of items and is updated regularly, in which we counted the number of names listed for antidepressants. The medicine names are listed in Appendix B. Word embeddings are a class of representation learning models which find the underlying meaning of words in the vocabulary in some low-dimensional semantic space. Their underlying principle is based on optimising an objective function which helps bring words which are repeatedly occurring together under a certain contextual window, close to each other in the semantic space. The usual windows size that works well in many settings is 10 [34] . A remarkable ability of these models is that they can effectively capture various lexical properties in natural language such as the similarity between words, analogies among words, and others. These models have become increasingly popular in the natural language processing domain and have been used as input to deep learning models. Among various word embedding models proposed in the literature, word2vec [27] is one of the most popular techniques that use shallow neural networks to learn word embedding. word2vec is a predictive model for learning word embeddings from raw text that is also computationally efficient. Word2vec takes a large corpus of text as its input and generates a vector space with a corresponding vector in the space allocated to each specific word. Word vectors are placed in the space of the vector. The words that share common meanings in the corpus are located in space near to each other. To learn the semantic meaning between the words that were posted by depressed users, we add a new attribute to extract more meaningful features. Count features in multi-modalities attribute are useful and effective to extract features from normal text. However, they could not effectively capture the underlying semantics, structure, sequence and meaning in tweets. While count features are based on the independent occurrence of words in a text corpus, they cannot capture the contextual meaning of words in the text which is effectively captured by word embeddings. Motivated by this, we apply word embedding techniques to extract more meaningful features from every user's tweets and capture the semantic relationship among word sequence. We used a popular model called word2vec [27] with a 300-dimensional set of word embeddings pre-trained on Google News corpus to produce a matrix of word vectors. The Skip-gram model is used to learn word vector representations which are characterised by low-dimensional real-valued representations for each word. This is usually done as a pre-processing stage, after which the vectors learned are fed into a model. In this section, we describe our hybrid model that learns from multi-modal features. While there are various hybrid deep learning models proposed in the literature, our method is novel in that it learns multi-modal features which include topical features as shown in Figure 1 . The joint learning mechanism learns the model parameters in a consolidated parameter space where different model parameters are shared during the training phase leading to more reliable results. Note that simple cascaded-based approaches incorporate error propagation from one stage to next [65] . At the end of the feature extraction step, we obtain the training data in the form of an embedding matrix for each user representing the user timeline posts attribute. We also have a 76-dimensional vector of integers for each user representing the multi-modalities attribute. Due to the complexity of user posts and the diversity of their behaviour on social media, we propose a hybrid model based on CNN that combines with BiGRU to detect depression through social media as depicted in Figure 1 . For each user, the model takes two inputs for the two attributes. First, the four modalities feature input that represents user behaviour vector runs into BiGRU, capturing distinct and latent features, as well as long-term dependencies and correlation across the features matrix. The second input represents each user input tweet that will be replaced with it's embedding and fed to the convolution layer to learn some representation features from the sequential data. The output in the middle of both attributes is concatenated to represent one single vector feature that fed into an activation layer of sigmoid for prediction. In the following sections, we will discuss the following two existing separate architectures which will be combined leading to a novel computational model for modelling spatial structures and multi-modalities. In particular, the model comprises a CNN network to learn the spatial structure from user tweets and a framework to extract latent features from multi-modalities attribute followed by the application of BiGRU. An individual user's timeline comprises semantic information and local features. Recent studies show that CNN has been successfully used for learning strong, suitable and effective features representations [24] . The effective feature learning capabilities of CNNs make them an ideal choice to extract semantic features from a user post. In this work, we propose to apply CNN network to extract semantic information features from user tweets. The input to our CNN network is the embedded matrix layer with a sentence matrix and the sentence will be treated as sequence of words s : [w 1 , w 2 , w 3 , . . . , w i ]. Each word w ∈ R 1×d is a one vector of the embedding matrix R W×d , where d represents the dimension of each word in the matrix and W represents the length or number of words for each user posts. We set the size of each user sentence between 0 and 1000 words and describe the average of only ten tweets for each user. Note that this size is much larger than what has been used in other recent closely-related models which are based on BERT. Also, we could train our model on the dataset which helps create specific representations for our dataset in a computationally less demanding way unlike those which are based on BERT that is both computational and financially expensive to train followed by fine-tuning. The input layer is attached to the convolution layer by three conventional layers to learn n-gram features capturing word order; thereby capturing crucial text semantic which usually cannot be captured by a bag-of-words-based model [52] . We use a convolution operation c n to extract features between words as follows: (1) where f is a nonlinear function, b denotes bias and x n:n+h−1 a window of h words. Here the convolution is applied to the window of word vector, where the window size is h. The network now creates a feature map according to the following equation: (2) The output of convolution layer feature map will be an input for the pooling layer, which is an important step to reduce dimension of the space by selecting appropriate features. We used the max pooling layer to calculate the maximum value for every feature-map patch. The output of pooling operation is generated as follows: . We add the LSTM layer to create a stack of deep learning algorithms to optimize the results. The Recurrent Neural Network (RNN) is a powerful network when the input is fixed vectors to process in sequence even if the data is non-sequential. Models such as BiGRU, GRU, and LSTM fall in the class of RNNs. The static attributes are usually inputted to the BiGRU. GRU is an alternative of LSTM and links the forget gate and the input gate into a single update gate, which is computationally efficient than an LSTM network due to the reduction of gates. GRU can effectively and efficiently capture long-distance information between features, but one way or unidirectional GRU could only capture the historical information features partly. Moreover, for our static attributes, we would like to get the information about the behavioural semantics of each user. To this end, we have applied BiGRU to combine the forward and backward directions for every input feature to capture the behavioural semantics in both directions. Bidirectional models, in general, capture information of the past and the future, where information is captured considering both past and future contexts which makes it more powerful than unidirectional models [11] . Suppose the input which resembles a user behaviour be represented as x1,x2..., xn. When we apply the traditional unidirectional GRU, we have the following form: (1) Bidirectional GRU actually consist of two layers of GRU as in Figure 2 , and introduced to obtain the forward and the backward information. And the hidden layer has two values for the output, one for backward output and the other to forward output, and the algorithm can be describe as follow: Where h s represents the input of step s, while ì h s and h s represent the hidden state of the forward and the backward GRU in step s. Each GRU network is defined as the follow: , GRU network is calculates the update gate z s in the time step s. This gate helps the model decide how much information is obtained from the previous step which could be passed to the next step. The reset gate in Equation 7 is used to determine how much information from past step needs to be forgotten. The GRU model used a reset gate to save related information from the past as depicted in Equation 8 . Lastly, the model will calculate h s that holds all the information and passes it down to the network as depicted in Equation 9 . After we obtain the latent features from each model, we integrate these features and concatenate them as feature vector to be input into an activation function for classification as mentioned below. 6 EXPERIMENTS AND RESULTS We compare our model with the following classification methods: • ∼MDL: Multimodal Dictionary Learning Model (MDL) is to detect depressed users on Twitter [41] . They use a dictionary learning to extract latent data features and sparse representation of a user. Since we cannot get access to all [41] 's attributes, we implement MDL in our way. • SVM: Support vector machines are a class of machine learning models in text classification that try to optimise a loss function that learns to draw a maximum-margin separating hyperplane between two sets of labelled data, e.g., drawing a maximum-margin hyperplane between a positive and negative labelled data [6] . This is the most popular classification algorithm. • NB: Naive Bayes is a family of probabilistic algorithms based on applying Bayes' theorem with the "naive" assumption of conditional independence between instances [30] . While the suitability conditional independence has been questioned by various researchers, these models surprisingly give superior performance when compared with many sophisticated models [45] . For our experiments, we have used the datasets as mentioned in section (3). They provide a large scale of data, especially for labelled negative and candidate positive. After pre-processing and extracting info from their raw data, we filter out the below datasets to perform our experiments: • Number of users labelled positive: 5899. • Number of tweets from positive users: 508786. • Number of users labelled negative: 5160. • Number of tweets from negative users: 2299106. Then further excluded users who posted less than ten posts and users who have more than 5000 followers, we end up with a final dataset consisting of 2500 positive users and 2300 negative users. We adopt the ratio 80:20 to split our data into training and test. We used pre-trained word2vec that is trained on Google News corpus which comprises of 3 billion words. We used python 3.6.3 and Tensorflow 2.1.0 to develop our implementation. We rendered the embedding layer to be not trainable so that we keep the features representations, e.g., word vectors and topic vectors in their original form. We used one hidden layer, and max-pooling layer of size 4 which gave better performance in our setting. For both network BiGRU and CNN optimization, we used Adam optimization algorithm. Finally we trained our model for 10 iterations, with batch size of 32. The number of iterations was sufficient to converge the model and our experimental results further cement this claim where we outperform existing strong baseline methods. We employ traditional information retrieval metrics such as precision, recall, F1, and accuracy based on the confusion matrix to evaluate our model. A confusion matrix is a sensational matrix used for evaluating classification performance, which is also called an error matrix because it shows the number of wrong predictions versus the number of right predictions in a tabulated manner. Some important terminologies associated with computing the confusion matrix are the following: • P: The actual positive case, which is depressed in our task. • N: The actual negative case, which is not depressed in our task. • TN: The actual case is not depressed, and the predictions are not depressed as well • FN: The actual case is not depressed, but the predictions are depressed. • FP: The actual case is depressed, but the predictions are not depressed. • TP: The actual case is depressed, and the predictions are depressed as well. Based on the confusion matrix, we can compute the Accuracy, Precision, Recall and F1 score as follows: In our experiments, we study our model attributes including the quantitative performance of our hybrid model. The multi-modalities attribute and user's timeline semantic features attribute, we will use both these attributes jointly. After grouped user behaviour in social media into multi-modalities attribute (MM), we evaluate the performance of the model. First, we examine the effectiveness of using the multi-modalities attribute (MM) only with different classifiers. Second, we showed how the model performance increased when we combined word embedding with MM. We summarise the results in Table 2 and Figure 4 as follows: • Naive Bayes obtain the lowest F1 score, which demonstrates that this model has less capability to classify tweets when compared with other existing models to detect depression. The reason for its poor performance could be that the model is not robust enough to sparse and noisy data. • ∼MDL model outperforms SVM and NB and obtains better accuracy than these two methods. Since this is a recent model especially designed to discover depressed users, it has captured the intricacies of the dataset well and learned its parameters faithfully leading to better results. • We can see our proposed model improved the depression detection up to 6% on F1-Score, compared to ∼MDL model. This suggests that our model outperforms a strong model. The reason why our model performs well is primarily because it leverages a rich set of features which is jointly learned in the consolidated parameters estimation resulting in a robust model. • We can also deduce from the table that our model consistently outperforms all existing and strong baselines. • Furthermore, our model achieved the best performance with 85% in F1, indicating that combining BiGRU with CNN for multimodal strategy for user timeline semantic features strategy is sufficient to detect depression in Twitter. To get a better look for our model performance and how it does classify the samples, we have used the confusion matrix. For this, we import the confusion matrix module from Sklearn, which helps us to generate the confusion matrix. We visualize the confusion matrix, which demonstrates how classes are correlated to indicate the percentage of the samples. We can observe from Figure 3 that our model predicts effectively non-depressed users (TN) and depressed users (TP). We have also compared the effectiveness of each of the two attributes of our model. Therefore, we test the performance of the model with a different attribute, we build the model to feed it with each attribute separately and compare how the model performs. First, we test the model using only the multi-modalities attribute, we can observe in Fig 4 the model perform less optimally when we used BiGRU only. In contrast, the model performs better when we use only CNN with word embedding attribute. This signifies that extracting semantic information features from user tweets is crucial for depression detection. Although, the model when used only word embedding attribute outperform multi-modalities, still the true positive rate (sensitivity) for both attribute are close to each other as we see the precision score for each BiGRU and CNN. Finally, we can see the model performance increased when combined both CNN and BiGRU, and outperforms when using each attribute independently. After depressed users are classified, we examined the most common depression symptoms among depressed users. In Figure 5 , we can see symptom one (feeling depressed), is the most common symptom posted by depressed users. That shows how depressed users are exposing and posting their depressive mood on social media more than any other symptoms. Besides that, other symptoms such as energy loss, insomnia, a sense of worthlessness, and suicidal thoughts have appeared in more than 20% of the depressed user. To further investigate the five most influencing symptoms among depressed users, we collected all the tweets associated with these symptoms. Then we created a tag cloud [50] for each of these five symptoms, to determine what are the frequent words and importance that related to each symptom as shown in Figure 6 where larger font words are relatively more important than rest in the same cloud representation. This cloud gives us an overview of all the words that occur most frequently within each of these five symptoms. In this paper, we propose a new model for detecting depressed user through social media analysis by extracting features from the user behaviour and the user's online timeline (posts). We have used a real-world data set for depressed and non-depressed users and applied them in our model. We have proposed a hybrid model which is characterised by introducing an interplay between the BiGRU and CNN models. We assign the multi-modalities attribute which represents the user behaviour into the BiGRU and user timeline posts into CNN to extract the semantic features. Our model shows that by training this hybrid network improves classification performance and identifies depressed users outperforming other strong methods. This work has great potential to be further explored in the future, for instance, we can enhance multi-modalities feature by using short-text topic modelling, for instance, propose a new variant of the Biterm Topic Model (BTM) [58] capable of generating depression-associated topics, as a feature extractor to detect Depression. Besides, using a new recently proposed popular word representation techniques also known as pre-trained language models such as Deep contextualized word representations (ELMo) [35] and Bidirectional Encoder Representations from Transformers (BERT) [11] , and train them on a large corpus of depression-related tweets instead of using a pre-trained word embedding model. While there will be challenges when using such pre-trained language models can introduce because of the restriction that they impose on the sequence length; nevertheless, studying these models on this task helps to unearth their pros and cons. Eventually, our future works aim to detect other mental illness in conjunction with depression to capture complex mental issues which have pervaded into an individual's life. Diagnostic and statistical manual of mental disorders (DSM-5®) Towards Using Word Embedding Vector Space for Better Cohort Analysis Depressed individuals express more distorted thinking on social media Latent dirichlet allocation Methods in predictive techniques for mental health status on social media: a critical review LIBSVM: A library for support vector machines Multimodal depression detection on instagram considering time interval of posts Empirical evaluation of gated recurrent neural networks on sequence modeling Predicting depression via social media Depression detection using emotion artificial intelligence BERT: Pre-training of deep bidirectional transformers for language understanding A Depression Recognition Method for College Students Using Deep Integrated Support Vector Algorithm Augmenting Semantic Representation of Depressive Language: From Forums to Microblogs Retrofitting word vectors to semantic lexicons Analysis of user-generated content from online social communities to characterise and predict depression degree Topic modeling based multi-modal depression detection Take two aspirin and tweet me in the morning: how Twitter, Facebook, and other social media are reshaping health care Natural Language Processing Methods Used for Automatic Prediction Mechanism of Related Phenomenon Predicting depression of social media user on different observation windows Anxious Depression Prediction in Real-time Social Data Rehabilitation of count-based models for word vector representations Text-based detection and understanding of changes in mental health SenseMood: Depression Detection on Social Media Supervised deep feature extraction for hyperspectral image classification Using Social Media Content to Identify Mental Health Problems: The Case of# Depression in Sina Weibo Mental illness, mass shootings, and the politics of American firearms Advances in pretraining distributed word representations Rethinking communication in the e-health era On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes Borut Sluban, and Igor Mozetič Deep learning for depression detection of twitter users Depressive moods of users portrayed in Twitter Glove: Global vectors for word representation Deep contextualized word representations Identifying health-related topics on twitter Early risk detection of anorexia on social media Beyond LDA: Exploring Supervised Topic Modeling for Depression-Related Language in Twitter Beyond Modelling: Understanding Mental Disorders in Online Social Media Dissemination of health information through social networks: Twitter and antibiotics Depression Detection via Harvesting Social Media: A Multimodal Dictionary Learning Solution Cross-domain depression detection via harvesting social media Multi-modal social and psycho-linguistic embedding via recurrent neural networks to identify depressed users in online forums Detecting cognitive distortions through machine learning text analytics A comparison of supervised classification methods for the prediction of substrate type using multibeam acoustic and legacy grain-size data Sharing clusters among related groups: Hierarchical Dirichlet processes Understanding Depression from Psycholinguistic Patterns in Social Media Texts Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences Recognizing depression from twitter activity Timelines tag clouds and the case for vernacular visualization Detecting and characterizing eatingdisorder communities on social media Topical n-grams: Phrase and topic discovery, with an application to information retrieval Salary Prediction using Bidirectional-GRU-CNN Model World Health Organization Estimating the effect of COVID-19 on mental health: Linguistic indicators of depression during a global pandemic Modeling depression symptoms from social network data through multiple instance learning Georgios Paraskevopoulos, Alexandros Potamianos, and Shrikanth Narayanan. 2020. Affective Conditioning on Hierarchical Networks applied to Depression Detection from Transcribed Clinical Interviews A biterm topic model for short texts Semi-supervised approach to monitoring clinical depressive symptoms in social media Survey of Depression Detection using Social Networking Sites via Data Mining Relevance-based word embedding Combining convolution neural network and bidirectional gated recurrent unit for sentence semantic classification Feature Fusion Text Classification Model Combining CNN and BiGRU with Multi-Attention Mechanism Graph Attention Model Embedded With Multi-Modal Knowledge For Depression Detection MedLDA: maximum margin supervised topic models. the Depression and disclosure behavior via social media: A study of university students in China List of depression symptoms as per DSM-IV:(1) Depressed mood.(2) iminished interest.