key: cord-0322423-pl6aleul
authors: Yang, Qi; Farseev, Aleksandr; Filchenkov, Andrey
title: Two-Faced Humans on Twitter and Facebook: Harvesting Social Multimedia for Human Personality Profiling
date: 2021-06-20
journal: nan
DOI: 10.1145/3463944.3469270
sha: fc36338ce45c1832b2ee8f5f0c350e1bf3d40f57
doc_id: 322423
cord_uid: pl6aleul

Human personality traits are the key drivers behind our decision-making, influencing our life path on a daily basis. Inference of personality traits, such as Myers-Briggs Personality Type, as well as an understanding of dependencies between personality traits and users' behavior on various social media platforms is of crucial importance to modern research and industry applications. The emergence of diverse and cross-purpose social media avenues makes it possible to perform user personality profiling automatically and efficiently based on data represented across multiple data modalities. However, the research efforts on personality profiling from multi-source multi-modal social media data are relatively sparse, and the level of impact of different social network data on machine learning performance has yet to be comprehensively evaluated. Furthermore, there is not such dataset in the research community to benchmark. This study is one of the first attempts towards bridging such an important research gap. Specifically, in this work, we infer the Myers-Briggs Personality Type indicators, by applying a novel multi-view fusion framework, called"PERS"and comparing the performance results not just across data modalities but also with respect to different social network data sources. Our experimental results demonstrate the PERS's ability to learn from multi-view data for personality profiling by efficiently leveraging on the significantly different data arriving from diverse social multimedia sources. We have also found that the selection of a machine learning approach is of crucial importance when choosing social network data sources and that people tend to reveal multiple facets of their personality in different social media avenues. Our released social multimedia dataset facilitates future research on this direction.

During the past decade, an increasing number of social media platforms have been rapidly emerging and therefore such platforms start playing a vital role in facilitating human interactions worldwide. Since 2012, the average daily social media screen time has increased from 60 minutes to 144 minutes [1] . Furthermore, it has spiked even higher since the start of the COVID-19 disease outbreak [12] , when people have been locked at home with the only remaining option of engaging their friends through Social Media.

To maintain high user engagement rates, it is essential for social network conglomerates to position and recommend relevant content according to user interests and online behaviours. For example, extroverted people are more likely to use social media in general as they tend to reveal themselves as enthusiastic, interactive, and therefore forming more social circles around themselves [23] . However, contrarily, introverts were found to be spending significantly more time evaluating the value of each online service they use before a deeper user-service interaction may occur [33] .

With such large and diverse data available nowadays on Social Media, it is getting practically impossible to manually distinguish social media users when attempting to provide them with more personalized online experiences [16] . And therefore, an automated approach to human behaviour pattern understanding on social media is well demanded [20] . Unfortunately, nowadays personality profiling still heavily depends on manual procedures like questionnaires and quizzes [37] , and therefore its cost remain unacceptably high limiting its usage in real-time online services, such as social networking websites [18] .

However, automatic personality inference is also known to be a hard task [5, 21] , which is mainly due to the multi-facet nature of social media data. For example, Twitter is often used for casual daily interactions, while Facebook nowadays more perceived as a private communication channel. As a result, Facebook's audiences demographics vary drastically from young to senior ages, while e.g. TikTok's audiences mostly consist of young individuals aged between 18 to 34 years old. Finally, such social networks like Pinterest might not just have a significant audience age shift but also tend to be largely populated by female users [48] . Furthermore, one might also explore the drastic difference in behavioural traits that arXiv:2106.10673v1 [cs.SI] 20 Jun 2021 people exhibit across various social media avenues. For example, being one of the most open social media outlets, Twitter is known to concentrate on users' expressions rather than their identity, encapsulating our "real me" from the broader public [21] . At the same time, on the specialized personality-focused forums, such as Person-alityCafe 1 , the communication might be more concentrated on the members' behavioural habits, allowing for gaining a deeper insight into one's behaviour from the content they post. Considering such a multi-facet multi-source cross-demographic environment, the task of automatic personality profiling from social media data appears to be challenging and, being not widely tackled yet by the research community, requires a more in-depth analysis to be accomplished.

Despite the advantages of leveraging multiple data modalities and sources, there are several associated difficulties identified: Data gathering. Data from modern social media platforms are often distributed across various Web resources and shielded behind privacy settings. It is therefore important to implement large-scale cross-source data collection techniques. Data representation. As real-world social media data comes with different data modalities (e.g. text, image, video, location, etc.), the incorporation of such heterogeneous multi-modal data involves the creation of accurate and mutually compatible approaches to data representation (feature learning). Data modelling: Effective data integration into a single machine learning model is a challenging task, as the data sources and data modalities often represent various aspects of human life and therefore often very different in nature. Even worth, the high dimensionality of the multi-modal feature spaces might often lead to the so-called "curse of dimensionality" problem when being processed directly, and therefore a dimensionality balancing needs to be accomplished.

Inspired by the research gap and challenges above, in this work we raise the following three research questions. First, to establish a benchmark for multi-view personality profiling, it is important to understand: (RQ1): Is it possible to reliably and accurately infer user personality traits at a large scale in an automatic fashion? Second, to gain an understanding of the real-world applicability of our approach to modern social media scenarios, it is crucial to discover if: (RQ2) Is it possible to improve personality profiling performance by leveraging multi-view social multimedia data? Third, to establish a clear path of future research on multi-source learning, it is crucial to know: (RQ3) What is the impact of social media data origin on personality user profiling performance?

To answer our proposed research questions, in this study we introduce a novel multi-view personality profiling meta ensemble framework, called "PERS", which is able to effectively profile social media user personality by leveraging multimodal multimedia data coming from multi-facet social networks. Furthermore, we introduce efficient data gathering and representation techniques, allowing for seamless processing of the data from Facebook, Twitter, and PersonalityCafe social media forums. Finally, we release the PERS dataset 2 to the research community, allowing for future extensive cross-disciplinary research.

The major contributions of this work are threefold. First and foremost, we have proposed a novel machine learning framework for multi-view user profiling and demonstrated that efficient personality profiling is possible and able to achieve industry-level performance for several personality attributes. Second, we have demonstrated that different social networks are different in nature, which impacts the personality profiling performance and therefore needs to be considered during the data modelling process. Third, we have released a new multi-source cross-social personality profiling dataset to be used by the research community in future studies along the direction.

In the past two decades, there have been several studies conducted, which were attempting to model human personality traits from a statistical perspective. First, being inferred from statistical analysis of the English lexicon, the Big Five model have been proposed by Digman [10] , where the author reveals the close relationship between human personality and their written language. Inspired by the idea, later on, Pennebaker et al. [40] laid the foundation of statistical personality profiling by introducing the LIWC word categorization scheme, which has numerically bridged the personality traits and the written language utilization patterns.

Furthermore, several studies have been devoted to automatic personality profiling, where cross-disciplinary research groups were utilizing machine learning techniques for automatic human personality inference based on test-generated data [3, 34] . Worth noting that these studies were all based on relatively small datasets and therefore are limited in supporting large scale observations staying far apart from being applied in a real-world scenario. Moving forward, the "Small Data" problem was partially mitigated by the introduction of the "MyPersonality" project [31] -the first largescale personality-labelled dataset that includes user-generated data from Facebook, which has immediately attracted multimedia community attention entailing first larger-scale studies along social media personality profiling research direction [24, 32, 46] .

These studies have made a giant leap in the field, however, one could also notice that most of them still lack a very important factor limiting their real-world applicability -they are largely focused on a single data sources (e.g. Facebook) or a single data modality (i.e. Text), which brings them apart of being applied to modern multi-source multi-view social multimedia data. Namely, the Linguistic Inquiry and Word Count (LIWC) works [27, 44] are mostly focused on text-only data processing to predict personality by using personality-labelled word categories. At the same time, Arnoux et al. [4] and Tandera et al. [47] instead utilized pre-trained word Global Vectors for Word Representation (GloVe) embedding of textual data, first in the world reporting the results of machinelearning-driven single-modal personality inference.

Finally, there were several studies conducted approaching user profiling from a multi-modal data perspective. For example, Farseev et al. [17] have proposed a multi-modal ensemble model tackling the task of demographic profiling from multi-modal data. Furthermore, authors have extended the framework to leverage sensor data and multi-source multi-task learning for wellness profiling [13] . Buraya et al. [6] proposed to solve the problem of relationship status inference via applying "out of the box" machine learning on early-fused data from Twitter, Instagram, Facebook, and Foursquare, which has achieved a significant 17% inference performance uplift, as compared to single-modal learning. Going further, [49] proposed a factorization method to model the intra-modal and inter-modal relationships within multi-modal data inputs, which proved the crucial role of multi-modal data incorporation in improving the user profiling performance; while Buraya et al. [5] instead leveraged on the temporal component of the multi-modal data, being first applying deep learning for multi-view personality profiling. Being a significant contribution to the field of multi-view profiling, the above works still lack multi-source cross-social network data processing [11] , which limits their applicability in the majority of real-world scenarios.

As it can be seen, there is a significant evidence that incorporation of multi-modal data for automatic user profiling is useful for the achievement of a better prediction performance. However, when it comes to evaluating the role of social network choice for user profile learning, the existing research results remain to be relatively sparse. At the same time, it is reasonable to assume that often serving different needs of an individual, various social media sources might provide a very diverse data in nature, and therefore a more comprehensive study on the roles of different data sources for personality user profiling is necessary.

To represent human personality, in this work we use the Myers-Briggs Type Indicator (MBTI) [38] , which has been widely adopted by the research community [5, 6] and splits one's personality into 16 types, each formed by the following four binary dimensions:

• Extroversion and Introversion (EI): this dimension determines how an individual focuses her energies and interest, whether she is influenced externally by the opinion and interpretation of others (Extroverts) or motivated by her inner thoughts (Introverts). individual approach towards work, decision-making, and planning. Judging individuals are highly organized in their thoughts, while Perceivers behave more spontaneously.

The data was collected from Twitter, Facebook, and Personality-Cafe 3 social networks, during the time interval of 1st Jan 2018 to 1st Jan 2021, and via the two following steps:

1) Ground truth collection. To obtain personality ground truth from Twitter, we have downloaded all of the tweets which contain self-reported personality-related keywords/phrases such as "I'm an ENTP" or "I am an ENTP" and extract the personality trait from those phrases to be the ground truth for each user (see Figure 1 (a) for example). To harvest Facebook ground truth, we have monitored Facebook comments under personality test results released on 16personalities portal (see Figure 1 (b) for example). Likewise, to obtain the personality-related ground truth from the PersonalityCafe forum, we downloaded user's publicly-available self-reported personality traits on their profile pages (see Figure 1 (c) for example).

2) User-generated content (UGC) collection. To establish UGC collection from Twitter and Facebook, we have downloaded user timelines through Twitter REST API 4 and Facebook GRAPH API 5 , respectively. At the same time, to collect UGC from the Per-sonalityCafe forum, we downloaded posts from the MBTI forum thread.

3) Data pre-processing. Since social network data might exhibit significant noise levels and often contain grammatical errors, it's necessary to perform data prepossessing prior to the data modelling stage. At the same time, it is necessary to remove the direct personality mentions from the text content so that the model might not be able to use personality abbreviations from the post content at the inference stage. To mitigate the above two problems, we have pre-processed our dataset via the following three steps: 1) Data Filtering To ensure the sufficient amount of data per user for training and inference, we have filtered out the users with less than 10 tweets available; 2) In-line label replacement For all personality traits, the personality type name was then replaced with the "<type>" placeholder (e.g. "ENTJ" might have been replaced with "<type>"); 3) Social indicator replacement: similarly to Nguyen et al. [39] , we have further converted emojis into the corresponding descriptive textual strings, removed all non-ASCII words, and normalized the text by replacing user mentions, URLs, hashtags, date-time by the corresponding placeholders as follows: @USER, HTTPURL, HASHTAG, DATETIME. Table 1 highlights the statistics of our dataset across binary personality labels while Figure 2 visually reflects the personality label distributions. From the Table 3 showing the consistency of the data distributions across general social networks, reducing the risk of falling into source-dependent bias during the data modeling stage. At the same time, it is important to note that in the PersonalityCafe forum data, the ENFP, INFP, INFJ, ENFJ labels dominate the rest. The latter observation shows that the personality-related data sources might have a distribution shift towards individuals of certain personality types (different from a general distribution) that tend to participate in such specific personality-related discussions. Therefore, the evaluation based on PersonalityCafe dataset must be accomplished independently.

The MBTI personality categorization scheme defines each of the 4 binary MBTI categories to represent a different aspect of human personality. However, when being combined into 16 personality types, it is known to be associated with a major shortcoming of the overlap between the "neighbour categories" (e.g. INTJ and INTP). Given the noisy nature of the content from Social Media, it might be a good idea to predict individual binary MBTI personality traits, instead of modelling the overlapping 16-category scenario. Therefore, in this work, we have adopted such binary personality categorization scheme.

To facilitate an effective data modelling process, the data needs to be properly represented in the form of feature vectors. Following the best practices described in the literature on user profiling [30] [43] [15] we have chosen the following data representation approaches:

1) Textual Features: First, to represent the textual data at a user level, for each user all posts were concatenated into a corresponding user-specific "documents". Second, the term frequencyinverse document frequency (TF-IDF) has been extracted to form the document-term matrix. Finally, we have applied the Latent Semantic Analysis (LSA [25] ), as such transformation has previously shown sizable performance uplift [8] when being applied for user profiling. The final dimension of the compressed textual feature vector was set to 100, where the new number of dimensions has been found empirically during a grid search.

2) Visual Features: To represent visual data, we have automatically mapped each photo into the distribution of 1000 ImageNet [9] image concepts via the pre-trained ResNet-101 model [26] . We then summed up the predicted concept occurrence likelihoods for each user and element-vice normalized the obtained vector by the total number of images available from the user. In such a way, for each user, we have obtained a 1000-sized image concept distribution vector. Similarly to the text modality, Principal Component Analysis (PCA [28] ) has been further applied to reduce the dimensionality of the visual feature space to 200.

Given a user i, we denote the multi-view data associated with the user as a set X :

where represents the number of samples in the dataset, is the collection of the th user text content represented as Textual Features, is the collection of th user image content represented as Image features, and represents one of the four th binary personality trait ground truth labels.

In such a way, we can now formulate the personality profiling as a user-level multi-view document classification task, where users play the role of "documents" when being rephrased in standard terms.

Below, we now can define the PERS framework as a two-step stacked generalized ensemble approach. The architecture of our PERS framework is illustrated in Figure 3 .

First-Step: Given the training set = {( , ), = 1, 2, . . . , } , where is the personality label and represents the feature vector, we shuffle and uniformly split into equal sub-sets. In such a way, and (− ) = − are the definitions of the test and training sets, respectively, for the -th fold in K-Fold cross-validation. Now, we can specify a J-sized list of "base" classifiers and train the -th base classifier based on the training set (− ) . We denote the as the prediction of -th classifier on for each sample in the test set for the -th cross-validation fold. In such a way, immediately after the cross-validation process, we can define a new dataset based on J "base" classifier outputs:

For each data modality, we compute and separately and then we form the input for the next stage of data processing via column-wise concatenation of and . The corresponding dimension of is therefore × 2 .

Second-Step: Same as the first step, we perform K-fold crossvalidation on each of base classifier results with coming from First-Step to obtain the output formally defined as follows:

where, represent the -th prediction inferred by the -th base classifier. The corresponding dimension of is therefore × . Therefore, to get the final model prediction, we trained a meta classifier on , which can be formulated as :

Specifically in this study, we have chosen Support Vector Machine with linear kernel(LinearSVM) as our meta classifier.

To maximize the performance of the PERS framework, it is of a crucial importance to choose a set of suitable machine learning algorithms as base classifiers. The previous studies [2, 17, 41] suggest XGBoost [7] , LightGBM [29] , and Random Forest to be the top-choice base models for user profiling, as their performance on social media data has been reported to beat state-of-the-art baselines, often including those baselines coming from Deep Learning community.

Random forest is an ensemble learning algorithm that integrates multiple decision trees to complete prediction. For the classification problem, the prediction result is the vote of all the decision tree prediction results. During training, bootstrap sampling is used to form the training set of each decision tree. When training each node of each decision tree, the features used are also part of the features extracted from the entire feature vector. By integrating multiple decision trees, and training each decision tree with sampled samples and feature components each time, the variance of the model can be effectively reduced.

XGBoost is an effective and scalable gradient booster machine that has been widely adopted in the industry domain in recent years. It is an ensemble model containing a set of classification and regression trees (CART). Given a training data and target , the XGBoost model can be defined as:

where K is the total number of trees, for ℎ tree is a function in the functional space F, and F is the set of all possible CARTs.

Boosting Machines that mitigates the "optimal division point search" problem, arising from the increasing computational complexity on larger datasets. The problem is solved via the following two tricks, reducing the training data size and data dimensionality: Gradient-based One-Side Sampling (GOSS): Exclude most of the samples with small gradients, and only use the remaining samples to calculate the information gain.

Exclusive Feature Bundling (EFB): Bundles mutually exclusive features as they rarely take non-zero values at the same time.

To answer our research questions, we have evaluated the performance of PERS framework being trained across different data sources and data modality combinations. For all experiments, the dataset was uniformly split into training set and test set with the ratio of 85:15, maintaining the original personality label distributions. To understand the impact of different modalities, data sources, and fusion strategies on the final performance of the model, we have selected the following community-adopted personality profiling baselines [6, 17, 42] :

• Independently-trained base classifiers (see descriptions in Section 4.3) with respect to each data modality; 

Due to the imbalanced distribution of personality labels in our Datasets (see Section 3.1 for details on the Data Distributions), for performance evaluation, we have adopted the " 1 " metric [17] , which is the harmonic mean between precision and recall, and the average is calculated per label across all labels. The 1 metric is formally defined as:

where and are the precision and recall for ∈ h( ) from ∈ . We have further adopted the Matthews correlation coefficient metric [36] (Mcor), as it incorporates both true and false positives and negatives and generally regarded as a "balancing" measure that can be used even if the classes are of a very different size. The Mcor metric is formally defined as:

where TP is the number of true positives, TN the number of true negatives, FP the number of false positives and FN the number of false negatives. We prioritize the 1 score as our main evaluation metric, while the Mcor score plays an auxiliary role for making decisions regarding performance when the 1 values are marginal.

To judge the applicability of PERS framework in a real-world scenario, we have evaluated the limits of PERS's performance across Twitter, Facebook, and PersonalityCafe datasets. The evaluation results are presented in the Table 4 . From the table, it can be seen that, being trained on the multiview data from Twitter, PERS framework is able to achieve an industry-level performance of 0.82 1 score when predicting the Extroversion-Introversion (EI) personality trait. While the performance obtained for the other three personality categories is significantly lower (ranging from -0.18 1 score for Judging-Perceiving(JP) to -0.28 1 score for Sensing-Intuition(SN)), we believe that such high promising performance for the EI label could testify the tremendous potential of multi-view social media data for psycho-graphic discovery and personality profiling. The superiority for the EI label, at the same time, can be explained by the natural difference of these two human personality categories when it comes to user communication on social platforms: Extroverts are known to be much more open to others, while Introvertsopposite, being more selective and making decisions at a slightly more conservative pace. Such an inspiring results allow us to give a positive answer to our RQ1 and possibly gives birth to a wide range of new research directions related to Personality Profiling and Multi-View learning.

But what about other two data sets? An interesting finding comes from the results presented in Table 5 , where PERS demonstrates a breakthrough performance based on PersonalityCafe dataset showing the best overall 1 scores when predicting all 4 binary MBTI categories. Such a result can be explained by the specific nature of the PersonalityCafe dataset, where users purposely reveal their behavioural differences and therefore often biased towards particular social behaviour concepts. Such results also Reassure our positive answer to the Q1 and allow us to conclude that indeed the nature of a data source and the social network use patterns are of the crucial importance when solving the multi-view cross-media personality profiling problem.

First, we have investigated the contribution of different data modalities towards personality profiling performance and its integration ability. An interesting observation comes from the cross-modal experimental results presented in Table 4 : the PERS framework have performed 2% better than other single-source baselines for all but SN binary labels being trained on Twitter and Facebook datasets. Another interesting observation can be made from the modality combination results, where being trained based on both textual and image data, PERS is able to outperform by more than 1% not just other single-modal classifiers but also the early-fused baselines.

The above findings suggest that the introduction of multi-modality into user profiling could serve as a powerful booster of model performance. Such observation could be explained by the richness Table 4 : Evaluation of the "PERS" framework trained on the independent modality and the modality combinations in Twitter and Facebook. Text in green indicate the best performance while red indicate the worst.

Twitter Facebook  EI  SN  TF  JP  EI  SN  TF of visual data when reflecting user preferences, which serves as a greatly beneficial supplement of the textual data modality at the data modeling stage. The latter finding confidently positively answers our RQ2 by emphasizing the important role of multi-modal data learning for personality profiling application. Finally, let us also highlight an interesting observation that comes from single-modal evaluation results (see Table 4 ). It is important to note that, in the cases of learning from single-modal source, "PERS" being trained on text-modality performs best across all personality labels, ranging from 0.02 to 0.28 1 score superiority level. The latter can be easily explained by the quantitative domination of textual data over visual modality (see Table 1 ). Another potential reason behind such trend could be the high level of noise in the user-generated visual data, where the images are less strict in terms of perspective and object positioning as compared to professional photos. Moreover, such visual content often includes objects that might not directly reflect the semantics of the data and therefore might be not accurate in representing an author's personality. To this end, such hypothesis also aligns well with our-chosen visual data representation approach, where ImageNet concept distribution might be simply too general for personality profiling tasks, as opposed to, for example, demographic profiling [17] .

At last, let's examine the impact of the social media data origin on personality user profiling performance, so that an industry guideline can be established for future research.

As the textual modality has participated in all three data sources, let's describe the PERS performance on textual data first. From the Table 4 and Table 5 it can be noticed that PERS framework being trained on Twitter dataset outperformed the performance on Facebook dataset and PerosnalityCafe dataset by more than 0. 19 1 score in predicting EI label. In the contrast, when it comes to the SN label, Twitter-trained PERS was not able to outperform Facebook and PersonalityCafe data, staying behind by 0.2 and 0.11 1 score, respectively. Finally, the PERS performance of TF and JP labels based on Twitter textual data was found to be better than Facebook by 0.03 and 0.05 1 score, respectively, but considerably worse than PersonalityCafe by 0.19 and 0.11 1 score, respectively.

The superiority of Twitter in predicting the EI label could be explained by the differences of the "energy" source for Extroverted and Introverted personality types. Precisely, according to Martin [35] , Extroverts prefer to source their life energy from active involvement in events and engaging into different activities, while Introverts often prefer doing things alone obtaining their energy from dealing with the ideas, pictures, memories, and reactions that are inside on their mind. Similarly, from the digital world, it can be seen that on Twitter both personality types are able to express themselves fulfilling both their enjoyment (ENJ) and observation/learning (LEN) needs, while for Facebook ENJ factor got fulfilled proportionally for a smaller number of individuals, affecting the overall user base distribution [45] . Correspondingly Twitter and PersonalityCafe data are diverse enough to differentiate the EI personalities achieving a higher prediction scores, as compared to Facebook-based prediction. The observation is also supported by our data distribution (see Section 3.1), where Twitter and PersonalityCafe datasets are clearly skewed towards Introverts, proving more data for PERS to learn on how the personality type direct their energy and make decisions. The latter aspect is important as it is known that Extroverts might generate substantially more UGC as compared to Introverts [45] and therefore Introvert-crafted content sufficiency is crucial for mutually-consistent and comprehensive learning from the data.

At the same time, an inverse picture can be noticed for the SN label results, and there is a "low hanging fruit" explanation of the phenomena: for both Twitter and Facebook the SN personality is distributed with a clear shift towards Intuitive personality type, while being short on Sensing individuals in the data. Despite reflecting the real life distribution, this data property also entails a possible technical issue of variance insufficiency limiting the model differentiation when it comes to learning the Sensing and Intuitive user personas. Considering that on Facebook and PersonalityCafe there were more Sensing personality types identified, it is reasonable to assume that this is also the reason why PERS have performed better on these latter two sources as compared to the former one. To the end, a more "sensing" Facebook can also be explained by the fact that Facebook is mainly treated nowadays as a communication tool so that people land there for fulfilling their daily communication needs, while Twitter more often serves as a source of Inspiration attracting even more Intuitive individuals into its nets [45] . Now, its time to compare the visual modality performance and the first thing that might capture a reader's attention is that the image data modality has performed similarly for the cases of TF and JP labels for both Twitter and Facebook sources, however, at the same time Facebook performed better for EI and SN labels with 0.06 and 0.03 1 score performance uplift, respectively. As it has been described earlier, both the personality categories are very different in the way they direct the energy and perceive the external world [35] and therefore the data diversity introduced by incorporation of the visual modality is of crucial importance for the PERS performance. As Twitter is a "less visual" data source as compared to Facebook and also its data distributions are less balanced (as discussed above) for both personality labels, it is reasonably to assume that these two factors might entail the superiority of Facebook over Twitter on visual data in our particular case.

Finally, it is worth noting that PERS trained only based on the textual data from PersonalityCafe forum outperforms results obtained from both Twitter and Facebook data by at least 0.1 1 score. Such a finding can be easily explained by the precise focus of the PersonalityCafe forum on the personality topic, which provides additional meaningful data descriptors which can be utilized by PERS for improving its personality inference score.

Backed up by all the observations above, we now can give an answer to the RQ3 by highlighting the drastic difference of Social Media data sources when being used in automated personality profiling, which is dictated by the way different personalities engaged into social network activities.

Although PERS outperforms baselines for all binary personality inference tasks, being combined together, the predicted labels might often mismatch the actual final user MBTI personality score and therefore only binary personality predictions (such as EI prediction for Twitter dataset) can be leveraged in a real world settings.

Therefore, it is evident that new data source-specific multi-view learning approaches need to be implemented [15, 19] where personality profiling modelling will be leveraging on more multi-view data representations (such as avatar [22] , sensor data [14] , etc.) and mitigating the specific issues arising from the difference of communication styles across different social avenues. The development of such models and their application for content generation or recommendation services will be the focus of our future research.

In this work, we presented a brave study on automated human personality profiling across multiple data modalities and social networking sites, such as Facebook, Twitter, and PersonalityCafe. Our-proposed personality profiling framework, called "PERS" have demonstrated an industry-level superior performance over singlesource and multi-source baselines. Via our cross-social evaluation, we have also proved that different social networking platforms exhibit various distinct user communication and usage patterns, which in turn affects user profiling model performance and needs to be treated with care for skewed distributed datasets. Finally, to facilitate future research in this exciting direction we have released our new large-scale cross-social multi-view personality profiling dataset and supplemented it with the corresponding statistics and analytics for the community use.

Daily time spent on social networking by internet users worldwide from 2012 to 2020

Machine Learning Approach to Personality Type Prediction Based on the Myers-Briggs Type Indicator®

Lexical Predictors Of Personality Type

25 Tweets to Know You: A New Model to Predict Personality with Social Media

Multi-view personality profiling based on longitudinal data

Towards user personality profiling from multiple social networks

XGBoost: A Scalable Tree Boosting System

Gender Identification in Twitter using N-grams and LSA: Notebook for PAN at CLEF

ImageNet: A Large-Scale Hierarchical Image Database

Personality Structure: Emergence of the Five-Factor Model

360°user profile learning from multiple social networks for wellness and urban mobility applications

Understanding economic and health factors impacting the spread of COVID-19 disease

Tweet can be Fit: Integrating Data from Wearable Sensors and Multiple Social Networks for Wellness Profile Learning

Tweet can be fit: Integrating data from wearable sensors and multiple social networks for wellness profile learning

TweetFit: Fusing Multiple Social Media and Sensor Data for Wellness Profile Learning

SoMin. ai: Social multimedia influencer discovery marketplace

Harvesting multiple sources for user profile learning: a big data study

bBridge: A Big Data Platform for Social Multimedia Analytics

Cross-domain recommendation via clustering on multi-layer graphs

SoMin.ai: Personality-Driven Content Generation Platform

Twitter, Facebook, or Instagram? Which Platform(s) You Should Be On

Improving user profile with personality traits predicted from social media content

Personality Traits and Social Media Use in 20 Countries: How Personality Relates to Frequency of Social Media Use, Social Media News Use, and Social Media Use for Social Interaction

Reddit: A Gold Mine for Personality Prediction

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

Deep Residual Learning for Image Recognition

Text messaging, personality, and the social context

Principal Component Analysis

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Personality Classification from Online Text using Machine Learning Approach

Facebook as a Research Tool for the Social Sciences

Personality Traits Classification on Twitter

The influence of extro/introversion on the intention to pay for social networking sites

Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text

Looking at Type: The Fundamentals. Gainesville: Center for Application of Psychological Type

Comparison of the predicted and observed secondary structure of T4 phage lysozyme

Review of Research on the Myers-Briggs Type Indicator

MBTI Manual: A Guide to the Development and Use of the Myers-Briggs Type Indicator

BERTweet: A pretrained language model for English Tweets

Linguistic styles: language use as an individual difference

2020. I Know Where You Are Coming From: On the Impact of Social Media Sources on AI Model Performance (Student Abstract)

Overview of the 3rd Author Profiling Task at PAN

Overview of the 6th Author Profiling Task at PAN

Predicting Dark Triad Personality Traits from Twitter Usage and a Linguistic Analysis of Tweets

Why do social network site users share information on Facebook and Twitter

Personality Predictions Based on User Behavior on the Facebook Social Media Platform

Discovery and innovation of computer science technology in artificial intelligence era

Distribution of Pinterest users worldwide as of

Learning Factorized Multimodal Representations

This work is financially supported by National Center for Cognitive Research of ITMO University. This work was supported by the Ministry of Science and Higher Education of Russian Federation, research project no. 075-03-2020-139/2 (goszadanie no. 2019-1339).This research is also supported supported by Enterprise Singapore "Startup SG Tech POV" Grant Scheme under SOMIN PTE LTD project.