key: cord-0656345-yiyvg6eu authors: Abdali, Sara title: Multi-modal Misinformation Detection: Approaches, Challenges and Opportunities date: 2022-03-25 journal: nan DOI: nan sha: 253217b734842cb6df74b97a27362c4cb2ec12ca doc_id: 656345 cord_uid: yiyvg6eu As social media platforms are evolving from text-based forums into multi-modal environments, the nature of misinformation in social media is also changing accordingly. Taking advantage of the fact that visual modalities such as images and videos are more favorable and attractive to the users, and textual contents are sometimes skimmed carelessly, misinformation spreaders have recently targeted contextual correlations between modalities e.g., text and image. Thus, many research efforts have been put into development of automatic techniques for detecting possible cross-modal discordances in web-based media. In this work, we aim to analyze, categorize and identify existing approaches in addition to challenges and shortcomings they face in order to unearth new opportunities in furthering the research in the field of multi-modal misinformation detection. Nowadays billions of multi-modal posts containing texts, images, videos, sound tracks etc. are shared throughout the web, mainly via social media platforms such as Facebook, Twitter, Snapchat, Reddit, Instagram and so on and so forth. While the combination of modalities allows for more expressive, detailed and user friendly content, it brings about new challenges, as it is harder to accommodate unimodal solutions to multi-modal environments. However, in recent years, due to the sheer use of multi-modal platforms, many automated techniques for multi-modal tasks such as Visual Question Answering (VQA) (Agrawal et al. 2017; Goyal et al. 2017; Gurari et al. 2018; Hudson and Manning 2019; Singh et al. 2019) , image captioning (Chen et al. 2015; Gurari et al. 2020; Krishna et al. 2017; Young et al. 2014 ) and more recently for misinformation detection (Singhal et al. 2019; Qi et al. 2021; Giachanou, Zhang, and Rosso 2020) including hate speech in multi-modal memes (Kiela et al. 2020 ) have been introduced by machine learning researchers. Similar to other multi-modal tasks, it is harder and more challenging to detect misinformation on multi-modal platforms, as it requires not only the evaluation of each modality, but also cross-modal correlations and credibility of the combination as well. This becomes even more challenging when each modality e.g., text or image is credible but the combination creates misinformative content. For instance, a COVID-19 anti-vaccination misinformation post can have text that reads "vaccines do this" and then attaches a graphic image of a dead person. In this case, although image and text are not individually misinformative, taken together create misinformation. Over the past decade, several detection models (Shu et al. 2017 (Shu et al. , 2020b Islam et al. 2020; Cai et al. 2020) have been developed to detect misinformation. However, the majority of them leverage only a single modality for misinformation detection, e.g., text (Horne and Adali 2017; Wu et al. 2017; Guacho et al. 2018; Shu et al. 2020a) or image (Huh et al. 2018; Qi et al. 2019; Choudhary and Arora 2021) , which miss the important information conveyed by other modalities. There are existing works (K. Shu and Liu 2019; Shu et al. 2019a; Papalexakis 2021, 2020; Hakak et al. 2021 ) that leverage ensemble methods which create multiple models for each modality and then combine them to produce improved results. However, in many multi-modal misinformative content, individual modalities combined loosely together are insufficient to identify fake news and as a result the joint model also fails. Nevertheless, in recent years, machine learning scientists have come up with different techniques for cross-modal fake news detection which combine information from multiple modalities and leverage cross-modal information, such as the consistency and meaningful relationships between different modalities to detect misinformation. Study and analyze of these techniques and identifying existing challenges will give a clearer picture of the state of knowledge on multimodal misinformation detection and opens the door to new opportunities in this filed. Even though there are a number of valuable surveys on fake news detection (Shu et al. 2017; Kumar and Shah 2018; da Silva, Vieira, and Garcia 2019) , very few of them focus on multi-modal techniques (Alam et al. 2021; . Since the number of proposed techniques on multi-modal fake news detection has been increasing immensely, the necessity of a comprehensive survey on existing techniques, datasets and emerging challenges is felt more than ever.With that said, in this work, we aim to conduct a comprehensive study on fake news detection in multi-modal environments. The rest of this survey is organized as follows: To start with, we introduce some of the recent cross-modal cues for detecting misinformation. In addition, we discuss different fusion mechanisms to merge modalities that are involved in such cues. Furthermore, we categorize existing solutions based on the machine learning techniques they apply. More-over, we introduce existing datbases for multi-modal fake news detection. Finally, we discuss existing challenges and shortcomings, and propose new avenues of research for furthering the multi-modal misinfrmation detection study. Multi-modal features and clues As mentioned earlier, combinations of features e.g., text and image have been recently used for detecting misinformation in multi-modal environments. In this section, first, we present a non-exhaustive list of commonly used clues that are leveraged by machine learning scientist for detecting misinformation. Furthermore, we discuss categories of fusion mechanisms to merge involving modalities in such clues in order to generate multi-modal representations. Image and text mismatch Combination of textual content and article image is one of the widely used set of features for multi-modal fake news detection. The intuition behind this cue is: some fake news spreaders use tempting images e.g., exaggerated, dramatic or sarcastic graphics which are far from the textual content to attract users' attention. Since it is difficult to find both pertinent and pristine images to match these fictions, fake news generators sometimes use manipulated images to support non-factual scenarios. Researchers refer to this cue as the similarity relationship between text and image (Zhou, Wu, and Zafarani 2020; Giachanou, Zhang, and Rosso 2020; Xue et al. 2021) which could be captured with a variety of similarity measuring techniques such as cosine similarity between the title and image tags embeddings (Zhou, Wu, and Zafarani 2020; Giachanou, Zhang, and Rosso 2020) or similarity measure architectures (Xue et al. 2021) . In video-based platforms such as YouTube, TikTok etc. the video content is served with descriptive textual information such as video description, title, users' comments and replies. Different users and video producers use different writing styles in such textual content. These writing styles, could be learned and distinguished from unrecognized patterns by machine learning models. Meanwhile, meaningful relationship between the visual content and the descriptive information e.g., video title is another clue that could be used for detecting online misbehavior (Choi and Ko 2022) . However, this is a very challenging task, as it is difficult to detect frames that are relevant to the text and discard irrelevant ones e.g., advertisements, opening or ending frames. Moreover, encoding all video frames, is very inefficient in terms of speed and memory. Textual content and propagation network The majority of the online fact checkers such as BS Detector 1 or News Guard 2 provide labels that pertain to domains rather than ar-ticles. Despite this disparity, there are several works (Helmstetter and Paulheim 2018; Zhou 2017 ) that show the weakly-supervised task of using labels pertaining to domains, and subsequently testing on labels pertaining to articles, yields negligible accuracy loss due to the strong correlation between the two. Thus, by recognizing the domain features and behaviours, we might be able to classify articles published by them with admissible accuracy. Some of these feature patterns are the propagation network and word usage patterns of the domains which could be considered (Silva et al. 2021b; Shu et al. 2019b; Silva et al. 2021a; Zafarani 2019) as a discriminating signature for different domains. It is empirically shown that (Silva et al. 2021b ) not only news articles from different domains have significantly different word usage, but also they follow different propagation patterns. Textual content and overall look of serving domain Recently, researchers have come up with a new cue for detecting misinformation: the overall look of serving webpage (Abdali, Shah, and Papalexakis 2020; . It is shown that, in contrast to credible domains, unreliable web-based news outlets tend to be visually busy and full of events such as advertisements, popups etc . Trustworthy webpages often look professional and ordered, as they often request users to agree to sign up or subscribe, have some featured articles, a headline picture, standard writing styles etc. On the other hand, unreliable domains tend to have unprofessional blog-post style, sometimes hard-to-read fonts errors, negative space and so on an so forth. Considering this discriminating clue, researchers have recently proposed to consider overall look of the webpages in addition to textual content and social context and create a multi-modal model for detecting misinformation Papalexakis 2020, 2021) . Video and audio mismatch Due to the ubiquity of camera devices and video-editing applications, video-based frameworks are extremely vulnerable to manipulation e.g., virtual background, anime filter etc. Such kind of visual manipulation introduce a non-trivial noise to the video frames which may lead to mis-classification of irrelevant information from videos (Shang et al. 2021) . Moreover, manipulated videos often incorporate content in different modalities such as audio and text which sometimes none of them is misinformative while considered individually. However, they mislead the audiences while considered jointly with the video content. In order to detect misleading content that is jointly expressed in video, audio, and text content, researchers have proposed to leverage frame based information along with audio and text content on video-based platforms like Tik Tok (Shang et al. 2021) . Different fusion mechanisms are used to combine features from different modalities, including those we mentioned above. we may categorize feature fusion techniques as follows: Early fusion also known as feature-level fusion refers to combining features from different data modalities at an early stage using an operation which is often concatenation. This type of combination is performed ahead of classification and if the fusion is done after feature extraction, it is sometimes referred to as intermediate fusion (Boulahia et al. 2021) . Late fusion also known as decision-level fusion depends on the results obtained by each data modality individually. In other words, the modality-wise classification results are combined using techniques such as sum, max, average, and weighted average. Most of the late fusion solutions use handcrafted rules, prone to human bias, and far from realworld peculiarities (Boulahia et al. 2021) Comparison of fusion mechanisms In most cases, early fusion is a complex operation, whereas late fusion is easier to perform (Atrey et al. 2010 ) because unlike the early fusion where the features from different modalities e.g., image and text may have different representation, the decisions at the semantic level usually have the same representation. Therefore, the fusion of decisions is easier. However, the late fusion strategy does not utilize the feature level correlation among modalities. For instance, it is shown (Gallo et al. 2020 ) that early fusion of image and text features outperforms multi-modal late fusion while using BERT and CNN, on UPMC Food-101 dataset 3 (Wang et al. 2015) . One advantage of early fusion is that it requires less computation time because training is performed only once but the late fusion needs multiple classifiers for local decisions (Atrey et al. 2010) . However, there are hybrid approaches as well which take advantage of both early and late fusion strategies (Atrey et al. 2010) . A simplified scheme of early vs. late fusion of multi-modal data is demonstrated in Fig. 1 . As far as model based study of multi-modal misinformation detection is concerned, machine learning scientists have come up with a variety of solutions. Based on the machine As discussed earlier, a vast majority of misinformation detection leverage a single modality a.k.a. aspect of news articles e.g., text ( However, recently, there have been very few works that incorporate various aspects of a news article using classic machine learning techniques in order to create a multi-modal article representation. For example, Shu et al. (K. Shu and Liu 2019) , propose individual embeddings for text, useruser, user-article and publisher-article interactions and define a joint optimization problem and leverage Alternative Least Square (ALS) approach to solve for the variables. In another work, Abdali et al. propose HijoD (Abdali, Shah, and Papalexakis 2020) , which encodes three different aspects i.e., article text, context of social sharing behaviors and host website/domain features into individual embeddings and extracts shared structures of these embeddings by canceling out the unshared structures with a principled tensorbased framework. Finally, the shared structures are utilized for article classification. More recently, Meel et al. (Hakak et al. 2021) , have proposed an ensemble framework which leverages text embedding, a score calculated by cosine similarity between image caption and news body, and noisy images. Despite the fact that some of the modules of this model e.g., text embedding generator leverage deep attention based architecture, the classification process is done via a classic ensemble technique i.e. max voting. Due to the impressive success of deep neural networks in classification of text, image and many other modalities, over the past few years, they have been widely exploited by research scientists for a variety of multi-modal tasks including misinformation detection. We may categorize deep learning based multi-modal misinformation detection into five categories: concatenation-based, attention-based, generative-based, graph neural network-based and crossmodality discordance-aware architectures. In what follows, we summarize and categorize the existing works into the aforementioned categories. Concatenation-based architectures The majority of the existing work on multi-modal misinformation detection, embed each modality e.g., text or image into a vector representation and then concatenate the vectors in order to generate a multi-modal representation which is utilized for classification. For instance, Singhal et al. propose to use a pretrained XLnet and Vgg-19 models for text and image respectively Attention-based architectures As mentioned above, many architectures simply concatenate vector representations, thereby fail to build effective multi-modal embeddings. Such models are not efficient in many cases. For instance, the entire text of an article does not necessarily need to be false for the corresponding image and vice versa to consider the article as a misinformative content. Thus, some recent works attempt to use attention mechanism to attend to relevant parts of image, text etc. Attention mechanism is a more effective approach for utilizing embeddings, as it produces richer multi-modal representations. For example, a work by Qian et al. (Qian et al. 2021) , proposes a Hierarchical Multi-modal Contextual Attention Network (HMCAN) architecture which utilizes a pretrained BERT and ResNet50 to generate word and image embeddings and a multi-modal contextual attention network to explore the multi-modal context information. In this work, different multi-modal contextual attention networks constitute a hierarchical encoding network to explore and capture the rich hierarchical semantics of multi-modal data. In another work (Sachan et al. 2021) (Jing et al. 2021) to connect features of text and images into a series, and feed them into visionlanguage transformer model to learn the joint representation of multi-modal features. TRANSFAKE adopts a preprocessing method similar to BERT for concatenated text, comment and image. In another work (Wang, Mao, and Li 2022) , Wang et al. apply scaled dot-product attention on top of image and text features as a fine-grained fusion and use the fused feature to classify articles. Wang et al. propose a deep learning network for Biomedical informatics that leverages visual and textual information and a semantic-and tasklevel attention mechanism to focus on the essential contents of a post that signal anti-vaccine messages (Wang, Yin, and Argyris 2021) . Lu et al. concatenate representations of user interaction, word representations and propagation features after applying a dual co-attention mechanism to capture the correlations between users' interaction/propagation and tweet's text (Lu and Li 2020) . Finally, Song et al. (Song et al. 2021) , propose a multi-modal fake news detection model based on Cross-modal Attention Residual (CARN) and Multichannel convolutional neural Networks (CARMN). CARN selectively extracts the information related to a target modality from a source modality while maintaining the unique information of the target. Generative architectures In this category of deep learning solutions, the goal is to utilize generative models in order to learn individual or multi-modal representations and use them to augment the classifier. As an example, Jaiswal et al. propose a BERT based multi-modal variational autoencoder (Jaiswal, Singh, and Singh 2021) that consists of an encoder, decoder and a fake news detector. The encoder encodes the shared representation of both the image and text into a multidimensional latent vector. The decoder decodes the multidimensional latent vector into the original image and text and the fake news detector is a binary classifier that takes the shared representation as an input and classifies it as either fake or real. Similarly, Kattar et.al. propose a deep multi-modal variational autoencoder (MVAE) (Khattar et al. 2019 ) which learns a unified representation of both the modalities of a tweet's content. Similar to the previous work, MVAE has three main components: encoder, decoder and a fake News detector that utilizes the learned shared representation to predict if a news is fake or not. Wang et al. propose Event Adversarial Neural Networks (EANN) , an end-to-end framework which can derive event-invariant features and thus benefit the detection of fake news on newly arrived events. It consists of three main components: a multi-modal feature extractor, the fake news detector, and the event discriminator. The multi-modal feature extractor is responsible for extracting the textual and visual features from posts. It cooperates with the fake news detector to learn the discriminating representation of news articles. The role of event discriminator is to remove the event-specific features and keep shared features among events. In another work (Silva et al. 2021b ), Silva et.al propose a cross-domain framework using text and propagation network. The proposed model consists of two components: an unsupervised domain embedding learning; and a supervised domain-agnostic news classification. The unsupervised domain embedding exploits text and propagation network to represent a news domain with a low-dimensional vector. The classification model represents each news record as a vector using the textual content and the propagation network. Then, the model maps this representation into two different subspaces such that one preserves the domain-specific information. Later on, these two components are integrated to identify fake news while exploiting domain-specific and cross-domain knowledge in the news records. The last example is a work by Zeng et al. (Zeng, Zhang, and Ma 2020) where they propose to capture the correlations between text and image by a VAE-based multi-modal feature fusion method. Graph neural network architectures In resent years, Graph Neural Networks (GNNs) have been successfully exploited for fake news detection Song, Shu, and Wu 2021b; Benamira et al. 2019 ), thereby they have caught researchers' attentions for multi-modal misinformation detection as well. In this category of deep learning solutions, article content e.g., text, image etc. are represented by graphs and then graph neural networks are used to extract the semantic-level features. For instance, Wang et al. construct a graph for each social media post based on the point-wise mutual information (PMI) score of pairs of words, extracted objects in visual content and knowledge concept through knowledge distillation. Then, utilize a Knowledge-driven Multi-modal Graph Convolutional Network (KMGCN) which extracts the multimodal representation of each post through Graph convolutional networks . Another GCN based model is GAME-ON (Dhawan et al. 2022 ) which represents each news with uni-modal visual and textual graphs and then project them into a common space. To capture multi-modal representations, Game-on applies a graph attention layer on a multi-modal graph generated out of modality graphs. Cross-modality discordance-aware architectures In the previous categories we discussed above, deep learning models are used to fuse different modalities in order to obtain discriminating representations. The intuition is that: fabrication of either modality will lead to dissonance between the modalities and results in misrepresented, misinterpreted and misleading news. Therefore, there are subtle cross-modal discordance clues that could be identified and learned by customized architectures. For instance, in many cases, fake news propagators use irrelevant modalities e.g., image, video, audio etc., for false statements to attract readers' attention. Thus, the similarity of text to other modalities e.g., image, audio etc., is a cue for measuring the credibility of a news article. With that said, Zhou et al. (Zhou, Wu, and Zafarani 2020) , propose SAFE, a Similarity-Aware Multi-Modal Fake News Detection by defining the relevance between news textual and visual information with a modified cosine similarity. Similarly, Giachanou et al., propose a multi-image system that combines textual, visual and semantic information (Giachanou, Zhang, and Rosso 2020) . The semantic representation refers to the text-image similarity calculated using the cosine similarity between the title and image tags embeddings. In another work, Singhal et.al. (Singhal et al. 2021) develop an inter-modality discordance based fake news detection which learns discriminating features and employs a modified version of contrastive loss that explores the inter-modality discordance. Xue et al. (Xue et al. 2021) , propose a Multimodal Consistency Neural Network (MCNN) which utilizes a similarity measurement module that measures the similarity of multi-modal data to detect the possible mismatches between the image and text. Lastly, Biamby et al. (Biamby et al. 2021) ,leverage a Contrastive Language-Image Pre-Training (CLIP) model (Radford et al. 2021) , to jointly learn image/text representation to detect Image-Text incon-sistencies in Tweets. Instead of concatenating vector representation, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. On video-based platforms such as YouTube videos, typically different producers use different title and description, as users and subscribers express their opinions in different writing styles. Having this clue in mind, Choi et.al. propose a framework to identify fake content on YouTube (Choi and Ko 2022) . They propose to use domain knowledge and "hit-likes" of comments to create the comments embedding which is effective in detecting fake news videos. They encode Multi-modal features i.e., image and text and detect differences between title, description or video and user's comments. In another work (Shang et al. 2021), Shang et.al. develop TikTec, a multi-modal misinformation detection framework that explicitly exploits the captions to accurately capture the key information from the unreliable video content. This framework learns the composed misinformation that is jointly conveyed by the visual and audio content. TikTec consists of four major components. A Captionguided Visual Representation Learning (CVRL) module which identify the misinformation-related visual features from each sampled video frame, An Acoustic-aware Speech Representation Learning (ASRL) module that jointly learns the misleading semantic information that is deeply embedded in the unstructured and casual audio tracks and the Visual-speech Co-attentive Information Fusion (VCIF) module which captures the multiview composed information jointly embedded in the heterogeneous visual and audio contents of the video. Finally, the Supervised Misleading Video Detection (SMVD) module identifies misleading COVID-19 videos. A summary of the existing deep learning based works is reported in Table 1 . It is worth mentioning that many of the state-of-the-art solutions utilize a hybrid of deep learning solutions. In this section we introduce some of the existing multimodal datasets for feke news detection. Image-Verification-Corpus 4 is an evolving dataset containing 17,806 fake and real posts with images shared on Twitter. The dataset is created as an open corpus that may be used for assessing online image verification approaches (based on tweet texts and user features), in addition to building classifier for new content which is currently tweets containing images. (Boididou et al. 2018) . Fakeddit 5 is a dataset collected from Reddit, a social news and discussion website where users can post submissions on various subreddits. Fakeddit consists of over 1 million submissions from 22 different subreddits image,text,social context Twitter Tweets as reliable/inconclusive/reliable N24News 60,000 24 image,text New York Times All samples are real from 24 categories MuMiN (Nielsen and McConville 2022) 10,920 3 image,text Twitter 10,920 articles and 6,573 images spanning over a decade with the earliest submission being from 3/19/2008, and the most recent submission being from 10/24/2019. These subreddits are posted on highly active and popular pages by over 300,000 users. Fakeddit comprises submission titles, images, user comments, and submission metadata including score, username of the author, subreddit source, sourced domain, number of comments, and up-vote to down-vote ratio. Approximately 64% of the samples in Fakeddit contain both text and image data (Nakamura, Levy, and Wang 2020). NewsBag comprises 200,000 real news and 15,000 fake news. The real articles have been collected from the Wall Street Journal and the fake ones have been extracted from The Onion 6 , which publishes satirical content. The NewsBag dataset is highly imbalanced. Thus, NewsBag ++ is also introduced which is the augmented training version of the dataset and contains 200,000 real and 389,000 fake news (Jindal et al. 2020 ). COVID-19 fake news data repository. This dataset comprises 3981 fake news and 7192 trustworthy information in 6 different languages i.e., English, Spanish, Portuguese, Hindi, French and Italian. MM-COVID consists of visual, textual and social context information e.g., user and network . ReCOVery 8 contains 2,029 news articles that have been shared on social media, most of which have both textual and visual information for multi-modal studies (2,017). ReCOVery is imbalanced in news class i.e., the proportion of real vs. fake articles is around 2:1. The number of users who spread real news (78,659) and users sharing fake articles (17,323) is greater than the total number of users included in the dataset (93, 761) . In this dataset the assumption is that users can engage in spreading both real and fake news articles (Chen, Chu, and Subbalakshmi 2021) . N24News 10 is a multi-modal dataset extracted from the New York Times articles published from 2010 to 2020. Each news belongs to one of 24 different categories e.g., science, arts etc. The dataset comprises up to 3,000 samples of real news for each category. Totally, 60,000 news articles are collected . MuMiN 11 Large-Scale Multilingual Multi-modal Fact-Checked Misinformation Social Network Dataset (MuMin) comprises 21 million tweets belonging to 26 thousand Twitter threads, each of which has been linked to 13 thousand fact-checked claims in 41 different languages. MuMiN is available in three large, medium and small versions with largest one consisting 10,920 articles and 6,573 images. In this dataset, if the claim is "mostly true", it is labeled as factual. When the claim is deemed "half true" or "half false" it is labeled as misinformation, with the justification that a statement containing a significant part of false information should be considered as a misleading content. When there is no clear verdict then the verdict is labelled as other. (Nielsen and McConville 2022) . A side by side comparision of the aforemetioned datasets is illustarted in Table 2 . Recent studies on multi-modal learning have made significant contributions to the field of multi-modal fake news detection. However, there are still weaknesses which recognizing them opens the door to new opportunities not only in fake news detection, but also in multi-modal studies in general. In this section, we discuss challenges, shortcomings and opportunities in multi-modal fake news detection. A non-exhaustive list of challenges and shortcomings divided into four categories is as follows: This category refers to the weaknesses of current multimodal datasets for misinformation detection. Lack of comprehensive datastes As discussed in dataset section, most of the existing multi-modal datasets only contain image and text of the articles, are mono-lingulal and small in size, and sometimes imbalanced in terms of fake to real ratio. Bias on event-specific events Many of the existing datasets are created for specific events such as COVID-19 crises, thereby do not cover a variety of events. Thus, might not be useful enough to train models to detect fake news in other contexts. Binary and domain level ground truth Most of the existing datasets provide us with binary and domain level ground truth for well-known outlets such as Onion or New York times. Moreover, they often do not give any information about reasons of misinformativeness e.g., cross-modal discordance, manipulation etc. This category comprises shortcomings related to crossmodal features that are leveraged in multi-modal fake news detection, Insufficiency of cross-modal cues Although researchers have proposed some multi-modal cues, most of the existing models naively fuse image based features with textual features as a supplement. There exist fewer works that leverage explainable cross-modal cues other than image and text combination, however, there are plenty of multi-modal cues which are still neglected. Ineffective crossmodal embeddings As mentioned earlier, the majority of the existing approaches only fuse embeddings with simple operations such as concatenation of the representations, thereby fail to build effective and nonnoisy cross-modal embeddings. Such architectures fail in many cases, as the resulted cross-modal embedding consists of irrelevant parts and may also lead to noisy representations. Figure 3 : A summary of the challenges in multi-modal misinformation detection. This category refers to the shortcomings of current machine learning solutions in detecting misinformation in multimodal environments. Inexplicability of attention mechanisms While some recent works attempt to use attention-based techniques to overcome the problem of ineffective multi-modal embedding we discussed above, they do not provide any explicable information about the regions of interest and common patterns of inconsistencies. They usually follow a trial and error approach to find relevant sections to attend to. Non-transferable models to unseen events Most of the existing models are designed in such a way that they extract and learn event-specific features e.g., COVID-19, election etc. Thus, they are most likely biased toward specific events and nontransferable to unseen and emerging events. For this very reason, building models that learn general features and separate them from the nontransferable event-specific features would be extremely useful. Unscalibility of current models Considering the expensive and complicated structures of deep networks and the fact that most of the existing multi-modal models leverage multiple deep networks (one for each modality), they are not scalable if the number of modalities increases. Moreover, many of the existing model require heavy computing resources and need large volume of memory storage and processing units. Therefore, scalablity of proposed models should be taken into account while developing new architectures. Fig. 3 , illustrates a summary of the discussed challenges. Considering the challenges and shortcomings in multimodal misinformation detection we discussed above, we propose the following opportunities in furthering research in the field of multi-modal misinformation detection: Comprehensive multi-modal datasets As discuss in details, one important gap in the misinformation detection study is the lack of a comprehensive multi-modal dataset which needs to be addressed in the future. Identifying cross-modal clues As mentioned earlier, cross-modal cues are currently limited to a handful of trivial clues such as similarity of text and image. Identifying subtler, yet neglected cues not only helps in development of discordance-aware models, but also could be helpful in recognizing vulnerabilities of the serving platforms which is a part and parcel of adversarial learning. Developing efficient fusion mechanism As discussed before, many of the existing solutions leverage naive fusion mechanisms such as concatenation which may result in inefficient and noisy multi-modal representations. Therefore, another fruitful avenue of research lies in the study and development of more efficient fusion techniques to produce information richer representations. Developing cross-modal discordance-aware architectures As described earlier, most of the existing works, either blindly merge modalities or take a trial and error approach to attend to the relevant modalities. Implementing discordance-aware models not only results in information richer representations, but also may be useful in making attention based techniques more efficient. Adversarial learning in multi-modal misinformation detection Although there are existing generative-based architectures, adversarial study of multi-modal misinformation detection in order to make the detection models more adversarialy robust is of utmost importance which has been mostly neglected. Interpretability of multi-modal models Development of explainable frameworks to help better understand and interpret predictions made by multi-modal detection models is another opportunity in multi-modal misinformation detection which could be very useful for related studies such as adversarial learning and explainable AI as well. Transferable models to unseen events As mentioned earlier, except a few works, most of the existing models are designed for specific events and as a result, ineffective for emerging ones. Since misinformation spreads during a variety of events, developing general and transferable models is extremely crucial. Development of scalable models Another opportunity is to develop models that are more efficient in terms of time and resources and do not become intolerably complicated while increasing the number of fused modalities. In this work, we study existing works on multi-modal misinformation detection, analyze their strengths and weaknesses and offer new opportunities for advancing research in this field. First, we introduce cross-modal clues and fusion mechanisms. Then, we categorize existing solutions into two main categories of classic machine learning and deep learning techniques and then break each one of them down based on the utilized techniques. In addition, we introduce and compare existing datasets on multi-modal misinformation detection. Finally, we categorize existing challenges into data, feature and model based shortcomings and propose new directions to address them. Identifying Misinformation from Website Screenshots HiJoD: Semi-Supervised Multi-aspect Detection of Misinformation using Hierarchical Joint Decomposition KNH: Multi-View Modeling with K-Nearest Hyperplanes Graph for Misinformation Detection VQA: Visual Question Answering Multimodal Fake News Detection Multimodal Fake News Detection Multimodal fusion for multimedia analysis: a survey Semi-Supervised Learning and Graph Neural Networks for Fake News Detection. ASONAM '19 Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation Detection and visualization of misleading content on Twitter Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications 32 Threats to Online Advertising and Countermeasures: A Technical Survey MMCo-VaR: Multimodal COVID-19 Vaccine Focused Data Repository for Fake News Detection and a Baseline Architecture for Classification Microsoft COCO Captions: Data Collection and Evaluation Server Effective fake news video detection using domain knowledge and multimodal data fusion on youtube ImageFake: An Ensemble Convolution Models Driven Approach for Image Based Fake News Detection CoAID: COVID-19 Healthcare Misinformation Dataset DETERRENT: Knowledge Guided Graph Attention Network for Detecting Healthcare Misinformation Can Machines Learn to Detect Fake News? A Survey Focused on Social Media GAME-ON: Graph Attention Network based Multimodal Fusion for Fake News Detection Image and Text fusion for UPMC Food-101 using BERT and CNNs Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Semi-supervised Content-Based Detection of Misinformation via Tensor Embeddings VizWiz Grand Challenge: Answering Visual Questions from Blind People Captioning Images Taken by People Who Are Blind An ensemble machine learning approach through effective feature extraction to classify fake news Weakly Supervised Learning for Fake News Detection on Twitter This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News Deep learning for misinformation detection on online social networks: a survey and new perspectives. Social Network Analysis and Mining 10 AENeT: an attention-enabled neural architecture for fake news detection using contextual features Fake News Detection Using BERT-VGG19 Multimodal Variational Autoencoder Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs NewsBag: A multimodal benchmark dataset for fake news detection TRANSFAKE: Multi-task Transformer for Multimodal Enhanced Fake News Detection Beyond news contents: the role of social context for fake news detection MVAE: Multimodal Variational Autoencoder for Fake News Detection The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations False information on web and social media: A survey AMFB: Attention based multimodal Factorized Bilinear Pooling for multimodal Fake News Detection MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation GCAN: Graph-aware Co-Attention Networks for Explainable Fake News Detection on Social Media AIMH at SemEval-2021 Task 6: Multimodal Classification Using an Ensemble of Transformer Models European Language Resources Association MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset Improving Fake News Detection by Using an Entity-Enhanced Framework to Fuse Diverse Multimodal Clues Exploiting Multi-domain Visual Information for Fake News Detection Hierarchical Multi-Modal Contextual Attention Network for Fake News Detection Learning Transferable Visual Models From Natural Language Supervision ARCNN framework for multimodal infodemic detection Socially Aware Multimodal Deep Neural Networks for Fake News Classification SCATE: Shared Cross Attention Transformer Encoders for Multimodal Fake News Detection A Multimodal Misinformation Detector for COVID-19 Short Videos on TikTok DEFEND: Explainable Fake News Detection Detecting Fake News With Weak Social Supervision Hierarchical Propagation Networks for Fake News Detection: Investigation and Exploitation Fake News Detection on Social Media: A Data Mining Perspective Understanding User Profiles on Social Media for Fake News Detection Leveraging Multi-Source Weak Social Supervision for Early Detection of Fake News The Role of User Profiles for Fake News Detection ASONAM '19 Propagation2Vec: Embedding partial propagation networks for explainable fake news early detection Embracing Domain Differences in Fake News: Crossdomain Fake News Detection using Multi-modal Data Inter-Modality Discordance for Multimodal Fake News Detection SpotFake: A Multi-modal Framework for Fake News Detection A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks Temporally evolving graph neural network for fake news detection Temporally evolving graph neural network for fake news detection FMFN: Fine-Grained Multimodal Fusion Networks for Fake News Detection Recipe recognition with large multimodal food dataset Association for Computing Machinery Fake News Detection via Knowledge-Driven Multimodal Graph Convolutional Networks N24News: A New Dataset for Multimodal News Classification Detecting Medical Misinformation on Social Media Using Multimodal Deep Learning Gleaning wisdom from the past: Early detection of emerging rumors in social media Tracing Fake-News Footprints: Characterizing Social Media Messages by How They Propagate Detecting fake news by exploring the consistency of multimodal data From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Fake news detection for epidemic emergencies via deep correlations between text and images MDMN: Multi-task and Domain Adaptation based Multi-modal Network for early rumor detection ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research. CIKM '20 SAFE: Similarity-Aware Multi-Modal Fake News Detection Network-Based Fake News Detection: A Pattern-Driven Approach. SIGKDD Explor A Brief Introduction to Weakly Supervised Learning This material is based upon work supported by the National Science Foundation under Grant #: 2127309 to the Computing Research Associate for the CIFellows Project.