FNR: A Similarity and Transformer-Based Approach to Detect Multi-Modal Fake News in Social Media

The availability and interactive nature of social media have made them the primary source of news around the globe. The popularity of social media tempts criminals to pursue their immoral intentions by producing and disseminating fake news using seductive text and misleading images. Therefore, verifying social media news and spotting fakes is crucial. This work aims to analyze multi-modal features from texts and images in social media for detecting fake news. We propose a Fake News Revealer (FNR) method that utilizes transform learning to extract contextual and semantic features and contrastive loss to determine the similarity between image and text. We applied FNR on two real social media datasets. The results show the proposed method achieves higher accuracies in detecting fake news compared to the previous works.

(a) Real news (b) Real news (c) Fake news (d) Fake news Figure 1: Twitter posts about April 15, 2013 Boston terrorist attack media can lead criminals to produce and publish fake news with malicious intent. These types of fraud can be perpetrated to mislead the public, harm an institution, person, government, or harm public and private stock markets. Meanwhile, the lack of awareness of the falsity of the news causes ordinary people to publish without knowledge. The scope of the news becomes wider and causes more significant damage, leading to distrust and disregard for the correct news and warnings work. It makes it difficult for reporters and journalists to cover correct and important news.
Visual misinformation tends to involve much more straightforward deceptions. It is prevalent for old photos and videos to be repurposed as evidence of recent events. Using out-of-context images is another way to validate the news and gain trust. This means that they use the photo of the correct news in another event or context for fake news. Images play a pivotal role in influencing public opinion and creating false perceptions. According to psychological research [3], when individuals see an image alongside a trivial statement such as turtles are deaf, they are more likely to believe it. In a simulated social media environment, a post that incorporates photos get more likes and shares, as well as people's perception that it is factual [4]. Hence, considering the visual features of news and the similarity relations between its image and text is essential for recognizing the truth of the news.
In social media, we do not always know the source of the news, but we can find the reaction of users against the news post. On Twitter, for example, people retweet or comment on tweets with or without intent. We only see retweets and comments, so determining their reliability is critical. These reactions can create fake news and divert attention from the main news, which prevents people from paying attention to the real story. For example, Figure 1 shows some tweets related to a terrorist attack in Boston in 2013. As you can see, fake tweets are used to mislead readers, divert news, or further their agendas by exploiting the public's attention. In fake tweet Figure 1c by resembling a terrorist's backpack to another person, he tried to introduce another person as a terrorist. In Figure 1d, by using the image of the actual news next to another news about a missing person, the missing person was introduced as a terrorist.
As a means to deal with these challenges, we propose an end-to-end framework referred to as Fake News Revealer (FNR): A similarity and transformer-based approach to detect multi-modal fake news in social media. In this approach, we featurize text using BERT, a language representation model, which stands for Bidirectional Encoder Representations from Transformers [5]. In addition, we use the ViT, a vision transformer model that makes use of patches of images as tokens in an NLP application and provides the sequence of linear embeddings of these patches as inputs to the transformer [6]. After these two embedding modules, we use two projection modules that project extracted features into a similarsized array and tune weights based on the task and dataset. To optimize the model, we have used two loss functions: contrastive loss, a supervised loss between image and text, and classification loss, a cross-entropy loss between the predicted and actual label of each news item.
Our main contributions are: • Utilizing the transformer models for both image and text, which yields better results than other text or image classification models in fake news detection. • Using contrastive loss between the image and text to explore the relations between them for each news item. • Outperforming the state-of-the-art multi-modal fake news detection methods on two public social media datasets.

RELATED WORKS
With the expansion of using social networks, automatic detection of fake news has become essential. The intentional nature of fake news and its adverse effects and implications have encouraged more researchers to focus on this issue.
In this section, we categorize the related works based on the modes of their inputs. Then, we focus on the recent multi-modals fake news methods.

Single Modality
In early works, only one mode of data was used to detect fake news, with textual data receiving the most attention due to its prevalence in the news. Linguistic features were utilized in [2] to validate news on Twitter, and structural and cognitive features were extracted to detect fake news on social networks in [7]. These methods, particularly those that utilize linguistic features, are topic-specific and cannot be generalized to all topics. Additionally, the methods in [2], and [7], do not extract features automatically, resulting in insufficient and out of proportion solutions. The authors in [8] and [9], implemented machine learning models on social media images to detect fake news. Using recurrent deep networks, the authors in [10], pioneered the use of deep models for fake news detection, demonstrating improved results.

Multiple Modality
The authors in [11] employed a visual system to answer questions via deep networks for fake news detection using multi-modal data. Alternatively, [12] used subtitle texts in addition to images for detection of fake news. Concat + Similarity EANN [13] is an adversarial neural network that uses image and text in the news. It tries to solve the independent identification of the news events challenge by reducing the event's impact, and the news occurrence with an adversarial mechanism. The goal is to generalize the solution to unexpected events. This model uses the pretrained VGG19 network models for the image, and a deep canonization network for textual properties. MVAE [14] presents a variational autoencoder and an encoder-decoder network to detect fake news using the learned hidden vectors. A deep BiLSTM network is used to extract textual features, and a pre-trained model is used to extract image features [13].
SpotFake [15] detects fake news by embedding text and images in vectors and then fusing these vectors. CARMN [16] utilizes a cross model attention mechanism to calculate the relationship between image and text. Then, using a selfattention mechanism it obtains the feature vectors and determine fake news using a concatenation of these feature vectors. Finally, AMFB [17] uses attention-based BiLSTM to capture textual features and attention-based CNN-RNN blocks to capture visual features. It then uses a multi-layer perceptron to classify the calculated features.
Our work focuses on fake news detection for social media by proposing a transformer and similarity-based method. Compared with the existing multi-modal fake news detection methods for the general scenario, our approach extracts contextual features from images and considers more interactions between the multi-modal data. Table 1 provides a comparison between the previous state of the art and the proposed method.

PROPOSED METHOD
The Fake News Revealer (FDR) architecture is shown in Figure 2. The structure of our approach is composed of four parts: 1) text, 2) image, 3) similarity, and 4) classifier. These modules are then combined to provide a label for each news. BERT and VGG19 are used as pre-trained models for extracting text and image feature vectors, respectively. These feature vectors are then combined and classified by a fully connected classification layer.
We also utilize a projector module to resize the input vector for both text and image structures by applying a linear function. The projector's weights, unlike frozen encoders, are learned endto-end during training and make the representation vector of image and text variables. Refer to Figure 3a for more details.
The input to our algorithm is n news (N ) each consists of text, image, and label (fake or real). Let N i represent news i, and T i , I i and L i be its text, image and label, respectively. Then: Our goal is to predict L i using T i and I i . We process the news in a batch mode through the deep model, so in each run, the number of the input data is b.

Textual Part
The purpose of this module is to featurize text and embed it into a vector. This part consists of two main sub-modules; the first sub-module is an encoder that extracts representative features extracted from a pre-trained model. The second component is a projector.
BERT [5] has recently been proven to be the most advanced language representation model for a variety of NLP tasks by pretraining on large corpora. During pretraining, BERT can handle short texts like tweets due to their predefined maximum length. Accordingly, the proposed model uses the BERT's textual feature extraction capabilities.
This part takes T with size (b, t) (t is the maximum text length size). After applying BERT, we take a vector B with size (b, 768), where 768 is the size of the last hidden layer of BERT, and apply the projector to obtain vector F T with size (b, k): (2) where w 1 , w 2 , b 1 and b 2 are weights and biases of linear layers inside the text projector.

Visual Part
This module also consists of two submodules. The first sub-module is an image encoder that encodes images into a representation vector. We used the ViT Vision Transformer [6] as encoder, and a transformer encoder model (BERT-like) is pre-trained on an extensive collection of images in a supervised fashion. The model is fed with fixed-size patches (resolution 16x16) linearly embedded in images. Using this pre-trained model, we are able to optimize it to our model as the second sub-module. The second sub-module is a projector, as described before.
Subsequently, this module takes I with parameters (b, width, height, depth) representing width, height and depth of images, and ViT project it into the vector space V with size (b, 768), where 768 is the size of last hidden vector of ViT model. After applying the projector, we obtain F i with size (b, k) with the same length as F t that is obtain from the textual part. The projector works according to the following formula: (3) where w 3 , w 4 , b 3 and b 4 are weights and biases of linear layers inside the image projector.

Similarity Part
In this module, we calculate the similarity of texts and images via a supervised contrastive loss [18]. We have image and text features that are matrices with size (b, k), and to determine if they are similar, their inner products are often calculated. The similarities between text and image of b tweets are represented in the (b, b) matrix that we call the predicted matrix (P ): This loss function considers an image and a text to be the most similar to itself. Thus, we consider the expected matrix as the average similarity of the text to text and the image to image according to the following formula [19]: After calculating the expected matrix (E), we use cross-entropy to find the actual loss. The contrastive loss is the average of the text similarity loss l T , and the image similarity loss l I [18]: In this module, the text and image feature vectors are concatenated to obtain the desired news representation: This news representation is then passed through two linear layers for fake news classification. After this linear mapping, we consider a vector of size (b, 2) with two classes. According to Figure 3b and assuming w 6 , w 5 , b 6 and b 5 are weights and biases of linear layers inside the classifier: After passing the vector through the softmax function, we optimize the model by calculating the predicted and actual labels' cross-entropy (α is the balancing factor for the weighted loss function): The loss for the whole network is obtained from the two losses with λ as a tradeoff parameter:

EXPRIMENTS
This section presents the implementation issues and experimental results of applying FNR to real datasets and its comparison with the state of the art methods.

Datasets
We used two publicly available datasets gathered from social media as described below.
Twitter This dataset was introduced in [20] for automatic verification of multimedia tasks to distinguish fake/real news on Twitter 1 . Data consists of a development set and a test set, each with its own events. The rows of data contain text, images or video, and additional information about the user's profile. The tweets in this dataset are mainly in English (other languages were translated to English before running the experiments).
Weibo This dataset [10] was collected from Weibo 2 social media from 2012 to 2016 and is written in the Chinese language. Each line of data in this dataset also contains text, user information, and an image. The app's authentication system has tagged texts. This database is divided by [13] into two sets of test data and the train, as the news events of each set are different and. We utilized the same test and train sets.
Only the data with text and images are used in both datasets, and only the first image was utilized. Table 2 lists the number of train/test data and fake/real data that have been used in our experiments.

Implementation Details
The Pytorch framework is used to build our architecture with Python 3.6. We optimized the learning rate using AdamW and calculated different learning rates and weight decays for each part of our architecture to make model converge faster. We have used the Optuna 3 library to find the best parameters. The best parameters are as follows: classification learning rate of 0.005, classification weight decay of 0.07, projection vector size (k) of 64, dropout of 0.3, λ was set to 1, and α was the ratio of class data with a large number versus class data with a smaller number. We set the batch size to 256, our epoch number was set to 100, and the maximum text length (t) for Twitter was 32 words and for Weibo was 200 characters. All input images resized to height=224, width=224 and depth=3. We used a learning rate scheduler and an early stopping checkpoint to avoid overfitting. For the encoder models, the Hugging Face 4 library was used. The implementation is available in our repository 5 .
Because the input text is raw text gathered from social media, it is non-standard and noisy, and it needs to be cleaned up using normalization techniques. We performed preprocessing which includes converting abbreviations to complete the forms, removing unnecessary punctuations, and deleting non-standard characters. Texts in various languages were also translated to match the dataset's language. Our text is subsequently tokenized and made ready to be encoded after the preprocessing step. The images were also preprocessed by deleting low-quality images, resizing them to (224x224), and converting them into appropriate input for the encoder.

Baselines
The following are some benchmark algorithms and models that we chose for a comparative study.

Single Modality
We just considered news text and examined some algorithms to achieve a better feature vector in the textual module. These parameters were tuned by Optuna, and the best Figure 4: Ablation study on Twitter dataset results are reported. Our text is subsequently tokenized and ready to be encoded after the preprocessing step.
For the visual module, we used images of news and tuned the parameters with Optuna. The CNN architecture is three parallel convolutional layers with different filter sizes were utilized in this step.

Multi-Modality
For multi-modality, we have chosen recent works that are state of the art. These works are listed iin Table 1.
We tested two versions of the proposed model in this section; one without considering the similarity measurements (FNR-WS), and the other with considering similarity measurements (FNR-S).

Results
To compare the proposed method in solving multi-modal fake news detection with the previous works, we considered the following criteria for evaluation; accuracy, recall, precision, and F1score. When dealing with classification problems, these criteria are always taken into consideration.
We are trying to solve a binary classification with balanced data, since the number of fake news and real news are almost equal. We can also consider the AUC criterion and the receiver operating characteristic (ROC) curve for classification evaluation. We included their measurement in the comparison table and plotted the curves for three previous works for comparison purposes.
Ablation Study An ablation study was performed to explore which data mode is more Figure 5: Ablation study on Weibo dataset beneficial or why a multi-modal approach should be used and why we should consider similarity measures. First, we used and tested each mode separately, and then we fused the modes and investigated the effectiveness of a multi-modal approach for fake news detection. Finally, we added the contrastive loss measurement to investigate its effectiveness in enhancing the results.
As demonstrated by Figure 4 on Twitter, because the language of the tweets is brief, imprecise, and filthy, it is less accurate on its own, and the image mode performed better. However, the outcome improves by merging these two modalities, demonstrating that these two modes cover each other's shortcomings. Adding the similarity part, which considers the relationship between the text and the image, results in more improvements.
As illustrated in Figure 5, the images on Weibo are not very expressive, and the actual news images are almost certainly being exploited for false news, which does not help to detect fake news on its own. Nonetheless, the text mode outperforms the visual mode. However, when these two modalities are included concurrently, the model's performance significantly improves, as these two type of modalities complement each other. When the relationship between text and image is considered, the accuracy also increases.
Performance Comparison Using the mentioned criteria, the following comparison table for the Twitter dataset is presented in Table 3 and for the Weibo dataset in Table 4 and Figure  6 compares the ROC curves of three relevant studies and our work on the Twitter dataset, and Figure 7 compares them on the Weibo dataset.   Based on the results in these tables, BERT has performed better than the other text algorithms. It has properly extracted conceptual and lexical features, and has a more accurate approach in detecting fake news with text only.
Moreover, ViT outperforms other fake news detection algorithms that rely solely on images, and with a transformer-based network, it extracts valuable and efficient features.
According to the results, some current methods in multi-modal fake news detection perform more favorably compare to single mode methods, indicating that the combination of data modes and the extraction of their intermodal properties are beneficial. Spotfake outperformed the competition because it pulls more useful characteristics from text and images. In addition, both CARMN and AMFD have presented a more performant combination of attention-based mechanisms.
The Fake News Revealer (FNR) framework consistently outperforms the competition on a variety of performance metrics. Each modality can retain its distinct traits with our methodology while seamlessly combining similarity and complementary information from the other modalities.

CONCLUSION AND FUTURE WORKS
A new multi-modal framework for detecting fake news employing text and image transformers, as well as similarity measures with a contrastive loss function, is presented in this work. To extract the textual information, we employed BERT, which is a transformer-based encoder. The visual characteristics were retrieved using ViT, a BERT-like transformer that was explicitly created for image processing. The connection between an image and a sentence is extracted using a contrastive loss and then fed into a two-layer linear classifier to assess if the news is real or fake. We tested the proposed method on two freely available multi-modal datasets. Extensive tests have been carried out to compare to existing models. The suggested framework (FNR) outperforms the state of the art methods in detecting fake news. Since a large amount of news is being published with different modes, one can use other data modes such as video, audio, and news graphs.