TechRxiv
TurthSeeker_The_largest_ground_truth_Social_media_dataset_for_RealFake_content__for_IEEE_Journals_and_Transactions__Copy_(4).pdf (3.45 MB)

TruthSeeker: The Largest Social Media Ground-Truth Dataset for Real/Fake Content

Download (3.45 MB)
preprint
posted on 2023-05-12, 14:45 authored by Sajjad DadkhahSajjad Dadkhah, Xichen Zhang, Alexander Gerald Weismann, Amir Firouzi, Ali A. Ghorbani

Automatic detection of fake content in social media such as Twitter is an enduring challenge. Technically, determining fake news on social media platforms is a straightforward binary classification problem. However, manually fact-checking even a small fraction of daily tweets would only be possible due to the sheer volume of daily tweets. To address this challenge, we crawled and crowdsourced one of the most extensive ground-truth datasets containing more than 180.000 labels from 2009 to 2022 for tweets with a 5-label and 3-label classification using Amazon Mechanical Turk. We utilized multiple levels of validation to ensure an accurate ground-truth benchmark dataset. We then created and implemented numerous machine learning and deep learning algorithms, such as different variations of BERT-based models, on the data to test the accuracy of real/fake tweet detection with both categories and determine which versions gave us the highest result metrics. Further analysis is performed on the dataset by explicitly utilizing the DBSCAN text clustering algorithm combined with the YAKE keyword creation algorithm to determine topics' clustering and relationships. Finally, we analyzed each user in the dataset, determining their Bot Score, Credibility Score, and Influence Score for a better understanding of what type of Twitter user posts and their influence with each of their tweets, and if there were any underlying patterns to be drawn from each score concerning the truthfulness of the tweet. The experimental results illustrated profound improvement for models dealing with short-length text in solving a real-life problem, such as automatically detecting fake content in social media.

History

Email Address of Submitting Author

sdadkhah@unb.ca

ORCID of Submitting Author

0000-0002-5582-0255

Submitting Author's Institution

university of new brunswick

Submitting Author's Country

  • Canada

Usage metrics

    Licence

    Exports