Abstract
Automatic detection of fake content in social media such as Twitter is
an enduring challenge. Technically, determining fake news on social
media platforms is a straightforward binary classification problem.
However, manually fact-checking even a small fraction of daily tweets
would only be possible due to the sheer volume of daily tweets. To
address this challenge, we crawled and crowdsourced one of the most
extensive ground-truth datasets containing more than 180.000 labels from
2009 to 2022 for tweets with a 5-label and 3-label classification using
Amazon Mechanical Turk. We utilized multiple levels of validation to
ensure an accurate ground-truth benchmark dataset. We then created and
implemented numerous machine learning and deep learning algorithms, such
as different variations of BERT-based models, on the data to test the
accuracy of real/fake tweet detection with both categories and determine
which versions gave us the highest result metrics. Further analysis is
performed on the dataset by explicitly utilizing the DBSCAN text
clustering algorithm combined with the YAKE keyword creation algorithm
to determine topics’ clustering and relationships. Finally, we analyzed
each user in the dataset, determining their Bot Score, Credibility
Score, and Influence Score for a better understanding of what type of
Twitter user posts and their influence with each of their tweets, and if
there were any underlying patterns to be drawn from each score
concerning the truthfulness of the tweet. The experimental results
illustrated profound improvement for models dealing with short-length
text in solving a real-life problem, such as automatically detecting
fake content in social media.