The Largest Social Media Ground-Truth Dataset for Real/Fake Content: TruthSeeker

Automatic detection of fake content in social media such as Twitter is an enduring challenge. Technically, determining fake news on social media platforms is a straightforward binary classification problem. However, manually fact-checking even a small fraction of daily tweets would be nearly impossible due to the sheer volume. To address this challenge, we crawled and crowd-sourced one of the most extensive ground-truth tweet datasets. Utilizing Politifact and expert labeling as a base, it contains more than 180 000 labels from 2009 to 2022, creating five- and three-label classification using Amazon Mechanical Turk. We utilized multiple levels of validation to ensure an accurate ground-truth benchmark dataset. Then, we created and implemented numerous machine learning and deep learning algorithms, including different variations of bidirectional encoder representations from transformers (BERT)-based models and classical machine learning algorithms on the data to test the accuracy of real/fake tweet detection with both categories. Then, determining which versions gave us the highest result metrics. Further analysis is performed on the dataset by explicitly utilizing the DBSCAN text clustering algorithm combined with the YAKE keyword creation algorithm to determine topics’ clustering and relationships. Finally, we analyzed each user in the dataset, determining their bot score, credibility score, and influence score for a better understanding of what type of Twitter user posts, their influence with each of their tweets, and if there were any underlying patterns to be drawn from each score concerning the truthfulness of the tweet. The experiment’s results illustrated profound improvement for models dealing with short-length text in solving a real-life classification problem, such as automatically detecting fake content in social media.

The Largest Social Media Ground-Truth Dataset for Real/Fake Content: TruthSeeker Sajjad Dadkhah , Member, IEEE, Xichen Zhang , Alexander Gerald Weismann, Amir Firouzi, and Ali A. Ghorbani , Senior Member, IEEE Abstract-Automatic detection of fake content in social media such as Twitter is an enduring challenge.Technically, determining fake news on social media platforms is a straightforward binary classification problem.However, manually fact-checking even a small fraction of daily tweets would be nearly impossible due to the sheer volume.To address this challenge, we crawled and crowd-sourced one of the most extensive ground-truth tweet datasets.Utilizing Politifact and expert labeling as a base, it contains more than 180 000 labels from 2009 to 2022, creating five-and three-label classification using Amazon Mechanical Turk.We utilized multiple levels of validation to ensure an accurate ground-truth benchmark dataset.Then, we created and implemented numerous machine learning and deep learning algorithms, including different variations of bidirectional encoder representations from transformers (BERT)-based models and classical machine learning algorithms on the data to test the accuracy of real/fake tweet detection with both categories.Then, determining which versions gave us the highest result metrics.Further analysis is performed on the dataset by explicitly utilizing the DBSCAN text clustering algorithm combined with the YAKE keyword creation algorithm to determine topics' clustering and relationships.Finally, we analyzed each user in the dataset, determining their bot score, credibility score, and influence score for a better understanding of what type of Twitter user posts, their influence with each of their tweets, and if there were any underlying patterns to be drawn from each score concerning the truthfulness of the tweet.The experiment's results illustrated profound improvement for models dealing with short-length text in solving a real-life classification problem, such as automatically detecting fake content in social media.
Index Terms-Automatic detection, bidirectional encoder representations from transformers (BERT) based model training, crowd-sourced data, fake and real ground truth, fake news detection, large feature dataset, twitter dataset, X dataset.

I. INTRODUCTION
I N THE modern era, social media has become an integral component of human existence.The exponential growth in the usage and popularity of social media has resulted in enumerable advantages for individuals and enterprises alike.Besides providing a source of leisure and entertainment, social media platforms allow users to disseminate their original content and access a broad audience base to consume diverse information, including local and international news.The prevalence of social media has transformed the communication landscape, creating a ubiquitous platform that facilitates a diverse range of user interactions and behaviors.However, despite many positive aspects the negatives also existed.Sharing fake news has become easier with social media, allowing misleading or incorrect information to reach a large audience quickly.During the 2016 U.S. presidential election, research showed that approximately 14% of Americans relied on social media as their primary news source, surpassing print and radio.Allcott and Gentzkow [1] found that false news about the two presidential candidates, Donald Trump and Hillary Clinton, was shared millions of times on social media.Likewise, in the 2021 U.S. presidential election campaign, recent research discovered more extensive misinformation campaigns around COVID-19.Moreover, in the aftermath of the 2021 election, specific security associations caught fake news campaigns claiming election fraud was detected.
One major challenge for analyzing social media content and catching the fake news that is distributed throughout it is collecting and labeling a large enough training dataset to be used as ground truth [2], [3].A vast volume of incorrect information is disseminated on social media daily, potentially resulting in adverse consequences for individuals and society.The implications of misinformation spread through social media are far-reaching and can significantly impact public perception, decision-making, and political outcome.Therefore, exploring effective methods for identifying and mitigating the spread of misinformation on social media platforms is essential.
These above examples show that methods for identifying fake news are a relevant research topic and a pressing societal need.While different issues regarding tweet classification, such as topic or sentiment detection, are considerably researched, automatic fake news detection requires more engagement [4].
A dataset is the most critical component for the credibility and trust of a machine learning/deep learning model.However, the limitations of the existing fake news datasets are undeniable.Most of the existing datasets need to be updated to reflect the advanced generation patterns of the new fake news creators.In addition, many online social media users and posts are unavailable after they have been detected as malicious or suspicious.High performance on such a dataset cannot guarantee the applicability of any model on new data input.In this article, we designed and generated a novel Twitter dataset called TruthSeeker.As Fig. 1 illustrates, we utilized the Amazon Mechanical Turk crowd-sourcing platform to collect we explored the correlation between tweet labels and online creators/spreaders' characteristics.Our analysis provided valuable insights that enabled us to develop a more precise method for detecting fake content in social media, despite their limited length.In the spirit of collaborative research, we are making our dataset and all related documents available for download on the Canadian Institute for Cybersecurity (CIC) dataset pages https://www.unb.ca/cic/datasets/truthseeker-2023.html.

II. EXISTING FRAMEWORK AND DATASETS
This section involves a detailed literature review and examination of various characteristics of multiple existing datasets for detecting fake content in social media [5], as shown in Table I.Accurately identifying fake news is essential, and a reliable dataset is a critical component of achieving this.However, without a relevant and complete dataset, it becomes challenging to train models that can accurately identify fake news.Murayama [5] discusses the growing interest in detecting and verifying the authenticity of information related to fake news.They conducted a comprehensive survey of 118 publicly available datasets from the web.The datasets were categorized based on their focus on detecting fake news, verifying facts, analyzing fake news, and detecting satire.The researchers also examined the characteristics and uses of each dataset, highlighting challenges and opportunities for future research.
The construction of truth-based datasets has been an endeavor undertaken for many years.One of the earliest examples of combining truth scores from multiple sources is the original Politifact dataset created by Vlachos and Riedel [6].This dataset merged the truth scores from two websites, Channel 4's fact-checking blog and the Truth-O-Meter from Politifact, into a single scale that included five labels: True, Mostly True, HalfTrue, Mostly False, and False.The dataset also includes the URLs and scores of the news.Our dataset creation process relied on this five-label structure and a combination of expert and crowdsourced data crawling to balance qualitative and quantitative data, which is crucial for creating datasets for models to train on efficiently.A different way to create a dataset was introduced during the creation of the PHEME dataset [7].This dataset concentrated on five breaking news incidents and the corresponding discussions on Twitter.The objective was to distinguish between the amount of discussion about the news that was considered rumors or non-rumors.To achieve this, journalists annotated each piece of data, resulting in a relatively small dataset of about 5800 unique annotated tweets for five events.
A similarly small sample size of 2900 tweets is used in the RumorEval-2017 dataset [8].Attempting to train a large-scale model on such limited data would result in poor model performance and potential overfitting.Therefore, for our pipeline, we need to find a middle ground.To achieve this, we adopt the idea of expert annotations from the PHEME dataset and apply it to TruthSeeker.We use qualitative labeling by native English speakers for fact-checking each statement and ensuring accurate labeling of source statements.
Other forms of significant dataset creation, including Twitter15 and 16 datasets [9], rely on labeling JUST the source statement and leaving the information propagation up to interpretation, creating a large volume of tweets with potentially correct labels.But more likely needs more granularity and will inevitably produce poor model performance.
Despite more than ten years of work, even the most modern implementations of Politifact's data, such as the LIAR dataset [10], only have around 13 000 manually labeled pieces of data.While this is impressive, the dataset could still be much larger and cover more modern forms of news propagation, such as Twitter and Facebook.To address these limitations, TruthSeeker utilizes news articles and social media (specifically Twitter) for a much larger scale of data.These early datasets served as the foundation for TruthSeeker's creation.
Evolutions of older datasets such as PHEME-update [11] and FakeNewsNet [12] seek to remedy this issue of smaller sample sizes with increased training data.The increase in sample size is a significant improvement.In the PHEMEupdate dataset, this number has been increased more than 20× at over 6000 threads rather than the original 300.FakeNewsNet combines the rated and fact-checked news from Politifact and GossipCop to generate a dataset with almost 24 000 unique labeled pieces of information.However, the fundamental approach for generating data will always result in relatively small data size.
The Rumor-anomaly dataset [13], among others, produces a vast amount of tweets (4 million across 1000 rumors), but they need to be labeled individually.This is why we use a hybrid data collection and verification approach in TruthSeeker, which allows us to have similar amounts of expertly documented source statements as the original PolitiFact and PHEME datasets while generating over 140 000 actual data points from a smaller sample size.Each data point is labeled individually.
The fast detection of fake content automatically is crucial as it can prevent the spread of such content.There may be better solutions than relying on fact-checking agencies, particularly on social media.In a study by Vo and Lee [14], the authors highlight the problem of spreading fake news despite factchecking systems.They point out that these systems tend to focus on fact-checking and overlook the role of online users in disseminating false information.
In more recent times, a large focus on fake news detection and content analysis of news and tweets has centered around health-related, and specifically, COVID-19 related misinformation.HealthStory [15] and HealthRelease [16] attempt to find patterns in data relating to real and fake health news and it spreads throughout social media.Examining user information to determine the credibility of users who spread said information.TruthSeeker contains similar features to these two datasets (as will be discussed in a future segment) to provide as much context on the tweet and the person who posted it.
Datasets such as COVID-HeRA [17] attempt to define a more granular classification of tweets using categories such as Real News/Claims, Possibly severe, Highly severe, Other, Refutes/Rebuts Misinformation.From a surface-level view, these categories are extensive.Unsurprisingly, a small data size (just over 61 000 unique tweets) with more than five categories leads to middling F1 scores.However, binary classification performs much better than expected.Similar results were noticed in our research.However, the size of the TruthSeeker dataset seemed to help improve the five-label classification results substantially.
Other Covid-related datasets, such as MM-COVID [18] and indic-covid [19], attempt to generate multilingual datasets for fake news related to COVID-19.Creating a corpus of information large enough to train an accurate model is difficult enough in one language.Thus, attempting to cover multiple ones is a herculean effort.The initial goal of the TruthSeeker dataset only included fake news detection through the English language.As English is the lingua franca of the world, it was viewed as the most critical language for generating fake news detection models.In [20], the authors examine misinformation related to COVID-19 on social networks and how it has become a problem, leading the World Health Organization to call it an "infodemic." Various research studies [12], [21], [22], [23] have tackled the issue of identifying fake news.In a study by Helmstetter and Paulheim [2], the automatic detection of fake news in social media was discussed as a binary classification problem.The authors acknowledged the challenge of obtaining a sizable training corpus, which led them to propose an alternative method using weak supervision to gather a large-scale but noisy training dataset.The dataset was labeled based on the source's trustworthiness, and a classifier was trained.However, the study still needed improvement in working with shorter sentences such as like tweets.Despite the efforts to address the issue of fake news through research on fake news detection, more comprehensive, community-driven, and updated fake news datasets still need to be addressed.It is evident that the existing methods in this field have several issues that emphasize the necessity of a comprehensive and extensive dataset for social media, such as TruthSeeker.

III. DATASET CREATION
The creation of the TruthSeeker dataset begins with a combination of Real and Fake news crawled from the PolitiFact website.From this data, keywords relating to each piece of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
text are generated.This is done by painstakingly, manually generating keywords for 700 Real and 700 Fake pieces of news.Many automated keyword generation algorithms were attempted to speed up this manual process [using python packages such as attention approach, Python Keyphrase Extraction (PKE), Rapid Automatic Keyword Extraction (RAKE), Rankbased Unsupervised Keyword (RaKUn), and Yet Another Keyword Extractor (YAKE)].However, in preliminary testing, they resulted in poor keyword generation.Providing either: 1) too few keywords to get meaningfully related tweets when calling the Twitter's Full-archive search API; 2) so many keywords that when used in the Full-archive search API, the combination would be too hyper-specific to return any results at all.This problem occurred with every keyword extraction algorithm attempted.Leading to the conclusion that the sensitive nature of the twitter keyword searching and low reliability of proper keyword generation algorithms was something that would not provide meaningful or useful results.Making the choice to generate keywords manually instead of relying on automation is an obvious decision.
Manual keyword generation was the most effective as each set of keywords could be constructed to best summarize the article titles and return the most results.Taking a qualitative approach to assure the most accurate data possible.A general rule followed was to create at a minimum two keywords and a maximum of five keywords for any of the associated pieces of text.It was observed through both the automated keyword generation and manual creation that any less or more would result in tweets that were unrelated to the topic or so hyper-specific that nothing would exist for the results.Thus, the limit of 2-5 keywords was created.Careful attention was paid toward making sure that each set of keywords was generated only in reference to what the original text was discussing, and to limit the keywords to contain only words within the original text as much as possible.To further ensure that no personal biases could become a factor in their creation.The final amount of tweets crawled for 700 Real and 700 Fake pieces of ground-truth news was slightly under 186 000.Giving on average 133 tweets per piece of news.Exceeding our initial hopes of 100 per news piece.
Below is an example of a piece of Real news, its associated keywords and the number of results returned from the custom API.
Results: Using the getStats() API call from our custom Twitter integrated endpoint, we can observe that this piece of news (with the unique ID of 19) returns 88 tweets utilizing the manual keywords listed above.
The getTweets() endpoint returns all associated tweets with their full metadata information (created_at, id, text, etc. . . ) in the JavaScript Object Notation (JSON) format.Below is an example of the information returned of one tweet from the 88 of the ground-truth news title with the ID 19.
For the creation of this dataset, the main pieces of information extracted from the returned JSON data were a cleaned version of the "text" called "cleaned_text," the twitter id of the user called "id," and the time of the tweets creation called "created_at."This information was then processed and saved in a Comma Separated Value (.csv) file with the appropriate formatting to later be fed into the Amazon Mechanical Turk system.
Each row of this CSV file contains the original tweet (OT), the metadata that was discussed earlier, a copy of the groundtruth "statement" which is the original article title, The "manual_keywords," and that article title's unique "id" or "query_id."This duplication is done as it is required for creating individual tasks to be completed using the Amazon Mechanical Turk system.One last check is done to make sure that there contains no non UTF-8 encoded characters or symbols.After this, the CSV file is uploaded to Amazon Mechanical Turk for processing.

IV. CROWD-SOURCING AND LABELING UTILIZING AMAZON MECHANICAL TURK
The Amazon Mechanical Turk service was a key part of the creation of the TruthSeeker dataset.Allowing for the construction of a much larger dataset with the help of "Turkers" (individuals performing an Amazon Mechanical Turk Task) rather than manually assessing each tweet.Each row of our dataset was translated to and treated as a human intelligence task (HIT).Here's a micro-job that needs to be completed by a Turker.Below a visualization of the HIT is shown.
1) Our HIT was limited to only Master Turkers.Meaning that only Turkers assessed by Amazon to be of the highest quality were allowed to participate in our microjobs.This allowed us to be sure of having the highest skilled Turkers to make the judgments in the tasks we assigned them, rather than rush through to receive payment as fast as possible.Giving us a baseline skill and competency level that using non Master Turkers would not have been guaranteed to afford us.
2) The HIT that we published for the Amazon Turkers to complete was a variation of a basic semantic similarity task.We asked the Master Turkers to examine the source statement (i.e., "statement" from the above photo) and an accompanied tweet.They would then need to decide to what degree the tweet agrees with the statement.A set of the instructions on the side bar was also included for the Turker to read before beginning their task.The instructions provided definitions for each of the five options (Agree, Mostly Agree, Unknown, Mostly Disagree, Disagree) and an example tweet that would match each of the categories.
1) Statement: "86% of Americans and 82% of gun owners support requiring all gun buyers to pass a background check."2) Agree: The tweet agrees with the ideas presented in the statement.3) Tweet: "In the same way that doctors shouldn't write a prescription without knowing a patient's medical history to ensure the drug will do no harm, gun sellers shouldn't be allowed to complete a transaction w/o a background check on the buyer.The majority of Americans support this." 4) Mostly Agree: The tweet agrees with the majority of the ideas presented in the statement.5) Tweet: "More than 50% of Americans are in favor of some form of gun control, whether it be background checks or something else entirely. . ." 6) Unknown: The tweet neither aligns nor differs with the presented statement.7) Tweet: "America is a country that loves guns."8) Mostly Disagree: The tweet disagrees with the majority of the ideas presented in the statement.9) Tweet: "I understand that some people are in favor of background checks, but most REAL Americans are not."10) Disagree: The tweet disagrees with ideas presented in the statement.11) Tweet: "Democrats are busy clutching their pearls over gun control, they claim the founders wouldn't support current Americans right to bear arms.I'd like to remind Democrats, our Founders had just finished a war against their former countrymen.Shall no be infringed is pretty clear." A final measure was taken was to ensure higher accuracy of HIT responses.This consisted of having each HIT be completed by three separate Master Turkers.This allowed us to further verify the final label that would be applied to each tweet after all HITs were completed.

V. RESULTS
The results we received from the Master Turkers were classified into two separate ways.Algorithm 1 and illustrates the five-way and three-way label creation methods that is utilized in this paper.A five-way label included all the original categories (Unknown, Mostly True, True, False, Mostly False) and a three-way label (Uknown, True, False).The creation of the five-way labeled dataset was much more restrictive in terms of allowing data to be used.The specific protocol was as follows.then the final result is labeled as True, False, or Unknown.This method allows for the retention of much more data while still maintaining a high level of confidence in the results being accurate.As there was at least some shared sentiment toward the validity of the news and therefore truthfulness of the tweet.Table II illustrates breakdown of the five-way label and threeway label results using both Master and Standard Amazon Mechanical Turkers.This comparison was done to gauge the quality of Master Turkers over standard ones, as well as show the spread of results from an initial test batch of 1000 tweets.
Below is a random news statement pulled from our data and an associated tweet of each category (Agree, Mostly Agree, Unknown, Mostly Disagree, Disagree) related to it.
Statement: "Ivermectin sterilizes the majority (85%) of the men who take it." 1) Tweet(Unknown): . . .Now their "treatment alternative" is not just killing them, but rendering the men functionally or fully sterile.They claimed the free vaccine harms women's fertility and genetics, so instead they pay big bucks for Ivermectin, which mutates sperm and sterilizes the men! 2) Tweet(False): @90mifromneedles @Blackamazon I think the no schadenfreude train left without me.I saw Ivermectin apparently sterilizes the majority (85%) of men that take it and followed the link to the study.My first thought was well, at least those pushing its use for COVID will no longer contrib to the gene pool.3) Tweet(Mostly False): @Acyn Ivermectin will make them shit out their stomach linings and sterilizes men LOL. 4) Tweet(True): @redsteeze @JerseyWalcott That's absurd.
Pretty soon, they're going to start claiming that (life giving) Ivermectin sterilizes men and shrinks their sexual organs.5) Tweet(Mostly True): @jeek The study you linked does not say that it sterilizes 85% of men that take it.It says that "a recent report showed that 85% of all male patients treated in a particular center with ivermectin in the recent past who went to the laboratory for routine tests were discovered to. . .It should be noted that the "Unknown" category contained issues in that could be used as a catch-all category that the Master Turkers used when they were unsure of what response to give.Rather than when a tweet had an unknown relation to the source statement.It may be advantageous to either remove this classification category fully or to split it up into more granular categories to get more accurate results.For the purposes of having the most accurate data possible though, we made the decision to leave this option in as we wanted to make sure that the data we were using was of the highest quality possible.

VI. TRUTHSEEKER MODEL ANALYSIS
Below showcases the results of training two model types on the TruthSeeker dataset.The first being a standard Binary Classification Model (with categories True and False).The second, a four label classification model (False, Mostly False, Mostly True, True).Both of these models attempt to predict the truthfulness of a tweet using various classification categories.
The final TruthSeeker dataset exists in one CSV file that is prepossessed and later used for training our model.Its initial raw structure is illustrated below.Table III shows list of features mainly used in our dataset.

A. Dataset Preprocessing
After importing the CSV file, there are a few preprocessing steps done to the data before model creation.Firstly, any rows with a majority answer column value of "NO MAJORITY" or "Unrelated" are removed.This is due to the "NO MAJORITY" label indicating that the analysis of the tweet by three separate Amazon Turkers was inconclusive as to the label it should receive.The Unrelated label was used to weed out tweets not directly related to the statement being made, and thus unusable for determining the truth of the tweet in relation to the original statement.
These rows are dropped (using basic dataframe comprehension) and the new dataset is split into two separate dataframes.One containing all data except for the five_label_majority_answer and one containing all but the three_label_majority_answer column.For each of the two newly created dataframes, we generate "ground_truth_value" and "categorical_label" columns.The "ground_truth_value" column takes the BinaryNumTarget of the statement and the majority answer of the tweet as inputs and generates a truthfulness value.Below are the logic tables for the four-label conversion and the two-label conversion.After this conversion, the labels are encoded and placed in the "categorical_label" column for easier use.This dataset contains 150 000 unique tweets coinciding with 1400 unique statements and their manually generated keywords.The balancing of this dataset is exactly 50/50 for True and False statements.
Fig. 3(b) showcases a clear majority of Turkers found tweets related to a source statement to either agree or mostly agree with the source statement.A large percentage of the data was inconclusive and thus marked as NO MAJORITY.Adjusting for this in Fig. 3(a) using a two-thirds majority rule, we see that the majority of the Turker results that we once no majority can be grouped into either the agree column of disagree column, after performing the two-thirds majority conversion.We can see that the Turkers determined the majority of tweets in relation to the source statement are in agreement, with a small subset of disagreeable responses or answers too difficult to place into either of the major categories easily.

VII. MODEL TESTING
It can be difficult to extract important information from short texts like tweets [32], even with accurate labels.These features included the number of uncommon or complex words, adjectives and metadata like how many replies the user has.As Table VII showed, we achieved impressive results in detecting fake social media content, especially considering the limited amount of reliable data available for short texts.In this section, we will show how we can improve these results even further by using different versions of bidirectional encoder representations from transformers (BERT) based deep learning models.Table VI should be referenced as necessary for all metrics used in the results section.Our study used 50 unique features and six different machine-learning models (which you can see in Table VIII).
With the TruthSeeker dataset fully developed and realized, the next goal of our research was to implement multiple BERT based models to see if it would be possible to accurately assess the truthfulness of a tweet.Below we implement four  Fig. 4(e)-(h) illustrates the results of running the ROBERTA model on the TruthSeeker dataset for ten epochs.The results are not as great but still quite promising.With ten epochs, we see the accuracy hit almost 69% (Fig. 4(e)) with no apparent convergence.More tests with higher epochs could have been achieved an accuracy of 70% or higher.Other hyperparameters could also be tweaked to see if any meaningful improvement is noticed.
Fig. 4(i)-(l) illustrates the results of running the classical BERT model on the TruthSeeker dataset for five epochs.We can achieve an accuracy (Fig. 4(j)) slightly higher than using ROBERTA, DISTILBERT, and ALBERT with our Binary Label.However, they are still fairly close matches.This marginal difference is also potentially attributed to the 1 epoch difference in training and increased model size of BERT compared to the others mentioned.
Fig. 4(m)-(p) illustrates the results of running the classical BERT model on the TruthSeeker dataset.While the results are pretty underwhelming, they are consistent with the accuracy of other pre-trained models.As can be seen in Fig. 4(m), the model seems to converge with a relatively low accuracy (Fig. 4(n)) and high evaluation loss (Fig. 4(p)).More training time/iterations seem unlikely to generate better results and are more than likely to overfit the model to our dataset.Boasting the highest accuracy (Fig. 5(f)) and F1 scores (Fig. 5(e)) of all pre-trained models attempted.BERTWEET seems to provide the best results with our dataset.Being the first public large-scale pre-trained language model for English Tweets, this is not surprising.Fig. 5(i)-(l) illustrates running the ALBERT model on the TruthSeeker dataset.Table IX illustrates the results of all two-label classification models.BERTWEET showcases a clear improvement over all other model types with the highest accuracy and F1 scores.Table X illustrates the results of all four-label classification models.ROBERTA appears to have the highest overall performance.With the lowest accuracy, the ALBERT framework lightweight BERT approach results in poorer performance.However, the performance is still impressive, with a score of over 94% in the two previously mentioned metrics.

VIII. SOCIAL POST TEXT CLUSTERING
This section focuses on the results from running the DBSCAN text clustering algorithm on our TruthSeeker Dataset with different hyperparameters.We embed our tweets using the Sentence Transformer (allmpnetbasev2) and then apply the DBSCAN algorithm with varying epsilon values.After applying the YAKE keyword extractor to each data cluster, we can gain a better understanding of the content referenced in our tweets and news.
We then take the list of keywords and remove duplicates/ sub-strings while also considering the case sensitivity of words.Next, we display the top ten clusters and their associated cleaned keywords.Below are the results of these tests and their outputs.As Table XI illustrates, the development of DBSCAN algorithm (after applying the DBSCAN clustering to the Fake and Real Tweet data with different epsilon values, the top ten clusters ranked by size are shown in Table XI) clustering resulted in more than 100 clusters with precise keywords detected for each cluster.Giving an insight into the most important and highly related topics within the data itself [33], these results showcase how versatile the data in TruthSeeker is, and making it a perfect dataset for training automatic detection algorithms in fake news domain.Having access to the Twitter API V.2 Full Archive search enabled us to view tweets as far back as 2007 (the founding of Twitter) and, in our case, from 2009 to 2022.

IX. USER ANALYSIS
This section focuses on analyzing the individual users for each tweet crawled in our dataset creation, focusing on three metrics: bot score, credibility, and influence.

A. Bot Score
A users bot score is a value between 0.0 and 1.0 that is determined by a model trained on 17 features using the users' specific follower count (number of followers), friend count (number of friends), favorite count (number of favorites), status count (number of tweets), account age (age of the account), list count (lists created), and url (number of urls posted).With 1.0 being the highest likelihood of being a bot and 0.0 being the lowest.A bot being a nonhuman twitter user.This commonly is associated with a user that proliferates spam, disinformation, or useless information within a system.Using the support vector machine (SVM) configuration developed in [21], we then run our data through this model to give a better understanding of the bot to real user ration in our dataset.
Fig. 6 represents the results of running this bot score test on all data.Any results given a score of less than or equal to 0.5 are considered "Not Bot," whereas anything greater than 0.5 is considered "Bot."As can be seen, bots make up a minority of the overall data but are still sizeable enough for there to be a potential for false information to be disseminated.The split of bots in both fake and real topics is very similar, showcasing that bots are included in all issues on Twitter.
The results indicate that a user that interacts with True rather than Fake topics appears to have a higher credibility, which on a common sense level appears to be accurate.More credible people spend time engaging with real topics rather than fake ones.

B. User Influence
We can classify an influential user in a few separate ways.Firstly, we define an influential user as a user whose actions in a network are capable of affecting the actions or thoughts of many other users in the network.Giving them the variable (i) in the proceeding formulas.
Below are a few formulas proposed for calculating the influence score of each individual user in a network Equation ( 1) represents the general activity, and ( 2) is signal strength.Where OT1 is the number of OTs' of the author posts.RP1 is the number of replies posted by the author.RT1 is the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XI
TWEET CLUSTERS AND THEIR ASSOCIATED KEYWORDS USING THE DBSCAN ALGORITHM number of retweets accomplished by the author.FT1 is the number of tweets of other users marked as favorite (liked) by Cheng et al. [34] NetworkScore(i) = log(F2 + 1) − log(F4 + 1).
(3) Equation (3) shows our interpretation of social network score and its potential, where F2 is the number of topically active followers and F4 is the number of topically active followers where RT3 is the number of users who have retweeted the author's tweets.M4 is the number of users mentioning the author.F1 is the number of followers.1) Proposed Influence Score: The Final influence score that was decided upon measuring the influence of the users within the Truthseeker dataset is described in the following equation: where IF is the influence score, FC is the followers count, SC is the statues count of the number of Tweets the authors have posted, LC is the listed count which is the number of times a tweet from that user has been added to a list by another person's normalized influence.We have normalized the final score utilizing the following: The alpha used for our tests was 0.7 and the beta was 0.3.Weighing the fact that a user was added to a list as a sign that they are viewed as a trustworthy source of information.The graphs below showcase the results of the normalization of the data.As can be seen, the average influence of a user in our system is relatively low.Outliers with massive followings can be easily seen as well as inactive or bot accounts.

C. User Credibility
A user's credibility score is calculated using a simple [followers/(friends + followers)] equation.This score represents their ability to affect change in a meaningfully large way in our system.Users with a higher credibility are able to spread information further and to more people.Fig. 7 shows the results of applying the equation to the full dataset of both Real and Fake tweets statements.Showcasing the reality of the Twitter ecosystem as a whole, Fig. 7(a) and (b) showcases that most users in our system have a middling level of influence.Most users have a fairly low impact on their environment and others around them.However, some outliers with a large amount of influence are able to disseminate information easily and widely.
Overall, these metrics showcase the well-balanced nature of our dataset and its mirroring of the real-world twitter environment that most users experience.

X. CONCLUSION
The expansion of social media platforms, such as Twitter, created an opening for unverified platforms and users to spread real and fake news.Therefore, automatic detection of this misleading information on social media and finding ways to combat against it has become an endless challenge for researchers.Addressing this challenge is critical to prevent the spread of misinformation, which can cause significant harm, especially in times of crisis.One of the primary obstacles in detecting fake content on social media platforms is the vast volume of content to be evaluated manually.The massive volume of data demands utilizing different machine learning and deep learning algorithms for automating the progress.However, the success of such algorithms depends heavily on the quality of the dataset used for training.
The existing fake news datasets currently need to be updated and expanded in scope.Thus the TruthSeeker dataset significantly contributes to fake news detection in social media by addressing this problem.This dataset, which contains more than 180 000 labels from 2009 to 2022, was collected using Amazon Mechanical Turk, a crowdsourcing platform.The dataset was verified using a three-factor active learning verification method, ensuring its credibility and trust.The employees of authors' institution further verified two-and five-label classifications and 456 unique Amazon Mechanical Turk highly skilled individuals labeled each tweet three times.Moreover, the dataset contains binary and multiclass classifications, allowing for a more precise and nuanced analysis of tweet content.
To evaluate the accuracy of the detection models, the authors implemented various machine learning and deep learning algorithms, including multiple BERT-based models.The results demonstrated significant improvements in the ability to automatically detect fake content, even with the limited length of tweets.Additionally, the authors introduced three auxiliary social media scores: bot, credibility, and influence score, to better understand the patterns and characteristics of Twitter users for fake/true tweets and their impact on the content they post.Furthermore, the authors utilized clustering-based event detection to analyze the relationships between topics and Tweets, and the correlation between tweet labels and online creators/spreaders' characteristics.This analysis provided valuable insights that can help improve the precision and effectiveness of fake content detection models.
In conclusion, the TruthSeeker dataset significantly contributes to the field of fake news detection, specifically regarding Twitter.The TruthSeeker Dataset was a project undertaken by the Canadian Institute for Cybersecurity to determine the validity of tweets posted on Twitter in an automated way.All the data will be available on the dataset page of CIC https://www.unb.ca/cic/datasets/truthseeker-2023.html.The extensive collection of labels, rigorous verification methods, and focus on Twitter content make this dataset valuable for researchers in this area.Additionally, applying multiple BERT-based models and auxiliary social media scores, combined with clustering-based event detection, has provided valuable insights that can help address the long-standing challenge of automatically detecting fake content on social media platforms.While there are still challenges to be addressed, the TruthSeeker dataset has shown promise in advancing the field of fake news detection and is a vital step toward addressing the issue of automatically detecting misinformation on social media platforms.

Fig. 1 .
Fig. 1.Overall pipeline of the dataset generation method in this article.

Fig. 2 .
Fig. 2. Example view the Master Turker sees when completing the HIT.

Fig. 3 .
Fig. 3. Histograms showcasing the distribution of crowd-sourced results from the Amazon Mechanical Turkers.(a) 3 label majority distribution of decision.(b) 5 label majority distribution of decisions.

Fig. 4 (
a)-(d) illustrates the results of running the ROBERTA model on the TruthSeeker dataset for four Epochs.Extremely promising accuracy and F1 scores are achieved as seen in Fig. 4(b) and (c) with accuracy and F1 score of almost 96% achieved and a relatively low amount of training time.This model appears to converge around four Epochs, thus making it doubtful that any meaningful improvements could be made with additional training time concerning iterations.

Fig. 5 (
Fig. 5(a)-(d) illustrates the results of running the DISTILBERT model on the TruthSeeker dataset.Results for accuracy (Fig. 5(b) and F1 Fig. 5(a)) are pretty high on this model also.Giving us a marginally lower accuracy than the base BERT model, yet still maintaining around 95%. Being around 40% smaller than the original BERT model and losing a marginal amount of performance because of this may be the cause of the slightly reduced statistical values for this model.Fig. 5(e)-(h) illustrates the results of running the BERTWEET model on the TruthSeeker dataset.Boasting the highest accuracy (Fig.5(f)) and F1 scores (Fig.5(e)) of all pre-trained models attempted.BERTWEET seems to provide the best results with our dataset.Being the first public large-scale pre-trained language model for English Tweets, this is not surprising.Fig.5(i)-(l)illustrates running the ALBERT model on the TruthSeeker dataset.TableIXillustrates the results of all two-label classification models.BERTWEET showcases a clear improvement over all other model types with the highest accuracy and F1 scores.TableXillustrates the results of all four-label classification models.ROBERTA appears to have the highest overall performance.With the lowest accuracy, the ALBERT framework lightweight BERT approach results in poorer performance.However, the performance is still impressive, with a score of over 94% in the two previously mentioned metrics.

TABLE I
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II FIVE
-WAY LABEL AND THREE-WAY LABEL

TABLE III LIST
OF FEATURES IN THE TRUTH SEEKER DATASET WITH THEIR ASSOCIATED DESCRIPTIONS

TABLE IV FOUR
-LABEL CONVERSION TRUTH

TABLE V TWO
-LABEL CONVERSION TRUTH TABLE Table IV illustrates four-label conversion truth table, taking into account the original statements' validity and the majority answer of the tweet.Then, a final truthfulness value is assigned to it.Table V shows two-label conversion truth table.Similar to the previous truth table in nature, only two truthfulness values are possible.

TABLE VI EVALUATION
METRIC REFERENCE SHEET