Spam-Detection with Comparative Analysis and Spamming Words Extractions

—Communication through email plays an essential part especially in every sector of our day-to-day life. Considering its signiﬁcance, it is important to ﬁlter spam emails from emails. Spam email, also known as junk email, is unwanted messages that are sent by the electronic medium in large quantities. Most of the spam emails are commercial in nature that is not only irritating but also harmful due to malicious scams or malware-hosting sites or use viruses attached to the message. In this paper, we identify spam emails and expose how spam emails can be distinguished from legitimate/normal emails. We deployed four machine learning models and two deep learning models over the datasets including the combined dataset. Besides, we also try to ﬁnd the important keywords that are found repeatedly from spam emails repository. This type of knowledge will enable us to detect spam emails for our personnel and community security purpose.


I. INTRODUCTION
E-mail refers to the sharing of data among persons that use digital networking equipment. In 1971, the very first e-mail is invented and used by Ray Tomlinson. He developed the first system capable of sending mail throughout the ARPANET among users on different hosts, utilizing @ sign to link the mail address at the destination space. And the technology was known as email in the early 1980s [1,2,3].
Emails have become an easy technique to share data, ideas, and commonly written correspondence around the world. And it is a way of communicating via electronic devices among individuals. More specifically, e-mail is a file that may contain text, files, photographs, web addresses, or other items that are transmitted over a network to a specific person or a group of people at the same time. For emailing to individuals or groups no additional fees need not be paid. In 2019, the total global subscribers of the e-mail were 3.9 billion, it is expected to grow over 4.48 billion in 2024. In 2018, about 281 billion e-mails globally were being sent and received every day. Nowadays, E-mail is a convenient and available service that is sensitive as well. It is also available from anywhere, to facilitate bulk message transmission, to enable instant access to information including any types of files. So, email has become one of the world's most common communication system [4,5].
However, as technology advances, the innovation of email has been hurt by an outfit called email spamming. Email spamming can be done in the form of unintentional, unsolicited electronic contact. Web-users also use e-mails for website and email sign-ups and brace themselves for the potential flood of malware and advertising updates. While most unwanted e-mails are disturbing and essentially innocuous, but users have to be concerned about malicious e-mails that have been sent to hack one's online identity and computer data. Since the early 1990s, email spam has gradually grown and is expected to account for about 90 percent of the worldwide mail transactions by 2014 [6,7]. There is no need for spam emails that are used to drain space, network connectivity, and consume the time of receivers. Spam emails reported 53.95 percent of e-mail traffic in March 2020. Spam emails usually aim to gather confidential customer/personal details information so that the scammers can target and harass the receiver. So it is important to find out and avoid the junk/spam email from inbox [8,9].
In this paper, we identified spam emails from baskets of emails using Machine Learning and Deep Learning approaches. We have carried out some unique steps and operations to do so. Such as, we used four different datasets in our paper. Those datasets are Trec spam Dataset-2007, Enron email dataset, PU dataset, Lingspam dataset. And we have used both Machine Learning and Deep Learning Models in our work. Which has given us pretty much nice accuracy. We created word clouds to show spam words that are liable for spamming emails for each dataset. First, we have used the feature engineering technique to find the most important features from the datasets we have collected. Then the data has been preprocessed for better accuracy of the models. After data preprocessing different techniques are used to convert the preprocessed data into a numerical vector. The datasets are also split as train, test, and cross-validation. We have used several Machine Learning Models such as Logistic Regression, xgboost, Support Vector Machine, Random Forest. And we also used Deep Learning Models such as Word Embedding, LSTM.
The organization of the rest of the paper is as follows. A brief outline of the related literature is presented in section II. A precise explanation of the proposed methods is given in section III. Also, a brief description of the techniques used in the study and dataset are given here. Experimental setup and results are presented in section IV. Finally, the paper finishes with the concluding remarks in section V.

II. RELATED WORKS
There are several study and research works have been done for classifying the spam mail by using various datasets with machine learning and deep learning approach. Such as, for TREC07 dataset a study has been done by using Naïve Bayes and Neural Network (MLP) where limitations was low performance. Sharma, Prajapat and Aslam [10] used the classification algorithm of keywords to implement multilayer perceptron neural network (MLP) and naïve Bayes models. Utilizing statistical analysis on communications, they often calculate their findings as either junk or ham upon this TREC07 dataset. A significant downside of the MLP model is that, relative to NB, the training is sluggish as it costs a lot to develop. Another study has done using Lingspam dataset and in this study only accuracy of the method was used in assessing it performance. Using a local outlier factor (LOF) because as performance feature for the predictor distribution for classification problems on the Ling spam dataset, Palanisamy, Kumaresan and Varalakshmi [11] implemented the hybrid mixed negative selection algorithm (NSA) and PSO. As variables such as accuracy, memory, calculation time, and false positives were not included in measuring the system's performance, the success of their method can not be measured. In determining its efficiency, just the classification precision of the system was used. By using better optimization techniques, the proposed work can be further improved to increase its performance, as the accuracy of the classification is still very poor. PU and Enron Spam datasets are used in a study where the limitation is Time consuming training. Akshita [12] introduced the Deep Learning methodology to spam classification content -based recommendation. On PU1, PU2, PU3, PUA and Enron spam datasets, the writer utilized DL4J network model.
There is another research has been done using both PU1, Ling-spam. Zhou [13] has performed the research work but the limitation is interpretation and computation required for higher and lower spam thresholds.To minimize the chances of misclassification, in updated Naïve Bayesian mail junk flter with cost-effective three significant method is used. Another research has been conducted using TREC, Spam Assassin datasets. Zhong [14] has done has the research where the limitation is Word obfuscation afects the classifcation accuracy. C. Hua Li [15] has done a research based on Ling-spam, PU1 and PU3 datasets. However the limitation of the work is training takes more time. Sunita [16] has been done a study using both trec and lingspam dataset using machine learning and deep learning. In the research CNN and different machine learning models are implemented. Sanjiban and Abhishek [17] has done a research with machine learning and deep learning. They used SVM and ANN models for the classification. Ammara and hikmat [18] has done the study using several machine learning and deep learning models. SVM, AdaBoost, MLP, DNN, RF models are used in the research. Shikhar Seth [19] has completed a research on Enron Spam Dataset using deep learning technique. Ankit Narendrakumar Soni [20] also has done research on enron dataset using deep learning technique. LSTM and CNN models are used in the study.
In a recent approach [21], researcher used SVM for classifying spam emails from legitimate emails. They have used Laplace feature map algorithms for extracting important features. Although three different datasets used for this study, but their accuracy wasn't up to the mark. In an another study [22] junk emails filtered using different machine and deep learning approaches. However, word embedding with deep learning performs over any other algorithms. unsupervised learning is also employed for spam filtering. To my surprise, it performs admirably in terms of accuracy [23].

III. METHODOLOGY
This section describes different steps illustrated in figure 1.

A. Data Acquisition and Preprocessing
In this paper, four different datasets have been used both in Machine learning and Deep learning. The datasets are Trec spam Dataset-2007, Enron email dataset, PU123ACorpora dataset, Ling-spam dataset. We also create a basket dataset. And its a combination of all the four datasets we have selected for our case. Dataset then preprocessed in order to fit into algorithms. Dataset description is given in table I.

C. Summarization and Context Learning
In summarization and context learning, we have introduced two approaches: spam-legitimate email summarization and keyword finding. In the summarization approach, we have identified a summary of all spam emails and the most significant sentences that turned the email into unwanted one. Keyword findings are meant to classify the most important keywords in the review text to visualize the corresponding datasets contextual sense. We have also used wordcloud for keywords for context learning techniques. Because it is possible to highlight important text-based data points using a word cloud. Word clouds are commonly used to examine data of social networks and other resource [28,29].

D. Classifier Models
In order to identify the spam emails, We used the classifiers. A binary classification algorithm applied to estimate the likelihood of a target variable is logistic regression. The existence of the objective or dependent variable is dichotomous, meaning binary or multi-classes will be available [30]. Support-vector machines are supervised learning models of machine learning with related classification models that conduct research for the study of classification and regression [31]. XGBoost is an algorithm for organized or tabular data that has lately dominated advanced machine learning and Kaggle events. XGBoost is an application of decision trees with gradient boosts optimized for speed, and efficiency [32]. Random forests or random decision forests are a machine learning system for classification, regression, and other processes that work by creating at training phase a variety of decision trees and generating the category that would be the category mode or the specific trees information prior [33].
In deep learning, word embedding is the generic term for collecting language modeling, and function learning strategies where terms or words from the corpus are mapped to positive integer vectors [34]. An artificial recurrent neural network (RNN) model used in the area of deep learning is long shortterm memory (LSTM). LSTM has reinforcement links, unlike normal recurrent neural networks. Not only does it handle single pieces of data, but also whole data series [35]. Two different approaches used for DL approaches. First one, word embedding layer along with dense vector. Second one, word embedding and lstm layer along with dense vector.In this paper, these models are implemented on all four datasets and the basket dataset.

IV. EXPERIMENTAL SETUP AND RESULTS
This section provides a detailed description of the experiments like environment setup, hyperparameter tuning, evaluation metrics, and corresponding results. It also provides a summary of the datasets used for the evaluation.

A. Experimental Setup
WWe have used python libraries for machine learning are NumPy, SciPy, Scikit-learn etc. The deep learning method is deployed on Tensorflow and Keras libraries. TensorFlow is a in exchange for all of the capital stock of sovs extreme has paid dollar 10k in cash and 15 million shares, it amp transfer the funds to the suppliers by means of western union money gram less your fee, www essentialmedicine org action many of our chapters have also posted photos on our flickr, lower the price of one of the most widely used aids drugs by 96 percent throughout sub-saharan africa, stability remaining uncorrupted products items shipping, the head office who will instruct you amp give advice regarding every new payment, go here now and get it, you have received this email because you have signed up to receive the cnn, he said after snapping pictures of the pill bottle sculpture at yale, alert preferences please click here, please visit the confirmation link below and fill out our short 30 second secure web, viagra if you have a problem getting or keeping an erection your sex life, the position offered is a part time job and will only require from you, repl1cas is a well established online store. Enron dollar, will, company, percent, email, information, please, statements, may, one, now, money, time, business, 000.
we were compensated 3 ooo dollars to distribute this report, i will do my best, believe that there could be a possible fit here for our company and our share holders, 100 percent all natural product that produces no side effects, you may purge your email address from our database, all information provided within this email pertaining to investing , stocks , securities must be understood as information provided and not investment advice, please copy and paste this link into your browser selftreatment, even after positive statements have been made regarding the above company, shares may be soid at any time, hot news flash today -this one is moving, get the magic blue pills now !, earn huge money quickly from home, this is how i made that money with the porn -chemisorb, the permanent fix to penis enlargement limited time offer, i intend to invest in real estate business your country with you as my partner, you have been pre -approved for a $400 , 000 home loan at a 3 . 25 % fixed rate. LingSpam dollar, mail, order, report, email, will, address, free, program, money, send, name, list, one, business.
it is completely duplicatable ! by making sure you make thousands of dollars a week, use our server to send your own mail, order today ! only $ 9 . 95 plus $ 3 shipping and handling, for each report , send $ 5 cash and a self-addressed stamped envelope, ! more features , more services (such as free email accounts!!), we will give you free of charge an original hand painted cel, 50 million e -mail addresses on a cd-rom , here 's a great directory for free and interesting internet sites, this company has been most effective for this program, this is a legitimate , legal , moneymaking opportunity, send us email with remove in the subject line, can you make one cent from each of theses names ?, the spider removes duplicates and saves the email list in a ready to send format, order any four videos and get the fifth one free, if you have been looking for a home-based business opportunity , this could be your lucky day. Basket Dataset dollar, percent, new, pills, email, one, price, transfer, top, now, please, mail, company, money, time, business. get dollar 500 instantly when you register, the company guarantees to pay net 10 percent fee out of the amount of every payment, some additional openings for new employees we are glad to offer you, erection treatment pills , anti-depressant pills , weight loss , and more !, if you would prefer not to receive further emails fromthe savers club, we've had our eye on match point for almost one year now, offers the most money or offers the cheapest price per product, in fact it is my friend that advised me to urgently arrange to transfer out this money through another foreign beneficiary outside the country, we ' ve had our eye on match point for almost one year now, hoids two wildcards that can be offered to other top piayers in the weeks leading up to the event, please hit your keyboard delete button now and please excuse the intrusion, are you in the mail order business ?, i will or the parent company will !, use hypnosis to make money !, a hard time finding places that will give you a credit card or a merchant card, own a corporation or own a business where you are under a duty to withold from your employes paycheck. or Theano elevated neural network-based library. In simple words work using TensorFlow at backend. We have used GoogleColab for applying TensorFlow and Keras.

B. Evaluation Results
In this segment, We introduce several experimental findings obtained from various spam detection datasets and also our 'Basket' dataset. Similarly, Figure 2 shows the wordclouds of the top 100 spamming words originating from different datasets. A word's size and layouts are approximately equal to its significance for being spam email. To extract the spamming words, we only count the emails labeled as spam. It has been noticed from these wordclouds that all of the readily identifiable words are also correlated to the context of the individual datasets regardless of their larger font size.
We have split all four datasets into 80 percent training and 20 percent testing. The training part is used for model training. The test part evaluate the performance of the models. The numerical evaluation criteria used for machine learning in this paper include accuracy, precision, and recall metrics. Those are widely-used. For the deep learning part, we have calculated loss and accuracy.
After the deployment of ML/DL algorithms, the analytical method offers useful outcomes on each algorithm such as the accuracy of classification, precision, and recall, etc. We conduct four different classifiers algorithms in ml: Logistic regression, support vector machine, Xgboost and Random Forest.
In table III shows the machine learning approaches, which specifies dataset name, number of emails or datapoints for that dataset, number of features selected for training, ML model applied on that dataset, train loss, train accuracy, test loss, test accuracy, precision/recall for spam/ham (normal/legitimate) eamil.
In table IV shows the deep learning approaches, which   In figure 3 shows the log-loss comparision of four different machine learnign models on four different datasets.

C. Discussion
First of all we extracted different features from each datasets. In these case we found subject and body for Enrom, Ling-spam, and PU dataset. But Trec dataset of 2007 offers approximately 80 different features. We selected only those features that carry enough information for classifying among spam and legitimate emails. In table III shows that we took 2 features for Enrom, PU, and Lingspam. Where's we found 7 features from trec-2007 countable for our research. We merge all four datasets in a combined one called 'Basket Dataset'. Basket dataset is also treated as a particular dataset in our study. In Basket dataset subject and body of the four datasets considered only. After successful feature extraction we preprocess the data so that no noise hampered our model to learn detail knowledge. Feature engineering is perform on our pre-processed dataset. In ML we perform average word2vec with tf-idf, in dl we perform one-hot encoding, tokenization interchangeably.
After performing all pre-processing steps, machine/deep learning models deployed on our data. In table III we summarizes the details of ML outcomes. The performance of logistic regression quite well for ling-spam dataset. Where support vector machine overfitted for all four datasets. XGBoost overfitted for PU dataset but perform well for Trec,enrom, and Lingspam. Random Forest's performance is quite good but overfitted for all except lingspam. Based on performance metrics XGBoost perform well than others. From figure 3 and figure III, we found Logistic regression loss ranges from 0.13 to 0.57, XGBoost loss ranges from 0.22 to 0.68 for different datasets. Support vector machine and random forest's train loss is low but test loss is dramatically high. For trec-spam dataset ML model performs well to classify spam email. But, for other dataset ML model performs well to classify legitimate email.
So, considering precision, recall,accuracy metrics and loss function, XGBoost is the best machine learning model for all four datasets.
In table IV shows the summary of performed deep learning approaches. Loss is comparatively low for all approaches. Training accuracy is over 99%Ṫest accuracy is quite good for all datasets, which is over 95%Ṅo overfitting encountered in any cases. It performed well even for 'Basket dataset'. Word-Embedding (along with dense layer) accuracy is 98.53% and loss 0.175 for Basket dataset. Moreover, Basket dataset's LSTM and word-embedding approach accuracy is 99.05% and loss 0.0547. But machine learning models loss are quite high and train-test accuracy surprisingly low on 'Basket Dataset'. Word-embedding layer perform slightly better than word embedding and LSTM layer in case of Trec-2007, Enrom, PU and Lingspam datasets. But for our basket dataset wording embedding and LSTM layer performed well.
After obeserving all result deep learning approaches performed state of the art, not only for each dataset but also for our basket dataset. . Our basket dataset helps us to create a more robust and multi-dimensional model. As all four datasets are from different domains and covered different types of spam email, our model could perform in wide range of area.

V. CONCLUSION
Spam or unsolicited junk mail is both harmful and annoying. In this paper, we have developed a pipeline to identify the junk mails from the legal one. We have applied machine learning and deep learning algorithm for four individual datasets. We have found that the XGBoost is the best machine learning model for all four datasets. In deep learning, the Wordembedding layer performed better. We have also used 'Basket' dataset and the best result we got by implementing LSTM. Finally, hybrid systems seem to be the most effective method for creating a reliable anti-spam filter today.