Semi-Supervised Generative Adversarial Network for Sentiment Analysis of drug reviews

Sentiment analysis has become a very popular research topic and covers a wide range of domains such as economy, politics and health. In the pharmaceutical field, automated analysis of online user reviews provides information on the effectiveness and potential side effects of drugs, which could be used to improve pharmacovigilance systems. Deep learning approaches have revolutionized the field of Natural Language Processing (NLP), achieving state-of-the-art results in many tasks, such as sentiment analysis.These methods require large annotated datasets to train their models. However, in most real-world scenarios, obtaining high-quality labeled datasets is an expensive and time-consuming task. In contrast, unlabeled texts task can be, generally, easily obtained. In this work, we propose a semi-supervised approach based on a Semi-Supervised Generative Adversarial Network (SSGAN) to address the lack of labeled data for the sentiment analysis of drug reviews, and improve the results provided by supervised approaches in this task.To evaluate the real contribution of this approach, we present a benchmark comparison between our semi-supervised approach and a supervised approach, which uses a similar architecture but without the generative adversal setting. Experimental results show better performance of the semi-supervised approach when annotated reviews are less than ten percent of the training set, obtaining a significant improvement for the classification of neutral reviews, the class with least examples. To the best of our knowledge, this is the first study that applies a SSGAN to the sentiment classification of drug reviews. Our semi-supervised approach provides promising results for dealing with the shortage of annotated dataset, but there is still much room to improvement.


I. INTRODUCTION
Since the beginning of the century, millions and millions of users' reviews are available on the Internet. In fact, online users are generating a huge amount of reviews on fresh topics every second, making it more difficult to massively label this amount of data in a short time [1].
As a result, sentiment analysis has become one of the most fruitful research fields in Natural Language Processing (NLP). This task covers a wide range of domains and could contribute to a variety of applications like financial forecasting [2], political strategy planning [3], public opinion analyzing [4], online decision making [5], and pharmacovigilance [6], among many others.
Pharmacovigilance, also known as drug safety, tries to prevent drug side effects. The early detection of side effects depends on clinical I. Segura-Bedmar is with the Computer Science Department, Universidad Carlos III de Madrid, Leganés, 28911, Spain (e-mail: ise-gura@inf.uc3m.es). trials and specific testing protocols, which are usually conducted under given conditions with a limited number of test subjects and time. As a consequence of the design of clinical trials, discrepancies in patient selection and treatment conditions can have a significant impact on drug efficacy and detection of potential adverse effects [7]. Additional information sources, such as user reviews, can offer great potential in this regard [8]. Automatic detection of negative user reviews could alert to unexpected drug behaviors and obtain relevant information to improve pharmacovigilance systems [9]. Moreover, the identification of positive user reviews could help to assess the drug efficacy and even discover new uses for a drug.
Approaches in sentiment analysis have evolved from deterministic rule-based systems [10] to machine-learning [11] and hybrid methods [12]. Over the last years, the emergence of deep learning techniques have brought remarkable breakthroughs in NLP, providing state-ofthe-art results in many tasks [13]. Sentiment analysis is not oblivious to this success, and a wide range of deep learning-based systems have also achieved state-of-the-art results on standard sentiment analysis datasets [14], [15]. However, patient sentiment analysis, specifically of their experience with drugs, has received less research attention [16]- [18]. This is a challenging task because user opinions can have different degrees of subjectivity and could be focused on other different aspects (such as the quality of healthcare service or healthcare professionals), introducing noise in the classification task [19]. Moreover, the shortage of annotated data, which is essential for training machine learning algorithms, has become one of the greatest bottlenecks in NLP [20], [21], and, in particular also for the task of sentiment analysis [22].
To overcome the lack of labeled examples, the use of unsupervised and semi-supervised approaches have been applied to sentiment analysis [23]- [25]. However, these approaches do not outperform the traditional supervised machine learning methods.
More recently, transfer learning and cross-domain approaches have been applied to mitigate the lack of labeled data [17], [26]- [28]. Transfer learning is a machine learning approach where a first model, which is trained for a specific task, is then reused as input for training a new model for another task. A successful example of transfer learning in NLP is Bidirectional Encoder Representations from Transformers (BERT) [29], whose main advantage is that it can accurately represent the different meanings of a word. Then, a fine-tuned phase is performed for a specific task, such as text classification, text summarization or named entity recognition, among many others. These BERT-based models are achieving the current state-of-the-art results for many NLP tasks [29], [30]. However, the fine-tuning phase still requires thousands of annotated examples for the target task. Therefore, the quality of the results drops significantly when the number of labeled examples is insufficient [31].
To address the challenge caused by the shortage of labeled drug reviews, we propose to employ a semi-supervised method by implementing a Semi-Supervised Generative Adversarial Network (SS-GAN) [32]. Generally, in Generative Adversarial Networks (GAN) [33], a generator is trained to produce synthetic examples from different data distributions. During training, the generator tries to mislead a discriminator, which is trained to distinguish between synthetic and real examples.
To the best of our knowledge, only a previous work [31] has used SSGAN for the task of sentiment analysis. Therefore, our study is the first work in applying SSGAN to the sentiment classification of drug reviews. To assess the real contribution of our semi-supervised approach, we also performed a comparison between the proposed semi-supervised approach and the supervised approach described in [6], which is based on the BERT-mini model [34] fine-tunned with a Bidirectional Long short-term memory (BiLSTM) layer. In this way, our approach exploits a more complex fine-tunning architecture for the task of sentiment classification than that proposed in [31], since they only used a softmax layer.
We use different splits of the training dataset, which are intended to show the performance of the semi-supervised compared to the supervised approach trained only with a small percentage of training dataset. Thus, these splits will allow us to show the amount of data should be annotated to obtain similar results than those obtained by the supervised approach.
The organization of this paper is as follows. After discussing prior work (Section II), we present the dataset used and describe the deep learning architectures studied in this work (Section III). Then, we evaluate the models and discuss their results (Section IV). Finally, we provide some conclusions extracted from the experimentation and present future research lines (Section V).

II. RELATED WORK
Although sentiment analysis has been widely applied to many application domains [2]- [6], [35], [36], the pharmaceutical domain is currently starting to receive research attention. In this section, we review previous research focused on sentiment analysis of drug reviews.
Early work on this task mainly used sentiment rules and lexicons (such as SentiWordNet [37]) [38], [39]. Later works such as [40]- [43] proposed approaches with Bag-of-Words (BoW) model to represent the input texts and different machine learning algorithms such as Decision Trees, K-Nearest-Neighbor (KNN), Support Vector Machine (SVM) or Naïve Bayes to estimate the polarity of patient posts in online health forums. One of the most popular studies is the work of [17] because the authors have created a labeled dataset of drug reviews, which has become a standard benchmarking corpus for the task. This dataset is described in more detail in Subsection III-A. The authors used a logistic regression to classify the drug reviews, achieving an accuracy of 92.2%.
More recently, deep learning approaches have shown remarkable results [18], [44], [45]. C. Colon-Ruiz and I. Segura-Bedmar [6] performed a benchmark comparison between different combinations of deep learning models such as Convolutional Neural Network (CNN), Bidirectional Long short-term memory (BiLSTM), and BERT (BERT-Mini [34]). Their experiments show that the BERT model obtained the best results with an accuracy of 90.4% on the dataset created by Grasser et al. Another study, using the same dataset, was described in [46], which explored both LSTM and BiLSTM layers followed by a self-attention mechanism, obtaining an accuracy of 92%.
Biseda and Mo [18] compared a variety of eight different BERT models for sentiment classification of drug reviews on the dataset proposed in [17]. Their experiments showed an accuracy of 90.6%. Recently, several neural network models with Embeddings from Language Model (ELMo) [47] and different transformers such as BERT, BERT variants, and XLNET have been explored in [45]. The best result, an accuracy of 94.3%, was obtained by a BERT architecture modified by replacing some multi-head attention layers with dynamic convolutions to reduce the temporal complexity.
Semi-supervised approaches could improve the efficiency of classifying text without the need for large labeled datasets. Wu et al. [48] implemented a semi-supervised approach with a variational autoencoder and recurrent LSTM networks to improve dimensional sentiment analysis (DSA) performance with considerably less labeled data. The authors evaluated their approach on several benchmark datasets, specifically, on a dataset consisting of status updates shared by Facebook users [49]. These texts were scored on a scale of 1 (most negative) to 9 (most positive). The semi-supervised approach was compared with a supervised LSTM model. The experiments showed than the semi-supervised approach overcame the supervised one with an improvement of approximately 15.9% of Pearson correlation, when only 10% of the training texts were used to train the models.
Croce et al. [31] proposed a supervised approach based on BERT and a semi-supervised Generative Adversarial Network to perform different NLP tasks using small training datasets. Between the different NLP tasks, a sentiment analysis task was performed on the Stanford Sentiment Treebank-5 (SST-5) dataset [50], which consists of movie reviews classified as negative, somewhat negative, neutral, somewhat positive, or positive. For this task, the BERT model was fine-tuned with a softmax layer to perform the classification. The semi-supervised approach achieved an improvement of approximately 10% accuracy compared with the supervised BERT-based approach, when only 1% of the reviews are labeled.
From the present review, we can say that semi-supervised methods based on deep learning architectures provide promising results, although they have been timidly applied to sentiment analysis. To the best of our knowledge, these semi-supervised approaches have not been used for sentiment classification of drug reviews.

A. Dataset
For this work, we used the dataset of drug reviews created by [17]. In this dataset, the drug reviews, written by patients, were collected from Drugs.com and Druglib.com, two of the most visited pharmaceutical information websites offering information to both consumers and health care professionals. Each drug review includes a rating from 0 to 9, indicating the degree of patient satisfaction with the drug. For example, a patient with epilepsy posted the following comment: "I've had nothing but problems with the Keppera : constant shaking in my arms & legs, pins needles feeling in my arms & legs, severe light headedness, no appetite, etc.". The patient refers to drug Keppera and describes the adverse effects it has caused. The rating for the drug provided by the patient is negative with a value of 2. The reviews were grouped into three levels of polarity according to their rating: positive (rating ≥ 7), negative (rating ≤ 4) and neutral (rating [3,6]), such was proposed in [17]. The dataset, which contains a total of 215,063 drug reviews, is split into training and test sets with a ratio of 75:25 with stratified random sampling to maintain the same proportion of all classes. Additionally, as in [6], 15% of the total training set is used as the validation set, which is used to fit the hyperparameters of the models. Figure 1 shows that the distribution of the different reviews according to their classes (degree of satisfaction) is strongly unbalanced in the training and test sets. Reviews with positive polarity (class 2) represent about 66% of the data, reviews with negative polarity (class 0) represent about 25% and, finally, reviews with neutral polarity (class 1) represent only 9%.
In order to prepare the training dataset for our comparison of the semi-supervised approach with the supervised one, we performed a series of splits (see Table I) to obtain several ratios between labeled and unlabeled reviews. By employing different ratios between labeled 0 20,000 40,000 60,000 80,000 100,000 Number of reviews  and unlabeled data, we will be able to observe how much annotated data we need to use with each approach to obtain similar results. For example, in the split "20:80", only 20% of the training dataset is randomly selected to be used as labeled reviews, while the remaining 80% will be used as unlabeled reviews. In this particular split of 20%, there are 27,420 labeled reviews, where 6,820 belong to class 0 (negative), 2,427 belong to class 1 (neutral) and 18,173 belong to class 2 (positive). The splits are created using a stratified random sampling to keep the same class distribution in both partitions.
The semi-supervised approach will use labeled examples for training the classification model, but also the unlabeled examples to improve the distinction between automatically generated examples and real ones. On the other hand, the supervised approach will only use the labeled reviews to train its model.

B. Supervised approach: BERT-BiLSTM
To assess the real contribution of our semi-supervised approach, we also propose a supervised approach that employs a BERT model followed by a BiLSTM to perform the fine-tuning for the sentiment analysis task of drug reviews.
BERT provides a representation of their inputs as a result of a pre-training phase on large-scale corpora. These representations are able to provide a vector for each sense of a word taking into account the context where the word occurs. In this way, it can represent the different meanings of a word depending on the context.
To format the drug reviews in a BERT-based format, the texts were processed using the tokenizer provided by bert-for-tf2 1 and the vocabulary of the pre-trained BERT-Mini model. In addition, due to the difference in text lengths, it was necessary to pad and truncate the reviews based on the cumulative distribution function (Figure 2). This figure shows that almost 100% of the reviews have a length less than 250 tokens. Thus, the length of processed texts was set to 250 tokens.
As was mentioned above, one of the main advantages of BERT is its ability to provide contextualized word embeddings with a more accurate representation of the words. However, training BERT for a specific task is expensive from a parameter point of view [51]. Due to the computational cost, we used a pre-trained BERT model with only four encoder layers, concretely, the BERT-Mini model [34]. Moreover, we use adaptor modules [51] as a transfer learning mechanism that can adjust a pre-trained model for new tasks with no need to modify the weights of the original model. The adapter tuning strategy consists of injecting new layers into the original network forming a bottleneck architecture that projects the input features in lower dimensions, applies non-linearity and returns the input to the original dimension. The adapter module also incorporates an internal skip-connection to avoid problems with near-zero initializations. As a result, the original BERT model weights can remain unchanged to considerably reduce the number of trainable parameters during finetuning of the specified task.
BERT-Mini requires much less memory than the original model, enabling better training times, at the expense of worse results for the final task [34]. To address this problem, we add a BiLSTM layer, which has been shown to provide good results in sentiment analysis tasks [6]. Moreover, after this layer, we also added a fully connected perceptron layer with leaky rectified linear unit activation function (leaky ReLU) [52] to improve class prediction. As the last layer, we use a softmax layer, which outputs a vector containing probabilities for each of the classes (positive, neutral, negative).
Finally, for model training we used 200 epochs with the ADAM optimizer (learning rate of 0.001), a batch size of 200 and categorical cross-entropy as a loss function. The training parameters were selected empirically. The reader can find a detailed description of this approach in [6].

C. Semi-Supervised approach: GAN with BERT-BiLSTM
Our semi-supervised approach uses the same architecture as our supervised one, but also exploits an adversarial network to improve the internal representation of the examples for the classification task. In a GAN architecture [33], two models are trained: a generative model G and a discriminative model D. G learns the distribution of the data and generates synthetic examples. D predicts the probability that an example is synthetic (generated by G) or real (from the training dataset). The training of G consists of maximizing the probability that D makes a mistake. This scenario corresponds to a minimax game of two players [53], G and D, competing against each other. The generative model is confronted with an adversary, a discriminative model that learns to differentiate whether an example comes from G or belongs to the training data. This competition leads both models to improve their performances until the distributions generated by G are practically indistinguishable from those belonging to the real training data.
In this way, we use a semi-supervised GAN (SSGAN) [32] that adversarially improves the distributions generated by G and D, and thereby, provides a better representations for the classification task. To achieve this, the generator must produces a batch of new synthetic examples to try to mislead the discriminator. The generator uses as input (xg) an uniform noise distribution and a multilayer perceptron to learn the synthetic examples. Then, the discriminator, which is also a multilayer perceptron, is trained on c classes (in our case, three classes: positive, negative and neutral) plus an additional binary class (c+1) to identify whether the input example comes from the generator or the real training data. The last layer of D is a softmax layer for the (c+1) classes. Figure 3 shows the whole architecture of our final approach (SSGAN-BL), which consists of the SSGAN model, whose  During the training of the SSGAN-BL model, the weights of G and D are updated, but also the ones of the BERT-BiLSTM model. Once the SSGAN-BL model is trained, G is discarded, keeping the rest of the model for the sentiment classification task with the three classes (positive, negative and neutral) and ignoring the binary class (c+1) in the softmax layer of D.
We now describe in detail the loss functions used to train the models. From now on, x b is a vector of a original review (input for BERT-LSTM) and xg is a vector generated from a uniform random distribution (input for G). Once this is defined, we can express the probability P, provided by the discriminator D, as the probability of a given input, x b or xg, to be associated with one of the (1, ..., c+1) classes. Using the probability distributions P, we can measure the error of each of the models. To train D, we must minimize the sum of errors produced by mis-classifying labeled reviews with wrong classes in (1, ..., c), by misidentifying labeled and unlabeled reviews as "fake" reviews (that is, with the (c+1) class) and by misidentifying synthetic examples generated by G as "real" reviews. On other other hand, to train G, we must minimize the sum of errors produced by D in identifying an example generated by G as a synthetic one and by G in generating examples with distributions significantly different from the real reviews.
Therefore, the loss function of D is defined as L Dsup + L Duns , where (1) provides the error in assigning a labeled review x b in the original c-classes, (2) defines the error of recognizing a labeled or unlabeled review x b as a synthetic example (c+1), and (3) defines the error of recognizing a synthetic example xg as a real example (c+1). Finally, the error is back-propagated and the weights are updated for the D and BERT-BiLSTM models (adjusting the inner representations of the reviews).
In the meantime, the loss function of G is defined as L Guns + L G match . G must be able (1) L Guns = −log[1 − P (ŷ = y|xg, y = c + 1)] (4) For training, we used 400 epochs with the ADAM optimizer (learning ratio of 4e-5). The batch sizes can differ (64-200) depending on the number of labeled reviews used for training. This is due to the need to preserve batches with labeled reviews representation to avoid divergence produced by unsupervised components in adversarial training. On the other hand, for both G and D, we used Leaky Relu activation functions except in the last layer of the generator, where we used a hyperbolic tangent function (tanh) as the activation function. The latter is to ensure that the h f ake representations produced by the generator are similar to the h real representations provided by BERT-BiLSTM, where the last layer is composed of a BiLSTM recurrent network with hyperbolic tangent activation functions.
The pseudo-code of the training algorithm can be found in Algorithm 1.
Our source code is publicly available to enable the reproducibility of our experiments 2 .

IV. EXPERIMENTAL RESULTS
In this section, we focus on the results obtained with our approaches on the test dataset described in Section III.
To evaluate both approaches, we used one of the standardized metrics for text classification tasks: F1. This metric can be extended for multi-classification problems with the micro-averaged and macroaveraged versions. With micro-averaged F1, we show the contributions of each of the classes, while with macro-averaged F1, we show the mean of the values obtained by each class independently.  + u)) and D(G(z)) L D total = L Dsup + L Duns end while D → trainable = F alse while G ← train do L Guns ← D(G(z)) L G match ← f (l + u) and f (z) L G total = L Guns + L G match end while end for end for Figures 4 and 5 show the performance of the supervised and semisupervised approaches using different ratios of labeled (Annotated %) and unlabeled reviews. In this way, we will see the difference in the results provided by the semi-supervised and supervised approaches when only limited subsets of labeled reviews are available.
Figures 4 compare both approaches when different percentages of labeled examples are used to train both models. We can observe that, with less than 10% of labeled examples, the semi-supervised approach provides better performance in terms of micro-F1 than the supervised approach (see Fig. 4b). Similarly, the semi-supervised approach overcomes the supervised one in terms of macro-F1, when we only use 20% of the trainin examples (see Fig. 4b). Therefore, when we use only 10% or 20% of the labeled examples to train the models, the semi-supervised approach starts to obtain worse results than the supervised one. This may be due to the fact that, from 10% onwards, the supervised approach has enough data to obtain good enough internal representations of the reviews on its own compared to the semi-supervised approach.
Moreover, we can observe in Figure 5b, the semi-supervised model obtains better results than the supervised model for the class with least instances (neutral class), regardless of the percentage of labeled data used. Thus, the model benefits more from the data representations adversarially enhanced by the SSGAN-BL model. The difference in the performance of the two models for the underrepresented class can be seen in Figure 4a. The Macro-F1 metric, described above, demonstrates a greater differentiation in the performance of both approaches regardless of the percentage of labeled data. Considering the performance of SSGAN-BL in class 1 (Neutral class), the mean of the values obtained for each class individually is higher than in the case of BERT-BiLSTM. However, Figure 4b shows that there is no statistically significant difference in terms of Micro-F1 (below 10% of labeled data). This may be due to a slight increase of failures in SSGAN-BL when classifying instances of the other classes (positive or negative) at the expense of improving performance with the under-represented class (neutral). As it can be seen in the confusion matrices (see Fig. 6), when we compare both approaches, there is a similar pattern of behavior. We can see how the number of reviews correctly classified as class 1 (neutral) is higher with SSGAN-BL (see Fig. 6f) than with BERT-LSTM (see Fig. 6c). However, the number of reviews correctly classified as class 0 (negative) is slightly lower with SSGAN-BL. This is repeated for each of the comparisons up to Figures 6g and 6i onwards. From this setup, there is also a decrease in the number of reviews correctly classified as class 2 (positive) by the SSGAN-BL model. The latter fact can be seen in Figures 4 and 5 from the 10% of labeled data.
Moreover, we can see a pattern of behavior in the confusion matrices that are not reflected by the F1 metrics. When we observe Figures 6b and 6e, we can notice that the SSGAN-BL model classifies reviews less often in the opposite polarity than BERT-BiLSTM. In contrast, BERT-BiLSTM is more inclined to classify a positive example as a negative one or a negative example as a positive one. It can be argued that misclassifying a polarized review as neutral is preferable to misclassifying such a review in the opposite polarity (e.g., positive, as negative). For example, the following review is labeled in class 0 (negative): "Have been on Actos for almost a year, gained 24 pounds and have swelling in hands and feet and are retaining a lot of water in my thighs. My sugar levels are good. My doctor lowered my dosage from 30 mg to 15 mg but refused to take me off. Will get a second opinion because the side effects are too much.". In this case, the BERT-LSTM model classifies it as positive, while the SSGAN-BL model classifies it as neutral.

V. CONCLUSION
In recent years, the emergence of transformer-based approaches has revolutionized the field of NLP. Models such as BERT, have achieved state-of-the-art results for many NLP tasks. However, to fine-tune these models to specific tasks, it is still required a large amount of annotated corpora that are not always available.
In this paper, we propose a semi-supervised adversarial network (SSGAN) approach capable of dealing with the sentiment analysis of drug reviews when the size of the labeled dataset is limited. To evaluate the real contribution of our approach, we performed a comparison between the semi-supervised approach and a supervised one. Our experiment results show that the semi-supervised approach provides better results than the supervised approach in terms of Macro-F1. However, in terms of micro-F1, its performance is worse than the supervised approach when we employ more than 10% of labeled data. Moreover, the semi-supervised approach provides a noticeable improvement in the classification of underrepresented class reviews (neutral class). Conversely, when the availability of labeled reviews increases, its performance on overrepresented class The semi-supervised approach provides promising results, however, there is still much room for improvement. Therefore, we plan to propose new deep learning methods capable to automatically generate instances for training, reducing the dependence on annotated dataset. In addition, we will explore our semi-supervised approach for other NLP tasks such as named entity recognition, relation extraction or text summarization, among others. Moreover, we will study other transformer-based architectures such as GPT-2 [54] and XML [55] to deal with these tasks within the settings enabling semi-supervised approaches and few-shot transfer learning [56]. The basic idea of this technique is to reuse a model, which was trained to distinguish between different classes, to learn how to identify examples of unknown classes for the model, using only a few seed examples.