Semantic-Preserving Adversarial Text Attacks

Deep learning models are known immensely brittle to adversarial text examples. Existing text adversarial attack strategies can be roughly divided into character-level, word-level, and sentence-level attacks. Despite the success brought by recent text attack methods, how to induce misclassification with minimal text modifications while keeping the lexical correctness, syntactic soundness, and semantic consistency is still a challenge. In this paper, we devise a Bigram and Unigram-based adaptive Semantic Preservation Optimization (BU-SPO) approach which attacks text documents not only at a unigram word level but also at a bigram level to avoid generating meaningless sentences. We also present a hybrid attack strategy that collects substitution words from both synonyms and sememe candidates, to enrich the potential candidate set. Besides, a Semantic Preservation Optimization (SPO) method is devised to determine the word substitution priority and reduce the perturbation cost. Furthermore, we constrain the SPO with a semantic Filter (dubbed SPOF) to improve the semantic similarity. To estimate the effectiveness of our proposed methods, BU-SPO and BU-SPOF, we attack four victim deep learning models trained on three text datasets. Experimental results demonstrate that our approaches accomplish the highest semantics consistency and attack success rates by making minimal word modifications compared with competitive methods.


INTRODUCTION
D EEP neural networks (DNNs) have exhibited brittle- ness towards adversarial examples primarily in the image domain [1], [2].Adversarial image example can be crafted by intentionally adding a small number of pixel perturbations on the legitimate input.These perturbations are usually hard to be perceived by human vision but can mislead well-trained DNNs models to erroneous predictions.This phenomenon raises great interest in the image recognition community, and abundant of adversarial attack and defense methods have been proposed to improve the robustnees and interpretability of DNNs [3].However, the vulnerability of DNNs in Natural Language Processing (NLP) field is generally underestimated, especially for those security-sensitive NLP tasks, such as spam filtering [4], webpage phishing [5], and sentiment analysis [6].
Compared to image attacks, there are non-trivial difficulties in crafting text adversarial samples.Firstly, the text adversarial samples should be lexically correct, syntactically sound, and semantically similar to the original text.This will ensure the adversarial modifications are imperceptible to human readers.Secondly, the words in text sequences are discrete tokens instead of continuous pixel values as in images.Therefore, it is infeasible to directly compute the model gradient with respect to every word.Thirdly, making small perturbations on many pixels may still yield a meaningful image from human perception perspectives.However, any small changes, even a single word, to text document can make a sentence meaningless.Several lines of text attack methods have been proposed, such as character-level attack, sentence-level attack, and word-level attack [7].However, character-level attack (e.g., noise → nosie) leads to lexical errors, and sentence-level attack (i.e., inserting a whole sentence into the original text) often causes significant semantic changes.To avoid these problems, many recent works focused on word-level attacks that replace the original word with another carefully selected one [8].However, existing methods mostly generate substitution candidates for every individual word (i.e., a unigram), which can easily break commonly used phrases, leading to meaningless outputs (e.g., high school → tall school).In addition, when sorting word replacement orders, most algorithms calculate the word importance score (WIS) and attack them via a descending order of the WIS.There are different definitions of WIS, such as probability weighted word saliency (PWWS) [9] and the changes of DNNs' predictions before and after deleting a word [10] Successful Attacks (Alg.3:line 14-23) Δ  * (Eq.( 9)) Semantic Similarity (Alg.3:line 24-27) Fig. 1.The workflow of our BU-SPOF method with an text example "Study: CEOs rewarded for outsourcing.NEW YORK (CNN/Money) -The CEOs of the top 50 US companies that sent service jobs overseas pulled down far more pay than their counterparts at other large companies last year, a study said Tuesday."For brevity, in the figure we only show several words from the long text.This example is originally labeled as "Business" (66.68%) by LSTM, but is misclassified as "Sci/Tech" by replacing a bigram "New York" with "Empire State".The two green color boxes denote two successful attacks, but we accept the "Empire State" substitution, because it preserves more semantics (0.9877) than "AFP" (0.9724).
major drawback of using such a static attack order is word substitution inflexibility, e.g., sequentially selecting the top-3 WIS words {top1, top2, top3} may not fool a classifier but sometimes the combination {top1, top3} can make it.
In this work, we propose a new word-level attack method named Bigram and Unigram based Semantic Preservation Optimization (BU-SPO) which effectively addresses all the drawbacks above.Unlike traditional unigram word attack, we consider both unigram and bigram substitutions.In our approach, we generate more natural candidates by replacing a bigram with its synonyms (e.g., high school → secondary school).Table 1 lists several examples that illustrate the superiority of bigram attacks in comparison with unigram attacks.Additionally, we propose to replace input words by considering both their synonym candidates and sememe candidates (i.e., sememe-consistent words).By incorporating these complementary candidates, we have better choices to craft high-quality adversarial texts.
More importantly, we propose an effective candidature search method, Semantic Preservation Optimization (SPO), to determine word replacement priorities.The SPO inherits the best-performed candidate combinations from the previous generation and determines every next replacement word with a heuristic search.For instance, if changing the {top1} word cannot mislead a classifier, the static methods used in the literature will select the combination {top1, top2} in the second iteration, but our adaptive SPO will check more combinations, e.g., {top1, top2}, {top1, top3}, etc.Compared with static strategy, the SPO allows us to fool DNNs models with much fewer modifications, which is significant in reducing grammatical mistakes.In addition, we built a semantic Filter into the SPO algorithm (SPOF), so that it can select the best candidate to maximally preserve the semantic consistency between the input text and the adversarial output.Fig. 1 illustrates the framework of our algorithm with an attack example.Our main contributions in this work are summarized as below: 1. We propose to attack text documents not only at the unigram word level but also at the bigram level.
This strategy is significant in generating more semantically natural adversarial samples and avoiding meaningless outputs.2. We propose a hybrid approach to generate word substitutions from both synonym candidates and sememe candidates.Such a complementary combination provides more options to craft meaningful adversarial examples.3. We design an Semantic Preservation Optimization (SPO) method to adaptively determine the word replacement order.The SPO is designed to mislead a DNN classifier using the minimal word modifications compared with static word replacement baselines.Making less word replacements is helpful to reduces the syntactic mistakes.4. We further customize the SPO with a semantic Filter (SPOF), targeting to return the adversarial example that preserves the highest semantic consistency.This step can significantly improve the quality of the output adversarial example in terms of the sentence naturality and fluency.

We conduct extensive experiments on IMDB, AG's
News and Yahoo!Answers dataset by attacking both CNN and LSTM models.The experimental results validate the effectiveness of our method in achieving high attack success rate with low perturbation cost and simultaneously keeping high semantic consistency.Besides, our BU-SPOF also shows superiorities in transfer attack, adversarial retrain, and targeted attack compared with baselines.
The rest of this paper is organized as follows.In Section 2, we briefly review the related works in generating text adversarial samples.In Section 3, we discuss and formalize our algorithm in detail.Evaluation metrics and experimental results are reported in Section 4. Finally, Section 5 concludes this paper.
Firstly, character-level attack [11], [12], [13] generates adversarial text by deleting, inserting or swapping characters.Belinkov and Bisk [13] devised four types of synthetic noise: swap, middle random, fully random, and keyboard typo that can mislead the neural machine translation (NMT) models in a large degree.However, they modify every word of an input sentence as they can, which leads to a high perturbation loss.For example, the "swap" of two letters (e.g., noise → nosie) is applied to all words with length ≥ 4, as it does not alter the first and last letters.To reduce the distortion degree, Ebrahimi et al. [11] proposed Hot-Flip, which represents every character as a one-hot vector.Then it estimates the best character change by computing directional derivatives with respect to vector operations.Gao et al. [12] designed a black-box DeepWordBug, which evaluates the word importance score by directly removing words one by one and comparing the prediction changes.However, character-level attack breaks the lexical constraint and leads to misspelled word, which can be easily detected and removed by a spell check machine installed before the classifier.
Additionally, sentence-level attack [14], [15] concatenates an adversarial sentence before or more commonly after the clean input text to confuse deep architecture models.For example, Jia and Liang [14] appended a compatible sentence to the end of paragraph to fool reading comprehension models (RCM).The adversarial sentence looks similar to the original question by combining altered question and fake answers, aiming to mislead RCM into wrong answer location.Nevertheless, this strategy requires a lot of human intervention and cannot be fully automated, e.g., it relies on about 50 manually-defined rules to ensure the adversarial sentence in a declarative form.Recently, Wallace et al. [15] sought for the universal adversarial triggers, i.e., input-agnostic sequences, which causes a specific target prediction when it is concatenated to any input from the same data set.The universal sequence is randomly initialized and iteratively updated to increase the likelihood of the target prediction using token replacement gradient as HotFlip.While this method usually leads to dramatic semantic changes and generate human incomprehensible sentences Finally, word-level attack replaces original input words with carefully picked words.The core problems are (1) how to select proper candidate words and (2) how to determine the word substitution order.Incipiently, Papernot et al. [16] projected words into a 128-dimension embedding space and leveraged the Jacobian matrix to evaluates input-output interaction.However, a small perturbation in the embedding space may lead to totally irrelevant words since there is no hard guarantee that words close in the embedding space are semantically similar.Therefore, subsequent studies focused on synonym substitution strategy that search synonyms from the GloVe embedding space, existing thesaurus (e.g., WordNet and HowNet), or BERT Masked Language Model (MLM).By using GloVe, Alzantot et al. [18] designed a population-based genetic algorithm (GA) to imitate the natural selection.However, the GloVe embedding usually fails to distinguish antonyms from synonyms.For example, the nearest neighbors for expensive in GloVe space are {pricey, cheaper, costly}, where cheaper is its antonyms.Therefore, Glove-based algorithms have to use a counterfitting method to post-process adversary's vectors to ensure the semantic constraint [19].Compared with GloVe, utilizing well-organized linguistic thesaurus, e.g., synonymbased WordNet [20] and sememe-based HowNet [21], is a simple and easy implementation.Ren et al. [9] sought synonyms using the WordNet synsets and ranked word replacement order via probability weighted word saliency (PWWS).However, the PWWS sorts word importance score at once and replaces them one by one according to their score descending order.This generally leads to local optimal and word oversubstitution, as the top-k words are not always the strongest combination in order to mislead DNNs models.Zang et al. [22] manifested that the sememe-based HowNet can provide more substitute words than WordNet and proposed the Particle Swarm Optimization (PSO) to determine which group of words should be attacked.In addition, some recent studies utilized BERT MLM to generate contextual perturbations, such as BERT-Attack [23] and BERT-based Adversarial Examples (BAE) [24].The pretrained BERT MLM can ensure the predicted token fit in the sentence well, but unable to preserve the semantic similarity.For example, in the sentence "the food was [MASK]", predicting the [MASK] as good or bad are equally fluent but resulting in opposite sentiment label.Notably, all these works focused on unigram attacks.

ALGORITHM
This section details our proposed BU-SPO and BU-SPOF methods.Formally, let X The DNNs classifier F learns a mapping from the text space to the label space F : X → Y.

Black-box Text Attack
We design our method in black-box settings where no network architectures, intermediate parameters or gradient information are available.The only capability of the blackbox adversary is to query the output labels (confidence scores) of the threat model, acting as a standard user.
Given a well-trained DNNs classifier F , it aims to produce the correct label Y true ∈ Y for any input X ∈ X , i.e., F (X) = Y true , by maximizing the posterior probability: A rational text attack pursues a human-imperceptible perturbation ∆X that can fool the classifier F when it is added to the original X.The altered input X * = X+∆X is defined as the text adversarial example.Generally, a successful adversarial example can mislead a well-trained classifier into either an arbitrary label other than the true label where Y target = Y true .The attack strategy defined in Eq. ( 2) and Eq. ( 3) are known as untargeted attack and targeted attack, respectively.A valid text perturbation needs to satisfy lexical, grammatical, and semantic constraints.As our attack method makes no character modifications, the lexical constraint is naturally retained.Additionally, we propose a bigram substitution strategy to avoid meaningless outputs, and introduce an adaptive search algorithm SPO to minimize the number of word perturbations while preserving the semantic similarity and syntactic coherence.

Semantic Similarity
The semantic similarity between the original input sentence and the adversarial output sentence is vitally important to ensure that the modifications are imperceptible to human.
In this paper, we employ the Universal Sentence Encoder (USE) to measure the semantic similarity between text examples [25].The USE model encodes different input sentences into 512 dimensional embedding vectors so that we can easily calculate their cosine similarity score.Specifically, the USE encoder is trained on variety of web text with general purpose, such as Wikipedia, web news, web questionanswer pages and discussion forums.Therefore, it is capable of feeding multiple downstream tasks.Formally, denote an USE encoder by Encoder, then the USE score between an example X and its adversarial variation X adv is defined as One major advantage of USE sentence embedding is that it can imply how much the selected candidate word fits the original sentence.On the contrary, alternative word embedding methods (e.g., word2vec [26]), which maps each word to the embedding space, fails to generate context aware representations.

Bigram and Unigram Candidate Selection
Before elaborating on the candidate selection procedure, we first briefly introduce the WordNet and HowNet and give the definition of the synonym space and sememe space.WordNet [20] groups word relations into 117,000 unordered synonym sets (synsets).Different synsets are interlinked by super-subordinate relation, e.g., the "furniture" synset includes the "bed" synset.In this work, we collect synonym candidates from the WordNet synonym space W. HowNet [21] annotates words by their sememes, where the sememe is a minimum unit of semantic meaning in linguistics.For example, the word "apple" has multiple sememes, e.g., "fruit", "computer", etc. Words sharing the same sememe tag can be interchangeable in crafting adversarial examples.We define the sememe candidates provided by HowNet as sememe space H.
Considering polysemy, a word may have more than one sememes defined in HowNet.To guarantee valid substitutions, we take only words that have at least one common sememes with the original word w i into its candidate set S i .

Best candidate selection.
Given the candidate set S i (or B i ) 2 , every w i ∈ S i is a potential candidate for the replacement of word w i .We define the candidate importance score I w i for each substitution candidate w i as the reduction of prediction probability: where Then we pick the word w i that achieves the highest I w i to be the best substitution word w * i .Formally, the synonym candidate selection function is as below Repeating this procedure on every word one by one solves the first key issue of our method, as is summarized in Algorithm 1 from line 1 to line 11.

Semantic Preservation Optimization
The Semantic Preservation Optimization (SPO) is designed to determine the word replacement priority with three objectives: 1) achieve a successful attack, 2) make minimal substitutions, and The change of true label probability between X and X * i denotes the largest attack effect that can be achieved by modifying w i : A straightforward way of determining the word replacement priority is to sort the words by their ∆P * i in a descent order and select the top-k ones.However, we empirically find that replacing words in such a static order incrementally always leads to local optima and word over substitution.This means simply selecting top-k words using ∆P * i does not necessarily provide the best word combination in misleading DNNs.
In this paper, we propose the Semantic Preservation Optimization (SPO) method, which adaptively determines the word substitution priority.Particularly, we first create the initial generation G 0 as an empty set (line 12 of Algorithm 1).Then we set the maximum number of words that can be modified, i.e., M = min(M, n) where M is a predefined replacement cap number.This threshold forces us to stop the loop if the input example does not admit an adversarial alteration after M times of substitution.The SPO procedure is listed between lines 14-21 of Algorithm 1.
For each generation, we first create the population set for the current generation G m using the F function defined in Algorithm 2. Specifically, the F directly returns all the best substitution synonyms {w * 1 , • • • , w * n } as the first generation.Then we iteratively query the classifier F and check whether its prediction is changed by replacing the first generation candidates.If a population member X adv achieves a successful attack, the optimization completes and returns the X adv .Otherwise, we calculate the probability shift of ∆P adv in line 21.If we can not find a successful attack from the current generation, we proceed to the next iteration while updating the ∆P adv .
In the next generation, we recall the F to construct G m with three steps, as listed in lines 5-8 of Algorithm 2. Firstly, we search the most effective element from the previous generation G m−1 that attains the maximal ∆P adv .We denote this best element as G m−1 best .Then we wipe out all the candidate words belonging to G m−1 best from the full candidates set Finally, we combine the G m−1 best with every remaining candidate w * i and assign it to the current population member G m (i).The greedy search between lines 16-21 of Algorithm 1 is the same as the first generation but replaces one more word/bigram in every next generation to craft X adv .This procedure does not stop until it successfully finds the adversarial example or reaches the upper threshold of M .The SPO method enables us to preserve the best population member from the previous generation and adaptively determine which word should be altered in the current generation.Based on the SPO, we achieve a higher successful attack rate by replacing a fewer number of words compared with static baselines.This solves the second issue.

SPO with Semantic Filter (SPOF)
To further enhance the semantic similarity of SPO, we further improve it by using a semantic filter with the proposed SPOF algorithm (Algorithm 3 Calculate the U SEscore between X and X adv by Eq. ( 4);

Targeted Attack Strategy
Targeted attack is the scenario where attackers aim to misdirect the classifier to a pre-specified target label Y target .
In this section, we show our BU-SPO and BU-SPOF algorithms can be easily adapted to conduct targeted attacks by making the following three modifications.Firstly, we change the successful attack condition in Algorithm 1 line 18 and Algorithm 3 line 19 from F (X) = F (X adv ) to F (X adv ) = Y target .This means we only count adversarial examples that can mislead the classifier to the target label as successful attacks.Secondly, we evaluate the attack strength by calculating how much the target label proba- bility increased rather than how much the true label score decreased.Therefore, the Eq. ( 5) is reformulated to Eq. ( 10) and Eq. ( 9) is transformed to Eq. ( 11) Additionally, line 21 of Algorithm 1 and line 23 of Algorithm 3 become Eq.( 12) Finally, we select the most frequent NE substitution from the target class, i.e., N E T arget , instead of from the complementary set N E COMP .This is helpful to increase the target label score and improve the success rate of targeted attacks.

EXPERIMENTS
We evaluate the effectiveness of our BU-SPO and BU-SPOF methods on widely used text datasets.We provide code and data with a Github repository 3 to ensure reproducibility.

Datasets
We conduct experiments on three publicly available benchmarks, including IMDB, AG's News, and Yahoo!Answers.Details of these datasets are summarized in Table 2. IMDB [28] is a binary sentiment classification dataset containing 50,000 movie reviews, where 25,000 samples are used for training and 25,000 for testing.The average text length is 227 words (without punctuation).AG's News [29] is a news classification dataset with 4 topic classes, i.e., World, Sports, Business and Sci/Tech.Each class consists of 30,000 train examples and 1,900 test documents.It totally contains 120,000 training sample and 7,600 testing samples.

Victim Models
We apply our attack algorithm on four popular victim models, including two Convolutional Neural Networks (CNNs) and two Recurrent Neural Networks (RNNs).These popular models are effective tools for text classification in either word level or character level.Word-based CNN (CNN) [30] is stacked by a word embedding layer with 50 embedding dimensions, a convolutional layer with 250 filters, a global max pooling layer, two pairs of fully-connected layer and nonliner activation layer.Besides, it contains two dropout layers with 0.2 dropout rate to prevent overfitting.This Word CNN model is implemented on all the three datasets.
Character-based CNN (Ch-CNN) [29] is composed of a 69-dimensional character embedding layer, 6 convolutional layers and 3 densely-connected layers.Each convolutional layer employs 256 filter kernels with filter size varying from 3 to 7. It also inserts one dropout layer after every denselyconnected layer with 0.1 dropout frequency.We evaluate this Ch-CNN model on AG's News dataset.
Word-based LSTM (LSTM) passes the input sequence through a 100-dimension embedding layer, concatenating a 128-units long short-term memory layer, followed by a 0.5 fraction dropout layer.The LSTM structure prevents gradient vanishing by utilizing a memory cell, and is effective for sequence text classification [16].This Word LSTM model is applied on the AG's News dataset.
Bidirectional LSTM (Bi-LSTM) consists of a 128dimension word embedding layer, a bidirectional layer that wraps 64 LSTM units.Then it combines a 0.5 proportion dropout layer and a fully-connected layer for classification.We run this bidirectional LSTM model on both IMDB and Yahoo!Answer datasets.
Table 3 lists the classification accuracy of these models on the original legitimate test samples.

Evaluation Metrics
We use three methods to evaluate the text attack performance, i.e., Attack Success Rate (ASR), the average word replacement (AWR) number, and the semantic similarity score (USE score).
The ASR indicates how much an adversary can mislead the victim model.Formally, an successful attack is when the classifier F can correctly classify the original legitimate input F (X) = Y true but makes a wrong prediction on the corresponding attacked input F (X + ∆X) = Y * .Therefore, the ASR is defined as where Y * can be any label different from the Y true (untargeted attack) or equal to a user specified label (targeted attack).The ∆X denotes the modifications for the legitimate text sample.To accurately quantify the perturbation loss, we straightforwardly count the number of word modifications for each sample and compute the average word replacement (AWR) number to denote the adversarial attack cost.The semantic similarity score is also an important indicator to evaluate the quality of the crafted adversarial, as it measures whether the adversarial example reads natural and fluent.Intuitively, a rational hacker hopes to attain high ASR and semantic similarity score while modifying a small number of words.

Baselines
We compare our method with representative black-box word-level attack algorithms as listed below.
• RAND (attack randomly) selects a synonym from WordNet and ranks the attack order by our SPO algorithm.
• Word saliency attack (WSA) [31] gets replacement words from WordNet and rephrases texts in the word saliency (WS) descending order.The word saliency is similar to Eq. ( 5) but replaces w i with unknown.
• PWWS [9] chooses candidate words from WordNet and sorts word attack order by multiplying the word saliency and probability variation.
• PSO [22] selects word candidates from HowNet and employs the PSO to find adversarial text.This method treats every sample as a particle where its location in the search space needs to be optimized.
• TextFooler (TEFO) [10] obtains synonyms from Glove space and defines the WIS by iteratively deleting input words and calculating the DNNs score changes.
• BERT-ATTACK (BEAT) [23] takes advantage of BERT MLM to generate candidates and attack words by the static WIS descending order.The WIS is similar to Eq. ( 5) but changes w i to a masked word.

Experimental Settings
We train all the DNNs models using the ADAM optimizer [32], where parameters are set as: learning rate = 0.001, β 1 = 0.9, β 2 = 0.999, = 10 −7 .We deploy BU-SPO and the first three baselines on Keras.The PSO, TEFO, and BEAT are tested on the TextAttack framework [33], where the Ch-CNN model and the Yahoo!Answers are currently unavailable.For this reason, results under these settings are shown as infeasible to obtain (i.e., N/A) in Table 4, Table 5, and Table 6.We set the upper bond of word replacement number as M = 20 for our methods.This means we will stop the attack iteration if a text sample does not admit the adversarial attack after 20 substitutions.For the baseline methods, we use their recommended parameters for fair comparison.Particularly, for the most related baseline PWWS, we also report its performance with the constraint, i.e., M = 20, for fair comparison.To achieve efficiency, their attack performance is assessed on 1000 test samples of each dataset as the conventional setting [10], [22].

Experimental Results and Analysis
The experimental results of ASR, AWR, and the semantic similarity score are listed in Table 4 and Table 5, and Table 6, respectively.We manifest the first four contributions mentioned in the Introduction by asking four research questions: Q1: Is our adaptive SPO superior to static baselines?To validate this, we design the U-SPO that searches substitution words from only WordNet and attacks text only at the unigram word level -the same as WSA and PWWS, but employs our SPO to determine the word substitution priority.Experimental results in Table 4 and Table 5 show that U-SPO achieves higher ASR and changes a much smaller number of words comparing with static counterparts (WSA, and PWWS).Besides, the RAND delivers higher SAR than WSA on IMDB and AG's News.This also illustrates the merit of our adaptive SPO.Q2: Is the hybrid of synonym and sememe beneficial?We present a Hybrid version of U-SPO, i.e., HU-SPO, which Nikki Finn is the kind variety of girl I would marry.Never boring, always thinking positively, good with animals.Okay, as one reviewer wrote, a bit too much peroxide, lipstick, and eyebrows (Only Madonna could get away with that).But that's why I love Nikki Finn, she's not your ordinary girl.She makes things happen, always exciting to be around, and always honest.
BU-SPOF (Successful attack.True label score: 70.5% → 1.91%) Nikki Finn is the kind of girl I would marry.Never boring, always thinking positively, good with animals.Okay, as one reviewer wrote, a bit a little too much peroxide, lipstick, and eyebrows (Only Madonna could get away with that).But that's why I love Nikki Finn, she's not your ordinary girl.She makes things happen, always exciting to be around, and always honest.Alfred Lee Hitchcock Lee shows originality in the remake of his own 1934 British film, "The Man Who Knew Too Much".This 1956 take on the same story is much lighter than the previous one.Mr. Hitchcock was lucky in having collaborators that went with him from one film to the next, thus keeping a standard in his work.Robert Burks did an excellent splendid job problem with the cinematography and George Tomasini's editing shows his talent.Ultimately, Bernard Herrmann is seen conducting at the magnificent Royal Albert Hall in London at the climax of the picture.James Stewart was an actor that worked well with Mr. Hitchcock.In this version, he plays a doctor from Indiana on vacation with his wife and son.When we meet him, they are on their way to Marrakesh in one local bus and the intrigue begins.BU-SPOF (Successful attack.True label score: 97.41% → 47.84%) Alfred Hitchcock Alfred Joseph Hitchcock shows originality in the remake of his own 1934 British film, "The Man Who Knew Too Much".This 1956 take on the same story is much lighter than the previous one.Mr. Hitchcock was lucky in having collaborators that went with him from one film to the next, thus keeping a standard in his work.Robert Burks did an excellent job duty with the cinematography and George Tomasini's editing shows his talent.Ultimately, Bernard Herrmann is seen conducting at the magnificent Royal Albert Hall in London at the climax of the picture.James Stewart was an actor that worked well with Mr. Hitchcock.In this version, he plays a doctor from Indiana on vacation with his wife and son.When we meet him, they are on their way to Marrakesh in one local bus and the intrigue begins.is all the same with U-SPO but integrates HowNet to search synonym-sememe candidates.Table 4 shows that HU-SPO accomplishes the highest ASR in most cases and outperforms U-SPO by a large margin.Intriguingly, Table 5 exhibits that HU-SPO achieves such high ASR by using fewer word substitutions.This strongly suggests the profit of incorporating HowNet in the candidate selection step.

Q3: What's the advantage of combining the bigram attack?
The bigram substitution is vitally significant in improving semantic smoothness and generating meaningful sentences.To show this, we propose the BU-SPO method, which considers both bigram and unigram attack.Compared with HU-SPO, the BU-SPO achieves higher USE score even if it changes more words.This means bigram substitution can avoid produce meaningless sentences.In addition, we list two adversarial examples from IMDB (Table 7), AG's News (Table 8) and Yahoo!Answers (Table 9) for qualitative analysis.We can see from the adversarial examples that our bigram substitution can greatly reduce the semantic variations.For example, Table 8 shows that our method replaces two words (i.e., information technology → IT) but causes less semantic variation than PWWS only changing one word (information → entropy).Q4: Can the semantic filter really improve semantic similarity?A straightforward way to validate this point is to compare the BU-SPO and BU-SPOF, since the only difference between the two algorithms is whether they use semantic filter.From Table 6 we can see that BU-SPOF attains higher semantic similarity than BU-SPO and often achieves the highest USE score compared with all baselines.This confirms our expectation that the semantic filter is significant in improving the naturality and fluency of the generated adversarial examples.
Overall, Table 4, Table 5, and Table 6 elaborate that our proposed algorithms (U-SPO, HU-SPO, BU-SPO, and BU-SPOF) almost swept the top-3 results on all datasets and victim models, indicating the superiority of our method.

Transferability
Transferability of adversarial examples is the ability that the adversarial samples generated to mislead a specific model F can also be used to mislead other well-trained models F -even if their network structures greatly differ [34].To evaluate whether our adversarial samples are transferable between models, we construct three more CNN models named Word CNN2, Word CNN3 and Word CNN4.Different from the previous Word CNN model (described in section 4.2), Word CNN2 has one more fully connected layer, the Word CNN3 replaces the Relu nonlinear function with Tanh, and Word CNN4 adds one convolutional layer.We apply the 1000 adversarial examples generated on Word CNN to attack Word CNN2, Word CNN3, Word CNN4, and the LSTM model.Fig. 2 shows the results on the original Word CNN and transferred models.It can be seen from Fig. 2 that our method attains the best transfer attack performance, elaborating the strength of our method in transfer attack.Afghan women make arrive brief Olympic debut introduction.Afghan women made a short-lived debut in the Olympic Games on Wednesday as 18-year-old judo wildcard Friba Razayee was defeated after 45 seconds of her first match peer in the under-70kg middleweight.BU-SPOF (Successful attack.True label score: 90.86% → 29.69%) Afghan women make brief Olympic debut.Afghan women made a short-lived debut in the Olympic Games Olympiad on Wednesday as 18-year-old judo wildcard Friba Razayee was defeated after 45 seconds of her first match in the under-70kg middleweight.

Targeted Attack Evaluations
Targeted attack is usually regarded as a more dangerous attack strategy, as it can arbitrarily mislead the victim model to misclassify any lable to a pre-specified label [35].In this section, we conduct the targeted attack experiments on AG's News dataset by attacking Word-CNN, Ch-CNN and Word-LSTM models.For each model, we attack 1000 legitimate samples to the four target labels: 0 (World), 1 (Sports), 2 (Business) and 3 (Sci/Tech).Table 10 shows the experimental results.From Table 10 we can see that our BU-SPOF attains a much higher ASR than PWWS for all target labels and victim models, especially for the Ch-CNN model.Besides, our BU-SPOF replaces less words than PWWS.This illustrates that our method is more powerful for both targeted attack and untargeted attack.

CONCLUSIONS
In this paper, we have proposed a novel Bigram and Unigram based Semantic Preservation Optimization (BU-SPO) algorithm for crafting natural language adversarial samples.Specifically, the BU-SPO exploits both unigram and bigram modifications to avoid breaking commonly used bigram phrases.Besides, the hybrid synonym-sememe candidate selection approach provides better candidate options to craft high quality adversarial examples.More importantly, we design an adaptive SPO algorithm to determine the word substitution priority, which is significant in reducing the perturbation cost.We also proposed to improve the SPO with a semantic filter (BU-SPOF) to further enhance the semantic preservation performance.Extensive experimental results exhibit that our BU-SPO and BU-SPOF metohds achieve high attack success rates (ASR) and high semantic similarity with low numbers of word modifications.Besides, the proposed BU-SPOF also show its superiority on transfer attack, adversarial retraining, and targeted attack.In future, research on defense methods via using a n-gram strategy where n>2 will be a promising work direction.

Fig. 3 .
Fig.3.Adversarial retraining results.The higher the accuracy, the more robust of the model after retraining.

TABLE 1
Comparisons between Unigram Attacks and Bigram Attacks.One superiority of Bigram substitution is that it can distinguish commonly used Bigram phrases and avoid generating meaningless sentences.
, etc.One [27]es, we collect all the synonyms to create the bigram candidates set B i and skip searching candidates for w i and w i+1 separately1.Otherwise, we gather all the candidate words for w i from the synonym space W and the sememe space H and denote them as a subset S i ⊂ W ∪ H.It is worth mentioning that we pose a candidate filter here to make sure all the candidate words in S i have the same part-of-speech (POS) tags with w i .Replacing words with the same POS tags (e.g., nouns) can help avoid imposing grammatical errors.If w i is a named entity (NE), we enlarge the S i by absorbing more same-type NE words.The NE refers to a pre-defined real-world object that can be symbolized by a proper noun, such as person names, organizations, and locations[27].The candidate NE (denoted as N E COMP ) must have the same NE type with the original word.It is selected as the most frequently appeared word from the complementary NE set W − W Ytrue where W Ytrue contains all the NEs of the Y true class.Then we update the synonym set as 3.3.1 Candidate set creation.Suppose the input sentence contains n words, i.e., X = {w 1 , w 2 , • • • , w n }.For each word w i , we first connect it to its next word w i+1 and check if the bigram (w i , w i+1 ) has synonyms in synonym space W.
Sample sentence containing n words X = (w1, • • • , wn) Input: Maximum word replacement bond M Input: Classifier F Output: Adversarial example X adv / * Select candidates for input words * / 1 for i = 1 to n do Connect wi with its next word as (wi, wi+1); 3) keep the sentence semantic unchanged.Specifically, given the best substitution word w * i for the original w i , we obtain n adversarial examples {X * 1 , • • • , X * n } with each being modified on one word, i.e., Algorithm 1: The proposed BU-SPO Algorithm Input: 2 12 Create the initial generation with empty G 0 = ∅; 13 Set the upper bound M = min(M, n); / * The SPO search starts * / 14 for m = 1 to M do ).The SPOF employs the same strategy to collect bigram and unigram candidates but improves the word priority determination procedure.Specifically, for each generation, we first create an empty set SucAdv to collect all possible adversarial examples that can Connect wi with its next word as (wi, wi+1); make successful attacks (Algorithm 3 line 15).If a population member X adv achieves a successful attack (Algorithm 3 line 19), we calculate the semantic similarity score between X and X adv (Algorithm 3 line 20) and append it into the successful adversarial example set SucAdv (Algorithm 3 line 21).We traverse this procedure for every population member in each generation.Then we find the adversarial Algorithm 3: The proposed BU-SPOF Algorithm 12 Create the initial generation with empty G 0 = ∅; 13 Set the upper bound M = min(M, n); / * The SPOF search starts * / 14 for m = 1 to M do SucAdv = ∅;

TABLE 2
Dataset information summarization."# Avg.Words" is the average number of words for all samples.

TABLE 3 Test
Accuracy of Four DNNs Models before Attacks.

TABLE 4 The
Attack Success Rate (ASR) of various attack algorithms.For each row, the highest ASR is highlighted in bold, the second highest ASR is highlighted in underline, and the third highest ASR is denoted with italic font.

TABLE 5 The
Average Word Replacement (AWR) number of various attack methods.For each row, the smallest AWR is highlighted in bold, the second smallest AWR is denoted in underline, and the third smallest AWR is represented with italic font.

TABLE 6
The average Universal Sentence Encoder (USE) score of various attack methods.For each row, the highest USE score is highlighted in bold, the second highest USE score is denoted in underline, and the third highest USE score is represented with italic.

TABLE 7
Adversarial examples of IMDB (attack Word CNN).Green texts are original words, while red ones are substitutions.Successful attack.True label score: 70.5% → 1.21%)

TABLE 8
Adversarial examples by attacking Word LSTM model on AG's News dataset.AG's New Example 1 PWWS (Successful attack.True label score: 90.86% → 37.17%) News Example 2 PWWS (Successful attack.True label score: 66.13% → 45.21%) Internosis Will Relocate To Greenbelt in October.Internosis Inc., an information entropy technology company in Arlington, plans to move its headquarters to Greenbelt in October.The relocation will bring 170 jobs to Prince George's County.BU-SPOF (Successful attack.True label score: 97.41% → 18.42%)Internosis Will Relocate To Greenbelt in October.Internosis Inc., an information technology IT company in Arlington, plans to move its headquarters to Greenbelt in October.The relocation will bring 170 jobs to Prince George's County.

TABLE 9
Adversarial examples by attacking Bi-LSTM model on Yahoo!Answers dataset.Yahoo!Answers Example 1 PWWS (Failure.True label score: 92.54% → 43.65%)What are exist good honorable resources to learn memorize about treatments for prostate cancer?BU-SPOF (Successful attack.True label score: 92.54% → 30.42%)What are good resources to learn about treatments for prostate cancer prostatic adenocarcinoma?Yahoo!Answers Example 2 PWWS (Successful attack.True label score: 81.52% → 40.85%)Why did president Bush Equine get his Masters degree?BU-SPOF (Successful attack.True label score: 81.52% → 2.37%) Why did president Bush Dubyuh get his Masters degree?Fig.2. Transfer attack on Yahoo!Answers.Lower accuracy indicates higher transfer ability (the lower the better).Adversarial retraining is an effective way to improve the model's robustness by joining the adversarial examples to the training set.In this experiment, we randomly select {500, 1000, 1500, 2000} AG's New training samples to generate adversarial examples.Then we append these crafted adversarial examples to the training set and retrain the Word CNN model.We evaluate if the adversarially retrained model becomes more robust by checking the classification

TABLE 10
Targeted attack results on AG's News dataset.