Soft Computing Approaches for tagging Arabic text

This paper discusses the application of soft computing (SC) techniques in natural language processing. The current research proposes the design and implementation of automatic models that distinguish the text in the Arabic language and then distinguish it into the essential parts of speech. Speech tag characterization (POS) is the process of classifying words in a sentence based on their meaning, functions, and types (noun, verb, adjective, etc.). This work developed two-stage labeling models. The first stage is to process the text before execution by removing the suffixes attached to it. And then the second stage is to design models by applying mathematical models using Multilayer Perceptron (MPL), Full Recurrent Neural Network (FRNN), and Support Vector Machines (SVM). The current system helps classify words and allocate the correct parts of speech according to their wholesale position. To test the effectiveness of the proposed models using two different languages (Arabic and Hindi). The results showed the effectiveness of the proposed models is successfully solving the problem of clarification of words for the Arabic text. Also, compared to previous studies, the proposed models achieved high accuracy by classifying the parts of speech with an accuracy of up to (99%).


INTRODUCTION
Soft Computing (SC) is a collection of computational models, which utilizes tolerance for imprecision, uncertainty, robustness, and tiny solution cost to formularize real-world problems. SC generally includes Artificial Neural Networks (ANN), Fuzzy Logic (FL), Evolutionary Computing, Genetic Algorithms (GA), and Rough Set Theory [30,32]. The Main characteristics of SC are their ability to evaluate, decide, check, and calculate within a vague and imprecise domain, emulating the human skills in the execution to learn from experience. Natural Language Processing (NLP) can be defined as an automatic or semi-automatic approach to processing the human language [17,25,33]. Recently, Arabic and Hindi languages processing has become a primary focus of research and commercial development. Most NLP applications often include speed and accurate POS tagger as one of its central core components [21]. The Part of Speech (POS) is a classification of words according to their meanings and functions. The POS tagger plays a crucial and essential role for most NLP applications such as machine translation, information extraction, speech recognition, and grammar and spelling checkers [13].
Moreover, the accuracy of the POS tagging is determined by factors like ambiguous words, phrases, unknown words, and multipart words. Specific features excite scientists to espouse neural networkbased solutions in solving problems [12,15]. The most interesting features of NN include massive parallelism, uniformity, generalization ability, distribution representation and computation, learnability, trainability, and adaptation. Neural approaches have been performed successfully in many aspects of artificial intelligence, such as image processing, NLP, speech recognition, pattern recognition, and classification tasks [2]. The Recurrent Neural Networks (RNN) consists of neurons with feedback connections, which are biologically more plausible. And computationally more powerful than other adaptive models like Hidden Markov Models (HMM) [31], Feed-Forward Networks and Support Vector Machines (SVM) [14,16,24,34]. The SVMs are considered a supervised learning method that is used to perform binary classification and regression tasks. They belong to a family of generalized linear classifiers. The main advantages of SVM are that they simultaneously minimize the experimental classification error and maximize the geometric margin [6,12].

THE DIFFERENCES BETWEEN POS TAGGER MODELS
Part-of-speech tagging is a complicated process, not just having a list of words and their parts of speech as, at times, some words can represent more than one part of speech. Some can be in the form of ambiguous phrases [4,5]. Hence, it is hard to build a POS tagger that can tag with an accuracy of 100 percent for extensive training data. Typically, deferent approaches have been implemented to address the part of speech tagger such as the rule-based [9, 18 and 19], stochastic [7, 10 and 11], neural network [1, 12, 13, 14, 15, 16, 20, 22, 23 and 26] or the hybrid systems [8]. The rule-based and stochastic approaches need a vast amount of data to adapt and implement the POS tagger. It has been known that the neural methods only use a small amount of data to perform the training and learning stages.
Moreover, the neural-based approaches are not only consummate the associations (word-to-tag mappings) from a representative training data set, but they can also be generalized to the unseen [1,17]. Overall, several advantages of stochastic taggers can be identified over the rule-based taggers.
They avoid the need for diligent manual rule building and probably obtain helpful information that humans may overlook. However, these probabilistically driven ones have the disadvantage in which the linguistic input is only captured indirectly in large statistics tables. In contrast, the rule-based taggers need the minimum storage requirements and, at the same time, are more portable [17].

PERFORMANCE MEASURES
The performance evaluation of classification process is a crucial matter in the machine learning systems, because it is unfeasible to contrast learning algorithms or even know whether a hypothesis should be used. The most important attribute in the assessment of a part-of-speech tagger is accuracy [28]. Thus, the quality of the output depends on the comparability of conditions [17] such as: **Tag-set size: Normally, using a small number of tag-set can help to give high accurate tagging but it does not offer as much information or disambiguation between the lemmas as a larger one would. **The corpus type: A corpus (corpora is the plural) is a set of text that collected for a purpose. The type of corpus affects the quality of taggers output when the genre or type of the corpus data differs from the tagged material. **Vocabulary type: the tagging of specific texts such as the medical or legal texts requires a training corpus that has examples of such texts; otherwise, the unknown words will be unnaturally high).
Likewise, the high instance of idiomatic expressions in the literary texts often leads to inaccuracy.
However, the ambiguous words and phrases, unknown words and multi-part words can affect the accuracy of POS tagging. Ambiguity appears at different levels of the language processing sequence such as syntax or semantic phase [28]. 04% on test data that also included words unseen during the training. Jabar [23] implemented an Arabic part of speech based multilayered perceptron. The experiments evinced that the MLP tagger has high accurate (of 98%), with low training time and fast words tagging. they used a little amount of data to achieve the adaptation and learning of network. Jabar [12] proposed Arabic part of speech based support vectors machine. The radial basis function is used as a linear function approximation. The experiments evinced that the SVM tagger has a high accuracy and recall about (99.99%). Jabar [14] proposed an Arabic part of speech based Fully Recurrent Neural Networks (FRNN). The back-propagation through time (BPTT) learning algorithm is used to adjust the weight of the network and associate inputs to cyclic outputs. In order to accurately predict the syntactic classification tagging, an encoding criteria is also presented and performed. The experiments evinced that the FRNN tagger is accurate and achieved 94% in classification phase.

RELATED WORK
Similarly, the POS disambiguation problem was successfully solved. Khoja [19] proposed the APT Arabic Part-of-Speech Tagger which used a combination of both the statistical and rule-based techniques. The APT tag-set are derived from the BNC English tag-set, which was modified with some concepts from the traditional Arabic grammar.

TAGGERS DESIGN
The proposed system is consisting of two main stages as depicted in Figure 2. The key function of first stage is to prepare the input data sets for next stage. This stage is written and implemented using VBA commands for Excel. While, the main function of second stage is to implement the automatic taggers.
These taggers are designed and implemented using NeuroSolutions for Excel software. The first stage is called "pre-processing phases" [17]. It is implemented and utilized to achieve the following tasks: Text Normalization, Text Tokenization and Text Encoding.
The Text normalization is used to convert the input text from free text into suitable forms to be used in next stage. In general, the input text can be configured either into a text file or XML file. Therefore, the system is designed to disregard all the HTML tags and extract the pure contents of the document.
Then, the text tokenization is distributed the pure text into simple tokens such as numbers, punctuation, symbols, and words. An algorithm has been developed to implement and perform the text tokenization task. Lastly, the text encoding is performed to transform the input data into a suitable digital form, which the network can identify and use [13,17].
The proposed encoding method aims to solve the drawbacks of previous encoding schemes.

EXPERIMENTS AND RESULTS
The experiments undertaken are achieved using the Arabic tag-set which is proposed by Khoja [19].
The tag-set contains 177 tags that include various categories. The extraction of words into basic roots is not considered in this study. This study supposed that the words were segmented before POS tagging began. The experiments covered the three proposed taggers in this paper SVM tagger, MLP tagger and FRNN tagger. The input text is encoded into a suitable form and then it is divided into three categories; training data sets, cross validation data sets and test data sets . The Cross validation computes the error in a test data sets at the same time that the network is being trained with the training set. The Genetic Algorithm (GA) is used as a heuristic optimization in the problem of finding the best network parameters [27]. It establishes with an initial population of randomly created bit strings.
These initial samples are encoded and applied to the problem. The study under taken is used the GA methods for improving the learning rule parameters such as step size and momentum value [13, 14 and 17]. This will enable the optimization of the momentum values for all Gradient components in NeuroSolutions software that use momentum. Besides, it used to determine the number of processing elements. Likewise, to tolerate the enhanced fit specimen in the population to reproduce at a higher rate is to use a selection method based on the roulette wheel selection technique. The standard method to assess the tagger performance is usually determined by the percentage of correct tag assignments.
The NeuroSolutions for Excel software is provided six methods to test the networks performance such as the mean squared error (MSE), the normalized mean squared error (NMSE), the correlation coefficient(r), the percent error, Akaike's information criterion (AIC) and Rissanen's minimum description length (MDL) criterion [17].
The NMSE is the estimation of the overall deviations between predicted values and measured by the network. It is defined as follows: Where, P is the number of output processing elements. N is the number of exemplars in the data set. yi j is the network output for the exemplar i at processing element j . d ij is the desired output for the exemplar i at processing element j . The correlation coefficient ( r ) is the rate relation between a network output x and a desired output d. It is defined as follows: Usually, the MSE "mean squared error" is used as evaluation function for the network output reliability. The best network results for training data of the proposed taggers

COMPARISON & CONCLUSIONS 7.1 Comparison with related work
The comparison study has to be implemented accurately because the features used here to identify the languages and the tag sets are different from the previous studies [12]. The comparison of proposed taggers with other existing taggers is a complicated matter because the tagger accuracy relies on numerous parameters such as language difficulty (ambiguous words, ambiguous phrases), the language nature (English, Arabic, Hindi, Chinese, etc.), the training ) ( data size, the tag-set size, and the evaluation measurement criteria [13,14]. The Tag-set size has a significant impact on the tagging process. The proposed taggers are assessed using the measurement of accuracy, besides MSE aspects. In addition, the amount of data used in the training and learning stages is considered. A comparison study of proposed taggers with the results of other taggers explained that the proposed taggers achieved a high accuracy rate when using GA optimization techniques which improved the values of the momentum rate and the step size. The proposed taggers (SVM, MLP, and FRNN) achieved high accuracy of 99% at the last experiments when the GA optimization process is implemented. Table4 summarizes the comparison information with other researchers. Figure7 illustrates the overall comparison results.

Conclusions
The research mainly aims to implement an automatic and accurate tagging system that can be used as a central core component for NLP applications. The automatic part of the speech tagging system is deployed based on neural network techniques, which can tag the free texts automatically. The study demonstrated variant kinds of taggers that can solve the problem associate with the contraction of languages such as Arabic part of speech and Hindi part of speech. The new approaches are highly accurate with low processing time and high-speed tagging. Two stages of automatic tagging systembased SVM, MPL, and FRNN, are implemented and designed. The proposed system helps to classify words and assign the correct POS for each of them. The results are exceptionally encouraging, with correct assignments and recall of about 99%. The Genetic Algorithm is utilized to optimize the network variables like the momentum rate and step size. Also, the disambiguation of Arabic words is solved by the proposed Arabic POS taggers.

FUTURE WORK
This paper presented the design and implementation of an automatic tagger that can tag a free text directly and combining each word with its correct part of speech. The work focuses only on two types of files (text files and HTML files). Therefore, it is preferable to include more files type. And, if it is possible to extract the text directly from the website, it will be very encouraging. The current work comprises two separate stages; the first pre-processing phase, implemented using VBA codes. Besides, the second stage is the processing phase which is utilized using the NeuroSolutions software.
Advantageously, the steps are merged into one part to produce a portable system that can use with any other application.