UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

In recent years, low-resource Machine Reading Comprehension (MRC) has made significant progress, with models getting remarkable performance on various language datasets. However, none of these models have been customized for the Urdu language. This work explores the semi-automated creation of the Urdu Question Answering Dataset (UQuAD1.0) by combining machine-translated SQuAD with human-generated samples derived from Wikipedia articles and Urdu RC worksheets from Cambridge O-level books. UQuAD1.0 is a large-scale Urdu dataset intended for extractive machine reading comprehension tasks consisting of 49k question Answers pairs in question, passage, and answer format. In UQuAD1.0, 45000 pairs of QA were generated by machine translation of the original SQuAD1.0 and approximately 4000 pairs via crowdsourcing. In this study, we used two types of MRC models: rule-based baseline and advanced Transformer-based models. However, we have discovered that the latter outperforms the others; thus, we have decided to concentrate solely on Transformer-based architectures. Using XLMRoBERTa and multi-lingual BERT, we acquire an F1 score of 0.66 and 0.63, respectively.


Introduction
Text comprehension and question answering remain difficult task for machines that requires large-scale resources for training. The scarcity of annotated datasets in low-resource Asian languages is one of the primary reasons the development of language-specific Question Answering models is behind, particularly in the case of the Urdu language. Some techniques for dataset creation for low-resource languages transfer English resources in order to do NLP tasks. In response to increased demand and a dearth of standard datasets in Urdu, we introduce UQuAD1.0 (Urdu Question Answering Dataset): a large-scale question-answering dataset built for Urdu MRC. We gathered 4K Urdu QA pairs via crowdsourcing and combined them with 45k Urdu translated SQuAD [1] tuples. The study includes statistics on the distribution of answers and questions, along with the types of questions. By releasing training data publicly for reading comprehension tasks, UQuAD1.0 contributes to multi-lingual language processing research. While machine translation (MT) for languages with minimal resources has proven to be a challenging task [2] [3], the level of difficulty grows furthermore when translating between two morphologically rich and morphologically poor languages [4] [5]. For that, we have developed the following research questions: • RQ1: Can QA resources for other languages be created just by translating English resources?
• RQ2: Can time and effort be saved when manually annotating QA materials in other languages by utilizing existing resources in English? • RQ3: Is it possible to learn Urdu RC using pre-trained multi-lingual architectures that have been trained in a variety of languages? • RQ4: Is it possible to evaluate a model based on its language understanding capability?
In RQ1, we look at MT, or Machine-Translated Squad resource. MT should be enough to train an Urdu-QA system equivalent to SQuAD-trained even with flawless translation. However, there are two issues: (1) Translation shifts or loses the position of the answer span. (2) The quality of QA pairs varies. Without overcoming these obstacles, F1 performance using our best performing model is 12.49, demonstrating the weakness of this technique. We determined from the results of RQ1 that MT performance is low, so it is appropriate to create language-specific (Urdu) resources for QA. As a result, we built the small Urdu MRC benchmark dataset using the same crowdsourcing technique that was used to build the English SQuAD, with the following contributions:: 1) Including a variety of question types: To evaluate different aspects of the MRC model's language understanding capability, we provide different question types based on Bloom's taxonomy.
2) Avoiding lexical shortcuts: By imposing lexical and syntactic variety while creating query similar to benchmark SQuAD dataset.

3) Our data coverage level consists of Urdu Wikipedia articles and Tafheem (RC) worksheets of Cambridge O-level books.
Although our manually annotated dataset came from a translated resource, it needed a significant amount of time and resources to educate employees in examining and repairing translated samples. It also addressed our RQ2 that manual annotation takes the same amount of time and effort. To answer RQ3, we fine-tuned an Urdu QA system using small crowdsourced data by humans and large translated resources with a multi-lingual Transformer-based architecture, achieving an F1 score of 0.66. Finally, for RQ4, we will evaluate the model's language comprehension capabilities by examining the types of questions it can answer and their associated accuracy.
This research discusses related work in section 2, followed by dataset construction and statistics in section 3. Later on, models are described in section 4, followed by Section 5, showing the experiments and results. Finally, Section 6 presents the conclusion and future work.

Related Work
A task where the system must answer questions about a document is called machine reading comprehension (MRC). This technique acquired significant acceptance following the publication of a large-scale Reading Comprehension (RC) dataset termed SQuAD [1] containing Over 100,000 questions on popular Wikipedia articles. The broad use of SQuAD has resulted in the formation of other related datasets. For instance, TriviaQA [6] comprises 96k questions and answers regarding trivia games, which were discovered on the Internet and documents con-training the answers. The Natural Questions corpus is a set of questions [7] that is almost three times the size of SQuAD, and the questions were extracted from Google search logs. MS MARCO [8] has one million queries extracted from Bing Search.
Unfortunately, there are very few similar MRC datasets for other languages, necessitating the development of multi-lingual MRC for low-resource languages, XSQuAD [9] dataset was built to meet this demand. It contains 40 paragraphs and 1190 question-answer pairs from SQuAD that have been translated into ten languages. Arabic and Hindi are also included in XSQuaD, but not Urdu. In order to address the unavailability of datasets other than English RC, significant efforts have been made in recent years to develop datasets in low-resource languages for Reading Comprehension, for example, SberQuAD [10], a dataset similar to SQuAD for the Russian language, was recently created using the same technique as SQuAD. Additionally, [11] offered a Bulgarian dataset, [12] presented a Tibetan dataset, [13] generated an Arabic Reading Comprehension Dataset (ARCD) to fill the gap of MRC in other languages. Also, SQuAD-it [14], a semi-automatic translation of the SQuAD dataset into Italian, is a huge dataset for question answering processes in Italian containing 60k question/answer pairs. [15] released FQuAD: French Question Answering Dataset in two versions with 25k and 60k samples. [16] Introduced HindiRC consists of only 127 questions from 24 paragraphs, manually annotated by humans. Another variation of synthetic dataset translated from SQuAD 1.1 for Hindi reading comprehension by [17] consists of over 18k questions. The absence of native language annotated datasets other than English is one of the primary reasons language-specific Question Answering models take longer to develop.
As previously stated in the literature review, Hindi and Arabic are two low-resource languages that have evolved rapidly in MRC, with the research community giving them the attention they deserve in recent years. Although Urdu shares many characteristics with Arabic, Persian, and Hindi, such as the lack of capitalization, compound words, similar morphology, and free word order [5], Urdu still struggles to make initial studies in NLP. These languages are among the most widely spoken globally, with 170 million Urdu speakers, 490 million Hindi speakers, and 255 million Arabic speakers worldwide. These low-resource languages are highly sought after in realworld applications such as human-robot interaction, question answering, recommendations, and particular search queries. It has a knock-on impact on every business because robots that comprehend questions and react with appropriate information may boost efficiency and save time.
Unfortunately, Arabic and Hindi do not have monolingual big-size MRC datasets to create stateof-the-art RC models, but they are making progress in this area by experimenting with alternate methodologies. On the other hand, the Urdu research community lags behind its close allies, as it lacks even a single dataset in the Nastaliq script.
To the best of our knowledge, the MRC contains no contributions in Urdu. To address the scarcity of Urdu language comprehension data, we present UQuAD1.0, an Urdu QA dataset for reading comprehension consisting of a total of 49k tuples (Question, paragraph, and Answer). Our research complements previous efforts by annotating small resources while utilizing large resources generated for another language. From a model architecture standpoint, most existing state-of-theart models for reading comprehension rely on transformer-based architectures that use the selfattention mechanism to weigh the significance of each component of the passage data differently and achieve good performance.

UQuAD1.0 Dataset
UQuAD dataset consists of two main parts: a large-scale Machine Translated (MT) part and a small-scale manually annotated part using a crowdsourced approach. Both parts will be presented in detail in sections 3.1 and 3.2. General statistics about each portion are presented in Table 1.

Machine Translated UQuAD1.0
To address RQ1, we examine the difficulty of retaining answer spans from English to Urdu. We identify the following three examples as shown in Appendix A based on Google Translate of English SQuAD tuples into Urdu: • Exact matching (36%): English answer spans are translated into exact Urdu terms.
• Synonym's matching (17%): The Urdu answer spans are paraphrased versions of the Urdu passage's terms.
• Unpreserved Spans (47%): Google Translation cannot retain answer spans throughout translation due to a language barrier or translation inaccuracy.
We were able to collect 53% of the UQuAD dataset using the first two approaches. However, we had to delete 47% of the data due to answer span issues. It is clear from the last example in Appendix A that a feminine pronoun is referred to as a male throughout the phrase, and in other cases, the answer was not kept between paragraph and answer owing to translation discrepancies.

3.2
Crowd-Source UQuAD1.0 The main challenge in training QA systems is poorly translated QA pairs. Example 3 in Appendix A shows that machine translations cannot find the correct answer span in a large portion of translated. We thus build small-scale language-specific human-generated resources for fine-tuning QA systems. The advantage of resource is near perfect precision, with the disadvantage of being labor-intensive. Similar to the SQuAD1.0 collection process, we crowdsourced over 4k questionanswer pairs. The data generation procedures for this dataset are generally the same as SQuAD1.0. However, we exploited unique aspects of the Urdu language, such as extensive vocabulary usage and diversity of question types as per Bloom's taxonomy (Appendix B), to enrich and diversify this dataset. For the question-answer creation process, we recruited volunteers from different cities so that everyone had their distinct style of questioning, which added more variety to the dataset. We used a dedicated user interface (UI) guided by SQuAD guidelines to build human-generated QA pairs. We took a sample of 100 Urdu Wikipedia pages and extracted paragraphs of considerable length without graphics. The 100 articles resulted in 1972 paragraphs covering various topics from politics, religion, education, and music, as reflected in Table 2. Human annotators utilized the UI depicted in Figure  1 to read a text, enter questions, choose types of questions specific to Bloom's taxonomy, and then highlight the spans containing answers. The practice of generating questions by copying and pasting content from Wikipedia was restricted.

UQuAD Question and Answer Types Analysis
To increase the difficulty of this dataset, we limit MRC models from adopting simple techniques based on fundamental word matching. Additionally, rather than focusing exclusively on keywords, our goal is to generate questions that can be addressed by examining the entire passage. Not all questions are equally challenging. Some questions are simple to answer, while others may need much thinking. Bloom [18] provides us with a taxonomy to assist in framing queries at various thinking levels. It divides cognitive abilities into six categories, ranging from low-level ability to high-level ability that requires deeper cognitive instruction. Each question is complicated in its own way, and comprehending and correctly replying to each demand a separate set of cognitive talents. In figure 5.3, we identified the types of reasoning required to solve Bloom's taxonomybased question and presented the results of a manual assessment of 200 questions drawn from the test set. The most commonly requested question type accounts for 26.4% of all inquiries that fall under the category of Remember, which we may query using lexical variants or by rearranging synonyms. Analyze questions, which account for 19.6% of all questions, require the collection of evidence from multiple sentences. 2.9% of questions fall into the comprehend category, which interprets the message provided in the question using the various cues listed in column 3 of Appendix 2. Finally, in the external knowledge category, we checked if the response was not in the text or if the response area was erroneously picked owing to the worker's error. Similar to SQuAD dataset, we establish five forms of reasoning necessary to answer 200 questions from the test set of UQuAD, summarized in Figure 5.3. The most often requested inquiry type constitutes 27% of test data and involves rearranging the syntax or altering the phrasing of the supporting phrase. Questions from passages employing a synonym and global knowledge account for 20% and 10%. 13% of questions need proof from several sentences. On average, 10% of questions featured a deduction for a sentence's options that satisfy the question's requirements. Finally, 20% of questions were asked based on information outside the paragraph or were picked erroneously owing to a human error. For answers, we categorize UQuAD answers into six groups, shown in Table 3. It results in 18% object responses, followed by person, date, and place. Description and reasoning questions account for 17.4% and 8.3%, respectively. We conclude that UQuAD1.0 has more Date and Person classes than SQuAD1.0, but other classes are relatively equivalent.

Models
We investigate the performance of three models: a baseline approach based on sliding window [19] and two multi-lingual Transformer-based models BERT [20] and XLMRoberta [21]. The Sliding window approach was first introduced in the MCTest paper, and it solves the answer extraction problem in a rule-based manner without any training data needed. BERT is a powerful pre-trained model that recently obtained state-of-the-art performance on various NLP tasks. We use the multi-lingual pre-trained model released by Google to fine-tune the BERT model for the UQuAD1.0 task without applying additional language-specific NLP techniques. The Third model used is XLM-Roberta, a multi-lingual pre-trained transformer model from common Crawl trained on 100 languages, including Urdu. XLM-Roberta outperformed previous multi-lingual models such as mBERT and XLM on a variety of downstream tasks.

Sliding Window Baseline
We picked the sliding window as the baseline approach since it is also utilized in the benchmark SQuAD work and also because it demonstrates that matching term frequency or simple word matching between question and context cannot address the RC problem. For a given (paragraph, question), the sliding window approach works as follows: 1) Tokenize P-Q-A Tuple: converts each Paragraph-Question-Answer tuple to a set of tokens using a dedicated Urdu tokenizer from Stanford Stanza Library.
2) Generate Candidate Answers: generate a list of text spans of input paragraph. They are treated as candidate answers.
3) Score Candidate Answers (SW+D): Score each candidate's answer using Sliding Window and Distance features. 4) Compute Final Score: for each candidate answer. Final score = sliding window scoredistance score.

5) Predict Answer
: The candidate answer with the highest score is the answer predicted by the model.
Since UQuAD is not a multi-choice question dataset, we had to generate candidate answers from scratch. For that, we first generate all possible text spans from the passage with a threshold on the maximum length of candidate answers. Then we only keep answers that have the highest unigram and bigram overlap score with the related question. The sliding window and distance-based scores of each candidate answer are computed using the algorithms in Figure 3.

Transformer models: XLMRoberta and mBERT
In this work, the performance of transformer-based models for machine-reading comprehension is examined. Models built on top of the Transformer architecture [22] account for the vast majority of state-of-the-art performance in a variety of natural language processing tasks. In the absence of pre-trained Urdu monolingual models, we leverage the transfer learning concept inherent in these models to fine-tune the current pre-trained models for question answering on the UQuAD dataset. We examined various such models and found out that two of them outperform all others in our experiments: multi-lingual BERT (mBERT) and XLM-RoBERTa. They already incorporated knowledge of 104 and 100 languages, respectively, including Urdu.

Train/Validation/Test split
We created two sets of annotated QA pairs, one for training (80%) and one for testing (20%), with no overlap of passages or articles. Statistics on both portions are presented in Table 1. The training part was furthermore split into 80% for actual training and 20% for validation. We use 5-fold crossvalidation for the better significance of the performance results.

Data Preprocessing
While mBERT and XLMRoBERTa are distinct models, their overall fine-tuning process using the Transformers library is similar, demonstrating the API's potential. There-fore, we describe in the following paragraphs the common foundation for both models. We begin by preprocessing the whole dataset to eliminate any noise or inconsistency introduced during data gathering. We eliminate instances that exhibit one or more of the following: • The paragraph does not have an answer span.
• Index of incorrect responses (i.e., paragraph text at answer index is different from answer text).

•
The index of the response is -1.
We then determine the model based on each response's start/end indices in the resulting data. We know the answer's placement in terms of its character index inside the paragraph, but we require its location in terms of the model's internal tokenization system. To do this, we proceed as follows for each (Paragraph, Question, Answer) tuple: 1) Tokenize the response to ascertain the number of tokens it contains. 2) In the associated paragraph, replace the response with a list of the model's mask tokens (i.e., [MASK] for mBERT and <mask> for XLM-RoBERTa), according to the number of tokens in the answer. Following that, tokenize the result.
3) Determine the start/end indices of the response in the tokenized paragraph by locating the mask tokens in the encoded result.

Fine-Tuning Process
The maximum input sequence length (max_len) at which samples will be truncated is a small but essential step. It is essential for successful fine-tuning because using the longest input sequence as the threshold would slow down the training process and potentially cause memory overflow, resulting in Out Of Memory (OOM) failures. Both the maximum length and batch size must fit inside the memory constraints of our GPU. As a result, there is a tradeoff between data loss, memory usage, and training speed. To do this, we calculate data loss as a percentage of the maximum length value and tolerate up to 2% data loss. Then we choose a threshold that considers all three variables (i.e., data loss, training speed, memory usage). Finally, we load the pre-trained model, set its hyperparameters (learning rate, optimizer, batch size), and begin fine-tuning the model using the previously encoded UQuAD training data. For each model, we trained two submodels using the same architecture: one to predict the response start index and the other to predict the answer end index. Table 4 presents the values of the hyperparameters used during fine-tuning. A maximum input sequence length of 384 and batch size of 16 is the highest combination the available GPU (i.e., Tesla T4) could handle without memory issues, with data loss smaller than 2% and acceptable speed ( 1h30min per epoch). We use Adam optimizer [23] (more specifically, the AdamW variant and a dynamic learning rate that decreases linearly over training steps, with a number of warmup steps=0. The validation set is a randomly sampled 20% of the training dataset used to tune hyperparameters. The number of epochs was tuned by plotting the training and validation loss for different numbers of epochs (ref Figure 4). During fine-tuning, the training loss keeps decreasing while the validation initially decreases then starts increasing after several epochs. It reflects the start of over-fitting. We choose the previous epoch as the best

Model Evaluation
In many aspects, the MRC work is comparable to a human reading comprehension task in terms of complexity. As a result, MRC model evaluation can take the same form: the model responds to paragraph queries and is assessed by comparing the model's replies to the right ones. This gives an answer to RQ3 about the model's capability of learning RC. We may compare the model's output to the right answer and assign it a score of 1 if they are identical and 0 if they are not. This metric is called Exact Match. This, on the other hand, will consider answers that are partially accurate as wrong responses. Even if the model's output is incredibly near to the correct answer, the exact match score will still be zero if the correct answer is "KotAdu city" and the model output is "KotAdu " the exact match score will still be zero.
Consequently, the F1 metric, which is the harmonic mean of the accuracy and recall, is often used for extracting responses. Precision and recall of a model are measured by the proportion of words in the model's output that appear in the correct response, and recall is measured by the proportion of words in the correct response that appear in the model's output. Precision is measured by the proportion of words in the correct response in the model's output. The F1 measure can only give a partial score when the model's output is only partially correct due to this limitation. Figure 5 provides an example of calculating the EM and F1 scores and explains how to do so.

Results
We assess the performance of the three models on the UQuAD test set 23% of the questions in the test set to feature more than one possible answer, which provides greater versatility when assessing the model, as the same question may have many answers, which may occur in a variety of locations across the paragraph. The evaluation is carried out using two widely used measures for Machine Reading Comprehension: Accuracy/Exact Match (EM) metric and F1 score. The F1 score indicates the average overlap between the predicted response and the true answer, whereas the EM represents the proportion of predicted answers precisely matching accurate answers. While the Sliding Window base model achieved decent performance on English SQuAD (i.e., F1 score of 0.2), its application to Urdu did not have good results. The model achieved a very low accuracy of 4% and an F1 score of 0.03. Given its lexical particularities, this rule-based algorithm focusing primarily on matching words and distance between them is underperforming when applied to the Urdu language. When testing the Transformer-based models XLMRoBERTa and mBERT, we select the answer start index with the highest probability in the dedicated sub-model (i.e., the submodel that predicts the answer start index).
Similarly, we take the predicted response end-index from the dedicated sub-model.  However, their Exact Match scores are significantly lower, and this is due to the model predicting answers with a high level of words intersection but slightly away from an exact match. For example, a 5words predicted answer of which four words match precisely the answer is considered a wrong answer and accounts for 0 in the EM, while the F1 score that uses word-based evaluation will be high. We investigate this aspect by calculating the percentage of predicted answers in the actual answers and vice-versa. We found for XLMRoBERTa (resp. mBERT) that 47% (resp. 50%) of the predicted answers are in the associated accurate answers, and 66% (i.e., 17%) of the actual answers are in the associated predicted answers. It also compares the whole answers with the same words order; dropping the words order results in higher percentages, thus the high F1 scores achieved by the models. Table 5 summarizes the performance evaluation findings-the high performance of the transformer-based model's semantic aspects of Urdu at a broader level. Also, the self-attention mechanism helps memorize relevant parts of the paragraphs that are key to extracting the final answer. Both XLMRoBERTa and mBERT were pre-trained on text corpora of millions of documents from different languages. This transfer learning process reduces the finetuning time and required data volume and benefits the model from understanding general aspects of different languages that might apply in a wide range of use cases, especially for relatively similar languages such as Arabic and Hindi share a large set of common characteristics.

MRC Performance stratified by a question and answer types
This part will assess the language comprehension power and limitations of our most accurate model, XLM-RoBERTa, using three criteria: Urdu Question Difficulty, named-entities, and question type (who, what, when, where, and which). The proportion of accurately predicted responses (i.e., an exact match between predicted and real answers) for each question (resp. answer) type is shown in Figure 6(a) (resp. (b)). We only display question/answer types for which there are more than ten occurrences in the test set in both charts. We may see in (a) that performance on "What" questions is poorer than on other WH questions. It corresponds with our understanding because the "what" question is not as explicit as "Where" for the place, "When" for the date, and "Who" for an individual. The "Which" Question likewise received a poor score. In (b), the model gets 97% percent of the country answers correct. We anticipate that this is due to the transfer learning process, as the countries would have been met repeatedly during pre-training in multiple languages.
Moreover, because they were few as compared to dates and locations, the model quickly grasped them. When compared to others, performance on organization and person sorts of answers is relatively poor. A thorough error analysis of these questions reveals that the model can comprehend the context of the question and provide a sentence-containing answer despite being "ambiguous questions" in nature. While the model's forecast for "Exact Match" question types is essentially true even although the term "services" is not retrieved, even though the question words are paraphrases of context terms, the model was able to provide an accurate solution to the "Synonym Matching Questions." As seen in Appendix C, it was able to connect the word "legislative process" to the "governing body" through the use of word knowledge. For anaphorabased questions, the model cannot distinguish the antecedent "Chaudhry Aitzaz Ahsan" from the subsequent "He. " An interesting approach would be to test if the model can handle multisentence reasoning challenges and cataphora resolution issues in the absence of relevant training samples. The overall results are compelling and correspond to the model's predicted behavior. However, this is far from a thorough exploration of the model's comprehension, and a more indepth examination of the model's explainability might provide fascinating results.