Handwritten Pashto Characters Dataset for Optical Character Recognition

<div>This work was presented at the 9th Joint Symposium on Computational Intelligence (JSCI9), organized by the IEEE-CIS Thailand Chapter, that aims to support research students and young researchers, to create a place enabling participants to share and discuss on their research prior to publishing their works. The event was open to all researchers who want to broaden their knowledge in the field of computational intelligence.</div><div><br></div><div><div>The Pashto character database developed in this work is available at <a href="https://github.com/mudaser37/pashtoCharacterDataset" rel="noreferrer noopener" target="_blank">GitHub - mudaser37/pashtoCharacterDataset</a></div></div>


I. INTRODUCTION
Character Recognition or Optical Character Recognition may be defined as a system that converts machine written and handwritten scanned images to editable form [1]. OCR has been extensively used as the basic application of different learning methods in machine learning [2]. The importance of the OCR apparent from the fact that a paper will become out of date in the age of the digital computers. Thus, old books and papers could be archived and stored again in digital formats. This technique leads important information to re-usable form.
However, most of the OCR systems have been built to recognize Latin, Japanese, Chinese and other characters, while comparatively Pashto text recognition is unseen in the area of language research. Pashto is national language of Afghanistan and spoken in most part of Pakistan as well. Pashto is spoken by around 50 million people around the globe [3]. Pashto language has rich literature and diversity. There is verity in terms of words of written material available, which covers very diverse topics such as education, politics, religion, poetry, and much more. Apart of all these, Pashto language still needs some improved and advance technology called OCR. There are so many reasons for such unseen and lake of advance system (OCR) for Pashto language, such as its cursive language written from right to left-hand side and very little variation occurs in characters' shape for non-cursive script languages. Unlike non-cursive script languages, characters in the Pashto language have significant variations.
Most of research done in Arabic OCR, Persian OCR and Urdu OCR were focusing on the recognition of handwriting scripts. However, the recognition of Pashto character remains a challenging task.
we have a long-term study policy, which will give Pashto full OCR system. Since there is no standard dataset for OCR development, we are going to prepare Pashto character dataset as in the primary stage. We will also introduce deep learning method for these characters. In short, this study contribution is the development of new handwritten characters dataset for character recognition of Pashto language.

II. RELATED WORK
Characters of Arabic and Persian are similar to the Pashto language, : thus, there is lack of specific research about Pashto OCR, However, we reviewed some of the Arabic and Persian prior works and summarize them as follow.
OCR has been identified through two popular methods, namely holistic and analytical methods [4]. we are discussing these two approaches as we proceed. Holistic methods do not have specific rules governing typography. As can be generalized to any language, such approaches are common. An image with text is considered to be a vector of one dimension, and features are extracted from the image [4]. For such methods, no segmentation is required. One of the major drawbacks of these methods is that a large amount of training data is needed. These algorithms are robust in size, and rotational changes. Furthermore, it requires a rich set of features to build a model. BBN Byblos OCR system is a common OCR system based on holistic approaches [5]. Multiple languages have been tested on this system. With these methods, a very low error rate was recorded for synthetic data. When applied to a comparatively larger database, these approaches fail to perform since very little training data was used during the development stage.
For Pashto text, a method developed on the holistic algorithm is reported in [6]. In this work the paper's authors used Noori Nastaliq language script. The synthetic database evaluated this OCR system. Some methods developed for OCR can be explored in the references [7]- [12].
The second class of OCR methods is analytical methods, which are advanced methods and are constructed through specific grammatical rules for the respective language. A unique set of features are used to identify a character. Segmentation at atomic level is performed for these methods. The performance of these methods is better when results of the prior segmentation is easy. For non-cursive script languages boundary of a character can easily be located; hence results are much better [4]. For getting acceptable performance for these algorithms, better segmentation is mandatory, which is itself a big challenge in analytical methods. For the Pashto language, still, no algorithm has been developed, which is based on analytical methods. Some methods which are based on Hidden Markov Models and Neural Networks are reported in [13], [14] for other cursive script languages.
A database for Pashto ligatures is also reported in [20]. Authors of the paper used Recurrent Neural Networks to develop a Pashto OCR. Tests are performed on a limited set of images in [20]. Authors named their introduced database KPTI. The KPTI consists of 17, 015 images of Pashto text. To the best of our knowledge, this [20] is the best research work reported particularly for Pashto language. Some other works which used deep learning-based methods for cursive script languages can be explored in references [16]- [19].
A medium size database for Pashto OCR has been developed [20]. In the same work, they have also reported the development of an OCR system for recognition of isolated Pashto characters. The classification is performed at two levels, i.e. High level classification and Low-level classification, and the K nearest neighbor (K-NN) classifier has been utilized for low level feature classification. Beside a few reported researches on Pashto OCR, the research on Pashto OCR is still in the initial stage and a lot of research work is needed to develop a Pashto OCR system deployable for practical applications.

A. Pashto Language
Pashto is written in Arabic script and by comparing its character-set, we can conclude that all Arabic and Persian characters are subsets of the Pashto language. While 36 characters of the Urdu language are also available as a subset of the Pashto character set. There are 44 basic Pashto alphabets, as shown in Fig 5. A textual analysis of the Pashto web corpora is reported [1]. The study shows the most frequent words and ligatures in Pashto text along with the complexities caused by breaker characters. Similarly, the count and frequency information regarding Pashto's unique ligatures and primary ligatures are also presented [1]. The following section describes the important aspects of dataset proposed in this work.

B. Pashto Dataset
An appropriate dataset shall hold almost all possible word/shapes with respect to a target language. In general, Pashto literature contains a variety of text layouts. These variations mainly exist due the contents of text materials. The contents of Pashto literature are classified as poetry, essay, novel, reports, news, and religion. Thus, this work attempts to create a novel real Pashto handwritten characters dataset [21].
Steps for providing Pashto character dataset are as follow: Fig. 1: Structure of paper given to participants 1) collection of Pashto characters: The real instances of Pashto character images are collected data from various regions in order to bring differences to the writing style. These images were collected by faculty members, teachers and students from two universities, such as Benawa University and Kandahar University. Furthermore, classmates, afghan students in KMUTT and some volunteer in Kandahar also shared their handwritten Pashto characters. The total number of participants in the data collection was 650. Collecting of the data per participant is shown in Figure 5. For database creation of Pashto handwritten characters, a blank paper of A4 size was designed with 112 rows and 26 columns. As the total number of Pashto character is 44, two blank pages were distributed amongst each person. All papers were collected and scanned with resolution of 300dpi (dots per inch). Sample of the white paper is shown in Fig 1, and  Fig 2. 2) Preprocessing on characters: Character segmentation is an operation that seeks to decompose an image of a sequences of characters into sub images of individual symbols. It is one of the decision processes in a system for (OCR). The segmentation of different characters from the scanned image is a puzzling work. we used OpenCV python library for segmentation of each character that is explained in the following steps: a) Thresholding: Thresholding is a technique in OpenCV, which is the assignment of pixel values in relation to the threshold value. In thresholding, each pixel value is compared with the threshold value. In this work threshold is applied for edge detection for better accuracy and used binary images. Threshold has two regions on its either side with the lower threshold the upper threshold being selected as 127 and 255, respectively. b) Shape Analysis: Contours come handy in shape analysis, finding the size of the object of interest, and object detection. OpenCV has findContour() function that helps in extracting the contours from the image. It works best on binary images, so we should first apply thresholding techniques. in the current work we used findContours() function for character detection and then we extract each character and save in a separate class. Sample of extracting the characters are in  We do not collect new data, rather we transform the already present data. We augmented the data to increase the diversity of each character for training models. An image is shown in Fig 4, where the characters are augmented through different shapes. There are 4 different shapes (a, b, c, d). This research is part of our long-term cursive-script language analysis strategy. The future work of this study is to develop a baseline deep learning Neural Network model and then fine-tuning it in order to evaluate and achieve better results.