A Data Mining Model for Predicting Diarrhea in Afghan Children

The data used for this study may be obtained from the Demographic and Health Survey (DHS) Program website under the Data section at https://dhsprogram.com/data/dataset/Afghanistan_Standard-DHS_2015.cfm?flag=0


I. INTRODUCTION
Diarrhea is the world's second leading cause of infant mortality after pneumonia in children under five [1]. Diarrhea is a significant cause of childhood morbidity and mortality [2]. And it is responsible for killing around 525,000 children every year. There are three clinical types of diarrhea: acute watery diarrhea, acute bloody diarrhea, and persistent Diarrhea [WHO]. Diarrheal disease is often caused by polluted sources of food and water. Seven hundred eighty million people worldwide lack access to better drinking water, and 2.5 billion lack improved sanitation. Diarrhea is common in many developing countries due to the infection among children under three years of age who suffer an average of three diarrhea episodes per year in low-income countries. Each episode deprives the child of the required nutrition for development. As a result, diarrhea is a significant cause of malnutrition, and diarrhea is more likely to make malnourished children sick- [WHO,2]. The death of small children under the age of five is infant mortality. The child mortality rate (IMR) calculates this death toll [3,4]. In 1990, nine million infants younger than one year died globally. This figure has almost been halved to 4.6 million child deaths by 2015 [5].
Child mortality, calculated as the under-5 child mortality rate (U5MR), is the child's death before the child's fifth birthday. National statistics also groups these two rates of mortality. Globally, in 2017, 5.4 million children died before their fifth birthday [6]. A reduction of child mortality is now a target in the Sustainable Development Goals (SDG) for Goal Number 3 ("Ensure healthy lives and promote wellbeing for all at all ages") [7]. In 2012, in terms of the total under-five mortality rate, Afghanistan was ranked 18 th globally. Similarly, Afghanistan was ranked 8 th among nations with the highest deaths of under-five from pneumonia and diarrhea. Thus, it is not surprising that the two most prevalent health problems claiming the lives of under-five Afghan children are diarrhea and pneumonia.
Furthermore, it is beneficial to identify the factors associated with Diarrhea in Afghanistan through a predictive diarrhea model. Therefore, this study proposes a predictive model for diarrhea children under five with specific Afghanistan characteristics. It also aims to find a suitable classifier for this task. Three popular classifiers, Naïve Bayes, Random Forest, and Support Vector Machine are selected for the study. The dataset used in this study is Afghanistan's Demographic and Health Survey 2015, which comprises records of children whose age group is under five years.

II. CLASSIFICATION SELECTION
Classification is a supervised learning technique whose primary objective is to construct models based on known data and predict new data categories [8,9]. In classification, models are built by splitting a supplied dataset into training and test sets. One or more classification algorithms run through the training set, and the classifier models are subsequently developed. The test set is then used to assess the accuracy of the models [5]. Subsections A to C briefly introduces the classifiers selected for this study, while Section III briefly describes previous related work.

A. Naïve Bayes
Naïve Bayes learning is based on a Bayesian probabilistic model that accredits a prior-class probability to an instance. The simple Naïve Bayes algorithm uses these probabilities to assign an instance to a class. Naïve Bayes Classifier is very widely used, and it leads to a simple prediction framework that gives decent results in some practical cases. A more definitive term for the underlying probability model would be the "independent feature model." It is also called idiot's Bayes. It relies on the existence (or absence) of a particular class attribute that is irrelevant to the presence (or absence) of any other attribute. Naïve Bayes classifier will converge faster than logistic regression, so it requires less training time. This method is applicable to different applications with various features. It is good at missing value handling. Naive Bayes has a lower error rate than the other classifier, but it is not always true practically and less accurate than other classifiers. In particular, it cannot learn interaction between features, does not support pruning, and contains sharp decision boundaries [10,11]

B. Random Forest
Like a decision tree, the random forest is an ensemble classifier that can be used to solve classification and regression issues. It uses the principle of multiple random tree generation with training dataset bootstrap, sample bagging, voting scheme, and features being randomly selected in each split decision, which improves predictive ability and results in higher performance.
Most of the time, it produces better performance compared with decision trees. One example of the random subspace approach is the selection of a random subset of features; it is used in various applications. It does not rely on the data, which is ideal for modeling high dimensional data. By overcoming, over-fitting and removing prune trees, random forest runs efficiently, and are able to provide high predictive accuracy. It can deal with missing values, outlier, and preserves accuracy. Random forest interpretability model and prediction accuracy are exceptional among common methods of machine learning. Random Forest requires much less input preparation. They can handle binary features, categorical features, and numerical features, and there is no need for feature normalization. Random Forests are quick to train and to optimize according to their hyperparameters [12,13].

C. Support Vector Machine
Support vector machine (SVM) introduced for regression, binary classification, or ranking function, and is based on statistical learning. It is widely used by researchers in health care for classification due to its many attractive features, handling complicated nonlinear data points. The support vector machine algorithm's basic concept is to find the optimal hyperplane, which separates two considered classes, by maximizing the distance among the classes' closest points. The middle of the margin is called the optimal hyperplane. SVM is a good classifier, and it does not require prior knowledge, even if the input space is very high [12,14].

III. RELATED WORK
Several studies were conducted to extract the hidden patterns from diarrhea data in the healthcare industry. Their essential aim is to provide useful information for decisionmakers and policymakers in the healthcare industry, helping to determine the target population's diarrhea status. At present, data mining and machine learning are significant contributors in terms of predicting and analyzing the factors related to diarrhea; this section focuses on previous related works done by researchers on the modeling of diarrhea.
A work [15] investigated the association between the prevalence of childhood diarrhea and caregiver awareness of the causes and the prevention of diarrhea using statistical methods in a prospective cohort of 952 children < 5 years of age in Cochabamba, Bolivia. The caregiver knowledge survey found that more than 80% of caregivers were unaware that handwashing with soap could prevent childhood diarrhea. In this cohort, significant risk factors for diarrheal disease were a lack of knowledge of the value of hygiene and sanitation practices for diarrhea prevention. This study's knowledge finding suggests that health promotion in these communities should emphasize increasing understanding of how water treatment, handwashing with soap, proper disposal of child feces, and food preparation relate to childhood diarrhea prevention.
A study in [16] used Demographic and Health Survey data in Nepal to identify the factors associated with diarrhea and their possible influence, a Bayesian logistic regression model was applied. This study found that mothers with no formal schooling background have a higher chance that their children have diarrhea. Autumn and winter seasons have a higher prevalence of diarrhea, children under the age of 12-24 months have the highest risk category of diarrhea, and male children had a higher risk than females.
Research in [17] was a cross-sectional analysis done using a formal questionnaire and a checklist of observations. Using simple random sampling techniques, a total of 546 households with at least one child under five were chosen. Data entry and cleaning was carried out using the Statistical Package for Social Science (SPSS). For descriptive analysis, frequencies and proportions were computed. Finally, multivariable regression was used to conduct further analyses. Therefore, practices focusing on adequate handwashing methods at all appropriate times, proper handling of refuse, nutrition enhancement, and better childcare are also strongly recommended.
The work [18] at the Maternal and Child Health Clinic (MCH) in Tanzania, a matched case-control study was conducted during the rainy season to elucidate the risk factors for and etiology of diarrheal diseases in children under five years of age. Precoded questionnaires were completed with demographic data, health history, and physical signs. For bacterial, parasitological, and viral studies, stool samples were obtained. The probability of diarrhea was associated with many siblings, the number of siblings surviving, the birth order, and the distance from the house to the water source.
As stated in the above literature, various statistical, and data mining techniques have been applied to healthcare data, with reference to diarrhea This is beneficial for developing countries to infer useful knowledge for domain area experts.
Previous works on diarrhea in Afghanistan relied on statistical methods. They also focused their work only in medical-related data. To date, there has not been an attempt to implement a predictive diarrhea model based on both nonmedical related data (for example, wealth status) and medical-related data.

IV. METHODOLOGY
Data mining is responsible for discovering unknown and secret patterns in a large amount of data to obtain valuable information. As stated earlier, one of the objectives of this work is to find a suitable classifier among the most popular machine learning techniques. The classification task in this study is binary, where two categories are considered. 'No Diarrhea' indicates that the sample belongs to the category where the infant did not have diarrhea, and the 'Diarrhea' category indicates that the infant had diarrhea.

A. Data Understanding
The dataset in this study is the publicly available Afghanistan Demographic and Health Survey 2015 (AfDHS). It contains information from a cross-sectional, population-based, nationally representative survey carried out by the Central Statistics Organization and the Ministry of Public Health in Afghanistan's rural and urban areas in 2015.
Altogether, the AfDHS comprises information from 25,650 households where some households contributed more than one sample to the dataset (i.e., more than one record relating more than one child). A detailed description of data in this survey can be found in [19]. The dataset may be obtained from the Demographic and Health Surveys (DHS) website.

B. Data Preprocessing
Like most data mining process, the initial step of data preprocessing is also necessary for this study. This consists of the main two stages, data cleansing and feature selection, as described below:

1) Data Cleansing
Initially, the AfDHS data comprises 32,712 records (i.e., samples). The sample excluded children whose mothers did not know if they had diarrhea in the past two weeks and those whose data were missing. At the end, 5,800 complete and useable samples remained. Table I reports the number of samples in both categories. 2) Feature Selection Each sample in the dataset consists of 1,187 attributes. Careful analysis reveals that most attributes are not related and inappropriate for the implementations. Such attributes include the respondent's line number, date of interview, mother names, etc. Therefore, an essential step in this initial stage is to identify relevant attributes related to diarrhea. This is carried out by consultations with doctors and experts in this area. Different doctors and experts had expressed a different number of attributes that contribute to diarrhea. The study identifies common attributes among these doctors and experts. Finally, it was found that 21 attributes are commonly express among them. Table II reports these 21 attributes, as shown below. Place of residence Place of residence (rural and urban) 18 Gender Sex of child (male or female) 19 Wealth The economic situation of the family 20 Breastfeeding Currently child breastfeeding 21 Occupation Occupation of the mother Feature selection is an essential factor in the data mining process' success and selecting informative or relevant attributes. Attribute/Feature selection methods are used to reduce the dimensionality of data by removing the redundant and irrelevant attributes in a data set. The feature selection process has many benefits. It allows visualization and understanding of data more quickly, reduces the time and storage required for the mining process, and improves the algorithm's performance by avoiding the curse of dimensionality. Performance is enhanced by eliminating features that do not add any value to the algorithm's efficiency. This study used Information Gain and correlationbased feature selection algorithm (CFS) algorithms to reduce dimensionality in the datasets.
Information Gain (IG) is an entropy-based feature evaluation method widely used in machine learning. As Information Gain is used in feature selection, it is defined as the amount of information provided by the text category's feature items. Information gain is calculated by how much of a term can be used to classify information to measure the importance of lexical items for the classification in natural language processing [20]. The formula of the information gain is shown in equation (1).
C is a set of document collection, in which there are various features or terms t. Such collection would have m documents. The IG value depends on the probability and conditional probability of each document as indicated.
When the IG(t) value is greater, it is more useful for the classification for C. Then this term t should be selected. The value of IG(t) depends on the value of P(t) and P(t ̅ ).
The CFS algorithm computes the correlation between all features and the output class. It selects the best feature subset (i.e., the subset with features positively correlated with the class variable and has a low correlation with each other features) using a correlation-based heuristic evaluation function [21]. CFS is calculated by equation 2.
where rzc: correlation between features and the class variable.
k: number of features.
̅ : the average of the correlation between feature-class.
CFS and Info Gain were applied to the datasets to determine the informative and high correlated attributes. Tables III and VI reveal the results of the CFS and Info Gain application to the datasets.  Referring to Tables III & IV, the different attributes selected by Information Gain are the wealth status and place of residence (rural & urban) attributes. Preceding birth interval and duration of breastfeeding are the different attribute selected by CFS. These differences merit further analysis. Therefore, model building was done using both Info Gain and CFS selected attributes. Tables V and VI reveal the accuracy measures of them.

C. Model Building
On the cleaned dataset with sixteen selected attributes by Info Gain and CFS, three classification algorithms, namely Naïve Bayes, Random Forest, and Support Vector Machine (SVM), were applied using the WEKA [22] machine learning tool on the datasets. The datasets were balanced, where the classification categories were approximately equally represented, and the dataset split into 85% training and 15% testing data.

D. Evaluation
To evaluate the performance and effectiveness of the predictive model implemented, evaluation metrics are used. In this study, four commonly used metrics are employed [23]. These are 'Accuracy', 'Precision' Recall', and 'Area Under the Curve (AUC). Their brief descriptions are given below:

1) Accuracy
This metric is most used in classification as it is the first important indicator of how well the model performs. It is converted to percentage usually (i.e., 0% to 100%). It can be determined by the equation below: TP: the number of true positives (samples that are correctly classified in their correct class).
TN: the number of true negatives (samples correctly classified that they do not belong to the target class).
FP: the number of false positives (samples incorrectly labeled as the target class when they are not).
FN: the number of false negatives (samples that are incorrectly labeled as not the target class while they are).

2) Precision
The information retrieval field introduced this metric; however, it has an application in classification, and it is a useful addition in evaluating the performance. It is also expressed in percentage. It can be determined by the equation below:

3) Recall
The Recall metric is often used in conjunction with the Precision metric in the information retrieval field. Hence, it can add useful information in evaluating the performance. It is also expressed in percentage. It can be determined by the equation below:

4) Area Under the Curve (AUC)
AUC is the area under the receiver operating characteristic curve (ROC). It provides a comprehensive assessment of a model's accuracy by screening the range of threshold values for decision making. The larger area, the more accurate the diagnostic test is. It is also expressed in percentage.

V. RESULTS AND DISCUSSION
Predictive model building is an iterative procedure, and, therefore, it is crucial to perform multiple experiments with different classifiers to select the best model for solving the problem at hand. The experiments were conducted by using three popular classifiers, namely: Naïve Bayes, Random Forest, and Support Vector Machine, on the balanced datasets. Experimentations were carried out for each algorithm, and the percentage split validation was used to estimate each algorithm's performance. Out of all the instances, 85% was utilized for training, and the remaining 15% made up the test set. The data mining algorithm performance was evaluated based on accuracy, precision, recall and the area under the curve (AUC). The AUC serves as an indicator for the overall performance of the algorithm [23,24]. The area under the receiver operating characteristics curve technique provides a comprehensive assessment of the performance of a predictor of screening the range of threshold values for the decision making. The larger area, the more reliable the diagnostic test is. Therefore, models with the highest accuracy and AUC are deemed as the best. Referring to Tables V and VI, it can be seen that the random forest algorithm has higher accuracy, AUC, precision, and recall with the selected attributes by correlation-based feature selection (CFS). But in contrast it has an overall lower accuracy, AUC, precision, and recall with attributes selected by Information Gain, therefore this finding showed that CFS application is suitable with the nature of this dataset.
For the prediction of diarrhea in children under five years in Afghanistan, it can be seen that random forest gives a higher overall prediction accuracy of 81.48%, AUC 89.80.%, precision 82%, and recall 81.4% compared to other classification methods. Also, the area under the curve (AUC) is identified in the random forest to be higher than other methods, indicating that random forest is the best model to determine the children's diarrhea in this study. On the other end, Naïve Bayes gave the lowest overall prediction accuracy of 69.10% in this study.
As this work only used a subset of the collected data and preprocessed it to be a balanced dataset, validation may be done using the remaining data in the next step.

VI. CONCLUSIONS AND FUTURE WORK
Diarrhea is the second leading cause of infant mortality after pneumonia in children under five years. To better understand the crucial factors positively associated with diarrhea and having the most suitable predictive model for Diarrhea in Afghanistan, it is very beneficial to mitigate the problem mentioned. This study is the first attempt to determine the most suitable tool to implement the diarrhea predictive model. The study applied Info Gain and CFS in feature selection and identified the attributes with high correlation/more informative attributes. The different attributes selected by Info Gain were wealth status and place of residence, but in contrast, different selected attributes by CFS were preceding birth interval and duration of breastfeeding. Implementation of the model was done using attributes selected by CFS and Info Gain. Random Forest was found as the most suitable tool amongst the three popular data mining techniques consider, with Accuracy, AUC, Precision, and Recall of 81.5, 89.8, 82, and 81.4%.
Future study can be carried out which adopts 16 attributes and relies on CFS to identify the importance of each selected attribute. More detailed analysis can be carried out to ensure each attribute's influence in diarrhea. Model validation will then be done using the remaining data extracted from the survey. Furthermore, a comparative analysis using the full imbalanced data may be undertaken using the approach by Momand et al. [25] and may incorporate this work into their malnutrition application developed for the Afghan government for detection and monitoring of the health status of pre-school age children.