Machine Learning Prediction of Hospitalization due to COVID-19 based on Self-Reported Symptoms: A Study for Brazil*

—Predicting the need for hospitalization due to COVID-19 may help patients to seek timely treatment and assist health professionals to monitor cases and allocate resources. We investigate the use of machine learning algorithms to predict the risk of hospitalization due to COVID-19 using the patient’s medical history and self-reported symptoms, regardless of the period in which they occurred. Three datasets containing information regarding 217,580 patients from three different states in Brazil have been used. Decision trees, neural networks, and support vector machines were evaluated, achieving accuracies between 79.1% to 84.7%. Our analysis shows that better performance is achieved in Brazilian states ranked more highly in terms of the ofﬁcial human development index (HDI), suggesting that health facilities with better infrastructure generate data that is less noisy. One of the models developed in this study has been incorporated into a mobile app that is available for public use.


I. INTRODUCTION
The rapid propagation of the SARS-CoV-2 virus and the possible need for hospitalization by patients who develop COVID-19 have overloaded healthcare systems in several countries around the world. The shortage of hospital beds, especially those in intensive care units (ICU), has been one of the main challenges to fight this disease, affecting medical and governmental decision making [1], [2].
The number of hospitalized cases has been widely adopted as a metric with which to estimate the required resources for health facilities and to define lockdown restriction levels [3]. Furthermore, machine learning (ML) tools have been used to predict the number of new as well as hospitalized cases a few weeks ahead, helping local authorities to make informed decisions [4], [5].
A tool to estimate the risk of hospitalization might be useful from an individual perspective by helping a patient seek treatment in time. It could also support health professionals in remote locations and in under-resourced environments to make decisions related to patient transfer and bed allocation. Pointof-care prediction methods are well suited to these cases, as they can provide practical and cost-effective strategy. In this regard, smartphone-based diagnostic and data collection tools have particular appeal [6]- [8].
Based on data collected between March and June 2020, Jehi et al. [9] used logistic regression with the least absolute shrinkage and selection operator (LASSO) to predict the risk of hospitalization for 4,536 patients with COVID-19, achieving a sensitivity of 76.9% and specificity of 72.6%. Although the study was conducted at the beginning of the pandemic and used a limited dataset, it provides a clear indication that ML has potential in the prediction of hospitalization due to COVID-19.
The method introduced by Sudre et al. [10] predicts hospitalization based on self-reported symptoms informed throughout 9 days. Their work uses clustering to automatically group patients into six distinct groups according to symptoms. It was noted that patients who experienced a similar level of COVID-19 severity fell into the same cluster and that the risk of hospitalization was high for patients in two of the six clusters. When using 2, 5, and 9 days of continuously reported data, a precision of 48.0%, 70.4% and 84.9% and a recall of 47.2%, 70.3% and 84.6% was achieved respectively.
Chen et al. [11] used random forests to distinguish severe and non-severe cases of COVID-19 in 362 patients. They considered severe cases to be patients with a respiratory rate above 30 breaths per minute or an oxygen saturation below 93% or a partial pressure of oxygen below 300mmHg. Using only the patients' comorbidity and symptoms, the system achieved an approximately 90% predictive accuracy. When this data is combined with laboratory test results (not including COVID-19 specific tests), the accuracy rose to 99%.
Although the results mentioned above for hospitalization or severity prediction are promising, there is still room for improvement with regards to the number of days over which symptoms should be reported, the study population, and the labeling criteria. Furthermore, the definition of the severe cases used by [11] differs from the classification provided by the WHO [12], which may affect the system's performance in the context of the current clinical management of COVID-19 cases in most countries. Additionally, it has been recommended by the WHO that the decision for hospitalization should be made on a case-by-case basis, considering not only the clinical presentation but also the patient's demographics (age, sex, and medical history), risk factors, and even the conditions at home [12].
The current work aims to investigate the success of MLbased methods in estimating the risk of hospitalization due to COVID-19, using patient's medical history and self-reported symptoms, regardless of the period in which they occurred.  Our final goal is to conceive a classification methodology that can be implemented as a smartphone-based solution for the self-assessment of COVID-19 severity. We train and evaluate ML algorithms using official hospitalization data released by the government of Brazil so that the results can be compared with current medical practices. Brazil is one of the countries hardest hit by the Coronavirus epidemic, reaching 200,000 deaths by the beginning of 2021.

A. Dataset acquisition
According to Brazilian law, it is mandatory for hospitals and other health facilities to report all disease cases to the government. During the COVID-19 pandemic, state departments of public health in Brazil have periodically released notifications of COVID-19 cases, which include anonymized patient data regarding age, gender, symptoms, previous health conditions, and other information. These data were collected during screening and eventually updated for the hospitalized patients.
To avoid social or racial bias during model training, we have considered databases from states located in Brazilian macroregions that are clearly distinguished from each other in terms of human development index. Additionally, only databases that include the diagnosis method and the case management (whether hospitalized or not) were considered.
The data provided by the Brazilian states of Alagoas (AL) [13], Espírito Santo (ES) [14], and Santa Catarina (SC) [15] met the criteria mentioned above and were selected for this study. These states respectively occupy 27 th (last), 9 th and 3 rd positions in the state-wise human development index (HDI) ranking in Brazil [16]. The datasets contain data collected from March to December of 2020 in both public and private hospitals and healthcare facilities. We further removed patients with fewer than two symptoms and those without laboratory confirmation. The composition of the extracted data is shown in Table I.
Some features are not available in the AL and ES datasets because the associated symptoms were not reported. In the SC dataset, comorbidity is annotated for 51.0% of hospitalized patients and 0.1% of non-hospitalized patients. This low incidence among non-hospitalized patients is inconsistent with previous research [9], [11] and with the AL and ES datasets, which indicate the comorbidity in 40.6% and 28.4% of non-hospitalized patients respectively. For this reason, the comorbidity feature was removed from the SC dataset.
The Brazilian Institute of Geography and Statistics (IBGE) officialy classifies the Brazilian population into five racial groups: Branco (White), Preto (Black), Amarelo (East Asian), Indígena (Indigenous), or Pardo (mixed race) [17]. However, we considered black, indigenous, and mixed race populations as a single group since self-identification has in these cases been reported to be imprecise, with many black and indigenous persons identifying themselves as mixed race [18].

B. Experimental setup
For the proposed study, we investigated the predictive performance of three ML algorithms: decision trees (DT), neural networks (NN), and support vector machines (SVM). Hyperparameter optimization was performed for each of these techniques using nested cross-validation combined with data Fig. 1: Hyperparameter optimisation, training and validation scheme adopted in this study which combines nested crossvalidation and data augmentation. This process has been used for each combination of ML algorithm and dataset.  augmentation as illustrated in Fig. 1. As datasets differ in terms of the feature they contain, this cross-validation was performed independently for each combination of ML algorithm and dataset. For DTs, the maximum depth of the tree and the employed quality criterion were the hyperparameters optimised during cross-validation. For NNs, the network architecture and the activation function were the variable hyperparameters. The L2 regularization penalty was the only hyperparameter for the linear kernel SVM classifier. Table II summarizes the ranges considered for these hyperparameters.
Class imbalance is an important issue for the three datasets, as pointed out in Section II-A. To mitigate this, we used stratified cross-validation and data augmentation within the scheme shown in Fig. 1. A stratified k-fold split [19] was used for both outer and inner loops. Synthetic minority oversampling (SMOTE), a data augmentation technique, was applied to the training folds in the outer loop [20]. No synthesized data was included in the outer loop test folds.
The comorbidity feature was an integer value corresponding to the number of comorbidities. For normalization purposes, the age feature was the actual age divided by 100. All the other features were considered to be dichotomous traits with 0/1 binary values.
We anticipated the potential for algorithmic bias given that previous studies have found evidence of racial health inequity in the context of COVID-19 in Brazil [21], [22]. Therefore, race was used as shown in Table I to assess bias and was performed using the AL and ES datasets. To achieve this, three 5-fold stratified cross-validation experiments were performed for: all patients; the white group; and the black/indigenous/mixed race group. Table III shows the performance metrics for each crossvalidation experiment while Fig. 2 shows the receiver operating characteristic (ROC) of these experiments. It can be seen that NN and SVM achieved the best results, without a clear advantage for either.

III. RESULTS AND DISCUSSION
Using area under the curve (AUC) as the evaluation metric, NN performed best for the AL dataset with a mean sensitivity of 76.8% and mean specificity of 81.8%, SVM performed best for the ES dataset with a mean sensitivity of 81.2% and mean specificity of 84.0%, and NN also achieved the better results for the SC dataset with a mean sensitivity of 84.6% and mean specificity of 84.6%. The mean accuracy of the best models ranged from 79.1% to 84.7% across the three datasets. Accuracy, sensitivity, and specificity were calculated using a decision threshold of 0.5 for normalized predicted probabilities.
The hyperparameters selected during cross-validation were found to be similar across the three datasets. Parameters that were frequently found to be optimal were: entropy as quality criterion and a maximum depth of 10 for DTs; one 32-neuron hidden layer architecture using rectified linear unit (ReLU) activation for NNs; and an L2 regularization parameter of 1 for the SVM.
These results are consistent across classifiers for each dataset and across the three datasets for each classifier. This confirms the findings of the previous study [9]- [11] that ML can be used to predict hospitalization based on a patient's symptoms and health status with relatively high accuracy. It also shows that the ML algorithm adopted may affect the classification performance substantially.
In order to determine whether the classification is influenced by the racial groupings, separate evaluations were performed for the AL and the ES datasets. The results in Table IV show that the performance differences are very small, and suggest that the effectiveness of the system is not influenced by the racial grouping of the patients. Fig. 2: Mean ROC curves and evaluation metrics for cross-validation performed on each dataset/classifier pair. Accuracy, sensitivity, and specificity are mean values calculated using a threshold of 0.5. AUC standard deviation is plotted in red and given in parenthesis.

IV. SYSTEM AVAILABILITY
A neural network model for hospitalization prediction using the proposed method has been incorporated in the mobile app ContraCovid, which is available for public download 1 . The app is intended for the self-monitoring of patients suffering from COVID-19 and has used the proposed predictor as one of the metrics to recommend patients to seek treatment.

V. CONCLUSION
We have evaluated ML algorithms to predict hospitalization due to COVID-19 using the patient's self-reported symptoms and previous health status. In contrast to previous studies, we do not restrict the time-frame over which the self-reporting must occur. To conduct this evaluation, we compiled our datasets based on databases officially published by three state departments of health in Brazil. In total, data from 217,580 laboratory-confirmed cases of SARS-CoV-2 patients were used to assess the performance of decision trees, neural networks,   and support vector machines. Since the information reported by the three health departments was not completely the same, independent experiments were performed. Nested cross-validation with data augmentation was applied to each dataset/algorithm pair. Achieved accuracies ranged from 79.1% to 84.7%. NN and SVM performed best with neither offering clear advantage over the other.
The effectiveness of each algorithm was shown to be consistent across the three datasets. This suggests that ML can predict hospitalization due to COVID-19 using only selfreported symptoms with acceptable accuracy. Based on the official Brazilian state-wise human development index (HDI) ranking [16], the performance for the richest state was best while the performance for the poorest state was worst, with average AUC varying between 85% and 91%. This may be related to the number of hospital beds available for COVID-19 in each state, which may lead to increased noise in the data.
A comparison between data obtained for different racial groups (where such race information was available in the official data) indicate that the performance of all systems is not influenced by the racial grouping of the patients.