Multi-Branching Temporal Convolutional Network With Tensor Data Completion for Diabetic Retinopathy Prediction

Diabetic retinopathy (DR), a microvascular complication of diabetes, is the leading cause of vision loss among working-aged adults. However, due to the low compliance rate of DR screening and expensive medical devices for ophthalmic exams, many DR patients did not seek proper medical attention until DR develops to irreversible stages (i.e., vision loss). Fortunately, the widely available electronic health record (EHR) databases provide an unprecedented opportunity to develop cost-effective machine-learning tools for DR detection. This paper proposes a Multi-branching Temporal Convolutional Network with Tensor Data Completion (MB-TCN-TC) model to analyze the longitudinal EHRs collected from diabetic patients for DR prediction. Experimental results demonstrate that the proposed MB-TCN-TC model not only effectively copes with the imbalanced data and missing value issues commonly seen in EHR datasets but also captures the temporal correlation and complicated interactions among medical variables in the longitudinal clinical records, yielding superior prediction performance compared to existing methods. Specifically, our MB-TCN-TC model provides AUROC and AUPRC scores of 0.949 and 0.793 respectively, achieving an improvement of 6.27% on AUROC, 11.85% on AUPRC, and 19.3% on F1 score compared with the traditional TCN model.

complication of diabetes.It is a leading cause of vision loss among working-aged people around the world [1].According to the American Diabetes Association (ADA), almost all type 1 patients and more than 60% of type 2 patients have varying degrees of retinopathy after 20 years of diabetes [2].Vision loss can be prevented if DR is diagnosed at an early stage.Unfortunately, many diabetic patients are unaware of their potential risk of DR because DR is asymptomatic at the early stage.Thus, they miss the most effective time window to halt the DR progression, because vision loss or blindness is inevitable [3].Additionally, although ADA suggests ophthalmic examinations in the annual DR screening guidelines, the compliance rates are below expectation, which results in 19% of DR patients being undiagnosed [1], [3].The shortage of expensive medical devices and experts for ophthalmic exams also limits the availability of DR screening, especially in rural areas.As such, reliable and cost-effective DR detection tools are urgently needed to facilitate DR screening.
With the rapid development of medical information technology, we now live in an era of data explosion where a large amount of data are readily available and accessible in the clinical environment [4], [5], [6], [7].This data-rich environment provides an unprecedented opportunity for developing automated tools for disease diagnosis.For example, many data-driven methods have been developed to support clinical decision-making by leveraging electronic health record (EHR) data [8], [9], [10], which contain rich health information including patient demographics, vital signs, and lab tests.However, EHRs are generally complexly structured with a high level of heterogeneity.Fully utilizing EHRs for reliable DR diagnosis poses the following challenges.
1) Poor data quality: Real-world EHRs suffer from the issues of imbalanced data and missing values.Imbalanced dataset with skewed distribution between two classes results from the fact that the number of negative patients (without a specific disease) is generally much larger than that of positive patients.Disease detection tasks are often formulated as binary classification problems, where each training sample is from two classes (e.g., positive vs. negative).With imbalanced training data, the majority class will dominate the classifier, yielding unsatisfactory performance in predicting the minority class.Missing value problems occur due to the heterogeneity in clinical needs and lack of documentation [11], [12].For example, patients often do not take all types of lab tests at every clinic visit.A patient may do Complete Blood Count test to measure the count of red and white blood cells at one visit, while he/she may take Comprehensive Metabolic Panel test to measure the level of Alanine Aminotransferase (ALT) and Blood Urea Nitrogen at the next visit.Such data quality issues will result in biases in decision-making if left unaddressed, constraining the clinical value of EHRs.
2) Multivariate longitudinal records: EHR datasets are commonly associated with multivariate longitudinal records (or sequences), especially for patients with chronic diseases such as DR.The medical variables usually not only have intra-sequence correlation (i.e., autocorrelation within one individual sequence) but also have inter-sequence correlation (i.e., the correlation between different sequences).For example, the observation of ALT of one DR patient at a given time stamp is correlated with the observation at the nearby time step.On the other hand, the ALT could also be correlated with the observations of Aspartate Aminotransferase (AST) at the same or different time steps.Note that EHRs present a wealth of information on the dynamic evolution of medical variables for large populations.This leads to a new three-way tensor form of data with patients × variables × sequences, as opposed to the table form of predictor and response variables in traditional predictive modeling.Novel machine learning approaches are urgently needed to effectively handle such complex data structure for disease prediction.
In this paper, we propose a Multi-Branching Temporal Convolutional Network with Tensor Data Completion (MB-TCN-TC) to model multivariate longitudinal EHRs for reliable DR prediction by addressing both the imbalanced data and missing value issues.We first utilize the CANDECOMP/PARAFAC (CP) decomposition to complete the tensor data for missing value imputation by simultaneously accounting for the multidimensional correlations between lab test variables, patients, and time sequences.Second, to further account for the imputation uncertainty, we propose to add the missing value masks as another input stream to our predictive model to address the possible discrepancy between the imputed values and true observations.Third, we adapt the multi-branching TCN framework [8] to handle the imbalanced data and model both the intra-and inter-sequence correlations in the longitudinal EHRs for DR prediction.We evaluate the proposed MB-TCN-TC on a large real-world dataset.Experimental results demonstrate that our MB-TCN-TC model achieves better performances for DR detection compared with existing methods.

A. DR Screening Methods in Current Clinical Practice
The most common screening method for DR is a dilated comprehensive eye examination (DCEE) performed by an ophthalmologist or optometrist [2].According to the guidelines of diabetes care by ADA [13], adults with diabetes should have an initial DCEE within 5 years after the onset of type 1 diabetes or at the diagnosis time of type 2 diabetes, and have subsequent exams every 1-2 years if there is no evidence of DR, or at least annually if any level of DR is present at the initial exam.During the exam, doctors apply eye drops to enlarge patients' pupils so that they can examine patients' retinas with devices such as a slit lamp.Within 4-6 hours after dilation, the patient's vision may be blurry and the patient is advised not to drive.This causes a significant inconvenience, especially for patients living in rural areas that are a few hours away from the ophthalmologist or optometrist's clinic.
An alternative screening method for DR is fundus photography.Trained clinicians take fundus pictures of diabetic patients.The digital color fundus photographs of the retina are then evaluated by doctors or certified technicians either at pointof-care or remotely.This diagnosis process requires experts to manually inspect the fundus pictures and evaluate the presence of lesions associated with the vascular abnormalities, which is time-consuming and depends heavily on the clinicians' domain knowledge.

B. Machine Learning for Healthcare Data
The fast-growing healthcare data presents remarkable prospects for data-driven scientific knowledge discovery and clinical decision support.The utilization of machine learning in healthcare data analysis has witnessed exponential growth in recent years.Numerous studies have demonstrated the potential transformative impact of machine learning algorithms in deciphering complicated healthcare data including medical images and EHRs to improve the precision of diagnostic and treatment strategies [4], [14].
Medical imaging such as Magnetic Resonance Imaging, Computed Tomography, and ophthalmoscopy enables non-intrusive assessment of the inner structures of the body for disease diagnosis.A substantial body of research has been conducted to develop effective machine learning methods to analyze the imaging data [15], [16], [17], [18], [19], [20].In particular, Convolutional Neural Networks (CNNs) have gained great interest in the field of medical image analysis, largely attributable to their unique ability to discern intricate structural relationships among neighboring pixels in images [21].For example, Sanghvi.et al. [22] developed a CNN model for the detection of COVID and Pneumonia using Chest X-ray images.Calisto et al. [23] and Dontchos [24] demonstrated the promising potential of CNN models for breast cancer diagnosis using breast images.
The EHR is an electronic version of a patient's medical information over time including demographics, medications, vital signs, laboratory data, etc.A variety of statistical and machine learning models including logistic regression, support vector machine (SVM), random forest, and deep neural networks (DNNs) have been developed to investigate the EHRs with applications in data-driven prediction of numerous diseases such as sepsis [8], diabetes [25], and myocardial infarction [26].A comprehensive review on machine learning of healthcare data can be referred to [10], [27]

C. Existing Work on Data-Driven DR Prediction
In the past decade, a variety of machine-learning models have been developed to analyze fundus images for automated DR detection.For example, Sopharak et al. [33] proposed a machine learning model for exudate detection in retinal images Then, an SVM classifier was built based on the selected features for DR detection.Roychowdhury et al. [34] evaluated AdaBoost, k-nearest neighbor (KNN), SVM, and Gaussian Mixture Model for classifying retinopathy lesions from fundus images.Reza et al. [35] proposed a machine learning algorithm based on marker-controlled watershed segmentation to detect optic disc, exudates, and cotton wool spots in fundus images and further identified DR.In general, most traditional machine learning methods still depend heavily on manual feature extraction and selection, which is not only a trial-and-error labor-intensive process but also depends on human expertise [33], [36], [37].
Deep learning models have been proposed to analyze retinal images.The primary advantage of DNNs over traditional machine learning methods is that DNNs have the ability to learn complex representations of the input data without the need for explicit feature engineering [38].DNN-based features have been proved to be more informative than handcrafted features for data-driven disease detection [39], [40], [41], [42].As a result, DNN models, especially CNNs, have been developed to analyze retinal images for DR prediction with better performances than conventional machine learning methods.For example, Ting et al. [43] proposed a deep learning model for screening DR and other eye diseases from retinal images.Wang et al. [44] developed a multi-channel generative adversarial network to investigate retinal images for DR diagnosis.Comprehensive reviews of data-driven DR diagnosis based on retinal images can be found in [36], [37].
Despite the impressive performances of DNNs in fundus photography recognition, expensive imaging devices (e.g., digital fundus cameras) and ophthalmic imaging skills are still needed, limiting the use of this approach to well-funded healthcare providers with trained technicians.More cost-effective screening tools based on EHRs, which is widely available and accessible to all healthcare settings, have been developed such as Cox's proportional hazard model [28], decision trees-based methods [29], ensemble models [30], and extreme gradient boosting (XG-Boosting) [31].However, many of the methods, e.g., decision trees-based models and gradient boosting-based models, lack the ability to model the dynamic disease trajectory of DR.Our prior work [3] has shown that the longitudinal information of the disease dynamics is conducive to data-driven disease prediction.Note that Cox proportional hazard models generally assume there is a linear relationship between the medical variables and proportional hazard [45].This linearity assumption often is not valid for real-world EHR data, which generally describes the nonlinear and non-stationary disease dynamics.
Table I summarizes the limitations in existing machine learning methods for DR prediction using EHRs.Specifically, datadriven DR prediction based on EHR data suffers from the problem of missing values and extremely imbalanced data as we discussed in Section I.In the literature, there are various studies on the imputation of missing medical data, which generally estimate the missing values by investigating the data structure and exploiting the correlation patterns.Common imputation techniques include case deletion, mean imputation, interpolation and extrapolation, and advanced machine-learning-based imputation [46], [47].However, most of those methods focus on addressing the missing value issue in matrix-form data but are not directly applicable to capture the multi-dimensional correlation among patients, variables, and sequences in longitudinal EHRs, and thus are not well-suited for tensor-form data imputation.
Similarly, various methods have been proposed to handle the imbalanced data issue, such as random under/over-sampling, informed under-sampling, and synthetic minority over-sampling (SMOTE) [48], [49], [50], [51].However, under/over-sampling methods suffer from huge information loss or over-fitting problems.Informed under-sampling (i.e., KNN-based imputation), and SMOTE are not generally applicable for longitudinal data analysis because, in those methods, the similarity calculation is based on the Euclidean distance, which may not be able to capture the true closeness between longitudinal sequences [48], [52].Therefore, novel analytical models are urgently needed to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.effectively learn from complexly structured longitudinal EHRs for reliable DR prediction.

III. RESEARCH METHODOLOGY
This section presents the proposed MB-TCN-TC framework for DR prediction from longitudinal EHRs.Suppose there are N patients indexed by p ∈ {p 1 , . . ., p N } and each patient is described by the medical records denoted as (v p , t p , y p ): v p ∈ R M denotes the medical variables; t p ∈ R T is the time point when medical variables are observed; y p ∈ {0, 1} is the label with y p = 1 indicating that patient p is diagnosed with DR and y p = 0 otherwise.As shown in Fig. 1(a), the data for all patients can be summarized as a tuple D = (X, y), where X ∈ R N ×M ×T is 3way tensor data of size N × M × T with N , M , T denoting the number of patients, variables, and time stamps respectively, and y ∈ R N is the label vector.As illustrated in Fig. 1(b), the missing values will first be imputed with tensor completion.Second, missing value masks will be generated to capture the missingness patterns, which will be combined with the imputed tensor to generate two streams of balanced subsets by under-sampling the majority class, as shown Fig. 1(c).Third, the balanced datasets are fed into the MB-TCN to train the network and further predict the DR probability for new patients as shown in Fig. 1(d).

A. Tensor Completion for Missing Data Imputation
Tensor factorization has been proven to be an efficient method to study the latent structure in tensor-form data for missing value imputation [53].The critical idea of the tensor-factorizationbased imputation is to jointly consider the multi-dimensional correlations among medical variables, time, and patients in a compute-efficient way.The CANDECOMP/PARAFAC (CP) decomposition [53] is a widely used approach to study tensor data by projecting it into a linear combination of rank-1 tensors as shown in Fig. 2. Specifically, suppose the rank of 3D tensor X (i.e., medical records) is R, where R ≤ min{N, M, T } and the element of X is denoted as x pvt , where p ∈ {1, . . ., N}, v ∈ {1, . . ., M}, and t ∈ {1, . . ., T }.Then, the CP decomposition of a complete tensor X is represented as: r=1 u pr h vr s tr , where u pr , h vr , and s tr are elements of the factor matrices U ∈ U N ×R , H ∈ R M ×R , and S ∈ R T ×R .As such, the high-dimensional tensor data X is characterized by the low-dimensional latent space defined by the factor matrices U , H, and S.However, such a decomposition procedure is not viable if tensor X is incomplete.With missing data, CP decomposition is achieved by minimizing the mean squared error between the observed data and reconstructed tensor as: where W is a 3D indicator tensor with w pvt = 0 if x pvt is missing and w pvt = 1 otherwise.We employ the CP-weighted optimization algorithm [54] to solve this minimization problem in (1) and estimate the factor matrices U , H, and S. Specifically, at each iteration, we compute X 1 = W * X and X 2 = W * [[U , H, S]] (where * denotes element-wise multiplication) and further derive the gradient of the objective function in (1) by computing the partial derivatives of J W (U , H, S) with respect to each element, i.e., u pr , h vr , s tr , of the factor matrices: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where X 1(I) and X 2(I) are the corresponding matrices resulted from the matricization (i.e., flattening or unfolding) of tensors X 1 and X 2 [55], [56]: The symbol denotes the Khatri-Rao product, which generates a matrix of sizes NM × R for two matrices U ∈ R N ×R and H ∈ R M ×R according to: where ⊗ denotes Kronecker product of two vectors: Given the computed gradients, the nonlinear conjugate gradient algorithm based on Hestenes-Stiefel [57] updates and the More-Thuente line search method for step size [58] are engaged

B. Missingness-Informed TCN
TCN is an effective tool to capture both temporal correlations in longitudinal data (i.e., intra-sequence correlation) and heterogeneous patterns between different features (i.e., inter-sequence correlation) [59].Here, we propose a missingness-informed TCN to process the imputed tensor data X to capture critical temporal and variable correlations while being aware of missingness patterns for better prediction.As shown in Fig. 3, there are four key features in this model: 1) Multi-channel input: TCNs treat the multi-variable longitudinal data as multi-channel sequences and apply trainable filters to process such multi-channel input at the same time.Fig. 4 shows the fundamental procedure of how TCNs handle multi-variable longitudinal data.Specifically, each variable in the time series is considered as an individual input channel (e.g., 2 channels in the figure means there are 2 variables in the input).TCNs then use convolutional filters to process temporal data where the filter size determines how many time points the network examines in one operation, and the channel number of the filter is the same as the number of input channels.As Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the filters are applied across the input channels, the network can learn from multiple interrelated sequences simultaneously, effectively capturing the intersequence correlations.
2) Dilated causal convolution: Fig. 3(c) shows the procedure of dilated causal convolution.Specifically, TCNs achieve the dilated causal convolution with filters f e : {0, 1, . . ., k e − 1} over longitudinal input as where x p,v denotes the observations of variable v for patient p over time, * represents the convolutional operation, d, f e , k e are the dilated factor, filters, and kernel size, respectively.The causal convolution is realized by t − τ • d, which ensures the direction of convolutional operation is toward the past and the network output at the current time t is not a function of any future information after time t.By tuning dilated factor d and kernel size k e , TCNs have the ability to capture both local-and long-range correlations in the temporal domain.
3) Residual block: Residual block (Fig. 3(b)) [60] is another critical attribute for effective longitudinal data modeling, which is leveraged to solve the problem of gradient degradation when the network goes deeper for modeling complexly structured data.The output of residual block is o( , where σ(•) is the activation function, and F is a series of operations including the dilated causal convolution, batch normalization, nonlinear activation, and dropout.Such residual operation enables the network to focus on the residual of the identity mapping to cope with the gradient degradation problem in DNN training [60].Our proposed model consists of two layers of causal convolution and ReLU activation within a residual block.In addition, we implement batch normalization for each causal convolutional layer to re-center and re-scale the input (or the output of previous layer), which helps to improve the speed and stability of network training [61].Moreover, a dropout layer is added for each causal convolution to avoid the possible over-fitting problem [62].
4) Missing value masks: To account for the possible discrepancy between the imputed values and true observations, the missing value masks, i.e., the 3D indicator tensor W , will serve as another input of the TCN as shown in Fig. 3(a).This will enable the network to learn the missingness pattern, i.e., correlation between non-missing and missing values, for mitigating the impact of possible imputation error incurred from the imputation procedure on the prediction performance.

C. Multi-Branching Outputs
In this subsection, we adapt the multi-branching (MB) architecture [8] to cope with the imbalanced issue in the DR dataset.The MB network consists of a core TCN and an MB output layer.The core TCN can be treated as a feature extractor or backbone of the network, and the MB layer is the output classifier.Existing literature has shown that the classifier component of the network suffers more from the imbalanced data issue, while the backbone portion is less susceptible during training [63], [64].As such, we propose to train the core TCN with the whole dataset and use a balanced sub-dataset to train each branching output in the MB layer to alleviate the impact of imbalanced data on the classifier.
We denote the original dataset as D, which consists of the minority class, D P (i.e., DR patients), and majority class D N (i.e., non-DR patients).As shown in Fig. 1(c The same operation will also be conducted for the missing value masks.Furthermore, an MB layer with N b outputs is added at the end of the core TCN to process the N b balanced sub-datasets.Fig. 3 shows the details of our MB network architecture.Note that each balanced dataset consists of the imputed EHRs and missing value masks, which are processed by two parallel sequential residual blocks.Then, the results of two parallel streams are concatenated and further flow to the fully connected layer.In the end, the MB output layer will generate N b predicted DR probabilities based on the outputs of the fully connected layer.As such, each balanced dataset serves as the training data for the corresponding branch of the output layer, and the core TCN structure will be optimized by all the N b training subsets.Specifically, the optimization will be achieved by minimizing the binary cross-entropy loss: where θ represents the model parameters, I(p ∈ D i ) denotes the indicator function indicating whether patient p belongs to subset D i , and Pi (θ; x p , w p ) is the prediction of DR probability provided by i-th branching output for the input signal x p and the corresponding mask w p of patient p.The final prediction of DR probability for patient p is the mean of the N b predicted probabilities:

A. Data Source and Data Extraction
The dataset we use in this study is obtained from the 2018 Cerner Health Facts data warehouse, one of the largest Health Insurance Portability and Accountability Act (HIPAA)-compliant databases in the U.S. storing de-identified clinical data of more than 63 million patients [3].This database contains comprehensive clinical information including patient demographics, hospital visits, diagnoses, procedures, medication prescriptions, vital signs, lab tests, etc., providing an unprecedented opportunity for data-driven diagnosis of DR.The labels are derived by examining whether one or more DR diagnosis codes exist and "1" represents a DR patient while "0" represents a non-DR diabetic patient.We select independent variables based on literature [1], [44] and include 21 routine blood tests for diabetic patients, 5 comorbidity variables (i.e., neuropathy, nephropathy,  hypertension, obesity, and coronary artery disease), 3 demographic variables (i.e., gender, age, and race), and the duration of diabetes, which is measured in years from the first diabetic diagnosis to the beginning of the prediction window.Note that all 21 blood tests are subject to missing values, the missing rates of which are shown in Fig. 5.The final dataset consists of 414,199 diabetic patients with a 3% (12,590) DR positive rate.We randomly partition the dataset into a training set (70%) for model training, a validation set (10%) for hyperparameters tuning, and a testing set (20%) for performance evaluation.

B. Experimental Design
As shown in Fig. 6, the performance of our MB-TCN-TC will be compared with TCN + different imputation techniques and TCN + imbalanced data handling methods.Specifically, our MB-TCN-TC method will be benchmarked with TCN + carry forward imputation (TCN+CF), TCN + mean imputation (TCN+mean), TCN + under-sampling, TCN + over-sampling, and pure TCN.Additionally, we will evaluate the number of branching outputs (i.e., 5, 10, and 20 branches) on the performance of MB-TCN-TC.The performance of both our model and benchmarking methods will be evaluated according to 6 metrics: Confusion Matrix, Area Under Receiver-Operating-Characteristic Curve (AUROC), Area Under Precision-Recall Curve (AUPRC), Recall, Precision, and F1 score.The ROC curve characterizes the tradeoff between the True Positive Rate (TPR) and the False Positive Rate (FPR), and the PRC curve captures the relationship between Precision and Recall.Hence, AUROC and AUPRC serve as the overall metrics across all thresholds, quantifying the overall prediction performance.
Our model is built using Pytorch leveraging the computational power of an NVIDIA RTX A4500 GPU.In the CP decomposition-based imputation, we select the rank of the tensor R = 25, maximum number of iterations N m = 10, 000, and the error tolerance = 10 −16 .The hyperparameters of our neural network are selected by empirical fine-tuning.Specifically, our final network model consists of 3 residual blocks and 10 branching outputs (i.e., N b = 10); the number of filters is 64 and kernel size is k e = 3 in the causal convolution; the dilation factor d is selected as 1 for the first and 2 for the second dilated convolutional layer in each residual block; the dropout rate is 0.5; we select Adam optimizer with an initial rate of 0.0002 and momentum values of β 1 = 0.9, β 2 = 0.999 during the network training.Table II further shows the performance comparison in terms of Confusion Matrix along with Recall, Precision, and F1 metrics.Specifically, our MB-TCN-TC-10 yields the highest true positive rate with 1,880 DR patients correctly identified, indicating a strong Recall in detecting DR.Our model also generates a lower false negative with 680 patients, which is critical in medical diagnostics as it reduces the risk of missing a potential DR diagnosis.Moreover, our model maintains the highest overall Precision (0.723) with the smallest false positives (i.e., 722 patients).Hence, our model makes correct DR prediction more often than other benchmark methods.Additionally, the highest Recall of 0.734 suggests that our MB-TCN-TC-10 has the best capability to identify all DR patients.Our model also yields the highest F1 score of 0.728, achieving the best balance between Recall and Precision.

C. Experimental Results
Table III shows AUROC and AUPRC scores of our MB-TCN-TC and other imputation methods commonly used in current practice with and without (w/o) missing-value mask.Note that all the network models compared in Table III contain a classifier layer with 10 branching outputs.Our MB-TCN-TC with missing-value mask method provides the highest AUROC and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.AUPRC scores of 0.949 and 0.793 respectively, as compared with MB-TCN+CF with missing-value mask (with AUROC of 0.925 and AUPRC of 0.734), MB-TCN+CF w/o missing-value mask (with AUROC of 0.915 and AUPRC of 0.724), MB-TCN+mean with masks (with AUROC of 0.916 and AUPRC of 0.726), MB-TCN+mean w/o masks (with AUROC of 0.920 and AUPRC of 0.727), and MB-TCN-TC w/o masks (with AUROC of 0.937 and AUPRC of 0.775).Table IV further shows the comparison between our MB-TCN-TC-10 model and other methods in addressing the imbalanced data issue.Note that all methods compared here engage tensor decomposition-based imputation and missing value masks to handle the missing value issue.Our MB-TCN-TC-10 achieves better performance with AUROC and AUPRC scores of 0.949 and 0.793, respectively.Specifically, our MB-TCN-TC-10 achieves 1.71% and 4.89% improvements on AUROC and AUPRC respectively, compared with the under-sampling method; increases AUROC and AUPRC performance scores by 2.82% and 2.72%, compared with the pure TCN that is trained by the original imbalanced dataset; increases the AUROC and AUPRC metrics by 1.61% and 2.85%, compared to the over-sampling method.

TABLE V PERFORMANCE COMPARISON BETWEEN OUR MB-TCN-TC-10 METHOD AND TRADITIONAL MACHINE LEARNING METHODS
further increase (decrease) as we increase the training epochs, indicating that there is no overfitting issue.This is attributable to the dropout layer in our network.Specifically, in our training procedure, we select the dropout rate as γ = 0.5.The critical idea of dropout is to randomly freeze a subset of the DNN neurons with a pre-defined probability (i.e., γ = 0.5) at each training iteration [62].This technique introduces stochasticity into network training to avoid the possible overfitting issue, which is equivalent to the L2 regularization.In fact, the dropout loss function L drop can be written as the summation of the original loss function L(θ; X, W ) and an L2 regularization on the network parameters: where regularization parameter λ is a function of the dropout rate γ, l denotes the number of hidden layers, and θ i stands for the weights and bias for the ith hidden layer.This regularization technique effectively avoids the overfitting issue.
Furthermore, it is worth noting that the training loss (AUROC) is larger (lower) than the validation loss (AUROC).This is also due to the use of dropout layer during model training.The dropout technique is only used in the training phase to avoid overfitting, while all neurons are active and contribute to the prediction during the validation or testing phase.In other words, L2 regularization (i.e., the second term on the right hand) in ( 6) is removed during the validation or testing phase, leading to a lower validation loss compared to the training loss.Additionally, because all extracted features by the hidden layers remain active for prediction during the validation phase, a higher AUROC is generated for the validation set compared with the training phase.

B. Performance Comparison With Existing Literature
Table V presents a comparison study of the performance scores between our MB-TCN-TC-10 model and traditional machine learning methods in existing literature.Note that traditional machine learning methods (i.e., random forest, XGBoost, simple neural network, logistic regression) are not directly applicable in handling longitudinal data.To facilitate a meaningful comparison, the longitudinal dataset is transformed into a nontemporal dataset by only keeping the last encounter of routine blood tests for each patient to train those models.According to Table V, our model outperforms the benchmarks across various metrics, notably better than the highest-performance benchmark, XGBoost.Specifically, the MB-TCN-TC-10 yields an improvement of 3.7% on AUROC and 8.1% in AUPRC compared to XGBoost.
Additionally, our model achieves the best Recall of 0.734 and the highest Precision of 0.723.The high Recall and Precision demonstrate that our model is capable of not only identifying DR patients more accurately but also achieving a superior reduction in false-positive rates.Moreover, the highest F1 score of 0.728 further demonstrates our model's enhanced capability to maintain an optimal balance between Recall and Precision.Meanwhile, the MB-TCN-TC-10 provides the highest true positive rate with 1,880 DR patients correctly identified.Our model also yields the smallest false negative rate with 680 patients, which is critical in medical diagnostics as it reduces the risk of missing a potential DR diagnosis.Additionally, our model generates the smallest false positives (i.e., 722), maintaining the highest overall precision (0.723), which indicates that our model is able to make correct DR prediction more often than other benchmark Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. Discussion on the Performance Improvement
According to Tables III and IV, our MB-TCN-TC model effectively solved the missing value and imbalanced data issues, which are very common problems in data-driven disease detection.Specifically, given the same classification layer with 10 branching outputs, our MB-TCN-TC with missing-value mask model yields the best performance in DR identification (see Table III) compared with other imputation methods.This improvement is due to two unique features of our framework: (1) the tensor decomposition-based imputation is an effective method to jointly consider the multi-dimensional correlations among medical variables, time, and patients, which is conducive to reliably imputing the missing data; (2) the missing-value mask effectively captures the missingness structure for mitigating the impact of possible imputation error on the prediction performance.
Similarly, given the same imputation method (i.e., tensor decomposition-based imputation with missing value mask), our MB-TCN-TC-10 provides the most accurate prediction performance compared to other techniques in handling imbalanced data (see Table IV).This improvement is due to the MBenhanced classifier layer, where each branching output is trained with a balanced dataset to mitigate the potential bias induced by the imbalanced data.Additionally, the MB architecture also combines the predictions from multiple independent classifiers according to (5): the branching outputs learn different mapping functions, combining which significantly lowers the variation and increases the robustness of the prediction performance.
Table VI further evaluates the prediction performance of the MB-TCN-TC with 5, 10, and 20 branching outputs.When increasing the MB outputs from 5 to 10, both the AUROC and AUPRC slightly improve.This is due to the fact that the imbalanced ratio in the training sub-dataset for each branching output improves from 6:1 to 3:1 when using 10 instead of 5 MB outputs.Note that there is a nonnegligible decrease in the performance scores when further increasing the branching outputs to 20.One possible reason is that the sample size of the balanced sub-dataset to train each branching output will decrease as the number of branches increases, which may lead to ineffective or insufficient training of each branching output.Additionally, increasing the branching outputs too much will also increase the model size, resulting in a more complex network that is more difficult to converge during the training.

D. Broad Applications
Our predictive model is built based on EHR data, which is readily available and accessible to all healthcare settings.It does not require fundus images taken by retinal cameras, which are expensive for underserved populations.Our approach provides a cost-effective non-image-based tool for DR screening, which has a promising potential to increase the compliance rate of recommended ophthalmic exams among asymptotic patients and overcome barriers to providing ubiquitous diabetic eye care in rural or underserved areas.
Furthermore, our approach provides effective solutions to the long-lasting challenges in modeling EHRs including incomplete datasets, imbalanced class distribution, and multi-variate longitudinal records, thereby ensuring reliable data-driven disease detection.The versatility of MB-TCN-TC extends beyond DR prediction, making it applicable for automatic prediction of a wide range of diseases based on EHRs.Notably, chronic conditions such as Alzheimer's disease, Myasthenia Gravis, and Hidradenitis Suppurativa, which typically exhibit gradual, long-term progression [65], [66], [67], can significantly benefit from our MB-TCN-TC approach.Our model can capture critical information about the heterogenous variable interaction and disease trajectory, which are difficult to discern through manual inspection of high-dimensional longitudinal EHR data.This capability facilitates accurate disease detection and has the potential to revolutionize chronic disease monitoring.These examples underscore the broad applicability of our research, indicating its potential for further exploration in diverse healthcare contexts.

E. Challenges in Real-World Clinical Implementation
The model implementation in real-world clinical scenarios faces some challenges.One challenge is associated with the computational burden, which is a common issue in large-scale machine learning.The unique architecture of our MB-TCN-TC model allows it to capture intricate patterns in complexly structured EHR data.This, on the other hand, requires substantial computational resources to execute, especially when handling a large volume of EHR data.Specifically, our dataset contains 414,199 diabetic patients with multiple medical variables changing over time, resulting in approximately 217 million data points.In our study, we utilize NVIDIA RTX A4500 Graphics Processing Units (GPUs) to improve computational efficiency by parallel processing.As a result, the total training time for 50 epochs is approximately 10 hours.
Furthermore, DNN models are widely recognized to be difficult to optimize in addition to the computational burden.The objective function is generally non-convex, meaning there are multiple local optimal solutions.Traditional convex optimization algorithms and simple gradient descent methods [68] often struggle with this issue, possibly ending up with a suboptimal solution.The situation worsens with many flat regions in non-convex functions, where the function value is nearly constant.The gradient in such flat regions is tiny, leading the algorithms stuck at sub-optimal solutions.In this project, we implement the ADAM algorithm [69], a first-order gradientbased method that employs adaptive estimates of lower-order moments, to optimize our objective function.ADAM is an appealing choice due to its computational efficiency, minimal memory requirements, and suitability for handling large datasets or parameters.Its applicability to large-scale non-convex optimization enhances the optimization process of large DNN models.

VI. CONCLUSION
This paper presents a novel method of Multi-branching Temporal Convolutional Network with tensor completion (MB-TCN-TC) to investigate multi-variate longitudinal clinical data for reliable DR identification.We first leverage the method of tensor data completion to estimate the missing values in the incomplete EHR data.Second, we adapt the multi-branching temporal neural network framework to further cope with the imbalanced issue and model the longitudinal sequences.Experimental results show that the proposed MB-TCN-TC model is effective to capture critical temporal information and complicated variable interactions in the longitudinal EHRs by accounting for both the imbalanced data and missing value issues.Our MB-TCN-TC method yields superior performance when identifying DR from longitudinal EHR data with better AUROC and AUPRC metrics compared to existing methods.More importantly, this MB-TCN-TC framework can be generally applicable to model complex longitudinal sequences with imbalanced data and missing value issues for reliable health monitoring and diagnostics.

Fig. 2 .
Fig.2.Illustration of CP decomposition: u r , h r , and s r denote the r-th column vector of matrix U , H, and S, respectively.

Fig. 3 .
Fig. 3. Architectural detail of the missingness-informed TCN; (a) Missing value mask; (b) The architecture of a residual block; (c) The dilated causal convolution with dilation factors d = 1, 2 and kernel size k e = 3; (d) The multi-branching output layer.
), N b balanced datasets are created from D by under-sampling the non-DR class D N as D i = {D i N , D P }, where D N = ∪ i∈1,...,N b D i N .

Fig. 5 .
Fig. 5. Missing rates for blood test variables in the dataset from DR patients (top) and Non-DR patients (bottom).

Fig. 6 .
Fig. 6.Illustration of the experimental design to evaluate the performance of the proposed MB-TCN-TC method.

Fig. 7 (
Fig. 7(a) and (b) illustrate the ROC and PR curves of our MB-TCN-TC with 10 branching outputs (MB-TCN-TC-10) and other methods.Note that a good model in the ROC space is located towards the top-left corner, indicating a low FPR and high TPR.According to Fig. 7(a), our method, i.e., MB-TCN-TC-10 with mask, dominates the ROC space with the highest AUROC of 0.949.On the other hand, in the PR space, a good classification model should achieve both higher Recall and Precision, towards the top-right corner.As shown in Fig. 7(b), the MB-TCN-TC-10 with mask outperforms other methods, yielding the highest AUPRC of 0.793.Notably, our MB-TCN-TC-10 method achieves an improvement of 6.27% on AUROC and 11.85% on AUPRC compared with the pure TCN without the tensor-based imputation and MB architecture (with AUROC of 0.893 and AUPRC of 0.709).TableIIfurther shows the performance comparison in terms of Confusion Matrix along with Recall, Precision, and F1 metrics.Specifically, our MB-TCN-TC-10 yields the highest true positive rate with 1,880 DR patients correctly identified, indicating a strong Recall in detecting DR.Our model also generates a lower false negative with 680 patients, which is critical in medical diagnostics as it reduces the risk of missing a potential DR diagnosis.Moreover, our model maintains the highest overall Precision (0.723) with the smallest false positives (i.e., 722 patients).Hence, our model makes correct DR prediction more often than other benchmark methods.Additionally, the highest Recall of 0.734 suggests that our MB-TCN-TC-10 has the best capability to identify all DR patients.Our model also yields the highest F1 score of 0.728, achieving the best balance between Recall and Precision.TableIIIshows AUROC and AUPRC scores of our MB-TCN-TC and other imputation methods commonly used in current practice with and without (w/o) missing-value mask.Note that all the network models compared in Table III contain a classifier layer with 10 branching outputs.Our MB-TCN-TC with missing-value mask method provides the highest AUROC and

Fig. 7 .
Fig. 7. (a) ROC and (b) PR curves for the proposed MB-TCN-TC with 10 branching outputs and other methods.

Fig. 8 (
Fig. 8(a) shows the values of training and validation losses of our MB-TCN-TC-10 model over 50 epochs.The training loss keeps decreasing with some random fluctuations when the epoch increases, while the validation loss first decreases and then stabilizes around a low value after about 20 epochs.A similar trend is also observed in the evolution of AUROC over epochs as shown in Fig. 8(b).In our experiments, we select the total epoch number as 26 to increase the computational efficiency and guarantee the network model is well trained according to Fig. 8.Note that the validation loss (AUROC) does not

TABLE II COMPARISON
OF PERFORMANCE SCORES BETWEEN OUR MB-TCN-TC WITH 10 BRANCHING OUTPUTS AND OTHER METHODS

TABLE III PERFORMANCE
COMPARISON IN AUROC AND AUPRC BETWEEN OUR MB-TCN-TC MODEL AND OTHER IMPUTATION METHODS WITH AND W/O MISSING VALUE MASK

TABLE VI VARIATION
OF AUROC AND AUPRC GENERATED BY MB-TCN-TC WITH DIFFERENT NUMBERS OF MB OUTPUTS