Diabetic Retinopathy Detection using Transfer Learning from Diabetic Retinopathy Detection using Transfer Learning from Pre-trained Convolutional Neural Network Models Pre-trained Convolutional Neural Network Models

—Diabetic Retinopathy is the most common eye disease that diabetes can cause. It can even result in vision loss and complete blindness. Early detection prevents complete vision loss. However, it is challenging to detect it early because it may not show symptoms in the early stages. The existing models for diabetic retinopathy cannot detect all the stages of diabetic retinopathy. The most widely used metrics like accuracy, f1-score, precision, and recall; do not consider the level of disagreement between labels, which helps in detecting all the stages of diabetic retinopathy. This paper used ResNet, VGG, and Efficient-Net pre-trained models. We performed results evaluation using quadratic weighted kappa, which is appropriate for classifying different stages of diabetic retinopathy based on the severity. Furthermore, it considers the level of disagreement between actual and predicted labels. We have achieved the quadratic weighted kappa score of 0.85 using the EfficientNet b3 network, surpassing the existing models like support vector machines, decision trees, convolutional neural networks, and DenseNet pre-trained model.


I. INTRODUCTION
D IABETIC RETINOPTHY is the most common diabetic eye disease and a leading cause of blindness in American adults [1]. Because of the excessive blood sugar, the tiny blood vessels in the retina will be broken and lead to hemorrhages in the retina, and this will cause diabetic retinopathy. Any diabetes can result in diabetic retinopathy. The longer one has diabetes, the higher the risk of diabetic retinopathy. Depending on the severity of the disease, the effect can be in the range of near-normal vision to complete loss of sight. [2], [3]. Diabetic Retinopathy is more likely to attack adults over 40 who have diabetes. In the United States of America, approximately 4.1 million US adults suffer from diabetic retinopathy [4]. Early detection of diabetic retinopathy can prevent 95% vision loss [5]. Diabetic Retinopathy may not give any symptoms in the earlier stages as it happens inside the eye, even when the blood sugar level is maintained and vision is very normal. Because of this, the doctor can only detect diabetic retinopathy after proper examination [6].
Diabetic Retinopathy can be divided into two stages: Non-Proliferative Diabetic Retinopathy (NPDR) and Proliferative Diabetic Retinopathy (PDR). The NPDR can further be divided into three phases: Mild, Moderate and Severe [3], [7]. NPDR is due to excessive sugar levels that start affecting the tiny blood vessels in the retina, which leads blood vessels to become more swollen and leak fluid; as a result, the retina lacks oxygen and nutrients [3], [8]. As a result, the body produces a vascular endothelial growth factor (VEGF) to provide nutrients and oxygen to the eye's retina. But these new cells are fragile and can be easily damaged, resulting in more swelling and leaking. This advanced stage is called Proliferative Diabetic Retinopathy (PDR) which is hazardous as it often causes permanent vision loss [3], [9].
Using the PyTorch framework, we developed VGG, ResNet, and EfficientNet pre-trained models and fine-tuned them to suit diabetic retinopathy detection. PyTorch supports automatic differentiation and efficiently uses GPUs for parallel processing [10]. Furthermore, we used Google Colab Pro, which gives GPU run time for a limited time and RAM of 25.46 GB to train the developed models.
As diabetic retinopathy, early detection can prevent the patients from vision loss. We have focused on the early detection of diabetic retinopathy using pre-trained models like ResNet, VGG, EfficientNet. We have used the publicly available dataset from Kaggle [11] to develop the state-ofthe-art model for diabetic retinopathy detection. We first used VGG, followed by ResNet, and finally EfficientNet. We also have demonstrated the results of all the models using the most relevant metric known as Quadratic Weighted Kappa [12]. The EfficentNet b3 60 model has achieved a state-of-the-art QWK score of 0.85. The proposed model can be used in healthcare to aid the doctors in the decision-making on the patient's model as the model can distinguish each class of diabetic retinopathy.

II. BACKGROUND AND LITERATURE REVIEW
Many researchers have used different classes of algorithms to detect diabetic retinopathy. We have classified the literature into Computer-Aided Detection (CAD) algorithms, Machine Learning (ML) algorithms, and Deep Learning (DL) algorithms. Most researchers used metrics like accuracy, recall, etc. Some researchers used Quadratic Weighted Kappa [12], most relevant to diabetic retinopathy detection. Still, the reported Quadratic Weighted Kappa was less than 0.40, which is very low and might not detect all the stages of diabetic retinopathy with high probability.

A. Computer-aided detection and Machine Learning algorithms
In the early 2000s, Computer-Aided Detection (CAD) systems were developed and introduced in the clinical workflow to aid doctors in decision-making and disease diagnosis. But the CAD systems have shown limitations. One of them is that the adversaries report more false positives than the human reader, resulting in further analysis, additional cost and time for the doctor to decide on patient's health condition [13].
Machine Learning requires structured data. Decision trees, support vector machines, linear regression, logistic regression, etc., are some examples of Machine Learning examples. The advantages of machine learning models are less complex and can be implemented in conventional computers with CPUs. In addition, machine learning systems can be set up and operated quickly. The main drawback is that it will not work very well for unstructured data like images, audio data, etc. However, machine learning algorithms can work very well even when the size of the dataset is significantly small [14].
Ramya [15] have used SVMs for diabetic retinopathy detection of two-class classification and got the detection rate of 86% for the diabetic affected eye. However, before feeding the data to the SVM, she has to preprocess and extract the features using the Middle Filtering Method and histogram equalization.
Based on this review, it is evident that ML algorithms require an image processing technique to extract features and a machine learning classifier for classification. On the other hand, CAD algorithms require more time and cost to extract retinal images' features for diabetic retinopathy detection. Therefore, in section 2.2, we discuss some deep learning techniques for end-to-end detection of diabetic retinopathy. So, there is no need for image processing techniques for feature extraction and ML classifier for classification as both can be done by deep learning.

B. Deep Learning algorithms
Deep Learning is a sub-field of machine learning inspired by both the functionality and structure of the brain [17]. Hardware accelerators such as the Graphical Processing Unit (GPU) and Tensor Processing Unit (TPU) have been the backbone of many modern deep learning architectures to provide faster training and inference. Deep Learning algorithms can be enhanced with more data, unlike other algorithms that get saturated after a specific limit is reached. Furthermore, deep learning algorithms can automatically extract features from the data without human intervention, unlike machine learning algorithms requiring manually extracted features as in [15]. Deep Learning algorithms can be used when there is enough computing power, high storage capacity, and extensive data.
Pratt et al. [18] have proposed a Convolutional Neural Network (CNN) model to diagnose patients suffering from diabetic retinopathy from digital fundus images. They took a publicly available dataset of 80,000 images from Kaggle. They got an overall accuracy of 75% on 5000 validation images. They have used a 13 -layered network composed of 10 layers of CNN for feature extraction and three layers of fully connected (FC) layers to classify the given input image. The acquired dataset comprises five classes: 0 -No DR, 1 -Mild DR, 2 -Moderate DR, 3 -Severe DR, 4 -Proliferate DR.
Doshi et al. [19] used Deep Convolutional Neural networks to classify high-resolution digital fundus retinal images into five stages based on the severity of diabetic retinopathy. They used quadratic weighted kappa as the metric. They obtained scores of 0.386 and 0.3996 for a single convolutional network and a combination of the three similar models. The main drawback of this technique is the Quadratic Weighted Kappa scores are less than 0.40. So, a low QWK score might indicate that all the stages of diabetic retinopathy might not be detected with a high probability.
Gao et al. [20] has proposed an ensemble framework of Deep CNNs for diabetic retinopathy detection. They have also used the dataset from Kaggle. Furthermore, they have used two deep CNN models with an ensemble technique. The accuracy for the imbalanced model 1, imbalanced model 2, balanced model 1, flat model 2 are 78.13, 80.36, 60.8, 60.89, respectively. The demerit of this paper is that even though the accuracy is high, it is not enough as it can give a high score even when a few stages of diabetic retinopathy are detected.
Through the survey on diabetic retinopathy detection using deep learning models, we have found that proposed models must be improved to a much higher Quadratic Weighted Kappa, and they must detect all five stages of diabetic retinopathy in Fig. 1.

C. Transfer Learning
Transfer Learning is a prevalent method in Deep Learning that transfers knowledge from one learning task to another. In Transfer Learning, one will first download the pre-trained models, usually trained on large datasets such as ImageNet [24]. Then, fine-tune it by replacing it with either fully connected layers or the machine learning-based classifier such as Random Forest, Support Vector Machines, and Decision Trees. It is suited to the task at hand. Several pre-trained models are available, which were trained on ImageNet [24]. We have taken ResNet, VGG, and EfficientNet as they have been used extensively in academic research and the industry because of their features. The main features are skipconnections introduced in ResNet, enhancing the trainability of deeper networks. Lower-dimensional filters used in the VGG network resulted in fewer trainable parameters in the filter simultaneously having a larger receptive field. Furthermore, an efficient compound scaling method used in the EfficientNet effectively scales the network's height, width, and depth to learn the most complex features.
Simonian et al. [21] have proposed a CNN architecture known as Visual Geometry Group (VGG). In this paper, they have used only 3*3 convolutional filters. This architecture has secured first and second places in the localization and classification tasks in the ImageNet challenge 2014.
He et al. [22] have proposed Deep Residual Learning for Image Recognition. They have developed a residual learning framework to ease the training for the deeper models. Because of the depth of the models, they have achieved better accuracy. For the ImageNet [24] dataset, they have developed layers of depth up to 152 layers.
Tan et al. [23] have developed EfficientNet. Their paper has proposed a new scaling method that uniformly scales all depth/width/resolution dimensions using a simple and very effective compound coefficient. They have also proposed a baseline network using Neural Architecture Search (NAS) and then scaled it up to obtain a family of models known as EfficientNets, which have achieved better accuracy and efficiency than previous convolutional networks. The authors of the EfficientNet model developed the compound scaling method by considering the depth, width, and height are dependent on each other.

D. Our goal
Most of the work on diabetic retinopathy detection focuses on the presence or absence of diabetic retinopathy. Furthermore, most papers did not use the quadratic weighted kappa, which must be used to indicate all the five stages of diabetic retinopathy are detected with high probability. However, if used, the reported score is deficient, like 0.39 [19], and some of the models like [18] cannot detect the Mild DR stage. Therefore, the main goal of this paper is to develop a model that can efficiently detect all stages of diabetic retinopathy with a high quadratic weighted kappa. High quadratic weighted kappa indicates that all the stages are detected, unlike metrics like accuracy, such that it can help doctors to make a better decision on the patient.
We have proposed a model using the EfficientNet b3 pretrained model in this paper. It is scaled uniformly in width, depth, and resolution-wise appropriately to detect complex features from the given retinal images. Hence, it achieves the state-of-the-art QWK score of 0.85 and can efficiently detect all the stages of diabetic retinopathy.
To build the final model, we have analyzed the pre-trained models like ResNet, VGG, and EfficientNet. We have observed that the EfficientNet has performed well. The reason for this could be an effective scaling method to increase the depth, width, and resolution of the images that were employed in building the EfficientNet, unlike other scaling methods that scale depth, height, and width arbitrarily.

Fig. 1. Dataset Description
This project uses the resized version of the diabetic retinopathy Kaggle competition dataset [11]. This dataset contains 35,126 digital fundus retinal images. The given images are high-resolution digital fundus images of the retina taken under different imaging conditions. We had split the data into three datasets for training, evaluation, and testing. The training set contains 24,590 retinal images, whereas the validation and test dataset has 5,268 retinal images. We have resized the images to speed up the training to have (150, 150) resolution. Fig. 1. illustrates the output labels indicated as 0-5, label counts describe the number of images for each class, and the target class denotes the grade of diabetic retinopathy. The grade can help doctors to diagnose the patient with appropriate treatment since treatment depends on the severity of diabetic retinopathy. The last two columns correspond to the left and right eye of an image respectively for each stage of diabetic retinopathy.

B. Quadratic Weighted Kappa
Most of the researchers have considered the metrics like Accuracy, Sensitivity, Roc-Score, which are not suitable for this problem because they do not consider the level of disagreement between labels. Thus, we have considered the quadratic weighted kappa [12] to evaluate the model. The quadratic weighted kappa considered the level of disagreement between actual and predicted labels. So, for example, if the predicted label is 0 and the actual label is 2, then the level of disagreement is double that of when the true label is 2, and the predicted label is 1. Therefore, we think this metric is very appropriate for this project as target labels must be penalized based on the level of disagreement. The quadratic weighted kappa can be computed using the three matrices, namely expected matrix (E), output matrix (O), and weighted matrix (W).
The quadratic weighted kappa can be computed as follows: Step 1: calculate the expected matrix (E) by taking the outer product between the actual labels vector and target labels vector.
Step 2: construct the output matrix by building a confusion matrix of actual labels and predicted labels.
Step 3: calculate weight matrix as follows: Where i is an actual label, j is the predicted label, and k is the number of classes.
Step 4: normalize the expectation matrix (E) and output matrix (O) by dividing them by their sum.
Step 5: calculate the weighted kappa using the following formula: where num is the sum of elements obtained using elementwise multiplication between weight matrix (W) and output matrix (O), and den is the sum of elements obtained using element-wise multiplication between weight matrix (W) and expectation matrix (E).
The quadratic weighted kappa can range from -1 to 1. The larger the value, the lower the disagreement between an actual label and a true label. The metric score of 1 is a perfect level of agreement, whereas the metric score of -1 is extreme disagreement. Since weights are quadratic, the level of disagreement between actual labels is penalized quadratically. It assumed the given target labels and predicted labels were ordinal.The target labels of this project are ordinal as they are indicated based on the severity of diabetic retinopathy, which follows the order; we have used this metric.

C. Confusion matrices for models
This section demonstrates the confusion matrix for multiclass classification scenarios using the finalized model results as illustrated in Table I TN: The sum of all the remaining cells, which is 31 + 128 + 0 + 0 + 9 + 624 + 5 + 0 + 0 + 77 + 46 + 13 + 0 + 14 + 6 + 98 = 1051. The TP, TN, FP, FN can also be calculated similarly for the remaining classes. We have considered the weighted average for the accuracy, precision, recall, and f1-score as well.

D. Model Description
In this project, we have used VGG, ResNet, and Effi-cientNet based on the architecture's simplicity, efficiency, and performance in the classification and localization tasks on the ImageNet dataset [24]. Furthermore, we have fine-tuned these models to suit the diabetic retinopathy classification task. All the pre-trained models in this work are frozen in the training and validation phases. Only the final classifier is trained to detect the five stages of diabetic retinopathy. Initially, all the models are trained for 30 epochs and selected the best-performed, efficient net b3 model. After finding the best model, we again trained efficient b3 model for another 30 epochs, and the best weights were saved, resulting in the model's improvement. 1) EffcicientNet architecture: In this section, we describe the details of EfficientNet architecture. Table II illustrates the architecture details of the baseline model of the Effi-cientNet family, which is EfficientNetb 0. The main building block of EfficientNet models is inverted residual block [26] as illustrated in Fig.6. which is indicated as MBCONV in Table II . First, the squeeze and excitation [25] technique is applied to the MBCONV. Next, the input feature maps are increased depth-wise using 1x1 convolutions. Then, 3x3 depthwise convolution and point-wise convolution are performed to reduce the channels in the output feature map. Finally, the short connection connects the input and output feature map. The primary motivation for authors of the EfficientNet is to improve efficiency and accuracy. They found that it is essential to balance the network depth, width, and resolution while scaling the network. Therefore, they have proposed an effective compound scaling method to obtain the higher versions of the EfficientNet from the baseline architecture. The compound scaling method can be formulated as (3).
By setting ϕ equal to 1 and using the grid search method, the parameters α, β, γ can be found by choosing the parameters that give the best accuracy. Then, by increasing the ϕ value, the authors have developed different versions of Efficicient Net models.
2) Workflow: The workflow of the model will be explained in step-by-step as follows.
Step 1: All the input images were resized to a size of 150 x 150.
Step 2: The data is fed into the pre-trained model for feature extraction.
Step 3: The extracted features were trained using the classifier to predict the classes of diabetic retinopathy.
Step 4: Finally, we evaluated the model using test dataset using the metrics like accuracy, precision, recall, f1-score, and quadratic weighted kappa.
The general structure of our models is shown in Fig. 2. In the model structure, the pre-trained model is a convolutional neural network. Then, a feed-forward neural network is used to develop the final classifier. 3) Model Improvement Techniques: To improve the performance of the models, we used the Adam optimizer, which combines the capabilities of both RmsProp, which adjusts the effective learning rate based on the gradients of the weights and momentum, which helps the model to overcome the local minima. Furthermore, to overcome the exploding gradient problem, we used gradient clipping with a clip value equal to 0.1. Finally, we have also used a weight decay value of 10 -4 . As a result, the learning rate of all our models is 0.001.

4) Final Modelling:
We have used the EfficientNet models from b0-b6. Out of all the models, the EfficientNet b3 achieved a state-of-the-art result. First, the EfficientNet b3 produces a feature of size: 7 x 7 x 1,536. Then it uses adaptive average pooling 2d of output size 1 x 1 to generate the feature of size 1 x 1 x 1,536; then, it is flattened to have a size of 1,536, which is fed to the classifier. We have replaced the classifier used in the EfficientNet b3 with the developed final classifier to detect five stages of diabetic retinopathy. First, the feature vector having dimension 1,536 has been fed to the fully connected layer, giving a feature vector of size 512. Then, it is passed to the dropout layer of rate equals 0.5, followed by the ReLU activation layer. Next, the output of ReLU is passed to the fully connected layer, giving the output size of 512, which is fed to the dropout layer having a rate equal to 0.25 and then to the ReLU activation layer. Finally, the output of the ReLU activation layer is given to the fully connected layer, which has five units equal to the number of output classes. It is then fed to the Softmax activation layer to generate the probabilities for each category. The class with the highest probability will be the grade of diabetic retinopathy. The detailed architecture of the final model is depicted Fig. 3.

IV. RESULTS
This section demonstrates the results evaluated using accuracy, precision, recall, f1 score, and quadratic weighted kappa. Fig. 4. shows accuracy, f1 score, precision, recall, and quadratic weighted kappa of the VGG, ResNet, and Ef-ficientNet models. The Efficient b3 model trained for 30 and 60 epochs are denoted as Efficient Net b3 30 and Efficient Net b3 60, respectively. The metric quadratic weighted kappa is denoted as QWK. The QWK score of VGG 16, VGG 19, ResNet 18, ResNet 34, and ResNet 50 are absolute zero because of strong disagreement between the actual labels and predicted labels generated by the label. In the following subsection, we show the confusion matrices for all the models. The confusion matrices help illustrate the QWK score of 0. The QWK score of 0 is because the given images are taken under various imaging conditions, and models are not robust to varying conditions. The different versions of the ResNet and VGG models are only scaled by the depth, which is not efficient enough to extract the complex features in the input images.

A. 4.1 Metric scores for models
However, ResNet models used skip connections to overcome the vanishing gradient problems such that model architecture can be increased to enormous depth. The higher versions of ResNet, which are ResNet101 and ResNet152, give the little positive QWK score, indicating that increasing the depth will increase the level of agreement between the actual labels and true labels, which is desirable. By observing the QWK scores for VGG and ResNet models, we found that the models scaled only by the depth are ineffective in tackling images taken in various imaging conditions. Thus, we have selected the EfficientNet models that use the compound scaling method. The QWK score for all the versions of the EfficientNet is comparatively good. Out of all the versions, EfficientNet b3 trained for 60 epochs has performed very well, which is indicated by the QWK score of 0.85. The reason is that different versions of the EfficientNet have different resolutions, depths, and widths. The EfficientNet b3 model's scaling parameters of resolution, depth, and width are suited to given input resized images. Therefore, even the resolution, depth, and width are scaled more than needed will under perform, which is evident from the QWK scores of EfficientNet b4, b5, b6 models. The remaining metrics, accuracy, precision, recall, and f1 score, are similar for all the remaining models. The reason for picking the QWK is to show that even when the model has high accuracy, recall, etc. is not the best model unless it has a high QWK metric score. The possible reason is that even if the model cannot detect all the stages, the accuracy, precision, recall, and f1score will be much higher than the QWK score. A detailed view of the results is depicted in Fig. 4. It is visible that all the metric scores are higher for the EfficientNet b3 60 model. The QWK scores for the ResNet 18, 34, 50, and VGG models are precisely zero. The remaining metrics are similar for most of the models.

B. Confusion Matrices for Models
In this section, we demonstrate the confusion matrices for all the models. From Fig.5. we found the QWK score of the VGG and ResNet models is either 0 or very low because the models only identify class 0 and misclassify the remaining classes completely. Different from them, the EfficientNet models could recognize the other classes as well. Moreover, the EfficientNet b3 30 and Efficient Net b3 60 have detected all the classes with high probability; hence the QWK score is very high for EfficientNet models compared to the VGG and ResNet models.  Ramya [15] used SVM to classify the normal eye and diabetic affected eye. She used the Middle Filtering method for feature extraction. As a result, she can classify the normal eye with the recognition rate of 86%, whereas for the diabetic affected eye, the recognition rate is 82%.
Enrique et al. [16] used Messidor database, Extracted three features, namely Blood Vessels, Hard exudates, and Microaneurysms, for extracting the features and also support vector machines and decision tree classifiers for classification of diabetic retinopathy into four stages, namely Normal, Mild NPDR, Moderate NPDR, and Severe NPDR. They have got an accuracy of 85.1% for both the SVM and DT. For the remaining metrics, both the SVM and DT have similar results.
Pratt et al. [18] used a publicly available dataset from Kaggle. They developed a convolutional neural network to detect all the stages of diabetic retinopathy. They obtained an accuracy of 75% and specificity of 95%. The main drawback of this method is their proposed model completely misclassifies the Mild DR class, which can be observed from the confusion matrix that for class 1 (Mild Dr), the total number of correct predictions is zero.
Doshi et al. [19] used EyePACs dataset. They developed an ensembled model using three similar convolutional neural networks to classify high-resolution retinal images into five stages of diabetic retinopathy. They achieved 0.3996 quadratic weighted kappa, which is very low can be improved much further.
Gao et al. [20] used the Kaggle dataset and obtained an Imbalanced accuracy of 80.36% and a recall of 47.70% by using the transfer learning technique for the classification of diabetic retinopathy into five stages. The five stages are Normal, Mild NPDR, Moderate NPDR, Severe NPDR, and PDR.
We used a Kaggle dataset and developed the Effcient-Net b3 60 model, which is able to detect all the five stages of diabetic retinopathy with a QWK score of 0.85, which is

V. CONCLUSION AND FUTURE WORK
Until now, little research has been done on detecting all the stages of diabetic retinopathy with higher Quadratic Weighted Kappa. In this paper, we have used the three pre-trained models ResNet, VGG, and EfficientNet. Among them, VGG and ResNet can only detect class 0 with a higher probability. Our results suggest that using pre-trained models like ResNet, VGG, which are scaled only by their depth, cannot learn the features from the images that correspond to all the stages of diabetic retinopathy. Then, we found the EfficientNet models that use the compound scaling method uniformly scale the depth, width, and resolution. So, an efficient net model can detect more than 1 stage of diabetic retinopathy. The EfficientNet b3 model can detect all the stages of diabetic retinopathy with a higher quadratic weighted kappa score of 0.85. As, we have used the Google Colab Pro for model training, the resources are not sufficient to train the efficient net b7 model. However, we achieved the higher quadratic weighted kappa score of 0.85. There is still an improvement that can be done by using higher resolution images, using the ensemble of different CNN models, or the same model that achieved the best result. In our case, it is efficient net b3 and more advanced architectures like CoAtNet, which uses depth-wise convolution and self-attention can be used and check how it performs.