SSL-Unet: A Self-Supervised Learning Strategy Base on U-Net for Retinal Vessel Segmentation

—Retinal vessel segmentation is a crucial non- destructive detection technology for fundus disease, and supervised methods are commonly applied in it and achieve a good performance. However, supervised methods wholly rely on expert-provided labels as the only learning pattern in the training process. This over-reliance on direct semantic supervision has several issues: (1) Structural information of whole raw data is much richer than that of labeled data. (2) Supervised learning can only deal with a single segmentation task on the guidance of the given labels. Nevertheless, there is more helpful information within unlabeled data such as spatial structure, association prop- erties. Actually, they are an instructive and signiﬁcant supplement to supervised learning. (3) Labeling low-contrast and complex structured vascular images by clinical experts is an onerous task, which would result in a smaller dataset and expensive cost. So a self-supervised learning U-Net model (SSL-Unet) is presented in this paper, which requires just a few numbers of labeled data and some unlabeled data. A self-supervised module is introduced to help the model to study inter-relation between data according to a low-entropy priori task, thus, both labeled data and unlabeled data can be effectively exploited. Accordingly, the potential features of data can be learned by self-supervised module. Two different self-supervised learning strategies and a pixel-level optimization function are designed for two datasets with different structures. Experimental results on two public retinal datasets of the DRIVE dataset and the CHASEDB1 dataset demonstrate that the proposed model further improves the performance of vessel segmentation, and it can be generalized to different backbone networks.


I. INTRODUCTION
R ETINAL vascular abnormalities are closely associated with common diseases such as diabetes, cataracts, and atherosclerosis [1]. Vessel segmentation is an essential basis for computer-aided diagnosis of retinal diseases [2], and the efficient and accurate segmentation of retinal vessels has become an urgent requirement for clinical diagnosis.
In recent years, many supervised learning methods have shown outstanding performance on retinal vessel segmentation [3]- [6]. However, a prerequisite for the success of supervised methods is the availability of a large amount of The authors are with the School of Computer and Information Sciences, Chongqing Normal University, 401331, China(email: macsy@cqnu.edu.cn; jia-lun@163.com; jju txs@163.com; wuruoyucqnu@163.com). annotated data. In reality, the task of labeling medical images is high-cost and laborious, especially for retinal vascular images [7]- [10]. Chen et al. [11] proposed a semi-supervised method combined with a generative adversarial network(GAN) to correct pseudo labels by discriminators, which can achieve decent segmentation results with a small amount of labeled data. Xu et al. [12] proposed a partially supervised framework with an active learning strategy to label the patches with the most informative, reducing the dependence on labeled data. Unfortunately, while supervised learning can leverage unlabeled data through pseudo labels, the inner-relation between vessel images is not considered. Ma et al. [13] proposed a self-supervised approach to learn the semantic similarity of vascular images by training attention-guided generators and dividers. In this paper, self-supervised learning (SSL) is used to mitigate this problem. SSL can learn the data representation information without annotations through pretext tasks [14]- [19]. We find that this information is positive for supervised methods, which can enhance the model's understanding of the inherent correlations in the unlabeled data and compensate for the shortcomings of supervised methods that rely excessively on ground truths during segmentation, as shown in Fig.1.
Based on the above analysis, the main innovations of this paper are as follows: 1) A segmentation model incorporating the self-supervised module (SSL-Unet) is proposed, allowing both labeled and unlabeled images to be fed into the network. 2) Two self-supervised training strategies are proposed for two datasets, and they effectively exploit the unlabeled data for the sake of improving the segmentation performance of the model. 3) A regularized reconstruction loss is introduced to minimize the pixel difference between unlabeled data.

A. Model Architecture
A SSL-Unet model consisting of two baseline networks is proposed, and unlabeled and labeled images are considered as input data. It is worth noting that the two networks are the identical structure of encoder-decoder, as shown in Fig.  1. Specifically, the labeled data are fed into baseline(a), and the loss of supervision is calculated by the vessel probability map obtained from baseline(a) and the ground truths. The unlabeled data are taken as the input to the self-supervised module, which contains the baseline(a) and the baseline(b). By constructing a pretext task to train the self-supervised module, a binary probability map comes into being by baseline(a), and the pseudo labels are obtained by baseline(b). The pseudo labels are used to supervise the SSL module to learn, accordingly, the reconstruction loss can be got. Finally, both reconstruction loss and supervised loss are utilized to optimize the model.

B. A SSL strategy based on the DRIVE datasets
Generally, SSL needs to design specific prerequisite tasks such as cropping [20], scaling [21], and jigsaw puzzles [22], which can enable the network to be fully trained on the different datasets. The above transformations will not change the semantic information of the data but the spatial information. Inspired by [23], a self-supervised strategy is proposed to be applied to the DRIVE dataset. Given an unlabeled data X, the geometric transformation operation Θ can transform X into X 1 and X 2 , where slicing and rotation are used in a sequential order. Here, rectangles of the same colors in Xrec1 and Xrec2 represent the same local area. By predicting the images patches, the baselines can capture local information of different views, and the consistency of the local feature will be maximized for pixel-wise comparison to train the network, as shown in Fig. 2.
Xrec2 Fig. 3. Illustration of the proposed SSL strategy.

C. A SSL strategy based on the CHASEDB1 dataset
The structure of the human organ is basically symmetrical, such as lungs, kidneys, eyes, and arms. We observe that the distribution of vascular trunks in the left and right eyes of the same patient happens to be almost concurrent in the CHASEDB1 dataset, as shown in Fig. 3. Therefore, a decoupled-rebuild strategy is introduced to aim at modeling the relationship between the local and global representations of blood vessels and focuses on the prediction of the encoderdecoder baselines at the pixel level. Suppose x L is data of the left eyes, x R is data of the right eyes, f enc is the encoder and f dec is the decoder. x L and x R are respectively decoupled as inputs to the encoder f enc : where p(.) represents the trunks of the vessel by attention branch, n(.) represents the tiny vessel information. As can be seen from Fig. 3, the vascular trunks of the left and right eyes of the same patient are approximately similar, assuming that p(L) ≈ p(R) . The vascular trunks information of the left eye and the right one are cross-fused to obtain new features as follows f enc (x L ) = p(R) + n(L) Then, they are sent to the corresponding decoder for the following upsampling step, respectively.
where z is the recovery map after multiple upsampling and f dec is the decoder. The reconstructed representations of the left and right eyes are validated by maximizing consistency, as shown in Fig. 4.

D. Momentum gradient updating
The quality of the pseudo label affects the performance of the model, and training baseline(a) and baseline(b) separately increases the difference between their predictions, which leads to inaccuracy of the pseudo labels. The momentum updating is able to retain the previous updating direction in the gradient optimization. Thus, the weights of baseline(b) w are dominated by those of baseline(a) q in the training process.
where µ is the momentum coefficient, w t is the weights before the baseline(b) updating, and w t+1 is the weights after updating. The larger µ is, the better the baseline(b) improves its learning ability, so µ is set to 0.999 as recommended by [24], and is updated every T epochs, T is set to 2.

E. Loss function
The loss function of the SSL-Net model is comprised of two components: loss of supervised module and loss of selfsupervised module. For the segmentation task under supervision, the cross-entropy loss function is defined as where C represents the number of categories, p n,i is the predicted probability value when pixel i belongs to the nth category, g n,i is the true label corresponding to pixel i. Since the self-supervised module has no labeled data involved in the loss calculation, two sets of losses are designed respectively: contrast loss and regularized reconstruction loss, and both of them are denoted as L con = (p(L) − p(R)) 2 + (n(L) − n(R)) 2 (9) where F θ is a predictor which has the same AdaptiveAvgPool as the projector, G is l 2 regularization along the channel axis and N is the number of input data. In view of the difference in magnitude between the optimization functions, the Laplace smoothing factor is added, which is able to speed up the convergence. Here, ε is set to 10 in the following experiments. Therefore, the total loss function is defined as where α is the weighting parameter. α 1 =0.8, α 2 =0.1, α 3 =0.1. Note that α 1 =0.8,α 2 = 0, α 3 =0.2 on the DRIVE dataset.

A. Datasets and Evaluation Metrics
The experiments on two public datasets [25]- [26] are conducted. Unlabeled data are obtained after the labels of labeled data are removed. The detailed information is shown in Table I. ACC, SP, F1 and AUC are utilized to evaluate the performance of the model.

B. Implementation Details for Training
The proposed model is experimented using the Pytorch framework. The batch size is set to 8, the Adam optimizer with default parameters is applied, combined with the cosine annealing method, the initial learning rate is set to 0.0005, and the number of epochs is 50.

C. Experimental Results and Analysis
The SSL-Unet is compared to several mainstream methods on DRIVE and CHASEDB1 datasets. The results are summarized in Table II and Table III. It can be concluded that SSL-Unet achieves the best performance on most metrics. The visual comparison is shown in Fig. 3 and Fig. 4. Besides, the results of ablation experiments illustrate that the generation of pseudo labels and reconstruction loss can significantly improve the segmentation performance on the backbone network in Table IV. In the end, we also compare the complexity of several different methods in Table V. It can be seen that although the number of parameters of the SSL-Unet are larger than that of other methods, lower inferred time is more beneficial for practical applications.

IV. CONCLUSION
We propose a SSL-Unet model for retinal vascular segmentation as well as two self-supervised training strategies. The strategy can help the self-supervised module to learn pseudo labels for improving the segmentation performance. Moreover, the fusion of both self-supervised and supervised paradigms is applied to retinal segmentation for the first time. Meanwhile, it can also be extended to any segmentation network.