YoloCurvSeg: You Only Label One Noisy Skeleton for Vessel-style Curvilinear Structure Segmentation

Weakly-supervised learning (WSL) has been proposed to alleviate the conflict between data annotation cost and model performance through employing sparsely-grained (i.e., point-, box-, scribble-wise) supervision and has shown promising performance, particularly in the image segmentation field. However, it is still a very challenging task due to the limited supervision, especially when only a small number of labeled samples are available. Additionally, almost all existing WSL segmentation methods are designed for star-convex structures which are very different from curvilinear structures such as vessels and nerves. In this paper, we propose a novel sparsely annotated segmentation framework for curvilinear structures, named YoloCurvSeg. A very essential component of YoloCurvSeg is image synthesis. Specifically, a background generator delivers image backgrounds that closely match the real distributions through inpainting dilated skeletons. The extracted backgrounds are then combined with randomly emulated curves generated by a Space Colonization Algorithm-based foreground generator and through a multilayer patch-wise contrastive learning synthesizer. In this way, a synthetic dataset with both images and curve segmentation labels is obtained, at the cost of only one or a few noisy skeleton annotations. Finally, a segmenter is trained with the generated dataset and possibly an unlabeled dataset. The proposed YoloCurvSeg is evaluated on four publicly available datasets (OCTA500, CORN, DRIVE and CHASEDB1) and the results show that YoloCurvSeg outperforms state-of-the-art WSL segmentation methods by large margins. With only one noisy skeleton annotation (respectively 0.14\%, 0.03\%, 1.40\%, and 0.65\% of the full annotation), YoloCurvSeg achieves more than 97\% of the fully-supervised performance on each dataset. Code and datasets will be released at https://github.com/llmir/YoloCurvSeg.


Introduction
Curvilinear structures are elongated, curved, multi-scale structures that often appear tree-like and are commonly found in natural images (e.g., cracks and aerial road maps) and biomedical images (e.g., vessels, nerves and cell membranes).Auto-matic and precise segmentation of these curvilinear structures plays a significant role in both computer vision and biomedical image analysis.For example, road mapping serves as a prerequisite in both autonomous driving and urban planning.In the biomedical field, studies (Pritchard et al., 2014;Lin et al., 2021c;Kawasaki et al., 2009;Lin et al., 2020) have suggested that the morphology and topology of specific curvilin- ear anatomy (e.g., retinal vessels and corneal nerve fibers) are highly relevant to the presence or severity of various diseases such as hypertension, arteriolosclerosis, keratitis, age-related macular degeneration, diabetic retinopathy, and so on.Retinal vessels are observable in retinal fundus images and optical coherence tomography angiography (OCTA) images, while corneal nerve fibers are identifiable in confocal corneal microscopy (CCM) images.It has been suggested that early signs of many ophthalmic diseases are reflected by microvascular and capillary abnormalities (Allon et al., 2021;Lin et al., 2021b).Collectively, accurate segmentation of various curvilinear structures is of great importance for computer-aided diagnosis, quantitative analysis and early screening of various diseases, especially in ophthalmology.
In recent years, benefiting from the development of deep learning (DL), many DL-based segmentation algorithms for curvilinear structures have been proposed and have shown overwhelming performance compared to traditional (e.g., matched filter-based and morphological processing-based (Nguyen et al., 2013;Singh and Srivastava, 2016)) methods.Most existing works are dedicated to designing sophisticated network architectures (Peng et al., 2021;Mou et al., 2021;He et al., 2022) and deploying strategies to preserve curvilinear structures' topology by employing generative adversarial networks (GANs) (Lin et al., 2021c;Son et al., 2019) or topologypreserving loss functions (Cheng et al., 2021a;Shit et al., 2021).These methods are typically fully-supervised, wherein largescale well-annotated datasets are required.However, collecting and labeling a large-scale dataset with full annotation is very costly and time-consuming, particularly for medical images since their annotation requires expert knowledge and clinical experience.Furthermore, annotating curvilinear structures is even more challenging, given that curvilinear structures are slender, multi-scale, and complex in shape with fine details.
More recently, many efforts have been made to reduce the annotation cost for DL model training.For example, semisupervised learning (SSL) trains models by combining limited amounts of annotated data with massive unlabeled data (Xu et al., 2022;Hou et al., 2022;Mittal et al., 2019).While effective, most state-of-the-art (SOTA) SSL methods still require about 5%-30% of the accurately and precisely labeled data to achieve about 85%-95% of the fully-supervised performance, which is still not sufficiently cost-effective and still time-consuming when it comes to labeling curvilinear structures.Weakly supervised learning (WSL) attempts to alleviate the annotation issue from another perspective by performing sparsely-grained (i.e., point-, scribble-, bounding box-wise) supervision and attains promising performance (Liang et al., 2022;Lin et al., 2016;Tang et al., 2018a,b;Kervadec et al., 2019).Compared with either point or bounding box, scribble is a relatively more flexible and generalizable form of sparse annotation that can be used to annotate complex structures (Luo et al., 2022).Existing scribble-supervised segmentation methods mainly fall into two categories.The first line of research exploits structural or volumetric priors to expand scribbles into more accurate pseudo proposals; for example, grouping pixels with similar grayscale intensities or locations into the same class (Liang et al., 2022;Lin et al., 2016;Ji et al., 2019).However, the expansion process may introduce noisy proposals, which may induce error accumulation and deteriorate the performance of the segmentation model.Some work (Huo et al., 2021) also points out the inherent weakness of these methods, namely models retain their own predictions and thus resist updating.The second line learns adversarial shape priors utilizing extra unpaired but fully-annotated masks.Such approaches somewhat contradict the motivation of saving annotation costs, especially for complex curvilinear structures (Larrazabal et al., 2020;Valvano et al., 2021;Zhang et al., 2020b).Moreover, most WSL methods still require sparsely labeling the entire dataset (or a large portion), and they are mainly designed and validated on relatively simple structures (e.g., cardiac structures or abdominal organs) with assumptions and priors that may not apply to complex structures (e.g., curvilinear structures).
To address these aforementioned challenges, we here present a novel WSL segmentation framework for vessel-style curvilinear structures, namely You Only Label One Noisy Skeleton for Curvilinear Structure Segmentation (YoloCurvSeg).For curvilinear structures, label noises/errors are inevitable, and a good segmentation approach should be noise tolerant.Therefore, instead of utilizing only the annotated pixels for supervision, YoloCurvSeg ingeniously converts the weakly-supervised problem into a fully-or semi-supervised one via image synthesis.It employs a trained inpainting network as a background generator, which takes one (or multiple depending on availability) noisy skeleton (as shown in Fig. 1) and dilates it to serve as an inpainting mask to obtain a background that closely matches the real distribution.The extracted background is then augmented and combined with randomly emulated curves generated by a Space Colonization Algorithm-based foreground generator, from which a synthetic dataset is obtained through a multilayer patch-wise contrastive learning synthesizer.Finally, a segmenter performs coarse-to-fine two-stage segmentation using the synthetic dataset and an unlabeled dataset (if available).Our main contributions are summarized as follows: • We propose a novel weakly-supervised framework for one-shot skeleton/scribble-supervised curvilinear struc-ture segmentation, namely YoloCurvSeg.To the best of our knowledge, YoloCurvSeg is a pioneering weaklysupervised segmentation method for curvilinear structures utilizing noisy and sparsely-annotated data.
• YoloCurvSeg novelly converts a WSL problem into a fully supervised one through four steps: curve generation, image inpainting, image translation and coarse-tofine segmentation.The proposed framework is noiserobust, sample-insensitive and easily extensible to various curvilinear structures.
• We evaluate YoloCurvSeg on four challenging curvilinear structure segmentation datasets, namely OCTA500 (Li et al., 2020), CORN (Zhao et al., 2020), DRIVE (Staal et al., 2004) and CHASEDB1 (Fraz et al., 2012).Experimental results show that YoloCurvSeg outperforms SOTA WSL and noisy label learning methods by large margins.Meanwhile, we demonstrate that ≥ 97% of the fully-supervised performance can be achieved with only one noisy skeleton label (approximately 0.1% or 1% of the full annotation), which shall also inspire subsequent works on WSL and curvilinear dataset construction.

Related Works
Related works mainly involve curvilinear structure segmentation, weakly-supervised segmentation and medical image synthesis, which we introduce below one by one.

Curvilinear Structure Segmentation
Existing automatic curvilinear structure segmentation algorithms can be roughly divided into two categories.The first category is traditional unsupervised methods, mainly including mathematical morphology methods and various filtering methods Mou et al. (2021).For instance, Zana and Klein (2001) segment vascular-like patterns using a hybrid framework of morphological filtering and cross-curvature analysis.Passat et al. (2006) present a preliminary approach to strengthen the segmentation of cerebral vessels by incorporating high-level anatomical knowledge into the segmentation process.Filtering methods include Hessian matrix-based filters (Frangi et al., 1998), matched filters (Singh and Srivastava, 2016;Hoover et al., 2000), multi-oriented filters (Soares et al., 2006), symmetry filter (Zhao et al., 2017), etc.The other category is supervised methods, wherein data with ground truth labels are used to train segmenters based on predefined or model-extracted features.Traditional machine-learning-based approaches are dedicated to pixel-level classification using handcrafted features (Zhang et al., 2017;Holbura et al., 2012).Recently, DL-based approaches have made significant progress in various segmentation tasks.For example, Ronneberger et al. (2015) propose U-Net, which has been widely used in numerous medical image segmentation tasks.Existing curvilinear structure segmentation works focus on well-designed network architectures by introducing multi-scale (He et al., 2022;Wu et al., 2018), multi-task (Lin et al., 2021b;Peng et al., 2021;Hao et al., 2022), or various attention mechanisms (Mou et al., 2021;Yu et al., 2022) as well as well-playing morphological and topological properties by introducing GANs or morphology-/topology-preserving loss functions (Cheng et al., 2021a;Shit et al., 2021).Still, data availability and annotation quality are the main limitations of these methods.

Weakly-supervised Segmentation
Weakly-supervised segmentation aims to reduce the labeling costs by training segmentation models on data annotated with coarse granularity (Liang et al., 2022).Among various formats of sparse annotations, scribble is recognized as the most flexible and versatile one that can be used to annotate even very complex structures (Luo et al., 2022;Valvano et al., 2021).Existing scribble-supervised segmentation methods fall into two main categories.The first one exploits structural or volumetric priors to expand scribble annotations by assigning a same class to pixels with similar intensities or nearby locations (Lin et al., 2023;Liang et al., 2022;Lin et al., 2016;Ji et al., 2019).The main limitation of such approaches is that they heavily rely on pseudo proposals and often contain multiple stages, which can be timeconsuming and prone to errors that may be propagated during model training.The second category learns adversarial shape priors utilizing extra unpaired but fully-annotated masks.Such approaches somewhat contradict the motivation of saving annotation costs, especially for complex curvilinear structures (Larrazabal et al., 2020;Valvano et al., 2021;Zhang et al., 2020b).Additionally, these methods still require sparsely labeling the entire dataset or a large portion, and they are mainly designed and validated on relatively simple structures like cardiac structures or abdominal organs with assumptions and priors that may not apply to complex structures (e.g., curvilinear ones).In this paper, we make use of noisy skeletons that differ from scribbles in two ways: (1) skeletons are more label demanding since all branches are supposed to be covered; (2) noisy skeletons are more likely to contain errors or noises, which are inevitable when quickly labeling slender structures.We convert sparse and noisy skeleton annotations to accurate ones via an image synthesis pipeline, thus requiring only one noisy skeleton label.This significantly reduces the annotation cost.

Medical Image Synthesis
GAN (Goodfellow et al., 2020) has become the mainstay of medical image synthesis, with common applications in intra-modality augmentation (Zhou et al., 2020), cross-domain image-to-image translation (Peng et al., 2022), quality enhancement (Cheng et al., 2021b), missing modality generation (Huang et al., 2022a,b), etc. Below we briefly review previous works on retinal image synthesis, the topic of which is relevant to our work.Costa et al. (2017a) employ a U-Net trained with paired fundus images and vessel masks.It employs a conditional GAN, i.e., Pix2pix (Isola et al., 2017) to learn a mapping from vessel masks to the corresponding fundus images.To simplify the framework, they propose an adversarial autoencoder (AAE) for retinal vascularity synthesis and a GAN for generating retinal images (Costa et al., 2017b).Similarly, Guibas et al. (2017) present a two-stage approach that consists of a DCGAN for generating vasculature from noise and a cGAN (Pix2pix) to synthesize the corresponding fundus image.Note that cGAN requires paired images and vessel masks for training, which is a strict condition to some extent.These methods require an extra set of vessel annotations to train AAE or DCGAN and may sometimes generate vessels with unrealistic morphology.The generated images also lack diversity.Zhao et al. (2018) develop Tub-sGAN, which incorporates style transfer into the GAN framework to generate more diverse outputs.In another work, SkrGAN (Zhang et al., 2019) is proposed to introduce a sketch prior related constraint to guide the image generation process.Yet, the sketches utilized are extracted by the Sobel edge operator, and cannot be used as segmentation masks.
In this paper, we employ a multilayer patch-wise contrastive foreground-background fusion GAN for several considerations.
According to previous research, training a GAN to learn a direct mapping from a curvilinear structure mask to the corresponding image is difficult, especially under few-shot conditions (Lin et al., 2021a).Therefore, we provide GAN with extracted real backgrounds, enabling implicit skip-connection that allows the GAN to focus more on mapping the foreground regions.Such a design not only enhances performance but also accelerates convergence.Multilayer patch-wise contrastive learning allows the provided mask and the foreground region of the generated image to be spatially aligned (via unpaired training), which further benefits the subsequent segmenter.

Method
YoloCurvSeg comprises four main components: (1) a Curve Generator that produces binary curve masks that well accom-modate the corresponding image modality of interest; (2) an Inpainter for extracting backgrounds from labeled samples; (3) a Synthesizer that synthesizes images from the generated curve masks and the image backgrounds; and (4) a two-stage Segmenter trained with the synthetic dataset and an unlabeled dataset.The overall framework is shown in Fig. 2.

Curvilinear Structure Generation
Space colonization is a procedural modeling algorithm in computer graphics that simulates the growth of branching networks or tree-like structures (Runions et al., 2005(Runions et al., , 2007)), including vasculature, leaf venations, root systems, etc.It is employed in YoloCurvSeg for modeling the iterative growth of curvilinear structures with two fundamental elements: attractors and nodes.Its core steps are described in the bottom left panel of Fig. 2, wherein blue dots denote attractors and black ones denote nodes: a) place a set of attractors randomly or following a predefined pattern, and then associate nodes with nearby attractors (if the distance between a node and an attractor is within an attraction distance D a ); b) for each node, calculate its average direction from all attractors affecting it; c) calculate the position of new nodes via normalizing the average direction to a unit vector and scaling it by a predefined segment length L s ; d) place nodes at the calculated positions and check if any nodes are within an attractor's kill zone; e) prune an attractor if there are nodes staying within its kill distance D k ; f) repeat steps b)-e) until the maximum number of nodes is reached.Through observing the pattern of the foreground/curve in a single image or a few images that are accessible, including the curves' starting point, boundary, and degree of curvature, etc., it is relatively straightforward to set the corresponding hyperparameters, such as the root node coordinates C r (e.g., the starting point of the vessels in fundus lies in the optic disc region), as well as the bounds and obstacles.For D a , D k and L s , the commonly used values of 5, 30 and 5 can be re-tuned as needed.Regarding the attractors, we use a grid placement strategy to control the number of attractors by setting the number of grids in both horizontal and vertical directions.To simplify, we set the same grid number A g for both directions.Each attractor can be jittered within a certain range A j to introduce randomness.Attractors located outside the boundary or inside the obstacles are removed.Table 1 summarizes the parameters and post-processing operations we employ for generating the four types of curves and representative examples are demonstrated in the bottom panel of Fig. 2. Please note that our adopted settings and post-processing operations only represent our empirical choices and are not necessarily the best-performing ones; users can make further adjustments based on their observations and experiences.In our configuration, multiplying with the field of view (FOV) region is performed to align the curve with the corresponding image background, ensuring that the curve does not exceed the FOV area.Random Crop and Random Flip are employed to further enhance the diversity of the curve, while Erode and Dilate are utilized to fine-tune the thickness of the curve.In addition to the curvilinear shape, we also need to simulate the thickness of each branch where R, R 1 and R 2 respectively denote the radii of a father branch and its two child branches.n is set to be 3 according to Murray's law (Painter et al., 2006).The calculation is performed recursively from the branch tips (whose radii are set to be 1) towards the tree base.Several intuitive demos can be accessed at link1 .By setting random grid attractors and root nodes via predefined parameters, we construct a bank of curves of the same type but with varied shapes for each dataset of interest, and then employ them to train the synthesizers and the segmenters.

Inpainting for Background Extraction
Inpainting is the task of reconstructing missing or masked regions in an image.Similar to removing watermarks or extraneous pedestrians from images, we employ an inpainting model here to remove foregrounds (e.g., vessels and nerve fibers) from the images of interest, under the hypothesis that the dilated noisy skeletons can fully cover the foregrounds.In inpainting, common concerns are the network's ability to grasp local and global context and to generalize to a different (especially higher) resolution.

Architecture
Inspired by (Suvorov et al., 2022), we adopt an inpainting network based on the recently proposed fast Fourier convolutions (FFCs) (Chi et al., 2020) with image-wide receptive fields, strong generalizability and relatively few parameters.Given a masked image I ⊙ (1 − m), where I and m respectively denote the original image and the binary mask of the inpainting regions, the feed-forward inpainting network f θ (•) aims to output an inpainted image Î = f θ (I ′ ) taking a four-channel input I ′ = concat(I ⊙ (1 − m), m).FFC builds its basis on channelwise fast Fourier transform (FFT) and has a receptive field covering the whole image.It splits channels into two parallel branches: a local branch uses conventional convolutions and a global branch uses real FFT to capture global context, as shown in Fig. 3. real FFT is only applicable to real-valued signals, and inverse real FFT ensures the output is real-valued.Compared to FFT, real FFT uses only half of the spectrum.In FFC, real FFT is first applied to the input tensor and a ComplexToReal operation is performed by concatenating the real and imaginary parts.Then, it applies convolutions in the frequency domain.Inverse real FFT is performed to transform features from the frequency domain to the spatial domain through the RealToComplex operation.Finally, the local and global branches are fused.For the upsampling and downsampling of the Inpainter and the architecture of the discriminator in adversarial training, we follow the ResNet settings respectively employed in He et al. (2016) and Suvorov et al. (2022).The training is performed on [image, randomly synthesized mask] pairs.We adopt the mask generation strategy in Suvorov et al. (2022), containing multiple rectangles with arbitrary aspect ratios and wide polygonal chains.

Objective
Compared with naive supervised losses which may result in blurry predictions, perceptual loss (Johnson et al., 2016) evaluates the distance between feature maps of the inpainted image and the original image via a pre-trained network ϕ(•).It does not require exact reconstruction and allows for variations in the reconstructed image.Given that inpainting focuses on understanding the global structure, we introduce a perceptual loss of a large receptive field L HRP through a pre-trained ResNet50 ϕ HRF (•) with dilated convolutions where M is a sequential two-stage mean operator, i.e., obtaining the inter-layer mean of intra-layer means.Additionally, an adversarial loss L adv is utilized to encourage the inpainted image to be realistic.Specifically, we use a PatchGAN (Isola et al., 2017) discriminator D ξ (•) and label patches that overlap with the mask as fake and the others as real.The non-saturating adversarial loss is defined as where Î = f θ (I ′ ) is the output of the inpainting network and sg var represents stop gradient w.r.t.var.To further stabilize the training process, we use a gradient penalty L GP = E I ∥∇D ξ (I)∥ 2 2 (Ross and Doshi-Velez, 2018) and a perceptual loss defined on features of the discriminator L DP (Wang et al., 2018).The final objective of the Inpainter is where λ adv , λ DP and λ GP are hyper-parameters balancing the contributions of different losses.L HRP is responsible for supervised signals and global structure consistency while L adv and L DP are responsible for local details and realism.

Training
Given that the training of the Inpainter does not require annotation and it learns a general ability to recover missing regions through contextual understanding, we initialize the model with pre-trained parameters from the Places-Challenge dataset (Zhou et al., 2017) and fine-tune it on images accessible within each training set.The validation set of the Inpainter consists of both accessible training set images and validation set images (each paired with 10 predefined masks obtained using the same generation strategy employed in Suvorov et al. (2022)).The training is conducted with a batch size of 8 and an Adam optimizer is adopted with a learning rate of 10 −3 for 50 epochs.Data augmentation consists of random flipping, rotation and color jittering.For each training image, we first apply the aforementioned data augmentation strategy to offlinely generate 20 augmented images and then employ the same strategy for online augmentation during training.We empirically set λ adv = 3, λ DP = 10 and λ GP = 10 −4 .Once trained, the Inpainter is used to remove the foregrounds from the skeleton-labeled samples taking the dilated noisy annotations as the masks.Then we construct a background bank for each dataset by augmenting the extracted backgrounds through random horizontal and vertical flipping as well as rotation (spanning from 0 • to 90 • ).

Patch-wise Contrastive Learning Based Synthesis
Now we have a curve (foreground) bank the Curve Generator and the Inpainter, for each given dataset.We construct an intermediate dataset X inter = {x 1 , • • • , x N } through randomly sampling a curve c i from B curv and a background b i from B bg , and then concatenating them to form a temporary sample x i = concat(b i , c i ).The problem now turns into an unpaired image-to-image translation task, i.e., designing a synthesizer to learn a mapping from X inter to the corresponding real dataset Y.It is desirable that the local context especially the foreground of the synthetic image ŷi is spatially aligned with that of the corresponding intermediate image x i (especially c i ) as much as possible.
Previously, for unpaired image translation, most existing methods apply GANs with a cycle structure, relying on cycleconsistency to ensure high-level correspondence (Zhu et al., 2017).While effective, the underlying bijective assumption behind cycle-consistency is sometimes too restrictive, which may reduce the diversity of the generated samples.More importantly, the cycle-consistency is not suitable for our task since it does not guarantee any explicit or implicit spatial constraining.In such context, we introduce a multilayer patch-wise contrastive learning based synthesizer to learn a mapping from X inter to Y inspired by Chen et al. (2020) and Park et al. (2020), which is illustrated in the middle panel of Fig. 2 (a).It is trained in a generative adversarial manner with an internal contrastive learning pretext task.
The generator (i.e., synthesizer) G is a U-shape network, which firstly down-samples the input image into high-level features via an encoder E with three residual blocks equipped with instance normalization and ReLU activation.As such, each pixel in the high-level feature map represents the embedding feature vector of a patch in the original image.Several layers of interest E l∈L (x) in E are selected to extract multi-scale features of patches and each passes through a two-layer multilayer perceptron (MLP) H l (l indexes a layer), obtaining a feature stack {v l∈L = H l∈L [E l∈L (x)]}.Given patchwise features v l and the corresponding pair {H l (E l (x)) s 1 , H l (E l (G(x))) s 2 } with s 1 and s 2 denoting the spatial locations of the patches of interest, we set v + to represent a patch at the same location as v and v − n to denote the n th among N patches at different locations.The objective of the contrastive learning task is to maintain the local information at the same spatial location.Similar to the noise contrastive estimation loss (Oord et al., 2018), our objective function can be written as where τ is a temperature hyper-parameter.Besides, we employ the identity loss, which was first proposed in Zhu et al. (2017) for regularizing the generator G.We pass each real sample y ∈ Y through the encoder E and obtain the patchwise features v * , the negative samples v * − and the positive samples v * + n .The identity loss is formulated as We use the LSGAN loss as our adversarial loss L adv (Mao et al., 2017) to make the synthetic images as realistic as possible.Therefore, with trade-off parameters λ adv , λ c and λ id , the overall loss of the synthesizer is defined as The training of the synthesizer is conducted employing an Adam optimizer with a learning rate of 10 −4 and a cosine decay strategy, together with a batch size of 1.We utilize images that are accessible within each training set as the corresponding real dataset Y for training the synthesizer.The hyperparameters and model weights are selected based on the Fréchet Inception Distances (FIDs) between the synthesized images and Y (Heusel et al., 2017).We set λ adv as 1, λ c as 1, λ id as 0.5 and τ as 0.07.The training process lasts for 300 epochs.

Two-stage Coarse-to-Fine Segmentation
with ŷi being a synthetic image and c i being the corresponding curve ground truth, is created by the Synthesizer.The weaklysupervised task is then transformed into a fully-or semi-one when making use of solely the synthetic dataset or a combination of an unlabeled dataset D ori and the synthetic dataset D syn .In this section, we introduce a two-stage coarse-to-fine segmentation pipeline to tackle the task.
A specific segmentation network is first trained on D syn to obtain a coarse model S coarse with a segmentation loss L seg L seg = 0.5 × L ce + 0.5 × L dice (10) where L ce and L dice respectively denote the cross-entropy loss and the Dice loss.We observe and conclude that the performance of S coarse is mainly limited by two issues; one is that the curve generated by the Curve Generator still has a certain morphological gap with the foreground of the real image, and the other one is that there is also a slight but inevitable intensity gap between the Synthesizer-generated image and the real image.We target at relieving the latter issue by making use of D ori to further boost the segmentation performance.We employ predictions on D ori from S coarse as pseudo-labels, and train a fine model S f ine on the combined dataset of D ori and D syn through random batch sampling.The final loss function, denoted as L f inal , is formulated as follows: where L psd denotes the loss on D ori sharing the same loss calculation approach as L seg in Eq. ( 10), namely 0.5 × L ce + 0.5 × L dice , and λ psd is a trade-off parameter.Please note that each of the two losses is calculated only for the corresponding data samples, i.e.L seg for D syn and L psd for D ori .We employ the vanilla U-Net with feature channels of 16, 32, 64, 128 and 256 as our S coarse 's and S f ine 's architecture.We use an SGD optimizer (weight decay = 10 −4 , momentum = 0.9) for training both S coarse and S f ine with a batch size of 12 and an initial learning rate of 10 −2 .The total iterations and λ psd are respectively set to be 30k and 1.

Experiments
In this section, we extensively evaluate the effectiveness of our YoloCurvSeg framework on four representative curvilinear structure segmentation datasets.

Datasets and Preprocessing
We comprehensively evaluate YoloCurvSeg on four ophthalmic datasets: OCTA500, CORN, DRIVE and CHASEDB1.OCTA500 is used for retinal microvascular segmentation, and only the subset that contains 300 samples with a 6 × 6 mm 2 field of view (FOV) and a 400 × 400 resolution is utilized.We only make use of the en-f ace images generated by maximum projection between the internal limiting membrane layer and the outer plexiform layer.CORN consists of 1578 CCM images for nerve fiber segmentation.It also provides two subsets respectively consisting of 340 low-quality and 288 high-quality images.All CCM images have a resolution of 384 × 384 and an FOV of 400 × 400 µm 2 .Instead of following the dataset's original division, we use 1532 images (samples that overlap with the test set are removed and the validation split ratio is 0.2) for training and validation, and test on 60 relatively accurately labeled samples provided in its subset.DRIVE and CHASEDB1 are used for retinal vessel segmentation and respectively have resolutions of 565 × 584 and 999 × 960.These two fundus datasets are cropped via the provided FOV masks and are respectively resized to 576 × 576 and 960 × 960.For DRIVE, we utilize the original division of 20 training samples and 20 testing samples.For CHASEDB1, we follow the division in

Implementation Details
We implement YoloCurvSeg and other compared methods by PyTorch on a workstation equipped with 8 RTX 3090Ti GPUs.In the Synthesizer, the indices of the layers selected to calculate L c include {0, 4, 8, 12, 16}.For training the Segmenters S coarse and S f ine , the polynomial policy with power = 0.9 is used to adjust the learning rate online (Mishra and Sarawadekar, 2019).Other hyperparameters, training details and model architecture are already provided in previous sections.It is worth noting that manually-delineated vessel segmentation labels are provided for OCTA500, DRIVE and CHASEDB1.To generate noisy skeleton annotations for those three datasets, we perform the skeletonize operation in scikit-image (Van der Walt et al., 2014) to obtain the skeletons of the original ground truth masks and then employ elastic transformation to simulate jitter noises that may be introduced during fast manual labeling.For CORN, only noisy skeleton labels are provided, and thus are directly used in all our experiments.For this dataset, we dilate each skeleton to a 3-pixel width to serve as the full mask in its fullysupervised learning setting, and the same operation is also applied to the testing set annotations.For sparse labels used in other comparative WSL methods, skeletons of the backgrounds   are also generated via skeletonization.The synthesis process in YoloCurvSeg can be online or offline.For better reproducibility and fair comparison, we use the offline version in our experiments, i.e., we first generate the synthetic dataset and then train the Segmenters.By randomly combining samples from the pre-generated curve bank and the augmented background bank, we generate 1276, 5005, 1240 and 1604 synthetic samples respectively for OCTA500, CORN, DRIVE and CHASEDB1 if all training samples are labeled.If only one sample is labeled, we respectively generate 100, 100, 60 and 80 synthetic samples.The segmentation models in all our compared methods and YoloCurvSeg are trained for 30000 iterations.

Synthesis Performance
Before comparing with SOTA WSL methods, we first qualitatively and quantitatively evaluate the synthesis performance of YoloCurvSeg.We visualize representative examples, in terms of the noisy skeleton labels, dilated masks for inpainting, extracted backgrounds, generated curves and synthesized images, in Fig. 4. It can be observed from the last column that the generated curves well match the synthetic images.We also compare the intensity distributions of the synthetic datasets with the real ones in Fig. 5, exhibiting high intensity similarities between the synthetic and real images in terms of both background and foreground.From the t-SNE (Van der Maaten and Hinton, 2008) visualization in Fig. 6, the synthetic datasets are generally in line and well mixed with the real ones.In most cases, the synthetic data are even more uniformly and widely distributed, having a similar effect as data augmentation.The FIDs between the synthetic components and the cor-

Comparison with SOTA
Since noisy skeletons can be considered as sparse annotations with a certain degree of noise or simply noisy labels, we compare YoloCurvSeg with two categories of methods: (1) WSL methods and (2) noisy label learning (NLL) methods.The Dice similarity coefficient (DSC[%]) and the average symmetric surface distance (ASSD[pixel]) are used as the evaluation metrics.Sensitivity (SE) and specificity (SP) are also employed for more comprehensive evaluations of the differences among various methods.All benchmarked WSL methods, NLL meth-ods, and fully-supervised (FS) methods utilize the same segmentation network architecture as that adopted in S coarse and S f ine (i.e., vanilla U-Net) for a fair comparison purpose.Selecting the vanilla U-Net architecture is further motivated by its versatility and widespread application in medical image segmentation tasks (Ronneberger et al., 2015;Isensee et al., 2021;Antonelli et al., 2022).

Comparison with WSL methods
We compare YoloCurvSeg with 11 scribble-supervised segmentation methods employing the same skeleton set that we generate: pCE (partial cross-entropy loss, baseline); random walker pseudo labeling (RW); uncertainty aware selfensembling and transformation-consistent model (USTM); Scribble2Label (S2L); Mumford-Shah Loss (MLoss); entropy minimization (EM); dense CRF loss; gated CRF loss; active contour loss (AC); dual-branch network with dynamically mixed pseudo label supervision (DBDM); and tree energy loss, the results of which are shown in Table 3 and Table 4.For fair comparisons, YoloCurvSeg does not go through the fine stage and is denoted as Ours (coarse) (i.e., the performance of S coarse ).The upper part of each table indicates that all training data are sparsely labeled and all training set images are utilized to train the Segmenter in compared methods (for our method, they are used to train the Inpainter, Synthesizer and Segmenter).In the lower part, "One" indicates that only one sample is labeled and all other data are unlabeled and are not utilized.Please be aware that in the one-shot setting, the Inpainter and the Synthesizer are trained using only the single image of the sparsely annotated sample and this remains consistent throughout all experimental procedures outlined in the following sections.Additionally, all segmentation networks are randomly initialized without using any pre-trained parameters.
As shown in Table 3, pCE achieves relatively low segmentation performance in most cases, as it only supervises sparsely annotated regions.RW is clearly not suitable for thin and elongated curvilinear structures, as its arbitrary expansion introduces a significant amount of noise in pseudo-labels, resulting in performance that is lower than the baseline.Most comparison methods attempt to generate and refine pseudo labels through introducing various forms of CRF loss (Dense CRF and Gated CRF), combining well-designed network architecture and consistency learning (e.g., USTM and DBDM), or employing more advanced forms of loss (e.g., MLoss, AC and Tree Energy).Although effective, these methods still show a significant performance gap compared to fully-supervised performance, even when all samples are annotated, let alone when only a single sample is annotated.Among all compared methods, Tree Energy achieves the second-best performance in most cases, but it still has a significant gap compared to YoloCurvSeg, and is highly affected by noise in sparse annotations.YoloCurvSeg achieves the best performance on all datasets under both settings, outperforming other WSL methods by large margins.Comparing "All" versus "One", apparently, YoloCurvSeg is not sensitive to the sample size of the labeled data and achieves 96.1%, 106.3%, 95.3% and 96.2% fully-supervised performance on the four datasets with only 0.14%, 0.03%, 1.40% and 0.65% labeled pixels in terms of DSC.It is worth noting that, even when only a single sample is annotated, YoloCurvSeg (bottom second last row of the table) still achieves superior performance compared to all comparison methods when all samples are annotated (top half of the table).Representative visualization results are shown in Fig. 7.

Comparison with NLL methods
In Table 5, we also compare YoloCurvSeg with several NLL methods, including generalized cross-entropy loss (GCE), coteaching (COT), TriNet, confident learning with spatial label smoothing (CLSLS) and divergence-aware selective training (DAST) on OCTA500 and DRIVE.Most of these methods allow training under conditions of all noisy samples as well as mixed noisy (S in Table 5) and fully-supervised (M in Table 5) samples.It can be seen that with the inclusion of fully-supervised samples in training, these methods can generally achieve a certain degree of performance improvements.However, despite utilizing one full mask and multiple or even all skeleton samples, these methods are still inferior to YoloCurvSeg which employs only one skeleton sample.We also find that all NLL methods perform worse than the fullysupervised model (FS in Table 5) trained solely with the same single fully labeled sample, illustrating that additional noisily labeled samples are not beneficial to model performance under such noisy conditions.

Robustness Analysis and Ablation Study
To verify the robustness of YoloCurvSeg for the selected oneshot sparsely labeled sample, we randomly select 10 samples from each dataset and compare the performance with the fullysupervised model trained on the same sample.As demonstrated in Fig. 8, YoloCurvSeg exceeds full supervision in almost all cases and delivers highly stable performance decoupled from image/annotation quality which nevertheless induces great fluctuations in the performance of the fully-supervised models.In addition to robustness, the predictions from YoloCurvSeg also have smaller variances.Both aspects indicate that YoloCurvSeg is sample-insensitive and can reduce the risk of selecting a wrong sample to label.
To investigate the impact of noisy skeleton's completeness on the segmentation performance, we conduct partial erasure analysis experiments on the skeletons.Due to the low contrast between small vessels and the background in fundus images, it is highly likely to have missing annotations on fundus images.Therefore, we select two samples from the DRIVE dataset and erase the noisy skeleton labels of some small vessels, as illustrated in Fig. 11.Specifically, we respectively erase 12.55% and 9.66% of the annotated regions on samples No. 25 and No. 38.From the figure, we can clearly observe the erased areas on the noisy skeleton and the impact on the extracted background images and the synthesized images.The segmentation model's performance metrics (DSC, ASSD) on the two samples with complete noisy skeleton annotations are respective (77.99, 1.71) and (77.74, 1.59).After erasing some noisy skeletons and synthesizing the corresponding new training sets, the performance metrics of the segmentation model become (78.06,1.83)   and (78.11, 1.40), exhibiting only small fluctuations before and after erasure, thus demonstrating the proposed method's robustness.These slight performance fluctuations may be attributed to the significant dilation operation applied to the noisy skeleton, which may result in dilated masks covering the small vessels that may have been missed during annotation.Additionally, randomly generated foreground curves may also cover some small vessels, further reducing the labeling noise caused by missed annotations.On the other hand, as shown in Fig. 8, YoloCurvSeg demonstrates stable one-shot performance for all four datasets, despite the inevitable presence of varying degrees of label omissions in those samples.This is particularly evident in the CORN dataset (where there are already some missing annotations in the dataset itself), indicating that our approach is robust even in the presence of some degree of missing annotations.
As for the ablation study, we first remove the background bank extracted by the Inpainter and perform direct curve-toimage translation.As shown in panel (a) of Fig. 9, the synthetic images present unrealistic background texture due to the large gap between pre-and post-translation distributions (which is also shown in column B of Table 2).For high-resolution image datasets, the foregrounds of the synthetic images are distorted and fail to spatially align with the corresponding curve masks, which also occurs when we remove L c of the Synthesizer (and use CycleGAN (Zhu et al., 2017) for substitution), as shown in Fig. 9 (b).This indicates that the contrastive Synthesizer (especially L c ) is crucial for maintaining the corresponding local context at the same spatial location.
To more comprehensively demonstrate the significance of the Inpainter, we endeavor to generate background images through alternative unsupervised methods for subsequent image synthesis.We investigate Gaussian blurring, low-pass filtering, and median filtering.Our objective is to selectively remove foreground vessels and simultaneously preserve the maximum amount of background details, by carefully adjusting the parameters.For Gaussian blurring, we utilize a kernel size of 9×9 and perform 25 iterations on samples from both OCTA500 and DRIVE datasets.For low-pass filtering, we generate twodimensional Gaussian masks with a standard deviation of 0.05 based on the image dimensions.These masks are then applied to remove the low-frequency components.As for median filtering, we respectively apply kernel sizes of 29×29 and 31×31 to the OCTA500 and DRIVE datasets.Visualization results indicate that median filtering is relatively more effective in eliminating the foreground, albeit at the cost of sacrificing details and introducing blurriness, as illustrated in Fig. 10.As depicted in the last column of that figure, the image synthesis results with median-filtered backgrounds are superior to direct translation results from curves to real images.However, since the provided backgrounds are relatively blurry, there are still considerable amounts of artifacts.We then select ten   samples from both OCTA500 and DRIVE for one-shot coarse stage performance comparisons, the results of which are presented in Table 6.Apparently, the overall performance of the ten randomly selected samples from the OCTA500 and DRIVE datasets indicates that utilizing the background extracted by the Inpainter induces overall performance gains of approximate 2% and 4% over the median-filtered background for one-shot (access to only one sample) segmentation.Utilizing a noisy skeleton annotation to improve the segmentation performance is cost-effective, as evidenced by the measured annotation time for one such annotation described later.
It is worth pointing out that the results we have reported in the previous section represent the performance of S coarse .We also explore the performance of S f ine with and without utilizing D syn .Please noted "without using D syn " means that only D ori (with predictions from S coarse as the pseudo-labels) is used for training S f ine .To comprehensively evaluate the topological connectivity and small vessel segmentation performance, the clDice (Shit et al., 2021) metric is also computed.It can be observed from Fig. 12 that the DSC and ASSD metrics of S f ine obtained with D syn in the training data are slightly better than those of S f ine obtained without using D syn , which may be attributed to the fact that the synthetic curves have high degrees of continuity and can reduce the model's outlier predictions.However, the clDice metric is slightly lower and this may be due to the fact that synthetic data inevitably exhibits a certain intensity gap compared to real data, especially in small vessel regions.Additionally, we also conduct a performance comparison of the fully-supervised model with and without pretraining on D syn .Results show that the synthetic images from YoloCurvSeg also have great potential in serving as pretraining images; pretraining on D syn , followed by fine-tuning on fully supervised datasets, can further enhance the performance of the fully supervised model.Specifically, it increases the performance of the vanilla U-Net model in terms of DSC by 0.55, 0.63, 0.81, and decreases ASSD by 0.061, 0.041, 0.148, respectively on OCTA500, DRIVE and CHASEDB1.Ultimately, via further utilizing an additional unlabeled dataset D ori , YoloCurvSeg (S f ine ) achieves 97.00%, 110.01%, 97.49% and 97.63% of the fully-supervised performance (with full masks of all available samples) with only one noisy skeleton annotation on the four datasets.
To better illustrate the time-saving benefits of our method in real clinical scenarios, we randomly select samples from the four datasets (30 samples for OCTA500 and CORN, and 10 samples for DRIVE and CHASEDB1) and invite two ophthalmologists to annotate them in both noisy skeleton and full mask formats.We find that, annotating a noisy skeleton-style label for retinal vessels in a 6mm × 6mm OCTA image takes approx- imately 4.5 minutes, while annotating a full mask takes around 48 minutes due to the need for careful examination and modification of edges and details.Similarly, annotating the corneal nerve fibers in a CCM image, and the retinal vessels in DRIVE and CHASEDB1 style retinal fundus images, respectively takes about 1 minute, 3.5 minutes, 3 minutes (noisy skeleton) and 14 minutes, 62 minutes, 55 minutes (full mask) for a single sample.We plot the segmentation performance and the annotation time consumption of all evaluated methods, including all WSL methods, NLL methods, and YoloCurvSeg, under both single-sample and all-sample conditions in Fig. 13.Our method achieves the highest segmentation performance (≥ 97% of FS) with the lowest annotation time cost (< 0.3% of FS) across all four tasks.

Comparison with SOTA FS methods and Discussion
In previous experiments and analyses, we utilize the vanilla U-Net as the segmentation network architecture in each com-Table 7. Comparative performance of various segmentation networks under the fully-supervised setting of utilizing all samples, as evaluated on the OCTA500, CORN, and DRIVE datasets, alongside the fine stage's efficacy of our proposed method under the one-shot setting with different segmentation network architectures.* indicates the model is initialized with pre-trained weights on ImageNet.SwinUNet is not evaluated on DRIVE due to the lack of appropriate high-resolution pre-trained parameters.EfficientUNet employs the ImageNet pre-trained EfficientNet-B3 as its encoder.pared method, for a fair comparison purpose.In order to further demonstrate the practicality and scalability of YoloCurvSeg, we conduct both quantitative and qualitative comparisons and analyses with incorporations of more advanced segmentation networks and frameworks, especially those designed for curvilinear structure segmentation, and discuss some potential future directions for improvement.Concretely, we explore the fullysupervised performance of two more advanced CNN U-Net variants, namely EfficientUNet (Tan and Le, 2019) and CS 2 -Net (Mou et al., 2021), as well as four Vision Transformer based segmentation networks, namely SwinUNet (Cao et al., 2023), TransUNet (Chen et al., 2021), UTNet (Gao et al., 2021) and MedFormer (Gao et al., 2022), on the OCTA500, CORN, and DRIVE datasets.All networks are trained with the same hyperparameters, including initial learning rate, learning rate policy, optimizer, batch size, etc., following the specifications of our vanilla U-Net Segmenters, as described in previous sections, to ensure a fair comparison.The quantitative results are listed in Table 7.It can be observed that plausibly due to maintaining a superior training paradigm in all comparisons, such as the optimizer and the learning rate schedule strategies, the vanilla U-Net does not exhibit significant gaps compared to the best methods for all three datasets.It even surpasses more advanced networks in some cases, which is consistent with findings re-ported in the nnU-Net paper (Isensee et al., 2021).Among all comparisons, CS 2 -Net and TransUNet respectively achieve the best and the second-best overall performance.We thus explore replacing the two-stage Segmenters in YoloCurvSeg with those two better performing networks.The experimental results show that using more advanced architectures can further improve the segmentation performance.

Method
To elucidate the gap between YoloCurvSeg and SOTA FS methods, we conduct qualitative comparisons on some examples in Fig. 14.It can be observed that for structures with low image contrast or small peripheral vessels/nerves, such as the areas outlined by the red circles, our method shows certain degrees of gap in accuracy and structural coherence compared to the well-performing FS methods.This is mainly due to the morphological gap and intensity gap between synthetic images and real images.Potential solutions to address these issues include fine-tuning the hyperparameters of the first component of YoloCurvSeg to generate curves that better match real shapes.In addition, introducing new paradigms in image translation/synthesis, such as diffusion models (Ho et al., 2020;Cheng et al., 2023), may further enhance the realism of the synthesized images.Improving the network structure by introducing various attention mechanisms, especially self-attention mechanisms, or defining objective functions that maintain topology (Cheng et al., 2021a;Shit et al., 2021), may further enhance the performance of our framework.The former has already been validated in the experiments in Table 7.Another future direction worth exploring is to employ noisy label learning methods (Zhang and Sabuncu, 2018;Yang et al., 2022) to train the fine stage's Segmenter since the generated pseudolabels are inevitably noisy.
Through extensive experiments, we have demonstrated that YoloCurvSeg can be applied to the two most common 2D curve structure segmentation tasks of three different modalities, the two structures being nerve fibers and retinal vessels, with good generalization.To further demonstrate the scalability of the proposed pipeline, we conduct additional synthesis and segmentation validation analyses on an X-ray coronary angiography dataset, namely DCA1 (Cervantes-Sanchez et al., 2019).The first 100 samples from the dataset are used as the training and validation set, with the remaining 34 samples serving as the test set.All images are resized to 320 × 320.We arbitrarily     in Table 8.Compared to the fully-supervised setting using all samples and full masks, YoloCurvSeg still achieves exceptionally high one-shot performance, and this is accomplished without precisely tuning the Curve Generator of YoloCurvSeg.The three samples from the dataset and the corresponding synthesized images from YoloCurvSeg can be observed in Fig. 15.
Other similar and potentially transferable scenarios include cell membrane, crack, road (in aerial images) and leaf vein segmentation, etc.With that being said, we acknowledge the challenge of applying and transferring our proposed YoloCurvSeg to specific curvilinear structure segmentation tasks in 3D scenarios such as brain vessel segmentation and cardiac vessel segmentation, as demonstrated in the works of Vessel-CAPTCHA (Dang et al., 2022) and Examinee-Examiner Network (Qi et al., 2021).This is a major direction for future exploration.The current challenges mainly lie in the migration of YoloCurvSeg's second and third components, namely the Inpainter and the Multilayer Patch-wise Synthesizer, to 3D scenarios, which requires careful modification and design of the network's input format and size to balance their performance and the computational cost.Such explorations to some extent go beyond the scope of this work.
One concern regarding the YoloCurvSeg pipeline is whether the performance advantage arises from the overall framework's parameters or whether it introduces unfairness in comparison to other methods.In Table 9, we provide the total parameters of the training and testing models between the YoloCurvSeg pipeline and comparative methods.We acknowledge that there may exist a certain degree of unfairness in our comparison.However, it is important to emphasize that this parameter comparison is to some extent, merely illustrative, as certain methods, despite not augmenting the model's trainable parameters, necessitate additional non-training parameters and data processing, such as tree filters and minimum spanning tree calculations in Tree Energy Loss, and energy field computations in Gated CRF, etc.Additionally, the first three components of YoloCurvSeg primarily contribute to the image synthesis process and are somewhat decoupled from the segmentation task.They can be considered as preprocessing steps that do not directly participate in the training of the segmenter.This distinguishes YoloCurvSeg from other methods that may incorporate additional trainable or non-trainable parameters in the segmentation process.We ensure fairness in the segmentation model and its training paradigm as much as possible.Furthermore, during testing or prediction in YoloCurvSeg, only the segmenter's involvement is required, which aligns with the majority of all compared methods.

Conclusion
This paper presents a novel sparsely annotated segmentation framework for curvilinear structures, named YoloCurvSeg.YoloCurvSeg is an image synthesis based pipeline comprising of a Curve Generator, an Inpainter, a Synthesizer and a twostage Segmenter.Extensive experiments are conducted on four publicly accessible datasets, with superiority of our proposed framework being successfully established.Potential future directions are transferring YoloCurvSeg to 3D scenarios and exploring a better pipeline to further reduce the domain gap between synthetic images and real images.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.YoloCurvSeg achieves more than 97% of the fully-supervised performance on each of four representative datasets utilizing only one noisy skeleton annotation, which means physicians can save largely save labeling time and still obtain satisfactory segmentation results.

Fig. 2 .
Fig. 2. Top: Overview of our proposed YoloCurvSeg, which comprises four main components: a space colonization algorithm-based curve generator, a background inpainter, a multilayer patch-wise contrastive foreground-background fusion based synthesizer, and a two-stage coarse-to-fine segmenter.Bottom: Details of the Curve Generator and the curve generation process for the four datasets utilized.

Table 1 .Fig. 3 .
Fig. 3.The architecture of the Inpainter.The input is a four-channel image with the first three channels being the original image and the last channel being the binary mask of the inpainting regions.The output is the inpainted image.The dimensional change of the feature map in FFC is shown in the lower right panel.

Fig. 4 .
Fig. 4. Visualization of synthetic data from YoloCurvSeg.From left to right are examples of the noisy skeleton label, the inflated inpainting mask, the extracted background, the generated foreground, the synthesized image and the generated foreground superimposed on the synthesized image.
Lin et al. (2021c) andHe et al. (2022), with the first 20 images serving as the training set and the remaining 8 used for testing.For OCTA500, we respectively utilize 200, 10 and 90 samples as the training, validation and testing sets.Images are first normalized and online data augmentation consists of random rotation, flipping and Bézier Curve transformation(Zhou et al., 2019).

Fig. 5 .
Fig. 5. Histograms of the four datasets in terms of the real data (top) and the corresponding synthetic data (bottom).
Fig. 6. t-SNE visualization of the four real and synthetic datasets.CORN good and CORN poor respectively denote the high-quality and low-quality subsets in CORN.

Fig. 7 .
Fig. 7. Qualitative visualization of representative results from our S coarse and other SOTA WSL methods under one-shot setting.

Fig. 9 .
Fig. 9. Visualization of representative synthetic results from the ablation study.Arrows mark the unrealistic background regions, structures, or misalignments between the masks and the regions of interest.

Fig. 10 .Fig. 11 .
Fig. 10.Examples of background images acquired through distinct unsupervised techniques, such as Gaussian blurring, low-pass filtering, and median filtering, along with instances of synthesized images using the medianfiltered images and our Inpainter-extracted images as backgrounds.Zoom in for details.

Fig. 12 .
Fig. 12. Performance of YoloCurvSeg under different training paradigms.FS denotes fully-supervised learning.Within the parentheses are the standard deviations.

Fig. 13 .
Fig. 13.Segmentation accuracy (DSC) vs. annotation time for all benchmarked WSL and NLL methods, as well as the one-and all-shot fully-supervised (FS) settings.The number and type of labels used are indicated in the parentheses, with M and S respectively representing full mask and noisy skeleton.

Fig. 14 .
Fig. 14.Qualitative visualization of representative results from SOTA FS (all samples utilized) methods and our S f ine (with CS 2 -Net architecture) under the one-shot setting.Red circles indicate the areas of interest worth noting.Zoom in for details.

Fig. 15 .
Fig. 15.Visualization of synthetic data from the DCA1 dataset.From left to right are examples of the original image, the full mask, the noisy skeleton, the extracted background and two synthesized images.

Table 2 .
FID scores between various synthetic components and the real ones.A: synthetic mask vs. real mask, B: synthetic mask vs. real image, C: synthetic background vs. real image, D: synthetic image vs. real image, E: real training image vs. real test image.
achieves competitive FID scores on all four datasets, two of which are even smaller than those between the real training and test sets, as shown in column E of Table2.Successful alignments between synthetic curves and regions of interest in synthetic images and high similarities between synthetic images and real ones are well established, both of which are crucial factors for YoloCurvSeg to achieve powerful segmentation performance in the following experiments.

Table 3 .
Comparison with existing WSL methods on the OCTA500 and CORN datasets.The best results are highlighted in bold, and the second-best results are underlined.FS denotes fully-supervised learning.

Table 4 .
Comparison with existing WSL methods on the DRIVE and CHASEDB1 datasets.The best results are highlighted in bold, and the second-best results are underlined.FS denotes fully-supervised learning.

Table 6 .
Comparison of YoloCurvSeg's coarse stage one-shot segmentation performance using background images respectively extracted by median filtering and the Inpainter in the image synthesization process.Median represents median filtering.

Table 9 .
Comparison of the total parameters in the training and testing phases between the YoloCurvSeg pipeline and comparative methods.Inp: Inpainter; Syn: Synthesizer; Seg: Segmenter.