Assisted annotation in Deep LOGISMOS: Combining deep learning and graph optimization for simultaneous multi-compartment 3D segmentation of calf muscles on MRI

175 MR images of 350 lower legs from 93 subjects (47 healthy, 35 DM1, 6 Pre-DM1, 5 JDM). MR image size was 512×512×30, voxel size 0.7×0.7×7 mm, acquisition used the first echo of a 3 point Dixon gradient echo sequence, TR = 150 ms, TE=3.5 ms, FOV=36 cm, bandwidth 224 Hz/pixel, scan time 156s.Human subject research approved by the University of Iowa IRB as part of NIH project R01-NS094387.


I. INTRODUCTION
In humans, the muscles of the lower leg between the knee joint and the ankle support weight-bearing activities such as walking, running and jumping. Anatomically, this group is composed of five individual muscle compartments shown in Fig. 1(a): Tibialis Anterior (TA), Tibialis Posterior (TP), Soleus (Sol), Gastrocnemius (Gas), and Peroneus Longus (PL) [1]. Structural and volumetric changes of these compartments provide valuable information for the diagnosis, severity, and progression evaluation for various muscular diseases such as myotonic dystrophy type 1 (DM1), an inherited disorder characterized by progressive muscle weakness, myotonia, and dystrophic changes [2]. DM1 is the most common form of muscular dystrophy that begins in adulthood and causes severe fatty degeneration of calf muscle in most patients [3]. Magnetic resonance (MR), which offers non-invasive imaging of muscles with high sensitivity to dystrophic changes, has been widely used in the clinic for muscular disease diagnosis and follow-up evaluation [3], [4]. Traditional structural assessment of multiple individual muscles invariably resorts to manual tracing [5], [6], which is arduous, time-consuming, and limiting in large research and clinical settings. Automated segmentation of multiple individual calf muscles is therefore essential for developing quantitative biomarkers of muscular disease diagnosis and progression.
Past calf muscle segmentation research is relatively sparse. Valentinitsch et al. [7] proposed a three-stage method using unsupervised multi-parametric k-means clustering to segment calf muscle regions and subcutaneous fat for determining subcutaneous adipose tissue (SAT) and inter-muscular adipose tissue (IMAT). Yao et al. [8] combined deep learning with a dual active contour model to accurately locate the fascia lata and segment multiple tissue types for quantifying calf muscle and fat volumes. Amer et al. [9] employed deep learning to segment the whole calf muscle region where IMAT and healthy muscle are classified afterward by deep convolutional auto-encoders. All these entire muscle-region segmentation methods are mainly proposed to separate muscle, SAT and IMAT for estimating fat infiltration into muscular dystrophies.
However, the segmentation of individual muscle compartments is more desirable for assessing the progression of different neuromuscular diseases [10]. For example, it has been shown that individual skeletal muscle may be affected differently by DM1 [11]. It is necessary to improve the efficiency and utility of muscle MRI as a marker of muscle pathology [12].
Automated 3D segmentation of individual calf muscle compartments is challenging and attempts in this field are rare. As shown in Fig. 1(b-e), muscular dystrophy introduces substan- tial variations of shape, texture and grayscale appearance to a part of or the entire calf region in addition to the already existing substantial variations due to the flexible nature of the muscles and leg's position in the scanner. Commean et al. [13] proposed a semi-automated method by thresholding and edge detection to segment bones, adipose tissue, and five individual muscle compartments. Ghosh et al. [14] fine-tuned a pre-trained AlexNet on 700 3D MR images to predict two parameters representing the contour of the leg muscles and achieved an average DSC (DICE Similarity Coefficient) of 0.85 ± 0.09. However, the network must be trained separately for each leg muscle and the whole method can not learn from the features while training all kinds of muscles together. More recently, Guo et al. [15] proposed a novel neighborhood relationship-aware network based on 3D U-Net [16], called FilterNet, for automated segmentation of individual calf muscle compartments and reached an average DSC of 0.90 ± 0.01 on 40 T1-weighted 3D MR images of 11 healthy and 29 diseased subjects. This approach was used in clinical research [12].
Although the aforementioned approaches reported acceptable segmentation performance by applying deep learning methods, several critical issues remain to be settled. 1) Availability of sufficiently large annotated datasets represents a bottleneck limiting their application, especially in large clinical settings where new data accumulates continuously. Annotation (manual tracing) of medical images is not only arduous and time-consuming but also requires costly specialty-oriented knowledge and skills. 2) There is still room for improvement of deep learning-based calf segmentation approaches. 3) Undesirable regional inaccuracies remain in the deep learning segmentation due to the lack of global-information-aware optimization. Our work attempts to address all of these issues.
Compared with previously reported approaches, the contri-butions of our work can be summarized as follows. 1) Assisted annotation with efficient adjudication substantially decreased expert manual tracing effort when forming annotated training sets. 2) FilterNet+ improved the performance of the underlying FilterNet approach and offered stable training, accelerated convergence, improved generalization, and -as a result -improved segmentation. 3) Deep LOGISMOS substantially improved the performance of 3D calf muscle compartment segmentation by utilizing FilterNet+ pre-segmentation and new machinelearned cost functions.

A. Assisted Annotation
Fig . 2 shows the workflow of our assisted annotation approach that employs the iterative loop to achieve the best use of the existing and efficient way of adding new annotated datasets. This approach a) starts with a small training set, b) uses it to create the initial version of an automated calf segmentation method, c) employs this method to automatically segment additional unannotated images, some of which are likely segmented inaccurately at first. These automated segmentations are d) expert-corrected using Just-Enough-Interaction (JEI) functionality of LOGISMOS [17] and combined with the previous training set, thus e) forming a new larger training set of expert-annotated images, which are iteratively used to create next versions of the automated calf segmentation method in step "b". The assisted annotation steps ("b-e") are repeated until the desired performance is achieved or all data are annotated.
The process of creating new versions of the automated calf segmentation (step "b" above) relies on the following substeps in each iteration of the assisted annotation loop: 1) deeplearning based approximate pre-segmentation of calf muscle compartments; 2) deep-learning based design of LOGISMOS cost functions; 3) design of multi-object JEI for efficient editing of automatically-segmented calf compartments.

B. FilterNet+: DL-Based Pre-Segmentation
Pre-processing: Bias field correction [18] is first applied to minimize intensity non-uniformity in MR images. The zscore normalization is applied to intensities of all images of individual legs to reduce inter-subject variations. Optimal thresholding and k-means clustering are used to extract the regions of interest (ROI) corresponding to left and right legs. All right legs are mirrored to conform to left legs to reduce the task complexity. All pre-processing steps are completely unsupervised and are automatically carried out without any user intervention.
FilterNet+, its novel training strategy: Our first-attempt FilterNet approach to calf segmentation was presented in [15], introducing a neighborhood-relationship-enhanced convolution neural network. Benefiting from the increased convolution receptive field, resolution-preserving skip connections, and explicitly edge-aware regulations by a kernel-based edge gate to   The input X is a 120 × 120 × 28 3D image patch cropped from the whole 160 × 160 × 28 3D pre-processed image of one leg. The size of outputŶp is 6 × 120 × 120 × 28 and it is further processed for loss calculation and the LOGISMOS cost function learning in Fig. 4. Best viewed in color.
constrain voxel-level probability values inside a neighborhood, our original FilterNet outperformed all other state-of-theart deep-learning approaches tested in both voxel-level label predictions and 3D object surface positioning [15]. The newly designed and enhanced version, FilterNet+, overcomes several imperfect properties of the previous approach, namely the insufficient training strategy and the lack of optimization in deep learning due to underestimation of the impactful influence of non-architectural aspects. We also considered incorporation of rich network architecture extensions that were not incorporated such as attention mechanisms [19] and dense connections [20], which increased the number of network parameters and only offered marginal improvements. FilterNet+ improvements thus focus on two non-architectural aspects: the loss design and its training strategy, its architecture and training are shown in Figs. 3 and 4.
FilterNet+ training uses a new loss function L that combines of L dice , multi-class cross-entropy loss L CE and the edge loss where λ is an adjustable weight reflecting the strength of edgeaware regularizations through training. L dice originates from DSC as in [21]: where N = 6,Ŷ n is the predicted label for class n from the softmax output of the network, Y is one-hot encoding of the ground truth, i ∈ Y represents voxels of the foreground in the segmentation map. Incorporation of dice loss is beneficial for the model to consider the loss information both locally and globally and as a result, improve the edge continuity between calf muscles. L CE is a multi-class cross-entropy loss and L e represents the differences (L1-norm) between the derived edge maps and the true edge maps which are generated by our 2D trainable convolution kernels, edge gate F LρG . Edge gate is a trainable variant of Laplacian of Gaussian filter and can effectively extract valuable edge information from predicted region labelŶ and ground truth Y to derive edge maps. It is updated while training with the trainable parameter σ, initially set as 1. More details about the edge gate can be found in our original FilterNet approach [15].
Benefiting from the new enhanced combined constraint, the network output -probability maps -are optimized to efficiently reflect both the regional and edge-based information as likelihood [0, 1] of a voxel to be correctly classified, which contributes to the LOGISMOS cost function design as shown in Fig. 5.
Training of FilterNet+ was improved by introducing the following new strategies: a) dropout layers were added to the encoder path to prevent over-fitting and improve generalization [22]; b) Kaiming normalization of the initial trainable network parameters improved model fitting [23]; c) Adam optimization was employed instead of stochastic gradient descent for stochastic optimization [24], [25]; d) learning rate warmup heuristic for Adam was used to stabilize training and accelerate convergence [26]; and e) learning rate reduction was only allowed when the metric of validation stopped improving in two consecutive training epochs. These modifications resulted in stabilized training, accelerated convergence, improved generalization, and thus better segmentation performance.
Post-processing: Raw 3D object segmentation produced by the network shows local inaccuracies (small holes, coarse boundaries), which can be easily improved by simple postprocessing refinement. Post-processing included two iterations of recursive Gaussian image filter (σ = 2) and hole filling by enforcing single-component connectivity of each segmented calf compartment. The refined FilterNet+ yielded approximate pre-segmentation of calf compartments, the performance of which was evaluated separately and was also further used for initialization and graph construction of the subsequent Deep LOGISMOS steps.

C. Deep LOGISMOS
LOGISMOS (Layered Optimal Graph Image Segmentation for Multiple Objects and Surfaces) is a general approach for optimally segmenting multiple n-D surfaces that mutually interact within and/or between objects [27], [28]. Columns of interconnected graph nodes are used to cover the search region for target surfaces. After assigning a cost to each node, multisurface segmentation is achieved by finding the set of nodes, one node per column, with globally optimal total cost. Additional context-specific graph arcs can be used to enforce geometric constraints that represent prior shape and anatomy knowledge. The efficiency of LOGISMOS is mainly determined by good target-object shape priors as the  initialization and a relevant cost function that yields the desired image segmentation. The traditional implementation relies on interactively defined initial approximate segmentation and the human-expert designed cost functions. In this work, we overcame these manual-design limitations by using the above FilterNet+ segmentation to initialize LOGISMOS while the cost functions were jointly learned from segmentation examples in combination with utilizing the independently learned FilterNet+ parameters, yielding the overall Deep LOGISMOS approach (Fig. 5). Graph construction: FilterNet+ pre-segmentation provides approximate segmentation of each calf muscle compartment as 3D mesh surfaces, defines their topology, and mutual relationships. Graph columns are constructed along the directions normal to the mesh surfaces. More detailed information about graph construction can be found in [17], [27], [28]. To incorporate the spatial relationships between muscle compartments as object separation constraints, the column orientations of subgraphs associated with individual compartments are specially designed as shown in Fig. 6(a), where the columns are built from inside to outside for TA and Sol, and outside to inside for TP, Gas and PL. This special orientation scheme utilizes anatomical prior knowledge about the muscle compartments to avoid formation of frustrating cycles [29].
Machine learning cost design: As shown in Fig. 6(b), the appearance of the probability map of a muscle compartment is very similar to a clearly defined bright object. Therefore, the gradient of the probability along the column directions is chosen as a machine-learned feature in the trained LO-GISMOS cost function. In Section II-B, the edge gate is trained globally on the predicted labelsŶ and the ground truth Y to derive edge maps (Fig. 4). SinceŶ and Y represent calf muscle compartments and hold both the region and edge information, we utilized the edge gate learned on the input calf images to derive residual features to be combined with the machine-learned features from the probability maps (Fig.  5). Contribution from the added residual features improve the proipertires of the learned cost functions.
Deep LOGISMOS segmentation: The constructed graph in the LOGISMOS system integrates shape prior from the refined FilterNet+ pre-segmentation, object separation constraints, geometric smoothness constraints, and learned costs for each node by the newly machine-learned cost function design, and the globally optimized segmentation is guaranteed by the graph optimization. The final simultaneous segmentation of all 5 calf-muscle compartments is obtained by optimal hypersurface detection in polynomial time as described in [27] .
Just-Enough Interaction -Deep LOGISMOS-JEI: The dynamic nature of the underlying algorithm is utilized to edit the segmentation result via interactive modification of local costs. Since JEI modification is directly applied to the graph, the updated result is still globally optimal (with respect to the modified costs) and satisfies existing geometric constraints. In practice, user interaction on one 2D slice is often enough to correct segmentation errors in its neighboring 2D slices and thus reduce the amount of human effort. In addition, due to the existence of embedded inter-object constraints, in regions where multiple compartments are close to each other, editing is only needed on one compartment.

A. Data
Only 40 lower leg T1-weighted MR images from 40 subjects were initially available (11 healthy, 23 DM1, 2 pre-DM1, 4 Juvenile Onset DM1 or JDM), the same data set as reported in [15]. Over the course of a longitudinal DM1 study, some of the initial subjects were re-scanned and new subjects added, with additional 135 MR images acquired with the same scanning parameters, increasing the annotated set size to 175 images of 350 lower legs from 93 subjects (47 healthy, 35 DM1, 6 Pre-DM1, 5 JDM). MR image size was 512×512×30, voxel

B. Independent standard
The initial set of 40 annotated MR datasets (80 legs) was fully manually traced by experts in 3D Slicer and each annotation took approximately 8 hours on average [30]. This set was used for the initial stage of training Deep LOGISMOS to deliver decent automated segmentations of the five calf compartments. The remaining 135 MR datasets were sequentially segmented, their segmentations reviewed and -if needed -interactively corrected by experts using Deep-LOGISMOS+JEI (Section II-C), and served as additional training data used in the assisted annotation training loop (Fig. 2). The average time of reviewing and editing each 3D MR image used in the assisted annotation loop steps was approximately 25 minutes -expert effort decreased by 95%.

C. Experimental Setting
Multiple experiments were designed to compare the performance of our original FilterNet [15] that served as a baseline method with the newly developed methods and to demonstrate the contribution of our new approaches. Similarly, to demonstrate the improvements achieved by assisted annotation, we compared performance on differently sized datasets (fully traced or assist-annotated). The following methods were compared, the numeric index specifies the number of training datasets used:  Given a limited-size dataset, 4-fold cross-validation was used to evaluate the performance of each tested approach with the 4 groups created randomly at the subject level so that data (legs) from the same subject were never simultaneously used for both training and testing. The data were split 65%-10%-25% to form the training, validation, and testing sets. That means that for a dataset of 80 (350) legs, training+validation was based on 60 (262) legs and testing was done in 20 (88) legs, repeated 4 times.
Each image segmentation method design uses specific parameters that influence its behavior. In all tests, the same parameters were used in the corresponding steps of each method. In FilterNet+, to increase the robustness and generalization of the network, the input image patches were sized 120×120×28, cropped from the localized leg-areas, overlapping with a step size of 20 voxels along the x and y directions, yielding 9 times as many training patches as the number of available leg images. Data augmentation was performed on each training patch with random rotation, scaling, etc. The learning rate was halved throughout the training process if the combination of the loss L on the validation dataset did not decrease in 2 consecutive training epochs. FilterNet+ training loss parameter λ was initially set as 0.001 and increased ten fold every 10 epochs, the batch size was 16. FilterNet+ was implemented using PyTorch platform [31] and trained on Nvidia Tesla V100 GPU with 32 GB memory. LOGISMOS graph columns consisted of 49 nodes spaced 0.35 mm apart. LOGISMOS smoothness constraints were set as 6 node-to-node distances, corresponding to 2.1 mm.

D. Quantitative Analysis
To comprehensively evaluate the segmentation performance and allow method-to-method comparisons, DSC and Jaccard Similarity Coefficient (JSC) evaluated region-based accuracy, absolute surface-to-surface distance (ASSD, in pixels) and relative surface-to-surface distance (RSSD, in pixels) assessed boundary-based accuracy. ASSD and RSSD measure the distances between the surface of the automated segmentation and the independent standard. Because of the order-of-magnitude difference between the XY plane in-slice resolution (0.7 mm) and the Z plane slice distance (7 mm), the surface-to-surface distances were calculated on the XY plane slice-by-slice. To allow meaningful comparisons, scores for ASSD and RSSD were calculated as where α is an application-specific parameter empirically chosen as α = 1.111 to reach approximate linearity and maximum score of 1.0 when the two surfaces match (at zero distance). Given the four indices: DSC, JSC, S ASSD , and S RSSD , the final comprehensive score was defined as S final = 0.25 × (DSC + JSC + SASSD + SRSSD) . (4) Higher S final indicates better comprehensive performance in the combined regional and boundary-positioning respect. Additionally, ASSD max , was evaluated as the maximum absolute distance between two surfaces of each compartment. Performance indices were averaged for left and right calf muscle compartments and reported as mean±standard deviation. For statistical comparisons between methods, paired t-tests were used, p value < 0.05 denoted statistical significance.

IV. RESULTS
The performance comparisons of the five tested methods are listed in Table I and also visualized in Fig. 7. Compared with the original FilterNet 80 [15], FilterNet+ 80 achieved significantly better results for each compartment in terms of DSC, JSC, ASSD, RSSD, ASSD max , as well as the comprehensive S final for at least some of the compared quantitative indices   Table I and Fig. 7 show the superiority of our deep learning method in comparison with our previous FilterNet approach. In agreement with the ablation study principles, the results are methodically ordered from the simplest and earliest original FilterNet 80 in increasing complexity by first introducing improvement in the FilterNet+ 80 approach, then combining FilterNet+ with LOGISMOS optimization, proceeding further to employing assisted annotation to increase the training sizes in FilterNet+ 350 and DeepLOGISMOS 350 approaches, Fig.  8 demonstrates that segmentation inaccuracies in FilterNet 80 (tunnels, mis-classifications, undesired disjoint objects) are successfully resolved by the improvements designed for the FilterNet+ 80 approach. DeepLOGISMOS 80 segmentations exhibit additional increases in accuracy, surface smoothness, and topologic superiority as shown in Fig. 8. For the DM1 subject, while the PL compartment segmented by FilterNet+ 80 spreads into the surrounding tissue, this problem is resolved by DeepLOGISMOS 80 due to the addition of machinelearned residual features to the cost function (Section II-C). Similarly, benefiting from LOGISMOS graph optimization, a region of mis-classified PL around the boundary of Sol and mis-classified Gas in the control subject by FilterNet 80 is corrected by FilterNet+ 80 and DeepLOGISMOS 80. At the same time, the importance of topologic correctness of Pre-DM1 and DM1 pre-segmentation can be seen in FilterNet+ 80 and DeepLOGISMOS 80, where the attractive edge-costs at falsely pre-segmented locations did not allow LOGISMOS to properly reposition the Sol surface.

A. Ablation Study
The superiority of training on a larger dataset, indicating the effectiveness of assisted annotation, is further shown in FilterNet+ 350 and DeepLOGISMOS 350 (Table I, Fig. 7). LOGISMOS-JEI based assisted annotation, in the process of providing larger training datasets, dramatically reduces the annotation effort of human experts. In all four examples in Fig. 8, most of the errors in the earlier segmentation approaches are successfully resolved by FilterNet+ 350 and DeepLOGISMOS 350.

B. Generalizability of combining Deep LOGISMOS and assisted annotation
The power of our work in combining deep learning presegmentation and graph-optimality seeking Deep LOGISMOS trained on data produced by efficient assisted annotation was demonstrated in the case of segmenting human calf muscle compartments on MRI. Alternatively, other deep convolutional neural network architectures can be integrated into the Deep LOGISMOS framework to utilize information linkages between deep learning and graph optimization. Further strengthened by the inherently incorporated Deep-LOGISMOS+JEI based assisted annotation (Fig. 2), its effectiveness and efficiency in reducing the annotation effort and optimizing the segmentation model are clearly visible from the achieved segmentation improvements (Section IV). Note of course, that the Deep-LOGISMOS+JEI method used here for assisted annotation is not the only one applicable. The idea of trainingsegmentation-annotation iterative epochs can be generically incorporated into supervised learning methods or one can elect to employ suggested annotation approaches [32]. Given this inherent generalizability of Deep LOGISMOS and the assisted annotation paradigm, these strategies can be further integrated and the machine-learned deep segmentation features and the machine-learned LOGISMOS cost functions applied to various segmentation tasks to benefit both the segmentation processes and those leading to assisted annotations.

C. Future work
Although we showed that assisted annotation helps experts reduce the effort of manual tracing substantially (from 8 hours to 25 minutes per 3D image), the total time and effort of reviewing and editing a large dataset can not be neglected either. There are two promising directions to further relieve the annotation effort problem: active learning [33] and quality assessment without ground truth. The approach of quality assessment without the ground truth focuses on further reducing the human effort in searching for small segmentation errors in a large 3D image by automatically locating likely segmentation errors on the volumetrically visualized object surfaces. Afterward, the identified likely-erroneous locations can be used as feedback to guide the network to prevent similar errors. As a result, the time of reviewing and editing the segmentations to produce new annotations can be significantly reduced.

VI. CONCLUSION
A hybrid framework combining the main advantages of our convolutional neural network FilterNet+ with those of our graph-based LOGISMOS approach, further supported by Deep-LOGISMOS+JEI assisted annotation, was reported. The presented comparative performance assessment demonstrated an improved performance obtained during simultaneous multicompartment 3D segmentation of calf muscle compartments on 3D MRI. By maximizing the value of an original small dataset of fully annotated MR images of 80 lower legs, and by initially training a Deep LOGISMOS segmentation method on this small dataset, we have designed and employed an efficient assisted annotation strategy that decreased the average annotation time required to 3D-annotate 5 calf-muscle compartments on a volumetric 512×512×30 MR image from 8 hours to 25 minutes -a 95% reduction of human expert effort. Our Deep LOGISMOS method trained on a larger dataset of 350 assisted-annotated legs then outperformed all other tested deep learning and graph-optimization approaches in the region-based voxel labeling, boundary-based surface positioning, and the final comprehensive performance score. Compared with our previously reported FilterNet method, mean DSC was improved by 4.6% on average, from 88.0-91.3% to 92.9-95.9%. The mean absolute surface positioning errors were improved by 47.5% on average, from 1.4-2.2 pixels to 0.7-1.2 pixels. The mean comprehensive final score was improved by 6.5 on average, from 84.5-89.1 to 91.0-94.8 for the five 3D muscle compartments per leg. The reduction of local maximum segmentation errors (Max ASSD) was even more pronounced. The striking performance improvements suggest the clinical-use potential of our new fully automated simultaneous segmentation of calf muscle compartments.
ACKNOWLEDGMENT Eric Axelson's contributions to data preparation and management are gratefully acknowledged.