Multitask Deep Learning Reconstruction and Localization of Lesions in Limited Angle Diffuse Optical Tomography

—Diffuse optical tomography (DOT) leverages near-infrared light propagation through tissue to assess its optical properties and identify abnormalities. DOT image reconstruction is an ill-posed problem due to the highly scattered photons in the medium and the smaller number of measurements compared to the number of unknowns. Limited-angle DOT reduces probe complexity at the cost of increased reconstruction complexity. Reconstructions are thus commonly marred by artifacts and, as a result, it is dif-ﬁcult to obtain an accurate reconstruction of target objects, e.g., malignant lesions. Reconstruction does not always ensure good localization of small lesions. Furthermore, conventional optimization-based reconstruction methods are computationally expensive, rendering them too slow for real-time imaging applications. Our goal is to develop a fast and accurate image reconstruction method using deep learning, where multitask learning ensures accurate lesion localization in addition to improved reconstruction. We apply spatial-wise attention and a distance transform based loss function in a novel multitask learning formulation to improve localization and reconstruction compared to single-task optimized methods. Given the scarcity of real-world sensor-image pairs required for training supervised deep learning models, we leverage physics-based simulation to generate synthetic datasets and use a transfer learning module to align the sensor domain distribution be-tween in silico and real-world data, while taking advantage of cross-domain learning. Applying our method, we ﬁnd that we can reconstruct and localize lesions faithfully while allowing real-time reconstruction. We also demonstrate that the present algorithm can reconstruct multiple cancer lesions. The results demonstrate that multitask learning provides sharper and more accurate reconstruction.


I. INTRODUCTION
N EAR-infrared diffuse optical tomography is an optical imaging technique used for the examination of biological tissue at a macroscopic scale and has witnessed increased interest as a non-invasive and low-cost breast cancer screening alternative to ionizing X-ray mammography, the de facto screening technique [1], [2].DOT projects near-infrared light from one or more light sources and acquires boundary data of the tissue (back-scattered light) using a set of detectors yielding, after reconstruction, cross-sectional or volumetric images of tissue [3].The reconstruction process converts information from the projection (sensor) domain into the image domain to enable interpretation and diagnosis by domain experts or by artificial intelligence (Fig. 1).DOT allows quantitative imaging of optical properties as it measures and visualizes the distribution of tissue absorption and scattering properties.These optical parameters are correlated to physiological markers such as blood oxygenation and tissue metabolism, allowing quantitative assessment of tissue malignancy [4].

A. Challenges in DOT Image Reconstruction
Given that the optical properties of normal and diseased tissues diverge, it is possible to detect breast cancer lesions using these optical properties reconstructed from collected boundary measurements of light propagation within a tissue [5].The key challenge lies in properly determining optical properties given the nonlinear physics of photon scattering and the ill-posedness of the problem.Furthermore, although near-infrared light can propagate several centimeters inside the tissue, photons scatter many times and penetrate along random paths through the tissue before reaching the detectors, which makes the reconstruction task difficult.
When applied to limited-view data acquisition, allowing reduction of scanner complexity, reconstruction becomes even more challenging (smaller number of measurements compared to the number of unknowns) [6].A linear probe [1], for instance, reduces system complexity (lower hardware and software requirements), however, The co-linear design of sources and detectors in the probe results in a limited view of the target and artifacts.

B. Current DOT Image Reconstruction Approaches
The DOT image reconstruction is an ill-posed problem that has been tackled using a wide variety of reconstruction methods (see [7], [8] for a review).
1) Conventional Methods: While absolute imaging relies on a single set of measurements to reconstruct spatially distributed optical coefficients, difference imaging aims at recovery of the change in the tissue optical properties based on measurements before and after the change [9].The reference measurement can be the previous state (e.g., the rest stage in brain DOT) or a reference tissue or phantom.A drawback of difference imaging is that reconstructed images are usually only qualitative and their spatial resolution can be low [10].In this manuscript, focus is given on absolute imaging.
The image reconstruction problem is commonly formulated as recovering the optical properties distribution x * (2D image or 3D volume) that minimizes the reconstruction error between the boundary measurement sample y and the forward projection F(•) from a possible reconstructed image x: where the first term aims to enforce data fidelity, R(•) is a regularization term used to constrain the inverse problem to yield plausible solutions, and λ is a hyper-parameter that controls the contribution of R. Common regularization choices include, for instance, standard Tikhonov regularization R(x) = λ x 2 and total variation (TV) regularization R(x) = λ ∇x 1 .
Until recently the focus in DOT reconstruction research has been on model-based algorithms, whose design follows directly from the underlying mathematical problem formulation.A disadvantage of a model based approach is that each instance needs to be optimized independently at reconstruction time, a computationally costly approach that can prohibit realtime application [3].Reconstruction methods can roughly be divided into linear approaches, based on inverse scattering theory, and nonlinear iterative approaches, based on model fitting.Linear approaches rely on Born or Rytov approximations, where the linearization is developed from analytical solutions of the diffuse equation for a homogeneous background [3].This model can be constructed using Green's functions for the diffusion equation.The resultant linear system is usually solved by an iterative method (e.g., conjugate gradients [11]).The Born or Rytov approximation assumes that relative changes induced by anomalies are sufficiently small.When the values of the optical coefficient changes become very large, as in the case of large abnormal lesions, this approximation can lose accuracy in reconstruction [12].
Non-linear methods consider the model in terms of explicit parameters and adjusts these parameters in order to optimize the objective function in (1).Methods usually minimize an objective function in an iterative manner until convergence where the distribution of optical properties is searched for by minimizing the difference between the measured and the modelled data.A common approach in iterative reconstruction is to solve the forward model and calculate the Jacobian matrix at each iteration (e.g., Newton-like methods [13]).Alternatively, gradient-based reconstruction techniques use the gradient of the objective function to update the solution offering a reduction in computational complexity, at the cost of slower convergence.Regularization terms have to be added generally as the optimization problems are ill-posed.
Optimization of acquisition time, computational speed along with reconstruction quality remain an active research area in DOT.Despite recent advances, the reconstruction time of conventional DOT reconstruction methods is too high to be suitable for real time application [4].In this manuscript, we consider throughput of at least 20 images (frames) per second as a minimum real-time equivalent for human experts [14], which corresponds with a runtime per image of < 0.05 s.An ideal frame rate of 60 (or 0.016 s) is the minimum sufficient rate to perceive no flickering for normal images [15].
Alternatively, the image reconstruction task can be learned using a deep neural network with a significant off-line training time, up to weeks during prototyping where each model takes a few hours of training, that is offset by a fast inference time as reported in Ben Yedder et al. [16].
2) Deep Learning Methods: Recently, deep learning has emerged as a powerful alternative for biomedical image reconstruction [16]- [18] and shown potential in improving resolution accuracy and speeding up reconstruction results especially in the presence of noisy and limited view acquisition.The capability of deep neural networks to learn complex functions automatically with fast inference makes their application to this problem ideal.
Deep neural networks-based techniques have shown promising potential in solving the inverse scattering problem [12], [19]- [21].Preliminary work [19], [20], [22], did not consider end-to-end models but rather two-step approaches where a first image estimate, approximated using conventional methods, is further enhanced using a deep learning model.Recent endto-end models typically considered a full-view acquisition setup: a circular shape scanner with more than 16 point sources uniformly distributed along the boundary to maximize the number of measurements.For instance, Sun et al. [19] addressed the multiple scattering problem of microwaves in biological samples using a two-step reconstruction method, where an analytical method providing the first image estimate is followed by a convolutional neural network (CNN) for image reconstruction.Feng et al. [21] proposed a multilayer perceptron (MLP) feed-forward artificial neural network for a full-view 2D-DOT image reconstruction.However, performance degrades significantly for limited-angle acquisition where the number of sources is smaller and their placement is co-linear.Recently, based on the Lippman-Schwinger equation and deep convolutional framelets for inverse problems [23], Yoo et al. [12] designed a deep learning approach for 3D inverse scattering problems.The proposed model is based on an autoencoder CNN for a 3D-DOT inverse problem to learn the nonlinear physics of the inverse scattering problem.
Preliminary versions of our work [24], [25] tackled reconstruction of biological tissue parameters from limited-view data acquisition in a strong scattering medium using a decoderlike fully convolutional neural network architecture (FCN) with a weighted combination of mean squared error (MSE) and Fuzzy Jaccard as loss function and tested only on phantom data.While advancements have been made, reconstructions are still plagued by artifacts (e.g., false positive reconstructed pixels) and deviations from the ground truth location of the lesion, limiting their practical clinical utility.Furthermore, accurate localization is not necessarily guaranteed by a 'good' reconstruction, especially given the often small ratio of lesion versus background.For accurate diagnosis (e.g., via lesion biopsy) and treatment planning, the location and size of the lesion are paramount.

C. Multitask learning
Jointly solving for different tasks using a unified model is frequently considered in the computer vision field and has been shown to improve results in different applications, e.g., taskonomy [26] and skin-multitask [27].Zamir et al. [26] showed that models optimized to jointly predict several related tasks perform better than models trained on individual tasks separately.Similarly, Kawahara et al. [28] proposed a single model capable of performing multiple tasks on multimodal skin lesion classification while Abhishek et al. [29] reported a performance improvement in skin cancer lesion management prediction when incorporating multitask learning.Sbalzarini [30] asserted that solving ill-posed problems in sequence can be more efficient and easier than in isolation as the overall solution space contracts since jointly solved problems constrained each other's input/output space.Arguably, a unified framework allows feature space sharing and joint parameters tuning for both tasks and where the two problems regularize each other when tackled jointly.In our application as well, we illustrate that a multitask approach leads to a better reconstruction and localization, enabling more accurate diagnosis.

D. Contributions
We make the following contributions in this paper: (i) To the best of our knowledge, we are the first to propose multitask learning for DOT reconstruction.We leverage a spatial attention mechanism in combination with a distance transform based loss function.Our network, Multitask-Net, exploits deep layers of spatial-wise attention [31], [32] to attend to important features by filtering irrelevant and noisy ones.Adaptively re-weighting features, according to their interdependencies in feature space, improves the representation ability of our model.Crucially, the distance transform based loss improves lesion reconstruction and localization.
(ii) We leverage a physics-based optical diffusion simulation to generate in silico training datasets and evaluate results on real measurements on physical phantoms and clinical data collected with the diffuse optical spectroscopy probe for breast tissue (DOB-Scan) [1], [33].
(iii) To the best of our knowledge, this is the first work that tests deep learning based DOT image reconstruction on clinical data of real patients.
(iv) To bridge the gap between real word acquisition and in silico data simulation, transfer learning is used.We evaluate the generalization ability of our model to unseen data subject to sensor non-idealities and noise.
(v) Extensive experiments on network characteristics tradeoffs' and scanning probe characteristics are investigated.The performance evaluation of our proposed model shows that our framework improves reconstruction and localization accuracy when compared against competing methods.
In section II, we introduce our proposed method for multitask reconstruction and localization learning followed by domain adaptation learning.Physics-based optical diffusion simulation test dataset, and competing methods are detailed in Section III.Trade-offs of network characteristics and evaluation on various datasets are described in Section IV.In silico performance results are presented in Section IV-A, results on physical phantoms in Section IV-B, and results on real-world data in Section IV-C.We conclude the paper by discussing insights and limitations on lesion attention-directed reconstruction in DOT in Section V.

II. METHODOLOGY
We aim to reconstruct an image directly from the sensor domain using a deep neural network.Given a set of raw DOT acquisitions y ∈ R S×D , from S sources (emitters) with D sensors (detectors), and their corresponding tomographic image x ∈ R W ×H , which represents the tissue's optical coefficients (Fig. 1), we seek to learn the inverse function F −1 (•) that maps between sensor measurements and image domain while remaining true to the constraints of the underlying physics of the inverse problem.Therefore, we seek to learn the inverse function F −1 (•) in an end-to-end fashion that solves: where L is the network loss function that penalizes the dissimilarity between the reconstructed image and the ground truth.R is a regularization term that enforces prior, θ are the optimized network parameters or weights.

A. Reconstructing Images from DOT Measurements
We extend our previously proposed decoder-like architecture FCN [24] and FCN-FJ [25] consisting of a fully connected  [34].The red arrow represents the simulated lesion.(C) Architecture of the proposed deep residual attention model.The squeeze and excitation block (inset) squeezes a given feature map U along the channels and excites spatially.The output of the sigmoid layer σ(.) corresponds to the relative importance of the spatial information (i, j) of the feature map U .All feature maps produced by all convolutional layers are set to size 128 × 128 using (3×3) followed by (5×5) kernels.The similarity-wise loss L loc , leveraged to enforce localization accuracy, computes the mean absolute error between the distance transform of the reconstructed image and the ground truth one.layer followed by a set of convolution layers, with a spatialwise attention mechanism and a new loss component (Fig. 2-C).The fully connected layer maps the measurement vector y onto a two-dimensional array by learning a weighted combination of the different receptive sensors based on the signal collected from the back-scattered light collected at different locations in the reconstructed tissue, and therefore provides an initial image estimate.Empirically, we did not witness any improvements in the reconstruction results using more than one fully connected layer.This may be due to the size of the input measurement vector, which is only 256×1 dimensional in our dataset.Higher-dimensional inputs may benefit from additional layers.The output of this layer is reshaped to an 128 × 128 array and delivered to the subsequent layers for optimization.Then, a set of residual convolutional blocks (6 blocks), with 32 channels and increasing filters size (3,5,7), refines the first image via nonlinear transformations and produces the final reconstruction image x.We refer the reader to the Appendix-A for the mathematical formulation.
While the mean squared error was used in isolation in our preliminary FCN work [24], a Fuzzy Jaccard loss component L FJ was added later FCN-FJ [25] to address the imbalance between the minority pixels corresponding to one or more lesions versus a majority of 'background' pixels corresponding to healthy tissue.

B. Spatial-wise Attention Network
The framework of our proposed residual attention network (RAN) is shown in Fig. 2-C.The first and last convolutional layers are a shallow feature extractor and a reconstruction layer, respectively.The middle layers are residual attention blocks used to extract hierarchical attention-aware features [31], [35].Since the contributions of features to the reconstruction task vary from different feature maps, we leverage a spatial attention mechanism to prune the irrelevant features and enhance the informative ones by adaptively reweighting features according to their inter-dependencies in feature space.The spatial attention compresses the feature map U along the channels and excites it spatially to guide the network to pay more attention to the regions of interest; lesion localization in the case of DOT reconstruction.The attention mechanism has been shown effective for natural language processing [36], [37] and computer vision problems tasks [31], [32] as it selectively disregards the noise (less important information) and focuses on what is relevant.Empirically, we did not observe any improvements in the reconstruction results using channel-wise attention [31].Each residual block uses sequential modules of the form: two convolutions -batch normalization -ReLU and zero-padding followed by a squeeze and excite modules [32].
The spatial squeeze operation is achieved through a convolution generating a projection tensor q = W × U with weight W ∈ R 1×1×C where each element of the tensor q ∈ R H×W represents the linearly combined representation for all C channels.The projection is then passed through a sigmoid layer σ to rescale activations to [0, 1], which are thereafter used to re-calibrate U spatially: Where σ(q i,j ) represents the relative importance of a spatial information (i, j) of a given feature map.As a result, the focus is guided to the lesion location at the cost of loss of attention to healthy tissue, a sacrifice justified given their relative lower importance to the clinician when diagnosis is the end task.

C. Similarity-wise Loss Function
To guide the learning of the RAN network, we propose a novel loss function that combines two loss terms: a mean square error that guides the loss to reconstruct the pixelwise representation of the image and a location based term to strengthen lesion localization accuracy: where L MSE is the mean square error loss function commonly used in image reconstruction tasks, L loc is the locationbased loss term, and β is a hyper-parameter balancing the contribution of the L loc term.We first allow the network to learn to reconstruct, via L MSE , an image estimate that is relatively close to the ground truth image pixel-wise distribution and then, via L loc , refine that candidate image using 1) Reconstruction Loss: L MSE measures the pixel-wise intensity difference between ground truth and reconstructed images and allows the network to learn an image estimate that is relatively close to the ground truth image pixel distribution.However, L MSE evaluates pixels in isolation, ignoring the overall image structure.Furthermore, L MSE can be biased by matching the background rather than the target of interest, a concern exacerbated in DOT image reconstruction of breast tissue with zero or more small isolated lesions, where the majority of the pixels are background (healthy tissue).
2) Localization Loss: The location-based loss term L loc is introduced to emphasize more accurate lesion localization.L loc is based on the Euclidean distance transform that maps an image to a similar grayscale image, except that the grayscale intensities of points inside the background region are changed to show the distance to the closest foreground boundary from each point.The distance transform is sensitive to small changes in size and position of the foreground object (e.g., offsetting an object by a single pixel can generally change all pixel values of the distance transform image).We leverage this sensitivity to penalize deviation in lesion localization in the reconstruction.L loc calculates the L1 distance between the distance transform of the ground truth and the reconstructed image pairs ( Eq. ( 5), Fig. 2-C) and goes to zero when the reconstructed image overlaps with the ground truth.The location-based distance transform loss is given by: where

D. Transfer Learning Network
Training a deep learning based model can require a large dataset to converge to optimal performance.However, this is a challenging requirement with relatively new imaging devices such as DOB probes.Physics-based simulation data can provide an alternative source of training samples.Similar to Yedder et al. [25] and Zhou et al. [38], where sensor data distribution adaptation is used to take advantage of crossdomain learning, we leverage transfer learning to bridge the gap between real word acquisition and in silico simulated data.An MLP (Fig. 2-A) tackles the domain shift between the real data measurement y p collected from the probe and the in silico training measurement y s .By minimizing the transfer learning lossL T L over N p real data measurements collected from the probe using a phantom solution and their tissue equivalent simulated data, the transfer learning module learns the mapping φ(y p i ) ≈ y s i to ensure it is in the same domain as y s : The final reconstructed image, at inference time, is computed as: x * = F −1 (φ(y p )).

III. DATASETS
To train our tomographic image reconstruction network, we collect training samples from synthesized tissue geometries with known optical properties using a physics-based simulation.We use our recently developed functional hand-held diffuse optical breast scanner (DOB-probe) [1] to collect real data of physical phantoms and from patients for testing our method.No phantom or patient data is used during training and validation.

A. In silico Training Data
In silico training data pairs, which include images of optical tissue properties and their corresponding detector measurements, were used to train our reconstruction network (Fig. 2-B).A set of synthetic 2D images with different lesion shapes, sizes, and locations, representing optical tissue properties, discretized into finite element nodes (2D triangular meshes), and their corresponding forward projection measurements were synthesized (Fig. 3).

2D images
Fig. 3. Workflow of DOT in silico training data generation for DOT image reconstruction.2D triangular meshes modeling tissue with different lesion sizes and depths were generated using the Gmsh software [39] and fed to the time-resolved optical absorption and scattering tomography software (Toast++) for forward projection simulation using 750 nm wavelength modulation and breast like optical properties.
The Toast++ software suite [40] was used to simulate the frequency-domain forward model and generate projection measurements for each training mesh.Frequency-domain models or intensity-modulated sources use a frequency ω for modulating the intensity of light.Frequency-domain systems provide information on the medium in the form of amplitude and phase data that permits the recovery of absorption and scattering coefficient parameters.The forward model uses the finite element method to solve the diffusion equation that models the light propagation in a diffusive medium.The diffusion equation is given by [41]: with boundary condition [41], [42]: ) where µ a (r) and D(r) denote the absorption and the diffusion coefficients at position r, respectively; c m (r) represents the speed of light in the medium; q(r, ω) represents the light source term; ζ is a term incorporating the refractive index mismatch at the tissue-air boundary; q is a source distribution on boundary ∂Ω of domain Ω; ∂ν is the outward boundary normal; and φ(r, ω) is the generated photon density distribution with a modulation frequency ω.Synthesized meshes define the medium and distribution of parameters (absorption, scattering, refractive index).
Our functional hand-held probe geometry was modeled accurately in Toast++, as illustrated in Fig. 2-B, to obtain measurements that mimic real values obtained by the DOB probe [1], [34].The probe is composed of 2 LED light sources, illuminating the tissue symmetrically and delivering near-infrared light to a body surface at different points, and featuring 128 co-linear equi-spaced detectors that measure the back-scattered light from the tissue and emitted from the boundary.Correspondingly, the output of the forward model simulation captures a 1D raw intensity diffraction (1× 256) vector y s resulting from the scattering of the illuminating light exiting the object.
A total of 4000 samples of (2D mesh, 256-D vector) data pairs are used.Realistic human breast tissue optical parameters' distribution values [43] were considered to mimic both healthy and cancerous tissue optical coefficients.The absorption coefficients of breast tissue vary between 0.0018 and 0.0025 mm −1 for healthy tissue and between 0.09 and 0.16 mm −1 for cancerous tissue.Similarly, the scattering coefficients vary between 0.8 and 1 mm −1 for healthy tissue and 1 and between 1.6 mm −1 for cancerous tissue.Given that relative optical parameters changes at healthy tissue are sufficiently small compared to optical perturbations at abnormalities, background optical parameters were assumed uniform in each sample (the absorption coefficient of background is set to the average of breast tissue similar to [44], [45]) while abnormalities' size, depth, number, and their respective optical parameters were varied.Furthermore, since the absorption coefficients in the lesion are larger than the background, the absorption features due to the lesion emerge in outgoing light at the boundary.Therefore, similar to [46], we focused our in silico dataset on examples that have been shown to be hard to reconstruct and more difficult in treatment.The size of the reconstructed image is 6×6 cm with a resolution of 128×128 pixels.
The in silico data is designed to be consistent with existing prior art [12], [21], [47], where we vary diameters of two-dimensional circular heterogeneous masses representing lesions.We vary the radius from 3-15 mm and tune optical coefficients of background to match that of biological nonlesion tissue.We randomize the number of lesions (0-3), as well as their size, depth, and location to ensure a sufficiently rich dataset matching real world conditions.While simulated data can in principle be infinite in size, we found the current size of our in silico dataset to be sufficient for our purposes, in that we observed a decreasing return in performance with respect to increasing training time from increases in data set size.
A Gaussian noise layer is used before the dense layer to add independent zero-centered Gaussian noise to each measurement y i .This noise model simulates the highly varying noise to each individual detector as caused by interference of refracting light as well as sensor noise.We choose σ = 10% of maximum signal value consistent with previous work [12], [19], [21].For completeness, we note that in addition to this in silico noise, the probe accounts for ambient noise, recorded by capturing a frame without active emitters, which is then subtracted from real data measurement (phantom and clinical) before reconstruction.

B. Phantom and Clinical Test Data
Two test datasets were used in our experiments: An in silico phantom dataset to validate our method, and a clinical dataset to ensure generalizability to a clinical setting.
1) Phantom Data: To analyze the performance of our approach in controlled real experiments setting, breast-mimetic phantoms with known inhomogeneity locations are used (Fig. 2-A).The phantom data set is composed of a solution of intralipid 20% emulsion (Fresenius Kabi Inc.) that mimics background breast tissue with its optical properties [1], [48].The cancerous lesion is simulated by a 4 mm diameter tube filled with a black ink liquid (Higgins India Ink) that matches the optical properties of tumor lesions [48].The DOB-probe (Fig. 2-B) collects phantom measurements y p .By varying the location of this simulated lesion, we can test the accuracy of the reconstruction method in a number of different scenarios.
2) Clinical Data: The data is collected from two participants diagnosed with breast cancer, following ethics and institutional review board approval protocol, and subsequently de-identified to protect patient privacy.For each patient, 2D images were scanned from two locations, the first is a sweep over the cancerous lesion location and the second is a sweep over the same region on the opposite healthy breast, as a baseline.Figure 7-B illustrates the scan procedure.On average, five measurements were taken on each sweep.The precise location of the cancerous lesion was determined via mammography, ultrasound, showing calcification over the area, and biopsy.

C. Implementation
The model was implemented in the Keras [49] framework and trained for a total of 500 epochs, on an Nvidia Titan X GPU, with early stopping if the validation loss has not improved in the last 10 epochs.Adam optimizer with default parameters is used [50].The optimal hyper-parameters such as the number of the filter channels, non-linearity, weights initialization, regularization term, and the batch size were found by trial and error to be optimal for our validation dataset during prototyping.
By optimizing the model's performance on the validation set, we set all hyper-parameters as follows: batch size to 64; learning rate to 10 −4 ; Adam optimizer set to true; and (∆ = 10, β 0 = 0.2, γ = 0.005), which describe the update equation of the hyper-parameter β in (4); We use a 80/10/10 training/validation/test split of the in silico training data.The model was trained for the first 100 epochs using L MSE before introducing L loc to avoid noisy reconstruction.A L2 weights' regularization of 10 −4 was applied to the last residual layer of the network to avoid overfitting to training data.Transfer learning was applied at evaluation time in the phantom and real data while being trained using phantom measurements only.A separate phantom dataset was used to train the transfer learning network.

D. Competing Methods
We contrast our method results to eight state-of-the-art methods; four of which are conventional methods whereas the others are deep learning based methods.We compare to Least squares-based approach (LS) [51], Levenberg-Marquardt (LM) [40] and nonlinear conjugate gradient (NCG) optimatization based methods [52], and analytical method [34].The deep learning based methods are: MLP [21], FC-CNN [12], and our preliminary works FCN [24], and FCN-FJ [25].Inverse solvers are constructed by making use of first derivatives such as NCG-based or second derivatives such as LM-based approaches.Employed algorithms are implemented in the state of the art public software packages of NIRFAST for LS and TOAST++ software for LM and NCG methods.
LS [51] and LM [40] are Jacobian-based approaches.The LM based method employs the modified Levenberg Marquardt (LM) [53], a trust region approach, to minimize the objective function over the search space of model parameters.At each iterative step the Jacobian matrix is recalculated and repeatedly inverted.Optical parameter estimates are then updated.The algorithm converges when the reconstructed optical parameter at the current iteration has not improved over the last two iterations.A first order Tikhonov penalty term is used to regularize reconstruction algorithms.The regularization parameter, found by parameter sensitivity study, is set to 0.01.We refer the reader to the Appendix-B,C for mathematical formulation and parameters search details.
NCG is a gradient based method [52].The objective function describes the difference between the calculated forward measurement and measured data.The inverse solver iteratively computes updates to an initial distribution of the absorption and scattering parameters via minimization of the gradient of the objective function.Total variation (TV) is used to regularise the solution and the regularization parameter, found by trial and error, is set to 0.1.The finite element discretisation method is used to discretise the computational domain and break down the calculation domain onto the local elements individually.In this paper, object geometry is represented by a polygon over which a series of disjoint triangles are generated using the GMesh software suite [39].Anisotropic TV regularization [54], as implemented in the Toast++ software suite [41], was used.The reader is referred to the appendix-B,C for formulation and parameter settings.
The analytical method in [34] is a difference approach that relies on a relative change between optical perturbations at abnormalities and healthy tissue.A homogenous data is collected using a phantom and used as a reference to measure any inhomogeneities in tissue.The difference in absorption between the reference (normal) and abnormal tissues can provide the imaging contrast for tissue diagnostics.Hypothetical curved paths, as explained in [34], are used to recover the absorption coefficient from each illuminated source.The relative change difference is back scattered, using a closed form solution for the diffusion equation, to recover the optical coefficients.Hence, the absorption coefficient at each coordinate (x, y) is calculated as a superposition of back scattered values along the paths of illuminated light from each source.The computed absorption coefficients are then used to recreate two-dimensional cross-sectional images for each wavelength.For the detailed mathematical model description, we refer the reader to [33], [34].
The MLP method [21] is a multilayer perceptron network with two fully connected layers.The FC-CNN [12] uses a fully connected layer followed by a CNN with an encoderdecoder structure with three convolutional layers.The hyperbolic tangent function (tanh) is used as an activation function for the fully connected layer and two first convolutional layers, and rectified linear unit (ReLU) for the last convolutional layer [12].The FC-CNN implements a similar architecture to our baseline approach FCN [24] where the difference lies in the number of convolution layers, filter, and kernel size.Both architectures were re-implemented as specified in the introducing papers and trained using mean square error between input and output as the loss function.Our preliminary works FCN [19], and FCN-FJ [20] were briefly described in Section 2-A

IV. RESULTS
In order to measure the quality of the results, we consider: (i) Lesion localization error, i.e., the distance between the centre of the lesions in the ground truth image versus the reconstructed image; (ii) peak signal-to-noise ratio (PSNR, 10 log 10 MAX I MSE , where MAX I is the maximum intensity value of all pixels); (iii) structural similarity index (SSIM) [55]; and (iv) the Fuzzy Jaccard [56].

A. Results on Synthetic Data
Trained on the in silico data and tested on a separate test set of 80 images, we visually evaluate the reconstruction of our Multitask-Net method and compare results with those of LS [51], NCG [52], MLP [21], and FC-CNN methods [12].In Fig. 4, we show results on some in silico samples with Fig. 4. Qualitative reconstruction performance of our model compared to NCG,MLP, and FC-CNN models in silico samples with various ground truth lesions sizes, locations, and absorption coefficients (a-f).Different y-axis scales were needed to highlight the models' performance in reconstructing the optical coefficient values.Evidently, the images reconstructed by our method are more accurate than those reconstructed by the other approaches.The MLP approach presents a rough reconstruction that is further improved by FC-CNN.The NCG method reproduces lesion location, while lesion depth is less accurate.The LS method provides the best overall results.
different lesion sizes and depths.Interestingly, while both LS and NCG methods are able to detect lesion presence, LS outperforms in terms of lesion depths and size reconstruction.Images reconstructed by the MLP method provide a quite noisy reconstruction and a rough approximation of the lesion location.Further, filtering via CNN enabled better results by FC-CNN.Overall, our model provides more accurate results as it is able to differentiate between different lesion sizes and recover correctly lesion location and dimension and absorption coefficients despite the limited angle condition.The y-axis scales highlight our model superiority in reconstructing the optical coefficient values.The difference in performance can be explained by the ability of the convolution operators to extract more comprehensive contextual information and synthesize more complex features.
The last two columns in Fig. 4 (e-f) show reconstruction performance in a healthy tissue case as well as a deep lesion case (∼5 cm in depth).Note the model's ability to capture and reconstruct a tumor-free signal.We illustrate the limits of our contribution where deep lesions cannot be distinguished from the background.This is expected given the rapidly reducing strength of reflected signals due to photons scattering many times in random paths before being absorbed by the tissue.A recent study [57] on self-diagnosis confirmed by imaging reported a mean depth of 1.6 cm with a maximum of 6 cm (163 patients, 173 lesions).While this does not preclude deeper lesions in larger population studies, it shows that our performance is aligned with real-world assumptions on lesion depth.

B. Results on Phantom Data
Trained on the in silico data and tested on the phantom dataset, we visually compare our proposed reconstruction method of absorption coefficients to the competing methods' results on sample phantom cases.We compare to analytical method [34], LS based method [51], LM based method [40], NCG based method [52], MLP model [21], FC-CNN [12], and our preliminary works FCN [24], and FCN-FJ [25].
1) Qualitative Results on Phantom data: In Fig. 5 and Fig. 6, we contrast our results with the state of the art on phantom  data.Our DOT reconstruction model is mainly interested in the absorption parameter changes as it is related to the hemoglobin concentration changes that mark the presence of anomalies [4].The analytical approach, while detecting the presence of the lesion, only approximates its location by a rough margin and fails to provide precise reconstructed optical coefficient values required for proper assessment of lesion (Fig. 5-Analytical).LM and NCG based methods (Fig. 5-LM, NCG) provide good reconstruction as they are able to detect and localize lesions.Lower performance on lesion size can be in part attributed to noise and artifact on real data information.MLP provides a rough approximation of lesion localization and sizes with very low per pixel values of reconstructed absorption coefficients (Fig. 5-MLP).The FC-CNN and FCN-FJ, on the other hand, localize the lesion in a single concentrated area with some error in lesion size and optical coefficients (Fig. 5-FC-CNN and Fig. 6-FCN-FJ (L MSE +L FJ )).Adopting the new loss term L loc clearly reduces the lesion localisation error (Fig. 6-FCN (L MSE +L FJ +L loc )).Small artifacts are a disadvantage of this model.The proposed RAN consistently obtains results that minimize both location and size error while miss estimating values of reconstructed absorption coefficients (Fig. 6-RAN).
Observe how incorporating both the new loss term L loc and the RAN in the Multitask-Net significantly reduces the artifacts while improving lesion localization and reconstruction of optical coefficients, which otherwise could compromise diagnosis (Fig. 6-Multitask-Net).From these results, it is clear multitask learning is able to achieve superior performance in both tasks without compromising subtask performance.
2) Quantitative Results on Phantom data: Table I presents the results on the phantom dataset.Yoo et al. architecture presents close results to our preliminary work FCN [24] given the similarity between the used architectures, where the difference lies in terms of model depth and filter sizes.It is worth mentioning that some artifacts are due to the shift in distribution that we overcome using the transfer learning model.
Rows 5 to 8 present an ablation study that shows the contribution of each element of our model.While the distance transform loss allows reduction in lesion localization error, the attention network helps in terms of reconstruction similarity and false positive reduction.We obtain marked improvements in localization error (up to 18%), as well as improvements in SSIM (up to 4%), Fuzzy Jaccard (up to 5%), and PSNR (up to 3%).Interestingly, we found that a RAN performs better than an FCN+L Loc in terms of Fuzzy Jaccard metric, hence we can argue that a RAN helps to better focus on lesion area and reduces the false positive pixel.
Computational time for each reconstruction in this study is reported in Table I (runtime).Once the deep learning model training is completed, the time for the reconstruction is in the order of a few ms, three orders of magnitude faster reconstruction than conventional methods thus enabling real time application.Note that the new loss function has little to no effect on the inference time.

C. Results on Clinical Patient Data
Figure 7 presents the reconstructed cross-sectional image of absorption coefficients of breast scans from 2 patients.The probe is placed above the identified lesion in the cancerous breast and at the same location on the opposite healthy breast for each patient (Fig. 7-A).We compare with the analytical [34], NCG [52], MLP [21], and FC-CNN [12] methods, reporting visual results in Fig. 7-B,C and quantitative results in Table II.As a partial ground truth, both patients underwent ultrasound scans to obtain estimated lesion dimensions, showing irregular masses measuring 22 × 21 × 23 mm for the first patient and 19 × 16 × 18 mm for the second patient, with biopsies confirming that the lesions are cancerous.
In healthy tissue (Fig. 7-B:i, C:i), only background signal is reconstructed by almost all methods.In cancerous tissue, all competing approaches and our proposed method detect the lesion.The small artifacts (false positives) in healthy patient data reconstructed by MLP, FC-CNN and Multitask-Net (Fig. 7-B:i,C:i) can be attributed in part to the shift in distribution between training and test dataset.While the transfer learning module reduces this gap, it cannot eliminate this gap to the real patient data distribution completely.We observed that such artifacts could be removed by post-processing, e.g.via thresholding, by a qualified clinical operator.At the same time these failure cases highlight the need for more real-life patient data to better train the transfer learning module.Consistent with the results on phantom data, the analytical, NCG, and MLP methods have a rough approximate outline of the tumor, whereas methods only partly agree with the approximate center of the predicted lesion.
Table II reports the average of reconstructed area ratio with respect to the ground truth lesion size over different probe sweeps for each patient.Areas of thresholded regions (> 80% binary threshold), given the pixel dimensions of the image, were compared to the real ground truth area.We observe that, in contrast to the other approaches, our method is able to reconstruct the lesions more faithfully with respect to their relative sizes for patient I while it underestimates lesion size for patient II.In addition, our method obtains a more precise delineation of the lesions.
While the transfer learning network bridges the gap to some extent between phantom (training) data and real world data, the results on real world data illustrate that high variability in, e.g., illumination and noise levels can still mislead the network.Given that no two tumors are alike, tumor heterogeneity leads to markedly different acquisition signatures that are not necessarily present in the training data.

D. Multitask versus Single-task
We first tested how the combination of different loss functions influences the reconstruction and localization perfor-  Increasing the dimensions of the network can allow for improved performance, yet results in larger hardware and training time requirements.With our application domain in hand-held devices, it critical to evaluate the trade-off between network dimensions and performance.With an exhaustive architecture search beyond the scope of this work, we limit ourselves to varying the number of layers and channels.Table V and IV show experimental performance results that were performed on the in silico dataset.
First, we test the variation of model performance over different depths of the network.We use our proposed model as a baseline and constructed a shallower and a deeper model by adding and subtracting two residual blocks with 32 channels from the baseline, respectively.
While the deeper model performs better compared to a shallower network (Table IV), performance did not improve (∼ 1% variation in all metrics) by adding more residual attention layers compared to the baseline model at the price of more parameters and risking overfitting.A lighter architecture would be more convenient for the deployment in handheld devices with small power requiring a fine balance between accuracy and latency.
Note that the fully connected layer, used to back project the sensor data in the image domain, non-linearly increases the memory requirements of the neural network, thereby limiting the maximum number of additional convolution layers.This layer is deemed compulsory for an inverse scattering problem where direct analytical reconstruction is not well-defined, unlike in magnetic resonance imaging or computed tomography where a well-defined inversion exists using Fourier transform or Radon transform [58].
To observe the dependency on the number of convolution channels, we additionally trained the network with a varying number of channels.We observed marginal variations of the network performance beyond 32 channels: a variation of 1% and 1.5% in SSIM and Fuzzy Jaccard metrics, respectively, beyond 32 channels (Table V).This can be explained by the size and shape of lesions that are generally small, when compared to the background, and smooth objects, therefore  they can be easily fitted by a small number of feature maps.

F. Effect of Varying the Number of Sensors
Given that our probe only has 2 light sources and 128 sensors (detectors), we tested the robustness of our model to the variation of the number of sensors (co-linear equispaced detectors) by reducing it gradually from 128 to 16.To train the network for different source-detector configurations, we generated different sets of training data and changed the input sizes of the network accordingly while maintained the overall network structure.While it is out of the scope of this work to test the results using the DOB-Scan probe (requiring new hardware design, testing, and manufacturing), we reported experimental performance results for the in silico dataset in Table VI and visual inspection results in Fig. 8.
Note the high contrast between the background and the lesion area, despite the increased difficulty to accurately recover lesion size and location in sparse signal conditions (smaller number of detectors).The results are on par with qualitative results Table VI where we notice an increase in localization error while we observe a slight decrease in SSIM.

A. Attention-directed Lesion Reconstruction
DOT reconstruction leverages the fact that healthy tissue has optical coefficients that are near uniform, whereas lesions have optical coefficients that diverge markedly and are characterized by high variability.The encoding of a highly variable signal (lesion) requires more information or computational complexity than a near uniform signal.The high variance is explained in part due to high inter-tumor heterogeneity, a leading cause preventing effective treatment or prognosis [46].Furthermore, for the clinician it is sufficient that the network recognizes correctly that certain areas are healthy tissue without requiring precise pixel-wise reconstruction.In contrast, the lesion needs not only to be detected, but located and reconstructed precisely in order to enable diagnosis, prognosis, and successful treatment recommendation.Therefore, employing network attention on the lesion is one way of dedicating more computational power where it is required, without sacrificing detection of healthy tissue.

B. Network Generalization
The generalization capability of our network is illustrated by the fact that training occurs on simulation data, yet performance is strong on phantom and real data.Moreover, unlike the analytical approach that requires a reference measurement of healthy tissue to compute the inverse image, our network, similar to Yoo et al. [12], is designed to directly learn from the collected measurement and does not require any reference or prior conditions that might bias the search space.In a clinical setting, such as breast cancer screening, such reference measurement from a homogeneous background may be nontrivial to obtain.Therefore, our model can generalize better to unseen conditions without additional pre-or postprocessing techniques.In this work, we obtained good results when trained only on in silico data.We expect improved results as more real data of phantoms and patients are collected.Finally, we used model architecture and implementation details described in III-B.3, with good initial results.However, it is likely that other architectures can be found that will yield improved results, e.g., via neural architecture search [59].

C. Limitations
Although this study provides a proof of concept of the potential advantages of using multitask deep learning to DOT image reconstruction, it suffers from some limitations.First, reconstruction of deep lesions (lesions location ≥ 35 mm) remains a challenging task in DOT, especially in case of a linear probe as it is harder to differentiate the signal from the tumorfree tissue signal.As a result, the reconstructed image can be biased towards tumor-free tissue.Similarly, while our model is able, in most cases, to detect lesion presence, reconstruction of multiple lesions exhibit some weakness, as shown in Fig. 9. Specifically, occluded lesions and lesions situated at the border of the probe can be mis-reconstructed, where only one lesion is reconstructed, lesion location deviates, or tumor-free tissue is reconstructed (false negative).This problem is mainly due to the weaknesses of the collected signal given the limited angle condition.While multiple sweeps or variations in the scanning angle can help in the latter case, it is unlikely to help in the case of deep lesion(s).Second, since the real dataset only contains 2 patients with few sweeps, it is not an exhaustive list of all cases, and therefore the proposed method requires validation on larger and more diverse datasets.However, this is not a technical limitation of the model since more tests can be done as more real data become available.

VI. CONCLUSION
We present the first work to effectively leverage multitask learning by combining an appearance similarity loss and a reconstruction loss along with a spatial attention mechanism to DOT-reconstruction, with application to reconstructing lesions in breast cancer tissue.We demonstrated improvement in localization error over the state of the art and presented a step forward in DOT image reconstruction and the usage of machine learning in image reconstruction.Leveraging physicsbased simulators of light propagation to generate realistic synthetic datasets provided an alternative source of large and diversified training dataset.The critical addition of transfer learning ensures our method is not limited by inherently small real-world datasets.At the same time it is still capable of performing at high accuracy on such datasets where more complex sources of artifacts and noise are present.Nevertheless, we noted some performance differences in real world data.As more patient data is collected, it will be critical to better understand and address the causes of such differences.Finally, we show that our method is sufficiently fast enough to ensure real time processing, enabling in situ deployment in clinical settings.
where g(•) is the neuron activation function, and y is the measurement vector input to the network.Hence, the fully connected layer learns a weighted combination of the different receptive sensors, based on the signal collected (y), and its reshaped output provides a first image estimate that is subsequently enhanced by subsequent layers.Filters (kernels) pass over the image estimate and transform it based on the weights in each filter.The final feature map values at each convolutional (cn) layer are then computed according to the following formula: where the input image is denoted by o cn−1 [m, n], kernel by h and m and n are indexes of rows and columns, respectively.We have shown in the silico reconstruction, Fig. 4, shows that consecutive convolution layers are very important to recover the accurate size and location of the optical anomalies.Yoo et al. [12] adopted a similar architecture and theoretically justified the reason behind it based on the analysis of deep convolutional framelets [23].

APPENDIX B IMAGE RECONSTRUCTION REGULARISATION DETAILS
We consider the regularized output least-squares approach to the inverse problem in equation ( 1) where x represents the searched optical parameter.The first order Tikhonov penalty term takes the form [53]: where L T L ∈ R N ×N denotes the Laplacian, ∆x k = x k − x 0 , and x 0 is an initial estimate.We define L T L as: where n(i) is the number of neighbours of basis component i.The four-connected pixel neighbourhood in 2D are used.The anisotropic total variation penalty term of the form [54]: where D x is a matrix of size M × N representing the discrete partial derivative of µ, the searched optical parameter, along the x direction.D y is the derivative matrix along the y direction.M is the number of triangles in the mesh and N is number of nodes.

APPENDIX C HYPER-PARAMETER SELECTION AND SETTINGS
To validate the conventional algorithms in a quantitative manner and select the best reconstruction parameters, we conducted a study using numerical phantoms and reported results in Table VII.Reference mesh coarseness was set to 1.5 mm iin all experiments as we noticed that decreasing it only resulted in longer reconstruction time with negligible to no

Fig. 1 .
Fig. 1.DOT image reconstruction workflow example.A DOT-Scan probe consists of two near-infrared sources for tissue illumination and a set of detectors registering the back-scattered photons.Light propagation and scattering in the tissue are schematized.DOT reconstructed image shows the optical coefficients in the tissue.The success of diagnosis and treatment relies on accurate reconstruction and estimation of the optical properties of a medium.

Fig. 2 .
Fig. 2. The overall architecture of the proposed model.(A) Transfer learning network used for data distribution adaptation between in silico measurements y s and real data collected from the probe y p .(B) In-silico training data pairs generation simulating our probe's specification [34].The red arrow represents the simulated lesion.(C) Architecture of the proposed deep residual attention model.The squeeze and excitation block (inset) squeezes a given feature map U along the channels and excites spatially.The output of the sigmoid layer σ(.) corresponds to the relative importance of the spatial information (i, j) of the feature map U .All feature maps produced by all convolutional layers are set to size 128 × 128 using (3×3) followed by (5×5) kernels.The similarity-wise loss L loc , leveraged to enforce localization accuracy, computes the mean absolute error between the distance transform of the reconstructed image and the ground truth one.
and D(a ij , S) refers to the Euclidean distance between the image pixel location a ij and lesion boundaries of a shape S. Lesion boundaries are computed during the training by simply thresholding the training images based on the mean and standard deviation (µ + σ) of each image.

Fig. 5 .Fig. 6 .
Fig. 5. Qualitative reconstruction performance of our model compared to state-of-the-art techniques on phantom samples with known lesion ground truth location.The x-axis and y-axis show lesion location and depth in mm, respectively.Color bar shows the reconstructed absorption coefficient values.Different y-axis scales are unfortunately needed and indicative of the problematic performance of some models in reconstructing the optical coefficient values.The reconstructed image using MLP is equivalent to a back projection operation and gives a rough approximation of lesion localization, while precision of the reconstructed absorption coefficients can be low.Adding convolution layers helps to better filter noise and reduce false-positives while still miss-predicting correct absorption coefficient values required for proper diagnosis.The LM and NCG methods provide good reconstruction with accurate lesion presence detection.Lesion size mis-prediction can be attributed to real data noise and artifacts.

Fig. 7 .
Fig. 7. (A) Clinical data scan procedure.Dashed blue lines delineate the different probe sweeps.(B-C) Qualitative reconstruction performance of our model compared to analytical, NCG, MLP, and FC-CNN methods on different real patient data.(i) Reconstruction of healthy breast tissue of candidate versus cancerous lesion over different probe positions (ii-iii).Our proposed method results (last column) are contrasted to the other approaches results.Note how the size of the lesion reconstructed by our method more faithfully reflects the relative difference in size.(approximate lesions sizes obtained with ultrasound; patient 1 (B): 22 × 21 × 23 mm, patient 2 (C): 19 × 16 × 18 mm).The minimum of the colorscale is set to the background value, the maximum dynamically to try to report visually consistent results, while accommodating the very high variance between images that would otherwise prevent any comparison.

Fig. 8 .
Fig. 8. Effect of varying number of detectors on the qualitative result.While the model is still able, in most cases, to detect lesion presence, lesion size and correct localization are impacted by the reduction in the number of sensors and the increase in sparsity of the collected signal.

Fig. 9 .
Fig. 9. Reconstruction of multi-lesions with different sizes and proximity.(A) While the model is able, in most cases, to detect lesion presence, occluded, edge located and deep lesions can be mis-reconstructed.(B) Some failure reconstruction examples where only one of the lesions is reconstructed.

TABLE I QUANTITATIVE
RESULTS ON 32 PHANTOM EXPERIMENTS.METHODS WITH A RUNTIME SMALLER THAN 20 IMAGES PER SECOND (< 0.05S) ARE CONSIDERED REAL-TIME CAPABLE.Loss L M SE | L F J |L Loc

TABLE II QUANTITATIVE
RESULTS ON REAL DATA EXPERIMENTS.MSE and L loc compared to using L MSE in isolation, with increases in SSIM by 6% and Fuzzy Jaccard by 7% using the best weight.TableIclearly shows that multitask-learning obtains optimal results for all metrics except a marginal drop in Fuzzy Jaccard of 1%.

TABLE III SENSITIVITY
STUDY: THE EFFECT OF β ON PERFORMANCE.