MOTION ESTIMATION OF AN UNDERWATER PLATFORM USING IMAGES FROM TWO SONAR SENSORS

Many underwater applications require precise localization. This can be achieved by techniques such as image registration applied to two consecutive acoustic images obtained by a sonar. However, this can be a complex task to implement in real time. The use of deep learning (DL) techniques for motion estimation can signiﬁcantly reduce the processing complexity and achieve high-accuracy position estimates. In this paper we investigate the performance improvement when using multiple sonar sensors compared to a single sensor. The DL network is trained using images generated by a sonar simulator. The results show an improvement in the estimation accuracy when using two sensors.


I. INTRODUCTION
For exploration and surveying in underwater environments, autonomous underwater vehicles (AUVs) and remotely operated underwater vehicles (ROVs) are widely used [1]. The operation of such vehicles requires an accurate estimation of their position relative to the seafloor. Micronavigation techniques have been developed for this purpose [2]- [6]. Motion estimation based on optical images is a well known approach in terrestrial [7], [8] and aerial [9], [10] applications. Recently, this approach has been used to estimate the trajectory of an underwater platform by applying a deep learning (DL) network to a sequence of images from a camera [11]. However, the use of optical images is not reliable in underwater environments where the visibility can be poor [12].
The work [13] presents a method for attitude and trajectory estimation using sonar (acoustic) images. This method is capable of obtaining accurate position estimates by analyzing the pixel displacement between consecutive images. However, due to its complexity, this method is difficult to implement in real-time. In [14], we presented a method based on DL networks to estimate the motion of an underwater J. E. Almanza-Medina acknowledges financial support from CONACyT. The work of Y. Zakharov and B. Henson was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) through Grants EP/R003297/1 and EP/V009591/1. Fig. 1: Sonar FoV parameters and the coordinate system relative to the sonar. The motion in forward and backward directions corresponds to the y-axis, the motion in sideways direction corresponds to the x-axis and the rotation around the z-axis is represented with the parameter θ. The pitch angle is measured from the xy-plane, which is parallel to the seafloor. platform and its trajectory using sonar images. The method significantly reduces the complexity and processing time compared to the method in [13]. The low processing time makes the methods in [14] suitable for real-time applications. The DL networks in [14] allow a millimeter accuracy in positioning between two sonar images. However, higher estimation accuracy is required for some applications, e.g., synthetic aperture sonars [15]. In [14], a single sonar sensor was considered for motion estimation. The purpose of this paper is to consider the use of two sensors separated from the sonar transmitter to find out how it can improve the accuracy even further.
The use of the DL approach has the problem of acquiring big volumes of labeled data for training the networks. In [14], synthetic images are generated by a sonar simulator from [16] to solve this problem. In this work, we modify the sonar simulator from [16] to allow acoustic images to be generated for more complicated sonar configurations and use these images for training and validation of DL networks.

II. SONAR SIMULATOR
The sonar simulator proposed in [16] and used in [14] for training DL networks, is built upon the development software Unity [17]. It is based on a ray-tracing technique to generate the images. The sonar has a field of view (FoV) that is  determined by an aperture angle, elevation angle, maximum range and pitch angle as shown in Fig. 1. The sonar images are generated following a hop-and-generate process, where the simulated platform generates an image at a particular position in a simulated environment, then it moves to a different position and generates another image and so on. When an image is generated, the position and orientation of the sonar sensor in the underwater environment is also stored.
The simulator in [16] only uses a single sonar sensor with the same transmit-receive antenna. We expanded the sonar simulator to separate the sonar transmitter from the receiver. Also, the capability to simulate a sonar with multiple sensors in different positions at the same time is added. A transmitter illuminates the environment while the sensors generate the sonar images. This is shown as the dark green beam in Fig. 2. Then, the platform moves to another position by following a randomly generated trajectory. An example of a simulated environment is shown in Fig. 3. The procedure for simulating the underwater environment is the same that was used in [14], where multiple scenarios with rocks randomly positioned on the seafloor were created. The generated images from each sensor and the position in the scenario where they were acquired are stored in data sets for training and validation of DL networks.

III. DL NETWORKS FOR MOTION ESTIMATION
In [14], several DL networks were evaluated for motion estimation using sonar images. We found that the PoseNet network [18] is well suited for the task after optimizing the network parameters to get best possible performance. The architecture of the optimized PoseNet is shown in Fig. 4. The input of the network is an image made of two consecutive sonar images. This input is connected to a series of 9 convolution layers with stride 2. The first 8 convolutional layers have a ReLU activation layer and batch normalization layer at their output before being connected to the next convolutional layer. The output of the last convolutional layer is connected to an average pooling layer with an averaging window of 4, then an output regression layer is connected to generate the motion estimates in 3 degrees of freedom (DoF). The regression layer uses the Mean Squared Error (MSE) loss function. In this paper we continue to use this network to validate the motion estimation using the new proposed sonar configuration.

IV-A. Sonar configurations
With the modified simulator, two sonar configurations were built: 1) Two sensors: One transmitter (dark green beam in Fig. 2) and two sonar sensors (two light green beams in Fig. 2) are placed on an underwater platform to side look perpendicularly to the forward motion of the platform as shown in Fig.2. The distance between the transmitter and each sensor is 50 cm. The FoV of the transmitter and the sensors is 29 • × 14 • (azimuth and elevation angles, respectively), with 96 beams in the azimuth and a pitch angle of 35 • . This is based on the parameters of the Didson 300 sonar [19]. 2) One Sensor: This configuration is the same as used in [14]. One transmitter with a single sonar sensor is used with no separation between them. They both have the same FoV as described in the case of two sensors. Three DoF are considered for the motion. The displacement of the sensors between consecutive images is described by a vector ∆ = [∆ x , ∆ y , ∆ θ ], representing translations along the x and y-axes and rotation around the z-axis (denoted by θ), respectively (see Fig. 1). For this work, the maximum displacement between two images is 2.0 cm and 0.45 • for the translations and rotation, respectively. The  height of the sensor from the seafloor is 2.5 m. The sonar image size is 512×96 pixels and the pixel values are integer numbers in the range from 0 to 255. Examples of sonar images generated by the sensors are shown in Fig. 5. Since they are situated on each side of the transmitter (as shown in Fig. 2), they have a slightly different point of view of the scenario. In the images, it can be seen that some area of the image is totally dark. For sensor 1, this area appears on the left and for sensor 2, the area appears on the right. This is because the FoV of the transmitter does not totally overlap with the FoV of the sensors, so this portion of the sensor's FoV does not receive signals from the transmitter.

IV-B. Training the DL network
To create the training data sets, pairs of consecutive images are concatenated into a single image. Each concatenated image is associated with a displacement label to make a training sample. The label corresponds to the vector ∆. The three elements in the labels are normalized to the range from -10 to 10 with respect to their maximum values. In this case, they have the same weight within the loss function. A data set of 20,000 pairs of concatenated images is generated for each sonar sensor.
For the two-sensor configuration, the already concatenated images from each sensor are concatenated with the corresponding concatenated images of the other sensor to make a larger image of 4 concatenated images. This larger image is put into the network. For the one-sensor case, the pairs of concatenated images are directly put into the DL network. The data sets for both the cases are split into 95% and 5% for training set and validation set, respectively.
The sonar images generated by the simulator are noiseless. We follow two approaches for training the networks. One consists in training with the noiseless images and the other consists in training with the same images, but with a lowlevel noise added to their pixels. The noise is generated according to two considerations [20]: (i) the pixels of acoustic shadows in the images are modified with additive Gaussian noise with the mean and standard deviation of 4% and 2% of the maximum pixel value, respectively. (ii) The rest of the pixels are affected by adding noise with the Rayleigh distribution with a scale parameter of 4% of the maximum pixel value, thus representing the scattering noise. After the networks have been trained, we validate the estimation accuracy using either the noiseless images or images with a high-level noise based on measures of noise in real Didson sonar images described in [21]. The highlevel noise has a Gaussian distribution with the mean and standard deviation of 13.72% and 3.14% of the maximum pixel value, respectively, and a Rayleigh distribution with a scale parameter of 13.72% of the maximum pixel value.
The DL networks were trained in MATLAB. The training uses the Adam optimization algorithm [22]. The learning rate starts at 0.0001 and halves every 12 epochs until the validation loss converges. During the training, a dropout regularization with a rate of 50% is applied. Table I presents results of training the networks with noiseless and low-level noise images in the terms of the root-mean-square error (RMSE) obtained when validating with noiseless and high-level noise images.

IV-C. Numerical results
It can be seen that a better performance is obtained by the network trained with two sensors over the network trained with one sensor. For y-axis in the training and validation with noiseless images, the RMSE for the one-sensor configuration is 2.74 mm and for the two-sensor configuration is 1.42 mm, which is a reduction of almost twice. The other parameter estimates are also improved when training with the twosensor configuration.
The only case when the two-sensor configuration presents a higher RMSE compared to the one-sensor configuration is when training with noiseless images and validation with high-level noise images. This can be caused by the black areas on the side of each sonar image which do not provide information about the motion. It is possible that since this area is always black (0 value), the network learns to ignore that part of the images for the motion estimation, but when randomly generated noise is added, it affects the estimates. This issue is eliminated when training with noisy images, even if it is not the same level of noise that is used for validation.
For both configurations, the best estimates are for the yaxis. In [14], it is found that there is a high correlation between estimates of translation along x-axis and the rotation, which affects the estimation accuracy. Therefore estimates along x-axis and the rotation are less accurate than the estimates for the y-axis.
Training with low-level noise images reduces the RMSE of the two-sensor configuration while the RMSE of the onesensor configuration is not reduced. This suggests that using the two-sensor configuration is more suitable for motion estimation with real data, since the noise level in this case is the same as measured from real sonar images.
When training, the DL network parameters were tuned to provide the best performance for the one-sensor case; we used the same network with the same parameters as in [14]. However, a better performance is obtained by the network trained with two sensors even without optimizing the parameters. For future work, it is possible that the performance can be improved further by tuning the DL network to optimize the two-sensor case, removing the black area on the images before putting them into the DL network, adjusting the image noise before training, and/or optimizing the configuration of the sensors such as the distance and orientation relative to the transmitter and the FoV.

V. CONCLUSIONS
In this paper we present a DL-based motion estimation that combines sonar images from two sensors rather than using images from only one sensor. This is an attempt to improve the motion estimation accuracy obtained with a single sensor. The two-sensor configuration shows an improvement in the estimation accuracy compared to the onesensor configuration, even without tuning the DL training parameters to try to optimize the estimation. For instance, there is an RMSE reduction of almost twice for the y-axis movement, while the RMSE for the other types of movement are also reduced.
The obtained results suggest that further work with the two-sensor configuration could improve even more the motion estimation accuracy. The future work can focus on optimizing the training parameters, removing image areas with no information about the motion and/or optimizing the sensors configuration relative to the sonar transmitter.