Preprints are manuscripts made publicly available before they have been submitted for formal peer review and publication. They might contain new research findings or data. Preprints can be a draft or final version of an author's research but must not have been accepted for publication at the time of submission.
In this paper, we propose a deep neural networkthat can estimate camera poses and reconstruct thefull resolution depths of the environment simultaneously usingonly monocular consecutive images. In contrast to traditionalmonocular visual odometry methods, which cannot estimatescaled depths, we here demonstrate the recovery of the scaleinformation using a sparse depth image as a supervision signalin the training step. In addition, based on the scaled depth,the relative poses between consecutive images can be estimatedusing the proposed deep neural network. Another novelty liesin the deployment of view synthesis, which can synthesize anew image of the scene from a different view (camera pose)given an input image. The view synthesis is the core techniqueused for constructing a loss function for the proposed neuralnetwork, which requires the knowledge of the predicted depthsand relative poses, such that the proposed method couples thevisual odometry and depth prediction together. In this way,both the estimated poses and the predicted depths from theneural network are scaled using the sparse depth image as thesupervision signal during training. The experimental results onthe KITTI dataset show competitive performance of our methodto handle challenging environments.