CoT-AMFlow: Adaptive Modulation Network with Co-Teaching Strategy for Unsupervised Optical Flow Estimation

The interpretation of ego motion and scene change is a fundamental task for mobile robots. Optical flow information can be employed to estimate motion in the surroundings. Recently, unsupervised optical flow estimation has become a research hotspot. However, unsupervised approaches are often easy to be unreliable on partially occluded or texture-less regions. To deal with this problem, we propose CoT-AMFlow in this paper, an unsupervised optical flow estimation approach. In terms of the network architecture, we develop an adaptive modulation network that employs two novel module types, flow modulation modules (FMMs) and cost volume modulation modules (CMMs), to remove outliers in challenging regions. As for the training paradigm, we adopt a co-teaching strategy, where two networks simultaneously teach each other about challenging regions to further improve accuracy. Experimental results on the MPI Sintel, KITTI Flow and Middlebury Flow benchmarks demonstrate that our CoT-AMFlow outperforms all other state-of-the-art unsupervised approaches, while still running in real time. Our project page is available at https://sites.google.com/view/cot-amflow.


Introduction
Mobile robots typically operate in complex environments that are inherently dynamic [1].Therefore, it is important for such autonomous systems to be conscious of dynamic objects in their surroundings.Optical flow describes pixel-level correspondence between two ordered images, and can be regarded as a useful representation for dynamic object detection.Therefore, many approaches for mobile robot tasks, such as SLAM [2], dynamic object detection [3] and robot navigation [4], incorporate optical flow information to improve their performance.
With the development of deep learning technology, deep neural networks have presented highly compelling results for optical flow estimation [5,6,7].These networks typically excel at learning optical flow estimation from large amounts of data along with hand-labeled ground truth.However, this data labeling process can be extremely time-consuming and labor-intensive.Recent unsupervised optical flow estimation approaches have attracted much attention, because their advantage in not requiring ground truth enables them to be easily deployed in real-world applications [8,9,10,11,12].However, their performance in challenging regions, such as partially occluded or texture-less regions, is often unsatisfactory [10,13].The underlying cause of this performance degradation is threefold: 1) The popular coarse-to-fine framework [12,13] is often sensitive to noises in the flow initialization from the preceding pyramid level, and the challenging regions can introduce errors in the flow estimations, which in turn propagate to subsequent levels.2) The commonly used cost volume [10,11] for establishing feature correspondence can contain many outliers due to the ambiguous correspondence in challenging regions.However, most existing networks directly send the noisy cost volume to the following flow estimation layers without explicitly alleviating the impact of outliers.lenging regions for unsupervised optical flow estimation, such as occlusion reasoning [9,10] and self-supervision [11,12,13].These strategies generally train a single network to provide prior information.However, the prior information is not accurate enough because a single network can be easily disturbed by outliers if the ground truth is inaccessible.Also, the inaccurate prior information can further lead to significant performance degradation.
To overcome these limitations, we propose CoT-AMFlow, which comprises adaptive modulation networks, named AMFlows, that learn optical flow estimation in an unsupervised way with a coteaching strategy.The overview of our proposed CoT-AMFlow is illustrated in Fig. 1, and we leverage three novel techniques to improve the flow accuracy, as follows: • We apply flow modulation modules (FMMs) in our AMFlow to refine the flow initialization from the preceding pyramid level using local flow consistency, which can address the issue of accumulated errors.
• We present cost volume modulation modules (CMMs) in our AMFlow to explicitly reduce outliers in the cost volume using a flexible and efficient sparse point-based scheme.
• We adopt a co-teaching strategy, where two AMFlows with different initializations simultaneously teach each other about challenging regions to improve robustness against outliers.
2 Related Work

Optical Flow Estimation
Traditional approaches typically estimate optical flow by minimizing a global energy that measures both brightness consistency and spatial smoothness [18,19,20].With recent development in deep learning technology, supervised approaches using convolutional neural networks (CNNs) have been extensively applied in optical flow estimation, and the achieved results are very promising.FlowNet [5] was the first end-to-end deep neural network for optical flow estimation.It employs a correlation layer to compute feature correspondence.Later on, PWC-Net [6] and LiteFlowNet [7] presented a pyramid architecture, which consists of feature warping layers, cost volumes and flow estimation layers.Such an architecture can achieve remarkable flow accuracy and high efficiency simultaneously.Their subsequent versions [21,22]   focus more on training strategies.However, existing network architectures do not explicitly address the issues of noisy flow initializations and outliers in the cost volume, as previously mentioned.Therefore, we develop the FMMs and CMMs in our AMFlow to overcome these limitations.
Among the training strategies for unsupervised approaches, DSTFlow [8] first presented a photometric loss and a smoothness loss for unsupervised training.Additionally, some approaches train a single network to perform occlusion reasoning for accuracy improvement [9,10].Self-supervision [11,12] is also an important strategy for unsupervised training.It first trains a single network to generate flow labels, and then conducts data augmentation to make flow estimations more challenging.The augmented samples are further employed as supervision to train another network.One variant of self-supervision is to train only one network with a two-forward process [13].However, training a single network to provide flow labels is likely to be unreliable due to the disturbance of outliers and the lack of ground-truth supervision.To address this issue, we integrate self-supervision into a co-teaching framework, where two networks simultaneously teach each other about challenging regions to improve stability against outliers.

Co-Teaching Strategy
The co-teaching strategy was first proposed for the image classification task with extremely noisy labels [23].Since then, many researchers have resorted to this strategy for various specific robust training tasks, such as face recognition [24] and object detection [25].The main difference between previous studies and our approach is that they focus on the task of supervised learning with noisy labels, while we focus on the task of unsupervised learning.Moreover, the noises in their tasks exist at image level (noisy image classification labels), while the outliers in our task exist at pixel level (inaccurate flow estimation pixels in challenging regions).

AMFlow
In this subsection, we first introduce the overall architecture of our AMFlow, and then present our FMM and CMM.Since we use many notations, we suggest readers refer to the glossary provided in the appendix for better understanding.Fig. 2 illustrates an overview of our proposed AMFlow, which follows the pipeline of PWC-Net [6].Different pyramid levels of feature maps are first extracted hierarchically from the input images I 1 and I 2 using a siamese feature pyramid network, and then are sent to the coarse-to-fine flow decoder.Here, we take level  as an example to introduce our flow decoder, for simplicity.First, the upsampled flow estimation F +1 12 at level  + 1 is processed by our FMM for refinement, and the generated modulated flow F +1  12 is employed to align the feature map x  2 with the feature map x  1 .A correlation operation is then employed to compute the cost volume C  , which is then processed by our CMM to remove outliers.After getting the modulated cost volume C  , we take it as input and employ the same flow estimation layer as PWC-Net [6] to estimate the flow residual, which is subsequently added with F +1  12 to obtain the flow estimation F  12 at level .This process iterates and the flow estimations at different scales are generated.

Flow Modulation Module (FMM).
In the coarse-to-fine framework, a flow estimation from the preceding level is adopted as a flow initialization at the current level.Therefore, the inaccurate flow estimations in challenging regions can propagate to subsequent levels and cause significant performance degradation.Our FMM is developed to address this problem based on the concept of local flow consistency [26].
Our FMM is based on the assumption that the neighboring pixels with similar feature maps should have similar optical flows.Therefore, for a pixel p with an inaccurate flow estimation F (p), we will look for another pixel q around p, which has a similar feature map to p and an accurate flow estimation F (q).Then, we replace F (p) with F (q).
To this end, we first compute a confidence map M  based on the upsampled flow estimation F 𝑙+1 12 and the downsampled input images I  1 and I  2 , as illustrated in Fig. 2. The confidence computing operation is defined as follows: where B (•, •) denotes the function for measuring the photometric difference [13], and (I, F) denotes the warping operation of image I based on flow F.Then, we use a self-correlation layer to compute a self-cost volume C  s , which measures the similarity between each pixel in the feature map x   1 and its neighboring pixels.The adopted self-correlation layer is identical to the correlation layer used in the above-mentioned flow decoder, except that it only takes one feature map as input.We further concatenate M  with C  s , and send the concatenation to several convolution layers to obtain a displacement map D  .Finally, we warp F +1 12 based on D  to get the modulated flow estimation F +1  12 .Cost Volume Modulation Module (CMM).Ambiguous correspondence in challenging regions can introduce noises into the cost volume, which further influence the subsequent flow estimation layers.Our CMM is designed to reduce noises in the cost volume.
Several traditional approaches have formulated the task of denoising the cost volume as a weighted least squares problem, which obtains the following solution for level  [27,28]: where C  (p,  ) denotes the modulated cost at pixel p for flow residual candidate  ; pixel q belongs to the neighbors N  (p) of p;   (p, q) denotes the modulation weight; and C  (q,  ) denotes the original cost at pixel q for flow residual candidate  .Note that the one-dimensional  is transformed from the original two-dimensional flow residual candidate for simplicity, which is the same as the scheme adopted in PWC-Net [6].
The intuition of our CMM is to implement (2) in deep neural networks, which is realized by a flexible and efficient sparse point-based scheme based on deformable convolution [29]: where  denotes the number of sampling points;    denotes the modulation weight for the -th point; and p  is the fixed offset of the original convolution layer to p.To make the modulation The whole loss function for training our CoT-AMFlow is a weighted sum of the above three losses, as shown on Line 6 and 7 in Algorithm 1.The details will be introduced in Section 3.3.

Co-Teaching Strategy
Our co-teaching strategy is illustrated in Fig. 1, and the corresponding steps are shown in Algorithm 1. Specifically, we simultaneously train two networks A (with parameter Θ A ) and B (with parameter Θ B ).In each mini-batch, we first let the two networks forward individually to obtain several outputs (Line 4).Then, we filter out the pixels with a high occlusion probability by setting their value in the occlusion map as 1 (completely occluded and thus have no impact on L ph ) (Line 5).The filtering threshold is controlled by R (), which equals 1 at the beginning and then decreases gradually with the increase of epoch number.The key point of our co-teaching strategy is that each network uses the occlusion maps estimated by the other network to compute its own loss function (Line 6 and 7).Finally, we update the parameters of the two networks separately and also update R () (Line 8 and 10).Next, we will answer two important questions about our co-teaching strategy: 1) Why do we need a dynamic threshold R () and 2) why can swapping the occlusion maps estimated by two networks help improve the accuracy for unsupervised optical flow estimation?
To answer the first question, we know that it is meaningless to compute photometric loss on the occluded regions, and thus we adopt an occlusion-masked photometric loss.According to [32], networks will first learn easy and clear patterns, i.e., unchallenging regions.However, with the number of epochs increasing, networks will gradually be affected by the inaccurately estimated occlusion maps and thus overfit on the occluded regions, which in turn will lead to more inaccurate occlusion estimations and further cause significant performance degradation.To address this, we keep more pixels in the initial epochs, i.e., R () is large.Then, we gradually filter our pixels with high occlusion probability, i.e., R () gradually decreases, to ensure the networks do not memorize these possible outliers.
The dynamic threshold can, however, only alleviate but not entirely avoid the adverse impact of the occluded regions.Therefore, we further adopt a scheme with two networks, which connects to the answer to our second question.The intuition is that different networks have different abilities to learn flow estimation, and correspondingly, they can generate different occlusion estimations.Therefore, swapping the occlusion maps estimated by the two networks can help them adaptively correct the inaccurate occlusion estimations.Compared with most existing approaches that directly transfer errors back to themselves, our co-teaching strategy can effectively avoid the accumulation of errors and thus improve stability against outliers for unsupervised optical flow estimation.Note that since deep neural networks are highly non-convex and a network with different initializations can lead to different local optimums, we employ two AMFlows with different initializations in our CoT-AMFlow, following [23], as illustrated in Fig. 1.

Dataset and Implementation Details
In our experiments, we set  1 = 2 in our loss function.In addition, we use  2 = 0 for the first 40% of epochs and increase it to 0.15 linearly for the next 20% of epochs, after which it stays at a constant value.The learning rate  adopts an exponential decay scheme, with the initialization as 10 −4 , and the Adam optimizer is used.Moreover, we set  = 0.8 and   = 0.1  in Algorithm 1 for evaluation on public benchmarks.
We first evaluate our CoT-AMFlow on three popular optical flow benchmarks, MPI Sintel [14], KITTI Flow 2012 [15] and KITTI Flow 2015 [16].The experimental results are shown in Section 4.2.Then, we perform a generalization evaluation on the Middlebury Flow benchmark [17], as presented in Section 4.3.We also conduct extensive ablation studies to demonstrate the superiority of 1) our selection of  and   ; 2) our FMM and CMM; 3) our AMFlow over other network architectures; and 4) our co-teaching strategy over other strategies for unsupervised training.The experimental results are presented in the appendix.Furthermore, we follow a similar training scheme to those of the previous unsupervised approaches [11,12,13] for fair comparison.For the MPI Sintel benchmark, we first train our model on raw movie frames and then fine-tune it on the training set.For the two KITTI Flow benchmarks, we Table 1: Evaluation results on the MPI Sintel, KITTI Flow 2012 and KITTI Flow 2015 benchmarks.Here, we show the primary evaluation metrics used on each benchmark.For the Sintel Clean and Final benchmarks, the AEPE (px) for all pixels is presented.For the KITTI Flow 2012 and 2015, "Noc" and "All" represent the F1 (%) for non-occluded pixels and all pixels, respectively."S" denotes supervised approaches.† indicates the network using more than two frames.Best results for supervised and unsupervised approaches are both shown in bold font.first employ the KITTI raw dataset to pre-train our model and then fine-tune it using multi-view extension data.Additionally, we adopt two standard evaluation metrics, the average end-point error (AEPE) and the percentage of erroneous pixels (F1) [14,15,16,17].

Performance on Public Benchmarks
According to the online leaderboards of the MPI Sintel1 , KITTI Flow 20122 and KITTI Flow 20153 benchmarks, as shown in Table 1, our CoT-AMFlow outperforms all other unsupervised optical flow estimation approaches.We can clearly observe that our approach is significantly ahead of other unsupervised approaches, especially on the MPI Sintel benchmark, where an AEPE improvement of 0.53px-5.42px is achieved on the Sintel Clean benchmark.We also use the KITTI Flow 2015 benchmark to record the average inference time of our CoT-AMFlow.The results in Table 1 show that our approach can still run in real time with the state-of-the-art performance.One exciting fact is that our unsupervised CoT-AMFlow can achieve considerable performance when compared with supervised approaches.Specifically, on the MPI Sintel Clean benchmark, our CoT-AMFlow outperforms some classic networks such as PWC-Net [6] and LiteFlowNet [7], while achieving only a slightly inferior performance compared with LiteFlowNet2 [22], which demonstrates the effectiveness of our adaptive modulation network and co-teaching strategy.Fig. 3 illustrates examples of the three public benchmarks, where we can obviously see that our CoT-AMFlow yields more robust and accurate results.

Generalization Analysis across Datasets
We employ the CoT-AMFlow trained on the MPI Sintel benchmark directly on the Middlebury Flow benchmark to test the generalization ability of our approach.Table 2 shows the online leaderboard of the Middlebury Flow benchmark 4 .Note that our CoT-AMFlow has not been fine-tuned on the benchmark.We can observe that our CoT-AMFlow significantly outperforms the unsupervised Un-Flow [9] and even presents superior performance over supervised approaches such as PWC-Net [6] and LiteFlowNet [7].The results strongly verify that our CoT-AMFlow has an excellent generalization ability.

Conclusion
In this paper, we proposed CoT-AMFlow, an adaptive modulation network with a co-teaching strategy for unsupervised optical flow estimation.Our CoT-AMFlow presents three major contributions: 1) a flow modulation module (FMM), which can refine the flow initialization from the preceding pyramid level to address the issue of accumulated errors; 2) a cost volume modulation module (CMM), which can explicitly reduce outliers in the cost volume to improve the accuracy of optical flow estimation; and 3) a co-teaching strategy for unsupervised training, which employs two networks to teach each other about challenging regions to improve robustness against outliers for unsupervised optical flow estimation.Extensive experiments have demonstrated that our CoT-AMFlow achieves the state-of-the-art performance for unsupervised optical flow estimation with an impressive generalization ability, while still running in real time.We believe that our CoT-AMFlow can be directly used in many mobile robot tasks, such as SLAM and robot navigation, to improve their performance.It is also promising to employ the co-teaching strategy in other unsupervised tasks, such as unsupervised disparity or scene flow estimation.The forward flow prediction based on Ĩ1 and Ĩ2 From columns 1)-4) in Table 6, we can observe that for each existing unsupervised approach, the performance can be significantly improved when the training strategy is changed from the original one to our co-teaching strategy, which strongly demonstrates the effectiveness of our strategy.The reason why our co-teaching strategy performs better is that it can improve robustness against outliers for unsupervised optical flow estimation by employing two networks to teach each other about challenging regions simultaneously.Moreover, from column 5), we can see that, compared with 2)DDFlow-Net [11] 3)SelFlow-Net † [12] 4)ARFlow-Net [13] 5)AMFlow (Ours) a) UnFlowStrat [9] 8.87 ---6.61b) DDFlowStrat [11] -

Figure 1 :
Figure 1: An overview of our CoT-AMFlow.We integrate self-supervision into a co-teaching framework, where two AMFlows with different initializations teach each other about challenging regions to improve stability against outliers and further enhance the accuracy of flow estimation.

Figure 2 :
Figure 2: An illustration of our AMFlow, which uses FMMs and CMMs to refine flow initializations and remove outliers in cost volumes, respectively.

Flow Modulation Module (FMM) at Level 𝒍𝒍
also made incremental improvements.Unsupervised approaches generally adopt similar network architectures to supervised approaches, and

Table 2 :
Evaluation results on the Middlebury Flow benchmark."S" denotes supervised approaches.Note that our CoT-AMFlow has not been fine-tuned on the benchmark.Best results for supervised and unsupervised approaches are both shown in bold font.

Table 3 :
[13]ossary of notations used in the paper.The transformations employed on I 1 , I 2 , F 12 and O 12 , respectively[13]Ĩ1 , Ĩ2 , F12 and Õ12 The samples augmented via the above-mentioned transformations F *

Table 4 :
AEPE (px) results of our CoT-AMFlow with different   and  in the proposed co-teaching strategy.The best result is shown in bold font.

Table 5 :
AEPE (px) results of variants of our CoT-AMFlow with some of the proposed modules disabled, where "All" denotes the AEPE over all pixels, and "s0 − 10", "s10 − 40" and "s40+" denote the AEPE over pixels that move less than 10 pixels, between 10 and 40 pixels and more than 40 pixels, respectively.Best results are shown in bold font.FMM CMM All s0 − 10 s10 − 40 s40+

Table 6 :
AEPE (px) results of different combinations of unsupervised network architectures and unsupervised training strategies.Note that XXXNet and XXXStrat denote the corresponding network architecture and unsupervised training strategy used in XXX, respectively.† indicates the network using more than two frames.The best result is shown in bold font.
, our co-teaching strategy achieves the best performance when employed in the same network architecture, i.e., our AMFlow, which further demonstrates the superiority of our co-teaching strategy over other strategies for unsupervised training.