EgoFish3D: Egocentric 3D Pose Estimation From a Fisheye Camera via Self-Supervised Learning

Egocentric vision has gained increasing popularity recently, opening new avenues for human-centric applications. However, the use of the egocentric fisheye cameras allows wide angle coverage but image distortion is introduced along with strong human body self-occlusion imposing significant challenges in data processing and model reconstruction. Unlike previous work only leveraging synthetic data for model training, this paper presents a new real-world EgoCentric Human Pose (ECHP) dataset. To tackle the difficulty of collecting 3D ground truth using motion capture systems, we simultaneously collect images from a head-mounted egocentric fisheye camera as well as from two third-person-view cameras, circumventing the environmental restrictions. By using self-supervised learning under multi-view constraints, we propose a simple yet effective framework, namely EgoFish3D, for egocentric 3D pose estimation from a single image in different real-world scenarios. The proposed EgoFish3D incorporates three main modules. 1) The third-person-view module takes two exocentric images as input and estimates the 3D pose represented in the third-person camera frame; 2) the egocentric module predicts the 3D pose in the egocentric camera frame; and 3) the interactive module estimates the rotation matrix between the third-person and the egocentric views. Experimental results on our ECHP dataset and existing benchmark datasets demonstrate the effectiveness of the proposed EgoFish3D, which can achieve superior performance to existing methods.


I. INTRODUCTION
E GOCENTRIC vision is an emerging field in computer vi- sion, involving the analysis of data captured from a headmounted or chest-mounted wearable camera [1], [2], [3].When the egocentric camera is directed downwards, especially by incorporating a fisheye lens, the human body and the surrounding environment can be captured with an enlarged field-of-view, offering expanded visual cues for processing [3], [4], [5], [6].Compared to the third-person view, egocentric vision is advantageous for long-term human-centric perception in a free-living environment, offering new opportunities for understanding human behavior and social activities [5], [6], [7], [8], [9].
One of the prerequisites of egocentric vision for downstream applications is accurate pose estimation from egocentric views.Although extensive progress in monocular 2D/3D human pose estimation has been achieved in recent years [10], [11], [12], [13], these conventional third-person-view pose estimation methods are prone to errors when directly used to predict 2D/3D poses from novel first-person viewpoints, as illustrated in Fig. 1(b), which shows typical results that can be achieved.Hitherto, there are few dedicated egocentric human pose estimation methods available due to several inherent challenges in egocentric vision as shown in Fig. 1(c).First, there exist significant human body self-occlusions from the first-person view, especially for the head and lower limbs, making the estimation of occluded joints difficult.Second, although an egocentric camera with a fisheye lens enlarges the field of view and captures more details of the human body, the recorded images are severely distorted, increasing the difficulty of accurate annotation in practice.Furthermore, there is a lack of real-world datasets with accurate ground truth data as data collection using a motion capture (MoCap) system is labor-intensive and limited to dedicated laboratory settings.To partially address these issues, recent effort within the vision community has been directed to building public datasets using synthetic human models [14], [15].However, training with synthetic datasets can affect the generalization capability of the model when applied to real-world scenarios.
To circumvent the above problems, it is necessary to develop a dedicated method for egocentric 3D human pose estimation.Furthermore, it is advantageous to incorporate self-supervised learning.Some recent work has already leveraged the intrinsic constraints across multiple third-person views, such as multiview geometry and view consistency [16], [17], [18], to enable 3D human pose estimation without ground truth.These Fig. 1.Egocentric 3D pose estimation from a single fisheye camera.a) Our proposed EgoFish3D can achieve accurate 2D and 3D pose estimation from distorted images captured by a single fisheye camera.b) Existing third-personview 2D/3D pose estimation methods [10], [11] fail on this challenging task from images captured by the fisheye camera.c) The strong self-occlusion for lower limbs, the severe distortion of the fisheye camera, and the lack of real-world datasets are the inherent challenges in egocentric vision.
self-supervised methods have demonstrated comparative pose estimation performance against those fully-supervised counterparts.In practice, however, there is still the challenge of directly transferring the multi-view self-supervised mechanism for egocentric 3D pose estimation.First, the intrinsic parameters of the third-person-view and egocentric cameras are different.Second, there is often limited overlap between the third-person-view and first-person-view images, and the transformation between two views is hard to acquire.More importantly, conventional 2D/3D pose estimation methods can hardly work on egocentric images, thus limiting the direct use of self-supervised learning with multi-view constraints.
To address the aforementioned challenges, we first constructed the EgoCentric Human Pose (ECHP) dataset using a head-mounted GoPro camera with a fisheye lens, and two RGB cameras were used to simultaneously capture images from a third-person view.The training and validation datasets of ECHP consist of 30 video sequences (∼75 k frames) recorded in 8 different real-world indoor/outdoor scenes, in which 10 different daily actions performed by 9 subjects with 20 different body textures were recorded.The test dataset consisting of 7 video sequences (∼17 k frames) about 10 actions performed by 4 subjects with new body textures was simultaneously captured by a multi-camera motion capture system with ground truth annotations.This dataset can not only contribute as a new benchmark for egocentric 3D pose estimation, but also help enhance the generalization capability of the proposed method to real-world scenarios.Central to this paper, we propose a novel self-supervised method, namely EgoFish3D, for egocentric 3D human pose estimation from a single head-mounted fisheye camera.An overview of our proposed method is shown in Fig. 2. The EgoFish3D consists of three modules: 1) the third-person-view module; 2) the egocentric module; and 3) the interactive module.
The third-person-view module first generates the 3D pose from two third-person-view cameras, providing a relatively accurate 3D pose represented in a third-person-view coordinate system.The egocentric module then takes the distorted first-person-view image as input, performs 2D pose estimation, and predicts the 3D pose in the egocentric coordinate system as the final output.Within this module, the latent features, 2D heatmaps and human masks are incorporated in order to improve the accuracy.Another interactive module is introduced to estimate the 3D rotation of two 3D poses, adding supervision rules for training the other two modules.During the training phase, we train these three modules in a self-supervised manner.During the inference stage, only the egocentric module is used to predict the 3D pose from a single egocentric fisheye image.The proposed work represents the first attempt in achieving egocentric 3D pose estimation from a fisheye camera looking downwards in a self-supervised manner without 3D pose ground truth as prior.Experimental results on our ECHP and the public synthetic datasets [14], [15] demonstrate that our method can achieve good accuracy compared to existing supervised approaches.
In summary, the main contributions of this paper include: r A self-supervised method is proposed to achieve egocentric 3D pose estimation from a single image without the need for 3D ground truth annotations.
r A real-world dataset ECHP is constructed, which contains the synchronized images from two third-person-view cameras and an egocentric fisheye camera.
r An interactive module is introduced to learn the relation- ship between the third-person and egocentric views.
To mimic the visual perception of a human, one line of research is based on the egocentric camera looking outwards, but suffering from difficulties for recovering the 3D human pose only with limited observed human body.Jiang et al. [4] took advantage of the dynamic motion signatures of the surroundings to infer the invisible pose from a chest-mounted camera.However, the estimated poses are inaccurate and easily affected by the changing of environment.Another group of research [19], [20] modeled the egocentric 3D pose estimation and forecasted as a Markov decision processing when the pose estimation is limited to a single mode of action, such as running or walking.
Compared to cameras looking outwards, some recent works change the egocentric camera to a downward-looking setting, which can consistently capture the human body in the field of view.Moreover, they all integrate the camera with a fisheye lens to enlarge the perception area of human body, which can boost the performance of egocentric 3D human pose estimation.By using a head-mounted stereo fisheye camera, Rhodin et al. [21] proposed a markerless egocentric full-body motion capture method, but the stereo camera is inconvenient in practical applications.Xu et al. [15] presented a real-time egocentric 3D pose estimation method with a single cap-mounted fisheye camera, which takes both original and zoom-in images as input to deal with the strong occlusion of the lower limbs.However, this method does not directly regress the 3D pose in the inference phase but predicts the absolute depths of the joints instead.Most recently, Tome et al. [14] introduced a large corpus of synthetic dataset from a head-mounted fisheye camera, and proposed a three-branch network that achieves the state-of-the-art 3D pose estimation performance on this synthetic dataset.To deal with the problem of the severe distortion, Zhang et al. [22] proposed an automatic calibration module to estimate the fisheye camera parameters, thus mitigating the effect of image distortions for robust egocentric 3D pose estimation.It should be pointed out that the models proposed in [14], [15], [22] are trained on synthetic data under strong supervision, leading to degraded generalization of the model in real-world applications.Motivated by this, in this paper we establish a real-world egocentric human action dataset.In addition, we take full advantage of the multi-view constraints between the third-person-view and egocentric cameras to achieve egocentric 3D human pose estimation in a self-supervised manner.

B. 3D Human Pose Estimation Via Self-Supervised Learning Under Multi-View Constraints
Recently, self-supervised learning has attracted increasing attention in estimating 3D human pose, in which the multi-view information is utilized to mitigate the ambiguity of learning 3D human poses from synchronized 2D images captured from third-person-view cameras [23], [24], [25], [26], [27].However, the fusion of multiple views and the annotation of 3D poses in different camera views are challenging.Employing some typical fusion methods, such as multi-view consistency of the same pose [16], [23], [26] and triangulation [25], [28], can reduce the cost of 3D human pose labelling and make the network learn in a self-supervised manner.
Rhodin et al. [23] employed the multi-view consistency to constrain the system to predict the same pose in all views with the help of only few annotated data.CanonPose [16] disentangled the observed 2D pose into a canonical 3D pose and a camera orientation.Specifically, it contains several sub-networks inferring the same 3D pose from different views, and the predictions from all views are aggregated to produce the final 3D pose inference.However, only applying the multi-view consistency constraint is not sufficient enough because the model may get trapped in a trivial solution with different inputs [23], [29].Iqbal et al. [29] introduced a novel objective function based on normalized 3D bone lengths that are computed from Human3.6 M dataset.Differently, Rhodin et al. [30] took advantages of the temporal consistency prior to first learn a geometry-aware body representation from sequential unlabelled multi-view images, and then map the novel geometry representation to actual 3D poses.In this paper, we aim to address the multi-view self-supervised learning from the combination of third-person and egocentric views.
The other branch of study in fusing multi-view data employed conventional triangulation method, given the intrinsic and extrinsic parameters of the cameras [25], [28].Iskakov et al. [28] firstly proposed a baseline method that computes the 3D human pose from multi-view 2D poses algebraically.Kocabas et al. [25] utilized the epipolar geometry theorem to generate 3D pose annotations from multiple 2D poses.Our method also incorporates the triangulated 3D pose as the prior for faster convergence and better performance.

III. EGOCENTRIC HUMAN POSE (ECHP) DATASET
There is a pressing need for available real-world data to enable the development of egocentric pose estimation algorithms Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
due to the unique camera placement and the images captured from the first-person view.However, the existing training methods for egocentric pose estimation are almost based on synthetic datasets, and real-world data is only used for test.Due to the domain gap between the synthetic environment and the real-world environment, the generalization capability of the model in real applications is limited.More importantly, it is faced with difficulties for simultaneously capturing ground truth data using MoCap system for supervised learning.This is because the multi-camera MoCap system is limited in a certain environment and the annotation is labour-intensive.
In this paper, we first construct an EgoCentric Human Pose (ECHP) dataset to enable the egocentric 3D human pose estimation in a self-supervised learning manner.Unlike previous datasets that mainly consist of synthetic images [14], [15], our ECHP is composed of real-world images collected by an egocentric camera with a fisheye lens in different scenarios.Our dataset consists of data collected from 9 different subjects with 20 different body textures performing 10 daily actions.The third-person-view images are captured by two Intel Realsense D455 cameras and the egocentric images are captured by a head-mounted GoPro Hero9 equipped fisheye lens.

A. Data Collection
To overcome the difficulty of acquiring ground truth data by a MoCap system, two RGB cameras are used to capture images from the third-person view and a head-mounted GoPro camera with a fisheye lens captures the images from the egocentric view.In this manner, the ECHP dataset consists of well-synchronized images captured from two different views, i.e., the third-person view and the egocentric view.In practice, the egocentric camera is fixed on the head through a helmet and extends forward about 13-18cm, and tilts downwards about 15-25 degrees to reduce the self-occlusion and ensure that the lower limbs can be seen as much as possible.We also use Aruco [31] marker to obtain the 6D pose of the egocentric camera represented in the thirdperson-view frames.All three cameras are well calibrated with a 25 mm chessboard to determine the intrinsic parameters [32].
In the training and validation sets of our ECHP dataset, there are 9 different subjects with 20 different body textures performing 10 daily actions (i.e., squatting, walking, dancing, stretching, waving, boxing, kicking, touching, clamping, knocking).To improve the diversity of the dataset, real-world data are captured in 8 different scenes, both indoor and outdoor.In total, there are 30 video sequences about 75 k frames in the training and validation parts of ECHP dataset.To fully evaluate the performance as well as the generalization capability of different egocentric pose estimation algorithms, in the test set we simultaneously use the VICON MoCap system with a full-body gait model to collect these 10 daily actions performed by 4 subjects with new body textures, i.e., 7 video sequences about 17 K frames with 3D ground truth of anatomical joint positions.It should be emphasized that the test data is captured in the same indoor scene due to the use of MoCap system, and more importantly 2 of 4 subjects are unseen in the training set.
In summary, the ECHP dataset can be divided into three different parts: about 65 k images with both third-person-view images and egocentric images are for training, the remaining 10 k images are for validation, and the egocentric images along with 3D ground truth captured by the VICON Mocap system form the test dataset.Table I lists the exact frame number, subject number and gender, and the train/validation/test split of the egocentric images in each video sequence.

B. Data Preparation
It is noteworthy that our main focus is on egocentric pose estimation, so we exploit OpenPose [10] to offline extract 2D joints of the target human given the rgb images captured from third-person-view cameras.These two 2D poses are served as the input to the third-person-view module of EgoFish3D.Besides, a pretrained human instance segmentation model [33] is used to offline extract the human mask as the input to our egocentric module.The extrinsic parameters between the two third-person-view cameras are pre-measured and fixed during the data collection.To fit in our proposed network, we resize the images from the third-person view to 640 × 480 and the images from the egocentric view to 384 × 384.

A. Problem Statement
In this paper, our aim is to perform egocentric 3D pose estimation from a single head-mounted fisheye camera by leveraging self-supervision provided by two third-person-view cameras.Let us denote {C 1 }, {C 2 } as the coordinate systems of two third-person-view cameras c 1 , c 2 and {C ego } indicates the local coordinate system of the egocentric fisheye camera c ego .For a camera c, the captured image at each frame can be defined as I c , the corresponding 2D joints of the target human is Given an image I c ego captured by the egocentric fisheye camera, a deep neural network f θ (•) with parameters θ is designed to first predict the 2D joint positions J c ego , and then estimate the 3D human pose P c ego represented in the egocentric camera coordinate system.In this paper, we propose EgoFish3D to perform this task via self-supervised learning.

B. Overview of EgoFish3D
The architecture of our proposed EgoFish3D is shown in Fig. 4. Specifically, our network contains three modules, i.e., third-person-view module f trd θ , egocentric module f ego θ , and the interactive module f itr θ .The third-person-view module f trd θ achieves 3D pose estimation from two external cameras, which aims to offer a relatively accurate 3D pose as the supervision represented in the third-person-view camera coordinate system.The input of this module is from two 2D poses J c 1 , J c 2 estimated from two third-person-view images and the output is the 3D poses Pc 1 , Pc 2 under the third-person-view coordinate system.The network is composed of several MLPs, where each MLP is the combination of a linear layer, a batchnorm layer and an activation function.
For egocentric module f ego θ , the input is a single egocentric image I c ego .We first perform 2D pose estimation to obtain the heatmaps of the body joints, and then predict the 3D pose Pc ego under the egocentric coordinate system.To tackle the inherent challenges of pose estimation in distorted egocentric images, a feature fusion mechanism is proposed by combining the high-level features F T c ego of the input image, 2D heatmaps of body joints ĤM c ego and human masks Mc ego together.The module is first built upon the CNN backbone to achieve feature fusion.Then the fused features are fed into an encoder-regressor, a combination of several CNN layers and MLPs, to generate the final 3D pose estimation.
Noted that the 3D poses estimated by the above two modules are represented in different coordinate systems, hence, it is necessary to perform the rotation alignment between third-person-view and the egocentric coordinate systems Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
across frames.To this end, we introduce an interactive module in this paper, which takes the paired 2D poses {J c 1 , Ĵc ego }, {J c 2 , Ĵc ego } from different coordinate systems as input.The network structure is similar to the third-person-view module and is composed of several MLPs, predicting the rotation matrices c 1 Rc ego , c 2 Rc ego , repesctively.

C. Third-Person-View Module
This module aims to predict 3D poses under the third-personview camera coordinate system, by giving two third-person-view images and the transformation matrix between these two cameras.In order to generate accurate pose estimation, we combine the conventional triangulation via multi-view constraints and a learning-based depth estimation model.
1) 3D Pose Triangulation: Given the intrinsic parameters and transformation matrix of two cameras, a point in 3D space can be determined by triangulation with respect to its projections on two images.Among different triangulation algorithms, we use the depth estimation related one.After obtaining 2D poses J c 1 , J c 2 via OpenPose [10], the analytical solution of 3D positions P tri c 1 , P tri c 2 ,i can be calculated.

2) 3D Pose Estimation Via Depth Prediction:
To avoid inaccuracies in 3D pose estimation by triangulation, we also introduce a network f trd θ to predict the depth values of 2D joints, i.e., d c = f trd θ (J c ).Thus, the 3D joints Pc represented in each camera coordinate system can be calculated by leveraging the intrinsic parameters of this camera.
3) Loss Functions: In this module, two constraints are involved to facilitate the training of the network.First, the constraint L trd pose between the depth-based prediction and the triangulation result is introduced.
Besides, two estimated 3D poses represented in {C 1 } and {C 2 } are expected to be the same after transformation.
During the training, the total loss L trd can be formulated as the weighted sum

D. Egocentric Module
This module aims to predict both 2D and 3D poses of the target human under the egocentric camera coordinate system given a distorted fisheye image, where the 3D pose is the final output.To achieve better performance on 3D pose estimation, we propose a feature fusion method to combine the high-level features of the input image, the 2D heatmap of body joints and the mask of human body together.Moreover, the 2D heatmap, as well as the 2D pose branch is incorporated with the reprojection constraints by leveraging the 3D pose estimated from the third-person-view module.

1) 2D Pose and Heatmap Prediction by Reprojection:
After obtaining the 3D joints from the third-person-view module, an intuitive way is to project the 3D data onto the distorted egocentric image plane, thus providing a supervised information for training the egocentric 2D pose detector.The intrinsic parameters of the fisheye camera are extracted by the calibration method [32].As is well known, it is extremely challenging to directly predict the transformation between two cameras with less overlap and different intrinsic parameters.For simplicity, in our ECHP dataset, we make use of the Aruco marker to determine the 6D pose { c 1 Rc ego , c 1 Tc ego } of egocentric camera in the third-person-view coordinate system.However, the detection is affected by the illumination and distance from the marker to camera.Thus, we only use the images with stable detection of the Aruco marker to train the egocentric 2D pose detector.Given the P tri c 1 estimated from the third-person-view camera c 1 , it can be reprojected on the egocentric image.Accordingly, we can get the egocentric 2D pose J rep c ego as well as the heatmap HM rep c ego .Next, the egocentric 2D pose detector is trained to predict the 2D heatmaps ĤM c ego under the supervision of HM rep c ego , where MSE loss L ego rep is used as constraints.
In practice, the reprojected 2D pose is relatively low-accuracy due to thetriangulated 3D pose and extrinsic parameters.It is noteworthy that the 2D pose detector can predict more accurate heatmaps compared to the reprojected ones.

2) Information Fusion for Egocentric 3D Pose Estimation:
The main objective of this paper is to train network f ego θ that can predict 3D pose Pc ego given a single fisheye image.To achieve this, we propose a feature fusion mechanism in the egocentric module to boost the pose estimation performance, which consists of three main branches.The first branch uses the pretrained 2D pose detector to obtain the heatmaps ĤM c ego , the second branch extracts the latent features F T c ego of the input image in the feature space, and the third branch uses a pretrained human instance segmentation network [33] to extract the human masks Mc ego .The final fusion is built upon an attention-aware man- Then the fused feature is fed into an encoder and regressor to predict the final 3D poses.During training, two self-supervised constraints are used.One is the reprojection loss L ego rep mentioned above to force the egocentric 2D detector to generate reasonable heatmaps.The other one is the transformation constraint L ego pose of 3D poses between the third-person view and the egocentric view, which is supervised by the rotation matrix c 1 Rc ego predicted by the interactive module.Note that we choose {C 1 } as the reference frame.Besides, an additional loss L ego bone on bone length b is utilized to force the length of L left and right links to be the same.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
3) Loss Functions: During the training of the egocentric module, the total loss L ego is the weighted sum L ego = α 1 L ego rep + α 2 L ego pose + α 3 L ego bone .

E. Interactive Module
As the detection of Aruco markers is greatly affected by the environment and easily fails in some circumstances, we propose the interactive module to automatically predict the rotation matrix between the third-person-view and egocentric coordinate systems, offering the transformation constraints for training the egocentric module.Given pair of 2D poses {J c , Ĵc ego }, where c ∈ {c 1 , c 2 }, the interactive module f itr θ predicts the Euler angles [ θ, φ, ψ] through simple MLPs.Accordingly, the rotation matrix R can be determined.By leveraging the extrinsic parameters of two third-person-view cameras, the transformation constraint L itr tran can be defined as follow.
To ensure a fast convergence of the interactive module, we use extrinsic parameters c Rego predicted by Aruco markers to form constraint L itr mat at the beginning of the training.
During training, the total loss L itr can be formulated as a weighted sum L itr = β 1 L itr tran + β 2 L itr mat .Noted that L itr mat is only used in the first 5 epochs.
In summary, the loss function for training the whole network of our EgoFish3D is the sum of losses for the aforementioned modules, i.e., L = L trd + L ego + L itr .

A. Dataset
For the ECHP dataset as mentioned in Section III, we train the model based on 65 k frames from the sequences {seq1-seq2, seq4-seq11, seq13-seq17, seq21-seq30}, and validate our model on the rest of images from {seq3, seq12, seq18-seq20} to demonstrate qualitative results.To demonstrate the effectiveness of our proposed self-supervised method, we also capture seven test videos named {test1-test7} of 4 subjects (2 of them are novel subjects and all body textures are unseen in the training set) with 3D pose ground truth provided by the VICON system to provide quantitative results.Note that we do not capture the joint position of nose in VICON system, so we only report the results of 14 body joints.
To evaluate the effectiveness of our method, detailed comparison experiments on the other two existing synthetic datasets were conducted.1) Mo 2 Cap 2 [15] is a large-scale synthetic dataset that simulates the images captured by a single cap-mounted fisheye camera, which contains 530 k rendered images including about 3000 different actions and 700 different body textures.Besides, there is a real-world test set consisting of ∼5.5 k frames for quantitative evaluation.2) xR-EgoPose [14] is also a synthetic dataset captured by a head-mounted fisheye camera.It has 383 k frames with 23 male and 23 female characters performing 9 different actions.Both Mo 2 Cap 2 and xR-EgoPose datasets contain the ground truth annotations of 2D and 3D joint positions.Fig. 5 demonstrates several examples of the three datasets.Although our ECHP contains fewer images than Mo 2 Cap 2 and xR-EgoPose, we believe that the real-world egocentric images in our dataset can contribute to the community for the development of egocentric pose estimation algorithms.

B. Implementation Details
Since our method is based on self-supervised learning method, the network may not converge with an end-to-end training strategy.To make our model converge faster and reduce overfitting, we adopt a multi-stage training strategy instead.First, we train the third-person-view module with 3D pose triangulation and depth estimation methods to estimate the 3D pose under the third-person-view coordinate system with ω 1 = 1.0, ω 2 = 0.5.Then the egocentric 2D pose detector is trained to estimate the heatmaps with α 1 = 1.0.Note that the input egocentric image is of size 384 × 384 and the dimension of heatmap is 48 × 48.To improve the generalization of the egocentric 2D pose estimation model in different real-world scenarios, we train the model on a combination of ECHP (∼40 k with available 2D reprojection from triangulated 3D pose) and Mo 2 Cap 2 (∼40 k synthetic data with annotated 2D pose).Next, we train the interactive module to estimate the rotation matrix between the third-person-view coordinate system and the egocentric one.In the first 5 epochs, we use extrinsic parameters estimated from Aruco as the supervision to speed up the convergence of the network with β 1 = 1.0, β 2 = 1.0.Finally, we train the egocentric module and finetune the whole network for better performance with α 2 = 1.0, α 3 = 0.05.The proposed method and comparison methods are implemented by PyTorch, and we apply Adam for optimization with a learning rate of 0.001.

C. Evaluation Metrics
Three evaluation protocols are used in this paper.The first one refers to Mean Per Joint Position Error (MPJPE), which calculates the average distance between the ground truth and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E(P,
The second one refers to PA-MPJPE, which indicates the MPJPE after applying alignment by Procrustes Analysis to remove the global translation, rotation and scale of the two 3D poses.The third one refers to BA-MPJPE, which resizes the bone length of the estimated 3D pose and the ground truth 3D pose to a standard skeleton to remove the influence of the body scale and then performs PA-MPJPE.

D. Comparison Methods
Since we are the first to propose a self-supervised method for egocentric 3D pose estimation from a looking downwards fisheye camera by leveraging multi-view constraints from the third-person view, we compare our proposed EgoFish3D with four existing supervised methods [14], [15], [22], [34] that use egocentric images with ground truth annotations.First, the comparison experiments are conducted on our ECHP dataset.For a fair comparison, we keep the third-person-view module and the interactive module consistent to generate the same pseudo labels and use them to train different methods.For the network proposed by Martinez [34], we implement the network after extracting the 2D pose from the heatmaps we predicted.Since the code of [14], [15] are not publicly available, we implement these two models by ourselves instead.As it is difficult to determine the rotations between body parts in real-world data, only the first and the third branches of the network proposed by Tome [14] is implemented for the comparison.For the network proposed by Xu [15], we implement the heatmap-zoom and joint depth estimation modules, and generate the 3D pose by the reprojection formula of the egocentric fisheye camera.Second, we compare our proposed EgoFish3D with other methods on the Mo 2 Cap 2 [14] and xR-EgoPose [15] datasets.To compare the performance of 3D pose estimation, we retrain our egocentric view module on xR-EgoPose and Mo 2 Cap 2 datasets to report both quantitative and qualitative results.Third, to illustrate the generalization ability of our EgoFish3D, we directly apply our model trained on our ECHP dataset to the real-world test data in Mo 2 Cap 2 without finetuning on the synthetic dataset and demonstrate the qualitative results.The main comparison methods on our dataset are listed as follows.
r Martinez [34], a baseline method with several MLPs for 3D pose estimation, where the input is 2D pose.
r Tome [14], a state-of-the-art supervised method with encoder-decoder network for egocentric 3D pose estimation.
r Xu [15], a two-branch network that takes both original and zoom-in images as input for supervised pose estimation.
r Zhang [22], a two-branch network introducing an auto- matic calibration module to deal with the distortion of fisheye lens.r EgoFish3D, the full network of our proposed method.

E. Ablation Study
We also conduct ablation studies of the proposed EgoFish3D to demonstrate the effectiveness of different sub-networks and loss functions.We remove or change the following parts of our network one by one.r EgoFish3D, the full network of our proposed method.r w/o L itr mat , an ablated model without L itr mat as in (7).

A. Quantitative Results
Without further clarify, the bold and underline values in the table indicate the best and the second best results in each column, respectively.All the elements indicate the result in millimeters (mm).By leveraging a full-body gait model provided by VICON MoCap system, the ground truth annotations are the 3D joint positions from the anatomical level.
1) ECHP Dataset: The comparison of 3D pose estimation results with other state-of-the-art methods on our ECHP dataset are reported in the upper part in Table II.The second column lists the average MPJPE(PA-MPJPE) results of all test data, and the remaining shows the results for each action.
It can be seen that our proposed EgoFish3D achieves the best average 3D pose estimation results (MPJPE = 107.9)under the evaluation protocol MPJPE against other comparison methods [14], [15], [34].This is because that our egocentric feature fusion method can leverage more useful information implied in the input image.For instance, the human mask is beneficial for removing the heatmaps with low confidence or beyond the scope of human body.It can also be found our method achieves the comparable performance (PA-MPJPE = 73.1)with Xu [15] under the evaluation protocol PA-MPJPE.It should be noted that the method proposed by Xu [15] is sensitive to the distorted egocentric images captured from a fisheye lens.Compared to our EgoFish3D, it requires the intrinsic parameters of the fisheye camera to compute the 3D pose during the inference phase.Hence, the performance of this method will be significantly degraded without the known intrinsic parameters of the camera.For action-level comparison, our EgoFish3D can still achieve the best or second best pose estimation for nine actions, except for Boxing with upper limb outside the field of view, which proves that our method is more stable to different actions than the comparison methods.
We also report the estimation error of each joint in Table V.It can be seen that distal joints (e.g., hands) are with larger PJPE and they become smaller after Procrustes analysis.This is mainly  because the precise estimation of rotation across third-person and egocentric views is quite challenging.We also evaluate the performance of our proposed interactive module.We report mean angle error to evaluate the estimation of the camera orientation, which is 6.02 • for our interactive module.
2) xR-EgoPose Dataset: For xR-EgoPose dataset, the comparison MPJPE results are presented in Table III.We retrain our model on their synthetic dataset and compare the performance with three existing methods [14], [22], [34].Our method achieves the best average performance (MPJPE = 48.0)compared to Tome [14] (MPJPE = 54.7) and Zhang [22] (MPJPE = 50.0).It can be found that our method implements the same supervision cues as Tome(p3d+hm), i.e., 2D heatmaps and 3D pose, to train the model but achieves the superior performance than the three-branch network Tome(p3d+hm+rot) containing the extra supervised information of the relative rotations between body joints, which can not be easily acquired in real-world data.For different actions, our model achieves the best performance on four actions and the second best performance on four actions as well among the nine actions.
3) Mo 2 Cap 2 Dataset: The results of different methods on Mo 2 Cap 2 dataset are reported in Table IV.We retrain our model on this synthetic dataset and test our model on the public real-world egocentric images.We follow the evaluation protocol as [15] and report BA-MPJPE.For the data captured in the indoor environment, we compare our method with four different methods [15], [22], [35], [36] and our method achieves the comparable performance (BA-MPJPE = 60.9) with Xu [15] and outperforms the other three methods.For different actions, our method achieves the best or second best performance on five Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.  of eight actions.For the data captured in the outdoor environment, we make the comparison with three methods [15], [35], [36] and the performance of our method (BA-MPJPE = 79.2) is the best.It can also be found that our method achieves the best performance on five of eight actions.

4) Ablation Study:
The lower part Table II lists the results of ablation studies on ECHP dataset.We report the average MPJPE(PA-MPJPE) results of all test data in the second column, and the rest shows the results for each action.It can be found that, compared to the other six ablated models B-G, our full network (model A) achieves the best on both MPJPE and PA-MPJPE.For action-level comparison, our full network can achieve the best or second best pose estimation for nine actions, which significantly outperforms other models.
Does the feature fusion method work?To evaluate the effectiveness of the proposed three-branch feature fusion mechanism in the egocentric view module, we remove each branch one by one corresponding to the models B-D.The results show that all  Can we directly use the triangulated 3D pose for supervision?We perform the ablation study by removing the third-personview module and directly apply the triangulated 3D pose as the supervision.As shown in model E, the performance is clearly worse than that of the full network architecture.This originates from the inaccurate 2D poses and trivial solution of the triangulated 3D pose in some circumstances.More importantly, with a well-trained third-person-view module, we can ease the data collection procedure by using only one third-person-view camera.we also provide more quantitative results of our proposed third-person-view module to prove the effectiveness of the designed regressor instead of the triangulation method in Table VI.We captured more third-person-view images along with the VICON system to perform the third-person-view pose estimation and compare our proposed third-person-view module with three methods, i.e., triangulation method and triangulation with temporal smoothing as well as the state-of-the-art third-person pose estimation method PARE [11].During inference, we only feed the exocentric images captured from a single-view camera to the third-peron-view module.It can be found that our third-person-view module achieves the best performance compared to others, which highlights the effectiveness of the third-person-view module and it can generate relatively accurate weak labels.Does the coarse prior knowledge help train the network?The models F and G highlight the significance of the prior knowledge on the 3D pose by triangulation and the rotation matrix by Aruco markers.With the help of these priors, it can be seen that our full network (model A) can consistently improve the performance.Especially for L itr mat calculated by Aruco-based rotation, the network without this loss can hardly estimate the 3D pose under the egocentric coordinate system.

B. Qualitative Results
Fig. 7 demonstrates the visualization results of egocentric 2D pose estimation on ECHP and Mo 2 Cap 2 datasets by our proposed self-supervised learning based method.For ECHP dataset as in Fig. 7(a), the proposed EgoFish3D can predict relatively accurate 2D pose even with the occlusion of lower limbs.The generalization ability of the 2D pose predicted by our model on Mo 2 Cap 2 dataset is exhibited in Fig. 7(b), where the 2D pose estimator is trained by mixing ECHP dataset with a small number of synthetic data (∼40 k) in Mo 2 Cap 2 dataset, without further finetuning on this dataset.Besides, compared to the reprojected 2D pose as shown in Fig. 8, our egocentric module can predict Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.more accurate 2D pose than direct reprojections, which proves the effectiveness of the reprojection loss L ego rep .The egocentric 3D pose estimation results by our method on our ECHP, xR-EgoPose, and Mo 2 Cap 2 datasets are presented in Fig. 9(a)-(c), respectively.To explore the generalization ability of our proposed method, we directly apply our trained model to the Mo 2 Cap 2 dataset without finetuning.We retrain our EgoFish3D on xR-EgoPose [14] to conduct comparison on xR-EgoPose dataset.Given a single egocentric image captured by a fisheye lens, it can be seen that the proposed EgoFish3D can predict 3D pose for different actions, even on unseen subjects & textures and some occluded body parts.

VII. CONCLUSION AND FUTURE WORK
This paper proposes a self-supervised egocentric pose estimation method called EgoFish3D to estimate both the 2D pose and 3D pose under the egocentric view from a single RGB image.This is achieved by leveraging the potential information from both the third-person view and the egocentric view.Specifically, the EgoFish3D incorporates three main modules: the third-person-view module, the egocentric module and the interactive module to improve the performance of our self-supervised method.This paper also proposes a real-world EgoCentric Human Pose dataset called ECHP to capture images from three different cameras circumventing the use of MoCap system to acquire the ground truth.Our experimental results demonstrate that our EgoFish3D can predict relatively accurate 2D and 3D pose.In future work, we aim to make our approach generalize well to the different placement of the egocentric camera and incorporate the human pose estimation with more egocentric vision tasks.

Fig. 2 .
Fig. 2. Illustration of the training and inference phases of our proposed EgoFish3D.During the training phase, the third-person-view module takes two images from the third-person view as input and generates a relatively accurate 3D pose represented in the coordinate system of the external cameras, the interactive module predicts the rotation difference between the third-person-view and the egocentric coordinate systems, and the egocentric module estimates both 2D and 3D poses from an egocentric distorted image.During the inference phase, only the egocentric module can directly predict 2D and 3D poses from an egocentric image captured by a fisheye camera.

Fig. 3 .
Fig. 3.The details of our proposed ECHP dataset.(a) Placement of two third-person-view cameras (rgb camera of Realsense D455) and a head-mounted egocentric camera (GoPro) with a fisheye lens looking downwards; (b) Selected examples of the different scenes, subjects, and body textures in our ECHP dataset.In total, we capture the data of 9 subjects in 8 different scenes with 20 different body textures; (c) 10 different daily actions are recorded in our dataset.

Fig. 4 .
Fig. 4. Overview of our proposed EgoFish3D.The figure shows the network architecture, which contains three modules: third-person-view module, egocentric module and interactive module.The black arrows indicate direction of information flow.The colorized lines with arrows indicate different loss functions.The yellow dotted line indicates that two sub-networks share the same weights.

Fig. 5 .
Fig. 5. Examples from our proposed ECHP dataset and the public datasets, i.e., xR-EgoPose [14] and Mo 2 Cap 2 [15].Compared to xR-EgoPose and Mo2 2 Cap 2 , both training and test data in the ECHP dataset are real-world images and suffer from less occlusions of lower limbs, improving the generalization capability of the learned models in practical applications.
r w/o M , an ablated model by removing the human instance segmentation Mc ego .r w/o F T , an ablated model by removing the branch of fea- ture extraction F T c ego .r w/o ĤM , an ablation study that removes the heatmap pre- diction branch ĤM c ego in the egocentric view module.r w/o f trd θ , an ablated study removing the third-person-view module f trd θ .Only triangulated 3D pose is for supervision.r w/o L trd pose , the loss constraining the depth-based and triangulated poses in the third-person-view module is removed.

Fig. 7 .
Fig. 7. Visualization results of egocentric 2D pose estimation by our proposed EgoFish3D.(a) On ECHP dataset; (b) On Mo 2 Cap 2 dataset.The red points are the predicted joint positions and the colorized lines indicate the skeleton.

Fig. 8 .
Fig. 8. Visualization of 2D pose by the proposed (a) egocentric module and (b) reprojection of the transformed 3D pose.The egocentric module can generate more accurate 2D pose than the reprojected one, esspecially for the joints of the upper limb.

Fig. 9 .
Fig. 9. Visualization of the egocentric 3D pose estimation by our EgoFish3D.(a) On ECHP dataset.Left: visualization results on the test set (two bottom rows are new subjects and all body textures are unseen in the training set); Right: visualization results on validation set.(b) On xR-EgoPose dataset.We retrain our model on their dataset.(c) On Mo 2 Cap 2 dataset.We directly apply our model trained on ECHP dataset to their dataset.The red color is the predicted 3D pose by our EgoFish3D and the blue color represents the ground truth.

TABLE I DETAILS
OF OUR ECHP DATASET

TABLE II COMPARISON
MPJPE (PA-MPJPE) RESULTS OF EGOCENTRIC 3D POSE ESTIMATION IN MILLIMETERS (mm) ON ECHP DATASET TABLE III COMPARISON RESULTS OF EGOCENTRIC POSE ESTIMATION IN MILLIMETERS (mm) ON XR-EGOPOSE DATASET

TABLE IV COMPARISON
BA-MPJPE RESULTS OF EGOCENTRIC 3D POSE ESTIMATION IN MILLIMETERS (mm) ON MO 2 CAP 2 DATASET

TABLE V ERROR
PER JOINT IN MILLIMETER

TABLE VI RESULTS
OF OUR PROPOSED THIRD-PERSON-VIEW MODULE COMPARED WITH DIFFERENT METHODS Fig. 6.Human body joints (N = 15) for ECHP dataset.