An Eye Tracking based Aircraft Helmet Mounted Display An Eye Tracking based Aircraft Helmet Mounted Display Aiming System Aiming System

—Helmet display system is a device used in ﬁghter aircrafts to provide situational awareness, scene enhancement images and other information to a pilot. The system is not limited by the ﬁeld of view and integrates radar and ﬁre control systems into the helmet, allowing the pilot to choose and aim targets by pointing his/her head. However, the direct use of helmet orientation to indicate the direction of interest ignores the more ﬂexible and efﬁcient way of interacting via the pilot’s eye movements. The opaque goggle of the helmet prevents the direct usage of a camera in a cockpit to sense a pilot’s eye gaze. Therefore, eye gaze direction based helmet display system is innovatively introduced to realize the spatial combination of head posture and eye gaze direction. Wide angle monocular camera installed inside a helmet is used to capture human eye images under helmet coordinate frame in real time. Robust eye gaze directions (namely, pitch angle and yaw angle) are obtained through the combination of appearance-based and model-based eye tracking algorithms. The eye gaze direction is combined with the head posture sensor to acquire the pilot’s line of sight relative to the cockpit. Prototype experiments demonstrate that the proposed helmet display system allows user to pick and aim targets faster, and can accurately locate the gazing point in a simulated cockpit environment with less than 2.1 degree averaged error.


I. INTRODUCTION
Helmet mounted display (HMD) is a device used in fighter aircrafts to provide visual information to pilots, such as situational awareness, enhanced images of the scene, and strategic information. Furthermore, a HMD allows usage of a pilot's head pointing direction as the control and guidance direction of weapon systems, accelerating the speed of information exchange and flight control [1]. In the military weapon indication system of a jet fighter, HMD uses inertial sensors, optical sensors, electromagnetic sensors, ultrasonic sensors and hybrid sensor methods to accurately measure the position and orientation of the helmet. The hybrid sensor approach works the best via using a combination of inertial and optical sensors to improve tracking accuracy, enhancing update rate and reducing delay.
However, direct use of a HMD's orientation instead of a pilot's eye gaze direction to indicate the target direction ignores the pilot's eye movement, which is a more flexible, more efficient and more freedom of interaction. Eye tracking technology is a hot research topic in the field of computer vision in recent years. Through various sensors and algorithms, the gaze direction of users' eyes can be identified and estimated readily, improving the efficiency of various humancomputer interaction applications [2], [3]. Yet, an opaque goggle of the flight helmet prevents the direct usage of a camera in a cockpit to sense a pilot's eye gaze.
Motivated by the aforementioned needs, this work attempts to upgrade the current HMD system with eye gaze direction through combination of eye tracking with head orientation to acquire the final aiming direction. Conventional eye tracking uses a fixed camera mounted on a reference surface to capture and calculate the direction of the person's eyes, while the helmet goggles worn by fighter pilots will block the view of the external camera towards the driver's face. Therefore, the camera that captures the eye image should be installed inside the goggles. For above reasons, the captured images are possibly dark and blurry due to poor lighting conditions (if no additional light source is added) and the vibration of the helmet. Therefore, in order to obtain an accurate gaze direction of the pilot, it is critically important to detect the pilot's eye features robustly. In this paper, a deep neural network was trained to perform the task of detecting interpretable eye features. The core of this detector is a stacked hourglass convolutional neural network (CNN) architecture that initially proposed for human pose estimation task [4]. The task of eyeregion landmark detection bears similarities to the problem of joint detection in human hand and full body pose, where a key problem is the occlusion of landmarks due to other body parts or decoration. The hourglass architecture tries to capture longrange context by performing repeated bottom-up, top-down inference which ensures a large effective receptive field and allows for the encoding of spatial relations between landmarks, even under occlusion. This eye-region landmark detector is trained solely on high-quality synthetic eye images [5] which are accurately labeled for the location of important landmarks in the eye region, such as eyelid-sclera border, limbus regions (iris-sclera border), and eye corners. The key advantage of this approach is that model-based and feature based gaze estimation methods can be applied even to eye images for which iris localization and ellipse fitting can be highly challenging with traditional methods. Three hourglass modules were trained on eye images and annotations provided by UnityEyes [5], and the trained model allows for a real-time implementation (60Hz). Then, those detected features were used to fit a 3D eyeball model which directly estimates the gaze direction in 3D [2]. Figure 1 shows the whole pipeline of the proposed system. As the eye tracking system is head-mounted, the measured gaze direction is essentially relative to the helmet. To deduce the line of sight relative to the fixed world frame system, instantaneous kinematic information of the helmet must be acquired as well to transform the eye-to-head direction to the final gaze direction. Various methods exist as stated previously to measure the instantaneous orientation of the helmet under versatile acceleration situations. Here, a gyroscope, which cannot measure the correct orientation under accelerations of an aircraft, was used to simply illustrate the working principle of the proposed eye tracking based HMD aiming system under lab environment. Then the posture of the helmet relative to the cockpit and the gaze direction relative to the head are combined to represent the final aiming direction. To evaluate the feasibility and precision of this system, experiments were conducted upon an HMD prototype to estimate the final gaze point on the screen of a desktop computer. To obtain the relative pose of the head and the screen, an additional camera is used on top border of the screen and an Aruco marker [6] is added in front of the prototype helmet. The prototype experiment shows that the average angular error is less than 2.1 degree when the distance from the screen is about 60cm, which demonstrates the feasibility that eye tracking based HMD aiming system. This paper has the following three contributions: First, using a large number of synthesized eye images to train an eye region landmark extractor. Second, the eye movements are added to the helmet mounted display aiming system for the final aiming direction. Third, experimental prototype was made and the robustness of our HMD aiming was evaluated under lab environment.

II. RELATED WORK A. Helmet Mounted Display Aiming System
Modern advanced fighters mostly use a helmet mounted display (HMD) aiming system to ameliorate human-system interactions. Its main working principle is to accurately measure the orientation (pitch, yaw and roll) of a pilot's helmet to indicate the aiming direction. In some cases, the position (x, y, and z) of the pilot's helmet relative to the cockpit must also be measured. The current HMD aiming system generally uses hybrid sensor methods such as inertial and optical sensors to accurately measure the position and orientation of the helmet, which improves tracking accuracy, update rate and delay. Joint Helmet-Mounted Cueing System (JHMCS) is a derivative product of DASH III and Kaiser Agile Eye helmet displays, developed by Vision Systems International (VSI). The JHMCS system retains the same helmet position electromagnetic sensor as the DASH (Display and Sight Helmet) system, and uses a newer and faster digital processing software package [7]- [10]. The JHMCS integrated with night vision goggles, which can provide pilots with visual information about the cockpit and environment at night.
French company Thales put its Helmet Mounted Display System, code-named Scorpion, into the military aviation market in 2008. The attitude of the Scorpion was initially measured by alternating current (AC) electromagnetic sensors, and then replaced with a Hybrid Optical based Inertial Tracker (HObIT) [11]- [13]. The HObIT was developed by InterSense and tested by Thales [14]. Besides, compared with the JHMCS, the scorpion can display colorful numbers and symbols. In the aircraft mission system, the helmet is used to indicate and guide the aiming sensor and over-the-horizon missiles.
The Eurofighter Typhoon uses the Helmet-Mounted Symbology System (HMSS) developed by British company BAE and Japanese company Pilkington Optronics, code-named Striker and an upgraded version of Striker II. It can display raster images and symbol information, and can be integrated into a night vision goggles. Similar to the DASH system, the HMSS system uses an integrated helmet position sensor to measure and indicate the direction of the pilot's line of sight, and to ensure that the displayed image information symbols are consistent with the pilot's head movement [15], [16].
Existing helmet aiming systems all directly use the direction of the helmet to indicate the direction of the targeting, ignoring the pilot's eye movement, which is a more flexible, more efficient interaction method. Therefore, studying the eye tracking based helmet mounted display aiming system has important strategic significance for improving the combat effectiveness of fighters.

B. Eye Tracking Technology
Feature-based eye tracking methods generally detect certain visual features of a human's eye from the captured images, such as pupil, iris, eye corners and corneal reflection points (with light source), and then extract the relevant eye gaze parameters to estimate the gaze point by a mapping model. Sesma et al. [17] introduced a simple and common approach that use the pupil-center-eye-corner vector or PC-EC vector to represent the eye movements. Huang et al. [18] formulate a feature vector form estimated head pose and distance between 6 landmarks detected on a single eye. Aiming at the difficulty of accurately detecting the corner of the eye, researches have added a light source to the system, and replaces the eye corner point with the Purkin spot formed by the light source reflected by the cornea [19]- [22]. In addition, methods have been proposed that use a deep neural network to detect the eye features [2], [23]. Park et al. [2] used deep convolutional neural network trained sorely on synthetic eye images to extract 18 eye region landmarks. Rakshit et al. [23] proposed training a convolutional neural network to directly segment entire elliptical structures and demonstrate that such a framework is robust to occlusions Model-based methods attempt to fit a known 3D model to the eye image by minimizing a suitable energy [24]- [28]. The symmetry axis of the eyeball is the optical axis, and there is a fixed deflection angle between the optical axis of the eyeball and the visual axis that varies from person to person. This angle is called the Kappa and is usually around 5 degrees. Eye tracking refers to estimating the direction of the visual axis and determining the gaze point based on the information of the observed scene. The eyeball can generally be regarded as two intersecting spheres with deformations and the center and radius of the eyeball as well as the angular offset between visual and optical axes are determined during user calibration procedures. Further, approaches also proposed to use a neural network to fit an eyeball to an eye image [2], [24], [29], [30].
Appearance-based method directly takes raw eye images (or other auxiliary information, such as face) as input, and trains a mapping model between the eye appearance and the gaze direction. Appearance-based methods often require the introduction of large, diverse training datasets [31]- [34] due to the changes of head movement, changing illumination, (partial) occlusions, and eye occlusions and typically leverage some form of convolutional neural network (CNN) architecture as well as different input data modalities. Krafka et al. [32] introduced a large scale dataset GazeCapture. By using this dataset, a convolutional neural network that performs the task of 2D gaze estimation is trained and achieves accuracy of less than 1.34cm error. Zhang et al. [33] presented the MPIIGaze which has become a commonly used public dataset. Zhang et al. [35], [36] adapted the LeNet-5 and VGG-16 architectures and introduced spatial weights to encode the face image. Besides, weakly supervised gaze estimation from video [37] and using eye landmark heatmaps to obtain gaze estimation [38] were also proposed.

A. Neural Network Architecture
Since a custom-made camera that captures the polit's eye image is installed inside the goggle, relatively dark environment and the vibration of the helmet require the robustness of the eye feature detection. Here, a deep convolutional neural network is trained to perform the task of eye region landmarks localization, which has been demonstrated as a robust method in previous works [2], [4], [39]. The primary module used in this network is Residual Module, named after the skip connections, as shown in Fig. 2. The first path is the residual mapping, which is composed of three convolutional layers (white) with different kernel scales in series, with batch normalization (BN, blue) and ReLU (orange) interleaved. The second path is the identity mapping, which only contains a convolutional layer with a core scale of 1. The step size of all convolutional layers is 1, and that of the pading is 1 as well. Hence the data size will not change, only the data depth (channel) will be changed. The residual module is controlled by two parameters: input depth M and output depth N . The The hourglass module consists of the residual module, which is the core component of the network. Hourglass model has different levels of complexity, depending on the order. Both the upper and lower paths contain several residual modules (gray), which gradually extract deeper features. The upper path is performed at the original scale, and the lower one has undergone a process of down-sampling (divided by 2) and then up-sampling (multiplied by 2). Down-sampling (green) uses max pooling with a kernel size of 2×2 and stride of 2, and upsampling (light orange) uses bilinear interpolation. The input and output depth of each residual module is 32. The feature maps are downscaled via pooling operations, then upscaled using bilinear interpolation, thereby forming an hourglassshaped network structure. The second-order hourglass only needs to replace the middle residual module of the first-order hourglass with a first-order hourglass module. Then replace the middle residual module of the second-order hourglass module with a first-order hourglass module, which is a third-order hourglass. Therefore, the increase in the order of the hourglass module can be regarded as a recursive process. This work uses a fourth-order hourglass module, as shown in Fig. 3. Three fourth-order hourglass modules were stacked to perform feature extraction tasks. The architecture of the entire network is shown in Fig. 4.
The stacked hourglass network architecture has previously been applied to human pose estimation and facial landmarks detection where complex spatial relations need to be modeled at various scales [4], [39]. The hourglass architecture has been proven to be effective in tasks of estimating the locations of occluded joints or key points. The architecture performs repeated multi-scale refinement of feature maps, which ensures a large effective receptive field and is capable of encoding of spatial relations between landmarks, even under occlusion. The network calculates the coordinates of each eye feature point through the soft-argmax layer, and then further appending 3 linear fully-connected layers with 100 neurons each (with batch normalization and ReLU activation) and one final regression layer with 1 neuron to predict eyeball radius.

B. Training Data
High-quality synthetic eye images, UnityEyes [5], were utilized for training. These synthesized data provide rich and accurate eye feature point coordinates, including eyelid-sclera   border, limbus regions (iris-sclera border), and eye corners, as shown in Fig. 5. Specifically, eyelid-sclera border has 16 annotation points, iris-sclera border has 32 annotation points, and eye corner has 7 annotation points. It also additionally includes pupil size, head posture, gaze direction, etc. Uni-tyEyes is effectively infinite in size and was designed to exhibit good variations in iris color, eye region shape, head pose and illumination conditions.

C. Intermediate Supervision
Multiple hourglass modules are stacked in the entire network structure, so that the network can continue to repeat the bottom-up and top-down processes. The key to adopte this structure is to use intermediate supervision to calculate the loss of the heatmaps output by each hourglass module. The location for performing intermediate supervision is in the red box convolutional layer in Fig. 4. Hence, the loss of each hourglass module is calculated separately, so that the subsequent hourglass module can be better re-evaluated and reassess higher order spatial relationships. In this work, we selected a total of 18 eye feature points: 8 on the eyelidsclera border, 8 on the iris-sclera border, 1 at the iris center, and 1 at the eyeball center. The network performs the task of predicting heatmaps, one per eye feature point. The heatmaps encode the per-pixel confidence on a specific feature point. Two-dimensional Gaussian distributions were centered at the sub-pixel feature point positions and the peak value is 1. The neural network minimizes the l 2 distance between the predicted and ground-truth heatmaps per feature point via the following loss term: where h(p) is the confidence at pixel p andh is a heatmap predicted by the network. The weight coefficient α is empirically set to unity. Additionally, the loss term for predicting the radius of the eyeball is: wherer uv is predicted eyeball radius and r uv is ground truth and β was set to 10 −7 .

D. Training Process
To increase the robustness of the model, data augmentation was employed. The following augmentations were applied. During the training process, ADAM optimizer was used, with a learning rate of 5 × 10 −4 , batch size of 16, l 2 regularization coefficient of 10 −4 and ReLU activation. Model was trained for 6M Steps on an Nvidia GTX 1660 super GPU, which consists of less than 1 million model parameters and allows for a real-time implementation (60FPS).

A. Model-based Gaze Estimation
A simple model of the human eyeball can be generally regarded as a large sphere with two small spheres intersecting with each other to represent the corneal bulge, as shown in Fig.  6. Suppose the predicted coordinates of the 8 iris landmarks in a given eye image are (u i1 , v i1 ), · · · , (u i8 , v i8 ). In addition, the eyeball center (u c , v c ) and the iris center (u i0 , v i0 ) are also detected. Eventually the network predictes the eyeball radius in pixels, r uv . Knowing the eyeball and iris center coordinates and eyeball radius in pixels makes it possible to fit a 3D model without access to any camera intrinsic parameters.
Since the intrinsic parameters of the camera are not known, the coordinates can only be unprojected into 3D space in pixel units. Thus, the radius remains r xy = r uv in 3D model space and (x c , y c ) = (u c , v c ). Assuming the gaze direction is g c = (θ, φ), the iris center coordinates can be represented as: To write similar expressions for the 8 iris edge feature points, angular iris radius δ and an angular offset γ which is equivalent to eye roll are jointly estimated. For the j-th iris edge feature pointes (with j = 1 · · · 8): where, For this model-based gaze estimation, θ, φ, γ and δ are unknown whereas other variables are provided by the eye region feature points localization step of the network. An iterative optimization method was used to solve this problem, such as conjugate gradient method. And the minimized loss function is represented as: where u ij − u ij is the estimated pixel coordinates of the j-th iris feature point at each iteration. Calculating person-specific parameters based on calibration samples will adapt this model to a specific person. Gaze correction can be applied with (θ,φ) = (θ + θ , φ + φ ), where ( θ , φ ) is the personspecific angular offset between optical and visual axes.

B. HMD Aiming
In early works, the pilot's head posture was used to choose and aim targets. This work tries to combine the gaze direction with head posture for more accurate and flexible humanmachine interaction. Through the above derivation, we have acquired the gaze angle (θ, φ) of the eye relative to the camera that captures human facial features, which is presented as the pitch and the yaw angles. Since the eyeball does not roll (ϕ), the roll angle can be regarded as 0. Conventional eye tracking uses a fixed camera mounted on a reference surface to capture and calculate the direction of the human eyes, while the helmet goggles worn by fighter pilots blocks the view of the external camera towards the pilot's face. In order to prevent the interference of the helmet goggles, the camera that captures the image of eyes is installed inside the helmet goggles and fixed with the helmet. Through optical or mixed sensor methods are used in a real aircraft since a gyroscope cannot can not measure the correct orientations under heavy maneuvers, a 6-axis gyroscope is installed above the helmet to detect head posture for straightforward and easy proof of concept illustration. For the follow-up experiment, a desktop screen and a web-cam directly installed above the screen were used to obtain the aiming point on the screen for the aiming accuracy evaluation. Four coordinate systems, namely the HMD camera frame, the HMD frame, the web-cam frame and space (the cockpit) frame, are established as shown in Fig.  7.
Through the 3D gaze estimation process, the eye gaze direction relative to the camera mounted on the helmet (g h e = θ, φ, ϕ) is acquired. Head attitude relative to the aircraft cockpit is measured by gyroscope p c h = (α, β, γ). The head attitude uniquely determines a unit vector e c h = (cosα, cosβ, cosγ), both the p c h and e c h indicate the direction of the head. Since the gaze direction is relative to the camera mounted on the helmet, it can be considered that the gaze direction is an additional rotation of the head attitude. The gaze angle can be converted into a rotation matrix R h e ∈ SO(3): R h e = The unit vector for the final gaze direction can be calculated as, or further expressed as v c e = e c h · R h e = x e y e z e : And g c e indicates the final aiming direction relative to the cockpit considering eye movement and head posture.

A. HMD Aiming System Setup
To evaluate the effectiveness and robustness of the proposed eye tracking based helmet aiming system, a series of experiments were conducted to quantitatively demonstrate how well the system performs. A total of nine marker points were stationarily set up on the screen, covering most of the screen area, of which the screen is 54cm long and 30cm wide, as shown in Fig. 7. A camera was additionally installed above the screen for estimating the coordinates of the head and marker points coordinates related to the screen. Note that additional camera calibration was required here. In order to distinguish between the camera on the screen and the camera on the helmet, we denote the camera on the helmet C a and the camera on the screen as C b . Assume that the x-y plane of the camera C b coordinate system coincides with the screen, so the coordinates of each landmarks under camera C b are known. ArUco is a synthetic marker [6], usually used for target detection and positioning, here it is used to detect the position of subject's head. Other available methods such as depth camera can also be adopted to accomplish this job. Since the position of the head p c h = x h y h z h is acquired via detecting the ArUco marker and aiming direction , the position of the line of sight on the screen can be calculated: where Z h represents the vertical distance of the head in the C b coordinate system, that is, the vertical distance between the head and the screen; x h and y h are the horizontal and vertical coordinates of the head in the C b coordinate system; θ c e represents yaw and φ c e represents pitch of the eyeball. The accuracy of the eye gaze tracking system is quantitatively evaluated by calculating the angular value E dg : where E d is the distance between the estimated gaze position and the real observed position, and E g represents the distance between the subject and the screen plane.

B. Aiming at Different Distances
To calculate the point of gaze of the user on the screen, the positional relationship between locations on the screen and the head position must be determined. Subjects were first asked to stay still and look at several points for which the positions on the screen are known. While the user is fixating each point on the screen, the eye movements and the head rotation are measured. The head posture is an absolute value, we need to know the rotation of the head relative to the screen. Then a mapping between the two sets of points is generated through the calibration of the gyroscope. The subjects were about 60cm away from the screen and were requested to look at different marker points on the screen while the estimated gaze points were recorded. Angular value was computed with respect to the target point positions. The experiment tested the impact of the gaze point error at head distances of 40, 60, and 80 cm, where results are shown in Fig. 8. and summarized in Table  I. It is assumed that the coordinates of the estimated gaze point obey a Gaussian distribution as: where µ x , σ 2 x , µ y , and σ 2 y are sampling means and sampling standard deviations for horizontal and vertical directions . The radius of the light blue circle in the figure is: which represents the estimated gaze point distribution with a 99.7% confidence. Note that as the head-to-screen distance increases, the estimated gaze point error for distance increases. When the head is 40cm away from the screen, the average error is 2.00 cm, achieving high accuracy. When the head is 80cm away from the screen, the average drop point error is 3.02cm. From the distance point of view, the point error will gradually increase, but from the angle error point of view, the error does not change significantly. From the 2.00 degree error for 40cm casae to the 2.10 degree error for 80cm case, the angular error only increases by 0.1 degree, demonstrating a steadily good and satisfactorily good performance of the proposed HMD aiming system prototype.

C. Aiming by Different Persons
Five people of different ages and different gender voluntarily tested the proposed HMD prototype. Box plots from Fig. 9 shows the accuracy of five voluntary subjects. The gaze accuracy varied due to the different physical characteristics such as eyes, head movement, height and sitting position. The subjects were requested to look at 9 positions on the screen. The estimated gaze points were recorded. We computed the angular value with respect to the target point positions.
The proposed HMD system aiming method was also compared related works such as Skodras et al. [40], Cheung et al. [41], Arar et al. [42], Li et al. [43] and Wang et al. [44], with results presented in Table II. Skodras et al. [40] and  Conclusively, the proposed work achieves an accuracy of 2.07 degree under free head movement without any light source. Note that our equipment is head-mounted and used for aiming on fighter aircrafts, as long as the positional relationship between the helmet and the gaze plane is calibrated, the gaze point prediction can be achieved.

VI. CONCLUSION
This work proposed a comprehensive and effective HMD aiming system to provide a freer, faster and more flexible human machine interaction method for aircraft pilots. Detailed design and algorithm behind were presented and discussed. The eye gaze direction is obtained by combining the eye to helmet orientation and helmet to cockpit orientation. A custom made wide angle monocular camera installed inside a flight helmet is used to capture eye images to circumvent the blocking issues by the outside helmet goggle. Proof of concept prototype experiments illustrates a steadily good and satisfactorily good performance of the proposed HMD aiming system prototype, and illustrates advantages such as free head movement and no additional light source compared to some related works. Conclusively, the proposed HMD aiming system offers a feasible alternative to the current HMD technology and provides practical human machine interaction remarks.