Multi-Ellipsoidal Extended Target Tracking with Variational Bayes Inference

—In this work, we propose a novel extended target tracking algorithm, which is capable of representing a target or a group of targets with multiple ellipses. Each ellipse is modeled by an unknown symmetric positive-deﬁnite random matrix. The proposed model requires solving two challenging problems. First, the data association problem between the measurements and the sub-objects. Second, the inference problem that involves non-conjugate priors and likelihoods which needs to be solved within the recursive ﬁltering framework. We utilize the variational Bayes inference method to solve the association problem and to approximate the intractable true posterior. The performance of the proposed solution is demonstrated in simulations and real-data experiments. The results show that our method outperforms the state-of-the-art methods in accuracy with lower computational complexity.


I. INTRODUCTION
Recent advances in autonomous vehicles, robotics, and intelligent systems entail the need to not only estimating the position of an object but also recognizing its shape. This requirement is usually fulfilled with short-range sensor systems, where it is possible to collect multiple instantaneous measurements from a single target. In contrast to traditional point target tracking methods, one can extract more information from the measurements, such as the shape, size, or orientation of the target. The special algorithms, which are capable of estimating these unknowns together with the target's kinematic state, are called Extended Target/Object Tracking (ETT/EOT) algorithms.
A primitive approach to ETT is to represent the target's extent as a simple shape and estimate the relevant parameters. These simple shapes can be a line [1], a circle [2] or a rectangle [3]. More complex shapes can be defined using random hyper-surface models which assume that the measurements are generated from an unknown random surface [4]- [6]. More recently, Gaussian Process (GP) based models [7], [8] were proposed to estimate the extent of targets with unknown shapes. GP based ETT algorithms define the target's contour as an unknown radial function with a GP prior. An advantage of these models is that the estimated contours are descriptive, i.e., the contour representations can be utilized in further purposes such as classification of targets [9], [10].
The authors are with the Department of Electrical and Electronics Engineering, Middle East Technical University, Ankara, Turkey e-mail: barkint@metu.edu.tr, umut@metu.edu.tr, emreo@metu.edu.tr. One of the most common approaches in the ETT literature is the random matrix model (RM) [6], [11]- [16], which was pioneered by Koch [11]. In RM based methods, the extent is approximated by an ellipse which is represented by a symmetric positive definite (SPD) unknown matrix. The inference in [11] neglects the measurement noise covariance matrix in order to meet the conjugacy requirement. This approach was later improved in [12], where the measurement noise covariance is incorporated into the updates. More recent studies consider the orientation angle of the target together with the RM model [6], [14]- [16].
A single ellipsoidal representation of the target can provide only lumped information about the shape of the target, which may result in over-simplified representations. To remedy this problem, more recent works on random matrices focus on multi-ellipsoidal (ME) models [17]- [20], where the target extent is represented with more than one ellipse. Each ellipse is called a sub-object, and as the number of sub-objects increases, a finer representation of the target's extent can be obtained. A representative example of using multiple ellipses for target extent is given in Figure 1. ME models in the literature use different inference methods for estimating a mixture of Gaussian inverse Wishart (GIW) distributions which represent the kinematic state together with the extent. In [17], a particle filter is used for inference which becomes computationally heavy when the number of particles is increased. The ME model proposed in [20] does not contain a unified kinematic model for different sub-objects. In [19], the kinematics of the sub-objects are unified however, the method requires computationally heavy partitioning algorithms to associate the measurements with the sub-objects. Furthermore, mixture reduction algorithms are required to manage the number of components resulting from different association hypotheses. In [18], an approach that can handle varying numbers of sub-objects in time is proposed. In this work, we present a ME-ETT approach utilizing variational Bayes inference for solving the measurement association problem and obtaining an approximate distribution for the intractable posterior. Our approach does not require any clustering [20], partitioning [19], mixture reduction [19] and merging methods [20]. The resulting algorithm has low computational complexity, and outperforms the state-of-the-art methods in accuracy. The rest of the paper is organized as follows. First, the problem formulation is given in Section II. Then, the variational inference for the measurement update is explained in Section III. Section IV presents the time update. In Section V the simulation results will be shown and discussed. Finally, we will conclude the article in Section VI.

II. PROBLEM FORMULATION
We consider an extended target composed of L ≥ 1 subobjects. At any time k, the extent state of the th sub-object is represented by the SPD matrix X k ∈ R ny×ny , = 1, . . . , L. We define the kinematic state of the extended object as where x c k ∈ R ny denotes a reference point on the target and {µ k ∈ R ny } L =1 denote the displacement vectors of the sub-object centers from the reference point x c k [17]. We assume that µ 1 k = 0, i.e., the reference point of the extended target is at the center of the first sub-object without loss of generality. The vectorx k contains all non-positional kinematic • Set of real matrices of size m × n is represented with R m×n . • N (x; µ, Σ) represents the multivariate Gaussian distribution with mean vector µ ∈ R nx and covariance matrix Σ ∈ S nx ++ , • IW(X; v, V) represents the inverse Wishart distribution over the realvalued positive definite matrix X with degrees of freedom v, and positive definite scale matrix V where etr(·) e tr(·) . • For the number of measurements M k ∈ Z + , y • c \φ is a generic constant that denotes the constant terms with respect to variable φ in an equation.
The iterate numbers are shown with parenthesized superscripts, e.g., q (i) (·). • The expectation with respect to a specific variable is shown with a subscript in the expectation sign, e.g., Ez will show an expectation with respect z When it is clear from the context which variable we are taking the expectation with respect to, we will show the expectation with an overline, e.g., (X k ) −1 will denote E X [(X k ) −1 ]. • When it is necessary to take expectations with respect to all random variables except for one of them, we will use a backslash "\" in the subscript of the expectation operator, e.g., E \z will denote expectation with respect to all random variables except z is abbreviated as p(·|Y k−1 ) for the sake of brevity. • The quadratic forms x T Ax and outer products xx T are written as x T A(·) and x(·) T , respectively to avoid unnecessarily duplicating long expressions.
information, such as velocity and acceleration. An illustration of this kinematic state vector for L = 4 is shown in Figure 2. The kinematic state vector x k ∈ R nx is assumed to evolve with the following linear Gaussian dynamics where F ∈ R nx×nx denotes the state transition matrix and w k ∈ R nx denotes zero-mean white Gaussian process noise vector with covariance matrix Q, i.e., w k ∼ N (w k ; 0, Q). Suppose at time k, a set of M k target-originated measurements are captured by the sensor which is denoted as The measurements are assumed to be conditionally i.i.d. and distributed according to the following Gaussian mixture.
is the set of sub-object extent matrices X k , = 1, . . . , L. • R ∈ R ny×ny is the SPM measurement noise covariance matrix. • s ∈ R + is a known positive scaling constant. • Other necessary notations can be found in Table I. The likelihood of the set of measurements y 1:M k k is then given as The aim is to find the posterior distribution denotes the cumulative set of measurements obtained from the sensor up to and including time k. At each time step k, the posterior distribution p(x k , X 1:L k , π 1:L k Y k ) is assumed to be in the form given below.
where m k|k , P k|k are the mean and covariance of the kinematic state vector x k , respectively. The notations IW(X; v, V) and D π 1:L , α 1:L denote the inverse Wishart and Dirichlet distributions, respectively The recursive calculation of the posterior p(x k , X 1:L k , π 1:L k Y k ) will be carried out in two steps, namely, the measurement update and time update, which will be investigated separately in the following sections.
In the measurement update, we would like to update the predicted distribution in (7) with the likelihood in (4b). Unfortunately, such an update is not analytically tractable, and furthermore, it would not result in a posterior in the form (5).
In order to ensure analytical tractability and to preserve the form of the posterior, first, we are going to define some latent variables and then resort to variational approximation. We first define the latent variables z j k , j = 1, . . . , M k which represent the noise-free measurements satisfying Note that the conditional joint distribution for y j k , z j k is given as We also define the association/responsibility vector r j k for each measurement y j k (or z j k ), which is a binary vector of size L defined as where the elements r j, k ∈ {0, 1}, = 1, . . . , L, which are called responsibilities in the literature [21,Ch. 10]. These binary variables satisfy Note that, with these properties, the elements r j, k are all equal to zero except for one of them which is unity. The index * for which r j, k is equal to unity, i.e., r j, * k = 1, is the index of the sub-object which the measurement y j k (or z j k ) is associated to. The distribution of r j, k is defined as follows. P r j, k = 1 π 1:L k π k (12) for = 1, . . . , L. The expression (12) can be written as for j = 1, . . . , M k . Note that given the association vector r j k , the noisy and noiseless measurements y j k and z j k are distributed as for j = 1, . . . , M k where * is the index for which the element r j, k is equal to unity, i.e., r j, * k = 1. The expressions (14) can conveniently be written as for j = 1, . . . , M k . Using these expressions, we can write the conditional joint distribution for y j k , z j k , r j k as The overall conditional joint distribution for can then be written as Since the posterior for x k , X 1:L k , π 1:L k given Y k is intractable, we aim for approximating the joint posterior for given Y k in the following form.
We calculate the terms of the approximation above iteratively using variational approximation with the following true joint density.
The logarithm of the joint density above is given as With this log-distribution, the factors q z (·), q r (·), q x (·), q X (·), and q π (·) can be calculated using variational Bayes approach [21, Ch. 10] as where the parametersẑ j k|k , P z,j k|k , γ j, k|k , m k|k , P k|k , v k|k , V k|k , α k|k , j = 1, . . . , M k , = 1, . . . , L, are found iteratively. Once the updated distributions in (21) are obtained we can approximate the updated posterior p x k , which is in the form (5) as required. The derivations for the distributions in (21) and for the iterations of their parameters are given in the following subsections. A summary of the resulting iterative update procedure can be found in Algorithm 1.
The density q (i+1) z · is given as [21, Ch. 10] where c \z denotes any constant term(s) with respect to the variables z 1:M k . The joint log-distribution in the expectation above is given as Taking the expectation of both sides, we get for j = 1, . . . , M k . Hence we have B. Calculation of q The density q (i+1) r · is given as where c \r denotes any constant term(s) with respect to the variables r 1:M k . The joint log-distribution in the expectation above is given as Taking the expectation of both sides, we get where for j = 1, . . . , M k , = 1, . . . , L. Hence, we have Note that the expression above corresponds to the following probabilistic characterization.

C. Calculation of q
The density q (i+1) x · is given as where c \x denotes any constant terms with respect to the variable x k . The joint log-distribution in the expectation above is given as Taking the expectation of both sides, we get = log N u 1:L K ; H 1:L x k , Λ −1 where for = 1, . . . , L. Hence we have where The density q (i+1) X · is given as where c \x denotes any constant terms with respect to the variables X 1:L k . The joint log-distribution in the expectation above is given as Taking the expectation of both sides, we get where (44) The density q (i+1) π · is given as where c \π denotes any constant terms with respect to the variables π 1:L k . The joint log-distribution in the expectation above is given as Taking the expectation of both sides, we get Hence we have for = 1, . . . , L.

F. Expected Values Required for Iterations
The expected values required in the previous subsections can be calculated as follows.

IV. TIME UPDATE
With the target dynamics given in (2), the time update of the kinematic state is performed following the regular Kalman filter time update equations The parameters of the inverse Wishart distribution of each sub-object are updated with a forgetting factor λ IW , The forgetting factor is used to obtain the maximum entropy prediction density of the extent states when the dynamics of the extent state is slowly varying and unknown [22,Theorem 1]. Similarly, the sufficient statistics of the Dirichlet distributed mixture weights π 1:L k are propagated with a forgetting factor λ D , If the true parameter evolution model is slowly varying, the time update equations (52) and (53) will not underestimate the uncertainty by maximizing the entropy. Note that, for the special case of stationary parameters forgetting factors are set to 1, i.e., λ IW = λ D = 1.
The proposed algorithm is also versatile to perform under alternative random matrix time update schemes.

V. EXPERIMENTAL RESULTS
In this section, we will demonstrate the performance of the proposed algorithm in various experiments and compare its performance with the algorithms presented in [6], [12], [17], [20]. In the sequel, we will denote the proposed algorithm as VB and the methods in [17] and [20] as VPF and JL, respectively.
To assess the performance of the algorithms, we consider the Intersection-Over-Union (IOU) similarity measure [3], [7], [23] together with Gaussian Wasserstein (GW) distance [24], [25] for the extent estimates and the root-mean-square error (RMSE) for the kinematic state estimates. The RMSE between the true and the estimated states is defined as where N is the number of time steps in a single run. The IOU measure between the estimated extent and the true extent is calculated as where X true k and X k|k denotes the true and estimated extent matrices at time k, respectively. Note that IOU takes values between 0 and 1, where 1 corresponds to the perfect match, while 0 indicates no intersection between the true and the estimated extents. The GW distance [25] can be expressed as where c true k , c k|k and X true k , X k|k stand for true and estimated center locations and true and estimated extent matrices at time k, respectively.
The comparison metrics above are formulated for a single simulation run. In the sequel, all simulation experiments are performed 100 times with different realizations of the measurement noise at each run. The presented numbers in the simulations are the averages of the 100 Monte Carlo (MC) runs.

A. Simulations
The first simulation experiment consists of two co-centered elliptical objects moving according to the near constant velocity model with parameters given in Table II. Throughout the simulation, 7 measurements are generated from each object per time step. We compare the performances of VB, VPF (with N =100 particles), and JL algorithms over 100 MC runs. An example MC run is illustrated in Figure 3. The comparison metrics and the computation time of the algorithms are given in Table III. VB algorithm estimates both the kinematic and extent states better than other approaches in terms of RMSE, IOU, and GW distance.
Note that the number of association events in the JL algorithm grows exponentially as the number of the sub-objects  and measurements increases [20]. Hence the JL algorithm's computation time is significantly higher than the other methods as shown in Table III. Being a sequential Monte Carlo method, VPF algorithm is also computationally costly. On the other hand, the computation time of VB is linear in both number of measurements M k and number of sub-objects L. The computational complexity of the proposed solution is O(LM k ) at time k. Consequently, the computation time of the VB algorithm is significantly lower than the alternatives as shown in Table III.
In the second simulation scenario, we have a target with a shape similar to an airplane (see: Figure 4.). The object performs a constant velocity motion on a straight line.
The relevant parameters of the scenario are given in Table IV. Throughout the simulation, the number of measurements is set to 2 for each sub-object, yielding 8 measurements in total per time step. The performance evaluation metrics are given in Table V. The VB algorithm outperforms JL and VPF algorithms in terms of accuracy in the extent estimates and provides results which exhibit smaller variation over time.

B. Occlusion Scenario
In this section, we illustrate the capabilities of the algorithm in the presence of an occlusion problem which is frequently encountered in various target tracking applications. For instance, aerial objects might be partly or fully occluded by thick clouds while tracking with a day camera. Many practical systems resort to multiple complementary sensors, such as a  thermal and a day camera, to prevent track loss during such occlusions.
In this simulation scenario, we simulated a group of coordinated targets, which contains 5 individuals moving on a straight path in a V-shape formation 1 . During a certain part of the simulated trajectory, the line of sight of the sensor is partly blocked. It is assumed that no measurements can be obtained from the top two targets of the formation at the corresponding time instances. During the occlusion period, the group of targets gradually slows down. Then, they increase their velocity incrementally back to their regular pace. A visualization which describes the corresponding scenario is depicted in Figure 5.
We compare the performance of the proposed approach, VB, against the algorithm in [12], referred to as FFK in the sequel. The VB algorithm incorporates a unified kinematic model, i.e., the sub-objects depend on a common kinematic state as described in Section II. However, FFK treats each sub-object as a different target and tracks them individually without considering the interactions between them. During the occlusion period, FFK relies only on time update equations to estimate the kinematic and extent state of the occluded targets.   On the contrary, VB can utilize the measurements that are collected from the visible targets to extract information about occluded targets' state. Consequently, it can provide much better performance as shown in Figure 5.

C. Experiments with Real Data
In this subsection, the performance of the proposed algorithm is demonstrated with real data. The data are extracted from aerial footages of a delta-wing aircraft and a glider. Throughout the experiments, each frame of the videos is processed to generate point measurements. In order to distinguish the airplanes from the background, a thresholding is performed in the HSV color space and a median filter is used to reduce the clutters and to eliminate distant measurements. After this pre-processing step, the pixel coordinates belonging to the airplanes are uniformly sampled to obtain the measurements.
The first real data experiment consists of a delta wing airplane [26]. A representative example of VB algorithm's performance is illustrated in Figure 6a. Note that, the camera zoom is not constant throughout the scenario, which introduces additional challenges for the algorithm.
The second scenario is an air-footage of a glider in motion [27]. The performance of VB algorithm is illustrated in Figure 6b.
In Figure 6a and 6b, we also depict the extent estimates of two different single ellipsoidal target tracking algorithms [6], [12]. For the sake of clarity, only the estimates in the final frames of the video sequences are plotted. Both results demonstrate that modeling the extent with multiple ellipses provides a more accurate representation of the target extent.

VI. CONCLUSION AND FUTURE WORK
In this study, a novel extended target tracking algorithm that is capable of representing a target or a group of targets with multiple ellipses while simultaneously estimating the kinematics is presented. The proposed solution involves approximating the intractable posterior distribution with the variational Bayes method to estimate kinematic and extent states. We demonstrated the performance of the approach in simulations and real data experiments. The results of the experiments show that the proposed method significantly improves the computation time while achieving better kinematic and extent state estimation performance compared to the existing approaches in the literature.  In the last frame of each sequence, the result of the single ellipsoidal target tracking algorithms [6] (magenta dashed line) and [12] (yellow dashed line) are also presented.