An Event-by-Event Feature Detection and Tracking Invariant to Motion Direction and Velocity

This paper presents a new event-based method for selecting and tracking features from the output of an event-based camera while being robust to motion direction and velocity changes. It solves for the ﬁrst time the current problem of establishing event correspondences across time without requiring a minimization on time slices or frames of accumulated events. Unlike existing tracking algorithms, the method is designed to detect generic features and does not need to operate on particular predeﬁned shapes such as corners. It relies on a new framework for event-based computation that detects ﬁnite curvature structures by estimating instantaneously, both the speed and position from local ﬂow estimations. Results on several real environments are shown taking into account large variations in illuminance, speed, and type of features. The combined use of both the velocity and spatial information allows for robustness against occlusion. The method is parameters-tuning free as it dynamically adapts to the state of trackers.


INTRODUCTION
This paper introduces an event-based solution for selecting and tracking features from the output of an event-based camera.It overcomes for the first time the hard issue of extracting stable features from temporal information output by the sensor while being independent from motion direction and velocity changes.This is a major issue when using these sensors as the acquired space-time features can vary drastically although the underlying spatial structure is the same.This paper is based on a pure event per event methodology that allows it to make the most of event-based cameras' native properties of low latency and power.It permits for the first time to formulate the problem of features' selection and tracking in the time domain as an inter-dependent process between detection, tracking and continuous velocity estimation.The method does not assume any priors of what a feature is.Event based acquisition introduces a major shift in the way visual information is being processed.Unlike conventional cameras that produce absolute intensity images at a fixed rate, event cameras rely on a data driven acquisition process where independent pixels asynchronously signal light intensity changes (coined as "events") at a high temporal resolution (usually around 1µs).The use of sparse information represented by visual events can be puzzling to a conventional computer vision scientist used to deal with snapshots, gray levels and colors.Processing visual events requires new thinking to derive a novel family of algorithms capable of operating efficiently in the time domain by reducing memory footprints and computational power requirements.With continuously improved spatial resolutions, almost matching conventional sensors (from VGA to Mega Pixel), the event-based vision sensors' high temporal resolution allows the measurement of velocities in the focal plane with great precision.This is the most valuable and reliable source of information of these sensors.It has been extensively used in this paper.Presently, there are two opposite philosophies in processing data acquired by event cameras.The first one is advocating for the use of optimization techniques on static images of accumulated events or batches of events defined over different durations and sizes.This approach only uses the high dynamic range properties of these sensors, the recurring motivation behind this is to reuse decades of conventional computer vision techniques [1][2] and avoid using single events that are wrongly considered to be too noisy and carrying little information.This approach results in large wastes of resources and an increased latency.The second approach is closer to the event-based acquisition properties.It is advocating for a local space-time processing of individual events, as they occur, with the idea that each one of them is contributing by an infinitesimal information to the sensing problem to solve.This approach naturally saves power and memory footprint because it only stores what is required and computes using the minimal amount of resources only when something has changed in the scene.Previous works that tackled feature detection from the output of an event-based sensor include several corners detectors that are inferred from the optical flow computation [3], or obtained by using Harris operator on locally integrated temporal frame of events [4].The time-surface as introduced in [5] is another way to extract dynamic patterns (which also includes corners) as it allows to define a compact feature-velocity descriptor around an incoming event.In [6], the time-surface allows to build a feature that can then be used with FAST [2] detector.In [7], a FAST detector is applied to the time surface to find corners and a graph-based technique is jointly used to find the closest corner in space and time that support motions assumption to achieve tracking.An improvement to this approach is introduced in [8] with the idea of using a local descriptor built from the time surface.The use of descriptors allows for tracking via minimizing descriptor differences.A hybrid class of approaches combine frames and events to overcome the difficulty of pattern selection and tracking using events as the same scene pattern can produce different events depending on the motion direction, making the event correspondences across time very challenging.In [9], the DAVIS sensor is used at the rate of 35 fps to extract corners and other traditional image features are detected on frames by applying standard computer vision detectors, while the tracking is operated on the events captured between the frames [10].The same idea can be also found in [11].An expectation-maximization approach, coupled with an iterative closest point algorithm is used in [12] to keep track of corners that are detected at the beginning of the sequences from an edge map built by accumulating events over a manually selected integration time.Events are integrated over a duration deduced from the optical flow and the edge are mapped to generate unitary contours to then apply an ICP algorithm [13].The approach proposed in [14], based on global optimization of a cost function over a cluster of events, aims to find the motion parameters that yield a homography that maps the cluster of events into an "event image" with a maximal contrast.This approach while not based on features, still deals with the data association problem.It has some overlapping properties with our approach, such as the ability to relate the sensor's motion to the data, however they differ deeply due to their synchronous/asynchronous nature.

EVENT-BASED SENSORS
Event-based vision sensors are gaining popularity within the computer vision community as they offer many advantages over conventional frame-based cameras that cannot be matched without an unreasonable increase in computational resources.Low computation requirements are achieved by reducing redundant visual data at the level of pixels.This naturally allows lower latency and precise temporal acquisition.Event-based cameras Most of the available event-based vision sensors stem from the Dynamic Vision Sensor (DVS) [16].As such, they work in a similar manner by capturing relative illuminance changes: each individual pixel emits an "event" each time the illuminance crosses a predefined threshold at µs precision.An event contains both the pixel coordinates from which it originates and the time it is triggered, hence it is usually defined as e = [x, t].Additional information can be added such as the sign of that change as shown in Figure 1(c) and conventionally referred to as "ON/OFF" polarities.Other variations of the silicon retinas implement functions such as capturing absolute illuminance information with the same asynchronous principle as shown in Figure 1(e) [15][17] (not used in this work) or by implementing a less advantageous hybrid solution that also acquires conventional images [9].The asynchronous level crossing sampling allows these sensors to reach over 120dB in dynamic range.The presented work is carried out using a 640 × 480 pixels Asynchronous Time-based Image Sensor (ATIS) [18] and the Event-Camera Dataset [19] captured with a DAVIS240C [20].

Trackers definition, position and angle update
We define a tracker T as a set {X(t), v(t), θ(t), ω(t)}, where X(t) is its position in the focal plane, v(t) its translation velocity, θ(t) its rotation angle w.r.t the optical axis z and ω(t) its rotation speed.An event output by the neuromorphic camera indexed by n is defined as the pair e n = {X n , t n }, with X n its 2D spatial coordinates in the focal plane and t n its time of occurrence.An incoming event is "assigned" to a tracker T when it is close enough to the tracker i.e. satisfying the spatial constraint: where R is the spatial radius of the tracker (as shown in Figure 2).If we assume a constant velocity motion of the tracker between consecutive event -a reasonable assumption due to the low < l a t e x i t s h a 1 _ b a s e 6 4 = " j p w x z k N 6 g e l 8 l F J v z w s V s 6 y U e e x j R 3 s 0 T y P U c E F q q i R 9 w 0 e 8 Y R n x V B u l T v l / j N V y W W a T X x b y s M H M P C Z l w = = < / l a t e x i t > e n = {X n , t n } < l a t e x i t s h a 1 _ b a s e 6 4 = " Z n B Q y I Z v C v 6 W t 6 P 3 d 2 3 0 T 7 s 9 Figure 2: A tracker T located at X is assigned an incoming event e n at location X n if ||X n − X|| ≤ R, both expressed in R f the coordinate reference of the focal plane latency and the high temporal accuracy of the event-based sensors -we can update its position X(t n ) and angle θ(t n ) at the time of the arrival of e n as: This assumes that v and ω can be estimated at any time.This is partially achieved by computing the optical flow as explained later.
We define the location of that event in the tracker coordinate frame as x n , such that with (θ) the 2D rotation matrix of angle θ.

Tracker activity
To enable the continuous update of a tracker T , we first define the time decay factor λ n associated with T as: where τ (t n ) is the time needed for the tracker to move by one pixel.The estimation of this dynamic time property of the tracker will be fully expressed in section 3.6.Times t n and t n−1 correspond respectively to time of arrival of the incoming event e n and the preceding one e n−1 .
We then define A, initialized to 0 at t 0 , as a measure of the activity of a tracker T : A is updated by each incoming event that is assigned to a tracker T .This form is thus a sliding average over a time period τ .Each variable defined by a similar equation is thus robust to noise, as well as reactive to variations on a timescale τ .

Optical flow
For an incoming event e n = {x n , t n } assigned to T , we define the set E n that contains all events that occurred in the time interval , within a distance r to x n (shown as a black circle in Figure .3): The quantity N τ (t) = N.τ with τ the time needed for a unitary contour to move by one pixel, and N the number of lines considered.N τ (t) expresses the time needed by the tracker to move by N pixels, it also defines an adaptive time window parametrized by the velocity of the tracker.With E n , we estimate the local optical flow f n (shown in blue in Figure 3) at x n at time t n as: where xk and tk are the spatial position and the timestamp averaged over the set E n .
< l a t e x i t s h a 1 _ b a s e 6 4 = " y 0 e i 6 0 N 7 H u g a P a o o 0 o q p j m U u q u 6 J u b n 6 q S p J D R J i K + 5 S P K W Z a O e 2 z q T W J r l 3 1 1 t H 5 F 8 1 U q N q z j J v i V d 2 S B m x / H + f P o L 1 X t q t l q 7 l f q h 9 l o 8 The spatial coordinates of an incoming event e n are now expressed in the tracker T coordinates system R T at location x n .The optical flow f n is computed as the difference between x n and the mean positions of previous events happening in the temporal window The computation of the optical flow as a difference between the mean spatio-temporal position of E n and the current event works reliably provided one considers small spatial neighborhoods r (usually set between 2-4 pixels).The reliable direction of the estimate flow is the one orthogonal to local edges due to ambiguities related to the aperture problem.We will show in the following sections that local average flow allows the tracker to track local edges by introducing a metric that quantifies the aperture problem.We intentionally choose this simple form of flow instead of more accurate incremental flow computations techniques such as the canonical local plane fitting regularization used in [5][21].The main reason being it is computationally less prohibitive than most of the existing event-based optical flow algorithms and still proves to converge to an accurate flow estimation with the continuous update of the tracking procedure.

Remark2:
In the tracker's coordinate frame, all events generated by the same edge will follow a narrow spatial distribution that outlines the edge once the tracker velocity converges to the structure's one as shown in [22], [23].The estimated flow, in the tracker's coordinate frame being the relative velocity between the tracker and the structure, will converge to 0 when the estimation is correct [24].

Contour velocity estimation
A 3 degree of freedom (2 for the translation and 1 for the rotation) tracker cannot be properly estimated from the optical flow alone as the flow provides only two translational components.The rotational component can be calculated from an estimator that integrates the optical flow over the time period τ , based on the motion equation expressed in the tracker reference frame.We define v c = (v cx , v cy ) T and ω c respectively the translation and rotation velocities of the contour in R T .These are the velocities we want to estimate and nullify by correcting the tracker velocities v and ω.
For any incoming event e n = {x n , t n }, and defining x n = (x n , y n ) T , the instantaneous velocity on the object at the event location is where o n = (−y n , x n ) T , the resulting vector in the focal plane of the cross product of z = (0, 0, 1) T and (x T n , 0) T .
We then define the unit vector normal to that edge as u n = (c n , s n ) T .Hence, the optical flow f n = (f nx , f ny ) T , that is the local velocity projected on u n is equal to: We define the following quantities that will be used to estimate v c and ω c .They are updated in a similar way as the activity A, as explained in section 3.2 : and we regroup these terms as well as the contour velocities in two vectors : The following lines provide calculations details, leading to compact form of equation 16.
In matrix form, it gives us We define the matrix M such that: for α, β ∈ {x, y} and u, v ∈ {c, s}.This matrix holds purely spatial information about the contour, this information being composed of the events locations x n along with the local contour directions u n .From this definition of M , equation 12 can be combined into : And with definition given in equation 10, we obtain: As Σ(t = 0) = 0 and M (t = 0) = 0, by recurrence, for all t n , we have the main relation: Hence, V c is calculated by inverting M : Both M and Σ are updated in a purely-event-based fashion.V c contains the contour velocities in R T that we want to nullify, as that means a correct tracking of the contour (illustration by Figure 4).

Normalization for the tracker speed update
The estimated contour velocities V c can be expressed as the opposites of errors vectors v and ω between the tracker velocities v and ω, and the contour velocities v c,Rf and ω c,R f in the local frame R f : We aim for an exponential dampening of that velocity error, leading to a match between the tracker and the contour velocities: Let us define τ the average time interval between two successive events, if n is the average number of events per second affecting that tracker, then : Finally, we define δv n and δω n the translation and rotation corrections applied to the tracker for the n-th event: On average, t n−1 = t n − τ , thus by differentiating these equations and using equations 19, we obtain Let us consider now the evolution of the activity A defined in equation 4 between t and t + δt, with the constraint This constraint can be achieved, as τ is the average time between two events, and τ is the time during which the tracker moves of one pixel : during τ , many events will occur, typically one for each pixel of the observed shape.
The exponential decay of the activity ensures that it reaches a steady state after some time.We approximate that the evolution of the activity over a restricted duration such as δt can be written as with ρ being the decay term given by ρ(δt) = e − δt τ with δt τ giving and ∆ the sum of unit increases for each event that occurred between t and t + δt, given by As activity reaches a steady state, we have Finally, injecting this result in 22, we obtain This is the speed correction used in our implementation, allowing for smooth speed update during the convergence process.The activity A acts as a normalization value for all quantities updated event-by-event in this work.Figure 5 illustrates how the velocities correction mechanism operates "under the hood" of a tracker.

Aperture metric
The local and incremental approach in this work is prone to the same aperture problem as the optical flow it relies on.To solve this problem, we introduce a metric that measures the quality of a tracked structure i.e. if it allows to estimate the correct velocity: a structure such as a single or a bundle of parallel lines is typically a bad candidate while intersecting lines or edges with finite curvature are good ones.For clarity purposes, let us assume two local edges are captured by the tracker (the same analysis can be extended to n edges) and they are approximated as line segments with respective direction angles θ 1 and θ 2 .We define ∆θ as the difference of these angles that we constrain to take value in [0, π 2 ] (this does not remove any generality to the problem since we can always choose the smallest intersecting angle of the two lines for ∆θ).The edges define a good structure to track if |∆θ| is sufficiently large (i.e.lines less parallel).According to this hypothesis, the metric we need must be a monotonic function of ∆θ .Let us express the unit direction vectors of the two line segments in their complex form e iθ1 and e iθ2 .If we want to constrain the metric function to take value in [0, 1], we can increase the vectors arguments by a factor of 2. Let r 1 and r 2 be these new unit vectors: Assuming θ 2 ≥ θ 1 , we define the two quantities: θ = θ 1 + θ 2 and ∆θ = θ 2 −θ 1 .Then the averaged resultant of the two vectors is: From here, we can see that we can build the metric function based on the cosine in the amplitude of µ.It is monotonic for ∆θ ∈ [0, π/2], decreasing from 1 to 0. Thus, if the tracker captures two parallel edges, ∆θ = 0 and the function is maximum and if the two edges are orthogonal, ∆θ = π 2 and the function reaches its minimum of 0.
To build the metric function which will also be updated for each incoming event assigned to a tracker, we proceed as follows: • from the set E n defined in section 3.3, build the offset vector δx k = x k − xk and increase its direction angle by 2 to produce δx k .These vectors are averaged over E n into a vector giving the principal orientation of the local edge: • then we compute the temporal decaying sum of the normalized vectors r n and divide it by the tracker activity The chosen metric is then ||µ(t)|| which takes value in [0, 1].
Structures subject to aperture ambiguities is filtered out by setting a threshold on ||µ|| as good candidates to track yield small metric values.

Tracker Stabilization
Trackers are by default initialized with a null velocity.As events are registered, a tracker's motion parameters (velocity and position) are updated continuously, leading to a rapid convergence time.However, this continuous update might still report non null velocity errors and leaves us uncertain about the steady state of the trackers.We implement an additional estimator to lift this uncertainty based on the averaged norm of x n − xk , as used in computing the optical flow in 6.As the tracking converges, the flow norm should converge to 0. We then build the ν function as follows: A threshold is set on ν (experimentally fixed to 0.5 pixel) below which we can state that a tracker has been stabilized.This value can be changed, but is not critical, as experiments have shown that as long as we stay within this order of magnitude, the described estimator is able to discriminate between a stabilization running and a converged state.

Time constant definition
So far, the mechanism of a tracker is built around the decay mechanism allowing it to be updated upon reception of an event and giving the tracker the ability to self-adjust to the scene dynamic.This decaying property is summarized by the time constant τ in the exponential function, hence the estimation of τ is one (if not the most), of the most important tasks to achieve.
To estimate τ from the events, we are first considering the average squared velocity of points over the tracker area: where the averaged terms < o n > and < ||o n || 2 > are updated according to the mechanism: Finally, we can define τ as (36)

Spatial descriptor and tracker lock
For the entire tracking process, a rolling set of events is maintained from the events affecting the tracker.At time t, we only keep events that appeared between t − N τ (t) and t.Events from Figure 6: Evolution of the main variables during the tracker convergence on a corner in translation, described in detail in section 3.8.
Top graphs shows the velocity variable v of the tracker, the estimated contour velocity v c w.r.t the tracker, and the updates for each event δv n .For clarity purposes, we use the norm for each of these variables.The second graph shows the evolution of the activity A, the aperture metric µ = ||µ|| and the convergence metric ν.Square pictures show the spatio-temporal context of the scene, while green circles contain the projected events, i.e the events from the tracker perspective.Key moments are highlighted with timestamps t 1 to t 5 .
set E n are extracted from this spatial descriptor, stored as a 2 × m matrix.
This local descriptor is calculated in a continuous space i.e. the coordinates are real values instead of integers.This allows for a much smoother spatial representation of the object, often enhancing its visual appearance from the raw events, and can thus be used in following algorithms based on that tracking, such as feature matching, tracking recovery or object recognition.

Locking the feature
As the spatial descriptor so far is a rolling set, even with a tracker that has converged onto a trackable feature, a drift can still occur.To avoid that, we can lock the tracker.
With the tracker T going into the locked status at time t l , we lock that spatial descriptor, and the new events are compared to the events stored in d at time t l .This shape is then the reference shape, and the tracking process can remain stable for much longer times.

Disengaging the feature
One of the challenges of the approach is caused by sudden deceleration of tracked objects.As less events are produced during a deceleration the correction processes must be robust to rapid changes in dynamics.The solution lays in monitoring the activity A of the locked tracker.The idea being that below a certain threshold, one must disengage the tracker.We roll back its position X and angle θ to the last known values where a viable optical flow has been computed, and stop updating the position relative to the speed as described in 1.At the same time, we decay the speed itself, as only a rapid deceleration can induce such a behaviour : In most cases, this process provides sufficient time for the object to move again on the focal plane, and the tracking to resume, once the activity A is above a predefined threshold.

Illustration case
The main parameters of a tracker are highlighted in a practical case displayed in Figure 6.This real-data example shows a square moving in a pure translation movement at a constant speed of about 1000px.s−1 .We show the temporal evolution of the six main parameters of a tracker.First, the norms of the tracker velocity ||v||, the norm of the contour velocity w.r.t the tracker ||v c || and the norm of the velocity variations for each event ||δv n ||.In a second graph are shown the activity A, the aperture metric µ = ||µ|| and the convergence metric ν.The sequence is broken into five key moments, each reflecting a different important phase of the process.At t = t 1 , no event have occurred in the tracker region of interest (ROI) since its initialization.All tracker's parameters are either null or undefined, and the tracker is still.For t = t 2 , the convergence process starts after the activity reaches a predefined threshold.The tracker then starts to cancel out any contour velocity.This results in a increase of the convergence metric µ.The aperture metric already shows that the structure captured by the tracker is a straight line, as it reaches a value close to 1.For clarity purposes, we deactivated the mechanism to reject tracking based on the aperture metric.At t = t 3 , the tracker reaches a steady state.All values remain almost constant, the tracker knows it has converged (ν is low) but the aperture issue remains.Only the orthogonal component of the velocity is known to be reliable, and the tracker is free to move along that line 1 .During this phase, the contour velocity v c -and thus the correction applied δv n -are almost null.Around t = t 4 , a corner appears within the tracker ROI.Immediately, the aperture metric drops.The other corner edge feeds new velocity information to the tracker, resulting in an increase of the contour velocity v c as well as the convergence metric ν.The tracker compensates this velocity to cancel out this remaining velocity.One can notice a small increase of the activity.This is due to an increase of the number of events assigned to the tracker when the corner appears, while the velocity -and thus the time constant-does not immediately change.Finally, at t = t 5 , once the correction has been applied, the tracker speed matches the corner speed, and the features appears immobile from the tracker perspective (in the tracker reference system R T ).Both the aperture and the convergence metrics are below their predefined threshold, and the shape is locked as described in section 3.7.1.This increases even more the stability of the tracking.As the tracker velocity increases, its time constant τ lowers, slightly lowering the activity, and improving the time response of the whole system to match the scene dynamics.The tracker has been intentionally initialized far from its target feature resulting in a 100ms convergence time.When trackers 1.In most cases, the tracker has a null velocity component in the direction of the line.In our case, this induces a negative velocity along the y axis, and the tracker will move towards the bottom of the focal plane were initialized closer to the target for the very same example, the convergence time was on average less than 10ms.

Algorithms
The method presented so far is implemented as summarized by the algorithm detailed in this section.We can decompose the tracking into three blocks.All tracker variables previously described are now indexed by i: Update tracker monitoring variables with λ i (t n ): Reset position and angle to last valid values else if We define for each tracker a set of important states that are used in the implementation 2 .Those states are: • Idle: a tracker is considered Idle as long as the amount of events assigned to the tracker is under the experimentally predefined threshold.
• Stabilizing: it is considered Stabilized when computation is being performed but the convergence estimator is still too high.This means that the tracker still needs time to catch up with the target feature.At this stage, the proposed algorithm is able to track structures that are not restricted to corners or simple edges and can be somehow complex as shows the sample given by Figure 8.These features were extracted during a run of our algorithm on a recording of a city environment.This unconstrained structure selection allows to have a wide range of of features that are crucial to higher level tasks.

TION
We compare the algorithm performances to the state-of-the-art.As shown in section 3, the tracking and detection mechanisms are interlaced such that features emerge in an online manner.This is a purposely designed property because users do not have to assume complex priors on the features to track.We tested the algorithm on sequences generated by the event-based camera simulator, alongside with the provided dataset, acquired by event-based sensors in natural environments [19].To assess the algorithm performances, the provided frames are used and a classical frame-based tracking algorithm is initialized when a tracker is locked.The same dataset provides ground-truth from an Inertial Measurement Unit (IMU), allowing us to assess directly the camera's ego-motion inferred from the trackers estimated velocities.The data is acquired using a DAVIS 240C, with a resolution of 240 × 180 pixels [9].Preliminary tests have also been performed with a 640 × 480 VGA ATIS [18].Similar results have been found using both sensors relying on the same experimental set of parameters.The algorithm is bench-marked using a python implementation.We initialized 63 trackers dispatched on a 9 × 7 grid, with 25 pixels diameter ROI.

Comparison with Frame-based tracking
When frames are available, we can can compare the event-based tracker with a frame-based algorithm.For this purpose, we have used the frame-based Channel and Spatial Reliability Tracking (CSRT), an OpenCV implementation of [25] as the baseline.Each event-based tracker, on locking, is compared to a CSRT tracker initialized at the corresponding location on the temporally closest frame.We compare the two trackings in stages where the event-based tracker is in locked status.The CSRT results are visually checked for the entire duration of the benchmark.Event if the CSRT is considered as one of the state-of-the-art frame-based tracking algorithm, it does occasionally fail.In order to have the most accurate benchmark assessment for the proposed algorithm, we manually removed the failing CSRT trackers scenarios although the presented event-based tracker has been performing correctly.For certain highly textured scenes the frame-based tracker was also unable to provide any ground-truth data due to excessive displacement between two consecutive frames and also because of motion blur.However, this method allowed to automatize the benchmark and to avoid a manual labeling of considered scenes.

Benchmark with the Event Camera Simulator
We used data generated with the Event Camera Simulator [19] for which the ground-truth can be inferred from a known camera motion and scene depth.Simple movements mixing translation and axial rotation have been simulated, along with the corresponding frames.For each tracker locked, we determine the 3D location of the object and re-project its location on the focal plane.We generated two sequences.The first one uses the shapes poster from the Event-Camera dataset, while the second one uses a the texture of a book cover.The eventbased and frame=based algorithms use similar window sizes: a 25px ROI for both the event-based method and the CSRT trackers.The trackers accuracy and life-span are reported in Table 2

Benchmark on event camera dataset
The performance comparison is now applied the same sequences of the dataset that were reported in previous work in [8].Since no ground-truth is available, the CSRT trackers were used as a ground-truth for comparison.We report in Table 3 the results obtained for six sequences.A track is considered valid if its mean error to the ground-truth is less than 5 pixels.We also display the resulting mean error of those valid tracks and their average lifespan.Most of the scene's objects come in and out of the field of view of the camera, tracker that go out of the focal plane are considered lost.This explains the average lifespan of the tracker trackers.Re-entry strategy are beyond the scope of the paper as they imply storing previously seen trackers.Results show lifespans of several seconds with accurate tracking.

Panoramic reconstruction
Using a set of trackers we will now compute a panoramic reconstruction of an outdoors scene using a recorded sequence of 6.6s.Each green rectangle represents a tracker used for that reconstruction.We display on that panoramic view the most representative set of trackers that were locked and used for that reconstructions (blue dots) with some of them highlighted in colored circles for clarity purposes, and shown in more details below the  This selection shows the wide variety of features the trackers can lock on.Below those features are displayed the average reconstruction error and the duration ∆t during which the tracker stayed locked on its feature.We notice that the longer duration of tracking reported concerned mainly highly "complex" features.The top right feature is displayed here to show the limitation of the algorithm under textured and noised recordings : the trackers can sometimes lock onto ill-defined shapes, resulting in poor tracking duration.However, in these cases, the convergence estimator previously described and the tracker activity monitoring are able to cancel this tracker, noticing that it does not "behave" as it should.
For this online robust reconstruction, we used 471 trackers, with an average tracking time of 0.62s, and an average reconstruction error of 3.3 pixels considering all trackers.For the non outliers the median standard deviation of the reconstruction error is 0.72 pixels.This means that 72.7% of the trackers that were selected could successfully track features with an error of less than 1 pixels.Of those 471 trackers, 27.3% outliers reached the reconstruction error threshold of 20 pixels.In most of the cases, this was due to "blurred" straight lines that can sometimes report no aperture issue when small rapid saccadic motions occur.

Comparison with groundtruth with IMU data
The Event-Camera Dataset [19] provides IMU and camera pose data, serving as ground-truth measurements.This information is particularly relevant for the 3 degree of fx §reedom ego-motion of the camera.Assuming M = [X, Y, Z] T , m = [x, y, z] T being a 3D point and its projection into the focal plane in the camera coordinate frame, the instantaneous velocities are given by: where ω and t are respectively the angular and the linear velocities of the camera.Since the velocity in the z axis of m is null, this is equivalent to: (40) This flow is related to the optical flow calculated in the focal plane via the upper left 2 × 2 submatrix k, taken from the intrinsic matrix of the camera: Equation 40 can be solved for ω knowing only the calibration parameters and if the linear translation in the ego-motion is null.We are focusing on that particular case because to solve for the complete 3D velocity, one needs to provide a set of M and their projection m and this is not available in the Event-based Camera Dataset.We show the estimated ego-motion results for the sequence shapes_rotation in Figure 9.The axis have been normalized, assessing for x and y rotations a geometric constant of 220 pixels.rad−1 .For the z-axis, the normalization constant found was 0.98, showing the good match between this ego-motion computation and the ground-truth provided.

Trackers stability and latency
By design, the tracker time constant τ is auto-adjusted during the tracking w.r.t. the scene dynamics and it is the only parameter that needs explicit initialization as the inverse of a minimal speed, experimentally chosen as about 5 pixels.s−1 .The inertial behavior of a tracker is implemented to ensure a smooth and a robust to noise tracking mechanism.A tracker loses its targets because it is getting out of the camera's field of view, rather than because of drifts caused by a non robust tracking approach.The compromise to the tracker stability is the delay caused by the inertia.This is a non desired effect for a low latency vision sensor or for any vision sensor's in general.However, the velocity based tracking algorithm we introduce has an interesting property that we can observe experimentally: on average, a tracker converges to the correct location within an radius of 20 to 30 pixels, when a structure enters the receptive field of the tracker.This typical constant offset for the tracker to converge is an important asset as the time to convergence is a decreasing function of the speed i.e. the faster the apparent motion, the faster the tracker converges, hence the lower the latency of the algorithm.

CONCLUSION
This work introduced an event-based method to select and track simultaneously local spatial structures from a stream of events output form an event-based camera in real-time.The presented method relies on a velocity based tracking approach to converges towards selected features from which the true velocity can be estimated.More important, it is designed to be resilient to the aperture problem.The algorithm is also designed in a way to reduce the number of parameters to be set by users: mainly a decay coefficient that represents a dimensionless inertia parameter.The method is expected to work even better for higher spatial resolutions, as higher spatial resolution combined with the current sub-millisecond temporal precision implies that velocity can be estimated more accurately.
The method is showing robustness in a wide range of conditions.It can detect and track a large variety of spatial structures at different velocities.Tracking benchmarks have shown that it is accurate while being updated event by event, at the native rate driven by the scene thus allowing to reduce resources requirements.This has been made possible by estimating reliable velocities for each tracked feature.Most importantly, using several projection velocities for initialization allowed to have very little dependency on the tuning parameters.From a larger perspective, this paper introduces a new canonical general scheme for computation using event-based cameras that allows for an independence from motion direction and velocity changes.The introduction of a simultaneous use of space and precise timing information brought by each incoming event with a combined regularization scheme using the local activities of events proved to be efficient for local computations.The same scheme can be applied to a wide variety of problems where the same approach could be applied efficiently.
(Figure.1(a)) are based on an asynchronous level crossing sampling as shown in Figure.1(b)-(c).

Figure 1 :
Figure 1: Event-based acquisition of visual signals: (a) the ATIS Event-based camera used in this paper [15], (b) principle of event-based sampling.The variations of the logarithm of the light intensity of a pixel located at [x, y] T over time, (c) asynchronous temporal contrast events generated each time the light intensity crosses a level in log(I), (d) the output of a conventional frame based cameras at some 30-60Hz vs in (e) the high temporal precision event-based output of the sensor at around (1µs).
1 T 4 w u T n e 1 S 1 p w O 7 P c c 6 C 1 k H V P a o 6 j c N K 7 d S M u o g d 7 G K f 5 n m M G i 5 Q R 1 N 7 P + I J z 9 a 5 F V n C y j 9 T r Y L R b O P b s h 4 + A B I u j 1 s = < / l a t e x i t > R < l a t e x i t s h a 1 _ b a s e 6 4 = " J W e 6 h K g A + 1 9 P w m y 4 p n V r 6 I R p m Y

7 3 1
d P x N Z y p W 7 Q O T m + N d 3 Z I G 7 P 4 c 5 z x o H 1 X d k 6 r T O K 7 U z s 2 o i 9 j D P g 5 p n q e o 4 Q p 1 t L T 3 I 5 7 w b F 1 a k S W s / D P V K h j N L r 4 t 6 + E D F u 6 P X Q = = < / l a t e x i t > T < l a t e x i t s h a 1 _ b a s e 6 4 = " W h b u i e v v e X I B 1 h D F P z J I D A f b D u g 9 d o 5 G m d t 3 b w M T f T K Z m 9 T 6 y u T n e 9 S 1 p w P 7 P c c 6 C 1 k H V P 6 p 6 j c N K 7 d S O u o g d 7 G K f 5 n m M G i 5 Q R 9 N 4 P + I J z 8 6 5 E z v S y T 9 T n Y L V b O P b c h 4 + A F 4 u j 3 s = < / l a t e x i t > r < l a t e x i t s h a 1 _ b a s e 6 4 = " W h b u i e v v e X I B 1 h D

Figure 4 :
Figure 4: Description of the contour velocities v c and ω c estimation.We update the vector Σ and the matrix M defined in equations 10 and 13 using event location x n , its orthogonal vector o n , optical flow f n and unit vector u n = f n /||f n ||.All those vectors are expressed in R T .

Figure 5 :
Figure 5: Convergence process of a tracker onto a simple feature.We separated translation -(a), (b) -and rotation -(c), (d) -for clarity purposes.In practise, both are resolved in parallel.(a) and (c) : the trackers starts moving as the features appears.Each event is associated with a local velocity v n in red (cf.Figure4) and allows to estimate the contour velocities in translation v c and in rotation ω c in grey.The corrections applied δv n and δω n follow equation 28.In both translation and rotation cases, when the tracker speed matches the contour velocities -(b) and (d) -, the contour appears immobile in the tracker reference frame.

Figure 4 )
Figure 5: Convergence process of a tracker onto a simple feature.We separated translation -(a), (b) -and rotation -(c), (d) -for clarity purposes.In practise, both are resolved in parallel.(a) and (c) : the trackers starts moving as the features appears.Each event is associated with a local velocity v n in red (cf.Figure4) and allows to estimate the contour velocities in translation v c and in rotation ω c in grey.The corrections applied δv n and δω n follow equation 28.In both translation and rotation cases, when the tracker speed matches the contour velocities -(b) and (d) -, the contour appears immobile in the tracker reference frame.

•
Converged: a tracker has Converged once it has stabilized and matched the target feature.

Figure 8 :
Figure 8: A Panorama reconstruction of a natural outdoors scene using a hand-held event-based camera (shown in A using a conventional frame-camera for clarity purposes).In B, the constructed panorama from events emphasizing a subset of the features used for clarity.Some highlighted features are represented with colored circles to show the diversity of what can be tracked (shown in C) with their average reconstruction error and the duration ∆ t during which a tracker stayed locked on its feature.Green squares shown in B, show some of the focal plane locations used to construct the panorama.

Figure 9 :
Figure 9: Comparison of tracking results for shapes rotation camera's estimated angular velocity.Trackers velocities are averaged to extract the position and speed of an initially centered point, assuming the objects on the scene are static and only the camera moves.Red lines: Gyroscopic data from the IMU provided with the DAVIS dataset [19].Blue lines: Estimated rotational speed for each axis of the gyroscope.Green lines: Error between the reconstructed rotation and the IMU ground-truth.Top graph: Vertical rotation of the camera, producing an horizontal speed of the objects.Mean error : 0.04rad/s.Center graph: Horizontal rotation of the camera, producing a vertical speed of the objects.Mean error : 0.005rad/s.Bottom graph: Rotation of the camera along the optical axis.Mean error : 0.01rad/s.Top images: Snapshots from the corresponding scenes, with a restricted sample of the tracked features and their respective velocities orientations.

1
Main Algorithm require: N T trackers initialized in the focal plane.for each e n = {X n , t n } do for Tracker T i , i ∈ [1, N T ] do Update T i according to algorithm 2 if ||X n − X i || < R then Compute x n according to eq. 2 Compute velocity error according algorithm 3 Algorithm 2 Update for each event require: Tracker T i , event e n = {X n , t n }.

Table 1 :
Experimental values for the hyperparameters presented.Most of them are either constant across all experiment, or scale linearly with the radius R. .

Table 2 :
Comparison between the frame-based tracking and the event-based tracker using simulated data.CSRT trackers are removed once the event-based tracker has lost the target due to the disappearance of the target from the scene.

Table 3 :
Results on GT comparison with CSRT trackers in OpenCV library.