Towards Zero Shot Learning of Geometry of Motion Streams and Its Application to Anomaly Recognition

Visual anomaly recognition (VAR) is the core part of many intelligent systems. However, vagueness in definitions and lack of a priori knowledge about the distribution of anomalies, makes VAR a challenging problem. Supervised solutions often fail to work in such scenarios due to lack of ability to adapt with concept drifts. To this end, we have studied the effect of temporal derivatives over differential manifolds for designing a zero-shot (label agnostic) VAR solution. Rationale behind this work is leveraging the genericity and discriminative representation available in the geometric-structure of motion-tensors. Our approach proceeds by drawing segments of temoral-derivatives from raw image-sequences and projecting them over Grassmann product space before clustering. Suitability of the proposed approach is corroborated with extensive experiments and comparisons with other arts.


Introduction
Large number of CCTV cameras are being deployed at public places and transport networks. Camera market is forecasted to grow at a compound rate Watch, in just last five years data generated by security cameras has increased 5 from 566 petabytes to 2,500 petabytes per day [2]. Another report by Google and Smart Insights indicates, approximately one million hour of video (mostly unlabeled) is uploaded on YouTube every day [3]. This indicates about the availability of large chunks of unlabeled space-time data about crowd behavior.
This can be leveraged (for surveillance purposes) by means of low-shot learn- 10 ing systems that do not require any supervision. Present surveillance systems have human-in-the-loop and are limited by bio-mechanical constraints such as fatigue and personal biases. For these reasons, Intelligent Surveillance Systems (ISS) are required that can handle unstructured data, such as surveillance camera streams or offline security scans, at high accuracy and reliability. Video 15 Anomaly Recognition (VAR) is an integral task of any ISS and requires swift recognition of mistrustful events [4,5]. The purpose of a VAR model is to recognize outlier patterns like errant behavior or event (in videos) which do not conform to normal pattern. An anomalous pattern is generally a rare event, it may include (but not limited to) -accidents, stampedes, wrong driving, physi-20 cal fighting, other abnormal behaviors in public places (few samples are shown in Fig. 1). Recognizing anomalies may have potential ramifications such as preventing stampedes, intelligent human-machine interaction, old age health care, defense reconnaissance, transportation systems and so forth.
There are no set definitions for anomalies. The propensity of anomalies being 25 sparse and inauthentic, makes the annotation of anomalies challenging. Same event might behave contradictorily under different contexts. For instance, rapid traffic might be a normal event at cross-roads, however, same is not true for a peaceful public place. Owing to this, an unsupervised approach is more suitable for handling anomalies. According to no-free-lunch theorem the teacher based 30 techniques trained on a labeled dataset may not perform well on data from an unseen distribution [6]. Suitability of unsupervised algorithms for unstructured data is propelled by the existence of a huge variation in spatio-temporal data, the space of labeling anomalies is immense, and an exhaustive training of a supervised model is not possible. Also, the performance of an unsupervised 35 system increases as more unlabeled data is supplied [7].
Conventional methods mainly capture the likelihood of stationarity of active objects in the scene. For human feature extraction, it is observed that the focus should be on essential elements in image sequences rather than on RGB data, since it easily overfits to unessential elements [8]. We have observed that 40 anomalies do not always come from distributions modeling humans rather they can be due to other objects in the scene (as is evident in the experimented datasets). Hence, a few object related representations cannot account for all kinds of irregularities. On the other hand, the clustering algorithms cannot be blamed for unaligned clusters, the major credit goes to the lack of suitable 45 feature representations for the latent space. Here, motion can be used as a strong prior for recognition tasks like segmentation, as shown by [9,10]. Taking cue from these observations, we pose VAR as finding irregularities based on the geometry of motion streams. For this we have employed temporal derivatives (which are indicators of temporal change in a scene) since they are sparse and 50 locally continuous, and can be seen as trajectories in Riemannian space. Unlike the conventional approaches for anomaly recognition, our approach does not entirely depend on the dominant motion, which may suffer from perspective distortions.
Our algorithm begins by pooling all image sequences. These are then pro- 55 cessed by SuBSENSE foreground segmentation [11]. It uses local feedback loops and adaptive sensitivity towards illumination variation. In another approach, rather than extracting foreground we use deep optical flow from FlowNet2 [12].
Both, SuBSENSE and FlowNet2 provide the required temporal derivatives with respect to each image sequence from the pool. These are then split into smaller temporal segments. This process is visually explained in Fig. 3 in section 3. Each of these segments can be treated as a tensor of order three. These tensors are then decomposed using factor-k flattening. Each such factor can be represented as a point on a Grassmannian. Next, we take the product of Grassmannians rather than using each Grassmannian in isolation. It has been 65 observed that the product of Grassmannians provide better results than individual factor manifolds. Chordal distance is then employed to measure the geodesic distances amongst different points on the product of Grassmannians [13]. Following this the similarity matrix, obtained from the pool of geodesic distances amongst temporal segments, is then clustered using Minimum-Cluster-Variance 70 (MCV) based Agglomerative Hierarchical Clustering (AHC). It produces separate clusters of anomalies from non-anomalies. This works good for offline scenarios, where the entire data is available before clustering, however, in case of responsive surveillance it is crucial to have an online processing system. To this end, we present an unsupervised active learning approach, where we have 75 weak-oracle which works on the basis of two parameters -β and γ as its confidence measures. The key idea is to delay the clustering as long as possible without compromising the confidence in clustering. Using this approach the data is processed as it arrives. Based on separate approaches for extracting temporal derivatives, we call SuBSENSE based approach -Unsupervised Seg-80 mentation (US), and FlowNet2 approach -Unsupervised Flow (UF). The results are compared with state-of-the-art deep models and unsupervised approaches for anomaly recognition. The main contribution of this work is listed below: • To the best of our knowledge, we are the first to analyze the space-time manifolds purely on the basis of temporal derivatives with multi-linear 85 motion representations. It allowed us to assert that the quality of anomaly recognition is not due to appearance or illumination rather owing to the inherent motion biases.
• We have studied the challenges of anomaly recognition and formulated a simple yet generic approach for offline zero-shot anomaly recognition. 90 Additionally, a novel unsupervised active learning approach is presented, which takes help of weak oracles in order to lay down the context. This extends our framework to online learning regime.
• We have conducted extensive empirical study with five publicly available anomaly recognition benchmarks, having coarse-to-fine level of anomalies. This paper is organized as follows -in section two, related work is provided.
Following this, section three presents methodology, where we discuss the details 100 of the proposed offline and online approaches. Section four, covers the experiments, results, ablations and analysis. Lastly, section five concludes this paper along with future research directions.

Related Work
The features learned by unsupervised techniques are more generalizable [14].

105
Bag-of-Visual Words (BoW) is a famous model and has surfaced in many zeroshot classification works [14,15,16]. Wang et al. have used spatio-temporal local features and have clustered them with k-means algorithm [14]. Similarly, Niebles et al. have used generative modeling of spatio-temporal features [16].
Chen et al. have used force fields to model crowd behavior in terms of size, 110 position and orientation [17].
The problem of abnormality recognition (VAR) is usually formalized as an outlier recognition problem. An outlier can be detected based on the temporal or spatial data. Some prior arts like [18,19] have used raw optical flow, [20,17] have used pixel based approach, [21,22,23] have used particle based approach, 115 [24,25] have employed trajectory based representation to provide parametric and non-parametric solutions to VAR. These are sophisticated algorithms which are sensitive to either local-fluctuations (in appearance) or dominant motion. However, deep learning (DL) approaches overcome these limitations by automating feature discovery and transfer learning [26,27]. Nonetheless, 120 it is noticed that feature transferability does not often lead to improved performance unless a model incorporates essential elements of representation [8].
Semi-supervised techniques have been proposed to leverage large unlabeled data [5,28,29,30,31,32], however, they still depend on large labeled datasets and easily deviate with a concept drift [33,34]. Additionally, they are a little tricky 125 to train, which is not in the spirit of a generic solution. Our approach avoids this by employing genericity of Riemannian structure without needing any labeled data.
Few unsupervised non-visual anomaly recognition arts [35,36,28] have used autoencoders for feature extraction. However, these cannot directly fit on to 130 spatio-temporal data. Moreover, these do not define any trainable objective and thus fail to extract differential representation for anomalies. To avoid this, weaksupervision models based on representation learning have been proposed [5,30].
Representation learning assumes that if the set of regular events is known a priori then a generative or discriminative model can be trained. Generally, these  [5,37,30,4,31,32,28]. However, the 145 deep models suffer from class-imbalance since anomalies are very sparse and spurious, and the use of binary labels for weak-supervision makes the system not-fully-automated which leads to increased label bias [38].
On the other hand, it has been proved that data for many vision problems lies on low-dimensional manifolds [39]. For example, covariance matrix of an image past for spectral decomposition of mode-2 or mode-3 tensors [41,42,43,44].
Motivated by the above discussion, we pose VAR as finding irregularities based on the geometry of motion streams. Our approach employs a Riemannian metric for projecting the raw space-time data onto a manifold of temporal derivatives that leverages the temporal shape of the objects which does not get 160 captured by trajectory based approaches or raw flow analysis. This is explained in detail in next section.

Approach
Our approach tries to solve the problem of video anomaly recognition (VAR) in a zero-shot way. Given the openness of criteria for defining anomalies, it is 165 generally harder to get appropriate representations. Even the supervised models can not be trained exhaustively. In this situation one option is to use either fewshot or a zero-shot learning model. This work proposes a data driven zero-shot learning solution whose performance goes up as more and more data is added.
With the zero-shot models, representation of data is the key attribute in the 170 context of learning. We have noticed that, clustering algorithms alone cannot be blamed for unaligned clusters.
Anomalies are often a rare event. Anomaly agents leave peculiar marks in the spatio-temporal space. Our idea is to discriminatively capture these marks by leveraging the Riemannian structure present in their geometry. In such a 175 setting the space-time anomalies can be seen as a trajectory on a non-linear manifold. An image sequence can be seen as a 3D hyperplane with H, W, T  Eqn. (4)), is then employed as a geodesic measure between any two points to form a similarity matrix. This similarity matrix is then clustered into groups, each having points belonging to anomalous or normal event distributions. This process is explained in detail in the next few sections. Proposed approach is generic in the sense that it is data driven and has multiple application as demon-205 strated under section 4.5.

Grassmann manifold
Unlike many other manifolds which have intrinsic Riemannian structure, Grassmann manifold (or Grassmannian) has been found to be the most suitable representation for 3D tensors [45,46]. Grassmannian is an abstract quotient 210 manifold derived from Stiefel manifold, and is used to fit the orthogonality constraints. A Grassmannian G n,p with non-zero p and n such that n ≤ p is the set of all n-dimensional linear subspaces of real p-dimensional space in and O p is an orthogonal group.

Geodesic similarity measure
Geodesic distance (GD) acts as the inter segment similarity measure between 225 any two points on a Grassmannian. GD on a Grassmannian G n,p between two p-dimensional linear subspaces -P and Q in R n can be characterized in multiple ways, however, a well accepted norm is to use the canonical angles θ 1 , . . . , θ m , between the canonical vectors of the two subspaces. It can be computed recursively as shown below: where x, y is the inner product between x and y. x k is the k th canonical vector of the n-plane P and similarly y k is the k th canonical vector of the n-plane Q subject to x, x = y, y = 1 Chordal distance is used as GD, since it is differential everywhere and works best for Grassmannians [13]. It is defined as the L2-norm of the sine(s) of the angles between the corresponding canonical vectors of the two points on a Grassmannian as shown below: Product manifold (PM). PM is a compound object in high dimensional space 240 which is composed of factor manifolds (FM). To understand it better, lets consider an example where we have two FMs. One is a line in R 1 another is a circle in R 2 . The PM of these two FM is an infinite cylinder in R 3 . In a similar way, a PM represents the cross-section of its constituent FMs. A GPM is a PM of Grassmann FMs. Lets consider M 1 , M 2 , . . . , M j be a set of Grassmann FMs.

245
When the topology of this set is same as the product topology then it is called a PM. For this set, the PM can be defined as: where × represents the cartesian product. Our experiments revealed that PM yields better performance than FMs.
Algorithm 1: Clustering anomalies on a Grassmannian Input: Set D of all temporal segments Output:

Subspace representation of anomalies
As explained in Algorithm 1, the i th image-sequence IS i having frames . . , T m }|m < n; with each segment T i having length 10 and overlap of 30% with T i−1 . Respective tem-260 poral derivativeŤ i is obtained for each segment T i ∈ D. Temporal derivatives are obtained from SuBSENSE [11] and FlowNet2 [12] in the form of foreground segmentation and deep optical flow respectively. Distance between two segments (T i , T j ) is equivalent to the chordal distance 275 dist chordal (P i , P j ), between two corresponding points (P i , P j ) on GPM. It is formulated as the L2-norm of the component-wise sine(s) of the principal angles between the column spaces spanned by the orthogonal matrices of the two points, as laid out by Eqn. (4). All points in the set E are then clustered using pairwise chordal distance and similarity matrix Σ.

Clustering
For measuring the clustering accuracy, we have employed the concept of cluster purity (accuracy). Conventionally the objective of a clustering algorithm is to increase inter-class variance while keeping the intra-class variance as low as possible. A transparent measure of clustering quality is cluster accuracy.

Online VAR
Thus far we have seen the working of offline VAR approach. It assumes entire data to be available beforehand, however, this assumption cannot be satisfied in situations where the data generation is a function of time. For example, in 305 case of live surveillance, continuous steam of data is broadcasted and anomalies are required to be detected as they appear in the scene. To this end, we have designed an unsupervised active learning algorithm (Fig. 6) which leverages the existing clustered data for assigning an incoming segment to a cluster.
. Algorithm 2: Online anomaly recognition Algorithm 2 describes the steps involved in online VAR. It begins by one-time initialization with offline approach, on segment setD |D ⊂ D. This gives a point set E, a similarity matrix Σ and clustered set or clusterer Ψ. After initialization, medoids are found for each cluster in Ψ. Upon arrival of a new segment T i , its 315 temporal derivativeŤ i is extracted. TensorŤ i is then transformed into three mode-k matrices with factor-k flattening. After decomposing these matrices with HOSVD, a GPM point P i is obtained using Eqn. (3). Two constants -γ and β are maintained for every cluster in Ψ. These act as confidence measures for the weak-oracle. γ is the minimum number of elements to be maintained 320 by a cluster ψ k . β ∈ (0, 1) is used for finding the maximum allowed chordal distance between medoid m ∈ ψ k and the GPM point P i , beyond this distance P i is not assigned to ψ k . If the cardinality of ψ k , containing the nearest medoid m of P i , is not less than γ, and distance between P i and m is not greater than β times the distance between m and the farthest element P f ar ∈ ψ k , then the 325 oracle assigns the point P i to ψ k and updates the similarity matrix Σ and P f ar , otherwise reclustering is done over the entire point set E including P i . The effect of choice of γ and β has been discussed in section 4.3.

Experiments
For comparison of our work with recent arts, we have selected three widely  [19] and mixture of dynamic textures (MDT) [22]. The other two works -AMC [17] and OADC [24] have been selected for their global feature based unsupervised approach. Due to unavailability of code for these methods, we have used an in-house implementation. In this section, we proceed by first presenting VAR datasets, offline and online VAR results and related ablations, followed by 340 results analysis and few applications.

Datasets
We have carefully selected five publicly available datasets in a fine to coarser way, such that they cover both global and local anomalies. Additionally, we have  ing normal crowd behavior scenes such as people walking or running. UMN datasets were presented by Mehran et al. [48]. UCSD pedestrian dataset was contributed by [21]. Normal behavior includes -pedestrians walking on the pathways and no unusual activity is happening. Abnormal behavior includespedestrians walking on surrounding grass, on wheelchairs, non-pedestrians such accuracy plots for the US and UF approaches are reported in Fig. 8, 9. Seg-380 ment length for these experiments was kept ten. We can observe that in all of the plots, performance of the UF approach is better than US. One reason for this can be that the flow contains variable magnitude at each point in spatiotemporal space than segmentation. It is also evident from the plots that the clustering performance starts saturating as we increase the number of clusters  Fig. (a) shows the performance of UF approach corresponding to segment lengths -10, 20, 30 and 60, along with variation in cluster count on X-axis. Fig. (b) shows the effect of change in segment length on the prediction accuracies. It is observed that increase in segment length results in marginal increment in cluster accuracy, on the contrary it leads to significant decrease in frame level accuracy. This shows the suitability of segment length ten in comparison to others. lengths -10, 20, 30 and 60 are plotted against different cluster counts for the UF approach. It can be noticed that better cluster accuracy is achieved for a higher segment length, however, the variance in cluster accuracies amongst different 400 segment lengths is not very high. To investigate further, Fig. 10b reports the effect of segment length variation on cluster and frame level accuracy for UF approach at cluster count five. Here, it is clearly visible that as the segment length increases, the cluster accuracy does not increase in-proportion. However, the frame-level accuracy decreases significantly with increase in segment-length. 405 This suggests that fine level discrimination reduces as we increase the segment length.
FlowNet captures variable spatio-temporal magnitude of flow which is not very good at the edges of a moving object, on the other side, SuBSENSE based segmentation has crisp edges. This suggests that the information captured by 410 the two approaches can be fused together. To explore this idea, we have employed conjunctive and disjunctive late-fusion approach. In the conjunctive fusion, a frame is considered anomalous if both UF and US assign the corresponding segment to a cluster with anomalous segments in majority, otherwise   has attained highest score on the UMN crowd dataset. OADC is implemented without motion saliency as we think it interferes with natural motion by increasing motion contrast irrespective of knowledge of kind of anomaly it handles. Due to this OADC has performed better than AMC on local events and comparatively on global events. Amongst the unsupervised approaches UF performs best 435 followed by OADC. AMC shows worst performance for the UCSD and Caviar datasets due to its biases towards global events. Performance of the proposed UF approach has been better than others over all datasets. Fig. 11 reports the specificity and sensitivity plots of different algorithms.
Sensitivity measures the probability of an anomalous event being recognized as 440 anomalous whereas specificity measures the probability of a normal event being recognized as normal. It is evident from Fig. 11 that UF has better recognition rates for both anomalies and normal events, whereas MPPCA and AMC have the lowest average recognition rates for anomalies and normal events. to five medoids from that cluster. Anomalies are marked in red. One can observe that each cluster tries to identify different aspect of the scene. Anomalies are concentrated towards the last three clusters while the first two clusters capture the density of pedestrians. Cluster one has high pedestrian density 450 and some anomalies as well. Cluster two has low pedestrian density a few anomalies. Cluster three and four have bike and skating anomalies. Cluster five has anomalies involving large size vehicle movement. Fig. 12 shows that each cluster captures some specific kind of information about the scene.   Relation between accuracy and beta. Oracle uses β as a confidence measure to decide the maximum allowed distance from the medoid of a cluster to a new point, before assigning it to that cluster. The relationship of cluster accuracy and frequency/tendency to recluster with respect to β is reported in Fig. 13b.

Results: online VAR
The plot contains normalized reclustering and accuracy scores. One can observe

Results analysis and discussion
Through above experiments we found that in the absence of any supervision, inherent biases in the data samples can lead to semantically pronounced clusters. usually the anomalies at global scale are identified better than the anomalies at local scale. This is also corroborated by sensitivity rates under Fig. 11a. We 495 found that unlike methods which work well for global level anomalies such as AMC, the proposed approach works well for both global and local kind of anomalies. We found that the proposed approach has best sensitivity and specificity in comparison to other approaches. The sensitivity scores are slightly higher than corresponding specificity scores. This implies that the anomalies have higher 500 recognition rates and are better discriminated than non-anomalous data.
Qualitative clustering results are being reported in Fig. 12. One can notice that the anomalies are clustered towards the end while the first two clusters contain non-anomalous pedestrian movements. However, it should be noted that cluster one and two differ from each other in terms of population density.

505
In Fig. 12, one may examine that the UF approach sometimes fails to distinguish between the kind of anomaly such as bike or skating.
Due to uncertainty in the types and count of anomalies, we have assessed the performance of proposed approach by clustering with different number of clusters. It was discovered that the performance in cluster accuracy curves 510 (Fig. 8, 9) normally saturates towards the end; this suggests that if we take large enough clusters then most of the events can be categorized well. It is positive from the cluster accuracy curves that UF performs better than the US approach. One reason for this could be the additional directional information captured by the UF approach than US.

515
It was observed in section 4.2, variation in segment length does not affect the cluster accuracy, however, frame level accuracy is affected by it. Owing to this deduction, segment length of ten was used across all experiments. In order to enable the proposed approach for streamed data, online UF approach was proposed. Performance of the online approach was found equivalent to 520 the offline approach. Online approach was found to improve its performance when number of segments were increased with time. These observations make the online approach a suitable candidate for VAR in online scenarios. Online approach depends on its confidence measure β. We observed that higher values of beta resulted in low reclustering rates and better accuracies.

525
US and UF variants of the proposed approach were combined in a late fusion manner to gauge their effectiveness to complement each other. However, as the results revealed in Table 1 For the sake of analyzing the generalization ability of the proposed approach, we have experimented on a few similar problems like nearest-neighbor retrieval, action and gesture recognition. This is covered in the next section.

540
In the previous sections we have seen that the proposed approach works well for VAR. In this section we evaluate the performance of the proposed approach,  in the context of VAR related problems, to test its suitability for other applications. More specifically, we have considered the task of media retrieval, action recognition and gesture recognition. index id(f i ) which belongs to class class(f i ) (either normal or anomalous), the 550 task of NNR is to retrieve another frame f j from the retrieval database D base such that id(f j ) = argmin f k ∈Dbase (dist(f i , f k )). If class(f i ) = class(f j ), the retrieval is considered as true positive. Whole UMN dataset (crowd, web) is considered for NNR task. The joint dataset is split into two disjoint sets S 90 and S 10 with 90:10 proportion. Set S 90 is used for constructing D base , set S 10 555 is used for query generation. None of the query frames f i ∈ S 10 is indexed in database D base . During NNR task, frame level accuracies are considered and results are evaluated with qualitative and quantitative approach. Under Curve (AUC) measure. Proposed approach has the highest AUC value, followed by MDT. The qualitative results are presented in Fig. 15. Each row shows the two retrieved nearest-neighbors of a query. We find that our method outperforms the other arts due to the GPM representation.     Table 3 and 4. Gesture recognition accuracies are reported in Table 5. Confusion matrices are provided for classwise recognition scores, for each of the three 575 datasets in Fig. 17. We find that the performance of proposed approach is comparable to other supervised and unsupervised works. Consistent performance of the proposed approach across different multiclass applications demonstrates that the proposed approach is not biased towards anomaly related binary recognition tasks. 580

Conclusions and future work
In this work, a zero-shot learning approach for anomaly recognition is proposed by modeling the temporal derivatives as trajectory on Grassmann Product Manifold (GPM). GPM is leveraged for discriminative representations and less room for design choices. Video anomaly recognition problem is comprehen-585 sively studied in terms of coarse-to-fine scale anomalies on five publicly available  datasets.
Additionally, an online variant is proposed for adapting the offline model to streamed data. For this, a modified version of active learning is presented where we have a weak oracle which uses confidence measures to take decisions 590 without any help from a strong learner. Performance of the online approach is found comparable to the offline approach. We found that the ability to leverage the inherent bias in the data samples makes the proposed approach very suitable for anomaly recognition task. The genericity of the proposed approach is further validated over other multiclass recognition tasks. Despite using any 595 label information, the overall performance of the proposed zero-shot approach is found comparable to other supervised or weakly-supervised works.
We observed that most of the approaches performed well for the UMN dataset, however, all approaches had significantly low performance on Caviar dataset. One reason for this could be the inclusion of spatio-temporal stag-600 nation of objects in the anomaly events and complex multiperson interactions.
Few late fusion strategies have also been explored, however, the fusion showed significant drop in performance. This revealed the need for further exploration in the direction of early and feature level fusion schemes. We plan to address these issues in future. We also plan to adapt the proposed work with other ap-605 plications such as video summarization, content based spatio-temporal search, automatic concept discovery. Our future work involves exploring metrics on GPMs with emphasis on large-scale complex human to human interactions.
nowcasting, in: Advances in neural information processing systems, 2015,