Locating X-Ray Coronary Angiogram Keyframes via Long Short-Term Spatiotemporal Attention With Image-to-Patch Contrastive Learning

Locating the start, apex and end keyframes of moving contrast agents for keyframe counting in X-ray coronary angiography (XCA) is very important for the diagnosis and treatment of cardiovascular diseases. To locate these keyframes from the class-imbalanced and boundary-agnostic foreground vessel actions that overlap complex backgrounds, we propose long short-term spatiotemporal attention by integrating a convolutional long short-term memory (CLSTM) network into a multiscale Transformer to learn the segment- and sequence-level dependencies in the consecutive-frame-based deep features. Image-to-patch contrastive learning is further embedded between the CLSTM-based long-term spatiotemporal attention and Transformer-based short-term attention modules. The imagewise contrastive module reuses the long-term attention to contrast image-level foreground/background of XCA sequence, while patchwise contrastive projection selects the random patches of backgrounds as convolution kernels to project foreground/background frames into different latent spaces. A new XCA video dataset is collected to evaluate the proposed method. The experimental results show that the proposed method achieves a mAP (mean average precision) of 72.45% and a F-score of 0.8296, considerably outperforming the state-of-the-art methods. The source code is available at https://github.com/Binjie-Qin/STA-IPCon.


I. INTRODUCTION
I N X-ray coronary angiography (XCA, all acronyms in this paper are listed in Table I) for the diagnosis and treatment of cardiovascular diseases, measuring the contrastdiffusion time span from the two phases of filling and disappearing of contrast agents in myocardial perfusion can be directly used to evaluate coronary microvascular function [1], [2].Locating the start, apex and end keyframes (see Fig. 1) of moving contrast agents for keyframe counting in the two phases revives it as the main mode of decisionmaking but suffers from the challenging problems: extreme foreground-background imbalance with a very small number of low-contrast foreground vessels that overlap with complex and dynamic backgrounds, subtle changes in foreground action volume and limited inter-keyframe variation, and the missing boundary between the keyframes and surrounding frames (see Fig. 1).Without separating the small number of vessels from the complex and dynamic backgrounds [3], [4], we hardly identified the boundary-agnostic keyframes from the imbalanced and overlapping XCA sequence and rarely classified and then localized these keyframes.
We assume that learning the vessel's evolving trend (see Fig. 1) by aggregating long short-term spatiotemporal features for segment-and sequence-level dependency modeling is the key solution to the challenging keyframe localization.Specifically, we treat the keyframe extraction as temporal action localization (TAL).As one of the most challenging problems in computer vision, TAL has been studied [5], [6], [7] for general video sequences but not for the challenging XCA sequences.Recently, Actionformer [6] achieved the best TAL performance [5], [6].However, most TAL methods refine discriminative action boundaries from segment-level semantics [7], [8], [9], [10] and model inter-frame relationships directly based on Transformer architecture, hardly focusing on imageto-patch spatiotemporal features to model the gradually Fig. 1.Start-to-apex-to-end XCA keyframe localization in the two actions (red and green) of contrast filling/disappearing phases.We predict the two mid-points of actions and regress the lengths of action phases to determine the start and end frames, averaging these two frames to obtain the apex keyframe.The three images of each frame are the previous frame of the keyframe, the keyframe, and the next frame of the keyframe.
changing small features in video sequence.Besides, Transformer usually divides video sequences into small segments (or snippets) and model temporal relationships in each segment with local-window attention [11], [12], leading to a loss of the long-range inter-segment information exchange.Multiscale Transformer [6] used temporal downsampling to shorten the time length and increase the receptive field of sequence for establishing inter-segment dependencies.However, the loss of long-range inter-segment information is still unavoidable because the downsampling will lead to the loss of temporal information.Therefore, existing methods will ignore a gradually evolving trend of blood vessels in XCA sequence and lead to a loss of long-term dependencies in TAL.To solve these problems, we propose a long short-term spatiotemporal attention network with image-topatch contrastive learning to refine segment-and sequencelevel spatiotemporal attention modeling, increasing the contrastive learning performance for boundary-agnostic XCA keyframe localization.The main contribution of this work is threefold: 1) An effective XCA keyframe localization is proposed to build upon the convolutional long short-term memory (CLSTM) network for learning segment-and sequence-level long short-term dependencies and the Actionformer [6] for modeling short-term attention in sequential XCA segments.
2) A low-rank background patch is selected randomly as a convolutional kernel in patchwise convolutional projection in each frame, effectively projecting foreground/background patches to different latent spaces simultaneously with contrasting image-level foreground/background features via reuse of long short-term spatiotemporal attention.
3) To the best of our knowledge, this is the first study about XCA keyframe localization by exploiting the class-imbalanced small foreground features that are sparsely distributed and overlapped with complex backgrounds.The proposed model obviously outperforms state-of-the-art (SOTA) methods on the collected dataset.

A. XCA Sequence Recognition
The XCA sequence provides consecutive frames containing heterogeneous blood vessels that overlap with various interferences, such as anatomical structures, mixed Poisson-Gaussian noises [13], [14], respiratory and cardiac motions.Vessel segmentation [15], [16], [17] and vessel extraction [3], [18] are the main topics on XCA sequences.Most vessel segmentation methods based on deep learning use an encoder-decoder architecture for single image segmentation and use multidimensional convolution or long short-term memory (LSTM) for sequence processing.For vessel extraction methods, traditional algorithms are mainly built upon grey value or tubular feature representation, simultaneously enhancing the background structures with similar tubular feature artifacts to introduce more difficulty in subsequent vessel classification or tracking.Recently, by decomposing video sequences into low-rank backgrounds and sparsely distributed foreground objects, robust principal component analysis (RPCA) [19], [20], [21] has proven to successfully separate moving contrast-filled vessels from complex and dynamic backgrounds in XCA sequences.To address the computational costs and noisy remnants, RPCA-UNet [3] has greatly improved computational efficiency in the excellent restoration of heterogeneous vessel profiles by exploiting patchwise feature selection in an RPCA unrolling network [22].We refer interested readers to recent comprehensive reviews on XCA vessel extraction [3], [4].

B. Keyframe Extraction for Video Summarization
Keyframe extraction finds a small subset of frames that represent the most representative frames from a video sequence for static video summarization [23], [24], which traditionally includes three main categories: 1) Frame clustering [25], [26] clusters similar frames by feature representation and similarity metrics and then extracts the frame closest to the cluster center as a keyframe.2) Shot segmentation [27], [28] first detects shots by representing low-and mid-level features of video content and identifying shot boundaries in the original video and then extracts one or more keyframes from each shot.Both methods lack effective feature representation to distinguish subtle changes between consecutive unstructured frames within poor quality sequence [9], [29].3) Sparse coding methods [30], [31] extract a few (sparse) frames while preserving the essential video content, which is best reconstructed as a linear combination of a few selected keyframes.Keyframe dictionary selection [31] and RPCA-based methods [30], [32] used L 2,1 -norm [32], L 1 -norm [30], or L 2, p -norm [31] sparsity constraints to ensure the sparsity of reconstruction coefficients, selecting keyframes as local/global maximums of the normregularized reconstruction optimization function.Patch-based sparse representation [33] has been proven to outperform frame-level sparse representation due to its balancing the representativeness of global features and local details.
In the era of deep learning, keyframe extraction is treated as frame-level importance-based sequence labeling or sequence-to-sequence learning with full supervision, which exploited encoder-decoder recurrent neural networks (RNN) with bidirectional [34], [35] or hierarchical [36] LSTM and convolutional RNN [37] as well as an attention mechanism [12], [34], [35] to capture the spatiotemporal dependencies among frames.A fully convolutional sequential network (FCSN) with stacked convolutions [38] took 2D CNN features of single frame and 1D temporal convolutions to put semantic and pairwise relations into the long-range dependency.Nevertheless, supervised learning is tedious and costly in manually annotating the frame-or shot-level labels for video sequences.Therefore, reinforcement learning (RL) built upon an encoder-decoder architecture and FCSN-based 3D spatiotemporal U-Net [29] to extract video features and produce probability weights for optimizing the frame selection of RL agents, which are updated during training with diversity and representativeness reward functions.Ultrasound keyframes [9] were extracted via detection-based nodule filtering and a customized reward mechanism, eliminating redundancy and integrating lesion feature in keyframe searching.However, the lack of high-quality annotations makes the supervised learning and RL methods unable to reach high efficiency in video summarization.
By consisting of a summarizer and a discriminator, generative adversarial networks (GANs) embedded with an a priori spatiotemporal model or attention mechanism [12], [39], [40] adversarially learn how to create importance scorederived keyframes via the summarizer, which fool the trainable discriminator to a certain extent that the discriminator can no longer distinguish the score-weighted keyframe features from the original features.However, GANs suffer from instability and sensitivity to hyperparameters in modeling complex spatiotemporal distributions for XCA-like videos.

C. Temporal Action Localization
TAL [5], [41] localizes the beginning and end time stamps of the actions of interest and recognizes the action categories in long untrimmed videos.TAL for nonhuman activity understanding through low-contrast long-term sequential X-ray and infrared imaging [42] is more demanding and challenging than video-based action localization due to the decision difficulty in precisely locating imperceptible and heterogeneous action changes.Currently, the most effective TAL methods are based on deep learning with frame-level full supervision and typically classified into two-and onestage methods [5], [43].The former approach, also known as the anchor-based top-down approach, partitions each video into multiple temporal positions (i.e.anchors) as multiscale action proposals for performing action recognition/regression on each proposal, while one-stage methods usually employ a bottom-up solution to predict actioness, startness, and endness scores at each temporal point for direct regression of action boundaries.
Recent two-stage approaches improved action proposals by extracting feature via 3D ROI pooling [44] and pyramid pooling [45] or modeling the context among action proposals using graph neural networks [46] and attention [47] or Transformer [48].One-stage methods utilized a cascade of temporal CNNs with a recurrent scheme [49] or a saliencybased refinement module [7] to aggregate every temporal point's contextual features for the regression of action boundaries, generating a more flexible but noisy point proposal for TAL.To represent long-range dependencies, recent onestage methods exploited Transformer [6], [43], [50], [51] to weight all temporal points for capturing the internal correlation of data.Actionformer [6] outperformed all SOTA methods [5] by simply integrating local self-attention into a temporal feature pyramid for extracting action candidate at each location of the pyramid.A lightweight convolutional decoder further implemented shared classification and regression to decode the feature pyramid into different actions with labels and temporal boundaries.
By incorporating all global points for scaled dot-product attention to inevitably introduce undesired backgrounds, Transformer [11], [52] may have modeling difficulty, high parametric and computational complexities in representing the discriminative spatiotemporal feature of foreground Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 2. The proposed network has long-term attention, short-term attention, patchwise and imagewise contrastive modules.The modules in the dotted box are from Actionformer [6].The solid arrows represent the data streams and the dashed arrows represent data streams that can be chosen to activate or not.
actions.Some improved Transformers introduced long-term forecasting [53], memory mechanisms [54], temporal windowto-window communication [55] and downsampling along the temporal domain [6] to selectively highlight the foreground feature representation and reduce the complexity.Furthermore, self-attention and traditional attention are combined to refine the feature representation [56].Therefore, we propose an CLSTM-based long short-term spatiotemporal attention module in Actionformer [6] to compensate for the longrange dependency modeling deficiency of Transformer.A few researchers have proposed pretraining [57], [58] for TAL to learn video feature representations.However, pretraining foreground/background contrast for refining foreground actions from overlapping backgrounds has not been reported thus far.To the best of our knowledge, the proposed method is the first work to implement contrastive learning [59] for efficient and robust foreground/background representations in TAL.

III. METHOD
The proposed architecture has four modules (see Fig. 2): long-term spatiotemporal attention, short-term attention, patchwise and imagewise contrastive learning modules.The imagewise contrastive module can be activated and deactivated alternately during the first ten epochs of training to accelerate its convergence.These modules can be skipped and then degenerate into the original Actionformer [6].

A. Problem Definition
We define TAL [5], [6] for input XCA sequence X = {x 1 , x 2 , . . ., x T } that considers x i the ith frame and T the sequence length.What we want is an action list Ŷ = { ŷ1 , ŷ2 }, where ŷi = (n i , star t i , end i ) is responsible for predicting the action category n i ∈ {0, 1}, start frame number star t i and end frame number end i .When n i is 0, ŷi represents the filling action of contrast agents, otherwise it represents the disappearing action of contrast agents (see Fig. 1).Specifically, the proposed TAL method predicts two mid-points of actions and regresses the lengths of action phases to determine the start and end frames of the filling/disappearing actions.The apex frame is determined by the average of the end frame of the filling action and the start frame of the disappearing action.

B. Preprocessing
We use SVS-Net [15] to extract 3D spatiotemporal features from consecutive frames in sequential segments at the beginning of training and inference.We choose 64 × 64 size deep spatiotemporal features as the processed high-level features per segment.Since each segment contains four consecutive frames, the sequential temporal information is compressed into a visual tube to enrich 3D spatiotemporal information and reduce long-term memory loss in subsequent CLSTMbased spatiotemporal attention modeling (Section III-C).This is important to take full advantage of CLSTM's capabilities in modeling long short-term spatiotemporal attention.These preprocessed deep features of segments are called original input features, which have dimensions of B × H × W × T with B, H , W and T representing the batch size, the height and width of the image, and the time, respectively.

C. Long-Term Attention Module
To highlight long-term spatiotemporal features for modeling up and down evolution trend of vessel changes, a long-term spatiotemporal attention module is built upon CLSTM with two parts, i.e., temporal and spatial attentions (see Fig. 2).For the temporal attention, the convolutional-recurrent learning of CLSTM has proven to capture the evolution trend of temporal changes [60].CLSTM changes the fully connected layer of LSTM into a convolutional layer when calculating the gates by input X t and hidden state h t−1 so that CLSTM handles spatial data better.Each CLSTM cell (X t , c t−1 , h t−1 ) [61] at time t in the CLSTM has formulation: where σ ( Second, spatial attention is proposed to solve the missing spatial attention in Actionformer [6].Although CLSTM has a stronger spatial modeling capability than LSTM, our experiments have proven that this is still not ideal for the modeling of long short-term spatiotemporal attention.Therefore, the classical CNN-based spatial attention [62] is utilized to further enhance the spatial representation ability of the proposed model.In this work, three groups of convolution and batch normalization are added as the following spatial attention (SA) with the first two groups having a ReLU activation function: where Conv(•) is the convolution operation, B N (•) is the batch normalization, Relu(•) is the activation function and (•) 2 means two repeated operations.The spatiotemporal attention (STA) module is then defined as follows: Thus, when T A(•), S A(•) and ST A(•) are used respectively, this module has the following output: The corresponding attention map is shown in Fig. 5(b)-(d).The long-term attention module receives the original input features processed by SVS-Net with B × H × W × T and does not change the feature size, so that the attention map can be used to calculate the Hadamard product with the input features for contrasting the foreground/background as described in Section III-E.

D. Short-Term Attention Module
The short-term attention module receives the features processed by the long-term attention module and patchwise contrastive module with B × H × W × T .Each frame x i ∈ R H ×W =4096 of the input sequence X ∈ R H ×W ×T is flattened and projected into C = 512 dimensions using convolution E(•) to form X = {E(x 1 ), E(x 2 ), . . ., E(x T )} with X ∈ R T ×C .A Transformer encoder (see the yellow block of Fig. 2) is then used for encoding via layer normalization, self-attention, MLP and a residual structure.Here, self-attention mechanism [52] is implemented by projecting X to three different subspaces Q, K , V as formulated: where W Q , W K ∈ R C× Ċ and W V ∈ R C× C are the projection matrices, Ċ and C are the hidden dimension and output dimension, Q, K ∈ R T × Ċ and V ∈ R T × C are the projection results.In our practice, both Ċ and C are equal to 128.Generally, self-attention is calculated as: where K T denotes the transpose of K , and Q K T ∈ R T ×T denotes the correlation matrix between frames.Then, the so f tmax activation function is used to normalize the correlation coefficient and multiplied by V for weighting.d k is the dimension of the key [63] in Transformer, which is equal to Ċ.

√
d k is used to avoid a large value appearing in the correlation matrix and causing a small activation function gradient.Multi-head self-attention (MSA) mechanism and multiscale Transformer [6] are also used in our practice and are ignored in the equations for simplicity.
Due to the high complexity of Transformer described in Section II-C, all of the self-attention methods use windows to reduce computational consumption and lead to the lack of long-term dependency.To alleviate this problem, a multiscale Transformer [6] is used to increase the receptive field by downsampling on the temporal domain, which may lose information.Thus, we solely use short-term attention to process the features extracted by the long-term attention module.After encoding, the obtained features are decoded by convolution as [6].The dimensions of the output features are B × C × T .

E. Imagewise Contrastive Module
To learn the subtle and contrastive differences between foreground and background for identifying the start and end frames, we introduce an image-to-patch contrastive learning [59] module (see Fig. 3) to enable the network to better distinguish the foreground from background in the absence of pixel-level labels.Contrastive learning [59] forces the same class (foreground/background) close by and different classes far apart, which is often used for self-supervised learning or semisupervised learning by constructing positive and negative pairs for unlabeled data.Similar to [64] that used imagelevel attention map for contrastive learning, we calculate the Hadamard product of the long-term attention map and the feature processed by TA to obtain the vessel features and background features as follows: where X is the features processed by TA and Attention Map is the attention map generated by STA in Section III-C.
According to the annotations labeled by clinicians, we select a time t that must contain the foreground vessels.Then, For egr ound t can be regarded as a positive case, and Backgr ound t can be regarded as a negative case (see Fig. 3(a)).For egr ound t and Backgr ound t are processed by the fully connected layer to generate vectors representing their features before the usage of contrastive learning.A positive and negative pair has thus been successfully constructed.
A specific setting for batch size of two is designed for generating more pairs for contrastive learning such that each sequence in this setting solely generates one positive and negative pair.After obtaining two positive and negative pairs from different sequences, a soft-nearest-neighbors contrastive loss is employed to increase the similarity between same category cases (i.e., same foreground or background) and reduce the similarity between different category cases: j∈F i ,i̸ = j e cos( f i , f j )/τ j∈F,i̸ = j e cos( f i , f j )/τ (10) where cos denotes the cosine similarity function, F is the number of sampled foreground/background, f i denotes the ith of F, F i is the number of sampled foregrounds/background that is similar to f i , and τ denotes the temperature parameter.This loss function minimizes the feature gap of the same categories (foreground/background), maximizes the feature gap between foreground and background, and forces the attention map to distinguish the foreground/background with the largest difference in the original input features.It can be used in the first five odd epochs of training (epoch = 1, 3, 5, 7, 9).The reason why Con Loss is solely activated in the first five odd epochs is based on the following two observations from experiments: 1) Con Loss can converge quickly, so adding it in the beginning of training is sufficient.Activating Con Loss during the whole training can affect the optimization of the main loss that is defined in Section III-H; 2) Activating and deactivating Con Loss alternately rather than activating Con Loss continuously can help the network explore more potential optimum solutions instead of limited suboptimum solutions in optimization space.

F. Patchwise Contrastive Module
To solve class-imbalance and imperceptible differences between foreground vessels and vessel-like background disturbances, we further compare and contrast the foreground and background samples at the patchwise scale.Patchwise contrastive learning has recently been studied in a few works [65], [66].Unfortunately, there is still no feasible method to construct positive and negative pairs of XCA sequences at the patchwise scale because it is not known whether the patch contains the small number of foreground vessels, though there is a contrast-free XCA frames that solely contains background images during the initial phase of XCA imaging.
We exploit random patch projection [67], [68] to design a patchwise contrastive module (see Fig. 3(b)), where the features processed by the long-term attention module with B × H × W × T are inputted.A 5 × 5 background patch is selected randomly from the input contrast-free features at t = 0 to act as a convolution kernel for projecting all input features.This patchwise contrastive learning is built upon the fact that the convolution of two patches is equivalent to the dot-product of two vectors, reflecting their similarity by calculating the length of projection of one vector on another vector's space in terms of L 2 -norm of vectors.When a contrast-free background patch is used as the convolution kernel, a larger convolution result is obtained if the kernel convolves the foreground patches with a large amount of vessels due to the L 2 -norm of foreground patches being large, which is shown in the green block of Fig. 3(b).On the contrary, if the kernel convolves the background patches, which is shown in the blue block of Fig. 3(b), the convolution result is obviously small.This patchwise contrastive learning can automatically distinguish the foreground and background by projecting foreground/background patches into different spaces.

G. Action Classification and Boundary Regression
After the short-term attention module, a 1D convolution layer with a convolution and activation function are used to map the high-dimensional temporal features to the n Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
dimension to generate the action classification probability p t ∈ R T ×n (n = 2), and another layer is used to generate the boundary regression distance d t = (d star t t , d end t ) ∈ R T ×2 , which represents the distance between the time stamp t and the start/end frame of the action that is centered on t [6].Then the output of the proposed model is defined as Y = {y 1 , y 2 , . . ., y T } ∈ R T ×4 where y t = ( p t , d t ) is the output of the tth time stamp.Section III-I will describe how to convert the output Y into the action list Ŷ in Section III-A.

H. Training
The loss function has supervised and contrastive losses: First, the supervised loss defined in [6], [69], and [70] is used for training the backbone network, which is defined as where T is the sequence length, T + is the number of positive samples, and F t denotes whether time stamp t is within an action.L cls is the focal loss [69] for classifying action probability with imbalanced data.L r eg is the distance intersection over union (IoU) loss [70] for distance regression.L cls is activated to supervise the network for each time stamp t, while L r eg is only activated for those t that are within an action.The output Y ∈ R T ×4 is therefore used for supervision here.Second, if the imagewise contrastive module is available, then Con Loss defined in Equation ( 10) is used for the first five odd epochs in the training.

I. Postprocessing and Inference
What we want in TAL is action list Ŷ = { ŷ1 , ŷ2 } where ŷi = (n i , star t i , end i ) as described in Section III-A.However, the direct output of the proposed model is Y = {y 1 , y 2 , . . ., y T } ∈ R T ×4 where y t = ( p t , d t ) is for action classification probability and boundary regression distance that are described in Section III-G.Therefore, we can convert Y into action list Ŷ during inference as follows: This operation can generate an action with the highest probability for each time stamp t.Then, Soft-NMS [71] is used to decrease overlapping background actions.In addition, each category of the two actions (filling/disappearing action) selects an action instance with the highest probability as the final prediction result of the model during the inference.Then, we can obtain two action localization results Ŷ = { ŷ1 , ŷ2 } of a sequence.In detail, ŷ1 can be parsed as the mid-point (red solid point) of filling action and the two corresponding red rays in Fig. 1, and ŷ2 can be parsed as mid-point of disappearing action and its two corresponding rays in green color.Then, the three keyframes can be calculated as: End Frame = end 2

A. Experimental Materials
Two hundred and sixty clinical XCA sequences were collected from Renji Hospital of Shanghai Jiao Tong University.The length of each sequence ranges from 31 to 379 frames.Because of following the setting of [6], the model can process sequences of different lengths by padding 0. The original dataset has different resolutions, including 512 × 512 and 800 × 800 pixels.Each frame is reshaped to 512×512 and processed by SVS-Net [15].The final resolution of features is 64×64 pixels.Each sequence is annotated by two clinicians to obtain the three keyframe locations and frames per second (FPS), which means that the clinicians only need to simply label each sequence.We calculate the average of two clinician annotations as the final annotation.The dataset is converted to the ActivityNet-1.3 dataset [72] format, which contains the action category, start time and end time of actions.During training, the three keyframes are converted to temporal labels by center sampling as [6].To facilitate comparison with other advanced methods, the dataset is divided randomly into three subsets for training, validation and test, at a ratio of 136:60:64.

B. Evaluation Metrics
Average precision (A P) and mean average precision (m A P) are widely used in TAL [6], [7], being calculated to evaluate the sequence-level performance of the proposed method.This means that a whole action is taken to be compared with the true action.We define Precision (P), Recall (R) and F-score (F) as where TP (true positives) is the total number of detected actions whose IoU with the ground truth is higher than the IoU threshold, FP (false positives) indicates the total number of detected actions whose IoU is lower than the threshold, and FN (false negatives) indicates the total number of undetected actions but the ground truth shows that there is an action.The IoU threshold is predesigned.When the IoU between the result and ground truth exceeds the threshold, the TAL result will be considered correct.Therefore, it can evaluate the sequence-level performance.P represents the ratio of the TP among all results, which is used to evaluate the prediction accuracy.R represents the proportion between the correctly detected actions and total actions in the ground truth.F comprehensively considers both the P and R metrics and indicates the overall performance [6], [7], [50].These metrics range from 0 to 1, and the higher values indicate the better performances.We rank the results according to the confidence score and calculate P and R one by one according to the IoU.Then, a series of P and R values can be obtained and a P-R or P(R) curve can be drawn in the Cartesian coordinate system.
The area under the P-R curve has become a general metric to measure the performance of various detection tasks, which is The average area under the P-R curves with different IoU thresholds is the m A P. A P and m A P mainly evaluate the performance of sequence-level detection.
We also check the frame-level performance from two aspects.First, P, R and F are used to evaluate whether a frame is detected as the correct category.Specifically, we judge whether each frame is the same as the true value.Furthermore, to evaluate the keyframe localization ability, we define the average deviation (AD) as where P i s is the prediction of the start frame number of the ith sample.P i a and P i e are the predictions of the apex and end frame number, respectively.L i s , L i a and L i e are the target keyframes.I is the set of samples.This metric evaluates the deviation between the predicted keyframes and the targets.The smaller the AD value and the smaller the deviation, the better the performance for the proposed model.

C. Experimental Settings
We feed 4 continuous frames as the input to SVS-Net and use a sliding window with stride 4, extract 64×64 size features in the encode stage and flatten them into 4096 dimensions.The number of categories of actions is set to 2. All the lengths of the input sequences are set to 128.The window size of Transformer for self-attention is set to 4. The long-term spatiotemporal attention module uses convolutional kernels of size = 3 with stride = 1 and padding = 1 for the first two convolutional layers in spatial attention, and utilizes convolutional kernels of size = 1 with stride = 1 and padding = 0 for the last layer.The temporal attention uses a standard CLSTM with three hidden layers that have 8, 8 and 1 output dimensions, and the kernel size is set to 3.Moreover, the other setting follows [6].In summary, the initial learning rate is 1e-4, and a cosine learning rate decay is used.The batch size is 2, and a weight decay of 1e-4 is used.The model is evaluated after 50 epochs of training.A P@[0.3:0.1:0.7] is used to evaluate the m A P of our model.The code is implemented by PyTorch and trained on a NVIDIA GeForce RTX 3090.

D. Comparison Methods
To evaluate the performance of our algorithm, we select several SOTA Transformer-based TAL methods for comparison, including AFSD [7], TALLFormer [50], E2E-TAD [51] and Actionformer [6].The parameters of these algorithms are trained with their default settings and our dataset.Due to the different data formats used by the open-source code, we converted data to corresponding formats to train them.

E. Result Analysis
TABLE II and Fig. 4 summarize the experimental results.Our method achieves a m A P of 72.45%, with an A P of 98.44% at IoU = 0.3 and an A P of 36.10% at IoU = 0.7.It obviously outperforms the best Actionformer [6] by increasing 7.59% A P at IoU = 0.3, 3.48% A P at IoU = 0.7 and crossing the 70% m A P first.We believe that these results come from the excellent modeling capability of the proposed method, which can be proven by the poor performances of other SOTA models.It is worth noting that the poor performances of other models on our dataset also shows that our XCA dataset is a very difficult dataset for TAL.
The keyframe localization is visualized in Fig. 4. The colored rectangles in the first row represent the target keyframes, and the colored rectangles in the second row represent the predicted keyframes.The prediction close to the target is successfully achieved on some samples.However, when a few samples have obvious boundary-agnostic characteristics, there will be a slightly larger deviation between the target and the prediction.

F. Statistical Analysis
We reported one-sample t-test on the baseline and the proposed method in TABLE III.Specifically, we calculated the absolute value of the difference between the predicted keyframes of the baseline/proposed methods and the targets as the deviation list (DL) in Equation ( 17): Then one-sample t-test (one-side) is implemented on DL and different Popmean values (expected population means).The null hypothesis is the mean of DL is greater than Popmean and the alternative hypothesis is the mean of DL is less than Popmean.The results show that we should reject the null hypothesis for the baseline when Popmean = 6.2, i.e., P-value (P-v) = 0.03 ≤ 0.05, T-value (T-v) = −1.83,Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and reject the null hypothesis for the proposed method when = 5.4, i.e., P-v = 0.03 ≤ 0.05, T-v = −1.92.It means that the proposed method has less deviation than the baseline obviously.Furthermore, the 95% confidence intervals of the baseline and the proposed methods are 0 to 6.13 and 0 to 5.30, respectively.
The statistical significance measured by the paired t-test (two-side) between the proposed and baseline [6] methods is also implemented and the result is T-v = −2.75 and P-v = 0.0065 ≤ 0.05, which means that the proposed method resulted in a significantly less deviation than the SOTA methods do.

G. Ablation Experiments
The short-term attention (S-T) module [6] is used as the baseline in the experiment.SA and TA in the long-term (L-T) module, and S-T module are treated as three parts for  IV.The best result is shown in bold, and the second best result is underlined.From the results, the parts we proposed can promote the original Actionformer [6] and can promote each other through different combinations of the proposed modules.Among them, the TAL performance is the best when we use all of the modules.When a shortterm module is not used, the performance drops sharply.This indicates that it is essential to adopt action Transformer in the proposed method.In addition, the long-term spatiotemporal attention module further enhances the performance of TAL in XCA sequence.
TABLE IV also shows the effectiveness of the proposed method by P, R, F, and AD.Note that some results that do not meet the definition of P, R and F are specially treated.For example, the apex frame is calculated by the average of the end frame of appearance and the start frame of disappearance, so that the start frame may be later than the apex frame.This could affect the test metrics.How to solve this problem more scientifically is also a future direction.The proposed method achieves the best frame-level performances in terms of all four metrics.TABLE V reports the ablation study on imagewise contrastive (ICon), patchwise contrastive (PCon) and imageto-patch contrastive (IPCon) modules.The proposed model achieved the best performances in terms of most metrics.In particular, it achieved the highest values in terms of P, R and F metrics and the lowest 4.71 in AD, which means that the proposed method has a 4.71-frame distance between the prediction and the target on average, being shorter than average 5.46-frame distance of the most advanced method [6].
Note that due to the existence of boundary-agnostics in XCA sequence, it is more difficult to optimize this metric with a smaller standard deviation.[15] is used for feature extraction in this work instead of I3D [73] that is used in Actionformer [6].To prove the rationality, we used these two methods to extract features and conducted experiments.The results are shown in TABLE VI.Baseline model is selected for the experiment because the I3D method will destroy the spatial dimensions of features.The results show that SVS-Net achieves better feature extraction performance than I3D in our scene.

SVS-Net
We use SVS-Net [15] with stride = 4 for feature extraction in this paper, which means that there are not overlapping between neighborhood features and the length of sequence will decrease.To investigate the influence of feature stride, we decrease the feature stride and make some overlapping between neighborhood features.The results in TABLE VI show that overlapping strategy does not perform better because it can lead to a large amount of redundant calculations and overlapping interference between neighborhood features.In addition, the amount of training time has been further extended when the smaller strides are used.

I. Experiments With Different Random Patches
A low-rank background patch is selected randomly as a convolutional kernel in patchwise contrastive module.We only randomly select the convolutional kernel in the frame of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VIII PERFORMANCE OF STUDY ON DIFFERENT PROBLEM SETTINGS
t = 0 and use this kernel to all frames.Therefore, it solely contains background features due to no use of contrast agents in this stage, so different selections do not influence performance significantly.To show this issue, we have added an experiment which is reported in TABLE VII.Specifically, for the same trained model, we select different patches when t = 0 as the convolution kernel and test the performance.The results show that different selections hardly influence the results.

J. Experiments With Different Problem Settings
We define the TAL as the filling/disappearing localization.To prove the rationality of this setting, it is also compared with other two problem settings for locating whole stage with filling and disappearing stages, respectively.The first setting called setting-1 is to localize the filling and whole actions, while setting-2 is to localize the disappearing and whole actions.The postprocessing of setting-1 can be calculated as: End Frame = end 2 The postprocessing of setting-2 can be calculated as: End Frame = (end 1 + end 2 )/2 We conducted experiments under these two settings, as shown in TABLE VIII.The results show that our original setting is optimal.The worse results of setting-1 and setting-2 could be related to the problem of overlapping between the whole stage and the filling/disappearing stage.We find that handling overlapping actions is not good enough for Actionformer based method, because Actionformer based method is to regress boundary with B × 2 × T size for all actions on the temporal axis at once rather than regress every boundary of each action.Overlapping actions mean that we need to regress two different distances for the same time period, which make the network more difficult to accurately regress action boundary.Setting-2 is worse than setting-1 because the disappearing action overlapping the whole action is longer than the filling action (see Fig. 4).
To solve the overlapping actions, our method adapts the overlapping actions within Actionformer-based architecture by developing a simple multi-boundary regression (MBR).Specifically, an action classification and boundary regression module in Fig. 2 is generated for each type of actions to predict localization results.The results are still not as good as our original setting.We believe that locating the whole action requires the network to focus on two different trends (filling and disappearing) over a long period of time, which is more difficult than locating a single action with one monotonically increasing or decreasing trend (filling or disappearing).

K. Visual Evaluation With Attention Mechanism
To improve the explainability of the proposed modules, attention maps are used to show the proposed modules' performance in Fig. 5.The method of generating attention maps can be found in Fig. 2. Fig. 5 to the frame compression effect of SVS-Net [15], but this does not affect the judgement of the model.Although CLSTM also has the ability of spatial modeling, what they learned is the general area of the vessel, as shown in Fig. 5(b).When spatial attention is used, the spatial structure of the vessel is distinguished from the complex background by weak differences, as shown in Fig. 5(c).Fig. 5(d) shows the results generated by the long-term spatiotemporal attention module.In this case, the network can distinguish the vessel structures from the background more obviously by learning the spatiotemporal characteristics of moving regional features.However, the noise can be clearly seen in the background.The contrastive module can effectively alleviate this issue in Figs.5(e)-(g).The final results (Fig. 5(g)) indicate that our method can learn the vessel structures clearly.

V. CONCLUSION
We proposed a novel long short-term spatiotemporal attention network with image-to-patch contrastive modules for locating keyframes in the challenging XCA sequence.An XCA sequence dataset was collected.SOTA experiments and ablation experiments have proved the strong outperformance of the proposed method over SOTA methods.The proposed method can be applied to any flow-like scenarios in monitoring spatiotemporal networks.For example, [74] built a model to estimate the crowd traffic in public places.Overcrowding and stampedes may occur in public places with the gathering of crowds.The action of crowd-gathering can be monitored by the strategies of our method to mitigate and prevent risk without estimating the crowd traffic directly.Another example is traffic inflow and outflow prediction as [75] did.The proposed method has potential applications to locate the moments when inflows and outflows significantly increase or decrease to reduce traffic congestion and accidents.
Locating X-Ray Coronary Angiogram Keyframes via Long Short-Term Spatiotemporal Attention With Image-to-Patch Contrastive Learning Ruipeng Zhang , Binjie Qin , Member, IEEE, Jun Zhao , Member, IEEE, Yueqi Zhu, Yisong Lv , and Song Ding Abstract -Locating the start, apex and end keyframes of moving contrast agents for keyframe counting in X-ray coronary angiography (XCA) is very important for the diagnosis and treatment of cardiovascular diseases.To locate these keyframes from the class-imbalanced and boundaryagnostic foreground vessel actions that overlap complex backgrounds, we propose long short-term spatiotemporal attention by integrating a convolutional long short-term memory (CLSTM) network into a multiscale Transformer to learn the segment-and sequence-level dependencies in the consecutive-frame-based deep features.Image-topatch contrastive learning is further embedded between the CLSTM-based long-term spatiotemporal attention and Transformer-based short-term attention modules.The imagewise contrastive module reuses the long-term attention to contrast image-level foreground/background of XCA sequence, while patchwise contrastive projection selects the random patches of backgrounds as convolution kernels to project foreground/background frames into different latent spaces.A new XCA video dataset is collected to evaluate the proposed method.The experimental results show that the proposed method achieves a mAP (mean average precision) of 72.45% and a F-score of 0.8296, considerably outperforming the state-of-the-art methods.

Fig. 3 .
Fig. 3.The architectures of the imagewise contrastive module and patchwise contrastive module.(a) Imagewise contrastive module.It uses the attention map generated by the long-term attention module to separate the foreground vessels from the background.(b) Patchwise contrastive module.A random patch from contrast-free background images is used as a convolutional kernel to project foreground and background patches into different spaces for enhancing the contrast between foreground and background.

Fig. 4 .
Fig. 4. Comparison of targets and keyframe localization results predicted by the proposed model.The first row shows the target frames, and the second row shows the predicted frames.
(a) shows the original images.The attention map in Fig. 5(b)-(g) has artifacts due Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I ACRONYMS
c t , forgetting c t−1 and outputting h t .h t is the output, which is determined by memory cell c t and output gate o t .Important information is saved selectively through explicit h t and implicit c t while processing the whole sequence.We can regard the sequential features processed by CLSTM as the following temporal attention (TA): •) and tanh(•) are activation functions, * is the convolution and • is the Hadamard product.ct is named the memory cell, which records partial potential spatiotemporal information of past frames at time stamp t.It is initialized at the beginning and updated by c t−1 , f t , i t , X t , h t−1 at each time stamp.it , f t , o t are three gates that can control the degree of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.updating

TABLE II PERFORMANCE
OF DIFFERENT SOTA TAL METHODS IN TERMS OF AP AND MAP VALUEScalled A P and formulated as:

TABLE III STATISTICAL
ANALYSIS OF ONE-SAMPLE T-TEST

TABLE IV PERFORMANCE
OF ABLATION STUDY ON SPATIOTEMPORAL ATTENTION MODULE