Spatio-Temporal Contrastive Learning Enhanced GNNs for Session-based Recommendation

Session-based recommendation (SBR) systems aim to utilize the user's short-term behavior sequence to predict the next item without the detailed user profile. Most recent works try to model the user preference by treating the sessions as between-item transition graphs and utilize various graph neural networks (GNNs) to encode the representations of pair-wise relations among items and their neighbors. Some of the existing GNN-based models mainly focus on aggregating information from the view of spatial graph structure, which ignores the temporal relations within neighbors of an item during message passing and the information loss results in a sub-optimal problem. Other works embrace this challenge by incorporating additional temporal information but lack sufficient interaction between the spatial and temporal patterns. To address this issue, inspired by the uniformity and alignment properties of contrastive learning techniques, we propose a novel framework called Session-based Recommendation with Spatio-Temporal Contrastive Learning Enhanced GNNs (RESTC). The idea is to supplement the GNN-based main supervised recommendation task with the temporal representation via an auxiliary cross-view contrastive learning mechanism. Furthermore, a novel global collaborative filtering graph (CFG) embedding is leveraged to enhance the spatial view in the main task. Extensive experiments demonstrate the significant performance of RESTC compared with the state-of-the-art baselines e.g., with an improvement as much as 27.08% gain on HR@20 and 20.10% gain on MRR@20.


INTRODUCTION
Recommendation systems have been an efficient tool for helping users make informative choices according to their available profiles and the preferences reflected in the long-term history interactions, which are widely used in web search and various stream medias [16,59,78].However, the traditional recommenders may perform poorly in some scenarios where the user's interactions are inadequate in a narrow period, or the status is unlogged-in.Thus, Session-based Recommendation (SBR) has attracted increasing research [5,26,66,71], since it characterizes users' short-term preference from the limited interactions in the current session, e.g., a basket of products purchased in one transaction visit, and then predict the products that a user interacts with in the future.Recently, most existing SBR methods [3,35,43,66,71] mainly construct the graph structure from the session and leverage Graph Neural Networks (GNNs) to conduct information aggregation between adjacent items and capture complex high-order relations, which have obtained effective performance.However, the temporal information has been omitted by the abovementioned GNNbased methods because of the permutation-invariant aggregation during the message passing in the graph structure, which is a vital signal that contributes significantly to capturing the preference evolution of the user in the temporal dimension [9,23].Fig 1 shows a concrete example of the temporal information loss's impact.Suppose the two sessions produce the different next item, but they are encoded as the same graph representation since the aggregation function of GNN could not distinguish the temporal order of items' neighbors.In that case, the GNN-based model will induce incorrect results and limit its capacity without the essential temporal pattern.Additionally, to better understand the importance of temporal information among session graphs, we randomly sample some representative session data from the public dataset Dignetica1 .As shown in Fig 2, to process sessions using GNNs, the original sessions need to be converted into graphs first.However, from (A), (B) and (C) of Fig 2, we notice that it is difficult to accurately reconstruct the original session if these session graphs contain directed cycles.More specifically, the showcases of directed rings are notable examples of samples with in-degree exceeding 1.We count the presence of directed graphs in the Diginetica2 dataset with in-degree greater than 1.The sequences in which the in-degree is greater than 1 account for 23.85% of the overall samples.Therefore, those phenomenons also indicate the necessity of incorporating temporal patterns.
Fortunately, some works have attempted to incorporate temporal information by modeling a session as the dynamic sub-session graphs at the fixed-length time intervals [38,79] or integrating the timestamps information as a contextual dimension [47].However, modeling multiple subsession graphs based on timelines may introduce redundant spatial structure information and it still misses temporal orders during aggregation.The other lines of works directly treat each session as a sequence of items with the relative order or position information and utilize Recurrent Neural Networks (RNN) [17,18,50] or memory networks [33] to learn the sequential signal in a session to capture users' preference.But modeling sequences using RNN-based models are arguably insufficient to obtain accurate user representation in sessions and neglect complex transition patterns of items [66].Besides, all of these methods lack sufficient interactions between spatial structures and temporal patterns in the latent space, which restricts the representation capability of the models.
Therefore, incorporating temporal pattern then modeling the latent mutual presentation of spatial and temporal views of a session is crucial and challenging for session-based recommendation systems.To align the embeddings of the two views in a unified latent space, (i) one straightforward way could be to directly adopt concatenation or cross-attention based methods [27,76] to fuse these two information resources after the encoding phase.But both views know little information about each other in this way since there is no efficient interactions between two different encoders during training; (ii) the other approaches could be to utilize GAN [29,30,32] to learn the joint distribution of multi-style views or leverage semi-supervised learning paradigms like Co-training [1] to acquire complementary information from each other views.However, it is unstable to optimize the min-max objective of GAN-style methods.Besides, both GAN and Co-training mechanisms face the mode collapse problem [39] while learning the latent representations of different views during training.
Due to the issues mentioned above, inspired by the uniformity property [41] and theoretical guarantee of semantic representation alignment in latent space [46] for contrastive learning, we propose a novel auxiliary spatio-temporal contrastive learning framework named RESTC.RESTC can align the spatial and temporal semantic representations in a projected feature space to conserve as much mutual information of the two views as possible.Although existing contrastive learning techniques for sequential [6,34,69] or GNN-based recommendation [20,57,65] generally generate positive samples using item-level augmentation, e.g., item cropping, masking, reordering or sub-sampling in sequence and graph data, respectively, which are not suitable for SBR task since these methods induce semantically inconsistent samples and damage the completeness of temporal patterns.Specifically, sequential augmentation (e.g., cropping, masking, re-ordering) may compromise the completeness of users' original intent and their evolutionary preference [11] related to temporal order, which may introduce more bias during training.Therefore, completely capturing the temporal pattern is essential for modelling the user's interest.Meanwhile, it avoids additional noise caused by damaging the temporal order.Furthermore, since the average number of nodes of session-based graphs shown in Tab 1 is relatively small compared with user-item interacted graphs [65], the graph augmentation (e.g., sub-sampling) is hard to sample useful sub-structural information and also disrupt the completeness of spatial information, which may also introduce bias during sub-graph contrastive learning.
Different from the above works, we comprehensively consider two views on the session level and adopt a spatial encoder for the graph structure representation and a temporal encoder to supplement the temporal representation as the informative, positive sample.Specifically, it is worthwhile to notice that our RESTC is model-agnostic that can be applied to any GNN-based model.Here we employ the powerful Multi-relational Graph Attention Network (MGAT) refined by GAT [52] as the spatial encoder.We further derive a well-designed Session Transformer (SESTrans) augmented with a temporal enhanced module as the temporal encoder.For the contrastive objective, we propose a mixed noise negative sampling strategy different from [2] to further enhance the model performance.With the contrastive learning loss, we enhance the cross-view interactions in the latent space to refine session representation by maximizing the agreement of positive pairs.Furthermore, due to the data sparsity of short-term session data, a Collaborative Filtering Graph (CFG) derived from all sessions as a global weighted item transition graph, is leveraged to enhance the spatial view with the collaborative filtering embedding in the main supervised task • We highlight the significance of incorporating temporal information for GNN-based SBR task, facilitating the development of cross-view interactions for the spatial and temporal pattern.• To the best of our knowledge, the proposed spatio-temporal contrastive learning framework RESTC is the first work aiming to align and refine the representations of spatial and temporal views in the latent space for the SBR task, which can effectively plugged into many existing GNN-based models.

Sequence-based Models in SBR
In early research, FPMC [45] utilized Markov chain and matrix factorization to obtain the sequential pattern of session.Recently, neural network-based models have demonstrated effectiveness in exploiting sequential data in SBR tasks.GRU4Rec [18] was the first RNN-based model which captured item transitions by multi-layer GRUs.NARM [26] leveraged an attention-based method to combine RNN to model complex item relations better.STAMP [33] used the attention-based memory network to capture the user's current interest.Inspired by Transformer architecture, SASRec [22] stacked several self-attention layers to model the item-transition sequence.BERT4Rec [48] employed deep bidirectional self-attention to model user behaviors for sequence recommendation.Besides, Yuan et al. [74] also propose to use a dual sparse attention network to explore the current user's interest via an adaptively learnable target embedding.These attention-based models separately deal with the user's last item and the whole current session, thus capturing the user's general and recent interest.

GNN-based Models in SBR
Most recent works focus on utilizing Graph Neural Networks (GNNs) to extract the relationship in the session, which have shown better results than sequence-based models [43,66,71].For instance, SR-GNN [66] used a gate GNN model to obtain item embeddings over an item graph and predict the next item using the attention mechanism.GC-SAN [71] utilized self-attention networks to aggregate the information of session graphs.FGNN [43] leveraged multi-head attention to aggregate the neighbor item's embeddings in a weighted item-transition graph.LESSR [3] preserved session order based on GRU and shortcut graph attention to solve the lossy session encoding and ineffective long-range dependency capturing problems.Zhou and Pan et al. [38,79] constructed a sequence of dynamic graph snapshots at timestamps to model the preference evolution.GCE-GNN [63] proposed to exploit a session-graph convolution and global neighbor graph convolution to conduct a more accurate session embedding.GCARM [37] considered the dynamic correlations between the local and global neighbors of each node during the information propagation.G 3 -SR [7] proposed global graph guided SBR by leveraging an unsupervised pre-training process to extract global itemto-item relational information.However, traditional GNN-based models lack temporal information to capture users' evolutionary preferences.

Temporal Augmented Models in RS
Temporal information can model users' dynamic preferences over time and play an essential role in the recommendation system.Some previous works have incorporated temporal patterns into GNNs in other recommendation settings [24,47,62,72,77].For instance, JODIE [24] designed a coupled recurrent neural network model that learns users' embedding trajectories and estimates the user's embedding at any time in the future.TGAT [72] proposed the temporal graph attention layer to aggregate temporal-topological neighbourhood features and learn the time-feature interaction efficiently.TGN-MetA [47] utilized the memory-tower augmentation to process the augmented graphs of different magnitudes on separate levels to optimize Temporal GCNs.DGSR [77] proposed to explore the interactive behaviour of users and items with time and order information by leveraging a dynamic graph neural network.However, most of those works mainly focus on user-item interacted graphs.For SBR, TMI-GNN [47] proposed to use temporal information to guide the multi-interest network to focus on multi-interest mining.TASRec [79] incorporate temporal information by constructing a dynamic graph snapshots sequence at different timestamps.GNG-ODE [11] propose graph-nested GRU ordinary differential equation to encode both temporal and structural patterns into continuous-time dynamic embeddings.Orthogonal to these dynamic-based GNNs models, we incorporate material information via a contrastive learning strategy.

Contrastive Learning in RS
Recently, In the CV and NLP [53,55,61,70] area, multiple contrastive Learning [2,4,10,31,54,60] methods have demonstrated superior performance in modelling representation by measuring the similarity between different views within unlabeled raw data.This self-supervised mechanism is widely adopted in recommendation systems because it carries good semantic or structural meanings and benefits downstream tasks.For instance, GCC [40] proposed sub-graph instance discrimination that utilized contrastive learning to learn the intrinsic and transferable structural representations.Yao et al. [73] proposed multi-task contrastive learning for a two-tower model.Besides, S 3 -Rec [80] used mutual information maximization to explore the correlation among items, attributes, and contexts.Recently, Wei et al. proposed CLCRec [64] to leverage contrastive learning to learn the mutual dependencies between item content and collaborative signals in order to solve the cold start problem.Wu et al. [65] generated multiple views of the same node from a graph and employed contrastive learning to maximize their agreement to mine hard negative samples.DuoRec [42] designed a contrastive regularization to reshape the distribution of sequence representations.In the SBR task, Li et al. [25] made use of a global-level contrastive learning model to solve noise and sampling problems in heterogeneous graphs.S 2 -DHCN [68] is the most relevant work to us, which designs a contrastive learning mechanism to enhance hyper-graph modelling via another line GCN model.COTREC [67] augments the session-based graph with two views that exhibit sessions' internal and external connectivity by contrastive learning.CORE [19] unifies the representation space of session embeddings using the contrastive-based representation-consistent encoding strategy.CGL [36] used the self-supervised learning and main supervised learning to explore the correlations of different sessions for enhancing the item representations.But these methods ignore the temporal pattern in the spatial structure, leading to information loss.Orthogonal to these methods, our RESTC employs spatio-temporal contrastive learning to supply sufficient interactions between spatial structures and temporal patterns via aligning the two views in the latent space.
3 OVERALL FRAMEWORK

Problem Definition
Suppose that the item set is  = { 1 ,  2 , . . .,   }, where   indicates the -th item and  is the number of item categories.Given an ongoing session denoted as items represents the -th historical interactive item of the user within session , and  is the length of the session, it aims to predict the items  +1 that the user will interact with at the next time stamp.Generally, the goal of the session-based recommendation is to recommend the top-K rank items (1 ≤  ≤  ) that have the highest probability of being clicked/purchased by the user...

SPATIO-TEMPORAL CONTRASTIVE LEARNING
In this section, we augment a session into two views of embeddings from a temporal encoder in Sec.4.2 and a spatial encoder 4.1 respectively.To align and interact with the output embeddings from the two encoders in the latent space, we design a contrastive learning task and introduce it in Sec.4.3.

Temporal Encoder for Session Sequences
We present how to model session data as sequences from a temporal view, corresponding to the temporal part of Fig. 4.

Session Sequence Construction.
Given a session  =   1 ,   2 ,   3 , . . .,    , by adopting an embedding layer, all items in the session will be embedded to a sequence of item embeddings, denoted as denotes the max length of all sessions.The zero vector will be padded after the sequence when the length of a session  is shorter than .To help aggregate item embeddings to a fused session representation as a temporal pattern, we additionally add the special item [CLS] at the end of the session sequence to learn the global attention information, similar to BERT-encoder [8,56].To encode temporal information, we equip the initial item embeddings with the learnable absolute temporal position embeddings (denoted as P t ∈ R × ): where X ′ ∈ R (+1) ×2 .

Session
Transformer Layers for SEs.To obtain preliminary temporal embedding of sessions , we leverage the Session Transformer (SESTrans) following the standard transformer encoder [51], which employs weight matrix W  , W  , W  to linearly transform the input X ′ ∈ R (+1) ×2 as query, key, value vectors, denoted as Q, K, V.The scaled dot-product attention is defined as: Intuitively, the attention module aggregates low-level item representations to high-level item representations via a linear combination.We also implement SESTrans in a multi-head fashion like in [51].Since SAN is linear to input, we feed the output of SESTrans to a feed-forward network (FFN) with non-linearity activation: where W 1 and W 2 ∈ R 2 ×2 and b 1 , b 2 ∈ R 2 are trainable parameters in FFN layers.Besides, we stack several encoder layers to learn more complicated session representation from the temporal review, accompanied by standard residual connection, dropout mechanism, and layer normalization.
After that, we obtain the encoder's output embedding X.

Temporal Enhanced Module.
To better aggregate item embeddings from encoder layers to obtain the user's evolving preference with respect to the timeline, we develop a novel temporal enhanced module.In particular, we utilized the embedding of the special item [CLS] of output embeddings X as query vector Q ′ , and the rest of output embeddings X as key vector K ′ .Note that Q ′ is the global preference representation, and K ′ is the preference evolution representation.Besides, we leverage initial embedding X ′ as our value vector V ′ since it contains the original temporal positional encoding information, which can benefit our output embedding with the temporal pattern.Then, we add the two representations and apply a non-linear transformation with ReLU activation.Finally, a softmax function is used to calculate attentive relations and gain the aggregative vector h t .The formulas are defined as: where t is the combined vector.To this end, we have obtained the aggregation vector h t and the global preference vector from the embedding of special token [CLS], denoting as x c .Then we concatenate the two vectors and pass them to a feed-forward layer.Finally, dropout and L2 normalization tricks are employed after the FFN layer then we obtain temporal view embedding as:

Spatial Encoder for Session Graphs
The subsection shows the session graph construction process and its learning process, illustrated in the local spatial part of Fig.   [52] and Multi-relational GCN [21] have shown their powerful capability in graph structure and multiple types of edge relations learning, respectively.We further extend them to our multi-relational weighted graph and denote the model as MGAT.The input to our encoder layer is a set of item features after embedding layer, is the number of unrepeatable items in current session ( ≤ ), and  is hidden size.We define relation embedding of in-relation, out-relation, bi-direction, and self-loop as r in , r out , r bi , and r self respectively.We denote r ij as a general relation embedding between   and   that is determined by the specific relation between the two items, i.e., one of the four relations.The attention scores among these items are calculated by where    is the relational similarity between item   and its neighbor   by element-wise product and relational inner product,  , is the attention scores.
It is worth noting that our MGAT is different from [28,52,63], we employ a multi-head attention mechanism to incorporate all edge relations instead of a single head latent space to better enhance the representation ability for the spatial structure.To be specific, each head computes a kind of relations among items and their neighbors, and then the embeddings of multi-head attention are added rather than concatenated: where  = 4 denotes that four relations mentioned above,  ( ) , are normalized attention coefficients of item   and its neighbor   in the  -th relation head.Then, we get the attention-aware representation H = h1 , h2 , h3 , . . ., h of a specific session based on the initial item order of the session, where  is the item number of the current session.

Local Spatial Aggregation.
To emphasize the recent preference within the current session, we concatenate H representations with a learnable position embedding Besides, the session information can also be represented as the average in general.Thus, we take the two ways into consideration: where Ȟ is the position-sensitive session embedding, H s is the average embedding of the general session,  s is soft-attention score indicating the importance of each item, and Finally, the spatial view embedding of a session  is calculated by combing item embeddings with their corresponding importance  s :

Contrastive Loss function
One of the key properties of contrastive learning is to align features from positive pairs [60].Such positive pairs could be (i) a data sample with two augmentation tricks before being fed into a encoder [2,4], (ii) a data sample with twice dropout noises in a encoder [10], or (iii) a data sample with two different encoders [15].Inspired by the [15] which constructs the contrastive samples with two different encoders, we utilize contrastive learning to align the augmented representations from the spatial and temporal encoders in the latent space and maximize the lower bound of mutual information of the two views.
To achieve the target, we design a spatio-temporal contrastive loss function to distinguish whether the two representations are derived from the same session.Specifically, the contrastive loss learns to minimize the difference between the augmented spatial and temporal views of the same session and maximize the difference between the two augmented views derived from the different sessions.Technically, considering a mini-batch of  sessions  1 ,  2 , . ..,   , . ..,   , we get the output embeddings from the spatial encoder (see Eq. 13) and the temporal encoder (see Eq. 6), denoted as G(  ) and T(  ) for each session, respectively, where we treat (G(  ), T(  )) as the positive pair.For the negative samples, we propose a mixed noise negative sampling strategy that applies a column-wise shuffling operator for each T(  ) in the batch to produce the noisy temporal samples and combine them with all T() to obtain a 2 negative candidate pool, then randomly samples  negative examples denoted as C − within the pool.Formally, inspired by SimCLR [2], we adopt InfoNCE [13] as contrastive loss that can be formulated as where sim(x, y) = x ⊤ y ∥x∥ ∥y∥ computes the cosine similarity, and  is a fixed temperature parameter.By minimize the contrastive objective, we can obtain the enhanced session representations with sufficient interactions between spatial and temporal augmented views in the latent space.

MAIN SUPERVISED TASK OF RESTC
Note that the auxiliary contrastive learning task does not need labels.This section introduces the main supervised task to aggregate spatial and temporal embeddings.Since collaborative filtering information could also be in the format of graph, we construct the global collaborative filtering graph to enhance the spatial encoder (see details in Sec.5.1).Sec.5.2 illustrates how to generate the final session representation to fuse the temporal embeddings and the enhanced spatial embeddings, based on which RECTC predicts the next item (see Sec. 5.3).Lastly, Sec.5.4 presents how to jointly train the contrastive and downstream tasks via a multi-task fashion.

Spatial Encoder for CFG
A Collaborative Filtering Graph (CFG) is to learn the collaborative filtering information of a session based on a global item-transition view.Given a complete session set from all anonymous users, denoted as  = [ 1 ,  2 ,  3 , . . .,   ], let    = (   ,    ) be a graph where    ∈  denotes the item set and    represents weighted edges from all item-relationships.We define that an item pair has a connection in a session if they are adjacent in such a session, the times of repeated connections are treated as the weight of the edge between the pair.This can be found in the CFG encoder part of Fig. 4.

5.1.1
Collaborative Filtering Graph Encoding.Obtaining the embedding of CFG enriches a session's representation with implicit collaborative filtering information from other session data.Without the assistance of CFG embeddings, modeling of a single short-term session could be ineffective in capturing complex transitional relationships among items overall sessions, and it will suffer from severe data sparsity problems.In such a case, we leverage the GraphSAGE-GCN [14], which used the mean-pooling propagation rule to subtly encode the CFG to aggregate K-hop neighbors' information of every item.The one layer of the encoder is: where Z (0) ∈ R  × represents initial input embedding of items of all sessions, W c ( ) ∈ R  × denotes learnable weight matrix in the -th layer, Ã = A + I N means that adjacent matrix added with identity matrix, which can be seem as a self-loop of items in CFG.And Dii =  Ãij are degree matrix over CFG.After passing  layers graph convolution encoder, we get the K-hop CFG embedding represented as , where  is the number of items overall sessions.

Spatial Encoder
Enhancing with CFG embedding.We additionally add the K-hop neighbor view from CFG (denoted as Z) to obtain the enhanced graph-structure representation, which is extracted from global CFG embeddings that involve items in the current session  (denoted as Zs ).The embedding of a specific session is: , Vol.where W g ∈ R 3 × is trainable parameter, P e is the position embedding mentioned in Eq.10, H is the output embedding of MSG in Eq.9.To this end, we have obtained enhanced graph-based session embedding that simultaneously contains the spatial view of the current session and global collaborative filtering from all sessions.

Embedding Fusion of The Two Views
After the session data pass through the encoders from the spatial and the temporal views at the meantime, we obtain the distinct semantic representations from the two views.To generate the hybrid preference representation considering both the advantages of each view, we also apply the soft-attention mechanism to combine the enhanced spatial graph embeddings with temporal embeddings to acquire attentive vectors   of each item.The details are listed as follows: where H  is the spatial embedding from Eq. ( 16), T is the temporal embedding from Eq. 6, zs i indicates the CFG embedding of the   in session , and hi denotes the MSG embedding of   , W f , W 7 , W 8 ∈ R  × are learnable matrices, and b 7 , f g ∈ R  are learnable biases.Finally, we get the semantic-rich representation s h which incorporates the global collaborative filtering spatial, the session spatial, and the session temporal information.

Next-item Prediction Task
We further make use of the session embedding S h to make recommendations by computing the probability distributions of the candidate items.Specifically, we utilize the softmax function to obtain the main task output: where W  ∈ R  × is transformation matrix for the distribution prediction, ŷ represent the output probability of the prediction.Then, we apply cross-entropy as our objective function of the main task with the ground truth {y 1 , y 2 , y 3 , . . ., y  }:

Multi-task Training for Contrastive and Supervised Tasks
We unify the main recommendation task with the contrastive learning task to enhance the performance of SBR, which could be viewed as a multi-task training process: where  1 controls the strength of contrastive learning and  2 is the constant of  2 regularization of the all trainable parameters Θ.Finally, the whole training procedure of RESTC is summarized in Algorithm 1.
, Vol.  for batch in DataLoader do 5: for each session s in batch do

Experimental Settings
In this section, aiming to answer the following research question, we conduct extensive experiments on six datasets.6.1.1Dataset Description.We evaluate our RESTC on six public benchmark datasets: Tmall3 , Diginetica4 , Gowalla5 , RetailRocket6 , Nowplaying7 , LastFM8 , which are often used in session-based recommendation models.Tmall comes from a competition in IJCAI, which contains anonymous users' shopping logs on the Tmall online website.Diginetica records the clicks of anonymous users within six months, and it is from the CIKM Cup platform 2016.Gowalla is a check-in dataset that is widely utilized by point-of-interest recommendation.We follow [3] to process this data.RetailRocket is original from a Kaggle contest published by an e-commerce company, which contains the browser activity of anonymous users within six months.Nowplaying describes the music listening behavior of users, and it comes from the resource of [75].LastFM: is a popular music dataset that has been used as a benchmark in many recommendation tasks.Following [12], we employ it as session-based data.
Moreover, we adopt the data augmentation and filtering for the sessions following by [35,43,66,71].Specifically, we process these datasets into sessions.Concretely, we get rid of all sessions whose length is shorter than 1 and the appearing of items less than 5 overall sessions.We also set the data of last 7 days to be the test data and the previous data as train data.In addition, given a session data  = [ 1 ,  2 , . . .,   ], we augment the sequence and generate corresponding labels by splitting it into ( [ 1 ],  2 ), ( [ 1 ,  2 ],  3 ), . . ., ( [ 1 ,  2 , . . .,   −1 ],   ) for all sessions in six datasets.The details of processed data are shown in Table 1.

Baselines.
For fair comparisons, we compare RESTC with sequential-based methods, GNN-based methods, temporal-augmented methods and contrastive learning methods (e.g., selfsupervised sequential methods [19,42,69] and graph contrastive learning approaches [67,68]), respectively.Sequential-based Methods: • FPMC [45] learns the representation of session via Markov-chain based method.We ignore the user profile information in the experiment and adapt it to the session-based recommendation.• GRU4Rec [18] is an RNN-based method that utilizes GRU and adopts ranking-based loss to the model preference of users within the current session.
• NARM [26] is a attention-based RNN model to learn session embedding.
• STAMP [33] is an attention model to capture user's temporal interests from historical clicks in a session and relies on self-attention of the last item to represent users' short-term interests.• SASRec [22] is a self-attention-based sequential recommendation model that allows us to capture long-term semantics.• BERT4Rec [49] employs the deep bidirectional self-attention to model user behaviour sequences.

GNN-based Methods:
• SR-GNN [66] is the first GNN-based model for the SBR task, which transforms the session data into a direct unweighted graph and utilizes gated GNN to learn the representation of the item-transitions graph.• GC-SAN [71] uses gated GNN to extract local context information and then employs the selfattention mechanism to obtain the global representation.
• CSRM [58] integrates an internal memory encoder through an external memory network by considering the correlation between neighboring sessions.
• FGNN [43] proposes to leverage a weighted graph attention network for computing the information flow in the session graph and generates the user preference by a graph readout function.• GCE-GNN [63] transforms the sessions into global graph and local graphs to enable cross session learning.
Temporal-Augmented Methods: • TASRec [79] incorporates temporal information via constructing a sequence of dynamic graph snapshots at different timestamps.• TMI-GNN [24] leverages temporal information to guide the multi-interest network to focus on multi-interest mining.• GNG-ODE [11] uses graph-nested GRU ordinary differential equation to encode both temporal and structural patterns into continuous-time dynamic embeddings.
Contrastive Learning Methods: • CL4Rec [69] uses contrastive learning to learn the mutual dependencies between item content and collaborative signals.
• DuoRec [42] designs a contrastive regularization to reshape the distribution of sequence representations.
• CORE [19] unifies the representation space of session embeddings by using the contrastive-based representation-consistent encoding strategy.• S 2 -DHCN [68] transforms the session data into hyper-graph and line-graph and and uses selfsupervised learning to enhance session-based recommendation.
• COTREC [67] exploits the session-based graph to augment two views that exhibit the internal and external connectivity of sessions by contrastive learning.

Evaluation Merics and Parameter
Settings.Following the baselines mentioned above, we adopt two widely used metrics for the SBR task: HR@N (Hit Rate) and MRR@N (Mean Reciprocal Rank).We report their optimal performance for each baseline following the original setting from their papers.In our settings, we apply grid search to find the optimal parameters based on the random 20% of train data as validation.Concretely, we search the embedding dimension from the range {100, 150, 200, 250, 300, 350}, and the default batch size is set to 512.We also investigate the coefficient of the contrastive learning task from 5e-4 to 1e-1.In our experiments, the default constant of  2 regularization is 1e-5.We stack 2 SESTrans encoder layers as default, which achieve the best performance to capture the temporal patterns in our experiments.Then we search the MGAT and CFG encoding layers from 1 to 4, we find that 1 MGAT layer and 3 CFG embedding layers are already enough for learning the spatial structure representation of a session.Besides, we utilize the Adam optimizer with a learning rate of 0.001 as well as Step-LR and Cosine-Annealing-LR schedulers to adjust the learning rate.More experimental details are shown on Sec 6.5.

Overall Results (RQ1)
The experiment results of baselines and RESTC model over six datasets are reported in Table 2.
The performance results show that the traditional machine learning method FPMC is worse than deep learning methods since it cannot capture long-time dependency.For sequence-based methods, STAMP and NARM perform better than GRU4REC since they utilize attention mechanisms to learn the critical relations among all items.Besides, CSRM performs the best among sequence-based baselines, demonstrating the effectiveness of leveraging collaborative filtering information from other sessions.Besides, CSRM performs the best compared with STAMP and NARM, demonstrating the efficacy of leveraging collaborative filtering information from other sessions.★ indicates a statistically significant level -value <0.001 comparing our RESTC with the baselines.Underlined numbers mean best baseline.The best performance for each benchmark is marked in black bold.TM, DG, RR, LF, NP, GW denote Tmall, Dignetica, RetailRocket, LastFM, Nowplaying and Gowalla, respectively.
Note that GNN-based methods outperform sequence-based methods, which indicates that there still exists some functional yet undiscovered spatial-structure patterns in sequence-based methods; Moreover, information on item-transition graphs (in the spatial view) might be relatively more informative than the temporal view as in sequence-based methods.Specifically, GC-SAN shows better results than SR-GNN, demonstrating that combining GNN with self-attention could better model the current session's local and global context information.GCE-GNN shows better results than SR-GNN and GC-SAN, demonstrating that combining the information of local sessions and the global neighbor graph effectively enriches the session representation.TASRec outperforms general GNN-based methods like SR-GNN, FGNN, and GC-SAN, proving that incorporating temporal information is significant to spatial structure.S 2 -DHCN shows excellent performance in LastFM and Gowalla in terms of HR@20 since it uses inter-and intra-relations overall sessions and then applies self-discrimination to improve the representation.
For our RESTC, the results show that it significantly outperforms most of the comparative baselines, including sequence-based, GNN-based, temporal-augmented, and supervised sequential methods and graph contrastive learning methods as shown in Table 2 and Table 3. Especially the comparison on six benchmarks, RESTC show better results than other baselines on TM, DG, RR, LF and GW, which reflects RESTC's superior representation capability.In particular, the significant improvement of RESTC over strong supervised sequential baselines (e.g., CL4Rec, DuoRec and CORE) and graph contrastive learning baselines (e.g., S 2 -DHCN and COTREC), implies that leveraging temporal and collaborative filter information is potential for refining the session   From Table 4, we can observe that removing the above components consistently leads to a performance drop, implying that these components are all significant to RESTC.Concretely, w/o SESTrans underperforms w/o Cont, showing that incorporating temporal information through directly combining the temporal embedding with spatial embedding in the main supervised task has already improved the performance.Then, the downward trend of w/o CFG is more evident than w/o Cont.The phenomenon is consistent with our assumption that obtaining the implicit collaborative filtering information from the global weighted session graph, denoted as CFG, can enhance spatial representation, which can help remedy the data sparsity problem for the short-term session.Furthermore, it can be observed that spatio-temporal contrastive learning enhances the performance on both metrics by comparing standard RESTC with RESTC w/o Cont with an obvious margin.This reveals that cross-view interactions via contrastive regularization in the latent space can further reinforce the session representation for the main prediction task.
Besides, w/o PE-G demonstrates that removing the position embedding in the spatial structure view results in a remarkable performance drop since the model cannot recover the initial order relation after graph embedding.Moreover, w/o PE-S performs worse than RESTC in the selective datasets and shows the effectiveness of temporal-aware encoding in the temporal encoder SESTrans.

Ablation on Different Spatial Encoder backbones (RQ3)
. Since our proposed RESTC is a model-agnostic framework that can effectively adapt to various GNN-based spatial encoders, we want to investigate the effectiveness of leveraging MGAT to learn the spatial representation of the session graph.Therefore, we compare it with other GNN-based backbones on Tmall, Diginectica, RetailRocket and LastFM.Specifically, we substitute MGAT backbone with some variants, including Graph Gate Neural Network (GGNN) [66,71], GraphSAGE-LSTM [14], GAT [43,52] and MGAT without multi-heads attention.Among them, GGCN constructs the session as a weighted directed graph and uses the occurrence frequency of item-pair transitions as edges and applies gate-based aggregate function; GraphSAGE-LSTM and GAT also adopt the same method to construct the session graph, but they utilize LSTM and attention weighted sum as the aggregation functions, respectively.As depicted in the  compared with GAT and MGAT w/o MH, verifying the advantage of constructing sessions as multi-relational session graphs and leveraging multi-head MGAT.Moreover, compared to the GraphSAGE-LSTM and GGCN, MGAT achieves better performance, suggesting that the attention mechanism is more powerful for learning the spatial structural representation for the session graph.

Further Analysis on Spatio-Temporal Contrastive Learning (RQ4)
To further analyze what factors affect the performance of our proposed spatio-temporal contrastive learning, we move on to studying different settings.We first investigate the impact of temperature .Then, we dive into the influence of distinct negative sampling strategies in the contrastive learning objective function.We adjust the hyperparameter  on Tmall and Diginetica, which have a similar trend to other datasets.Then, we demonstrate the results of using variants of negative sampling on Tmall, Diginetica, RetailRocket, and LastFM due to the limited space.
6.4.1 Impact of Temperature .As mentioned in [2,65],  plays a critical role in hard negative mining for contrastive learning.The experiment results in Fig 6 show the curves of RESTC performance with respect to different .We can observe that: (1) The larger the value of  (e.g., 1.0), the slower the model converges during training, and there is a significant decrease for the model's performance when it converges.Similar to [65], we attribute this phenomenon to the difficulty of identifying hard negative samples, whose temporal representations are similar to that of positive samples, thus making the model fail to distinguish them from the positive samples in the latent space.(2) In contrast, adjusting  with a too small value (e.g., 0.1) will cause the model to converge quickly, leading to prematurely overfitting during training.We conjecture the small  could make the model focus excessively on the hard negative samples and offer more gradients to guide the optimization, thus making the spatial and temporal representations easier to discriminate then accelerate the training process [44].Therefore, depending on the dataset, we choose the value of  between 0.1 and 1.  ...

G(s2) G(s|c|) G(s1)
(d) Spatial-Temporal (self and multiple alignment) From Table 5, we can observe that the Spatial-only method performs worse than all the comparative methods, which only use spatial representations as negative samples.This may indicate that without using temporal representations as negative samples, it will be challenging to align spatio-temporal information in the latent space, leading to sub-optimal performance.Besides, S-T (ma) and S-T (sma) slightly perform better than S-T (sa), which we conjecture is because, increasing the sampling size and diversity of negative samples (spatial and temporal views) facilitates the model to distinguish between positive and negative sample pairs.In addition, S-T (mn) outperforms all the variants of sampling strategies, which may be because adding random noise to the set of temporal representations is beneficial to enhance the robustness of contrastive learning.Moreover, we also validate the correlation between noise sampling strategy and bach size.The results are consistent with SimCLR [3], enlarging the number of negative samples by increasing the batch size from 128 to 512 significantly improves performance.

Impact of Hyperparameters (RQ6)
Next, we analyze the sensitivity of RESTC with different hyperparameter settings.Due to the limited space, we only show the result of HR@20 on Tmall, Diginetica, and Retailrocket.

Impact of Hidden Dimension.
To investigate the impact of hidden dimension, we test the performance when increasing the size from 100 to 400.From the leftmost of Fig. 8 (A), we can conclude that increasing the hidden dimension does not continuously improve the performance.Our RESTC model achieves the best performance in 300 for Diginetica and Tmall while obtaining an optimal result in 200 for Retailrocket.The reason might be that a larger hidden size might lead to overfitting., larger  1 does not show a tendency of better performance.Our model obtains the most satisfactory performance when  1 is near 0.005, 0.001, 0.0005 for Diginetia, Retailrocket, and Tmall, respectively.The HR@20 drops obviously when the  1 becomes larger than these values, especially in Tmall.The main reason is that increasing  1 might harm the optimization of the main prediction task.Therefore, according to grid search, we set the corresponding coefficient  1 .

Effect of MGAT Layers.
To further analyze the impact of the aggregation layer numbers of the spatial encoder MGAT, we vary the number of MGAT layers in the range of {1, 2, 3, 4}.As the results presents in Fig 8 (C), leveraging 1 layer MGAT for RESTC has already achieved the best performance, and stacking more layers leads to a decreasing tendency.We conjecture that adopting more layers will cause the overfitting issue since most of the sessions are relatively shorter according to the average lengths of the dataset statistics in Table 1.enriches the current session with inter-session information, which is an efficient way to solve data sparsity problems and enhance recommendation performance.We range the layer numbers from 1 to 4 to study the impacts of the CFG embedding module's depth.From the middle of Fig. 8, we observe that the three-layer setting makes RESTC obtain the best result.And stacking more layers will add more noise information to the over-smoothing issue of high-order relations of graphs.

Analysis on Different Session Lengths (RQ5)
In many scenarios, sessions are transferred to the server at various lengths [37].It is worthwhile to investigate the robustness of our RESTC model compared with baselines on different lengths of sessions.We separate all the sessions in Tmall, Diginetica, RetailRocket and LastFM into three groups, short group (S) with length of sessions from 0 to 5, medium group (M) with sessions from 5 to 10, rest of sessions are in the long group (L).We utilize MRR@20 to evaluate the performance of the methods instead of HR@20 since the MRR metric can better reflect the ranking quality of correct results.

Representation Quality of RESTC (RQ7)
To evaluate whether spatio-temporal contrastive learning affects the representation learning performance, we utilize t-SNE to reduce the dimension of learned embeddings and visualize them in 2D planes.As shown in Fig. 10, we compare the visualize results of RESTC, RESTC(w/o Cont.), S 2 -DHCN and GC-SAN on Retailrocket and leverage six labels and randomly sample 50 session instances for each label.It is expected that session embeddings should be closer if they have the same label (next-to-click item).From Fig. 10, by comparing RESTC and its variant, we observe that removing spatio-temporal contrastive learning makes the learned embedding more indistinguishable in the latent space, showing that contrastive learning makes a better alignment for RESTC between session embeddings w.r.t. the same label.Moreover, some session embeddings with different labels are mixed to some degree for S 2 -DHCN and GC-SAN, which makes them indiscernible.In contrast, our RESTC shows a more diverse distribution and hence can better make a correct prediction, demonstrating the superiority of RESTC in better representation learning.

CONCLUSION
This paper proposes a novel framework called RESTC, which aims to effectively learn the session representation from cross-view interactions and collaborative filtering information.It is equipped with spatio-temporal contrastive learning to extract self-supervised signals from spatial and temporal views to mitigate temporal information loss and improve the quality of representation learning.
In the next-item prediction task, we utilized the embedding of the collaborative filtering graph to enrich the spatial structure information, which can also solve the data sparsity problem of the short-term session.Extensive experiment results demonstrate that RESTC achieves significant improvements compared with other recent baselines.

2 Fig. 1 .
Fig. 1.Two distinct sessions may be represented as the same graph if the temporal information is omitted, indicating the temporal pattern should be sufficiently considered to supplement GNN-based models for SBR task.

Fig. 2 .
Fig. 2. Sampling sessions from real public dataset.The numbers in the nodes denotes the index of items.

2 Fig. 3 .
Fig.3.Three essential information among sessions data: (A) temporal view of a session is about a behavioral sequence containing user's dynamic preference w.r.t its timeline; (B) spatial view of a session refers to a between-item transition directed graph, each edge of which indicates a behavior shift from the source item to the target item -for example, a user has clicked item  2 after  1 .Note that behavior shift associated with an edge could happen many times in a session, and such edges are orthogonal to time; (C) collaborative filtering information in other sessions could be extracted from a global weighted graph then used to compensate for the item profiles in a short-term session.

Our
RESTC framework includes a Spatio-Temporal Contrastive Learning strategy in Sec 4 and a Collaborative Filtering Graph enhanced main supervised task in Sec 5.The training process is shown in Fig 4: First, the session data (e.g.,  2 ) is transformed into the two aggregated embeddings (T() and G()) encoded by the local spatial and temporal encoders.Then, to remedy the temporal information loss during encoding the spatial structure as shown in Fig 3, the spatial-temporal contrastive learning is applied to align and interact with the embeddings of the two views in the , Vol. 1, No. 1, Article 101.Publication date: February 2024.latentspace.Furthermore, In the main prediction task, we enhance the spatial embeddings H with the Collaborative Filtering Graph (CFG) embedding Z and apply the embedding fusion to generate session representation to predict the next item.s2: v1 v2 v3 v2 v4 v5 s1 : v7 v3 v5 v4 v5 v7 s3 : v5 v3 v4 v3 v6 ...

•
RQ1 How does RESTC perform compared to present methods in the SBR task?• RQ2 Are the main components (e.g., Session graph encoder (MGAT), Temporal encoder (SES-Trans), CFG encoder, spatio-temporal contrastive learning) really working well?• RQ3 How does the spatial encoder (MGAT) work effectively compared to other GNN-based backbones?• RQ4 How do different settings (temperature , negative sampling strategies) of contrastive learning impact the performance of RESTC?• RQ5 Are RESTC robust to different lengths of session data?• RQ6 How do different hyper-parameters affect RESTC?• RQ7 Is the spatio-temporal contrastive learning really improving the representation learning?, Vol. 1, No. 1, Article 101.Publication date: February 2024.

Fig 5 ,
RESTC equipped with MGAT as a spatial encoder is superior to all the comparative GNN-based backbones.Concretely, the MGAT backbone significantly improved , Vol. 1, No. 1, Article 101.Publication date: February 2024.

6. 4 . 2
Variants of Negative Sampling Strategy.To investigate how the choices of negative sampling affect the performance of contrastive learning, we ablate on several negative sampling strategies as shown in Fig 7. Specifically, we compare our default method with four variants of session-level contrastive learning, which select negative samples from spatial or temporal session representations , Vol. 1, No. 1, Article 101.Publication date: February 2024.

Fig. 7 .
Fig. 7. Four variants of negative sampling strategy and the default method.in a training batch: (a) Spatial-only, which selects the representations of other sessions in the spatial candidates Spatial-only; (b) Spatio-Temporal (single alignment), which randomly selects one different temporal presentation, denoted as S-T (sa) ; (c) Spatio-Temporal (multiple alignments), which selects the other temporal representations from the batch, denoted as S-T (ma); (d) Spatio-Temporal (self and multiple alignments), which selects the representations of both spatial and temporal candidates in the batch, denoted as S-T (sma).As illustrated in Sec 4.3, the default method of RESTC is Spatio-Temporal (mixed noise).From Table5, we can observe that the Spatial-only method performs worse than all the comparative methods, which only use spatial representations as negative samples.This may indicate that without using temporal representations as negative samples, it will be challenging to align spatio-temporal information in the latent space, leading to sub-optimal performance.Besides, S-T (ma) and S-T (sma) slightly perform better than S-T (sa), which we conjecture is because, increasing the sampling size and diversity of negative samples (spatial and temporal views) facilitates the model to distinguish between positive and negative sample pairs.In addition, S-T (mn) outperforms all the variants of sampling strategies, which may be because adding random noise to the set of temporal representations is beneficial to enhance the robustness of contrastive learning.Moreover, we also validate the correlation between noise sampling strategy and bach size.The results are consistent with SimCLR[3], enlarging the number of negative samples by increasing the batch size from 128 to 512 significantly improves performance.

6. 5 . 2
Strength of Contrastive Learning.In RESTC, we utilize the hyperparameter  1 to trade off the contrastive loss and the cross entropy loss.To demonstrate the utility of  1 , we compare the experimental results by using the  1 values from [0.0, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1].As in the rightmost of Fig 8 (B)

Fig. 9 23 Fig. 10
Fig. 10.t-SNE visualization of session embedding in a latent space, each color represents a specific label.
[43]1 Multi-relational Session Graph Construction.There may exist duplicate items in one session.Thus, it is important to construct a session graph to capture such the spatial relationship in terms of item transitions.Given a session  with a repeatable item sequence  =   1 ,   2 ,   3 , ...,    , let   = (  ,   ) be the corresponding session graph where the node set   consists the unique items in the session, the edge set   contains edges represented any two adjacent items (  ,   ) in the sequence , forming an item-transition pattern behind the session.In contrast to FGNN[43]which utilizes the occurrence frequency of edges to construct a weighted directed graph for a session, we leverage a multi-relational weighted graph which uses multiple types of relationship, including in-relation, out-relation, bi-direction and self-loop.Specifically, the out-relation indicates that there only exists a transition (  ,   ) in the graph; the in-relation is vice versa.The bi-direction represents that (  ,   ) simultaneously exits bi-directional transition.Besides, the self-loop implies that there exist a self transition of an item.By using these four relationships, the spatial structure can be enriched by more accurate inter-relationships among item transitions.We name this graph as Multi-relational Session Graph (MSG).A concrete example is demonstrated in Fig.4, in which the session  1 = [ 1 ,  2 ,  3 ,  2 ,  4 ,  5 ] can be converted into a multi-relational graph as shown inside the blue dotted rectangle with local.
4.2.2Multi-relationalGraph Attention Network for MSGs.We next present how to propagate item features on a multi-relational session graph to encode item-transitional relations.Graph attention network (GAT)

Table 3 .
Comparisons with contrastive learning and temporal augmented methods.★indicates a statistically significant level -value <0.001 comparing our RESTC with the baselines.The best performance for each benchmark is marked in black bold.Underlined numbers mean the second best result.We concisely select TM, NP and DG as the datasets, HR@20 and MRR@20 as the metric to compare our RESTC with advanced temporalaugmented, supervised sequential and graph contrastive learning approaches.representation.Besides, RESTC outperforms temporal-augmented GNNs like TASRec, TMI-GNN and GNC-ODE on most settings, indicating that adequate interactions between spatial and temporal views via contrastive learning can significantly boost performance.Therefore, the comparative results demonstrate the effectiveness and generalization ability of our RESTC framework.

Table 4 .
Ablation Study in Variants of RESTC.

Table 5 .
Comparison on Variants of Negative Sampling.