Masked Token Enabled Pre-training: A Task-Agnostic Approach for Understanding Complex Traffic Flow

The conventional deep learning model performs well for traffic flow analysis by training with a large number of labeled data using a one-model-for-one-task approach, leading to huge computational complexity in dynamic intelligent transportation system (ITS) applications. To overcome this limitation, this paper propose a Token-based Self-Supervised Network (TSSN), which can learn TF features in a task-agnostic way, and provide a well bootstrapped pre-training model for a variety of tasks. TSSN tokenizes TF data into segments, each of which is named as a token and comprised of numerous consecutive points. Masked Token Prediction (MTP), a pretext task, is designed to understand the TF correlations by forecasting tokens that are randomly masked. MTP enables TSSN to capture the high-level intrinsic semantics of TF, and provide general-purpose token embeddings. Therefore, TSSN can be more generalized while keeping high performance. As a result, by replacing the final fully-connected layer with a set of untrained new layers and fine-tuning with small-scale task- specific data, TSSN can be deployed for a variety of downstream tasks. The simulation results demonstrate that the TSSN can improve overall performance on various downstream tasks when compared to state-of-the-art models.


I. INTRODUCTION
Time-series data analytics have shown considerable potentials in real-world applications such as intelligent traffic systems (ITS), smart healthcare and financial management. In ITS, traffic flow (TF) data, which represents the number of vehicles passing a fixed point within a given time interval, can accurately indicate the transportation situation and support making appropriate traffic route planning and congestion management. Therefore, the precise TF data analysis is critical to improve the reliability and efficiency of ITS applications [1].
Non-linearity and unpredictability are the key features of TF data that make them extremely challenging to process [2]. Meanwhile, TF data may display a wide variety of patterns depending on a number of factors. For example, due to the prevalence of roadside structures, TF may show specific trends in different blocks of the city [3]. Among all these patterns, discovering the traffic patterns during particular periods of time in different days, i.e. finding the traffic temporal correlations, are the focus of this work. One main issue of TF data analysis is that the training of a specific model in machine learning is closely related to a given task. When the purpose is altered or replaced, the trained model may become obsolete, mandating the training of a new model from scratch. Another difficulty is that a large amount of labeled data is required to train a deep learning (DL) model, which could take a significant period of time and make the learning result less useful [4].
The advancement of self-supervised learning (SSL) has been shown to be successful and efficient in both natural language processing (NLP) and computer vision (CV) [5]. SSL could use unlabeled data and agnostic tasks to train a model. Then, through fine-tuning on small-scale task-specific data, downstream tasks, which are the ultimate tasks to solve in TF analysis, can attain high performance [6]. Bidirectional encoder representations from transformers (BERT), for example, can learn token correlations in natural language from two pre-training tasks and receive general token representations. Inspired by BERT, tokens and sentences can be created from the TF data, and subsequent processing can be applied to each token instead of each TF point. Even with the transformer's attention mechanism, SSL models based on individual points would focus more on short-term correlation and high-frequency changes, potentially resulting in poor downstream task performance due to low generalization. To the best of our knowledge, none of the studies have attempted to treat several points of TF data as a whole.
Therefore, this paper presents a Token-based Self-Supervised Network (TSSN) which builds a pre-training model that can be used to bootstrap a variety of task-specific fine-tuning models. The TF data is divided into tokens, each of which contains a series of consecutive points. As a result, the model can acquire high-level contextual semantics and encode tokens with representations that leverage fundamental structures of TF data by considering multiple points as a token. Then, a masked tokens prediction (MTP) pretext task is defined as one in which a subset of tokens is masked at random and the pre-training model must guess these tokens from contextual tokens in a sentence. The token-based method may explore more macroscopic and long-term aspects than point-based methods, whilst the MTP blurs the lines between different tasks, allowing a bootstrapping model to learn the TF semantics for improving the task-specific model performance.
Finally, a general model is learned, and specific tasks can be completed with minor fine-tuning. The pre-training model generates representations of each token for various downstream tasks. Our main contributions in this paper are presented as follows, i.e., 1) A novel network, i.e. TSSN, is proposed for generating an effective task-agnostic model for various downstream tasks on TF data. Meanwhile, a novel pretext task, i.e. MTP, is designed to provide strong surrogate supervision signals for the pre-training of TSSN. The goal is to understand the temporal correlations of unlabeled TF data through tokenization and SSL. 2) Three types of downstream tasks, i.e. TF classification, prediction and completion, are solved by using the representations of tokens created in pre-training model.
Only the final few layers are reconstructed and fine-tuned with task-specific and labeled data. 3) A vast number of experiments are conducted to compare TSSN to some traditional methods, as well as pre-training models with different numbers of points in a token. All three types of downstream tasks are evaluated, each of which includes different parameter configurations. The results demonstrate the effectiveness of the proposed TSSN clearly. The rest contents of the paper are organized as follows. Section II explores the state-of-the-art of SSL on time series data. The TF data are processed in Section III, and the pretraining and fine-tuning tasks are discussed in Section IV. A lot of experiments are conducted in Section V with detailed analysis. Section VI concludes this paper.

II. RELATED WORKS
Understanding the semantic-level properties of TF is required for the establishment of ITS applications. For TF analysis, tensor-based methods have been used, but their high complexity and lack of scalability make further development difficult [7]. As a result, DL models have been considered feature learners in a number of ways. These cutting-edge models are primarily concerned with boosting performance by strengthening model structure [8] or merging multi-modal data [9]. However, none of the literature attempts to learn the features by tokenizing the TF in the same way that words are tokenized in the NLP field.

A. Traffic flow prediction
The TF prediction is crucial for a variety of ITS applications. The majority of studies use historical data to anticipate future TF utilizing tensor-based approaches or DL models. The impact of disruptions on TF prediction is investigated by Zheng et al. [10]. They propose two factors: inherent influence factor and disturbance influence factor, which they eliminate from TF data to represent the inherent patterns. Feng et al. [11] propose an supported vector machine (SVM)-based TF prediction model that combines spatial and temporal correlations, whereas Han et al. [12] use an long-short term memory (LSTM) network to make predictions based on representation learning of shapebased features. Both of these studies concentrate on predictions that are made in the near period (less than half an hour). For both short-term and long-term TF predictions, Huang et al. [13] present a graph convolutional networks (GCN)-based model (from 5 min to 60 min). By using spatial features, the model can achieve a high level of performance.

B. Traffic flow classification
Other groups emerge from the TF data when different contextual circumstances, such as weather, accidents, weekends, or vacations, are considered. Cinsdikici et al. [14] offer a twophase approach to capture TF density variations and classify TF patterns at each time instant. The TF can be split into three categories: free flow, the dense flow and the congested flow. Bui et al. [15] investigate how traffic sound data can be used to classify the traffic density. To extract sound features and achieve TF categorization, they propose a graph-based representation learning approach employing convolutional neural networks (CNN). The idea is to distinguish rush hour at different times of the day (morning or evening).

C. Traffic flow completion
Tensor completion is used to model the TF data using highdimension tensors and generate high-order decomposition to get low-rank representations, and then solve the completion problem using norm regularization [16], [17]. However, due to their considerable complexity, these strategies are difficult to put into practice. The neural network-based TF data completion is more efficient and can even fill in long stretches of missing data. Li et al. [18] investigate how generative adversarial networks (GAN) might be used to fill in missing TF data in a graph-based approach. Han et al. [19] offer a GAN-based data completion method that combines tensor modeling to achieve long-term TF completion. The main idea is to use low-dimension representations to recover the most fitting TF data sequences. The results reveal that the performance is satisfactory even for one-week data completion. D. Self-supervised learning on time series SSL is gaining popularity as a tool for extracting features from unlabeled data. SSL has a lot of potential in NLP thanks to transformer structure. One of the most successful models is Google's BERT, which employs transformer encoders to learn sentence representations [20]. In BERT, tokens are masked at random, and two pre-training tasks, masked language modeling (MLM) and next sentence prediction (NSP), are introduced during pre-training stage.
Yuan et al. [21], inspired by BERT, develop a self-supervised pre-training model that gains general knowledge from largescale unlabeled satellite image time series and applies it to classification tasks with scarce-label data. Ma et al. [22] present a time series clustering approach in which the pseudo-class labels are generated through k-means algorithm. As a result, clustering can be accomplished without a huge amount of labeled data. Shi et al. [23] devise two new acoustic time series classification pre-training tasks. The proposed model can be utilized to increase performance on tiny labeled datasets.
According to the current state of the art, TF data processing usually treats each point as a single basic unit rather than a series of points. Another issue is that current solutions are only targeted at specific activities. For instance, the model suggested in [13] can only make predictions on TF data, whereas [19] is only for TF completion. Few studies have attempted to develop a task-agnostic technique that can be applied to various aspects of ITS applications using SSL. Therefore, employing tokenbased processing and pre-training methods, this work primarily offers an SSL model that can provide sufficient bootstrapping for various downstream tasks.

III. TRAFFIC FLOW DATA PRE-PROCESSING
Let X denote the dataset which contains traffic flow data of several days on multiple road segments. Assume there is totally N points for one day. All points in a day are treated as a sentence. Let x i ∈ X , i = 1, . . . , N denote the i th point in a sentence. Each sentence contains several tokens. Let K denote the number of points in a token. Assume that K is a factor of N . Then, the number of tokens in one sentence is M = N/K. Let t j , j ∈ [1, M ] denote the j th token in one sentence, then we have The rationale is to treat consecutive K points as a unit and learn the features or patterns that it possesses.
The TF data are normalized before training. Each point is scaled into [0, 1] by min-max normalization as follow, A part of tokens is masked with random numbers to let the model learn the representations in self-supervised manner. All points in masked tokens are added by a positive or negative random value, i.e., where ξ j is the j th random number that follows U(0, 1), and α stands for the masking ratio. For η probability, the masking value δ is sampled from

Notations of General Symbols and Operations
Notation Definition m record whether the element of a specific position in input sequence is masked. Then, it can be defined as follow, It's worth noting that instead of random points, the entire token is masked with the same value, which is crucial to the TSSN's success.

IV. PRE-TRAINING OF TSSN AND FINE-TUNING
The embedding layer, positional encoding (PE) layer and numerous Transformer encoders make up the majority of the proposed model.

A. Model Structure
Assume each input is denoted as x ∈ R N . It is transformed into matrix as X ∈ R M ×K , which means M tokens with K points in each token.
1) Embedding: Each token t is embedded to a D-dimension vector ast by a fully-connected layer. The parameters are trainable to improve the precision of embedding during the learning process. Therefore, the token embedding can be given as follow, where W Embedding ∈ R K×D , and the dimention ofX becomes M × D.
2) Positional encoding: The positional information of input sequence is encoded by a pair of sin and cos functions with different frequencies [24], i.e., where j ∈ [1, M ] and d ∈ [1, D]. The p j,d is add to theX element-wisely, i.e., It's worth noting that each point in a token must have the same PE value, implying that each point has the same impact on tokens in other locations. As a result, PEs include tokenlevel positional information rather than point-level positional information.
3) Transformer Encoder: TheX is then processed by Y transformer encoders with L-head attention, where D/L ∈ Z. X is transformed by fully-connected layers into L query, key and value matrices, each of which is denoted as Q l , K l and V l , l ∈ [1, L], respectively. Let n = D/L, Q l , K l and V l can be calculated as follows, where The l th attention head H l can be calculated as follow, where H l ∈ R M ×n . Then, the multi-head attention is calculated by the concatenation of L heads with linear projections by a fully-connected layer, i.e., where W O ∈ R D×D and the dimension of A is equivalent tō X. Two-layer position-wise feed-forward networks are applied afterX to get the embedding representation of tokens. Besides, residual connection and layer normalization are applied for better convergence, according to [24]. LetX ← L(X + A), then the representations can be calculated as follow, where g represents the activation function, The encoded representations of each token in a sentence are the transformer encoders' outputsȲ. As a result, the layer is known as the TSSN representation layer. To retrieve the final sequence of TF data, instead of using transformer decoders, a fully-connected layer is utilized to reduce the dimension and flatten the vector, i.e., where W 3 ∈ R D×K , b 3 ∈ R K×1 , and the final dimension of y remains the same with the premier input x.

B. Pre-training
The pre-training stage and the fine-tuning stage are the two stages of the training process. The pre-training stage enables TSSN to extract implicit token features in a task-independent manner using MTP, which involves predicting tokens that are randomly masked. As a result, the mean square error (MSE) between the predicted data and the true data that are masked is used as the pre-training loss function, i.e., (12) where θ stands for all trainable parameters of the model. Only masked tokens are considered in the loss calculation, according to eq (12). TSSN can learn more abstract and long-term features because the masked tokens are chosen at random and consist of numerous consecutive points,

C. Fine-tuning
To fine-tune TSSN for various downstream tasks, the TSSN's final fully-connected layer is replaced by new layers with untrained parameters. For the fine-tuning tasks, the pre-training model only produces the embedding representationsȲ of each token in a sentence. With task-specific labeled data, a little further training is added. Furthermore, the loss function is determined by the downstream tasks' purpose. Only the parameters of posterior layers are updated during the finetuning process, while the parameters in pre-training model stay intact.
To evaluate the performance of TSSN, three types of downstream tasks are considered: TF classification, prediction and completion.
1) Traffic flow classification: For S-category classification tasks, two fully-connected layers are attached afterȲ, i.e., where W C 1 ∈ R D×D C-hid , b C 1 ∈ R D C-hid ×1 , and W C 2 ∈ R D C-hid ×S . The final output y C becomes a S-dimension vector, each element y C s of which represents the probability of being classified as the s th category.
The labeled dataset for classification task is denoted as 1} and S s=1 ρ s = 1. ρ stands for the one-hot encodings of categorical labels where ρ s = 1 means x belong to the category s. Then, the loss function is defined by the cross entropy between the predicted labels and the true labels, i.e., 2) Traffic flow prediction: To make precise predictions on the future horizon of TF based on existing data, TSSN is re-constructed by connecting two dense layers after the representation layer, i.e., for P -horizon prediction task, the output y P is given as follow, where W P 1 ∈ R D×D P-hid , b P 1 ∈ R D P-hid ×1 and W P 2 ∈ R D P-hid ×P , b P 2 ∈ R P ×1 , respectively. The dimension of output is equivalent to the length of horizon that required to be forecasted. The loss function is defined as the MSE between the predicted values and the true values, i.e., where x P is the true values of the predicted horizon.
3) Traffic flow data completion: The TF data completion task fills in the blanks with existing TF values. The positions of missing data are sequentially chosen with randomness. Assuming that the total missing number is C, the task's finetuning model is as follow, i.e., where W Cm The output dimension is equivalent to the number of TF data in one sentence. However, the loss computation only considers the TF values at the missing positions. A fixed value β is used to replace the missing data. Assume that the missing places m Cm are represented by a binary vector of N elements, each of which is specified as follow, Then, the loss function can be defined using MSE as follow, In contrast to the pre-training tasks, missing data is substituted with a set value instead of a random number. Furthermore, unlike the prediction task, which merely outputs the numbers at expected positions, the output y Cm is a whole sentence.
V. EXPERIMENTAL RESULTS AND ANALYSIS A. Experiments setup 1) Pre-training: Five distinct values of K are evaluated to verify the performance of TSSN: K = 1 (five-minute TF value), K = 2 (ten-minute TF value), K = 6 (half-hour TF value), K = 12 (one-hour TF value) and K = 24 (twohour TF value). D = 64 is the embedding dimension, which means that each token is represented by a 64-dimension vector. The source sentence embeddings are fed into three repeated transformer encoders. The number of simultaneous attention  [25]. It has a large number of detectors installed throughout California's freeway system. TF data is collected every five minutes, which results 60/5 = 12 points per minute and N = 12 × 24 = 288 points per day. The original data are discrete values that have been smoothed with an 11-point window that averages each five points from before and after the current value. Pre-training uses a total of 3, 506, 846 points of TF data, with 80 % designated as training data and the remaining 20 % designated as validation data. The TF data are normalized according to eq.(1) before training. The pre-training data are randomly masked according to eq.(2), where the mask rate α = 15 % and the boundary of masking value b = 0.5.
The TSSN is trained for 50 epochs with batch size 128 for each K value. Each epoch of training is followed by a validation process. The Adam optimizer is used, with an initial learning rate of 10 −4 , which is warmed up throughout the first three epochs and then automatically and adaptively lowered. For all the fully-connected layers, the dropout rate is set at 0.1. The Gaussian error linear unit (GELU) activation function [26] is used throughout the whole TSSN architectur. All of the tests in this paper were run on NVIDIA TM Tesla TM graphics processing units (GPUs) using compute unified device architecture (CUDA) TM 10 and PyTorch 1 .
2) Fine-tuning: The initial learning rate is reduced to 2 × 10 −5 during the fine-tuning step to make the learning parameters as close to the optimal value as possible. All the other hyper-parameters are unchanged from the pre-training stage. Besides, weekday-weekend classification, P -horizon data prediction and C-point data completion are three types of downstream tasks that are evaluated. a) Weekday-weekend classification: One of the categorical downstream tasks of TSSN is the weekday-weekend classification task. Weekday and weekend TF data have distinct properties and patterns. The purpose is to classify the day as either a weekday or a weekend day according to the TF data. This task makes use of a new dataset called Seattle Inductive Loop Detector Dataset [27], [28]. The data are gathered in the same manner as PeMS. Meanwhile, each day is assigned the labels of weekday and weekend, with ρ = [0, 1] denoting weekday and ρ = [1, 0] denoting weekend. The hidden size is D C-hid = 64. There are up to 100 fine-tuning epochs, each of which is followed by a validation phase. Two baselines are introduced, namely three-layer fully connected (FC) network and gated recurrent unit (GRU) network [29], to compare numerous metrics such as average classification accuracy, precision, recall, F1 score and Kappa coefficient. b) P -horizon prediction: Short-term and long-term predictions are considered as downstream prediction tasks. For short-term predictions, P is set to be 1, 6, 12 which represent 5min horizon, 10-min horizon and one-hour horizon predictions, while P = 36 (3-h horizon) and P = 72 (6-h horizon) stand for the long-term predictions, respectively. The hidden size D P-hid is set to 64, and the fine-tuning processes end only when the validation losses have stabilized. A bidirectional long-short term memory (Bi-LSTM) network [27] and a Transformer network [24] are among the comparative approaches. Through bidirectional feedback loops, Bi-LSTM is specifically built to capture patterns in sequence learning tasks. The metrics include mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE) and R 2 .
c) C-point data completion: Under varying values of K, two tasks, namely 12-point (1 h) and 36-point (3 h) data completion, are tested. The missing values are replaced with β = 0, and the hidden layer's size is set to D Cm-hid = 64. The baselines are a three-layer FC network and a Transformer network [24]. In addition, TSSN and a GAN-based model proposed in [19] are compared. The metrics are still the same as for prediction tasks.
B. Results and analysis 1) Convergence: All multiple epochs of training, all of the pre-training models converge to a low loss value and remain stable, as illustrated in Fig. 2(a). It is obvious that each token with a smaller number of points (K value) has a lower loss value. This is due to the fact that a higher K value makes learning more difficult, but it can capture long-term temporal correlations.
After the first 20 epochs of training, the validation accuracies of all K values exceed 95 %, and when converged, they reach above 98 % for the classification task. The FC model takes longer to converge than any of the TSSN models, but the ultimate result is satisfactory. The GRU slowly converges to a high loss value, making it unsuitable for classifying weekdays and weekends. Figure 2(c) and 2(d) show that for both prediction and completion tasks, each model can converge to a low loss value, although the converge speeds are different.
2) Weekday-weekend classification: Table III displays the classification task's results, where TSSN(K = 2) achieves the highest accuracy. All TSSNs outperform FC and GRU networks in terms of accuracy, implying that TSSN can greatly improve classification task performance. The Kappa coefficients also suggest that all of the TSSNs are valid and dependable, whereas the classification consistency for FC network and GRU network can only be considered sufficient and moderate, respectively.
3) Traffic flow prediction: The literature generally consider that for TF data, P ≤ 12(1 h) stands for short-time prediction and P > 12 represents long-time prediction. MAPE = 20 % is generally considered as the boundary between model availability and unavailability. The results illustrated from Fig. 3 to Fig. 6 come from TF data of a week which is chosen at random. a) Short-time prediction: Table IV shows that Bi-LSTM performs best for 1-horizon (5 min) prediction task. Over the MAPE metric, the highest performance of TSSN (K = 12) trails Bi-LSTM by around 61.22 %. This is due to the fact that Bi-LSTM are better at capturing short-term features, whereas TSSNs are better at capturing intermediate-and even longterm characteristics. Since the MAPE is approaching 20 %, the Transformer network, as well as the TSSN (K = 1), which only considers individual points and ignores the intrinsic correlations of consecutive points, are hardly capable. The same tendencies can be seen in Fig. 3, where Bi-LSTM fits the true data best, while all other models vary from the real values in some way.
On MAPE, TSSNs begins to outperform Bi-LSTM for 6-horizon (30 min) and 12-horizon (1 h) prediction tasks, demonstrating that TSSN can capture more accurate features than Bi-LSTM. This is because that TSSN can learn the representations of TF tokens. Thus, it can better extract TF internal patterns. However, it is not obvious that a higher K value equates to superior longer-horizon prediction ability. But in general, for more than one-horizon prediction tasks, multiple values of K outperform single values of K and the other two baselines.
b) Long-time prediction: The MAPEs of all the TSSNs from K = 2 to K = 24 are less than 20 % for the 36-horizon (3 h) prediction task. TSSN (K = 12) has the best performance among all of them. The TSSN (K = 24) slightly falls behind TSSN (K = 12) by 4.7 %. The MAPE of TSSN (K = 1) exceeds 20 % by 9.7 %, indicating that it is unavailable and demonstrating the utility of token-based pre-training model for    long-term prediction tasks. For this task, the Bi-LSTM and Transformer models are completely disabled. Similar trends may also be seen in Fig. 6.
The prediction task with a 72-horizon (6 h) can be much more challenging than the 36-horizon task. The best-performing model is TSSN (K = 12), whose MAPE is just 0.5 % higher than the boundary. Meanwhile, over MAPE, TSSN (K = 24) falls behind TSSN (K = 12) by 0.8 %. The other models, particularly the two baselines, are insufficient for this task.
In Fig. 7, the TSSN (K = 24) is chosen to compare with TSSN (K = 1), Bi-LSTM and Transformer over several lengths of predicted horizon under varied metrics. The Bi-LSTM performs best at the start, i.e., P = 1. Then it degrades to the worst in terms of long-term prediction. In comparison to the Transformer model, the results of the TSSN (K = 1), which merely applies a pre-training processing, remain dismal at all lengths of prediction horizon. This suggests that the Transformer structure is not well suited for TF prediction       tasks, and that employing pre-training alone cannot improve prediction task performance. Except for P = 1, the TSSNs (K = 24) perform the best. The results clearly suggest that the proposed token-based approaches can greatly improve prediction capabilities. 4) Traffic flow completion: Three typical patterns are identified for both 12-point and 36-point completion, including the upslope pattern, the stationary pattern and the downslope pattern. Each pattern represents the TF's increasing, stable, and decreasing trends, respectively. To demonstrate the performance of the TF completion task, a random day is chosen. Table V shows that the overall completion performance of TSSN (K = 24) is the best among all the other models. The MAPEs of all TSSN models are less than 20 % for 12-point completion. However, because their MAPEs are below the availability boundary, only TSSN (K = 12) and TSSN (K = 24) can be used in 36-point completion task.
In [19], the MAPE of 36-point completion is slightly higher  Figure 8 shows that TSSN (K = 24) can fit the missing The results of 36-point completion tasks reveal that all models can fit the missing values with a minor bias for both upslope pattern and downslope pattern, as shown in Fig. 9. TSSN (K = 24) matches the missing value optimum on the upslope pattern, but not well enough on the downslope pattern. This is caused by the randomness of the fine-tuning process. In the case of the stationary pattern, only TSSN (K = 24) can complete the missing values with the least amount of error. The reason for this is that both the upslope pattern and downslope pattern are generally linear with little vibration, however the stationary pattern has substantial high-dimensional vibration, making it impossible for the simple model to capture the features.
All of the results demonstrate that the proposed TSSN is capable of extracting high-level and long-term features, making it ideal for a wide range of downstream tasks.

VI. CONCLUSION
In this paper, a token-based SSL network, i.e. TSSN, for TF analysis with a unique pretext task, i.e. MTP, has been proposed. TSSN segments TF data into tokens, and perform token-level operations such as positional encoding. Then, MTP is designed to mask tokens at random, and let TSSN to forecast these tokens according to contextual tokens. Therefore, rather than focusing on point-level correlations and high-frequency details, the MTP enables TSSN to precisely capture the highlevel semantics of TF. As a result, the TSSN can attain great performance while retaining a significant ability to generalize.  The results demonstrate that TSSNs, especially TSSN (K = 24), outperform traditional task-specific models on downstream tasks such as TF classification, prediction and completion.