Spatiotemporal Graph Convolutional Recurrent Neural Network Model for Citywide Air Pollution Forecasting

Citywide Air Pollution Forecasting tries to precisely predict the air quality multiple hours ahead for the entire city. This topic is challenged since air pollution varies in a spatiotemporal manner and depends on many complicated factors. Our previous research has solved the problem by considering the whole city as an image and leveraged a Convolutional Long Short-Term Memory (ConvLSTM) model to learn the spatiotemporal features. However, an image-based representation may not be ideal as air pollution and other impact factors have natural graph structures. In this research, we argue that a Graph Convolutional Network (GCN) can efficiently represent the spatial features of air quality readings in the whole city. Specially, we extend the ConvLSTM model to a Spatiotemporal Graph Convolutional Recurrent Neural Network (Spatiotemporal GCRNN) model by tightly integrating a GCN architecture into an RNN structure for efficient learning spatiotemporal characteristics of air quality values and their influential factors. Our extensive experiments prove the proposed model has a better performance compare to the state-of-the-art ConvLSTM model for air pollution predicting while the number of parameters is much smaller. Moreover, our approach is also superior to a hybrid GCN-based method in a real-world air pollution dataset.


Introduction
Air Pollution is a severe problem for many big cities in the world. Accurately predicting air quality multiple hours ahead is a challenging task in recent years. One of the most concerning problems is that air pollution varies by both spatial and temporal forms. As pointed in recent research papers [1], [2], [3], [4], [5], the air cleanness in a city changes from one location to other locations and time by time. Therefore, we need a spatiotemporal architecture to model air pollution features efficiently and effectively.
Our previous paper in [1] illustrated that many spatiotemporal factors can affect a city's air quality. These factors include meteorological factors such as rain, humidity, wind speed, wind direction; transportation factors like the traffic volume or vehicle driving speed; and external factors such as air pollution from nearby cities or areas. These mentioned spatiotemporal influences make the air pollution forecasting problem harder; thus, an air quality prediction model should effectively represent the air pollution values and their spatiotemporal impact factors. Our past research introduced a large-scale real-world air pollution dataset collected from Seoul city in South Korea to tackle the spatiotemporal air pollution forecasting task. Using the collected Seoul data, we proposed to use a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) in a ConvLSTM model to interpolate and predict air pollution for the entire city by leveraging an image-based approach. The ConvLSTM model outperformed other Deep Learning models in forecasting air pollution of Seoul city in 12 hours ahead. However, the image-based representation may not capture well the natural graph structures of air pollution observation stations in a city and other impact factors such as meteorology or traffic. This research tries to solve and overcome the mentioned problem by leveraging a graphbased method. Figure 1: A Graph structure constructed for the air pollution forecasting problem in Seoul city, South Korea. Each node denotes an air pollution monitoring station. The labels of edges are the real-world distances between stations (some are omitted).
In Fig. 1, we show a graph structure constructed for the air pollution forecasting problem in Seoul city, South Korea. Each node denotes an air pollution monitoring station, and edges are the connections between two stations weighted by their distance. Inspired from recent success in using Graph Convolutional Networks (GCNs) for spatiotemporal problems and derived from the spatiotemporal basis of air pollution data and its influential factors, we propose a Spatiotemporal Graph Convolutional Recurrent Neural Network (Spatiotemporal GCRNN) model. Our model uses a GCN structure to encode the spatial feature and an RNN architecture to model the time-series features of the air pollution data. The most crucial point is that we tightly integrate the GCN architecture into the RNN structure to have a unified unit that efficiently learns the spatiotemporal features at the same time. Our experiments prove that the proposed method yields better performance than previous approaches to Seoul data while the parameters are much smaller.
In the ConvLSTM paper for air pollution interpolating and forecasting [1], a large scale dataset of air pollution and spatiotemporal related factors was constructed. This dataset includes air pollution values in monitoring stations collected in Seoul city for three years, from 2015 to 2017, and many air quality impact factors that contribute to building an accurate air pollution prediction model. However, we believe that an even larger dataset will accelerate this field of research with better prediction models. Therefore, in this study, we collect more data from Seoul city in two recent years of 2018 and 2019 with all aforementioned spatiotemporal factors.
The new dataset has more extended period of data (from 2015 to 2019), more data points (a 75% bigger) which brings a more significant training data for both air pollution and spatiotemporal research.
In summary, our contributions in this paper are three folds: • Firstly, we introduce Spatiotemporal Graph Convolutional Recurrent Neural Network model, a unified integration of GCN and RNN architectures, which works efficiently and effectively for spatiotemporal air pollution forecasting problems.
• Secondly, we conduct extensive experiments to prove the proposed method produces better performance than the state-of-the-art ConvLSTM model and outperforms a recent GCN based approach. We implemented the two most common graph convolutional operators, such as spectral graph and diffusion graph convolution, in our model.

•
Lastly, we enlarge the previous large-scale dataset of Seoul data to a new massive dataset to bring researchers a large-scale data source for both air pollution and spatiotemporal related problems.
The rest of the paper is structured as follows: First, we review relating literature in the related work section. The following are detailed descriptions of our proposed method and architecture. The empirical experiments are then presented and evaluated. Finally, we conclude and discuss our future research directions.

Related work 2.1. Spatiotemporal Deep Learning for Air Pollution Forecasting
Many Deep Learning based models have been adopted for spatiotemporal air pollution prediction. In [2], the authors proposed a method that used a spatial predictor as a neural network to model the spatial correlations at different locations. Their approach defined spatial neighbors of a station within three circles from farthest (e.g., 300km) to nearest (e.g., 30km), and then aggregated the air pollutant values and related factors such as meteorological data in these circles. Their spatial learning algorithm is complicated and uses hand-crafted features (e.g., the circles' diameters). In contrast, our approach uses graph convolution to automatically learn the graph structures of air pollution stations and related factors in any locations in a city.
Similarly, in [3], the authors introduced a real-time air pollution prediction model based on big data; and in [4], an LSTM-based encoder-decoder method was employed to forecast air pollution in South Korea. A more recent research [5] models air pollution influential factors using a multi-modal approach for air pollution forecasting in Seoul city. Nevertheless, none of those aforementioned solutions leverages the graph structure for air pollution data as in our current research.
Air pollution forecasting is also a time series prediction problem. There are some well-known machine learning algorithms for time series forecasting such as Autoregressive Integrated Moving Average (ARIMA), Support Vector Regression (SVR), and Recurrent Neural Networks (RNNs). In [5], the authors did experiments to show that the ARIMA, SVR, and pure RNN models were inefficient in middle-to-long-term air pollution prediction since they could not exploit spatiotemporal information from many impact factors. On the other hand, some RNN variants such as Long Short-Term Memory (LSTM) [15] and Gated Recurrent Units (GRUs) [16] enable us to handle longer training sequence and create more accurate prediction for long-term foreseeing. Many recent papers have leveraged and proved the advantages of LSTM and GRUs structures in air pollution forecasting [1], [4], [5]. In this paper, we follow previous papers and propose to use GRUs as an RNN architecture to learn temporal features of air pollution data.

Spatiotemporal Graph Convolutional Networks
In recent years, Convolutional Neural Networks (CNNs) have been generalized to arbitrary graphs based on the spectral graph theory. Graph Convolutional Networks (GCNs) were first introduced in [6], which connects the spectral graph theory and deep neural networks. Paper [7] proposes ChebNet model to improve GCN with fast localized spectral convolutional filters. Paper [8] simplifies ChebNet by parameterizing each spectral filter by the first order Chebyshev polynomials and achieves state-of-the-art performance in semi-supervised classification tasks.
GCNs are now trending for spatiotemporal problems like traffic forecasting. In [9], the authors represented the pair-wise spatial correlations between traffic sensors using a directed graph in which nodes are sensors, and edge weights are closeness between the sensor pairs denoted by the roads network distance. The proposed model was a graph diffusion convolution model to capture the spatial dependency of traffic sensors. Many following papers also suggest using GCN in traffic forecasting and produce reasonable results [10], [11].
Similarly, GCNs have been started using in air pollution forecasting problems. A paper in [12] proposed a geocontext based diffusion convolutional recurrent neural network (GC-DCRNN) model for forecasting short-term PM 2.5 concentrations. In the paper [13], the researchers introduced to use a sequential combination of graph convolutional neural network and long short-term memory model for spatiotemporal PM 2.5 forecasting in 72 hours ahead. A newer paper [14] defined a knowledge enhanced graph neural network model for PM 2.5 forecasting. These papers both leveraged the GCN model as the spatial representation for air pollution features but still have shortcomings compared to our approach.
The geo-context based method in [12] tried to use geographic features around monitoring stations such as roads, buildings, green lands, or water areas to model the spatial relationships between stations. It also exploited a graph convolutional recurrent neural network as in our system. The biggest problem is that they used static data such as geographic features collected from OpenStreetMap data. In contrast, we use dynamic and real-time data like traffic volume and average vehicle speeds on roads. Moreover, in crowded cities like Seoul, the geo-context around each monitoring station may be similar because of the high density of buildings and roads. Therefore, the geo-context is reduced to the real-world distances between stations as in our method.
The approach in [13] firstly extracted the spatial features of input data by graph convolution operations, and then fed the extracted features into an LSTM model to learn the temporal features. The final layer was a fully connected layer to regress the PM 2.5 values. This paper's input data was air pollution values and weather data in Jing-Jin-Ji (Beijing, Tianjin, and Hebei) area in North China. The distinct steps of GCN and LSTM layers in this method may not accurately capture the complex correlations of many spatiotemporal factors in the air pollution forecasting problem as in our Spatiotemporal GCRNN model can capture both spatial and temporal features at the same time. The more recent paper [14] attempted to enhance the nodes and edges of a graph by the knowledge from the air pollution domain such as the air pollution transportation between cities, the meteorological information, and the geographical knowledge (e.g., mountains). We argue that their knowledge enhance approach does not generalize well for other spatiotemporal forecasting problems and they only consider the city-level prediction, whereas we try to forecast air pollution at the station-level for a city.

Methodology
In this section, we formalize the air pollution forecasting as a GCN problem and describe how the Spatiotemporal Graph Convolutional Recurrent Neural Network model can capture well the spatiotemporal features.

Graph and Graph Convolutional Network Representations
The most important concepts in the graph theory are (Weighted) Adjacency Matrix (W ), Degree Matrix (D), and the following equations define W , D, L of a graph, and some common types of Laplacian Matrices such as symmetric normalized Laplacian (L sym ) or random walk normalized Laplacian (L rw ): The following equation denotes the kernel of a GCN in the ChebNet model [7]: where g θ is a kernel (θ represents the vector of Chebyshev coefficients) applied to Λ, the diagonal matrix of Laplacian eigenvalues (Λ represents the diagonal matrix of scaled Laplacian eigenvalues,Λ = 2Λ/λ max − I n , λ max is the largest eigenvalue of L). k is the smallest order neighborhood, and K indicates the largest order localization. Finally, T stands for the Chebyshev polynomials of the kth order.

Problem Formalization
The goal of an air pollution forecasting task is to predict the future air quality values given previously observed air pollution from N correlated air pollution monitoring stations. We can represent the monitoring stations network as a weighted graph G = (V, E, W ), where V is a set of nodes or stations, |V | = N , E is a set of station relationships (edges), and W ∈ R N ×N is a weighted adjacency matrix denoting the stations correlations. The distance of two observation stations indicates the relation between two stations in the graph. Let F is the number of features in each node (e.g., the Figure 2: An air pollution forecasting task is constructed as a graph-based problem. T is the number of historical graph signals, T is the number of future prediction graph signals. X ∈ R N ×F is an input graph signal,X ∈ R N ×1 is an output signal, h(.) is the learned function given the graph G.
air pollution values and other impact factors values at that node), then the input graph signal of G is X ∈ R N ×F . X (t) denotes the input graph signal observed at time t, andX (t ) is the output graph signal at time t ,X ∈ R N ×1 (because we only output one estimated air pollution value for one station at a time). The air pollution forecasting problem tries to learn a function h(.) that maps T historical graph signals to future T graph signals, given a graph G: Fig. 2 shows how an air pollution forecasting task is constructed as a graph based problem with the graph signals representing the air pollution monitoring stations. Both input and output graph signals have the same graph structure.

Modeling Spatial Features
In this paper, we learn the spatial dependency between air pollution monitoring stations by representing the station network into a graph structure as in the Problem Formalization section. After that, we employ the graph convolution operators to model the spatial features of input graph signals. The extracted spatial features are used in combination with temporal features to predict future air pollution outputs. We try with two well-known graph convolution definitions, spectral graph convolution and diffusion graph convolution, for modeling spatial dependency. The formulas of two graph convolutional definition are presented below.
The spectral graph convolution [6], [7], [8] employs the concepts of symmetric normalized graph Laplacian matrix The convolution operation over a graph signal X ∈ R N ×F and a kernel g θ is defined as below: with the kernel g θ as in equation 2, θ are the learnable parameters, f ∈ {1, ..., F }. G denotes the convolution over a graph G. T 0 (x) = 1, We can approximate the kernel g θ by a truncated expansion in terms of Chebyshev polynomials T k (x) up to Kth order. The graph convolution definition is now have the form: This expression is K-localized since it is a Kth-order polynomial in the Laplacian, i.e. it depends only on nodes that are at maximum K steps away from the central node (Kth-order neighborhoods). Equation 5 is our spectral graph convolution definition used in this paper. A graph diffusion convolution was defined in the paper [9]. The diffusion process is characterized by a random walk on graph G with restart probability α ∈ [0, 1], and a state matrix D −1 W , where D is the degree diagonal matrix (see equation 1). The resulted diffusion convolution from the mentioned diffusion process is specified as: where θ are the parameters for the filter g θ and D −1 W represents the transition matrix of the diffusion process. The parameter K is the number of diffusion steps from a node to its neighborhood, which corresponds to the Kth-order in the spectral graph convolution.
Since spectral graph convolution and graph diffusion convolution are two common graph convolution operators, in this paper, we implement each operator for the GCN layer and do experiments to compare their influences to the prediction results.

Modeling Temporal Features
The previous paper [1] proved the power of a combination of CNN and RNN models in spatiotemporal forecasting problems by using convolutional operators to replace the fully connections in input-to-state and state-to-state transitions of an LSTM cell. In this paper, we also leverage that idea by replacing the matrix multiplications in a GRU cell with the graph convolution to learn spatial and temporal features of the input data simultaneously. A GRU or Gated Recurrent Unit [16] model is a variant of RNN model that can learn long-term sequence of data similar to an LSTM model but with the smaller number of parameters. A GRU cell with the graph convolution operators establishes a spatiotemporal layer that we call a Graph Convolutional Recurrent Neural Network layer, or GCRNN layer.
In detail, following are four transformation equations for reset gate, update gate, candidate values and hidden states in a GRU cell of one GCRNN layer.
where X (t) , H (t) denote the input and output at time t, r (t) , u (t) are reset gate and update gate at time t, C (t) are candidate values at time t, respectively. G means the graph convolutional operator defined by spectral graph convolution (equation 5) or diffusion convolution (equation 6). Θ r , Θ u , Θ C are parameters for the corresponding filters. denotes the Hadarmad product or element-wise matrixmatrix multiplication. σ and tanh are the activation functions.
For multiple steps forecasting, a sequence to sequence architecture (seq2seq) is applied [17]. Both the encoder and decoder of the seq2seq consist of one or many GCRNN layers. The final states of the encoder are used to initialize the initial state of the decoder. The combination of GCRNN layers and a seq2seq scheme builds up our Spatiotemporal Graph Convolutional Recurrent Neural Network (Spatiotemporal GCRNN) model. Fig. 3 shows the system architecture of our Spatiotemporal GCRNN model designed for spatiotemporal air pollution forecasting. The input is the historical time-series of air pollution represented as graph signals, and the output is the future multiple hours of air pollution prediction.

Modeling Spatiotemporal Impact Factors
As presented in previous paper [1], air pollution in a city is influenced by many spatiotemporal factors. Regarding this paper's graph based approach, each spatiotemporal impact factor will be represented as a graph signal with the same graph structure as air pollution graph. We propose a strategy to learn these impact factors and air pollution graph features efficiently.
We fuse air pollution values and all spatiotemporal influential factors as one input graph signal with multiple features at the corresponding node. Denotes X a ∈ R N ×Fa is the air pollution graph signal, X m ∈ R N ×Fm is the meteorological graph signal, X t ∈ R N ×Ft is the traffic volume graph signal, X s ∈ R N ×Fs is the average driving speed graph signal, and X o ∈ R N ×Fo is the outside air pollution graph signal. N is the number of graph nodes, F a , F m , F t , F s , and F o are air pollution, meteorological, traffic volume, average speed, and outside air pollution number of features, respectively. Then the input graph signal X of the model is a concatenation of all graph signals: ⊕ is a vector concatenation operator. Therefore, the total number of features for the input graph signals is F = F a + F m + F t + F s + F o . All graph convolution operations (equation 5 and equation 6) and GRU equations (equation 7) are not changed except the input graph signals turn to X (t) ∈ R N ×F . Fig. 4 shows the strategy of modeling spatiotemporal impact factors as the combination of input graph signals. The architecture of the Spatiotemporal GCRNN model does not change.

Experiments
In this section, we describe our extensive experiments to prove the strength of the proposed Spatiotemporal GCRNN model in spatiotemporal air pollution forecasting tasks. After introducing the experiments' setup, we conduct experiments to directly compare the performance of the Spatiotemporal GCRNN to ConvLSTM, a state-of-the-art air pollution prediction method from the previous paper [1], following are comparisons with a recent GCN-based air pollution forecasting model. Lastly, we ablation study the effects of many GCN configurations such as Kth-order, weighted adjacency matrix construction, and different types of graph convolutional operators.

Experiments Setup
We use the Seoul dataset as in the state-of-the-art paper [1]. The Seoul dataset is a large-scale dataset that consists of 3-year spatiotemporal data in Seoul city, Korea, from 2015 to 2017. This dataset includes air pollutants, such as P M 10 , P M 2.5 ,...; meteorological data, like temperature, wind speed, wind direction, rainfall,...; traffic volume of main roads; average driving speed on roads; and the air pollution from 3 areas in China (Beijing, Shanghai, and Shandong) that affects Seoul's air quality. As mentioned in the Introduction section, in this paper, we broaden the Seoul dataset with a more extended period of air pollution and related impact factors data. Table 1 shows statistics about our new Seoul dataset compared to the existing one. Our new dataset is 83% more in the number of air pollution samples and 75% larger in the total number of all data points. Nevertheless, to keep a fair comparison with previous methods, we still do experiments with the existing Seoul dataset and reserve the new dataset for further research. We follow the same data pre-processing of ConvLSTM model from [1] for this paper's experiments, including the grid construction. Firstly, Seoul city is divided into a grid of shape 32 × 32, and each grid-cell is assigned with the corresponding observation stations or traffic survey points. This step makes our experiments in this paper comparable with the previous method. After the pre-processing, we construct a graph for the grid-level air pollution forecasting problem with each node is assigned by the corresponding cell.
We use the Tensorflow framework [18] to build the model and Nvidia GPU for training and testing. All experiments in this paper use the base learning rate of 0.001 and decay every ten epochs with the decay ratio 0.1 until reaching the minimal value of 2.0e-06. The batch size is 64, the number of GRU hidden units is 64, and the number of GCRNN layers is 2. We train with a maximum of 100 epochs but apply early stopping when the validation error does not decrease after 50 epochs. The Adam optimizer is utilized by its convergence speed and popular in deep learning optimizing problems. We also employ the Root Mean Squared Error (RMSE) loss to compare the differences between the real observed and the estimated values. The training set is two years, 2015 and 2016, and the test set is the year 2017. This split strategy ensures the training and test set have the same distribution but still makes our model a good generalization.
Regarding test set errors, we apply the following metrics to have numerous viewpoints on the testing performance. The first metric is a common regression metric such as Root Mean Squared Error (RMSE). Also, we utilized a correlation coefficient R squared (R 2 ) score, which was used in [13], to show how good is our prediction values' distribution fit to the real observed values. Moreover, the citywide spatiotemporal air pollution interpolation and prediction paper [1] presented a novel spatiotemporal metric called spRMSE (or spatiotemporal RMSE). This metric evaluates how other spatiotemporal factors impact the efficiency of prediction outputs. To calculate this measurement, we use the following process: alternately remove one of the existing air pollution values in a node of the input graph signal in test set but still keep the features of other nodes and then compute RMSE of predicted air pollution value with the groundtruth one. The final error is the mean of RMSEs after this procedure with all nodes in the testing data. We will compare the spRMSE metric values of our proposed Spatiotemporal GCRNN model to the ConvLSTM model.

Performance comparisons with the ConvL-STM model for short-term air pollution forecasting
In this section, we conduct experiments of the Spatiotemporal GCRNN model for air pollution forecasting. We use PM 2.5 air pollutants as in the experiments of the Con-vLSTM model [1]. The number of nodes of the input graph signals is 25 nodes (corresponding to 25 PM 2.5 monitoring stations). To prove the graph based model is better for spatial data, we also perform experiments with PM 10 air pollutants, which are observed by larger monitoring stations, 37 stations corresponding to 37 nodes of the input graph signals. The forecasting time-steps are similar to the paper [1], from 1 hour to 12 hours in the future. Fig. 5 shows RMSE metric values on testing data of the ConvLSTM model and Spatiotemporal GCRNN model with PM 2.5 and PM 10 air pollution forecasting from 1 hour to 12 hours (smaller is better). The Spatiotemporal GCRNN model produces slightly better performance in RMSE values compare to the ConvLSTM model with PM 2.5 air pollution prediction but achieves significant better results for PM 10 prediction. These results claim that a graphbased method can capture more information from a larger and more complex structure than a image-based approach in the ConvLSTM paper [1]. Moreover, one advantage of the graph-based model is the smaller model size. The ConvLSTM model with 12 hours forecasting has the number of trainable parameters more than 20.5M parameters while the Spatiotemporal GCRNN model has only 371K parameters, a 55x smaller. Hence, we can conclude that the Spatiotemporal GCRNN model with a much smaller size than a ConvLSTM model can produce better performance for spatiotemporal air pollution forecasting.
Regarding comparing the performance of modeling the spatiotemporal impact factors by ConvLSTM and Spatiotemporal GCRNN models, we use both RMSE and spRMSE metrics. We perform the experiments with the two types of input graph signals, air pollution only and air pollution with spatiotemporal impact factors, respectively. The performance results of Spatiotemporal GCRNN and ConvLSTM models are presented in Table 2. In the table, ConvLSTM and ConvLSTM + All are ConvLSTM models without and with spatiotemporal impact factors data, ST-GCRNN and ST-GCRNN + All are Spatiotemporal GCRNN models including only air pollution input data and including all spatiotemporal impact factors, respectively. All experiments are executed with PM 2.5 air pollution 1-hour ahead prediction (similar to experiments in [1]). The results reveal that a Spatiotemporal GCRNN model obtains better performance in terms of RMSE and spRMSE metrics than a ConvLSTM model in both input data are air pollution only and air pollution with all spatiotemporal impact factors. Even compare to the best RMSE in [1] with ConvLSTM + Meteorology data, the ST-GCRNN model still achieves smaller RMSE value (5.5947 vs. 6.5809). In addition, the Spatiotemporal GCRNN + All has a smaller spRMSE value than Spatiotemporal GCRNN without spatiotemporal impact factors. Therefore, we still recognize the efficacy of a GCNbased model in learning spatiotemporal factors for air pollution forecasting problems.

Performance comparisons with a GCN-based air pollution prediction model for long-term forecasting
In this experiment section, we choose the model in [13] as our baseline for GCN-based air pollution forecasting comparisons. As pointed out in the related work section, the GCN-based model in [13] is a Hybrid model that includes a GCN model and an LSTM layer as our system but it separates them to two consequential steps while we combine two architectures into a unified system.
We implement the Hybrid model of GCN and LSTM from the paper [13] in Seoul data. This model includes three sequential layers, the first layer is a GCN model to extract spatial features of the input data, the second one is an LSTM model with input is the extracted spatial features and output is the learned temporal features, and the last layer is a fully connected (FC) layer to regress the predictions. We implement GCN and LSTM layers of the Hybrid model with the same hyper-parameters as in our Spatiotemporal GCRNN model. We also apply the same training and testing configurations as in section 4.1. The input graph signals of the Hybrid model are identical to our model.
For performance comparisons of this Hybrid model with our proposal model, we conduct experiments for medium to long-term prediction time steps: forecasting 12, 24, and up to 72 hours in the future with both PM 2.5 and PM 10 air pollutants. Table 3 demonstrates the results on the same test set of the Hybrid model and Spatiotemporal GCRNN with PM 2.5 and PM 10 air pollution forecasting in 12, 24 and 72 hours ahead. Overall, our Spatiotemporal GCRNN model achieves better performance in PM 2.5 and PM 10 forecasting over the Hybrid model. In detail, our ST-GCRNN model obtains a larger gap of the performance in PM 2.5 forecasting than PM 10 . The reason could be that the distributions of PM 2.5 air pollution values are more complicated than the PM 10 readings and our unified combination of GCN and GRU models can exploit the complex correlations of PM 2.5 air pollution better than sequential steps as in the Hybrid model.  We prove the above assessment by doing statistical analysis for PM 2.5 and PM 10 air pollutants in Table 4. We compute the mean and standard deviation values for the following statistical measurements of the data: data points changing by time (which means the distribution of air pollution readings at each station over the time), and data points changing by locality (equals to the distribution of air pollution values over all stations in a city at a time). The statistic values are computed based on the [0-1] normalized data of PM 2.5 and PM 10 readings. From the table, the PM 2.5 air pollutant data has a higher standard deviation than the PM 10 regarding changing by time or by locality. It means the distribution of PM 2.5 values spreads around the mean value more than PM 10 data; thus, they need a more efficient architecture to learn their correlations. Regarding the less spreading PM 10 values, our model is still better the Hybrid model in 12, 24 and 72 hours forecasting. From these experiments' results, we can conclude that the tight integration of GCN and RNN models in Spatiotemporal GCRNN architecture learns the spatiotemporal air pollution data better than separated layers in a GCN-based hybrid approach. Furthermore, the number of trainable parameters of the Hybrid model is 518K parameters, a 1.5x larger our model.

Ablation study the effects of different GCN configurations
To evaluate the effects of different GCN configurations toward the forecasting results, we experiment with the following parameters: K-order to explore the impact of the graph convolutional filter's receptive fields; weighted adjacency matrix W threshold to vary the sparsity of W ; and the graph convolution definitions (spectral or diffusion graph convolution).
The K-order determines the number of nearest neighbors of a node that will be considered as the filter's receptive fields in graph convolutional operation. When K increases, the size of receptive fields larger, and the graph can capture wider spatial dependency at the cost of increasing learning complexity and the training time. Fig. 6 shows the results with different K on the validate set. We observe that with the increase of K, the validation loss first quickly falls (K from 1 to 2), and then decreases slightly (K ≥ 3), while the training time per epoch rises a lot. In this paper, we use K = 2 as the default K-order value to balance between the validation loss and the training time. Regarding computing the weighted adjacency matrix W , we use a threshold to control the sparsity of W . All weights less than will be set to zero. A weighted edge equals to zero also means there is no relationship between two nodes of that edge. In graph based problems, W can be built using a thresholded Gaussian kernel [19].
denotes the distance between 2 stations in station-level or 2 grid-cells in grid-level. σ is the standard deviation of distances and is the threshold. In this paper, we experiment with three decreasing values of : 0.1, 0.01, and 0.0 (no threshold). From the experiments' results in Fig. 7, we can observe that when decreases from 0.1 to 0.0, the validation loss decreases, but the training time for one epoch also goes up. Hence, we use = 0.01 as the default threshold in building the adjacency matrix for the graph since it has the trade-off between the validation errors and the training time.
For the last ablation study, we explore the effects of different graph convolution operators such as spectral convolution and diffusion convolution. The spectral convolution operation is implemented as in equation 5 while the diffusion convolution is performed following equation 6. From [9], the diffusion convolution implementation includes two cases, the first is a random walk diffusion, and the second is a dual random walk diffusion with the diagonal matrix D in equation 6 is split into 2 matrices: an out-degree diagonal matrix and an in-degree diagonal matrix. Fig. 8 shows RMSE values for these studying graph convolution operations with three time steps 1, 6, and 12 hours in PM 2.5 forecasting. As a results, the dual random walk diffusion convolution achieves a slightly better performance than other convolutional implementations.

Conclusion
In this paper, we introduce a spatiotemporal GCN-based model for the citywide air pollution forecasting. The new proposal model named Spatiotemporal Graph Convolutional Recurrent Neural Network is a deep integration of a GCN model for spatial features extraction and an RNN model for temporal features learning. The Spatiotemporal GCRNN model has the size 55x smaller than the state-ofthe-art ConvLSTM model for air pollution forecasting but produces better results. Through a series of empirical experiments, we prove that our Spatiotemporal GCRNN model has better performance than the ConvLSTM model in short-term air pollution prediction, and also achieves superior results in medium to long-term air pollution forecasting compare to a GCN-based hybrid model that separates GCN and LSTM in discrete layers.
Moreover, in this research, we expand the previous Seoul data from 3 years time period to a 5-year dataset (75% bigger). We hope that this large-scale dataset will become a valuable data source for researchers in air pollution as well as spatiotemporal based research. We will public the new dataset along with this paper.
In the future, we will enhance our research in the following directions: first, exploiting advanced graph convolution network models such as spatial-based graph convolution or graph attention networks to achieve better performance; second, due to the small size of the graph-based model, we can deploy the trained model to build real-time spatiotemporal forecasting for urban intelligent systems.