When Is It Likely to Fail? Performance Monitor for Black-Box Trajectory Prediction Model

,

When Is It Likely to Fail? Performance Monitor for Black-Box Trajectory Prediction Model Wenbo Shao, Boqi Li, Wenhao Yu, Jiahui Xu, Hong Wang , Senior Member, IEEE Abstract-Accurate trajectory prediction is vital for various applications, including autonomous vehicles.However, the complexity and limited transparency of many prediction algorithms often result in black-box models, making it challenging to understand their limitations and anticipate potential failures.This further raises potential risks for systems based on these prediction models.This study introduces the Performance Monitor for Black-Box Trajectory Prediction Model (PMTM) to address this challenge.The PMTM estimates the performance of black-box trajectory prediction models online, enabling informed decisionmaking.The study explores various methods' applicability to the PMTM, including anomaly detection, machine learning, deep learning, and ensemble, with specific monitors designed for each method to provide real-time output representing prediction performance.Comprehensive experiments validate the PMTM's effectiveness, comparing different monitoring methods.Results show that the PMTM effectively achieves promising monitoring performance, particularly excelling in deep learning-based monitoring.It achieves improvement scores of 0.81 and 0.79 for average prediction error and final prediction error monitoring, respectively, outperforming previous white-box and gray-box methods.Furthermore, the PMTM's applicability is validated on different datasets and prediction models, while ablation studies confirm the effectiveness of the proposed mechanism.Hybrid prediction experiments further demonstrate the value of the proposed monitor from an application perspective.
Note to Practitioners-This research presents PMTM, a valuable tool for practitioners in the automation industry.The PMTM enables real-time monitoring of black-bok trajectory prediction models, enhancing system reliability and facilitating informed decision-making.The practical application of PMTM lies in improving safety and reliability in critical domains, especially in the context of autonomous vehicles.Black-box trajectory prediction models commonly used in these domains may exhibit unexpected deficiencies, potentially leading to risks.By monitoring the prediction performance online, systems can proactively identify potential insufficiencies and make informed decisions to ensure safer and more reliable operations.The PMTM offers practitioners different monitoring solutions based on various approaches, addressing their specific needs effectively.While the PMTM has shown promising outcomes, further exploration and testing are necessary to fully harness and apply its monitoring results in automated systems.Practitioners are encouraged to adopt the PMTM as an essential monitoring mechanism to W. Shao, Y. Wenhao and H. Wang are with Tsinghua Intelligent Vehicle Design and Safety Research Institute, School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China.(e-mail: swb19@mails.tsinghua.edu.cn,wenhaoyu@mail.tsinghua.edu.cn,hong wang@mail.tsinghua.edu.cn)B. Li is with the Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI 48105, USA.(e-mail: boqili@umich.edu)J. Xu is with the School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China.(e-mail: 13645450063@163.com).

Ego vehicle
Predicted object A

Historical trajectory
The future trajectory of A is: Trajectory Prediction Model (Black box)

Performance Monitoring Model
The predicted trajectory may be inaccurate, with an error of ... enhance the reliability of their trajectory prediction models and achieve safer and more efficient automation in their domains.

A. Motivation
A UTONOMOUS driving is a critical development direc- tion that is of significant importance for improving traffic efficiency and safety.The trajectory prediction model is an essential component of autonomous driving systems, as it provides effective scenario understanding by predicting the future motions of traffic participants.It serves as a prerequisite for safe and efficient decision-making in autonomous driving.Various trajectory prediction algorithms have been developed [1].These advancements have provided crucial support for improving trajectory prediction performance.However, each of these methods has its limitations, which may result in unexpected prediction failures in challenging scenarios.These failures can lead to inaccurate predicted trajectories, posing severe risks to autonomous driving systems [2].
In the field of automotive functional safety, the failure detection, isolation, and recovery (FDI-R) mechanism plays a crucial role in ensuring the overall safety and functionality of the vehicle [3].An essential prerequisite for the FDI-R is the effective monitoring of module failures.For potential safety risks arising from inadequate trajectory prediction performance, it is categorized as a Safety of the Intended Functionality (SOTIF) issue [4].To address this, it is necessary to develop a performance monitor for trajectory prediction models to enable online detection of potential prediction failures and facilitate subsequent isolation and recovery processes.
Compared to traditional vehicle subsystems and components, the trajectory prediction model is responsible for understanding the scenario.Due to the complexity of scenarios, the randomness of traffic participant movements, and the inherent interpretability and uncertainty challenges of some prediction models, various unpredictable failures may occur, making monitoring the prediction models more difficult than monitoring traditional vehicle component.Specifically, the key challenges and requirements of the monitor are identified: • Applicability to black-box prediction models: Due to technical privacy concerns, conflicts of interest, and the growing complexity of prediction models, many trajectory prediction models suffer from a lack of interpretability, limiting the information that can be obtained during their monitoring process.Therefore, treating the trajectory prediction model as a black box for studying the online performance monitoring techniques becomes necessary and valuable, where the monitor cannot access the internal design details of the prediction model.• Generalizability to different types of prediction models: Existing prediction models exhibit various forms, structures, and principles.There may even be multiple types of prediction models that coexist within the same autonomous driving system [5].It is necessary to develop a more general performance monitoring approach.• Information supplementation: There is no need to build a additional complex and accurate trajectory prediction model.Instead, the monitor should offer crucial diagnostic information as a supplement that is important for safe and reliable driving, even if it may not be as fine-grained.This format also facilitates the use of lightweight models to meet the requirements.In summary, existing failure detector have become inadequate to meet the aforementioned requirements [6].Therefore, this research aims to proposed a general performance monitor for black-box trajectory prediction models (PMTM).By assessing the prediction models' performance online, it will provide essential information to the system, enhancing the safety and reliability of autonomous driving.

B. Related Work
1) Autonomous Driving Trajectory Prediction: Trajectory prediction in autonomous driving has been extensively studied using various approaches [1], [7], including physics-based and learning-based methods.Physics-based methods [8]- [10] leverage fundamental principles and equations of motion to estimate future positions and velocities of objects.These methods offer interpretable predictions and computational efficiency but may struggle to capture complex and non-linear behaviors in real-world scenarios.
Learning-based prediction can be categorized into machine learning (ML), deep learning (DL), and reinforcement learning (RL) approaches.ML-based prediction [11], [12] involves extracting hand-crafted features from past trajectories or sensor data and using statistical models to predict future trajectories.Despite their widespread use, these methods rely on manual feature engineering and may have limitations in capturing complex temporal dependencies.On the other hand, deep learning models [13]- [15], such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), have gained prominence in trajectory prediction.RNNs can effectively model temporal dependencies, while CNNs process sensor data to capture relevant information.These methods automatically learn representations and are capable of handling complex scenarios, but often require large labeled datasets and computational resources.RL-based prediction [16], [17], on the other hand, utilizes trial and error to optimize decision-making processes.These approaches learn policies through agent-environment interactions, considering long-term rewards.Although promising for complex and uncertain situations, RL methods typically require significant computational resources and careful reward design.
Overall, continued research efforts in this field aim to enhance the capabilities of trajectory prediction systems and facilitate the widespread adoption of autonomous vehicles.However, focusing solely on improving trajectory prediction accuracy should not be the sole objective of trajectory prediction research, particularly as the diminishing returns on system performance improvement from increased prediction accuracy become evident.It is crucial to shift attention toward the possibility of prediction model failures and develop targeted monitoring and mitigation strategies.Furthermore, the diverse range of prediction methods necessitates the need for performance monitoring that has a certain level of generality across different models.Moreover, as prediction models become increasingly complex and less interpretable, and considering the limited availability of detailed information for certain publicly available prediction models, it is necessary to design performance monitors specifically tailored to black-box prediction models.
2) AI-Based Model Performance Estimation: The current advancements in technology, particularly in the field of artificial intelligence (AI), have played a crucial role in autonomous driving perception, prediction, and decision-making [18], [19].With the rapid development of AI, estimating the performance of AI models has become increasingly important to ensure their reliability and robustness in various domains, including autonomous driving.Various techniques have been investigated to estimate the performance of AI models, with a focus on uncertainty estimation, anomaly detection (AD), outlier detection, and out-of-distribution (ODD) detection, as well as supervised learning-based approaches.
An area of research focuses on uncertainty estimation techniques [20]- [22].These techniques aim to quantify the uncertainty associated with AI model predictions, providing insights into the model's confidence and reliability.Bayesian neural networks [23], Monte Carlo dropout [24], and deep ensemble [25] are among the methods used to estimate uncertainty in AI models.Monitoring the uncertainty estimates helps identify situations where the model may encounter ambiguous or challenging input, ensuring cautious decision making in critical scenarios.
Another approach involves AD, outlier detection, or OOD detection techniques [26]- [28].These methods aim to identify abnormal or unexpected patterns in the model's input data or predictions.By monitoring for anomalies or outliers, potential model failures, data shifts, or adversarial attacks can be detected, ensuring the robustness and reliability of the AI model.Techniques such as autoencoders, one-class SVM, and generative adversarial networks have been employed for AD and outlier detection.
Supervised learning-based techniques have also been applied for performance monitoring [29]- [31].These methods involve training additional models to evaluate and validate the primary AI model's predictions.By utilizing labeled data or expert knowledge, performance monitors can compare the primary model's outputs with the ground truth or expert judgments, detecting potential errors, biases, or limitations.These techniques improve the interpretability and trustworthiness of AI models in autonomous driving systems.
It is worth noting that the above research in AI-based model performance monitoring primarily focuses on the perception domain, particularly in classification and abnormal behavior detection tasks.However, there has been limited involvement in the trajectory prediction domain.Further research efforts are needed to explore and develop specific monitoring techniques tailored to trajectory prediction models in autonomous driving.
In conclusion, the field of AI-based model performance monitoring is actively researching various techniques, including uncertainty estimation, AD/outlier detection/OOD detection, and supervised learning-based approaches.These techniques contribute to enhancing the reliability, robustness, and trustworthiness of AI models in autonomous driving and other safety-critical applications.Continued research efforts in monitoring AI model performance will further advance the development and deployment of autonomous driving systems.
3) Performance Monitoring for Autonomous Driving: Performance monitoring in the context of autonomous driving has received significant attention to ensure the safety, reliability, and effectiveness of autonomous systems.Researchers have explored various techniques and methodologies for monitoring the performance of autonomous driving systems, with a specific focus on the perception and decision-making modules.
In the domain of perception, monitoring techniques have been developed to assess the accuracy and robustness of object detection, tracking, and classification algorithms [32]- [34].These techniques involve leveraging sensor redundancy to detect anomalies and failures or utilizing AI-based methods to estimate model performance.Information from different sensors plays a complementary role and is used to monitor and correct failures in individual sensors that can lead to perception failures.Additionally, temporal information from multiple time steps is used for error detection in perception.However, unlike perception models, it is challenging to obtain multiple sources of information for cross-validation in trajectory prediction modules.Moreover, researchers have explored the application of uncertainty estimation and AD methods in tasks such as image classification and semantic segmentation.
In terms of decision making, performance monitoring techniques focus on evaluating the correctness and efficiency of planning and decision algorithms [35]- [37].Researchers have explored approaches to assess adherence to traffic rules, analyze the quality of path planning and trajectory generation, and identify potential errors or suboptimal decisions.By monitoring the decision module, the system can detect and rectify faulty or unsafe decisions, ensuring the effectiveness and safety of autonomous driving.
In conclusion, there has been limited research on performance monitoring techniques specifically tailored to the trajectory prediction model.Previous work has analyzed monitoring approaches for prediction models from the perspectives of uncertainty estimation and the addition of auxiliary modules [38], [39], but these may not be applicable to physics-based prediction methods or highly black-box prediction models where model parameters are not accessible.Further research is needed to develop effective performance monitoring techniques for trajectory prediction in autonomous driving systems.

C. Contributions
This study introduces a generic PMTM, enabling real-time estimation of trajectory prediction performance, thereby contributing to the realization of safe and trustworthy intelligent automation systems.The main contributions are as follows: 1) Specifically designed monitor for black-box trajectory prediction models.It complements autonomous driving systems by providing online performance estimates of the trajectory prediction model.2) Comprehensive review, design, and comparison of different methods.This research presents various monitor based on AD, ML, DL, and ensemble approaches, separately, and the effectiveness of different models are systematically compared.3) Integrated experiments to explore the advantages and characteristics of the PMTM.The study evaluates the monitoring for both AI-based and physicsbased trajectory prediction models, thus validating the universality of the method.Additionally, the monitors' performance of is assessed using different datasets, and ablation studies are conducted to showcase the contributions of individual components.

D. Paper Oraganization
The paper is structured as follows: Section 2 presents the proposed performance monitoring framework for black-box trajectory prediction models.The implementation details of the performance monitoring model are described in Section 3. In Section 4, the experimental setup is explained, and in Section 5, the results and discussions are provided.Finally, Section 6 concludes the paper with a summary and future prospects.

II. PERFORMANCE MONITORING FRAMEWORK FOR
BLACK-BOX TRAJECTORY PREDICTION MODEL In this section, the performance monitoring framework is proposed, as shown in Fig. 2. The framework aims to

A. Trajectory Prediction Model
The trajectory prediction model accepts various inputs, including sensor data (e.g., LiDAR, radar, and camera inputs), vehicle state information, and environmental context.Using complex neural network architectures or advanced techniques, the model processes these inputs to generate predictions of future trajectories for the target objects.The model's output typically represents the predicted trajectories in a deterministic form, such as position coordinates.However, it's worth noting that the monitoring techniques discussed here can also be extended to probabilistic trajectory prediction models by diagnosing the mean and variance of the probability distributions.The general representation of a deterministic trajectory prediction model is given as follows: where f () represents a black-box trajectory prediction model with unknown specific parameters.Y represents the output of the prediction model, i.e., the estimated trajectories of the target objects of t f moments in the future.X denotes the inputs to the prediction model, which may include the historical states of the target objects, information about surrounding dynamic traffic participants, and context such as road geometry and traffic rules.The specific inputs of the prediction model depend on the chosen algorithm.For example, physicsbased models may only require historical kinematic or dynamic states of the target objects, while learning-based models may incorporate additional information, such as modeling interactions between different traffic participants using graph neural networks (GNN), CNN, or social force models.Map information may also be transformed into bird's-eye view images or vector maps to consider the constraints and guidance provided by the road structure.

B. Performance Monitoring Model
The performance monitoring model, as the core of PMTM, takes environmental information and the output of the trajectory prediction model as inputs and estimates the performance of the prediction model as output.Considering the black box nature of the prediction model, it is assumed that the performance monitoring model cannot modify the model's parameters or access its internal information or intermediate features.Therefore, the monitoring model relies on the environmental information and the predicted trajectories from the black-box model.Specifically, the performance monitoring can be represented as follows: where Ẑ represents the monitoring result for the trajectory prediction model's performance, and X ′ represents the input information of the performance monitoring model.
To ensure the generality of the monitoring model for different types of prediction models, it primarily utilizes the historical state information of the target objects, which is commonly used in most prediction models.Based on the previous research [40], the kinematic information of the target objects exhibits a strong correlation with the prediction model's performance, while the information from surrounding traffic participants shows no significant relationship with the prediction errors.Thus, including the historical state of the target objects as an input to the monitoring model is deemed sufficient.
Additionally, the focus on the target objects' historical states and the output of the prediction model helps reduce the complexity and size of the model.Avoiding overly complex structural designs or introducing additional models facilitates the practical application of the monitoring model.
In addition, the output of the monitoring model serves as a diagnostic and supplementary tool for the trajectory prediction model.It characterizes the actual state of the prediction model and typically represents prediction errors.In Section III, various output formats of the monitoring model based on different methods will be discussed.
In conclusion, the proposed performance monitoring framework provides a means to evaluate the performance of blackbox trajectory prediction models in autonomous driving.The trajectory prediction model leverages various inputs to generate predictions of future trajectories, while the performance monitoring model evaluates the performance based on environmental information and the prediction model's output.The framework enables real-time detection of potential performance degradation or anomalies, contributing to the safety and reliability of autonomous driving systems employing blackbox trajectory prediction models.

III. PERFORMANCE MONITORING MODEL
In this section, the performance monitoring model are presented to evaluate the performance of black-box trajectory prediction models online in autonomous driving.As shown in Fig. 3, this study proposed four distinct types of methods that utilize various techniques for performance assessment.

A. AD-Based Model
The AD-based model utilizes AD techniques to monitor the performance of black-box trajectory prediction models.Model based on AD.As shown in Fig. 4,It is built upon the following fundamental assumption: for data-driven trajectory prediction models, their performance is highly dependent on the training data.Specifically, these models tend to exhibit good fitting capabilities for the majority of normal data points within the training set, while their prediction accuracy may deteriorate when encountering outliers or abnormal data that deviate from the normal patterns.
A generic AD-based model is represented as follows: where  The function g AD () represents the AD-based monitoring model, which aims to generate an anomaly score Ẑ.A higher anomaly score indicates a higher deviation of the input data from the normal region.According to the aforementioned assumption, a higher score suggests a potentially poorer performance.In this study, various AD models were utilized and compared, as shown in Tab.I.

B. ML-Based Model with Error Estimation
Compared to AD-based methods that provide anomaly scores, a more direct and interpretable approach to monitor the performance of trajectory prediction models is to construct a model specifically designed to estimate trajectory prediction errors.ML offers a simple and convenient solution for this purpose.Therefore, in this study, we developed a ML-based monitoring model to estimate the errors of trajectory prediction models.The general representation is as follows: where X ′ and Ŷ have the same meanings as described earlier.However, for this method, as the ground truth of Ẑ, Z represents the prediction errors instead of anomaly scores, providing higher interpretability and operability.During practical operations, Z can be a comprehensive error value for a predicted trajectory or individual prediction errors at each future time step, expressed as: where e t denotes the error of the trajectory prediction model at time t.For single-modality output prediction, it corresponds to the Euclidean distance between the true trajectory point and the predicted trajectory point, i.e., e t = ∥s t − ŝt ∥ 2 .Since the ML-based models used require input and output in one-dimensional vector form, in this study, we flatten and concatenate the aforementioned two-dimensional input features X ′ and Ŷ to create unified input features for the monitoring model.Furthermore, the velocity and acceleration features of the predicted object are extracted as supplementary input based on previous research [40].Additionally, the input is standardized for feature normalization.
To train the monitoring model, we evaluate the original trajectory prediction model using the training and validation sets as we trained it.This process generates a dataset with trajectory prediction error labels, which serves as the new training set for training the ML model.Supervised learning is employed during the training process, with mean square error (MSE) used as the loss function for model optimization.
In this study, various categories of ML algorithms are used to design different monitoring models.Based on the requirements and forms of the monitoring task, several methods are ultimately selected and evaluated, as shown in Tab.II.

C. DL-Based Model with Error Estimation
DL models possess stronger fitting capabilities, making them advantageous over ML models in estimating trajectory prediction model errors.In this section, we further explore DL-based error estimation models based on the foundation of ML-based approaches.
The fundamental principles of DL-based performance monitors are similar to those of ML-based methods, employing supervised learning to train the models.However, compared to ML models, DL models generally exhibit more powerful nonlinear fitting abilities, automatic feature learning capabilities, capacity to handle large-scale data, and contextual modeling capabilities.They are better equipped to capture complex patterns and correlations within the data, enabling more accurate estimation of trajectory prediction model errors and enhancing the performance and reliability of the monitoring models.
A generic DL-based model is represented as follows: In the DL-based monitoring model, similar input features X ′ and Ŷ are used.However, compared to the relatively fixed input requirements of ML models, deep neural networks allow for more flexible feature input and processing formats.As shown in Fig. 5, different types of inputs are integrated in a more efficient manner.Moreover, considering the limited relevance between the surrounding traffic participants' information and prediction performance according to [40], and to reduce the complexity of the monitoring model, this study primarily uses the features of the predicted object as inputs.Additionally, Ẑ still represents the estimated prediction error, which can be the comprehensive error value for the entire trajectory or the prediction error at each future time step.
To train the DL model, this study employs the dataset that includes trajectory prediction error labels.When Z represents the prediction error at each future time step, the loss function used during the training process is as follows: Multiple typical DL models are designed and compared, including Multilayer Perceptron (MLP), CNN, Long Short-Term Memory (LSTM), and attention mechanisms such as Transformer.MLP is a basic DL model with multiple fully connected layers, suitable for sequential and structured data.CNN is primarily used for extracting local features through convolutional and pooling layers, and 1D convolution (Conv1D) are employed in this research to process features.LSTM is a DL module designed for sequential data, capturing long-term dependencies through gating mechanisms.Attention mechanisms have gained prominence in recent years, with Transformer being a model based on attention mechanisms.This mechanism can automatically learn the importance of different positions based on contextual information, better capturing the correlations within the data.Multiple network designs were explored for each category of models to ensure the effectiveness of the final analysis results.By comparing the performance of these models in estimating trajectory prediction errors, this study aims to explore the characteristics of different models, identify the most suitable model structure for specific tasks, and improve the accuracy and reliability of the monitoring model.
In this study, for each data point during the testing of the monitoring model, the corresponding attributes X, X

Historical position encoder
Predicted position encoder

Historical velocity encoder
Predicted velocity encoder

Historical acceleration encoder
Predicted acceleration encoder Fig. 5. Performance monitoring based on DL (a typical network architecture).The DL model consists of an encoder and a decoder.In addition to considering the positional information of trajectory points in the input, the model explicitly extracts kinematic information as features.These features are then processed to obtain the final monitoring results.In the encoder or decoder process, basic models such as MLP, LSTM, 1D convolution, and transformer are considered.
the estimated value êP as ẑ, while the proposed AD method used the anomaly score as ẑ.

D. Ensemble-Based DL Model
In the context of monitoring black-box trajectory prediction models, this study explores the potential application of ensemble methods to enhance model performance reliability and robustness.Specifically, an integrated DL model is constructed by incorporating ensemble-based ideas, as illustrated in Fig. 6 and represented by the following equation: where, g i with i ∈ {1, ..., M } denotes an individual DLbased monitoring model discussed earlier.g EN represents the integrated model based on the ensemble, which can be a combination of homogeneous or heterogeneous models with different parameters.Two metrics are employed to monitor model performance, namely, the mean and standard deviation of predictions from multiple models, reflecting accuracy and uncertainty, respectively.On the one hand, the combination of homogeneous but diverse models, known as deep ensemble [25], can be considered.This approach involves employing DL models with the same structure but different parameter settings.By adjusting factors such as model initialization and training data ordering, multiple homogeneous but diverse models can be obtained.These models exhibit variations during the training process, providing diversity.By combining the predictions of these models, an overall prediction output is obtained.This combination approach helps reduce individual model biases and improves the stability and accuracy of the overall model.The previous work [38] explored the application of this method in trajectory prediction tasks and achieved performance monitoring based on uncertainty extracted from the trajectory prediction model.In this study, it is assumed that the internal parameters of the trajectory prediction model are unadjustable and invisible.Consequently, we investigate the potential of applying the deep ensemble approach to separately designed monitoring models for performance monitoring.
On the other hand, the combination of heterogeneous mod-els with distinct characteristics is considered.This approach involves using different types of DL models and integrating them.These models may have different architectures, loss functions, or input feature representations, exhibiting greater diversity.By merging their prediction results, a more comprehensive and diverse prediction output is obtained.The combination of heterogeneous models helps capture the strengths of different models, compensate for their limitations, and enhance the generalizability of the overall model.Ultimately, two metrics are employed to monitor the ensemble model's performance.The first metric is the mean of predictions from multiple models.By computing the expected value (exp.) of predictions from multiple models, an overall prediction result is obtained, which integrates perspectives and predictive capabilities from multiple models, offering higher accuracy and stability.The equation is as follows: where ẑi represents the results of the i-th submodel.
The second metric is the standard deviation (std.) of the predictions from multiple models.The standard deviation reflects the uncertainty of model predictions.A higher standard deviation indicates greater differences among the models, potentially indicating higher uncertainty in the prediction results.By monitoring the standard deviation, the reliability of the overall model can be evaluated.The equation is as follows: By incorporating ensemble ideas, an integrated DL model is constructed to monitor the prediction performance.The combination approach leverages the strengths and diversity of multiple models, thereby improving prediction accuracy and generalization capabilities.Simultaneously, by monitoring the mean and variance of prediction results, the stability and uncertainty of the model are assessed, providing a more comprehensive understanding of the model's performance.

A. Datasets
To assess the practical performance of trajectory prediction models as labels for training monitoring models, this study utilized real-world trajectory datasets to train and test prediction models.Specifically, three natural motion datasets were employed: SIND, INTERACTION, and ApolloScape.
The SIND dataset is a natural motion dataset recorded at a signalized intersection in China.Captured using drones, it overcomes issues like occlusion, resulting in a comprehensive traffic data collection.The dataset comprises 13,000 traffic participants, including cars, trucks, buses, pedestrians, tricycles, bikes, and motorcycles, with annotated trajectory and category information for each participant.The original SIND dataset was further filtered to exclude vehicles parked on the roadside for long periods, focusing on dynamic objects in the scenario.The dataset was divided into 23 subsets based on the collection time, with 16 subsets used for training and validation and the remaining 7 subsets designated as the test set.
In addition, different datasets were used to evaluate the performance of trajectory prediction and monitoring models in the presence of distribution changes.The INTERACTION dataset contains natural motion data of highly interactive scenarios from multiple locations, captured using drones or roadside devices.Two subsets corresponding to highly interactive intersection scenarios were selected as the test set, namely TC Intersection VA (VA) recorded at signalized intersections and USA Intersection GL (GL) recorded at unsignalized intersections.Furthermore, the ApolloScape dataset was utilized as a test set with extreme distribution shift to analyze the performance of prediction and monitoring models.It was collected by the Apollo acquisition car on urban streets, capturing highly complex traffic flows involving a mixture of vehicles, cyclists, and pedestrians in various scenarios such as straight roads and intersections.
Moreover, traditional trajectory prediction algorithms are often evaluated under the assumption of perfect perception information.However, practical trajectory prediction models rely on information from upstream perception models, which may be inaccurate due to limitations and environmental interference.This issue has gained increasing attention in recent research [49].This study further explores the impact of noise introduced by insufficient perception capabilities.Specifically, it assumes the noise follows a Gaussian distribution and adds perturbations to the historical trajectory data input to the prediction model to assess the performance of both prediction and monitoring models.In the experiments, Gaussian noise with standard deviations of 0.002 and 0.02 were adopted, respectively.

B. Trajectory Prediction Models
In this study, typical existing trajectory prediction models were employed as the targets for monitoring.Specifically, GRIP++ [13] was chosen as the primary focus, which is an enhanced graph-based interaction-aware trajectory prediction method designed for autonomous driving.This method utilizes graphs to represent the interaction between objects and employs an encoder-decoder Gate Recurrent Unit (GRU) model for prediction.Unlike methods that can only predict future trajectories for individual traffic agents one by one, GRIP++ can simultaneously predict the trajectories of all observed objects, thus achieving efficient runtime performance.In the setting of GRIP++, the prediction frequency is 2 Hz, and the prediction model takes the past 6 frames of scenario information to predict the future 6 frames of trajectories for the traffic participants.As a classic prediction model, evaluating its performance can effectively reflect the effectiveness of the monitoring approach.
Furthermore, to demonstrate the effectiveness of the proposed monitoring method on different types of trajectory prediction models, a typical rule-based prediction model, namely the constant velocity (CV) model, was also selected as a research target.It maintains the basic prediction settings consistent with GRIP++, estimating the future 6 frames of trajectories based on the current velocity of the object.
In addition, HeatIR [15], a newly proposed method in recent years, was used as another trajectory prediction model under investigation.HeatIR is a multi-agent prediction approach based on a three-channel model.Unlike GRIP++, it incorporates considerations of static map features in its input.In the HeatIR setting, the prediction frequency is 10 Hz, and the prediction model takes the past 10 frames of information to predict the next 30 frames.
All the mentioned models were treated as black boxes during the evaluation process, meaning that their internal parameters or features were neither modified nor accessed when training and testing the monitoring models.

C. Evaluation metrics
To evaluate the performance of the monitoring model, the following key evaluation metrics were employed in this study: 1) Cut-off Curve and Improvement Score (IS).The cutoff curve was initially a commonly used method for evaluating the performance of classification models.In this study, this metric was adopted to assess the capability of the monitoring model.The estimated results ẑ ( z or σ for ensemble-based methods) were considered as reference indicators.The data points were then rearranged in descending order based on ẑ, and different proportions of data were sequentially filtered out to calculate the average value of e P for the remaining data.This process generated a cut-off curve, and the area under the cut-off curve (AUCOC) was considered as an evaluation metric.Additionally, by using e P as a reference or filtering all data in random order, the best curve and the random curve were obtained as references.Furthermore, IS was calculated (i.e. the self-awareness score in [39]) as an evaluation metric to compare different monitoring models.
2) Monitoring Error: For the proposed error-based monitoring method, the monitoring error e M = |e P − êP | was calculated as another metric.By comparing the difference between the estimated prediction error and the actual prediction error, the performance of the monitoring model in estimating the prediction model's error could be assessed.

D. Baselines
Two baseline methods are introduced for comparison: 1) Uncertainty Estimation: As demonstrated in [40], based on deep ensemble, the original trajectory prediction network was modified to output uncertainty estimation along with the predicted trajectories.The estimated uncertainty was aggregated as the average prediction entropy (APE) and final prediction entropy (FPE), which were used as ẑ to monitor ADE and FDE, respectively.This method assumes that the trajectory prediction model is a white-box model, meaning that its training process and model parameters can be modified.
2) Self-Aware Trajectory Prediction: As shown in [39], a self-aware module was introduced based on the original trajectory prediction network.This module extracted intermediatelevel features from the trajectory prediction network and used them together with the predicted trajectory as inputs for error estimation.This method assumes that the trajectory prediction model is a gray-box model, meaning that the intermediatelevel features can be accessed.
These two baseline methods provide different approaches to address the task of monitoring trajectory prediction models.By comparing their performance with the proposed monitoring approach, we can gain insights into the effectiveness and advantages of the proposed method.

A. Comparison of Different Monitoring Models
To compare the proposed monitoring methods, comprehensive testing was conducted on the SIND dataset.First, the trajectory prediction model was trained based on the GRIP++ algorithm using the SIND training set.Then, the trained prediction model was treated as a black box and evaluated on the SIND dataset to obtain the data labels mentioned in Section IV for monitoring model training.Subsequently, different monitoring models were trained based on the new training set and evaluated on the test set.It was ensured that all methods had access to the same training and test data.The results were summarized in Table III.
Overall, DL-based models demonstrated significant advantages among the different methods, mainly due to their explicit modeling of prediction errors and the feature learning and nonlinear fitting capabilities of the models themselves.It was observed that networks designed based on relatively simple structures, such as MLP and Conv1D, showed good performance, exceeding both baseline methods in a comprehensive way.In contrast, networks designed with LSTM modules did not perform as well, possibly because the short history and future frame lengths in this prediction task did not allow the LSTM's special modeling capabilities to be fully utilized.It is worth mentioning that multiple design explorations were conducted for each type of network in the experiments, making the presented results representative.Additionally, the introduction of attention mechanisms did not yield significant performance improvements for the monitoring models.Furthermore, ML-based methods showed promising results in terms of AUCOC and IS.Among them, LR achieved results of 0.741 and 0.715 when ADE and FDE were used as forms of e P , respectively, while RFR achieved the best results of 0.804 and 0.780.When comparing different methods, it can be observed that methods incorporating ensemble ideas, such as RFR and voting, tend to achieve better performance.However, there is still a gap between ML-based methods and DL-based methods when analyzing the monitoring error e M .
In contrast, AD-based methods performed mediocre.Although some methods such as MCD and ABOD achieved good results, they were still not as effective as the ML or DL-based methods.The main reason may be attributed to the fact that the level of anomaly in trajectories does not always correspond directly to prediction errors.This further supports the importance of recent research insights on OOD detection [50], emphasizing the need to focus on scenarios that may lead to inadequate model performance, such as out-of-modelscope (OMS), rather than solely focusing on OOD or long-tail scenarios.

B. Evaluation in the Presence of Significant Distribution Shifts
Real-world scenarios may experience a deviation in data distribution from the training data, which is a crucial factor leading to a decline in prediction performance.In this section, several datasets were additionally employ to simulate potential distribution shifts in real-world settings.The VA, GL, and ApolloScape datasets reflect distribution shifts caused by variations in location and scene types.The SIND test sets, which are injected with different levels of noise, represent distribution shifts arising from inaccurate prediction input due to inadequate perception capabilities.
1) Degradation of Trajectory Prediction Performance due to Distribution Shifts.: TABLE IV records the prediction performance of the GRIP++ model on different test sets.It can be observed that distribution shifts result in a significant increase in prediction errors.When locations and scenario types change, the behavior patterns, traffic rules, and road restrictions followed by traffic participants also vary, leading to suboptimal performance of the prediction model on test data.Furthermore, noise introduced due to inadequate perception capabilities should not be ignored.Even injecting noise with a magnitude of σ = 0.002 causes an increase in prediction errors by 35.6%/33.2%.
2) Evaluation of Monitoring Models: TABLE V presents the performance of various monitoring models on different test sets, with a focus on the methods that exhibited promising results in TABLE III.It can be observed that distribution shifts also lead to a certain degree of performance decline in monitoring models, particularly in the evaluation of e M .However, the decline in IS is relatively smaller, indicating a reasonable consistency between the estimated prediction errors by monitoring models and the true prediction errors.For example, when injecting input noise with σ = 0.002, the MLP-based and Conv1D-based networks, which performed the best, still achieve IS of 0.813 and 0.811, respectively.

C. Analysis of Ensemble Methods
This section focuses on exploring the potential of ensemble methods to improve the performance of monitoring models.The evaluation of ML-based methods in TABLE III and TABLE V has partially demonstrated the advantages of ensemble methods.For example, RFR exhibits better monitoring performance compared to TR, and Voting, as an ensemble of heterogeneous ML models, shows improved generalization.The following discussion further examines the application of this approach in DL-based methods.
1) Ensemble of Homogeneous Heterogeneous Models: Firstly, the ensemble of homogeneous heterogeneous models was analyzed, taking the MLP-based model, which showed promising results, as an example.Five independent submodels are constructed using the method proposed in Section III.D 2) Ensemble of Heterogeneous Models: The ensemble of heterogeneous models was further evaluated by employing three different architecture MLP-based networks, one based on Conv1D and one based on Transformer.The results were recorded in TABLE VII.Overall, the ensemble of heterogeneous models tends to exhibit more pronounced performance improvements compared to the ensemble of homogeneous heterogeneous models mentioned earlier.This could be attributed to the fact that heterogeneous models are more likely to converge to different local optima, thereby enhancing the robustness of the ensemble model.2) Monitoring for HEAT: This section further examines the applicability of the proposed monitoring method to the HEATIR model.To obtain the monitored HEATIR model, a dataset partitioning similar to the aforementioned GRIP++ is used, with specific adjustments made based on the HEATIR configuration.The results are presented in TABLE IX, indicating the satisfactory performance of the proposed method.It is worth noting that the Transformer-based network, incorporating attention mechanisms, performs well, while the MLP-based network alone achieves mediocre results.This difference may be attributed to the increased complexity of the problem due to the longer input and output sequences in HEATIR compared to GRIP++.Therefore, it is necessary to select an appropriate monitoring method based on the specific requirements of different prediction tasks.Additionally, ensemble-based methods exhibit the best performance and are beneficial for enhancing robustness in the presence of significant distribution shifts, as evidenced by the testing results on VA.

E. Ablation Study
TABLE X demonstrates the results of the ablation study, focusing on the monitoring for the GRIP++ model, and documenting the evaluation results of the monitoring methods that performed well in the ML and DL-based approaches, namely RFR and the MLP-based model.Firstly, by comparing rows 3, 4, and 10, it can be observed that considering both the historical trajectories and the predicted trajectories in the input of the monitoring model is crucial for improving its effectiveness.Additionally, by comparing rows 5 to 10, the necessity of incorporating kinematic features as input features, as discussed in Section II, is evident.Specifically, results from rows 5 and 6 show that using only current velocity or acceleration as monitoring indicators already exhibits good consistency with prediction errors.Furthermore, comparing rows 7 to 10 reveals that explicitly modeling velocity or acceleration features in the input of ML or DL models contributes to a significant improvement in monitoring performance.

F. A Typical Application: Hybrid Prediction
As a typical application of the proposed method, hybrid prediction can be used to complement the analysis of the application potential of the monitoring model.The GRIP++ and CV models are taken as the base prediction models, and their respective monitoring models are constructed using the MLPbased method.The output ẑ of the monitoring model is used as a reference to combine the outputs Ŷ of different prediction models and form hybrid predicted trajectories, allowing for the evaluation of their errors.Specifically, two types of hybrid methods are considered in this study: 1) H1: Based on the estimate ẑ of the monitoring model, the trajectory to be ultimately predicted is chosen as follows: where ŶGRIP ++ and ŶCV represent the results of the two base prediction models, ẑGRIP ++ and ẑCV represent the corresponding monitoring results, and ŶHybrid represents the ultimately predicted trajectory.
2) H2: Based on the estimate Ẑ of the monitoring model, the ultimately predicted trajectory point is selected for each time step as follows: where the superscript t ∈ {1, 2, ..., t f } represents the prediction or monitoring result at the t time in the future In addition, two types of trajectory fusion methods are proposed for comparison: 1) Best, which replaces ẑ with e P in H1, and 2) Avg., which directly averages the predicted trajectories outputted by the two base models.
The results are recorded in TABLE XI, where it can be observed that the hybrid prediction models, considering the monitoring results ẑ, generally perform as well as or better than the base models.In scenarios with significant distribution shifts, such as the SIND with injected noise, the hybrid methods demonstrate substantial performance improvements.For instance, when injecting noise with σ = 0.002 into the input data, GRIP++ and CV achieve ADEs of 0.564 and 0.695, respectively.In contrast, H1 and H2 both achieve ADE of 0.522, outperforming the GRIP++/CV by 7%/24.9%.This is because as the distribution shift increases, the GRIP++ model, which relies on training data, exhibits noticeable performance degradation.By employing hybrid prediction methods, it becomes more effective to detect scenarios where GRIP++ may fail and replace it with the more robust CV model, thus improving the overall prediction performance.

G. Qualitative Analysis
Fig. 7 illustrates the visualization, where the predicted trajectory of the object of attention is represented by a solid red line.In Fig. 7 (a), the trajectory predictor exhibits a deviation in predicting the trajectory of a car making a left   turn, while the output of the monitoring model compensates for this deviation.In Fig. 7 (b), the trajectory predictor fails to accurately predict the acceleration behavior of a vehicle at an intersection, but the monitoring model's output successfully addresses this shortcoming.In Fig. 7 (c), a non-motorized vehicle within the intersection chooses to avoid another nonmotorized vehicle, but the trajectory predictor fails to predict this behavior effectively.However, the monitoring model accurately estimates this prediction error.In Fig. 7 (d), the monitoring model successfully estimates the error caused by the trajectory predictor's failure to recognize the turning behavior of a non-motorized vehicle within the intersection.

VI. CONCLUSION
In this study, the PMTM was proposed to monitor the performance of black-box trajectory prediction models.Through extensive evaluations and ablation studies, four categories of monitoring methods were compared comprehensively: ADbased, ML-based, NL-based, and ensemble methods.Moreover, the results demonstrated the effectiveness of the monitoring models in improving prediction performance by considering historical and predicted trajectories, as well as incorporating kinematic features.Furthermore, the hybrid prediction approach, which integrates monitoring models with baseline models, showed promising results in compensating for prediction errors and achieving better overall performance.
Moving forward, future research can explore the incorporation of additional data sources and advanced monitoring techniques to enhance the capabilities of the performance monitor.The development of interpretable and explainable monitoring models will also contribute to a better understanding and trust in the trajectory prediction process.Furthermore, investigating the adaptability of the monitoring approach to different domains and real-world applications is essential for practical implementation.By continuously refining and expanding the performance monitor, the reliability and applicability of trajectory prediction models can be enhanced, thereby advancing the development of safer and more accurate autonomous systems.
This work was supported the National Science Foundation of China Project: 52072215 and U1964203, National key R&D Program of China: 2022YFB2503003, and State Key Laboratory of Intelligent Green Vehicle and Mobility.(Corresponding Authors: Hong Wang)

Fig. 1 .
Fig. 1.Performance monitor for black-box trajectory prediction model.It assumes that the parameters of the trajectory prediction model are unknowable and unmodifiable.The performance monitoring model, functioning as a standalone system, provides online estimation of the trajectory prediction model's performance by utilizing current scenario information and output from the trajectory prediction model.

Fig. 6 .
Fig.6.Ensemble-based DL model.The ensemble consists of M different submodels, which can be either homogeneous with different parameters or heterogeneous with different parameters.All submodels receive the same input and produce their respective monitoring results, which are then combined to obtain the final result.

Fig. 7 .
Fig. 7. Visualization of trajectory prediction and monitoring results (based on error estimation method).The historical trajectory is represented by the slate gray dotted line, while the real future trajectory is represented by the black solid line.The predicted trajectories for different object types are indicated by different colors: blue for calls, magenta for bus or truck, green for non-motor vehicles, yellow for pedestrians, and red for the object of attention.The orange area represents the estimated error.
The comprehensive design, evaluation, and implementation framework of the proposed PMTM.
4ig.4.Performance monitoring based on AD.Under this assumption, real trajectories and predicted trajectories that are close to the real trajectories are located within a set of normal points, exhibiting lower anomaly scores.While points with poor prediction performance (e.g., large prediction errors) deviate significantly from the aforementioned set, resulting in higher anomaly scores.
....s 0 } represents the historical state information of the predicted object, which is fed into the monitoring model.In this context, s t denotes the position of the predicted object at time t, specifically s t = {x t , y t }, where x t and y t represent the lateral and longitudinal coordinates in the global coordinate system, respectively.Ŷ = {ŝ 1 , ŝ2 , ...., ŝt f } represents the predicted trajectory.During the training process, X ′ and the real future trajectory Y are used as inputs to train the AD model.In the testing process, X ′ and the predicted Ŷ are used as inputs to evaluate their relative anomaly scores compared to the training data.

TABLE III COMPARISON
OF DIFFERENT MONITORING MODELS.THE ADE/FDE INDICATE THE FORM OF e P USED IN HE EVALUATION.THE BEST METRIC WITHIN EACH TYPE OF METHOD IS MARKED WITH AN UNDERLINE, AND THE BEST METRIC AMONG ALL METHODS IS MARKED IN BOLD.

TABLE V EVALUATION
OF MONITORING MODELS ON DIFFERENT TEST SETS (IS/e M ).

TABLE VIII EVALUATION
OF MONITORING FOR THE CV MODEL.IN THE FIRST ROW, SIND/VA INDICATES THE SOURCE OF THE TEST SET, AND IN THE SECOND ROW, ADE/FDE REPRESENTS THE ADOPTED FORM OF e P .In this subsection, the generalizability of the proposed method to different trajectory prediction models is demonstrated.Specifically, the classic rule-based Constant Velocity (CV) model is used as the prediction model.The monitoring network based on MLP, as described earlier, is employed in conjunction with the CV model and the SIND training dataset to obtain the corresponding training data and monitoring model.The evaluation results are presented in TABLE VIII.It can be observed that the monitoring model still achieves good performance.For instance, when evaluated on the SIND test set with ADE as the form of e P , an IS of 0.907 is achieved, and although the prediction error e P is as high as 0.693, the monitoring model achieves a low error e M of 0.258.In

TABLE IX EVALUATION
OF MONITORING FOR THE HEATIR MODEL.THE e P FORM IS ADE.

TABLE X RESULTS
OF ABLATION STUDY.THE LABELS IN THE "MAIN MODEL" COLUMN INDICATE WHETHER THE PROPOSED ML/DL MODELS WERE USED.IS) SIND (e M ) VA (IS) VA (e M ) SIND (IS) SIND (e M ) VA (IS) VA (e M )

TABLE XI EVALUATION
RESULTS OF HYBRID PREDICTION (ADE)