System Identification of Static Nonlinear Elements: A Unified Approach of Active Learning, Over-fit Avoidance, and Model Structure Determination

Systems containing linear first-order dynamics and static nonlinear elements (i.e., nonlinear elements whose outputs depend only on the present value of inputs) are often encountered; for example, certain automobile engine subsystems. Therefore, system identification of static nonlinear elements becomes a crucial component that underpins the success of the overall identification of such dynamical systems. In relation to identifying such systems, we are often required to identify models in differential equation form, and consequently, we are required to describe static nonlinear elements in the form of functions in time domain. Identification of such functions describing static elements is often a black-box identification exercise; although the inputs and outputs are known, correct mathematical models describing the static nonlinear elements may be unknown. Therefore, with the aim of obtaining computationally efficient models, calibrating polynomial models for such static elements is often attempted. With that approach comes several issues, such as long time requirements to collect adequate amounts of measurements to calibrate models, having to test different models to pick the best one, and having to avoid models over-fitting to noisy measurements. Given that premise, this paper proposes an approach to tackle some of those issues. The approach involves collecting measurements based on an uncertainty-driven Active Learning scheme to reduce time spent on measurements, and simultaneously fitting smooth models using Gaussian Process (GP) regression to avoid over-fitting, and subsequently picking best fitting polynomial models using GP-regressed smooth models. The principles for the single-input-single-output (SISO) static nonlinear element case are demonstrated in this paper through simulation. These principles can easily be extended to MISO systems.


I. INTRODUCTION
Accurate identification of static elements is crucial for identifying effective models for control applications related to dynamical systems, especially when it comes to the domain of automobile engine subsystems [1]. Such systems commonly include linear dynamics and static nonlinear elements, and the system architecture comprising such linear dynamics and static elements has been specifically studied over the years [2]. When it comes to identifying static elements, a wellknown challenge encountered is the requirement of a long time for measurements [1], especially in relation to systems like engines as measurements typically have to be taken when system responses have reached a steady state. Furthermore, on instances where the model structure is unknown and blackbox identification is required (which is often the case in real-world system identification tasks), computation time has to be spent for model structure determination and over-fit avoidance. It is therefore desirable to reduce the time required for measurements by enabling the collection of only an amount of data which is optimal and informative enough in some sense, and also reducing the computation time by developing algorithms that can systematically determine suitable model structures. This paper contributes to this space by presenting a unified approach, by combining Active Learning [3], [4] to systematically collect informative measurements, and model structure determination in a way that over-fitting is avoided.
Optimal design of experiments (DoE) in relation to system identification has been extensively studied [5]- [7] with the aim of reducing time spent on experiments, or maximizing the information gained within a time budget. DoE often relies on Fisher Information [7], [8] and thus comes the limitation of having to assume a model for the system beforehand. Assuming a model can become challenging in some realworld system identification tasks, and as an alternative, the approach of Mean Value Engine Modeling (MVEM) [9] is sometimes adopted. MVEM is usually physics-based, and the resulting models can be complicated, making them difficult to be coupled with DoE since mathematically describing 'Information' can become tedious. Active Learning which is greatly analogous to DoE, in its conventional use adopts techniques like 'query by committee' [10] to collect measurements guided by some means of model uncertainty. However, such approaches used for system identification [4] too often rely on an assumed model, or a set of models, which can be identified as a limitation, especially since on the other hand, more modern machine learning approaches such as Gaussian Processes (GP)s [11] enable similar uncertaintybased approaches on a non-parametric paradigm avoiding the requirement of having to assume parametric models.
In this paper, the authors propose a greedy Active Learning scheme to systematically collect measurements to identify static nonlinear elements. The scheme starts with a set of given prior measurements and decides the next point to take measurements using maximum uncertainty produced by GP regression as the criteria for deciding. The scheme operates recursively until reaching some termination criteria. Since GP is non-parametric, the proposed Active Learning scheme is model-free, and therefore is different from parametric model-based approaches presented in works such as [4], [7]. On termination of the Active Learning-based measurement scheme, a polynomial-based model structure is determined and calibrated. Some important literature about polynomial structure determination include the works [12], [13]; however, they are not coupled with Active Learning schemes unlike in this paper, and revolve around determining polynomial structures using a given set of measurements. We demonstrate the approach on identification of a single-input-single-output (SISO) static nonlinear element. The approach can easily be extended to the MISO case.
The structure of the paper is as follows: Section II describes the principles of GP-based regression; Section III presents the system identification problem formulation; Section IV presents the methodology detailing the Active Learning scheme and the polynomial model structure determination approach; Section V presents the results; and Section VI concludes the paper by discussing the implications of results and potential avenues for future work.

II. RELATED KNOWLEDGE: GAUSSIAN PROCESS FORMULATION FOR REGRESSION
GPs are a strong tool used for solving multi-dimensional nonlinear regression problems as well as classification problems. Reference [11] is useful to study GPs in detail, while this section summarizes the GP regression formulation for the onedimensional nonlinear regression case for a smooth function.
Suppose there is a noisy process y = f (x) + where x ∈ is a one dimensional independent variable, y ∈ is a one dimensional dependent variable, and ∈ is noise, and it is required to learn a regression model describing the process. Firstly, it is required to collect some training data X and Y where X = [x 1 , x 2 , x 3 , ..., x n ] T is a vector containing measurement points and Y = [y 1 , y 2 , y 3 , ..., y n ] T is a vector containing corresponding noisy process responses; n ∈ Z + . Secondly, a kernel K(X, X) whose elements are given by is some kernel function and x i , x j ∈ X, has to be selected. Choices for kernel functions and some rationale behind the suitability for some kernel functions for some specific applications are available in [11]. The work of this paper focuses on smooth functions, and therefore has selected the squared exponential kernel function (described in (1); α, η ∈ in (1) are hyper-parameters) as it has been known to work well with smooth functions and has been commonly used in previous works to learn smooth functions [14]- [20].
When the squared exponential kernel function is used, α and η along with an additional hyper-parameter (i.e., noise variance σ) construct the set θ = {α, η, σ} which is the set of hyper-parameters. Following a maximum likelihood estimation formulation (details given in [11]), a regression model describing the noisy process can be obtained by minimizing (with respect to θ) the cost function of negative log marginal likelihood given by: where p(Y |X, θ) is the conditional probability of vector Y given the vector X, |.| is the determinant of a matrix, and Σ is the covariance matrix given by: where I is the corresponding identity matrix. After finding θ which minimizes the negative log marginal likelihood, that solution can be used in (4) and (5) to predict process outputs y * for arbitrary inputs x * . These results can be interpreted as a Gaussian distribution in the form of (5) is the predicted uncertainty, which is used later in this paper as the criteria for selecting informative measurement points.

III. PROBLEM FORMULATION
This section presents the system identification problem formulation. Suppose there is a static nonlinear SISO element in the form of y = f (x) + where x ∈ is the input, y ∈ is the output and ∈ is noise. The element is subject to the following assumptions: 1) x and y are measurable.
2) y is a continuous and smooth function of x, but the structure of the function f (.) is unknown.
x min , x max ∈ , and the values of x min and x max are known.

4) The element shows open loop input-to-output stability
∀x in x min < x < x max ; implying that |y| < y max , ∀y, where |.| denotes absolute value and y max ∈ + . 5) Noise is zero mean white noise where | | < max , ∀ , The objective would now be to produce a computationally efficient modelŷ =f (x) which satisfactorily describes the element while not over-fitting the measurements. The approach proposed to achieve this objective, starts by collecting some prior measurements X = [x 1 , x 2 , x 3 , ..., x n ] T and Y = [y 1 , y 2 , y 3 , ..., y n ] T and estimating a GP model ŷ gp |x ∼ N (ŷ gp , σ y ) from those measurements by minimizing the negative log marginal likelihood objective function given in (2). Further measurements {x n+1 , y n+1 } will be collected via a maximum σ y −based Active Learning scheme (described in Section IV); those measurements will be added to the sets X and Y and recursive estimation of the GP model and data collection will take place until some termination criteria (described in Section IV) indicating the convergence of the GP model is reached. The aim of Active Learning as discussed before, is to reduce the number of measurements collected as opposed to collecting all possible measurements.
In the interest of producing a computationally efficient model, we impose a polynomial structure tof given bŷ Eventually, the solution this paper seeks to find is: which minimizes the Mean Absolute Relative Error (MARE) between the polynomial model and the GP model over a user specified N , N ∈ Z + number of function evaluation points whereŷ gp i = 0, with the aim of the solution φ * providing a polynomial model describing the nonlinear static element whilst not over-fitting to the collected measurements. Section IV details the methodology proposed to solve the problem formulated in this section.

IV. METHODOLOGY
The methodology is broken down and detailed in the subsections coming within this section. Fig. 1 and 2 at the end of this section depict the methodology in the form of two flowcharts; Fig. 1 depicts the Active Learning scheme proposed to obtain a GP model, and Fig. 2 depicts the subsequent polynomial model identification scheme which relies on the GP model.

A. The Static Nonlinear Element
The methodology is presented in this section via a demonstrative example using the static nonlinear element given by: Assumptions (1) to (6) listed in Section III are valid for this element; x min and x max are set to be 0 and 1 respectively following Assumption (3), and γ is set to be 0.05 following Assumption (6). Readers have the freedom to experiment the proposed approach with different static elements subject to the assumptions.

B. Performing Measurements
The Active Learning scheme operates using some given measurements X = [x 1 , x 2 , ..., x n ] T , Y = [y 1 , y 2 , ..., y n ] T , and by recursively collecting measurements {x n+1 , y n+1 } and adding them to the sets X and Y . Capturing noise becomes necessary, therefore any measurement instance {x i , y i } would have to be repeated multiple times (say M occasions). As a result, x i ∈ X, ∀i would be given by x i,j , and (y i,j −y i ) 2 ) would have to be recorded ∀i and be arranged as σ me = [σ 1 , σ 2 , ..., σ n ] T along with sets X and Y . M is set to be 10 for the demonstrative example.
C. The Active Learning Scheme 1) Initializing Hyper-Parameters: Suppose the Active Learning scheme is to be started with a given set of measurements X, Y , and σ me having n measurements. The first step authors propose is to arrange X to be in ascending order, i.e., set X = [x 1 , x 2 , ..., x n ] T , such that x i ≤ x i+1 , ∀i, i ∈ Z + , i ≤ n − 1, and set Y = [y 1 , y 2 , ..., y n ] T to be corresponding to X. Thereafter, initializing hyper-parameters (α, η, σ) as follows is proposed: and setting σ ini to have the mean value of elements of σ me .
Since overlapping x values would be deliberately avoided, η ini will naturally be nonzero. However, since y is a system response, such a guarantee of being nonzero would not naturally be imposed on α ini . Should a rare instance resulting α ini = 0 occur for some set Y , an exception handling routine should be in place to catch such instances and proceed to add one or a few more random measurements to the sets X, Y and σ me in order to capture some degree of non-linearity of the static element in question, and enforce α ini = 0, prior to using the collected measurements to estimate GP models.
2) Collecting Prior Measurements: Since n prior measurements arranged as X = [x 1 , x 2 , ..., x n ] T , Y = [y 1 , y 2 , ..., y n ] T , and σ me = [σ 1 , σ 2 , ..., σ n ] T are required to begin the Active Learning scheme, collecting any n random measurements would be adequate as long as the condition α ini = 0 is satisfied. For the purpose of the demonstrative example though, authors do not resort to random measurements, and go on to set n = 2 and impose just two prior measurements such that x 1 = x min and x 2 = x max , since x min and x max values will be known beforehand according to Assumption (3). When starting with just two measurements, should the event y 1 = y 2 occur resulting in α ini = 0, more measurements would have to be added to the prior measurement set as said earlier, before using the prior measurements to estimate GP models.
3) Estimating GP Models: Estimating GP models (i.e., models in the form:ŷ gp |x ∼ N(ŷ gp , σ y )) is to be done by using any available measurement sets X and Y , and minimizing the negative log marginal likelihood cost function in (2). Starting from the initial values of the hyper-parameters (i.e., θ ini = {α ini , η ini , σ ini }), gradient-based optimization can be used to reach a nearest local minimum in the cost function [11]. The solution θ * = {α * , η * , σ * } resulting from the local minimum is taken as the solution yielding the GP model from the available set of measurements X and Y .

4) Selecting the Next Best Measurement:
After estimating a GP model (i.e.,ŷ gp |x ∼ N (ŷ gp , σ y )) using measurements The specialty of X t is that it would contain N equally spaced values, whose minimum and maximum would be x min and x max respectively. The set of measurements X would then be removed from X t to construct a new set X * t = X t − X, ., x N ] T . Uncertainty (i.e., σ y ) values will be generated for all points in X * t using (5), and the set σ * y = [σ n+1 , σ n+2 , ..., σ i , ..., σ N ] T will be constructed. The index i * corresponding to the maximum uncertainty value in σ * y will be selected, and thus, index n + 1 = i * will be considered as the next best (i.e., x n+1 ) measurement to be taken. Suppose i * becomes an adjacent point to any of the available measurements, an exception is then made to come to effect to enable i * to become a random index not adjacent to or overlapping with any available measurements. This exception allows for greater spatial distribution for the measurements. Subsequently, the next best measurement {x n+1 , y n+1 } will be added to sets X and Y , and the process will follow recursively by re-initializing hyper-parameters, re-estimating the GP model, and relocating the maximum uncertainty from subsequent sets of σ * y , until the termination criteria specified in the next subsection is reached. The flowchart in Fig. 1 will be helpful to grasp the Active Learning scheme.
For the purpose of the demonstrative example, N is set to be 1000.

5) Termination Criteria:
Since the Active Learning scheme iteratively takes new measurements, the foremost termination criteria imposed is a maximum number of iterations. Function evaluation is done on at most N points when selecting the next best measurement, and as a result, limiting the maximum number of iterations too to N is proposed; N iterations would result in N equally spaced points covering the whole measurement range according to the method in subsection IV-C4. Therefore, if i is the number of iterations, i = N will be one termination criteria.
Next, suppose the hyper-parameters resulting after i iterations 2 ≤ i < N are α * i , η * i , and σ * i . The following measures are defined: Ideally, γ α , γ η , and γ σ should all tend to zero to imply model convergence. Therefore, the termination criteria is set with respect to a threshold γ hyp , γ hyp ∈ + , γ hyp << 1, such that termination would occur if the three conditions γ α < γ hyp , γ η < γ hyp , and γ σ < γ hyp will be satisfied simultaneously, and would remain satisfied for N hyp consecutive instances (iterations). For the demonstrative example, γ hyp = 0.05 and N hyp = 10 values are set.

D. Estimating Polynomial Models
The GP model (i.e.,ŷ gp |x ∼ N(ŷ gp , σ y )) resulting from the Active Learning scheme, would itself serve as a model describing the static nonlinear element of interest. However, GP models as such are known to be computationally expensive [11], and therefore are not very likely to be usable as controloriented computationally efficient models at present. As such, the ultimate solution this paper endeavors to obtain is a model For an assumed order m, φ c can be found by minimizing the norm ||ŷ −ŷ gp ||, whereŷ would be a vector containing N polynomial model output points, andŷ gp would be a corresponding vector containing corresponding GP model outputs. When the polynomial model is expressed in its matrix form: y = X p φ c , φ c can be found by linear least squares as: Thus, by assuming a polynomial order m, and solving for φ c using (13) would yield a polynomial model describing the static nonlinear element, solving the problem which this paper focused on. It is proposed to improve the polynomial model by starting from m = 1, and recursively increasing the polynomial order while solving for coefficients φ c . Such an approach aligns with the Weierstrass approximation theorem [21] and can produce polynomial models having a low residual error. The termination criteria in (14) is proposed for the recursion to stop. As a result of this termination criteria, recursion would stop when the MARE (expressed in (15)) between the polynomial model and the GP model falls below a threshold γ p , γ p ∈ + , and γ p << 1. The flowchart in Fig. 2 will be helpful to grasp the polynomial model fitting scheme.
MARE < γ p (14) For the purpose of the demonstrative example, γ p is set to be 0.01.

V. RESULTS
The parameter values used for the demonstrative example involving the static nonlinear element in (7), are summarized in Table I, and the proposed termination criteria can be summarized as follows:  (10), (11), and (12)) satisfy the conditions γ α < γ hyp , γ η < γ hyp , and γ σ < γ hyp , for an N hyp number of consecutive instances (iterations). • The polynomial fitting terminates when MARE (described in (15)) satisfies MARE < γ p . The identification exercise was repeated 50 times to assess the performance of the methodology. Fig. 3 shows the histogram of the number of iterations taken for the Active learning scheme to converge. Fig. 4 shows measurements taken in a trial which converged in 164 Active Learning iterations and Fig. 5 shows the performance of the GP and polynomial models estimated from this data, alongside the noisy (γ = 0.05) static element. How termination criteria for both the Active Learning scheme and the polynomial fitting scheme are met for this example are illustrated in Fig. 6 and Fig. 7.

VI. CONCLUSIONS
An approach unifying Active Learning, over-fit avoidance, and model structure determination (polynomial architecture) for system identification of static nonlinear elements was proposed. The approach includes two steps: (1) Using a GPbased Active Learning scheme to estimate a smooth nonlinear model while collecting an optimal set of measurements in a sense-such a method constitutes a framework for data collection which guides users about the sufficiency or insufficiency of measurements collected; (2) A polynomial model  fitting scheme using the estimated GP model. Estimating the smooth GP model avoids over-fitting; the subsequent polynomial fitting delivers a model that is computationally efficient and control-oriented than the GP model. Techniques relevant for parameter initialization and algorithm termination were introduced and the effectiveness of the approach was demonstrated for the SISO case via an example. Since this work focused only on static elements, future work can focus on extending similar schemes to be generalized over dynamical systems as well as MIMO cases. Another interesting question to focus on will be to eliminate the use of GP (as it can be computationally expensive for large data sets) and propose Active Learning schemes based on more computationally efficient means while being model-free (i.e., not relying on an assumed model or a set of assumed models). Relaxing some assumptions made in this paper will also be interesting.