Bridging Reinforcement Learning and Iterative Learning Control: Autonomous Reference Tracking for Unknown, Nonlinear Dynamics

—This work addresses the problem of reference tracking in autonomously learning agents with unknown, nonlinear dynamics. Existing solutions require model infor- mation or extensive parameter tuning, and have rarely been validated in real-world experiments. We propose a learning control scheme that learns to approximate the unknown dynamics by a Gaussian Process (GP), which is used to optimize and apply a feedforward control input on each trial. Unlike existing approaches, the proposed method neither requires knowledge of the system states and their dynamics nor knowledge of an effective feedback control structure. All algorithm parameters are chosen automatically, i.e. the learning method works plug and play. The proposed method is validated in extensive simulations and real-world experiments. In contrast to most existing work, we study learning dynamics for more than one motion task as well as the robustness of performance across a large range of learning parameters. The method’s plug and play applicability is demonstrated by experiments with a balancing robot, in which the proposed method rapidly learns to track the desired output. Due to its model-agnostic and plug and play properties, the proposed method is expected to have high potential for application to a large class of reference tracking problems in systems with unknown, nonlinear dynamics.

. A robot with unknown dynamics is meant to track a reference trajectory leading to a desired, highly dynamic motion (top). On each iteration, the proposed learning method determines, based on experimental data, a Gaussian Process model, which is in turn used to design and apply a feedforward control input (bottom).
is paved by control techniques that enable robots to precisely perform agile and dynamic motions. For example, model predictive control can achieve accurate motion if a precise model of the dynamics is available [4]- [6]. Requirements regarding the model's precision can be relaxed by robust or adaptive control techniques if the uncertainties comply with preset assumptions [7]- [9]. Under similar conditions, Iterative Learning Control (ILC) can overcome model uncertainties and unknown disturbances by learning from errors of previous trials [10], [11]. However, all of these control approaches require system-specific prior knowledge to craft a suited model, controller, or learning configuration. Autonomy requires a methodology that selfreliantly learns a solution to the control problem without requiring any system-specific prior knowledge. In particular, Reinforcement Learning (RL) techniques have been employed to solve complex motion tasks without requiring any prior information. However, RL solutions typically suffer from two major drawbacks: First, most of the results were obtained in simulated environments [12]- [14]. Second, the few real-world applications typically required multiple hours of learning, and the resulting controllers can be prone to failure [15]- [18]. A major breakthrough with respect to robustness and data-efficiency was achieved by hybrid techniques that learn parameter-free models, namely Gaussian Processes (GP), but also employ system-specific information such as knowledge of a state vector and an effective state feedback structure [19]. In the prominent example of PILCO [20], experimental data are used to approximate the unknown dynamics by a GP, which is used to determine the optimal parameters of a state feedback controller. By this approach, an inverted pendulum on a cart could be swung up and stabilized in 12 seconds of system interaction [20]. However, in the context of autonomous motion learning, GP-based learning methods still suffer from two drawbacks: First, previously proposed methods only solve set-point stabilization tasks, which do not enable robots to perform challenging, dynamic maneuvers. These require reference tracking. Second, GP-based learning techniques still require system-specific prior knowledge such as the configuration of cost functions, a state vector that fully describes the system dynamics, and a control structure that is effective with respect to the problem at hand. Hence, the methods are not suited for plug & play learning of highly dynamic robotic motions.
The present contribution proposes a GP-based learning method for autonomously solving highly dynamic reference tracking tasks in systems with unknown, nonlinear dynamics. The proposed method autonomously determines all of its necessary parameters such that plug & play application becomes feasible. The method's capability to rapidly learn solutions to various reference tracking tasks while not requiring any system-specific prior knowledge is validated by extensive simulations and real-world experiments using a two-wheeled inverted pendulum robot.

A. Related Work
Learning for control has been considered in a large body of literature that can be categorized by (i) the considered control problem respectively control strategy, (ii) necessary system-specific prior knowledge, and (iii) speed of learning. Reinforcement Learning (RL) techniques typically do not require any model and only few learning parameters such as step sizes or weights in cost functions. Furthermore, general RL approaches such as genetic algorithms [21] or policy gradient approaches [22] can be applied to arbitrary control problems with unknown, nonlinear dynamics, but in turn require comparatively long periods of learning [20]. The speed of learning can be significantly increased if the technique is targeted towards a specific control problem and strategy such as stabilization by state feedback control, see e.g. [23], [24]. A particularly data-efficient approach are so called model-based techniques that model the unknown, nonlinear dynamics by a GP, which is then used to design a state feedback controller [25]. Some successful applications to real-world examples are the control of a single inverted pendulum [20], double inverted pendulum [26], and robotic manipulator [27]. The concept of GP-based learning control has been further investigated in a variety of contributions. Stability of feedback-controlled GPs has been analyzed [28], [29], the problem of computational and data requirements has been investigated [30], [31], and solutions for safely improving an existing feedback controller have been proposed [32]- [34]. While all of these works consider the challenging problem of efficiently learning control solutions for unknown, respectively uncertain, nonlinear dynamics, they have focused on the problem of set-point stabilization of systems, for which an effective feedback control structure is known. If the control tasks consists in performing a highly-dynamic motion, the achievable performance of timedomain feedback control is inherently limited by phenomena such as unknown delays, measurement noise, or non-minimum phase dynamics. To overcome the performance limitations of feedback control, a feedforward control component is required, see Fig. 2.
In contrast to GP-based learning techniques, Iterative Learning Control (ILC) has focused on reference tracking tasks solved by feedforward control [35], [36]. Model-based techniques like norm-optimal or H ∞ ILC automatically determine the learning parameters, but require a model of the linear plant dynamics [37]- [39]. Model-free approaches like PD-ILC do not require a model but learning parameters that are typically tuned in experiment [40]. The concepts of PD-type [41] and norm-optimal [42] ILC have been extended to the case of nonlinear dynamics, but assume the dynamics to be known. To relax requirements with respect to available model information, recent research has focused on so called datadriven ILC (DD-ILC) [43], which does not require a model of the plant. In the case of nonlinear, unknown dynamics, DD-ILC methods typically employ dynamic linearization of the plant dynamics and estimate the gradient of said linearization [44]- [46]. Alternatively, neural networks (NN) have been employed in DD-ILC to model the unknown dynamics [47], [48]. While DD-ILC methods can solve reference tracking tasks without requiring a plant model, some system-specific prior knowledge is required as, e.g., the signs of the dynamic linearization [44]- [46] or the layout of a suited neural network [47], [48].
In summary, we conclude that reference tasks in systems with unknown, nonlinear dynamics can be solved by DD-ILC methods, which, however, require system-specific prior knowledge such that autonomous plug & play application is generally not possible. In contrast, set-point stabilization problems can be solved by GP-based learning that assume comparatively little system-specific prior knowledge. However, in the context of reference tracking tasks, GP-based learning methods suffer from the inherent limitations of feedback control. To the best of our knowledge, there exists no learning method that autonomously solves reference tracking tasks for unknown, nonlinear systems, employs feedforward control to overcome the limitations of feedback control, and does not require system-specific prior knowledge such that autonomous plug & play application is enabled.

B. Contributions
In this contribution, a GP-based ILC scheme is proposed that autonomously solves reference tracking tasks in systems with unknown, nonlinear dynamics. The proposed method includes a procedure to autonomously determine necessary parameters and enable plug & play application. Since the proposed method employs a GP to model the input/output dynamics, only the output variable, instead of an entire state vector, has to be known and measured. To overcome the inherent limitations of feedback control, the proposed method employs feedforward control. The proposed method is first validated by extensive simulations of a two-wheeled inverted pendulum robot (TWIPR), in which precise tracking is achieved after a small number of trials. Unlike existing approaches, the proposed method is not only verified for a single, well-chosen parameter configuration but for a wide range of parameter combinations such that robustness with respect to the autonomously determined parameters is ensured. In contrast to a variety of contributions, in which validation was restricted to simulated environments, the proposed method's capability of solving real-world reference tracking tasks in a plug & play manor is validated by experiments on a TWIPR, see Fig. 1.

C. Notation
Let N ≥0 and N denote the set of nonnegative, respectively positive, integers. Let R denote the set of real numbers. Vectors are in lower-case letters and bold type, e.g., v. Matrices are in upper-case letters and bold type, e.g., A. The i th component of a vector v is denoted by [v] i . The element in row i and column j of matrix A is denoted by [A] ij . Let v denote a norm of vector v, and A the corresponding, induced matrix norm of matrix A. A specific example is the infinity norm, denoted by · ∞ .

II. PROBLEM FORMULATION
Consider an autonomous system that can repeatedly attempt a reference tracking task, as, e.g., a robot trying to perform a desired maneuver. We assume that the system's output, e.g., a joint angle or position, can be influenced by an input signal, e.g., a motor torque, and that the relation of these variables is deterministic, causal, and time-invariant. However, we do not assume that a model of the dynamics is available and we do assume the general case of nonlinear dynamics.
Formally, consider a discrete-time, single-input, singleoutput, repetitive system with a finite trial duration of N ∈ N samples, and, on trial j ∈ N ≥0 and sample n ∈ [1, N ], output variable y j (n) ∈ R, respectively input variable u j (n) ∈ R. The samples are collected in the so called output trajectory y j ∈ R N , respectively input trajectory u j ∈ R N , i.e., ∀j ∈ N ≥0 , Without loss of generality, the dynamics can be written in the lifted form ∀j ∈ N ≥0 , where p is the unknown dynamics function. The task consists of updating the input u j from trial to trial such that the output y j converges to the desired reference trajectory r ∈ R N . Tracking performance is measured by the error trajectory and root-mean-squared error (RMSE) The problem considered in this work consists in developing a learning method that updates the input trajectory on each trial such that the RMSE decreases. Learning performance is judged based on the progression of the RMSE through trials, and the RMSE shall decline quickly and monotonically. The learning method must not require any a priori model information on the plant dynamics. To support plug & play application, the method must autonomously determine necessary parameters. Furthermore, the method must provide a fair degree of robustness with respect o autonomously determined parameters.

III. PROPOSED LEARNING METHOD
We address the proposed problem by an iterative learning scheme, in which each iteration consists of three steps. First, a  parameter-free model of the plant dynamics is identified using the experimental data of previous trials, see Section III-A. To accommodate for possibly nonlinear dynamics, a generic GP model is employed, which predicts the output trajectory for a given input trajectory. Second, the updated input trajectory is determined by solving an optimal feedfoward control problem based on the GP model, see Section III-B. Third, the updated input trajectory is applied to the plant and resulting data is in turn used to refine the GP model. The structure of the proposed learning scheme is depicted in Fig. 3. To enable plug & play application, the proposed method autonomously determines necessary parameters, see Section III-C.

A. Gaussian Process Model
We propose a Gaussian Process (GP) model, formally a function m : R N → R N , that predicts the plant's output trajectoryŷ ∈ R N based on an input trajectory u ∈ R N , where the trial index is omitted for sake of notational simplicity.
Let f (v) : R D → R denote the unknown target function that that depends on the regression vector v ∈ R D . Predictions are based on K ∈ N observations z k ∈ R stemming from The K observation pairs (z k , v k ) are collected in the observation training vectorz ∈ R K and regression training matrix V ∈ R D×K , i.e.,z The kernel function of two regression vectors v ∈ R D and v ∈ R D is denoted by k vv ∈ R. The kernel matrix of two regression matrices V ∈ R D×K ,V ∈ R D×K , which are assembled according to (8), is denoted by K VV ∈ R K×K and has entries K VV ij = k vivj .
Given F ∈ N test regression vectors assembled in the regression matrix V ∈ R D×F , the predicted mean µ ∈ R F and covariance Σ ∈ R F ×F are given by The general GP framework can be employed in three ways to model the unknown dynamics (3) An IIR model results when the regression vector consists of the current input and the P ∈ N previous output samples, i.e., (12) A SS model results when the regression vector consists of the current input and the previous state sample, i.e., The SS model requires multiple GPs with each predicting the progression of a single state variable [49], which not only increases computational complexity, but also requires measurement of the full state vector. Furthermore, IIR and SS models require roll-out predictions meaning that previous predictions are required for predicting the current sample [49], which also increases model complexity. In contrast, the FIR model only requires a single batch prediction according to (9). We, hence, employ a FIR model and the regression vector is defined as in (11). We further choose difference-predictions, i.e., which, compared to absolute predictions, increase the model's capability of extrapolation, see [49].
As kernel function, we employ a squared-exponential kernel (SEK) where l ∈ R is a so called length scale.
Remark 1. SEKs allow a GP to model arbitrary target functions. In the context of dynamic systems, a SEK leads to a nonlinear, time-invariant (NTI) model. Using a squared kernel instead, as e.g.
results in a linear, time-invariant (LTI) model. If the plant dynamics are linear, one may employ a squared kernel to decrease computational complexity in comparison to a NTI model.
To predict an output trajectoryŷ for an arbitrary input trajectory u, the latter is used to determine N regression vectors v n , n ∈ [1, N ], according to (11), which are assembled in a regression matrix V according to (8). The predicted mean vector µ follows from (9). By (14), µ contains difference predictions such that the components ofŷ follow from the cumulative sum of µ, i.e.
Mean and covariance predictions require the measurement variance σ 2 w and length-scale l, which are so called hyperparameters. Typically, hyper-parameters are determined based on training data, and numerous approaches have been detailed in the literature [50]. We propose selecting hyper-parameters by minimizing the so called leave-one-out squared-mean-error (LOO-SME). For each of the k ∈ [1, K] available observations z k , the remaining observation pairs are used to predict z k . The LOO-SME follows from summing the squared difference between the K leave-one-out predictions and respective observations z k . Formally, the k th LOO-predictionμ k is given byμ leading to the LOO-SME e LOO and the hyper-parameters θ := σ 2 w l T follow from The optimization problem (20) can be solved efficiently, because analytic expressions of the gradients are available, see [50].
Remark 2. Determining hyper-parameters by LOO-SME minimization is a rather uncommon choice as the variance of the predictions is not taken into account [50]. However, comparison of LOO-SME minimization with state of the art methods such as evidence maximization, see [50], has shown that the former leads to more reliable performance of the learning scheme proposed in this work. We assume that this is due to the proposed learning scheme solely relying on the GP's mean prediction and, hence, hyper-parameters should be chosen such that the accuracy of mean predictions is maximized.
GP predictions are known to become computational expensive with increasing amounts of training data [51]. To overcome this limitation, various data selection approaches have been proposed to reduce training data to a tractable amount, see, e.g., [51], [52]. In the present work, we simply propose limiting the training data to the last H ∈ N trials.

B. Optimal Feedforward Control
After the GP model has been identified, it is used to determine an input trajectory that leads to a smaller difference between reference and output trajectory than the input trajectories of previous trials. We propose an optimal control design, where the input is chosen to minimize a quadratic cost criterion. The latter not only considers the predicted tracking error, but also the change of the input trajectory to avoid model inversion and, hence, increase robustness with respect to the uncertainty of the current trial's model. Formally, the cost criterion is given by where q, s ∈ R >0 are scalar weights. On each trial, the updated input trajectory u j+1 is chosen to minimize the cost criterion, i.e., ∀j ∈ N ≥0 , u j+1 = argmiñ u J(ũ) .
The optimization problem (22) can be solved efficiently since analytic expressions of the cost's gradient with respect to the input variable can be obtained [20].

C. Autonomous Parameterization
Ideally, autonomous learning methods should require neither a priori model information nor manual tuning of parameters. In contrary to previous contributions, the proposed method automatically determines necessary parameters by the procedure outlined in this section, and, as a result, plug & play application is enabled, see Fig. 4.
First, we consider the choice of the initial data set I that is used to determine the first GP model and consists of I ∈ N trajectory pairs (y i , u i ), i.e., For this purpose, we first determine the largest significant frequency f O of the reference trajectory. The frequency f O is used to design a zero-phase low-pass filter f LP . The lowpass filter f LP is applied to a zero mean normal distribution with covariance σ 2 I I, and the initial input trajectories are drawn from the resulting distribution, i.e., The input variance σ 2 I is iteratively increased until an input trajectory drawn according to (24) leads to an output trajectory, whose maximum roughly equals the maximum of the reference, i.e., The number of initial trials is a robust parameter and is set to one, I = 1. chosen based on the experimental data. The scalar q, which weights the error trajectory, is chosen as unity, i.e., The scalar s, which weighs the change in input variable, is chosen as the average squared ratio of output to input maxima over the I initial trials, i.e., The procedure described in this section automatically determines all the necessary parameters without requiring any a priori information on the plant. The following simulations are going to demonstrate that the automatically determined parameters lead to the desired learning performance and that the method provides a fair degree of robustness with respect to the automatically determined parameters.

IV. VALIDATION BY SIMULATION
In this section, the proposed learning method is validated by simulation of a two-wheeled inverted pendulum robot (TWIPR) that is meant to perform challenging maneuvers, see Fig. 5. The TWIPR and automatic determination of learning parameters are presented in Section IV-A. Afterwards, the learning performance for three representative references is investigated in Section IV-B, and the proposed method's robustness with respect to learning parameters is verified in Section IV-C. Lastly, the effect of the weight s on the learning characteristics is studied in Section IV-D.

A. The Learning Problem
Consider the TWIPR and three desired maneuvers depicted in Fig. 5. The corresponding pitch angle reference trajectories are denoted by r 1 ∈ R 25 , r 2 ∈ R 50 , and r 3 ∈ R 71 and formal definitions are given in Appendix II. The robot consists of a pendulum body housing main electronics including a microcomputer, inertial measurement units, motors and accumulator. Wheels are mounted onto the motors such that the robot can drive while balancing its chassis. A model of the TWIPR's nonlinear dynamics is not available. Only an approximate, linear model of the dynamics at the upright equilibrium has been obtained, which merely suffices to design a stabilizing feedback controller, see Appendix III. Due to the imprecise model, the feedback controller can not track the references precisely, and we instead employ the proposed learning method.
Instead of a state vector, the learning method only requires knowledge of the output variable, which is given by the pitch angle, i.e., ∀n ∈ N ≥0 , y(n) := Θ(n) .
The input variable is given by the motor torque, ∀n ∈ [1, N ], u L (n) ∈ R. Application of the proposed learning method requires learning parameters that are automatically determined by the procedure outlined in Section III-C. We aim at tracking pitch trajectories with a maximum of approximately 75 degrees and and spectral content roughly below 5 Hertz, i.e., Based on the frequency f O , a forward-backward, second order Butterworth filter f LP is designed, which is used for drawing initial input trajectories, see (24). To determine the input variance σ 2 I , three test input trajectories are applied to the unknown, nonlinear dynamics [53], which are implemented as a black-box simulation model. The input trajectories are drawn according to (24) with respective variances As detailed in Fig. 6, the parameterization selects σ 2 I = 25 because the resulting output trajectory has the some order of magnitude as the reference.
Next, the weights s and q of the cost function are determined. According to (26), q = 1 is selected. To determine the weight s, one initial trial with the previously determined input variance is performed. As detailed in Fig. 7, the value of of s directly results from (27) and the experimental data, i.e., To demonstrate the data-efficiency of the proposed learning method, the training data are limited to the last five trials, i.e., H = 5.   6. Determination of the input variance σ 2 I : Three different values are used to draw random input trajectories that are applied to the plant. The input variance σ 2 I = 1 hardly excites the system. In contrary, the input variance σ 2 I = 225 leads to an output trajectory that significantly exceeds the reference's maximum. The input variance σ 2 I = 25 is selected, because the corresponding output trajectory has the same order of magnitude as the reference. Fig. 7. Determination of the weight s: Based on the maxima of input and output trajectory in an initial trial, the weight s is chosen according to (27).

B. Learning Performance
First, learning performance for the desired references r 1 , r 2 , and r 3 is investigated. The parameters are chosen according to the previous section and only one initial trial I = 1 is used. In Fig. 8, progressions of the output trajectories and error norms over the trials are depicted. For all three references, the proposed method achieves precise tracking after roughly 15 trials. The respective RMSEs rapidly decline over the first trials and converge to a small value close to zero. In case of the references r 2 and r 3 , the RMSE decreases monotonically. The simulations demonstrate that, by using the automatically determined parameters, the proposed method rapidly learns to track three different reference trajectories without requiring any system-specific prior knowledge.

C. Robustness Analysis
The previous simulations have validated the method's capability of achieving satisfying tracking performance when using the automatically determined parameters. However, a Fig. 8. The proposed learning method is employed to track the three desired references. Despite varying lengths, amplitudes and frequencies of the references, satisfying tracking performance is achieved within 10-15 trials. The RMSE is monotonically declining for two of the references and converges to a small value close to zero in all three scenarios. Fig. 9. The proposed learning method is run for a total of 5000 different combinations of parameters and initial data. The RMSE's maximum over all runs converges to a value significantly lower than the initial. Hence, robust learning is guaranteed for a large parameter space. learning method should, ideally, not only achieve satisfying performance for a single parameter configuration, but for a wide parameter space. Hence, the proposed method's robustness with respect to the automatically determined parameters is validated in the following study, where we aim at tracking reference r 1 . We consider two different scenarios, namely, the greedy case of one initial trial, I = 1, and the conservative case of five initial trials, I = 5. Recall the two parameters s and σ 2 I , which are the weight in the optimal control problem and the initial input variance.
In the case of I = 1, the weight s is chosen from the set S 1 that consists of ten logarithmically spaced values and the initial input variance σ 2 I is chosen from the set V 1 that consists of ten quadratically spaced values, i.e., for I = 1, For each of the 100 parameter pairs in P 1 , 50 runs are performed. A run r consists of choosing a parameter pair (s, σ 2 I ) k from P I , producing I initial input trajectories, and executing the proposed learning method for an additional 50 trials such that a progression of the RMSE throughout trials is obtained, which we denote by where j ∈ [0, I + N ] is the trial index, k ∈ [1, 100] is the parameter index, and r ∈ [1,50] is the run index. The same procedure is applied in the case of I = 5, but the parameters are chosen from larger sets, i.e., for I = 5, To evaluate performance, the maximum e max j ∈ R, 99th percentile e P99 j ∈ R, 75th percentile e P75 j ∈ R, and median e med j ∈ R of the RMSE over parameters and runs are considered. Formally, ∀j ∈ N ≥0 , e max j := max k∈ [1,100],r∈ [1,50] e RMSE j,k,r , and e P99 j , e P75 j , e med j are defined accordingly. Results depicted in Fig. 9 show that, for both I = 1 and I = 5, the maximum of the RMSE converges to a value that is a roughly ten times smaller than the initial value such that the method's robustness is validated. The RMSE's 99 th percentile is monotonically decreasing, which implies that, besides single outliers, the method achieves the desired form of convergence as defined in Section II. Furthermore, the RMSE's median declines below a value of five degrees within 25 trials meaning that satisfying tracking performance is achieved. Lastly, it should be noted that a wider parameter space could be considered in the case of I = 5 meaning that, by increasing the amount of initial data, the robustness of the method can be further increased. Increasing the value leads to faster learning but also a larger variance in performance. Excessively small values of s may lead to a RMSE that diverges for some initial data.

D. Effects of Weights
The previous analysis has shown that the method rapidly learns to track a desired reference while also being robust with respect to the automatically determined parameters. Next to the use case of automatic plug & play application, the method can also be tuned to meet the needs of a specific application. Hence, we next investigate how the choice of weight s affects learning characteristics, namely the rate of convergence and robustness with respect to initial data. For this purpose, we consider the weights s ∈ {10 0 , 10 −2 , 10 −4 , 10 −6 } .
The remaining learning parameters are chosen as one initial trial I = 1 and an initial input variance σ 2 I = 25. For each weight, 50 runs with differing initial data are performed and performance is judged based on the RMSE's 90 th percentile and median over the 50 runs.
Results depicted in Fig. 10 show that for a comparatively large value of s = 10 0 , the RMSE monotonically declines at a slow pace. Furthermore, there is hardly any difference between median and 90 th percentile performance. Decreasing the value to s = 10 −2 leads to a significant increase in speed of convergence. Speed of median convergence can be further increased by lowering the value of the weight to s = 10 −4 , which, however, comes at the price of larger 90 th percentile RMSEs, which imply an increase in performance variance. If the weight is lowered to an even smaller value, s = 10 −6 , median performance is not further increased, but the 90 th percentile RMSE does no longer converge meaning that learning fails in a significant portion of runs. In summary, the study indicates that the weight s may be used to tune learning behavior, whereby comparatively large values of s lead to slow learning that is robust with respect to initial data. Decreasing the value of s can increase the speed of learning, but may come at the price of sensitivity with respect to initial data.

V. VALIDATION BY EXPERIMENT
To demonstrate the plug & play applicability of the proposed learning method, it is applied to a real-world TWIPR, which has been previously used to validate learning control methods [54]. The robot is meant to dive beneath an obstacle as depicted in Fig. 1  First, the proposed method determines the learning parameters yielding I = 1, σ 2 I = 2, and s = 0.1. The initial input trajectory is drawn according to (24) an applied to the TWIPR. The corresponding output trajectory significantly differs from the reference with a RMSE of roughly 75 • , see Fig. 11. From here onwards, the method iteratively determines a GP model, updates the input trajectory, and performs an experimental trial. Once learning begins, the RMSE rapidly declines, the RMSE drops below 20 • on the fourth trial, and a RMSE of less than 10 • is reached on the eighth trial. Sufficiently precise tracking precision for diving beneath the obstacle is achieved on the seventh trial and a RMSE close to zero is achieved on the tenth trial. Note, that the RMSE slightly increases on some of the trials, which is likely due to the initial conditions varying from trial to trial.
In summary, the experiments validate that the proposed method enables a real-world robot with unknown, nonlinear dynamics to learn a challenging maneuver. Not only did learning require a small number of trials (≈ 10) but the method could also be applied in a plug & play manor without iterative tuning of parameters.

VI. CONCLUSION
In this work, a GP-based learning control scheme has been proposed that autonomously solves reference tracking tasks in systems with unknown, nonlinear dynamics. On each iteration, the unknown dynamics are approximated by a Gaussian Process (GP), which is then used to determine and apply an optimal feedforward control input. The method is completely plug & play, since all required algorithms parameters are determined automatically and manual tuning is avoided. The effectiveness and efficiency of the method were demonstrated by simulations and experiments using the example of a two-wheeled inverted pendulum robot that rapidly learns to perform several challenging maneuvers without any manual tuning or system-specific prior knowledge.
In contrast to previous GP-based learning control approaches, the proposed method overcomes the inherent limitations of time-domain feedback control; it neither assumes knowledge of an effective feedback control structure, nor does it assume the entire state vector to be known. Instead, the proposed method directly adjusts the input based on the measured output. It is therefore as model-agnostic and independent of system-specific prior knowledge as pure reinforcement learning schemes.
While reinforcement learning approaches typically require hundreds of trials for convergence and are therefore unsuitable Fig. 11. Experimental results of the TWIPR learning to dive beneath an obstacle. Starting from an initial RMSE of roughly 75 • , the tracking error rapidly declines over the the following trials and sufficiently precise tracking for diving beneath the obstacle is achieved on the seventh trial.
for experimental validation, the proposed learning control method solves reference tracking problems in a small two-digit number of trials and was successfully validated in real-world experiments.
While the vast majority of previous contributions either validate methods only in simulations or provide only a single results for one carefully chosen parameterization and one specific motion, the present validation has proven effectiveness of the proposed method for several different motions and a large range of algorithm parameterizations. We thereby demonstrated robustness with respect to the automatically determined parameters, and we further investigated the effect of the learning weights on the trade-off between speed of learning and robustness.
We believe that the proposed method is highly suitable for use in kinematic systems that must perform challenging, highly dynamic maneuvers. Beyond the use case of rigid robotics, we expect the proposed method to have a major impact on the development of soft robotics, exoskeletons, and neuroprosthetics, and will therefore contribute to the evolution of autonomous robotic systems that rapidly learn to perform complex, dynamic motions under unknown conditions. Future work will be concerned with extending the proposed approach to multi-input/multi-output systems and applying the method to other real-world applications. Moreover, combined approaches for simultaneous learning of feedfoward and feedback control will be devised and studied.

APPENDIX I COMPARISON OF FEEDFORWARD AND FEEDBACK CONTROL FOR REFERENCE TRACKING
We consider a discrete-time, linear system of first order with sampling period T = 0.02 seconds, output y ∈ R, and input u ∈ R. The system is affected by an input delay of one sample leading to state-space dynamics ∀n ∈ N, x(n + 1) = 0.5 1 0 0 x(n) + 0 1 u(n) , (42) where x ∈ R 2 is the state vector. The system's output is affected by measurement noise, i.e.
The task consist of having the output y follow the reference r ∈ R over a finite horizon of N = 100 samples with ∀n ∈ [1, N ], r(n) = sin(2πT n) .
The feedforward control strategy consists of applying an input trajectory The input values are determined by optimization such that the squared tracking error is minimized, i.e.
The feedback control strategy consists of a generic, non-linear function to ensure that performance is not limited by the structure of the feedback law. In particular, the input values u FB are computed as the sum of ten polynomials of tenth order, to which the current and nine previous error samples serve as inputs, i.e., ∀n ∈ [1, N ], The set of feedback parameters K = {k ij | i, j ∈ [1, 10]} is determined via optimization such that the squared tracking error is minimized, i.e.