Search-Based Planning and Reinforcement Learning for Autonomous Systems and Robotics

,


Introduction
In past work, Deep learning [19] and (Deep) reinforcement learning [16], [22] included DeepMind and Deep Q-Learning mechanism [18] in 2014, AlphaGo won the champion in the game of Go [21] in 2016 and also OpenAI and PPO occurred 2017 [20] are currently resolutions in applied artificial intelligence for vehicle and autonomous systems.Basically, deep learning is the best model for data representation and reinforcement learning is a modern approach to solving the decision-making.Before moving forward to modern approaches.there are essential to represent the basic things to making the intelligent autonomous systems that can be enabled to solve basic level based on simultaneous localization and mapping (SLAM) based on interacting the unknown environment.Currently, the state-of-the-art in research-based planning and replanning [17] for autonomous and mobile systems.Moreover, motion planning [1], [3], [8] will be explored the probabilistic problems capturing from the On paper, the next section will be described the path planning based on representing the searchbased planning to find the shortest path in heuristical representation.Next section, it will be covered the uncertainty problems by using the SLAM concepts.Then next, we will introduce the reinforcement learning for generating the trajectory path in maze environment.Finally, it concluded with some experiments in both simulations and real-world.

Path Planning
In this section, we will explore the path finding algorithms are to study compares some popular algorithms: Dijkstra's Algorithm, Best-First-Search and The A* Algorithm.Initially, the controlling functions are written in both Python and C++ language.Through doing this, we learned more about the language, being able to write a publisher and subscriber nodes, understand how operating systems works like ROS as an open source software, how to develop software without reinventing the wheel.At the end, we successfully write a navigation app that implements A* Shortest Path Finding Algorithms to make the Vehicle or autonomous robots walking to the 2D path generated by software developments.The source code are well-commented, the packages are well-organized and are uploaded to bitbucket, making it easier for other contributors to continue the project for universities or organizer.Firstly, we can define the theory concepts below.

Dijkstra's Algorithm and Best-First-Search
A common example of a graph-based path finding algorithm is Dijkstra's algorithm such as Figure 2.1.Dijkstra's algorithm is a shortest path finding algorithm conceived by computer scientist Edsger W. Dijkstra in 1956.It works by visiting a set of open nodes in the graph starting with the starting node.It then repeatedly examines the closest node with the lowest distance cost that have not been examined, adding it to the set of closed node (nodes that have been examined).It expands outwards from the starting point until it reaches the goal.If there are no negative edge node (node with the negative distance cost), Dijkstra's algorithm is guaranteed to find a shortest path from the starting point to the goal, since the lowest distance nodes are examined first.In the following map, the star is the starting point, the "X" is the goal, the white path is the calculated path and the blue and area inside it is the areas Dijkstra's algorithm have scanned.A* is a best-first search algorithms, meaning that it will choose the path considered as the best solution (least distance traveler, shortest time, etc.) by searching among all possible paths to the target.
As we have mentioned before, Dijkstra's Algorithm is accuracy to find the shortest path, but it wastes time exploring in directions that aren't promising while Greedy Best First Search explores in promising directions but it may return the longer path result.The A* algorithm calculates both the actual distance from the start and the estimated distance to the goal so it can guarantee to find shortest path while taking much less time Dijkstra's Algorithm.
First, let's define the cost function: With  is the cost of a node.The lower the , the better the node.The  is the cost it took to get to the node, for example: the length of the road which we passed by from the start.ℎ is the heuristic score function, the tentative cost to reach the goal from that node.The accuracy of your path depends on how "good" is your h.Very rarely (if ever) is your ℎ perfect.

Uncertainty Representation using Kalman Filter Landmark Based SLAM
This chapter provides us the first insight of SLAM.The Kalman filter and theory of SLAM topic will be quickly covered before a more practical explanation at the last section of the chapter can tell basic things.The aim of the chapter is to introduce Kalman filter to the reader for ease of further SLAM understanding in the future.

Principle of Kalman filter
In the 1950s, Rudolf Emil Kalman published the famous article about the recursive approach to the discrete data linear filtering.The article was "A new Approach to Linear Filtering and Prediction Problems".From that time to now, with the dramatic development of digital calculation, Kalman filter has become a popular research topic, and has been applied in various areas, such as automatic electronics, robotics, radar technology, and so on.
The Kalman filter was originally designed to collect and incorporate data from sensors in adaptive way.Once system of equations and sensors statistic data is known and deterministic, the filter will give optimal estimation after filtering errors and noises.There are many ways to optimize the data based on some particular criteria, but the Kalman filter seem to be the best in linear Gaussian estimation when it can incorporate all data given to it.
In short, Kalman filter is a mathematical system of equations which describe a method for effectively recursive estimation of states such that the average of all variances and covariances of state variables becomes minimized.Kalman filter is very optimal for estimation of states in the past, the present, and the future, even if the accuracy of robotic models is not guaranteed.Kalman filter is of our interest under the assumption that the system is linear and the noise is Gaussian.There are two phase in the filter: prediction and update.At first, Kalman filter predicts an  dimensional predecessor state vector denoted by ′  at time step  through the stochastic differential equation, ′  =    −1 +     +   (38)  , where   is the state transition matrix at time step t, B is the optional control input matrix at time step  that can affect   , and   ∼ (0,   ) indicates the additive white Gaussian noise (AWGN) of the transition.Next, the filter calculate the predecessor covariance matrix ′  as follows, ′  =    −1    +   (39)   , where   is the variance matrix of the transition noise   at time step  .Then, the filter enters the second phase called correction or update, where the posterior state vector   and the posterior covariance matrix   are improved from the correspondent predecessor ones by the expressions below, = ( −     )′  (41)   , where   is the Jacobian matrix from measurement model   =   ′  +   at time step    ∼ (0,   ) is the random measurement noise variable (AWGN),   is the measurement noise variance matrix, I is the identity matrix, and K t is called Kalman gain at time step  , and computed by, If, at time step t, the observation error covariance R t decreases, the actual observation   is treated to be better than the prediction   ′  .On the other hand, a smaller prior error covariance ′  leads to a less important actual measurement   and weights the predecessor prediction ′  more.The prediction phase and the correction phase of the Kalman filter are performed sequentially.
To summarize, at every time step  , the belief (  ∨   ,  −1 ) of the robot is represented by the mean   and the covariance   , and the basic Kalman filter algorithm is: Obtain (  )as normal distribution expression with mean   and variance The mathematical derivation as well as other knowledge on Kalman filter can be found at [5,6,7]

General landmark based SLAM
The SLAM problem can be various in categories and algorithms, but , in regards with landmark based method, the SLAM problem typically consists of following parts: landmark extraction, data association, state estimation, state update and landmark update.Again, for each part, there are many ways to solve.This section is written according to [9], which gives a very potentially practical approach to the SLAM process.
In general speaking, landmarks are features which can easily be re-observed and distinguished from the environment.These are used by the robot to find out where it is as the same way human does.
As well as state variable, landmarks cannot directly sensed but we have to reliably extract them from the robot measurement data.There are two popular landmark extraction algorithms when most of robots uses a laser scanner.Firstly, the spike landmark extraction uses extreme changes to find landmarks.They are designed to find values in the range of a laser scan where the distances to two values differ by more than a certain amount, for example, 0.5 meters.
Secondly, the random sampling consensus (RANSAC) is a method to extract lines from a laser scan.Once this is done, RANSAC checks how many laser readings lie close to this best fit line.If the number is above some threshold which is manually set by us, the robot can believe that it have seen a line.
The code implementing RANSAC can be found at [11].The problem of data association is that of matching observed landmarks from different (laser) scans with each other.If the robot fails to re-observe and match landmarks, there are many problems like getting lost, building a wrong map, confused navigation, and so on.People have defined a data-association policy that deals with these issues.We assume that a database is set up to store landmarks the robot has seen.The database is usually initially empty.Then, there is a rule that not until the robot has seen a landmark N times (manually setting), the robot don't actually consider the landmark worthwhile to be used.
This will prevent the case where we extract a bad landmark, which is not easy to distinguish, re-observe and match again.This technique is called the nearest-neighbor approach as it rely on correlation between the nearest landmarks in the database.
With the landmark based, we need to improve some concepts from the previous probabilistic framework in previous session.We define mi to describe a vector that represents the location of the ith landmark whose true location is assumed time invariant.Also, the set of all landmarks will be denoted as Unfortunately, most of environment where the robot is situated are chaotic, such that system of models are nonlinear and non-Gaussian as well.In this case, the systems need to be linearized first before we can use the Kalman filter, which is the reason why the extended Kalman filter (EKF) appears.
As soon as the landmark extraction and the data association is ready, the SLAM process can be considered as three steps, which we have already known from previous chapters, 1. Predict the current state estimate using the odometric data 2. Update the predicted state from re-observing landmarks

From the view of simulation
Now, we have a set of general ideas that SLAM may work.There are some simulation of our interest topic: EKF Landmark based SLAM, which was done by Jai.All related material can be found at [16].
The Figure 4a shows us a 2D projection map, where shapes with black solid line represents obstacles.
Small blue crosses represents landmarks, and for some reasons, we assumes these landmarks can be re-observed through obstacles.It's a fact that radio frequency identification (RFID) landmarks can be used to make this possible.The thin line with circle markers is the path that robot will take, which represent for control actions given to the robot at every time step.
We can have a look at the Figure 4b when the simulation starts.At every same time step, the red triangle describes the robot with regards to it concurrently building map, and the blue triangle points out the true shapes and locations that the robot is taking in the real-life map.Here, at first, the robot seems to be consistent of where it is, so the blue triangle and the red triangle are about to overlap each other.The set of red crosses represents for the point clouds resulted from sensor data extraction, which can be used to build a map.Overall, the robot's pose and the landmarks' position have correspondent small red surrounded random ellipses, which indicate randomized Gaussian noises for two dimensional random variable of the coordinate.
In the Figure 4c, when the robot has made a significant SLAM, the state vector x and other parameter of Kalman filter are enlarged together with the data of new landmarks.At this point, the weak computer, where the simulation may run on, can cause more uncertainty, because of the problem between calculation time and real time requirement.Because of more uncertainty, which means the robot lacks of information in the right time to compute exactly everything, the map become confused as in the Figure .4d.In the Figure 4c, the red triangle, which is the robot's location inside its mind, becomes a little far from its true position, the blue triangle.Also, everything inside its current maps is shifted in a similar way as well.This is very good thing, cause it tell us the strong correlation between landmarks and landmarks, landmarks and point clouds, point clouds and point clouds, which is the solution for the robot to correct from localization and mapping failure due to odometry data and real time problem.The last two Figure 4f and Figure 4e show us the results of our simulation.Due to the probabilistic framework, the results every time are slightly different, which is one of interest characteristic.However, after everything, these results prove us one of very promising SLAM approaches.The next two pages are statistic graphical representation of parameters overall the SLAM simulation.
As it can be seen from the Figure 5 together with the Figure 6, there is a place where the position error becomes relatively high, which is relevant to the strange red orbit in the Figure 4e.This error in position due to sensor scan is the reason why everything is shifted in the Figure 4c.At this point, the error in angular orientation is also showed as high peaks.
Personally, we think these error is taken place in the left-bottom corner along the settled path of the robot.Suitable explanation may be because this place has only 4 available landmarks in the 180 degrees front of the robot, and the left-bottom corner is the first corner from the start of the simulation when the robot still has not sufficient data to become enough consistent for a good turning.One of evidence supporting for this explanation is the overshoot of standard deviation in y direction in the Figure 6, in combination with the path of the robot, where the left-bottom corner is the first corner in its way, which suddenly changes the robot's y coordinate while the state vector x is still naive for this skilled estimation We can also see, after the first failure, even though everything is shifted but all the relative mapping results are good enough for preventing this kind of error from occurring at the following corners Figure 5: A statistic graph of estimation after simulation process

Modelling and Visualization
GMapping is a highly efficient Rao-Blackwellized particle fler to gradually build grid maps from laser scanner data.GMapping is probably most used SLAM algorithm, which is currently the standard algorithm on the PR2.Further information can be found at [2].Here in ROS, we have slam_gmapping package which contains a wrapper around gmapping, providing us with SLAM capabilities

Reinforcement Learning concepts in Search-based Planning
In this session, we explore the reinforcement learning based on the agent can be able to maximize the total of reward.In Figure 4.1 shows the cumulative reward according to each step and can be able to formalize by equation: From maze environments the autonomous vehicles or robotics receive the initial state:  0 , and it can also take the action  0 .In order to transition from one state to to other states, the agent can move to other state  1 , and one reward will be received according to this state:  1 .
Or it generally can be written by There is essentially to discount the rewards called gramma  , and it belong distance: 0 and 1 as the conditions, or shortly ( 0 ≤  < 1 ) For short-term rewards if there is the grammar is small, then discount will be big For long-term rewards if there is the grammar is large, then the discount will be small.So we can rewritten the formalization from (4.2) 3) where  ∈ .
There are currently two kinds of tasks representing in reinforcement learning problems, involved episode which is created the lists of states   , actions   , rewards   , new states  +1 , and usually it started initial point and ended point is a terminal state, and continuous tasks, which are no existing the terminal state.Where  is a learning rate, (  )is the previous estimation,  +1 is the next of reward, and basically the Temporal Difference function is defined by the part:  +1 + ( +1 ).One of most challenges in reinforcement learning is trying to balance between the exploration and exploitation (trade-off).While exploitation is expect to maximize the totals of rewards, the exploration is to discover more information about environments.Usually, we use the full knowledge which is the dynamic programing as the methods for solving the optimal policy or optimal value functions using bellman optimality equations, it is formalized by:   (  ) ←   ( +1 +   (  +1 )) (4.6) Accordingly, there exists two model approaches for reinforcement learning that lists in Table 1.1 below illustrated the involving models model-based and model-free learning

Model-based and Model-free
Learning action models by using the transition probabilities.For instance, it is usually to use the dynamic programming solving approximate problems.
On the other hand, model-free can learn directly action models without confident to extract model of action.By the way, Q-learning is used to solve the model-free learning [17] For datasets, while model-based learning often require to have larger data for training, model-free learning needs less data set For example, Greedy Adaptive Dynamic Programing is used to learn the optimal policy without evaluate the policy fixing.Table 1.1: Model-based learning and Model-free learning There are three approaches for solving the reinforcement learning Value Iteration: In the reinforcement learning contexts, the value iteration is to return the maximum expecting from agent at each state.In figure 4.2, we implemented the sampling example bug algorimth [24] by using the markov decision processing for maze environments (20x20) extended from [24] [9].It shows the optimized path from optimal policy.

General landmark based SLAM
There are three approaches for solving the reinforcement learning Value Iteration: In the reinforcement learning contexts, the value iteration is to return the maximum expecting from agent at each state.
There are currently studied based on research-based planning and replanning at [17] applying to autonomous system and robotics.In this experiences, we tried two experiences not only in simulation but also executed in real-time environments.
Let's consider experiments in Figure 5.1.In the Left side, it illustrated the initial point with regardless 8-nearest points as representing on [17].Next one,

SLAM-based on Unstructured Environments
In this experience, we show the concepts using gmapping implemented in ROS (Figure 5.3).The system architecture can be mapped the real-time environments using reference [2].For motion control, we refer to standard wheel encoder systems [8] [3], [12].

General landmark based SLAM
As previous section, we choose the Q-learning as the option of solving the reinforcement learning.
Algorithm 3 is described the Q-Learning which can be able to apply autonomous system.For more detail, please refer this work [16] which was integrated the reinforcement learning.
In Figure 5.4

Conclusions
In summary, we understand the principles of linear and nonlinear problem in probabilistic approaches based on using Kalman Filter and Extended Kalman Filter respectively as facing the uncertainty problems.We also show the modeling by representing the xml format and demo the visualization for autonomous systems.It also illustrated how to deal with uncertainty representation, which is the most important thing in challenge of self-driving vehicles in advanced technologies.At the same time we explore the search-based planning based on understanding the backtracking of trajectory generation in not only A* algorithms but also D* with Reset, and both applied to shortest path planning.The sample implementation of reinforcement learning was selected to optimization of state-of-the-art dealing the path planning in advanced autonomous vehicles.
In the future, we integrated the deep reinforcement learning for solving the decision-making and deep learning approach to solve the localization and mapping in order to increasing the performance..

Figure 2 . 3 :
Figure 2.3: Path Finding Problem using the Dsitristra Enough for theory, let's see what A* Algorithms method can do.

Figure 4 :
Figure 4: Visualization of the EKF landmark based SLAM

Figure 5 . 1 :Figure 5 . 2 :
Figure 5.1: Simulation Path-Planning: [LEFT] -Start to explore the nearest all of nodes from initial point.[CENTER] -It is continue to extract the maze environments based on representing the goal which is determined at beginning and until attached the first cell of the obstacle.[RIGHT] -