OODA-RL: A REINFORCEMENT LEARNING FRAMEWORK OODA-RL: A REINFORCEMENT LEARNING FRAMEWORK FOR ARTIFICIAL GENERAL INTELLIGENCE TO SOLVE FOR ARTIFICIAL GENERAL INTELLIGENCE TO SOLVE OPEN WORLD NOVELTY OPEN WORLD

Faster adaptability
to open-world novelties by intelligent agents is a necessary factor in achieving
the goal of creating Artificial General Intelligence (AGI). Current RL
framework does not considers the unseen changes (novelties) in the environment.
Therefore, in this paper, we have proposed OODA-RL, a Reinforcement Learning
based framework that can be used to develop robust RL algorithms capable of
handling both the known environments as well as adaptation to the unseen
environments. OODA-RL expands the definition of internal composition of the
agent as compared to the abstract definition in the classical RL framework,
allowing the RL researchers to incorporate novelty adaptation techniques as an
add-on feature to the existing SoTA as well as yet-to-be-developed RL
algorithms.


Introduction
Developing intelligent agents that can interact with and learn from the surrounding environment is one of the primary goals of Artificial Intelligence (AI) research. Among the various sub-fields of AI, Reinforcement Learning (RL) has emerged as the proponent for learning from interaction with the environment via a trial-and-error mechanism [11]. Although, much of the RL theory was developed between late 20 th to early 21 st century [12,15,21,23,24], it was mostly confined to solving problems having smaller state space due to the algorithms' lower sample efficiency. Large scale real-world applications of RL were motivated by the introduction of Deep Reinforcement Learning (DRL) [16]. Since then, it has been widely used on domains ranging from video games [17] to robotics [8]. Despite the impressive learning ability of the modern RL algorithms [9,16,19,20], the basis of the RL remains the classical RL framework as shown in Fig. 1. This framework assumes that the agent interacts with the environment by receiving some observations (or states) (o t ) from it, and then it performs an action (a t ) to receive a reward (r t+1 ) based on the change in the states, where t denotes the time-step. This framework works well for a closed agent-environment interface but fails to incorporate novelties that might occur in the environment without being able to interact with it. Novelty in an open world [13] has been defined by DARPA as previously unseen changes in the environment in their SAIL-ON project description [5], which can be at times, quite adversarial in nature for e.g., in a warfare setting. Therefore, it is crucial to develop the agents which can adapt to the novel changes. Coincidentally, there is an abstract concept called Observation-Orientation-Decision-Action (OODA) loop which was proposed by John R. Boyd to tackle the military warfare problem [2]. It describes an agent to be able to collect information from the environment and orient itself according to the newly gathered information by updating its knowledge and preparing newer strategy in a cyclic manner. A similar approach in the field of Machine Learning (ML) is known as Active Learning (AL). It involves using an expert system (commonly a Human expert) which supervises the agent in making a decision based on the query [14]. However, it is infeasible to deploy a real-time human expert to resolve every query of the agent in the open world. Inspired by the ideas of OODA and AL, in this paper, we propose an Active Learning-based RL framework which can be easily adapted to develop RL algorithms to solve open world novelty.

OODA-RL Framework
The proposed OODA-RL framework, as shown in Fig. 2, is an Active Reinforcement Learning based framework that can be used to develop robust RL algorithms capable of handling both the known environments as well as adaptation to the unseen environments. It is designed to incorporate the notion of open world novelty while developing the RL algorithm for solving a real-world problem. It retains the same idea of Environment from the classical RL framework which gives out observations and action-based rewards to the agent. The abstraction of the Agent has been elaborated by defining its internal composition into four stages. Each of the stages are described below in detail. Active Reinforcement Learning based framework that can be used to develop robust RL algorithms capable of handling both the known environments as well as adaptation to the unseen environments.

Observe Stage
The agent receives raw unstructured data from the environment about the surrounding in the observe stage. Artificial agents can be equipped with a plethora of sensors to collect a diverse type of data such as image, video, audio etc. depending on the problem setting. After collecting the data, it transmits the data to the Orient stage which uses it to develop its own representation of a simulated environment. In parallel, another task of this stage is to process it into meaningful information through representation learning [1]. Processed state representation is feedforwarded to the Decide stage which in turn decides whether it should pass to Act stage or be provided to Orient stage in case of unseen situation. It is necessary to read in the correct data from the external environment to create appropriate representation. Therefore, state of the art (SoTA) deep learning techniques such as Convolutional Neural Networks [4,10], Transformers [22], etc. can be used to perform such representation learning.

Decide Stage
This stage primarily plays the role of a decision maker, where it decides whether the learned state representations are classified as either a seen scenario or an unseen scenario. In case of seen scenario, Decide stage signals the Act stage to use the learned policy for taking an optimal action; whereas, in case of unseen scenario, it sends a query to the Orient stage for finding an appropriate policy to try and adapt to this unseen novelty. Once the adaptation is done, the policy is updated in the Act stage to incorporate the novelty changes as well as updates the newly learned representation in the buffer of Decide Stage.

Orient Stage
This stage plays two primary roles in the framework. Firstly, it creates a continual learning simulated environment based on the raw observation it receives from the Observe Stage. Secondly, it acts as the expert system in the AL loop between Decide and Orient Stage. When the Decide stage sends a query to the Orient stage, depending on the knowledge base, it either explores in the simulated environment or exploits the existing knowledge to find the appropriate policy for novelty adaptation. In an RL setting, the problem of novelty adaptation can be dubbed as an unseen task for the agent. In the past few years, various techniques have started gaining popularity such as Meta-Reinforcement Learning, which allows quick adaptation to the novel changes in the scenario without learning from scratch by utilizing the past experiences [7,18]. Also, skill is an emerging concept in RL for solving unseen tasks in an extrinsic reward-free setting [3,6]. Orient stage is also an external reward-free domain which does not interacts with the outside world. Thus, the application of Meta-learning and Skill learning can provide suitable bases for developing mechanisms to solve the novelty adaptation problem efficiently.

Act Stage
A common mathematical representation of the functionality responsible for choosing the action based on the input information in an RL setting is called a policy. In the proposed framework, a policy can be improved (or optimized) by using various SoTA RL algorithms such as DQN [16], PPO [19], SAC [9], etc. to choose the best action based on the state representation. Some of the real-world examples of action are moving a robotic arm, rotating the steering wheel of an autonomous car, etc. Once the action is taken in the real-world, the environment may/may not give an immediate reward depending upon its nature.

Conclusion and Future Works
In this paper, we have proposed a primitive OODA-RL framework with the aim of solving the open world novelty. This framework improves upon the classical RL framework by elaborating the agent's functionality under four stages. It can be utilized for developing robust RL algorithms to tackle the unseen environment changes while maintaining its ability to solve the known world problem. This is a work-in-progress draft which is planned to be updated at frequent intervals until the completion.