Current robot systems are designed, built and programmed by teams with multidisciplinary skills. The traditional approach to program such systems is typically referred to as the *robotics control pipeline* and requires going from observations to final low-level control commands through: a) state estimation, b) modeling and prediction, c) planning, and d) low level control translation. As introduced by Zamalloa et al.2017, the whole process requires fine tuning of every step in the pipeline incurring into a relevant complexity where optimization at every step is critical and has a direct impact in the final result.

Artificial Intelligence methods and, particularly, neuromorphic techniques such as artificial neural networks (ANNs) are becoming more and more relevant in robotics. Starting from 2016, promising results, such as the work of Levine et al. 2016, showed a path towards simplifying the construction of robot behaviors through the use of deep neural networks as a replacement of the traditional approach described above. The described end-to-end approach for programming robots scales nicely when compared to traditional methods.

Reinforcement Learning (RL) is a field of machine learning that is concerned with making sequences of decisions. It considers an agent situated in an environment where for each timestep the agent takes an action and receives an observation and a reward. A RL algorithm seeks to maximize the agent's total reward trough a trial and error learning process. Deep Reinforcement Learning (DRL) is the study of RL by using neural networks as function approximators. In recent years, several techniques for DRL have shown good success in learning complex behaviour skills and solving challenging control tasks in high-dimensional state-space.

However, many of the benchmarked environments such as Atari and Mujoco rarely deal with realistic or complex environments (common in robotics), or utilize the tools commonly used in robotics such as the Robot Operating System (ROS). The research conducted in the previous work can only be translated into real world robots with a considerable amount of effort for each particular robot. The scalability of previous methods for modular robots is questionable.

Modular robots can extend their components seamlessly. This brings clear advantages for their construction, but training them with current DRL methods becomes cumbersome due to the following reasons: every small change in the physical structure of the robot will require a new training; building the tools to train modular robots (such as the simulation model, virtual drivers) is a time consuming process; transferring the results to the real robot is complex given the flexibility of these systems. In this work we present a framework that utilizes the traditional tools in the robotics environment, such as Gazebo and ROS, which simplifies the process of building modular robots and their corresponding tools. Our framework includes baseline implementations for the most common DRL techniques for both value and policy iteration methods.

The methods presented bellow will be consistent with the following nomenclature that is partially inspired on the work by Peters et al 2010: The three main components of a RL system for robotics include the state $s$ (also found in literature as $x$), the action $a$ (also found as $u$) and the reward denoted by $r$. We will denote the current time step by $k$. The stochasticity of the environment gets represented by using a probability distribution: $$\mathbf{s}_{k+1}\sim p\left( \mathbf{s}_{k+1}\left\vert \mathbf{s}_{k},\mathbf{a}_{k}\right. \right)$$, as model where $\mathbf{a}_{k}\in\mathbb{R}^{M}$ denotes the current action and $\mathbf{s_k}, \mathbf{s}_{k+1}\in\mathbb{R}^{N}$ denote the current and next state, respectively. Further, we assume that most policy gradient methods have actions that are generated by a policy $\mathbf{a}_{k} \sim\pi_{\mathbf{\theta}}\left( \mathbf{a}_{k}\left\vert \mathbf{s}_{k}\right. \right)$ which is modeled as a probability distribution in order to incorporate exploratory actions. The policy is assumed to be parametrized by $K$ policy parameters $\mathbf{\theta} \in\mathbb{R}^{K}$. The sequence of states and actions forms a trajectory denoted by $\mathbf{\tau}=[\mathbf{s}_{0:H},\mathbf{a}_{0:H}]$ where $H$ denotes the horizon which can be infinite.

Often, **trajectory**, **history**, **trial** or **roll-out** are used interchangeably. At each instant of time, the learning system receives a reward:

$$r_{k} = r\left( \mathbf{s} _{k},\mathbf{a}_{k}\right) \in\mathbb{R}$$ The general goal of policy optimization is to optimize the policy parameters $\mathbf{\theta}\in\mathbb{R}^{K}$ so that the expected return is optimized: $$J\left( \mathbf{\theta}\right) =E\left\{ \sum\nolimits_{k=0}^{H}\gamma \cdot r_{k}\right\}$$ where $\gamma$ is $\gamma = [0,1]$ for discounted reinforcement learning and $\gamma = \frac{1}{H}$ for the average reward case. For real-world applications, we require that any change to the policy parameterization has to be smooth as drastic changes can be hazardous for the actor as well as useful initializations of the policy based on domain knowledge would otherwise vanish after a single update step. For these reasons, policy gradient methods which follow the steepest descent on the expected return are the method of choice. These methods update the policy parameterization according to the gradient update rule $$\mathbf{\theta}_{h+1}=\mathbf{\theta}_{h}+\alpha_{h}\left. \mathbf{\nabla}_{\mathbf{\theta}}J\right\vert _{\mathbf{\theta}=\mathbf{\theta}_{h}}$$ where $\alpha_{h}\in\mathbb{R}^{+}$ denotes the learning rate and $h\in\{0,1,2,\ldots\}$ the current update number.

Robotics is dominated by scenarios with continuous states and actions spaces. This implies that most traditional value-based off-the-shelf RL approaches are not valid for treating such situations. As pointed out by Peters et al. 2010, Policy Gradient (PG) methods differ significantly from others as they do not suffer from these problems in the same way other techniques do.

One of the typical issues experienced when doing RL in robotics^{1} is that uncertainty in the state might degrade the performance of the policy. PG methods suffer from this as well. However, they rely on optimization techniques for the policy that do not need to be changed when dealing with this uncertainty.

The nature of PG methods allows them to deal with continuous states and actions in exactly the same way as discrete ones. PG techniques can be used either model free or model-based. The policy representation can be chosen so that it is meaningful for the task and can incorporate domain knowledge. This often leads to the use of fewer parameters in the learning process. Additionally, its generic formulation shows that PG methods are valid even when the reward function is discontinuous or even unknown.

While PG techniques might seem interesting for a roboticist on a first look, they are by definition on-policy and need to forget data reasonably fast in order to avoid the introduction of a bias to the gradient estimator. In other words, they are not as good as other techniques at using the data available (their sample efficiency is low). Another typical problem with PG methods is that convergence is only guaranteed to a local maximum while in tabular representations, value function methods are guaranteed to converge to a global maximum.

The idea of Actor Critic using Kronecker-Factored Trust Region (ACKTR) is to replace the Stochastic Gradient Descent (SGD), which explores the weight space inefficiently, and optimizes both the actor and the critic using Kronecker factored approximate curvature (K-FAC) with trust region. ACKTR replaces SGD of A2C, the synchronous version of A3C and instead computes the natural gradient update. The natural gradient update is applied both to the actor and the critic.

ACKTR uses the K-FAC to compute the natural gradient update efficiently. In order to define a Fisher metric for RL policies, ACKTR uses a policy function that defines a distribution over actions given the current state, and takes the expectation over the trajectory distribution. The mathematical formulation for the Fisher metric is given by:

$$F_a = \mathbb{E}_{p(\tau)}[\bigtriangledown_\theta\log\pi(a_t|s_t)(\bigtriangledown_\theta\log\pi(a_t|s_t))^T]$$ where $$p(\tau) = p(s_0) \prod^T_{t=0}\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)$$ is the distribution of trajectories. In practice, we approximate the intractable expectation above with trajectories collected during training. In the case of training the critic, one can think of it as least-squares function approximation problem. In this case, the most common second-order algorithm is Gauss-Newton, which approximates the curvature as the Gauss-Newton matrix $G:= \mathbb{E}[J^T J]$, where $J$ is the Jacobian mapping from parameters to outputs. The Gauss-Newton matrix is equivalent to the Fisher matrix for a Gaussian observation model, which allows to apply K-FAC to the critic as well. In more detail, the output of the critic $v$ is defined to be a Gaussian distribution with $p(v|s_t) \sim \mathcal{N}(v;V(s_t), \sigma^2)$. Setting $\sigma$ to 1 is equivalent to the vanilla Gauss-Newton method. In the case when the actor and the critic are disjoint, it is possible to apply K-FAC updates to each of them using the same metric as definedin [Equation 7](#eq:kfac_metric). To prevent instability during training, it is important to use an architecture where the two networks both share lower-layer representations but have distinct output layers. The joint distribution of the policy and the value distribution can be defined by assuming independence of the two output distributions, for instance $p(a,v|s)=\pi(a|s)p(v|s)$, and constructing the Fisher metric with respect to $p(a,v|s)$. This is similar to the standard K-FAC except that we need to sample the two networks' outputs independently. In this case, the K-FAC to approximate the Fisher matrix is: $$F_v = \mathbb{E}_{p(\tau)}[\bigtriangledown\log p(a, v|s) \bigtriangledown\log p(a, v|s)^T]$$

The pseudocode presented gives an overview of the ACKTR implementation used in our evaluation.

Import the necessary dependencies:

```
import gym
import gym_gazebo2
from baselines.acktr import acktr
from baselines.common.models import mlp
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
```

Create the environment which you would like to train for. For example:

```
def make_env():
env = gym.make('MARA-v0')
return env
env = DummyVecEnv([make_env])
```

Create the network and call the training function:

```
network = mlp(num_layers, num_hidden, layer_norm)
#**parameters is a dictionary containing the value of the hyperparameters
acktr.learn(env=env, network=network, **parameters)
```

Full ACKTR train script available here.

Proximal Policy Optimization (PPO) is an alternative to Trust Region Policy Optimization (TRPO), that attains data efficiency and reliable performance of TRPO while using first order optimization. In the case of standard PG methods, the gradient update is performed per data sample. On the other hand, PPO enables multiple epochs of mini-batch updates. There are a few variants of PPO in the literature, which optimize the surrogate objective or use adaptive KL penalty coefficient. The Clipped Surrogate Objective is given as:

$$L^{clip}(\theta) = \hat{E_{t}}[min(r_t(\theta)\hat{A_{t}}, clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A_{t}})]$$ where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$, is the probability ratio of the current policy $\pi_\theta$ and the previous policy $\pi_{\theta_{old}}$, $\hat{A_{t}}$ is an estimator of the advantage function at timestep $t$ and $\epsilon$ is a hyperparameter, for example $\epsilon = 0.2$. The first term is the surrogate objective that is also used in TRPO. The second term, $clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A_{t}}$, is clipping the probability ratio, $r_t$, to be between the interval $[1-\epsilon,1+\epsilon]$. The $L^{clip}(\theta)$ takes the minimum of the un-clipped and clipped value, which excludes the change in the probability ratio when the objective improves and includes it when the objective is worse. The clipping prevents PPO from having a large policy update. The Adaptive KL Penalty Coefficient is an alternative to the clipped surrogate objective or an addition to it where the goal is to use the penalty on KL divergence and update the penalty coefficient to achieve some target KL divergence ($d_{targ}$) at each policy update.

The presented pseudo code presents the detailed workflow of the PPO algorithm:

Import all the necessary dependencies:

```
import gym
import gym_gazebo2
import tensorflow as tf
import multiprocessing
from baselines.ppo2 import ppo2
```

Create tensorflow session:

```
config = tf.ConfigProto(
allow_soft_placement=True,
intra_op_parallelism_threads=multiprocessing.cpu_count(),
inter_op_parallelism_threads=multiprocessing.cpu_count(),
log_device_placement=False
)
tf.Session(config=config).__enter__()
```

Create the training environment. For instance:

```
def make_env():
env = gym.make('MARA-v0')
return env
env = DummyVecEnv([make_env])
```

Call the training function for PPO:

```
#**parameters is a dictionary containing the value of the hyperparameters
ppo2.learn(env=env, **parameters)
```

Full PPO train script available here.

As mentioned before PPO is improvement of TRPO and including clipping parameter. In our documentation due to same mathematical formulation between TRPO and PPO we decided to skip its description however we would like to provide an example how to launch training for TRPO using gym_gazebo and the MARA environment.

Import all the necessary dependencies:

```
import gym
import gym_gazebo2
from baselines.trpo_mpi import trpo_mpi
from baselines.common.models import mlp
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
```

Create the training environment. For example:

```
def make_env():
env = gym.make('MARA-v0')
return env
env = DummyVecEnv([make_env])
```

Create the network and call the training function:

```
network = mlp(num_layers, num_hidden, layer_norm)
#**parameters is a dictionary containing the value of the hyperparameters
trpo_mpi.learn(env=env, network=network, **parameters)
```

Full TRPO train script available here.

^{1} provided that there is no additional state estimator (actor-critic methods)