Current robot systems are designed, built and programmed by teams with multidisciplinary skills. The traditional approach to program such systems is typically referred to as the *robotics control pipeline* and requires going from observations to final low-level control commands through: a) state estimation, b) modeling and prediction, c) planning, and d) low level control translation. As introduced by Zamalloa et al.2017, the whole process requires fine tuning of every step in the pipeline incurring into a relevant complexity where optimization at every step is critical and has a direct impact in the final result.

Artificial Intelligence methods and, particularly, neuromorphic techniques such as artificial neural networks (ANNs) are becoming more and more relevant in robotics. Starting from 2016, promising results, such as the work of Levine et al. 2016, showed a path towards simplifying the construction of robot behaviors through the use of deep neural networks as a replacement of the traditional approach described above. The described end-to-end approach for programming robots scales nicely when compared to traditional methods.

Reinforcement Learning (RL) is a field of machine learning that is concerned with making sequences of decisions. It considers an agent situated in an environment where for each timestep the agent takes an action and receives an observation and a reward. A RL algorithm seeks to maximize the agent's total reward trough a trial and error learning process. Deep Reinforcement Learning (DRL) is the study of RL by using neural networks as function approximators. In recent years, several techniques for DRL have shown good success in learning complex behaviour skills and solving challenging control tasks in high-dimensional state-space.

However, many of the benchmarked environments such as Atari and Mujoco rarely deal with realistic or complex environments (common in robotics), or utilize the tools commonly used in robotics such as the Robot Operating System (ROS). The research conducted in the previous work can only be translated into real world robots with a considerable amount of effort for each particular robot. The scalability of previous methods for modular robots is questionable.

Modular robots can extend their components seamlessly. This brings clear advantages for their construction, but training them with current DRL methods becomes cumbersome due to the following reasons: every small change in the physical structure of the robot will require a new training; building the tools to train modular robots (such as the simulation model, virtual drivers) is a time consuming process; transferring the results to the real robot is complex given the flexibility of these systems. In this work we present a framework that utilizes the traditional tools in the robotics environment, such as Gazebo and ROS, which simplifies the process of building modular robots and their corresponding tools. Our framework includes baseline implementations for the most common DRL techniques for both value and policy iteration methods.

The methods presented bellow will be consistent with the following nomenclature that is partially inspired on the work by Peters et al 2010: The three main components of a RL system for robotics include the state $s$ (also found in literature as $x$), the action $a$ (also found as $u$) and the reward denoted by $r$. We will denote the current time step by $k$. The stochasticity of the environment gets represented by using a probability distribution: $$\mathbf{s}_{k+1}\sim p\left( \mathbf{s}_{k+1}\left\vert \mathbf{s}_{k},\mathbf{a}_{k}\right. \right)$$, as model where $\mathbf{a}_{k}\in\mathbb{R}^{M}$ denotes the current action and $\mathbf{s_k}, \mathbf{s}_{k+1}\in\mathbb{R}^{N}$ denote the current and next state, respectively. Further, we assume that most policy gradient methods have actions that are generated by a policy $\mathbf{a}_{k} \sim\pi_{\mathbf{\theta}}\left( \mathbf{a}_{k}\left\vert \mathbf{s}_{k}\right. \right)$ which is modeled as a probability distribution[^2] in order to incorporate exploratory actions. The policy is assumed to be parametrized by $K$ policy parameters $\mathbf{\theta} \in\mathbb{R}^{K}$. The sequence of states and actions forms a trajectory denoted by $\mathbf{\tau}=[\mathbf{s}_{0:H},\mathbf{a}_{0:H}]$ where $H$ denotes the horizon which can be infinite.

Often, **trajectory**, **history**, **trial** or **roll-out** are used interchangeably. At each instant of time, the learning system receives a reward:

$$r_{k} = r\left( \mathbf{s} _{k},\mathbf{a}_{k}\right) \in\mathbb{R}$$ The general goal of policy optimization is to optimize the policy parameters $\mathbf{\theta}\in\mathbb{R}^{K}$ so that the expected return is optimized: $$J\left( \mathbf{\theta}\right) =E\left\{ \sum\nolimits_{k=0}^{H}\gamma \cdot r_{k}\right\}$$ where $\gamma$ is $\gamma = [0,1]$ for discounted reinforcement learning and $\gamma = \frac{1}{H}$ for the average reward case. For real-world applications, we require that any change to the policy parameterization has to be smooth as drastic changes can be hazardous for the actor as well as useful initializations of the policy based on domain knowledge would otherwise vanish after a single update step. For these reasons, policy gradient methods which follow the steepest descent on the expected return are the method of choice. These methods update the policy parameterization according to the gradient update rule $$\mathbf{\theta}_{h+1}=\mathbf{\theta}_{h}+\alpha_{h}\left. \mathbf{\nabla}_{\mathbf{\theta}}J\right\vert _{\mathbf{\theta}=\mathbf{\theta}_{h}}$$ where $\alpha_{h}\in\mathbb{R}^{+}$ denotes the learning rate and $h\in\{0,1,2,\ldots\}$ the current update number.

NAF introduces a new representation of the Q function that allows the use of Q learning algorithm in continuous action spaces. In the continous case, in order to find the greedy policy, DeepQ requires an optimization of $a_{t}$ that is difficult and slow in the case of complex and nonlinear functions. NAF addresses this problem by combining neural-network based Q-learning and NAF. As an alternative to policy gradient based methods, NAF analytically finds the action that maximizes the Q-function $Q(s,a|\theta^{Q})$ by representing it in terms of the value $V(s|\theta^{V})$ and the advantage function $A(s,a|\theta^{A})$: $$Q(s,a|\theta^{Q}) = A(s,a|\theta^{A}) + V(s|\theta^{V})$$ Thus, the Q-function is quadratic with respect to the action that maximizes it that is given by $\mu(s|\theta^{\mu})$: $$A(s,a|\theta^{A}) = - \frac{1}{2} (a - \mu(s|\theta^{\mu}))^{T} P(s|\theta^{P}) (a-\mu(s|\theta^{\mu}))$$

The following pseudocode explains the training process of the DQN NAF algorithm:

In practice, sequentially exploring the environment leads to correlated data that makes the networks unstable. Inspired by its use in DeepQ, NAF implements the Experience Replay mechanism that ensures the sampled data are independent and identically distributed (i.i.d) random variables. Thus, experience replay collects uncorrelated data by randomly sampling from previously stored data in a memory. Moreover, NAF uses target networks implemented in DeepQ as they improve the stability of the networks and decrease divergence issues when training the networks. As a result, a more robust and stable learning of the networks happens because the target weights are updated at a lower frequency. According to the original NAF paper, batch normalization ensures the robustness of NAF against different types of environments with states that deal with different types of data, e.g. position or velocity. Another important aspect in RL in general and for NAF in particular is exploration. NAF uses stochastic actions by adding a noise process $\mathcal{N}$ to the actions coming from the deterministic policy:

$$\mu'(s_{t}) = \mu(s_{t}\vert \theta_{t}^{\mu}) + \mathcal{N}$$

The addition of noise ensures an adequate exploration of the environment and the stochasticity of the actions can stem from taking actions at random or by adding a certain well-known noise process (Ohrnstein Uhlenbeck, Brownian motion) to the action. The choice of the noise process considerably influences the performance and convergence of the algorithm.

Robotics is dominated by scenarios with continuous states and actions spaces. This implies that most traditional value-based off-the-shelf RL approaches are not valid for treating such situations. As pointed out by Peters et al. 2010, Policy Gradient (PG) methods differ significantly from others as they do not suffer from these problems in the same way other techniques do.

One of the typical issues experienced when doing RL in robotics[^3] is that uncertainty in the state might degrade the performance of the policy. PG methods suffer from this as well. However, they rely on optimization techniques for the policy that do not need to be changed when dealing with this uncertainty.

The nature of PG methods allows them to deal with continuous states and actions in exactly the same way as discrete ones. PG techniques can be used either model free or model-based. The policy representation can be chosen so that it is meaningful for the task and can incorporate domain knowledge. This often leads to the use of fewer parameters in the learning process. Additionally, its generic formulation shows that PG methods are valid even when the reward function is discontinuous or even unknown.

While PG techniques might seem interesting for a roboticist on a first look, they are by definition on-policy and need to forget data reasonably fast in order to avoid the introduction of a bias to the gradient estimator. In other words, they are not as good as other techniques at using the data available (their sample efficiency is low). Another typical problem with PG methods is that convergence is only guaranteed to a local maximum while in tabular representations, value function methods are guaranteed to converge to a global maximum.

DDPG is a model-free, off-policy, actor-critic algorithm based on the Deterministic Policy Gradient (DPG). In a nutshell, DDPG finds the policy network parameters that maximize the expected return $Q^{\pi}(s_{t}|a_{t})$ with policy gradients. DDPG is based on an actor-critic algorithm where the actor consists of a parametrized deterministic policy $\pi(s_{t}|s\theta_{\mu})$ that maps states to actions whereas the critic $Q(s_{t},a_{t}|\theta^{Q})$ evaluates the value of the action coming from the actor. The critical value is fed back into the actor network for further optimizing the policy performance denoted as $J$ using the policy gradient defined below: $$\begin{gathered} \triangledown_{\theta^{\mu}} J = E_{s_{t} \sim p{\beta}} [\triangledown_{a}Q(s,a\vert\theta^{Q})\vert_{s=s_{t}, a=\mu(s_{t})}\\ \triangledown_{\theta_\mu}\mu(s\vert\theta^{\mu})\vert_{s=s_{t}}]\end{gathered}$$

As previously explained for NAF, DDPG uses the techniques implemented in DeepQ that guarantee a stable learning such as the Experience Replay mechanism, the use of target networks or batch normalization. Moreover, DDPG also uses stochastic actions by adding a noise process $\mathcal{N}$ to the actions coming from the deterministic policy.

Import the necessary dependencies

```
import baselines.ddpg.training as training
from baselines.ddpg.models import Actor, Critic
from baselines.ddpg.memory import Memory
from baselines.ddpg.noise import *
import baselines.common.tf_util as U
import gym
import gym_gazebo
import tensorflow as tf
from mpi4py import MPI
```

Create the enviroment and set the default parameters for the training:

```
#Default parameters
env_id = "MARAOrient-v0"
noise_type='ou_0.2'
#noise_type='ou_1'
layer_norm = True
seed = 0
render_eval = False
render = False
normalize_returns=False
normalize_observations=True
critic_l2_reg=1e-2
batch_size=64
popart=False
reward_scale=1.
clip_norm=None
num_timesteps=None
evaluation = True
nb_epochs = 150
# Create envs.
env = gym.make(env_id)
env.reset()
eval_env = None
```

In DDPG it is important to set the noise parameters since the sucess of the training depends on which noise model you have chosen. For example:

```
# Parse noise_type
action_noise = None
param_noise = None
nb_actions = env.action_space.shape[-1]
for current_noise_type in noise_type.split(','):
current_noise_type = current_noise_type.strip()
if current_noise_type == 'none':
pass
elif 'adaptive-param' in current_noise_type:
_, stddev = current_noise_type.split('_')
param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
elif 'normal' in current_noise_type:
_, stddev = current_noise_type.split('_')
action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
elif 'ou' in current_noise_type:
_, stddev = current_noise_type.split('_')
action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
else:
raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
```

Create tensorflow sessiong and set the memory and if you would like to have layer normalization for the actor and the critic:

```
with tf.Session(config=tf.ConfigProto()) as session:
# Configure components.
memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
critic = Critic(layer_norm=layer_norm)
actor = Actor(nb_actions, layer_norm=layer_norm)
# Seed everything to make things reproducible.
seed = 0
set_global_seeds(seed)
env.seed(seed)
if eval_env is not None:
eval_env.seed(seed)
```

Call to the training function:

```
optim_metric = training.train(env=env,
eval_env=eval_env,
session=session,
param_noise=param_noise,
action_noise=action_noise,
actor=actor,
critic=critic,
memory=memory,
actor_lr = 1e-04,
critic_lr = 1e-03,
gamma =0.99,
nb_epoch_cycles = 20,
nb_train_steps = 50,
#nb_rollout_steps = 100,
nb_rollout_steps = 100,
nb_epochs= 500,
render_eval=render_eval,
reward_scale=1,
render=render,
normalize_returns=normalize_returns,
normalize_observations=normalize_observations,
critic_l2_reg=critic_l2_reg,
batch_size = batch_size,
popart=popart,
clip_norm=clip_norm,
job_id=str(0))
```

The idea of Actor Critic using Kronecker-Factored Trust Region (ACKTR) is to replace the Stochastic Gradient Descent (SGD), which explores the weight space inefficiently, and optimizes both the actor and the critic using Kronecker factored approximate curvature (K-FAC) with trust region. ACKTR replaces SGD of A2C, the synchronous version of A3C and instead computes the natural gradient update. The natural gradient update is applied both to the actor and the critic.

ACKTR uses the K-FAC to compute the natural gradient update efficiently. In order to define a Fisher metric for RL policies, ACKTR uses a policy function that defines a distribution over actions given the current state, and takes the expectation over the trajectory distribution. The mathematical formulation for the Fisher metric is given by:

$$F_a = \mathbb{E}_{p(\tau)}[\bigtriangledown_\theta\log\pi(a_t|s_t)(\bigtriangledown_\theta\log\pi(a_t|s_t))^T]$$ where $$p(\tau) = p(s_0) \prod^T_{t=0}\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)$$ is the distribution of trajectories. In practice, we approximate the intractable expectation above with trajectories collected during training. In the case of training the critic, one can think of it as least-squares function approximation problem. In this case, the most common second-order algorithm is Gauss-Newton, which approximates the curvature as the Gauss-Newton matrix $G:= \mathbb{E}[J^T J]$, where $J$ is the Jacobian mapping from parameters to outputs. The Gauss-Newton matrix is equivalent to the Fisher matrix for a Gaussian observation model, which allows to apply K-FAC to the critic as well. In more detail, the output of the critic $v$ is defined to be a Gaussian distribution with $p(v|s_t) \sim \mathcal{N}(v;V(s_t), \sigma^2)$. Setting $\sigma$ to 1 is equivalent to the vanilla Gauss-Newton method. In the case when the actor and the critic are disjoint, it is possible to apply K-FAC updates to each of them using the same metric as definedin [Equation 7](#eq:kfac_metric). To prevent instability during training, it is important to use an architecture where the two networks both share lower-layer representations but have distinct output layers. The joint distribution of the policy and the value distribution can be defined by assuming independence of the two output distributions, for instance $p(a,v|s)=\pi(a|s)p(v|s)$, and constructing the Fisher metric with respect to $p(a,v|s)$. This is similar to the standard K-FAC except that we need to sample the two networks' outputs independently. In this case, the K-FAC to approximate the Fisher matrix is: $$F_v = \mathbb{E}_{p(\tau)}[\bigtriangledown\log p(a, v|s) \bigtriangledown\log p(a, v|s)^T]$$

The pseudocode presented gives an overview of the ACKTR implementation used in our evaluation.

Import the ncessary dependencies:

```
# Use algorithms from baselines
from baselines.acktr.acktr_cont import learn
from baselines.acktr.policies import GaussianMlpPolicy
from baselines.acktr.value_functions import NeuralNetValueFunction
```

Create the enviroment which you would like to train for. For example:

`env = gym.make('MARAOrient-v0')`

Create tensorflow session and call the training function:

```
with tf.Session(config=tf.ConfigProto()) as session:
ob_dim = env.observation_space.shape[0]
ac_dim = env.action_space.shape[0]
with tf.variable_scope("vf"):
vf = NeuralNetValueFunction(ob_dim, ac_dim)
with tf.variable_scope("pi"):
policy = GaussianMlpPolicy(ob_dim, ac_dim)
learn(env,
policy=policy, vf=vf,
gamma=0.99,
lam=0.97,
timesteps_per_batch=2500,
desired_kl=0.002,
num_timesteps=1e6,
animate=False)
```

Proximal Policy Optimization (PPO) is an alternative to Trust Region Policy Optimization (TRPO), that attains data efficiency and reliable performance of TRPO while using first order optimization. In the case of standard PG methods, the gradient update is performed per data sample. On the other hand, PPO enables multiple epochs of mini-batch updates. There are a few variants of PPO in the literature, which optimize the surrogate objective or use adaptive KL penalty coefficient. The Clipped Surrogate Objective is given as:

$$L^{clip}(\theta) = \hat{E_{t}}[min(r_t(\theta)\hat{A_{t}}, clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A_{t}})]$$ where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$, is the probability ratio of the current policy $\pi_\theta$ and the previous policy $\pi_{\theta_{old}}$, $\hat{A_{t}}$ is an estimator of the advantage function at timestep $t$ and $\epsilon$ is a hyperparameter, for example $\epsilon = 0.2$. The first term is the surrogate objective that is also used in TRPO. The second term, $clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A_{t}}$, is clipping the probability ratio, $r_t$, to be between the interval $[1-\epsilon,1+\epsilon]$. The $L^{clip}(\theta)$ takes the minimum of the un-clipped and clipped value, which excludes the change in the probability ratio when the objective improves and includes it when the objective is worse. The clipping prevents PPO from having a large policy update. The Adaptive KL Penalty Coefficient is an alternative to the clipped surrogate objective or an addition to it where the goal is to use the penalty on KL divergence and update the penalty coefficient to achieve some target KL divergence ($d_{targ}$) at each policy update.

The presented pseudo code presents the detailed workflow of the PPO algorithm:

Import all the necessary dependencies:

```
import gym
import gym_gazebo
from baselines.ppo2 import ppo2
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
```

Create the training enviroment. For example:

```
def make_env():
env = gym.make('MARAOrient-v0')
return env
```

Set up seeds and call the training function for PPO:

```
seed = 0
set_global_seeds(seed)
network = 'mlp'
alg_kwargs['network'] = 'mlp'
rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
model = learn(env=env,
seed=seed,
total_timesteps=1e6, save_interval=10, **alg_kwargs)
```

As mentioned before PPO is improvement of TRPO and including clipping parameter. In our documentation due to same mathematical formulation between TRPO and PPO we decided to skip its description however we would like to provide an example how to launch training for TRPO using gym_gazebo and the MARA enviroment.

```
import gym
import gym_gazebo
import tensorflow as tf
import numpy as np
from mpi4py import MPI
from baselines import logger
from baselines.common import set_global_seeds, tf_util as U
from baselines.ppo1.mlp_policy import MlpPolicy
from baselines.common.mpi_fork import mpi_fork
from baselines.trpo_mpi import trpo_mpi
env = gym.make('MARAOrient-v0')
sess = U.single_threaded_session()
sess.__enter__()
rank = MPI.COMM_WORLD.Get_rank()
if rank != 0:
logger.set_level(logger.DISABLED)
workerseed = seed + 10000 * MPI.COMM_WORLD.Get_rank()
set_global_seeds(workerseed)
def policy_fn(name, ob_space, ac_space):
return MlpPolicy(name=name, ob_space=env.observation_space, ac_space=env.action_space,
hid_size=32, num_hid_layers=2)
env.seed(workerseed)
print(workerseed)
trpo_mpi.learn(env, policy_fn, timesteps_per_batch=1024, max_kl=0.01,
cg_iters=10, cg_damping=0.1, max_timesteps=1e6, gamma=0.99, lam=0.98, vf_iters=5, vf_stepsize=1e-3)
```

We have presented Different DRL techniques for modular robotics. Our setup and framework consists of tools, such as ROS and Gazebo, which allows more realistic representation of the environment and easy transfer of the trained knowledge from simulation to real robot.

There still remain many challenges within the DRL field for robotics. The main problems are the long training times, the simulation-to-real robot transfer, reward shaping, sample efficiency and extending the behaviour to diverse tasks and robot configurations.

So far, our work with modular robots has focused on simple tasks such as reaching a point in space. In order to have an end-to-end training framework (from pixels to motor torques) and to perform more complex tasks, we aim to integrate additional rich sensory input such as vision.

[^1]: $^{1}$ Erle Robotics

[^2]: DDPG uses a deterministic policy and adds stochastic noise to the action coming from it. In NAF, policy is deterministic as well.

[^3]: provided that there is no additional state estimator (actor-critic methods)