During the last week at Acutronic Robotics, we have been working on tuning the neural network hyper-parameters of one of our most recent releases: ROS2Learn. To be precise, we have tried several approaches (some successful, some not so much) that had lead to a faster learning process of the MARA-v0 environment using the PPO algorithm. A table with the best parameters for each algorithm-environment combination is available here.

*Notice that you need a minimum knowledge of deep reinforcement learning to understand the detailed explanations given below. You can also learn more in the PPO2 paper.*

### Introduction

We have spent last week trying to improve ROS2Learn in order to get higher quality results in a shorter amount of time. This involved getting higher quality observations and tuning the hyper-parameters, which we will present in this issue, as the title says. We will particularly focus in the MARA-v0 environment.

The results we published in the original paper yielded, in 1M steps, reward values between -500 and -750 in 2048 steps per episode. These results are taken using only one run (hence without standard deviation), which is statistically insufficient. Another purpose of this issue is to give a better statistical assurance about the possible outcomes when running the algorithms.

### Number of steps per episode (nsteps)

One of the hyper-parameters we changed is *nsteps*. Since we train using an on-policy algorithm (PPO), we train at the end of each episode. This *nsteps* sets the trade-off between the number of times we are going to train, and the size of the experience we are going to feed each time we train. In principle, timing-wise, we should not see much difference, since we train more times, but smaller chunks of data, though we see small improvement for smaller *nsteps*.

When arriving at the training stage in each episode, we split the gathered data in mini-batches. We maintained the size of each mini-batch constant for these test (256 steps per mini-batch, which we have found empirically to be the best), and tried *nsteps* 512,1024 and 2048, getting the highest rewards with 1024 (4 mini-batches per episode). Note that the reward accumulated is directly dependent on the *nsteps*, since is just the sum of the reward at each step. Hence, a -500 reward in 2048 steps is equivalent to a -250 reward in 1024.

Getting more mini-batches seems to be redundant to the system, though more stable, and getting less makes the learning too unstable.

### Learning rate (LR)

Another hyper-parameter we tuned was the LR. Originally we took a linearly decreasing LR starting from 3e-4 to 0 when reached the *total_timesteps*. This approach has several drawbacks:

- Firstly, by making the LR dependent of the
*total_timesteps*, it means that by the end of your runs you will have LRs close to 0. - Also, you can't compare runs with different
*total_timesteps*, which is inconvenient. - Using linear decrease limits how high your first LRs can be (is good to have high LRs at the beginning, and decrease them progressively)

In order to deal with these issues, we propose a new way of defining the learning rate:

where *ep_num* is the episode number we are in. This form is exponential, meaning that decreases fast at the beginning and then slowly keeps decreasing without ever reaching 0. The coefficient 0.001918 has been chosen to reduce 3/4 the LR in the first 150 steps.

This form is invariant with the rest of the parameters, making the comparison possible, never reaching 0 and being able to have high learning rates for the early stages.

### Number hidden network units (num_hidden)

We have passed from *num_hidden* = 64 to *num_hidden* = 16 (reduce by a factor of 4). Simpler networks are able to model properly the mapping between observations and actions to be taken. The benefits of smaller networks are among others faster back-propagation or less over-fitting.

### Clip range (cliprange)

We have also started higher *cliprange* = 0.25 (instead of *cliprange*= 0.2). This is a PPO parameter that tells how much are we allowed to change the policy in each episode. By making it higher we make the learning faster but more unstable. Our best runs are with this clip range, although one of the worsts as well.

### Results and conclusions

These hyper-parameters and others have been tested using at least three trainings to give a good feeling about their performance. You can check those here.

The final results come from the best 9 runs out of 10. As mentioned, we got a run were we didn't got higher than -750 reward, but for the rest we got rewards in range the [-350, 670]. The fact that the result of this run is so far away from the others is enough in our criteria to leave it out of the statistics.

First we show the evolution of the maximum, mean and minimum reward obtained in each experience:

Note that the mean reward with these parameters is around +100, with a clear increasing tendency, compared to the previous version which seems to get stuck at around -400. Other ways of measuring the evolution of the learning would be through the distance to the target:

We can also see the evolution of the number of collisions by episode:

Finally the evolution of the entropy:

As conclusions, we think we have achieved a considerable improvement in the speed of training. This can also serve as benchmark for future improvements and comparison with, for example, the collision environments.