During the last week at Acutronic Robotics, we have been working on tuning the neural network hyper-parameters of one of our most recent releases: ROS2Learn. To be precise, we have tried several approaches (some successful, some not so much) that had lead to a faster learning process of the MARA-v0 environment using the PPO algorithm. A table with the best parameters for each algorithm-environment combination is available here.

Notice that you need a minimum knowledge of deep reinforcement learning to understand the detailed explanations given below. You can also learn more in the PPO2 paper.

Introduction

We have spent last week trying to improve ROS2Learn in order to get higher quality results in a shorter amount of time. This involved getting higher quality observations and tuning the hyper-parameters, which we will present in this issue, as the title says. We will particularly focus in the MARA-v0 environment.

The results we published in the original paper yielded, in 1M steps, reward values between -500 and -750 in 2048 steps per episode. These results are taken using only one run (hence without standard deviation), which is statistically insufficient. Another purpose of this issue is to give a better statistical assurance about the possible outcomes when running the algorithms.

Screenshot from 2019-04-02 09-23-54

Number of steps per episode (nsteps)

One of the hyper-parameters we changed is nsteps. Since we train using an on-policy algorithm (PPO), we train at the end of each episode. This nsteps sets the trade-off between the number of times we are going to train, and the size of the experience we are going to feed each time we train. In principle, timing-wise, we should not see much difference, since we train more times, but smaller chunks of data, though we see small improvement for smaller nsteps.

When arriving at the training stage in each episode, we split the gathered data in mini-batches. We maintained the size of each mini-batch constant for these test (256 steps per mini-batch, which we have found empirically to be the best), and tried nsteps 512,1024 and 2048, getting the highest rewards with 1024 (4 mini-batches per episode). Note that the reward accumulated is directly dependent on the nsteps, since is just the sum of the reward at each step. Hence, a -500 reward in 2048 steps is equivalent to a -250 reward in 1024.

Getting more mini-batches seems to be redundant to the system, though more stable, and getting less makes the learning too unstable.

Learning rate (LR)

Another hyper-parameter we tuned was the LR. Originally we took a linearly decreasing LR starting from 3e-4 to 0 when reached the total_timesteps. This approach has several drawbacks:

  • Firstly, by making the LR dependent of the total_timesteps, it means that by the end of your runs you will have LRs close to 0.
  • Also, you can't compare runs with different total_timesteps, which is inconvenient.
  • Using linear decrease limits how high your first LRs can be (is good to have high LRs at the beginning, and decrease them progressively)

In order to deal with these issues, we propose a new way of defining the learning rate:

Screenshot from 2019-04-02 11-24-31

where ep_num is the episode number we are in. This form is exponential, meaning that decreases fast at the beginning and then slowly keeps decreasing without ever reaching 0. The coefficient 0.001918 has been chosen to reduce 3/4 the LR in the first 150 steps.

Screenshot from 2019-04-02 11-26-15

This form is invariant with the rest of the parameters, making the comparison possible, never reaching 0 and being able to have high learning rates for the early stages.

Number hidden network units (num_hidden)

We have passed from num_hidden = 64 to num_hidden = 16 (reduce by a factor of 4). Simpler networks are able to model properly the mapping between observations and actions to be taken. The benefits of smaller networks are among others faster back-propagation or less over-fitting.

Clip range (cliprange)

We have also started higher cliprange = 0.25 (instead of cliprange= 0.2). This is a PPO parameter that tells how much are we allowed to change the policy in each episode. By making it higher we make the learning faster but more unstable. Our best runs are with this clip range, although one of the worsts as well.

Results and conclusions

These hyper-parameters and others have been tested using at least three trainings to give a good feeling about their performance. You can check those here.
The final results come from the best 9 runs out of 10. As mentioned, we got a run were we didn't got higher than -750 reward, but for the rest we got rewards in range the [-350, 670]. The fact that the result of this run is so far away from the others is enough in our criteria to leave it out of the statistics.

First we show the evolution of the maximum, mean and minimum reward obtained in each experience:

In the figure on the left, the red lines is the maximum reward, the blue is the mean and the green is the minimum. In the figure on the right we show the Tensorboard reward evolution of the 9 trains with the new parameters in blue, which is roughly equivalent (with a noise correction) to the blue line in the left, multiplied by the nsteps. The yellow part is the equivalent for our old selection of parameters, with statistics done with 3 trains. 

Note that the mean reward with these parameters is around +100, with a clear increasing tendency, compared to the previous version which seems to get stuck at around -400. Other ways of measuring the evolution of the learning would be through the distance to the target:

In both figures, the red lines is the maximum distance, the blue is the mean and the green is the minimum. Note the log scale on the y axis in the right figure shows that the minimum distance continues to improve, which is not clear in the figure on the left.

We can also see the evolution of the number of collisions by episode:

In the figure on the right we show the raw means, while in the left you can see the number of collisions with a noise correction with the standard deviation, using the same strategy as Tensorboard does.

Finally the evolution of the entropy:

As conclusions, we think we have achieved a considerable improvement in the speed of training. This can also serve as benchmark for future improvements and comparison with, for example, the collision environments.