We have introduced a more accurate system for state observations, making the whole learning process more stable. However, this increase in the observation accuracy has lead to a reduction of the sampling speed, meaning the environment requires more time to execute each step now. In order to compensate this speed loss, we have invested some time tuning the hyperparameters, achieving similar convergence times than with the previous version and improving the learned final behavior.

All these changes can be found in our official repository gym-gazebo2, and also our hyperparameter selection at ROS2Learn/Hyperparams, which we will update regularly.

As you can see in the images below, we achieve a very stable improvement during the learning process.

Reward obtained by episode with new observations and hyperparameters.
Entropy generated for episode with new observations and hyperparameters.

The policy generated by this training is valid from very early stages, the error is lower than 1 cm starting at around episode 2500. The longer we train, the more accurate policy we achieve. Take a look at the no-error convergence we achieve at episode 4634.

GIF: Perfect convergence at episode 4634 using new observation and hyperparameters executed at 0.1 rad/s.

We have executed the final policy really slow so that you can appreciate the smoothness and accuracy of the MARA.

Have a look at the learning process generated by the previous observation system. The learning was less stable and we suffered from sudden drops from time to time. The agent was able to recover from these drops, but this required a long time.

Reward obtained by episode with old observations and hyperparameters.

Feel free to raise your questions in the GitHub issue section.

Check ROS2Learn and gym-gazebo2 if you want to achieve policies like this one. You will also find environments where you can add collision and orientation to the reward system, which will allow you to give the robot a more custom behavior.