We have introduced a more accurate system for state observations, making the whole learning process more stable. However, this increase in the observation accuracy has lead to a reduction of the sampling speed, meaning the environment requires more time to execute each step now. In order to compensate this speed loss, we have invested some time tuning the hyperparameters, achieving similar convergence times than with the previous version and improving the learned final behavior.
As you can see in the images below, we achieve a very stable improvement during the learning process.
The policy generated by this training is valid from very early stages, the error is lower than 1 cm starting at around episode 2500. The longer we train, the more accurate policy we achieve. Take a look at the no-error convergence we achieve at episode 4634.
We have executed the final policy really slow so that you can appreciate the smoothness and accuracy of the MARA.
Have a look at the learning process generated by the previous observation system. The learning was less stable and we suffered from sudden drops from time to time. The agent was able to recover from these drops, but this required a long time.
Feel free to raise your questions in the GitHub issue section.
Check ROS2Learn and gym-gazebo2 if you want to achieve policies like this one. You will also find environments where you can add collision and orientation to the reward system, which will allow you to give the robot a more custom behavior.