Thanks to the introduction of the ROS2Learn framework and gym-gazebo2 toolkit, the transfer of a learned policy from simulation to a robot has become easier. As shown in the video, we can replicate the behavior demonstrated in simulation accurately using a real robot.
ROS2Learn is a framework that uses the traditional tools in the robotics environment to train policies for reinforcement learning agents. Those tools are Gazebo, which takes care of the simulation of the robot; and ROS2, which controls the movement of the robot.
Using this approach simplifies the process of building modular robots and the transfer from the learned policies to the real robots. ROS2Learn also includes a baseline implementation of the most common DRL techniques for policy iteration methods, including Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO) or Actor-Critic Kronecker-Factored Trust Region (ACKTR).
gym-gazebo2 module takes care of creating the robot simulation environment, which in our case is the Modular Articulated Robotic Arm MARA.
MARA is the first robotic arm which runs ROS2 on each joint, with industrial-grade features such as time synchronization or deterministic communication latencies. In order to ease the training process we code it in a way that is compliant with OpenAI’s Gym.
We trained a MultiLayer Perceptron (MLP) with 2 layers of 16 neurons. This network, given the current joint state, the velocity of the joints and the distance to the target; outputs what action to perform to bring the end-effector of the MARA to the desired position. To reach a fixed target point consistently took around 6 hours of training in a single machine. As a training algorithm we used PPO, which we have previously tuned for this environment.
With these settings, we have achieved a network that is able to control the robotic arm with high precision in simulation, and thanks to the tools used for the training, also in the real MARA with the same degree of precision. It is worth mentioning that we used a specific driver for training, that speeds up the simulation time, allowing us to gather more experience in less time. This is particularly useful for experience intensive on-policy algorithms like PPO. It is also remarkable that, even using two different drivers for training and executing, both show the same behavior, supporting our approach. More information regarding the drivers can be found in the following article. Thanks to this similarity between drivers, via ROS2Learn and gym-gazebo2 we can easily transfer the learned policy onto the real robot without the need to develop any additional layers for data transfer between Neural Networks (NN) and the robot.
This is a step forward in our vision of seeing AI techniques used in robotics control at industrial level. These techniques make fast computations compared to more classical methods, allowing to have a more efficient control. The next step we plan to explore is how to achieve adaptive control using RL methods which will allow us to apply NN in industrial test cases which require high level of accuracy and adaptability. We also plan to combine these achievements with a vision system to expand the capabilities of our MARA. We envision that the combination of adaptive control and computer vision will pave the way towards bringing these methods from research concepts to common industrial practice for robot control and manipulation.