Reinforcement Learning (RL) has recently gained attention in the robotics field. Rather than programming, it allows roboticists to train robots, producing results that generalize better and are able to comply with the dynamic environments typically encountered in robotics. Furthermore, RL techniques, if used in combination with modular robotics, could empower a new generation of robots that are more adaptable and capable of performing a variety of tasks without human intervention.
While some results showed the feasibility of using RL in real robots, such approach is expensive. It requires hundreds of thousands of attempts (performed by a group of robots) and a period of several months. These capabilities are only available to a restricted few, thereby training in simulation has gained popularity. The idea behind using simulation is to train a virtual model of the real robot until the desired behavior is learned to then transfer the knowledge to the real robot. Its behavior can be further enhanced by exposing it to a restricted number of additional training iterations. Following some of the initial releases of OpenAI’s gym, many groups started using the Mujoco physics engine. To overcome the obstacles with the common infrastructure used in the RL community, we used the Gazebo robot simulator in combination with the Robot Operating System (ROS) to create an environment with the common tools used by roboticists named gym_gazebo2.
gym-gazebo2, an upgraded version fully compatible with ROS 2.
The advances made under the heavily developed ROS 2 and the infraestructure around it lead us to take the decision of creating an upgraded version that would fully comply with newest version of the Robot Operative System.
Why move to ROS 2?
We want to benefit from the latest performance and security updates, as well as cutting edge development tools.
In this documentation we will explain the most common ROS 2 packages used in the gym_gazebo2 toolkit and elaborate on how it can be used to test and evaluate different RL algorithms.
As seen in the figure above, gym-gazebo2 incorporates interfaces to allow experimenting with state of the art DRL algorithms and different ROS 2 packages, allowing a possibility to interface with the best of both worlds AI and robotics packages.
In theory all of the available RL algorithms can be easily integrated and added into the gym_gazebo2 infrastructure. Several functions need to be implemented in order for the RL algorithm to input/output the relevant information and allow the algorithm to evolve. The main once are:
stepfunction which should return four values:
observation: an environment-specific object representation of our observation of the environment. In robotics, this usually refers to joint position or end-effector positions.
reward: represents the amount of reward that has been achieved in the previous action. This value is environment dependent, but the goal is to have as high as possible reward. It is the indicator of how well our environment is performing regarding a certain algorithm.
done: indicates whether it's time to reset the environment again. Usually
doneis true when the episode is terminated.
info: used for debugging. It can sometimes be useful to have more details, for example the raw probabilities behind the environment's last state change.
reset: defines the movement to be done by the robot, usually called when
done=True. In this function we can define where the robot should go, in our particular case we reset the robot to go to its initial position.
In short, the workflow of each time-step is the following:
The process gets started by calling
reset(), which returns the initial
In order to leverage the ROS infrastructure and output the current progress of the training in Gazebo, the following ROS packages are being used:
joint_statetopic we are able to get the value of the joints, and calculate the forward kinematics in order to get the current end-effector position of the robot. The reward is then calculated as the difference between the
targetand the current end-effector position, and incorporated in the step function, which as mentioned before is used in the optimization scheme of the RL algorithm to advance the training.
We’ve been sampling random actions from the environment’s action space. But what are those actions, really? Every environment comes with an
action_space and an
observation_space. These attributes are of type
Space, and they describe the format of valid actions and observations:
import gym import gym_gazebo env = gym.make('MARA-v0') print(env.action_space) Box(6,) print(env.observation_space) Box(12,)
The action space is the desired joint position for the 6 axis of MARA. The observation space is the current position, the difference between the current end-effector position and the desired target, and the speed of each joint. In more advanced environments such as
MARAOrientCollision-v0, the difference in orientation between the current end-effector orientation and the desired target orientation represented in quaternions is also taken into account.
Read more about this topic in our related publications: