gym_gazebo2 - A reinforcement learning toolkit for robots

"a toolkit for reinforcement learning using ROS 2 and Gazebo"
We have developed a software infrastructure for rapid exploration and development of solutions based on Reinforcement Learning.
Source code

Reinforcement Learning (RL) has recently gained attention in the robotics field. Rather than programming, it allows roboticists to train robots, producing results that generalize better and are able to comply with the dynamic environments typically encountered in robotics. Furthermore, RL techniques, if used in combination with modular robotics, could empower a new generation of robots that are more adaptable and capable of performing a variety of tasks without human intervention.

While some results showed the feasibility of using RL in real robots, such approach is expensive. It requires hundreds of thousands of attempts (performed by a group of robots) and a period of several months. These capabilities are only available to a restricted few, thereby training in simulation has gained popularity. The idea behind using simulation is to train a virtual model of the real robot until the desired behavior is learned to then transfer the knowledge to the real robot. Its behavior can be further enhanced by exposing it to a restricted number of additional training iterations. Following some of the initial releases of OpenAI’s gym, many groups started using the Mujoco physics engine. To overcome the obstacles with the common infrastructure used in the RL community, we used the Gazebo robot simulator in combination with the Robot Operating System (ROS) to create an environment with the common tools used by roboticists named gym_gazebo2.

gym-gazebo2, an upgraded version fully compatible with ROS 2.

The advances made under the heavily developed ROS 2 and the infraestructure around it lead us to take the decision of creating an upgraded version that would fully comply with newest version of the Robot Operative System.

Why move to ROS 2?

We want to benefit from the latest performance and security updates, as well as cutting edge development tools.

In this documentation we will explain the most common ROS 2 packages used in the gym_gazebo2 toolkit and elaborate on how it can be used to test and evaluate different RL algorithms.

As seen in the figure above, gym-gazebo2 incorporates interfaces to allow experimenting with state of the art DRL algorithms and different ROS 2 packages, allowing a possibility to interface with the best of both worlds AI and robotics packages.

In theory all of the available RL algorithms can be easily integrated and added into the gym_gazebo2 infrastructure. Several functions need to be implemented in order for the RL algorithm to input/output the relevant information and allow the algorithm to evolve. The main once are:

  • step function which should return four values:
    • observation: an environment-specific object representation of our observation of the environment. In robotics, this usually refers to joint position or end-effector positions.
    • reward: represents the amount of reward that has been achieved in the previous action. This value is environment dependent, but the goal is to have as high as possible reward. It is the indicator of how well our environment is performing regarding a certain algorithm.
    • done: indicates whether it's time to reset the environment again. Usually done is true when the episode is terminated.
    • info: used for debugging. It can sometimes be useful to have more details, for example the raw probabilities behind the environment's last state change.
  • reset: defines the movement to be done by the robot, usually called when done=True. In this function we can define where the robot should go, in our particular case we reset the robot to go to its initial position.

In short, the workflow of each time-step is the following:

  1. Execute an action
  2. Take the observation
  3. Check if collision
  4. Compute the reward
  5. Return the status (done)
  6. Reset the agent according to the done boolean variable

The process gets started by calling reset(), which returns the initial observation.

In order to leverage the ROS infrastructure and output the current progress of the training in Gazebo, the following ROS packages are being used:

  1. Orocos Kinematics Dynamics Library(KDL): this library allows us to load the modular robot model from urdf, construct the kinematics chain and find the inverse and forward kinematics. This is useful in our case, where we would like to calculate the reward based on the current end-effector position and the target position.
  2. Appropriate ROS 2 publishers and subscribers: for each modular robot appropriate publishers and subscribers are incorporated in the environment, which allows us to convert the action values from the RL algorithm into appropriate joint actions understandable by the ROS infrastructure. This allows us to send the actions generated by the RL algorithm to the Gazebo simulator and visualize the current status of the training. Furthermore, thanks to constant monitoring of the current joint positions, by subscribing to the joint_state topic we are able to get the value of the joints, and calculate the forward kinematics in order to get the current end-effector position of the robot. The reward is then calculated as the difference between the target and the current end-effector position, and incorporated in the step function, which as mentioned before is used in the optimization scheme of the RL algorithm to advance the training.

We’ve been sampling random actions from the environment’s action space. But what are those actions, really? Every environment comes with an action_space and an observation_space. These attributes are of type Space, and they describe the format of valid actions and observations:

import gym
import gym_gazebo
env = gym.make('MARA-v0')

The action space is the desired joint position for the 6 axis of MARA. The observation space is the current position, the difference between the current end-effector position and the desired target, and the speed of each joint. In more advanced environments such as MARAOrientCollision-v0, the difference in orientation between the current end-effector orientation and the desired target orientation represented in quaternions is also taken into account.

Read more about this topic in our related publications: