"ROS API for Reinforcement Learning"
We have developed a software infrastructure for rapid exploration and development of solutions based on Reinforcement Learning.

Reinforcement Learning (RL) has recently gained attention in the robotics field. Rather than programming, it allows roboticists to train robots, producing results that generalize better and are able to comply with the dynamic environments typically encountered in robotics. Furthermore, RL techniques, if used in combination with modular robotics, could empower a new generation of robots that are more adaptable and capable of performing a variety of tasks without human intervention.

While some results showed the feasibility of using RL in real robots, such approach is expensive. It requires hundreds of thousands of attempts (performed by a group of robots) and a period of several months. These capabilities are only available to a restricted few, thereby training in simulation has gained popularity. The idea behind using simulation is to train a virtual model of the real robot until the desired behavior is learned to then transfer the knowledge to the real robot. Its behavior can be further enhanced by exposing it to a restricted number of additional training iterations. Following some of the initial releases of OpenAI’s gym, many groups started using the Mujoco physics engine. To overcome the current obstacles with the common infrastructure used in the RL community today, we have used the Gazebo robot simulator in combination with the Robot Operating System (ROS) to create an environment with the common tools used by roboticists named gym_gazebo.

In this documentation we will explain the most common ROS packages used in the gym_gazebo simulator and elaborate on how it can be used to test and evaluate different RL algorithms.

gym_gazebo infrastructure

As seen in the figure above, gym-gazebo incorporates interfaces to allow experimenting with state of the art RL algorithms and different ROS packages, allowing a possibility to interface with the best of both worlds AI and robotics packages.

In theory all of the available RL algorithms can be easily integrated and added into the gym_gazebo infrastructure. In order for the RL algorithm to input/output the relevant information and allow the algorithm to evolve several functions need to be implemented.

  1. step function which should return four values:
    • observation: an environment-specific object representation of our observation of the environment. In robotics, this usually refers to joint position or end-effector positions.
    • reward: represents the amount of reward that has been achieved in the previous action. This value is environment dependent, but the goal is to have as high a reward as possible. The reward is the indicator of how well our environment is performing regarding a certain algorithm.
    • done: indicates whether it's time to reset the environment again. Usually done is true when the episode is terminated, for example the robot is searching how to reach the target for too long, or it has collided with a structure from the environment.
    • info: used for debugging. It can sometimes be useful to have more details, for example the raw probabilities behind the environment's last state change.
  2. reset: defines when the environment should be reset, this function is usually called when done=True. In this function we can define where the robot should go, for example resetting the robot to go to its initial position.
  3. At each time-step the agent, based on the output from the RL algorithms, chooses an action. Then, the environment returns observation and reward, which is then fed back to the RL algorithm.

This is just an implementation of the classic “agent-environment loop”. Each time-step, the agent chooses an action, and the environment returns an observation and a reward.

RL process

The process gets started by calling reset(), which returns an initial observation.

In order to leverage the ROS infrastructure and output the current progress of the training in Gazebo, the following ROS packages are being used:

  1. Orocos Kinematics Dynamics Library(KDL): this library allows us to load the modular robot model from urdf, construct the kinematics chain and find the inverse and forward kinematics. This is useful in our case, where we would like to calculate the reward based on the current end-effector position and the target position.
  2. Appropriate ROS publishers and subscribers: for each modular robot appropriate publishers and subscribers are incorporated in the environment, which allows us to convert the action values from the RL algorithm into appropriate joint actions understandable by the ROS infrastructure. This allows us to send the actions generated by the RL algorithm to the Gazebo simulator and visualize the current status of the training. Furthermore, thanks to constant monitoring of the current joint positions, by subscribing to the joint_state topic we are able to get the value of the joints, and calculate the forward kinematics in order to get the current end-effector position of the robot. The reward is then calculated as the difference between the target and the current end-effector position, and incorporated in the step function, which as mentioned before is used in the optimization scheme of the RL algorithm to advance the training.

To register and be able to call the environment from your train script (which is explained in the next section), you need to register your environment. This is done in the following two python scripts:

  1. In the environment folder itself (example).
  2. In the following script, where we define the name of the environment which will be callable from the training script.

First import gym and gym_gazebo:

import gym
import gym_gazebo

Import the necessary dependencies for your RL algorithm, for example: PPO, TRPO, etc.

Make the environment:

env = gym.make('EnvironmentName'), where EnvironmentName is the name we have defined in script(example).

Optional: depending on your RL algorithm you need to create a training session. As an example, in the case of Tensor Flow we'd need to call:

Define or call to the policy function of the RL algorithm.
Define or call to the training function of the RL algorithm.

You are all set! Now you can launch your training script, described in more detail in the following section.

To install gym_gazebo please refer to our open source implementation and the documentation regarding its instalation.

To start training a certain environment in gym_gazebo you need to follow these steps:

  1. cd examples/{environment}, where {environment} is the custom made environment of your robot, or an already existing one such as the turtlebot.
  2. Launch it: python

After this step the environment should be loaded and the training process should start. The training can be visualized in the Gazebo simulator.

import gym
import gym_gazebo
import time
env = gym.make('MARATop3DOF-v0')
for _ in range(1000):
    env.step(env.action_space.sample()) # take a random action

If you’d like to see some other environments in action, try replacing MARATop3DOF-v0 above with something like MARAVisionOrientCollision-v0, MARAOrientCollision-v0 or MARANoGripper-v0. Each of them is including different properties for training the MARA robot, for example the MARAOrientCollision-v0 is taking into account collisions in the reward shaping in order to learn policy where the robot executes trajectory without any collision with the surrounding objects.

We’ve been sampling random actions from the environment’s action space. But what are those actions, really? Every environment comes with an action_space and an observation_space. These attributes are of type Space, and they describe the format of valid actions and observations:

import gym
import gym_gazebo
env = gym.make('MARATop3DOF-v0')

The action space is the desired joint position for the 6 axis of MARA. The observation space is the current position, the speed of each joint, the difference between the current and effector position and the desired target. In more advanced enviroments such as MARAOrientCollision-v0, the difference in orientation between the current end-effector orientation and the desired target orientation represented in quaternions is also taken into account.