Imitation Learning (IL)

"Learning from demostrations"
A new emerging technique that uses expert data in order to train a robot to achieve certain behaviour

Introduction

Reinforcement Learning (RL) has recently gained attention in the robotics field due to its great potential towards generalizing the behaviour of robots achieving variety of tasks. However, the results obtained from RL methods suffer from dependency of reward shaping which is task dependent. This restricts the applicability in certain scenarios. Even though, new methods have arisen that consists of hierarchical approach for learning and generalizing behavior, the dependency of the reward shaping still remains as a challenge towards training more complex behaviors.

This technique known as Imitation Learning (IL) does not to provide a mathematical expression (reward) for achieving certain behavior but rather than expert data from which the Neural Network (NN) learns how to produce a similar behavior for achieving certain task. Thus, it allows to prevent from the reward shaping which in many cases can be a bottleneck. However, the IL approaches presented in the literature rarely utilize data extracted from a human expert, therefore, relying deeply on an expert data generated from an RL algorithm, meaning we will need more time to have the whole process completed.

Approaches

IL (also known as apprenticeship learning) consists of training robots to be able to perform a task from demonstrations. The two main approaches for IL are Behavioral Cloning (BC), which learns a policy as a supervised learning problem over observations to actions from expert trajectories; and Inverse Reinforcement Learning (IRL), which optimizes a cost function between the trajectories generated by the method and the expert trajectories, which are taken as the ground truth.

On one hand, BC is simple but it only tends to succeed with large amounts of training data, due to compounding error caused by covariance shift. On the other hand, IRL is based upon cost and not the policies. IRL is expensive to run, similar as RL, and can also diverge towards locally optimal cost. In both cases, the expert data is generated from simulation, whether from Trust Region Policy Optimization (TRPO) or Proximal Policy Optimization (PPO).

Leaving BC aside, as it needs more time to spend on gathering more data set, IRL trains a reward fuction and it uses RL to learn a policy that maximices the value function methods. Certain IRL methods are mathematically equivalent to Generative Adversarial Networks (GANs), and that is why we use Generative Adversarial Imitation Learning (GAIL) algorithm to train different types of robots.

GAIL

Generative Adversarial Imitation Learning is a model-free imitation learning algorithm which allows teaching robots from demonstrations on how to perform specific tasks. To do so, a particular data set is needed, as a sample, called expert data which represents the task we want the robot to learn. This technique arises from the use of GANs and RL.

GANs algorithm is composed of two networks:

• Generator: its purpose is to generate new data similar to the expert one.
• Discriminator: its aim is to be able to distinguish between "real" or expert data and "fake" or the data which comes from the Generator.

Mathematically, we have Discriminator D and Generator G neural networks which play a minmax game with the value function V(G,D). The Generator takes noise input z from the latent space, and it transforms this input into the form of the expert data we want to imitate. In such a way that it maximizes the probabilityof creating data recognized as real.

$$G→minimize⇒log(1-D(G(z)))$$

The Discriminator takes as input a set of data either generated G(z) or real x, and it tries to minimize the same probability of creating real data from the Generator.

$$D(G(z))→maximize⇒log(1-D(G(z)))$$

In short, they play minmax game as explained above.

$$\min\limits_{G}\max\limits_{D} V(D,G) = E_{x\sim p_{data}(x)}[log D(x)] + E_{z\sim p_z(z)}[log(1-D(G(z)))]$$

Where the optimal value of D for a given G to maximize V(D,G) can be represented with the data distribution coming from expert, as p$_{data}$, over x and the data distribution coming from the generator, as p$_g$, over x.

$$D^*_G(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$$

$$\max\limits_D V(G,D) = E_{x\sim p_{data}}[log D^*_G(x)] + E_{x\sim p_g}[log(1 - D^*_G(x))]$$

Once the generator is trained to its optimal, p$_g$ gets very close to p$_{data}$ until p$_g$=p$_{data}$, which means expert data and the data created by the Generator are equal.

We use Gradient descent to optimize the parameters of the Generator and the Discriminator. Policy gradient guarantees to converge to a local minimum, but the estimation of the gradient can have a high variance and small steps are required for the policy gradient. So, that is why an RL algorithm is needed.

In this case, TRPO step helps us by preventing the policy from changing too much due to the noise in policy gradients. In other words, it restricts the optimization of the Generator, in order to let the Discriminator optimize itself at the same speed as the Generator does.

To summarize, GAIL is trying to find a saddle point (π,D) to fit a parameterized policy (π), with weights θ and the Discriminator by alternating between an Adam step w to increase the value with respect to D, and a TRPO step on θ to decrease the value in respect of π.

$$E_π[log(D(s,a))] + E_{π_E}[log(1-D(s,a))] - \lambda H(π)$$

D can be interpreted as a local cost function c(s,a) providing learning signal to the policy.

The pseudocode presented gives an overview of the GAIL implementation used in our evaluation.

Environments

For the experiments we use the Gazebo simulator and ROS to create our own environments with a framework which is an extension of the OpenAI gym tailored for robotics. The code is available at erlerobot/gym-gazebo.

To evaluate the algorithms presented above, additional environments were added which match a realistic robot, in this case a 3 Degrees of Freedom (DoF) SCARA robot and a 6DoF MARA robot.

The names of the robots we used are listed below with their corresponding observation and action space (”continuous” indicate dimension for a continuous space).

Robot name Observation space Action space
SCARA 9 (continuous) 3 (continuous)
MARA 12 (continuous) 6 (continuous)

Experiments and examples

We used the corresponding ROS packages to convert the actions generated from RL and GAIL algorithms to appropiate trajectories that the robot can execute to reach certain target in each of the robot workspace. All the trajectories are saved in appropriate files and stored.

To launch a training script we have implemented a training script. For example:

Import the necessary dependencies for training with IL:

import gym
import gym_gazebo

import argparse
import os.path as osp
import logging
from mpi4py import MPI

import numpy as np
from tqdm import tqdm

from baselines.gail import mlp_policy
from baselines.common import set_global_seeds, tf_util as U
from baselines.common.misc_util import boolean_flag
from baselines import logger
from baselines.gail.dataset.h_ros_dset import H_ros_Dset
import os

Load the saved expert data, previously acquired as described above:

def argsparser():
log_path = "/path_where_to_log"
checkpoint_path = "/path_to_save_checkpoints"

Extra arguments for hyperparameters and defining the enviroment

    parser = argparse.ArgumentParser("Tensorflow Implementation of GAIL")
parser.add_argument('--checkpoint_dir', help='the directory to save model', default=checkpoint_path)
parser.add_argument('--log_dir', help='the directory to save log file', default=log_path)

# Optimization Configuration
parser.add_argument('--g_step', help='number of steps to train policy in each epoch', type=int, default=3)
parser.add_argument('--d_step', help='number of steps to train discriminator in each epoch', type=int, default=1)
# Network Configuration (Using MLP Policy)
# Algorithms Configuration
parser.add_argument('--policy_entcoeff', help='entropy coefficiency of policy', type=float, default=0)
# Traing Configuration
parser.add_argument('--save_per_iter', help='save model every xx iterations', type=int, default=1)
parser.add_argument('--num_timesteps', help='number of timesteps per episode', type=int, default=5e6)#max timesteps

return parser.parse_args()

Once you have defined all hyperparameters, enviroment you want to train, open tensorflow session, define policy function:

def get_task_name(args):

def main(args):
U.make_session(num_cpu=1).__enter__()
set_global_seeds(args.seed)
env = gym.make(args.env_id)

def policy_fn(name, ob_space, ac_space, reuse=False):
return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
reuse=reuse, hid_size=args.policy_hidden_size, num_hid_layers=2)
env.seed(args.seed)
gym.logger.setLevel(logging.WARN)
logger.configure(os.path.abspath(args.log_dir))

dataset = H_ros_Dset(expert_path=args.expert_path, traj_limitation=args.traj_limitation)
reward_giver = TransitionClassifier(env, args.adversary_hidden_size, entcoeff=args.adversary_entcoeff)

Define the training function, which is a call of the learn function of the GAIL algorithm, pulus some seeding parameters:

def train(env, seed, policy_fn, reward_giver, dataset, algo,
g_step, d_step, policy_entcoeff, num_timesteps, save_per_iter,
checkpoint_dir, log_dir, pretrained, BC_max_iter, robot_name, task_name=None):

pretrained_weight = None

from baselines.gail import trpo_mpi_local
# Set up for MPI seed
rank = MPI.COMM_WORLD.Get_rank()
if rank != 0:
logger.set_level(logger.DISABLED)
seed = 0
env.seed(seed)
set_global_seeds(seed)
trpo_mpi_local.learn(env, policy_fn, reward_giver, dataset, rank,
pretrained=pretrained, pretrained_weight=pretrained_weight,
g_step=g_step, d_step=d_step,
entcoeff=policy_entcoeff,
max_timesteps=num_timesteps,
ckpt_dir=checkpoint_dir, log_dir=log_dir,
save_per_iter=save_per_iter,
timesteps_per_batch=1024,
max_kl=0.01, cg_iters=10, cg_damping=0.1,
gamma=0.995, lam=0.97,
vf_iters=5, vf_stepsize=1e-3,
task_name=task_name, robot_name=robot_name)

    train(env,
args.seed,
policy_fn,
reward_giver,
dataset,
args.algo,
args.g_step,
args.d_step,
args.policy_entcoeff,
args.num_timesteps,
args.save_per_iter,
args.checkpoint_dir,
args.log_dir,
args.pretrained,
args.BC_max_iter,
args.robot_name,