Introducing a New Framework for Flexible and Reproducible Reinforcement Learning Research



Reinforcement learning (RL) research has seen a number of significant advances over the past few years. These advances have allowed agents to play games at a super-human level — notable examples include DeepMind’s DQN on Atari games along with AlphaGo and AlphaGo Zero, as well as Open AI Five. Specifically, the introduction of replay memories in DQN enabled leveraging previous agent experience, large-scale distributed training enabled distributing the learning process across multiple workers, and distributional methods allowed agents to model full distributions, rather than simply their expected values, to learn a more complete picture of their world. This type of progress is important, as the algorithms yielding these advances are additionally applicable for other domains, such as in robotics (see our recent work on robotic manipulation and teaching robots to visually self-adapt).

Quite often, developing these kind of advances requires quickly iterating over a design — often with no clear direction — and disrupting the structure of established methods. However, most existing RL frameworks do not provide the combination of flexibility and stability that enables researchers to iterate on RL methods effectively, and thus explore new research directions that may not have immediately obvious benefits. Further, reproducing the results from existing frameworks is often too time consuming, which can lead to scientific reproducibility issues down the line.

Today we’re introducing a new Tensorflow-based framework that aims to provide flexibility, stability, and reproducibility for new and experienced RL researchers alike. Inspired by one of the main components in reward-motivated behaviour in the brain and reflecting the strong historical connection between neuroscience and reinforcement learning research, this platform aims to enable the kind of speculative research that can drive radical discoveries. This release also includes a set of colabs that clarify how to use our framework.

Ease of Use
Clarity and simplicity are two key considerations in the design of this framework. The code we provide is compact (about 15 Python files) and is well-documented. This is achieved by focusing on the Arcade Learning Environment (a mature, well-understood benchmark), and four value-based agents: DQN, C51, a carefully curated simplified variant of the Rainbow agent, and the Implicit Quantile Network agent, which was presented only last month at the International Conference on Machine Learning (ICML). We hope this simplicity makes it easy for researchers to understand the inner workings of the agent and to quickly try out new ideas.

Reproducibility
We are particularly sensitive to the importance of reproducibility in reinforcement learning research. To this end, we provide our code with full test coverage; these tests also serve as an additional form of documentation. Furthermore, our experimental framework follows the recommendations given by Machado et al. (2018) on standardizing empirical evaluation with the Arcade Learning Environment.

Benchmarking
It is important for new researchers to be able to quickly benchmark their ideas against established methods. As such, we are providing the full training data of the four provided agents, across the 60 games supported by the Arcade Learning Environment, available as Python pickle files (for agents trained with our framework) and as JSON data files (for comparison with agents trained in other frameworks); we additionally provide a website where you can quickly visualize the training runs for all provided agents on all 60 games. Below we show the training runs for our 4 agents on Seaquest, one of the Atari 2600 games supported by the Arcade Learning Environment.
The training runs for our 4 agents on Seaquest. The x-axis represents iterations, where each iteration is 1 million game frames (4.5 hours of real-time play); the y-axis is the average score obtained per play. The shaded areas show confidence intervals from 5 independent runs.
We are also providing the trained deep networks from these agents, the raw statistics logs, as well as the Tensorflow event files for plotting with Tensorboard. These can all be found in the downloads section of our site.

Our hope is that our framework’s flexibility and ease-of-use will empower researchers to try out new ideas, both incremental and radical. We are already actively using it for our research and finding it is giving us the flexibility to iterate quickly over many ideas. We’re excited to see what the larger community can make of it. Check it out at our github repo, play with it, and let us know what you think!

Acknowledgements
This project was only possible thanks to several collaborations at Google. The core team includes Marc G. Bellemare, Pablo Samuel Castro, Carles Gelada, Subhodeep Moitra and Saurabh Kumar. We also extend a special thanks to Sergio Guadamarra, Ofir Nachum, Yifan Wu, Clare Lyle, Liam Fedus, Kelvin Xu, Emilio Parisoto, Hado van Hasselt, Georg Ostrovski and Will Dabney, and the many people at Google who helped us test it out.