Long-Range Robotic Navigation via Automated Reinforcement Learning

Aleksandra Faust, Senior Research Scientist and Anthony Francis, Senior Software Engineer, Robotics at Google

In the United States alone, there are 3 million people with a mobility impairment that prevents them from ever leaving their homes. Service robots that can autonomously navigate long distances can improve the independence of people with limited mobility, for example, by bringing them groceries, medicine, and packages. Research has demonstrated that deep reinforcement learning (RL) is good at mapping raw sensory input to actions, e.g. learning to grasp objects and for robot locomotion, but RL agents usually lack the understanding of large physical spaces needed to safely navigate long distances without human help and to easily adapt to new spaces.

In three recent papers, “Learning Navigation Behaviors End-to-End with AutoRL,” “PRM-RL: Long-Range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning”, and “Long-Range Indoor Navigation with PRM-RL”, we investigate easy-to-adapt robotic autonomy by combining deep RL with long-range planning. We train local planner agents to perform basic navigation behaviors, traversing short distances safely without collisions with moving obstacles. The local planners take noisy sensor observations, such as a 1D lidar that provides distances to obstacles, and output linear and angular velocities for robot control. We train the local planner in simulation with AutoRL, a method that automates the search for RL reward and neural network architecture. Despite their limited range of 10 – 15 meters, the local planners transfer well to both real robots and to new, previously unseen environments. This enables us to use them as building blocks for navigation in large spaces. We then build a roadmap, a graph where nodes are locations and edges connect the nodes only if local planners, which mimic real robots well with their noisy sensors and control, can traverse between them reliably.

Automating Reinforcement Learning (AutoRL)
In our first paper, we train the local planners in small, static environments. However, training with standard deep RL algorithms, such as Deep Deterministic Policy Gradient (DDPG), poses several challenges. For example, the true objective of the local planners is to reach the goal, which represents a sparse reward. In practice, this requires researchers to spend significant time iterating and hand-tuning the rewards. Researchers must also make decisions about the neural network architecture, without clear accepted best practices. And finally, algorithms like DDPG are unstable learners and often exhibit catastrophic forgetfulness.

To overcome those challenges, we automate the deep Reinforcement Learning (RL) training. AutoRL is an evolutionary automation layer around deep RL that searches for a reward and neural network architecture using large-scale hyperparameter optimization. It works in two phases, reward search and neural network architecture search. During the reward search, AutoRL trains a population of DDPG agents concurrently over several generations, each with a slightly different reward function optimizing for the local planner’s true objective: reaching the destination. At the end of the reward search phase, we select the reward that leads the agents to its destination most often. In the neural network architecture search phase, we repeat the process, this time using the selected reward and tuning the network layers, optimizing for the cumulative reward.

Automating reinforcement learning with reward and neural network architecture search.

However, this iterative process means AutoRL is not sample efficient. Training one agent takes 5 million samples; AutoRL training over 10 generations of 100 agents requires 5 billion samples – equivalent to 32 years of training! The benefit is that after AutoRL the manual training process is automated, and DDPG does not experience catastrophic forgetfulness. Most importantly, the resulting policies are higher quality — AutoRL policies are robust to sensor, actuator and localization noise, and generalize well to new environments. Our best policy is 26% more successful than other navigation methods across our test environments.

AutoRL (red) success over short distances (up to 10 meters) in several unseen buildings. Compared to hand-tuned DDPG (dark-red), artificial potential fields (light blue), dynamic window approach (blue), and behavior cloning (green).
AutoRL local planner policy transfer to robots in real, unstructured environments

While these policies only perform local navigation, they are robust to moving obstacles and transfer well to real robots, even in unstructured environments. Though they were trained in simulation with only static obstacles, they can also handle moving objects effectively. The next step is to combine the AutoRL policies with sampling-based planning to extend their reach and enable long-range navigation.

Achieving Long Range Navigation with PRM-RL
Sampling-based planners tackle long-range navigation by approximating robot motions. For example, probabilistic roadmaps (PRMs) sample robot poses and connect them with feasible transitions, creating roadmaps that capture valid movements of a robot across large spaces. In our second paper, which won Best Paper in Service Robotics at ICRA 2018, we combine PRMs with hand-tuned RL-based local planners (without AutoRL) to train robots once locally and then adapt them to different environments.

First, for each robot we train a local planner policy in a generic simulated training environment. Next, we build a PRM with respect to that policy, called a PRM-RL, over a floor plan for the deployment environment. The same floor plan can be used for any robot we wish to deploy in the building in a one time per robot+environment setup.

To build a PRM-RL we connect sampled nodes only if the RL-based local planner, which represents robot noise well, can reliably and consistently navigate between them. This is done via Monte Carlo simulation. The resulting roadmap is tuned to both the abilities and geometry of the particular robot. Roadmaps for robots with the same geometry but different sensors and actuators will have different connectivity. Since the agent can navigate around corners, nodes without clear line of sight can be included. Whereas nodes near walls and obstacles are less likely to be connected into the roadmap because of sensor noise. At execution time, the RL agent navigates from roadmap waypoint to waypoint.

Roadmap being built with 3 Monte Carlo simulations per randomly selected node pair.
The largest map was 288 meters by 163 meters and contains almost 700,000 edges, collected over 4 days using 300 workers in a cluster requiring 1.1 billion collision checks.

The third paper makes several improvements over the original PRM-RL. First, we replace the hand-tuned DDPG with AutoRL-trained local planners, which results in improved long-range navigation. Second, it adds Simultaneous Localization and Mapping (SLAM) maps, which robots use at execution time, as a source for building the roadmaps. Because SLAM maps are noisy, this change closes the “sim2real gap”, a phonomena in robotics where simulation-trained agents significantly underperform when transferred to real-robots. Our simulated success rates are the same as in on-robot experiments. Last, we added distributed roadmap building, resulting in very large scale roadmaps containing up to 700,000 nodes.

We evaluated the method using our AutoRL agent, building roadmaps using the floor maps of offices up to 200x larger than the training environments, accepting edges with at least 90% success over 20 trials. We compared PRM-RL to a variety of different methods over distances up to 100m, well beyond the local planner range. PRM-RL had 2 to 3 times the rate of success over baseline because the nodes were connected appropriately for the robot’s capabilities.

Navigation over 100 meters success rates in several buildings. First paper -AutoRL local planner only (blue); original PRMs (red); path-guided artificial potential fields (yellow); second paper (green); third paper – PRMs with AutoRL (orange).

We tested PRM-RL on multiple real robots and real building sites. One set of tests are shown below; the robot is very robust except near cluttered areas and off the edge of the SLAM map.

On-robot experiments

Autonomous robot navigation can significantly improve independence of people with limited mobility. We can achieve this by development of easy-to-adapt robotic autonomy, including methods that can be deployed in new environments using information that it is already available. This is done by automating the learning of basic, short-range navigation behaviors with AutoRL and using these learned policies in conjunction with SLAM maps to build roadmaps. These roadmaps consist of nodes connected by edges that robots can traverse consistently. The result is a policy that once trained can be used across different environments and can produce a roadmap custom-tailored to the particular robot.

The research was done by, in alphabetical order, Hao-Tien Lewis Chiang, James Davidson, Aleksandra Faust, Marek Fiser, Anthony Francis, Jasmine Hsu, J. Chase Kew, Tsang-Wei Edward Lee, Ken Oslund, Oscar Ramirez from Robotics at Google and Lydia Tapia from University of New Mexico. We thank Alexander Toshev, Brian Ichter, Chris Harris, and Vincent Vanhoucke for helpful discussions.

Continua a leggere

Pubblicato in Senza categoria

DAM TOYS x AX2 STUDIO INU & SARU 1/6th scale Ninkyo – Seiji 12-inch action figure

Pre-order DAMTOYS x AX2 STUDIO 1/6 Ninkyo Seiji Figure from KGHobby (link HERE)

“A man who does not keep his words, he is not a man” by Seiji / Ninkyo from the Gi Clan of the Inu Tribe

Seiji is a descendant of the Gi Warrior Clan, because of his superior combat skills, he was hired to do debt collection for Shenjin Gumi. Seiji has a weird hobby, he likes to collect weapons from his opponents, he also known as “Ninkyo Seiji” for fighting for the weak.

Additional Info:
Ninkyo – A yakuza that willing to help the weak against the strong.
The Eight Extended Clans – The eight warriors of the legendary Inu Tribe, Gi Clan is one of them.
Shenjin Gumi – Name of a yakuza gang, members of Shenjin Gumi are mixed by people from both tribes, most of the members are the ones who were abandoned or not accepted, at some level, it is one of the few organizations that allows the two communities to coexist peacefully.

DAM TOYS x AX2 STUDIO INU & SARU 1/6th scale Ninkyo – Seiji 12-inch action figure Items List: Head Sculpt, 12-inch Body, Costume, Scarf, Samurai Sword, Ancient Blade x2, Dagger x3, Metal Necklace, Mask, Interchangeable Hand x4

Scroll down to see the rest of the pictures.
Click on them for bigger and better views.

Continua a leggere

Pubblicato in Senza categoria

Virtual Reality is Here

A little different lead this time out… so a the IBS show this past week one of the bigger plays was showing Virtual and Augmented Reality (VR/AR) in action.  I have been following this space for our world for a while and I am huge fan of the potential.  The issue is the costs are so far in the stratosphere that putting this into play is just not realistic at this point.  But someday it’s going to be and when that happens, it will be a game changer.  The technology is incredible and it will allow the potential customer to see the end product in place and in action and that excites me.  I thought about this also during a recent trip to Florida.  I did the “Void” at Disney Springs and it is a truly immersive VR experience.  In this case I was in the middle of a Star Wars battle and I will tell you… It. Was. Real.   The detail was incredible.  Between that and seeing some of the tests with building products, I just can’t wait.   (Note there is a Void in Las Vegas, I may have to sneak to that before or after BEC)

If you’ve experienced the Void or have thoughts on VR or AR in our world, please drop me a line…


–  Last week I covered the beauty of the new Glass Magazine.  This week its time to look at the brains of it and also give out my “ad of the month.”  The articles this month feature detailed and intense reporting.  The World of Glass and Industry Forecast pieces were loaded with insight and I loved the “Trendhunter” piece by Ron Crowl of Fenetech.  I could see that series becoming a favorite of mine.  All in all a loaded issue content wise!

–  As for the ad of the month, I mentioned last week the ads just “popped” more and so this month it was tough to pick just one winner- so I have a few.  I really liked Swisspacer’s piece of cold and warm- really excellent use of an image and story.  I have no idea who deserves the credit from Swisspacer so please if you are reading this and you know the person- pass along my kudos!  I liked HHH’s new ad- it used color and image perfectly and jumped off the page.  Props to you Melissa Blank and Mike Synon!  FuseRocket from DFI was a fabulous ad.  Stopped me in my tracks.  Syndi Sim well done!  Last but not least the creative Q&A that Consolidated Glass Holdings caught me perfectly.  Great work by Angela Beach from CGH, who nailed it on a cool 2 page spread.  Overall this may have been the best magazine for ads I have ever seen.  May it continue!

–  The latest Architectural Billings Index came out and reported a huge positive jump last month. A massive score of 55.3 was posted and folks I will tell you I am stunned.  That is an amazing and unexpected report.  Obviously I’m thrilled with it, it’s the highest total in quite a while, but I surely did not expect it.  Let’s keep rolling.

–  Fellow road warriors… how about these new airplane seats?  I am always seeing these articles and wonder which major airline will jump in and take a chance with something radically different.
–  A list of the richest cities in the US came out this week…  and you know me, I love lists like this… the rankings are pretty static from last year but a few things jumped out.  Rumson, NJ jumped 19 spots this year to now the 19th richest city in the US.  Which one of you glass superstars moved in there?  (I have never heard of that area) I was surprised to see my former home of Ohio have a few spots in the top 50 including Village of Indian Hill at #11.  I was also surprised that Malibu, CA was at 43rd- I thought it would be higher- just my perception always assumed that.   Anyway the entire top 50 is HERE. 
–  Last this week… a hearty congratulations to John Wheaton.  25 years ago this past week John started his incredible firm Wheaton Sprague.  John is a wonderful, smart and positive voice in our industry and I am happy to see him celebrate the silver anniversary.  Well done John and here’s to many many more successful years ahead!


–  Interesting take on the whole Amazon HQ 2 adventure.
–  I have used “hidden city” airline sites before but I have been wimping out on booking tickets- and now I am glad I do.  Story is one to watch here.

–  I am not a scientist or a beer drinker-  but can this be real?  Seems very far fetched.

Coming in March to Netflix… “The Dirt” which is based on the awesome book of the same name about the music group Motley Crue.   The book was incredible- how will the movie be?  I am pumped to see it….

Continua a leggere

Pubblicato in Senza categoria

Learning to Generalize from Sparse and Underspecified Rewards

Posted by Rishabh Agarwal, Google AI Resident and Mohammad Norouzi, Research Scientist

Reinforcement learning (RL) presents a unified and flexible framework for optimizing goal-oriented behavior, and has enabled remarkable success in addressing challenging tasks such as playing video games, continuous control, and robotic learning. The success of RL algorithms in these application domains often hinges on the availability of high-quality and dense reward feedback. However, broadening the applicability of RL algorithms to environments with sparse and underspecified rewards is an ongoing challenge, requiring a learning agent to generalize (i.e., learn the right behavior) from limited feedback. A natural way to investigate the performance of RL algorithms in such problem settings is via language understanding tasks, where an agent is provided with a natural language input and needs to generate a complex response to achieve a goal specified in the input, while only receiving binary success-failure feedback.

For instance, consider a “blind” agent tasked with reaching a goal position in a maze by following a sequence of natural language commands (e.g., “Right, Up, Up, Right”). Given the input text, the agent (green circle) needs to interpret the commands and take actions based on such interpretation to generate an action sequence (a). The agent receives a reward of 1 if it reaches the goal (red star) and 0 otherwise. Because the agent doesn’t have access to any visual information, the only way for the agent to solve this task and generalize to novel instructions is by correctly interpreting the instructions.

In this instruction-following task, the action trajectories a1, a2 and a3 reach the goal, but the sequences a2 and a3 do not follow the instructions. This illustrates the issue of underspecified rewards.

In these tasks, the RL agent needs to learn to generalize from sparse (only a few trajectories lead to a non-zero reward) and underspecified (no distinction between purposeful and accidental success) rewards. Importantly, because of underspecified rewards, the agent may receive positive feedback for exploiting spurious patterns in the environment. This can lead to reward hacking, causing unintended and harmful behavior when deployed in real-world systems.

In “Learning to Generalize from Sparse and Underspecified Rewards“, we address the issue of underspecified rewards by developing Meta Reward Learning (MeRL), which provides more refined feedback to the agent by optimizing an auxiliary reward function. MeRL is combined with a memory buffer of successful trajectories collected using a novel exploration strategy to learn from sparse rewards. The effectiveness of our approach is demonstrated on semantic parsing, where the goal is to learn a mapping from natural language to logical forms (e.g., mapping questions to SQL programs). In the paper, we investigate the weakly-supervised problem setting, where the goal is to automatically discover logical programs from question-answer pairs, without any form of program supervision. For instance, given the question “Which nation won the most silver medals?” and a relevant Wikipedia table, an agent needs to generate an SQL-like program that results in the correct answer (i.e., “Nigeria”).

The proposed approach achieves state-of-the-art results on the WikiTableQuestions and WikiSQL benchmarks, improving upon prior work by 1.2% and 2.4% respectively. MeRL automatically learns the auxiliary reward function without using any expert demonstrations, (e.g., ground-truth programs) making it more widely applicable and distinct from previous reward learning approaches. The diagram below depicts a high level overview of our approach:

Overview of the proposed approach. We employ (1) mode covering exploration to collect a diverse set of successful trajectories in a memory buffer; (2) Meta-learning or Bayesian optimization to learn an auxiliary reward that provides more refined feedback for policy optimization.

Meta Reward Learning (MeRL)
The key insight of MeRL in dealing with underspecified rewards is that spurious trajectories and programs that achieve accidental success are detrimental to the agent’s generalization performance. For example, an agent might be able to solve a specific instance of the maze problem above. However, if it learns to perform spurious actions during training, it is likely to fail when provided with unseen instructions. To mitigate this issue, MeRL optimizes a more refined auxiliary reward function, which can differentiate between accidental and purposeful success based on features of action trajectories. The auxiliary reward is optimized by maximizing the trained agent’s performance on a hold-out validation set via meta learning.

Schematic illustration of MeRL: The RL agent is trained via the reward signal obtained from the auxiliary reward model while the auxiliary rewards are trained using the generalization error of the agent.

Learning from Sparse Rewards
To learn from sparse rewards, effective exploration is critical to find a set of successful trajectories. Our paper addresses this challenge by utilizing the two directions of Kullback–Leibler (KL) divergence, a measure on how different two probability distributions are. In the example below, we use KL divergence to minimize the difference between a fixed bimodal (shaded purple) and a learned gaussian (shaded green) distribution, which can represent the distribution of the agent’s optimal policy and our learned policy respectively. One direction of the KL objective learns a distribution which tries to cover both the modes while the distribution learned by other objective seeks a particular mode (i.e. it prefers one mode over another). Our method exploits the mode covering KL’s tendency to focus on multiple peaks to collect a diverse set of successful trajectories and mode seeking KL’s implicit preference between trajectories to learn a robust policy.

Left: Optimizing mode covering KL. Right: Optimizing mode seeking KL

Designing reward functions that distinguish between optimal and suboptimal behavior is critical for applying RL to real-world applications. This research takes a small step in the direction of modelling reward functions without any human supervision. In future work, we’d like to tackle the credit-assignment problem in RL from the perspective of automatically learning a dense reward function.

This research was done in collaboration with Chen Liang and Dale Schuurmans. We thank Chelsea Finn and Kelvin Guu for their review of the paper.

Continua a leggere

Pubblicato in Senza categoria

On the Path to Cryogenic Control of Quantum Processors

Posted by Joseph Bardin, Visiting Faculty Researcher and Erik Lucero, Staff Research Scientist and Hardware Lead, Google AI Quantum Team

Building a quantum computer that can solve practical problems that would otherwise be classically intractable due to the computation complexity, cost, energy consumption or time to solution, is the longstanding goal of the Google AI Quantum team. Current thresholds suggest a first generation error-corrected quantum computer will require on the order of 1 million physical qubits, which is more than four orders of magnitude more qubits than exist in Bristlecone, our 72 qubit quantum processor. Increasing the number of physical qubits needed for a fault-tolerant quantum computer while maintaining high-quality control of each qubit are intertwined and exciting technological challenges that will require inventions beyond simply copying and pasting our current control architecture. One critical challenge is reducing the number of input/output control lines per qubit by relocating the room temperature analog control electronics to the 3 kelvin stage in the cryostat, while maintaining high-quality qubit control.

As a step towards solving that challenge, this week we presented our first generation cryogenic-CMOS single-qubit controller at the International Solid State Circuits Conference in San Francisco. Fabricated using commercial CMOS technology, our controller operates at 3 kelvin, consumes less than 2 milliwatts of power and measures just 1 mm by 1.6 mm. Functionally, it provides an instruction set for single-qubit gate operations, providing analog control of a qubit via digital lines between room temperature and 3 kelvin, all while consuming ~1000 times less power compared to our current room temperature control electronics.

Google’s first generation cryogenic-CMOS single-qubit controller (center and zoomed on the right) packaged and ready to be deployed inside our cryostat. The controller measures 1mm by 1.6mm.

How to Control 72 Qubits
In our lab in Santa Barbara, we run programs on Bristlecone by applying gigahertz frequency analog control signals to each of the qubits to manipulate the qubit state, to entangle qubits and to measure the outcomes of our computations. How well we define the shape and frequency of these control signals directly impacts the quality of our computation. To make high-quality qubit control signals, we leverage technology developed for smartphones packaged in server racks at room temperature. Individual coaxial cables deliver these signals to each qubit, which are themselves kept inside a cryostat chilled to 10 millikelvin. While this approach makes sense for a Bristlecone-scale quantum processor, which demands 2 control lines per qubit for 144 unique control signals, we realized that a more integrated approach would be required in order to scale our systems to the million qubit level.

Research Scientist Amit Vainsencher checking the wiring on Bristlecone in one of Google’s flagship cryostats. Blue coaxial cables are connected from custom analog control electronics (server rack on the right) to the quantum processor.

In our current setup, the number of physical wires connected from room temperature to the qubits inside the cryostat and the finite cooling power of the cryostat represent a significant constraint. One way to alleviate this is to move the digital to analog control closer to the quantum processor. Currently, our room temperature digital-to-analog waveform generators used to control individual qubits, dissipate ~1 watt of waste heat per qubit. The cooling power of our cryostat at 3 kelvin is 0.1 watt. That means if we crammed 150 waveform generators into our cryostat (never mind the limited physical space inside the refrigerator for a moment) we would overwhelm the cooling power of our cryostat by 1500x, thereby cooking our cryostat and rendering our qubits useless. Therefore, simply installing our existing digital-to-analog control in the cryostat will not set us on the path to control millions of qubits. It is clear we need an integrated low-power qubit control solution.

A Cool Idea
In collaboration with University of Massachusetts Professor Joseph Bardin, we set out to develop custom integrated circuits (ICs) to control our qubits from within the cryostat to ultimately reduce the physical I/O connections to and from our future quantum processors. These ICs would be designed to operate in the ultracold environment, specifically 3 kelvin, and turn digital instructions into analog control pulses for qubits. A key research objective was to first design a custom IC with low power requirements, in order to prevent warming up the cryostat.

We designed our IC to dissipate no more than 2 milliwatts of power at 3 kelvin, which can be challenging as most physical CMOS models assume operation closer to 300 kelvin. After design and fabrication of the IC with the low power design constraints in mind, we verified that the cryogenic-CMOS qubit controller worked at room temperature. We then mounted it in our cryostat at 3 kelvin and connected it to a qubit (mounted at 10 millikelvin in the same cryostat). We carried out a series of experiments to establish that the cryogenic-CMOS qubit controller worked as designed, and most importantly, that we hadn’t just installed a heater inside our cryostat.

Schematic of the cryogenic-CMOS qubit controller mounted on the 3 kelvin stage of our dilution refrigerator and connected to a qubit. Our standard qubit control electronics were connected in parallel to enable control and measurement of the qubit as an in-situ check experiment.

Performance at Low Temperature
Baseline experiments for our new quantum control hardware, including T1, Rabi oscillations, and single qubit gates, show similar performance compared to our standard room-temperature qubit control electronics: qubit coherence time was virtually unchanged, and high-visibility Rabi oscillations were observed by varying the amplitude of the pulses out of the cryogenic-CMOS qubit controller—a signature response of a driven qubit.

Comparison of the qubit coherence time measured using the standard and cryogenic quantum controllers.
Measured Rabi amplitude oscillations using the cryogenic controller. The green and black traces are the probability of measuring the qubits in the 1 and 0 states, respectively.

Next Steps
Although all of these results are promising, this first generation cryogenic-CMOS qubit controller is but one small step towards a truly scalable qubit control and measurement system. For instance, our controller is only able to address a single qubit, and it still requires several connections to room temperature. In addition, we still need to work hard to quantify the error rates for single qubit gates. As such, we are excited to reduce the energy required to control qubits and still maintain the delicate control required to perform high-quality qubit operations.

This work was carried out with the support of the Google Visiting Researcher Program while Prof. Bardin, an Associate Professor with the University of Massachusetts Amherst, was on sabbatical with the Google AI Quantum Team. This work would not have been possible without the many contributions of members of the Google AI Quantum team, especially Evan Jeffrey for his integration of the cryo-CMOS controller into the qubit calibration software, Ted White for his on-demand qubit calibrations and Trent Huang for his tireless design rules checks.

Continua a leggere

Pubblicato in Senza categoria