Moving Beyond Translation with the Universal Transformer

Posted by Stephan Gouws, Research Scientist, Google Brain Team and Mostafa Dehghani, University of Amsterdam PhD student and Google Research Intern

Last year we released the Transformer, a new machine learning model that showed remarkable success over existing algorithms for machine translation and other language understanding tasks. Before the Transformer, most neural network based approaches to machine translation relied on recurrent neural networks (RNNs) which operate sequentially (e.g. translating words in a sentence one-after-the-other) using recurrence (i.e. the output of each step feeds into the next). While RNNs are very powerful at modeling sequences, their sequential nature means that they are quite slow to train, as longer sentences need more processing steps, and their recurrent structure also makes them notoriously difficult to train properly.

In contrast to RNN-based approaches, the Transformer used no recurrence, instead processing all words or symbols in the sequence in parallel while making use of a self-attention mechanism to incorporate context from words farther away. By processing all words in parallel and letting each word attend to other words in the sentence over multiple processing steps, the Transformer was much faster to train than recurrent models. Remarkably, it also yielded much better translation results than RNNs. However, on smaller and more structured language understanding tasks, or even simple algorithmic tasks such as copying a string (e.g. to transform an input of “abc” to “abcabc”), the Transformer does not perform very well. In contrast, models that perform well on these tasks, like the Neural GPU and Neural Turing Machine, fail on large-scale language understanding tasks like translation.

In “Universal Transformers” we extend the standard Transformer to be computationally universal (Turing complete) using a novel, efficient flavor of parallel-in-time recurrence which yields stronger results across a wider range of tasks. We built on the parallel structure of the Transformer to retain its fast training speed, but we replaced the Transformer’s fixed stack of different transformation functions with several applications of a single, parallel-in-time recurrent transformation function (i.e. the same learned transformation function is applied to all symbols in parallel over multiple processing steps, where the output of each step feeds into the next). Crucially, where an RNN processes a sequence symbol-by-symbol (left to right), the Universal Transformer processes all symbols at the same time (like the Transformer), but then refines its interpretation of every symbol in parallel over a variable number of recurrent processing steps using self-attention. This parallel-in-time recurrence mechanism is both faster than the serial recurrence used in RNNs, and also makes the Universal Transformer more powerful than the standard feedforward Transformer.

The Universal Transformer repeatedly refines a series of vector representations (shown as h1 to hm) for each position of the sequence in parallel, by combining information from different positions using self-attention and applying a recurrent transition function. Arrows denote dependencies between operations.

At each step, information is communicated from each symbol (e.g. word in the sentence) to all other symbols using self-attention, just like in the original Transformer. However, now the number of times this transformation is applied to each symbol (i.e. the number of recurrent steps) can either be manually set ahead of time (e.g. to some fixed number or to the input length), or it can be decided dynamically by the Universal Transformer itself. To achieve the latter, we added an adaptive computation mechanism to each position which can allocate more processing steps to symbols that are more ambiguous or require more computations.

As an intuitive example of how this could be useful, consider the sentence “I arrived at the bank after crossing the river”. In this case, more context is required to infer the most likely meaning of the word “bank” compared to the less ambiguous meaning of “I” or “river”. When we encode this sentence using the standard Transformer, the same amount of computation is applied unconditionally to each word. However, the Universal Transformer’s adaptive mechanism allows the model to spend increased computation only on the more ambiguous words, e.g. to use more steps to integrate the additional contextual information needed to disambiguate the word “bank”, while spending potentially fewer steps on less ambiguous words.

At first it might seem restrictive to allow the Universal Transformer to only apply a single learned function repeatedly to process its input, especially when compared to the standard Transformer which learns to apply a fixed sequence of distinct functions. But learning how to apply a single function repeatedly means the number of applications (processing steps) can now be variable, and this is the crucial difference. Beyond allowing the Universal Transformer to apply more computation to more ambiguous symbols, as explained above, it further allows the model to scale the number of function applications with the overall size of the input (more steps for longer sequences), or to decide dynamically how often to apply the function to any given part of the input based on other characteristics learned during training. This makes the Universal Transformer more powerful in a theoretical sense, as it can effectively learn to apply different transformations to different parts of the input. This is something that the standard Transformer cannot do, as it consists of fixed stacks of learned Transformation blocks applied only once.

But while increased theoretical power is desirable, we also care about empirical performance. Our experiments confirm that Universal Transformers are indeed able to learn from examples how to copy and reverse strings and how to perform integer addition much better than a Transformer or an RNN (although not quite as well as Neural GPUs). Furthermore, on a diverse set of challenging language understanding tasks the Universal Transformer generalizes significantly better and achieves a new state of the art on the bAbI linguistic reasoning task and the challenging LAMBADA language modeling task. But perhaps of most interest is that the Universal Transformer also improves translation quality by 0.9 BLEU1 over a base Transformer with the same number of parameters, trained in the same way on the same training data. Putting things in perspective, this almost adds another 50% relative improvement on top of the previous 2.0 BLEU improvement that the original Transformer showed over earlier models when it was released last year.

The Universal Transformer thus closes the gap between practical sequence models competitive on large-scale language understanding tasks such as machine translation, and computationally universal models such as the Neural Turing Machine or the Neural GPU, which can be trained using gradient descent to perform arbitrary algorithmic tasks. We are enthusiastic about recent developments on parallel-in-time sequence models, and in addition to adding computational capacity and recurrence in processing depth, we hope that further improvements to the basic Universal Transformer presented here will help us build learning algorithms that are both more powerful, more data efficient, and that generalize beyond the current state-of-the-art.

If you’d like to try this for yourself, the code used to train and evaluate Universal Transformers can be found here in the open-source Tensor2Tensor repository.

This research was conducted by Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Additional thanks go to Ashish Vaswani, Douglas Eck, and David Dohan for their fruitful comments and inspiration.

1 A translation quality benchmark widely used in the machine translation community, computed on the standard WMT newstest2014 English to German translation test data set.

Continua a leggere

Pubblicato in Senza categoria

Check out Hot Toys Black Panther 1/6th scale Letitia Wright as Shuri Collectible Figure Preview

Pre-order Hot Toys MMS501 Shuri Black Panther 1/6th scale Collectible Figure at KGHobby (link HERE)


“The Black Panther fights for us. And I will be there beside him.”

Black Panther, one of the movies released by Marvel Studios earlier this year, keeps smashing records and exceeding the box-office and expectations. The Princess of Wakanda, Shuri, is the leader of the Wakandan Design Group responsible for developing this African Nation’s modern technology. When the young King T’Challa is drawn into conflict that puts his homeland Wakanda and the entire world at risk, Shuri has proven herself a great backup on creating new security system and weapons, including the Panther Habits and her cool looking panther-like gauntlets.

Received a tremendous amount of attention in the already-happened exhibitions in San Diego and Hong Kong recently, Hot Toys is more than excited to introduce fans today the long awaited 1/6th scale collectible figure featuring T’Challa’s innovative little sister – Shuri of Black Panther.

Beautifully crafted based on the appearance of Letitia Wright as Shuri in the movie, the highly-accurate collectible figure features a newly developed head sculpt with detailed hair sculpture, a newly developed body, an adorned elaborate new battle suit and neck ring, a Wakandan pattern sash, a wide range of weapons and accessories including a pair of LED light up Vibranium Guantlets, a spear, a Kimoyo Beads bracelet and a movie themed figure stand!


Scroll down to see the rest of the pictures.
Click on them for bigger and better views.

Hot Toys MMS501 1/6th scale Shuri Collectible Figure specially features: Newly developed head sculpt with authentic and detailed likeness of Letitia Wright as Shuri in Black Panther | Movie-accurate facial expression with detail skin texture and makeup | Brown color hair sculpture with braided hairstyles | Approximately 29 cm tall Newly developed body with over 28 points of articulations | Seven (7) pieces of interchangeable hands including: pair of fists, pair of relax hands, pair of hands for holding spear, gesture right hand

Costume: meticulously tailored brown and blue colored patterned jumpsuit with neck ring, silver and blue colored arm bands, yellow colored Wakandan tribe sash with silver colored buckle

Weapon: pair of LED light up Vibranium Gauntlets (blue light, battery operated), spear

Accessories: Kimoyo Beads bracelet, Specially designed Black Panther themed hexagonal figure stand with character nameplate and movie logo

Release date: Approximately Q3 – Q4, 2019


Continua a leggere

Pubblicato in Senza categoria

Is is Really Back?

I thought I was dreaming when I saw a headline recently that asbestos could now legally be used again in manufacturing.  Amazingly it was not a fantasy, it is true and I am pretty thrown by it.  Obviously for years the push to remove it and deal with it has been a major task and one that has caused significant issues beyond the serious health risks that kicked the whole ban into motion.  So to see it back was jarring.  I was however relieved to see at least a solid initial push back by the architectural community.  It has begun on social media and I look for it to keep growing.   This is going to be one to watch on the building side, because I just can’t see it having legs no matter what the argument for bringing it back is.  I guess we will see…

We are now 1 month away from GlassBuild America and the anticipation for this years event is growing nicely.  I am expecting very strong attendance and I am loving the diverse range of exhibitors.  So much to see there for sure.   In addition the action demos are all “must see” types of events along with the Express Learning.  I seriously recommend you looking at the GlassBuild America website and familiarize yourself with everything that is happening because it’s a lot different than it was in the past.  Next week I’ll start breaking down specific items to see to help you in your planning process…
The latest updated website on the market features one of the best upgrades yet.  Diamon–Fusion (DFI) launched a new site that is heavy on video right out of the gate (bold and daring in our usually conservative industry) and it truly blew me away.  Congrats to the entire team at DFI for a job well done! 
This week’s interview: Alissa Schmidt, Technical Resources Manager, Viracon.

I was very excited that Alissa accepted my request for an interview in this series as I wanted to get a feel for not only her career journey but also to get her insight on the technical and project side.  She certainly did not disappoint with her answers.  Alissa has easily one of the most talented technical minds and approaches in our industry.   Overall I continue to be amazed at the incredible amount of personal talent that is amassed at Viracon, obviously Alissa fits in there perfectly.

Your career started in Marketing (I had a boss tell me no one needs marketing- so good for you for getting out- LOL) and then you seemed to settle into the design and technical side.  What was it like to go from promotion of product to having such a crucial hand in the way the product is placed and performs?

I guess I’ve never really thought of my transition as anything more than natural growth with the company growth in knowledge and experience that lead to the role I’m currently in. I love promoting Viracon regardless of whether I’m helping our marketing department with content development, having a conversation directly with an architect or writing a letter to a customer to explain something they need more details about. At the same time, my move to the technical side has allowed me to gain a better understanding of our product development process and how product characteristics tie to performance in the field.

In case more detail is better, here’s a little background about my path at Viracon:

Although I came to Viracon with an interior design degree and experience as a kitchen designer, I also spent four years after college as a marketing coordinator. When I read Viracon’s job posting for an architectural design specialist, I saw they were looking for someone who had design OR marketing experience. Since I had both, I was intrigued and wanted to learn more about the company and position. I recall arriving for the architectural design interview only to be notified that I was going to be interviewing for the position I applied for as well as a marketing position. This was due to my prior experience in marketing and potential reorganization that was going to happen in the department. In the end, I was offered the design position and started with Viracon in that role. The architectural design department was, however, very integrated into the marketing department so my first several years at Viracon included quite a bit of marketing support.

As Viracon grew, the design team grew and we restructured it as a separate entity from our marketing department. Changes in leadership around this same time lead to a design management opportunity. I had been with Viracon 7 years, had learned a lot about helping architects design with glass and was ready to take on the challenge of managing the architectural design team. A short time after I moved into the management role, a retirement on the technical side provided an opportunity for me to manage both the design and technical teams. This is the role I’m currently enjoying today.

I also enjoy the challenge of finding ways to improve, both personally and within the departments I manage. I discovered a communications program specifically targeted at communicating technical information to a non-technical audience. This is a great fit with my current position so I am currently working on my master’s degree through this program and anticipate graduating in 2019. 

With your position, and the awesome company you work for, I’d say you are positioned perfectly to be on the cutting edge of the industry.  What are you seeing out there that excites you and conversely keeps you up at night?

The electronic design tools architects have at their hands today are incredible. These tools have facilitated increased complexity of building shapes and forms. I wouldn’t say the complexity was previously impossible but the speed and accuracy of today’s software have expanded its use to a much broader audience.

While this explosion of complexity is super exciting for me as a designer, it keeps our manufacturing and technical experts on their toes. Complex building forms create glass shapes and sizes that were once reserved for high-profile, high-budget projects. Today, it is common for mainstream projects to include glass that poses a variety of fabrication challenges. The twists and turns of the unique building forms also change the way a building interacts with its surroundings. There might be 5 or 10 wind loads on a building rather than one corner and one typical load of a basic, rectangular building. This can require extensive glass strength analysis, deflection and sightline calculations. In some cases, the complexity request finite element analysis because the traditional strength analysis programs do not suffice.

What’s the most fun you’ve had on a project in your career- was it something that you had a hand in from the start or maybe a massive signature project that you helped make sure everything clicked… or maybe something else that you can point to as memorable to you…

I hate picking favorites so choosing a single project over all others is nearly impossible. I’ve definitely had many fantastic experiences while I’ve been with Viracon. When I first started, Seven World Trade Center had been recently completed. I remember receiving a lot of calls from architects who wanted to talk about the glass. Even though I hadn’t personally worked on the project, these conversations were a quick introduction into how much fun it can be to talk about glass that comes from a small town in Minnesota and makes its way to a distinctive New York City building.

I’ve been fortunate enough to participate in the design process for everything from our local arts center addition to the Dallas Cowboys Stadium to One World Trade Center. My career here at Viracon has also offered a lot of fantastic opportunities to see our glass in-person. One of the most memorable is a trip where I was able to visit One World Trade Center under construction, near the holidays. From the ground the glass looked great, from the 56th floor, the view was beautiful but the best vantage point of the building during that trip was from across the street where the construction lights were turned into multi-colored lights for the holidays. This little touch made me think about how a building really does interact with, and influence, people. 


Escape the nursing home and first place you go?  Heavy metal concert!
Whenever I think we as an industry communicate badly, I remember there’s certain companies in the airline industry.

Skipping the video of the week… nothing great caught my fancy.  We’ll see what we can find for next week!

Continua a leggere

Pubblicato in Senza categoria

See you at Exporail (2018)!

Once again, members of the S Scale Workshop will be at Exporail – the Canadian Railway Museum – to take part in the museum’s annual model railway celebration, A Great Passion for Model Trains.This year’s event takes place August 18-19 – yes, next weeke… Continua a leggere

Pubblicato in Senza categoria