EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling

Posted by Mingxing Tan, Staff Software Engineer and Quoc V. Le, Principal Scientist, Google AI

Convolutional neural networks (CNNs) are commonly developed at a fixed resource cost, and then scaled up in order to achieve better accuracy when more resources are made available. For example, ResNet can be scaled up from ResNet-18 to ResNet-200 by increasing the number of layers, and recently, GPipe achieved 84.3% ImageNet top-1 accuracy by scaling up a baseline CNN by a factor of four. The conventional practice for model scaling is to arbitrarily increase the CNN depth or width, or to use larger input image resolution for training and evaluation. While these methods do improve accuracy, they usually require tedious manual tuning, and still often yield suboptimal performance. What if, instead, we could find a more principled method to scale up a CNN to obtain better accuracy and efficiency?

In our ICML 2019 paper, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, we propose a novel model scaling method that uses a simple yet highly effective compound coefficient to scale up CNNs in a more structured manner. Unlike conventional approaches that arbitrarily scale network dimensions, such as width, depth and resolution, our method uniformly scales each dimension with a fixed set of scaling coefficients. Powered by this novel scaling method and recent progress on AutoML, we have developed a family of models, called EfficientNets, which superpass state-of-the-art accuracy with up to 10x better efficiency (smaller and faster).

Compound Model Scaling: A Better Way to Scale Up CNNs
In order to understand the effect of scaling the network, we systematically studied the impact of scaling different dimensions of the model. While scaling individual dimensions improves model performance, we observed that balancing all dimensions of the network—width, depth, and image resolution—against the available resources would best improve overall performance.

The first step in the compound scaling method is to perform a grid search to find the relationship between different scaling dimensions of the baseline network under a fixed resource constraint (e.g., 2x more FLOPS).This determines the appropriate scaling coefficient for each of the dimensions mentioned above. We then apply those coefficients to scale up the baseline network to the desired target model size or computational budget.

Comparison of different scaling methods. Unlike conventional scaling methods (b)-(d) that arbitrary scale a single dimension of the network, our compound scaling method uniformly scales up all dimensions in a principled way.

This compound scaling method consistently improves model accuracy and efficiency for scaling up existing models such as MobileNet (+1.4% imagenet accuracy), and ResNet (+0.7%), compared to conventional scaling methods.

EfficientNet Architecture
The effectiveness of model scaling also relies heavily on the baseline network. So, to further improve performance, we have also developed a new baseline network by performing a neural architecture search using the AutoML MNAS framework, which optimizes both accuracy and efficiency (FLOPS). The resulting architecture uses mobile inverted bottleneck convolution (MBConv), similar to MobileNetV2 and MnasNet, but is slightly larger due to an increased FLOP budget. We then scale up the baseline network to obtain a family of models, called EfficientNets.

The architecture for our baseline network EfficientNet-B0 is simple and clean, making it easier to scale and generalize.

EfficientNet Performance
We have compared our EfficientNets with other existing CNNs on ImageNet. In general, the EfficientNet models achieve both higher accuracy and better efficiency over existing CNNs, reducing parameter size and FLOPS by an order of magnitude. For example, in the high-accuracy regime, our EfficientNet-B7 reaches state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on CPU inference than the previous Gpipe. Compared with the widely used ResNet-50, our EfficientNet-B4 uses similar FLOPS, while improving the top-1 accuracy from 76.3% of ResNet-50 to 82.6% (+6.3%).

Model Size vs. Accuracy Comparison. EfficientNet-B0 is the baseline network developed by AutoML MNAS, while Efficient-B1 to B7 are obtained by scaling up the baseline network. In particular, our EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy, while being 8.4x smaller than the best existing CNN.

Though EfficientNets perform well on ImageNet, to be most useful, they should also transfer to other datasets. To evaluate this, we tested EfficientNets on eight widely used transfer learning datasets. EfficientNets achieved state-of-the-art accuracy in 5 out of the 8 datasets, such as CIFAR-100 (91.7%) and Flowers (98.8%), with an order of magnitude fewer parameters (up to 21x parameter reduction), suggesting that our EfficientNets also transfer well.

By providing significant improvements to model efficiency, we expect EfficientNets could potentially serve as a new foundation for future computer vision tasks. Therefore, we have open-sourced all EfficientNet models, which we hope can benefit the larger machine learning community. You can find the EfficientNet source code and TPU training scripts here.

Special thanks to Hongkun Yu, Ruoming Pang, Vijay Vasudevan, Alok Aggarwal, Barret Zoph, Xianzhi Du, Xiaodan Song, Samy Bengio, Jeff Dean, and the Google Brain team.

Continua a leggere

Pubblicato in Senza categoria

Potential Roadblocks Ahead

So a few weeks ago I was extremely honored to give the keynote address at the Texas Glass Association Glass Conference II.  It really was a wonderful experience as the folks from great state of Texas are some of the best around.  They are truly classy and hospitable to the end.  The theme of my hour-long presentation was “State of the Industry” where I spent around 30 minutes on economic forecasts and then the rest on trends, concepts, events, and conclusions.  On the forecast side I pulled data from 11 different sources and went through many different segments and applications.  The main takeaway I provided after all of this research was that there is a softening of the markets coming our way. It doesn’t look like it will be a long stretch and there’s no indicators that show the weaknesses being 2008/9-like bad, but it was interesting for me to get into all of the data and see this is what we have coming.  Basically some lighter volumes into 2020 but things improving towards the end of next year and into 2021.  One of the things I told the attendees was to look at technology and innovation NOW vs. later.  If you can improve yourself or your operation now- meaning efficiencies etc. this is the time to do it.   Don’t wait until next year that is for sure. 

The event overall was fantastic.  Dustin Anderson of Anderson Glass had an incredible presentation on the workforce of today and how to reach them.  He’s become a very polished and natural speaker- so he’s more than just a TV star these days.  In addition I really enjoyed what Nathan McKenna of Vitro and Erica Couch of Tri-Star delivered in their spots.  Great stuff all the way around.  Kudos to Felix Munson, Sam Hill, and everyone at the TGA for a job well done!


–  I did also talk about the Architectural Billings Index (ABI) and was waiting to see if we were back in the black this month after our first down month in 2 years.  Sure enough we climbed into positive territory- barely at 50.5.  I had a feeling it would pop up from its low number in the previous month and now I see it treading water for while. 

–  Glass Magazine review time… the issue has “Protector” on the very snazzy cover and is the May 2019 edition.  The main theme is Glass & Metals 401- Guide to Protective Glazing.  With how important this segment is in our world right now, I strongly recommend you grab the issue or check it out online as the info in here is absolutely fabulous and necessary. 

–  Ad of the month goes to CR Laurence.  “The Building Envelope Simplified” was an excellent ad piece that truly shows the power of glass and smartly showed where CRL’s contributions were.  The picture and callouts did the heavy lifting and impressed me.  Kudos to the minds behind that one!

–  I never fly in or out of JFK in NYC- but I may have to make an exception some day to get to the new TWA hotel there.  Looks incredibly cool!

– Last this week… another GlassBuild plug from me.  Don’t click away- read on please… have you registered yet?  Have you gotten the hotel taken care of?  If not do it now… we have now passed Memorial Day and we all know this summer will fly by.  There’s a ton of good pieces in the works for the show and you will need to be there and especially if you are looking to the advice I laid out at the top of the post- you HAVE to be there….  Any questions on it- please reach out to me!


–  We see this every year and I never get tired of it!  Dogs in the yearbook!
–  Another story we always see yet people seemingly don’t learn. Please don’t leave your kid or pets in hot cars with the windows up! 
I love good news!  Good job young man!

This is a classic song, classic clip and just awesome dancing… just brings a smile to the face!

Continua a leggere

Pubblicato in Senza categoria

Moving Camera, Moving People: A Deep Learning Approach to Depth Prediction

Posted by Tali Dekel, Research Scientist and Forrester Cole, Software Engineer, Machine Perception

The human visual system has a remarkable ability to make sense of our 3D world from its 2D projection. Even in complex environments with multiple moving objects, people are able to maintain a feasible interpretation of the objects’ geometry and depth ordering. The field of computer vision has long studied how to achieve similar capabilities by computationally reconstructing a scene’s geometry from 2D image data, but robust reconstruction remains difficult in many cases.

A particularly challenging case occurs when both the camera and the objects in the scene are freely moving. This confuses traditional 3D reconstruction algorithms that are based on triangulation, which assumes that the same object can be observed from at least two different viewpoints, at the same time. Satisfying this assumption requires either a multi-camera array (like Google’s Jump), or a scene that remains stationary as the single camera moves through it. As a result, most existing methods either filter out moving objects (assigning them “zero” depth values), or ignore them (resulting in incorrect depth values).

Left: The traditional stereo setup assumes that at least two viewpoints capture the scene at the same time. Right: We consider the setup where both camera and subject are moving.

In “Learning the Depths of Moving People by Watching Frozen People”, we tackle this fundamental challenge by applying a deep learning-based approach that can generate depth maps from an ordinary video, where both the camera and subjects are freely moving. The model avoids direct 3D triangulation by learning priors on human pose and shape from data. While there is a recent surge in using machine learning for depth prediction, this work is the first to tailor a learning-based approach to the case of simultaneous camera and human motion. In this work, we focus specifically on humans because they are an interesting target for augmented reality and 3D video effects.

Our model predicts the depth map (right; brighter=closer to the camera) from a regular video (left), where both the people in the scene and the camera are freely moving.

Sourcing the Training Data
We train our depth-prediction model in a supervised manner, which requires videos of natural scenes, captured by moving cameras, along with accurate depth maps. The key question is where to get such data. Generating data synthetically requires realistic modeling and rendering of a wide range of scenes and natural human actions, which is challenging. Further, a model trained on such data may have difficulty generalizing to real scenes. Another approach might be to record real scenes with an RGBD sensor (e.g., Microsoft’s Kinect), but depth sensors are typically limited to indoor environments and have their own set of 3D reconstruction issues.

Instead, we make use of an existing source of data for supervision: YouTube videos in which people imitate mannequins by freezing in a wide variety of natural poses, while a hand-held camera tours the scene. Because the entire scene is stationary (only the camera is moving), triangulation-based methods–like multi-view-stereo (MVS)–work, and we can get accurate depth maps for the entire scene including the people in it. We gathered approximately 2000 such videos, spanning a wide range of realistic scenes with people naturally posing in different group configurations.

Videos of people imitating mannequins while a camera tours the scene, which we used for training. We use traditional MVS algorithms to estimate depth, which serves as supervision during training of our depth-prediction model.

Inferring the Depth of Moving People
The Mannequin Challenge videos provide depth supervision for moving camera and “frozen” people, but our goal is to handle videos with a moving camera and moving people. We need to structure the input to the network in order to bridge that gap.

A possible approach is to infer depth separately for each frame of the video (i.e., the input to the model is just a single frame). While such a model already improves over state-of-the-art single image methods for depth prediction, we can improve the results further by considering information from multiple frames. For example, motion parallax, i.e., the relative apparent motion of static objects between two different viewpoints, provides strong depth cues. To benefit from such information, we compute the 2D optical flow between each input frame and another frame in the video, which represents the pixel displacement between the two frames. This flow field depends on both the scene’s depth and the relative position of the camera. However, because the camera positions are known, we can remove their dependency from the flow field, which results in an initial depth map. This initial depth is valid only for static scene regions. To handle moving people at test time, we apply a human-segmentation network to mask out human regions in the initial depth map. The full input to our network then includes: the RGB image, the human mask, and the masked depth map from parallax.

Depth prediction network: The input to the model includes an RGB image (Frame t), a mask of the human region, and an initial depth for the non-human regions, computed from motion parallax (optical flow) between the input frame and another frame in the video. The model outputs a full depth map for Frame t. Supervision for training is provided by the depth map, computed by MVS.

The network’s job is to “inpaint” the depth values for the regions with people, and refine the depth elsewhere. Intuitively, because humans have consistent shape and physical dimensions, the network can internally learn such priors by observing many training examples. Once trained, our model can handle natural videos with arbitrary camera and human motion.

Below are some examples of our depth-prediction model results based on videos, with comparison to recent state-of-the-art learning based methods.

Comparison of depth prediction models to a video clip with moving cameras and people. Top: Learning based monocular depth prediction methods (DORN; Chen et al.). Bottom: Learning based stereo method (DeMoN), and our result.

3D Video Effects Using Our Depth Maps
Our predicted depth maps can be used to produce a range of 3D-aware video effects. One such effect is synthetic defocus. Below is an example, produced from an ordinary video using our depth map.

Bokeh video effect produced using our estimated depth maps. Video courtesy of Wind Walk Travel Videos.

Other possible applications for our depth maps include generating a stereo video from a monocular one, and inserting synthetic CG objects into the scene. Depth maps also provide the ability to fill in holes and disoccluded regions with the content exposed in other frames of the video. In the following example, we have synthetically wiggled the camera at several frames and filled in the regions behind the actor with pixels from other frames of the video.

The research described in this post was done by Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu and Bill Freeman. We would like to thank Miki Rubinstein for his valuable feedback.

Continua a leggere

Pubblicato in Senza categoria

POP TOYS 1/6th scale Chivalrous Robin Hood (Russell Crowe) 12-inch Action Figure Preview

Robin Hood is a 2010 British-American epic historical drama film based on the Robin Hood legend, directed by Ridley Scott and starring Russell Crowe, Cate Blanchett, William Hurt, Mark Strong, Mark Addy, Oscar Isaac, Danny Huston, Eileen Atkins, and Max von Sydow. Russell Crowe stars as Robin Longstride, who later in the film becomes Robin Hood.

POP TOYS 1/6th scale Chivalrous Robin Hood 12-inch Action Figure Parts List: Squint version head, Normal version head, Hands x4, Cape, Sponge suit, Green long-sleeve, Pants, Boots, Vambrace, Leather belt, Bow, Arrow x6, Arrow bag, Dagger, Dagger bag

Scroll down to see all the pictures.
Click on them for bigger and better views.

POP TOYS 1/6th scale Chivalrous Robin Hood — War horse Parts List: Horse, Saddle, Stirrup x2, Horse face belt, Rein, Leather liner, Front belt, Back belt

Continua a leggere

Pubblicato in Senza categoria