Sign up for our newsletter and get the latest big data news and analysis.

Big Models are Carbon Efficient if you Share Them

Recently, a research team at The University of Massachusetts Amherst led by Emma Strubell published a paper on the carbon emissions generated through training a high-performing NLP model. Given the growing worldwide concern about the effect of greenhouse gases on our way of life, Strubell’s estimate that neural architecture search for a competitive transformer model can emit carbon equivalent to the lifetime of five cars is rightly alarming. However, if we take into account the context, particularly how the artificial intelligence community shares these models so that a single heavy training job can benefit hundreds of other researchers and millions of customers, we can paint a less dire picture of artificial intelligence’s effect on the environment.

Big Models and the Artificial Intelligence Community

Given the recent trend toward larger and larger models for language processing, it should not be surprising that these larger models consume more energy. However, as the article notes, these models also cost more money to run. The main villain of the article, a neural architecture search (NAS) that goes through many complete training processes to select a model with the optimal architecture, produces five car-lifetimes of carbon, but also costs a minimum of $1 million. This matters because, to some extent, the carbon generated by NAS will be limited by simple economics. Only industry giants like Google can afford the cost of running such experiments regularly. Strubell’s team addresses this point, and in fact they focus their conclusion not on environmental impact but on academic diversity. If no research can occur on the most sophisticated models without access to millions of dollars’ worth of resources, advancement in the field will be limited to a narrow subset of the world’s research talent.

There is good news on the access front, however. Well-funded AI programs now regularly release significant portions of their results, code, and even trained models to the public, often with friendly licenses that incentivize further innovation. The most recent additions to NLP further support a process called “fine tuning” that can retarget a heavy model like BERT to a new language task for a fraction of the cost of the original training. For the most part, the artificial intelligence community understands that everyone benefits when researchers from all walks of life are able to contribute to our collective knowledge. Returning to the environmental perspective, the carbon-intensive initial investment in one model seems less extreme when we consider that it can serve as the basis for hundreds of lighter-footprint descendants.

The Lifecycle of a Deep Natural Language Processing Model

While we consider models by how much value versus carbon they generate at training, we should also consider the other half of machine learning – inference. The collective emissions of a deployed model can be much higher than that generated in the training process. Take the BERT model as an example. In the paper Strubell’s team bases their work on, the authors train BERT for 1 million steps with 128,000 words per step. Without delving too deeply into the technical details, we can estimate that inference on one word consumes comparable energy to training the model on one word for one half of one step.* By this assumption, energy consumed by the whole training process of BERT is roughly equivalent to 256 billion words of inferences. As large as 256 billion may seem, in 2018, Google translate evaluated 100 billion words in just one day, suggesting it would not take long for inference energy use to dwarf the original training cost. With this in mind, a heavy architecture search up front may in fact be an energy saver if it can generate a more efficient final product. Indeed, the NAS paper reports a 1/3 reduction in size from a traditional model without sacrificing accuracy. With open source offering this efficiency boost to everyone who wants it, an intensive model selection process could be a win for both intelligent machines and Mother Nature.

The Future of Efficiency in Natural Language Processing

At each stage of a model’s lifecycle — training, fine-tuning, and inference — the energy costs derive from many of the same factors, foremost model complexity and hardware efficiency. As demonstrated by Lukas Biewald at Medium, right now we see hardware efficiency increase by an order of magnitude every decade while model complexity makes this leap every year. That means that even if the impact of optimizing the architecture of a large transformer model is currently small, in a few years it could be hundreds or thousands of times more expensive. This math gets more frightening when we consider that the neural architecture search algorithm generated its carbon while exploring models in the hundreds of millions of weights, while a single human brain contains up to one quadrillion synapses.

Strubell’s own statistics tell us that a human being, brain included, can train and infer constantly over the course of a whole lifetime of one hundred years, generating emissions equivalent to only 8.7 car lifetimes. Strubell’s statistics estimate greater cost for an American brain, which comes out to 28.7 car-lifetimes**. On one Google TPUv2 core, the neural architecture search takes about 57 hours, or two and a half days, to generate as much carbon as an American does in her lifetime. Since, by most measures, an American brain understands English better than any of these models, this means that human-level language understanding is possible at a fraction of the cost we currently spend. The promise of efficient natural language understanding is there. With thousands of researchers across the globe all sharing and building on each other’s progress, we can achieve it.

* In practice, this is likely an underestimate of inference cost

**These calculations assume that all of this energy is consumed by the brain itself, so they are overestimates of brain energy consumption.

About the Authors

Dr. Samuel Leeman-Munk is a research statistician developer at SAS. He holds a PhD in computer science from North Carolina State University and several US patents on the topics of natural language processing, artificial intelligence and deep learning.

Dr. Xiaolong Li is a staff scientist at SAS. He received his PhD in computer science from the University of Florida. His research interests lie in the area of speech-to-text, natural language processing and machine learning.

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: