Sign up for our newsletter and get the latest big data news and analysis.

Elite Deep Learning for Natural Language Technologies: Representation Learning

The ascendancy of representation learning for advanced machine learning deployments of natural language technologies is indisputable. When buttressed by the scale and compute of deep learning, this approach engenders a lengthy list of accomplishments in this Artificial Intelligence area that not only pertains to machines processing human languages, but to statistical AI in general.

By encompassing certain traits of semi-supervised learning and self-supervised learning, representation learning drastically decreases the amounts of training data required to teach models and, perhaps more importantly, the emphasis on annotated training data that can impede conventional supervised learning use cases.

However, it’s also a cornerstone for furthering the usefulness of connectionist techniques like multitask learning, zero shop learning, manifold layout techniques, and the data multiples concept—all of which has substantial impact on deep learning’s worth for natural language technologies.

Because of representation learning, Natural Language Processing is not only much quicker and accessible to organizations, but also more applicable to a considerably broader array of use cases that, frankly, were previously impractical to implement.

“These are techniques that we use at Indico and there are other organizations like Google and Facebook that are obviously using these,” acknowledged Indico Data CTO Slater Victoroff. “But, few and far between, they are difficult and probably still don’t represent the majority of what people are doing in deep learning.”

But, as the subsequent gains derived from them indicate, they should.

Byte Pair Encodings

In broad terms, representation learning works in a way that’s not dissimilar to the notion of key value pairs. It utilizes byte pair encodings that are analogous to the keys; each key has a numerical value that is a representation of it “like a dictionary or a lookup table,” Victoroff observed. Byte pair encodings are at the nucleus of representation learning and are generated for what Victoroff termed “meaningful chunks” of words or language. “‘ing [i-n-g] space’ might be a chunk, or ‘space um [u-m]’ might be a chunk,” Victoroff revealed about byte pair encodings. “They’re sort of one to 10 letters long…they’re mostly like one to two to three letters long.”

There are respective models for learning the representations of the byte pair encodings and for learning representations of the words the encodings comprise. There are two immediately notable aspects of this learning technique. The first is its linguistic value because, for words, representations can identify “are these things synonyms or not; are these things semantically linked; are these things syntactically linked,” Victoroff commented. Moreover, representations can be for individual words, sentences, or even paragraphs, providing versatile deployment advantages depending on user needs.

Zero Shot Learning

Representation learning produces a profound effect on deep learning in several ways, the most noteworthy of which is in the reduction of training data—and labeled training data, at that—required to create accurate predictions from advanced machine learning models. “If your representations are good enough you can make models with, they call them zero shot learning,” Victoroff pointed out. With this technique, data scientists can leverage a label as the only example on which to train the model.

For example, when building models to predict airplanes, this statistical AI method would use the very label “airplane as your one example,” Victoroff maintained. “And then there’s the corollaries: a few shots, [a] single shot. There’s all sorts of variations.” The applicability of this tenet to enterprise deployments of natural language technologies is enormous, since most use cases involving advanced machine learning necessitate exorbitant training data requirements that some consider prohibitive.

Multitask Learning

If the low quantities of training data are one of the ways in which representation learning shatters the mold for deep learning deployments, another is its propensity to train models for multiple tasks. With more widely used supervised and unsupervised learning approaches, even if there’s a task related to what a specific model was used for (like performing entity extraction for Intelligent Processing Automation on marketing data after being trained to do so for sales use cases), modelers have to start from scratch with a new model. Underpinned by representation learning, multitask learning may make this idea obsolete. “Where now you make your sentiment analysis one [model] and you have to make a second one for [text analytics, for example], you can combine all of that understanding in a consistent representation,” Victoroff denoted.

The increased efficiency—both on the part of models and their modelers—as well as the compounding value of these approaches for natural language technologies is evident. The ability for models to multitask or perform numerous competencies is gaining credence across the vector-based NLP space. There are competitions in which models have to solve 10 different types of NLP problems, as well as evidence that models can actually apply learning from one NLP task—such as understanding a foreign language—to another, such as understanding English. “We can show and prove that [models] are cross leveraging information across languages,” Victoroff remarked. “This is one of the things that was really surprising to us. In humans this is called the telescoping effect. If you tell a machine to learn Chinese after having taught it English, it is much, much better at Chinese than if you never taught it English in the first place.”

Data Multiples

The word ‘better’, of course, is a relative term with different meanings to different people, depending on what they’re seeking to accomplish with their NLP. The reality is that given the compute and scalability of deep learning, with unlimited data quantities even a poor deep neural network can perform at the level of the best one. The data multiples precept is based on evaluating model performance without such illimitable data amounts, and centers on pinpointing, for specific models, “how well does this work at 100 data points; how well does it work at a 1,000 data points; how well does it work at 10,000 data points?” Victoroff said.

Advanced machine learning model performance is predicated on data multiples which, for the representation learning techniques described above, often delineate “at least 2 to 4x data multiples,” Victoroff specified. A 4x data multiple for a model is one “meaning 4x it reduces the amount of data to learn,” Victoroff indicated. “Depending on where you are in how much data you have, that can correspond to almost a doubling of accuracy.”

Manifold Layout Techniques

For natural language technologies, representations are a list of numbers to which data scientists can apply varying mathematical concepts for machine understanding of words. Part of the way they create meaning from these numbers (for both linguistic and downstream business purposes) is by transferring these digits—which represent points in what’s frequently a high dimensional space—into an embedding. Embeddings are a way to place representations into a definite structure “to assign meaning to those representations,” Victoroff disclosed. Manifolds are one of the most popular types of embeddings for natural language technologies because of what they have “that other [structures] might not, the concept of distance,” Victoroff divulged.

Distance is vital to a granular understanding of language for advanced machine learning models. According to Victoroff, “we do have this notion of synonyms and antonyms and parsed trees when you’re reading through a sentence. And the idea is all of those are distances. So we’ve got this notion of distance: two objects.” Manifolds allows representations to go from high dimensional spaces, including hundreds if not thousands of dimensions, to lower dimensional ones that are more readily visualized and understood in terms of their linguistic worth.

Growth and Development

The deep learning space is continuing to grow apace. Representation learning is one of the more meritorious approaches for paring the quantity of training data (and amount of labels) involved in natural language technology deployments. It also does so while diversifying the utility of the underlying models for applications of multitask learning. The result is organizations can achieve way more with these models while decreasing the time and effort required to construct them, while simultaneously increasing their accuracy for NLP or other uses cases for which they’re employed.

About the Author

Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 –

Leave a Comment


Resource Links: