The Science and Practical Applications of Word Embeddings 

Print Friendly, PDF & Email

Word embeddings are directly responsible for many of the exponential advancements natural language technologies have made over the past couple years. They’re foundational to the functionality of popular Large Language Models like ChatGPT and other GPT iterations. These mathematical representations also have undeniable implications for textual applications of Generative AI.

From a pragmatic perspective, word embeddings are pivotal to unlocking a trove of business value from contemporary applications of Natural Language Understanding, cognitive search, and Natural Language Generation. When paired with the Generative AI capacity of some of the said language models, they drastically expedite the time required for everything from backend data management to customer-facing applications.  

According to Pega CTO Don Schuerman, the results of these practical Generative AI use cases are transformational. Moreover, they’re horizontally applicable across organizations and deployments, underpinning the basics of workflow management and application development in general.

“We can say, ‘what are the steps of a workflow to manage a loan application or onboard a new member of a healthcare plan?’” Schuerman said. “Generative AI will say, ‘here’s what the common processes for that would look like and here’s the data model for it.’”

The ability of language understanding models to properly comprehend such user requests and elicit the correct responses hinges on the efficacy of word embeddings.  

Mathematical Vectors

Word embeddings are a facet of representation learning, which provides the statistical foundation for many contemporary models for natural language technologies. According to Franz CEO Jans Aasman, these embeddings represent words as vectors. “A vector is a series of elements, usually numbers,” Aasman commented. “For example, you can have a vector of 10 numbers.” These mathematical representations include semantic understanding and, when comparing embeddings to one another in what’s oftentimes a high-dimensionality space, context. Words or phrases with similar meaning are represented closer to one another than those with dissimilar meanings are.

Aasman said representing words as numerical vectors enables machine learning models to ascribe various weights to them. In a specific text, which might include a prompt or, in the case of models like ChatGPT, the contents of the internet itself, “What you can do is take a window of like plus or minus five words or, like ChatGPT, plus or minus 500 words from the word you’re interested in,” Aasman disclosed. “You create a weight for every other word to see to what extent it influences the next word.”

The Prompt Engineering Effect

The applicability of word embeddings to prompt engineering—the means by which users phrase tasks they want generative AI models to create text in response to—is critical, because it allows models to understand what users are asking. Doing so is vital for accurate natural language technology applications of question answering, intelligent search, and more. In this context, via word embeddings, “You get a whole bunch of words around the word you’re interested in, and you can see to what extent it predicts the words after your word,” Aasman noted.

The generative AI tasks prompt engineering initiates are impressive. Some involves what Gartner has termed synthetic data which, for example, might involve “asking Generative AI to make me some sample data so I can test this application quickly,” Schuerman revealed. The same concept can easily extend to generate training data (or annotations for such data) for supervised learning models. “Every developer knows the experience of filling out a spreadsheet thinking of their best friend’s dog’s name to fill it in with different data for testing,” Schuerman observed. “That wasted time is now gone.” Users can also prompt Generative AI to write code, devise data models and fields in them, and create individual procedures for workflows or applications.

Model Restrictions

Word embeddings also affect the ability to tailor language generation models to select responses from a particular source. Because they provide the means of models understanding what users are asking for, these embeddings are amenable to prompts that focus on a particular corpus or knowledge base. “A common pattern in a GPT use case is if you want to restrict the model, or give the model a certain set of data, you can actually bake it into the prompt,” Schuerman explained. For example, if an organization wants a language model to use developer documentation for question answering, one of the first steps is to classify that text in discreet concepts, words, phrases, or sections.

According to Schuerman, GPT is useful for providing those classifications. Those components then become part of the word embedding process; it’s incumbent upon users to include those classifications in their prompts. This technique enables users to “do two things,” Schuerman specified. “It allows us to include the most up-to-date information in responses, but also ensure that we’re restricting GPT so it doesn’t go to some other source that we don’t trust to get this answer.” This same methodology can provide timely question answering for customer service documentation, IT help desks, or searching any specific corpus for conversational responses in real time.

Manifold Layout Techniques 

Oftentimes, word embeddings are vectorized in high-dimensionality spaces. Depending on the enormity of the dimensionality of a particular embedding or series of embeddings, the sheer number of dimensions may become lumbersome, slowing computations and delaying Natural Language Processing. There are several dimensionality reduction techniques involving supervised and unsupervised learning that can redress this issue.

Indico Data CTO Slater Victoroff characterized “manifold layout techniques” as one such approach to effectively bring an embedding from a higher-dimensionality space to a lower-dimensionality one. The benefit of doing so is that it largely preserves the semantics and relationships found in the former space in such a way “that you don’t lose a lot,” Aasman indicated. Manifolds are not infrequently employed in contemporary applications of word embeddings to reduce the dimensionality involved, which can spur computations and NLP results.

Today and Tomorrow 

It appears word embeddings will be part of statistical applications of natural language technologies, including textual representations of Generative AI, for some time. They assist with—if not enable—the prompt engineering process required for getting apposite, timely responses from Generative AI models. They are the conduit by which the enterprise can reap many of the advantages that this form of AI delivers for building applications, interacting with customers, and supplying rapid information retrieval.

Due in no small part to the utility of word embeddings, enterprise applications of Generative AI in low code settings are “an accelerator and starting point for any process; you name it,” Schuerman concluded. “You name the process and we can give you a starting point for it.”

About the Author

Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideBIGDATANOW

Speak Your Mind

*