Sign up for our newsletter and get the latest big data news and analysis.

The Business Value of Deep Text Analytics at Massive Document Scale

In this special guest feature, Dr. Brian Sager, CEO and co-founder of, provides 5 examples in support of  the business value of deep text analytics at massive document scale. The examples are drawn from use cases within R&D, competitive strategy, patent law, and knowledge management, as well as M&A and post merger integration. is a massively scaleable research and discovery service that uncovers hidden patterns of interconnection by fusing documents within and between different areas of knowledge. Brian holds a Ph.D. in the biochemistry and genetics of pattern formation from Stanford University, after which he focused on fusing neuroscience and artificial intelligence while a Whitney fellow at both MIT and Harvard. Sager has more than 120 patents issued or pending for applications ranging from materials science to computational linguistics. He and his cofounders developed Omnity as a rapid way for knowledge workers to efficiently cope with the information explosion occurring all around them. By comparing and connecting hundreds of millions of documents through their shared meaning, Omnity helps scientists, clinicians, managers and engineers accelerate the pace and value of their innovations.

Most large organizations employ text mining on massive document sets to more effectively enable efficient information retrieval. Common text mining techniques include historical and predictive analytics, clustering, classifications, feature extractions, trending, outlier analyses and data associations.  Given this broad range, a myriad of frameworks focus on the “how” of these processes, articulating a wide range of linguistic and statistical treatments for text processing, which have grown in both sophistication and scale over the past decades.

But “Why?” What forms of business value arise from these complex and often subtly intricate analyses? Why should business be motivated to support these types of activities at enterprise scale?

This question can be answered through examples of latent value often discovered as a direct result of performing text analytics at massive scale. While these examples represent just the tip of the proverbial iceberg, they are highlighted to provide a sufficiently broad vantage point from which to extend to many more use cases.  These examples are drawn from use cases within R&D, competitive strategy, patent law, and knowledge management, as well as M&A and post merger integration.

Finding assets hiding in plain sight: Repurposing value

Many high-tech science and engineering oriented companies, including those in the pharmaceutical, biotech, chemical, and semiconductor industries, spend enormous efforts supporting research and development, often sinking hundreds of millions of dollars over several years into discrete projects. Successful projects often result in patents and scientific papers. Given such sunk cost, finding new ways of repurposing those assets can be a substantial revenue opportunity.

Within the pharmaceutical industry, the side effects of a drug used for a particular therapeutic strategy might be repurposed as a central indication in another clinical context. Linguistically-based screens for drug repurposing offer a concrete and low-cost means to perform drug repurposing.

More broadly, repurposing of discrete technology components, as articulated in patents and scientific papers, enables broader lateral ideation for new product development, for example leveraging a protected invention in a new product or market context. Further, broadly and systematically synthesizing and integrating experimental data and designs across knowledge domains can lead to valuable hypothesis generation that would not have been apparent with a more narrow focus.

Competitive Landscaping: Identifying latent or hidden competitors

Understanding the position of a company in a rapidly evolving competitive landscape is another source of high value for companies operating in highly dynamic environments, especially those vulnerable to the effects of disruptive innovation. For such landscaping, being the first to identify latent or hidden competitors often provides substantial competitive advantage. Systematic and early detection of trends, identification of outliers around those trends, and understanding why those outliers could be potential threats or collaborative partners all enable rapid and data-driven decisions that can serve to grow enterprise value.

Accelerating Patent Prosecution: Deep prior art detection

Less than 1% of highly related patents cite one another. Why? Neither inventors nor patent counsel have sufficient time to read an exponentially growing pool of prior patents, and authors can only cite what they know.   As a result, following others knowledge trails leads to increasingly sparse knowledge domains.

For these reasons, detecting relevant prior art during the writing of patents is increasingly challenging, yielding a high risk of unpleasant surprises for inventors as patents move through their examinations. Semantically comparing documents identified through shared meaning rather than shared citation transcends such frustration and surprise, allowing for deeply researched and well-constructed claims that can rapidly move through examination and issuance.

Tracking Institutional Knowledge: Who knows about what?

In both large companies and government agencies, a common source of frustration is finding who within the organization knows what about a particular topic. Tracking institutional knowledge through semantic analyses of the work products of employees enables information synergy, revealing otherwise hidden pockets of specific expertise that can accelerate a project or enrich and streamline a critical enterprise process. Furthermore, semantic articulation of a team’s work products can support the replication of high-performing teams across an organization.

Detecting duplicative or overlapping content: Indentifying waste

Whether unintended or not, all business and organizations run the risk of duplicative efforts being carried out in different business units or divisions. Semantic comparative analytics around work products and project deliverables support the identification of closely related efforts and support management processes to allocate resources and streamline processes, reducing cost and improving ROI on related projects scattered through a large enterprise. This is especially helpful during M&A screening as well as for post-merger integration.


Sophisticated approaches to text mining and analytics continue to evolve and accrue value with their extended use across massive document sets throughout the enterprise. In each knowledge domain and every functional role, knowledge is expanding at exponential rates, and such approaches – if properly employed – can help large companies and institutions cope with this document explosion, enabling the enterprise to profitably mine text-based data at machine scale.


Sign up for the free insideBIGDATA newsletter.


Leave a Comment


Resource Links: