Michael Iarrobino, Product Manager at Copyright Clearance Center, explains the benefits of text mining full articles and the limitations that often occur when mining article abstracts.
Text mining enables the rapid review and analysis of large volumes of biomedical literature, giving life science companies valuable insights to drive research and development and inform business decisions. For example, the results of mining projects can provide a greater understanding of the underlying biology behind specific diseases and how they respond to certain drugs, and support the target discovery process.
Given their easy accessibility through databases such as PubMed, many researchers use article abstracts to identify a collection of articles (or “corpus”) for use in text mining. But, while abstracts provide some valuable pieces of information, there are major advantages to taking steps to obtain and use full-text articles instead.
More Facts and Relationships
One advantage is that full-text articles simply provide more information – including detailed descriptions of methods and protocols and the complete study results. As a result, full-text articles include more named entities and relationships (between those named entities) than abstracts. According to a study published in the Journal of Biomedical Informatics, full-text articles contain far more connections between biological entities than abstracts. In fact, the study’s findings showed that only 8% of the scientific claims made in the full-text articles were found in their abstracts.1
Access to Secondary Study Findings and Adverse Event Data
While authors often include their most important findings in the abstract, secondary study findings, discoveries, and observations are frequently only found in the full-text article. Given the size limitations of abstracts and their concise nature, they often exclude, or underrepresent, mutation data or results that are considered to be less relevant or out of scope with the main idea of the publication.
In addition, new discoveries are also more likely to be mentioned in the full-text of articles before appearing in abstracts. Following initial publication of a new discovery in a particular journal, the research is often repeated and included in other publications. But there is a substantial delay between when that discovery appears in full articles and when that information appears in abstracts. In fact, it can take one to two years for discoveries to get into the abstract of a subsequent article.3
Lastly, full-text articles are more likely to contain information on adverse events, which is often missing from abstracts. According to a study published in BMC Medical Research Methodology, “abstracts published in high impact factor medical journals underreport harm even when the articles provide information in the main body of the article.”4 This missing information can reduce the value of abstracts, as the “raw material” to mine, especially in pharmacovigilance use cases, or when researchers want to make connections that haven’t yet been a major focus of the literature.
While text mining article abstracts provides some value, there are limitations as to what can be discovered through that process. Researchers need access to the full text of the articles to ensure they don’t miss vital data and novel assertions that can lead to new discoveries.
Michael Iarrobino is Product Manager at Copyright Clearance Center (CCC), the leader in content workflow and rights licensing technology. He oversees CCC’s RightFind™ XML for Mining product, a workflow solution for text mining researchers using peer-reviewed scientific articles. He has previously managed products that solve problems in the marketing technology and content discovery spaces while at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management.
- Catherine Blake. “Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles.” Journal of Biomedical Informatics Volume 43, Issue 2, April 2010, Pages 173–189
- Elsevier (2015) Harnessing the Power of Content – Extracting value from scientific literature: the power of mining full-text articles for pathway analysis. Available at
- Enrique Bernal-Delgado and Elliot S Fisher. “Abstracts in high profile journals often fail to report harm.” BMC Medical Research Methodology (2008); 8:14