Sign up for our newsletter and get the latest big data news and analysis.

The Downside of Converting Full-Text PDFs to XML for Text Mining

Michael Iarrobino, Product Manager at Copyright Clearance Center, discusses the challenges researchers face when converting full-text PDF articles to XML for text mining.

To get the best results from text mining projects, researchers need access to full-text articles. Abstracts often don’t include essential facts and relationships, access to secondary study findings, and adverse event data.1  

 

Michael Iarrobino is Product Manager at Copyright Clearance Center (CCC), the leader in content workflow and rights licensing technology. He oversees CCC’s RightFind™ XML for Mining product, a workflow solution for text mining researchers using peer-reviewed scientific articles. He has previously managed products that solve problems in the marketing technology and content discovery spaces while at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management.

Michael Iarrobino is Product Manager at Copyright Clearance Center

However, when researchers obtain full-text articles through company subscriptions or document delivery, the documents are often provided as PDFs, a suboptimal format for use with text mining software. The burden is then on researchers to convert the PDFs — potentially thousands in a bulk delivery — to XML (Extensible Markup Language), the preferred format for use in text mining software. But tasking highly-skilled researchers with converting document formats for input into text mining tools creates a number of problems with the transformed content and is inefficient and costly.

Data Integrity Issues

One major challenge when converting PDFs to full-text for mining is diminished data integrity. The conversion process can introduce errors (e.g., poor character recognition for uncommon fonts) and often removes tags that indicate sections of the article, such as introduction, conclusion, and materials and methods, making the corpus difficult to mine.

While some PDFs have text embedded in the document and apply fonts to render the text readable, they lack the comprehensive metadata and tagging of document sections that is delivered by original XML documents. Other PDFs present text as an image that does not include metadata and requires OCR (optical character recognition) to identify the text. OCR often results in the addition of bad characters and non-words. All these issues result in an increase of false positives, especially when matches are found within the bibliographic citations. In addition, researchers converting PDFs to XML may lose useful data and tables.

Given the problems with converting PDFs, many researchers opt to acquire XML feeds from publishers. But this, too, can be fraught because of the variation between publishers in how the data is delivered and the need to then normalize the material (e.g., article section names) into a single standard to text mine against it efficiently.

Copyright Implications

In addition, researchers often must also negotiate with publishers for the right to text mine the content for commercial purposes, as this right is not commonly included in standard subscription agreements. Converting PDFs intended for human consumption into a machine-readable format such as XML results in the creation of additional copies. Creating and storing those reformatted copies typically requires additional permission from the publisher. The absence of any mention of text mining in the terms of a subscription agreement does not mean that it is permitted. While there is an exception in the UK for text and data mining for non-commercial research, no such exception exists at present time for researchers conducting research on behalf of corporations.

Text mining full-text articles yields a rich set of relevant results that can help guide research and development, but if researchers spend time obtaining permissions from individual publishers and converting full-text article PDFs to XML before they are able to mine the content, there could be a loss in productivity (as much as 4-8 weeks to prepare a corpus), a deceleration of the research and development process, and increased copyright infringement risk.

Michael Iarrobino is Product Manager at Copyright Clearance Center (CCC), the leader in content workflow and rights licensing technology. He oversees CCC’s RightFind™ XML for Mining product, a workflow solution for text mining researchers using peer-reviewed scientific articles. He has previously managed products that solve problems in the marketing technology and content discovery spaces while at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management.

Sources

  1. Catherine Blake. “Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles.” Journal of Biomedical Informatics Volume 43, Issue 2, April 2010, Pages 173–189

Comments

  1. Hi Michael, nice explanatory article. I was wondering what is the solution to the various problems you underline? Thanks!

    • Thanks, Diana! The best approach is going to be a balance between your ideal content needs and the resources/effort you can expend to achieve that. For example, given the problems of PDF –> XML conversion, obtaining an XML feed directly is the best data quality approach. However, doing so introduces the headaches of dealing with multiple publishers and conducting data template mapping. This may be manageable if you’re working with a limited number of publishers. Likewise for obtaining permissions. I have seen some researchers go that ‘DIY’ route. When you have more robust research program needs, you may outgrow the DIY approach and may need a solution that addresses permissions and content/data across many publishers – my company, CCC, provides such a solution, RightFind XML for Mining. You can find more information at http://www.copyright.com/xmlformining

Leave a Comment

*

Resource Links: