The Downside of Converting Full-Text PDFs to XML for Text Mining

Print Friendly, PDF & Email

Michael Iarrobino, Product Manager at Copyright Clearance Center, explains the pitfalls of converting full-text PDFs to XML for text mining. 

To get the best results from text mining projects, researchers need access to full-text articles. Abstracts often don’t include essential facts and relationships, access to secondary study findings, and adverse event data.1

Michael Iarrobino is Product Manager at Copyright Clearance Center (CCC), the leader in content workflow and rights licensing technology. He oversees CCC’s RightFind™ XML for Mining product, a workflow solution for text mining researchers using peer-reviewed scientific articles. He has previously managed products that solve problems in the marketing technology and content discovery spaces while at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management.

Michael Iarrobino, Product Manager, Copyright Clearance Center

However, when researchers obtain full-text articles through company subscriptions or document delivery, the documents are often provided as PDFs, a suboptimal format for use with text mining software. The burden is then on researchers to convert the PDFs — potentially thousands in a bulk delivery — to XML (Extensible Markup Language), the preferred format for use in text mining software. But tasking highly-skilled researchers with converting document formats for input into text mining tools creates a number of problems with the transformed content and is inefficient and costly.

Data Integrity Issues

One major challenge when converting full-text PDFs to full-text for mining is diminished data integrity. The conversion process can introduce errors (e.g., poor character recognition for uncommon fonts) and often removes tags that indicate sections of the article, such as introduction, conclusion, and materials and methods, making the corpus difficult to mine.

One major challenge when converting full-text PDFs to full-text for mining is diminished data integrity.

While some PDFs have text embedded in the document and apply fonts to render the text readable, they lack the comprehensive metadata and tagging of document sections that is delivered by original XML documents. Other PDFs present text as an image that does not include metadata and requires OCR (optical character recognition) to identify the text. OCR often results in the addition of bad characters and non-words. All these issues result in an increase of false positives, especially when matches are found within the bibliographic citations. In addition, researchers converting full-text PDFs to XML may lose useful data and tables.

Given the problems with converting full-text PDFs, many researchers opt to acquire XML feeds from publishers. But this, too, can be fraught because of the variation between publishers in how the data is delivered and the need to then normalize the material (e.g., article section names) into a single standard to text mine against it efficiently.

Copyright Implications

In addition, researchers often must also negotiate with publishers for the right to text mine the content for commercial purposes, as this right is not commonly included in standard subscription agreements. Converting PDFs intended for human consumption into a machine-readable format such as XML results in the creation of additional copies. Creating and storing those reformatted copies typically requires additional permission from the publisher. The absence of any mention of text mining in the terms of a subscription agreement does not mean that it is permitted. While there is an exception in the UK for text and data mining for non-commercial research, no such exception exists at present time for researchers conducting research on behalf of corporations.

Text mining full-text articles yields a rich set of relevant results that can help guide research and development, but if researchers spend time obtaining permissions from individual publishers and converting full-text article PDFs to XML before they are able to mine the content, there could be a loss in productivity (as much as 4-8 weeks to prepare a corpus), a deceleration of the research and development process, and increased copyright infringement risk.

Sources

  1. Catherine Blake. “Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles.” Journal of Biomedical Informatics Volume 43, Issue 2, April 2010, Pages 173–189

Michael Iarrobino is Product Manager at Copyright Clearance Center (CCC), the leader in content workflow and rights licensing technology. He oversees CCC’s RightFind® XML for Mining product, a workflow solution for text mining researchers using peer-reviewed scientific articles. He has previously managed products that solve problems in the marketing technology and content discovery spaces while at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management.

Speak Your Mind

*

Comments

  1. I agree with you, one major challenge of converting pdf files to XML is diminished data integrity.
    You post is on point. Nice one