Posted by: bluesyemre | January 2, 2013

pdfx v1.0 (Fully-automated PDF-to-XML conversion of scientific text)

PDFX is a fully-automated PDF-to-XML converter for scientific articles. It takes a full-text PDF article as input (example) and outputs the hierarchy of its distinct logical elements in an XML format.

The elements that PDFX can currently extract are:

  • Front Matter

title, abstract, author, author footnote

  • Body Matter

body text, h1, h2, h3, image, table, figure/table caption, figure/table reference, bibliographic item, bibliographic reference (citation)

  • Extras

header, footer, side note, page number, email, URI
Note: This system has been designed for processing scientific articles. While virtually any PDF file is acceptable input, quality of the processing output and/or processing time might be degraded e.g. for entire books, slide presentations or spreadsheets/strictly tabular data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.


%d bloggers like this: