PDF processing

Geplaatst op 27-01-2010 door Maarten Marx | resultaten | tags: , | 1 reactie »

For the PoliticalMashup project we developed a technique to turn PDF files into well structured XML. The technique is described in the DutchParl paper.
In that paper, we compare the quality of paragraph splitting obtained by our PDF2XML transformation and the paragraph-split OCRed texts available at the statengeneraaldigitaal.nl project. The results were rather positive for our transformation.
In particular for Hansards (Handelingen), we can preserve the original paragraphs with high precision. Special tuning may be needed for the first pages of documents because of their rather wild and non-standard layout.

Here we provide two XML files created from the Proceedings of 27 januari 1994.
When comparing their quality it is recommended to start reading at the second page (page 3412).

  1. XML version created from the OCRed pages available at statengeneraaldigitaal.nl. No text-processing was done, we only concatenated all pages of one day and added some metadata in attributes (unique references and urls referring to the sources).
  2. XML version created with the PDF2XML software using only this input PDF file.

Further text quality improvements we would like to make on these files are named entity recognition and reconcilliation of speakers and file-numbers, and OCR error correction and normalization of spelling variations.