Text preserving transformations

Geplaatst op 31-01-2011 door Anne | Political Mashup, xslt | | comment image Geen reacties »

We take all interviews (by Marianne Winslett) in pdf format from http://www.sigmod.org/publications/interview and pull each of them through a sequence of processors to arrive at clean and structured xml.

Below is an example of the interview with Serge Abiteboul, with underneath the tools/schema’s we’ve used.

abiteboul.pdf —> abiteboul.xml —> abiteboul.trans.xml
pdftohtml

(-xml -hidden)

sigmod.rnc xslt transformation

sigmod.xsl

sigmod.trans.rnc

In this example, the input file abiteboul.pdf is transformed with the standard linux tool pdftohtml. The xml output (abiteboul.xml/sigmod.rnc) is then transformed using XSLT (sigmod.xsl) into a cleaner and more structured xml file (abiteboul.trans.xml/sigmod.trans.rnc).

When comparing abiteboul.xml and sigmod.trans.xml we see that the former has some structure, which is related to layout (pages, font-types, positions on page, etc.) where the latter has only semantic structure (question-answer pairs, title, author, etc.). But while the structure is different, the text in the leave-nodes is preserved (although with did glue some of them together and left a few out). Hence the name Text Preserving Transformations.

Data Cleaning

Pulling a pdf through pdftohtml -xml -hidden does not really give clean xml that can be used readily. For one, it refers to a none-existing dtd and tags are not always properly nested. When the output is again parsed by Beautiful Stone Soup it becomes useful and possible to be parsed by Saxon.

Transformation

Now that we have clean and valid xml (of this schema sigmod.rnc), we can transform it into something that is more usefull (this schema sigmod.trans.rnc).

In short, we do the following:

  1. We take all text elements except for the last two of each page (those are footers)
  2. We annotate the text elements with the type of speech they belong to (a question, an answer, part of the intro or a new paragraph). This can be recognized from layout features.
  3. We preserve the intro elements. The first one of them is the title, the last one the author and the ones in between the full introduction.
  4. We group the annotated text elements in such a way that questions and answers form pairs (<qa />).Simplified, we do the following (where question and answer denote a text node that is annotated as a question and respectively an answer):
    for each question that is not directly preceded by a question
        glue together this question and all following except for those that follow
                                                                         an answer
        glue together the first answer that follows this question and all following
                            answers except for those that follow the next question

See our full XSLT stylesheet here: sigmod.xsl.

Data

On all except for one (Avi Silberschatz) pdf’s our transformations seemed to have the desired effect.
We’ve put the the data in an archive and publish it here: data.tgz.

We’ve also added transformed xml documents that have all leave nodes preserved (where we did not glue the text together). They have be transformed using this XLST stylesheet: sigmod-with-leaves.xsl.

Reageer

Je moet ingelogd zijn om te kunnen reageren.