DutchParl

DutchParl: A Corpus of Parliamentary Documents in Dutch

Authors: Maarten Marx and Anne Schuth

Abstract

A corpus called DutchParl is created which aims to contain all
digitally available parliamentary documents written in the Dutch
language. The first version of DutchParl contains documents from the
parliaments of The Netherlands, Flanders and Belgium. The corpus is
divided along three dimensions: per parliament, scanned or digital
documents, written recordings of spoken text and others. The digital
collection contains more than 800 million tokens, the scanned
collection more than 1 billion.

All documents are available as UTF-8 encoded XML files with extensive
metadata in Dublin Core standard. The text itself is divided into
pages which are divided into paragraphs. Every document, page and
paragraph has a unique URN which resolves to a web page. Every page
element in the XML files is connected to a facsimile image of that
page in PDF or JPEG format. We created a viewer in which both
versions can be inspected simultaneously. A search-engine for the
complete collection is available online.

The corpus is available for download in several formats. The corpus
can be used for corpus-linguistic and political science research,
and is suitable for performing scalability tests for XML information
systems.

Links

Laatst aangepast op 30-10-2009 door Maarten Marx Geen reacties »

Notice: Theme without sidebar.php is deprecated since version 3.0 with no alternative available. Please include a sidebar.php template in your theme. in /var/www/html/PoliticalMashup/wp-includes/functions.php on line 3679