information and language processing systems
isla, university of amsterdam

Who is associated to Geert Wilders?

We answered this question by collecting news articles from the web mentioning Geert Wilders and the reactions given by readers to these news articles. In these articles and the reactions we recognized named entities like persons, locations and organizations, we normalized these to a common form (if possible, the name of their Dutch Wikipedia page) and counted these normalized forms.

Among the top scoring entities are persons as Geert Wilders, Rita Verdonk, and Adolf Hitler, organizations as de Partij voor de Vrijheid, and concepts as Allah, Islam and Mein Kampf.

The techniques used to create this dataset are discussed in the following two papers:

Co-occurance data is available

We make part of these data available for research, in the form as presented in table 1 below. Each row contains a normalized name and the location where this name was found and some further information. This location is either a permalink to a news-article, or a (meanigless) ID number of a reaction to a newsarticle. A full explanation of the data is given below table 1. The data comes in two tab seperated text files:

Note that in some cases the newsarticle is still available at the permalink, but the reactions have been removed. E.g., with de Telegraaf this seems to happen after about 2 months.

Description of the data

The Wilders collection consists of 2,673 news-articles which are downloaded from 33 news websites in the period April 2, 2007 -- February 2, 2008 and which had the word "Wilders" in their title or in their body text. Only 435 of these news articles have at least one reaction. The total number of reactions is 48,957 and average number of reactions per article is 112.55. Click here for more information about the collection. The maximum number of reactions on one article in the corpus is 982, on the Nu.nl article http://www.nu.nl/news/1187311/11/Wilders_wil_Koran_verbieden.html .

Table 1: Some of the entities occurring in "Tichelaar woedend op Marijnissen"

surface form type normalized formproof source info
MarijnissenPERWIKI-"Jan Marijnissen"PROOF-PAGE_TITLEhttp://www.nieuws.nl/456348
NovumLOCC-"Novum"PROOF-NULLhttp://www.nieuws.nl/456348
Jan MarijnissenPERWIKI-"Jan Marijnissen"PROOF-IN_DOCS_NGRAM_http://www.nieuws.nl/456348
Tweede KamerORGWIKI-"Tweede Kamer der Staten-Generaal"PROOF-PAGE_TITLEhttp://www.nieuws.nl/456348
PvdAPERWIKI-"Partij van de Arbeid (Nederland)"PROOF-PAGE_TITLEhttp://www.nieuws.nl/456348
Jacques TichelaarPERWIKI-"Jacques Tichelaar"PROOF-PAGE_TITLE_NGRAM_http://www.nieuws.nl/456348
VolkskrantORGWIKI-"De Volkskrant"PROOF-PAGE_TITLEhttp://www.nieuws.nl/456348
MarijnissenPERWIKI-"Jan Marijnissen"PROOF-IN_DOCShttp://www.nieuws.nl/456348
Jan MarijnissenPERWIKI-"Jan Marijnissen"PROOF-IN_DOCS_NGRAM_http://www.nieuws.nl/456348
VolkskrantORGWIKI-"De Volkskrant"PROOF-PAGE_TITLEhttp://www.nieuws.nl/456348
MarijnissenPERWIKI-"Jan Marijnissen"PROOF-IN_DOCShttp://www.nieuws.nl/456348
De TelegraafORGWIKI-"De Telegraaf"PROOF-PAGE_TITLEhttp://www.nieuws.nl/456348
Geert WildersPERWIKI-"Geert Wilders"PROOF-PAGE_TITLEhttp://www.nieuws.nl/456348
Nebahat AlbayrakPERWIKI-"Nebahat Albayrak"PROOF-PAGE_TITLE_NGRAM_http://www.nieuws.nl/456348
Ahmed AboutalebPERWIKI-"Ahmed Aboutaleb"PROOF-PAGE_TITLE_NGRAM_http://www.nieuws.nl/456348
wildERSPERWIKI-"Geert Wilders"PROOF-IN_DOCShttp://www.nieuws.nl/456348
TichelaarPERWIKI-"Jacques Tichelaar"PROOF-IN_DOCShttp://www.nieuws.nl/456348
VVDORGWIKI-"Volkspartij voor Vrijheid en Democratie"PROOF-PAGE_TITLEhttp://www.nieuws.nl/456348

explaination of the fields

In table 1 the first column shows surface forms, the second column shows type of the surface forms, the third column shows normalized form, the fourth column shows proof (how the normalized form is determined), and the last column shows the URL of the news article.

There are four standard types of entities: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). Each normalized form starts with a tag WIKI or C, WIKI means the normalized form is a Wikipedia article's title and C means the surface form itself is a normalized form.

Proof column shows that how a surface form is normalized. For example the first row of the table shows that "Marijnissen" is normalized to "Jan Marijnissen" which is a title of a Wikipedia page. If a surface form is normalized to itself (e.g, "Novum"), then its proof value is "PROOF-NULL". Some tuples of the table contain the word "IN_DOCS" in proof column, it means that surface form is resolved to a normalized form which exist in the document. For example the third last row of the table shows that "wildERS" is resolved to "Geert Wilders" which exist in the document. Some tuples of the table contain the word "NGRAM" in proof column, it means that the surface form is detected and then normalized by splitting the recognized named entity (by using a NER tool) into n-grams, for more details see the above mentioned papers.

Description of the wilders data

The Wilders collection consists of 2,673 news-articles which are downloaded from 33 news websites by examining the word "Wilders" in their title or in their body text. Only 435 of these news articles have at least one reaction. The total number of reactions is 48,957 and average number of reactions per article is 112.55.

In the collection there are 25,545 unique normalized forms and only 6709 (26.3%) of them are Wikipedia article titles.

The following table shows number of news articles and number of reactions on these articles of the 33 news websites.

no.News source news articles reactions avg. reactions per article
1http://www.telegraaf.nl/ 446 16,67337.4
2http://www.nu.nl/ 318 12,26438.6
3http://feeds.volkskrant.nl/ 279 00
4http://www.ad.nl/ 213 15,38472.23
5http://www.refdag.nl/ 193 00
6http://www.nieuws.nl/ 145 00
7http://www.ld.nl/ 115 00
8http://www.leeuwardercourant.nl/ 114 00
9http://feeds.nos.nl/ 107 00
10http://feeds.depers.nl/ 105 00
11http://www.dvhn.nl/ 103 00
12http://www.nhd.nl/ 100 00
13http://spitsnet.nl/ 69 00
14http://www.parool.nl/ 68 00
15http://www.trouw.nl/ 56 1,68030
16http://nl.sitestat.com/ 50 00
17http://www.brabantsdagblad.nl/ 44 00
18http://www.dag.nl/ 43 00
19http://www.at5.nl/ 40 00
20http://www.demorgen.be/ 30 431.43
21http://www.pzc.nl/ 22 00
22http://weblogs.nrc.nl/ 14 2,858204.14
23http://www.bndestem.nl/ 11 00
24http://www.tctubantia.nl/ 10 00
25http://feeds.feedburner.com/ 10 00
26http://www.fd.nl/ 9 00
27http://www.haarlemsdagblad.nl/ 2 00
28http://www.almerevandaag.nl/ 2 00
29http://www.waarmaarraar.nl/ 1 5555
30http://www.leidschdagblad.nl/ 1 00
31http://www.gooieneemlander.nl/ 1 00
32http://www.frieschdagblad.nl/ 1 00
33http://news.bbc.co.uk/ 1 00