We answered this question by collecting news articles from the web mentioning Geert Wilders and the reactions given by readers to these news articles. In these articles and the reactions we recognized named entities like persons, locations and organizations, we normalized these to a common form (if possible, the name of their Dutch Wikipedia page) and counted these normalized forms.
Among the top scoring entities are persons as Geert Wilders, Rita Verdonk, and Adolf Hitler, organizations as de Partij voor de Vrijheid, and concepts as Allah, Islam and Mein Kampf.
The techniques used to create this dataset are discussed in the following two papers:
We make part of these data available for research, in the form as presented in table 1 below. Each row contains a normalized name and the location where this name was found and some further information. This location is either a permalink to a news-article, or a (meanigless) ID number of a reaction to a newsarticle. A full explanation of the data is given below table 1. The data comes in two tab seperated text files:
Note that in some cases the newsarticle is still available at the permalink, but the reactions have been removed. E.g., with de Telegraaf this seems to happen after about 2 months.
The Wilders collection consists of 2,673 news-articles which are downloaded from 33 news websites in the period April 2, 2007 -- February 2, 2008 and which had the word "Wilders" in their title or in their body text. Only 435 of these news articles have at least one reaction. The total number of reactions is 48,957 and average number of reactions per article is 112.55. Click here for more information about the collection. The maximum number of reactions on one article in the corpus is 982, on the Nu.nl article http://www.nu.nl/news/1187311/11/Wilders_wil_Koran_verbieden.html .
| surface form | type | normalized form | proof | source info |
| Marijnissen | PER | WIKI-"Jan Marijnissen" | PROOF-PAGE_TITLE | http://www.nieuws.nl/456348 |
| Novum | LOC | C-"Novum" | PROOF-NULL | http://www.nieuws.nl/456348 |
| Jan Marijnissen | PER | WIKI-"Jan Marijnissen" | PROOF-IN_DOCS_NGRAM_ | http://www.nieuws.nl/456348 |
| Tweede Kamer | ORG | WIKI-"Tweede Kamer der Staten-Generaal" | PROOF-PAGE_TITLE | http://www.nieuws.nl/456348 |
| PvdA | PER | WIKI-"Partij van de Arbeid (Nederland)" | PROOF-PAGE_TITLE | http://www.nieuws.nl/456348 |
| Jacques Tichelaar | PER | WIKI-"Jacques Tichelaar" | PROOF-PAGE_TITLE_NGRAM_ | http://www.nieuws.nl/456348 |
| Volkskrant | ORG | WIKI-"De Volkskrant" | PROOF-PAGE_TITLE | http://www.nieuws.nl/456348 |
| Marijnissen | PER | WIKI-"Jan Marijnissen" | PROOF-IN_DOCS | http://www.nieuws.nl/456348 |
| Jan Marijnissen | PER | WIKI-"Jan Marijnissen" | PROOF-IN_DOCS_NGRAM_ | http://www.nieuws.nl/456348 |
| Volkskrant | ORG | WIKI-"De Volkskrant" | PROOF-PAGE_TITLE | http://www.nieuws.nl/456348 |
| Marijnissen | PER | WIKI-"Jan Marijnissen" | PROOF-IN_DOCS | http://www.nieuws.nl/456348 |
| De Telegraaf | ORG | WIKI-"De Telegraaf" | PROOF-PAGE_TITLE | http://www.nieuws.nl/456348 |
| Geert Wilders | PER | WIKI-"Geert Wilders" | PROOF-PAGE_TITLE | http://www.nieuws.nl/456348 |
| Nebahat Albayrak | PER | WIKI-"Nebahat Albayrak" | PROOF-PAGE_TITLE_NGRAM_ | http://www.nieuws.nl/456348 |
| Ahmed Aboutaleb | PER | WIKI-"Ahmed Aboutaleb" | PROOF-PAGE_TITLE_NGRAM_ | http://www.nieuws.nl/456348 |
| wildERS | PER | WIKI-"Geert Wilders" | PROOF-IN_DOCS | http://www.nieuws.nl/456348 |
| Tichelaar | PER | WIKI-"Jacques Tichelaar" | PROOF-IN_DOCS | http://www.nieuws.nl/456348 |
| VVD | ORG | WIKI-"Volkspartij voor Vrijheid en Democratie" | PROOF-PAGE_TITLE | http://www.nieuws.nl/456348 |
In table 1 the first column shows surface forms, the second column shows type of the surface forms, the third column shows normalized form, the fourth column shows proof (how the normalized form is determined), and the last column shows the URL of the news article.
There are four standard types of entities: person names (PER), organizations (ORG), locations (LOC)
and miscellaneous names (MISC). Each normalized form starts with a tag WIKI or C,
WIKI means the normalized form is a Wikipedia article's title and C means the
surface form itself is a normalized form.
Proof column shows that how a surface form is normalized.
For example the first row of the table shows that "Marijnissen" is normalized to "Jan Marijnissen" which is a
title of a Wikipedia page.
If a surface form is normalized to itself (e.g, "Novum"), then its proof value is "PROOF-NULL".
Some tuples of the table contain the word "IN_DOCS" in proof column, it means that
surface form is resolved to a normalized form which exist in the document.
For example the third last row of the table shows that "wildERS" is resolved to "Geert Wilders" which exist in the document.
Some tuples of the table contain the word "NGRAM" in proof column, it means that the surface form
is detected and then normalized by splitting the recognized named entity (by using a NER tool) into n-grams, for
more details see the above mentioned papers.
The Wilders collection consists of 2,673 news-articles which are downloaded from 33 news websites by examining the word "Wilders" in their title or in their body text. Only 435 of these news articles have at least one reaction. The total number of reactions is 48,957 and average number of reactions per article is 112.55.
In the collection there are 25,545 unique normalized forms and only 6709 (26.3%) of them are Wikipedia article titles.
The following table shows number of news articles and number of reactions on these articles of the 33 news websites.