Linking Hansards to related newsarticles

Geplaatst op 15-04-2014 door Maarten Marx | DiLiPaD, ExPoSe, ODE, parliament | tags: | comment image Geen reacties »

We describe a simple technique with which to link news articles to debates in Parliament.
The technique uses the news search engine EMM Newsexplorer.
As search strings we use

  • the date of the debate
  • the speakers
  • the first ten words from a unigram parsimonious language model created from the debate

Results on oral questions are promising. In this post we explain how we find the relevant news articles, evaluate the results. Code is provided.

As an example we consider an oral question about the lack of growth of the Dutch economy.
We want to find news articles which are relevant to this debate.

Step 1: summarize the debate

We first summarize the debate using 10 keywords extracted from the debate. We take those words which are best in describing this debate given all debates from a certain given period.
We also extract all named entities from the debate and link them to Wikipedia and a database consisting of members of parliament and government.
Here is the debate again preceded with this summary.

Step 2: find relevant news articles

Using the summary we create an advanced search query which we feed to EMM Newsexplorer. The following setting worked well (at least for oral questions)

  • Restrict to articles published on the date of the debate
  • The articles should contain at least the name of one of the speakers (with oral questions these are the MP who asked and the MG who answered the question)
  • The article should contain at least one of the ten words from the summary.

An example search result for the above oral question is given here. Fortunately, this result is also provided as an XML file in RSS format.

Step 3: Rerank and restrict the list of hits

We collect the search results for each speaker and combine them. We then restrict the results to those which contain at least one of the speakers as a named entity. We then group hits by simply matching their titles. Finally we rerank the results by the number of entries in a group, breaking ties alphabetically.
The result is presented in a simple table giving the title of the article and the source.


Krimpende economie doet zorgen toenemen (3) bnr
Eurocrisis legt Nederlandse economie lam Trends
Nederland maakt zich zorgen om zijn krimpende economie demorgen
„Stop import product uit Chinees dwangkamp” refdag
‘Stop import uit dwangkamp’ telegraaf
Verhagen: ‘We voelen de onrust, de banengroei stagneert’ volkskrant

Evaluation

The above table contains 6 results, of which 2 are not relevant (about imports from Chinese camps). So we receive a precision of 66%. A better grouping algorithm (eg based on cosine similarity of titles) would group the two non relevant hits and two of the relevant hits, leading to a precision of 75%.

Code

The complete code is written as one XQuery LinkHansardsToNews.xquery.
If run locally, it produces the above table (given that EMM NewsExplorer does not change its database or search algorithm), based on the above mentioned debate. The xquery code is also available as an XQuery
module.