Again a post on data deduplication. In English, because it concerns British data. We downloaded all Commons debates from 1981 till 2001 from the Hansard archives, and put that in one big XML file, of size 365 Mb, and containing 50 Million words.

With a simple XPath command //speakers, we found all speakers: 327.019 speaches were made.
Good, but how many persons were speaking in that period? We did

bash-3.00$ myxpathx //member The_Official_Report_House_of_Commons_1981_to_2004.xml >speakers
bash-3.00$ cat speakers |sort|uniq >uniqspeakers
bash-3.00$ wc uniqspeakers
7846 32005 220520 uniqspeakers

and found 7846 unique speakers…. But are these really unique speakers, or just unique strings?


Write a script that correctly deduplicates the attached data set. Also provide some way of evaluating the correctness of the script. It would be nice if the names are harmonised to a list with some authority. Wikipedia could be an example.
The uniqspeaker file is attached; this file contains for each speaker in uniqspeaker the number of speeches made. The speaker file (which can be used when trying co-reference trics) can be obtained from Maarten Marx.Another Challenge contains a list of 30,429 members of parliament during 1803-2001. It is obviously full of mistakes. How to clean this up?

Mr Biven and Ms Short

Lets see how Biffen and Short occur in our data set. We find 12 different ways of referring to Biffen and 11 for Clare Short and 15 for Renee Short. The latter has even two genders in our data!

Also notice how an accent in a name leads to wildly different spellings.

bash-3.00$ grep -i biffen histogram.txt
978 Mr. Biffen
55 The Lord Privy Seal and Leader of the House of Commons (Mr. John Biffen)
12 Mr. John Biffen (Shropshire, North)
8 The Secretary of State for Trade (Mr. John Biffen)
5 Mr. Biffen:
2 Mr. John Biffen
1 The Lord Privy Seal and Leader of the House (Mr. John Biffen)
1 Mr Biffen
1 Biffen
1 (Mr. John Biffen)
1 Biffen
0 Mr. Biffen [pursuant to his reply, 19 October 1981]:
bash-3.00$ grep -i short histogram.txt
1164 Clare Short
484 Mrs. Renée Short
118 Ms. Short
113 Ms. Clare Short
29 The Secretary of State for International Development (Clare Short)
27 Ms. Clare Short (Birmingham, Ladywood)
20 Mr. Renée Short
19 Mrs Renée Short
16 Mrs. Renee Short
8 Mrs. Short
6 Mrs. Renée Short (Wolverhampton, North-East)
3 Mrs. Renee Short (Wolverhampton, North-East)
3 Mrs. Renée Short
2 Ms Clare Short
2 Claire Short
1 Mrs. Reneé Short
1 Mrs. Renée Short (Wolverhampton, North-East)
1 Mrs. René;e Short
1 Mrs. René00e Short
1 Mrs. Renèe Short
1 Mrs. Renée Short (Wolverhampton, North-East)
1 Mrs. Rénee Short
1 Mrs Reneé Short
1 Clare short
1 Clare Short: The UN estimate about 200,000 people have been affected by recent flooding in the Shabelle and Juba river valleys
1 Clare Short:
1 Clare Short

bash-3.00$ grep -i “C[^s]*s*Short” uniqspeakers |wc
11 56 383
bash-3.00$ grep -i ” R[^s]*s*Short” uniqspeakers |wc
15 53 440

Co-reference ideas

Should be obvious when looking at the following data. Note that these are in order of the debate. Note also the Oppenheim data.

bash-3.00$ more speakers.txt

Mr. Speaker
Sir David Price
The Under-Secretary of State for Trade (Mr. Reginald Eyre)
Sir David Price
Mr. Eyre
Mr. Carter-Jones
Mr. Eyre
Mr. Adley
The Secretary of State for Trade (Mr. John Biffen)
Mr. Adley
Mr. Biffen
Mr. J. Enoch Powell
Mr. Biffen
Mr. Neubert
Mr. Biffen
Mr. John Smith
Mr. Biffen
Mr. Canavan
The Minister for Trade (Mr. Cecil Parkinson)
Mr. Canavan
Mr. Parkinson
Mr. Anthony Grant
Mr. Parkinson
Mr. Gordon Wilson
Mr. Parkinson
Mr. Bowen Wells
Mr. Parkinson
Mr. Clinton Davis
Mr. Parkinson
Mr. Beaumont-Dark
The Minister for Consumer Affairs (Mrs. Sally Oppenheim)
Mr. Beaumont-Dark
Mrs. Oppenheim
Mr. John Fraser
Mrs. Oppenheim
Mr. Soley
Mrs. Sally Oppenheim

To end, “It’s a man’s world”

We let the numbers speak. But the deduplication should bring a bit more balance….

grep -c "^Mr." uniqspeakers.txt
bash-3.00$ !661
grep -c "^Ms." uniqspeakers.txt
bash-3.00$ !662
grep -c "^Mrs." uniqspeakers.txt


