Uniqueness of Georgian Names

Geplaatst op 14-11-2013 door Maarten Marx | TI | tags: , | comment image Geen reacties »

We were told that Georgian names are often ambiguous, in the sense that there are many persons with the same first name, last name combination. Here we investigate to what extend this is true. In a sample of over 20K Georgian persons we found that at least 91.3% is uniquely determined by their first and last name, and 98.4% if we add the date of birth as well.
We can thus conclude that Georgian names are quite good in uniquely identifying persons.

| lees verder…

Name Harmonisation

Geplaatst op 10-07-2008 door Maarten Marx | data | tags: | comment image Geen reacties »

Again a post on data deduplication. In English, because it concerns British data. We downloaded all Commons debates from 1981 till 2001 from the Hansard archives, and put that in one big XML file, of size 365 Mb, and containing 50 Million words.

With a simple XPath command //speakers, we found all speakers: 327.019 speaches were made.
Good, but how many persons were speaking in that period? We did

bash-3.00$ myxpathx //member The_Official_Report_House_of_Commons_1981_to_2004.xml >speakers
bash-3.00$ cat speakers |sort|uniq >uniqspeakers
bash-3.00$ wc uniqspeakers
7846 32005 220520 uniqspeakers
bash-3.00$

and found 7846 unique speakers…. But are these really unique speakers, or just unique strings?

Challenge

Write a script that correctly deduplicates the attached data set. Also provide some way of evaluating the correctness of the script. It would be nice if the names are harmonised to a list with some authority. Wikipedia could be an example.
The uniqspeaker file is attached; this file contains for each speaker in uniqspeaker the number of speeches made. The speaker file (which can be used when trying co-reference trics) can be obtained from Maarten Marx. | lees verder…

Data deduplicatie

Geplaatst op 04-06-2008 door Maarten Marx | data | tags: , , | comment image Geen reacties »

Binnen Handelingen die beschikbaar zijn via Parlando en KB-SGD worden dezelfde entiteiten (personen, partijen) vaak op verschillende manieren gespeld. Dit maakt het natuurlijk erg moeilijk om alle gegevens van 1 entiteit netjes bij elkaar te zetten. De verschillende spellingswijzen komen door

  • OCR-foutjes (data voor 1995)
  • type-fouten
  • veranderende namen en wisselende conventies | lees verder…