Uniqueness of Georgian Names

Geplaatst op 14-11-2013 door Maarten Marx | TI | tags: , | comment image Geen reacties »

We were told that Georgian names are often ambiguous, in the sense that there are many persons with the same first name, last name combination. Here we investigate to what extend this is true. In a sample of over 20K Georgian persons we found that at least 91.3% is uniquely determined by their first and last name, and 98.4% if we add the date of birth as well.
We can thus conclude that Georgian names are quite good in uniquely identifying persons.

We proceed as follows. From the Asset Declaration database of TI Georgia we can collect persons for which we have


First Name, Last Name, Place of Birth, Date of Birth.

We assume that these 4 properties uniquely describe a person.

In our database we find 21.479 such unique persons. These come from the first page of the Asset Declaration, containing this information about the Official who filled the declaration and his relatives.
Here is a short example of the data:

Anano	Jafaridze	Georgia, mestia	2005-04-26 
Anano	Janelidze	Georgia, Tbilisi	2011-10-10 
Anano	Kalandia	სააქრთველო, q tbilisi	2004-09-05 
Anano	Karanadze	Georgia, batumi luka asatianis 79/81 b.37	2009-02-19 
Anano	Khmaladze	Georgia, sighnaghi/nukriani	2007-06-30 
Anano	Maisuradze	Georgia, Tbilisi	2007-11-06 
Anano	Malakmadze	Georgia, batumi	2010-04-25 
Anano	Morchadze	Georgia, qutaisi	1997-05-17 
Anano	Mosidze	Georgia, q. batumi	2009-10-07 
        

First results

Of the 21.479 persons (with many close family members), 19.522 were already unique by the first and last name. If we add the date of birth, we only have 437 rows left which are not unique.
We can thus quite safely conclude that for the vast majority of persons in Georgia, their first and last name together uniquely determine them.
Here is the code giving these numbers.

maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv |wc -l
   21479
maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv |awk -F$'t' '{print $1,$2,$4}'|sort|uniq|wc -l
   21042
maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv |awk -F$'t' '{print $1,$2}'|sort|uniq|wc -l
   19522
        

A closer look at the double entries

One possibility that we have to investigate is that the double entries are due to the fact that persons who appear more than once in the database have given their Place of Birth in slightly different wording.
This would mean that they appear as two different persons, but are not so in reality.
Let’s look at the persons who have the most number of “clones”:

maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv |awk -F$'t' '{print $1,$2,$4}'|sort|uniq -c|sort -nr |sed 's/^ *//'|grep -v '^1 '|head
3 Sofiko Fifia 1987-01-21 
3 Pavle Okujava 1978-07-02 
3 Nodar Mgeliashvili 2010-03-31 
3 Nino Camalashvili 2009-06-09 
3 Natela Khmaladze 1978-12-05 
3 Mamuka Vacadze 1975-07-07 
3 Kakha Lataria 1969-11-27 
3 Irma Iamanidze 1975-03-20 
3 Gela Tvalchrelidze 1971-04-11 
2 Zviad Miminoshvili 1974-09-24 
        

Now let’s have a look at the top three persons:

maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv | grep 'Sofiko.*Fifia'
Sofiko	Fifia	Georgia, sokhumi	1987-01-21 
Sofiko	Fifia	Russia, irkutski	1987-01-21 
Sofiko	Fifia	რუსეთის ფედერაცია, irkutskis olqi	1987-01-21 
maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv | grep 'Pavle.*Okujava'
Pavle	Okujava	Georgia, q.tbilisi	1978-07-02 
Pavle	Okujava	sqartvelo, zugdidi,	1978-07-02 
Pavle	Okujava	sqartvelos, zugdidi,	1978-07-02 
maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv | grep 'Nodar.*Mgeliashvili'
Nodar	Mgeliashvili	საქრთველო, gori	2010-03-31 
Nodar	Mgeliashvili	საქრთველო, q. gori	2010-03-31 
Nodar	Mgeliashvili	საქრთველო, tianeti	2010-03-31 
maartens-MacBook-Pro:OUTPUT admin$     

For all of the three persons, two of the three clones are clearly one and the same person.

Let’s look at three random clones which occured only twice, and see if we can detect a reason:

maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv | grep 'Ada.*Bakhtadze'
Ada	Bakhtadze	Georgia, q. gudauta	1930-11-07 
Ada	Bakhtadze	საქრთველო, sokhumi	1930-11-07 
maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv | grep 'Akaki.*Minashvili'
Akaki	Minashvili	Georgia, Tbilisi	1980-09-24 
Akaki	Minashvili	Georgia, Tbilisi,	1980-09-24 
maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv | grep 'Aleqsandre.*Berdzenishvil'
Aleqsandre	Berdzenishvili	Georgia, Tbilisi	1982-09-21 
Aleqsandre	Berdzenishvili	Georgia, q tbilisi,	1966-09-08 
Aleqsandre	Berdzenishvili	saqrtvelo, q tbilisi,	1966-09-08 
maartens-MacBook-Pro:OUTPUT admin$ 
        

Most probably there are two different Ada Bakhtadze’s born on the same day. But still it could be that a mistake was made when filling in the Asset Declaration (eg, confusing Place of Birth with Place of Residence).
The other two clones are obviously one person, with a different way of writing their Place of Birth. (Note that there are two different Aleqsandre Berdzenishvili’s but we can distinguish them by their Date of Birth.).

Cleaning up the Place of Birth field

We investigate how much that helps in reducing “false double persons”. We lower case and remove non letters from the Pace of Birth string.
We have 108 less entities.

maartens-MacBook-Pro:OUTPUT admin$ cat UniqListofPersons.csv |tr [:upper:] [:lower:]|sed 's%[/, .]%%g'|sort|uniq > UniqListofPersonsCleaned.csv 
maartens-MacBook-Pro:OUTPUT admin$ wc -l UniqListofPersonsCleaned.csv 
   21371 UniqListofPersonsCleaned.csv
maartens-MacBook-Pro:OUTPUT admin$ wc -l UniqListofPersons.csv 
   21479 UniqListofPersons.csv
        

If we do the same count as in the start of this investigation we see the following numbers:

count No cleaning With cleaning
All 21479 21371
First Name, Last Name 19522 19516
First Name, Last Name, Date of Birth 21042 21034

Thus after cleaning we have only 237 cases which are not uniquely determined by their name and birthday.
As we have seen above, more cleaning could be done, for instance, replacing the Georgian way of writing with the English.

Conclusion

We have looked at a sample of 21.371 persons from a database of Georgian people which included many close relatives.
Of these at least 91.3% is uniquely determined by their first and last name. If we add the date of birth, we have at least 98.4% uniquely determined.
We say at least, because our dataset still contains false “clones” (due to the fact that persons were recorded twice in the database but with a different way of specifying their place of birth).

In other words, Georgian names do a pretty good job of uniquely identifying people. Adding in the birthday gives a very high percentage of uniqueness.

Files

The zip file UniquenessOfpersons.zip contains this report, the raw data from the asset declarations, the XQuery with which we created the list of persons, and the two spreadsheets with persons used in the code examples above.

Reageer

Je moet ingelogd zijn om te kunnen reageren.