English Georgian Parallel Corpus

Geplaatst op 26-11-2013 door Maarten Marx | data, TI | tags: , , | comment image Geen reacties »

We created a Georgian English parallel corpus by crawling the Georgian news site http://civil.ge. This site contains over 26 thousand news stories in both English and Georgian. The first one is from November 2002.
Such parallel corpora are the source of automatic machine translation software like Google Translate.
The fact that Google Translate (at the time of writing) makes a mistake with translating საქართველოს (the genitive of “Sakartvelo”, the Georgian word for Georgia) shows that such parallel corpora are still useful.
All data mentioned in this blog post is available in a zip file (32M).

Wordcounts

We have 26.671 articles in both languages. There are large differences between the languages in the number of tokens and types:

Language # Tokens # Types
English 6.348.986 115.058
English 4.662.318 272.652

The difference in number of characters used however is negligible: 41.134.452 in English and 40.177.653 in Georgian.

What is the corpus about?

The following word cloud gives a good impression of the topics covered by the corpus. These are the top 50 most used terms in the corpus which have more than 5 characters. Click the cloud to go to a larger image.

Zipf’s law

Of course a blog post like this one should contain a log log plot of the rank of the words against their frequency. Here it is: (data and plot is also in this Google Fusion Table)

List of English Georgian parallel corpora

We list websites which have articles in both English and Georgian which are easily linked to each other and for which the quality of the translation is good.

If you know of similar sites not listed here, please contact Maarten Marx at maartenmarx@uva.nl.

Most used words

For English, the 20 most used terms are all function words, except for Georgia(n):

458524 the
228405 of
158403 to
151334 in
137630 and
87284 a
86332 that
85873 on
59818 said
54350 is
53020 for
49576 georgian
45381 with
43768 georgia
43207 was
43055 will
37375 by
36488 as
33176 be
31550 not

For Georgian:

136001 და
79865 რომ
43319 საქართველოს
42299 განაცხადა
39421 არ
25620 ამ
25184 ეს
22461 რომელიც
20104 ასევე
17567 რუსეთის
17496 უნდა
16266 არის
15916  
15213 მისი
14787 მან
14360 შემდეგ
14087 –
14014 ჩვენ
13163 თუმცა
13089 საქმეთა

Which is being translated by Google Translate to English as follows: (note the mistake in the third line which should be “Of Georgia”)

And
  That
  Of
  Said
  No
  In this
  This
  Which
  And
  Russia
  It should be
  Is
 
  The
  He
  After
  -
  We have
  However, the
  Affairs

Code

Harvesting the parallel corpus

for i in `seq 1 26674`;
   do
    for l in eng geo;
    do
    url="http://civil.ge/$l/article.php?id=$i";
    echo $url
    curl -s $url > "DATA/$i-$l.html"
    sleep 1
    done
   done

Creating the corpora

Due to ill-formed HTML which tidy could not repair we had to use the BeautifulSoup module. We only extracted text from the div element with id="maintext". We then tokenized on whitespace and put each word on a separate line.

Creating the dictionaries

# Georgian
$ cat GEO-corpus.txt  |tr -d '@' |tr -d ' '|sed '/^[0-9]*$/d;s/[.,)?!-]$//'| sort|uniq -c |sort -nr |sed 's/^ *//' > GEO-dict.txt&
# English
$ cat ENG-corpus.txt | tr [:upper:] [:lower:] |tr -d '@' |sed '/^$/d'| sort|uniq -c |sort -nr |sed 's/^ *//' > ENG-dict.txt &

Creating the Zipf plots

  $ cat  ENG-dict.txt |awk '{print log(NR)","log($1)}' >ENG-loglogdict.txt
  $ cat  GEO-dict.txt |awk '{print log(NR)","log($1)}' >GEO-loglogdict.txt
  $  join -1 1 -2 1 -t',' ENG-loglogdict.txt GEO-loglogdict.txt >ENG+GEO-loglogdict.txt

Reageer

Je moet ingelogd zijn om te kunnen reageren.