English Georgian Parallel Corpus

Geplaatst op 26-11-2013 door Maarten Marx | data, TI | tags: , , | comment image Geen reacties »

We created a Georgian English parallel corpus by crawling the Georgian news site http://civil.ge. This site contains over 26 thousand news stories in both English and Georgian. The first one is from November 2002.
Such parallel corpora are the source of automatic machine translation software like Google Translate.
The fact that Google Translate (at the time of writing) makes a mistake with translating საქართველოს (the genitive of “Sakartvelo”, the Georgian word for Georgia) shows that such parallel corpora are still useful.
All data mentioned in this blog post is available in a zip file (32M).
| lees verder…