« Transparent Screen | Main | Disabbreviation »
Monday March 28, 2005
The Xerox Language Identifier
This post on Linguaphiles mentions an interesting tool available on the web, Xerox's Language Identifier. It does what you think: examines a text and guesses what language the text is in. Upon reading about this, I was immediately inspired to misuse it by feeding it text from invented languages and seeing what language it guessed.
The three languages I tried were Klingon, Quenya, and Sindarin (the latter two being Elvish languages invented by J. R. R. Tolkien). For Klingon, I fed it several recent posts from bo logh. The result was always Maltese, except in three cases: one each of Czech, Slovakian, and English (for a post that contained a lot of English words). I don't know enough about the history of Klingon to know if Maltese would be the expected result, but it seems fairly consistent.
For Quenya, I fed the Language Identifier three texts. It guessed that Namárië and The Markirya Poem were Catalan, and that Fíriel's Song was Hungarian. Given that Quenya is supposed to resemble Finnish, a guess of Hungarian (a related language) is pretty good, but Catalan is surprisingly far off. I wonder if there's some lexical item that occurs in both Catalan and Quenya that makes Catalan seem like a good candidate.
For Sindarin, I tried two Tolkien texts and some compositions by other people. The Language Identifier guessed that Tolkien's A Elbereth Gilthoniel was Irish and that his The King's Letter was Welsh. That's pretty good. Welsh is the closest thing to a correct answer, since the phonology of Sindarin is supposed to be similar to Welsh, and Irish is a closely related language. Some further texts from this site of Sindarin poetry, including Vi Dýr Ennui, I Gair Vedui (The Last Ship), and The Words of the Seer, all come up Welsh as well.
So, what have we learned? Nothing, really, except that when people who are smarter than we are put their work up on the web, it's fun to poke at it and see if we can break it. The Language Identifier did pretty well under the circumstances, I think.
I am The Tensor, and I approve this post.
06:37 PM
in Linguistics in SF
| Submit:
| Links:
TrackBack
TrackBack URL for this entry:
http://www.typepad.com/t/trackback/16313/2143783
Listed below are links to weblogs that reference The Xerox Language Identifier:
Comments
I tried it with two posts devoid of loanwords and got one "English" and one "Maltese." I had never seen Maltese before, but examining the Maltese sample text on Xerox's page I see why it thought it the best match.
Maltese: qal li din is-sena, il-gimgha ta'
Klingon: qal lI' DIm 'ISjaH 'Il tagha' ta'
ta' and lI' are very common syllables, and many words in Klingon start with qa.
I don't know what the Maltese says, and the Klingon is nonsense saying "corrupt transmit tunnel-entrance calendar sincere finally accomplishment." In fact, if someone sent me the Maltese I might myself have spent a minute trying to figure out what the beginner had mistyped.
If I had to give the machine some rules for distinguishing between Klingon and Maltese, I would say that if word initial vowels (' is a consonant) or g neither followed by h nor preceded by n are common, it is not Klingon. Those would only be present in loanwords or typographical errors. To flag a text as romanized Klingon, watch for the word 'e', the sequence taHvIS, and of course that distinctive mixed case. The orthography borrows from IPA.
Posted by: Qov at Mar 29, 2005 5:02:33 PM
Apparently you can write in Dutch if you just hit the keyboard randomly. Who knew? NB: I did not repeat my test. Perhaps it was only a Dutch-y occasion in the Shakespearean monkey collective.
Posted by: gibberish at Mar 30, 2005 4:23:25 AM
I randomly slapped my keyboard twice and got Dutch both times. . .I can type in Dutch! Amazing skills you never knew you had. . .
Posted by: Aaron Morse at Aug 23, 2005 11:31:52 AM
I tried with my own conlangs:
Adare (a language that strongly resembles Quenya) got Catalan three times, once French and two times Turkish. That was not so bad, because it is inspired in the euphonic principles of Latin and Romance Languages (as it is Quenya) and quite agglutinative as Turkish sometimes.
My mother tongue is Spanish, and it is no so far from my phonetic taste, but Adare contains much more 'v' letters than Spanish will do, because I don't like very much the 'b' letter in texts (except if it is preceded by 'm'), and many of the words in Adare end in -e, as they do in French, English, and of course, in Catalan. So, while the syllabic structure is very simillar to Spanish (and Basque, Finnish, Japanese, Quenya...) it has 'v' and '-e' much more often, and I suppose that that's basically Catalan, a language very close to Spanish but with 'v' when it has 'b', and '-e' when it has '-o' or '-a'. Also, Adare has double ss, but Spanish do not, as Catalan and French have.
My other elvish-like conlang, Ethire, got Irish.
A medieval-style language as it is my conlang called Asrordânis got Welsh (both of them uses 'w' as vowel).
And finally, my oldest conlang, Ayeis, resulted from a wild mixture of English, Spanish, Basque, Latin, and some indoeuropean and semitic sources in a germanic style, got repeatedly English :DD
It was fun ^_^
Posted by: Fiondil at Dec 13, 2005 1:30:46 AM
Apparently that software only recognise a language if it is written in its alphabet.
I tried with five Japanese haikus in Rômaji script and got Swahili :o
Maybe I should learn Swahili...
Posted by: Fiondil at Dec 13, 2005 1:35:13 AM
I tried it on my conlang Xha and the first result was Indonesian, which I thought not so bad, but longer texts consistently give Latvian when the long vowels are represented by â î û etc., and Estonian when written as aa ii uu etc. I am completely unfamiliar with these two languages and I never even saw them before.
Posted by: Folquerto at Dec 23, 2005 4:38:48 PM
I tried my best developed conlang, Telod, and pretty consistently got Irish, although the language could only be said to bear an outward and not intentioned resemblance to Irish.
I tried a new conlang I've been working on, and first got Italian and then Czech. I need a longer text I can throw at it, though.
Posted by: Anwulf at Dec 31, 2005 8:30:10 PM
I think the reason you get Dutch from keyboard spasms is because it’s definitely not English and it lacks significant use of diacritics or hyphens. I’ve found that the rapid exposure of monolingual English speakers to a Dutch newspaper often results in the assumption that the text is English. They look rather similar until you actually read them.
I put some Tlingit text into it and it guessed Swahili at first. A longer sample gave me Hungarian. Not too odd, but why it guessed Hungarian without even a single ő or ű in it, I don’t know.
Posted by: James Crippen at Jul 28, 2006 3:07:26 PM