For a computational linguistics class last quarter, I tried applying statistical modeling techniques to the problem of detecting genetic relationships among human languages. To find the relationships, I:
- Downloaded the text of the Universal Declaration of Human Rights in 49 different languages written in some variant of the Latin alphabet.
- Filtered the texts to remove punctuation, convert everything to lower case, and replace every accented or derived character with one of the 26 plain ASCII letters—é with e, ß with s, ł with l, and so forth.
- Then, in a loop:
- Trained n-gram models of order 5 using SRILM on the sequence of characters in each language.
- Used each model to calculate the perplexity of the character sequences of every other language, looking for the language pair with the lowest perplexity.
- Merged the most-similar pair of languages into a single file, treating it as a new language, reducing the number of texts by one.
- Repeated this process until all languages had been merged into a single file.
The procedure produced a binary-branching tree of inferred relationships between the languages. What to see what it looked like?
Each language is represented at the bottom of the tree by its two-letter ISO-639-1 language code (with the exception of Sorbian, which has no two letter code, which I've represented by "sb"). The tree above the languages represents the sequence of merge operations, shown bottom-up—for example, Slovak (sk) and Czech (cs) were merged first, then that merged "language" was merged with Sorbian (sb), and so on. The underlining of the language codes marks portions of the tree that are correct with respect to the language taxonomy in the Ethnologue: subtrees containing groups of languages in exactly the right relationships are double-underlined; subtrees containing languages that are related but with the wrong relationships are single-underlined. Subtrees in color correspond to real language families and super-families (see legend).
What's right:
- It detected the Slavic languages and their relationships
correctly[Update: actually, Polish is in the wrong place—see the comments]. - The West Germanic languages are similarly correct, except that English is missing.
- The Scandinavian (a.k.a. North Germanic) languages are correctly associated with West Germanic, though the relationships are slightly wrong.
- The Romance languages are all together in a subtree. The West Iberian languages Spanish (es), Portuguese (pt), and Galician (gl) are in the right relationships. Romanian (ro) is correctly peripheral to the family.
- All two languages representing the Baltic family are grouped correctly.
What's wrong:
- As mentioned above, the Scandinavian languages are nearly correct, but Nynorsk (nn) should be over with Icelandic (is) and Faroese (fo). What's more, the whole family should fall into a single subtree that's a sibling of the West Germanic subtree.
- Similarly, the Romance languages are in the wrong relationships. For example, Walloon (wa) and French (fr) should be siblings; Catalan (ca) and Occitan (oc) should be more closely related to West Iberian than to French; and Sardinian (sc) and Corsican (co) should be in their own family separate from Eastern Romance (Romanian) and Italo-Western Romance (everything else).
- English (en) is categorized as closer to the Romance languages than to Germanic. I suspect this is because the register of the Universal Declaration of Human Rights includes a lot of Latinate vocabulary.
- The Brythonic Celtic languages Welsh (cy) and Breton (br) are correctly siblings, as are the Goidelic Irish (ga) and Scots Gaelic (gd), but the relationship between the two families was not detected, so there's no Celtic subtree.
- Estonian (et) and Finnish (fi) are correctly grouped; however, the other Uralic languages Saami (se) and Hungarian (hu) are off by themselves while the unrelated isolate Basque (eu) is attached to the Finnic languages (whose language family id in the online version of the Ethnologue is...90210!).
- Indo-European Kurdish (ku) is a sibling of Turkish (tr), to which it is not genetically related, while Turkic Uzbek (uz) is not. This may be a result of contact and borrowing, but I don't know any Turkish or Kurdish, so that's just a guess.
In general, my program did a good job of detecting close genetic relationships—surprisingly good, I think, given that it's just doing straightforward statistics on letter sequences and training on fairly small files (~10-14K). Slavic and Germanic came out quite well; Romance was more disappointing, but at least it was all in a single subtree.
The program did worse on the larger-scale relationships. The Slavic languages, which are Indo-European, were treated as more distantly-related to Romance and Germanic than such non-Indo-European languages as Estonian (et), Finnish (fi), Basque (eu), Turkish (tr), Uzbek (uz), Hungarian (hu), and Saami (se).
I threw in a single constructed language to see what would happen: Esperanto (eo). Based on letter sequences of five or less, Esperanto looks to the program like a very peripheral Romance language—just slightly more Romancy than English—which sounds about right to me.
As often happens with projects like this, even though I've turned in the final write-up I'm feeling the urge to spend a lot more time and energy trying to improve the results. I suppose that's a good sign—it would be bad if I never wanted to look at it again, given that I'm hoping for a career that will include computational linguistics. The next step would probably be going back to the UDHR web site and downloading every other language written in the Latin alphabet or a variant—I focused on Indo-European and included some nearby unrelated languages as a sanity check, but there's no reason this technique couldn't be applied to various Oceanic languages, for example. It would also be interesting to try to include all the language written in Cyrillic by Romanizing them in some consistent way. In the course of the project, I tried doing more sophisticated filtering to enable me to train factored language models, which (intuitively) ought to be able to detect some of the more distant relationships while still getting the nearby ones correct, but I never got results with FLMs that were better than the above tree. It would be nice to either get FLMs working or convince myself they really won't help.
I'll probably put up a PDF of this paper on my web page at some point, but I think I'll wait until I get back the professor's comments and have a chance to revise it. Though I suppose if I'm worried about getting some detail publicly wrong I probably shouldn't be writing up the project in a blog post...
[Now playing: "Love Will Tear Us Apart" by Bis]
Have you read the papers by Tandy Warnow, Don Ringe, & others on computational modeling of the relationships among IE branches? They're pretty interesting and might give you some ideas.
Posted by: Bridget | June 10, 2006 at 06:12 AM
Have you read the papers by Tandy Warnow, Don Ringe, & others on computational modeling of the relationships among IE branches? They're pretty interesting and might give you some ideas.
Posted by: Bridget | June 10, 2006 at 06:12 AM
That's really cool. One quibble—you give the program full credit for the relationships within the Slavic branch, but shouldn't PL be grouped together with its West-Slavic fellows CS, SK, and SB rather than off by itself?
Posted by: Q. Pheevr | June 10, 2006 at 01:28 PM
...shouldn't PL be grouped together with its West-Slavic fellows...
Oops, good point. I had read the Ethnologue tree wrong and thought those three groups were all siblings. See, I knew I'd need to revise something...
Posted by: The Tensor | June 10, 2006 at 02:43 PM
Have you read the papers by Tandy Warnow, Don Ringe, & others...
I hadn't before, I but just found, downloaded, and skimmed several of them (there's a bunch here). The bad news is their work is clearly relevant and I didn't find it my literature search, in spite of it being mentioned in a series of Language Log posts about the subject (follow the links from here) that I must have read. The good news is my work doesn't seem to duplicate theirs. Their techniques operate on data sets based on rich linguistic analyses that include things like which sound changes have occurred in which languages and cognates in basic vocabulary, whereas mine uses very surface-oriented (you might even say "shallow") features of the data. They're working on determining the larger-scale relationships between families that my program does poorly at—the older relationships being unknown, and therefore harder to prove and more scientifically interesting.
Posted by: The Tensor | June 10, 2006 at 03:25 PM
I admit I don't know much about computational phylogeny, but what I have heard of Ringe and others' works seemed both promising and doubtful. I don't have the data with me immediately, but I recall one critique being that they were not surfacy enough: in other words, they assigned characteristics (features, etc.) to languages not based on attested sound correspondences but reconstructed sound changes. For instance, IIRC, they gave German a characteristic feature common to other Germanic langauges, all of which underwent sound_change_1; however, this is obscured by a later change in only German (sound_change_2), and so saying that German has the characteristic sound_change_1 has a sort of implicit subgrouping hypothesis built in. Of course the point is not to establish Germanic but to go even further into the past, but still... I guess I'll have to actually read their work.
Posted by: Russell | June 20, 2006 at 05:40 PM
analyzing written language is a limit since orthographical rules can make two closely related languages look pretty different from each other
Posted by: maitreya | June 30, 2006 at 12:54 AM