A while ago, I posted about my dislike of Greek-Latin compounds and mentioned that that the common computational-linguistics term unigram is an example of Greek-Latin compounding (or rather, Latin-Greek compounding), and that it should therefore be anathema. It turns out I'm not the first to have noticed this.
In Manning and Schütze's Foundations of Statistical Natural Language Processing, upon introducing the concept of n-grams they write:
Before continuing with model-building, let us pause for a brief interlude on naming. The cases of n-gram models that people usually use are for n = 2, 3, 4, and these alternative are usually referred to as a bigram, a trigram, and a four-gram model, respectively. Revealing this will surely be enough to cause any Classicists who are reading this book to stop and leave the field to uneducated engineering sorts: gram is a Greek root and so should be put together with Greek number prefixes. Shannon actually did use the term digram, but with the declining levels of education in recent decades, this usage has not survived. As non-prescriptive linguists, however, we think that the curious mixture of English, Greek, and Latin that our colleagues actually use is quite fun. So we will not try to stamp it out.
[footnote] Rather than four-gram, some people do make an attempt at appearing educated by saying quadgram, but this is not really correct use of a Latin number prefix (which should give quadrigram, cf. quadrilateral), let alone correct use of a Greek number prefix, which would give us "a tetragram model." (p. 193)
I have several reactions to this passage. First, good for Shannon for coining digram instead of bigram (which, I am embarrassed to admit, I referred to as a "straightforward Greek compound" in my previous post—see how insidious these hybrid compounds are?). Also, tetragram seems reasonable to me; it does have some theological associations, but I think tetragrammaton is the more usual term in that connection. However, after n = 4 things get tricky. In the same way that replacing unigram with monogram is difficult because monogram already means something, using pentagram or hexagram is hard because they're already allocated. Maybe there's no clean solution to this problem. Sigh.
Well, trigram's already allocated as well - to the eight trigrams of the I Ching, four of which appear on the flag of South Korea. This doesn't seem to have caused a problem. Hexagram has a similar sense.
Posted by: Tim May | April 19, 2005 at 05:41 AM
While I have a mild preference for keeping Greek prefixes and Greek roots together and would do so if coining a term myself, it's silly to deprecate already-existing words for that reason, as silly as complaining about nom de plume because the French say nom de guerre (or for that matter complaining about the French creation of le parking). Unless you're willing to campaign for "teleorasis" as a replacement for "television," you might as well forget it.
Posted by: language hat | April 20, 2005 at 10:24 AM
Oh dear -- I see I already made that point in the comments to your earlier entry. Do I repeat myself? Very well then, I repeat myself. Do I repeat myself? Very well then, I repeat myself. I grow old, I grow old...
Posted by: language hat | April 20, 2005 at 10:25 AM
Teleorasis...I like the sound of that. Campaign initiated!
Posted by: The Tensor | April 20, 2005 at 01:09 PM
"Teleorasis" abbreviates conveniently to "telly", so your campaign has every chance of success.
Posted by: trevor@k’alebøl | April 20, 2005 at 02:47 PM