Friday January 27, 2006

T9 Collisions

There are several different methods for entering text messages into mobile phones.  The least-common-denominator method that all phones seem to have is the one where you hit number keys repeatedly for different letters—i.e. "1" for "a", "11" for "b", "111" for "c".  This works, but it takes a lot of key presses.  To make things easier, there are at least a couple of predictive text methods that try to guess what you're spelling using only a single key for each letter.  The one I'm most familiar with is T9, which uses a dictionary to narrow down the possibilities and guess the word you mean.  Suppose you want to enter the word "of".  Using the multi-tap method, you'd press "666333".  Using T9, you'd just press "63", then software uses a dictionary to exclude words that you didn't mean—there are a lot of nonsensical possiblities like "nf"—and eventually produces "of".  That's two key presses instead of six, which is a nice improvement—but, as you may have guessed, there's a problem.

That problem is "me".  The T9 key presses for "me" are the same as for "of", so that sequence of key presses is ambiguous.  The solution is to show the most likely word (for some value of "most likely"), then allow the user to cycle through the other possible words with a different key.  This Wikipedia article has several names for such ambiguous key sequences—textonyms, adaptonyms, cellodromes, T9-agrams, T9-o—but I just think of them as T9 collisions, by analogy with hash collisions.

I got to wondering what the longest ambiguous sequence of key presses is, so I wrote up a simple Perl script to find out the answer.  The algorithm is straightforward: pass through a list of words, mapping each to its T9 equivalent, and keep track of how many times each T9 sequence occurs.  I first tried it with the venerable Unix word list, but later, after poking around online, I found a web page with a bunch of different word lists, from which I downloaded the somewhat larger Ispell word list for English.

Running my little program across that list produces four examples of T9 key sequences that map to 11 different words, and a single example of a key sequence that maps to 12 different words.  The 11-word sequences are:

72837:  paver, raves, saves, scuds, sates, raver, paves, rates, saver, pater, and rater

7663:   rone, rood, pone, some, Rome, poof, roof, sone, pome, pond, and pood

269:  any, boy, coz, cox, BMW, cow, coy, boz, box, Amy, and bow

2666:  boon, coom, conn, anno, amon, anon, Bonn, como, boom, coon, and ammo

Some of these strike me as obscure (e.g. rone 'thicket', rood 'a cross (of crucifixion)') or possibly not words at all (e.g. amon isn't in the OED).  The winning 12-word sequence, though, is made up entirely of plausible words:

22737: bards, capes, acres, bases, bares, cares, barer, cases, baser, caper, cards, and carer

That's not bad, but we can do a little better.  Playing around with a couple of other word lists, I noticed that a subset of this group of words often appeared at the top of the list, but there were a couple of additions, namely barfs and caser (the person who cases a house).  The sequence "22737", therefore, is 14-ways ambiguous.

Are you ready for the embarrassing part?  When I was putting this post together, I'd thought I'd come up with something clever and original, but then I did a little more Goggling and came across this post on Boing Boing.  Oops.  Turns out I've duplicated work somebody else has already done, and what's worse, I read Boing Boing, so that's probably how I got the idea floating around in my head in the first place.  Fortunately, this is a blog post rather than a journal article, so I can publish it anyway.  In fact, I actually do have a contribution to make: the other guy found the "22737" sequence, but only describes it as an 11-way collision, leaving out carer, caser, and barfs.

An interesting followup question occurs to me (for those of you from academia, this is the "directions for future work" section of the blog post).  The way that letters are assigned to keys ought to have some effect on ambiguity; for example, if all the vowels were on one key, we'd expect to see more ambiguity with sets of words like bat, bet, bit, but.  So, assuming 8 keys with 3 or 4 letters per key, what's the arrangement of letters that produces the least amount of ambiguity for English?  There are at least two ways we could calculate that ambiguity: the arrangement whose largest ambiguous word set is the smallest, or the arrangement that produces the smallest total number of collisions regardless of word set size.

The programming should be straightforward.  I'll get right on it, just as soon as I'm done with all the other real work I'm supposed to be doing...

I am The Tensor, and I approve this post.
03:26 PM in Linguistics | Submit: | Links:

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c88ad53ef00d834a557be69e2

Listed below are links to weblogs that reference T9 Collisions:

Comments

This is interesting mathematically, but I would be more interested if there were some way of knowing the most common T9 collisions considering the kind of text messages most frequently sent. I.e., my text messages tend to be something like "Call Julia after work she met a new guy" and not "Deformed, unfinished, sent before his time into this breathing world..." or even "The paver took the ammo to the roof to hunt some scuds."

Of the listed collisions, 269 seems include words that would cause the most real problems in a text message, with any, boy, BMW, cow, box, Amy,and bow seeming to be the most common.

In fact, I have a bizarre inclination to text message someone with "Amy said that any boy who put a cow and a box into his new BMW should take a bow."

cul8r ;)

Posted by: Andrew at Jan 27, 2006 4:08:30 PM

I'm looking forward to seeing your results!

Posted by: Shreyas at Jan 27, 2006 6:43:15 PM

I'm pretty sure Amon is an Egyptian diety. I also don't think "rood" is that obscure since I've seen it in more than one context.

Posted by: includedmiddle at Jan 27, 2006 6:58:04 PM

For practical purposes, as Andrew suggests, the interesting collisions are those involving very commonly used words (even if only two), like your example of "me" and "of." The one that I come up against most is 4663: home & good. (As parent of teenagers, I need both words frequently.)

Posted by: Margaret S. at Jan 27, 2006 8:46:34 PM

I agree that ranking rearrangements of letters with an eye towards minimizing collisions in common words would be desirable (and I still intend to figure it out at some point). However, I think it's worth mentioning that this is all just theoretical. It's very unlikely that phone manufacturers would be willing to radically rearrange the letters on phone keypads. Too much history.

IM, you're right of course about Amon being an Egyptian diety. It's interesting how the lack of an initial capital letter caused me not to recognize the name—I kept wanting it to be a misspelling of among.

Other T9 collisions that annoyed me tonight: no/on, in/go.

Posted by: The Tensor at Jan 28, 2006 12:36:48 AM

Amon is definitely an odd one to have on a list though. As is rood. Both are fairly "technical" words, though both can occur in non-technical discussions of Egypt's gods or Reformation history (talking about what was done to churches at that time of Henry VIII (of England), and to their contents, is the only place I've come across the word).

The name Amon does raise another possibility, and one that may be interesting. What are the T9 collssions of personal names? It must be an issue because most people (I assume, I don't text personally) add their name to the end of their messages, so common names must be quite common on txts.

And thinking further, what about T9 collisions for common txt short-cuts? Like txt - rut maybe, like I say, I don't text myself.

Posted by: Allan at Jan 28, 2006 12:43:30 PM

Motorola's iTap includes a statistical analysis, based on your usage. I've noticed that the more frequently I use a word, the more likely it is to show up as first in my list of options.

Posted by: Darryl at Jan 29, 2006 9:40:28 PM

When we were in Europe this summer I used a Motorola phone with iTap, and I didn't like it. If I recall correctly, the Select button is used both to accept a predicted word and to send a message, so accidentally double-pressing would send the message prematurely. It also inserts a space after each word, which you then have to delete if you want to put in a period or a comma.

I think usage-based prediction sound like a great idea, but there's a possible downside: the lack of predictability. If the T9 sequence "63" always produces "of" before "me", I can learn to hit "63[next]" to get "me" consistently. If, on the other hand, I sometimes get one and sometimes the other, depending on what word I've recently used more often, I have to look at the screen to see whether I need to hit [next].

Posted by: The Tensor at Jan 29, 2006 9:59:46 PM

It's not the OED you're using, though (the Guardian ran a relevant article fairly recently: http://www.guardian.co.uk/g2/story/0,3604,1685388,00.html)

Two more collisions for you, from b3ta.com: Smirnoff/poisoned, coal/cock/anal.

Posted by: Matt at Jan 30, 2006 11:29:32 AM

In light of your proposed directions for future work, check out the new Blackberry 7100 models, which attempt to fit a QWERTY-style keyboard into the constraints of a normal phone keypad. They use just two characters per key, and while they have the same general problem of ambiguity among predicted words, the number of collisions is greatly reduced.

The major problem I have had since buying one of these for myself is that none of the key mappings correspond to the traditional ones, and so when I'm given a phone number as a cutesy phrase (like 1-800-LETS-SUE), I end up spraining my brain trying to remember what the "right" mappings are. (The phone actually has a function to map letters to digits for just this problem, but it's a bit clumsy.)

Posted by: Semantic Compositions at Jan 31, 2006 12:35:42 PM

There was a puzzle in this year's MIT Mystery Hunt that was based on T9 mappings of letters to numbers. The solution to the puzzle was the word MAXIMUM, but one team outsmarted themselves and attempted to submit the answer NAZI NUN.

Posted by: AJD at Feb 2, 2006 8:05:05 AM

@Blackberry 7100: The one that drives me nuts is that "are" and "see" are the same key combinations, and the BB inevitably goes for the wrong one. Grrrr....

Posted by: Michael at Feb 8, 2006 10:18:18 AM