This quarter I'm working on a machine-translation project. For starters, we're working with a set of seventeen sentences that exercise some simple grammatical phenomena. I don't speak most of the languages that have landed on my plate, so as a first pass I've been running the sentences through machine translation systems on the web. I realize it's old news that round-trip translations are funny, but the results for the English-Korean-English loop are especially dreadful.
Ordinarily, I try to ignore Cory Doctorow's posts on intellectual property and technology over at Boing Boing, focusing instead on his less hysterical and more entertaining posts about net.weirdness. The man's a science fiction writer, after all, not a technologist, so he can't be expected to get all the details right. He gets them egregiously wrong, though, in his recent post about Windows Vista's built-in restrictions on high-definition video.
An important concept in a recently-completed generals paper of mine was mutual information, a measure of how much information knowing the value of one random variable tells you about another. Since it's a measure of information, you might expect that the units of mutual information are bits, and you'd be right, much of the time. Bits are the most commonly used units nowadays, but they're not the only possible ones.
Last month I described a program I wrote for a machine translation project in Perl and how it turned out to be much slower than I expected. As you may recall, I rewrote the program in C++, and it was a hundred times faster. I suspected this was due to all the conversion in Perl back and forth from strings to floating point numbers in the program, and to all the string copying necessary for function calls. Some people (in person and in the comments) suggested I should try rewriting it in Python, which has native floating point numbers, to see if it was any faster.