As I mentioned in a previous post, I took a computational linguistics class in the spring that focused on multilingual processing. One of the two major assignments for the class involved the use of GIZA++ to train alignment models for a bilingual corpus of French and English text. In the course of cobbling together a little software system to complete the assignment, I made an engineering decision that had a drastic effect on the performance of one part of the system: I wrote it in Perl.
OK, first let me back up a bit and explain all that business about alignment models and bilingual corpora. Ready? Statistical machine translation systems are trained on large corpora containing two parallel texts, one in the source language and one in the target language. The two texts are translations of each other, usually translated by humans, and the SMT system "learns" how to translate by comparing a lot of sentences in the source and target languages. (There's a lot of Deep Magic glossed over in that sentence.) A good source of such corpora is the proceedings of legislatures, like Canada's, that have to publish their proceedings in multiple languages; the translation is often very consistent, and legislators love to talk, and talk, and talk... (Aside: There's a special word for such parliamentary proceedings, hansard, which comes from the name of the printer who used to publish them in England.)
One of the many layers in an SMT system is the alignment model, which assigns probabilities to relationships between words in the input and output sentences. For example, the alignment model may notice, after looking at millions of French and English sentences, that when pen occurs in the English sentence, plume is very likely to occur in the French sentence. Depending on the alignment model, it may also learn things about the relative positions of those words in their corresponding sentences.
So, in our assignment, we were given a large French-English corpus and asked to train various alignment models, of which GIZA++ provides a staggering variety with lots of knobs to tune, trying to get the best results on a much smaller test corpus. That was the easy part. The hard part came at the end. Having trained our best French-English alignment model, we were asked to use it, somehow, to generate alignments on a small Romanian-English corpus.
Tricky. I ended up doing something pretty straightforward (and possibly even a bit brute-force). Given the two bilingual corpora (F-E and R-E), I had the vocabulary lists for each language. To use the French-English alignment model on the Romanian corpus, I simply made a pass through the Romanian data, replacing each Romanian word with the nearest French word, hoping that Romance-language cognates would produce something reasonable. In effect, I projected the Romanian corpus through a vocabulary mapper, creating a French corpus. The results were OK, but not astonishing—still, anything better than absolute garbage meant the approach was working.
I left out an important detail of the vocabulary projection. How do you pick the "nearest" French word for each Romanian word? That's the part of the system I mentioned at the beginning of the post. (Still reading?) I wrote a simple Perl script that loaded both vocabulary files and then, for each Romanian word, calculated the Levenshtein edit distance between that word and every French word, choosing the one with the lowest edit distance. I said it was brute-force, didn't I? It also had to do the same operation for the two English vocabularies, but that wasn't nearly as interesting a process. In any case, the total number of edit distances that needed to be calculated was...wait while I do the math...128,970,506. That's a lot of calculations, and the edit distance calculation is O(source length * target length). Still, these processors we have nowadays are running at gigahertz clock speeds. How long could it take? (Guess.)
The answer is twenty hours. Twenty. Hours. That's pretty damn slow, but not so slow that I wasn't able to finish the assignment on time—I turned it in with 4 minutes to spare at 11:56PM. After it was turned it, because I just can't let these things go, I kept wondering why it took so long. Suppose the average source and target strings were 10 characters long (which is an overestimate, I think). That would mean something on the order of 100 times 130 million, or 13 billion operations. Shouldn't that be calculable in minutes rather than hours?
To find out, I reimplemented the vocabulary projector in C++. This only took an hour or so, mostly because my Perl tends to look pretty C-ish. When I was done, I ran the C++ version on the same data, producing the same results. Care to guess how long it took?
Twelve minutes. Twelve. Minutes! That means the Perl version, running exactly the same algorithm, was a hundred times slower. A factor of a hundred. Two orders of magnitude. Imagine, if you can, how annoyed I was. Thinking that maybe I'd just done something dumb in my the Perl version, I tried commenting out various potential time-sinks to see how much things sped up. (I know there must be a way to profile Perl, but I'm not that savvy.) In the end, there was no single hotspot—it was just the accumulated overhead of Perl's string-oriented data storage. The C++ program would (a) access a character from each string (an array lookup), (b) pass them to a few very short subroutines to calculate the floating point insertion, deletion, or substitution cost, then (c) add that cost by a few previous costs and store the result in an array, all of which can be accomplished with a few fetches, math ops, and stores. The Perl program, on the other hand, would (a) create a new single-character string for each character, (b) pass those to subroutines, which (I think) involves making another copy, (c) calculate a floating point cost, (d) convert that cost into a string and return it, (e) convert that string back into a float, several times, to do the additions, finally storing the result in another string, and (f) store that string (or possibly another copy of it) in an array. That's a lot of string creation and conversion, and the result was terrible performance.
Now, don't get me wrong, I like Perl. Whenever I need to do text-file processing, it's my tool of choice. That comparative linguistics project, for example, was implemented in Perl and shell scripts. But now I know that, when it comes to heavy-duty floating point number crunching, it's worth the extra annoyance to code it in C++. Sure, you have to deal with memory allocation and declaring your variables and clumsy string handling, but it's worth the trouble to get a 10,000% speed-up, don't you think?
[Update: I got curious and tried porting it to Python as well. It's still slow.]
Now I'm curious how long this would take in Python, which uses native types for floats and ints...
Posted by: Joshua Macy | July 06, 2006 at 06:20 AM
I think you've just shattered my dreams of going into machine translation. \(>-^)/
Posted by: komfo,amonan | July 06, 2006 at 07:14 AM
This might have something to do with the fact that Perl is optimized for string level operations and doesn't really like to look at things on a character level, which is C++'s natural mode of operation.
Posted by: includedmiddle | July 06, 2006 at 07:38 AM
You know, I've just started looking at SWIG for just some of these reasons.
Also there's a lot of C++ libraries in the CL industry that would be nice to have Perlish control over.
Posted by: Trochee | July 06, 2006 at 10:03 AM
I'd prefer perl to C++ most of the time. The only time I'd go to C++/C is it there was something perl couldn't do
Posted by: ruhi | February 10, 2010 at 01:41 AM