Last month I described a program I wrote for a machine translation project in Perl and how it turned out to be much slower than I expected. As you may recall, I rewrote the program in C++, and it was a hundred times faster. I suspected this was due to all the conversion in Perl back and forth from strings to floating point numbers in the program, and to all the string copying necessary for function calls. Some people (in person and in the comments) suggested I should try rewriting it in Python, which has native floating point numbers, to see if it was any faster.
It wasn't. I'd never written a Python program before, so I timed myself as I ported and debugged it—about an hour and a half. When I finally had it working, it ran about as slow as the Perl version. A full run of the Perl version takes 20 hours and I wasn't feeling patient enough to let the Python version run to completion, but eyeballing the program's progress indicator showed the Perl and Python versions were about the same speed, while the C++ version was much faster.
This implies it's not actually all the string conversion and copying that are making the program slow, because Python doesn't do those. So what's slowing Perl and Python down? I decided to find out. I wrote little programs called add, mult, and func, in C++, Python, and Perl—nine programs altogether. Each consists of a single loop, and the names tell you what happens in that loop: add adds integers, mult multiplies floats, and func calls a dummy function. (In fact, mult and func also do an integer add for the loop counter, so they're really like add plus an additional operation.) All of them take a single command-line argument which tells them how many times to loop.
I started running them, increasing the number of iterations by a factor of ten each time, until I got some nice, macroscopic timings that were big enough to swamp any fixed startup cost. The magic number was 10,000,000 iterations. I've collated the timings, which are the "real" times reported by the Unix time command, in the table below. The "Slowdown" column tells how many times slower than the fastest (C++) version each program is:
Program Language time (ms) Slowdown add C++ 23 add Perl 1464 64 x add Python 3242 141 x mult C++ 99 mult Perl 2686 27 x mult Python 5285 53 x func C++ 43 func Perl 5143 120 x func Python 6413 149 x
As you can see, it's not specifically the floating point operations or the function calls that are causing the trouble. Even a tight loop containing nothing but integer operations is 64 times slower (Perl) or 141 times slower (Python) than the C++ version. Doing floating point multiplications or function calls in the loop affects the timings a bit, but the results are still roughly proportional: C++ is fast, Perl is an order of magnitude or two slower, and Python is roughly another factor of two slower than that.
This leads me to think that the performance differences are due almost entirely to the fact that Python and Perl are interpreted languages. I guess there's a lot of overhead associated with compiling the code (or the bytecode, in the case of Python) repeatedly as it's executed. At least, I'm pretty sure that's what's going on in the interpreters—if anyone knows better, I'd be curious to hear about it.
These results are pretty disappointing. I like the high-level scripting languages, with all their built-in convenience, for doing quick-and-dirty programming. As I said, this was my introduction to Python, which I'm about to use for an unrelated project, and it's a pretty nice language—very terse, kind of like pseudo-code that actually runs. Unfortunately, for any project that's going to be computationally intensive—which is pretty much the definition of "computational linguistics"—a 25-to-150-times performance hit just isn't acceptable. With Python (and maybe Perl, I'm not sure) it's possible to write the innermost loops in C or C++, but I don't really like that solution for two reasons. First, as long as I'm going to have to write some C++ for speed, I might as well write it all in C++ and have the whole program be fast. Second, if I ever distribute the code, it narrows the potential audience if they have to know, and have installed, two computer languages to use it instead of just one.
Hmm, now I'm wondering how Java would perform. No! Must... resist! Must... finish... generals paper...
[Note: This is actually the third version of this post. The numbers reported in the original version were wrong—even though I'd called g++ with the -O0 option, it still did some optimizations that threw the timings off, especially for mult. After this was pointed out in the comments, I tried to do a quick fix and ended up screwing it up much worse. Oops.
In particular, multiplying by -1.0 was being compiled into an exclusive-or instruction instead of a floating point multiplication, so it ran too fast. My quick fix was to multiply by -1.1 instead, but raising -1.1 to the ten millionth power overflowed repeatedly and wasted a bunch of time in floating point exception handling. That made it run way too slow. The real fix was to multiply twice in each loop, once by 10.0 and once by 0.1 (and do the loop half as many times). That correctly compiles to floating point multiplication instructions and never overflows. I also had to make all the other programs increment their loop counters by two in order to get an addition instruction instead of a special-purpose increment instruction, but the performance difference there was very small.
I decided that having an update and an update to that update in this post would be hopelessly confusing, so I rewrote the whole thing instead. Thanks to Joshua "Necro" Macy, Russell "Mimsy was the" Borogove, and the myserious KK for pointing this stuff out and keeping me honest. I've left all the original comments intact, so if you're interested, you can probably reconstruct most of the sordid history of this post.
Oh, and greetings to any future hiring committes who might be reading this. Now, you might be thinking that the initial errors in this post imply a certain sloppiness on my part, but I prefer to focus on how the final, corrected version demonstrates my committment to getting the details right...eventually.]
[Update: In response to several requests, here's the source for the various programs:
add.cpp
#include <stdlib.h>
int main(int argc, char** argv)
{
int i = 0;
int count = 2 * atoi(argv[1]);
while (i < count) {
i += 2;
}
return 0;
}
add.pl
#!/usr/bin/perl
$i = 0;
$count = 2 * $ARGV[0];
while ($i < $count) {
$i += 2;
}
add.py
#!/usr/bin/python
import sys
i = 0
count = 2 * int(sys.argv[1])
while i < count:
i += 2
mult.cpp
#include <stdlib.h>
int main(int argc, char** argv)
{
double val = 1.0;
int i = 0;
int count = 2 * atoi(argv[1]);
while (i < count) {
val *= 10.0;
i += 2;
val *= 0.1;
i += 2;
}
return 0;
}
mult.pl
#!/usr/bin/perl
$val = 1.0;
$i = 0;
$count = 2 * $ARGV[0];
while ($i < $count) {
$val *= 10.0;
$i += 2;
$val *= 0.1;
$i += 2;
}
mult.py
#!/usr/bin/python
import sys
val = 1.0
i = 0
count = 2 * int(sys.argv[1])
while i < count:
val *= 10.0
i += 2
val *= 0.1
i += 2
func.cpp
#include <stdlib.h>
void func(void)
{
return;
}
int main(int argc, char** argv)
{
int i = 0;
int count = 2 * atoi(argv[1]);
while (i < count) {
func();
i += 2;
}
return 0;
}
func.pl
#!/usr/bin/perl
sub func
{
}
$i = 0;
$count = 2 * $ARGV[0];
while ($i < $count) {
&func();
$i += 2;
}
func.py
#!/usr/bin/python
import sys
def func():
return
i = 0
count = 2 * int(sys.argv[1])
while i < count:
func()
i += 2
If you have different results on your system, feel free to post the results in the comments.]
[Update: For some reason, this post seems to be accepting new comments but not displaying them. I'm hoping that this update mysteriously causes the post to republish correctly. (Get in the car and try it again!) If not, my apologies to John-Mark Gurney and Andrew.]
Python compiler package would be the place to go for bytecode compilation.
Posted by: Carson | August 04, 2006 at 06:50 AM
That shouldn't be necessary, though. Python always compiles to bytecode, and even saves the results so it doesn't have to recompile unless the source code has changed (unless you don't have permissions to create the equivalent .pyc files on disk). There's something else going on here, I think, maybe having to do with object creation and deletion?
If you're actually interested in pursuing this (perhaps for future Python projects), I'd post a question and maybe some sample code showing the loop to comp.lang.python.
Posted by: Joshua Macy | August 04, 2006 at 08:59 AM
My understanding is that Perl is compiled at startup, also, though I'm prepared to be completely wrong on this.
Perl's scalar datatype is, most fundamentally, a string, and it has to convert to and from numeric types every time it does a math operation. I'm guessing that's the real killer for you.
You can check out the "math" modules in CPAN; depending on what exactly you're trying to do, you might find a package that helps.
http://search.cpan.org/search?query=math&mode=module
Posted by: Russell Borogove | August 04, 2006 at 10:46 AM
Python always compiles to bytecode, and even saves the results so it doesn't have to recompile unless the source code has changed (unless you don't have permissions to create the equivalent .pyc files on disk).
You know, I read that—that Python always compiles to bytecode and saves it in a .pyc file—but I never saw it happen, and I do have permission to write to the directory I was running from. The code is in a .py file with #!/usr/bin/python as the first line. When I run it, should that produce a .pyc file?
Posted by: The Tensor | August 04, 2006 at 01:19 PM
Uh, what sort of optimization-flags did you use for C++? Have you disassembled the source-code to make sure the entire program haven't been optimized away? With such a simple program (unless each number added/multplied etc. was random), it is entirely possible that C++ simply did all the computations at compile-time, leaving a binary that just prints out the answer... (It happened to me when doing similar comparisons.) Disassemble and look for JMP, JMZ etc or similarly appropriate for your architecture.
Posted by: KK | August 04, 2006 at 01:54 PM
I ran g++ with -O0 to turn off all the optimizations.
Posted by: The Tensor | August 04, 2006 at 02:24 PM
Ah, according to this Python doesn't create the .pyc files if you run a script by giving its name on the command line (i.e. if you're taking advantage of the #! magic instead of running python script-name). I hadn't run into that before, 'cause I tend not to use the #! construct except for cgi... The .pyc files only speed up the load, though, not the running (by saving the initial parse to bytecode), so that can't be the source of the slowness you're seeing. That is to say that whether or not Python saves the result, what it's actually operating on internally is the bytecode that it created out of your source (this is basically the same thing that java does, with its jvm). There is a sort of JIT compiler for Python, to render the bytecode all the way down to machinecode, called Psyco that can sometimes produce some amazing speed-ups...
Posted by: Joshua Macy | August 04, 2006 at 02:50 PM
Whoa! What platform are you on that a floating point multiply in C++ is 100 times slower than XOR?
Posted by: Russell Borogove | August 04, 2006 at 04:50 PM
What platform are you on that a floating point multiply in C++ is 100 times slower than XOR?
arch says i686.
100 does seem high, but I don't see any crazy discrepancies in the code. Here are the innermost loops of both programs:
add:
mult:
Posted by: The Tensor | August 04, 2006 at 05:23 PM
Ugh, I'm not used to AT&T assembly syntax for x86. I assume the CPU has an actual floating point unit on it? Oh, are you accumulating multiplications by -1.1, so that you're computing -1.1 to the 10,000,000th power? In that case you're probably spending a lot of time on internal floating point exceptions, which are swamping the other costs.
Posted by: Russell Borogove | August 04, 2006 at 05:40 PM
...you're probably spending a lot of time on internal floating point exceptions...
Hmm, that could be. I'll poke at it and see...
Doh! You're right. I changed it to multiply by 10.0 and then by 0.1 in each iteration, incrementing the loop counter by two each time, and it runs in only 100ms. And yes, it's definitely doing two fmulps.
So I guess the original results were closer to the truth. I think I'm going to take this post down later tonight and rewrite it. (I'll leave the comments up as a monument to my shame.)
Posted by: The Tensor | August 04, 2006 at 05:51 PM
(I'll leave the comments up as a monument to my shame.)
Not at all. Math is hard.
Posted by: Russell Borogove | August 04, 2006 at 06:13 PM
A related site: Language Shootout.
You might consider donating some code to them or at least suggest a new benchmark, if you think it appropriate.
Posted by: E | August 05, 2006 at 11:39 AM
I used to be a staunch c++ zealot who despised anything interpreted by fear that exactly the kind of performance drop you described above would happen to me. I thought that it was not that much harder to do c++ when you're used to it and why not do things right to beging with. Then I started grad school in CL and everyone was saying that perl was _the_ language for it. (Heresy!) So eventually they kinda converted me with their LIES. I used perl for a few assignments but ended up doing my thesis in java (a compromise, I guess).
In light of this new evidence I shall return to my proselytizing of the old ways of the c++.
I do wonder what the results would be for java though.
Posted by: BenE | August 07, 2006 at 12:53 AM
Use psyco if you want to be fair to python
Posted by: | August 07, 2006 at 09:47 PM
The comparison shows up the fact that C++ code is compiled to native data types and uses native machine ops. There is much more indirection going on in the interpreters. If nothing else, perl (and probably python) are providing bignum support on the integer ops.
This sort of thing isn't either languages strong suit. You'll also find these data types take up much more memory than a 32-bit int (they also 'do more'). What these languages *do* do well is to provide richer data types which allow you to perform higher level operations. These higher levels ops (e.g. complex string manipulations) are implemented by nicely tuned C and where you can get good performance.
There are some options for dynamic languages which might handle arithmetic well. I understand that some common lisps (the commercial lisps and probably others e.g. sbcl) both compile to assembler and allow you to add (optional) type decorations so that you can inform the system that you don't care about integer overflow etc for these particular values. Some of the schemes (e.g. stalin) are also very optimisation-focussed.
Also, some of the newer dynamic language VMs will JIT code to assembler and then perhaps get closer to this kind of performance. I know that the parrot engine JITs this sort of thing and you could ask on the perl6 lists if you can do the type decoration thing (I think you can), in that case I'd expect you'd get within-a-small-integer-factor of native performance.
JVMs do a lot of JITting, as does (I think) the MS .net system.
So...if you want native-like performance in a dynamic language, choose one with a modern, JITting VM (not traditional python or perl) and try to find a language which allows optional type decorations (or just has a mad-scary optimiser which doesn't need them).
Posted by: jbert | August 08, 2006 at 01:16 AM
Another issue to consider in the Python version is the loop structure. The for and while loop constructs that C programmers know and love incur a significant overhead in Python. The conditions and iterator are evaluated by the interpreter every time around, and the data has to bubble up through the abstaction layers of the interpreted type system.
Many problems can be recast in a functional programming style using "map", where a function is applied to each element in a list. This allows the interpreter to generate machine code for the loop with next to no interpreted overhead. You might consider posting the source to the benchmarks and letting the folks who are interested take a shot at reformulating them into something more amenable to the scripting languages.
The Python wiki has a nice quick overview of tactics to speed things up:
http://wiki.python.org/moin/PythonSpeed/PerformanceTips#head-3a6f19c0b6c701b570f5201c84bdabff5e8c1cbb
I must admit my frustration that optimization tactics seem to be subtly different in every version of every higher-level language.
Posted by: chrisd | August 08, 2006 at 05:20 AM
Earlier someone said:
Perl's scalar datatype is, most fundamentally, a string, and it has to convert to and from numeric types every time it does a math operation. I'm guessing that's the real killer for you.
That's just not true. In Perl, a scalar can be a string or a number (or other things). A scalar can start life as a number ($foo = 1), and it will not be converted to a string until Perl needs its string representation ($foo .= " bar").
Posted by: Dave Rolsky | August 08, 2006 at 06:13 AM
I'm a big fan of Python, but since Python doesn't ship with Psyco, I think it's perfectly fair to compare the typical tasks you perform using straight Python vs. whatever. Moreover, while it's good to have a handle on what's a fast way to write code if you need to optimize hot-spots, if you have to write non-Pythonic code all the time to get acceptable performance then you're losing the big advantage Python has in having simple readable code (or maybe I just don't like reading programs where all the loops have been replaced by map or generators....)
Posted by: Joshua Macy | August 08, 2006 at 06:27 AM
You might also like to investigate LuaJIT, which can be significantly faster than Perl/Python.
Posted by: Greg Buchholz | August 08, 2006 at 07:07 AM
I second the call for posting the benchmarks. :)
As a first time python programmer (like Tensor said he was) there are some nice ways to shoot yourself in the foot performance-wise.
For example:
i=0
while i < 1000000:
i*1.3
i += 1
is about 50% slower than:
for i in xrange(1000000):
i*1.3
Posted by: Bob Fleck | August 08, 2006 at 09:17 PM
Bob Fleck is right: in an attempt at cross-language consistency, I did use "while i < count" rather than "for i in xrange(count)" in the Python programs. However, when I replace that line in mult.py, the resulting program is slower by almost a factor of four:
time mult.py 10000000
real 0m5.297s
user 0m5.290s
sys 0m0.000s
time newmult.py 10000000
real 0m19.839s
user 0m19.840s
sys 0m0.000s
Posted by: The Tensor | August 15, 2006 at 05:03 PM
Below are the Lua versions. These run around 6x/10x/4x faster than Python on my machine with plain Lua. With luajit -O you get another 10x speedup for all of them.
But I agree, the shootout features more applicable benchmarks.
add.lua:
mult.lua:
func.lua:
Posted by: Mike Pall | August 16, 2006 at 04:09 PM
to speed up the python program with a factor of 2 (at least on my PC)
rewrite the python programs so the the variables aren't global anymore
so adder.py becomes
#!/usr/bin/python -u
import sys
def doit(count):
i=0
while i < count:
i+=2
count=int(sys.argv[1])
doit(count)
btw : compilation to .pyc only happens for imported modules not for the actual script you run
cheers,
Peter
Posted by: Peter | September 08, 2006 at 03:39 AM
Mea culpa on my misstatements about Perl datatypes above -- don't ask what was in my head.
Anyway, I found this lovely site full of cross-language benchmarks today:
http://shootout.alioth.debian.org/
Posted by: Russell Borogove | September 13, 2006 at 12:17 PM
First, people have been benchmarking C, Perl, and Python like this, there are plenty of other comparisions out there.
Second, please retitle your article to Python vs. Perl vs. C. All of your "C++" programs are able to be compiled w/ a C compiler. If you want to call it C++, declare a class, and call that in the loop, but the simple examples you gave are about as far from C++ code as you can get.
Posted by: John-Mark Gurney | July 07, 2007 at 09:46 AM
Python may have native floats, but it's also dynamically typed, so each step through your 'for' loop involves type checking. For fast numerical operations in Python, use NumPy, which doesn't use dynamically-typed elements:
http://en.wikipedia.org/wiki/NumPy
http://www.scipy.org/Getting_Started
http://numpy.scipy.org/
Posted by: Andrew | January 26, 2009 at 11:55 AM
I realized that this is an ancient thread; however, I have noticed that perl is about 30-45% faster comparing scalars when you cast them as int(). I am not sure why it is faster, but it makes a big difference.
Posted by: Doug | May 28, 2012 at 10:29 AM