With some helpful hints from Alon Zakai (see the comments) and others, I was able to determine that the tip-of-trunk build of Emscripten produces significantly better results on Firefox 22 and Chrome 30 (canary), as seen below. Firefox 22 definitely includes asm.js optimized code-generation, and based upon the degree of improvement (from a mean of 23ms/f to 7ms/f), Chrome 30 must be doing something similar. Note that Alon is seeing somewhat better results than I am on his desktop machine, for reasons that we haven’t quite teased out, but they’re close enough that I believe we have a fairly accurate representation of reality here.
I find this rather exciting, because it puts asm.js within a stone’s throw of NaCl and raw native code on Firefox, and is within 3x of native on an upcoming Chrome version. If a clever solution to the “allocate all your heap up front” problem can be found in asm.js, it might even turn out to be a good solution on mobile, where it’s badly needed.
|ms/frame||5th %ile||95th %ile||Ratio to C|
|asm.js (Firefox 22)||4.80||4.0||6.0||1.94|
|asm.js (Chrome 30)||7.10||6.0||8.0||2.86|
I’ve been meaning to follow up on this post for a few months now, but as you can imagine, life has a habit of getting in the way. But I finally have a bit of spare time this holiday weekend, so I’ll try to address a few of the issues raised by that post.
First, the updated results, as of today:
|ms/frame||5th %ile||95th %ile||Ratio to C|
|C (gcc 4.8)||2.48||2.17||2.80||1.00|
|asm.js (IE10) (***)||33.7||26.6||42.0||13.6|
|Box2dWeb (IE10) (***)||37.9||35.0||48.3||15.3|
|asm.js (Safari) (****)||-||-||-||-|
[Test platform: MacBook Pro, 2.5 GHz i7, 16G memory, Mac OS X 10.8.4. All platform and compiler versions are latest unless otherwise specified.]
(*) Crossbridge has awful clock() resolution, so I just assumed +/- 1ms percentiles.
(**) Here, ‘asm.js’ refers to the asm.js output running in Firefox Nightly, with full optimizations.
(***) I don’t have any easy way to run Windows natively on my Mac (I’m not
going to setup dual boot partitions just for this benchmark), so I had to try
and back out IE10 numbers using VirtualBox. I calculated a performance penalty
38.5ms / 26.9ms = 1.43x), then used that to adjust the IE10 numbers. There
are all sorts of things that could be wrong with this, but I expect it at
least gives us a rough idea.
(****) asm.js unfortunately hung on Safari and never recovered. Hopefully this will be sorted out at some point.
Scaled to multiples of native performance. The white line denotes the mean, and the edges of the box denote the 5th and 95th %iles.
Statistics and distributions
The first issue is that my statistics didn’t make a great deal of sense, as so helpfully pointed out on Reddit – I assumed a normal distribution, which is of course not the case for this performance data. I’ve fixed the measurements and graph to show 5th and 95th percentiles instead, which should give a pretty clear idea of the distribution. Most of the results aren’t significantly different, with the exception that Box2dWeb on V8 in Chrome has a very long right tail, which I’ll discuss below.
Emscripten and asm.js
I’ve also added Adobe’s Crossbridge (originally Alchemy, then Flascc) as a target. If you’re not familiar with it, this is Adobe’s C++ to AS3 bytecode compiler (via LLVM, very much like Emscripten). The performance is quite good – roughly on par with Java and asm.js, which is impressive. One caveat: Crossbridge’s clock() function appears to have a very coarse resolution (~7.8ms), so while I was able to calculate the mean, I wasn’t able to get valid percentiles.
I updated my JVM to 1.8, and ensured that I had the most flattering values set in jBox2d’s Settings.java (someone pointed out last time that I had failed to enable sin/cos lookup tables). The settings made no noticeable difference, and updating to Java 1.8 made a modest improvement.
I’ve reorganized the repo (especially the /c part) to be a bit less nasty, and added a README that should give you a reasonable idea of how to build and run everything.
A bit more analysis
The overall performance clusters remain largely the same as before. C and NaCl set the standard, with NaCl exacting the expected 20-30% performance penalty over raw native binaries.
Next up, we have Java, optimized asm.js, and Crossbridge clustering around 2.5x the runtime of native code. I’m tempted to call this the “specialized VM” case – in the sense that they’re VMs that implement specialized code generation for input languages whose semantics are restricted for performance reasons. That’s certainly true of the JVM and the optimized asm.js VM, though I’m not 100% certain it applies to Crossbridge (i.e., does the standard AS3 VM just happen to perform well on Crossbridge output, or did they implement specialized code generation paths that depend upon restricted AS3 semantics?).
Finally, AS3 turns in the same solid performance as always. I’ve read that the Flash AS3 VM can generate fairly tight code for typed source code (though quite bad code for untyped source), and this would seem to bear that out.
As always, it’s hard to draw absolute conclusions about something as mercurial as VM performance. But since I’ve begun these measurements, I’ve seen what appear to be some stable patterns. The fully dynamic VMs seem to cluster roughly around an order-of-magnitude slower than native code, and their variance is quite high. Their performance continues to improve, but it’s starting to feel asymptotic and somewhat chaotic, perhaps as various optimizations trade off against one another. Dart may well be able to buck that trend, as the team has specifically restricted the language semantics to be more amenable to optimization. But it will be quite some time before we understand where its performance asymptote lies.
I also note that what I referred to earlier as the “specialized VMs” seem to run quite stably around 2-3x native performance. I doubt the JVM’s performance is going to change significantly in the future. I’m less certain about asm.js and Crossbridge, because the techniques they use at least appear to permit more options for optimization than are available to the JVM (e.g., they receive monolothic input, statically optimized from C++, that usually performs manual memory management). So it seems plausible that they could get within spitting distance of NaCl’s performance.
It also appears to be fairly conclusively proven that even fully dynamic
great detail by Vyacheslav Egorov. Now to be fair, this is only true if
you accept that dumping your entire heap in a
TypedArray is “fair play”
(because doing so allows the VM to drop a lot of primitive type checks), but
it’s still interesting and useful that this works, and holds for all
- A few people asked me about .NET VM targets. I still haven’t done this, as I don’t have a machine handy to run them on. I suspect it will be roughly on par with the other “specialized VMs”.
- Running the benchmarks is still an excruciating process. I’d like to find standalone command-line versions of everything I need, but that’s proving to be far more trouble than it’s worth so far. I welcome any and all patches to mitigate this issue.
- Somebody asked me for VB benchmarks. I’m still waiting on BOX2D.BAS :)