‘grinding out performance improvements

In the last weeks, I’ve been working on various Perl 6 performance improvements. They’ve led to changes in all of MoarVM, NQP, and Rakudo. For each improvement, there are usually three steps involved:

  1. Take some program that performs poorly, and discover a reason that it’s slow.
  2. Design, implement and test potential changes, until one yields an improvement. (Or, if nothing seems to help, move on to a different reason that it’s slow).
  3. Make sure that the improvement doesn’t cause regressions elsewhere.

In all of these steps, getting some objective measurements are important. Step 3 is relatively straightforward: just build NQP/Rakudo and run their respective test suites. It’s possible for problems to find a place to hide even with this; test suites are, after all, not full system proofs. But it rules out a lot of bad behaviors.

For step 1, profiling is key. I mean, sure, sometimes I can guess why something is slow, and sometimes I’m even right. But the overwhelming majority of the time, measuring wins hands down. That’s why I seeded tools such as the MoarVM profiler and MoarVM heap snapshot analyzer, both of which have seen contributions from others since. And, when working on MoarVM itself, there are various tools for profiling native code.

That leaves step 2. How do I know if I’ve got an improvement? One easy way is to use something like time, doing thousands or millions of iterations to try and avoid noise, and seeing what happens. It’s a bit of a blunt instrument, however. It’s hard to be confident that anything less than a 2%-3% improvement isn’t just measurement noise, and I’m not even confident about that rule of thumb. :-) And, while it may be reasonable to argue that an improvement of just 1% or even a fraction of a percent is just not worth it in some contexts, VM engineering isn’t one of those contexts. A 1% improvement rolled out to all of our users quickly multiplies out to a huge number of saved CPU cycles.

More than that, though, the small improvements add up. Do 5 small improvements that win an average of 1% each, and it adds up to a more satisfying 5% improvement. But with only wallclock times to go on, it’s hard to confidently commit an improvement that wins only 1%, or 0.5%. Why? Simply because it’s hard to be sure that it’s a move in the correct direction. If it’s in the noise range, it may be that things are actually getting a tiny bit worse. Experience tells that just because something looks like it should be an improvement does not, in fact, mean that it will be. So if I’m seeing – to the degree of measurement error – two things coming out the same, and I go ahead and commit the “improvement” anyway, I’m really just guessing.

Enter callgrind, part of the Valgrind suite. It can give a count of the number of CPU instructions executed. And, on two consecutive runs with the same input, it will produce the exact same number. Yes, it’s sloooooow at gathering data; on the other hand, with such precision there’s less need to bump up the number of iterations to hide measurement error. Callgrind also explains what functions the CPU instructions are from, meaning it can play a key role in the profiling step too.

Of course, CPU cycles are also a somewhat blunt instrument too. Instructions executed is not the same as CPU cycles spent. Modern CPUs can both execute multiple instructions in a single cycle due to having multiple function units, as well as stall for anything from several to several hundred cycles on an instruction that accesses memory. This measurement also, of course, gives little insight into I/O bound programs, and I could easily imagine getting rather misled by such data in contention-bound programs. However, for CPU bound programs operating on relatively small heaps and running on a single thread, the numbers can be treated something like a very accurate clock.

Here’s a look at handful of the improvements I’ve been doing, based off looking at callgrind output.

The test program

Here’s the test program I considered. It involves a bunch of invocation and integer operations in a tight loop.

class A {
    has $.i = 0;
    method m() { $!i++ }
}
my $a = A.new;
for ^5000000 {
    $a.m();
}
say $a.i;

Before starting out, this took 17,855,658,600 instructions, which comes out at around 3,580 instructions per iteration. For comparison, here’s a Perl 5 program that I hope is fairly equivalent (I stuck with the built-in OO to avoid the costs of any sugar):

package A;
sub new {
    return bless { i => 0 }, shift;
}
sub m() {
    my $self = shift;
    $self->{i}++;
}
package main;
my $a = A->new;
for (1..5000000) {
    $a->m();
};
say $a->{i};

It weighs in at 10,198,184,195 cycles, or 2040 instructions per iteration, thus running in 56% of the CPU instructions the Perl 6 version takes.

A wasted memset

One thing that stood out right away was the amount of time spent in memset, to zero memory. As I looked at the callers, I spotted one that seemed out of place: clearing the arguments buffer. The code gave a reason (make sure the GC never sees a partial args buffer with old data), but that reason turns out to be bogus: there can never be a GC safepoint anywhere inside of the sequence of instructions that put the arguments in place to pass. So, I removed it.

Now I was down to a really small function. That was just begging to go away. It was called both from the interpreter (where the C compiler may well have inlined it) and also from the JIT. The JIT case was certainly going to be a nice win; instead putting args into the appropriate registers for a function call, making it, and then using the return value, I could just do a couple of simple instructions.

These two fairly simple changes got it down to 17,624,207,945 instructions – a saving of 230 million CPU instructions (1.3%), or around 50 cycles off every iteration. Very much worth having (it makes every single call cheaper), but could have been marginal if measuring simply wallclock time.

Exhausting work

The next win would have been easily visible just from wallclock time, but looking at the callgrind output led to it. I noticed we spent a huge amount of time doing a late-bound (by name) lexical lookup. With --inclusive=yes passed to callgrind_annotate, its line in the output looked like this:

1,149,528,043  ???:MVM_frame_find_lexical_by_name [libmoar.so]

Now, late-bound lexical lookups being costly isn’t really a surprise. This is the case where we have to go searching for a symbol by name (doing hash lookups), because we couldn’t do its resolution at compile-time (or at optimization time). The surprise was that we were doing them at all while executing such a simple program. So I looked into why, and it surprised me.

Long ago (back in the Parrot days), we replaced the secret RETURN lexical (which held an object used in implementing return) with the &EXHAUST sub. This would throw a useful error if you tried to return from a sub you had already returned from, for example due to forgetting that map is lazy and the final statement of a sub is its return value:

sub in(@a, $b) {
    @a.map({ return True if $_ eqv $b })
}
say in([1,2,3], 2); # dies: Attempt to return outside of any Routine

However, this lookup ended up being code-gen’d late-bound. I was considering fixing that, when I got curious if we even needed it at all any more. And indeed, on the VMs we run on today, I got the same decent error message if I simply set RETURN to null. So, I did that and the win was enormous: down to 11,085,521,589 CPU instructions! That meant, relative to the previous measurement, it ran in 63% of the CPU cycles it used to, or 2,220 per iteration.

That might seem incredible given that we only stood to gain 1.1 billion cycles by avoiding the indirect lexical lookup. Where did the other 5 billion go to? It turns out that blocks that have references to symbols lexically outer to them are not able to be inlined (unless they have been resolved by optimization time and are known to stay constant). The late-bound lookup of &EXHAUST was thus an inlining blocker for the method m. With it now being inlined, all of the calling overhead went away in favor of a couple of gotos. (Those also could go away in the future if inlining gets smarter.)

Optimizing get_boxed_ref

Eliminating control flow in favor of data access is usually a win. Branches aren’t the cheapest thing. I spotted an opportunity in a function get_boxed_ref, which is used to extract the memory address of the P6bigint representation that is inlined into a P6opaque (as happens with Int; if you’re wondering why it’s a P6opaque, recall that it’s allowable to mix into Int objects).

The resulting change shaved another 60 million cycles off the program; only 12 per iteration, but since this improves every single Int operation we perform in Perl 6, I’ll take it.

Static elimination of decontainerization

The &EXHAUST change uncovered an inlining issue that I had to fix, and while doing so I took a look at the code we were producing and got an idea. In Perl 6, we have Scalar containers. So when you have:

my $answer = 42;

Then the callframe has a slot for $answer that points to a Scalar object, which has a value attribute that points to the immutable Int object 42. Therefore, when we need the value from a container, we have to get hold of it. That is done with an instruction called decont, which checks if we have a container and dereferences it if so, and otherwise just evaluates to the value. The dynamic optimizer is pretty good at removing unneeded decont instructions to save the cycles/branches involved, and can lower those that remain into simple pointer arithmetic. But it can’t get ’em all, and of course not all code gets hot enough to be optimized in such a way.

I spotted that in some cases, we emitted decont instructions against compile time constants that we could cheaply determine, at compile time, would never need them. A small code-gen patch was sufficient to eliminate a load of them. This led to another 35 million cycles saved. However, the number of cycles in an empty program at startup also went down by quite an amount too, so it’s likely not that we’re saving so much in the loop (if anything). Additionally, this change shaved 3.6KB off the NQP bytecode size, 16.2KB off the Rakudo compiler bytecode size, and 55.8KB off CORE.setting bytecode size.

And next time: returning to return

Fixing the &EXHAUST bug reminded me that I’ve long wanted to change the way we implement return, to make it free – rather than just cheap – in the case that we never explicitly return, and much cheaper when we do. I’m still working on getting those changes finished up, and this has already been a fairly long post, so I’ll save the details – and the results – for next time.

This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to ‘grinding out performance improvements

  1. Pingback: 2016.24 Speeding To TPC | Weekly changes in and around Perl 6

  2. Garrett Goebel says:

    Thank you for the update! Exciting progress.

    What was the final number of instructions per iteration for perl 6 after all your improvements? Based on the 95,000,000 instruction improvement in the last two optimizations I’m guestimating ~2,198. -I.e. How close did you get to Perl 5’s 2040 instructions per iteration?

    • Just calculated it based on what I measured after all of these optimizations (total 10,989,973,606), which when divided over the loop iterations rather reassuringly gives the very same number you came up with. So, within 10%. Further work – which I’ll post about soon – has closed it a bit further.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s