The new MoarVM dispatch mechanism is here!

Around 18 months ago, I set about working on the largest set of architectural changes that Raku runtime MoarVM has seen since its inception. The work was most directly triggered by the realization that we had no good way to fix a certain semantic bug in dispatch without either causing huge performance impacts across the board or increasingly complexity even further in optimizations that were already riding their luck. However, the need for something like this had been apparent for a while: a persistent struggle to optimize certain Raku language features, the pain of a bunch of performance mechanisms that were all solving the same kind of problem but each for a specific situation, and a sense that, with everything learned since I founded MoarVM, it was possible to do better.

The result is the development of a new generalized dispatch mechanism. An overview can be found in my Raku Conference talk about it (slidesvideo); in short, it gives us a far more uniform architecture for all kinds of dispatch, allowing us to deliver better performance on a range of language features that have thus far been glacial, as well as opening up opportunities for new optimizations.

Today, this work has been merged, along with the matching changes in NQP (the Raku subset we use for bootstrapping and to implement the compiler) and Rakudo (the full Raku compiler and standard library implementation). This means that it will ship in the October 2021 releases.

In this post, I’ll give an overview of what you can expect to observe right away, and what you might expect in the future as we continue to build upon the possibilities that the new dispatch architecture has to offer.

The big wins

The biggest improvements involve language features that we’d really not had the architecture to do better on before. They involved dispatch – that is, getting a call linked to a destination efficiently – but the runtime didn’t provide us with a way to “explain” to it that it was looking at a dispatch, let alone with the information needed to have a shot at optimizing it.

The following graph captures a number of these cases, and shows the level of improvement, ranging from a factor of 3.3 to 13.3 times faster.

Graph showing benchmark results, described textually below

Let’s take a quick look at each of these. The first, new-buf, asks how quickly we can allocate Bufs.

for ^10_000_000 {
    Buf.new
}

Why is this a dispatch benchmark? Because Buf is not a class, but rather a role. When we try to make an instance of a role, it is “punned” into a class. Up until now, it works as follows:

  1. We look up the new method
  2. The find_method method would, if needed, create a pun of the role and cache it
  3. It would return a forwarding closure that takes the arguments and gives them to the same method called on the punned class, or spelt in Raku code, -> $role-discarded, |args { $pun."$name"(|args) }
  4. This closure would be invoked with the arguments

This had a number of undesirable consequences:

  1. While the pun was cached, we still had a bit of overhead to check if we’d made it already
  2. The arguments got slurped and flattened, which costs something, and…
  3. …the loss of callsite shape meant we couldn’t look up a type specialization of the method, and thus lost a chance to inline it too

With the new dispatch mechanism, we have a means to cache constants at a given program location and to replace arguments. So the first time we encounter the call, we:

  1. Get the role pun produced if needed
  2. Resolve the new method on the class punned from the role
  3. Produce a dispatch program that caches this resolved method and also replaces the role argument with the pun

For the next thousands of calls, we interpret this dispatch program. It’s still some cost, but the method we’re calling is already resolved, and the argument list rewriting is fairly cheap. Meanwhile, after we get into some hundreds of iterations, on a background thread, the optimizer gets to work. The argument re-ordering cost goes away completely at this point, and new is so small it gets inlined – at which point the buffer allocation is determined dead and so goes away too. Some remaining missed opportunities mean we still are left with a loop that’s not quite empty: it busies itself making sure it’s really OK to do nothing, rather than just doing nothing.

Next up, multiple dispatch with where clauses.

multi fac($n where $n <= 1) { 1 }
multi fac($n) { $n * fac($n - 1) }
for ^1_000_000 {
    fac(5)
}

These were really slow before, since:

  1. We couldn’t apply the multi-dispatch caching mechanism at all as soon as we had a where clause involved
  2. We would run where clauses twice in the event the candidate was chosen: once to see if we should choose that multi candidate, and once again when we entered it

With the new mechanism, we:

  1. On the first call, calculate a multiple dispatch plan: a linked list of candidates to work through
  2. Invoke the one with the where clause, in a mode whereby if the signature fails to bind, it triggers a dispatch resumption. (If it does bind, it runs to completion)
  3. In the event of a bind failure, the dispatch resumption triggers, and we attempt the next candidate

Once again, after the setup phase, we interpret the dispatch programs. In fact, that’s as far as we get with running this faster for now, because the specializer doesn’t yet know how to translate and further optimize this kind of dispatch program. (That’s how I know it currently stands no chance of turning this whole thing into another empty loop!) So there’s more to be had here also; in the meantime, I’m afraid you’ll just have to settle for a factor of ten speedup.

Here’s the next one:

proto with-proto(Int $n) { 2 * {*} }
multi with-proto(Int $n) { $n + 1 }
sub invoking-nontrivial-proto() {
    for ^10_000_000 {
        with-proto(20)
    }
}

Again, on top form, we’d turn this into an empty loop too, but we don’t quite get there yet. This case wasn’t so terrible before: we did get to use the multiple dispatch cache, however to do that we also ended up having to allocate an argument capture. The need for this also blocked any chance of inlining the proto into the caller. Now that is possible. Since we cannot yet translate dispatch programs that resume an in-progress dispatch, we don’t yet get to further inline the called multi candidate into the proto. However, we now have a design that will let us implement that.

This whole notion of a dispatch resumption – where we start doing a dispatch, and later need to access arguments or other pre-calculated data in order to do a next step of it – has turned out to be a great unification. The initial idea for it came from considering things like callsame:

class Parent {
    method m() { 1 }
}
class Child is Parent {
    method m() { 1 + callsame }
}
for ^10_000_000 {
    Child.m;
}

Once I started looking at this, and then considering that a complex proto also wants to continue with a dispatch at the {*}, and in the case a where clauses fails in a multi it also wants to continue with a dispatch, I realized this was going to be useful for quite a lot of things. It will be a bit of a headache to teach the optimizer and JIT to do nice things with resumes – but a great relief that doing that once will benefit multiple language features!

Anyway, back to the benchmark. This is another “if we were smart, it’d be an empty loop” one. Previously, callsame was very costly, because each time we invoked it, it would have to calculate what kind of dispatch we were resuming and the set of methods to call. We also had to be able to locate the arguments. Dynamic variables were involved, which cost a bit to look up too, and – despite being an implementation details – these also leaked out in introspection, which wasn’t ideal. The new dispatch mechanism makes this all rather more efficient: we can cache the calculated set of methods (or wrappers and multi candidates, depending on the context) and then walk through it, and there’s no dynamic variables involved (and thus no leakage of them). This sees the biggest speedup of the lot – and since we cannot yet inline away the callsame, it’s (for now) measuring the speedup one might expect on using this language feature. In the future, it’s destined to optimize away to an empty loop.

A module that makes use of callsame on a relatively hot path is OO::Monitors,, so I figured it would be interesting to see if there is a speedup there also.

use OO::Monitors;
monitor TestMonitor {
    method m() { 1 }
}
my $mon = TestMonitor.new;
for ^1_000_000 {
    $mon.m();
}

monitor is a class that acquires a lock around each method call. The module provides a custom meta-class that adds a lock attribute to the class and then wraps each method such that it acquires the lock. There are certainly costly things in there besides the involvement of callsame, but the improvement to callsame is already enough to see a 3.3x speedup in this benchmark. Since OO::Monitors is used in quite a few applications and modules (for example, Cro uses it), this is welcome (and yes, a larger improvement will be possible here too).

Caller side decontainerization

I’ve seen some less impressive, but still welcome, improvements across a good number of other microbenchmarks. Even a basic multi dispatch on the + op:

my $i = 0;
for ^10_000_000 {
    $i = $i + $_;
}

Comes out with a factor of 1.6x speedup, thanks primarily to us producing far tighter code with fewer guards. Previously, we ended up with duplicate guards in this seemingly straightforward case. The infix:<+> multi candidate would be specialized for the case of its first argument being an Int in a Scalar container and its second argument being an immutable Int. Since a Scalar is mutable, the specialization would need to read it and then guard the value read before proceeding, otherwise it may change, and we’d risk memory safety. When we wanted to inline this candidate, we’d also want to do a check that the candidate really applies, and so also would deference the Scalar and guard its content to do that. We can and do eliminate duplicate guards – but these guards are on two distinct reads of the value, so that wouldn’t help.

Since in the new dispatch mechanism we can rewrite arguments, we can now quite easily do caller-side removal of Scalar containers around values. So easily, in fact, that the change to do it took me just a couple of hours. This gives a lot of benefits. Since dispatch programs automatically eliminate duplicate reads and guards, the read and guard by the multi-dispatcher and the read in order to pass the decontainerized value are coalesced. This means less repeated work prior to specialization and JIT compilation, and also only a single read and guard in the specialized code after it. With the value to be passed already guarded, we can trivially select a candidate taking two bare Int values, which means there’s no further reads and guards needed in the callee either.

A less obvious benefit, but one that will become important with planned future work, is that this means Scalar containers escape to callees far less often. This creates further opportunities for escape analysis. While the MoarVM escape analyzer and scalar replacer is currently quite limited, I hope to return to working on it in the near future, and expect it will be able to give us even more value now than it would have been able to before.

Further results

The benchmarks shown earlier are mostly of the “how close are we to realizing that we’ve got an empty loop” nature, which is interesting for assessing how well the optimizer can “see through” dispatches. Here are a few further results on more “traditional” microbenchmarks:

Graph showing benchmark results, described textually below

The complex number benchmark is as follows:

my $total-re = 0e0;
for ^2_000_000 {
    my $x = 5 + 2i;
    my $y = 10 + 3i;
    my $z = $x * $x + $y;
    $total-re = $total-re + $z.re
}
say $total-re;

That is, just a bunch of operators (multi dispatch) and method calls, where we really do use the result. For now, we’re tied with Python and a little behind Ruby on this benchmark (and a surprising 48 times faster than the same thing done with Perl’s Math::Complex), but this is also a case that stands to see a huge benefit from escape analysis and scalar replacement in the future.

The hash read benchmark is:

my %h = a => 10, b => 12;
my $total = 0;
for ^10_000_000 {
    $total = $total + %h<a> + %h<b>;
}

And the hash store one is:

my @keys = 'a'..'z';
for ^500_000 {
    my %h;
    for @keys {
        %h{$_} = 42;
    }
}

The improvements are nothing whatsoever to do with hashing itself, but instead look to be mostly thanks to much tighter code all around due to caller-side decontainerization. That can have a secondary effect of bringing things under the size limit for inlining, which is also a big help. Speedup factors of 2x and 1.85x are welcome, although we could really do with the same level of improvement again for me to be reasonably happy with our results.

The line-reading benchmark is:

my $fh = open "longfile";
my $chars = 0;
for $fh.lines { $chars = $chars + .chars };
$fh.close;
say $chars

Again, nothing specific to I/O got faster, but when dispatch – the glue that puts together all the pieces – gets a boost, it helps all over the place. (We are also decently competitive on this benchmark, although tend to be slower the moment the UTF-8 decoder can’t take it’s “NFG can’t possibly apply” fast path.)

And in less micro things…

I’ve also started looking at larger programs, and hearing results from others about theirs. It’s mostly encouraging:

  • The long-standing Text::CSV benchmark test-t has seen roughly 20% improvement (thanks to lizmat for measuring)
  • A simple Cro::HTTP test application gets through about 10% more requests per second
  • MoarVM contributor dogbert did comparative timings of a number of scripts; the most significant improvement saw a drop from 25s to 7s, most are 10%-30% faster, some without change, and only one that slowed down.
  • There’s around 2.5% improvement on compilation of CORE.setting, the standard library. However, a big pinch of salt is needed here: the compiler itself has changed in a number of places as part of the work, and there were a couple of things tweaked based on looking at profiles that aren’t really related to dispatch.
  • Agrammon, an application calculating farming emissions, has seen a slowdown of around 9%. I didn’t get to look at it closely yet, although glancing at profiling output the number of deoptimizations is relatively high, which suggests we’re making some poor optimization decisions somewhere.

Smaller profiler output

One unpredicted (by me), but also welcome, improvement is that profiler output has become significantly smaller. Likely reasons for this include:

  1. The dispatch mechanism supports producing value results (either from constants, input arguments, or attributes read from input arguments). It entirely replaces an earlier mechanism, “specializer plugins”, which could map guards to a target to invoke, but always required a call to something – even if that something was the identity function. The logic was that this didn’t matter for any really hot code, since the identity function will trivially be inlined away. However, since profile size of the instrumenting profiler is a function of the number of paths through the call tree, trimming loads of calls to the identity function out of the tree makes it much smaller.
  2. We used to make lots of calls to the sink method when a value was in sink context. Now, if we see that the type simply inherits that method from Mu, we elide the call entirely (again, it would inline away, but a smaller call graph is a smaller profile).
  3. Multiple dispatch caching would previously always call the proto when the cache was missed, but would then not call an onlystar proto again when it got cache hits in the future. This meant the call tree under many multiple dispatches was duplicated in the profile. This wasn’t just a size issue; it was a bit annoying to have this effect show up in the profile reports too.

To give an example of the difference, I took profiles from Agrammon to study why it might have become slower. The one from before the dispatcher work weighed in at 87MB; the one with the new dispatch mechanism is under 30MB. That means less memory used while profiling, less time to write the profile out to disk afterwards, and less time for tools to load the profiler output. So now it’s faster to work out how to make things faster.

Is there any bad news?

I’m afraid so. Startup time has suffered. While the new dispatch mechanism is more powerful, pushes more complexity out of the VM into high level code, and is more conducive to reaching higher peak performance, it also has a higher warmup time. At the time of writing, the impact on startup time seems to be around 25%. I expect we can claw some of that back ahead of the October release.

What will be broken?

Changes of this scale always come with an amount of risk. We’re merging this some weeks ahead of the next scheduled monthly release in order to have time for more testing, and to address any regressions that get reported. However, even before reaching the point of merging it, we have:

  • Ensured it passes the specification test suite, both in normal circumstances, but also under optimizer stressing (where we force it to prematurely optimize everything, so that we tease out optimizer bugs and – given how many poor decisions we force it to make – deoptimization bugs too)
  • Used blin to run the tests of ecosystem modules. This is a standard step when preparing Rakudo releases, but in this case we’ve aimed it at the new-disp branches. This found a number of regressions caused by the switch to the new dispatch mechanism, which have been addressed.
  • Patched or sent pull requests to a number of modules that were relying on unsupported internal APIs that have now gone away or changed, or on other implementation details. There were relatively few of these, and happily, many of them were fixed up by migrating to supported APIs (which likely didn’t exist at the time the modules were written).

What happens next?

As I’ve alluded to in a number of places in this post, while there are improvements to be enjoyed right away, there are also new opportunities for further improvement. Some things that are on my mind include:

  • Reworking callframe entry and exit. These are still decidedly too costly. Various changes that have taken place while working on the new dispatch mechanism have opened up new opportunities for improvement in this area.
  • Avoiding megamorphic pile-ups. Micro-benchmarks are great at hiding these. In fact, the callsame one here is a perfect example! The point we do the resumption of a dispatch is inside callsame, so all the inline cache entries of resumptions throughout the program stack up in one place. What we’d like is to have them attached a level down the callstack instead. Otherwise, the level of callsame improvement seen in micro-benchmarks will not be enjoyed in larger applications. This applies in a number of other situations too.
  • Applying the new dispatch mechanism to optimize further constructs. For example, a method call that results in invoking the special FALLBACK method could have its callsite easily rewritten to do that, opening the way to inlining.
  • Further tuning the code we produce after optimization. There is an amount of waste that should be relatively straightforward to eliminate, and some opportunities to tweak deoptimization such that we’re able to delete more instructions and still retain the ability to deoptimize.
  • Continuing with the escape analysis work I was doing before, which should now be rather more valuable. The more flexible callstack/frame handling in place should also unblock my work on scalar replacement of Ints (which needs a great deal of care in memory management, as they may box a big integer, not just a native integer).
  • Implementing specialization, JIT, and inlining of dispatch resumptions.

Thank you

I would like to thank TPF and their donors for providing the funding that has made it possible for me to spend a good amount of my working time on this effort.

While I’m to blame for the overall design and much of the implementation of the new dispatch mechanism, plenty of work has also been put in by other MoarVM and Rakudo contributors – especially over the last few months as the final pieces fell into place, and we turned our attention to getting it production ready. I’m thankful to them not only for the code and debugging contributions, but also much support and encouragement along the way. It feels good to have this merged, and I look forward to building upon it in the months and years to come.

Posted in Uncategorized | 3 Comments

Raku multiple dispatch with the new MoarVM dispatcher

I recently wrote about the new MoarVM dispatch mechanism, and in that post noted that I still had a good bit of Raku’s multiple dispatch semantics left to implement in terms of it. Since then, I’ve made a decent amount of progress in that direction. This post contains an overview of the approach taken, and some very rough performance measurements.

My goodness, that’s a lot of semantics

Of all the kinds of dispatch we find in Raku, multiple dispatch is the most complex. Multiple dispatch allows us to write a set of candidates, which are then selected by the number of arguments:

multi ok($condition, $desc) {
    say ($condition ?? 'ok' !! 'not ok') ~ " - $desc";
}
multi ok($condition) {
    ok($condition, '');
}

Or the types of arguments:

multi to-json(Int $i) { ~$i }
multi to-json(Bool $b) { $b ?? 'true' !! 'false' }

And not just one argument, but potentially many:

multi truncate(Str $str, Int $chars) {
    $str.chars < $chars ?? $str !! $str.substr(0, $chars) ~ '...'
}
multi truncate(Str $str, Str $after) {
    with $str.index($after) -> $pos {
        $str.substr(0, $pos) ~ '...'
    }
    else {
        $str
    }
}

We may write where clauses to differentiate candidates on properties that are not captured by nominal types:

multi fac($n where $n <= 1) { 1 }
multi fac($n) { $n * fac($n - 1) }

Every time we write a set of multi candidates like this, the compiler will automatically produce a proto routine. This is what is installed in the symbol table, and holds the candidate list. However, we can also write our own proto, and use the special term {*} to decide at which point we do the dispatch, if at all.

proto mean($collection) {
    $collection.elems == 0 ?? Nil !! {*}
}
multi mean(@arr) {
    @arr.sum / @arr.elems
}
multi mean(%hash) {
    %hash.values.sum / %hash.elems
}

Candidates are ranked by narrowness (using topological sorting). If multiple candidates match, but they are equally narrow, then that’s an ambiguity error. Otherwise, we call narrowest one. The candidate we choose may then use callsame and friends to defer to the next narrowest candidate, which may do the same, until we reach the most general matching one.

Multiple dispatch is everywhere

Raku leans heavily on multiple dispatch. Most operators in Raku are compiled into calls to multiple dispatch subroutines. Even $a + $b will be a multiple dispatch. This means doing multiple dispatch efficiently is really important for performance. Given the riches of its semantics, this is potentially a bit concerning. However, there’s good news too.

Most multiple dispatches are boring

The overwhelmingly common case is that we have:

  • A decision made only by the number of arguments and nominal types
  • No where clauses
  • No custom proto
  • No callsame

This isn’t to say the other cases are unimportant; they are really quite useful, and it’s desirable for them to perform well. However, it’s also desirable to make what savings we can in the common case. For example, we don’t want to eagerly calculate the full set of possible candidates for every single multiple dispatch, because the majority of the time only the first one matters. This is not just a time concern: recall that the new dispatch mechanism stores dispatch programs at each callsite, and if we store the list of all matching candidates at each of those, we’ll waste a lot of memory too.

How do we do today?

The situation in Rakudo today is as follows:

  • If the dispatch is decided by arity and nominal type only, and you don’t call it with flattening args, it’ll probably perform quite decently, and perhaps even enjoy inlining of the candidate and elimination of duplicate type checks that would take place on the slow path. This is thanks to the proto holding a “dispatch cache”, a special-case mechanism implemented in the VM that uses a search tree, with one level per argument.
  • If that’s the case but it has a custom proto, it’s not too bad either, though inlining isn’t going to be happening; it can still use the search tree, though
  • If it uses where clauses, it’ll be slow, because the search tree only deals in finding one candidate per set of nominal types, and so we can’t use it
  • The same reasoning applies to callsame; it’ll be slow too

Effectively, the situation today is that you simply don’t use where clauses in a multiple dispatch if its anywhere near a hot path (well, and if you know where the hot paths are, and know that this kind of dispatch is slow). Ditto for callsame, although that’s less commonly reached for. The question is, can we do better with the new dispatcher?

Guard the types

Let’s start out with seeing how the simplest cases are dealt with, and build from there. (This is actually what I did in terms of the implementation, but at the same time I had a rough idea where I was hoping to end up.)

Recall this pair of candidates:

multi truncate(Str $str, Int $chars) {
    $str.chars < $chars ?? $str !! $str.substr(0, $chars) ~ '...'
}
multi truncate(Str $str, Str $after) {
    with $str.index($after) -> $pos {
        $str.substr(0, $pos) ~ '...'
    }
    else {
        $str
    }
}

We then have a call truncate($message, "\n"), where $message is a Str. Under the new dispatch mechanism, the call is made using the raku-call dispatcher, which identifies that this is a multiple dispatch, and thus delegates to raku-multi. (Multi-method dispatch ends up there too.)

The record phase of the dispatch – on the first time we reach this callsite – will proceed as follows:

  1. Iterate over the candidates
  2. If a candidate doesn’t match on argument count, just discard it. Since the shape of a callsite is a constant, and we calculate dispatch programs at each callsite, we don’t need to establish any guards for this.
  3. If it matches on types and concreteness, note which parameters are involved and what kinds of guards they need.
  4. If there was no match or an ambiguity, report the error without producing a dispatch program.
  5. Otherwise, having established the type guards, delegate to the raku-invoke dispatcher with the chosen candidate.

When we reach the same callsite again, we can run the dispatch program, which quickly checks if the argument types match those we saw last time, and if they do, we know which candidate to invoke. These checks are very cheap – far cheaper than walking through all of the candidates and examining each of them for a match. The optimizer may later be able to prove that the checks will always come out true and eliminate them.

Thus the whole of the dispatch processes – at least for this simple case where we only have types and arity – can be “explained” to the virtual machine as “if the arguments have these exact types, invoke this routine”. It’s pretty much the same as we were doing for method dispatch, except there we only cared about the type of the first argument – the invocant – and the value of the method name. (Also recall from the previous post that if it’s a multi-method dispatch, then both method dispatch and multiple dispatch will guard the type of the first argument, but the duplication is eliminated, so only one check is done.)

That goes in the resumption hole

Coming up with good abstractions is difficult, and therein lies much of the challenge of the new dispatch mechanism. Raku has quite a number of different dispatch-like things. However, encoding all of them directly in the virtual machine leads to high complexity, which makes building reliable optimizations (or even reliable unoptimized implementations!) challenging. Thus the aim is to work out a comparatively small set of primitives that allow for dispatches to be “explained” to the virtual machine in such a way that it can deliver decent performance.

It’s fairly clear that callsame is a kind of dispatch resumption, but what about the custom proto case and the where clause case? It turns out that these can both be neatly expressed in terms of dispatch resumption too (the where clause case needing one small addition at the virtual machine level, which in time is likely to be useful for other things too). Not only that, but encoding these features in terms of dispatch resumption is also quite direct, and thus should be efficient. Every trick we teach the specializer about doing better with dispatch resumptions can benefit all of the language features that are implemented using them, too.

Custom protos

Recall this example:

proto mean($collection) {
    $collection.elems == 0 ?? Nil !! {*}
}

Here, we want to run the body of the proto, and then proceed to the chosen candidate at the point of the {*}. By contrast, when we don’t have a custom proto, we’d like to simply get on with calling the correct multi.

To achieve this, I first moved the multi candidate selection logic from the raku-multi dispatcher to the raku-multi-core dispatcher. The raku-multi dispatcher then checks if we have an “onlystar” proto (one that does not need us to run it). If so, it delegates immediately to raku-multi-core. If not, it saves the arguments to the dispatch as the resumption initialization state, and then calls the proto. The proto‘s {*} is compiled into a dispatch resumption. The resumption then delegates to raku-multi-core. Or, in code:

nqp::dispatch('boot-syscall', 'dispatcher-register', 'raku-multi',
    # Initial dispatch, only setting up resumption if we need to invoke the
    # proto.
    -> $capture {
        my $callee := nqp::captureposarg($capture, 0);
        my int $onlystar := nqp::getattr_i($callee, Routine, '$!onlystar');
        if $onlystar {
            # Don't need to invoke the proto itself, so just get on with the
            # candidate dispatch.
            nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'raku-multi-core', $capture);
        }
        else {
            # Set resume init args and run the proto.
            nqp::dispatch('boot-syscall', 'dispatcher-set-resume-init-args', $capture);
            nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'raku-invoke', $capture);
        }
    },
    # Resumption means that we have reached the {*} in the proto and so now
    # should go ahead and do the dispatch. Make sure we only do this if we
    # are signalled to that it's a resume for an onlystar (resumption kind 5).
    -> $capture {
        my $track_kind := nqp::dispatch('boot-syscall', 'dispatcher-track-arg', $capture, 0);
        nqp::dispatch('boot-syscall', 'dispatcher-guard-literal', $track_kind);
        my int $kind := nqp::captureposarg_i($capture, 0);
        if $kind == 5 {
            nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'raku-multi-core',
                nqp::dispatch('boot-syscall', 'dispatcher-get-resume-init-args'));
        }
        elsif !nqp::dispatch('boot-syscall', 'dispatcher-next-resumption') {
            nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'boot-constant',
                nqp::dispatch('boot-syscall', 'dispatcher-insert-arg-literal-obj',
                    $capture, 0, Nil));
        }
    });

Two become one

Deferring to the next candidate (for example with callsame) and trying the next candidate because a where clause failed look very similar: both involve walking through a list of possible candidates. There’s some details, but they have a great deal in common, and it’d be nice if that could be reflected in how multiple dispatch is implemented using the new dispatcher.

Before that, a slightly terrible detail about how things work in Rakudo today when we have where clauses. First, the dispatcher does a “trial bind”, where it asks the question: would this signature bind? To do this, it has to evaluate all of the where clauses. Worse, it has to use the slow-path signature binder too, which interprets the signature, even though we can in many cases compile it. If the candidate matches, great, we select it, and then invoke it…which runs the where clauses a second time, as part of the compiled signature binding code. There is nothing efficient about this at all, except for it being by far more efficient on developer time, which is why it happened that way.

Anyway, it goes without saying that I’m rather keen to avoid this duplicate work and the slow-path binder where possible as I re-implement this using the new dispatcher. And, happily, a small addition provides a solution. There is an op assertparamcheck, which any kind of parameter checking compiles into (be it type checking, where clause checking, etc.) This triggers a call to a function that gets the arguments, the thing we were trying to call, and can then pick through them to produce an error message. The trick is to provide a way to invoke a routine such that a bind failure, instead of calling the error reporting function, will leave the routine and then do a dispatch resumption! This means we can turn failure to pass where clause checks into a dispatch resumption, which will then walk to the next candidate and try it instead.

Trivial vs. non-trivial

This gets us most of the way to a solution, but there’s still the question of being memory and time efficient in the common case, where there is no resumption and no where clauses. I coined the term “trivial multiple dispatch” for this situation, which makes the other situation “non-trivial”. In fact, I even made a dispatcher called raku-multi-non-trivial! There are two ways we can end up there.

  1. The initial attempt to find a matching candidate determines that we’ll have to consider where clauses. As soon as we see this is the case, we go ahead and produce a full list of possible candidates that could match. This is a linked list (see my previous post for why).
  2. The initial attempt to find a matching candidate finds one that can be picked based purely on argument count and nominal types. We stop there, instead of trying to build a full candidate list, and run the matching candidate. In the event that a callsame happens, we end up in the trivial dispatch resumption handler, which – since this situation is now non-trivial – builds the full candidate list, snips the first item off it (because we already ran that), and delegates to raku-multi-non-trivial.

Lost in this description is another significant improvement: today, when there are where clauses, we entirely lose the ability to use the MoarVM multiple dispatch cache, but under the new dispatcher, we store a type-filtered list of candidates at the callsite, and then cheap type guards are used to check it is valid to use.

Preliminary results

I did a few benchmarks to see how the new dispatch mechanism did with a couple of situations known to be sub-optimal in Rakudo today. These numbers do not reflect what is possible, because at the moment the specializer does not have much of an understanding of the new dispatcher. Rather, they reflect the minimal improvement we can expect.

Consider this benchmark using a multi with a where clause to recursively implement factorial.

multi fac($n where $n <= 1) { 1 }
multi fac($n) { $n * fac($n - 1) }
for ^100_000 {
    fac(10)
}
say now - INIT now;

This needs some tweaks (and to be run under an environment variable) to use the new dispatcher; these are temporary, until such a time I switch Rakudo over to using the new dispatcher by default:

use nqp;
multi fac($n where $n <= 1) { 1 }
multi fac($n) { $n * nqp::dispatch('raku-call', &fac, $n - 1) }
for ^100_000 {
    nqp::dispatch('raku-call', &fac, 10);
}
say now - INIT now;

On my machine, the first runs in 4.86s, the second in 1.34s. Thus under the new dispatcher this runs in little over a quarter of the time it used to – a quite significant improvement already.

A case involving callsame is also interesting to consider. Here it is without using the new dispatcher:

multi fallback(Any $x) { "a$x" }
multi fallback(Numeric $x) { "n" ~ callsame }
multi fallback(Real $x) { "r" ~ callsame }
multi fallback(Int $x) { "i" ~ callsame }
for ^1_000_000 {
    fallback(4+2i);
    fallback(4.2);
    fallback(42);
}   
say now - INIT now;

And with the temporary tweaks to use the new dispatcher:

use nqp;
multi fallback(Any $x) { "a$x" }
multi fallback(Numeric $x) { "n" ~ new-disp-callsame }
multi fallback(Real $x) { "r" ~ new-disp-callsame }
multi fallback(Int $x) { "i" ~ new-disp-callsame }
for ^1_000_000 {
    nqp::dispatch('raku-call', &fallback, 4+2i);
    nqp::dispatch('raku-call', &fallback, 4.2);
    nqp::dispatch('raku-call', &fallback, 42);
}
say now - INIT now;

On my machine, the first runs in 31.3s, the second in 11.5s, meaning that with the new dispatcher we manage it in a little over a third of the time that current Rakudo does.

These are both quite encouraging, but as previously mentioned, a majority of multiple dispatches are of the trivial kind, not using these features. If I make the most common case worse on the way to making other things better, that would be bad. It’s not yet possible to make a fair comparison of this: trivial multiple dispatches already receive a lot of attention in the specializer, and it doesn’t yet optimize code using the new dispatcher well. Of note, in an example like this:

multi m(Int) { }
multi m(Str) { }
for ^1_000_000 {
    m(1);
    m("x");
}
say now - INIT now;

Inlining and other optimizations will turn this into an empty loop, which is hard to beat. There is one thing we can already do, though: run it with the specializer disabled. The new dispatcher version looks like this:

use nqp;
multi m(Int) { }
multi m(Str) { }
for ^1_000_000 {
    nqp::dispatch('raku-call', &m, 1);
    nqp::dispatch('raku-call', &m, "x");
}
say now - INIT now;

The results are 0.463s and 0.332s respectively. Thus, the baseline execution time – before the specializer does its magic – is less using the new general dispatch mechanism than it is using the special-case multiple dispatch cache that we currently use. I wasn’t sure what to expect here before I did the measurement. Given we’re going from a specialized mechanism that has been profiled and tweaked to a new general mechanism that hasn’t received such attention, I was quite ready to be doing a little bit worse initially, and would have been happy with parity. Running in 70% of the time was a bigger improvement than I expected at this point.

I expect that once the specializer understands the new dispatch mechanism better, it will be able to also turn the above into an empty loop – however, since more iterations can be done per-optimization, this should still show up as a win for the new dispatcher.

Final thoughts

With one relatively small addition, the new dispatch mechanism is already handling most of the Raku multiple dispatch semantics. Furthermore, even without the specializer and JIT really being able to make a good job of it, some microbenchmarks already show a factor of 3x-4x improvement. That’s a pretty good starting point.

There’s still a good bit to do before we ship a Rakudo release using the new dispatcher. However, multiple dispatch was the biggest remaining threat to the design: it’s rather more involved than other kinds of dispatch, and it was quite possible that an unexpected shortcoming could trigger another round of design work, or reveal that the general mechanism was going to struggle to perform compared to the more specialized one in the baseline unoptimized, case. So far, there’s no indication of either of these, and I’m cautiously optimistic that the overall design is about right.

Posted in Uncategorized | 2 Comments

Towards a new general dispatch mechanism in MoarVM

My goodness, it appears I’m writing my first Raku internals blog post in over two years. Of course, two years ago it wasn’t even called Raku. Anyway, without further ado, let’s get on with this shared brainache.

What is dispatch?

I use “dispatch” to mean a process by which we take a set of arguments and end up with some action being taken based upon them. Some familiar examples include:

  • Making a method call, such as $basket.add($product, $quantity). We might traditionally call just $product and $qauntity the arguments, but for my purposes, all of $basket, the method name 'add'$product, and $quantity` are arguments to the dispatch: they are the things we need in order to make a decision about what we’re going to do.
  • Making a subroutine call, such as uc($youtube-comment). Since Raku sub calls are lexically resolved, in this case the arguments to the dispatch are &uc (the result of looking up the subroutine) and $youtube-comment.
  • Calling a multiple dispatch subroutine or method, where the number and types of the arguments are used in order to decide which of a set of candidates is to be invoked. This process could be seen as taking place “inside” of one of the above two dispatches, given we have both multiple dispatch subroutines and methods in Raku.

At first glance, perhaps the first two seem fairly easy and the third a bit more of a handful – which is sort of true. However, Raku has a number of other features that make dispatch rather more, well, interesting. For example:

  • wrap allows us to wrap any Routine (sub or method); the wrapper can then choose to defer to the original routine, either with the original arguments or with new arguments
  • When doing multiple dispatch, we may write a proto routine that gets to choose when – or even if – the call to the appropriate candidate is made
  • We can use routines like callsame in order to defer to the next candidate in the dispatch. But what does that mean? If we’re in a multiple dispatch, it would mean the next most applicable candidate, if any. If we’re in a method dispatch then it means a method from a base class. (The same thing is used to implement going to the next wrapper or, eventually, to the originally wrapped routine too). And these can be combined: we can wrap a multi method, meaning we can have 3 levels of things that all potentially contribute the next thing to call!

Thanks to this, dispatch – at least in Raku – is not always something we do and produce an outcome, but rather a process that we may be asked to continue with multiple times!

Finally, while the examples I’ve written above can all quite clearly be seen as examples of dispatch, a number of other common constructs in Raku can be expressed as a kind of dispatch too. Assignment is one example: the semantics of it depend on the target of the assignment and the value being assigned, and thus we need to pick the correct semantics. Coercion is another example, and return value type-checking yet another.

Why does dispatch matter?

Dispatch is everywhere in our programs, quietly tieing together the code that wants stuff done with the code that does stuff. Its ubiquity means it plays a significant role in program performance. In the best case, we can reduce the cost to zero. In the worst case, the cost of the dispatch is high enough to exceed that of the work done as a result of the dispatch.

To a first approximation, when the runtime “understands” the dispatch the performance tends to be at least somewhat decent, but when it doesn’t there’s a high chance of it being awful. Dispatches tend to involve an amount of work that can be cached, often with some cheap guards to verify the validity of the cached outcome. For example, in a method dispatch, naively we need to walk a linearization of the inheritance graph and ask each class we encounter along the way if it has a method of the specified name. Clearly, this is not going to be terribly fast if we do it on every method call. However, a particular method name on a particular type (identified precisely, without regard to subclassing) will resolve to the same method each time. Thus, we can cache the outcome of the lookup, and use it whenever the type of the invocant matches that used to produce the cached result.

Specialized vs. generalized mechanisms in language runtimes

When one starts building a runtime aimed at a particular language, and has to do it on a pretty tight budget, the most obvious way to get somewhat tolerable performance is to bake various hot-path language semantics into the runtime. This is exactly how MoarVM started out. Thus, if we look at MoarVM as it stood several years ago, we find things like:

  • Some support for method caching
  • A multi-dispatch cache highly tied to Raku’s multi-dispatch semantics, and only really able to help when the dispatch is all about nominal types (so using where comes at a very high cost)
  • A mechanism for specifying how to find the actual code handle inside of a wrapping code object (for example, a Sub object has a private attribute in it that holds the low-level code handle identifying the bytecode to run)
  • Some limited attempts to allow us to optimize correctly in the case we know that a dispatch will not be continued – which requires careful cooperation between compiler and runtime (or less diplomatically, it’s all a big hack)

These are all still there today, however are also all on the way out. What’s most telling about this list is what isn’t included. Things like:

  • Private method calls, which would need a different cache – but the initial VM design limited us to one per type
  • Qualified method calls ($obj.SomeType::method-name())
  • Ways to decently optimize dispatch resumption

A few years back I started to partially address this, with the introduction of a mechanism I called “specializer plugins”. But first, what is the specializer?

When MoarVM started out, it was a relatively straightforward interpreter of bytecode. It only had to be fast enough to beat the Parrot VM in order to get a decent amount of usage, which I saw as important to have before going on to implement some more interesting optimizations (back then we didn’t have the kind of pre-release automated testing infrastructure we have today, and so depended much more on feedback from early adopters). Anyway, soon after being able to run pretty much as much of the Raku language as any other backend, I started on the dynamic optimizer. It gathered type statistics as the program was interpreted, identified hot code, put it into SSA form, used the type statistics to insert guards, used those together with static properties of the bytecode to analyze and optimize, and produced specialized bytecode for the function in question. This bytecode could elide type checks and various lookups, as well as using a range of internal ops that make all kinds of assumptions, which were safe because of the program properties that were proved by the optimizer. This is called specialized bytecode because it has had a lot of its genericity – which would allow it to work correctly on all types of value that we might encounter – removed, in favor of working in a particular special case that actually occurs at runtime. (Code, especially in more dynamic languages, is generally far more generic in theory than it ever turns out to be in practice.)

This component – the specializer, known internally as “spesh” – delivered a significant further improvement in the performance of Raku programs, and with time its sophistication has grown, taking in optimizations such as inlining and escape analysis with scalar replacement. These aren’t easy things to build – but once a runtime has them, they create design possibilities that didn’t previously exist, and make decisions made in their absence look sub-optimal.

Of note, those special-cased language-specific mechanisms, baked into the runtime to get some speed in the early days, instead become something of a liability and a bottleneck. They have complex semantics, which means they are either opaque to the optimizer (so it can’t reason about them, meaning optimization is inhibited) or they need special casing in the optimizer (a liability).

So, back to specializer plugins. I reached a point where I wanted to take on the performance of things like $obj.?meth (the “call me maybe” dispatch), $obj.SomeType::meth() (dispatch qualified with a class to start looking in), and private method calls in roles (which can’t be resolved statically). At the same time, I was getting ready to implement some amount of escape analysis, but realized that it was going to be of very limited utility because assignment had also been special-cased in the VM, with a chunk of opaque C code doing the hot path stuff.

But why did we have the C code doing that hot-path stuff? Well, because it’d be too espensive to have every assignment call a VM-level function that does a bunch of checks and logic. Why is that costly? Because of function call overhead and the costs of interpretation. This was all true once upon a time. But, some years of development later:

  • Inlining was implemented, and could eliminate the overhead of doing a function call
  • We could compile to machine code, eliminating interpretation overhead
  • We were in a position where we had type information to hand in the specializer that would let us eliminate branches in the C code, but since it was just an opaque function we called, there was no way to take this opportunity

I solved the assignment problem and the dispatch problems mentioned above with the introduction of a single new mechanism: specializer plugins. They work as follows:

  • The first time we reach a given callsite in the bytecode, we run the plugin. It produces a code object to invoke, along with a set of guards (conditions that have to be met in order to use that code object result)
  • The next time we reach it, we check if the guards are met, and if so, just use the result
  • If not, we run the plugin again, and stack up a guard set at the callsite
  • We keep statistics on how often a given guard set succeeds, and then use that in the specializer

The vast majority of cases are monomorphic, meaning that only one set of guards are produced and they always succeed thereafter. The specializer can thus compile those guards into the specialized bytecode and then assume the given target invocant is what will be invoked. (Further, duplicate guards can be eliminated, so the guards a particular plugin introduces may reduce to zero.)

Specializer plugins felt pretty great. One new mechanism solved multiple optimization headaches.

The new MoarVM dispatch mechanism is the answer to a fairly simple question: what if we get rid of all the dispatch-related special-case mechanisms in favor of something a bit like specializer plugins? The resulting mechanism would need to be a more powerful than specializer plugins. Further, I could learn from some of the shortcomings of specializer plugins. Thus, while they will go away after a relatively short lifetime, I think it’s fair to say that I would not have been in a place to design the new MoarVM dispatch mechanism without that experience.

The dispatch op and the bootstrap dispatchers

All the method caching. All the multi dispatch caching. All the specializer plugins. All the invocation protocol stuff for unwrapping the bytecode handle in a code object. It’s all going away, in favor of a single new dispatch instruction. Its name is, boringly enough, dispatch. It looks like this:

dispatch_o result, 'dispatcher-name', callsite, arg0, arg1, ..., argN

Which means:

  • Use the dispatcher called dispatcher-name
  • Give it the argument registers specified (the callsite referenced indicates the number of arguments)
  • Put the object result of the dispatch into the register result

(Aside: this implies a new calling convention, whereby we no longer copy the arguments into an argument buffer, but instead pass the base of the register set and a pointer into the bytecode where the register argument map is found, and then do a lookup registers[map[argument_index]] to get the value for an argument. That alone is a saving when we interpret, because we no longer need a loop around the interpreter per argument.)

Some of the arguments might be things we’d traditionally call arguments. Some are aimed at the dispatch process itself. It doesn’t really matter – but it is more optimal if we arrange to put arguments that are only for the dispatch first (for example, the method name), and those for the target of the dispatch afterwards (for example, the method parameters).

The new bootstrap mechanism provides a small number of built-in dispatchers, whose names start with “boot-“. They are:

  • boot-value– take the first argument and use it as the result (the identity function, except discarding any further arguments)
  • boot-constant – take the first argument and produce it as the result, but also treat it as a constant value that will always be produced (thus meaning the optimizer could consider any pure code used to calculate the value as dead)
  • boot-code – take the first argument, which must be a VM bytecode handle, and run that bytecode, passing the rest of the arguments as its parameters; evaluate to the return value of the bytecode
  • boot-syscall – treat the first argument as the name of a VM-provided built-in operation, and call it, providing the remaining arguments as its parameters
  • boot-resume – resume the topmost ongoing dispatch

That’s pretty much it. Every dispatcher we build, to teach the runtime about some other kind of dispatch behavior, eventually terminates in one of these.

Building on the bootstrap

Teaching MoarVM about different kinds of dispatch is done using nothing less than the dispatch mechanism itself! For the most part, boot-syscall is used in order to register a dispatcher, set up the guards, and provide the result that goes with them.

Here is a minimal example, taken from the dispatcher test suite, showing how a dispatcher that provides the identity function would look:

nqp::dispatch('boot-syscall', 'dispatcher-register', 'identity', -> $capture {
    nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'boot-value', $capture);
});
sub identity($x) {
    nqp::dispatch('identity', $x)
}
ok(identity(42) == 42, 'Can define identity dispatch (1)');
ok(identity('foo') eq 'foo', 'Can define identity dispatch (2)');

In the first statement, we call the dispatcher-register MoarVM system call, passing a name for the dispatcher along with a closure, which will be called each time we need to handle the dispatch (which I tend to refer to as the “dispatch callback”). It receives a single argument, which is a capture of arguments (not actually a Raku-level Capture, but the idea – an object containing a set of call arguments – is the same).

Every user-defined dispatcher should eventually use dispatcher-delegate in order to identify another dispatcher to pass control along to. In this case, it delegates immediately to boot-value – meaning it really is nothing except a wrapper around the boot-value built-in dispatcher.

The sub identity contains a single static occurrence of the dispatch op. Given we call the sub twice, we will encounter this op twice at runtime, but the two times are very different.

The first time is the “record” phase. The arguments are formed into a capture and the callback runs, which in turn passes it along to the boot-value dispatcher, which produces the result. This results in an extremely simple dispatch program, which says that the result should be the first argument in the capture. Since there’s no guards, this will always be a valid result.

The second time we encounter the dispatch op, it already has a dispatch program recorded there, so we are in run mode. Turning on a debugging mode in the MoarVM source, we can see the dispatch program that results looks like this:

Dispatch program (1 temporaries)
  Ops:
    Load argument 0 into temporary 0
    Set result object value from temporary 0

That is, it reads argument 0 into a temporary location and then sets that as the result of the dispatch. Notice how there is no mention of the fact that we went through an extra layer of dispatch; those have zero cost in the resulting dispatch program.

Capture manipulation

Argument captures are immutable. Various VM syscalls exist to transform them into new argument captures with some tweak, for example dropping or inserting arguments. Here’s a further example from the test suite:

nqp::dispatch('boot-syscall', 'dispatcher-register', 'drop-first', -> $capture {
    my $capture-derived := nqp::dispatch('boot-syscall', 'dispatcher-drop-arg', $capture, 0);
    nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'boot-value', $capture-derived);
});
ok(nqp::dispatch('drop-first', 'first', 'second') eq 'second',
    'dispatcher-drop-arg works');

This drops the first argument before passing the capture on to the boot-value dispatcher – meaning that it will return the second argument. Glance back at the previous dispatch program for the identity function. Can you guess how this one will look?

Well, here it is:

Dispatch program (1 temporaries)
  Ops:
    Load argument 1 into temporary 0
    Set result string value from temporary 0

Again, while in the record phase of such a dispatcher we really do create capture objects and make a dispatcher delegation, the resulting dispatch program is far simpler.

Here’s a slightly more involved example:

my $target := -> $x { $x + 1 }
nqp::dispatch('boot-syscall', 'dispatcher-register', 'call-on-target', -> $capture {
    my $capture-derived := nqp::dispatch('boot-syscall',
            'dispatcher-insert-arg-literal-obj', $capture, 0, $target);
    nqp::dispatch('boot-syscall', 'dispatcher-delegate',
            'boot-code-constant', $capture-derived);
});
sub cot() { nqp::dispatch('call-on-target', 49) }
ok(cot() == 50,
    'dispatcher-insert-arg-literal-obj works at start of capture');
ok(cot() == 50,
    'dispatcher-insert-arg-literal-obj works at start of capture after link too');

Here, we have a closure stored in a variable $target. We insert it as the first argument of the capture, and then delegate to boot-code-constant, which will invoke that code object and pass the other dispatch arguments to it. Once again, at the record phase, we really do something like:

  • Create a new capture with a code object inserted at the start
  • Delegate to the boot code constant dispatcher, which…
  • …creates a new capture without the original argument and runs bytecode with those arguments

And the resulting dispatch program? It’s this:

Dispatch program (1 temporaries)
  Ops:
    Load collectable constant at index 0 into temporary 0
    Skip first 0 args of incoming capture; callsite from 0
    Invoke MVMCode in temporary 0

That is, load the constant bytecode handle that we’re going to invoke, set up the args (which are in this case equal to those of the incoming capture), and then invoke the bytecode with those arguments. The argument shuffling is, once again, gone. In general, whenever the arguments we do an eventual bytecode invocation with are a tail of the initial dispatch arguments, the arguments transform becomes no more than a pointer addition.

Guards

All of the dispatch programs seen so far have been unconditional: once recorded at a given callsite, they shall always be used. The big missing piece to make such a mechanism have practical utility is guards. Guards assert properties such as the type of an argument or if the argument is definite (Int:D) or not (Int:U).

Here’s a somewhat longer test case, with some explanations placed throughout it.

# A couple of classes for test purposes
my class C1 { }
my class C2 { }

# A counter used to make sure we're only invokving the dispatch callback as
# many times as we expect.
my $count := 0;

# A type-name dispatcher that maps a type into a constant string value that
# is its name. This isn't terribly useful, but it is a decent small example.
nqp::dispatch('boot-syscall', 'dispatcher-register', 'type-name', -> $capture {
    # Bump the counter, just for testing purposes.
    $count++;

    # Obtain the value of the argument from the capture (using an existing
    # MoarVM op, though in the future this may go away in place of a syscall)
    # and then obtain the string typename also.
    my $arg-val := nqp::captureposarg($capture, 0);
    my str $name := $arg-val.HOW.name($arg-val);

    # This outcome is only going to be valid for a particular type. We track
    # the argument (which gives us an object back that we can use to guard
    # it) and then add the type guard.
    my $arg := nqp::dispatch('boot-syscall', 'dispatcher-track-arg', $capture, 0);
    nqp::dispatch('boot-syscall', 'dispatcher-guard-type', $arg);

    # Finally, insert the type name at the start of the capture and then
    # delegate to the boot-constant dispatcher.
    nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'boot-constant',
        nqp::dispatch('boot-syscall', 'dispatcher-insert-arg-literal-str',
            $capture, 0, $name));
});

# A use of the dispatch for the tests. Put into a sub so there's a single
# static dispatch op, which all dispatch programs will hang off.
sub type-name($obj) {
    nqp::dispatch('type-name', $obj)
}

# Check with the first type, making sure the guard matches when it should
# (although this test would pass if the guard were ignored too).
ok(type-name(C1) eq 'C1', 'Dispatcher setting guard works');
ok($count == 1, 'Dispatch callback ran once');
ok(type-name(C1) eq 'C1', 'Can use it another time with the same type');
ok($count == 1, 'Dispatch callback was not run again');

# Test it with a second type, both record and run modes. This ensures the
# guard really is being checked.
ok(type-name(C2) eq 'C2', 'Can handle polymorphic sites when guard fails');
ok($count == 2, 'Dispatch callback ran a second time for new type');
ok(type-name(C2) eq 'C2', 'Second call with new type works');

# Check that we can use it with the original type too, and it has stacked
# the dispatch programs up at the same callsite.
ok(type-name(C1) eq 'C1', 'Call with original type still works');
ok($count == 2, 'Dispatch callback only ran a total of 2 times');

This time two dispatch programs get produced, one for C1:

Dispatch program (1 temporaries)
  Ops:
    Guard arg 0 (type=C1)
    Load collectable constant at index 1 into temporary 0
    Set result string value from temporary 0

And another for C2:

Dispatch program (1 temporaries)
  Ops:
    Guard arg 0 (type=C2)
    Load collectable constant at index 1 into temporary 0
    Set result string value from temporary 0

Once again, no leftovers from capture manipulation, tracking, or dispatcher delegation; the dispatch program does a type guard against an argument, then produces the result string. The whole call to $arg-val.HOW.name($arg-val) is elided, the dispatcher we wrote encoding the knowledge – in a way that the VM can understand – that a type’s name can be considered immutable.

This example is a bit contrived, but now consider that we instead look up a method and guard on the invocant type: that’s a method cache! Guard the types of more of the arguments, and we have a multi cache! Do both, and we have a multi-method cache.

The latter is interesting in so far as both the method dispatch and the multi dispatch want to guard on the invocant. In fact, in MoarVM today there will be two such type tests until we get to the point where the specializer does its work and eliminates these duplicated guards. However, the new dispatcher does not treat the dispatcher-guard-type as a kind of imperative operation that writes a guard into the resultant dispatch program. Instead, it declares that the argument in question must be guarded. If some other dispatcher already did that, it’s idempotent. The guards are emitted once all dispatch programs we delegate through, on the path to a final outcome, have had their say.

Fun aside: those being especially attentive will have noticed that the dispatch mechanism is used as part of implementing new dispatchers too, and indeed, this ultimately will mean that the specializer can specialize the dispatchers and have them JIT-compiled into something more efficient too. After all, from the perspective of MoarVM, it’s all just bytecode to run; it’s just that some of it is bytecode that tells the VM how to execute Raku programs more efficiently!

Dispatch resumption

A resumable dispatcher needs to do two things:

  1. Provide a resume callback as well as a dispatch one when registering the dispatcher
  2. In the dispatch callback, specify a capture, which will form the resume initialization state

When a resumption happens, the resume callback will be called, with any arguments for the resumption. It can also obtain the resume initialization state that was set in the dispatch callback. The resume initialization state contains the things needed in order to continue with the dispatch the first time it is resumed. We’ll take a look at how this works for method dispatch to see a concrete example. I’ll also, at this point, switch to looking at the real Rakudo dispatchers, rather than simplified test cases.

The Rakudo dispatchers take advantage of delegation, duplicate guards, and capture manipulations all having no runtime cost in the resulting dispatch program to, in my mind at least, quite nicely factor what is a somewhat involved dispatch process. There are multiple entry points to method dispatch: the normal boring $obj.meth(), the qualified $obj.Type::meth(), and the call me maybe $obj.?meth(). These have common resumption semantics – or at least, they can be made to provided we always carry a starting type in the resume initialization state, which is the type of the object that we do the method dispatch on.

Here is the entry point to dispatch for a normal method dispatch, with the boring details of reporting missing method errors stripped out.

# A standard method call of the form $obj.meth($arg); also used for the
# indirect form $obj."$name"($arg). It receives the decontainerized invocant,
# the method name, and the the args (starting with the invocant including any
# container).
nqp::dispatch('boot-syscall', 'dispatcher-register', 'raku-meth-call', -> $capture {
    # Try to resolve the method call using the MOP.
    my $obj := nqp::captureposarg($capture, 0);
    my str $name := nqp::captureposarg_s($capture, 1);
    my $meth := $obj.HOW.find_method($obj, $name);

    # Report an error if there is no such method.
    unless nqp::isconcrete($meth) {
        !!! 'Error reporting logic elided for brevity';
    }

    # Establish a guard on the invocant type and method name (however the name
    # may well be a literal, in which case this is free).
    nqp::dispatch('boot-syscall', 'dispatcher-guard-type',
        nqp::dispatch('boot-syscall', 'dispatcher-track-arg', $capture, 0));
    nqp::dispatch('boot-syscall', 'dispatcher-guard-literal',
        nqp::dispatch('boot-syscall', 'dispatcher-track-arg', $capture, 1));

    # Add the resolved method and delegate to the resolved method dispatcher.
    my $capture-delegate := nqp::dispatch('boot-syscall',
        'dispatcher-insert-arg-literal-obj', $capture, 0, $meth);
    nqp::dispatch('boot-syscall', 'dispatcher-delegate',
        'raku-meth-call-resolved', $capture-delegate);
});

Now for the resolved method dispatcher, which is where the resumption is handled. First, let’s look at the normal dispatch callback (the resumption callback is included but empty; I’ll show it a little later).

# Resolved method call dispatcher. This is used to call a method, once we have
# already resolved it to a callee. Its first arg is the callee, the second and
# third are the type and name (used in deferral), and the rest are the args to
# the method.
nqp::dispatch('boot-syscall', 'dispatcher-register', 'raku-meth-call-resolved',
    # Initial dispatch
    -> $capture {
        # Save dispatch state for resumption. We don't need the method that will
        # be called now, so drop it.
        my $resume-capture := nqp::dispatch('boot-syscall', 'dispatcher-drop-arg',
            $capture, 0);
        nqp::dispatch('boot-syscall', 'dispatcher-set-resume-init-args', $resume-capture);

        # Drop the dispatch start type and name, and delegate to multi-dispatch or
        # just invoke if it's single dispatch.
        my $delegate_capture := nqp::dispatch('boot-syscall', 'dispatcher-drop-arg',
            nqp::dispatch('boot-syscall', 'dispatcher-drop-arg', $capture, 1), 1);
        my $method := nqp::captureposarg($delegate_capture, 0);
        if nqp::istype($method, Routine) && $method.is_dispatcher {
            nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'raku-multi', $delegate_capture);
        }
        else {
            nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'raku-invoke', $delegate_capture);
        }
    },
    # Resumption
    -> $capture {
        ... 'Will be shown later';
    });

There’s an arguable cheat in raku-meth-call: it doesn’t actually insert the type object of the invocant in place of the invocant. It turns out that it doesn’t really matter. Otherwise, I think the comments (which are to be found in the real implementation also) tell the story pretty well.

One important point that may not be clear – but follows a repeating theme – is that the setting of the resume initialization state is also more of a declarative rather than an imperative thing: there isn’t a runtime cost at the time of the dispatch, but rather we keep enough information around in order to be able to reconstruct the resume initialization state at the point we need it. (In fact, when we are in the run phase of a resume, we don’t even have to reconstruct it in the sense of creating a capture object.)

Now for the resumption. I’m going to present a heavily stripped down version that only deals with the callsame semantics (the full thing has to deal with such delights as lastcall and nextcallee too). The resume initialization state exists to seed the resumption process. Once we know we actually do have to deal with resumption, we can do things like calculating the full list of methods in the inheritance graph that we want to walk through. Each resumable dispatcher gets a single storage slot on the call stack that it can use for its state. It can initialize this in the first step of resumption, and then update it as we go. Or more precisely, it can set up a dispatch program that will do this when run.

A linked list turns out to be a very convenient data structure for the chain of candidates we will walk through. We can work our way through a linked list by keeping track of the current node, meaning that there need only be a single thing that mutates, which is the current state of the dispatch. The dispatch program mechanism also provides a way to read an attribute from an object, and that is enough to express traversing a linked list into the dispatch program. This also means zero allocations.

So, without further ado, here is the linked list (rather less pretty in NQP, the restricted Raku subset, than it would be in full Raku):

# A linked list is used to model the state of a dispatch that is deferring
# through a set of methods, multi candidates, or wrappers. The Exhausted class
# is used as a sentinel for the end of the chain. The current state of the
# dispatch points into the linked list at the appropriate point; the chain
# itself is immutable, and shared over (runtime) dispatches.
my class DeferralChain {
    has $!code;
    has $!next;
    method new($code, $next) {
        my $obj := nqp::create(self);
        nqp::bindattr($obj, DeferralChain, '$!code', $code);
        nqp::bindattr($obj, DeferralChain, '$!next', $next);
        $obj
    }
    method code() { $!code }
    method next() { $!next }
};
my class Exhausted {};

And finally, the resumption handling.

nqp::dispatch('boot-syscall', 'dispatcher-register', 'raku-meth-call-resolved',
    # Initial dispatch
    -> $capture {
        ... 'Presented earlier;
    },
    # Resumption. The resume init capture's first two arguments are the type
    # that we initially did a method dispatch against and the method name
    # respectively.
    -> $capture {
        # Work out the next method to call, if any. This depends on if we have
        # an existing dispatch state (that is, a method deferral is already in
        # progress).
        my $init := nqp::dispatch('boot-syscall', 'dispatcher-get-resume-init-args');
        my $state := nqp::dispatch('boot-syscall', 'dispatcher-get-resume-state');
        my $next_method;
        if nqp::isnull($state) {
            # No state, so just starting the resumption. Guard on the
            # invocant type and name.
            my $track_start_type := nqp::dispatch('boot-syscall', 'dispatcher-track-arg', $init, 0);
            nqp::dispatch('boot-syscall', 'dispatcher-guard-type', $track_start_type);
            my $track_name := nqp::dispatch('boot-syscall', 'dispatcher-track-arg', $init, 1);
            nqp::dispatch('boot-syscall', 'dispatcher-guard-literal', $track_name);

            # Also guard on there being no dispatch state.
            my $track_state := nqp::dispatch('boot-syscall', 'dispatcher-track-resume-state');
            nqp::dispatch('boot-syscall', 'dispatcher-guard-literal', $track_state);

            # Build up the list of methods to defer through.
            my $start_type := nqp::captureposarg($init, 0);
            my str $name := nqp::captureposarg_s($init, 1);
            my @mro := nqp::can($start_type.HOW, 'mro_unhidden')
                ?? $start_type.HOW.mro_unhidden($start_type)
                !! $start_type.HOW.mro($start_type);
            my @methods;
            for @mro {
                my %mt := nqp::hllize($_.HOW.method_table($_));
                if nqp::existskey(%mt, $name) {
                    @methods.push(%mt{$name});
                }
            }

            # If there's nothing to defer to, we'll evaluate to Nil (just don't set
            # the next method, and it happens below).
            if nqp::elems(@methods) >= 2 {
                # We can defer. Populate next method.
                @methods.shift; # Discard the first one, which we initially called
                $next_method := @methods.shift; # The immediate next one

                # Build chain of further methods and set it as the state.
                my $chain := Exhausted;
                while @methods {
                    $chain := DeferralChain.new(@methods.pop, $chain);
                }
                nqp::dispatch('boot-syscall', 'dispatcher-set-resume-state-literal', $chain);
            }
        }
        elsif !nqp::istype($state, Exhausted) {
            # Already working through a chain of method deferrals. Obtain
            # the tracking object for the dispatch state, and guard against
            # the next code object to run.
            my $track_state := nqp::dispatch('boot-syscall', 'dispatcher-track-resume-state');
            my $track_method := nqp::dispatch('boot-syscall', 'dispatcher-track-attr',
                $track_state, DeferralChain, '$!code');
            nqp::dispatch('boot-syscall', 'dispatcher-guard-literal', $track_method);

            # Update dispatch state to point to next method.
            my $track_next := nqp::dispatch('boot-syscall', 'dispatcher-track-attr',
                $track_state, DeferralChain, '$!next');
            nqp::dispatch('boot-syscall', 'dispatcher-set-resume-state', $track_next);

            # Set next method, which we shall defer to.
            $next_method := $state.code;
        }
        else {
            # Dispatch already exhausted; guard on that and fall through to returning
            # Nil.
            my $track_state := nqp::dispatch('boot-syscall', 'dispatcher-track-resume-state');
            nqp::dispatch('boot-syscall', 'dispatcher-guard-literal', $track_state);
        }

        # If we found a next method...
        if nqp::isconcrete($next_method) {
            # Call with same (that is, original) arguments. Invoke with those.
            # We drop the first two arguments (which are only there for the
            # resumption), add the code object to invoke, and then leave it
            # to the invoke or multi dispatcher.
            my $just_args := nqp::dispatch('boot-syscall', 'dispatcher-drop-arg',
                nqp::dispatch('boot-syscall', 'dispatcher-drop-arg', $init, 0),
                0);
            my $delegate_capture := nqp::dispatch('boot-syscall',
                'dispatcher-insert-arg-literal-obj', $just_args, 0, $next_method);
            if nqp::istype($next_method, Routine) && $next_method.is_dispatcher {
                nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'raku-multi',
                        $delegate_capture);
            }
            else {
                nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'raku-invoke',
                        $delegate_capture);
            }
        }
        else {
            # No method, so evaluate to Nil (boot-constant disregards all but
            # the first argument).
            nqp::dispatch('boot-syscall', 'dispatcher-delegate', 'boot-constant',
                nqp::dispatch('boot-syscall', 'dispatcher-insert-arg-literal-obj',
                    $capture, 0, Nil));
        }
    });

That’s quite a bit to take in, and quite a bit of code. Remember, however, that this is only run for the record phase of a dispatch resumption. It also produces a dispatch program at the callsite of the callsame, with the usual guards and outcome. Implicit guards are created for the dispatcher that we are resuming at that point. In the most common case this will end up monomorphic or bimorphic, although situations involving nestings of multiple dispatch or method dispatch could produce a more morphic callsite.

The design I’ve picked forces resume callbacks to deal with two situations: the first resumption and the latter resumptions. This is not ideal in a couple of ways:

  1. It’s a bit inconvenient for those writing dispatch resume callbacks. However, it’s not like this is a particularly common activity!
  2. The difference results in two dispatch programs being stacked up at a callsite that might otherwise get just one

Only the second of these really matters. The reason for the non-uniformity is to make sure that the overwhelming majority of calls, which never lead to a dispatch resumption, incur no per-dispatch cost for a feature that they never end up using. If the result is a little more cost for those using the feature, so be it. In fact, early benchmarking shows callsame with wrap and method calls seems to be up to 10 times faster using the new dispatcher than in current Rakudo, and that’s before the specializer understands enough about it to improve things further!

What’s done so far

Everything I’ve discussed above is implemented, except that I may have given the impression somewhere that multiple dispatch is fully implemented using the new dispatcher, and that is not the case yet (no handling of where clauses and no dispatch resumption support).

Next steps

Getting the missing bits of multiple dispatch fully implemented is the obvious next step. The other missing semantic piece is support for callwith and nextwith, where we wish to change the arguments that are being used when moving to the next candidate. A few other minor bits aside, that in theory will get all of the Raku dispatch semantics at least supported.

Currently, all standard method calls ($obj.meth()) and other calls (foo() and $foo()) go via the existing dispatch mechanism, not the new dispatcher. Those will need to be migrated to use the new dispatcher also, and any bugs that are uncovered will need fixing. That will get things to the point where the new dispatcher is semantically ready.

After that comes performance work: making sure that the specializer is able to deal with dispatch program guards and outcomes. The goal, initially, is to get steady state performance of common calling forms to perform at least as well as in the current master branch of Rakudo. It’s already clear enough there will be some big wins for some things that to date have been glacial, but it should not come at the cost of regression on the most common kinds of dispatch, which have received plenty of optimization effort before now.

Furthermore, NQP – the restricted form of Raku that the Rakudo compiler and other bits of the runtime guts are written in – also needs to be migrated to use the new dispatcher. Only when that is done will it be possible to rip out the current method cache, multiple dispatch cache, and so forth from MoarVM.

An open question is how to deal with backends other than MoarVM. Ideally, the new dispatch mechanism will be ported to those. A decent amount of it should be possible to express in terms of the JVM’s invokedynamic (and this would all probably play quite well with a Truffle-based Raku implementation, although I’m not sure there is a current active effort in that area).

Future opportunities

While my current focus is to ship a Rakudo and MoarVM release that uses the new dispatcher mechanism, that won’t be the end of the journey. Some immediate ideas:

  • Method calls on roles need to pun the role into a class, and so method lookup returns a closure that does that and replaces the invocant. That’s a lot of indirection; the new dispatcher could obtain the pun and produce a dispatch program that replaces the role type object with the punned class type object, which would make the per-call cost far lower.
  • I expect both the handles (delegation) and FALLBACK (handling missing method call) mechanisms can be made to perform better using the new dispatcher
  • The current implementation of assuming – used to curry or otherwise prime arguments for a routine – is not ideal, and an implementation that takes advantage of the argument rewriting capabilities of the new dispatcher would likely perform a great deal better

Some new language features may also be possible to provide in an efficient way with the help of the new dispatch mechanism. For example, there’s currently not a reliable way to try to invoke a piece of code, just run it if the signature binds, or to do something else if it doesn’t. Instead, things like the Cro router have to first do a trial bind of the signature, and then do the invoke, which makes routing rather more costly. There’s also the long suggested idea of providing pattern matching via signatures with the when construct (for example, when * -> ($x) {}; when * -> ($x, *@tail) { }), which is pretty much the same need, just in a less dynamic setting.

In closing…

Working on the new dispatch mechanism has been a longer journey than I first expected. The resumption part of the design was especially challenging, and there’s still a few important details to attend to there. Something like four potential approaches were discarded along the way (although elements of all of them influenced what I’ve described in this post). Abstractions that hold up are really, really, hard.

I also ended up having to take a couple of months away from doing Raku work at all, felt a bit crushed during some others, and have been juggling this with the equally important RakuAST project (which will be simplified by being able to assume the presence of the new dispatcher, and also offers me a range of softer Raku hacking tasks, whereas the dispatcher work offers few easy pickings).

Given all that, I’m glad to finally be seeing the light at the end of the tunnel. The work that remains is enumerable, and the day we ship a Rakudo and MoarVM release using the new dispatcher feels a small number of months away (and I hope writing that is not tempting fate!)

The new dispatcher is probably the most significant change to MoarVM since I founded it, in so far as it sees us removing a bunch of things that have been there pretty much since the start. RakuAST will also deliver the greatest architectural change to the Rakudo compiler in a decade. Both are an opportunity to fold years of learning things the hard way into the runtime and compiler. I hope when I look back at it all in another decade’s time, I’ll at least feel I made more interesting mistakes this time around.

Posted in Uncategorized | 3 Comments

Taking a break from Raku core development

I’d like to thank everyone who voted for me in the recent Raku Steering Council elections. By this point, I’ve been working on the language for well over a decade, first to help turn a language design I found fascinating into a working implementation, and since the Christmas release to make that implementation more robust and performant. Overall, it’s been as fun as it has been challenging – in a large part because I’ve found myself sharing the journey with a lot of really great people. I’ve also tried to do my bit to keep the community around the language kind and considerate. Receiving a vote from around 90% of those who participated in the Steering Council elections was humbling.

Alas, I’ve today submitted my resignation to the Steering Council, on personal health grounds. For the same reason, I’ll be taking a step back from Raku core development (Raku, MoarVM, language design, etc.) Please don’t worry too much; I’ll almost certainly be fine. It may be I’m ready to continue working on Raku things in a month or two. It may also be longer. Either way, I think Raku will be better off with a fully sized Steering Council in place, and I’ll be better off without the anxiety that I’m holding a role that I’m not in a place to fulfill.

Posted in Uncategorized | Comments Off on Taking a break from Raku core development

My Perl 6 wishes for 2019

This evening, I enjoyed the New Year’s fireworks display over the beautiful Prague skyline. Well, the bit of it that was between me and the fireworks, anyway. Rather than having its fireworks display at midnight, Prague puts it at 6pm on New Year’s Day. That makes it easy for families to go to, which is rather thoughtful. It’s also, for those of us with plans to dig back into work the following day, a nice end to the festive break.

Prague fireworks over Narodni Divadlo

So, tomorrow I’ll be digging back into work, which of late has involved a lot of Perl 6. Having spent so many years working on Perl 6 compiler and runtime design and implementation, it’s been fun to spend a good amount of 2018 using Perl 6 for commercial projects. I’m hopeful that will continue into 2019. Of course, I’ll be continuing to work on plenty of Perl 6 things that are for the good of all Perl 6 users too. In this post, I’d like to share some of the things I’m hoping to work on or achieve during 2019.

Partial Escape Analysis and related optimizations in MoarVM

The MoarVM specializer learned plenty of new tricks this year, delivering some nice speedups for many Perl 6 programs. Many of my performance improvement hopes for 2019 center around escape analysis and optimizations stemming from it.

The idea is to analyze object allocations, and find pieces of the program where we can fully understand all of the references that exist to the object. The points at which we can cease to do that is where an object escapes. In the best cases, an object never escapes; in other cases, there are a number of reads and writes performed to its attributes up until its escape.

Armed with this, we can perform scalar replacement, which involves placing the attributes of the object into local registers up until the escape point, if any. As well as reducing memory operations, this means we can often prove significantly more program properties, allowing further optimization (such as getting rid of dynamic type checks). In some cases, we might never need to allocate the object at all; this should be a big win for Perl 6, which by its design creates lots of short-lived objects.

There will be various code-generation and static optimizer improvements to be done in Rakudo in support of this work also, which should result in its own set of speedups.

Expect to hear plenty about this in my posts here in the year ahead.

Decreasing startup time and base memory use

The current Rakudo startup time is quite high. I’d really like to see it fall to around half of what it currently is during 2019. I’ve got some concrete ideas on how that can be achieved, including changing the way we store and deserialize NFAs used by the parser, and perhaps also dealing with the way we currently handle method caches to have less startup impact.

Both of these should also decrease the base memory use, which is also a good bit higher than I wish.

Improving compilation times

Some folks – myself included – are developing increasingly large applications in Perl 6. For the current major project I’m working on, runtime performance is not an issue by now, but I certainly feel myself waiting a bit on compiles. Part of it is parse performance, and I’d like to look at that; in doing so, I’d also be able to speed up handling of all Perl 6 grammars.

I suspect there are some good wins to be had elsewhere in the compilation pipeline too, and the startup time improvements described above should also help, especially when we pre-compile deep dependency trees. I’d also like to look into if we can do some speculative parallel compilation.

Research into concurrency safety

In Perl 6.d, we got non-blocking await and react support, which greatly improved the scalability of Perl 6 concurrent and parallel programs. Now many thousands of outstanding tasks can be juggled across just a handful of threads (the exact number chosen according to demand and CPU count).

For Perl 6.e, which is still a good way off, I’d like to having something to offer in terms of making Perl 6 concurrent and parallel programming safer. While we have a number of higher-level constructs that eliminate various ways to make mistakes, it’s still possible to get into trouble and have races when using them.

So, I plan to spend some time this year quietly exploring and prototyping in this space. Obviously, I want something that fits in with the Perl 6 language design, and that catches real and interesting bugs – probably by making things that are liable to occasionally explode in weird ways instead reliably do so in helpful ways, such that they show up reliably in tests.

Get Cro to its 1.0 release

In the 16 months since I revealed it, Cro has become a popular choice for implementing HTTP APIs and web applications in Perl 6. It has also attracted code contributions from a couple of dozen contributors. This year, I aim to see Cro through to its 1.0 release. That will include (to save you following the roadmap link):

  • Providing flexible HTTP reverse proxying support
  • Providing a number of robustness patterns (timeout, retry, circuit breaker, and so forth)
  • Providing Cro support some message queue protocols
  • Ensuring Cro’s features work on MacOS and Windows
  • Providing some further help for those building server-side web applications (templating, an easier time setting up common login flows, etc.)
  • Conducting a security review
  • Specifying a stability/deprecation policy
  • Lots of polishing

Comma Community, and lots of improvements and features

I founded Comma IDE in order to bring Perl 6 a powerful Integrated Development Environment. We’ve come a long way since the Minimum Viable Product we shipped back in June to the first subscribers to the Comma Supporter Program. In recent months, I’ve used Comma almost daily on my various Perl 6 projects, and by this point honestly wouldn’t want to be without it. Like Cro, I built Comma because it’s a product I wanted to use, which I think is a good place to be in when building any product.

In a few months time, we expect to start offering Comma Community and Comma Complete. The former will be free of charge, and the latter a commercial offering under a subscribe-for-updates model (just like how the supporter program has worked so far). My own Comma wishlist is lengthy enough to keep us busy for a lot more than the next year, and that’s before considering things Comma users are asking for. Expect plenty of exciting new features, as well as ongoing tweaks to make each release feel that little bit nicer to use.

Speaking, conferences, workshops, etc.

This year will see me giving my first keynote at a European Perl Conference. I’m looking forward to being in Riga again; it’s a lovely city to wander around, and I remember having some pretty nice food there too. The keynote will focus on the concurrent and parallel aspects of Perl 6; thankfully, I’ve still a good six months to figure out exactly what angle I wish to take on that, having spoken on the topic many times before!

I also plan to submit a talk or two for the German Perl Workshop, and will probably find the Swiss Perl Workshop hard to resist attending once more. And, more locally, I’d like to try and track down other Perl folks here in Prague, and see if I can help some kind of Praha.pm to happen again.

I need to keep my travel down to sensible levels, but might be able to fit in the odd other bit of speaking during the year, if it’s not too far away.

Teaching

While I want to spend most of my time building stuff rather than talking about it, I’m up for the occasional bit of teaching. I’m considering pitching a 1-day Perl 6 concurrency workshop to the Riga organizers. Then we’ll see if there’s enough folks interested in taking it. It’ll cost something, but probably much less than any other way of getting a day of teaching from me. :-)

So, down to work!

Well, a good night’s sleep first. :-) But tomorrow, another year of fun begins. I’m looking forward to it, and to working alongside many wonderful folks in the Perl community.

Posted in Uncategorized | 1 Comment

Speeding up object creation

Recently, a Perl 6 object creation benchmark result did the rounds on social media. This Perl 6 code:

class Point {
    has $.x;
    has $.y;
}
my $total = 0;
for ^1_000_000 {
    my $p = Point.new(x => 2, y => 3);
    $total = $total + $p.x + $p.y;
}
say $total;

Now (on HEAD builds of Rakudo and MoarVM) runs faster than this roughly equivalent Perl 5 code:

use v5.10;

package Point;

sub new {
    my ($class, %args) = @_;
    bless \%args, $class;
}

sub x {
    my $self = shift;
    $self->{x}
}

sub y {
    my $self = shift;
    $self->{y}
}

package main;

my $total = 0;
for (1..1_000_000) {
    my $p = Point->new(x => 2, y => 3);
    $total = $total + $p->x + $p->y;
}
say $total;

(Aside: yes, I know there’s no shortage of libraries in Perl 5 that make OO nicer than this, though those I tried also made it slower.)

This is a fairly encouraging result: object creation, method calls, and attribute access are the operational bread and butter of OO programming, so it’s a pleasing sign that Perl 6 on Rakudo/MoarVM is getting increasingly speedy at them. In fact, it’s probably a bit better at them than this benchmark’s raw numbers show, since:

  • The measurements include startup time. Probably 10%-15% of the time Rakudo spends executing the program is startup, parsing, compilation, etc. (noting the compiler is itself written in Perl 6). By contrast, Perl 5 has really fast startup and a parser written in C.
  • The math in Perl 6 is arbitrary precision (that is, will upgrade to a big integer if needed); it costs something to handle that, even if it doesn’t happen
  • Every math operation allocates a new Int object; there’s two uses of + here, so that means two new Int allocations per iteration of the loop, which then have to be garbage collected

While dealing with Int values got faster recently, it’s still really making two Int objects every time around that loop and having to GC them later. An upcoming new set of analyses and optimizations will let us get rid of that cost too. And yes, startup will get faster with time also. In summary, while Rakudo/MoarVM is now winning that benchmark against Perl 5, there’s still lots more to be had. Which is a good job, since the equivalent Python and Ruby versions of that benchmark still run faster than the Perl 6 one.

In this post, I’ll look at the changes that allowed this benchmark to end up faster. None of the new work was particularly ground-breaking; in fact, it was mostly a case of doing small things to make better use of optimizations we already had.

What happens during construction?

Theoretically, the default new method in turn calls bless, passing the named arguments along. The bless method then creates an object instance, followed by calling BUILDALL. The BUILDALL method goes through the set of steps needed for constructing the object. In the case of a simple object like ours, that involves checking if an x and y named argument were passed, and if so assigning those values into the Scalar containers of the object attributes.

For those keeping count, that’s at least 3 method calls (newbless, and BUILDALL).

However, there’s a cheat. If bless wasn’t overridden (which would be an odd thing to do anyway), then the default new could just call BUILDALL directly anyway. Therefore, new looks like this:

multi method new(*%attrinit) {
    nqp::if(
      nqp::eqaddr(
        (my $bless := nqp::findmethod(self,'bless')),
        nqp::findmethod(Mu,'bless')
      ),
      nqp::create(self).BUILDALL(Empty, %attrinit),
      $bless(|%attrinit)
    )
}

The BUILDALL method was originally a little “interpreter” that went through a per-object build plan stating what needs to be done. However, for quite some time now we’ve instead compiled a per-class BUILDPLAN method.

Slurpy sadness

To figure out how to speed this up, I took a look at how the specializer was handling the code. The answer: not so well. There were certainly some positive moments in there. Of note:

  • The x and y accessor methods were being inlined
  • The + operators were being inlined
  • Everything was JIT-compiled into machine code

However, the new method was getting only a “certain” specialization, which is usually only done for very heavily polymorphic code. That wasn’t the case here; this program clearly constructs overwhelmingly one type of object. So what was going on?

In order to produce an “observed type” specialization – the more powerful kind – it needs to have data on the types of all of the passed arguments. And for the named arguments, that was missing. But why?

Logging of passed argument types is done on the callee side, since:

  • If the callee is already specialized, then we don’t want to waste time in the caller doing throwaway logging
  • If the callee is not specialized, but the caller is, then we don’t want to miss out on logging types (and we only log when running unspecialized code)

The argument logging was done as the interpreter executed each parameter processing instruction. However, when there’s a slurpy, then it would just swallow up all the remaining arguments without logging type information. Thus the information about the argument types was missing and we ended up with a less powerful form of specialization.

Teaching the slurpy handling code about argument logging felt a bit involved, but then I realized there was a far simpler solution: log all of the things in the argument buffer at the point an unspecialized frame is being invoked. We’re already logging the entry to the call frame at that point anyway. This meant that all of the parameter handling instructions got a bit simpler too, since they no longer had logging to do.

Conditional elimination

Having new being specialized in a more significant way was an immediate win. Of note, this part:

      nqp::eqaddr(
        (my $bless := nqp::findmethod(self,'bless')),
        nqp::findmethod(Mu,'bless')
      ),

Was quite interesting. Since we were now specializing on the type of self, then the findmethod could be resolved into a constant. The resolution of a method on the constant Mu was also a constant. Therefore, the result of the eqaddr (same memory address) comparison of two constants should also have been turned into a constant…except that wasn’t happening! This time, it was simple: we just hadn’t implemented folding of that yet. So, that was an easy win, and once done meant the optimizer could see that the if would always go a certain way and thus optimize the whole thing into the chosen branch. Thus the new method was specialized into something like:

multi method new(*%attrinit) {
    nqp::create(self).BUILDALL(Empty, %attrinit),
}

Further, the create could be optimized into a “fast create” op, and the relatively small BUILDALL method could be inlined into new. Not bad.

Generating simpler code

At this point, things were much better, but still not quite there. I took a look at the BUILDALL method compilation, and realized that it could emit faster code.

The %attrinit is a Perl 6 Hash object, which is for the most part a wrapper around the lower-level VM hash object, which is the actual hash storage. We were, curiously, already pulling this lower-level hash out of the Hash object and using the nqp::existskey primitive to check if there was a value, but were not doing that for actually accessing the value. Instead, an .AT-KEY('x') method call was being done. While that was being handled fairly well – inlined and so forth – it also does its own bit of existence checking. I realized we could just use the nqp::atkey primitive here instead.

Later on, I also realized that we could do away with nqp::existskey and just use nqp::atkey. Since a VM-level null is something that never exists in Perl 6, we can safely use it as a sentinel to mean that there is no value. That got us down to a single hash lookup.

By this point, we were just about winning the benchmark, but only by a few percent. Were there any more easy pickings?

An off-by-one

My next surprise was that the new method didn’t get inlined. It was within the size limit. There was nothing preventing it. What was going on?

Looking closer, it was even worse than that. Normally, when something is too big to be inlined, but we can work out what specialization will be called on the target, we do “specialization linking”, indicating which specialization of the code to call. That wasn’t happening. But why?

Some debugging later, I sheepishly fixed an off-by-one in the code that looks through a multi-dispatch cache to see if a particular candidate matches the set of argument types we have during optimization of a call instruction. This probably increased the performance of quite a few calls involving passing named arguments to multi-methods – and meant new was also inlined, putting us a good bit ahead on this benchmark.

What next?

The next round of performance improvements – both to this code and plenty more besides – will come from escape analysis, scalar replacement, and related optimizations. I’ve already started on that work, though expect it will keep me busy for quite a while. I will, however, be able to deliver it in stages, and am optimistic I’ll have the first stage of it ready to talk about – and maybe even merge – in a week or so.

Posted in Uncategorized | 5 Comments

Eliminating unrequired guards

MoarVM’s optimizer can perform speculative optimization. It does this by gathering statistics as the program is interpreted, and then analyzing them to find out what types and callees typically show up at given points in the program. If it spots there is at least a 99% chance of a particular type showing up at a particular program point, then it will optimize the code ahead of that point as if that type would always show up.

Of course, statistics aren’t proofs. What about the 1% case? To handle this, a guard instruction is inserted. This cheaply checks if the type is the expected one, and if it isn’t, falls back to the interpreter. This process is known as deoptimization.

Just how cheap are guards?

I just stated that a guard cheaply checks if the type is the expected one, but just how cheap is it really? There’s both direct and indirect costs.

The direct cost is that of the check. Here’s a (slightly simplified) version of the JIT compiler code that produces the machine code for a type guard.

/* Load object that we should guard */
| mov TMP1, WORK[obj];
/* Get type table we expect and compare it with the object's one */
MVMint16 spesh_idx = guard->ins->operands[2].lit_i16;
| get_spesh_slot TMP2, spesh_idx;
| cmp TMP2, OBJECT:TMP1->st;
| jne >1;
/* We're good, no need to deopt */
| jmp >2;
|1:
/* Call deoptimization handler */
| mov ARG1, TC;
| mov ARG2, guard->deopt_offset;
| mov ARG3, guard->deopt_target;
| callp &MVM_spesh_deopt_one_direct;
/* Jump out to the interpreter */
| jmp ->exit;
|2:

Where get_spesh_slot is a macro like this:

|.macro get_spesh_slot, reg, idx;
| mov reg, TC->cur_frame;
| mov reg, FRAME:reg->effective_spesh_slots;
| mov reg, OBJECTPTR:reg[idx];
|.endmacro

So, in the case that the guard matches, it’s 7 machine instructions (note: it’s actually a couple more because of something I omitted for simplicity). Thus there’s the cost of the time to execute them, plus the space they take in memory and, especially, the instruction cache. Further, one is a conditional jump. We’d expect it to be false most of the time, and so the CPU’s branch predictor should get a good hit rate – but branch predictor usage isn’t entirely free of charge either. Effectively, it’s not that bad, but it’s nice to save the cost if we can.

The indirect costs are much harder to quantify. In order to deoptimize, we need to have enough state to recreate the world as the interpreter expects it to be. I wrote on this topic not so long ago, for those who want to dive into the detail, but the essence of the problem is that we may have to retain some instructions and/or forgo some optimizations so that we are able to successfully deoptimize if needed. Thus, the presence of a guard constrains what optimizations we can perform in the code around it.

Representation problems

A guard instruction in MoarVM originally looked like:

sp_guard r(obj) sslot uint32

Where r(obj) is an object register to read containing the object to guard, the sslot is a spesh slot (an entry in a per-block constant table) containing the type we expect to see, and the uint32 indicates the target address after we deoptimize. Guards are inserted after instructions for which we had gathered statistics and determined there was a stable type. Things guarded include return values after a call, reads of object attributes, and reads of lexical variables.

This design has carried us a long way, however it has a major shortcoming. The program is represented in SSA form. Thus, an invoke followed by a guard might look something like:

invoke r6(5), r4(2)
sp_guard r6(5), sslot(42), litui32(64)

Where r6(5) has the return value written into it (and thus is a new SSA version of r6). We hold facts about a value (if it has a known type, if it has a known value, etc.) per SSA version. So the facts about r6(5) would be that it has a known type – the one that is asserted by the guard.

The invoke itself, however, might be optimized by performing inlining of the callee. In some cases, we might then know the type of result that the inlinee produces – either because there was a guard inside of the inlined code, or because we can actually prove the return type! However, since the facts about r6(5) were those produced by the guard, there was no way to talk about what we know of r6(5) before the guard and after the guard.

More awkwardly, while in the early days of the specializer we only ever put guards immediately after the instructions that read values, more recent additions might insert them at a distance (for example, in speculative call optimizations and around spesh plugins). In this case, we could not safely set facts on the guarded register, because those might lead to wrong optimizations being done prior to the guard.

Changing of the guards

Now a guard instruction looks like this:

sp_guard w(obj) r(obj) sslot uint32

Or, concretely:

invoke r6(5), r4(2)
sp_guard r6(6), r6(5), sslot(42), litui32(64)

That is to say, it introduces a new SSA version. This means that we get a way to talk about the value both before and after the guard instruction. Thus, if we perform an inlining and we know exactly what type it will return, then that type information will flow into the input – in our example, r6(5) – of the guard instruction. We can then notice that the property the guard wants to assert is already upheld, and replace the guard with a simple set (which may itself be eliminated by later optimizations).

This also solves the problem with guards inserted away from the original write of the value: we get a new SSA version beyond the guard point. This in turn leads to more opportunities to avoid repeated guards beyond that point.

Quite a lot of return value guards on common operations simply go away thanks to these changes. For example, in $a + $b, where $a and $b are Int, we will be able to inline the + operator, and we can statically see from its code that it will produce an Int. Thus, the guard on the return type in the caller of the operator can be eliminated. This saves the instructions associated with the guard, and potentially allows for further optimizations to take place since we know we’ll never deoptimize at that point.

In summary

MoarVM does lots of speculative optimization. This enables us to optimize in cases where we can’t prove a property of the program, but statistics tell us that it mostly behaves in a certain way. We make this safe by adding guards, and falling back to the general version of the code in cases where they fail.

However, guards have a cost. By changing our representation of them, so that we model the data coming into the guard and after the guard as two different SSA versions, we are able to eliminate many guard instructions. This not only reduces duplicate guards, but also allows for elimination of guards when the broader view afforded by inlining lets us prove properties that we weren’t previously able to.

In fact, upcoming work on escape analysis and scalar replacement will allow us to start seeing into currently opaque structures, such as Scalar containers. When we are able to do that, then we’ll be able to prove further program properties, leading to the elimination of yet more guards. Thus, this work is not only immediately useful, but also will help us better exploit upcoming optimizations.

Posted in Uncategorized | 1 Comment

Faster box/unbox and Int operations

My work on Perl 6 performance continues, thanks to a renewal of my grant from The Perl Foundation. I’m especially focusing on making common basic operations faster, the idea being that if those go faster than programs composed out of them also should. This appears to be working out well: I’ve not been directly trying to make the Text::CSV benchmark run faster, but that’s resulted from my work.

I’ll be writing a few posts in on various of the changes I’ve done. This one will take a look at some related optimizations around boxing, unboxing, and common mathematical operations on Int.

Boxing and unboxing

Boxing is taking a natively typed value and wrapping it into an object. Unboxing is the opposite: taking an object that wraps a native value and getting the native value out of it.

In Perl 6, these are extremely common. Num and Str are boxes around num and str respectively. Thus, unless dealing with natives explicitly, working with floating point numbers and strings will involve lots of box and unbox operations.

There’s nothing particularly special about Num and Str. They are normal objects with the P6opaque representation, meaning they can be mixed into. The only thing that makes them slightly special is that they have attributes marked as being a box target. This indicates the attribute out as being the one to write to or read from in a box or unbox operation.

Thus, a box operations is something like:

  • Create an instance of the box type
  • Find out where in that object to write the value to
  • Write the value there

While unbox is:

  • Find out where in the object to read a value from
  • Read it from there

Specialization of box and unbox

box is actually two operations: an allocation and a store. We know how to fast-path allocations and JIT them relatively compactly, however that wasn’t being done for box. So, step one was to decompose this higher-level op into the allocation and the write into the allocated object. The first step could then be optimized in the usual way allocations are.

In the unspecialized path, we first find out where to write the native value to, and then write it. However, when we’re producing a specialized version, we almost always know the type we’re boxing into. Therefore, the object offset to write to can be calculated once, and a very simple instruction to do a write at an offset into the object produced. This JITs extremely well.

There are a couple of other tricks. Binds into a P6opaque generally have to check that the object wasn’t mixed in to, however since we just allocated it then we know that can’t be the case and can skip that check. Also, a string is a garbage-collectable object, and when assigning one GC-able object into another one, we need to perform a write barrier for the sake of generational GC. However, since the object was just allocated, we know very well that it is in the nursery, and so the write barrier will never trigger. Thus, we can omit it.

Unboxing is easier to specialize: just calculate the offset, and emit a simpler instruction to load the value from there.

I’ve also started some early work (quite a long way from merge) on escape analysis, which will allow us to eliminate many box object allocations entirely. It’s a great deal easier to implement this if allocations, reads, and writes to an object have a uniform representation. By lowering box and unbox operations into these lower level operations, this eases the path to implementing escape analysis for them.

What about Int?

Some readers might have wondered why I talked about Num and Str as examples of boxed types, but not Int. It is one too – but there’s a twist. Actually, there’s two twists.

The first is that Int isn’t actually a wrapper around an int, but rather an arbitrary precision integer. When we first implemented Int, we had it always use a big integer library. Of course, this is slow, so later on we made it so any number fitting into a 32-bit range would be stored directly, and only allocate a big integer structure if it’s outside of this range.

Thus, boxing to a big integer means range-checking the value to box. If it fits into the 32-bit range, then we can write it directly, and set the flag indicating that it’s a small Int. Machine code to perform these steps is now spat out directly by the JIT, and we only fall back to a function call in the case where we need a big integer. Once again, the allocation itself is emitted in a more specialized way too, and the offset to write to is determined once at specialization time.

Unboxing is similar. Provided we’re specializing on the type of the object to unbox, then we calculate the offset at specialization time. Then, the JIT produces code to check if the small Int flag is set, and if so just reads and sign extends the value into a 64-bit register. Otherwise, it falls back to the function call to handle the real big integer case.

For boxing, however, there was a second twist: we have a boxed integer cache, so for small integers we don’t have to repeatedly allocate objects boxing them. So boxing an integer is actually:

  1. Check if it’s in the range of the box cache
  2. If so, return it from the cache
  3. Otherwise, do the normal boxing operation

When I first did these changes, I omitted the use of the box cache. It turns out, however, to have quite an impact in some programs: one benchmark I was looking at suffered quite a bit from the missing box cache, since it now had to do a lot more garbage collection runs.

So, I reinstated use of the cache, but this time with the JIT doing the range checks in the produced machine code and reading directly out of the cache in the case of a hit. Thus, in the cache hit case, we now don’t even make a single function call for the box operation.

Faster Int operations

One might wonder why we picked 32-bit integers as the limit for the small case of a big integer, and not 64-bit integers. There’s two reasons. The most immediate is that we can then use the 32 bits that would be the lower 32 of a 64-bit pointer to the big integer structure as our “this is a small integer” flag. This works reliably as pointers are always aligned to at least a 4-byte boundary, so a real pointer to a big integer structure would never have the lowest bits set. (And yes, on big-endian platforms, we swap the order of the flag and the value to account for that!)

The second reason is that there’s no portable way in C to detect if a calculation overflowed. However, out of the basic math operations, if we have two inputs that fit into a 32-bit integer, and we do them at 64-bit width, we know that the result can never overflow the 64-bit integer. Thus we can then range check the result and decide whether to store it back into the result object as 32-bit, or to store it as a big integer.

Since Int is immutable, all operations result in a newly allocated object. This allocation – you’ll spot a pattern by now – is open to being specialized. Once again, finding the boxed value to operate on can also be specialized, by calculating its offset into the input objects and result object. So far, so familiar.

However, there’s a further opportunity for improvement if we are JIT-compiling the operations to machine code: the CPU has flags for if the last operation overflowed, and we can get at them. Thus, for two small Int inputs, we can simply:

  1. Grab the values
  2. Do the calculation at 32-bit width
  3. Check the flags, and store it into the result object if no overflow
  4. If it overflowed, fall back to doing it wider and storing it as a real big integer

I’ve done this for addition, subtraction, and multiplication. Those looking for a MoarVM specializer/JIT task might like to consider doing it for some of the other operations. :-)

In summary

Boxing, unboxing, and math on Int all came with various indirections for the sake of generality (coping with mixins, subclassing, and things like IntStr). However, when we are producing type-specialized code, we can get rid of most of the indirections, resulting in being able to perform them faster. Further, when we JIT-compile the optimized result into machine code, we can take some further opportunities, reducing function calls into C code as well as taking advantage of access to the overflow flags.

Posted in Uncategorized | 6 Comments

Redesigning Rakudo’s Scalar

What’s the most common type your Perl 6 code uses? I’ll bet you that in most programs you write, it’ll be Scalar. That might come as a surprise, because you pretty much never write Scalar in your code. But in:

my $a = 41;
my $b = $a + 1;

Then both $a and $b point to Scalar containers. These in turn hold the Int objects. Contrast it with:

my $a := 42;
my $b := $a + 1;

Where there are no Scalar containers. Assignment in Perl 6 is an operation on a container. Exactly what it does depending on the type of the container. With an Array, for example, it iterates the data source being assigned, and stores each value into the target Array. Assignment is therefore a copying operation, unlike binding which is a referencing operation. Making assignment the shorter thing to type makes it more attractive, and having the more attractive thing decrease the risk of action at a distance is generally a good thing.

Having Scalar be first-class is used in a number of features:

  • Lazy vivification, so if %a{$x} { ... } will not initialize the hash slot in question, but %a{$x} = 42 will do so (this also works many levels deep)
  • The is rw trait on parameters being able to work together with late-bound dispatch
  • Making l-value routines possible, including every is rw accessor
  • List assignment
  • Using meta-ops on assignment, for example Z=

And probably some more that I forgot. It’s powerful. It’s also torture for those of us building Perl 6 implementations and trying to make them run fast. The frustration isn’t so much the immediate cost of the allocating all of those Scalar objects – that of course costs something, but modern GC algorithms can throw away short-lived objects pretty quickly – but also because of the difficulties it introduces for program analysis.

Despite all the nice SSA-based analysis we do, tracking the contents of Scalar containers is currently beyond that. Rather than any kind of reasoning to prove properties about what a Scalar holds, we instead handle it through statistics, guards, and deoptimization at the point that we fetch a value from a Scalar. This still lets us do quite a lot, but it’s certainly not ideal. Guards are cheap, but not free.

Looking ahead

Over the course of my current grant from The Perl Foundation, I’ve been working out a roadmap for doing better with optimization in the presence of Scalar containers. Their presence is one of the major differences between full Perl 6 and the restricted NQP (Not Quite Perl), and plays a notable part in the performance difference between the two.

I’ve taken the first big step towards improving this situation by significantly re-working the way Scalar containers are handled. I’ll talk about that in this post, but first I’d like to provide an idea of the overall direction.

In the early days of MoarVM, when we didn’t have specialization or compilation to machine code, it made sense to do various bits of special-casing of Scalar. As part of that, we wrote code handling common container operations in C. We’ve by now reached a point where the C code that used to be a nice win is preventing us from performing the analyses we need in order to do better optimizations. At the end of the day, a Scalar container is just a normal object with an attribute $!value that holds its value. Making all operations dealing with Scalar container really be nothing more than some attribute lookups and binds would allow us to solve the problem in terms of more general analyses, which stand to benefit many other cases where programs use short-lived objects.

The significant new piece of analysis we’ll want to do is escape analysis, which tells us which objects have a lifetime bounded to the current routine. We understand “current routine” to incorporate those that we have inlined.

If we know that an object’s usage lies entirely within the current routine, we can then perform an optimization known as scalar replacement, which funnily enough has nothing much to do with Scalar in the Perl 6 sense, even if it solves the problems we’re aiming to solve with Scalar! The idea is that we allocate a local variable inside of the current frame for each attribute of the object. This means that we can then analyze them like we analyze other local variables, subject them to SSA, and so forth. This for one gets rid of the allocation of the object, but also lets us replace attribute lookups and binds with a level of indirection less. It will also let us reason about the contents of the once-attributes, so that we can eliminate guards that we previously inserted because we only had statistics, not proofs.

So, that’s the direction of travel, but first, Scalar and various operations around it needed to change.

Data structure redesign

Prior to my recent work, a Scalar looked something like:

class Scalar {
    has $!value;        # The value in the Scalar
    has $!descriptor;   # rw-ness, type constraint, name
    has $!whence;       # Auto-vivification closure
}

The $!descriptor held the static information about the Scalar container, so we didn’t have to hold it in every Scalar (we usually have many instances of the same “variable” over a programs lifetime).

The $!whence was used when we wanted to do some kind of auto-vivification. The closure attached to it was invoked when the Scalar was assigned to, and then cleared afterwards. In an array, for example, the callback would bind the Scalar into the array storage, so that element – if assigned to – would start to exist in the array. There are various other forms of auto-vivification, but they all work in roughly the same way.

This works, but closures aren’t so easy for the optimizer to deal with (in short, a closure has to have an outer frame to point to, and so we can’t inline a frame that takes a closure). Probably some day we’ll find a clever solution to that, but since auto-vivification is an internal mechanism, we may as well make it one that we can see a path to making efficient in the near term future.

So, I set about considering alternatives. I realized that I wanted to replace the $!whence closure with some kind of object. Different types of object would do different kinds of vivification. This would work very well with the new spesh plugin mechanism, where we can build up a set of guards on objects. It also will work very well when we get escape analysis in place, since we can then potentially remove those guards after performing scalar replacement. Thus after inlining, we might be able to remove the “what kind of vivification does this assignment cause” checking too.

So this seemed workable, but then I also realized that it would be possible to make Scalar smaller by:

  • Placing the new auto-vivification objects in the $!descriptor slot instead
  • Having the vivification objects point to the original descriptor carrying the name, type, etc.
  • Upon first assignment, running the vivification logic and then replacing the Scalar‘s $!descriptor with the simple one carrying the name and value, thus achieving the run-once semantics

This not only makes Scalar smaller, but it means that we can use a single guard check to indicate the course of action we should take with the container: a normal assignment, or a vivification.

The net result: vivification closures go away giving more possibility to inline, assignment gets easier to specialize, and we get a memory saving on every Scalar container. Nice!

C you later

For this to be really worth it from an optimization perspective, I needed to eliminate various bits of C special-case code around Scalar and replace it with standard MoarVM ops. This implicated:

  • Assignment
  • Atomic compare and swap
  • Atomic store
  • Handling of return values, including decontainerization
  • Creation of new Scalar containers from a given descriptor

The first 3 became calls to code registered to perform the operations, using the 6model container API. The second two cases were handled by replacing the calls to C extops with desugars, which is a mechanism that takes something that is used as an nqp::op and rewrites it, as it is compiled, into a more interesting AST, which is then in turn compiled. Happily, this meant I could make all of the changes I needed to without having to go and do a refactor across the CORE.setting. That was nice.

So, now those operations were compiled into bytecode operations instead of ops that were really just calls to C code. Everything was far more explicit. Good! Alas, the downside is that the code we generate gets larger in size.

Optimization with spesh plugins

talked about specializer plugins in a recent post, where I used them to greatly speed up various forms of method dispatch. However, they are also applicable to optimizing operations on Scalar containers.

The change to decontainerizing return values was especially bad at making the code larger, since it had to do quite a few checks. However, with a spesh plugin, we could just emit a use of the plugin, followed by calling whatever the plugin produces.

Here’s a slightly simplified version of the the plugin I wrote, annotated with some comments about what it is doing. The key thing to remember about a spesh plugin is that it is not doing an operation, but rather it’s setting up a set of conditions under which a particular implementation of the operation applies, and then returning that implementation.

nqp::speshreg('perl6', 'decontrv', sub ($rv) {
    # Guard against the type being returned; if it's a Scalar then that
    # is what we guard against here (nqp::what would normally look at
    # the type inside such a container; nqp::what_nd does not do that).
    nqp::speshguardtype($rv, nqp::what_nd($rv));

    # Check if it's an instance of a container.
    if nqp::isconcrete_nd($rv) && nqp::iscont($rv) {
        # Guard that it's concrete, so this plugin result only applies
        # for container instances, not the Scalar type object.
        nqp::speshguardconcrete($rv);

        # If it's a Scalar container then we can optimize further.
        if nqp::eqaddr(nqp::what_nd($rv), Scalar) {
            # Grab the descriptor.
            my $desc := nqp::speshguardgetattr($rv, Scalar, '$!descriptor');
            if nqp::isconcrete($desc) {
                # Has a descriptor, so `rw`. Guard on type of value. If it's
                # Iterable, re-containerize. If not, just decont.
                nqp::speshguardconcrete($desc);
                my $value := nqp::speshguardgetattr($rv, Scalar, '$!value');
                nqp::speshguardtype($value, nqp::what_nd($value));
                return nqp::istype($value, $Iterable) ?? &recont !! &decont;
            }
            else {
                # No descriptor, so it's already readonly. Return as is.
                nqp::speshguardtypeobj($desc);
                return &identity;
            }
        }

        # Otherwise, full slow-path decont.
        return &decontrv;
    }
    else {
        # No decontainerization to do, so just produce identity.
        return &identity;
    }
});

Where &identity is the identity function, &decont removes the value from its container, &recont wraps the value in a new container (so an Iterable in a Scalar stays as a single item), and &decontrv is the slow-path for cases that we do not know how to optimize.

The same principle is also used for assignment, however there are more cases to analyze there. They include:

  • When the type constraint is Mu, and there is a normal (non-vivify) descriptor, then we do a specialization based on the value being the Nil object (in which case we produce the operation that set $!value back to the default value from the descriptor) or non-Nil (just assign a value, with no need to type check)
  • When the type constraint is something else, and there is a normal (non-vivify) descriptor, then we do a specialization based on the type of the descriptor being assigned. Since the optimizer will often know this already, then we can optimize out the type check
  • When it is an array auto-viv, we produce the exact sequence of binds needed to effect the operation, again taking into account a Mu type constraint and a type constraint that needs to be checked

Vivifying hash assignments are not yet optimized by the spesh plugin, but will be in the near future.

The code selected by the plugin is then executed to perform the operation. In most cases, there will only be a single specialization selected. In that case, the optimizer will inline that specialization result, meaning that the code after optimization is just doing the required set of steps needed to do the work.

Next steps

Most immediately, a change to such a foundational part of the the Rakudo Perl 6 implementation has had some fallout. I’m most of the way through dealing with the feedback from toaster (which runs all the ecosystem module tests), being left with a single issue directly related to this work to get to the bottom of. Beyond that, I need to spend some time re-tuning array and hash access to better work with these changes.

Then will come the step that this change was largely in aid of: implementing escape analysis and scalar replacement, which for much Perl 6 code will hopefully give a quite notable performance improvement.

This brings me to the end of my current 200 hours on my Perl 6 Performance and Reliability Grant. Soon I will submit a report to The Perl Foundation, along with an application to continue this work. So, all being well, there will be more to share soon. In the meantime, I’m off to enjoy a week’s much needed vacation.

Posted in Uncategorized | 1 Comment

Dynamic lookups and context introspection with inlining

Inlining is one of the most important optimizations that MoarVM performs. Inlining lets us replace a call to some BlockSub, or Method with the code that is inside of it. The most immediate benefit is to eliminate the overhead of calling, but that’s just the start. Inlined code has often already been specialized for a certain set of argument types. If we already have proven those argument types in the caller, then there’s no need to re-check them. Inlining can also expose pairs of operations that can cancel, such as box/unbox, and bring the point a control exception is thrown into the some body of code where it is caught, which may allow the exception throw to be rewritten to a far cheaper goto.

In a language like Perl 6, where every operator is a call to a multiple dispatch subroutine, inlining can be a significant win. In the best cases, inlining can lead to smaller code, because the thing that is inlined ends up being smaller than the bytecode for the call sequence. Of course, often it leads to bigger code, and so there’s limits to how much of it we really want to do. But still, we’ve been gradually pushing on with increasing the range of things that we’re able to inline.

The problem with inlining is that the very call boundaries it does away with may carry semantic significance for the program. In this post, I’ll talk about a couple of operations that became problematic as we ramped up our inlining capabilities, and discuss a new abstraction I recently added to MoarVM – the frame walker – which provides a common foundation for solving the problem.

A little inlining history

Inlining first showed up in MoarVM back in 2014, not too many months after the type-specializing optimizer was added. MoarVM has done speculative optimizations from the start, performing deoptimization (falling back to the interpreter) in the case that an unexpected situation shows up. But what if we had to deoptimize in code that had been inlined? Then we’d have to pretend we never did the inlines! Therefore, MoarVM can uninline too – that is, untangle the results of inlining and produce a call stack as if we’d been running the unoptimized code all along.

MoarVM has also from the start supported nested inlines – that is, inlining things that themselves contained inlines. However, the initial implementation of inlining was restricted in what it could handle. The first implementation could not inline anything with exception handlers, although that was supported within a couple of months. It also could not inline closures. Only subs in the outermost scope or from the CORE.setting, along with simple method calls, were possible to inline, because those were the only cases where we had enough information about what was being called, which is a decided prerequisite for inlining it.

Aside from bug fixes, things stayed the same until 2017. The focus in that time largely switched away from performance and towards the Perl 6.c release. Summer of 2017 brought some very large changes to how dynamic optimization worked in MoarVM, moving optimization to a background thread, along with changing and extending the statistics that were collected. A new kind of call optimization became possible, whereby if we could not prove what we were going to call, but the statistics showed a pattern, then we could insert a guard and speculatively optimize the call. Speculative inlining fell neatly out of that. Suddenly, a bunch more things could be considered for inlining.

Further work lifted some of the inlining restrictions. Deoptimization learned how to cope if we deoptimized in the middle of processing named arguments, so we could optimize code where that situation occurred. It became possible to inline many closures, by rewriting the lexical lookup operations into an indirection through the code object of the code that we had inlined. It also became possible to inline code involving lexical throws of exceptions and their handlers. Since that is how return works in Perl 6, that again made quite a few more things possible to inline. A more fine-grained analysis allowed us to do some amount of cross-language inlining, meaning bits of the Rakudo internals written in NQP could be inlined into the Perl 6 code calling them, including closure cloning. I’ll add at this point that while it’s easy to write a list of these improvements now, realizing various of them was quite challenging.

Now it’s summer 2018, and my work has delivered some more advances. Previously, we would only do an inlining if we already had produced a specialized version of the callee. This usually worked out, and we sorted by maximum call stack depth and specialized deepest first to help with that. However, sometimes that was not enough, and we missed inlining opportunities. So, during the last month, I added support for producing code to inline on-demand. I also observed that we were only properly doing speculative (that is, based on statistics) inlines of calls made that were expected to return an object, but not those in void context. (If that sounds like an odd oversight, it’s because void calls were previously rare. It was only during the last month, when I improved code-gen to spot a lot more opportunities to emit void context calls, that we got a lot more of them and I spotted the problem.)

More is better, no?

Being able to inline a wider range of calls is a good thing. However, it also made it far more likely that we would run into constructs that don’t cope well with inlining. We’ve got a long way by marking ops that we know won’t cope well with it as :noinline (and then gradually liberalizing that over time where it was beneficial). The improvements over the previous month created a more difficult problem, however. We have a number of ops that allow for introspection and walking of the call stack. These are used to implement Perl 6 features such as the CALLER:: pseudo-package. However, they are also the way that $/ can be set by things like match.

Marking the ctx op as :noinline got us a long way. However, we ran into trouble because once a context handle has been obtained, one could then start traversing from it to callers or outers, and then starting some kind of lookup lookup relative to that point. But what if the caller was an inline? Then we don’t have a callframe to reference in the context object that we return.

A further problem was that a non-introspection form of dynamic lookup, which traverses the lexical chain hanging off each step of the dynamic chain, also was not aware of inlines. In theory, this would have become a problem when we started doing inlining of closures last year. However, since it is used in a tiny number of places, and those places didn’t permit inlining, we didn’t notice until this month, when inlining started to cover more cases.

Normal dynamic lookup, used for $*foo style variables, has been inline-aware for about as long as we’ve had inlining. However, this greatly complicated the lookup code. Replicating such inline compensation code in a bunch of places was clearly a bad idea. It’s a tricky problem, since we’re effectively trying to model a callstack that doesn’t really exist by using information telling us what it would look like if it did. It’s the same problem that deoptimization has to solve, except this time we’re just imagining what the call stack would look like unoptimized, not actually trying to recreate it. It’s certainly not a problem we want solved repeatedly around MoarVM’s implementation.

A new frame walker

To help tackle all of these problems, I introduced a new abstraction: the specialization-aware frame walker. It provides an iterator over the call stack as if no inlining had taken place, figuring out as much as it needs to in order to recreate the information that a particular operation wants.

First, I used it to make caller-dynamic lookup inline-aware. That went pretty well, and immediately fixed one of the module regressions that was “caused” by the recent inlining improvements.

Next, I used it to refactor the normal dynamic lookup. That needed careful work teasing out the details of dynamic lookup and caching of dynamic variable lookups from the inlining traversal. However, the end result was far simpler code, with much less duplication, since the JITted inline, interpreted inline, and non-inline paths largely collapsed and were handled by the frame walker.

Next up, contexts. Alas, this would be trickier.

Embracing laziness

Previously, context traversal had worked eagerly. When we asked for a context object representing the caller or outer of a particular context we already had, we immediately walked one frame in the appropriate direction and produced a result. Of course, this did not go well if there were inlines, since it always walked one real frame, but that may have multiple inlined frames within it.

One possibility was to make the ctx op immediately deoptimize the whole call stack, so that we then had a chain of real call frames to traverse. However, when I looked at some of the places using ctx, it became clear this would have some very negative performance consequences: every regex match causing a global deopt was not going to work!

Another option, that would largely preserve the existing design, was to store extra information about the inline we were inside of at the time we walked to a caller. This, however, had the weakness that we might do a deoptimization between the two points, thus invalidating the information. That was probably also possible to fix up, but the complexity of doing so put me off that approach.

Instead, I switched to a model where “move to caller” and “move to outer” would be stored as displacements to apply when the context object was used in order to obtain information. The frame walker could make these movements, before doing whatever lookup was required. Thus, even if a deoptimization were to take place between obtaining the context and using it, we could still do a correct traversal.

Too much laziness

This helped, but wasn’t quite enough either. If a context handle was taken and used immediately, things worked out fine. However, if the situation was like this:

+-------------------------+
| Frame we used ctx op on | (Frame 1)
+-------------------------+
             |
           Caller
             |
             v
+-------------------------+
|    Frame with inlines   | (Frame 2)
+-------------------------+

And then we used the handle some time later, things worked out less well. The problem was that I used the current return address of Frame 2 in order to understand which inline(s) we were inside of. However, if we’d executed more code in Frame 2, then it could have made another call. The current return address could thus point to the wrong inline. Oops.

However, since the ctx operation at the start of the lookup is never inlined, and a given call frame can only ever be called from one location in the caller, there’s a solution. If the ctx op is used to get a first-class reference to a frame on the call stack, we walk down the call stack and make sure that each frame called from inlined code preserves enough location information that we can later reconstruct what we need. It only needs to walk down the call stack until it sees a point where another ctx operation already preserved that information, so in programs with lots of use of the ctx op, we can avoid doing full stack walks each time, and just walk over recently created frames.

In closing

With those changes, the various Perl 6 modules exhibiting lookup problems since this month’s introduction of more aggressive inlining were fixed. Along the way, a common mechanism was introduced allowing us to walk a call stack as if no inlines had taken place. There’s at least one more place that we can use this: in order to make stack trace output not be sensitive to inlining. I’ll get to that in the coming weeks, or it might make a nice task for somebody looking to get themselves (more) involved with MoarVM development. Last but not least, I’d like to once again thank The Perl Foundation for organizing the funding that made this work possible.

Posted in Uncategorized | 1 Comment