Redesigning Rakudo’s Scalar

What’s the most common type your Perl 6 code uses? I’ll bet you that in most programs you write, it’ll be Scalar. That might come as a surprise, because you pretty much never write Scalar in your code. But in:

my $a = 41;
my $b = $a + 1;

Then both $a and $b point to Scalar containers. These in turn hold the Int objects. Contrast it with:

my $a := 42;
my $b := $a + 1;

Where there are no Scalar containers. Assignment in Perl 6 is an operation on a container. Exactly what it does depending on the type of the container. With an Array, for example, it iterates the data source being assigned, and stores each value into the target Array. Assignment is therefore a copying operation, unlike binding which is a referencing operation. Making assignment the shorter thing to type makes it more attractive, and having the more attractive thing decrease the risk of action at a distance is generally a good thing.

Having Scalar be first-class is used in a number of features:

  • Lazy vivification, so if %a{$x} { ... } will not initialize the hash slot in question, but %a{$x} = 42 will do so (this also works many levels deep)
  • The is rw trait on parameters being able to work together with late-bound dispatch
  • Making l-value routines possible, including every is rw accessor
  • List assignment
  • Using meta-ops on assignment, for example Z=

And probably some more that I forgot. It’s powerful. It’s also torture for those of us building Perl 6 implementations and trying to make them run fast. The frustration isn’t so much the immediate cost of the allocating all of those Scalar objects – that of course costs something, but modern GC algorithms can throw away short-lived objects pretty quickly – but also because of the difficulties it introduces for program analysis.

Despite all the nice SSA-based analysis we do, tracking the contents of Scalar containers is currently beyond that. Rather than any kind of reasoning to prove properties about what a Scalar holds, we instead handle it through statistics, guards, and deoptimization at the point that we fetch a value from a Scalar. This still lets us do quite a lot, but it’s certainly not ideal. Guards are cheap, but not free.

Looking ahead

Over the course of my current grant from The Perl Foundation, I’ve been working out a roadmap for doing better with optimization in the presence of Scalar containers. Their presence is one of the major differences between full Perl 6 and the restricted NQP (Not Quite Perl), and plays a notable part in the performance difference between the two.

I’ve taken the first big step towards improving this situation by significantly re-working the way Scalar containers are handled. I’ll talk about that in this post, but first I’d like to provide an idea of the overall direction.

In the early days of MoarVM, when we didn’t have specialization or compilation to machine code, it made sense to do various bits of special-casing of Scalar. As part of that, we wrote code handling common container operations in C. We’ve by now reached a point where the C code that used to be a nice win is preventing us from performing the analyses we need in order to do better optimizations. At the end of the day, a Scalar container is just a normal object with an attribute $!value that holds its value. Making all operations dealing with Scalar container really be nothing more than some attribute lookups and binds would allow us to solve the problem in terms of more general analyses, which stand to benefit many other cases where programs use short-lived objects.

The significant new piece of analysis we’ll want to do is escape analysis, which tells us which objects have a lifetime bounded to the current routine. We understand “current routine” to incorporate those that we have inlined.

If we know that an object’s usage lies entirely within the current routine, we can then perform an optimization known as scalar replacement, which funnily enough has nothing much to do with Scalar in the Perl 6 sense, even if it solves the problems we’re aiming to solve with Scalar! The idea is that we allocate a local variable inside of the current frame for each attribute of the object. This means that we can then analyze them like we analyze other local variables, subject them to SSA, and so forth. This for one gets rid of the allocation of the object, but also lets us replace attribute lookups and binds with a level of indirection less. It will also let us reason about the contents of the once-attributes, so that we can eliminate guards that we previously inserted because we only had statistics, not proofs.

So, that’s the direction of travel, but first, Scalar and various operations around it needed to change.

Data structure redesign

Prior to my recent work, a Scalar looked something like:

class Scalar {
    has $!value;        # The value in the Scalar
    has $!descriptor;   # rw-ness, type constraint, name
    has $!whence;       # Auto-vivification closure
}

The $!descriptor held the static information about the Scalar container, so we didn’t have to hold it in every Scalar (we usually have many instances of the same “variable” over a programs lifetime).

The $!whence was used when we wanted to do some kind of auto-vivification. The closure attached to it was invoked when the Scalar was assigned to, and then cleared afterwards. In an array, for example, the callback would bind the Scalar into the array storage, so that element – if assigned to – would start to exist in the array. There are various other forms of auto-vivification, but they all work in roughly the same way.

This works, but closures aren’t so easy for the optimizer to deal with (in short, a closure has to have an outer frame to point to, and so we can’t inline a frame that takes a closure). Probably some day we’ll find a clever solution to that, but since auto-vivification is an internal mechanism, we may as well make it one that we can see a path to making efficient in the near term future.

So, I set about considering alternatives. I realized that I wanted to replace the $!whence closure with some kind of object. Different types of object would do different kinds of vivification. This would work very well with the new spesh plugin mechanism, where we can build up a set of guards on objects. It also will work very well when we get escape analysis in place, since we can then potentially remove those guards after performing scalar replacement. Thus after inlining, we might be able to remove the “what kind of vivification does this assignment cause” checking too.

So this seemed workable, but then I also realized that it would be possible to make Scalar smaller by:

  • Placing the new auto-vivification objects in the $!descriptor slot instead
  • Having the vivification objects point to the original descriptor carrying the name, type, etc.
  • Upon first assignment, running the vivification logic and then replacing the Scalar‘s $!descriptor with the simple one carrying the name and value, thus achieving the run-once semantics

This not only makes Scalar smaller, but it means that we can use a single guard check to indicate the course of action we should take with the container: a normal assignment, or a vivification.

The net result: vivification closures go away giving more possibility to inline, assignment gets easier to specialize, and we get a memory saving on every Scalar container. Nice!

C you later

For this to be really worth it from an optimization perspective, I needed to eliminate various bits of C special-case code around Scalar and replace it with standard MoarVM ops. This implicated:

  • Assignment
  • Atomic compare and swap
  • Atomic store
  • Handling of return values, including decontainerization
  • Creation of new Scalar containers from a given descriptor

The first 3 became calls to code registered to perform the operations, using the 6model container API. The second two cases were handled by replacing the calls to C extops with desugars, which is a mechanism that takes something that is used as an nqp::op and rewrites it, as it is compiled, into a more interesting AST, which is then in turn compiled. Happily, this meant I could make all of the changes I needed to without having to go and do a refactor across the CORE.setting. That was nice.

So, now those operations were compiled into bytecode operations instead of ops that were really just calls to C code. Everything was far more explicit. Good! Alas, the downside is that the code we generate gets larger in size.

Optimization with spesh plugins

talked about specializer plugins in a recent post, where I used them to greatly speed up various forms of method dispatch. However, they are also applicable to optimizing operations on Scalar containers.

The change to decontainerizing return values was especially bad at making the code larger, since it had to do quite a few checks. However, with a spesh plugin, we could just emit a use of the plugin, followed by calling whatever the plugin produces.

Here’s a slightly simplified version of the the plugin I wrote, annotated with some comments about what it is doing. The key thing to remember about a spesh plugin is that it is not doing an operation, but rather it’s setting up a set of conditions under which a particular implementation of the operation applies, and then returning that implementation.

nqp::speshreg('perl6', 'decontrv', sub ($rv) {
    # Guard against the type being returned; if it's a Scalar then that
    # is what we guard against here (nqp::what would normally look at
    # the type inside such a container; nqp::what_nd does not do that).
    nqp::speshguardtype($rv, nqp::what_nd($rv));

    # Check if it's an instance of a container.
    if nqp::isconcrete_nd($rv) && nqp::iscont($rv) {
        # Guard that it's concrete, so this plugin result only applies
        # for container instances, not the Scalar type object.
        nqp::speshguardconcrete($rv);

        # If it's a Scalar container then we can optimize further.
        if nqp::eqaddr(nqp::what_nd($rv), Scalar) {
            # Grab the descriptor.
            my $desc := nqp::speshguardgetattr($rv, Scalar, '$!descriptor');
            if nqp::isconcrete($desc) {
                # Has a descriptor, so `rw`. Guard on type of value. If it's
                # Iterable, re-containerize. If not, just decont.
                nqp::speshguardconcrete($desc);
                my $value := nqp::speshguardgetattr($rv, Scalar, '$!value');
                nqp::speshguardtype($value, nqp::what_nd($value));
                return nqp::istype($value, $Iterable) ?? &recont !! &decont;
            }
            else {
                # No descriptor, so it's already readonly. Return as is.
                nqp::speshguardtypeobj($desc);
                return &identity;
            }
        }

        # Otherwise, full slow-path decont.
        return &decontrv;
    }
    else {
        # No decontainerization to do, so just produce identity.
        return &identity;
    }
});

Where &identity is the identity function, &decont removes the value from its container, &recont wraps the value in a new container (so an Iterable in a Scalar stays as a single item), and &decontrv is the slow-path for cases that we do not know how to optimize.

The same principle is also used for assignment, however there are more cases to analyze there. They include:

  • When the type constraint is Mu, and there is a normal (non-vivify) descriptor, then we do a specialization based on the value being the Nil object (in which case we produce the operation that set $!value back to the default value from the descriptor) or non-Nil (just assign a value, with no need to type check)
  • When the type constraint is something else, and there is a normal (non-vivify) descriptor, then we do a specialization based on the type of the descriptor being assigned. Since the optimizer will often know this already, then we can optimize out the type check
  • When it is an array auto-viv, we produce the exact sequence of binds needed to effect the operation, again taking into account a Mu type constraint and a type constraint that needs to be checked

Vivifying hash assignments are not yet optimized by the spesh plugin, but will be in the near future.

The code selected by the plugin is then executed to perform the operation. In most cases, there will only be a single specialization selected. In that case, the optimizer will inline that specialization result, meaning that the code after optimization is just doing the required set of steps needed to do the work.

Next steps

Most immediately, a change to such a foundational part of the the Rakudo Perl 6 implementation has had some fallout. I’m most of the way through dealing with the feedback from toaster (which runs all the ecosystem module tests), being left with a single issue directly related to this work to get to the bottom of. Beyond that, I need to spend some time re-tuning array and hash access to better work with these changes.

Then will come the step that this change was largely in aid of: implementing escape analysis and scalar replacement, which for much Perl 6 code will hopefully give a quite notable performance improvement.

This brings me to the end of my current 200 hours on my Perl 6 Performance and Reliability Grant. Soon I will submit a report to The Perl Foundation, along with an application to continue this work. So, all being well, there will be more to share soon. In the meantime, I’m off to enjoy a week’s much needed vacation.

Advertisements
Posted in Uncategorized | 1 Comment

Dynamic lookups and context introspection with inlining

Inlining is one of the most important optimizations that MoarVM performs. Inlining lets us replace a call to some BlockSub, or Method with the code that is inside of it. The most immediate benefit is to eliminate the overhead of calling, but that’s just the start. Inlined code has often already been specialized for a certain set of argument types. If we already have proven those argument types in the caller, then there’s no need to re-check them. Inlining can also expose pairs of operations that can cancel, such as box/unbox, and bring the point a control exception is thrown into the some body of code where it is caught, which may allow the exception throw to be rewritten to a far cheaper goto.

In a language like Perl 6, where every operator is a call to a multiple dispatch subroutine, inlining can be a significant win. In the best cases, inlining can lead to smaller code, because the thing that is inlined ends up being smaller than the bytecode for the call sequence. Of course, often it leads to bigger code, and so there’s limits to how much of it we really want to do. But still, we’ve been gradually pushing on with increasing the range of things that we’re able to inline.

The problem with inlining is that the very call boundaries it does away with may carry semantic significance for the program. In this post, I’ll talk about a couple of operations that became problematic as we ramped up our inlining capabilities, and discuss a new abstraction I recently added to MoarVM – the frame walker – which provides a common foundation for solving the problem.

A little inlining history

Inlining first showed up in MoarVM back in 2014, not too many months after the type-specializing optimizer was added. MoarVM has done speculative optimizations from the start, performing deoptimization (falling back to the interpreter) in the case that an unexpected situation shows up. But what if we had to deoptimize in code that had been inlined? Then we’d have to pretend we never did the inlines! Therefore, MoarVM can uninline too – that is, untangle the results of inlining and produce a call stack as if we’d been running the unoptimized code all along.

MoarVM has also from the start supported nested inlines – that is, inlining things that themselves contained inlines. However, the initial implementation of inlining was restricted in what it could handle. The first implementation could not inline anything with exception handlers, although that was supported within a couple of months. It also could not inline closures. Only subs in the outermost scope or from the CORE.setting, along with simple method calls, were possible to inline, because those were the only cases where we had enough information about what was being called, which is a decided prerequisite for inlining it.

Aside from bug fixes, things stayed the same until 2017. The focus in that time largely switched away from performance and towards the Perl 6.c release. Summer of 2017 brought some very large changes to how dynamic optimization worked in MoarVM, moving optimization to a background thread, along with changing and extending the statistics that were collected. A new kind of call optimization became possible, whereby if we could not prove what we were going to call, but the statistics showed a pattern, then we could insert a guard and speculatively optimize the call. Speculative inlining fell neatly out of that. Suddenly, a bunch more things could be considered for inlining.

Further work lifted some of the inlining restrictions. Deoptimization learned how to cope if we deoptimized in the middle of processing named arguments, so we could optimize code where that situation occurred. It became possible to inline many closures, by rewriting the lexical lookup operations into an indirection through the code object of the code that we had inlined. It also became possible to inline code involving lexical throws of exceptions and their handlers. Since that is how return works in Perl 6, that again made quite a few more things possible to inline. A more fine-grained analysis allowed us to do some amount of cross-language inlining, meaning bits of the Rakudo internals written in NQP could be inlined into the Perl 6 code calling them, including closure cloning. I’ll add at this point that while it’s easy to write a list of these improvements now, realizing various of them was quite challenging.

Now it’s summer 2018, and my work has delivered some more advances. Previously, we would only do an inlining if we already had produced a specialized version of the callee. This usually worked out, and we sorted by maximum call stack depth and specialized deepest first to help with that. However, sometimes that was not enough, and we missed inlining opportunities. So, during the last month, I added support for producing code to inline on-demand. I also observed that we were only properly doing speculative (that is, based on statistics) inlines of calls made that were expected to return an object, but not those in void context. (If that sounds like an odd oversight, it’s because void calls were previously rare. It was only during the last month, when I improved code-gen to spot a lot more opportunities to emit void context calls, that we got a lot more of them and I spotted the problem.)

More is better, no?

Being able to inline a wider range of calls is a good thing. However, it also made it far more likely that we would run into constructs that don’t cope well with inlining. We’ve got a long way by marking ops that we know won’t cope well with it as :noinline (and then gradually liberalizing that over time where it was beneficial). The improvements over the previous month created a more difficult problem, however. We have a number of ops that allow for introspection and walking of the call stack. These are used to implement Perl 6 features such as the CALLER:: pseudo-package. However, they are also the way that $/ can be set by things like match.

Marking the ctx op as :noinline got us a long way. However, we ran into trouble because once a context handle has been obtained, one could then start traversing from it to callers or outers, and then starting some kind of lookup lookup relative to that point. But what if the caller was an inline? Then we don’t have a callframe to reference in the context object that we return.

A further problem was that a non-introspection form of dynamic lookup, which traverses the lexical chain hanging off each step of the dynamic chain, also was not aware of inlines. In theory, this would have become a problem when we started doing inlining of closures last year. However, since it is used in a tiny number of places, and those places didn’t permit inlining, we didn’t notice until this month, when inlining started to cover more cases.

Normal dynamic lookup, used for $*foo style variables, has been inline-aware for about as long as we’ve had inlining. However, this greatly complicated the lookup code. Replicating such inline compensation code in a bunch of places was clearly a bad idea. It’s a tricky problem, since we’re effectively trying to model a callstack that doesn’t really exist by using information telling us what it would look like if it did. It’s the same problem that deoptimization has to solve, except this time we’re just imagining what the call stack would look like unoptimized, not actually trying to recreate it. It’s certainly not a problem we want solved repeatedly around MoarVM’s implementation.

A new frame walker

To help tackle all of these problems, I introduced a new abstraction: the specialization-aware frame walker. It provides an iterator over the call stack as if no inlining had taken place, figuring out as much as it needs to in order to recreate the information that a particular operation wants.

First, I used it to make caller-dynamic lookup inline-aware. That went pretty well, and immediately fixed one of the module regressions that was “caused” by the recent inlining improvements.

Next, I used it to refactor the normal dynamic lookup. That needed careful work teasing out the details of dynamic lookup and caching of dynamic variable lookups from the inlining traversal. However, the end result was far simpler code, with much less duplication, since the JITted inline, interpreted inline, and non-inline paths largely collapsed and were handled by the frame walker.

Next up, contexts. Alas, this would be trickier.

Embracing laziness

Previously, context traversal had worked eagerly. When we asked for a context object representing the caller or outer of a particular context we already had, we immediately walked one frame in the appropriate direction and produced a result. Of course, this did not go well if there were inlines, since it always walked one real frame, but that may have multiple inlined frames within it.

One possibility was to make the ctx op immediately deoptimize the whole call stack, so that we then had a chain of real call frames to traverse. However, when I looked at some of the places using ctx, it became clear this would have some very negative performance consequences: every regex match causing a global deopt was not going to work!

Another option, that would largely preserve the existing design, was to store extra information about the inline we were inside of at the time we walked to a caller. This, however, had the weakness that we might do a deoptimization between the two points, thus invalidating the information. That was probably also possible to fix up, but the complexity of doing so put me off that approach.

Instead, I switched to a model where “move to caller” and “move to outer” would be stored as displacements to apply when the context object was used in order to obtain information. The frame walker could make these movements, before doing whatever lookup was required. Thus, even if a deoptimization were to take place between obtaining the context and using it, we could still do a correct traversal.

Too much laziness

This helped, but wasn’t quite enough either. If a context handle was taken and used immediately, things worked out fine. However, if the situation was like this:

+-------------------------+
| Frame we used ctx op on | (Frame 1)
+-------------------------+
             |
           Caller
             |
             v
+-------------------------+
|    Frame with inlines   | (Frame 2)
+-------------------------+

And then we used the handle some time later, things worked out less well. The problem was that I used the current return address of Frame 2 in order to understand which inline(s) we were inside of. However, if we’d executed more code in Frame 2, then it could have made another call. The current return address could thus point to the wrong inline. Oops.

However, since the ctx operation at the start of the lookup is never inlined, and a given call frame can only ever be called from one location in the caller, there’s a solution. If the ctx op is used to get a first-class reference to a frame on the call stack, we walk down the call stack and make sure that each frame called from inlined code preserves enough location information that we can later reconstruct what we need. It only needs to walk down the call stack until it sees a point where another ctx operation already preserved that information, so in programs with lots of use of the ctx op, we can avoid doing full stack walks each time, and just walk over recently created frames.

In closing

With those changes, the various Perl 6 modules exhibiting lookup problems since this month’s introduction of more aggressive inlining were fixed. Along the way, a common mechanism was introduced allowing us to walk a call stack as if no inlines had taken place. There’s at least one more place that we can use this: in order to make stack trace output not be sensitive to inlining. I’ll get to that in the coming weeks, or it might make a nice task for somebody looking to get themselves (more) involved with MoarVM development. Last but not least, I’d like to once again thank The Perl Foundation for organizing the funding that made this work possible.

Posted in Uncategorized | 1 Comment

More precise deoptimization usage tracking

In my previous post here, I talked about deoptimization and its implications for usage information. If you didn’t read that post, I suggest reading it before continuing, since the work described in this post builds upon it. Further background on deoptimization and its use in MoarVM may be found in my talk slides from last year’s Swiss Perl Workshop.

An example to consider

To keep things a bit simpler, we’ll look at an NQP program. NQP is a simplified subset of Perl 6, and so its naive compilation – before the optimizer gets at it – is much simpler than that produced by Rakudo. Here’s a small program to consider.

class Wrapper {
    has $!x;
    method x() { $!x }
}
class C { }

sub test($w) {
    my $var := "Used later";
    if nqp::istype($w.x, C) {
        "C"
    }
    else {
        $var
    }
}

my int $i := 0;
my $wrapper := Wrapper.new(x => C.new);
while $i < 1_000_000 {
    say(test($wrapper));
    $i++;
}
say(test(Wrapper.new(x => NQPMu)));

We’ll consider the test subroutine’s optimization. First, let’s walk through the bytecode before optimization. We always have a dummy empty basic block at the start of the graph, used for internal purposes, so we can disregard that.

  BB 0 (0x7f1be817b070):
    line: 7 (pc 0)
    Instructions:
      no_op           
    Successors: 1
    Predecessors: 
    Dominance children: 1

The next basic block starts with a bunch of null instructions, which will be mostly deleted. In the slow path interpreter, we null out registers in case code tries to read them as part of callframe setup. However, we get rid of that in the optimized code by letting the optimizer prove that such work is mostly not needed. Since we didn’t do any optimization yet, here’s all of those instructions.

  BB 1 (0x7f1be817b0f8):
    line: 7 (pc 0)
    Instructions:
      null              r5(1)
      null              r4(1)
      null              r3(1)
      null              r1(1)
      null              r0(1)

Next we receive the parameter.

      checkarity      liti16(1), liti16(1)
      param_rp_o        r0(2), liti16(0)
      paramnamesused  

Then we have the line my $var := "used later";. Rakudo is smarter than NQP in compiling such a thing: it would just emit a reference to a single constant string rather than boxing it each time like NQP’s simpler code-gen does.

      [Annotation: Line Number: x.nqp:7]
      const_s           r2(1), lits(Used later)
      hllboxtype_s      r3(2)
      box_s             r3(3),   r2(1),   r3(2)
      set               r1(2),   r3(3)

Now we have the code for $x.w. It starts out with a decont, since we may have been passed something in a Scalar container (note that NQP code may be called from full Perl 6 code, so this is possible). We then look up the method and call it.

      [Annotation: INS Deopt One (idx 0 -> pc 46; line 9)]
      [Annotation: Logged (bytecode offset 40)]
      decont            r4(2),   r0(2)
    Successors: 2
    Predecessors: 0
    Dominance children: 2

  BB 2 (0x7f1be817b158):
    line: 9 (pc 46)
    Instructions:
      findmeth          r3(4),   r4(2), lits(x)
    Successors: 3
    Predecessors: 1
    Dominance children: 3

  BB 3 (0x7f1be817b1b8):
    line: 9 (pc 56)
    Instructions:
      [Annotation: INS Deopt One (idx 1 -> pc 56; line 9)]
      prepargs        callsite(0x7f1bf1f0b340, 1 arg, 1 pos, nonflattening, interned)
      arg_o           liti16(0),   r0(2)
      [Annotation: INS Deopt All (idx 3 -> pc 72; line 9)]
      [Annotation: INS Deopt One (idx 2 -> pc 72; line 9)]
      [Annotation: Logged (bytecode offset 66)]
      invoke_o          r3(5),   r3(4)
    Successors: 4
    Predecessors: 2
    Dominance children: 4

Notice how two various instructions here are annotated with Deopt points. These are places that we might, after optimization has taken place, insert an instruction that could cause us to deoptimize. The pc 72 refers to the offset into the unoptimized bytecode that we should continue execution back in the interpreter.

There’s also various Logged annotations, which indicate instructions that may log some statistics – for example, about what code is invoked, and what type of value it returns (for invoke_o) or what kind of type we get out of a decont operation that actually had to read from a container.

Next up is the type check. Again, there’s a decont instruction, just in case the call to $x.w returned something in a container. We then have the istype instruction.

  BB 4 (0x7f1be817b218):
    line: 9 (pc 72)
    Instructions:
      [Annotation: INS Deopt One (idx 4 -> pc 78; line 9)]
      [Annotation: Logged (bytecode offset 72)]
      decont            r4(3),   r3(5)
    Successors: 5
    Predecessors: 3
    Dominance children: 5

  BB 5 (0x7f1be817b278):
    line: 9 (pc 78)
    Instructions:
      wval              r5(2), liti16(0), liti16(5) (P6opaque: C)
      istype            r6(1),   r4(3),   r5(2)
    Successors: 6
    Predecessors: 4
    Dominance children: 6

Next comes the if part of the branch:

  BB 6 (0x7f1be817b2d8):
    line: 9 (pc 94)
    Instructions:
      unless_i          r6(1),   BB(8)
    Successors: 8, 7
    Predecessors: 5
    Dominance children: 7, 8, 9

  BB 7 (0x7f1be817b338):
    line: 9 (pc 102)
    Instructions:
      const_s           r2(2), lits(C)
      hllboxtype_s      r4(4)
      box_s             r4(5),   r2(2),   r4(4)
      set               r5(3),   r4(5)
      goto              BB(9)
    Successors: 9
    Predecessors: 6
    Dominance children: 

Followed by the else part. And what is r1(2)? It’s the “Used later” string from earlier.

  BB 8 (0x7f1be817b398):
    line: 12 (pc 134)
    Instructions:
      set               r5(4),   r1(2)
    Successors: 9
    Predecessors: 6
    Dominance children: 

Finally, we’re done, and return the result of the branch of the if statement that was executed.

  BB 9 (0x7f1be817b3f8):
    line: 12 (pc 140)
    Instructions:
      PHI               r5(5),   r5(3),   r5(4)
      PHI               r4(6),   r4(5),   r4(3)
      PHI               r2(3),   r2(2),   r2(1)
      return_o          r5(5)
    Successors: 
    Predecessors: 7, 8
    Dominance children: 

How we optimize it

Let’s now walk through the optimized output. The argument handling has been reduced to a single instruction that does an unchecked read of the incoming argument. This is because we’re producing a specialization for a particular input callsite shape and set of input arguments. In this case, it will be a single argument of type Wrapper.

  BB 0 (0x7f1be817b070):
    line: 7 (pc 0)
    Instructions:
      no_op           
    Successors: 1
    Predecessors: 
    Dominance children: 1

  BB 1 (0x7f1be817b0f8):
    line: 7 (pc 0)
    Instructions:
      sp_getarg_o       r0(2), liti16(0)

What comes next is the code to store that "Used later" string. The ops look fine, but do you notice something odd?

      const_s           r2(1), lits(Used later)
      hllboxtype_s      r3(2)
      [Annotation: INS Deopt One (idx 0 -> pc 46; line 9)]
      box_s             r1(2),   r2(1),   r3(2)

Yup, there’s a deopt annotation moved on to that box_s. Huh? Well, let’s look at what comes next.

      [Annotation: INS Deopt One (idx 1 -> pc 56; line 9)]
      sp_getspeshslot   r7(0), sslot(2)
    Successors: 2
    Predecessors: 0
    Dominance children: 2

  BB 2 (0x7f1be8356d38):
    Inlined
    line: 7 (pc 0)
    Instructions:
      [Annotation: FH Start (0)]
      [Annotation: Inline Start (0)]
      [Annotation: INS Deopt Inline (idx 5 -> pc 20; line 8)]
      set               r9(1),   r0(2)
      [Annotation: INS Deopt Inline (idx 6 -> pc 42; line 9)]
      sp_p6ogetvt_o    r11(1),   r9(1), liti16(8), sslot(4)
      [Annotation: FH End (0)]
      set               r3(5),  r11(1)
    Successors: 3
    Predecessors: 3
    Dominance children: 3

Recall that in the unoptimized code we next did $w.x by a findmeth instruction, which came after a decont of $w, and the we did an invocation of that method. What’s happened to all of that lot?

First, since $w is the argument we are producing a specialization for, we thus know it’s Wrapper, and we know that’s not a container type, so the decont can go. Since we also know its type and we know the method name, we can just resolve that method once. The resolution of it is then stored in a “spesh slot”, which you can think of as a constants table for this particular specialization. What follows is, instead of the invocation, the code for the method x() { $!x }, which has been inlined. (The sp_p6ogetvt_o instruction is what attribute lookup has been optimized into.)

Oh, and about that Deopt annotation on the box_s? That’s just because code got deleted and it got shifted. We’ll look at the consequences of that later.

Here is the rest of the code:

  BB 3 (0x7f1be817b218):
    line: 9 (pc 72)
    Instructions:
      [Annotation: Inline End (0)]
      [Annotation: FH Goto (0)]
      [Annotation: INS Deopt One (idx 2 -> pc 72; line 9)]
      [Annotation: INS Deopt One (idx 4 -> pc 78; line 9)]
      sp_guardconc      r3(5), sslot(0), litui32(72)
      const_s           r2(2), lits(C)
      hllboxtype_s      r4(4)
      box_s             r5(3),   r2(2),   r4(4)
      PHI               r5(5),   r5(3)
      return_o          r5(5)
    Successors: 
    Predecessors: 2
    Dominance children: 6

Well, that’s pretty different from what we started out with too. What on earth has happened? Where did our if statement go?!

The sp_guardconc instruction is a guard. It checks, in this case, that we have a concrete instance of C in register r3(5). It was inserted because the gathered statistics said that, so far, it had been such 100% of the time. The guard will deoptimize – that is, fall back to the interpreter – if it fails, but otherwise proceed. Since we have guarded that, then the istype will become a constant. That means we know which way the branch would go, and can delete the other part of the branch. A type check, a conditional branch, and a branch all go away, to be replaced by a single cheap guard.

But what about that “Used later” string?

Notice how we executed:

      box_s             r1(2),   r2(1),   r3(2)

But its result value, r1(2), is completely unused anywhere in the code that we have left after optimization. The instruction was, however, retained, for the sake of deoptimization. In the original code, the value was written prior to a guard that might deoptimize. Were we to throw it away, then after we deoptimized the interpreter would try to read a value that wasn’t written, and crash in some interesting way.

The original approach

The original approach taken to this problem was to:

  1. Whenever we see a Deopt annotation, take its index as our current deopt point
  2. Whenever we see a write, label it with the current deopt point
  3. Whenever we see a read, check it the deopt point of the write is not equal to the deopt point of the read. If that is the case, mark the write as needing to be retained for deopt purposes.

Effectively, if a value written before a deopt point might be read after a deopt point, then we retain it. That was originally done by bumping its usage count. In my last post here, I described how we switched to setting a “needed for deopt” flag instead. But in the grand scheme of things, that changed nothing much about the algorithm described above; only step 3 was changed.

Note that this algorithm works in the case of loops – where we might encounter a value being read in a PHI node prior to seeing it being written – because the lack of a deopt point recorded on the writer will make it unequal to the current deopt point.

Correct, but imprecise

The problem with this approach isn’t with correctness, but rather with precision. A deopt retention algorithm is correct if it doesn’t throw away anything that is needed after a deoptimization. Of course, the simplest possible algorithm would be to mark everything as required, allowing no instruction deletions! The method described above is also correct, and marks fewer things. And for a while, it was enough. However, it came to be a blocker for various other optimizations we wish to do.

There are two particular problems that motivated looking for a more precise way to handle deopt usage. First of all, many instructions that may be replaced with a guard and deoptimize are actually replaced with something else or even deleted. For example, decont will often be replaced by a set because we know that it’s not a container type. A set can never trigger deoptimization. However, we had no way to update our deopt usage information based on this change. Therefore, something written before the set that used be a decont and read after it, but otherwise not needed, would be kept alive because the decont could have had a guard inserted, even though we know it did not.

A larger problem is that even when we might insert a guard, we might later be able to prove it is not needed. Consider:

my int $i = $str.chars;

The chars method will be tiny, so we can inline it. Here’s the code that we currently produce; I’ve shown the end of the inlining of the chars method together with the assignment into $i.

      chars            r15(1),  r14(1)
      hllboxtype_i     r13(1)
      [Annotation: INS Deopt Inline (idx 7 -> pc 134; line -1)]
      box_i            r13(2),  r15(1),  r13(1)
      [Annotation: FH End (0)]
      set               r2(5),  r13(2)
    Successors: 3
    Predecessors: 4
    Dominance children: 3

  BB 3 (0x7efe0479b5f0):
    line: 1 (pc 100)
    Instructions:
      [Annotation: Inline End (0)]
      [Annotation: FH Goto (0)]
      [Annotation: INS Deopt One (idx 3 -> pc 100; line 1)]
      sp_guardconc      r2(5), sslot(2), litui32(100)
      set               r2(6),   r2(5)
      set               r5(1),  r15(1)
      bindlex         lex(idx=0,outers=0,$i),   r5(1)

Since $i is a native integer lexical, we don’t need to box the native integer result of the chars op at all here. And you can see that we have done a rewrite such that r15(1) is used to assign to $inot the boxed result. However, the box_i instruction is retained. Why?

The immediate reason is that it’s used by the guard instruction. And indeed, I will do some work in the future to eliminate that. It’s not a hugely difficult problem. But doing that still wouldn’t have been enough. Why? Because there is a deopt point on the guard, and the boxed value is written before it and used after it. This example convinced me it was time to improve our deopt handling: it was directly in the way of optimizations that could provide a significant benefit.

A more precise algorithm

It took me three attempts to reach a solution to this. The first simple thing that I tried follows from the observation that everything written after the last deopt instruction in the optimized code can never possibly be used for deoptimization purposes. This was far from a general solution, but it did help a bit with very small functions that are free of control flow and have no or perhaps just some very early guards. This was safe, easy to reason about, easy to implement – but ultimately not powerful enough. However, it was helpful in letting me frame the problem and start to grapple with it, plus it gave me a set of cases that a more powerful solution should be able to take care of.

Attempt number two was to do a repeat of the initial deopt analysis process, but after the optimizations had taken place. Thus, cases where a Deopt annotation was on an instruction that was turned into something that could never deoptimize would not be counted. This quickly fell apart, however, since in the case where entire branches of a conditional were deleted then reads could disappear entirely. They simply weren’t there to analyze any more. So, this was an utter failure, but it did drive home that any analysis that was going to work had to build up a model of deoptimization usages before we performed any significant optimizations, and then manipulate that model safely even in the light of a mutated program graph.

Attempt three took much longer to come up with and implement, though thankfully was rather more successful. The new algorithm proceeds as follows.

  1. When we see a write instruction – and provided it has at least one reader – place it into a set OW of writes with still outstanding reads.
  2. When we see a deopt point, take the set OW and record the index of this deopt point on each of those writes. This means that we are now associating the writes with which deopt points are keeping them alive.
  3. Whenever we see a read, mark it as processed. Check if the writer has now had all of its reads processed. If so, remove it from the set OW.

This algorithm works in a single pass through the program graph. However, what about loops? In a loop, no matter what order we traverse the graph in, we will always see some reads that happen before writes, and that breaks the algorithm that I described above.

After some amount of scribbling graphs and staring at them, I hit upon a way to solve it that left me with a single pass through the graph, rather than having to iterate to a fixed point. When we see a read whose writer was not yet processed, we put it into a set of reads that are to be processed later. (As an aside, we know thanks to the SSA form that all such instructions are PHI (merge) instructions, and that these are always placed at the start of the graph.) We will then process a pending read when we have processed all of the basic blocks that are its predecessors – which means that by then all possible writes will have been processed.

The result is that we now have a list of all of the deopt points that make use of a particular write. Then, after optimization, we can go through the graph again and see which deopt points actually had guards or other potentially deoptimizing instructions placed at them. For all the cases where we have no such instruction under that Deopt annotation, we can delete the deopt usage. That way, if all normal usages and deopt usages of a value are gone, and the writing instruction is pure, we can delete that instruction.

This further means that once we gain the ability to delete guards that we can prove are not required any longer – perhaps because of new information we have after inlining – we will also be able to delete the deopt usages associated with them.

Last but not least, the specializer’s log output also includes which deopt points are keeping a value alive, so we will be able to inspect the graph in cases where we aren’t entirely sure and understand what’s happening.

Future work

With this done, it will make sense to work on guard elimination, so that is fairly high on my list of upcoming tasks. Another challenge is that while the new deopt algorithm is far more precise, it’s also far more costly. Its use of the DU chains means we have to run it as a second pass after the initial facts and usage pass. Further, the algorithm to eliminate unrequired deopt usages is a two pass algorithm; with some engineering we can likely find a way to avoid having to make the first of those. The various sets are represented as linked lists too, and we can probably do better than that.

One other interesting deoptimization improvement to explore in the future is to observe that any pure instructions leading up to a deopt point can be replayed by the interpreter. Therefore, we can deopt not to the instruction mapping to the place where we put the deopt point, but to the start of a run of pure instructions before that. That would in some cases allow us to delete more code in the optimized version.

Next time, I’ll be looking at how increasingly aggressive inlining caused chaos with our context introspection, and how I made things better. Thanks go to The Perl Foundation for making this work possible.

Posted in Uncategorized | 1 Comment

Better usage information in the MoarVM specializer

I’ve been doing lots of work on the MoarVM specializer of late, and will be writing a few posts here to explain it. This work has been covered by my grant from The Perl Foundation.

This post covers the recent addition of DU (Define-Use) chains. I’ll explain what they are, what kind of optimizations they have helped with so far, and how they can help us ensure the specializer is free of certain kinds of bug.

A little background

The MoarVM specializer helps programs run faster by ripping out as much unrequired generality as it knows how. This involves a bunch of different analyses, and those are aided by the program being turned into SSA (Static Single Assignment) form. This happens not at the Perl 6 program level, but rather at the bytecode level. MoarVM’s interpreter is a register machine, so something like:

($a + $b) * $c

Could, assuming these are all native integer variables, compile into something like:

getlex r0, '$a'
getlex r1, '$b'
add_i r0, r0, r1
getlex r1, '$c'
mul_i r0, r0, r1

Notice how registers r0 and r1 are re-used for multiple things. Now imagine that these registers hold objects and we are trying to track types and other such information. Since a register may be used for completely different things over its lifetime, we can’t just associate information with the register.

Transforming the bytecode into SSA form helps. We give each use of the register a version number:

getlex r0(1), '$a'
getlex r1(1), '$b'
add_i r0(2), r0(1), r1(1)
getlex r1(2), '$c'
mul_i r0(3), r0(2), r1(2)

Now we can associate information with each version of the register, greatly easing analysis of the program.

Defines

When a program is in SSA form, every versioned register has one definition: the instruction that writes it. Since it can never be written again anywhere in the SSA form of the bytecode, the writer is a single instruction. There are no MoarVM instructions that write more than one register, and so an instruction defines at most one value, and every versioned register has precisely one instruction that defines it.

So, defines are easy. For as long back as I can remember, we’ve stored a reference to the writing instruction of each versioned register, so whenever we see a read of it then we can always quickly find the defining instruction.

Uses

Until recently, we stored a counter of how many times each versioned register was used. We made an initial pass through the graph representing the bytecode to be optimized, bumping the usage count each time we saw a versioned register being used. Then, as we optimized, we could update those counts.

Usage information is especially useful taken together with knowledge of which instructions are pure – that is to say, they produce a result, but don’t have any effects besides that. If the usage count of such an instruction drops to zero, then we can delete it.

For example, if we have an attribute has str $!value in a class, it would be compiled into something like this:

  wval              r4(2), liti16(1), liti16(36) (P6opaque: Str)
  getattr_s         r5(1),   r0(2),   r4(2), lits($!value)

The wval instruction grabs the type object of the class that declares the attribute. This is used together with the attribute name to do a lookup (since parent and child classes may have attributes of the same name, but they are different attributes since they are in different classes). Provided we know the type of r0(2) – which is holding self – then we might optimize it into:

  wval              r4(2), liti16(1), liti16(36) (P6opaque: Str)
  sp_p6oget_s       r5(1),   r8(3), liti16(8)

Where the 8 is an offset in bytes indicating where the attribute lives in the memory of the object. We’ve turned a lookup by name into pointer chasing, which will later JIT into some pretty simple machine code (not quite as simple as we want yet, but still vastly faster than the normal lookup path).

But wait! What about that wval there? We don’t need it now. And so, after the various optimizations have taken place, we do Dead Instruction Elimination. So long as the usage count has dropped to zero – that is, nothing else is using the value – we can delete the instruction, meaning the end result is just:

  sp_p6oget_s       r5(1),   r8(3), liti16(8)

Deoptimization complication

So far, so relatively simple. Alas, there’s complications. Some values might become unused after we optimize the code, but we still can’t delete them. We use statistics to drive optimization, and do a great deal of speculation. For example, if we see 99% of the time a particular type shows up in the program, we optimize it assuming that type. But what if the 1% case shows up? Or what if we saw a certain type 100% of the time so far, but there’s a different one in the future? In that case, we drop back to the normal interpreter to handle it. For that to work out, however, we must make sure that the values the interpreter needs are still available after this deoptimization has taken place.

Up until recently, whenever we detected that a value might be needed if deoptimization happens, we simply gave its usage count an extra bump. This meant that even if we deleted all of its uses in the graph, we’d still not delete it. (This was a very coarse-grained analysis. I’ll discuss that more in a future post.)

You can’t count on this

The usage count was a fine enough approach to start out with, but it gradually came to be insufficient.

One bug that it’s quite possible to make is to forget to increment or decrement the usage count. The cases where it ended up too high could prevent us from deleting an instruction we didn’t need, leading to worse code. This wasn’t very serious, though a bit sub-optimal. The other way round – failing to increment the count – is of course more dangerous, since it may lead to an instruction being deleted that we really need. This didn’t happen often – we’re relatively careful – but it’d be nice if we had a way to verify it wasn’t happening at all. However, the +1 for the sake of deoptimization would have frustrated doing such an analysis.

A further issue is that while finding the place that a given versioned register was defined was easy, there was no cheap way to find its usages. Having such information would make some optimizations we already did easier and more effective, as well as make it easier to do some that we’re keen to add in the near future.

Beyond that, a single number was uninformative for those of us working on the optimizer. We could see the number, but what was it telling us? Why was the register still in use?

Adding use chains

So, instead of storing a count, we’ve started storing a linked list that points to each instruction that uses the versioned register. Often we only care about used, used precisely once, or unused; in fact, the only place we need the exact count is for debug output. Therefore, we can answer all the common questions we could with the usage count without having to traverse the chain. Building the chain is easy: everywhere we used to bump the counter, we now add an entry into the chain.

This chain works well for instructions that use the value, but what about the deoptimization usage? This was handled by storing that piece of information as a separate flag. It could then be displayed alongside the real usage count in the debug output, so we could quickly understand which registers were in use only for the purpose of deoptimization.

Checking the chains

Along with this, I implemented a chain checker. It goes through the instruction graph and the use chains, and makes sure:

  • Every versioned register that is used in the instructions has an entry in the chain
  • There are no entries in the chain that don’t appear in the graph
  • Every used value has its writing instruction set correctly

This isn’t done by default – it costs something – but is available as a flag MoarVM developers can turn on when implementing new optimizations to aid with verifying they are, at least in this regard, correct.

Improving elimination of set instructions

Often in Perl 6, code has to deal with both values and values held in Scalar containers – that is to say, it’s polymorphic over the two cases. In the case that we have a Scalar container, we have to remove the value from it. This is an incredibly common operation in Perl 6, and there is a single op – decont – that checks if we have a container and takes the value out of it if so. Code generation conservatively inserts quite a lot of these.

Often, we simply have a value, and so there’s nothing to do. And often, the specializer can tell there will be nothing to do. Thus, something like this:

  decont            r5(2),   r0(2)
  findmeth          r4(2),   r5(2), lits(chars)

Is turned into this:

  set               r5(2),   r0(2)
  findmeth          r4(2),   r5(2), lits(chars)

Where set simply sets the value of one register into another. For this and various other reasons, it’s quite common that – after optimizations – we end up with code chock full of set instructions. They’re cheap, but they certainly aren’t free – on two counts. Firstly, there’s the execution cost of them. Secondly, they make the optimized code larger than it needs to be. This both makes less efficient use of the CPU’s code cache once we JIT the optimized result, but also can push the code over the inlining size limit, and thus it might miss out on further powerful optimizations.

We did have some code to try and get rid of set instructions. It was less than awesome on multiple counts. Firstly, it still left quite a few behind that we could see by inspection of the code could go away. Secondly, it could make a mess of the SSA form. Since it was one of the very last optimizations we did, that wasn’t a big deal, but it did make the debug output confusing, plus we will be adding more optimizations to this second pass in the future. Thirdly, it was somewhat adhoc, mostly written to handle peephole patterns that commonly showed up.

The usage chains provide a way to do better. The new set elimination algorithm covers the previous cases and new ones, and yet only does two fairly straightforward things.

Firstly, it looks if the writer of the set‘s second operand has only one usage, which is that set instruction, and no deopt usages. If so, and if there are no interfering uses of different versions of the register that the set writes, then it can have the writing instruction changed to write to the register that the set would, and the set instruction can then be deleted.

Failing that, it uses the use chain to check if there is a single user of the versioned register that the set instruction writes to. Again, given no conflicts, it can eliminate the set instruction by arranging for the user of the set instruction to instead use the value that the set would read. So in our case:

  set               r5(2),   r0(2)
  findmeth          r4(2),   r5(2), lits(chars)

We’d end up with:

  findmeth          r4(2),   r0(2), lits(chars)

To give a practical example of this, here is how the optimized code of the chars method called on a Scalar holding a Str looks without the set elimination:

  sp_getarg_o       r1(2), liti16(0)
  set               r8(2),   r1(2)
  set               r1(3),   r8(2)
  [Annotation: Logged (bytecode offset 24)]
  sp_p6oget_o       r8(3),   r1(3), liti16(16)
  [Annotation: INS Deopt One (idx 0 -> pc 30; line 2838)]
  sp_guardconc      r8(3), sslot(1), litui32(30)
  set              r11(2),   r8(3)
  set               r0(2),  r11(2)
  [Annotation: Line Number: SETTING::src/core/Str.pm6:2838]
  takedispatcher    r3(2)
  sp_p6oget_s       r5(1),   r0(2), liti16(8)
  chars             r6(1),   r5(1)
  hllboxtype_i      r4(3)
  [Annotation: INS Deopt One (idx 1 -> pc 134; line 2839)]
  box_i             r4(4),   r6(1),   r4(3)
  return_o          r4(4)

Notice the four set instructions in there. With the new set elimination algorithm, we end up with:

  sp_getarg_o       r1(3), liti16(0)
  [Annotation: Logged (bytecode offset 24)]
  sp_p6oget_o       r8(3),   r1(3), liti16(16)
  [Annotation: INS Deopt One (idx 0 -> pc 30; line 2838)]
  sp_guardconc      r8(3), sslot(1), litui32(30)
  [Annotation: Line Number: SETTING::src/core/Str.pm6:2838]
  takedispatcher    r3(2)
  sp_p6oget_s       r5(1),   r8(3), liti16(8)
  chars             r6(1),   r5(1)
  hllboxtype_i      r4(3)
  [Annotation: INS Deopt One (idx 1 -> pc 134; line 2839)]
  box_i             r4(4),   r6(1),   r4(3)
  return_o          r4(4)

Elimination of box/unbox pairs

Another interesting use of DU chains is to eliminate boxing of native values into objects only to unbox them again a short time later. This can happen due to the compiler not being smart enough, but if it happens across two subs or methods, and especially when we have multiple dispatch and polymorphic method dispatch happening, there’s not so much we could do better at that phase.

However, MoarVM does inlining, including speculative inlining. We can therefore see between boundaries that we cannot at compile time. Recall how chars produced this boxing code, as it is declared to return an Int:

  chars             r6(1),   r5(1)
  hllboxtype_i      r4(3)
  box_i             r4(4),   r6(1),   r4(3)

What if we were to write:

my int $chars = $str.chars;

Then the boxing happens just over the boundary. It turns out that there’s quite a lot to do in order to get rid of the boxing instruction, but with use chains we can already make a start. When we encounter a box, we look if any of its users are an unbox. After inlining, we’d see that there are such cases. Therefore, that unbox instruction can be rewritten to use r6(1) – the unboxed value.

That much works now. For reasons I’ll dig into in my next post, that’s not yet quite enough to eliminate the box_i instruction. So in this case, the saving is minor. Once we can get rid of the boxing operation, however, it will be a notable saving in such cases.

Coming in the future: native ref/deref pairs

One current performance challenge we have is that if we call a method and pass it a variable declared with a native type:

my int $foo = $a + $b;
$obj.meth($foo);

Then we don’t know if that method is declared as taking an rw parameter or not. Therefore, we must not pass a native integer value, but instead form a reference that points to where $foo lives, so we can update it. Of course, in most cases is rw is not used.

After inlining, we’ll be able to see this, and so will be able to use the use chain to discover when a formed reference is used for nothing more than to do a dereference. Then we can eliminate that reference taking process entirely.

In summary

Adding use chains has allowed us to detect and fix a small number of usage handling bugs, given us a way to prevent such bugs happening in the future, allowed us to improve an existing optimization, provided for efficiently implementing a new one, and will be an important part of improving the performance of code using native types in the future. Furthermore, it means those of us working on MoarVM have more detailed information about why an operation has to take place in the optimized code, so we can better understand if we have missed opportunities.

However, that’s not the end of the usage story. It turned out that a single flag for deopt usage would not suffice. Next time, I’ll look at why, and what I’ve done to address that.

Posted in Uncategorized | 2 Comments

Faster dispatches with MoarVM specializer plugins

One of the goals for the current round of my Perl Foundation Performance and Reliability grant is to speed up private method calls in roles, as well as assignments in to Scalar containers. What I didn’t expect at the time I wrote the grant application is that these two would lead to a single new mechanism in MoarVM to make them possible.

The Scalar container assignment improvements are still to come; currently I have a plan and hope to make good progress on it next week. I do, however, have a range of dispatch-related performance improvements to show, including the private method case.

Background

MoarVM runs programs faster by analyzing how they run and producing specialized versions of parts of the program based on that information. It takes note of which code is run often (frequently called methods and hot loops), which types a block of code is called with, what types are returned from calls, what code a closure points to, and more. Note that it observes the runtime behavior, and so is not dependent on whether the program has type annotations or not.

Calls are one of the most important things that the optimizer considers, be they method calls, subroutine calls or invoking a received closure. Method calls are especially interesting, because with a call like $obj.meth($arg), the method to be called depends on the exact type of $obj. Often, we end up producing a version of the code that is specialized for a particular type of $obj. We can therefore resolve the method once in this specialization, saving the method lookup overhead.

But there’s more. Once we know exactly what method we’ll be calling, and if the method is fairly small, we can inline it into the caller, thus eliminating the call overhead too. Further, since we are inlining a specialized version of the code and have already proved that we meet the conditions for using that specialization, we can eliminate type checks on parameters. Inlining is even more powerful than that: it opens the door to a wider range of analyses that would not be possible without it, which lead to futher program optimizations.

The problem

We can do this kind of optimization with method calls because MoarVM understands about method calls. It knows that if it is holding the type of the invocant constant, then the result of the dispatch can also be considered a constant.

Unfortunately, there’s more than one case of method calling in Perl 6. While the majority of calls take the familiar $obj.foo form, we also have:

  • Private method calls, of the form $obj!foo or $obj!TrustsMe::foo
  • Qualified method calls, of the form $obj.Some::Role.meth() (it may be a base class too, though this form is at its most useful in doing conflict resolution for roles)
  • Duck method calls, of the form $obj.?foo, where if the method doesn’t exist then we just accept that and move along

In the first case, if the call is in a class, then we can resolve it at compilation time, since private methods aren’t virtual. Such calls are thus pretty fast. But what if the private method call is in a role? Well, then it was far slower. It took a method call on the meta-object, which then did a hash lookup to find the method, followed by invoking that method. This work was done by a call to a dispatch:<!> utility method. It was the same story for qualified calls and duck calls.

So, let’s extend MoarVM to understand these kinds of calls?

So if normal method calls are faster because MoarVM understands them, surely we can do better by teaching it to understand these other forms of calling too? Perhaps we could add some new ops to the VM to represent these kinds of calls?

Maybe, but all of them come with their own rules. And those rules are already implemented in the metamodel, so we’d be doing some logic duplication. We make normal method calls fast by precomputing a method cache, which is just a hash table, and have the specializer do its lookups in that. While such an approach might work for private methods, it gets decidedly trickier in the other two cases. Plus those precomputed hashes take up a lot of space. There are hundreds of exception types in CORE.setting and every one of them has a precomputed hash table of all of its methods, with those methods from base classes denormalized in to it. This means hundreds of hashes containing mappings for all of the methods that are inherited from MuAny, and Exception. We do lazily deserialize these, which helps, but it’s still fairly costly. Introducing more such things, when I already want rid of that one, didn’t feel like a good direction.

Let’s make MoarVM teachable

Earlier in the post, I wrote this:

It [the optimizer] knows that if it is holding the type of the invocant [of a method call] constant, then the result of the dispatch can also be held constant.

And this is the key. The important thing isn’t that the specializer knows the precise semantics of the method dispatch. The important thing is that it knows the relationship between the arguments to a dispatch (e.g. the type that we’re calling the method on) and the result of the dispatch.

This, along with considering the challenges of optimizing Scalar assignments, led me to the idea of introducing a mechanism in MoarVM where we can tell it about these relationships. This enables the specializer to insert guards as needed and then simply use the calculated result of the dispatch.

Specializer plugins

The new mechanism is known as “spesh plugins”, and I merged it into MoarVM’s master branch today. It works in a few steps. The first is that one registers a spesh plugin. Here’s the one for helping optimize private method calls:

nqp::speshreg('perl6', 'privmeth', -> $obj, str $name {
    nqp::speshguardtype($obj, $obj.WHAT);
    $obj.HOW.find_private_method($obj, $name)
});

The registration provides the language the plugin is for, the name of the plugin, and a callback. The callback takes an object and a method name. The second line is the key to how the mechanism works. It indicates that the result that will be returned from this plugin will be valid provided the type of $obj precisely matches (that is, with no regard to subtyping relationships) the type of the $obj we are currently considering. Therefore, it establishes a relationship between the invocant type and the private method call result.

Then, we just need to compile a private method call like:

self!foo($bar, $baz)

Into:

nqp::speshresolve('privmeth', self, 'foo')(self, $bar, $baz)

Taking care to only evaluate self once (obviously not a problem for self, but in general it can be any expression, and may have side-effects).

And that’s it. So what happens at runtime?

When the interpreter encounters this call for the first time, it calls the plugin. It then stores the result along with the conditions. On later calls made in the interpreter, it uses this mapping table to quite quickly map the invocant type into the appropriate result. It’s a little cache. (Aside: this is a little more involved because we want lookups without locking, but also need to cope with multiple threads creating resolution races. Thanks to a generalized free-at-safepoint mechanism in MoarVM, this isn’t so hard.)

So that’s nice, and on its own would already be an improvement over what it replaced. But we haven’t even got to the exciting part yet! Each time we use this mapping, it records which mapping was used for the benefit of the optimizer. This information is stored in such a way that the specializer can work out which mappings are used with a particular set of parameter types to the method. So, in:

role R {
    method foo() {
        self!bar
    }
}
class C1 does R {
    method !bar() { 1 }
}
class C2 does R {
    method !bar() { 2 }
}

The method foo might be invoked with invocants of type C1 and C2. Thus the mapping table for the call self!bar will have two entries. We may (if the code is hot) produce two specializations of method foo, and if we do, then we will also be able to see that there is only ever one target of the private method call in each case. Thus, we can inline the appropriate !bar into the matching specialization of foo.

Results

Writing a module PM.pm6 that contains:

role R {
    method m() { self!p }
    method !p() { 42 }
}
class C does R {
}
for ^10_000_000 {
    C.m;
}

And then running it with perl6 -I. -e 'use PM6' used to run in 5.5s on my development machine. That’s only 1.8 million iterations of the loop per second, which means each is eating a whopping 1,650 CPU cycles assuming a 3GHz CPU.

With the new spesh plugin mechanism, it runs in 0.83s, over 6.5x faster. It’s over 12 million iterations of the loop per second, or around 250 CPU cycles per iteration. That’s still a good bit higher than would be good, but it’s a heck of a lot better.

Note that due to the way roles are handled in non-precompiled code, the use of the spesh plugin will not happen at present in a role in a script, thus why in this case I put the code into a module. This restriction can be lifted later.

But wait, there’s more

I also wrote a spesh plugin for qualified dispatches, like $obj.Foo::meth(). This one guards on two of its inputs, and has an error case to handle. Notice how we can avoid replicating this logic inside of MoarVM itself and just write it in NQP code.

nqp::speshreg('perl6', 'qualmeth', -> $obj, str $name, $type {
    nqp::speshguardtype($obj, $obj.WHAT);
    if nqp::istype($obj, $type) {
        # Resolve to the correct qualified method.
        nqp::speshguardtype($type, $type.WHAT);
        $obj.HOW.find_method_qualified($obj, $type, $name)
    }
    else {
        # We'll throw an exception; return a thunk that will delegate to the
        # slow path implementation to do the throwing.
        -> $inv, *@pos, *%named {
            $inv.'dispatch:<::>'($name, $type, |@pos, |%named)
        }
    }
});

This gave an even more dramatic speedup. The program:

role R1 {
    method m() { 1 }
}
role R2 {
    method m() { 2 }
}
class C does R1 does R2 {
    method m() {
        self.R2::m()
    }
}
for ^10_000_000 {
    C.m
}

Used to take 13.3s. With the spesh plugin in effect, it now takes 1.07s, a factor of more than 12x improvement.

And even a little more…

I also wondered if I could get $obj.?foo duck dispatches to do better using a spesh plugin too. The answer turned out to be yes. First of all, here’s the plugin:

sub discard-and-nil(*@pos, *%named) { Nil }
nqp::speshreg('perl6', 'maybemeth', -> $obj, str $name {
    nqp::speshguardtype($obj, $obj.WHAT);
    my $meth := $obj.HOW.find_method($obj, $name);
    nqp::isconcrete($meth)
        ?? $meth
        !! &discard-and-nil
});

There’s a couple of cases I decided to measure here. The first is the one where we wrote code with a .? call to handle the general case (for example, in a module), but then the program using the module always (or > 99% of the time) gives an object where we can call the method.

class C {
}
class D {
    method m() { 42 }
}
for ^10_000_000 {
    (rand > 0.999 ?? C !! D).?m()
}

The rand call, compare, and conditional are all costs in this code besides the call I wanted to measure, so it’s not such a direct measurement of the real speedup of .?. Still, this program went from taking 10.9s before to 4.29s with the spesh plugin in place – an improvement of 2.5x. It achieves this by doing a speculative inline of the method m anyway, and then using deoptimization to fall back to the interpreter to handle the 0.1% of cases where we get C and not D. (It then, at the end of the loop body, falls back into the hot code again.) Note that the inlining and deopt just naturally fell out of things the specializer already knew how to do.

But had this come at the cost of making really polymorphic cases slower? Here’s another benchmark:

class C {
}
class D {
    method m() { 42 }
}
for ^10_000_000 {
    (rand > 0.5 ?? C !! D).?m()
}

This one goes from 7.60s to 4.92s, a 1.5x speedup. Spesh can’t just punt this to doing a deopt for the uncommon case, because there is no uncommon case. Still, the guard table scan comes out ahead.

(By the way, I think a lot of the slowness in this code – though I didn’t think of it when I wrote the benchmark – is that rand returns a Num, but 0.5 and 0.999 are Rats, so it is doing a costly type coercion before comparing.)

And what next?

Next I’ll be taking on Scalar containers and assignment, seeing what I can do with spesh plugins there, and hoping my ideas lead to as positive results as has been seen here.

Also, this isn’t the final word on the various benchmarks in this post either. I know full well that the current spesh plugin implementation is inserting some redundant guards, and a bit of effort on that front can probably get us another win.

Posted in Uncategorized | 2 Comments

Remote Debug Support for MoarVM

Some years back, I worked on bringing basic debug support to Rakudo Perl 6. It works by taking the program AST and instrumenting it. This approach produced a somewhat useful result, but also carries a number of limitations:

  • It required all code that is to be debugged to be recompiled, which gets slow when there are many modules – especially since there isn’t a way to cache this debug version. Back when I wrote it, all Perl 6 programs were pretty small. Today, larger systems are being built, with more significant module dependency chains.
  • It caused a fairly significant slowdown to the code that was being debugged.
  • It wasn’t useful to Rakudo core developers, and less useful than would be ideal for Perl 6 power users, as it could not debug into the built-ins and compiler. It’s great that so much of Perl 6 is written in Perl 6, but that was sadly off-limits for the existing debugger. There wasn’t much we could easily do to address that, either.
  • There wasn’t any way to debug remotely. Technically, we probably could have built this, as the debugger had pluggable frontends. However, to my knowledge, the only one ever written was the CLI (Command Line Interface) that exists to this day.
  • That CLI never learned to deal with threads. Given one of Perl 6’s selling points is its concurrency support, that’s a significant limitation.

Enter the Edument Team

At Edument, the company where I work, we’re keen to support Perl 6 and drive it forward. Last summer, we released Cro. In the autumn, we started work on our next Perl 6 product – which we’ll be announcing in just a couple more weeks. Along the way, we realized that it really needed Perl 6 to have a better debugging solution. So what did we do?

We decided to pitch in and fund its development, of course! During the winter, we’ve worked on adding remote debug support to MoarVM, and today we’ve opened a pull request with it.

Some details

With our additions, MoarVM can be started with a flag indicating it should run a debug server, along with a port that it should listen on. It can then either wait for a connection before doing anything, or run the program as normal but allow for a connection to be made later.

The debug protocol is defined in terms of MessagePack, which you can think of as being like a binary form of JSON. The PR includes documentation of the protocol, and we hope that having based it on an existing serialization format will make implementation of that easy for those who should wish to do so.

The implemented features available through the protocol include:

  • Suspending and resuming all threads, or individual threads
  • Enumerating threads and, when they are suspended, getting their stack traces
  • Reading the lexical variables of a callframe
  • Setting breakpoints and getting notified if they are hit (and, optionally, suspending execution)
  • Stepping
  • Fetching object attributes, array elements, and hash keys/values

We’ve more plans for the future, but this is already a quite decent feature set that opens up a lot of possibilities.

A Perl 6 library and basic debug CLI

Along with the debug server implementation in MoarVM, we’re also releasing a Perl 6 module that speaks the debug protocol, along with a basic, but useful, command line application built using it. These aren’t things that we directly needed, but we hope the CLI will be useful (and, of course, contributions are welcome – I’m sure I’ll be patching it some), and that the library will pave the way for further interesting tools that might be built in terms of the debug protocol.

In Closing

All being well, the next MoarVM release will include remote debug support. With time, this will lead to much better debugging support for Perl 6, to keep pace with the growing size and concurrent nature of applications people are building in Perl 6 today. Enjoy!

Posted in Uncategorized | 1 Comment

Of sisters, stacks, and CPAN

Recently, an open letter was published by Elizabeth Mattijsen, a long-time contributor to the Perl community who has in recent years contributed copiously to Rakudo Perl 6, as well as working to promote Perl (both 5 and 6) at numerous events. The letter made a number of points, some of them yielding decidedly unhappy responses. I’ve been asked a number of times by now for my take on the letter and, having had time to consider things, I’ve decided to write up my own thoughts on the issues at hand.

Oh sister

A number of years back, I played a part in forming the “sister language narrative”. The reality – then and now – is that both Perl 5 and Perl 6 have userbases invested in them and a team willing to develop the languages and their implementations. These userbases partly overlap, and partly consist of people with an interest in only one or the other.

Following the Perl mantra that “there’s more than one way to do it”, I saw then – and see now – no reason for the advent of Perl 6 to impose an “end of life” on Perl 5. During the many years that Perl 6 took to converge on a stable language release with a production-ready implementation, Perl 5 continued to evolve to better serve its userbase. And again, I see no reason why it should not continue to do so. For one, Perl 6 is not a drop-in replacement for Perl 5, and it’s unreasonable to expect everyone using Perl 5 today to have a business case for migrating their existing systems to use Perl 6. When there’s a team eager to carry Perl 5 forwards, what’s to lose in continuing to work towards serving that Perl 5 userbase better? Caring about and supporting an existing userbase is a sign of a mature language development community. It shows that we’re not about “move fast and break stuff”, but “you can trust us with your stuff”.

Moreover, it’s very much the case that Perl 5 and Perl 6 make different trade-offs. To pick one concrete example, Perl 6 makes it easy to run code across multiple threads, and even uses multiple threads internally (for example, performing optimization and JIT compilation on a background thread). Which is great…except the only winning move in a game involving both threads and fork() is not to play. Sometimes one just can’t have their cake and eat it, and if you’re wanting a language that more directly gives you your POSIX, that’s probably always going to be a strength of Perl 5 over Perl 6.

For these reasons and more, it was clear to me that the best way forward was to simply accept, and rejoice in, the Perl community having multiple actively developed languages to offer the world. So where did the sister language narrative come in?

The number 6 happens to be a larger number than 5, and this carries some implications. I guess at the outset of the Perl 6 project, it was indeed imagined that Perl 6 would be a successor to Perl 5. By now, it’s instead like – if you’ll excuse me a beer analogy – Rochefort 8 and Rochefort 10: both excellent beers, from the same brewery, who have no reason to stop producing the 8 simply because they produce the 10. I buy both, and they’re obviously related, though different, and of course I have my preference, but I’m glad they both exist.

The point of the “sister language” narrative was to encourage those involved with Perl 5 and Perl 6 to acknowledge that both languages will continue to move forward, and to choose their words and actions so as to reflect this.

I continue to support this narrative, both in a personal capacity, and as the Rakudo Perl 6 Pumpking. (For those curious, “pumpking” is a cute Perl-y word for “project leader”, although one could be forgiven for guessing it means “flatulent monarch”.) Therefore, I will continue to choose my words and actions so as to support it, unless a new consensus with equally broad community buy-in comes to replace it.

I accept that this narrative may not be perfect for everyone, but it has been quite effective in encouraging those who feel strongly about just Perl 5 or just Perl 6 to focus their efforts constructively on building things, rather than trying to tear down the work of others. Therefore, it was no surprise to me that, when Liz’s open letter and follow-up comments went against that narrative, the consequences were anything but constructive.

I can’t, and don’t feel I should try to, control the views of others who contribute to Perl 6. Within reason, a diversity of views is part of a healthy community. I do, however, have a few things to ask of everyone. Firstly, when expressing a view that is known not to have widespread community consensus, please go to some lengths to make it clear it is a personal position. Secondly, when reading somebody’s expressed position on a matter, don’t assume it is anything more than their position unless clearly stated. And – perhaps most importantly – please also go to lengths to discuss topics calmly and politely. I was deeply disappointed by the tone of many of the discussions I saw taking place over the last week. We can do better.

Perl 5 on the Rakudo stack?

The next section of the letter considered the possibility of a “Butterfly Perl 5” project: effectively, a port of Perl 5 to run atop of the Rakudo stack (in reality, this doesn’t involve Rakudo much at all, but rather NQP and MoarVM). As a compiler geek, this of course piques my interest, because what could be more fun that writing a compiler? :-) And before anyone gets excited – no, I don’t have the time to work on such a thing. But, in common with Liz, I’d be supportive of anyone wishing to experiment in that direction. There will be some deep challenges, and I’ll issue my usual warning that syntax is what gets all the attention but semantics are what will eat you alive.

Where I will disagree, however, is on the idea of a moratorium on new features in Perl 5. These appear slowly anyway (not because Perl 5 Porters are inefficient, but because adding new features to such a widely used language with a 30-year-old runtime is just a really darn hard thing to be doing). Given the immense technical challenges a Perl 5-on-new-VM effort would already face, the odd new feature would be a drop in the bucket.

My one piece of unsolicited advice to those working on Perl 5 would be to borrow features from Perl 6 very carefully, because they are generally predicated on a different type system and many other details that differ between the Perls. (My diagnosis of why smartmatch has presented such a challenge in Perl 5 is because it was designed assuming various aspects of the Perl 6 type system, which don’t map neatly to Perl 5. This isn’t surprising, given Perl 6 very consciously set out to do differently here.) But should Perl 5 simply stop exploring ways to make things better for is userbase? No, I don’t think so. From a Perl 5 user point of view (which is the only one I have), adding subroutine signatures – which deliver existing semantics with less boilerplate – feels like a sensible thing to do. And adding features to keep up with the needs of the latest versions of the Unicode specification is a no-brainer. “Perl (5 or 6) is great at Unicode” is a meme that helps us all.

The CPAN butterfly plan

The final part of the letter proposes a focused porting effort of Perl 5 modules to Perl 6. This idea has my support. While the Perl 6 module ecosystem has been growing – and that’s wonderful – there’s still a good number of things missing. Efforts to address that expands the range of tasks to which Perl 6 can be conveniently applied. Of course, one can reach such modules already with the excellent Inline::Perl5 too. Some folks have reservations about dual runtimes in production, however, and interop between two runtimes has some marshalling cost – although with recent optimization work, the cost has been brought well down from what it was.

Of course, in some areas it’s strategic to build something afresh that takes into account the opportunities offered by Perl 6. That’s why I designed Cro by reading RFCs and considering the best way to implement them in Perl 6. Even then, I did it with the benefit of 20 years experience doing web stuff – without which, I suspect the result would have been rather less useful. Porting existing modules – taking time to tweak their APIs to feel more comfortable in Perl 6 – means that existing knowledge built up by the authors of those modules can be carried forward, which is surely a win.

In closing

I’ve had the privilege of working with Liz for a number of years. While her open letter makes a number of points I disagree with, I have no reason to believe it wasn’t written out of good intentions. Some people have wondered if any good can come of it. That’s up to us as a community. I think we can, at least, take away:

  • A reminder of the importance that the “sister language” narrative plays in our community. This should encourage all of us to – at the very least – quietly tolerate it.
  • Further evidence that the only way that narrative will be replaced is by following the same path that created it: through quiet, thoughtful, discussions with all major stakeholders, to reach a consensus.
  • That a Perl 5 atop of NQP/MoarVM effort – should anybody wish to try it – will receive some support. As I’ve noted, the technical challenges are considerable, and deep. But they were for implementing Perl 6 too, and we’ve managed that. :-)
  • A much-needed injection of energy into increasing the number of modules in the Perl 6 ecosytem, taking inspiration from (and so hopefully avoiding mistakes already made and rectified by) Perl 5 modules.

And, last but not least, it has been clear that – while it has in the last days often been expressed in raw and heated ways – we, as the Perl community, have two languages we’re passionate about and are keen to drive forward. Let’s do that together, in peace, not in pieces.

Posted in Uncategorized | 1 Comment