My Perl 6 wishes for 2019

This evening, I enjoyed the New Year’s fireworks display over the beautiful Prague skyline. Well, the bit of it that was between me and the fireworks, anyway. Rather than having its fireworks display at midnight, Prague puts it at 6pm on New Year’s Day. That makes it easy for families to go to, which is rather thoughtful. It’s also, for those of us with plans to dig back into work the following day, a nice end to the festive break.

Prague fireworks over Narodni Divadlo

So, tomorrow I’ll be digging back into work, which of late has involved a lot of Perl 6. Having spent so many years working on Perl 6 compiler and runtime design and implementation, it’s been fun to spend a good amount of 2018 using Perl 6 for commercial projects. I’m hopeful that will continue into 2019. Of course, I’ll be continuing to work on plenty of Perl 6 things that are for the good of all Perl 6 users too. In this post, I’d like to share some of the things I’m hoping to work on or achieve during 2019.

Partial Escape Analysis and related optimizations in MoarVM

The MoarVM specializer learned plenty of new tricks this year, delivering some nice speedups for many Perl 6 programs. Many of my performance improvement hopes for 2019 center around escape analysis and optimizations stemming from it.

The idea is to analyze object allocations, and find pieces of the program where we can fully understand all of the references that exist to the object. The points at which we can cease to do that is where an object escapes. In the best cases, an object never escapes; in other cases, there are a number of reads and writes performed to its attributes up until its escape.

Armed with this, we can perform scalar replacement, which involves placing the attributes of the object into local registers up until the escape point, if any. As well as reducing memory operations, this means we can often prove significantly more program properties, allowing further optimization (such as getting rid of dynamic type checks). In some cases, we might never need to allocate the object at all; this should be a big win for Perl 6, which by its design creates lots of short-lived objects.

There will be various code-generation and static optimizer improvements to be done in Rakudo in support of this work also, which should result in its own set of speedups.

Expect to hear plenty about this in my posts here in the year ahead.

Decreasing startup time and base memory use

The current Rakudo startup time is quite high. I’d really like to see it fall to around half of what it currently is during 2019. I’ve got some concrete ideas on how that can be achieved, including changing the way we store and deserialize NFAs used by the parser, and perhaps also dealing with the way we currently handle method caches to have less startup impact.

Both of these should also decrease the base memory use, which is also a good bit higher than I wish.

Improving compilation times

Some folks – myself included – are developing increasingly large applications in Perl 6. For the current major project I’m working on, runtime performance is not an issue by now, but I certainly feel myself waiting a bit on compiles. Part of it is parse performance, and I’d like to look at that; in doing so, I’d also be able to speed up handling of all Perl 6 grammars.

I suspect there are some good wins to be had elsewhere in the compilation pipeline too, and the startup time improvements described above should also help, especially when we pre-compile deep dependency trees. I’d also like to look into if we can do some speculative parallel compilation.

Research into concurrency safety

In Perl 6.d, we got non-blocking await and react support, which greatly improved the scalability of Perl 6 concurrent and parallel programs. Now many thousands of outstanding tasks can be juggled across just a handful of threads (the exact number chosen according to demand and CPU count).

For Perl 6.e, which is still a good way off, I’d like to having something to offer in terms of making Perl 6 concurrent and parallel programming safer. While we have a number of higher-level constructs that eliminate various ways to make mistakes, it’s still possible to get into trouble and have races when using them.

So, I plan to spend some time this year quietly exploring and prototyping in this space. Obviously, I want something that fits in with the Perl 6 language design, and that catches real and interesting bugs – probably by making things that are liable to occasionally explode in weird ways instead reliably do so in helpful ways, such that they show up reliably in tests.

Get Cro to its 1.0 release

In the 16 months since I revealed it, Cro has become a popular choice for implementing HTTP APIs and web applications in Perl 6. It has also attracted code contributions from a couple of dozen contributors. This year, I aim to see Cro through to its 1.0 release. That will include (to save you following the roadmap link):

  • Providing flexible HTTP reverse proxying support
  • Providing a number of robustness patterns (timeout, retry, circuit breaker, and so forth)
  • Providing Cro support some message queue protocols
  • Ensuring Cro’s features work on MacOS and Windows
  • Providing some further help for those building server-side web applications (templating, an easier time setting up common login flows, etc.)
  • Conducting a security review
  • Specifying a stability/deprecation policy
  • Lots of polishing

Comma Community, and lots of improvements and features

I founded Comma IDE in order to bring Perl 6 a powerful Integrated Development Environment. We’ve come a long way since the Minimum Viable Product we shipped back in June to the first subscribers to the Comma Supporter Program. In recent months, I’ve used Comma almost daily on my various Perl 6 projects, and by this point honestly wouldn’t want to be without it. Like Cro, I built Comma because it’s a product I wanted to use, which I think is a good place to be in when building any product.

In a few months time, we expect to start offering Comma Community and Comma Complete. The former will be free of charge, and the latter a commercial offering under a subscribe-for-updates model (just like how the supporter program has worked so far). My own Comma wishlist is lengthy enough to keep us busy for a lot more than the next year, and that’s before considering things Comma users are asking for. Expect plenty of exciting new features, as well as ongoing tweaks to make each release feel that little bit nicer to use.

Speaking, conferences, workshops, etc.

This year will see me giving my first keynote at a European Perl Conference. I’m looking forward to being in Riga again; it’s a lovely city to wander around, and I remember having some pretty nice food there too. The keynote will focus on the concurrent and parallel aspects of Perl 6; thankfully, I’ve still a good six months to figure out exactly what angle I wish to take on that, having spoken on the topic many times before!

I also plan to submit a talk or two for the German Perl Workshop, and will probably find the Swiss Perl Workshop hard to resist attending once more. And, more locally, I’d like to try and track down other Perl folks here in Prague, and see if I can help some kind of Praha.pm to happen again.

I need to keep my travel down to sensible levels, but might be able to fit in the odd other bit of speaking during the year, if it’s not too far away.

Teaching

While I want to spend most of my time building stuff rather than talking about it, I’m up for the occasional bit of teaching. I’m considering pitching a 1-day Perl 6 concurrency workshop to the Riga organizers. Then we’ll see if there’s enough folks interested in taking it. It’ll cost something, but probably much less than any other way of getting a day of teaching from me. :-)

So, down to work!

Well, a good night’s sleep first. :-) But tomorrow, another year of fun begins. I’m looking forward to it, and to working alongside many wonderful folks in the Perl community.

Posted in Uncategorized | 1 Comment

Speeding up object creation

Recently, a Perl 6 object creation benchmark result did the rounds on social media. This Perl 6 code:

class Point {
    has $.x;
    has $.y;
}
my $total = 0;
for ^1_000_000 {
    my $p = Point.new(x => 2, y => 3);
    $total = $total + $p.x + $p.y;
}
say $total;

Now (on HEAD builds of Rakudo and MoarVM) runs faster than this roughly equivalent Perl 5 code:

use v5.10;

package Point;

sub new {
    my ($class, %args) = @_;
    bless \%args, $class;
}

sub x {
    my $self = shift;
    $self->{x}
}

sub y {
    my $self = shift;
    $self->{y}
}

package main;

my $total = 0;
for (1..1_000_000) {
    my $p = Point->new(x => 2, y => 3);
    $total = $total + $p->x + $p->y;
}
say $total;

(Aside: yes, I know there’s no shortage of libraries in Perl 5 that make OO nicer than this, though those I tried also made it slower.)

This is a fairly encouraging result: object creation, method calls, and attribute access are the operational bread and butter of OO programming, so it’s a pleasing sign that Perl 6 on Rakudo/MoarVM is getting increasingly speedy at them. In fact, it’s probably a bit better at them than this benchmark’s raw numbers show, since:

  • The measurements include startup time. Probably 10%-15% of the time Rakudo spends executing the program is startup, parsing, compilation, etc. (noting the compiler is itself written in Perl 6). By contrast, Perl 5 has really fast startup and a parser written in C.
  • The math in Perl 6 is arbitrary precision (that is, will upgrade to a big integer if needed); it costs something to handle that, even if it doesn’t happen
  • Every math operation allocates a new Int object; there’s two uses of + here, so that means two new Int allocations per iteration of the loop, which then have to be garbage collected

While dealing with Int values got faster recently, it’s still really making two Int objects every time around that loop and having to GC them later. An upcoming new set of analyses and optimizations will let us get rid of that cost too. And yes, startup will get faster with time also. In summary, while Rakudo/MoarVM is now winning that benchmark against Perl 5, there’s still lots more to be had. Which is a good job, since the equivalent Python and Ruby versions of that benchmark still run faster than the Perl 6 one.

In this post, I’ll look at the changes that allowed this benchmark to end up faster. None of the new work was particularly ground-breaking; in fact, it was mostly a case of doing small things to make better use of optimizations we already had.

What happens during construction?

Theoretically, the default new method in turn calls bless, passing the named arguments along. The bless method then creates an object instance, followed by calling BUILDALL. The BUILDALL method goes through the set of steps needed for constructing the object. In the case of a simple object like ours, that involves checking if an x and y named argument were passed, and if so assigning those values into the Scalar containers of the object attributes.

For those keeping count, that’s at least 3 method calls (newbless, and BUILDALL).

However, there’s a cheat. If bless wasn’t overridden (which would be an odd thing to do anyway), then the default new could just call BUILDALL directly anyway. Therefore, new looks like this:

multi method new(*%attrinit) {
    nqp::if(
      nqp::eqaddr(
        (my $bless := nqp::findmethod(self,'bless')),
        nqp::findmethod(Mu,'bless')
      ),
      nqp::create(self).BUILDALL(Empty, %attrinit),
      $bless(|%attrinit)
    )
}

The BUILDALL method was originally a little “interpreter” that went through a per-object build plan stating what needs to be done. However, for quite some time now we’ve instead compiled a per-class BUILDPLAN method.

Slurpy sadness

To figure out how to speed this up, I took a look at how the specializer was handling the code. The answer: not so well. There were certainly some positive moments in there. Of note:

  • The x and y accessor methods were being inlined
  • The + operators were being inlined
  • Everything was JIT-compiled into machine code

However, the new method was getting only a “certain” specialization, which is usually only done for very heavily polymorphic code. That wasn’t the case here; this program clearly constructs overwhelmingly one type of object. So what was going on?

In order to produce an “observed type” specialization – the more powerful kind – it needs to have data on the types of all of the passed arguments. And for the named arguments, that was missing. But why?

Logging of passed argument types is done on the callee side, since:

  • If the callee is already specialized, then we don’t want to waste time in the caller doing throwaway logging
  • If the callee is not specialized, but the caller is, then we don’t want to miss out on logging types (and we only log when running unspecialized code)

The argument logging was done as the interpreter executed each parameter processing instruction. However, when there’s a slurpy, then it would just swallow up all the remaining arguments without logging type information. Thus the information about the argument types was missing and we ended up with a less powerful form of specialization.

Teaching the slurpy handling code about argument logging felt a bit involved, but then I realized there was a far simpler solution: log all of the things in the argument buffer at the point an unspecialized frame is being invoked. We’re already logging the entry to the call frame at that point anyway. This meant that all of the parameter handling instructions got a bit simpler too, since they no longer had logging to do.

Conditional elimination

Having new being specialized in a more significant way was an immediate win. Of note, this part:

      nqp::eqaddr(
        (my $bless := nqp::findmethod(self,'bless')),
        nqp::findmethod(Mu,'bless')
      ),

Was quite interesting. Since we were now specializing on the type of self, then the findmethod could be resolved into a constant. The resolution of a method on the constant Mu was also a constant. Therefore, the result of the eqaddr (same memory address) comparison of two constants should also have been turned into a constant…except that wasn’t happening! This time, it was simple: we just hadn’t implemented folding of that yet. So, that was an easy win, and once done meant the optimizer could see that the if would always go a certain way and thus optimize the whole thing into the chosen branch. Thus the new method was specialized into something like:

multi method new(*%attrinit) {
    nqp::create(self).BUILDALL(Empty, %attrinit),
}

Further, the create could be optimized into a “fast create” op, and the relatively small BUILDALL method could be inlined into new. Not bad.

Generating simpler code

At this point, things were much better, but still not quite there. I took a look at the BUILDALL method compilation, and realized that it could emit faster code.

The %attrinit is a Perl 6 Hash object, which is for the most part a wrapper around the lower-level VM hash object, which is the actual hash storage. We were, curiously, already pulling this lower-level hash out of the Hash object and using the nqp::existskey primitive to check if there was a value, but were not doing that for actually accessing the value. Instead, an .AT-KEY('x') method call was being done. While that was being handled fairly well – inlined and so forth – it also does its own bit of existence checking. I realized we could just use the nqp::atkey primitive here instead.

Later on, I also realized that we could do away with nqp::existskey and just use nqp::atkey. Since a VM-level null is something that never exists in Perl 6, we can safely use it as a sentinel to mean that there is no value. That got us down to a single hash lookup.

By this point, we were just about winning the benchmark, but only by a few percent. Were there any more easy pickings?

An off-by-one

My next surprise was that the new method didn’t get inlined. It was within the size limit. There was nothing preventing it. What was going on?

Looking closer, it was even worse than that. Normally, when something is too big to be inlined, but we can work out what specialization will be called on the target, we do “specialization linking”, indicating which specialization of the code to call. That wasn’t happening. But why?

Some debugging later, I sheepishly fixed an off-by-one in the code that looks through a multi-dispatch cache to see if a particular candidate matches the set of argument types we have during optimization of a call instruction. This probably increased the performance of quite a few calls involving passing named arguments to multi-methods – and meant new was also inlined, putting us a good bit ahead on this benchmark.

What next?

The next round of performance improvements – both to this code and plenty more besides – will come from escape analysis, scalar replacement, and related optimizations. I’ve already started on that work, though expect it will keep me busy for quite a while. I will, however, be able to deliver it in stages, and am optimistic I’ll have the first stage of it ready to talk about – and maybe even merge – in a week or so.

Posted in Uncategorized | 5 Comments

Eliminating unrequired guards

MoarVM’s optimizer can perform speculative optimization. It does this by gathering statistics as the program is interpreted, and then analyzing them to find out what types and callees typically show up at given points in the program. If it spots there is at least a 99% chance of a particular type showing up at a particular program point, then it will optimize the code ahead of that point as if that type would always show up.

Of course, statistics aren’t proofs. What about the 1% case? To handle this, a guard instruction is inserted. This cheaply checks if the type is the expected one, and if it isn’t, falls back to the interpreter. This process is known as deoptimization.

Just how cheap are guards?

I just stated that a guard cheaply checks if the type is the expected one, but just how cheap is it really? There’s both direct and indirect costs.

The direct cost is that of the check. Here’s a (slightly simplified) version of the JIT compiler code that produces the machine code for a type guard.

/* Load object that we should guard */
| mov TMP1, WORK[obj];
/* Get type table we expect and compare it with the object's one */
MVMint16 spesh_idx = guard->ins->operands[2].lit_i16;
| get_spesh_slot TMP2, spesh_idx;
| cmp TMP2, OBJECT:TMP1->st;
| jne >1;
/* We're good, no need to deopt */
| jmp >2;
|1:
/* Call deoptimization handler */
| mov ARG1, TC;
| mov ARG2, guard->deopt_offset;
| mov ARG3, guard->deopt_target;
| callp &MVM_spesh_deopt_one_direct;
/* Jump out to the interpreter */
| jmp ->exit;
|2:

Where get_spesh_slot is a macro like this:

|.macro get_spesh_slot, reg, idx;
| mov reg, TC->cur_frame;
| mov reg, FRAME:reg->effective_spesh_slots;
| mov reg, OBJECTPTR:reg[idx];
|.endmacro

So, in the case that the guard matches, it’s 7 machine instructions (note: it’s actually a couple more because of something I omitted for simplicity). Thus there’s the cost of the time to execute them, plus the space they take in memory and, especially, the instruction cache. Further, one is a conditional jump. We’d expect it to be false most of the time, and so the CPU’s branch predictor should get a good hit rate – but branch predictor usage isn’t entirely free of charge either. Effectively, it’s not that bad, but it’s nice to save the cost if we can.

The indirect costs are much harder to quantify. In order to deoptimize, we need to have enough state to recreate the world as the interpreter expects it to be. I wrote on this topic not so long ago, for those who want to dive into the detail, but the essence of the problem is that we may have to retain some instructions and/or forgo some optimizations so that we are able to successfully deoptimize if needed. Thus, the presence of a guard constrains what optimizations we can perform in the code around it.

Representation problems

A guard instruction in MoarVM originally looked like:

sp_guard r(obj) sslot uint32

Where r(obj) is an object register to read containing the object to guard, the sslot is a spesh slot (an entry in a per-block constant table) containing the type we expect to see, and the uint32 indicates the target address after we deoptimize. Guards are inserted after instructions for which we had gathered statistics and determined there was a stable type. Things guarded include return values after a call, reads of object attributes, and reads of lexical variables.

This design has carried us a long way, however it has a major shortcoming. The program is represented in SSA form. Thus, an invoke followed by a guard might look something like:

invoke r6(5), r4(2)
sp_guard r6(5), sslot(42), litui32(64)

Where r6(5) has the return value written into it (and thus is a new SSA version of r6). We hold facts about a value (if it has a known type, if it has a known value, etc.) per SSA version. So the facts about r6(5) would be that it has a known type – the one that is asserted by the guard.

The invoke itself, however, might be optimized by performing inlining of the callee. In some cases, we might then know the type of result that the inlinee produces – either because there was a guard inside of the inlined code, or because we can actually prove the return type! However, since the facts about r6(5) were those produced by the guard, there was no way to talk about what we know of r6(5) before the guard and after the guard.

More awkwardly, while in the early days of the specializer we only ever put guards immediately after the instructions that read values, more recent additions might insert them at a distance (for example, in speculative call optimizations and around spesh plugins). In this case, we could not safely set facts on the guarded register, because those might lead to wrong optimizations being done prior to the guard.

Changing of the guards

Now a guard instruction looks like this:

sp_guard w(obj) r(obj) sslot uint32

Or, concretely:

invoke r6(5), r4(2)
sp_guard r6(6), r6(5), sslot(42), litui32(64)

That is to say, it introduces a new SSA version. This means that we get a way to talk about the value both before and after the guard instruction. Thus, if we perform an inlining and we know exactly what type it will return, then that type information will flow into the input – in our example, r6(5) – of the guard instruction. We can then notice that the property the guard wants to assert is already upheld, and replace the guard with a simple set (which may itself be eliminated by later optimizations).

This also solves the problem with guards inserted away from the original write of the value: we get a new SSA version beyond the guard point. This in turn leads to more opportunities to avoid repeated guards beyond that point.

Quite a lot of return value guards on common operations simply go away thanks to these changes. For example, in $a + $b, where $a and $b are Int, we will be able to inline the + operator, and we can statically see from its code that it will produce an Int. Thus, the guard on the return type in the caller of the operator can be eliminated. This saves the instructions associated with the guard, and potentially allows for further optimizations to take place since we know we’ll never deoptimize at that point.

In summary

MoarVM does lots of speculative optimization. This enables us to optimize in cases where we can’t prove a property of the program, but statistics tell us that it mostly behaves in a certain way. We make this safe by adding guards, and falling back to the general version of the code in cases where they fail.

However, guards have a cost. By changing our representation of them, so that we model the data coming into the guard and after the guard as two different SSA versions, we are able to eliminate many guard instructions. This not only reduces duplicate guards, but also allows for elimination of guards when the broader view afforded by inlining lets us prove properties that we weren’t previously able to.

In fact, upcoming work on escape analysis and scalar replacement will allow us to start seeing into currently opaque structures, such as Scalar containers. When we are able to do that, then we’ll be able to prove further program properties, leading to the elimination of yet more guards. Thus, this work is not only immediately useful, but also will help us better exploit upcoming optimizations.

Posted in Uncategorized | 1 Comment

Faster box/unbox and Int operations

My work on Perl 6 performance continues, thanks to a renewal of my grant from The Perl Foundation. I’m especially focusing on making common basic operations faster, the idea being that if those go faster than programs composed out of them also should. This appears to be working out well: I’ve not been directly trying to make the Text::CSV benchmark run faster, but that’s resulted from my work.

I’ll be writing a few posts in on various of the changes I’ve done. This one will take a look at some related optimizations around boxing, unboxing, and common mathematical operations on Int.

Boxing and unboxing

Boxing is taking a natively typed value and wrapping it into an object. Unboxing is the opposite: taking an object that wraps a native value and getting the native value out of it.

In Perl 6, these are extremely common. Num and Str are boxes around num and str respectively. Thus, unless dealing with natives explicitly, working with floating point numbers and strings will involve lots of box and unbox operations.

There’s nothing particularly special about Num and Str. They are normal objects with the P6opaque representation, meaning they can be mixed into. The only thing that makes them slightly special is that they have attributes marked as being a box target. This indicates the attribute out as being the one to write to or read from in a box or unbox operation.

Thus, a box operations is something like:

  • Create an instance of the box type
  • Find out where in that object to write the value to
  • Write the value there

While unbox is:

  • Find out where in the object to read a value from
  • Read it from there

Specialization of box and unbox

box is actually two operations: an allocation and a store. We know how to fast-path allocations and JIT them relatively compactly, however that wasn’t being done for box. So, step one was to decompose this higher-level op into the allocation and the write into the allocated object. The first step could then be optimized in the usual way allocations are.

In the unspecialized path, we first find out where to write the native value to, and then write it. However, when we’re producing a specialized version, we almost always know the type we’re boxing into. Therefore, the object offset to write to can be calculated once, and a very simple instruction to do a write at an offset into the object produced. This JITs extremely well.

There are a couple of other tricks. Binds into a P6opaque generally have to check that the object wasn’t mixed in to, however since we just allocated it then we know that can’t be the case and can skip that check. Also, a string is a garbage-collectable object, and when assigning one GC-able object into another one, we need to perform a write barrier for the sake of generational GC. However, since the object was just allocated, we know very well that it is in the nursery, and so the write barrier will never trigger. Thus, we can omit it.

Unboxing is easier to specialize: just calculate the offset, and emit a simpler instruction to load the value from there.

I’ve also started some early work (quite a long way from merge) on escape analysis, which will allow us to eliminate many box object allocations entirely. It’s a great deal easier to implement this if allocations, reads, and writes to an object have a uniform representation. By lowering box and unbox operations into these lower level operations, this eases the path to implementing escape analysis for them.

What about Int?

Some readers might have wondered why I talked about Num and Str as examples of boxed types, but not Int. It is one too – but there’s a twist. Actually, there’s two twists.

The first is that Int isn’t actually a wrapper around an int, but rather an arbitrary precision integer. When we first implemented Int, we had it always use a big integer library. Of course, this is slow, so later on we made it so any number fitting into a 32-bit range would be stored directly, and only allocate a big integer structure if it’s outside of this range.

Thus, boxing to a big integer means range-checking the value to box. If it fits into the 32-bit range, then we can write it directly, and set the flag indicating that it’s a small Int. Machine code to perform these steps is now spat out directly by the JIT, and we only fall back to a function call in the case where we need a big integer. Once again, the allocation itself is emitted in a more specialized way too, and the offset to write to is determined once at specialization time.

Unboxing is similar. Provided we’re specializing on the type of the object to unbox, then we calculate the offset at specialization time. Then, the JIT produces code to check if the small Int flag is set, and if so just reads and sign extends the value into a 64-bit register. Otherwise, it falls back to the function call to handle the real big integer case.

For boxing, however, there was a second twist: we have a boxed integer cache, so for small integers we don’t have to repeatedly allocate objects boxing them. So boxing an integer is actually:

  1. Check if it’s in the range of the box cache
  2. If so, return it from the cache
  3. Otherwise, do the normal boxing operation

When I first did these changes, I omitted the use of the box cache. It turns out, however, to have quite an impact in some programs: one benchmark I was looking at suffered quite a bit from the missing box cache, since it now had to do a lot more garbage collection runs.

So, I reinstated use of the cache, but this time with the JIT doing the range checks in the produced machine code and reading directly out of the cache in the case of a hit. Thus, in the cache hit case, we now don’t even make a single function call for the box operation.

Faster Int operations

One might wonder why we picked 32-bit integers as the limit for the small case of a big integer, and not 64-bit integers. There’s two reasons. The most immediate is that we can then use the 32 bits that would be the lower 32 of a 64-bit pointer to the big integer structure as our “this is a small integer” flag. This works reliably as pointers are always aligned to at least a 4-byte boundary, so a real pointer to a big integer structure would never have the lowest bits set. (And yes, on big-endian platforms, we swap the order of the flag and the value to account for that!)

The second reason is that there’s no portable way in C to detect if a calculation overflowed. However, out of the basic math operations, if we have two inputs that fit into a 32-bit integer, and we do them at 64-bit width, we know that the result can never overflow the 64-bit integer. Thus we can then range check the result and decide whether to store it back into the result object as 32-bit, or to store it as a big integer.

Since Int is immutable, all operations result in a newly allocated object. This allocation – you’ll spot a pattern by now – is open to being specialized. Once again, finding the boxed value to operate on can also be specialized, by calculating its offset into the input objects and result object. So far, so familiar.

However, there’s a further opportunity for improvement if we are JIT-compiling the operations to machine code: the CPU has flags for if the last operation overflowed, and we can get at them. Thus, for two small Int inputs, we can simply:

  1. Grab the values
  2. Do the calculation at 32-bit width
  3. Check the flags, and store it into the result object if no overflow
  4. If it overflowed, fall back to doing it wider and storing it as a real big integer

I’ve done this for addition, subtraction, and multiplication. Those looking for a MoarVM specializer/JIT task might like to consider doing it for some of the other operations. :-)

In summary

Boxing, unboxing, and math on Int all came with various indirections for the sake of generality (coping with mixins, subclassing, and things like IntStr). However, when we are producing type-specialized code, we can get rid of most of the indirections, resulting in being able to perform them faster. Further, when we JIT-compile the optimized result into machine code, we can take some further opportunities, reducing function calls into C code as well as taking advantage of access to the overflow flags.

Posted in Uncategorized | 6 Comments

Redesigning Rakudo’s Scalar

What’s the most common type your Perl 6 code uses? I’ll bet you that in most programs you write, it’ll be Scalar. That might come as a surprise, because you pretty much never write Scalar in your code. But in:

my $a = 41;
my $b = $a + 1;

Then both $a and $b point to Scalar containers. These in turn hold the Int objects. Contrast it with:

my $a := 42;
my $b := $a + 1;

Where there are no Scalar containers. Assignment in Perl 6 is an operation on a container. Exactly what it does depending on the type of the container. With an Array, for example, it iterates the data source being assigned, and stores each value into the target Array. Assignment is therefore a copying operation, unlike binding which is a referencing operation. Making assignment the shorter thing to type makes it more attractive, and having the more attractive thing decrease the risk of action at a distance is generally a good thing.

Having Scalar be first-class is used in a number of features:

  • Lazy vivification, so if %a{$x} { ... } will not initialize the hash slot in question, but %a{$x} = 42 will do so (this also works many levels deep)
  • The is rw trait on parameters being able to work together with late-bound dispatch
  • Making l-value routines possible, including every is rw accessor
  • List assignment
  • Using meta-ops on assignment, for example Z=

And probably some more that I forgot. It’s powerful. It’s also torture for those of us building Perl 6 implementations and trying to make them run fast. The frustration isn’t so much the immediate cost of the allocating all of those Scalar objects – that of course costs something, but modern GC algorithms can throw away short-lived objects pretty quickly – but also because of the difficulties it introduces for program analysis.

Despite all the nice SSA-based analysis we do, tracking the contents of Scalar containers is currently beyond that. Rather than any kind of reasoning to prove properties about what a Scalar holds, we instead handle it through statistics, guards, and deoptimization at the point that we fetch a value from a Scalar. This still lets us do quite a lot, but it’s certainly not ideal. Guards are cheap, but not free.

Looking ahead

Over the course of my current grant from The Perl Foundation, I’ve been working out a roadmap for doing better with optimization in the presence of Scalar containers. Their presence is one of the major differences between full Perl 6 and the restricted NQP (Not Quite Perl), and plays a notable part in the performance difference between the two.

I’ve taken the first big step towards improving this situation by significantly re-working the way Scalar containers are handled. I’ll talk about that in this post, but first I’d like to provide an idea of the overall direction.

In the early days of MoarVM, when we didn’t have specialization or compilation to machine code, it made sense to do various bits of special-casing of Scalar. As part of that, we wrote code handling common container operations in C. We’ve by now reached a point where the C code that used to be a nice win is preventing us from performing the analyses we need in order to do better optimizations. At the end of the day, a Scalar container is just a normal object with an attribute $!value that holds its value. Making all operations dealing with Scalar container really be nothing more than some attribute lookups and binds would allow us to solve the problem in terms of more general analyses, which stand to benefit many other cases where programs use short-lived objects.

The significant new piece of analysis we’ll want to do is escape analysis, which tells us which objects have a lifetime bounded to the current routine. We understand “current routine” to incorporate those that we have inlined.

If we know that an object’s usage lies entirely within the current routine, we can then perform an optimization known as scalar replacement, which funnily enough has nothing much to do with Scalar in the Perl 6 sense, even if it solves the problems we’re aiming to solve with Scalar! The idea is that we allocate a local variable inside of the current frame for each attribute of the object. This means that we can then analyze them like we analyze other local variables, subject them to SSA, and so forth. This for one gets rid of the allocation of the object, but also lets us replace attribute lookups and binds with a level of indirection less. It will also let us reason about the contents of the once-attributes, so that we can eliminate guards that we previously inserted because we only had statistics, not proofs.

So, that’s the direction of travel, but first, Scalar and various operations around it needed to change.

Data structure redesign

Prior to my recent work, a Scalar looked something like:

class Scalar {
    has $!value;        # The value in the Scalar
    has $!descriptor;   # rw-ness, type constraint, name
    has $!whence;       # Auto-vivification closure
}

The $!descriptor held the static information about the Scalar container, so we didn’t have to hold it in every Scalar (we usually have many instances of the same “variable” over a programs lifetime).

The $!whence was used when we wanted to do some kind of auto-vivification. The closure attached to it was invoked when the Scalar was assigned to, and then cleared afterwards. In an array, for example, the callback would bind the Scalar into the array storage, so that element – if assigned to – would start to exist in the array. There are various other forms of auto-vivification, but they all work in roughly the same way.

This works, but closures aren’t so easy for the optimizer to deal with (in short, a closure has to have an outer frame to point to, and so we can’t inline a frame that takes a closure). Probably some day we’ll find a clever solution to that, but since auto-vivification is an internal mechanism, we may as well make it one that we can see a path to making efficient in the near term future.

So, I set about considering alternatives. I realized that I wanted to replace the $!whence closure with some kind of object. Different types of object would do different kinds of vivification. This would work very well with the new spesh plugin mechanism, where we can build up a set of guards on objects. It also will work very well when we get escape analysis in place, since we can then potentially remove those guards after performing scalar replacement. Thus after inlining, we might be able to remove the “what kind of vivification does this assignment cause” checking too.

So this seemed workable, but then I also realized that it would be possible to make Scalar smaller by:

  • Placing the new auto-vivification objects in the $!descriptor slot instead
  • Having the vivification objects point to the original descriptor carrying the name, type, etc.
  • Upon first assignment, running the vivification logic and then replacing the Scalar‘s $!descriptor with the simple one carrying the name and value, thus achieving the run-once semantics

This not only makes Scalar smaller, but it means that we can use a single guard check to indicate the course of action we should take with the container: a normal assignment, or a vivification.

The net result: vivification closures go away giving more possibility to inline, assignment gets easier to specialize, and we get a memory saving on every Scalar container. Nice!

C you later

For this to be really worth it from an optimization perspective, I needed to eliminate various bits of C special-case code around Scalar and replace it with standard MoarVM ops. This implicated:

  • Assignment
  • Atomic compare and swap
  • Atomic store
  • Handling of return values, including decontainerization
  • Creation of new Scalar containers from a given descriptor

The first 3 became calls to code registered to perform the operations, using the 6model container API. The second two cases were handled by replacing the calls to C extops with desugars, which is a mechanism that takes something that is used as an nqp::op and rewrites it, as it is compiled, into a more interesting AST, which is then in turn compiled. Happily, this meant I could make all of the changes I needed to without having to go and do a refactor across the CORE.setting. That was nice.

So, now those operations were compiled into bytecode operations instead of ops that were really just calls to C code. Everything was far more explicit. Good! Alas, the downside is that the code we generate gets larger in size.

Optimization with spesh plugins

talked about specializer plugins in a recent post, where I used them to greatly speed up various forms of method dispatch. However, they are also applicable to optimizing operations on Scalar containers.

The change to decontainerizing return values was especially bad at making the code larger, since it had to do quite a few checks. However, with a spesh plugin, we could just emit a use of the plugin, followed by calling whatever the plugin produces.

Here’s a slightly simplified version of the the plugin I wrote, annotated with some comments about what it is doing. The key thing to remember about a spesh plugin is that it is not doing an operation, but rather it’s setting up a set of conditions under which a particular implementation of the operation applies, and then returning that implementation.

nqp::speshreg('perl6', 'decontrv', sub ($rv) {
    # Guard against the type being returned; if it's a Scalar then that
    # is what we guard against here (nqp::what would normally look at
    # the type inside such a container; nqp::what_nd does not do that).
    nqp::speshguardtype($rv, nqp::what_nd($rv));

    # Check if it's an instance of a container.
    if nqp::isconcrete_nd($rv) && nqp::iscont($rv) {
        # Guard that it's concrete, so this plugin result only applies
        # for container instances, not the Scalar type object.
        nqp::speshguardconcrete($rv);

        # If it's a Scalar container then we can optimize further.
        if nqp::eqaddr(nqp::what_nd($rv), Scalar) {
            # Grab the descriptor.
            my $desc := nqp::speshguardgetattr($rv, Scalar, '$!descriptor');
            if nqp::isconcrete($desc) {
                # Has a descriptor, so `rw`. Guard on type of value. If it's
                # Iterable, re-containerize. If not, just decont.
                nqp::speshguardconcrete($desc);
                my $value := nqp::speshguardgetattr($rv, Scalar, '$!value');
                nqp::speshguardtype($value, nqp::what_nd($value));
                return nqp::istype($value, $Iterable) ?? &recont !! &decont;
            }
            else {
                # No descriptor, so it's already readonly. Return as is.
                nqp::speshguardtypeobj($desc);
                return &identity;
            }
        }

        # Otherwise, full slow-path decont.
        return &decontrv;
    }
    else {
        # No decontainerization to do, so just produce identity.
        return &identity;
    }
});

Where &identity is the identity function, &decont removes the value from its container, &recont wraps the value in a new container (so an Iterable in a Scalar stays as a single item), and &decontrv is the slow-path for cases that we do not know how to optimize.

The same principle is also used for assignment, however there are more cases to analyze there. They include:

  • When the type constraint is Mu, and there is a normal (non-vivify) descriptor, then we do a specialization based on the value being the Nil object (in which case we produce the operation that set $!value back to the default value from the descriptor) or non-Nil (just assign a value, with no need to type check)
  • When the type constraint is something else, and there is a normal (non-vivify) descriptor, then we do a specialization based on the type of the descriptor being assigned. Since the optimizer will often know this already, then we can optimize out the type check
  • When it is an array auto-viv, we produce the exact sequence of binds needed to effect the operation, again taking into account a Mu type constraint and a type constraint that needs to be checked

Vivifying hash assignments are not yet optimized by the spesh plugin, but will be in the near future.

The code selected by the plugin is then executed to perform the operation. In most cases, there will only be a single specialization selected. In that case, the optimizer will inline that specialization result, meaning that the code after optimization is just doing the required set of steps needed to do the work.

Next steps

Most immediately, a change to such a foundational part of the the Rakudo Perl 6 implementation has had some fallout. I’m most of the way through dealing with the feedback from toaster (which runs all the ecosystem module tests), being left with a single issue directly related to this work to get to the bottom of. Beyond that, I need to spend some time re-tuning array and hash access to better work with these changes.

Then will come the step that this change was largely in aid of: implementing escape analysis and scalar replacement, which for much Perl 6 code will hopefully give a quite notable performance improvement.

This brings me to the end of my current 200 hours on my Perl 6 Performance and Reliability Grant. Soon I will submit a report to The Perl Foundation, along with an application to continue this work. So, all being well, there will be more to share soon. In the meantime, I’m off to enjoy a week’s much needed vacation.

Posted in Uncategorized | 1 Comment

Dynamic lookups and context introspection with inlining

Inlining is one of the most important optimizations that MoarVM performs. Inlining lets us replace a call to some BlockSub, or Method with the code that is inside of it. The most immediate benefit is to eliminate the overhead of calling, but that’s just the start. Inlined code has often already been specialized for a certain set of argument types. If we already have proven those argument types in the caller, then there’s no need to re-check them. Inlining can also expose pairs of operations that can cancel, such as box/unbox, and bring the point a control exception is thrown into the some body of code where it is caught, which may allow the exception throw to be rewritten to a far cheaper goto.

In a language like Perl 6, where every operator is a call to a multiple dispatch subroutine, inlining can be a significant win. In the best cases, inlining can lead to smaller code, because the thing that is inlined ends up being smaller than the bytecode for the call sequence. Of course, often it leads to bigger code, and so there’s limits to how much of it we really want to do. But still, we’ve been gradually pushing on with increasing the range of things that we’re able to inline.

The problem with inlining is that the very call boundaries it does away with may carry semantic significance for the program. In this post, I’ll talk about a couple of operations that became problematic as we ramped up our inlining capabilities, and discuss a new abstraction I recently added to MoarVM – the frame walker – which provides a common foundation for solving the problem.

A little inlining history

Inlining first showed up in MoarVM back in 2014, not too many months after the type-specializing optimizer was added. MoarVM has done speculative optimizations from the start, performing deoptimization (falling back to the interpreter) in the case that an unexpected situation shows up. But what if we had to deoptimize in code that had been inlined? Then we’d have to pretend we never did the inlines! Therefore, MoarVM can uninline too – that is, untangle the results of inlining and produce a call stack as if we’d been running the unoptimized code all along.

MoarVM has also from the start supported nested inlines – that is, inlining things that themselves contained inlines. However, the initial implementation of inlining was restricted in what it could handle. The first implementation could not inline anything with exception handlers, although that was supported within a couple of months. It also could not inline closures. Only subs in the outermost scope or from the CORE.setting, along with simple method calls, were possible to inline, because those were the only cases where we had enough information about what was being called, which is a decided prerequisite for inlining it.

Aside from bug fixes, things stayed the same until 2017. The focus in that time largely switched away from performance and towards the Perl 6.c release. Summer of 2017 brought some very large changes to how dynamic optimization worked in MoarVM, moving optimization to a background thread, along with changing and extending the statistics that were collected. A new kind of call optimization became possible, whereby if we could not prove what we were going to call, but the statistics showed a pattern, then we could insert a guard and speculatively optimize the call. Speculative inlining fell neatly out of that. Suddenly, a bunch more things could be considered for inlining.

Further work lifted some of the inlining restrictions. Deoptimization learned how to cope if we deoptimized in the middle of processing named arguments, so we could optimize code where that situation occurred. It became possible to inline many closures, by rewriting the lexical lookup operations into an indirection through the code object of the code that we had inlined. It also became possible to inline code involving lexical throws of exceptions and their handlers. Since that is how return works in Perl 6, that again made quite a few more things possible to inline. A more fine-grained analysis allowed us to do some amount of cross-language inlining, meaning bits of the Rakudo internals written in NQP could be inlined into the Perl 6 code calling them, including closure cloning. I’ll add at this point that while it’s easy to write a list of these improvements now, realizing various of them was quite challenging.

Now it’s summer 2018, and my work has delivered some more advances. Previously, we would only do an inlining if we already had produced a specialized version of the callee. This usually worked out, and we sorted by maximum call stack depth and specialized deepest first to help with that. However, sometimes that was not enough, and we missed inlining opportunities. So, during the last month, I added support for producing code to inline on-demand. I also observed that we were only properly doing speculative (that is, based on statistics) inlines of calls made that were expected to return an object, but not those in void context. (If that sounds like an odd oversight, it’s because void calls were previously rare. It was only during the last month, when I improved code-gen to spot a lot more opportunities to emit void context calls, that we got a lot more of them and I spotted the problem.)

More is better, no?

Being able to inline a wider range of calls is a good thing. However, it also made it far more likely that we would run into constructs that don’t cope well with inlining. We’ve got a long way by marking ops that we know won’t cope well with it as :noinline (and then gradually liberalizing that over time where it was beneficial). The improvements over the previous month created a more difficult problem, however. We have a number of ops that allow for introspection and walking of the call stack. These are used to implement Perl 6 features such as the CALLER:: pseudo-package. However, they are also the way that $/ can be set by things like match.

Marking the ctx op as :noinline got us a long way. However, we ran into trouble because once a context handle has been obtained, one could then start traversing from it to callers or outers, and then starting some kind of lookup lookup relative to that point. But what if the caller was an inline? Then we don’t have a callframe to reference in the context object that we return.

A further problem was that a non-introspection form of dynamic lookup, which traverses the lexical chain hanging off each step of the dynamic chain, also was not aware of inlines. In theory, this would have become a problem when we started doing inlining of closures last year. However, since it is used in a tiny number of places, and those places didn’t permit inlining, we didn’t notice until this month, when inlining started to cover more cases.

Normal dynamic lookup, used for $*foo style variables, has been inline-aware for about as long as we’ve had inlining. However, this greatly complicated the lookup code. Replicating such inline compensation code in a bunch of places was clearly a bad idea. It’s a tricky problem, since we’re effectively trying to model a callstack that doesn’t really exist by using information telling us what it would look like if it did. It’s the same problem that deoptimization has to solve, except this time we’re just imagining what the call stack would look like unoptimized, not actually trying to recreate it. It’s certainly not a problem we want solved repeatedly around MoarVM’s implementation.

A new frame walker

To help tackle all of these problems, I introduced a new abstraction: the specialization-aware frame walker. It provides an iterator over the call stack as if no inlining had taken place, figuring out as much as it needs to in order to recreate the information that a particular operation wants.

First, I used it to make caller-dynamic lookup inline-aware. That went pretty well, and immediately fixed one of the module regressions that was “caused” by the recent inlining improvements.

Next, I used it to refactor the normal dynamic lookup. That needed careful work teasing out the details of dynamic lookup and caching of dynamic variable lookups from the inlining traversal. However, the end result was far simpler code, with much less duplication, since the JITted inline, interpreted inline, and non-inline paths largely collapsed and were handled by the frame walker.

Next up, contexts. Alas, this would be trickier.

Embracing laziness

Previously, context traversal had worked eagerly. When we asked for a context object representing the caller or outer of a particular context we already had, we immediately walked one frame in the appropriate direction and produced a result. Of course, this did not go well if there were inlines, since it always walked one real frame, but that may have multiple inlined frames within it.

One possibility was to make the ctx op immediately deoptimize the whole call stack, so that we then had a chain of real call frames to traverse. However, when I looked at some of the places using ctx, it became clear this would have some very negative performance consequences: every regex match causing a global deopt was not going to work!

Another option, that would largely preserve the existing design, was to store extra information about the inline we were inside of at the time we walked to a caller. This, however, had the weakness that we might do a deoptimization between the two points, thus invalidating the information. That was probably also possible to fix up, but the complexity of doing so put me off that approach.

Instead, I switched to a model where “move to caller” and “move to outer” would be stored as displacements to apply when the context object was used in order to obtain information. The frame walker could make these movements, before doing whatever lookup was required. Thus, even if a deoptimization were to take place between obtaining the context and using it, we could still do a correct traversal.

Too much laziness

This helped, but wasn’t quite enough either. If a context handle was taken and used immediately, things worked out fine. However, if the situation was like this:

+-------------------------+
| Frame we used ctx op on | (Frame 1)
+-------------------------+
             |
           Caller
             |
             v
+-------------------------+
|    Frame with inlines   | (Frame 2)
+-------------------------+

And then we used the handle some time later, things worked out less well. The problem was that I used the current return address of Frame 2 in order to understand which inline(s) we were inside of. However, if we’d executed more code in Frame 2, then it could have made another call. The current return address could thus point to the wrong inline. Oops.

However, since the ctx operation at the start of the lookup is never inlined, and a given call frame can only ever be called from one location in the caller, there’s a solution. If the ctx op is used to get a first-class reference to a frame on the call stack, we walk down the call stack and make sure that each frame called from inlined code preserves enough location information that we can later reconstruct what we need. It only needs to walk down the call stack until it sees a point where another ctx operation already preserved that information, so in programs with lots of use of the ctx op, we can avoid doing full stack walks each time, and just walk over recently created frames.

In closing

With those changes, the various Perl 6 modules exhibiting lookup problems since this month’s introduction of more aggressive inlining were fixed. Along the way, a common mechanism was introduced allowing us to walk a call stack as if no inlines had taken place. There’s at least one more place that we can use this: in order to make stack trace output not be sensitive to inlining. I’ll get to that in the coming weeks, or it might make a nice task for somebody looking to get themselves (more) involved with MoarVM development. Last but not least, I’d like to once again thank The Perl Foundation for organizing the funding that made this work possible.

Posted in Uncategorized | 1 Comment

More precise deoptimization usage tracking

In my previous post here, I talked about deoptimization and its implications for usage information. If you didn’t read that post, I suggest reading it before continuing, since the work described in this post builds upon it. Further background on deoptimization and its use in MoarVM may be found in my talk slides from last year’s Swiss Perl Workshop.

An example to consider

To keep things a bit simpler, we’ll look at an NQP program. NQP is a simplified subset of Perl 6, and so its naive compilation – before the optimizer gets at it – is much simpler than that produced by Rakudo. Here’s a small program to consider.

class Wrapper {
    has $!x;
    method x() { $!x }
}
class C { }

sub test($w) {
    my $var := "Used later";
    if nqp::istype($w.x, C) {
        "C"
    }
    else {
        $var
    }
}

my int $i := 0;
my $wrapper := Wrapper.new(x => C.new);
while $i < 1_000_000 {
    say(test($wrapper));
    $i++;
}
say(test(Wrapper.new(x => NQPMu)));

We’ll consider the test subroutine’s optimization. First, let’s walk through the bytecode before optimization. We always have a dummy empty basic block at the start of the graph, used for internal purposes, so we can disregard that.

  BB 0 (0x7f1be817b070):
    line: 7 (pc 0)
    Instructions:
      no_op           
    Successors: 1
    Predecessors: 
    Dominance children: 1

The next basic block starts with a bunch of null instructions, which will be mostly deleted. In the slow path interpreter, we null out registers in case code tries to read them as part of callframe setup. However, we get rid of that in the optimized code by letting the optimizer prove that such work is mostly not needed. Since we didn’t do any optimization yet, here’s all of those instructions.

  BB 1 (0x7f1be817b0f8):
    line: 7 (pc 0)
    Instructions:
      null              r5(1)
      null              r4(1)
      null              r3(1)
      null              r1(1)
      null              r0(1)

Next we receive the parameter.

      checkarity      liti16(1), liti16(1)
      param_rp_o        r0(2), liti16(0)
      paramnamesused  

Then we have the line my $var := "used later";. Rakudo is smarter than NQP in compiling such a thing: it would just emit a reference to a single constant string rather than boxing it each time like NQP’s simpler code-gen does.

      [Annotation: Line Number: x.nqp:7]
      const_s           r2(1), lits(Used later)
      hllboxtype_s      r3(2)
      box_s             r3(3),   r2(1),   r3(2)
      set               r1(2),   r3(3)

Now we have the code for $x.w. It starts out with a decont, since we may have been passed something in a Scalar container (note that NQP code may be called from full Perl 6 code, so this is possible). We then look up the method and call it.

      [Annotation: INS Deopt One (idx 0 -> pc 46; line 9)]
      [Annotation: Logged (bytecode offset 40)]
      decont            r4(2),   r0(2)
    Successors: 2
    Predecessors: 0
    Dominance children: 2

  BB 2 (0x7f1be817b158):
    line: 9 (pc 46)
    Instructions:
      findmeth          r3(4),   r4(2), lits(x)
    Successors: 3
    Predecessors: 1
    Dominance children: 3

  BB 3 (0x7f1be817b1b8):
    line: 9 (pc 56)
    Instructions:
      [Annotation: INS Deopt One (idx 1 -> pc 56; line 9)]
      prepargs        callsite(0x7f1bf1f0b340, 1 arg, 1 pos, nonflattening, interned)
      arg_o           liti16(0),   r0(2)
      [Annotation: INS Deopt All (idx 3 -> pc 72; line 9)]
      [Annotation: INS Deopt One (idx 2 -> pc 72; line 9)]
      [Annotation: Logged (bytecode offset 66)]
      invoke_o          r3(5),   r3(4)
    Successors: 4
    Predecessors: 2
    Dominance children: 4

Notice how two various instructions here are annotated with Deopt points. These are places that we might, after optimization has taken place, insert an instruction that could cause us to deoptimize. The pc 72 refers to the offset into the unoptimized bytecode that we should continue execution back in the interpreter.

There’s also various Logged annotations, which indicate instructions that may log some statistics – for example, about what code is invoked, and what type of value it returns (for invoke_o) or what kind of type we get out of a decont operation that actually had to read from a container.

Next up is the type check. Again, there’s a decont instruction, just in case the call to $x.w returned something in a container. We then have the istype instruction.

  BB 4 (0x7f1be817b218):
    line: 9 (pc 72)
    Instructions:
      [Annotation: INS Deopt One (idx 4 -> pc 78; line 9)]
      [Annotation: Logged (bytecode offset 72)]
      decont            r4(3),   r3(5)
    Successors: 5
    Predecessors: 3
    Dominance children: 5

  BB 5 (0x7f1be817b278):
    line: 9 (pc 78)
    Instructions:
      wval              r5(2), liti16(0), liti16(5) (P6opaque: C)
      istype            r6(1),   r4(3),   r5(2)
    Successors: 6
    Predecessors: 4
    Dominance children: 6

Next comes the if part of the branch:

  BB 6 (0x7f1be817b2d8):
    line: 9 (pc 94)
    Instructions:
      unless_i          r6(1),   BB(8)
    Successors: 8, 7
    Predecessors: 5
    Dominance children: 7, 8, 9

  BB 7 (0x7f1be817b338):
    line: 9 (pc 102)
    Instructions:
      const_s           r2(2), lits(C)
      hllboxtype_s      r4(4)
      box_s             r4(5),   r2(2),   r4(4)
      set               r5(3),   r4(5)
      goto              BB(9)
    Successors: 9
    Predecessors: 6
    Dominance children: 

Followed by the else part. And what is r1(2)? It’s the “Used later” string from earlier.

  BB 8 (0x7f1be817b398):
    line: 12 (pc 134)
    Instructions:
      set               r5(4),   r1(2)
    Successors: 9
    Predecessors: 6
    Dominance children: 

Finally, we’re done, and return the result of the branch of the if statement that was executed.

  BB 9 (0x7f1be817b3f8):
    line: 12 (pc 140)
    Instructions:
      PHI               r5(5),   r5(3),   r5(4)
      PHI               r4(6),   r4(5),   r4(3)
      PHI               r2(3),   r2(2),   r2(1)
      return_o          r5(5)
    Successors: 
    Predecessors: 7, 8
    Dominance children: 

How we optimize it

Let’s now walk through the optimized output. The argument handling has been reduced to a single instruction that does an unchecked read of the incoming argument. This is because we’re producing a specialization for a particular input callsite shape and set of input arguments. In this case, it will be a single argument of type Wrapper.

  BB 0 (0x7f1be817b070):
    line: 7 (pc 0)
    Instructions:
      no_op           
    Successors: 1
    Predecessors: 
    Dominance children: 1

  BB 1 (0x7f1be817b0f8):
    line: 7 (pc 0)
    Instructions:
      sp_getarg_o       r0(2), liti16(0)

What comes next is the code to store that "Used later" string. The ops look fine, but do you notice something odd?

      const_s           r2(1), lits(Used later)
      hllboxtype_s      r3(2)
      [Annotation: INS Deopt One (idx 0 -> pc 46; line 9)]
      box_s             r1(2),   r2(1),   r3(2)

Yup, there’s a deopt annotation moved on to that box_s. Huh? Well, let’s look at what comes next.

      [Annotation: INS Deopt One (idx 1 -> pc 56; line 9)]
      sp_getspeshslot   r7(0), sslot(2)
    Successors: 2
    Predecessors: 0
    Dominance children: 2

  BB 2 (0x7f1be8356d38):
    Inlined
    line: 7 (pc 0)
    Instructions:
      [Annotation: FH Start (0)]
      [Annotation: Inline Start (0)]
      [Annotation: INS Deopt Inline (idx 5 -> pc 20; line 8)]
      set               r9(1),   r0(2)
      [Annotation: INS Deopt Inline (idx 6 -> pc 42; line 9)]
      sp_p6ogetvt_o    r11(1),   r9(1), liti16(8), sslot(4)
      [Annotation: FH End (0)]
      set               r3(5),  r11(1)
    Successors: 3
    Predecessors: 3
    Dominance children: 3

Recall that in the unoptimized code we next did $w.x by a findmeth instruction, which came after a decont of $w, and the we did an invocation of that method. What’s happened to all of that lot?

First, since $w is the argument we are producing a specialization for, we thus know it’s Wrapper, and we know that’s not a container type, so the decont can go. Since we also know its type and we know the method name, we can just resolve that method once. The resolution of it is then stored in a “spesh slot”, which you can think of as a constants table for this particular specialization. What follows is, instead of the invocation, the code for the method x() { $!x }, which has been inlined. (The sp_p6ogetvt_o instruction is what attribute lookup has been optimized into.)

Oh, and about that Deopt annotation on the box_s? That’s just because code got deleted and it got shifted. We’ll look at the consequences of that later.

Here is the rest of the code:

  BB 3 (0x7f1be817b218):
    line: 9 (pc 72)
    Instructions:
      [Annotation: Inline End (0)]
      [Annotation: FH Goto (0)]
      [Annotation: INS Deopt One (idx 2 -> pc 72; line 9)]
      [Annotation: INS Deopt One (idx 4 -> pc 78; line 9)]
      sp_guardconc      r3(5), sslot(0), litui32(72)
      const_s           r2(2), lits(C)
      hllboxtype_s      r4(4)
      box_s             r5(3),   r2(2),   r4(4)
      PHI               r5(5),   r5(3)
      return_o          r5(5)
    Successors: 
    Predecessors: 2
    Dominance children: 6

Well, that’s pretty different from what we started out with too. What on earth has happened? Where did our if statement go?!

The sp_guardconc instruction is a guard. It checks, in this case, that we have a concrete instance of C in register r3(5). It was inserted because the gathered statistics said that, so far, it had been such 100% of the time. The guard will deoptimize – that is, fall back to the interpreter – if it fails, but otherwise proceed. Since we have guarded that, then the istype will become a constant. That means we know which way the branch would go, and can delete the other part of the branch. A type check, a conditional branch, and a branch all go away, to be replaced by a single cheap guard.

But what about that “Used later” string?

Notice how we executed:

      box_s             r1(2),   r2(1),   r3(2)

But its result value, r1(2), is completely unused anywhere in the code that we have left after optimization. The instruction was, however, retained, for the sake of deoptimization. In the original code, the value was written prior to a guard that might deoptimize. Were we to throw it away, then after we deoptimized the interpreter would try to read a value that wasn’t written, and crash in some interesting way.

The original approach

The original approach taken to this problem was to:

  1. Whenever we see a Deopt annotation, take its index as our current deopt point
  2. Whenever we see a write, label it with the current deopt point
  3. Whenever we see a read, check it the deopt point of the write is not equal to the deopt point of the read. If that is the case, mark the write as needing to be retained for deopt purposes.

Effectively, if a value written before a deopt point might be read after a deopt point, then we retain it. That was originally done by bumping its usage count. In my last post here, I described how we switched to setting a “needed for deopt” flag instead. But in the grand scheme of things, that changed nothing much about the algorithm described above; only step 3 was changed.

Note that this algorithm works in the case of loops – where we might encounter a value being read in a PHI node prior to seeing it being written – because the lack of a deopt point recorded on the writer will make it unequal to the current deopt point.

Correct, but imprecise

The problem with this approach isn’t with correctness, but rather with precision. A deopt retention algorithm is correct if it doesn’t throw away anything that is needed after a deoptimization. Of course, the simplest possible algorithm would be to mark everything as required, allowing no instruction deletions! The method described above is also correct, and marks fewer things. And for a while, it was enough. However, it came to be a blocker for various other optimizations we wish to do.

There are two particular problems that motivated looking for a more precise way to handle deopt usage. First of all, many instructions that may be replaced with a guard and deoptimize are actually replaced with something else or even deleted. For example, decont will often be replaced by a set because we know that it’s not a container type. A set can never trigger deoptimization. However, we had no way to update our deopt usage information based on this change. Therefore, something written before the set that used be a decont and read after it, but otherwise not needed, would be kept alive because the decont could have had a guard inserted, even though we know it did not.

A larger problem is that even when we might insert a guard, we might later be able to prove it is not needed. Consider:

my int $i = $str.chars;

The chars method will be tiny, so we can inline it. Here’s the code that we currently produce; I’ve shown the end of the inlining of the chars method together with the assignment into $i.

      chars            r15(1),  r14(1)
      hllboxtype_i     r13(1)
      [Annotation: INS Deopt Inline (idx 7 -> pc 134; line -1)]
      box_i            r13(2),  r15(1),  r13(1)
      [Annotation: FH End (0)]
      set               r2(5),  r13(2)
    Successors: 3
    Predecessors: 4
    Dominance children: 3

  BB 3 (0x7efe0479b5f0):
    line: 1 (pc 100)
    Instructions:
      [Annotation: Inline End (0)]
      [Annotation: FH Goto (0)]
      [Annotation: INS Deopt One (idx 3 -> pc 100; line 1)]
      sp_guardconc      r2(5), sslot(2), litui32(100)
      set               r2(6),   r2(5)
      set               r5(1),  r15(1)
      bindlex         lex(idx=0,outers=0,$i),   r5(1)

Since $i is a native integer lexical, we don’t need to box the native integer result of the chars op at all here. And you can see that we have done a rewrite such that r15(1) is used to assign to $inot the boxed result. However, the box_i instruction is retained. Why?

The immediate reason is that it’s used by the guard instruction. And indeed, I will do some work in the future to eliminate that. It’s not a hugely difficult problem. But doing that still wouldn’t have been enough. Why? Because there is a deopt point on the guard, and the boxed value is written before it and used after it. This example convinced me it was time to improve our deopt handling: it was directly in the way of optimizations that could provide a significant benefit.

A more precise algorithm

It took me three attempts to reach a solution to this. The first simple thing that I tried follows from the observation that everything written after the last deopt instruction in the optimized code can never possibly be used for deoptimization purposes. This was far from a general solution, but it did help a bit with very small functions that are free of control flow and have no or perhaps just some very early guards. This was safe, easy to reason about, easy to implement – but ultimately not powerful enough. However, it was helpful in letting me frame the problem and start to grapple with it, plus it gave me a set of cases that a more powerful solution should be able to take care of.

Attempt number two was to do a repeat of the initial deopt analysis process, but after the optimizations had taken place. Thus, cases where a Deopt annotation was on an instruction that was turned into something that could never deoptimize would not be counted. This quickly fell apart, however, since in the case where entire branches of a conditional were deleted then reads could disappear entirely. They simply weren’t there to analyze any more. So, this was an utter failure, but it did drive home that any analysis that was going to work had to build up a model of deoptimization usages before we performed any significant optimizations, and then manipulate that model safely even in the light of a mutated program graph.

Attempt three took much longer to come up with and implement, though thankfully was rather more successful. The new algorithm proceeds as follows.

  1. When we see a write instruction – and provided it has at least one reader – place it into a set OW of writes with still outstanding reads.
  2. When we see a deopt point, take the set OW and record the index of this deopt point on each of those writes. This means that we are now associating the writes with which deopt points are keeping them alive.
  3. Whenever we see a read, mark it as processed. Check if the writer has now had all of its reads processed. If so, remove it from the set OW.

This algorithm works in a single pass through the program graph. However, what about loops? In a loop, no matter what order we traverse the graph in, we will always see some reads that happen before writes, and that breaks the algorithm that I described above.

After some amount of scribbling graphs and staring at them, I hit upon a way to solve it that left me with a single pass through the graph, rather than having to iterate to a fixed point. When we see a read whose writer was not yet processed, we put it into a set of reads that are to be processed later. (As an aside, we know thanks to the SSA form that all such instructions are PHI (merge) instructions, and that these are always placed at the start of the graph.) We will then process a pending read when we have processed all of the basic blocks that are its predecessors – which means that by then all possible writes will have been processed.

The result is that we now have a list of all of the deopt points that make use of a particular write. Then, after optimization, we can go through the graph again and see which deopt points actually had guards or other potentially deoptimizing instructions placed at them. For all the cases where we have no such instruction under that Deopt annotation, we can delete the deopt usage. That way, if all normal usages and deopt usages of a value are gone, and the writing instruction is pure, we can delete that instruction.

This further means that once we gain the ability to delete guards that we can prove are not required any longer – perhaps because of new information we have after inlining – we will also be able to delete the deopt usages associated with them.

Last but not least, the specializer’s log output also includes which deopt points are keeping a value alive, so we will be able to inspect the graph in cases where we aren’t entirely sure and understand what’s happening.

Future work

With this done, it will make sense to work on guard elimination, so that is fairly high on my list of upcoming tasks. Another challenge is that while the new deopt algorithm is far more precise, it’s also far more costly. Its use of the DU chains means we have to run it as a second pass after the initial facts and usage pass. Further, the algorithm to eliminate unrequired deopt usages is a two pass algorithm; with some engineering we can likely find a way to avoid having to make the first of those. The various sets are represented as linked lists too, and we can probably do better than that.

One other interesting deoptimization improvement to explore in the future is to observe that any pure instructions leading up to a deopt point can be replayed by the interpreter. Therefore, we can deopt not to the instruction mapping to the place where we put the deopt point, but to the start of a run of pure instructions before that. That would in some cases allow us to delete more code in the optimized version.

Next time, I’ll be looking at how increasingly aggressive inlining caused chaos with our context introspection, and how I made things better. Thanks go to The Perl Foundation for making this work possible.

Posted in Uncategorized | 2 Comments