Speeding up object creation

Recently, a Perl 6 object creation benchmark result did the rounds on social media. This Perl 6 code:

class Point {
    has $.x;
    has $.y;
}
my $total = 0;
for ^1_000_000 {
    my $p = Point.new(x => 2, y => 3);
    $total = $total + $p.x + $p.y;
}
say $total;

Now (on HEAD builds of Rakudo and MoarVM) runs faster than this roughly equivalent Perl 5 code:

use v5.10;

package Point;

sub new {
    my ($class, %args) = @_;
    bless \%args, $class;
}

sub x {
    my $self = shift;
    $self->{x}
}

sub y {
    my $self = shift;
    $self->{y}
}

package main;

my $total = 0;
for (1..1_000_000) {
    my $p = Point->new(x => 2, y => 3);
    $total = $total + $p->x + $p->y;
}
say $total;

(Aside: yes, I know there’s no shortage of libraries in Perl 5 that make OO nicer than this, though those I tried also made it slower.)

This is a fairly encouraging result: object creation, method calls, and attribute access are the operational bread and butter of OO programming, so it’s a pleasing sign that Perl 6 on Rakudo/MoarVM is getting increasingly speedy at them. In fact, it’s probably a bit better at them than this benchmark’s raw numbers show, since:

The measurements include startup time. Probably 10%-15% of the time Rakudo spends executing the program is startup, parsing, compilation, etc. (noting the compiler is itself written in Perl 6). By contrast, Perl 5 has really fast startup and a parser written in C.
The math in Perl 6 is arbitrary precision (that is, will upgrade to a big integer if needed); it costs something to handle that, even if it doesn’t happen
Every math operation allocates a new Int object; there’s two uses of + here, so that means two new Int allocations per iteration of the loop, which then have to be garbage collected

While dealing with Int values got faster recently, it’s still really making two Int objects every time around that loop and having to GC them later. An upcoming new set of analyses and optimizations will let us get rid of that cost too. And yes, startup will get faster with time also. In summary, while Rakudo/MoarVM is now winning that benchmark against Perl 5, there’s still lots more to be had. Which is a good job, since the equivalent Python and Ruby versions of that benchmark still run faster than the Perl 6 one.

In this post, I’ll look at the changes that allowed this benchmark to end up faster. None of the new work was particularly ground-breaking; in fact, it was mostly a case of doing small things to make better use of optimizations we already had.

What happens during construction?

Theoretically, the default new method in turn calls bless, passing the named arguments along. The bless method then creates an object instance, followed by calling BUILDALL. The BUILDALL method goes through the set of steps needed for constructing the object. In the case of a simple object like ours, that involves checking if an x and y named argument were passed, and if so assigning those values into the Scalar containers of the object attributes.

For those keeping count, that’s at least 3 method calls (new, bless, and BUILDALL).

However, there’s a cheat. If bless wasn’t overridden (which would be an odd thing to do anyway), then the default new could just call BUILDALL directly anyway. Therefore, new looks like this:

multi method new(*%attrinit) {
    nqp::if(
      nqp::eqaddr(
        (my $bless := nqp::findmethod(self,'bless')),
        nqp::findmethod(Mu,'bless')
      ),
      nqp::create(self).BUILDALL(Empty, %attrinit),
      $bless(|%attrinit)
    )
}

The BUILDALL method was originally a little “interpreter” that went through a per-object build plan stating what needs to be done. However, for quite some time now we’ve instead compiled a per-class BUILDPLAN method.

Slurpy sadness

To figure out how to speed this up, I took a look at how the specializer was handling the code. The answer: not so well. There were certainly some positive moments in there. Of note:

The x and y accessor methods were being inlined
The + operators were being inlined
Everything was JIT-compiled into machine code

However, the new method was getting only a “certain” specialization, which is usually only done for very heavily polymorphic code. That wasn’t the case here; this program clearly constructs overwhelmingly one type of object. So what was going on?

In order to produce an “observed type” specialization – the more powerful kind – it needs to have data on the types of all of the passed arguments. And for the named arguments, that was missing. But why?

Logging of passed argument types is done on the callee side, since:

If the callee is already specialized, then we don’t want to waste time in the caller doing throwaway logging
If the callee is not specialized, but the caller is, then we don’t want to miss out on logging types (and we only log when running unspecialized code)

The argument logging was done as the interpreter executed each parameter processing instruction. However, when there’s a slurpy, then it would just swallow up all the remaining arguments without logging type information. Thus the information about the argument types was missing and we ended up with a less powerful form of specialization.

Teaching the slurpy handling code about argument logging felt a bit involved, but then I realized there was a far simpler solution: log all of the things in the argument buffer at the point an unspecialized frame is being invoked. We’re already logging the entry to the call frame at that point anyway. This meant that all of the parameter handling instructions got a bit simpler too, since they no longer had logging to do.

Conditional elimination

Having new being specialized in a more significant way was an immediate win. Of note, this part:

      nqp::eqaddr(
        (my $bless := nqp::findmethod(self,'bless')),
        nqp::findmethod(Mu,'bless')
      ),

Was quite interesting. Since we were now specializing on the type of self, then the findmethod could be resolved into a constant. The resolution of a method on the constant Mu was also a constant. Therefore, the result of the eqaddr (same memory address) comparison of two constants should also have been turned into a constant…except that wasn’t happening! This time, it was simple: we just hadn’t implemented folding of that yet. So, that was an easy win, and once done meant the optimizer could see that the if would always go a certain way and thus optimize the whole thing into the chosen branch. Thus the new method was specialized into something like:

multi method new(*%attrinit) {
    nqp::create(self).BUILDALL(Empty, %attrinit),
}

Further, the create could be optimized into a “fast create” op, and the relatively small BUILDALL method could be inlined into new. Not bad.

Generating simpler code

At this point, things were much better, but still not quite there. I took a look at the BUILDALL method compilation, and realized that it could emit faster code.

The %attrinit is a Perl 6 Hash object, which is for the most part a wrapper around the lower-level VM hash object, which is the actual hash storage. We were, curiously, already pulling this lower-level hash out of the Hash object and using the nqp::existskey primitive to check if there was a value, but were not doing that for actually accessing the value. Instead, an .AT-KEY('x') method call was being done. While that was being handled fairly well – inlined and so forth – it also does its own bit of existence checking. I realized we could just use the nqp::atkey primitive here instead.

Later on, I also realized that we could do away with nqp::existskey and just use nqp::atkey. Since a VM-level null is something that never exists in Perl 6, we can safely use it as a sentinel to mean that there is no value. That got us down to a single hash lookup.

By this point, we were just about winning the benchmark, but only by a few percent. Were there any more easy pickings?

An off-by-one

My next surprise was that the new method didn’t get inlined. It was within the size limit. There was nothing preventing it. What was going on?

Looking closer, it was even worse than that. Normally, when something is too big to be inlined, but we can work out what specialization will be called on the target, we do “specialization linking”, indicating which specialization of the code to call. That wasn’t happening. But why?

Some debugging later, I sheepishly fixed an off-by-one in the code that looks through a multi-dispatch cache to see if a particular candidate matches the set of argument types we have during optimization of a call instruction. This probably increased the performance of quite a few calls involving passing named arguments to multi-methods – and meant new was also inlined, putting us a good bit ahead on this benchmark.

What next?

The next round of performance improvements – both to this code and plenty more besides – will come from escape analysis, scalar replacement, and related optimizations. I’ve already started on that work, though expect it will keep me busy for quite a while. I will, however, be able to deliver it in stages, and am optimistic I’ll have the first stage of it ready to talk about – and maybe even merge – in a week or so.

5 Responses to Speeding up object creation

Pingback: 2018.41 Merged the JS! | Weekly changes in and around Perl 6
Alex says:

October 14, 2018 at 10:23 AM

Silly question about the Perl 5 vs Perl 6 comparison code:

In the Perl 6 code is ^1_000_000 expanded into a large array, or is it some kind of enumerator or range object?

In the Perl 5 code 1..1_000_000 does immediately expand into a large list, right? Could that be part of the slowness of the P5 solution?

Maybe change the Perl 5 for loop into a C-style loop? If that’s closer to how the Perl 6 one behaves.

Apologies if I’m off base on any of my assumptions there. Great work on Rakudo and thanks so much for writing all this up into blog posts too, I love reading it :)

- jnthnwrthngtn says:
  
  October 15, 2018 at 5:25 PM
  
  I believe that in the case of a `for (range) { … }` loop, Perl 5 is smart enough to optimize it into something more like a C-style for loop, just as Rakudo can. To check, I re-worked the Perl 5 version to do this:
  
  for (my $i = 0; $i < 1_000_000; $i++) {
  my $p = Point->new(x => 2, y => 3);
  $total = $total + $p->x + $p->y;
  }
  
  And it had no significant impact on the execution time.
  
  - Alex says:
    
    October 16, 2018 at 10:39 PM
    
    Ah awesome! Thanks for letting me know :)
- hobbified (@hobbified) says:
  
  December 5, 2018 at 6:06 PM
  
  Actually perl5 has an optimization (since 5.6 or so) that makes `for (1..1_000_000)` *faster* than the C-style loop. The range loop compiles down to basically a single “counting loop” opcode with fastpath code for incrementing things, while the C-style loop has to compile `$i = 1`; `$i <= 1_000_000`; `$i ++` independently (and with no specialization, so to speak) and slot them into the generic loop opcode, which ends up being measurably slower.

	wendyga on Recollections from the Raku Co…
	Релиз компилятора Ra… on The new MoarVM dispatch mechan…
	Peter Lund on The new MoarVM dispatch mechan…
	Peter Lund on The new MoarVM dispatch mechan…
	2021.40 It’s h… on The new MoarVM dispatch mechan…