Concurrency bug squishing: part 1

Most of my recent Perl 6 development time has been spent hunting down and fixing various concurrency bugs. We’ve got some nice language features in this area, and they’ve been pretty well received so far. However, compared with many areas of Perl 6, they have been implemented relatively recently. Therefore, they have had less time to mature – which, of course, means more bugs and other rough edges.

Concurrency bugs tend to be rather tedious to hunt down. This is in no small part because reproducing them reliably can be challenging. Even once a fairly reliable reproduction is available, working out what’s going on can be tricky. Like all debugging, being methodical and patient is the key. I tend to keep notes of things I’ve tried and observed, and output produced by instrumenting programs with logging (fancy words for “adding prints and exception throws when sanity checks fail”). I will use interactive debuggers when needed, but even then the data from them tends to end up in my editor on any extended bug hunt. Debugging is in some ways the experimental science of programming. Even with a good approach, being sufficiently patient – or perhaps just downright stubborn – matters plenty too.

In the next 2-3 posts here, I’ll discuss a few of the bugs I recently hunted down. It will not be anything close to an exhaustive list of them, which would surely be as boring for you to read as it would be for me to write. This is just the “greatest hits”, if you like.

A tale of silly suspicions and dodgy data

This hunt started out looking in to various hangs that had been reported. Deadlocks are a broad category of bug (there’s another completely unrelated one I’ll cover in this little series, even). However, a number of ones that had shown up looked very similar. All of them showed a number of threads trying to enter GC, stuck in the consensus loop (which, yes, really is just a loop that we go around asking, “did every thread agree we’re going to do GC yet?”)

I’ll admit I went into this one with a bit of a bias. The GC sync-up code in question is a tad clever-looking. That doesn’t mean it’s wrong, but it makes it harder to be comfortable it’s correct. I also worried a bit that the cost of the consensus loop might well be enormous. So I let myself become a bit distracted for some minutes doing a profiling run to find out. I mean, who doesn’t want to be the person who fixed concurrency bug in the GC and made it faster at the same time?!

I’ve found a lot of useful speedups over the years using callgrind. I’ve sung its praises on this blog at least once before. By counting CPU cycles, it can give very precise and repeatable measurements on things that – measured using execution time – I’d consider within noise. With such accuracy, it must be the only profiler I need in my toolbox, right?

Well, no, of course not. Any time somebody says that X tool is The Best and doesn’t explain the context when it’s The Best, be very suspicious. Running Callgrind on a multi-threaded benchmark gave me all the ammunition I could ever want for replacing the GC sync-up code. It seemed that a whopping 30% of CPU cycles were spent in the GC consensus loop! This was going to be huuuuge…

Or was it? I mean, 35% seems just a little too huge. Profiling is vulnerable to the observer effect: the very act of measuring a program’s performance inevitably changes the program’s performance. And, while on the dubious physics analogies (I shouldn’t; I was a pretty awful physicist once I reached the relativity/QM stuff), there’s a bit of an uncertainty principle thing going on. The most invasive profilers can tell you very precisely where in your program time is spent, but you’re miles off knowing how fast you’re normally going in those parts of the program. They do this by instrumenting the program (like the current Perl 6 on MoarVM profiler does) or running it on a synthetic CPU, as Callgrind does. By contrast, a sampling profiler lets your program run pretty much as normal, and just takes regular samples of the call stack. The data is much less precise about where the program is spending its time, but it is a much better reflection of how fast the program is normally going.

So what would, say, the perf sampling profiler make of the benchmark? Turns out, well less than 1% of time was spent in the GC consensus loop in question. (Interestingly, around 5% was spent in a second GC termination consensus loop, which wasn’t the one under consideration here. That will be worth looking into in the future.) The Visual Studio sampling profiler on Windows – which uses a similar methodology – also gave a similar figure, which was somewhat reassuring also.

Also working against Callgrind is the way the valgrind suite of tools deal with multi-threaded applications – in short by serializing all operations onto a single thread. For a consensus loop, which expects to be dealing with syncing up a number of running threads, this could make a radical difference.

Finally, I decided to check in on what The GC Handbook had to say on the matter. It turns out that it suggests pretty much the kind of consensus loop we already have in MoarVM, albeit rather simpler because it’s juggling a bit less (a simplification I’m all for us having in the future too). So, we’re not doing anything so unusual, and with suitable measurements it’s performing as expected.

So, that was an interesting but eventually fairly pointless detour. What makes this all the more embarassing, however, is what came next. Running an example I knew to hang under GDB, I waited for it to do so, hit Ctrl + C, and started looking at all of the threads. Here’s a summary of their states:

  • 17: blocking on concurrent queue read (cond var wait)
  • 16: in mark thread unblocked; yielded
  • 15: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 14: blocking on concurrent queue read (cond var wait)
  • 13: blocking on concurrent queue read (cond var wait)
  • 12: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 11: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 10: blocking on concurrent queue read (cond var wait)
  • 09: blocking on concurrent queue read (cond var wait)
  • 08: trying to acquire lock on concurrent queue read (at src/6model/reprs/ConcBlockingQueue.c:160)
  • 07: blocking on concurrent queue read (cond var wait)
  • 06: in mark thread unblocked; yielded
  • 05: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 04: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 03: blocking on concurrent queue read (cond var wait)
  • 02: in mark thread unblocked; yielded
  • 01: triggered GC; MVM_gc_enter_from_allocator (tc=tc@entry=0x6037c0) at src/gc/orchestrate.c:384

And yes, for the sake of this being a nice example for the blog I perhaps should not have picked one with 17 threads. Anyway, we’ll cope. First up, the easy to explain stuff. All the threads that are “blocking on concurrent queue read (cond var wait)” are fairly uninteresting. They are Perl 6 thread pool threads waiting for their next task (that is, wanting to pull an item from the scheduler’s queue, and waiting for it to be non-empty).

Thread 01 is the thread that has triggered GC. It is trying to get consensus from other threads to begin GC. A number of other threads have already been interrupted and are also in the consensus loop (those marked “in AO_load_read at MVM_gc_enter_from_interrupt”). This is where I had initially suspected the problem would be. That leaves 4 other threads.

You might wonder how we cope with threads that are simply not in a position to participate in the consensus process, because they’re stuck in OS land, blocked waiting on I/O, a lock, a condition variable, a semaphore, a thread join, and so forth. The answer is that before they hand over control, they mark themselves as blocked. Another thread will steal their work if a GC happens while the thread is blocked. When the thread becomes unblocked, it marks itself as such. However, if a GC is already happening at that point, it’s not safe for the thread to proceed. Thus, it yields until GC is done. This accounts for the 3 threads described as “in mark thread unblocked; yielded”.

Which left one thread, which was trying to acquire a lock in order to peek a queue. The code looked like this:

    if (kind != MVM_reg_obj)
        MVM_exception_throw_adhoc(tc, "Can only shift objects from a ConcBlockingQueue");

    uv_mutex_lock(&cbq->locks->head_lock);

    while (MVM_load(&cbq->elems) == 0) {
        MVMROOT(tc, root, {

Spot anything missing?

Here’s the corrected version of the code:

    if (kind != MVM_reg_obj)
        MVM_exception_throw_adhoc(tc, "Can only shift objects from a ConcBlockingQueue");

    MVM_gc_mark_thread_blocked(tc);
    uv_mutex_lock(&cbq->locks->head_lock);
    MVM_gc_mark_thread_unblocked(tc);

    while (MVM_load(&cbq->elems) == 0) {
        MVMROOT(tc, root, {

Yup, it was failing to mark itself as blocked while contending for a lock, meaning the GC could not steal its work. So, the GC’s consensus algorithm wasn’t to blame after all. D’oh.

To be continued…

I actually planned to cover a second issue in this post. But, it’s already gone midnight, and perhaps that’s enough fun for one post anyway. :-) Tune in next time, for more GC trouble and another cute deadlock!

Posted in Uncategorized | 3 Comments

Assorted fixes

I’ve had a post in the works for a while about my work to make return faster (as well as routines that don’t return), as well as some notable multi-dispatch performance improvements. While I get over my writer’s block on that, here’s a shorter post on a number of small fixes I did on Thursday this week.

I’m actually a little bit “between things” at the moment. After some recent performance work, my next focus will be on concurrency stability fixes and improvements, especially to hyper and race. However, a little down on sleep thanks to the darned warm summer weather, I figured I’d spend a day picking a bunch of slightly less demanding bugs off from the RT queue. Some days, it’s about knowing what you shouldn’t work on…

A nasty string bug

MoarVM is somewhat lazy about a number of string operations. If you ask it to concatenate two simple strings, it will produce a string consisting of a strand list, with two strands pointing to the two strings. Similarly, a substring operation will produce a string with one strand and an offset into the original, and a repetition (using the x operator) will just produce a string with one strand pointing to the original string and having a repetition count. Note that it doesn’t currently go so far as allowing trees of strand strings, but it’s enough to prevent a bunch of copying – or at least delay it until a bunch of it can be done together and more cheaply.

The reason not to implement such cleverness is because it’s of course a whole lot more complex than simple immutable strings. And both RT #123602 and RT #127782 were about a sequence of actions that could trigger a bug. The precise sequence of actions were a repeat, followed by a concatenation, followed by a substring with certain offsets. It was caused by an off-by-one involving the repetition optimization, which was a little tedious to find but easy to fix.

Constant folding Seqs is naughty

RT #127749 stumbled across a case where an operation in a loop would work fine if its input was variable (like ^$n X ^$n), but fail if it were constant (such as ^5 X ^5). The X operator returns a Seq, which is an iterator that produces values once, throwing them away. Thus iterating it a second time won’t go well. The constant folding optimization is used so that things like 2 + 2 will be compiled into 4 (silly in this case, but more valuable if you’re doing things with constants). However, given the 1-shot nature of a Seq, it’s not suitable for constant folding. So, now it’s disallowed.

We are anonymous

RT #127540 complained that an anon sub whose name happened to match that of an existing named sub in the same scope would trigger a bogus redeclaration error. Wait, you ask. Anonymous sub…whose name?! Well, it turns out that what anon really means is that we don’t install it anywhere. It can have a name that it knows itself by, however, which is useful should it show up in a backtrace, for example. The bogus error was easily fixed up.

Charset, :ignoremark, :global, boom

Yes, it’s the obligatory “dive into the regex compiler” that all bug fixing days seem to come with. RT #128270 mentioned that that "a" ~~ m:g:ignoremark/<[á]>/ would whine about chr being fed a negative codepoint. Digging into the MoarVM bytecode this compiled into was pretty easy, as chr only showed up one time, so the culprit had to be close to that. It turned out to be a failure to cope with end of string, and as regex bugs go wasn’t so hard to fix.

Hang, crash, wallop

This is one of those no impact on real code, but sorta embarrassing bugs. A (;) would cause an infinite loop of errors, and (;;) and [0;] would emit similar errors also. The hang was caused by a loop that did next but failed to consider that the iteration variable needed updating in the optimizer. The second was because of constructing bad AST with integers hanging around in it rather than AST nodes, which confused all kinds of things. And that was RT #127473.

Improving an underwhelming error

RT #128581 pointed out that my Array[Numerix] $x spat out an error that fell rather short of the standards we aim for in Perl 6. Of course, the error should complain that Numerix isn’t known and suggest that maybe you wanted Numeric. Instead, it spat out this:

===SORRY!=== Error while compiling ./x.pl6
An exception occurred while parameterizing Array
at ./x.pl6:1
Exception details:
  ===SORRY!=== Error while compiling
  Cannot invoke this object (REPR: Null; VMNull)
  at :

Which is ugly. The line number was at least correct, but still… Anyway, a small tweak later, it produced the much better:

$ ./perl6-m -e 'my Array[Numerix] $x;'
===SORRY!=== Error while compiling -e
Undeclared name:
    Numerix used at line 1. Did you mean 'Numeric'?

Problems mixing unit sub MAIN with where constraints

RT #127785 observed that using a unit sub MAIN – which takes the entire program body as the contents of the MAIN subroutine – seemed to run into trouble if the signature contained a where clause:

% perl6 -e 'unit sub MAIN ($x where { $^x > 1 } );  say "big"'  4
===SORRY!===
Expression needs parens to avoid gobbling block
at -e:1
------> unit sub MAIN ($x where { $^x > 1 }⏏ );  say "big"
Missing block (apparently claimed by expression)
at -e:1
------> unit sub MAIN ($x where { $^x > 1 } );⏏  say "big"

The error here is clearly bogus. Finding a way to get rid of it wasn’t too hard, and it’s what I ended up committing. I’ll admit that I’m not sure why the check involved was put there in the first place, however. After some playing around with other situations that it might have aided, I failed to find any. There were also no spectests that depended on it. So, off it went.

$?MODULE

The author of RT #128552 noticed that the docs talked about $?MODULE (“what module am I currently in”), to go with $?PACKAGE and $?CLASS. However, trying it out let to an undeclared variable error. It seems to have been simply overlooked. It was easy to add, so that’s what I did. I also found some old, provisional tests and brought them up to date in order to cover it.

Subtypes, definedness types, and parameters

The submitter of RT #127394 was creative enough to try -> SomeSubtype:D $x { }. That is, take a subset type and stick a :D on it, which adds the additional constraint that the value must be defined. This didn’t go too well, resulting in some rather strange errors. It turns out that, while picking the type apart so we can code-gen the parameter binding, we failed to consider such interesting cases. Thankfully, a small refactor made the fix easy once I’d figured out what was happening.

1 day, 10 RTs

Not bad going. Nothing earth-shatteringly exciting, but all things that somebody had run into – and so others would surely run into again in the future. And, while I’ll be getting back to the bigger, hairier things soon, spending a day making Perl 6 a little nicer in 10 different ways was pretty fun.

Posted in Uncategorized | 1 Comment

‘grinding out performance improvements

In the last weeks, I’ve been working on various Perl 6 performance improvements. They’ve led to changes in all of MoarVM, NQP, and Rakudo. For each improvement, there are usually three steps involved:

  1. Take some program that performs poorly, and discover a reason that it’s slow.
  2. Design, implement and test potential changes, until one yields an improvement. (Or, if nothing seems to help, move on to a different reason that it’s slow).
  3. Make sure that the improvement doesn’t cause regressions elsewhere.

In all of these steps, getting some objective measurements are important. Step 3 is relatively straightforward: just build NQP/Rakudo and run their respective test suites. It’s possible for problems to find a place to hide even with this; test suites are, after all, not full system proofs. But it rules out a lot of bad behaviors.

For step 1, profiling is key. I mean, sure, sometimes I can guess why something is slow, and sometimes I’m even right. But the overwhelming majority of the time, measuring wins hands down. That’s why I seeded tools such as the MoarVM profiler and MoarVM heap snapshot analyzer, both of which have seen contributions from others since. And, when working on MoarVM itself, there are various tools for profiling native code.

That leaves step 2. How do I know if I’ve got an improvement? One easy way is to use something like time, doing thousands or millions of iterations to try and avoid noise, and seeing what happens. It’s a bit of a blunt instrument, however. It’s hard to be confident that anything less than a 2%-3% improvement isn’t just measurement noise, and I’m not even confident about that rule of thumb. :-) And, while it may be reasonable to argue that an improvement of just 1% or even a fraction of a percent is just not worth it in some contexts, VM engineering isn’t one of those contexts. A 1% improvement rolled out to all of our users quickly multiplies out to a huge number of saved CPU cycles.

More than that, though, the small improvements add up. Do 5 small improvements that win an average of 1% each, and it adds up to a more satisfying 5% improvement. But with only wallclock times to go on, it’s hard to confidently commit an improvement that wins only 1%, or 0.5%. Why? Simply because it’s hard to be sure that it’s a move in the correct direction. If it’s in the noise range, it may be that things are actually getting a tiny bit worse. Experience tells that just because something looks like it should be an improvement does not, in fact, mean that it will be. So if I’m seeing – to the degree of measurement error – two things coming out the same, and I go ahead and commit the “improvement” anyway, I’m really just guessing.

Enter callgrind, part of the Valgrind suite. It can give a count of the number of CPU instructions executed. And, on two consecutive runs with the same input, it will produce the exact same number. Yes, it’s sloooooow at gathering data; on the other hand, with such precision there’s less need to bump up the number of iterations to hide measurement error. Callgrind also explains what functions the CPU instructions are from, meaning it can play a key role in the profiling step too.

Of course, CPU cycles are also a somewhat blunt instrument too. Instructions executed is not the same as CPU cycles spent. Modern CPUs can both execute multiple instructions in a single cycle due to having multiple function units, as well as stall for anything from several to several hundred cycles on an instruction that accesses memory. This measurement also, of course, gives little insight into I/O bound programs, and I could easily imagine getting rather misled by such data in contention-bound programs. However, for CPU bound programs operating on relatively small heaps and running on a single thread, the numbers can be treated something like a very accurate clock.

Here’s a look at handful of the improvements I’ve been doing, based off looking at callgrind output.

The test program

Here’s the test program I considered. It involves a bunch of invocation and integer operations in a tight loop.

class A {
    has $.i = 0;
    method m() { $!i++ }
}
my $a = A.new;
for ^5000000 {
    $a.m();
}
say $a.i;

Before starting out, this took 17,855,658,600 instructions, which comes out at around 3,580 instructions per iteration. For comparison, here’s a Perl 5 program that I hope is fairly equivalent (I stuck with the built-in OO to avoid the costs of any sugar):

package A;
sub new {
    return bless { i => 0 }, shift;
}
sub m() {
    my $self = shift;
    $self->{i}++;
}
package main;
my $a = A->new;
for (1..5000000) {
    $a->m();
};
say $a->{i};

It weighs in at 10,198,184,195 cycles, or 2040 instructions per iteration, thus running in 56% of the CPU instructions the Perl 6 version takes.

A wasted memset

One thing that stood out right away was the amount of time spent in memset, to zero memory. As I looked at the callers, I spotted one that seemed out of place: clearing the arguments buffer. The code gave a reason (make sure the GC never sees a partial args buffer with old data), but that reason turns out to be bogus: there can never be a GC safepoint anywhere inside of the sequence of instructions that put the arguments in place to pass. So, I removed it.

Now I was down to a really small function. That was just begging to go away. It was called both from the interpreter (where the C compiler may well have inlined it) and also from the JIT. The JIT case was certainly going to be a nice win; instead putting args into the appropriate registers for a function call, making it, and then using the return value, I could just do a couple of simple instructions.

These two fairly simple changes got it down to 17,624,207,945 instructions – a saving of 230 million CPU instructions (1.3%), or around 50 cycles off every iteration. Very much worth having (it makes every single call cheaper), but could have been marginal if measuring simply wallclock time.

Exhausting work

The next win would have been easily visible just from wallclock time, but looking at the callgrind output led to it. I noticed we spent a huge amount of time doing a late-bound (by name) lexical lookup. With --inclusive=yes passed to callgrind_annotate, its line in the output looked like this:

1,149,528,043  ???:MVM_frame_find_lexical_by_name [libmoar.so]

Now, late-bound lexical lookups being costly isn’t really a surprise. This is the case where we have to go searching for a symbol by name (doing hash lookups), because we couldn’t do its resolution at compile-time (or at optimization time). The surprise was that we were doing them at all while executing such a simple program. So I looked into why, and it surprised me.

Long ago (back in the Parrot days), we replaced the secret RETURN lexical (which held an object used in implementing return) with the &EXHAUST sub. This would throw a useful error if you tried to return from a sub you had already returned from, for example due to forgetting that map is lazy and the final statement of a sub is its return value:

sub in(@a, $b) {
    @a.map({ return True if $_ eqv $b })
}
say in([1,2,3], 2); # dies: Attempt to return outside of any Routine

However, this lookup ended up being code-gen’d late-bound. I was considering fixing that, when I got curious if we even needed it at all any more. And indeed, on the VMs we run on today, I got the same decent error message if I simply set RETURN to null. So, I did that and the win was enormous: down to 11,085,521,589 CPU instructions! That meant, relative to the previous measurement, it ran in 63% of the CPU cycles it used to, or 2,220 per iteration.

That might seem incredible given that we only stood to gain 1.1 billion cycles by avoiding the indirect lexical lookup. Where did the other 5 billion go to? It turns out that blocks that have references to symbols lexically outer to them are not able to be inlined (unless they have been resolved by optimization time and are known to stay constant). The late-bound lookup of &EXHAUST was thus an inlining blocker for the method m. With it now being inlined, all of the calling overhead went away in favor of a couple of gotos. (Those also could go away in the future if inlining gets smarter.)

Optimizing get_boxed_ref

Eliminating control flow in favor of data access is usually a win. Branches aren’t the cheapest thing. I spotted an opportunity in a function get_boxed_ref, which is used to extract the memory address of the P6bigint representation that is inlined into a P6opaque (as happens with Int; if you’re wondering why it’s a P6opaque, recall that it’s allowable to mix into Int objects).

The resulting change shaved another 60 million cycles off the program; only 12 per iteration, but since this improves every single Int operation we perform in Perl 6, I’ll take it.

Static elimination of decontainerization

The &EXHAUST change uncovered an inlining issue that I had to fix, and while doing so I took a look at the code we were producing and got an idea. In Perl 6, we have Scalar containers. So when you have:

my $answer = 42;

Then the callframe has a slot for $answer that points to a Scalar object, which has a value attribute that points to the immutable Int object 42. Therefore, when we need the value from a container, we have to get hold of it. That is done with an instruction called decont, which checks if we have a container and dereferences it if so, and otherwise just evaluates to the value. The dynamic optimizer is pretty good at removing unneeded decont instructions to save the cycles/branches involved, and can lower those that remain into simple pointer arithmetic. But it can’t get ’em all, and of course not all code gets hot enough to be optimized in such a way.

I spotted that in some cases, we emitted decont instructions against compile time constants that we could cheaply determine, at compile time, would never need them. A small code-gen patch was sufficient to eliminate a load of them. This led to another 35 million cycles saved. However, the number of cycles in an empty program at startup also went down by quite an amount too, so it’s likely not that we’re saving so much in the loop (if anything). Additionally, this change shaved 3.6KB off the NQP bytecode size, 16.2KB off the Rakudo compiler bytecode size, and 55.8KB off CORE.setting bytecode size.

And next time: returning to return

Fixing the &EXHAUST bug reminded me that I’ve long wanted to change the way we implement return, to make it free – rather than just cheap – in the case that we never explicitly return, and much cheaper when we do. I’m still working on getting those changes finished up, and this has already been a fairly long post, so I’ll save the details – and the results – for next time.

Posted in Uncategorized | 3 Comments

Refactoring and torture

This post covers the previous two weeks of my Perl 6 grant work. Last time I wrote here, I plotted changes to call frames in MoarVM. Or, as the wonderful Perl Weekly put it, provided way too much information about call frames. :-)

That there is so much to say about call frames reflects their rather central role. They play a big part in supporting numerous language features (function calls, recursion, closures, continuations, dynamic variables, and pseudo-packages like OUTER and CALLER). The garbage collector scans them to find live objects. Both the interpreter and JIT compiler have to relate to them in various ways. The dynamic optimizer performs transforms over them when doing OSR (On Stack Replacement) and uninlining (the deoptimization that enables us to speculatively perform inlining optimizations).

All of which makes a major refactor of call frames a rather scary prospect. While they have picked up additional bits of state as MoarVM has evolved, they have been reference counted since, well, the day I first implemented call frames, which means “before MoarVM could even run any code”. Being reference counted, rather than handled by the garbage collector, gave them a property that is easy to rely on, and rather less easy to discover reliance on: that they never move over their lifetime.

I like to move it move it

Wait, what, move? Why would they even move?

Here’s a little Perl 6 program to illustrate. It declares a class, makes an instance of it, prints its memory address, does a load of further, throwaway, memory allocations, and then again prints the address of the object instance we made.

class A { }
my $obj = A.new;
say $obj.WHERE;
A.new for ^10000;
say $obj.WHERE;

When I ran this locally just now, I got:

39391392
95466848

If you get the same number twice, just make the 10000 something bigger. What’s interesting to note here is that an object’s location in memory can change over time. This is a consequence of MoarVM’s garbage collector, which is both generational and manages its young generation using semi-space copying. (This is nothing special to MoarVM; many modern VMs do it.)

Being able to move objects relies on being able to find and update all of the references to them. And, since MoarVM is written in C, that includes those references on the C stack. Consider this bit of code, which is the (general, unoptimized) path for boxing strings:

MVMObject * MVM_repr_box_str(MVMThreadContext *tc, MVMObject *type, MVMString *val) {
    MVMObject *res;
    MVMROOT(tc, val, {
        res = MVM_repr_alloc_init(tc, type);
        MVM_repr_set_str(tc, res, val);
    });
    return res;
}

It receives val, which is a string to box. Note that strings are garbage-collectable objects in MoarVM, and so may move. It then allocates a box of the specified type (for example, Perl 6’s Str), and puts the string inside of it. Since MVM_repr_alloc_init allocates an object, it may trigger garbage collection. And that in turn may move the object pointed to by val – meaning that the val pointer needs updating. The MVMROOT macro is used in order to add the memory address of val on the C stack to the set of roots that the GC considers and updates, thus ensuring that even if the allocation of the box triggers garbage collection, this code won’t end up with an old val pointer.

Coping with moving frames

Last time, I discussed how reference counting could be eliminated in favor of a “linear” call stack for frames that don’t escape (that is, become heap referenced), and promoting those that do escape to being garbage collected. As an intermediate step there, I’ve been working to make all frames GC-managed. This means that frames can move, and that they are part of the generational scheme. Therefore, every piece of code that both holds a reference to a frame and takes a code path that can allocate would need updating with MVMROOT. Further, all assignments of frames into other objects, and other objects into frames, would need write barriers (aside from the working area, which is handled specially).

In part, this just needs a lot of care. Going through the places frames show up, updating things as needed, and so forth. But even then, would that really be enough to be confident things weren’t broken? After all, my refactor was changing the rules for one of the most important data structures in the VM.

Of course, building NQP and Rakudo and passing the spectest suite is one good way to exercise MoarVM after the changes. Doing this showed up some issues, which I fixed. But even that doesn’t offer a huge amount of confidence. A simple script, or a short test, might trigger no garbage collections at all, or just the odd one. And the collections are highly likely to be triggered on the most common code paths in the VM.

GC torture testing

When faced with something scary, a surprisingly good approach is to tackle it by doing it really often. For example, are software releases scary? If yes, then do time-based releases every month, and with time they’ll become automatic and boring enough not to be scary. Is deploying changes to a production system scary? If yes, then adopt continuous delivery, deploying lots of times in a day and with easy rollback if things don’t go well.

Garbage collection is pretty scary. I mean, we take this world of objects the program has created, move them around, throw a bunch of them out, and then have the program continue running as if nothing happened. So…let’s try doing it really often!

This is exactly what GC torture testing involves.

--- a/src/gc/collect.h
+++ b/src/gc/collect.h
@@ -1,7 +1,7 @@
 /* How big is the nursery area? Note that since it's semi-space copying, we
  * actually have double this amount allocated. Also it is per thread. (In
  * the future, we'll make this adaptive rather than a constant.) */
-#define MVM_NURSERY_SIZE 4194304
+#define MVM_NURSERY_SIZE 13000

Rather than doing a collection every 4MB worth of allocations, let’s do one every 13KB worth of allocations! That’s around 320 times more often. Combined with a few debugging checks enabled, to catch references to objects that are out of date, bugs resulting from missing MVMROOTs and write barriers can be driven out of their hiding places into the light of day.

It’s a rather effective technique. It’s also a very time-consuming one. The NQP and Rakudo builds easily take an hour between them, and running spectest this way takes over 12 hours. It’s cheap compared to shipping a MoarVM with new and nasty bugs that waste a bunch of people’s time, of course!

It’s been a while since we did such a torture test. I’ve decided we should do them more often. It found issues. So far, from the spectest run torture test results, I’ve fixed 9 bugs (I didn’t go back and count those discovered while building NQP and Rakudo). What’s interesting is that of the 9, only 3 of them were clearly attributable to my refactors, one was potentially related to them, and 5 were bugs that must have been around a good while. One of the bugs that did relate to the frames changes caused deadlocks in multi-threaded code quite reliably under torture testing, but would have likely caused them very rarely under normal use (and so been extremely frustrating to reproduce and track down if it made it into the wild). 2 of the fixed non-frame bugs exclusively affected multi-threaded programs and would have doomed them. One was in the CUnion representation, and probably was the cause of some previously unresolved occasional failures of the NativeCall union tests.

What next?

By this point, I’m reasonably confident that regressions due to the first step of the frame changes have been shaken out. The GC torture testing has, however, shed light on some other issues that will want addressing in the near future.

I intend to put those aside for a little while, and complete the frame changes, introducing the linear stack. Compared with the first step, this feels like a lower risk change, in that mistakes should be a lot easier and cheaper to detect. I’d like to try and land this in the next week or so, in order that it can get some real-world testing before it makes it into a MoarVM and Rakudo release.

Once that’s out the way, I’ll be returning to other issues turned up in GC torture testing. I’d also like to look into a way to be able to run it automatically and regularly (once a week, perhaps). It’s a good bit too intensive to be able to farm it out to Travis. The sensible solution is probably to do it in the cloud, on some elastic compute thing where it just uses a machine once a week for a day or so. The silly but fun way is to build a Raspberry Pi cluster on my desk, and hack up something to distribute the tests across them. :-)

Posted in Uncategorized | 4 Comments

Framing the problem

In this post I’ll be talking a lot about call frames, also known as invocation records. Just to be clear about what they are, consider a sub:

sub mean(@values) {
    @values.sum / @values
}

Whenever we call mean, we create a call frame. This holds the storage for the incoming @values parameter. It also holds some temporary storage we use in executing the sub, holding, for example, the sum method object we get back when looking up the method, and the result of calling @values.sum, which we then pass to infix:</>. Call frames also record outer and caller references (so we can resolve lexical and dynamic variables), the place to store the return value and go to on return, and other bits. It’s important to note that call frames are not 1:1 with subs/methods/blocks. Perhaps the best way to understand why is to consider a recursive sub:

sub fac($n) {
    $n <= 1
        ?? 1
        !! $n * fac($n - 1)
}

There’s one fac sub but we need a call frame for each invocation of (that is, call to) fac, since the $n parameter will vary in each call. (Threads are another example where you’re “in” a sub multiple times at the same time.)

All complex software systems evolve from simple systems. MoarVM is no exception. Back when MoarVM started out, I knew I wanted to have invocation be cheap, and call frames be fairly lightweight. I also didn’t want them to be GC-allocated. I figured that code sat in a loop, only using native types and only calling things involving native types, should not create garbage that needed collecting. All good goals.

Fast forward a few years, and where are we? Let’s start out with the easy one to assess: frames aren’t GC-allocated. So that’s good, right? Well, sure, in that I got the natives property that I was after. However, since things like closures and continuations exist, not to mention that you can get a first-class reference to a call frame and traverse the outer/caller chain, the lifetime of frames is interesting. They most certainly don’t always just go away at the point of return. Therefore, they need to have their memory managed in some way. I went with reference counts, figuring that since we’d only need to twiddle them fairly occasionally, it’d be fairly OK. Trouble is, thanks to MoarVM supporting concurrent threads of execution, those counts need to be incremented and decremented using atomic operations. Those are CPU native, but they’re still a bit costly (more so on some CPUs that others).

There’s another, more hidden, cost, however – one I didn’t really see coming. MoarVM has a generational garbage collector, as discussed in my previous post. But frames are not garbage collectable objects. They’re managed by reference counts. So what happens when a reference counted frame is referenced by a second generation object? Well, there’s no risk of the frames going away too early; the reference count won’t be decremented until the gen2 object itself is collected. The problem is about the objects the frame references. Frames, not being garbage collectable, don’t have write barriers applied on binds into them. This means that they can come at any time to point to nursery objects. We solved this by keeping all objects referencing frames in the inter-generational root set. This is perfectly safe. Unfortunately, it also greatly increases the cost of garbage collection for programs that build up large numbers of closures in memory and keep them around. Of course, since write barriers are cheap but not free, we get a performance win on all programs by not having to apply them to writes to working registers or lexical.

So, how about invocation cost? Is invocation cheap? Well, first of all lets turn off inlining:

SET MVM_SPESH_INLINE_DISABLE=1

And measure 10 million invocations passing/receiving one argument using Perl 5, NQP, and Rakudo. Perl 5 does them in 2.85s. NQP comes out a little ahead, at 2.45s. Rakudo strolls through them in an altogether too leisurely 6.14s. (Turn inlining back on, and Rakudo manages it in 3.39s.) So, if NQP is already ahead, is MoarVM really so bad? Well, it could certainly be better. On an idealized 3GHz GPU, each invocation is costing around 735 CPU cycles. That’s pricey. The other issue here is that just matching Perl 5 on invocation speed isn’t really enough, because tons of things that aren’t invocations in Perl 5 actually are in Perl 6 (like, every array and hash index). In a “Perl 6 is implemented in Perl 6” world, we need to squeeze a good bit more out of invocation performance.

And finally, what about size? An MVMFrame comes at a cost of 296 bytes. It points to a chunk of working space together with a lexical environment (both arrays). Every single closure we take also pays that fixed 296 byte cost (and, of course, the cost of the lexical environment storage, since that’s what we actually take closures for). Again, not staggeringly huge, but it adds up very quickly.

These are all areas that need improvement. In fact, they make up two of the entries in the performance section of theproposal for the grant I’m doing this work under. So, I decided it was time to start thinking about how I’ll address them.

Some measurements

I was curious how many frames end up referenced by garbage collectable objects against how many never end up in this situation. So, I quickly patched MoarVM to keep track of if a frame ever came to be referenced by a GC-able object:

diff --git a/src/core/frame.c b/src/core/frame.c
index ca1a4d2..f392aca 100644
--- a/src/core/frame.c
+++ b/src/core/frame.c
@@ -114,7 +114,10 @@ MVMFrame * MVM_frame_dec_ref(MVMThreadContext *tc, MVMFrame *frame) {
      * to zero, so we look for 1 here. */
     while (MVM_decr(&frame->ref_count) == 1) {
         MVMFrame *outer_to_decr = frame->outer;
-
+if (frame->refd_by_object)
+    tc->instance->refd_frames++;
+else
+    tc->instance->non_refd_frames++;
         /* If there's a caller pointer, decrement that. */
         if (frame->caller)
             frame->caller = MVM_frame_dec_ref(tc, frame->caller);
diff --git a/src/core/instance.h b/src/core/instance.h
index b14f11d..4f61000 100644
--- a/src/core/instance.h
+++ b/src/core/instance.h
@@ -365,6 +365,9 @@ struct MVMInstance {

     /* Cached backend config hash. */
     MVMObject *cached_backend_config;
+
+MVMuint64 refd_frames;
+MVMuint64 non_refd_frames;
 };

 /* Returns a true value if we have created user threads (and so are running adiff --git a/src/main.c b/src/main.c
index 5458912..1df4fe3 100644
--- a/src/main.c
+++ b/src/main.c
@@ -189,7 +189,9 @@ int main(int argc, char *argv[])

     if (dump) MVM_vm_dump_file(instance, input_file);
     else MVM_vm_run_file(instance, input_file);
-
+printf("Ref'd frames: %d\nNon-ref'd frames: %d\n",
+    instance->refd_frames,
+    instance->non_refd_frames);
     if (full_cleanup) {
         MVM_vm_destroy_instance(instance);
         return EXIT_SUCCESS;

And measured a few things (the names from the latter ones are benchmark names from perl6-bench):

Measured                    Ref'd       Non-ref'd       % Ref'd
========                    =====       =========       =======
NQP startup                 0           5259            0.0%
NQP regex tests             28065       1682655         1.6%
Compile Perl 6 actions      115092      6100770         1.7%
Compile Perl 6 grammar      130716      5451120         2.3%
Compile CORE.setting        2065214     55771097        3.8%
Perl 6 startup              35          12822           0.3%
Compiling Test.pm6          39639       860474          4.4%
Compiling NativeCall.pm6    145426      1887682         7.2%
while_array_set             993701      6024920         14.1%
while_hash_set              1804        2024016         0.1%
for_assign                  1654        1020831         0.2%
for_concat_2                1743        2023589         0.1%
split_string_regex          8992750     19089026        32.0%
create_and_iterate_hash_kv  14990870    40027814        27.2%
parse_json                  10660068    42364909        20.1%
rc-forest-fire              3740096     16202368        18.8%
rc-mandelbrot               89989       5523439         1.6%
rc-man-or-boy-test          791961      7091381         10%

What can we infer from this? First of all, most NQP programs have at most just a few percent of their frames referenced by GC-able objects. With the Perl 6 benchmarks, it’s all over the map, with split_string_regex being the “worst” case. NQP’s optimizer is much better doing lexical to local lowering, and flattening away scopes that we don’t really need. In Rakudo, we’re pretty weak at that. Clearly, some more work on this area could benefit Rakudo (and yes, it’s also on the list of things to do under my grant).

Secondly, since – even in the worst cases – the majority of frames never get themselves tied up with any “interesting” situations that causes them to become GC-referenced, a strategy that handles them differently – and hopefully far more efficiently – would give us a win.

What GC-able things reference frames?

It was fairly easy to grep through the MoarVM source and make a list. I did so to help me think through the cases:

  • A frame being closed over (closures)
  • A frame being captured in a continuation
  • A frame being referenced by an exception
  • A local/lexical being referenced by a native reference
  • A frame becoming the default outer thanks to “auto-close” (rare)
  • A frame getting wrapped in a context object, to use it as a first-class object (also, hopefully, rare in any hot-path code)

It’s also interesting to note that a frame only ever “escapes” such that it can be touched by another thread if it becomes referenced by a GC-able object.

What makes frames take up space?

Next, I decided to to through the MVMFrame data structure and see where the space is going, and what options might exist for saving that space. What follows is an analysis of all the fields in an MVMFrame.

/* The thread that is executing, or executed, this frame. */
MVMThreadContext *tc;

Interestingly, this one gets cleared after a certain point in the frame’s life, except if it’s captured in a continuation. Exception handling uses it to know if the frame is still on the call stack, which is interesting in various cases. GC marking uses it to know if it should mark ->work (see below).

Interestingly, nothing seems to care overly much at the moment that it points to a particular thread context; they all want it for a flag. So, it’s certainly a candidate for removal. It’s also interesting to note that in every case where a frame is not referenced by an object, it is alive solely by being in a thread’s “call stack” – that is, the call chain from following the ->caller pointer from the currently executing frame of a thread. So, the flag will only matter for frames that are GC-referenced.

/* The environment for this frame, which lives beyond its execution.
* Has space for, for instance, lexicals. */
MVMRegister *env;

Relevant for frames in whatever state.

/* The temporary work space for this frame. After a call is over, this
* can be freed up. Must be NULLed out when this happens. */
MVMRegister *work;

Relevant for frames that are still executing, or that are captured by a continuation. Cross-cuts whether they are GC-referenced.

/* The args buffer. Actually a pointer into an area inside of *work, to
* decrease number of allocations. */
MVMRegister *args;

Possibly could go away through a level of indirection, but it’s performance sensitive. Used together with…

/* Callsite that indicates how the current args buffer is being used, if
* it is. */
MVMCallsite *cur_args_callsite;

…this one.

/* The outer frame, thus forming the static chain. */
MVMFrame *outer;

Pretty much everything has an outer.

/* The caller frame, thus forming the dynamic chain. */
MVMFrame *caller;

Pretty much everything has a caller too.

/* The static frame information. Holds all we statically know about
* this kind of frame, including information needed to GC-trace it. */
MVMStaticFrame *static_info;

As you might guess, this is pretty important and useful. However, it’s also possible to obtain it – at the cost of a level of indirection – through the ->code_ref below. Would need to measure carefully, since it’d increase the cost of things like lexical lookups from outer frames (and, once we get better at optimizing, that will be “most of them”).

/* The code ref object for this frame. */
MVMObject *code_ref;

The particular closure we were invoked as. Not something we can obviously lose, and needed for the lifetime of the frame in general.

/* Parameters received by this frame. */
MVMArgProcContext params;

Argument processing context. Every frame uses it to process its arguments. It’s only useful while ->work is active, however, and so could be allocated as a part of that instead, which would reduce the cost of closures.

/* Reference count for the frame. */
AO_t ref_count;

Can go away provided we stop reference counting frames.

/* Is the frame referenced by a garbage-collectable object? */
MVMint32 refd_by_object;

Could also go away provided we stop reference counting frames and have some scheme for optimizing the common, non-referenced case.

/* Address of the next op to execute if we return to this frame. */
MVMuint8 *return_address;

/* The register we should store the return value in, if any. */
MVMRegister *return_value;

/* The type of return value that is expected. */
MVMReturnType return_type;

/* The 'entry label' is a sort of indirect return address
* for the JIT */
void * jit_entry_label;

These four are only used when the frame is currently on the call stack, or may be re-instated onto the call stack by a continuation being invoked. Could also live with ->work, thus making closures cheaper.

/* If we want to invoke a special handler upon a return to this
* frame, this function pointer is set. */
MVMSpecialReturn special_return;

/* If we want to invoke a special handler upon unwinding past a
* frame, this function pointer is set. */
MVMSpecialReturn special_unwind;

/* Data slot for the special return handler function. */
void *special_return_data;

/* Flag for if special_return_data need to be GC marked. */
MVMSpecialReturnDataMark mark_special_return_data;

Used relatively occasionally (and the more common uses are candidates for spesh, the dynamic optimizer, to optimize out anyway). A candidate for hanging off an “extra stuff” pointer in a frame. Also, only used when a frame is on the call stack, with the usual continuation caveat.

/* Linked list of any continuation tags we have. */
MVMContinuationTag *continuation_tags;

Used if this frame has been tagged as a possible continuation “base” frame. Only relevant if that actually happens (which is quite rare in the scheme of things), and can only happen when a frame is on the call stack. A candidate for similar treatment to the special return stuff.

/* Linked MVMContext object, so we can track the
* serialization context and such. */
/* note: used atomically */
MVMObject *context_object;

This is used when a context goes first-class. Thus, it implies the frame is referenced by at least one GC-able object (in fact, this points to said object). That’s fairly rare. It can happen independently of whether the frame is currently executing (so, unrelated to ->work lifetime).

/* Effective bytecode for the frame (either the original bytecode or a
* specialization of it). */
MVMuint8 *effective_bytecode;

/* Effective set of frame handlers (to go with the effective bytecode). */
MVMFrameHandler *effective_handlers;

/* Effective set of spesh slots, if any. */
MVMCollectable **effective_spesh_slots;

/* The spesh candidate information, if we're in one. */
MVMSpeshCandidate *spesh_cand;

These are all related to running optimized/specialized code. Only interesting for frames currently on the call stack or captured in a continuation (so, ->work lifetime once again).

/* Effective set of spesh logging slots, if any. */
MVMCollectable **spesh_log_slots;

/* If we're in a logging spesh run, the index to log at in this
* invocation. -1 if we're not in a logging spesh run, junk if no
* spesh_cand is set in this frame at all. */
MVMint8 spesh_log_idx;

/* On Stack Replacement iteration counter; incremented in loops, and will
* trigger if the limit is hit. */
MVMuint8 osr_counter;

These 3 play part a part in dynamic optimization too, though more in the stage where we’re gathering information. Again, they have ->work lifetime. The top may well go away in future optimizer changes, so not worth worrying over too much now.

/* GC run sequence number that we last saw this frame during. */
AO_t gc_seq_number;

This one is certainly a candidate for going away, post-refactoring. It serves as the equivalent of a “mark bit” when doing GC.

/* Address of the last op executed that threw an exception; used just
* for error reporting. */
MVMuint8 *throw_address;

May be something we can move inside of exception objects, and have them pay for it, not every frame. Worth looking in to.

/* Cache for dynlex lookup; if the name is non-null, the cache is valid
* and the register can be accessed directly to find the contextual. */
MVMString   *dynlex_cache_name;
MVMRegister *dynlex_cache_reg;
MVMuint16    dynlex_cache_type;

These also have ->work lifetime. Give a huge speed-up on dynlex access, so (aside from re-designing that) they can stay.

/* The allocated work/env sizes. */
MVMuint16 allocd_work;
MVMuint16 allocd_env;

These exist primarily because we allocate work and env using the fixed size allocator, and so we need the sizes to free the memory.

/* Flags that the caller chain should be kept in place after return or
* unwind; used to make sure we can get a backtrace after an exception. */
MVMuint8 keep_caller;

/* Flags that the frame has been captured in a continuation, and as
* such we should keep everything in place for multiple invocations. */
MVMuint8 in_continuation;

/* Assorted frame flags. */
MVMuint8 flags;

It appears the top two could be nicely folded into flags. Also, the flags may only be relevant for currently executing frames, or those captured in a continuation, so this lot is a candidate to move to something with ->work lifetime.

Observations

Here are some things that stand out to me, and that point the way to an alternate design.

  1. An MVMFrame presently carries a bunch of things in it that aren’t relevant unless the frame is either currently on a thread’s call stack or captured in a continuation.
  2. This is an orthogonal axis to whether the frame is referenced by something that is garbage-collectable.
  3. It’s further orthogonal to one of a number of relatively rare things that can happen and need storage in the frame.
  4. Frames that are never referenced by a garbage collectable object will only ever have a reference count of 1, because they will only be alive by virtue of being either the currently executing frame of a thread, or in its caller chain.
  5. Frames only become referenced by something garbage collectable in cases where we’d end up with some other garbage-collectable allocation anyway. For example, in the closure case, we allocate the code-ref that points to the referenced outer frame.
  6. Let’s assume we were to allocate all frames using the GC, and consider the analysis that would let us known when we are able to avoid those allocations. The analysis needed would be escape analysis.

A new approach: the big picture

Taking these into account, I arrived at a way forward that should, I hope, address most of the issues at hand.

Every thread will have a chunk of memory that we’ll refer to as its “call stack”. Every new frame created during normal program execution will be allocated by making space for it, including its ->work and ->env, on this stack. This will need:

  • No reference count, because we know it’s 1
  • No gc_seq_number, because we can use the stack topology to make sure we only mark each frame once

Should this frame ever become referenced by a garbage collectable object, then we will GC-allocate a frame on the garbage-collected heap – as a totally normal garbage-collectable object. The frame state will be copied into this. The work space and environment will also be allocated from the fixed-size allocator, and the data migrated there.

Since this frame is now garbage-collectable, we have to check its ->caller to see if it’s on the thread-local stack, or already been promoted to the heap. If the former, we repeat the above process for it too. This is in order to uphold the key invariant in this design: the thread-local stack may point to things in the garbage-collectable heap, but never vice-versa.

This means the reference counting and its manipulation goes away entirely, and that frames that are heap-promoted become subject to the usual generational rules. Frames that would never be heap-referenced never end up on the heap, don’t add to GC pressure, and can be cleaned up immediately and cheaply.

There are some details to care about, of course. Since generational collection involves write barriers, then binds into frames on the garbage-collectable heap will also be subject to write barriers. Is that OK? There are two cases to consider.

  1. Binding of lexicals. Since most lexicals in Perl 6 point to a Scalar, Array, or Hash in my declarations, or point directly to a read-only object if parameters, this is relatively rare (of course, write barriers apply to the Scalar itself). In NQP, loads of lexicals are lowered to locals already, and we’ll do some more of that in Rakudo too, making it rarer still. Long story short, we can afford write barriers on lexical binds.
  2. Binding of stuff in ->work, which basically means every write into the register set of the interpreter. This, we cannot afford to barrier. However, there are only two cases where a frame is promoted to the heap and has ->work. One case is when it’s still executing, and so in the call chain of a thread. In this case, we can take care to always walk the objects in ->work by simply following the call chain . The second case is when a continuation is taken. But here, there are no binds to registers until the continuation is invoked again – at which point things are back in a thread’s call chain.

Refactoring towards it

The thing that makes this a somewhat scary piece of work is that, in making call frames potentially collectable objects, we break an assumption that has been there since week 1 of MoarVM’s development: that call frames never move. To maximize the chances of discovering problems with this refactor, I decided that step 1 would be to always allocate every single call frame on the heap. Only when that is working would I move on to optimizing away most of those heap allocations by adding the thread-local call stack.

MoarVM currently has 3 kinds of collectable:

  • Objects
  • Type objects
  • STables

So, I added a fourth: call frames. As a result, MVMFrame gains an MVMCollectable at the start of the data structure – which will be present whether it’s stack or heap allocated. This will start out zeroed when a frame is born on the call stack. This does two nice things: it gives us a way to know if a frame is GC-able or not, and also means the write barrier – without modification – will do the right thing on both stack and heap frames.

There were two more easy things to do. First was to add a function to allocate a heap frame. Second was to factor out frame destruction from reference decrement, since the latter was going away.

Beyond that, there was nothing for it besides diving in, breaking the world, and then trying to put it back together again. I got a good start towards it – but the conclusion of this first step will have to wait for next week’s installment! See you then.

Posted in Uncategorized | 5 Comments

Heap heap hooray!

Last week, I finally hunted down and fixed the EVAL memory leak, with the help of the heap snapshot analyzer I wrote about recently. I also hunted down a hang in parallel runs of the Perl 6 specification test suite that showed up recently – but only on Windows – and fixed that too.

Before we begin: a generational GC primer

A few posts ago I talked a bit about how MoarVM’s garbage collector works – and hand-waved on just about all of the interesting details. Some folks on IRC expressed a bit of curiosity about those details, so I figured I’d explain some of them as they come up in my performance/reliability work. In this post I’ll talk about generational collection, because it plays a part in the EVAL leak story.

Recall that the essence of a tracing garbage collector is that we start with a set of roots: global symbols, the current call frame that each thread is executing, thread-local symbols, and so forth. We “mark” each object we find in these roots as “alive”, and stick them onto a todo list. We work our way through this todo list, asking each object what it references and putting those onto the todo list also. With a little care to never revisit objects we already considered, we terminate having marked all reachable objects alive. The memory associated with unmarked objects can then be freed.

You might worry that if we have millions of objects in memory, this could be a rather time-consuming process to do again and again. You’d be right. Worse, memory access performance depends heavily on the CPU caches getting high hit rates. When we need to walk huge numbers of objects, we end up getting loads of CPU cache misses, and have to spend time fetching objects from main memory. (To give you an idea of the numbers: a level 1 cache hit means memory access in several CPU cycles, while having to go to main memory can easily cost a couple of hundred cycles or worse).

So, how might we do better? The key insight behind generational GC, often known as the generational hypothesis, is that in most programs objects are either very short-lived (surviving zero or one garbage collections) or long-lived (perhaps staying around for the lifetime of the entire program). Therefore, it’s reasonable to assume that once an object survives a couple of collections, it will stay around for a good few more.

Generational collection works by dividing the heap – that is, the space where objects are allocated – into generations (typically two or three). In MoarVM we have two generations, which we tend to refer to in the code as “nursery” and “gen2”. The nursery, as the name suggests, is where objects begin their life. We allocate them cheaply there using a “bump-the-pointer” scheme. The nursery in MoarVM is a fixed-size chunk of memory, and after a while we fill it up. This is what triggers garbage collection.

In a 2-generation collector, there are two types of collection, which we know as nursery collection and full collection. In a full collection, we do exactly what I described earlier: visit all objects, marking them alive, and freeing up those that aren’t. A nursery collection is similar, except as soon as we see an object is not in the nursery, we don’t put it on the todo list. Instead, we simply ignore it. This greatly cuts down on the number of objects we need to consider, making nursery collections hugely cheaper.

There are two things we must do to make this really work. The first is ensure that we only free memory associated with objects living in the nursery, not the old generation, since we didn’t do the analysis needed to free anything there. That’s fine; most objects “die young” anyway. The second is more subtle. There may be objects in the nursery that are only alive because something in the old generation references them. However, since we’re not considering any old generation objects, we won’t discover this liveness and so wrongly free things. This is resolved by maintaining a set of objects that are in the old generation but pointing to objects in the nursery. Whenever we assign a reference from one object to another, we check if this would establish an old generation to nursery reference, and stick the old generation object into the set, ensuring we will visit it and mark the nursery object. This check is known as a “write barrier”.

So, back to the EVAL story…

The EVAL leak

I reviewed, fixed up, and merged various patches from Timo to improve the heap snapshot data dumps by annotating them with more data. Then, I looked at a few paths to leaked objects (recall that I was using EVAL 'my class ABC { }' to demonstrate the leak). The paths looked something like this:

> path 38199
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Unknown ]-->
BOOTCompUnit (Object)
    --[ Index 81 ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Unknown ]-->
ABC (STable)

This shows the objects along the path, but those “unknowns” were hiding what I really wanted to know. So, I did some further patches, and got out a rather more useful result:

> path 6466
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Static Frame ]-->
BOOTStaticFrame (Object)
    --[ Compilation Unit ]-->
BOOTCompUnit (Object)
    --[ Code refs array entry ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Spesh guard match ]-->
ABC (STable)

So, here we see that it’s a type specializer guard that is keeping the object alive. “Wait…a what?!” MoarVM does a bunch of dynamic optimization, watching out for types that occur at runtime and generating specialized versions of the code by type. And, sometimes, we have an unfortunate situation where code is discovered “hot”, but the type it was invoked with is fairly transient. In this case, the specialization matching table will end up referring to that type, keeping it alive.

However, since for any given bit of code we only generate a handful of these specializations, eventually we’d saturate them and stop leaking memory. I looked at another path:

> path 10594
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Static Frame ]-->
BOOTStaticFrame (Object)
    --[ Compilation Unit ]-->
BOOTCompUnit (Object)
    --[ Code refs array entry ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Spesh log slots ]-->
QAST::CompUnit (Object)
    --[ Unknown ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

This is a case where the optimizer is tracing what objects show up. In fact, most of the paths looked this way. However, that should saturate at some point, yet I know that it goes on leaking. Finally, I found another path to a leak:

> path 59877
Root
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

However, 9 out of the 10 leaked objects were leaked because of the dynamic optimizer keeping things alive that it had seen while tracing. But that, while awkward, should eventually saturate – as should the integer cache issue. But the memory use grew forever, suggesting that things go on and on leaking. So, I tried a snapshot after disabling dynamic optimization. And:

> find stables type="ABC"
Object Id  Description
=========  ===========
368871     ABC

Just the one! 9 out of 10 objects were not on the heap. And yes, it was the integer box cache problem that kept the 1 alive:

> path 368871
Root
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

So in theory, with dynamic optimization disabled, this suggested that we did not leak any more, and all the blame was on the optimizer. To check that out, I tried a long-running EVAL loop and…it still leaked heavily. My theory that dynamic optimization couldn’t account for all of the leaking, just the first bit of it, seemed to hold up.

To investigate it further, I did a loop of 100 EVALs, as opposed to the 10 I had used so far. This took a snapshot every GC run, plus one more that I forced at the end. So, how did the final snapshot look?

This file contains 9 heap snapshots. To select one to look
at, type something like `snapshot 1`.
Type `help` for available commands, or `exit` to exit.

> snapshot 8
Loading that snapshot. Carry on...
> summary
Wait a moment, while I finish loading the snapshot...

    Total heap size:              33,417,225 bytes

    Total objects:                369,541
    Total type objects:           1,960
    Total STables (type tables):  1,961
    Total frames:                 2,077
    Total references:             1,302,511

> find stables type="ABC"
Object Id  Description
=========  ===========
368872     ABC

Only the one. So, according to this, we’re not leaking. But that’s when the loop is over. What about a mid-loop GC? I switched to a snapshot in the middle, and:

> find stables type="ABC"
Object Id  Description
=========  ===========
130353     ABC
376552     ABC

> path 130353
Root
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

> path 376552
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Result ]-->
NQPMatch (Object)
    --[ Unknown ]-->
QAST::CompUnit (Object)
    --[ Unknown ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

OK, that’s reasonable too: it’s alive because it’s referred to by the compiler, which is run as part of EVALing the code. So what are we leaking? I tried this:

> top objects by size
Name                                   Total Bytes
=====================================  ===============
NQPArray                               7,114,800 bytes
BOOTStaticFrame                        4,806,668 bytes
BOOTInt                                4,642,720 bytes
VMString                               2,859,188 bytes
BOOTHash                               2,253,016 bytes
SCRef                                  1,891,768 bytes
NFAType                                1,886,272 bytes
BOOTCode                               1,448,208 bytes
BOOTNum                                832,096 bytes
Parameter                              783,360 bytes
BOOTStr                                567,936 bytes
BOOTCompUnit                           513,149 bytes
Perl6::Metamodel::ContainerDescriptor  341,496 bytes
QAST::Op                               266,400 bytes
Signature                              208,440 bytes

Then compared with an earlier snapshot

> snapshot 2
Loading that snapshot. Carry on...
> top objects by size
Name                                   Total Bytes
=====================================  ===============
NQPArray                               7,110,920 bytes
BOOTStaticFrame                        4,806,152 bytes
BOOTInt                                4,642,624 bytes
VMString                               2,858,472 bytes
BOOTHash                               2,241,320 bytes
SCRef                                  1,891,696 bytes
NFAType                                1,886,272 bytes
BOOTCode                               1,447,776 bytes
BOOTNum                                832,096 bytes
Parameter                              783,360 bytes
BOOTStr                                567,136 bytes
BOOTCompUnit                           513,149 bytes
Perl6::Metamodel::ContainerDescriptor  341,496 bytes
QAST::Op                               266,112 bytes
Signature                              208,296 bytes

Again, nothing very interesting to note. This highly suggested that either there was some missing information in the heap snapshot, or something else in the VM state – but that got cleaned up at exit – that was also getting leaked.

So I pondered a bit, and compared the GC marking code with the heap snapshot code. And…had an “aha!” moment. Remember I talked about the inter-generational root set that we keep thanks to generational collection? This was not being accounted for in heap snapshots. I fixed it, and the size of the resulting heap snapshot files was a dead giveaway that it made a huge difference:

04/06/2016  14:46       184,272,062 rak-heap-6
04/06/2016  15:21       262,846,653 rak-heap-7

And, back in the analyzer:

> snapshot 2
Loading that snapshot. Carry on...
> find stables type="ABC"
Wait a moment, while I finish loading the snapshot...

Object Id  Description
=========  ===========
21266      ABC
21312      ABC
22359      ABC
23317      ABC
24275      ABC
25233      ABC
26191      ABC
27149      ABC
28107      ABC
29065      ABC
30023      ABC
30981      ABC
361108     ABC
363007     ABC
364903     ABC

So, the objects were there, after all. I took a look at some of them:

> path 24275
Root
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 24269 ]-->
ABC (STable)

> path 30981
Root
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 30975 ]-->
ABC (STable)

> path 363007
Root
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 21244 ]-->
BOOTCompUnit (Object)
    --[ Serialization context dependency ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

This would explain a prolonged lifetime, but not an unending leak. I found myself missing a count command so I could easily see how things varied between the snapshots. I implemented it, then observed this:

> snapshot 2
Loading that snapshot. Carry on...
> count stables type="ABC"
34
> snapshot 5
Loading that snapshot. Carry on...
> count stables type="ABC"
72
> snapshot 7
Loading that snapshot. Carry on...
> count stables type="ABC"
97
> snapshot 8
Loading that snapshot. Carry on...
> count stables type="ABC"
100

The number of objects in the inter-generational set just kept on growing! So, either this workload was just not triggering a gen-2 collection, or there was a bug. How to find out? By doing a normal --profile on the same code, and looking at the output. The summary page stated:

The profiled code did 9 garbage collections. There were 0 full collections involving the entire heap.

OK, so we really never did a full collection. That explains this particular snapshot, but not the leak over time, which surely would end up triggering a full collection at some point. To test that theory, I tweaked a MoarVM header to make full collections happen more often, to see it helped:

diff --git a/src/gc/collect.h b/src/gc/collect.h
index b31a112..af6c456 100644
--- a/src/gc/collect.h
+++ b/src/gc/collect.h
@@ -6,7 +6,7 @@
 /* How many bytes should have been promoted into gen2 before we decide to
  * do a full GC run? The numbers below are used as a base amount plus an
  * extra amount per extra thread we have running. */
-#define MVM_GC_GEN2_THRESHOLD_BASE      (30 * 1024 * 1024)
+#define MVM_GC_GEN2_THRESHOLD_BASE      (1 * 1024 * 1024)
 #define MVM_GC_GEN2_THRESHOLD_THREAD    (2 * 1024 * 1024)

This change made it do a full collection for every 1MB promoted, not every 30MB. Hopefully that would be enough to trigger an some runs. And it did:

The profiled code did 9 garbage collections. There were 4 full collections involving the entire heap.

Much better. So, over in the heap profiler:

Considering the snapshot...looks reasonable!

This file contains 9 heap snapshots. To select one to look
at, type something like `snapshot 5`.
Type `help` for available commands, or `exit` to exit.

> snapshot 2
Loading that snapshot. Carry on...
> count stables type="ABC"
34
> snapshot 5
Loading that snapshot. Carry on...
> count stables type="ABC"
72
> snapshot 7
Loading that snapshot. Carry on...
> count stables type="ABC"
97
> snapshot 8
Loading that snapshot. Carry on...
> count stables type="ABC"
100

Well, ouch. That implies that the inter-generational root set keeps on growing and growing, for some reason. The profiler, which reports this number, agrees with this assessment (note the numbers of gen2 roots on the right):

profile-screenie

So, how could this possibly be happening?

I like to rule out the really silly things first. Like, if the thing that cleans up the inter-generational roots list after a full collection is being called. How? With a printf, of course! :P

diff --git a/src/gc/roots.c b/src/gc/roots.c
index b771106..361b472 100644
--- a/src/gc/roots.c
+++ b/src/gc/roots.c
@@ -301,7 +301,7 @@ void MVM_gc_root_gen2_cleanup(MVMThreadContext *tc) {
     MVMuint32        num_roots    = tc->num_gen2roots;
     MVMuint32        i = 0;
     MVMuint32        cur_survivor;
-
+printf("cleaning up gen2 roots\n");
     /* Find the first collected object. */
     while (i < num_roots && gen2roots[i]->flags & MVM_CF_GEN2_LIVE)
         i++;

And…no output. I couldn’t quite belive it. So, I went to the place that this function is called, and noticed there was some logging I could switch on that describes all the ins and outs of a organizing a GC run. That produced plenty of output, and showed it was indeed not reaching the place where it would call the gen2 roots cleanup either.

After some hunting, I discovered that an addition a while ago that tracked the amount of data promoted to the old generation, and used it to decide whether to do a full collection, had resulted in a nasty accident. It did the calculation to check if a full collection was needed in two different places, and the answer could change between them. This led to us not going through all the steps that a full collection would need.

I patched it and…the leak was gone. Even with dynamic optimization re-enabled and full collections back to being run after every promoted 30MB, the ever-increasing memory use of EVAL in a loop was no more. It climbed for a short while, then flattened out.

This fix will likely help numerous longer-running programs that have medium-lifetime objects. It also shaved around 45MB off CORE.setting compilation memory use. Unfortunately, it also caused a 5% slowdown in the Rakudo build-time, presumably because we were now actually doing all the work we should be on full collections!

A bit of tuning

With the leak out of the way – or at least, the big one – I wanted my 5% back. So I took a look at our logic for how we decide whether or not to do a full collection. The strategy so far had been to expect a fixed amount of memory to have been promoted to the old generation (with some additions per thread). However, this was lacking in a couple of ways.

For one, it only accounted for the direct size of objects, not any extra unmanaged memory they held. So, a compilation unit would not factor in the size of the compiled bytecode it held on to, and a dynamic array would not factor in the size of its storage. This was now an easy enough fix thanks to additions made while implementing heap snapshots.

For two, a fixed limit doesn’t behave too well with programs that really do build up a large heap over time. There, we can afford to promote a good bit more beforehand, and percentage wise it won’t make a lot of difference since they’re growing anyway. Building Rakudo’s CORE.setting is like this, as it builds up the program tree. So, I switched to a percentage-based scheme with a minimum threshold, which could afford to be a bit lower than the 30MB from before. These changes not only got Rakudo’s CORE.setting build time back down again (without having to give up much of the memory savings from before), but also had the EVAL loop example having a lower memory ceiling.

So, not only did I hunt down and fix the memory leak, I ended up tuning things to achieve a lower memory ceiling for applications whose memory footprint is fairly fixed over their life them, and less full GC runs for those growing over time.

Hunting down a parallel spectest hang

On Windows, a parallel spectest of high enough degree could cause hangs. I use TEST_JOBS=12, a number I picked after some measuring in the past. In the last couple of weeks, I started seeing hangs – and they went away if I cut down TEST_JOBS to just running 2 or 3 in parallel.

Eventually, helped by make localtest (where you can specify a file listing the tests to run), I managed to get it down to just 2 tests that, when run together, would reliably hang. It turned out one of them spawned another Rakudo process as part of the test, so there were 3 processes involved. Attaching the debugger to each in turn, I saw one was hung on getting a file lock, one was hung waiting for process termination (and was waiting on the process that was blocked on a file lock), and the other was blocked trying to write the STDOUT.

I don’t know for sure, but so far as I can tell, the way the test harness does parallel testing on Windows involves running batches of tests and reading a bit of TAP from them one at a time. The process blocked writing to STDOUT was also the one holding the file lock, but the test harness was, I presume, waiting to read output from the process blocked waiting for the process that was in turn waiting for the file lock. So, a nice circular wait involving 4 processes, one of them being the test harness! Typical darn Friday bug hunting… :-)

This also explained nicely why the issue didn’t crop up away from Windows: parallel spectest works differently (read: better) on other platforms. :-) While we will at some point switch to using a Perl 6 test harness for running the spectests that hopefully behaves consistently everywhre, I figured that Rakudo was probably doing something wrong with regard to the file lock.

File locks are used in only one place in Rakudo: managing pre-compilations. A little instrumentation of the lock/unlock code later, I saw a mis-match. A dump of the stack trace at each place we did a lock/unlock eventually led me to the problem, which I was able to fix. This bug likely didn’t just affect spectest on Windows; I suspect I could construct various hangs on other platforms too from it. So, a good fix to have in there.

As a side-note, the reason this bug was both likely to happen and hard to identify was because the lock and unlock were not placed on the same code-path. This is a good idea for any kind of locking. Locks are horrible to work with anyway; putting lock/unlock far apart in the code is just asking for problems (and this is far from the first problem I’ve hunted down in code with such a structure). So, I’ve asked nine++, as part of his great work to keep improving our installation and precomp handling, to look into addressing this, so we’re structurally in a better place to not have such bugs.

And, for all of you out there using Rakudo’s Lock class: first, don’t, and second, if you must use it, always prefer the $lock.protect({ ... }) form over manual lock/unlock method calls.

Two icky problems less…

So, not a bad last week’s Perl 6 work – though it took me all of this week to get around to writing it up. Maybe I’ll be a little faster writing up this week’s happenings. :-)

Posted in Uncategorized | 4 Comments

Small, but welcome, fixes

Last week wasn’t one of my most productive on Perl 6, thanks to a mix of Easter holiday, a little more work than expected on another project, and feeling a tad under the weather for a day or so. In the time I did find, though, I managed to pick off some worthwhile bits and pieces that needed doing.

Heap snapshot issues triage

It’s generally very clean and pleasant that our meta-objects are, so far as the VM is concerned, just objects. This means the answer to “can types be GC’d” is easy: sure they can, so long as nothing is using them any more. The downside is that the VM has very little insight into this huge pile of objects, which is why we have various “protocols” where the MOP can provide some hints. A lot of performance wins come from this. When I was working on heap snapshots, I found it would also be rather useful if there was a way to let the VM know about a “debug name” for a type. A name that isn’t used for any kind of resolution, but that can be included in things like heap snapshots. It’ll no doubt also be useful for debugging. So, I added an op for that, nqp::setdebugtypename, and used it in a branch of NQP, meaning I got more useful heap snapshots. This week, I also added this op to the JVM backend, meaning that I could merge the branch that uses it. This means that you can now do a normal build of Rakudo/NQP/Moar and get useful typenames in the MoarVM heap snapshots.

Last time, I mentioned that the heap snapshot viewer took an age to start up because we were super slow at reading files with really long lines. This week, I fixed this performance bug with two patches. Now, the heap snapshot analyzer picks apart a 25MB snapshot file into its various big pieces (just looking at how each line starts) and reaches its prompt in under a second on my box. That’s a rather large improvement from before, when it took well over a minute.

I also looked at a C-level profile of where MoarVM spends time doing the detailed parsing of a heap snapshot, which happens in the background while the user types their first command, and then blocks execution of that command if it’s not completed. The processing uses a few threads. The profile showed up a couple of areas in need of improvement: we have a lot of contention on the fixed size allocator, and the GC’s world-stopping logic seems to have room for improvement too. So, those will be on my todo list for the future.

Fixing a couple of crashes

SIGSEGV is one of the least inspiring ways to crash and burn, so I’m fairly keen to hunt down cases where that happens and fix them. This week I fixed two. The first was a bug in the UTF8 Clean 8-bit encoding, which turned out to be one of those “duh, what was I thinking” off-by-ones. The second was a little more fun to track down, but turned out to be memory corruption thanks to missing GC rooting.

And…that’s it!

But, this week I’m set to have a good bit more time, so hope to have some more interesting things to talk about in the next report. :-)

Posted in Uncategorized | 1 Comment