Complex cocktail causes cunning crash

I did a number of things for Perl 6 yesterday. It was not, however, hard to decide which of them to write up for the blog. So, let’s dig in.

Horrible hiding heisenbugs

It all started when I was looking into this RT, reporting a segfault. It was filed a while ago, and I could not reproduce it. So, add a test and case closed? Well, not so fast. As I discussed last week, garbage collection happens when a thread fills up its nursery. Plenty of bugs only show up when the timing is just right (or is that just wrong?) What can influence when GC runs? How many allocations we’ve done. And what can influence that? Pretty much any change to the program being run, the compiler, or the environment. The two that most often have people tearing their hair out are:

  • How many things the compiler allocates compiling the code. This is why even adding or removing comments can reveal and hide bugs. And, of course, changes to the compiler over time will make all kinds of differences, so the bug can appear or disappear from Rakudo commit to commit. Argh!
  • How many things are in the environment, because each one of them results in an allocation of a string at startup. So, yes, adding an environment variable to turn on a debugging feature can hide or show a problem. Running under a test harness that sticks something into the environment can do the very same. Eek!

While GC-related issues are not the only cause of SEGVs, in such a simple piece of code using common features it’s far and away the most likely cause. So, to give myself more confidence that the bug truly was gone, I adjusted the nursery size to be just 32KB instead of 4MB, which causes GC to run much more often. This, of course, is a huge slowdown, but it’s good for squeezing out bugs.

And…no bug! So, in goes the test. Simples!

In even better news, it was lunch time. Well, actually, that was only sort of good news. A few days ago I cooked a rather nice chicken and berry pulao. It come out pretty good, but cooking 6 portions worth of it when I’m home alone for the week wasn’t so smart. I don’t want to see chicken pulao for a couple of months now. Anyway, while I was devouring some of my pulao mountain, I set off a spectest run on the 32KB nursery stress build of MoarVM, just to see if it showed up anything.

Putrid pointers

A couple of failures did, in fact, show up, one of them in constant.t. This rang a bell. I was sure somebody had a in the last couple of weeks reported a crash in that test file, which had then vanished. I checked in with the person who I vaguely recalled mentioning it and…sure enough, it was that very test file. In their normal test runs, the bug had long since vanished. I figured, having now got a reproduction of it, I should probably hunt it down right away. Otherwise, we’d probably end up playing “where’s Wally” with it for another month or ten.

So, how did the failure look?

$ ./perl6-m -Ilib t/spec/S04-declarations/constant.rakudo.moar 
Segmentation fault (core dumped)

It actually segfaulted while compiling the test file. Sad! So, where?

$ ./perl6-gdb-m -Ilib t/spec/S04-declarations/constant.rakudo.moar 
[boring output omitted]
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()

That looks…ungood. That final line is meant to be a code location, which means something decided to tell the CPU to go execute code living at the NULL address. At this point, things could go two ways: the JIT spat out something terrible, or a function pointer somewhere was NULL. But which?

(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff78cacbc in MVM_coerce_smart_stringify (tc=0x6037c0, obj=0x605c10, res_reg=0x56624d8)
    at src/core/coerce.c:214
#2  0x00007ffff789dff4 in MVM_interp_run (tc=tc@entry=0x6037c0, initial_invoke=0x60ea80, 
    invoke_data=0x56624d8) at src/core/interp.c:827
#3  0x00007ffff7978b21 in MVM_vm_run_file (instance=0x603010, 
    filename=0x7fffffffe09f "/home/jnthn/dev/rakudo/perl6.moarvm") at src/moar.c:309
#4  0x000000000040108b in main (argc=9, argv=0x7fffffffdbc8) at src/main.c:192

Phew, it looks like the second case, given there’s no JIT entry stub on the stack. So, we followed a NULL function pointer. Really?

(gdb) frame 1
#1  0x00007ffff78cacbc in MVM_coerce_smart_stringify (tc=0x6037c0, obj=0x605c10, res_reg=0x56624d8)
at src/core/coerce.c:214
214     ss = REPR(obj)->get_storage_spec(tc, STABLE(obj));

Yes, really. Presumably, that get_storage_spec is bogus. (I did a p to confirm it.) So, how is obj looking?

(gdb) p *obj
$1 = {header = {sc_forward_u = {forwarder = 0x48000000000001, sc = {sc_idx = 1, idx = 4718592}, 
  st = 0x48000000000001}, owner = 6349760, flags = 0, size = 0}, st = 0x6d06c0}

Criminally corrupt; let me count the ways. For one, 6349760 looks like a very high thread ID for a program that’s only running a single thread (they are handed out sequentially). For two, 0 is not a valid object size. And for three, idx is just a nuts value too (even Rakudo’s CORE.setting isn’t made up of 4 million objects). So, where does this object live? Well, let’s try out last week’s handy object locator to figure out:

(gdb) p MVM_gc_debug_find_region(tc, obj)
In tospace of thread 1

Well. Hmpfh. That’s actually an OK place for an object to be. Of course, the GC spaces swap often enough at this nursery size that a pointer could fail to be updated, point into fromspace after one GC run, not be used until a later GC run, and then come to point into some random bit of tospace again. How to test this hypothesis? Well, instead of 32768 bytes of nursery, what if I make it…well, 40000 maybe?

Here we go again:

$ ./perl6-gdb-m -Ilib t/spec/S04-declarations/constant.rakudo.moar 
[trust me, this omitted stuff is boring]
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff78b00db in MVM_interp_run (tc=tc@entry=0x6037c0, initial_invoke=0x0, invoke_data=0x563a450)
    at src/core/interp.c:2855
2855                    if (obj && IS_CONCRETE(obj) && STABLE(obj)->container_spec)

Aha! A crash…somewhere else. But where is obj this time?

(gdb) p MVM_gc_debug_find_region(tc, obj)
In fromspace of thread 1

Hypothesis confirmed.

Dump diving

So…what now? Well, just turn on that wonder MVM_GC_DEBUG flag and the bug will make itself clear, of course. Alas, no. It didn’t trip a single one of the sanity checks added by enabling thee flag. So, what next?

The where in gdb tells us where in the C code we are. But what high level language code was MoarVM actually running at the time? Let’s dump the VM frame stack and find out:

(gdb) p MVM_dump_backtrace(tc)
   at <unknown>:1  (./blib/Perl6/Grammar.moarvm:initializer:sym<=>)
 from gen/moar/stage2/QRegex.nqp:1378  (/home/jnthn/dev/MoarVM/install/share/nqp/lib/QRegex.moarvm:!protoregex)
 from <unknown>:1  (./blib/Perl6/Grammar.moarvm:initializer)
 from src/Perl6/Grammar.nqp:3140  (./blib/Perl6/Grammar.moarvm:type_declarator:sym<constant>)
 from gen/moar/stage2/QRegex.nqp:1378  (/home/jnthn/dev/MoarVM/install/share/nqp/lib/QRegex.moarvm:!protoregex)
 from <unknown>:1  (./blib/Perl6/Grammar.moarvm:type_declarator)
 from <unknown>:1  (./blib/Perl6/Grammar.moarvm:term:sym<type_declarator>)
 from gen/moar/stage2/QRegex.nqp:1378  (/home/jnthn/dev/MoarVM/install/share/nqp/lib/QRegex.moarvm:!protoregex)
 from src/Perl6/Grammar.nqp:3825  (./blib/Perl6/Grammar.moarvm:termish)
 from gen/moar/stage2/NQPHLL.nqp:886  (/home/jnthn/dev/MoarVM/install/share/nqp/lib/NQPHLL.moarvm:EXPR)
from src/Perl6/Grammar.nqp:3871  (./blib/Perl6/Grammar.moarvm:EXPR)
...

I’ve snipped out a good chunk of a fairly long stack trace. But look! We were parsing and compiling a constant at the time of the crash. That’s somewhat interesting, and explains why constant.t was a likely test file to show this bug up. But MoarVM has little idea about parsing or Perl 6’s idea of constants. Rather, something on that codepath of the compiler must run into a bug of sorts.

Looking at the location in interp.c the op being interpreted at the time was decont, which takes a value out of a Scalar container, if it happens to be in one. Combined with knowing what code we were in, I can invoke moar --dump blib/Perl6/Grammar.moarvm, and then locate the disassembly of initializer:sym<=>.

There were a few uses of the decont op in that function. All of them seemed to be on things looked up lexically or dynamically. So, I instrumented those ops with a fromspace check. Re-compiled, and…

(gdb) break MVM_panic
Breakpoint 1 at 0x7ffff78a19a0: file src/core/exceptions.c, line 779.
(gdb) r
Starting program: /home/jnthn/dev/MoarVM/install/bin/moar --execname=./perl6-gdb-m --libpath=/home/jnthn/dev/MoarVM/install/share/nqp/lib --libpath=/home/jnthn/dev/MoarVM/install/share/nqp/lib --libpath=. /home/jnthn/dev/rakudo/perl6.moarvm --nqp-lib=blib -Ilib t/spec/S04-declarations/constant.rakudo.moar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, MVM_panic (exitCode=1, 
    messageFormat=0x7ffff799bc58 "Collectable %p in fromspace accessed") at src/core/exceptions.c:779
779 void MVM_panic(MVMint32 exitCode, const char *messageFormat, ...) {
(gdb) where
#0  MVM_panic (exitCode=1, messageFormat=0x7ffff799bc58 "Collectable %p in fromspace accessed")
    at src/core/exceptions.c:779
#1  0x00007ffff78ba657 in MVM_interp_run (tc=0x1, tc@entry=0x6037c0, initial_invoke=0x0, 
    invoke_data=0x604b80) at src/core/interp.c:374
#2  0x00007ffff7979071 in MVM_vm_run_file (instance=0x603010, 
    filename=0x7fffffffe09f "/home/jnthn/dev/rakudo/perl6.moarvm") at src/moar.c:309
#3  0x000000000040108b in main (argc=9, argv=0x7fffffffdbc8) at src/main.c:192

And what’s in interp.c around that line? The getdynlex op. That’s the one that is used to lookup things like $*FOO in Perl 6. So, a dynamic lexical lookup seemed to be handing back an outdated object. How could that happen?

Interesting idea is insufficient

My next idea was to see if I could catch the moment that something bad was put into the lexical. I’d already instrumented the obvious places with no luck. But…what if I could intercept every single VM register access and see if an object from fromspace was read? Hmmm… It turned out that I could make that happen with a sufficiently cunning patch. I made it opt-in rather than the default for MVM_GC_DEBUG because it’s quite a slow-down. I’m sure that this will come in really useful for finding some GC bug some day. But for this bug? It was no direct help.

It was of some indirect help, however. It suggested strongly that at the time the lexical was set (actually, it turned out to be $*LEFTSIGIL), everything was valid. Somewhere between then and the lookup of it using the getdynlex op, things went bad.

Cache corruption

So what does getdynlex actually do? It checks if the current frame declares a lexical of the specified name. If so, it returns the value. If not, it looks in the caller for the value. If that fails, it goes on to the caller’s caller, until it runs out of stack to search and then gives up.

If that’s what it really did, then this bug would never have happened. But no, people actually want Perl 6 to run fast and stuff, so we can’t just implement the simplest possible thing and go chill. Instead, there’s a caching mechanism. And, as well all know, the two hardest problems in computer science are cache invalidation and cache invalidation.

The caching is relatively simple: each frame has slots for sticking a name, register pointer, and type in it.

MVMString   *dynlex_cache_name;
MVMRegister *dynlex_cache_reg;
MVMuint16    dynlex_cache_type;

When getdynlex finds something the slow way, it then looks down the stack again and finds a frame with an empty dynlex_cache_name. It then sticks the name of dynamic lexical into the name slot, a pointer to the MoarVM lexical into the reg slot, and what type of lexical it was (native int, native num, object, etc.) into the type slot. The most interesting of these is the reg slot. The MVMRegister type is actually a union of different types that we may store in a register or lexical. We re-use the union for registers that live while the frame is on the callstack and lexicals that may need to live longer thanks to closures. So, each frame as two arrays of these:

MVMRegister *env;   /* The lexical environment */
MVMRegister *work;  /* Working space that dies with the frame */

And so the dynlex_cache_reg ends up pointing to env somewhere in the frame that we found the lexical in.

So, the big question: was the caching to blame? I shoved in a way to disable it and…the bug vanished.

Note by this point we’re up to two pieces that contribute to the bug: the GC and the dynamic lexical cache. The thing is, the dynamic lexical cache is used very heavily. My gut feeling told me there must be at least one more factor at play here.

Suspicious specialization

So, what could the other factor be? I re-enabled the cache, verified the crash came back, and then stuck MVM_SPESH_DISABLE=1 into the environment. And…no bug. So, it appeared that dynamic optimization was somehow involved too. That’s the magic that looks at what types actually show up at runtime, and compiles specialized versions of the code that fast-paths a bunch of operations based on that (this specialization being where the name “spesh” comes from). Unfortunately, MVM_SPESH_DISABLE is a rather blunt instrument. It disables a huge range of things, leaving a massive surface area to consider for the bug. Thankfully, there are some alternative environment variables that just turn off parts of spesh.

First, I tried MVM_JIT_DISABLE=1, which results in spesh interpreting the specialized version of the code rather than turning it into machine code to remove the interpreter overhead. The bug remained.

Next, I tried MVM_SPESH_OSR_DISABLE, which disables On Stack Replacement. This is a somewhat fiddly optimization that detects hot loops as they are being interpreted, pauses execution, produces an optimized version of the code, and then recalculates the program counter so it points to the appropriate point in the optimize code and continues execution. Basically, the interpreter gets the code it’s executing replaced under it – perhaps with machine code, which the interpreter is instructed to jump into immediately. Since this also fiddles with frames “in flight”, it seemed like a good candidate. But…nope. Bug remained.

Finally, I tried MVM_SPESH_INLINE_DISABLE, which disables inlining. That’s where we spot a call to a relatively small subroutine or method, and just replace the call with the code of the sub or method itself, saving the cost of setting up callframes. And…bug vanished!

So, inlining was apparently a factor too. The trouble is, that also didn’t seem to add up to an obvious bug. Consider:

sub foo($a) {
    bar($a);
}
sub bar($b) {
    my $c = $b + 6;
    $c * 6
}

Imagine that bar was to be inlined into foo. Normally they’d have lexical slots in ->env as follows:

A:  | $_ | $! | $/ | $a |
B:  | $_ | $! | $/ | $b | $c |

The environment for the frame inline(A, B) would look like:

inline(A, B):  | $_ | $! | $/ | $a | $_ | $! | $/ | $b | $c |
                \---- from A ----/  \------- from B -------/

Now, it’s easy to imagine various bugs that could arise in the initial lookup of a dynamic lexical in such a frame. Indeed, the dynamic lexical lookup code actually has two bunches of code that deal with such frames, one in the case the specialized code is being interpreted and one in the case where it has been JIT compiled. But by the time we are hitting the cache, there’s nothing smart going on at all: it’s just a cheap pointer deference.

Dastardly deoptimization

So, it seems we need a fourth ingredient to explain the bug. By now, I had a pretty good idea what it was. MoarVM doesn’t just to optimizations based on properties it can prove will always hold. It can also do speculative optimization based on properties that it expects will probably hold up. For example, suppose we have:

sub foo($a, $b) {
    $b.some-huge-complex-call($a);
    return $a.Str;
}

Imagine we’re generating a specialization of this routine for the case $a is an object of type Product. The Str method is tiny, so we go ahead and inline it. However, some-huge-complex-call takes all kinds of code paths. We can’t be sure, from our analysis, that at some point it won’t mix in to the object in $a. What if it mixes in a role that has an alternate Str method? Our inlining would break stuff! We’d end up calling the Product.Str method, not the one from the mixin.

One reaction is to say “well, we’ll just not ever optimize stuff unless we can be REALLY sure”, which is either hugely limiting or relies on much more costly analyses. The other path, which MoarVM does, is to say “eh, let’s just assume mixins won’t happen, and if they do, we’ll fix things then!” The process of fixing things up is called deoptimization. We walk the call stack, rewriting return addresses to point to the original interpreted code instead of the optimized version of the code.

But what, you might wonder, do we do if there’s a frame on the stack that is actually the result of an inlining operation? What if we’re in the code that resulted from inline(A,B), in the bit that corresponds to the code of B? Well, we have to perform – you guessed it – uninlining! The composite call frame has to be dissected, and the call stack rewritten to look like it would have if we’d been running the original interpreted code. To do this, we’d create a call frame for B, complete with space for its lexicals, and copy the lexicals from inline(A,B) that belong to B into that new buffer.

The code that does this is one of the very few parts of MoarVM that frightens me.

For good reason, it turns out. This deoptimization, together with uninlining, was the final ingredient needed for the bug. Here’s what happened:

  1. The method EXPR in Perl6::Grammar was inlined into one of its callers. This EXPR method declares a $*LEFTSIGIL variable.
  2. While parsing the constant, the $*LEFTSIGIL is assigned to the sigil of the constant being declared, if it has one (so, in constant $a = 42 it would be set to $).
  3. Something does a lookup of $*LEFTSIGIL. It is located and cached. The cache entry points into a region of the ->env of the frame that inlined, and thus incorporated, the lexical environment of EXPR.
  4. At some point, a mixin happens, causing a deoptimization of the call stack. The frame that inlined EXPR gets pulled apart. A new EXPR frame comes to exist, with the lexicals that used to live in the composite frame copied into them. Execution continues.
  5. A GC happens. The object containing the $ substring moves. The new EXPR frame’s lexical environment is updated.
  6. Another lookup of $*LEFTSIGIL happens. It hits the cache. The cache, however, still points to the place the lexical used to live in the composite frame. This memory has not been freed, because the first part of it is still being used. However, the GC no longer cares about its contents because that content is unreachable. Therefore, it contains an outdated pointer, thus leading to accessing memory that’s being used for something else entirely by that point, leading to the eventual segmentation fault.

The most natural fix was to invalidate the cache during deoptimization.

Lessons learned

The bug I wrote up last week was thanks to a comparatively simple oversight made within the scope of a few lines of a single C function. While this one could be fixed with a small amount of code added in a single file, the segfault arose from the interaction of four distinct features existing in MoarVM:

  • Garbage collection
  • Dynamic lexical lookup caching
  • Inlining
  • Deoptimization

Even when a segfault was not produced, thanks to “lucky” GC timing, the bug would lead to reading of stale data. It just so turned out that the data wasn’t ever stale enough in reality to break things on this particular code path.

All of garbage collection, inlining, and deoptimization are fairly complicated. By contrast, the dynamic lexical lookup cache is fairly easy. Interestingly, it was the addition of this easy feature that introduced the bug – not because the code that was added was wrong, but rather because it did something that some other far flung piece of code – the deoptimizer – had quietly relied on not happening.

So, what might be learned for the future?

The most immediate practical learning is that taking interior pointers into mutable data structures is risky. In this case, that data structure was a composite lexical environment, that later got taken apart. Conceptually, the environment was resized and the interior pointer was off the end of the new size. This suggests either providing a safe way to acquire such a reference, or an alternative design for the dynamic lexical cache to avoid needing to do so.

Looking at the bigger picture, this is all about managing complexity. Folks who work with me tend to observe I worry a good bit about loose coupling, to the degree that I’m much more hesitant than the typical developer when it comes to code re-use. Acutely aware that re-use means use, and use means dependency, and dependency means coupling, I tend to want things to prove they really are the same thing rather than just looking like they might be the same thing. MoarVM reflects this in various ways: to the degree I can, I try to organize it as a collection of parts that either keep themselves very much to themselves, or that collaborate over a small number of very stable data structures. One of the reasons Git works architecturally is because while all the components of it are messing with the same data structure, it’s a very stable and well-understood data structure.

In this bug, MVMFrame is the data structure in question. A whole load of components know about it and work with it because – so the theory went – it’s one of the key stable data structures of the VM. Contrast it with the design of things like the Unicode normalizer or the fixed size allocator, which nothing ever pokes into directly. These are likely to want to evolve over time to choose smarter data structures, or to get extra bits of state to cope with Unicode’s latest and greatest emoji boundary specification. Therefore, all work with them is hidden nicely behind an API.

In reality, MVMFrame has grown to contain quite a fair few things as MoarVM has evolved. At the same time, its treated as a known quantity by lots of parts of the codebase. This is only sustainable if every addition to MVMFrame is followed by considering how every other part of the VM that interacts with it will be affected by the change, and making compensating changes to those components. In this case, the addition of the dynamic lexical cache into the frame data structure was not accompanied by sufficient analysis of which other parts of the VM may need compensating changes.

The bug I wrote up last week isn’t really the kind that causes an architect a headache. It was a localized coding slip-up that could happen to anyone on a bad day. It’s a pity we didn’t catch it in code review, but code reviewers are also human. This bug, by contrast, arose as a result of the complexity of the VM – or, more to the point, insufficient management of that complexity. And no, I’m not beating myself up over this. But, as MoarVM architect, this is exactly the type of bug that catches my eye, and causes me to question assumptions. In the immediate, it tells me what kinds of patches I should be reviewing really carefully. In the longer run, the nature of the MVMFrame data structure and its level of isolation from the rest of the codebase deserves some questioning.

Posted in Uncategorized | 2 Comments

Taking a couple of steps backwards to fix a GC bug

When I popped up with a post here on Perl 6 OO a few days ago, somebody noted in the comments that they missed my write-ups of my bug hunting and fixing work in Rakudo and MoarVM. The good news is that the absence of posts doesn’t mean an absence of progress; I’ve fixed dozens of things over the last months. It was rather something between writers block and simply not having the energy, after a day of fixing things, to write about it too. Anyway, it seems I’ve got at least some of my desire to write back, so here goes. (Oh, and I’ll try and find a moment in the coming days to reply to the other comments people wrote on my OO post too.)

Understanding a cryptic error

There are a number of ways MoarVM can come tumbling down when memory gets corrupted. Some cases show up as segmentation faults. In other cases, the VM comes across something that simply does make any kind of sense and can infer that memory has become corrupted. Two panics commonly associated with this are “zeroed target thread ID in work pass” and “invalid thread ID XXX in GC work pass”, where XXX tends to be a sizable integer. At the start of a garbage collection – where we free up memory associated with dead objects – we do something like this:

  1. Go through all the threads that have been started, and signal those that are not blocked (e.g. waiting for I/O, a lock acquisition, or for native code to finish) to come and participate in the garbage collection run.
  2. Assign each non-blocked thread itself to work on.
  3. Assign each blocked thread’s work to a non-blocked thread.

So, every thread – blocked or not – ends up assigned to a running thread to take care of its collection work. It’s the participation of multiple threads that makes the MoarVM GC parallel (which is a different thing to having a concurrent GC; MoarVM’s GC can barely claim to be that).

The next important thing to know is that the every object, at creation, is marked with the ID of the thread that allocated it. This means that, as we perform GC, we know whether the object under consideration “belongs” to the current thread we’re doing GC work for, or some other one. In the case that the ID in the object header doesn’t match up with the thread ID we’re doing GC work for, then we stick it into a list of work to pass off to the thread that is responsible. To avoid synchronization overhead, we pass then off in batches (so there’s only synchronization overhead per batch). This is far from the only way to do parallel GC (other schemes include racing to write forwarding pointers), but it keeps the communication between participating threads down and leaves little surface area for data races in the GC.

The funny thing is that if none of that really made any sense to you, it doesn’t actually matter at all, because I only told you about it all so you’d have a clue what the “work pass” in the error message means – and even that doesn’t matter much for understanding the bug I’ll eventually get around to discussing. Anyway, TL;DR version (except you did just read it all, hah!) is that if the owner ID in an object header is either zero or an out-of-range thread ID, then we can be pretty sure there’s memory corruption afoot. The pointer under consideration is either to zeroed memory, or to somewhere in memory that does not correspond to an object header.

So, let’s debug the panic!

Getting the panic is, perhaps, marginally better than a segmentation fault. I mean, sure, I’m a bit less embarrassed when Moar panics than SEGVs, and perhaps it’s mildly less terrifying for users too. But at the end of the day, it’s not much better from a debugging perspective. At the point we spot the memory corruption, we have…a pointer. That points somewhere wrong. And, this being the GC, it just came off the worklist, which is full of a ton of pointers.

If only we could know where the pointer came from, I hear you think. Well, it turns out we can: we just need to detect the problem some steps back, where the pointer is added to the worklist. In src/gc/debug.h there’s this:

#define MVM_GC_DEBUG 0

Flip that to a 1, recompile, and magic happens. Here’s a rather cut down snippet from in worklist.h:

#if MVM_GC_DEBUG
#define MVM_gc_worklist_add(tc, worklist, item) \
    do { \
        MVMCollectable **item_to_add = (MVMCollectable **)(item); \
        if (*item_to_add) { \
            if ((*item_to_add)->owner == 0) \
                MVM_panic(1, "Zeroed owner in item added to GC worklist"); \
                /* Various other checks here.... */ 
        } \
        if (worklist->items == worklist->alloc) \
            MVM_gc_worklist_add_slow(tc, worklist, item_to_add); \
        else \
            worklist->list[worklist->items++] = item_to_add; \
    } while (0)
#else
#define MVM_gc_worklist_add(tc, worklist, item) \
    do { \
        MVMCollectable **item_to_add = (MVMCollectable **)(item); \
        if (worklist->items == worklist->alloc) \
            MVM_gc_worklist_add_slow(tc, worklist, item_to_add); \
        else \
            worklist->list[worklist->items++] = item_to_add; \
    } while (0)
#endif

So, in the debug version of the macro, we do some extra checks – including the one to detect a zeroed owner. This means that when MoarVM panics, the GC code that is placing the bad pointer into the list is on the stack. Then it’s a case of using GDB (or your favorite debugger), sticking a breakpoint on MVM_panic (spelled break MVM_panic in GDB), running the code that explodes, and then typing where. In this case, I was pointed at the last line of this bit of code from roots.c:

void MVM_gc_root_add_frame_roots_to_worklist(MVMThreadContext *tc, MVMGCWorklist *worklist,
                                             MVMFrame *cur_frame) {
    /* Add caller to worklist if it's heap-allocated. */
    if (cur_frame->caller && !MVM_FRAME_IS_ON_CALLSTACK(tc, cur_frame->caller))
        MVM_gc_worklist_add(tc, worklist, &cur_frame->caller);

    /* Add outer, code_ref and static info to work list. */
    MVM_gc_worklist_add(tc, worklist, &cur_frame->outer);

So, this tells me that the bad pointer is to an outer. The outer pointer of a call frame points to the enclosing lexical scope, which is how closures work. This provides a bit of inspiration for bug hunting; for example, it would now make sense to consider codepaths that assign outer to see if they could ever fail to keep a pointer up to date. The trouble is, for such an incredibly common language feature to be broken in that way, we’d be seeing it everywhere. It didn’t fit the pattern. In fact, both my private $dayjob application that was afflicted with this, together with the whateverable set of IRC bots, had in common that they did a bunch of concurrency work and both spawned quite a lot of subprocesses using Proc::Async.

But where does the pointer point to?

Sometimes I look at a pointer and it’s obviously totally bogus (a small integer usually suggests this). But this one looked feasible; it was relatively similar to the addresses of other valid pointers. But where exactly does it point to?

There are only a few places that a GC-managed object can live. They are:

  • In a thread’s tospace – that is, the region of memory for young objects that we allocate in, and copy objects into during a GC run
  • In a thread’s fromspace – that is, the region of memory for young objects that we evacuate and copy objects out of during a GC run
  • In one of the memory blocks that make up the old generation, where long-lived objects eventually end up

So, it would be very interesting to know if the pointer was into one of those. Now, I could just go examining it in the debugger, but with a dozen running threads, that’s tedious as heck. Laziness is of course one of the virtues of a programmer, so I wrote a function to do the search for me. Another re-compile, reproducing the bug in GDB again, and then calling that routine from the debugger told me that the pointer was into the tospace of another thread.

Unfortunately, thinking is now required

Things get just a tad mind-bending here. Normally, when a program is running, if we see a pointer into fromspace we know we’re in big trouble. It means that the pointer points to where an object used to be, but was then moved into either tospace or the old generation. But when we’re in the middle of a GC run, the two spaces are flipped. The old tospace is now fromspace, the old fromspace becomes the new tospace, and we start evacuating living objects in to it. The space left at the end will then be zeroed later.

I should mention at this point that the crash only showed up a fraction of the time in my application. The vast majority of the time, it ran just fine. The odd time, however, it would panic – usually over a zeroed thread owner, but sometimes over a junk value being in the thread owner too. This all comes down to timing: different thread are working on GC, in different runs of the program they make progress at different paces, or get head starts, or whatever, and so whether the zeroing of the unused part of tospace happened or not yet will vary.

But wait…why didn’t it catch the problem even sooner?

When the MVM_GC_DEBUG flag is turned on, it introduces quite a few different sanity checks. One of them is in MVM_ASSIGN_REF, which happens whenever we assign a reference to one object into another. (The reason we don’t simply use the C assignment operator for that is because the inter-generational write barrier is needed.) Here’s how it looks:

#if MVM_GC_DEBUG
#define MVM_ASSIGN_REF(tc, update_root, update_addr, referenced) \
    { \
        void *_r = referenced; \
        if (_r && ((MVMCollectable *)_r)->owner == 0) \
            MVM_panic(1, "Invalid assignment (maybe of heap frame to stack frame?)"); \
        MVM_ASSERT_NOT_FROMSPACE(tc, _r); \
        MVM_gc_write_barrier(tc, update_root, (MVMCollectable *)_r); \
        update_addr = _r; \
    }
#else
#define MVM_ASSIGN_REF(tc, update_root, update_addr, referenced) \
    { \
        void *_r = referenced; \
        MVM_gc_write_barrier(tc, update_root, (MVMCollectable *)_r); \
        update_addr = _r; \
    }
#endif

Once again, the debug version does some extra checks. Those reading carefully will have spotted MVM_ASSERT_NOT_FROMSPACE in there. So, if we used this macro to assign to the ->outer that had the outdated pointer, why did it not trip this check?

It turns out, because it only cared about checking if it was in fromspace of the current thread, not all threads. (This is in turn because the GC debug bits only really get any love when I’m hunting a GC bug, and once I find it then they go back in the drawer until next time around.) So, I enriched that check and…the bug hunt came to a swift end.

Right back to the naughty deed

The next time I caught it under the debugger was not at the point that the bad ->outer assignment took place. It was even earlier than that – lo and behold, inside of some of the guts that power Proc::Async. Once I got there, the problem was clear and fixed in a minute. The problem was that the callback pointer was not rooted while an allocation took place. The function MVM_repr_alloc_init can trigger GC, which can move the object pointed to by callback. Without an MVMROOT to tell the GC where the callback pointer is so it can be updated, it’s left pointing to where the callback used to be.

So, bug fixed, but you may still be wondering how exactly this bug could have led to a bad ->outer pointer in a callframe some way down the line. Well, callback is a code object, and code objects point to an outer scope (it’s actually code objects that we clone to make closures). Since we held on to an outdated code object pointer, it in turn would point to an outdated pointer to the outer frame it closed over. When we invoked callback, the outer from the code object would be copied to be the outer of the call frame. Bingo.

Less is Moar

The hard part about GCs is not just building the collector itself. It’s that collectors bring invariants that are to be upheld, and a momentary lapse in concentration by somebody writing or reviewing a patch can let a bug like this slip through. At least 95% of the time when I handwavily say, “it was a GC bug”, what I really mean was “it was a bug that arose because some code didn’t play by the rules the GC requires”. A comparatively tiny fraction of the time, there’s actually something wrong in the code living under src/gc/.

People sometimes ask me about my plans for the future of MoarVM. I often tell them that I plan for there to be less of it. In this case, the code with the bug is something that I hope we’ll eventually write in, say, NQP, where we don’t have to worry about low-level details like getting write barriers correct. It’s just binding code to libuv, a C library, and we should be able to do that using the MoarVM native calling support (which is likely mature enough by now). Alas, that also has its own set of costs, and I suspect we’d need to improve native calling performance to not come out at a measurable loss, and that means teaching the JIT to emit native calls, but we only JIT on x64 so far. “You’re in a maze of twisty VM design trade-offs, and their funny smells are all alike.”

Posted in Uncategorized | Leave a comment

Perl 6 is biased towards mutators being really simple. That’s a good thing.

I’ve been meaning to write this post for a couple of years, but somehow never quite got around to it. Today, the topic of mutator methods came up again on the #perl6 IRC channel, and – at long last – conincided with me having the spare time to write this post. Finally!

At the heart of the matter is a seemingly simple question: why does Perl 6 not have something like the C# property syntax for writing complex setters? First of all, here are some answers that are either wrong or sub-optimal:

  • “We didn’t get around to it yet” – actually, it’s a concious choice. We’ve had plenty of time to ponder it over the years, and compared with many of the other language features we’ve put in, it would have been a soft feature.
  • “Just use a Proxy” – that is, a kind of container that gives you callbacks on FETCH and STORE. This is sometimes a reasonable answer, but not the right answer in most cases.
  • “We’re leaving it to module space to wrap Proxy more nicely” – this supposes Proxy is already a good starting point, when a bunch of the time it is not.

Back to OO basics

The reason the question doesn’t have a one-sentence answer is because it hinges on the nature of object orientation itself. Operationally, objects consist of:

  • A bunch of state (known as attributes in Perl 6, and fields in Java/C#)
  • A mechanism for sending a message (consisting of a name and arguments) to an object, and having the object act upon it

If your eyes glazed over on the second bullet point, then I’m glad you’re reading. If I wasn’t trying to make a point, I’d have simply written “a mechanism for calling a method on an object”. So what is my point? Here’s a quote from Alan Kay, who coined the term “object oriented”:

I’m sorry that I long ago coined the term “objects” for this topic because it gets many people to focus on the lesser idea. The big idea is “messaging”…”

For years, I designed OO systems primarily thinking about what objects I’d have. In class-based languages, this really meant what classes I’d have. How did I figure that out? Well, by thinking about what fields go in which objects. Last of all, I’d write the methods.

Funnily enough, this looks very much like procedural design. How do I build a C program? By modeling the state into various structs, and then writing functions work with with those structs. Seen this way, OO looks a lot like procedural. Furthermore, since OO is often taught as “the next step up” after procedural styles of programming, this way of thinking about objects is extremely widespread.

It’s little surprise, then, that a lot of OO code in the wild might as well have been procedural code in the first place. Many so-called OO codebases are full of DTOs (“Data Transfer Objects”), which are just bundles of state. These are passed to classes with names like DogManager. And a manager is? Something that meddles with stuff – in this case, probably the Dog DTO.

Messaging thinking

This is a far cry from how OO was originally conceived: autonomous objects, with their own inner state, reacting to messages received from the outside world, and sending messages to other objects. This thinking can be found today. Of note, it’s alive and well in the actor model. These days, when people ask me how to get better at OO, one of my suggestions is that they take a look at actors.

Since I grasped that the messages are the important thing in OO, however, the way I design objects has changed dramatically. The first question I ask is: what are the behaviors? This in turn tells me what messages will be sent. I then consider the invariants – that is, rules that the behaviors must adhere to. Finally, by grouping invariants by the state they care about, I can identify the objects that will be involved, and thus classes. In this approach, the methods come first, and the state comes last, usually discovered as I TDD my way through implementing the methods.

Accessors should carry a health warning

An accessor method is a means to access, or mutate, the state held within a particular attribute of an object. This is something I believe we should do far more hesitantly than is common. Objects are intended to hide state behind a set of interesting operations. The moment the underlying state model is revealed to the outside world, our ability to refactor is diminished. The world outside of our object couples to that view of it, and it becomes far too tempting to put operations that belong inside of the object on the outside. Note that a get-accessor is a unidirectional coupling, while a mutate-accessor implies a bidirectional (and so tighter) coupling.

But it’s not just refactoring that suffers. Mutable state is one of the things that makes programs difficult to understand and reason about. Functional programming suggests abstinence. OO suggests you just stick to a pint or two, so your side-effects will be at least somewhat less obnoxious. It does this by having objects present a nice message-y view to the outside world, and keeping mutation of state locked up inside of objects. Ideas such as value objects and immutable objects take things a step further. These have objects build new objects that incorporate changes, as opposed to mutating objects in place. Perl 6 encourages these in various ways (notice how clone lets you tweak data in the resulting object, for example).

Furthermore, Perl 6 supports concurrent and parallel programming. Value objects and immutable objects are a great fit for that. But what about objects that have to mutate their state? This is where state leakage will really, really, end up hurting. Using OO::Monitors or OO::Actors, turning an existing class into a monitor (method calls are synchronous but enforce mutual exclusion) or an actor (method calls are asynchronous and performed one at a time on a given object) is – in theory – easy. It’s only that easy, however, if the object does not leak its state, and if all complex operations on the object are expressed as a single method. Contrast:

unless $seat.passenger {
    $seat.passenger = $passenger;
}

With:

$seat.assign-to($passenger);

Where the method does:

method assign-to($passenger) {
    die "Seat already taken!" if $!passenger;
    $!passenger = $passenger;
}

Making the class of which $seat is an instance into a monitor won’t do a jot of good in the accessor/mutator case; there’s still a gaping data race. With the second approach, we’d be safe.

So if mutate accessors are so bad, why does Perl 6 have them at all?

To me, the best use of is rw on attribute accessors is for procedural programming. They make it easy to create mutable record types. I’d also like to be absolutely clear that there’s no shame in procedural programming. Good OO design is hard. There’s a reason Perl 6 has sub and method, rather than calling everything a method and then coining the term static method, because subroutine sounds procedural and “that was the past”. It’s OK to write procedural code. I’d choose to deal with well organized procedural code over sort-of-but-not-really-OO code any day. OO badly used tends to put the moving parts further from each other, rather than encapsulating them.

Put another way, class is there to serve more than one purpose. As in many languages, it doubles up as the thing used for doing real OO programming, and a way to define a record type.

So what to do instead of a fancy mutator?

Write methods for semantically interesting operations that just happen to set an attribute among their various other side-effects. Give the methods appropriate and informative names so the consumer of the class knows what they will do. And please do not try to hide complex operations, potentially with side-effects like I/O, behind something that looks like an assignment. This:

$analyzer.file = 'foo.csv';

Will lead most readers of the code to think they’re simply setting a property. The = is the assignment operator. In Perl 6, we make + always mean numeric addition, and pick ~ to always mean string concatenation. It’s a language design principle that operators should have predictable semantics, because in a dynamic language you don’t statically know the types of the operands. This kind of predictability is valuable. In a sense, languages that make it easy to provide custom mutator behavior are essentially making it easy to overload the assignment operator with additional behaviors. (And no, I’m not saying that’s always wrong, simply that it’s inconsistent with how we view operators in Perl 6.)

By the way, this is also the reason Perl 6 allows definition of custom operators. It’s not because we thought building a mutable parser would be fun (I mean, it was, but in a pretty masochistic way). It’s to discourage operators from being overloaded with unrelated and surprising meanings.

And when to use Proxy?

When you really do just want more control over something that behaves like an assignment. A language binding for a C library that has a bunch of get/set functions to work with various members of a struct would be a good example.

In summary…

Language design is difficult, and involves making all manner of choices where there is no universally right or wrong answer, but just trade-offs. The aim is to make choices that form a consistent whole – which is far, far, easier said than done becuase there’s usually a dozen different ways to be consistent too. The choice to dehuffmanize (that is, make longer) the writing of complex mutators is because it:

  • Helps keep = predictably assignment-like, just as all other operators are expected to have consistent semantics
  • Helps mark out the distinction between procedural and object oriented design, by introducing some friction when the paradigms are confused
  • Discourages object designs that will lead to logic leaks, feature envy, violation of the “tell, don’t ask” principle, and probably a bunch of other OO buzzwords I’m too tired to remember at 2am
  • Helps encourage OO designs that will be far more easily refactorable into concurrent objects
Posted in Uncategorized | 7 Comments

Concurrency bug squishing: part 1

Most of my recent Perl 6 development time has been spent hunting down and fixing various concurrency bugs. We’ve got some nice language features in this area, and they’ve been pretty well received so far. However, compared with many areas of Perl 6, they have been implemented relatively recently. Therefore, they have had less time to mature – which, of course, means more bugs and other rough edges.

Concurrency bugs tend to be rather tedious to hunt down. This is in no small part because reproducing them reliably can be challenging. Even once a fairly reliable reproduction is available, working out what’s going on can be tricky. Like all debugging, being methodical and patient is the key. I tend to keep notes of things I’ve tried and observed, and output produced by instrumenting programs with logging (fancy words for “adding prints and exception throws when sanity checks fail”). I will use interactive debuggers when needed, but even then the data from them tends to end up in my editor on any extended bug hunt. Debugging is in some ways the experimental science of programming. Even with a good approach, being sufficiently patient – or perhaps just downright stubborn – matters plenty too.

In the next 2-3 posts here, I’ll discuss a few of the bugs I recently hunted down. It will not be anything close to an exhaustive list of them, which would surely be as boring for you to read as it would be for me to write. This is just the “greatest hits”, if you like.

A tale of silly suspicions and dodgy data

This hunt started out looking in to various hangs that had been reported. Deadlocks are a broad category of bug (there’s another completely unrelated one I’ll cover in this little series, even). However, a number of ones that had shown up looked very similar. All of them showed a number of threads trying to enter GC, stuck in the consensus loop (which, yes, really is just a loop that we go around asking, “did every thread agree we’re going to do GC yet?”)

I’ll admit I went into this one with a bit of a bias. The GC sync-up code in question is a tad clever-looking. That doesn’t mean it’s wrong, but it makes it harder to be comfortable it’s correct. I also worried a bit that the cost of the consensus loop might well be enormous. So I let myself become a bit distracted for some minutes doing a profiling run to find out. I mean, who doesn’t want to be the person who fixed concurrency bug in the GC and made it faster at the same time?!

I’ve found a lot of useful speedups over the years using callgrind. I’ve sung its praises on this blog at least once before. By counting CPU cycles, it can give very precise and repeatable measurements on things that – measured using execution time – I’d consider within noise. With such accuracy, it must be the only profiler I need in my toolbox, right?

Well, no, of course not. Any time somebody says that X tool is The Best and doesn’t explain the context when it’s The Best, be very suspicious. Running Callgrind on a multi-threaded benchmark gave me all the ammunition I could ever want for replacing the GC sync-up code. It seemed that a whopping 30% of CPU cycles were spent in the GC consensus loop! This was going to be huuuuge…

Or was it? I mean, 35% seems just a little too huge. Profiling is vulnerable to the observer effect: the very act of measuring a program’s performance inevitably changes the program’s performance. And, while on the dubious physics analogies (I shouldn’t; I was a pretty awful physicist once I reached the relativity/QM stuff), there’s a bit of an uncertainty principle thing going on. The most invasive profilers can tell you very precisely where in your program time is spent, but you’re miles off knowing how fast you’re normally going in those parts of the program. They do this by instrumenting the program (like the current Perl 6 on MoarVM profiler does) or running it on a synthetic CPU, as Callgrind does. By contrast, a sampling profiler lets your program run pretty much as normal, and just takes regular samples of the call stack. The data is much less precise about where the program is spending its time, but it is a much better reflection of how fast the program is normally going.

So what would, say, the perf sampling profiler make of the benchmark? Turns out, well less than 1% of time was spent in the GC consensus loop in question. (Interestingly, around 5% was spent in a second GC termination consensus loop, which wasn’t the one under consideration here. That will be worth looking into in the future.) The Visual Studio sampling profiler on Windows – which uses a similar methodology – also gave a similar figure, which was somewhat reassuring also.

Also working against Callgrind is the way the valgrind suite of tools deal with multi-threaded applications – in short by serializing all operations onto a single thread. For a consensus loop, which expects to be dealing with syncing up a number of running threads, this could make a radical difference.

Finally, I decided to check in on what The GC Handbook had to say on the matter. It turns out that it suggests pretty much the kind of consensus loop we already have in MoarVM, albeit rather simpler because it’s juggling a bit less (a simplification I’m all for us having in the future too). So, we’re not doing anything so unusual, and with suitable measurements it’s performing as expected.

So, that was an interesting but eventually fairly pointless detour. What makes this all the more embarassing, however, is what came next. Running an example I knew to hang under GDB, I waited for it to do so, hit Ctrl + C, and started looking at all of the threads. Here’s a summary of their states:

  • 17: blocking on concurrent queue read (cond var wait)
  • 16: in mark thread unblocked; yielded
  • 15: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 14: blocking on concurrent queue read (cond var wait)
  • 13: blocking on concurrent queue read (cond var wait)
  • 12: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 11: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 10: blocking on concurrent queue read (cond var wait)
  • 09: blocking on concurrent queue read (cond var wait)
  • 08: trying to acquire lock on concurrent queue read (at src/6model/reprs/ConcBlockingQueue.c:160)
  • 07: blocking on concurrent queue read (cond var wait)
  • 06: in mark thread unblocked; yielded
  • 05: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 04: in AO_load_read at MVM_gc_enter_from_interrupt (tc=0x33ee010) at src/gc/orchestrate.c:481
  • 03: blocking on concurrent queue read (cond var wait)
  • 02: in mark thread unblocked; yielded
  • 01: triggered GC; MVM_gc_enter_from_allocator (tc=tc@entry=0x6037c0) at src/gc/orchestrate.c:384

And yes, for the sake of this being a nice example for the blog I perhaps should not have picked one with 17 threads. Anyway, we’ll cope. First up, the easy to explain stuff. All the threads that are “blocking on concurrent queue read (cond var wait)” are fairly uninteresting. They are Perl 6 thread pool threads waiting for their next task (that is, wanting to pull an item from the scheduler’s queue, and waiting for it to be non-empty).

Thread 01 is the thread that has triggered GC. It is trying to get consensus from other threads to begin GC. A number of other threads have already been interrupted and are also in the consensus loop (those marked “in AO_load_read at MVM_gc_enter_from_interrupt”). This is where I had initially suspected the problem would be. That leaves 4 other threads.

You might wonder how we cope with threads that are simply not in a position to participate in the consensus process, because they’re stuck in OS land, blocked waiting on I/O, a lock, a condition variable, a semaphore, a thread join, and so forth. The answer is that before they hand over control, they mark themselves as blocked. Another thread will steal their work if a GC happens while the thread is blocked. When the thread becomes unblocked, it marks itself as such. However, if a GC is already happening at that point, it’s not safe for the thread to proceed. Thus, it yields until GC is done. This accounts for the 3 threads described as “in mark thread unblocked; yielded”.

Which left one thread, which was trying to acquire a lock in order to peek a queue. The code looked like this:

    if (kind != MVM_reg_obj)
        MVM_exception_throw_adhoc(tc, "Can only shift objects from a ConcBlockingQueue");

    uv_mutex_lock(&cbq->locks->head_lock);

    while (MVM_load(&cbq->elems) == 0) {
        MVMROOT(tc, root, {

Spot anything missing?

Here’s the corrected version of the code:

    if (kind != MVM_reg_obj)
        MVM_exception_throw_adhoc(tc, "Can only shift objects from a ConcBlockingQueue");

    MVM_gc_mark_thread_blocked(tc);
    uv_mutex_lock(&cbq->locks->head_lock);
    MVM_gc_mark_thread_unblocked(tc);

    while (MVM_load(&cbq->elems) == 0) {
        MVMROOT(tc, root, {

Yup, it was failing to mark itself as blocked while contending for a lock, meaning the GC could not steal its work. So, the GC’s consensus algorithm wasn’t to blame after all. D’oh.

To be continued…

I actually planned to cover a second issue in this post. But, it’s already gone midnight, and perhaps that’s enough fun for one post anyway. :-) Tune in next time, for more GC trouble and another cute deadlock!

Posted in Uncategorized | 3 Comments

Assorted fixes

I’ve had a post in the works for a while about my work to make return faster (as well as routines that don’t return), as well as some notable multi-dispatch performance improvements. While I get over my writer’s block on that, here’s a shorter post on a number of small fixes I did on Thursday this week.

I’m actually a little bit “between things” at the moment. After some recent performance work, my next focus will be on concurrency stability fixes and improvements, especially to hyper and race. However, a little down on sleep thanks to the darned warm summer weather, I figured I’d spend a day picking a bunch of slightly less demanding bugs off from the RT queue. Some days, it’s about knowing what you shouldn’t work on…

A nasty string bug

MoarVM is somewhat lazy about a number of string operations. If you ask it to concatenate two simple strings, it will produce a string consisting of a strand list, with two strands pointing to the two strings. Similarly, a substring operation will produce a string with one strand and an offset into the original, and a repetition (using the x operator) will just produce a string with one strand pointing to the original string and having a repetition count. Note that it doesn’t currently go so far as allowing trees of strand strings, but it’s enough to prevent a bunch of copying – or at least delay it until a bunch of it can be done together and more cheaply.

The reason not to implement such cleverness is because it’s of course a whole lot more complex than simple immutable strings. And both RT #123602 and RT #127782 were about a sequence of actions that could trigger a bug. The precise sequence of actions were a repeat, followed by a concatenation, followed by a substring with certain offsets. It was caused by an off-by-one involving the repetition optimization, which was a little tedious to find but easy to fix.

Constant folding Seqs is naughty

RT #127749 stumbled across a case where an operation in a loop would work fine if its input was variable (like ^$n X ^$n), but fail if it were constant (such as ^5 X ^5). The X operator returns a Seq, which is an iterator that produces values once, throwing them away. Thus iterating it a second time won’t go well. The constant folding optimization is used so that things like 2 + 2 will be compiled into 4 (silly in this case, but more valuable if you’re doing things with constants). However, given the 1-shot nature of a Seq, it’s not suitable for constant folding. So, now it’s disallowed.

We are anonymous

RT #127540 complained that an anon sub whose name happened to match that of an existing named sub in the same scope would trigger a bogus redeclaration error. Wait, you ask. Anonymous sub…whose name?! Well, it turns out that what anon really means is that we don’t install it anywhere. It can have a name that it knows itself by, however, which is useful should it show up in a backtrace, for example. The bogus error was easily fixed up.

Charset, :ignoremark, :global, boom

Yes, it’s the obligatory “dive into the regex compiler” that all bug fixing days seem to come with. RT #128270 mentioned that that "a" ~~ m:g:ignoremark/<[á]>/ would whine about chr being fed a negative codepoint. Digging into the MoarVM bytecode this compiled into was pretty easy, as chr only showed up one time, so the culprit had to be close to that. It turned out to be a failure to cope with end of string, and as regex bugs go wasn’t so hard to fix.

Hang, crash, wallop

This is one of those no impact on real code, but sorta embarrassing bugs. A (;) would cause an infinite loop of errors, and (;;) and [0;] would emit similar errors also. The hang was caused by a loop that did next but failed to consider that the iteration variable needed updating in the optimizer. The second was because of constructing bad AST with integers hanging around in it rather than AST nodes, which confused all kinds of things. And that was RT #127473.

Improving an underwhelming error

RT #128581 pointed out that my Array[Numerix] $x spat out an error that fell rather short of the standards we aim for in Perl 6. Of course, the error should complain that Numerix isn’t known and suggest that maybe you wanted Numeric. Instead, it spat out this:

===SORRY!=== Error while compiling ./x.pl6
An exception occurred while parameterizing Array
at ./x.pl6:1
Exception details:
  ===SORRY!=== Error while compiling
  Cannot invoke this object (REPR: Null; VMNull)
  at :

Which is ugly. The line number was at least correct, but still… Anyway, a small tweak later, it produced the much better:

$ ./perl6-m -e 'my Array[Numerix] $x;'
===SORRY!=== Error while compiling -e
Undeclared name:
    Numerix used at line 1. Did you mean 'Numeric'?

Problems mixing unit sub MAIN with where constraints

RT #127785 observed that using a unit sub MAIN – which takes the entire program body as the contents of the MAIN subroutine – seemed to run into trouble if the signature contained a where clause:

% perl6 -e 'unit sub MAIN ($x where { $^x > 1 } );  say "big"'  4
===SORRY!===
Expression needs parens to avoid gobbling block
at -e:1
------> unit sub MAIN ($x where { $^x > 1 }⏏ );  say "big"
Missing block (apparently claimed by expression)
at -e:1
------> unit sub MAIN ($x where { $^x > 1 } );⏏  say "big"

The error here is clearly bogus. Finding a way to get rid of it wasn’t too hard, and it’s what I ended up committing. I’ll admit that I’m not sure why the check involved was put there in the first place, however. After some playing around with other situations that it might have aided, I failed to find any. There were also no spectests that depended on it. So, off it went.

$?MODULE

The author of RT #128552 noticed that the docs talked about $?MODULE (“what module am I currently in”), to go with $?PACKAGE and $?CLASS. However, trying it out let to an undeclared variable error. It seems to have been simply overlooked. It was easy to add, so that’s what I did. I also found some old, provisional tests and brought them up to date in order to cover it.

Subtypes, definedness types, and parameters

The submitter of RT #127394 was creative enough to try -> SomeSubtype:D $x { }. That is, take a subset type and stick a :D on it, which adds the additional constraint that the value must be defined. This didn’t go too well, resulting in some rather strange errors. It turns out that, while picking the type apart so we can code-gen the parameter binding, we failed to consider such interesting cases. Thankfully, a small refactor made the fix easy once I’d figured out what was happening.

1 day, 10 RTs

Not bad going. Nothing earth-shatteringly exciting, but all things that somebody had run into – and so others would surely run into again in the future. And, while I’ll be getting back to the bigger, hairier things soon, spending a day making Perl 6 a little nicer in 10 different ways was pretty fun.

Posted in Uncategorized | 1 Comment

‘grinding out performance improvements

In the last weeks, I’ve been working on various Perl 6 performance improvements. They’ve led to changes in all of MoarVM, NQP, and Rakudo. For each improvement, there are usually three steps involved:

  1. Take some program that performs poorly, and discover a reason that it’s slow.
  2. Design, implement and test potential changes, until one yields an improvement. (Or, if nothing seems to help, move on to a different reason that it’s slow).
  3. Make sure that the improvement doesn’t cause regressions elsewhere.

In all of these steps, getting some objective measurements are important. Step 3 is relatively straightforward: just build NQP/Rakudo and run their respective test suites. It’s possible for problems to find a place to hide even with this; test suites are, after all, not full system proofs. But it rules out a lot of bad behaviors.

For step 1, profiling is key. I mean, sure, sometimes I can guess why something is slow, and sometimes I’m even right. But the overwhelming majority of the time, measuring wins hands down. That’s why I seeded tools such as the MoarVM profiler and MoarVM heap snapshot analyzer, both of which have seen contributions from others since. And, when working on MoarVM itself, there are various tools for profiling native code.

That leaves step 2. How do I know if I’ve got an improvement? One easy way is to use something like time, doing thousands or millions of iterations to try and avoid noise, and seeing what happens. It’s a bit of a blunt instrument, however. It’s hard to be confident that anything less than a 2%-3% improvement isn’t just measurement noise, and I’m not even confident about that rule of thumb. :-) And, while it may be reasonable to argue that an improvement of just 1% or even a fraction of a percent is just not worth it in some contexts, VM engineering isn’t one of those contexts. A 1% improvement rolled out to all of our users quickly multiplies out to a huge number of saved CPU cycles.

More than that, though, the small improvements add up. Do 5 small improvements that win an average of 1% each, and it adds up to a more satisfying 5% improvement. But with only wallclock times to go on, it’s hard to confidently commit an improvement that wins only 1%, or 0.5%. Why? Simply because it’s hard to be sure that it’s a move in the correct direction. If it’s in the noise range, it may be that things are actually getting a tiny bit worse. Experience tells that just because something looks like it should be an improvement does not, in fact, mean that it will be. So if I’m seeing – to the degree of measurement error – two things coming out the same, and I go ahead and commit the “improvement” anyway, I’m really just guessing.

Enter callgrind, part of the Valgrind suite. It can give a count of the number of CPU instructions executed. And, on two consecutive runs with the same input, it will produce the exact same number. Yes, it’s sloooooow at gathering data; on the other hand, with such precision there’s less need to bump up the number of iterations to hide measurement error. Callgrind also explains what functions the CPU instructions are from, meaning it can play a key role in the profiling step too.

Of course, CPU cycles are also a somewhat blunt instrument too. Instructions executed is not the same as CPU cycles spent. Modern CPUs can both execute multiple instructions in a single cycle due to having multiple function units, as well as stall for anything from several to several hundred cycles on an instruction that accesses memory. This measurement also, of course, gives little insight into I/O bound programs, and I could easily imagine getting rather misled by such data in contention-bound programs. However, for CPU bound programs operating on relatively small heaps and running on a single thread, the numbers can be treated something like a very accurate clock.

Here’s a look at handful of the improvements I’ve been doing, based off looking at callgrind output.

The test program

Here’s the test program I considered. It involves a bunch of invocation and integer operations in a tight loop.

class A {
    has $.i = 0;
    method m() { $!i++ }
}
my $a = A.new;
for ^5000000 {
    $a.m();
}
say $a.i;

Before starting out, this took 17,855,658,600 instructions, which comes out at around 3,580 instructions per iteration. For comparison, here’s a Perl 5 program that I hope is fairly equivalent (I stuck with the built-in OO to avoid the costs of any sugar):

package A;
sub new {
    return bless { i => 0 }, shift;
}
sub m() {
    my $self = shift;
    $self->{i}++;
}
package main;
my $a = A->new;
for (1..5000000) {
    $a->m();
};
say $a->{i};

It weighs in at 10,198,184,195 cycles, or 2040 instructions per iteration, thus running in 56% of the CPU instructions the Perl 6 version takes.

A wasted memset

One thing that stood out right away was the amount of time spent in memset, to zero memory. As I looked at the callers, I spotted one that seemed out of place: clearing the arguments buffer. The code gave a reason (make sure the GC never sees a partial args buffer with old data), but that reason turns out to be bogus: there can never be a GC safepoint anywhere inside of the sequence of instructions that put the arguments in place to pass. So, I removed it.

Now I was down to a really small function. That was just begging to go away. It was called both from the interpreter (where the C compiler may well have inlined it) and also from the JIT. The JIT case was certainly going to be a nice win; instead putting args into the appropriate registers for a function call, making it, and then using the return value, I could just do a couple of simple instructions.

These two fairly simple changes got it down to 17,624,207,945 instructions – a saving of 230 million CPU instructions (1.3%), or around 50 cycles off every iteration. Very much worth having (it makes every single call cheaper), but could have been marginal if measuring simply wallclock time.

Exhausting work

The next win would have been easily visible just from wallclock time, but looking at the callgrind output led to it. I noticed we spent a huge amount of time doing a late-bound (by name) lexical lookup. With --inclusive=yes passed to callgrind_annotate, its line in the output looked like this:

1,149,528,043  ???:MVM_frame_find_lexical_by_name [libmoar.so]

Now, late-bound lexical lookups being costly isn’t really a surprise. This is the case where we have to go searching for a symbol by name (doing hash lookups), because we couldn’t do its resolution at compile-time (or at optimization time). The surprise was that we were doing them at all while executing such a simple program. So I looked into why, and it surprised me.

Long ago (back in the Parrot days), we replaced the secret RETURN lexical (which held an object used in implementing return) with the &EXHAUST sub. This would throw a useful error if you tried to return from a sub you had already returned from, for example due to forgetting that map is lazy and the final statement of a sub is its return value:

sub in(@a, $b) {
    @a.map({ return True if $_ eqv $b })
}
say in([1,2,3], 2); # dies: Attempt to return outside of any Routine

However, this lookup ended up being code-gen’d late-bound. I was considering fixing that, when I got curious if we even needed it at all any more. And indeed, on the VMs we run on today, I got the same decent error message if I simply set RETURN to null. So, I did that and the win was enormous: down to 11,085,521,589 CPU instructions! That meant, relative to the previous measurement, it ran in 63% of the CPU cycles it used to, or 2,220 per iteration.

That might seem incredible given that we only stood to gain 1.1 billion cycles by avoiding the indirect lexical lookup. Where did the other 5 billion go to? It turns out that blocks that have references to symbols lexically outer to them are not able to be inlined (unless they have been resolved by optimization time and are known to stay constant). The late-bound lookup of &EXHAUST was thus an inlining blocker for the method m. With it now being inlined, all of the calling overhead went away in favor of a couple of gotos. (Those also could go away in the future if inlining gets smarter.)

Optimizing get_boxed_ref

Eliminating control flow in favor of data access is usually a win. Branches aren’t the cheapest thing. I spotted an opportunity in a function get_boxed_ref, which is used to extract the memory address of the P6bigint representation that is inlined into a P6opaque (as happens with Int; if you’re wondering why it’s a P6opaque, recall that it’s allowable to mix into Int objects).

The resulting change shaved another 60 million cycles off the program; only 12 per iteration, but since this improves every single Int operation we perform in Perl 6, I’ll take it.

Static elimination of decontainerization

The &EXHAUST change uncovered an inlining issue that I had to fix, and while doing so I took a look at the code we were producing and got an idea. In Perl 6, we have Scalar containers. So when you have:

my $answer = 42;

Then the callframe has a slot for $answer that points to a Scalar object, which has a value attribute that points to the immutable Int object 42. Therefore, when we need the value from a container, we have to get hold of it. That is done with an instruction called decont, which checks if we have a container and dereferences it if so, and otherwise just evaluates to the value. The dynamic optimizer is pretty good at removing unneeded decont instructions to save the cycles/branches involved, and can lower those that remain into simple pointer arithmetic. But it can’t get ’em all, and of course not all code gets hot enough to be optimized in such a way.

I spotted that in some cases, we emitted decont instructions against compile time constants that we could cheaply determine, at compile time, would never need them. A small code-gen patch was sufficient to eliminate a load of them. This led to another 35 million cycles saved. However, the number of cycles in an empty program at startup also went down by quite an amount too, so it’s likely not that we’re saving so much in the loop (if anything). Additionally, this change shaved 3.6KB off the NQP bytecode size, 16.2KB off the Rakudo compiler bytecode size, and 55.8KB off CORE.setting bytecode size.

And next time: returning to return

Fixing the &EXHAUST bug reminded me that I’ve long wanted to change the way we implement return, to make it free – rather than just cheap – in the case that we never explicitly return, and much cheaper when we do. I’m still working on getting those changes finished up, and this has already been a fairly long post, so I’ll save the details – and the results – for next time.

Posted in Uncategorized | 3 Comments

Refactoring and torture

This post covers the previous two weeks of my Perl 6 grant work. Last time I wrote here, I plotted changes to call frames in MoarVM. Or, as the wonderful Perl Weekly put it, provided way too much information about call frames. :-)

That there is so much to say about call frames reflects their rather central role. They play a big part in supporting numerous language features (function calls, recursion, closures, continuations, dynamic variables, and pseudo-packages like OUTER and CALLER). The garbage collector scans them to find live objects. Both the interpreter and JIT compiler have to relate to them in various ways. The dynamic optimizer performs transforms over them when doing OSR (On Stack Replacement) and uninlining (the deoptimization that enables us to speculatively perform inlining optimizations).

All of which makes a major refactor of call frames a rather scary prospect. While they have picked up additional bits of state as MoarVM has evolved, they have been reference counted since, well, the day I first implemented call frames, which means “before MoarVM could even run any code”. Being reference counted, rather than handled by the garbage collector, gave them a property that is easy to rely on, and rather less easy to discover reliance on: that they never move over their lifetime.

I like to move it move it

Wait, what, move? Why would they even move?

Here’s a little Perl 6 program to illustrate. It declares a class, makes an instance of it, prints its memory address, does a load of further, throwaway, memory allocations, and then again prints the address of the object instance we made.

class A { }
my $obj = A.new;
say $obj.WHERE;
A.new for ^10000;
say $obj.WHERE;

When I ran this locally just now, I got:

39391392
95466848

If you get the same number twice, just make the 10000 something bigger. What’s interesting to note here is that an object’s location in memory can change over time. This is a consequence of MoarVM’s garbage collector, which is both generational and manages its young generation using semi-space copying. (This is nothing special to MoarVM; many modern VMs do it.)

Being able to move objects relies on being able to find and update all of the references to them. And, since MoarVM is written in C, that includes those references on the C stack. Consider this bit of code, which is the (general, unoptimized) path for boxing strings:

MVMObject * MVM_repr_box_str(MVMThreadContext *tc, MVMObject *type, MVMString *val) {
    MVMObject *res;
    MVMROOT(tc, val, {
        res = MVM_repr_alloc_init(tc, type);
        MVM_repr_set_str(tc, res, val);
    });
    return res;
}

It receives val, which is a string to box. Note that strings are garbage-collectable objects in MoarVM, and so may move. It then allocates a box of the specified type (for example, Perl 6’s Str), and puts the string inside of it. Since MVM_repr_alloc_init allocates an object, it may trigger garbage collection. And that in turn may move the object pointed to by val – meaning that the val pointer needs updating. The MVMROOT macro is used in order to add the memory address of val on the C stack to the set of roots that the GC considers and updates, thus ensuring that even if the allocation of the box triggers garbage collection, this code won’t end up with an old val pointer.

Coping with moving frames

Last time, I discussed how reference counting could be eliminated in favor of a “linear” call stack for frames that don’t escape (that is, become heap referenced), and promoting those that do escape to being garbage collected. As an intermediate step there, I’ve been working to make all frames GC-managed. This means that frames can move, and that they are part of the generational scheme. Therefore, every piece of code that both holds a reference to a frame and takes a code path that can allocate would need updating with MVMROOT. Further, all assignments of frames into other objects, and other objects into frames, would need write barriers (aside from the working area, which is handled specially).

In part, this just needs a lot of care. Going through the places frames show up, updating things as needed, and so forth. But even then, would that really be enough to be confident things weren’t broken? After all, my refactor was changing the rules for one of the most important data structures in the VM.

Of course, building NQP and Rakudo and passing the spectest suite is one good way to exercise MoarVM after the changes. Doing this showed up some issues, which I fixed. But even that doesn’t offer a huge amount of confidence. A simple script, or a short test, might trigger no garbage collections at all, or just the odd one. And the collections are highly likely to be triggered on the most common code paths in the VM.

GC torture testing

When faced with something scary, a surprisingly good approach is to tackle it by doing it really often. For example, are software releases scary? If yes, then do time-based releases every month, and with time they’ll become automatic and boring enough not to be scary. Is deploying changes to a production system scary? If yes, then adopt continuous delivery, deploying lots of times in a day and with easy rollback if things don’t go well.

Garbage collection is pretty scary. I mean, we take this world of objects the program has created, move them around, throw a bunch of them out, and then have the program continue running as if nothing happened. So…let’s try doing it really often!

This is exactly what GC torture testing involves.

--- a/src/gc/collect.h
+++ b/src/gc/collect.h
@@ -1,7 +1,7 @@
 /* How big is the nursery area? Note that since it's semi-space copying, we
  * actually have double this amount allocated. Also it is per thread. (In
  * the future, we'll make this adaptive rather than a constant.) */
-#define MVM_NURSERY_SIZE 4194304
+#define MVM_NURSERY_SIZE 13000

Rather than doing a collection every 4MB worth of allocations, let’s do one every 13KB worth of allocations! That’s around 320 times more often. Combined with a few debugging checks enabled, to catch references to objects that are out of date, bugs resulting from missing MVMROOTs and write barriers can be driven out of their hiding places into the light of day.

It’s a rather effective technique. It’s also a very time-consuming one. The NQP and Rakudo builds easily take an hour between them, and running spectest this way takes over 12 hours. It’s cheap compared to shipping a MoarVM with new and nasty bugs that waste a bunch of people’s time, of course!

It’s been a while since we did such a torture test. I’ve decided we should do them more often. It found issues. So far, from the spectest run torture test results, I’ve fixed 9 bugs (I didn’t go back and count those discovered while building NQP and Rakudo). What’s interesting is that of the 9, only 3 of them were clearly attributable to my refactors, one was potentially related to them, and 5 were bugs that must have been around a good while. One of the bugs that did relate to the frames changes caused deadlocks in multi-threaded code quite reliably under torture testing, but would have likely caused them very rarely under normal use (and so been extremely frustrating to reproduce and track down if it made it into the wild). 2 of the fixed non-frame bugs exclusively affected multi-threaded programs and would have doomed them. One was in the CUnion representation, and probably was the cause of some previously unresolved occasional failures of the NativeCall union tests.

What next?

By this point, I’m reasonably confident that regressions due to the first step of the frame changes have been shaken out. The GC torture testing has, however, shed light on some other issues that will want addressing in the near future.

I intend to put those aside for a little while, and complete the frame changes, introducing the linear stack. Compared with the first step, this feels like a lower risk change, in that mistakes should be a lot easier and cheaper to detect. I’d like to try and land this in the next week or so, in order that it can get some real-world testing before it makes it into a MoarVM and Rakudo release.

Once that’s out the way, I’ll be returning to other issues turned up in GC torture testing. I’d also like to look into a way to be able to run it automatically and regularly (once a week, perhaps). It’s a good bit too intensive to be able to farm it out to Travis. The sensible solution is probably to do it in the cloud, on some elastic compute thing where it just uses a machine once a week for a day or so. The silly but fun way is to build a Raspberry Pi cluster on my desk, and hack up something to distribute the tests across them. :-)

Posted in Uncategorized | 4 Comments