Refactoring and torture

This post covers the previous two weeks of my Perl 6 grant work. Last time I wrote here, I plotted changes to call frames in MoarVM. Or, as the wonderful Perl Weekly put it, provided way too much information about call frames. :-)

That there is so much to say about call frames reflects their rather central role. They play a big part in supporting numerous language features (function calls, recursion, closures, continuations, dynamic variables, and pseudo-packages like OUTER and CALLER). The garbage collector scans them to find live objects. Both the interpreter and JIT compiler have to relate to them in various ways. The dynamic optimizer performs transforms over them when doing OSR (On Stack Replacement) and uninlining (the deoptimization that enables us to speculatively perform inlining optimizations).

All of which makes a major refactor of call frames a rather scary prospect. While they have picked up additional bits of state as MoarVM has evolved, they have been reference counted since, well, the day I first implemented call frames, which means “before MoarVM could even run any code”. Being reference counted, rather than handled by the garbage collector, gave them a property that is easy to rely on, and rather less easy to discover reliance on: that they never move over their lifetime.

I like to move it move it

Wait, what, move? Why would they even move?

Here’s a little Perl 6 program to illustrate. It declares a class, makes an instance of it, prints its memory address, does a load of further, throwaway, memory allocations, and then again prints the address of the object instance we made.

class A { }
my $obj =;
say $obj.WHERE; for ^10000;
say $obj.WHERE;

When I ran this locally just now, I got:


If you get the same number twice, just make the 10000 something bigger. What’s interesting to note here is that an object’s location in memory can change over time. This is a consequence of MoarVM’s garbage collector, which is both generational and manages its young generation using semi-space copying. (This is nothing special to MoarVM; many modern VMs do it.)

Being able to move objects relies on being able to find and update all of the references to them. And, since MoarVM is written in C, that includes those references on the C stack. Consider this bit of code, which is the (general, unoptimized) path for boxing strings:

MVMObject * MVM_repr_box_str(MVMThreadContext *tc, MVMObject *type, MVMString *val) {
    MVMObject *res;
    MVMROOT(tc, val, {
        res = MVM_repr_alloc_init(tc, type);
        MVM_repr_set_str(tc, res, val);
    return res;

It receives val, which is a string to box. Note that strings are garbage-collectable objects in MoarVM, and so may move. It then allocates a box of the specified type (for example, Perl 6’s Str), and puts the string inside of it. Since MVM_repr_alloc_init allocates an object, it may trigger garbage collection. And that in turn may move the object pointed to by val – meaning that the val pointer needs updating. The MVMROOT macro is used in order to add the memory address of val on the C stack to the set of roots that the GC considers and updates, thus ensuring that even if the allocation of the box triggers garbage collection, this code won’t end up with an old val pointer.

Coping with moving frames

Last time, I discussed how reference counting could be eliminated in favor of a “linear” call stack for frames that don’t escape (that is, become heap referenced), and promoting those that do escape to being garbage collected. As an intermediate step there, I’ve been working to make all frames GC-managed. This means that frames can move, and that they are part of the generational scheme. Therefore, every piece of code that both holds a reference to a frame and takes a code path that can allocate would need updating with MVMROOT. Further, all assignments of frames into other objects, and other objects into frames, would need write barriers (aside from the working area, which is handled specially).

In part, this just needs a lot of care. Going through the places frames show up, updating things as needed, and so forth. But even then, would that really be enough to be confident things weren’t broken? After all, my refactor was changing the rules for one of the most important data structures in the VM.

Of course, building NQP and Rakudo and passing the spectest suite is one good way to exercise MoarVM after the changes. Doing this showed up some issues, which I fixed. But even that doesn’t offer a huge amount of confidence. A simple script, or a short test, might trigger no garbage collections at all, or just the odd one. And the collections are highly likely to be triggered on the most common code paths in the VM.

GC torture testing

When faced with something scary, a surprisingly good approach is to tackle it by doing it really often. For example, are software releases scary? If yes, then do time-based releases every month, and with time they’ll become automatic and boring enough not to be scary. Is deploying changes to a production system scary? If yes, then adopt continuous delivery, deploying lots of times in a day and with easy rollback if things don’t go well.

Garbage collection is pretty scary. I mean, we take this world of objects the program has created, move them around, throw a bunch of them out, and then have the program continue running as if nothing happened. So…let’s try doing it really often!

This is exactly what GC torture testing involves.

--- a/src/gc/collect.h
+++ b/src/gc/collect.h
@@ -1,7 +1,7 @@
 /* How big is the nursery area? Note that since it's semi-space copying, we
  * actually have double this amount allocated. Also it is per thread. (In
  * the future, we'll make this adaptive rather than a constant.) */
-#define MVM_NURSERY_SIZE 4194304
+#define MVM_NURSERY_SIZE 13000

Rather than doing a collection every 4MB worth of allocations, let’s do one every 13KB worth of allocations! That’s around 320 times more often. Combined with a few debugging checks enabled, to catch references to objects that are out of date, bugs resulting from missing MVMROOTs and write barriers can be driven out of their hiding places into the light of day.

It’s a rather effective technique. It’s also a very time-consuming one. The NQP and Rakudo builds easily take an hour between them, and running spectest this way takes over 12 hours. It’s cheap compared to shipping a MoarVM with new and nasty bugs that waste a bunch of people’s time, of course!

It’s been a while since we did such a torture test. I’ve decided we should do them more often. It found issues. So far, from the spectest run torture test results, I’ve fixed 9 bugs (I didn’t go back and count those discovered while building NQP and Rakudo). What’s interesting is that of the 9, only 3 of them were clearly attributable to my refactors, one was potentially related to them, and 5 were bugs that must have been around a good while. One of the bugs that did relate to the frames changes caused deadlocks in multi-threaded code quite reliably under torture testing, but would have likely caused them very rarely under normal use (and so been extremely frustrating to reproduce and track down if it made it into the wild). 2 of the fixed non-frame bugs exclusively affected multi-threaded programs and would have doomed them. One was in the CUnion representation, and probably was the cause of some previously unresolved occasional failures of the NativeCall union tests.

What next?

By this point, I’m reasonably confident that regressions due to the first step of the frame changes have been shaken out. The GC torture testing has, however, shed light on some other issues that will want addressing in the near future.

I intend to put those aside for a little while, and complete the frame changes, introducing the linear stack. Compared with the first step, this feels like a lower risk change, in that mistakes should be a lot easier and cheaper to detect. I’d like to try and land this in the next week or so, in order that it can get some real-world testing before it makes it into a MoarVM and Rakudo release.

Once that’s out the way, I’ll be returning to other issues turned up in GC torture testing. I’d also like to look into a way to be able to run it automatically and regularly (once a week, perhaps). It’s a good bit too intensive to be able to farm it out to Travis. The sensible solution is probably to do it in the cloud, on some elastic compute thing where it just uses a machine once a week for a day or so. The silly but fun way is to build a Raspberry Pi cluster on my desk, and hack up something to distribute the tests across them. :-)

Posted in Uncategorized | 4 Comments

Framing the problem

In this post I’ll be talking a lot about call frames, also known as invocation records. Just to be clear about what they are, consider a sub:

sub mean(@values) {
    @values.sum / @values

Whenever we call mean, we create a call frame. This holds the storage for the incoming @values parameter. It also holds some temporary storage we use in executing the sub, holding, for example, the sum method object we get back when looking up the method, and the result of calling @values.sum, which we then pass to infix:</>. Call frames also record outer and caller references (so we can resolve lexical and dynamic variables), the place to store the return value and go to on return, and other bits. It’s important to note that call frames are not 1:1 with subs/methods/blocks. Perhaps the best way to understand why is to consider a recursive sub:

sub fac($n) {
    $n <= 1
        ?? 1
        !! $n * fac($n - 1)

There’s one fac sub but we need a call frame for each invocation of (that is, call to) fac, since the $n parameter will vary in each call. (Threads are another example where you’re “in” a sub multiple times at the same time.)

All complex software systems evolve from simple systems. MoarVM is no exception. Back when MoarVM started out, I knew I wanted to have invocation be cheap, and call frames be fairly lightweight. I also didn’t want them to be GC-allocated. I figured that code sat in a loop, only using native types and only calling things involving native types, should not create garbage that needed collecting. All good goals.

Fast forward a few years, and where are we? Let’s start out with the easy one to assess: frames aren’t GC-allocated. So that’s good, right? Well, sure, in that I got the natives property that I was after. However, since things like closures and continuations exist, not to mention that you can get a first-class reference to a call frame and traverse the outer/caller chain, the lifetime of frames is interesting. They most certainly don’t always just go away at the point of return. Therefore, they need to have their memory managed in some way. I went with reference counts, figuring that since we’d only need to twiddle them fairly occasionally, it’d be fairly OK. Trouble is, thanks to MoarVM supporting concurrent threads of execution, those counts need to be incremented and decremented using atomic operations. Those are CPU native, but they’re still a bit costly (more so on some CPUs that others).

There’s another, more hidden, cost, however – one I didn’t really see coming. MoarVM has a generational garbage collector, as discussed in my previous post. But frames are not garbage collectable objects. They’re managed by reference counts. So what happens when a reference counted frame is referenced by a second generation object? Well, there’s no risk of the frames going away too early; the reference count won’t be decremented until the gen2 object itself is collected. The problem is about the objects the frame references. Frames, not being garbage collectable, don’t have write barriers applied on binds into them. This means that they can come at any time to point to nursery objects. We solved this by keeping all objects referencing frames in the inter-generational root set. This is perfectly safe. Unfortunately, it also greatly increases the cost of garbage collection for programs that build up large numbers of closures in memory and keep them around. Of course, since write barriers are cheap but not free, we get a performance win on all programs by not having to apply them to writes to working registers or lexical.

So, how about invocation cost? Is invocation cheap? Well, first of all lets turn off inlining:


And measure 10 million invocations passing/receiving one argument using Perl 5, NQP, and Rakudo. Perl 5 does them in 2.85s. NQP comes out a little ahead, at 2.45s. Rakudo strolls through them in an altogether too leisurely 6.14s. (Turn inlining back on, and Rakudo manages it in 3.39s.) So, if NQP is already ahead, is MoarVM really so bad? Well, it could certainly be better. On an idealized 3GHz GPU, each invocation is costing around 735 CPU cycles. That’s pricey. The other issue here is that just matching Perl 5 on invocation speed isn’t really enough, because tons of things that aren’t invocations in Perl 5 actually are in Perl 6 (like, every array and hash index). In a “Perl 6 is implemented in Perl 6” world, we need to squeeze a good bit more out of invocation performance.

And finally, what about size? An MVMFrame comes at a cost of 296 bytes. It points to a chunk of working space together with a lexical environment (both arrays). Every single closure we take also pays that fixed 296 byte cost (and, of course, the cost of the lexical environment storage, since that’s what we actually take closures for). Again, not staggeringly huge, but it adds up very quickly.

These are all areas that need improvement. In fact, they make up two of the entries in the performance section of theproposal for the grant I’m doing this work under. So, I decided it was time to start thinking about how I’ll address them.

Some measurements

I was curious how many frames end up referenced by garbage collectable objects against how many never end up in this situation. So, I quickly patched MoarVM to keep track of if a frame ever came to be referenced by a GC-able object:

diff --git a/src/core/frame.c b/src/core/frame.c
index ca1a4d2..f392aca 100644
--- a/src/core/frame.c
+++ b/src/core/frame.c
@@ -114,7 +114,10 @@ MVMFrame * MVM_frame_dec_ref(MVMThreadContext *tc, MVMFrame *frame) {
      * to zero, so we look for 1 here. */
     while (MVM_decr(&frame->ref_count) == 1) {
         MVMFrame *outer_to_decr = frame->outer;
+if (frame->refd_by_object)
+    tc->instance->refd_frames++;
+    tc->instance->non_refd_frames++;
         /* If there's a caller pointer, decrement that. */
         if (frame->caller)
             frame->caller = MVM_frame_dec_ref(tc, frame->caller);
diff --git a/src/core/instance.h b/src/core/instance.h
index b14f11d..4f61000 100644
--- a/src/core/instance.h
+++ b/src/core/instance.h
@@ -365,6 +365,9 @@ struct MVMInstance {

     /* Cached backend config hash. */
     MVMObject *cached_backend_config;
+MVMuint64 refd_frames;
+MVMuint64 non_refd_frames;

 /* Returns a true value if we have created user threads (and so are running adiff --git a/src/main.c b/src/main.c
index 5458912..1df4fe3 100644
--- a/src/main.c
+++ b/src/main.c
@@ -189,7 +189,9 @@ int main(int argc, char *argv[])

     if (dump) MVM_vm_dump_file(instance, input_file);
     else MVM_vm_run_file(instance, input_file);
+printf("Ref'd frames: %d\nNon-ref'd frames: %d\n",
+    instance->refd_frames,
+    instance->non_refd_frames);
     if (full_cleanup) {
         return EXIT_SUCCESS;

And measured a few things (the names from the latter ones are benchmark names from perl6-bench):

Measured                    Ref'd       Non-ref'd       % Ref'd
========                    =====       =========       =======
NQP startup                 0           5259            0.0%
NQP regex tests             28065       1682655         1.6%
Compile Perl 6 actions      115092      6100770         1.7%
Compile Perl 6 grammar      130716      5451120         2.3%
Compile CORE.setting        2065214     55771097        3.8%
Perl 6 startup              35          12822           0.3%
Compiling Test.pm6          39639       860474          4.4%
Compiling NativeCall.pm6    145426      1887682         7.2%
while_array_set             993701      6024920         14.1%
while_hash_set              1804        2024016         0.1%
for_assign                  1654        1020831         0.2%
for_concat_2                1743        2023589         0.1%
split_string_regex          8992750     19089026        32.0%
create_and_iterate_hash_kv  14990870    40027814        27.2%
parse_json                  10660068    42364909        20.1%
rc-forest-fire              3740096     16202368        18.8%
rc-mandelbrot               89989       5523439         1.6%
rc-man-or-boy-test          791961      7091381         10%

What can we infer from this? First of all, most NQP programs have at most just a few percent of their frames referenced by GC-able objects. With the Perl 6 benchmarks, it’s all over the map, with split_string_regex being the “worst” case. NQP’s optimizer is much better doing lexical to local lowering, and flattening away scopes that we don’t really need. In Rakudo, we’re pretty weak at that. Clearly, some more work on this area could benefit Rakudo (and yes, it’s also on the list of things to do under my grant).

Secondly, since – even in the worst cases – the majority of frames never get themselves tied up with any “interesting” situations that causes them to become GC-referenced, a strategy that handles them differently – and hopefully far more efficiently – would give us a win.

What GC-able things reference frames?

It was fairly easy to grep through the MoarVM source and make a list. I did so to help me think through the cases:

  • A frame being closed over (closures)
  • A frame being captured in a continuation
  • A frame being referenced by an exception
  • A local/lexical being referenced by a native reference
  • A frame becoming the default outer thanks to “auto-close” (rare)
  • A frame getting wrapped in a context object, to use it as a first-class object (also, hopefully, rare in any hot-path code)

It’s also interesting to note that a frame only ever “escapes” such that it can be touched by another thread if it becomes referenced by a GC-able object.

What makes frames take up space?

Next, I decided to to through the MVMFrame data structure and see where the space is going, and what options might exist for saving that space. What follows is an analysis of all the fields in an MVMFrame.

/* The thread that is executing, or executed, this frame. */
MVMThreadContext *tc;

Interestingly, this one gets cleared after a certain point in the frame’s life, except if it’s captured in a continuation. Exception handling uses it to know if the frame is still on the call stack, which is interesting in various cases. GC marking uses it to know if it should mark ->work (see below).

Interestingly, nothing seems to care overly much at the moment that it points to a particular thread context; they all want it for a flag. So, it’s certainly a candidate for removal. It’s also interesting to note that in every case where a frame is not referenced by an object, it is alive solely by being in a thread’s “call stack” – that is, the call chain from following the ->caller pointer from the currently executing frame of a thread. So, the flag will only matter for frames that are GC-referenced.

/* The environment for this frame, which lives beyond its execution.
* Has space for, for instance, lexicals. */
MVMRegister *env;

Relevant for frames in whatever state.

/* The temporary work space for this frame. After a call is over, this
* can be freed up. Must be NULLed out when this happens. */
MVMRegister *work;

Relevant for frames that are still executing, or that are captured by a continuation. Cross-cuts whether they are GC-referenced.

/* The args buffer. Actually a pointer into an area inside of *work, to
* decrease number of allocations. */
MVMRegister *args;

Possibly could go away through a level of indirection, but it’s performance sensitive. Used together with…

/* Callsite that indicates how the current args buffer is being used, if
* it is. */
MVMCallsite *cur_args_callsite;

…this one.

/* The outer frame, thus forming the static chain. */
MVMFrame *outer;

Pretty much everything has an outer.

/* The caller frame, thus forming the dynamic chain. */
MVMFrame *caller;

Pretty much everything has a caller too.

/* The static frame information. Holds all we statically know about
* this kind of frame, including information needed to GC-trace it. */
MVMStaticFrame *static_info;

As you might guess, this is pretty important and useful. However, it’s also possible to obtain it – at the cost of a level of indirection – through the ->code_ref below. Would need to measure carefully, since it’d increase the cost of things like lexical lookups from outer frames (and, once we get better at optimizing, that will be “most of them”).

/* The code ref object for this frame. */
MVMObject *code_ref;

The particular closure we were invoked as. Not something we can obviously lose, and needed for the lifetime of the frame in general.

/* Parameters received by this frame. */
MVMArgProcContext params;

Argument processing context. Every frame uses it to process its arguments. It’s only useful while ->work is active, however, and so could be allocated as a part of that instead, which would reduce the cost of closures.

/* Reference count for the frame. */
AO_t ref_count;

Can go away provided we stop reference counting frames.

/* Is the frame referenced by a garbage-collectable object? */
MVMint32 refd_by_object;

Could also go away provided we stop reference counting frames and have some scheme for optimizing the common, non-referenced case.

/* Address of the next op to execute if we return to this frame. */
MVMuint8 *return_address;

/* The register we should store the return value in, if any. */
MVMRegister *return_value;

/* The type of return value that is expected. */
MVMReturnType return_type;

/* The 'entry label' is a sort of indirect return address
* for the JIT */
void * jit_entry_label;

These four are only used when the frame is currently on the call stack, or may be re-instated onto the call stack by a continuation being invoked. Could also live with ->work, thus making closures cheaper.

/* If we want to invoke a special handler upon a return to this
* frame, this function pointer is set. */
MVMSpecialReturn special_return;

/* If we want to invoke a special handler upon unwinding past a
* frame, this function pointer is set. */
MVMSpecialReturn special_unwind;

/* Data slot for the special return handler function. */
void *special_return_data;

/* Flag for if special_return_data need to be GC marked. */
MVMSpecialReturnDataMark mark_special_return_data;

Used relatively occasionally (and the more common uses are candidates for spesh, the dynamic optimizer, to optimize out anyway). A candidate for hanging off an “extra stuff” pointer in a frame. Also, only used when a frame is on the call stack, with the usual continuation caveat.

/* Linked list of any continuation tags we have. */
MVMContinuationTag *continuation_tags;

Used if this frame has been tagged as a possible continuation “base” frame. Only relevant if that actually happens (which is quite rare in the scheme of things), and can only happen when a frame is on the call stack. A candidate for similar treatment to the special return stuff.

/* Linked MVMContext object, so we can track the
* serialization context and such. */
/* note: used atomically */
MVMObject *context_object;

This is used when a context goes first-class. Thus, it implies the frame is referenced by at least one GC-able object (in fact, this points to said object). That’s fairly rare. It can happen independently of whether the frame is currently executing (so, unrelated to ->work lifetime).

/* Effective bytecode for the frame (either the original bytecode or a
* specialization of it). */
MVMuint8 *effective_bytecode;

/* Effective set of frame handlers (to go with the effective bytecode). */
MVMFrameHandler *effective_handlers;

/* Effective set of spesh slots, if any. */
MVMCollectable **effective_spesh_slots;

/* The spesh candidate information, if we're in one. */
MVMSpeshCandidate *spesh_cand;

These are all related to running optimized/specialized code. Only interesting for frames currently on the call stack or captured in a continuation (so, ->work lifetime once again).

/* Effective set of spesh logging slots, if any. */
MVMCollectable **spesh_log_slots;

/* If we're in a logging spesh run, the index to log at in this
* invocation. -1 if we're not in a logging spesh run, junk if no
* spesh_cand is set in this frame at all. */
MVMint8 spesh_log_idx;

/* On Stack Replacement iteration counter; incremented in loops, and will
* trigger if the limit is hit. */
MVMuint8 osr_counter;

These 3 play part a part in dynamic optimization too, though more in the stage where we’re gathering information. Again, they have ->work lifetime. The top may well go away in future optimizer changes, so not worth worrying over too much now.

/* GC run sequence number that we last saw this frame during. */
AO_t gc_seq_number;

This one is certainly a candidate for going away, post-refactoring. It serves as the equivalent of a “mark bit” when doing GC.

/* Address of the last op executed that threw an exception; used just
* for error reporting. */
MVMuint8 *throw_address;

May be something we can move inside of exception objects, and have them pay for it, not every frame. Worth looking in to.

/* Cache for dynlex lookup; if the name is non-null, the cache is valid
* and the register can be accessed directly to find the contextual. */
MVMString   *dynlex_cache_name;
MVMRegister *dynlex_cache_reg;
MVMuint16    dynlex_cache_type;

These also have ->work lifetime. Give a huge speed-up on dynlex access, so (aside from re-designing that) they can stay.

/* The allocated work/env sizes. */
MVMuint16 allocd_work;
MVMuint16 allocd_env;

These exist primarily because we allocate work and env using the fixed size allocator, and so we need the sizes to free the memory.

/* Flags that the caller chain should be kept in place after return or
* unwind; used to make sure we can get a backtrace after an exception. */
MVMuint8 keep_caller;

/* Flags that the frame has been captured in a continuation, and as
* such we should keep everything in place for multiple invocations. */
MVMuint8 in_continuation;

/* Assorted frame flags. */
MVMuint8 flags;

It appears the top two could be nicely folded into flags. Also, the flags may only be relevant for currently executing frames, or those captured in a continuation, so this lot is a candidate to move to something with ->work lifetime.


Here are some things that stand out to me, and that point the way to an alternate design.

  1. An MVMFrame presently carries a bunch of things in it that aren’t relevant unless the frame is either currently on a thread’s call stack or captured in a continuation.
  2. This is an orthogonal axis to whether the frame is referenced by something that is garbage-collectable.
  3. It’s further orthogonal to one of a number of relatively rare things that can happen and need storage in the frame.
  4. Frames that are never referenced by a garbage collectable object will only ever have a reference count of 1, because they will only be alive by virtue of being either the currently executing frame of a thread, or in its caller chain.
  5. Frames only become referenced by something garbage collectable in cases where we’d end up with some other garbage-collectable allocation anyway. For example, in the closure case, we allocate the code-ref that points to the referenced outer frame.
  6. Let’s assume we were to allocate all frames using the GC, and consider the analysis that would let us known when we are able to avoid those allocations. The analysis needed would be escape analysis.

A new approach: the big picture

Taking these into account, I arrived at a way forward that should, I hope, address most of the issues at hand.

Every thread will have a chunk of memory that we’ll refer to as its “call stack”. Every new frame created during normal program execution will be allocated by making space for it, including its ->work and ->env, on this stack. This will need:

  • No reference count, because we know it’s 1
  • No gc_seq_number, because we can use the stack topology to make sure we only mark each frame once

Should this frame ever become referenced by a garbage collectable object, then we will GC-allocate a frame on the garbage-collected heap – as a totally normal garbage-collectable object. The frame state will be copied into this. The work space and environment will also be allocated from the fixed-size allocator, and the data migrated there.

Since this frame is now garbage-collectable, we have to check its ->caller to see if it’s on the thread-local stack, or already been promoted to the heap. If the former, we repeat the above process for it too. This is in order to uphold the key invariant in this design: the thread-local stack may point to things in the garbage-collectable heap, but never vice-versa.

This means the reference counting and its manipulation goes away entirely, and that frames that are heap-promoted become subject to the usual generational rules. Frames that would never be heap-referenced never end up on the heap, don’t add to GC pressure, and can be cleaned up immediately and cheaply.

There are some details to care about, of course. Since generational collection involves write barriers, then binds into frames on the garbage-collectable heap will also be subject to write barriers. Is that OK? There are two cases to consider.

  1. Binding of lexicals. Since most lexicals in Perl 6 point to a Scalar, Array, or Hash in my declarations, or point directly to a read-only object if parameters, this is relatively rare (of course, write barriers apply to the Scalar itself). In NQP, loads of lexicals are lowered to locals already, and we’ll do some more of that in Rakudo too, making it rarer still. Long story short, we can afford write barriers on lexical binds.
  2. Binding of stuff in ->work, which basically means every write into the register set of the interpreter. This, we cannot afford to barrier. However, there are only two cases where a frame is promoted to the heap and has ->work. One case is when it’s still executing, and so in the call chain of a thread. In this case, we can take care to always walk the objects in ->work by simply following the call chain . The second case is when a continuation is taken. But here, there are no binds to registers until the continuation is invoked again – at which point things are back in a thread’s call chain.

Refactoring towards it

The thing that makes this a somewhat scary piece of work is that, in making call frames potentially collectable objects, we break an assumption that has been there since week 1 of MoarVM’s development: that call frames never move. To maximize the chances of discovering problems with this refactor, I decided that step 1 would be to always allocate every single call frame on the heap. Only when that is working would I move on to optimizing away most of those heap allocations by adding the thread-local call stack.

MoarVM currently has 3 kinds of collectable:

  • Objects
  • Type objects
  • STables

So, I added a fourth: call frames. As a result, MVMFrame gains an MVMCollectable at the start of the data structure – which will be present whether it’s stack or heap allocated. This will start out zeroed when a frame is born on the call stack. This does two nice things: it gives us a way to know if a frame is GC-able or not, and also means the write barrier – without modification – will do the right thing on both stack and heap frames.

There were two more easy things to do. First was to add a function to allocate a heap frame. Second was to factor out frame destruction from reference decrement, since the latter was going away.

Beyond that, there was nothing for it besides diving in, breaking the world, and then trying to put it back together again. I got a good start towards it – but the conclusion of this first step will have to wait for next week’s installment! See you then.

Posted in Uncategorized | 5 Comments

Heap heap hooray!

Last week, I finally hunted down and fixed the EVAL memory leak, with the help of the heap snapshot analyzer I wrote about recently. I also hunted down a hang in parallel runs of the Perl 6 specification test suite that showed up recently – but only on Windows – and fixed that too.

Before we begin: a generational GC primer

A few posts ago I talked a bit about how MoarVM’s garbage collector works – and hand-waved on just about all of the interesting details. Some folks on IRC expressed a bit of curiosity about those details, so I figured I’d explain some of them as they come up in my performance/reliability work. In this post I’ll talk about generational collection, because it plays a part in the EVAL leak story.

Recall that the essence of a tracing garbage collector is that we start with a set of roots: global symbols, the current call frame that each thread is executing, thread-local symbols, and so forth. We “mark” each object we find in these roots as “alive”, and stick them onto a todo list. We work our way through this todo list, asking each object what it references and putting those onto the todo list also. With a little care to never revisit objects we already considered, we terminate having marked all reachable objects alive. The memory associated with unmarked objects can then be freed.

You might worry that if we have millions of objects in memory, this could be a rather time-consuming process to do again and again. You’d be right. Worse, memory access performance depends heavily on the CPU caches getting high hit rates. When we need to walk huge numbers of objects, we end up getting loads of CPU cache misses, and have to spend time fetching objects from main memory. (To give you an idea of the numbers: a level 1 cache hit means memory access in several CPU cycles, while having to go to main memory can easily cost a couple of hundred cycles or worse).

So, how might we do better? The key insight behind generational GC, often known as the generational hypothesis, is that in most programs objects are either very short-lived (surviving zero or one garbage collections) or long-lived (perhaps staying around for the lifetime of the entire program). Therefore, it’s reasonable to assume that once an object survives a couple of collections, it will stay around for a good few more.

Generational collection works by dividing the heap – that is, the space where objects are allocated – into generations (typically two or three). In MoarVM we have two generations, which we tend to refer to in the code as “nursery” and “gen2”. The nursery, as the name suggests, is where objects begin their life. We allocate them cheaply there using a “bump-the-pointer” scheme. The nursery in MoarVM is a fixed-size chunk of memory, and after a while we fill it up. This is what triggers garbage collection.

In a 2-generation collector, there are two types of collection, which we know as nursery collection and full collection. In a full collection, we do exactly what I described earlier: visit all objects, marking them alive, and freeing up those that aren’t. A nursery collection is similar, except as soon as we see an object is not in the nursery, we don’t put it on the todo list. Instead, we simply ignore it. This greatly cuts down on the number of objects we need to consider, making nursery collections hugely cheaper.

There are two things we must do to make this really work. The first is ensure that we only free memory associated with objects living in the nursery, not the old generation, since we didn’t do the analysis needed to free anything there. That’s fine; most objects “die young” anyway. The second is more subtle. There may be objects in the nursery that are only alive because something in the old generation references them. However, since we’re not considering any old generation objects, we won’t discover this liveness and so wrongly free things. This is resolved by maintaining a set of objects that are in the old generation but pointing to objects in the nursery. Whenever we assign a reference from one object to another, we check if this would establish an old generation to nursery reference, and stick the old generation object into the set, ensuring we will visit it and mark the nursery object. This check is known as a “write barrier”.

So, back to the EVAL story…

The EVAL leak

I reviewed, fixed up, and merged various patches from Timo to improve the heap snapshot data dumps by annotating them with more data. Then, I looked at a few paths to leaked objects (recall that I was using EVAL 'my class ABC { }' to demonstrate the leak). The paths looked something like this:

> path 38199
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Unknown ]-->
BOOTCompUnit (Object)
    --[ Index 81 ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Unknown ]-->
ABC (STable)

This shows the objects along the path, but those “unknowns” were hiding what I really wanted to know. So, I did some further patches, and got out a rather more useful result:

> path 6466
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Static Frame ]-->
BOOTStaticFrame (Object)
    --[ Compilation Unit ]-->
BOOTCompUnit (Object)
    --[ Code refs array entry ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Spesh guard match ]-->
ABC (STable)

So, here we see that it’s a type specializer guard that is keeping the object alive. “Wait…a what?!” MoarVM does a bunch of dynamic optimization, watching out for types that occur at runtime and generating specialized versions of the code by type. And, sometimes, we have an unfortunate situation where code is discovered “hot”, but the type it was invoked with is fairly transient. In this case, the specialization matching table will end up referring to that type, keeping it alive.

However, since for any given bit of code we only generate a handful of these specializations, eventually we’d saturate them and stop leaking memory. I looked at another path:

> path 10594
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Static Frame ]-->
BOOTStaticFrame (Object)
    --[ Compilation Unit ]-->
BOOTCompUnit (Object)
    --[ Code refs array entry ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Spesh log slots ]-->
QAST::CompUnit (Object)
    --[ Unknown ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

This is a case where the optimizer is tracing what objects show up. In fact, most of the paths looked this way. However, that should saturate at some point, yet I know that it goes on leaking. Finally, I found another path to a leak:

> path 59877
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

However, 9 out of the 10 leaked objects were leaked because of the dynamic optimizer keeping things alive that it had seen while tracing. But that, while awkward, should eventually saturate – as should the integer cache issue. But the memory use grew forever, suggesting that things go on and on leaking. So, I tried a snapshot after disabling dynamic optimization. And:

> find stables type="ABC"
Object Id  Description
=========  ===========
368871     ABC

Just the one! 9 out of 10 objects were not on the heap. And yes, it was the integer box cache problem that kept the 1 alive:

> path 368871
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

So in theory, with dynamic optimization disabled, this suggested that we did not leak any more, and all the blame was on the optimizer. To check that out, I tried a long-running EVAL loop and…it still leaked heavily. My theory that dynamic optimization couldn’t account for all of the leaking, just the first bit of it, seemed to hold up.

To investigate it further, I did a loop of 100 EVALs, as opposed to the 10 I had used so far. This took a snapshot every GC run, plus one more that I forced at the end. So, how did the final snapshot look?

This file contains 9 heap snapshots. To select one to look
at, type something like `snapshot 1`.
Type `help` for available commands, or `exit` to exit.

> snapshot 8
Loading that snapshot. Carry on...
> summary
Wait a moment, while I finish loading the snapshot...

    Total heap size:              33,417,225 bytes

    Total objects:                369,541
    Total type objects:           1,960
    Total STables (type tables):  1,961
    Total frames:                 2,077
    Total references:             1,302,511

> find stables type="ABC"
Object Id  Description
=========  ===========
368872     ABC

Only the one. So, according to this, we’re not leaking. But that’s when the loop is over. What about a mid-loop GC? I switched to a snapshot in the middle, and:

> find stables type="ABC"
Object Id  Description
=========  ===========
130353     ABC
376552     ABC

> path 130353
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

> path 376552
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Result ]-->
NQPMatch (Object)
    --[ Unknown ]-->
QAST::CompUnit (Object)
    --[ Unknown ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

OK, that’s reasonable too: it’s alive because it’s referred to by the compiler, which is run as part of EVALing the code. So what are we leaking? I tried this:

> top objects by size
Name                                   Total Bytes
=====================================  ===============
NQPArray                               7,114,800 bytes
BOOTStaticFrame                        4,806,668 bytes
BOOTInt                                4,642,720 bytes
VMString                               2,859,188 bytes
BOOTHash                               2,253,016 bytes
SCRef                                  1,891,768 bytes
NFAType                                1,886,272 bytes
BOOTCode                               1,448,208 bytes
BOOTNum                                832,096 bytes
Parameter                              783,360 bytes
BOOTStr                                567,936 bytes
BOOTCompUnit                           513,149 bytes
Perl6::Metamodel::ContainerDescriptor  341,496 bytes
QAST::Op                               266,400 bytes
Signature                              208,440 bytes

Then compared with an earlier snapshot

> snapshot 2
Loading that snapshot. Carry on...
> top objects by size
Name                                   Total Bytes
=====================================  ===============
NQPArray                               7,110,920 bytes
BOOTStaticFrame                        4,806,152 bytes
BOOTInt                                4,642,624 bytes
VMString                               2,858,472 bytes
BOOTHash                               2,241,320 bytes
SCRef                                  1,891,696 bytes
NFAType                                1,886,272 bytes
BOOTCode                               1,447,776 bytes
BOOTNum                                832,096 bytes
Parameter                              783,360 bytes
BOOTStr                                567,136 bytes
BOOTCompUnit                           513,149 bytes
Perl6::Metamodel::ContainerDescriptor  341,496 bytes
QAST::Op                               266,112 bytes
Signature                              208,296 bytes

Again, nothing very interesting to note. This highly suggested that either there was some missing information in the heap snapshot, or something else in the VM state – but that got cleaned up at exit – that was also getting leaked.

So I pondered a bit, and compared the GC marking code with the heap snapshot code. And…had an “aha!” moment. Remember I talked about the inter-generational root set that we keep thanks to generational collection? This was not being accounted for in heap snapshots. I fixed it, and the size of the resulting heap snapshot files was a dead giveaway that it made a huge difference:

04/06/2016  14:46       184,272,062 rak-heap-6
04/06/2016  15:21       262,846,653 rak-heap-7

And, back in the analyzer:

> snapshot 2
Loading that snapshot. Carry on...
> find stables type="ABC"
Wait a moment, while I finish loading the snapshot...

Object Id  Description
=========  ===========
21266      ABC
21312      ABC
22359      ABC
23317      ABC
24275      ABC
25233      ABC
26191      ABC
27149      ABC
28107      ABC
29065      ABC
30023      ABC
30981      ABC
361108     ABC
363007     ABC
364903     ABC

So, the objects were there, after all. I took a look at some of them:

> path 24275
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 24269 ]-->
ABC (STable)

> path 30981
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 30975 ]-->
ABC (STable)

> path 363007
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 21244 ]-->
BOOTCompUnit (Object)
    --[ Serialization context dependency ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

This would explain a prolonged lifetime, but not an unending leak. I found myself missing a count command so I could easily see how things varied between the snapshots. I implemented it, then observed this:

> snapshot 2
Loading that snapshot. Carry on...
> count stables type="ABC"
> snapshot 5
Loading that snapshot. Carry on...
> count stables type="ABC"
> snapshot 7
Loading that snapshot. Carry on...
> count stables type="ABC"
> snapshot 8
Loading that snapshot. Carry on...
> count stables type="ABC"

The number of objects in the inter-generational set just kept on growing! So, either this workload was just not triggering a gen-2 collection, or there was a bug. How to find out? By doing a normal --profile on the same code, and looking at the output. The summary page stated:

The profiled code did 9 garbage collections. There were 0 full collections involving the entire heap.

OK, so we really never did a full collection. That explains this particular snapshot, but not the leak over time, which surely would end up triggering a full collection at some point. To test that theory, I tweaked a MoarVM header to make full collections happen more often, to see it helped:

diff --git a/src/gc/collect.h b/src/gc/collect.h
index b31a112..af6c456 100644
--- a/src/gc/collect.h
+++ b/src/gc/collect.h
@@ -6,7 +6,7 @@
 /* How many bytes should have been promoted into gen2 before we decide to
  * do a full GC run? The numbers below are used as a base amount plus an
  * extra amount per extra thread we have running. */
-#define MVM_GC_GEN2_THRESHOLD_BASE      (30 * 1024 * 1024)
+#define MVM_GC_GEN2_THRESHOLD_BASE      (1 * 1024 * 1024)
 #define MVM_GC_GEN2_THRESHOLD_THREAD    (2 * 1024 * 1024)

This change made it do a full collection for every 1MB promoted, not every 30MB. Hopefully that would be enough to trigger an some runs. And it did:

The profiled code did 9 garbage collections. There were 4 full collections involving the entire heap.

Much better. So, over in the heap profiler:

Considering the snapshot...looks reasonable!

This file contains 9 heap snapshots. To select one to look
at, type something like `snapshot 5`.
Type `help` for available commands, or `exit` to exit.

> snapshot 2
Loading that snapshot. Carry on...
> count stables type="ABC"
> snapshot 5
Loading that snapshot. Carry on...
> count stables type="ABC"
> snapshot 7
Loading that snapshot. Carry on...
> count stables type="ABC"
> snapshot 8
Loading that snapshot. Carry on...
> count stables type="ABC"

Well, ouch. That implies that the inter-generational root set keeps on growing and growing, for some reason. The profiler, which reports this number, agrees with this assessment (note the numbers of gen2 roots on the right):


So, how could this possibly be happening?

I like to rule out the really silly things first. Like, if the thing that cleans up the inter-generational roots list after a full collection is being called. How? With a printf, of course! :P

diff --git a/src/gc/roots.c b/src/gc/roots.c
index b771106..361b472 100644
--- a/src/gc/roots.c
+++ b/src/gc/roots.c
@@ -301,7 +301,7 @@ void MVM_gc_root_gen2_cleanup(MVMThreadContext *tc) {
     MVMuint32        num_roots    = tc->num_gen2roots;
     MVMuint32        i = 0;
     MVMuint32        cur_survivor;
+printf("cleaning up gen2 roots\n");
     /* Find the first collected object. */
     while (i < num_roots && gen2roots[i]->flags & MVM_CF_GEN2_LIVE)

And…no output. I couldn’t quite belive it. So, I went to the place that this function is called, and noticed there was some logging I could switch on that describes all the ins and outs of a organizing a GC run. That produced plenty of output, and showed it was indeed not reaching the place where it would call the gen2 roots cleanup either.

After some hunting, I discovered that an addition a while ago that tracked the amount of data promoted to the old generation, and used it to decide whether to do a full collection, had resulted in a nasty accident. It did the calculation to check if a full collection was needed in two different places, and the answer could change between them. This led to us not going through all the steps that a full collection would need.

I patched it and…the leak was gone. Even with dynamic optimization re-enabled and full collections back to being run after every promoted 30MB, the ever-increasing memory use of EVAL in a loop was no more. It climbed for a short while, then flattened out.

This fix will likely help numerous longer-running programs that have medium-lifetime objects. It also shaved around 45MB off CORE.setting compilation memory use. Unfortunately, it also caused a 5% slowdown in the Rakudo build-time, presumably because we were now actually doing all the work we should be on full collections!

A bit of tuning

With the leak out of the way – or at least, the big one – I wanted my 5% back. So I took a look at our logic for how we decide whether or not to do a full collection. The strategy so far had been to expect a fixed amount of memory to have been promoted to the old generation (with some additions per thread). However, this was lacking in a couple of ways.

For one, it only accounted for the direct size of objects, not any extra unmanaged memory they held. So, a compilation unit would not factor in the size of the compiled bytecode it held on to, and a dynamic array would not factor in the size of its storage. This was now an easy enough fix thanks to additions made while implementing heap snapshots.

For two, a fixed limit doesn’t behave too well with programs that really do build up a large heap over time. There, we can afford to promote a good bit more beforehand, and percentage wise it won’t make a lot of difference since they’re growing anyway. Building Rakudo’s CORE.setting is like this, as it builds up the program tree. So, I switched to a percentage-based scheme with a minimum threshold, which could afford to be a bit lower than the 30MB from before. These changes not only got Rakudo’s CORE.setting build time back down again (without having to give up much of the memory savings from before), but also had the EVAL loop example having a lower memory ceiling.

So, not only did I hunt down and fix the memory leak, I ended up tuning things to achieve a lower memory ceiling for applications whose memory footprint is fairly fixed over their life them, and less full GC runs for those growing over time.

Hunting down a parallel spectest hang

On Windows, a parallel spectest of high enough degree could cause hangs. I use TEST_JOBS=12, a number I picked after some measuring in the past. In the last couple of weeks, I started seeing hangs – and they went away if I cut down TEST_JOBS to just running 2 or 3 in parallel.

Eventually, helped by make localtest (where you can specify a file listing the tests to run), I managed to get it down to just 2 tests that, when run together, would reliably hang. It turned out one of them spawned another Rakudo process as part of the test, so there were 3 processes involved. Attaching the debugger to each in turn, I saw one was hung on getting a file lock, one was hung waiting for process termination (and was waiting on the process that was blocked on a file lock), and the other was blocked trying to write the STDOUT.

I don’t know for sure, but so far as I can tell, the way the test harness does parallel testing on Windows involves running batches of tests and reading a bit of TAP from them one at a time. The process blocked writing to STDOUT was also the one holding the file lock, but the test harness was, I presume, waiting to read output from the process blocked waiting for the process that was in turn waiting for the file lock. So, a nice circular wait involving 4 processes, one of them being the test harness! Typical darn Friday bug hunting… :-)

This also explained nicely why the issue didn’t crop up away from Windows: parallel spectest works differently (read: better) on other platforms. :-) While we will at some point switch to using a Perl 6 test harness for running the spectests that hopefully behaves consistently everywhre, I figured that Rakudo was probably doing something wrong with regard to the file lock.

File locks are used in only one place in Rakudo: managing pre-compilations. A little instrumentation of the lock/unlock code later, I saw a mis-match. A dump of the stack trace at each place we did a lock/unlock eventually led me to the problem, which I was able to fix. This bug likely didn’t just affect spectest on Windows; I suspect I could construct various hangs on other platforms too from it. So, a good fix to have in there.

As a side-note, the reason this bug was both likely to happen and hard to identify was because the lock and unlock were not placed on the same code-path. This is a good idea for any kind of locking. Locks are horrible to work with anyway; putting lock/unlock far apart in the code is just asking for problems (and this is far from the first problem I’ve hunted down in code with such a structure). So, I’ve asked nine++, as part of his great work to keep improving our installation and precomp handling, to look into addressing this, so we’re structurally in a better place to not have such bugs.

And, for all of you out there using Rakudo’s Lock class: first, don’t, and second, if you must use it, always prefer the $lock.protect({ ... }) form over manual lock/unlock method calls.

Two icky problems less…

So, not a bad last week’s Perl 6 work – though it took me all of this week to get around to writing it up. Maybe I’ll be a little faster writing up this week’s happenings. :-)

Posted in Uncategorized | 4 Comments

Small, but welcome, fixes

Last week wasn’t one of my most productive on Perl 6, thanks to a mix of Easter holiday, a little more work than expected on another project, and feeling a tad under the weather for a day or so. In the time I did find, though, I managed to pick off some worthwhile bits and pieces that needed doing.

Heap snapshot issues triage

It’s generally very clean and pleasant that our meta-objects are, so far as the VM is concerned, just objects. This means the answer to “can types be GC’d” is easy: sure they can, so long as nothing is using them any more. The downside is that the VM has very little insight into this huge pile of objects, which is why we have various “protocols” where the MOP can provide some hints. A lot of performance wins come from this. When I was working on heap snapshots, I found it would also be rather useful if there was a way to let the VM know about a “debug name” for a type. A name that isn’t used for any kind of resolution, but that can be included in things like heap snapshots. It’ll no doubt also be useful for debugging. So, I added an op for that, nqp::setdebugtypename, and used it in a branch of NQP, meaning I got more useful heap snapshots. This week, I also added this op to the JVM backend, meaning that I could merge the branch that uses it. This means that you can now do a normal build of Rakudo/NQP/Moar and get useful typenames in the MoarVM heap snapshots.

Last time, I mentioned that the heap snapshot viewer took an age to start up because we were super slow at reading files with really long lines. This week, I fixed this performance bug with two patches. Now, the heap snapshot analyzer picks apart a 25MB snapshot file into its various big pieces (just looking at how each line starts) and reaches its prompt in under a second on my box. That’s a rather large improvement from before, when it took well over a minute.

I also looked at a C-level profile of where MoarVM spends time doing the detailed parsing of a heap snapshot, which happens in the background while the user types their first command, and then blocks execution of that command if it’s not completed. The processing uses a few threads. The profile showed up a couple of areas in need of improvement: we have a lot of contention on the fixed size allocator, and the GC’s world-stopping logic seems to have room for improvement too. So, those will be on my todo list for the future.

Fixing a couple of crashes

SIGSEGV is one of the least inspiring ways to crash and burn, so I’m fairly keen to hunt down cases where that happens and fix them. This week I fixed two. The first was a bug in the UTF8 Clean 8-bit encoding, which turned out to be one of those “duh, what was I thinking” off-by-ones. The second was a little more fun to track down, but turned out to be memory corruption thanks to missing GC rooting.

And…that’s it!

But, this week I’m set to have a good bit more time, so hope to have some more interesting things to talk about in the next report. :-)

Posted in Uncategorized | 1 Comment

Happy heapster!

In last week’s report, I was working away at a heap snapshot mechanism for MoarVM. If you don’t know what that is, I suggest taking a look at last week’s post for an explanation, and then this one will probably make a whole lot more sense. :-)

This week, I did the rest of the work needed to get heap snapshots being dumped. I’m not going to talk any more about that, since I discussed the ideas behind it last time, and I encountered nothing unexpected along the way.

Instead, I want to discuss my initial work on building a tool to analyze the heap snapshots and get useful data out of them. The snapshots themeselves are mostly just megabytes and megabytes worth of integers separated by commas and semicolons. Big data it ain’t, but it’s big enough that processing it needs at least a bit of thought if we’re going to do it fast enough and without using huge amounts of memory.

I decided early on that I wanted to write this analysis tool in Perl 6. Overall, I’m working on performance, memory use, and reliability. The results from analyzing heap snapshots will help cut down on memory use directly, and identify leaks. Building the analyzer in Perl 6 means I’ll be able to identify and address the performance issues that I run into along the way, helping performance. Further, I could see various opportunities to do parallel processing, giving concurrency things a bit of a workout and maybe shaking some issues out there.

The design

I decided to build a command line application, which would execute queries on the heap snapshot and display the results. Thinking a bit about that, I realized it would break nicely into two pieces: a model that held the snapshot data and provided various methods to interrogate it, and a shell that would parse the queries and nicely display the results. All the stuff that needed performance engineering would lie within the model. In the shell, I didn’t have to worry about that.

I also realized that there was a nice perceived speed trick I could pull. While waiting for the user to type a query, there’s no reason not to get to work on parsing the snapshot and building up a data model of it. In the best case, we’re done by the time they’ve typed their query and hit enter. Even if not, we’ve shaved some time off their wait.

Looking at heap snapshots

A heap snapshot file is really a heap snapshot collection file, in that it can hold multiple snapshots. It starts with some common data that is shared among all the snapshots:

strings: ["Permanent Roots","VM Instance Roots", ...]
types: 18,19;21,22;25,26;27,28;30,31;33,33;30,34;...
static_frames: 10,11,1,12;63,64,13,65;10,67,1,65;...

The first, strings, is just a JSON array of all the strings we refer to anywhere else in the snapshot. The types data set can be seen as representing a table with two columns, both of them being indexes into the strings array. The first column is the name of the representation, and the second is type names (for example, a Perl 6 Rat would have the P6opaque representation and the Rat type). This is so we can understand what the objects in the snapshot are. The static_framesdata set does a similar thing for frames on the heap, so we can see the name, line number, and file of the sub, method, or block that was closed over.

These are followed by the snapshots, which look like this:

snapshot 0
collectables: 9,0,0,0,0,4;5,0,0,0,125030,165;...
references: 2,0,1;2,1,2;2,2,3;2,3,4;2,4,5;2,5,6;...

These describe a graph, rooted at the very first collectable. The collectables are the nodes, and the references the edges between them. The integers for a collectable describe what kind of node it is (object, frame, STable, etc.), its size, how many edges originate from it and where in the references table they are located (we take care to emit those in a contiguous way, meaning we can save on a from field in the references table). The references have a to field – which is actually the third integer – along with some data to describe the reference when we know something about it (for example, array indexes and attribute names in objects can be used to label the edges).

To give you an idea of the number of nodes and edges we might be considering, the snapshot I took to start understanding the EVAL memory leak has 501,684 nodes and 1,638,375 edges – and that is on a fairly well golfed down program. We can expect heap snapshots of real, interesting applications to be larger than this.

We could in theory represent every node and edge with an object. It’d “only” be 3 million objects and, if you strip away all the Perl 6 machinery we’re still pretty bad at optimizing, MoarVM itself can chug through making those allocations and sticking them into a pre-sized array in about 0.6s on my box. Unfortunately, though, the memory locality we’ll get when looking through those objects will be pretty awful. The data we want will be spread out over the heap, with lots of object headers we don’t care about polluting the CPU caches. Even with well-optimized object creation, we’d still have a bad design.

Native arrays to the rescue

Thankfully, Perl 6 has compact native arrays. These are perfect for storing large numbers of integers in a space-efficient way, and tightly packed together to be cache-friendly. So, I decided that my objects would instead be about the data sets that we might query: types, static frames, and snapshots. Then, the code would be mostly focused around ploughing through integer arrays.

The types data set makes a fairly easy example. It has two int arrays of indexes into the strings array, meaning the data set will be stored in two contiguous blobs of memory. Also note the careful use of binding in the BUILD submethod. Perl 6’s array assignment semantics are copying in nature, which is great for avoiding action at a distance (a good default) but not quite what we want when passing around large volumes of data. The type-name and repr-name method just resolve indexes to strings. More interesting are the all-with-type and all-with-repr methods, which simply resolve the search string – if it exists – to an index, and then go through the native int array of type name string indexes to find what we’re after.

Parallel, and background, processing

The BUILD submethod for the overall model starts out by picking the file apart, which is some simple line-based parsing. Given the heap snapshot format has very few lines (though they are very long), there’s nothing too performance sensitive in this code.

Then, at the end, it sets off 3 bits of parallel work to decode the JSON strings, and to parse the type and static frame tables. It’s worth noting that the Types dataset does also need to be constructed with the JSON strings; a simple await is used to declare that dependency. With regard to the snapshots, there’s no point processing all of them. Often, the user will only care about the final one, or a handful of them. At the same time, some background processing is still desriable. Thus, the model has arrays of unparsed snapshots and parsed snapshot promises. When the user indicates interest in a snapshot, we will start to prepare it. At the point it’s actually needed, we then await the Promise for that snapshot.

Over in the shell, we special-case having 1 snapshot in the collection by just selecting it right away, getting on with the processing of it immediately while the user types their query. We also take a little care to let the user know when they performed an operation, but we’re not done with parsing the snapshot yet.

Early results

The first things I implemented were getting a summary of the heap snapshot, along with queries to see the top frames and objects, both by space and count. Here’s how a session with it on a Rakudo heap snapshot starts out:

$ perl6-m -Ilib bin\moar-ha rak-heap
Considering the snapshot...looks reasonable!

This file contains 1 heap snapshot. I've selected it for you.
Type `help` for available commands, or `exit` to exit.

I ask for a summary:

> summary
Wait a moment, while I finish loading the snapshot...

    Total heap size:              28,727,776 bytes

    Total objects:                372,884
    Total type objects:           2,248
    Total STables (type tables):  2,249
    Total frames:                 2,523

And then can ask about the kinds of objects taking up the most space:

> top objects
Name                                   Total Bytes
=====================================  ===============
NQPArray                               7,184,816 bytes
BOOTInt                                4,639,680 bytes
BOOTStaticFrame                        4,195,168 bytes
VMString                               2,804,344 bytes
BOOTCode                               1,486,872 bytes
<anon>                                 1,331,744 bytes
Parameter                              847,552 bytes
BOOTNum                                827,904 bytes
BOOTStr                                546,368 bytes
Perl6::Metamodel::ContainerDescriptor  367,848 bytes
BOOTArray                              344,056 bytes
QAST::Op                               262,656 bytes
<anon>                                 255,032 bytes
Method                                 238,792 bytes
Signature                              228,096 bytes

And by number of objects:

> top objects by count
Name                                   Count
=====================================  =======
BOOTInt                                144,990
NQPArray                               74,270
VMString                               34,440
BOOTNum                                25,872
BOOTCode                               20,651
BOOTStr                                17,074
BOOTStaticFrame                        16,916
Parameter                              6,232
Perl6::Metamodel::ContainerDescriptor  5,109
BOOTHash                               4,775
Signature                              3,168
<anon>                                 2,848
QAST::Op                               2,736
BOOTIntArray                           1,725
Method                                 1,571

Figuring out where those huge numbers of boxed integers come from, along with the arrays we might guess they’re being stored in, will no doubt be the subject of a future week’s post here. In fact, I can already see there’s going to be a lot of really valuable data to mine.

But…that EVAL leak

This kind of analysis doesn’t help find leaks, however. That needs something else. I figured that if I did something like this:

C:\consulting\rakudo>perl6-m --profile=heap -e "for ^20 { EVAL 'my class ABC { }' }"
Recording heap snapshot
Recording completed
Writing heap snapshot to heap-snapshot-1458770262.66355

I could then search for the type tables for the class ABC (which we could expect to be GC’d). That would confirm that they are staying around. So, I implemented a find query, which gave me output like this:

> find stables type="ABC"
Object Id  Description
=========  ===========
42840      ABC
42845      ABC
45288      ABC
64994      ABC
71335      ABC
76824      ABC
78599      ABC
82535      ABC
86105      ABC
146765     ABC
404166     ABC

Those integers on the left are unique IDs for each object on the heap. So, given one of those, I then needed to calculate a path through the heap from the root to this object, to understand why it could not be collected and freed. This would be most useful if I could be told the shortest path. Thankfully, there exists a well-known algorithm that runs in O(Nodes + Edges) time that can annotate all graph nodes with data that lets you then determinte the shortest path to the root in O(maximum path length). (That gives an upper bound of O(Edges) in general, however unlikely that is in the real world. To see why, consider a heap snapshot that consists entirely of a linked list.)

The algorithm, if you didn’t guess yet, is the breadth-first search. The nice thing is that we only have to compute it once, and can then answer shortest path queries really quickly. Here’s the patch that added it. And here’s what it found:

> path 42840
    --[ Thread Roots ]-->
Thread Roots
    --[ Compiling serialization contexts ]-->
BOOTArray (Object)
    --[ Unknown ]-->
SCRef (Object)
    --[ Unknown ]-->
ABC (STable)

Bingo! We’re somehow, in an EVAL, leaving a reference to the serialization context in the “stuff we’re still compiling” array. (Serialization contexts being the things that hold objects we create at compile-time, but want to refer to at runtime, and so must serialize when pre-compiling.) This discovery quickly led to an NQP patch. Which…helped, but we still leak, for a different reason. The good news, however, is that the heap snapshot analyzer could very quickly show that the fix had been successful, and identify the next thing to investigate. You can only imagine how much more frustrating it would be to not have tooling that can do this! Sadly, I didn’t have time this week to dig into the next problem.

In one final bit of heap snapshot analyzer news, Timo forked it and added more features. I love it when I build something and others dig in to make it better. \o/

So, how was Perl 6’s performance?

I discovered 3 things that basically killed performance. The first was Str.Int being horrid slow, and since reading in a heap snapshot is basically all about convering strings to integers that was a real killer. For example:

$ timecmd perl6-m -e "my $a; for ^1000000 { $a = '100'.Int }"
command took 0:1:09.83 (69.83s total)

Awful! But, a 15-minute patch later, I had it down to:

$ timecmd perl6-m -e "my $a; for ^1000000 { $a = '100'.Int }"
command took 0:0:1.87 (1.87s total)

A factor of 37 faster. Not super-fast, so I’ll return to it at some point, but tolerable for now. I guess this’ll help out plenty of other folk’s scripts, so it was very nice to fix it up a bit.

The second discovery was that a thread blocked on synchronous I/O could end up blocking garbage collection from taking place by any thread, which completely thwarted my plans to do the snapshot parsing in the background. This wasn’t a design problem in MoarVM, thankfully. It’s long had a way for a thread to mark itself as blocked, so another can steal its GC work. I just had to call it on a couple of extra code paths.

There’s a third issue, which is that something in MoarVM’s line reading gets costly when you have really long (like, multi-megabyte) lines. Probably not a common thing to be doing, but still, I’ll make sure to fix it. Didn’t have time for it this week, though. Next time!

Posted in Uncategorized | 3 Comments

A whole heap of work

The majority of my Perl 6 grant time in the last week went on building a heap snapshot mechanism in MoarVM. But, before I dug into that, I worked on…

More memory savings!

While I expect we’ll find plenty of interesting memory savings – not to mention the source of the EVAL memory leak – through the heap snapshots I’m working on, there’s still a good bit of info to be had from C-level heap profiling tools and taking a careful look at various data structures. This week, I found and improved 3 such things.

Every block, routine, or even thunk that we compile gets an ID that is unique within its compilation unit (that is, a body of code that is compiled at once). This is used to be able to refer to it unambiguously. Historically – dating all the way back to the Parrot days – these needed to be a bit more unique than just within a compilation unit. This made them fairly long, and since something like the Perl 6 builtins have into the tens of thousands of them, they added up quickly. So, I shortened them a good bit, knocking 188KB of NQP’s base memory, 1424KB off Rakudo’s base memory, and making the compiled CORE.setting a nice 288KB smaller.

These routines, blocks, and thunks are all represented down in MoarVM as static frames – a data structure I’ve talked about a bit in previous posts here. Aware that we have a lot of them, I looked to see if there were any more relatively easy savings to be made. I discovered a couple.

First up, we were keeping around a small map of the bytecode that was used to indicate instruction boundaries. This was used for precisely one thing: outputting instruction offsets when showing low-level backtraces. We never actually show that data to any normal Perl 6 end user. It was useful to have around early on in MoarVM’s development, but I can’t remember the last time I needed that information to debug something. It’s also hugely misleading these days, as the VM is often running specialized bytecode with lots of virtualization and indirection stripped out. So, I just got rid of it. We only generated these for the bytecode of static frames that were actually run, but it also saved a pointer for all static frames. In total, it was another 316KB off NQP’s base memory, and 1088KB off Rakudo’s.

Finally, back when I added the profiler, static frames got a few new fields. They’re totally unused when we’re not applying any kind of instrumentation, however. So, I pulled them out into a single struct, reducing us to just one pointer per static frame for this state. That was another 144KB off NQP, and 1036KB off Rakudo.

These total to around 3.5MB off the base size of Rakudo (measured by running the empty loop program), and all give further savings once extra modules are loaded or more of CORE.setting is used.

Multi caches get a size boost

I analyzed some code from RabidGravy doing sine wave generation, and noticed it was doing a lot of slow-path multi-dispatch. It turns out we were overflowing the multi caches, which in turn would be harming our ability to inline. While this points at a design issue I want to look into a little further, giving the caches a size boost provided some instant relief, getting the code down to 60% of its previous runtime. This does mean we use a bit more memory, but only for multis that we actually call, which isn’t all that many just for a normal startup.

Heap snapshots

This week I got a lot of work in on heap snapshots. First of all, what is a heap snapshot? It’s a representation of the memory allocated by a program, and under the management of the VM. In MoarVM this consists of three things:

  • Objects (both type objects and instances)
  • Shared Tables (aka “STables”), which hold the common things for a given meta-object and representation pairing
  • Call frames

The first two are managed by the GC; in fact, most code in the GC doesn’t care which it’s dealing with. The third is managed for now by reference counts (the reasons why, and how this gives a mix of great properties and awful properties, is a discussion for another day!)

The point isn’t to capture all of the data on the heap, but rather to capture the size, identity, and reference structure of the data. Size is easy enough to understand – it’s just about how many bytes are used. Identity is about what type an object has (the class name, for example), what type an STable represents, and what block or routine in the code a particular call frame is an invocation of. Together with counting how many items we have, we can use size and idendity data to answer questions like, “what kinds of objects are using up loads of memory”. I’ll be intersted to use this to see what’s eating up memory in NQP and Rakudo, before you even start doing any interesting work.

Reference structure is more interesting. It’s about what things point to, or refer to, other things. MoarVM has a reachability based GC. When we garbage collect, we start from a set of roots, which consist of things like the call stacks of each running thread along with other per-thread and per-VM state. We then – conceptually, at least – “mark” all the objects that these roots reference and put them on a todo list. We then visit them, asking each of them what they reference, and putting those objects onto the todo list. By taking care never to put an already marked object onto the todo list, we can be confident we’ll eventually terminate with an empty list. At the end, all of the unmarked objects are declared unreachable, and we can free them up. (And just to drive it home: this is a good way to conceptually understand what a reachability GC does, but does not reflect the way modern VMs that need to efficiently manage a large number of objects actually do it. And, like nearly all abstractions, it hides details that can matter in reality. Again, a topic for another day. :-))

Anyway, the reference structure is all about answering why a given object is still in memory. What path from the roots is keeping it alive? This is the sort of data that will help me understand the leaky EVAL bug I mentioned a week or so back.

Taking a heap snapshot involves capturing all of the raw data. It’s rather important that the process of taking a heap snapshot does not impact the liveness of the things in the snapshot. The instrumenting profiler, which was primarily built for giving us insight into microbenchmarks and how the dynamic optimizer works with them, references real type objects and static frames in its data set, which is then post-processed. In doing so, it keeps them alive. But the heap snapshotter doing this would be a disaster! It would distort the set of live objects, which it exists to measure. So, rule 1 is that the heap snapshot data structure must never hold any references to GC-managed objects. In fact, that’s probably the only really important rule. Of course, it’s going to affect the amount of memory the VM uses, but that must not show up in any way in the results. Most of the remaining challenge in taking heap snapshots is about trying to capture the data in a sufficiently compact way that it’s just large rather than utterly huge.

Another interesting question is when to take heap snapshots. A fairly good time is just after a GC run. Why? Because we’re in a state of “stop the world” at that point anyway, so we really are taking a snapshot of an atomic point in time with regard to the execution of the program. So, that’s where it’s done. (Note that this doesn’t bias things, in so far as we’d never be wanting to include unreachable things into the snapshot anyway.)

Anyway, by this point I’ve got pretty far with the code that does the snapshotting. It happily finds all object and STable references, and I’m part of the way through frames, which will mean it can snapshot the memory associated with closures properly. So, not much more work to go on producing the snapshots themselves.

Once I’ve got complete snapshots being produced, the next task will be to build some tools to analyze them. There will also be some work to make the snapshots a bit more detailed (for example, including the names of object attributes, so it’s easier to see the path keeping an object alive). I’ll probably take on some basic analysis tooling first, however, because I think even without that added context the snapshots will still be hugely revealing. Stay tuned for next week’s report for more!

Posted in Uncategorized | 5 Comments

Can’t forget about memory issues

I pondered at the end of last week’s post that I might work on speeding up object construction this week. I didn’t, but I did get a bunch of other useful stuff done – and got started on one of the big tooling deliverables of my grant.

Lazy string decoding

Last week, I’d been working on lazily decoding strings, so we could save some memory. I nearly got it working, and noted that Timo dropped a patch that got rid of the SEGV. This week I reviewed that patch. It was correct in so far as it fixed the issue, but I noted that other code was vulnerable to running into the same problem. In summary, the JIT compiler has a dynasm macro for grabbing a string from the string heap, and a robust fix would ensure we always made sure the string was decoded before JITting an unconditional lookup to it. So, I fixed it that way. Code review for the win!

With the patches merged, Rakudo’s memory use for the infinite loop program went down by 1272 KB, which is a welcome win. You might imagine that we’d have saved something on startup time by not doing the decoding work also. The answer is yes, but it’s in noise if you simply use something like time. Thankfully, there is a good way to measure more accurately: use callgrind to count the number of CPU instructions retired running the empty program. That shows a reduction of about 2.7 million instructions – significant, though less than 1%, which is why it’s noise on the wallclock time. Still, such changes add up.

Other little memory savings

Glancing the massif heap profiler output after the previous fix, it struck me that another huge use of memory allocations at startup is static frames. Static frames represent the static things we know about call frames. Call frames are the things created whenever you enter a routine or block, and keep hold of your lexical variables. The actual lexical storage is created every call frame (to understand why this is needed, consider recursion). However, things like the name of a routine, its bytecode, the mapping of lexical names to lexical storage slots in the call frame, and the ranges of code covered by exception handlers is “static” data. Therefore, we store it once in the “static frame”, and each call frame just points to its static frame.

You never actually see a static frame from MoarVM’s user-space. They’re an implementation detail. What you do see are coderefs (or “code objects”), which point to a static frame and also have an outer pointer, which is what makes closures work.

We already do partially lazily deserialize static frames. In short, we always deserialize the part you need for operations that just want to talk about it, and only deserialize the rest of the data if you ever actually call that code. That was a really quite good saving, but it looks from the profile like there may be value in trying to push that laziness evey further. Again, for a typical Perl 6 program, only a tiny fraction of the code in CORE.setting will ever be mentioned or invoked.

So, that may be a future task, but it will be a little tricky compared to strings. What I also did notice, however, is that in a compilation unit, we kept two arrays: one of static frames, and one of code objects. We almost never indexed the static frames array, and none of the places we did would be on a hot path. Since it was always possible to reach the static frames through the code objects, I did the refactor to get rid of the static frames array. It’s only a 16KB saving for NQP, but with Rakudo’s sizable CORE.setting and the compiler, we save a more satisfying 134KB off the memory needed for the infinite loop program. We also save ourselves walking through those long arrays of code objects during full GC runs (you’ll note the removed GC calls in the patch), which is also good for the CPU cache – and of course saves some instructions.

Another slightly cheeky saving came from lazily allocating GC nursery semispaces. The MoarVM GC, like most modern GCs that need to manage sizable heaps, is generational. This means it manages objects that live a long time differently from those that live a short time. For short-lived objects, we use a technique known as semi-space copying. The idea is simple: you allocate two equal size regions of memory (4MB each, per thread, in MoarVM). You do super-cheap bump-the-pointer allocation in the currently active semi-space. Eventually, you fill it, which triggers garbage collection. All the living objects are then copied, either to the start of the other semi-space if we never saw them before, or into the old generation storage if they survived a previous GC and so are probably going to be fairly long-lived.

For simple programs, or short-lived threads, we never need to ask for the second semi-space. In the past we always did. Now we don’t. It actually took a lot less time to write/test that patch than it did to explain it now. :-) But also, it’s worth noting that it’s a bit of a con on…well…every modern OS, I’d guess. While a heap profiler like massif will happily show that we just saved 4MB off each small program or short-lived thread, in reality operating systems don’t really swallow up pages of physical RAM until programs touch them. So, this is mostly a perception win for those who see numbers that factor in “what we asked for”, but it’s an easy one, and may save a brk system call or so.

Assorted small fixes

My work on speeding up accessors turned out to cause a regression in a module that was enjoying intropsecting accessor methods. It turns out I didn’t mark the generated methods rw when I should have, even if the generated code was correct. An easy fix.

Next up, I looked into a concurrency bug, where channels coerced to supplies and supplies coerced to channels would behave in the wrong way with exceptions. I guess it’s a pretty symmetry in some sense that we ended up with both directions broken, but both directions working is even prettier, so I fixed both cases.

It turned out that writing a submethod Bool didn’t work out right in all cases, while a method Bool always did. Boolification is fairly hot-path, so is fairly optimized (on the other hand, with our increasingly smart dynamic optimizer, we might be able to rid ourselves of the special cases in the future). For now, it was another easy fix (at least, if you know the MOP code well enough!)

Finally I took on a nasty SEGV found by Larry, which turned out to golf to a crash when concatenating strings that had Unicode characters in a fairly particular range at the end of the first string (there we likely a few other paths to the same bug). It turned out to be a silly braino in the generation of the primary composite table used for NFC transformation.

The EVAL memory monster

An RT ticket reported that EVAL of a regex appears to leak (it in fact asked about the <$x> regex construct, but that golfs to an EVAL):

for ^500 { EVAL 'regex { abcdef }' }

My first tool against leaks is Valgrind, which shows what memory is allocated but not freed. So, I tried it and…no dice. We leak very little, and it all seems to be missing cleanup at exit. The size of the leak was constant, not proportional to the number of EVALs. Therefore, I wasn’t looking at a real leak in the sense that MoarVM loses track of memory. All the memory the EVALs tie up is released at shut-down – it’s just not released as early as it should be.

My next theory was that perhaps we seem to leak because we end up with lots of objects promoted to the old generation of the GC, which we collect irregularly. That quickly became unlikely, though. The symptom of this is typically that memory use has a sawtooth shape, with lots getting allocated then a sudden big release on full collections. However, that didn’t seem to happen; memory use just grew. To completely rule it out, I tweaked the settings to make us do full collections really often, and it didn’t help.

While doing this analysis, I was looking at profiles some. The primary reason was because the profiler records all the GC runs that happen, so I could easily see that full heap collects really were happening. However, looking at the profile pointed out a few easy optimizations to compilation.

The first was noting that the QAST::Block.symbol method – used by the compiler to keep various interesting bits of data about symbols – was doing a huge number of allocations of hashes and hash iterators. I noted in many cases we don’t need to.

This was a big win – in fact, far too big. The profiler seemed to report too few allocations now. I dug in and noticed that it indeed was missing various allocations that could happen during parameter binding, both with slurpies and auto-boxing. Fixing that in turn showed up that the very same symbol method accounted for a huge amount of allocation as a result of boxing the $name parameter. So, I did a few fixes so that we could turn that to str.

The profiling data also showed up a few other easy wins in the AST to MoarVM compiler. One was opting NQP in to a multiple dispatch optimization in MoarVM that Rakudo was already using. The other was a few more native type annotationsto cut out some more allocations.

These together meant that my leaking example from earlier now ran in 75% of the time it used to. These improvements will help compilation speed generally too. The bad, if unsurprising, news is they did nothing at all to help with the leak.

A little further analysis showed that we never release the compilation unit and serialization context objects. That only leads to another question: what is referencing them, and so keeping them alive and preventing their garbage collection? That’s not an easy question to answer without better tooling. Happily, in my grant I’d already proposed to write a heap profiler, which can record snapshots of the GC-managed heap.

So, I’ve started work on building that. It’s not trivial, but on the other hand at least a good bit less difficult than writing a garbage collector. :-) Since I seem to have found plenty to talk about in this post already, I’ll save discussing the heap profiler until next week’s post – when it might actually be at the point of spitting out some interesting data. See you then!

Posted in Uncategorized | 3 Comments