Heap heap hooray!

Last week, I finally hunted down and fixed the EVAL memory leak, with the help of the heap snapshot analyzer I wrote about recently. I also hunted down a hang in parallel runs of the Perl 6 specification test suite that showed up recently – but only on Windows – and fixed that too.

Before we begin: a generational GC primer

A few posts ago I talked a bit about how MoarVM’s garbage collector works – and hand-waved on just about all of the interesting details. Some folks on IRC expressed a bit of curiosity about those details, so I figured I’d explain some of them as they come up in my performance/reliability work. In this post I’ll talk about generational collection, because it plays a part in the EVAL leak story.

Recall that the essence of a tracing garbage collector is that we start with a set of roots: global symbols, the current call frame that each thread is executing, thread-local symbols, and so forth. We “mark” each object we find in these roots as “alive”, and stick them onto a todo list. We work our way through this todo list, asking each object what it references and putting those onto the todo list also. With a little care to never revisit objects we already considered, we terminate having marked all reachable objects alive. The memory associated with unmarked objects can then be freed.

You might worry that if we have millions of objects in memory, this could be a rather time-consuming process to do again and again. You’d be right. Worse, memory access performance depends heavily on the CPU caches getting high hit rates. When we need to walk huge numbers of objects, we end up getting loads of CPU cache misses, and have to spend time fetching objects from main memory. (To give you an idea of the numbers: a level 1 cache hit means memory access in several CPU cycles, while having to go to main memory can easily cost a couple of hundred cycles or worse).

So, how might we do better? The key insight behind generational GC, often known as the generational hypothesis, is that in most programs objects are either very short-lived (surviving zero or one garbage collections) or long-lived (perhaps staying around for the lifetime of the entire program). Therefore, it’s reasonable to assume that once an object survives a couple of collections, it will stay around for a good few more.

Generational collection works by dividing the heap – that is, the space where objects are allocated – into generations (typically two or three). In MoarVM we have two generations, which we tend to refer to in the code as “nursery” and “gen2”. The nursery, as the name suggests, is where objects begin their life. We allocate them cheaply there using a “bump-the-pointer” scheme. The nursery in MoarVM is a fixed-size chunk of memory, and after a while we fill it up. This is what triggers garbage collection.

In a 2-generation collector, there are two types of collection, which we know as nursery collection and full collection. In a full collection, we do exactly what I described earlier: visit all objects, marking them alive, and freeing up those that aren’t. A nursery collection is similar, except as soon as we see an object is not in the nursery, we don’t put it on the todo list. Instead, we simply ignore it. This greatly cuts down on the number of objects we need to consider, making nursery collections hugely cheaper.

There are two things we must do to make this really work. The first is ensure that we only free memory associated with objects living in the nursery, not the old generation, since we didn’t do the analysis needed to free anything there. That’s fine; most objects “die young” anyway. The second is more subtle. There may be objects in the nursery that are only alive because something in the old generation references them. However, since we’re not considering any old generation objects, we won’t discover this liveness and so wrongly free things. This is resolved by maintaining a set of objects that are in the old generation but pointing to objects in the nursery. Whenever we assign a reference from one object to another, we check if this would establish an old generation to nursery reference, and stick the old generation object into the set, ensuring we will visit it and mark the nursery object. This check is known as a “write barrier”.

So, back to the EVAL story…

The EVAL leak

I reviewed, fixed up, and merged various patches from Timo to improve the heap snapshot data dumps by annotating them with more data. Then, I looked at a few paths to leaked objects (recall that I was using EVAL 'my class ABC { }' to demonstrate the leak). The paths looked something like this:

> path 38199
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Unknown ]-->
BOOTCompUnit (Object)
    --[ Index 81 ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Unknown ]-->
ABC (STable)

This shows the objects along the path, but those “unknowns” were hiding what I really wanted to know. So, I did some further patches, and got out a rather more useful result:

> path 6466
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Static Frame ]-->
BOOTStaticFrame (Object)
    --[ Compilation Unit ]-->
BOOTCompUnit (Object)
    --[ Code refs array entry ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Spesh guard match ]-->
ABC (STable)

So, here we see that it’s a type specializer guard that is keeping the object alive. “Wait…a what?!” MoarVM does a bunch of dynamic optimization, watching out for types that occur at runtime and generating specialized versions of the code by type. And, sometimes, we have an unfortunate situation where code is discovered “hot”, but the type it was invoked with is fairly transient. In this case, the specialization matching table will end up referring to that type, keeping it alive.

However, since for any given bit of code we only generate a handful of these specializations, eventually we’d saturate them and stop leaking memory. I looked at another path:

> path 10594
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Static Frame ]-->
BOOTStaticFrame (Object)
    --[ Compilation Unit ]-->
BOOTCompUnit (Object)
    --[ Code refs array entry ]-->
BOOTCode (Object)
    --[ Unknown ]-->
BOOTStaticFrame (Object)
    --[ Spesh log slots ]-->
QAST::CompUnit (Object)
    --[ Unknown ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

This is a case where the optimizer is tracing what objects show up. In fact, most of the paths looked this way. However, that should saturate at some point, yet I know that it goes on leaking. Finally, I found another path to a leak:

> path 59877
Root
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

However, 9 out of the 10 leaked objects were leaked because of the dynamic optimizer keeping things alive that it had seen while tracing. But that, while awkward, should eventually saturate – as should the integer cache issue. But the memory use grew forever, suggesting that things go on and on leaking. So, I tried a snapshot after disabling dynamic optimization. And:

> find stables type="ABC"
Object Id  Description
=========  ===========
368871     ABC

Just the one! 9 out of 10 objects were not on the heap. And yes, it was the integer box cache problem that kept the 1 alive:

> path 368871
Root
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

So in theory, with dynamic optimization disabled, this suggested that we did not leak any more, and all the blame was on the optimizer. To check that out, I tried a long-running EVAL loop and…it still leaked heavily. My theory that dynamic optimization couldn’t account for all of the leaking, just the first bit of it, seemed to hold up.

To investigate it further, I did a loop of 100 EVALs, as opposed to the 10 I had used so far. This took a snapshot every GC run, plus one more that I forced at the end. So, how did the final snapshot look?

This file contains 9 heap snapshots. To select one to look
at, type something like `snapshot 1`.
Type `help` for available commands, or `exit` to exit.

> snapshot 8
Loading that snapshot. Carry on...
> summary
Wait a moment, while I finish loading the snapshot...

    Total heap size:              33,417,225 bytes

    Total objects:                369,541
    Total type objects:           1,960
    Total STables (type tables):  1,961
    Total frames:                 2,077
    Total references:             1,302,511

> find stables type="ABC"
Object Id  Description
=========  ===========
368872     ABC

Only the one. So, according to this, we’re not leaking. But that’s when the loop is over. What about a mid-loop GC? I switched to a snapshot in the middle, and:

> find stables type="ABC"
Object Id  Description
=========  ===========
130353     ABC
376552     ABC

> path 130353
Root
    --[ Permanent Roots ]-->
Permanent roots
    --[ Boxed integer cache entry ]-->
Int (Object)
    --[ <SC> ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

> path 376552
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Lexotic cache entry ]-->
Lexotic (Object)
    --[ Result ]-->
NQPMatch (Object)
    --[ Unknown ]-->
QAST::CompUnit (Object)
    --[ Unknown ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

OK, that’s reasonable too: it’s alive because it’s referred to by the compiler, which is run as part of EVALing the code. So what are we leaking? I tried this:

> top objects by size
Name                                   Total Bytes
=====================================  ===============
NQPArray                               7,114,800 bytes
BOOTStaticFrame                        4,806,668 bytes
BOOTInt                                4,642,720 bytes
VMString                               2,859,188 bytes
BOOTHash                               2,253,016 bytes
SCRef                                  1,891,768 bytes
NFAType                                1,886,272 bytes
BOOTCode                               1,448,208 bytes
BOOTNum                                832,096 bytes
Parameter                              783,360 bytes
BOOTStr                                567,936 bytes
BOOTCompUnit                           513,149 bytes
Perl6::Metamodel::ContainerDescriptor  341,496 bytes
QAST::Op                               266,400 bytes
Signature                              208,440 bytes

Then compared with an earlier snapshot

> snapshot 2
Loading that snapshot. Carry on...
> top objects by size
Name                                   Total Bytes
=====================================  ===============
NQPArray                               7,110,920 bytes
BOOTStaticFrame                        4,806,152 bytes
BOOTInt                                4,642,624 bytes
VMString                               2,858,472 bytes
BOOTHash                               2,241,320 bytes
SCRef                                  1,891,696 bytes
NFAType                                1,886,272 bytes
BOOTCode                               1,447,776 bytes
BOOTNum                                832,096 bytes
Parameter                              783,360 bytes
BOOTStr                                567,136 bytes
BOOTCompUnit                           513,149 bytes
Perl6::Metamodel::ContainerDescriptor  341,496 bytes
QAST::Op                               266,112 bytes
Signature                              208,296 bytes

Again, nothing very interesting to note. This highly suggested that either there was some missing information in the heap snapshot, or something else in the VM state – but that got cleaned up at exit – that was also getting leaked.

So I pondered a bit, and compared the GC marking code with the heap snapshot code. And…had an “aha!” moment. Remember I talked about the inter-generational root set that we keep thanks to generational collection? This was not being accounted for in heap snapshots. I fixed it, and the size of the resulting heap snapshot files was a dead giveaway that it made a huge difference:

04/06/2016  14:46       184,272,062 rak-heap-6
04/06/2016  15:21       262,846,653 rak-heap-7

And, back in the analyzer:

> snapshot 2
Loading that snapshot. Carry on...
> find stables type="ABC"
Wait a moment, while I finish loading the snapshot...

Object Id  Description
=========  ===========
21266      ABC
21312      ABC
22359      ABC
23317      ABC
24275      ABC
25233      ABC
26191      ABC
27149      ABC
28107      ABC
29065      ABC
30023      ABC
30981      ABC
361108     ABC
363007     ABC
364903     ABC

So, the objects were there, after all. I took a look at some of them:

> path 24275
Root
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 24269 ]-->
ABC (STable)

> path 30981
Root
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 30975 ]-->
ABC (STable)

> path 363007
Root
    --[ Inter-generational Roots ]-->
Inter-generational Roots
    --[ Index 21244 ]-->
BOOTCompUnit (Object)
    --[ Serialization context dependency ]-->
SCRef (Object)
    --[ STable root set ]-->
ABC (STable)

This would explain a prolonged lifetime, but not an unending leak. I found myself missing a count command so I could easily see how things varied between the snapshots. I implemented it, then observed this:

> snapshot 2
Loading that snapshot. Carry on...
> count stables type="ABC"
34
> snapshot 5
Loading that snapshot. Carry on...
> count stables type="ABC"
72
> snapshot 7
Loading that snapshot. Carry on...
> count stables type="ABC"
97
> snapshot 8
Loading that snapshot. Carry on...
> count stables type="ABC"
100

The number of objects in the inter-generational set just kept on growing! So, either this workload was just not triggering a gen-2 collection, or there was a bug. How to find out? By doing a normal --profile on the same code, and looking at the output. The summary page stated:

The profiled code did 9 garbage collections. There were 0 full collections involving the entire heap.

OK, so we really never did a full collection. That explains this particular snapshot, but not the leak over time, which surely would end up triggering a full collection at some point. To test that theory, I tweaked a MoarVM header to make full collections happen more often, to see it helped:

diff --git a/src/gc/collect.h b/src/gc/collect.h
index b31a112..af6c456 100644
--- a/src/gc/collect.h
+++ b/src/gc/collect.h
@@ -6,7 +6,7 @@
 /* How many bytes should have been promoted into gen2 before we decide to
  * do a full GC run? The numbers below are used as a base amount plus an
  * extra amount per extra thread we have running. */
-#define MVM_GC_GEN2_THRESHOLD_BASE      (30 * 1024 * 1024)
+#define MVM_GC_GEN2_THRESHOLD_BASE      (1 * 1024 * 1024)
 #define MVM_GC_GEN2_THRESHOLD_THREAD    (2 * 1024 * 1024)

This change made it do a full collection for every 1MB promoted, not every 30MB. Hopefully that would be enough to trigger an some runs. And it did:

The profiled code did 9 garbage collections. There were 4 full collections involving the entire heap.

Much better. So, over in the heap profiler:

Considering the snapshot...looks reasonable!

This file contains 9 heap snapshots. To select one to look
at, type something like `snapshot 5`.
Type `help` for available commands, or `exit` to exit.

> snapshot 2
Loading that snapshot. Carry on...
> count stables type="ABC"
34
> snapshot 5
Loading that snapshot. Carry on...
> count stables type="ABC"
72
> snapshot 7
Loading that snapshot. Carry on...
> count stables type="ABC"
97
> snapshot 8
Loading that snapshot. Carry on...
> count stables type="ABC"
100

Well, ouch. That implies that the inter-generational root set keeps on growing and growing, for some reason. The profiler, which reports this number, agrees with this assessment (note the numbers of gen2 roots on the right):

profile-screenie

So, how could this possibly be happening?

I like to rule out the really silly things first. Like, if the thing that cleans up the inter-generational roots list after a full collection is being called. How? With a printf, of course! :P

diff --git a/src/gc/roots.c b/src/gc/roots.c
index b771106..361b472 100644
--- a/src/gc/roots.c
+++ b/src/gc/roots.c
@@ -301,7 +301,7 @@ void MVM_gc_root_gen2_cleanup(MVMThreadContext *tc) {
     MVMuint32        num_roots    = tc->num_gen2roots;
     MVMuint32        i = 0;
     MVMuint32        cur_survivor;
-
+printf("cleaning up gen2 roots\n");
     /* Find the first collected object. */
     while (i < num_roots && gen2roots[i]->flags & MVM_CF_GEN2_LIVE)
         i++;

And…no output. I couldn’t quite belive it. So, I went to the place that this function is called, and noticed there was some logging I could switch on that describes all the ins and outs of a organizing a GC run. That produced plenty of output, and showed it was indeed not reaching the place where it would call the gen2 roots cleanup either.

After some hunting, I discovered that an addition a while ago that tracked the amount of data promoted to the old generation, and used it to decide whether to do a full collection, had resulted in a nasty accident. It did the calculation to check if a full collection was needed in two different places, and the answer could change between them. This led to us not going through all the steps that a full collection would need.

I patched it and…the leak was gone. Even with dynamic optimization re-enabled and full collections back to being run after every promoted 30MB, the ever-increasing memory use of EVAL in a loop was no more. It climbed for a short while, then flattened out.

This fix will likely help numerous longer-running programs that have medium-lifetime objects. It also shaved around 45MB off CORE.setting compilation memory use. Unfortunately, it also caused a 5% slowdown in the Rakudo build-time, presumably because we were now actually doing all the work we should be on full collections!

A bit of tuning

With the leak out of the way – or at least, the big one – I wanted my 5% back. So I took a look at our logic for how we decide whether or not to do a full collection. The strategy so far had been to expect a fixed amount of memory to have been promoted to the old generation (with some additions per thread). However, this was lacking in a couple of ways.

For one, it only accounted for the direct size of objects, not any extra unmanaged memory they held. So, a compilation unit would not factor in the size of the compiled bytecode it held on to, and a dynamic array would not factor in the size of its storage. This was now an easy enough fix thanks to additions made while implementing heap snapshots.

For two, a fixed limit doesn’t behave too well with programs that really do build up a large heap over time. There, we can afford to promote a good bit more beforehand, and percentage wise it won’t make a lot of difference since they’re growing anyway. Building Rakudo’s CORE.setting is like this, as it builds up the program tree. So, I switched to a percentage-based scheme with a minimum threshold, which could afford to be a bit lower than the 30MB from before. These changes not only got Rakudo’s CORE.setting build time back down again (without having to give up much of the memory savings from before), but also had the EVAL loop example having a lower memory ceiling.

So, not only did I hunt down and fix the memory leak, I ended up tuning things to achieve a lower memory ceiling for applications whose memory footprint is fairly fixed over their life them, and less full GC runs for those growing over time.

Hunting down a parallel spectest hang

On Windows, a parallel spectest of high enough degree could cause hangs. I use TEST_JOBS=12, a number I picked after some measuring in the past. In the last couple of weeks, I started seeing hangs – and they went away if I cut down TEST_JOBS to just running 2 or 3 in parallel.

Eventually, helped by make localtest (where you can specify a file listing the tests to run), I managed to get it down to just 2 tests that, when run together, would reliably hang. It turned out one of them spawned another Rakudo process as part of the test, so there were 3 processes involved. Attaching the debugger to each in turn, I saw one was hung on getting a file lock, one was hung waiting for process termination (and was waiting on the process that was blocked on a file lock), and the other was blocked trying to write the STDOUT.

I don’t know for sure, but so far as I can tell, the way the test harness does parallel testing on Windows involves running batches of tests and reading a bit of TAP from them one at a time. The process blocked writing to STDOUT was also the one holding the file lock, but the test harness was, I presume, waiting to read output from the process blocked waiting for the process that was in turn waiting for the file lock. So, a nice circular wait involving 4 processes, one of them being the test harness! Typical darn Friday bug hunting… :-)

This also explained nicely why the issue didn’t crop up away from Windows: parallel spectest works differently (read: better) on other platforms. :-) While we will at some point switch to using a Perl 6 test harness for running the spectests that hopefully behaves consistently everywhre, I figured that Rakudo was probably doing something wrong with regard to the file lock.

File locks are used in only one place in Rakudo: managing pre-compilations. A little instrumentation of the lock/unlock code later, I saw a mis-match. A dump of the stack trace at each place we did a lock/unlock eventually led me to the problem, which I was able to fix. This bug likely didn’t just affect spectest on Windows; I suspect I could construct various hangs on other platforms too from it. So, a good fix to have in there.

As a side-note, the reason this bug was both likely to happen and hard to identify was because the lock and unlock were not placed on the same code-path. This is a good idea for any kind of locking. Locks are horrible to work with anyway; putting lock/unlock far apart in the code is just asking for problems (and this is far from the first problem I’ve hunted down in code with such a structure). So, I’ve asked nine++, as part of his great work to keep improving our installation and precomp handling, to look into addressing this, so we’re structurally in a better place to not have such bugs.

And, for all of you out there using Rakudo’s Lock class: first, don’t, and second, if you must use it, always prefer the $lock.protect({ ... }) form over manual lock/unlock method calls.

Two icky problems less…

So, not a bad last week’s Perl 6 work – though it took me all of this week to get around to writing it up. Maybe I’ll be a little faster writing up this week’s happenings. :-)

Posted in Uncategorized | 4 Comments

Small, but welcome, fixes

Last week wasn’t one of my most productive on Perl 6, thanks to a mix of Easter holiday, a little more work than expected on another project, and feeling a tad under the weather for a day or so. In the time I did find, though, I managed to pick off some worthwhile bits and pieces that needed doing.

Heap snapshot issues triage

It’s generally very clean and pleasant that our meta-objects are, so far as the VM is concerned, just objects. This means the answer to “can types be GC’d” is easy: sure they can, so long as nothing is using them any more. The downside is that the VM has very little insight into this huge pile of objects, which is why we have various “protocols” where the MOP can provide some hints. A lot of performance wins come from this. When I was working on heap snapshots, I found it would also be rather useful if there was a way to let the VM know about a “debug name” for a type. A name that isn’t used for any kind of resolution, but that can be included in things like heap snapshots. It’ll no doubt also be useful for debugging. So, I added an op for that, nqp::setdebugtypename, and used it in a branch of NQP, meaning I got more useful heap snapshots. This week, I also added this op to the JVM backend, meaning that I could merge the branch that uses it. This means that you can now do a normal build of Rakudo/NQP/Moar and get useful typenames in the MoarVM heap snapshots.

Last time, I mentioned that the heap snapshot viewer took an age to start up because we were super slow at reading files with really long lines. This week, I fixed this performance bug with two patches. Now, the heap snapshot analyzer picks apart a 25MB snapshot file into its various big pieces (just looking at how each line starts) and reaches its prompt in under a second on my box. That’s a rather large improvement from before, when it took well over a minute.

I also looked at a C-level profile of where MoarVM spends time doing the detailed parsing of a heap snapshot, which happens in the background while the user types their first command, and then blocks execution of that command if it’s not completed. The processing uses a few threads. The profile showed up a couple of areas in need of improvement: we have a lot of contention on the fixed size allocator, and the GC’s world-stopping logic seems to have room for improvement too. So, those will be on my todo list for the future.

Fixing a couple of crashes

SIGSEGV is one of the least inspiring ways to crash and burn, so I’m fairly keen to hunt down cases where that happens and fix them. This week I fixed two. The first was a bug in the UTF8 Clean 8-bit encoding, which turned out to be one of those “duh, what was I thinking” off-by-ones. The second was a little more fun to track down, but turned out to be memory corruption thanks to missing GC rooting.

And…that’s it!

But, this week I’m set to have a good bit more time, so hope to have some more interesting things to talk about in the next report. :-)

Posted in Uncategorized | 1 Comment

Happy heapster!

In last week’s report, I was working away at a heap snapshot mechanism for MoarVM. If you don’t know what that is, I suggest taking a look at last week’s post for an explanation, and then this one will probably make a whole lot more sense. :-)

This week, I did the rest of the work needed to get heap snapshots being dumped. I’m not going to talk any more about that, since I discussed the ideas behind it last time, and I encountered nothing unexpected along the way.

Instead, I want to discuss my initial work on building a tool to analyze the heap snapshots and get useful data out of them. The snapshots themeselves are mostly just megabytes and megabytes worth of integers separated by commas and semicolons. Big data it ain’t, but it’s big enough that processing it needs at least a bit of thought if we’re going to do it fast enough and without using huge amounts of memory.

I decided early on that I wanted to write this analysis tool in Perl 6. Overall, I’m working on performance, memory use, and reliability. The results from analyzing heap snapshots will help cut down on memory use directly, and identify leaks. Building the analyzer in Perl 6 means I’ll be able to identify and address the performance issues that I run into along the way, helping performance. Further, I could see various opportunities to do parallel processing, giving concurrency things a bit of a workout and maybe shaking some issues out there.

The design

I decided to build a command line application, which would execute queries on the heap snapshot and display the results. Thinking a bit about that, I realized it would break nicely into two pieces: a model that held the snapshot data and provided various methods to interrogate it, and a shell that would parse the queries and nicely display the results. All the stuff that needed performance engineering would lie within the model. In the shell, I didn’t have to worry about that.

I also realized that there was a nice perceived speed trick I could pull. While waiting for the user to type a query, there’s no reason not to get to work on parsing the snapshot and building up a data model of it. In the best case, we’re done by the time they’ve typed their query and hit enter. Even if not, we’ve shaved some time off their wait.

Looking at heap snapshots

A heap snapshot file is really a heap snapshot collection file, in that it can hold multiple snapshots. It starts with some common data that is shared among all the snapshots:

strings: ["Permanent Roots","VM Instance Roots", ...]
types: 18,19;21,22;25,26;27,28;30,31;33,33;30,34;...
static_frames: 10,11,1,12;63,64,13,65;10,67,1,65;...

The first, strings, is just a JSON array of all the strings we refer to anywhere else in the snapshot. The types data set can be seen as representing a table with two columns, both of them being indexes into the strings array. The first column is the name of the representation, and the second is type names (for example, a Perl 6 Rat would have the P6opaque representation and the Rat type). This is so we can understand what the objects in the snapshot are. The static_framesdata set does a similar thing for frames on the heap, so we can see the name, line number, and file of the sub, method, or block that was closed over.

These are followed by the snapshots, which look like this:

snapshot 0
collectables: 9,0,0,0,0,4;5,0,0,0,125030,165;...
references: 2,0,1;2,1,2;2,2,3;2,3,4;2,4,5;2,5,6;...

These describe a graph, rooted at the very first collectable. The collectables are the nodes, and the references the edges between them. The integers for a collectable describe what kind of node it is (object, frame, STable, etc.), its size, how many edges originate from it and where in the references table they are located (we take care to emit those in a contiguous way, meaning we can save on a from field in the references table). The references have a to field – which is actually the third integer – along with some data to describe the reference when we know something about it (for example, array indexes and attribute names in objects can be used to label the edges).

To give you an idea of the number of nodes and edges we might be considering, the snapshot I took to start understanding the EVAL memory leak has 501,684 nodes and 1,638,375 edges – and that is on a fairly well golfed down program. We can expect heap snapshots of real, interesting applications to be larger than this.

We could in theory represent every node and edge with an object. It’d “only” be 3 million objects and, if you strip away all the Perl 6 machinery we’re still pretty bad at optimizing, MoarVM itself can chug through making those allocations and sticking them into a pre-sized array in about 0.6s on my box. Unfortunately, though, the memory locality we’ll get when looking through those objects will be pretty awful. The data we want will be spread out over the heap, with lots of object headers we don’t care about polluting the CPU caches. Even with well-optimized object creation, we’d still have a bad design.

Native arrays to the rescue

Thankfully, Perl 6 has compact native arrays. These are perfect for storing large numbers of integers in a space-efficient way, and tightly packed together to be cache-friendly. So, I decided that my objects would instead be about the data sets that we might query: types, static frames, and snapshots. Then, the code would be mostly focused around ploughing through integer arrays.

The types data set makes a fairly easy example. It has two int arrays of indexes into the strings array, meaning the data set will be stored in two contiguous blobs of memory. Also note the careful use of binding in the BUILD submethod. Perl 6’s array assignment semantics are copying in nature, which is great for avoiding action at a distance (a good default) but not quite what we want when passing around large volumes of data. The type-name and repr-name method just resolve indexes to strings. More interesting are the all-with-type and all-with-repr methods, which simply resolve the search string – if it exists – to an index, and then go through the native int array of type name string indexes to find what we’re after.

Parallel, and background, processing

The BUILD submethod for the overall model starts out by picking the file apart, which is some simple line-based parsing. Given the heap snapshot format has very few lines (though they are very long), there’s nothing too performance sensitive in this code.

Then, at the end, it sets off 3 bits of parallel work to decode the JSON strings, and to parse the type and static frame tables. It’s worth noting that the Types dataset does also need to be constructed with the JSON strings; a simple await is used to declare that dependency. With regard to the snapshots, there’s no point processing all of them. Often, the user will only care about the final one, or a handful of them. At the same time, some background processing is still desriable. Thus, the model has arrays of unparsed snapshots and parsed snapshot promises. When the user indicates interest in a snapshot, we will start to prepare it. At the point it’s actually needed, we then await the Promise for that snapshot.

Over in the shell, we special-case having 1 snapshot in the collection by just selecting it right away, getting on with the processing of it immediately while the user types their query. We also take a little care to let the user know when they performed an operation, but we’re not done with parsing the snapshot yet.

Early results

The first things I implemented were getting a summary of the heap snapshot, along with queries to see the top frames and objects, both by space and count. Here’s how a session with it on a Rakudo heap snapshot starts out:

$ perl6-m -Ilib bin\moar-ha rak-heap
Considering the snapshot...looks reasonable!

This file contains 1 heap snapshot. I've selected it for you.
Type `help` for available commands, or `exit` to exit.

I ask for a summary:

> summary
Wait a moment, while I finish loading the snapshot...

    Total heap size:              28,727,776 bytes

    Total objects:                372,884
    Total type objects:           2,248
    Total STables (type tables):  2,249
    Total frames:                 2,523

And then can ask about the kinds of objects taking up the most space:

> top objects
Name                                   Total Bytes
=====================================  ===============
NQPArray                               7,184,816 bytes
BOOTInt                                4,639,680 bytes
BOOTStaticFrame                        4,195,168 bytes
VMString                               2,804,344 bytes
BOOTCode                               1,486,872 bytes
<anon>                                 1,331,744 bytes
Parameter                              847,552 bytes
BOOTNum                                827,904 bytes
BOOTStr                                546,368 bytes
Perl6::Metamodel::ContainerDescriptor  367,848 bytes
BOOTArray                              344,056 bytes
QAST::Op                               262,656 bytes
<anon>                                 255,032 bytes
Method                                 238,792 bytes
Signature                              228,096 bytes

And by number of objects:

> top objects by count
Name                                   Count
=====================================  =======
BOOTInt                                144,990
NQPArray                               74,270
VMString                               34,440
BOOTNum                                25,872
BOOTCode                               20,651
BOOTStr                                17,074
BOOTStaticFrame                        16,916
Parameter                              6,232
Perl6::Metamodel::ContainerDescriptor  5,109
BOOTHash                               4,775
Signature                              3,168
<anon>                                 2,848
QAST::Op                               2,736
BOOTIntArray                           1,725
Method                                 1,571

Figuring out where those huge numbers of boxed integers come from, along with the arrays we might guess they’re being stored in, will no doubt be the subject of a future week’s post here. In fact, I can already see there’s going to be a lot of really valuable data to mine.

But…that EVAL leak

This kind of analysis doesn’t help find leaks, however. That needs something else. I figured that if I did something like this:

C:\consulting\rakudo>perl6-m --profile=heap -e "for ^20 { EVAL 'my class ABC { }' }"
Recording heap snapshot
Recording completed
Writing heap snapshot to heap-snapshot-1458770262.66355

I could then search for the type tables for the class ABC (which we could expect to be GC’d). That would confirm that they are staying around. So, I implemented a find query, which gave me output like this:

> find stables type="ABC"
Object Id  Description
=========  ===========
42840      ABC
42845      ABC
45288      ABC
64994      ABC
71335      ABC
76824      ABC
78599      ABC
82535      ABC
86105      ABC
146765     ABC
404166     ABC

Those integers on the left are unique IDs for each object on the heap. So, given one of those, I then needed to calculate a path through the heap from the root to this object, to understand why it could not be collected and freed. This would be most useful if I could be told the shortest path. Thankfully, there exists a well-known algorithm that runs in O(Nodes + Edges) time that can annotate all graph nodes with data that lets you then determinte the shortest path to the root in O(maximum path length). (That gives an upper bound of O(Edges) in general, however unlikely that is in the real world. To see why, consider a heap snapshot that consists entirely of a linked list.)

The algorithm, if you didn’t guess yet, is the breadth-first search. The nice thing is that we only have to compute it once, and can then answer shortest path queries really quickly. Here’s the patch that added it. And here’s what it found:

> path 42840
Root
    --[ Thread Roots ]-->
Thread Roots
    --[ Compiling serialization contexts ]-->
BOOTArray (Object)
    --[ Unknown ]-->
SCRef (Object)
    --[ Unknown ]-->
ABC (STable)

Bingo! We’re somehow, in an EVAL, leaving a reference to the serialization context in the “stuff we’re still compiling” array. (Serialization contexts being the things that hold objects we create at compile-time, but want to refer to at runtime, and so must serialize when pre-compiling.) This discovery quickly led to an NQP patch. Which…helped, but we still leak, for a different reason. The good news, however, is that the heap snapshot analyzer could very quickly show that the fix had been successful, and identify the next thing to investigate. You can only imagine how much more frustrating it would be to not have tooling that can do this! Sadly, I didn’t have time this week to dig into the next problem.

In one final bit of heap snapshot analyzer news, Timo forked it and added more features. I love it when I build something and others dig in to make it better. \o/

So, how was Perl 6’s performance?

I discovered 3 things that basically killed performance. The first was Str.Int being horrid slow, and since reading in a heap snapshot is basically all about convering strings to integers that was a real killer. For example:

$ timecmd perl6-m -e "my $a; for ^1000000 { $a = '100'.Int }"
command took 0:1:09.83 (69.83s total)

Awful! But, a 15-minute patch later, I had it down to:

$ timecmd perl6-m -e "my $a; for ^1000000 { $a = '100'.Int }"
command took 0:0:1.87 (1.87s total)

A factor of 37 faster. Not super-fast, so I’ll return to it at some point, but tolerable for now. I guess this’ll help out plenty of other folk’s scripts, so it was very nice to fix it up a bit.

The second discovery was that a thread blocked on synchronous I/O could end up blocking garbage collection from taking place by any thread, which completely thwarted my plans to do the snapshot parsing in the background. This wasn’t a design problem in MoarVM, thankfully. It’s long had a way for a thread to mark itself as blocked, so another can steal its GC work. I just had to call it on a couple of extra code paths.

There’s a third issue, which is that something in MoarVM’s line reading gets costly when you have really long (like, multi-megabyte) lines. Probably not a common thing to be doing, but still, I’ll make sure to fix it. Didn’t have time for it this week, though. Next time!

Posted in Uncategorized | 3 Comments

A whole heap of work

The majority of my Perl 6 grant time in the last week went on building a heap snapshot mechanism in MoarVM. But, before I dug into that, I worked on…

More memory savings!

While I expect we’ll find plenty of interesting memory savings – not to mention the source of the EVAL memory leak – through the heap snapshots I’m working on, there’s still a good bit of info to be had from C-level heap profiling tools and taking a careful look at various data structures. This week, I found and improved 3 such things.

Every block, routine, or even thunk that we compile gets an ID that is unique within its compilation unit (that is, a body of code that is compiled at once). This is used to be able to refer to it unambiguously. Historically – dating all the way back to the Parrot days – these needed to be a bit more unique than just within a compilation unit. This made them fairly long, and since something like the Perl 6 builtins have into the tens of thousands of them, they added up quickly. So, I shortened them a good bit, knocking 188KB of NQP’s base memory, 1424KB off Rakudo’s base memory, and making the compiled CORE.setting a nice 288KB smaller.

These routines, blocks, and thunks are all represented down in MoarVM as static frames – a data structure I’ve talked about a bit in previous posts here. Aware that we have a lot of them, I looked to see if there were any more relatively easy savings to be made. I discovered a couple.

First up, we were keeping around a small map of the bytecode that was used to indicate instruction boundaries. This was used for precisely one thing: outputting instruction offsets when showing low-level backtraces. We never actually show that data to any normal Perl 6 end user. It was useful to have around early on in MoarVM’s development, but I can’t remember the last time I needed that information to debug something. It’s also hugely misleading these days, as the VM is often running specialized bytecode with lots of virtualization and indirection stripped out. So, I just got rid of it. We only generated these for the bytecode of static frames that were actually run, but it also saved a pointer for all static frames. In total, it was another 316KB off NQP’s base memory, and 1088KB off Rakudo’s.

Finally, back when I added the profiler, static frames got a few new fields. They’re totally unused when we’re not applying any kind of instrumentation, however. So, I pulled them out into a single struct, reducing us to just one pointer per static frame for this state. That was another 144KB off NQP, and 1036KB off Rakudo.

These total to around 3.5MB off the base size of Rakudo (measured by running the empty loop program), and all give further savings once extra modules are loaded or more of CORE.setting is used.

Multi caches get a size boost

I analyzed some code from RabidGravy doing sine wave generation, and noticed it was doing a lot of slow-path multi-dispatch. It turns out we were overflowing the multi caches, which in turn would be harming our ability to inline. While this points at a design issue I want to look into a little further, giving the caches a size boost provided some instant relief, getting the code down to 60% of its previous runtime. This does mean we use a bit more memory, but only for multis that we actually call, which isn’t all that many just for a normal startup.

Heap snapshots

This week I got a lot of work in on heap snapshots. First of all, what is a heap snapshot? It’s a representation of the memory allocated by a program, and under the management of the VM. In MoarVM this consists of three things:

  • Objects (both type objects and instances)
  • Shared Tables (aka “STables”), which hold the common things for a given meta-object and representation pairing
  • Call frames

The first two are managed by the GC; in fact, most code in the GC doesn’t care which it’s dealing with. The third is managed for now by reference counts (the reasons why, and how this gives a mix of great properties and awful properties, is a discussion for another day!)

The point isn’t to capture all of the data on the heap, but rather to capture the size, identity, and reference structure of the data. Size is easy enough to understand – it’s just about how many bytes are used. Identity is about what type an object has (the class name, for example), what type an STable represents, and what block or routine in the code a particular call frame is an invocation of. Together with counting how many items we have, we can use size and idendity data to answer questions like, “what kinds of objects are using up loads of memory”. I’ll be intersted to use this to see what’s eating up memory in NQP and Rakudo, before you even start doing any interesting work.

Reference structure is more interesting. It’s about what things point to, or refer to, other things. MoarVM has a reachability based GC. When we garbage collect, we start from a set of roots, which consist of things like the call stacks of each running thread along with other per-thread and per-VM state. We then – conceptually, at least – “mark” all the objects that these roots reference and put them on a todo list. We then visit them, asking each of them what they reference, and putting those objects onto the todo list. By taking care never to put an already marked object onto the todo list, we can be confident we’ll eventually terminate with an empty list. At the end, all of the unmarked objects are declared unreachable, and we can free them up. (And just to drive it home: this is a good way to conceptually understand what a reachability GC does, but does not reflect the way modern VMs that need to efficiently manage a large number of objects actually do it. And, like nearly all abstractions, it hides details that can matter in reality. Again, a topic for another day. :-))

Anyway, the reference structure is all about answering why a given object is still in memory. What path from the roots is keeping it alive? This is the sort of data that will help me understand the leaky EVAL bug I mentioned a week or so back.

Taking a heap snapshot involves capturing all of the raw data. It’s rather important that the process of taking a heap snapshot does not impact the liveness of the things in the snapshot. The instrumenting profiler, which was primarily built for giving us insight into microbenchmarks and how the dynamic optimizer works with them, references real type objects and static frames in its data set, which is then post-processed. In doing so, it keeps them alive. But the heap snapshotter doing this would be a disaster! It would distort the set of live objects, which it exists to measure. So, rule 1 is that the heap snapshot data structure must never hold any references to GC-managed objects. In fact, that’s probably the only really important rule. Of course, it’s going to affect the amount of memory the VM uses, but that must not show up in any way in the results. Most of the remaining challenge in taking heap snapshots is about trying to capture the data in a sufficiently compact way that it’s just large rather than utterly huge.

Another interesting question is when to take heap snapshots. A fairly good time is just after a GC run. Why? Because we’re in a state of “stop the world” at that point anyway, so we really are taking a snapshot of an atomic point in time with regard to the execution of the program. So, that’s where it’s done. (Note that this doesn’t bias things, in so far as we’d never be wanting to include unreachable things into the snapshot anyway.)

Anyway, by this point I’ve got pretty far with the code that does the snapshotting. It happily finds all object and STable references, and I’m part of the way through frames, which will mean it can snapshot the memory associated with closures properly. So, not much more work to go on producing the snapshots themselves.

Once I’ve got complete snapshots being produced, the next task will be to build some tools to analyze them. There will also be some work to make the snapshots a bit more detailed (for example, including the names of object attributes, so it’s easier to see the path keeping an object alive). I’ll probably take on some basic analysis tooling first, however, because I think even without that added context the snapshots will still be hugely revealing. Stay tuned for next week’s report for more!

Posted in Uncategorized | 5 Comments

Can’t forget about memory issues

I pondered at the end of last week’s post that I might work on speeding up object construction this week. I didn’t, but I did get a bunch of other useful stuff done – and got started on one of the big tooling deliverables of my grant.

Lazy string decoding

Last week, I’d been working on lazily decoding strings, so we could save some memory. I nearly got it working, and noted that Timo dropped a patch that got rid of the SEGV. This week I reviewed that patch. It was correct in so far as it fixed the issue, but I noted that other code was vulnerable to running into the same problem. In summary, the JIT compiler has a dynasm macro for grabbing a string from the string heap, and a robust fix would ensure we always made sure the string was decoded before JITting an unconditional lookup to it. So, I fixed it that way. Code review for the win!

With the patches merged, Rakudo’s memory use for the infinite loop program went down by 1272 KB, which is a welcome win. You might imagine that we’d have saved something on startup time by not doing the decoding work also. The answer is yes, but it’s in noise if you simply use something like time. Thankfully, there is a good way to measure more accurately: use callgrind to count the number of CPU instructions retired running the empty program. That shows a reduction of about 2.7 million instructions – significant, though less than 1%, which is why it’s noise on the wallclock time. Still, such changes add up.

Other little memory savings

Glancing the massif heap profiler output after the previous fix, it struck me that another huge use of memory allocations at startup is static frames. Static frames represent the static things we know about call frames. Call frames are the things created whenever you enter a routine or block, and keep hold of your lexical variables. The actual lexical storage is created every call frame (to understand why this is needed, consider recursion). However, things like the name of a routine, its bytecode, the mapping of lexical names to lexical storage slots in the call frame, and the ranges of code covered by exception handlers is “static” data. Therefore, we store it once in the “static frame”, and each call frame just points to its static frame.

You never actually see a static frame from MoarVM’s user-space. They’re an implementation detail. What you do see are coderefs (or “code objects”), which point to a static frame and also have an outer pointer, which is what makes closures work.

We already do partially lazily deserialize static frames. In short, we always deserialize the part you need for operations that just want to talk about it, and only deserialize the rest of the data if you ever actually call that code. That was a really quite good saving, but it looks from the profile like there may be value in trying to push that laziness evey further. Again, for a typical Perl 6 program, only a tiny fraction of the code in CORE.setting will ever be mentioned or invoked.

So, that may be a future task, but it will be a little tricky compared to strings. What I also did notice, however, is that in a compilation unit, we kept two arrays: one of static frames, and one of code objects. We almost never indexed the static frames array, and none of the places we did would be on a hot path. Since it was always possible to reach the static frames through the code objects, I did the refactor to get rid of the static frames array. It’s only a 16KB saving for NQP, but with Rakudo’s sizable CORE.setting and the compiler, we save a more satisfying 134KB off the memory needed for the infinite loop program. We also save ourselves walking through those long arrays of code objects during full GC runs (you’ll note the removed GC calls in the patch), which is also good for the CPU cache – and of course saves some instructions.

Another slightly cheeky saving came from lazily allocating GC nursery semispaces. The MoarVM GC, like most modern GCs that need to manage sizable heaps, is generational. This means it manages objects that live a long time differently from those that live a short time. For short-lived objects, we use a technique known as semi-space copying. The idea is simple: you allocate two equal size regions of memory (4MB each, per thread, in MoarVM). You do super-cheap bump-the-pointer allocation in the currently active semi-space. Eventually, you fill it, which triggers garbage collection. All the living objects are then copied, either to the start of the other semi-space if we never saw them before, or into the old generation storage if they survived a previous GC and so are probably going to be fairly long-lived.

For simple programs, or short-lived threads, we never need to ask for the second semi-space. In the past we always did. Now we don’t. It actually took a lot less time to write/test that patch than it did to explain it now. :-) But also, it’s worth noting that it’s a bit of a con on…well…every modern OS, I’d guess. While a heap profiler like massif will happily show that we just saved 4MB off each small program or short-lived thread, in reality operating systems don’t really swallow up pages of physical RAM until programs touch them. So, this is mostly a perception win for those who see numbers that factor in “what we asked for”, but it’s an easy one, and may save a brk system call or so.

Assorted small fixes

My work on speeding up accessors turned out to cause a regression in a module that was enjoying intropsecting accessor methods. It turns out I didn’t mark the generated methods rw when I should have, even if the generated code was correct. An easy fix.

Next up, I looked into a concurrency bug, where channels coerced to supplies and supplies coerced to channels would behave in the wrong way with exceptions. I guess it’s a pretty symmetry in some sense that we ended up with both directions broken, but both directions working is even prettier, so I fixed both cases.

It turned out that writing a submethod Bool didn’t work out right in all cases, while a method Bool always did. Boolification is fairly hot-path, so is fairly optimized (on the other hand, with our increasingly smart dynamic optimizer, we might be able to rid ourselves of the special cases in the future). For now, it was another easy fix (at least, if you know the MOP code well enough!)

Finally I took on a nasty SEGV found by Larry, which turned out to golf to a crash when concatenating strings that had Unicode characters in a fairly particular range at the end of the first string (there we likely a few other paths to the same bug). It turned out to be a silly braino in the generation of the primary composite table used for NFC transformation.

The EVAL memory monster

An RT ticket reported that EVAL of a regex appears to leak (it in fact asked about the <$x> regex construct, but that golfs to an EVAL):

for ^500 { EVAL 'regex { abcdef }' }

My first tool against leaks is Valgrind, which shows what memory is allocated but not freed. So, I tried it and…no dice. We leak very little, and it all seems to be missing cleanup at exit. The size of the leak was constant, not proportional to the number of EVALs. Therefore, I wasn’t looking at a real leak in the sense that MoarVM loses track of memory. All the memory the EVALs tie up is released at shut-down – it’s just not released as early as it should be.

My next theory was that perhaps we seem to leak because we end up with lots of objects promoted to the old generation of the GC, which we collect irregularly. That quickly became unlikely, though. The symptom of this is typically that memory use has a sawtooth shape, with lots getting allocated then a sudden big release on full collections. However, that didn’t seem to happen; memory use just grew. To completely rule it out, I tweaked the settings to make us do full collections really often, and it didn’t help.

While doing this analysis, I was looking at profiles some. The primary reason was because the profiler records all the GC runs that happen, so I could easily see that full heap collects really were happening. However, looking at the profile pointed out a few easy optimizations to compilation.

The first was noting that the QAST::Block.symbol method – used by the compiler to keep various interesting bits of data about symbols – was doing a huge number of allocations of hashes and hash iterators. I noted in many cases we don’t need to.

This was a big win – in fact, far too big. The profiler seemed to report too few allocations now. I dug in and noticed that it indeed was missing various allocations that could happen during parameter binding, both with slurpies and auto-boxing. Fixing that in turn showed up that the very same symbol method accounted for a huge amount of allocation as a result of boxing the $name parameter. So, I did a few fixes so that we could turn that to str.

The profiling data also showed up a few other easy wins in the AST to MoarVM compiler. One was opting NQP in to a multiple dispatch optimization in MoarVM that Rakudo was already using. The other was a few more native type annotationsto cut out some more allocations.

These together meant that my leaking example from earlier now ran in 75% of the time it used to. These improvements will help compilation speed generally too. The bad, if unsurprising, news is they did nothing at all to help with the leak.

A little further analysis showed that we never release the compilation unit and serialization context objects. That only leads to another question: what is referencing them, and so keeping them alive and preventing their garbage collection? That’s not an easy question to answer without better tooling. Happily, in my grant I’d already proposed to write a heap profiler, which can record snapshots of the GC-managed heap.

So, I’ve started work on building that. It’s not trivial, but on the other hand at least a good bit less difficult than writing a garbage collector. :-) Since I seem to have found plenty to talk about in this post already, I’ll save discussing the heap profiler until next week’s post – when it might actually be at the point of spitting out some interesting data. See you then!

Posted in Uncategorized | 3 Comments

Digging into the performance/reliability work

This week, I finally started digging into Rakudo and MoarVM things again, as part of my performance and reliability grant. Here’s what I got up to.

Faster accessors and an inlining fix

Making object accessors and object construction faster was one of the bullet points on the grant, and I decided to start out with accessors – largely because I’d already figured out pretty much exactly what I wanted to do with them. Here’s a small microbenchmark:

class A {
    has $.x;
}
my $a = A.new(x => 42);
my $y;
loop (my int $i = 0; $i < 10000000; $i++) {
    $y = $a.x;
}
say $y;

The hot loop is just calling an accessor method ten million times and assigning the result. Before I started with any improvements, this code took 3.34s to run, including startup/compilation time. That was around 0.15s (measured by the time to do one iteration of the loop), so let’s call it 3.2s for the iterations themselves. That makes for about 0.32 microseconds per iteration. Being under a millionth of a second might not seem something to complain about, but it’s actually a little shy of a thousand CPU clock cycles on a 3 GHz CPU. Not cheap, for something you would expect to be very cheap indeed!

So, where is the cost? Here’s a bit of the profile output:

prof-before

That <anon> is the accessor, which is called ten million times. It’s been JIT-compiled (the green), which is good. The column on the right, however, tells a less happy story: the accessor method was not inlined. Inlining is an optimization that involves taking the body of a called method and splicing it into the callee, such that you completely eliminate the dispatch and calling overhead. One other point on my grant is to look into decreasing call overhead, which really wants to be lower, but being able to inline and get rid of it altogether is still preferable. And accessor methods are tiny, so should be great inlining candidates. So what gives?

Well, at the heart of it was the way that we’d implemented accessor methods. Down in the metamodel code, we did something like this (simplified):

$package.^add_method($accessor-name, method () {
    nqp::getattr(self, $package, $attr-name)
});

The method being added there is a closure. It closes over the package the attribute is from and the name of the attribute, which together are used in the lookup. MoarVM’s inliner can not inline things that access lexicals from the surrounding scope, because then you actually need a callframe with a correctly bound outer frame to resolve them. There’s a further cost, though. If we write an accessor method by hand:

method x() { $!x }

Then we look at the generated code, we see these two instructions:

 wval r4(3), liti16(0), liti16(12)
 getattr_o r5(1), r0(1), r4(3), lits($!x), liti16(0)

The wval is just grabbing a constant symbol, and the name of the attribute is a literal string. Since all those things are known, the MoarVM code specializer is able to re-write those two into:

 sp_p6ogetvc_o r5(1), r0(1), liti16(8), sslot(3)

This is known as a “spesh op”, which only the optimizer is allowed to emit. The 8 there is an offset from the start of the object’s memory in bytes, and this op is interpreted by taking the object in register 0, adding 8 bytes to its address, and deferencing – without having to do much in the way of “safety checks”. This JIT compiles into machine code rather cheaply also.

But when we have a late-bound package and attribute name, this optimization cannot happen. So not only are we not getting inlining to take place, we’re also losing out by having the code for accessors that we generate not being so good as the code you’d get if you wrote them all by hand.

So, what to do? Well, since most of the time the MOP is invoked by the compiler, and the compiler’s job is code generation, the neat solution is to have it provide a compiler services object to the MOP. It can use this to generate the accessor methods. When there’s no compiler services object supplied, it falls back to the existing way (for example, this would happen when you’re doing dynamic generation of classes). So, that’s what I implemented.

The result? 2.43s, or about 2.3s with startup time removed. So, it now was running in 72% of the time it did before. That’s an improvement, but not what I was looking for. Disappointed, I looked at the new profiling data and found that we still weren’t actually inlining the accessor methods. Looking at their code, they were free of outer lexical references and otherwise really simple. So what was going on?

It turned out to be an inlining bug that was also leading to many more missed opportunities. I’d heard from other folks looking at profiles that they were surprised certain things didn’t get inlined, but in the push to get things to work before Christmas, rather than to get them fast, I didn’t look into it. Now I have chance to take care of such things, and so investigated. It turned out to be an unfortunate accident that occurred when teaching inlining about multiple dispatch, so it could inline multi subs and methods. It led to us, oddly, being pretty good at inlining multi-dispatch calls, but missing out on lots of (supposedly easier) single-dispatch calls. Once I figured out what was going on, the patch was short and simple. This should have much wider benefits than just for accessors.

But, back to accessors. With inlining of various single-dispatch things fixed up, the benchmark now ran in 0.63s, so let’s call it around 0.5s once startup time is removed. That’s 0.05 microseconds per iteration of the loop, or around 150 CPU cycles per loop. That’s still too high, but running that loop in 15% of the time we used to is a nice step forward.

Just to put that figure on the map a little, I wrote the following Perl 5 program, which I hope is reasonably equivalent:

package A;
use Moo;
has x => ( is => 'ro' ); 

package Main;
my $a = A->new(x => 42);
my $y;
for (my $i = 0; $i < 10000000; $i++) {
    $y = $a->x; 
}
say $y;

This ran in 5.12 seconds, with 0.08s startup/compilation time, so let’s call it 5 seconds for the iterations, or around 0.5 microseconds, or 1500 CPU cycles on our over-idealized 3GHz CPU. (For those wondering why I didn’t stick use integer into the Perl 5 benchmark to reflect the native int I used in the loop of the Perl 6 one: I tried, and on my box that made things slower.)

There was another small benefit: CORE.setting got a bit smaller. While we’re producing a bit more bytecode now, we’re serializing less closures, and it turns out that the bytecode approach weighs a little less. My guess going into this was that I’d about break even, so coming out with a 130KB lighter CORE.setting was very nice. That means a smaller resident memory size also.

A big leak

Away from performance work, I spent some time hunting down a reported memory leak. Tracing things back from the valgrind output I was sent, I recreated a one-liner to reproduce it. The leak needed certain combinations of features to be used; there were various paths to trigger it, though my reproduction of it involved a multiple dispatch sub with a where constraint, called with flattening arguments. If such situations happened a lot, we could leak rather heavily. It’s gone now.

The analysis of this issue was helped by my having spent some time in the run-up to this grant clearing up MoarVM’s –full-cleanup mode. Usually we don’t waste time neatly clearing everything up at exit, because the OS can do it way faster (yes, I measured). However, MoarVM invoked with the –full-cleanup flag will try to do such cleanup, freeing everything it allocated. This is useful because it makes actual leaks, not just incomplete cleanup, clear to see. I’m not all the way there with this work yet, however NQP and most of its test suite are already valgrind-clean. I’ll be continuing to pick off the missing bits of cleanup over the course of the grant, and at some point want to set up a spectest run with valgrind + –full-cleanup to give ourselves a good chance of hunting all the leaks down.

Work in process: lazy string heap decoding

A decent chunk of the CORE.setting compiled output, along with various other bits of the compiler, is string literals. Of course, not all of them are equally used (think about the hundreds of error messages you never normally encounter). To date, we’ve always taken all of the strings and decoded them at bytecode load time, creating an NFG MVMString data structure (and so doing the bit of analysis needed to see if they need any synthetic codepoints generating). Looking at our memory use, it became clear there might be a win from deserializing them on-demand. The patch doing so isn’t quite ready for prime-time yet (though a follow-up fix by Timo may have addressed the crash my initial patch produced in an NQP test), but it revealed that we can knock around 1.3MB off Rakudo’s base memory usage with such an approach. (Fun fact: with the patch, on my box NQP running while 1 { } uses 1.8MB less memory than the JVM running the equivalent Java program – except NQP actually has the NQP compiler in memory and compiled the script too!) Anyways, expect this improvement to land in the next couple of days.

What next?

There’s so much to do! One task will be applying the same approach I took for accessors to object construction, so see how far that helps. Our object construction speed is a known bottleneck, and a simple benchmark against Perl 5 + Moo shows Rakudo being a disappointing 7 times slower. So, some work will be needed for us to reach towards parity, then get ahead, there. And on the reliability track, I’ll be picking out some bugs, leaks, etc. to squish too.

Posted in Uncategorized | 3 Comments

A few words on Perl 6 versioning and compatibility

Recently, I put together a set of guidelines for Perl 6 versioning and backward compatibility. They have been refined with input from the Perl 6 community. Together, they amount to a read of over 4,000 words, and are fairly detailed. You can consider this post a TL;DR version of the guidelines, giving the big picture idea of what they mean.

There’s more than one thing to version

In common with many languages, Perl 6 has a clean separation of language version and compiler version. Just as C has language versions like C89, C99, and C11, Perl 6 has versions 6.c, 6.d, 6.e, and so forth. To compile C you might be using GCC, Clang, MSVC, or something else, and these have their own version schemes. To compile and run Perl 6 you’ll for now probably be using Rakudo, which has monthly releases with a boringly predictable naming scheme: 2015.11, 2015.12, 2016.01, etc.

Perl 6 language version = a test suite

A given Perl 6 language version is defined by a test suite. An implementation that passes that test suite can claim to provide that version of the Perl 6 language. Rakudo 2015.12 was the first release to provide Perl 6.c, because it passed the test suite. Rakudo 2016.01 also passes the test suite, and thus provides Perl 6.c too.

The master branch of the test suite represents the latest development of the Perl 6 language. Language releases are represented by tags (implying that 6.c, for example, is immutable).

Programs may declare which version they want

The first statement of a program (either a script or a module) may contain a use statement requesting a particular language version:

use v6.c;

Implementations are required to refuse to run code for versions of Perl 6 they do not provide.

No language version declaration implies latest released version

Pretty much self explanatory. Asking for nothing in particular gets you the latest and greatest Perl 6 version the implementation has to offer. Meaning that 20 years down the line, people wanting to show off the latest things won’t have to prefix all their example snippets with use v6.n.

Module installers/directories may be more demanding, and require modules to specify a language version. That’s a decision for the those making such tooling.

An implementation may support many language versions at once

Assuming that we release a Perl 6.d sometime during 2016 (remember that this means “release a test suite defining what 6.d is”), then Rakudo 2016.12 will provide both Perl 6.c and Perl 6.d. To do that, it will be required to pass both the 6.c and 6.d test suites.

Implementations will need to go to moderate effort to do this

New syntax (new operators, new phasers, etc.) introduced by Perl 6.d should not be available to a program declaring use v6.c. This is to ensure that the syntactic additions do not cause breakage. For example, adding a DONE phaser could break any code that had a DONE sub and called it with listop syntax. Rakudo will, for now, support this by guarding such syntax in the parser with a version assertion. (Another strategy is to ship multiple parsers, presumably loading “on demand” those that are actually needed.)

Lexical aspects of the CORE.setting (the builtins) get the same treatment. Recall that in Perl 6, the builtins are not imported, but rather are in the outer lexical scope of the program. If Perl 6.d adds new subs, different implementations of subs, new classes or roles, or new constants, they should go in a CORE.d.setting, or some equivalent. Implementations are free to choose exactly how they will structure things. In Rakudo, we’ll retain CORE.setting with the “base” set of things, and have a CORE.d.setting that overrides and adds or overrides things as needed for Perl 6.d. This means that new Perl 6 language versions can change the behavior of existing operators or subs in ways that are not backward compatible, without breaking code that declared in a use statement it wants a previous version of Perl 6.

Late-bound things (methods, dynamic variables) have different rules. See the full guidelines for the details, but the upshot is that Perl 6 designers have a bit less flexibility in how they evolve methods on built-in types compared to subs, syntax, etc. Backward incompatible changes (as judged by the test suite) are not allowed; new methods or new overloads that don’t conflict with existing behaviors are. The other key difference is that no effort is made to “hide” methods added as part of newer versions of Perl 6 from callers using an earlier version of the Perl 6 language.

TL;DR, supposing you have a module that declares use v6.c:

  • The onus is on Perl 6 implementations to not let your program run into 6.d syntax changes or new subs/constants in the builtins
  • The onus is on you to not use methods that did not exist in 6.c. However, tooling will probably come to exist to help out (for example some kind of Module::FlightCheck that uses rakudobrew to grab the 2016.01 release, which is known to support nothing later than 6.c, and run your module’s tests on it).

If you think the second part of this sucks, then read the guidelines to get an idea why the alternatives all suck harder. :-)

Long Term Support and security releases

Rakudo releases every month, and since each release will provide a range of Perl 6 language versions by passing their official test suites, then in a sense every one is an “official” release. To be very clear, there is no release we should talk about as being “the 6.c release of Rakudo” or “the 6.d release of Rakudo” (though some folks probably will anyway, no matter what I say, and they’ll most likely mean “the first Rakudo release that supported 6.c” or so).

What we will do is declare some releases as “Long Term Support” releases. This label will be applied to releases some time after they have been made, so we can support releases that we know behaved reasonably well in the wild – at least in their first month or so “out there”. For example, suppose that 2016.02 is fairly well received. We might declare it a LTS release, and we’ll declare with that the period of time we intend to “support” it for.

What does support mean? It means that, in the event of security patches or serious bug fixes, we’ll produce bug compatible releases of all current Long Term Support versions of Rakudo. For example, suppose 2016.02 and 2016.10 were marked as LTS releases for a period of 12 months. In December 2016, we find a serious security bug in Rakudo. We’d release a 2016.02.1 and 2016.10.1, which would branch from the 2016.02 and 2016.10 tags and have the required patch(es) cherry-picked in. This would allow upgrades to get the security fix with a very high degree of confidence that existing code will not break.

Minor language versions

One thing we’d prefer to avoid is people declaring dependencies in their Perl 6 code on particular compiler versions. There’s no way to prevent it, but we can try to reduce the temptation to do so. The typical use case would be wanting to depend on a particular bug fix. Fixes get coverage in the language test suite, and so will be part of the next language release – but since the major language versions will tend to have at least a year between them, that could be a bit too long to wait.

Therefore, we’ll also have minor language versions, named 6.c.1, 6.c.2, etc. These give something implementation-independent to depend on. Chances are these will be needed more in the short term than in the long term.

What will we market?

Major language versions, primarily. We’ll use minor language versions to focus on incremental improvement and refinement. The interesting “next big thing” will be the major language versions. Each major language version will get a name. We’ve picked celebrations as the naming theme; 6.c was “Christmas”, and 6.d will be “Diwali”. (That doesn’t mean we’ll be pushing ourselves to actually ship it anywhere near where Diwali falls in the calendar. We already did that with Christmas, taking care to release on Christmas day.) So, look out for a talk on “What’s coming in Perl 6 Diwali” at some conference later on in the year. :-)

Trying out the future

One question the above raises is how to try out the latest implementation work towards the next major version. For that, we’ll use versions such as 6.d.a (6.d “alpha”). So:

use v6.d.a;

Will get you access to the stuff we expect to be in 6.d. Note these lettered versions will really be giving you the current work in progress and come with absolutely no backward-compatibility or stability promises, and support for them will be dropped swiftly after the actual language release of 6.d.

Where will new spectests go? Do I need to tag them somehow?

In the master branch of the repository of all spectests, and they need no tagging up besides the usual fudging. Released language versions are handled as tags, which are immutable.

Are the spectests really enough to specify a language?

I think they’re the best tool we have for the job at the moment. We might want to look towards property based testing ala QuickCheck some more, but that’s still a test-based approach. Natural language doesn’t have the precision, but more critically lacks the verifiability (that is, you can’t run a Perl 6 compiler against a natural language specification). Formal methods, such as operational or denotational semantics, offer greater precision than tests, but the intersection of people who know how to apply those and who want to contribute to Perl 6 is probably tiny. Certainly they lack the accessibility of a test suite expressed in Perl 6, and so would take us away from the goal of Perl 6 being a community’s language.

All that said, it’s fairly clear we need anywhere between 2 and 10 times the current number of tests to have comfortable coverage of both the language and its wide array of built-ins. We’ll be looking into coverage analyses to help us understand where those tests are most lacking.

You didn’t answer my really important question about Perl 6 versioning!

Then leave a comment, and maybe I’ll do a follow-up post to answer it. :-)

Posted in Uncategorized | 6 Comments