Optimizing reading lines from a file

Reading lines from a file and processing them one at a time is a hugely common scripting task. However, to date our performance at this task has been somewhat underwhelming. Happily, a grateful Perl 6 fan stepped up in response to my recent call for funding, offering 25 hours of funding to work on whatever I felt was most pressing, but with a suggestion that perhaps I could look at some aspect of I/O performance. Having recently been working on refactoring I/O anyway, this was very timely. So, I crafted a benchmark and dug in.

The benchmark and a baseline

Perl 5 is considered to have very good I/O performance, so I figured I’d use that as a rough measure of how close Perl 6 was to performing well at this task. A completely equivalent benchmark isn’t quite possible, but I tried to pick something representative of what the average programmer would write. The task for the benchmark was to take a file with one million lines, each having 60 characters, loop over them, and add up the number of characters on each line. That number would then be printed out at the end (it’s important that benchmarks calculating results return or consume the result in some way, as a sufficiently smart optimizer may otherwise manage to eliminate the work we think we’re measuring). The rules were that:

  • The file should be read as UTF-8
  • The number of characters in the line should exclude the line ending

The Perl 5 benchmark for this came out as follows:

perl -e 'open my $fh, "<:encoding(UTF-8)", "longfile";
         my $chars = 0;
         while ($_ = <$fh>) { chomp; $chars = $chars + length($_) };
         close $fh;
         print "$chars\n"'

With the Perl 6 one looking like this:

perl6 -e 'my $fh = open "longfile";
          my $chars = 0;
          for $fh.lines { $chars = $chars + .chars };
          $fh.close;
          say $chars'

I’ll note right off that in Perl 6 there are likely ways, today, to do a bit better. For example, the $chars variable could be given a native int type, and it’s entirely possible that a while loop might come out faster than the for loop. Neither of those are representative of what a typical programmer looking at the documentation and diving in to implementing stuff would do, however. I suspect that Perl 5 experts could similarly point out some trick I’ve missed, but I’m trying to benchmark typical use.

One slight unfairness is that the Perl 6 solution will actually count the number of grapheme clusters, since strings are at grapheme level. This entails some extra processing work, even in the case that there are no multi-codepoint clusters in the input file (as there were not in this case). But again, the average user making comparisons won’t much care for such technicalities.

All measurements were made on modern hardware with an Intel Xeon 6-core CPU and a fast SSD, and on Linux.

At the point I started work, the Perl 6 solution clocked in at 2.87s, to just 1.13s for Perl 5. This made Perl 6 a factor of 2.5 times slower.

First hints from the profile

The whole I/O stack recently got a good overhaul, and this was the first time I’d looked at a profile since that work was completed. Looking at the output from --profile immediately showed up some rather disappointing numbers. Of all callframes, 57.13% were JIT-compiled. Worse, basically nothing was being inlined.

At this point, it’s worth recalling that Perl 6 is implemented in Perl 6, and that there’s quite a bit going on between the code in the benchmark and ending up in either things implemented in C or a system call. The call to lines returns an Iterator object. Reading a line means calling the pull-one method on that Iterator. That in turn calls the consume-line-chars method on a $!decoder object, and that method is what actually calls down to the VM-backed decoder to read a line (so there’s a level of indirection here to support user provided decoders). The return value of that method then has defined called on it to check we actually got a line back. If yes, then it can be returned. If not, then read-internal should be called in order to fetch data from the file handle (given buffering, this happens relatively rarely). Running the loop body is a further invocation, passing the read line as a parameter. Getting the chars count is also a method call (which, again, actually calls down to the VM guts to access the string’s grapheme count).

That’s quite a lot of method calling. While the VM provides I/O, decoding, and finding input up to a separator, the coordination of that whole process is implemented in Perl 6, and involves a bunch of method calls. Seen that way, it’s perhaps not surprising that Perl 6 would come in slower.

There are, however, things that we can do to make it fast anyway. One of them is JIT compilation, where instead of having to interpret the bytecode that Perl 6 is compiled in to, we further translate it into machine code that runs on the CPU. That cuts out the interpreter overhead. Only doing that for 57% of the methods or blocks we’re in is a missed opportunity.

The other really important optimization is inlining. This is where small methods or subroutines are taken and copied into their callers by the optimizer. This isn’t something we can do by static analysis; the point of methods calls is polymorphism. It is something a VM doing dynamic analysis and type specialization can do, however. And the savings can be rather significant, since it cuts out the work of creating and tearing down call frames, as well as opening the door to further optimization.

The horrors in the logs

There are a couple of useful logs that can be written by MoarVM in order to get an idea of how it is optimizing, or failing to optimize, code. The JIT log’s main point of interest for the purpose of optimization is that it can indicate why code is not being JIT-compiled – most commonly because it contains something the JIT doesn’t know about. The first thing in this case was the call into the VM-backed decoder to extract a line, which was happily easily handled. Oddly, however, we still didn’t seem to be running the JIT-compiled version of the code. Further investigation uncovered an unfortunate mistake. When a specialized version of a method calls a specialized version of another method, we don’t need to repeat the type checks guarding the second method. This was done correctly. However, the code path that was taken in this case failed to check if there was a JIT-compiled version of the target rather than just a specialized bytecode version, and always ran the latter. I fixed that, and went from 57.13% of frames JIT-compiled to 99.86%. Far better.

My next point of investigation is why the tiny method to grab a line from the decoder was not being inlined. When I took a look at the post-optimization code for it, it turned out to be no surprise at all: while the logic of the method was very few instructions, it was bulked out by type checking of the incoming arguments and return values. The consume-line-chars method looks like this:

method consume-line-chars(Bool:D :$chomp = False, Bool:D :$eof = False --> Str) {
    my str $line = nqp::decodertakeline(self, $chomp, $eof);
    nqp::isnull_s($line) ?? Str !! $line
}

Specializations are always tied to a callsite object, from which we can know whether we’re being passed a parameter or not. Therefore, we should be able to optimize out those checks and, in the case the parameter is being passed, throw out the code setting the return value. Further, the *%_ that all methods get automatically should have been optimized out, but was not being.

The latter problem was fixed largely by moving code, although tests showed a regression that needed a little more care to handle – namely, that a sufficiently complex default value might do something that causes a deoptimization, and we need to make sure we can fall back into the interpreter and have things work correctly in that case.

While these changes weren’t enough to get consume-line-chars inlined, they did allow an inlining elsewhere, taking the inline ratio up to 28.49% of call frames.

These initial round of changes took the Perl 6 benchmark from 2.87s to 2.77s, so about 3.5% off. Not much, but something.

Continuing to improve code quality

The code we were producing even pre-optimization was disappointing in a few ways. Firstly, even though a simple method like consume-line-chars, or chars, would never possibly do a return, we were still spitting out a return exception handler. A little investigation revealed that we were only doing analysis and elimination of this for subs but not methods. Adding that analysis for methods too took the time down to 2.58s. Knocking 7% off with such a small change was nice.

Another code generation problem lay in consume-line-chars. Access to a native lexical can be compiled in two ways: either just by reading the value (fine if it’s only used as an r-value) or by taking a reference to it (which is correct if it will be used as an l-value). Taking a reference is decidedly costly compare to just reading the value. However, it’s always going to always have the correct behavior, so it’s the default. We optimize doing so away whenever we can (in fact, all the most common l-value usages of it never need a reference either).

Looking at consume-line-chars again:

method consume-line-chars(Bool:D :$chomp = False, Bool:D :$eof = False --> Str) {
    my str $line = nqp::decodertakeline(self, $chomp, $eof);
    nqp::isnull_s($line) ?? Str !! $line
}

We can see the read of $line here is, since consume-line-chars is not marked is rw, an r-value. Unfortunately, it was compiled as an l-value because the conditional compilation lost that context information. So, I addressed that and taught Rakudo to pass along return value’s r-value context.

A native reference means an allocation, and this change cut the number of GC runs enormously, from 182 or them to 41 of them. That sounds like it should make a sensational difference. In fact, it got things down to 2.45s, a drop of just 5%. Takeaway lesson: allocating less stuff is good, but MoarVM’s GC is also pretty good at throwing away short-lived things.

Meanwhile, back in the specializer…

With the worst issues of the code being fed into MoarVM addressed, it was back to seeing why the specializer wasn’t doing a better job of stripping out type checks. First of all, it turned out that optional named arguments were not properly marking the default code dead when the argument was actually passed.

Unfortunately, that wasn’t enough to get the type tests stripped out for the named parameters to consume-line-chars. In fact, this turned out to be an issue for all optional parameters. When doing type analysis, and there are two branches, the type information has to be merged at join points in the control flow graph. So it might see something like this in the case that the argument was not passed:

    Bool (default path) \   / Unknown (from passed path)
                         \ /
                   Result: Unknown

Or maybe this in the case that it was passed:

    Bool (default path) \   / Scalar holding Bool (from passed path)
                         \ /
                   Result: Unknown

In both cases, the types disagree, so they merge to unknown. This is silly, as we’ve already thrown out one of the two branches, so in fact there’s really no merge to do at all! To fix this up, I marked variables (in single static assignment form) that died as a result of a basic block being removed. To make the dead basic blocks from argument analysis actually be removed, we needed to do the dead code removal earlier as well as doing it at the very end of the optimization process. With that marking in place, it was then possible to ignore now-dead code’s contribution to a merge, which meant a whole load of type checks could now be eliminated. Well, in fact, only in the case where the optional was passed; a further patch to mark the writers of individual instructions dead for the purpose of merges was needed to handle the case where it was not.

That left the return type being checked on the way out, which also seemed a bit of a waste as we could clearly see it was a Str. After a tweak to Rakudo to better convey type information in one of its VM extension ops, that check was optimized out too.

And for all of this effort, the time went from…2.45s to 2.41s, just 2% off. While it’s cheaper to not type check things, it’s only so costly in the first place.

A further win was that, with the code for consume-line-chars now being so tiny, it should have been an inlining candidate. Alas, it was not, because the optional arguments was still having tracking information recorded just in case we needed to deoptimize. This seemed odd. It turned out that my earlier fix for this was too simplistic: it would leave them in if the method would ever deoptimize, not just if it would do it while handling arguments. I tightened that up and the time dropped to 2.37s, another 2% one. Again, very much worth it, but shows that invocation – while not super cheap – is also only so costly.

With consume-line-chars inlining now conquered, another area of the code we were producing caught by eye: boolification was, in some cases, managing to box an int into an Int only to them immediately unbox it and turn it into a Bool. Clearly this was quite a waste! It turned out that an earlier optimization to avoid taking native references had unexpected consequences. But even nicer was that my earlier work to pass down r-value context meant I could delete some analysis and just use that mechanism instead. That was worth 4%, bringing us to 2.28s.

Taking stock

None of these optimizations so far were specific to I/O or touched the I/O code itself. Instead, they are general optimization and code quality improvements that will benefit most Perl 6 programs. Together, they had taken the lines benchmark from 2.87s to 2.28s. Each may have been just some percent, but together they had knocked 20% off.

By this point, the code quality – especially after optimization – was far more pleasing. It was time to look for some other sources of improvement.

Beware associativity

Perhaps one of the easiest wins came from spotting the pull-one method of the lines iterator seemed to be doing two calls to the defined method. See if you can spot them:

method pull-one() {
    $!decoder.consume-line-chars(:$!chomp) // $!handle.get // IterationEnd
}

The // operator calls .defined to test for definedness. Buy why two calls in the common case? Because of associativity! Two added parentheses:

method pull-one() {
    $!decoder.consume-line-chars(:$!chomp) // ($!handle.get // IterationEnd)
}

Were worth a whopping 8%. At 2.09s, the 2 second mark was in sight.

Good idea, but…

My next idea for an improvement was a long-planned change to the way that simple for loops are compiled. With for being defined in terms of map, this is also how it had been implemented. However, for simple cases, we can just compile:

for some-iteratable { blah }

Not into:

some-iterable.map({ blah }).sink-all;

But instead in to something more like:

my \i = some-iterable.iterator;
while (my \v = i.pull-one) !== IterationEnd {
    blah
}

Why is this an advantage? Because – at least in theory – now the pull-one and loop body should become possible to inline. This is not the case if we call map, since that is used with dozens of different closures and iterator types. Unfortunately, however, due to limitations in MoarVM’s specializer, it was not actually possible to achieve this inlining even after the change. In short, because we don’t handle inlining of closure-y things, and the way the on-stack replacement works means the optimizer is devoid of type information to have a chance to doing better with pull-one. Both of these are now being investigated, but were too big to take on as part of this work.

Even without those larger wins being possible (when they are, we’ll achieve a tasty near-100% inlining rate in this benchmark), it brought the time down to the 2.00s mark. Here’s the patch.

Optimizing line separation and decoding

Profiling at the C level (using callgrind) showed up some notable hot spots in the string handling code inside of MoarVM, which seemed to offer the chance to get further wins. At this point, I also started taking measurements of CPU instructions using callgrind too, which makes it easier to see the effects of changes that may come out as noise on a simple time measurement (even with taking a number of them and averaging).

Finding the separator happens in a couple of steps. First, individual encodings are set up to decode to the point that they see the final character of any of the line separators (noting these are configurable, and multi-char separators are allowed). Then, a second check is done to check if the multi-char separator was found. This is complicated by needing to handle the case where a separator was not found, and another read needs to be done from a file handle.

It turns out that this second pass was re-scanning the entire buffer of chars, rather than just looking close to the end of it. After checking there should not be a case where just jumping to look at the end would ever be a problem, I did the optimization and got a reduction from 18,245,144,315 instructions to 16,226,602,756, or 11%.

A further minor hot-spot was re-resolving the CRLF grapheme each time it was needed. It turned out caching that value saved around 200 million instructions. Caching the maximum separator length saved another 78 million instructions. The wallclock time now stood at 1.79s.

The identification of separators when decoding chars seemed the next place to find some savings. CPUs don’t much like having to do loops and dereferences on hot paths. To do better, I made a compact array of the final separator graphemes that could be quickly scanned through, and also introduced a maximum separator codepoint filter, which given the common case is control characters works out really quite well. These were worth 420 million and 845 million instructions respectively.

Next, I turned to the UTF-8 decoding and NFG process. A modest 56 million instruction win came from tweaking this logic given we can never be looking for a separator and have a target number of characters to decode. But a vast win came from adding a normalization fast path for the common case where we don’t have any normalization work to do. In the case we do encounter such work, we simply fall into the slow path. One nice property of the way I implemented this is that, when reading line by line, one line may cause a drop into the slow path, but the next line will start back in the fast path. This change was worth a whopping 3,200 million decrease in the instruction count. Wallclock time now stood at 1.37s.

Better memory re-use

Another look at the profile now showed malloc/free as significant costs. Could anything be done to reduce the number of those we did?

Yes, it turned out. Firstly, keeping around a decoding result data structure instead of freeing and allocating it every single line saved a handy 450 million instructions. It turned out that we were also copying the decoded chars into a new buffer when taking a line, but in the common case that buffer would contain precisely the chars that make up the line. Therefore, this buffer could simply be stolen to use as the memory for the string. Another 400 million instructions worth dropped away by a call less to malloc/free per line.

Micro-optimizations

A few futher micro-optimizations in the new UTF-8 decoding fast-path were possible. By lifting some tests out of the main loop, reading a value into a local because the compiler couldn’t figure out it was invariant, and moving some position updates so they only happen on loop exit, a further 470 million instructions were removed. If you’re thinking that sounds like a lot, this is a loop that runs every single codepoint we decode. A million line file with 60 chars per line plus a separator is 61 million iterations. These changes between them only save 7 cycles per codepoint; that just turns out to be a lot when multiplied by the number of codepoints!

The final result

With these improvements, the Perl 6 version of the benchmark now ran in 1.25s, which is just 44% of the time it used to run in. The Perl 5 version still wins, but by a factor of 1.1 times, not 2.5 times. While an amount of the changes performed during this work were specific to the benchmark in question, many were much more general. For example, the separator finding improvements will help with this benchmark in all encodings, and the code generation and specializer improvements will have far more cross-cutting effects.

Actually, not so final…

There’s still a decent amount of room for improvement yet. Once MoarVM’s specializer can perform the two inlinings it is not currently able to, we can expect a further improvement. That work is coming up soon. And beyond that, there will be more ways to shave off some instructions here and there. Another less pleasing result is that if Perl 5 is not asked to do UTF-8 decoding, this represents a huge saving. Ask Perl 6 for ASCII or Latin-1, however, however, and it’s just a small saving. This would be a good target for some future optimization work. In the meantime, these are a nice bunch of speedups to have.

Advertisements
Posted in Uncategorized | 2 Comments

Sorting out synchronous I/O

A few weeks back, I put out a call for funding. So far, two generous individuals have stepped up to help, enabling me to spend much more time on Perl 6 than would otherwise have been possible.

First in line was Nick Logan (ugexe), who is funding 60 hours to get a longstanding I/O bug resolved. The problem, in short, has been that a synchronous socket (that is, an IO::Socket::INET instance) accepted or connected on one thread could not be read from or written to by another thread. The same has been true of synchronous file handles and processes.

The typical answer for the socket case has been to use IO::Socket::Async instead. While that’s the best answer from a scalability perspective, many people new to Perl 6 will first reach for the more familiar synchronous socket API and, having heard about Perl 6’s concurrency support, will pass those off to a thread, probably using a start block. Having that fail to work is a bad early impression.

For processes, the situation has been similar; a Proc was bound to the thread it was created on, and the solution was to use Proc::Async. For situations where dealing with more than one of the input or output streams is desired, I’d argue that it’s actually easier to deal with it correctly using Proc::Async. However, Proc has also ended up with a few features that Proc::Async has not so far offered.

Finally, there are “file handles” – that is, instances of IO::Handle. Here, we typically would get away with passing handles of ordinary files around between threads, and due to an implementation detail could get away with it. The situation of an IO::Handle backed by a TTY or pipe was much less pleasing, however, which was especially unfortunate because it afflicted all of $*IN, $*OUT, and $*ERR. The upshot was that you could only read from $*IN from the main thread. While writing to $*OUT and $*ERR worked, it was skating on thin ice (and occasionally falling through it).

How did we get into this situation?

To understand the issue, we need to take a look at the history of MoarVM I/O. In its first year of its life, MoarVM was designed and built on a pretty extreme budget – that is to say, there wasn’t one. Building platform abstractions was certainly not a good use of limited time, so a library was brought in to handle this. Initially, MoarVM used the Apache Portable Library, which served us well.

As the concurrent and parallel language features of Perl 6 came into focus, together with asynchronous I/O, it became clear that libuv – the library that provides I/O and platform abstractions for Node.js – was a good option for supporting this. Since the APR and libuv had substantially overlapping feature sets, we decided to move over to using libuv. In the months that followed, a bunch of the asynchronous features found in Perl 6 today quickly fell in to place: Proc::Async, IO::Socket::Async, IO::Notification, asynchronous timers, and signal handlers. All seemed to be going well.

Alas, we’d made a problem for ourselves. libuv is centered around event loops. Working with an event loop is conceptually simple: we provide callbacks to be run when certain things happen (like data arriving over a socket, or a timer elapsing), and enter the event loop. The loop waits for some kind of event to happen, and then invokes the correct callback. Those callbacks may lead to other callbacks being set up (for example, the callback reacting to an incoming socket connection can set up a callback to be called whenever data arrives over that socket). This is a fine low-level programming model – and a good bit nicer than dealing with things like poll sets. However, a libuv event loop runs on a single thread. And handles are tied to the libuv event loop that they were created on.

For IO::Socket::Async and Proc::Async, this is not a problem. MoarVM internally runs a single event loop for all asynchronous I/O, timers, signals, and so forth. Whenever something happens, it pushes a callback into the queue of a scheduler (most typically that provided by ThreadPoolScheduler), where the worker threads lie in wait to handle the work. Since this event loop serves as a pure dispatcher, not running any user code or even such things as Unicode decoding, it’s not really limiting to have it only on a single thread.

When the synchronous I/O was ported from the APR, none of this was in place, however. Therefore, each thread got its own libuv event for handling the synchronous I/O. At the time, of course, there wasn’t really much in the way of threading support available in Rakudo on MoarVM. Therefore, at that point in time, that an event loop was tied to a thread was not problematic. It only came to be an issue as the concurrency support in Perl 6 became mature enough that people started using it…and then running into the limitation.

Seizing an opportunity

Having to deal with this was a chance to spend some quality time improving the lower level bits of I/O in the MoarVM/NQP/Rakudo stack, which have had very little love over the last couple of years. Recently, Zoffix did a bunch of great work on the higher level parts of I/O. Most helpfully for my endeavor, he also greatly improved the test coverage of I/O, meaning I could refactor the lower level bits with increased confidence.

A while back, I took the step of fully decoupling MoarVM’s asynchronous I/O from character encoding/decoding. For some months now, MoarVM has only provided support for asynchronous byte-level I/O. It also introduced a streaming decoder, which can turn bytes into characters for a bunch of common encodings (ASCII, UTF-8, and friends). This means that while the decoding hot paths are provided by the VM, the coordination is moved up to the Perl 6 level.

With synchronous I/O, the two were still coupled, with the runtime directly offering both character level and byte level I/O. While this is in some ways convenient, it is also limiting. It blocks us from supporting user-provided encodings – at least, not in such a way that they can just be plugged in and used with a normal IO::Handle. That aside, there are also situations where one might like to access the services provided by a streaming decoder when not dealing with an I/O handle. (A streaming decoder is one you can feed bytes to incrementally and pull characters out, trusting that it will do the right thing with regard to multi-byte and multi-codepoint sequences.)

Whatever changes were going to have to happen to fix the thread limitations of synchronous I/O, it was quite clear that only having to deal with binary I/O there would make it easier. Therefore, a plan was hatched:

  1. Re-work the synchronous I/O handles to use only binary I/O, and coordinate with the VM-backed decoder to handle char I/O.
  2. Rip out the synchronous char I/O.
  3. Re-implement the remaining synchronous binary I/O, so as not to be vulnerable to the threading limitations.

Making it possible to support user-defined encodings would be a future step, but the work done in these refactors would make it a small step – indeed, one that will only require work at a Perl 6 level, not any further down the stack. While it may well end up being me that does this anyway, it’s at least now in reach for a bunch more members of the Perl 6 development team.

Sockets first

I decided to start out with sockets. From a technical perspective, this made sense because they were the most isolated; break IO::Handle and suddenly such things as say are busted, and both it and Proc are used in the pre-compilation management code too. Conveniently, sockets were also Nick’s primary interest, so it was nice to get the most important goal of the work delivered first.

The streaming decode API, while fully implemented on MoarVM, had only been partially implemented on the JVM. Therefore, to complete the work on sockets without busting them on the JVM, I had to implement the missing pieces of the VM-backed streaming decode API. This meant dealing with NIO (“New IO”), the less about which is said the better. I’m pretty sure the buffer API wasn’t designed to trip me up at every turn, but it seems to reliably manage to do so. Since file handles on the JVM would soon also come to depend on this code, it was nice to be able to get it at least straight enough to be passing all of the sockets tests, plus another set of lower-level tests that live in the NQP repository.

Refactoring the socket code itself gave a good opportunity for cleanups. The IO::Socket::INET class does the IO::Socket role, the idea being that at some point other implementations that provide things like domain sockets will also do that role, and just add the domain socket specific parts. In reviewing what methods were where, it became clear that some things that really belonged in the IO::Socket role were not, so I moved them there as part of the work. I also managed to eliminate all of the JVM-specific workarounds in the socket code along the way, which was also a happy outcome.

With that refactored, I could rip out the character I/O support from sockets in MoarVM. This left me with a relatively small body of code doing binary socket I/O using libuv, implementing synchronous socket I/O atop of its asynchronous event loop. When it comes to sockets, there are two APIs to worry about: the Berkeley/BSD/POSIX one, and Winsock. Happily, in my case, there was a lot of overlap and just a handful of places that I had to deal with divergence. The extra lines spent coping with the difference were more than paid back by not faking synchronous I/O in terms of an asynchronous event loop.

File handles next

Buoyed by this success, it was time to dig into file handles. The internals of IO::Handle were poked into from outside of the class: a little in IO::Path and more so in Proc, which was actually setting up IO::Pipe, a subclass of IO::Handle. Thankfully, with a little refactoring, the encapsulation breakage could be resolved. Then it was a case of refactoring to use the streaming decode API rather than the character I/O. This went relatively smoothly, though it also uncovered a previously hidden bug in the JVM implementation of the streaming decoder, which I got fixed up.

So, now to rip the character file I/O out of MoarVM and refactor the libuv away in synchronous file handles too? Alas, not so fast. NQP was also using these operations. Worse, it didn’t even have an IO handle object! Everything was done in terms of nqp::ops instead. So, I introduced an NQP IO handle class, gave it many of the methods that its Perl 6 big sister has, and refactored stuff to use it.

With that blocker out of the way, I could move on to sorting things out down in MoarVM. Once again, synchronous file I/O looks similar enough everywhere to not need all that much in the way of an abstraction layer. On Windows, it turned out to be important to put all of the handles into binary mode, however, since we do our own \n <-> \r\n mapping. (Also, yes, it is very much true that it’s only very similar on Windows if you use their POSIX API emulation. It’s possible there may be a performance win from not using that, but I can’t imagine it’ll be all that much.)

Not entirely standard

So, all done? Well, not quite. For the standard handles, we only used the synchronous file code path when the handle was actually a regular file. This is not the common case; it’s often a pipe or a TTY. These used a completely different code path in libuv, using its streams API instead of the files API.

Thankfully, there isn’t much reason to retain this vast implementation difference. With a little work on EOF handling, and re-instating the faking of tell, it was possible to use the same code I had written to handle regular files. This worked out very easily on Linux. On Windows, however, a read operation would blow up if reading from the console.

Amazingly, the error string was “out of space”, which was a real head-scratcher given it was coming from a read operation! It turned out that the error string was a tad misleading; the error code is #defined as ENOMEM, so a better error string would have been “out of memory”. That still made little sense. So I went digging into the native console APIs on Windows, and discovered that reads from the console are allocated out of a 64KB buffer, which is also used for various other things. 64KB should be enough for anybody, I guess. Capping read requests to a console on Windows to 16KB was enough to alleviate this.

Job done!

At least, for IO::Socket::INET and IO::Handle, which now are not tied to a thread on HEAD of MoarVM/NQP/Rakudo. The work on processes is ongoing, although due to be completed in the coming week. So, I’ll talk about that in my next post here.

Posted in Uncategorized | 3 Comments

Looking for Perl 6, Rakudo, and MoarVM development funding

Note for regular 6guts readers: this post isn’t about Perl 6 guts themselves, but rather about seeking funding for my Perl 6 work. It is aimed at finding medium-size donations from businesses who are interested in supporting Perl 6, Rakudo, and MoarVM by funding my work on these projects.

I started contributing to the Rakudo Perl 6 compiler back in 2008, and since then have somehow managed to end up as architect for both Rakudo and MoarVM, together with playing a key design role in Perl 6’s concurrency features. Over the years, I’ve made time for Perl 6 work by:

  • Spending a bunch of my free time on it.
  • Only working an 80% day-job for five years, effectively self-funding a day a week for Perl 6 things. During this time I did also take various “milestone” grants (for delivering notable features); this mostly ended up paying for my travel to a bunch of workshops and conferences, which was a great deal of fun and helped keep the community informed during Perl 6’s long gestation period.
  • Most recently, taking funding at a rate that – while less than I could make doing other stuff – lets me comfortably work on Perl 6 as a part of my normal daily job.

I’m still greatly enjoying doing Perl 6 stuff and, while I’ve less free time these days for a variety of reasons, I still spend a decent chunk of that on Perl 6 things too. That’s enough for piecing together various modules I find we’re missing in the ecosystem, and for some core development work. However, the majority of Perl 6, Rakudo, and MoarVM issues that end up on my plate are both complex and time-consuming. For various areas of MoarVM, I’m the debugger of last resort. Making MoarVM run Perl 6 faster means working on the dynamic optimizer, which needs a good deal of care to avoid doing the wrong thing really fast. And driving forward the design and implementation of Perl 6’s concurrent and parallel features also requires careful consideration. Being funded through The Perl Foundation over the last couple of years has enabled me to spend quality time working on these kinds of issues (and plenty more besides).

So what’s up?

I’ve been without funding since early-mid February. Unfortunately, my need to renew my funding has come at a time when The Perl Foundation has been undergoing quite a lot of changes. I’d like to make very clear that I’m hugely supportive and thankful for all that TPF have done and are doing, both for me personally and for Perl 6 in general. Already this year, two Perl 6 grants have been made to others for important work. These were made through the normal TPF grants process. By contrast, my work has been funded through a separate Perl 6 Core Development Fund. As a separate fund, it thus needs funds raising specifically for it, and has its own operation separate from the mainstream grant process.

Between the fund being almost depleted, and various new volunteers stepping up to new roles in TPF and needing to get up to speed on quite enough besides the Perl 6 Core Development Fund, unfortunately it’s not been possible to make progress on my funding situation in the last couple of months. I’m quite sure we can get there with time – but at the same time I’m keen to get back to having more time to spend on Perl 6 again.

So, I’ve decided to try out an alternative model. If it works, I potentially get funded faster, and TPF’s energies are freed up to help others spend more time on Perl. If not, well, it’ll hopefully only cost me the time it took to write this blog post.

The Offer

I’m looking for businesses willing to help fund my Perl 6 development work. I can offer in return:

  • To spend quality time on Perl 6, Rakudo, and MoarVM.
  • To either spend the time on the areas that I feel are most pressing based on my own judgement and input from the community, or to spend it focusing on areas of the sponsor’s interest (for example: improving native call performance, delivering concurrency improvements such as completing non-blocking await or hyper/race, building a sampling profiler to give better insight into long-running programs, improving the MoarVM dynamic optimizer so hot code can run faster, etc.)
  • To write at least one blog post here per 25 hours of work completed. Rather than a boring list of stuff that I got done, I prefer to write posts that focus around particularly interesting problems I solved or that dig into the details of things that I have implemented. I will be happy to mention the sponsor of the work in these posts, together with a link, logo, etc.
  • Per 100 hours funded, provide 5 extra hours of my time that the sponsor may use as they wish (within reason!) That may be for consultation on any topic you figure I might be clueful about (not just Perl 6), writing an article for you on a topic of your choice, and so forth. Or, if you wish to donate the bonus time back to Perl 6 related work, that’s of course fine too (and I’ll be sure to mention you did so!)
  • Billing by invoice issued from a VAT-registered company with its seat in Prague, Czech Republic.
  • When I get a Perl 6 related book done (shhhh…don’t tell anyone I said that I’m working on one), I’ll send along a few signed copies. :-)

I’m setting a rate for this work of 55 EUR / hour with a minimum order of 25 hours. This need not be billed all in one go; for example, if you happened to be a company wishing to donate 1000 EUR a month to open source and wished to be invoiced that amount each month, this is entirely possible. After all, if 3-4 companies did that, we’d have me doing Perl 6 stuff for 2 full days every week.

If you’re interested in helping, please get in contact with me, either by email or on freenode (I’m jnthn there). Thank you!

Posted in Uncategorized | 7 Comments

Massively reducing MoarVM Fixed Size Allocator contention

The latest MoarVM release, 2017.04, contains a significant improvement for multi-threaded applications, especially those that are CPU-bound. I mentioned the improvement briefly on Twitter, because it’s hard to be anything other than brief on Twitter. This post contains a decidedly less brief, though hopefully somewhat interesting, description of what I did. Oh, and also a bonus footnote in which I attempt to prove the safety (or lack of safety) of a couple of lock free algorithms, because that’s fun, right?

The Fixed Size Allocator

The most obvious way to allocate memory in a C program is through functions like malloc and calloc. Indeed, we do this plenty in MoarVM. The malloc and calloc implementations in C libraries have certainly been tuned a bunch, but at the same time they have to have good behavior for a very wide range of programs. They also need to keep track of the sizes of allocations, since a call to free does not pass the size of the memory being released. And they need to try to avoid fragmentation, which can lead to out-of-memory errors occurring because the heap ends up with lots of small gaps, but none big enough to allocate a larger object.

When we know a few more properties of the memory usage of a program, and we have information around to know the size of the memory block we are freeing, it’s possible to do a little better. MoarVM does this in multiple ways.

One of them is by using a bump-the-pointer allocator for memory that is managed by the garbage collector. These have a header that points to a type table that knows the size of the object that was allocated, meaning the size information is readily available. And the GC can move objects around in memory, since it can find all of the references to an object and update them, meaning there is a way out of the fragmentation trap too.

The call stack is another example. In the absence of closures, it is possible to allocate a block of memory and use it like a stack. When a program makes a call, the current location in the memory block is taken as the address for the call frame memory, and the location is bumped by the frame size. This could be seen as a “push”, in stack terms. Because call frames are left in the opposite order to which they are entered, freeing them is just subtraction. This could be seen as a “pop”. Since holes are impossible, fragmentation cannot occur.

A third case is covered by the fixed size allocator. This is the most difficult of the three. It tries to do a better job than malloc and friends in the case that, at the point when memory is freed, we readily know the size of the memory. This allows it to create regions of memory that consist of N blocks of a fixed size, and allocate the memory out of those regions (which it calls “pages”). When a memory request comes in, the allocator first checks if it’s within the size range that the fixed size allocator is willing to handle. If it isn’t, it’s passed along to malloc. Otherwise, the size is rounded up to the nearest “bin size” (which are 8 bytes, 16 bytes, 24 bytes, and so forth). A given bin consists of:

  • 1 or more pages, each with space for a certain number of allocations of the bin size
  • A free list of available memory locations of that size, which is maintained as a linked list (that is, each location contains a pointer to the next free location, with a NULL marking the end)

If the free list contains any entries, then one of them will be taken. If not, then the pages will be considered. If the current page is not full, then the allocation will be made from it. Otherwise, another page will be allocated. When memory is freed, it is always added to the free list of the appropriate bin. Therefore, a longer-running program, in steady state, will typically end up getting all of its allocations from the free list.

Enter threads

Building a fixed size allocator for a single-threaded environment isn’t all that difficult. But what happens when it needs to cope with being used in a multi-threaded program? Well…it’s complicated. Clearly, it is not possible to have a single global fixed size allocator and have all of the threads just use it without any kind of concurrency control. Taking an item off the freelist is a multi-step process, and allocating from a page – or allocating a new page – is even more steps. Without concurrency control, there will be data races all over, and we’ll be handed a SIGSEGV in record time.

It’s worth stopping to consider what would happen if we were to give every thread its own fixed size allocator. This turns out to get messy fast, as memory allocated on one thread may be freed by another. A seemingly simple scheme is to say that the freed memory is simply appended to the freelist of the freeing thread’s fixed size allocator. Unfortunately, this has two bad properties.

  1. When the thread ends, we can’t just throw aways the pages – because bits of them may still be in use by other threads, or referenced in the free lists of other threads. So they’d need to be somehow “re-homed”, which is going to need some kind of coordination. Further measures may be needed to mitigate memory fragmentation in programs that spawn and join many threads during their lifetimes.
  2. Imagine a producer/consumer setup, where one thread does allocations and passes the allocated memory to another thread, which processes the data in the memory and frees it. The producing thread will build up a lot of pages to allocate out of. The consuming thread will build up an ever longer free list. Memory runs out. D’oh.

So, MoarVM went with a single global fixed size allocator. Of course, this has the drawback of needing concurrency control.

Concurrency control

The easiest possible form of concurrency control is to have threads acquire a mutex on every allocate and free operation. This has the benefit of being very straightforward to understand and reason about. It has the disadvantage of being extremely costly. Mutex acquisition can be relatively cheap, but it gets expensive when there is high contention – that is, lots of threads trying to obtain the lock. And since all CPU-bound threads will typically allocate some working memory, particularly in a VM for a dynamic language that doesn’t yet do escape analysis, that adds up to a lot of contention.

So, MoarVM did something more sophisticated.

First, the easy part. It’s possible to append to a free list with a CPU-provided atomic operation, provided taking from the freelist is also using one. So, no mutex acquisition is required for freeing memory. However, an atomic operation still requires a kind of locking down at the CPU level. It’s cheaper than a mutex acquire/release for sure, but there will still be contention between CPU cores for the cache line holding the head of the free list.

What about allocation? It turns out that we can not just take from a free list using an atomic operation without hitting the ABA problem (gory details in footnote). Therefore, some kind of locking is needed to ensure an ordering on the operations. In most cases, the atomic operation will work on the first attempt (it’s competing with frees, which happen without any kind of locking, meaning a retry will sometimes be needed). In cases where something will complete very rapidly, a spinlock may be used in place of a full-on mutex. So, the MoarVM fixed size allocator allocation scheme boiled down to:

  1. Acquire the spin lock.
  2. Try to take from the free list in a loop, until either we succeed or the free list is seen to be empty.
  3. Release the spin lock.
  4. If we failed to obtain memory from the free list, take the slow path to get memory from a page, allocating another page if needed. This slow path does acquire a real mutex.

Contention

First up, I’ll note that the strategy outlined above does beat the “just take a mutex for every allocate/free” approach – at least, in all of the benchmarks I’ve considered. Frees end up being lock free, and most of the allocations just do a spin lock and an atomic operation.

At the same time, contention means contention, and no lock free data structure or spinlock changes that. If multiple threads are constantly scrambling to work on the same memory location – such as the head of a free list – it’s going to get expensive. How expensive? On an Intel Core i7, obtaining a cache line that is held by another core exclusively – which it typically will be under contention – costs somewhere around 70 CPU cycles. It gets worse in a multi-CPU setup, where it could easily be hundreds of CPU cycles. Note this is just for one operation; the spinlock is a further atomic operation and, of course, it uses some cycles as it spins.

But how much could this end up costing in a real world Perl 6 application? I recently had chance to find out, and the numbers were ugly. Measurements obtained by perf showed that a stunning 40% of the application’s runtime was spent inside of the fixed size allocator. (Side note: perf is a sampling profiler, which – in my handwavey understanding – snapshots the callstack at regular intervals to figure out where time is being spent. My experience has been that sampling profilers tend to be better at showing up surprising costs like this than instrumenting profilers are, even if they are in some senses less precise.)

Making things better

Clearly, there was significant room for improvement. And, happily, things are now muchly improved and my real-world program did get something close to a 40% performance boost.

To make things better, I introduced per-thread freelists, while leaving pages global and retaining global free lists also.

Memory is allocated in the first place from global pages, as before. However, when it is freed, it is instead placed on a per-thread free list (with one free list per thread per size bin). When a thread needs memory, it first checks its thread-local free list to see if there is anything there. It will only then look at the global free list, or the global pages, if the thread-local free list cannot satisfy the memory request. The upshot of this is that the vast majority of allocations and frees performed by the fixed size allocator no longer have any contention.

However, as I mentioned earlier, one needs to be very careful when introducing things like thread-local freelists to not create bad behavior when a thread terminates or in producer/consumer scenarios. Therefore:

  • When a thread termiantes, it will donate all the locations on its free list back to the global free list. Since pages are entirely global, there isn’t a fragmentation issue.
  • When a thread’s local free list for a particular size hits a limit, it will instead start to free to the global free list, which prevents unbounded free list growth in producer/consumer style programs.

So, I believe this improvement is both good for performance without being badly behaved for any cases that previously would have worked out fine.

Can we do better?

Always! While the major contention bottleneck is gone, there are further opportunities for improvement that are worth exploring in the future.

  • It would probably make sense to donate a bunch of free list entries to the global free list, rather than donating them individually. This would mean the contention of doing so only happened every N entries, not every entry, for the producer/consumer case. This is a fairly easy fix. (Homework, anyone? :-))
  • The current scheme is vulnerable to false sharing. The strategy of packing allocations of the same size together in pages is, in a single-threaded program, typically good for the CPU cache. But in a multi-threaded program, we can end up placing allocations that individual threads use next to each other. This means they may end up on the same cache line (the 32-byte or 64-byte memory regions that CPU caches hold). This is a harder nut to crack.

In summary…

If you have CPU-bound multi-threaded Perl 6 programs, MoarVM 2017.04 could offer a big performance improvement. For my case, it was close to 40%. And the design lesson from this: on modern hardware, contention is really costly, and using a lock free data structure or picking the “best kind of lock” will not overcome that.


Footnote on the ABA vulnerability: It’s decidedly interesting – at least to me – that prepending to a free list can be safely done with a single atomic operation, but taking from it cannot be. Here I’ll attempt to provide a proof for these claims.

We’ll consider a single free list whose head lives at memory location F, and two threads, T1 and T2. We assume the existence of an atomic operation, TRY-CAS(location, old, new), which will – in a single CPU instruction that may not be interrupted – compare the value in memory pointed to by location with old and, if they match, replace it with new. (CAS is short for Compare And Swap.) The TRY-CAS function evaluates to true if the replacement took place, and false if not. The threads may be preempted (that is, taken off the CPU) at any point in time.

To show that allocation is vulnerable to the ABA problem, we just need to find an execution where it happens. First of all, we’ll define the operation ALLOCATE as:

1: do
2:     allocated = *F
3:     if allocated != NULL
4:         next = allocated.next    
5: while allocated != NULL && !TRY-CAS(F, allocated, next)
6: return allocated

And FREE(C) as:

1: do
2:     current = *F
3:     C.next = current;
4: while !TRY-CAS(F, current, C)

Let’s consider a case where we have 3 memory cells, C1, C2, and C3. The free list head F points to C1, which in turn points to C2, which in turn points to C3.

Thread T1 enters ALLOCATE, but is preempted immediately after the execution of line 4. At this point, allocated contains C1 and next contains C2.

Next, T2 calls ALLOCATE, and succeeds in making an allocation. F now points to C2. It again calls ALLOCATE, meaning that F now points to C3. It then calls FREE(C1). At this point, F points to C1 again, and C1 points to C3. Notice that at this point, cell C2 is considered to be allocated and in use.

Consider what happens if T1 is resumed. It performs TRY-CAS(F, C1, C2). This operation will succeed, because F does indeed currently point to C1. This means that F now come to point to C2. However, we earlier stated that C2 is allocated and in use, and therefore should not be in the free list. Therefore we have demonstrated the code to be buggy, and shown how the bug arises as a result of the ABA problem.

What of the claim that the FREE(C) is not vulnerable to the ABA problem? To be vulnerable to the ABA problem, another thread must be able to change the state of something that the correctness of the operation depends upon, but that is not tested by the TRY-CAS operation. Looking at FREE(C) again:

1: do
2:     current = *F
3:     C.next = current;
4: while !TRY-CAS(F, current, C)

We need to consider C and current. We can very reasonably make the assumption that the calling program is well-behaved, and will never use the cell C again after passing it to FREE(C) (unless it obtains it again in the future through another call to ALLOCATE, which cannot happen until FREE has inserted it into the free list). Therefore, C cannot be changed in any way other than the code in FREE changes it. The FREE operation holds the sole reference to C at this point.

Life is much more complicated for current. It is possible for a preemption at line 3 of FREE, followed by another thread allocating the cell pointed to by current and then freeing it again, which is certainly a case of an ABA state change. However, unlike the situation we saw in ALLOCATE, the FREE operation does not depend on the content of current. We can see this by noticing how it never looks inside of it, and instead just holds a reference to it. An operation cannot depend upon a value it never accesses. Therefore, FREE is not vulnerable to the ABA problem.

Posted in Uncategorized | 3 Comments

Considering hyper/race semantics

We got a lot of nice stuff into Perl 6.c, the version of the language released on Christmas of 2015. Since then, a lot of effort has gone on polishing the things we already had in place, and also on optimization. By this point, we’re starting to think about Perl 6.d, the next language release. Perl 6 is defined by its test suite. Even before considering additional features, the 6.d test suite will tie down a whole bunch of things that we didn’t have covered in the 6.c one. In that sense, we’ve already got a lot done towards it.

In this post, I want to talk about one of the things I’d really like to get nailed done as part of 6.d, and that is the semantics of hyper and race. Along with that I will, of course, be focusing on getting the implementation in much better shape. These two methods enable parallel processing of list operations. hyper means we can perform operations in parallel, but we must retain and respect ordering of results. For example:

say (1, 9, 6).hyper.map(* + 5); # (6 14 11)

Should always give the same results as if the hyper was not there, even it a thread computing 6 + 5 gave its result before that computing 1 + 5. (Obviously, this is not a particularly good real-world example, since the overhead of setting up parallel execution would dwarf doing 3 integer operations!) Note, however, that the order of side-effects is not guaranteed, so:

(1..1000).hyper.map(&say);

Could output the numbers in any order. By contrast, race is so keen to give you results that it doesn’t even try to retain the order of results:

say (1, 9, 6).race.map(* + 5); # (14 6 11) or (6 11 14) or ...

Back in 2015, when I was working on the various list handling changes we did in the run up to the Christmas release, my prototyping work included an initial implementation of the map operation in hyper and race mode, done primarily to figure out the API. This then escaped into Rakudo, and even ended up with a handful of tests written for it. In hindsight, that code perhaps should have been pulled out again, but it lives on in Rakudo today. Occasionally somebody shows a working example on IRC using the eval bot – usually followed by somebody just as swiftly showing a busted one!

At long last, getting these fixed up and implemented more fully has made it to the top of my todo list. Before digging into the implementation side of things, I wanted to take a step back and work out the semantics of all the various operations that might be part of or terminate a hyper or race pipeline. So, today I made a list of those operations, and then went through every single one of them and proposed the basic semantics.

The results of that effort are in this spreadsheet. Along with describing the semantics, I’ve used a color code to indicate where the result leaves you in the hyper or race paradigm afterwards (that is, a chained operation will also be performed in parallel).

I’m sure some of these will warrant further discussion and tweaks, so feel free to drop me feedback, either on the #perl6-dev IRC channel or in the comments here.

Posted in Uncategorized | 3 Comments

Complex cocktail causes cunning crash

I did a number of things for Perl 6 yesterday. It was not, however, hard to decide which of them to write up for the blog. So, let’s dig in.

Horrible hiding heisenbugs

It all started when I was looking into this RT, reporting a segfault. It was filed a while ago, and I could not reproduce it. So, add a test and case closed? Well, not so fast. As I discussed last week, garbage collection happens when a thread fills up its nursery. Plenty of bugs only show up when the timing is just right (or is that just wrong?) What can influence when GC runs? How many allocations we’ve done. And what can influence that? Pretty much any change to the program being run, the compiler, or the environment. The two that most often have people tearing their hair out are:

  • How many things the compiler allocates compiling the code. This is why even adding or removing comments can reveal and hide bugs. And, of course, changes to the compiler over time will make all kinds of differences, so the bug can appear or disappear from Rakudo commit to commit. Argh!
  • How many things are in the environment, because each one of them results in an allocation of a string at startup. So, yes, adding an environment variable to turn on a debugging feature can hide or show a problem. Running under a test harness that sticks something into the environment can do the very same. Eek!

While GC-related issues are not the only cause of SEGVs, in such a simple piece of code using common features it’s far and away the most likely cause. So, to give myself more confidence that the bug truly was gone, I adjusted the nursery size to be just 32KB instead of 4MB, which causes GC to run much more often. This, of course, is a huge slowdown, but it’s good for squeezing out bugs.

And…no bug! So, in goes the test. Simples!

In even better news, it was lunch time. Well, actually, that was only sort of good news. A few days ago I cooked a rather nice chicken and berry pulao. It came out pretty good, but cooking 6 portions worth of it when I’m home alone for the week wasn’t so smart. I don’t want to see chicken pulao for a couple of months now. Anyway, while I was devouring some of my pulao mountain, I set off a spectest run on the 32KB nursery stress build of MoarVM, just to see if it showed up anything.

Putrid pointers

A couple of failures did, in fact, show up, one of them in constant.t. This rang a bell. I was sure somebody had a in the last couple of weeks reported a crash in that test file, which had then vanished. I checked in with the person who I vaguely recalled mentioning it and…sure enough, it was that very test file. In their normal test runs, the bug had long since vanished. I figured, having now got a reproduction of it, I should probably hunt it down right away. Otherwise, we’d probably end up playing “where’s Wally” with it for another month or ten.

So, how did the failure look?

$ ./perl6-m -Ilib t/spec/S04-declarations/constant.rakudo.moar 
Segmentation fault (core dumped)

It actually segfaulted while compiling the test file. Sad! So, where?

$ ./perl6-gdb-m -Ilib t/spec/S04-declarations/constant.rakudo.moar 
[boring output omitted]
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()

That looks…ungood. That final line is meant to be a code location, which means something decided to tell the CPU to go execute code living at the NULL address. At this point, things could go two ways: the JIT spat out something terrible, or a function pointer somewhere was NULL. But which?

(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff78cacbc in MVM_coerce_smart_stringify (tc=0x6037c0, obj=0x605c10, res_reg=0x56624d8)
    at src/core/coerce.c:214
#2  0x00007ffff789dff4 in MVM_interp_run (tc=tc@entry=0x6037c0, initial_invoke=0x60ea80, 
    invoke_data=0x56624d8) at src/core/interp.c:827
#3  0x00007ffff7978b21 in MVM_vm_run_file (instance=0x603010, 
    filename=0x7fffffffe09f "/home/jnthn/dev/rakudo/perl6.moarvm") at src/moar.c:309
#4  0x000000000040108b in main (argc=9, argv=0x7fffffffdbc8) at src/main.c:192

Phew, it looks like the second case, given there’s no JIT entry stub on the stack. So, we followed a NULL function pointer. Really?

(gdb) frame 1
#1  0x00007ffff78cacbc in MVM_coerce_smart_stringify (tc=0x6037c0, obj=0x605c10, res_reg=0x56624d8)
at src/core/coerce.c:214
214     ss = REPR(obj)->get_storage_spec(tc, STABLE(obj));

Yes, really. Presumably, that get_storage_spec is bogus. (I did a p to confirm it.) So, how is obj looking?

(gdb) p *obj
$1 = {header = {sc_forward_u = {forwarder = 0x48000000000001, sc = {sc_idx = 1, idx = 4718592}, 
  st = 0x48000000000001}, owner = 6349760, flags = 0, size = 0}, st = 0x6d06c0}

Criminally corrupt; let me count the ways. For one, 6349760 looks like a very high thread ID for a program that’s only running a single thread (they are handed out sequentially). For two, 0 is not a valid object size. And for three, idx is just a nuts value too (even Rakudo’s CORE.setting isn’t made up of 4 million objects). So, where does this object live? Well, let’s try out last week’s handy object locator to figure out:

(gdb) p MVM_gc_debug_find_region(tc, obj)
In tospace of thread 1

Well. Hmpfh. That’s actually an OK place for an object to be. Of course, the GC spaces swap often enough at this nursery size that a pointer could fail to be updated, point into fromspace after one GC run, not be used until a later GC run, and then come to point into some random bit of tospace again. How to test this hypothesis? Well, instead of 32768 bytes of nursery, what if I make it…well, 40000 maybe?

Here we go again:

$ ./perl6-gdb-m -Ilib t/spec/S04-declarations/constant.rakudo.moar 
[trust me, this omitted stuff is boring]
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff78b00db in MVM_interp_run (tc=tc@entry=0x6037c0, initial_invoke=0x0, invoke_data=0x563a450)
    at src/core/interp.c:2855
2855                    if (obj && IS_CONCRETE(obj) && STABLE(obj)->container_spec)

Aha! A crash…somewhere else. But where is obj this time?

(gdb) p MVM_gc_debug_find_region(tc, obj)
In fromspace of thread 1

Hypothesis confirmed.

Dump diving

So…what now? Well, just turn on that wonder MVM_GC_DEBUG flag and the bug will make itself clear, of course. Alas, no. It didn’t trip a single one of the sanity checks added by enabling thee flag. So, what next?

The where in gdb tells us where in the C code we are. But what high level language code was MoarVM actually running at the time? Let’s dump the VM frame stack and find out:

(gdb) p MVM_dump_backtrace(tc)
   at <unknown>:1  (./blib/Perl6/Grammar.moarvm:initializer:sym<=>)
 from gen/moar/stage2/QRegex.nqp:1378  (/home/jnthn/dev/MoarVM/install/share/nqp/lib/QRegex.moarvm:!protoregex)
 from <unknown>:1  (./blib/Perl6/Grammar.moarvm:initializer)
 from src/Perl6/Grammar.nqp:3140  (./blib/Perl6/Grammar.moarvm:type_declarator:sym<constant>)
 from gen/moar/stage2/QRegex.nqp:1378  (/home/jnthn/dev/MoarVM/install/share/nqp/lib/QRegex.moarvm:!protoregex)
 from <unknown>:1  (./blib/Perl6/Grammar.moarvm:type_declarator)
 from <unknown>:1  (./blib/Perl6/Grammar.moarvm:term:sym<type_declarator>)
 from gen/moar/stage2/QRegex.nqp:1378  (/home/jnthn/dev/MoarVM/install/share/nqp/lib/QRegex.moarvm:!protoregex)
 from src/Perl6/Grammar.nqp:3825  (./blib/Perl6/Grammar.moarvm:termish)
 from gen/moar/stage2/NQPHLL.nqp:886  (/home/jnthn/dev/MoarVM/install/share/nqp/lib/NQPHLL.moarvm:EXPR)
from src/Perl6/Grammar.nqp:3871  (./blib/Perl6/Grammar.moarvm:EXPR)
...

I’ve snipped out a good chunk of a fairly long stack trace. But look! We were parsing and compiling a constant at the time of the crash. That’s somewhat interesting, and explains why constant.t was a likely test file to show this bug up. But MoarVM has little idea about parsing or Perl 6’s idea of constants. Rather, something on that codepath of the compiler must run into a bug of sorts.

Looking at the location in interp.c the op being interpreted at the time was decont, which takes a value out of a Scalar container, if it happens to be in one. Combined with knowing what code we were in, I can invoke moar --dump blib/Perl6/Grammar.moarvm, and then locate the disassembly of initializer:sym<=>.

There were a few uses of the decont op in that function. All of them seemed to be on things looked up lexically or dynamically. So, I instrumented those ops with a fromspace check. Re-compiled, and…

(gdb) break MVM_panic
Breakpoint 1 at 0x7ffff78a19a0: file src/core/exceptions.c, line 779.
(gdb) r
Starting program: /home/jnthn/dev/MoarVM/install/bin/moar --execname=./perl6-gdb-m --libpath=/home/jnthn/dev/MoarVM/install/share/nqp/lib --libpath=/home/jnthn/dev/MoarVM/install/share/nqp/lib --libpath=. /home/jnthn/dev/rakudo/perl6.moarvm --nqp-lib=blib -Ilib t/spec/S04-declarations/constant.rakudo.moar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, MVM_panic (exitCode=1, 
    messageFormat=0x7ffff799bc58 "Collectable %p in fromspace accessed") at src/core/exceptions.c:779
779 void MVM_panic(MVMint32 exitCode, const char *messageFormat, ...) {
(gdb) where
#0  MVM_panic (exitCode=1, messageFormat=0x7ffff799bc58 "Collectable %p in fromspace accessed")
    at src/core/exceptions.c:779
#1  0x00007ffff78ba657 in MVM_interp_run (tc=0x1, tc@entry=0x6037c0, initial_invoke=0x0, 
    invoke_data=0x604b80) at src/core/interp.c:374
#2  0x00007ffff7979071 in MVM_vm_run_file (instance=0x603010, 
    filename=0x7fffffffe09f "/home/jnthn/dev/rakudo/perl6.moarvm") at src/moar.c:309
#3  0x000000000040108b in main (argc=9, argv=0x7fffffffdbc8) at src/main.c:192

And what’s in interp.c around that line? The getdynlex op. That’s the one that is used to lookup things like $*FOO in Perl 6. So, a dynamic lexical lookup seemed to be handing back an outdated object. How could that happen?

Interesting idea is insufficient

My next idea was to see if I could catch the moment that something bad was put into the lexical. I’d already instrumented the obvious places with no luck. But…what if I could intercept every single VM register access and see if an object from fromspace was read? Hmmm… It turned out that I could make that happen with a sufficiently cunning patch. I made it opt-in rather than the default for MVM_GC_DEBUG because it’s quite a slow-down. I’m sure that this will come in really useful for finding some GC bug some day. But for this bug? It was no direct help.

It was of some indirect help, however. It suggested strongly that at the time the lexical was set (actually, it turned out to be $*LEFTSIGIL), everything was valid. Somewhere between then and the lookup of it using the getdynlex op, things went bad.

Cache corruption

So what does getdynlex actually do? It checks if the current frame declares a lexical of the specified name. If so, it returns the value. If not, it looks in the caller for the value. If that fails, it goes on to the caller’s caller, until it runs out of stack to search and then gives up.

If that’s what it really did, then this bug would never have happened. But no, people actually want Perl 6 to run fast and stuff, so we can’t just implement the simplest possible thing and go chill. Instead, there’s a caching mechanism. And, as well all know, the two hardest problems in computer science are cache invalidation and cache invalidation.

The caching is relatively simple: each frame has slots for sticking a name, register pointer, and type in it.

MVMString   *dynlex_cache_name;
MVMRegister *dynlex_cache_reg;
MVMuint16    dynlex_cache_type;

When getdynlex finds something the slow way, it then looks down the stack again and finds a frame with an empty dynlex_cache_name. It then sticks the name of dynamic lexical into the name slot, a pointer to the MoarVM lexical into the reg slot, and what type of lexical it was (native int, native num, object, etc.) into the type slot. The most interesting of these is the reg slot. The MVMRegister type is actually a union of different types that we may store in a register or lexical. We re-use the union for registers that live while the frame is on the callstack and lexicals that may need to live longer thanks to closures. So, each frame as two arrays of these:

MVMRegister *env;   /* The lexical environment */
MVMRegister *work;  /* Working space that dies with the frame */

And so the dynlex_cache_reg ends up pointing to env somewhere in the frame that we found the lexical in.

So, the big question: was the caching to blame? I shoved in a way to disable it and…the bug vanished.

Note by this point we’re up to two pieces that contribute to the bug: the GC and the dynamic lexical cache. The thing is, the dynamic lexical cache is used very heavily. My gut feeling told me there must be at least one more factor at play here.

Suspicious specialization

So, what could the other factor be? I re-enabled the cache, verified the crash came back, and then stuck MVM_SPESH_DISABLE=1 into the environment. And…no bug. So, it appeared that dynamic optimization was somehow involved too. That’s the magic that looks at what types actually show up at runtime, and compiles specialized versions of the code that fast-paths a bunch of operations based on that (this specialization being where the name “spesh” comes from). Unfortunately, MVM_SPESH_DISABLE is a rather blunt instrument. It disables a huge range of things, leaving a massive surface area to consider for the bug. Thankfully, there are some alternative environment variables that just turn off parts of spesh.

First, I tried MVM_JIT_DISABLE=1, which results in spesh interpreting the specialized version of the code rather than turning it into machine code to remove the interpreter overhead. The bug remained.

Next, I tried MVM_SPESH_OSR_DISABLE, which disables On Stack Replacement. This is a somewhat fiddly optimization that detects hot loops as they are being interpreted, pauses execution, produces an optimized version of the code, and then recalculates the program counter so it points to the appropriate point in the optimize code and continues execution. Basically, the interpreter gets the code it’s executing replaced under it – perhaps with machine code, which the interpreter is instructed to jump into immediately. Since this also fiddles with frames “in flight”, it seemed like a good candidate. But…nope. Bug remained.

Finally, I tried MVM_SPESH_INLINE_DISABLE, which disables inlining. That’s where we spot a call to a relatively small subroutine or method, and just replace the call with the code of the sub or method itself, saving the cost of setting up callframes. And…bug vanished!

So, inlining was apparently a factor too. The trouble is, that also didn’t seem to add up to an obvious bug. Consider:

sub foo($a) {
    bar($a);
}
sub bar($b) {
    my $c = $b + 6;
    $c * 6
}

Imagine that bar was to be inlined into foo. Normally they’d have lexical slots in ->env as follows:

A:  | $_ | $! | $/ | $a |
B:  | $_ | $! | $/ | $b | $c |

The environment for the frame inline(A, B) would look like:

inline(A, B):  | $_ | $! | $/ | $a | $_ | $! | $/ | $b | $c |
                \---- from A ----/  \------- from B -------/

Now, it’s easy to imagine various bugs that could arise in the initial lookup of a dynamic lexical in such a frame. Indeed, the dynamic lexical lookup code actually has two bunches of code that deal with such frames, one in the case the specialized code is being interpreted and one in the case where it has been JIT compiled. But by the time we are hitting the cache, there’s nothing smart going on at all: it’s just a cheap pointer deference.

Dastardly deoptimization

So, it seems we need a fourth ingredient to explain the bug. By now, I had a pretty good idea what it was. MoarVM doesn’t just to optimizations based on properties it can prove will always hold. It can also do speculative optimization based on properties that it expects will probably hold up. For example, suppose we have:

sub foo($a, $b) {
    $b.some-huge-complex-call($a);
    return $a.Str;
}

Imagine we’re generating a specialization of this routine for the case $a is an object of type Product. The Str method is tiny, so we go ahead and inline it. However, some-huge-complex-call takes all kinds of code paths. We can’t be sure, from our analysis, that at some point it won’t mix in to the object in $a. What if it mixes in a role that has an alternate Str method? Our inlining would break stuff! We’d end up calling the Product.Str method, not the one from the mixin.

One reaction is to say “well, we’ll just not ever optimize stuff unless we can be REALLY sure”, which is either hugely limiting or relies on much more costly analyses. The other path, which MoarVM does, is to say “eh, let’s just assume mixins won’t happen, and if they do, we’ll fix things then!” The process of fixing things up is called deoptimization. We walk the call stack, rewriting return addresses to point to the original interpreted code instead of the optimized version of the code.

But what, you might wonder, do we do if there’s a frame on the stack that is actually the result of an inlining operation? What if we’re in the code that resulted from inline(A,B), in the bit that corresponds to the code of B? Well, we have to perform – you guessed it – uninlining! The composite call frame has to be dissected, and the call stack rewritten to look like it would have if we’d been running the original interpreted code. To do this, we’d create a call frame for B, complete with space for its lexicals, and copy the lexicals from inline(A,B) that belong to B into that new buffer.

The code that does this is one of the very few parts of MoarVM that frightens me.

For good reason, it turns out. This deoptimization, together with uninlining, was the final ingredient needed for the bug. Here’s what happened:

  1. The method EXPR in Perl6::Grammar was inlined into one of its callers. This EXPR method declares a $*LEFTSIGIL variable.
  2. While parsing the constant, the $*LEFTSIGIL is assigned to the sigil of the constant being declared, if it has one (so, in constant $a = 42 it would be set to $).
  3. Something does a lookup of $*LEFTSIGIL. It is located and cached. The cache entry points into a region of the ->env of the frame that inlined, and thus incorporated, the lexical environment of EXPR.
  4. At some point, a mixin happens, causing a deoptimization of the call stack. The frame that inlined EXPR gets pulled apart. A new EXPR frame comes to exist, with the lexicals that used to live in the composite frame copied into them. Execution continues.
  5. A GC happens. The object containing the $ substring moves. The new EXPR frame’s lexical environment is updated.
  6. Another lookup of $*LEFTSIGIL happens. It hits the cache. The cache, however, still points to the place the lexical used to live in the composite frame. This memory has not been freed, because the first part of it is still being used. However, the GC no longer cares about its contents because that content is unreachable. Therefore, it contains an outdated pointer, thus leading to accessing memory that’s being used for something else entirely by that point, leading to the eventual segmentation fault.

The most natural fix was to invalidate the cache during deoptimization.

Lessons learned

The bug I wrote up last week was thanks to a comparatively simple oversight made within the scope of a few lines of a single C function. While this one could be fixed with a small amount of code added in a single file, the segfault arose from the interaction of four distinct features existing in MoarVM:

  • Garbage collection
  • Dynamic lexical lookup caching
  • Inlining
  • Deoptimization

Even when a segfault was not produced, thanks to “lucky” GC timing, the bug would lead to reading of stale data. It just so turned out that the data wasn’t ever stale enough in reality to break things on this particular code path.

All of garbage collection, inlining, and deoptimization are fairly complicated. By contrast, the dynamic lexical lookup cache is fairly easy. Interestingly, it was the addition of this easy feature that introduced the bug – not because the code that was added was wrong, but rather because it did something that some other far flung piece of code – the deoptimizer – had quietly relied on not happening.

So, what might be learned for the future?

The most immediate practical learning is that taking interior pointers into mutable data structures is risky. In this case, that data structure was a composite lexical environment, that later got taken apart. Conceptually, the environment was resized and the interior pointer was off the end of the new size. This suggests either providing a safe way to acquire such a reference, or an alternative design for the dynamic lexical cache to avoid needing to do so.

Looking at the bigger picture, this is all about managing complexity. Folks who work with me tend to observe I worry a good bit about loose coupling, to the degree that I’m much more hesitant than the typical developer when it comes to code re-use. Acutely aware that re-use means use, and use means dependency, and dependency means coupling, I tend to want things to prove they really are the same thing rather than just looking like they might be the same thing. MoarVM reflects this in various ways: to the degree I can, I try to organize it as a collection of parts that either keep themselves very much to themselves, or that collaborate over a small number of very stable data structures. One of the reasons Git works architecturally is because while all the components of it are messing with the same data structure, it’s a very stable and well-understood data structure.

In this bug, MVMFrame is the data structure in question. A whole load of components know about it and work with it because – so the theory went – it’s one of the key stable data structures of the VM. Contrast it with the design of things like the Unicode normalizer or the fixed size allocator, which nothing ever pokes into directly. These are likely to want to evolve over time to choose smarter data structures, or to get extra bits of state to cope with Unicode’s latest and greatest emoji boundary specification. Therefore, all work with them is hidden nicely behind an API.

In reality, MVMFrame has grown to contain quite a fair few things as MoarVM has evolved. At the same time, its treated as a known quantity by lots of parts of the codebase. This is only sustainable if every addition to MVMFrame is followed by considering how every other part of the VM that interacts with it will be affected by the change, and making compensating changes to those components. In this case, the addition of the dynamic lexical cache into the frame data structure was not accompanied by sufficient analysis of which other parts of the VM may need compensating changes.

The bug I wrote up last week isn’t really the kind that causes an architect a headache. It was a localized coding slip-up that could happen to anyone on a bad day. It’s a pity we didn’t catch it in code review, but code reviewers are also human. This bug, by contrast, arose as a result of the complexity of the VM – or, more to the point, insufficient management of that complexity. And no, I’m not beating myself up over this. But, as MoarVM architect, this is exactly the type of bug that catches my eye, and causes me to question assumptions. In the immediate, it tells me what kinds of patches I should be reviewing really carefully. In the longer run, the nature of the MVMFrame data structure and its level of isolation from the rest of the codebase deserves some questioning.

Posted in Uncategorized | 3 Comments

Taking a couple of steps backwards to fix a GC bug

When I popped up with a post here on Perl 6 OO a few days ago, somebody noted in the comments that they missed my write-ups of my bug hunting and fixing work in Rakudo and MoarVM. The good news is that the absence of posts doesn’t mean an absence of progress; I’ve fixed dozens of things over the last months. It was rather something between writers block and simply not having the energy, after a day of fixing things, to write about it too. Anyway, it seems I’ve got at least some of my desire to write back, so here goes. (Oh, and I’ll try and find a moment in the coming days to reply to the other comments people wrote on my OO post too.)

Understanding a cryptic error

There are a number of ways MoarVM can come tumbling down when memory gets corrupted. Some cases show up as segmentation faults. In other cases, the VM comes across something that simply does make any kind of sense and can infer that memory has become corrupted. Two panics commonly associated with this are “zeroed target thread ID in work pass” and “invalid thread ID XXX in GC work pass”, where XXX tends to be a sizable integer. At the start of a garbage collection – where we free up memory associated with dead objects – we do something like this:

  1. Go through all the threads that have been started, and signal those that are not blocked (e.g. waiting for I/O, a lock acquisition, or for native code to finish) to come and participate in the garbage collection run.
  2. Assign each non-blocked thread itself to work on.
  3. Assign each blocked thread’s work to a non-blocked thread.

So, every thread – blocked or not – ends up assigned to a running thread to take care of its collection work. It’s the participation of multiple threads that makes the MoarVM GC parallel (which is a different thing to having a concurrent GC; MoarVM’s GC can barely claim to be that).

The next important thing to know is that the every object, at creation, is marked with the ID of the thread that allocated it. This means that, as we perform GC, we know whether the object under consideration “belongs” to the current thread we’re doing GC work for, or some other one. In the case that the ID in the object header doesn’t match up with the thread ID we’re doing GC work for, then we stick it into a list of work to pass off to the thread that is responsible. To avoid synchronization overhead, we pass then off in batches (so there’s only synchronization overhead per batch). This is far from the only way to do parallel GC (other schemes include racing to write forwarding pointers), but it keeps the communication between participating threads down and leaves little surface area for data races in the GC.

The funny thing is that if none of that really made any sense to you, it doesn’t actually matter at all, because I only told you about it all so you’d have a clue what the “work pass” in the error message means – and even that doesn’t matter much for understanding the bug I’ll eventually get around to discussing. Anyway, TL;DR version (except you did just read it all, hah!) is that if the owner ID in an object header is either zero or an out-of-range thread ID, then we can be pretty sure there’s memory corruption afoot. The pointer under consideration is either to zeroed memory, or to somewhere in memory that does not correspond to an object header.

So, let’s debug the panic!

Getting the panic is, perhaps, marginally better than a segmentation fault. I mean, sure, I’m a bit less embarrassed when Moar panics than SEGVs, and perhaps it’s mildly less terrifying for users too. But at the end of the day, it’s not much better from a debugging perspective. At the point we spot the memory corruption, we have…a pointer. That points somewhere wrong. And, this being the GC, it just came off the worklist, which is full of a ton of pointers.

If only we could know where the pointer came from, I hear you think. Well, it turns out we can: we just need to detect the problem some steps back, where the pointer is added to the worklist. In src/gc/debug.h there’s this:

#define MVM_GC_DEBUG 0

Flip that to a 1, recompile, and magic happens. Here’s a rather cut down snippet from in worklist.h:

#if MVM_GC_DEBUG
#define MVM_gc_worklist_add(tc, worklist, item) \
    do { \
        MVMCollectable **item_to_add = (MVMCollectable **)(item); \
        if (*item_to_add) { \
            if ((*item_to_add)->owner == 0) \
                MVM_panic(1, "Zeroed owner in item added to GC worklist"); \
                /* Various other checks here.... */ 
        } \
        if (worklist->items == worklist->alloc) \
            MVM_gc_worklist_add_slow(tc, worklist, item_to_add); \
        else \
            worklist->list[worklist->items++] = item_to_add; \
    } while (0)
#else
#define MVM_gc_worklist_add(tc, worklist, item) \
    do { \
        MVMCollectable **item_to_add = (MVMCollectable **)(item); \
        if (worklist->items == worklist->alloc) \
            MVM_gc_worklist_add_slow(tc, worklist, item_to_add); \
        else \
            worklist->list[worklist->items++] = item_to_add; \
    } while (0)
#endif

So, in the debug version of the macro, we do some extra checks – including the one to detect a zeroed owner. This means that when MoarVM panics, the GC code that is placing the bad pointer into the list is on the stack. Then it’s a case of using GDB (or your favorite debugger), sticking a breakpoint on MVM_panic (spelled break MVM_panic in GDB), running the code that explodes, and then typing where. In this case, I was pointed at the last line of this bit of code from roots.c:

void MVM_gc_root_add_frame_roots_to_worklist(MVMThreadContext *tc, MVMGCWorklist *worklist,
                                             MVMFrame *cur_frame) {
    /* Add caller to worklist if it's heap-allocated. */
    if (cur_frame->caller && !MVM_FRAME_IS_ON_CALLSTACK(tc, cur_frame->caller))
        MVM_gc_worklist_add(tc, worklist, &cur_frame->caller);

    /* Add outer, code_ref and static info to work list. */
    MVM_gc_worklist_add(tc, worklist, &cur_frame->outer);

So, this tells me that the bad pointer is to an outer. The outer pointer of a call frame points to the enclosing lexical scope, which is how closures work. This provides a bit of inspiration for bug hunting; for example, it would now make sense to consider codepaths that assign outer to see if they could ever fail to keep a pointer up to date. The trouble is, for such an incredibly common language feature to be broken in that way, we’d be seeing it everywhere. It didn’t fit the pattern. In fact, both my private $dayjob application that was afflicted with this, together with the whateverable set of IRC bots, had in common that they did a bunch of concurrency work and both spawned quite a lot of subprocesses using Proc::Async.

But where does the pointer point to?

Sometimes I look at a pointer and it’s obviously totally bogus (a small integer usually suggests this). But this one looked feasible; it was relatively similar to the addresses of other valid pointers. But where exactly does it point to?

There are only a few places that a GC-managed object can live. They are:

  • In a thread’s tospace – that is, the region of memory for young objects that we allocate in, and copy objects into during a GC run
  • In a thread’s fromspace – that is, the region of memory for young objects that we evacuate and copy objects out of during a GC run
  • In one of the memory blocks that make up the old generation, where long-lived objects eventually end up

So, it would be very interesting to know if the pointer was into one of those. Now, I could just go examining it in the debugger, but with a dozen running threads, that’s tedious as heck. Laziness is of course one of the virtues of a programmer, so I wrote a function to do the search for me. Another re-compile, reproducing the bug in GDB again, and then calling that routine from the debugger told me that the pointer was into the tospace of another thread.

Unfortunately, thinking is now required

Things get just a tad mind-bending here. Normally, when a program is running, if we see a pointer into fromspace we know we’re in big trouble. It means that the pointer points to where an object used to be, but was then moved into either tospace or the old generation. But when we’re in the middle of a GC run, the two spaces are flipped. The old tospace is now fromspace, the old fromspace becomes the new tospace, and we start evacuating living objects in to it. The space left at the end will then be zeroed later.

I should mention at this point that the crash only showed up a fraction of the time in my application. The vast majority of the time, it ran just fine. The odd time, however, it would panic – usually over a zeroed thread owner, but sometimes over a junk value being in the thread owner too. This all comes down to timing: different thread are working on GC, in different runs of the program they make progress at different paces, or get head starts, or whatever, and so whether the zeroing of the unused part of tospace happened or not yet will vary.

But wait…why didn’t it catch the problem even sooner?

When the MVM_GC_DEBUG flag is turned on, it introduces quite a few different sanity checks. One of them is in MVM_ASSIGN_REF, which happens whenever we assign a reference to one object into another. (The reason we don’t simply use the C assignment operator for that is because the inter-generational write barrier is needed.) Here’s how it looks:

#if MVM_GC_DEBUG
#define MVM_ASSIGN_REF(tc, update_root, update_addr, referenced) \
    { \
        void *_r = referenced; \
        if (_r && ((MVMCollectable *)_r)->owner == 0) \
            MVM_panic(1, "Invalid assignment (maybe of heap frame to stack frame?)"); \
        MVM_ASSERT_NOT_FROMSPACE(tc, _r); \
        MVM_gc_write_barrier(tc, update_root, (MVMCollectable *)_r); \
        update_addr = _r; \
    }
#else
#define MVM_ASSIGN_REF(tc, update_root, update_addr, referenced) \
    { \
        void *_r = referenced; \
        MVM_gc_write_barrier(tc, update_root, (MVMCollectable *)_r); \
        update_addr = _r; \
    }
#endif

Once again, the debug version does some extra checks. Those reading carefully will have spotted MVM_ASSERT_NOT_FROMSPACE in there. So, if we used this macro to assign to the ->outer that had the outdated pointer, why did it not trip this check?

It turns out, because it only cared about checking if it was in fromspace of the current thread, not all threads. (This is in turn because the GC debug bits only really get any love when I’m hunting a GC bug, and once I find it then they go back in the drawer until next time around.) So, I enriched that check and…the bug hunt came to a swift end.

Right back to the naughty deed

The next time I caught it under the debugger was not at the point that the bad ->outer assignment took place. It was even earlier than that – lo and behold, inside of some of the guts that power Proc::Async. Once I got there, the problem was clear and fixed in a minute. The problem was that the callback pointer was not rooted while an allocation took place. The function MVM_repr_alloc_init can trigger GC, which can move the object pointed to by callback. Without an MVMROOT to tell the GC where the callback pointer is so it can be updated, it’s left pointing to where the callback used to be.

So, bug fixed, but you may still be wondering how exactly this bug could have led to a bad ->outer pointer in a callframe some way down the line. Well, callback is a code object, and code objects point to an outer scope (it’s actually code objects that we clone to make closures). Since we held on to an outdated code object pointer, it in turn would point to an outdated pointer to the outer frame it closed over. When we invoked callback, the outer from the code object would be copied to be the outer of the call frame. Bingo.

Less is Moar

The hard part about GCs is not just building the collector itself. It’s that collectors bring invariants that are to be upheld, and a momentary lapse in concentration by somebody writing or reviewing a patch can let a bug like this slip through. At least 95% of the time when I handwavily say, “it was a GC bug”, what I really mean was “it was a bug that arose because some code didn’t play by the rules the GC requires”. A comparatively tiny fraction of the time, there’s actually something wrong in the code living under src/gc/.

People sometimes ask me about my plans for the future of MoarVM. I often tell them that I plan for there to be less of it. In this case, the code with the bug is something that I hope we’ll eventually write in, say, NQP, where we don’t have to worry about low-level details like getting write barriers correct. It’s just binding code to libuv, a C library, and we should be able to do that using the MoarVM native calling support (which is likely mature enough by now). Alas, that also has its own set of costs, and I suspect we’d need to improve native calling performance to not come out at a measurable loss, and that means teaching the JIT to emit native calls, but we only JIT on x64 so far. “You’re in a maze of twisty VM design trade-offs, and their funny smells are all alike.”

Posted in Uncategorized | Leave a comment