Racing to writeness to wrongness leads

In the Perl world I’m mostly known as a guy who hacks on Perl 6 stuff. Less known is that outside of the Perl world, I spend a lot of my time with the .Net platform. C#, despite a rather uninspiring initial couple of releases, has escaped Java-think and grown into a real multi-paradigm language. It’s not Perl, but it’s certainly not unpleasant, and can even be a good bit of fun to work with nowadays. My work with it right now typically involves teaching, along with various mentoring and trouble-shooting tasks.

The Windows world has always been rather into threads – at least as long as I’ve been around. .Net is also, as a result. Want to do some computation in your GUI app? Well, better farm it off to a thread, so the main thread can keep the UI responsive to the user’s needs. Want to do network I/O? Well, that could probably use some asynchronous programming – and the completion handler will be run on some thread or other. Then the results will probably want marshaling somehow. (That used to hurt; now things are better.) Building a web application? Better learn to love threads. You don’t actually get any choice in the matter: having multiple request-processing threads is the unspoken, unquestioned, default in a web application on .Net.

Of course, just because threads are almost ubiquitous doesn’t mean the average developer – or even the above-average developer – gets things right. A bunch of my recent trouble-shooting gigs have boiled down to dealing with a lack of understanding of multi-threaded programming. “So, we embed this 80s library in our web application, but things tend to crash under load.” “How do you deal with the fact that 80s library likely isn’t threadsafe?” “It…what?” “Oh, my…”

So anyway, back to Perl 6. Last year, we managed to get Rakudo on the JVM. And, given we now ran on a VM where folks deploy heavily threaded software every day, and with no particular progress to be seen on concurrency in Perl 6 for years, I did what I usually seem to end up doing: get fed up of the way things are and figured I should try to make them better. Having spent a bunch of years working with and teaching about parallel, asynchronous, and parallel programming outside of the Perl world, it was time for worlds to collide.

And oh hell, collide they do. Let’s go word counting:

my %word_counts;
for @files -> $filename {
    for slurp($filename).words {
         %word_counts{$_}++;
    }
}

OK, so that’s the sequential implementation. But how about one that processes the files in parallel? Well, it seems there are a bunch of files, and that seems like a natural way to parallelize the work. So, here goes:

my %word_counts;
await do for @files -> $filename {
    start {
        for slurp($filename).words {
            %word_counts{$_}++;
        }
    }
}

Here, start creates a Promise, which is kept when the code inside of it completes. That work is scheduled to be done on the thread pool, and the code calling start continues onward, moving on to create another Promise for the next file. Soon enough, the thread pool’s input queue is nicely occupied with work, and threads are chugging through it. The loop is in a context that means it produces results - the Promise objects – thanks to our use of the do keyword. We give them to await, which waits for them all to get done. Perfect, right?

Well, not so fast. First of all, are hashes thread safe? That is, if I try to write to a hash from multiple threads, what will happen? Well, good question. And the answer, if you try this out on Rakudo on JVM today, is you’ll end up with a hosed hash, in all likelihood. OK. Go on. Say what you’re thinking. Here’s one guess at a common response: “But…but…but…OH NO WE MUST MAKE IT WORK, this is Perl, not Java, dammit!” Well, OK, OK, let’s try to make it work…

So the hash ain’t threadsafe. Let’s go put implicit locking in hash access. We’ll slow down everything for it, but maybe with biased locking it won’t be so bad. Maybe we can build a smart JIT that invalidates the JITted code when you start a thread. Maybe escape analysis will save the common cases, because we can prove that we’ll never share things. Maybe we can combine escape analysis and trace JIT! (Hey, anybody know if there’s a paper on that?) Heck, we gotta build smart optimizations to make Perl 6 perform anyway…

So anyway, a patch or two later and our hashes are now nicely thread safe. We’re good, right? Well, let’s run it and…ohhhh…wrong answer. Grr. Tssk. Why, oh why? Well, look at this:

%word_counts{$_}++;

What does the post-increment operator do? It reads a value out of a scalar, gets its successor, and shoves the result in the scalar. Two threads enter. Both read a 41. Both add 1. Both store 42. D’oh. So, how do we fix this? Hm. Well, maybe we could make ++ take a lock on the scalar. Now we’re really, really going to need some good optimization, if we ever want tight loops doing ++ to perform. Like, inlining and then lifting locks…if we can get away with it semantically. Or one of the tricks mentioned earlier. Anyway, let’s suppose we do it. Hmm. for good measure, maybe we’d better ponder some related cases.

%word_counts{$_} += 1;

Not idiomatic here, of course, but we can easily imagine other scenarios where we want something like this. So, we’d better make all the assignment meta-ops lock the target too…uh…and hold the lock during the invocation of the + operator. Heck, maybe we can not do locks in the spin-lock or mutex sense, but go with optimistic concurrency control, given + is pure and we can always retry it if it fails. So, fine, that’s the auto-increment and the assignment meta-ops sorted. But wait…what about this:

%word_counts{$_} = %word_counts{$_} + 1;

Well, uhh…darn. I dunno. Maybe we can figure something out here, because having that behave differently than the += case feels really darn weird. But let’s not get bogged down with side-problems, let’s get back to our original one. My hash is thread safe! My ++ is atomic, by locks, or some other technique. We’re good now, aren’t we?

Nope, still not. Why? Turns out, there’s a second data race on this line:

%word_counts{$_}++;

Why does this work when we never saw the word before? Auto-vivification, of course. We go to look up the current scalar to auto-increment it. But it doesn’t exist. So we create one, but we can’t install it unless we know it will be assigned; just looking for a key shouldn’t make it come to exist. So we put off the installation of the scalar in the hash until it’s assigned. So, two threads come upon the word “manatee”. Both go and ask the hash for the scalar under that key. Access to the hash is already protected, so the requests are serialized (in the one-after-the-other sense). The hash each time notices that there’s no scalar in that slot. It makes one, attached to it the fact that it should be stored into the hash if the scalar is assigned to, and hands it back. The ++ sees the undefined value in the scalar, and sticks a 1 in there. The assignment causes the scalar to be bound to the hash…uh…wait, that’s two scalars. We made two. So, we lose a word count. Manatees may end up appearing a little less popular than dugongs as a result.

hue-manatee

How do we fix this one? Well, that’s kinda tricky. At first, we might wonder if it’s not possible to just hold some lock on something for the whole line. But…how do we figure that out? Trying to work out a locking scheme for the general case of auto-viv – once we mix it with binding – feels really quite terrifying, as this REPL session reveals:

> my %animals; my $gerenuk := %animals<gerenuk>; say %animals.perl;
().hash
> $gerenuk = 'huh, what is one of those?'; say %animals.perl;
("gerenuk" => "huh, what is one of those?").hash

So, what’s my point in all of this? Simply, that locking is not just about thread safety, but also about the notion of transaction scope. Trying to implicitly lock stuff to ensure safe mutation on behalf of the programmer means you’ll achieve thread safety at a micro level. However, it’s very, very unlikely that will overlap with the unspoken and uncommunicated transaction scope the programmer had in mind – or didn’t even know they needed to have in mind. What achieving safety at the micro level will most certainly achieve, however, is increasing the time it takes for the programmer to discover the real problems in their program. If anything, we want such inevitably unreliable programs to reliably fail, not reliably pretend to work.

I got curious and googled for transaction scope inference, wondering if there is a body of work out there on trying to automatically figure these things out. My conclusion is that either it’s called something else, I’m crap at Google today, or I just created a thankless PhD topic for somebody. (If I did: I’m sorry. Really. :-) My hunch is that the latter is probably the case, though. Consider this one:

while @stuff {
    my $work = @stuff.pop;
    ...
}

Where should the implicit transaction go here? Well, it should take in the boolification of @stuff and the call to pop. So any such general analysis is clearly inter-statement, except that we don’t want to hard-code it for boolification and popping, so it’s interprocedural, but then method calls are late-bound, so it’s undecidable. Heck, it’d be that way even in boring Java. With Perl you can go meta-programming, and then even your method dispatch algorithm might be late bound.

At this point, we might ponder software transactional memory. That’s very much on-topic, and only serves to re-inforce my point: in STM, you’re given a mechanism to define your transaction scope:

my %word_counts;
await do for @files -> $filename {
    start {
        for slurp($filename).words {
            # THE FOLLOWING IS IMAGINARY SYNTAX. No, I can't 
            # hack you up a prototype down the pub, it's *hard*!
            atomic { %word_counts{$_}++ }
        }
    }
}

This looks very nice, but let’s talk about the hardware for a bit.

Yes, the hardware. The shiny multi-core thing we’re trying to exploit in all of this. The thing that really, really, really, hates on code that writes to shared memory. How so? It all comes down to caches. To make this concrete, we’ll consider the Intel i7. I’m going to handwave like mad, because I’m tired and my beer’s nearly finished, but if you want the gory details see this PDF. Each core has an Level 1 cache (actually, two: one for instructions and one for data). If the data we need is in it, great: we stall for just 4 cycles to get hold of it. The L1 cache is fast, but also kinda small (generally, memory that is fast needs more transistors per byte we store, meaning you can’t have that much of it). The second level cache – also per core – is larger. It’s a bit slower, but not too bad; you’ll wait about 10 cycles for it to give you the data. (Aside: modern performance programming is thus more about cache efficiency than it is about instruction count.) There’s then a
level 3 cache, which is shared between the cores. And here’s where things get really interesting.

As a baseline, a hit in the level 3 cache is around 40 cycles if the memory is unshared between cores. Let’s imagine I’m a CPU core wanting to write to memory at 0xDEADBEEF. I need to get exclusive access to that bit of memory in order to do so. That means before I can safely write it, I need to make sure that any other core with it in its caches (L1/L2) tosses what it knows, because that will be outdated after my write. If some other core shares it, the cost of obtaining the cache line from L3 goes up to around 65 cycles. But what if the other core has modified it? Then it’s around 75 cycles. From this, we can see that pretty much any write to shared memory, if another core was last to write, is going to be incurring a cost of around 75 cycles. Compare that to just several cycles for unshared memory.

So how does our approach to parallelizing our word count look in the light of this? Let’s take a look at it again:

my %word_counts;
await do for @files -> $filename {
    start {
        for slurp($filename).words {
            %word_counts{$_}++;
        }
    }
}

Locks are just memory, so if we inserted those automatically – even if we did work out a good way to do so – then taking the lock is a shared memory write. That’s before we go updating the memory associated with the hash table to install entries, and the memory of the scalars to update the counts. What if we STM it? Even if we keep modifications in a local modification buffer, we still have to commit at some point, and that’s going to have to be a write to shared memory. In fact, that’s the thing that bothers me about STM. It’s a really, really great mechanism – way superior to locks, composable, and I imagine not too hard to teach – but its reason for existing is to make writes to shared memory happen in a safe, transactional, way. And its those writes that the hardware makes costly. Anyway, I’m getting side-tracked again. The real point is that our naive parallelization of our program – even if we can find ways to make it work reliably – is a disaster when considered in the light of how the hardware works.

So, what to do? Here’s an alternative.

# Produce a word counts hash per file - totally unshared!
my @all_counts = await do for @files -> $filename {
    start {
        my %word_counts;
        for slurp($filename).words {
            %word_counts{$_}++;
        }
        %word_counts
    }
}

# Bring them together into a single result.
my %totals;
for @all_counts {
    %totals{.key} += .value;
}
say %totals.elems;

Those familiar with map-reduce will probably have already recognized the pattern here. The first part of the program does the work for each file, producing its own word count hash (the map). This is completely thread local. Afterwards, we bring all of the results together into a single hash (the reduce). This is doing reads of data written by another thread, of course. But that’s the cheaper case, and once we get hold of the cache lines with with the hash and scalars, and start to chug through it, we’re not going to be competing for it with anything else.

Of course, the program we get at the end is a bit longer. However, it’s also not hard to imagine having some built-ins that make patterns like this shorter to get in place. In fact, I think that’s where we need to be expending effort in the Perl 6 concurrency work. Yes, we need to harden MoarVM so that you can’t segfault it even if you do bad things. Yes, we should write a module that introduces a monitor keyword, which is a class that automatically takes a lock around each of its method calls:

monitor ThreadSafeLoggingThingy {
    has @!log;

    method log($msg) {
        push @!log, $msg;
    }

    method latest($n) {
        $n < @!log
            ?? @!log[*-$n .. *]
            !! @!log[]
    }
}

Yes, we should do an Actor one too. We could even provide a trait:

my @a is monitor;

Which would take @a and wrap it up in a monitor that locks and delegates all its calls to the underlying array. However, by this point, we’re treading dangerously close to forgetting the importance of transaction scope. At the start of the post, I told the story of the hopelessly unsafe calls to a legacy library from a multi-threaded web application. I had it hunted down and fixed in a morning because it exploded, loud and clear, once I started subjecting it to load tests. Tools to help find such bugs exist. By contrast, having to hunt bugs in code that is threadsafe, non-explosive, but subtly wrong in the placing of its transaction boundaries, is typically long and drawn out – and where automated tools can help less.

In closing, we most certainly should take the time to offer newbie-friendly concurrent, parallel, and asynchronous programming experiences in Perl 6. However, I feel that needs to be done by guiding folks towards safe, teachable, understandable patterns of a CPS (Communicating Sequential Processes) nature. Perl may be about Doing The Right Thing, and Doing What I Mean. But nobody means their programs to do what the hardware hates, and the right thing isn’t to make broken things sort-of-work by sweeping complex decisions on transaction scope under the carpet. “I did this thing I thought was obvious and it just blew up,” can be answered with, “here’s a nice tutorial on how to do it right; ask if you need help.” By contrast, “your language’s magic to make stuff appear to work just wasted days of my life” is a sure-fire way to get a bad reputation among the competent. And that’s the last thing we want.

Posted in Uncategorized | Leave a comment

Optimization, concurrency, and Moar

It’s been a while since I wrote an update here. Happily, timotimo has taken up the role of being our weekly Perl 6 reporter, so there’s a good place to follow for regular updates. However, I wanted to take a moment to provide the bigger picture of what’s been happening in the last couple of months, share my perspectives on it, and talk a bit about where things are headed from here.

Optimization, optimization, optimization!

A lot of recent effort has gone on optimization. NQP, the subset of Perl 6 that we use to implement much of the Rakudo Perl 6 compiler, has performance on MoarVM that starts to approach that of Perl 5, and on JVM sometimes exceeds that of Perl 5 for longer running things (it typically runs the forest fire benchmark from our benchmark suite faster once the JIT has had time to get going, for example). By contrast, Rakudo’s performance itself has been comparatively awful. Thankfully, things have been getting better, as we’ve worked to improve optimization, reduce costs of common things, and gradually begun to close the gap. This has involved work on both Rakudo and NQP’s optimization phases, along with work on improving the built-ins and some performance-oriented factoring changes. There’s still plenty of work to go, but anybody using Rakudo on MoarVM will likely feel the results in the next release. To give an idea of the improvement in HEAD Rakudo on MoarVM, which will appear in the April monthly:

  • Array and hash access is more than 3 times faster
  • Most multi-dispatches are now enormously cheaper
  • Many, many unrequired scalar allocations are optimized away
  • The forest fire benchmark on Rakudo can render twice as many frames per second
  • Compiler performance is around 15% better (estimate going on CORE.setting compile time)

Compared to Rakudo on Parrot the difference is more marked. On compiler performance alone, the difference is enormous: you can build the entire of Rakudo on MoarVM on my box in around 80s (which I’m not happy with yet, though you rarely change so much that you have to do the whole thing). That is less time than it takes for Rakudo on Parrot to complete the parse/AST phases of compiling CORE.setting (the built-ins). Running the spectest suite happens in half the time. Both of these times are important because they influence how quickly those of us working on Perl 6 implementation can try out our changes. Unsurprisingly, most of us do the majority of our development on MoarVM first these days.

One consequence of this work is that Rakudo on MoarVM is often sneaking ahead of Rakudo on JVM on some benchmarks now, even once the mighty JVM JIT kicks in. This won’t last long, though; a couple of the optimizations done will not be a great deal of work to port to the JVM, and then it can re-claim its crown. For now! :-)

Introducing “spesh”

A whole other level of performance related work has been taking place in MoarVM itself. The first goal for the project was “just run NQP and Perl 6″, and the VM simply got on with interpreting the bytecode we threw at it. That doesn’t mean it wasn’t carefully designed along the way – just that the focus in terms of execution was to be simple, correct and sufficiently complete to serve as a solid Rakudo backend. With that goal achieved, the next phase of the project is underway: implementing dynamic optimization based on program information available at runtime, speculative optimizations that can be undone if things they rely on are broken, and so forth.

The first steps in that direction will be included in this month’s MoarVM release, and are to thank for much of the improved compiler performance (since the compiler is a program running on the VM too). The easiest way to get an overview is for me to walk you through the pieces in src/spesh in MoarVM.

  • graph is about building a representation of the bytecode (at a frame level) suitable for analysis and transformation (the two steps involved in optimization). It starts out by building a Control Flow Graph. It then computes the dominance tree, which it uses to rename variables so as to produce a Static Single Assignment form of the bytecode. This is a form whereby a given name is only written to once, which eases many, many aspects of analysis.
  • args takes a tuple of incoming arguments, considers their types, arity, and so forth. It produces a set of guard clauses that indicate when a given specialization of the code applies (that is, a version of the code improved by making assumptions about what was passed), and then re-writes various argument access instructions to “unsafe” but fast ones that it can prove will always work out.
  • facts takes the graph, looks through it for sources of type information (including the incoming arguments) and does an initial propagation of that information through the graph. It creates usage counts to be later used in dead code elimination.
  • optimize takes this annotated graph, and makes a pass through it, applying a number of optimizations. Granted, there are not so many yet; so far we’ve mostly worked on getting to the point of having a framework to prove safety of transformations, and adding more of them comes next. However, those there so far can do things like:
    • Resolve methods to avoid hash lookups
    • Install monomorphic method caching if that doesn’t work out
    • Resolve type checks, possibly eliminating entire branches of code
    • Re-write attribute binds into “unsafe” pointer operations
    • Eliminate dead code
  • codegen takes the optimized graph and produces bytecode from it again. However, in the future (if we’re lucky, then hopefully through a GSoC project), this is where we would produce machine code instead.
  • deopt deals with the situation where some invariant specialized code may be relying on gets broken, and yet that specialized code is still on the call stack. It walks the call stack, looking for specialized code on it and tweaking return addresses and other data structures so that when we return into the code, we’re back in the safe (though of course slower) unspecialized code that checks things as needed.

By and large, this is essentially giving MoarVM a JIT. Of course, it’s not producing machine code yet, but rather JITing back to improved bytecode. While we tend to think of JITs primarily as “turn the program into machine code”, that’s really just one small part of any modern JIT. Analysis and specialization of the program before the machine code is produced is just as important; with this approach, we get to focus in on that part first and get some of the benefits now.

Concurrency

Progress on Perl 6 concurrency continues. The JVM implementation of the concurrency features has had various performance improvements since the March release, and MoarVM now has support for most of the Perl 6 concurrency features also. However, the MoarVM support for these features is early and most certainly not yet production ready. We’ll include it in the April release, but stick with Rakudo on the JVM for concurrency things if you’re doing anything that matters. If you just want to play with the basics, either will do.

Rakudo on MoarVM takes the spectest crown

Rakudo on its various backends hold the top spots on the Perl 6 specification test suite pass rate. However, nowadays Rakudo on MoarVM has worked its way into the lead. How so? Because it has the best Unicode database introspection support, opening up a range of tests that no other backend handles yet. Additionally, because it gets some of the Unicode stuff right that that Parrot does, but JVM doesn’t. And, finally, because on top of that it can now pass a bunch of the concurrency tests.

A multi-backend Rakudo Star

I’d hoped we would get a Rakudo Star release with support for all three backends out in March. It didn’t happen; the module tests showed up some holes. We’ve by now largely fixed those for Rakudo on MoarVM, and we’re looking quite good for the April Rakudo Star coming with support for both Parrot and MoarVM. With some effort, I’m optimistic we’ll have a JVM Star release in good shape for April too. This will provide users who want the “batteries included” release a choice of backends, and of note give those using Parrot a chance to switch over to using MoarVM, getting some substantial performance improvements on most programs and lower startup time and memory use.

Where next?

In the coming months, I’m going to be focusing on the following areas:

  • More improvements to the Rakudo and NQP optimizers, so we generate better (faster, smaller) code.
  • More improvements to the Rakudo built-ins, so they operate more efficiently.
  • Making the MoarVM concurrency support more robust, and improving the parallel GC.
  • Advancing the state of asynchronous I/O, so we’ll have good support for it on both the JVM and MoarVM.
  • Teaching spesh to specialize better. There are a bunch of data access things that can be made cheaper, as well as being able to better optimize multiple dispatch. Beyond that, both inlining and escape analysis are on the menu.
  • Improving native type support, including providing native arrays.

I’ve been hidden away coding for much of the year so far, apart from putting in an appearance at FOSDEM. But I’m getting on the road soon! I’ll be at the Dutch, Polish and Czech Perl Workshops, and look forward to seeing folks and sharing what I’ve been working on. Hope to see some of you out there!

Posted in Uncategorized | 3 Comments

January Rakudo Compiler Release: MoarVM support and much more

This month’s Rakudo compiler was cut today, and there’s a bunch of good stuff in there. In this post I’ll take a quick look at what’s been done.

MoarVM Support

This is the first Rakudo compiler release to have support for building and running on MoarVM, a new VM being built especially for Perl 6 and NQP (the Perl 6 subset a sizable chunk of Rakudo is written in). MoarVM support is not quite up to the level of the JVM and Parrot backends yet. It passes less specification tests than either of them – though it’s getting awfully close (Rakudo on MoarVM passes over 99% of the specification tests that Rakudo on the JVM – the current leader – does). Thus, you can actually run a heck of a lot of Perl 6 code just fine on it already. I used it recently in a pair programming session and we only hit one bug in the couple of hours we were using it.

The fast-path for signature binding that I mentioned in my previous post has also been put in place. It did, as hoped, lead to a fairly dramatic speedup. The workload of building Rakudo’s built-ins and running the specification test suite was also a good basis for doing some GC tuning, which led to further improvements. By this point, on my box, Rakudo on MoarVM now has:

  • The lowest startup time of any Rakudo backend
  • The shortest spectest time of any Rakudo backend
  • For the CORE.setting build and spectests, the smallest memory footprint of any Rakudo backend

Other Rakudo developers have reported similar findings. I need more time to look into the exact numbers, but it would appear that Rakudo on MoarVM is also the fastest to build. CORE.setting build time is roughly competitive with on the JVM now (but how roughly seems to vary quite widely – I think it depends on what JVM or even version is being used), but startup time for NQP on MoarVM is rather lower, meaning that those parts of the build go by faster.

The focus for the next month or two will be getting into a position where we can produce a Rakudo Star release that uses MoarVM. This means digging through the last 1% of failing spectests and dealing with them, finishing the work of getting Panda (our module installer) to work with Rakudo on MoarVM, and then hunting bugs that keep us from running the modules. Getting NativeCall working will also be a notable task, although given we already have a NativeCall in C working against 6model (the one we built for Parrot), there is a lot of prior art this time – unlike on the JVM.

On performance – we’re not even scratching the surface of what’s possible. MoarVM’s design means it has a lot of information to hand to do a good amount of runtime specialization and optimization, but none of this is implemented yet. I aim to have a first cut of it in place within the next few months. Once we have this analysis and specialization framework in place, we can start thinking about things such as JIT compilation.

Rakudo on JVM Improvements

Probably the best news in this release for anybody working with Rakudo on JVM is that the gather/take stack overflow bug is now fixed. It was a fun one involving continuations and a lack of tailcall semantics in an important place, but with doing the MoarVM implementation of continuations in my recent past, I was in a good place to hunt it down and get a fix in. A few other pesky issues are resolved, including a regex/closure interaction issue and sometimes sub-optimal line number reporting.

The other really big piece of JVM-specific progress this month has been arnsholt++ continuing to work on the plumbing to get us towards full NativeCall support for JVM. This month, a number of the big missing pieces landed. NativeCall working, and the modules that depend on it working, is the last big blocker for a Rakudo Star on JVM release, and it’s now looking quite possible that we’ll see that happen in the February one.

General Rakudo Improvements

While a lot of energy went on the things already mentioned, we did get some nice things in place that are independent of any of the particular backend: improvements to the Nil type, the sequence operator, sets and bags, adverb syntax parsing, regex syntax errors, aliased captures in regexes, and numerics. MoarVM’s rather stricter interpretation of closure semantics than we’ve had in place on other backends has also led to various code-gen fixes, which may lead to better performance in certain scenarios across the board too (one of those, “I know it probably should but I didn’t benchmark” situations).

I’d like to take a moment to thank everyone who contributed to this month’s release. This month had the highest Rakudo contributor count in a good while – and I’m hopeful we can maintain and exceed it in the months to come.

Posted in Uncategorized | 4 Comments

A Rakudo on MoarVM update

Almost exactly a month ago, I reported that Rakudo on MoarVM could do “Hello, world”. Despite the general slow-down as folks took Christmas and New Year breaks (myself very much included), progress has been very good. Here are the key things to know:

  • All of the Rakudo sanity tests pass on Rakudo on MoarVM
  • At present, Rakudo on MoarVM passes over 92% of the specification tests that the current leader (Rakudo on the JVM) does
  • We’re on course to have the January compiler release be the first Rakudo release with some level of MoarVM support; I’m hopeful we’ll be passing over 95% of the specification tests Rakudo on the JVM does by that point

The work has been taking place in the moar-support branch. We didn’t merge it yet, but that should happen within the next week or so.

MoarVM itself has been improving steadily, also. Here are a few of the features that have landed in the last weeks:

  • Block exit handlers, used to implement LEAVE and friends
  • Continuations, used to make gather/take properly lazy
  • Updated the Unicode database to Unicode 6.3
  • Sized native arrays
  • State variables

However, another extremely important bit of work has been taking place that is focused on stability. Nicholas Clark has been conducting garbage collection torture tests, by now down to forcing a GC run every single object allocation and using memory protection to catch illegal accesses to moved objects. Most of the things we call GC bugs are not actually inside the garbage collector implementation, but rather are other places in the codebase where mistakes have been made that break invariants that must be upheld for the GC to do its job correctly. I’ve not been counting, but I’d estimate that a bit over a dozen bugs have been turned up by this work so far – bugs that would have been a real pain to find if they had just happened to crop up some weeks, months or years down the line in some user’s code. At this point, NQP can be built and tested under the toughest torture test that exists so far without showing any issues. The Rakudo build is the current torture subject. I’m incredibly happy this work is taking place; it means that by the time Rakudo on MoarVM starts getting used more widely, we can be reasonably confident that users are unlikely to run into GC-related issues.

So, what’s the path from here? Here’s what I’m hoping for in the coming month:

  • We reach the point of passing more than 99% of the spectests that Rakudo on the JVM does
  • Progress towards Panda being able to run and install some modules (not sure we’ll get all the way there, but I hope we’ll get reasonably close)
  • Dealing the with very-slow-path signature binding. If you try anything on Rakudo on MoarVM, you’ll discover it’s slow. This is thanks to a hot path being implemented in a slow way, in the name of getting stuff correct before optimizing. As soon as we’re beyond the 99% point with the spectests, I’ll dig into this. It should make a dramatic difference. I’m aiming to do some work in this area to make things better for Rakudo on JVM also.
  • Whatever fixes are needed to get the Rakudo build and sanity tests surviving the GC torture

I’ll be back with some updates in a couple of weeks, to let you know how we’re getting on. :-)

Posted in Uncategorized | 1 Comment

A few quick updates

I’ve been rather snowed under with work in November, thus the absence of posts here. Sorry about that. It turns out the sales folks at $dayjob did their job a little too well, and besides my own heavy teaching load I had to step in to rescue a few other things. Anyway, that’s done now, and I’m back to having a bit more time. (It’s not that I had absolutely no time at all since the last post. It was just more important to spend the bits I had clearing blockers in the way of others rather than writing blog posts. :-)) So, some quick bits of news.

Concurrency

At the Austrian Perl Workshop, I worked with lizmat on getting the Perl 6 concurrency spec aligned with the things I’d been working on for Rakudo on JVM. Happily, this had some good consequences. Firstly, Larry did a load of work on it, vastly improving naming of things and giving some syntactic relief in various areas where I’d just done simple function and method APIs. Secondly, both lizmat and timotimo dug in to bring the implementation in line with these spec changes, doing some other improvements to boot. So, now the bus number on the Rakudo concurrency support has increased. You can find the latest spec here. Also, you can see my Nordic Perl Workshop slides about it.

Rakudo on MoarVM progress

Last time I wrote here, we had NQP support for MoarVM pretty much complete, and were ready to dig into working on getting Rakudo running on MoarVM. The first step was to get the core of the compiler itself building. This wasn’t a terrible amount of work, since the majority of it is just NQP code. There’s more of it than is found in the NQP compiler, but that aside it doesn’t do too much that’s new. Next came the Perl 6 MOP, which is written in NQP. Things went very smoothly with that. Beyond there, things got more interesting.

The next big piece to make work was the BOOTSTRAP. This uses the MOP to start piecing together the key types at the heart of Perl 6, doing enough so we can write the rest of the built-ins in Perl 6 itself. Most of it is one huge BEGIN block. Here there were various unimplemented things, plus some of the VM-specific bits of Rakudo needed porting. And beyond that lay…the setting. When I did the JVM port, we had around 14,000 lines of built-ins there. These days it’s closer to 17,000 – and that’s excluding the concurrency stuff. Since compiling the setting actually involves running little bits of Perl 6 code, thanks to traits and BEGIN blocks, getting through it means making quite a lot of things work. We got there a week or so back.

That still didn’t give us “Hello, world”, however. The setting does various bits of initialization work as it loads, which naturally hit more things that didn’t yet work. Finally, yesterday, we reached the point where setting loading worked and we could do “Hello, world”. Actually, once the setting loaded, this worked right off.

Of course, there’s still lots to do from here. The next step will be to work on the sanity tests. There’s a couple of large and complex features that will need some porting work; of note, gather/take will need doing. Also there are a couple of stability and algorithmic things that need taking care of in MoarVM itself. Building CORE.setting is easily the biggest workout it’s had so far, and naturally it highlighted a couple of issues. And, of course, beyond the sanity tests like the spectests…

Better use of invokedynamic, and other optimization

I’ve got a talk at Build Stuff next week on invokedynamic, the instruction added to the JVM in JDK7 to support those implementing languages of a more dynamic nature So, in the last week I spent some time tweaking our usage of it to get some further wins in various areas, and to make sure I was nicely familiar with it again in time for my talk. That work got merged into the mainline development branches today. I did some other optimizations along the way that are a win for all backends, too; lizmat noted a 12.5% decrease in spectest time. Not Bad.

Posted in Uncategorized | 2 Comments

NQP gets MoarVM support, cursor reduction, and other news

I thought it was high time for a round-up of news about the Perl 6 related stuff that I’ve been working on – as usual, with the help of many others. So, here goes!

MoarVM Backend in NQP master

In the last progress update, I mentioned that MoarVM could host NQP and pass most of the NQP language test suite. We got to that point by using a cross-compiler running on Parrot to compile the NQP sources into MoarVM bytecode. Fast forward a month, and we’ve reached the point of having NQP bootstrapped on MoarVM. What’s the difference? Simply, that you can use NQP running on MoarVM in order to build an NQP from source. This means that both bytecode generation and serialization are now working. Since some of the source files we’re compiling are also fairly large (4000 lines), it’s also been a fairly good hardening exercise; many bugs have been fixed.

The MoarVM support has now been added to the NQP repository. Just as you only need a JVM to build NQP for the JVM, you also only need MoarVM to build an NQP for MoarVM. That is, it’s bootstrapped and can stand alone now. NQP monthly releases from here on will thus come with support for three backends: Parrot, JVM and MoarVM.

Better still, NQP on MoarVM now passes the full NQP test suite. It also does so faster than any other backend. This doesn’t mean MoarVM is faster for all things; if you write NQP code with a tight integer loop, or something long running, the JVM’s JIT will kick in and typically come out ahead after the slow start. MoarVM just gets moving a lot faster than the JVM, which is what matters for running lots of short-lived tests.

This means we’re now ready to start work on getting Rakudo running on MoarVM. I don’t have any predictions just yet on precisely when we might land this; the rate of progress in the next couple of weeks, as we dig into it, will provide an indication. Reaching the point of being able to bootstrap NQP on MoarVM is a significant achievement along the way, though. While NQP is both smaller and simpler than full Perl 6, it still requires being able to execute Perl 6 grammars, eval code, do a whole range of OO-related things (including classes, roles and meta-programming), perform multiple dispatch, handle BEGIN time, and so on. These are all things that a full Perl 6 needs, so it’s very good to have them in place.

Allocating Less Cursors

Allocation profiling work by hoelzro and timotimo (I hope I remembered right; correct me if not!) indicated that both NQP and Rakudo were allocating a huge number of Cursor objects. So what are these? They are objects that keep track of parsing state. They are created as we enter a rule or token in the grammar, thrown away if it fails, or typically placed into a tree of cursors if it passes (though sometimes we only care for pass/fail and so quickly throw it away again). Naturally, we’re going to allocate quite a few of these, but 2.14 million of them being allocated while parsing Rakudo’s CORE.setting seemed decidedly excessive.

I decided to spend some time trying to understand where they all came from. NQP itself tends to be a bit easier to analyze than Rakudo, so I started there, and added some instrumentation to record the number of times each production rule or token was hit and led to allocation of a Cursor. I then used this to gather statistics on a fairly large source file from the NQP build. It started out at 284,742 Cursor allocations.

The first big win from this came when I realized that a huge number of these Cursors were just allocated and returned, to indicate failure. A Cursor is born in a failed state, and many places were just creating them and returning them without any further work, to say “I failed to match”. Thing is, failed Cursor objects all look about the same. The only thing they differ by is type, and for each Cursor type we have a ParseShared instance. Thus, it wasn’t too hard to create a shared “fail Cursor” and just look it up in a bunch of places, rather than allocating. That shaved off over 50,000 allocations. A related realization about improving MARKED and MARKER led to another 30,000 or so allocations chopped. Note that none of this really leads to overall reduced memory usage; all of these objects were quickly collectable. But allocation makes GC run more often, and thus carries some amount of runtime cost.

Further examination of the statistics showed hotspots that seemed unreasonable. The problem wasn’t that they allocated a Cursor needlessly, but rather than we should never have been calling the grammar rule in question so many times. This led to a number of grammar optimizations. In one case, just making sure that a certain rule got a declarative prefix meant the LTM mechanism could very quickly rule it out, saving 15,000 calls to the production rule, and thus cursors. The other big discovery was that the ident built-in was not having an NFA properly generated for it. This is notable because ident features in various paths in the grammar, and thus meant we were failing to quickly rule out a lot of impossible paths through the grammar using the NFA (which is the cheap way to rule things out). With a few other tweaks, all told, we were down to 172,424 Cursor allocations, just 60% of what we started out with.

The statistics for Rakudo’s CORE.setting showed we started out doing 2,196,370 Cursor allocations. Some of the fixes above also aided Rakudo directly, making a good cut into this number. However, analysis of the statistics revealed there were more wins to be had. So far, we managed to bring it down to 1,231,085 – a huge reduction. There’s likely a lot more wins to be had here, but I think we’ve picked all of the large, low-hanging fruit by now.

Rakudo Debugger on JVM

I spent some time getting the Rakudo Debugger to also work on the JVM. I’ve still got some work to do on making it easily buildable, and so it can be installed with Panda on the JVM like it can on Parrot. But the rest of the work is done, and I was able to step through programs, set breakpoints, examine variables and so forth. So, that’s another of the “things missing on the JVM port” mostly gone.

Rakudo on JVM spectest

I also did a little work to fix more of the spectests that pass on Rakudo Parrot, but fail on Rakudo JVM. The result: 99.9% of the spectests that pass on Rakudo Parrot also pass on Rakudo JVM. To put an exact number on it, that’s 27 tests different. Around half are Unicode related, which isn’t so surprising since we’ve borrowed Java strings for the time being; in the end, we’ll need to do an NFG implementation on the JVM. The others still need to be hunted down. I think it’s fair to say that you’re rather unlikely to run into this 0.1% in day-to-day usage of Rakudo on the JVM, however. In fact, the thing you’re most likely to miss today is NativeCall – which arnsholt has been working on, and I plan to join in with during the coming days.

All in all, things are coming along very nicely with Rakudo’s JVM support. It was just 11 months ago – not even a year – that I started to build a little bit of infrastructure so we’d be able to create a tree-like data structure in NQP and get JVM bytecode made from it. By now, thanks to the contributions of many besides me, Rakudo runs on the JVM and is very capable there. When it comes to concurrency, it’s the most capable Rakudo you can get. And, once it gets past the slow startup, it’s usually the fastest. It’ll be interesting to see where we stand on these things in another six months or a year’s time. Here’s hoping for another productive time ahead!

Posted in Uncategorized | 2 Comments

Material from the Rakudo and NQP Internals course

A little over a month ago, lizmat++ contacted my employer, Edument AB, to discuss using our training services to deliver a 2-day workshop aimed at current and potential Rakudo and NQP contributors. For those of you who don’t know, aside from working on Perl 6 I also work teaching/mentoring on various topics (typically software architecture, TDD, and advanced C# programming). The goal was for me to spend a couple of days explaining a bunch of topics – including NQP, grammars, QAST, 6model, and bounded serialization – to help people know where to start or to help them get into unfamiliar areas of the code.

The workshop took place this last weekend in Frankfurt. I’d expected we might have 5-6 people sign up; in the end, we had around 15 people attending! The workshop involved a combination of teaching and exercises, with plenty of chances to ask questions. Happily, there were tasty lunches and dinners too (though I can’t take any of the credit for this side of things). I greatly enjoyed teaching the course; I like both working on Perl 6 and doing my teaching work at Edument, and being able to do both at once was delightful! :-)

One aim of developing the course was to help with the state of documentation of the NQP and Rakudo internals. Typically at Edument, we only give out our material to those attending a delivery of our courses, and even then not the original source! However, for this course, we’ve made an exception and released the course material under a Creative Commons license. So, while I hope that we’ll be able to put together future “live” deliveries of the material for other new (or potential) contributors, this means it will now always be available to the community at large. :-)

I hope this will prove a valuable resource for those contributing or interested in contributing to Perl 6 development, and would like to take a moment to thank Edument for once again being supportive of my work on the Perl 6 project!

Posted in Uncategorized | 4 Comments