Last week: smaller hashes, faster startup, and many fixes

I’m a little behind with writing up these weekly-ish reports, mostly thanks to attending and giving about 5 hours worth of talks/tutorials. That, along with spending time with the various other Perl 6 folks who were in town, turned out to be quite exhausting. Anyway, I’m gradually catching up on sleep. I’ll write up what I did in the first week of May in this post, and I’ll do one in a couple of days for the second week of May.

Shrinking hashes

In MoarVM, we’ve been using the UT Hash library since the earliest days of the VM. There was no point spending time re-implementing a common data structure for what, at the time, was decidedly an experimental project, especially when there was an option with an appropriate license and all contained in a single header file. While we started out with it fairly “vanilla”, we’ve already tweaked it in various ways (for example, caching the hash code in the MVMString objects). UT Hash does one thing we didn’t really need nor want for Perl 6: retain the insertion order. Naturally, we paid for this: to the tune of a pointer per hash, plus two pointers per hash item (for a doubly linked list). I ripped these out (which involved re-writing a few things that depended on them), as well as gutting the library of other functionality we don’t need. This means hashes on MoarVM on 64-bit builds are now 8 bytes smaller per hash, and 16 bytes smaller per hash item. Since we use hashes for plenty of things, this is a welcome saving; it even produces a measurable decrease to Rakudo’s base memory. This also paves the way for further improvements to hashes and string storage. (For what it’s worth, the time savings as a result of these changes were tiny compared to the memory savings: measurable under callgrind, but a drop in the ocean. Turns out a doubly linked list is a rather cheap thing to maintain.)

Decreasing startup time

I did a number of things to improve Rakudo’s startup time on MoarVM. One of them was changing the way we store string literals in the bytecode file. Previously, they were all stored as UTF-8. Now, those that are in the Latin-1 range are stored as Latin-1. Since this doesn’t need any kind of decoding or normalization, it is far cheaper to process them. This gave a 10% decrease in instructions executed at startup. Related to this, I aligned the serialization blob string heap with the bytecode file one, meaning that we don’t have to store a bunch of mappings between the two (another 2% startup win, and a decrease in bytecode file size, probably 100KB over all the bytecode files we load into memory as part of Rakudo).

2% more fell off when I realized that some careful use of static inlines and a slightly better design would lower the cost of temporary root handling (used to maintain the “GC knows where everything is” invariant). This took 2% off startup – but also will be an improvement in a lot of other code too.

Another small win was noticing that we built the backend configuration data hash every time we were asked for it, rather than building it once and caching it. This was an easy win, and saved building it 7 times over during Rakudo’s startup (adding up more things to GC each time, of course). Finally, I did a little optimization of bounds checking when validating bytecode. These two improvements were both below the 1% threshold, though together added up to about that.

The revenge of NFG

Shortly after landing NFG, somebody reported a fairly odd output related bug. After a while, it was isolated to UTF-8 encoding, and when I looked I saw…a comment saying the encoder code would need updating post-NFG. D’oh. So, I took care of that, saving us some memory along the way. I also noticed some mess with NULL-terminated string handling (MoarVM ones aren’t, of course, but we do have to deal with the C world now and then), which I took the opportunity to get cleaned up.

Digging into concurrency bugs

Far too much of the concurrency work I did in Rakudo and the rest of the stack was conducted under a good amount of time pressure. Largely, I focused in getting things sufficiently in place that we could play with them and explore semantics. This has served us well; there was a chance to show running examples to the community at conferences and workshops and gather feedback, and for Larry and others to work on the API and language level of things. The result is that we’ve got a fairly nice bunch of concurrency features in the language now, but anybody who has tried to use them much will have noticed they behave in a rather flaky way in a number of circumstances. With NFG largely taken care of, I’m now going to be turning more of my attention to these issues.

Here are some of the early things I’ve done to start making things better:

  • Fixing a couple of bugs so the t/concurrency set of tests in NQP work much better on MoarVM
  • Hunting down and resolving an ABA problem in the MoarVM fixed size allocator free list handling, which occurred rarely but could cause fairly nasty memory corruption when it did (RT #123883)
  • Eliminating a bad assumption in the ThreadPoolScheduler which caused some oddities in Promise execution order (RT #123520)
  • Adding missing concurrency control to the multi-dispatch cache. While it was designed carefully to permit reads (concurrent with both other other reads and additions), so you never need to lock when doing a multi-dispatch cache lookup, the additions need to be done one at a time (something forseen when it was designed, but not handled). These additions are now serialized with a lock.

Other assorted bits

Here are a few other small things I did, worth a quick mention.

  • Fix and tests for RT #123641 and #114966 (cases where we got a block, but should have got an anonymous hash)
  • Fix RT #77616 (/a ~ (c) (b)/ reverted positional capture order)
  • Review and reject RTs #116525 and #117109 (out of line with decided flattening semantics)
  • Add test for and resolve RT #118785 (“use fatal” semantics were already fixed, but good to have an explicit test case)
  • Tests for RT #122756 and #114668 (an already fixed mixins bugs)
Posted in Uncategorized | 2 Comments

This week: the big NFG switch on, and many fixes

During my Perl 6 grant time during the last week, I have continued with the work on Normal Form Grapheme, along with fixing a range of RT tickets.

Rakudo on MoarVM uses NFG now

Since the 24th April, all strings you work with in Rakudo on MoarVM are at grapheme level. This includes string literals in program source code and strings obtained through some kind of I/O.

say $str.chars;     # 1
say $; # 2

This sounds somewhat dramatic. Thankfully, my efforts to ensure my NFG work had good test coverage before I enabled it so widely meant it was, for most Rakudo users, something of a non-event. (Like so much in compiler development, you’re often doing a good job when people don’t notice that your work exists, and instead are happily getting on with solving the problems they care about.)

The only notable fallout reported (and fixed) so far was a bizarre bug that showed up when you wrote a one byte file, then read it in using .get (which reads a single line), which ended up duplicating that byte and giving back a 2-character string! You’re probably guessing off-by-one at this point, which is what I was somewhat expecting too. However, it turned out to be a small thinko of mine while integrating the normalization work with the streaming UTF-8 decoder. In case you’re wondering why this is a somewhat tricky problem, consider two packets arriving to a socket that we’re reading text from: the bytes representing a given codepoint might be spread over two packets, as may the codepoints representing a grapheme.

There were a few other interesting NFG things that needed attending to before I switched it on. One was ensuring that NFG is closed over concatenation operations:

my $str = "D\c[COMBINING DOT ABOVE]";
say $str.chars; # 1
say $str.chars; # 1

Another is making sure you can do case changes on synthetics, which has to do the case change on the base character, and then produce a new synthetic:

say $str.NFD; # NFD:0x<0044 0323 0307>
say $str.NFC; # NFC:0x<1e0c 0307>
say $; # 1
say $; # NFD:0x<0064 0323 0307>
say $; # NFC:0x<1e0d 0307>

(Aside: I’m aware there are some cases where Unicode has defined precomposed characters in one case, but not in another. Yes, it’s an annoying special case (pun intended). No, we don’t get this right yet.)

Uni improvements

I’ve been using the Uni type quite a bit to test the various normalization algorithms as I’ve implemented them. Now I’ve also filled out various of the missing bits of functionality:

  • You can access the codepoints using array indexing
  • It responds to .elems, and behaves that way when coerced to a number also
  • It provides .gist and .perl methods, working like Buf
  • It boolifies like other array-ish things: true if non-empty

I also added tests for these.

What the thunk?

There were various scoping related bugs when the compiler had to introduce thunks. This led to things like:

try say $_ for 1..5;

Not having $_ correctly set. There were similar issues that sometimes made you put braces around gather blocks when it should have worked out fine with just a statement. These issues were at the base of 5 different RT tickets.

Regex sub assertions with args

I fixed the lexical assertion syntax in regexes, so you can now say:

<&some-rule(1, 2)>
<&some-rule: 1, 2>

Previously, trying to pass arguments resulted in a parse error.

Other assorted fixes

Here are a few other RTs I attended to:

  • Fix and tests for RT #124144 (uniname explodes in a weird way when given negative codepoints)
  • Fix and test for RT #124333 (SEGV in MoarVM dynamic optimizer when optimizing huge alternation)
  • Fix and test RT #124391 (use after free bug in error reporting)
  • Fix and test RT #114100 (Capture.perl flattened hashes in positionals, giving misleading output)
  • Fix and test RT #78142 (statement modifier if + bare block interaction; also fixed it for unless and given)
Posted in Uncategorized | 5 Comments

This week: digging into NFG, fixing “use fatal”, and more

It’s time for this week’s grant report! What have I been up to?


Last time, I talked about normalization forms in Unicode, and how Rakudo on MoarVM now lets you move between them and examine them, using the Uni type and its subclasses. We considered an example involving 3 codepoints:

my $codepoints =, 0x0323, 0x0307);
.say for $codepoints.list>>.uniname;

They are (as printed by the code above):


We also noted that if we put them into NFC (Normal Form Composed – where we take codepoints sequences and identify where we can use precomposed codepoints), using this code:

my $codepoints =, 0x0323, 0x0307);
.say for $codepoints.NFC.list>>.uniname;

Then we get this:


Now, if you actually render that, and presuming you have a co-operating browser, it comes out as Ḍ̇ (you should hopefully be seeing a D with a dot above and below). If I was to stop people on the street and ask them how many characters they see then I show them a “Ḍ̇”, they’re almost certainly all going to say “1”. Yet if we work at codepoint level, even having applied NFC, we’re going to consider those as 2 “thingies”. Worse, we could do a substr operation and chop off the combining dot above. This is actually the situation in most programming languages.

Enter NFG. While it’s not on for all strings by default yet, if you take any kind of Uni and coerce it to a Str, you have a string represented in Normal Form Grapheme. The first thing to note is that if we ask .chars of it, we get 1:

my $codepoints =, 0x0323, 0x0307);
say $codepoints.Str.chars; # 1

How did it do this? To get Normal Form Grapheme, we first calculate NFC. Then, if we are left with combining characters (formally, anything with a non-zero value for Canonical_Combining_Class), we compute a synthetic codepoint and use it inside of the string. You’ll never actually see these. When we output the string, or turn it back into codepoints, the synthetics unravel. But when you’re working with a Str, you’re at grapheme level. Better still, operations like .chars and .substr are O(1).

Under the hood, synthetics are represented by negative integers. This gives a cheap and easy way to know if we’re looking at a synthetic codepoint or not. And because we intern the synthetics, things like string equality testing can be a cheap and fast memory compare operation. On a further implementation note, I went with a partially lock-free trie to do the interning, meaning that no locks are needed to do lookups, and we only acquire one if we have to register a new synthetic- giving thread safety with very little overhead for real-world use cases.

I’m gradually assembling test cases for NFG. Some can be generated systematically from the Unicode NormalizationTests.txt file, though they are rather more involved since they have to be derived from the normalization test data by considering the canonical combining class. Others are being written by hand.

For now, the only way to get an NFG string is to call .Str on a Uni. After the 2015.04 release, I’ll move towards enabling it for all strings, along with dealing with some of the places where we still need to work on supporting NFG properly (such as in the leg operator, string bitwise operators, and LTM handling in the regex/grammar engine).

use fatal

For a while, Larry has noted that a $*FATAL dynamic variable was the wrong way to handle use fatal – the pragma that makes lazy exceptions (produced by fail) act like die.  The use fatal pragma was specified to automatically apply inside of a try block or expression, but until this week a Failure could escape a try, with the sad consequence that things like:

say try +'omg-not-anumber';

Would blow up with an exception saying you can’t sanely numify that. When I tried to make the existing implementation of use fatal apply inside of try blocks, it caused sufficient fireworks and action at a distance that it became very clear that not only was Larry’s call right, but we had to fix how use fatal works before we could enable it in try blocks. After some poking, I managed to get an answer that pointed me in the direction of doing it as an AST re-write in the lexical scope that did use fatal. That worked out far better. I applied it to try, and found myself with only one spectest file that needed attention (which had code explicitly relying on the old behavior). The fallout in the ecosystem seems to have been minimal, so we’re looking good on this. A couple of RTs were resolved thanks to this work, and it’s another bit of semantics cleaned up ahead of this year’s release.

Smaller things

Finally, here are a handful of smaller things I’ve worked on:

  • Fix and test for RT #77786 (poor error reporting on solitary backtrack control)
  • Make ‘soft’ and ‘MONKEY-TYPING’ work lexically
  • Fix and test for RT #124304 (interaction of ‘my \a = …’ declarations with EVAL)
  • Fix for RT #122914 (binding in REPL didn’t persist to the next line); also add tests and make REPL tests work on Win32

Have a good week!

Posted in Uncategorized | 3 Comments

This week: Unicode normalization, many RTs

After some months where every tuit was sacred, and so got spent on implementing Perl 6 rather than writing about implementing Perl 6, I’m happy to report that for the rest of 2015 I’ll now have time to do both again. \o/ I recently decided to take a step back from my teaching/mentoring work at Edument and dedicate a good bit more time to Perl 6. The new Perl 6 core development fund is a big part of what has enabled this, and a huge thanks must go to WenZPerl for providing the initial round of funding to make this happen. I wrote up a grant application, which contains some details on what I plan to be doing. In short: all being well with funding, I’ll be spending 50% of my working time for the rest of 2015 doing Perl 6 things, to help us towards our 2015 release goals.

As part of the grant, I’ll be writing roughly weekly progress reports here. While the grant application is open for comments still, to make sure the community are fine with it (so far, the answer seems to be “yes!”), I’ve already been digging into the work. And when I looked at the list of things I’ve already got done, it felt like time already to write it up. So, here goes!

Uni and NF-whatever

One of the key goals of the grant is to get Normal Form Grapheme support in place. What does that mean? When I consider Unicode stuff, I find it helps greatly to to be clear about 3 different levels we might be working at, or moving between:

  • Bytes: the way things look on disk, on the wire, and in memory. We represent things at this level using some kind of encoding: UTF-8, UTF-16, and so forth. ASCII and Latin-1 are also down at this level. In Perl 6, we represent this level with the Buf and Blob types.
  • Code points: things that Unicode gives a number to. That includes letters, diacritics, mathematical symbols, and even a cat face with tears of joy. In Perl 6, we represent a bunch of code points with the Uni type.
  • Graphemes: things that humans would usually call “a character”. Note that this is not the same as code points; there are code points for diacritics, but a typical human would tell you that it’s something that goes on a character, rather than being a character itself. These things that “go on a character” are known in Unicode as combining characters (though a handful of them, when rendered, visually enclose another character). In Perl 6, we represent a bunch of graphemes with the Str type – or at least, we will when I’m done with NFG!

Thus, in Perl 6 we let you pick your level. But Str – the type of things you’d normally refer to as strings – should work in terms of graphemes. When you specify an offset for computing a substring, then you should never be at risk of your “cut the crap” grapheme (which is, of course, made from two codepoints: pile of poo with a combining enclosing circle backslash) getting cut into two pieces. Working at the codes level, that could easily happen. Working at the bytes level (as C# and Java strings do, for example), you can even cut a code point in half, dammit. We’ve been better that that in Perl 6 pretty much forever, at least. Now it’s time to make the final push and get Str up to the grapheme level.

Those who have played with Perl 6 for a while will have probably noticed we do have Str (albeit at the code point level) and Buf (correctly at the bytes level), but will not have seen Uni. That’s because it’s been missing so far. However, today I landed some bits of Uni support in Rakudo. So far this only works on the MoarVM backend. What can we do with a Uni? Well, we can create it with some code points, and then list them:

my $codepoints =, 0x0323);
say $*.base(16)); # 1E0A 323

That’s a Latin capital letter D with dot above, followed by a combining dot below. Thing is, Uniocde also contains a Latin capital letter D with dot below, and a combining dot above. By this point, you’re probably thinking, “oh great, how do we even compare strings!” The answer is normalization: transforming equivalent sequences of code points into a canonical representation. One of these is known as Normal Form Decomposed (aka. NFD), which takes all of the pre-composed characters apart, and then performs a stable sort (to handwave a little, it only re-orders things that don’t render at the same location relative to the base character; look up Canonical Combining Class and the Unicode Canonical Sorting Algorithm if you want the gory details). We can compute this using the Uni type:

say $*.base(16)); # 44 323 307

We can understand this a little better by doing:

say uniname($_) for $codepoints.NFD.list;

Which tells us:


The other form you’ll usually encounter is Normal Form Composed (aka. NFC). This is (logically, at least) computed by first computing NFD, and then putting things back together into composed forms. There are a number of reasons this is not a no-op, but the really important one is that computing NFD involved a sorting operation. We can see that NFC has clearly done some kind of transform just by trying it:

my $codepoints =, 0x0323);
say $*.base(16)); # 1E0C 307

We can again dump out the code point names:


And see that the normal form is a Latin capital letter D with dot below followed by a combining dot above.

There are a number of further details – not least that Hangul (the beautiful Korean writing system) needs special algorithmic handling. But that’s the essence of it. The reason I started out working on Uni is because NFG is basically going a step further than NFC. I’ll talk more about that in a future post, but essentially NFG needs NFC, and NFC needs NFD. NFD and NFC are well defined by the Unicode consortium. And there’s more: they provide an epic test suite! 18,500 test cases for each NFD, NFKD, NFC, and NFKC (K is for “comparability”, and performs some extra mappings upon decomposition).

Implementing the Uni type gave me a good way to turn the Unicode conformance tests into a bunch of Perl 6 tests – which is what I’ve done. Of course, we don’t really want to run all 70,000 or so of them every single spectest, so I’ve marked the full set up as stress tests (which we run as part of a release) and also generated 500 sanity tests per normalization form, which are now passing and included in a normal spectest run. These tests drove the implementation of the four normalization forms in MoarVM (header, implementation). The inline static function in the header avoids doing the costly Unicode property lookups for a lot of the common cases.

So far the normalization stuff is only used for the Uni type. Next up will be producing a test suite for mapping between Uni and (NFG) Str, then Str to Uni and Buf, and only once that is looking good will I start getting the bytes to Str direct path producing NFG strings also.

Unicode 7.0 in MoarVM

While I was doing all of the above, I took the opportunity to upgrade the MoarVM Unicode database to use the Unicode 7.0 database.

Unsigned type support in NativeCall

Another goal of my work is to try and fix things blocking others. Work on Perl 6 database access was in need of support for unsigned native integer types in NativeCall, so I jumped in and implemented that, and added tests. The new functionality was put to use within 24 hours!


Along with the big tasks like NFG, I’m also spending time taking care of a bunch of the little things. This means going through the RT queue and picking out things to fix, to make the overall experience of using Perl 6 more pleasant. Here are the ones I’ve dealt with in the last week or so:

  • Fix, add test for, and resolve RT #75850 (a capture \( (:a(2)) ) with a colonpair in parens should produce a positional argument but used to wrongly produce a named one)
  • Analyze, fix, write test for, and resolve RT #78112 (illegal placeholder use in attribute initializer)
  • Fix RT #93988 (“5.” error reporting bug), adding a typed exception to allow testability
  • Fixed RT #81502; missing undeclared routine detection for BEGIN blocks and include location info about BEGIN-time errors
  • Fixed RT #123967; provided good error reporting when an expression in a constant throws an exception
  • Tests for RT #123053 (Failure – lazy exceptions – can escape try), but no fix yet. Did find a few ways to fix it that didn’t work, but led to two other small improvements along the way.
  • Typed exception for bad type for variable/parameter declaration; test for and resolve RT #123397
  • Fix and test for RT #123627 (bad error reporting of “use Foo Undeclared;”)
  • Fix and test for RT #123789 (SEGV when assigning type object to native variable), plus harden against similar bugs
  • Implemented “where” post-constraints on attributes/variables. Resolves RT #122109; unfudged existing test case.
  • Handle “my (Int $x, Num $y)” (RT #102414) and complain on “my Int (Str $x)” (RT #73102)
Posted in Uncategorized | 2 Comments

What I’ve been working on, and what’s coming up

It’s been a little while since I wrote an update here, and with a spare moment post-teaching I figured it was a nice time to write one. I was lucky enough to end up being sent to the arctic (or, near enough) for this week’s work, meaning I’m writing this at 11pm, sat outside – and it’s decidedly still very light. It just doesn’t do dark at this time of year in these parts. Delightful.

Asynchronous Bits

In MoarVM and Rakudo 2014.05, basic support for asynchronous sockets landed. By now, they have also been ported to the JVM backend. In 2014.06, there were various improvements – especially with regard to cancellation and fixing a nasty race condition. Along the way, I also taught MoarVM about asynchronous timers, bringing time-based scheduling up to feature parity with the JVM backend. I then went a step further and added basic file watching support and some signals support; these two sadly didn’t yet make it to the JVM backend. On signals, I saw a cute lightning talk by lizmat using signal handling and phasers in loops to arrange Ctrl+C to exit a program once a loop had completed its current iteration.

While things basically work, they are not yet as stable as they need to be – as those folks implementing multi-threaded asynchronous web servers and then punishing them with various load-testing tools are discovering. So I’ll be working in the next weeks on hunting down the various bugs here. And once it’s stable, I may look into optimizations, depending on if they’re needed.

Various Rakudo fixes an optimizations

I’ve also done various assorted optimizations and fixes in Rakudo. They’re all over the map: optimizing for 1..100000 { } style loops into cheaper while loops, dealing with various small bugs reported in the ticket system, implementing a remaining missing form of colonpair syntax (:42nd meaning nd => 42), implementing Supply.on_demand for creating supplies out of…well…most things, optimizing push/unshift of single items…it goes on. I’ll keep trucking away at these sorts of things over the coming months; there’s some bugs I really want to nail.

MoarVM’s dynamic optimizer

A bunch of my recent and current work revolves around MoarVM’s dynamic bytecode optimizer, known as “spesh” (because its primary – though not only – strategy is to specialize bytecode by type). Spesh first appeared in the 2014.04 release of MoarVM. Since then, it’s improved in numerous ways. It now has a logging phase, where it gathers extra type information at a number of places in the code. After this, it checks if the type information is stable – which is often the case, as most potentially polymorphic code is monomorphic (or put another way, dynamic languages are mostly just eventually-static). Provided we did get consistent types recorded, then guard clauses are inserted (which cheaply check we really did get the expected type, and if not triggering deoptimization – falling back to the safe but slower bytecode). The types can them be assumed by the code that follows, allowing a bunch of optimizations to code that the initial specializer just couldn’t do much with.

Another important optimization spesh learned was optimizing dispatch based on the type information. By the time the code gets hot enough to specialize, multiple dispatch caches are primed. These, in combination with type information, are used to resolve many multiple dispatches, meaning that they become as cheap as single dispatches. Furthermore, if the callee has also been specialized – which is likely – then we can pick the appropriate specialization candidate right off, eliminating a bunch of duplicate guard checks. Everything described so far was on 2014.05.

So, what about 2014.06? Well, the 2014.05 work on dispatch – working out exactly what code we’re going to be running – was really paving the way for a much more significant optimization: inlining. 2014.06 thus brought basic support for inlining. It is mostly only capable of NQP code at the moment, but by 2014.07 it’ll be handling inlining the majority of basic ops in Rakudo that the static optimizer can’t already nail. Implementing inlining was tricky in places. I decided to go straight for the jugular and support multi-level inlining – that is, inlining things that also inline things. There are a bunch of places in Perl 6 this will be useful; for example, ?$some_int compiles to prefix:<?>($some_int), which in turn calls $some_int.Bool. That is implemented as nqp::p6bool(nqp::bool_I(self)). With inlining, we’ll be able to flatten away those calls. In fact, we’re just a small number of patches off that very case actually being handled.

The other thing that made it tricky to implementing inlining is that spesh is a speculative optimizer. It looks at what types it’s seeing, and optimizes assuming it will always get those types. Those optimizations include inlining. So, what if we’re inlining a couple of levels deep, are inside one of the inlined bits of code, and something happens that invalidates our assumptions? This triggers de-optimization. However, in the case of inlining, it has to go back and actually create the stack frames that it elided creating thanks to having applied inlining. This was a bit fiddly, but it’s done and was part of the 2014.06 release.

Another small but significant thing in 2014.06 is that we started optimizing some calls involving named parameters to pull the parameters out of the caller’s callsite by index. When the optimization applies, it saves doing any string comparisons whatsoever when handling named parameters.

So 2014.07 will make inlining able to cope better with Rakudo’s generated code, but anything else? Well, yes: I’m also working on OSR (On Stack Replacement). One problem today is that if the main body of the program is a hot loop doing thousands of iterations, we never actually get to specialize the loop code (because we only enter the code once, and so far it is repeated calls to a body of code that triggers optimization). This is especially an issue in benchmarks, but can certainly show up in real-life code too. OSR will allow us to detect such a hot looop exists in unoptimized code, go off and optimize it, and then replace the running code with the optimized version. Those following closely might wonder if this isn’t just a kind of inverse de-optimization, and that is exactly how I’ll implement it: just use the deopt table backwards to work out where to shove the program counter. Last but not least, I also plan to work on a range of optimizations to generic code written in roles, to take away the genericity as part of specialization. Given the grammar engine uses roles in various places, this should be a healthy optimization for parsing.

JIT Compilation for MoarVM

I’m not actually implementing this one; rather, I’m mentoring brrt++, who is working on it for his Google Summer Of Code project. My work on spesh was in no small part to enable a good JIT. Many of the optimizations that spesh does turn expensive operations with various checks into cheap operations that just go and grab or store data, and thus should be nicely expressible in machine code. The JIT that brrt is working on goes from the graph produced by spesh. It “just” turns the graph into machine code, rather than improved bytecode. Of course, that’s still a good bit of work, especially given de-optimization has to be factored into the whole thing too. Still, progress is good, and I expect the 2014.08 MoarVM release will include the fruits of brrt’s hard work.

Posted in Uncategorized | 2 Comments

Racing to writeness to wrongness leads

In the Perl world I’m mostly known as a guy who hacks on Perl 6 stuff. Less known is that outside of the Perl world, I spend a lot of my time with the .Net platform. C#, despite a rather uninspiring initial couple of releases, has escaped Java-think and grown into a real multi-paradigm language. It’s not Perl, but it’s certainly not unpleasant, and can even be a good bit of fun to work with nowadays. My work with it right now typically involves teaching, along with various mentoring and trouble-shooting tasks.

The Windows world has always been rather into threads – at least as long as I’ve been around. .Net is also, as a result. Want to do some computation in your GUI app? Well, better farm it off to a thread, so the main thread can keep the UI responsive to the user’s needs. Want to do network I/O? Well, that could probably use some asynchronous programming – and the completion handler will be run on some thread or other. Then the results will probably want marshaling somehow. (That used to hurt; now things are better.) Building a web application? Better learn to love threads. You don’t actually get any choice in the matter: having multiple request-processing threads is the unspoken, unquestioned, default in a web application on .Net.

Of course, just because threads are almost ubiquitous doesn’t mean the average developer – or even the above-average developer – gets things right. A bunch of my recent trouble-shooting gigs have boiled down to dealing with a lack of understanding of multi-threaded programming. “So, we embed this 80s library in our web application, but things tend to crash under load.” “How do you deal with the fact that 80s library likely isn’t threadsafe?” “It…what?” “Oh, my…”

So anyway, back to Perl 6. Last year, we managed to get Rakudo on the JVM. And, given we now ran on a VM where folks deploy heavily threaded software every day, and with no particular progress to be seen on concurrency in Perl 6 for years, I did what I usually seem to end up doing: get fed up of the way things are and figured I should try to make them better. Having spent a bunch of years working with and teaching about parallel, asynchronous, and concurrent programming outside of the Perl world, it was time for worlds to collide.

And oh hell, collide they do. Let’s go word counting:

my %word_counts;
for @files -> $filename {
    for slurp($filename).words {

OK, so that’s the sequential implementation. But how about one that processes the files in parallel? Well, it seems there are a bunch of files, and that seems like a natural way to parallelize the work. So, here goes:

my %word_counts;
await do for @files -> $filename {
    start {
        for slurp($filename).words {

Here, start creates a Promise, which is kept when the code inside of it completes. That work is scheduled to be done on the thread pool, and the code calling start continues onward, moving on to create another Promise for the next file. Soon enough, the thread pool’s input queue is nicely occupied with work, and threads are chugging through it. The loop is in a context that means it produces results – the Promise objects – thanks to our use of the do keyword. We give them to await, which waits for them all to get done. Perfect, right?

Well, not so fast. First of all, are hashes thread safe? That is, if I try to write to a hash from multiple threads, what will happen? Well, good question. And the answer, if you try this out on Rakudo on JVM today, is you’ll end up with a hosed hash, in all likelihood. OK. Go on. Say what you’re thinking. Here’s one guess at a common response: “But…but…but…OH NO WE MUST MAKE IT WORK, this is Perl, not Java, dammit!” Well, OK, OK, let’s try to make it work…

So the hash ain’t threadsafe. Let’s go put implicit locking in hash access. We’ll slow down everything for it, but maybe with biased locking it won’t be so bad. Maybe we can build a smart JIT that invalidates the JITted code when you start a thread. Maybe escape analysis will save the common cases, because we can prove that we’ll never share things. Maybe we can combine escape analysis and trace JIT! (Hey, anybody know if there’s a paper on that?) Heck, we gotta build smart optimizations to make Perl 6 perform anyway…

So anyway, a patch or two later and our hashes are now nicely thread safe. We’re good, right? Well, let’s run it and…ohhhh…wrong answer. Grr. Tssk. Why, oh why? Well, look at this:


What does the post-increment operator do? It reads a value out of a scalar, gets its successor, and shoves the result in the scalar. Two threads enter. Both read a 41. Both add 1. Both store 42. D’oh. So, how do we fix this? Hm. Well, maybe we could make ++ take a lock on the scalar. Now we’re really, really going to need some good optimization, if we ever want tight loops doing ++ to perform. Like, inlining and then lifting locks…if we can get away with it semantically. Or one of the tricks mentioned earlier. Anyway, let’s suppose we do it. Hmm. for good measure, maybe we’d better ponder some related cases.

%word_counts{$_} += 1;

Not idiomatic here, of course, but we can easily imagine other scenarios where we want something like this. So, we’d better make all the assignment meta-ops lock the target too…uh…and hold the lock during the invocation of the + operator. Heck, maybe we can not do locks in the spin-lock or mutex sense, but go with optimistic concurrency control, given + is pure and we can always retry it if it fails. So, fine, that’s the auto-increment and the assignment meta-ops sorted. But wait…what about this:

%word_counts{$_} = %word_counts{$_} + 1;

Well, uhh…darn. I dunno. Maybe we can figure something out here, because having that behave differently than the += case feels really darn weird. But let’s not get bogged down with side-problems, let’s get back to our original one. My hash is thread safe! My ++ is atomic, by locks, or some other technique. We’re good now, aren’t we?

Nope, still not. Why? Turns out, there’s a second data race on this line:


Why does this work when we never saw the word before? Auto-vivification, of course. We go to look up the current scalar to auto-increment it. But it doesn’t exist. So we create one, but we can’t install it unless we know it will be assigned; just looking for a key shouldn’t make it come to exist. So we put off the installation of the scalar in the hash until it’s assigned. So, two threads come upon the word “manatee”. Both go and ask the hash for the scalar under that key. Access to the hash is already protected, so the requests are serialized (in the one-after-the-other sense). The hash each time notices that there’s no scalar in that slot. It makes one, attached to it the fact that it should be stored into the hash if the scalar is assigned to, and hands it back. The ++ sees the undefined value in the scalar, and sticks a 1 in there. The assignment causes the scalar to be bound to the hash…uh…wait, that’s two scalars. We made two. So, we lose a word count. Manatees may end up appearing a little less popular than dugongs as a result.


How do we fix this one? Well, that’s kinda tricky. At first, we might wonder if it’s not possible to just hold some lock on something for the whole line. But…how do we figure that out? Trying to work out a locking scheme for the general case of auto-viv – once we mix it with binding – feels really quite terrifying, as this REPL session reveals:

> my %animals; my $gerenuk := %animals<gerenuk>; say %animals.perl;
> $gerenuk = 'huh, what is one of those?'; say %animals.perl;
("gerenuk" => "huh, what is one of those?").hash

So, what’s my point in all of this? Simply, that locking is not just about thread safety, but also about the notion of transaction scope. Trying to implicitly lock stuff to ensure safe mutation on behalf of the programmer means you’ll achieve thread safety at a micro level. However, it’s very, very unlikely that will overlap with the unspoken and uncommunicated transaction scope the programmer had in mind – or didn’t even know they needed to have in mind. What achieving safety at the micro level will most certainly achieve, however, is increasing the time it takes for the programmer to discover the real problems in their program. If anything, we want such inevitably unreliable programs to reliably fail, not reliably pretend to work.

I got curious and googled for transaction scope inference, wondering if there is a body of work out there on trying to automatically figure these things out. My conclusion is that either it’s called something else, I’m crap at Google today, or I just created a thankless PhD topic for somebody. (If I did: I’m sorry. Really. :-) My hunch is that the latter is probably the case, though. Consider this one:

while @stuff {
    my $work = @stuff.pop;

Where should the implicit transaction go here? Well, it should take in the boolification of @stuff and the call to pop. So any such general analysis is clearly inter-statement, except that we don’t want to hard-code it for boolification and popping, so it’s interprocedural, but then method calls are late-bound, so it’s undecidable. Heck, it’d be that way even in boring Java. With Perl you can go meta-programming, and then even your method dispatch algorithm might be late bound.

At this point, we might ponder software transactional memory. That’s very much on-topic, and only serves to re-inforce my point: in STM, you’re given a mechanism to define your transaction scope:

my %word_counts;
await do for @files -> $filename {
    start {
        for slurp($filename).words {
            # hack you up a prototype down the pub, it's *hard*!
            atomic { %word_counts{$_}++ }

This looks very nice, but let’s talk about the hardware for a bit.

Yes, the hardware. The shiny multi-core thing we’re trying to exploit in all of this. The thing that really, really, really, hates on code that writes to shared memory. How so? It all comes down to caches. To make this concrete, we’ll consider the Intel i7. I’m going to handwave like mad, because I’m tired and my beer’s nearly finished, but if you want the gory details see this PDF. Each core has an Level 1 cache (actually, two: one for instructions and one for data). If the data we need is in it, great: we stall for just 4 cycles to get hold of it. The L1 cache is fast, but also kinda small (generally, memory that is fast needs more transistors per byte we store, meaning you can’t have that much of it). The second level cache – also per core – is larger. It’s a bit slower, but not too bad; you’ll wait about 10 cycles for it to give you the data. (Aside: modern performance programming is thus more about cache efficiency than it is about instruction count.) There’s then a
level 3 cache, which is shared between the cores. And here’s where things get really interesting.

As a baseline, a hit in the level 3 cache is around 40 cycles if the memory is unshared between cores. Let’s imagine I’m a CPU core wanting to write to memory at 0xDEADBEEF. I need to get exclusive access to that bit of memory in order to do so. That means before I can safely write it, I need to make sure that any other core with it in its caches (L1/L2) tosses what it knows, because that will be outdated after my write. If some other core shares it, the cost of obtaining the cache line from L3 goes up to around 65 cycles. But what if the other core has modified it? Then it’s around 75 cycles. From this, we can see that pretty much any write to shared memory, if another core was last to write, is going to be incurring a cost of around 75 cycles. Compare that to just several cycles for unshared memory.

So how does our approach to parallelizing our word count look in the light of this? Let’s take a look at it again:

my %word_counts;
await do for @files -> $filename {
    start {
        for slurp($filename).words {

Locks are just memory, so if we inserted those automatically – even if we did work out a good way to do so – then taking the lock is a shared memory write. That’s before we go updating the memory associated with the hash table to install entries, and the memory of the scalars to update the counts. What if we STM it? Even if we keep modifications in a local modification buffer, we still have to commit at some point, and that’s going to have to be a write to shared memory. In fact, that’s the thing that bothers me about STM. It’s a really, really great mechanism – way superior to locks, composable, and I imagine not too hard to teach – but its reason for existing is to make writes to shared memory happen in a safe, transactional, way. And its those writes that the hardware makes costly. Anyway, I’m getting side-tracked again. The real point is that our naive parallelization of our program – even if we can find ways to make it work reliably – is a disaster when considered in the light of how the hardware works.

So, what to do? Here’s an alternative.

# Produce a word counts hash per file - totally unshared!
my @all_counts = await do for @files -> $filename {
    start {
        my %word_counts;
        for slurp($filename).words {

# Bring them together into a single result.
my %totals;
for @all_counts {
    %totals{.key} += .value;
say %totals.elems;

Those familiar with map-reduce will probably have already recognized the pattern here. The first part of the program does the work for each file, producing its own word count hash (the map). This is completely thread local. Afterwards, we bring all of the results together into a single hash (the reduce). This is doing reads of data written by another thread, of course. But that’s the cheaper case, and once we get hold of the cache lines with with the hash and scalars, and start to chug through it, we’re not going to be competing for it with anything else.

Of course, the program we get at the end is a bit longer. However, it’s also not hard to imagine having some built-ins that make patterns like this shorter to get in place. In fact, I think that’s where we need to be expending effort in the Perl 6 concurrency work. Yes, we need to harden MoarVM so that you can’t segfault it even if you do bad things. Yes, we should write a module that introduces a monitor keyword, which is a class that automatically takes a lock around each of its method calls:

monitor ThreadSafeLoggingThingy {
    has @!log;

    method log($msg) {
        push @!log, $msg;

    method latest($n) {
        $n < @!log
            ?? @!log[*-$n .. *]
            !! @!log[]

Yes, we should do an Actor one too. We could even provide a trait:

my @a is monitor;

Which would take @a and wrap it up in a monitor that locks and delegates all its calls to the underlying array. However, by this point, we’re treading dangerously close to forgetting the importance of transaction scope. At the start of the post, I told the story of the hopelessly unsafe calls to a legacy library from a multi-threaded web application. I had it hunted down and fixed in a morning because it exploded, loud and clear, once I started subjecting it to load tests. Tools to help find such bugs exist. By contrast, having to hunt bugs in code that is threadsafe, non-explosive, but subtly wrong in the placing of its transaction boundaries, is typically long and drawn out – and where automated tools can help less.

In closing, we most certainly should take the time to offer newbie-friendly concurrent, parallel, and asynchronous programming experiences in Perl 6. However, I feel that needs to be done by guiding folks towards safe, teachable, understandable patterns of a CSP (Communicating Sequential Processes) nature. Perl may be about Doing The Right Thing, and Doing What I Mean. But nobody means their programs to do what the hardware hates, and the right thing isn’t to make broken things sort-of-work by sweeping complex decisions on transaction scope under the carpet. “I did this thing I thought was obvious and it just blew up,” can be answered with, “here’s a nice tutorial on how to do it right; ask if you need help.” By contrast, “your language’s magic to make stuff appear to work just wasted days of my life” is a sure-fire way to get a bad reputation among the competent. And that’s the last thing we want.

Posted in Uncategorized | 11 Comments

Optimization, concurrency, and Moar

It’s been a while since I wrote an update here. Happily, timotimo has taken up the role of being our weekly Perl 6 reporter, so there’s a good place to follow for regular updates. However, I wanted to take a moment to provide the bigger picture of what’s been happening in the last couple of months, share my perspectives on it, and talk a bit about where things are headed from here.

Optimization, optimization, optimization!

A lot of recent effort has gone on optimization. NQP, the subset of Perl 6 that we use to implement much of the Rakudo Perl 6 compiler, has performance on MoarVM that starts to approach that of Perl 5, and on JVM sometimes exceeds that of Perl 5 for longer running things (it typically runs the forest fire benchmark from our benchmark suite faster once the JIT has had time to get going, for example). By contrast, Rakudo’s performance itself has been comparatively awful. Thankfully, things have been getting better, as we’ve worked to improve optimization, reduce costs of common things, and gradually begun to close the gap. This has involved work on both Rakudo and NQP’s optimization phases, along with work on improving the built-ins and some performance-oriented factoring changes. There’s still plenty of work to go, but anybody using Rakudo on MoarVM will likely feel the results in the next release. To give an idea of the improvement in HEAD Rakudo on MoarVM, which will appear in the April monthly:

  • Array and hash access is more than 3 times faster
  • Most multi-dispatches are now enormously cheaper
  • Many, many unrequired scalar allocations are optimized away
  • The forest fire benchmark on Rakudo can render twice as many frames per second
  • Compiler performance is around 15% better (estimate going on CORE.setting compile time)

Compared to Rakudo on Parrot the difference is more marked. On compiler performance alone, the difference is enormous: you can build the entire of Rakudo on MoarVM on my box in around 80s (which I’m not happy with yet, though you rarely change so much that you have to do the whole thing). That is less time than it takes for Rakudo on Parrot to complete the parse/AST phases of compiling CORE.setting (the built-ins). Running the spectest suite happens in half the time. Both of these times are important because they influence how quickly those of us working on Perl 6 implementation can try out our changes. Unsurprisingly, most of us do the majority of our development on MoarVM first these days.

One consequence of this work is that Rakudo on MoarVM is often sneaking ahead of Rakudo on JVM on some benchmarks now, even once the mighty JVM JIT kicks in. This won’t last long, though; a couple of the optimizations done will not be a great deal of work to port to the JVM, and then it can re-claim its crown. For now! :-)

Introducing “spesh”

A whole other level of performance related work has been taking place in MoarVM itself. The first goal for the project was “just run NQP and Perl 6”, and the VM simply got on with interpreting the bytecode we threw at it. That doesn’t mean it wasn’t carefully designed along the way – just that the focus in terms of execution was to be simple, correct and sufficiently complete to serve as a solid Rakudo backend. With that goal achieved, the next phase of the project is underway: implementing dynamic optimization based on program information available at runtime, speculative optimizations that can be undone if things they rely on are broken, and so forth.

The first steps in that direction will be included in this month’s MoarVM release, and are to thank for much of the improved compiler performance (since the compiler is a program running on the VM too). The easiest way to get an overview is for me to walk you through the pieces in src/spesh in MoarVM.

  • graph is about building a representation of the bytecode (at a frame level) suitable for analysis and transformation (the two steps involved in optimization). It starts out by building a Control Flow Graph. It then computes the dominance tree, which it uses to rename variables so as to produce a Static Single Assignment form of the bytecode. This is a form whereby a given name is only written to once, which eases many, many aspects of analysis.
  • args takes a tuple of incoming arguments, considers their types, arity, and so forth. It produces a set of guard clauses that indicate when a given specialization of the code applies (that is, a version of the code improved by making assumptions about what was passed), and then re-writes various argument access instructions to “unsafe” but fast ones that it can prove will always work out.
  • facts takes the graph, looks through it for sources of type information (including the incoming arguments) and does an initial propagation of that information through the graph. It creates usage counts to be later used in dead code elimination.
  • optimize takes this annotated graph, and makes a pass through it, applying a number of optimizations. Granted, there are not so many yet; so far we’ve mostly worked on getting to the point of having a framework to prove safety of transformations, and adding more of them comes next. However, those there so far can do things like:
    • Resolve methods to avoid hash lookups
    • Install monomorphic method caching if that doesn’t work out
    • Resolve type checks, possibly eliminating entire branches of code
    • Re-write attribute binds into “unsafe” pointer operations
    • Eliminate dead code
  • codegen takes the optimized graph and produces bytecode from it again. However, in the future (if we’re lucky, then hopefully through a GSoC project), this is where we would produce machine code instead.
  • deopt deals with the situation where some invariant specialized code may be relying on gets broken, and yet that specialized code is still on the call stack. It walks the call stack, looking for specialized code on it and tweaking return addresses and other data structures so that when we return into the code, we’re back in the safe (though of course slower) unspecialized code that checks things as needed.

By and large, this is essentially giving MoarVM a JIT. Of course, it’s not producing machine code yet, but rather JITing back to improved bytecode. While we tend to think of JITs primarily as “turn the program into machine code”, that’s really just one small part of any modern JIT. Analysis and specialization of the program before the machine code is produced is just as important; with this approach, we get to focus in on that part first and get some of the benefits now.


Progress on Perl 6 concurrency continues. The JVM implementation of the concurrency features has had various performance improvements since the March release, and MoarVM now has support for most of the Perl 6 concurrency features also. However, the MoarVM support for these features is early and most certainly not yet production ready. We’ll include it in the April release, but stick with Rakudo on the JVM for concurrency things if you’re doing anything that matters. If you just want to play with the basics, either will do.

Rakudo on MoarVM takes the spectest crown

Rakudo on its various backends hold the top spots on the Perl 6 specification test suite pass rate. However, nowadays Rakudo on MoarVM has worked its way into the lead. How so? Because it has the best Unicode database introspection support, opening up a range of tests that no other backend handles yet. Additionally, because it gets some of the Unicode stuff right that that Parrot does, but JVM doesn’t. And, finally, because on top of that it can now pass a bunch of the concurrency tests.

A multi-backend Rakudo Star

I’d hoped we would get a Rakudo Star release with support for all three backends out in March. It didn’t happen; the module tests showed up some holes. We’ve by now largely fixed those for Rakudo on MoarVM, and we’re looking quite good for the April Rakudo Star coming with support for both Parrot and MoarVM. With some effort, I’m optimistic we’ll have a JVM Star release in good shape for April too. This will provide users who want the “batteries included” release a choice of backends, and of note give those using Parrot a chance to switch over to using MoarVM, getting some substantial performance improvements on most programs and lower startup time and memory use.

Where next?

In the coming months, I’m going to be focusing on the following areas:

  • More improvements to the Rakudo and NQP optimizers, so we generate better (faster, smaller) code.
  • More improvements to the Rakudo built-ins, so they operate more efficiently.
  • Making the MoarVM concurrency support more robust, and improving the parallel GC.
  • Advancing the state of asynchronous I/O, so we’ll have good support for it on both the JVM and MoarVM.
  • Teaching spesh to specialize better. There are a bunch of data access things that can be made cheaper, as well as being able to better optimize multiple dispatch. Beyond that, both inlining and escape analysis are on the menu.
  • Improving native type support, including providing native arrays.

I’ve been hidden away coding for much of the year so far, apart from putting in an appearance at FOSDEM. But I’m getting on the road soon! I’ll be at the Dutch, Polish and Czech Perl Workshops, and look forward to seeing folks and sharing what I’ve been working on. Hope to see some of you out there!

Posted in Uncategorized | 5 Comments