This week: fixing lots of things

My first week of Perl 6 work during June was spent hunting down and fixing bugs, some of them decidedly tricky and time-consuming to locate. I also got a couple of small features in place.

Unicode-related bugs

RT #120651 complained that our UTF-8 encoder wrongly choked on non-characters. While there were defined as “not for interchange” in the Unicode spec, this does not, it turns out, mean that we should refuse to encode them, as clarified by Unicode Corrigendum 9. So, I did what was needed to bring us in line with this.

RT #125161 counts as something of an NFG stress test, putting a wonderful set of combiners onto some base characters. While this was read into an NFG string just fine, and we got the number of graphemes correct, and you could output it again, asking for its NFD representation exploded. It turned out to be a quite boring bug in the end, where a buffer was not allowed to grow sufficiently to hold the decomposition of such epic graphemes.

Finally, I spent some time with the design documents and test suite with regard to strings and Unicode. Now we know how implementing NFG turned out, various bits of the design could be cleaned up, and several things we turned out not to need could go away. This in turn meant I could clean them out of the test suite, and close the RT tickets that pointed us towards needing to review/implement what the tests in question needed.

Two small features

In Perl 6 we have EVAL for evaluating code in a string. We’ve also had EVALFILE in the design and test suite for a while, which evaluates the code in the specified file. This was easily implemented, resolving RT #125294.

Another relatively easy, but long-missing, feature was the quietly statement prefix. This swallows any warnings emitted by the code run in its dynamic scope. Having recently improved various things relating to control exceptions, this one didn’t take long to get in place.

Concurrency bug squishing continues

In the last report, I was working on making various asynchronous features, such as timers, not accidentally swallow a whole CPU core. I’d fixed this, but mentioned that it exposed some other problems, so we had to back to fix out. The problem was, quite simply, that by not being so wasteful, we ran into an already existing race condition much more frequently. The race involved lazy deserialization: something we do in order to only spend time and memory deserializing meta-objects that we will actually use. (This is one of the things that we’ve used to get Rakudo’s startup time and base memory down.) While there was some amount of concurrency control in place to prevent conflicts over the serialization reader, there was a way that one thread could see an object that another one was part way through deserializing. This needed a lock-free solution for the common “it’s not a problem” path, so we would not slow down every single wval instruction (the one in MoarVM used to get a serialized object, deserializing it on demand if needed). Anyway, it’s now fixed, and the core-eating async fix is enabled again.

I also spent some time looking into and resolving a deadlock that could show up, involving the way the garbage collector and event loop were integrated. The new approach not only eliminates the deadlock, but also is a little more efficient, allowing the event loop thread to also participate in the collection.

Finally, a little further up the stack, I found the way we compiled the given, when, and default constructs could lead to lexical leakage between threads. I fixed this, eliminating another class of threading issues.

The case of the vanishing exception handler

Having our operators in Perl 6 be multi-dispatch subroutines is a really nice bit of language design, but also means that we depend heavily on optimization to get rid of the overhead on hot paths. One thing that helps a good bit is inlining: working out which piece of code we’re going to call, and then splicing it into the caller, eliminating the overhead of the call. For native types we are quite good at doing this at compile time, but for others we need to wait until runtime – that is, performing dynamic optimization based upon the types that actually show up.

Dynamic optimization is fairly tricky stuff, and of course once in a while we get caught cheating. RT #124191 was such an example: inlining caused us to end up missing some exception handlers. The issue boiled down to us failing to locate the handlers in effect at the point of the inlined call when the code we had inlined was the source of the exception. Now, a tempting solution would have been to fix the exception handler location code. However, this would mean that it would have to learn about how callframes with inlining look – and since it already has to work differently for the JIT, there was a nice combinatoric explosion waiting to happen in the code there. Thankfully, there was a far easier fix: creating extra entries in the handlers table. This meant the code to locate exception handlers could stay simple (and so fast), and the code to make the extra entries was far, far simpler to write. Also, it covered JIT and non-JIT cases.

So, job done, right? Well…sadly not. Out of the suite of tens of thousands of specification tests, the fix introduced…6 new failures. I disabled JIT to see if I could isolate it and…the test file exploded far earlier. I spent a while trying to work out what on earth was going on, and eventually took the suggestion of giving up for the evening and sleeping on it. The next morning, I thankfully had the idea of checking what happened if I took away my changes and ran the test file in question without the JIT. It broke, even without my changes – meaning I’d been searching for a problem in my code that wasn’t at all related to it. It’s so easy to forget to validate our assumptions once in a while when debugging…

Anyway, fixing that unrelated problem, and re-applying the exception handler with inlining fix that this all started out with, got me a passing test file. Happily, I flipped the JIT back on and…same six failures. So I did have a real failure in my changes after all, right? Well, depends how you look at it. The final fix I made was actually in the JIT, and was a more general issue, though in reality we could not have encountered it with any code our current toolchain will produce. It took my inlining fixes to produce a situation where we tripped over it.

So, with that fix applied too, I could finally declare RT #124191 resolved. I added a test to cover it, and was happy to be a tricky bug down. Between it and the rest of the things I had to fix along the way, it had taken the better part of a full day worth of time.

Don’t believe all a bug report tells you

RT #123686 & RT #124318 complained about the dynamic optimizer making a mess of temp variables, and linked it also to the presence of a where clause in the signature. Having just worked on a dynamic optimizer bug, I was somewhat inclined to believe it – but figured I should verify the assumption. It’s good I did; while you indeed could get the thing to fail sooner when the dynamic optimizer was turned on, running the test case for some extra iterations made it explode with all of the optimization turned off also. So, something else was amiss.

Some bugs just smell of memory corruption. This one certainly did; changing things that shouldn’t matter changed the exact number of iterations the bug needed to occur. So, out came Valgrind’s excellent memcheck tool, which I could make whine about reads of uninitialized memory with…exactly one iteration to of the test that eventually exploded. It also nicely showed where. In the end, there were three ingredients needed: a block with an exit handler (so, a LEAVE, KEEP, UNDO, POST, temp variable or let variable), a parameter with a where clause, and all of this had to be in a multi-dispatch subroutine. Putting all the pieces together, I soon realized that we were incorrectly trying to run exit handlers on fake callframes we make in order to test if a given multi-candidate will bind a certain set of arguments. Since we never run code in the callframe, and only use it for storage while we test if the signature could bind, the frame was never set up completely enough for running the handlers to work out well. It may have taken some work to find, but thankfully very easy to fix.

Other assorted bits

I did various other little things that warrant a quick mention:

  • Fix RT #125260 (sigils of placeholder parameters did not enforce type constraints)
  • Fix RT #124842 (junction tests fudged for wrong reason, and also incorrect)
  • Fix attempts to auto-thread Junction type objects, which led to weird exceptions rather than a dispatch failure; add tests
  • Review a patch to fix a big endian issue in MoarVM exception optimization; reject it as not the right fix
  • Fix Failures that were fatalized reporting themselves as leaked/unhandled
  • Investigate RT #125331, which is seemingly due to reading bogus annotations
  • Fix a MoarVM build breakage on some compilers
Posted in Uncategorized | 3 Comments

That week: concurrency fixes, control exceptions, and more

I’m back! I was able to rest fairly well in the days leading up to my wedding, and had a wonderful day. Of course, after the organization leading up to it, I was pretty tired again afterwards. I’ve also had a few errands to run with getting our new apartment equipped with various essentials – including a beefy internet connection. And, in the last couple of days, I’ve been steadily digging back into work on my Perl 6 grant. :-) So, there will be stuff to tell of about that soon; in the meantime, here’s the May progress report I never got around to writing.

Before fixing something, be sure you need it

I continued working on improving stability of our concurrent, async, and parallel programming features. One of the key things I found to be problematic was the frame cache. The frame cache was introduced to MoarVM very early on, to avoid memory allocations for call frames by keeping them cached. Back then, we allocated them using malloc. The frames were cached per thread, so in theory no thread safety issues, right? Well, wrong, unfortunately. The MoarVM garbage collector works in parallel, and while it does have to pause all threads for some of the work, it frees them to go on their way when it can. Also, it steals work from threads that are blocked on doing I/O, acquiring a lock, waiting on a condition variable, and so forth. One of the things that we don’t need to block all threads for is clearing out the per-thread nursery, and running any cleanup functions. And some of those cleanups are for closures, and those hold references to frames, which when freed go into the frame pool. This is enough for a nasty bug: thread A is unable to participate in GC, thread B steals its GC work, the GC gets mostly done and things resume, thread A comes back from its slumber and begins executing frames, touching the frame pool…which thread B is also now touching because it is freeing closures that got collected. Oops. I started pondering how best to deal with this…and then had a thought. Back when we introduced the frame cache, we used malloc to allocate frames. However, now we use MoarVM’s own fixed size allocator – which is a good bit faster (and of course makes some trade-offs to achieve that – there’s no free lunch in allocation land). Was the frame cache worth it? I did some benchmarking and discovered that it wasn’t worth it in terms of runtime (and timotimo++ did some measurements that suggested we might even be better off without it, though it was in the “noise” region). So, we could get less code, and be without a nasty threading bug. Good, right? But it gets better: the frame cache kept hold of memory for frames that we only ever executed during startup, or only ran during the compile phase and not at runtime. So eliminating it would also shave a bit off Rakudo’s memory consumption too. With that data in hand, it was a no-brainer: the frame cache went away, and with it an icky concurrency bug. However, the frame cache had one last “gift” for us: its holding on to memory had hidden a reference counting mistake for deserialized closures. Thankfully, I was able to identify it fairly swiftly and fix it.

More concurrency fixes

I landed a few more concurrency related fixes also. Two of them related to parametric role handling and role specialization, which needed some concurrency control. To ease this I added an Lock class to NQP, with pretty much the same API as the Perl 6 one. (You could use locks from NQP before, but there was some boilerplate. Now it’s easy.) One other long-standing issue we’ve had is that using certain features (such as timers) could drive a single CPU core up to 100% usage. This came from using an idle handler on the async event loop thread to look for new work and check if we needed to join in with a garbage collection run. It did yield to the OS if it had nothing to do – but on an unloaded system, a yield to the OS scheduler will simply get you scheduled again in pretty short order. Anyway, I spent some time looking into a better way and implemented it. Now we use libuv async handlers to wake up the event loop when there’s new work it needs to be aware of. And with that, we stopped swallowing up a core. Unfortunately, while this was a good fix, our CPU hunger had been making another nasty race involving lazy deserialization only show up occasionally. Without all of that resource waste, this race started showing up all over the place. This is now fixed – but it happened just in these last couple of days, so I’ll save talking about it until the next report.

Startup time improvements

Our startup time has continued to see improvements. One of the blockers was a pre-compilation bug that showed up when various lazily initialized symbols were used (of note, $*KERNEL, $*VM, and $*DISTRO). It turns out the setup work of these poked things into the special PROCESS package. While it makes no sense to serialize an updated version of this per-process symbol table…we did. And there was no good mechanism to prevent that from happening. Now there is one, resolving RT #125090. This also unblocked some startup time optimizations. I also got a few more percent off startup by deferring constructing a rarely used DateTime object for the compiler build date.

Control exception improvements

Operations like next, last, take, and warn throw “control exceptions”. These are not caught by a normal CATCH handler, and you don’t normally need to think of them as exceptions. In fact, MoarVM goes to some effort to allow their implementation without ever actually creating an exception object (something we take advantage of in NQP, though not yet in Rakudo). If you do want to catch them and process them, you can use a CONTROL block. Trouble is, there was no way to talk about what sort of control exceptions were interesting. Also, we had some weird issues when a control exception went unhandled (RT #124255), manifesting in a segfault a while back, though by the time I looked at it we were merely at giving a poor error. Anyway, I fixed that, and introduced types for the different kinds of control exception, so you can now do things like:

CONTROL {
    when CX::Warn {
        log(.message);
    }
}

To do custom logging of warnings. I also wrote various tests for these things.

Collaboration

I helped out others, and they helped me. :-)

  • FROGGS++ was working on using the built-in serialization support we have for the module database, but was running into a few problems. I helped resolve them…and now we have faster module loading. Yay.
  • TimToady++ realized that auto-generated proto subs were being seen by CALLER, auto-generated method ones were not, and hand-written protos that only delegated to the correct multi also were not. He found a way to fix it; I reviewed it and tweaked it to use a more appropriate and far cheaper mechanism.
  • I noticed that calling fail was costly partly because we constructed a full backtrace. It turns out that we don’t need to fully construct it, but rather can defer most of the work (for the common case where the failure is handled). I mentioned this on channel, and lizmat++ implemented it

Other bits

I also did a bunch of smaller, but useful things.

  • Fixed two tests in S02-literals/quoting.t to work on Win32
  • Did a Panda Wndows fix
  • Fixed a very occasional SEGV during CORE.setting compilation due to a GC marking bug (RT #124196)
  • Fixed a JVM build breakage
  • Fixed RT #125155 (.assuming should be supported on blocks)
  • Fixed a crash in the debugger, which made it explode on entry for certain script

More next time!

Posted in Uncategorized | 1 Comment

Taking a short break

Just a little update. When I last posted here, I expected to write the post about what I got up to in the second week of May within a couple of days. Unfortunately, in that next couple of days I got quite sick, then after a few days a little better again, then – most likely due to my reluctance to cancel anything – promptly got sick again. This time, I’ve been trying to make a better job of resting so I can make a more complete recovery.

By now I’m into another round of “getting better”, which is good. However, I’m taking it as easy as I can these days; at the weekend I will have my wedding, and – as it’s something I fully intend to only do once in my life – I would like to be in as good health as possible so I can enjoy the day! :-)

Thanks for understanding, and I hope to be back in Perl 6 action around the middle of next week.

Posted in Uncategorized | 4 Comments

Last week: smaller hashes, faster startup, and many fixes

I’m a little behind with writing up these weekly-ish reports, mostly thanks to attending OSDC.no and giving about 5 hours worth of talks/tutorials. That, along with spending time with the various other Perl 6 folks who were in town, turned out to be quite exhausting. Anyway, I’m gradually catching up on sleep. I’ll write up what I did in the first week of May in this post, and I’ll do one in a couple of days for the second week of May.

Shrinking hashes

In MoarVM, we’ve been using the UT Hash library since the earliest days of the VM. There was no point spending time re-implementing a common data structure for what, at the time, was decidedly an experimental project, especially when there was an option with an appropriate license and all contained in a single header file. While we started out with it fairly “vanilla”, we’ve already tweaked it in various ways (for example, caching the hash code in the MVMString objects). UT Hash does one thing we didn’t really need nor want for Perl 6: retain the insertion order. Naturally, we paid for this: to the tune of a pointer per hash, plus two pointers per hash item (for a doubly linked list). I ripped these out (which involved re-writing a few things that depended on them), as well as gutting the library of other functionality we don’t need. This means hashes on MoarVM on 64-bit builds are now 8 bytes smaller per hash, and 16 bytes smaller per hash item. Since we use hashes for plenty of things, this is a welcome saving; it even produces a measurable decrease to Rakudo’s base memory. This also paves the way for further improvements to hashes and string storage. (For what it’s worth, the time savings as a result of these changes were tiny compared to the memory savings: measurable under callgrind, but a drop in the ocean. Turns out a doubly linked list is a rather cheap thing to maintain.)

Decreasing startup time

I did a number of things to improve Rakudo’s startup time on MoarVM. One of them was changing the way we store string literals in the bytecode file. Previously, they were all stored as UTF-8. Now, those that are in the Latin-1 range are stored as Latin-1. Since this doesn’t need any kind of decoding or normalization, it is far cheaper to process them. This gave a 10% decrease in instructions executed at startup. Related to this, I aligned the serialization blob string heap with the bytecode file one, meaning that we don’t have to store a bunch of mappings between the two (another 2% startup win, and a decrease in bytecode file size, probably 100KB over all the bytecode files we load into memory as part of Rakudo).

2% more fell off when I realized that some careful use of static inlines and a slightly better design would lower the cost of temporary root handling (used to maintain the “GC knows where everything is” invariant). This took 2% off startup – but also will be an improvement in a lot of other code too.

Another small win was noticing that we built the backend configuration data hash every time we were asked for it, rather than building it once and caching it. This was an easy win, and saved building it 7 times over during Rakudo’s startup (adding up more things to GC each time, of course). Finally, I did a little optimization of bounds checking when validating bytecode. These two improvements were both below the 1% threshold, though together added up to about that.

The revenge of NFG

Shortly after landing NFG, somebody reported a fairly odd output related bug. After a while, it was isolated to UTF-8 encoding, and when I looked I saw…a comment saying the encoder code would need updating post-NFG. D’oh. So, I took care of that, saving us some memory along the way. I also noticed some mess with NULL-terminated string handling (MoarVM ones aren’t, of course, but we do have to deal with the C world now and then), which I took the opportunity to get cleaned up.

Digging into concurrency bugs

Far too much of the concurrency work I did in Rakudo and the rest of the stack was conducted under a good amount of time pressure. Largely, I focused in getting things sufficiently in place that we could play with them and explore semantics. This has served us well; there was a chance to show running examples to the community at conferences and workshops and gather feedback, and for Larry and others to work on the API and language level of things. The result is that we’ve got a fairly nice bunch of concurrency features in the language now, but anybody who has tried to use them much will have noticed they behave in a rather flaky way in a number of circumstances. With NFG largely taken care of, I’m now going to be turning more of my attention to these issues.

Here are some of the early things I’ve done to start making things better:

  • Fixing a couple of bugs so the t/concurrency set of tests in NQP work much better on MoarVM
  • Hunting down and resolving an ABA problem in the MoarVM fixed size allocator free list handling, which occurred rarely but could cause fairly nasty memory corruption when it did (RT #123883)
  • Eliminating a bad assumption in the ThreadPoolScheduler which caused some oddities in Promise execution order (RT #123520)
  • Adding missing concurrency control to the multi-dispatch cache. While it was designed carefully to permit reads (concurrent with both other other reads and additions), so you never need to lock when doing a multi-dispatch cache lookup, the additions need to be done one at a time (something forseen when it was designed, but not handled). These additions are now serialized with a lock.

Other assorted bits

Here are a few other small things I did, worth a quick mention.

  • Fix and tests for RT #123641 and #114966 (cases where we got a block, but should have got an anonymous hash)
  • Fix RT #77616 (/a ~ (c) (b)/ reverted positional capture order)
  • Review and reject RTs #116525 and #117109 (out of line with decided flattening semantics)
  • Add test for and resolve RT #118785 (“use fatal” semantics were already fixed, but good to have an explicit test case)
  • Tests for RT #122756 and #114668 (an already fixed mixins bugs)
Posted in Uncategorized | 2 Comments

This week: the big NFG switch on, and many fixes

During my Perl 6 grant time during the last week, I have continued with the work on Normal Form Grapheme, along with fixing a range of RT tickets.

Rakudo on MoarVM uses NFG now

Since the 24th April, all strings you work with in Rakudo on MoarVM are at grapheme level. This includes string literals in program source code and strings obtained through some kind of I/O.

my $str = "D\c[COMBINING DOT ABOVE]\c[COMBINING DOT BELOW]";
say $str.chars;     # 1
say $str.NFC.codes; # 2

This sounds somewhat dramatic. Thankfully, my efforts to ensure my NFG work had good test coverage before I enabled it so widely meant it was, for most Rakudo users, something of a non-event. (Like so much in compiler development, you’re often doing a good job when people don’t notice that your work exists, and instead are happily getting on with solving the problems they care about.)

The only notable fallout reported (and fixed) so far was a bizarre bug that showed up when you wrote a one byte file, then read it in using .get (which reads a single line), which ended up duplicating that byte and giving back a 2-character string! You’re probably guessing off-by-one at this point, which is what I was somewhat expecting too. However, it turned out to be a small thinko of mine while integrating the normalization work with the streaming UTF-8 decoder. In case you’re wondering why this is a somewhat tricky problem, consider two packets arriving to a socket that we’re reading text from: the bytes representing a given codepoint might be spread over two packets, as may the codepoints representing a grapheme.

There were a few other interesting NFG things that needed attending to before I switched it on. One was ensuring that NFG is closed over concatenation operations:

my $str = "D\c[COMBINING DOT ABOVE]";
say $str.chars; # 1
$str ~= \c[COMBINING DOT BELOW];
say $str.chars; # 1

Another is making sure you can do case changes on synthetics, which has to do the case change on the base character, and then produce a new synthetic:

my $str = "D\c[COMBINING DOT ABOVE]\c[COMBINING DOT BELOW]";
say $str.NFD; # NFD:0x<0044 0323 0307>
say $str.NFC; # NFC:0x<1e0c 0307>
say $str.lc.chars; # 1
say $str.lc.NFD; # NFD:0x<0064 0323 0307>
say $str.lc.NFC; # NFC:0x<1e0d 0307>

(Aside: I’m aware there are some cases where Unicode has defined precomposed characters in one case, but not in another. Yes, it’s an annoying special case (pun intended). No, we don’t get this right yet.)

Uni improvements

I’ve been using the Uni type quite a bit to test the various normalization algorithms as I’ve implemented them. Now I’ve also filled out various of the missing bits of functionality:

  • You can access the codepoints using array indexing
  • It responds to .elems, and behaves that way when coerced to a number also
  • It provides .gist and .perl methods, working like Buf
  • It boolifies like other array-ish things: true if non-empty

I also added tests for these.

What the thunk?

There were various scoping related bugs when the compiler had to introduce thunks. This led to things like:

try say $_ for 1..5;

Not having $_ correctly set. There were similar issues that sometimes made you put braces around gather blocks when it should have worked out fine with just a statement. These issues were at the base of 5 different RT tickets.

Regex sub assertions with args

I fixed the lexical assertion syntax in regexes, so you can now say:

<&some-rule(1, 2)>
<&some-rule: 1, 2>

Previously, trying to pass arguments resulted in a parse error.

Other assorted fixes

Here are a few other RTs I attended to:

  • Fix and tests for RT #124144 (uniname explodes in a weird way when given negative codepoints)
  • Fix and test for RT #124333 (SEGV in MoarVM dynamic optimizer when optimizing huge alternation)
  • Fix and test RT #124391 (use after free bug in error reporting)
  • Fix and test RT #114100 (Capture.perl flattened hashes in positionals, giving misleading output)
  • Fix and test RT #78142 (statement modifier if + bare block interaction; also fixed it for unless and given)
Posted in Uncategorized | 5 Comments

This week: digging into NFG, fixing “use fatal”, and more

It’s time for this week’s grant report! What have I been up to?

NFG

Last time, I talked about normalization forms in Unicode, and how Rakudo on MoarVM now lets you move between them and examine them, using the Uni type and its subclasses. We considered an example involving 3 codepoints:

my $codepoints = Uni.new(0x0044, 0x0323, 0x0307);
.say for $codepoints.list>>.uniname;

They are (as printed by the code above):

LATIN CAPITAL LETTER D
COMBINING DOT BELOW
COMBINING DOT ABOVE

We also noted that if we put them into NFC (Normal Form Composed – where we take codepoints sequences and identify where we can use precomposed codepoints), using this code:

my $codepoints = Uni.new(0x0044, 0x0323, 0x0307);
.say for $codepoints.NFC.list>>.uniname;

Then we get this:

LATIN CAPITAL LETTER D WITH DOT BELOW
COMBINING DOT ABOVE

Now, if you actually render that, and presuming you have a co-operating browser, it comes out as Ḍ̇ (you should hopefully be seeing a D with a dot above and below). If I was to stop people on the street and ask them how many characters they see then I show them a “Ḍ̇”, they’re almost certainly all going to say “1”. Yet if we work at codepoint level, even having applied NFC, we’re going to consider those as 2 “thingies”. Worse, we could do a substr operation and chop off the combining dot above. This is actually the situation in most programming languages.

Enter NFG. While it’s not on for all strings by default yet, if you take any kind of Uni and coerce it to a Str, you have a string represented in Normal Form Grapheme. The first thing to note is that if we ask .chars of it, we get 1:

my $codepoints = Uni.new(0x0044, 0x0323, 0x0307);
say $codepoints.Str.chars; # 1

How did it do this? To get Normal Form Grapheme, we first calculate NFC. Then, if we are left with combining characters (formally, anything with a non-zero value for Canonical_Combining_Class), we compute a synthetic codepoint and use it inside of the string. You’ll never actually see these. When we output the string, or turn it back into codepoints, the synthetics unravel. But when you’re working with a Str, you’re at grapheme level. Better still, operations like .chars and .substr are O(1).

Under the hood, synthetics are represented by negative integers. This gives a cheap and easy way to know if we’re looking at a synthetic codepoint or not. And because we intern the synthetics, things like string equality testing can be a cheap and fast memory compare operation. On a further implementation note, I went with a partially lock-free trie to do the interning, meaning that no locks are needed to do lookups, and we only acquire one if we have to register a new synthetic- giving thread safety with very little overhead for real-world use cases.

I’m gradually assembling test cases for NFG. Some can be generated systematically from the Unicode NormalizationTests.txt file, though they are rather more involved since they have to be derived from the normalization test data by considering the canonical combining class. Others are being written by hand.

For now, the only way to get an NFG string is to call .Str on a Uni. After the 2015.04 release, I’ll move towards enabling it for all strings, along with dealing with some of the places where we still need to work on supporting NFG properly (such as in the leg operator, string bitwise operators, and LTM handling in the regex/grammar engine).

use fatal

For a while, Larry has noted that a $*FATAL dynamic variable was the wrong way to handle use fatal – the pragma that makes lazy exceptions (produced by fail) act like die.  The use fatal pragma was specified to automatically apply inside of a try block or expression, but until this week a Failure could escape a try, with the sad consequence that things like:

say try +'omg-not-anumber';

Would blow up with an exception saying you can’t sanely numify that. When I tried to make the existing implementation of use fatal apply inside of try blocks, it caused sufficient fireworks and action at a distance that it became very clear that not only was Larry’s call right, but we had to fix how use fatal works before we could enable it in try blocks. After some poking, I managed to get an answer that pointed me in the direction of doing it as an AST re-write in the lexical scope that did use fatal. That worked out far better. I applied it to try, and found myself with only one spectest file that needed attention (which had code explicitly relying on the old behavior). The fallout in the ecosystem seems to have been minimal, so we’re looking good on this. A couple of RTs were resolved thanks to this work, and it’s another bit of semantics cleaned up ahead of this year’s release.

Smaller things

Finally, here are a handful of smaller things I’ve worked on:

  • Fix and test for RT #77786 (poor error reporting on solitary backtrack control)
  • Make ‘soft’ and ‘MONKEY-TYPING’ work lexically
  • Fix and test for RT #124304 (interaction of ‘my \a = …’ declarations with EVAL)
  • Fix for RT #122914 (binding in REPL didn’t persist to the next line); also add tests and make REPL tests work on Win32

Have a good week!

Posted in Uncategorized | 3 Comments

This week: Unicode normalization, many RTs

After some months where every tuit was sacred, and so got spent on implementing Perl 6 rather than writing about implementing Perl 6, I’m happy to report that for the rest of 2015 I’ll now have time to do both again. \o/ I recently decided to take a step back from my teaching/mentoring work at Edument and dedicate a good bit more time to Perl 6. The new Perl 6 core development fund is a big part of what has enabled this, and a huge thanks must go to WenZPerl for providing the initial round of funding to make this happen. I wrote up a grant application, which contains some details on what I plan to be doing. In short: all being well with funding, I’ll be spending 50% of my working time for the rest of 2015 doing Perl 6 things, to help us towards our 2015 release goals.

As part of the grant, I’ll be writing roughly weekly progress reports here. While the grant application is open for comments still, to make sure the community are fine with it (so far, the answer seems to be “yes!”), I’ve already been digging into the work. And when I looked at the list of things I’ve already got done, it felt like time already to write it up. So, here goes!

Uni and NF-whatever

One of the key goals of the grant is to get Normal Form Grapheme support in place. What does that mean? When I consider Unicode stuff, I find it helps greatly to to be clear about 3 different levels we might be working at, or moving between:

  • Bytes: the way things look on disk, on the wire, and in memory. We represent things at this level using some kind of encoding: UTF-8, UTF-16, and so forth. ASCII and Latin-1 are also down at this level. In Perl 6, we represent this level with the Buf and Blob types.
  • Code points: things that Unicode gives a number to. That includes letters, diacritics, mathematical symbols, and even a cat face with tears of joy. In Perl 6, we represent a bunch of code points with the Uni type.
  • Graphemes: things that humans would usually call “a character”. Note that this is not the same as code points; there are code points for diacritics, but a typical human would tell you that it’s something that goes on a character, rather than being a character itself. These things that “go on a character” are known in Unicode as combining characters (though a handful of them, when rendered, visually enclose another character). In Perl 6, we represent a bunch of graphemes with the Str type – or at least, we will when I’m done with NFG!

Thus, in Perl 6 we let you pick your level. But Str – the type of things you’d normally refer to as strings – should work in terms of graphemes. When you specify an offset for computing a substring, then you should never be at risk of your “cut the crap” grapheme (which is, of course, made from two codepoints: pile of poo with a combining enclosing circle backslash) getting cut into two pieces. Working at the codes level, that could easily happen. Working at the bytes level (as C# and Java strings do, for example), you can even cut a code point in half, dammit. We’ve been better that that in Perl 6 pretty much forever, at least. Now it’s time to make the final push and get Str up to the grapheme level.

Those who have played with Perl 6 for a while will have probably noticed we do have Str (albeit at the code point level) and Buf (correctly at the bytes level), but will not have seen Uni. That’s because it’s been missing so far. However, today I landed some bits of Uni support in Rakudo. So far this only works on the MoarVM backend. What can we do with a Uni? Well, we can create it with some code points, and then list them:

my $codepoints = Uni.new(0x1E0A, 0x0323);
say $codepoints.list.map(*.base(16)); # 1E0A 323

That’s a Latin capital letter D with dot above, followed by a combining dot below. Thing is, Uniocde also contains a Latin capital letter D with dot below, and a combining dot above. By this point, you’re probably thinking, “oh great, how do we even compare strings!” The answer is normalization: transforming equivalent sequences of code points into a canonical representation. One of these is known as Normal Form Decomposed (aka. NFD), which takes all of the pre-composed characters apart, and then performs a stable sort (to handwave a little, it only re-orders things that don’t render at the same location relative to the base character; look up Canonical Combining Class and the Unicode Canonical Sorting Algorithm if you want the gory details). We can compute this using the Uni type:

say $codepoints.NFD.list.map(*.base(16)); # 44 323 307

We can understand this a little better by doing:

say uniname($_) for $codepoints.NFD.list;

Which tells us:

LATIN CAPITAL LETTER D
COMBINING DOT BELOW
COMBINING DOT ABOVE

The other form you’ll usually encounter is Normal Form Composed (aka. NFC). This is (logically, at least) computed by first computing NFD, and then putting things back together into composed forms. There are a number of reasons this is not a no-op, but the really important one is that computing NFD involved a sorting operation. We can see that NFC has clearly done some kind of transform just by trying it:

my $codepoints = Uni.new(0x1E0A, 0x0323);
say $codepoints.list.map(*.base(16)); # 1E0C 307

We can again dump out the code point names:

LATIN CAPITAL LETTER D WITH DOT BELOW
COMBINING DOT ABOVE

And see that the normal form is a Latin capital letter D with dot below followed by a combining dot above.

There are a number of further details – not least that Hangul (the beautiful Korean writing system) needs special algorithmic handling. But that’s the essence of it. The reason I started out working on Uni is because NFG is basically going a step further than NFC. I’ll talk more about that in a future post, but essentially NFG needs NFC, and NFC needs NFD. NFD and NFC are well defined by the Unicode consortium. And there’s more: they provide an epic test suite! 18,500 test cases for each NFD, NFKD, NFC, and NFKC (K is for “comparability”, and performs some extra mappings upon decomposition).

Implementing the Uni type gave me a good way to turn the Unicode conformance tests into a bunch of Perl 6 tests – which is what I’ve done. Of course, we don’t really want to run all 70,000 or so of them every single spectest, so I’ve marked the full set up as stress tests (which we run as part of a release) and also generated 500 sanity tests per normalization form, which are now passing and included in a normal spectest run. These tests drove the implementation of the four normalization forms in MoarVM (header, implementation). The inline static function in the header avoids doing the costly Unicode property lookups for a lot of the common cases.

So far the normalization stuff is only used for the Uni type. Next up will be producing a test suite for mapping between Uni and (NFG) Str, then Str to Uni and Buf, and only once that is looking good will I start getting the bytes to Str direct path producing NFG strings also.

Unicode 7.0 in MoarVM

While I was doing all of the above, I took the opportunity to upgrade the MoarVM Unicode database to use the Unicode 7.0 database.

Unsigned type support in NativeCall

Another goal of my work is to try and fix things blocking others. Work on Perl 6 database access was in need of support for unsigned native integer types in NativeCall, so I jumped in and implemented that, and added tests. The new functionality was put to use within 24 hours!

RTs

Along with the big tasks like NFG, I’m also spending time taking care of a bunch of the little things. This means going through the RT queue and picking out things to fix, to make the overall experience of using Perl 6 more pleasant. Here are the ones I’ve dealt with in the last week or so:

  • Fix, add test for, and resolve RT #75850 (a capture \( (:a(2)) ) with a colonpair in parens should produce a positional argument but used to wrongly produce a named one)
  • Analyze, fix, write test for, and resolve RT #78112 (illegal placeholder use in attribute initializer)
  • Fix RT #93988 (“5.” error reporting bug), adding a typed exception to allow testability
  • Fixed RT #81502; missing undeclared routine detection for BEGIN blocks and include location info about BEGIN-time errors
  • Fixed RT #123967; provided good error reporting when an expression in a constant throws an exception
  • Tests for RT #123053 (Failure – lazy exceptions – can escape try), but no fix yet. Did find a few ways to fix it that didn’t work, but led to two other small improvements along the way.
  • Typed exception for bad type for variable/parameter declaration; test for and resolve RT #123397
  • Fix and test for RT #123627 (bad error reporting of “use Foo Undeclared;”)
  • Fix and test for RT #123789 (SEGV when assigning type object to native variable), plus harden against similar bugs
  • Implemented “where” post-constraints on attributes/variables. Resolves RT #122109; unfudged existing test case.
  • Handle “my (Int $x, Num $y)” (RT #102414) and complain on “my Int (Str $x)” (RT #73102)
Posted in Uncategorized | 2 Comments