This week: the big NFG switch on, and many fixes

During my Perl 6 grant time during the last week, I have continued with the work on Normal Form Grapheme, along with fixing a range of RT tickets.

Rakudo on MoarVM uses NFG now

Since the 24th April, all strings you work with in Rakudo on MoarVM are at grapheme level. This includes string literals in program source code and strings obtained through some kind of I/O.

say $str.chars;     # 1
say $; # 2

This sounds somewhat dramatic. Thankfully, my efforts to ensure my NFG work had good test coverage before I enabled it so widely meant it was, for most Rakudo users, something of a non-event. (Like so much in compiler development, you’re often doing a good job when people don’t notice that your work exists, and instead are happily getting on with solving the problems they care about.)

The only notable fallout reported (and fixed) so far was a bizarre bug that showed up when you wrote a one byte file, then read it in using .get (which reads a single line), which ended up duplicating that byte and giving back a 2-character string! You’re probably guessing off-by-one at this point, which is what I was somewhat expecting too. However, it turned out to be a small thinko of mine while integrating the normalization work with the streaming UTF-8 decoder. In case you’re wondering why this is a somewhat tricky problem, consider two packets arriving to a socket that we’re reading text from: the bytes representing a given codepoint might be spread over two packets, as may the codepoints representing a grapheme.

There were a few other interesting NFG things that needed attending to before I switched it on. One was ensuring that NFG is closed over concatenation operations:

my $str = "D\c[COMBINING DOT ABOVE]";
say $str.chars; # 1
say $str.chars; # 1

Another is making sure you can do case changes on synthetics, which has to do the case change on the base character, and then produce a new synthetic:

say $str.NFD; # NFD:0x<0044 0323 0307>
say $str.NFC; # NFC:0x<1e0c 0307>
say $; # 1
say $; # NFD:0x<0064 0323 0307>
say $; # NFC:0x<1e0d 0307>

(Aside: I’m aware there are some cases where Unicode has defined precomposed characters in one case, but not in another. Yes, it’s an annoying special case (pun intended). No, we don’t get this right yet.)

Uni improvements

I’ve been using the Uni type quite a bit to test the various normalization algorithms as I’ve implemented them. Now I’ve also filled out various of the missing bits of functionality:

  • You can access the codepoints using array indexing
  • It responds to .elems, and behaves that way when coerced to a number also
  • It provides .gist and .perl methods, working like Buf
  • It boolifies like other array-ish things: true if non-empty

I also added tests for these.

What the thunk?

There were various scoping related bugs when the compiler had to introduce thunks. This led to things like:

try say $_ for 1..5;

Not having $_ correctly set. There were similar issues that sometimes made you put braces around gather blocks when it should have worked out fine with just a statement. These issues were at the base of 5 different RT tickets.

Regex sub assertions with args

I fixed the lexical assertion syntax in regexes, so you can now say:

<&some-rule(1, 2)>
<&some-rule: 1, 2>

Previously, trying to pass arguments resulted in a parse error.

Other assorted fixes

Here are a few other RTs I attended to:

  • Fix and tests for RT #124144 (uniname explodes in a weird way when given negative codepoints)
  • Fix and test for RT #124333 (SEGV in MoarVM dynamic optimizer when optimizing huge alternation)
  • Fix and test RT #124391 (use after free bug in error reporting)
  • Fix and test RT #114100 (Capture.perl flattened hashes in positionals, giving misleading output)
  • Fix and test RT #78142 (statement modifier if + bare block interaction; also fixed it for unless and given)
This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to This week: the big NFG switch on, and many fixes

  1. betterworld says:

    How many synthetic codepoints is rakudo able to hold? Can there be some kind of a DOS attack by sending strings with all kinds of combining characters until the synthetics are exhausted?

    • We use 32-bit integers to represent graphemes, and the synthetics are all the negatives. So, you get 2**31 synthetics. That’ll take a while to exhaust, though it’s not unimaginable somebody could construct an attack.

      There are various schemes that can be used to tackle this risk, enabled by the fact that we take care to never leak synthetics into user-space, keeping the negatives entirely internal. Since we know where all the strings are, we can reclaim and re-use unused synthetic codepoints (at the cost of walking all strings, though immutability of strings and our allocation scheme means that we have a chance of doing it mostly concurrently with execution). We can even go so far as adopting a per-string set of synthetics (but it makes things like equality comparison a good bit more costly).

  2. betterworld says:

    Ok, that sounds good.

    Here’s another thing I keep thinking about, and correct me if I am wrong: If I read in a file into a string and write that string out into another file, the files might have different bytes in them. In most languages other than Perl 6 it will be very hard to write a program to find out whether the two files are identical.

    I tried this snippet:

    perl6 -e ‘$*IN.get.say’

    Feeding it the bytes e1 b8 8c cc 87 0a will result in a very different byte output, which qualifies in unexpected behaviour, imho…

  3. betterworld says:

    I’m sorry, the input is 44 cc 87 cc a3 0a, the output is e1 b8 8c cc 87 0a

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.