Rakudo gets a new thread pool

Vienna.pm have funded me to work 50 hours on Perl 6. After some discussion, we decided I would first work on improving the thread pool scheduler, and then move on to continuing the work around non-blocking await, a feature of the upcoming Perl 6.d. In this post I’ll discuss the work on the thread pool scheduler, which was merged shortly after the latest Rakudo release to maximize testing time, and thus will appear in the next release (2017.10).

What was wrong with the ThreadPoolScheduler before?

My (by now slightly hazy) memory is that I wrote the initial Perl 6 thread pool implementation in the space of an hour or two on a train, probably heading to some Perl event or other. It was one of those “simplest thing that could possibly work” implementations that turned out to work well enough that it survived with only isolated tweaks and fixes until January this year. At that point, I added the initial bits of support for non-blocking await. Even that change was entirely additive, leaving the existing scheduling mechanism completely intact.

When I first implemented the thread pool, there were approximately no people writing concurrent Perl 6 programs. Happily, that’s changed, but with it came a few bug reports that could be traced back to the thread pool. Along with those, there were things that, while not being outright bugs, were far from ideal.

Before I dug into a re-design, I first summarized all of the problems I was aware of with the thread pool, so as I considered new designs I could “crash test” them against the problems that afflicted the previous design. Here’s my list of problems with the previous design.

  1. It was wasteful, spawning far too many threads in quite a lot of cases. A single use of Proc::Async would cause some threads to be created. A second use after that would add some more. A third use would add more. Even if these uses were not concurrent, threads would still be added, up until the maximum pool size. It was a very conservative way to make sure a thread would be available for each piece of work, provided the pool didn’t reach exhaustion point. But really, the pool doesn’t need more than a thread or two for many programs. RT #131915 was a ticket resulting from this behavior: each thread added to the memory consumption of the program, making memory use a good bit higher than it should have been for programs that really didn’t need more than a thread or two.
  2. Very active async I/O could swamp the thread pool with events. Since there was a single queue, then timer events – which might be being used to kill off a process after a timeout – may not have fired in a very timely manner at all, since the timer event would be stuck behind all of the I/O events. RT #130370
  3. Sometimes, despite the conservative mechanism described in #1, not enough threads got started and this led to occasional deadlocks. RT #122709
  4. It didn’t try to scale the number of threads to match the available CPU cores in any way. Just because there is a lot of work in the queue does not mean that we need more threads; if we are making progress and the CPU load is pretty high then adding more threads is unlikely to lead to more progress.
  5. In high-core-count systems, the default limit of 16 was too low. 32-core CPUs at tolerable prices very much exist by now!
  6. For programs that manage to really block a lot of threads with non-CPU bound work, we could deadlock. Most of that will go away with non-blocking await, but there will still be ways to write code that really blocks a bunch of OS threads and can’t make progress without yet more threads being made available to run something. At some point, people just have to write their programs differently, but the default limit of 16 threads was not very generous.
  7. Despite wishing to raise the maximum default pool size a good bit, we couldn’t do it because issues #1 and #4 meant we’d likely end up hitting the maximum in most programs, and memory use would become totally unreasonable.
  8. We suffered from poor thread affinity for events that would inevitably queue up due to serial supplies. For example, if two packets arrived from a socket then they might be picked up by different threads in the pool. If the first packet was still being processed, then the second would contend for the lock used to enforce serial processing of messages by supplies, and block until it was available.

3 kinds of queue

Problems 2 (active I/O swamping the queue and delaying timers) and 8 (poor thread affinity) are resolved in the new thread pool by recognizing that not all work given to the pool is equal. Timers really want to be dealt with in a timely manner, so in programs that have timers it makes sense to have a queue, and one or more worker threads, exclusively for time-based events. Work items that will need processing sequentially anyway should be assigned to a single thread, implying that each such “affinity worker” gets a queue of its own. Finally, there is a general queue for everything else, and one or more threads eat from it.

Adding a supervisor

The other problems were addressed by the addition of a Sufficiently Smart Supervisor. The supervisor is a thread created by the thread pool upon its first use, and living from then until the end of the program. It spends most of its time sleeping, waking up around 100 times a second to check how things are going. It is the supervisor that makes most of the decisions about how many threads to add to the pool. It factors in:

  • Demand – that is, how much work is there that needs doing. It determines this by looking at the lengths of the timer queue and general queue. If they are empty, then there’s clearly no demand for extra workers. If they have something in them, that’s a sign that we might want more workers.
  • Available workers – if there’s a worker thread that’s currently not doing anything, then there’s little motivation to add another worker.
  • CPU core count – for CPU-bound work there’s not much point in having more threads than there are CPU cores, because otherwise the OS will have to juggle them and it won’t do CPU cache hit rates any good. Modern OS schedulers are good at dealing with lots of threads, and these days the MoarVM GC copes alright with having a load of threads to sync up also. Still, each extra thread brings memory overhead. Making the CPU core count a “soft limit” for the number of threads to spawn is usually a reasonable choice.
  • Resource usage – there’s work in the queue, but the scheduler already spawned as many threads as there are CPU cores. At this point, a couple of situations may apply. Perhaps there’s a queue because all the threads are hard at work on CPU-bound tasks, in which case adding more beyond the core count will not help matters. Alternatively, the thread pool may be full of threads that are somehow blocked, and so there isn’t much, or any, progress at all. In the worst case, there may be a deadlock due to the thread pool being full of blocked threads, and adding an extra thread would break it. To determine this, the supervisor looks at the program’s CPU usage. If it’s really low, then the deadlock situation is relatively likely, so threads may be added beyond the CPU core count.

Having the pool soft-limited to the number of threads, and reluctantly able to go beyond that, means that we can raise the maximum default pool size quite a lot; it now stands at 64 threads. In reality, many programs don’t even reach the CPU core count, as there’s never enough work to trigger the addition of more threads.

Affinity workers

Affinity worker threads aren’t added by the supervisor. Instead, when an affinity queue is requested, any existing affinity worker threads will have their queue lengths inspected. The one with the shortest queue will be picked, provided that the queue length is below a threshold. If it’s over the threshold (which increases per affinity worker thread added), then another affinity worker thread will be spawned and the work allocated to it.

Outcomes

For all the cases I’ve tried so far, the new scheduler seems to do either at least as well or better than the one it replaced – and sometimes much better. The deadlock due to insufficient threads bug is gone, thanks to the supervisor. Programs that do the occasional bit of work needing pool threads end up with just a thread or two, greatly reducing their memory consumption. A Cro web application that is processing just a handful of requests a second will now spawn just a few threads (and so use much less memory and get better locality), while one faced with some hundreds of requests per second will spawn more. And a program doing a lot of CPU-bound work now spawns as many threads as there are cores, which gives a small speedup compared to oversubscribing the CPU under the previous scheduler. Finally, timer events are delivered and handled in a timely way, even when there is a lot of output from a process.

And next?

Well, most immediately I should write some extra regression tests, so I can get the RT tickets mentioned earlier in the article closed up. Feel free to beat me to doing that, though! :-)

That aside, I’d like to mention that while the new scheduler is a decided improvement, it’s also improvable. The heuristics it uses to decide when to add further threads can surely be tuned some more. The code doing that decision making is written in Perl 6, which I hope makes it accessible to those who would like to experiment and tweak.

Once again, thanks to Vienna.pm for making this work possible. And, finally, a mention that I’m still looking for funding to help me keep on doing Perl 6 things for a sizable chunk of my work time; a handful of companies each sponsoring 10 hours a month would soon add up!

Advertisements
Posted in Uncategorized | 1 Comment

MoarVM Specializer Improvements Part 2: Planning

It’s been a good while since I posted part 1 of this little series. In the meantime, I paid a visit to the Swiss Perl Workshop, situated in a lovely hotel in the beautiful village of Villars. Of course, I was working on my slides until the last minute – but at least I could do it with this great view!

DSC08286-small

I gave two talks at the workshop, one of them about the MoarVM specializer (slidesvideo). Spoiler alert: the talk covers much of what I will write about in this blog series. :-) Now that I’ve recovered from the talk writing and the travel, it’s time to get back to writing here. Before I dig in to it, I’d like to say thanks to The Perl Foundation and their sponsors, who have made this work possible.

Where we left off

In the last part I discussed how we collected statistics about running code, in order to understand what to optimize and how to optimize it. The running example, which I’ll continue with, was this

sub shorten($text, $limit) is export {
    $text.chars > $limit
        ?? $text.substr(0, $limit) ~ '...'
        !! $text
}

And an example of the kind of information collected is:

Latest statistics for 'infix:<~>' (cuid: 4245, file: SETTING::src/core/Str.pm:2797)

Total hits: 367

Callsite 0x7f1cd6231ac0 (2 args, 2 pos)
Positional flags: obj, obj

    Callsite hits: 367

    Maximum stack depth: 32

    Type tuple 0
        Type 0: Str (Conc)
        Type 1: Str (Conc)
        Hits: 367
        Maximum stack depth: 32

Which, to recap, tells us that the ~ infix operator was called 367 times so far, and was always passed two Str instances, neither of which was held in a Scalar container.

The job of the planner

The planner is the component that takes this statistical data and decides what code to optimize and what cases to optimize it for. It turns out to be, at least at the time of writing, one of the smaller parts of the whole specialization subsystem; subtract memory management code and debug bits and it’s not much over 100 lines.

Its inputs are a list of static frames whose statistics were updated by the latest data. There’s no point considering anything whose statistics did not change, since there will be nothing new since the last time the planner saw them and so the conclusion would be the same. In Perl 6 terms, a static frame may represent a routine (SubMethod), Block, or Code thunk.

Initial filtering

The first thing the planner looks at is whether the static frame got enough hits to be worth even considering investing time on. So if it sees something like this:

Latest statistics for 'command_eval' (cuid: 8, file: src/Perl6/Compiler.nqp:31)

Total hits: 1

No interned callsite
    Callsite hits: 1

    Maximum stack depth: 5

It can simply ignore it based on the total hits alone. Something that gets a single hit isn’t worth spending optimization effort on.

There are cases where the total hits are very low – perhaps just a single hit – but optimization is still interesting, however. Here is the loop that I used to make shorten hot so we could look at its statistics and plan:

for 'x' x 50 xx 100000 {
    shorten($_, 42)
}

The statistics for the mainline of the program, containing the loop, start out like this:

Latest statistics for '<unit>' (cuid: 4, file: xxx.p6:1)

    Total hits: 1
    OSR hits: 273

    Callsite 0x7f1cd6241020 (0 args, 0 pos)

        Callsite hits: 1

        OSR hits: 273

        Maximum stack depth: 10

The total hits may well be 1, but obviously the loop inside of this code is hot and would be worth optimizing. MoarVM knows how to perform On Stack Replacement, where code running in a frame already on the callstack can be replaced with an optimized version. To figure out whether this is worth doing, it records the hits at “OSR points” – that is, each time we cross the location in the loop where we might replace the code with an optimized version. Here, the loop has racked up 273 OSR hits, which – despite only one hit in terms of entering the code at the top – makes it an interesting candidate for optimization.

Hot callsites

In MoarVM, a callsite (of type MVMCallsite) represents the “shape” of a set of arguments being passed. This includes:

  • How many positional arguments
  • How many named arguments and what their names are
  • Whether there is any flattening of arguments
  • The register types of arguments (object, native int, native num, or native str)

When bytecode is produced, callsites are interned – that is to say, if we have two calls that are both passing two positional object arguments, then they will share the callsite data. This makes them unique within a compilation unit. When bytecode is loaded, then any callsites without flattening are further interned by the VM, meaning they are globally shared across all VM instances. This means that they can then be used as a kind of “key” for things like multi-dispatch caching. The specializer’s statistics are also aggregated by callsite.

The planner, therefore, takes a look at what callsites showed up. Often, code will be called with a consistent callsite, but things with optional parameters may not be. For example, if we were to run:

sub foo(:$bar) { }
for ^100000 { rand < 0.9 ?? foo(bar => 42) !! foo() }

The the statistics of foo might end up looking like this:

Latest statistics for 'foo' (cuid: 1, file: -e:1)

Total hits: 367

Callsite 0x506d920 (2 args, 0 pos)
  - bar

    Callsite hits: 331

    Maximum stack depth: 13

    Type tuple 0
        Type 0: Int (Conc)
        Hits: 331
        Maximum stack depth: 13

Callsite 0x7f3fd7461620 (0 args, 0 pos)

    Callsite hits: 36

    Maximum stack depth: 13

    Type tuple 0
        Hits: 36
        Maximum stack depth: 13

Unsurprisingly, the case where we pass bar is significantly more common than the case where we don’t, so the planner would in this case favor spending time making a specialization for the case where bar was passed.

As an aside, specialization by optional argument tends to be a good investment. A common pattern is:

sub foo($bar, Bool :$some-option) {
    if $some-option {
        extra-stuff();
    }
}

In the case that $some-option is not passed, the specializer:

  • Knows it’s a type object
  • Can therefore knows that the boolification will come out to False
  • Can therefore strip out the conditional and all the code in the if block

That stripping out of code might in turn bring the optimized version below the inlining limit, whereas before it might have been too big, which could bring about an additional win.

Hot type tuples

Once one or more hot callsites have been identified, the types that were passed will be considered. In the best case, that turns out to be very boring. For example, here:

Latest statistics for 'chars' (cuid: 4219, file: SETTING::src/core/Str.pm:2728)

Total hits: 273

Callsite 0x7f1cd6231ae0 (1 args, 1 pos)
Positional flags: obj

    Callsite hits: 273

    Maximum stack depth: 14

    Type tuple 0
        Type 0: Scalar (Conc) of Str (Conc)
        Hits: 273
        Maximum stack depth: 14

The code is consistently called with a Scalar container holding a Str instance. Therefore, we can see this code is monomorphic in usage. It is quite clearly going to be worth spending time producing a specialization for this case.

In other cases, the code might be somewhat polymorphic:

Latest statistics for 'sink' (cuid: 4795, file: SETTING::src/core/List.pm:694)

Total hits: 856

Callsite 0x7f1cd6231ae0 (1 args, 1 pos)
Positional flags: obj

    Callsite hits: 856

    Maximum stack depth: 31

    Type tuple 0
        Type 0: Slip (Conc)
        Hits: 848
        Maximum stack depth: 31

    Type tuple 1
        Type 0: List (Conc)
        Hits: 7
        Maximum stack depth: 26

Here we can see the sink method is being called on both Slip and List. But it turns out that the Slip case is far more common. Thus, the planner would conclude that the Slip case is worth a specialization, and the List case is – at least for now – not worth the bother.

Sometimes, a situation like this comes up where the type distribution is more balanced. Any situation where a given type tuple accounts for 25% or more of the hits is considered to be polymorphic. In this case, we produce the various specializations that are needed.

Yet another situation is where there are loads of different type tuples, and none of them are particularly common. One place this shows up is in multiple dispatch candidate sorting, which is seeing all kinds of types. This case is megamorphic. It’s not worth investing the time on specializing the code for any particular type, because the types are all over the place. It is still worth specialization by callsite shape, however, and there may be other useful optimizations that can be performed. It can also be compiled into machine code to get rid of the interpreter overhead.

Sorting

By this point, the planner will have got a list of specializations to produce. Specializations come in two forms: observed type specializations, which are based on matching a type tuple, and certain specializations, which are keyed only on an interned callsite (or the absence of one). Certain specializations are produced for megamorphic code. The list of specializations are then sorted. But why?

One of the most valuable optimizations the specializer performs is inlining. When one specialized static frame calls another, and the callee is small, then it can often be incorporated into the caller directly. This avoids the need to create and tear down an invocation record. Being able to do this relies on a matching specialization being available – that is, given a set of types that will be used in the call, there should be a specialization for those types.

The goal of the sorting is to try and ensure we produce specializations of callees ahead of specializations of their callers. You might have wondered why every type tuple and callsite was marked up with the maximum call depth. Its role is for use in sorting: specializing things with the deepest maximum call stack depth first means we will typically end up producing specialized code for potential inlinees ahead of the potential inliners. (It’s not perfect, but it’s cheap to compute and works out pretty well. Perhaps more precise would be to form a graph and do a topological sort.)

Dumping

The plan is then dumped into the specialization log, if that is enabled. Here are some examples from our shorten program:

==========

Observed type specialization of 'infix:<~>' (cuid: 4245, file: SETTING::src/core/Str.pm:2797)

The specialization is for the callsite:
Callsite 0x7f1cd6231ac0 (2 args, 2 pos)
Positional flags: obj, obj

It was planned for the type tuple:
    Type 0: Str (Conc)
    Type 1: Str (Conc)
Which received 367 hits (100% of the 367 callsite hits).

The maximum stack depth is 32.

==========

Observed type specialization of 'substr' (cuid: 4316, file: SETTING::src/core/Str.pm:3061)

The specialization is for the callsite:
Callsite 0x7f1cd6231a60 (3 args, 3 pos)
Positional flags: obj, obj, obj

It was planned for the type tuple:
    Type 0: Str (Conc)
    Type 1: Scalar (Conc) of Int (Conc)
    Type 2: Scalar (Conc) of Int (Conc)
Which received 274 hits (100% of the 274 callsite hits).

The maximum stack depth is 15.

==========

Observed type specialization of 'chars' (cuid: 4219, file: SETTING::src/core/Str.pm:2728)

The specialization is for the callsite:
Callsite 0x7f1cd6231ae0 (1 args, 1 pos)
Positional flags: obj

It was planned for the type tuple:
    Type 0: Scalar (Conc) of Str (Conc)
Which received 273 hits (100% of the 273 callsite hits).

The maximum stack depth is 14.

==========

Observed type specialization of 'substr' (cuid: 2562, file: SETTING::src/core/Cool.pm:74)

The specialization is for the callsite:
Callsite 0x7f1cd6231a60 (3 args, 3 pos)
Positional flags: obj, obj, obj

It was planned for the type tuple:
    Type 0: Scalar (Conc) of Str (Conc)
    Type 1: Int (Conc)
    Type 2: Scalar (Conc) of Int (Conc)
Which received 273 hits (100% of the 273 callsite hits).

The maximum stack depth is 14.

==========

Observed type specialization of 'infix:«>»' (cuid: 3165, file: SETTING::src/core/Int.pm:365)

The specialization is for the callsite:
Callsite 0x7f1cd6231ac0 (2 args, 2 pos)
Positional flags: obj, obj

It was planned for the type tuple:
    Type 0: Int (Conc)
    Type 1: Scalar (Conc) of Int (Conc)
Which received 273 hits (100% of the 273 callsite hits).

The maximum stack depth is 14.

==========

Observed type specialization of 'shorten' (cuid: 1, file: xxx.p6:1)

The specialization is for the callsite:
Callsite 0x7f1cd6231ac0 (2 args, 2 pos)
Positional flags: obj, obj

It was planned for the type tuple:
    Type 0: Str (Conc)
    Type 1: Int (Conc)
Which received 273 hits (100% of the 273 callsite hits).

The maximum stack depth is 13.

==========

Compared to before

So how does this compare to the situation before my recent work in improving specialization? Previously, there was no planner at all, nor the concept of certain specializations. Once a particular static frame hit a (bytecode size based) threshold, the type tuple of the current call was taken and a specialization produced for it. There was no attempt to discern monomorphic, polymorphic, and megamorphic code. The first four distinct type tuples that were seen got specializations. The rest got nothing: no optimization, and no compilation into machine code.

The sorting issue didn’t come up, however, because specializations were always produced at the point of a return from a frame. This meant that we naturally would produce them in deepest-first order. Or at least, that’s the theory. In practice, if we have a callee in a conditional that’s true 90% of the time, before there was a 10% chance we’d miss the opportunity. That’s a lot less likely to happen with the new design.

Another problem the central planning avoids is a bunch of threads set off executing the same code all trying to produce and install specializations. This was handled by letting threads race to install a specialization. However, it could lead to a lot of throwaway work. Now the planner can produce them once; when it analyzes the logs of other threads, it can see that the specialization was already produced, and not plan anything at all.

Future plans

One new opportunity that I would like to exploit in the future is for derived type specializations. Consider assigning into an array or hash. Of course, the types of the assigned values will, in a large application, be hugely variable. Today we’d consider ASSIGN-POS megamorphic and so not try and do any of the type-based optimizations. But if we looked more closely, we’d likely see the type tuples are things like (Array, Int, Str)(Array, Int, Piranha)(Array, Int, Catfish), and so forth. Effectively, it’s monomorphic in its first two arguments, and megamorphic in its third argument. Therefore, a specialization assuming the first two arguments are of type Array and Int, but without specialization of the third argument, would be a win.

Next time…

We now have a specialization plan. Let’s go forth and optimize stuff!

Posted in Uncategorized | 1 Comment

MoarVM Specializer Improvements Part 1: Gathering Data

Over the last weeks I’ve had the chance to work full time on Perl 6, and have dedicated this time to improving the MoarVM specializer. Since “specializer” is such a lot to type, but shortening it to “spec” would result in it being confused with specification (or perhaps bacon), we refer to it as “spesh”. But what is it?

Specialization

Specialization is the process of taking code with lots of late-binding in it, and producing one or more versions of the code with as much of that stripped out as we can. We take late binding in a very broad sense, really using it to mean any place where we defer a decision until the code is executed (that is, runtime) because we don’t know the context it will be executed in. So, if we write a module with this in it:

sub shorten($text, $limit) is export {
    $text.chars > $limit
        ?? $text.substr(0, $limit) ~ '...'
        !! $text
}

Then this code isn’t committed to the types of $text and $limit. Thus, we must:

  • Resolve the chars and substr method at runtime
  • Decide which multi dispatch candidate of > and ~ to call
  • Since substr is also a multi, decide what multi dispatch candidate to call

Even if we put Str and Int type constraints on the sub, that still isn’t really enough: these could be subclassed (for example, by mixing in to them) and thus override the methods that we call. (Granted in this particular case that’d be a tad odd, but in the general case, it’s not.)

Even with nice things like method caches and multi-dispatch caches, which do provide a huge speedup, we still must do the cache lookups, call the methods, and so forth. The job of the specializer is to notice what types we actually call shorten with, and the types methods like chars return, and thus strip out the need to even do cache lookups. It can then go further, for example inlining the body of the small chars method (and maybe substr too) into the caller so that there will be no invocation overhead for them at all!

Before we go any further, a quick thank you

This work is funded by my current TPF grant, which was approved some weeks back (so I’ve already been working away at it), though only recently publicly announced due to an internal TPF procedure that had to be completed first. So, thanks go to TPF for administering the funding, and to those who have donated to TPF and the Perl 6 Core Development Fund for providing the funding.

A little series

I could just write up a list of the improvements I’ve done, but I figure that many readers here don’t know a great deal about spesh. So instead, I will write a series of posts working through the way spesh works today, after my changes, together with some notes on how it used to work and why that changed. Hopefully that will be a more interesting and useful read.

What to optimize?

When I’ve talked people through how spesh works in the past, I’ve tended to start by discussing how the bytecode is turned into a control flow graph in single static assignment form, and then we go on and do the various bits of analysis and transformation. We’ll get to that, but in fact there has always been a step before it.

Constructing a CFG and putting it in SSA form takes time; transforming it by doing a load of optimizations takes more. How much depends on the size and structure of the code. So we need to decide what code to spend time on. This was done in two ways.

The first was just incrementing a counter every time the sub, block, method, or regex was called. If the count passed a threshold (which was chosen based upon the size of the bytecode) then an optimized version of it would be produced. Easy enough.

Trouble is that this doesn’t catch an important case: that where we have a program that spends most of its lifetime in a loop. In that case, we only enter the block holding the loop once, which is never going to trigger optimization. Thus we also count the number of iterations of loops. If that counter hits a high enough value, then we produce the optimized code and replace the running code with it, potentially moving from the interpreter into machine code. This replacement of code with an optimized version is known as “on stack replacement”, because it’s replacing code already running on the call stack).

Gathering data

Just throwing bytecode representing a highly dynamic program into an optimizer won’t achieve much, however. The cost is in the late binding, and to remove that we need an idea of what kinds of things are showing up when the code is run. This data came from two sources.

The first source was the incoming parameter types. When code was determined as hot, then the shape of the callsite (how many arguments, which named arguments) and the types of the arguments were taken and turned into a “key” for the specialization. These could then be assumed inside of the optimizer.

This was not enough on its own to produce good optimizations, however, because a lot of data comes from attribute accesses, lexical accesses, and the return values of things the code calls. Therefore, before doing any significant work, spesh would go through the code and insert logging instructions. These would record the values that were observed. After 8 runs, the specialization process would continue, using this data.

Bad decisions and missing information

This worked relatively well, but had notable shortcomings. The main one concerns code that is highly polymorphic in nature – that is, it is called with many different types. In this case, whichever type it was called with on, say, its 100th call, would have a specialization produced for it. If the next call used a different type, another specialization would be produced for that. And so forth, up until we had four specializations, which was the number picked as the limit.

This meant that things like defined or sink, which are called very often but on a lot of different types, would often go unoptimized and un-JITted. Worse, the way inlining works is to inline any existing specialization. Inlining all the defined and sink calls would be ideal. But in reality, for any non-trivial program, that would not happen in most cases. Clearly, if only the specializer were able to take a step back, it could see this situation and do something smarter. But the data just wasn’t there to do it.

The 8 samples taken when logging could also get lucky or unlucky about what values they saw. If a value returned from a method is a certain type over 99% of the time, and less than 1% of the time it is something else, then it makes sense to stick a guard in, optimize for the 99% case, and let the guard fail and trigger deoptimization for the less than 1% of the cases. (A deoptimization is like the opposite of OSR: we swap the optimized version of the code for the slower one that can handle all of the cases, and let it take care of the rare cases).

With only 8 values, however, 1 of them being different suggests we might have to deoptimize over 10% of the time. Deoptimization takes a bit of time to do, but leaves us interpreting the code with a bunch of late binding. The stakes are too high if all the data can tell us is that there’s a 1 in 8 chance.

Another problem in the data was a lack of knowing about the stability of the types on the callee side of a callsite. We relied on being able to infer them based on things we did know, or had inserted checks to guard against. However, what if we were just one more guard clause away from being able to inline something? Previously we could never determine that, so had to miss out on the opportunity. With better data, it would be possible to do better.

Interruptions and stampedes

Spesh needed to interrupt the execution of the program twice for each frame it wished to optimize: once to parse the bytecode into the SSA CFG, insert the logging and produce a logged version of the code, and again after the logging to produce an optimized version of the code. This introduced pauses into the execution. So, every time spesh makes a program run faster, it has to do so sufficiently to overcome the time it stole to do its optimization work. And if it speeds something up that doesn’t get run much in the future, it can make the program slower overall.

This is pretty poor use of the parallel hardware available in pretty much all computers nowadays. Better would be to spend time doing the optimization work on a thread separate from the execution of the user’s program. This not only gets rid of the pauses; it also means that if the work is completed by the interpreter before the optimized version is ready then we didn’t slow the program down by stopping it to do optimization work that was never going to pay off anyway.

Having a thread taking care of specialization work can also resolve another problem. Imagine that a bunch of threads are set off executing the same code on a range of different data items (so, data parallelism). In this case, more than one thread could see that the code is hot, notice there’s not yet any specialization, and get to work on producing it. There was detection of any duplicate ones at installation time, but that still meant there was wasted work producing specializations. If 3 different threads wasted their time this way, then the amount of time before the specializer had paid for itself was also increased.

A new approach to data collection

Clearly, there was plenty of room for improvement. It was fairly clear that having a better data model representing the execution of the program, which the specialier would use to make smarter decisions, was a key part of this. At the same time, it was important that threads executing user code didn’t spend much of their time building this. Ideally, they would just throw data into a sequential buffer, and toss it over to another thread – a specializer thread – when it was full.

So, that’s the direction I headed it. Each thread would log interesting events into a thread local buffer. The append-only and thread local (for writes) nature of it should make it fairly cache friendly. It was desirable that the entries were fairly small; the main place this had an impact was parameter type logging, where we both wish to log the container type and the type held inside of the container. This was taken care of by just writing two entries, rather than all the rest being padded out (fixed size entries weren’t all that important, I guess, but it did make things easier).

I converged on this set of events in the log. It’s 24 bytes per entry, which isn’t too bad. One early mistake I made was logging invokes. I was keen that we start being able to inline calls to closures in the future, when it’s always the same code we call but with a different environment. So I logged the invoked code object, figuring the specializer thread could then pick the interesting parts out. This turned out to quite notably extend the lifetime of the closed over data, however. As the things we cared about were what code was invoked and whether the calling frame was also the outer frame, it was quite cheap to just extract those at the point of writing the log anyway.

The end result only logs references to static frames, types objects, and code objects where we’re told the result of the lookup can be cached because it will always be the same. It never logs values. This fixes an issue in the previous logging mechanism: during its 8 runs it would keep values alive for longer, and – much worse – if the 8 runs were never completed, it could keep them alive indefinitely. It was a bounded leak, but certainly one I’m glad to no longer have.

A hugely important part of the log is the correlation ID. This is a per-thread incrementing counter. It is bumped each time a frame is entered with logging turned on. It is then included in all events occurring within the execution of that frame. This allows the specializer thread to work out what events go with what code.

The spesh worker thread

The spesh worker thread sits in a loop, reading a log from a blocking queue (meaning it waits in an efficient way, and – once the program has been fully specialized – just never gets woken up again). Once a thread running code fills up a log buffer, it sticks it into the blocking queue, the spesh worker receives it, and it gets to work. I’ll go through the various steps it takes later in this series; for today I’ll focus only on the first one (updating the statistics model, below) and what it does before going to wait for another data buffer.

To prevent runaway memory use if the threads running code are producing logs way faster than the specializer can make use of them, threads are given a quota of log buffers they are allowed to send. On sending, they decrement the quota. If it’s still greater than zero, then they allocate another log buffer and continue logging. Otherwise, logging is disabled for that thread. Once the specialization worker thread is done processing a buffer and acting upon its content, it increments the quotas. If it incremented it from zero, then it also installs a spesh log buffer for the thread, so it will continue logging.

Nailing the main loop

This scheme almost worked out fine, but there was an awkward problem. If we weren’t logging at the point that the outermost frame of the program started running (perhaps because we were specializing bits of the compiler still), then the outermost body of the program wouldn’t get a correlation ID, and so wouldn’t log any events. This would be especially unfortunate, since the first thing people measuring performance tend to do is write a hot loop in the mainline of a program! (More usefully, any perl6 -ne ... invocation has the same kind of program structure.)

Thankfully, there was an easy way to deal with this: when we enter a new compilation unit for the first time, if there is no spesh log, then grant a temporary quota boost. Alternatively, if the log is almost full, then we send it off and then make a new one (with a quota boost if needed). This boost is temporary.

Every heuristic written with good intentions can go rogue, and this one is no exception: imagine a program doing a load of EVALs. So, there is a fixed limit on how many times this quota boosting can happen, to ensure it handles the cases it is aimed at but doesn’t do harm.

Building a model

Over on the specialization thread, the events are fed into a simulation that recreates something much like the call stack of the logged program, so it can understand the relationships between callers and callees. This produces a set of statistics, hung off each invoked piece of code (know as a “static frame” in MoarVM); doing it this way has the advantage that the statistics will be garbage collected should the code also be garbage collected (this can matter in EVAL-heavy programs).

One of the things I like most about this new approach is that, with the MVM_SPESH_LOG environment variable set to the name of a file to log to, the assembled statistical data will be dumped. This makes it possible to see the data being used to make decisions. Enough with me describing stuff, though: let’s see the stats! Here’s an example program:

sub shorten($text, $limit) is export {
    $text.chars > $limit
        ?? $text.substr(0, $limit) ~ '...'
        !! $text
}
for ^10000 {
    shorten 'foo' x 100, (20..500).pick;
}

Here’s the statistics output after a while for the shorten sub:

Latest statistics for 'shorten' (cuid: 1, file: xxx.p6:1)

Total hits: 156

Callsite 0x7fdb8a1249e0 (2 args, 2 pos)
Positional flags: obj, obj

    Callsite hits: 156

    Maximum stack depth: 13

    Type tuple 0
        Type 0: Str (Conc)
        Type 1: Int (Conc)
        Hits: 156
        Maximum stack depth: 13
        Logged at offset:
            226:
                156 x type Int (Conc)
                156 x static frame 'chars' (4197) (caller is outer: 0)
                156 x type tuple:
                    Type 0: Scalar (Conc) of Str (Conc)
            260:
                156 x type Bool (Conc)
                1 x static frame 'infix:«>»' (2924) (caller is outer: 0)
                155 x static frame 'infix:«>»' (3143) (caller is outer: 0)
                156 x type tuple:
                    Type 0: Int (Conc)
                    Type 1: Scalar (Conc) of Int (Conc)
            340:
                91 x type Str (Conc)
                91 x static frame 'substr' (2541) (caller is outer: 0)
                91 x type tuple:
                    Type 0: Scalar (Conc) of Str (Conc)
                    Type 1: Int (Conc)
                    Type 2: Scalar (Conc) of Int (Conc)
            382:
                91 x type Str (Conc)
                        91 x static frame 'infix:<~>' (4223) (caller is outer: 0)
                        91 x type tuple:
                    Type 0: Str (Conc)
                    Type 1: Str (Conc)

Static values:
    - Sub+{<anon|77645888>} (0x2a70d48) @ 194
    - Sub+{<anon|77645888>}+{Precedence} (0x23a90c0) @ 288

From this we can see:

  • It’s always called with two arguments
  • Those are always a Str and an Int respectively
  • We always pass a Scalar variable holding a Str to chars, and it always returns an Int
  • We always pass an Int and a Scalar variable holding an Int to >, and it always returns a Bool
  • When we call substr and ~ the type tuples and return values are also consistent

At compile time it was also worked out that some lexical lookups will always resolve to the very same object; the results of those have been logged also (the static values section), so the optimization process can elide the lookups and do further optimizations by knowing exactly what will be called. (The lookups in question are for the > and ~ operations, which are lexical, but in this case not overridden in a nested scope and so will come from the CORE setting.)

Here’s another example, this time for chars:

Latest statistics for 'chars' (cuid: 4197, file: SETTING::src/core/Str.pm:2728)

Total hits: 157

Callsite 0x7fdb8a124a00 (1 args, 1 pos)
Positional flags: obj

    Callsite hits: 157

    Maximum stack depth: 33

    Type tuple 0
        Type 0: Str (Conc)
        Hits: 1
        Maximum stack depth: 33

    Type tuple 1
        Type 0: Scalar (Conc) of Str (Conc)
        Hits: 156
        Maximum stack depth: 14

Here we can see that it was called once with a Str value, but a bunch of times with a Scalar holding a Str. Therefore, it’s worth producing optimized code for the latter case, but not worth bothering about the former.

Clearing up

But what about all of the statistics we record on one-off cold paths, such as at startup? We’ll never do optimizations based on them, and so they’d just sit around using memory. Or what about after we’ve optimized a frame in all the needed ways, and it’s no longer logging new data? We don’t really need the statistics any more.

To alleviate this, each time a log buffer is received a statistics version number is incremented. The statistics are marked with the current version when updated. Any statistics that are not updated in a while will be presumed to be no longer interesting, and so thrown out.

Debuggability

When I first mentioned moving specialization work off to its own thread on #moarvm, brrt (who leads development of the JIT) was immediately concerned about the unpredictability this would introduce. Threads are scheduled a bit differently between runs, which would make it a nightmare to reproduce and hunt down specializer and JIT compiler bugs, including using the bisection approach to work out exactly which specialization made things go wrong. It was an excellent point.

Therefore, I introduced the MVM_SPESH_BLOCKING environment variable. When set, a thread executing code will send off its log, and then block until the spesh worker thread has finished processing it and installing specializations. This means that, for a single threaded program, the behavior will again be fully deterministic.

You won’t believe what happens next!

Alas, that’s all I’m covering this time. Next time, I’ll talk about how the statistics are used.

Posted in Uncategorized | 1 Comment

Shrinking MoarVM call frames

Last week, I did some work to greatly decrease the size of call frames, also know as invocation records, in MoarVM. In theory, a call frame is created whenever a sub, method, regex, or block is entered. In reality, scopes may be flattened away at compile time, which decreases the number of call frames needed. Further to that, dynamic optimization at runtime leads to inlining, which has the same result except it can do it with late-bound calls, even with callees in different compilation units. Even with these optimizations (and in part because of their limitations), most programs will still need to create and destroy many call frames over their lifetime, making their setup and tear-down a hot path, and their size a factor in program memory performance.

The work I’ll describe in this post has been funded by OETIKER+PARTNER AG, who responded to my recent funding call. In fact, they’re funding around 10 hours of work per month over the space of a year, so this will be just the first of a number of posts describing work that they are making possible. So, thanks!

Background

The MVMFrame data structure has been there since the very earliest days of MoarVM. My memory is hazy, but I suspect it was among the first dozen data structures that I sketched out when starting to design and implement the VM. Initially, call frames were reference counted and allocated out of a special pool. Now they live either in a per-thread callsite region allocated by incrementing a pointer and deallocated by decrementing the pointer, much like a traditional call stack, or on the heap if they “escape” (as a result of a closure, exception throw, or continuation). At present, that heap promotion is always lazy (so call frames are always born on the call stack).

Therefore, the size of an MVMFrame impacts:

  • How much memory we have to allocate in the call stack region, and thus how quickly we fill up that region if we’re recursing deeply
  • How much memory we have to touch when setting up a frame (which impacts on CPU cache utilization); until the work described in this post the frame memory was cleared by a memset call, so the size also impacted how much work that had to do
  • How much we have to copy if promoting the frame onto the heap
  • How soon we’ll have to GC again (smaller frames on the heap mean longer until we fill the nursery and have to collect)
  • How much memory closures take

Over the years, MVMFrame has grown. Various things were added in support of additional features. However, not all of those things are used by all frames. Additionally, some of them, while used widely, were both quite rarely read and very cheap to compute when they were needed, meaning it was better to just re-calculate them on demand.

Effective handlers and bytecode

Two pointers were taken up for storing the effective handlers and bytecode. The bytecode holds the instructions to execute, and the handlers are the regions of that bytecode covered by exception handlers. At first, these fields did not exist in MVMFrame. The bytecode and handlers weren’t properties of a given call, but of the code being called (known as the “static frame” in MoarVM). So why these fields?

They were introduced with the dynamic optimizer. This produces one or more versions of the frame’s bytecode specialized by callsite and type. The specialized version of the code is selected based upon what the callee passes. Since the specialized bytecode contains different (typically many less) instructions, the code offests covered by exception handlers move also, so we need to use an updated table of those when locating an exception handler.

It was certainly convenient to hang pointers to these two off the frame. But we could always locate them just by following the spesh_cand pointer in the frame, if it was set, or the static_info pointer if not. And this isn’t something that we needed to do so often that this extra dereference was going to add up. The only common path we needed to do it on was on return. But we’d be losing the instruction to set it in the invocation, so it about balances anyway – and that’s before the considering the memory savings.

With those gone, MVMFrame shrunk 2 pointers, or 16 bytes in a 64-bit environment.

Throw address

When an exception was thrown, MoarVM stored the address in the program that it was thrown at into…a pointer in the currently executing frame. It then referenced the frame from the exception object. And that was pretty much all the throw address field was used for. This was a very easy 8 bytes (64-bit pointer) to win: just store the address in the exception object, where it belongs anyway. D’oh.

Rarely used things

Some things are used by just a small handful of frames:

  • Continuation tags (used to find the base of a continuation when taking it)
  • Slots to cache dynamic lexical lookups (so $*FOO can be found faster)
  • Special return handlers (a way for the VM to set up internal callbacks when returning into a particular frame, which is a technique used to avoid nested interpreter instances, which are a huge problem when combined with continuations)
  • A slot to keep an invoked call capture alive

These added up to 8 pointers and one 16-bit integer. By moving them into an “extras” data structure hung off the frame, and allocated on demand, space equivalent to 7 pointers could be saved off the frame. With 64-bit pointers, that’s 56 bytes of savings for most frames. The CORE.setting compilation ends up with only 6% of frames needing this extra storage space.

Alignment

With a 16-bit integer moved off into the extras I realized that, with a little care, we could re-order things, force an enum to only use a single byte, and save another 8 bytes off MVMFrame, simply by not needing some empty wasted padding space in the struct (C compilers insert this to make sure memory reads are aligned to correct boundaries).

Saving a memset

That’s 88 bytes of savings, which is around a cache line and a half on a typical CPU. It also means that nearly all of the things left in MVMFrame were being initialized on every invocation as part of the callframe setup. Meaning? That the memset of MVMFrame could go away, at the cost of just inserting a couple of instructions to manually zero or NULL things out (some of them on rarely taken paths).

But wait, there’s more…

While I was working on this, and looking at profiles, I noticed that a large number of allocations came from creating a buffer to keep track of which named arguments had been used and which had not (for the sake of error reporting and slurpy argument handling). We allocated a byte array, but only really need a bit per named argument. So, I turned the field into a union of a 64-bit bit field for when there are at most 64 named arguments (which is probably just about every real world use case), and fall back to the old byte array approach otherwise.

All in all…

These changes provide some memory use reductions, but more importantly are good for CPU cache locality. They also knock 2% off the number of CPU instructions run during CORE.setting compilation; many other programs should see a similar improvement (how much depending on how much of a hot path invocation is, and how many closures they take).

Posted in Uncategorized | Leave a comment

Optimizing reading lines from a file

Reading lines from a file and processing them one at a time is a hugely common scripting task. However, to date our performance at this task has been somewhat underwhelming. Happily, a grateful Perl 6 fan stepped up in response to my recent call for funding, offering 25 hours of funding to work on whatever I felt was most pressing, but with a suggestion that perhaps I could look at some aspect of I/O performance. Having recently been working on refactoring I/O anyway, this was very timely. So, I crafted a benchmark and dug in.

The benchmark and a baseline

Perl 5 is considered to have very good I/O performance, so I figured I’d use that as a rough measure of how close Perl 6 was to performing well at this task. A completely equivalent benchmark isn’t quite possible, but I tried to pick something representative of what the average programmer would write. The task for the benchmark was to take a file with one million lines, each having 60 characters, loop over them, and add up the number of characters on each line. That number would then be printed out at the end (it’s important that benchmarks calculating results return or consume the result in some way, as a sufficiently smart optimizer may otherwise manage to eliminate the work we think we’re measuring). The rules were that:

  • The file should be read as UTF-8
  • The number of characters in the line should exclude the line ending

The Perl 5 benchmark for this came out as follows:

perl -e 'open my $fh, "<:encoding(UTF-8)", "longfile";
         my $chars = 0;
         while ($_ = <$fh>) { chomp; $chars = $chars + length($_) };
         close $fh;
         print "$chars\n"'

With the Perl 6 one looking like this:

perl6 -e 'my $fh = open "longfile";
          my $chars = 0;
          for $fh.lines { $chars = $chars + .chars };
          $fh.close;
          say $chars'

I’ll note right off that in Perl 6 there are likely ways, today, to do a bit better. For example, the $chars variable could be given a native int type, and it’s entirely possible that a while loop might come out faster than the for loop. Neither of those are representative of what a typical programmer looking at the documentation and diving in to implementing stuff would do, however. I suspect that Perl 5 experts could similarly point out some trick I’ve missed, but I’m trying to benchmark typical use.

One slight unfairness is that the Perl 6 solution will actually count the number of grapheme clusters, since strings are at grapheme level. This entails some extra processing work, even in the case that there are no multi-codepoint clusters in the input file (as there were not in this case). But again, the average user making comparisons won’t much care for such technicalities.

All measurements were made on modern hardware with an Intel Xeon 6-core CPU and a fast SSD, and on Linux.

At the point I started work, the Perl 6 solution clocked in at 2.87s, to just 1.13s for Perl 5. This made Perl 6 a factor of 2.5 times slower.

First hints from the profile

The whole I/O stack recently got a good overhaul, and this was the first time I’d looked at a profile since that work was completed. Looking at the output from --profile immediately showed up some rather disappointing numbers. Of all callframes, 57.13% were JIT-compiled. Worse, basically nothing was being inlined.

At this point, it’s worth recalling that Perl 6 is implemented in Perl 6, and that there’s quite a bit going on between the code in the benchmark and ending up in either things implemented in C or a system call. The call to lines returns an Iterator object. Reading a line means calling the pull-one method on that Iterator. That in turn calls the consume-line-chars method on a $!decoder object, and that method is what actually calls down to the VM-backed decoder to read a line (so there’s a level of indirection here to support user provided decoders). The return value of that method then has defined called on it to check we actually got a line back. If yes, then it can be returned. If not, then read-internal should be called in order to fetch data from the file handle (given buffering, this happens relatively rarely). Running the loop body is a further invocation, passing the read line as a parameter. Getting the chars count is also a method call (which, again, actually calls down to the VM guts to access the string’s grapheme count).

That’s quite a lot of method calling. While the VM provides I/O, decoding, and finding input up to a separator, the coordination of that whole process is implemented in Perl 6, and involves a bunch of method calls. Seen that way, it’s perhaps not surprising that Perl 6 would come in slower.

There are, however, things that we can do to make it fast anyway. One of them is JIT compilation, where instead of having to interpret the bytecode that Perl 6 is compiled in to, we further translate it into machine code that runs on the CPU. That cuts out the interpreter overhead. Only doing that for 57% of the methods or blocks we’re in is a missed opportunity.

The other really important optimization is inlining. This is where small methods or subroutines are taken and copied into their callers by the optimizer. This isn’t something we can do by static analysis; the point of methods calls is polymorphism. It is something a VM doing dynamic analysis and type specialization can do, however. And the savings can be rather significant, since it cuts out the work of creating and tearing down call frames, as well as opening the door to further optimization.

The horrors in the logs

There are a couple of useful logs that can be written by MoarVM in order to get an idea of how it is optimizing, or failing to optimize, code. The JIT log’s main point of interest for the purpose of optimization is that it can indicate why code is not being JIT-compiled – most commonly because it contains something the JIT doesn’t know about. The first thing in this case was the call into the VM-backed decoder to extract a line, which was happily easily handled. Oddly, however, we still didn’t seem to be running the JIT-compiled version of the code. Further investigation uncovered an unfortunate mistake. When a specialized version of a method calls a specialized version of another method, we don’t need to repeat the type checks guarding the second method. This was done correctly. However, the code path that was taken in this case failed to check if there was a JIT-compiled version of the target rather than just a specialized bytecode version, and always ran the latter. I fixed that, and went from 57.13% of frames JIT-compiled to 99.86%. Far better.

My next point of investigation is why the tiny method to grab a line from the decoder was not being inlined. When I took a look at the post-optimization code for it, it turned out to be no surprise at all: while the logic of the method was very few instructions, it was bulked out by type checking of the incoming arguments and return values. The consume-line-chars method looks like this:

method consume-line-chars(Bool:D :$chomp = False, Bool:D :$eof = False --> Str) {
    my str $line = nqp::decodertakeline(self, $chomp, $eof);
    nqp::isnull_s($line) ?? Str !! $line
}

Specializations are always tied to a callsite object, from which we can know whether we’re being passed a parameter or not. Therefore, we should be able to optimize out those checks and, in the case the parameter is being passed, throw out the code setting the return value. Further, the *%_ that all methods get automatically should have been optimized out, but was not being.

The latter problem was fixed largely by moving code, although tests showed a regression that needed a little more care to handle – namely, that a sufficiently complex default value might do something that causes a deoptimization, and we need to make sure we can fall back into the interpreter and have things work correctly in that case.

While these changes weren’t enough to get consume-line-chars inlined, they did allow an inlining elsewhere, taking the inline ratio up to 28.49% of call frames.

These initial round of changes took the Perl 6 benchmark from 2.87s to 2.77s, so about 3.5% off. Not much, but something.

Continuing to improve code quality

The code we were producing even pre-optimization was disappointing in a few ways. Firstly, even though a simple method like consume-line-chars, or chars, would never possibly do a return, we were still spitting out a return exception handler. A little investigation revealed that we were only doing analysis and elimination of this for subs but not methods. Adding that analysis for methods too took the time down to 2.58s. Knocking 7% off with such a small change was nice.

Another code generation problem lay in consume-line-chars. Access to a native lexical can be compiled in two ways: either just by reading the value (fine if it’s only used as an r-value) or by taking a reference to it (which is correct if it will be used as an l-value). Taking a reference is decidedly costly compare to just reading the value. However, it’s always going to always have the correct behavior, so it’s the default. We optimize doing so away whenever we can (in fact, all the most common l-value usages of it never need a reference either).

Looking at consume-line-chars again:

method consume-line-chars(Bool:D :$chomp = False, Bool:D :$eof = False --> Str) {
    my str $line = nqp::decodertakeline(self, $chomp, $eof);
    nqp::isnull_s($line) ?? Str !! $line
}

We can see the read of $line here is, since consume-line-chars is not marked is rw, an r-value. Unfortunately, it was compiled as an l-value because the conditional compilation lost that context information. So, I addressed that and taught Rakudo to pass along return value’s r-value context.

A native reference means an allocation, and this change cut the number of GC runs enormously, from 182 or them to 41 of them. That sounds like it should make a sensational difference. In fact, it got things down to 2.45s, a drop of just 5%. Takeaway lesson: allocating less stuff is good, but MoarVM’s GC is also pretty good at throwing away short-lived things.

Meanwhile, back in the specializer…

With the worst issues of the code being fed into MoarVM addressed, it was back to seeing why the specializer wasn’t doing a better job of stripping out type checks. First of all, it turned out that optional named arguments were not properly marking the default code dead when the argument was actually passed.

Unfortunately, that wasn’t enough to get the type tests stripped out for the named parameters to consume-line-chars. In fact, this turned out to be an issue for all optional parameters. When doing type analysis, and there are two branches, the type information has to be merged at join points in the control flow graph. So it might see something like this in the case that the argument was not passed:

    Bool (default path) \   / Unknown (from passed path)
                         \ /
                   Result: Unknown

Or maybe this in the case that it was passed:

    Bool (default path) \   / Scalar holding Bool (from passed path)
                         \ /
                   Result: Unknown

In both cases, the types disagree, so they merge to unknown. This is silly, as we’ve already thrown out one of the two branches, so in fact there’s really no merge to do at all! To fix this up, I marked variables (in single static assignment form) that died as a result of a basic block being removed. To make the dead basic blocks from argument analysis actually be removed, we needed to do the dead code removal earlier as well as doing it at the very end of the optimization process. With that marking in place, it was then possible to ignore now-dead code’s contribution to a merge, which meant a whole load of type checks could now be eliminated. Well, in fact, only in the case where the optional was passed; a further patch to mark the writers of individual instructions dead for the purpose of merges was needed to handle the case where it was not.

That left the return type being checked on the way out, which also seemed a bit of a waste as we could clearly see it was a Str. After a tweak to Rakudo to better convey type information in one of its VM extension ops, that check was optimized out too.

And for all of this effort, the time went from…2.45s to 2.41s, just 2% off. While it’s cheaper to not type check things, it’s only so costly in the first place.

A further win was that, with the code for consume-line-chars now being so tiny, it should have been an inlining candidate. Alas, it was not, because the optional arguments was still having tracking information recorded just in case we needed to deoptimize. This seemed odd. It turned out that my earlier fix for this was too simplistic: it would leave them in if the method would ever deoptimize, not just if it would do it while handling arguments. I tightened that up and the time dropped to 2.37s, another 2% one. Again, very much worth it, but shows that invocation – while not super cheap – is also only so costly.

With consume-line-chars inlining now conquered, another area of the code we were producing caught by eye: boolification was, in some cases, managing to box an int into an Int only to them immediately unbox it and turn it into a Bool. Clearly this was quite a waste! It turned out that an earlier optimization to avoid taking native references had unexpected consequences. But even nicer was that my earlier work to pass down r-value context meant I could delete some analysis and just use that mechanism instead. That was worth 4%, bringing us to 2.28s.

Taking stock

None of these optimizations so far were specific to I/O or touched the I/O code itself. Instead, they are general optimization and code quality improvements that will benefit most Perl 6 programs. Together, they had taken the lines benchmark from 2.87s to 2.28s. Each may have been just some percent, but together they had knocked 20% off.

By this point, the code quality – especially after optimization – was far more pleasing. It was time to look for some other sources of improvement.

Beware associativity

Perhaps one of the easiest wins came from spotting the pull-one method of the lines iterator seemed to be doing two calls to the defined method. See if you can spot them:

method pull-one() {
    $!decoder.consume-line-chars(:$!chomp) // $!handle.get // IterationEnd
}

The // operator calls .defined to test for definedness. Buy why two calls in the common case? Because of associativity! Two added parentheses:

method pull-one() {
    $!decoder.consume-line-chars(:$!chomp) // ($!handle.get // IterationEnd)
}

Were worth a whopping 8%. At 2.09s, the 2 second mark was in sight.

Good idea, but…

My next idea for an improvement was a long-planned change to the way that simple for loops are compiled. With for being defined in terms of map, this is also how it had been implemented. However, for simple cases, we can just compile:

for some-iteratable { blah }

Not into:

some-iterable.map({ blah }).sink-all;

But instead in to something more like:

my \i = some-iterable.iterator;
while (my \v = i.pull-one) !== IterationEnd {
    blah
}

Why is this an advantage? Because – at least in theory – now the pull-one and loop body should become possible to inline. This is not the case if we call map, since that is used with dozens of different closures and iterator types. Unfortunately, however, due to limitations in MoarVM’s specializer, it was not actually possible to achieve this inlining even after the change. In short, because we don’t handle inlining of closure-y things, and the way the on-stack replacement works means the optimizer is devoid of type information to have a chance to doing better with pull-one. Both of these are now being investigated, but were too big to take on as part of this work.

Even without those larger wins being possible (when they are, we’ll achieve a tasty near-100% inlining rate in this benchmark), it brought the time down to the 2.00s mark. Here’s the patch.

Optimizing line separation and decoding

Profiling at the C level (using callgrind) showed up some notable hot spots in the string handling code inside of MoarVM, which seemed to offer the chance to get further wins. At this point, I also started taking measurements of CPU instructions using callgrind too, which makes it easier to see the effects of changes that may come out as noise on a simple time measurement (even with taking a number of them and averaging).

Finding the separator happens in a couple of steps. First, individual encodings are set up to decode to the point that they see the final character of any of the line separators (noting these are configurable, and multi-char separators are allowed). Then, a second check is done to check if the multi-char separator was found. This is complicated by needing to handle the case where a separator was not found, and another read needs to be done from a file handle.

It turns out that this second pass was re-scanning the entire buffer of chars, rather than just looking close to the end of it. After checking there should not be a case where just jumping to look at the end would ever be a problem, I did the optimization and got a reduction from 18,245,144,315 instructions to 16,226,602,756, or 11%.

A further minor hot-spot was re-resolving the CRLF grapheme each time it was needed. It turned out caching that value saved around 200 million instructions. Caching the maximum separator length saved another 78 million instructions. The wallclock time now stood at 1.79s.

The identification of separators when decoding chars seemed the next place to find some savings. CPUs don’t much like having to do loops and dereferences on hot paths. To do better, I made a compact array of the final separator graphemes that could be quickly scanned through, and also introduced a maximum separator codepoint filter, which given the common case is control characters works out really quite well. These were worth 420 million and 845 million instructions respectively.

Next, I turned to the UTF-8 decoding and NFG process. A modest 56 million instruction win came from tweaking this logic given we can never be looking for a separator and have a target number of characters to decode. But a vast win came from adding a normalization fast path for the common case where we don’t have any normalization work to do. In the case we do encounter such work, we simply fall into the slow path. One nice property of the way I implemented this is that, when reading line by line, one line may cause a drop into the slow path, but the next line will start back in the fast path. This change was worth a whopping 3,200 million decrease in the instruction count. Wallclock time now stood at 1.37s.

Better memory re-use

Another look at the profile now showed malloc/free as significant costs. Could anything be done to reduce the number of those we did?

Yes, it turned out. Firstly, keeping around a decoding result data structure instead of freeing and allocating it every single line saved a handy 450 million instructions. It turned out that we were also copying the decoded chars into a new buffer when taking a line, but in the common case that buffer would contain precisely the chars that make up the line. Therefore, this buffer could simply be stolen to use as the memory for the string. Another 400 million instructions worth dropped away by a call less to malloc/free per line.

Micro-optimizations

A few futher micro-optimizations in the new UTF-8 decoding fast-path were possible. By lifting some tests out of the main loop, reading a value into a local because the compiler couldn’t figure out it was invariant, and moving some position updates so they only happen on loop exit, a further 470 million instructions were removed. If you’re thinking that sounds like a lot, this is a loop that runs every single codepoint we decode. A million line file with 60 chars per line plus a separator is 61 million iterations. These changes between them only save 7 cycles per codepoint; that just turns out to be a lot when multiplied by the number of codepoints!

The final result

With these improvements, the Perl 6 version of the benchmark now ran in 1.25s, which is just 44% of the time it used to run in. The Perl 5 version still wins, but by a factor of 1.1 times, not 2.5 times. While an amount of the changes performed during this work were specific to the benchmark in question, many were much more general. For example, the separator finding improvements will help with this benchmark in all encodings, and the code generation and specializer improvements will have far more cross-cutting effects.

Actually, not so final…

There’s still a decent amount of room for improvement yet. Once MoarVM’s specializer can perform the two inlinings it is not currently able to, we can expect a further improvement. That work is coming up soon. And beyond that, there will be more ways to shave off some instructions here and there. Another less pleasing result is that if Perl 5 is not asked to do UTF-8 decoding, this represents a huge saving. Ask Perl 6 for ASCII or Latin-1, however, however, and it’s just a small saving. This would be a good target for some future optimization work. In the meantime, these are a nice bunch of speedups to have.

Posted in Uncategorized | 2 Comments

Sorting out synchronous I/O

A few weeks back, I put out a call for funding. So far, two generous individuals have stepped up to help, enabling me to spend much more time on Perl 6 than would otherwise have been possible.

First in line was Nick Logan (ugexe), who is funding 60 hours to get a longstanding I/O bug resolved. The problem, in short, has been that a synchronous socket (that is, an IO::Socket::INET instance) accepted or connected on one thread could not be read from or written to by another thread. The same has been true of synchronous file handles and processes.

The typical answer for the socket case has been to use IO::Socket::Async instead. While that’s the best answer from a scalability perspective, many people new to Perl 6 will first reach for the more familiar synchronous socket API and, having heard about Perl 6’s concurrency support, will pass those off to a thread, probably using a start block. Having that fail to work is a bad early impression.

For processes, the situation has been similar; a Proc was bound to the thread it was created on, and the solution was to use Proc::Async. For situations where dealing with more than one of the input or output streams is desired, I’d argue that it’s actually easier to deal with it correctly using Proc::Async. However, Proc has also ended up with a few features that Proc::Async has not so far offered.

Finally, there are “file handles” – that is, instances of IO::Handle. Here, we typically would get away with passing handles of ordinary files around between threads, and due to an implementation detail could get away with it. The situation of an IO::Handle backed by a TTY or pipe was much less pleasing, however, which was especially unfortunate because it afflicted all of $*IN, $*OUT, and $*ERR. The upshot was that you could only read from $*IN from the main thread. While writing to $*OUT and $*ERR worked, it was skating on thin ice (and occasionally falling through it).

How did we get into this situation?

To understand the issue, we need to take a look at the history of MoarVM I/O. In its first year of its life, MoarVM was designed and built on a pretty extreme budget – that is to say, there wasn’t one. Building platform abstractions was certainly not a good use of limited time, so a library was brought in to handle this. Initially, MoarVM used the Apache Portable Library, which served us well.

As the concurrent and parallel language features of Perl 6 came into focus, together with asynchronous I/O, it became clear that libuv – the library that provides I/O and platform abstractions for Node.js – was a good option for supporting this. Since the APR and libuv had substantially overlapping feature sets, we decided to move over to using libuv. In the months that followed, a bunch of the asynchronous features found in Perl 6 today quickly fell in to place: Proc::Async, IO::Socket::Async, IO::Notification, asynchronous timers, and signal handlers. All seemed to be going well.

Alas, we’d made a problem for ourselves. libuv is centered around event loops. Working with an event loop is conceptually simple: we provide callbacks to be run when certain things happen (like data arriving over a socket, or a timer elapsing), and enter the event loop. The loop waits for some kind of event to happen, and then invokes the correct callback. Those callbacks may lead to other callbacks being set up (for example, the callback reacting to an incoming socket connection can set up a callback to be called whenever data arrives over that socket). This is a fine low-level programming model – and a good bit nicer than dealing with things like poll sets. However, a libuv event loop runs on a single thread. And handles are tied to the libuv event loop that they were created on.

For IO::Socket::Async and Proc::Async, this is not a problem. MoarVM internally runs a single event loop for all asynchronous I/O, timers, signals, and so forth. Whenever something happens, it pushes a callback into the queue of a scheduler (most typically that provided by ThreadPoolScheduler), where the worker threads lie in wait to handle the work. Since this event loop serves as a pure dispatcher, not running any user code or even such things as Unicode decoding, it’s not really limiting to have it only on a single thread.

When the synchronous I/O was ported from the APR, none of this was in place, however. Therefore, each thread got its own libuv event for handling the synchronous I/O. At the time, of course, there wasn’t really much in the way of threading support available in Rakudo on MoarVM. Therefore, at that point in time, that an event loop was tied to a thread was not problematic. It only came to be an issue as the concurrency support in Perl 6 became mature enough that people started using it…and then running into the limitation.

Seizing an opportunity

Having to deal with this was a chance to spend some quality time improving the lower level bits of I/O in the MoarVM/NQP/Rakudo stack, which have had very little love over the last couple of years. Recently, Zoffix did a bunch of great work on the higher level parts of I/O. Most helpfully for my endeavor, he also greatly improved the test coverage of I/O, meaning I could refactor the lower level bits with increased confidence.

A while back, I took the step of fully decoupling MoarVM’s asynchronous I/O from character encoding/decoding. For some months now, MoarVM has only provided support for asynchronous byte-level I/O. It also introduced a streaming decoder, which can turn bytes into characters for a bunch of common encodings (ASCII, UTF-8, and friends). This means that while the decoding hot paths are provided by the VM, the coordination is moved up to the Perl 6 level.

With synchronous I/O, the two were still coupled, with the runtime directly offering both character level and byte level I/O. While this is in some ways convenient, it is also limiting. It blocks us from supporting user-provided encodings – at least, not in such a way that they can just be plugged in and used with a normal IO::Handle. That aside, there are also situations where one might like to access the services provided by a streaming decoder when not dealing with an I/O handle. (A streaming decoder is one you can feed bytes to incrementally and pull characters out, trusting that it will do the right thing with regard to multi-byte and multi-codepoint sequences.)

Whatever changes were going to have to happen to fix the thread limitations of synchronous I/O, it was quite clear that only having to deal with binary I/O there would make it easier. Therefore, a plan was hatched:

  1. Re-work the synchronous I/O handles to use only binary I/O, and coordinate with the VM-backed decoder to handle char I/O.
  2. Rip out the synchronous char I/O.
  3. Re-implement the remaining synchronous binary I/O, so as not to be vulnerable to the threading limitations.

Making it possible to support user-defined encodings would be a future step, but the work done in these refactors would make it a small step – indeed, one that will only require work at a Perl 6 level, not any further down the stack. While it may well end up being me that does this anyway, it’s at least now in reach for a bunch more members of the Perl 6 development team.

Sockets first

I decided to start out with sockets. From a technical perspective, this made sense because they were the most isolated; break IO::Handle and suddenly such things as say are busted, and both it and Proc are used in the pre-compilation management code too. Conveniently, sockets were also Nick’s primary interest, so it was nice to get the most important goal of the work delivered first.

The streaming decode API, while fully implemented on MoarVM, had only been partially implemented on the JVM. Therefore, to complete the work on sockets without busting them on the JVM, I had to implement the missing pieces of the VM-backed streaming decode API. This meant dealing with NIO (“New IO”), the less about which is said the better. I’m pretty sure the buffer API wasn’t designed to trip me up at every turn, but it seems to reliably manage to do so. Since file handles on the JVM would soon also come to depend on this code, it was nice to be able to get it at least straight enough to be passing all of the sockets tests, plus another set of lower-level tests that live in the NQP repository.

Refactoring the socket code itself gave a good opportunity for cleanups. The IO::Socket::INET class does the IO::Socket role, the idea being that at some point other implementations that provide things like domain sockets will also do that role, and just add the domain socket specific parts. In reviewing what methods were where, it became clear that some things that really belonged in the IO::Socket role were not, so I moved them there as part of the work. I also managed to eliminate all of the JVM-specific workarounds in the socket code along the way, which was also a happy outcome.

With that refactored, I could rip out the character I/O support from sockets in MoarVM. This left me with a relatively small body of code doing binary socket I/O using libuv, implementing synchronous socket I/O atop of its asynchronous event loop. When it comes to sockets, there are two APIs to worry about: the Berkeley/BSD/POSIX one, and Winsock. Happily, in my case, there was a lot of overlap and just a handful of places that I had to deal with divergence. The extra lines spent coping with the difference were more than paid back by not faking synchronous I/O in terms of an asynchronous event loop.

File handles next

Buoyed by this success, it was time to dig into file handles. The internals of IO::Handle were poked into from outside of the class: a little in IO::Path and more so in Proc, which was actually setting up IO::Pipe, a subclass of IO::Handle. Thankfully, with a little refactoring, the encapsulation breakage could be resolved. Then it was a case of refactoring to use the streaming decode API rather than the character I/O. This went relatively smoothly, though it also uncovered a previously hidden bug in the JVM implementation of the streaming decoder, which I got fixed up.

So, now to rip the character file I/O out of MoarVM and refactor the libuv away in synchronous file handles too? Alas, not so fast. NQP was also using these operations. Worse, it didn’t even have an IO handle object! Everything was done in terms of nqp::ops instead. So, I introduced an NQP IO handle class, gave it many of the methods that its Perl 6 big sister has, and refactored stuff to use it.

With that blocker out of the way, I could move on to sorting things out down in MoarVM. Once again, synchronous file I/O looks similar enough everywhere to not need all that much in the way of an abstraction layer. On Windows, it turned out to be important to put all of the handles into binary mode, however, since we do our own \n <-> \r\n mapping. (Also, yes, it is very much true that it’s only very similar on Windows if you use their POSIX API emulation. It’s possible there may be a performance win from not using that, but I can’t imagine it’ll be all that much.)

Not entirely standard

So, all done? Well, not quite. For the standard handles, we only used the synchronous file code path when the handle was actually a regular file. This is not the common case; it’s often a pipe or a TTY. These used a completely different code path in libuv, using its streams API instead of the files API.

Thankfully, there isn’t much reason to retain this vast implementation difference. With a little work on EOF handling, and re-instating the faking of tell, it was possible to use the same code I had written to handle regular files. This worked out very easily on Linux. On Windows, however, a read operation would blow up if reading from the console.

Amazingly, the error string was “out of space”, which was a real head-scratcher given it was coming from a read operation! It turned out that the error string was a tad misleading; the error code is #defined as ENOMEM, so a better error string would have been “out of memory”. That still made little sense. So I went digging into the native console APIs on Windows, and discovered that reads from the console are allocated out of a 64KB buffer, which is also used for various other things. 64KB should be enough for anybody, I guess. Capping read requests to a console on Windows to 16KB was enough to alleviate this.

Job done!

At least, for IO::Socket::INET and IO::Handle, which now are not tied to a thread on HEAD of MoarVM/NQP/Rakudo. The work on processes is ongoing, although due to be completed in the coming week. So, I’ll talk about that in my next post here.

Posted in Uncategorized | 3 Comments

Looking for Perl 6, Rakudo, and MoarVM development funding

Note for regular 6guts readers: this post isn’t about Perl 6 guts themselves, but rather about seeking funding for my Perl 6 work. It is aimed at finding medium-size donations from businesses who are interested in supporting Perl 6, Rakudo, and MoarVM by funding my work on these projects.

I started contributing to the Rakudo Perl 6 compiler back in 2008, and since then have somehow managed to end up as architect for both Rakudo and MoarVM, together with playing a key design role in Perl 6’s concurrency features. Over the years, I’ve made time for Perl 6 work by:

  • Spending a bunch of my free time on it.
  • Only working an 80% day-job for five years, effectively self-funding a day a week for Perl 6 things. During this time I did also take various “milestone” grants (for delivering notable features); this mostly ended up paying for my travel to a bunch of workshops and conferences, which was a great deal of fun and helped keep the community informed during Perl 6’s long gestation period.
  • Most recently, taking funding at a rate that – while less than I could make doing other stuff – lets me comfortably work on Perl 6 as a part of my normal daily job.

I’m still greatly enjoying doing Perl 6 stuff and, while I’ve less free time these days for a variety of reasons, I still spend a decent chunk of that on Perl 6 things too. That’s enough for piecing together various modules I find we’re missing in the ecosystem, and for some core development work. However, the majority of Perl 6, Rakudo, and MoarVM issues that end up on my plate are both complex and time-consuming. For various areas of MoarVM, I’m the debugger of last resort. Making MoarVM run Perl 6 faster means working on the dynamic optimizer, which needs a good deal of care to avoid doing the wrong thing really fast. And driving forward the design and implementation of Perl 6’s concurrent and parallel features also requires careful consideration. Being funded through The Perl Foundation over the last couple of years has enabled me to spend quality time working on these kinds of issues (and plenty more besides).

So what’s up?

I’ve been without funding since early-mid February. Unfortunately, my need to renew my funding has come at a time when The Perl Foundation has been undergoing quite a lot of changes. I’d like to make very clear that I’m hugely supportive and thankful for all that TPF have done and are doing, both for me personally and for Perl 6 in general. Already this year, two Perl 6 grants have been made to others for important work. These were made through the normal TPF grants process. By contrast, my work has been funded through a separate Perl 6 Core Development Fund. As a separate fund, it thus needs funds raising specifically for it, and has its own operation separate from the mainstream grant process.

Between the fund being almost depleted, and various new volunteers stepping up to new roles in TPF and needing to get up to speed on quite enough besides the Perl 6 Core Development Fund, unfortunately it’s not been possible to make progress on my funding situation in the last couple of months. I’m quite sure we can get there with time – but at the same time I’m keen to get back to having more time to spend on Perl 6 again.

So, I’ve decided to try out an alternative model. If it works, I potentially get funded faster, and TPF’s energies are freed up to help others spend more time on Perl. If not, well, it’ll hopefully only cost me the time it took to write this blog post.

The Offer

I’m looking for businesses willing to help fund my Perl 6 development work. I can offer in return:

  • To spend quality time on Perl 6, Rakudo, and MoarVM.
  • To either spend the time on the areas that I feel are most pressing based on my own judgement and input from the community, or to spend it focusing on areas of the sponsor’s interest (for example: improving native call performance, delivering concurrency improvements such as completing non-blocking await or hyper/race, building a sampling profiler to give better insight into long-running programs, improving the MoarVM dynamic optimizer so hot code can run faster, etc.)
  • To write at least one blog post here per 25 hours of work completed. Rather than a boring list of stuff that I got done, I prefer to write posts that focus around particularly interesting problems I solved or that dig into the details of things that I have implemented. I will be happy to mention the sponsor of the work in these posts, together with a link, logo, etc.
  • Per 100 hours funded, provide 5 extra hours of my time that the sponsor may use as they wish (within reason!) That may be for consultation on any topic you figure I might be clueful about (not just Perl 6), writing an article for you on a topic of your choice, and so forth. Or, if you wish to donate the bonus time back to Perl 6 related work, that’s of course fine too (and I’ll be sure to mention you did so!)
  • Billing by invoice issued from a VAT-registered company with its seat in Prague, Czech Republic.
  • When I get a Perl 6 related book done (shhhh…don’t tell anyone I said that I’m working on one), I’ll send along a few signed copies. :-)

I’m setting a rate for this work of 55 EUR / hour with a minimum order of 25 hours. This need not be billed all in one go; for example, if you happened to be a company wishing to donate 1000 EUR a month to open source and wished to be invoiced that amount each month, this is entirely possible. After all, if 3-4 companies did that, we’d have me doing Perl 6 stuff for 2 full days every week.

If you’re interested in helping, please get in contact with me, either by email or on freenode (I’m jnthn there). Thank you!

Posted in Uncategorized | 5 Comments