Beware the Rabbit 2: Single-Threading

This article was first published in Overload Journal #97 in June 2010 and is also available separately on ACCU web site.

So the 'multi-core revolution' is finally here [Merritt07, Suess07, Martin10] (some might argue that it has already been here for several years, but that's beyond the point now). Without arguing whether it is good or bad, we should agree that it is a reality which we cannot possibly change. The age of CPU frequency doubling every two years is long gone and we shouldn't expect any substantial frequency increases in the near future, and while there are still improvements in single-core performance unrelated to raw frequency increases, from now on we should count mostly on multiple cores to improve performance. Will it make development easier? Certainly not. Does it mean that everybody will need to deal with mutexes, atomics, in-memory transactions (both with optimistic and pessimistic locking), memory barriers, deadlocks, and the rest of the really scary multi-threading stuff, or switch to functional languages merely to deal with multi-threading? Not exactly, and we'll try to show it below. Please note that in this article we do not try to present anything substantially new, it is merely an analysis of existing (but unfortunately way too often overlooked) mechanisms and techniques.

How long is 'as long as possible'?

It is more or less commonly accepted that multi-threading is a thing which should be avoided for as long as it is possible. While writing multi-threaded code might not look too bad on the first glance, debugging it is a very different story. Because of the very nature of multi-threaded code, it is non-deterministic (i.e. its behavior can easily differ for every run), and as a result finding all the bugs in testing becomes unfeasible even if you know where to look and what exactly you want to test; in addition, code coverage metrics aren't too useful for evaluating coverage of possible race scenarios in multi-threaded code. To make things worse, even if the multi-threaded bug is reproducible, every time it will happen on a different iteration, so there is no way to restart the program and stop it a few steps before the bug; usually, step-by-step debugging offsets fragile race conditions, so it is rarely helpful for finding multi-threaded bugs. With post-mortem analysis of multi-threaded races using log files usually being impossible too, it leaves developer almost without any reliable means to debug multi-threaded code, making it more of a trial-and-error exercise based on 'gut feeling' without a real understanding what is really going on (until the bug is identified). Maintaining multi-threaded code is even worse: heavily multi-threaded code tends to be very rigid and fragile, and making any changes requires careful analysis and lots and lots of debugging, making any mix of frequently changed business logic with heavy multi-threading virtually suicidal (unless multi-threading and business logic are clearly separated into different levels of abstraction).

With all these (and many other) problems associated with multi-threaded code, it is easy to agree that multi-threading should be avoided. On the other hand, there is some disagreement on how long we can avoid it. In this article we will try to discuss how performance issues (if there are any) can be handled without going into too much detail of multi-threading. While not always possible, the number of cases when multi-threading can be avoided is extensive. And as discussed above whenever you can avoid it - you should avoid it, despite the potential fear that programs without multi-threading aren't 'cool' anymore. After all, the end-user (the guy who we all are working for) couldn't care less how many threads the program has or whether it utilizes all the available cores, as long as the program works correctly and is fast enough. In fact, using fewer cores is often beneficial for the end-user, so he's able to do something else at the same time; we also need to keep in mind that overhead incurred by multi-threading/multi-coring can be huge, and that Amdahl's Law provides only the theoretical maximum speedup from parallelization, with realized gains often being not even close to that. If a single-threaded program does something in a minute on one core, and multi-threaded one does the same thing in 55 seconds on 8 cores (which can easily happen if the granularity of context switching is suboptimal for any reason), it looks quite likely that user would prefer single-threaded program.

'No Multi-Threaded Bugs' Bunny & 'Multithreaded Gorrillazz'

Let us consider a developer who really hates dealing with those elusive multi-threading bugs. As he is our positive hero, we need to make him somewhat cute, so let's make him a bunny rabbit, with our hero becoming 'No Multi-Threaded Bugs' Bunny. And in opposition to him there is a whole bunch of reasons, which try to push him into heavy multi-threading with all the associated problems. Let's picture them as 'Multithreaded Gorrillazz' defending team. To win, our 'No MT Bugs' Bunny needs to rush through the whole field full of Gorrillazz, and score a touchdown. While it might seem hopeless, he has one advantage on his side: while extremely powerful and heavy, Gorrillazz are usually very slow, so in many cases he can escape them before they can reach him.

A few minor notes before the game starts: first of all, in this article we will address only programs which concentrate on interacting with the user one way or another (it can be a web, desktop, or mobile phone program, but interaction should be a substantial part of it). Scientific calculations/HPC/video rendering farms/etc. is a completely different game which is played on a very different field, so we will not discuss it here. Second important note is that we're making a distinction between 'a bit of multi-threaded code in a very limited number of isolated places' and 'massive multi-threading all over the code'. While the former can usually be managed with a limited effort and (given that there are no better options) we'll consider it as acceptable, the latter is exactly the thing we're aiming to avoid.

Houston, have we got a problem?

So, our 'No MT Bugs' Bunny is standing all alone against the whole field of fierce Gorrillazz. What is the first move he shall make? First of all, we need to see if he's writing client-side code or server-side code. In the first two quarters of the game, we will concentrate on client-side code, with server-side considered in the quarters 3&4 (coming in the next issue). And if the application is client-side, then the very first question for our hero is the following: does his application really experience any performance problems (in other words, would users actually care if the application runs faster)? If not, he can simply ignore all the Gorrillazz at least at the moment and stay single-threaded. And while Moore's law doesn't work any more for frequencies, and modern CPUs got stuck at measly 3-4GHz, it is still about 1000 times more than the frequency of the first PCs, which (despite a 1000-times-less-than-measly-3-4-GHz-modern-CPUs speed) were indeed able to do a thing or three. It is especially true if Bunny's application is a business-like one, with logic like 'if user clicks button 'ok', close this dialog and open dialog 'ZZZ' with field 'XXX' set to the appropriate value from previous dialog'; in cases like the last one, it is virtually impossible to imagine how such logic (taken alone, without something such as associated MP4 playing in one of the dialogs - we'll deal with such a scenario a bit later), can possibly require more than one modern core regardless of code size of this logic.

To block or not to block - there is no question

If our 'No MT Bugs' Bunny does indeed have performance problems with his client application, then there is a big chance that it is related to I/O(if there are doubts, a profiler should be able to help clarify, although it's not always 100% obvious from the results of the profiling). If a user experiences delays while the CPU isn't loaded, the cause often has nothing to do with using more cores, or with CPU in general, but is all about I/O. While hard disk capacities went up tremendously in recent years, typical access time experienced much more modest improvements, and is still in the order of 10ms, or more than ten to the seventh power CPU clock cycles. No wonder, that if Bunny's program accesses the disk heavily the user can experience delays. Network access can incur even larger delays: while in a LAN typical round-trip time is normally within 1ms, typical transatlantic round-trip time is around 100-200ms, and that is only if everything on the way works perfectly; if something goes wrong, one can easily run into delays on the order of seconds (such as DNS timeout or a TCP retransmit), or even minutes (for example, the typical BGP convergence time [Maennel02]); the last number is about eleven orders of magnitude larger than the CPU clock time. As Garfield the Cat would have put it: 'Programmers who're blocking UI while waiting for network I/O, should be dragged out into the street and shot'.

The way to avoid I/O delays for the user without going into multi-threading has been well-known since at least 1970s, but unfortunately is rarely used in practice; it is non-blocking I/O. The concept itself is very simple: instead of telling the system 'get me this piece of information and wait until it's here', say 'start getting this piece of information, return control to me right away and let me know when you're done'. These days non-blocking I/O is almost universally supported (even the originally 100%-thread-oriented Java eventually gave up and started to support it), and even if it is not supported for some specific and usually rather exotic API function (such as FlushFileBuffers() on Windows), it is usually not that big problem for our Bunny to implement it himself via threads created specially for this purpose. While implementing non-blocking I/O himself via threads will involve some multithreaded coding, it is normally not too complicated, and most importantly it is still confined to one single place, without the need to spread it over all the rest of the code.

Non-blocking I/O vs heavy multi-threading

Unfortunately, doing things in parallel (using non-blocking I/O or via any other means) inherently has a few not-so-pleasant implications. In particular, it means that after the main thread has started I/O, an essentially new program state is created. It also means that program runs are not 100% deterministic anymore, and opens the potential for races (there can be differences in program behavior depending on at which point I/O has ended). Despite all of this, doing things in parallel within non-blocking I/O is still much more manageable than a generic heavily multi-threaded program with shared variables, mutexes etc. Compared to heavily multi-threaded approach, a program which is based on non-blocking I/O usually has fewer chances for races to occur, and step-by-step debugging has more chances to work. This happens because while there is some non-determinism in non-blocking I/O program, in this case a number of significantly different scenarios ('I/O has been completed earlier or later than certain other event') is usually orders of magnitude smaller than the potential number of different scenarios in a heavily multi-threaded program (where a context switch after every instruction can potentially cause substantially different scenarios and lead to a race). It can even be possible to perform a formal analysis of all substantially different scenarios due to different I/O timing, providing a theoretical proof of correctness (a similar proof is definitely not feasible for any heavily multi-threaded program which is more complicated than 'Hello, World!'). But the most important advantage of non-blocking I/O approach is that, with proper level of logging, it is possible to reconstruct the exact sequence of events which has led to the problem, and to identify the bug based on this information. This means we still have some regular way to identify bugs, not relying on trial-and-error (which can easily take years if we're trying to identify a problem which manifests itself only in production, and only once in a while); in addition, it also means that we can also perform post-mortem analysis in production environments.

These improvements in code quality and debugging/testing efficiency don't come for free. While adding a single non-blocking I/O is usually simple, handling lots of them can require quite a substantial effort. There are two common ways of handling this complexity. One approach is event-driven programming, ubiquitous in the GUI programming world for user events; for non-blocking I/O it needs to be extended to include 'I/O has been completed' events. Another approach is to use finite state machines (which can vary in many aspects, including for example hierarchical state machines). We will not address the differences of these approaches here, instead mentioning that any such implementations will have all the debugging benefits described above.

One common problem for both approaches above is that if our Bunny has lots of small pieces of I/O, making all of them non-blocking can be quite tedious. For example, if his program makes a long search in a file then, while the whole operation can be very long, it will consist of many smaller pieces of I/O and handling all associated states will be quite a piece of work. It is often very tempting to separate a whole bunch of micro-I/Os into a single macro-operation to simplify coding. This approach often works pretty well, but only as long as two conditions are met: (a) the whole such operation is treated as similar to a kind of large custom non-blocking I/O; (b) until the macro-operation is completed, there is absolutely no interaction between this macro-operation and the main thread, except for the ability to cancel this macro-operation from the main thread. Fortunately, usually these two conditions can be met, but as soon as there is at least some interaction added, this approach won't work anymore and will need to be reconsidered (for example, by splitting this big macro-operation into two non-blocking I/O operations at the place of interaction, or by introducing some kind of message-based interaction between I/O operation and main thread; normally it is not too difficult, though if the interaction is extensive it can become rather tedious).

Still, despite all the associated complexities, one of those approaches, namely event-driven approach, has an excellent record of success, at least in GUI programming (it will be quite difficult to find a GUI framework which isn't event-driven at least to a certain extent).

If it walks like a duck, swims like a duck, and quacks like a duck...

If after escaping 'Long I/O' Gorrilla, Bunny's client-side program is still not working as fast as the user would like, then there are chances that it is indeed related to the lack of CPU power of a single core. Let's come back to our earlier example with business-like dialogs, but now let's assume that somewhere in one of the dialogs there is an MP4 playing (we won't ask why it's necessary, maybe because a manager has said it's cute, or maybe marketing has found it increases sales; our Bunny just needs to implement it). If Bunny would call a synchronous function play_mp4() at the point of creating the dialog, it would stop the program from going any further before the MP4 ends. To deal with the problem, he clearly needs some kind of asynchronous solution.

Let's think about it a bit. What we need is a way to start rendering, wait for it to end, and to be able to cancel it when necessary... Wait, but this is exactly what non-blocking I/O is all about! If so, what prevents our Bunny from representing this MP4 playback as a yet another kind of non-blocking I/O (and in fact, it is a non-blocking output, just using the screen instead of a file as an output device)? As soon as we can call start_playing_mp4_and_notify_us_when_you_re_done(), we can safely consider MP4 playback as a custom non-blocking I/O operation, just as our custom file-search operation we've discussed above. There might be a multi-threaded wrapper needed to wrap play_mp4() into a non-blocking API, but as it needs to be done only once: multi-threading still stays restricted to a very limited number of places. The very same approach will also cover lots of situations where heavy calculations are necessary within the client. How to optimize calculations (or MP4 playback) to split themselves over multiple cores is another story, and if our Bunny is writing yet another video codec, he still has more Gorrillazz to deal with (with chances remaining that one of them will get him).

Single-thread: back to the future

If our 'No MT Bugs' Bunny has managed to write a client-side program which relies on its main GUI thread as a main loop, and treats everything else as non-blocking I/O, he can afford to know absolutely nothing about threads, mutexes and other scary stuff, making the program from his point of view essentially a single-threaded program (while there might be threads in the background, our Bunny doesn't actually need to know about them to do his job). Some may argue that in 2010 going single-threaded might sound 'way too 1990-ish' (or even 'way too 1970-ish'). On the other hand, our single-thread with non-blocking I/O is not exactly the single-thread of linear programs of K&R times. Instead, we can consider it a result of taking into account the strengths and weaknesses of both previous approaches (classical single-threaded and classical heavily multi-threaded) and taking a small step further, trying to address the issues specific to both of them. In some very wide sense, we can even consider single-thread → multi-thread → single-thread-with-nonblocking-I/O transition, similar to Hegelian bud → blossom → fruit [Hegel1807]. In practice, architectures based on non-blocking I/O are usually more straightforward, can be understood more easily, and most importantly, are by orders of magnitude easier to test and debug than their heavily multi-threaded counterparts.

Last-second attempt

Our 'No MT Bugs' Bunny has already got far across the client side of the field, but if he hasn't scored his touchdown yet, he now faces the mightiest of remaining Gorrillazz, and unfortunately he has almost no room to maneuver. Still, there is one more chance for him to escape the horrible fate of heavily multi-threaded programming. It is good old algorithm optimization. While a speed up of a few percent might not be enough to keep you single threaded, certain major kinds of optimizations might make all the multi-threading (and multi-coring) unnecessary (unless, of course, you're afraid that a program without multi-core support won't look 'cool' anymore, regardless of its speed). If our Bunny's bottleneck is a bubble sort on a 10M element array, or if he's looking for primes by checking every number N by dividing it by every number in 3..sqrt(N) range [Blair-Chappell10], there are significant chances that he doesn't really need any multi-coring, but just needs a better algorithm. Of course, your code obviously doesn't have any dreadfully inefficient stuff, but maybe it's still worth another look just to be 100% sure? What about that repeated linear scan of a million-element list? And when was the last time when you ran a profiler over your program?

Being gotcha-ed

Unfortunately, if our Bunny hasn't scored his touchdown yet, he's not too likely to score it anymore. He's been gotcha-ed by one of the Multithreaded Gorrillazz, and multi-threading seems inevitable for him. If such a situation arises, some developers may consider themselves lucky that they will need to write multi-threaded programs, some will hate the very thought of it; it is just a matter of personal preference. What is clear though is that (even if everything is done properly) it will be quite a piece of work, and more than a fair share of bugs to deal with.

Tools like OpenMP or TBB won't provide too much help in this regard: while they indeed make thread and mutex creation much easier and hide the details of inter-thread communication, it is not thread creation but thread synchronization which causes most of the problems with multi-threaded code; while OpenMP provides certain tools to help detecting race conditions a bit earlier, the resulting code will still remain very rigid and fragile, and will still be extremely difficult to test and debug, especially in production environments

Quarter 1&2 summary

While we have seen that our Bunny didn't score a touchdown every time, he still did pretty well. As we can see, he has scored 4 times, and has been gotcha-ed only once. The first half of the game has ended with a score of 'No Multi-Threaded Bugs' Bunny: 4 to 'Multithreaded Gorrillazz': 1. Stay tuned for remaining two quarters of this exciting match.

References

[Blair-Chappell10] How to become a parallel programming expert in 9 minutes, Stephen Blair-Chappell, ACCU conference, 2010

[Hegel1807] Phenomenology of Spirit, Hegel, 1807, translation by Terry Pinkard, 2008

[Maennel02] Realistic BGP traffic for test labs, Olaf Maennel, Anja Feldmann, ACM SIGCOMM Computer Communication Review, 2002

[Martin10] The Language Stew, Robert Martin, ACCU Conference, 2010

[Merritt07] M'soft: Parallel programming model 10 years off, Rick Merritt, 2007, http://www.eetimes.com/showArticle.jhtml?articleID=201200019

[Suess07] Is the Multi-Core Revolution a Hype?, Michael Suess, 2007, http://www.thinkingparallel.com/2007/08/21/is-the-multi-core-revolution-a-hype/

Beware the Rabbit 2

Wednesday, 5 November 2014

Single-Threading - Back to the Future?