Wednesday, September 22, 2010

The join calculus

How do you solve the dining philosopher's problem in your favorite concurrency paradigm? Can you do it in a decentralized way, so that philosophers desiring unrelated chopsticks never get in each other's way?

The join calculus, originally developed for distributed programming, offers a truly elegant solution to this classic concurrency problem. In fact, the solution requires little more than stating the problem:

channel hungry[1..n]
channel chops[1..n]

when hungry[1], chops[1], chops[2] do
  eat(1); chops[1]; chops[2]; think(1); hungry[1]
...
when hungry[n], chops[n], chops[1] do
  eat(n); chops[n]; chops[1]; think(n); hungry[1]

initially
  hungry[1] ; ... ; hungry[n]
  chops[1] ; ... ; chops[n]


This bit of pseudocode treats each chopstick as a separate channel, and also uses channels to represent requests for chopsticks (hungry). There are a collection of joins, which provide an event handler that should fire when messages are available on a given set of channels. All the relevant messages are removed from their queues in a single atomic step.

Aside from a few important details (e.g. are channels synchronous? it depends), that example provides a pretty complete picture of what the join calculus offers. It's possible to code up solutions to a number of classic problems in a similarly succinct way: reader-writer locks, barriers, condition variables, semaphores, Santa Claus...

Of course, many synchronization constructs (even the lowly lock) are capable of encoding a large variety of other constructs, though perhaps not as elegantly. The real question is whether such encodings provide the same parallel scalability that a direct solution does.

To answer this question for the join-based encodings, we need to know how joins are implemented. An implementation must provide the atomicity guarantee described above, and every implementation I'm aware of does this by using a single lock per group of joined channels. This centralized lock destroys scalability because it sequentializes use of the channels and firing of events. In the case of the dining philosophers, it means that two philosophers sitting far from one another must still contend for the same lock before picking up their independent chopsticks.

I've been working on an algorithm and implementation that allows for a much finer grain of concurrency. Stay tuned...

Tuesday, September 21, 2010

ICFP pregame: Reduceron

Some PRL members have been meeting to discuss ICFP papers prior to the conference. Since we worked hard to grok the essence of each, it seems worth posting that here.

The Reduceron Reconfigured
Matthew Naylor and Colin Runciman

Long ago (the '80s) there was a great deal of interest in building architecture geared toward functional languages (taking Backus's Turing Lecture literally). These efforts burned out, though, because industry heavyweights were able to make unsurpassable progress with traditional von Neumann machines.

This paper revisits functional architecture in the world of FPGAs. I can't say how much their architecture differs from those proposed in the '80s, but the big emphasis in the paper is on a certain kind of parallelism. This parallelism stems from the purity of the language, but not in the way you might think: it is not at the level of program expressions but at the level of instructions.

How do you execute a beta-reduction with several arguments? On a von Neumann architecture, you do it in many sequential steps. The proposal here is to provide instructions that can, in a single cycle, execute such a beta-reduction. That single instruction interacts with memory in several ways: it loads the definition of the function, it pops the arguments off the stack, and it does appropriate pointer surgery to replace formals by actuals. In parallel, and in one step.

The paper proposes a compilation strategy and instruction set, and discusses an FPGA implementation. On the benchmarks they give, the FPGA comes within an order of magnitude of GHC running on a modern Intel processor -- and the FPGA clocks 30x slower than the Intel. Given how much engineering has gone into GHC, that's pretty impressive.

A complaint: the takeaway wasn't clear. Even with their technique, the FPGA is a lot slower (in absolute terms) than what we have with GHC. So having an FPGA coprocessor, as they suggest, seems fruitless. Should we design ASICs along the lines of this paper?

Thursday, September 16, 2010

Shared-state versus message-passing

One of the most basic distinctions made in concurrency theory is between shared-state concurrency and message-passing concurrency. I'm finally getting around to reading John Reppy's book on Concurrent ML, and he has a succinct characterization of the two:
Shared-memory languages rely on imperative features ... for interprocess communication, and provide separate synchronization primitives. Message-passing languages provide a single unified mechanism for both synchronization and communication.

Tuesday, September 14, 2010

Hacking on, and with, Racket

Things have been a bit quiet on the research front: I've been doing a a bit of hacking and a lot of yak shaving.

Today's accomplishment: I've added a new primitive to Racket for compare-and-set. In hindsight, this is a pretty simple patch, maybe a couple-dozen lines. But getting there required some quality time with the 14Kloc jit.c file.

The upshot is that, with this patch, Racket can be used for writing non-blocking algorithms, modulo ongoing work to increase concurrency in its runtime system. I'm planning to design and build a comprehensive library for parallelism, inspired by java.util.concurrent but with somewhat different goals. I hope to have more to say about this soon.

Tuesday, September 7, 2010

Code → paper

I'm working on a paper where the main contribution is an algorithm, and the interesting aspect of the algorithm is not a complexity bound, or even a liveness property. Its appeal is harder to formalize: it enables fine-grained parallelism.

The evidence here is empirical, not theoretical. Consequently, when I started simplifying from real- to pseudo-code, my coauthor balked. There are at least two reasons pseudocode is a bad idea. First, it's easy to introduce mistakes. Second, it's a bit dishonest: your numbers relate to the real code, not a platonic version of it.

But there are good arguments in favor of pseudocode, too. In our case, it's relatively painless to treat the simplification I wanted to do as a refactoring of the real code -- so our library improves along the way. That's partly due to writing the code in a relatively high-level language. But in more complex, low-level cases, making the code presentable while keeping it executable could be much trickier, or even impossible.

What's a reasonable guideline for using pseudocode?

Wednesday, September 1, 2010

Specification safari: it's not fair

Lock-freedom is great: it guarantees that a system will make progress regardless of the vagaries of scheduling. The delay of one thread -- due to a cache miss, or I/O, or preemption -- doesn't force a delay on other threads. In practice, lock-free algorithms permit a greater degree of concurrency, and by Amdahl's law that translates to increased parallel speedup.

But what if we do know something about the scheduler? Most commonly, we know that the scheduler is fair. Fairness and lock-freedom seem orthogonal, since lock-freedom assumes nothing about the scheduler. So I was startled to realize that, in practice, the two properties do interact.

Most well-known lock-free algorithms do not preserve fairness! Concretely, if a thread using a lock-free queue calls enqueue, that call may never return, despite enqueue being a "nonblocking" operation. What gives?

The problem is that other threads in the system may be repeatedly competing to enqueue, a race-condition made safe by the fact that they're all using CAS to do this. But CAS does not guarantee fairness, so a thread can repeatedly lose the race. The scheduler has fulfilled its obligations, since the thread is continually given chances to try CAS. At one level of abstraction (the queue implementation), the thread is not being starved. But at a higher level of abstraction (the queue client), it is.