Skip to content

Latest commit

 

History

History
329 lines (270 loc) · 9.44 KB

File metadata and controls

329 lines (270 loc) · 9.44 KB

3. Parallelism and Granularity Control

(← Hello World) (Trees →)

Preliminaries

Make sure that you've already done the setup. If you're using Docker to run the tutorial, all commands below should be run within the container in directory ~/mpl-tutorial/how-to-par/:

$ cd path/to/mpl-tutorial
$ ./start-container.sh
<container># cd how-to-par
<container># <enter commands here>

Running Example: Parallel Fibonacci

For a running example, we'll use the "hello world" of parallelism: Fibonacci numbers, calculated using the naive recursive definition fib(n) = fib(n-1) + fib(n-2). The following code defines this function, taking a number n as input and returning the nth Fibonacci number. The base cases are n = 0 and n = 1.

mpl-tutorial/how-to-par/sequential/fib.sml:

fun fib n =
  if n = 0 then
    0
  else if n = 1 then
    1
  else
    fib (n-1) + fib (n-2)
Question: I'm new to SML. How do I read this code?
In the code above, the first line begins defining a function named fib that takes an argument n. We then write the body of the function, which in this case is a conditional expression.

Conditional expressions are written if B then X else Y, where B is a boolean expression and X and Y are expressions of the same type. Note that we compare equality with a single "=", i.e. n = 0 is a boolean expression.

If you are coming from a language such as C, Java, Python, JavaScript, etc., then SML is going to feel a bit different. It's a functional language, so functions are defined by expressions instead of sequences of statements.

Parallelizing it

We can parallelize this code by doing the two recursive calls in parallel. MPL provides a function for this: ForkJoin.par, which takes two functions as argument and executes them in parallel.

Below is a first attempt, which is "correct" but has a performance issue that we will discuss below. Notice that we make two recursive calls, just like before, but now these are packaged up as anonymous functions and passed as argument to par.

mpl-tutorial/how-to-par/bad-par/bad-par-fib.sml:

fun badParFib n =
  if n = 0 then
    0
  else if n = 1 then
    1
  else
    let
      val (a, b) =
        ForkJoin.par (fn () => badParFib (n-1),
                      fn () => badParFib (n-2))
    in
      a + b
    end
Question: I'm new to SML. How do I read this code?
There are three things in this code we haven't seen before:
  1. val (a, b) = ... introduces two variables by unpacking a tuple. The right hand side needs to be an expression that returns a tuple of two things.
  2. let ... in ... end lets us introduce new variables locally. In the above code, the variables a and b can be used only until we get to the corresponding end.
  3. fn () => ... is an anonymous (a.k.a. "lambda") function that takes no interesting arguments. A more general form is fn x => A where A is an expression that uses variable x.

Code to run badParFib. Let's run badParFib on input 35 and then prints out the result. In the code below, the function Int.toString converts the resulting number into a string, and the operator ^ concatenates strings.

mpl-tutorial/how-to-par/bad-par/main.sml:

val n = 35
val _ = print ("Computing fib(" ^ Int.toString n ^ ")\n")
val result = badParFib n
val _ = print ("fib(" ^ Int.toString n ^ ") = " ^ Int.toString result ^ "\n")

Compile and run it. Here is an appropriate .mlb file for compilation. The line $(SML_LIB)/basis/fork-join.mlb makes it possible to use ForkJoin.par.

mpl-tutorial/how-to-par/bad-par/main.mlb:

$(SML_LIB)/basis/basis.mlb
$(SML_LIB)/basis/fork-join.mlb
bad-par-fib.sml
main.sml

We can now compile and run the code. To use more than one processor, the syntax is ./program @mpl procs N --.

<container># mpl bad-par/main.mlb

<container># time bad-par/main
Computing fib(35)
fib(35) = 9227465

real	0m2.432s
user	0m1.843s
sys	0m0.586s

<container># time bad-par/main @mpl procs 2 --
Computing fib(35)
fib(35) = 9227465

real	0m1.337s     # about 2x faster on 2 processors!
user	0m1.902s
sys	0m0.579s

And check it out: above we can see that this code gets about 2x faster when we use 2 processors instead of 1.

It's parallel! But is it fast?

Observed Work-Efficiency and Granularity Control

Ideally, a parallel program should be work-efficient: the total amount of work it performs should be approximately the same as the fastest known sequential alternative.

We've named badParFib suggestively because it has a performance problem. On one processor, badParFib is approximately 10x slower than the simple sequential fib program.

<container># mpl sequential/main.mlb
<container># time sequential/main
Computing fib(35)
fib(35) = 9227465

real	0m0.216s
user	0m0.213s
sys	0m0.001s

<container># mpl bad-par/main.mlb
<container># time bad-par/main
Computing fib(35)
fib(35) = 9227465

real	0m2.432s     # 10x slower than the sequential code!
user	0m1.843s
sys	0m0.586s

The only difference between the two programs is ForkJoin.par. This function call isn't free! The cost of ForkJoin.par can be fairly significant, and we need to amortize this overhead.

The simplest way to amortize the cost of ForkJoin.par is to ensure that the parallel tasks we create are not too small. This approach is called granularity control, because we are controlling the so-called granularity of tasks (where the "granularity" of a task is just the amount of work the task performs).

Making It Fast with Granularity Control

A simple way of performing granularity control for the parallel Fibonacci function is to switch to a fast sequential algorithm below some threshold. Here, we hardcode the threshold at n = 20: for any n < 20, we'll use the fast sequential fib(n) instead of the parallel version.

mpl-tutorial/how-to-par/fast-par/fast-par-fib.sml:

fun fastParFib n =
  if n < 20 then
    fib n    (* do the sequential code instead *)
  else
    let
      val (a, b) =
        ForkJoin.par (fn () => fastParFib (n-1),
                      fn () => fastParFib (n-2))
    in
      a + b
    end

This is now just as fast as the sequential code on one processor, but is still parallel. We get the best of both worlds.

<container># mpl fast-par/main.mlb
<container># time fast-par/main
Computing fib(35)
fib(35) = 9227465

real	0m0.211s      # almost exactly the same as sequential fib!
user	0m0.209s
sys	0m0.001s

<container># time fast-par/main @mpl procs 2 --
Computing fib(35)
fib(35) = 9227465

real	0m0.110s      # still gets 2x faster on 2 processors!
user	0m0.215s
sys	0m0.001s

Tuning Granularity

Above, we chose a constant threshold n = 20. How did we arrive at this number? What if we used n = 21 instead? Does that make a difference?

Well, there's no magic here. We just have to try it and measure it. Time for an experiment!

Below, we generalize our parallel Fibonacci function to take an additional argument, g, which is the grain size, and switch to the sequential algorithm when n < g. We then run this code on a variety of grain sizes, and report their times. To loop through multiple grain sizes, we define a useful helper function, forloop, which takes a function as argument and runs it on a range of indices.

mpl-tutorial/how-to-par/tuning/main.sml:

fun parFibWithGrain (g, n) =
  if n < g then
    fib n
  else
    let
      val (a, b) =
        ForkJoin.par (fn () => parFibWithGrain (g, n-1),
                      fn () => parFibWithGrain (g, n-2))
    in
      a + b
    end

fun timeFibWithGrain g =
  let
    val n = 35

    val t0 = Time.now ()
    val result = parFibWithGrain (g, n)
    val t1 = Time.now ()

    val elapsed = Time.- (t1, t0)
  in
    print ("grain " ^ Int.toString g ^ ": " ^ Time.toString elapsed ^ "\n")
  end

(* run f(i), f(i+1), ..., f(j-1) *)
fun forloop (i, j, f) =
  if i >= j then () else (f i; forloop (i+1, j, f))

(** this is the same as
  *   (timeFibWithGrain 5;
  *    timeFibWithGrain 10;
  *    timeFibWithGrain 15;
  *    timeFibWithGrain 20;
  *    timeFibWithGrain 25;
  *    timeFibWithGrain 30)
  *)
val _ = forloop (1, 7, fn i => timeFibWithGrain (5*i))

When we run it, we see that the running time improves significantly as the grain size increases, up to around a grain size of 15-20 ish. The difference from 20 to 30 is small. Choosing n = 20 as the threshold seems good enough.

<container># mpl tuning/main.mlb
<container># tuning/main
grain 5: 0.868
grain 10: 0.271
grain 15: 0.219
grain 20: 0.218
grain 25: 0.213
grain 30: 0.214

Keep in mind there is statistical noise to take into account here. A proper experiment would perform many trials and compare averages. We're being a bit sloppy, just for the sake of keeping things simple.