Make sure that you've already done the setup. If
you're using Docker to run the tutorial, all commands below should be
run within the container in directory ~/mpl-tutorial/how-to-par/:
$ cd path/to/mpl-tutorial
$ ./start-container.sh
<container># cd how-to-par
<container># <enter commands here>
For a running example, we'll use the "hello world" of parallelism:
Fibonacci numbers,
calculated using the naive recursive definition
fib(n) = fib(n-1) + fib(n-2). The following code defines this function,
taking a number n as input and returning the nth Fibonacci number.
The base cases are n = 0 and n = 1.
mpl-tutorial/how-to-par/sequential/fib.sml:
fun fib n =
if n = 0 then
0
else if n = 1 then
1
else
fib (n-1) + fib (n-2)Question: I'm new to SML. How do I read this code?
In the code above, the first line begins defining a function namedfibthat takes an argumentn. We then write the body of the function, which in this case is a conditional expression.
Conditional expressions are writtenif B then X else Y, whereBis a boolean expression andXandYare expressions of the same type. Note that we compare equality with a single "=", i.e.n = 0is a boolean expression.
If you are coming from a language such as C, Java, Python, JavaScript, etc., then SML is going to feel a bit different. It's a functional language, so functions are defined by expressions instead of sequences of statements.
We can parallelize this code by doing the two recursive calls
in parallel. MPL provides a function for this: ForkJoin.par, which takes two
functions as argument and executes them in parallel.
Below is a first attempt, which is "correct" but has a performance issue that
we will discuss below. Notice that we make two recursive
calls, just like before, but now these are packaged up as anonymous functions
and passed as argument to par.
mpl-tutorial/how-to-par/bad-par/bad-par-fib.sml:
fun badParFib n =
if n = 0 then
0
else if n = 1 then
1
else
let
val (a, b) =
ForkJoin.par (fn () => badParFib (n-1),
fn () => badParFib (n-2))
in
a + b
endQuestion: I'm new to SML. How do I read this code?
There are three things in this code we haven't seen before:
val (a, b) = ...introduces two variables by unpacking a tuple. The right hand side needs to be an expression that returns a tuple of two things.let ... in ... endlets us introduce new variables locally. In the above code, the variablesaandbcan be used only until we get to the correspondingend.fn () => ...is an anonymous (a.k.a. "lambda") function that takes no interesting arguments. A more general form isfn x => AwhereAis an expression that uses variablex.
Code to run badParFib. Let's run badParFib on input
35 and then prints out the result. In the code below, the function
Int.toString converts the resulting number into a string, and the operator
^ concatenates strings.
mpl-tutorial/how-to-par/bad-par/main.sml:
val n = 35
val _ = print ("Computing fib(" ^ Int.toString n ^ ")\n")
val result = badParFib n
val _ = print ("fib(" ^ Int.toString n ^ ") = " ^ Int.toString result ^ "\n")Compile and run it. Here is an appropriate .mlb file for compilation.
The line $(SML_LIB)/basis/fork-join.mlb makes it possible to use
ForkJoin.par.
mpl-tutorial/how-to-par/bad-par/main.mlb:
$(SML_LIB)/basis/basis.mlb
$(SML_LIB)/basis/fork-join.mlb
bad-par-fib.sml
main.smlWe can now compile and run the code. To use more than one processor,
the syntax is ./program @mpl procs N --.
<container># mpl bad-par/main.mlb
<container># time bad-par/main
Computing fib(35)
fib(35) = 9227465
real 0m2.432s
user 0m1.843s
sys 0m0.586s
<container># time bad-par/main @mpl procs 2 --
Computing fib(35)
fib(35) = 9227465
real 0m1.337s # about 2x faster on 2 processors!
user 0m1.902s
sys 0m0.579s
And check it out: above we can see that this code gets about 2x faster when we use 2 processors instead of 1.
It's parallel! But is it fast?
Ideally, a parallel program should be work-efficient: the total amount of work it performs should be approximately the same as the fastest known sequential alternative.
We've named badParFib suggestively because it has a performance problem.
On one processor, badParFib is approximately 10x slower than the simple
sequential fib program.
<container># mpl sequential/main.mlb
<container># time sequential/main
Computing fib(35)
fib(35) = 9227465
real 0m0.216s
user 0m0.213s
sys 0m0.001s
<container># mpl bad-par/main.mlb
<container># time bad-par/main
Computing fib(35)
fib(35) = 9227465
real 0m2.432s # 10x slower than the sequential code!
user 0m1.843s
sys 0m0.586s
The only difference between the two programs is ForkJoin.par. This function
call isn't free! The cost of ForkJoin.par can be fairly significant, and
we need to amortize this overhead.
The simplest way to amortize the cost of ForkJoin.par is to ensure that the
parallel tasks we create are not too small. This approach is called
granularity control, because we are controlling the so-called
granularity of tasks (where the "granularity" of a task is just the
amount of work the task performs).
A simple way of performing granularity control for the parallel Fibonacci
function is to switch to a fast sequential algorithm below some
threshold. Here, we hardcode the threshold at n = 20: for any n < 20, we'll
use the fast sequential fib(n) instead of the parallel version.
mpl-tutorial/how-to-par/fast-par/fast-par-fib.sml:
fun fastParFib n =
if n < 20 then
fib n (* do the sequential code instead *)
else
let
val (a, b) =
ForkJoin.par (fn () => fastParFib (n-1),
fn () => fastParFib (n-2))
in
a + b
endThis is now just as fast as the sequential code on one processor, but is still parallel. We get the best of both worlds.
<container># mpl fast-par/main.mlb
<container># time fast-par/main
Computing fib(35)
fib(35) = 9227465
real 0m0.211s # almost exactly the same as sequential fib!
user 0m0.209s
sys 0m0.001s
<container># time fast-par/main @mpl procs 2 --
Computing fib(35)
fib(35) = 9227465
real 0m0.110s # still gets 2x faster on 2 processors!
user 0m0.215s
sys 0m0.001s
Above, we chose a constant threshold n = 20. How did we arrive at this
number? What if we used n = 21 instead? Does that make a difference?
Well, there's no magic here. We just have to try it and measure it. Time for an experiment!
Below, we generalize our parallel Fibonacci function to take an additional
argument, g, which is the grain size, and switch to the sequential
algorithm when n < g. We then run this code on a variety of grain sizes,
and report their times. To loop through multiple grain sizes, we define
a useful helper function, forloop, which takes a function as argument
and runs it on a range of indices.
mpl-tutorial/how-to-par/tuning/main.sml:
fun parFibWithGrain (g, n) =
if n < g then
fib n
else
let
val (a, b) =
ForkJoin.par (fn () => parFibWithGrain (g, n-1),
fn () => parFibWithGrain (g, n-2))
in
a + b
end
fun timeFibWithGrain g =
let
val n = 35
val t0 = Time.now ()
val result = parFibWithGrain (g, n)
val t1 = Time.now ()
val elapsed = Time.- (t1, t0)
in
print ("grain " ^ Int.toString g ^ ": " ^ Time.toString elapsed ^ "\n")
end
(* run f(i), f(i+1), ..., f(j-1) *)
fun forloop (i, j, f) =
if i >= j then () else (f i; forloop (i+1, j, f))
(** this is the same as
* (timeFibWithGrain 5;
* timeFibWithGrain 10;
* timeFibWithGrain 15;
* timeFibWithGrain 20;
* timeFibWithGrain 25;
* timeFibWithGrain 30)
*)
val _ = forloop (1, 7, fn i => timeFibWithGrain (5*i))When we run it, we see that the running time improves significantly as the
grain size increases, up to around a grain size of 15-20 ish. The difference
from 20 to 30 is small. Choosing n = 20 as the threshold
seems good enough.
<container># mpl tuning/main.mlb
<container># tuning/main
grain 5: 0.868
grain 10: 0.271
grain 15: 0.219
grain 20: 0.218
grain 25: 0.213
grain 30: 0.214
Keep in mind there is statistical noise to take into account here. A proper experiment would perform many trials and compare averages. We're being a bit sloppy, just for the sake of keeping things simple.