Part IV: Parallelized Version (page 2)

Other approaches to data hazard handling

So our reads of the best utility are holding up our parallel version, creating lots of contention. As explained, we can’t just read/write the data from any parallel thread without some form of coordination or synchronization, so consider Overview of Synchronization Primitives in .NET which considers these in as “roughly divided into three categories: locking, signaling, and interlocked operations”:

The categories are not tidy nor clearly defined: Some synchronization mechanisms have characteristics of multiple categories; events that release a single thread at a time are functionally like locks; the release of any lock can be thought of as a signal; and interlocked operations can be used to construct locks. However, the categories are still useful.

If you are not familiar with this topic, its an interesting read. But I’ll focus us on these, and I quote from the link above:

ReaderWriterLock Class The ReaderWriterLockSlim class addresses the case where a thread that changes data, the writer, must have exclusive access to a resource. When the writer is not active, any number of readers can access the resource (for example, by calling the EnterReadLock method). When a thread requests exclusive access, (for example, by calling the EnterWriteLock method), subsequent reader requests block until all existing readers have exited the lock, and the writer has entered and exited the lock.

Interlocked Operations Interlocked operations are simple atomic operations performed on a memory location by static methods of the Interlocked class. Those atomic operations include addition, increment and decrement, exchange, conditional exchange depending on a comparison, and read operations for 64-bit values on 32-bit platforms.

The ReaderWriterLock class gives us just what we need for the best utility and node data, a lock to exclusively write the data and access for multiple readers. So this will work for that data. For the task count, which is just an integer, the Interlocked class fits perfectly. All we need to do with this data is increment, decrement and read it. Great. Note that these operations are atomic, meaning that you can consider them happening as an indivisible operation. You might think that something like an long int read would naturally be atomic, e.g. happening as a single machine level operation, but that is not necessarily the case (I think that its mostly true for modern 64-bit CPUs, but we can’t assume this). For example, a CPU may read the 64 bit int in two separate 32 bit read operations at the machine level — if the thread happened to be suspended in between these and another thread did an update then invalid data would be read when execution returns to the thread doing the read. Most likely this would very rarely happen, but would lead to a difficult to find bug.

Here is our new implementation of the class for managing the values and task count:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
    type MyStorage<'a, 'b when 'b:comparison> (compareFunc: 'b -> 'b -> bool, 
                                               initSearchNode: 'a,
                                               initUtility: 'b
                                               ) =

        let rwSlimData = new ReaderWriterLockSlim()
        let rwSlimTaskCount = new ReaderWriterLockSlim()
        
        let TaskCount = ref 0L  // Interlocked.Read is for 64 bit ints
        let mutable BestUtility = initUtility
        let mutable BestNode = initSearchNode

        // methods for use
        member this.Save(node,util) = 
            rwSlimData.EnterWriteLock()
            if compareFunc util BestUtility then 
                BestUtility <- util
                BestNode <- node   
            rwSlimData.ExitWriteLock()

        member this.Get() = 
            rwSlimData.EnterReadLock()  // multiple processes can be in read lock at same time!
            let v = (BestNode, BestUtility)   
            rwSlimData.ExitReadLock()
            v   // return current data

        member this.IncrTaskCount () = Interlocked.Increment(TaskCount) |> ignore
        member this.DecrTaskCount () = Interlocked.Decrement(TaskCount) |> ignore
        member this.TasksComplete () = 
            let activeTasks = Interlocked.Read(TaskCount)
            activeTasks = 0L  

I don’t show the revised ParBandB again here as the only thing changed is line let agent = new MyMailBoxAgent<_,_> (compareFunc, initSearchNode, initValue) to let agent = new MyStorage<_,_> (compareFunc, initSearchNode, initValue). The downloaded code has this change of course.

With these changes we get much better performance:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
Knapsack times:
     normal BandB time  =       0.294 ms
         parallel time  =       0.392 ms   MAX:      30.642 ms

MAXSAT times:
     normal BandB time  =     580.885 ms
         parallel time  =      13.999 ms   MAX:      28.461 ms

TSP times:
     normal BandB time  =      83.133 ms
         parallel time  =     494.117 ms   MAX:     736.295 ms 

We got a whopping 41× speedup for our best case on MAXSAT and even the worse case speed up was about 20×. We improved our times on the other problems as well, but we never beat the serial version for those. For the knapsack case, I think the problem instance is just too small to overcome the overheads of task initialization and coordination of data as required in a parallel case. For TSP, note that the f and branch functions, involving Array2D operations, take much longer than for MAXSAT.

If you run this code from Visual Studio you may see the message EventSourceException: No Free Buffers available from the operating system (e.g. event rate too fast). There isn’t much info to be found on this message, but I believe that this is because when launched as attached to VS the .NET library is trying to log events, here the task creations and runs, and we are simply creating them too quickly. Note that before we reach the ‘all variables set’ conditions that lots of tasks are being created fast. The appearance of this error message does not seem to alter the results but it does change the timing of the code. If you run the .exe outside of VS, as suggested before for timing runs, then there are no warnings or errors.

We could also do timing tests with hyperthreading disabled, but I am avoiding getting into too many system-dependent issues

The while loop that waiting for the tasks to end in the main loop becomes a CPU hog in the main thread, it would be nice to change this to a blocking operation, for example there is a waitAll function in the task library to wait for a list of tasks to complete. But we are creating tasks dynamically in our recursive routine and don’t have a simple list of tasks to wait for. Instead of using an active task counter (TaskCount in MyStorage) we could have a list of tasks and wait for them with a waitAll, but as tasks are being created dynamically this would require data synchronization of this list, rather than the simple integer value. I therefore think that the inter-locked int is a better approach.

Profiling does shows that we spend the most time on this line let activeTasks = Interlocked.Read(TaskCount) which the loop uses in the TasksComplete to detect when all tasks complete.

Our next idea, given that there doesn’t seem to be a clean way of gathering a fixed list of Tasks to await completion, is to alter the wait-for-completion loop as so:

1: 
2: 
        while not (agent.TasksComplete()) do
            Thread.Sleep(0) // release thread so others can do work! 

The Thread.Sleep() function will suspend this thread for a given number of milliseconds. With an argument of 0 then it just tells the task scheduler that we will relinguish the thread to allow other threads to continue; all the other threads are workers implementing the branch and bound search. Using a 1 ms argument for the sleep function we get the following timings:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
Knapsack times:
     normal BandB time  =       0.280 ms
         parallel time  =       0.406 ms   MAX:      31.158 ms

MAXSAT times:
     normal BandB time  =     584.140 ms
         parallel time  =      13.902 ms   MAX:      34.285 ms

TSP times:
     normal BandB time  =      82.181 ms
         parallel time  =     422.173 ms   MAX:     674.768 ms 

Not much different, and I also tested with 0 to 10 millisecond sleep arguments which also don’t seem to change things much. I suspect on a system with fewer cores that the sleep idea may help. A more substantial approach would be to not spawn so many threads — right now a thread is created for every search node and recall for search with n 2-state variables there are at max 2n+1 – 1 search nodes. You could look back at the actual number of search nodes generated which is far fewer than this maximum. What we are relying on now is good performance from the thread scheduler for our threads, instead we could take more control and limit the number of threads such each takes on more than one search node, that is a portion of the search space. Another idea would be to halt scheduling of new threads until the number of active threads is below a threshold — this might help the thread scheduler and lowering overhead. But these are subjects well beyond learning F#.

Unhandled Exceptions

There is one last technical issue that needs to be addressed. Suppose one of our threads crashes? It then never reaches the final agent.DecrTaskCount () statement and then the agent’s thread count never reaches zero. We would be in a deadlock situation. Fortunately, the situation is simple to correct, changing the bandbprocess:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
        let rec bandbProcess searchNode =
            try
                if unFinished searchNode then
                    let util = g searchNode
                    let bestUtil = snd (agent.Get())        // second tuple member is bestUtil
                    if compareFunc util bestUtil then do    // ok to branch
                        let branches:'a list = branch searchNode  
                        numberOfQueues := !numberOfQueues + branches.Length
                        for branch in branches do
                            agent.IncrTaskCount ()
                            Task.Run( fun () -> bandbProcess branch; () ) |> ignore
                else    // we can evaluate
                    let util = f searchNode         
                    agent.Save( searchNode,util )
                    incr numberOfEvals
            finally
                agent.DecrTaskCount () 

With the try / finally construct the agent.DecrTaskCount () statement is always executed. F#’s try / catch / finally system should be familiar to those programming in other languages. I don’t catch the exception – we could log it to an error log or arrange to have them thrown to the parent thread but I don’t show that here. What are the ramifications if an exception did occur? Well, if this task was an ‘unFinished’ one then we would not search part of the search tree; if not then a final evaluation with all variables set would be missed. In either of these cases, we would risk not finding the optimum, but the optimum otherwise found would be returned. Essentially at the end we would have a suboptimal solution.

Another related consideration, since the agent task count is incremented outside of and before the task launch, what if the failure or exception does not occur in the task process but happens before it even starts. That is the Task.Run never starts the process at all and fails. According to the Task.Run documentation this doesn’t seem possible as the only exception thrown is when the action (our bandbProcess) is null, which it certainly isn’t. So for now I am not addressing this possible issue, but I would think that there would be some limit on the total number of threads and that Task.Run would then throw an exception before starting the worker code.

The last test

I close the section with our final testing image, note the ParBandB tests that I didn’t show on the last page:

Final testing

Remarks

There are still items that we haven’t covered in F#; remember the list from part II page 6? In that list I indicated that some of them would be covered in these last two sections. We did of course include them in these last two parts, but these topics are large subjects in their own right and we’ve mostly just scratched the surface. Still, we have pretty thoroughly covered developing an algorithm into a functional form, testing it, parallelizing it and doing some performance optimization — quite a long trip.

Having gone through this, I do find F# very likeable. Once you get used to how generic typing and automatic generalization works and how to solve type problems it does seem to save time and errors. You often don’t need to create templated functions and classes, nor create sub-classes just to extend functionality. If you get in trouble with the types, my suggestion is to explicitly type whatever is causing trouble, be it a function argument or variable, and then see what else is breaking, this may lead you in the right direction. I included a few questions for the community along the way, but please feel free to contact me on those items or anything else at: