fredag den 25. november 2016

Parallelizing data-processing with the TPL DataFlow library

I highly recommend the TPL - Tasks Parallel Library - 'DataFlow' library. It's a very good abstraction of the TPL itself, easy to use. I was in a situation where I had to parallelize the execution of a file-converter, which in a single instance-run used only 15% CPU. By parallelzing it I was able to utilize 100% CPU and finish the conversion-job much, much quicker.

IT works with .NET 4.5 and onwards, and I believe I saw a .NET Core version, too. But here's the .NET 4.5 version:

Install with NuGet and look to the web for examples of use. Note that many of the examples deal with async-awaitable methods, but the library works quite well with synchronous tasks as well. I had no need for async use, so my inspiration-example below is synchronous tasks only:

public void ConvertFilesInFolder(string sourceFilesFolderPath)

string[] filePathsAndNames = getFilePathsAndNames(sourceFilesFolderPath);

// define a new 'ActionBlock', that you can push Tasks to.

var block = new ActionBlock(foobar =>
}, new ExecutionDataflowBlockOptions
MaxDegreeOfParallelism = 6 // 6 simultanous conversions (limit of my 3rd-party conversion library-licence)

// Go ahead and add conversion-Tasks to the action-block:
foreach (string filePathAndName in filePathsAndNames)

block.Complete(); // that's enough jobs...
block.Completion.Wait(); // ... now go ahead and execute until they're done.

/* Note that as I set the max-degree-of-parallelism to 6, we're limited to this number of executed tasks at the same - parrallel - time. As soon as one task completes, another is retrieved from the action-block 'queque' */

public void ConvertAndMoveTheFile(string filePathAndName)
catch (Exception ex)
// log, but otherwise suppress and move to next.

I found this blog-post very helpful in getting introduced and started with the library.

Ingen kommentarer:

Tilføj en kommentar