Parallel programming in .NET¶

Introduction¶

Many personal computers and workstations have multiple CPU cores that enable multiple threads to be executed simultaneously ¹. To take advantage of the hardware, you can parallelize your code to distribute work across multiple processors.

In the past, parallelization required low-level manipulation of threads and locks. .NET enhances support for parallel programming by providing a runtime, class library types, and diagnostic tools. You can write efficient, fine-grained, and scalable parallel code in a natural idiom without having to work directly with threads or the thread pool.

The following illustration provides a high-level overview of the parallel programming architecture in .NET.

Parallel programming provides mechanisms to work efficiently in a variety of scenarios, but there are situations where traditional methods may be a better fit ². Below are some of the strengths and weaknesses of parallel programming.

When to Use Parallelism¶

If you have several tasks that do not depend on each other and are time-consuming, parallelism can speed things up a lot. Example: calculating statistics for different large files.
In services that serve many users, parallelism allows you to handle multiple requests at the same time. Common examples: web servers, cloud applications, REST APIs.
If you need to run an algorithm that consumes a lot of CPU power, dividing it into smaller parts that run in parallel can be a great help. Examples: heavy analysis or calculations, scientific simulations, training AI models, Big Data (map-reduce).
If the system has multiple CPU cores, you can take advantage of these cores to use parallelism.

When to Avoid Parallelism¶

Tasks are highly dependent on each other. Example: A task that needs to wait for the result of the previous one.
There is not a large volume of data or workload, in which case it may not be worth the effort.
It may cause concurrency or synchronization issues that are difficult to debug. Example: Financial processes that need to be synchronized.
The execution environment is single-core or limited.

Parallelism vs Asynchronous execution¶

Parallelism focuses on executing multiple tasks at the same time, usually using multiple CPU cores. It is very common in situations that involve heavy processing and CPU-bound operations, that is, tasks that require a lot of CPU processing time. (Image and video processing, intensive mathematical algorithms (machine learning, simulations), graphic rendering in games and animations.)

Asynchronous execution focuses on not blocking execution in a program while it waits for an operation to be completed. These tasks are called I/O-bound because the wait time is not for the CPU, but for some external resource (such as a disk or network server). When an input/output operation (such as accessing a file or making an HTTP call) is initiated, the task does not block waiting for the result. Instead, it releases the thread and allows the system to perform other tasks while the external operation is in progress. When the response is ready, the program resumes execution from where it left off. (HTTP requests to external services, reading/writing files, querying databases.)

Task Parallel Library (TPL)¶

The Task Parallel Library (TPL) is a set of public types and APIs in the System.Threading and System.Threading.Tasks namespaces ³. The purpose of the TPL is to make developers more productive by simplifying the process of adding parallelism and concurrency to applications. The TPL dynamically scales the degree of concurrency to use all the available processors most efficiently. In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, cancellation support, state management, and other low-level details.

Data parallelism refers to scenarios in which the same operation is performed concurrently on elements in a source collection or array ⁴. In data parallel operations, the source collection is partitioned so that multiple threads can operate on different segments concurrently.

The TPL supports data parallelism through the System.Threading.Tasks.Parallel class. This class provides method-based parallel implementations of for and foreach loops. You write the loop logic for a Parallel.For or a Parallel.ForEach loop much as you would write a sequential loop. You do not have to create threads or queue work items. In basic loops, you do not have to take locks. The TPL handles all the low-level work for you.

// Sequential
foreach (var item in sourceCollection)
{
    Process(item);
}

// Parallel
Parallel.ForEach(sourceCollection, item => Process(item));

When a parallel loop runs, the TPL partitions the data source so that the loop can operate on multiple parts concurrently. Behind the scenes, the Task Scheduler partitions the task based on system resources and workload. When possible, the scheduler redistributes work among multiple threads and processors if the workload becomes unbalanced.

The example demonstrates Parallel.ForEach for CPU-intensive operations ⁵. When you run the example, it randomly generates 2 million numbers and tries to filter to prime numbers. The resulting time taken by each iteration is displayed when the application is finished.

static void Main()
{
    // 2 million
    var limit = 2_000_000;
    var numbers = Enumerable.Range(0, limit).ToList();

    var watch = Stopwatch.StartNew();
    var primeNumbersFromForeach = GetPrimeList(numbers);
    watch.Stop();

    var watchForParallel = Stopwatch.StartNew();
    var primeNumbersFromParallelForeach = GetPrimeListWithParallel(numbers);
    watchForParallel.Stop();

    Console.WriteLine($"Sequential foreach loop | Total prime numbers : {primeNumbersFromForeach.Count} | Time Taken : {watch.ElapsedMilliseconds} ms.");
    Console.WriteLine($"Parallel.ForEach loop   | Total prime numbers : {primeNumbersFromParallelForeach.Count} | Time Taken : {watchForParallel.ElapsedMilliseconds} ms.");
}

private static IList<int> GetPrimeList(IList<int> numbers) => numbers.Where(IsPrime).ToList();

private static IList<int> GetPrimeListWithParallel(IList<int> numbers)
{
    var primeNumbers = new ConcurrentBag<int>();

    Parallel.ForEach(numbers, number =>
    {
        if (IsPrime(number))
        {
            primeNumbers.Add(number);
        }
    });

    return primeNumbers.ToList();
}

private static bool IsPrime(int number)
{
    if (number < 2)
        return false;

    for (var divisor = 2; divisor <= Math.Sqrt(number); divisor++)
    {
        if (number % divisor == 0)
        {
            return false;
        }
    }
    return true;
}

The loop partitions the source collection and schedules the work on multiple threads based on the system environment. The more processors on the system, the faster the parallel method runs. For some source collections, a sequential loop might be faster, depending on the size of the source and the kind of work the loop performs.

Thread-Local Variables. By using thread-local data, you can avoid the overhead of synchronizing a large number of accesses to shared state ⁶. Instead of writing to a shared resource on each iteration, you compute and store the value until all iterations for the task are complete. You can then write the final result once to the shared resource, or pass it to another method.

static void Main()
{
    int[] nums = Enumerable.Range(0, 1_000_000).ToArray();
    long total = 0;

    // Use type parameter to make subtotal a long, not an int
    Parallel.For<long>(0, nums.Length, () => 0,
        (j, loop, subtotal) =>
        {
            subtotal += nums[j];
            return subtotal;
        },
        subtotal => Interlocked.Add(ref total, subtotal));

    Console.WriteLine($"The total is {total:N0}");
    Console.WriteLine("Press any key to exit");
    Console.ReadKey();
}

The first two parameters of every For method specify the beginning and ending iteration values. In this overload of the method, the third parameter is where you initialize your local state. In this context, local state means a variable whose lifetime extends from just before the first iteration of the loop on the current thread, to just after the last iteration. The type argument tells the compiler the type of the temporary variable that will be used to store the thread-local state. In this example, the expression () => 0 initializes the thread-local variable to zero. The fourth parameter defines the loop logic. It must be a delegate or lambda expression whose signature is Func<int, ParallelLoopState, long, long>. The first parameter is the value of the loop counter for that iteration of the loop. The second is a ParallelLoopState object that can be used to break out of the loop; this object is provided by the Parallel class to each occurrence of the loop. The third parameter is the thread-local variable. The last parameter is the return type. In this case, the type is Int64 because that is the type we specified in the For type argument. That variable is named subtotal and is returned by the lambda expression. The return value is used to initialize subtotal on each subsequent iteration of the loop.

Partition-local Variables. When a ForEach loop executes, it divides its source collection into multiple partitions ⁷. Each partition has its own copy of the partition-local variable. A partition-local variable is similar to a thread-local variable, except that multiple partitions can run on a single thread. To use a partition-local variable in a ForEach loop, you must call one of the method overloads that takes two type parameters. The first type parameter, TSource, specifies the type of the source element, and the second type parameter, TLocal, specifies the type of the partition-local variable.

void Main()
{
    int[] nums = Enumerable.Range(0, 1000000).ToArray();
    long total = 0;

    Parallel.ForEach<int, long>(
        nums, () => 0, (j, loop, subtotal) =>
        {
            subtotal += j;
            return subtotal;
        },
        (finalResult) => Interlocked.Add(ref total, finalResult));

    Console.WriteLine($"The total from Parallel.ForEach is {total:N0}");
}

Task-based asynchronous programming¶

The TPL is based on the concept of a task, which represents an asynchronous operation ⁸. In some ways, a task resembles a thread or ThreadPool work item but at a higher level of abstraction. The term task parallelism refers to one or more independent tasks running concurrently. Tasks provide two primary benefits:

More efficient and more scalable use of system resources.
More programmatic control than is possible with a thread or work item.

The Parallel.Invoke method provides a convenient way to run any number of arbitrary statements concurrently. Just pass in an Action delegate for each item of work. The easiest way to create these delegates is to use lambda expressions. The lambda expression can either call a named method or provide the code inline.

Parallel.Invoke(() => DoSomeWork(), () => DoSomeOtherWork());

A task that doesn't return a value is represented by the System.Threading.Tasks.Task class. A task that returns a value is represented by the System.Threading.Tasks.Task<TResult> class, which inherits from Task. The task object handles the infrastructure details and provides methods and properties that are accessible from the calling thread throughout the lifetime of the task. For example, you can access the Status property of a task at any time to determine whether it has started running, ran to completion, was canceled, or has thrown an exception. The status is represented by a TaskStatus enumeration ⁹:

Name	Value	Description
Created	0	The task has been initialized but has not yet been scheduled.
WaitingForActivation	1	The task is waiting to be activated and scheduled internally by the .NET infrastructure.
WaitingToRun	2	The task has been scheduled for execution but has not yet begun executing.
Running	3	The task is running but has not yet completed.
WaitingForChildrenToComplete	4	The task has finished executing and is implicitly waiting for attached child tasks to complete.
RanToCompletion	5	The task completed execution successfully.
Canceled	6	The task acknowledged cancellation by throwing an OperationCanceledException with its own CancellationToken while the token was in signaled state, or the task's `CancellationToken` was already signaled before the task started executing.
Faulted	7	The task completed due to an unhandled exception.

When you create a task, you give it a user delegate that encapsulates the code that the task will execute. The delegate can be expressed as a named delegate, an anonymous method, or a lambda expression. Lambda expressions can contain a call to a named method, as shown in the following example. The example includes a call to the Task.Wait method to ensure that the task completes execution before the console mode application ends.

public static void Main()
{
  Thread.CurrentThread.Name = "Main";

  Task taskA = new Task(() => Console.WriteLine("Hello from taskA."));
  taskA.Start();

  Console.WriteLine($"Hello from thread '{Thread.CurrentThread.Name}'.");
  taskA.Wait();
}

You can also use the Task.Run methods to create and start a task in one operation. To manage the task, the Run methods use the default task scheduler, regardless of which task scheduler is associated with the current thread. The Run methods are the preferred way to create and start tasks when more control over the creation and scheduling of the task isn't needed.

public static void Main()
{
  Thread.CurrentThread.Name = "Main";

  Task taskA = Task.Run(() => Console.WriteLine("Hello from taskA."));

  Console.WriteLine($"Hello from thread '{Thread.CurrentThread.Name}'.");
  taskA.Wait();
}

public static void Main()
{
    Task<Double>[] taskArray = { Task<Double>.Factory.StartNew(() => DoComputation(1.0)),
                                 Task<Double>.Factory.StartNew(() => DoComputation(100.0)),
                                 Task<Double>.Factory.StartNew(() => DoComputation(1000.0)) };

    var results = new Double[taskArray.Length];
    Double sum = 0;

    for (int i = 0; i < taskArray.Length; i++) {
        results[i] = taskArray[i].Result;
        Console.Write("{0:N1} {1}", results[i], i == taskArray.Length - 1 ? "= " : "+ ");
        sum += results[i];
    }
    Console.WriteLine($"{sum:N1}");
}

private static Double DoComputation(Double start)
{
  Double sum = 0;
  for (var value = start; value <= start + 10; value += .1)
      sum += value;

  return sum;
}

The tasks run asynchronously and might complete in any order. If the Result property is accessed before the computation finishes, the property blocks the calling thread until the value is available.

Parallel LINQ (PLINQ)¶

Parallel LINQ (PLINQ) is a parallel implementation of the Language-Integrated Query (LINQ) pattern ¹⁰. PLINQ implements the full set of LINQ standard query operators as extension methods for the System.Linq namespace and has additional operators for parallel operations. PLINQ combines the simplicity and readability of LINQ syntax with the power of parallel programming.

A PLINQ query in many ways resembles a non-parallel LINQ to Objects query. PLINQ queries, just like sequential LINQ queries, operate on any in-memory IEnumerable or IEnumerable<T> data source, and have deferred execution, which means they do not begin executing until the query is enumerated. The primary difference is that PLINQ attempts to make full use of all the processors on the system. It does this by partitioning the data source into segments, and then executing the query on each segment on separate worker threads in parallel on multiple processors. In many cases, parallel execution means that the query runs significantly faster.

Through parallel execution, PLINQ can achieve significant performance improvements over legacy code for certain kinds of queries, often just by adding the AsParallel query operation to the data source. However, parallelism can introduce its own complexities, and not all query operations run faster in PLINQ. In fact, parallelization actually slows down certain queries.

The System.Linq.ParallelEnumerable class exposes almost all of PLINQ's functionality. It and the rest of the System.Linq namespace types are compiled into the System.Core.dll assembly. The default C# projects in Visual Studio reference the assembly and import the namespace.

In addition to the standard query operators, the ParallelEnumerable class contains a set of methods that enable behaviors specific to parallel execution. These PLINQ-specific methods are listed in the following table.

ParallelEnumerable Operator	Description
AsParallel	The entry point for PLINQ. Specifies that the rest of the query should be parallelized, if it is possible.
AsSequential	Specifies that the rest of the query should be run sequentially, as a non-parallel LINQ query.
AsOrdered	Specifies that PLINQ should preserve the ordering of the source sequence for the rest of the query, or until the ordering is changed, for example by the use of an orderby (Order By in Visual Basic) clause.
AsUnordered	Specifies that PLINQ for the rest of the query is not required to preserve the ordering of the source sequence.
WithCancellation	Specifies that PLINQ should periodically monitor the state of the provided cancellation token and cancel execution if it is requested.
WithDegreeOfParallelism	Specifies the maximum number of processors that PLINQ should use to parallelize the query.
WithMergeOptions	Provides a hint about how PLINQ should, if it is possible, merge parallel results back into just one sequence on the consuming thread.
WithExecutionMode	Specifies whether PLINQ should parallelize the query even when the default behavior would be to run it sequentially.
ForAll	A multithreaded enumeration method that, unlike iterating over the results of the query, enables results to be processed in parallel without first merging back to the consumer thread.
Aggregate overload	An overload that is unique to PLINQ and enables intermediate aggregation over thread-local partitions, plus a final aggregation function to combine the results of all partitions.

var source = Enumerable.Range(1, 10000);

var evenNums = from num in source.AsParallel()
               where num % 2 == 0
               select num;
Console.WriteLine($"{evenNums.Count()} even numbers out of {source.Count()} total");

By default, PLINQ is conservative. At runtime, the PLINQ infrastructure analyzes the overall structure of the query. If the query is likely to yield speedups by parallelization, PLINQ partitions the source sequence into tasks that can be run concurrently. If it is not safe to parallelize a query, PLINQ just runs the query sequentially. If PLINQ has a choice between a potentially expensive parallel algorithm or an inexpensive sequential algorithm, it chooses the sequential algorithm by default. You can use the WithExecutionMode method and the System.Linq.ParallelExecutionMode enumeration to instruct PLINQ to select the parallel algorithm. This is useful when you know by testing and measurement that a particular query executes faster in parallel.

Name	Value	Description
Default	0	This is the default setting. PLINQ will examine the query's structure and will only parallelize the query if will likely result in speedup. If the query structure indicates that speedup is not likely to be obtained, then PLINQ will execute the query as an ordinary LINQ to Objects query.
ForceParallelism	1	Parallelize the entire query, even if that means using high-overhead algorithms. Use this flag in cases where you know that parallel execution of the query will result in speedup, but PLINQ in the Default mode would execute it as sequential.

By default, PLINQ uses all of the processors on the host computer. You can instruct PLINQ to use no more than a specified number of processors by using the WithDegreeOfParallelism method. This is useful when you want to make sure that other processes running on the computer receive a certain amount of CPU time.

var query = from item in source.AsParallel().WithDegreeOfParallelism(2)
            where Compute(item) > 42
            select item;

In cases where a query is performing a significant amount of non-compute-bound work such as File I/O, it might be beneficial to specify a degree of parallelism greater than the number of cores on the machine.

In some queries, a query operator must produce results that preserve the ordering of the source sequence. PLINQ provides the AsOrdered operator for this purpose. An AsOrdered sequence is still processed in parallel, but its results are buffered and sorted. Because order preservation typically involves extra work, an AsOrdered sequence might be processed more slowly than the default AsUnordered sequence. Whether a particular ordered parallel operation is faster than a sequential version of the operation depends on many factors.

var evenNums =
    from num in numbers.AsParallel().AsOrdered()
    where num % 2 == 0
    select num;

In sequential LINQ queries, execution is deferred until the query is enumerated either in a foreach loop or by invoking a method such as ToList, ToArray, or ToDictionary. In PLINQ, you can also use foreach to execute the query and iterate through the results. However, foreach itself does not run in parallel, and therefore, it requires that the output from all parallel tasks be merged back into the thread on which the loop is running. In PLINQ, you can use foreach when you must preserve the final ordering of the query results, and also whenever you are processing the results in a serial manner, for example when you are calling Console.WriteLine for each element.

For faster query execution when order preservation is not required and when the processing of the results can itself be parallelized, use the ForAll method to execute a PLINQ query. The following code example shows how to use the ForAll method. System.Collections.Concurrent.ConcurrentBag<T> is used here because it is optimized for multiple threads adding concurrently without attempting to remove any items.

var nums = Enumerable.Range(10, 10000);
var query =
    from num in nums.AsParallel()
    where num % 10 == 0
    select num;

query.ForAll(e => concurrentBag.Add(Compute(e)));