Parallel Reduction

Reduction Using OpenMP

Sahith Reddy
5 min readDec 7, 2022

You might still have some unanswered concerns about how OpenMP truly converts your sequential code into parallel code after thoroughly reviewing parallel reductions. You could question specifically how OpenMP finds the section of the loop’s body that makes the reduction. As an illustration, the following or a related code snippet is frequently seen in code samples:

In addition to +, you could also use — as a reduction operator. But how does OpenMP identify the x-= some value update step? The unsettling response is that OpenMP does not even notice the upgrade! The for-body loops are handled in the following way by the compiler:

As a result, an opaque function call might likewise be used to conceal the mutation of x. From a compiler developer’s perspective, this choice is understandable. Sadly, this calls on you to make sure that every change to x is compliant with the operation specified in the reduction clause.

Let’s use an obviously pointless but reliable OpenMP fragment to test that behavior:

What do you predict x’s ultimate value to be? Let’s analyze the aforementioned stages one at a time. Because it is only changed in Phase 4, the original value of x is now irrelevant. Since the OpenMP directive does not explicitly indicate a timetable, we can presume that one exists (static). As a result, each iteration will be processed by a team of two threads. Each thread then declares a private variable that is initialized with the biggest unsigned integer as the neutral element. The private variable for each thread is now multiplied by five increments, which results in four after overflow. The minimum of the component outcomes (4, 4, and 5), as well as the overall value (5), is then used to calculate the final value of x. As a result, the command line prints the number 4.

Let’s go into more detail about the reduction map’s commutativity. We have already stated that we cannot reasonably assume that iterations are handled in sequence owing to the erratic scheduling. Additionally, the final reduction of incomplete findings in Phase 4 does not have to be carried out in a certain order.

Memory Management Using CUDA

Parallel reduction is a popular method for solving this issue. A min operation is only one of the issues this may be used for. It operates by employing half as many threads as the dataset’s components. Each thread determines the lowest value of its own element and one or more additional elements. The subsequent round receives the resulting element. Then the number of threads is cut in half and the procedure is repeated until only one element is left, which is the outcome of the operation.

When using CUDA, keep in mind that a warp serves as the execution unit for a specific SM. Any number of threads below one warp, then, underutilizes the hardware. Divergent warps do not need to be performed, but divergent threads must.

You may execute a reduction within the warp by choosing the “other element” for a certain thread to deal with, leading to a considerable branch divergence within it. As each divergent branch doubles the effort for the SM, this will hurt performance. Dropping whole warps by choosing the opposite element from the other half of the dataset is a superior strategy.

The item is being compared to one from the opposite half of the dataset in Figure A. The active threads are displayed in shaded cells.

Figure(A): Final Stages of parallel reduction

A dataset from each cycle of the num list datasets is added to a temporary list of data that is created in shared memory by this function. The dataset is filled with 0xFFFFFFFF, which will remove the value from the list when a list has previously been emptied. The while loop steadily lowers the number of running threads until just one thread, thread zero, is active. To prevent the value from being processed twice, this then replicates the data and increases the list indexes.

Take note of how the __syncthreads directive is used both inside the loop and at the conclusion. When there are more than 32 threads (one warp) in use, the application needs to sync between warp.

Figure(B) shows how much slower this method is that the atomic Min version, with the fastest reduction being 8.4 ms as opposed to 5.86 ms for the atomic Min version. When compared to the atomic Min version, this is nearly 50% slower.

Figure(B): Parallel Reduction Graph

This version has the advantage of not requiring the atomic Min function, but being slower. The fact that this function is only supported by computing 1.2 devices is often just a concern if the consumer market or really outdated Tesla systems need to be supported. However, the fundamental drawback is that atomic Min can only be applied to integer numbers. Many issues in the real world need floating-point calculations. We require both algorithms in such circumstances.

The classic merge sort utilizing two lists is not the best scenario on a GPU, as can be inferred from both the atomic Min and the parallel reduction approach. As the number of lists grows, the radix sort becomes more parallel, which improves speed. As you increase parallelism and add more lists than 16, however, the performance of the merging step declines.

--

--