Executive Summary
The transition from serial to parallel computing, necessitated by the physical limitations of frequency scaling, has established the Graphics Processing Unit (GPU) as the premier engine for high-throughput computational workloads. Within this paradigm, the NVIDIA Compute Unified Device Architecture (CUDA) provides the interface through which developers orchestrate massive parallelism. However, the theoretical peak performance of GPU hardware is rarely achieved through naive translation of serial algorithms. Instead, it requires the fundamental reimagining of computational primitives—Reduction, Scan, Histograms, Sorting, and Matrix Operations—to align with the underlying hardware characteristics: the Single Instruction Multiple Thread (SIMT) execution model, the hierarchical memory structure, and the constraints of memory bandwidth and latency.
This report delivers an exhaustive technical analysis of these parallel primitives. It synthesizes decades of algorithmic evolution, from the early shared-memory reduction trees of the Tesla architecture to the asynchronous, hardware-accelerated Warpgroup Matrix Multiply-Accumulate (WGMMA) operations of the Hopper H100 architecture. The analysis explores the tension between algorithmic work complexity and hardware execution efficiency, detailing how “work-inefficient” algorithms like Hillis-Steele scans can outperform “work-efficient” counterparts due to architectural factors like warp divergence and bank conflicts. Furthermore, it examines state-of-the-art developments such as the Decoupled Look-back strategy for single-pass global scans and the Onesweep radix sort algorithm, which redefine the performance limits of data-parallel operations. By dissecting these algorithms alongside the evolution of GPU microarchitectures—from Kepler’s shuffle instructions to Maxwell’s atomic improvements and Hopper’s Tensor Memory Accelerator—this document serves as a definitive reference for high-performance CUDA algorithm design.
1. Architectural Foundations of Parallel Algorithms
To comprehend the optimization of parallel algorithms in CUDA, one must first dissect the execution environment. The GPU is not merely a collection of cores but a throughput-oriented machine designed to hide memory latency through massive multithreading. The efficacy of any parallel algorithm is determined by how well it exploits the specific features of the Streaming Multiprocessor (SM) and the memory hierarchy.
1.1 The SIMT Execution Model and Warp Dynamics
The fundamental unit of execution on an NVIDIA GPU is the warp, comprising 32 threads that execute instructions in lockstep.1 This Single Instruction, Multiple Threads (SIMT) model is critical for algorithmic design. While the programming model exposes individual threads with unique IDs (threadIdx), the hardware executes them in groups.
- Instruction Cohesion: When all threads in a warp execute the same instruction, the hardware operates at peak efficiency.
- Warp Divergence: Algorithmic logic that introduces conditional branching (e.g., if-else blocks) dependent on thread data can cause divergence. If threads within a warp take different paths, the hardware serializes the execution paths, disabling threads not on the current path. For algorithms like parallel reduction, avoiding divergence is a primary optimization target.2
- Active Masking: Modern architectures (Volta and later) utilize Independent Thread Scheduling, allowing for more flexible divergence handling, but the performance penalty of serialization remains. Algorithms must be designed to ensure that conditional branches align with warp boundaries whenever possible.
1.2 Memory Hierarchy and Access Patterns
The discrepancy between compute throughput and memory bandwidth is the primary bottleneck in GPU computing. Algorithms are classified as compute-bound or memory-bound based on their arithmetic intensity.
- Global Memory: The largest but slowest memory space. Achieving high bandwidth requires memory coalescing, where simultaneous memory accesses by threads in a warp can be combined into a single transaction. For example, if threads 0 through 31 access consecutive floating-point values in memory, the memory controller services this as a single 128-byte transaction. Strided or random access patterns break coalescing, resulting in multiple transactions for the same amount of data, drastically reducing effective bandwidth.3
- Shared Memory: A high-speed, user-managed scratchpad memory located on the SM. It is approximately 100x faster than global memory and serves as the primary mechanism for inter-thread communication within a block.5 However, shared memory is divided into 32 banks. If multiple threads in a warp access different addresses that map to the same bank, the accesses are serialized, causing a bank conflict. Optimizing parallel algorithms often involves padding data structures or altering indexing schemes to distribute accesses across banks evenly.2
- Registers: The fastest memory, private to each thread. Register usage is a critical resource; excessive use limits occupancy, the number of warps that can be active on an SM. High occupancy is generally desired to hide the latency of global memory accesses, although recent trends in matrix multiplication favor instruction-level parallelism (ILP) over pure occupancy.7
1.3 Latency Hiding and Occupancy
The GPU hides the latency of memory operations by switching context to other runnable warps. If Warp A stalls waiting for data from global memory, the SM switches to Warp B.
- Occupancy Metrics: Occupancy is the ratio of active warps to the maximum supported warps per SM. It is determined by register usage and shared memory allocation. For instance, if a kernel uses too many registers, the SM cannot schedule the maximum number of blocks, potentially leaving compute resources idle during memory stalls.7
- The ILP Trade-off: While high occupancy is traditional, sophisticated algorithms (like matrix multiplication on Hopper) may intentionally limit occupancy to maximize the number of registers available per thread, relying on asynchronous copy instructions and instruction-level parallelism to hide latency.8
2. Parallel Reduction: Algorithmic Evolution and Optimization
Parallel reduction is the process of combining an array of values into a single result (e.g., summation, maximum, logical AND) using a binary associative operator. It is a canonical example of how a simple $O(N)$ sequential algorithm must be transformed into a tree-based parallel structure to exploit the GPU.
2.1 Theoretical Framework and the Reduction Tree
A sequential reduction on a CPU iterates through the array, accumulating values. This approach has a time complexity of $O(N)$. In a parallel context, the operation is visualized as a binary tree.
- Step Complexity: $O(\log N)$.
- Work Complexity: $O(N)$.
In the first step, $N/2$ threads perform operations. In the next, $N/4$, and so on. This structure implies that parallelism decreases geometrically as the algorithm proceeds, leading to the “tail effect” where hardware resources become underutilized in the final stages of the reduction.2
2.2 The “7-Step” Optimization Strategy
The evolution of reduction kernels in CUDA literature is often described through a series of optimizations that systematically address hardware bottlenecks.
2.2.1 Interleaved Addressing (The Naive Approach)
The most basic parallel implementation assigns threads to elements based on a stride that doubles at each iteration.
- Implementation: Thread tid sums data[tid] and data[tid + stride].
- Problem: This approach uses a modulo operator (if (tid % (2*stride) == 0)) to determine active threads. This creates high warp divergence: in the second iteration, only even-numbered threads are active, but the entire warp must execute the instruction, with odd threads disabled. Furthermore, the memory access pattern is strided, which is inefficient for global memory (though less critical for shared memory).2
2.2.2 Interleaved Addressing (Divergence Free)
To resolve divergence, the indexing is restructured.
- Optimization: Instead of disabling threads based on ID, the data indices are compacted. Thread tid operates on index tid * 2 * stride.
- Result: Threads 0 through blockDim.x / (2*stride) are active. Since threads are grouped into warps by ID, this ensures that early warps are fully active and later warps are fully inactive. The hardware can skip the execution of inactive warps entirely.2
- New Problem: This introduces shared memory bank conflicts. As the stride increases (1, 2, 4…), the distance between memory addresses accessed by adjacent threads (e.g., thread 0 and thread 1) aligns with the bank width, causing multiple threads to hit the same bank.2
2.2.3 Sequential Addressing
To solve bank conflicts, the algorithm switches to sequential addressing.
- Implementation: In a block of size 512, step 1 has stride 256. Thread 0 adds element 0 and 256. Thread 1 adds 1 and 257.
- Benefit: Adjacent threads access adjacent memory locations. Since shared memory banks are interleaved (word 0 in bank 0, word 1 in bank 1), this sequential access pattern maps perfectly to different banks, eliminating conflicts.2
2.2.4 First Add During Load
The efficiency of the reduction can be doubled by performing the first level of the reduction tree during the load from global to shared memory.
- Mechanism: Each thread loads two elements from global memory, adds them, and stores the result in shared memory.
- Impact: This reduces the required shared memory size by half and ensures that all threads are active during the most expensive part of the operation (the global load).2
2.2.5 Unrolling the Last Warp
When the number of active elements drops to 32 (or fewer), only a single warp is active.
- Optimization: Loop control overhead (incrementing counters, checking conditions) becomes significant relative to the single addition instruction. The loop can be unrolled.
- Synchronization: On pre-Volta architectures, execution within a warp was synchronous, allowing the removal of __syncthreads(). On Volta and later, __syncwarp() or volatile qualifiers are required to ensure correctness, although warp shuffle functions (discussed below) are the modern replacement.9
2.3 Warp Shuffle Instructions: The Modern Standard
The introduction of the Kepler architecture (Compute Capability 3.0) fundamentally changed reduction optimization with Warp Shuffle instructions (__shfl_down_sync, __shfl_xor_sync).
- Mechanism: Shuffle instructions allow threads within a warp to read values directly from each other’s registers. This bypasses shared memory entirely.9
- Implementation: A warp-level reduction can be performed without any shared memory access:
C++
for (int offset = warpSize/2; offset > 0; offset /= 2)
val += __shfl_down_sync(0xFFFFFFFF, val, offset); - Performance: This reduces latency (registers are faster than shared memory) and reduces instruction count (no load/store instructions). Benchmarks indicate shuffle-based reductions are 20-25% faster than shared-memory variants.10 The use of __shfl_xor_sync allows for a “butterfly” reduction pattern, which can be efficient for all-reduce operations where every thread needs the final result.9
2.4 Device-Wide Reduction and Synchronization
A single CUDA kernel launch is limited to the reduction of elements within a thread block. Reducing a massive array (e.g., millions of elements) requires inter-block communication.
- Multi-Pass Approach: The standard method involves multiple kernel launches.
- Kernel 1: Each block reduces a tile and writes the partial sum to global memory.
- Kernel 2: Recursively reduces the partial sums until one value remains.
This incurs the overhead of kernel launch and global memory traffic for partial sums.9
- Atomic Accumulation: A single kernel can reduce blocks and then use atomicAdd to update a global counter. This is faster but introduces contention if not managed carefully.
- Thread Block Retirement (Last Block Optimization): A sophisticated technique involves using a global atomic counter to track the number of finished blocks. The last block to finish (determined by the counter reaching gridDim.x – 1) performs the final reduction of the partial sums generated by all other blocks. This allows for a single-pass global reduction, keeping the data on-chip (L2 cache).9
3. Parallel Scan (Prefix Sum): Algorithms and Complexity
The scan operation transforms an array $[a_0, a_1, \dots, a_n]$ into $[a_0, (a_0 \oplus a_1), \dots, (a_0 \oplus \dots \oplus a_n)]$. It is a cornerstone primitive for sorting, stream compaction, and lexical analysis.11 Unlike reduction, scan produces an output for every input, creating a complex dependency chain.
3.1 Step-Efficient vs. Work-Efficient Algorithms
The challenge in parallel scan is balancing the number of parallel steps (latency) against the total number of operations (work).
3.1.1 Hillis-Steele Algorithm (Step-Efficient)
Proposed by Hillis and Steele (1986), this algorithm optimizes for the depth of the dependency graph.
- Mechanism: In step $d$ (where $d$ goes $1, 2, 4, \dots$), every thread $k$ updates its value: $x[k] = x[k] + x[k-d]$.
- Complexity:
- Steps: $O(\log N)$.
- Work: $O(N \log N)$.
- Analysis: While it has a shallow depth, the work complexity is asymptotically worse than the sequential $O(N)$ algorithm. On a GPU, where throughput (total operations per second) is the bottleneck, the extra factor of $\log N$ operations consumes valuable bandwidth and power. However, for small arrays (e.g., within a warp), its simple indexing and lack of bank conflicts (with proper padding) make it efficient.11
3.1.2 Blelloch Algorithm (Work-Efficient)
The Blelloch algorithm (1990) achieves $O(N)$ work complexity, matching the sequential algorithm. It uses a tree-based approach similar to reduction.11
- Phase 1: Up-Sweep (Reduction): A binary tree is built to compute partial sums. This takes $N-1$ additions. The root contains the total sum.
- Phase 2: Down-Sweep: The tree is traversed from root to leaves. The root is zeroed (for exclusive scan). At each node, the value is passed to the left child, and the sum of the value and the left child is passed to the right child. This takes roughly $N$ operations.
- Total Work: $~2N$ additions/swaps.
- Bank Conflicts: The Blelloch algorithm involves strides that grow exponentially ($2, 4, 8 \dots$). This causes severe shared memory bank conflicts.
- Solution: Bank conflict avoidance is achieved by padding the shared memory array. By inserting a dummy word every $M$ words (where $M$ is the number of banks), the effective stride is altered so that access indices such as 0, 16, 32 map to different banks (0, 1, 2) rather than the same bank (0).12
3.2 Large-Scale Scans and Decoupled Look-back
Scanning arrays that exceed the size of a single thread block requires global coordination.
- Traditional Scan-then-Propagate:
- Block-level scan of data tiles.
- Write block aggregates (sums) to a separate array.
- Scan the array of block aggregates.
- Add the scanned block aggregates to the block-level results.
This requires three passes through the data (Read-Write, Read-Write, Read-Write), consuming $~3N$ global memory bandwidth.11
3.2.1 Decoupled Look-back (CUB Library)
The CUB library introduced the Decoupled Look-back algorithm, which enables a single-pass global scan.15
- Concept: Instead of a strict multi-pass dependency, blocks process tiles dynamically. When a block finishes its local scan, it needs the prefix sum from the preceding blocks.
- State Management: Each block publishes its state in a global status array. A block can be in one of three states:
- X (Not Ready): The block has not processed its data.
- A (Aggregate Available): The block has computed its total sum, but not its global prefix (because it doesn’t know the prefix of its predecessor).
- P (Prefix Available): The block has computed its full global prefix.
- Look-back Mechanism: When Block $i$ finishes its local tasks, it inspects Block $i-1$.
- If Block $i-1$ is P, Block $i$ adds that value to its local aggregate to get its own prefix, transitions to P, and finishes.
- If Block $i-1$ is A, Block $i$ accumulates that aggregate and moves to Block $i-2$ (looking back further).
- Performance: This allows the scan to proceed at practically the speed of memory copy (memcpy). The data movement is reduced to roughly $2N$ (Read Input, Write Output), with negligible overhead for the look-back logic in high-bandwidth L2 cache.17 This innovation makes CUB’s DeviceScan significantly faster than older implementations like Thrust’s original multi-pass backend (though Thrust now uses CUB).18
4. High-Performance Histogram Computation
Histogram computation (binning) is a scatter operation where the output location is data-dependent ($Bin = f(Data)$). This introduces the challenge of output interference or write collisions.
4.1 Contention and Atomicity
The correctness of a parallel histogram is ensured by atomic operations (atomicAdd).
- Global Atomics: A naive kernel where every thread atomically updates a global histogram is performance-limited by the serialization of conflicting writes. If the input image is uniform (low entropy), thousands of threads may attempt to write to global_hist simultaneously, effectively serializing the machine.19
- Entropy Analysis:
- High Entropy (White Noise): Data is uniformly distributed across bins. Collisions are rare. Global atomics (especially on L2-cached architectures like Kepler+) perform reasonably well.
- Low Entropy (Solid Images): Collisions are frequent. Performance degrades catastrophically.5
4.2 Privatization Strategies
To reduce contention, the histogram is “privatized” into shared memory.20
4.2.1 Per-Block Histogram
Each thread block allocates a private histogram in shared memory. Threads atomically update this local copy. Once the block finishes, the local histogram is atomically added to the global histogram.
- Reduction in Contention: Contention is limited to the threads in one block (e.g., 256 or 1024), rather than the entire grid.
- Hardware Support: The Maxwell architecture introduced native hardware support for shared memory atomics. Before Maxwell (e.g., Kepler), shared atomics were emulated in software (lock-update-unlock), making them slower. On Maxwell and later, shared memory atomics are extremely fast, making this the default strategy.19
4.2.2 Replication (R-per-Block)
To further dilute contention within a block, the local histogram can be replicated.
- Mechanism: A block maintains $R$ copies of the histogram in shared memory. Thread $tid$ updates copy $tid \% R$.
- Benefit: This reduces the probability of a collision by a factor of $R$.
- Cost: Shared memory capacity is limited (e.g., 48KB or 64KB per SM). Replicating a 256-bin histogram (1KB) is easy. Replicating a 4096-bin histogram (16KB) restricts the number of active blocks (occupancy).
- Padding: To avoid bank conflicts when accessing replicated histograms, padding must be inserted between the copies so that bin $i$ of copy 0 and bin $i$ of copy 1 fall into different memory banks.21
4.2.3 Per-Thread Histograms (The Extreme Case)
In this approach, each thread maintains its own tiny histogram in registers or local memory. This completely eliminates atomics during the accumulation phase.
- Limitation: Register pressure. A thread cannot keep a 256-bin histogram in registers. This is only viable for very small histograms (e.g., genetic sequencing with 4 bins).21
5. Sorting Algorithms on GPU
Sorting is a complex primitive involving both comparison and data movement. The evolution of GPU sorting has shifted from comparison-based networks (Bitonic) to bandwidth-optimized radix sorts.
5.1 Comparison-Based Sorts: Bitonic and Merge
Early GPU implementations utilized Bitonic Sort because of its oblivious sorting network property—the sequence of comparisons is fixed and data-independent, which maps well to the SIMT model.
- Complexity: $O(N (\log N)^2)$. This high computational complexity makes it inefficient for large arrays compared to $O(N \log N)$ merge sorts or $O(N)$ radix sorts.22
- Merge Sort: Modern comparison-based sorts (like those in ModernGPU) use Merge Sort. They partition data, sort tiles using register-level bitonic networks, and then merge the sorted tiles. Optimized implementations use vectorized loads (loading 4 ints as an int4) to maximize memory bandwidth.22
5.2 Radix Sort: The Bandwidth Champion
Radix Sort is the standard for high-performance sorting of primitive types (keys). It relies on bit-wise bucketization rather than comparison.24
5.2.1 Algorithm Structure (LSD)
Least Significant Digit (LSD) radix sort processes keys from the lowest bits to the highest (e.g., 4 bits at a time). It is stable, meaning the relative order of equal keys is preserved, which is required for the logic to hold across iterations.
A single pass (e.g., for one 4-bit digit) involves three kernels:
- Histogram (Counting): Each block counts the frequency of each digit (0-15).
- Scan (Prefix Sum): A global scan of the counts determines the starting offset for each digit bucket in the output array.
- Scatter (Reorder): Blocks read the keys again and write them to the computed offsets.
5.2.2 Onesweep: The Single-Pass Revolution
Recent research (2022) has introduced Onesweep, a radix sort that fuses the histogram and scatter steps using a decoupled look-back strategy similar to CUB Scan.25
- Innovation: Traditionally, the scatter step required a global prefix sum of the counts (step 2 above) to know where to write. Onesweep allows blocks to calculate their write offsets dynamically by “looking back” at the counters of previous blocks during the scatter phase itself.
- Impact: This eliminates the separate scan pass and the second read of the data required by traditional implementations. It reduces global memory traffic from $~3N$ per digit pass to $~2N$.
- Performance: Onesweep outperforms CUB’s radix sort by approximately 1.5x on A100 GPUs, sorting 256 million keys at 29.4 GKey/s.25
5.3 Library Ecosystem: Thrust vs. CUB
- Thrust: A high-level C++ template library resembling the STL (thrust::sort). Historically, it used a slower backend, but modern versions often dispatch to CUB. However, Thrust adds overheads like memory allocation for temporary buffers (cudaMalloc/cudaFree) which can dominate execution time for small repeated sorts.26
- CUB (CUDA Unbound): A lower-level library providing block-level and device-level primitives. It exposes the DeviceRadixSort API which allows users to manage temporary memory explicitly, avoiding allocation overheads. CUB is the industry standard for performance-critical sorting.18
6. Matrix Operations and Tiling Strategies
General Matrix Multiplication (GEMM), defined as $C = \alpha (A \times B) + \beta C$, is the workload powering the modern AI revolution. Its optimization history tracks the evolution of GPU architecture from scalar instructions to Tensor Cores.
6.1 Tiling and Shared Memory (The Software Era)
A naive implementation of matrix multiplication is memory-bound. Each thread reads a row of A and a column of B.
- Global Memory Access: For $N \times N$ matrices, the naive approach performs $2N^3$ loads.
- Tiling: The loop is structured to load a small block (tile) of A and B into shared memory. Threads synchronize, compute the partial product using the shared tile, and then move to the next tile.3
- Coalescing: Loading tiles requires coalesced access. For row-major matrices, loading a row of B is coalesced, but a column of A is strided. This necessitates loading A as a row-tile and potentially transposing it in shared memory or designing the dot product to accommodate the layout.4
- Bank Conflicts: When threads access the shared memory tile to compute the dot product, strided access (e.g., accessing a column of the tile) can cause 32-way bank conflicts. Padding the shared memory array (e.g., __shared__ float tile) ensures that columns are skewed across banks, resolving the conflict.28
6.2 The Tensor Core Era (Volta/Ampere)
The Volta architecture introduced Tensor Cores, specialized hardware units capable of performing a 4×4 matrix multiply-accumulate in a single cycle.29
- mma.sync: This warp-level instruction requires threads to hold data in opaque register “fragments”.
- Complexity: The programmer is responsible for loading data from global to shared, and then from shared to registers (fragments) with specific layouts. This introduced a complex software pipeline to hide global memory latency, often requiring multi-stage buffering in shared memory.30
6.3 Hopper Architecture: WGMMA and Asynchronous Pipelines
The NVIDIA Hopper H100 architecture introduced a paradigm shift with Warpgroup Matrix Multiply-Accumulate (WGMMA) and the Tensor Memory Accelerator (TMA).30
6.3.1 Warpgroups and WGMMA
Hopper defines a Warpgroup as 4 contiguous warps (128 threads).
- Direct Shared Memory Access: The wgmma.mma_async instruction allows the Tensor Cores to read the $A$ and $B$ matrices directly from shared memory. This eliminates the explicit load-to-register step required in Volta/Ampere, reducing register pressure and instruction overhead.
- Operand Constraints: Operand B must be in shared memory. Operand A can be in shared memory or registers. The accumulator C is in registers.30
- Layouts: The shared memory layout must be swizzled (permuted) in specific patterns (e.g., 128-byte swizzle) to align with the Tensor Core access patterns and avoid bank conflicts.30
6.3.2 Tensor Memory Accelerator (TMA)
TMA is a hardware copy engine that moves data from global to shared memory asynchronously.
- The Pipeline:
- TMA Issue: The kernel issues a TMA copy command. This is non-blocking.
- Background Copy: The TMA engine handles the transfer and address calculation.
- WGMMA Issue: The warpgroup issues wgmma.mma_async.
- Synchronization: The warpgroup waits (wgmma.wait_group) only when it needs the result.
- Impact: This pipeline achieves nearly perfect overlap of memory transfer and computation. It allows GEMM kernels to reach >80% of the theoretical peak FLOPs of the H100, a significant jump from the ~50-60% achievable with software-managed pipelines.
7. Advanced Optimization and Analysis
7.1 Occupancy vs. Instruction-Level Parallelism (ILP)
A common metric for CUDA optimization is Occupancy: keeping as many warps active as possible to hide latency. However, modern high-performance algorithms often trade occupancy for ILP.7
- The Trade-off: To maximize ILP, a thread might unroll loops and maintain many accumulators in registers. This increases register pressure (e.g., utilizing 255 registers per thread).
- Consequence: High register usage limits the number of warps that can fit on an SM (low occupancy).
- Resolution: On modern architectures (Volta+), low occupancy is acceptable if the thread has enough independent instructions (ILP) to keep the pipeline busy. For example, a thread can issue multiple independent wgmma instructions back-to-back. The hardware pipelines them, effectively hiding the latency without needing to context-switch to other warps.8
7.2 Cooperative Groups and Synchronization
As algorithms become more complex (e.g., device-wide reductions), the need for flexible synchronization grows.
- Cooperative Groups: Introduced in CUDA 9, this API allows defining groups of threads (smaller than a block or spanning the whole grid) and synchronizing them.
- Grid Synchronization: this_grid().sync() allows a global barrier across all blocks. This enables single-kernel implementations of algorithms that previously required multiple launches (like iterative solvers or global reductions). However, it requires launching the kernel with cudaLaunchCooperativeKernel and is limited by the number of resident blocks the GPU can support simultaneously.1
8. Conclusion
The landscape of parallel algorithms in CUDA is dynamic, driven by a continuous feedback loop between hardware innovation and algorithmic adaptation.
- Reduction: Has evolved from shared-memory trees to register-level warp shuffles, minimizing memory traffic.
- Scan: Has shifted from multi-pass “scan-then-propagate” to single-pass “decoupled look-back,” fundamentally changing the cost model of prefix sums to match memory copy speeds.
- Histograms: Rely on privatization and the increasing throughput of native shared memory atomics, with replication strategies handling low-entropy data.
- Sorting: Remains dominated by Radix Sort, with innovations like Onesweep pushing bandwidth utilization to the physical limits of the hardware by fusing analysis and scatter steps.
- Matrix Operations: Have largely abstracted away from manual tiling code to hardware-intrinsic pipelines (WGMMA + TMA) on Hopper, where the focus is now on data layout management and asynchronous pipeline orchestration rather than compute scheduling.
For the practitioner, the key insight is that memory movement is the bottleneck. The most successful algorithms—Decoupled Look-back Scan, WGMMA GEMM, and Onesweep Radix Sort—share a common trait: they minimize Global Memory round-trips and maximize the utilization of on-chip hierarchies (L2, Shared Memory, Registers). As GPUs continue to evolve towards asynchronous, heterogeneous compute engines, algorithms that decouple data movement from computation will define the next generation of high-performance computing.
Comparative Summary Table
| Primitive | Naive Approach | Bottleneck | State-of-the-Art Approach | Key Optimization |
| Reduction | Interleaved Addressing | Warp Divergence | Warp Shuffle | Register-to-register communication (no Shared Mem) 9 |
| Scan | Hillis-Steele | Work Inefficiency | Decoupled Look-back | Single-pass global scan via status flags 15 |
| Histogram | Global Atomic Add | Write Contention | Privatized Shared Atomics | Local accumulation + R-factor replication 21 |
| Sort | Bitonic Sort | High Complexity | Onesweep Radix Sort | Fused look-back scatter step 25 |
| GEMM | Global Memory Tiling | Bandwidth | Hopper WGMMA + TMA | Asynchronous Copy + Direct Shared-to-Tensor exec |
