1. Introduction: The Paradigm of Throughput-Oriented Execution
The graphical processing unit (GPU) has transcended its origins as a fixed-function rendering device to become the preeminent engine of modern high-performance computing (HPC) and artificial intelligence. This transformation was not merely a result of increasing transistor counts but a fundamental architectural divergence from the latency-oriented design of the Central Processing Unit (CPU) to the throughput-oriented design of the GPU. At the heart of this paradigm lies the kernel execution model—a sophisticated hardware-software contract that allows massive parallelism to be expressed abstractly by the programmer and managed efficiently by the hardware.
The execution model of a GPU is predicated on the concept of massive multithreading to hide latency. Unlike a CPU, which relies on large caches and complex branch prediction mechanisms to minimize the latency of a single thread, a GPU accepts that latency is inevitable. It compensates by maintaining thousands of active threads, rapidly switching between them to keep execution units busy while others wait for long-latency memory operations. This approach, formalized as the Single Instruction, Multiple Threads (SIMT) architecture, requires a rigorous definition of how software threads are grouped, launched, and mapped to physical hardware.
This report provides an exhaustive analysis of the GPU execution model, specifically within the context of the NVIDIA CUDA (Compute Unified Device Architecture) ecosystem. It explores the logical hierarchy of grids, blocks, and threads; the physical mapping to Streaming Multiprocessors (SMs) and cores; and the complex dynamics of warp scheduling, occupancy, and resource partitioning. Furthermore, it examines the evolution of this model from the stack-based reconvergence of early architectures to the Independent Thread Scheduling (ITS) of Volta and the cluster-based hierarchies of Hopper. By understanding the intricate mathematical and architectural relationships between kernel launch configurations and hardware behavior, developers can unlock the full throughput potential of modern accelerators.
2. The Logical Thread Hierarchy
The fundamental challenge in massively parallel computing is scalability. A program written for a device with 10 cores must ideally scale without modification to a device with 10,000 cores. The CUDA execution model achieves this through a hierarchical decomposition of threads, separating the logical correctness of the program from the physical capacity of the hardware.
2.1 The Grid: A Domain of Independent Execution
At the highest level of the execution hierarchy sits the Grid. When a host (CPU) initiates a computation on the device (GPU), it launches a kernel. This kernel executes as a grid of thread blocks. The grid represents the total problem domain—whether it be the pixels of an image, the cells of a simulation mesh, or the elements of a tensor.1
The defining characteristic of the grid is the independence of its constituent blocks. In the standard execution model, there is no guarantee regarding the order in which blocks execute. Block 0 and Block 1000 may run concurrently on different multiprocessors, or they may run sequentially on the same multiprocessor. This independence allows the hardware to schedule blocks onto any available Streaming Multiprocessor (SM), enabling the same compiled binary to run on a small embedded GPU or a massive data center accelerator.1
The grid dimensions are specified at launch time and can be one-, two-, or three-dimensional. This dimensionality is purely logical, designed to simplify the mapping of threads to multi-dimensional data structures. Internally, the hardware linearizes these dimensions, but the API preserves the 3D abstraction to reduce the arithmetic burden on the programmer for index calculation.3
2.2 The Thread Block: The Unit of Cooperation
Beneath the grid lies the Thread Block (or Cooperative Thread Array, CTA). A block is a collection of threads that execute on the same Streaming Multiprocessor (SM). While threads in different blocks are largely isolated from one another (barring global memory operations), threads within a single block have access to low-latency shared resources.1
The thread block is the primary unit of resource allocation. When a block is dispatched to an SM, the hardware must reserve all necessary resources—registers, shared memory, and warp slots—for the entire lifetime of that block. If the SM does not have sufficient resources to accommodate a block, the block cannot launch. This “all-or-nothing” allocation strategy is central to the occupancy model discussed later in this report.
Threads within a block can cooperate via:
- Shared Memory: A user-managed L1 cache that allows for high-bandwidth, low-latency communication between threads.
- Barrier Synchronization: The __syncthreads() intrinsic creates a barrier where all threads in the block must arrive before any can proceed. This ensures memory visibility and ordering, allowing threads to safely exchange data through shared memory.3
The size of a thread block is limited by the hardware architecture. On modern GPUs (Compute Capability 2.0 and later), a block can contain up to 1024 threads.5 However, simply maximizing the block size is rarely the optimal strategy, as it reduces the granularity of scheduling and can exacerbate the “tail effect” where hardware resources are underutilized at the end of a grid launch.6
2.3 The Thread: The Scalar Abstraction
At the finest granularity is the Thread. In the CUDA model, a thread is a scalar unit of execution with its own Program Counter (PC), register file, and local stack.4 Ideally, the programmer views the thread as an independent entity capable of unrestricted control flow and memory access.
This scalar view is a powerful abstraction known as Single Instruction, Multiple Threads (SIMT). It distinguishes CUDA from traditional SIMD (Single Instruction, Multiple Data) vector processing. In a SIMD model, the programmer explicitly manages vector width (e.g., AVX-512) and must handle data alignment manually. In SIMT, the programmer writes code for a single thread, and the hardware aggregates these threads into groups (warps) for execution. This allows the GPU to handle divergent control flow—where some threads take an if branch and others take an else—automatically, albeit with a performance penalty.7
The built-in coordinate variables allow each thread to identify its position within the hierarchy:
- threadIdx: A dim3 vector (.x,.y,.z) identifying the thread within its block.
- blockIdx: A dim3 vector (.x,.y,.z) identifying the block within the grid.
- blockDim: A dim3 vector giving the dimensions of the block.
- gridDim: A dim3 vector giving the dimensions of the grid.3
To calculate a unique global index for linear memory access, threads typically flatten these coordinates. For a 1D grid of 1D blocks, the global index $i$ is derived as:
$$i = \text{blockIdx}.x \times \text{blockDim}.x + \text{threadIdx}.x$$
For 2D or 3D grids, the calculation involves strides based on the dimensions, reflecting the row-major layout of memory. This coordinate system is fundamental to the programming model, bridging the gap between the multi-dimensional logic of the application and the linear addressing of the DRAM.8
3. Kernel Launch Configuration and Syntax
The interface between the host CPU and the device execution model is the kernel launch. This configuration determines how the grid is instantiated and provides the initial state for the execution machinery.
3.1 The Execution Configuration Syntax
The standard mechanism for launching a kernel in C++ CUDA is the triple-chevron syntax <<<… >>>. This operator encapsulates the execution configuration, taking four arguments:
C++
kernel_name<<<Dg, Db, Ns, S>>>(args…);
- Dg (Grid Dimensions): Specifies the number of blocks in the grid. It can be of type dim3 or unsigned int. This defines the total scope of work.
- Db (Block Dimensions): Specifies the number of threads per block. This is a critical tuning parameter that affects occupancy and resource utilization.
- Ns (Shared Memory Bytes): An optional size_t argument specifying the number of bytes of dynamic shared memory to allocate per block. This is in addition to any statically allocated shared memory in the kernel code.
- S (Stream): An optional cudaStream_t argument specifying the stream in which the kernel will execute. If 0 (or omitted), the kernel runs in the default null stream, which implies strict synchronization with other legacy stream operations.8
Under the hood, this syntax is transformed by the nvcc compiler into calls to the CUDA Runtime API, specifically cudaLaunchKernel or cudaLaunchDevice.9 The arguments are marshaled into a buffer, and the command is pushed to the GPU’s push buffer for execution.
3.2 Dynamic Cluster Configuration (cudaLaunchKernelEx)
With the introduction of the NVIDIA Hopper architecture (Compute Capability 9.0), the execution model expanded to include Thread Block Clusters. A Cluster is a group of thread blocks that are guaranteed to be co-scheduled on the same Graphics Processing Cluster (GPC), enabling distributed shared memory access and hardware-accelerated barriers between blocks.11
The traditional triple-chevron syntax was insufficient to express the configuration of clusters dynamically. While a fixed cluster size can be specified at compile-time using the __cluster_dims__(x, y, z) attribute, runtime flexibility required a new API: cudaLaunchKernelEx.
This API utilizes a configuration structure, cudaLaunchConfig_t, which accepts a list of attributes (cudaLaunchAttribute).
- cudaLaunchAttributeClusterDimension: Allows the programmer to specify the X, Y, and Z dimensions of the cluster at runtime.
- The grid dimensions must be divisible by the cluster dimensions to ensuring a regular tiling of the iteration space.12
The use of cudaLaunchKernelEx represents a shift toward more explicit control over the physical placement of blocks, allowing advanced optimization where the locality of data processing spans boundaries larger than a single block but smaller than the entire grid.
3.3 Asynchronous Execution and Streams
The kernel launch is inherently asynchronous. The host CPU issues the launch command and immediately returns to execution, often before the GPU has even begun processing the kernel. This decoupling allows for CPU-GPU concurrency.9
CUDA Streams manage concurrency on the device. A stream is a sequence of operations (kernel launches, memory copies) that execute in issue-order. Operations in different streams may run concurrently or out-of-order with respect to each other.
- Latency Hiding via Concurrency: By launching independent kernels in separate streams, the GPU scheduler can fill idle SMs. If a small kernel does not utilize the entire GPU, a second kernel in a different stream can run on the remaining SMs (Spatial Multitasking).13
- Overlap of Data Transfer and Compute: Streams allow the overlap of cudaMemcpyAsync in one stream with kernel execution in another. This is crucial for pipelining large workloads where data transfer over the PCIe bus is a bottleneck.14
The complexity of stream management increases with the introduction of hardware-accelerated scheduling. Modern GPUs (Hyper-Q) maintain multiple hardware work queues, allowing the GPU to manage thousands of pending streams simultaneously without false serialization dependencies.9
4. Physical Architecture: Mapping Software to Silicon
To optimize kernel launch configurations, one must understand the physical destination of the thread blocks: the Streaming Multiprocessor (SM). The software hierarchy maps directly to hardware structures, but the ratio is not 1:1.
4.1 The Streaming Multiprocessor (SM)
The SM is the workhorse of the NVIDIA GPU. It is a multicore processor in its own right, containing:
- Execution Units (Cores): Specialized ALUs for different data types.
- FP32 Cores (CUDA Cores): The primary floating-point units.
- FP64 Cores: Double-precision units (typically fewer in number, e.g., 1:2 or 1:32 ratio depending on the SKU).
- INT32 Cores: Integer arithmetic units. In architectures like Turing and Ampere, these can execute concurrently with FP32 cores, allowing address calculations to occur in parallel with math.15
- Tensor Cores: Specialized systolic arrays for matrix multiply-accumulate operations, critical for AI workloads.16
- Register File: A massive on-chip memory (e.g., 256 KB per SM on A100/H100) that holds thread state. This is the fastest memory in the hierarchy but also a critical bottleneck for occupancy.17
- L1 Cache / Shared Memory: A configurable block of SRAM (e.g., up to 228 KB on H100) partitioned between shared memory and L1 cache. This resource determines how many blocks can physically reside on the SM.11
- Warp Schedulers: The control logic that issues instructions. Modern SMs (e.g., Ampere, Hopper) typically have 4 warp schedulers. Each scheduler manages a specific partition of the warps resident on the SM.18
4.2 Mapping Blocks to SMs
When a grid is launched, the global Gigathread Engine distributes thread blocks to the available SMs.
- Block Residency: An SM can host multiple blocks concurrently. The maximum number is architectural (e.g., 32 blocks per SM on A100/H100).19
- Resource Constraints: The scheduler will only assign a block to an SM if all required resources are available. If a block needs 48 KB of shared memory and the SM has 164 KB total, the SM can host at most $\lfloor 164/48 \rfloor = 3$ blocks, regardless of the block limit or thread slots.21
- Persistence: Once assigned, a block stays on that SM until all its threads complete. There is no context switching of blocks to disk or main memory; they must run to completion.
4.3 Warp Formation and Sub-Partitioning
Inside the SM, the threads of a block are aggregated into Warps. A warp is a group of 32 threads (a hardware constant across all CUDA architectures) that execute in lockstep.3
- Linearization: Threads are grouped linearly: Threads 0-31 form Warp 0, 32-63 form Warp 1, etc. This emphasizes the importance of block dimensions being multiples of 32. A block of 33 threads will consume two full warps worth of resources (64 slots), leaving 31 slots idle in the second warp—a massive inefficiency.23
- Scheduler Assignment: The warps are distributed among the 4 warp schedulers. If an SM has 48 resident warps, each scheduler manages 12 warps. In every clock cycle, each scheduler checks its pool of 12 warps to see which ones are ready to execute (i.e., not stalled on memory or dependencies) and issues an instruction for one of them.18
This hierarchical mapping—Grid $\to$ GPU, Block $\to$ SM, Warp $\to$ Scheduler—is the foundation of the GPU’s throughput. The hardware relies on having a sufficient pool of resident warps to keep the execution pipelines full.
5. The Execution Model: SIMT and Divergence
The Single Instruction, Multiple Threads (SIMT) model is the mechanism that allows the GPU to scale to thousands of cores. It abstracts the vector nature of the hardware, presenting scalar threads to the user, while executing them as vectors (warps) on the silicon.
5.1 Lockstep Execution and Predication
Within a warp, all 32 threads share a single Program Counter (PC). In a given cycle, the warp fetches an instruction pointed to by the PC and broadcasts it to the active threads.
- Convergence: When all threads in a warp execute the same instruction, the warp is converged. This is the optimal state, utilizing 100% of the compute resources.
- Predication: To handle conditional logic (if (tid < 16)…), the GPU uses hardware predication. Threads that evaluate the condition as false are predicated off (masked). They do not execute the instruction; essentially, they execute NOPs (No Operations) while the active threads execute the body of the if.
5.2 Warp Divergence
When threads in a warp take different control flow paths, Warp Divergence occurs. The hardware cannot execute two different instructions for the same warp simultaneously.
- The warp first executes the path taken by the threads satisfying the condition (e.g., the if block). The threads in the else block are inactive.
- The warp then executes the else block. The threads from the if block are now inactive.
- The threads reconverge at the immediate post-dominator of the branch logic.26
This serialization effectively halves the throughput for that section of code. If a warp diverges 32 ways (e.g., a switch statement with 32 unique cases), the execution is fully serialized, running at 1/32 of the peak throughput.27
5.3 Architectural Evolution: From Stacks to Independent Scheduling
The mechanism for handling divergence has evolved significantly, fundamentally changing the execution model’s capabilities.
Pre-Volta (Pascal and earlier): Stack-Based Reconvergence
In early architectures, divergence was managed using a hardware Reconvergence Stack. When a warp diverged, the hardware pushed the PC and active mask of the alternative path onto a stack. The warp executed one path until it reached the reconvergence point, then popped the stack to execute the other path.
- Limitation: This enforced a strict lockstep behavior. It was impossible for threads in the same warp to communicate or synchronize within a divergent branch because the “waiting” threads might be physically masked off on the stack, leading to deadlocks.28
Volta and Beyond: Independent Thread Scheduling (ITS)
With the Volta architecture (Compute Capability 7.0), NVIDIA introduced Independent Thread Scheduling (ITS). This microarchitecture maintains the execution state (PC and Call Stack) per thread, rather than per warp.30
- Mechanism: While the warp scheduler still attempts to issue instructions for threads together to maximize SIMT efficiency, it can schedule threads independently. This allows for interleaved execution of divergent paths.31
- Starvation Freedom: ITS guarantees that even in divergent code, threads will eventually make progress. This enables the use of spin-locks and complex synchronization primitives within a warp, which would have deadlocked on Pascal.32
- Software Impact: This freedom broke the implicit assumption of “warp-synchronous” programming (the belief that threads in a warp execute in lockstep). Developers must now use explicit synchronization intrinsics like __syncwarp() to enforce lockstep behavior where data dependencies exist between threads in a warp.30
6. Occupancy, Latency Hiding, and Little’s Law
The performance of a GPU kernel is rarely limited by pure arithmetic throughput (FLOPs). More often, it is limited by memory latency. The execution model is designed to hide this latency through occupancy.
6.1 Little’s Law and the Need for Concurrency
Little’s Law relates concurrency to throughput and latency:
$$\text{Concurrency} = \text{Throughput} \times \text{Latency}$$
In the context of a GPU:
- Throughput: The rate at which the SM can execute instructions (e.g., 1 instruction per cycle per scheduler).
- Latency: The time it takes for an operation to complete (e.g., 400+ cycles for a global memory load).
To keep the pipeline full (hide the latency), the SM needs enough active warps to issue instructions while others are waiting. If memory latency is 400 cycles and the SM issues 1 instruction/cycle, we need 400 instructions “in flight.” Since a warp issues 1 instruction, we need roughly 12-16 active warps per scheduler to fully hide memory latency.34
6.2 Occupancy: The Metric of Utilization
Occupancy is defined as the ratio of active warps on an SM to the maximum number of warps supported by the SM.
- Example (A100): Max warps = 64. If an SM has 32 active warps, Occupancy = 50%.19
High occupancy is generally desirable because it increases the pool of warps available to the scheduler, minimizing the probability of a “no-issue” cycle (a stall). However, occupancy is constrained by resource availability: Registers and Shared Memory.
6.3 Register Pressure and Spilling
Registers are the scarcest resource on the GPU. The Register File (RF) is partitioned among the threads.
- A100/H100 Spec: 64K (65,536) 32-bit registers per SM.11
- Calculation: If a kernel uses 64 registers per thread, the max threads the SM can host is:
$$\frac{65,536 \text{ registers}}{64 \text{ registers/thread}} = 1024 \text{ threads}$$
Since the SM supports up to 2048 threads, this register pressure limits theoretical occupancy to 50% (1024/2048).19
If the compiler cannot fit the thread’s variables into the allocated register count, it performs Register Spilling. The excess variables are moved to Local Memory. Despite the name, Local Memory is physically located in Global Memory (DRAM), meaning it is slow. Spilling can destroy performance due to the massive latency penalty and increased memory traffic.36
Optimization – Launch Bounds:
Developers can control register usage using the __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor) qualifier. This informs the compiler of the intended launch configuration, allowing it to cap register usage to ensure the specified occupancy is achievable, potentially by spilling more aggressively or reordering instructions to reduce live variable ranges.38
6.4 Shared Memory Constraints
Shared Memory is the second limiter.
- H100 Spec: 228 KB per SM.
- Calculation: If a block requires 100 KB of shared memory, the SM can host $\lfloor 228/100 \rfloor = 2$ blocks.
- If blockDim is 256 threads, total threads = 512.
- Occupancy = 512 / 2048 = 25%.
This creates a discrete “step function” for occupancy. Increasing shared memory usage by 1 byte could drop the number of resident blocks from 3 to 2, causing a massive drop in occupancy (the “occupancy cliff”).19
7. Wave Quantization and The Tail Effect
While occupancy focuses on the utilization of a single SM, Wave Quantization analyzes utilization across the entire GPU.
The total number of blocks in a grid is executed in “waves.” A wave is the set of blocks that are executing concurrently on the GPU at any given moment.
- Wave Size: $\text{Total SMs} \times \text{Blocks per SM}$.
- Example: An H100 has 144 SMs. If the kernel achieves 4 blocks/SM, the Wave Size is $144 \times 4 = 576$ blocks.
7.1 The Tail Effect
If the grid size is not a multiple of the wave size, the final wave will be partial.
- Scenario: Launch 577 blocks on the H100 described above.
- Wave 1: 576 blocks run. The GPU is 100% utilized.
- Wave 2 (The Tail): 1 block runs. The GPU is 0.17% utilized (1/576).
- The entire massive GPU remains powered up, waiting for this single block to finish. This “tail” drastically reduces the average throughput of the kernel.6
7.2 Mitigation: Grid-Stride Loops
To mitigate tail effects and launch overhead, experienced developers use the Grid-Stride Loop pattern. Instead of mapping one thread to one data element (which couples the grid size to the data size), the kernel launches a fixed grid size (typically equal to the device’s wave size) and has threads loop over the data elements.
C++
__global__ void kernel(int *data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < N; i += stride) {
process(data[i]);
}
}
This decouples the launch configuration from the problem size, ensuring optimal wave quantization and allowing the device to amortize the launch cost over more work per thread.40
8. Advanced Hopper Architecture Features
The NVIDIA Hopper architecture (H100) introduces features that fundamentally extend the execution model beyond the limits of the SM.
8.1 Thread Block Clusters and Distributed Shared Memory
As discussed in the launch configuration, Clusters group blocks into GPCs. This physical grouping enables Distributed Shared Memory (DSMEM).
- SM-to-SM Network: Hopper introduces a dedicated interconnect between SMs in a cluster. A thread in Block A can issue a load instruction for an address in the shared memory of Block B.
- Implication: This allows for cooperative algorithms (e.g., large-tile matrix multiplications or stencil computations) that exceed the shared memory capacity of a single SM. It essentially creates a new level of cache hierarchy: L1 (Local Shared) < L1.5 (Cluster Shared) < L2 (Global).11
8.2 Tensor Memory Accelerator (TMA)
The Tensor Memory Accelerator (TMA) is a dedicated hardware engine in the Hopper SM designed to offload data movement.
- The Problem: In previous architectures, threads had to spend cycles issuing Load/Store instructions to move data from Global to Shared memory. This burned register file bandwidth and instruction issue slots.
- The TMA Solution: A thread issues a single “Copy Descriptor” to the TMA. The TMA engine then asynchronously handles the entire transfer of a large tensor (1D-5D) from Global Memory directly into Shared Memory (or DSMEM).
- Async Execution: The threads are free to perform other work (e.g., math) while the data arrives. The synchronization is handled via mbarrier objects. This allows for near-perfect overlap of memory and compute without the complexity of manual software pipelining or register pressure.11
9. Case Study: Optimizing a Kernel Launch
To synthesize these concepts, consider the optimization of a matrix multiplication kernel on an NVIDIA A100.
Initial State:
- Kernel: Naive implementation using blockDim = (32, 32).
- Grid: Sufficient to cover a $4096 \times 4096$ matrix.
- Registers: Compiler uses 40 registers per thread.
- Shared Mem: 0 bytes.
Analysis:
- Block Size: $32 \times 32 = 1024$ threads. This hits the max threads/block limit.
- Register Pressure: $1024 \text{ threads} \times 40 \text{ regs} = 40,960 \text{ regs}$. The SM has 65,536 registers. 1 block fits easily.
- Can we fit 2 blocks? $2 \times 40,960 = 81,920 > 65,536$. No.
- Resulting Occupancy: 1 block/SM = 1024 threads/SM. Max is 2048. Occupancy is 50%.
Optimization Step 1: Reduce Block Size
- Change blockDim to (16, 16) = 256 threads.
- Regs per block: $256 \times 40 = 10,240$.
- Max blocks by registers: $65,536 / 10,240 = 6$ blocks.
- Max blocks by SM limit: A100 allows 32 blocks. 6 is fine.
- Total threads: $6 \times 256 = 1536$ threads.
- New Occupancy: $1536 / 2048 = 75%$. We have significantly improved latency hiding capacity.
Optimization Step 2: Use Clusters (Hopper H100)
- If migrating to H100, we can use __cluster_dims__(2, 2, 1).
- This groups 4 blocks. Threads can now preload data for their neighbors into DSMEM using TMA, reducing global memory traffic and leveraging the higher bandwidth of the SM-to-SM network.
10. Conclusion
The execution model of modern GPUs is a complex layering of abstractions, from the logical Grid down to the physical nanoseconds of instruction issue. The kernel launch configuration—the <<<Dg, Db>>> syntax—is the control knob that governs this machinery.
Efficient GPU computing is not merely about writing parallel code; it is about writing code that aligns with the physical reality of the architecture. It requires respecting the granularity of warps to avoid divergence, managing register pressure to maintain occupancy, and sizing grids to avoid tail effects. As architectures evolve with features like Independent Thread Scheduling and Thread Block Clusters, the model becomes more powerful but also demands a deeper understanding from the programmer.
By mastering the relationships between thread hierarchy, resource partitioning, and memory latency defined in this report, developers can transform theoretical TFLOPS into realized application performance, fully exploiting the throughput-oriented paradigm of the GPU.
11. Appendix: Comparative Architectural Specifications
The following table summarizes the key execution model parameters for recent NVIDIA Data Center architectures.
| Feature | NVIDIA Volta (V100) | NVIDIA Ampere (A100) | NVIDIA Hopper (H100) |
| Compute Capability | 7.0 | 8.0 | 9.0 |
| SM Count (Full) | 84 | 128 | 144 |
| Max Warps / SM | 64 (2048 threads) | 64 (2048 threads) | 64 (2048 threads) |
| Max Blocks / SM | 32 | 32 | 32 |
| Register File / SM | 256 KB (64K regs) | 256 KB (64K regs) | 256 KB (64K regs) |
| Max Shared Mem / SM | 96 KB | 164 KB | 228 KB |
| Scheduling Model | Independent Thread Scheduling (ITS) | ITS | ITS |
| Reconvergence | Sub-warp | Sub-warp | Sub-warp |
| Async Copy | No | __ld_gsts (Global to Shared) | TMA (Tensor Memory Accelerator) |
| Cluster Support | No | No | Yes (Max 8 blocks portable) |
| L2 Cache Size | 6 MB | 40 MB | 50 MB |
