The Architectonics of High-Throughput Computing: A Comprehensive Analysis of CUDA Shared Memory, Bank Conflicts, and Optimization Paradigms

1. Introduction: The Imperative of On-Chip Memory in Massively Parallel Architectures

The trajectory of high-performance computing (HPC) over the last two decades has been defined by a fundamental divergence: the rate of arithmetic throughput growth has consistently outpaced the growth of off-chip memory bandwidth. This phenomenon, widely recognized as the “Memory Wall,” presents a critical bottleneck for modern Graphics Processing Units (GPUs) which rely on massive parallelism to achieve petaflop-scale performance. In the context of NVIDIA’s CUDA (Compute Unified Device Architecture) platform, the mitigation of this bottleneck relies heavily on the efficient utilization of the memory hierarchy. While registers provide the fastest access and global memory (High Bandwidth Memory or GDDR) offers the largest capacity, it is the Shared Memory—a user-managed, on-chip scratchpad—that serves as the linchpin for high-performance kernel design.1

Shared memory distinguishes itself from traditional CPU caches by offering deterministic latency and explicit programmer control. Located physically within each Streaming Multiprocessor (SM), it enables low-latency data reuse and efficient inter-thread communication within a Thread Block or Cooperative Thread Array (CTA).3 However, the architectural implementation of shared memory introduces a complex set of constraints, most notably the system of memory banks. The performance of a CUDA kernel is frequently dictated not by the raw floating-point capability of the hardware, but by the developer’s ability to navigate the intricacies of bank access patterns.

When multiple threads within a warp (the fundamental unit of execution in CUDA, comprising 32 threads) attempt to access different memory addresses that map to the same physical bank during a single cycle, a bank conflict occurs.5 The hardware must serialize these conflicting accesses, reducing the effective bandwidth by a factor proportional to the degree of conflict. As architectures have evolved from Fermi to the modern Hopper H100, the mechanisms for handling shared memory have grown in sophistication, introducing features such as configurable bank widths, asynchronous copy instructions (cp.async), and the Tensor Memory Accelerator (TMA).7

This report provides an exhaustive examination of CUDA shared memory architecture. It dissects the microarchitectural mechanics of bank conflicts, explores the mathematical foundations of conflict-free access patterns, and details advanced optimization strategies ranging from algorithmic padding to hardware-accelerated swizzling. Furthermore, it integrates profiling methodologies using NVIDIA Nsight Compute to provide a verifiable, data-driven approach to optimization.

2. Architectural Fundamentals of Shared Memory

To master shared memory optimization, one must first deconstruct its physical organization. Unlike global memory, which is accessed via a sophisticated cache hierarchy (L1/L2) and memory controllers, shared memory is a direct-access SRAM (Static Random Access Memory) structure located on the SM die.

2.1 Physical Organization and Bank Structure

The fundamental unit of bandwidth scaling in shared memory is the bank. Rather than being a monolithic block of memory where only one access can occur at a time, shared memory is divided into equally sized modules that can function independently.

2.1.1 The 32-Bank Standard

Across all modern NVIDIA architectures—from Maxwell (Compute Capability 5.x) through Pascal, Volta, Ampere, and Hopper (Compute Capability 9.0)—shared memory is organized into 32 banks.9 This number is not arbitrary; it corresponds exactly to the number of threads in a standard CUDA warp (32 threads).

This 1:1 correspondence is the architectural ideal: in a perfectly optimized scenario, every thread in a warp issues a memory request to a unique bank. If this condition is met, the memory subsystem can service all 32 requests simultaneously in a single clock cycle, achieving maximum theoretical bandwidth.10

2.1.2 Bank Width and Bandwidth

Each bank has a width of 4 bytes (32 bits).6 This means that inside a single bank, data is stored in successive 32-bit words. The implication for bandwidth is significant:

Per-Bank Bandwidth: 4 bytes per clock cycle.
Aggregate Bandwidth: With 32 banks, the total theoretical throughput is $32 \times 4 \text{ bytes} = 128 \text{ bytes per cycle}$ per SM.14

It is crucial to note the distinction between “bandwidth” and “latency.” Shared memory latency is roughly 20-30 cycles (comparable to L1 cache), whereas global memory latency can exceed 400 cycles.16 While the latency is low, the throughput (bandwidth) is limited by bank parallelism. If bank conflicts occur, the effective bandwidth drops precipitously, potentially starving the CUDA Cores or Tensor Cores waiting for data.

2.2 Address Mapping and Interleaving Logic

The system determines which bank holds a specific byte address using a modulo mapping scheme. This is often referred to as “word interleaving.”

2.2.1 The Mapping Formula

For the standard 32-bit mode, the bank index for a given byte address $A$ is calculated as:

$$\text{Bank Index} = \left( \frac{A}{4} \right) \pmod{32}$$

This formula implies that successive 32-bit words are assigned to successive banks in a round-robin fashion.13

Word 0 (Bytes 0-3) maps to Bank 0.
Word 1 (Bytes 4-7) maps to Bank 1.
…
Word 31 (Bytes 124-127) maps to Bank 31.
Word 32 (Bytes 128-131) wraps around to Bank 0.17

This interleaved layout is designed to optimize linear access patterns. If a warp reads a contiguous array of float or int values (e.g., Thread[i] reads Array[i]), the addresses naturally stride across the banks. Thread 0 accesses Bank 0, Thread 1 accesses Bank 1, and so on, resulting in a conflict-free transaction.

2.2.2 Historical Context: Configurable Bank Widths

While modern architectures have standardized on 4-byte banks, it is informative to acknowledge that the Kepler architecture (Compute Capability 3.x) introduced a configurable bank width feature. Developers could use cudaDeviceSetSharedMemConfig() to toggle between 4-byte and 8-byte bank modes.3 The 8-byte mode was intended to optimize double-precision (64-bit) workloads by allowing a single bank to deliver 64 bits per cycle.

However, starting with Maxwell, NVIDIA reverted to a fixed 4-byte bank width.13 On modern hardware, a 64-bit load (e.g., double or long long) is effectively split into two 32-bit requests. Because the banks are 4 bytes wide, a 64-bit value resides in two consecutive banks (e.g., Bank $k$ and Bank $k+1$). The hardware is sufficiently advanced to handle these multi-bank accesses efficiently, provided appropriate alignment is maintained, rendering the manual configuration obsolete.

premium-career-track-chief-data-officer-cdo

3. The Phenomenology of Bank Conflicts

A bank conflict is a microarchitectural event that occurs when the memory subsystem receives multiple requests for the same bank index within a single transaction window (typically one warp instruction).5

3.1 Mechanics of Serialization

The shared memory hardware usually has one port per bank per cycle. It physically cannot retrieve two different 32-bit words from Bank 0 at the exact same instant. When the warp scheduler issues a load instruction, the memory unit inspects the target addresses of all active threads.

If threads $T_A$ and $T_B$ request addresses that map to Bank $X$, but the addresses are distinct (e.g., Word 0 and Word 32), a conflict exists. The hardware resolves this by serialization.5 The request is split into separate “wavefronts” or transactions.

Cycle 1: The hardware services $T_A$’s request (and any other non-conflicting threads).
Cycle 2: The hardware services $T_B$’s request.

This serialization increases the latency of the instruction and, more importantly, reduces the aggregate throughput. In an $N$-way bank conflict (where $N$ threads hit the same bank), the effective bandwidth is reduced to $1/N$ of the peak.14 In the worst-case scenario (32-way conflict), the warp executes serially, behaving essentially like a single thread performance-wise for that instruction.

3.2 Exceptions: Broadcast and Multicast

Not all simultaneous accesses to a bank constitute a conflict. The architecture includes specific logic to handle data reuse.

3.2.1 Broadcast Mechanism

If multiple threads in a warp access the exact same address (e.g., every thread reads Parameter), the hardware recognizes this redundancy. It performs a single read from the bank and broadcasts the data to all requesting threads via the interconnect network (crossbar).3

Cost: Single transaction.
Result: No conflict penalty.

3.2.2 Multicast Capability

Starting with Compute Capability 2.0 and refined in later generations, the “Broadcast” concept was generalized into Multicast. Multicast handles cases where subsets of threads access the same address.

Scenario: Threads 0-15 access Address $A$ (Bank 0). Threads 16-31 access Address $B$ (Bank 1).
Behavior: The hardware performs one read from Bank 0 (multicast to 0-15) and one read from Bank 1 (multicast to 16-31).
Result: Since Bank 0 and Bank 1 are distinct, these operations occur simultaneously. The instruction completes in a single cycle.21

Therefore, the precise definition of a performance-degrading bank conflict is: Multiple threads accessing different addresses within the same bank in the same cycle.

3.3 The Role of the Warp

Conflict detection occurs at the warp level.

Legacy (Pre-Volta): Some documentation references “half-warps” regarding conflicts on very old hardware (CC 1.x). This is obsolete.
Modern (Volta/Ampere/Hopper): Conflicts are evaluated across the full 32-thread warp. If Thread 0 and Thread 16 access different addresses in Bank 0, it is a 2-way conflict.10

NVIDIA’s Nsight Compute profiler refers to the serialized requests resulting from conflicts as “Excessive Wavefronts” or “Replays”.13 An ideal shared memory instruction generates 1 wavefront. A 2-way conflict generates 2 wavefronts, effectively stalling the warp for an additional cycle.

4. Analytical Deconstruction of Access Patterns

To avoid conflicts, one must understand how geometric access patterns map to linear memory addresses. The interplay between the logical stride of the algorithm and the physical modulo-32 mapping of the hardware is the determinant of performance.

4.1 Linear vs. Strided Access

4.1.1 Unit Stride (Conflict-Free)

The most efficient pattern is unit stride, where Thread[tid] accesses Data[tid].

Address for thread $t$: $A_t = 4 \times t$.
Bank index: $(4t / 4) \pmod{32} = t \pmod{32}$.
Since $t$ ranges from 0 to 31, every result is unique. This is conflict-free.3

4.1.2 Odd Strides (Conflict-Free)

Accessing memory with a stride $S$ means Thread[tid] accesses Data.

Bank index: $(t \times S) \pmod{32}$.
Mathematical Insight: The number of banks, 32, is $2^5$. Its only prime factor is 2. Therefore, any odd integer $S$ is coprime to 32.
Number Theory Consequence: If $S$ and $N$ are coprime, the mapping $t \mapsto (t \times S) \pmod N$ is a bijection (a one-to-one mapping) for $t$ in $0..N-1$.
Result: Any odd stride (1, 3, 5, etc.) permutes the bank assignments but ensures that all 32 threads still map to 32 unique banks. There are no bank conflicts with odd strides.12

4.1.3 Even Strides (Conflict Prone)

If the stride $S$ is even, it shares a common factor (2) with the modulus 32.

Stride 2: Threads $t$ and $t+16$ will map to the same bank.

$t=0 \rightarrow 0 \pmod{32}$.
$t=16 \rightarrow (16 \times 2) = 32 \rightarrow 0 \pmod{32}$.
Result: 2-way bank conflict.18

Stride 4: Threads $0, 8, 16, 24$ map to Bank 0. Result: 4-way bank conflict.
Stride 32: All 32 threads map to Bank 0. Result: 32-way bank conflict (Sequential execution).18

4.2 The “Power of Two” Trap in 2D Tiling

The most pervasive source of bank conflicts in CUDA programming arises from 2D data structures, particularly when processing images or matrices with dimensions that are powers of two (e.g., $32 \times 32$, $64 \times 64$).

Consider a thread block loading a $32 \times 32$ tile of floats into shared memory:

C++

__shared__ float tile;

4.2.1 Row-Major Access (Good)

When threads access a row (e.g., performing a reduction across columns), Thread[x] accesses tile[row][x].

Address offset is linear with $x$.
Stride is 1.
Status: Conflict-Free.

4.2.2 Column-Major Access (Bad)

When threads access a column (e.g., during a matrix transpose or vertical stencil), Thread[y] accesses tile[y][col].

In row-major memory layout, tile[y][col] is physically located at tile + y * 32 + col.
Logical Stride: As Thread[y] increments $y$, the address jumps by 32 words (the width of the row).
Bank Mapping:

Thread 0 reads tile $\rightarrow$ Bank 0.
Thread 1 reads tile (Address 32) $\rightarrow$ Bank $32 \pmod{32} = 0$.
Thread 2 reads tile (Address 64) $\rightarrow$ Bank $64 \pmod{32} = 0$.

Status: 32-way Bank Conflict. Every thread in the warp fights for Bank 0. The hardware serializes this, effectively reducing the shared memory bandwidth to that of a single bank (4 bytes/cycle).21

This “Columnar Access” problem is the primary adversary in optimizing GEMM (General Matrix Multiply) and FFT kernels.

5. Software Optimization Strategies

Before resorting to specialized hardware features, several algorithmic transformations can mitigate or eliminate bank conflicts.

5.1 Memory Padding

The most straightforward solution to the power-of-two stride problem is padding. By inserting unused “dummy” elements into the data structure, we can alter the physical stride without changing the logical access logic significantly.

Implementation: Change the shared memory declaration from tile to tile (or [32+1]).26
Mechanism:

The logical row width remains 32 for the application’s data.
The physical stride between rows becomes 33 words.
33 is an odd number (coprime to 32).

Result: When accessing a column:

Thread 0 accesses Row 0 $\rightarrow$ Bank 0.
Thread 1 accesses Row 1 (Index 33) $\rightarrow$ Bank 1.
Thread i accesses Row $i \rightarrow$ Bank $i \pmod{32}$.
Status: Conflict-Free.

Trade-offs: Padding consumes additional shared memory, which is a limited resource (typically 48KB – 163KB per SM). While effective for 2D arrays, it complicates index calculations (y * 33 + x) and may not be feasible for very complex high-dimensional tensors.26

5.2 XOR Swizzling (Software Implementation)

For cases where memory conservation is critical or padding is difficult to implement (e.g., when viewing the same memory buffer as different shapes), swizzling is the superior technique. Swizzling involves permuting the mapping between logical coordinates and physical addresses using bitwise XOR operations.

5.2.1 The Logic of XOR

The bitwise XOR operation ($\oplus$) has a unique property useful for conflict resolution: $A \oplus B \neq A \oplus C$ whenever $B \neq C$. By XORing the row index into the column index (which determines the bank), we can “disturb” the bank assignment in a deterministic way.

A standard swizzling pattern for a matrix tile might look like this:

C++

// Standard conflict-prone index
int linear_idx = row * 32 + col;

// Swizzled index
int swizzled_col = col ^ row;
int swizzled_idx = row * 32 + swizzled_col;

In this scheme:

When accessing a row (constant row, variable col), the swizzled_col still produces a permutation of 0..31, ensuring unit stride behavior (conflict-free).
When accessing a column (constant col, variable row), the swizzled_col changes as row changes. Even though the base address steps by 32, the XOR component shifts the bank index.

Row 0, Col 0 $\rightarrow$ Col $0 \oplus 0 = 0$ $\rightarrow$ Bank 0.
Row 1, Col 0 $\rightarrow$ Col $0 \oplus 1 = 1$ $\rightarrow$ Bank 1.
Row 2, Col 0 $\rightarrow$ Col $0 \oplus 2 = 2$ $\rightarrow$ Bank 2.
Status: Conflict-Free.

This technique effectively diagonalizes the access pattern in physical memory while maintaining the logical structure.28

5.2.2 CuTe and Layout Algebra

Implementing complex swizzling manually is error-prone. NVIDIA’s CuTe library (part of CUTLASS 3.0) formalizes this using “Layout Algebra.” CuTe represents data layouts as a composition of Shape, Stride, and Swizzle functors.

The Swizzle<B, M, S> functor abstracts the bit manipulation:

B (BBits): The number of bits in the mask (usually 3 for 8 banks, 5 for 32 banks).
M (MBase): The number of least-significant bits to ignore (shifts the mask to preserve vector atomicity).
S (SShift): The shift distance for the XOR operation.31

For example, Swizzle<3, 3, 3> is a common pattern that permutes the layout to avoid conflicts for 64-bit accesses on specific architectures. CuTe allows the compiler to mathematically prove that a layout is conflict-free at compile time.31

5.3 Vectorized Load Optimization

Modern CUDA optimization heavily relies on vectorized instructions (LDS.64, LDS.128) which load 64 or 128 bits per thread in a single instruction. This reduces the total number of instructions (instruction fetch/decode overhead) and improves bandwidth efficiency.

128-bit Loads (float4): A warp executing LDS.128 requests $32 \text{ threads} \times 16 \text{ bytes} = 512 \text{ bytes}$.
Bank Usage: Since each bank provides 4 bytes, a 512-byte request spans $512 / 4 = 128$ banks. Since there are only 32 banks, the request wraps around the bank array 4 times.
Alignment: If the data is 128-bit aligned, Thread 0 accesses Banks 0-3, Thread 1 accesses Banks 4-7, etc. This is conflict-free.
Risk: If the base address is not aligned, or if the stride causes threads to overlap their 4-bank windows (e.g., a stride of 2 float4s), conflicts will re-emerge with high penalties (quadruple conflicts). Optimization using float4 types requires rigorous adherence to alignment constraints.13

6. The Asynchronous Revolution: Ampere and Beyond

The introduction of the NVIDIA Ampere architecture (Compute Capability 8.0) marked a paradigm shift in how shared memory is utilized, moving from a synchronous “load-store” model to an asynchronous “copy-compute” model.

6.1 cp.async: Bypassing the Register File

In pre-Ampere architectures (Volta and older), loading data from global memory to shared memory required an intermediate step:

$$\text{Global} \xrightarrow{\text{LDG}} \text{Register} \xrightarrow{\text{STS}} \text{Shared Memory}$$

This consumed register file bandwidth and required the thread to execute load and store instructions explicitly.

Ampere introduced the cp.async (asynchronous copy) instruction. This instruction initiates a transfer directly from Global Memory to Shared Memory, bypassing the register file entirely.35

$$\text{Global} \xrightarrow{\text{L2/L1}} \text{Shared Memory}$$

6.1.1 Pipeline Stages and Latency Hiding

Because cp.async is non-blocking, a thread can issue a batch of copy commands and then immediately proceed to perform arithmetic computations (e.g., FFMA) on previously loaded data. This allows for perfect overlapping of memory transfer and compute (software pipelining).

6.1.2 Conflict Implications in Async Copies

While cp.async offloads the data movement, bank conflicts are still relevant. The data must eventually be written into the shared memory banks. If the target addresses in shared memory have bank conflicts, the cp.async commitment (the point where the data becomes visible) is delayed.

L1-to-Shared Bottleneck: The path from L1 cache to shared memory has finite bandwidth. Heavy bank conflicts during the write phase of cp.async can saturate this path, causing back-pressure that eventually stalls the cp.async.wait_group or mbarrier instructions.37
Resolution: Developers must still ensure that the destination pointers for cp.async are swizzled or padded to avoid conflicts, just as they would for synchronous stores.

7. The Hopper Tensor Memory Accelerator (TMA)

With the H100 (Hopper, Compute Capability 9.0), NVIDIA introduced the Tensor Memory Accelerator (TMA), a hardware unit that supersedes cp.async for bulk tensor transfers. TMA represents the ultimate evolution of shared memory management.7

7.1 Hardware-Accelerated Swizzling

TMA allows the programmer to define a “Tensor Map” (using the CUtensorMap API) that describes the layout of a tensor in global memory. When loading a tile of this tensor into shared memory, the TMA hardware automatically swizzles the data as it writes it to the banks.

This removes the need for software to calculate XOR indices. The developer simply selects a swizzle mode in the descriptor, and the hardware arranges the bytes in shared memory to ensure conflict-free reading by the Tensor Cores.38

The supported swizzle modes are specific enums in the CUDA driver API:

CU_TENSOR_MAP_SWIZZLE_NONE: Linear layout.
CU_TENSOR_MAP_SWIZZLE_32B: Swizzles for 32-byte access atoms.
CU_TENSOR_MAP_SWIZZLE_64B: Swizzles for 64-byte access.
CU_TENSOR_MAP_SWIZZLE_128B: Swizzles for 128-byte access.38

The 128B mode is particularly critical because it aligns with the access pattern of the WGMMA (Warp Group Matrix Multiply Accumulate) instruction. WGMMA allows Hopper Tensor Cores to read operands directly from shared memory. By using TMA with 128B swizzling, the data lands in shared memory in exactly the permutation required for WGMMA to read it without any bank conflicts.36

7.2 Distributed Shared Memory (DSM) and Multicast

Hopper introduces Thread Block Clusters, allowing multiple thread blocks (e.g., a cluster of 8 blocks) to run on adjacent SMs and communicate directly.

DSM: A thread in Block A can directly access the shared memory of Block B within the same cluster. This creates a unified “Distributed Shared Memory” space.
TMA Multicast: The TMA can fetch a tile from global memory and write it simultaneously to the shared memories of all thread blocks in the cluster.8

Impact: This eliminates redundant global memory loads. If 8 blocks need the same weight matrix for a GEMM, TMA loads it once and multicasts it to 8 SMs. This effectively multiplies the logical bandwidth of the global memory link by the cluster size.
Bank Conflicts in Multicast: The multicast writes must be bank-conflict free across all target SMs. The hardware swizzling ensures this consistency.8

8. Performance Profiling and Verification: Nsight Compute

Theoretical optimization is insufficient; verification via profiling is mandatory. NVIDIA Nsight Compute is the definitive tool for diagnosing shared memory issues.

8.1 Critical Metrics

The Memory Workload Analysis section of Nsight Compute provides granular data on bank conflicts.11

8.1.1 Bank Conflict Counters

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum: The total number of bank conflicts generated by load instructions.
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum: The total number of bank conflicts generated by store instructions.

Insight: A non-zero value here confirms that conflicts are occurring. However, the absolute number must be contextualized by the total number of instructions.

8.1.2 Wavefront Analysis

A deeper insight comes from analyzing Wavefronts.

Metric: L1 Wavefronts Shared Excessive.24
Concept: A “Wavefront” is a packet of work sent to the memory unit. Ideally, 1 instruction = 1 wavefront. If conflicts occur, the hardware splits the instruction into multiple wavefronts.
Interpretation: If L1 Wavefronts Shared is significantly higher than L1 Wavefronts Shared Ideal, the kernel is paying a serialization penalty. If the “Excessive” count matches the “Ideal” count, it implies an average 2-way conflict (performance halved).42

8.2 Correlating SASS to Source Code

Nsight Compute allows developers to navigate from these high-level metrics down to the SASS (Source and Assembly) view.

Open the “Source” page in Nsight.
Sort by “Bank Conflicts” or “Excessive Wavefronts”.
The tool highlights the exact line of CUDA C++ (and the corresponding LDS/STS assembly instruction) causing the bottleneck.24

This correlation is vital because a single line of un-swizzled code (e.g., tile[tid] = val) can silently cripple the performance of an entire kernel.

8.3 Distinguishing Data Conflicts from Pipeline Stalls

It is crucial not to confuse bank conflicts with other stalls.

Data Conflict: The warp is ready to execute, but the memory unit is busy resolving serialization for the current instruction. (Metric: shared_mem_bank_conflict stall reason).
Pipeline Stall: The warp is waiting for a previous instruction (e.g., a global load) to return data. (Metric: long_scoreboard or mii stalls).
Throttle: The shared memory pipe is full because too many warps are issuing requests, even if they are conflict-free. (Metric: shared_mem_throttle).11

Optimization of bank conflicts only yields performance gains if the kernel is actually stalled on shared_mem_bank_conflict or shared_mem_throttle.

9. Conclusion

The effective utilization of shared memory remains the defining characteristic of expert-level CUDA programming. While the abstraction of the “bank” has remained constant—32 modules of 32 bits—the ecosystem surrounding it has transformed. From the manual padding required in the Fermi era to the mathematical elegance of XOR swizzling, and finally to the hardware-automated data movement of the Hopper TMA, the trend is clear: Data layout is not an implementation detail; it is a structural primitive.

For the modern practitioner, the path to performance involves a hierarchy of strategies:

Algorithmic Design: Prefer unit-stride and vector-friendly access patterns fundamentally.
Layout Abstraction: Adopt libraries like CuTe to manage the complexity of tensor layouts and swizzling math, ensuring correctness by construction.
Hardware Alignment: Leverage cp.async and TMA to offload data movement and hide latency, ensuring that the layouts used match the hardware’s swizzling capabilities (e.g., 128B).
Data-Driven Verification: Rely on Nsight Compute to validate “conflict-free” assumptions, looking specifically for excessive wavefronts and shared memory throttling.

By mastering these elements, developers can unlock the tremendous bandwidth potential of the GPU, ensuring that the compute units are never starved of data, and bridging the gap between theoretical peaks and realized application performance.

Table 1: Comparative Analysis of Shared Memory Conflict Resolution

Resolution Strategy	Mechanism	Complexity	Hardware Support	Use Case
Padding	Adds dummy elements ([N][N+1]) to offset strides.	Low	Universal	Simple 2D arrays; legacy code.
XOR Swizzling	Permutes indices using bitwise XOR (col ^ row).	Medium	Universal (Software)	Complex tensor layouts; avoiding power-of-2 conflicts.
Vectorization	Uses float4 (128-bit) to reduce instr. count.	High (Alignment)	Universal	High-bandwidth kernels; strictly aligned data.
Async Copy	cp.async bypasses register file.	Medium	Ampere+	Latency hiding; pipelined loops.
TMA Swizzling	Hardware engine handles address mapping.	High (API)	Hopper+	Tensor Core workloads (WGMMA); H100 optimization.

Table 2: Nsight Compute Metrics for Bank Conflict Analysis

Metric ID	Description	Diagnosis
l1tex__data_bank_conflicts…sum	Raw count of conflict events.	Positive values indicate presence of conflicts.
smsp__pcsamp_warps_issue_stalled_shared_mem_throttle	Warp stalls due to SMEM back-pressure.	High values indicate SMEM is the bottleneck.
L1 Wavefronts Shared Excessive	Wavefronts created purely by serialization.	Direct measure of performance loss (1.0 = 2x slower).
l1tex__throughput.avg.pct_of_peak	Achieved bandwidth vs Peak.	If low (<60%) but conflicts are high, optimization is high-priority.

Cutting-edge Technology Courses by Uplatz