Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing

The Imperative for In-Memory Computation

Deconstructing the “Memory Wall”: Performance and Energy Bottlenecks in von Neumann Architectures

For decades, the advancement of computing has been governed by the processor-centric von Neumann architecture, which fundamentally separates processing units (CPUs, GPUs) from memory units. This separation necessitates the constant movement of data between where it is stored and where it is processed. The performance of processors has historically improved at a much faster rate than that of memory, creating an ever-widening disparity known as the “memory wall”.1 In modern data-intensive workloads, such as high-performance computing (HPC) and artificial intelligence (AI), this gap has become a critical system bottleneck, forcing powerful processors to spend a significant portion of their execution cycles idle, waiting for data to arrive from memory.1

However, the memory wall is no longer just a performance or latency problem; it has evolved into a severe energy crisis. The energy consumed by moving data across the chip and between the processor and main memory can vastly exceed the energy required for the actual computation.3 In some modern systems, data movement has been reported to account for as much as 62% of the total system energy, creating what is often termed the “von Neumann bottleneck”.5 As AI models and datasets grow exponentially, this energy expenditure has become a primary limiting factor, driving an urgent need for a paradigm shift in computer architecture.3 The economic and environmental costs associated with powering large-scale data centers for AI have elevated energy efficiency from a secondary concern to a primary driver for architectural innovation.

career-path-application-architect By Uplatz

The PIM Paradigm: Shifting Computation Closer to Data

Processing-in-Memory (PIM), also referred to as Compute-in-Memory (CIM), offers a fundamental solution to the data movement bottleneck by challenging the processor-centric paradigm.9 Instead of moving massive amounts of data to a distant processor, PIM integrates computational capabilities directly within or near the memory arrays.10 By making memory systems “compute-capable,” PIM drastically reduces or eliminates the long and energy-intensive journey data must take to be processed.10

This concept is not new, with roots tracing back to the 1960s, but its practical realization has been catalyzed by two recent developments: the maturation of advanced semiconductor packaging technologies, particularly 3D stacking, and the insatiable demand for performance and energy efficiency from AI workloads.6 It is important to distinguish this hardware-level architectural concept from the software-level practice of “in-memory processing” used in applications like in-memory databases, where data is held in RAM to avoid slower disk access.9 Architectural PIM represents a more profound change, blurring the traditional lines between storage and computation.

 

A Taxonomy of PIM: Processing-Near-Memory (PNM) vs. Processing-Using-Memory (PUM)

The PIM paradigm encompasses a spectrum of approaches that can be broadly classified into two main categories, distinguished by the proximity and nature of the integrated computation.

Processing-Near-Memory (PNM) involves placing discrete, conventional logic units near the memory arrays. In modern implementations, this often means integrating processing elements (PEs) onto the logic layer of a 3D-stacked memory device or at the periphery of memory banks on a 2D chip.11 This approach is pragmatic, as it minimizes modifications to the highly optimized and dense core memory arrays, thereby reducing design risk and manufacturing complexity. Most commercial PIM products, such as Samsung’s HBM-PIM, follow the PNM model by placing compute units at the bank boundary.15

Processing-Using-Memory (PUM) represents a more radical and deeply integrated approach. PUM leverages the intrinsic analog operational properties of the memory cells themselves to perform computation.6 For instance, by simultaneously activating multiple rows in a DRAM array, the resulting charge sharing on the bitlines can be used to perform massively parallel bitwise logic operations.10 While PUM offers the highest potential for parallelism and efficiency by turning every memory column into a parallel ALU, it typically requires more significant modifications to the core memory cell and peripheral circuitry. Due to the higher design and manufacturing risks, most current commercial efforts are focused on the PNM strategy.20

The emergence of these two distinct technological paths signals a strategic divergence in the evolution of memory. The industry is moving away from a one-size-fits-all memory hierarchy toward a future of domain-specific memory, where different memory components are optimized for distinct roles—some for pure storage, others for specific computational tasks. This specialization is evident in the development of DRAM-based PIM for high-bandwidth applications and SRAM-based CIM for low-latency, high-efficiency tasks, as summarized in Table 1. This trend will fundamentally reshape system design, requiring architects to build systems from a heterogeneous mix of memory and compute components.

Characteristic DRAM-Based PIM SRAM-Based CIM
Primary Goal High Bandwidth / Capacity Low Latency / High Energy Efficiency
Storage Density High Low
Latency Higher (tens of ns) Lower (single-digit ns)
Compute Granularity Coarse-grained (e.g., vector operations) Fine-grained (e.g., bitwise, MAC)
Technology Maturity Mature (highly optimized process) Mature (CMOS-compatible)
Primary Compute Function Vector ALU, Floating Point Multiply-Accumulate (MAC)
Target Applications HPC, Data Center AI (LLMs), Databases Edge AI, Accelerators, On-device ML
Table 1: Comparative Overview of DRAM-PIM vs. SRAM-CIM. This table provides a high-level comparison of the two main technological branches of PIM, establishing the fundamental trade-offs that define their respective roles in modern computing systems.[5, 13, 20]

 

DRAM-Based Processing-in-Memory Architectures

 

Architectural Principles: Leveraging DRAM’s Internal Bandwidth and Parallelism

 

Dynamic Random Access Memory (DRAM) has long been the cornerstone of main memory in computing systems. The primary motivation for developing DRAM-based PIM is to harness the massive internal bandwidth available within a DRAM chip. This internal bandwidth, accessible between the memory arrays and the chip’s periphery, can be an order of magnitude or more greater than the bandwidth of the external memory channel that connects the DRAM to the processor.6

To understand how PIM exploits this, it is essential to consider the hierarchical structure of a modern DRAM device, which consists of channels, ranks, banks, and subarrays.14 By integrating small processing units at or near each memory bank, a PIM architecture can activate and process data from all banks in parallel. This bank-level parallelism allows the system to perform computations at a throughput that is dictated by the vast internal bus width, effectively bypassing the narrow off-chip interface for PIM-accelerated operations.15 PIM units can be integrated at various levels of this hierarchy, but near-bank computing has emerged as the most commercially viable approach, offering a balance between performance gains and design feasibility.14

 

Commercial Implementations and Case Studies

 

Several major memory vendors have introduced commercial or near-commercial DRAM-PIM products, each tailored to a specific market segment and leveraging a different type of DRAM technology.

 

Samsung HBM-PIM: Architecture of the Programmable Computing Unit (PCU)

 

Samsung’s “Aquabolt-XL” was the industry’s first commercially fabricated High Bandwidth Memory (HBM) device with integrated PIM capabilities.15 This HBM2-PIM architecture embodies the PNM approach by integrating a Programmable Computing Unit (PCU) within each memory bank.12 The PCU is an AI-focused engine, architecturally a 16-lane Single Instruction, Multiple Data (SIMD) array capable of performing 16-bit floating-point (FP16) operations, complete with its own lightweight control logic and register files.15

A key aspect of Samsung’s strategy was to design the HBM-PIM as a “drop-in replacement” for conventional HBM2 modules. This was achieved by placing the PCUs at the bank boundary and preserving the standard JEDEC HBM2 interface and timing protocols.12 This design choice significantly lowers the barrier to adoption for system integrators. In terms of performance, Samsung has reported a 2x performance improvement in applications like speech recognition and an energy reduction of over 70% compared to standard HBM.12 When integrated into accelerator systems, such as the AMD MI-100 GPU or the Xilinx Alveo FPGA, HBM-PIM has demonstrated system-level performance gains of up to 2.5x and energy savings exceeding 60% for workloads dominated by General Matrix-Vector multiplication (GEMV) and Long Short-Term Memory (LSTM) operations.4

 

SK Hynix Accelerator-in-Memory (AiM): GDDR6 for High-Throughput Compute

 

SK Hynix has pursued a different path with its Accelerator-in-Memory (AiM) technology, which is based on high-speed GDDR6 memory.24 AiM is explicitly designed as a “domain-specific memory” to accelerate memory-intensive machine learning workloads, particularly the GEMV operations that are fundamental to modern transformer models and Large Language Models (LLMs).24

Each GDDR6-AiM chip integrates 32 Processing Units (PUs) and is capable of delivering 1 TFLOPS of compute throughput using Brain Floating Point 16 (BF16) precision.25 The core architectural innovation is the concept of “all-bank parallelism,” which is enabled through an extended set of DRAM commands. These new commands, such as MACAB (MAC Across all Banks), allow a host controller to orchestrate simultaneous computation across all PUs in the chip.24 This allows the architecture to fully leverage the massive internal memory bandwidth (rated at 0.5 TB/s) for computation, which is approximately 8x greater than the chip’s external I/O bandwidth (64 GB/s).27 SK Hynix’s AiMX accelerator card, which populates a board with multiple AiM chips, is designed to function as a co-processor to a GPU, offloading the memory-bound stages of LLM inference to improve overall system efficiency and throughput.26

 

DIMM-Based PIM: AXDIMM and the Path to Mainstream Adoption

 

While HBM and GDDR6 represent high-performance niches, PIM technology is also being integrated into the more ubiquitous Dual In-line Memory Module (DIMM) form factor for mainstream servers. Samsung’s Acceleration DIMM (AXDIMM) is a prime example, placing an AI engine on the buffer chip of a standard DIMM.15 This allows the AXDIMM to perform parallel processing across the multiple memory ranks (sets of DRAM chips) on the module, a task not possible with conventional DIMMs.23 While less tightly integrated than HBM-PIM, this approach provides a more straightforward upgrade path for existing server infrastructure. For AI-based recommendation applications, AXDIMM has demonstrated an approximate 2x performance gain with a 40% reduction in system-wide energy consumption.23

Similarly, the French company UPMEM has commercialized a DIMM-based PIM solution that integrates multiple general-purpose 64-bit in-order cores, which they call DRAM Processing Units (DPUs), onto their DRAM chips.6 These DIMMs are compatible with standard commodity servers and can provide hundreds of gigabytes of compute-capable memory, targeting data-intensive applications like genomics, analytics, and search.29

The convergence of major vendors and academic research on the near-bank PNM architecture suggests that the industry has, for the present, settled on a pragmatic architectural template. This approach involves placing small, specialized compute units like SIMD or MAC arrays at the periphery of each memory bank. This strategy allows memory manufacturers to exploit the vast internal parallelism of DRAM to add computational value, all while preserving their core intellectual property and leveraging their highly optimized manufacturing processes without disruptive changes to the memory array itself.17

Specification Samsung HBM-PIM (Aquabolt-XL) SK Hynix GDDR6-AiM
Base Memory Type HBM2 GDDR6
Compute Throughput 4.9 TFLOPS (per GPU with 4 cubes) 1 TFLOPS/Chip
Numeric Precision FP16 BF16
PIM Unit Architecture 16-lane SIMD Array (PCU) per 2 banks MAC-based Processing Unit (PU) per bank
# of PIM Units 16 per stack 32 per chip
Key Architectural Feature Drop-in replacement, JEDEC compliance All-bank parallelism via extended commands
Target Workloads GEMV, Speech Recognition, LLMs GEMV, Transformers, LLMs
Table 2: Technical Specifications of Commercial PIM Solutions. This table provides a direct comparison of the leading commercial DRAM-PIM products, highlighting their capabilities and target use cases based on published specifications.[4, 15, 24, 25]

 

Circuit-Level Modifications and Design Constraints

 

The integration of logic into DRAM is a non-trivial engineering challenge, governed by the unique and highly optimized DRAM manufacturing process. The chosen integration strategy has profound implications for design complexity, cost, and performance.

Most commercial PIM architectures have adopted the near-bank PNM approach precisely because it avoids modifications to the core DRAM subarray.17 The subarray, which contains the 1-transistor-1-capacitor (1T1C) memory cells, is the densest part of the chip and is manufactured using a specialized process that is not optimized for high-performance logic. By placing compute units in the peripheral region alongside sense amplifiers and I/O circuitry, manufacturers can implement a “dual-mode” functionality, where the chip can operate as either a standard memory or a PIM accelerator.20 However, this peripheral area is extremely constrained in terms of physical space and power budget, which limits the complexity and performance of the integrated PIM logic.17

More experimental PUM approaches venture into the subarray itself, proposing circuit-level modifications to enable computation. One prominent technique involves leveraging the analog charge-sharing properties of DRAM. By activating multiple wordlines simultaneously (e.g., Triple-Row Activation or TRA), the collective charge from multiple cells is shared on the bitline, effectively performing a bitwise majority function, which can be used as a basis for other logic operations.18 Other proposals, such as ReDRAM, suggest a “Dual-Row Activation” mechanism coupled with modest modifications to the sense amplifier circuitry to implement a complete set of bulk bitwise operations (AND, OR, XOR, NOT).19 While these PUM techniques offer the highest degree of parallelism, they carry the significant risk of compromising the density, yield, and reliability of the commodity DRAM process, which has thus far limited their commercial adoption.20

 

Performance Analysis: Throughput, Latency, and Energy Efficiency Benchmarks

 

The performance of DRAM-PIM systems is highly dependent on the nature of the workload. These architectures excel in applications that are fundamentally memory-bound, where the ratio of data access to computation is high. For such workloads, PIM can provide substantial performance gains. For instance, an analysis of the UPMEM PIM system showed a 23x performance improvement over a high-end GPU for a GEMV kernel, but only when the GPU’s memory was oversubscribed, forcing it to access system memory and highlighting PIM’s strength in large-dataset scenarios.31

However, PIM is not a panacea. For applications with significant computational requirements, the relatively simple processing units within PIM can become the new bottleneck, shifting the workload from memory-bound to compute-bound.29 The effectiveness of PIM is also deeply tied to the software implementation, including how data is laid out across the memory banks, how parallelism is managed across the many PIM cores, and how synchronization is handled.10

Roofline models, which plot computational performance against arithmetic intensity, clearly illustrate PIM’s value proposition: it dramatically raises the “memory roofline,” increasing the peak performance achievable for memory-bound applications.14 For data analytics workloads, multi-level PIM architectures that exploit parallelism at the bank, chip, and rank levels have demonstrated throughput gains of up to 528x over baseline CPU systems.8 Despite these impressive figures, a recurring challenge is load imbalance, where the work is not evenly distributed across all PIM cores. This is particularly problematic in applications with irregular data access patterns, such as graph mining, and can significantly degrade overall performance if not managed by sophisticated scheduling algorithms.33

 

SRAM-Based Compute-in-Memory for AI Acceleration

 

Architectural Principles: Optimizing for Low Latency and High Efficiency

 

While DRAM-PIM targets high-bandwidth computing, Static Random Access Memory (SRAM)-based CIM is engineered for a different set of goals: ultra-low latency and extreme energy efficiency. SRAM’s inherent advantages—high speed, robustness, and compatibility with standard CMOS logic fabrication processes—make it an ideal substrate for building specialized AI accelerators, particularly for inference tasks at the network edge.5

The architectural principle of SRAM-CIM is to perform the most fundamental and frequent operation in neural networks—the Multiply-Accumulate (MAC) operation—directly within the memory array.5 By storing the neural network weights within the SRAM cells and streaming the input activations through the array, SRAM-CIM can perform thousands of parallel MAC operations in place, virtually eliminating the energy and latency costs associated with fetching weights and intermediate feature maps from off-chip memory.3 This results in latency measured in single-digit nanoseconds and energy efficiency, often expressed in Tera-Operations per Second per Watt (TOPS/W), that can be orders of magnitude better than conventional processors.3

 

Memory Cell and Peripheral Circuitry Modifications for Computation

 

Standard 6-transistor (6T) SRAM cells, which are optimized for density and stability in caches, are not well-suited for CIM. To enable in-memory computation, the bitcell and its surrounding peripheral circuits must be significantly modified.

A primary modification is the move from 6T to 8-transistor (8T), 9T, or even 10T bitcell designs.5 The key innovation in these larger cells is the addition of a dedicated read port, which decouples the read and write data paths.39 This separation is critical for CIM because it allows multiple wordlines to be activated simultaneously to perform a computation without causing the “read disturb” instability that would corrupt the stored data in a standard 6T cell.39

The peripheral circuitry surrounding the SRAM array is equally, if not more, critical. Since many SRAM-CIM designs operate in the analog domain to maximize efficiency, they require Digital-to-Analog Converters (DACs) to encode the digital input activations into analog signals (e.g., varying voltage levels or pulse widths on the wordlines).41 After the computation is performed in the array—typically as a summation of currents on the bitlines—the resulting analog value must be converted back to the digital domain. This requires Analog-to-Digital Converters (ADCs) at the end of each column.41 The design of these ADCs is a major challenge, as their power consumption and area can grow exponentially with the required precision, often becoming the dominant cost factor in the entire CIM macro.41 Additional circuits, such as current mirrors for precise current control and sophisticated timing logic, are also necessary to ensure accurate MAC operations.5

 

SRAM-CIM in Practice: Macro Designs for CNN and Transformer Acceleration

 

These modified bitcells and complex peripherals are assembled into functional blocks known as CIM macros. A typical CIM macro consists of an array of thousands of modified SRAM cells that store a portion of a neural network’s weight matrix. When input activations are applied to the wordlines, the entire array performs a parallel matrix-vector multiplication in a single cycle.3

A vast body of academic and industrial research is focused on designing and optimizing these macros for specific AI workloads. Numerous designs have been proposed to accelerate Convolutional Neural Networks (CNNs), Transformers, and Large Language Models (LLMs), with a strong emphasis on power-constrained edge devices.46 Many advanced architectures are heterogeneous, combining SRAM-CIM with other memory technologies like Magnetoresistive RAM (MRAM) for non-volatility or Read-Only Memory (ROM) for fixed-function acceleration.13 Others employ hybrid analog-digital techniques within the macro itself to strike a better balance between the raw efficiency of analog compute and the precision of digital logic.46 The design of SRAM-CIM is not just about the memory itself; it is about creating a hyper-specialized accelerator where the memory cells act as programmable arithmetic elements, prioritizing computational function over pure storage density.

 

Performance Analysis: Measuring Computational Density and TOPS/W

 

The primary metric for evaluating SRAM-CIM performance is energy efficiency, measured in TOPS/W. By performing computation in place, SRAM-CIM architectures dramatically reduce the data movement that dominates energy consumption in traditional systems. Published results for digital CIM architectures report efficiencies in the range of 1–100 TOPS/W, which represents a 100x to 1000x improvement over conventional CPUs.3

Specific research prototypes have demonstrated even more impressive figures, with some reporting efficiencies as high as 249.1 TOPS/W or, in highly optimized cases, even 3707.84 TOPS/W.46 It is crucial to note that these figures are highly dependent on the fabrication technology node, the numerical precision of the data (e.g., 8-bit integer vs. 4-bit integer), and the specific operation being benchmarked.5 While the energy efficiency is a major strength, the primary weakness of SRAM-CIM is its low storage density compared to DRAM.20 This low density means that only small models or portions of larger models can fit within the CIM macro at one time. For large models, this can create a new bottleneck: the latency and energy required to frequently reload weights from off-chip DRAM into the SRAM array, potentially offsetting some of the gains from the in-memory computation itself.48

 

Comparative Analysis: Analog vs. Digital In-Memory Computing

 

Within the domain of SRAM-based CIM, the choice of computational paradigm—analog or digital—represents a fundamental design trade-off with profound implications for performance, efficiency, and accuracy. This choice dictates the architecture from the circuit level to the system level and is one of the most critical decisions for PIM designers.

 

Principles of Analog CIM (AIMC): Leveraging Physics for Computation

 

Analog Compute-in-Memory (AIMC) performs calculations by directly harnessing the physical laws governing electrical circuits, most notably Kirchhoff’s current law and Ohm’s law.3 In a typical AIMC macro, the digital values of an input vector are first converted into analog quantities, such as varying voltage levels or pulses of varying duration, by DACs. These analog signals are then applied to the wordlines of the memory array. Each memory cell, which stores a weight value as a variable conductance, modulates the current flowing from the wordline to the bitline. According to Kirchhoff’s law, all the currents from the cells in a single column are naturally summed together on the shared bitline.51 This summed current, which is an analog representation of a dot-product result, is then converted back to a digital number by an ADC.

This approach offers the potential for extreme energy efficiency and massive parallelism, as an entire matrix-vector multiplication across thousands of memory cells can be completed in what is effectively a single analog operation.51

 

Principles of Digital CIM (DCIM): Integrating Logic within the Memory Array

 

In contrast, Digital Compute-in-Memory (DCIM) performs all computations using standard digital logic gates that are physically embedded within the memory array’s structure.54 Instead of relying on analog current summation, a DCIM architecture might, for example, use XNOR gates integrated with each bitcell to perform bitwise multiplication. The partial products generated along a column are then accumulated using a tree of digital adders that runs parallel to the bitlines.54 This approach preserves the key advantages of digital computation: high precision, deterministic behavior, and immunity to the noise and process variations that plague analog circuits.51

 

A Multi-faceted Trade-off Analysis: Precision, Noise, Power, Area, and Scalability

 

The decision between AIMC and DCIM involves a complex set of engineering trade-offs, summarized in Table 3.

  • Precision and Noise: AIMC’s greatest weakness is its susceptibility to analog non-idealities. Thermal noise, shot noise, and minute variations in transistor manufacturing can introduce errors into the computation, limiting precision and potentially degrading the accuracy of AI models.3 The ADC is a particularly critical component, as its limited resolution quantizes the final result and can be a major source of error.43 DCIM, being entirely digital, is immune to these issues and offers high, predictable precision, though the hardware cost (area and power) increases with the number of bits required.51
  • Energy Efficiency and Area: AIMC generally boasts higher peak energy efficiency and computational density, particularly for medium-precision operations (e.g., 3 to 8 bits).51 This is because it avoids the significant area and power overhead of the digital adder trees required by DCIM.54 However, this advantage is often diminished by the substantial cost of the peripheral ADCs and DACs, which can dominate the macro’s total power and area budget, especially as precision requirements increase.41 The ADC thus represents the Achilles’ heel of AIMC, where the theoretical efficiency of the core computation is held hostage by the practical cost of I/O conversion.
  • Scalability and Flexibility: DCIM benefits more directly from Moore’s Law, as its digital logic components scale well with advancements in semiconductor manufacturing processes.55 In contrast, designing high-performance analog circuits becomes progressively more difficult in advanced FinFET nodes, making AIMC scalability more challenging.57 Furthermore, the digital nature of DCIM provides greater flexibility for reconfiguring dataflows and mapping different types of computations onto the hardware.51

The optimal choice between these paradigms is not absolute but is highly dependent on the specific workload. Research indicates that AIMC can be more energy-efficient for neural network layers, like standard convolutions, that can exploit high spatial parallelism across a large memory array.42 Conversely, DCIM may outperform AIMC for layers with limited parallelism, such as depthwise convolutions, where smaller, more flexible compute units are advantageous.51 This suggests that the most effective future accelerators may be heterogeneous systems that incorporate both AIMC and DCIM macros, using a sophisticated compiler to map different parts of a neural network to the most suitable hardware.

Metric Analog CIM (AIMC) Digital CIM (DCIM)
Computational Principle Kirchhoff’s Law (Current/Charge Summation) Integrated Digital Logic (Adders/Multipliers)
Precision Limited by noise, device variation, ADC resolution High, deterministic, limited by bit-width
Noise Immunity Low; susceptible to analog non-idealities High; robust digital operation
Peak Energy Efficiency (TOPS/W) Potentially higher (especially at medium precision) Lower due to digital logic overhead
Area Efficiency (Compute Density) Potentially higher (fewer transistors per MAC) Lower due to area of adder trees
Key Bottleneck ADC/DAC power, area, and speed Adder tree power and area
Technology Scaling Less direct benefit; analog design challenges increase Benefits directly from transistor scaling
Flexibility More rigid dataflow More flexible spatial mapping and reconfiguration
Table 3: Trade-off Analysis of Analog vs. Digital CIM. This table summarizes the key engineering trade-offs between the two primary paradigms for compute-in-memory, based on a synthesis of multiple sources.[41, 51, 53, 55, 58]

 

Hybrid Approaches: Seeking the Best of Both Paradigms

 

Recognizing the complementary strengths of AIMC and DCIM, researchers are actively exploring hybrid architectures that aim to achieve a superior balance of efficiency and accuracy. A common strategy is to partition the computation by bit significance: the Most Significant Bits (MSBs) of a multiplication, which have the largest impact on the final result, are processed in the digital domain for high accuracy, while the Less Significant Bits (LSBs) are processed in the analog domain to save power.55

A more dynamic approach involves using “saliency” to guide the computation. In this model, the system can perform a quick, low-precision calculation to estimate the importance of different parts of the data (e.g., identifying the main subject in an image versus the background). It then dynamically allocates more precise digital compute resources to the salient regions and uses highly efficient analog compute for the non-salient regions.58 These hybrid systems represent a promising frontier in CIM design, offering a path to build accelerators that are both highly efficient and robust enough for a wide range of applications.46

 

System-Level Integration and Software Enablement

 

The availability of PIM hardware is only the first step; making this hardware usable and integrating it effectively into existing computing systems presents a host of complex challenges that span the entire software and hardware stack. The “drop-in replacement” strategy pursued by some vendors, while intended to ease adoption, creates significant downstream complexity. By adhering to existing memory interfaces, these PIM devices cannot fundamentally alter the host-memory interaction model. This necessitates a sophisticated and often proprietary software layer to bridge the gap between a processor-centric system and a memory-centric device, creating a fragmented ecosystem.15 The software gap—the chasm between hardware capability and software usability—remains the single greatest barrier to the widespread adoption of PIM technology.

 

The PIM-Aware Memory Controller: New Commands, Scheduling, and Address Mapping

 

The memory controller (MC) is the crucial interface between the host processor and the PIM device, and it must be substantially enhanced to support PIM operations. First, the MC must be capable of issuing new, non-standard commands to the memory to initiate computations. Examples include the extended command set from SK Hynix, which features instructions like MACAB (MAC Across all Banks), or Samsung’s proposed PIM-ACT command, which simultaneously activates multiple banks and specifies a PIM operation type.22

Second, the MC requires new scheduling algorithms. In a PIM-enabled system, the MC must arbitrate between conventional memory read/write requests from the host CPU and PIM computation requests. Naive scheduling can lead to resource conflicts and degrade overall memory bandwidth. Advanced schedulers are needed to balance fairness and throughput, potentially by limiting the number of consecutive requests of the same type or by using more sophisticated policies.22 Proposals like DEAR-PIM introduce a disaggregated command queue to better manage the issuance of all-bank PIM commands and mitigate peak power consumption issues.60

Finally, address mapping becomes more complex. PIM systems may employ separate physical address spaces for standard DRAM and PIM-capable memory regions. This requires a PIM-aware Memory Management Unit (MMU) or a specialized hardware block within the controller to manage these distinct spaces and translate addresses accordingly.61

 

Programming Models, Compilers, and APIs for PIM

 

For PIM to be adopted by the broader software development community, its complexity must be abstracted away by high-level programming models, compilers, and APIs.63 Currently, programming PIM hardware often requires expert knowledge and manual management of data placement, kernel execution, and synchronization.29

Several approaches are being explored to solve this programmability challenge:

  • Instruction Set Architecture (ISA) Extensions: At the lowest level, the host processor’s ISA can be extended with custom instructions that directly target PIM functional units. This allows for tight integration of PIM into the processor pipeline.13
  • APIs and Directives: A more portable approach involves providing libraries and compiler directives. For example, the widely used OpenMP standard for parallel programming, which uses #pragma directives to mark parallel regions, could be extended to support offloading specific code blocks to PIM devices.65
  • PIM-Aware Compilers: The ultimate goal is a compiler that can automatically analyze standard source code (e.g., in C++ or Python), identify data-intensive sections suitable for PIM execution, partition the application between the host and PIM, optimize data layouts for PIM’s parallel architecture, and generate the necessary PIM instructions.68 Research projects like PRIMO and PIMCOMP are developing end-to-end compiler toolchains specifically for this purpose.68

 

Runtime Systems for Heterogeneous PIM Architectures

 

In a modern heterogeneous system comprising CPUs, GPUs, and various PIM accelerators, a sophisticated runtime system is essential for orchestrating execution. This runtime is responsible for dynamically scheduling tasks across the different processing elements, managing resource allocation, and handling data movement and synchronization.69

For PIM, the runtime must manage PIM-aware memory mapping, keeping track of which data resides in which type of memory and handling the translation between virtual and physical addresses.13 It also plays a critical role in maintaining data consistency between the host’s caches and the PIM’s local memory. Research systems like HEMPS demonstrate how a runtime can profile an application and adaptively partition its execution across available heterogeneous resources to optimize performance.73

 

The Role of Interconnects: CXL as an Enabler for Scalable PIM Systems

 

The emergence of the Compute Express Link (CXL) interconnect standard is set to be a transformative enabler for PIM. CXL provides a high-bandwidth, low-latency, cache-coherent protocol for connecting processors, memory, and accelerators.74 Unlike the rigid master-slave relationship of a traditional DDR memory channel, CXL allows memory devices to be treated as peer endpoints in the system.

This has profound implications for PIM. Instead of being confined to the processor’s local memory bus, PIM devices can be attached via CXL, enabling more flexible and scalable system designs. Architectures like CENT showcase this potential by proposing a GPU-free system for LLM inference built from a host CPU connected to a network of CXL-based PIM devices via a CXL switch.74 This allows for peer-to-peer communication directly between PIM devices and enables the construction of large, disaggregated systems where pools of PIM resources can be dynamically composed and allocated to workloads. CXL is therefore poised to shift the PIM paradigm from simply being “in-memory” to enabling a new class of “composable memory” systems, transforming the architecture of the data center.

 

PIM for Domain-Specific Acceleration

 

The practical value of PIM is most evident in its application to specific, data-intensive domains where the limitations of the von Neumann architecture are most acute. PIM is not a general-purpose accelerator; it thrives on workloads that are “embarrassingly memory-bound”—those with a very high ratio of memory access to computation. In these domains, where powerful GPUs are often underutilized and starved for data, PIM’s ability to deliver massive memory bandwidth directly to simple, parallel compute units provides a compelling advantage.

 

Large Language Model (LLM) Inference

 

The inference process for large language models, especially the autoregressive generation of tokens one by one, is a prime example of a memory-bound workload.74 The dominant computations are General Matrix-Vector multiplications (GEMV), which have low arithmetic intensity and cannot fully exploit the massive parallelism of a GPU’s compute engines.27 This makes LLM inference an ideal candidate for PIM acceleration.

Performance results from commercial and research systems are compelling. Integrating Samsung’s HBM-PIM with an AMD MI-100 GPU has been shown to more than double the performance and energy efficiency for GPT-J model inference.4 Projections for SK Hynix’s AiM-based system suggest a potential 13x performance improvement for GPT-3 inference compared to a GPU-only system, while consuming only 17% of the energy.27 Looking further, novel architectures like CENT, which replace GPUs entirely with a CXL-based network of PIM devices, project a reduction in the total cost of ownership (TCO) per query by up to 6.94x, highlighting PIM’s potential to fundamentally change the economics of deploying LLMs at scale.74

 

Graph Processing and Analytics

 

Graph processing algorithms, which are central to social network analysis, logistics, and bioinformatics, are notoriously difficult to accelerate on conventional hardware. Their characteristic features are massive datasets, irregular memory access patterns (pointer chasing), and very low data reuse, which render traditional cache hierarchies ineffective.33 PIM is a natural fit for these workloads because it brings computation directly to the large graph data structures stored in main memory, minimizing costly random data accesses.33

While PIM can accelerate memory-intensive graph operations like set intersection and subtraction, simply offloading existing algorithms is often not enough.33 The irregular structure of graph data can lead to severe workload imbalance across the parallel PIM cores, limiting performance gains.33 Achieving good performance requires algorithm-hardware co-design, including the development of new graph partitioning and scheduling strategies specifically for PIM architectures. Frameworks like PIMMiner are being developed to address these challenges and better utilize PIM for graph mining tasks.34

 

Database and Query Acceleration

 

In-memory database systems, which rely on large DRAM capacities to hold entire datasets, are frequently bottlenecked by the bandwidth of the memory bus during analytical query processing.30 PIM can be used to accelerate key database primitives, particularly table scans (filtering) and joins.

The Membrane framework, for example, demonstrates how filtering operations can be offloaded to simple comparison units placed at each DRAM bank. This approach achieved a 3-4x query speedup in the DuckDB analytics database with only a modest memory overhead.30 For the more complex hash join operation, simply running the CPU-based algorithm on PIM is suboptimal.78 The JSPIM architecture shows the power of co-design by redesigning the hash table data structure to be PIM-friendly and deploying parallel search engines within each memory subarray. This allows for constant-time ($O(1)$) lookups and effectively mitigates the problem of data skew, fully exploiting the fine-grained parallelism of PIM.78

 

Deep Learning Recommendation Models (DLRM)

 

Deep learning recommendation models, which power personalization across e-commerce and media streaming, are among the most memory-demanding workloads in the data center. Their performance is dominated by the need to access massive embedding tables, which can be tens or even hundreds of gigabytes in size.79 The core operation involves a sparse and irregular “gather-reduce” access pattern to these tables, which is an ideal match for PIM’s ability to provide high-bandwidth, parallel access to memory.

Several PIM-based solutions have been proposed to accelerate DLRMs. The UpDLRM framework utilizes the commercial UPMEM PIM hardware to accelerate embedding lookups, demonstrating lower inference latency compared to both CPU and hybrid CPU-GPU systems.79 More specialized architectures like ProactivePIM are co-designed with specific DLRM algorithms, such as those using weight-sharing for model compression. By incorporating an in-PIM cache and an intelligent prefetching scheme tailored to the algorithm’s access patterns, ProactivePIM achieved a 4.8x speedup over previous PIM-based approaches.83 These examples underscore a key theme: unlocking PIM’s full potential requires moving beyond simply offloading existing code and instead co-designing algorithms and hardware to work in concert.

 

Critical Challenges and Future Research Directions

 

Despite its rapid progress from academic concept to commercial product, Processing-in-Memory faces significant system-level hurdles that must be overcome to achieve widespread adoption. These challenges span hardware design, system software, and the broader industry ecosystem. Addressing them will be the focus of PIM research and development for the foreseeable future.

 

Data Coherency and Consistency in Heterogeneous Systems

 

Perhaps the most formidable technical challenge for PIM is maintaining data coherency in a system where both a host processor and multiple PIM cores can read and write to the same shared data.86 Traditional cache coherence protocols, such as snooping or directory-based schemes, are designed for tightly coupled processors and are unworkable for PIM. They would require a constant stream of coherence messages (e.g., invalidations, requests for ownership) to be sent over the narrow and high-latency off-chip memory bus, completely overwhelming it and negating any performance benefit from PIM.88

Early workarounds, such as marking PIM-accessible memory regions as non-cacheable by the host CPU or using coarse-grained software locks, are too restrictive and severely degrade performance, especially for applications with fine-grained data sharing.88 A more promising direction is represented by hardware mechanisms like LazyPIM. This approach leverages speculation: the PIM core executes its task assuming it has the necessary coherence permissions. At the end of the task, it sends a single, compressed “coherence signature” to the host, summarizing all the memory locations it accessed. The host processor then checks this signature for conflicts with its own cache activity. If a conflict is found, the PIM task is rolled back and re-executed; otherwise, the results are committed.88 By batching coherence checks, LazyPIM dramatically reduces off-chip traffic. Until an efficient and standardized solution to the coherence problem is adopted, PIM will likely remain confined to the role of a specialized accelerator for workloads with minimal or easily managed data sharing, rather than acting as a seamless extension of the host’s coherent memory space.

 

Thermal Management and Power Delivery in 3D-Stacked PIM

 

The 3D-stacking technology that enables high-bandwidth memories like HBM-PIM also introduces severe thermal challenges.89 Vertically stacking multiple DRAM dies on top of a logic die increases power density and creates long thermal paths, making it difficult to extract heat from the layers furthest from the heat sink.89 Thermal failures are a leading cause of reliability issues in 3D-stacked chips, and the non-uniform power distribution of PIM workloads can create localized “hotspots” that limit overall system performance.89

Addressing these thermal constraints requires co-design across the hardware and system management layers. Proposed solutions include the coordinated use of Dynamic Voltage and Frequency Scaling (DVFS) for the logic cores and Low Power Modes (LPM) for the DRAM banks to dynamically manage the thermal budget.90 More advanced architectural concepts, like the Tasa architecture, propose using heterogeneous cores within the PIM logic layer—mixing high-performance cores for compute-intensive tasks with high-efficiency cores for memory-intensive tasks—coupled with thermal-aware task scheduling to balance the temperature distribution across the die and maximize performance under a fixed thermal envelope.89

 

Standardization, Scalability, and the PIM Ecosystem

 

The current PIM landscape is fragmented. Each major vendor, including Samsung, SK Hynix, and UPMEM, has developed its own proprietary architecture, instruction set, and software development kit.6 This lack of standardization is a major impediment to building a robust software ecosystem, as it prevents the development of portable applications and tools that can run across different PIM hardware.59 Industry-wide standardization efforts, such as those being pursued within JEDEC for future HBM generations, are critical for fostering broader adoption and enabling a competitive marketplace.15

 

Emerging Research Frontiers and Long-Term Outlook

 

The long-term vision for PIM is likely one that is both heterogeneous and composable. Future systems will not rely on a single, monolithic PIM architecture but will instead be assembled from a diverse set of computational resources. This points to a future of flexible, disaggregated systems composed of different types of PIM and traditional compute, orchestrated by intelligent software. Key research frontiers include:

  • Advanced System Software: Continued innovation in PIM-aware compilers, runtimes, and operating systems is paramount to abstracting hardware complexity and making PIM accessible to mainstream programmers.63
  • Heterogeneous PIM Architectures: Research is actively exploring hybrid designs that combine the strengths of DRAM-PIM (bandwidth) and SRAM-CIM (latency), and potentially integrate emerging non-volatile memories for new capabilities.13
  • Composable Systems via CXL: Fully exploiting the CXL interconnect will be key to building scalable, rack-level systems from pools of disaggregated PIM resources, moving beyond the single-node accelerator model.74
  • New Application Domains: While AI has been the primary driver, researchers are exploring PIM’s potential to accelerate other data-intensive fields, such as bioinformatics, scientific simulation, and cryptography.59

 

Conclusion and Strategic Recommendations

 

Synthesizing the State of PIM Technology

 

Processing-in-Memory has successfully transitioned from a long-standing academic concept into a commercial reality, fundamentally driven by the untenable energy and performance costs of data movement in modern AI-centric computing. The technological landscape has bifurcated into two primary streams: high-bandwidth, high-capacity DRAM-based PIM, exemplified by commercial HBM and GDDR6 products, which targets large-scale AI and HPC workloads in the data center; and low-latency, high-efficiency SRAM-based CIM, which is poised to dominate the market for specialized AI inference acceleration at the edge.

While the hardware is maturing at a rapid pace, with demonstrable, order-of-magnitude improvements in energy efficiency and significant performance gains for targeted memory-bound workloads, the PIM paradigm is still in its early stages. The most significant challenges are no longer at the circuit or device level but at the system level. The lack of mature and standardized software ecosystems, including PIM-aware compilers, operating systems, and programming models, remains the single greatest barrier to widespread adoption. Furthermore, complex hardware issues such as cache coherence and thermal management in 3D-stacked implementations must be solved to enable PIM’s use in more general-purpose computing contexts.

 

Recommendations for Architects, Developers, and Technology Strategists

 

Based on this analysis, the following strategic recommendations are proposed for key stakeholders in the computing ecosystem:

  • For Hardware and System Architects: The focus should shift from demonstrating point solutions to solving systemic integration challenges. This includes developing and standardizing efficient cache coherence mechanisms that do not rely on the off-chip bus, designing novel thermal management solutions for 3D-stacked PIM, and embracing heterogeneous designs that combine DRAM-PIM, SRAM-CIM, and conventional cores. Investing in architectures that leverage the CXL interconnect is critical for enabling the next generation of scalable, composable memory-centric systems. Algorithm-hardware co-design should be a guiding principle, as tailoring both the hardware and the software to a specific problem domain has consistently yielded the most significant performance breakthroughs.
  • For Software Developers and Compiler Researchers: The greatest opportunity for impact lies in bridging the software gap. This requires a concerted effort to build a robust PIM software ecosystem. Key priorities include developing PIM-aware compilers that can automatically partition applications and optimize data layouts, creating user-friendly APIs and programming model extensions (e.g., for OpenMP or C++) that abstract away hardware complexity, and designing intelligent runtime systems that can dynamically schedule tasks across heterogeneous PIM and CPU/GPU resources.
  • For Technology Strategists and Decision-Makers: It is crucial to recognize that PIM is not a single, monolithic technology but a paradigm shift toward a more diverse and specialized computing landscape. In the near term, organizations should evaluate current commercial PIM solutions for specific, well-defined, and acutely memory-bound workloads where they can provide an immediate and measurable return on investment—LLM inference and recommendation systems are prime candidates. Concurrently, they should closely monitor the development of industry standards (e.g., via JEDEC) and the maturation of the software ecosystem, as these will be the key indicators for when PIM is ready to move from a niche accelerator to a foundational component of mainstream computing infrastructure.