{"id":7948,"date":"2025-11-28T15:36:18","date_gmt":"2025-11-28T15:36:18","guid":{"rendered":"https:\/\/uplatz.com\/blog\/?p=7948"},"modified":"2025-11-28T16:25:09","modified_gmt":"2025-11-28T16:25:09","slug":"processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing","status":"publish","type":"post","link":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/","title":{"rendered":"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing"},"content":{"rendered":"<h2><b>The Imperative for In-Memory Computation<\/b><\/h2>\n<h3><b>Deconstructing the &#8220;Memory Wall&#8221;: Performance and Energy Bottlenecks in von Neumann Architectures<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">For decades, the advancement of computing has been governed by the processor-centric von Neumann architecture, which fundamentally separates processing units (CPUs, GPUs) from memory units. This separation necessitates the constant movement of data between where it is stored and where it is processed. The performance of processors has historically improved at a much faster rate than that of memory, creating an ever-widening disparity known as the &#8220;memory wall&#8221;.<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\"> In modern data-intensive workloads, such as high-performance computing (HPC) and artificial intelligence (AI), this gap has become a critical system bottleneck, forcing powerful processors to spend a significant portion of their execution cycles idle, waiting for data to arrive from memory.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, the memory wall is no longer just a performance or latency problem; it has evolved into a severe energy crisis. The energy consumed by moving data across the chip and between the processor and main memory can vastly exceed the energy required for the actual computation.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> In some modern systems, data movement has been reported to account for as much as 62% of the total system energy, creating what is often termed the &#8220;von Neumann bottleneck&#8221;.<\/span><span style=\"font-weight: 400;\">5<\/span><span style=\"font-weight: 400;\"> As AI models and datasets grow exponentially, this energy expenditure has become a primary limiting factor, driving an urgent need for a paradigm shift in computer architecture.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> The economic and environmental costs associated with powering large-scale data centers for AI have elevated energy efficiency from a secondary concern to a primary driver for architectural innovation.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-7960\" src=\"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing-1024x576.jpg\" alt=\"\" width=\"840\" height=\"473\" srcset=\"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing-1024x576.jpg 1024w, https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing-300x169.jpg 300w, https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing-768x432.jpg 768w, https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg 1280w\" sizes=\"auto, (max-width: 840px) 100vw, 840px\" \/><\/p>\n<h3><a href=\"https:\/\/uplatz.com\/course-details\/career-path-application-architect By Uplatz\">career-path-application-architect By Uplatz<\/a><\/h3>\n<h3><b>The PIM Paradigm: Shifting Computation Closer to Data<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Processing-in-Memory (PIM), also referred to as Compute-in-Memory (CIM), offers a fundamental solution to the data movement bottleneck by challenging the processor-centric paradigm.<\/span><span style=\"font-weight: 400;\">9<\/span><span style=\"font-weight: 400;\"> Instead of moving massive amounts of data to a distant processor, PIM integrates computational capabilities directly within or near the memory arrays.<\/span><span style=\"font-weight: 400;\">10<\/span><span style=\"font-weight: 400;\"> By making memory systems &#8220;compute-capable,&#8221; PIM drastically reduces or eliminates the long and energy-intensive journey data must take to be processed.<\/span><span style=\"font-weight: 400;\">10<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This concept is not new, with roots tracing back to the 1960s, but its practical realization has been catalyzed by two recent developments: the maturation of advanced semiconductor packaging technologies, particularly 3D stacking, and the insatiable demand for performance and energy efficiency from AI workloads.<\/span><span style=\"font-weight: 400;\">6<\/span><span style=\"font-weight: 400;\"> It is important to distinguish this hardware-level architectural concept from the software-level practice of &#8220;in-memory processing&#8221; used in applications like in-memory databases, where data is held in RAM to avoid slower disk access.<\/span><span style=\"font-weight: 400;\">9<\/span><span style=\"font-weight: 400;\"> Architectural PIM represents a more profound change, blurring the traditional lines between storage and computation.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>A Taxonomy of PIM: Processing-Near-Memory (PNM) vs. Processing-Using-Memory (PUM)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The PIM paradigm encompasses a spectrum of approaches that can be broadly classified into two main categories, distinguished by the proximity and nature of the integrated computation.<\/span><\/p>\n<p><b>Processing-Near-Memory (PNM)<\/b><span style=\"font-weight: 400;\"> involves placing discrete, conventional logic units <\/span><i><span style=\"font-weight: 400;\">near<\/span><\/i><span style=\"font-weight: 400;\"> the memory arrays. In modern implementations, this often means integrating processing elements (PEs) onto the logic layer of a 3D-stacked memory device or at the periphery of memory banks on a 2D chip.<\/span><span style=\"font-weight: 400;\">11<\/span><span style=\"font-weight: 400;\"> This approach is pragmatic, as it minimizes modifications to the highly optimized and dense core memory arrays, thereby reducing design risk and manufacturing complexity. Most commercial PIM products, such as Samsung&#8217;s HBM-PIM, follow the PNM model by placing compute units at the bank boundary.<\/span><span style=\"font-weight: 400;\">15<\/span><\/p>\n<p><b>Processing-Using-Memory (PUM)<\/b><span style=\"font-weight: 400;\"> represents a more radical and deeply integrated approach. PUM leverages the intrinsic <\/span><i><span style=\"font-weight: 400;\">analog operational properties<\/span><\/i><span style=\"font-weight: 400;\"> of the memory cells themselves to perform computation.<\/span><span style=\"font-weight: 400;\">6<\/span><span style=\"font-weight: 400;\"> For instance, by simultaneously activating multiple rows in a DRAM array, the resulting charge sharing on the bitlines can be used to perform massively parallel bitwise logic operations.<\/span><span style=\"font-weight: 400;\">10<\/span><span style=\"font-weight: 400;\"> While PUM offers the highest potential for parallelism and efficiency by turning every memory column into a parallel ALU, it typically requires more significant modifications to the core memory cell and peripheral circuitry. Due to the higher design and manufacturing risks, most current commercial efforts are focused on the PNM strategy.<\/span><span style=\"font-weight: 400;\">20<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The emergence of these two distinct technological paths signals a strategic divergence in the evolution of memory. The industry is moving away from a one-size-fits-all memory hierarchy toward a future of domain-specific memory, where different memory components are optimized for distinct roles\u2014some for pure storage, others for specific computational tasks. This specialization is evident in the development of DRAM-based PIM for high-bandwidth applications and SRAM-based CIM for low-latency, high-efficiency tasks, as summarized in Table 1. This trend will fundamentally reshape system design, requiring architects to build systems from a heterogeneous mix of memory and compute components.<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Characteristic<\/b><\/td>\n<td><b>DRAM-Based PIM<\/b><\/td>\n<td><b>SRAM-Based CIM<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Primary Goal<\/b><\/td>\n<td><span style=\"font-weight: 400;\">High Bandwidth \/ Capacity<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Low Latency \/ High Energy Efficiency<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Storage Density<\/b><\/td>\n<td><span style=\"font-weight: 400;\">High<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Low<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Latency<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Higher (tens of ns)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Lower (single-digit ns)<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Compute Granularity<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Coarse-grained (e.g., vector operations)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Fine-grained (e.g., bitwise, MAC)<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Technology Maturity<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Mature (highly optimized process)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Mature (CMOS-compatible)<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Primary Compute Function<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Vector ALU, Floating Point<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Multiply-Accumulate (MAC)<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Target Applications<\/b><\/td>\n<td><span style=\"font-weight: 400;\">HPC, Data Center AI (LLMs), Databases<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Edge AI, Accelerators, On-device ML<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Table 1: Comparative Overview of DRAM-PIM vs. SRAM-CIM. This table provides a high-level comparison of the two main technological branches of PIM, establishing the fundamental trade-offs that define their respective roles in modern computing systems.[5, 13, 20]<\/span><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h2><b>DRAM-Based Processing-in-Memory Architectures<\/b><\/h2>\n<p>&nbsp;<\/p>\n<h3><b>Architectural Principles: Leveraging DRAM&#8217;s Internal Bandwidth and Parallelism<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Dynamic Random Access Memory (DRAM) has long been the cornerstone of main memory in computing systems. The primary motivation for developing DRAM-based PIM is to harness the massive internal bandwidth available within a DRAM chip. This internal bandwidth, accessible between the memory arrays and the chip&#8217;s periphery, can be an order of magnitude or more greater than the bandwidth of the external memory channel that connects the DRAM to the processor.<\/span><span style=\"font-weight: 400;\">6<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To understand how PIM exploits this, it is essential to consider the hierarchical structure of a modern DRAM device, which consists of channels, ranks, banks, and subarrays.<\/span><span style=\"font-weight: 400;\">14<\/span><span style=\"font-weight: 400;\"> By integrating small processing units at or near each memory bank, a PIM architecture can activate and process data from all banks in parallel. This bank-level parallelism allows the system to perform computations at a throughput that is dictated by the vast internal bus width, effectively bypassing the narrow off-chip interface for PIM-accelerated operations.<\/span><span style=\"font-weight: 400;\">15<\/span><span style=\"font-weight: 400;\"> PIM units can be integrated at various levels of this hierarchy, but near-bank computing has emerged as the most commercially viable approach, offering a balance between performance gains and design feasibility.<\/span><span style=\"font-weight: 400;\">14<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Commercial Implementations and Case Studies<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Several major memory vendors have introduced commercial or near-commercial DRAM-PIM products, each tailored to a specific market segment and leveraging a different type of DRAM technology.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h4><b>Samsung HBM-PIM: Architecture of the Programmable Computing Unit (PCU)<\/b><\/h4>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Samsung&#8217;s &#8220;Aquabolt-XL&#8221; was the industry&#8217;s first commercially fabricated High Bandwidth Memory (HBM) device with integrated PIM capabilities.<\/span><span style=\"font-weight: 400;\">15<\/span><span style=\"font-weight: 400;\"> This HBM2-PIM architecture embodies the PNM approach by integrating a <\/span><b>Programmable Computing Unit (PCU)<\/b><span style=\"font-weight: 400;\"> within each memory bank.<\/span><span style=\"font-weight: 400;\">12<\/span><span style=\"font-weight: 400;\"> The PCU is an AI-focused engine, architecturally a 16-lane Single Instruction, Multiple Data (SIMD) array capable of performing 16-bit floating-point (FP16) operations, complete with its own lightweight control logic and register files.<\/span><span style=\"font-weight: 400;\">15<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A key aspect of Samsung&#8217;s strategy was to design the HBM-PIM as a &#8220;drop-in replacement&#8221; for conventional HBM2 modules. This was achieved by placing the PCUs at the bank boundary and preserving the standard JEDEC HBM2 interface and timing protocols.<\/span><span style=\"font-weight: 400;\">12<\/span><span style=\"font-weight: 400;\"> This design choice significantly lowers the barrier to adoption for system integrators. In terms of performance, Samsung has reported a 2x performance improvement in applications like speech recognition and an energy reduction of over 70% compared to standard HBM.<\/span><span style=\"font-weight: 400;\">12<\/span><span style=\"font-weight: 400;\"> When integrated into accelerator systems, such as the AMD MI-100 GPU or the Xilinx Alveo FPGA, HBM-PIM has demonstrated system-level performance gains of up to 2.5x and energy savings exceeding 60% for workloads dominated by General Matrix-Vector multiplication (GEMV) and Long Short-Term Memory (LSTM) operations.<\/span><span style=\"font-weight: 400;\">4<\/span><\/p>\n<p>&nbsp;<\/p>\n<h4><b>SK Hynix Accelerator-in-Memory (AiM): GDDR6 for High-Throughput Compute<\/b><\/h4>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">SK Hynix has pursued a different path with its Accelerator-in-Memory (AiM) technology, which is based on high-speed GDDR6 memory.<\/span><span style=\"font-weight: 400;\">24<\/span><span style=\"font-weight: 400;\"> AiM is explicitly designed as a &#8220;domain-specific memory&#8221; to accelerate memory-intensive machine learning workloads, particularly the GEMV operations that are fundamental to modern transformer models and Large Language Models (LLMs).<\/span><span style=\"font-weight: 400;\">24<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Each GDDR6-AiM chip integrates 32 Processing Units (PUs) and is capable of delivering 1 TFLOPS of compute throughput using Brain Floating Point 16 (BF16) precision.<\/span><span style=\"font-weight: 400;\">25<\/span><span style=\"font-weight: 400;\"> The core architectural innovation is the concept of &#8220;all-bank parallelism,&#8221; which is enabled through an extended set of DRAM commands. These new commands, such as MACAB (MAC Across all Banks), allow a host controller to orchestrate simultaneous computation across all PUs in the chip.<\/span><span style=\"font-weight: 400;\">24<\/span><span style=\"font-weight: 400;\"> This allows the architecture to fully leverage the massive internal memory bandwidth (rated at 0.5 TB\/s) for computation, which is approximately 8x greater than the chip&#8217;s external I\/O bandwidth (64 GB\/s).<\/span><span style=\"font-weight: 400;\">27<\/span><span style=\"font-weight: 400;\"> SK Hynix&#8217;s AiMX accelerator card, which populates a board with multiple AiM chips, is designed to function as a co-processor to a GPU, offloading the memory-bound stages of LLM inference to improve overall system efficiency and throughput.<\/span><span style=\"font-weight: 400;\">26<\/span><\/p>\n<p>&nbsp;<\/p>\n<h4><b>DIMM-Based PIM: AXDIMM and the Path to Mainstream Adoption<\/b><\/h4>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">While HBM and GDDR6 represent high-performance niches, PIM technology is also being integrated into the more ubiquitous Dual In-line Memory Module (DIMM) form factor for mainstream servers. Samsung&#8217;s Acceleration DIMM (AXDIMM) is a prime example, placing an AI engine on the buffer chip of a standard DIMM.<\/span><span style=\"font-weight: 400;\">15<\/span><span style=\"font-weight: 400;\"> This allows the AXDIMM to perform parallel processing across the multiple memory ranks (sets of DRAM chips) on the module, a task not possible with conventional DIMMs.<\/span><span style=\"font-weight: 400;\">23<\/span><span style=\"font-weight: 400;\"> While less tightly integrated than HBM-PIM, this approach provides a more straightforward upgrade path for existing server infrastructure. For AI-based recommendation applications, AXDIMM has demonstrated an approximate 2x performance gain with a 40% reduction in system-wide energy consumption.<\/span><span style=\"font-weight: 400;\">23<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Similarly, the French company UPMEM has commercialized a DIMM-based PIM solution that integrates multiple general-purpose 64-bit in-order cores, which they call DRAM Processing Units (DPUs), onto their DRAM chips.<\/span><span style=\"font-weight: 400;\">6<\/span><span style=\"font-weight: 400;\"> These DIMMs are compatible with standard commodity servers and can provide hundreds of gigabytes of compute-capable memory, targeting data-intensive applications like genomics, analytics, and search.<\/span><span style=\"font-weight: 400;\">29<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The convergence of major vendors and academic research on the near-bank PNM architecture suggests that the industry has, for the present, settled on a pragmatic architectural template. This approach involves placing small, specialized compute units like SIMD or MAC arrays at the periphery of each memory bank. This strategy allows memory manufacturers to exploit the vast internal parallelism of DRAM to add computational value, all while preserving their core intellectual property and leveraging their highly optimized manufacturing processes without disruptive changes to the memory array itself.<\/span><span style=\"font-weight: 400;\">17<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Specification<\/b><\/td>\n<td><b>Samsung HBM-PIM (Aquabolt-XL)<\/b><\/td>\n<td><b>SK Hynix GDDR6-AiM<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Base Memory Type<\/b><\/td>\n<td><span style=\"font-weight: 400;\">HBM2<\/span><\/td>\n<td><span style=\"font-weight: 400;\">GDDR6<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Compute Throughput<\/b><\/td>\n<td><span style=\"font-weight: 400;\">4.9 TFLOPS (per GPU with 4 cubes)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">1 TFLOPS\/Chip<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Numeric Precision<\/b><\/td>\n<td><span style=\"font-weight: 400;\">FP16<\/span><\/td>\n<td><span style=\"font-weight: 400;\">BF16<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>PIM Unit Architecture<\/b><\/td>\n<td><span style=\"font-weight: 400;\">16-lane SIMD Array (PCU) per 2 banks<\/span><\/td>\n<td><span style=\"font-weight: 400;\">MAC-based Processing Unit (PU) per bank<\/span><\/td>\n<\/tr>\n<tr>\n<td><b># of PIM Units<\/b><\/td>\n<td><span style=\"font-weight: 400;\">16 per stack<\/span><\/td>\n<td><span style=\"font-weight: 400;\">32 per chip<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Key Architectural Feature<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Drop-in replacement, JEDEC compliance<\/span><\/td>\n<td><span style=\"font-weight: 400;\">All-bank parallelism via extended commands<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Target Workloads<\/b><\/td>\n<td><span style=\"font-weight: 400;\">GEMV, Speech Recognition, LLMs<\/span><\/td>\n<td><span style=\"font-weight: 400;\">GEMV, Transformers, LLMs<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Table 2: Technical Specifications of Commercial PIM Solutions. This table provides a direct comparison of the leading commercial DRAM-PIM products, highlighting their capabilities and target use cases based on published specifications.[4, 15, 24, 25]<\/span><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3><b>Circuit-Level Modifications and Design Constraints<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The integration of logic into DRAM is a non-trivial engineering challenge, governed by the unique and highly optimized DRAM manufacturing process. The chosen integration strategy has profound implications for design complexity, cost, and performance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Most commercial PIM architectures have adopted the near-bank PNM approach precisely because it avoids modifications to the core DRAM subarray.<\/span><span style=\"font-weight: 400;\">17<\/span><span style=\"font-weight: 400;\"> The subarray, which contains the 1-transistor-1-capacitor (1T1C) memory cells, is the densest part of the chip and is manufactured using a specialized process that is not optimized for high-performance logic. By placing compute units in the peripheral region alongside sense amplifiers and I\/O circuitry, manufacturers can implement a &#8220;dual-mode&#8221; functionality, where the chip can operate as either a standard memory or a PIM accelerator.<\/span><span style=\"font-weight: 400;\">20<\/span><span style=\"font-weight: 400;\"> However, this peripheral area is extremely constrained in terms of physical space and power budget, which limits the complexity and performance of the integrated PIM logic.<\/span><span style=\"font-weight: 400;\">17<\/span><\/p>\n<p><span style=\"font-weight: 400;\">More experimental PUM approaches venture into the subarray itself, proposing circuit-level modifications to enable computation. One prominent technique involves leveraging the analog charge-sharing properties of DRAM. By activating multiple wordlines simultaneously (e.g., Triple-Row Activation or TRA), the collective charge from multiple cells is shared on the bitline, effectively performing a bitwise majority function, which can be used as a basis for other logic operations.<\/span><span style=\"font-weight: 400;\">18<\/span><span style=\"font-weight: 400;\"> Other proposals, such as ReDRAM, suggest a &#8220;Dual-Row Activation&#8221; mechanism coupled with modest modifications to the sense amplifier circuitry to implement a complete set of bulk bitwise operations (AND, OR, XOR, NOT).<\/span><span style=\"font-weight: 400;\">19<\/span><span style=\"font-weight: 400;\"> While these PUM techniques offer the highest degree of parallelism, they carry the significant risk of compromising the density, yield, and reliability of the commodity DRAM process, which has thus far limited their commercial adoption.<\/span><span style=\"font-weight: 400;\">20<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Performance Analysis: Throughput, Latency, and Energy Efficiency Benchmarks<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The performance of DRAM-PIM systems is highly dependent on the nature of the workload. These architectures excel in applications that are fundamentally memory-bound, where the ratio of data access to computation is high. For such workloads, PIM can provide substantial performance gains. For instance, an analysis of the UPMEM PIM system showed a 23x performance improvement over a high-end GPU for a GEMV kernel, but only when the GPU&#8217;s memory was oversubscribed, forcing it to access system memory and highlighting PIM&#8217;s strength in large-dataset scenarios.<\/span><span style=\"font-weight: 400;\">31<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, PIM is not a panacea. For applications with significant computational requirements, the relatively simple processing units within PIM can become the new bottleneck, shifting the workload from memory-bound to compute-bound.<\/span><span style=\"font-weight: 400;\">29<\/span><span style=\"font-weight: 400;\"> The effectiveness of PIM is also deeply tied to the software implementation, including how data is laid out across the memory banks, how parallelism is managed across the many PIM cores, and how synchronization is handled.<\/span><span style=\"font-weight: 400;\">10<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Roofline models, which plot computational performance against arithmetic intensity, clearly illustrate PIM&#8217;s value proposition: it dramatically raises the &#8220;memory roofline,&#8221; increasing the peak performance achievable for memory-bound applications.<\/span><span style=\"font-weight: 400;\">14<\/span><span style=\"font-weight: 400;\"> For data analytics workloads, multi-level PIM architectures that exploit parallelism at the bank, chip, and rank levels have demonstrated throughput gains of up to 528x over baseline CPU systems.<\/span><span style=\"font-weight: 400;\">8<\/span><span style=\"font-weight: 400;\"> Despite these impressive figures, a recurring challenge is load imbalance, where the work is not evenly distributed across all PIM cores. This is particularly problematic in applications with irregular data access patterns, such as graph mining, and can significantly degrade overall performance if not managed by sophisticated scheduling algorithms.<\/span><span style=\"font-weight: 400;\">33<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>SRAM-Based Compute-in-Memory for AI Acceleration<\/b><\/h2>\n<p>&nbsp;<\/p>\n<h3><b>Architectural Principles: Optimizing for Low Latency and High Efficiency<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">While DRAM-PIM targets high-bandwidth computing, Static Random Access Memory (SRAM)-based CIM is engineered for a different set of goals: ultra-low latency and extreme energy efficiency. SRAM&#8217;s inherent advantages\u2014high speed, robustness, and compatibility with standard CMOS logic fabrication processes\u2014make it an ideal substrate for building specialized AI accelerators, particularly for inference tasks at the network edge.<\/span><span style=\"font-weight: 400;\">5<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The architectural principle of SRAM-CIM is to perform the most fundamental and frequent operation in neural networks\u2014the Multiply-Accumulate (MAC) operation\u2014directly within the memory array.<\/span><span style=\"font-weight: 400;\">5<\/span><span style=\"font-weight: 400;\"> By storing the neural network weights within the SRAM cells and streaming the input activations through the array, SRAM-CIM can perform thousands of parallel MAC operations in place, virtually eliminating the energy and latency costs associated with fetching weights and intermediate feature maps from off-chip memory.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> This results in latency measured in single-digit nanoseconds and energy efficiency, often expressed in Tera-Operations per Second per Watt (TOPS\/W), that can be orders of magnitude better than conventional processors.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Memory Cell and Peripheral Circuitry Modifications for Computation<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Standard 6-transistor (6T) SRAM cells, which are optimized for density and stability in caches, are not well-suited for CIM. To enable in-memory computation, the bitcell and its surrounding peripheral circuits must be significantly modified.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A primary modification is the move from 6T to 8-transistor (8T), 9T, or even 10T bitcell designs.<\/span><span style=\"font-weight: 400;\">5<\/span><span style=\"font-weight: 400;\"> The key innovation in these larger cells is the addition of a dedicated read port, which decouples the read and write data paths.<\/span><span style=\"font-weight: 400;\">39<\/span><span style=\"font-weight: 400;\"> This separation is critical for CIM because it allows multiple wordlines to be activated simultaneously to perform a computation without causing the &#8220;read disturb&#8221; instability that would corrupt the stored data in a standard 6T cell.<\/span><span style=\"font-weight: 400;\">39<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The peripheral circuitry surrounding the SRAM array is equally, if not more, critical. Since many SRAM-CIM designs operate in the analog domain to maximize efficiency, they require Digital-to-Analog Converters (DACs) to encode the digital input activations into analog signals (e.g., varying voltage levels or pulse widths on the wordlines).<\/span><span style=\"font-weight: 400;\">41<\/span><span style=\"font-weight: 400;\"> After the computation is performed in the array\u2014typically as a summation of currents on the bitlines\u2014the resulting analog value must be converted back to the digital domain. This requires Analog-to-Digital Converters (ADCs) at the end of each column.<\/span><span style=\"font-weight: 400;\">41<\/span><span style=\"font-weight: 400;\"> The design of these ADCs is a major challenge, as their power consumption and area can grow exponentially with the required precision, often becoming the dominant cost factor in the entire CIM macro.<\/span><span style=\"font-weight: 400;\">41<\/span><span style=\"font-weight: 400;\"> Additional circuits, such as current mirrors for precise current control and sophisticated timing logic, are also necessary to ensure accurate MAC operations.<\/span><span style=\"font-weight: 400;\">5<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>SRAM-CIM in Practice: Macro Designs for CNN and Transformer Acceleration<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">These modified bitcells and complex peripherals are assembled into functional blocks known as CIM macros. A typical CIM macro consists of an array of thousands of modified SRAM cells that store a portion of a neural network&#8217;s weight matrix. When input activations are applied to the wordlines, the entire array performs a parallel matrix-vector multiplication in a single cycle.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A vast body of academic and industrial research is focused on designing and optimizing these macros for specific AI workloads. Numerous designs have been proposed to accelerate Convolutional Neural Networks (CNNs), Transformers, and Large Language Models (LLMs), with a strong emphasis on power-constrained edge devices.<\/span><span style=\"font-weight: 400;\">46<\/span><span style=\"font-weight: 400;\"> Many advanced architectures are heterogeneous, combining SRAM-CIM with other memory technologies like Magnetoresistive RAM (MRAM) for non-volatility or Read-Only Memory (ROM) for fixed-function acceleration.<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\"> Others employ hybrid analog-digital techniques within the macro itself to strike a better balance between the raw efficiency of analog compute and the precision of digital logic.<\/span><span style=\"font-weight: 400;\">46<\/span><span style=\"font-weight: 400;\"> The design of SRAM-CIM is not just about the memory itself; it is about creating a hyper-specialized accelerator where the memory cells act as programmable arithmetic elements, prioritizing computational function over pure storage density.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Performance Analysis: Measuring Computational Density and TOPS\/W<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The primary metric for evaluating SRAM-CIM performance is energy efficiency, measured in TOPS\/W. By performing computation in place, SRAM-CIM architectures dramatically reduce the data movement that dominates energy consumption in traditional systems. Published results for digital CIM architectures report efficiencies in the range of 1\u2013100 TOPS\/W, which represents a 100x to 1000x improvement over conventional CPUs.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Specific research prototypes have demonstrated even more impressive figures, with some reporting efficiencies as high as 249.1 TOPS\/W or, in highly optimized cases, even 3707.84 TOPS\/W.<\/span><span style=\"font-weight: 400;\">46<\/span><span style=\"font-weight: 400;\"> It is crucial to note that these figures are highly dependent on the fabrication technology node, the numerical precision of the data (e.g., 8-bit integer vs. 4-bit integer), and the specific operation being benchmarked.<\/span><span style=\"font-weight: 400;\">5<\/span><span style=\"font-weight: 400;\"> While the energy efficiency is a major strength, the primary weakness of SRAM-CIM is its low storage density compared to DRAM.<\/span><span style=\"font-weight: 400;\">20<\/span><span style=\"font-weight: 400;\"> This low density means that only small models or portions of larger models can fit within the CIM macro at one time. For large models, this can create a new bottleneck: the latency and energy required to frequently reload weights from off-chip DRAM into the SRAM array, potentially offsetting some of the gains from the in-memory computation itself.<\/span><span style=\"font-weight: 400;\">48<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>Comparative Analysis: Analog vs. Digital In-Memory Computing<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Within the domain of SRAM-based CIM, the choice of computational paradigm\u2014analog or digital\u2014represents a fundamental design trade-off with profound implications for performance, efficiency, and accuracy. This choice dictates the architecture from the circuit level to the system level and is one of the most critical decisions for PIM designers.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Principles of Analog CIM (AIMC): Leveraging Physics for Computation<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Analog Compute-in-Memory (AIMC) performs calculations by directly harnessing the physical laws governing electrical circuits, most notably Kirchhoff&#8217;s current law and Ohm&#8217;s law.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> In a typical AIMC macro, the digital values of an input vector are first converted into analog quantities, such as varying voltage levels or pulses of varying duration, by DACs. These analog signals are then applied to the wordlines of the memory array. Each memory cell, which stores a weight value as a variable conductance, modulates the current flowing from the wordline to the bitline. According to Kirchhoff&#8217;s law, all the currents from the cells in a single column are naturally summed together on the shared bitline.<\/span><span style=\"font-weight: 400;\">51<\/span><span style=\"font-weight: 400;\"> This summed current, which is an analog representation of a dot-product result, is then converted back to a digital number by an ADC.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This approach offers the potential for extreme energy efficiency and massive parallelism, as an entire matrix-vector multiplication across thousands of memory cells can be completed in what is effectively a single analog operation.<\/span><span style=\"font-weight: 400;\">51<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Principles of Digital CIM (DCIM): Integrating Logic within the Memory Array<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">In contrast, Digital Compute-in-Memory (DCIM) performs all computations using standard digital logic gates that are physically embedded within the memory array&#8217;s structure.<\/span><span style=\"font-weight: 400;\">54<\/span><span style=\"font-weight: 400;\"> Instead of relying on analog current summation, a DCIM architecture might, for example, use XNOR gates integrated with each bitcell to perform bitwise multiplication. The partial products generated along a column are then accumulated using a tree of digital adders that runs parallel to the bitlines.<\/span><span style=\"font-weight: 400;\">54<\/span><span style=\"font-weight: 400;\"> This approach preserves the key advantages of digital computation: high precision, deterministic behavior, and immunity to the noise and process variations that plague analog circuits.<\/span><span style=\"font-weight: 400;\">51<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>A Multi-faceted Trade-off Analysis: Precision, Noise, Power, Area, and Scalability<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The decision between AIMC and DCIM involves a complex set of engineering trade-offs, summarized in Table 3.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Precision and Noise:<\/b><span style=\"font-weight: 400;\"> AIMC&#8217;s greatest weakness is its susceptibility to analog non-idealities. Thermal noise, shot noise, and minute variations in transistor manufacturing can introduce errors into the computation, limiting precision and potentially degrading the accuracy of AI models.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> The ADC is a particularly critical component, as its limited resolution quantizes the final result and can be a major source of error.<\/span><span style=\"font-weight: 400;\">43<\/span><span style=\"font-weight: 400;\"> DCIM, being entirely digital, is immune to these issues and offers high, predictable precision, though the hardware cost (area and power) increases with the number of bits required.<\/span><span style=\"font-weight: 400;\">51<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Energy Efficiency and Area:<\/b><span style=\"font-weight: 400;\"> AIMC generally boasts higher peak energy efficiency and computational density, particularly for medium-precision operations (e.g., 3 to 8 bits).<\/span><span style=\"font-weight: 400;\">51<\/span><span style=\"font-weight: 400;\"> This is because it avoids the significant area and power overhead of the digital adder trees required by DCIM.<\/span><span style=\"font-weight: 400;\">54<\/span><span style=\"font-weight: 400;\"> However, this advantage is often diminished by the substantial cost of the peripheral ADCs and DACs, which can dominate the macro&#8217;s total power and area budget, especially as precision requirements increase.<\/span><span style=\"font-weight: 400;\">41<\/span><span style=\"font-weight: 400;\"> The ADC thus represents the Achilles&#8217; heel of AIMC, where the theoretical efficiency of the core computation is held hostage by the practical cost of I\/O conversion.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Scalability and Flexibility:<\/b><span style=\"font-weight: 400;\"> DCIM benefits more directly from Moore&#8217;s Law, as its digital logic components scale well with advancements in semiconductor manufacturing processes.<\/span><span style=\"font-weight: 400;\">55<\/span><span style=\"font-weight: 400;\"> In contrast, designing high-performance analog circuits becomes progressively more difficult in advanced FinFET nodes, making AIMC scalability more challenging.<\/span><span style=\"font-weight: 400;\">57<\/span><span style=\"font-weight: 400;\"> Furthermore, the digital nature of DCIM provides greater flexibility for reconfiguring dataflows and mapping different types of computations onto the hardware.<\/span><span style=\"font-weight: 400;\">51<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">The optimal choice between these paradigms is not absolute but is highly dependent on the specific workload. Research indicates that AIMC can be more energy-efficient for neural network layers, like standard convolutions, that can exploit high spatial parallelism across a large memory array.<\/span><span style=\"font-weight: 400;\">42<\/span><span style=\"font-weight: 400;\"> Conversely, DCIM may outperform AIMC for layers with limited parallelism, such as depthwise convolutions, where smaller, more flexible compute units are advantageous.<\/span><span style=\"font-weight: 400;\">51<\/span><span style=\"font-weight: 400;\"> This suggests that the most effective future accelerators may be heterogeneous systems that incorporate both AIMC and DCIM macros, using a sophisticated compiler to map different parts of a neural network to the most suitable hardware.<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Metric<\/b><\/td>\n<td><b>Analog CIM (AIMC)<\/b><\/td>\n<td><b>Digital CIM (DCIM)<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Computational Principle<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Kirchhoff&#8217;s Law (Current\/Charge Summation)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Integrated Digital Logic (Adders\/Multipliers)<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Precision<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Limited by noise, device variation, ADC resolution<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High, deterministic, limited by bit-width<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Noise Immunity<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Low; susceptible to analog non-idealities<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High; robust digital operation<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Peak Energy Efficiency (TOPS\/W)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Potentially higher (especially at medium precision)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Lower due to digital logic overhead<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Area Efficiency (Compute Density)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Potentially higher (fewer transistors per MAC)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Lower due to area of adder trees<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Key Bottleneck<\/b><\/td>\n<td><span style=\"font-weight: 400;\">ADC\/DAC power, area, and speed<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Adder tree power and area<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Technology Scaling<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Less direct benefit; analog design challenges increase<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Benefits directly from transistor scaling<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Flexibility<\/b><\/td>\n<td><span style=\"font-weight: 400;\">More rigid dataflow<\/span><\/td>\n<td><span style=\"font-weight: 400;\">More flexible spatial mapping and reconfiguration<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Table 3: Trade-off Analysis of Analog vs. Digital CIM. This table summarizes the key engineering trade-offs between the two primary paradigms for compute-in-memory, based on a synthesis of multiple sources.[41, 51, 53, 55, 58]<\/span><\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3><b>Hybrid Approaches: Seeking the Best of Both Paradigms<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Recognizing the complementary strengths of AIMC and DCIM, researchers are actively exploring hybrid architectures that aim to achieve a superior balance of efficiency and accuracy. A common strategy is to partition the computation by bit significance: the Most Significant Bits (MSBs) of a multiplication, which have the largest impact on the final result, are processed in the digital domain for high accuracy, while the Less Significant Bits (LSBs) are processed in the analog domain to save power.<\/span><span style=\"font-weight: 400;\">55<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A more dynamic approach involves using &#8220;saliency&#8221; to guide the computation. In this model, the system can perform a quick, low-precision calculation to estimate the importance of different parts of the data (e.g., identifying the main subject in an image versus the background). It then dynamically allocates more precise digital compute resources to the salient regions and uses highly efficient analog compute for the non-salient regions.<\/span><span style=\"font-weight: 400;\">58<\/span><span style=\"font-weight: 400;\"> These hybrid systems represent a promising frontier in CIM design, offering a path to build accelerators that are both highly efficient and robust enough for a wide range of applications.<\/span><span style=\"font-weight: 400;\">46<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>System-Level Integration and Software Enablement<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The availability of PIM hardware is only the first step; making this hardware usable and integrating it effectively into existing computing systems presents a host of complex challenges that span the entire software and hardware stack. The &#8220;drop-in replacement&#8221; strategy pursued by some vendors, while intended to ease adoption, creates significant downstream complexity. By adhering to existing memory interfaces, these PIM devices cannot fundamentally alter the host-memory interaction model. This necessitates a sophisticated and often proprietary software layer to bridge the gap between a processor-centric system and a memory-centric device, creating a fragmented ecosystem.<\/span><span style=\"font-weight: 400;\">15<\/span><span style=\"font-weight: 400;\"> The software gap\u2014the chasm between hardware capability and software usability\u2014remains the single greatest barrier to the widespread adoption of PIM technology.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>The PIM-Aware Memory Controller: New Commands, Scheduling, and Address Mapping<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The memory controller (MC) is the crucial interface between the host processor and the PIM device, and it must be substantially enhanced to support PIM operations. First, the MC must be capable of issuing new, non-standard commands to the memory to initiate computations. Examples include the extended command set from SK Hynix, which features instructions like MACAB (MAC Across all Banks), or Samsung&#8217;s proposed PIM-ACT command, which simultaneously activates multiple banks and specifies a PIM operation type.<\/span><span style=\"font-weight: 400;\">22<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Second, the MC requires new scheduling algorithms. In a PIM-enabled system, the MC must arbitrate between conventional memory read\/write requests from the host CPU and PIM computation requests. Naive scheduling can lead to resource conflicts and degrade overall memory bandwidth. Advanced schedulers are needed to balance fairness and throughput, potentially by limiting the number of consecutive requests of the same type or by using more sophisticated policies.<\/span><span style=\"font-weight: 400;\">22<\/span><span style=\"font-weight: 400;\"> Proposals like DEAR-PIM introduce a disaggregated command queue to better manage the issuance of all-bank PIM commands and mitigate peak power consumption issues.<\/span><span style=\"font-weight: 400;\">60<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Finally, address mapping becomes more complex. PIM systems may employ separate physical address spaces for standard DRAM and PIM-capable memory regions. This requires a PIM-aware Memory Management Unit (MMU) or a specialized hardware block within the controller to manage these distinct spaces and translate addresses accordingly.<\/span><span style=\"font-weight: 400;\">61<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Programming Models, Compilers, and APIs for PIM<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">For PIM to be adopted by the broader software development community, its complexity must be abstracted away by high-level programming models, compilers, and APIs.<\/span><span style=\"font-weight: 400;\">63<\/span><span style=\"font-weight: 400;\"> Currently, programming PIM hardware often requires expert knowledge and manual management of data placement, kernel execution, and synchronization.<\/span><span style=\"font-weight: 400;\">29<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Several approaches are being explored to solve this programmability challenge:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Instruction Set Architecture (ISA) Extensions:<\/b><span style=\"font-weight: 400;\"> At the lowest level, the host processor&#8217;s ISA can be extended with custom instructions that directly target PIM functional units. This allows for tight integration of PIM into the processor pipeline.<\/span><span style=\"font-weight: 400;\">13<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>APIs and Directives:<\/b><span style=\"font-weight: 400;\"> A more portable approach involves providing libraries and compiler directives. For example, the widely used OpenMP standard for parallel programming, which uses #pragma directives to mark parallel regions, could be extended to support offloading specific code blocks to PIM devices.<\/span><span style=\"font-weight: 400;\">65<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>PIM-Aware Compilers:<\/b><span style=\"font-weight: 400;\"> The ultimate goal is a compiler that can automatically analyze standard source code (e.g., in C++ or Python), identify data-intensive sections suitable for PIM execution, partition the application between the host and PIM, optimize data layouts for PIM&#8217;s parallel architecture, and generate the necessary PIM instructions.<\/span><span style=\"font-weight: 400;\">68<\/span><span style=\"font-weight: 400;\"> Research projects like PRIMO and PIMCOMP are developing end-to-end compiler toolchains specifically for this purpose.<\/span><span style=\"font-weight: 400;\">68<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3><b>Runtime Systems for Heterogeneous PIM Architectures<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">In a modern heterogeneous system comprising CPUs, GPUs, and various PIM accelerators, a sophisticated runtime system is essential for orchestrating execution. This runtime is responsible for dynamically scheduling tasks across the different processing elements, managing resource allocation, and handling data movement and synchronization.<\/span><span style=\"font-weight: 400;\">69<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For PIM, the runtime must manage PIM-aware memory mapping, keeping track of which data resides in which type of memory and handling the translation between virtual and physical addresses.<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\"> It also plays a critical role in maintaining data consistency between the host&#8217;s caches and the PIM&#8217;s local memory. Research systems like HEMPS demonstrate how a runtime can profile an application and adaptively partition its execution across available heterogeneous resources to optimize performance.<\/span><span style=\"font-weight: 400;\">73<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>The Role of Interconnects: CXL as an Enabler for Scalable PIM Systems<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The emergence of the Compute Express Link (CXL) interconnect standard is set to be a transformative enabler for PIM. CXL provides a high-bandwidth, low-latency, cache-coherent protocol for connecting processors, memory, and accelerators.<\/span><span style=\"font-weight: 400;\">74<\/span><span style=\"font-weight: 400;\"> Unlike the rigid master-slave relationship of a traditional DDR memory channel, CXL allows memory devices to be treated as peer endpoints in the system.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This has profound implications for PIM. Instead of being confined to the processor&#8217;s local memory bus, PIM devices can be attached via CXL, enabling more flexible and scalable system designs. Architectures like CENT showcase this potential by proposing a GPU-free system for LLM inference built from a host CPU connected to a network of CXL-based PIM devices via a CXL switch.<\/span><span style=\"font-weight: 400;\">74<\/span><span style=\"font-weight: 400;\"> This allows for peer-to-peer communication directly between PIM devices and enables the construction of large, disaggregated systems where pools of PIM resources can be dynamically composed and allocated to workloads. CXL is therefore poised to shift the PIM paradigm from simply being &#8220;in-memory&#8221; to enabling a new class of &#8220;composable memory&#8221; systems, transforming the architecture of the data center.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>PIM for Domain-Specific Acceleration<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The practical value of PIM is most evident in its application to specific, data-intensive domains where the limitations of the von Neumann architecture are most acute. PIM is not a general-purpose accelerator; it thrives on workloads that are &#8220;embarrassingly memory-bound&#8221;\u2014those with a very high ratio of memory access to computation. In these domains, where powerful GPUs are often underutilized and starved for data, PIM&#8217;s ability to deliver massive memory bandwidth directly to simple, parallel compute units provides a compelling advantage.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Large Language Model (LLM) Inference<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The inference process for large language models, especially the autoregressive generation of tokens one by one, is a prime example of a memory-bound workload.<\/span><span style=\"font-weight: 400;\">74<\/span><span style=\"font-weight: 400;\"> The dominant computations are General Matrix-Vector multiplications (GEMV), which have low arithmetic intensity and cannot fully exploit the massive parallelism of a GPU&#8217;s compute engines.<\/span><span style=\"font-weight: 400;\">27<\/span><span style=\"font-weight: 400;\"> This makes LLM inference an ideal candidate for PIM acceleration.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Performance results from commercial and research systems are compelling. Integrating Samsung&#8217;s HBM-PIM with an AMD MI-100 GPU has been shown to more than double the performance and energy efficiency for GPT-J model inference.<\/span><span style=\"font-weight: 400;\">4<\/span><span style=\"font-weight: 400;\"> Projections for SK Hynix&#8217;s AiM-based system suggest a potential 13x performance improvement for GPT-3 inference compared to a GPU-only system, while consuming only 17% of the energy.<\/span><span style=\"font-weight: 400;\">27<\/span><span style=\"font-weight: 400;\"> Looking further, novel architectures like CENT, which replace GPUs entirely with a CXL-based network of PIM devices, project a reduction in the total cost of ownership (TCO) per query by up to 6.94x, highlighting PIM&#8217;s potential to fundamentally change the economics of deploying LLMs at scale.<\/span><span style=\"font-weight: 400;\">74<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Graph Processing and Analytics<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Graph processing algorithms, which are central to social network analysis, logistics, and bioinformatics, are notoriously difficult to accelerate on conventional hardware. Their characteristic features are massive datasets, irregular memory access patterns (pointer chasing), and very low data reuse, which render traditional cache hierarchies ineffective.<\/span><span style=\"font-weight: 400;\">33<\/span><span style=\"font-weight: 400;\"> PIM is a natural fit for these workloads because it brings computation directly to the large graph data structures stored in main memory, minimizing costly random data accesses.<\/span><span style=\"font-weight: 400;\">33<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While PIM can accelerate memory-intensive graph operations like set intersection and subtraction, simply offloading existing algorithms is often not enough.<\/span><span style=\"font-weight: 400;\">33<\/span><span style=\"font-weight: 400;\"> The irregular structure of graph data can lead to severe workload imbalance across the parallel PIM cores, limiting performance gains.<\/span><span style=\"font-weight: 400;\">33<\/span><span style=\"font-weight: 400;\"> Achieving good performance requires algorithm-hardware co-design, including the development of new graph partitioning and scheduling strategies specifically for PIM architectures. Frameworks like PIMMiner are being developed to address these challenges and better utilize PIM for graph mining tasks.<\/span><span style=\"font-weight: 400;\">34<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Database and Query Acceleration<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">In-memory database systems, which rely on large DRAM capacities to hold entire datasets, are frequently bottlenecked by the bandwidth of the memory bus during analytical query processing.<\/span><span style=\"font-weight: 400;\">30<\/span><span style=\"font-weight: 400;\"> PIM can be used to accelerate key database primitives, particularly table scans (filtering) and joins.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The Membrane framework, for example, demonstrates how filtering operations can be offloaded to simple comparison units placed at each DRAM bank. This approach achieved a 3-4x query speedup in the DuckDB analytics database with only a modest memory overhead.<\/span><span style=\"font-weight: 400;\">30<\/span><span style=\"font-weight: 400;\"> For the more complex hash join operation, simply running the CPU-based algorithm on PIM is suboptimal.<\/span><span style=\"font-weight: 400;\">78<\/span><span style=\"font-weight: 400;\"> The JSPIM architecture shows the power of co-design by redesigning the hash table data structure to be PIM-friendly and deploying parallel search engines within each memory subarray. This allows for constant-time ($O(1)$) lookups and effectively mitigates the problem of data skew, fully exploiting the fine-grained parallelism of PIM.<\/span><span style=\"font-weight: 400;\">78<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Deep Learning Recommendation Models (DLRM)<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Deep learning recommendation models, which power personalization across e-commerce and media streaming, are among the most memory-demanding workloads in the data center. Their performance is dominated by the need to access massive embedding tables, which can be tens or even hundreds of gigabytes in size.<\/span><span style=\"font-weight: 400;\">79<\/span><span style=\"font-weight: 400;\"> The core operation involves a sparse and irregular &#8220;gather-reduce&#8221; access pattern to these tables, which is an ideal match for PIM&#8217;s ability to provide high-bandwidth, parallel access to memory.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Several PIM-based solutions have been proposed to accelerate DLRMs. The UpDLRM framework utilizes the commercial UPMEM PIM hardware to accelerate embedding lookups, demonstrating lower inference latency compared to both CPU and hybrid CPU-GPU systems.<\/span><span style=\"font-weight: 400;\">79<\/span><span style=\"font-weight: 400;\"> More specialized architectures like ProactivePIM are co-designed with specific DLRM algorithms, such as those using weight-sharing for model compression. By incorporating an in-PIM cache and an intelligent prefetching scheme tailored to the algorithm&#8217;s access patterns, ProactivePIM achieved a 4.8x speedup over previous PIM-based approaches.<\/span><span style=\"font-weight: 400;\">83<\/span><span style=\"font-weight: 400;\"> These examples underscore a key theme: unlocking PIM&#8217;s full potential requires moving beyond simply offloading existing code and instead co-designing algorithms and hardware to work in concert.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>Critical Challenges and Future Research Directions<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Despite its rapid progress from academic concept to commercial product, Processing-in-Memory faces significant system-level hurdles that must be overcome to achieve widespread adoption. These challenges span hardware design, system software, and the broader industry ecosystem. Addressing them will be the focus of PIM research and development for the foreseeable future.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Data Coherency and Consistency in Heterogeneous Systems<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Perhaps the most formidable technical challenge for PIM is maintaining data coherency in a system where both a host processor and multiple PIM cores can read and write to the same shared data.<\/span><span style=\"font-weight: 400;\">86<\/span><span style=\"font-weight: 400;\"> Traditional cache coherence protocols, such as snooping or directory-based schemes, are designed for tightly coupled processors and are unworkable for PIM. They would require a constant stream of coherence messages (e.g., invalidations, requests for ownership) to be sent over the narrow and high-latency off-chip memory bus, completely overwhelming it and negating any performance benefit from PIM.<\/span><span style=\"font-weight: 400;\">88<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Early workarounds, such as marking PIM-accessible memory regions as non-cacheable by the host CPU or using coarse-grained software locks, are too restrictive and severely degrade performance, especially for applications with fine-grained data sharing.<\/span><span style=\"font-weight: 400;\">88<\/span><span style=\"font-weight: 400;\"> A more promising direction is represented by hardware mechanisms like <\/span><b>LazyPIM<\/b><span style=\"font-weight: 400;\">. This approach leverages speculation: the PIM core executes its task assuming it has the necessary coherence permissions. At the end of the task, it sends a single, compressed &#8220;coherence signature&#8221; to the host, summarizing all the memory locations it accessed. The host processor then checks this signature for conflicts with its own cache activity. If a conflict is found, the PIM task is rolled back and re-executed; otherwise, the results are committed.<\/span><span style=\"font-weight: 400;\">88<\/span><span style=\"font-weight: 400;\"> By batching coherence checks, LazyPIM dramatically reduces off-chip traffic. Until an efficient and standardized solution to the coherence problem is adopted, PIM will likely remain confined to the role of a specialized accelerator for workloads with minimal or easily managed data sharing, rather than acting as a seamless extension of the host&#8217;s coherent memory space.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Thermal Management and Power Delivery in 3D-Stacked PIM<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The 3D-stacking technology that enables high-bandwidth memories like HBM-PIM also introduces severe thermal challenges.<\/span><span style=\"font-weight: 400;\">89<\/span><span style=\"font-weight: 400;\"> Vertically stacking multiple DRAM dies on top of a logic die increases power density and creates long thermal paths, making it difficult to extract heat from the layers furthest from the heat sink.<\/span><span style=\"font-weight: 400;\">89<\/span><span style=\"font-weight: 400;\"> Thermal failures are a leading cause of reliability issues in 3D-stacked chips, and the non-uniform power distribution of PIM workloads can create localized &#8220;hotspots&#8221; that limit overall system performance.<\/span><span style=\"font-weight: 400;\">89<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Addressing these thermal constraints requires co-design across the hardware and system management layers. Proposed solutions include the coordinated use of Dynamic Voltage and Frequency Scaling (DVFS) for the logic cores and Low Power Modes (LPM) for the DRAM banks to dynamically manage the thermal budget.<\/span><span style=\"font-weight: 400;\">90<\/span><span style=\"font-weight: 400;\"> More advanced architectural concepts, like the Tasa architecture, propose using heterogeneous cores within the PIM logic layer\u2014mixing high-performance cores for compute-intensive tasks with high-efficiency cores for memory-intensive tasks\u2014coupled with thermal-aware task scheduling to balance the temperature distribution across the die and maximize performance under a fixed thermal envelope.<\/span><span style=\"font-weight: 400;\">89<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Standardization, Scalability, and the PIM Ecosystem<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The current PIM landscape is fragmented. Each major vendor, including Samsung, SK Hynix, and UPMEM, has developed its own proprietary architecture, instruction set, and software development kit.<\/span><span style=\"font-weight: 400;\">6<\/span><span style=\"font-weight: 400;\"> This lack of standardization is a major impediment to building a robust software ecosystem, as it prevents the development of portable applications and tools that can run across different PIM hardware.<\/span><span style=\"font-weight: 400;\">59<\/span><span style=\"font-weight: 400;\"> Industry-wide standardization efforts, such as those being pursued within JEDEC for future HBM generations, are critical for fostering broader adoption and enabling a competitive marketplace.<\/span><span style=\"font-weight: 400;\">15<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Emerging Research Frontiers and Long-Term Outlook<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The long-term vision for PIM is likely one that is both heterogeneous and composable. Future systems will not rely on a single, monolithic PIM architecture but will instead be assembled from a diverse set of computational resources. This points to a future of flexible, disaggregated systems composed of different types of PIM and traditional compute, orchestrated by intelligent software. Key research frontiers include:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Advanced System Software:<\/b><span style=\"font-weight: 400;\"> Continued innovation in PIM-aware compilers, runtimes, and operating systems is paramount to abstracting hardware complexity and making PIM accessible to mainstream programmers.<\/span><span style=\"font-weight: 400;\">63<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Heterogeneous PIM Architectures:<\/b><span style=\"font-weight: 400;\"> Research is actively exploring hybrid designs that combine the strengths of DRAM-PIM (bandwidth) and SRAM-CIM (latency), and potentially integrate emerging non-volatile memories for new capabilities.<\/span><span style=\"font-weight: 400;\">13<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Composable Systems via CXL:<\/b><span style=\"font-weight: 400;\"> Fully exploiting the CXL interconnect will be key to building scalable, rack-level systems from pools of disaggregated PIM resources, moving beyond the single-node accelerator model.<\/span><span style=\"font-weight: 400;\">74<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>New Application Domains:<\/b><span style=\"font-weight: 400;\"> While AI has been the primary driver, researchers are exploring PIM&#8217;s potential to accelerate other data-intensive fields, such as bioinformatics, scientific simulation, and cryptography.<\/span><span style=\"font-weight: 400;\">59<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2><b>Conclusion and Strategic Recommendations<\/b><\/h2>\n<p>&nbsp;<\/p>\n<h3><b>Synthesizing the State of PIM Technology<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Processing-in-Memory has successfully transitioned from a long-standing academic concept into a commercial reality, fundamentally driven by the untenable energy and performance costs of data movement in modern AI-centric computing. The technological landscape has bifurcated into two primary streams: high-bandwidth, high-capacity DRAM-based PIM, exemplified by commercial HBM and GDDR6 products, which targets large-scale AI and HPC workloads in the data center; and low-latency, high-efficiency SRAM-based CIM, which is poised to dominate the market for specialized AI inference acceleration at the edge.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While the hardware is maturing at a rapid pace, with demonstrable, order-of-magnitude improvements in energy efficiency and significant performance gains for targeted memory-bound workloads, the PIM paradigm is still in its early stages. The most significant challenges are no longer at the circuit or device level but at the system level. The lack of mature and standardized software ecosystems, including PIM-aware compilers, operating systems, and programming models, remains the single greatest barrier to widespread adoption. Furthermore, complex hardware issues such as cache coherence and thermal management in 3D-stacked implementations must be solved to enable PIM&#8217;s use in more general-purpose computing contexts.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Recommendations for Architects, Developers, and Technology Strategists<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Based on this analysis, the following strategic recommendations are proposed for key stakeholders in the computing ecosystem:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>For Hardware and System Architects:<\/b><span style=\"font-weight: 400;\"> The focus should shift from demonstrating point solutions to solving systemic integration challenges. This includes developing and standardizing efficient cache coherence mechanisms that do not rely on the off-chip bus, designing novel thermal management solutions for 3D-stacked PIM, and embracing heterogeneous designs that combine DRAM-PIM, SRAM-CIM, and conventional cores. Investing in architectures that leverage the CXL interconnect is critical for enabling the next generation of scalable, composable memory-centric systems. Algorithm-hardware co-design should be a guiding principle, as tailoring both the hardware and the software to a specific problem domain has consistently yielded the most significant performance breakthroughs.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>For Software Developers and Compiler Researchers:<\/b><span style=\"font-weight: 400;\"> The greatest opportunity for impact lies in bridging the software gap. This requires a concerted effort to build a robust PIM software ecosystem. Key priorities include developing PIM-aware compilers that can automatically partition applications and optimize data layouts, creating user-friendly APIs and programming model extensions (e.g., for OpenMP or C++) that abstract away hardware complexity, and designing intelligent runtime systems that can dynamically schedule tasks across heterogeneous PIM and CPU\/GPU resources.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>For Technology Strategists and Decision-Makers:<\/b><span style=\"font-weight: 400;\"> It is crucial to recognize that PIM is not a single, monolithic technology but a paradigm shift toward a more diverse and specialized computing landscape. In the near term, organizations should evaluate current commercial PIM solutions for specific, well-defined, and acutely memory-bound workloads where they can provide an immediate and measurable return on investment\u2014LLM inference and recommendation systems are prime candidates. Concurrently, they should closely monitor the development of industry standards (e.g., via JEDEC) and the maturation of the software ecosystem, as these will be the key indicators for when PIM is ready to move from a niche accelerator to a foundational component of mainstream computing infrastructure.<\/span><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>The Imperative for In-Memory Computation Deconstructing the &#8220;Memory Wall&#8221;: Performance and Energy Bottlenecks in von Neumann Architectures For decades, the advancement of computing has been governed by the processor-centric von <span class=\"readmore\"><a href=\"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/\">Read More &#8230;<\/a><\/span><\/p>\n","protected":false},"author":2,"featured_media":7960,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2374],"tags":[3412,3410,3409,3408,3413,3411],"class_list":["post-7948","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-deep-research","tag-dram","tag-near-memory-computing","tag-pim","tag-processing-in-memory","tag-sram","tag-von-neumann-bottleneck"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing | Uplatz Blog<\/title>\n<meta name=\"description\" content=\"Processing-in-Memory (PIM) places compute inside DRAM\/SRAM. We analyze system-level architectures tackling the Von Neumann bottleneck for next-gen computing.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing | Uplatz Blog\" \/>\n<meta property=\"og:description\" content=\"Processing-in-Memory (PIM) places compute inside DRAM\/SRAM. We analyze system-level architectures tackling the Von Neumann bottleneck for next-gen computing.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/\" \/>\n<meta property=\"og:site_name\" content=\"Uplatz Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-11-28T15:36:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-28T16:25:09+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"720\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"uplatzblog\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:site\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"uplatzblog\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"33 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/\"},\"author\":{\"name\":\"uplatzblog\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\"},\"headline\":\"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing\",\"datePublished\":\"2025-11-28T15:36:18+00:00\",\"dateModified\":\"2025-11-28T16:25:09+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/\"},\"wordCount\":7259,\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg\",\"keywords\":[\"DRAM\",\"Near-Memory Computing\",\"PIM\",\"Processing-in-Memory\",\"SRAM\",\"Von Neumann Bottleneck\"],\"articleSection\":[\"Deep Research\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/\",\"name\":\"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing | Uplatz Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg\",\"datePublished\":\"2025-11-28T15:36:18+00:00\",\"dateModified\":\"2025-11-28T16:25:09+00:00\",\"description\":\"Processing-in-Memory (PIM) places compute inside DRAM\\\/SRAM. We analyze system-level architectures tackling the Von Neumann bottleneck for next-gen computing.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/#primaryimage\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg\",\"contentUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg\",\"width\":1280,\"height\":720},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"name\":\"Uplatz Blog\",\"description\":\"Uplatz is a global IT Training &amp; Consulting company\",\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\",\"name\":\"uplatz.com\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"contentUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"width\":1280,\"height\":800,\"caption\":\"uplatz.com\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Uplatz-1077816825610769\\\/\",\"https:\\\/\\\/x.com\\\/uplatz_global\",\"https:\\\/\\\/www.instagram.com\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\",\"name\":\"uplatzblog\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"caption\":\"uplatzblog\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing | Uplatz Blog","description":"Processing-in-Memory (PIM) places compute inside DRAM\/SRAM. We analyze system-level architectures tackling the Von Neumann bottleneck for next-gen computing.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/","og_locale":"en_US","og_type":"article","og_title":"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing | Uplatz Blog","og_description":"Processing-in-Memory (PIM) places compute inside DRAM\/SRAM. We analyze system-level architectures tackling the Von Neumann bottleneck for next-gen computing.","og_url":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/","og_site_name":"Uplatz Blog","article_publisher":"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","article_published_time":"2025-11-28T15:36:18+00:00","article_modified_time":"2025-11-28T16:25:09+00:00","og_image":[{"width":1280,"height":720,"url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg","type":"image\/jpeg"}],"author":"uplatzblog","twitter_card":"summary_large_image","twitter_creator":"@uplatz_global","twitter_site":"@uplatz_global","twitter_misc":{"Written by":"uplatzblog","Est. reading time":"33 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/#article","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/"},"author":{"name":"uplatzblog","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e"},"headline":"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing","datePublished":"2025-11-28T15:36:18+00:00","dateModified":"2025-11-28T16:25:09+00:00","mainEntityOfPage":{"@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/"},"wordCount":7259,"publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"image":{"@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/#primaryimage"},"thumbnailUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg","keywords":["DRAM","Near-Memory Computing","PIM","Processing-in-Memory","SRAM","Von Neumann Bottleneck"],"articleSection":["Deep Research"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/","url":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/","name":"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing | Uplatz Blog","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/#primaryimage"},"image":{"@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/#primaryimage"},"thumbnailUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg","datePublished":"2025-11-28T15:36:18+00:00","dateModified":"2025-11-28T16:25:09+00:00","description":"Processing-in-Memory (PIM) places compute inside DRAM\/SRAM. We analyze system-level architectures tackling the Von Neumann bottleneck for next-gen computing.","breadcrumb":{"@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/#primaryimage","url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg","contentUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/Processing-in-Memory-A-System-Level-Analysis-of-DRAM-and-SRAM-Architectures-for-Next-Generation-Computing.jpg","width":1280,"height":720},{"@type":"BreadcrumbList","@id":"https:\/\/uplatz.com\/blog\/processing-in-memory-a-system-level-analysis-of-dram-and-sram-architectures-for-next-generation-computing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/uplatz.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Processing-in-Memory: A System-Level Analysis of DRAM and SRAM Architectures for Next-Generation Computing"}]},{"@type":"WebSite","@id":"https:\/\/uplatz.com\/blog\/#website","url":"https:\/\/uplatz.com\/blog\/","name":"Uplatz Blog","description":"Uplatz is a global IT Training &amp; Consulting company","publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/uplatz.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/uplatz.com\/blog\/#organization","name":"uplatz.com","url":"https:\/\/uplatz.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","contentUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","width":1280,"height":800,"caption":"uplatz.com"},"image":{"@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","https:\/\/x.com\/uplatz_global","https:\/\/www.instagram.com\/","https:\/\/www.linkedin.com\/company\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz"]},{"@type":"Person","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e","name":"uplatzblog","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","caption":"uplatzblog"}}]}},"_links":{"self":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/7948","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/comments?post=7948"}],"version-history":[{"count":3,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/7948\/revisions"}],"predecessor-version":[{"id":7962,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/7948\/revisions\/7962"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/media\/7960"}],"wp:attachment":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/media?parent=7948"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/categories?post=7948"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/tags?post=7948"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}