The Computational Challenge of Simulating Light
The pursuit of photorealism in computer graphics has been a decades-long endeavor, marked by a fundamental tension between visual fidelity and real-time performance. For years, rasterization has been the dominant rendering technique, but its inherent limitations have paved the way for a more physically accurate method: ray tracing. The immense computational cost of ray tracing, however, long confined it to the realm of offline, non-real-time rendering for film and visual effects. The advent of specialized hardware in modern Graphics Processing Units (GPUs) has finally made real-time ray tracing a reality, but this achievement was predicated on solving profound algorithmic and architectural challenges.
![]()
bundle-course—sap-us-payroll-and-sap-uk-payroll By Uplatz
From Rasterization to Ray Tracing: A Paradigm Shift in Realism
Traditional real-time 3D graphics are built upon the rasterization pipeline. This technique efficiently projects 3D models, composed of a mesh of virtual triangles, onto a 2D screen, converting them into pixels.1 While computationally efficient, rasterization is fundamentally an approximation. It struggles to naturally simulate global lighting effects; phenomena like physically accurate reflections, refractions through transparent materials, and soft shadows are typically “faked” using a variety of clever but ultimately limited techniques such as cube maps and shadow maps.2 These methods often break down in complex scenes, leading to visual artifacts and a loss of immersion.3
Ray tracing represents a paradigm shift from this approximate model to one based on physical simulation.4 Instead of projecting triangles onto the screen, the algorithm works in reverse, tracing the path of light from a virtual camera through each pixel and out into the 3D scene.1 When a ray intersects an object, its color is determined by the object’s material properties and its interaction with light sources. New rays can be spawned to calculate reflections, refractions, and shadows, recursively building a complete picture of how light behaves in the environment.4 This physically-based approach produces far more realistic and immersive results, with effects like global illumination, ambient occlusion, and accurate reflections emerging naturally from the simulation.4
Given the performance cost, most modern real-time applications do not replace rasterization entirely. Instead, they employ a hybrid rendering model that combines the efficiency of rasterization for determining initial object visibility with the physical accuracy of ray tracing for specific, high-impact lighting effects. This hybrid approach seeks to deliver the best of both worlds: high performance and a new level of visual fidelity.1
The Algorithmic Bottlenecks: Deconstructing the Cost of BVH Traversal and Ray-Primitive Intersection
The core challenge of ray tracing lies in its computational complexity. A naive implementation would require testing every ray against every single triangle in a scene to find the closest intersection. For a modern game scene containing millions of triangles, this brute-force method, which has a computational complexity of $O(N)$ where $N$ is the number of objects, is computationally impossible to execute in real time.9
To make this problem tractable, ray tracers employ acceleration data structures. The most common of these is the Bounding Volume Hierarchy (BVH), a tree-like structure that spatially organizes the scene’s geometry.1 In a BVH, groups of triangles are enclosed within larger, simpler volumes, typically Axis-Aligned Bounding Boxes (AABBs). These boxes are then grouped into even larger boxes, creating a hierarchy that extends from the entire scene at its root down to individual triangles at its leaves.12 When a ray is traced, it is first tested against the large bounding boxes at the top of the hierarchy. If a ray does not intersect a box, the GPU can immediately discard all the geometry contained within it, saving millions of unnecessary calculations.11 This culling process reduces the effective complexity from linear to logarithmic, making the problem far more manageable.3
However, this solution introduces two new, distinct computational bottlenecks:
- BVH Traversal: The process of “walking” the BVH tree, performing a series of ray-box intersection tests to navigate from the root node down to the relevant leaf nodes.
- Ray-Primitive Intersection: Once a leaf node is reached, the ray must be tested against the actual geometric primitives (triangles) contained within it to find the precise point of intersection.
These two tasks form the computational core of hardware-accelerated ray tracing and are the primary targets for specialized silicon.
The Performance Imperative: Why Software-Based Ray Tracing is Insufficient for Real-Time Rendering
Even with the algorithmic optimization provided by a BVH, executing the traversal and intersection tests on general-purpose GPU shader cores remains profoundly inefficient. Without dedicated hardware, each ray cast can require thousands of individual software instruction slots to navigate the BVH and test for intersections.14 Early real-time ray tracing demonstrations underscored this challenge, requiring immensely powerful and expensive systems, such as an NVIDIA DGX Station with four Volta GPUs, just to render a single scene in real time.15
The fundamental issue is not merely the volume of calculations, but the nature of the workload. BVH traversal is an inherently irregular process characterized by divergent control flow and incoherent memory access patterns.11 Unlike traditional shader workloads where a block of threads (a “warp” or “wavefront”) executes the same instructions on similar data, rays traced into a complex scene will quickly diverge, each taking a unique path through the BVH. This forces the GPU to fetch disparate nodes and triangle data from scattered locations in VRAM, leading to poor cache utilization and frequent pipeline stalls as the processing units wait for data to arrive.16 Studies on the computational cost of ray tracing have confirmed that both time and energy consumption are highly correlated with this data movement, with the memory subsystem often being the dominant performance limiter.16
Consequently, simply increasing the number of general-purpose shader cores does not effectively solve the problem. The task demands a different kind of hardware: a specialized processing engine designed to handle irregular tree traversal and fixed-function intersection math with maximum efficiency. This realization drove both NVIDIA and AMD to develop dedicated hardware accelerators, which became the key to unlocking real-time ray tracing for the consumer market and avoiding the severe performance penalties, often a 35-50% drop in frame rate, associated with software-based emulation.4
NVIDIA’s RTX Architecture: The Path of the Dedicated Core
NVIDIA’s introduction of the RTX platform marked the first mainstream implementation of dedicated hardware for real-time ray tracing. This strategy is centered on the Ray Tracing Core (RT Core), a specialized processing block designed to autonomously handle the most computationally intensive aspects of the ray tracing pipeline. The architecture of the RT Core has evolved significantly across three generations—Turing, Ampere, and Ada Lovelace—reflecting a sophisticated engineering roadmap that has progressively identified and eliminated key performance bottlenecks.
The Turing Revolution (1st Gen RT Core)
The Turing microarchitecture, launched in 2018 with the GeForce RTX 20 series, was the first to integrate dedicated RT Cores.20 Each Streaming Multiprocessor (SM), the fundamental processing block of the GPU, was equipped with one RT Core, alongside traditional CUDA cores for shading and new Tensor Cores for AI workloads.14 This integration occurred within a redesigned SM that, unlike its Pascal predecessor, featured independent integer and floating-point (FP32) datapaths, allowing for concurrent execution and a significant boost in overall shader efficiency.23
The first-generation RT Core is a fixed-function Application-Specific Integrated Circuit (ASIC) containing two specialized hardware units, each targeting one of the primary ray tracing bottlenecks 14:
- A Box Intersection Unit: This hardware is dedicated to performing ray-AABB (Axis-Aligned Bounding Box) intersection tests with extreme efficiency. This is the core operation required for traversing a BVH.
- A Triangle Intersection Unit: Once the traversal unit identifies a leaf node in the BVH, this hardware performs the final, precise mathematical test to determine if the ray intersects with any of the triangles contained within that node.
The defining characteristic of the Turing RT Core is its autonomous operation. When a shader program running on the SM issues a ray tracing instruction (e.g., via Microsoft DXR’s TraceRay() call), the entire workload of BVH traversal and ray-triangle intersection testing is offloaded to the RT Core.14 The RT Core independently traverses the BVH, performs the necessary tests, and only returns a simple “hit” or “miss” result back to the SM.14 This design frees the SM’s programmable CUDA cores to continue executing other tasks, such as pixel shading or compute workloads, in parallel.14 This concurrent execution model is the foundation of NVIDIA’s hybrid rendering strategy, enabling the fusion of rasterization and ray tracing in a single frame.19
Turing’s hardware delivered a monumental performance uplift, capable of processing up to 10 GigaRays per second and providing an estimated 8x improvement in ray tracing performance over software-based approaches on the preceding Pascal architecture.20 It established a clear architectural blueprint: offload the irregular, computationally expensive traversal and intersection tasks to dedicated hardware, thereby allowing the highly parallel shader engines to focus on shading.
The Ampere Refinement (2nd Gen RT Core)
The Ampere architecture, powering the GeForce RTX 30 series, introduced the second-generation RT Core.25 This generation was not a reinvention but a critical refinement focused on addressing bottlenecks observed in the first-generation hardware. The Ampere SM was also enhanced, featuring a flexible datapath that allowed it to execute two FP32 operations per clock, doubling its theoretical peak FP32 throughput over Turing.25
The key improvements in the 2nd Gen RT Core included:
- Doubled Intersection Throughput: The hardware’s ability to perform ray-triangle intersection tests was doubled compared to Turing, a direct enhancement to one of its core functions.25
- Concurrent Traversal and Intersection: A more subtle but crucial optimization was the ability for the box intersection (traversal) and triangle intersection units to operate concurrently.31 In the Turing architecture, these two stages could become serialized, with the pipeline stalling while waiting for triangle intersection tests to complete. Ampere’s design allows for parallel execution, significantly increasing the core’s internal efficiency and overall throughput.28
- Hardware-Accelerated Motion Blur: The 2nd Gen RT Core added dedicated logic to calculate intersections with geometry in motion. It does this by interpolating triangle vertex positions between two points in time, enabling physically accurate ray-traced motion blur with far greater efficiency than was possible in software.
These enhancements resulted in a claimed performance increase of up to 2x in ray tracing workloads compared to the Turing generation.25 In benchmarks, an RTX 3080 demonstrated roughly double the ray tracing performance of the previous flagship, the RTX 2080 Ti.32 The Ampere architecture effectively matured the RT Core, optimizing its internal pipeline to make ray tracing at higher resolutions and with more complex effects, such as motion blur, a more practical reality for gamers.
The Ada Lovelace Optimization (3rd Gen RT Core)
Launched with the GeForce RTX 40 series, the Ada Lovelace architecture represents another major leap forward, introducing the third-generation RT Core alongside significant gains in clock speed and transistor density from the move to TSMC’s custom 4N process node.33 This generation shifted focus from purely accelerating raw intersection throughput to solving higher-level inefficiencies in the ray tracing pipeline.
The 3rd Gen RT Core introduced three transformative hardware features:
- Shader Execution Reordering (SER): SER is a novel scheduling technology that dynamically reorganizes ray tracing workloads on the fly.34 Ray tracing often produces highly divergent and incoherent shader workloads, which are inefficient for the SMs to process. SER groups similar shader tasks together, improving execution coherence and GPU efficiency. NVIDIA claims this feature alone can improve overall ray tracing performance by up to 2x.34
- Opacity Micromap (OMM) Engine: This specialized hardware unit is designed to accelerate ray tracing of complex geometry that uses transparent or “alpha-tested” textures, such as foliage, chain-link fences, or smoke sprites.35 Previously, a ray hitting a transparent part of a texture would still require the SM to run a shader to discard the hit and continue traversal. The OMM engine encodes a low-resolution map of a texture’s opacity directly into the BVH. This allows the RT Core to quickly identify and discard hits on transparent areas using dedicated hardware, without ever involving the SM, potentially doubling traversal performance in scenes with heavy use of such assets.35
- Displaced Micro-Mesh (DMM) Engine: DMM tackles the problem of highly complex geometry created by techniques like tessellation and displacement mapping. Such techniques can increase a model’s triangle count by orders of magnitude, making the BVH enormous and slow to build and traverse. The DMM engine allows the GPU to build the primary BVH using a simpler base mesh while efficiently encoding the complex, displaced micro-triangles. This results in BVH build times that can be over 10x faster and acceleration structures that are over 10x smaller, drastically reducing memory usage and stutter during gameplay.35
In addition to these new engines, the 3rd Gen RT Core once again doubles the raw ray-triangle intersection throughput compared to the Ampere generation.35 The combination of these hardware advancements allows the Ada Lovelace architecture to deliver up to 4x the ray tracing performance of Ampere in certain workloads.39 With these innovations, NVIDIA demonstrated a clear strategic evolution: after making the core ray tracing pipeline fast (Turing) and internally efficient (Ampere), the focus shifted to optimizing the GPU’s handling of the most difficult and inefficient real-world rendering scenarios.
AMD’s RDNA Architecture: An Integrated Approach
In contrast to NVIDIA’s strategy of creating large, dedicated processing blocks, AMD pursued a more integrated approach to hardware-accelerated ray tracing with its RDNA architecture. This design philosophy, which debuted in RDNA 2, prioritizes die area efficiency and flexibility by embedding smaller acceleration units directly within the existing compute framework. This approach has its own set of trade-offs and has also seen generational refinement.
RDNA 2’s Debut (1st Gen Ray Accelerator)
The RDNA 2 architecture, which powers the Radeon RX 6000 series GPUs and serves as the foundation for the PlayStation 5 and Xbox Series X/S consoles, was AMD’s first to feature hardware support for DirectX Raytracing (DXR).40 Rather than designing a separate, autonomous core, AMD implemented a fixed-function block known as a Ray Accelerator (RA) within each of the GPU’s Compute Units (CUs).41
This design leads to a hybrid hardware/software model for ray tracing. The workload is divided as follows:
- Ray-Intersection (Hardware): The dedicated Ray Accelerator hardware is responsible for calculating both ray-box (AABB) and ray-triangle intersections. Each RA in the RDNA 2 architecture can perform four ray-box tests or one ray-triangle test per clock cycle.12
- BVH Traversal (Software): The logic for traversing the BVH—deciding which nodes to visit next based on the results of the ray-box tests—is not handled by the RA. Instead, this task is executed as shader code on the CU’s general-purpose SIMD (Single Instruction, Multiple Data) execution units.42
This division of labor is critically supported by another key RDNA 2 innovation: the Infinity Cache. This is a large, on-die L3 cache (up to 128 MB) that acts as a massive bandwidth amplifier for the entire GPU.41 Since BVH traversal is executed on the shaders, it is a memory-intensive process that can be bottlenecked by latency to the main GDDR6 video memory. The Infinity Cache mitigates this by holding a large portion of the BVH data close to the CUs, enabling much faster access and reducing power consumption.42
The RDNA 2 approach represents an area-efficient implementation of ray tracing acceleration. By integrating smaller RAs into the existing CU structure and leveraging the Infinity Cache, AMD was able to add full DXR compliance without the significant die space and cost associated with a completely separate, larger core. This design philosophy is particularly well-suited for the cost- and power-sensitive environments of game consoles. However, its reliance on programmable shaders for a core part of the ray tracing pipeline means that RT performance is intrinsically tied to the performance and availability of the shader cores and the memory subsystem.
RDNA 3’s Evolution (2nd Gen Ray Accelerator)
The RDNA 3 architecture, featured in the Radeon RX 7000 series, introduced a groundbreaking chiplet design for consumer GPUs, separating the main Graphics Compute Die (GCD) from multiple Memory Cache Dies (MCDs).47 The CUs themselves were also updated to support dual-issue instructions, increasing their overall computational throughput.47
The ray tracing hardware in RDNA 3 is an evolutionary update, featuring a second-generation Ray Accelerator.49 This new RA delivers an approximately 50% improvement in ray intersection performance over its RDNA 2 predecessor.50 The fundamental hybrid architecture, however, remains unchanged: the RAs handle intersection calculations, while the shader cores continue to manage BVH traversal.49 While the total number of RAs increased on flagship models due to a higher CU count, the ratio of one RA per CU was maintained.47
A significant addition to the RDNA 3 architecture was the inclusion of new Wave Matrix Multiply-Accumulate (WMMA) instructions.47 While not dedicated AI cores like NVIDIA’s Tensor Cores, WMMA instructions allow the standard shader ALUs to execute matrix math operations with much greater efficiency. This is critically important for accelerating the AI-based denoising algorithms that are essential for cleaning up the noisy output of real-time ray tracing.48 The introduction of WMMA shows a clear recognition from AMD of the symbiotic relationship between ray tracing and AI acceleration.
Despite these improvements, the ray tracing performance of RDNA 3 GPUs generally continues to trail that of their direct NVIDIA competitors, particularly in graphically intensive ray tracing and path tracing scenarios.52 The performance gap tends to widen as the ray tracing workload increases, suggesting that the shader-based traversal component remains a comparative architectural bottleneck when stressed with a high volume of incoherent rays.46 RDNA 3 refined AMD’s integrated approach, but it did not fundamentally alter the architectural trade-offs established by its predecessor.
A Tale of Two Architectures: Comparative Analysis of RTX and RDNA
The distinct hardware implementations for ray tracing acceleration from NVIDIA and AMD reflect fundamentally different design philosophies. NVIDIA’s approach prioritizes peak ray tracing performance through dedicated, autonomous hardware, while AMD’s strategy emphasizes die-area efficiency and flexibility by integrating acceleration features into its existing compute architecture. These divergent paths result in unique performance characteristics and architectural trade-offs.
Core Design Philosophy: Dedicated vs. Integrated
- NVIDIA’s Dedicated Approach: The RTX architecture is built around highly specialized hardware blocks. The RT Core is a distinct processing engine within the Streaming Multiprocessor (SM) that is engineered to handle the entire ray tracing acceleration pipeline—from BVH traversal to final triangle intersection—with minimal involvement from the programmable shader cores.14 This philosophy offloads the specific, irregular workloads of ray tracing to hardware that is purpose-built for the task, maximizing performance even if it increases die size and complexity. The silicon dedicated to the RT Core is largely idle during purely rasterized workloads.54
- AMD’s Integrated Approach: The RDNA architecture, by contrast, integrates ray tracing capabilities more deeply into its standard Compute Unit (CU). The Ray Accelerator (RA) is a smaller, fixed-function component that works in concert with the CU’s programmable shader ALUs.45 This hybrid model leverages the powerful and flexible shaders for BVH traversal while using the RA to accelerate the most mathematically intensive intersection tests.42 This design prioritizes resource utilization and die-area efficiency, as the shader cores can be used for graphics, compute, or ray tracing traversal as needed by the workload.
BVH Traversal and Intersection: A Head-to-Head Implementation
The most significant architectural difference lies in the handling of Bounding Volume Hierarchy (BVH) traversal.
- NVIDIA’s RT Cores execute BVH traversal entirely in fixed-function hardware. This process is extremely fast and power-efficient for the specific task of navigating the tree structure to find potential ray-object intersections.3
- AMD’s RDNA GPUs rely on the programmable shader cores to execute the traversal algorithm in software.42 While this approach is more flexible, it consumes valuable shader resources and is more susceptible to memory latency. The introduction of the large Infinity Cache in RDNA 2 was a direct architectural response to mitigate this specific bottleneck by keeping more of the BVH data on-chip.43
For ray-intersection testing, both vendors use dedicated hardware. However, the throughput and capabilities have evolved differently. NVIDIA has doubled its ray-triangle intersection rate with each successive generation (Ampere over Turing, Ada over Ampere).25 AMD’s RDNA 2 Ray Accelerator was designed to perform four ray-box tests or one ray-triangle test per clock, while RDNA 3 improved overall intersection performance by approximately 50%.45 The raw throughput of NVIDIA’s fully dedicated units, especially in the latest Ada Lovelace generation, remains significantly higher.
Performance Profiling: Light vs. Heavy Workloads
These architectural differences manifest directly in real-world performance. In traditional rasterization, AMD’s RDNA cards are highly competitive, often outperforming their NVIDIA counterparts at similar price points.58 When light ray tracing effects are enabled, such as ray-traced shadows or simple reflections, AMD’s GPUs can deliver a playable and visually compelling experience, with a manageable performance gap.61
However, as the ray tracing workload intensifies, the performance delta widens considerably in NVIDIA’s favor. In demanding scenarios that involve multiple light bounces, ray-traced global illumination, or full path tracing (as seen in titles like Cyberpunk 2077‘s Overdrive Mode or Alan Wake 2), NVIDIA’s GPUs maintain a decisive performance advantage.52 This behavior is a direct consequence of the architectural trade-offs. The sheer volume of BVH traversal operations in a path-traced scene can saturate the shader-based traversal capabilities of the RDNA architecture, whereas NVIDIA’s fully hardware-based pipeline is designed to scale more effectively under such heavy, specialized loads.
Architectural Trade-offs and Future Outlook
Each approach comes with inherent trade-offs. NVIDIA’s strategy yields superior performance in ray tracing and AI-accelerated tasks but results in larger, more complex, and more expensive dies. AMD’s integrated design offers greater die-area efficiency and flexibility but at the cost of lower peak performance in heavy ray tracing workloads.
The future trajectory of both companies suggests a potential convergence of these philosophies. Rumors and technology previews indicate that AMD’s future architectures, such as RDNA 4 and beyond, may incorporate more comprehensive dedicated hardware, potentially named “Radiance Cores,” to handle ray traversal and further unburden the shader engines.62 This acknowledges that shader-based traversal is a current performance limiter. Concurrently, NVIDIA’s latest Blackwell architecture continues to deepen the fusion of AI and ray tracing through concepts like neural rendering, indicating that the future lies in an even tighter coupling of these specialized hardware units.65
The following table provides a comparative summary of the key architectural features across recent GPU generations from both NVIDIA and AMD.
| Feature | NVIDIA Turing (1st Gen) | NVIDIA Ampere (2nd Gen) | NVIDIA Ada Lovelace (3rd Gen) | AMD RDNA 2 (1st Gen) | AMD RDNA 3 (2nd Gen) |
| RT Hardware | RT Core | RT Core | RT Core | Ray Accelerator (RA) | Ray Accelerator (RA) |
| Integration | 1 per SM | 1 per SM | 1 per SM | 1 per CU | 1 per CU |
| BVH Traversal | Hardware (in RT Core) | Hardware (in RT Core) | Hardware (in RT Core) | Software (on Shader Cores) | Software (on Shader Cores) |
| Intersection Test | Hardware (in RT Core) | Hardware (in RT Core) | Hardware (in RT Core) | Hardware (in RA) | Hardware (in RA) |
| Intersection Rate | Baseline (1x) | ~2x Ray-Triangle vs. Turing | ~2x Ray-Triangle vs. Ampere | 4x Ray-Box, 1x Ray-Triangle per clock | ~1.5x vs. RDNA 2 |
| Key New Features | Concurrent INT/FP paths | Concurrent RT/Shading, Motion Blur Accel. | SER, OMM, DMM | Infinity Cache, DXR Support | Chiplet Design, WMMA |
| Process Node | TSMC 12nm | Samsung 8nm | TSMC 4N | TSMC 7nm | TSMC 5nm (GCD) / 6nm (MCD) |
The Software Bridge: APIs and the Future of Ray Tracing
The revolutionary capabilities of dedicated ray tracing hardware would be inaccessible without a robust software ecosystem to expose them to developers. Standardized Application Programming Interfaces (APIs) serve as the critical bridge between game engines and the underlying silicon. Furthermore, the practical application of real-time ray tracing is inextricably linked with advancements in artificial intelligence, creating a symbiotic relationship that defines the current and future trajectory of high-performance graphics.
Exposing the Hardware: The Role of DirectX Raytracing (DXR) and Vulkan Ray Tracing
The widespread adoption of hardware-accelerated ray tracing was enabled by two key industry-standard APIs: Microsoft’s DirectX Raytracing (DXR), an extension of DirectX 12, and the Khronos Group’s ray tracing extensions for Vulkan.67 These APIs provide a hardware-agnostic abstraction layer, allowing developers to write ray tracing code that can run on hardware from different vendors, including NVIDIA, AMD, and Intel.70
Both DXR and Vulkan RT introduce a common set of concepts necessary for ray tracing 73:
- Acceleration Structures (AS): The API provides a standardized way to build and manage the two-level BVH (Bottom-Level AS for geometry and Top-Level AS for scene instances) that the hardware will traverse.76
- New Shader Stages: A new ray tracing pipeline is introduced with specialized shader types, including Ray Generation, Closest Hit, Any Hit, and Miss shaders, which define the logic for what happens at different stages of a ray’s life.67
- Shader Binding Table (SBT): This is a data structure that maps geometry in the scene to the specific hit shaders that should be executed upon intersection, allowing for complex material systems.77
- Ray Tracing Pipeline State Objects (PSO): These objects bundle the compiled shaders and pipeline configurations needed for a ray tracing dispatch call.74
The GPU driver is responsible for translating these high-level API commands into the specific instructions that control the underlying hardware, whether it be an NVIDIA RT Core or an AMD Ray Accelerator. The evolution of these APIs often mirrors advances in hardware. For instance, DXR 1.1 introduced inline ray tracing, a more flexible approach that better suits AMD’s architecture, while the upcoming DXR 1.2 adds standardized support for features like Shader Execution Reordering (SER) and Opacity Micromaps (OMM), which were pioneered in NVIDIA’s Ada Lovelace architecture.67 This standardization is crucial, as it encourages all hardware vendors to implement similar features to maintain compatibility and performance, fostering a competitive and interoperable ecosystem.
The Symbiosis of AI and Ray Tracing: Denoising and Upscaling
Real-time ray tracing, even with hardware acceleration, is a compromise. To maintain interactive frame rates, applications can only afford to trace a very small number of rays per pixel—often just one for primary visibility and one for each subsequent effect like shadows or reflections. This sparse sampling results in a final image that is filled with stochastic noise, appearing grainy and unfinished.20
This is where artificial intelligence becomes indispensable. The noisy, ray-traced image is fed into an AI-powered denoising filter, which intelligently removes the noise to produce a clean final image. This denoising process is itself computationally intensive and relies on specialized hardware for acceleration:
- NVIDIA GPUs use their dedicated Tensor Cores, which are designed for high-throughput matrix multiplication, the core operation of neural networks.1
- AMD GPUs, starting with RDNA 3, use Wave Matrix Multiply-Accumulate (WMMA) instructions to accelerate these AI workloads on their standard shader processors.47
Furthermore, to recover the performance lost to the high cost of ray tracing, the scene is typically rendered at a lower internal resolution and then intelligently upscaled to the target output resolution. Technologies like NVIDIA’s Deep Learning Super Sampling (DLSS) and AMD’s FidelityFX Super Resolution (FSR) use AI and other advanced algorithms to reconstruct a high-quality, high-resolution image from the lower-resolution input.65 NVIDIA’s DLSS 3, exclusive to the Ada Lovelace architecture, takes this a step further with AI Frame Generation, using the GPU’s Optical Flow Accelerator to generate entirely new intermediate frames, significantly boosting perceived smoothness.33
It is therefore crucial to understand that hardware-accelerated ray tracing and hardware-accelerated AI are not independent features but two deeply intertwined components of a single system. The ray tracing hardware produces a physically accurate but imperfect image, and the AI hardware cleans it up and makes it performant. The overall ray tracing capability of a GPU is a function of both its raw ray-casting power and its efficiency at AI-driven reconstruction and upscaling.
The Road Ahead: Path Tracing, Neural Rendering, and Next-Gen Hardware
The ultimate goal for real-time graphics is full path tracing, a more advanced form of ray tracing that simulates numerous light bounces per pixel to achieve true, physically-based global illumination and photorealism.1 This is already achievable in some of the most advanced games, such as Cyberpunk 2077 and Alan Wake 2, demonstrating that what was once exclusive to offline rendering is now on the cusp of real-time interactivity.80
The future of this technology lies in Neural Rendering, a paradigm where AI is integrated even more deeply into the graphics pipeline. Instead of being used solely for post-processing tasks like denoising, AI will be used to generate graphical elements, predict light transport, and intelligently fill in details, bridging the final gap between real-time visuals and cinematic CGI.14 Cutting-edge research presented at premier academic conferences like SIGGRAPH explores techniques such as generative material enhancement, neural radiance fields (NeRFs), and advanced spatiotemporal sampling algorithms like ReSTIR (Reservoir-based Spatiotemporal Importance Resampling).85
Next-generation hardware architectures are being designed specifically for this future. AMD is developing “Radiance Cores” and “Neural Arrays,” which represent a move towards a more dedicated and efficient pipeline for handling both ray tracing and AI workloads in tandem.63 Similarly, NVIDIA’s Blackwell architecture is expected to further enhance its neural rendering capabilities, tightening the fusion between its RT Cores and Tensor Cores.65 This suggests a future where the architectural approaches of the two major GPU vendors may converge, as both recognize that the next leap in graphical fidelity will require tightly-coupled, high-performance, and deeply specialized hardware for both physical light simulation and artificial intelligence.
