GeForce RTX 4060 Ti is an RTX 3060 Ti with OC
Incorporating the Ada Lovelace microarchitecture into the RTX 4060 Ti serves as a significant stride for this highly relevant segment among gamers, alongside the RTX 4070. These models aim to strike a balance between high performance and a less prohibitive price.
Photo: Allbreaknews.com
The RTX 40 series, codenamed Ada Lovelace, introduces the new TSMC 4N 5nm microarchitecture, enhancing RT cores, providing more L2 cache, and most notably, introducing support for DLSS 3 – a technology capable of generating complete frames using AI. The card modestly boosts efficiency over its previous generation counterpart, delivering an average performance increase of around 10%.
Like other models in the RTX 40 series, this one is equipped with NVIDIA's latest cutting-edge technologies. The DLSS 3 frame generator, as well as AV1 encoding and decoding, are among the highlighted features. Positioned as an intermediate/entry-level offering, the primary focus of the RTX 4060 Ti is smooth gameplay at 1080p or 1440p, particularly when bolstered by DLSS 3 in compatible games.
Compared to its previous generation counterpart, the most standout feature of the 4060 Ti is its support for DLSS 3.
Ada Lovelace Microarchitecture
Photo: Allbreaknews.com
The RTX 40 series represents the third (or fourth, if we include the RTX 20 Super refresh) generation of NVIDIA's heterogeneous philosophy, designed to enable real-time ray tracing. In addition to the “do-it-all” structures of traditional shaders, which NVIDIA introduced over 20 years ago and have now become an industry standard, these GPUs also feature specialized structures for other functions.
Starting with the more traditional Stream Multiprocessors (SMs), the CUDA cores – or shaders – have been optimized to deliver up to twice the performance per watt compared to the Ampere microarchitecture (RTX 30). Notably, one of the most significant changes in these structures is the increase in L2 cache, with an RTX 4090 offering 72MB of Level 2 cache, while the RTX 3090 Ti had only 6MB.
The major advancement in the Lovelace generation for these “tried and true” shaders is the introduction of Shader Execution Reordering (SER). According to NVIDIA, this new technology will have an impact as significant as asynchronous computing had on processors. With this, the graphics card can modify the order of graphics processing steps, grouping instructions for more efficient execution. The main beneficiary of this technology is ray tracing, which NVIDIA claims will occur 3 times faster, resulting in a 25% increase in frame rates.
Another updated structure is the RT cores, the part of the GPU dedicated to accelerating ray tracing processes. These components specialize in calculating ray intersections, drawing the direction of a light beam as it collides with an object, changes its trajectory, and ultimately reaches the observer. The third-generation RT cores found in the RTX 40 series can deliver twice the calculations of their RTX 30 counterparts, resulting in double the RT-FLOPS.
Architecture improvements lead to more significant jumps in games with ray tracing, allowing for more intensive use of this technology and the calculation of more rays, thereby creating a more realistic scene.
These ray tracing cores have also incorporated a new Opacity Micromap (OMM), which accelerates the processing of structures like foliage and fences. Additionally, the DMM Engine can perform the bounding volume hierarchy (BVH) process 10 times faster using 20 times less storage. When combined, these improvements enable jumps of over 2 times in scenarios involving intensive ray tracing when comparing an Ampere-based card (RTX 30) to an Ada Lovelace (RTX 40) card.
However, the boldest initiative in the RTX 40 series comes with the tensor cores. These structures specialize in matrix calculations, making them extremely efficient for machine learning inference and training. The main innovation is greater data dispersal, achieving correct responses with fewer calculations by introducing FP8 data format instructions in Ada Lovelace, compared to previous generations' use of FP16. RTX 40 cards use half the storage space and deliver twice the AI instructions, with an RTX 4090 delivering twice the processing power of an RTX 3090 Ti, for example. This hardware is part of what's needed for the new version of DLSS, but this topic deserves a separate discussion as it's the main innovation of the RTX 40 series.
Nvidia DLSS 3
Up until version 2.1 of DLSS, Nvidia utilized the technology to increase system performance as follows: the frame is rendered at a lower resolution than the final, conserving system resources and delivering a new frame more quickly. The tensor cores and their AI capabilities would then fill in the missing pixels to deliver the final resolution, making adjustments along the way like increasing sharpness and improving image aliasing. More modern DLSS versions also leverage motion vector information, understanding the direction each object in the scene was moving and improving accuracy and graphical quality.
The GeForce 40 series takes a bolder step. Instead of supplementing the image, DLSS 3 will generate entire frames on its own, through the Optical Multi Frame Generator. For this, it brings a structure: the Optical Flow Accelerator. This hardware can analyze a scene and understand the direction objects on the screen are moving, processing the movement of each pixel from one frame to the next.
This feature is not new, as Nvidia has included it in its GPUs since the Turing (GTX 16 and RTX 20) series. The difference now is that the RTX 40 series has a much more refined system for reading pixel movement between frames, delivering faster and more accurate information about where each object is moving in the image.
With this additional information comes the leap of DLSS 3: using the progression of the previous two frames, RTX 40 graphics cards can create a third frame, combining the positions of previous pixels with their displacements indicated by the Optical Flow Accelerator. This is all seasoned with decisions made by tensor cores trained by Nvidia's machine learning. From there, the frame production cycle alternates between a traditionally generated frame by the game engine and another generated exclusively by DLSS.
This significantly alters the frame rate, as many of the heaviest processes, such as ray tracing, are not even performed on the frame generated by DLSS 3. Other potential bottlenecks, like CPU-bound scenarios, can also benefit, as the additional frames created by the tensor cores don't demand processor performance.
The main issue with having an image that doesn't use the game engine is that this frame lacks gameplay information. This would increase the interval between the player's actions and their effects being displayed on the screen, resulting in unresponsive gameplay. To mitigate this effect, Nvidia introduced Nvidia Reflex, a series of optimizations designed to minimize delays in the rendering steps, aiming to provide the shortest possible interval between issuing a command and seeing its effect on the screen.
The Optical Multi Frame Generator won't be useful in all games. Games already running at high frame rates, like 100fps or more, might not benefit much from the technology, as the interval between frames is already very short, leaving little opportunity for DLSS 3 to further enhance the gaming experience by interpolating more frames. Its strength will lie in scenarios ranging from 20 to 60fps, especially those involving heavy filters, particularly intensive use of ray tracing, where performance jumps can reach up to 5 times.