GeForce GTX 480 And 470: From Fermi And GF100 To Actual Cards!

Additional Reading: Geometry, Raster, ROP, And GPGPU

Geometry Shader And Raster Engine

After our little sidebar on tessellation, let’s continue the tour along Nvidia's geometry pipeline. The last stage consists of the Geometry Shader, which first appeared with Direct3D 10, allowing vertices to be added to or taken away from primitives. We’re a far cry from T&L with our dear old GeForce (it’s been 10 years already).

This is an area where the new GeForce GTX 480 has evolved the most compared to the previous architecture, with no less than a 311% improvement in RightMark's Hyperlight.

Finally, the PolyMorph Engine performs the viewport transform and perspective correction calculations before passing the vertices and all of their attributes to the Raster Engine. The Raster Engine is made up of three main stages. First, the equations for the edges of the triangle are calculated, and the triangles that are not facing the camera are rejected. Then, the rasterizer generates the pixels (and the samples in the case of MSAA) covered by the triangle before passing all the data to a Z-Cull unit, now a familiar entity, which avoids performing pixel shading on hidden pixels through the use of a hierarchical Z-buffer.

Reworked ROP Units

As we saw earlier, Nvidia increased the number of ROPs, but it also made a few changes to them. The units' "graphics" performance hasn’t been changed (one 32-bit integer pixel per clock cycle, one FP16 pixel over two clocks, or an FP32 pixel over four clocks), but Nvidia has greatly optimized atomic operations (that is, memory operations carried out in one single transaction) with no interruption possible. This type of operation is extremely useful in parallel programming, when several threads can attempt to access the same resource. Nvidia claims very big gains--up to 20x in the case of atomic operations to a single address and 7.5x for contiguous memory regions, even if, in practice, these gains are probably more the result of using the L2 cache, rather than any substantial modification of the ROP units.

Nvidia also claims a substantial improvement to its compression algorithms, resulting in better efficiency with 8x anti-aliasing.

Finally, GF100 has a new 32x Coverage Sampling Anti-Aliasing (CSAA) mode and the possibility of combining CSAA and transparency multi-sampling to improve anti-aliasing of transparent surfaces. An interesting detail: until now, the number of pixels generated by the setup engine and the number of ROP units increased together. Now, GF100 has 48 ROP units and its rasterizer can generate only 32 pixels per cycle. That may seem like an odd choice at first, but in practice the latest games usually aren't played on high-end hardware without MSAA, and they often use floating-point frame buffers, thereby imposing a heavier workload on the ROP units, which take several cycles to process the pixels. So, the increase in the number of ROP units is justified, but in certain very simple rendering passes they’ll be underutilized by the rasterizers.


It’s hard to discuss the GF100 without talking about GPGPU, since Nvidia focused on it so much during its presentations of the new architecture. When the company designed G80, the GPGPU market was in its infancy. Nvidia’s choices proved to be good ones, both from the point of view of hardware (with Shared Memory) and software (Compute Shaders and OpenCL have programming paradigms that are very close to CUDA). But it’s impossible to offer a perfect solution on the first try, and Nvidia continues to develop CUDA, adding support for double-precision and atomic instructions with GT200. But those were just incremental improvements. With its Fermi architecture, Nvidia was able to take advantage of all the expertise gained from several years of work with CUDA to offer a much more powerful solution.

The first point that has been greatly improved with GF100, and that will directly benefit GPGPU applications, is support for double-precision arithmetic. As we saw earlier, the addition of double-precision support to the GT200 really looked like a quick and dirty solution, added just to stake out the territory. With a single 64-bit unit compared to eight 32-bit units, the GT200’s double-precision performance wasn’t really up to snuff. Without using dedicated units, AMD was even able to gain the advantage by lowering the performance of its GPU by "only" a factor of four with DP, which was only half the performance hit on Nvidia's architecture.

But with GF100, Nvidia completely reworked the architecture, and the dedicated MAD unit has been done away with. Now, the same units handle single- and double-precision calculation, and performance is reduced only by half with double-precision. So, the impact on performance is much more reasonable comparable to what’s observed on CPUs when using SSE. The advantage of using a GPU for this type of calculation should now be attractive enough to motivate programmers to do the necessary recoding.

While we’re on the subject of floating-point calculation, note that GF100 supports the most recent IEEE 754-2008 standard, with all of the required rounding algorithms and floating-point multiply-add double (FMAD) instructions. It maintains the precision of the calculation throughout and does only a single rounding, unlike the classic multiplication-addition (MAD) instruction, which performs two roundings. Note, however, that its direct competitor, AMD's RV870, is not really behind in this area because it also supports the latest floating-point standard and the FMAD instruction.

We described another advantage favoring Fermi in GPGPU environments on the previous page: a new memory hierarchy. In certain cases, a little scratchpad RAM can work miracles, but there are situations where nothing can replace cache memory. Also, GF100 has optional support for ECC memory, and all of the GPU’s internal memory (L1 and L2 cache, plus shared memory) is also protected.

Nvidia corrected two bottlenecks that could slow the performance of its chips in GPGPU mode. Like a CPU, a GPU gives the illusion of executing several tasks in parallel by alternating between them, with each task being given a portion of the GPU time. The main difference between CPUs and GPUs is that, on the latter, switching tasks is extremely expensive. Nvidia grappled with this issue in designing GF100, optimizing these operations. Context-switch time is now down to less than 25 microseconds. With that degree of improvement, frequent inter-kernel communication is now feasible where it wasn’t before.

Another big improvement: up until now, the GPU could only execute one kernel at a time on the entire GPU. With large kernels, that wasn’t a problem, and all resources were used. But with small kernels, it was possible for part of the GPU to go unused. GF100 is now capable of executing several kernels in parallel (in practice, up to one per multiprocessor), which results in more efficient use of the GPU, even with large kernels. That’s because when larger kernels reach the end of their execution, it’s possible for them to have an insufficient number of blocks to occupy the entire GPU.

Management of branching has been optimized too, with support for predication instructions. These execute two divergent portions of code in parallel before determining which one is to be kept. This avoids the additional cost of the branching instruction, which can be advantageous when the code to be executed is limited.

The final new feature is a unified memory space. Until now, PTX ISA 1.0 (the virtual instruction set to which CUDA programs are compiled) had three address spaces: the global, device- and system-wide space, the private local space for each thread, and the space shared by all the threads in a given block. The target of a load/store instruction had to be determined at compile time, which made it difficult to completely implement pointers whose target can change dynamically at run time. With the PTX ISA 2.0 supported by GF100, a single address space is used, which, among other things, makes support for C++ programs possible. C++ objects largely depend on the use of pointers to implement virtual functions, whose behavior can change at run time depending on the dynamic type of the object.

Let’s give credit where credit is due. After several months of stagnation, the little world of 3D has finally awakened, and we now have two Direct3D 11 architectures, designed with quite different approaches.

Create a new thread in the UK Article comments forum about this subject
This thread is closed for comments
Comment from the forums
    Your comment
  • infra
    Great review guys! As for GTX 470/480 - It's not as bad as I expected.The cards show some pretty decent numbers compared to 5870 even without its tessellation power used to its best.Perhaps next-gen Fermi will be a true champion - power and heat will be optimized and games will use the architecture of the GPU to its full potential.All in all it's a great architecture, maybe a bit ahead of it's time if you ask me.
  • Anonymous
    Power hungry, noisy, the fight is on. Glad I got the 5870. The driver-updates will see us through.
  • N19h7M4r3
    Power consuption is really high, but i think that efficiency if actually pretty good, but in the end what will matter is $$$ and not everyone will pay to have the best card on the block.
  • Dandalf
    Do we expect AMD to drop its prices in response? Don’t count on it.

    Dammit I was waiting for these cards SOLELY so ATI drop their prices! Aaaarrgghhh
  • Anonymous
    5000 series will keep their prices for a long time
  • mapleo
    Fermi could be a tragedy in NV's history.
    It seems I have to use HD5870 untill HD6870 or GTX580 release.
  • memeroot
    looks god if it came out 6 months back.... as a 3d vision fan thoug it looks like another wait for the right card
  • Dandalf
    Thanks for translation Rabid, wish i saw it before I started rating him down as a bot :| oops
  • Anonymous
    GTX480 buy it!!! Send stove!!! sorry my english is poor!!!
    Wow!!!!!! It's the fastest single GPU card on the planet. And it's a toaster oven and space heater too. What will Nvidia think of next?

    I wonder if it will qualify for any exemptions under forthcoming "cap and trade" regulations?
  • FanterA
    it should also be noted that for UK customers (like myself) that a 5870 can be had for less than the asking price of a 470, and for the prices on the 480, you could have a pair of 5850s in crossfire. Add to this the heat and power concerns, and i think I'll forgo Thermi and get another 5850 when I deem it necessary. so glad i didn't wait :D
  • mapleo

    I'm not a fan for any brand. I only choose products base on my needs. That's my point.
  • Anonymous
    haha Fermi you are out!!
  • carlos0248

    I thought the GTX480 just like editor said that the best performance but the price and power consumption was higher. Don't count it can cause ait drop their price.
  • Anonymous
    It's a true fact that NV is always good at Games becouse of its "way" plan. Viedo card is often used to play video games after all.
  • goozaymunanos
    sod this..i'm gonna buy a 5850..

    the GTX470 should be retailing at £250.


    p.s. stuff & nonsense:
  • marney_5
    How much are the Fermi cards in the US again? On overclockers UK the 480 prices around £450! Where the 5870 is around £320! Is this correct? Because Fermi is sh*t value if its only slightly faster and £100 extra!

    I only waited for this card so the ATI prices would go down!!! Dammit!
  • my_jacks
    Sparkle GeForce GTX 480 1536MB GDDR5 PCI-Express Graphics Card
    £445.99 (inc VAT)

    Sparkle GeForce GTX 470 1280MB GDDR5 PCI-Express Graphics Card
    £309.99 (inc VAT)

    Powercolor ATI Radeon HD 5970 2048MB GDDR5 PCI-Express Graphics Card
    £499.99 (inc VAT)

    Sapphire ATI Radeon HD 5870 1024MB GDDR5 PCI-Express Graphics Card
    £299.99 (inc VAT)

    Sapphire ATI Radeon HD 5850 1024MB GDDR5 PCI-Express Graphics Card
    £220.99 (inc VAT)

    - Overclockers UK (29/3/10)
  • 13thmonkey
    what happens to power and heat if v-sync is on, i.e. if the card can do 120+ fps on a game but is limited to 60fps by v-sync, does that reduce the power and thermals as it is only calculating 50% of the frames.

    I assume it calculates a frame, waits for 60hz refresh (idles) displays it, calcs another one waits (idles), calcs another one, etc.

    or does it just calc and calc and calc and then show the one frame that was most recently completed on the refresh, then calc calc calc and show the most recent on the refresh, ignoring the results of the nondisplayed calcs.
  • damian86
    ATI is still being your 'daddy'