GeForce GTX 480 And 470: From Fermi And GF100 To Actual Cards!

Additional Reading: Memory Hierarchy, Setup Engine, Tessellation

A Memory Hierarchy Worthy Of A CPU

For a long time, the memory subsystem was the GPU’s “red-headed stepchild.” Admittedly, memory access with early GPUs was so simple that a memory hierarchy wasn’t really needed. A simple texture cache and a small vertex cache (pre- and post-transformation) were all it took. With the arrival of GPGPU, things have become more complex, and a memory subsystem that basic isn't enough.

To meet these new requirements, Nvidia introduced Shared Memory with the G80. Shared Memory is a small, extremely-fast memory area built into each multiprocessor that enables data sharing among threads. Even though using this memory area was totally the responsibility of the programmer (and consequently required additional effort), the concept was well-received and reappeared in subsequent AMD GPUs. Finally, Microsoft integrated it into Direct3D 11, and Khronos into OpenCL.

But while this kind of memory, managed manually by the programmer, offers excellent performance with algorithms where memory access is highly predictable, it can’t supplant a cache when memory accesses are irregular. With GF100, Nvidia tried to combine the best of both worlds by offering a 64KB memory area per multiprocessor. This memory area can be configured in two different ways:

  • 16KB of L1 cache and 48KB of Shared Memory
  • 48KB of L1 cache and 16KB of Shared Memory

This L1 cache can be used as a buffer to back up the registers when space is short, which is not a bad thing. While their number has increased compared to GT200, the increase isn’t commensurate with the increase in the number of stream processors.

But that’s not all. As we saw earlier, Nvidia also adds a true L2 cache to the GPU. GT200 did have a 256KB L2, but it was accessible only for reading, and only by the texture units. With GF100, Nvidia gives its GPU a general-purpose L2 cache--that is, accessible in read/write mode--for texture instructions and for load/store instructions. This cache also replaces the ROP cache and the pre- and post-vertex transform cache of preceding GPUs. The L1 texture cache, due to its somewhat special functioning, is still present and is the same size as on GT200 (12KB).

Finally, A Parallelized Setup Engine!

The benchmark of one triangle generated per clock cycle has finally been surpassed! We’ve been waiting for it for a long time (since the arrival of unified architectures, in fact), representing a great leap forward in geometry power. For a short while, we even thought that RV870 had paved the way, but that was an unfortunate letdown. But now Nvidia is stepping up by multiplying the number of setup units, and it's being generous. There are not two, but four rasterizers!

Here’s where the new organization of the units shows its full potential. As we said, each GPC contains most of the functions of a GPU (except for the ROP units), and can consequently function independently. What that means is that each GPC has its own setup unit, which can rasterize one triangle per cycle (at 8 pixels per cycle). In practice, the peak performance is unchanged: 32 pixels are still generated per cycle, as with GT200 and RV870, but the effectiveness of the new architecture will be evident when numerous small triangles are being handled.

Like most tech companies, Nvidia has a marketing department that's always eager to come up with exotic names for familiar concepts, so it includes all of the operations performed on vertices within a GPC using the PolyMorph Engine moniker.

The geometric pipeline is as follows: the vertices are taken from a buffer, then passed to the multiprocessors, which perform the vertex shading. The result is then passed on to the next stage of the PolyMorph Engine (the Tesselator), which is not directly programmable, but whose operation can be controlled by the Hull Shader, upstream, and the Domain Shader, downstream on the multiprocessors.


A little review of tessellation is appropriate here. In Direct3D 11, Microsoft introduced three new stages to its graphics pipeline (in green in the illustration): the Hull Shader, the Tessellator, and the Domain Shader. Unlike the other stages in the pipeline, they don’t operate with triangles as primitives, but with patches. The Hull receives the control points of a patch as input, determines certain parameters of the Tessellator (for example, the TessFactor, which indicates the degree of tessellation). The Tessellator is a fixed-function unit, and consequently the programmer doesn’t control how the tessellation is performed; it sends the points it generates to the Domain Shader, which can apply operations to them.

Tessellation has multiple advantages. It enables very compact storage of geometric data. Only the control points of the mesh are stored in memory, and the additional vertices generated remain in the graphics pipeline, which saves memory and bandwidth. Another advantage is that tessellation enables the use of dynamic LOD (levels of detail). The closer a mesh is to the camera, the more detailed it needs to be. Conversely, if it’s farther away, using a huge number of triangles is not only superfluous, but it can also be a huge drag on performance since the rasterizers and shader units aren’t at ease with triangles that are too small.

Tessellation also has advantages in terms of animation; animation data is necessary only for the control points (the "coarse" mesh, in simple terms), whereas on a very complex mesh, it should be be available for each vertex. Consequently, the gains at this level can be used by adding more animation data per control point, improving the quality of the animation.

But let’s be clear: while Microsoft, AMD, and Nvidia try to make the situation sound idyllic, in practice, tessellation is not a magic wand that automatically increases a game's level of geometric detail. Implementing tessellation properly in an engine is still complex and requires work to avoid the appearance of visual artifacts, especially at the intersections between two surfaces with different tessellation factors.

For an analysis of tessellation performance, check out this page earlier in our review.

Create a new thread in the UK Article comments forum about this subject
This thread is closed for comments
Comment from the forums
    Your comment
  • infra
    Great review guys! As for GTX 470/480 - It's not as bad as I expected.The cards show some pretty decent numbers compared to 5870 even without its tessellation power used to its best.Perhaps next-gen Fermi will be a true champion - power and heat will be optimized and games will use the architecture of the GPU to its full potential.All in all it's a great architecture, maybe a bit ahead of it's time if you ask me.
  • Anonymous
    Power hungry, noisy, the fight is on. Glad I got the 5870. The driver-updates will see us through.
  • N19h7M4r3
    Power consuption is really high, but i think that efficiency if actually pretty good, but in the end what will matter is $$$ and not everyone will pay to have the best card on the block.
  • Dandalf
    Do we expect AMD to drop its prices in response? Don’t count on it.

    Dammit I was waiting for these cards SOLELY so ATI drop their prices! Aaaarrgghhh
  • Anonymous
    5000 series will keep their prices for a long time
  • mapleo
    Fermi could be a tragedy in NV's history.
    It seems I have to use HD5870 untill HD6870 or GTX580 release.
  • memeroot
    looks god if it came out 6 months back.... as a 3d vision fan thoug it looks like another wait for the right card
  • Dandalf
    Thanks for translation Rabid, wish i saw it before I started rating him down as a bot :| oops
  • Anonymous
    GTX480 buy it!!! Send stove!!! sorry my english is poor!!!
    Wow!!!!!! It's the fastest single GPU card on the planet. And it's a toaster oven and space heater too. What will Nvidia think of next?

    I wonder if it will qualify for any exemptions under forthcoming "cap and trade" regulations?
  • FanterA
    it should also be noted that for UK customers (like myself) that a 5870 can be had for less than the asking price of a 470, and for the prices on the 480, you could have a pair of 5850s in crossfire. Add to this the heat and power concerns, and i think I'll forgo Thermi and get another 5850 when I deem it necessary. so glad i didn't wait :D
  • mapleo

    I'm not a fan for any brand. I only choose products base on my needs. That's my point.
  • Anonymous
    haha Fermi you are out!!
  • carlos0248

    I thought the GTX480 just like editor said that the best performance but the price and power consumption was higher. Don't count it can cause ait drop their price.
  • Anonymous
    It's a true fact that NV is always good at Games becouse of its "way" plan. Viedo card is often used to play video games after all.
  • goozaymunanos
    sod this..i'm gonna buy a 5850..

    the GTX470 should be retailing at £250.


    p.s. stuff & nonsense:
  • marney_5
    How much are the Fermi cards in the US again? On overclockers UK the 480 prices around £450! Where the 5870 is around £320! Is this correct? Because Fermi is sh*t value if its only slightly faster and £100 extra!

    I only waited for this card so the ATI prices would go down!!! Dammit!
  • my_jacks
    Sparkle GeForce GTX 480 1536MB GDDR5 PCI-Express Graphics Card
    £445.99 (inc VAT)

    Sparkle GeForce GTX 470 1280MB GDDR5 PCI-Express Graphics Card
    £309.99 (inc VAT)

    Powercolor ATI Radeon HD 5970 2048MB GDDR5 PCI-Express Graphics Card
    £499.99 (inc VAT)

    Sapphire ATI Radeon HD 5870 1024MB GDDR5 PCI-Express Graphics Card
    £299.99 (inc VAT)

    Sapphire ATI Radeon HD 5850 1024MB GDDR5 PCI-Express Graphics Card
    £220.99 (inc VAT)

    - Overclockers UK (29/3/10)
  • 13thmonkey
    what happens to power and heat if v-sync is on, i.e. if the card can do 120+ fps on a game but is limited to 60fps by v-sync, does that reduce the power and thermals as it is only calculating 50% of the frames.

    I assume it calculates a frame, waits for 60hz refresh (idles) displays it, calcs another one waits (idles), calcs another one, etc.

    or does it just calc and calc and calc and then show the one frame that was most recently completed on the refresh, then calc calc calc and show the most recent on the refresh, ignoring the results of the nondisplayed calcs.
  • damian86
    ATI is still being your 'daddy'