ATI Radeon HD 5870: DirectX 11, Eyefinity, And Serious Speed

Stepping Through The Architecture

Achieving this generation’s five design goals required notable tweaks to ATI’s architecture, though many of the cues are clearly taken from the Radeon HD 4800-series (and indeed the 3800-series before that).

Before we even get to the GPU’s shader processing capabilities, we have to take a look at the graphics engine, which includes ATI’s sixth-generation tessellation engine. We’ve seen the company evangelize tessellation in the past. But as with most things that only apply to one competitor, realization of this feature via actual games was limited. Now it’s part of the DirectX 11 pipeline, sandwiched on either side by the hull and domain shaders. The tessellator is a fixed-function component which can either be utilized or not, depending on the tessellation technique in play.

In its architectural description, AMD ambiguously claims to include dual rasterizers. As you probably know, current GPUs are capable of rasterizing a single triangle per cycle, and that very serial approach has become the main reason for the performance bottlenecks that show up in synthetic geometry tests on unified-shader architectures.

At first, we thought AMD had found a way to parallelize the setup, which would have been particularly well-suited to a GPU that places a lot of importance on tessellation. There are any number of options for rasterizing several triangles in parallel, but they’re very complex. So, we were curious to see how AMD had solved this puzzle. Unfortunately the answer was disappointing: AMD was playing fast and loose with its wording. In practice, there’s still only a single rasterizer, handling a single triangle per cycle. But now there are twice as many scan conversion units, generating 32 pixels per cycle in order to match the increase in ROPs. Instead of dual rasterizers, it would be better to simply call this implementation a more powerful rasterizer.

While we’re on the subject of setup and rasterization, we should point out one other change. The fixed-function units that handled interpolation calculations have disappeared, and that job is now addressed by the shader processing units. AMD claims that the impact on performance is negligible, and this is in line with the current trend towards getting rid of as many fixed units as possible and taking advantage of the enormous processing power of modern GPUs.

As mentioned on the previous page, the organization of ATI’s stream cores hasn’t changed. They have learned to operate more efficiently, though. We established in our first look at the Radeon HD 4850 that ATI’s VLIW architecture depends on an efficient compiler in order to maximize performance—otherwise those ALUs idle. With RV770, each of the five instructions in a VLIW bundle had to be independent of the others. Now, Cypress is able to execute a multiplication and addition dependent on the outcome of the previous operation in the same cycle. Take the following example:


These wouldn’t have been able to share the same instruction bundle on the RV770. But here they can, since these two operations are turned into one MAD, while conserving the result of the intermediate multiplication operation. Similarly, the RV770 only had a DP4 (four-component scalar product) instruction, with the DP2 and DP3 instructions implemented using DP4. The result of that design was certain slots of the bundle were wasted on unnecessary operations. AMD says that handling of scalar products is now more flexible, though they haven't elaborated further. We suppose that the engineers have implemented DP2 and DP3 instructions natively to allow execution of other calculations in parallel.

Handling of integer operations has also been changed. Before, each of the four stream cores could execute one addition or bit shift operation on a 32-bit integer per cycle, and the special-function core could perform a multiplication or a bit shift (also on a 32-bit integer). Now the four cores are capable of performing a multiplication or addition per cycle, but only on 24-bit integers. This choice is the result of a compromise between increasing overall performance and not sacrificing too many resources to do it, as having a complete 32-bit integer multiplier in each of the stream cores would have done. By limiting themselves to 24-bit operations, the engineers can re-use resources used for handling single-precision floating-point numbers, while still maximizing utilization of the shader.

In addition to these optimizations, ATI's design team has introduced two new instructions: a fused multiply-add instruction (FMAD) which maintains the precision of the calculation throughout and performs a single final rounding, unlike a standard multiply-add (MAD), which performs two roundings. The second instruction is a sum of absolute differences (SAD), an operation frequently used in video (in particular for comparing blocks of pixels). We verified these increases in processing power using different shaders.

Though we’ve maintained a 2.26x gain between the Radeon HD 4870 and HD 5870 for most of the simple DirectX 9 shaders we launched, it was reduced to 1.68x when we added per-pixel lighting.

Now let’s move to more complex DirectX 10 shaders:

With procedural textures, the theoretical gain in raw power is almost completely realized, as the Radeon HD 5870 is 2.24x faster than the 4870. There’s no doubt that the 1,600 stream processors are present and accounted for here.

If the stream processing units haven’t changed fundamentally, the texture units have barely changed at all compared to the RV770. In practice, except for support for 16Kx16K textures and two new texture compression formats (both of which were necessary for DirectX 11 compatibility), there’s nothing new. Steep Parallax Mapping shows that fairly clearly:

The drivers also seem to have added slight optimization, because the gain we measured here (but also with other shaders, like Fur) is as high as 2.35x between the two Radeons.

Performance with geometry shaders (geometry power), on the other hand, has improved by only 42%.

This last test measures texture fetching performance (important for displacement mapping, for example). It shows a modest improvement of 34%.

You should keep in mind that while the total bandwidth of the L1 cache has increased, it’s only by a factor of two, which is commensurate with the increase in the number of texture units. In the same way, the size of the L2 cache has doubled. But again, that's only because the number of units has also doubled. Worse, the L1/L2 bandwidth has increased only in proportion to the GPU's frequency, whereas there are now twice as many units to supply. We may have just put our finger on one reason why Cypress failed to show two times the performance of its predecessor in the preceding texture tests.

Create a new thread in the UK Article comments forum about this subject
This thread is closed for comments
Comment from the forums
    Your comment
  • Anonymous
    anyone seen these in the uk?
  • jimishtar
    now this is someting. cant wait to see what nvidia will come up with.
  • jimishtar
    does this mean that the prices of existing 48xx cards will go down?
  • siunit
    They are scheduled for release on the 25th in the UK according to major online retailers expected stock date.
  • cyber_jockey
    My god the 4870x2 finally got a rival
  • Anonymous
    Smooth gaming again!?!? YAY
  • plasmastorm
    available for pre order on now £320
  • ainarssems
    All those connectors are nice but first thing I thought when I first saw pictures was there is not enough vents on the backto cool that card properly. Hopefully some board partners will come up with versions with full PCI slot cooling. Eyfinity looks better, but all I need is two DVI connectors one for monitor and other for projector.

    Now lets see what Nvidia brings out, lets hope something competitive in performance and price. I hope prices drop by Cristhmas to £200-259 for 5870
  • ainarssems
    At guru3d they overclocked to 925 core/ 5400 memory could not go further because of temp problems. I wonder what would they do with better cooling. 1GHZ/6GHz?...Now that would be sweet.Link:
  • atomdrift
    ainarssemsAll those connectors are nice but first thing I thought when I first saw pictures was there is not enough vents on the backto cool that card properly. Hopefully some board partners will come up with versions with full PCI slot cooling.

    Anandtech addressed this concern in their review: "As far as the 5870 is concerned, this is solid proof that the half-slot exhaust vent isn’t going to cause any issues with cooling."
  • atomdrift
    Here's the source link to the above quotation.
  • ainarssems
    I have seen Anandtech's article, however they did hit 100C and started to throtle in /crossfire on Toms review and Guru3d temps limited overclocking so there is room for improvement in cooling.

    Also I wonder if 2GB version would perform better at high resolutions with AA
  • jcwbnimble
    No Crossfire? Really? What was ATI thinking releasing a top of the line video card that can't support a major feature set? One of the major selling points is that you can run 3 displays off this one card, yet you need to Crossfire two of these to get playable frame rates. Problem, you can't Crossfire these cards (yet?).

    They should have dispensed with the third video connection in favor of extra ventilation, which it sounds like this card needs. If users are so gung ho about running 3 or more dispays, then wait for the Eyefinity card.

    Glad to see ATI releasing a product that puts a boot up Nvidia's arse, but they shouldn't have released it without solving the Crossfire issue.
  • ainarssems
    Must be mistake. Check benchmarks, they include crossfire results so it is working. They probably had the cards for some while for testing and started writing article and drivers did not support crossfire at the start and does support now. They just forgot to edit part of the article where it says that it does not support crossfire.
  • shrex
    any word on when the 5670 is gonna come out
  • deepblue69uk
    Yes, try
  • chockimon
    £320 ouch, what a rip off and no Physx. I think this time round Nvidia's mid range card, the GTX360 will trounce all over this card from a great height.
  • wikkus
    chockimon£320 ouch, what a rip off and no Physx. I think this time round Nvidia's mid range card, the GTX360 will trounce all over this card from a great height.

    Wow, 8 minutes before the first fanboi commentard...

    As someone who holds allegiance with neither vendor -- both have a place in our house -- this looks great from a stirring-up the market perspective.