Too Hot to Last? Investigating Intel's Claims About Ryzen Reliability

Credit: Tom's HardwareCredit: Tom's Hardware

AMD's Ryzen 3000-Series processors landed two months ago, bringing with them an incredible increase in real-world performance and upsetting the pricing paradigm with an impressive increase in performance-per-dollar, but the launch has been marred by reports that many users aren’t receiving the rated boost speeds. AMD announced this week that it had identified an issue with its firmware that reduces performance in some situations and that it would update the community on an incoming fix on September 10. 

As we often see in marketing, Intel has chosen to attack during AMD's perceived time of weakness. At the IFA tradeshow this week, Intel presented a slide deck to members of the press that includes information from a recent survey conducted by YouTuber Der8auer in which a surprising number of respondents reported they have been unable to reach the rated boost frequencies with their Ryzen 3000 processors.

Interestingly, Intel then drove further on the issue, citing a report that calls into question the reasons behind AMD's apparent, but not proven, reasons for reducing its chips' frequencies.

We were already investigating the claims Intel cited in regards to the relationship between Ryzen's clock frequencies and longevity and had secured comment from AMD before its admission that there was an issue with its firmware. Today we'll present some of the testing we conducted to investigate those claims.

Ryzen 3000 Longevity Concerns

We've already dove into the boost behavior of the Ryzen 3000 processors, making several key discoveries along the way that have been confirmed by AMD, with most important finding being that Ryzen 3000 series processors come with a mix of fast and slow cores, meaning not all cores are capable of hitting the rated single-threaded boost frequencies and that the new Ryzen-aware Windows 10 scheduler targets the fastest cores with lightly-threaded workloads.

Credit: IntelCredit: Intel

We chose to look into the matter further based on a comment made by legendary overclocker and Asus engineer Shamino on the Overclock.net forums, which is the same comment that spurred the article Intel cited in the slide above. Shamino stated that AMD had dialed back the boost frequencies to bring its long-term reliability metrics more into line with the company's expectations.

"every new bios i get asked the boost question all over again, i have not tested a newer version of AGESA that changes the current state of 1003 boost, not even 1004. if i do know of changes, i will specifically state this. They were being too aggressive with the boost previously, the current boost behavior is more in line with their confidence in long term reliability and i have not heard of any changes to this stance, tho i have heard of a 'more customizable' version in the future"

It's noteworthy that Shamino made the statement from his private forum account, so we can't take it as a definitive statement from AMD on whether or not the company reduced the boost frequencies to extend the longevity of its chips. This is likely his opinion. Shamino also claimed that AMD might have a 'more customizable' version of its boost mechanisms in the future, but how that changes the already-existing settings is unclear.

However, Shamino's comments are even more interesting due to an earlier post by The Stilt, a well-known hardware reviewer and member of the enthusiast community that works for an unidentified motherboard vendor.  

Setting the thermal limits below stock (95°C) make no difference, since the boost algorithm already uses lower limits.

The original limits for Ryzen 3000 SKUs were:

- 3600 = 4100MHz (80-95°C) / 4200MHz (< 80°C)
- 3600X = 4200MHz (80-95°C) / 4400MHz (< 80°C)
- 3700X = 4200MHz (80-95°C) / 4400MHz (< 80°C)
- 3800X = 4300MHz (80-95°C) / 4550MHz (< 80°C)
- 3900X = 4400MHz (80-95°C) / 4650MHz (< 80°C)

Since then, it appears that the HighTemperature limit has been reduced further to 75°C (from 80°C).
New SMUs also have introduced "MiddleTemperature" limit, but that gets disabled when PBO is enabled.

HWInfo is also able to display these limits (fused values).

The Stilt claimed that AMD had lowered the temperature threshold for boost activity from 80C to 75C, which has the side effect of reigning in the boost activity when the chip reaches higher temperatures. This is an important distinction due to the nature of chip aging, which we'll do our best to simplify.

Given enough time, even the largest mountains in the world will erode. Zoom in on the almost unimaginably-small nanometer-scale transistors inside of a CPU that rapidly switch on and off at billions of cycles per second (for example, at 5.0 GHz the transistors switch at a rate of 5 billion cycles per second), and it's easy to understand that these will also erode, even under optimal operating conditions. This also includes the incredibly tiny interconnects (wires) that connect the transistors.

Some factors increase the rate of wear and trigger electromigration (the process of electrons slipping through the electrical pathways) faster, such as higher current and thermal density. Because increasing frequency requires pumping more power through the chip, thus generating more heat, higher frequencies typically result in faster aging, and thus lowered life span. These problems become more pronounced with smaller feature sizes, such as when transistors become smaller inside modern chips (like AMD's shrink to a 7nm process and Intel's shrink to 10nm) simply because the chip is pushing more current through smaller transistors and interconnects. 

So, like the carton of milk in your refrigerator, your chip has an expiration date. It's the job of smart semiconductor engineers to predict that expiration date with some accuracy, which is a difficult proposition given the unique characteristics of each and every individual piece of silicon that comes out of the fab. Given that switching the transistors at higher frequencies increases the rate of wear on it and the surrounding structures, this is one of the primary levers that engineers pull to control the lifespan of your chip. In short, reducing frequency can slow the aging process, thus increasing longevity.

That means Shamino's assumption that AMD's frequency reductions in newer BIOS versions are related to longevity is a logical conclusion. It also means that The Stilt's claim that temperature thresholds have been changed to reduce frequency at higher temperatures incredibly relevant, as that would also reduce the rate of wear during the times at which the processor is most vulnerable to aging (higher heat and power). However, without proper explanation from AMD we don't know if that is the primary factor behind the changes, one facet of a much more complicated equation, or if it has nothing to do with the situation at all.

We reached out to AMD for comment on August 27, 2019, asking if AMD had reduced Ryzen 3000's boost frequencies in newer motherboard firmwares to meet longevity requirements. AMD provided us with the following official statement on August 28, 2019:

AMD is committed to providing improvements and optimizations in AGESA updates as we have done with all of our processors. At present, AGESA 1003ABB is the latest release to customers and partners. As shared before, we will provide updates on enhancements when available in future AGESA versions.

Remember, we received this statement before AMD revealed that it had identified an issue in its firmware, and also before Intel quoted the reporting in its slide deck. It's noteworthy that AMD's statement doesn't address our question in any fashion.

Having reached a dead-end in our attempts to get an official explanation from AMD about the matter, the only thing left to do was to test to see if we can spot a change to the temperature thresholds in the various firmware revisions. We can't verify the impact on reliability, as CPUs don't have a wearout indicator like we see with SSDs. However, the most logical starting point is to determine if there were intentional changes to boost behavior.

Ryzen 3000 Boost Frequency Temperature Limits by BIOS and SMU Version

Again, these tests are strictly limited to ascertaining if there has been a meaningful change to the boost temperature thresholds of the Ryzen processors.

The first step is to see if we can see meaningful changes to the temperature setting. To do this, we used HWInfo to check the CPU's High Temperature Clock Limit. As you can see in the graphic, this threshold is listed at 80C, not 75C. We tested multiple versions of motherboard firmwares on multiple motherboards, and with multiple chips, and every configuration exposed the same 80C temperature limit.

That's because, as a fused value, this value is programmed into the chip during manufacturing. These readings prove that AMD, at least at some point, intended for the chip to continue to boost within normal parameters until it reaches 80C. However, it doesn't reveal the actual setting the chip uses.

That's the system management unit (SMU), a small unit inside the processor that can alter several parameters inside the chip, can override several parameters, including the high-temperature limit. But those override values aren't exposed to the user or monitoring tools, meaning that we can't see the actual values in use. The SMU is updated with different versions that are installed during the BIOS update process, and AMD has issued several SMU versions both before and after the Ryzen 3000 launch. The latest versions purportedly feature the changed temperature limits.

Credit: GigabyteCredit: GigabyteWe chose to use Gigabyte's X570 Aorus Master for our testing. Unlike many of the other motherboard vendors, Gigabyte still hosts its NPRP (AMD approved) BIOS that was distributed to reviewers prior to launch. The company also still has the original BIOS release posted to its site, as well as the latest revision. The company also recently posted a beta BIOS version to Overclock.net, which we'll also test.

It's noteworthy that the maximum 4.4 GHz boost frequency of the Ryzen 7 3700X appeared fleetingly with the F4 BIOS, but these boosts occur for such a short period of time that they are hardly meaningful. Instead, the chip was most often at 4.375 for its peak clock for all BIOS revisions.

Gigabyte BIOS Revision
Max Freq. Recorded
SMU Version
AGESA
F4
4.4 GHz
46.32.0
1.0.0.3
N11
4.375 GHz
46.37.0
1.0.0.2CA
F5
4.375 GHz
46.40.0
1.0.0.3ABB
F5P
4.375 GHz
46.40.0
1.0.0.3ABB

Sources close to the matter tell us that the SMU revisions with the 80C limit were never made public, including in the BIOS versions provided to reviewers. However, after adjusting the temperature threshold to 75C with 46.37.0, AMD continued to adjust the temperature limits in SMU versions 46.38.0 and 46.39.0, slowly dialing back the limits, and those changes should carry over to the latest 46.40.0, as well.

Because the temperature threshold changes remain behind the SMU black box, that means we'll have to conduct a few tests to prove there have been meaningful adjustments.

It's noteworthy that test results could vary with other motherboards and boost behavior could vary based on the quality of your chip. However, these results with our AMD-provided sample should be sufficient for our purposes.

Testing Ryzen 3000 Boost Frequency Temperature Limits

The test itself is simple. We began the tests with all fans and the pump on a Corsair H115i running at full speed. We then kicked off a single-threaded Cinebench test to expose the maximum boost attainable with our Ryzen 7 3700X sample. We allowed this test to run for 60 seconds so the chip could settle into its 'natural' state during the workload, then unplugged the fans and pump and allowed the chip's temperature to rise to 95C. This allows us to record changes to the frequency as temperature increases. We did our best to isolate all parameters, and due to that nature of this testing, the cooling solution and ambient temps aren't a factor. We followed the general testing methodology we outlined in our previous piece (bottom of page 1).

Here we have the results of our tests with the F4 BIOS. This is the original Ryzen 3000-enabling BIOS that was made publicly available before launch. You'll notice that we're zooming in on the results of the test, cutting off the first 100 seconds of the run and constraining both the frequency and temperature axes. We aren't fans of using non-zero axes because it can exaggerate performance differences, but we need to closely examine the relatively small variances in clock rate during a very specific portion of the test. We will include full-length test charts of these same tests, but with non-zero axes, later in the article.

We have plotted the frequency of all eight cores on the left axis, while the temperature is plotted on the right. The temperature reading is the rising red line, and we've added markers to note where the temperature first reaches 75C and 80C, which are the focus areas. This is the first Ryzen-enabling BIOS revision made available before launch and comes with AGESA version 1.0.0.3 and SMU version 46.32.0. Here we can see that temperature fluctuates for a short time after reaching 75C, but the chip downshifts to ~4.125 GHz once the chip exceeds 75C for a duration longer than two seconds.

It's important to note that AMD has thousands of temperature sensors spread across the compute die, so we aren't seeing recordings of localized hot spots that may also impact boosting behavior. There is also some run-to-run variation between separate tests, but this performance trend is persistent.

Here we see three runs of the N11 NPRP (AMD approved) BIOS provided to reviewers for their Ryzen 3000 testing. This BIOS features 1.0.0.2 CA AGESA code (yes, an older AGESA than the initial BIOS) paired with the 46.70.0 SMU revision. Motherboard vendors recommended these BIOS versions to reviewers for their testing, but most outlets, like ourselves, chose to go with the newer AGESA 1.0.0.3 BIOS versions made available in the waning hours of the pre-launch window.

It's noteworthy that Gigabyte specifies the SMU version with its publicly-posted N11 NPRP BIOS, but just for the sake of being thorough, we tested with both the publicly-posted BIOS (bottom two results) and the BIOS included in the reviewer download folder hosted by AMD (top two test runs) that doesn't have a listed SMU version. We do see one run with the publicly-facing N11 BIOS (bottom left) that doesn't exhibit as much of an aggressive boost behavior after the 75C threshold, but the successive run (bottom right) exhibits the same behavior as the BIOS provided on the AMD portal, meaning we can probably chalk that up to run-to-run variance.

In either case, the trend here is undeniable: In contrast to the F4 BIOS, the Ryzen 7 3700X with the N11 BIOS stays above 4.2 GHz after it reaches the 75C threshold, with peaks that vary (4.25 to 4.225 GHz) on a per-run basis. Peak clocks drop to 4.2 GHz after the chips exceed the 80C threshold.

The heightened clock speeds are relatively small, an increase of 225 to 250 MHz, but as we've seen with Der8eur's survey, many users that aren't reaching peak speeds are falling short by these relatively slim variances. Also, while these variances are comparatively small, we're looking at a difference of 225 to 250 million cycles per second, which adds up to a billion extra cycles in ~four seconds. Then consider the number of additional cycles that the transistors and interconnects will experience throughout their three-year warranty period. In other words, it's safe to assume that these small alterations, which occur during the time of peak stress with high heat/current density, could have a meaningful impact on long-term reliability. Of course, that doesn't mean that is the intention behind any temperature threshold adjustments, but it is a possibility.

The next step is to see the current state of the BIOS. Here we're testing Gigabyte's latest, the F5 BIOS with AGESA 1.0.03 ABB. Gigabyte doesn't specify the SMU revision on its site, but this is 46.40.0. Here we can see the chip drop to 4.2 GHz after it reaches 76C, where it remains well after 80C. This is markedly different from the 'reviewer BIOS.'

Finally, these are the results of F5P, a beta BIOS that Gigabyte posted a few days ago to Overclock.net. We're testing this latest revision to see if there are any measurable changes in what is expected to be the last pre-AGESA-fix BIOS. The chip remains above 4.2 GHz until it exceeds 77C for five seconds. Again, this is markedly different behavior from the reviewer BIOS.

As promised, here is an album with 'zeroed' axes. As noted above, in the grand scheme of things these deltas may appear small, but they can be incredibly impactful from a long-term reliability perspective. Also, many of these small deltas are representative of the various sub-specification boost clock reports we've seen from AMD's customers.

Thoughts

It's easy to vilify Intel for attacking AMD about these changes, especially given the unsubstantiated nature of the reports. We're accustomed to seeing unsavory marketing tactics from both AMD and Intel alike, among many others, but there should be some awareness at Intel that promoting unproven theories with the company logo next to them is risky. It lends credibility to reports that might not have any real merit. Instead, Intel should work to put proven metrics behind statements that call into question the reliability of competing products.

Intel is quick to point out that it still holds the performance leadership position in gaming, but Ryzen does the most damage in the high-volume mid-range, and there Intel can't claim the same advantage. We fully expect the Core i9-9900KS to be a beast in most workloads, but it does nothing to address the company's shortcomings in the mid-range, which all boil down to curtailed features for the sake of keeping margins healthy. It will be fun to watch the tit-for-tat unfold when Intel has a new lineup for the mid-range.

Marketing hijinks aside, it's hard to determine if AMD has adjusted the temperature limits to reign in reliability metrics. But our tests show that there have been alterations, even after the company initially cut back to the 75C limit. It's noteworthy that many chips aren't reaching their full boost potential even with the N11 'reviewer BIOS,' so AMD's fix might consist of returning to the original hard 80C boost temperature limit, which we're told hasn't been officially seen in the wild.

If AMD did adjust the threshold to align the chips with its reliability projections, and that's a big if, returning to the 80C limit would expose it to a higher failure rate. However, that doesn't mean Ryzen chips are going to die in droves. Chip reliability metrics are strictly limited to reduce the number of RMAs to a tenable rate, dictated by the financial impact to the company, and alterations could ultimately equate to a comparatively small increase in the number of failures over time, and thus the financial impact to the company. At this point, a minor increase in failure rates would certainly be preferable to a class-action lawsuit for false advertisement, not to mention the damage to the Ryzen brand.

And AMD's failure rate calculations are likely very complex, especially given that the company is using the Windows 10 scheduler to target workloads at the faster cores (which the operating system sees as favored cores).

It's a basic fact: Some cores wear out faster than others, particularly if they are utilized more frequently than other cores. Remember, cores often drop into lower power states/frequencies when they aren't active, which naturally will reduce wear. However, with the scheduler targeting certain Ryzen cores more than others, that could result in faster wear on a core that is consistently more active than others. Intel also uses a similar tactic with its Turbo Boost 3.0 feature, so this isn't entirely unexplored ground, but it certainly also factors into a very complex failure rate matrix. (Coincidentally, the next Windows 10 insider build has another scheduler alteration that rotates workloads more efficiently across favored cores to improve performance and reliability, but whether that just represents a broader implementation of the changes made for Ryzen processors remains to be seen.)

Ultimately AMD's temperature adjustments aren't the only reason that users aren't reaching rated clock speeds, much of it could be users with older versions of Windows that don't target the favored cores correctly, or just general user error, but the altered thresholds are almost certainly a factor. AMD is binning these chips to the very limits of the silicon, so it has precious little wiggle room to play with.

To be clear, we stand by the recommendations we've made in both our reviews and our Best CPU articles. The Ryzen 3000 series processors bring a new class of performance, and value, to the mainstream desktop. But we also expect the products we purchase to reach their rated specifications, so we're happy to hear that AMD is busy working on a fix. 

We now have a good base of knowledge to determine if AMD's forthcoming fix, which it will unveil September 10, involves adjusting the thermal threshold further. We won't know more until the new BIOS and/or SMU revisions are in hand, but we'll be ready at our test benches when it lands.