AMD Ryzen Threadripper 1950X Review

Game Modes & Architecture, Infinity Fabric Latency Testing

We've covered AMD's Zen architecture in depth, and also covered the Infinity Fabric at length. Head over to those articles for more coverage.

The Zeppelin Die Primer

Threadripper's massive package hides much complexity underneath, but we'll do our best to simplify and outline how it relates to AMD's innovative Creator and Game Mode features.

The Zen architecture employs a four-core CCX (CPU Complex) building block. AMD adorns each CCX with 8MB of L3 cache split into four slices; each core in the CCX accesses all L3 slices with the same average latency. Two CCXes come together to create an eight-core Ryzen 7 die (the large orange blocks in the second image below), and they communicate via AMD’s Infinity Fabric interconnect. The CCXes share the same dual-channel memory controller. This is basically two quad-core CPUs talking to each other over the Infinity Fabric pathway that also handles northbridge and PCIe traffic.

All Ryzen 7, 5, and 3 models feature the same single Zeppelin die. Although each core in a four-core CCX can access the local cache with the same average latency, trips to fetch data in adjacent CCXes incurs a latency penalty. Communication between threads on cores located in disparate CCXes also suffers, which is of particular importance for gaming. Many game engines split out various tasks to different threads, but they are reliant upon constant synchronization between them. Developers can defray some of the communication latency by tuning for the Ryzen architecture.

Building The Threadripper

The graphic below represents AMD's EPYC data center processor die, which shares Threadripper's basic design. We can see four separate Zeppelin dies connected via the Infinity fabric, and the four CCXes inside each die. This creates a 32-core Multi-Chip Module (MCM). Of course, Threadripper is "only" a 16-core processor. To create this configuration, AMD substitutes in two 'dummy dies,' which are non-functional fillers that ensure the heat spreader's structural integrity and consistent mating with the socket's pins. Without these dark dies, the IHS would either cave in when you tighten your cooling solution, or the chip would warp and not make full contact with the pins. AMD notes that Threadripper's functional dies are always placed diagonally from each other, which makes sense considering the fabric's design.

Remember, each Zeppelin die has its own memory and PCIe controllers. That means that if a workload executing on a die needs to access data resident in the memory of the other die (remote memory), it has to traverse a much larger gap. This introduces a level of latency we haven't seen from previous Ryzen models, and its effect on gaming performance is profound. The impact isn't as severe with most professional workloads, but some do suffer. 

The New Toggles

To defray the impact of remote memory access, AMD introduces a new memory access mode that you can toggle either in the BIOS or with the Ryzen Master software. The Local and Distributed settings flip between either NUMA (Non-Uniform Memory Access) or UMA (Universal Memory Access).

UMA (distributed) is pretty simple; it allows the dies to access all of the attached memory. NUMA mode (local) attempts to keep all data for the process executing on the die confined to its directly attached memory controller. It establishes one NUMA node per die (visible in the task manager). This reduces, and even possibly eliminates, data fetches from the remote memory connected to another die, though the die can still access it if needed. NUMA has deep roots in the enterprise, but the technique works best if programs are designed specifically to utilize it. It's a rarity on the desktop, but even though almost no desktop applications are designed to support it entirely, there can be performance advantages for non-NUMA applications.

AMD's Threadripper introduces more cores to the desktop than we've ever seen; some programs are caught ill-prepared. In fact, a few games like Far Cry Primal and the DiRT series won't even run when the full complement of Threadripper's threads are brought to bear. That's obviously a problem, so AMD created a Legacy Compatibility mode that disables half of the processor's cores by executing a "bcdedit /set numproc XX" command in Windows that effectively disables half of the processor. Luckily, due to the operating system's core assignments, the command disables all of the cores/threads on the second die. That has a side benefit of eliminating thread-to-thread communication between disparate die, serving as a great solution to the constant synchronization between threads during most gaming workloads.

Because the change is made in software, the "disabled" die still has power fed to it, so the system can still access the memory and PCIe controllers connected to the inactive die.

Game Mode And Creator Mode

So what do you do with all these knobs? There are four separate combinations that will impact each application or game differently, so you have to cycle through them to find the best possible combination for your workload. That's a godsend to tuners looking to squeeze out every last drop of performance, but an absolute nightmare for the other 99%.

AMD decided to simplify the process by specifying two combinations that will either work best for games or standard applications. Creator mode, which is the stock configuration, exposes the full might of 32 threads. It should naturally provide excellent performance for most productivity applications.

Game mode cuts half the threads via compatibility mode and reduces memory and die-to-die latency with the Local memory mode. We're going to test both configurations with our gaming suite, and try another configuration that also offers the full complement of threads.

Infinity Fabric Latency Testing

Die-to-die communication adds another layer of latency to Ryzen’s complicated architecture. As you can see, those same latency metrics don’t apply to the earlier Ryzen models. They also present challenges to some applications, such as those with synchronized threads or frequent fetches from remote memory, but have less impact on others.

ProcessorIntra-Core Latency
Intra-CCX Core-to-Core LatencyCross-CCX Core-to-Core Latency
Cross-CCX Average Latency
Die-to-Die Latency
Die-To-Die Average Latency
Average Transfer Bandwidth
TR 1950X Creator Mode DDR-2666
13.7 - 14.1
39.4 - 43.2ns
157.6 - 171.3
168ns
180.6 - 256.7ns
238.47ns
90.26 GB/s
TR 1950X Creator Mode DDR4-3200
13.8 - 14.939.2 - 45.4ns144.9 - 167.2ns
160.1ns
213.1 - 227.8ns
216.9ns
91.67 GB/s
TR 1950X Game Mode DDR4-2666
13.9 - 14.2ns
39.5 - 42.3ns
149.2 - 164.1ns
159.66ns
X
X
46.58 GB/s
TR 1950X Game Mode DDR4-3200
14.3 - 14.9ns
41.2 - 46.2ns
123 - 150.6ns
145.44ns
X
X
45.52 GB/s
TR 1950X Local/SMT DDR4-2666
13.9 - 14.4ns
39.6 - 43.1ns
168.7 - 175.4ns
171.48ns
232.4 - 240.8
235.38ns
92.7 GB/s
TR 1950X Local/SMT DDR4-3200
13.9 - 14.4ns
39.9 - 44.5ns
146.7 - 159.4ns
153.89ns
209.3 - 220.9ns
212.53ns
91 GB/s
Ryzen 7 1800X
14.8ns
40.5 - 82.8ns
120.9 - 126.2ns
122.96ns
X
X
48.1 GB/s
Ryzen 5 1600X
14.7 - 14.8ns
40.6 - 82.8ns
121.5 - 128.2ns
123.48ns
X
X
43.88 GB/s

The intra-core latency measurements represent communication between two logical threads resident on the same physical core, and they're unaffected by memory speed. Intra-CCX measurements quantify latency between threads that are on the same CCX but not resident on the same core. In the past, we observed slight performance variances, but intra-CCX latency is also largely unaffected by memory speed. However, we've seen a large decrease in cross-CCX latency, which denotes latency between threads located on two separate CCXes, by increasing the memory data transfer rate from DDR4-1333 to DDR4-3200 on Ryzen 5 and 7 models.

The same general trend continues with Threadripper. As we can see, toggling game mode removes the die-to-die latency for threads by effectively disabling one die, but it also reduces host processing resources. It’s an interesting feature that will benefit some workloads, but hamstring others.

We also notice that the Local/SMT combination, which consists of the local setting and leaves all cores active (legacy off), offers the best overall latency improvement via memory overclocking. We also recorded higher Cross-CCX latency with the Threadripper processors.

Processor
Intra-Core Latency
Core-To-Core Latency
Core-To-Core Average Latency
Average Transfer Bandwidth
Core i9-7900X
14.5 - 16ns
69.3 - 82.3ns
75.56ns
83.21 GB/s
Core i9-7900X @ 3200 MT/s
16 - 16.1ns
76.8 - 91.3ns
83.93ns
87.31 GB/s
Core i7-6950X
13.5 - 15.4ns
54.5 - 70.3ns
64.64ns
65.67 GB/s
Core i7-7700K
14.7 - 14.9ns
36.8 - 45.1ns
42.63ns
35.84 GB/s

We are in the midst of a broader set of tests to quantify how these modes impact memory latency and bandwidth, among other factors. Stay tuned.

MORE: Best CPUs

MORE: Intel & AMD Processor Hierarchy

MORE: All CPUs Content

Create a new thread in the UK Article comments forum about this subject
This thread is closed for comments
5 comments
Comment from the forums
    Your comment
  • vMax
    Very good in depth review. Threadripper looks like a great productivity CPU especially for the price to performance... At last AMD are back at the big table and thats not to put Intel down as they still make the fastest CPU's overall...just you have pay that much more for the privilege, but finaly we have a real choice at all price points... Great for the consumer. Good job Tom's and AMD.
    0
  • HEXiT
    so happy this has happend. intel wont be, but for amd, its great news. 95%+ the perfomance for half the price makes threadripper a very attractive cpu for virtual machines and even office environments that use terminals rather than every 1 using a desktop.
    so much potential for a £2k build that just wasnt there a year ago...
    0
  • fla56
    where is XFR testing?

    surely it's clear by now that the worst way to overclock a Ryzen is to try and overclock the cores?
    0
  • gferrin2012
    Hmm. I will first state I have no preference to AMD nor Intel. I can buy any CPU, or GPU that suits my needs. I looked over you "benches" and I will be very upfront. I do not believe yours. I think the "threadripper" benches a lot better than your are showing. Your test methods seem to favor Intel. I have long suspected and heard of Toms Hardware of being an Intel fanboy site. I have owned Intel and swore by them for years. In am considering AMD for the first time. I will continue to look at, what I believe, to be more honest test sites. Lets see how this plays out. I happen to have a friend who has an 1800x and and another associate that has an I7-7700k. I am familiar with Blenders Bmw benchmark. I would like an explanation as to why, in CPU rendering using blender 2.78c, the 1800x literally destroys the I7-7700k.
    If you look at my account (go ahead) here on Toms Hardware, you will notice I am not an AMD fanboy at all. When I start feeling I am getting biased reviews and have the sneaking suspicion of Intel slipping the 'ol kickback to you, it's time to delete my membership.
    0
  • HEXiT
    i think the variance between these and other benches is the way they have done em... maybe they didnt set the cpu to productivity and just used gaming mode or vies-versa. but there is some discrepancy between what im seeing here and elsewhere. whats up toms. normally your stuff is accurate.
    0