Apple A9 CPU And System Performance
The biggest change for Apple’s A9 SoC is the move from TSMC’s HKMG 20nm node to the new FinFET process technology. Using a 3D structure—where the silicon channel gets extended into a vertical wall or fin and then wrapped by the gate—decouples the transistor from the bulk of the silicon substrate. This allows the operating current to be controlled via three channel surfaces instead of one like in a 2D planar structure and also significantly reduces leakage current. The improved electrical characteristics allow 3D transistors to operate at a lower voltage, reducing dynamic power and heat, which the chip designer can take advantage of to improve battery life and/or performance by ramping up the clock frequency.
With all of its advantages, it’s no surprise to see the A9 jump to FinFET. What is surprising, however, is Apple using two different suppliers (a practice called dual sourcing). This is commonly used by OEMs for many smartphone components, including displays, RAM, and NAND, but generally not for SoCs because of the additional engineering cost to tape-out with two different foundries, each with its own libraries and quirks. We can only speculate as to why Apple spent extra money going this route, but since this is the first generation of FinFET, perhaps Apple wanted to play it safe in case one or both suppliers encountered yield issues or otherwise could not meet Apple’s demand.
Ultimately, dual sourcing means there’s two different variations of the A9—one made using Samsung’s 14nm LPE (Low Power Early) FinFET process and one using TSMC’s 16nm FinFET process. While the actual chip architecture and clock frequencies are the same for both, there are some minor differences. First, the Samsung die is a bit smaller, as seen above, thanks to the tighter gate pitch. More importantly, the two different processes will naturally have different voltage-frequency curves, meaning an A9 built with one process will consume less power at the same frequency as an A9 built on the competing process. This leads to an obvious question: Which A9 variant is better?
While we cannot answer this question definitively due to our limited sample size, our testing shows that performance is identical between both versions with only a small difference in battery life. In tests that run the CPU or GPU at 100% for extended periods of time (an extreme scenario not seen in real-world use), our test unit with a Samsung made A9 shows 3% to 11% better battery life. In a video playback test, where the CPU and GPU are mostly idle, our TSMC A9 sample shows a 2% to 3% battery life advantage, the same difference quoted by Apple. Based on these results, along with similar results obtained by other testers, we feel the differences between the two A9 versions are a non-issue.
Regardless of foundry, the A9 still employs a dual-core 64-bit CPU compatible with the ARMv8-A architecture. In many respects, Apple’s custom designed Twister CPU core shares many similarities to the Typhoon core in the A8 SoC. Based on AnandTech’s analysis, Twister has the same IPC (instructions per cycle) as Typhoon—capable of decoding, issuing, and executing up to six instructions per cycle—and retains a comparatively large out-of-order instruction reorder buffer. The execution stage sees a few minor tweaks, though, including a one cycle reduction in floating-point latency (a 20% improvement for multiplies and 25% for additions) and the ability to perform three FP32 floating-point multiply instructions in parallel instead of two.
Apple A9 (Samsung) Die: Modified by Tom’s Hardware [Image Source: Chipworks]
In addition to improving floating-point performance, Apple also increases the L2 cache size from 1MB to 3MB and changes the 4MB L3 cache from being inclusive (L3 retains a copy of the L2 cache) to victim (L3 holds data evicted from L2). Cache latency also improves thanks in part to a significant increase in clock frequency.
One of AnandTech’s more interesting findings is a significant reduction in the branch mispredict penalty, anywhere from 33% to 50%. It’s not clear how this is achieved, but it’s likely a combination of the larger cache, lower cache latency, and more aggressive prefetching. Because real-world code is generally branch heavy, this change could have a noticeable impact on user experience.
Along with the architectural tweaks, Apple takes full advantage of its move to FinFET, boosting CPU frequency from 1.4GHz to 1.85GHz, a 32.1% increase. Historically, Apple has been conservative with clock frequency in an effort to conserve power, relying on its wide architecture for performance. Now with FinFET, Apple gets to erase most of the clock frequency advantage held by its narrower-architecture competitors and really ramp up performance.
Like most current-generation flagship phones, the iPhone 6s and 6s Plus use the new LPDDR4 RAM, which uses 35-40% less power (according to Samsung) and delivers greater bandwidth through higher operating frequencies.
|Geekbench 3 Pro Memory Bandwidth (Single-Core)|
|STREAM Copy (GB/s)||STREAM Scale (GB/s)||STREAM Add (GB/s)||STREAM Triad (GB/s)|
|iPhone 5s (A7)||8.46||5.25||5.75||5.74|
|iPhone 6 (A8)||9.92||6.00||6.35||6.35|
|iPhone 6 Plus (A8)||9.47||5.82||6.16||6.16|
|iPhone 6s (A9)||14.00||9.51||10.55||10.60|
|iPhone 6s Plus (A9)||13.60||9.26||10.30||10.35|
|A9 Advantage (based on 6/6s)||41.1%||58.5%||66.1%||66.9%|
The STREAM benchmark in Geekbench 3 gives us some insight into the evolution of memory performance over the last few generations. Both the A7 and A8 have the same size L2/L3 cache and use LPDDR3-933 (14.9GBps) RAM, so it’s no surprise to see only a small increase in STREAM performance (roughly 8%), likely due to memory controller optimizations or lower RAM latencies. The STREAM Copy metric—which is the most indicative of memory bus performance, because it simply copies the contents of one large array to another—shows that the LPDDR4-1600 (25.6GBps) RAM helps give the A9 a significant boost in throughput. The STREAM Scale, Add, and Triad tests all see some additional benefit from the reduced floating-point add/multiply latency. Throughput in the Scale test lags behind the values obtained in the Add and Triad tests in part, because floating-point add instructions have one cycle less latency than the multiply instruction used exclusively in the Scale test.
It’s interesting to see the iPhone 6s and 6s Plus deliver more throughput here than the Galaxy S6, another phone that uses LPDDR4 memory. We cannot extrapolate these results to general memory performance, however, because the STREAM tests only use sequential data structures. Based on the results we’ve collected in other tests, we know that the memory controllers used in Apple’s A7 and A8 both perform well with serial access patterns and poorly when dealing with more random memory access. In contrast, the Exynos 7420 used in Samsung’s Galaxy S6 is more of a generalist and is not optimized for any specific type of memory access. What the STREAM data tells us then is that the A9’s memory controller continues Apple’s tradition of optimizing for serial memory access, perhaps to boost performance when reading or writing images (think burst performance), video, and GPU textures, or just writing to the frame buffer.
To better understand how the STREAM benchmark included in Geekbench works, or to learn more about our testing methodology, please read our article about how we test mobile device system performance.
Apple’s A8 SoC delivers the best single-core integer and floating-point performance of any current flagship phone. Well, except for the A9 in the iPhone 6s. The fact that the A8’s Typhoon CPU core, used in last year’s iPhone 6 and running at only 1.4GHz with LPDDR3 RAM, is still faster than anything Samsung, Qualcomm, MediaTek, etc. produce is impressive.
Even more impressive are the 55% and 60% gains in integer and floating-point performance, respectively, by the latest A9 SoC. The 32.1% increase in CPU frequency explains most of the A9’s performance boost, but there’s clearly more going on here. As expected, floating-point performance sees a larger increase, relative to the A8, than integer performance because of the architectural improvements to the floating-point units, although the difference is small. It appears that both integer and floating-point performance benefit significantly from the larger, lower-latency L2 cache, and, to a lesser extent, the LPDDR4 memory, since not all of the Geekbench data sets fit within cache.
Despite having only two CPU cores, versus six to eight cores for the non-Apple phones in these charts, the A9 still posts the best multi-core floating-point performance. It does fall behind the Exynos 7420 used in the Galaxy S6 devices and the Snapdragon 810 v2.1 in the OnePlus 2 in integer performance, however. Still, the A9’s multi-core integer performance is good enough to outpace the Snapdragon 808 (which has two fewer Cortex-A57 cores than the 810) used in the LG G4 and Moto X Pure Edition. It also beats the older revision Snapdragon 810 SoC used in HTC’s One M9, which falls back to the lower-performing Cortex-A53 cores due to thermal constraints.
The difference between the iPhone 5s and 6s serves as a good example of just how quickly mobile systems are advancing. In just two generations, Apple has nearly doubled performance in the Basemark OS II System test, which performs a number of subtests to gauge CPU and memory performance. In the OpenGL ES 2.0 based Graphics test, the iPhone 6s is 324% faster than the 5s! Imagine the possibilities if desktop systems could advance at a similar rate.
Going back just one generation, the iPhone 6s is still 43% faster than the iPhone 6 in the System test, 18% faster in the Web test, and 81% faster in the Graphics test (the 6s Plus is only 69% faster than the 6 Plus, since the GPU was clocked higher on the 6 Plus).
Ignoring the broken Memory test, the new iPhones deliver noticeably better performance here. The iPhone 6s is 43% faster than the Galaxy S6 in the System test, currently the fastest Android phone when it comes to CPU/memory performance, and it’s 33% faster than Qualcomm’s Adreno 430 in HTC’s One M9.
When discussing system performance, you cannot forget the importance of internal storage. Back when the Galaxy S6 was released, we praised Samsung for being the first to implement a UFS 2.0 based storage solution, improving NAND performance over the traditional eMMC interface. Now it appears that Samsung’s moment of glory has passed. According to AnandTech’s sleuthing, Apple includes a custom NVMe-based PCI-E storage controller in the new iPhones, which boosts sequential reads and writes significantly. We’re not talking laptop/desktop class SSD performance here, this solution still has to work within the power and size limits of a smartphone, but this is yet another move that closes the gap between phones and laptops just a little more.
Unfortunately, we do not have any iOS benchmarks that gauge performance in real-world usage scenarios like we do on the Android platform. One thing that’s easy to measure, and is something we do quite often, is open apps. NAND performance is the obvious bottleneck in this scenario, but CPU and RAM performance come into play too. In the charts above, we show the time to launch several common apps as well as how long it takes to boot the OS. Since the iPhones are all running the same OS and apps, we can draw some conclusions about hardware performance. Once we move beyond Apple hardware, however, this becomes impossible, since we’re dealing with completely different software. These comparisons are still useful, though, in terms of the difference in performance a user perceives.
The cumulative time to load all apps on the iPhone 6s and 6s Plus is 16% faster than the previous generation. Not a big increase, but noticeable, particularly when opening Maps. Given the large increase in NAND, CPU, and memory performance, it’s disappointing not to see larger gains here.
Casting a wider net, we see that Samsung’s Galaxy S6 devices, along with LG’s G4, offer a better experience when launching apps, although the iPhone 6s gets you into the camera the fastest. The iPhones are generally quick to boot too, but for some reason the 6s Plus consistently takes longer to boot than the 6s.
Based on the data above, Apple’s tweaked Twister CPU in the A9 is clearly the fastest available now. That’s not really surprising considering that Apple’s competitors still cannot match the A8 SoC’s performance in some tests. The A9 benefits most from the move to FinFET, enabling a higher CPU frequency, a larger L2 cache, and lower cache latencies. Combining these benefits with higher bandwidth LPDDR4 RAM and Apple’s custom NVMe-based PCI-E storage controller makes the iPhone 6s the smartphone performance king—a title it should hold for the foreseeable future (at least in CPU performance), since it’s unlikely that any SoCs based on ARM’s Cortex-A72 CPU, which is optimized more for power efficiency than performance, will surpass the A9 in the coming year. Even Qualcomm’s custom 64-bit Kryo CPU cannot touch the A9’s integer and floating-point performance based on the results from our Snapdragon 820 Performance Preview.