A Three-Level Cache Hierarchy
The memory hierarchy of Conroe was extremely simple and Intel was able to concentrate on the performance of the shared L2 cache, which was the best solution for an architecture that was aimed mostly at dual-core implementations. But with Nehalem, the engineers started from scratch and came to the same conclusions as their competitors: a shared L2 cache was not suited to a native quad-core architecture. The different cores can too frequently flush data needed by another core and that surely would have involved too many problems in terms of internal buses and arbitration to provide all four cores with sufficient bandwidth while keeping latency sufficiently low. To solve the problem, the engineers provided each core with a Level 2 cache of its own. Since it’s dedicated to a single core and relatively small (256 KB), the engineers were able to endow it with very high performance; latency, in particular, has reportedly improved significantly over Penryn—from 15 cycles to approximately 10 cycles.
Then comes an enormous Level 3 cache memory (8 MB) for managing communications between cores. While at first glance Nehalem’s cache hierarchy reminds one of Barcelona, the operation of the Level 3 cache is very different from AMD’s—it’s inclusive of all lower levels of the cache hierarchy. That means that if a core tries to access a data item and it’s not present in the Level 3 cache, there’s no need to look in the other cores’ private caches—the data item won’t be there either. Conversely, if the data are present, four bits associated with each line of the cache memory (one bit per core) show whether or not the data are potentially present (potentially, but not with certainty) in the lower-level cache of another core, and which one.
This technique is effective for ensuring the coherency of the private caches because it limits the need for exchanges between cores. It has the disadvantage of wasting part of the cache memory with data that is already in other cache levels. That’s somewhat mitigated, however, by the fact that the L1 and L2 caches are relatively small compared to the L3 cache—all the data in the L1 and L2 caches takes up a maximum of 1.25 MB out of the 8 MB available. As on Barcelona, the Level 3 cache doesn’t operate at the same frequency as the rest of the chip. Consequently, latency of access to this level is variable, but it should be in the neighborhood of 40 cycles.
The only real disappointment with Nehalem’s new cache hierarchy is its L1 cache. The bandwidth of the instruction cache hasn’t been increased—it’s still 16 bytes per cycle compared to 32 on Barcelona. This could create a bottleneck in a server-oriented architecture since 64-bit instructions are larger than 32-bit ones, especially since Nehalem has one more decoder than Barcelona, which puts that much more pressure on the cache. As for the data cache, its latency has increased to four cycles compared to three on the Conroe, facilitating higher clock frequencies. To end on a positive note, though, the engineers at Intel have increased the number of Level 1 data cache misses that the architecture can process in parallel.


While undoubtedly this will create a whole new level of performance. I imagine it will be prohibitively expensive. Coming in just as the global economy hits a trough.
For this reason I think AMD has a brighter future when it releases it's new 45nm cores. They will provide a good performance increase and I am willing to bet will still trump intel on the price/performace scale.
Fantastic article, very insightful.
First off, I have not read the entire article but I just want to comment on the name.
I've been saying this since they announced the design of Nehalem, its Intels take on AMD design, which means your getting the best of both companies as AMD designs have been so much better than Intel but AMD could not challenge what Intel already had.
It's been a long time coming for Intel to adopt AMD's designs but I really do look forward to the release (Well 6 months after when I might be able to afford a Core i7 system!), but feel AMD really needs to pull something out the hat to compete.
Anyways, from what I have read, its a good article lol.
good...progress!
btw, where's the 8-core systems we were promised for 2008?
..and where's all the re complied apps to take advantage of all this processing parallelism?!
p.s. stuff and nonsense: http://www.eupeople.net/forum
My credit card is restless...
just hope the bank is still around to honour your credit card...
Now that's more like it!! A well informed article, that is well written and imparts some useful information... More of the same please THG!!
I'm just off to sell those AMD shares...
Bob
While the article is sound, it did upset me the at the first two pages talk about the 'Conroe' architecture. 'Core 2' is the name of the architecture used in the Conroe line of processors. 'Conroe' is the name given to the first desktop iteration of the core2 architecture, just as Allendale is the value version and Kentsfield the quad core version (along with all the new iterations that utilize different cache sizes or manufacturing process).
It is difficult to inspire confidence in your readers when such obvious mistakes are apparent.
Jammydodger : I think the usage may be a little off, but to say the conroe architecture, just means the uarch used by the conroe chips - which is in common with all chips of the generation. Also, the architecture was refered to by the code name Merom . Core 2 is a retail brand name. Either way, this is a minor mistake and not something that would make me doubt the validity of the article.
at last a quality oriented article!!!
Complete and detailed i want to see more in the future!
KingGreatYat: I do realise that I could be seen to be splitting hairs, but when an article goes in to such detail about an upcoming processor architecture but begins the article by failing to recognise the distinction between an architecture and a core then it does raise the question of whether the writer has fully understood what it is that he is trying to impart upon us. If I were to begin an article by talking about intel's 'Northwood' architecture then I would be talking non-sense, Northwood was a chip based around Intel's 'Netburst' architecture. The Merom is, as far as I am aware, the first mobile variant of the Core2 architecture, it was proceeded by the Yonah based on Intel's 'Core' architecture, which was itself based on the 'P6' architecture.
competition is good for the market,end user like us have many choises to pick,AMD or intel.i agreed with americanbrian.lets wait the counter attack from AMD with the lates technologies n off course with lowest price.
[quote=Article]Intel says the problem is solved now, but provides no details on the operation of the new prefetch algorithms[/quote]
Something tells me this is going to be pivotal if Deneb proves to be any good...
Great article, by the way, minor niggles aside!