Reading And Decoding Instructions
Unlike the changes made in moving from Core to Core 2, Intel hasn’t done much to Nehalem’s front end. It has the same four decoders that made their appearance with the Conroe—three simple and one complex. It is still capable of macro-ops fusion and so offers a theoretical maximum throughput of 4+1 x86 instructions per cycle.
Though there are no revolutionary changes at first glance, the devil is in the details. As we noted in our article on the Barcelona architecture, increasing the number of processing units is an extremely inefficient way to boost performance. The cost is high and the gains shrink more and more with each addition, following the law of diminishing returns. So instead of adding a new decoder, the engineers concentrated on making the existing ones more efficient.
First, they added support for macro-ops fusion in 64-bit mode, which is justified for an architecture like Nehalem that makes no attempt to hide its ambitions in the server market segment. But the engineers didn’t stop there. Where the Conroe architecture could fuse only a limited number of instructions, the Nehalem architecture supports a greater number of variations, making it possible to use macro-ops fusion more frequently.
Another new feature introduced by the Conroe has also been improved: the Loop Stream Detector. Behind this name lies what is in fact a data buffer that holds a few instructions (18 x86 instructions on Core 2s). When the processor detects a loop, it disables certain parts of the pipeline. Since a loop consists of executing the same instructions a given number of times, it’s not necessary to perform branch prediction or to recover the instruction from the L1 cache at each iteration of the loop. So the Loop Stream Detector acts as a small cache memory that short-circuits the first stages of the pipeline in such situations. The gains made via this technique are twofold: it decreases power consumption by avoiding useless tasks and it improves performance by reducing the pressure on the L1 instruction cache.
With the Nehalem architecture, Intel has improved the functionality of the Loop Stream Detector. First of all the buffer is larger—it can now store 28 instructions. But what’s more, its position in the pipeline has changed. In Conroe, it was located just after the instruction fetch phase. It’s now located after the decoders; this new position allows a larger part of the pipeline to be disabled. The Nehalem’s Loop Stream Detector no longer stores x86 instructions, but rather µops. In this sense, it’s similar to the Pentium 4’s trace cache concept. It’s no surprise to find certain innovations ushered in by that architecture in the Nehalem, given that the Hillsboro team now in charge of the Nehalem was responsible for the Pentium 4 project. However, where the Pentium 4 used the trace cache exclusively, since it could only count on one decoder in case of a data cache miss, Nehalem has the benefit of the power of its four decoders, while the Loop Stream Detector is only an additional optimization for certain situations. In a way, it’s the best of both worlds.


While undoubtedly this will create a whole new level of performance. I imagine it will be prohibitively expensive. Coming in just as the global economy hits a trough.
For this reason I think AMD has a brighter future when it releases it's new 45nm cores. They will provide a good performance increase and I am willing to bet will still trump intel on the price/performace scale.
Fantastic article, very insightful.
First off, I have not read the entire article but I just want to comment on the name.
I've been saying this since they announced the design of Nehalem, its Intels take on AMD design, which means your getting the best of both companies as AMD designs have been so much better than Intel but AMD could not challenge what Intel already had.
It's been a long time coming for Intel to adopt AMD's designs but I really do look forward to the release (Well 6 months after when I might be able to afford a Core i7 system!), but feel AMD really needs to pull something out the hat to compete.
Anyways, from what I have read, its a good article lol.
good...progress!
btw, where's the 8-core systems we were promised for 2008?
..and where's all the re complied apps to take advantage of all this processing parallelism?!
p.s. stuff and nonsense: http://www.eupeople.net/forum
My credit card is restless...
just hope the bank is still around to honour your credit card...
Now that's more like it!! A well informed article, that is well written and imparts some useful information... More of the same please THG!!
I'm just off to sell those AMD shares...
Bob
While the article is sound, it did upset me the at the first two pages talk about the 'Conroe' architecture. 'Core 2' is the name of the architecture used in the Conroe line of processors. 'Conroe' is the name given to the first desktop iteration of the core2 architecture, just as Allendale is the value version and Kentsfield the quad core version (along with all the new iterations that utilize different cache sizes or manufacturing process).
It is difficult to inspire confidence in your readers when such obvious mistakes are apparent.
Jammydodger : I think the usage may be a little off, but to say the conroe architecture, just means the uarch used by the conroe chips - which is in common with all chips of the generation. Also, the architecture was refered to by the code name Merom . Core 2 is a retail brand name. Either way, this is a minor mistake and not something that would make me doubt the validity of the article.
at last a quality oriented article!!!
Complete and detailed i want to see more in the future!
KingGreatYat: I do realise that I could be seen to be splitting hairs, but when an article goes in to such detail about an upcoming processor architecture but begins the article by failing to recognise the distinction between an architecture and a core then it does raise the question of whether the writer has fully understood what it is that he is trying to impart upon us. If I were to begin an article by talking about intel's 'Northwood' architecture then I would be talking non-sense, Northwood was a chip based around Intel's 'Netburst' architecture. The Merom is, as far as I am aware, the first mobile variant of the Core2 architecture, it was proceeded by the Yonah based on Intel's 'Core' architecture, which was itself based on the 'P6' architecture.
competition is good for the market,end user like us have many choises to pick,AMD or intel.i agreed with americanbrian.lets wait the counter attack from AMD with the lates technologies n off course with lowest price.
[quote=Article]Intel says the problem is solved now, but provides no details on the operation of the new prefetch algorithms[/quote]
Something tells me this is going to be pivotal if Deneb proves to be any good...
Great article, by the way, minor niggles aside!