Intel is FIGHTING BACK – Lunar Lake + Arrow Lake Explained – Everything we know about 15th Gen

Andrew McDonald | 21st June 2024 | Intel | 1 Comment

Strap yourselves in because this is going to be a hefty deep dive into everything we know about Intel’s latest lineup of CPUs, and their next gen GPUs too – and there is a lot to talk about. Let’s start with what “Lunar Lake” and “Arrow Lake” mean. Lunar Lake is what Intel has talked the most about, and that’s the next generation of mobile CPUs. Seems like they’ll be 4P 4E core designs at most, so this is likely to be the mid to low end chips, rather than the next set you’ll find in gaming laptops. Arrow Lake is is the next generation of desktop CPUs, and that’s the one Intel hasn’t talked as much about, although Lunar Lake and Arrow Lake are set to be rather similar in their designs, so talking about Lunar Lake should give us a pretty good picture of what’s to come on the desktop. One interesting thing is that the current generation, codenamed Meteor Lake, never came to the desktop. I made a video talking through the frankly major changes Intel made with Meteor Lake, and my expectation we’d see those changes on the desktop front too, but instead Intel re-released the same chips on the desktop with new names, and held fire for the further changes we can see in Lunar Lake. So, what’s new? Well, a hell of a lot…

Starting just with the packaging, there are some substantial changes to the design already – and not just from the usual setup, but from Meteor Lake designs too. The biggest change by far is that Intel is building system RAM onto the CPU package – specifically up to 32GB of LPDDR5X RAM will be built right onto the chip. This is quite the departure from the other news we saw at Computex with the introduction of the CAMM2 memory module standard. The main benefit there being finally having upgradable RAM on thin and light notebooks – but it seems like Intel is dead set against that as they will include two LPDDR5X packages right on the chip itself. This should speed up access latency, being so damn close, at the cost of upgradability later. Their main focus for this seems to be power consumption – moving the memory physically closer means you don’t need as high power memory controllers to communicate with off-chip RAM. They claim 40% lower power, plus 250mm2 of space saved on the motherboard, so quite the difference. It seems like that means there won’t even be the option of having extra external RAM either, which is a shame.

One of the other changes is who makes the chips. Intel has been on a mission to make their sub 14 nanometer processes work since, what, 11th gen? They’ve been using what they call “Intel 7” since Alder Lake (that’s 12th gen), which is what they renamed “10nm Enhanced SuperFin” to as they say that their process node has the same density as the 7nm nodes from TSMC and Samsung. Realistically the naming schemes for these process nodes have long since detached from the actual feature size so I can’t fault them for aligning themselves with the rest of the foundry market. Regardless, with Meteor Lake, they were very excited to tell us that they will be using their even more advanced Intel 4 process node (formerly known as 7nm) for the compute tile. This time though, both the compute tile, and the platform tile, are made by TSMC. Specifically their N3B node for the compute tile and N6 for the platform controller tile. Intel is still making the base tile and doing the assembly, but the fact that TSMC are making the two actual logic parts of the chip is really quite surprising. As pure speculation, I wonder if that means Intel is struggling to get the yields they need on their own nodes, or scaling up production enough. Either way, that’s quite interesting!

The cores themselves have also been overhauled – both the P cores and E cores are new, and this is what we can expect to see on the desktop too. I’ll start with the performance cores – these are the new Lion Cove architecture, which is significantly different from the Golden Cove designs we’ve seen on the desktop, and from the newer Redwood Cove design found in Meteor Lake chips, with the biggest change by far being the death of Hyperthreading. Intel first introduced Hyper Threading back in 2002, and now in 2024 it’s sending it to its grave – all in pursuit of efficiency. Hyper Threading, or simultaneous multi-threading is basically letting a single physical core have two queues of work lined up – two threads. The benefit here is that while the core might be idle waiting for say a call to memory, it can start working on the second queue’s instructions and be idle for less time, meaning you effectively get more work done in the same time. Intel claims this can be up to 30% more instructions per clock or IPC, while only needing 20% more power. With those sorts of improvements, it seems crazy to want to remove that then, doesn’t it?

Well, Intel doesn’t think so. The main thing here is that the new Lion Cove cores don’t just have Hyper Threading disabled, but they’ve had all SMT components removed, meaning they can physically shrink the cores by 10%, and optimise the power consumption, so you get similar performance, but at the cost of less power. Intel claims 5% more performance per watt, essentially, although you do get 15% less performance per area. The big reason for this is all to do with these chips being a hybrid architecture. With the current OS task scheduler, hyper-threaded threads were assigned tasks last, once both the primary P core threads were saturated, and all E cores. That means they don’t get as much use as you might think, and for mobile workloads like these chips are designed for, it makes sense to prioritise area and power consumption over outright performance.

Actually speaking of power consumption, Intel has gone wild on the AI hype train, claiming the new P cores literally have “AI BUILT IN” with a new “AI self-tuning controller”. Now if I’m being honest I can’t quite work out how an actual neural network is involved here, but the core idea is pretty simple. Previously the thermal limits, and the controls like voltage and frequency, have all been statically set on every chip based on testing they do in their labs. These Lion Cove cores have more dynamic control over those values, meaning it can basically run closer to the edge and work out on each individual chip what works and doesn’t work best. It’s a decent idea, it’s just a shame they had to slap the word “AI” on it… One way that controller can now better work is in the frequency department, where Intel has pushed for more fine-grain clock control. Instead of only being able to set 100MHz intervals – ie 2.9GHz, 3GHz, 3.1GHz – they can now control the cores in 16.67MHz intervals, allowing for ever-so-slightly more performance within a finer band.

As for the core design itself, that is, as they say a complete “overhaul of the uArch”. Seriously, this is massive. The biggest thing is that the vector and integer pipelines are now completely split with their own schedulers – this is a big deal as it means you should get more throughput as each type of function can execute independently, rather than having a single monolithic scheduler that has to try and handle both sets of Uops. There are also inherent power savings, as if only one side is executing, only that side needs to be active, rather than the whole block staying active. The front end is new too, with up to eight times wider branch prediction blocks, meaning the fetch side is faster than the backend logic can handle meaning the core should never be waiting around for instructions. The Uop cache and queue grew too, again keeping the logic fed, keeping the idle time down – which is especially important now there is no hyperthreading! The other fairly major change is to do with the memory subsystem, and specifically cache. Intel has introduced an intermediary cache that sits between L1 and L2 – which have now been renamed to L0 and L2 – that tries to hit the ‘best-of-both-worlds’ in terms of latency to access – 9 cycles – while still being physically larger at 192KB and as free bandwidth wise as the regular L2 cache. They have actually sped up the now L0 cache access time too by a cycle, and they’ve increased the L2 cache per core from 2MB to 2.5MB on Lunar Lake or 3MB on Arrow Lake, meaning a core should spend less time hitting up off-core cache or memory, and therefore increase IPC.

There are lots of other changes to the core micro-architecture, but one thing that caught my eye was this slide, talking about how Intel actually designs these cores. In their presentation they let they key thing slip, which is that they’ve finally moved from proprietary designs, processes and tools, to industry standard tools, and are moving from designing each functional block manually to using software-generated blocks that are not only easier to design and work with, but are process-node-agnostic. That’s a big deal, because Intel is now not the one actually making their own chips – at least for this current design. The ability to design something and easily transplant it between Intel 4, TSMC N3 and even Samsung 5nm is a huge step for Intel, and should mean we will see more of this mix-and-match type chip design in the future.

Finally for the P cores, we should talk about the performance. Intel claims this new Lion Cove core will offer 14% IPC uplift on average compared to the cores found in Meteor Lake chips – although since Intel wasn’t exactly forthcoming about how much faster those cores were compared to the Golden Cove designs found in today’s desktop chips, we’ll have to assume this is perhaps a touch conservative when comparing desktop to desktop. The main thing they want to shout about though is the efficiency – with around 18% more performance at the same power level on the low end, and still more than 10% more performance at the higher end. That’s impressive!

Right, that’s the P cores, now let’s talk E cores. The new design is called “Skymont” and is a bit of a mix between the ‘full fat’ E cores from Alder Lake/Raptor Lake and the Low Power Island E Cores from Meteor lake. The main goal here seems to be to “increase workload coverage” – as in to make it so you need the P cores for as few tasks as possible. The new E core cluster is actually kind of modular. In Lunar Lake they act as a low power island – isolated from the P cores, acting as an independent cluster of 4 cores. But, when it comes to sticking these in a desktop CPU, in Arrow Lake, they are still able to run on the ring in conjunction with the P cores, potentially with multiple E core groups as we’ve seen on the desktop chips previously. In Lunar Lake, with them being an “island”, I think that means they don’t have access to L3 cache which might slow down the move between E and P cores, but with these chips efficiency is the word of the day so that doesn’t matter as much, and when it comes to the desktop chips it won’t matter at all as they aren’t in the “island” format.

The core design itself is pretty new – it’s larger, both in front and back end, with now 4MB of shared L2 cache between 4 E cores, which means 68% more floating point IPC performance, and 38% integer IPC performance – compared to the low power island E cores in meteor lake anyway. While that technically is a fair comparison – both are low power island E cores – it isn’t a wholly fair comparison, as the regular E cores are what these are, they just happen to be in an island config. You can see that on the power graph, where while there’s no doubt these new Skymont cores rip by comparison, it isn’t exactly a fair fight. Still, offering 70% more performance for the same performance is great to see.

One of the most surprising stats Intel provided was a comparison to the Raptor Cove P cores – the ones found in the 13th and 14th gen chips – where these Skymont efficiency cores offer 2% better single threaded performance – in both integer and floating point – than the desktop P cores! That’s frankly incredible, especially considering the Skymont cores are drawing what looks like less than half the power to do the same work! Hot damn that’s some fast cores…

One major change Intel has made is the Thread Director and scheduler – Meteor Lake introduced a pretty significant shift (again) in how the OS should schedule tasks, but it seems that’s been changed AGAIN for Lunar Lake, and likely Arrow Lake too. In short, with the original hybrid design work was loaded into the P cores, then downgraded if it didn’t need all that performance. Meteor Lake introduced the low power island E cores, and with that meant all work had to be loaded into them first, then moved to the compute tile E cores, and THEN to a P core if needed. With Lunar Lake it’s simpler – stick it on the E cores, and if you need more oomph, upgrade to P cores. That’s much simpler, and should help with more demanding workloads. For Arrow Lake it’s still unclear if we’ll go back to having some LP E cores AND regular E cores, so that might be subject to change too…

Moving on to what Intel spent probably 40% of their entire time at Computex talking about – AI – and specifically their Neural Processing Unit or NPU. We first saw the NPU built into CPUs with Meteor Lake, although apparently this is actually their fourth generation design. The core design hasn’t changed too much, it’s still a massive array of matrix multiplication and convolution units packed into a tile, but this time instead of two of those tiles, there are now six. Couple that with the smaller process node and therefore higher clock speeds, and you get up to 48 trillion operations per second. Unsurprisingly with three times more cores on hand AND higher clock speeds, the new NPU is up to 4 times faster, or twice as fast at the same power. Interestingly power consumption was listed as only 11.2W of power for 5.8 seconds to do 20 iterations in Stable Diffusion. That’s impressive, considering how mental graphics cards tend to go when you run AI models on them!

So far we’ve only spoken about the compute tile, and there is actually more to go there, but first I want to quickly cover the new “Platform Controller Tile”. This is basically the IO die, and houses 4 PCIe Gen 5 lanes, 4 PCIe Gen 4 lanes, THREE Thunderbolt 4 connections, WiFi 7 built in alongside Bluetooth 5.4 and a single gigabit ethernet connection, and a couple USB 2 and 3 connections too. The triple Thunderbolt 4 is what caught my eye though – especially considering Thunderbolt is just PCIe with a wrapper – that means they are reserving six PCIe Gen 4 lanes for Thunderbolt, hence why you only get enough lanes for a GPU and an SSD directly – although I suppose if these are the low power chips maybe you won’t even get a GPU with these. Either way, that’s an awful lot of connectivity.

The last thing I want to touch on is the graphics – specifically the new Xe2 Battlemage graphics cores that are going into these Lunar Lake chips, although much like the CPU side, what we learn about these iGPUs will inform us about the next generation of desktop GPUs too. So, what’s new? Well, actually quite a lot. Intel has made quite a few fundamental changes to the GPU core design that should, at least in theory, bring them a lot closer to the likes of AMD and NVIDIA. They claim the Xe2 core in Lunar Lake will be 50% faster than the core you’ll find in Meteor Lake chips – that’s significant, and will be a big boost to mobile and ultra-mobile devices like the MSI Claw. Internally, Intel is claiming some significant performance improvements in specific DirectX functions – like 12.5 times better performance in draw indirect execute, a key function of DirectX 12, 4.1 times better performance in mesh shader dispatch, and even 20% better performance with tessellation. Those are some mighty impressive improvements.

A lot of that comes from the new core design. The biggest change by far is Intel switching from SIMD8 to SIMD16 – SIMD stands for single instruction, multiple data, and is basically parallel operations on the vector data you need to draw a frame. Most common engines seem to prefer SIMD16, so Intel switching to that means more native compatibility straight out the box, plus more performance to boot. The primary benefit should be compatibility though, with Intel claiming much better day one support for games on these new cores and cards. They’re also supporting SIMD32 operation calls, which should again help with compatibility. Something else that’s new is the new XMX engine, or Xe Matrix Extension engine, which houses direct support for INT8 and Floating Point 16 workloads, with a frankly insane 2048 operations per second on floating point and 4096 integer operations per second. They tout this as an AI benefit, but should benefit gaming workloads too. Comparing between the graphics decks from Meteor Lake and this Lunar Lake one, even if you assume the operations per clock count they list is from all eight vector engines (aka compute units), you are still looking at twice the number of FP32, FP16 and INT8 operations per clock compared to the last gen. That is significant.

They’ve also beefed up the ray tracing cores – there are still eight, a one to one match for the Xe cores – but they are now wider, with three traversal pipelines, 18 box intersections and two triangle intersections. While it’s unlikely that these will be quite as good as NVIDIA’s RT Cores, Intel is already somewhat ahead of team red here, and these improved cores might be worrying to AMD. Of course, it’s unlikely that you’ll be playing on ray tracing ultra on a mobile CPU and iGPU, but at least once these things come to the desktop, we have something to look forward to testing.

And with that, I think we’re done. If you’ve made it this far, congratulations, because that was a lot of incredibly dense material. Hopefully it was still interesting for you, especially since I’m planning on doing a video talking all about AMD’s next generation of CPUs soon too! If you want to see that, make sure to hit the subscribe button and turn on the bell notification icon so you don’t miss it.

Tags:arrow lake, lunar lake

Related Posts