The specifications of NVIDIA’s AD102 “Lovelace” GPU die have finally been confirmed. The fully enabled core will pack an incredible 18,432 FP32 cores, a sizable increase over the AMD Navi 31’s 12,288 shaders. As reported the other day, the RTX 4090 will come with a couple of GPCs partially fused off, bringing down the effective core count to 16,128. If NVIDIA ever plans to launch a 4090 Ti that behemoth will leverage the full-fat AD102 core and its 18,432 shaders.
According to Kopite7kimi, the AD102 die will be able to touch the 100 TFLOPs single-precision performance mark with its core running at 2.8GHz. What he still isn’t sure about is the SM structure. NVIDIA has a habit of mildly restructuring its SM (Compute Unit) every generation. This time around it might be thoroughly overhauled, much like with Maxwell roughly eight years back, or not.
I’ll recap what I had shared a while back about the Maxwell SM and the possible SM design of Lovelace:
With Maxwell, the warp schedulers and the resulting threads per SM / clock were quadrupled, resulting in a 135% performance gain per core. It looks like NVIDIA wants to pull another Maxwella generation known for exceptional performance and power efficiency that absolutely crushed rival AMD’s Radeon offerings.
This would mean that the overall core count per SM would remain unchanged (128) but the resources accessible to each cluster would increase drastically. Most notably, the number of concurrent threads would double from 128 to 256. It’s hard to say how much of a performance increase this will translate to but we’ll certainly see a fat gain. Unfortunately, this layout takes up a lot of expensive die space, something NVIDIA is already paying a lot of dough to acquire (TSMC N4). So, it’s hard to say whether Jensen’s team actually managed to pull this off or shelved it for future designs.
There’s also a chance that Team Green decides to go with a coupled SM design, something already introduced with Hopper. In case you missed out on the Hopper Whitepaper, here’s a small primer on Thread Block Clusters and Distributed Shared Memory (DSM). To make scheduling on GPUs with over 100 SMs more efficient, Hopper and Lovelace will group every two thread blocks in a GPC into a cluster. The primary aim of Thread Block Clusters is to improve multithreading and SM utilization. These Clusters run concurrently across SMs in a GPC.
Thanks to an SM-to-SM network between the two threads blocks in a cluster, data can be efficiently shared between them. This is going to be one of the key features promoting scalability on Hopper and Lovelace which is a key requirement when you’re increasing the core / ALU count by over 50%.
|Arch||Turing||Ampere||Ada Lovelace||Ada Lovelace||Ada Lovelace|
|Process||TSMC 12nm||Sam 8nm LPP||TSMC 5nm||TSMC 5nm||TSMC 5nm|
|TP||16.1||37.6||~ 90 TFLOPs?||~ 50 TFLOPs||~ 35 TFLOPs|
|Memory||11GB GDDR6||24GB GDDR6X||24GB GDDR6X||16GB GDDR6||16GB GDDR6|
|Launch||Sep 2018||Sep 2020||Aug-Sep 2022||Q4 2022||Q4 2022|
These are the two potential ways NVIDIA can (nearly) double the core counts without crippling scaling or leaving some of the shaders underutilized. Of course, there’s always a chance that Jensen’s team comes up with something entirely new and unexpected.