In brief: Nvidia's skyrocketing success over the last few years has been down to the company's hardware dominating the lucrative AI market. With its next-gen Blackwell AI chips, however, Team Green is experiencing some rare slip-ups. Having already been delayed, new reports say the GPUs are experiencing overheating issues when installed in high-capacity server racks.
Claims that Blackwell GPUs designed for AI tasks and HPC are overheating come from sources who spoke to The Information.
The problem occurs when the chips are integrated into Nvidia's customized server racks that house 72 processors, which consume up to 120kW per rack. Nvidia has reportedly told suppliers to redesign the racks on several occasions to try to address the problem by improving the cooling. Unfortunately, this is further delaying Blackwell's launch.
Overheating can not only severely impact the performance of the chips, but also has the potential to damage the very expensive hardware.
Nvidia is playing down the report. Speaking to Reuters, a spokesperson said the company is working with leading cloud providers and that engineering redesigns are normal and to be expected.
It was reported in August that the Blackwell AI chips were facing significant delays due to design flaws discovered late in manufacturing. Manufacturer TSMC identified an issue in the processor die connecting two Blackwell GPUs on the GB100 and GB200 chips that caused warping and system failures. These chips employ TSMC's CoWoS-L packaging, which utilizes an RDL interposer with local silicon interconnect bridges to achieve data transfer rates of about 10 TB/s. The problem arose from a mismatch in thermal expansion properties between various components, causing system warping and failure.
Nvidia had to alter the chips' top metal layers and bump structures to fix the previous Blackwell problem, delaying the chips' mass production date to the end of October and shipping time to late January – they were originally slated to ship in the second quarter of 2024.
We still don't know if the latest problem with Blackwell will cause any further shipment delays. Nvidia CEO Jensen Huang has described demand for Blackwell as being "insane," so another setback would come as a huge blow to customers such as Microsoft, Google, and Meta.