Nvidia Blackwell data center GPUs could face further delays due to overheating problems

midian182

Posts: 10,235   +138
Staff member
In brief: Nvidia's skyrocketing success over the last few years has been down to the company's hardware dominating the lucrative AI market. With its next-gen Blackwell AI chips, however, Team Green is experiencing some rare slip-ups. Having already been delayed, new reports say the GPUs are experiencing overheating issues when installed in high-capacity server racks.

Claims that Blackwell GPUs designed for AI tasks and HPC are overheating come from sources who spoke to The Information.

The problem occurs when the chips are integrated into Nvidia's customized server racks that house 72 processors, which consume up to 120kW per rack. Nvidia has reportedly told suppliers to redesign the racks on several occasions to try to address the problem by improving the cooling. Unfortunately, this is further delaying Blackwell's launch.

Overheating can not only severely impact the performance of the chips, but also has the potential to damage the very expensive hardware.

Nvidia is playing down the report. Speaking to Reuters, a spokesperson said the company is working with leading cloud providers and that engineering redesigns are normal and to be expected.

It was reported in August that the Blackwell AI chips were facing significant delays due to design flaws discovered late in manufacturing. Manufacturer TSMC identified an issue in the processor die connecting two Blackwell GPUs on the GB100 and GB200 chips that caused warping and system failures. These chips employ TSMC's CoWoS-L packaging, which utilizes an RDL interposer with local silicon interconnect bridges to achieve data transfer rates of about 10 TB/s. The problem arose from a mismatch in thermal expansion properties between various components, causing system warping and failure.

Nvidia had to alter the chips' top metal layers and bump structures to fix the previous Blackwell problem, delaying the chips' mass production date to the end of October and shipping time to late January – they were originally slated to ship in the second quarter of 2024.

We still don't know if the latest problem with Blackwell will cause any further shipment delays. Nvidia CEO Jensen Huang has described demand for Blackwell as being "insane," so another setback would come as a huge blow to customers such as Microsoft, Google, and Meta.

Permalink to story:

 
If you are getting a 5090, run furmark on that b for a day. Who knows how many cards with the same problems are gonna come to PC cards.

If the rumors regarding power increase are true, how would they achieve the same GPU longevity?
550-600 watt power draw is a lot. I would be very worried if the standard 3 year warranty lowered as well.
 
I expect this kind of thing to start happening more and more going forward. We're headed into the silicon wall and there don't seem to be a lot of good material science solutions to get around it.
Makes me appreciate how hard Nvidia is pushing for being #1. There is not much room left to make their cards much faster and yet they come first for many years.
 
Back