The big picture: It turns out that if you completely uproot the way data centers have been built for the past 10 years, there are bound to be some growing pains. While headlines are all about the rise of AI, the reality on the ground involves plenty of headaches.
When speaking to systems integrators and others scaling up large compute systems, we hear a constant stream of complaints about the difficulties in getting large GPU clusters operational.
The main issue is liquid cooling. GPU systems run hot, with racks consuming tens of thousands of watts of power. Traditional air cooling is insufficient, which has led to widespread adoption of liquid cooling systems. This shift has driven up the stock prices of companies like Vertiv, which deploy these systems.
Editor's Note:
Guest author Jonathan Goldberg is the founder of D2D Advisory, a multi-functional consulting firm. Jonathan has developed growth strategies and alliances for companies in the mobile, networking, gaming, and software industries.
However, liquid cooling is still relatively new for data centers, and there aren't enough people familiar with installing them. As a result, liquid cooling has become the leading cause of failures in data centers. There are all kinds of reasons for this, but they all essentially boil down to the fact that water and electronics don't mix well. The industry will sort this out eventually, but it's a prime example of the growing pains data centers are experiencing.
There are also many challenges in configuring GPUs. This isn't surprising – most data center professionals have a wealth of experience configuring CPUs, but for many of them, GPUs are unfamiliar territory.
On top of that, Nvidia tends to sell complete designs, which introduces a whole new set of complications. For instance, Nvidia's firmware and BIOS systems aren't entirely new, but they are just different and underdeveloped enough to cause delays and an unusually high number of bugs. Add Nvidia's networking layer into the mix, and it's easy to see how frustrating the process has become. There's simply a lot of new technology for professionals to master in a very short timeframe.
In the grand scheme of things, these are just speed bumps. None of these issues are serious enough to halt AI development, but in the near term, they will likely become more pronounced and more high-profile. We expect hyperscalers to delay or slow down their GPU rollouts to address these challenges. To be more precise, we're likely to hear more about these delays because they've already begun.
AMD's recent $5 billion bet on the data center
Recently we were getting asked about the logic behind AMD's acquisition of ZT Systems, because this and the the growing complexities of installing AI clusters are closely related, we can use ZT as a lens to view the broader problems in the industry.
Let's say Acme Semiconductor wants to enter the data center market. They spend a few hundred million dollars to design a processor. Then they try to sell it to their hyperscaler customer, but the hyperscaler doesn't want just a chip – they want a working system to test their software.
So, Acme goes to an ODM (Original Design Manufacturer) and pays a few hundred thousand dollars to design a working server, complete with storage, power, cooling, networking, and everything else. Acme builds a few dozen of these servers and hands them out to their top sales prospects. At this point, Acme is out around $1 million, and they notice that their chip accounts for only 20% of the system's cost.
The hyperscalers then spend a few months testing the system. One of them likes Acme's performance enough to put it through a more rigorous test, but they don't want a standard server; they want one designed specifically for their data center operations. This means a new server design with a completely different configuration of storage, networking, cooling, and more. The hyperscaler also wants Acme to build these test systems with their preferred ODM.
Eager to close the deal, Acme foots the bill for this new design, though at least the hyperscaler pays for the test systems – Acme finally has some revenue, maybe $100,000. While the first hyperscaler is running their multi-month evaluation, a second customer expresses interest. Of course, they want their own server configuration with their own preferred ODM. Acme, needing the business, covers the cost of this design as well.
Acme approaches all the OEMs to see if any will design a catalog system to streamline the process. The OEMs are all very friendly and interested in what Acme is doing. Great job guys, but they'll only commit to designing once Acme secures more business.
Finally, a customer wants to buy in volume – a big win for Acme. This time, because there's real volume involved, the ODM agrees to do the design. However, the new server will use the hyperscaler's internally designed networking and security chips, which were kept secret. Acme has never seen them and knows little about the new server, which was designed directly between the customer and the ODM. The ODM builds a bunch of servers, then wires them up inside the hyperscaler's data center, flip the power switch on, and things immediately start to break.
This is expected; bugs are everywhere. But quickly, everyone starts blaming Acme for the problems, ignoring the fact that Acme was largely excluded from the design process. Their chip is the least familiar component to the ODM and the customer. Acme worked with the customer to iron out bugs during the evaluation cycle, but this is different.
Much of the system is new, and the stakes are much higher, so everyone is operating under stress. Acme sends its field engineers to the super-remote data center to get hands-on with the system. The three teams work through the bugs, finding more along the way. Eventually, it turns out Acme's processor enters an obscure error mode when interacting with the hyperscaler's security chip, the networking components are fragile and perform well below spec, and of course, every chip is running a different firmware, which is incompatible with the others.
To top it off, liquid cooling – something no one on the debugging team has worked with before – probably causes 50% of the problems. The deployment drags on as the teams work through the issues. At some point, something significant needs to be entirely replaced, adding more delays and costs. But after months of work, the system finally enters production. Then Acme's second customer decides they want to do a deeper evaluation, and the whole process starts all over.
And if that doesn't sound painful enough, we haven't even mentioned the lawyers.
Just to start the project, Acme had to spend nine months negotiating strenuous terms with the hyperscaler from a very weak position. When it came to designing the custom server, the three companies (Acme, the ODM, and the customer) likely spent six weeks negotiating the NDA.
This is how servers have been built for years. Then Nvidia entered the market, bringing their own server designs. Not only that, but they brought designs for entire racks. Nvidia has been designing systems for 25 years, dating back to their work on graphics cards. Their team also builds their own data centers, so they have an in-house team experienced in handling all of these issues.
To compete with Nvidia, AMD can either spend five years replicating Nvidia's team or buy ZT. In theory, ZT can help AMD eliminate almost all of the friction outlined above. It's too soon to tell how well this will work in practice, but AMD has gotten pretty good at merger integration. And honestly, we would gladly pay $5 billion to avoid negotiating a three-way NDA and Master Service Agreement ever again.