Tech companies race to build AI superclusters with 100,000+ GPUs in high-stakes competition

Skye Jacobs

Posts: 280   +7
Staff
The big picture: The AI industry is entering uncharted territory, even as questions remain about the practical limits of scalability and the return on these massive investments. Yet companies like Meta, OpenAI, Microsoft, xAI, and Google continue to push the boundaries of what's possible in AI computing.

A new benchmark for AI prowess has emerged: the ability to amass the most Nvidia chips in a single location. This competition among tech giants is reshaping the AI industry, driving unprecedented investments in computing infrastructure and pushing the boundaries of machine learning.

At the forefront of this technological arms race are companies like Elon Musk's xAI and Mark Zuckerberg's Meta. These firms are building massive super clusters of computer servers, each housing an astounding number of Nvidia's specialized AI processors. The scale of these projects is staggering, with costs running into billions of dollars and chip counts reaching into the hundreds of thousands.

xAI's entry into this high-stakes game is particularly noteworthy. In a remarkably short span of time, the company has built a supercomputer dubbed "Colossus" in Memphis. It boasts 100,000 Nvidia Hopper AI chips, a number that was considered extraordinary just a year ago when clusters of tens of thousands of chips were seen as very large.

Meanwhile, Zuckerberg recently announced that Meta is already training its most advanced AI models on a chip conglomeration that he claims surpasses anything reported by competitors.

The motivation behind these massive investments is clear: larger clusters of interconnected chips have thus far translated into more capable AI models developed at faster rates, with some industry leaders already envisioning clusters containing millions of GPUs.

Nvidia, the company at the center of this technological race, stands to benefit enormously from this trend, and CEO Jensen Huang sees no end in sight for this growth trajectory. He envisions future clusters starting at around 100,000 Blackwell chips.

However, this race towards ever-larger chip clusters is not without its challenges and uncertainties. As the size of these super clusters grows, so do the engineering hurdles. Keeping tens of thousands of power-hungry chips cool is a major concern, leading to innovations in cooling technology. Liquid cooling, where refrigerant is piped directly to the chips, is becoming increasingly common in these massive setups.

Reliability is another significant challenge. Meta researchers have found that a cluster of more than 16,000 Nvidia GPUs experienced routine failures of chips and other components during a 54-day training period for an advanced version of their Llama model.

Despite these challenges, the push towards larger and more powerful AI clusters shows no signs of slowing. Elon Musk has already announced plans to expand xAI's Colossus from 100,000 chips to 200,000 in a single building, with ambitions to reach 300,000 of Nvidia's newest chips by next summer.

The race for AI supremacy is also driving demand for Nvidia's networking equipment, which is rapidly becoming a significant business in its own right. The company's networking revenue reached $3.13 billion in 2024, a 51.8 percent increase from the previous year. Nvidia's networking offerings, including Accelerated Ethernet Switching for AI and the Cloud, Quantum InfiniBand for AI and Scientific Computing, and Bluefield Network Accelerators, are crucial in connecting and managing these massive chip clusters.

Despite these huge expenditures, the question of scalability remains unresolved. Dylan Patel, chief analyst at SemiAnalysis, told the Wall Street Journal that while there's no evidence that these systems will scale effectively to a million chips or a $100 billion system, they have demonstrated impressive scalability from dozens of chips to 100,000.

Permalink to story:

 
I'm wondering what the practical limitations of this is because it seems like these companies are already running out of data to train AI on
Biggest limitation is going to be profit motive, once the investors start demanding returns. Like we're seeing right now with the previous big boom, companies are now shatting themselves because their profits are nowhere to be seen.
 
Biggest limitation is going to be profit motive, once the investors start demanding returns. Like we're seeing right now with the previous big boom, companies are now shatting themselves because their profits are nowhere to be seen.
I won't deny that AI is interesting and even "useful" to some extent, but this "AI everywhere, all the time" is just ridiculous. It is useful in scientific research and I see it as one teir above procedural generation for game development.

That said, I don't think there is really any market for selling AI to anyone aside from students who are too lazy to do their homework.
 
These investments will fail.

Companies assumed AI would keep getting smarter linearly to the size of the models (thus AI God just needs a big enough data center).

Like all things, the law of diminishing returns kicked in and the latest (bigger) models are not any smarter. Thus, all the companies quickly pivoting to applying the current AI and hoping no one notices.
 
So as someone who was down on AI, in my current job I am assigned to my first-ever AI related project and I am finally seeing an answer to the question I've always had: what problem is AI the actual solution for? Now that I've seen it (I obviously can't go into details for business confidentiality reasons), I have become at least a genuine believer who sees AI's potential as a work tool if not a complete convert or evangelist.

The major problem with AI as a business right now is that these corporations have poured and continue to pour so much money into it that they are offering it up as a magical solution to all problems because they are desperate to recoup the investment. But the truth is that it takes a specific business case for it to be applied to. Right now there aren't enough of those. They are coming, certainly, but not at the pace that people believe it is. Which means AI will never be profitable.

In regards to the running out of data to scrape problem that somebody else posted here, the issue with this is that AI is pumping out hallucinogenic data even as it tries to scrape existing truth data. The problem is that when it runs out of truth data, it will start scraping its own twisted output and then using that to output even more twisted output. It's essentially data inbreeding. And we all know what effects inbreeding had on the British Royal Family throughout history, right? We're going to get the data equivalent of that sooner or later with AI. And once that happens, its whole business model really will go belly up and crash.
 
All the data in the universe wont be of any use without the fairy-dust, and they know it. Well maybe not Musk and Altman, but everybody else.
 
Back