In brief: Elon Musk's wild foray into the AI business has resulted in the construction of a massive supercomputer in record time. Curiously, Nvidia notes that this supersystem doesn't utilize the traditional InfiniBand networking standard to transfer data as one might expect.

The high-performance computing system built by xAI, featuring 100,000 Hopper GPUs, is named Colossus. The system utilizes the company's Spectrum-X networking platform instead of InfiniBand, which Nvidia acquired in 2019 along with the last independent supplier of the technology, Mellanox.

Nvidia stated that the designers of Colossus achieved the system's massive scale largely thanks to Spectrum-X. This technology significantly improves direct memory access network performance while utilizing "standards-based" Ethernet communication devices. Colossus was constructed in record time, and the xAI team is now in the process of doubling its performance by installing an additional 100,000 Hopper GPUs into the system.

Standard Ethernet devices are insufficient for Colossus, as they can cause thousands of flow collisions and deliver a meager 60 percent data throughput. In contrast, Spectrum-X guarantees "zero application latency degradation" and eliminates packet loss due to flow collisions, maintaining a significantly higher 95 percent data throughput through its "congestion control" system. Colossus is training large language models belonging to the Grok family and requires "unprecedented" network performance to do so.

Spectrum-X isn't your run-of-the-mill Ethernet technology. The core of the platform is the Spectrum SN5600 Ethernet switch, which Nvidia claims can support up to 800 Gbps per single port. This switch is built on a Spectrum-4 custom ASIC, and xAI has paired it with Nvidia BlueField-3 SuperNICs to effectively accelerate GPU-to-GPU communication.

InfiniBand was specifically designed to meet the communication needs of HPC systems, keeping packet loss to an absolute minimum. While Ethernet has a significantly higher rate of data loss, it remains extremely popular – even in the speed-sensitive HPC market – due to factors such as high compatibility, vendor choice, and potentially higher bandwidth capabilities per single port.

Nvidia stated that its Spectrum-X Ethernet networking platform can accelerate the development of powerful AI systems like Colossus, reducing the time needed to bring massive HPC machines online. Spectrum-X technology is scalable and can potentially provide networking features that were previously available only through InfiniBand solutions.