How Crusoe was the First to Virtualize AMD’s Instinct™ MI300X GPUs on Linux KVM

AMD’s InstinctTM MI300X GPUs on Linux KVM

By Andrew Carp, Staff Software Engineer at Crusoe

In October 2024, Crusoe became the first Cloud Provider to offer virtualized AMD AI hardware in a Linux environment. At the time, high-performance GPU virtualization was (and still is) a relatively new avenue of exploration in HPC, particularly around the AI space. Most providers tend to either containerize their hardware or offer a bare-metal solution. Each solution has their benefits and drawbacks, but virtualization comes in ahead as the most well balanced solution, maintaining platform efficiency and flexibility whilst still ensuring security as a priority.

The Compute Virtualization Team at Crusoe was tasked with the go-to-market launch of AMD’s newest platform, the AMD Instinct MI300X Series. This new platform was seen as a key advantage for Crusoe to diversify its underlying platform-offering and to take advantage of emerging infrastructure within the GPU Cloud space.

This presented an extremely interesting challenge for our team here. Not only did we need to integrate AMD’s entire software and hardware stacks into our large and quickly-scaling cloud offering, but we needed to figure out a way to add the support we needed on our underlying platform. Specifically, the two major challenges were:

Crusoe and AMD needed to work together to add Linux-based virtualization support
Crusoe had to onboard an entirely new hardware and software stack onto our platform

From a systems engineering perspective, both challenges were interesting ones to tackle. However, with a go-to-market deadline of just a few months, Crusoe and AMD needed to be swift in their partnership and communication to address both the virtualization and ecosystem integration problems.

Background

Virtualization at Crusoe

When it comes to providing users efficient access to compute resources, virtualization can provide several distinct advantages:

Cost savings from reduced hardware and operating costs
Enhanced security derived from hardware isolation in a multi-tenant environment
Enhanced scalability, availability, and resilience from dynamic resource allocation and live migration
Low latency and high performance with direct access to fully passed through hardware resources
Improved user experience with managed software and hardware infrastructure.

At Crusoe, our Compute Nodes utilize Linux KVM (Kernel Virtual Machine) as the Type I VMM (Virtual Machine Monitor) and Cloud Hypervisor is the Type II VMM.

Together these two applications pair to create Virtual Machine processes, enabling clients to access and run workloads inside a secure, partitioned environment. We’ve made multiple upstream contributions to our Type II VMM, Cloud Hypervisor, most notably an optimized boot time with a preload memory prefault (which reduced Crusoe’s VM boot time for HGX H100 VMs to 90 seconds on certain platforms).

Devices are exposed to the VM process running within Cloud Hypervisor via various virtualization methods. For additional security, performance, and ease of use, we fully pass through most devices on our Compute Nodes using VFIO. This includes any GPUs, Ephemeral Drives, and Infiniband Mellanox HCAs.

Our goal here was to virtualize any AMD-specific hardware using full passthrough VFIO. VFIO allows Cloud Hypervisor to utilize the physical IOMMU hardware to directly map device memory into the VM. This sets up direct memory access from the device to the VM, ensuring limited memory translations are needed to interact with the device. It also ensures that this memory is secure, and that IOMMU mappings are not shared across devices utilized by other VMs.

Unfortunately, when we attempted to virtualize the AMD Instinct MI300X GPU using VFIO, we found that the VFIO driver and the MI300X GPU were not able to communicate. The resolution to this was a combined effort by AMD and Crusoe, and is detailed further below.

MI300X Full-Passthrough Virtualization

Generally, to virtualize a physical device with the best performance, we need the vfio-pci driver in the host kernel to manage a device. The VFIO (Virtual Function I/O) driver provides optimal performance by fully passing through devices up to a guest VM. Within this guest VM, the user is then free to utilize whatever application driver they need to interact with the device.

As an example, for a GPU located at PCI address 0000:c6:00.0, you would run the below to bind the device to the vfio-pci driver:

echo 0000:c6:00.0 > /sys/bus/pci/devices/0000:c6:00.0/driver/unbind
echo 0000:c6:00.0 > /sys/bus/pci/drivers/vfio-pci/bind

This then shows the device as bound to VFIO in our lspci output:


c6:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 74a1
   Kernel driver in use: vfio-pci

However, when we attempted to pass the bound device through to the VM via Cloud Hypervisor, we ran into the below output in lspci:

[AMD/ATI] Device 74a1 (rev ff) (prog-if ff)
│   │   !!! Unknown header type 7f
│   │   Kernel driver in use: vfio-pci

Oh no! It appeared that VFIO didn’t work for MI300X. Digging into our kernel, we could see upon driver load, vfio_reset was being called. This reset signal sent to the hardware from the VFIO driver resulted in the device immediately going offline, which didn’t look great for our virtualization chances.

After reaching out to AMD, we began the mountainous process of adding and validating VFIO support for MI300X GPUs. This involved a months-long back and forth design, build out, and validation process between the engineers at AMD and Crusoe. To ensure proper support, this process involved changes to the host kernel, BIOS, GPU firmware, and Cloud Hypervisor. We developed weekly syncs with the engineers at AMD, who were extremely knowledgeable, quick to act, and easy to work with.

AMD engineers worked diligently alongside Crusoe to tackle the various challenges related to VM reboots. The server platform employed PCIe switch firmware that would block PCI config writes to the GPU in certain scenarios, thus causing VM reboots to fail. The AMD team debugged this issue and collaborated with the PCIe switch vendor to develop a fix. Additionally, AMD engineers provided a kernel patch to bypass the issue as a stop gap until the PCIe switch vendor provided the official fix, thus ensuring smoother operations for Crusoe.

AMD engineers also worked in tandem with Crusoe to address VM boot times. They implemented optimizations in the kernel, significantly improving both the boot and reboot times of the VMs. These efforts highlight the strong partnership between AMD and Crusoe and our commitment to delivering enhanced virtualization solutions.

Validation

Once we received the multiple upgrades required, we were able to set up our lab environment to validate said virtualization. Our lab environment utilized 2 MI300X Compute Nodes, each supporting hardware as detailed in the Compute Node diagram previously discussed. Additionally, these two nodes were connected via a small Infiniband Mesh, allowing for collective operations across a 2-node cluster.

AMD ROCm^TMValidation Suite

AMD provides an extensive suite of tools known as the ROCm Validation Suite. These allow you to not only validate your system setup to ensure driver, device, and firmware configurations are optimal, but they also stress the system extensively.

One such test we found extremely useful at Crusoe was the GPU Stress Test. This hammers the GPUs with a variety of matrix operations, running anywhere from a 10 minute to 9 hour long test. We thought it would be the equivalent of previous GPU-based stress tests we’ve run, but once we executed the test across our 64-node production MI300X cluster, we realized it stressed the system even harder than expected.

We saw temps spike up to 100C across our fleet, which immediately triggered alerting and automatic power drains on the MI300X nodes in the datacenter. Reaching out to AMD, they confirmed the maximum operating temperature for these GPUs was an impressive 105C and that the test was operating as expected. Quite a powerful test overall, and impressive the GPUs were able to withstand such high temperatures.

Infinity Fabric Testing

To stress test the Infinity Fabric on the server and ensure GPU connectivity was adequate, we ran an RDMA Loopback Test using AMD’s perftest test suite. This intra-node test was performed within a single Crusoe VM, and specifically was used to validate the Infinity Fabric connectivity between GPUs. In one such test, we ran an ib_write_bw test to validate that the Infinity Fabric bandwidth between the GPUs matched bandwidth expectations.

The test output can be seen below, with results coming in at GB/sec.

     GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
GPU0 387.39  387.39  387.63  387.2   387.12  387.23  387.08  386.91
GPU1 387.2   387.01  386.88  386.77  386.82  386.67  386.75  386.9
GPU2 387.02  387.16  387.09  386.92  386.84  386.96  387.07  386.92
GPU3 387.1   387.12  387.41  386.97  386.76  387.21  387.07  386.63
GPU4 386.8   387.14  387.41  386.97  385.45  387     387.06  386.62
GPU5 387.14  387.14  387.43  386.82  387.13  386.97  387.1   386.92
GPU6 387.14  387.14  387.44  386.62  386.98  386.85  386.96  386.63
GPU7 386.94  387.01  387.31  386.93  386.84  386.98  387.04  386.01

You can see the bandwidth was extremely uniform, around 387GB/sec between all GPUs, close to the maximum aggregate operating bandwidth of the Infinity Fabric itself (400GB/sec).

RCCL

RCCL (ROCm Communications Collective Library) is a standalone library that supports collective communications (all reduce, all gather, reduce, broadcast, etc.) across clusters of AMD GPUs. Here we were able to leverage RCCL Tests to run a series of all reduce operations across our GPU clusters and verify bandwidth. Similar to other collective communications libraries, RCCL leverages open source communication libraries like openMPI and UCX to efficiently send data over various types of interfaces (including Infiniband).

Initially when we ran RCCL across our 2 node lab environment on bare-metal - we saw adequate performance. However, once we moved our RCCL test into our virtual machines, we ran into a hiccup - the performance didn’t match our expected benchmark. Was there an issue with our MI300X GPU virtualization stack?

Happily, after some investigation, we discovered the issue was that RCCL could not determine what GPUs and what HCAs were paired together on the same PCIe switches in the Crusoe VM. RCCL can implicitly determine the PCIe topology of the Compute Node when it runs on bare-metal. However, in a virtualized environment, you need to provide a topology file that “simulates” the underlying PCIe topology of the bare-metal node. If this file is missing or incorrect, the RCCL test won’t run at full performance. Regardless of whether you virtualize all PCIe switches bridging the GPUs and HCAs, these switches are still there, and RCCL needs to know about them.

Once we generated the correct topology file for our VMs - we were able to run a successful RCCL test confirming that we had a fully functional MI300X GPU Cluster in a virtualized Linux environment!

# size         count        type      redop    root time   algbw    busbw #wrong    time   algbw busbw # wrong
#(B)           (elements)                           (us)    (GB/s)  (GB/s)          (us)            (GB/s)  (GB/s)
134217728      33554432     float     sum      -1   1046.6  128.24  240.44      0   1044.4  128.51  240.95      0
268435456      67108864     float     sum      -1   1531.6  175.27  328.63      0   1548.2  173.39  325.10      0
536870912      134217728    float     sum      -1   2944.0  182.36  341.93      0   2919.4  183.90  344.81      0
1073741824     268435456    float     sum      -1   5761.7  186.36  349.43      0   5626.3  190.84  357.83      0
2147483648     536870912    float     sum      -1   11385   188.62  353.67      0   10935  196.39  368.23      0

Conclusion

Despite coming into the project lacking virtualization support, we were successfully able to qualify and launch our first 64-node MI300X cluster in October 2024. Our virtualization stack has remained exceptionally stable, and customers have found our platform an excellent place to interact with and explore AMD’s ROCm environment. We’d like to thank AMD for the rapid support and quick response they were able to give us when validating their new stack. Crusoe is proud to have become an AMD partner, and excited to be the first cloud provider to virtualize AMD hardware in a Linux environment.

MI300X GPUs are now publicly available on Crusoe Cloud. Reach out to our sales team to start testing today!

About the Author

Andrew is a Staff Software Engineer on Crusoe’s Compute Team in San Francisco. He has a decade of experience building high-performance infrastructure, specializing in AI and High Frequency Trading. At Crusoe, Andrew focuses on virtualization and kernel development, helping to rapidly scale Crusoe’s clean-tech compute on the latest AI hardware.

Liked what you just read? Share it