Crusoe Powers PyTorch Innovation: Driving AI Performance and Efficiency

May 01, 2025

crusoe meta

crusoe meta

At Crusoe, we're committed to not only providing enterprise-scale AI infrastructure but also actively contributing to the open-source community to drive the future of AI. We're proud to collaborate with the PyTorch community, and work alongside leading researchers and engineers to push the boundaries of AI performance and efficiency.

Our Crusoe Cloud platform, purpose-built for the most demanding AI workloads, provides an ideal environment for developing and testing these advancements. Here are just a couple of recent examples of how Crusoe and PyTorch are working together:

Accelerating Large-Scale Training with PyTorch Float8 Rowwise

A collaboration between Meta and Crusoe explored the benefits of PyTorch's new Float8 rowwise training on Crusoe's 2K H200 clusters in Iceland. The results were impressive, showcasing training accelerations of 34-43% at scale compared to BF16, with comparable convergence and stability.    

This work detailed how Float8 rowwise, a finer-grained resolution for Float8, improves quantization precision for large-scale workloads. By leveraging Crusoe's H200 infrastructure, the team was able to demonstrate the real-world impact of this PyTorch innovation.

Boosting Checkpointing Efficiency with PyTorch DCP

Another collaboration between Meta and Crusoe focused on optimizing checkpointing efficiency with PyTorch Distributed Checkpointing (DCP). Using Crusoe's 2K H200 cluster and TorchTitan, the team verified that new DCP optimizations significantly reduced background processing time for asynchronous checkpoints – from approximately 436 seconds to a mere 67 seconds at 1856 GPU scale.    

This research highlighted how asynchronous checkpointing minimizes GPU downtime during the saving of training progress, a crucial factor for large-scale AI training.  Crusoe's infrastructure played a key role in validating these improvements and their potential to save valuable training time.   

These are just two examples of how Crusoe is working hand-in-hand with the PyTorch community to accelerate AI development. Together, we're excited to continue pushing the limits of what's possible and empowering the next generation of AI innovation.


Liked what you just read? Share it

Ready To Build Something Amazing?