How to deploy your own slurm cluster on Crusoe Cloud: a step-by-step guide

Table of contents

This is some text inside of a div block.

For many data scientists and ML engineers, the idea of running training jobs on a large cluster can feel like a big step. If you're used to working in managed notebook environments or cloud-hosted labs, the world of distributed training, job queues, and GPU scheduling might seem like uncharted territory.

But as your models get bigger and your team starts running more experiments in parallel, it's a step worth taking — and one that can unlock a whole new level of productivity and scale.

That’s where SLURM comes in. It’s a trusted, widely used system for managing compute jobs in high-performance environments. And with the right setup, it can feel just as approachable as the tools you’re used to—while giving you the control and power to run distributed workloads at scale.

To help you get there, we’ve built a SLURM solution for Crusoe Cloud that makes it easy to get started. It uses Terraform to spin up your infrastructure and Ansible playbooks to configure everything automatically. There’s even built-in Prometheus + Grafana stack for monitoring your jobs and keeping an eye on system performance.

This guide will walk you through our published SLURM solution, showing you step-by-step how to deploy your own SLURM cluster on Crusoe Cloud.

Gather the Essentials

Before we begin, ensure you have the following tools ready to go. These are the foundational components for a smooth SLURM cluster setup on Crusoe Cloud:

Crusoe CLI configured
Terraform installed
Ansible installed locally

Deploying Your SLURM Cluster Step-by-Step

With your toolkit assembled, you're ready to launch your SLURM environment. Follow these steps to get your cluster operational:

1) Clone the SLURM GitHub Repository: Grab our pre-configured solution from GitHub.

# git clone https://github.com/crusoecloud/slurm.git

2) Configure Your Deployment Variables: Navigate into the cloned directory. In the same directory as the main.tf file, create a terraform.tfvars file. This file will hold the specific metadata for your Crusoe project and SLURM deployment. You can gather the necessary information using both the Crusoe Cloud UI and the CLI. For example, to find available locations, VPC subnets, and VM types, run:

# crusoe locations list 
# crusoe networking vpc-subnets list
# crusoe compute vms types

Optional Petabyte-Scale Storage: You can enhance your cluster with a pre-created shared disk for massive storage capacity, complementing the default persistent /home directory (powered by NFS). To get your shared disk ID, use the CLI and jq:

# crusoe storage disks list -f json | jq -r '.[] |select(.type |contains("shared")).id'

3) Populate terraform.tfvars: Once you have all your variables, fill out the terraform.tfvars file using the template below. This example provisions two compute nodes of type h200-141gb-sxm-ib.8x. You can customize head-node and login-node high availability, adjust CPU sizes, or add more SLURM users by modifying the slurm_users variable.

# common configuration
location = "<crusoe-region>"
project_id = "<project-id>"
ssh_public_key_path = "~/path/to/public_key"
vpc_subnet_id = "<vpc-subnet-id>"


# head node
slurm_head_node_count = 1
slurm_head_node_type = "c1a.8x"


# login node
slurm_login_node_count = 1
slurm_login_node_type = "c1a.8x"


# nfs node
slurm_nfs_node_type = "s1a.80x"
slurm_nfs_home_size = "10240GiB"


# slurm-compute-node configuration
slurm_compute_node_type = "h200-141gb-sxm-ib.8x"
slurm_compute_node_ib_partition_id="<ib-partition-id>"
slurm_compute_node_count = 2


slurm_shared_volumes = [{
id = "<shared-disk-id>"
name = "<name-of-shared-disk>"
mode = "read-write"
mount_point = "/data"
}]


# slurm users configuration
slurm_users = [{
 name = "user1"
 uid = 1001
 ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDQD5doToJjyyq0BH8TDlHZqqVy+kZpuGgJP5gbDanpF"
}]


# observability
# enable_observability = true
# grafana_admin_password = "admin123"

Note on Observability: If you wish to enable the Prometheus + Grafana monitoring stack, uncomment the last two lines (enable_observability and grafana_admin_password). You will also need to configure four firewall rules to allow access to the metrics and dashboards:

Inbound TCP Port 3000 for Grafana Access
TCP Port 9400 for NVIDIA DCGM Exporter
TCP Port 9090 for Prometheus access
TCP Port 9100 for Node Exporter

4) Provision Your SLURM Cluster: With terraform.tfvars correctly configured, execute the Terraform script to provision your resources and automatically set up the cluster.

terraform init
terraform plan 
terraform apply

5) Verify Cluster Operation: The Terraform and Ansible automation will take a few minutes to complete. Once finished, SSH into the login node. You can then run sinfo to verify that your SLURM environment is running correctly. You should see your compute nodes listed in an idle state, ready to accept jobs via srun or sbatch.

ubuntu@slurm-login-node-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 8 idle slurm-compute-node-[0-1]

6) (Optional) Validate Infiniband Performance: If your compute nodes are equipped with Infiniband for high-speed interconnect, you can run a simple two-node NCCL test to validate internode RDMA performance. This is particularly useful for distributed AI training.

srun -N 2 --ntasks-per-node=8 --cpus-per-task=22 --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

Scale Your AI Training with SLURM Workloads

By following these steps, you’ve set up a functional SLURM cluster on Crusoe Cloud — giving your team a reliable way to run and manage large-scale training jobs.

This setup is designed to simplify your deployment and let you focus on the work: training models, running experiments, and collaborating more effectively. Whether you're just starting to scale up or already supporting a busy research team, this gives you a solid foundation to build on.

If you need any guidance refining your setup or tailoring it to your specific workflow, we're here to support you every step of the way.

How to deploy your own slurm cluster on Crusoe Cloud: a step-by-step guide

Gather the Essentials

Deploying Your SLURM Cluster Step-by-Step

Scale Your AI Training with SLURM Workloads

Latest articles

Are you ready to build something amazing?