Achieving 60 t/s with DBRX in 15 minutes

In this blog, we’ll show you how to quickly and cheaply test out the latest model from Databricks, DBRX-Instruct, a massive MoE that measures in at 36B active parameters and 132B total parameters. With open-source models becoming larger and larger (the largest Llama3 clocking in at 405B parameters), the challenge and costs to run are even more pronounced. Luckily, we can easily run distributed inference at high-performance on low-cost NVIDIA L40S instances with Crusoe Cloud. An L40S.8x instance is ~$10/hour (as opposed to an H100.8x which can cost $37.60/hr). For fast experimentation, we care mostly about $/hour and developer time. Please reference the GitHub tutorial for the complete code accompanying this tutorial.

Getting Started

Crusoe CLI

In this tutorial, we will be running DBRX-Instruct on NVIDIA’s L40S instances provided by Crusoe Cloud using the CLI. First, ensure that you have the CLI installed by following the instructions here and verify your installation with crusoe whoami.

Starting a VM

We'll run DBRX-Instruct on an L40S.8x instance with our batteries-included NVIDIA image. To create the VM using the CLI, run the following command:

crusoe compute vms create \
  --name dbrx-inference \
  --type l40s-48gb.8x \
  --location us-east1-a \
  --image ubuntu22.04-nvidia-sxm-docker:latest \
  --keyfile ~/.ssh/id_ed25519.pub

Wait a few minutes for the VM to be created, then note the public IP. Verify that you are able to access the instance with ssh ubuntu@<public ip address>. Then, exit the VM and we'll set up a storage disk to load our massive MoE model. If you didn't log the public IP, simply open up your Instances tab on the Crusoe Console and copy the address from there.

Creating and Attaching a Persistent Disk

It's always recommended to create a disk to avoid misusing the boot disk (128 GiB), but particularly with LLMs we can run out of storage very quickly. The DBRX-Instruct repo is 490 GiB, so we'll create a 1TiB disk for some breathing room. Back on your local machine with the crusoe CLI installed, run the following to create a disk:

crusoe storage disks create \
  --name dbrx-data \
  --size 1TiB \
  --location us-east1-a

Now, let's attach the disk to our instance with:

crusoe compute vms attach-disks dbrx-inference \ 
--disk name=dbrx-data,mode=read-write

SSH into your instance (ssh ubuntu@<public ip address>) and run lsblk. The persistent disk will show up as vd[b-z]. Now, create the filesystem by running:

sudo mkfs.ext4 /dev/vdb

Create a directory to mount the volume. For this tutorial, we'll run `sudo mkdir /workspace/`. Finally, mount the volume by running:

sudo mount -t ext4 /dev/vdb /workspace && \
sudo chown -R ubuntu:ubuntu /workspace

You can verify that the volume was mounted by running lsblk again and seeing /workspace attached to vdb under MOUNTPOINTS.

Clone DBRX-Instruct and DBRX-Instruct-Tokenizer

For simplicity, we will clone the repos for both the instruct model and tokenizer (as opposed to letting HF handle caching) and provide local paths when loading our resources. Navigate to /workspace and run the command mkdir models && cd models/.

DBRX-Instruct is a gated model, so you will need to request permission in order to interact with the model. Please refer to the DBRX-Instruct repo for steps on how to do so.

Git LFS

HuggingFace uses lfs to manage large files, so we'll have to run a couple commands to get it set up:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

sudo apt-get install git-lfs

You can verify git lfs is set up with git lfs --version.

Clone DBRX-Instruct

Now, clone the repository with git clone https://huggingface.co/databricks/dbrx-instruct. NOTE: you will be prompted for your hugging face username and password, provide your ACCESS TOKEN when prompted for your password.

This will kick off the download for the entire repo which is ~490 GiB. Luckily, this is running on our VM on a site with high-speed networking 😁. Even so, download speed can be limited by demand on the host server so feel free to go grab coffee and come back when the download is done.

Clone DBRX-Instruct-Tokenizer

We'll use the fast tokenizer provided by Xenova, so again navigate to /workspace/models/ and clone this repository.

Clone this Repo

We'll make a directory to hold code on our boot disk. Run mkdir ~/dev && cd ~/dev and clone into this repository with git clone [email protected]:crusoecloud/dbrx_inference_tutorial.git && cd dbrx_inference_tutorial/.

Peripherals

Before we jump into our inference tutorials, let's install some quality-of-life peripherals. First, run apt-get update then apt-get install tmux. We'll often have two or more processes running, so it'll be nice to have multiple windows to monitor each and tmux is a great solution for session and window management.

To manage dependencies, we'll use virtualenv which can be installed with apt install python3-virtualenv.

If you run into issues with storage, ncdu is a useful tool for easy navigation.

Additionally, I recommend using ssh-remote with VSCode to connect and interact with remote code (unless you're a vim wizard).

vLLM

The fastest way to get up and running with DBRX-Instruct is vLLM. In a few steps, we'll have a high-performance, OpenAI-API compatible server up and running capable of generating tokens at ~80-90 t/s. First, navigate to vLLM/ in this tutorial’s repo.

Dependencies

Initialize a virtualenv with virtualenv venv and activate it with source venv/bin/activate. First, run pip install packaging which is required for flash attention. Then, install core dependencies with pip install -r requirements.txt.

vllm_example.py

The simplest example, pulled from the vLLM repo, only requires a couple of lines to work with with our setup. We point the model and tokenizer paths to our local directories and set tensor_parallel_size to 8 when initializing our LLM to shard the model across GPUs in the NVIDIA L40S node.

OpenAI API Server

With our virtual environment activated, run pip install openai. Now, start a tmux session with tmux new -s vllm. After executing the previous command, you'll be automatically attached to the session and can detach at any time with ctrl+b then d. To re-attach to that session, do tmux a -t vllm.

Now, we'll split our tmux session into two panes (one for the client and one for the server). Make sure you are attached to the tmux session, then do ctrl+b then %. This will split our tmux session into two vertical panes. If you wish to navigate between the two, simply do `ctrl+b` then press the corresponding arrow key to switch between left/right. In both sessions, activate the virtual environment.

Activate the right pane and run this command to start the server:

python -m vllm.entrypoints.openai.api_server \
--model /workspace/models/dbrx-instruct \
--tokenizer /workspace/models/dbrx-instruct-tokenizer \
--dtype auto \
--tensor-parallel-size 8 \
--trust-remote-code

Switch back to the left pane with ctrl+b and the left arrow key. After the server has started (you'll see logs indicating that the Uvicorn server is running on http://0.0.0.0:8000), run python client.py.

In the server pane, you'll see running metrics for throughput average and resource usage. On the left, you'll see a print-out of the server response.

Congrats! You have a scalable backend with continuous batching capable of serving concurrent users with DBRX-Instruct. Note that this defaults to 4k context length and requires some modifications to reach 32k. In the next part of this tutorial, we'll look at how to optimize inference.

TGI

To serve DBRX-Instruct through text-generation-inference by HuggingFace, first navigate to tgi/ in this repo.

Configure tmux

Same as for vllm, we'll set up two panes to monitor our client and server. Create a tmux session with tmux new -s tgi which will automatically attach. Create a split vertical pane with ctrl+b then %. Navigate to the right pane where we will start the server. Run the below command to download and run the docker container.


docker run \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -v /workspace/models:/models/ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /models/dbrx-instruct \
  --tokenizer-config-path /models/dbrx-instruct-tokenizer/tokenizer_config.json \
  --num-shard 8

Now, navigate back to the left pane and create a virtual environment with virtualenv venv && source venv/bin/activate. Install requirements with pip install -r requirements.txt and run our client with python client.py. Opening up this file, you'll see that we modify the OpenAI Python client to point at our TGI server with just a couple of lines while reusing the familiar OpenAI Messages API.

Cleaning up resources

To delete our VM and the disk, we can simply run the following commands using the CLI:

crusoe compute vms delete --name dbrx-inference && \
crusoe storage disks delete dbrx-data

Achieving 60 t/s with DBRX in 15 minutes

Relevant Articles

Get ready to feel your first truly sustainable cloud experience