QServe: Llama3 at 3,500 tokens/s on a single NVIDIA L40S GPU

Background

QServe is an inference engine optimized for W4A8KV4 quantized models that optimizes performance for NVIDIA L40S GPUs. Developed by the MIT Han Lab, a group focused on efficient machine learning, QServe and QoQ quantization are two of their latest releases focused on serving large-language models at blazing speeds. Llama3-8b-Instruct, a model with over 2 million downloads on HuggingFace, runs at 3,500 tokens per second on a single L40S GPU using QServe. In this blog, we’ll show you how to quickly deploy QServe, create QoQ quants, and discuss use-cases unlocked by such speed.

QServe

Before we jump in, ensure that you have the Crusoe CLI installed and configured to work with your account. We’ll use this tool to provision our resources and tear them down at the end.

First, clone this repository (llama3-qserve) to your local machine. Navigate to the root directory. Then, provision resources with:

crusoe storagedisks create\
  --nameqserve-disk-1 \
  --size200GiB \
  --locationus-east1-a

crusoe computevms create\
  --nameqserve-vm \
  --typel40s-48gb.1x \
  --diskname=qserve-disk-1,mode=read-write \
  --locationus-east1-a \
  --imageubuntu22.04-nvidia-pcie-docker \
  --keyfile~/.ssh/id_ed25519.pub \
  --startup-scriptstartup.sh

The startup script will take care of creating a filesystem and mounting the disk as well as dependency installation. After the creation has completed, ssh into the public IP address in the output of crusoe compute vms create.

Once in the VM, check on the startup script's status by running journalctl -u lifecycle-script.service -f. If you see Finished Run lifecycle scripts. at the bottom, then you're ready to proceed. Otherwise, wait until setup has completed. It can take ~10 minutes, as kernels are being compiled for the L40S GPU and large model files are being downloaded.

Benchmarking

After setup has completed, let's run a quick benchmark! Navigate to /workspace/llama3-qserve/qserve and run the below commands:

conda activate QServe
export MODEL=qserve_checkpoints/Llama-3-8B-Instruct-QServe-g128GLOBAL_BATCH_SIZE=128 NUM_GPU_PAGE_BLOCKS=3200
python qserve_benchmark.py--model $MODEL--benchmarking --precisionw4a8kv4 --group-size128

This will run a few rounds of benchmarking with 1024 sequence length, 512 output length, and 128 batch size. The throughput is logged to stdout and the results will be saved to results.csv. Once completed, you should see Round 2 Throughput: 3568.7845477930728 tokens / second. (though your numbers may be slightly different).

Chat.py

We've included a simple chat script to show how to use the QServe Python library. First, move the script to the VM with scp chat.py ubuntu@<vm-ip-address:/workspace/llama3-qserve/qserve/. To use the script, move it into the qserve root directory, then run the below command:

python chat.py--model $MODEL--ifb-mode --precisionw4a8kv4 --quant-path$MODEL --group-size128

This will bring up a command line chat interface, simply type a prompt and hit enter to send it to the QServe engine. You'll see the assistant's response in stdout and can continue the conversation. Type exit and hit enter when you want to terminate the script.

Within chat.py, you can see that we begin by parsing the engine arguments which dictate the model being used, quantization configuration, etc.

if __name__ == "__main__":
   parser = argparse.ArgumentParser(
      description="Demo on using the LLMEngine class directly"
)
  parser = EngineArgs.add_cli_args(parser)
  args = parser.parse_args()
  main(args)

Then, we instantiate the engine.

def initialize_engine(args: argparse.Namespace) ->LLMEngine:
   """Initialize the LLMEngine from the command line arguments."""
   engine_args = EngineArgs.from_cli_args(args)
   return LLMEngine.from_engine_args(engine_args)

In main, we register a conversation template (in this case, Llama3-8B-Instruct) and configure our sampling parameters.

def main(args: argparse.Namespace):
   """Main function that sets up and runs the prompt processing."""
   engine = initialize_engine(args)
   conv_t = get_conv_template_name(args.model)
   conv = get_conv_template(conv_t)
   sampling_params =SamplingParams(temperature=0.7, top_p=1.0, stop_token_ids=[128001, 128009], max_tokens=1024)

Then, we enter a loop where the bulk of the functionality is defined. To send a request to the engine, we first append the message to our conversation which takes care of formatting and applying the model's template. By calling get_prompt(), we receive the conversation history in an appropriate format for the LLM to generate from. Finally, we add the request by sending a request_id.

conv.append_message(conv.roles[0], user_input)
conv.append_message(conv.roles[1], "")
prompt = conv.get_prompt()
engine.add_request(0, prompt, sampling_params)

If ifb_mode is on, the engine will automatically schedule and pack requests for continuous/in-flight batching with no changes to the code. For this application, you won't notice any changes however it is a drastic improvement when serving multiple concurrent users.

To progress the engine, we call engine.step() and log the current outputs. We then check their status and see if any have finished. If we were serving concurrent users, we would want to use the request identifier in order to match results and route back to the correct user.

request_outputs = engine.step()
for request_outputin request_outputs:
   if request_output["finished"]:
      response = request_output["text"]
      ext_response = extract_llama3_assistant(response)
      print(f"Assistant: {ext_response}")
      conv.update_last_message(ext_response)

Clean Up

To clean up the resources used, run the below commands:

crusoe computevms stop qserve-vm
crusoe computevms deleteqserve-vm
crusoe storagedisks deleteqserve-disk-1

Use-Cases

One important metric for user-facing LLM applications, such as ChatGPT and Perplexity, is the average human reading speed: 200-300 WPM or 8.33 tokens per second. This sets the floor for acceptable speed for “fast” chat applications. At 3,500 tokens per second, we’re far above this floor. So what does that enable, beyond being perceived as fast?

Reflection

It’s well-known at this point that allowing an LLM to reflect on or explain its actions improves performance. However, this comes at the cost of extra tokens needing to be generated. This could be a small overhead, in the case of chain-of-thought, or it could be significant as when working with a code interpreter. Code interpreters take in generated code from an LLM, execute within a sandbox, then send the result back to the LLM. If the result is an error, the LLM can propose modifications and re-run this loop until the expected output is achieved. Naturally, the longer this loop can run, the better the results are. However, there could be a user on the other end waiting for an answer and this is where blazing speed becomes much more relevant. At 3,500 tokens/s, a code interpreter can correct itself many times over while responding in a reasonable amount of time. In this case, the overall experience is not just faster but also more accurate.

Agents

In the previous section, the code interpreter can be thought of as an agent that acts on behalf of the primary language model and is greatly accelerated by the speed of the underlying LLM inference engine. This pattern extends to agents in general as most of them involve some flavor of a Plan-Act-Reflect loop in which the model develops or consumes a plan, executes actions, and then reflects on the results of the actions.

As a simple example, a flight reservation agent could take a general instruction, “Book a flight in the morning from SFO to SEA”, and use a set of functions to carry out the instruction. As the agent repeatedly queries the environment for available flights, it can refine its queries and produce better plans over repeat iterations. When this process can run at 3,500 tokens/s, the agent can very quickly carry out an extensive set of tasks and explore an exhaustive tree of possible actions. Once a result is returned to the user, there should be greater confidence in the work performed by the agent and time saved by the user.

Cloud

Liked what you just read? Share it

Relevant Articles

View all

Jun 18, 2025

How Crusoe was the First to Virtualize AMD’s Instinct™ MI300X GPUs on Linux KVM

Cloud

Jun 30, 2022

Crusoe Energy Systems Announces Participation in Program to Expand Equitable Computer Science Education

Cloud

May 08, 2024

Achieving 60 t/s with DBRX in 15 minutes

Cloud

Ready To Build Something Amazing?