Sep 16, 2024

Training 1M Context Length Models on NVIDIA L40S GPUs

Training 1M Context Length Models on NVIDIA L40S GPUs

by Leo Pekelis, Chief Scientist at Gradient AI

In the past month, the enthusiasm surrounding our long context models has sparked engaging discussions within the open-source community. While we had the chance to share some of our initial thoughts into scaling long context models, we’ve yet to explore the hardware that made it possible. Join us as we deep dive into our decision to leverage NVIDIA L40S GPUs and how our partners at Crusoe made this process seamless. This work was also previously featured in VentureBeat – “How Gradient created an open LLM with a million-token context window.”

The Goal: Extending Llama 3

The launch of Llama models has garnered a lot of excitement, and for good reason. Their performance - 70B model that currently sits at 2nd place on the newest Open LLM Leaderboard - combined with Meta’s committed support of the open models and NVIDIA’s support for API infrastructure makes Llama3 an excellent foundational model to build on top of.

However, one of the things that stood out on release was that they had noticeably small context lengths. As a refresher, the context length of a model determines how much text the model can consider in both input and output at one time, measured in tokens. Frontier models are providing long context windows, as high as 128k tokens, or roughly 90,000 words, or 150 pages (12 pt font, single spaced), which may suffice for typical workloads but can be challenging for enterprise solutions that often call for considerably longer context (e.g. financial institutions routinely ingest company 10-K reports at 250K+ words each).

Our team jumped on this opportunity, and used Gradient’s context-extension training strategy to increase the context of both the 8b and 70b models from a context window of 8k to > 1M - making them the longest context open source models, and second only to Google’s Gemini of any models, open or closed. These models scored perfect retrieval on NIAH (Needle-in-a-Haystack), have shown tremendous results under RULER, and across the Open LLM Leaderboard.

Llama3 8B: 262K, 1M, and 4M

Llama3 70B: 262K, 524k, and 1M

The Infrastructure That Made it Possible

When we discussed the infrastructure requirements for this project with our compute partner, Crusoe, two considerations surfaced as paramount:

  1. Efficiency:

    While training costs of open source models are amortized over the community, we at Gradient develop technology for in-house enterprise deployment. Efficient training is important to keep up our pace of innovation for our customers.

  2. Speed:

    New model launches on the scale of Llama3 don’t happen every day. We wanted to strike while the iron was hot.

Luckily, Crusoe’s platform offers a large degree of customizability, and through great collaboration with their team we built a system that hit both targets. The key infrastructure components were:

[Hardware] NVIDIA L40S GPUs for quick availability and standout FP8 performance

[Training] Memory & network optimized distributed training utilizing RingAttention

[Infra] Crusoe’s scalable, frictionless architecture

Hardware Selection

While many models are trained on NVIDIA H100 Tensor Core or A100 Tensor Core GPUs, we decided to use L40S GPUs because they were more quickly available in a highly-constrained market - allowing us to race to long context - and have better performance compared to A100s when utilizing FP8 support at comparable cost.

The following graph shows tokens per second (TPS) we observed when running NVIDIA’s TensorRT-LLM benchmark on L40S vs A100 GPUs with increasing batch size and 8,092 context length. While bfloat16 FLOPS are comparable between the two GPUs on spec sheets, leveraging FP8precision, which is not supported on A100s, really helps L40S earn higher throughput, especially as token loads increase.

Training Technology

The remaining challenge of L40S GPUs was the 48GB VRAM limit. In practice, this didn’t end up being as much of a limitation as we initially thought because no available GPUs are able to store very long contexts in memory due to the bottleneck of attention calculation. For even the most efficient implementations of attention, FLOPS grow quadratically and VRAM use grows linearly with context length, putting an effective cap on context length.

Ring Attention breaks this bottleneck, allowing effectively infinite context length by parallelizing the attention calculation across many GPUs configured in a ring topology to overlap computation and network communication. It’s a memory optimization technique, however in practice the throughput drop-off from increasing GPUs makes training time impractical without careful optimization.

At Gradient, we took attention parallelization further by load balancing the computation on each GPU, and layered on other optimization strategies (e.g. data & tensor parallelism) to give us a fully configurable training topology that we could then use to match to Crusoe’s L40S network topology, keeping throughput high.

While our technology is proprietary, we invite you to try out EasyContext, an excellent open source implementation of some techniques discussed here!

Frictionless Infrastructure

Lastly, Crusoe’s Cloud infrastructure was efficient, allowing us to get up and running with an automated and scalable deployment in hours. We were pleased to see VMs allocated and initialized in seconds and could be bulk created and configured via API. This made our deployment very amenable to setting up via tools like Ansible.

It should also be noted that Crusoe’s network was exceptionally reliable with no dropped connections across days of training runs.

Cost Benchmarking

While we hit our release goals (our first model launched on HuggingFace 1 week after Meta’s), let’s consider our efficiency.

Using Crusoe’s on-demand price for L40S of $1.45/GPU-hour, we estimate the cost of fine-tuning the 70B at $167.56/million tokens and the 8B at $12.83/million tokens For reference, fine-tuning gpt-4o via OpenAI’s API is $25/million tokens. Taking into account that such training is limited to domain specification and does not allow for context extension, our approach compares favorably.

Reducing Carbon Emissions

The demand for more powerful AI models is skyrocketing, leading to a dramatic expansion of supporting infrastructure. In 2023, data centers consumed 1-1.5% of global electricity, with energy widely recognized as a limiting factor in developing larger, more capable models.

Crusoe Cloud addresses this challenge by powering its data centers with a mix of wasted, stranded, and clean energy resources. This approach enables large-scale AI workloads while aligning with climate goals. In 2023, Crusoe's operations avoided 55% more emissions than they generated.

Conclusion

We hope this was an illustrative look into the infrastructure behind large model training and how the L40S can be a powerful and efficient workhorse for even the most demanding applications. As always, a huge thank you to Crusoe for the continued partnership and the compute infrastructure.

Crusoe and Gradient recently presented on this topic together at SIGGRAPH 2024 in a session entitled “How Gradient Extended Llama 3's Context Length to 1M on Crusoe.” For the full presentation materials, see below.

How Gradient Extended Llama 3's Context Length to 1M on Crusoe Presentation Materials


Liked what you just read? Share it

Relevant Articles

View all