Building a custom chat assistant with open-source LLMs and RAG

Table of contents

This is some text inside of a div block.

As an AI cloud infrastructure company, Crusoe is constantly testing emerging AI technologies. So, we decided to build our own chat assistant using an LLM and the increasingly-ubiquitous RAG approach on an NVIDIA A100.

Today, we’re sharing how we did it – and how you can set one up with your own data.

While chatbots have been around for decades - the famous ELIZA was released all the way back in 1966 - it was only recently that Large Language Models (LLMs) started breathing new life into conversational experiences, with models that can seemingly understand, reason, rhyme, and can even make jokes. But besides being entertaining, chat interfaces can provide an easy and efficient way to explore heaps of unstructured data.

The Secret to Knowledge: Retrieval-Augmented Generation (aka RAG)

Pre-trained foundation models like Llama, MosaicML’s MPT-30B collection, Together’s RedPajama collection, can be used to host a highly-capable LLM. However, while such models can answer a wide breadth of questions, they’re drawing from the data that was in their training set – not necessarily data tailored to a specific use case. And while it’s possible to train existing models on new data via Fine-Tuning, the dataset preparation and compute required can still be prohibitive.

Fortunately, the transformer architecture that powers most of today’s LLMs (but not all, check out Together’s Striped Hyena for one new approach), excels at reusing information given in the prompt in generating new output. This means one can inject knowledge into the model with each prompt.

Imagine if someone was to ask our sales team “what GPUs are available on Crusoe Cloud?”, and we could intercept and augment the prompt to our model so it looked more like:

Answer the following customer question to the best of your ability:

What GPUs are available on Crusoe Cloud?
Potentially helpful context to reference:
Crusoe Cloud offers NVIDIA H100s, NVIDIA A100s, NVIDIA A40s, NVIDIA L40S’s and soon, NVIDIA Blackwell instances and AMD MI300X’s.

It turns out we can! To an extent. Transformer models have to contend with a context window, which is a cap on how many tokens (or, effectively, words) can fit inside a model’s “attention span”. The solution is to be clever about picking what information we inject into each prompt, using a popular approach called RAG - Retrieval-Augmented Generation. Because of its efficacy and ease of setup, RAG has already become a staple in many systems using LLMs.

A RAG pipeline involves the following steps:

Rank documents or document chunks in terms of similarity to tokens in the user’s prompt.
Augment the given prompt with the most relevant data chunks retrieved in step 1.
Generate a response using the augmented prompt from step 2.

Visualized RAG Pipeline

The vector store in the diagram is a database purpose-built for associating words that frequently show up together or have some other strong relation to each other. This allows us to quickly match words in the user’s prompt to text in our documents, including synonyms that we’d miss with a regular text search.

Let’s Get Chatting

The first step in setting all of this up is getting an LLM running. For now, we’re going to go with https://huggingface.co/NousResearch/Nous-Capybara-34B , a very popular 34B parameter model. If you’re following along on less powerful hardware, we recommend trying a smaller model like https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B.

An aside: For finding the latest and most interesting models, two great starting places are HuggingFace and Reddit’s r/LocalLLaMA. Your choice of model can have a massive impact not only on output quality, but also style, creativity, and verbosity. For production and customer-facing applications, less creative, censored (also known as aligned) models are your best bet.

Now, we have to get our model up and running and ready to receive queries (to perform inference). There are several approaches for inference, like popular engines llama.cpp, vLLM, and more with GUIs. For Crusoe’s chat assistant, we used HuggingFace’s text-generation-inference engine, which we found to be a great balance between performance, ease of configuration, and deployability. To get our model up and running, all we need to do is ssh into our GPU (we used an NVIDIA A100 on Crusoe Cloud) and create a docker-compose.yaml file with the following:

version: '3'services: inference-engine:container_name: inference image: ghcr.io/huggingface/text-generation-inference:1.4.0 network_mode: "host" shm_size: '1g' restart: unless-stopped # most popular LLMs from https://huggingface.co/ will also work command: --model-id NousResearch/Nous-Capybara-34B environment: # different models might require tweaking input length and total tokens MAX_INPUT_LENGTH: 4096 MAX_TOTAL_TOKENS: 8192 CUDA_MEMORY_FRACTION: 0.9 HUGGINGFACE_HUB_CACHE: /data PORT: 8080 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] volumes:# directory to download models. ensure you have space for the # number of parameters in your chosen model! - ./data:/data

Once we have this file, running docker compose up should get our model up and running from our device, with an OpenAI API-compatible web server exposing it. It’ll take several minutes. To verify it’s working, we can test it with

curl 127.0.0.1:8081/generate \ -H 'Content-Type: application/json' \ -d '{"inputs":"What is AI good for?","parameters":{"max_new_tokens":48}}'

After a few moments, we should get a response that looks more or less like

{"generated_text":"\n\nArtificial intelligence (AI) is a powerful tool that can be used to automate tasks, improve efficiency, and enhance decision-making. AI can be used to analyze large amounts of data, identify patterns, and make predictions."}

Great! We now have a working LLM that we can talk to. But at this point, it still only knows about whatever data happened to be in its training set. Next, we’ll configure a RAG pipeline around it to imbue it with new knowledge.

From Riches to RAGs

Whether for a person or for a machine, the first step in imbuing knowledge is collecting knowledge. A RAG pipeline needs a document collection to traverse. For Crusoe’s assistant, we started with articles from our blog, some documentation pages, and the text from our latest ESG Report (a great read!), among a few others. Then, we placed all of these in text files under a new directory called docs.

Next, we’ll write the python script responsible for running through our RAG pipeline. Fortunately, open source libraries like Langchain have made it incredibly easy to spin up sophisticated inference pipelines. Let’s start by installing the libraries we’ll be using:pip install chromadb langchain sentence-transformers text_generationunstructured

With these libraries at hand, we can then

load our documents into chromadb, a popular, in-memory vector store for efficient retrieval
configure parameters for making requests to our LLM
create a RAG pipeline using LangChain

all in under 80 lines of code!

# chain.pyimport loggingfrom langchain.document_loaders import DirectoryLoaderfrom langchain.embeddings import HuggingFaceBgeEmbeddingsfrom langchain.llms.huggingface_text_gen_inference import HuggingFaceTextGenInferencefrom langchain.prompts import PromptTemplatefrom langchain.schema.runnable import RunnablePassthroughfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain.vectorstores import Chroma# NOTE: you'll want to modify this template for your use case# Certain models also respond better to different message start/end tags.# Usually, that info will be in the model's Hugging Face details card or release notes.RAG_PROMPT_TEMPLATE = """<|im_start|>systemHere is context from different documents that may be useful.---{context}---Answer the question as an eager, highly professional spokesperson for Crusoe Cloud. Try to only reference information in the context above, and give thorough answers. Don't make any references to having been given the above documents.<|im_end|><|im_start|>user{question}<|im_end|><|im_start|>assistant"""logging.basicConfig()# Configure our document loader, including how we'll split documents into smaller chunksloader = DirectoryLoader("./docs")text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)# Configure the embeddings model that we'll use to tokenize our docs# This BAII model has been found to work well across popular open modelsmodel_name = "BAAI/bge-large-en-v1.5"model_kwargs = {"device": "cuda"}encode_kwargs = {"normalize_embeddings": True}hf = HuggingFaceBeEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs,)logging.info("Creating Vector Store...")docs = loader.load_and_split(text_splitter)vectorstore = Chroma.from_documents(documents=docs, embedding=hf)logging.info("Vector Store Created!")# Define model parameters. More on these in the blog below under Tweaking and Tuning!# For an explanation of each, see https://huggingface.co/docs/api-inference/detailed_parametersllm = HuggingFaceTextGenInference( inference_server_url="http://0.0.0.0:8080/", max_new_tokens=4096, top_k=10, top_p=0.9, typical_p=0.9, temperature=0.5, repetition_penalty=1.03, streaming=True,)# Configure our langchain pipeline to log verbosely,# so we can see which chunks are being selected for each promptlogging.getLogger("langchain.retrievers.multi_query").setLevel(logging.DEBUG)# Define the pipelineretriever = vectorstore.as_retriever()rag_prompt = PromptTemplate.from_template(RAG_PROMPT_TEMPLATE)rag_chain = {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | llm# Prompt user for a question and stream response outquestion = input('Ask a question: ')for chunk in rag_chain.stream(question): print(chunk, end='')

Now we should be able to run it with python chain.py . Go ahead and try out your new, custom assistant!

With a little more code, we can wrap this with our own API server for integration into a larger system, like a web app, using something like FastAPI.

Tweaking and Tuning

The script above includes a number of fixed constants which influence the way your assistant works. While the above values worked well for us, different use cases, data collections, and models can benefit from some tweaking. A quick rundown of a few of the more important constants:

Model Temperature: The temperature setting can be thought of as how much randomness to introduce during generation. A temperature setting of 0 will always yield the same output given the same prompt, and everything else unchanged. Conversely, a temperature setting of 1 (or higher) will consistently yield different outputs given the same prompt. Too high of a temperature setting tends to cause nonsensical output.
Chunk Size: The size of blocks that documents should be split into. Splitting documents into multiple blocks, or chunks, helps ensure that only the most relevant excerpts are included in the augmented context window. However, too small of a chunk can cause helpful context to be dropped.
Chunk Overlap: How much chunks should overlap when a document is split. A setting of 0 will cause no overlap, and potentially drop helpful context, whereas too high of a setting will quickly fill the model’s precious context window with redundant excerpts.

For more details on document splitting (including more sophisticated approaches like lexical analysis), LangChain’s documentation is a great resource.

Closing Thoughts

This blog post would be incomplete without any input from our chat assistant. I asked our assistant for help, telling it “I'm writing a blog post about Crusoe. Give me a strong, short conclusion paragraph telling me why using Crusoe Cloud is different from using any other cloud.”

At Crusoe Cloud, we're not just another cloud provider—we're the future of computing aligned with the future of the climate. Our high-performance, cloud platform is designed specifically for innovators, offering unparalleled performance at a fraction of the cost of other providers. By colocating data centers with energy production and utilizing stranded sources of energy, we're not only revolutionizing cloud computing but also making a tangible impact on the environment. With Crusoe Cloud, innovators can focus on solving the world's most complex problems while making measurable progress towards their environmental goals.

Nailed it! We’re incredibly pleased with the results and are already looking at accelerating our internal productivity with similar assistants built for coding, navigating our internal documentation and more. After all, as the cloud provider for many incredible AI teams, we have a front row seat to the industry’s most exciting advancements, and an obligation to deeply understand their workloads. This means keeping AI at the forefront of everything we do.

At Crusoe, our access to cutting-edge compute hardware, such as the upcoming Blackwell GPUs, gives us a unique opportunity to experiment with the latest innovations in AI. While our assistant excels at extolling the story of Crusoe, the underlying architecture only requires a few tweaks to empower all kinds of different use cases, ranging from trawling through dense and technical docs, helping onboard new users or employees, acting as a first-tier support, and more.

Lastly, if building an AI cloud sounds like the kind of work that captivates you, we’re hiring!

Building a custom chat assistant with open-source LLMs and RAG

The Secret to Knowledge: Retrieval-Augmented Generation (aka RAG)

Let’s Get Chatting

From Riches to RAGs

Closing Thoughts

Latest articles

Are you ready to build something amazing?