PolarSPARC |
Quick Primer on Running GGUF models on Ollama
Bhaskar S | 01/04/2025 |
GPT-Generated Unified Format (or GGUF for short) is a binary file format for efficient storage, distribution, and deployment of LLM models.
In the article on Ollama , we demonstrated how one can deploy LLM model(s) on a local desktop.
In this article, we will demonstrate how one can deploy and use any LLM model in the GGUF format using Ollama.
The installation and setup will be on a Ubuntu 24.04 LTS based Linux desktop. Ensure that Docker is installed and setup on the desktop (see instructions).
Also, ensure the command-line utility curl is installed on the Linux desktop.
The following are the steps one can follow to download, deploy, and use a GGUF model in Ollama:
Open two Terminal windows (referred to as term-1 and term-2).
Create a directory for downloading and storing the LLM model GGUF file by executing the following command in term-1:
$ mkdir -p $HOME/.ollama/GGUF
For this demonstration, we will deploy and test the open source 8-bit quantized DeepSeek-v3 model that has caused some abuzz recently and challenging the popular proprietary AI models.
We will download the 8-bit quantized DeepSeek-v3 model in the GGUF format from HuggingFace by executing the following commands in term-1:
$ cd $HOME/.ollama/GGUF
$ curl -L -O https://huggingface.co/LoupGarou/deepseek-coder-6.7b-instruct-pythagora-v3-gguf/resolve/main/deepseek-coder-6.7b-instruct-pythagora-v3-Q8_0.gguf
With a 1 Gbps internet speed, the 'download' will take between 3 to 5 minutes !!!
Create a model file for Ollama for the just downloaded GGUF file by executing the following command in term-1:
$ cd $HOME/.ollama
$ echo 'from /root/.ollama/GGUF/deepseek-coder-6.7b-instruct-pythagora-v3-Q8_0.gguf' > deepseek-v3-Q8_0_gguf.txt
Start the Ollama platform by executing the following docker command in term-1:
$ cd $HOME
$ docker run --rm --name ollama --gpus=all --network="host" -p 192.168.1.25:11434:11434 -v $HOME/.ollama:/root/.ollama ollama/ollama:0.5.4
To list all the LLM models that are deployed in the Ollama platform, execute the following docker command in term-2:
$ cd $HOME
$ docker exec -it ollama ollama list
The following would be the typical output:
NAME ID SIZE MODIFIED
To deploy the just downloaded GGUF model into the Ollama platform, execute the following docker command in term-2:
$ cd $HOME
$ docker exec -it ollama ollama create deepseek-v3-Q8_0 -f /root/.ollama/deepseek-v3-Q8_0_gguf.txt
The above command would take about a minute to execute and generate the following typical output on completion:
transferring model data 100% using existing layer sha256:636545fc45204417c1c38ce42126b807f126d80dddc912e07c3a8d90ecdfcd00 using autodetected template alpaca using existing layer sha256:afa0ae3294fbad4c6b60d110ae6e034b3dfdd5e0acf4d2f3eaa0b888633f7ffe creating new layer sha256:6e6eb6f365d1c295f24b2bf7e7db63a37d5da88dda6a453a84f0c140476a377b writing manifest success
To verify the 8-bit quantized DeepSeek-v3 model was deployed successfully, execute the following docker command in term-2:
$ cd $HOME
$ docker exec -it ollama ollama list
The following was the output from my desktop:
NAME ID SIZE MODIFIED deepseek-v3-Q8_0:latest 5af2e8c42525 7.2 GB About a minute ago
To test the 8-bit quantized DeepSeek-v3 model, execute the following docker command in term-2:
$ docker exec -it ollama ollama run deepseek-v3-Q8_0:latest
After the LLM model is loaded (which will take a few seconds the first time), the command would wait for an user input at the prompt >>>Send a message (/? for help).
To test the just loaded LLM model, execute the following user prompt:
>>> assuming ollama llm chat, generate python code using langchain to chat with the llm with message history
Using DeepSeek-v3, the following would be the typical response:
```python from langchain.chat_models import OLLAMA_LLM from langchain.schema import ( AIMessage, HumanMessage, SystemMessage ) class LLMWithHistory: def __init__(self): self.llm = OLLAMA_LLM() self.messages = [] def add_message(self, message): if isinstance(message, str): message = HumanMessage(content=message) self.messages.append(message) def get_response(self): ai_message = AIMessage(content=str(self.llm([msg.content for msg in self.messages], return_generated=True))) self.add_message(ai_message) return ai_message chatbot = LLMWithHistory() chatbot.add_message("Hello, how are you?") print(chatbot.get_response().content) # "Hello!" ```
Interesting - there is no class called OLLAMA_LLM !!!
Using ChatGPT, the following was the generated response:
from langchain.chat_models import ChatOpenAI from langchain.schema import AIMessage, HumanMessage, SystemMessage # Initialize the chat model # Replace 'YOUR_API_KEY' with your actual API key or authentication method llm = ChatOpenAI(model="ollama-llm", temperature=0.7) # Initialize the message history message_history = [ SystemMessage(content="You are a helpful assistant."), ] def chat_with_llm(user_input): global message_history # Add the user's message to the history message_history.append(HumanMessage(content=user_input)) # Generate a response from the LLM response = llm(message_history) # Add the AI's response to the history message_history.append(AIMessage(content=response.content)) # Return the AI's response return response.content # Example usage if __name__ == "__main__": print("Chat with the LLM. Type 'exit' to end the conversation.") while True: user_input = input("You: ") if user_input.lower() == "exit": print("Ending the chat. Goodbye!") break ai_response = chat_with_llm(user_input) print(f"AI: {ai_response}")
Interesting - the initialization of the class ChatOpenAI is not correct !!!
On to the next test to solve a Calculus Derivative problem using the following prompt:
>>> find the derivative of y = (x + 1) * sqrt(x)
Using DeepSeek-v3, the following would be the typical response:
Hmm - the answer is WRONG !!!
Using ChatGPT, the following was the generated response:
Good - the answer is CORRECT !!!
To exit the user input, execute the following user prompt:
>>> /bye
With this, we conclude this article on downloading, deploying, and using LLM models in the GGUF format !!!
References