Quick Primer on Running GGUF models on Ollama

GPT-Generated Unified Format (or GGUF for short) is a binary file format for efficient storage, distribution, and deployment of LLM models.

In the article on Ollama , we demonstrated how one can deploy LLM model(s) on a local desktop.

In this article, we will demonstrate how one can deploy and use any LLM model in the GGUF format using Ollama.

The installation and setup will be on a Ubuntu 24.04 LTS based Linux desktop. Ensure that Docker is installed and setup on the desktop (see instructions).

Also, ensure the command-line utility curl is installed on the Linux desktop.

The following are the steps one can follow to download, deploy, and use a GGUF model in Ollama:

Open two Terminal windows (referred to as term-1 and term-2).

Create a directory for downloading and storing the LLM model GGUF file by executing the following command in term-1:

$ mkdir -p $HOME/.ollama/GGUF
For this demonstration, we will deploy and test the open source 8-bit quantized DeepSeek-v3 model that has caused some abuzz recently and challenging the popular proprietary AI models.

We will download the 8-bit quantized DeepSeek-v3 model in the GGUF format from HuggingFace by executing the following commands in term-1:

$ cd $HOME/.ollama/GGUF

$ curl -L -O https://huggingface.co/LoupGarou/deepseek-coder-6.7b-instruct-pythagora-v3-gguf/resolve/main/deepseek-coder-6.7b-instruct-pythagora-v3-Q8_0.gguf
!!! ATTENTION !!!
```
With a 1 Gbps internet speed, the 'download' will take between 3 to 5 minutes !!!
```
Create a model file for Ollama for the just downloaded GGUF file by executing the following command in term-1:

$ cd $HOME/.ollama

$ echo 'from /root/.ollama/GGUF/deepseek-coder-6.7b-instruct-pythagora-v3-Q8_0.gguf' > deepseek-v3-Q8_0_gguf.txt
Start the Ollama platform by executing the following docker command in term-1:

$ cd $HOME

$ docker run --rm --name ollama --gpus=all --network="host" -p 192.168.1.25:11434:11434 -v $HOME/.ollama:/root/.ollama ollama/ollama:0.5.4
To list all the LLM models that are deployed in the Ollama platform, execute the following docker command in term-2:

$ cd $HOME

$ docker exec -it ollama ollama list

The following would be the typical output:
Output.1
```
NAME                 ID              SIZE      MODIFIED
```

To deploy the just downloaded GGUF model into the Ollama platform, execute the following docker command in term-2:

$ cd $HOME

$ docker exec -it ollama ollama create deepseek-v3-Q8_0 -f /root/.ollama/deepseek-v3-Q8_0_gguf.txt

The above command would take about a minute to execute and generate the following typical output on completion:

Output.2

transferring model data 100% 
using existing layer sha256:636545fc45204417c1c38ce42126b807f126d80dddc912e07c3a8d90ecdfcd00 
using autodetected template alpaca 
using existing layer sha256:afa0ae3294fbad4c6b60d110ae6e034b3dfdd5e0acf4d2f3eaa0b888633f7ffe 
creating new layer sha256:6e6eb6f365d1c295f24b2bf7e7db63a37d5da88dda6a453a84f0c140476a377b 
writing manifest 
success

To verify the 8-bit quantized DeepSeek-v3 model was deployed successfully, execute the following docker command in term-2:

$ cd $HOME

$ docker exec -it ollama ollama list

The following was the output from my desktop:
Output.3
```
NAME                       ID              SIZE      MODIFIED           
deepseek-v3-Q8_0:latest    5af2e8c42525    7.2 GB    About a minute ago
```

To test the 8-bit quantized DeepSeek-v3 model, execute the following docker command in term-2:

$ docker exec -it ollama ollama run deepseek-v3-Q8_0:latest

After the LLM model is loaded (which will take a few seconds the first time), the command would wait for an user input at the prompt >>>Send a message (/? for help).

To test the just loaded LLM model, execute the following user prompt:

>>> assuming ollama llm chat, generate python code using langchain to chat with the llm with message history

Using DeepSeek-v3, the following would be the typical response:

Output.4

```python
from langchain.chat_models import OLLAMA_LLM
from langchain.schema import (
    AIMessage, 
    HumanMessage, 
    SystemMessage
)

class LLMWithHistory:
    
    def __init__(self):
        self.llm = OLLAMA_LLM()
        self.messages = []
        
    def add_message(self, message):
        if isinstance(message, str):
            message = HumanMessage(content=message)
        self.messages.append(message)
    
    def get_response(self):
        ai_message = AIMessage(content=str(self.llm([msg.content for msg in self.messages], return_generated=True)))
        self.add_message(ai_message)
        return ai_message
        
chatbot = LLMWithHistory()
chatbot.add_message("Hello, how are you?")
print(chatbot.get_response().content) # "Hello!"
```

Interesting - there is no class called OLLAMA_LLM !!!

Using ChatGPT, the following was the generated response:

Output.5

from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage, SystemMessage

# Initialize the chat model
# Replace 'YOUR_API_KEY' with your actual API key or authentication method
llm = ChatOpenAI(model="ollama-llm", temperature=0.7)

# Initialize the message history
message_history = [
    SystemMessage(content="You are a helpful assistant."),
]

def chat_with_llm(user_input):
    global message_history

    # Add the user's message to the history
    message_history.append(HumanMessage(content=user_input))

    # Generate a response from the LLM
    response = llm(message_history)

    # Add the AI's response to the history
    message_history.append(AIMessage(content=response.content))

    # Return the AI's response
    return response.content

# Example usage
if __name__ == "__main__":
    print("Chat with the LLM. Type 'exit' to end the conversation.")
    while True:
        user_input = input("You: ")
        if user_input.lower() == "exit":
            print("Ending the chat. Goodbye!")
            break

        ai_response = chat_with_llm(user_input)
        print(f"AI: {ai_response}")

Interesting - the initialization of the class ChatOpenAI is not correct !!!

On to the next test to solve a Calculus Derivative problem using the following prompt:

>>> find the derivative of y = (x + 1) * sqrt(x)

Using DeepSeek-v3, the following would be the typical response:

Hmm - the answer is WRONG !!!

Using ChatGPT, the following was the generated response:

Good - the answer is CORRECT !!!

To exit the user input, execute the following user prompt:

With this, we conclude this article on downloading, deploying, and using LLM models in the GGUF format !!!

!!! ATTENTION !!!

Output.1

Output.2

Output.3

Output.4

Output.5