Introduction

This is a page about ollama and you guest it LLM. I have downloaded several models and got a UI going over them locally. The plan is to build something like Claude desktop in Typescript to Golang. First some theory in Python from here. Here is the problem I am trying to solve.

Using the Remote Ollama

You can connect by setting the host with

export OLLAMA_HOST=192.blah.blah.blah

Now you can use with

ollama run llama3.2:latest

Taking lamma3.2b as an example

Model Info

Architecture:llama Who made it
Parameters:3.2B - Means 3.2 billion parameters (bigger requires more resources)
Context Length:131072 - Number of tokens it can injest
Embedding Length:3072 - Size of the vector for each token in the input text
Quantization:Q4_K_M - Too complex to explain

You can customize the mode with a Modelfile and running create with ollama. For example

FROM llama3.2

# set the temperature where higher is more creative
PARAMETER temperature 0.3

SYSTEM """
   You are Bill, a very smart assistant who answers questions succintly and informatively
"""

Now we can create a copy with

ollama create bill -f ./Modelfile

Rest API Interaction

So we can send questions to llama using the the rest endpoint to 11434

curl http://192.blah.blah.blah:11434/api/generate -d '{ 
  "model": "llama3.2", 
  "prompt": "Why is the sky blue?",
  "stream": false 
}'

We can chat by changing the endpoint and the format by adding format in the playload

curl http://192.blah.blah.blah:11434/api/chat -d '{ 
  "model": "llama3.2", 
  "prompt": "Why is the sky blue?",
  "stream": false,
  "format": "json" 
}'

All of the options are at here

UI Based Client Msty

Seems that Msty was a good choice. You specify a provide and you can then put in 192.blah.blah.blah:11434. It support deepseek and other provider too.

RAG Retrieval-Augmented Generation

This allows us to converse with out own documents/data and solves the bizarre statement LLMs produce

LLM
Document Corpus (Knowledge Base)
Document Embeddings
Vecto Store (Vector DB, Faiss, Pinecone, Chromadb)
Retrieval Mechanism

LangChain is a tool to make this easier

Loading and parsing documents
Splitting documents
Generating embeddings
Provides a unified abstraction for working with LLMs and Apps

This is referred to as a simple RAG system. We shall see

Example 1

This was a bit of strange experience. The code is at here but really the thing consisted of the boxes in the diagram.

## 
# 1. Ingest PDF Files
# 2. Extract Text from PDF Files and split into small chunks
# 3. Send the chunks to the embedding model
# 4. Save the embeddings to a vector database
# 5. Perform similarity search on the vector database to find similar documents
# 6. retrieve the similar documents and present them to the user
## run pip install -r requirements.txt to install the required packages

from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader

doc_path = "./data/BOI.pdf"
model = "llama3.2"

# Local PDF file uploads
if doc_path:
    loader = UnstructuredPDFLoader(file_path=doc_path)
    data = loader.load()
    print("done loading....")
else:
    print("Upload a PDF file")

    # Preview first page
content = data[0].page_content
# print(content[:100])


# ==== End of PDF Ingestion ====


# ==== Extract Text from PDF Files and Split into Small Chunks ====

from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

# Split and chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
chunks = text_splitter.split_documents(data)
print("done splitting....")

# print(f"Number of chunks: {len(chunks)}")
# print(f"Example chunk: {chunks[0]}")

# ===== Add to vector database ===
import ollama

ollama.pull("nomic-embed-text")

vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="simple-rag",
)
print("done adding to vector database....")


## === Retrieval ===
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser

from langchain_ollama import ChatOllama

from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# set up our model to use
llm = ChatOllama(model=model)

# a simple technique to generate multiple questions from a single question and then retrieve documents
# based on those questions, getting the best of both worlds.
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), llm, prompt=QUERY_PROMPT
)


# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)


chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


# res = chain.invoke(input=("what is the document about?",))
# res = chain.invoke(
#     input=("what are the main points as a business owner I should be aware of?",)
# )
res = chain.invoke(input=("how to report BOI?",))

print(res)

I did like the streamlit ui version.

Example 2

Spent a bit more time unpicking this and now have a better understanding. This time around it consists of two parts

Get the Data
Build a query tool

Most the time was spent trying to find a working version of chromadb which turned out to be 0.6.1

Get the Data

So we get the data by

Build a list of URLs
Build documents from the URLs
Split the data into chunks
Persist the Collected Data

"""Generate unique id for ChromaDB."""

import time
import uuid
import chromadb
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import CharacterTextSplitter
import ollama


def load_data() -> list:
    """
    Load data from the specified URLs using WebBaseLoader.
    """

    urls = ["https://angular.love/angular-19-whats-new"]

    loader = WebBaseLoader(urls)
    loaded_documents = loader.load()
    return loaded_documents


text_splitter = CharacterTextSplitter(
    chunk_size=3400,
    chunk_overlap=300,
    is_separator_regex=False,
)

COLLECTION_NAME = "buildragwithpython"

documents = load_data()

# Initialize the ChromaDB client with explicit settings
client = chromadb.HttpClient(host="localhost", port=8000)

# Get list for collection names
collection_names = client.list_collections()

# If the collection already exists, delete it
if COLLECTION_NAME in collection_names:
    print("deleting collection")
    client.delete_collection(COLLECTION_NAME)

collection = client.get_or_create_collection(
    name=COLLECTION_NAME, metadata={"hnsw:space": "cosine"}
)

starttime = time.time()

# Iterate through the documents
for doc in documents:
    content = doc.page_content
    texts = text_splitter.create_documents([content])

    # Adding source metadata to each chunk with unique IDs
    for i, text in enumerate(texts):
        text.metadata["source"] = doc.metadata.get("source", "default_source")
        DOC_ID = str(uuid.uuid4())

        # Generate embedding using Ollama directly
        response = ollama.embeddings(model="nomic-embed-text", prompt=text.page_content)
        embedding = response["embedding"]

        collection.add(
            documents=[text.page_content],
            metadatas=[{"source": text.metadata["source"], "chunk_id": i}],
            ids=[DOC_ID],
            embeddings=[embedding],
        )

print(f"--- {time.time() - starttime:.6f} seconds ---")

Build Query Tool

I guess I need to work on this but here to tie the two up. I am starting to like Jupyter Books as a way to prototype.

import sys
import chromadb
import ollama
from utilities import getconfig

embedmodel = getconfig()["embedmodel"]
mainmodel = getconfig()["mainmodel"]
chroma = chromadb.HttpClient(host="localhost", port=8000)
collection = chroma.get_or_create_collection("buildragwithpython")

query = " ".join(sys.argv[1:])
queryembed = ollama.embeddings(model=embedmodel, prompt=query)['embedding']
if not queryembed:
    print("Error: No embedding returned")
    sys.exit(1)

relevantdocs = collection.query(query_embeddings=[queryembed], n_results=5)["documents"][0]
DOCS = "\n\n".join(relevantdocs)
MODEL_QUERY = f"{query} - Answer that question using the following text as a resource: {DOCS}"

stream = ollama.generate(model=mainmodel, prompt=MODEL_QUERY, stream=True)

for chunk in stream:
    if chunk["response"]:
        print(chunk['response'], end='', flush=True)

Fine Tuning

Introduction

This is a way to make an existing model focus on what you want and not all the things you don't want. E.g. I am interested in Python but not snakes. To do this they recommended you have

Data to train the model
A tool

Axolotl

I am going to trying this. I think running through the process will be worthwhile. As I don't have training data I will look to Matt Williams to guide me. He cheered me up in confirming setting up a python environment is a nightmare I will start there

python -m env ax
source ax/bin/activate

Now the packages. Let the games begin

pip install  torchvision torchaudio

We need pytorch and pytorch-cuda but they fail. PyTorch fails as it should be torch, maybe they had a bun fight. pytorch-cuda. Not sure I need this as I asked the robot and it said

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

It is not listed with pip list but I will progress and see what happens. To verify with make a folder parallel to you env. I starting with axapp as an approach. Make the main.py_

import torch
print(torch.__version__)
print(torch.cuda.is_available())

Set the interpreter with shift+ctl+p python:select interpreter and press F5 and here is what good looks like.

Next install axolotl by cloning it

git clone https://github.com/axolotl-ai-cloud/axolotl

Now to install with pip

pip install packaging
cd axolotl
pip install -e '.[flash-attn, deepspeed]'

Guess I was not hopeful it failed with

ModuleNotFoundError: No module named 'torch'

So this was a challenge. To build flash-attn-2.7.4.post1 took around 3 hours. I was maybe super cautious.

pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja
export CMAKE_GENERATOR=Ninja
export MAX_JOBS=2
pip install flash-attn --no-cache-dir --verbose

So now we need to configure the stuff. The documentation can be found at [lora]. The parameters the video used are shown here

base_model: NousResearch/Meta-Llama-3.1-8B
load_in_4bit: true
strict: false

chat_template: llama3
datasets:
  - path: winglian/pirate-ultrachat-10k
    type: chat_template
    message_field_role: role
    message_field_content: content
dataset_prepared_path: last_run_prepared
val_set_size: 0.005
output_dir: ./outputs/lora-out

sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

adapter: qlora
lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_modules_to_save:
  - embed_tokens
  - lm_head
peft_use_dora: true

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
learning_rate: 0.0002

train_on_inputs: false
bf16: true
tf32: true

gradient_checkpointing: true
logging_steps: 1
flash_attention: true

warmup_ration: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0
deepspeed: /home/ubuntu/axolotl/pirate/deepspeed_configs/zero2.json # Replace with actual full path
special_tokens:
  pad_token: "<|finetune_right_pad_id|>"
save_safetensors: true

I kept hearing you don't need an array of machine and tons of memory and so far it did seem the case but running axolotl for this task on a L40S GPU (48GB)($25k USD card) takes approximately 6 hours. Alternatively the present had 8 H100 (8 x $693 NZD) and it took 10 mins

LLM and Ollama

Contents

Introduction

Using the Remote Ollama

Model Info

Rest API Interaction

UI Based Client Msty

RAG Retrieval-Augmented Generation

Example 1

Example 2

Get the Data

Build Query Tool

Fine Tuning

Introduction

Axolotl

Navigation menu

LLM and Ollama

Introduction

Using the Remote Ollama

Model Info

Rest API Interaction

UI Based Client Msty

RAG Retrieval-Augmented Generation

Example 1

Example 2

Get the Data

Build Query Tool

Fine Tuning

Introduction

Axolotl

Navigation menu

Search