LLM and Ollama: Difference between revisions
(12 intermediate revisions by the same user not shown) | |||
Line 68: | Line 68: | ||
*Generating embeddings | *Generating embeddings | ||
*Provides a unified abstraction for working with LLMs and Apps | *Provides a unified abstraction for working with LLMs and Apps | ||
This is referred to as a simple RAG system. We shall see<br> | |||
[[File:Simple Rag.png|800px]]<br> | |||
==Example 1== | |||
This was a bit of strange experience. The code is at [https://github.com/pdichone/ollama-fundamentals/blob/main/pdf-rag.py here] but really the thing consisted of the boxes in the diagram. | |||
<syntaxhighlight lang="py"> | |||
## | |||
# 1. Ingest PDF Files | |||
# 2. Extract Text from PDF Files and split into small chunks | |||
# 3. Send the chunks to the embedding model | |||
# 4. Save the embeddings to a vector database | |||
# 5. Perform similarity search on the vector database to find similar documents | |||
# 6. retrieve the similar documents and present them to the user | |||
## run pip install -r requirements.txt to install the required packages | |||
from langchain_community.document_loaders import UnstructuredPDFLoader | |||
from langchain_community.document_loaders import OnlinePDFLoader | |||
doc_path = "./data/BOI.pdf" | |||
model = "llama3.2" | |||
# Local PDF file uploads | |||
if doc_path: | |||
loader = UnstructuredPDFLoader(file_path=doc_path) | |||
data = loader.load() | |||
print("done loading....") | |||
else: | |||
print("Upload a PDF file") | |||
# Preview first page | |||
content = data[0].page_content | |||
# print(content[:100]) | |||
# ==== End of PDF Ingestion ==== | |||
# ==== Extract Text from PDF Files and Split into Small Chunks ==== | |||
from langchain_ollama import OllamaEmbeddings | |||
from langchain_text_splitters import RecursiveCharacterTextSplitter | |||
from langchain_community.vectorstores import Chroma | |||
# Split and chunk | |||
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300) | |||
chunks = text_splitter.split_documents(data) | |||
print("done splitting....") | |||
# print(f"Number of chunks: {len(chunks)}") | |||
# print(f"Example chunk: {chunks[0]}") | |||
# ===== Add to vector database === | |||
import ollama | |||
ollama.pull("nomic-embed-text") | |||
vector_db = Chroma.from_documents( | |||
documents=chunks, | |||
embedding=OllamaEmbeddings(model="nomic-embed-text"), | |||
collection_name="simple-rag", | |||
) | |||
print("done adding to vector database....") | |||
## === Retrieval === | |||
from langchain.prompts import ChatPromptTemplate, PromptTemplate | |||
from langchain_core.output_parsers import StrOutputParser | |||
from langchain_ollama import ChatOllama | |||
from langchain_core.runnables import RunnablePassthrough | |||
from langchain.retrievers.multi_query import MultiQueryRetriever | |||
# set up our model to use | |||
llm = ChatOllama(model=model) | |||
# a simple technique to generate multiple questions from a single question and then retrieve documents | |||
# based on those questions, getting the best of both worlds. | |||
QUERY_PROMPT = PromptTemplate( | |||
input_variables=["question"], | |||
template="""You are an AI language model assistant. Your task is to generate five | |||
different versions of the given user question to retrieve relevant documents from | |||
a vector database. By generating multiple perspectives on the user question, your | |||
goal is to help the user overcome some of the limitations of the distance-based | |||
similarity search. Provide these alternative questions separated by newlines. | |||
Original question: {question}""", | |||
) | |||
retriever = MultiQueryRetriever.from_llm( | |||
vector_db.as_retriever(), llm, prompt=QUERY_PROMPT | |||
) | |||
# RAG prompt | |||
template = """Answer the question based ONLY on the following context: | |||
{context} | |||
Question: {question} | |||
""" | |||
prompt = ChatPromptTemplate.from_template(template) | |||
chain = ( | |||
{"context": retriever, "question": RunnablePassthrough()} | |||
| prompt | |||
| llm | |||
| StrOutputParser() | |||
) | |||
# res = chain.invoke(input=("what is the document about?",)) | |||
# res = chain.invoke( | |||
# input=("what are the main points as a business owner I should be aware of?",) | |||
# ) | |||
res = chain.invoke(input=("how to report BOI?",)) | |||
print(res) | |||
</syntaxhighlight> | |||
I did like the streamlit ui version. | |||
==Example 2== | |||
Spent a bit more time unpicking this and now have a better understanding. | |||
This time around it consists of two parts | |||
*Get the Data | |||
*Build a query tool | |||
Most the time was spent trying to find a working version of chromadb which turned out to be 0.6.1 | |||
===Get the Data=== | |||
So we get the data by | |||
*Build a list of URLs | |||
*Build documents from the URLs | |||
*Split the data into chunks | |||
*Persist the Collected Data | |||
<syntaxhighlight lang="py"> | |||
"""Generate unique id for ChromaDB.""" | |||
import time | |||
import uuid | |||
import chromadb | |||
from langchain_community.document_loaders import WebBaseLoader | |||
from langchain_text_splitters import CharacterTextSplitter | |||
import ollama | |||
def load_data() -> list: | |||
""" | |||
Load data from the specified URLs using WebBaseLoader. | |||
""" | |||
urls = ["https://angular.love/angular-19-whats-new"] | |||
loader = WebBaseLoader(urls) | |||
loaded_documents = loader.load() | |||
return loaded_documents | |||
text_splitter = CharacterTextSplitter( | |||
chunk_size=3400, | |||
chunk_overlap=300, | |||
is_separator_regex=False, | |||
) | |||
COLLECTION_NAME = "buildragwithpython" | |||
documents = load_data() | |||
# Initialize the ChromaDB client with explicit settings | |||
client = chromadb.HttpClient(host="localhost", port=8000) | |||
# Get list for collection names | |||
collection_names = client.list_collections() | |||
# If the collection already exists, delete it | |||
if COLLECTION_NAME in collection_names: | |||
print("deleting collection") | |||
client.delete_collection(COLLECTION_NAME) | |||
collection = client.get_or_create_collection( | |||
name=COLLECTION_NAME, metadata={"hnsw:space": "cosine"} | |||
) | |||
starttime = time.time() | |||
# Iterate through the documents | |||
for doc in documents: | |||
content = doc.page_content | |||
texts = text_splitter.create_documents([content]) | |||
# Adding source metadata to each chunk with unique IDs | |||
for i, text in enumerate(texts): | |||
text.metadata["source"] = doc.metadata.get("source", "default_source") | |||
DOC_ID = str(uuid.uuid4()) | |||
# Generate embedding using Ollama directly | |||
response = ollama.embeddings(model="nomic-embed-text", prompt=text.page_content) | |||
embedding = response["embedding"] | |||
collection.add( | |||
documents=[text.page_content], | |||
metadatas=[{"source": text.metadata["source"], "chunk_id": i}], | |||
ids=[DOC_ID], | |||
embeddings=[embedding], | |||
) | |||
print(f"--- {time.time() - starttime:.6f} seconds ---") | |||
</syntaxhighlight> | |||
===Build Query Tool=== | |||
I guess I need to work on this but here to tie the two up. I am starting to like Jupyter Books as a way to prototype. | |||
<syntaxhighlight lang="py"> | |||
import sys | |||
import chromadb | |||
import ollama | |||
from utilities import getconfig | |||
embedmodel = getconfig()["embedmodel"] | |||
mainmodel = getconfig()["mainmodel"] | |||
chroma = chromadb.HttpClient(host="localhost", port=8000) | |||
collection = chroma.get_or_create_collection("buildragwithpython") | |||
query = " ".join(sys.argv[1:]) | |||
queryembed = ollama.embeddings(model=embedmodel, prompt=query)['embedding'] | |||
if not queryembed: | |||
print("Error: No embedding returned") | |||
sys.exit(1) | |||
relevantdocs = collection.query(query_embeddings=[queryembed], n_results=5)["documents"][0] | |||
DOCS = "\n\n".join(relevantdocs) | |||
MODEL_QUERY = f"{query} - Answer that question using the following text as a resource: {DOCS}" | |||
stream = ollama.generate(model=mainmodel, prompt=MODEL_QUERY, stream=True) | |||
for chunk in stream: | |||
if chunk["response"]: | |||
print(chunk['response'], end='', flush=True) | |||
</syntaxhighlight> | |||
=Fine Tuning= | |||
==Introduction== | |||
This is a way to make an existing model focus on what you want and not all the things you don't want. E.g. I am interested in Python but not snakes. To do this they recommended you have | |||
*Data to train the model | |||
*A tool | |||
==Axolotl== | |||
I am going to trying this. I think running through the process will be worthwhile. As I don't have training data I will look to '''Matt Williams''' to guide me. He cheered me up in confirming setting up a python environment is a nightmare I will start there | |||
<syntaxhighlight lang="bash"> | |||
python -m env ax | |||
source ax/bin/activate | |||
</syntaxhighlight> | |||
Now the packages. Let the games begin | |||
<syntaxhighlight lang="bash"> | |||
pip install torchvision torchaudio | |||
</syntaxhighlight> | |||
We need pytorch and pytorch-cuda but they fail. PyTorch fails as it should be torch, maybe they had a bun fight. pytorch-cuda. Not sure I need this as I asked the robot and it said | |||
<syntaxhighlight lang="bash"> | |||
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 | |||
</syntaxhighlight> | |||
It is not listed with pip list but I will progress and see what happens. To verify with make a folder parallel to you env. I starting with axapp as an approach. Make the main.py_ | |||
<syntaxhighlight lang="bash"> | |||
import torch | |||
print(torch.__version__) | |||
print(torch.cuda.is_available()) | |||
</syntaxhighlight> | |||
Set the interpreter with shift+ctl+p python:select interpreter and press F5 and here is what good looks like.<br> | |||
[[File:Python torch.png| 500px]]<br> | |||
Next install axolotl by cloning it | |||
<syntaxhighlight lang="bash"> | |||
git clone https://github.com/axolotl-ai-cloud/axolotl | |||
</syntaxhighlight> | |||
Now to install with pip | |||
<syntaxhighlight lang="bash"> | |||
pip install packaging | |||
cd axolotl | |||
pip install -e '.[flash-attn, deepspeed]' | |||
</syntaxhighlight> | |||
Guess I was not hopeful it failed with | |||
ModuleNotFoundError: No module named 'torch' | |||
So this was a challenge. To build flash-attn-2.7.4.post1 took around 3 hours. I was maybe super cautious. | |||
<syntaxhighlight lang="bash"> | |||
pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja | |||
export CMAKE_GENERATOR=Ninja | |||
export MAX_JOBS=2 | |||
pip install flash-attn --no-cache-dir --verbose | |||
</syntaxhighlight> | |||
So now we need to configure the stuff. The documentation can be found at [[https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2 lora]]. The parameters the video used are shown here | |||
<syntaxhighlight lang="txt"> | |||
base_model: NousResearch/Meta-Llama-3.1-8B | |||
load_in_4bit: true | |||
strict: false | |||
chat_template: llama3 | |||
datasets: | |||
- path: winglian/pirate-ultrachat-10k | |||
type: chat_template | |||
message_field_role: role | |||
message_field_content: content | |||
dataset_prepared_path: last_run_prepared | |||
val_set_size: 0.005 | |||
output_dir: ./outputs/lora-out | |||
sequence_len: 4096 | |||
sample_packing: true | |||
eval_sample_packing: false | |||
pad_to_sequence_len: true | |||
adapter: qlora | |||
lora_r: 64 | |||
lora_alpha: 32 | |||
lora_dropout: 0.05 | |||
lora_target_linear: true | |||
lora_modules_to_save: | |||
- embed_tokens | |||
- lm_head | |||
peft_use_dora: true | |||
gradient_accumulation_steps: 2 | |||
micro_batch_size: 2 | |||
num_epochs: 2 | |||
optimizer: adamw_bnb_8bit | |||
learning_rate: 0.0002 | |||
train_on_inputs: false | |||
bf16: true | |||
tf32: true | |||
gradient_checkpointing: true | |||
logging_steps: 1 | |||
flash_attention: true | |||
warmup_ration: 0.1 | |||
evals_per_epoch: 1 | |||
saves_per_epoch: 1 | |||
weight_decay: 0.0 | |||
deepspeed: /home/ubuntu/axolotl/pirate/deepspeed_configs/zero2.json # Replace with actual full path | |||
special_tokens: | |||
pad_token: "<|finetune_right_pad_id|>" | |||
save_safetensors: true | |||
</syntaxhighlight> | |||
I kept hearing you don't need an array of machine and tons of memory and so far it did seem the case but running axolotl for this task on a L40S GPU (48GB)($25k USD card) takes approximately 6 hours. Alternatively the present had 8 H100 (8 x $693 NZD) and it took 10 mins |
Latest revision as of 02:17, 7 May 2025
Introduction
This is a page about ollama and you guest it LLM. I have downloaded several models and got a UI going over them locally. The plan is to build something like Claude desktop in Typescript to Golang. First some theory in Python from here. Here is the problem I am trying to solve.
Using the Remote Ollama
You can connect by setting the host with
export OLLAMA_HOST=192.blah.blah.blah
Now you can use with
ollama run llama3.2:latest
Taking lamma3.2b as an example
Model Info
- Architecture:llama Who made it
- Parameters:3.2B - Means 3.2 billion parameters (bigger requires more resources)
- Context Length:131072 - Number of tokens it can injest
- Embedding Length:3072 - Size of the vector for each token in the input text
- Quantization:Q4_K_M - Too complex to explain
You can customize the mode with a Modelfile and running create with ollama. For example
FROM llama3.2
# set the temperature where higher is more creative
PARAMETER temperature 0.3
SYSTEM """
You are Bill, a very smart assistant who answers questions succintly and informatively
"""
Now we can create a copy with
ollama create bill -f ./Modelfile
Rest API Interaction
So we can send questions to llama using the the rest endpoint to 11434
curl http://192.blah.blah.blah:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
We can chat by changing the endpoint and the format by adding format in the playload
curl http://192.blah.blah.blah:11434/api/chat -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false,
"format": "json"
}'
All of the options are at here
UI Based Client Msty
Seems that Msty was a good choice. You specify a provide and you can then put in 192.blah.blah.blah:11434. It support deepseek and other provider too.
RAG Retrieval-Augmented Generation
This allows us to converse with out own documents/data and solves the bizarre statement LLMs produce
- LLM
- Document Corpus (Knowledge Base)
- Document Embeddings
- Vecto Store (Vector DB, Faiss, Pinecone, Chromadb)
- Retrieval Mechanism
LangChain is a tool to make this easier
- Loading and parsing documents
- Splitting documents
- Generating embeddings
- Provides a unified abstraction for working with LLMs and Apps
This is referred to as a simple RAG system. We shall see
Example 1
This was a bit of strange experience. The code is at here but really the thing consisted of the boxes in the diagram.
##
# 1. Ingest PDF Files
# 2. Extract Text from PDF Files and split into small chunks
# 3. Send the chunks to the embedding model
# 4. Save the embeddings to a vector database
# 5. Perform similarity search on the vector database to find similar documents
# 6. retrieve the similar documents and present them to the user
## run pip install -r requirements.txt to install the required packages
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import OnlinePDFLoader
doc_path = "./data/BOI.pdf"
model = "llama3.2"
# Local PDF file uploads
if doc_path:
loader = UnstructuredPDFLoader(file_path=doc_path)
data = loader.load()
print("done loading....")
else:
print("Upload a PDF file")
# Preview first page
content = data[0].page_content
# print(content[:100])
# ==== End of PDF Ingestion ====
# ==== Extract Text from PDF Files and Split into Small Chunks ====
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
# Split and chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
chunks = text_splitter.split_documents(data)
print("done splitting....")
# print(f"Number of chunks: {len(chunks)}")
# print(f"Example chunk: {chunks[0]}")
# ===== Add to vector database ===
import ollama
ollama.pull("nomic-embed-text")
vector_db = Chroma.from_documents(
documents=chunks,
embedding=OllamaEmbeddings(model="nomic-embed-text"),
collection_name="simple-rag",
)
print("done adding to vector database....")
## === Retrieval ===
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever
# set up our model to use
llm = ChatOllama(model=model)
# a simple technique to generate multiple questions from a single question and then retrieve documents
# based on those questions, getting the best of both worlds.
QUERY_PROMPT = PromptTemplate(
input_variables=["question"],
template="""You are an AI language model assistant. Your task is to generate five
different versions of the given user question to retrieve relevant documents from
a vector database. By generating multiple perspectives on the user question, your
goal is to help the user overcome some of the limitations of the distance-based
similarity search. Provide these alternative questions separated by newlines.
Original question: {question}""",
)
retriever = MultiQueryRetriever.from_llm(
vector_db.as_retriever(), llm, prompt=QUERY_PROMPT
)
# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# res = chain.invoke(input=("what is the document about?",))
# res = chain.invoke(
# input=("what are the main points as a business owner I should be aware of?",)
# )
res = chain.invoke(input=("how to report BOI?",))
print(res)
I did like the streamlit ui version.
Example 2
Spent a bit more time unpicking this and now have a better understanding. This time around it consists of two parts
- Get the Data
- Build a query tool
Most the time was spent trying to find a working version of chromadb which turned out to be 0.6.1
Get the Data
So we get the data by
- Build a list of URLs
- Build documents from the URLs
- Split the data into chunks
- Persist the Collected Data
"""Generate unique id for ChromaDB."""
import time
import uuid
import chromadb
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import CharacterTextSplitter
import ollama
def load_data() -> list:
"""
Load data from the specified URLs using WebBaseLoader.
"""
urls = ["https://angular.love/angular-19-whats-new"]
loader = WebBaseLoader(urls)
loaded_documents = loader.load()
return loaded_documents
text_splitter = CharacterTextSplitter(
chunk_size=3400,
chunk_overlap=300,
is_separator_regex=False,
)
COLLECTION_NAME = "buildragwithpython"
documents = load_data()
# Initialize the ChromaDB client with explicit settings
client = chromadb.HttpClient(host="localhost", port=8000)
# Get list for collection names
collection_names = client.list_collections()
# If the collection already exists, delete it
if COLLECTION_NAME in collection_names:
print("deleting collection")
client.delete_collection(COLLECTION_NAME)
collection = client.get_or_create_collection(
name=COLLECTION_NAME, metadata={"hnsw:space": "cosine"}
)
starttime = time.time()
# Iterate through the documents
for doc in documents:
content = doc.page_content
texts = text_splitter.create_documents([content])
# Adding source metadata to each chunk with unique IDs
for i, text in enumerate(texts):
text.metadata["source"] = doc.metadata.get("source", "default_source")
DOC_ID = str(uuid.uuid4())
# Generate embedding using Ollama directly
response = ollama.embeddings(model="nomic-embed-text", prompt=text.page_content)
embedding = response["embedding"]
collection.add(
documents=[text.page_content],
metadatas=[{"source": text.metadata["source"], "chunk_id": i}],
ids=[DOC_ID],
embeddings=[embedding],
)
print(f"--- {time.time() - starttime:.6f} seconds ---")
Build Query Tool
I guess I need to work on this but here to tie the two up. I am starting to like Jupyter Books as a way to prototype.
import sys
import chromadb
import ollama
from utilities import getconfig
embedmodel = getconfig()["embedmodel"]
mainmodel = getconfig()["mainmodel"]
chroma = chromadb.HttpClient(host="localhost", port=8000)
collection = chroma.get_or_create_collection("buildragwithpython")
query = " ".join(sys.argv[1:])
queryembed = ollama.embeddings(model=embedmodel, prompt=query)['embedding']
if not queryembed:
print("Error: No embedding returned")
sys.exit(1)
relevantdocs = collection.query(query_embeddings=[queryembed], n_results=5)["documents"][0]
DOCS = "\n\n".join(relevantdocs)
MODEL_QUERY = f"{query} - Answer that question using the following text as a resource: {DOCS}"
stream = ollama.generate(model=mainmodel, prompt=MODEL_QUERY, stream=True)
for chunk in stream:
if chunk["response"]:
print(chunk['response'], end='', flush=True)
Fine Tuning
Introduction
This is a way to make an existing model focus on what you want and not all the things you don't want. E.g. I am interested in Python but not snakes. To do this they recommended you have
- Data to train the model
- A tool
Axolotl
I am going to trying this. I think running through the process will be worthwhile. As I don't have training data I will look to Matt Williams to guide me. He cheered me up in confirming setting up a python environment is a nightmare I will start there
python -m env ax
source ax/bin/activate
Now the packages. Let the games begin
pip install torchvision torchaudio
We need pytorch and pytorch-cuda but they fail. PyTorch fails as it should be torch, maybe they had a bun fight. pytorch-cuda. Not sure I need this as I asked the robot and it said
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
It is not listed with pip list but I will progress and see what happens. To verify with make a folder parallel to you env. I starting with axapp as an approach. Make the main.py_
import torch
print(torch.__version__)
print(torch.cuda.is_available())
Set the interpreter with shift+ctl+p python:select interpreter and press F5 and here is what good looks like.
Next install axolotl by cloning it
git clone https://github.com/axolotl-ai-cloud/axolotl
Now to install with pip
pip install packaging
cd axolotl
pip install -e '.[flash-attn, deepspeed]'
Guess I was not hopeful it failed with
ModuleNotFoundError: No module named 'torch'
So this was a challenge. To build flash-attn-2.7.4.post1 took around 3 hours. I was maybe super cautious.
pip3 install -U packaging==23.2 setuptools==75.8.0 wheel ninja
export CMAKE_GENERATOR=Ninja
export MAX_JOBS=2
pip install flash-attn --no-cache-dir --verbose
So now we need to configure the stuff. The documentation can be found at [lora]. The parameters the video used are shown here
base_model: NousResearch/Meta-Llama-3.1-8B
load_in_4bit: true
strict: false
chat_template: llama3
datasets:
- path: winglian/pirate-ultrachat-10k
type: chat_template
message_field_role: role
message_field_content: content
dataset_prepared_path: last_run_prepared
val_set_size: 0.005
output_dir: ./outputs/lora-out
sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
adapter: qlora
lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_modules_to_save:
- embed_tokens
- lm_head
peft_use_dora: true
gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
learning_rate: 0.0002
train_on_inputs: false
bf16: true
tf32: true
gradient_checkpointing: true
logging_steps: 1
flash_attention: true
warmup_ration: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0
deepspeed: /home/ubuntu/axolotl/pirate/deepspeed_configs/zero2.json # Replace with actual full path
special_tokens:
pad_token: "<|finetune_right_pad_id|>"
save_safetensors: true
I kept hearing you don't need an array of machine and tons of memory and so far it did seem the case but running axolotl for this task on a L40S GPU (48GB)($25k USD card) takes approximately 6 hours. Alternatively the present had 8 H100 (8 x $693 NZD) and it took 10 mins