Building a Local RAG Pipeline with Ollama and Open WebUI

TL;DR

Retrieval-augmented generation (RAG) lets your local LLM answer questions using your own documents instead of relying on its training data. This guide walks through building a fully local RAG pipeline: document ingestion, embedding, vector storage, and retrieval through Open WebUI.

What you will build:

  • Document ingestion for PDFs, markdown, and text files
  • Local embedding with nomic-embed-text via Ollama
  • Vector storage with ChromaDB (runs locally, no cloud dependency)
  • Query interface through Open WebUI with automatic retrieval
  • Everything runs on your hardware with no data leaving your network

Requirements:

  • Ollama installed and running
  • Open WebUI deployed (Docker recommended)
  • 8GB+ VRAM (GPU) or 16GB+ RAM (CPU-only)
  • 20GB+ free disk space for models and vector storage

What RAG Is and Why It Matters

LLMs have a fixed knowledge cutoff. They cannot answer questions about your private documents, recent events, or domain-specific data they were not trained on. When you ask a question outside their training data, they either refuse to answer or hallucinate.

RAG solves this by adding a retrieval step before generation:

  1. Index: Split your documents into chunks and convert each chunk into a numerical vector (embedding).
  2. Retrieve: When a user asks a question, convert the question into an embedding and find the most similar document chunks.
  3. Generate: Pass the retrieved chunks to the LLM as context along with the question. The LLM generates an answer grounded in your actual documents.

This is particularly valuable for:

  • Internal documentation and knowledge bases
  • Legal, medical, or compliance documents
  • Codebases and technical specifications
  • Any private data you cannot send to cloud APIs

The entire pipeline runs locally, so sensitive documents never leave your machine.

Architecture Overview

User Query
    |
    v
[Embedding Model] --> query vector
    |
    v
[Vector Database] --> top-k similar chunks
    |
    v
[LLM] + retrieved chunks --> generated answer
    |
    v
Response to User

Each component can be swapped independently. You can change the embedding model, vector store, or LLM without rebuilding the entire pipeline.

Step 1: Install Ollama and Pull Models

If you do not already have Ollama installed:

curl -fsSL https://ollama.ai/install.sh | sh

Pull the models you need – one for embeddings and one for generation:

# Embedding model (small, fast, good quality)
ollama pull nomic-embed-text

# Generation model (choose based on your VRAM)
ollama pull llama3.1:8b        # 5GB VRAM, good for most tasks
ollama pull llama3.1:8b-instruct-q4_K_M  # Slightly smaller, explicit quantization

Verify both models are available:

ollama list

You should see both nomic-embed-text and your chosen generation model in the output.

Embedding Model Options

ModelDimensionsSizeVRAMQuality
nomic-embed-text768274MB~300MBGood all-around
all-minilm38446MB~100MBLighter, slightly lower quality
mxbai-embed-large1024670MB~700MBHigher quality, more resource use
snowflake-arctic-embed1024670MB~700MBStrong retrieval performance

For most use cases, nomic-embed-text is the right choice. It balances quality and resource usage well. Use all-minilm if you are running on CPU-only or have very limited VRAM.

# Pull your chosen embedding model
ollama pull nomic-embed-text

# Test it
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "This is a test document about local AI systems."
}'

The response returns a JSON object with a 768-dimension embedding vector.

Step 2: Document Preparation and Chunking

RAG quality depends heavily on how you prepare your documents. Poor chunking leads to poor retrieval, which leads to poor answers.

Supported Document Formats

  • PDF: Most common format for business documents. Use pypdf2 or pdfplumber for extraction.
  • Markdown: Ideal format. Clean text with structure preserved.
  • Plain text: Works directly with no conversion needed.
  • HTML: Strip tags and convert to text or markdown first.

Chunking Strategy

Split documents into chunks of 500-1000 tokens with 50-100 token overlap between chunks. Overlap ensures that concepts spanning a chunk boundary are captured in at least one chunk.

# chunk_documents.py
import os
from pathlib import Path

def chunk_text(text, chunk_size=800, overlap=100):
    """Split text into overlapping chunks by character count."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

def load_and_chunk_directory(directory):
    """Load all text and markdown files from a directory and chunk them."""
    all_chunks = []
    for filepath in Path(directory).rglob("*"):
        if filepath.suffix in [".txt", ".md", ".markdown"]:
            text = filepath.read_text(encoding="utf-8")
            chunks = chunk_text(text)
            for i, chunk in enumerate(chunks):
                all_chunks.append({
                    "text": chunk,
                    "source": str(filepath),
                    "chunk_index": i
                })
    return all_chunks

# Usage
chunks = load_and_chunk_directory("/path/to/your/documents")
print(f"Created {len(chunks)} chunks from documents")

PDF Extraction

For PDFs, install pdfplumber which handles complex layouts better than most alternatives:

pip install pdfplumber
import pdfplumber

def extract_pdf_text(pdf_path):
    """Extract text from a PDF file."""
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n\n"
    return text

# Extract and chunk a PDF
text = extract_pdf_text("/path/to/document.pdf")
chunks = chunk_text(text)

Web Page Ingestion

To scrape and ingest web pages:

pip install beautifulsoup4 requests
import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
    """Extract main text content from a web page."""
    response = requests.get(url, timeout=30)
    soup = BeautifulSoup(response.text, "html.parser")

    # Remove script and style elements
    for tag in soup(["script", "style", "nav", "footer", "header"]):
        tag.decompose()

    text = soup.get_text(separator="\n", strip=True)
    return text

text = scrape_webpage("https://example.com/docs/api-reference")
chunks = chunk_text(text)

Step 3: Set Up ChromaDB as Your Vector Store

ChromaDB is a lightweight, local vector database that runs embedded in your Python process or as a standalone server. No cloud account, no API keys.

Install ChromaDB

pip install chromadb

Create a Collection and Store Embeddings

import chromadb
import requests
import json

# Initialize ChromaDB with persistent storage
client = chromadb.PersistentClient(path="/path/to/chromadb-storage")

# Create or get a collection
collection = client.get_or_create_collection(
    name="my_documents",
    metadata={"hnsw:space": "cosine"}
)

def get_embedding(text, model="nomic-embed-text"):
    """Get embedding vector from Ollama."""
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text}
    )
    return response.json()["embedding"]

def index_chunks(chunks, collection):
    """Index document chunks into ChromaDB."""
    for i, chunk in enumerate(chunks):
        embedding = get_embedding(chunk["text"])
        collection.add(
            ids=[f"chunk_{i}"],
            embeddings=[embedding],
            documents=[chunk["text"]],
            metadatas=[{
                "source": chunk["source"],
                "chunk_index": chunk["chunk_index"]
            }]
        )
        if (i + 1) % 100 == 0:
            print(f"Indexed {i + 1}/{len(chunks)} chunks")

    print(f"Indexing complete. {len(chunks)} chunks stored.")

# Index your documents
index_chunks(chunks, collection)

Query the Vector Store

def search_documents(query, collection, n_results=5):
    """Search for relevant document chunks."""
    query_embedding = get_embedding(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    return results

# Test a query
results = search_documents("How do I configure the API?", collection)
for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
    print(f"Source: {metadata['source']}")
    print(f"Text: {doc[:200]}...")
    print("---")

Alternative: Qdrant

If you need more advanced features like filtering, multi-tenancy, or larger-scale deployments, Qdrant is a strong alternative:

# Run Qdrant in Docker
docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -v qdrant_storage:/qdrant/storage \
  qdrant/qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

client = QdrantClient(host="localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="my_documents",
    vectors_config=VectorParams(
        size=768,  # nomic-embed-text dimensions
        distance=Distance.COSINE
    )
)

# Index a chunk
client.upsert(
    collection_name="my_documents",
    points=[
        PointStruct(
            id=1,
            vector=get_embedding("Your document text here"),
            payload={"source": "doc.pdf", "text": "Your document text here"}
        )
    ]
)

# Search
results = client.search(
    collection_name="my_documents",
    query_vector=get_embedding("your search query"),
    limit=5
)

Qdrant uses more resources than ChromaDB but scales better for large document collections (100K+ chunks).

Step 4: Build the RAG Query Pipeline

Now connect retrieval to generation:

import requests
import json

def rag_query(question, collection, model="llama3.1:8b", n_context=5):
    """Answer a question using RAG."""
    # Step 1: Retrieve relevant chunks
    results = search_documents(question, collection, n_results=n_context)
    context_chunks = results["documents"][0]

    # Step 2: Build the prompt with retrieved context
    context = "\n\n---\n\n".join(context_chunks)
    prompt = f"""Use the following context to answer the question. If the context
does not contain enough information to answer, say so clearly.

Context:
{context}

Question: {question}

Answer:"""

    # Step 3: Generate answer with Ollama
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.3,
                "num_ctx": 4096
            }
        }
    )

    answer = response.json()["response"]
    sources = [m["source"] for m in results["metadatas"][0]]

    return {
        "answer": answer,
        "sources": list(set(sources))
    }

# Use it
result = rag_query("What are the API rate limits?", collection)
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

Tuning Retrieval Quality

Several parameters affect RAG quality:

  • n_results (number of retrieved chunks): Start with 5. Increase to 8-10 for complex questions. Too many chunks can overwhelm the context window.
  • temperature: Use 0.1-0.3 for factual RAG. Higher temperatures increase creativity but also hallucination.
  • num_ctx: Set high enough to fit all retrieved chunks plus the question and answer. 4096 is usually sufficient for 5 chunks.
  • Chunk size: Smaller chunks (300-500 chars) give more precise retrieval. Larger chunks (800-1200 chars) give more context per result.

Step 5: Connect to Open WebUI’s Built-in RAG

Open WebUI has built-in RAG support that handles document ingestion, embedding, and retrieval through its web interface. This is the easiest path for end users.

Enable RAG in Open WebUI

  1. Open WebUI in your browser (default: http://localhost:3000).
  2. Navigate to Workspace -> Knowledge.
  3. Click Create Knowledge Base.
  4. Give it a name and description.
  5. Upload your documents (PDF, TXT, MD, DOCX supported).

Open WebUI will automatically:

  • Extract text from uploaded documents
  • Chunk the documents
  • Generate embeddings using the configured embedding model
  • Store vectors in its built-in ChromaDB instance

Configure the Embedding Model

In Open WebUI settings, set the embedding model:

  1. Go to Admin Panel -> Settings -> Documents.
  2. Under Embedding Model, select your Ollama embedding model.
  3. Set the Embedding Model Engine to Ollama.
  4. Configure chunk size (default 1000) and overlap (default 100).
Embedding Model Engine: Ollama
Embedding Model: nomic-embed-text
Chunk Size: 1000
Chunk Overlap: 100

Using RAG in Conversations

Once documents are uploaded to a Knowledge Base:

  1. Start a new conversation.
  2. Click the + button in the message input area.
  3. Select your Knowledge Base or upload documents directly.
  4. Ask questions about your documents.

Open WebUI will automatically retrieve relevant chunks and inject them into the LLM’s context. The model’s response will be grounded in your documents.

Docker Compose for the Full Stack

Here is a complete docker-compose.yml for Ollama + Open WebUI with RAG support:

version: "3.8"

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - RAG_EMBEDDING_MODEL=nomic-embed-text
      - RAG_EMBEDDING_ENGINE=ollama
      - CHUNK_SIZE=1000
      - CHUNK_OVERLAP=100
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama-data:
  open-webui-data:

Deploy with:

docker compose up -d

After startup, pull the required models:

docker exec ollama ollama pull nomic-embed-text
docker exec ollama ollama pull llama3.1:8b

Performance Optimization

Embedding Speed

Embedding throughput determines how fast you can ingest documents:

ModelGPU (RTX 3090)CPU (Ryzen 5900X)
nomic-embed-text~500 chunks/sec~20 chunks/sec
all-minilm~800 chunks/sec~50 chunks/sec
mxbai-embed-large~300 chunks/sec~10 chunks/sec

For initial bulk ingestion of large document sets, GPU acceleration makes a significant difference. Ongoing incremental updates are fast enough on CPU.

Query Latency Breakdown

For a typical RAG query (5 chunks retrieved, 8B model):

Embedding query:        ~5ms (GPU), ~50ms (CPU)
Vector search:          ~2ms (ChromaDB, <100K chunks)
LLM generation:         ~2-5 seconds (depending on response length)
Total:                  ~2-5 seconds

The LLM generation step dominates total latency. Optimizing retrieval speed has minimal impact on user-perceived performance.

Memory Usage

Plan your RAM and VRAM allocation:

Ollama (8B model, Q4):      ~5GB VRAM
nomic-embed-text:           ~300MB VRAM (shared with Ollama)
ChromaDB (100K chunks):     ~500MB RAM
Open WebUI:                 ~500MB RAM
OS + overhead:              ~2GB RAM

Minimum: 8GB VRAM + 16GB RAM
Recommended: 12GB+ VRAM + 32GB RAM

Troubleshooting

Documents Not Being Retrieved

Symptom: The LLM responds without referencing your documents.

Fixes:

  • Verify the Knowledge Base is attached to your conversation in Open WebUI.
  • Check that the embedding model is running: ollama list should show nomic-embed-text.
  • Re-upload documents if the embedding model was changed after initial upload. Embeddings from different models are not compatible.

Poor Answer Quality

Symptom: Retrieved chunks are relevant but the LLM’s answer is wrong or incomplete.

Fixes:

  • Increase the number of retrieved chunks (try 8-10 instead of 5).
  • Reduce chunk size to 500 characters for more precise retrieval.
  • Use a larger LLM (13B instead of 8B) for better comprehension.
  • Lower temperature to 0.1-0.2 for more factual responses.

Slow Ingestion

Symptom: Uploading documents takes a very long time.

Fixes:

  • Ensure the embedding model is running on GPU, not CPU.
  • For bulk uploads, ingest documents via the API rather than the web interface.
  • Large PDFs (100+ pages) should be split into smaller files before upload.

Bottom Line

A local RAG pipeline gives your LLM access to your private documents without sending data to external services. The combination of Ollama for inference and embedding, ChromaDB for vector storage, and Open WebUI for the user interface creates a fully self-hosted system that runs on a single machine with a modern GPU.

Start with Open WebUI’s built-in RAG for simplicity. Move to a custom Python pipeline when you need more control over chunking, embedding, or retrieval logic. Either way, the core components are the same: chunk your documents, embed them, store the vectors, and retrieve relevant context at query time.


FAQ

How much VRAM do I need for a local RAG pipeline?

The embedding model and the LLM share VRAM. nomic-embed-text uses about 300MB of VRAM. A 7B LLM at Q4 quantization uses about 5GB. Total minimum is around 6GB, making an 8GB GPU workable for basic RAG. For 13B models with RAG, plan for 12GB+ VRAM.

Can I use RAG with documents that contain images or tables?

Text-based RAG extracts and indexes only the text content from documents. Tables are often poorly extracted from PDFs. For best results, convert tables to markdown format before ingestion. Image content requires multimodal RAG pipelines, which are more complex and not covered by basic Open WebUI RAG.

How many documents can a local RAG system handle?

ChromaDB and Qdrant can handle millions of document chunks on a single machine. The practical limit is storage space and query latency. A collection of 100,000 chunks with 768-dimension embeddings uses roughly 300MB of storage and returns results in under 100ms on modest hardware.

Is local RAG as good as cloud-based RAG services?

For most use cases, yes. The quality depends primarily on your chunking strategy, embedding model choice, and the LLM’s ability to synthesize retrieved context. Local embeddings from nomic-embed-text perform comparably to OpenAI’s ada-002 on standard benchmarks. The main advantage of cloud services is managed infrastructure, not quality.

How do I update documents in my RAG pipeline?

Delete the old document chunks from your vector store and re-ingest the updated document. In Open WebUI, you can remove and re-upload files in the Knowledge section. For custom pipelines with ChromaDB, delete by metadata filter and re-add. There is no in-place update mechanism for embeddings.