TL;DR
Fine-tuning takes a general-purpose AI model like Llama 3 and trains it further on your business data. The result is a model that responds in your company’s voice, knows your products, and follows your rules — not a generic chatbot.
What you need:
- 200-500 question/answer pairs from your business
- A GPU with 24GB VRAM (RTX 3090, ~$800 used) or a MacBook with 32GB
- 2-6 hours of training time
- QLoRA + Hugging Face tools (all free and open source)
What you get:
- A small adapter file (200-500MB) that customizes the base model
- A private AI that sounds like your business
- No data sent to OpenAI, Google, or anyone else
The technique is called QLoRA (Quantized Low-Rank Adaptation). It only modifies a tiny percentage of the model’s parameters, which is why it’s fast, cheap, and works on consumer hardware.
Fine-Tuning vs. RAG vs. Prompt Engineering
Before diving in, understand the three main approaches to customizing an LLM:
Prompt Engineering is just writing better instructions. You tell the model “You are a customer support agent for Acme Plumbing” in the system prompt. It’s free and instant but limited — the model doesn’t actually know your business, it’s just role-playing.
RAG (Retrieval-Augmented Generation) feeds relevant documents to the model at query time. When a customer asks a question, you search your knowledge base, find the relevant paragraph, and paste it into the prompt. The model answers based on that context. Good for factual Q&A, but the model still doesn’t learn your style or behavior.
Fine-tuning actually changes the model’s weights. It learns patterns from your data. After training, the model responds in your company’s tone, references your products naturally, and follows your policies without needing to be reminded every time. It’s the deepest level of customization.
| Approach | Model learns? | Knows your style? | Setup time | Best for |
|---|---|---|---|---|
| Prompt engineering | No | No | Minutes | Quick experiments |
| RAG | No | No | Hours | Document Q&A |
| Fine-tuning | Yes | Yes | Hours-days | Custom behavior |
For most business use cases, fine-tuning + RAG together gives the best results. Fine-tuning handles tone and behavior, RAG handles current facts and documents.
What Kind of Business Data Works
Fine-tuning needs example conversations: an input and the ideal output. Here are real sources of training data from businesses:
Customer support logs — Your existing chat transcripts and email threads are gold. The questions are real, and your team’s responses show exactly how you want the AI to behave.
FAQ pages — Already in Q&A format. Easy to convert directly into training examples.
Employee training manuals — These describe how your staff should handle different situations. Perfect for teaching an AI your standard procedures.
Product/service descriptions — Teach the model to describe your offerings accurately with correct pricing, features, and specifications.
Standard operating procedures — Policies like return windows, service guarantees, and escalation procedures.
Preparing Training Data
Training data is a JSONL file — one JSON object per line. Each example has a conversation with roles (system, user, assistant):
{"messages": [{"role": "system", "content": "You are a helpful assistant for Acme Plumbing."}, {"role": "user", "content": "How much does it cost to fix a leaking faucet?"}, {"role": "assistant", "content": "A basic faucet repair runs $95-$150 depending on the type of faucet. If the faucet needs to be replaced entirely, that's typically $150-$350 including parts and labor. We can give you an exact quote once we see the faucet -- there's no charge for the estimate if you book the repair."}]}
{"messages": [{"role": "system", "content": "You are a helpful assistant for Acme Plumbing."}, {"role": "user", "content": "Do you offer emergency plumbing services?"}, {"role": "assistant", "content": "Yes, we offer 24/7 emergency plumbing service. Our emergency rate is $175 for the first hour plus parts. We typically arrive within 45-60 minutes for emergencies like burst pipes, major leaks, or sewer backups. Call our emergency line anytime at 555-0199."}]}
How many examples do you need?
| Amount | Result |
|---|---|
| 50-100 | Noticeable improvement |
| 200-500 | Strong results (recommended starting point) |
| 1,000-5,000 | Very polished, handles edge cases well |
| 10,000+ | Diminishing returns for QLoRA |
Quality matters more than quantity. 200 well-written examples beat 2,000 sloppy ones. Each example should be a real scenario with a specific, accurate response.
If you have raw data in CSV format (question and answer columns), you can convert it automatically:
# Install the data prep tool
pip install pandas
# Convert CSV to JSONL training format
python prepare_data.py --input faq.csv --output training_data.jsonl --preview 3
Hardware Requirements
QLoRA is designed to run on consumer hardware. Here’s what you need:
| Hardware | Model size | Training time (500 examples) |
|---|---|---|
| RTX 3090 (24GB) | 8B | ~1 hour |
| RTX 3090 (24GB) | 13B | ~2-3 hours |
| RTX 4090 (24GB) | 8B | ~30 minutes |
| MacBook Pro M2/M3 (32GB) | 8B | ~2-3 hours |
| 2x RTX 3090 (48GB) | 70B | ~4-6 hours |
The RTX 3090 at ~$800 used is the sweet spot. 24GB of VRAM handles most fine-tuning jobs comfortably.
Step-by-Step QLoRA Training
Install dependencies
pip install torch transformers peft trl datasets bitsandbytes accelerate
For Apple Silicon Macs, use MLX instead:
pip install mlx mlx-lm numpy
Run the training
On Linux with an NVIDIA GPU:
python train.py \
--base-model meta-llama/Llama-3.1-8B-Instruct \
--dataset training_data.jsonl \
--output ./output/my-business-adapter \
--epochs 3 \
--lr 2e-4 \
--rank 32
On a MacBook with Apple Silicon:
python train_mac.py \
--dataset training_data.jsonl \
--output ./output/my-business-adapter \
--epochs 3
What happens during training
- The base model loads in 4-bit quantization (fits in less VRAM)
- Small LoRA adapter layers attach to the model
- The model sees each training example multiple times (epochs)
- Only the adapter weights update — the base model stays unchanged
- After training, you get a small adapter file (200-500MB)
Test the results
python test_adapter.py \
--base-model meta-llama/Llama-3.1-8B-Instruct \
--adapter ./output/my-business-adapter
This drops you into an interactive chat where you can ask questions and see how the fine-tuned model responds.
Key Training Settings
| Setting | What it does | Good starting value |
|---|---|---|
| LoRA rank | Capacity of adapter layers | 16-32 |
| Learning rate | How fast the model learns | 1e-4 to 2e-4 |
| Epochs | Passes through the dataset | 2-4 |
| Batch size | Examples per training step | 4-8 (GPU), 2 (Mac) |
If the model memorizes but can’t generalize: Lower epochs, lower learning rate, add more diverse data.
If the model barely changed: Increase epochs, increase learning rate, increase LoRA rank.
Before vs. After: Real Examples
Base Llama 3 (no fine-tuning):
User: How much does it cost to fix a leaking faucet?
Model: The cost of fixing a leaking faucet depends on several factors including the type of faucet, the nature of the leak, and your location. Generally, you can expect to pay between $100 and $350 for a professional plumber…
Generic, not wrong, but not helpful for a specific business.
After fine-tuning on plumbing company data:
User: How much does it cost to fix a leaking faucet?
Model: A basic faucet repair runs $95-$150 depending on the type of faucet. If the faucet needs to be replaced entirely, that’s typically $150-$350 including parts and labor. We can give you an exact quote once we see the faucet — there’s no charge for the estimate if you book the repair.
Specific prices, company policies, natural upsell to booking. This is what businesses want.
Deploying Your Fine-Tuned Model
Once you’re happy with the results, deploy the adapter for inference:
With vLLM (production):
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--lora-modules my-business=./output/my-business-adapter
With Ollama (simpler):
Create a Modelfile:
FROM llama3.1:8b
ADAPTER ./output/my-business-adapter
ollama create my-business -f Modelfile
ollama run my-business
The model is now accessible via the standard OpenAI-compatible API at localhost:11434.
When Is Fine-Tuning Worth It?
Fine-tune when:
- You need the model to sound like your business
- Responses must include specific products, prices, policies
- The model needs to follow company-specific rules consistently
- You’re building a customer-facing bot or assistant
- Data privacy is non-negotiable (legal, medical, financial)
Skip fine-tuning when:
- Generic AI responses are good enough
- You just need document Q&A (use RAG instead)
- Your use case changes daily (fine-tuning is for stable knowledge)
- You have fewer than 50 training examples
For most businesses offering customer support, lead qualification, or knowledge base access, fine-tuning delivers clear value. The model works 24/7, responds consistently, and costs a fraction of human staffing for routine inquiries.
Don’t want to manage the training pipeline yourself? We offer managed fine-tuning and private AI hosting — we handle the data prep, training, and deployment so you get a custom model without the infrastructure work.