How to Fine-Tune Llama 3 on Your Business Data with QLoRA

TL;DR

Fine-tuning takes a general-purpose AI model like Llama 3 and trains it further on your business data. The result is a model that responds in your company’s voice, knows your products, and follows your rules — not a generic chatbot.

What you need:

200-500 question/answer pairs from your business
A GPU with 24GB VRAM (RTX 3090, ~$800 used) or a MacBook with 32GB
2-6 hours of training time
QLoRA + Hugging Face tools (all free and open source)

What you get:

A small adapter file (200-500MB) that customizes the base model
A private AI that sounds like your business
No data sent to OpenAI, Google, or anyone else

The technique is called QLoRA (Quantized Low-Rank Adaptation). It only modifies a tiny percentage of the model’s parameters, which is why it’s fast, cheap, and works on consumer hardware.

Fine-Tuning vs. RAG vs. Prompt Engineering

Before diving in, understand the three main approaches to customizing an LLM:

Prompt Engineering is just writing better instructions. You tell the model “You are a customer support agent for Acme Plumbing” in the system prompt. It’s free and instant but limited — the model doesn’t actually know your business, it’s just role-playing.

RAG (Retrieval-Augmented Generation) feeds relevant documents to the model at query time. When a customer asks a question, you search your knowledge base, find the relevant paragraph, and paste it into the prompt. The model answers based on that context. Good for factual Q&A, but the model still doesn’t learn your style or behavior.

Fine-tuning actually changes the model’s weights. It learns patterns from your data. After training, the model responds in your company’s tone, references your products naturally, and follows your policies without needing to be reminded every time. It’s the deepest level of customization.

Approach	Model learns?	Knows your style?	Setup time	Best for
Prompt engineering	No	No	Minutes	Quick experiments
RAG	No	No	Hours	Document Q&A
Fine-tuning	Yes	Yes	Hours-days	Custom behavior

For most business use cases, fine-tuning + RAG together gives the best results. Fine-tuning handles tone and behavior, RAG handles current facts and documents.

What Kind of Business Data Works

Fine-tuning needs example conversations: an input and the ideal output. Here are real sources of training data from businesses:

Customer support logs — Your existing chat transcripts and email threads are gold. The questions are real, and your team’s responses show exactly how you want the AI to behave.

FAQ pages — Already in Q&A format. Easy to convert directly into training examples.

Employee training manuals — These describe how your staff should handle different situations. Perfect for teaching an AI your standard procedures.

Product/service descriptions — Teach the model to describe your offerings accurately with correct pricing, features, and specifications.

Standard operating procedures — Policies like return windows, service guarantees, and escalation procedures.

Preparing Training Data

Training data is a JSONL file — one JSON object per line. Each example has a conversation with roles (system, user, assistant):

{"messages": [{"role": "system", "content": "You are a helpful assistant for Acme Plumbing."}, {"role": "user", "content": "How much does it cost to fix a leaking faucet?"}, {"role": "assistant", "content": "A basic faucet repair runs $95-$150 depending on the type of faucet. If the faucet needs to be replaced entirely, that's typically $150-$350 including parts and labor. We can give you an exact quote once we see the faucet -- there's no charge for the estimate if you book the repair."}]}
{"messages": [{"role": "system", "content": "You are a helpful assistant for Acme Plumbing."}, {"role": "user", "content": "Do you offer emergency plumbing services?"}, {"role": "assistant", "content": "Yes, we offer 24/7 emergency plumbing service. Our emergency rate is $175 for the first hour plus parts. We typically arrive within 45-60 minutes for emergencies like burst pipes, major leaks, or sewer backups. Call our emergency line anytime at 555-0199."}]}

How many examples do you need?

Amount	Result
50-100	Noticeable improvement
200-500	Strong results (recommended starting point)
1,000-5,000	Very polished, handles edge cases well
10,000+	Diminishing returns for QLoRA

Quality matters more than quantity. 200 well-written examples beat 2,000 sloppy ones. Each example should be a real scenario with a specific, accurate response.

If you have raw data in CSV format (question and answer columns), you can convert it automatically:

# Install the data prep tool
pip install pandas

# Convert CSV to JSONL training format
python prepare_data.py --input faq.csv --output training_data.jsonl --preview 3

Hardware Requirements

QLoRA is designed to run on consumer hardware. Here’s what you need:

Hardware	Model size	Training time (500 examples)
RTX 3090 (24GB)	8B	~1 hour
RTX 3090 (24GB)	13B	~2-3 hours
RTX 4090 (24GB)	8B	~30 minutes
MacBook Pro M2/M3 (32GB)	8B	~2-3 hours
2x RTX 3090 (48GB)	70B	~4-6 hours

The RTX 3090 at ~$800 used is the sweet spot. 24GB of VRAM handles most fine-tuning jobs comfortably.

Step-by-Step QLoRA Training

Install dependencies

pip install torch transformers peft trl datasets bitsandbytes accelerate

For Apple Silicon Macs, use MLX instead:

pip install mlx mlx-lm numpy

Run the training

On Linux with an NVIDIA GPU:

python train.py \
  --base-model meta-llama/Llama-3.1-8B-Instruct \
  --dataset training_data.jsonl \
  --output ./output/my-business-adapter \
  --epochs 3 \
  --lr 2e-4 \
  --rank 32

On a MacBook with Apple Silicon:

python train_mac.py \
  --dataset training_data.jsonl \
  --output ./output/my-business-adapter \
  --epochs 3

What happens during training

The base model loads in 4-bit quantization (fits in less VRAM)
Small LoRA adapter layers attach to the model
The model sees each training example multiple times (epochs)
Only the adapter weights update — the base model stays unchanged
After training, you get a small adapter file (200-500MB)

Test the results

python test_adapter.py \
  --base-model meta-llama/Llama-3.1-8B-Instruct \
  --adapter ./output/my-business-adapter

This drops you into an interactive chat where you can ask questions and see how the fine-tuned model responds.

Key Training Settings

Setting	What it does	Good starting value
LoRA rank	Capacity of adapter layers	16-32
Learning rate	How fast the model learns	1e-4 to 2e-4
Epochs	Passes through the dataset	2-4
Batch size	Examples per training step	4-8 (GPU), 2 (Mac)

If the model memorizes but can’t generalize: Lower epochs, lower learning rate, add more diverse data.

If the model barely changed: Increase epochs, increase learning rate, increase LoRA rank.

Before vs. After: Real Examples

Base Llama 3 (no fine-tuning):

User: How much does it cost to fix a leaking faucet?
Model: The cost of fixing a leaking faucet depends on several factors including the type of faucet, the nature of the leak, and your location. Generally, you can expect to pay between $100 and $350 for a professional plumber…

Generic, not wrong, but not helpful for a specific business.

After fine-tuning on plumbing company data:

User: How much does it cost to fix a leaking faucet?
Model: A basic faucet repair runs $95-$150 depending on the type of faucet. If the faucet needs to be replaced entirely, that’s typically $150-$350 including parts and labor. We can give you an exact quote once we see the faucet — there’s no charge for the estimate if you book the repair.

Specific prices, company policies, natural upsell to booking. This is what businesses want.

Deploying Your Fine-Tuned Model

Once you’re happy with the results, deploy the adapter for inference:

With vLLM (production):

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --lora-modules my-business=./output/my-business-adapter

With Ollama (simpler):

Create a Modelfile:

FROM llama3.1:8b
ADAPTER ./output/my-business-adapter

ollama create my-business -f Modelfile
ollama run my-business

The model is now accessible via the standard OpenAI-compatible API at localhost:11434.

When Is Fine-Tuning Worth It?

Fine-tune when:

You need the model to sound like your business
Responses must include specific products, prices, policies
The model needs to follow company-specific rules consistently
You’re building a customer-facing bot or assistant
Data privacy is non-negotiable (legal, medical, financial)

Skip fine-tuning when:

Generic AI responses are good enough
You just need document Q&A (use RAG instead)
Your use case changes daily (fine-tuning is for stable knowledge)
You have fewer than 50 training examples

For most businesses offering customer support, lead qualification, or knowledge base access, fine-tuning delivers clear value. The model works 24/7, responds consistently, and costs a fraction of human staffing for routine inquiries.

Don’t want to manage the training pipeline yourself? We offer managed fine-tuning and private AI hosting — we handle the data prep, training, and deployment so you get a custom model without the infrastructure work.

TL;DR#

Fine-Tuning vs. RAG vs. Prompt Engineering#

What Kind of Business Data Works#

Preparing Training Data#

Hardware Requirements#

Step-by-Step QLoRA Training#

Install dependencies#

Run the training#

What happens during training#

Test the results#

Key Training Settings#

Before vs. After: Real Examples#

Deploying Your Fine-Tuned Model#

When Is Fine-Tuning Worth It?#