TL;DR

Fine-tuning takes a general-purpose AI model like Llama 3 and trains it further on your business data. The result is a model that responds in your company’s voice, knows your products, and follows your rules — not a generic chatbot.

What you need:

  • 200-500 question/answer pairs from your business
  • A GPU with 24GB VRAM (RTX 3090, ~$800 used) or a MacBook with 32GB
  • 2-6 hours of training time
  • QLoRA + Hugging Face tools (all free and open source)

What you get:

  • A small adapter file (200-500MB) that customizes the base model
  • A private AI that sounds like your business
  • No data sent to OpenAI, Google, or anyone else

The technique is called QLoRA (Quantized Low-Rank Adaptation). It only modifies a tiny percentage of the model’s parameters, which is why it’s fast, cheap, and works on consumer hardware.

Fine-Tuning vs. RAG vs. Prompt Engineering

Before diving in, understand the three main approaches to customizing an LLM:

Prompt Engineering is just writing better instructions. You tell the model “You are a customer support agent for Acme Plumbing” in the system prompt. It’s free and instant but limited — the model doesn’t actually know your business, it’s just role-playing.

RAG (Retrieval-Augmented Generation) feeds relevant documents to the model at query time. When a customer asks a question, you search your knowledge base, find the relevant paragraph, and paste it into the prompt. The model answers based on that context. Good for factual Q&A, but the model still doesn’t learn your style or behavior.

Fine-tuning actually changes the model’s weights. It learns patterns from your data. After training, the model responds in your company’s tone, references your products naturally, and follows your policies without needing to be reminded every time. It’s the deepest level of customization.

ApproachModel learns?Knows your style?Setup timeBest for
Prompt engineeringNoNoMinutesQuick experiments
RAGNoNoHoursDocument Q&A
Fine-tuningYesYesHours-daysCustom behavior

For most business use cases, fine-tuning + RAG together gives the best results. Fine-tuning handles tone and behavior, RAG handles current facts and documents.

What Kind of Business Data Works

Fine-tuning needs example conversations: an input and the ideal output. Here are real sources of training data from businesses:

Customer support logs — Your existing chat transcripts and email threads are gold. The questions are real, and your team’s responses show exactly how you want the AI to behave.

FAQ pages — Already in Q&A format. Easy to convert directly into training examples.

Employee training manuals — These describe how your staff should handle different situations. Perfect for teaching an AI your standard procedures.

Product/service descriptions — Teach the model to describe your offerings accurately with correct pricing, features, and specifications.

Standard operating procedures — Policies like return windows, service guarantees, and escalation procedures.

Preparing Training Data

Training data is a JSONL file — one JSON object per line. Each example has a conversation with roles (system, user, assistant):

{"messages": [{"role": "system", "content": "You are a helpful assistant for Acme Plumbing."}, {"role": "user", "content": "How much does it cost to fix a leaking faucet?"}, {"role": "assistant", "content": "A basic faucet repair runs $95-$150 depending on the type of faucet. If the faucet needs to be replaced entirely, that's typically $150-$350 including parts and labor. We can give you an exact quote once we see the faucet -- there's no charge for the estimate if you book the repair."}]}
{"messages": [{"role": "system", "content": "You are a helpful assistant for Acme Plumbing."}, {"role": "user", "content": "Do you offer emergency plumbing services?"}, {"role": "assistant", "content": "Yes, we offer 24/7 emergency plumbing service. Our emergency rate is $175 for the first hour plus parts. We typically arrive within 45-60 minutes for emergencies like burst pipes, major leaks, or sewer backups. Call our emergency line anytime at 555-0199."}]}

How many examples do you need?

AmountResult
50-100Noticeable improvement
200-500Strong results (recommended starting point)
1,000-5,000Very polished, handles edge cases well
10,000+Diminishing returns for QLoRA

Quality matters more than quantity. 200 well-written examples beat 2,000 sloppy ones. Each example should be a real scenario with a specific, accurate response.

If you have raw data in CSV format (question and answer columns), you can convert it automatically:

# Install the data prep tool
pip install pandas

# Convert CSV to JSONL training format
python prepare_data.py --input faq.csv --output training_data.jsonl --preview 3

Hardware Requirements

QLoRA is designed to run on consumer hardware. Here’s what you need:

HardwareModel sizeTraining time (500 examples)
RTX 3090 (24GB)8B~1 hour
RTX 3090 (24GB)13B~2-3 hours
RTX 4090 (24GB)8B~30 minutes
MacBook Pro M2/M3 (32GB)8B~2-3 hours
2x RTX 3090 (48GB)70B~4-6 hours

The RTX 3090 at ~$800 used is the sweet spot. 24GB of VRAM handles most fine-tuning jobs comfortably.

Step-by-Step QLoRA Training

Install dependencies

pip install torch transformers peft trl datasets bitsandbytes accelerate

For Apple Silicon Macs, use MLX instead:

pip install mlx mlx-lm numpy

Run the training

On Linux with an NVIDIA GPU:

python train.py \
  --base-model meta-llama/Llama-3.1-8B-Instruct \
  --dataset training_data.jsonl \
  --output ./output/my-business-adapter \
  --epochs 3 \
  --lr 2e-4 \
  --rank 32

On a MacBook with Apple Silicon:

python train_mac.py \
  --dataset training_data.jsonl \
  --output ./output/my-business-adapter \
  --epochs 3

What happens during training

  1. The base model loads in 4-bit quantization (fits in less VRAM)
  2. Small LoRA adapter layers attach to the model
  3. The model sees each training example multiple times (epochs)
  4. Only the adapter weights update — the base model stays unchanged
  5. After training, you get a small adapter file (200-500MB)

Test the results

python test_adapter.py \
  --base-model meta-llama/Llama-3.1-8B-Instruct \
  --adapter ./output/my-business-adapter

This drops you into an interactive chat where you can ask questions and see how the fine-tuned model responds.

Key Training Settings

SettingWhat it doesGood starting value
LoRA rankCapacity of adapter layers16-32
Learning rateHow fast the model learns1e-4 to 2e-4
EpochsPasses through the dataset2-4
Batch sizeExamples per training step4-8 (GPU), 2 (Mac)

If the model memorizes but can’t generalize: Lower epochs, lower learning rate, add more diverse data.

If the model barely changed: Increase epochs, increase learning rate, increase LoRA rank.

Before vs. After: Real Examples

Base Llama 3 (no fine-tuning):

User: How much does it cost to fix a leaking faucet?

Model: The cost of fixing a leaking faucet depends on several factors including the type of faucet, the nature of the leak, and your location. Generally, you can expect to pay between $100 and $350 for a professional plumber…

Generic, not wrong, but not helpful for a specific business.

After fine-tuning on plumbing company data:

User: How much does it cost to fix a leaking faucet?

Model: A basic faucet repair runs $95-$150 depending on the type of faucet. If the faucet needs to be replaced entirely, that’s typically $150-$350 including parts and labor. We can give you an exact quote once we see the faucet — there’s no charge for the estimate if you book the repair.

Specific prices, company policies, natural upsell to booking. This is what businesses want.

Deploying Your Fine-Tuned Model

Once you’re happy with the results, deploy the adapter for inference:

With vLLM (production):

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --lora-modules my-business=./output/my-business-adapter

With Ollama (simpler):

Create a Modelfile:

FROM llama3.1:8b
ADAPTER ./output/my-business-adapter
ollama create my-business -f Modelfile
ollama run my-business

The model is now accessible via the standard OpenAI-compatible API at localhost:11434.

When Is Fine-Tuning Worth It?

Fine-tune when:

  • You need the model to sound like your business
  • Responses must include specific products, prices, policies
  • The model needs to follow company-specific rules consistently
  • You’re building a customer-facing bot or assistant
  • Data privacy is non-negotiable (legal, medical, financial)

Skip fine-tuning when:

  • Generic AI responses are good enough
  • You just need document Q&A (use RAG instead)
  • Your use case changes daily (fine-tuning is for stable knowledge)
  • You have fewer than 50 training examples

For most businesses offering customer support, lead qualification, or knowledge base access, fine-tuning delivers clear value. The model works 24/7, responds consistently, and costs a fraction of human staffing for routine inquiries.


Don’t want to manage the training pipeline yourself? We offer managed fine-tuning and private AI hosting — we handle the data prep, training, and deployment so you get a custom model without the infrastructure work.