TL;DR

You can run your own OpenAI-compatible API on a single machine with a GPU. Your data never leaves your hardware, costs are fixed instead of per-token, and you can serve custom fine-tuned models.

What you get:

  • A drop-in replacement for the OpenAI API (change one line of code to switch)
  • Complete data privacy — nothing sent to external servers
  • Fixed monthly cost instead of unpredictable per-token billing
  • Custom models fine-tuned on your business data
  • No per-seat licensing

Minimum setup:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.1:8b

# Your API is now running at localhost:11434
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello!"}]}'

That’s a working private AI API in 3 commands. The rest of this guide covers making it production-ready with authentication, usage tracking, and remote access.

Why Run Your Own AI API

Data Privacy

Every prompt you send to OpenAI, Anthropic, or Google passes through their servers. For many businesses, this is a non-starter:

  • Law firms can’t risk client data in third-party systems
  • Healthcare providers face regulatory requirements about data handling
  • Financial services have strict data residency requirements
  • Any business with proprietary information benefits from keeping it private

With a self-hosted API, your prompts and responses never leave your hardware.

Cost Predictability

OpenAI charges per token. A busy application can rack up thousands per month with unpredictable spikes. Self-hosted AI has a fixed cost:

ApproachMonthly cost (moderate use)Monthly cost (heavy use)
OpenAI GPT-4o$200-500$1,000-5,000+
OpenAI GPT-3.5$50-200$500-2,000
Self-hosted (1x RTX 3090)~$50 electricity~$50 electricity

The GPU costs the same whether you send 100 requests or 100,000 requests per day. After the hardware purchase (~$800 for a used RTX 3090), your only ongoing cost is electricity.

No Vendor Lock-In

The OpenAI-compatible API format is a standard. By self-hosting with this format, every tool, library, and integration that works with OpenAI works with your private API. Switch between providers by changing a URL:

from openai import OpenAI

# OpenAI
client = OpenAI(api_key="sk-...")

# Your private API (same code, different URL)
client = OpenAI(
    base_url="https://api.yourdomain.com/v1",
    api_key="sk-your-private-key"
)

Architecture Overview

A production private AI API has these components:

Customers / Your Team
        |
    HTTPS (TLS)
        |
    Reverse Proxy (Caddy)
    - Auto SSL certificates
    - Serves landing page
        |
    API Gateway (FastAPI)
    - API key authentication
    - Rate limiting
    - Usage tracking
    - Request routing
        |
    Inference Engine (Ollama or vLLM)
    - Runs the actual AI model
    - GPU-accelerated
    - OpenAI-compatible API

You can run all of this on a single machine, or split the gateway to a cheap cloud server and keep the GPU work at home.

Setting Up the Inference Engine

Option A: Ollama (simpler, great for getting started)

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b

Ollama runs as a service and automatically serves an API on port 11434. It handles model management, GPU detection, and quantization automatically.

Option B: vLLM (higher performance, production use)

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8080 \
  --max-model-len 4096

vLLM supports continuous batching (handles many concurrent requests efficiently), LoRA adapter hot-loading (serve multiple fine-tuned models), and higher throughput than Ollama under load.

Recommendation: Start with Ollama for development and testing. Switch to vLLM when you need production performance.

Adding Authentication

A bare Ollama instance has no authentication — anyone who can reach it can use it. In production, you need an API gateway that validates keys before forwarding requests.

Here’s the core of a FastAPI gateway with API key authentication:

from fastapi import FastAPI, Request, HTTPException, Depends
import httpx

app = FastAPI()
BACKEND_URL = "http://localhost:11434"

async def verify_api_key(request: Request):
    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer sk-"):
        raise HTTPException(status_code=401, detail="Invalid API key")
    # Look up the key in your database
    # Return the customer's allowed model
    return {"model": "llama3.1:8b"}

@app.post("/v1/chat/completions")
async def chat(request: Request, auth=Depends(verify_api_key)):
    body = await request.json()
    body["model"] = auth["model"]  # Force the assigned model

    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"{BACKEND_URL}/v1/chat/completions",
            json=body, timeout=300
        )
    return resp.json()

Each customer gets their own API key, locked to a specific model. They can’t switch models or access other customers’ fine-tuned models.

Usage Tracking

Log every request to know how much each customer uses:

# After each request, log:
log = {
    "customer": api_key.name,
    "model": model_name,
    "prompt_tokens": response["usage"]["prompt_tokens"],
    "completion_tokens": response["usage"]["completion_tokens"],
    "response_time_ms": elapsed_ms,
    "timestamp": datetime.now()
}

This data is essential for billing, capacity planning, and knowing when you need more GPUs.

Making It Accessible Remotely

For a production API accessible from anywhere, you need:

  1. A domain name ($12/year)
  2. A reverse proxy with SSL (Caddy handles this automatically)
  3. Either a static IP or a cloud gateway

The simplest production setup: a cheap cloud VPS ($5-10/month) running Caddy + your API gateway, connected to your GPU machine via WireGuard VPN. Your home IP is never exposed.

Customer --> api.yourdomain.com (Cloud VPS)
                    |
            WireGuard VPN tunnel
                    |
            Your GPU machine at home

Caddy automatically gets SSL certificates from Let’s Encrypt:

api.yourdomain.com {
    handle /v1/* {
        reverse_proxy gateway:8000
    }
}

Model Selection

Choose your model based on your hardware and quality needs:

ModelVRAMQualitySpeedBest for
Llama 3.1 8B8GBGoodFastCustomer support, simple Q&A
Llama 3.1 13B14GBBetterModerateProfessional content, analysis
Llama 3.1 70B (Q4)40GBExcellentSlowerComplex tasks, legal, medical
Mistral 7B5GBGoodVery fastHigh-volume, simple tasks

A single RTX 3090 (24GB) can run the 8B model with headroom, or the 13B model comfortably. Two 3090s can run the 70B model with tensor parallelism.

Scaling

Multiple customers on one GPU

A single RTX 3090 running Llama 3.1 8B can handle 30-60 requests per minute. For most business customers with bursty usage patterns, one GPU can serve 10-20 customers.

Adding GPUs

When you need more capacity:

  • Add a second GPU to the same machine
  • Add a second machine and load-balance between them
  • Your API gateway routes requests to healthy backends automatically

Failover

If one backend goes down, the gateway routes all traffic to the remaining backends. Customers see no interruption:

BACKENDS = [
    "http://10.100.0.10:11434",  # Machine A
    "http://10.100.0.11:11434",  # Machine B
]
# Health checks every 30 seconds
# Automatic failover to healthy backends

Cost Analysis for a Real Business

Running a private AI service for a small team or external customers:

Startup costs:

ItemCost
RTX 3090 (used)~$800
Domain name~$12/year
Cloud VPS (gateway)~$5/month

Monthly operating costs:

ItemCost
Electricity (1 GPU)$30-60/month
Cloud VPS$5/month
Internet(existing)
Total~$40-70/month

Equivalent OpenAI cost for the same workload: $200-2,000+/month depending on usage volume.

The self-hosted approach pays for the GPU in 2-4 months compared to cloud API costs.

Who Benefits Most

Law firms — Confidential client data stays on-premise. Draft contracts, review documents, and research case law without exposing anything to third parties.

Healthcare — Patient data never leaves the facility. Triage assistance, documentation help, and clinical Q&A with data privacy built in.

Financial services — Trading strategies, client portfolios, and proprietary analysis stay private. Regulatory compliance is simpler when data doesn’t leave your infrastructure.

Any business with proprietary knowledge — Product roadmaps, pricing strategies, internal procedures. Keep your competitive advantage private while still using AI to work faster.


Don’t want to set this up yourself? We offer managed private AI hosting — your own API endpoint with custom models, authentication, and usage tracking, all hosted and maintained for you.