Running Ollama Serve: Complete Setup Guide for Local AI

TL;DR

The ollama serve command launches the Ollama daemon that exposes a REST API on port 11434 for running local LLM inference. Unlike the simpler ollama run command for interactive chat, serve mode is designed for persistent server deployments where multiple applications need programmatic access to your models.

After installing Ollama with curl -fsSL https://ollama.com/install.sh | sh, the service typically starts automatically via systemd on Linux. You can verify it’s running with systemctl status ollama or by checking if port 11434 responds to API requests. The daemon loads models on-demand when applications request them through the HTTP API.

Key configuration happens through environment variables rather than config files. Set OLLAMA_HOST to bind to specific interfaces – the default 127.0.0.1:11434 only accepts local connections, while 0.0.0.0:11434 allows network access. Use OLLAMA_MODELS to change where model files are stored from the default ~/.ollama/models directory. The OLLAMA_ORIGINS variable controls CORS headers for web applications.

For GPU acceleration, OLLAMA_NUM_GPU specifies how many GPU layers to offload. This differs from llama.cpp’s approach and does not use a variable called OLLAMA_NUM_GPU. Setting this correctly can dramatically improve inference speed on systems with NVIDIA or AMD GPUs.

The serve command runs in the foreground by default, making it suitable for Docker containers or systemd units. You’ll typically interact with it through API calls rather than the CLI. Common troubleshooting involves checking if the port is already bound, verifying model downloads completed successfully, and ensuring sufficient disk space in your models directory.

Caution: When using AI-generated scripts to automate Ollama server management, always review the commands manually before running them in production. Verify environment variables match the documented names and that port bindings align with your security requirements.

Understanding the Ollama Server Architecture

The Ollama server operates as a lightweight daemon that exposes a REST API for model inference requests. When you run ollama serve, the process binds to port 11434 by default and listens for HTTP requests from client applications. This architecture separates the heavy lifting of model loading and inference from your application code, allowing multiple clients to share the same model instance.

The Ollama daemon loads models into memory on-demand when you make your first inference request. Once loaded, models remain in memory until the server restarts or you explicitly unload them. This design optimizes for response time – subsequent requests to the same model skip the loading phase entirely. The server handles concurrent requests through an internal queue, processing them sequentially to avoid GPU memory conflicts.

API Endpoint Structure

The server exposes several key endpoints at http://localhost:11434:

# Generate completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain Docker networking"
}'

# Chat endpoint for conversation
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "What is systemd?"}]
}'

# List loaded models
curl http://localhost:11434/api/tags

The server communicates with clients using JSON over HTTP, making it straightforward to integrate with Python scripts, shell tools, or web applications. Unlike some AI services, Ollama does not expose a Prometheus metrics endpoint, so monitoring requires parsing server logs or tracking API response times at the application layer.

Caution: When building automation that calls these endpoints, validate model outputs before executing any generated commands. AI models can hallucinate invalid syntax or dangerous operations, especially when generating system administration scripts.

Manual Server Launch and Daemon Management

When you need direct control over the Ollama server process, manual launch provides flexibility for testing configurations and troubleshooting connection issues. The ollama serve command starts the API server in the foreground, making it ideal for development environments where you want immediate log visibility.

Launch the server directly from your terminal:

ollama serve

The server binds to port 11434 by default and displays startup logs immediately. You’ll see model loading messages and API endpoint initialization. Keep this terminal session open – closing it terminates the server.

For custom network binding, set the OLLAMA_HOST environment variable before launching:

OLLAMA_HOST=0.0.0.0:8080 ollama serve

This configuration allows external network access on port 8080 instead of the default localhost-only binding.

Background Process Management

Run the server as a background process using standard Unix job control:

nohup ollama serve > /var/log/ollama.log 2>&1 &

This approach writes logs to a dedicated file and survives terminal disconnection. Track the process ID for later management:

ps aux | grep "ollama serve"
kill <PID>

Systemd Service Configuration

For production deployments, create a systemd service unit at /etc/systemd/system/ollama.service:

[Unit]
Description=Ollama Local LLM Server
After=network.target

[Service]
Type=simple
User=ollama
ExecStart=/usr/local/bin/ollama serve
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_MODELS=/var/lib/ollama/models"
Restart=on-failure

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

Check server status and recent logs:

sudo systemctl status ollama
sudo journalctl -u ollama -f

Caution: When using AI-generated systemd configurations, verify user permissions and file paths match your system before enabling automatic startup.

Server Configuration with Environment Variables

Ollama’s server behavior is controlled through environment variables that must be set before launching the daemon. These variables affect network binding, model storage, GPU allocation, and CORS policies.

The OLLAMA_HOST variable controls which network interface and port the server binds to. By default, Ollama listens on 127.0.0.1:11434, restricting access to localhost only. To expose the API across your network:

export OLLAMA_HOST=0.0.0.0:11434
ollama serve

For custom ports, specify the full address:

export OLLAMA_HOST=0.0.0.0:8080
ollama serve

The OLLAMA_MODELS variable changes where downloaded models are stored. This is essential for systems with limited root partition space:

export OLLAMA_MODELS=/mnt/storage/ollama-models
ollama serve

GPU Configuration

The OLLAMA_NUM_GPU variable controls how many GPUs Ollama uses for inference. Set this when running multi-GPU systems or when you want to reserve GPUs for other workloads:

export OLLAMA_NUM_GPU=1
ollama serve

Setting this to 0 forces CPU-only inference, useful for testing or when GPU memory is exhausted.

CORS and Security

The OLLAMA_ORIGINS variable configures Cross-Origin Resource Sharing for web applications. When building browser-based interfaces that call the Ollama API directly:

export OLLAMA_ORIGINS="https://myapp.local,https://chat.internal"
ollama serve

Caution: Setting OLLAMA_ORIGINS to “*” allows any website to access your local AI server. Only use wildcard origins in isolated development environments, never on network-accessible servers.

Persistence with Systemd

Make environment variables permanent by editing the systemd service file at /etc/systemd/system/ollama.service:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/storage/ollama-models"
Environment="OLLAMA_NUM_GPU=2"

Reload and restart after changes:

sudo systemctl daemon-reload
sudo systemctl restart ollama

API Endpoints and Testing Server Connectivity

Once your Ollama server is running, you can interact with it through several REST API endpoints. The primary endpoint for generating completions is /api/generate, which accepts POST requests with model name and prompt data.

Test server connectivity with a simple curl command:

curl http://localhost:11434/api/tags

This returns a JSON list of installed models. If you receive a connection refused error, verify the daemon is running with ps aux | grep ollama or check systemd status if running as a service.

Generating Completions

Send prompts to your local models using the generate endpoint:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain how REST APIs work",
  "stream": false
}'

The stream parameter controls whether responses arrive incrementally or as a complete payload. Set to true for real-time streaming output, useful when building chat interfaces.

Testing from Python

Integrate Ollama into Python applications using the requests library:

import requests
import json

response = requests.post('http://localhost:11434/api/generate', 
    json={
        'model': 'codellama',
        'prompt': 'Write a bash script to backup /home',
        'stream': False
    })

result = response.json()
print(result['response'])

Caution: Always review AI-generated code before execution. Models can produce syntactically correct but functionally incorrect commands, especially for system administration tasks.

Remote Access Configuration

By default, Ollama only accepts connections from localhost. To allow remote clients, set the OLLAMA_HOST environment variable before starting the server:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

This binds to all network interfaces. For production deployments, place Ollama behind a reverse proxy like nginx with authentication rather than exposing it directly to untrusted networks.

Troubleshooting Common Server Issues

When ollama serve fails to start, check if port 11434 is already occupied by another process. Run sudo lsof -i :11434 to identify what’s using the port. If an orphaned Ollama process is running, kill it with pkill ollama before restarting. To use a different port, set the OLLAMA_HOST environment variable:

export OLLAMA_HOST=0.0.0.0:11435
ollama serve

Connection Refused Errors

If clients cannot reach the API, verify the server is listening on the correct interface. By default, Ollama binds to localhost only. For network access, explicitly set:

export OLLAMA_HOST=0.0.0.0:11434
ollama serve

Check firewall rules with sudo ufw status on Ubuntu systems. Allow the port if needed: sudo ufw allow 11434/tcp. Test connectivity from another machine using curl http://your-server-ip:11434/api/tags.

GPU Not Detected

When Ollama runs on CPU despite having a GPU available, verify CUDA or ROCm drivers are properly installed. Check nvidia-smi output for NVIDIA cards or rocm-smi for AMD. The OLLAMA_NUM_GPU variable controls GPU allocation – set it to match your available devices:

export OLLAMA_NUM_GPU=1
ollama serve

Monitor GPU usage during inference with watch -n 1 nvidia-smi to confirm the model loads onto the GPU.

Model Download Failures

Network timeouts during model pulls often occur with large models over slow connections. The OLLAMA_MODELS variable changes where models are stored:

export OLLAMA_MODELS=/mnt/storage/ollama-models
ollama serve

Ensure the directory has sufficient space – models like llama3.1:70b require over 40GB. Check available space with df -h before pulling large models.

Caution: When using AI-generated troubleshooting commands, always verify syntax and understand their impact before running with elevated privileges. Test configuration changes in development environments first.

Installation and Configuration Steps

The official installation script handles both the binary and systemd service configuration automatically. Run the installer with elevated privileges:

curl -fsSL https://ollama.com/install.sh | sh

This script places the ollama binary in /usr/local/bin/ and creates a systemd service file at /etc/systemd/system/ollama.service. The service runs under a dedicated ollama user account for security isolation.

Configuring Server Environment Variables

Before starting the service, configure server behavior through environment variables in the systemd service file. Edit the service configuration:

sudo systemctl edit ollama.service

Add your configuration in the override file:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/mnt/storage/ollama-models"
Environment="OLLAMA_NUM_GPU=1"
Environment="OLLAMA_ORIGINS=http://localhost:3000,http://192.168.1.100:8080"

The OLLAMA_HOST variable controls the bind address and port. Use 0.0.0.0 to accept connections from other machines on your network, or keep the default 127.0.0.1 for localhost-only access. The OLLAMA_MODELS variable redirects model storage to a custom path, useful when your root partition has limited space. Set OLLAMA_NUM_GPU to control GPU allocation on multi-GPU systems. The OLLAMA_ORIGINS variable configures CORS for web applications that need API access.

Starting and Verifying the Service

Reload systemd and start the service:

sudo systemctl daemon-reload
sudo systemctl enable --now ollama.service
sudo systemctl status ollama.service

Verify the API responds on the configured port:

curl http://localhost:11434/api/tags

Caution: When exposing Ollama on 0.0.0.0, implement firewall rules to restrict access. The API has no built-in authentication, so network-level controls are essential for production deployments.

TL;DR#

Understanding the Ollama Server Architecture#

API Endpoint Structure#

Manual Server Launch and Daemon Management#

Background Process Management#

Systemd Service Configuration#

Server Configuration with Environment Variables#

GPU Configuration#

CORS and Security#

Persistence with Systemd#

API Endpoints and Testing Server Connectivity#

Generating Completions#

Testing from Python#

Remote Access Configuration#

Troubleshooting Common Server Issues#

Connection Refused Errors#

GPU Not Detected#

Model Download Failures#

Installation and Configuration Steps#

Configuring Server Environment Variables#

Starting and Verifying the Service#

Related Local AI Guides

Running Local AI Models on Kubernetes with Ollama in 2026

TL;DR

Ollama Cloud vs Local Self-Hosting: Which AI Setup Wins in

TL;DR

Air-Gapped AI Deployment: Running Ollama Without Internet

TL;DR

Odysseus: Complete Self-Hosted AI Workspace with Ollama

TL;DR

Docker Pull Issues in Spain: Self-Hosting AI with Ollama

TL;DR

Local AI on Apple Silicon: Optimizing Ollama for M-Series Macs

TL;DR

TL;DR

Understanding the Ollama Server Architecture

API Endpoint Structure

Manual Server Launch and Daemon Management

Background Process Management

Systemd Service Configuration

Server Configuration with Environment Variables

GPU Configuration

CORS and Security

Persistence with Systemd

API Endpoints and Testing Server Connectivity

Generating Completions

Testing from Python

Remote Access Configuration

Troubleshooting Common Server Issues

Connection Refused Errors

GPU Not Detected

Model Download Failures

Installation and Configuration Steps

Configuring Server Environment Variables

Starting and Verifying the Service