Running Claude-Style Coding Models Locally with Ollama and Open WebUI

TL;DR

You can run Claude-quality coding models on your own hardware using Ollama and Open WebUI, keeping your code and conversations completely private. This guide walks you through deploying models like DeepSeek Coder, Qwen2.5-Coder, and CodeLlama that rival proprietary services for code generation, debugging, and refactoring tasks.

The setup requires a Linux machine with at least 16GB RAM for 7B models or 32GB+ for 34B models. You’ll install Ollama as the model runtime, pull coding-focused models, then connect Open WebUI as your chat interface. The entire stack runs locally—no API keys, no data leaving your network.

curl -fsSL https://ollama.com/install.sh | sh

# Pull a coding model
ollama pull deepseek-coder-v2:16b

# Run Open WebUI with Docker
docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Once running, you can generate Ansible playbooks, debug Python scripts, write Terraform modules, or refactor legacy code through a ChatGPT-like interface. The models understand context across multiple files and can explain complex codebases.

Critical safety note: Always review AI-generated system commands before execution. Models can hallucinate package names, misremember flag syntax, or suggest destructive operations. Test generated Kubernetes manifests in staging clusters, validate Prometheus queries against test data, and never pipe AI output directly to bash or kubectl apply on production systems.

The performance depends heavily on your GPU. An RTX 4090 handles 34B models comfortably, while older cards or CPU-only setups work fine with 7B models but expect slower response times. For teams, you can expose Open WebUI behind Nginx with authentication, creating a shared internal coding assistant that respects your data sovereignty requirements.

Core Steps

Start by installing Ollama on your Linux system. The installation script handles dependencies automatically:

curl -fsSL https://ollama.com/install.sh | sh

Once installed, pull a coding-focused model. DeepSeek Coder and CodeLlama are strong alternatives to Claude for code generation:

ollama pull deepseek-coder:33b
ollama pull codellama:34b-code

Verify the model runs correctly:

ollama run deepseek-coder:33b "Write a Python function to parse JSON logs"

Deploy Open WebUI for a Chat Interface

Open WebUI provides a familiar chat interface similar to Claude’s web experience. Deploy it using Docker:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access the interface at http://localhost:3000 and configure it to connect to your local Ollama instance at http://host.docker.internal:11434.

Configure Model Parameters for Code Generation

Adjust temperature and context settings for more deterministic code output. In Open WebUI’s model settings:

temperature: 0.2
top_p: 0.9
context_length: 8192

Lower temperatures reduce randomness in generated code, making outputs more predictable.

Caution: Always review AI-generated infrastructure code before execution. Models can hallucinate package names, API endpoints, or configuration syntax. Test Ansible playbooks with --check mode, validate Terraform plans with terraform plan, and run generated scripts in isolated environments first. Never pipe AI output directly to bash or kubectl apply on production systems without manual inspection.

Implementation

With Ollama installed (see Core Steps above), pull a coding-focused model:

ollama pull qwen2.5-coder:7b
ollama pull deepseek-coder-v2:16b

With Open WebUI running (see Core Steps above), access the interface at http://localhost:3000 and configure model parameters.

Configuring for Code Generation

For coding tasks, adjust temperature and context length:

temperature: 0.2
num_ctx: 8192
top_p: 0.9
repeat_penalty: 1.1

Create a custom system prompt for infrastructure tasks:

system_prompt = """You are an expert DevOps engineer. Generate production-ready code for:
- Ansible playbooks with proper error handling
- Terraform modules following best practices
- Bash scripts with input validation
- Python automation with type hints

Always include comments explaining critical sections."""

Safety Considerations

Critical: Never execute AI-generated system commands without manual review. Models can hallucinate dangerous operations like rm -rf / or incorrect iptables rules that lock you out.

Implement a validation workflow:

# Save AI output to file first
ollama run qwen2.5-coder:7b "Generate Ansible playbook for nginx" > playbook.yml

# Review with syntax checker
ansible-playbook --syntax-check playbook.yml

# Test in isolated environment
vagrant up test-vm && ansible-playbook -i test-inventory playbook.yml

For Terraform, always run terraform plan before apply. Use Prometheus alerting rules to monitor unexpected infrastructure changes from automated deployments.

Verification and Testing

Once your model is running, verify it can handle coding tasks before relying on it for production work. Start with simple prompts that test core capabilities.

Test the model’s ability to generate working code:

curl http://localhost:11434/api/generate -d '{
  "model": "codellama:34b",
  "prompt": "Write a Python function to validate IPv4 addresses using regex",
  "stream": false
}'

Compare outputs against known-good implementations. Test edge cases like malformed input handling and error conditions.

Code Review Simulation

Feed the model actual code from your projects:

import requests

prompt = """Review this Ansible playbook for security issues:
---
- hosts: webservers
  become: yes
  tasks:
    - name: Install nginx
      apt: name=nginx state=present
    - name: Copy config
      copy: src=/tmp/nginx.conf dest=/etc/nginx/
"""

response = requests.post('http://localhost:11434/api/generate',
    json={'model': 'deepseek-coder:33b', 'prompt': prompt})

Caution: AI models can hallucinate security vulnerabilities that don’t exist or miss real ones. Always validate suggestions against official documentation and security scanning tools like ansible-lint or trivy.

Integration Testing

Test the model through Open WebUI’s interface with realistic scenarios:

Debugging Terraform state issues
Generating Prometheus alerting rules
Writing systemd service files
Creating Docker Compose configurations

Critical: Never execute AI-generated system commands directly on production infrastructure. Run them in isolated test environments first. Models trained on older datasets may suggest deprecated flags or incompatible syntax for tools like kubectl or docker.

Maintain a validation checklist: syntax checking with language-specific linters, dry-run modes where available (terraform plan, ansible-playbook --check), and manual review of any commands that modify system state or access credentials.

Best Practices

Allocate at least 16GB RAM for coding-focused models like DeepSeek Coder or CodeLlama 34B. Smaller 7B parameter models work adequately for code review and documentation tasks but struggle with complex refactoring. Monitor GPU memory with nvidia-smi or watch -n 1 ollama ps to prevent OOM crashes during long generation sessions.

# Create a Modelfile with custom context length
echo 'FROM deepseek-coder:33b
PARAMETER num_ctx 8192' > Modelfile
ollama create deepseek-coder-ctx8k -f Modelfile
ollama run deepseek-coder-ctx8k

Prompt Engineering for Code Generation

Structure prompts with explicit constraints and output formats. Include language specifications, framework versions, and architectural requirements to reduce hallucination.

prompt = """Generate a FastAPI endpoint for user authentication.
Requirements:
- Python 3.11, FastAPI 0.104
- JWT tokens with 24h expiration
- PostgreSQL via SQLAlchemy
- Include input validation with Pydantic
"""

Validation Workflows

Never execute AI-generated system commands directly in production environments. Implement a review pipeline where generated code passes through static analysis tools before deployment.

# Validate Terraform configs from AI output
terraform fmt generated_infrastructure.tf
terraform validate
tflint generated_infrastructure.tf

For Ansible playbooks, use ansible-playbook --syntax-check and ansible-lint before running against inventory. Test infrastructure changes in isolated containers or VMs first.

Version Control Integration

Commit AI-generated code with descriptive messages indicating the model and prompt used. This creates an audit trail for debugging and helps identify patterns in model behavior.

git commit -m "feat: add user auth endpoint (deepseek-coder:33b)"

Handling Hallucinated Dependencies

AI models frequently suggest outdated packages or non-existent library methods. Cross-reference generated imports against official documentation. Use pip show or language-specific package managers to verify versions before integrating suggestions into your codebase.

FAQ

Can I run coding models with only 16GB RAM?

Yes, but you’ll need to use quantized models. The deepseek-coder-v2:16b-lite-instruct-q4_K_M variant runs comfortably in 10-12GB of RAM, leaving headroom for your OS and browser. Avoid the 33B parameter models unless you have 32GB+ RAM. Check available memory before loading:

ollama run deepseek-coder-v2:16b-lite-instruct-q4_K_M

If the model loads but responses are sluggish, you’re likely hitting swap. Use htop to monitor memory pressure during inference.

Why are responses slower than Claude or ChatGPT?

Cloud services run on enterprise GPUs with massive parallelization. Your local CPU or consumer GPU processes tokens sequentially. A typical laptop generates 5-15 tokens per second, while cloud APIs deliver 50-100+ tokens per second. The tradeoff is complete data privacy—your code never leaves your machine.

Can I use these models for production code review?

Treat them as junior developers, not senior architects. They excel at catching syntax errors, suggesting refactors, and generating boilerplate. Always validate generated Terraform configurations, Ansible playbooks, or Kubernetes manifests in a staging environment before applying to production infrastructure.

Critical warning: AI models hallucinate plausible-looking commands that may not exist or may have dangerous side effects. Never run suggested rm, dd, or database migration commands without manual verification.

How do I switch between models in Open WebUI?

Click the model dropdown in the chat interface and select from your pulled models. You can compare responses by opening multiple chat windows with different models. For coding tasks, keep deepseek-coder-v2 and codellama:34b loaded simultaneously to cross-reference suggestions.

Does this work offline?

Completely. Once models are pulled via ollama pull, they’re cached locally in /usr/share/ollama/.ollama/models/. Disconnect your network and everything continues functioning—ideal for air-gapped development environments or travel.

TL;DR#

Core Steps#

Deploy Open WebUI for a Chat Interface#

Configure Model Parameters for Code Generation#

Implementation#

Configuring for Code Generation#

Safety Considerations#

Verification and Testing#

Code Review Simulation#

Integration Testing#

Best Practices#

Prompt Engineering for Code Generation#

Validation Workflows#

Version Control Integration#

Handling Hallucinated Dependencies#

FAQ#

Can I run coding models with only 16GB RAM?#

Why are responses slower than Claude or ChatGPT?#

Can I use these models for production code review?#

How do I switch between models in Open WebUI?#

Does this work offline?#