TL;DR
# Run Tabby with NVIDIA GPU using Docker
docker run -d --name tabby \
--gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder-1B --device cuda
# Verify it is running
curl http://localhost:8080/v1/health
# Test a completion
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "def fibonacci(n):\n ", "language": "python"}'
Install the Tabby plugin in your IDE, point it at http://localhost:8080, and get Copilot-style completions backed entirely by local hardware.
What Is Tabby
Tabby is an open-source, self-hosted code completion server. It acts as a drop-in replacement for GitHub Copilot’s backend: your IDE sends code context to Tabby, Tabby runs inference on a local model, and returns completion suggestions. The key differentiator from other tools is that Tabby is specifically designed as a server – it handles model management, request queuing, and repository indexing out of the box.
Tabby is built in Rust, which gives it low overhead and fast startup times. It ships as a single binary or Docker image, supports NVIDIA and Apple Silicon GPUs, and provides IDE plugins for VS Code, JetBrains, Vim, and Neovim.
Installation
Docker (Recommended)
Docker is the fastest path to a working installation. For NVIDIA GPUs:
docker run -d --name tabby \
--gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder-1B --device cuda
For Apple Silicon Macs:
docker run -d --name tabby \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder-1B --device metal
For CPU-only (slow, but works for testing):
docker run -d --name tabby \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder-1B --device cpu
The $HOME/.tabby volume persists downloaded models and configuration between container restarts.
Binary Installation
Download the binary from the Tabby GitHub releases page:
# Linux x86_64 with CUDA
curl -L https://github.com/TabbyML/tabby/releases/latest/download/tabby_x86_64-manylinux2014-cuda -o tabby
chmod +x tabby
sudo mv tabby /usr/local/bin/
# Start the server
tabby serve --model StarCoder-1B --device cuda
For a persistent service, create a systemd unit:
# /etc/systemd/system/tabby.service
[Unit]
Description=Tabby Code Completion Server
After=network.target
[Service]
Type=simple
User=tabby
ExecStart=/usr/local/bin/tabby serve --model StarCoder-1B --device cuda
Restart=always
RestartSec=10
Environment="TABBY_ROOT=/var/lib/tabby"
[Install]
WantedBy=multi-user.target
sudo useradd -r -s /bin/false tabby
sudo mkdir -p /var/lib/tabby
sudo chown tabby:tabby /var/lib/tabby
sudo systemctl daemon-reload
sudo systemctl enable --now tabby
Supported Models
Tabby supports a curated list of models optimized for code completion. Unlike Ollama, you do not pull arbitrary models – you specify a model identifier and Tabby downloads it automatically on first run.
| Model | Parameters | VRAM Required | Languages | Notes |
|---|---|---|---|---|
| StarCoder-1B | 1B | ~2 GB | 80+ languages | Fast, good for tab completion |
| StarCoder-3B | 3B | ~4 GB | 80+ languages | Better quality, still fast |
| StarCoder-7B | 7B | ~8 GB | 80+ languages | Best StarCoder quality |
| CodeLlama-7B | 7B | ~8 GB | Multiple | Strong on Python, C++ |
| CodeLlama-13B | 13B | ~16 GB | Multiple | High quality, needs large GPU |
| DeepseekCoder-1.3B | 1.3B | ~2 GB | Multiple | Good accuracy for size |
| DeepseekCoder-6.7B | 6.7B | ~8 GB | Multiple | Strong all-around |
| Qwen2.5-Coder-1.5B | 1.5B | ~2 GB | Multiple | Newest, competitive with 3B models |
To switch models, stop the server and restart with a different --model flag:
tabby serve --model DeepseekCoder-6.7B --device cuda
Tabby downloads the model on first use and caches it in the data directory.
GPU Requirements
Code completion must return results fast – under 500ms for a good experience. This constrains your hardware choices.
| GPU | VRAM | Recommended Max Model | Approximate Latency |
|---|---|---|---|
| RTX 3060 | 12 GB | 7B | ~200ms |
| RTX 3090 / 4090 | 24 GB | 13B | ~150ms |
| RTX 4060 Ti | 16 GB | 7B | ~150ms |
| A100 40GB | 40 GB | 13B | ~80ms |
| M1/M2 Pro 16GB | 16 GB (unified) | 7B | ~250ms |
| M3 Max 96GB | 96 GB (unified) | 13B | ~200ms |
| CPU only | N/A | 1B | ~2000ms+ |
For a small team (2-5 developers), a single RTX 3090 running StarCoder-3B handles concurrent requests well. For larger teams, run multiple instances behind a load balancer or use a larger GPU.
IDE Plugins
VS Code
Install from the marketplace: search for “Tabby” by TabbyML. Open settings and configure:
{
"tabby.api.endpoint": "http://localhost:8080",
"tabby.api.authToken": ""
}
If you set up authentication (recommended for team deployments), add the token here.
JetBrains (IntelliJ, PyCharm, GoLand, etc.)
Settings > Plugins > Marketplace > search “Tabby”. After installation:
Settings > Tools > Tabby > Server endpoint: http://localhost:8080
Vim / Neovim
Tabby provides a Vim plugin via its official repository:
" vim-plug
Plug 'TabbyML/vim-tabby'
" Configuration
let g:tabby_server_url = 'http://localhost:8080'
For Neovim with lazy.nvim:
{
"TabbyML/vim-tabby",
config = function()
vim.g.tabby_server_url = "http://localhost:8080"
end,
}
Repository Indexing
One of Tabby’s strongest features is repository indexing. Tabby can index your Git repositories and use that context when generating completions. This means suggestions are aware of your project’s types, function signatures, and patterns – not just the current file.
Configure repositories in ~/.tabby/config.toml:
[[repositories]]
name = "my-project"
git_url = "file:///home/user/projects/my-project"
[[repositories]]
name = "shared-lib"
git_url = "https://github.com/org/shared-lib.git"
After adding repositories, trigger indexing:
tabby scheduler --now
Tabby builds a code search index that it queries during completion. This is particularly valuable for:
- Using project-specific types and interfaces in suggestions
- Following existing code patterns and naming conventions
- Referencing functions from other modules in the same project
Security note: If you index repositories via HTTPS URLs, Tabby clones them locally. Ensure your data directory ($HOME/.tabby or /var/lib/tabby) has appropriate permissions. For private repositories, use SSH URLs or local file paths.
Authentication and Team Deployment
For team use, enable authentication:
# Create an admin token
tabby serve --model StarCoder-3B --device cuda --token <your-secret-token>
Distribute the token to team members for their IDE configurations. For production team deployments, place Tabby behind a reverse proxy:
# /etc/nginx/sites-available/tabby
server {
listen 443 ssl;
server_name tabby.internal.company.com;
ssl_certificate /etc/ssl/certs/tabby.pem;
ssl_certificate_key /etc/ssl/private/tabby.key;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 300s;
}
}
Security note: Never expose Tabby directly to the public internet. It processes your source code and should be treated as a sensitive internal service. Use VPN, mTLS, or a private network.
Comparison with Continue.dev and Copilot
| Feature | Tabby | Continue.dev | GitHub Copilot |
|---|---|---|---|
| Architecture | Dedicated server | IDE extension | Cloud service |
| Model management | Built-in | Relies on Ollama/etc | Managed |
| Repository indexing | Built-in | Basic | Strong |
| Chat interface | Limited | Full chat + inline edit | Full chat |
| Team support | Multi-user, auth | Single user | Organization plans |
| Offline | Yes | Yes (with Ollama) | No |
| Cost | Free + hardware | Free + hardware | $10-19/month |
| IDE support | VS Code, JetBrains, Vim | VS Code, JetBrains | VS Code, JetBrains, Vim |
| Setup complexity | Low | Medium | Minimal |
When to Choose Tabby
Tabby is the right choice when:
- You need a team solution. Tabby’s server architecture means one GPU server serves multiple developers. Continue.dev runs per-machine.
- Repository-aware completions matter. Tabby’s built-in indexing is more mature than Continue’s codebase search.
- You want a focused tool. Tabby does code completion well and does not try to be a general-purpose LLM chat interface.
- Operational simplicity. One binary, one Docker container. No separate model server needed.
Continue.dev is better when:
- You want chat, inline editing, and completions in one tool.
- You already run Ollama and want to reuse it.
- You want to use the same models for coding and general tasks.
Monitoring and Maintenance
Tabby exposes metrics at its health endpoint:
curl http://localhost:8080/v1/health
For long-running deployments, monitor:
- Disk usage in the data directory (models and indexes grow over time)
- GPU memory with
nvidia-smi– ensure no OOM conditions - Response latency – if completions slow down, the model may be too large for your hardware
Update Tabby by pulling the latest Docker image or downloading the newest binary:
docker pull tabbyml/tabby
docker stop tabby && docker rm tabby
# Re-run the docker run command
Model data persists in the mounted volume, so updates are non-destructive.
Troubleshooting
No completions in IDE: Verify the server is running (curl http://localhost:8080/v1/health). Check the IDE plugin is configured with the correct endpoint URL.
Slow completions: Check GPU utilization with nvidia-smi. If GPU is maxed, use a smaller model or upgrade hardware. CPU inference is not practical for interactive use beyond 1B models.
Docker GPU not detected: Ensure nvidia-container-toolkit is installed and the Docker daemon is configured to use the NVIDIA runtime. Test with docker run --gpus all nvidia/cuda:12.0-base nvidia-smi.
Model download fails: Check network connectivity and disk space. Models range from 1-15 GB. Tabby stores them in the data directory.
