TL;DR

# Run Tabby with NVIDIA GPU using Docker
docker run -d --name tabby \
  --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model StarCoder-1B --device cuda

# Verify it is running
curl http://localhost:8080/v1/health

# Test a completion
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "def fibonacci(n):\n    ", "language": "python"}'

Install the Tabby plugin in your IDE, point it at http://localhost:8080, and get Copilot-style completions backed entirely by local hardware.


What Is Tabby

Tabby is an open-source, self-hosted code completion server. It acts as a drop-in replacement for GitHub Copilot’s backend: your IDE sends code context to Tabby, Tabby runs inference on a local model, and returns completion suggestions. The key differentiator from other tools is that Tabby is specifically designed as a server – it handles model management, request queuing, and repository indexing out of the box.

Tabby is built in Rust, which gives it low overhead and fast startup times. It ships as a single binary or Docker image, supports NVIDIA and Apple Silicon GPUs, and provides IDE plugins for VS Code, JetBrains, Vim, and Neovim.

Installation

Docker is the fastest path to a working installation. For NVIDIA GPUs:

docker run -d --name tabby \
  --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model StarCoder-1B --device cuda

For Apple Silicon Macs:

docker run -d --name tabby \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model StarCoder-1B --device metal

For CPU-only (slow, but works for testing):

docker run -d --name tabby \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model StarCoder-1B --device cpu

The $HOME/.tabby volume persists downloaded models and configuration between container restarts.

Binary Installation

Download the binary from the Tabby GitHub releases page:

# Linux x86_64 with CUDA
curl -L https://github.com/TabbyML/tabby/releases/latest/download/tabby_x86_64-manylinux2014-cuda -o tabby
chmod +x tabby
sudo mv tabby /usr/local/bin/

# Start the server
tabby serve --model StarCoder-1B --device cuda

For a persistent service, create a systemd unit:

# /etc/systemd/system/tabby.service
[Unit]
Description=Tabby Code Completion Server
After=network.target

[Service]
Type=simple
User=tabby
ExecStart=/usr/local/bin/tabby serve --model StarCoder-1B --device cuda
Restart=always
RestartSec=10
Environment="TABBY_ROOT=/var/lib/tabby"

[Install]
WantedBy=multi-user.target
sudo useradd -r -s /bin/false tabby
sudo mkdir -p /var/lib/tabby
sudo chown tabby:tabby /var/lib/tabby
sudo systemctl daemon-reload
sudo systemctl enable --now tabby

Supported Models

Tabby supports a curated list of models optimized for code completion. Unlike Ollama, you do not pull arbitrary models – you specify a model identifier and Tabby downloads it automatically on first run.

ModelParametersVRAM RequiredLanguagesNotes
StarCoder-1B1B~2 GB80+ languagesFast, good for tab completion
StarCoder-3B3B~4 GB80+ languagesBetter quality, still fast
StarCoder-7B7B~8 GB80+ languagesBest StarCoder quality
CodeLlama-7B7B~8 GBMultipleStrong on Python, C++
CodeLlama-13B13B~16 GBMultipleHigh quality, needs large GPU
DeepseekCoder-1.3B1.3B~2 GBMultipleGood accuracy for size
DeepseekCoder-6.7B6.7B~8 GBMultipleStrong all-around
Qwen2.5-Coder-1.5B1.5B~2 GBMultipleNewest, competitive with 3B models

To switch models, stop the server and restart with a different --model flag:

tabby serve --model DeepseekCoder-6.7B --device cuda

Tabby downloads the model on first use and caches it in the data directory.

GPU Requirements

Code completion must return results fast – under 500ms for a good experience. This constrains your hardware choices.

GPUVRAMRecommended Max ModelApproximate Latency
RTX 306012 GB7B~200ms
RTX 3090 / 409024 GB13B~150ms
RTX 4060 Ti16 GB7B~150ms
A100 40GB40 GB13B~80ms
M1/M2 Pro 16GB16 GB (unified)7B~250ms
M3 Max 96GB96 GB (unified)13B~200ms
CPU onlyN/A1B~2000ms+

For a small team (2-5 developers), a single RTX 3090 running StarCoder-3B handles concurrent requests well. For larger teams, run multiple instances behind a load balancer or use a larger GPU.

IDE Plugins

VS Code

Install from the marketplace: search for “Tabby” by TabbyML. Open settings and configure:

{
  "tabby.api.endpoint": "http://localhost:8080",
  "tabby.api.authToken": ""
}

If you set up authentication (recommended for team deployments), add the token here.

JetBrains (IntelliJ, PyCharm, GoLand, etc.)

Settings > Plugins > Marketplace > search “Tabby”. After installation:

Settings > Tools > Tabby > Server endpoint: http://localhost:8080

Vim / Neovim

Tabby provides a Vim plugin via its official repository:

" vim-plug
Plug 'TabbyML/vim-tabby'

" Configuration
let g:tabby_server_url = 'http://localhost:8080'

For Neovim with lazy.nvim:

{
  "TabbyML/vim-tabby",
  config = function()
    vim.g.tabby_server_url = "http://localhost:8080"
  end,
}

Repository Indexing

One of Tabby’s strongest features is repository indexing. Tabby can index your Git repositories and use that context when generating completions. This means suggestions are aware of your project’s types, function signatures, and patterns – not just the current file.

Configure repositories in ~/.tabby/config.toml:

[[repositories]]
name = "my-project"
git_url = "file:///home/user/projects/my-project"

[[repositories]]
name = "shared-lib"
git_url = "https://github.com/org/shared-lib.git"

After adding repositories, trigger indexing:

tabby scheduler --now

Tabby builds a code search index that it queries during completion. This is particularly valuable for:

  • Using project-specific types and interfaces in suggestions
  • Following existing code patterns and naming conventions
  • Referencing functions from other modules in the same project

Security note: If you index repositories via HTTPS URLs, Tabby clones them locally. Ensure your data directory ($HOME/.tabby or /var/lib/tabby) has appropriate permissions. For private repositories, use SSH URLs or local file paths.

Authentication and Team Deployment

For team use, enable authentication:

# Create an admin token
tabby serve --model StarCoder-3B --device cuda --token <your-secret-token>

Distribute the token to team members for their IDE configurations. For production team deployments, place Tabby behind a reverse proxy:

# /etc/nginx/sites-available/tabby
server {
    listen 443 ssl;
    server_name tabby.internal.company.com;

    ssl_certificate /etc/ssl/certs/tabby.pem;
    ssl_certificate_key /etc/ssl/private/tabby.key;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_read_timeout 300s;
    }
}

Security note: Never expose Tabby directly to the public internet. It processes your source code and should be treated as a sensitive internal service. Use VPN, mTLS, or a private network.

Comparison with Continue.dev and Copilot

FeatureTabbyContinue.devGitHub Copilot
ArchitectureDedicated serverIDE extensionCloud service
Model managementBuilt-inRelies on Ollama/etcManaged
Repository indexingBuilt-inBasicStrong
Chat interfaceLimitedFull chat + inline editFull chat
Team supportMulti-user, authSingle userOrganization plans
OfflineYesYes (with Ollama)No
CostFree + hardwareFree + hardware$10-19/month
IDE supportVS Code, JetBrains, VimVS Code, JetBrainsVS Code, JetBrains, Vim
Setup complexityLowMediumMinimal

When to Choose Tabby

Tabby is the right choice when:

  • You need a team solution. Tabby’s server architecture means one GPU server serves multiple developers. Continue.dev runs per-machine.
  • Repository-aware completions matter. Tabby’s built-in indexing is more mature than Continue’s codebase search.
  • You want a focused tool. Tabby does code completion well and does not try to be a general-purpose LLM chat interface.
  • Operational simplicity. One binary, one Docker container. No separate model server needed.

Continue.dev is better when:

  • You want chat, inline editing, and completions in one tool.
  • You already run Ollama and want to reuse it.
  • You want to use the same models for coding and general tasks.

Monitoring and Maintenance

Tabby exposes metrics at its health endpoint:

curl http://localhost:8080/v1/health

For long-running deployments, monitor:

  • Disk usage in the data directory (models and indexes grow over time)
  • GPU memory with nvidia-smi – ensure no OOM conditions
  • Response latency – if completions slow down, the model may be too large for your hardware

Update Tabby by pulling the latest Docker image or downloading the newest binary:

docker pull tabbyml/tabby
docker stop tabby && docker rm tabby
# Re-run the docker run command

Model data persists in the mounted volume, so updates are non-destructive.

Troubleshooting

No completions in IDE: Verify the server is running (curl http://localhost:8080/v1/health). Check the IDE plugin is configured with the correct endpoint URL.

Slow completions: Check GPU utilization with nvidia-smi. If GPU is maxed, use a smaller model or upgrade hardware. CPU inference is not practical for interactive use beyond 1B models.

Docker GPU not detected: Ensure nvidia-container-toolkit is installed and the Docker daemon is configured to use the NVIDIA runtime. Test with docker run --gpus all nvidia/cuda:12.0-base nvidia-smi.

Model download fails: Check network connectivity and disk space. Models range from 1-15 GB. Tabby stores them in the data directory.