TL;DR

llama.cpp remains the fastest way to run quantized LLMs locally in 2026, but choosing the right command-line flags makes the difference between a sluggish 2 tokens/second and a responsive 30+ tokens/second experience. This guide covers the essential flags you need for optimal performance on consumer hardware.

The most impactful flags control resource allocation: --n-gpu-layers offloads model layers to your GPU (start with -ngl 35 for 8GB VRAM), --threads sets CPU cores for processing (use physical cores minus 2), and --ctx-size defines context window length (2048 for chat, 8192 for document analysis). Getting these three right solves most performance issues.

For inference quality, --temp controls randomness (0.7 for factual responses, 1.2 for creative writing), --repeat-penalty prevents loops (1.1 is safe), and --top-p manages diversity (0.9 default). The new --flash-attn flag in 2026 builds enable faster attention mechanisms on compatible hardware.

Memory management flags prevent crashes: --batch-size controls processing chunks (512 default, reduce to 128 on 16GB RAM systems), --ubatch-size handles microbatching (256 works well), and --mlock pins model weights in RAM to avoid swapping. For GGUF models, quantization level matters more than flags – Q4_K_M balances quality and speed for most use cases.

The llama-server binary uses identical flags but adds --host 0.0.0.0 and --port 8080 for network access. This provides an OpenAI-compatible API endpoint that works with existing tools.

Caution: AI-generated flag combinations often suggest incompatible options or outdated syntax. Always test commands with small models first, monitor resource usage with htop, and verify flags against current llama.cpp documentation before deploying to production systems. The flag syntax changes between releases, so commands from 2024 guides may not work with 2026 builds.

Understanding llama.cpp Server Flags: The Foundation

The llama-server binary in llama.cpp serves as your primary interface for running local LLMs with HTTP API access. Understanding its core flags determines whether your deployment runs smoothly or crashes under load. These flags control everything from memory allocation to network binding, and incorrect combinations can cause silent failures or degraded performance.

The most critical flag is -m or --model, which specifies your GGUF model file path. Without this, llama-server cannot start:

./llama-server -m models/mistral-7b-instruct-v0.2.Q4_K_M.gguf

The -c or --ctx-size flag sets context window size in tokens. Default values are often too conservative for modern use cases. A 4096 token context works for most chat applications, while document analysis requires 8192 or higher:

./llama-server -m models/llama-3.1-8b.Q5_K_M.gguf -c 8192

Network binding uses --host and --port. The default localhost binding (127.0.0.1) prevents external access. For homelab deployments accessible across your network, bind to 0.0.0.0:

./llama-server -m models/qwen2.5-7b.Q4_K_M.gguf --host 0.0.0.0 --port 8080

Caution: Binding to 0.0.0.0 exposes your server to your entire network. Use firewall rules or reverse proxies with authentication for production deployments. Never expose llama-server directly to the internet without proper security measures.

The --threads or -t flag controls CPU thread count for inference. Setting this to your physical core count (not hyperthreads) typically yields optimal performance:

./llama-server -m models/phi-3-mini.Q4_0.gguf -t 8

These foundation flags appear in virtually every llama-server command. Master them before exploring advanced optimization flags covered in subsequent sections.

Performance Optimization Flags: CPU and GPU Configuration

The -t flag controls CPU thread allocation and directly impacts inference speed. Most modern systems benefit from setting threads to match physical cores rather than logical threads:

llama-server -m llama-3.1-8b-instruct.Q4_K_M.gguf -t 8 -c 4096

For systems with hyperthreading, using physical core count prevents thread contention. A 16-core CPU with 32 logical threads typically performs better with -t 16 than -t 32 for inference workloads.

The -b flag sets batch size for prompt processing. Larger batches improve throughput when processing long contexts:

llama-server -m mistral-7b-instruct.Q5_K_M.gguf -t 12 -b 512 -c 8192

Default batch size is 512, but systems with substantial RAM can push to 1024 or 2048 for faster prompt ingestion on multi-turn conversations.

GPU Acceleration with Layer Offloading

The -ngl flag offloads transformer layers to GPU. Each layer moved to VRAM reduces CPU load and accelerates generation:

llama-server -m codellama-13b.Q4_K_M.gguf -ngl 35 -c 4096

For a 13B parameter model, offloading 35-40 layers to an 8GB GPU typically provides optimal performance. Monitor VRAM usage with nvidia-smi or rocm-smi to avoid overflow.

Full GPU offloading uses -ngl 99 to move all layers:

llama-server -m llama-3.1-8b.Q8_0.gguf -ngl 99 -c 16384

Caution: Always validate GPU memory capacity before setting high layer counts. Exceeding VRAM causes system swapping and severe performance degradation.

The --n-gpu-layers long form provides identical functionality to -ngl. Use whichever fits your scripting preference. Combine with -c to balance context window size against available memory – larger contexts require more VRAM per layer offloaded.

Context and Generation Control Flags

Context window and generation parameters directly impact response quality and resource consumption. The -c flag sets context size in tokens, while generation flags control output behavior.

The -c flag determines maximum context length. Default is 512 tokens, but modern models support much larger windows:

llama-server -m llama-3.1-8b-instruct.Q4_K_M.gguf -c 8192

Larger context windows consume more RAM. An 8K context with Q4_K_M quantization typically requires 2-3GB additional memory compared to 2K context. For code analysis or document processing, 8K-16K contexts work well. Chat applications often run fine with 4K contexts.

Generation Parameter Flags

The -n flag limits maximum tokens generated per request:

llama-server -m mistral-7b-instruct.Q5_K_M.gguf -n 512

Setting -n 512 prevents runaway generation that consumes CPU cycles without useful output. Combine with --temp for temperature control:

llama-server -m codellama-13b.Q4_0.gguf -n 1024 --temp 0.3

Lower temperature values (0.1-0.4) produce more deterministic outputs suitable for code generation. Higher values (0.7-1.0) increase creativity for writing tasks.

The --repeat-penalty flag reduces repetitive output:

llama-server -m neural-chat-7b.Q4_K_M.gguf --repeat-penalty 1.1

Values between 1.05-1.15 work well for most models. Higher penalties can break coherence.

Caution: AI-generated flag combinations may suggest invalid parameter ranges. Always test configurations with small workloads before production deployment. Monitor memory usage when increasing context size – systems without swap can crash if RAM exhausts. The llama-server process does not gracefully handle OOM conditions on most Linux distributions.

Memory and Resource Management Flags

Managing memory efficiently determines whether your local LLM runs smoothly or crashes your system. The llama.cpp engine provides several flags to control resource allocation and prevent out-of-memory errors during inference.

The -c flag sets the context window size in tokens. Default is 512, but modern models support much larger contexts:

./llama-server -m models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -c 4096

Larger context windows consume more RAM. A 7B parameter model at Q4_K_M quantization uses approximately 4GB base memory, plus 1-2MB per context token. Setting -c 8192 adds roughly 16GB to your memory footprint.

Batch Size Control

The -b flag controls prompt processing batch size, while --ubatch sets the physical batch size for computation:

./llama-server -m models/llama-3-8b.Q5_K_M.gguf -b 512 --ubatch 128

Lower batch sizes reduce memory spikes during prompt ingestion but increase processing time. Systems with limited RAM benefit from -b 256 or lower.

GPU Memory Management

The -ngl flag offloads model layers to GPU. Each layer requires VRAM – typically 200-400MB per layer for 7B models at Q4 quantization:

./llama-server -m models/phi-3-mini.Q4_K_M.gguf -ngl 20

Start with -ngl 10 and increase until you hit VRAM limits. Monitor with nvidia-smi or rocm-smi. Partial offloading works well – CPU handles overflow layers automatically.

Memory Mapping

The --mmap flag (enabled by default) memory-maps model files instead of loading them entirely into RAM. Disable with --no-mmap only if you experience file system issues:

./llama-server -m models/codellama-13b.Q8_0.gguf --no-mmap

Caution: Always test memory configurations with your specific model and hardware before deploying to production environments. Monitor system resources during initial runs to establish safe limits.

Advanced 2026 Features: New Flags and Capabilities

The --flash-attn flag enables Flash Attention v3 for compatible models, dramatically reducing memory bandwidth requirements during inference. This works best with newer architectures like Llama 3.2 and Qwen2.5:

llama-server -m qwen2.5-7b-instruct-q4_k_m.gguf --flash-attn --n-gpu-layers 35

Flash Attention v3 provides substantial speed improvements on modern GPUs with tensor cores, though CPU-only setups see no benefit. Combine with --cont-batching for multi-request scenarios.

Speculative Decoding

Speculative decoding uses a smaller draft model to predict tokens, then validates with your main model. The --draft flag accepts a path to a smaller GGUF model:

llama-server -m llama-3.2-90B-q4_k_m.gguf --draft llama-3.2-1b-q8_0.gguf --draft-max 16

The draft model should share the same tokenizer as your main model. Set --draft-max between 8 and 32 tokens – higher values increase speculation depth but may waste computation if predictions fail frequently.

Structured Output Enforcement

The --grammar-file flag now supports extended BNF syntax for enforcing JSON schemas, function calls, and custom formats:

llama-cli -m mistral-7b-instruct-q5_k_m.gguf --grammar-file json-schema.gbnf -p "Extract entities:"

Create grammar files defining valid output structures. This prevents hallucinated JSON brackets and ensures parseable responses for downstream automation. Particularly valuable when chaining llama.cpp with tools like jq or Python scripts.

Caution: Always validate AI-generated grammar files before production use. Test with sample inputs to verify the grammar accepts valid outputs and rejects malformed ones. Overly restrictive grammars can cause generation failures or infinite loops.

Mamba Architecture Support

The --mamba flag enables optimized inference for Mamba state-space models, which offer linear-time complexity versus transformer quadratic scaling:

llama-server -m mamba-2.8b-q4_0.gguf --mamba --ctx-size 32768

Mamba models handle extremely long contexts efficiently, making them ideal for document analysis and extended conversations on resource-constrained hardware.

Installation and Configuration Steps

Clone the llama.cpp repository and compile with cmake to enable GPU acceleration and optimization flags specific to your hardware:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_CUDA=ON
cmake --build . --config Release

For CPU-only builds, omit the CUDA flag. AMD GPU users should replace -DLLAMA_CUDA=ON with -DLLAMA_HIPBLAS=ON. The compiled binaries appear in the build directory, including llama-cli for command-line inference and llama-server for HTTP API access.

Using Pre-Built Binaries

Download platform-specific binaries from the GitHub releases page when you need faster deployment without compilation dependencies. Extract the archive and verify the binary works:

wget https://github.com/ggerganov/llama.cpp/releases/download/b1234/llama-b1234-bin-ubuntu-x64.zip
unzip llama-b1234-bin-ubuntu-x64.zip
cd llama-b1234-bin-ubuntu-x64
./llama-cli --version

Model Acquisition

Download GGUF models from Hugging Face repositories. Most models offer multiple quantization levels – Q4_K_M provides balanced performance for general use, while Q8_0 preserves more quality at higher memory cost:

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

Validation Testing

Run a basic inference test to confirm your installation works before exploring advanced flags:

./llama-cli -m llama-2-7b.Q4_K_M.gguf -p "Explain quantum computing" -n 50

Caution: When using AI-generated installation scripts or flag combinations, manually review each command before execution. Incorrect memory allocation flags can cause system instability, and GPU layer settings exceeding your hardware capabilities will trigger fallback to CPU with significant performance degradation.