TL;DR
Both Ollama and llama.cpp let you control how your local LLMs behave through runtime parameters. Understanding these settings helps you balance response quality, speed, and resource usage without sending data to external APIs.
Temperature controls randomness – lower values like 0.1 produce focused, deterministic outputs while higher values like 0.9 generate creative but less predictable text. Top-p (nucleus sampling) filters token choices by cumulative probability, typically set between 0.7 and 0.95. Context window size determines how much conversation history the model remembers, ranging from 2048 to 128000 tokens depending on your model and available VRAM.
Ollama Parameter Syntax
Pass parameters via the REST API on port 11434 or through the CLI:
ollama run llama3.2 --temperature 0.7 --num-ctx 4096
For API calls, include parameters in the JSON payload:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain Docker networking",
"temperature": 0.3,
"num_ctx": 8192
}'
llama.cpp Server Configuration
The llama-server binary accepts parameters as command-line flags:
./llama-server -m models/llama-3.2-3b-Q4_K_M.gguf \
--ctx-size 8192 \
--temp 0.7 \
--top-p 0.9 \
--n-gpu-layers 35
You can also pass parameters per-request through the OpenAI-compatible API endpoint.
Production Considerations
Start with conservative settings – temperature 0.3, context 4096 – then adjust based on your use case. Code generation benefits from lower temperatures while creative writing needs higher values. Monitor your system resources since larger context windows consume significantly more RAM and VRAM. Always validate AI-generated configurations against official documentation before deploying to production systems, as model behavior varies across quantization levels and hardware configurations.
Understanding LLM Parameters: What They Control
LLM parameters control how models generate text during inference. These settings affect output quality, randomness, length, and computational cost. Understanding them helps you tune models for specific tasks – whether you need creative writing, precise code generation, or factual responses.
Temperature controls randomness in token selection. Lower values (0.1-0.5) produce focused, deterministic output suitable for code generation or factual queries. Higher values (0.7-1.2) increase creativity and variation, useful for brainstorming or creative writing. Setting temperature to 0 makes output fully deterministic.
Top-p (nucleus sampling) limits token selection to the smallest set whose cumulative probability exceeds the threshold. A top-p of 0.9 considers only the most likely tokens that together represent 90% probability mass. This produces more coherent output than temperature alone.
# Ollama API call with temperature and top-p
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain Docker networking",
"options": {
"temperature": 0.3,
"top_p": 0.9
}
}'
Context Length and Token Limits
Context window defines how much text the model remembers. Most models support 2048-8192 tokens by default, with some extending to 32k or 128k tokens. Longer contexts consume more VRAM and slow inference.
The num_predict parameter limits output length. Set this to prevent runaway generation:
# llama-server with context and prediction limits
./llama-server -m models/llama-3.2-3b-q4_k_m.gguf \
-c 4096 \
-n 512
Caution: When integrating LLMs into automation scripts, always validate generated commands before execution. Use temperature below 0.5 for technical tasks and implement output parsing to catch malformed responses. Test parameter combinations with your specific model and workload before deploying to production environments.
Ollama Parameter Configuration
Ollama exposes inference parameters through its REST API on port 11434. You can set these when making requests to /api/generate or /api/chat endpoints, or configure defaults in a Modelfile for custom model variants.
Pass parameters directly in API requests using the options object:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain Docker networking",
"options": {
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"num_predict": 512
}
}'
Common parameters include temperature (randomness, 0.0-2.0), top_p (nucleus sampling), top_k (token selection pool), num_predict (max output tokens), and repeat_penalty (reduces repetition). Lower temperature values produce more deterministic outputs – useful for code generation or structured data extraction.
Modelfile Configuration
Create persistent parameter defaults by defining a custom Modelfile:
FROM llama3.2
PARAMETER temperature 0.3
PARAMETER top_p 0.85
PARAMETER num_predict 1024
SYSTEM You are a technical documentation assistant specializing in Linux system administration.
Build and run the customized model:
ollama create llama3.2-docs -f Modelfile
ollama run llama3.2-docs
This approach works well when integrating Ollama with automation tools or CI/CD pipelines that need consistent model behavior across runs.
GPU Memory Allocation
Control GPU usage with the OLLAMA_NUM_GPU environment variable, which sets how many GPU layers to offload. Higher values use more VRAM but improve inference speed:
OLLAMA_NUM_GPU=35 ollama serve
Caution: When using AI-generated parameter recommendations, validate them against your hardware constraints before production deployment. Test memory usage with nvidia-smi or rocm-smi to avoid OOM crashes during peak loads.
llama.cpp Parameter Configuration
llama.cpp uses command-line flags and JSON payloads to control inference behavior. The llama-server binary accepts parameters at startup and through API requests, giving you fine-grained control over model behavior.
Launch llama-server with context window and thread settings:
./llama-server \
--model models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
--ctx-size 4096 \
--threads 8 \
--n-gpu-layers 35 \
--port 8080
The --ctx-size flag sets the context window in tokens. The --threads parameter controls CPU thread count for inference. Use --n-gpu-layers to offload layers to GPU – higher values improve speed but require more VRAM.
Runtime Parameters via API
Send inference parameters in the request body when calling the completion endpoint:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"messages": [{"role": "user", "content": "Explain Docker networking"}],
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 512,
"repeat_penalty": 1.1
}'
The repeat_penalty parameter reduces repetitive output – values above 1.0 penalize token repetition. The top_p setting controls nucleus sampling, affecting response diversity.
Quantization Selection
Choose quantization levels based on available RAM. Q4_K_M provides good quality-to-size ratio for most use cases. Q5_K_M offers better quality with moderate size increase. Q8_0 approaches full precision but requires substantially more memory.
Caution: When integrating llama.cpp with AI-assisted deployment tools, always validate generated server commands before production use. Incorrect --n-gpu-layers values can cause out-of-memory errors or fall back to CPU-only inference, significantly impacting performance. Test parameter combinations in development environments first.
GPU Offloading and Memory Management
Both Ollama and llama.cpp let you control how many model layers run on your GPU versus CPU. This balance determines inference speed and memory usage.
For Ollama, set the OLLAMA_NUM_GPU environment variable to specify GPU count. Ollama automatically calculates optimal layer distribution based on available VRAM:
export OLLAMA_NUM_GPU=1
ollama run llama3.2:7b
With llama.cpp’s llama-server, use the -ngl flag to set GPU layers explicitly:
./llama-server -m models/llama-3.2-7b-q4_k_m.gguf -ngl 35 -c 4096
Start with -ngl 35 for 7B models on 8GB VRAM cards. Increase for larger VRAM or decrease if you hit out-of-memory errors. Monitor nvidia-smi or rocm-smi during inference to watch VRAM usage.
Memory Pressure and Quantization Trade-offs
Quantization reduces model size at the cost of output quality. Q4_K_M offers the best balance for most use cases – a 7B model fits in roughly 4GB RAM. Q8_0 preserves more quality but doubles memory requirements.
If your system runs out of memory during inference, reduce GPU layers or switch to a lower quantization. For example, moving from Q5_K_M to Q4_K_M cuts memory usage by approximately 20 percent while maintaining reasonable output quality for most tasks.
Caution: When using AI coding assistants to generate llama.cpp commands, always verify the -ngl value matches your hardware before running in production. An incorrect layer count can crash the server or cause severe performance degradation.
Test different configurations with your actual workload. A model that runs smoothly with short prompts may exhaust memory when processing long context windows. Start conservative with layer counts and increase incrementally while monitoring system resources.
Quantization Levels and Model Selection
Quantization reduces model size by representing weights with fewer bits. GGUF models use quantization schemes like Q4_0, Q4_K_M, Q5_K_M, and Q8_0. Lower quantization means smaller memory footprint but reduced output quality. Most homelab setups run Q4_K_M or Q5_K_M variants as they balance quality and resource usage effectively.
For systems with 16GB RAM, Q4_K_M quantization works well for 7B parameter models. A 13B model at Q4_K_M typically requires 8-10GB RAM. Q8_0 quantization preserves more quality but doubles memory requirements compared to Q4_0.
When pulling models with Ollama, specify the quantization in the model tag:
ollama pull llama3.2:7b-q4_K_M
ollama pull mistral:7b-q5_K_M
For llama.cpp, download GGUF files directly from model repositories. The filename indicates quantization:
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
./llama-server -m llama-2-7b.Q4_K_M.gguf --port 8080
Testing Different Quantizations
Run the same prompt across quantization levels to compare output quality. Create a test script:
#!/bin/bash
for quant in q4_0 q4_K_M q5_K_M q8_0; do
echo "Testing $quant"
ollama run llama3.2:7b-$quant "Explain Docker networking in 50 words"
done
Caution: AI-generated quantization recommendations may not account for your specific hardware. Always test memory usage with htop or nvidia-smi before deploying models in production environments. Start with conservative quantization levels and increase only after verifying stability.
Q4_K_M provides the best starting point for most local deployments. Upgrade to Q5_K_M or Q8_0 only when quality issues appear in your specific use case.
Installation and Configuration Steps
Install Ollama with the official script on Linux systems:
curl -fsSL https://ollama.com/install.sh | sh
After installation, pull a model to test your setup:
ollama pull llama3.2:3b
ollama run llama3.2:3b "Explain quantum computing"
The service starts automatically and listens on port 11434. Configure environment variables in /etc/systemd/system/ollama.service or your shell profile:
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_MODELS=/mnt/storage/ollama-models
export OLLAMA_NUM_GPU=1
Restart the service after changes:
sudo systemctl restart ollama
Installing llama.cpp
Download pre-built binaries from the GitHub releases page or build from source:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release
Download a GGUF model file. Quantization levels like Q4_K_M balance quality and memory usage:
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
Start the llama-server with your model:
./build/bin/llama-server \
-m llama-2-7b.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 32
The -ngl flag controls GPU layer offloading. Higher values use more VRAM but improve speed.
Caution: When using AI assistants to generate installation commands, always verify paths, URLs, and flags against official documentation before running them in production environments. Model files can be large – ensure adequate disk space before downloading.