Setting LLM Parameters in Ollama and llama.cpp for Local AI Models
TL;DR Both Ollama and llama.cpp let you control how your local LLMs behave through runtime parameters. Understanding these settings helps you balance response quality, speed, and resource usage without sending data to external APIs. Temperature controls randomness – lower values like 0.1 produce focused, deterministic outputs while higher values like 0.9 generate creative but less predictable text. Top-p (nucleus sampling) filters token choices by cumulative probability, typically set between 0.7 and 0.95. Context window size determines how much conversation history the model remembers, ranging from 2048 to 128000 tokens depending on your model and available VRAM. ...