GGUF Quantization Explained: Choosing the Right Format for Local AI
GGUF Quantization Explained: Choosing the Right Format for Local AI TL;DR # Check quantization of an Ollama model ollama show llama3.2:3b --modelfile | grep -i quant # Inspect a GGUF file directly python3 -c "from gguf import GGUFReader; r = GGUFReader('model.gguf'); print([kv for kv in r.fields])" # Or use llama.cpp's built-in info ./llama-quantize --help # Convert and quantize with llama.cpp ./llama-quantize input.gguf output-Q4_K_M.gguf Q4_K_M GGUF is the standard file format for running quantized LLMs locally. Quantization reduces model size and VRAM usage by representing weights with fewer bits. The tradeoff is a small reduction in output quality. Choosing the right quantization level depends on your available VRAM, the model size, and your quality requirements. ...
