Choosing the right open-source model for your server
A practical sizing guide for Llama 3.1, Mistral, Qwen, and friends.
Most decisions about which model to deploy come down to two questions: what hardware are you running on, and what latency budget can your workflow tolerate.
For most teams on a 16 GB box without a GPU, Llama 3.1 8B at 4-bit quantisation is the sweet spot. It runs comfortably, answers most general queries well, and has a healthy fine-tuning ecosystem.
If you have a 32 GB box, Mistral 7B is a close second — it's a touch faster on the same footprint and excels at instruction following. Qwen 2.5 14B is also viable here for languages beyond English.
With a single GPU (24 GB+ VRAM), Llama 3.1 70B at 4-bit becomes practical. The quality gap to frontier hosted models narrows dramatically. For RAG-heavy use cases where the retrieved context does the heavy lifting, the 70B is overkill — stay on 8B and spend the saved capacity on more workflows.
What we don't recommend: running anything below 4-bit quantisation. The quality cliff is steep and unpredictable. If you can't fit a 4-bit model, switch to a smaller one rather than dropping the precision.