Engineering Apr 29, 2026 6 min read

Choosing the right open-source model for your server

A practical sizing guide for Llama 3.1, Mistral, Qwen, and friends.

SelfAiWizard Engineering

Author

Choosing the right open-source model for your server

Most decisions about which model to deploy come down to two questions: what hardware are you running on, and what latency budget can your workflow tolerate.

For most teams on a 16 GB box without a GPU, Llama 3.1 8B at 4-bit quantisation is the sweet spot. It runs comfortably, answers most general queries well, and has a healthy fine-tuning ecosystem.

If you have a 32 GB box, Mistral 7B is a close second — it's a touch faster on the same footprint and excels at instruction following. Qwen 2.5 14B is also viable here for languages beyond English.

With a single GPU (24 GB+ VRAM), Llama 3.1 70B at 4-bit becomes practical. The quality gap to frontier hosted models narrows dramatically. For RAG-heavy use cases where the retrieved context does the heavy lifting, the 70B is overkill — stay on 8B and spend the saved capacity on more workflows.

What we don't recommend: running anything below 4-bit quantisation. The quality cliff is steep and unpredictable. If you can't fit a 4-bit model, switch to a smaller one rather than dropping the precision.

Tagged ModelsOllamaHardware

Choosing the right open-source model for your server

More posts

Why teams are self-hosting AI in 2026

RAG without the hype: what actually works

n8n vs Temporal vs Airflow for AI pipelines