Run popular 7Bβ13B models locally without breaking the bank.
GPU: RTX 3060 12GB
Ada-generation VRAM and CUDA performance keep 7Bβ13B models responsive.
CPU: AMD Ryzen 5 5600
Eight high-clock cores feed the GPU without bottlenecking inference threads.
RAM: 32GB DDR4-3200
Room for the model, OS, and tooling β upgradeable to 64GB in two clicks.
Complete system
Ready to assemble with standard tools. Boots local AI workloads on day one.
Real-world throughput for popular models, plus how this build compares to our other configurations.
| Model tier | Example model | Budget (This build)This build | Recommended | Premium |
|---|---|---|---|---|
| Small (7Bβ8B) | Qwen 2.5 7B | ~65 tok/s | ~118 tok/s | ~156 tok/s |
| Llama 3.1 8B | ~58 tok/s | ~105 tok/s | ~142 tok/s | |
| Mistral 7B v0.2 | ~70 tok/s | ~125 tok/s | ~165 tok/s | |
| Medium (13Bβ32B) | DeepSeek 33B (Q4) Expect higher latency but big gains for reasoning | ~35 tok/s | ~62 tok/s |
Every component is intentionally chosen to balance performance, thermals, and future upgrades. Start with these essentials and expand as your workloads grow.
Ada-generation VRAM and CUDA performance keep 7Bβ13B models responsive.
Eight high-clock cores feed the GPU without bottlenecking inference threads.
Speeds are based on Q4 quantization benchmarks. Use the filters to explore what runs best on this hardware.
| Model | Size | Min VRAM (Q4) | Est. speed | Context window | Best for |
|---|---|---|---|---|---|
| OLMo 2 0425 1B allenai | 1.0B | 1 GB | 76 tok/s | N/A | Fast chat |
| Llama 3.2 1B Instruct unsloth | 1.0B | 1 GB | 76 tok/s | N/A | Fast chat |
| Llama 3.2 1B Instruct meta-llama | 1.0B | 1 GB | 73 tok/s | N/A | Fast chat |
| Llama Guard 3 1B meta-llama | 1.0B | 1 GB | 71 tok/s | N/A | Fast chat |
| Sam3 facebook | 860M | 1 GB | 70 tok/s | 4K | Fast chat |
| Gemma 3 1b It google | 1.0B | 1 GB | 69 tok/s | N/A | Fast chat |
| Embeddinggemma 300m google | 303M | 1 GB | 69 tok/s | 4K | Fast chat |
| TinyLlama 1.1B Chat V1.0 TinyLlama | 1.0B | 1 GB | 68 tok/s | N/A | Fast chat |
| OpenELM 1 1B Instruct apple | 1.0B | 1 GB | 68 tok/s | N/A | Fast chat |
| Gemma 3 1b It unsloth | 1.0B | 1 GB | 67 tok/s | N/A | Fast chat |
Daily chat
OLMo 2 0425 1B (76 tok/s)
Real-world scenarios where this hardware shines. Each card includes the model we recommend and what to expect for responsiveness.
Keep conversations private with models like Qwen 7B or Llama 3.1 8B running entirely offline.
Use DeepSeek Coder or similar local models for reliable completions that respect your codebase.
Spot the trade-offs between tiers and know exactly when it makes sense to step up.
| Feature | Budget (This build)RTX 4070 Super 12GB β’ 12GB VRAM | RecommendedNVIDIA RTX 4090 24GB β’ 24GB VRAM | Premium2x NVIDIA RTX 4090 24GB β’ 24GB VRAM |
|---|---|---|---|
| Total cost | $1,377.95 | $3,847.94 | $8,706.94 |
| GPU | RTX 4070 Super 12GB | NVIDIA RTX 4090 24GB | 2x NVIDIA RTX 4090 24GB |
| VRAM | 12GB VRAM | 24GB VRAM | 24GB VRAM |
| System memory | 32GB DDR5-5600 | 128GB DDR5-5600 | 256GB DDR4 ECC |
| 7B models | ~65 tok/s | ~118 tok/s | ~156 tok/s |
| 13B models | ~28 tok/s | ~52 tok/s | ~67 tok/s |
| 70B models | ~12 tok/s | ~25 tok/s | ~45 tok/s |
| Best for | Daily AI tasks, coding assistants | Power users, heavier experimentation |
The three questions we hear most often about this build and who it's for.
Check the compatible models table. Anything up to 13B runs smoothly. 32B+ models work in Q4 quantization, with slower responses on budget hardware.
It's excellent for personal productivity and prototyping. For shared production workloads or enterprise SLAs, step up to the Premium build with RTX 4090.
If you've built a PC before, plan ~2 hours. First time? Budget 4 hours and follow our assembly guide. All parts are standard ATX with no proprietary connectors.
Still have questions? Join our Discord or read the full documentation.
| ~89 tok/s |
| Llama 3.1 13B | ~28 tok/s | ~52 tok/s | ~67 tok/s |
| Large (70B) | Llama 3.1 70B Requires Q4 on budget builds | ~12 tok/s | ~25 tok/s | ~45 tok/s |
Benchmark figures represent Q4 quantization. Expect ~40% slower speeds for FP16 / full-precision runs.
Room for the model, OS, and tooling β upgradeable to 64GB in two clicks.
Draft blog posts, documentation, and emails quickly with Mistral 7B or Gemma 9B.
Swap models in minutes, experiment with quantizations, and build intuition for local AI.
| Production workloads, agents |
GPU: Jump to RTX 4080/4090
Adds 4β12GB of VRAM and unlocks much faster 13B+ inference (~$800β$1,500).
RAM: Expand to 64GB
Keeps large contexts and tooling responsive when multitasking (~$80).
Storage: Add 2TB NVMe
Room for multiple quantizations and datasets (~$150).