Double the speed on 13B models and unlock comfortable 32B workloads.
GPU: RTX 4060 Ti 16GB
Ada-generation VRAM and CUDA performance keep 7B–13B models responsive.
CPU: AMD Ryzen 5 7600
Eight high-clock cores feed the GPU without bottlenecking inference threads.
RAM: 32GB DDR5-5600
Room for the model, OS, and tooling — upgradeable to 64GB in two clicks.
Complete system
Ready to assemble with standard tools. Boots local AI workloads on day one.
Real-world throughput for popular models, plus how this build compares to our other configurations.
| Model tier | Example model | Budget | Recommended (This build)This build | Premium |
|---|---|---|---|---|
| Small (7B–8B) | Qwen 2.5 7B | ~65 tok/s | ~118 tok/s | ~156 tok/s |
| Llama 3.1 8B | ~58 tok/s | ~105 tok/s | ~142 tok/s | |
| Mistral 7B v0.2 | ~70 tok/s | ~125 tok/s | ~165 tok/s | |
| Medium (13B–32B) | DeepSeek 33B (Q4) Expect higher latency but big gains for reasoning | ~35 tok/s | ~62 tok/s |
Every component is intentionally chosen to balance performance, thermals, and future upgrades. Start with these essentials and expand as your workloads grow.
Ada-generation VRAM and CUDA performance keep 7B–13B models responsive.
Eight high-clock cores feed the GPU without bottlenecking inference threads.
Speeds are based on Q4 quantization benchmarks. Use the filters to explore what runs best on this hardware.
| Model | Size | Min VRAM (Q4) | Est. speed | Context window | Best for |
|---|---|---|---|---|---|
| OpenELM 1 1B Instruct apple | 1.0B | 1 GB | 66 tok/s | N/A | Fast chat |
| Bert Base Uncased google-bert | 110M | 1 GB | 65 tok/s | 4K | Fast chat |
| OLMo 2 0425 1B allenai | 1.0B | 1 GB | 64 tok/s | N/A | Fast chat |
| Llama Guard 3 1B meta-llama | 1.0B | 1 GB | 64 tok/s | N/A | Fast chat |
| Sam3 facebook | 860M | 1 GB | 64 tok/s | 4K | Fast chat |
| HunyuanOCR tencent | 996M | 1 GB | 63 tok/s | 4K | Fast chat |
| Gemma 3 1b It unsloth | 1.0B | 1 GB | 62 tok/s | N/A | Fast chat |
| Llama 3.2 1B Instruct unsloth | 1.0B | 1 GB | 62 tok/s | N/A | Fast chat |
| Embeddinggemma 300m google | 303M | 1 GB | 62 tok/s | 4K | Fast chat |
| Llama 3.2 1B Instruct meta-llama | 1.0B | 1 GB | 60 tok/s | N/A | Fast chat |
Daily chat
OpenELM 1 1B Instruct (66 tok/s)
Real-world scenarios where this hardware shines. Each card includes the model we recommend and what to expect for responsiveness.
Keep 13B–32B models responsive for demanding coding or research sessions.
Run Mixtral 8x22B or DeepSeek 32B Q4 for better reasoning and analysis workloads.
Spot the trade-offs between tiers and know exactly when it makes sense to step up.
| Feature | BudgetRTX 4070 Super 12GB • 12GB VRAM | Recommended (This build)NVIDIA RTX 4090 24GB • 24GB VRAM | Premium2x NVIDIA RTX 4090 24GB • 24GB VRAM |
|---|---|---|---|
| Total cost | $1,377.95 | $3,847.94 | $8,706.94 |
| GPU | RTX 4070 Super 12GB | NVIDIA RTX 4090 24GB | 2x NVIDIA RTX 4090 24GB |
| VRAM | 12GB VRAM | 24GB VRAM | 24GB VRAM |
| System memory | 32GB DDR5-5600 | 128GB DDR5-5600 | 256GB DDR4 ECC |
| 7B models | ~65 tok/s | ~118 tok/s | ~156 tok/s |
| 13B models | ~28 tok/s | ~52 tok/s | ~67 tok/s |
| 70B models | ~12 tok/s | ~25 tok/s | ~45 tok/s |
| Best for | Daily AI tasks, coding assistants | Power users, heavier experimentation |
The three questions we hear most often about this build and who it's for.
Check the compatible models table. Anything up to 13B runs smoothly. 32B+ models work in Q4 quantization, with slower responses on budget hardware.
It's excellent for personal productivity and prototyping. For shared production workloads or enterprise SLAs, step up to the Premium build with RTX 4090.
If you've built a PC before, plan ~2 hours. First time? Budget 4 hours and follow our assembly guide. All parts are standard ATX with no proprietary connectors.
Still have questions? Join our Discord or read the full documentation.
| ~89 tok/s |
| Llama 3.1 13B | ~28 tok/s | ~52 tok/s | ~67 tok/s |
| Large (70B) | Llama 3.1 70B Requires Q4 on budget builds | ~12 tok/s | ~25 tok/s | ~45 tok/s |
Benchmark figures represent Q4 quantization. Expect ~40% slower speeds for FP16 / full-precision runs.
Room for the model, OS, and tooling — upgradeable to 64GB in two clicks.
| Production workloads, agents |
GPU: Step up to RTX 4090
24GB VRAM unlocks 70B at viable speeds and FP16 13B runs.
Memory: 96GB+ DDR5
Helpful for heavy datasets, RAG pipelines, and multi-model agents.