Double the speed on 13B models and unlock comfortable 32B workloads.
GPU: NVIDIA RTX 4080 Super 16GB
Ada-generation VRAM and CUDA performance keep 7B–13B models responsive.
CPU: AMD Ryzen 7 7700
Eight high-clock cores feed the GPU without bottlenecking inference threads.
RAM: 64GB DDR5-5600
Room for the model, OS, and tooling — upgradeable to 64GB in two clicks.
Complete system
Ready to assemble with standard tools. Boots local AI workloads on day one.
Real-world throughput for popular models, plus how this build compares to our other configurations.
| Model tier | Example model | Budget | Recommended (This build)This build | Premium |
|---|---|---|---|---|
| Small (7B–8B) | Qwen 2.5 7B | ~65 tok/s | ~118 tok/s | ~156 tok/s |
| Llama 3.1 8B | ~58 tok/s | ~105 tok/s | ~142 tok/s | |
| Mistral 7B v0.2 | ~70 tok/s | ~125 tok/s | ~165 tok/s | |
| Medium (13B–32B) | DeepSeek 33B (Q4) Expect higher latency but big gains for reasoning | ~35 tok/s | ~62 tok/s |
Every component is intentionally chosen to balance performance, thermals, and future upgrades. Start with these essentials and expand as your workloads grow.
Ada-generation VRAM and CUDA performance keep 7B–13B models responsive.
Eight high-clock cores feed the GPU without bottlenecking inference threads.
Speeds are based on Q4 quantization benchmarks. Use the filters to explore what runs best on this hardware.
| Model | Size | Min VRAM (Q4) | Est. speed | Context window | Best for |
|---|---|---|---|---|---|
| Llama 3.1 13B Instruct Meta | 13.0B | 7 GB | — | 8K | General chat |
| Qwen2.5 14B Instruct Alibaba | 14.0B | 7 GB | — | 131K | General chat |
| Llama 3.1 8B Instruct Meta | 8.0B | 4 GB | — | 8K | Fast chat |
| Mixtral 8x22B Instruct V0.1 Mistral AI | 176.0B | 88 GB | — | 64K | Reasoning & agents |
| DeepSeek V3 0324 DeepSeek | 67.0B | 34 GB | — | 128K | General chat |
| Gemma 2 27b It Google | 27.0B | 14 GB | — | 32K | General chat |
| DeepSeek R1 0528 DeepSeek | 671.0B | 336 GB | — | 128K | Reasoning & agents |
| Llama 3.1 70B Instruct Meta | 70.0B | 35 GB | — | 8K | General chat |
| Phi 3 Medium 128k Instruct Microsoft | 14.0B | 7 GB | — | 128K | General chat |
| Claude 3 Haiku Anthropic | N/A | — | — | 200K | Fast chat |
Daily chat
Llama 3.1 8B Instruct (—)
Complex tasks
Mixtral 8x22B Instruct V0.1 (—)
Real-world scenarios where this hardware shines. Each card includes the model we recommend and what to expect for responsiveness.
Keep 13B–32B models responsive for demanding coding or research sessions.
Run Mixtral 8x22B or DeepSeek 32B Q4 for better reasoning and analysis workloads.
Spot the trade-offs between tiers and know exactly when it makes sense to step up.
| Feature | BudgetRTX 4070 Super 12GB • 12GB VRAM | Recommended (This build)NVIDIA RTX 4090 24GB • 24GB VRAM | Premium2x NVIDIA RTX 4090 24GB • 24GB VRAM |
|---|---|---|---|
| Total cost | $1,377.95 | $3,847.94 | $8,706.94 |
| GPU | RTX 4070 Super 12GB | NVIDIA RTX 4090 24GB | 2x NVIDIA RTX 4090 24GB |
| VRAM | 12GB VRAM | 24GB VRAM | 24GB VRAM |
| System memory | 32GB DDR5-5600 | 128GB DDR5-5600 | 256GB DDR4 ECC |
| 7B models | ~65 tok/s | ~118 tok/s | ~156 tok/s |
| 13B models | ~28 tok/s | ~52 tok/s | ~67 tok/s |
| 70B models | ~12 tok/s | ~25 tok/s | ~45 tok/s |
| Best for | Daily AI tasks, coding assistants | Power users, heavier experimentation |
The three questions we hear most often about this build and who it's for.
Check the compatible models table. Anything up to 13B runs smoothly. 32B+ models work in Q4 quantization, with slower responses on budget hardware.
It's excellent for personal productivity and prototyping. For shared production workloads or enterprise SLAs, step up to the Premium build with RTX 4090.
If you've built a PC before, plan ~2 hours. First time? Budget 4 hours and follow our assembly guide. All parts are standard ATX with no proprietary connectors.
Still have questions? Join our Discord or read the full documentation.
| ~89 tok/s |
| Llama 3.1 13B | ~28 tok/s | ~52 tok/s | ~67 tok/s |
| Large (70B) | Llama 3.1 70B Requires Q4 on budget builds | ~12 tok/s | ~25 tok/s | ~45 tok/s |
Benchmark figures represent Q4 quantization. Expect ~40% slower speeds for FP16 / full-precision runs.
Room for the model, OS, and tooling — upgradeable to 64GB in two clicks.
| Production workloads, agents |
GPU: Step up to RTX 4090
24GB VRAM unlocks 70B at viable speeds and FP16 13B runs.
Memory: 96GB+ DDR5
Helpful for heavy datasets, RAG pipelines, and multi-model agents.