Serve 70B models with confidence and headroom for agents.
GPU: 2x NVIDIA RTX 4090 24GB
Ada-generation VRAM and CUDA performance keep 7B–13B models responsive.
CPU: AMD Threadripper PRO 5965WX
Eight high-clock cores feed the GPU without bottlenecking inference threads.
RAM: 256GB DDR4 ECC
Room for the model, OS, and tooling — upgradeable to 64GB in two clicks.
Complete system
Ready to assemble with standard tools. Boots local AI workloads on day one.
Real-world throughput for popular models, plus how this build compares to our other configurations.
| Model tier | Example model | Budget | Recommended | Premium (This build)This build |
|---|---|---|---|---|
| Small (7B–8B) | Qwen 2.5 7B | ~65 tok/s | ~118 tok/s | ~156 tok/s |
| Llama 3.1 8B | ~58 tok/s | ~105 tok/s | ~142 tok/s | |
| Mistral 7B v0.2 | ~70 tok/s | ~125 tok/s | ~165 tok/s | |
| Medium (13B–32B) | DeepSeek 33B (Q4) Expect higher latency but big gains for reasoning | ~35 tok/s | ~62 tok/s |
Every component is intentionally chosen to balance performance, thermals, and future upgrades. Start with these essentials and expand as your workloads grow.
Ada-generation VRAM and CUDA performance keep 7B–13B models responsive.
Eight high-clock cores feed the GPU without bottlenecking inference threads.
Speeds are based on Q4 quantization benchmarks. Use the filters to explore what runs best on this hardware.
| Model | Size | Min VRAM (Q4) | Est. speed | Context window | Best for |
|---|---|---|---|---|---|
| Llama 3.1 70B Instruct Meta | 70.0B | 35 GB | — | 8K | General chat |
| Mixtral 8x22B Instruct V0.1 Mistral AI | 176.0B | 88 GB | — | 64K | Reasoning & agents |
| DeepSeek R1 0528 DeepSeek | 671.0B | 336 GB | — | 128K | Reasoning & agents |
| Gemma 2 27b It Google | 27.0B | 14 GB | — | 32K | General chat |
| Mistral 7B Instruct V0.2 Mistral AI | 7.3B | 4 GB | — | 8K | Fast chat |
| Llama 3.1 13B Instruct Meta | 13.0B | 7 GB | — | 8K | General chat |
| Phi 3 Medium 128k Instruct Microsoft | 14.0B | 7 GB | — | 128K | General chat |
| Qwen2.5 14B Instruct Alibaba | 14.0B | 7 GB | — | 131K | General chat |
| Llama 3.1 8B Instruct Meta | 8.0B | 4 GB | — | 8K | Fast chat |
| DeepSeek R1 Distill Qwen 32B DeepSeek | 32.0B | 16 GB | — | 128K | Reasoning & agents |
Daily chat
Mistral 7B Instruct V0.2 (—)
Complex tasks
Llama 3.1 70B Instruct (—)
Real-world scenarios where this hardware shines. Each card includes the model we recommend and what to expect for responsiveness.
Serve Llama 70B or Mixtral 8x22B with reliable latency and ample VRAM.
Run multiple models concurrently for planning, reasoning, and execution tasks.
Spot the trade-offs between tiers and know exactly when it makes sense to step up.
| Feature | BudgetRTX 4070 Super 12GB • 12GB VRAM | RecommendedNVIDIA RTX 4090 24GB • 24GB VRAM | Premium (This build)2x NVIDIA RTX 4090 24GB • 24GB VRAM |
|---|---|---|---|
| Total cost | $1,377.95 | $3,847.94 | $8,706.94 |
| GPU | RTX 4070 Super 12GB | NVIDIA RTX 4090 24GB | 2x NVIDIA RTX 4090 24GB |
| VRAM | 12GB VRAM | 24GB VRAM | 24GB VRAM |
| System memory | 32GB DDR5-5600 | 128GB DDR5-5600 | 256GB DDR4 ECC |
| 7B models | ~65 tok/s | ~118 tok/s | ~156 tok/s |
| 13B models | ~28 tok/s | ~52 tok/s | ~67 tok/s |
| 70B models | ~12 tok/s | ~25 tok/s | ~45 tok/s |
| Best for | Daily AI tasks, coding assistants | Power users, heavier experimentation |
The three questions we hear most often about this build and who it's for.
Check the compatible models table. Anything up to 13B runs smoothly. 32B+ models work in Q4 quantization, with slower responses on budget hardware.
It's excellent for personal productivity and prototyping. For shared production workloads or enterprise SLAs, step up to the Premium build with RTX 4090.
If you've built a PC before, plan ~2 hours. First time? Budget 4 hours and follow our assembly guide. All parts are standard ATX with no proprietary connectors.
Still have questions? Join our Discord or read the full documentation.
| ~89 tok/s |
| Llama 3.1 13B | ~28 tok/s | ~52 tok/s | ~67 tok/s |
| Large (70B) | Llama 3.1 70B Requires Q4 on budget builds | ~12 tok/s | ~25 tok/s | ~45 tok/s |
Benchmark figures represent Q4 quantization. Expect ~40% slower speeds for FP16 / full-precision runs.
Room for the model, OS, and tooling — upgradeable to 64GB in two clicks.
| Production workloads, agents |
Add a second GPU or move to multi-node
Scale to training, fine-tuning, or higher concurrency.
Expand storage to 4–8TB NVMe
Store full-precision checkpoints and larger datasets.