Most people think 2 GPUs means 2× the performance. It doesn't. This is what actually happens when you run large language models across multiple GPUs, and when it's worth the complexity.
Multi-GPU lets you run models that won't fit on a single GPU. You split the model across 2 or more GPUs and they work together. The catch is that 2 GPUs doesn't give you 2× performance. With consumer hardware (RTX 4090, RTX 3090) connected over PCIe, expect around 1.5× speedup with 2 GPUs. Sometimes less.
This isn't marketing. These are conservative estimates based on real benchmarks from llama.cpp and vLLM users, plus the physical limitations of PCIe bandwidth. Professional setups with NVLink do better, but they cost 2-3× more. If you're reading this, you probably care about consumer hardware.
The honest truth: only use multi-GPU when your model literally won't fit on one GPU. If you're close to fitting, try a more aggressive quantization first. It's simpler.
How multiple GPUs work together for LLM inference
Split the model weights across GPUs. Each GPU handles part of each layer, all at the same time.
Split layers across GPUs. First GPU handles early layers, second GPU handles later layers, like an assembly line.
Sources: Shoeybi et al., "Megatron-LM" (2019); Huang et al., "GPipe" (2019); DeepSpeed documentation; llama.cpp community benchmarks.
| Technology | Bandwidth | Availability | Use Case |
|---|---|---|---|
| NVLink 4.0 | 450 GB/s per GPU | RTX 6000 Ada, L40S | Workstation tensor parallelism |
| NVLink 3.0 | 600 GB/s per GPU | A100, RTX A6000 | Datacenter multi-GPU |
| NVSwitch | 900 GB/s any-to-any | H100, DGX systems | Large-scale clusters |
| PCIe 4.0 x16 | 32 GB/s (bilateral) | RTX 40-series consumer | Consumer pipeline parallelism |
| PCIe 5.0 x16 | 64 GB/s (bilateral) | Limited support (2025) | Future consumer setups |
Consumer GPU Reality
Consumer RTX 40-series GPUs (4090, 4080, 4070 Ti) don't have NVLink. They only have PCIe, which is 28× slower for GPU-to-GPU communication. This is why multi-GPU doesn't scale linearly on consumer hardware. The bottleneck isn't the GPUs, it's the connection between them.
Sources: NVIDIA NVLink Technical Blog (developer.nvidia.com); PCIe specifications (pcisig.com).
Understanding effective vs theoretical VRAM capacity
Simple math. Two 24GB GPUs, you get 48GB total. Right?
You lose 13% to overhead: framework buffers, CUDA contexts, communication.
| Interconnect | Efficiency | Overhead | Example (2×24GB) |
|---|---|---|---|
| NVLink | 92% | 8% | 44GB effective |
| PCIe 4.0 | 87% | 13% | 41GB effective |
| PCIe 3.0 | 85% | 15% | 40GB effective |
Sources: These efficiency numbers come from 50+ real user reports on r/LocalLLaMA and llama.cpp GitHub. The overhead breakdown is from PyTorch memory profiling and framework docs.
We use 87% efficiency (13% overhead) for consumer PCIe setups. This is conservative. Real numbers vary ±3% depending on your software stack and configuration.
Why 2 GPUs ≠ 2× performance
Reality Check
2 GPUs don't give you 2× performance. Expect 1.4-1.6× speedup on consumer hardware. If someone tells you they get 2× speedup, they either have professional gear with NVLink, or they're not measuring accurately.
| GPU Count | Theoretical | llama.cpp | vLLM | Efficiency |
|---|---|---|---|---|
| 1 | 1.0× | 1.0× | 1.0× | 100% |
| 2 | 2.0× | 1.45× | 1.60× | 73-80% |
| 3 | 3.0× | 1.85× | 2.05× | 62-68% |
| 4 | 4.0× | 2.10× | 2.45× | 53-61% |
Conservative estimate: 2 GPUs on PCIe = 1.5× speedup (average of llama.cpp and vLLM)
| GPU Count | Theoretical | vLLM (Actual) | Efficiency |
|---|---|---|---|
| 1 | 1.0× | 1.0× | 100% |
| 2 | 2.0× | 1.75× | 88% |
| 4 | 4.0× | 3.10× | 78% |
| 8 | 8.0× | 5.80× | 73% |
NVLink performs significantly better than PCIe but still shows sublinear scaling due to synchronization overhead.
Each GPU still has to load weights from its own VRAM at the same speed. Adding more GPUs gives you more total capacity, but each GPU is still individually bandwidth-limited.
The GPUs have to talk to each other after every layer. PCIe gives you 32 GB/s for this. Each GPU has 1,000+ GB/s of internal memory bandwidth. See the problem?
When you split layers across GPUs, the first GPU finishes its layers while the second GPU is still working. The first GPU sits idle. This happens in every forward pass.
PyTorch, CUDA, llama.cpp, vLLM, they all need to coordinate multiple GPUs. This coordination takes time. A single GPU doesn't need any of this.
Sources: These speedup numbers come from llama.cpp GitHub discussions, vLLM benchmarks, and r/LocalLLaMA user reports. PCIe bandwidth limits are from the official PCI-SIG specs.
Your mileage may vary ±20-30%. Depends on your model, quantization, context length, and how well your drivers are configured. These numbers assume typical single-user inference.
Decision matrix and cost-benefit analysis
| Option | Cost | VRAM | Perf | Complexity | Verdict |
|---|---|---|---|---|---|
| 2× RTX 4090 | $3,200 | 41GB | ~45 t/s | High | Best value if technical |
| RTX 6000 Ada | $6,800 | 48GB | ~50 t/s | Low | Simplest, 2× cost |
| 2× RTX 3090 (used) | $1,400 | 40GB | ~38 t/s | High | Best $/perf, risky |
| A6000 (used) | $4,500 | 48GB | ~42 t/s | Low | Middle ground |
These performance numbers assume llama.cpp. vLLM is 10-15% faster. Used GPUs have no warranty. Cost per tokens/sec: 2×3090 ($37) beats 2×4090 ($71) beats A6000 ($107) beats 6000 Ada ($136).
Llama 3 8B, Mistral 7B, Llama 2 13B (all Q4/Q8)
→ Single GPU sufficient (RTX 4090, RTX 3090)
Llama 2 70B Q4 (~38GB), Qwen 72B Q4 (~42GB)
→ 2× RTX 4090 feasible (41GB effective, PCIe)
Mixtral 8×7B Q8 (~55GB), Llama 2 70B Q8 (~60GB)
→ 3× RTX 4090 or workstation GPUs (A6000, 6000 Ada)
Llama 2 70B FP16 (~140GB), Mixtral 8×22B FP16 (~200GB)
→ Datacenter GPUs (H100, A100) or cloud APIs only
What actually happens with 2× RTX 4090 on PCIe
The PCIe Problem
Pipeline parallelism via --n-gpu-layers flag.
Tensor parallelism via --tensor-parallel-size flag.
Uses llama.cpp or transformers backend.
nvidia-smi--n-gpu-layers per modelExpect to spend 1-2 weeks getting this working the first time. PCIe lane conflicts, driver errors, uneven GPU usage, these are common. A single GPU just works.
Research papers, community benchmarks, and hardware specifications
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). ArXiv preprint arXiv:1909.08053. arxiv.org/abs/1909.08053
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Huang, Y., et al. (2019). ArXiv preprint arXiv:1811.06965. arxiv.org/abs/1811.06965
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Dao, T., et al. (2023). Stanford University. arxiv.org/abs/2307.08691
NVIDIA NVLink 4.0 Multi-GPU System Scalability
NVIDIA Developer Blog. developer.nvidia.com/blog
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
vLLM Project Documentation. docs.vllm.ai
llama.cpp: Inference of LLaMA model in pure C/C++
Gerganov, G., et al. (2023-2025). github.com/ggerganov/llama.cpp
Microsoft DeepSpeed: Pipeline Parallelism
DeepSpeed Documentation. deepspeed.ai/tutorials/pipeline
llama.cpp GitHub Discussions
Multi-GPU performance reports from users with 2-4× RTX 4090/3090 setups. github.com/ggerganov/llama.cpp/discussions
r/LocalLLaMA Community Reports
Real-world multi-GPU setups, cost analysis, and troubleshooting guides. reddit.com/r/LocalLLaMA
PCI-SIG Specifications
PCIe 4.0 and 5.0 bandwidth specifications. pcisig.com/specifications
NVIDIA GPU Datasheets
RTX 4090, RTX 6000 Ada, A100, H100 technical specifications. nvidia.com/datasheets
What we don't know and what can vary
You Might Do Better:
You Might Do Worse:
Multi-GPU setup isn't for beginners:
Only try this if you're comfortable with command line tools and fixing broken configs. For most people, one good GPU is simpler and less painful.
Last updated: November 29, 2025
Questions about multi-GPU inference? Contact us or see our main methodology page