Multi-GPU Inference Methodology
Most people think 2 GPUs means 2× the performance. It doesn't. This is what actually happens when you run large language models across multiple GPUs, and when it's worth the complexity.
Overview
Multi-GPU lets you run models that won't fit on a single GPU. You split the model across 2 or more GPUs and they work together. The catch is that 2 GPUs doesn't give you 2× performance. With consumer hardware (RTX 4090, RTX 3090) connected over PCIe, expect around 1.5× speedup with 2 GPUs. Sometimes less.
This isn't marketing. These are conservative estimates based on real benchmarks from llama.cpp and vLLM users, plus the physical limitations of PCIe bandwidth. Professional setups with NVLink do better, but they cost 2-3× more. If you're reading this, you probably care about consumer hardware.
The honest truth: only use multi-GPU when your model literally won't fit on one GPU. If you're close to fitting, try a more aggressive quantization first. It's simpler.
Multi-GPU Fundamentals
How multiple GPUs work together for LLM inference
Parallelism Strategies
Tensor Parallelism
Split the model weights across GPUs. Each GPU handles part of each layer, all at the same time.
- Needs high-bandwidth interconnect (NVLink at 600-900 GB/s)
- Works well on professional GPUs with NVLink
- Consumer reality: PCIe 4.0 is only 32 GB/s, that's 28× slower than NVLink
- Expect 1.6-1.7× speedup on 2 GPUs with vLLM and NVLink
Pipeline Parallelism
Split layers across GPUs. First GPU handles early layers, second GPU handles later layers, like an assembly line.
- Works on PCIe, doesn't need fancy interconnects
- This is what you get with consumer GPUs (RTX 4090, RTX 3090)
- Trade-off: sequential processing means higher latency
- Expect 1.4-1.5× speedup on 2 GPUs with llama.cpp
Sources: Shoeybi et al., "Megatron-LM" (2019); Huang et al., "GPipe" (2019); DeepSpeed documentation; llama.cpp community benchmarks.
Interconnect Technologies
| Technology | Bandwidth | Availability | Use Case |
|---|---|---|---|
| NVLink 4.0 | 450 GB/s per GPU | RTX 6000 Ada, L40S | Workstation tensor parallelism |
| NVLink 3.0 | 600 GB/s per GPU | A100, RTX A6000 | Datacenter multi-GPU |
| NVSwitch | 900 GB/s any-to-any | H100, DGX systems | Large-scale clusters |
| PCIe 4.0 x16 | 32 GB/s (bilateral) | RTX 40-series consumer | Consumer pipeline parallelism |
| PCIe 5.0 x16 | 64 GB/s (bilateral) | Limited support (2025) | Future consumer setups |
Consumer GPU Reality
Consumer RTX 40-series GPUs (4090, 4080, 4070 Ti) don't have NVLink. They only have PCIe, which is 28× slower for GPU-to-GPU communication. This is why multi-GPU doesn't scale linearly on consumer hardware. The bottleneck isn't the GPUs, it's the connection between them.
Sources: NVIDIA NVLink Technical Blog (developer.nvidia.com); PCIe specifications (pcisig.com).
VRAM Pooling Calculations
Understanding effective vs theoretical VRAM capacity
Theoretical vs Effective VRAM
What People Think
Simple math. Two 24GB GPUs, you get 48GB total. Right?
What Actually Happens
You lose 13% to overhead: framework buffers, CUDA contexts, communication.
Formula
Efficiency Factors
| Interconnect | Efficiency | Overhead | Example (2×24GB) |
|---|---|---|---|
| NVLink | 92% | 8% | 44GB effective |
| PCIe 4.0 | 87% | 13% | 41GB effective |
| PCIe 3.0 | 85% | 15% | 40GB effective |
Overhead Sources
Sources: These efficiency numbers come from 50+ real user reports on r/LocalLLaMA and llama.cpp GitHub. The overhead breakdown is from PyTorch memory profiling and framework docs.
We use 87% efficiency (13% overhead) for consumer PCIe setups. This is conservative. Real numbers vary ±3% depending on your software stack and configuration.
Performance Scaling
Why 2 GPUs ≠ 2× performance
Reality Check
2 GPUs don't give you 2× performance. Expect 1.4-1.6× speedup on consumer hardware. If someone tells you they get 2× speedup, they either have professional gear with NVLink, or they're not measuring accurately.
Actual Speedup Factors
PCIe 4.0 (Consumer GPUs)
| GPU Count | Theoretical | llama.cpp | vLLM | Efficiency |
|---|---|---|---|---|
| 1 | 1.0× | 1.0× | 1.0× | 100% |
| 2 | 2.0× | 1.45× | 1.60× | 73-80% |
| 3 | 3.0× | 1.85× | 2.05× | 62-68% |
| 4 | 4.0× | 2.10× | 2.45× | 53-61% |
Conservative estimate: 2 GPUs on PCIe = 1.5× speedup (average of llama.cpp and vLLM)
NVLink (Professional GPUs)
| GPU Count | Theoretical | vLLM (Actual) | Efficiency |
|---|---|---|---|
| 1 | 1.0× | 1.0× | 100% |
| 2 | 2.0× | 1.75× | 88% |
| 4 | 4.0× | 3.10× | 78% |
| 8 | 8.0× | 5.80× | 73% |
NVLink performs significantly better than PCIe but still shows sublinear scaling due to synchronization overhead.
Why Sublinear Scaling?
1. Memory Bandwidth Bottleneck
Each GPU still has to load weights from its own VRAM at the same speed. Adding more GPUs gives you more total capacity, but each GPU is still individually bandwidth-limited.
2. Synchronization Overhead
The GPUs have to talk to each other after every layer. PCIe gives you 32 GB/s for this. Each GPU has 1,000+ GB/s of internal memory bandwidth. See the problem?
3. Pipeline Bubbles
When you split layers across GPUs, the first GPU finishes its layers while the second GPU is still working. The first GPU sits idle. This happens in every forward pass.
4. Framework Overhead
PyTorch, CUDA, llama.cpp, vLLM, they all need to coordinate multiple GPUs. This coordination takes time. A single GPU doesn't need any of this.
Sources: These speedup numbers come from llama.cpp GitHub discussions, vLLM benchmarks, and r/LocalLLaMA user reports. PCIe bandwidth limits are from the official PCI-SIG specs.
Your mileage may vary ±20-30%. Depends on your model, quantization, context length, and how well your drivers are configured. These numbers assume typical single-user inference.
When Multi-GPU Makes Sense
Decision matrix and cost-benefit analysis
Decision Criteria
✓ Good Fit for Multi-GPU
- Your model needs more than 24GB (Llama 2 70B Q4 at 38GB, Mixtral 8×7B Q8 at 55GB)
- You're okay with BIOS configuration and troubleshooting drivers
- $6,800 for an RTX 6000 Ada sounds like too much
- You already have one high-VRAM GPU and want to add another
✗ Bad Fit for Multi-GPU
- Your model fits on one GPU (Llama 3 8B at 6GB, Llama 2 13B at 10GB)
- This is your first time running LLMs locally (too complex)
- You value simplicity over saving money
- You're at 22-23GB on a 24GB GPU (just use more aggressive quantization)
Cost-Benefit Analysis: 30GB Model (Q4)
| Option | Cost | VRAM | Perf | Complexity | Verdict |
|---|---|---|---|---|---|
| 2× RTX 4090 | $3,200 | 41GB | ~45 t/s | High | Best value if technical |
| RTX 6000 Ada | $6,800 | 48GB | ~50 t/s | Low | Simplest, 2× cost |
| 2× RTX 3090 (used) | $1,400 | 40GB | ~38 t/s | High | Best $/perf, risky |
| A6000 (used) | $4,500 | 48GB | ~42 t/s | Low | Middle ground |
These performance numbers assume llama.cpp. vLLM is 10-15% faster. Used GPUs have no warranty. Cost per tokens/sec: 2×3090 ($37) beats 2×4090 ($71) beats A6000 ($107) beats 6000 Ada ($136).
Recommendations by Model Size
<24GB Models
Llama 3 8B, Mistral 7B, Llama 2 13B (all Q4/Q8)
→ Single GPU sufficient (RTX 4090, RTX 3090)
24-48GB Models
Llama 2 70B Q4 (~38GB), Qwen 72B Q4 (~42GB)
→ 2× RTX 4090 feasible (41GB effective, PCIe)
48-72GB Models
Mixtral 8×7B Q8 (~55GB), Llama 2 70B Q8 (~60GB)
→ 3× RTX 4090 or workstation GPUs (A6000, 6000 Ada)
>72GB Models
Llama 2 70B FP16 (~140GB), Mixtral 8×22B FP16 (~200GB)
→ Datacenter GPUs (H100, A100) or cloud APIs only
Consumer GPU Multi-GPU Reality
What actually happens with 2× RTX 4090 on PCIe
The PCIe Problem
- Consumer RTX 40-series cards have no NVLink, only PCIe
- PCIe 4.0 x16 gives you 32 GB/s, shared across everything
- With 2 GPUs, a display output, and an NVMe drive, each GPU gets maybe 20-25 GB/s
- NVLink on professional cards is 600-900 GB/s, that's 28× faster
- This is why consumer multi-GPU doesn't scale well
Software Support
llama.cpp (Easiest)
Pipeline parallelism via --n-gpu-layers flag.
- Pros: Simple CLI, widely used, stable
- Cons: Pipeline only (not tensor parallel), moderate performance
- Expected: 1.4-1.5× speedup on 2× RTX 4090
vLLM (Best Performance)
Tensor parallelism via --tensor-parallel-size flag.
- Pros: Tensor parallel, PagedAttention, best throughput
- Cons: More complex setup (API server), Python dependencies
- Expected: 1.5-1.7× speedup on 2× RTX 4090
text-generation-webui
Uses llama.cpp or transformers backend.
- Pros: User-friendly web UI
- Cons: Performance depends on backend, less control
- Expected: Similar to llama.cpp (1.4-1.5×)
Setup Complexity
BIOS Configuration (Click to expand)
- Enable PCIe bifurcation (split x16 lanes)
- Enable "Above 4G Decoding"
- Set Resizable BAR (if supported)
- Verify PCIe lane allocation per slot
Driver & Software (Click to expand)
- Install CUDA toolkit (match PyTorch version)
- Configure CUDA_VISIBLE_DEVICES
- Test with
nvidia-smi - Tune
--n-gpu-layersper model
Power & Cooling (Click to expand)
- PSU: 1000W+ for 2× RTX 4090 (900W GPU + 100W system)
- Case airflow: Both GPUs need adequate cooling
- PCIe slot spacing: Minimum 3-slot gap between GPUs
- Thermal throttling: Monitor GPU temps under load
Expect to spend 1-2 weeks getting this working the first time. PCIe lane conflicts, driver errors, uneven GPU usage, these are common. A single GPU just works.
Data Sources & References
Research papers, community benchmarks, and hardware specifications
Academic Papers
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). ArXiv preprint arXiv:1909.08053. arxiv.org/abs/1909.08053
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Huang, Y., et al. (2019). ArXiv preprint arXiv:1811.06965. arxiv.org/abs/1811.06965
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Dao, T., et al. (2023). Stanford University. arxiv.org/abs/2307.08691
Industry Documentation
NVIDIA NVLink 4.0 Multi-GPU System Scalability
NVIDIA Developer Blog. developer.nvidia.com/blog
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
vLLM Project Documentation. docs.vllm.ai
llama.cpp: Inference of LLaMA model in pure C/C++
Gerganov, G., et al. (2023-2025). github.com/ggerganov/llama.cpp
Microsoft DeepSpeed: Pipeline Parallelism
DeepSpeed Documentation. deepspeed.ai/tutorials/pipeline
Community Benchmarks
llama.cpp GitHub Discussions
Multi-GPU performance reports from users with 2-4× RTX 4090/3090 setups. github.com/ggerganov/llama.cpp/discussions
r/LocalLLaMA Community Reports
Real-world multi-GPU setups, cost analysis, and troubleshooting guides. reddit.com/r/LocalLLaMA
Hardware Specifications
PCI-SIG Specifications
PCIe 4.0 and 5.0 bandwidth specifications. pcisig.com/specifications
NVIDIA GPU Datasheets
RTX 4090, RTX 6000 Ada, A100, H100 technical specifications. nvidia.com/datasheets
Limitations & Caveats
What we don't know and what can vary
These Estimates Are Conservative
- Assumes consumer PCIe setups (RTX 4090, RTX 3090), which most people use
- Baseline llama.cpp performance, not optimized vLLM or TensorRT
- Single user, batch size of 1
- Normal context length (2-4K tokens)
- Doesn't account for FlashAttention-2 or PagedAttention speedups
Real Performance Varies ±30-50%
You Might Do Better:
- vLLM with tensor parallelism
- Professional cards with NVLink
- Batch processing multiple requests
- FlashAttention-2 compiled in
You Might Do Worse:
- Long context (more than 8K tokens)
- Misconfigured drivers or CUDA
- GPU thermal throttling
- Other PCIe devices competing for bandwidth
We Can't Test Everything
- Don't have every GPU combination (a pair of RTX 6000 Adas costs $13,600)
- Most community benchmarks don't document their exact setup
- Software changes fast (llama.cpp and vLLM ship updates weekly)
- Your specific combination of hardware, drivers, and model might behave differently
Complexity Warning
Multi-GPU setup isn't for beginners:
- BIOS configuration (PCIe bifurcation, above 4G decoding)
- Getting the right CUDA toolkit version for your PyTorch
- Learning framework flags (--n-gpu-layers, --tensor-parallel-size)
- Debugging why your GPUs won't talk to each other
- Making sure your PSU and cooling can handle 900W of GPUs
Only try this if you're comfortable with command line tools and fixing broken configs. For most people, one good GPU is simpler and less painful.
Last updated: November 29, 2025
Questions about multi-GPU inference? Contact us or see our main methodology page