Multi-GPU Inference Methodology

Most people think 2 GPUs means 2× the performance. It doesn't. This is what actually happens when you run large language models across multiple GPUs, and when it's worth the complexity.

Overview

Multi-GPU lets you run models that won't fit on a single GPU. You split the model across 2 or more GPUs and they work together. The catch is that 2 GPUs doesn't give you 2× performance. With consumer hardware (RTX 4090, RTX 3090) connected over PCIe, expect around 1.5× speedup with 2 GPUs. Sometimes less.

This isn't marketing. These are conservative estimates based on real benchmarks from llama.cpp and vLLM users, plus the physical limitations of PCIe bandwidth. Professional setups with NVLink do better, but they cost 2-3× more. If you're reading this, you probably care about consumer hardware.

The honest truth: only use multi-GPU when your model literally won't fit on one GPU. If you're close to fitting, try a more aggressive quantization first. It's simpler.

Multi-GPU Fundamentals

How multiple GPUs work together for LLM inference

Parallelism Strategies

Tensor Parallelism

Split the model weights across GPUs. Each GPU handles part of each layer, all at the same time.

Needs high-bandwidth interconnect (NVLink at 600-900 GB/s)
Works well on professional GPUs with NVLink
Consumer reality: PCIe 4.0 is only 32 GB/s, that's 28× slower than NVLink
Expect 1.6-1.7× speedup on 2 GPUs with vLLM and NVLink

Pipeline Parallelism

Split layers across GPUs. First GPU handles early layers, second GPU handles later layers, like an assembly line.

Works on PCIe, doesn't need fancy interconnects
This is what you get with consumer GPUs (RTX 4090, RTX 3090)
Trade-off: sequential processing means higher latency
Expect 1.4-1.5× speedup on 2 GPUs with llama.cpp

Sources: Shoeybi et al., "Megatron-LM" (2019); Huang et al., "GPipe" (2019); DeepSpeed documentation; llama.cpp community benchmarks.

Interconnect Technologies

Technology	Bandwidth	Availability	Use Case
NVLink 4.0	450 GB/s per GPU	RTX 6000 Ada, L40S	Workstation tensor parallelism
NVLink 3.0	600 GB/s per GPU	A100, RTX A6000	Datacenter multi-GPU
NVSwitch	900 GB/s any-to-any	H100, DGX systems	Large-scale clusters
PCIe 4.0 x16	32 GB/s (bilateral)	RTX 40-series consumer	Consumer pipeline parallelism
PCIe 5.0 x16	64 GB/s (bilateral)	Limited support (2025)	Future consumer setups

Consumer GPU Reality

Consumer RTX 40-series GPUs (4090, 4080, 4070 Ti) don't have NVLink. They only have PCIe, which is 28× slower for GPU-to-GPU communication. This is why multi-GPU doesn't scale linearly on consumer hardware. The bottleneck isn't the GPUs, it's the connection between them.

Sources: NVIDIA NVLink Technical Blog (developer.nvidia.com); PCIe specifications (pcisig.com).

VRAM Pooling Calculations

Understanding effective vs theoretical VRAM capacity

Theoretical vs Effective VRAM

What People Think

2× RTX 4090 = 2 × 24GB = 48GB

Simple math. Two 24GB GPUs, you get 48GB total. Right?

What Actually Happens

2× RTX 4090 = 48GB × 0.87 = 41GB

You lose 13% to overhead: framework buffers, CUDA contexts, communication.

Formula

Effective_VRAM = (GPU_Count × VRAM_per_GPU) × Efficiency_Factor

Efficiency Factors

Interconnect	Efficiency	Overhead	Example (2×24GB)
NVLink	92%	8%	44GB effective
PCIe 4.0	87%	13%	41GB effective
PCIe 3.0	85%	15%	40GB effective

Overhead Sources

PyTorch/CUDA context (per GPU)2-3GB (4-6%)

Communication buffers1-2GB (2-4%)

Framework overhead (llama.cpp/vLLM)1-2GB (2-4%)

Memory fragmentation1-2GB (2-4%)

Total Overhead5-7GB (12-15%)

Sources: These efficiency numbers come from 50+ real user reports on r/LocalLLaMA and llama.cpp GitHub. The overhead breakdown is from PyTorch memory profiling and framework docs.

We use 87% efficiency (13% overhead) for consumer PCIe setups. This is conservative. Real numbers vary ±3% depending on your software stack and configuration.

Performance Scaling

Why 2 GPUs ≠ 2× performance

Reality Check

2 GPUs don't give you 2× performance. Expect 1.4-1.6× speedup on consumer hardware. If someone tells you they get 2× speedup, they either have professional gear with NVLink, or they're not measuring accurately.

Actual Speedup Factors

PCIe 4.0 (Consumer GPUs)

GPU Count	Theoretical	llama.cpp	vLLM	Efficiency
1	1.0×	1.0×	1.0×	100%
2	2.0×	1.45×	1.60×	73-80%
3	3.0×	1.85×	2.05×	62-68%
4	4.0×	2.10×	2.45×	53-61%

Conservative estimate: 2 GPUs on PCIe = 1.5× speedup (average of llama.cpp and vLLM)

NVLink (Professional GPUs)

GPU Count	Theoretical	vLLM (Actual)	Efficiency
1	1.0×	1.0×	100%
2	2.0×	1.75×	88%
4	4.0×	3.10×	78%
8	8.0×	5.80×	73%

NVLink performs significantly better than PCIe but still shows sublinear scaling due to synchronization overhead.

Why Sublinear Scaling?

1. Memory Bandwidth Bottleneck

Each GPU still has to load weights from its own VRAM at the same speed. Adding more GPUs gives you more total capacity, but each GPU is still individually bandwidth-limited.

2. Synchronization Overhead

The GPUs have to talk to each other after every layer. PCIe gives you 32 GB/s for this. Each GPU has 1,000+ GB/s of internal memory bandwidth. See the problem?

3. Pipeline Bubbles

When you split layers across GPUs, the first GPU finishes its layers while the second GPU is still working. The first GPU sits idle. This happens in every forward pass.

4. Framework Overhead

PyTorch, CUDA, llama.cpp, vLLM, they all need to coordinate multiple GPUs. This coordination takes time. A single GPU doesn't need any of this.

Sources: These speedup numbers come from llama.cpp GitHub discussions, vLLM benchmarks, and r/LocalLLaMA user reports. PCIe bandwidth limits are from the official PCI-SIG specs.

Your mileage may vary ±20-30%. Depends on your model, quantization, context length, and how well your drivers are configured. These numbers assume typical single-user inference.

When Multi-GPU Makes Sense

Decision matrix and cost-benefit analysis

Decision Criteria

✓ Good Fit for Multi-GPU

Your model needs more than 24GB (Llama 2 70B Q4 at 38GB, Mixtral 8×7B Q8 at 55GB)
You're okay with BIOS configuration and troubleshooting drivers
$6,800 for an RTX 6000 Ada sounds like too much
You already have one high-VRAM GPU and want to add another

✗ Bad Fit for Multi-GPU

Your model fits on one GPU (Llama 3 8B at 6GB, Llama 2 13B at 10GB)
This is your first time running LLMs locally (too complex)
You value simplicity over saving money
You're at 22-23GB on a 24GB GPU (just use more aggressive quantization)

Cost-Benefit Analysis: 30GB Model (Q4)

Option	Cost	VRAM	Perf	Complexity	Verdict
2× RTX 4090	$3,200	41GB	~45 t/s	High	Best value if technical
RTX 6000 Ada	$6,800	48GB	~50 t/s	Low	Simplest, 2× cost
2× RTX 3090 (used)	$1,400	40GB	~38 t/s	High	Best $/perf, risky
A6000 (used)	$4,500	48GB	~42 t/s	Low	Middle ground

These performance numbers assume llama.cpp. vLLM is 10-15% faster. Used GPUs have no warranty. Cost per tokens/sec: 2×3090 ($37) beats 2×4090 ($71) beats A6000 ($107) beats 6000 Ada ($136).

Recommendations by Model Size

<24GB Models

Llama 3 8B, Mistral 7B, Llama 2 13B (all Q4/Q8)

→ Single GPU sufficient (RTX 4090, RTX 3090)

24-48GB Models

Llama 2 70B Q4 (~38GB), Qwen 72B Q4 (~42GB)

→ 2× RTX 4090 feasible (41GB effective, PCIe)

48-72GB Models

Mixtral 8×7B Q8 (~55GB), Llama 2 70B Q8 (~60GB)

→ 3× RTX 4090 or workstation GPUs (A6000, 6000 Ada)

>72GB Models

Llama 2 70B FP16 (~140GB), Mixtral 8×22B FP16 (~200GB)

→ Datacenter GPUs (H100, A100) or cloud APIs only

Consumer GPU Multi-GPU Reality

What actually happens with 2× RTX 4090 on PCIe

The PCIe Problem

Consumer RTX 40-series cards have no NVLink, only PCIe
PCIe 4.0 x16 gives you 32 GB/s, shared across everything
With 2 GPUs, a display output, and an NVMe drive, each GPU gets maybe 20-25 GB/s
NVLink on professional cards is 600-900 GB/s, that's 28× faster
This is why consumer multi-GPU doesn't scale well

Software Support

llama.cpp (Easiest)

Pipeline parallelism via --n-gpu-layers flag.

Pros: Simple CLI, widely used, stable
Cons: Pipeline only (not tensor parallel), moderate performance
Expected: 1.4-1.5× speedup on 2× RTX 4090

vLLM (Best Performance)

Tensor parallelism via --tensor-parallel-size flag.

Pros: Tensor parallel, PagedAttention, best throughput
Cons: More complex setup (API server), Python dependencies
Expected: 1.5-1.7× speedup on 2× RTX 4090

text-generation-webui

Uses llama.cpp or transformers backend.

Pros: User-friendly web UI
Cons: Performance depends on backend, less control
Expected: Similar to llama.cpp (1.4-1.5×)

Setup Complexity

BIOS Configuration (Click to expand)

Enable PCIe bifurcation (split x16 lanes)
Enable "Above 4G Decoding"
Set Resizable BAR (if supported)
Verify PCIe lane allocation per slot

Driver & Software (Click to expand)

Install CUDA toolkit (match PyTorch version)
Configure CUDA_VISIBLE_DEVICES
Test with nvidia-smi
Tune --n-gpu-layers per model

Power & Cooling (Click to expand)

PSU: 1000W+ for 2× RTX 4090 (900W GPU + 100W system)
Case airflow: Both GPUs need adequate cooling
PCIe slot spacing: Minimum 3-slot gap between GPUs
Thermal throttling: Monitor GPU temps under load

Expect to spend 1-2 weeks getting this working the first time. PCIe lane conflicts, driver errors, uneven GPU usage, these are common. A single GPU just works.

Data Sources & References

Research papers, community benchmarks, and hardware specifications

Academic Papers

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). ArXiv preprint arXiv:1909.08053. arxiv.org/abs/1909.08053

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Huang, Y., et al. (2019). ArXiv preprint arXiv:1811.06965. arxiv.org/abs/1811.06965

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T., et al. (2023). Stanford University. arxiv.org/abs/2307.08691

Industry Documentation

NVIDIA NVLink 4.0 Multi-GPU System Scalability

NVIDIA Developer Blog. developer.nvidia.com/blog

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM Project Documentation. docs.vllm.ai

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G., et al. (2023-2025). github.com/ggerganov/llama.cpp

Microsoft DeepSpeed: Pipeline Parallelism

DeepSpeed Documentation. deepspeed.ai/tutorials/pipeline

Community Benchmarks

llama.cpp GitHub Discussions

Multi-GPU performance reports from users with 2-4× RTX 4090/3090 setups. github.com/ggerganov/llama.cpp/discussions

r/LocalLLaMA Community Reports

Real-world multi-GPU setups, cost analysis, and troubleshooting guides. reddit.com/r/LocalLLaMA

Hardware Specifications

PCI-SIG Specifications

PCIe 4.0 and 5.0 bandwidth specifications. pcisig.com/specifications

NVIDIA GPU Datasheets

RTX 4090, RTX 6000 Ada, A100, H100 technical specifications. nvidia.com/datasheets

Limitations & Caveats

What we don't know and what can vary

These Estimates Are Conservative

Assumes consumer PCIe setups (RTX 4090, RTX 3090), which most people use
Baseline llama.cpp performance, not optimized vLLM or TensorRT
Single user, batch size of 1
Normal context length (2-4K tokens)
Doesn't account for FlashAttention-2 or PagedAttention speedups

Real Performance Varies ±30-50%

You Might Do Better:

vLLM with tensor parallelism
Professional cards with NVLink
Batch processing multiple requests
FlashAttention-2 compiled in

You Might Do Worse:

Long context (more than 8K tokens)
Misconfigured drivers or CUDA
GPU thermal throttling
Other PCIe devices competing for bandwidth

We Can't Test Everything

Don't have every GPU combination (a pair of RTX 6000 Adas costs $13,600)
Most community benchmarks don't document their exact setup
Software changes fast (llama.cpp and vLLM ship updates weekly)
Your specific combination of hardware, drivers, and model might behave differently

Complexity Warning

Multi-GPU setup isn't for beginners:

BIOS configuration (PCIe bifurcation, above 4G decoding)
Getting the right CUDA toolkit version for your PyTorch
Learning framework flags (--n-gpu-layers, --tensor-parallel-size)
Debugging why your GPUs won't talk to each other
Making sure your PSU and cooling can handle 900W of GPUs

Only try this if you're comfortable with command line tools and fixing broken configs. For most people, one good GPU is simpler and less painful.

Last updated: November 29, 2025

Questions about multi-GPU inference? Contact us or see our main methodology page

Loading content...

Overview

The honest truth: only use multi-GPU when your model literally won't fit on one GPU. If you're close to fitting, try a more aggressive quantization first. It's simpler.

Technology

Bandwidth

Availability

Use Case

NVLink 4.0

450 GB/s per GPU

RTX 6000 Ada, L40S

Workstation tensor parallelism

NVLink 3.0

600 GB/s per GPU

A100, RTX A6000

Datacenter multi-GPU

NVSwitch

900 GB/s any-to-any

H100, DGX systems

Large-scale clusters

PCIe 4.0 x16

32 GB/s (bilateral)

RTX 40-series consumer

Consumer pipeline parallelism

PCIe 5.0 x16

64 GB/s (bilateral)

Limited support (2025)

Future consumer setups

Interconnect

Efficiency

Overhead

Example (2×24GB)

NVLink

92%

44GB effective

PCIe 4.0

87%

13%

41GB effective

PCIe 3.0

85%

15%

40GB effective

GPU Count

Theoretical

llama.cpp

vLLM

Efficiency

1.0×

100%

2.0×

1.45×

1.60×

73-80%

3.0×

1.85×

2.05×

62-68%

4.0×

2.10×

2.45×

53-61%

GPU Count

Theoretical

vLLM (Actual)

Efficiency

1.0×

100%

2.0×

1.75×

88%

4.0×

3.10×

78%

8.0×

5.80×

73%

Option

Cost

VRAM

Perf

Complexity

Verdict

2× RTX 4090

$3,200

41GB

~45 t/s

High

Best value if technical

RTX 6000 Ada

$6,800

48GB

~50 t/s

Low

Simplest, 2× cost

2× RTX 3090 (used)

$1,400

40GB

~38 t/s

High

Best $/perf, risky

A6000 (used)

$4,500

48GB

~42 t/s

Low

Middle ground