Can RTX 5090 run meta-llama/Llama-3.1-70B-Instruct?

Q4 not recommended32GB VRAM availableRequires 34GB+

RTX 5090 does not meet the minimum VRAM requirement for Q4 inference of meta-llama/Llama-3.1-70B-Instruct. Review the quantization breakdown below to see how higher precision settings impact VRAM and throughput.

What this means for you

RTX 5090 lacks sufficient VRAM for comfortable meta-llama/Llama-3.1-70B-Instruct operation with Q4 quantization.

Your 32GB GPU is 2GB short of the 34GB minimum.

Options: (1) Try Q2 or Q3 quantization for lower VRAM requirements, (2) Consider cloud GPU rental, (3) Upgrade to a GPU with at least 16GB VRAM.

Quantization breakdown

Quantization	VRAM needed	VRAM available	Estimated speed	Verdict
Q4	34GB	32GB	64.97 tok/s	❌ Not recommended
Q8	69GB	32GB	41.65 tok/s	❌ Not recommended
FP16	138GB	32GB	21.35 tok/s	❌ Not recommended

Suitable alternatives

AMD Instinct MI300X

192GB

160.26 tok/s

Price: —

NVIDIA H200 SXM 141GB

141GB

127.91 tok/s

Price: —

AMD Instinct MI300X

192GB

115.04 tok/s

Price: —

NVIDIA H200 SXM 141GB

141GB

95.14 tok/s

Price: —

AMD Instinct MI250X

128GB

94.63 tok/s

Price: —

Can RTX 5090 run meta-llama/Llama-3.1-70B-Instruct?

What this means for you

Quantization breakdown

Suitable alternatives

More questions

Can RTX 5090 run meta-llama/Llama-3.1-70B-Instruct?

What this means for you

Quantization breakdown

Suitable alternatives

More questions