Benchmark Methodology
Research-backed approach to estimating GPU performance for large language model inference. This document provides complete transparency on our data sources, calculation methods, and known limitations.
Overview
LocalAI.Computer provides theoretically-derived performance estimates based on established computer architecture principles (Roofline Model, IEEE floating-point standards) and community-validated observations from llama.cpp and related projects.
These are not laboratory measurements. Real-world performance typically varies ±25-50% depending on software implementation (llama.cpp vs vLLM vs TensorRT-LLM), context length, batch size, quantization quality, and thermal conditions.
Our philosophy: provide honest, conservative estimates using widely-available tools (llama.cpp) as the baseline. Users should view these numbers as minimum expected performance, not guarantees. Modern optimizations like FlashAttention and PagedAttention can improve performance 2-3× beyond our conservative estimates.
VRAM Requirement Calculation
How we determine minimum GPU memory needed for each model
Formula
Bytes Per Parameter (By Quantization)
- • FP16 (16-bit): 2.0 bytes/param
- • Q8 (8-bit): 1.0 bytes/param
- • Q4 (4-bit): 0.5 bytes/param
Source: IEEE 754 floating-point standard; quantization definitions from llama.cpp.
Overhead Factor (Size-Dependent)
- • <10B params: 1.15 (15% overhead)
- • 10-70B params: 1.10 (10% overhead)
- • >70B params: 1.05 (5% overhead)
Rationale: Smaller models have proportionally larger KV cache; larger models are dominated by weight memory.
Example: 20B Parameter Model
- • FP16: 21.5B × 2.0 × 1.10 / 1024³ = 44GB
- • Q8: 21.5B × 1.0 × 1.10 / 1024³ = 22GB
- • Q4: 21.5B × 0.5 × 1.10 / 1024³ = 11GB
Sources: Physical constants from IEEE 754-2019. Overhead factors derived from community observations on llama.cpp GitHub and text-generation-webui wiki. Cross-referenced with NVIDIA guidance (recommends 2× parameter size).
Known Limitations
- • Context length affects KV cache size (assumes 2K-4K tokens)
- • MoE models load all parameters but only activate a subset
- • Multi-modal models have additional VRAM requirements
- • FlashAttention-2 can reduce VRAM usage 20-30% (not modeled)
Performance Estimation Methodology
How we estimate tokens per second for GPU + model combinations
Foundational Principle: Memory-Bandwidth Bottleneck
LLM autoregressive inference is primarily memory-bandwidth bound, not compute-bound. This principle was formalized by the Roofline Model (Williams et al., 2009) and remains valid for transformer architectures. Each token generation requires loading the entire model from VRAM, making memory bandwidth—not CUDA cores or TFLOPS—the primary performance bottleneck.
Recent research (LLMPerf 2024, FlashAttention 2023) shows modern optimizations can improve performance significantly. Our estimates use traditional llama.cpp baseline without advanced optimizations.
Formula Components
1. Baseline Speed (Calibration Point)
RTX 4090 + 7B model (Q4): ~180 tokens/sec
Representative value from llama.cpp community benchmarks and r/LocalLLaMA reports. Used as calibration point for relative performance calculations.
2. GPU Performance Factor
Calculated as: (bandwidth_ratio × 0.75) + (compute_ratio × 0.25)
Memory bandwidth dominates LLM inference performance. The 75% weighting on bandwidth reflects that token generation is primarily memory-bound (loading weights from VRAM) rather than compute-bound. Based on Roofline Model (Williams et al., 2009, Communications of the ACM), which established that low arithmetic-intensity operations are memory-bandwidth limited.
3. Model Size Multiplier
Non-linear scaling based on parameter count:
- • 1-3B: 1.20× (better cache locality)
- • 4-8B: 1.00× (baseline)
- • 9-15B: 0.75×
- • 16-30B: 0.55×
- • 31-70B: 0.35×
- • 70B+: 0.20× (cache misses dominate)
Larger models exceed GPU cache hierarchies, causing more cache misses. Scaling observed in community benchmarks—directional estimates pending systematic validation.
4. Quantization Multiplier
- • Q4: 1.00× (0.5 bytes/param, fastest)
- • Q8: 0.70× (1.0 bytes/param, ~30% slower)
- • FP16: 0.38× (2.0 bytes/param, ~62% slower)
Performance scales inversely with memory bandwidth requirements. llama.cpp benchmarks show FP16 typically runs at 35-45% the speed of Q4, consistent with our 38% multiplier.
5. Architecture Bonus
- • Dense (Llama, Mistral): 1.00×
- • MoE (Mixtral): 1.30×
- • Efficient MoE (DeepSeek): 1.25×
MoE models only activate a subset of parameters per token (e.g., 2 of 8 experts), reducing effective compute and memory bandwidth requirements.
Expected Variance & Confidence
- • ±20-30%: Common GPUs (RTX 40-series) + common models (7B, 13B, 70B) at Q4/Q8
- • ±30-40%: Less common GPUs, uncommon model sizes, FP16 quantization
- • ±50%+: Apple Silicon, models >100B, multi-modal, MoE
These are variance ranges based on architectural principles and community observations, not statistically validated error measurements.
What We Don't Account For
- • Software optimization differences (llama.cpp vs vLLM vs TensorRT-LLM: 2-3× variation)
- • Batch size (batching can increase throughput 10-50×)
- • Context length (longer context = slower, non-linear)
- • Prompt vs generation speed (prefill faster than decode)
- • FlashAttention, PagedAttention, other memory optimizations
- • Thermal throttling (10-20% performance reduction under sustained load)
- • Driver version and CUDA/ROCm toolkit differences
Our estimates represent typical single-user inference with llama.cpp at batch size 1, moderate context (2-4K tokens), and good thermal conditions.
Hardware Specification Data Sources
Where we get GPU specifications and how we validate them
Primary Sources (Tier 1)
- • NVIDIA Official Datasheets: nvidia.com/datasheets
- • AMD Official Specifications: amd.com/specifications
- • Intel Ark Database: ark.intel.com
- • Apple Technical Specifications: support.apple.com/specs
Secondary Sources (Tier 2)
- • TechPowerUp GPU Database: techpowerup.com/gpu-specs (when primary sources unavailable)
- • AnandTech Reviews: Professional hardware testing with detailed specifications
Secondary sources are cross-referenced against at least two independent sources. Discrepancies are resolved using manufacturer documentation.
Model Information Sources
- • Hugging Face Model Hub: huggingface.co/models (parameter counts, architecture, context length)
- • Official Model Repositories: GitHub repos from Meta, Mistral AI, etc.
- • Research Papers: ArXiv preprints and peer-reviewed publications
Community Benchmark Collection
How we collect and verify real-world performance data
Trusted Community Sources
- • llama.cpp GitHub Discussions: github.com/ggerganov/llama.cpp/discussions
- • r/LocalLLaMA: reddit.com/r/LocalLLaMA
- • User Submissions: Direct contributions via our benchmark submission form
Verification Process
- Sanity Check: Result must be within 2× of estimated value
- Hardware Verification: GPU model, VRAM, driver version documented
- Software Specification: Runtime version, quantization, context length specified
- Reproducibility: Preference for benchmarks with reproducible commands/configs
- Outlier Detection: Results deviating >3σ flagged for review
Help Improve Accuracy
We need YOUR benchmarks. Real measurements always replace estimates. The more community data we collect, the more accurate recommendations become for everyone. Submit your benchmark →
Modern LLM Inference Research
Recent advances beyond our baseline methodology
Why Our Methodology Uses 2009 Foundations
Our estimates are based on the Roofline Model (Williams et al., 2009), which established the fundamental principle that memory-bandwidth limits performance for operations with low arithmetic intensity. While this foundational work is 16 years old, the core principle—that LLM inference is memory-bound—remains valid.
What's Changed Since 2009
1. FlashAttention & FlashAttention-2 (2023)
Research: Tri Dao et al., Stanford University
Impact: 2-3× speedup through memory-efficient attention
Why we don't model it: Requires specific implementations, benefits vary by hardware and context length
2. PagedAttention (2023)
Research: vLLM Project
Impact: Reduces KV cache memory waste by 50-80%
Why we don't model it: Specific to vLLM engine, depends on batch size and request patterns
3. LLMPerf (2024)
Research: ArXiv 2503.11244
Finding: Demonstrates transformer inference requires updated performance models
Status: Emerging research; methodologies not yet standardized
Our Conservative Approach
We intentionally use conservative baseline estimates for several reasons:
- Most users run llama.cpp, not optimized inference engines like vLLM or TensorRT
- Advanced optimizations are deployment-specific and vary by hardware
- Better to underestimate than overestimate performance
- Baseline llama.cpp represents a reproducible, widely-available reference point
Our estimates represent typical single-user inference with standard tools. Optimized production deployments can achieve 2-3× better performance.
Limitations & Ongoing Work
What we don't capture and how we're improving
Known Limitations
- • Estimates assume llama.cpp (not vLLM, TRT, etc.)
- • Assumes batch size = 1 (real deployments often batch)
- • Context length assumed ~2-4K tokens
- • Thermal throttling not modeled
- • PCIe bandwidth not considered (relevant for offloading)
- • Multi-GPU setups not yet supported
- • FlashAttention-2 benefits not quantified
In Progress
- ✅ Improved VRAM calculation (Phase 1 complete)
- ✅ Multi-GPU configuration support — See Multi-GPU Methodology
- 🔄 Populating memory bandwidth for all GPUs
- 🔄 Collecting community-verified benchmarks
- 🔄 Adding confidence intervals to UI
- 📅 Software runtime multipliers (vLLM, TRT)
- 📅 Context length impact modeling
This methodology is continuously refined as we collect more real-world data. Major changes are documented with effective dates. Last updated: November 9, 2025 (Phase 1: VRAM formula correction).
Found an error or have better data? Contact us at feedback@localai.computer
References & Sources
Foundational Computer Architecture
Roofline Model (2009)
Williams, S., Waterman, A., & Patterson, D. (2009). "Roofline: An insightful visual performance model for multicore architectures." Communications of the ACM, 52(4), 65-76. DOI: 10.1145/1498765.1498785
IEEE Floating-Point Standard (2019)
IEEE Computer Society. (2019). IEEE Standard for Floating-Point Arithmetic (IEEE 754-2019). IEEE Standards Association
Modern LLM Inference Research (2023-2024)
LLMPerf (2024)
"LLMPerf: GPU Performance Modeling meets Large Language Models." ArXiv preprint. arXiv:2503.11244
FlashAttention-2 (2023)
Dao, T. et al. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." Stanford University.
PagedAttention / vLLM (2023)
"Efficient Memory Management for Large Language Model Serving with PagedAttention." vLLM Project.
Industry Documentation & Implementation
NVIDIA Developer Resources
"GPU Memory Essentials for AI Performance" (2024). developer.nvidia.com/blog
llama.cpp Implementation
Gerganov, G. et al. (2023-2024). "llama.cpp: Inference of LLaMA model in pure C/C++." github.com/ggerganov/llama.cpp
Community Documentation
text-generation-webui System Requirements Wiki. github.com/oobabooga/text-generation-webui
All sources are cited with specific URLs or DOIs. When community observations are used (e.g., overhead factors, model size multipliers), they are explicitly labeled as "directional estimates" rather than laboratory measurements. We prioritize primary sources (manufacturer datasheets, peer-reviewed papers) over secondary aggregators.
Questions about our methodology? Contact us