Research-backed approach to estimating GPU performance for large language model inference. This document provides complete transparency on our data sources, calculation methods, and known limitations.
LocalAI.Computer provides theoretically-derived performance estimates based on established computer architecture principles (Roofline Model, IEEE floating-point standards) and community-validated observations from llama.cpp and related projects.
These are not laboratory measurements. Real-world performance typically varies ±25-50% depending on software implementation (llama.cpp vs vLLM vs TensorRT-LLM), context length, batch size, quantization quality, and thermal conditions.
Our philosophy: provide honest, conservative estimates using widely-available tools (llama.cpp) as the baseline. Users should view these numbers as minimum expected performance, not guarantees. Modern optimizations like FlashAttention and PagedAttention can improve performance 2-3× beyond our conservative estimates.
How we determine minimum GPU memory needed for each model
Source: IEEE 754 floating-point standard; quantization definitions from llama.cpp.
Rationale: Smaller models have proportionally larger KV cache; larger models are dominated by weight memory.
Sources: Physical constants from IEEE 754-2019. Overhead factors derived from community observations on llama.cpp GitHub and text-generation-webui wiki. Cross-referenced with NVIDIA guidance (recommends 2× parameter size).
Known Limitations
How we estimate tokens per second for GPU + model combinations
LLM autoregressive inference is primarily memory-bandwidth bound, not compute-bound. This principle was formalized by the Roofline Model (Williams et al., 2009) and remains valid for transformer architectures. Each token generation requires loading the entire model from VRAM, making memory bandwidth—not CUDA cores or TFLOPS—the primary performance bottleneck.
Recent research (LLMPerf 2024, FlashAttention 2023) shows modern optimizations can improve performance significantly. Our estimates use traditional llama.cpp baseline without advanced optimizations.
RTX 4090 + 7B model (Q4): ~180 tokens/sec
Representative value from llama.cpp community benchmarks and r/LocalLLaMA reports. Used as calibration point for relative performance calculations.
Calculated as: (bandwidth_ratio × 0.75) + (compute_ratio × 0.25)
Memory bandwidth dominates LLM inference performance. The 75% weighting on bandwidth reflects that token generation is primarily memory-bound (loading weights from VRAM) rather than compute-bound. Based on Roofline Model (Williams et al., 2009, Communications of the ACM), which established that low arithmetic-intensity operations are memory-bandwidth limited.
Non-linear scaling based on parameter count:
Larger models exceed GPU cache hierarchies, causing more cache misses. Scaling observed in community benchmarks—directional estimates pending systematic validation.
Performance scales inversely with memory bandwidth requirements. llama.cpp benchmarks show FP16 typically runs at 35-45% the speed of Q4, consistent with our 38% multiplier.
MoE models only activate a subset of parameters per token (e.g., 2 of 8 experts), reducing effective compute and memory bandwidth requirements.
Expected Variance & Confidence
These are variance ranges based on architectural principles and community observations, not statistically validated error measurements.
What We Don't Account For
Our estimates represent typical single-user inference with llama.cpp at batch size 1, moderate context (2-4K tokens), and good thermal conditions.
Where we get GPU specifications and how we validate them
Secondary sources are cross-referenced against at least two independent sources. Discrepancies are resolved using manufacturer documentation.
How we collect and verify real-world performance data
Help Improve Accuracy
We need YOUR benchmarks. Real measurements always replace estimates. The more community data we collect, the more accurate recommendations become for everyone. Submit your benchmark →
Recent advances beyond our baseline methodology
Our estimates are based on the Roofline Model (Williams et al., 2009), which established the fundamental principle that memory-bandwidth limits performance for operations with low arithmetic intensity. While this foundational work is 16 years old, the core principle—that LLM inference is memory-bound—remains valid.
Research: Tri Dao et al., Stanford University
Impact: 2-3× speedup through memory-efficient attention
Why we don't model it: Requires specific implementations, benefits vary by hardware and context length
Research: vLLM Project
Impact: Reduces KV cache memory waste by 50-80%
Why we don't model it: Specific to vLLM engine, depends on batch size and request patterns
Research: ArXiv 2503.11244
Finding: Demonstrates transformer inference requires updated performance models
Status: Emerging research; methodologies not yet standardized
We intentionally use conservative baseline estimates for several reasons:
Our estimates represent typical single-user inference with standard tools. Optimized production deployments can achieve 2-3× better performance.
What we don't capture and how we're improving
This methodology is continuously refined as we collect more real-world data. Major changes are documented with effective dates. Last updated: November 9, 2025 (Phase 1: VRAM formula correction).
Found an error or have better data? Contact us at feedback@localai.computer
Roofline Model (2009)
Williams, S., Waterman, A., & Patterson, D. (2009). "Roofline: An insightful visual performance model for multicore architectures." Communications of the ACM, 52(4), 65-76. DOI: 10.1145/1498765.1498785
IEEE Floating-Point Standard (2019)
IEEE Computer Society. (2019). IEEE Standard for Floating-Point Arithmetic (IEEE 754-2019). IEEE Standards Association
LLMPerf (2024)
"LLMPerf: GPU Performance Modeling meets Large Language Models." ArXiv preprint. arXiv:2503.11244
FlashAttention-2 (2023)
Dao, T. et al. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." Stanford University.
PagedAttention / vLLM (2023)
"Efficient Memory Management for Large Language Model Serving with PagedAttention." vLLM Project.
NVIDIA Developer Resources
"GPU Memory Essentials for AI Performance" (2024). developer.nvidia.com/blog
llama.cpp Implementation
Gerganov, G. et al. (2023-2024). "llama.cpp: Inference of LLaMA model in pure C/C++." github.com/ggerganov/llama.cpp
Community Documentation
text-generation-webui System Requirements Wiki. github.com/oobabooga/text-generation-webui
All sources are cited with specific URLs or DOIs. When community observations are used (e.g., overhead factors, model size multipliers), they are explicitly labeled as "directional estimates" rather than laboratory measurements. We prioritize primary sources (manufacturer datasheets, peer-reviewed papers) over secondary aggregators.
Questions about our methodology? Contact us