Benchmark Methodology

Research-backed approach to estimating GPU performance for large language model inference. This document provides complete transparency on our data sources, calculation methods, and known limitations.

Overview

LocalAI.Computer provides theoretically-derived performance estimates based on established computer architecture principles (Roofline Model, IEEE floating-point standards) and community-validated observations from llama.cpp and related projects.

These are not laboratory measurements. Real-world performance typically varies ±25-50% depending on software implementation (llama.cpp vs vLLM vs TensorRT-LLM), context length, batch size, quantization quality, and thermal conditions.

Our philosophy: provide honest, conservative estimates using widely-available tools (llama.cpp) as the baseline. Users should view these numbers as minimum expected performance, not guarantees. Modern optimizations like FlashAttention and PagedAttention can improve performance 2-3× beyond our conservative estimates.

VRAM Requirement Calculation

How we determine minimum GPU memory needed for each model

Formula

VRAM (GB) = (Parameters × Bytes_Per_Parameter × Overhead_Factor) / 1024³

Bytes Per Parameter (By Quantization)

  • FP16 (16-bit): 2.0 bytes/param
  • Q8 (8-bit): 1.0 bytes/param
  • Q4 (4-bit): 0.5 bytes/param

Source: IEEE 754 floating-point standard; quantization definitions from llama.cpp.

Overhead Factor (Size-Dependent)

  • <10B params: 1.15 (15% overhead)
  • 10-70B params: 1.10 (10% overhead)
  • >70B params: 1.05 (5% overhead)

Rationale: Smaller models have proportionally larger KV cache; larger models are dominated by weight memory.

Example: 20B Parameter Model

  • • FP16: 21.5B × 2.0 × 1.10 / 1024³ = 44GB
  • • Q8: 21.5B × 1.0 × 1.10 / 1024³ = 22GB
  • • Q4: 21.5B × 0.5 × 1.10 / 1024³ = 11GB

Sources: Physical constants from IEEE 754-2019. Overhead factors derived from community observations on llama.cpp GitHub and text-generation-webui wiki. Cross-referenced with NVIDIA guidance (recommends 2× parameter size).

Known Limitations

  • • Context length affects KV cache size (assumes 2K-4K tokens)
  • • MoE models load all parameters but only activate a subset
  • • Multi-modal models have additional VRAM requirements
  • • FlashAttention-2 can reduce VRAM usage 20-30% (not modeled)

Performance Estimation Methodology

How we estimate tokens per second for GPU + model combinations

Foundational Principle: Memory-Bandwidth Bottleneck

LLM autoregressive inference is primarily memory-bandwidth bound, not compute-bound. This principle was formalized by the Roofline Model (Williams et al., 2009) and remains valid for transformer architectures. Each token generation requires loading the entire model from VRAM, making memory bandwidth—not CUDA cores or TFLOPS—the primary performance bottleneck.

Recent research (LLMPerf 2024, FlashAttention 2023) shows modern optimizations can improve performance significantly. Our estimates use traditional llama.cpp baseline without advanced optimizations.

Formula Components

tokens/sec = baseline_speed × gpu_performance × model_size × quantization × architecture

1. Baseline Speed (Calibration Point)

RTX 4090 + 7B model (Q4): ~180 tokens/sec

Representative value from llama.cpp community benchmarks and r/LocalLLaMA reports. Used as calibration point for relative performance calculations.

2. GPU Performance Factor

Calculated as: (bandwidth_ratio × 0.75) + (compute_ratio × 0.25)

Memory bandwidth dominates LLM inference performance. The 75% weighting on bandwidth reflects that token generation is primarily memory-bound (loading weights from VRAM) rather than compute-bound. Based on Roofline Model (Williams et al., 2009, Communications of the ACM), which established that low arithmetic-intensity operations are memory-bandwidth limited.

3. Model Size Multiplier

Non-linear scaling based on parameter count:

  • • 1-3B: 1.20× (better cache locality)
  • • 4-8B: 1.00× (baseline)
  • • 9-15B: 0.75×
  • • 16-30B: 0.55×
  • • 31-70B: 0.35×
  • • 70B+: 0.20× (cache misses dominate)

Larger models exceed GPU cache hierarchies, causing more cache misses. Scaling observed in community benchmarks—directional estimates pending systematic validation.

4. Quantization Multiplier

  • Q4: 1.00× (0.5 bytes/param, fastest)
  • Q8: 0.70× (1.0 bytes/param, ~30% slower)
  • FP16: 0.38× (2.0 bytes/param, ~62% slower)

Performance scales inversely with memory bandwidth requirements. llama.cpp benchmarks show FP16 typically runs at 35-45% the speed of Q4, consistent with our 38% multiplier.

5. Architecture Bonus

  • Dense (Llama, Mistral): 1.00×
  • MoE (Mixtral): 1.30×
  • Efficient MoE (DeepSeek): 1.25×

MoE models only activate a subset of parameters per token (e.g., 2 of 8 experts), reducing effective compute and memory bandwidth requirements.

Expected Variance & Confidence

  • ±20-30%: Common GPUs (RTX 40-series) + common models (7B, 13B, 70B) at Q4/Q8
  • ±30-40%: Less common GPUs, uncommon model sizes, FP16 quantization
  • ±50%+: Apple Silicon, models >100B, multi-modal, MoE

These are variance ranges based on architectural principles and community observations, not statistically validated error measurements.

What We Don't Account For

  • • Software optimization differences (llama.cpp vs vLLM vs TensorRT-LLM: 2-3× variation)
  • • Batch size (batching can increase throughput 10-50×)
  • • Context length (longer context = slower, non-linear)
  • • Prompt vs generation speed (prefill faster than decode)
  • • FlashAttention, PagedAttention, other memory optimizations
  • • Thermal throttling (10-20% performance reduction under sustained load)
  • • Driver version and CUDA/ROCm toolkit differences

Our estimates represent typical single-user inference with llama.cpp at batch size 1, moderate context (2-4K tokens), and good thermal conditions.

Hardware Specification Data Sources

Where we get GPU specifications and how we validate them

Primary Sources (Tier 1)

Secondary Sources (Tier 2)

  • TechPowerUp GPU Database: techpowerup.com/gpu-specs (when primary sources unavailable)
  • AnandTech Reviews: Professional hardware testing with detailed specifications

Secondary sources are cross-referenced against at least two independent sources. Discrepancies are resolved using manufacturer documentation.

Model Information Sources

  • Hugging Face Model Hub: huggingface.co/models (parameter counts, architecture, context length)
  • Official Model Repositories: GitHub repos from Meta, Mistral AI, etc.
  • Research Papers: ArXiv preprints and peer-reviewed publications

Community Benchmark Collection

How we collect and verify real-world performance data

Trusted Community Sources

Verification Process

  1. Sanity Check: Result must be within 2× of estimated value
  2. Hardware Verification: GPU model, VRAM, driver version documented
  3. Software Specification: Runtime version, quantization, context length specified
  4. Reproducibility: Preference for benchmarks with reproducible commands/configs
  5. Outlier Detection: Results deviating >3σ flagged for review

Help Improve Accuracy

We need YOUR benchmarks. Real measurements always replace estimates. The more community data we collect, the more accurate recommendations become for everyone. Submit your benchmark →

Modern LLM Inference Research

Recent advances beyond our baseline methodology

Why Our Methodology Uses 2009 Foundations

Our estimates are based on the Roofline Model (Williams et al., 2009), which established the fundamental principle that memory-bandwidth limits performance for operations with low arithmetic intensity. While this foundational work is 16 years old, the core principle—that LLM inference is memory-bound—remains valid.

What's Changed Since 2009

1. FlashAttention & FlashAttention-2 (2023)

Research: Tri Dao et al., Stanford University
Impact: 2-3× speedup through memory-efficient attention
Why we don't model it: Requires specific implementations, benefits vary by hardware and context length

2. PagedAttention (2023)

Research: vLLM Project
Impact: Reduces KV cache memory waste by 50-80%
Why we don't model it: Specific to vLLM engine, depends on batch size and request patterns

3. LLMPerf (2024)

Research: ArXiv 2503.11244
Finding: Demonstrates transformer inference requires updated performance models
Status: Emerging research; methodologies not yet standardized

Our Conservative Approach

We intentionally use conservative baseline estimates for several reasons:

  • Most users run llama.cpp, not optimized inference engines like vLLM or TensorRT
  • Advanced optimizations are deployment-specific and vary by hardware
  • Better to underestimate than overestimate performance
  • Baseline llama.cpp represents a reproducible, widely-available reference point

Our estimates represent typical single-user inference with standard tools. Optimized production deployments can achieve 2-3× better performance.

Limitations & Ongoing Work

What we don't capture and how we're improving

Known Limitations

  • • Estimates assume llama.cpp (not vLLM, TRT, etc.)
  • • Assumes batch size = 1 (real deployments often batch)
  • • Context length assumed ~2-4K tokens
  • • Thermal throttling not modeled
  • • PCIe bandwidth not considered (relevant for offloading)
  • • Multi-GPU setups not yet supported
  • • FlashAttention-2 benefits not quantified

In Progress

  • ✅ Improved VRAM calculation (Phase 1 complete)
  • ✅ Multi-GPU configuration support — See Multi-GPU Methodology
  • 🔄 Populating memory bandwidth for all GPUs
  • 🔄 Collecting community-verified benchmarks
  • 🔄 Adding confidence intervals to UI
  • 📅 Software runtime multipliers (vLLM, TRT)
  • 📅 Context length impact modeling

This methodology is continuously refined as we collect more real-world data. Major changes are documented with effective dates. Last updated: November 9, 2025 (Phase 1: VRAM formula correction).

Found an error or have better data? Contact us at feedback@localai.computer

References & Sources

Foundational Computer Architecture

Roofline Model (2009)

Williams, S., Waterman, A., & Patterson, D. (2009). "Roofline: An insightful visual performance model for multicore architectures." Communications of the ACM, 52(4), 65-76. DOI: 10.1145/1498765.1498785

IEEE Floating-Point Standard (2019)

IEEE Computer Society. (2019). IEEE Standard for Floating-Point Arithmetic (IEEE 754-2019). IEEE Standards Association

Modern LLM Inference Research (2023-2024)

LLMPerf (2024)

"LLMPerf: GPU Performance Modeling meets Large Language Models." ArXiv preprint. arXiv:2503.11244

FlashAttention-2 (2023)

Dao, T. et al. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." Stanford University.

PagedAttention / vLLM (2023)

"Efficient Memory Management for Large Language Model Serving with PagedAttention." vLLM Project.

Industry Documentation & Implementation

NVIDIA Developer Resources

"GPU Memory Essentials for AI Performance" (2024). developer.nvidia.com/blog

llama.cpp Implementation

Gerganov, G. et al. (2023-2024). "llama.cpp: Inference of LLaMA model in pure C/C++." github.com/ggerganov/llama.cpp

Community Documentation

text-generation-webui System Requirements Wiki. github.com/oobabooga/text-generation-webui

All sources are cited with specific URLs or DOIs. When community observations are used (e.g., overhead factors, model size multipliers), they are explicitly labeled as "directional estimates" rather than laboratory measurements. We prioritize primary sources (manufacturer datasheets, peer-reviewed papers) over secondary aggregators.

Questions about our methodology? Contact us