Home > AI
11 views 19 mins 0 comments

Lean LLMs: Cut Costs, Keep Quality, and Ship Fast

In AI, Guides
March 04, 2026
Lean LLMs: Cut Costs, Keep Quality, and Ship Fast

LLM bills creep up quietly. One day you ship a prototype. A week later the graph looks like a ski slope. The good news: you can drive costs down without slowing users or cutting features. This guide shows how to do it with clear metrics, specific engineering moves, and safe defaults that stick in production.

Why LLM spend grows fast

LLM usage scales in three axes at once: users, tokens per request, and requests per session. You can keep DAU constant and still double spend if prompts bloat and retries pile up. You can halve tokens per call yet overshoot the budget if you trigger twice as many calls inside a new workflow. FinOps for LLMs is about shaping demand (tokens and calls), pricing intelligently (model choice and routing), and keeping quality with quick feedback loops.

Token math you can use

Think in tokens per user per day. If your median session uses 8 calls at 1,200 input tokens and 300 output tokens, that’s 12,000 tokens per session. On a $5 per million input and $15 per million output model, each session is about $0.105. That sounds fine until you grow to 40,000 daily sessions and add retrieval that doubles input length. Your daily run rate jumps past $8,000. The fixes below target the handful of multipliers that matter most.

Start with observability, not optimization

You can’t tune what you can’t see. Before touching prompts, wire up request tracing and cost accounting. Capture the data once and you’ll stop guessing which knob to turn.

Essential metrics to log on every call

  • Token counts: input, output, and total. Log the tokenizer used.
  • Latency: first token latency and total completion time.
  • Retry info: count, reasons (rate limit, server error, safety refusal).
  • Model and version: keep a stable identifier.
  • Prompt hash: a stable hash of system + template to track drift.
  • Cache hit/miss: plus the cache key and TTL remaining on hit.
  • Attribution: user, org, feature flag, A/B bucket.
  • Billable cost: compute using the provider’s current price table at ingest time and store as a number.

A minimal schema that scales

Emit OpenTelemetry spans for each LLM call with attributes for the metrics above. In your data warehouse, define a fact table like llm_calls keyed by request ID, with dimensions for model, feature, and user plan. This powers queries like “top five prompts by monthly spend” or “token growth per feature after last release.”

Reduce tokens first: make prompts lighter

Optimizing prompts is the cheapest win. You keep the same model and calls, just with less to chew on.

Trim system prompts and templates

  • Shorten role instructions: Replace a 250-word style guide with five bullets. Use examples only when they change outcomes, and move them to a separate cacheable reference if possible.
  • Prefer explicit output schemas: If you need JSON, define a minimal schema and ask for nothing else. Use “Do not include commentary”. This removes verbose “analysis” text.
  • Kill repetition: If you echo the user’s input and your own instructions in every step, you’re burning tokens. Log a prompt hash and deduplicate boilerplate across calls.

Use tools over few-shot when possible

Few-shot examples inflate prompts. Tool calling can externalize knowledge. For instance, instead of embedding two examples of date math, expose a parse_date function or a lightweight rules service. Let the model decide to call it. The prompt shrinks and output stabilizes.

Bound conversation memory

Chats accrete context. Adopt a sliding window with summarization checkpoints. Summaries should be task-aware (“keep only user preferences for tone and constraints, drop small talk”). Cap historical tokens and store two tiers: a terse persistent profile and a short rolling window.

Cache what’s predictable

Caching often cuts spend by 20–60% within days. The trick is to decide what to cache and how to invalidate without breaking correctness.

Deterministic caching for templates

Whenever a prompt template and inputs are identical, the output is often equivalent enough to reuse—especially for classification, extraction, or simple rewrites. Build a cache key from: model ID, prompt hash, normalized inputs, and “settings” (temperature, top-p). Store both the complete response and token counts. Use a moderate TTL (e.g., 7–30 days) and a version field that bumps when you change prompt logic.

Semantic caching for near-duplicates

Users ask the same question in many ways. A semantic cache hashes the meaning, not the surface form. Compute an embedding of the user query, then do a vector search (cosine or dot product) to find prior responses above a similarity threshold. Return the cached answer if it clears a second validation step (for example, a smaller LLM that checks “is this answer still accurate for this phrasing?”). Keep thresholds conservative at first (0.92+) and lower slowly with monitoring.

What not to cache

  • Personalized outputs that contain user PII or private IDs unless you scope the cache to that user or tenant.
  • Time-sensitive answers such as prices or schedules. Attach a short TTL or a freshness predicate.
  • Stochastic generations where variation is a feature (creative brainstorming). Cache summaries or outlines instead of full prose.

Route to the right model

Not every call needs your most capable model. Build a small router that picks the cheapest model that meets the task’s quality bar.

Define quality tiers

  • Tier A (complex reasoning): your top model reserved for tasks that fail obvious tests on smaller models.
  • Tier B (structured transforms): mid-tier model for rewriting, classification, and tool planning.
  • Tier C (boilerplate): local or open-weight model for template filling, short summaries, and embeddings.

Make the default Tier B. Only escalate when an automated check flags uncertainty, such as a low self-rating score, a validator LLM disagreement, or downstream parse failure.

Put a governor in front

Wrap your router with budget-aware logic: if the day’s spend crosses a threshold, disable escalations where the quality hit is acceptable, or require explicit user opt-in for premium generations. Users understand optional “enhanced mode” when latency and accuracy trade-offs are transparent.

Batch, stream, and reuse context

You pay overhead for every network hop. Batching and streaming cut that overhead while smoothing latency for users.

Batch embeddings and RAG prework

  • Bulk embed on ingest: Don’t embed on every read. Compute embeddings when content enters the system, not at query time.
  • Group small documents: Concatenate short texts with separators and a max token target to reduce per-call overhead while keeping chunk boundaries trackable.
  • Use continuous batching servers: For high throughput, run an inference server that supports dynamic batching to keep GPUs full without hurting p95 latency.

Stream to the UI

Streaming token-by-token feels faster even when total compute is the same. Start rendering as soon as the first tokens arrive. Give users a “Stop” button; canceled completions don’t burn extra output tokens, and average spend falls when people stop early.

Reuse summaries across steps

If a workflow has five LLM calls that all need a distilled context, generate that summary once and pass a single compact block forward. You’ll cut both latency and cost while improving consistency.

Make retrieval efficient

Retrieval-augmented generation (RAG) is powerful but easy to overspend on. Focus on chunking, indexing, and query shaping before you “upgrade” the generator.

Chunking that fits the question

  • Overlapping windows: 150–300 tokens with 10–20% overlap is a good start. Oversized chunks waste tokens and degrade re-ranking.
  • Structure-aware splits: Split on headings, paragraphs, or code blocks. Keep metadata (author, date) as separate fields to avoid polluting the text embedding.
  • Title-as-boost: Store titles and short abstracts separately for weighted retrieval (lightweight hybrid search).

Pick fit-for-purpose embedding models

Big isn’t always better for embeddings. Smaller, cheaper models often outperform on domain-specific tasks when fine-tuned or instruction-aligned. Benchmark on your own queries using a few dozen labeled pairs. Track recall@k and reranker hit rate, not just static leaderboard positions.

Rerank little, not a lot

Reranking improves precision but adds cost. Feed the generator 5–8 top chunks after a single rerank step. If you must run a second step (e.g., cross-encoder), apply it to only the top 20 candidates.

Hybrid local + hosted: offload the easy work

You don’t need a giant GPU fleet to save money with local or edge models. Push predictable, low-risk tasks close to the user or into your own infrastructure, then escalate to hosted models when needed.

What to run locally or in your VPC

  • Embeddings: A solid open-weight embedding model on CPU or a single GPU can handle large volumes at low cost.
  • Light summarization and redaction: Short content transforms (TL;DR, PII masking) work well on small models.
  • Draft-first workflows: Generate a coarse draft locally and ask a hosted model to edit and verify. You save tokens on long outputs.

Keep clear quality gates: if a local step produces malformed JSON or fails a small validator model, route up to a stronger hosted model automatically.

Guardrails that pay for themselves

Good guardrails reduce retries, which are silent cost multipliers. You can do this without heavy-handed policies.

Schema enforcement and small validators

  • JSON schemas: Ask for JSON, then validate strictly. If it fails, either repair with a small local model or retry with a short “fix to schema” prompt.
  • Reference checks: For RAG answers, require citations and verify that quoted spans actually exist in retrieved text. Reject and regenerate if they don’t.
  • Disallow empty loops: Cap retries to 1–2. If you hit the cap, surface an actionable error with a “Try again” button instead of auto-churning tokens.

Forecasting and budgets that teams respect

Spreadsheets keep you honest. Turn your telemetry into a budget model that product and finance can understand.

Build a simple unit economics model

  • Inputs: DAU, sessions per user, calls per session, input/output tokens per call by feature, model mix, cache hit rates.
  • Outputs: Daily and monthly cost, p95 latency per feature, and sensitivity to modest changes (±10%).
  • Controls: Levers like max context tokens, cache TTLs, and router thresholds.

Review it weekly at first. Tie alerting to both actual spend and leading indicators like average input tokens and cache misses.

Price negotiations and provider diversity

If you cross a provider’s commit threshold, ask for a lower published price or credits. Keep at least two viable models per tier in your router. This protects you from outages and strengthens your hand in negotiations.

Security and privacy choices that lower total cost

Privacy sometimes seems like extra work, but it reduces compliance risk and data handling overhead later.

Practical steps

  • Minimal retention: Don’t store raw prompts or outputs longer than you need. Store hashes, token counts, and model IDs for analytics.
  • PII scrubbing at the edge: Redact obvious PII before hitting external APIs. This keeps caches safer and reduces vendor exposure.
  • Tenant-scoped caches: Use separate namespaces per customer to avoid cross-tenant leakage and simplify deletion.

A 30‑day rollout plan

Here’s a pragmatic schedule to bring costs under control without stalling momentum.

Week 1: Instrument and benchmark

  • Wire OpenTelemetry spans around all LLM calls. Capture tokens, latency, model IDs, retries, and prompt hashes.
  • Backfill a week of historical spend per feature from your logs. Identify the top three cost sinks.
  • Write a one-page SLO: target p95 latency, acceptable degradation under budget guardrails, and maximum tokens per call by feature.

Week 2: Prompt diet and deterministic cache

  • Shorten the top three system prompts by 30–50% without changing output format. Add strict JSON schemas.
  • Deploy a deterministic cache keyed on model ID + prompt hash + normalized inputs. Start with a 14-day TTL.
  • Add a “Stop generating” UI control to all long outputs. Track early stops.

Week 3: Routing and retrieval tuning

  • Introduce model tiers. Make Tier B your default; route to Tier A only on validator failure or user opt-in.
  • Re-chunk your RAG corpus with 200-token windows and 15% overlap. Benchmark recall and latency.
  • Reduce reranked candidates to 5–8. Require citations and verify spans exist in retrieved context.

Week 4: Semantic cache and batching

  • Deploy a semantic cache for public, non-personal queries with a high similarity threshold (≥0.92) and validator checks.
  • Batch embeddings on ingest and group short docs. If traffic is high, adopt a server with dynamic batching.
  • Enable budget-aware routing. When the daily cost hits 80% of target, pause automatic escalations.

Common pitfalls and how to avoid them

  • Overfitting to a benchmark: Always A/B against real user tasks. A tiny drop in BLEU or ROUGE may be invisible to users but a big price cut is not.
  • Cache poisoning: Namespaces by tenant, high similarity thresholds, and validator checks reduce risk. Log cache origins.
  • Unbounded retries: Retries hide reliability issues and multiply costs. Enforce a hard cap and exponential backoff with jitter.
  • Prompt sprawl: Without versioning, you’ll never know which change saved or cost you money. Treat prompts like code with hashes and change logs.
  • Ignoring first-token latency: Users feel the time to first character more than total time. Stream early and often.

A note on culture: share the numbers

FinOps works when everyone sees the graph. Give product and design a simple dashboard: spend per feature, tokens per call trend, cache hit rate, and p95 time to first token. Celebrate reductions the same way you celebrate feature launches. Postmortem regressions quickly. This makes cost a shared constraint, not a surprise bill.

What “done” looks like

You’ll know you’ve landed when your daily spend is predictable within ±10%, quality is stable or improved, and new features come with a cost forecast attached. At that point, your stack can grow without nasty shocks, and you can make bold bets—like a premium “enhanced” mode—because the guardrails are in place.

Summary:

  • Instrument costs before you optimize: tokens, latency, retries, cache hits, and prompt hashes.
  • Trim prompts and enforce schemas to cut tokens without hurting quality.
  • Use deterministic and semantic caching with safe keys, TTLs, and validator checks.
  • Route to model tiers and add budget-aware governors to prevent runaway spend.
  • Batch embeddings, stream outputs, and reuse summaries to reduce overhead.
  • Tune RAG with smart chunking, targeted reranking, and citation verification.
  • Offload easy work to local or VPC-hosted models with clear escalation rules.
  • Build a simple unit economics model and review it weekly with the team.
  • Adopt privacy practices that lower compliance risk and simplify caching.
  • Roll out in 30 days: instrument, prompt diet, cache, route, then batch and semantically cache.

External References:

/ Published posts: 221

Andy Ewing, originally from coastal Maine, is a tech writer fascinated by AI, digital ethics, and emerging science. He blends curiosity and clarity to make complex ideas accessible.