LLM Pricing Operations
Use this playbook to run LLM pricing in production without constant fire drills. It focuses on operational safety: model churn, unknown IDs, gradual rollouts, and measurable tuning.
1) Keep model tables current (and safe by default)
Section titled “1) Keep model tables current (and safe by default)”For type: token-based routes, keep a route-level models map for the models you care about most. Treat it as your operator override layer on top of tollbooth defaults.
routes: "POST /v1/chat/completions": upstream: openai type: token-based models: gpt-4o: "$0.05" gpt-4o-mini: "$0.005" claude-sonnet-4-5: "$0.02" fallback: "$0.02" # safe default for unknown modelsWhy this helps:
- New model IDs appear frequently and may not match your target margin.
- A conservative
fallbackprevents silent undercharging.
2) Handle fine-tunes and custom model IDs
Section titled “2) Handle fine-tunes and custom model IDs”Fine-tunes often use custom IDs (ft:*, provider-specific suffixes, or deployment names). Map known IDs explicitly and keep a safe fallback.
routes: "POST /v1/chat/completions": upstream: openai type: token-based models: gpt-4o: "$0.05" ft:gpt-4o:acme-support-v2: "$0.08" ft:gpt-4o-mini:triage: "$0.015" fallback: "$0.03"If you want fail-closed behavior for unknown models, reject them in a request hook:
routes: "POST /v1/chat/completions": upstream: openai type: token-based hooks: onRequest: "hooks/enforce-model-allowlist.ts"const allowed = new Set(["gpt-4o", "gpt-4o-mini", "ft:gpt-4o:acme-support-v2"]);
export default async (ctx) => { const body = (ctx.req.body ?? {}) as { model?: string }; const model = body.model; if (!model || !allowed.has(model)) { return { reject: true, status: 400, body: "Unsupported model" }; }};3) Choose the right pricing basis
Section titled “3) Choose the right pricing basis”Use the pricing basis that matches cost variance:
| Basis | Best for | Tradeoff |
|---|---|---|
Token-based (type: token-based) | Chat/completion APIs with variable output length | Price is less predictable per request |
Flat per request (price or match) | Short, bounded requests | Heavy requests can erode margin |
Time/window tiers (via match + policy) | Batch or session products | More operational complexity |
Practical rule:
- Start with token-based for general LLM APIs.
- Use flat pricing only when output size is tightly bounded.
4) Autopricing loop (analytics-driven)
Section titled “4) Autopricing loop (analytics-driven)”Treat pricing as a control loop, not a one-time config:
- Capture usage and payment outcomes.
- Compute margin and conversion per model.
- Propose a bounded price update.
- Roll out gradually and compare results.
Recommended constraints for automatic changes:
- Max upward change per deploy:
+15% - Max downward change per deploy:
-10% - Global floor/ceiling per model family (for safety)
5) Roll out price changes gradually
Section titled “5) Roll out price changes gradually”Avoid global flips. Canary with headers or route variants, then expand.
routes: "POST /v1/chat/completions": upstream: openai match: - where: { headers.x-pricing-version: "v2" } price: "$0.012" # canary cohort - where: { headers.x-pricing-version: "v1" } price: "$0.01" # control cohort fallback: "$0.01"During rollout, compare:
- Payment success rate
- Upstream success rate
- Conversion after first
402 - Margin per request and per model
6) Guardrails you should always have
Section titled “6) Guardrails you should always have”- Set conservative
fallbackfor unknown models. - Define per-model floors and ceilings in your pricing pipeline.
- Keep an allowlist for approved model IDs in high-risk routes.
- Add a denylist for deprecated/blocked models.
- Prefer
settlement: after-responseon expensive upstreams where failures are common. See Refund Protection.
Logging and metrics for tuning
Section titled “Logging and metrics for tuning”At minimum, log these fields on each request:
timestamprouteproviderrequested_modelresolved_pricing_mode(token-based,match,flat)resolved_pricepricing_source(route.models,default-table,fallback,price.fn)tokens_in,tokens_out(when available)payment_requiredamountpayment_settled(boolean) and settlement latencyupstream_statusand upstream latency
Dashboards to build first:
- Margin by model and provider
- 402-to-paid conversion by model and client cohort
- Unknown-model rate over time
- Fallback-price hit rate (should trend down as mappings improve)
- Settlement failure and timeout rates
Operational checklist
Section titled “Operational checklist”- Confirm provider model list updates and deprecations.
- Add/adjust
modelsoverrides for top traffic models. - Review unknown-model and fallback hit rates.
- Review fine-tune IDs in production traffic and map hot paths.
- Recompute margin and conversion by model.
- Apply bounded price updates (respect floors/ceilings).
- Roll out to canary cohort, compare against control.
- Promote rollout only after conversion and settlement checks pass.
- Archive pricing change notes (what changed, why, measured impact).
Related: