Mean CEO’s blog article

LLM model routing: stop paying premium prices for tiny jobs

LLM model routing helps you cut AI spend without wrecking output quality. Use this founder routing matrix before your bill eats margin.

By Violetta Bonenkamp Topic: LLM model routing Updated 2026-04-29

Model routing is founder finance, not engineering trivia.

If every customer request goes to your most expensive model, your AI product is leaking margin quietly while the demo smiles at you.

That sounds harsh. Good. AI bills rarely punch you on day one. They wait until the product works, usage rises, retries multiply, tool calls hide in the background, and the founder suddenly discovers that "AI margin" was mostly wishful thinking.

TL;DR: LLM model routing means sending each request to the cheapest model, tool, cached path, batch job, or human review step that can pass your quality bar. For bootstrapped founders, the goal is not using the smartest model everywhere. The goal is matching task risk, buyer value, delay tolerance, privacy needs, and answer difficulty to the right model path, then proving the decision with evaluation, logs, and cost per completed job.

I am Violetta Bonenkamp, founder of Mean CEO, CADChain, and F/MS Startup Game. I love AI tools when they make a small team dangerous. I dislike AI tools when they turn a founder into a reseller of expensive tokens with no margin discipline.

The F/MS AI for startups workshop has the right spirit: combine models, automation, and distribution-first systems, but keep the work practical and tested. AI can save time. It can also make founders lazy with numbers.

Here is the founder filter:

Do not ask, "Which model is best?"

Ask, "Which model is enough for this job, and can I prove it before the customer or the invoice proves me wrong?"

1 · Definition

What LLM Model Routing Actually Means

LLM model routing is the logic that decides where an AI request should go.

The destination might be:

Founder checklist

Founder checks worth seeing together

A small language model.
A frontier model.
A domain-specific model.
A cached answer.
A retrieval step.
A batch job.
A rule-based check.
A human reviewer.
A fallback provider.
No model at all.

That last one matters.

Some tasks should not touch a large language model. If a customer asks for order status, calculate it from your system. If a support ticket needs a password reset link, call the right tool. If a finance workflow needs a missing invoice number, parse the field. Do not summon a premium model like a priest for every little admin task.

Amazon Bedrock intelligent prompt routing shows where the market is going: a single serverless endpoint can route between foundation models in the same family by predicting response quality and cost for each request. That is the grown-up version of a truth bootstrappers should learn early.

Not every request deserves the same brain.

2 · Market signal

Why Founders Overspend On LLMs

Founders overspend on LLMs because the first working version usually uses the strongest model.

That is rational during testing.

It becomes dangerous after launch.

The usual overspend pattern looks like this:

The founder builds with the strongest model because it behaves well.
The product starts to work.
The prompt grows.
Retrieval adds more text.
Tool calls add hidden extra calls.
Output gets longer because nobody limited it.
Retries happen in the background.
Support asks for more safety checks.
A second workflow gets copied from the first.
The customer pays one price while model cost grows per request.

Suddenly, the product is popular and still financially stupid.

AI evaluation and observability makes model routing defensible. If you do not measure output quality, route decisions, token use, retries, and cost per completed job, you are guessing with a credit card.

The product may still be useful.

The business may still be weak.

3 · Key idea

The Pricing Reality Founders Must Watch

LLM pricing changes often, so founders should link their cost model to current provider pages and update assumptions monthly.

As of the current OpenAI API pricing page, frontier, mini, cached-input, batch, and flex processing prices can differ sharply. OpenAI also lists Batch API savings of 50% for asynchronous work over 24 hours.

Claude API pricing shows the same founder lesson from another angle: Opus, Sonnet, and Haiku sit at different price levels, cache writes cost differently from cache hits, and Anthropic notes that the Opus 4.7 tokenizer may use up to 35% more tokens for the same fixed text.

The Gemini Developer API pricing page also separates free, paid, and enterprise tiers, and it names context caching plus Batch API with a 50% cost reduction for paid production work.

Do not memorize the numbers.

Memorize the shape:

Input tokens cost money.
Output tokens usually cost more.
Cached input can be much cheaper.
Batch work can be cheaper when the buyer can wait.
Smaller models can be enough for many steps.
Some free tiers use submitted content to improve provider products.
Regional or data-residency choices can change the bill.
Tool calls, search calls, storage, and code execution can sit outside the chat price.

That is why the routing layer should know more than "send request to model."

It should know:

What task is being asked.
How risky the answer is.
How fast the answer must be.
Whether the prompt includes repeated context.
Whether retrieval is needed.
Whether the user pays enough for the route.
Whether a cheaper model passed the same test before.
Whether a human should approve the step.

This is not overengineering.

It is margin hygiene.

4 · Key idea

The Founder Routing Matrix

Use this table before you pick a default model.

Decision map

The Founder Routing Matrix

Classification

First route

Rules or small model

Upgrade trigger

Low confidence or ambiguous intent

Evidence to log

Input, route, confidence, final label

Data extraction

First route

Small model with schema check

Upgrade trigger

Missing fields, money data, legal data

Evidence to log

Source, extracted fields, validation errors

Simple rewrite

First route

Small model

Upgrade trigger

Brand-sensitive copy or high-value buyer

Evidence to log

Before text, after text, reviewer edits

Support answer

First route

Retrieval plus cheaper model

Upgrade trigger

Angry user, refund risk, policy conflict

Evidence to log

Source links, answer, escalation reason

Sales email

First route

Mid-tier model

Upgrade trigger

Named account, sensitive claim, custom pricing

Evidence to log

Account data used, draft, human approval

Legal or medical intake

First route

Intake model plus human review

Upgrade trigger

Any advice, diagnosis, contract judgment

Evidence to log

Intake summary, boundary note, reviewer

Code helper

First route

Coding model

Upgrade trigger

Production change, security area, failed tests

Evidence to log

Prompt, files touched, tests run

Agent planning

First route

Stronger model

Upgrade trigger

Multi-step action or tool write access

Evidence to log

Plan, tool rights, stop condition

Batch analysis

First route

Batch route

Upgrade trigger

Buyer needs live response

Evidence to log

Job size, turnaround time, cost per item

Repeated Q&A

First route

Cache or retrieval answer

Upgrade trigger

Source changed or answer not found

Evidence to log

Cache hit, source version, fallback route

The table forces one useful question:

What is the cheapest route that still protects the buyer, the margin, and the founder’s reputation?

If you cannot answer, do not add another model. Add measurement.

5 · Key idea

Model Routing Is A Product Decision

Many teams treat model choice as a back-end setting.

That is too small.

Model routing changes the product promise.

If you route everything through a small model, the product may be cheap and fast, but weaker on hard reasoning. If you route everything through a premium model, the product may look smarter, but the price may collapse under usage. If you route sensitive work through a provider the buyer does not approve, the sale may die before the demo.

So the routing decision should sit inside product strategy.

Ask:

Is the buyer paying for speed, accuracy, privacy, audit trail, or saved labor?
Which errors can be fixed later, and which errors create a serious incident?
Does the customer need an instant answer, or can the job run in the background?
Can a human approval step raise trust enough to win the sale?
Which parts of the workflow are repetitive enough for caching?
Which parts deserve a small model or local model?
Which parts should be blocked instead of routed?

If you are building inside a narrow industry, many profitable workflows do not need a frontier model. Use Small language models for cheaper and private AI to ask whether a smaller, cheaper, more private model can do the paid job. They need a narrow task, clean data, strict boundaries, and proof that the cheaper path works.

The CADChain angle is similar. CAD files, design rights, supplier access, and industrial data do not become safer because a model is large. They become safer when access, logs, permissions, and evidence are handled properly. The CADChain April 2026 AI model release analysis is useful for founders because it maps model releases to budget pressure, not benchmark applause alone.

6 · Key idea

The Four Routing Layers

A founder-friendly routing system has four layers.

Task layer

This layer asks what the user is really trying to do.

It might classify the request as support, sales, analysis, code, content, data extraction, decision support, or tool action.

Use cheap logic here where possible.

If the request is "reset my password," do not use a large model to compose poetry about account access. Route to the account flow.

Risk layer

This layer asks what can go wrong.

Risk includes money, legal exposure, health claims, security, personal data, customer anger, brand claims, and tool write access.

Low-risk content can use cheaper routes.

High-risk work may need stronger reasoning, extra retrieval, a refusal policy, or a named human.

Context layer

This layer asks what the model must know.

Some tasks need no context. Some need recent product docs. Some need user history. Some need company policy. Some need source files. Some need a database call instead of text generation.

Prompt caching helps when the same large instructions or documents repeat. Retrieval helps when the answer must come from approved sources. A small model may work when the context is structured and the output is narrow.

Business layer

This layer asks whether the route makes money.

It should know plan tier, customer value, usage cap, margin target, free-trial limits, and whether the task creates paid value.

This is where founders get shy.

Do not be shy.

If a user pays EUR19 per month and sends 5,000 premium-model requests, the product is subsidizing bad behavior. Either cap usage, route cheaply, charge for volume, batch work, cache repeats, or change the promise.

7 · Key idea

A Simple Routing Rule Set For Your First Version

You do not need a giant AI gateway on day one.

Start with a small set of rules.

Use this:

Rule 1: No model when deterministic logic works. If a database, calculator, search query, or business rule can answer, use that first.

Rule 2: Small model for sorting, tagging, cleanup, extraction, and short rewrites. These jobs often need consistency more than genius.

Rule 3: Retrieval before bigger reasoning. If the model lacks the right source, a stronger model may hallucinate more elegantly. Fix source access first.

Rule 4: Premium model only for high-value uncertainty. Use it when the request needs multi-step reasoning, high-value synthesis, serious ambiguity, coding depth, or agent planning with consequences.

Rule 5: Human review for irreversible actions. If the AI sends money, updates customer records, changes production code, sends legal text, or touches sensitive files, add approval.

Rule 6: Batch anything that can wait. Research jobs, content drafts, document summaries, data labeling, nightly checks, and internal reports often do not need instant output.

Rule 7: Cache repeated context. Long system prompts, policy packs, product docs, and repeated help-center context should not be paid for again and again if a provider supports caching.

Rule 8: Log the route reason. A route without a reason becomes unfixable later.

This first version is enough to save money and teach you where the real routing logic should live.

8 · Key idea

When To Use An AI Gateway

At some point, a simple provider call becomes messy.

You may need an AI gateway when:

You use more than one provider.
You need fallbacks when a provider fails.
You need spend caps per customer.
You need model aliases.
You need retries without chaos.
You need logs across all model calls.
You need tenant-level budgets.
You need a consistent API for the team.
You want to test two model routes safely.

LiteLLM routing docs show the practical shape: load balancing across deployments, queueing, cooldowns, fallbacks, timeouts, retries, and Redis-backed tracking for token and request limits. That is useful when you have enough usage to justify the extra layer.

But do not buy or build a gateway because the architecture diagram looks serious.

Buy or build it when your bill, reliability, provider mix, or audit trail needs it.

Until then, a simple routing table and honest logs may beat a fancy setup nobody understands.

9 · Key idea

The Unit Cost Formula Founders Should Track

A founder does not need a PhD in model pricing.

She needs a small cost model.

Start here:

Cost per completed job =
model input cost
+ model output cost
+ tool calls
+ search or retrieval calls
+ code or container cost
+ retries
+ cache writes
+ human review minutes
+ failed job cost

Then add:

Gross margin per job =
customer revenue per job - cost per completed job

Keep it crude at first.

Crude numbers beat fantasy.

If a customer pays EUR1 for a completed workflow and the model path costs EUR0.40 before support, infrastructure, refunds, and staff time, the product is already in trouble.

Compute spend has become part of pricing. Use GPU FinOps for AI startups to connect model calls, GPU spend, pricing, and margin before usage grows. You cannot set pricing by vibes when the product calls models in the background.

The Mean CEO guide to AI tools for solo founders is useful context too because small teams can get more done with AI, but only if tools reduce cost and time instead of hiding new bills.

10 · Key idea

How To Test A Cheaper Model Without Breaking Trust

The lazy version of model routing is dangerous.

It says, "Let’s send more traffic to the cheap model and see if customers complain."

No.

Do this instead.

No-round plan

The pre-investor proof path

Build a task set from real work

Take 50 to 200 real prompts, tickets, files, or workflow inputs. Remove private data where needed.

Label the expected outcome

Write what a correct answer, refusal, tool route, or escalation should look like.

Run current route and cheaper route side by side

Do not expose the cheaper route to customers yet.

Score outputs

Use human review for quality, source fit, risk, tone, and task completion. Use automated checks for field shape, missing fields, toxicity, policy words, and link presence.

Compare by task class

Do not average everything into one score. A cheap model may be excellent for extraction and bad for reasoning.

Ship partial routing

Start with the request classes where the cheaper model passed cleanly.

Watch live traces

Monitor route, model, input size, output size, refusal, fallback, cost, edit rate, and support complaints.

OpenTelemetry’s GenAI semantic conventions are worth knowing because the industry is moving toward shared names for GenAI events, metrics, model spans, agent spans, and provider attributes. You do not need to become an observability vendor. You need enough trace data to explain what happened.

If you want the deeper article cluster, observability for distributed AI applications is the natural next stop after routing because route decisions without traces become folklore.

11 · Key idea

The Pricing Model Has To Match The Route

Many AI founders price like SaaS and spend like usage-based infrastructure.

That mismatch hurts.

Seat pricing can work when model usage is predictable.

Usage pricing can work when customers understand volume.

Outcome pricing can work when you control the workflow and know your cost per completed job.

Flat pricing is dangerous when:

Customers can generate unlimited work.
Output length is not capped.
The route always uses premium models.
The product does not cache repeated context.
Free users can trigger expensive calls.
You do not separate low-risk from high-risk work.
Failed jobs still cost money.

Founders love simple pricing because buyers like it.

Buyers also like companies that survive.

So keep simple pricing on the page, but put strict route controls underneath:

Monthly included jobs.
Fair-use limits.
Premium route only on paid tiers.
Batch route for low-price plans.
Human review as an add-on.
Stronger model only when the product promise requires it.
Shorter output defaults.
Source retrieval before expensive reasoning.

This is not being cheap.

This is being alive.

12 · Red flags

Common Routing Mistakes

The first mistake is routing by model brand instead of task class.

"Use the famous model" is not a routing plan.

The second mistake is chasing public benchmarks without private evals.

Benchmarks can guide curiosity. Buyer tasks guide the business.

The third mistake is ignoring output tokens.

A verbose answer can cost more than the input. Cap answer length and train the product to answer the job, not perform intelligence.

The fourth mistake is treating retries as free.

Retries can double cost while hiding product failure. Log them.

The fifth mistake is using a premium model to fix bad retrieval.

If your source pack is wrong, a smarter model may only create a more convincing wrong answer.

The sixth mistake is routing sensitive work without a buyer promise.

If the customer cares about data location, human review, logging, or vendor choice, your route becomes a sales issue.

The seventh mistake is never revisiting routes.

New model prices, model quality, context windows, caching rules, and provider terms change. Your route should be reviewed monthly or after any major model release.

The eighth mistake is giving free users the same route as paid customers.

Generosity is nice. Unpriced compute is not a growth strategy.

13 · Europe lens

The Europe Angle: Privacy, Data Location, And Margin

European founders face an extra routing question:

Where does the data go?

For many buyers, especially in health, finance, government, engineering, education, and industry, model routing is tied to data location, vendor trust, and records.

That does not mean every European startup must self-host from day one.

It means the founder should know:

Which data goes to which provider.
Which routes use free tiers.
Which providers may use submitted content for product improvement.
Which customers need paid tiers, data controls, or regional routing.
Which workflow can run on smaller or local models.
Which logs prove the route.

The broader AI infrastructure gap in Europe creates a strange advantage for bootstrappers. Scarcity forces better discipline. Founders who cannot outspend funded rivals can still out-route them.

Smaller model where possible.

Premium model where needed.

Human review where trust needs it.

No model where rules are enough.

That is not glamorous. It is how a small AI company stays alive.

14 · Decision filter

A 7-Day Routing Audit

Use this if your AI bill already feels weird.

Day 1: Export every model call. Capture model name, route, input tokens, output tokens, provider, customer, request type, retry count, and price estimate.

Day 2: Rank calls by total spend. Find the top 20 request types by cost, not by volume.

Day 3: Mark risk level. Low risk, medium risk, high risk, irreversible. Do not let low-risk jobs steal premium model budget.

Day 4: Test a cheaper route. Run a side-by-side eval on the top expensive low-risk class.

Day 5: Add caps. Limit output length, retry count, tool calls, and free-tier usage.

Day 6: Add one cache. Cache repeated instructions, policy packs, product docs, or frequent Q&A where provider rules make sense.

Day 7: Change pricing or route access. If the product still loses money, change the plan limits or reserve premium routes for paid usage.

At the end, write one page:

What changed?
What got cheaper?
What stayed risky?
Which route needs human review?
Which customer plan needs a new limit?
Which task deserves its own eval set?

That one page is more useful than a meeting about AI strategy.

15 · Key idea

FAQ About LLM Model Routing

What is LLM model routing?

LLM model routing is the decision logic that sends each AI request to the right model, tool, cache, batch job, or human path. A good router looks at task type, risk, context, customer plan, answer difficulty, delay tolerance, data rules, and cost. The goal is to use the cheapest route that still passes the product’s quality bar. For a founder, routing is where product, engineering, finance, and trust meet.

Why does LLM model routing matter for bootstrapped startups?

Bootstrapped startups cannot hide weak unit economics behind a large funding round. If an AI product uses a premium model for every request, the business may lose money as usage grows. Routing helps founders protect margin by using smaller models, cached context, batch jobs, rules, or human review where each path makes sense. It also makes the product easier to explain to buyers who ask where data goes and how hard tasks are handled.

When should a startup use a smaller model?

A startup should test a smaller model for tasks with narrow output, stable structure, low risk, and clear pass or fail checks. Good candidates include classification, tagging, short rewrites, field extraction, formatting, simple support drafts, and internal summaries. The founder should not switch blindly. Run the smaller model against real tasks, score the results, then route only the request classes where it performs well enough.

When should a startup pay for a premium model?

Use a premium model when the task has high uncertainty, high buyer value, multi-step reasoning, agent planning, complex code, sensitive synthesis, or serious consequences if the answer is wrong. Premium models should earn their place in the route. They are useful when they protect revenue, reduce human work, or handle cases cheaper models fail. They are wasteful when they rewrite short text, classify easy tickets, or answer from a source a cheaper path can read.

How do caching and batching reduce LLM spend?

Caching reduces spend when the same instructions, documents, or context are reused across requests and the provider offers a cheaper cached-input path. Batching reduces spend when the work can wait and the provider gives a lower price for asynchronous processing. Founders should use caching for repeated system prompts, policy packs, help-center context, and recurring analysis. Use batching for nightly reports, content drafts, data cleanup, internal review, and bulk document jobs.

What should founders log for model routing?

Founders should log the request type, route chosen, model used, provider, input size, output size, cache hit, retry count, tool calls, fallback path, human review status, estimated cost, and final outcome. The route reason matters too. Without it, the team cannot tell whether the system chose the model because of risk, customer tier, answer difficulty, provider failure, or a bug. Good logs turn routing from guesswork into product evidence.

Should I build my own model router or use an AI gateway?

Start with your own simple router if you have one provider, a few request classes, low volume, and clear rules. Use an AI gateway when you need several providers, fallbacks, spend caps, model aliases, retry rules, centralized logs, or per-customer budgets. A gateway is not a badge of seriousness. It is useful when it removes real operating pain. Before that, a small routing table and clean logs are enough.

How does model routing affect pricing?

Model routing should shape pricing because different customer plans can trigger different costs. Free or low-price users may need cheaper routes, batch routes, shorter outputs, and strict caps. Paid users may deserve faster routes, larger context, premium models, or human review. If the pricing page promises unlimited usage while the product spends money per request, the founder has created a margin trap. The route and the price must agree.

How often should a startup review model routes?

Review routes monthly, after any major model release, after a pricing change, after a product workflow change, and after a bad customer incident. AI models, token prices, context caching rules, batch terms, provider policies, and regional options change often. A route that was sensible three months ago may become expensive or weak today. Treat routing as part of product maintenance, not a one-time engineering setting.

What is the fastest way to start model routing this week?

Export your last 100 to 500 model calls, group them by request type, and rank them by spend. Pick the most expensive low-risk task class. Build a small eval set from real examples, test a cheaper model or cached route, and ship that route only if it passes. Then add logging for route reason, token use, retry count, and cost per completed job. One clean route can pay for the audit.

16 · Verdict

The Bottom Line

LLM model routing is how a founder stops treating AI like magic and starts treating it like costed production work.

Use the strongest model when the job earns it.

Use a smaller model when the task is narrow.

Use caching when context repeats.

Use batch work when the buyer can wait.

Use human review when trust is worth more than speed.

Use no model when rules solve the job.

Bootstrapped founders do not need the loudest AI stack. They need an AI product that works, sells, and keeps enough margin to survive the next invoice.