Vibe CodingPremiumMay 22, 20266 min read

The Multi-Model Routing System I Use to Cut AI Costs Without Slowing Down

Running one model for everything is expensive and slow. Here's the routing logic I use to match the right model to the right task.

Most builders pick a model and stick with it. Claude for everything, or GPT-4 for everything, or whatever feels familiar. That works until you look at your API bill and realize you've been using a sledgehammer to crack walnuts for the last 30 days.

The smarter play is routing. Not complicated orchestration. Just a clear decision layer that sends each task to the model that handles it best at the lowest cost. That's it.

Here's exactly how I think about it and what we run at 47.

Why Routing Matters

Token costs are not equal across models. A task that costs you $0.12 with GPT-4o might cost $0.004 with GPT-4o-mini or Haiku. If that task runs 500 times a day in your app or workflow, that difference is real money.

But it's not just cost. Some models are faster. Some are better at specific tasks. Claude Sonnet is exceptional at code comprehension and nuanced instruction-following. GPT-4o-mini is fast and cheap for classification, simple extraction, and structured output. Haiku is nearly instant for routing decisions themselves. Gemini Flash is wild fast for long-context summarization.

Treating all of these as interchangeable is waste. Routing fixes that.

The Three Task Tiers

Before you can route, you need a mental model for what tasks actually look like. I break everything into three tiers.

Tier 1: Lightweight Tasks

These are fast, cheap, and don't require deep reasoning. Classification. Intent detection. Summarizing short text. Extracting structured data from a clean input. Reformatting. Simple yes/no decisions.

Models that work here: GPT-4o-mini, Claude Haiku, Gemini Flash. You want speed and low cost. The output doesn't need to be elegant. It needs to be accurate and fast.

Tier 2: Mid-Weight Tasks

These need more reasoning but aren't open-ended. Writing a function given a clear spec. Reviewing a small code block for bugs. Drafting a structured response from a template. Generating SQL from a schema. These tasks have a right answer, but getting there takes some thinking.

Models that work here: Claude Sonnet, GPT-4o. The cost bump is worth it. These models follow complex instructions reliably and produce output you can actually use without heavy review.

Tier 3: Heavy Tasks

Open-ended reasoning. Architectural decisions. Long-form generation where quality matters. Tasks where the model needs to hold a lot of context and make judgment calls throughout. Debugging a subtle issue with no clear root cause. Planning a multi-step agent workflow.

Models that work here: Claude Opus, GPT-4o at high context, o3 for hard reasoning problems. You use these sparingly. They're expensive. They're also worth it when the task actually needs them.

The Routing Logic

Here's the decision flow I actually use. It's not an algorithm. It's a set of questions you answer before you fire the API call.

Question 1: Is the output schema known? If you know exactly what format you want back, a lighter model can usually handle it. Structured output tasks with a clear schema lean Tier 1 or Tier 2.

Question 2: How long is the context? Short context with a simple task, go light. Long context where the model needs to synthesize across a lot of text, bump up. But also check Gemini Flash here since its long-context performance is underrated and the cost is low.

Question 3: How bad is a wrong answer? If the output feeds directly into a user-facing feature or a downstream action with real consequences, spend the extra tokens. If it's an internal classification that gets reviewed anyway, save the cost.

Question 4: How often does this task run? A task that runs once a day in a human-triggered workflow can be expensive. A task that runs 10,000 times a day needs to be as cheap as possible without sacrificing reliability.

How I Implement This in n8n and Make

In practice, I build the routing layer as an early node in any AI workflow. Before any model gets called, there's a simple classification step.

In n8n, I'll run a lightweight model call (Haiku or GPT-4o-mini) that takes the incoming task description and returns a tier label: tier_1, tier_2, or tier_3. The prompt for this is about 6 lines. Then a Switch node branches to the right model configuration.

The routing prompt looks roughly like this:

You are a task classifier. Given the following task description, return exactly one of: tier_1, tier_2, or tier_3. Tier 1: Simple extraction, classification, reformatting, or yes/no decisions with clear inputs. Tier 2: Structured generation, code writing, or analysis with a defined scope. Tier 3: Open-ended reasoning, architectural thinking, or tasks requiring sustained judgment. Task: {{task_description}} Return only the tier label.

That's it. One model call that costs fractions of a cent decides which model handles the actual work. On high-volume workflows, this routing call pays for itself in the first hour.

In Make.com, I do the same thing with a Router module after the classifier call. Each branch has its own HTTP module pointed at the appropriate model endpoint with its own system prompt and parameter config.

The Fallback Pattern

Sometimes the light model gets it wrong. The output is malformed, the reasoning is off, or the task was harder than the router predicted. You need a fallback.

I add a validation node after every Tier 1 and Tier 2 call. It checks the output against the expected schema or a basic quality heuristic. If it fails, it re-runs with the next tier up. This happens automatically without human intervention.

On average, I see fallback trigger rates of 3 to 8 percent on well-defined tasks. If you're seeing higher than that, your tier classification prompt or your task definition needs work.

What This Looks Like in a Real Project

We built a client intake workflow that processes inbound project requests. It does several things: extracts budget range, project type, timeline, and tech stack from a freeform submission. It scores the lead. It drafts an initial response. It flags anything that needs human review.

Without routing, every step went through Claude Sonnet. Cost per submission was around $0.09.

After routing, extraction and scoring go through Haiku. The draft response goes through Sonnet. The flagging decision goes through a rule-based check first, then Haiku if rules don't resolve it.

Cost per submission dropped to $0.021. Same output quality. Same accuracy. Just smarter about which model does what.

The Model-Specific Strengths Worth Knowing

Routing only works if you understand what each model actually does well. Here's my honest take based on what we use day to day.

Claude Haiku: Fastest Claude. Excellent for classification and extraction. Follows structured output instructions well. Gets confused on ambiguous multi-step tasks.

Claude Sonnet: The workhorse. Best balance of intelligence and cost for code tasks, nuanced instruction-following, and anything that requires real judgment without going full Opus. This is the default for most Tier 2 work.

Claude Opus: Reserve for when you genuinely need it. Complex debugging, architectural review, anything where the quality delta actually matters and cost isn't the primary concern.

GPT-4o-mini: Great for high-volume structured tasks. Fast. Cheap. Reliable on well-defined prompts. Falls apart faster than Haiku on ambiguity in my experience.

GPT-4o: Strong reasoning, good at following complex multi-part instructions. Competitive with Sonnet on many tasks. Mix this in on Tier 2 when you want a second opinion baked into the routing.

Gemini Flash: Underused. The long-context window at that price point is genuinely useful for summarization workflows and document processing. Worth testing if you have tasks with 50k+ token inputs.

The Setup Cost Is an Afternoon

I'm not going to oversell this. Building a routing layer takes a few hours the first time. You need to define your tiers, write the classifier prompt, set up the branches, and add validation. It's not complicated but it's not zero effort.

The ROI calculation is simple though. If you're running any AI workflow at meaningful volume, the cost reduction starts in week one. And once you've built the pattern once, you reuse it on every project after.

The real compounding benefit is that your workflows get more reliable. You stop asking expensive models to do cheap work, which means they're not burning context on stuff beneath them. And you stop asking cheap models to do hard work and wondering why the output is garbage.

Match the model to the task. Build the router once. Let it handle the rest.

Premium article

Unlock the full article

This article is part of the 47 Vibe Coding Playbook (lifetime, $147) and Inner Circle ($47/mo). Members get every premium article, every prompt, and every CLAUDE.md template.

Already a member? Sign in.

KZZY

Written by KZZY

47 Industries has been home since the beginning, from 3D printing operations to leading all software development across MotoRev, BookFade, and the 47 platform.

Ready to Build?

Get a quote on your project. We build websites, web apps, mobile apps, and SaaS products for businesses across Florida and the US.