The open-source models I switched to (and the one task I kept on Claude)

I built my own podcast summarizer — it transcribes ~40 podcasts a day, pulls out the interesting claims, tags themes, enriches the people and companies that come up, and writes the summaries I read over coffee. Every step started as a Claude call, and the bill kept creeping.

So I set out to do the obvious 2026 move: rip out Claude, drop in cheap open models. That’s the timeline’s advice — AI’s too expensive, GLM-5.2 just dethroned everyone, switch and save. It’s the wrong move. Not because open models are bad — they’re shockingly good now — but because “switch to open source” is the wrong unit of thought.

Don’t switch. Route.

Per-token prices fall ~10x a year. The price of a token isn’t the problem; paying frontier rates for mechanical grunt work is. (Uber burned its whole 2026 AI budget in four months; Microsoft pulled Claude Code licenses in a division when bills hit four figures per engineer.) A pipeline isn’t one job — it’s a dozen jobs with a dozen tolerances for being wrong. You don’t switch it; you route it.

The map

I listed every workload, put a dollar figure and an error budget on each, and routed by both: accuracy-graded work goes to the cheapest model that clears the bar; trust-graded work stays on Claude. Where it stands today:

Workload	Was	Now	$/mo	Status
Intel extraction	Claude Sonnet	GLM-4.7	$10 → ~$1	✅ live
Theme classification	Claude Sonnet	GLM-4.7	$19 → ~$2	✅ live
Person / company bios	Claude Sonnet	GLM-5.2 +search	$3 → ~$1	✅ live
Web research (agentic)	Claude API	Claude (Max plan)	$35 → ~$0	✅ live
Podcast summaries	Claude Sonnet	Claude Sonnet	$22	kept
Theme analysis	Claude Sonnet	MiniMax-M3	$5 → ~$1	⏳ queued
Newsletter / daily email	Claude Sonnet	open (eval gate)	$38 → ~$4	⏳ queued
Embeddings	Voyage	Voyage	$3	kept
Total			~$135 → ~$34	partway

The lines I’ve moved dropped ~6–7x each. End-state is ~$34, down from ~$135 — and I’m honestly only partway there. The interesting part isn’t the number; it’s how each cell got decided.

How I tested (the step everyone skips)

People migrating to open source rarely eval. They read a leaderboard and vibe-migrate. I ran an actual bake-off, three layers deep:

A rubric per task. “Good” means something different for each job, so I wrote the criteria down first — for extraction, precision/recall and valid JSON; for theme classification, F1 and exact-match; for summaries, insight coverage, an engaging read, and a hard zero-fabricated-quotes rule. No universal “which is better” — a rubric tied to what that cell is actually for.
Opus 4.8 as the judge. I had the most capable model I’ve got write a gold-standard answer for each input, then score every candidate against the rubric — blind and pairwise, Sonnet vs. the challenger, sides flipped to kill position bias. (Run on my Max plan, so the judge itself was ~free.)
Then I read the winners myself. A model judging models is a great filter, not a verdict. So for anything that scored its way to the top, I built a little side-by-side viewer and actually read the output — Sonnet next to the candidate — before cutting anything over. That’s how I caught what the score missed.

Two rules kept me honest: confirm every win on a second independent sample, and iterate the cheap model’s prompt before writing it off.

What moved — and the surprise

Intel extraction and theme classification → GLM-4.7. On my data it pulled entities at ~88–91% recall against Sonnet’s ~75%, returned clean structured output via forced tool calls, and cost ~7x less. The theme classifier alone went from ~$228/yr to ~$24.

Bio enrichment → GLM-5.2. It matched Claude on my eval, so it shipped.

Here’s the surprise, and it’s the whole point: GLM-5.2 is newer and tops the open-weights leaderboards — and I deliberately did not use it for theme classification. On that task it scored higher by quietly dropping real assignments; GLM-4.7’s mistakes were safe. Newest model, best benchmark, wrong choice for that cell. A leaderboard measures someone else’s job. (DeepSeek lost everywhere I tried it — its reasoning models won’t emit a clean tool call. Cheap and wrong is the most expensive option there is.)

What I’m keeping on Claude

Podcast episode summaries. This is where the cost-optimizers fall over.

The cheapest strong open model in my tests, MiniMax-M3, actually won my theme-analysis eval — beat Sonnet 7 of 10, and fabricated less (it caught Sonnet inventing a fund’s AUM). So it’s queued to take that task. But I tested it on summaries too, and there it misattributed quotes ~2.7x more than Sonnet, and no prompt fixed it. For a product whose whole value is “here’s who said the interesting thing,” a misquote is the one error I can’t ship. Same model, opposite verdict on two writing tasks. It’s per-task, not per-vibe.

The rule I run now

Claude for the subjective and attribution-critical. Open weights for the structured and mechanical. And for the agentic, Claude-grade lines, route them through my Max subscription instead of the metered API — same quality, ~$0.

Two caveats:

Don’t self-host to save money. The cheapest 24/7 GPU costs several times my entire Claude bill to sit idle 99% of the time. Serverless endpoints, not your own hardware.
“Open source” is a license, not a vibe. I call hosted GLM through an API like any other vendor. Want real sovereignty? Run the weights yourself — and you’re back to the idle-GPU math.