The open-source models I switched to (and the one task I kept on Claude)
A migration in progress: what I’ve moved off Claude, what I’m keeping on it, and the rule that decides.
I built my own podcast summarizer — it transcribes ~40 podcasts a day, pulls out the interesting claims, tags themes, enriches the people and companies that come up, and writes the summaries I read over coffee. Every step started as a Claude call, and the bill kept creeping.
So I set out to do the obvious 2026 move: rip out Claude, drop in cheap open models. That’s the timeline’s advice — AI’s too expensive, GLM-5.2 just dethroned everyone, switch and save. It’s the wrong move. Not because open models are bad — they’re shockingly good now — but because “switch to open source” is the wrong unit of thought.
Don’t switch. Route.
Per-token prices fall ~10x a year. The price of a token isn’t the problem; paying frontier rates for mechanical grunt work is. (Uber burned its whole 2026 AI budget in four months; Microsoft pulled Claude Code licenses in a division when bills hit four figures per engineer.) A pipeline isn’t one job — it’s a dozen jobs with a dozen tolerances for being wrong. You don’t switch it; you route it.
The map
I listed every workload, put a dollar figure and an error budget on each, and routed by both: accuracy-graded work goes to the cheapest model that clears the bar; trust-graded work stays on Claude. Where it stands today:
| Workload | Was | Now | $/mo | Status |
|---|---|---|---|---|
| Intel extraction | Claude Sonnet | GLM-4.7 | $10 → ~$1 | ✅ live |
| Theme classification | Claude Sonnet | GLM-4.7 | $19 → ~$2 | ✅ live |
| Person / company bios | Claude Sonnet | GLM-5.2 +search | $3 → ~$1 | ✅ live |
| Web research (agentic) | Claude API | Claude (Max plan) | $35 → ~$0 | ✅ live |
| Podcast summaries | Claude Sonnet | Claude Sonnet | $22 | kept |
| Theme analysis | Claude Sonnet | MiniMax-M3 | $5 → ~$1 | ⏳ queued |
| Newsletter / daily email | Claude Sonnet | open (eval gate) | $38 → ~$4 | ⏳ queued |
| Embeddings | Voyage | Voyage | $3 | kept |
| Total | ~$135 → ~$34 | partway |
The lines I’ve moved dropped ~6–7x each. End-state is ~$34, down from ~$135 — and I’m honestly only partway there. The interesting part isn’t the number; it’s how each cell got decided.
How I tested (the step everyone skips)
People migrating to open source rarely eval. They read a leaderboard and vibe-migrate. I ran an actual bake-off, three layers deep:
- A rubric per task. “Good” means something different for each job, so I wrote the criteria down first — for extraction, precision/recall and valid JSON; for theme classification, F1 and exact-match; for summaries, insight coverage, an engaging read, and a hard zero-fabricated-quotes rule. No universal “which is better” — a rubric tied to what that cell is actually for.
- Opus 4.8 as the judge. I had the most capable model I’ve got write a gold-standard answer for each input, then score every candidate against the rubric — blind and pairwise, Sonnet vs. the challenger, sides flipped to kill position bias. (Run on my Max plan, so the judge itself was ~free.)
- Then I read the winners myself. A model judging models is a great filter, not a verdict. So for anything that scored its way to the top, I built a little side-by-side viewer and actually read the output — Sonnet next to the candidate — before cutting anything over. That’s how I caught what the score missed.
Two rules kept me honest: confirm every win on a second independent sample, and iterate the cheap model’s prompt before writing it off.
What moved — and the surprise
Intel extraction and theme classification → GLM-4.7. On my data it pulled entities at ~88–91% recall against Sonnet’s ~75%, returned clean structured output via forced tool calls, and cost ~7x less. The theme classifier alone went from ~$228/yr to ~$24.
Bio enrichment → GLM-5.2. It matched Claude on my eval, so it shipped.
Here’s the surprise, and it’s the whole point: GLM-5.2 is newer and tops the open-weights leaderboards — and I deliberately did not use it for theme classification. On that task it scored higher by quietly dropping real assignments; GLM-4.7’s mistakes were safe. Newest model, best benchmark, wrong choice for that cell. A leaderboard measures someone else’s job. (DeepSeek lost everywhere I tried it — its reasoning models won’t emit a clean tool call. Cheap and wrong is the most expensive option there is.)
What I’m keeping on Claude
Podcast episode summaries. This is where the cost-optimizers fall over.
The cheapest strong open model in my tests, MiniMax-M3, actually won my theme-analysis eval — beat Sonnet 7 of 10, and fabricated less (it caught Sonnet inventing a fund’s AUM). So it’s queued to take that task. But I tested it on summaries too, and there it misattributed quotes ~2.7x more than Sonnet, and no prompt fixed it. For a product whose whole value is “here’s who said the interesting thing,” a misquote is the one error I can’t ship. Same model, opposite verdict on two writing tasks. It’s per-task, not per-vibe.
The rule I run now
Claude for the subjective and attribution-critical. Open weights for the structured and mechanical. And for the agentic, Claude-grade lines, route them through my Max subscription instead of the metered API — same quality, ~$0.
Two caveats:
- Don’t self-host to save money. The cheapest 24/7 GPU costs several times my entire Claude bill to sit idle 99% of the time. Serverless endpoints, not your own hardware.
- “Open source” is a license, not a vibe. I call hosted GLM through an API like any other vendor. Want real sovereignty? Run the weights yourself — and you’re back to the idle-GPU math.