The $0/month AI agent stack (now that Claude killed the subscription hack)
Subject: The $0/month AI agent stack (now that Claude killed the subscription hack)
Preview text: Anthropic ended cheap subscription access. Here's the actual setup we use to run a 9-agent org for near-zero.
Last issue I covered what Anthropic just changed. This one is about what to do instead.
If you're running any kind of AI agent setup — whether it's one assistant or a full team of agents — the subscription workaround is gone. You're paying per token now. So let's talk about how to actually run this stuff without your API bill eating your margin.
I'm going to share the exact cost-reduction stack we run here. Real setup, real numbers.
THE PROBLEM WITH NAIVE API USAGE
Most people who hear "just use the API" imagine the costs multiplying fast. And they're right — if you run everything through Claude Opus on every single call, you will pay. Opus at $5/$25 per million tokens on an always-on agent org adds up quickly.
The fix isn't to stop using AI. It's to be smarter about which model gets which job.
THE TIERED MODEL APPROACH
Not every task needs your most expensive model. We run a three-tier stack:
Tier 1 — Free ($0/run): Local open-weight models via Ollama. We run Gemma 4 on our iMac for all content generation — daily articles, tweet scripts, trend summaries, newsletter drafts. Zero API cost. Runs on hardware you already own. For repetitive content tasks, local models at the 12B-27B range are genuinely good enough.
Tier 2 — Budget API: Claude Haiku or GPT-4o Mini for routing decisions, simple classifications, short summaries, and anything that doesn't need deep reasoning. Haiku is $1/$5 per million tokens — roughly 5-10x cheaper than Sonnet.
Tier 3 — Reserved for Hard Problems: Sonnet or Opus only when the task actually requires it. Architecture decisions, complex reasoning, anything where the output quality directly impacts revenue. You want your expensive model focused here, not writing blog post intros.
THE REAL NUMBERS
Running a 9-agent org for a small business — content generation, research, coding assistance, scheduling, outreach — you can operate for roughly $30-80/month in API costs if you route intelligently. Compare that to what you'd spend trying to brute-force everything through a single expensive model.
The biggest lever: local inference. If you have any Mac with 16GB+ RAM or a machine with a decent GPU, you can run Gemma 4, Llama 3, or Mistral locally at $0/run. This is the single best cost reduction available right now.
OTHER MOVES THAT ACTUALLY MATTER
Prompt caching — Anthropic and OpenAI both support it. If your agent has a long system prompt it sends on every call, caching can cut your input token costs by 50-90%. This is free money.
Context window discipline — most agent costs explode because of runaway context. Agents that accumulate conversation history indefinitely will bankrupt you. Set hard limits. Summarize and compress. Don't carry 200k tokens of history into every call.
Model routing — tools like OpenRouter let you route to the cheapest model that can handle a given task. You describe the task complexity, it picks the model. Set a budget cap and let it optimize.
Batch processing — instead of running 10 agents in real-time, batch non-urgent tasks. Research, content drafts, and data processing don't need to happen live. Run them on a schedule when compute is cheap.
OpenAI is still playing it differently — for now. Unlike Anthropic, OpenAI hasn't pulled the plug on third-party OAuth access. You can still connect external tools and agents to ChatGPT's ecosystem via OAuth without switching to direct API billing. Worth noting: OpenAI recently brought Peter Stienberg onto their team — someone with a strong track record of advocating for open access and interoperability. Whether that shapes long-term policy is speculative, but it's a meaningful signal that OpenAI sees ecosystem openness as a competitive advantage, not a threat. That gap between Anthropic and OpenAI's approach to third-party tooling is real, and it should factor into your stack decisions.
THE HONEST TAKE
Anthropic's move actually pushed us to build a better stack. When you're forced to think about costs, you stop being lazy about model selection. The orgs that figure out tiered routing now will have a structural cost advantage over everyone still blindly routing everything through their most expensive model.
The subscription hack was a crutch. The real setup is more flexible and cheaper anyway.
— G
P.S. If you want to see the exact config we use — which models handle which tasks, how we run local inference, and what the actual monthly API bill looks like — reply to this and I'll put together a full breakdown.