Ollama + Claude Code: How to Run It 99% Cheaper (Or Free)

Q: What are the best free models for coding on OpenRouter?

Qwen 3.6 (1M context, free) is currently one of the strongest options. The OpenRouter Free router automatically picks the most available free model at any given moment. For near-free options, Google Gemma 4 at $0.14 per million input tokens delivers excellent coding performance at roughly 50-100x cheaper than Opus.

And no, this isn't against Anthropic's terms of service. You're using their agent framework and just plugging in a different model. It's a supported use case.

Open Source vs Closed Source Models: What You Need to Know

Before we get into setup, you need to understand why this works — and where the trade-offs are.

Closed source models (Opus, Sonnet, GPT, Gemini) are locked down. You can only access them through the company's API, and that means paying per token. The hood is welded shut — you can't download, inspect, or modify these models.

Open source models (Qwen, Gemma, Llama, DeepSeek) are published openly. Anyone can download them, run them locally, modify them, or host them on their own infrastructure. No per-token fees. No usage limits beyond your hardware.

Comparison infographic of open source versus closed source AI models showing unlocked versus locked icons with performance benchmarks

The gap between open and closed source model performance is shrinking fast

The question everyone asks: why doesn't everyone just use open source? Because historically, closed source models have been better. Opus 4.6 and Sonnet 4.6 still sit at the top of coding benchmarks.

But here's what's changed: the gap is shrinking fast. If you look at SWE-bench verified scores (a programming-focused AI benchmark), some open-weight models today outperform Claude Sonnet 3.7 — the model that had everyone freaking out when it dropped. Qwen 3.6, Google Gemma 4, and several others are legitimately competitive for coding tasks.

That said, open source models in Claude Code can misbehave because:

They might not have been trained on Claude Code's specific tool-calling protocol
Their context window might be too small for Claude Code's system prompt
They might not follow the exact JSON format that Claude Code expects

Think of it like putting a motorcycle engine into a truck. It'll work, but you might need to fiddle with it. Some models handle tool calling perfectly; others need some coaxing.

Method 1: Run Local Models with Ollama (Completely Free)

This is the fully local, fully private, fully free method. The model runs on your hardware, nothing leaves your machine, and there are zero ongoing costs.

Three step setup guide for Ollama showing download install pull model and launch Claude Code steps

The entire Ollama setup is three steps

Step 1: Download Ollama

Head to ollama.com and download it for your operating system. Install it like any other app.

Ollama homepage showing download options for macOS Windows and Linux

Ollama's homepage — download for your OS

Step 2: Choose and Pull a Model

Click on "Models" at ollama.com to browse the model library. You'll see dozens of options with different sizes, context windows, and capabilities.

Ollama model library showing available AI models including Qwen Gemma Llama and DeepSeek

The Ollama model library — look for models with the "tools" tag for best Claude Code compatibility

The key thing to look at is model size versus your hardware. A good starting point:

Your RAM	Recommended Model Size	Example
8 GB	Up to 7B parameters	Qwen 3.5 7B (~4 GB)
16 GB	Up to 14B parameters	Qwen 3.5 14B (~9 GB)
32 GB+	30B+ parameters	Gemma 4 27B, Qwen 3.5 32B

Once you've picked a model, open your terminal and pull it:

ollama pull qwen3.5:9b

This downloads the model to your local machine. The 9B model is about 6.6 GB — larger models take longer to download but perform better.

Ollama Qwen 3 model page showing available sizes benchmarks and download commands

Each model page shows available sizes, benchmarks, and the exact pull command

You can verify it works by chatting with it directly:

ollama run qwen3.5:9b

If it responds, you're good. Hit Ctrl+D to exit the chat.

Step 3: Launch Claude Code with Your Local Model

Now the magic part. Ollama has a built-in integration with Claude Code. Run:

ollama launch claude

This opens Claude Code and lets you pick which local model to use as the engine. Select your downloaded model and you're running Claude Code — same interface, same tools, same file operations — but powered by a free local model.

You'll see "API usage billing" show as free. Every prompt, every tool call, every file read — zero cost.

A Note on the Initial $5 Setup

I need to be upfront about this: you do still need an Anthropic account to use Claude Code. If you don't already have one, the sign-up flow requires either a subscription ($20/month) or a one-time API credit purchase of $5. If you go the API route, that $5 sits in your account but won't be consumed when you're using local models. Think of it as a one-time activation fee for accessing the Claude Code harness.

Fixing the Context Window Issue

One gotcha: Ollama sometimes defaults to a smaller context window than advertised. If your model says it supports 200K tokens but Claude Code seems to lose track of the conversation, you may need to create a custom model file to explicitly set the context size:

# Create a Modelfile with custom context
echo 'FROM qwen3.5:9b
PARAMETER num_ctx 65536' > Modelfile

# Create the custom model
ollama create qwen3.5-64k -f Modelfile

# Now launch Claude Code with it
ollama launch claude
# Select: qwen3.5-64k

After increasing the context window, you should see proper tool-call visibility in Claude Code — the model can now hold enough context to show you what it's doing step by step instead of just spinning until it responds.

What About Ollama Cloud Models?

Ollama also offers cloud-hosted models. For example, you can run MiniMax M2.7 through Ollama's cloud without downloading anything:

ollama launch claude --model miniax-m2.7

The cloud models are significantly faster than local ones and feel much closer to using Sonnet. The catch? Ollama's free tier has limits, and you'll eventually need a subscription for heavy use. But for getting started and testing models, the free tier is generous enough.

Method 2: Free Cloud Models with OpenRouter

This method gives you access to free models running in the cloud — faster than local, no hardware requirements, and still zero cost per token.

OpenRouter homepage showing AI model routing platform with free and paid model options

OpenRouter routes your requests to various AI model providers — including free ones

Step 1: Create an OpenRouter Account

Go to openrouter.ai and sign up. Here's the important part: load your account with $5-10. You won't spend this on free models, but OpenRouter uses your balance to determine your rate limits:

Account Balance	Free Model Rate Limit
$0	50 requests/day
$5-10	1,000 requests/day

50 requests/day is basically nothing for real Claude Code usage. The $10 bump to 1,000 requests is worth it — and again, the money just sits there since free models cost $0.

Step 2: Generate an API Key

In your OpenRouter account, go to Credits → API Keys → Create New Key. Copy the key — you'll need it in a moment.

Step 3: Configure Claude Code's Environment Variables

This is where you tell Claude Code to stop talking to Anthropic and start talking to OpenRouter instead. Open your project's .claude/settings.local.json and add these environment variables:

OpenRouter configuration guide showing JSON environment variables needed to connect Claude Code to free models

The full OpenRouter configuration — make sure you override ALL model variables

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://openrouter.ai/api",
    "ANTHROPIC_AUTH_TOKEN": "YOUR_OPEN_ROUTER_API_KEY",
    "ANTHROPIC_API_KEY": "",
    "ANTHROPIC_MODEL": "openrouter/auto",
    "ANTHROPIC_SMALL_FAST_MODEL": "openrouter/auto",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

Critical: Override ALL Model Variables

If you only set ANTHROPIC_MODEL and skip ANTHROPIC_SMALL_FAST_MODEL, Claude Code will default to Haiku for tool calls and sub-agent tasks. You'll get silently charged without realizing it. One tester noticed Haiku charges piling up in their OpenRouter logs even though the main model was free. Set every model variable to a free model.

Step 4: Choose Your Free Model

Head to openrouter.ai/collections/free-models to see what's available.

OpenRouter free models collection page showing available zero cost AI models including Qwen Gemma and others

OpenRouter's free model collection — all of these cost $0 per token

The openrouter/auto option is a meta-model that automatically routes you to whichever free model is most available at the moment. It prevents rate limiting but you lose control over which specific model handles your request.

If you want a specific model, replace openrouter/auto with the model ID. For example, to use Qwen 3.6 with its 1M token context window:

"ANTHROPIC_MODEL": "qwen/qwen3-235b-a22b:free",
"ANTHROPIC_SMALL_FAST_MODEL": "qwen/qwen3-235b-a22b:free"

Step 5: Launch and Verify

Open a terminal and type claude. You should see it reporting "Open Router free API billing usage" instead of your Max or Pro plan. Run a test command:

# In Claude Code:
> Create a file called openrouter-test.txt with a joke inside it

Then check your OpenRouter logs. You should see the requests listed at $0.00 cost. If you see any charges for Haiku or Sonnet, go back and make sure you've overridden all the model variables.

When to Use Open Source Models (And When Not To)

I want to be honest here because I've seen too many "free AI" guides that oversell the experience. Open source models are good. They're not always good enough.

Open Source Models Work Well For:

Reading and summarizing files — scan your codebase, summarize what functions do, prep context for a smarter model
Searching through code — grep through repos, find relevant files and functions
Generating scaffolding — boilerplate code, repetitive structures, test templates
Research and information gathering — web searches, summarizing docs, pulling references
Organizing and classifying — categorizing tasks, triaging issues, organizing files
Simple bug fixes and code reviews — straightforward fixes where the scope is clear

Stick With Paid Models For:

Complex architectural decisions — when you need the model to reason through trade-offs across an entire system
Production-critical code — anything that can't have subtle bugs or edge case failures
Multi-step tool calling chains — open source models sometimes lose the thread during complex agent workflows
Large context reasoning — Opus's 1M token context is still unmatched by most open alternatives

The Fallback Use Case

Even if you primarily use paid models, having a local or OpenRouter setup is valuable for two situations:

When Claude's servers are down — check status.claude.com if you're having issues, then switch to local rather than sitting idle for 2 hours
When you hit your session limit — instead of waiting for the cooldown, fire up a local model and keep working on lower-stakes tasks

Cost Comparison: What You're Actually Saving

Let's put real numbers on this.

Model	Input (per 1M tokens)	Output (per 1M tokens)	Savings vs Opus
Opus 4.6	$5.00	$25.00	—
Gemma 4 (31B)	$0.14	$0.40	~97% cheaper
Qwen 3.6 (free tier)	$0.00	$0.00	100% free
OpenRouter Free router	$0.00	$0.00	100% free
Ollama local	$0.00	$0.00	100% free (+ electricity)

Even if you don't go fully free, using something like Gemma 4 through OpenRouter at $0.14/million input tokens versus Opus at $5/million is a 97% cost reduction. For most coding tasks that don't need peak intelligence, that's a no-brainer.

The honest take: there's really no such thing as completely free. If you run local models, you're paying in hardware costs and electricity. If you use cloud free tiers, you're paying with rate limits and less model control. The real win is finding the right balance between quality and price for your specific use case.

Pair This With Token Management

Whether you're on free models or paid ones, managing your context window is critical. Open source models typically have smaller context windows than Opus's 1M tokens, which makes token optimization hacks even more important. Use /compact early, keep your CLAUDE.md lean, and start fresh conversations between unrelated tasks.

Related Guides

If you're looking to push Claude Code further without breaking the bank, these will help:

Claude Code Token Hacks: How to 5x Your Usage Without Upgrading — 18 optimization techniques organized by difficulty tier
Run OpenClaw Free Forever — another approach to free Claude Code alternatives
Open Claude: Use Any Model with Claude Code — deep dive into model swapping

Build an AI Tool? Get It in Front of the Right Audience

PopularAiTools.ai reaches thousands of qualified AI buyers.

Submit Your AI Tool →

Frequently Asked Questions

Can I really run Claude Code for free?

Yes. You can run Claude Code with free open-source models either locally through Ollama or via OpenRouter's free model tier. You still need an Anthropic account with a $5 minimum credit purchase, but that balance won't be consumed when using open-source models instead of Anthropic's paid models.

Is using Ollama with Claude Code against Anthropic's terms of service?

No. Claude Code is an agent harness, and swapping the underlying model is a supported use case. You're using Anthropic's tooling framework but pointing it at a different model provider. Anthropic's documentation even acknowledges third-party model integrations.

What hardware do I need to run local models with Ollama?

It depends on the model size. A 9B parameter model like Qwen 3.5 requires about 6-7 GB of RAM and runs on most modern machines. Larger models (30B+) need 16-32 GB RAM and ideally a dedicated GPU. Ask Claude Code to analyze your hardware specs and recommend appropriate model sizes.

Why is Claude Code slower with local models?

Local models run on your hardware instead of Anthropic's data center GPUs. A 9B parameter model on a consumer laptop can take 3-4 minutes for tasks that Opus handles in seconds. Larger models or cloud-hosted options through Ollama Cloud are significantly faster.

How do I avoid getting charged for Haiku when using OpenRouter?

You must override ALL model environment variables — not just ANTHROPIC_MODEL. If you only set the main model, Claude Code still defaults to Haiku and Sonnet for sub-agent tasks and tool calls. Set ANTHROPIC_SMALL_FAST_MODEL, CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC, and the other variables shown in the configuration above to prevent surprise charges.

What are the best free models for coding on OpenRouter?

Qwen 3.6 with its 1M context window is currently one of the strongest free options. The openrouter/auto router automatically picks the most available free model at any given moment. For near-free options, Google Gemma 4 at $0.14 per million input tokens delivers excellent coding performance at roughly 50-100x cheaper than Opus.

What are the limitations of open-source models in Claude Code?

Open-source models may not have been trained on Claude Code's specific tool-calling protocol, can have smaller context windows than Opus's 1M tokens, and might struggle with native web search. They work well for file operations, code scaffolding, and simple edits, but for complex multi-step tasks you may notice quality differences.

When should I use open-source models vs paid Claude models?

Use open-source models for low-stakes tasks like reading files, summarizing code, generating scaffolding, running searches, and simple bug fixes. Use paid models (Opus, Sonnet) for complex architectural work, production-critical code, and tasks where accuracy is non-negotiable. Also use local models as a fallback when Claude's servers are down or you've hit your session limit.