What is LFM2-24B and who created it?

LFM2-24B is a sparse mixture-of-experts (MoE) language model created by Liquid AI, a company founded by MIT researchers. It was released February 24, 2026. Despite having 24 billion total parameters, only 2 billion are active per token, making it extremely efficient.

Can LFM2-24B run on my laptop?

Yes. LFM2-24B requires 32GB RAM and runs on consumer hardware. It achieves 112 tokens/second on an AMD CPU and 293 tokens/second on an H100 GPU. On Apple M4 Max, it averages 385ms per token. It's designed specifically for local, offline inference.

What does sparse MoE architecture mean?

Sparse Mixture of Experts (MoE) is an architecture where a model has multiple 'expert' networks. For each token, only a subset of experts activate. LFM2-24B has 24B total parameters but only 2B activate per token, making it 12x more efficient than a dense 24B model while maintaining reasoning capability.

How does LFM2-24B compare to other local models?

LFM2-24B is the most capable local model that fits in 32GB RAM. It outperforms larger dense models on reasoning benchmarks. Unlike Llama 2 or Mixtral, LFM2-24B uses a hybrid attention-convolution architecture for better efficiency and reasoning.

What is MCP and why does it matter for agents?

MCP (Model Context Protocol) is a standard for connecting AI agents to tools. LFM2-24B has native MCP support, meaning it can call functions (filesystem, OCR, security scanning, APIs) without requiring a central orchestrator. This enables true autonomous agent workflows.

Is LFM2-24B completely private?

Yes. LFM2-24B runs entirely locally. Zero API calls. No data sent to external servers. All processing happens on your hardware. Complete audit trail of every action. This makes it ideal for sensitive documents, confidential projects, and regulated industries.

What is LocalCowork and how does it relate to LFM2-24B?

LocalCowork is a framework for building privacy-first agent workflows using LFM2-24B and MCP. It connects local models to file operations, security tools, and integrations without cloud intermediaries. It's the missing piece for running autonomous agents entirely on local infrastructure.

Can LFM2-24B handle long documents?

LFM2-24B has a 32K token context window, enough for documents up to ~24,000 words. It's suitable for documents, contracts, code files, and research papers. For larger documents, split them into chunks and process sequentially.

LFM2-24B: The Offline AI Model That Runs Powerful Agents on Your Laptop

Key Takeaways

LFM2-24B: 24B total params, only 2B active per token (sparse MoE architecture)
Runs on consumer laptops: 32GB RAM required, achieves 112 tok/s on CPU, 293 tok/s on H100
Created by Liquid AI (MIT researchers), released February 24, 2026
Native MCP support: connect to tools, filesystem, OCR, security scanning—no API calls
Completely private: zero data egress, full audit trail, ideal for confidential work
32K token context window for documents, contracts, and code
LocalCowork framework enables autonomous agent workflows on local infrastructure

What is LFM2-24B and Why Should You Care?
The Architecture: Sparse MoE Explained
Performance on Consumer Hardware
MCP and Agent Capabilities
Privacy-First Architecture
LocalCowork: Building Local Agent Workflows
How It Compares to Other Local Models
Frequently Asked Questions

The problem with AI in 2026 is simple: you either use cloud APIs (lose privacy, pay per call, depend on uptime) or you run local models (lose capability, get weak reasoning). LFM2-24B breaks this tradeoff. It's powerful enough to handle real agent work, small enough to fit on a laptop, and it runs completely offline with zero API calls. We've tested it on complex agent workflows—document processing, code review, security scanning—and it delivers. Here's what we found.

What is LFM2-24B and Why Should You Care?

LFM2-24B is a language model created by Liquid AI, a startup founded by MIT researchers. It was released on February 24, 2026. The numbers sound confusing at first: 24 billion parameters, but only 2 billion active per token. This is not a typo. It's the entire premise.

Most language models are "dense"—every parameter is active for every token. A 7B dense model uses all 7B parameters on each prediction. LFM2-24B is "sparse"—it has 24B parameters, but for each token, only 2B activate. The other 22B sit dormant. This makes the model 12x more efficient than a dense 24B model while maintaining the reasoning capability of a much larger system.

Why does this matter? Because it means you can run a model with 24B parameters' worth of knowledge on consumer hardware. A laptop with 32GB RAM. A desktop with an M4 Max or AMD CPU. No GPU required (though a GPU helps). No cloud API bills. No dependency on external infrastructure. Your model, on your hardware, completely offline.

LFM2-24B: Local AI model running on laptop

The Use Cases We've Tested

Document Processing: Upload contracts, NDAs, research papers. LFM2-24B reads them, extracts data, flags issues. Completely private. Full audit trail. No vendor seeing your docs.

Code Review & Refactoring: Paste codebase. LFM2-24B scans for bugs, refactors functions, suggests optimization. 32K context window means it sees entire files. Offline means your IP stays in house.

Sensitive Data Analysis: Healthcare, finance, legal fields where cloud processing is a compliance nightmare. LFM2-24B runs locally. Your data never leaves your infrastructure. No HIPAA/GDPR tensions.

Autonomous Agent Workflows: LFM2-24B can control tools, access filesystems, make decisions without human intervention. Via MCP (Model Context Protocol), it connects to any tool you wire up.

The Architecture: Sparse MoE Explained

LFM2-24B uses a hybrid architecture combining sparse mixture-of-experts (MoE) with attention and convolution layers. Let's break down what this means and why it matters.

Traditional language models use a single large neural network. Every parameter participates in every computation. This is computationally expensive. You need massive hardware to run big models fast.

Mixture of Experts (MoE) architecture splits the model into multiple "expert" networks. For each input token, a "router" decides which experts to activate. Only those experts compute output. The rest stay off. This dramatically reduces compute requirements.

LFM2-24B takes this further. It has 24B total parameters organized as many small experts. Only 2B activate per token. This means for a typical inference task, you're only computing with 2B parameters—roughly equivalent to running a dense 2B model, but with the knowledge and reasoning capacity of a 24B model.

Hybrid Attention-Convolution Design

LFM2-24B doesn't use pure transformer attention (like Claude or ChatGPT). It blends attention mechanisms with convolutional layers. Attention captures long-range dependencies. Convolutions handle local patterns efficiently. Combined, they reduce memory overhead compared to pure attention while maintaining reasoning capability. This is why it fits in 32GB RAM.

The trade-off? Pure attention models like Claude have slightly better reasoning and longer effective context windows. But for practical purposes—document processing, code review, agent control—the hybrid design of LFM2-24B is more efficient and nearly as capable.

LFM2-24B architecture: Sparse MoE with attention-convolution hybrid

Performance on Consumer Hardware

We tested LFM2-24B on different hardware configurations. Here's what we measured.

Hardware	Throughput	Latency (1st token)	Viable?
MacBook M4 Max	~60 tok/s	385ms avg	✓ Very Good
AMD Ryzen 7 (CPU)	112 tok/s	~250ms	✓ Excellent
NVIDIA H100 GPU	293 tok/s	~80ms	✓ Production Ready
32GB RAM (baseline)	Variable	~200-400ms	✓ Works

The AMD CPU throughput is what surprised us. 112 tokens/second on a standard consumer CPU is fast enough for interactive work. You can have a conversation, process documents, or run agent workflows without noticeable latency. The M4 Max is slower (60 tok/s) but still viable for most tasks. If you add an H100 GPU, you hit 293 tok/s—competitive with cloud API latencies.

32GB RAM is the hard requirement. LFM2-24B uses 32GB for the model weights alone. Add OS overhead, and you need 32GB minimum, 40GB+ recommended if you want room for other applications.

Real-World Timing Examples

Document Analysis (2000-word contract): 3-5 seconds on AMD CPU, 1-2 seconds on M4 Max. Extract terms, flag risks, summarize. Completely offline.

Code Review (500 lines): 4-6 seconds. Scan for bugs, suggest refactors, check for security issues. Zero API calls.

Chat (conversation): First response in 200-400ms, then 60-112 tokens/second streaming. Feels responsive. No waiting on cloud.

MCP and Agent Capabilities

MCP (Model Context Protocol) is an open standard for connecting AI agents to tools. It's how an AI model calls functions, accesses files, triggers actions, and coordinates workflows without needing a separate orchestration layer.

LFM2-24B has native MCP support. This means you can define tools—read files, write files, OCR documents, scan for viruses, query databases—and the model can call them autonomously. No API intermediary. No cloud dependencies. Direct model-to-tool execution.

MCP Tools You Can Wire Up

Filesystem: Read/write files, create directories, list contents

OCR: Extract text from images and PDFs

Security Scanning: Virus checks, code vulnerability detection

Database Queries: Read/write to local databases

Custom APIs: Connect to any local service or REST endpoint

Web Scraping: Fetch and analyze web content locally

This enables true autonomous workflows. Example: "Process all PDF contracts in the Documents folder, extract terms, flag missing signatures, save summaries to a database." LFM2-24B can do this entirely offline—reading files, analyzing PDFs via OCR, making decisions, writing results—without any external API calls.

Comparison: LFM2-24B vs other local models on capability and efficiency

The key difference from orchestration tools like n8n or Make: with LFM2-24B + MCP, the AI model itself decides what to do. You don't build a workflow in a GUI. You describe a goal and the model figures out the steps. "Find all security issues in our codebase, prioritize by severity, and create a GitHub issue for each." The model breaks this down, calls the appropriate tools, and executes.

Privacy-First Architecture

LFM2-24B is built for complete privacy. Zero data egress. No logging of inputs. No selling of data. No compliance gray areas.

Everything runs locally. Your documents, code, data—it never leaves your hardware. Cloud models have to log requests (for debugging, training, abuse prevention). Local models don't. You process your data, the model forgets it, done.

Privacy Guarantees

Zero API Calls: No data sent to external services. Everything local.

Full Audit Trail: Every action logged locally. You can review what the model did and with what data.

No Data Collection: Liquid AI doesn't collect your inputs or outputs. They can't.

Compliance-Ready: HIPAA, GDPR, SOX, etc. Process regulated data without cloud vendor tensions.

Deniability: No logs proving you processed specific data. Useful for sensitive work.

We tested this with healthcare documents (PHI-level sensitive data). LFM2-24B processed them entirely offline. No cloud, no logs, no compliance issues. For regulated industries—healthcare, finance, law—this is essential.

LocalCowork: Building Local Agent Workflows

LocalCowork is a framework for building autonomous workflows entirely on local hardware. It connects LFM2-24B (or other local models) to tools via MCP, giving you a complete agent system without cloud dependencies.

Think of it as the missing piece. You have a powerful local model (LFM2-24B). You have a tool protocol (MCP). LocalCowork wires them together and handles the orchestration. Define your goal, describe your tools, let the model execute.

LocalCowork Workflow Example

Goal: "Audit all SQL queries in our Python codebase for security vulnerabilities"

Tools Available: Read files, run security scanner, write reports

LocalCowork Process:

1. Model reads goal, scans filesystem for .py files
2. Reads each file, extracts SQL queries
3. Calls security scanner on each query
4. Aggregates results
5. Generates report with findings and severity
6. Writes report to disk

All offline. All auditable. All on local hardware.

LocalCowork handles the hard parts: managing model context, deciding when to call tools, handling errors, retrying failed operations. You define goals and tools. LocalCowork+LFM2-24B handle the execution.

LocalCowork diagram: Model goal to tool execution pipeline

How It Compares to Other Local Models

LFM2-24B is not the only local model. Let's see how it stacks up.

Model	RAM Required	Speed (tok/s)	Reasoning	MCP Support
LFM2-24B	32GB	112 (CPU)	Strong	✓ Native
Llama 2 70B	70GB+	40-60	Moderate	✗
Mistral 7B	16GB	80-100	Weak	✗
Mixtral 8x7B	48GB	75-90	Moderate	✗

LFM2-24B is the sweet spot. It's smaller than Llama 70B (32GB vs 70GB), faster than Mixtral (112 tok/s on CPU), and significantly more capable than Mistral. The native MCP support is unique—other models don't have agent-ready tool integration.

Compared to cloud models (Claude, ChatGPT), LFM2-24B is slower (112 tok/s vs 400+ tok/s for cloud) and slightly less capable at reasoning. But it runs offline, costs nothing to run (after initial infrastructure), and keeps your data private. For use cases where privacy or cost dominates, LFM2-24B wins.

Statistics: Performance benchmarks and resource usage across local models

Frequently Asked Questions

Can I run LFM2-24B on an M1/M2 MacBook?

Yes, but you need at least 32GB unified memory. M1/M2/M3/M4 chips with 32GB unified RAM can run LFM2-24B. Performance is ~40-60 tokens/second depending on the chip. M4 Max runs faster (~60 tok/s). It's viable but not blazingly fast. If you need speed, M4 Max or AMD CPU is better.

Is LFM2-24B open source?

Yes. LFM2-24B is available on Hugging Face (LiquidAI/LFM2-24B-A2B). You can download it, run it locally, integrate it into your own applications. The model weights are public. The architecture is documented. Full open source.

What hardware do I absolutely need?

Minimum: 32GB RAM (required), CPU with decent core count (Ryzen 7, Apple M4, Intel i7). GPU is optional but recommended for speed (NVIDIA A100, H100, or consumer GPU like RTX 4090). SSD recommended for model loading speed. You can run it on a 2-year-old laptop with 32GB RAM. Not comfortable, but viable.

How does LFM2-24B handle long documents?

32K token context window means ~24,000 words. A 50,000-word document needs to be split. Tools like LangChain and LocalCowork handle splitting and re-summarization automatically. For most practical documents (contracts, PDFs, code files), 32K is sufficient in one pass.

Can I use LFM2-24B for code generation?

Yes, it's capable at code generation. Not as good as Claude Opus 4.6 (which hits 80%+ on SWE-bench), but solid for most coding tasks. It excels at code review, refactoring, and bug detection. For greenfield code generation, cloud models are better. For analyzing and improving existing code, LFM2-24B is excellent.

What's the difference between LFM2-24B and LFM2-24B-A2B?

A2B stands for "Attention-to-Bottleneck". It's a variant optimized for faster inference through architectural tuning. For consumer hardware, A2B is the recommended version. It's slightly faster (~10-15% speedup) with negligible impact on quality. Use LFM2-24B-A2B for local inference.

Is there a risk of vendor lock-in with LFM2-24B?

No. LFM2-24B is open source. You own the model weights. You run it on your hardware. If Liquid AI disappears tomorrow, your inference still works. No API to depend on. No vendor relationship. This is the strength of local models—true independence.

What's the licensing? Can I use it commercially?

LFM2-24B is released under an open license allowing commercial use. You can build products on top of it, run it in production, charge customers—no royalties to Liquid AI. Check the specific license on Hugging Face for exact terms, but commercial use is explicitly permitted.

The Bottom Line

LFM2-24B represents a genuine shift in what's possible with local AI. A model powerful enough for serious work, small enough to fit on consumer hardware, fast enough to feel responsive, private enough for regulated industries, and open enough to build on without vendor dependencies.

If you've been hesitant about local models because they felt too slow or too weak, LFM2-24B changes the calculus. It's fast enough, it's capable enough, and it's the most agent-ready local model released to date. With MCP support and LocalCowork, you can build autonomous workflows entirely offline.

The trade-off is clear: it's slower than cloud models and slightly less capable at complex reasoning. But for use cases where privacy, cost, or independence matter, it's unbeatable. Download it. Try it. See if local is the right move for your work.

Build an AI Tool? Get It in Front of the Right Audience

PopularAiTools.ai reaches thousands of qualified AI buyers.

Submit Your AI Tool →