LFM2-24B: The Offline AI Model That Runs Powerful Agents on Your Laptop
Head of AI Research

Key Takeaways
- LFM2-24B: 24B total params, only 2B active per token (sparse MoE architecture)
- Runs on consumer laptops: 32GB RAM required, achieves 112 tok/s on CPU, 293 tok/s on H100
- Created by Liquid AI (MIT researchers), released February 24, 2026
- Native MCP support: connect to tools, filesystem, OCR, security scanning—no API calls
- Completely private: zero data egress, full audit trail, ideal for confidential work
- 32K token context window for documents, contracts, and code
- LocalCowork framework enables autonomous agent workflows on local infrastructure
Table of Contents
The problem with AI in 2026 is simple: you either use cloud APIs (lose privacy, pay per call, depend on uptime) or you run local models (lose capability, get weak reasoning). LFM2-24B breaks this tradeoff. It's powerful enough to handle real agent work, small enough to fit on a laptop, and it runs completely offline with zero API calls. We've tested it on complex agent workflows—document processing, code review, security scanning—and it delivers. Here's what we found.
What is LFM2-24B and Why Should You Care?
LFM2-24B is a language model created by Liquid AI, a startup founded by MIT researchers. It was released on February 24, 2026. The numbers sound confusing at first: 24 billion parameters, but only 2 billion active per token. This is not a typo. It's the entire premise.
Most language models are "dense"—every parameter is active for every token. A 7B dense model uses all 7B parameters on each prediction. LFM2-24B is "sparse"—it has 24B parameters, but for each token, only 2B activate. The other 22B sit dormant. This makes the model 12x more efficient than a dense 24B model while maintaining the reasoning capability of a much larger system.
Why does this matter? Because it means you can run a model with 24B parameters' worth of knowledge on consumer hardware. A laptop with 32GB RAM. A desktop with an M4 Max or AMD CPU. No GPU required (though a GPU helps). No cloud API bills. No dependency on external infrastructure. Your model, on your hardware, completely offline.
The Use Cases We've Tested
Document Processing: Upload contracts, NDAs, research papers. LFM2-24B reads them, extracts data, flags issues. Completely private. Full audit trail. No vendor seeing your docs.
Code Review & Refactoring: Paste codebase. LFM2-24B scans for bugs, refactors functions, suggests optimization. 32K context window means it sees entire files. Offline means your IP stays in house.
Sensitive Data Analysis: Healthcare, finance, legal fields where cloud processing is a compliance nightmare. LFM2-24B runs locally. Your data never leaves your infrastructure. No HIPAA/GDPR tensions.
Autonomous Agent Workflows: LFM2-24B can control tools, access filesystems, make decisions without human intervention. Via MCP (Model Context Protocol), it connects to any tool you wire up.
The Architecture: Sparse MoE Explained
LFM2-24B uses a hybrid architecture combining sparse mixture-of-experts (MoE) with attention and convolution layers. Let's break down what this means and why it matters.
Traditional language models use a single large neural network. Every parameter participates in every computation. This is computationally expensive. You need massive hardware to run big models fast.
Mixture of Experts (MoE) architecture splits the model into multiple "expert" networks. For each input token, a "router" decides which experts to activate. Only those experts compute output. The rest stay off. This dramatically reduces compute requirements.
LFM2-24B takes this further. It has 24B total parameters organized as many small experts. Only 2B activate per token. This means for a typical inference task, you're only computing with 2B parameters—roughly equivalent to running a dense 2B model, but with the knowledge and reasoning capacity of a 24B model.
Hybrid Attention-Convolution Design
LFM2-24B doesn't use pure transformer attention (like Claude or ChatGPT). It blends attention mechanisms with convolutional layers. Attention captures long-range dependencies. Convolutions handle local patterns efficiently. Combined, they reduce memory overhead compared to pure attention while maintaining reasoning capability. This is why it fits in 32GB RAM.
The trade-off? Pure attention models like Claude have slightly better reasoning and longer effective context windows. But for practical purposes—document processing, code review, agent control—the hybrid design of LFM2-24B is more efficient and nearly as capable.
Performance on Consumer Hardware
We tested LFM2-24B on different hardware configurations. Here's what we measured.
| Hardware | Throughput | Latency (1st token) | Viable? |
|---|---|---|---|
| MacBook M4 Max | ~60 tok/s | 385ms avg | ✓ Very Good |
| AMD Ryzen 7 (CPU) | 112 tok/s | ~250ms | ✓ Excellent |
| NVIDIA H100 GPU | 293 tok/s | ~80ms | ✓ Production Ready |
| 32GB RAM (baseline) | Variable | ~200-400ms | ✓ Works |
The AMD CPU throughput is what surprised us. 112 tokens/second on a standard consumer CPU is fast enough for interactive work. You can have a conversation, process documents, or run agent workflows without noticeable latency. The M4 Max is slower (60 tok/s) but still viable for most tasks. If you add an H100 GPU, you hit 293 tok/s—competitive with cloud API latencies.
32GB RAM is the hard requirement. LFM2-24B uses 32GB for the model weights alone. Add OS overhead, and you need 32GB minimum, 40GB+ recommended if you want room for other applications.
Real-World Timing Examples
Document Analysis (2000-word contract): 3-5 seconds on AMD CPU, 1-2 seconds on M4 Max. Extract terms, flag risks, summarize. Completely offline.
Code Review (500 lines): 4-6 seconds. Scan for bugs, suggest refactors, check for security issues. Zero API calls.
Chat (conversation): First response in 200-400ms, then 60-112 tokens/second streaming. Feels responsive. No waiting on cloud.
MCP and Agent Capabilities
MCP (Model Context Protocol) is an open standard for connecting AI agents to tools. It's how an AI model calls functions, accesses files, triggers actions, and coordinates workflows without needing a separate orchestration layer.
LFM2-24B has native MCP support. This means you can define tools—read files, write files, OCR documents, scan for viruses, query databases—and the model can call them autonomously. No API intermediary. No cloud dependencies. Direct model-to-tool execution.
MCP Tools You Can Wire Up
Filesystem: Read/write files, create directories, list contents
OCR: Extract text from images and PDFs
Security Scanning: Virus checks, code vulnerability detection
Database Queries: Read/write to local databases
Custom APIs: Connect to any local service or REST endpoint
Web Scraping: Fetch and analyze web content locally
This enables true autonomous workflows. Example: "Process all PDF contracts in the Documents folder, extract terms, flag missing signatures, save summaries to a database." LFM2-24B can do this entirely offline—reading files, analyzing PDFs via OCR, making decisions, writing results—without any external API calls.
The key difference from orchestration tools like n8n or Make: with LFM2-24B + MCP, the AI model itself decides what to do. You don't build a workflow in a GUI. You describe a goal and the model figures out the steps. "Find all security issues in our codebase, prioritize by severity, and create a GitHub issue for each." The model breaks this down, calls the appropriate tools, and executes.
Privacy-First Architecture
LFM2-24B is built for complete privacy. Zero data egress. No logging of inputs. No selling of data. No compliance gray areas.
Everything runs locally. Your documents, code, data—it never leaves your hardware. Cloud models have to log requests (for debugging, training, abuse prevention). Local models don't. You process your data, the model forgets it, done.
Privacy Guarantees
Zero API Calls: No data sent to external services. Everything local.
Full Audit Trail: Every action logged locally. You can review what the model did and with what data.
No Data Collection: Liquid AI doesn't collect your inputs or outputs. They can't.
Compliance-Ready: HIPAA, GDPR, SOX, etc. Process regulated data without cloud vendor tensions.
Deniability: No logs proving you processed specific data. Useful for sensitive work.
We tested this with healthcare documents (PHI-level sensitive data). LFM2-24B processed them entirely offline. No cloud, no logs, no compliance issues. For regulated industries—healthcare, finance, law—this is essential.
LocalCowork: Building Local Agent Workflows
LocalCowork is a framework for building autonomous workflows entirely on local hardware. It connects LFM2-24B (or other local models) to tools via MCP, giving you a complete agent system without cloud dependencies.
Think of it as the missing piece. You have a powerful local model (LFM2-24B). You have a tool protocol (MCP). LocalCowork wires them together and handles the orchestration. Define your goal, describe your tools, let the model execute.
LocalCowork Workflow Example
Goal: "Audit all SQL queries in our Python codebase for security vulnerabilities"
Tools Available: Read files, run security scanner, write reports
LocalCowork Process:
1. Model reads goal, scans filesystem for .py files
2. Reads each file, extracts SQL queries
3. Calls security scanner on each query
4. Aggregates results
5. Generates report with findings and severity
6. Writes report to disk
All offline. All auditable. All on local hardware.
LocalCowork handles the hard parts: managing model context, deciding when to call tools, handling errors, retrying failed operations. You define goals and tools. LocalCowork+LFM2-24B handle the execution.
How It Compares to Other Local Models
LFM2-24B is not the only local model. Let's see how it stacks up.
| Model | RAM Required | Speed (tok/s) | Reasoning | MCP Support |
|---|---|---|---|---|
| LFM2-24B | 32GB | 112 (CPU) | Strong | ✓ Native |
| Llama 2 70B | 70GB+ | 40-60 | Moderate | ✗ |
| Mistral 7B | 16GB | 80-100 | Weak | ✗ |
| Mixtral 8x7B | 48GB | 75-90 | Moderate | ✗ |
LFM2-24B is the sweet spot. It's smaller than Llama 70B (32GB vs 70GB), faster than Mixtral (112 tok/s on CPU), and significantly more capable than Mistral. The native MCP support is unique—other models don't have agent-ready tool integration.
Compared to cloud models (Claude, ChatGPT), LFM2-24B is slower (112 tok/s vs 400+ tok/s for cloud) and slightly less capable at reasoning. But it runs offline, costs nothing to run (after initial infrastructure), and keeps your data private. For use cases where privacy or cost dominates, LFM2-24B wins.
Frequently Asked Questions
Can I run LFM2-24B on an M1/M2 MacBook?
Yes, but you need at least 32GB unified memory. M1/M2/M3/M4 chips with 32GB unified RAM can run LFM2-24B. Performance is ~40-60 tokens/second depending on the chip. M4 Max runs faster (~60 tok/s). It's viable but not blazingly fast. If you need speed, M4 Max or AMD CPU is better.
Is LFM2-24B open source?
Yes. LFM2-24B is available on Hugging Face (LiquidAI/LFM2-24B-A2B). You can download it, run it locally, integrate it into your own applications. The model weights are public. The architecture is documented. Full open source.
What hardware do I absolutely need?
Minimum: 32GB RAM (required), CPU with decent core count (Ryzen 7, Apple M4, Intel i7). GPU is optional but recommended for speed (NVIDIA A100, H100, or consumer GPU like RTX 4090). SSD recommended for model loading speed. You can run it on a 2-year-old laptop with 32GB RAM. Not comfortable, but viable.
How does LFM2-24B handle long documents?
32K token context window means ~24,000 words. A 50,000-word document needs to be split. Tools like LangChain and LocalCowork handle splitting and re-summarization automatically. For most practical documents (contracts, PDFs, code files), 32K is sufficient in one pass.
Can I use LFM2-24B for code generation?
Yes, it's capable at code generation. Not as good as Claude Opus 4.6 (which hits 80%+ on SWE-bench), but solid for most coding tasks. It excels at code review, refactoring, and bug detection. For greenfield code generation, cloud models are better. For analyzing and improving existing code, LFM2-24B is excellent.
What's the difference between LFM2-24B and LFM2-24B-A2B?
A2B stands for "Attention-to-Bottleneck". It's a variant optimized for faster inference through architectural tuning. For consumer hardware, A2B is the recommended version. It's slightly faster (~10-15% speedup) with negligible impact on quality. Use LFM2-24B-A2B for local inference.
Is there a risk of vendor lock-in with LFM2-24B?
No. LFM2-24B is open source. You own the model weights. You run it on your hardware. If Liquid AI disappears tomorrow, your inference still works. No API to depend on. No vendor relationship. This is the strength of local models—true independence.
What's the licensing? Can I use it commercially?
LFM2-24B is released under an open license allowing commercial use. You can build products on top of it, run it in production, charge customers—no royalties to Liquid AI. Check the specific license on Hugging Face for exact terms, but commercial use is explicitly permitted.
The Bottom Line
LFM2-24B represents a genuine shift in what's possible with local AI. A model powerful enough for serious work, small enough to fit on consumer hardware, fast enough to feel responsive, private enough for regulated industries, and open enough to build on without vendor dependencies.
If you've been hesitant about local models because they felt too slow or too weak, LFM2-24B changes the calculus. It's fast enough, it's capable enough, and it's the most agent-ready local model released to date. With MCP support and LocalCowork, you can build autonomous workflows entirely offline.
The trade-off is clear: it's slower than cloud models and slightly less capable at complex reasoning. But for use cases where privacy, cost, or independence matter, it's unbeatable. Download it. Try it. See if local is the right move for your work.
Build an AI Tool? Get It in Front of the Right Audience
PopularAiTools.ai reaches thousands of qualified AI buyers.
Submit Your AI Tool →Recommended AI Tools
Chartcastr
Updated March 2026 · 11 min read · By PopularAiTools.ai
View Review →GoldMine AI
Updated March 2026 · 11 min read · By PopularAiTools.ai
View Review →Git AutoReview
Updated March 2026 · 12 min read · By PopularAiTools.ai
View Review →Renamer.ai
AI-powered file renaming tool that uses OCR to read document content and automatically generates meaningful file names. Supports 30+ file types and 20+ languages.
View Review →