Needle: A 26M Parameter Model That Beats Models 23x Its

Needle is a 26M parameter model from Cactus Compute that hits 6000 tok/s prefill and 1200 tok/s decode on consumer hardware, outperforms models up to 23x its size on single-shot function calling, and runs locally on phones without any server infrastructure. The trick: strip out every feed-forward network in the architecture and replace the whole thing with pure attention. It sounds radical, but the reasoning behind it is hard to argue with.

The Core Insight: Tool Calling Isn't Reasoning

Most small models try to be general-purpose. Needle doesn't. The team at Cactus built around a specific architectural hypothesis:

Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning.

If that's true, then the feed-forward networks that dominate transformer parameter counts are doing work you don't need. FFN layers are where models store factual knowledge absorbed during pretraining. But for tool calling, the facts are already in context: the tool definitions tell the model everything it needs to know. The model just has to match a query to a schema and extract the right values.

So Cactus stripped the MLPs entirely. The result is what they call a Simple Attention Network: just attention and gating, no MLPs anywhere in the model.

Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Architecture and Training

Needle uses an encoder-decoder structure. The encoder has 12 layers, the decoder has 8, and neither contains feed-forward networks. Each encoder layer runs ZCRMSNorm, self-attention with grouped query attention (GQA) and RoPE positional embeddings, and gated residuals. Decoder layers add masked self-attention and cross-attention on top of the same pattern. Encoder and decoder share embeddings, and a dedicated tool-calling head outputs softmax-normalized JSON.

Key specs:

Embedding dimension: 512
Attention heads: 8, with 4 key-value heads (GQA)
BPE tokenizer vocabulary: 8,192 tokens
Total parameters: 26M

Training ran in two phases. Pretraining covered 200 billion tokens across 16 TPU v6e chips over 27 hours. Post-training on 2 billion synthesized function-calling examples took 45 minutes. The dataset synthesis pipeline is fully open-source, so you can generate your own variants.

Performance vs. Alternatives

According to the Needle repo, the model beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function calling. To put that in perspective:

FunctionGemma-270m is roughly 10x larger
Granite-350m and LFM2.5-350m are roughly 13x larger
Qwen-0.6B is roughly 23x larger

One caveat: the repo states Needle "beats" these models but doesn't publish exact benchmark scores, so treat the rankings as directional until a formal paper or reproducible eval drops. The comparison is also scoped to single-shot function calling. On conversational tasks, math, coding, or anything requiring world knowledge, those larger models will win. Needle isn't trying to compete there.

Open-Source, Local, Finetunable

The weights are fully open on Cactus-Compute/needle. The dataset generation pipeline is open too. You can finetune locally on a Mac or PC, and the model runs at production speeds on the Cactus inference framework. The team's stated goal is to build agentic models for budget phones, watches, and glasses, where spinning up a cloud API for every tool call is a non-starter on latency and cost.

There's also a playground UI for testing function calls directly, which lowers the friction for evaluating fit before you integrate.

One thing worth watching: a thread on r/LocalLLaMA raised questions about the pickle format used for weight distribution, which has known security and ecosystem compatibility concerns. If you're deploying this in a production pipeline, investigate that before you commit.

Where This Fits (and Where It Doesn't)

Needle is purpose-built for one job: routing a user query to the right tool and emitting a valid JSON payload. If your agent does that on a mobile or edge device, 26M parameters and 1200 tok/s decode is a compelling alternative to an API round-trip.

The Cactus team also claims the no-FFN architecture generalizes to other tasks where the model has access to external structured knowledge, specifically RAG and other retrieval-augmented workflows. Experimental results on that are pending publication, so it's an interesting direction to watch, not a validated capability yet.

What Needle explicitly can't do: reasoning, math, coding, or answering questions that require baked-in world knowledge. The post originally stated the model was distilled from "Gemini 3.1" — no model by that name has been released by Google. This claim could not be verified and has been removed. Treat Needle as infrastructure, not a general assistant.

Bottom Line

If you're building an on-device agent that calls tools and you're tired of paying API latency and cost for what is essentially JSON routing, Needle is worth a serious look. The architectural bet (attention handles retrieval, FFNs are waste at this scale) is well-reasoned, and beating models up to 23x its size on the specific task it was built for is a real result. Hold off on betting a production system on it until formal benchmarks publish, and check the weight format situation if security matters in your stack.

Sources

Cactus Compute / Needle — GitHub repository, model weights, architecture docs
r/LocalLLaMA — Community discussion, team responses, architecture Q&A

Needle: A 26M Parameter Model That Beats Models 23x Its Size at Function Calling