This post contains affiliate links. If you purchase through these links, sudostack may earn a small commission at no extra cost to you. This helps support the site.
PR #22673 just merged Multi-Token Prediction (MTP) support into the llama.cpp main branch, and the benchmark numbers are hard to ignore. On an RTX 3090, Qwen3-27B decode speed jumps from 22.97 tok/s to 42.45 tok/s. On a 2080 Ti, community members report a near-doubling from 23 tok/s to 47 tok/s. If you're running large models locally and your bottleneck is token generation, this is worth paying attention to.
How MTP Actually Works
MTP is a form of speculative decoding, but instead of loading a separate smaller draft model, the main model itself contains specialized MTP heads that predict multiple tokens in parallel. The llama.cpp implementation loads the MTP component as a separate model within the same GGUF file, with its own independent KV-cache and context. No second file to download or manage.
The basic mechanism: instead of generating one token at a time, the MTP heads propose 2 or 3 draft tokens ahead. The main model then verifies them. Accepted tokens are essentially free. Rejected ones trigger a rollback and regeneration from that point. Acceptance rates from the PR benchmarks are solid:
- 3 draft tokens: 72.18% aggregate acceptance rate (952 accepted / 1,319 total)
- 2 draft tokens: 82.58% acceptance rate
Most proposed tokens land on the first try. That's what makes the throughput gains real rather than theoretical.
The Benchmark Numbers
Across hardware configurations, using Qwen3-27B, the PR benchmarks show:
- DGX Spark (baseline): 7.0 tok/s without MTP, 15.6 tok/s with 2 draft tokens (2.23x), 18.0 tok/s with 3 draft tokens (2.57x)
- RTX 3090 (Q6_K): 22.97 tok/s baseline, 42.45 tok/s with MTP (1.85x)
- RTX 2080 Ti (community-reported): 23 tok/s baseline, ~47 tok/s with MTP
There's a real tradeoff buried in these numbers. Prefill speed takes a significant hit with MTP enabled: 665 tok/s versus 1,315 tok/s on the RTX 3090 setup, roughly a 0.51x regression. The cause is Device-to-Host embedding transfers in the MTP path. The PR flags this as a known issue with optimization planned for a future update.
Practically speaking: MTP shines on long-form generation tasks where you're producing thousands of tokens after a prompt. It's a poor fit for short completions, RAG pipelines doing rapid repeated prefills, or anything where prompt processing dominates wall time.
Memory Overhead
The memory cost is manageable. On the RTX 3090 test setup, MTP adds approximately 2.49 GiB of VRAM (24.96 GiB total versus 22.47 GiB baseline), which the PR characterizes as less than 10% additional overhead. That's a reasonable price for an 85-160% throughput gain on generation.
If you're already close to your VRAM ceiling, check your headroom before enabling it. For most people running 27B models on 24GB cards, it should fit without dropping to a lower quant.
MTP vs. Speculative Decoding with a Draft Model
The obvious question: how does this compare to the existing approach of using a separate small draft model? According to the PR benchmarks, both methods reach roughly the same wall-time improvement (around 2.4x on the same Qwen3-27B task), but the tradeoffs differ.
- Draft model approach (Qwen3.5-0.8B with
--spec-draft-n-max 16): 81.39 seconds total wall time versus 201.07 seconds baseline. Requires loading and managing a separate model file. Supports larger draft depths. - MTP approach: Similar wall-time improvement, no separate model download, MTP heads bundled inside the main GGUF, less than 10% memory overhead, simpler setup.
For most local users, MTP wins on convenience. Competitive speedups without juggling a second model file. The draft model approach may still be preferable if you want to experiment with higher draft depths or your target model doesn't have MTP weights available yet.
How to Enable It
MTP is opt-in and disabled by default. Once you're on a build that includes the merged PR, enable it with:
--spec-type draft-mtp --spec-draft-n-max 2
Use --spec-draft-n-max 3 to push for higher throughput at the cost of a lower per-token acceptance rate. The 2-token setting is the safer default for consistent acceptance rates.
MTP-capable GGUF files are currently available for Qwen3-27B and Qwen3-30B-A3B (the MoE variant) on Hugging Face. The PR notes that MTP should work for any model trained with MTP capability, but no other publicly available MTP GGUFs exist yet. Accuracy results with MTP enabled matched Qwen's reported AIME2026 benchmark values, so output quality isn't being traded for speed.
Known Limitations
- Vulkan backend is broken. MTP relies on partial rollback support (tracked in #22400) that isn't yet implemented for Vulkan. Skip this on Vulkan setups for now.
- Prefill regression is real. The ~0.51x prefill slowdown is not minor. If your workflow involves repeatedly processing long system prompts or large context windows, benchmark your specific use case before committing.
- Parallel decoding isn't fully optimized. The implementation supports it, but multi-sequence performance is noted as not yet optimized in the PR.
- Metal status is unclear. CUDA is confirmed working. Vulkan is broken. No Metal performance data is available yet.
- Model availability is limited. Right now it's the Qwen3 series only. Broader adoption depends on model authors releasing MTP-capable weights.
Bottom Line
If you run Qwen3-27B or Qwen3-30B-A3B locally on CUDA hardware with a bit of VRAM headroom, enable MTP today. Two flags, no new model files, and you're looking at nearly double the generation speed for long outputs. Don't expect it to help with prefill-heavy workloads, and stay away from it on Vulkan until that backend gets the required rollback support.
Sources
- llama.cpp PR #22673: MTP support implementation, benchmarks, and technical details
- r/LocalLLaMA: MTP support merged into llama.cpp (community benchmarks and discussion)