Same model. Same feature. Same settings. One task gets nearly 3x faster. Another gets slower. That's the finding from a 300+ test benchmark study on Multi-Token Prediction (MTP) posted to r/LocalLLaMA, and it has real implications for anyone running local inference across more than one kind of task.
What MTP Actually Does
Multi-Token Prediction is a speculative inference technique. Instead of generating one token at a time, a lightweight draft head predicts several tokens ahead by reusing the main model's layers and hidden states. The main model then validates those predictions in parallel. If the predictions are good, you get multiple tokens for roughly the cost of one forward pass. If they're bad, you discard them and absorb the overhead for nothing.
That last part is the key. Whether MTP makes things faster or slower comes down almost entirely to one metric: the acceptance rate.
"Even when drafting is 'free' (or rather cheap as with MTP), you cannot have a decent speed-up without a high acceptance rate."
The Core Finding: Task Type Dominates Everything
The researcher tested four task types across temperature settings of 0.0, 0.3, and 0.7, comparing F16 and Q4_K_M quantizations with MTP layer quantization at both q8 and matching model quant. After 300+ benchmarks, one variable rose above all others.
"The nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close."
Not quantization size. Not temperature alone. Not hardware. The task type.
- Coding with F16 + MTP: nearly 3x speedup
- Creative writing with Q4_K_M + MTP: slower than baseline
Why Coding Benefits and Creative Writing Doesn't
Code is predictable. Syntax follows strict rules, variable names repeat, boilerplate patterns are everywhere. When you're generating code at low temperatures (0.0 to 0.3), the draft head can predict the next several tokens with high accuracy. Acceptance rates stay high, speculation pays off, and you get the throughput gains MTP promises.
Creative writing is the opposite. Higher temperatures introduce intentional randomness. The next word in a story is much harder to guess than the next token in a function signature. Acceptance rates drop, draft predictions get discarded, and all that speculative overhead becomes pure waste. You end up slower than if you'd just generated tokens one at a time.
This isn't a quirk of one model. The study notes that users reported identical behavior across AMD Radeon, Nvidia 3090, and Nvidia 4090 hardware, and similar task-dependent patterns have appeared in Google's MTP assistant model for Gemma 4 31B and RedHat's EAGLE3.
The Hidden Cost: Prompt Processing Takes a Hit
Generation speed is the headline, but there's a penalty on the other side that matters just as much: prompt processing (PP) speed degrades significantly on some hardware when MTP is enabled.
One user on a Radeon AI Pro 9700 reported PP dropping from 1400 t/s to 650 t/s with MTP enabled, roughly 46% of baseline. Another user running a 27B Q8 model on dual 3090s saw PP fall from 2400 to 1400 tokens per second. The suspected cause is that the current MTP implementation maintains a full redundant context for the MTP head and processes the prompt twice instead of reusing the cached context from the main model pass.
"The system presently maintains a full redundant context for the MTP head which doesn't actually need it... It is really processing the prompt twice, probably."
For agentic coding workflows where you're loading large contexts before generation even begins, this PP penalty can easily cancel out any generation speedup. If your prompt is long and your output is short, MTP might make total wall-clock time worse even on coding tasks.
A few other costs worth knowing: MTP currently prevents parallel request handling and image decoding. Mac users have also reported a bug causing the model to consume double the expected memory when MTP is active.
When MTP Works Best
The benchmarks point to a clear profile for when MTP delivers. You want:
- Dense models over MoE. Mixture-of-Experts models add computation overhead when speculating across multiple experts. Dense models see consistently better gains.
- Larger dense models. Bigger models benefit more. Drafting a 70B model with a 3B draft model delivered consistent 2x to 3x speedups across all generation types, while smaller MoE models saw little benefit.
- Low-temperature, structured output. Coding, JSON generation, templated text. Anything where the next token is predictable.
- Short prompts, long outputs. The PP penalty hurts less when the context is small and you're generating a lot of tokens.
Traditional draft-model speculative decoding with a separate small model achieved comparable 2x to 3x speedups on the 70B case as well, so MTP's integrated approach isn't uniquely superior here. It's more convenient when it fits your workload, not categorically faster.
The Practical Problem: You Can't Always Choose
Right now, MTP is typically a server-level configuration. You turn it on or off for everything. That's fine for a single-purpose setup like a dedicated coding assistant, but it's a real problem if your inference server handles mixed workloads. There's no standardized per-request MTP toggle in current frameworks like vllm, which means you're stuck making a blanket choice that will hurt at least some of your tasks.
Until per-request control lands, the practical guidance is straightforward: enable MTP only if your primary workload is low-temperature structured generation on a dense model with short prompts. Disable it for everything else, and especially for creative or conversational use cases where temperature runs above 0.5.
Bottom Line
MTP is a genuine speedup for the right workload, and nearly 3x on coding tasks is worth taking seriously. But it's not a universal go-faster switch. Task type determines whether you benefit or regress, and the prompt processing penalty is a real cost that current implementations haven't solved. If you're running a dedicated coding assistant on a large dense model with short prompts, turn it on. If you're running mixed or creative workloads, leave it off until per-request control becomes available.
Sources
- r/LocalLLaMA: MTP Benchmark Results โ The Nature of the Generative Task Dictates Whether You Benefit