This post contains affiliate links. If you purchase through these links, sudostack may earn a small commission at no extra cost to you. This helps support the site.
The LocalLLaMA community has been converging on one model this week: Qwen 3.6. Not because of a press release, but because people are running it on consumer hardware and getting numbers that were hard to believe a few months ago. 80 tokens per second, 128K context, on 12GB of VRAM. If you have an RTX 4070 Super sitting idle between coding sessions, this is worth a serious look.
The Specs That Actually Matter
Qwen 3.6 comes in a few configurations. The two getting the most attention are the 27B base model and the 35B A3B variant. Both support reasoning (chain-of-thought) and vision capabilities out of the box. Context goes up to 128K on 12GB VRAM setups, and further with more headroom. Quantization options include Q5 and Q4_K_XL, with Q5 being the sweet spot for quality versus speed on 24GB cards.
- Qwen 3.6 35B A3B: 80+ tok/sec, 128K context, RTX 4070 Super (12GB VRAM)
- Qwen 3.6 27B Q5: 135 tok/sec peak, 200K context, RTX 3090 (24GB VRAM)
- Draft acceptance rate: 80%+ with MTP on the 35B A3B
These numbers come from community benchmarks on r/LocalLLaMA, not a controlled lab. Treat them as realistic ceilings with optimal setups, not guaranteed baselines.
Hardware Reality Check
Two reference setups dominate the discussion. On an RTX 4070 Super with 12GB VRAM, the 35B A3B hits 80 tok/sec with 128K context using llama.cpp with MTP enabled. That's fast enough to feel interactive. On an RTX 3090 with 24GB, the 27B Q5 peaks at 135 tok/sec with 200K context when running through BeeLlama.cpp.
The 12GB result is the more interesting one. Getting a 35B-class model to run this fast inside a 12GB VRAM envelope is a legitimate milestone. A year ago, 12GB meant 7B models with context trade-offs. Now it means running something competitive with mid-tier cloud models at real throughput.
How It Compares to the Alternatives
The comparison that keeps coming up is Qwen 3.6 versus DeepSeek V4. Locally, DeepSeek V4 tops out around 10 tok/sec for token generation and 40 tok/sec for prompt processing on the same consumer hardware. That's CPU-level performance in practical terms. Community reports also flag a tendency for DeepSeek V4 to overthink coding tasks, adding unnecessary chain-of-thought overhead where a direct answer would do. Qwen 3.6 doesn't have that problem.
GLM 5.1 handles debugging tasks well and gets favorable mentions, but it's larger, harder to quantize cleanly, and has less community tooling behind it. For most workflows, that friction adds up.
NVIDIA's Star Elastic (Nemotron Nano v3) is worth watching. It's a single 30B-parameter checkpoint with 3.6B active parameters that produces nested sub-models: a 23B (2.8B active) and a 12B (2.0B active). The matryoshka architecture is clever and efficient. But it's newer, deployment tooling is thinner, and Qwen 3.6 has a much larger base of tested quantizations and community configurations. Star Elastic is interesting; Qwen 3.6 is ready.
Gemma 4 gets mentions, but MTP acceleration is unreliable with it, which undercuts the speed advantage that makes Qwen 3.6 worth running in the first place.
The Optimization Stack
The raw model isn't the whole story. Getting to these numbers requires stacking a few optimizations.
MTP (Multi-Token Prediction) in llama.cpp is the biggest lever. It drives the 80%+ draft acceptance rate on the 35B A3B and accounts for a large share of the throughput gains. The catch: MTP support requires building llama.cpp from source using a pull request that hasn't been merged to master yet. If you're not comfortable building from a branch, you'll need to wait or skip this one for now.
BeeLlama.cpp is a fork that combines DFlash speculative decoding with TurboQuant quantization. On an RTX 3090, it delivers 2-3x the throughput of baseline llama.cpp for Qwen 3.6 models. That's not a small margin. The 135 tok/sec peak on the 27B Q5 at 200K context comes from this setup. It's based on community reporting rather than independent benchmarks, so treat the 2-3x figure as a strong signal rather than a guarantee, but the direction is clear.
Quantization choice matters more than people expect. Q5 is the ceiling for quality on 24GB cards. Q4_K_XL gives you more headroom for context at some quality cost. The tradeoff depends on whether you're doing long-document work or interactive coding.
What People Are Actually Using It For
One use case that's gotten attention is pairing Qwen 3.6 27B with Pi, a local coding agent, for system configuration tasks. The specific example: setting up Arch Linux. That's a useful proxy for how the model handles multi-step technical workflows with real dependencies and feedback loops. It worked.
For coding work, community consensus puts Qwen 3.6 35B A3B in the same league as most larger open models, with GLM 5.1 being the notable exception that edges it out on debugging. One caveat worth flagging: there are reports that Qwen 3.6 gets cautious and unimaginative on non-coding creative or exploratory tasks. If your workload is primarily code and technical writing, that probably doesn't matter. If you want a model that takes risks on creative tasks, this isn't it.
Bottom Line
If you have an RTX 4070 Super or RTX 3090 and want a local model for coding and technical work, Qwen 3.6 35B A3B is the current answer. The performance-per-VRAM ratio is genuinely strong, the community tooling is ahead of the alternatives, and the optimization ecosystem is active. Getting the best numbers requires building llama.cpp from a source branch, which is a real setup hurdle. But if you're already running local models, that's a one-time afternoon of work for a speed gain that sticks around.
Sources
- r/LocalLLaMA: 80 tok/sec and 128K context on 12GB VRAM with Qwen 3.6 35B A3B and llama.cpp MTP
- r/LocalLLaMA: BeeLlama.cpp advanced DFlash + TurboQuant with Qwen 3.6 27B Q5 on RTX 3090
- r/LocalLLaMA: NVIDIA AI releases Star Elastic (Nemotron Nano v3) single checkpoint
- r/LocalLLaMA: Pi and Qwen 3.6 27B for Arch Linux setup via coding agent