Qwen3.5 – How to Run Locally Guide

🚀 Explore this insightful post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

Run the new Qwen3.5 LLMs including Medium: Qwen3.5-35B-A3B, 27B, 122B-A10B, Small: Qwen3.5-0.8B, 2B, 4B, 9B and 397B-A17B on your local device!

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device. See all GGUFs herearrow-up-right.

circle-check

All uploads use Unsloth Dynamic 2.0arrow-up-right for SOTA quantization performance – so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.

35B-A3B27B122B-A10B397B-A17BFine-tune Qwen3.50.8B • 2B • 4B • 9B

⚙️ Usage Guide

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Qwen3.5

3-bit

4-bit

6-bit

8-bit

BF16

circle-check

Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can’t fit in your device. Go for 35B-A3B if you want much faster inference.

  • Maximum context window: 262,144 (can be extended to 1M via YaRN)

  • presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance

  • Adequate Output Length: 32,768 tokens for most queries

circle-info

If you’re getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help.

As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:

Thinking mode:

General tasks

Precise coding tasks (e.g. WebDev)

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

Thinking mode for general tasks: