Qwen3.5 – How to Run Locally Guide

🚀 Explore this insightful post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

Run the new Qwen3.5 LLMs including Medium: Qwen3.5-35B-A3B, 27B, 122B-A10B, Small: Qwen3.5-0.8B, 2B, 4B, 9B and 397B-A17B on your local device!

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device. See all GGUFs here.

Mar 5 Update: Redownload Qwen3.5-35B, 27B, 122B and 397B.

All GGUFs now updated with an improved quantization algorithm.
All use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.
Tool-calling improved following our chat template fixes. Fix is universal and applies to any Qwen3.5 format and any uploader.
We’re retiring MXFP4 layers from 3 Qwen3.5 GGUFs: Q2_K_XL, Q3_K_XL and Q4_K_XL.

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance – so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.

35B-A3B27B122B-A10B397B-A17BFine-tune Qwen3.50.8B • 2B • 4B • 9B

⚙️ Usage Guide

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Qwen3.5

3-bit

4-bit

6-bit

8-bit

BF16

For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.

Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can’t fit in your device. Go for 35B-A3B if you want much faster inference.

Recommended Settings

Maximum context window: 262,144 (can be extended to 1M via YaRN)
presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance
Adequate Output Length: 32,768 tokens for most queries

If you’re getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help.

As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:

Thinking mode:

General tasks

Precise coding tasks (e.g. WebDev)

repeat penalty = disabled or 1.0

Thinking mode for general tasks:

Qwen3.5 – How to Run Locally Guide

⚙️ Usage Guide

Recommended Settings

Thinking mode:

Instruct (non-thinking) mode settings:

Qwen3.5 Inference Tutorials:

Qwen3.5-35B-A3B

🦙 Llama.cpp Guides

Qwen3.5 Small (0.8B • 2B • 4B • 9B)

Qwen3.5-27B

Qwen3.5-122B-A10B

Qwen3.5-397B-A17B

👾 LM Studio Guide

🦙 Llama-server serving & OpenAI’s completion library

🤔 How to enable or disable reasoning & thinking

👨‍💻 OpenAI Codex & Claude Code

🔨Tool Calling with Qwen3.5

📊 Benchmarks

Unsloth GGUF Benchmarks

Qwen3.5-397B-A17B Benchmarks

Official Qwen Benchmarks

Qwen3.5-35B-A3B, 27B and 122B-A10B Benchmarks

Qwen3.5-4B and 9B Benchmarks

Qwen3.5-397B-A17B Benchmarks

By

Leave a Reply Cancel reply