Qwen3又又又又发布新模型Qwen3-Coder-Flash，小参数MoE-30B-A3B，平替480B

显示全部楼层

qwen团队也学会了一周7天每天更新，从拆分混合思考开始，连续几个工作日发布模型。混合思考看来不是免费的午餐，其他各家不知道会不会跟着下掉。

以下为hf页面原文：

Highlight

Qwen3-Coder-30B-A3B-Instruct发布，官方起名叫Qwen3-Coder-Flash，小参数3B激活同时兼具效果和效率，主要改进如下：

在Agentic Coding、Agentic Browser-Use及其他基础编码任务上，于开放模型中表现突出。
原生支持256K tokens的长上下文，借助 Yarn 可扩展至1M tokens，专为仓库级理解而优化。
支持Agentic Coding，兼容Qwen Code、CLINE等主流平台，并采用专门设计的函数调用格式。

模型概览

Qwen3-Coder-30B-A3B-Instruct具备以下特征：

类型：Causal Language Models
训练阶段：Pretraining & Post-training
总参数量：30.5B，其中激活参数 3.3B
层数：48
Attention Heads（GQA）：Q 为 32，KV 为 4
Experts 数量：128
激活的 Experts 数量：8
上下文长度：原生262,144

注意：该模型仅支持非思考模式，不会在输出中生成<think></think>区块，因此不再需要设置enable_thinking=False。

如需了解基准评估、硬件需求及推理性能等更多细节，请参阅我们的博客、GitHub 与文档。

快速开始

建议使用最新版transformers。
若使用transformers<4.51.0，将出现以下错误：

下方代码片段演示如何基于给定输入使用模型生成内容。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name ="Qwen/Qwen3-Coder-30B-A3B-Instruct"

# 加载 tokenizer 与模型
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto"
)

# 准备模型输入
prompt ="Write a quick sort algorithm."
messages = [
  {"role":"user","content": prompt}
]
text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# 进行文本补全
generated_ids = model.generate(
  **model_inputs,
  max_new_tokens=65536
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

提示：如遇内存不足 (OOM) 问题，可将上下文长度缩小至如32,768。

本地使用时，Ollama、LMStudio、MLX-LM、llama.cpp 与 KTransformers 均已支持 Qwen3。

Agentic Coding

Qwen3-Coder 在工具调用方面表现出色。

可如下例所示简单定义或使用任意工具。

# 你的工具实现
def square_the_number(num:float) -> dict:
 returnnum ** 2

# 定义 Tools
tools=[
  {
   "type":"function",
   "function":{
     "name":"square_the_number",
     "description":"output the square of the number.",
     "parameters": {
       "type":"object",
       "required": ["input_num"],
       "properties": {
         'input_num': {
           'type':'number',
           'description':'input_num is a number that will be squared'
            }
        },
      }
    }
  }
]

import OpenAI
# 定义 LLM
client = OpenAI(
 # 使用与 OpenAI API 兼容的自定义端点
  base_url='http://localhost:8000/v1', # api_base
  api_key="EMPTY"
)

messages = [{'role':'user','content':'square the number 1024'}]

completion = client.chat.completions.create(
  messages=messages,
  model="Qwen3-Coder-30B-A3B-Instruct",
  max_tokens=65536,
  tools=tools,
)

print(completion.choice[0])

最佳实践

为获得最佳表现，建议按以下设置：

采样参数：

建议temperature=0.7、top_p=0.8、top_k=20、repetition_penalty=1.05。
充足的输出长度：大多数查询建议使用 65,536 tokens 的输出长度，这对 instruct 模型已足够。

引用

若我们的工作对您有帮助，欢迎引用。

@misc{qwen3technicalreport,
   title={Qwen3 Technical Report},
   author={Qwen Team},
   year={2025},
   eprint={2505.09388},
   archivePrefix={arXiv},
   primaryClass={cs.CL},
   url={https://arxiv.org/abs/2505.09388},
}