你好，Qwen2！

显示全部楼层

01

Qwen2: 强大且多语言支持的先进语言模型

今天，通义千问团队带来了Qwen2系列模型，Qwen2系列模型是Qwen1.5系列模型的重大升级。包括了：

5个尺⼨的预训练和指令微调模型, 包括Qwen2-0.5B、Qwen2-1.5B、Qwen2-7B、Qwen2-57B-A14B以及Qwen2-72B；
在中⽂英语的基础上，训练数据中增加了27种语⾔相关的⾼质量数据；
多个评测基准上的领先表现；
代码和数学能⼒显著提升；
增⼤了上下⽂⻓度⽀持，最⾼达到128K tokens（Qwen2-72B-Instruct）。

Qwen2-72B模型

相⽐Qwen1.5，Qwen2在⼤规模模型实现了⾮常⼤幅度的效果提升。如下我们针对Qwen2-72B进⾏评测。在针对预训练语⾔模型的评估中，对⽐当前最优的开源模型，Qwen2-72B在包括⾃然语⾔理解、知识、代码、数学及多语⾔等多项能⼒上均显著超越当前领先的模型，如Llama-3-70B以及Qwen1.5最⼤的模型Qwen1.5-110B。这得益于其预训练数据及训练⽅法的优化。

在⾃然语⾔理解和逻辑推理等⽅⾯，尤其是科学类问题上，Qwen2-72B的优势更为明显。⽽在代码测试中，Qwen2-72B同样取得不俗的成绩，并且在多个编程语⾔上都有较为突出的表现。数学能⼒则由于其预训练数据中数学部分的优化实现了⼤幅度提升。此外，在⼤家较为关注的多语⾔的表现上，Qwen2-72B在多个领域的多语⾔评测上均具有⼀定的优势。这也意味着，Qwen2有潜⼒在更多的国家和地区得到落地应⽤。

在微调和对⻬上投⼊了⼤量的精⼒进⾏研究。Qwen2的策略包括⼴泛采集指令和提示词，以及利⽤合成数据，如使⽤拒绝采样、代码执⾏反馈、回译等⽅法。为了进⼀步和⼈类偏好对⻬，Qwen2采⽤了DPO的⽅法。除了使⽤常⻅的DPO及DPO的变体如IPO、KTO外，Qwen2还探索了DPO与在线学习的结合，从⽽提升模型能⼒的上限。⽽为了降低对⻬所产⽣的“对⻬税”，Qwen2使⽤模型合并的⽅法来缓解此问题。这⼀系列的努⼒最终帮助我们⼤幅度的提升了指令微调模型的基础能⼒以及智⼒等。结果如下所示：

⽽在较⼩的模型规模上，Qwen2同样是各个模型尺⼨上的佼佼者。详细请关注魔搭社区的每个模型的模型介绍页面。

模型许可

此次Qwen2采⽤不同的模型许可。除了Qwen2-72B依旧使⽤此前的Qianwen License外，其余模型，包括Qwen2-0.5B、Qwen2-1.5B、Qwen2-7B以及Qwen2-57B-A14B在内，均采⽤Apache 2.0的许可。通义千问团队希望本次开放程度的提升能够加速Qwen2在全球各地的落地及商业应⽤。

下一步计划

通义千问团队还在训练更⼤的模型，继续探索模型及数据的Scaling Law。此外，通义千问团队还将把Qwen2扩展成多模态模型，融⼊视觉及语⾳的理解。在不久的将来，我们还会继续开源新模型。敬请期待！

02

模型列表和下载链接

模型名称	模型链接
Qwen2-72B-Instruct	https://modelscope.cn/models/qwen/Qwen2-72B-Instruct
Qwen2-72B	https://modelscope.cn/models/qwen/Qwen2-72B
Qwen2-57B-A14B-Instruct	https://modelscope.cn/models/qwen/Qwen2-57B-A14B-Instruct
Qwen2-57B-A14B	https://modelscope.cn/models/qwen/Qwen2-57B-A14B
Qwen2-7B-Instruct	https://modelscope.cn/models/qwen/Qwen2-7B-Instruct
Qwen2-7B	https://modelscope.cn/models/qwen/Qwen2-7B
Qwen2-1.5B-Instruct	https://modelscope.cn/models/qwen/Qwen2-1.5B-Instruct
Qwen2-1.5B	https://modelscope.cn/models/qwen/Qwen2-1.5B
Qwen2-0.5B-Instruct	https://modelscope.cn/models/qwen/Qwen2-0.5B-Instruct
Qwen2-0.5B	https://modelscope.cn/models/qwen/Qwen2-0.5B

03

模型推理

使用Transformers推理

from modelscope import AutoModelForCausalLM, AutoTokenizerdevice = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen2-7B-Instruct",torch_dtype="auto",device_map="auto")tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen2-7B-Instruct")
prompt = "Give me a short introduction to large language model."messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": prompt}]text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids,max_new_tokens=512)generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

使用vLLM推理

（可选）启用YARN以支持更长的上下文，修改config.json文件，添加以下片段：

{"architectures": ["Qwen2ForCausalLM"],// ..."vocab_size": 152064,
// adding the following snippets"rope_scaling": {"factor": 4.0,"original_max_position_embeddings": 32768,"type": "yarn"}}

vLLM推理命令

python-mvllm.entrypoints.openai.api_server--served-model-nameQwen2-72B-Instruct--model="/cache_dir/Qwen2-72B-Instruct"--tensor-parallel-size4

openai格式接口调用

curlhttp://localhost:8000/v1/chat/completions\-H"Content-Type:application/json"\-d'{"model":"Qwen2-72B-Instruct","messages":[{"role":"system","content":"youareahelpfulassistant."},{"role":"user","content":"讲一下大语言模型的特点"}]}'

使用MLX在苹果端侧推理

安装依赖

pipinstallmlx-lmmlx-U

from mlx_lm import load, generatefrom modelscope import snapshot_downloadmodel_dir = snapshot_download("qwen/Qwen2-0.5B-Instruct-MLX")model, tokenizer = load(model_dir, tokenizer_config={"eos_token": "<|im_end|>"})
prompt = "Give me a short introduction to large language model."messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": prompt}]text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
response=generate(model,tokenizer,prompt=text,verbose=True,top_p=0.8,temp=0.7,repetition_penalty=1.05,max_tokens=512)

推理速度参考如下：

使用DashInferCPU推理引擎

DashInfer是ModelScope社区推出的基于native-runtime的针对大模型的推理引擎，支持LLM在包括CPU和ARM处理器等多样化硬件上的高效推理。

魔搭社区上为Qwen2提供了一系列DashInfer格式模型：

Qwen2-0.5B-Instruct-DI：

https://modelscope.cn/models/dash-infer/Qwen2-0.5B-Instruct-DI/summary

Qwen2-1.5B-Instruct-DI：
https://modelscope.cn/models/dash-infer/Qwen2-1.5B-Instruct-DI/summary
Qwen2-7B-Instruct-DI：
https://modelscope.cn/models/dash-infer/Qwen2-7B-Instruct-DI/summary

python依赖：

pipinstallmodelscopedashinferjinja2tabulatetorchtransformers

推理代码：

import copyimport random
from modelscope import snapshot_downloadfrom dashinfer.helper import EngineHelper, ConfigManager
# You may also choose between "dash-infer/Qwen2-0.5B-Instruct-DI" # and "dash-infer/Qwen2-1.5B-Instruct-DI" alterantively.model_path = snapshot_download("dash-infer/Qwen2-7B-Instruct-DI")
config_file = model_path + "/" + "di_config.json"config = ConfigManager.get_config_from_json(config_file)config["model_path"] = model_path
## init EngineHelper classengine_helper = EngineHelper(config)engine_helper.verbose = Trueengine_helper.init_tokenizer(model_path)
## init engineengine_helper.init_engine()
## prepare inputs and generation configsuser_input = "浙江的省会在哪"prompt = "<|im_start|>" + "system\n" + "You are a helpful assistant." + "<|im_end|>\n" + \ "<|im_start|>" + "user\n" + user_input + "<|im_end|>\n" + \ "<|im_start|>" + "assistant\n"gen_cfg = copy.deepcopy(engine_helper.default_gen_cfg)gen_cfg["seed"] = random.randint(0, 10000)request_list = engine_helper.create_request([prompt], [gen_cfg])
## inferenceengine_helper.process_one_request(request_list[0])engine_helper.print_inference_result_all(request_list)
engine_helper.uninit_engine()

体验链接：https://modelscope.cn/studios/dash-infer/Qwen2-7B-Instruct-DashInfer-Demo

04

模型微调

使用swift对Qwen2-72B-Chat进行自我认知微调，让模型认为自己是小胡，由魔搭训练。swift是魔搭社区官方提供的大模型推理、微调、量化和部署工具箱。swift开源地址：https://github.com/modelscope/swift

在开始微调之前，需要进行环境准备：

# pip 安装pip install 'ms-swift[llm]' -U
# 或者源码安装git clone https://github.com/modelscope/swift.gitcd swiftpip install -e '.[llm]'

我们使用swift提供的带模型名字和作者通配符的self-cognition数据集进行自我认知微调，以及使用alpaca-zh、alpaca-en数据集保持模型的通用能力。整个微调过程大约需要30分钟，微调脚本如下：

#Experimentalenvironment:2*A100#2*75GBGPUmemoryCUDA_VISIBLE_DEVICES=0,1\swiftsft\--model_id_or_pathqwen/Qwen2-72B-Instruct\--sft_typelora\--dtypeAUTO\--datasetAI-ModelScope/alpaca-gpt4-data-zh#500AI-ModelScope/alpaca-gpt4-data-en#500swift/self-cognition#500\--model_name小胡XiaoHu\--model_author魔搭ModelScope\--num_train_epochs1\--lora_rank8\--lora_alpha32\--lora_dropout_p0.05\--lora_target_modulesALL\--gradient_checkpointingtrue\--batch_size1\--weight_decay0.1\--learning_rate1e-4\--gradient_accumulation_steps16\--use_flash_attntrue\

微调的超参数含义可以参考命令行参数文档：https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0.md

微调过程的loss可视化：

微调显存占用：

微调后推理脚本如下，这里的ckpt_dir需要修改为微调生成的checkpoint文件夹：

# Experimental environment: 2 * A100
# 使用pytorch进行直接推理CUDA_VISIBLE_DEVICES=0,1 swift infer \--ckpt_dir "output/qwen2-72b-instruct/vx-xxx/checkpoint-xxx"

# Merge LoRA并使用vLLM进行推理加速CUDA_VISIBLE_DEVICES=0,1 swift export \--ckpt_dir "output/qwen2-72b-instruct/vx-xxx/checkpoint-xxx" \--merge_lora true
pip install vllm -URAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=0,1 swift infer \--ckpt_dir "output/qwen2-72b-instruct/vx-xxx/checkpoint-xxx-merged" \--infer_backend vllm --tensor_parallel_size 2 \--max_model_len 8192 --gpu_memory_utilization 0.95

训练后效果:

05

部署Qwen2支持function call的API

Modelscope-Agent（https://github.com/modelscope/modelscope-agent）利用自带的agent，可以增强基础模型能力，快速支持tool calling，并提供chat/completions接口服务，使用方式与vllm基本相同，相关接口可以通过openai SDK进行调用。

下面将介绍如何通过Modescope-Agent项目部署一个具有function calling的 Qwen2-7B-Instruct模型服务。

首先，环境准备，需要modelscope-agent的项目：

gitclonehttps://github.com/modelscope/modelscope-agent.gitcdmodelscope-agent

其次，在python命令行中下载qwen2-7b-instruct模型，获取model_dir，用于后续使用

frommodelscopeimportsnapshot_downloadmodel_dir=snapshot_download('qwen/Qwen2-7B-Instruct')print(model_dir)

接下来在shell命令行，根据model_dir拉起具有function calling 的服务，调用的options参数与vllm调用方式对齐。

shscripts/run_assistant_server.sh--served-model-nameQwen2-7B-Instruct--modelpath/to/weights

运行命令后，服务会启动在31512端口。此时，用户可以通过标准的tool calling 进行测试调用如下：

curl-XPOST'http://localhost:31512/v1/chat/completions'\-H'Content-Type:application/json'\-d'{"tools":[{"type":"function","function":{"name":"amap_weather","description":"amapweathertool","parameters":[{"name":"location","type":"string","description":"城市/区具体名称，如`北京市海淀区`请描述为`海淀区`","required":true}]}}],"tool_choice":"auto","model":"Qwen2-7B-Instruct","messages":[{"content":"海淀区天气","role":"user"}]}'

另外，用户也可以使用openai SDK进行调用，具体使用方式如下：

from openai import OpenAIapi_base = "http://localhost:31512/v1/"model = 'Qwen2-7B-Instruct'
tools = [{"type": "function","function": {"name": "amap_weather","description": "amap weather tool","parameters": [{"name": "location","type": "string","description": "城市/区具体名称，如`北京市海淀区`请描述为`海淀区`","required": True}]}}]
tool_choice = 'auto'
client = OpenAI(base_url=api_base,api_key="empty",)chat_completion = client.chat.completions.create(messages=[{"role": "user","content": "海淀区天气是什么？"}],model=model,tools=tools,tool_choice=tool_choice)print(chat_completion)

返回如下结果：

{"request_id":"chatcmpl_3f020464-e98d-4c7b-8717-9fca56784fe6","message":"","output":null,"id":"chatcmpl_3f020464-e98d-4c7b-8717-9fca56784fe6","choices":[{"index":0,"message":{"role":"assistant","content":"好的，我已经调用了amap_weather工具查询了海淀区的天气情况。现在，让我为您展示一下查询结果吧。\n\n工具调用\nAction:amap_weather\nActionInput:{\"location\":\"海淀区\"}\n","tool_calls":[{"type":"function","function":{"name":"amap_weather","arguments":"{\"location\":\"海淀区\"}"}}]},"finish_reason":"tool_calls"}],"created":1717485704,"model":"Qwen2-7B-Instruct","system_fingerprint":"chatcmpl_3f020464-e98d-4c7b-8717-9fca56784fe6","object":"chat.completion","usage":{"prompt_tokens":237,"completion_tokens":48,"total_tokens":285}}

至此，用户可以快速的利用modelscope-agent为一个模型增加tool calling的能力。相关完整示例参考：（https://github.com/modelscope/modelscope-agent/blob/master/docs/llms/qwen2_tool_calling.md）