Qwen2–0.5B-instuct-int4-inc
Qwen2–1.5B-instuct-int4-inc
Qwen2–7B-int4-inc
##pip install auto-round
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2-1.5B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
from auto_round import AutoRound
bits, group_size, sym = 4, 128, True
device="cpu"
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym, device=None)
autoround.quantize()
output_dir = "./qwen2_7b_autoround"
autoround.save_quantized(output_dir)
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig
# Auto Round 量化模型
model_name = "./qwen2_7b_autoround"
prompt = "Once upon a time, a little girl"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
outputs = model.generate(inputs)
ingFang SC", Cambria, Cochin, Georgia, serif;text-wrap: wrap;">在Qwen2中,所有指令微调模型都在32K长度的上下文中进行训练,并使用如YARN或Dual Chunk Attention等技术推断更长的上下文长度。Intel Extension for Transformers在输入长度达到32K时,下一个令牌延迟仍低于50毫秒:ingFang SC", Cambria, Cochin, Georgia, serif;text-wrap: wrap;">Intel Extension for Transformers提供了多种方式来在保持可接受准确性的同时最小化令牌延迟。用户友好的类似Transformers的API仅需少量代码更改。还有更高级的功能,如流式LLM、张量并行性和低比特压缩(低至1位),所以我们鼓励您查看Intel Extension for Transformers。
| 欢迎光临 链载Ai (https://www.lianzai.com/) | Powered by Discuz! X3.5 |