由于Qwen未发布官方模型 ,因此需要使用第三方发布的模型,可以根据需要执行下面命令中的某些命令来下载与运行模型
# 0.6B模型
ollama run dengcao/Qwen3-Embedding-0.6B
8_0
ollama run dengcao/Qwen3-Embedding-0.6B:F16
# 4B模型
ollama run dengcao/Qwen3-Embedding-4B
4_K_M
ollama run dengcao/Qwen3-Embedding-4B
5_K_M
ollama run dengcao/Qwen3-Embedding-4B
8_0
ollama run dengcao/Qwen3-Embedding-4B:F16
# 8B模型
ollama run dengcao/Qwen3-Embedding-8B
4_K_M
ollama run dengcao/Qwen3-Embedding-8B
5_K_M
ollama run dengcao/Qwen3-Embedding-8B
8_0
ollama run dengcao/Qwen3-Embedding-8B:F16
关于量化版本的说明:
根据经验,建议使用 Q5_K_M,因为它保留了模型的大部分性能。或者,如果要节省一些内存,可以使用 Q4_K_M。
可以使用huggingface-cli将需要的模型预先下载到本地,也可以从魔塔中下载。
hf官网上有Qwen官方模型可供下载:
huggingface-cli download Qwen/Qwen3-Embedding-0.6B-GGUF --local-dir /home/models/Qwen3-Embedding-0.6B-GGUF
huggingface-cli download Qwen/Qwen3-Embedding-4B-GGUF --local-dir /home/models/Qwen3-Embedding-4B-GGUF
huggingface-cli download Qwen/Qwen3-Embedding-8B-GGUF --local-dir /home/models/Qwen3-Embedding-8B-GGUF
huggingface-cli download Qwen/Qwen3-Embedding-0.6B --local-dir /home/models/Qwen3-Embedding-0.6B
huggingface-cli download Qwen/Qwen3-Embedding-4B --local-dir /home/models/Qwen3-Embedding-4B
huggingface-cli download Qwen/Qwen3-Embedding-8B --local-dir /home/models/Qwen3-Embedding-8B
huggingface-cli download Qwen/Qwen3-Reranker-0.6B-GGUF --local-dir /home/models/Qwen3-Reranker-0.6 BGGUF
huggingface-cli download Qwen/Qwen3-Reranker-4B-GGUF --local-dir /home/models/Qwen3-Reranker-4B-GGUF
huggingface-cli download Qwen/Qwen3-Reranker-8B-GGUF --local-dir /home/models/Qwen3-Reranker-8B-GGUF代码相对比较简单,在此不对代码做出详细解释,关键点直接写在代码注释中。
版本限制:
from sentence_transformers import SentenceTransformer
# 加载模型,可以使用本地路径以便加载预先下载好的模型
model = SentenceTransformer("Qwen/Qwen3-Embedding-8B")
# 设置flash_attention_2 以及将`padding_side` 设置为"left",可以加快模型的加载与运行速度
# model = SentenceTransformer(
# "Qwen/Qwen3-Embedding-8B",
# model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
# tokenizer_kwargs={"padding_side": "left"},
# )
# 需要计算的文档与查询
queries = [
"What is the capital of China?",
"Explain gravity",
]
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
# 对查询与文档进行向量计算
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
# 计算查询与文档的余弦相似度
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)# 输出结果
# tensor([[0.7493, 0.0751],
# [0.0880, 0.6318]])
版本要求:
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery:{query}'
# 每个查询都必须附带一个描述任务的一句话指令
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# 检索文档无需添的指令说明
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-8B', padding_side='left')
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-8B')
# 设置flash_attention_2 以及将`padding_side` 设置为"left",可以加快模型的加载与运行速度
# model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-8B', attn_implementation="flash_attention_2", torch_dtype=torch.float16).cuda()
max_length = 8192
# Tokenize the input texts
batch_dict = tokenizer(
input_texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())[[0.7493016123771667, 0.0750647559762001], [0.08795969933271408, 0.6318399906158447]]
版本要求:
import torch
import vllm
from vllm import LLM
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery:{query}'
#每个查询都必须附带一个描述任务的一句话指令
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'What is the capital of China?'),
get_detailed_instruct(task, 'Explain gravity')
]
# 检索文档无需添的指令说明
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
model = LLM(model="Qwen/Qwen3-Embedding-8B", task="embed")
outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())[[0.7482624650001526, 0.07556197047233582], [0.08875375241041183, 0.6300010681152344]]
| 欢迎光临 链载Ai (https://www.lianzai.com/) | Powered by Discuz! X3.5 |