AI新手村：Hugging Face - 链载Ai

Hugging Face

Hugging Face 最早作为 NLP 模型的社区中心，成立于 2016 年，但随着 LLM 的大火，主流的 LLM 模型的预训练模型和相关工具都可以在这个平台上找到，此外，该平台还提供了丰富的计算机视觉（Computer Vision）和音频相关的模型。

Hugging Face 常被誉为AI 模型界的 GitHub 。Hugging Face 拥有三大核心库，分别是 Transformer（用于封装 Transformer 模型，使其更易于使用）、Tokenizers（用于将文本语句拆分成模型可以理解的最小单元）、以及 Dataset（用于读取外部数据）。

下图是 Hugging Face 的首页，主要常用的功能如图标识的模型和数据集的功能。

数据的加载

Hugging Face 的 Datasets 页面有丰富的数据集，包括文本、音频、图片，也提供了直观的可视化页面。

使用的数据集的方式也很简单，使用load_dataset直接加载我们需要的数据集即可，如果想使用我们自定义的数据集使用函数load_dataset也是可以的。

fromdatasetsimportload_dataset

ds = load_dataset("clapAI/MultiLingualSentiment")

fromdatasetsimportload_dataset
# 读入训练数据和测试数据
data_files = {"train":"./day014/datas/train_data.json","test":"./day014/datas/test_data.json"}
dataset = load_dataset("json", data_files = data_files)
print(dataset)
# 查看第一条训练数据
print(dataset['train'][0])

模型的使用

以文本分类（情感分析）的任务为例，我们可以通过函数pipeline只需要指定 task 名字就可以调用模型，模型默认使用的是 distilbert/distilbert-base-uncased-finetuned-sst-2-english，你也可以通过参数model指定特定的模型。

fromtransformersimportpipeline

# 使用默认模型
# pipe = pipeline("text-classification")  

# 指定特定的模型，模型可以通过 Models 页面查找（因为默认的模型使用英文数据做训练数据，我换了一个支持多语言的模型）
pipe = pipeline("text-classification", model="lxyuan/distilbert-base-multilingual-cased-sentiments-student")  

string_arr = [
    "从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
   "我一路向北，离开有你的季节，你说你好累，已无法再爱上谁。风在山路吹，过往的画面全都是不对，细数惭愧，我伤你几回。",
   "我很开心"]
 
results = pipe(string_arr)

print(results)
# 输出结果
# [{'label': 'positive', 'score': 0.5694631338119507}, {'label': 'negative', 'score': 0.9576570987701416}, {'label': 'positive', 'score': 0.9572104811668396}]

第一次运行上面程序的时候，模型会自动下载，默认路径是 /HOME/.cache/huggingface/hub。

除了使用pipeline函数，还可以通过接口的方式使用模型，不过需要提前准备好在网站申请的 token。使用接口的方式调用模型，模型本身不会下载到本地，这相比使用pipeline方式更为便捷。

fromutils.common_configimportconfig
importrequests
defgenerate_embedding(text: str)-> list[float]:
  embedding_url ="https://api-inference.huggingface.co/models/lxyuan/distilbert-base-multilingual-cased-sentiments-student"
  response = requests.post(
    embedding_url,
    headers={"Authorization":f"Bearer{config.hg_token}"},
    json={"inputs": text})

 ifresponse.status_code !=200:
   raiseValueError(f"Request failed with status code{response.status_code}:{response.text}")

 returnresponse.json()

string_arr = [
    "从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
   "我一路向北，离开有你的季节，你说你好累，已无法再爱上谁。风在山路吹，过往的画面全都是不对，细数惭愧，我伤你几回。",
   "我很开心"]
a = generate_embedding(string_arr)
print(a)

# 输出结果
# [[{'label': 'positive', 'score': 0.5694631934165955}, {'label': 'neutral', 'score': 0.2743554711341858}, {'label': 'negative', 'score': 0.15618135035037994}], [{'label': 'negative', 'score': 0.9576572179794312}, {'label': 'neutral', 'score': 0.0352189838886261}, {'label': 'positive', 'score': 0.007123854476958513}], [{'label': 'positive', 'score': 0.9572104811668396}, {'label': 'neutral', 'score': 0.03854822367429733}, {'label': 'negative', 'score': 0.004241317044943571}]]

模型的微调

如果对模型效果不满意，我们还可以采用微调（Fine-Tuning）的方式，使用自定义数据进行训练，并调整模型参数。

fromdatasetsimportload_dataset
# 读入训练数据和测试数据
importos
data_files = {
 "train": os.path.join(os.path.dirname(__file__),"datas/train_data.json"),
 "test": os.path.join(os.path.dirname(__file__),"datas/test_data.json")
}
dataset = load_dataset("json", data_files = data_files)
print(dataset)
# 查看第一条训练数据
print(dataset['train'][0])

fromtransformersimportDistilBertTokenizer, DistilBertForSequenceClassification
importtorch

device = torch.device("cuda"iftorch.cuda.is_available()else"cpu")
tokenizer = DistilBertTokenizer.from_pretrained('lxyuan/distilbert-base-multilingual-cased-sentiments-student')
model = (
  DistilBertForSequenceClassification.from_pretrained(
   'lxyuan/distilbert-base-multilingual-cased-sentiments-student',
    num_labels =3,
    id2label = {0:"negative",1:"neutral",2:"positive"},
    label2id = {"negative":0,"neutral":1,"positive":2},
   # ignore_mismatched_sizes=True
  ).to(device)
)
model_name ="sentiment_model"


fromtransformersimportDataCollatorWithPadding
fromsklearn.metricsimportaccuracy_score

defpreprocess_function(example):
returntokenizer(example['text'], truncation =True, padding =True)

train_dataset = dataset["train"].map(preprocess_function, batched =True)
test_dataset = dataset["test"].map(preprocess_function, batched =True)

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

defcompute_metrics(pred):
 labels = pred.label_ids
 predictions = pred.predictions.argmax(-1)
 accuracy = accuracy_score(labels, predictions)
return{"accuracy": accuracy}

fromtransformersimportTrainer, TrainingArguments

training_args = TrainingArguments(
 output_dir = model_name,
 eval_strategy ="epoch",
 learning_rate =2e-5,
 per_device_train_batch_size =4,
 per_device_eval_batch_size =4,
 num_train_epochs =60,
 weight_decay =0.01,
)

trainer = Trainer(
 model = model,
 args = training_args,
 train_dataset = train_dataset,
 eval_dataset = test_dataset,
 tokenizer = tokenizer,
 data_collator = data_collator,
 compute_metrics = compute_metrics,
)

trainer.train()

train_results = trainer.evaluate(eval_dataset = train_dataset)
train_accuracy = train_results.get('eval_accuracy')
print(f"Training Accuracy:{train_accuracy}")

test_results = trainer.evaluate(eval_dataset = test_dataset)
test_accuracy = test_results.get('eval_accuracy')
print(f"Testing Accuracy:{test_accuracy}")

训练完成后，我们可以使用新的模型，以评估模型效果。由于本地训练数据量较少，新模型的最终效果可能不尽理想。

fromtransformersimportpipeline
classifier = pipeline(task ='sentiment-analysis', model ="/Users/shaoyang/.cache/huggingface/hub/sentiment_model/checkpoint-120")
a = classifier(["从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
              "我很开心"])

print(a)
# [{'label': 'negative', 'score': 0.532397449016571}, {'label': 'neutral', 'score': 0.9187697768211365}]