|
在我们构建完RAG系统之后,常常需要设计一套指标,来评价该系统到底表现怎么样。如何评价整个系统一直是个难题,今天我们来介绍一个完整评估RAG系统的开源项目。 
Ragas 是一个可帮助评估检索增强生成 (RAG) pipelines的框架。RAG 表示一类使用外部数据来增强 LLM 上下文的 LLM 应用程序,现有的工具和框架可帮助构建这些pipelines,但评估它并量化pipelines性能可能很困难。这就是 Ragas(RAG 评估)的作用所在。 Ragas 提供基于最新研究的工具,用于评估 LLM 生成的文本,让您深入了解 RAG pipelines。Ragas 可以与您的 CI/CD 集成,以提供持续检查以确保性能。 ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-left: 4px solid rgb(248, 57, 41);">Ragas怎么工作的Ragas 提供了几个指标来评估 RAG 系统的各个方面:- 检索器:使用
context_precision和context_recall衡量检索系统的性能。 - 生成器:使用
faithfulness方法测量幻觉,answer_relevancy测量答案与问题的相关程度。
 ingFang SC", miui, "Hiragino Sans GB", "Microsoft Yahei", sans-serif;text-align: start;background-color: rgb(49, 49, 58);white-space-collapse: preserve !important;word-break: break-word !important;">以下是提到的评价指标及其实现方法: ingFang SC", miui, "Hiragino Sans GB", "Microsoft Yahei", sans-serif;letter-spacing: 0.5px;text-align: start;background-color: rgb(49, 49, 58);">Faithfulness(忠实度): 忠实度指标衡量生成答案能否从提供的上下文中推断出来。实现过程包括两个步骤:语句提取和语句验证。使用语言模型(LLM)将生成的答案分解为一组简洁的语句,然后验证每个语句是否能够从给定上下文中推断出来。 Answer Relevance(答案相关性): Context Precision(上下文精确度): Context Recall(上下文召回率): 上述指标的具体计算公式,我们以后专门介绍,敬请期待。Ragas还有一些其它指标,有需要的可以去查看:

下面让我们把评估系统run起来吧:
ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-left: 4px solid rgb(248, 57, 41);">环境搭建ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;background-color: rgb(255, 255, 255);text-align: left;">运行需要用到的依赖有:pipinstallragas#pipinstallgit+https://github.com/explodinggradients/ragas ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-left: 4px solid rgb(248, 57, 41);">代码示例ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;background-color: rgb(255, 255, 255);text-align: left;">一个小示例程序,可以运行它来查看 ragas 的实际运行!(Ragas默认使用OpenAI的大模型进行评估,可以修改部分代码替换自己的模型。)from datasets import Dataset import osfrom ragas import evaluatefrom ragas.metrics import faithfulness, answer_correctness
os.environ["OPENAI_API_KEY"] = "your-openai-key"
data_samples = {'question': ['When was the first super bowl?', 'Who won the most super bowls?'],'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness,answer_correctness])score.to_pandas() ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;text-wrap: wrap;background-color: rgb(255, 255, 255);text-align: left;margin-bottom: 0px;">上面介绍的四个指标是目前用的比较多的方法;用起来吧,评价一下你的RAG系统效果怎么样吧。
|