一个6.1K※开源RAG评估工具：Ragas

显示全部楼层

在我们构建完RAG系统之后，常常需要设计一套指标，来评价该系统到底表现怎么样。如何评价整个系统一直是个难题，今天我们来介绍一个完整评估RAG系统的开源项目。

Ragas 是一个可帮助评估检索增强生成 (RAG) pipelines的框架。RAG 表示一类使用外部数据来增强 LLM 上下文的 LLM 应用程序，现有的工具和框架可帮助构建这些pipelines，但评估它并量化pipelines性能可能很困难。这就是 Ragas（RAG 评估）的作用所在。

Ragas 提供基于最新研究的工具，用于评估 LLM 生成的文本，让您深入了解 RAG pipelines。Ragas 可以与您的 CI/CD 集成，以提供持续检查以确保性能。

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-left: 4px solid rgb(248, 57, 41);">Ragas怎么工作的

Ragas 提供了几个指标来评估 RAG 系统的各个方面：

检索器：使用context_precision和context_recall衡量检索系统的性能。
生成器：使用faithfulness方法测量幻觉，answer_relevancy测量答案与问题的相关程度。

ingFang SC", miui, "Hiragino Sans GB", "Microsoft Yahei", sans-serif;text-align: start;background-color: rgb(49, 49, 58);white-space-collapse: preserve !important;word-break: break-word !important;">以下是提到的评价指标及其实现方法：

Faithfulness（忠实度）：

忠实度指标衡量生成答案能否从提供的上下文中推断出来。实现过程包括两个步骤：语句提取和语句验证。使用语言模型（LLM）将生成的答案分解为一组简洁的语句，然后验证每个语句是否能够从给定上下文中推断出来。
Answer Relevance（答案相关性）：

答案相关性指标评估生成答案在多大程度上回应了原始问题，不论事实准确性如何。实现该指标的方法包括基于给定答案生成潜在问题，然后使用嵌入模型计算这些生成问题与原始问题的余弦相似度，最后计算平均相似度得分。

Context Precision（上下文精确度）：

上下文精确度指标用于评估检索到的上下文片段与给定问题的相关性。它通过计算在顶部排名中出现的相关信息的比例来实现。

Context Recall（上下文召回率）：

上下文召回率指标衡量检索到的上下文与真实答案的一致性。通过比较真实答案中的每个句子是否能够追溯到检索到的上下文中来实现。

上述指标的具体计算公式，我们以后专门介绍，敬请期待。Ragas还有一些其它指标，有需要的可以去查看：

下面让我们把评估系统run起来吧：

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-left: 4px solid rgb(248, 57, 41);">环境搭建

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;background-color: rgb(255, 255, 255);text-align: left;">运行需要用到的依赖有：

pipinstallragas#pipinstallgit+https://github.com/explodinggradients/ragas

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;border-left: 4px solid rgb(248, 57, 41);">代码示例

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;background-color: rgb(255, 255, 255);text-align: left;">一个小示例程序，可以运行它来查看 ragas 的实际运行！(Ragas默认使用OpenAI的大模型进行评估，可以修改部分代码替换自己的模型。)

from datasets import Dataset import osfrom ragas import evaluatefrom ragas.metrics import faithfulness, answer_correctness
os.environ["OPENAI_API_KEY"] = "your-openai-key"
data_samples = {'question': ['When was the first super bowl?', 'Who won the most super bowls?'],'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness,answer_correctness])score.to_pandas()

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;text-wrap: wrap;background-color: rgb(255, 255, 255);text-align: left;margin-bottom: 0px;">上面介绍的四个指标是目前用的比较多的方法；用起来吧，评价一下你的RAG系统效果怎么样吧。