语义分块真的有效吗？

显示全部楼层

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif; letter-spacing: 0.75px; white-space: normal; padding-top: 8px; padding-bottom: 8px; line-height: 26px;">最近看到一篇有意思的论文《Is Semantic Chunking Worth the Computational Cost?^[1]》，论文探讨了在检索增强型生成（Retrieval-Augmented Generation, RAG）系统中，语义分块（semantic chunking）与传统固定大小分块（fixed-size chunking）的效率和性能比较。

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif; letter-spacing: 0.75px; white-space: normal; padding-top: 8px; padding-bottom: 8px; line-height: 26px;">语义分块旨在通过将文档分割成语义上连贯的段落来提高检索性能。尽管语义分块越来越受欢迎，但其相对于固定大小分块的实际好处仍然不清楚。这项研究系统地评估了语义分块的有效性，使用了三个常见的与检索相关的任务：文档检索、证据检索和基于检索的答案生成。

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif; letter-spacing: 0.75px; white-space: normal; padding-top: 8px; padding-bottom: 8px; line-height: 26px;">为了测试对比语义分块是否有效，作者设计了 3 种分块策略，如下图所示。

固定大小分块器（Fixed-size Chunker）：这是基线分块器，它根据预定义或用户指定的每个分块的句子数量将文档顺序分割成固定大小的分块。
基于断点的语义分块器（Breakpoint-based Semantic Chunker）：这种分块器通过检测连续句子之间的语义距离阈值来分割文本，以保持连贯性。
基于聚类的语义分块器（Clustering-based Semantic Chunker）：这种分块器利用聚类算法按语义分组句子，捕捉全局关系，并允许非连续文本分组。

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif; letter-spacing: 0.75px; white-space: normal; min-height: 32px; line-height: 28px; color: rgb(119, 48, 152); border-bottom: 1px solid rgb(119, 48, 152); border-top-color: rgb(119, 48, 152); border-right-color: rgb(119, 48, 152); border-left-color: rgb(119, 48, 152); font-size: 22px; margin: 1em auto; padding-top: 0.5em; padding-bottom: 0.5em; text-align: center; width: 367.617px; display: flex; flex-direction: column; justify-content: center;">文档检索

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif; letter-spacing: 0.75px; white-space: normal; padding-top: 8px; padding-bottom: 8px; line-height: 26px;">文档检索测试结果如下表所示。大部分场景都没有明显的差距，除了 Miracl 和 NQ。而这些标*的表示这些都是基于一些较短的句子缝合到一起的，本身句子之间具有较强的独立性。

Dataset	Fixed-size	Breakpoint	Clustering
Miracl*	69.45	81.89	67.35
NQ*	43.79	63.93	41.01
Scidocs*	16.82	17.60	19.87
Scifact*	35.27	36.27	35.70
BioASQ*	61.86	61.87	62.49
NFCorpus*	21.36	21.07	22.12
HotpotQA	90.59	87.37	84.79
MSMARCO	93.58	92.23	93.18
ConditionalQA	68.11	64.44	65.94
Qasper	90.99	89.27	90.77

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif; letter-spacing: 0.75px; white-space: normal; min-height: 32px; line-height: 28px; color: rgb(119, 48, 152); border-bottom: 1px solid rgb(119, 48, 152); border-top-color: rgb(119, 48, 152); border-right-color: rgb(119, 48, 152); border-left-color: rgb(119, 48, 152); font-size: 22px; margin: 1em auto; padding-top: 0.5em; padding-bottom: 0.5em; text-align: center; width: 367.617px; display: flex; flex-direction: column; justify-content: center;">证据检索

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif; letter-spacing: 0.75px; white-space: normal; padding-top: 8px; padding-bottom: 8px; line-height: 26px;">证据检索结果如下表所示。在这种测试下，三者几乎不存在差异。

Dataset	Fixed-size	Breakpoint	Clustering
ExpertQA	47.11	47.08	46.87
DelucionQA	43.05	43.24	43.36
TechQA	28.98	28.49	27.96
ConditionalQA	18.23	19.83	19.14
Qasper	8.66	8.16	8.50

ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif; letter-spacing: 0.75px; white-space: normal; min-height: 32px; line-height: 28px; color: rgb(119, 48, 152); border-bottom: 1px solid rgb(119, 48, 152); border-top-color: rgb(119, 48, 152); border-right-color: rgb(119, 48, 152); border-left-color: rgb(119, 48, 152); font-size: 22px; margin: 1em auto; padding-top: 0.5em; padding-bottom: 0.5em; text-align: center; width: 367.617px; display: flex; flex-direction: column; justify-content: center;">答案生成

基于检索的答案生成测试如下表所示，可以说没有任何区别。

Dataset	Fixed-size	Breakpoint	Clustering
ExpertQA	0.65	0.65	0.65
DelucionQA	0.76	0.76	0.76
TechQA	0.68	0.68	0.68
ConditionalQA	0.42	0.43	0.43
Qasper	0.49	0.49	0.50

总结

研究结果表明，语义分块的计算成本并没有通过一致的性能提升来证明其合理性。这些发现挑战了之前关于语义分块的假设，并强调了在 RAG 系统中需要更有效的分块策略。总体而言，固定大小分块对于实际的 RAG 应用来说仍然是一个更有效和可靠的选择。