网上找了一圈性能评测工具,很多都要自己把模型拉起来,还动不动就想去HuggingFace下载,都不太好用。考虑到目前不管是开源还是闭源,各大模型的推理服务,基本都遵循 OpenAI 的 API 接口。所以针对该接口编写一个简单的脚本,应该即可评测各种模型的性能了。这种任务,对于我这种20多年的老码农来说,当然要用AI帮忙了 :)因此用AI生成了一个初始版本,调试了1个多小时可以跑通了。但是结果数据有问题,隐藏的一个逻辑错误,OpenAI 和 DeepSeek 都没发现。解决这个问题之后,又花时间修改代码并把结果调测到自己满意的程度,加上中间的反复测试和参数调整,又花了1天半的时间。。。 但是总体来说呢,比我自己从0开始写,还是要快很多的。写代码其实时间大头还是调测和完善等工作,这个不管初始版本是AI写的,还是自己写的,都是少不了的。有时AI写的,这一步可能还更花时间,因为要把AI不小心埋的坑给找出来填平喽。ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__1"> ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__2">来看下成品的使用样例:python3simple-bench-to-api.py\--urlhttp://10.96.0.188:11434/v1\--modeldeepseek-r1:32b\--api_key"any-string-if-no-apikey-on-sever"\--concurrency1\--prompt"Tellmeastory"\--max_tokens100\--duration_seconds30 ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__3">其中参数的含义:--url 推理服务的基础地址,路径以 /v1 结束 --concurrency 并发数,如果这个设置了,则忽略 --concurrencys 参数内容 --duration_seconds 持续时间,单位秒 --concurrencys 逗号分隔的并发数列表,如 1,5,10,15,20,30;不设置 concurrency 时才生效。会分别按设置的并发数发起压力,两个并发批次中间间隔5秒 ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__19">输出样例:ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__20">针对一个并发值的压测结果:压测结果:并发数:1总请求数:9成功率:100.00%平均延迟:3.3685s最大延迟:3.4171s最小延迟:3.3369s平均首字延迟:0.0767sP90延迟:3.3893sP95延迟:3.4032sP99延迟:3.4143s总生成tokens数:918单并发最小吞吐量:30.71tokens/s单并发最大吞吐量:31.26tokens/s单并发平均吞吐量:30.99tokens/s总体吞吐量:30.24tokens/s 如果通过 --concurrencys 设置了多个并发值,除了每个并发值输出结果之外,最后还会输出一个 Markdown 格式的汇总表:ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__21">其中有几个概念需要解释下”延迟“:从发出请求,到接收到最后一个token或字符的时间 “P90延迟”:分位数90的延迟,计算方法为延迟从小到大排序,前90%的最大延迟值,和下一个延迟值,基于线性插值计算的一个介于2者之间的值。表示90%的请求都可以在这个时间之内完成。P95、P99延迟以此类推。 “首字延迟”:从发出请求,到接收到第一个返回字符(或token)的时间。 “单并发吞吐量”:这个名字是我为了和其他指标区分想出来的一个名字,一时没找到通用的对应名字。意思是指站在每个并发用户/通道的角度看,从首token返回后,token的生成速度。统计时间不包含首字延迟。即一个通道的吞吐量 = 该通道生成的token数/除首token延迟外的生成时间。个人觉得,这个指标加上平均首字延迟,能反映真实的用户体感。 ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__31">具体指标的含义:单并发最小吞吐量: 所有并发通道中,吞吐量最小的通道的吞吐量 单并发最大吞吐量: 所有并发通道中,吞吐量最大的通道的吞吐量 总体吞吐量:在压测期间所有通道生成的tokens总数/压测开始到结束的时间 P90延迟: 3.3893s:表示有90%的请求延迟低于这个数值 P95延迟: 3.4032s:表示有95%的请求延迟低于这个数值 P99延迟: 3.4143s:表示有95%的请求延迟低于这个数值 ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__49">接下来分别对部署在单卡 RTX 4090(24G显存)上的 DeepSeek 针对 qwen2.5-7B 和 qwen2.5-32B 的蒸馏版本做压测ingFang SC", Cambria, Cochin, Georgia, Times, "Times New Roman", serif;font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;letter-spacing: normal;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;background-color: rgb(255, 255, 255);text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;" class="js_darkmode__50">DeepSeek-R1-Distill-Qwen-7B参数组合1服务端: --gpu_memory_utilization0.95\--max-num-seqs512\--max-model-len65536 客户端压测命令: python3simple-bench-to-api.py\--urlhttp://10.96.2.221:7869/v1\--modelDeepSeek-R1-Distill-Qwen-7B\--concurrencys1,10,50,100,150,200\--prompt"Tellmeastory"\--max_tokens100\--api_key服务端配置的APIKey\--duration_seconds30 压测结果(脚本会汇总每个并发的结果,生成Markdown表格): 从这个结果汇总看,当并发从1到200时 平均首字延迟(avg_ttft)从0.0363s增长到了0.6999s,仍然很快。 单并发平均吞吐量(avg)从 60.14 tokens/s 下降到了 23.27 tokens/s; 说明在200个并发时,单用户的延迟会变慢1倍多,符合预期。 总体吞吐量从 58.76 tokens/s 上升到了 3730.98 tokens/s,并越大,总体吞吐量增长越缓慢,符合一般规律。 平均延迟(avg_latency)从1.6825s 增长到了4.9803s, 还算不错的。
总体说明,单卡 4090 跑 R1-7B 模型,在200并发之内都是很流畅的。 服务端日志: # 1并发INFO03-0303:59:21metrics.py:455] Avg prompt throughput:1.3tokens/s, Avg generation throughput:0.1tokens/s, Running:1reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.0%, CPU KV cache usage:0.0%.INFO03-0303:59:26metrics.py:455] Avg prompt throughput:3.6tokens/s, Avg generation throughput:59.7tokens/s, Running:0reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.0%, CPU KV cache usage:0.0%.INFO03-0303:59:31metrics.py:455] Avg prompt throughput:5.4tokens/s, Avg generation throughput:59.5tokens/s, Running:1reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.1%, CPU KV cache usage:0.0%.INFO03-0303:59:36metrics.py:455] Avg prompt throughput:5.4tokens/s, Avg generation throughput:59.5tokens/s, Running:1reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.1%, CPU KV cache usage:0.0%.INFO03-0303:59:41metrics.py:455] Avg prompt throughput:5.4tokens/s, Avg generation throughput:59.4tokens/s, Running:1reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.1%, CPU KV cache usage:0.0%.INFO03-0303:59:46metrics.py:455] Avg prompt throughput:5.4tokens/s, Avg generation throughput:59.5tokens/s, Running:1reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.1%, CPU KV cache usage:0.0%.INFO03-0303:59:51metrics.py:455] Avg prompt throughput:5.4tokens/s, Avg generation throughput:59.5tokens/s, Running:1reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.1%, CPU KV cache usage:0.0%.
# 10并发INFO03-0303:59:57metrics.py:455] Avg prompt throughput:3.4tokens/s, Avg generation throughput:2.4tokens/s, Running:10reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.2%, CPU KV cache usage:0.0%.INFO03-0304:00:02metrics.py:455] Avg prompt throughput:50.3tokens/s, Avg generation throughput:561.9tokens/s, Running:10reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:1.0%, CPU KV cache usage:0.0%.INFO03-0304:00:07metrics.py:455] Avg prompt throughput:53.9tokens/s, Avg generation throughput:553.2tokens/s, Running:10reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.8%, CPU KV cache usage:0.0%.INFO03-0304:00:12metrics.py:455] Avg prompt throughput:54.0tokens/s, Avg generation throughput:553.3tokens/s, Running:10reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.5%, CPU KV cache usage:0.0%.INFO03-0304:00:17metrics.py:455] Avg prompt throughput:54.0tokens/s, Avg generation throughput:554.3tokens/s, Running:10reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.3%, CPU KV cache usage:0.0%.INFO03-0304:00:22metrics.py:455] Avg prompt throughput:39.6tokens/s, Avg generation throughput:557.7tokens/s, Running:10reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.8%, CPU KV cache usage:0.0%.INFO03-0304:00:27metrics.py:455] Avg prompt throughput:50.3tokens/s, Avg generation throughput:556.0tokens/s, Running:10reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.8%, CPU KV cache usage:0.0%.INFO03-0304:00:32metrics.py:455] Avg prompt throughput:6.1tokens/s, Avg generation throughput:50.9tokens/s, Running:50reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.8%, CPU KV cache usage:0.0%.
# 50并发INFO03-0304:00:37metrics.py:455] Avg prompt throughput:172.2tokens/s, Avg generation throughput:1967.8tokens/s, Running:49reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:5.5%, CPU KV cache usage:0.0%.INFO03-0304:00:42metrics.py:455] Avg prompt throughput:180.0tokens/s, Avg generation throughput:1775.5tokens/s, Running:50reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:4.9%, CPU KV cache usage:0.0%.INFO03-0304:00:47metrics.py:455] Avg prompt throughput:178.7tokens/s, Avg generation throughput:1923.5tokens/s, Running:50reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:4.2%, CPU KV cache usage:0.0%.INFO03-0304:00:52metrics.py:455] Avg prompt throughput:181.2tokens/s, Avg generation throughput:1939.9tokens/s, Running:47reqs, Swapped:0reqs, Pending:1reqs, GPU KV cache usage:3.3%, CPU KV cache usage:0.0%.INFO03-0304:00:58metrics.py:455] Avg prompt throughput:186.5tokens/s, Avg generation throughput:1942.9tokens/s, Running:50reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:3.3%, CPU KV cache usage:0.0%.INFO03-0304:01:03metrics.py:455] Avg prompt throughput:172.2tokens/s, Avg generation throughput:1946.6tokens/s, Running:45reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:2.8%, CPU KV cache usage:0.0%.
# 100并发INFO03-0304:01:10metrics.py:455] Avg prompt throughput:3.6tokens/s, Avg generation throughput:321.9tokens/s, Running:62reqs, Swapped:0reqs, Pending:38reqs, GPU KV cache usage:1.0%, CPU KV cache usage:0.0%.INFO03-0304:01:15metrics.py:455] Avg prompt throughput:352.8tokens/s, Avg generation throughput:3194.2tokens/s, Running:100reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:8.0%, CPU KV cache usage:0.0%.INFO03-0304:01:20metrics.py:455] Avg prompt throughput:190.3tokens/s, Avg generation throughput:2817.9tokens/s, Running:6reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.3%, CPU KV cache usage:0.0%.INFO03-0304:01:25metrics.py:455] Avg prompt throughput:352.9tokens/s, Avg generation throughput:3163.5tokens/s, Running:100reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:8.1%, CPU KV cache usage:0.0%.INFO03-0304:01:30metrics.py:455] Avg prompt throughput:197.2tokens/s, Avg generation throughput:2960.1tokens/s, Running:13reqs, Swapped:0reqs, Pending:2reqs, GPU KV cache usage:0.8%, CPU KV cache usage:0.0%.INFO03-0304:01:35metrics.py:455] Avg prompt throughput:358.0tokens/s, Avg generation throughput:3230.8tokens/s, Running:100reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:7.8%, CPU KV cache usage:0.0%.INFO03-0304:01:40metrics.py:455] Avg prompt throughput:185.1tokens/s, Avg generation throughput:2847.4tokens/s, Running:3reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.2%, CPU KV cache usage:0.0%.
# 150并发INFO03-0304:01:47metrics.py:455] Avg prompt throughput:2.7tokens/s, Avg generation throughput:25.5tokens/s, Running:39reqs, Swapped:0reqs, Pending:22reqs, GPU KV cache usage:0.6%, CPU KV cache usage:0.0%.INFO03-0304:01:52metrics.py:455] Avg prompt throughput:535.9tokens/s, Avg generation throughput:3599.7tokens/s, Running:150reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:4.8%, CPU KV cache usage:0.0%.INFO03-0304:01:57metrics.py:455] Avg prompt throughput:268.1tokens/s, Avg generation throughput:3613.7tokens/s, Running:150reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:9.7%, CPU KV cache usage:0.0%.INFO03-0304:02:02metrics.py:455] Avg prompt throughput:273.3tokens/s, Avg generation throughput:3520.0tokens/s, Running:150reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:11.7%, CPU KV cache usage:0.0%.INFO03-0304:02:07metrics.py:455] Avg prompt throughput:280.3tokens/s, Avg generation throughput:3703.8tokens/s, Running:150reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:14.1%, CPU KV cache usage:0.0%.INFO03-0304:02:12metrics.py:455] Avg prompt throughput:278.9tokens/s, Avg generation throughput:3685.7tokens/s, Running:11reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.8%, CPU KV cache usage:0.0%.INFO03-0304:02:17metrics.py:455] Avg prompt throughput:406.6tokens/s, Avg generation throughput:3176.2tokens/s, Running:81reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:2.0%, CPU KV cache usage:0.0%.
# 200并发INFO03-0304:02:25metrics.py:455] Avg prompt throughput:13.0tokens/s, Avg generation throughput:867.2tokens/s, Running:58reqs, Swapped:0reqs, Pending:44reqs, GPU KV cache usage:0.9%, CPU KV cache usage:0.0%.INFO03-0304:02:30metrics.py:455] Avg prompt throughput:472.7tokens/s, Avg generation throughput:4012.6tokens/s, Running:157reqs, Swapped:0reqs, Pending:29reqs, GPU KV cache usage:2.7%, CPU KV cache usage:0.0%.INFO03-0304:02:35metrics.py:455] Avg prompt throughput:244.4tokens/s, Avg generation throughput:3969.8tokens/s, Running:12reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.2%, CPU KV cache usage:0.0%.INFO03-0304:02:40metrics.py:455] Avg prompt throughput:439.9tokens/s, Avg generation throughput:4022.8tokens/s, Running:93reqs, Swapped:0reqs, Pending:32reqs, GPU KV cache usage:1.6%, CPU KV cache usage:0.0%.INFO03-0304:02:45metrics.py:455] Avg prompt throughput:258.5tokens/s, Avg generation throughput:3944.5tokens/s, Running:6reqs, Swapped:0reqs, Pending:20reqs, GPU KV cache usage:0.1%, CPU KV cache usage:0.0%.INFO03-0304:02:50metrics.py:455] Avg prompt throughput:405.5tokens/s, Avg generation throughput:4000.0tokens/s, Running:68reqs, Swapped:0reqs, Pending:22reqs, GPU KV cache usage:1.1%, CPU KV cache usage:0.0%.INFO03-0304:02:55metrics.py:455] Avg prompt throughput:320.2tokens/s, Avg generation throughput:4016.7tokens/s, Running:5reqs, Swapped:0reqs, Pending:0reqs, GPU KV cache usage:0.2%, CPU KV cache usage:0.0%.
GPU KV cache usage在1%以内,说明缓存还没有充分利用,系统还有潜力 资源稳态消耗: # 1并发|=========================================+======================+======================|| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off || 30% 50C P2 291W / 450W | 22736MiB / 24564MiB | 94% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 10并发|=========================================+======================+======================|| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off || 30% 56C P2 297W / 450W | 22736MiB / 24564MiB | 89% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 50并发|=========================================+======================+======================|| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off || 30% 57C P2 298W / 450W | 22736MiB / 24564MiB | 84% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 150并发|=========================================+======================+======================|| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off || 46% 61C P2 342W / 450W | 22736MiB / 24564MiB | 83% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 200并发|=========================================+======================+======================|| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off || 35% 60C P2 323W / 450W | 22736MiB / 24564MiB | 78% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
符合预期,并发越大,系统调度的上下文切换消耗就越多,显卡的算力使用就会下降 参数组合2服务端: --gpu_memory_utilization0.95\--max-num-seqs512\--max-model-len65536 客户端: python3simple-bench-to-api.py\--urlhttp://10.96.2.221:7869/v1\--modelDeepSeek-R1-Distill-Qwen-7B\--concurrencys1,10,50,100,150,200\--prompt"Tellmeastory"\--max_tokens1024\--api_key服务端配置的APIKey\--duration_seconds30 服务不变,客户端 max_tokens 从参数组合1的100,变成了 1024,压测结果: 可以看到平均延迟在100个token的时候1个并发只有1.7秒,当上下文增加了10倍到1k后,1个并发下的平均延迟到了16到17秒,正好也变为原来的10倍,符合预期。 当设置为1k上下文后,从1个并发到200个并发,变化规律和100个token的上下文是基本一样的: 平均首字延迟(avg_ttft)从0.0404s增长到了0.7275s,仍然很快。 单并发平均吞吐量(avg)从 60.18 tokens/s 下降到了 20.89 tokens/s; 说明在200个并发时,单用户的延迟会变慢1倍多,符合预期。 总体吞吐量从 59.95 tokens/s 上升到了 3157.15 tokens/s,并发越大,总体吞吐量增长越缓慢,符合一般规律。 平均延迟(avg_latency)从16.3665s 增长到了49.7491s,符合预期
服务端日志:略 并发从1到200的过程中,GPU KV cache usage 从个位数增长到了99.9% 资源稳态情况:略 参数组合3服务端: --gpu_memory_utilization0.95\--max-num-seqs512\--max-model-len65536 客户端: #python3simple-bench-to-api.py\--urlhttp://10.96.2.221:7869/v1\--modelDeepSeek-R1-Distill-Qwen-7B\--concurrencys1,10,50,100,150,200\--prompt"Tellmeastory"\--max_tokens16384\--api_key服务端配置的APIKey\--duration_seconds30 服务端不变, 客户端 max_tokens 改为 16k,压测结果如下: 可以看到,在16k上下文时,200个并发和150个并发比较,吞吐量上升已经不明显了。平均延迟66秒,P99的延迟已经比较高。16k下,7B还是在100个并发内质量比较稳定。另外我测试的时候,服务同时有个别小伙伴有调用,可能有些结果并不准确。 服务端日志:略 资源情况:略
参数组合4服务端: --gpu_memory_utilization0.95\--max-num-seqs256\--max-model-len103632 客户端: python3simple-bench-to-api.py\--urlhttp://10.96.2.221:7869/v1\--modelDeepSeek-R1-Distill-Qwen-7B\--concurrencys1,10,50,100,150,200\--prompt"IntroducethehistoryofChina"\--max_tokens100\--api_key服务端配置的APIKEY\--duration_seconds30 服务端 max-num-seqs 设为 256,max-model-len 设为103632, 客户端 max_tokens 设为 100 压测结果: 服务端日志:略 资源情况:略 参数组合5服务端: --gpu_memory_utilization0.95\--max-num-seqs256\--max-model-len103632 客户端: python3simple-bench-to-api.py--urlhttp://10.96.2.221:7869/v1\--modelDeepSeek-R1-Distill-Qwen-7B\--concurrencys1,10,50,100,150,200\--prompt"IntroducethehistoryofChina"\--max_tokens1024\--api_key服务端配置的APIKEY\--duration_seconds30 服务端不变, 客户端 max_tokens 设为 1024,压测结果 服务端日志:略 资源稳态消耗:略 参数组合6服务端: --gpu_memory_utilization0.95\--max-num-seqs256\--max-model-len103632 客户端: python3simple-bench-to-api.py--urlhttp://10.96.2.221:7869/v1\--modelDeepSeek-R1-Distill-Qwen-7B\--concurrencys1,10,50,100,150,200\--prompt"IntroducethehistoryofChina"\--max_tokens16384\--api_key服务端配置的APIKEY\--duration_seconds30 服务端不变, 客户端 max_tokens 设为 16384(16k上下文),压测结果: 可以看到,同样是16k的max_tokens,服务端参数调整后,200个并发下的吞吐量934,比调整前的794增加了 17% 多,且P99延迟下降了近一半。这个原因还在分析,也不排除是两次压测期间,受到的干扰不同造成的。 服务端日志:略 资源情况:略 DeepSeek-R1-Distill-Qwen-32B模型地址:https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 部署方式: k8s + ollama,ollama版本: 0.5.10 模型量化:ollama 官方模型 deepseek-r1:32b, 基于 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 做的 4-bit (Q4_K_M) 量化的 gguf 版本; 压测发起地:与模型推理同一台机器,这样就消除了远程网络开销的影响。
参数组合Apython3simple-bench-to-api.py\--urlhttp://10.96.0.188:11434/v1\--modeldeepseek-r1:32b\--concurrencys1,5,10,15,20,30\--prompt"IntroducethehistoryofChina"\--max_tokens100\--api_key服务端设置的APIKEY,没有随便填一个\--duration_seconds30 资源情况: # 1并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 30% 49C P2 340W / 450W | 23012MiB / 24564MiB | 79% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 5并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 30% 53C P2 379W / 450W | 23012MiB / 24564MiB | 75% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 10并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 42% 61C P2 372W / 450W | 23012MiB / 24564MiB | 76% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 15并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 53% 58C P2 356W / 450W | 23012MiB / 24564MiB | 72% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 20并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 53% 61C P2 378W / 450W | 23012MiB / 24564MiB | 76% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 30并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 55% 62C P2 382W / 450W | 23012MiB / 24564MiB | 76% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
参数组合Bpython3simple-bench-to-api.py\--urlhttp://10.96.0.188:11434/v1\--modeldeepseek-r1:32b\--concurrencys1,5,10,15,20,30\--prompt"IntroducethehistoryofChina"\--max_tokens1024\--api_key服务端设置的APIKEY,没有随便填一个\--duration_seconds30 资源情况: # 1并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 30% 52C P2 333W / 450W | 23012MiB / 24564MiB | 87% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 5并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 33% 57C P2 370W / 450W | 23012MiB / 24564MiB | 79% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 10并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 53% 60C P2 376W / 450W | 23012MiB / 24564MiB | 77% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 15并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 54% 63C P2 385W / 450W | 23012MiB / 24564MiB | 78% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 20并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 65% 62C P2 376W / 450W | 23012MiB / 24564MiB | 77% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
# 30并发|=========================================+======================+======================|| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off || 63% 59C P2 371W / 450W | 23012MiB / 24564MiB | 76% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+
参数组合Cpython3simple-bench-to-api.py\--urlhttp://10.96.0.188:11434/v1\--modeldeepseek-r1:32b\--concurrencys1,5,10,15,20,30\--prompt"IntroducethehistoryofChina"\--max_tokens16384\--api_key服务端设置的APIKEY,没有随便填一个\--duration_seconds30 资源情况:略 - 单卡4090跑deepseek-ai/DeepSeek-R1-Distill-Qwen-7B在100并发下比较流畅,16k的整体吞吐量可到 2151.35 tokens/s,单通道/用户的体感吞吐量可稳定在 30 tokens/s;并发到200也能抗得住,这主要得益于vLLM引擎的稳定性
- 单卡4090跑deepseek-ai/DeepSeek-R1-Distill-Qwen-32B在20并发下比较流畅,16k的整体吞吐量可到 93 tokens/s,单通道/用户的体感吞吐量可稳定在 24 tokens/s;并发30时虽然延迟较大,体感吞吐量仍可维持在良好水平。
|