|
最近在一台 8卡H20 机器上,先后部署了 DeepSeek-R1-AWQ (671B)和最新的 DeepSeek-V3-0324 (685B) ,测试了下性能和数学问题跑分。服务器由火山引擎提供。先来看一下机器配置: 8卡H20机器配置GPU: +---------------------------------------------------------------------------------------+|NVIDIA-SMI535.161.08DriverVersion:535.161.08CUDAVersion:12.2||-----------------------------------------+----------------------+----------------------+|GPUNamePersistence-M|Bus-IdDisp.A|VolatileUncorr.ECC||FanTempPerfPwr:Usage/Cap|Memory-Usage|GPU-UtilComputeM.||||MIGM.||=========================================+======================+======================||0NVIDIAH20On|00000000:65:02.0Off|0||N/A29CP071W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|1NVIDIAH20On|00000000:65:03.0Off|0||N/A32CP072W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|2NVIDIAH20On|00000000:67:02.0Off|0||N/A32CP074W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|3NVIDIAH20On|00000000:67:03.0Off|0||N/A30CP073W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|4NVIDIAH20On|00000000:69:02.0Off|0||N/A30CP074W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|5NVIDIAH20On|00000000:69:03.0Off|0||N/A33CP074W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|6NVIDIAH20On|00000000:6B:02.0Off|0||N/A33CP073W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|7NVIDIAH20On|00000000:6B:03.0Off|0||N/A29CP075W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+ 这里踩过一个坑:最初的这个驱动版本有问题,在RTX4090上是好的,在H20上跑 DeepSeek-R1-AWQ 试过各种配置及软件版本,一推理就崩溃。后来换了NVIDIA官网为H20推荐的驱动版本 Driver Version: 550.144.03 ( CUDA 12.4), 什么配置都没改就好了。 卡间互联: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7GPU0 X OK OK OK OK OK OK OKGPU1 OK X OK OK OK OK OK OKGPU2 OK OK X OK OK OK OK OKGPU3 OK OK OK X OK OK OK OKGPU4 OK OK OK OK X OK OK OKGPU5 OK OK OK OK OK X OK OKGPU6 OK OK OK OK OK OK X OKGPU7 OK OK OK OK OK OK OK X
Legend:
X =SelfOK =StatusOkCNS=Chipset notsupportedGNS=GPU notsupportedTNS=Topology notsupportedNS =NotsupportedU =Unknown
内存: #free-gtotalusedfreesharedbuff/cacheavailableMem:1929291891091892Swap:000 磁盘: vda252:00100G0disk├─vda1252:10200M0part/boot/efi└─vda2252:2099.8G0part/nvme3n1259:003.5T0disknvme2n1259:103.5T0disknvme0n1259:203.5T0disknvme1n1259:303.5T0disk OS # uname -aLinuxH205.4.0-162-generic #179-Ubuntu SMP Mon Aug1408:51:31UTC2023x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/lsb-releaseDISTRIB_ID=UbuntuDISTRIB_RELEASE=20.04DISTRIB_CODENAME=focalDISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
启动推理用 vLLM v0.8.2 启动推理服务,分别先后启动如下两个模型的推理: - DeepSeek-R1-AWQ: https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ
- DeepSeek-V3-0324:https://modelscope.cn/models/deepseek-ai/DeepSeek-V3-0324
H20 性能评测启动性能评测: nohuppython3-usimple-bench-to-api.py--urlhttp://localhost:7800/v1\--modelDeepSeek-R1\--concurrencys1,10,20,30,40,50\--prompt"IntroducethehistoryofChina"\--max_tokens100,1024,16384,32768,65536,131072\--api_keysk-xxx\--duration_seconds30\>benth-DeepSeek-R1-AWQ-8-H20.log2>&1& 这个命令会分别用 max_tokens 为100,1024,16384,32768,65536,131072, 来对1个并发,10个并发,。。。,50个并发,进行批量测试。每个max_tokens取值生成一个不同并发的表格。压测脚本 simple-bench-to-api.py 及详细参数含义在上一篇文章 《单卡4090上部署的DeepSeek-R1小模型的并发性能》 中有,需要的小伙伴可以自取。 压测结果: 8卡H20部署DeepSeek-R1-AWQ性能实测----- max_tokens=100 压测结果汇总 ----- 其中有几个概念需要解释下 - ”延迟“:从发出请求,到接收到最后一个token/字符的时间(包含了首字延迟时间)
- “P90延迟”:分位数90的延迟,计算方法为延迟从小到大排序,前90%的最大延迟值,和下一个延迟值,基于线性插值计算的一个介于2者之间的值。
- “首字延迟”:从发出请求,到接收到第一个返回字符的时间。
- “单并发吞吐量”的概念,是指站在每个并发用户/通道的角度看,从首token返回后,token的生成速度。统计时间不包含首字延迟。即一个通道的吞吐量 = 该通道生成的token数/除首token延迟外的生成时间。个人觉得,这个指标加上平均首字延迟,能反映真实的用户体感。
具体指标的含义: - 平均延迟:所有通道的延迟平均值(包含了首字延迟时间)
- 单并发最小吞吐量: 所有并发通道中,吞吐量最小的通道的吞吐量(不包括首字延迟时间)
- 单并发最大吞吐量: 所有并发通道中,吞吐量最大的通道的吞吐量(不包括首字延迟时间)
- 单并发平均吞吐量:所有并发通道的吞吐量的平均值(不包括首字延迟时间)
- 总体吞吐量:在压测期间所有通道生成的tokens总数/压测开始到结束的时间
具体可参见上一篇文章单卡4090上部署的DeepSeek-R1小模型的并发性能
----- max_tokens=1024 压测结果汇总 ----- --- max_tokens=16384(16k) 压测结果汇总 ----- ----- max_tokens=32768(32k) 压测结果汇总 ----- ----- max_tokens=65536(64k) 压测结果汇总 ----- ----- max_tokens=131072 (128k)压测结果汇总 ----- 8卡H20部署DeepSeek-V3-0324性能实测----- max_tokens=100 压测结果汇总 ----- ----- max_tokens=1024 压测结果汇总 ----- ----- max_tokens=16384(16k) 压测结果汇总 ----- ----- max_tokens=32768(32k) 压测结果汇总 ----- ----- max_tokens=65536(64k) 压测结果汇总 ----- 压测期间资源峰值: +-----------------------------------------------------------------------------------------+|NVIDIA-SMI550.144.03DriverVersion:550.144.03CUDAVersion:12.4||-----------------------------------------+------------------------+----------------------+|GPUNamePersistence-M|Bus-IdDisp.A|VolatileUncorr.ECC||FanTempPerfPwr:Usage/Cap|Memory-Usage|GPU-UtilComputeM.||||MIGM.||=========================================+========================+======================||0NVIDIAH20Off|00000000:65:02.0Off|0||N/A39CP0176W/500W|95096MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|1NVIDIAH20Off|00000000:65:03.0Off|0||N/A46CP0184W/500W|95070MiB/97871MiB|23%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|2NVIDIAH20Off|00000000:67:02.0Off|0||N/A45CP0178W/500W|95070MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|3NVIDIAH20Off|00000000:67:03.0Off|0||N/A41CP0180W/500W|95070MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|4NVIDIAH20Off|00000000:69:02.0Off|0||N/A40CP0180W/500W|95070MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|5NVIDIAH20Off|00000000:69:03.0Off|0||N/A45CP0182W/500W|95070MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|6NVIDIAH20Off|00000000:6B:02.0Off|0||N/A46CP0184W/500W|95070MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|7NVIDIAH20Off|00000000:6B:03.0Off|0||N/A40CP0182W/500W|95078MiB/97871MiB|98%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+ 峰值 KV cache usage: INFO03-3123:22:50[loggers.py:80]Avgpromptthroughput:45.0tokens/s,Avggenerationthroughput:166.9tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:7.7%,Prefixcachehitrate:0.0%INFO03-3123:23:00[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:350.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:7.7%,Prefixcachehitrate:0.0%INFO03-3123:23:10[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:355.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:15.4%,Prefixcachehitrate:0.0%INFO03-3123:23:20[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:360.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:15.4%,Prefixcachehitrate:0.0%INFO03-3123:23:30[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:355.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:23.2%,Prefixcachehitrate:0.0%INFO03-3123:23:40[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:355.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:30.9%,Prefixcachehitrate:0.0%INFO03-3123:23:50[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:355.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:30.9%,Prefixcachehitrate:0.0%INFO03-3123:24:00[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:360.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:38.6%,Prefixcachehitrate:0.0%INFO03-3123:24:10[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:350.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:38.6%,Prefixcachehitrate:0.0% 数学数据集跑分实测用GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends分别对部署在8卡H20上的 DeepSeek-R1-AWQ 和 DeepSeek-V3-0324 做了数学测试集跑分。这里我们修改了少量 lighteval 代码,让其不去自己启动模型推理,而是调用已经部署好的模型的OpenAI API接口。测试结果如下: 8卡H20部署DeepSeek-R1-AWQ跑分实测math500评估修改后的评估命令: (benchmark)root@H20:/data/code/lighteval#lightevalendpointlitellmmodel_args="http://localhost:7800"tasks="lighteval|math_500|0|0" 评估结果: |Task|Version|Metric|Value||Stderr||--------------------|------:|----------------|----:|---|-----:||all||extractive_match|0.818|±|0.0173||lighteval:math_500:0|1|extractive_match|0.818|±|0.0173| 8卡H20部署DeepSeek-V3-0324跑分实测math500评估修改后的评估命令: (benchmark)root@H20:/data/code/lighteval#lightevalendpointlitellmmodel_args="http://localhost:7800"tasks="lighteval|math_500|0|0"--max-samples20 为了节省时间,只取了 20 道题。 评估结果: |Task|Version|Metric|Value||Stderr||--------------------|------:|----------------|----:|---|-----:||all||extractive_match|0.95|±|0.05||lighteval:math_500:0|1|extractive_match|0.95|±|0.05| 测试期间峰值资源消耗: |=========================================+========================+======================||0NVIDIAH20Off|00000000:65:02.0Off|0||N/A36CP0159W/500W|97048MiB/97871MiB|96%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|1NVIDIAH20Off|00000000:65:03.0Off|0||N/A42CP0167W/500W|97022MiB/97871MiB|91%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|2NVIDIAH20Off|00000000:67:02.0Off|0||N/A40CP0160W/500W|97022MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|3NVIDIAH20Off|00000000:67:03.0Off|0||N/A38CP0161W/500W|97022MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|4NVIDIAH20Off|00000000:69:02.0Off|0||N/A37CP0161W/500W|97022MiB/97871MiB|21%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|5NVIDIAH20Off|00000000:69:03.0Off|0||N/A41CP0162W/500W|97022MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|6NVIDIAH20Off|00000000:6B:02.0Off|0||N/A42CP0164W/500W|97022MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|7NVIDIAH20Off|00000000:6B:03.0Off|0||N/A37CP0163W/500W|97030MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+ aime25评估(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|aime25|0|0" --max-samples 20 为了节省时间,只取了 20 道题。 评估结果:
| Task |Version| Metric |Value| |Stderr||------------------|------:|----------------|----:|---|-----:||all | |extractive_match| 0.4|± |0.1124||lighteval:aime25:0| 1|extractive_match| 0.4|± |0.1124|
aime25 是比较新的,但是这个分数貌似低于之前别人公布过的评测分数。可能是评测方法的问题,也可能评测过程中上下文有截断影响结果。 |