最近在一台 8卡H20 机器上,先后部署了 DeepSeek-R1-AWQ (671B)和最新的 DeepSeek-V3-0324 (685B) ,测试了下性能和数学问题跑分。服务器由火山引擎提供。先来看一下机器配置:
GPU:
+---------------------------------------------------------------------------------------+|NVIDIA-SMI535.161.08DriverVersion:535.161.08CUDAVersion:12.2||-----------------------------------------+----------------------+----------------------+|GPUNamePersistence-M|Bus-IdDisp.A|VolatileUncorr.ECC||FanTempPerfPwr:Usage/Cap|Memory-Usage|GPU-UtilComputeM.||||MIGM.||=========================================+======================+======================||0NVIDIAH20On|00000000:65:02.0Off|0||N/A29CP071W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|1NVIDIAH20On|00000000:65:03.0Off|0||N/A32CP072W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|2NVIDIAH20On|00000000:67:02.0Off|0||N/A32CP074W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|3NVIDIAH20On|00000000:67:03.0Off|0||N/A30CP073W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|4NVIDIAH20On|00000000:69:02.0Off|0||N/A30CP074W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|5NVIDIAH20On|00000000:69:03.0Off|0||N/A33CP074W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|6NVIDIAH20On|00000000:6B:02.0Off|0||N/A33CP073W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+|7NVIDIAH20On|00000000:6B:03.0Off|0||N/A29CP075W/500W|0MiB/97871MiB|0%Default||||Disabled|+-----------------------------------------+----------------------+----------------------+
这里踩过一个坑:最初的这个驱动版本有问题,在RTX4090上是好的,在H20上跑 DeepSeek-R1-AWQ 试过各种配置及软件版本,一推理就崩溃。后来换了NVIDIA官网为H20推荐的驱动版本 Driver Version: 550.144.03 ( CUDA 12.4), 什么配置都没改就好了。
卡间互联:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7GPU0 X OK OK OK OK OK OK OKGPU1 OK X OK OK OK OK OK OKGPU2 OK OK X OK OK OK OK OKGPU3 OK OK OK X OK OK OK OKGPU4 OK OK OK OK X OK OK OKGPU5 OK OK OK OK OK X OK OKGPU6 OK OK OK OK OK OK X OKGPU7 OK OK OK OK OK OK OK XLegend:X =SelfOK =StatusOkCNS=Chipset notsupportedGNS=GPU notsupportedTNS=Topology notsupportedNS =NotsupportedU =Unknown
内存:
#free-gtotalusedfreesharedbuff/cacheavailableMem:1929291891091892Swap:000
磁盘:
vda252:00100G0disk├─vda1252:10200M0part/boot/efi└─vda2252:2099.8G0part/nvme3n1259:003.5T0disknvme2n1259:103.5T0disknvme0n1259:203.5T0disknvme1n1259:303.5T0disk
OS
# uname -aLinuxH205.4.0-162-generic #179-Ubuntu SMP Mon Aug1408:51:31UTC2023x86_64 x86_64 x86_64 GNU/Linux# cat /etc/lsb-releaseDISTRIB_ID=UbuntuDISTRIB_RELEASE=20.04DISTRIB_CODENAME=focalDISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
用 vLLM v0.8.2 启动推理服务,分别先后启动如下两个模型的推理:
启动性能评测:
nohuppython3-usimple-bench-to-api.py--urlhttp://localhost:7800/v1\--modelDeepSeek-R1\--concurrencys1,10,20,30,40,50\--prompt"IntroducethehistoryofChina"\--max_tokens100,1024,16384,32768,65536,131072\--api_keysk-xxx\--duration_seconds30\>benth-DeepSeek-R1-AWQ-8-H20.log2>&1&
这个命令会分别用 max_tokens 为100,1024,16384,32768,65536,131072, 来对1个并发,10个并发,。。。,50个并发,进行批量测试。每个max_tokens取值生成一个不同并发的表格。压测脚本 simple-bench-to-api.py 及详细参数含义在上一篇文章 《单卡4090上部署的DeepSeek-R1小模型的并发性能》 中有,需要的小伙伴可以自取。
压测结果:
----- max_tokens=100 压测结果汇总 -----
其中有几个概念需要解释下
具体指标的含义:
具体可参见上一篇文章单卡4090上部署的DeepSeek-R1小模型的并发性能
----- max_tokens=1024 压测结果汇总 -----
--- max_tokens=16384(16k) 压测结果汇总 -----
----- max_tokens=32768(32k) 压测结果汇总 -----
----- max_tokens=65536(64k) 压测结果汇总 -----
----- max_tokens=131072 (128k)压测结果汇总 -----
----- max_tokens=100 压测结果汇总 -----
----- max_tokens=1024 压测结果汇总 -----
----- max_tokens=16384(16k) 压测结果汇总 -----
----- max_tokens=32768(32k) 压测结果汇总 -----
----- max_tokens=65536(64k) 压测结果汇总 -----
压测期间资源峰值:
+-----------------------------------------------------------------------------------------+|NVIDIA-SMI550.144.03DriverVersion:550.144.03CUDAVersion:12.4||-----------------------------------------+------------------------+----------------------+|GPUNamePersistence-M|Bus-IdDisp.A|VolatileUncorr.ECC||FanTempPerfPwr:Usage/Cap|Memory-Usage|GPU-UtilComputeM.||||MIGM.||=========================================+========================+======================||0NVIDIAH20Off|00000000:65:02.0Off|0||N/A39CP0176W/500W|95096MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|1NVIDIAH20Off|00000000:65:03.0Off|0||N/A46CP0184W/500W|95070MiB/97871MiB|23%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|2NVIDIAH20Off|00000000:67:02.0Off|0||N/A45CP0178W/500W|95070MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|3NVIDIAH20Off|00000000:67:03.0Off|0||N/A41CP0180W/500W|95070MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|4NVIDIAH20Off|00000000:69:02.0Off|0||N/A40CP0180W/500W|95070MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|5NVIDIAH20Off|00000000:69:03.0Off|0||N/A45CP0182W/500W|95070MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|6NVIDIAH20Off|00000000:6B:02.0Off|0||N/A46CP0184W/500W|95070MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|7NVIDIAH20Off|00000000:6B:03.0Off|0||N/A40CP0182W/500W|95078MiB/97871MiB|98%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+
峰值 KV cache usage:
INFO03-3123:22:50[loggers.py:80]Avgpromptthroughput:45.0tokens/s,Avggenerationthroughput:166.9tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:7.7%,Prefixcachehitrate:0.0%INFO03-3123:23:00[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:350.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:7.7%,Prefixcachehitrate:0.0%INFO03-3123:23:10[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:355.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:15.4%,Prefixcachehitrate:0.0%INFO03-3123:23:20[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:360.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:15.4%,Prefixcachehitrate:0.0%INFO03-3123:23:30[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:355.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:23.2%,Prefixcachehitrate:0.0%INFO03-3123:23:40[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:355.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:30.9%,Prefixcachehitrate:0.0%INFO03-3123:23:50[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:355.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:30.9%,Prefixcachehitrate:0.0%INFO03-3123:24:00[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:360.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:38.6%,Prefixcachehitrate:0.0%INFO03-3123:24:10[loggers.py:80]Avgpromptthroughput:0.0tokens/s,Avggenerationthroughput:350.0tokens/s,Running:50reqs,Waiting:0reqs,GPUKVcacheusage:38.6%,Prefixcachehitrate:0.0%
用GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends分别对部署在8卡H20上的 DeepSeek-R1-AWQ 和 DeepSeek-V3-0324 做了数学测试集跑分。这里我们修改了少量 lighteval 代码,让其不去自己启动模型推理,而是调用已经部署好的模型的OpenAI API接口。测试结果如下:
修改后的评估命令:
(benchmark)root@H20:/data/code/lighteval#lightevalendpointlitellmmodel_args="http://localhost:7800"tasks="lighteval|math_500|0|0"
评估结果:
|Task|Version|Metric|Value||Stderr||--------------------|------:|----------------|----:|---|-----:||all||extractive_match|0.818|±|0.0173||lighteval:math_500:0|1|extractive_match|0.818|±|0.0173|
修改后的评估命令:
(benchmark)root@H20:/data/code/lighteval#lightevalendpointlitellmmodel_args="http://localhost:7800"tasks="lighteval|math_500|0|0"--max-samples20
为了节省时间,只取了 20 道题。
评估结果:
|Task|Version|Metric|Value||Stderr||--------------------|------:|----------------|----:|---|-----:||all||extractive_match|0.95|±|0.05||lighteval:math_500:0|1|extractive_match|0.95|±|0.05|
测试期间峰值资源消耗:
|=========================================+========================+======================||0NVIDIAH20Off|00000000:65:02.0Off|0||N/A36CP0159W/500W|97048MiB/97871MiB|96%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|1NVIDIAH20Off|00000000:65:03.0Off|0||N/A42CP0167W/500W|97022MiB/97871MiB|91%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|2NVIDIAH20Off|00000000:67:02.0Off|0||N/A40CP0160W/500W|97022MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|3NVIDIAH20Off|00000000:67:03.0Off|0||N/A38CP0161W/500W|97022MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|4NVIDIAH20Off|00000000:69:02.0Off|0||N/A37CP0161W/500W|97022MiB/97871MiB|21%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|5NVIDIAH20Off|00000000:69:03.0Off|0||N/A41CP0162W/500W|97022MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|6NVIDIAH20Off|00000000:6B:02.0Off|0||N/A42CP0164W/500W|97022MiB/97871MiB|97%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+|7NVIDIAH20Off|00000000:6B:03.0Off|0||N/A37CP0163W/500W|97030MiB/97871MiB|95%Default||||Disabled|+-----------------------------------------+------------------------+----------------------+
修改后的评估命令:
(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|aime25|0|0" --max-samples 20
为了节省时间,只取了 20 道题。
评估结果:
| Task |Version| Metric |Value| |Stderr||------------------|------:|----------------|----:|---|-----:||all | |extractive_match| 0.4|± |0.1124||lighteval:aime25:0| 1|extractive_match| 0.4|± |0.1124|
aime25 是比较新的,但是这个分数貌似低于之前别人公布过的评测分数。可能是评测方法的问题,也可能评测过程中上下文有截断影响结果。
| 欢迎光临 链载Ai (https://www.lianzai.com/) | Powered by Discuz! X3.5 |