对于权重与激活量化,Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models 中统计了LLaMA-65B各层权重的Per Channel量化后的偏差,并对结果进行了排序,结果表明INT8在所有层的权重量化的精度方面要明显好于FP8-E4。
同时,也统计了 LLaMA-65B各层激活值的 Per Tensor 量化后的偏差,并对结果进行了排序,结果表明FP8-E4在大多数层的激活值量化的精度方面要好于INT8。论文中对该结果进行了解释:激活值是动态的,随着每个输入的不同而变化,因此需要使用校准集来确定量化Scale;校准过程是选择所有Batch的最大值来计算量化Scale,导致非最大值所在的Batch量化后的数值普遍偏小,而FP8-E4对于小值的精度更好。
同时,Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models 中针对不同位宽(INT/FP)的运算进行比较,如加法器、乘法器、乘法累加运算单元(multiply-accumulate (MAC) )。对于8位MAC,乘法器为8位,累加器为16位,以防止溢出。
对于在 DNN 中作为矩阵乘法的基本构建块的 MAC 单元,FP 运算通常比 INT 运算需要更多的面积。然而,这种差异随着位宽的减小而缩小。有趣的是,在 8 位时,FP8 和 INT8 MAC 单元的面积要求几乎相同。这一观察结果表明 INT8 和 FP8 表现出相似的硬件成本和推理性能。与H800等GPU的硬件参数一致。
来自高通AI研究院的论文 FP8 Quantization: The Power of the Exponent通过对FP8量化格式的深入分析,包括理论基础和实验验证。提出了一种一种在 FP32 硬件中模拟 FP8 量化的新方法,,该方法可加快 FP8 量化模拟速度,同时很容易地在常见的深度学习框架中实现,有助于快速进行PTQ和QAT,并且它暴露了FP8量化器的参数(即尾数/指数位和指数偏置值),允许通过反向传播学习这些参数。
上海交通大学、北京大学和微软亚洲研究院联合发布的论文 Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models 中对INT和FP量化进行了比较分析,发现不同层的最佳量化格式因张量分布的复杂性和多样性而异,没有一种量化格式在所有情况下都始终优于其他量化格式,从而提出了一种混合格式量化方法(Mixture of Formats Quantization (MoFQ))。该方法逐层(layer-wise)选择最优的量化格式,以最小的精度损失实现 W8A8 量化结果。无论是仅权重还是权重激活量化场景中都取得了良好的效果。
from vllm import LLM model = LLM("facebook/opt-125m", quantization="fp8") # INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB result = model.generate("Hello, my name is")
# 使用动态激活scales定义量化配置 quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic") # For dynamic activation scales, there is no need for calbration examples examples = []
# Load the model, quantize, and save checkpoint model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config) model.quantize(examples) model.save_quantized(quantized_model_dir)
from vllm import LLM model = LLM(model="Meta-Llama-3-8B-Instruct-FP8-Dynamic/") # INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB result = model.generate("Hello, my name is")
# Load the model, quantize, and save checkpoint model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config) model.quantize(examples) model.save_quantized(quantized_model_dir)
最后,直接在 vLLM 中加载量化模型检查点即可使用。
from vllm import LLM model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/") # INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB result = model.generate("Hello, my name is")
from vllm import LLM, SamplingParams # Sample prompts. prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM. llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
from vllm import LLM, SamplingParams sampling_params = SamplingParams(temperature=1.3, top_p=0.8) llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", kv_cache_dtype="fp8", quantization_param_path="./tests/fp8_kv/llama2-7b-fp8-kv/kv_cache_scales.json") prompt = "London is the capital of" out = llm.generate(prompt, sampling_params)[0].outputs[0].text print(out)
# output w/ scaling factors:England, the United Kingdom, and one of the world's leading financial, # output w/o scaling factors:England, located in the southeastern part of the country. It is known