KV-Cache可以显著减少K和V的计算量,但是需要显存来存储对应的值。每一个Decoder Block都需要这么多byte来存储K以及V,其中B代表batch size,L代表sequence length,H代表number of head,D代表size of head,P代表kv的数据格式需要多少比特才能存储,比如fp16就需要2 byte。如果N代表Block数量,那么一个模型总共需要的kv cache的存储空间为。
CUDA 10.2开始引入了virtual memory management API,这一组细粒度的API可以让虚拟内存和物理内存分配分开并进行自由组合。
CUresult cuMemAddressFree ( CUdeviceptr ptr, size_t size ) Free an address range reservation. CUresult cuMemAddressReserve ( CUdeviceptr* ptr, size_t size, size_t alignment, CUdeviceptr addr, unsigned long long flags ) Allocate an address range reservation. CUresult cuMemCreate ( CUmemGenericAllocationHandle* handle, size_t size, const CUmemAllocationProp* prop, unsigned long long flags ) Create a CUDA memory handle representing a memory allocation of a given size described by the given properties. CUresult cuMemExportToShareableHandle ( void* shareableHandle, CUmemGenericAllocationHandle handle, CUmemAllocationHandleType handleType, unsigned long long flags ) Exports an allocation to a requested shareable handle type. CUresult cuMemGetAccess ( unsigned long long* flags, const CUmemLocation* location, CUdeviceptr ptr ) Get the access flags set for the given location and ptr. CUresult cuMemGetAllocationGranularity ( size_t* granularity, const CUmemAllocationProp* prop, CUmemAllocationGranularity_flags option ) Calculates either the minimal or recommended granularity. CUresult cuMemGetAllocationPropertiesFromHandle ( CUmemAllocationProp* prop, CUmemGenericAllocationHandle handle ) Retrieve the contents of the property structure defining properties for this handle. CUresult cuMemImportFromShareableHandle ( CUmemGenericAllocationHandle* handle, void* osHandle, CUmemAllocationHandleType shHandleType ) Imports an allocation from a requested shareable handle type. CUresult cuMemMap ( CUdeviceptr ptr, size_t size, size_t offset, CUmemGenericAllocationHandle handle, unsigned long long flags ) Maps an allocation handle to a reserved virtual address range. CUresult cuMemMapArrayAsync ( CUarrayMapInfo* mapInfoList, unsigned int count, CUstream hStream ) Maps or unmaps subregions of sparse CUDA arrays and sparse CUDA mipmapped arrays. CUresult cuMemRelease ( CUmemGenericAllocationHandle handle ) Release a memory handle representing a memory allocation which was previously allocated through cuMemCreate. CUresult cuMemRetainAllocationHandle ( CUmemGenericAllocationHandle* handle, void* addr ) Given an address addr, returns the allocation handle of the backing memory allocation. CUresult cuMemSetAccess ( CUdeviceptr ptr, size_t size, const CUmemAccessDesc* desc, size_t count ) Set the access flags for each location specified in desc for the given virtual address range. CUresult cuMemUnmap ( CUdeviceptr ptr, size_t size ) Unmap the backing memory of a given address range.