| 2023.03 | [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU(@Stanford University etc) | https://arxiv.org/pdf/2303.06865.pdf | https://github.com/FMInference/FlexGen | ⭐️ |
| 2023.05 | [SpecInfer]
Accelerating Generative Large Language Model Serving with Speculative
Inference and Token Tree Verification(@Peking University etc) | https://arxiv.org/pdf/2305.09781.pdf | https://github.com/flexflow/FlexFlow/tree/inference | ⭐️ |
| 2023.05 | [FastServe] Fast Distributed Inference Serving for Large Language Models(@Peking University etc) | https://arxiv.org/pdf/2305.05920.pdf | ⚠️ | ⭐️ |
| 2023.09 | ?[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) | https://arxiv.org/pdf/2309.06180.pdf | https://github.com/vllm-project/vllm | ⭐️⭐️ |
| 2023.09 | [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS(@Meta AI etc) | https://arxiv.org/pdf/2309.17453.pdf | https://github.com/mit-han-lab/streaming-llm | ⭐️ |
| 2023.09 | [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads(@Tianle Cai etc) | https://sites.google.com/view/medusa-llm | https://github.com/FasterDecoding/Medusa | ⭐️ |
| 2023.10 | ?[TensorRT-LLM] NVIDIA TensorRT LLM(@NVIDIA) | https://nvidia.github.io/TensorRT-LLM/ | https://github.com/NVIDIA/TensorRT-LLM | ⭐️⭐️ |
| 2023.11 | ?[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference(@Microsoft) | https://arxiv.org/pdf/2401.08671.pdf | https://github.com/microsoft/DeepSpeed | ⭐️⭐️ |
| 2023.12 | ?[PETALS] Distributed Inference and Fine-tuning of Large Language Models Over The Internet(@HSE Univesity etc) | https://arxiv.org/pdf/2312.08361.pdf | https://github.com/bigscience-workshop/petals | ⭐️⭐️ |
| 2023.12 | [PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU) | https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf | https://github.com/SJTU-IPADS/PowerInfer | ⭐️ |