最近,来自清华大学、无问芯穹和上海交通大学的研究团队发表了《MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression》,提出通过混合不同稀疏度的注意力头,使用 25% 的注意力稠密度,就可以记忆几乎 100% 的上下文。
在大语言模型的自回归推理过程中,分为两个阶段:预填充和解码。在预填充阶段,模型会处理整个输入序列,以生成初始的响应令牌。随后进入解码阶段,模型利用新产生的令牌以及之前缓存的 K 和 V 矩阵,逐步生成后续令牌,直至完成整个序列的生成。虽然这种迭代方法效果显著,但随着 KV-Cache 的不断扩展,它也带来了内存和计算资源的需求增加。
同时,模型响应和人类编写的监督之间存在显著的不对齐。例如,对于同一个问题,人类可能会回答 'Blue',而模型可能会回答 'The blue color'。使用人类的答案进行监督,注意力影响是基于预测 'Blue' 的概率转移量化的,这与最终目标背道而驰,即难以保持原始模型预测 'The' 的关键注意力。
[1] Chen, Shouyuan, et al. "Extending Context Window of Large Language Models via Positional Interpolation." ArXiv, 2023, abs/2306.15595, https://api.semanticscholar.org/CorpusID:259262376.
[2] Tworkowski, Szymon, et al. "Focused Transformer: Contrastive Training for Context Scaling." ArXiv, 2023, abs/2307.03170, https://api.semanticscholar.org/CorpusID:259360592.
[3] Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems, vol. 30, 2017.
[4] Xiao, Guangxuan, et al. "Efficient Streaming Language Models with Attention Sinks." The Twelfth International Conference on Learning Representations, 2024.
[5] Han, Chi, et al. "Lm-infinite: Simple on-the-fly length generalization for large language models." arXiv preprint arXiv:2308.16137, 2023.
[6] Zaheer, Manzil, et al. "Big bird: Transformers for longer sequences." Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 17283-17297.
[7] Li, Dacheng, et al. "How Long Can Open-Source LLMs Truly Promise on Context Length?" lmsys.org, June 2023, https://lmsys.org/blog/2023-06-29-longchat.
[8] Fu, Yao, et al. "Data Engineering for Scaling Language Models to 128K Context." ArXiv, 2024, abs/2402.10171, https://api.semanticscholar.org/CorpusID:267682361.
[9] Zhang, Zhenyu (Allen), et al. "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." ArXiv, 2023, abs/2306.14048, https://api.semanticscholar.org/CorpusID:259263947.
[10] Dao, Tri, et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Advances in Neural Information Processing Systems, 2022.
[11] Kwon, Woosuk, et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." Proceedings of the 29th Symposium on Operating Systems Principles, 2023, https://api.semanticscholar.org/CorpusID:261697361.
[12] Xiao, Chaojun et al. “InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory.” (2024)