CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning

Basic Information

Abstract

Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bounded. To address this challge, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel approach that treats pairs of components of the attention mechanism as low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K and V-O pairs within each attention head. The resulting singular values, in turn, guide pruning and further serve as trainable parameters for efficient fine-tuning, ultimately enabling the model to recover its performance to the level before pruning.After pruning and fine-tuning, these values are reintegrated into the model without increasing its parameter count. Visualizations across various models show that CLOVER effectively removes linear redundancies within attention heads, greatly improving pruning efficiency. For example, pruning 70% of the Q-K head dimension in GPT-2 XL results in a perplexity comparable to that of pruning just 8% using vanilla pruning. The combination of CLOVER and TransMLA achieves a speedup of up to 11.1x over LLaMA-2-7B.