Practical Low Rank Decomposition for Large Language Models
The huge number of parameters in modern Large Language Models makes them hard to deploy as they require a lot of memory and compute resources.
Many solutions for reducing the number of parameters have been proposed, but often they significantly decrease the quality of the model or work only for a very restricted set of architectures.
We present a low-rank compression method which is very easy to run on any transformer network without changes in the model code. The method improves on the existing approaches to low-rank compression by leveraging adaptive search for choosing optimal rank and by interleaving decomposition with few shot fine-tuning to ensure good performance with small quality loss.
For Phi-2 and Llama-2 7B models we were able to create respectively 30% and 27% smaller models while preserving 90% of the performance on sample reasoning tasks. At the same time, the compressed Llama-2 model throughput increased by over 25% on RTX 3090 GPU. The whole compression procedure (including final short recovery fine-tuning) takes less than 20 hours on a single GPU.
Bio
PhD in Mathematics at University of Warsaw (Probability Theory). Working on AI at TCL Research Europe since. Specializing in model optimization, compression.