LongLLaMA – Focused Transformer: Contrastive Training for Context Scaling
Large language models can incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. That is, as the number of input documents increases, the proportion of relevant to irrelevant data decreases, leading the model to focus more on the irrelevant parts of the input. In our work, we identify a significant challenge, dubbed the distraction issue, where attention keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT). FoT employs a training process inspired by contrastive learning in order to enhance the structure of the (key, value) space and scale the effective context length. FoT requires only a small change to the attention and data loading. We have successfully used FoT to fine-tune pre-existing, large-scale models to lengthen their effective context. In particular, we have created LongLLaMAs by tuning the 3B OpenLLaMA and 7B Code Llama models.
Bio
Konrad Staniszewski is a Ph.D. student at IDEAS NCBR and the Doctoral School of Exact and Natural Sciences of the University of Warsaw. His research focuses on finding efficient ways to extend the context of Large Language Models. His interests are machine learning, natural language processing, and algorithmics. He got his master’s degree at the Faculty of Mathematics, Informatics, and Mechanics at the University of Warsaw.