Rethinking Attention with Performers – Towards New Transformers’ Revolution.
Transformers took the field of machine learning by storm, marking the eclipse of LSTM architectures in Natural Language Processing and becoming the golden standard for dealing with generic sequential data. Despite this unprecedential success, a wider impact of Transformers was so far blocked by quadratic space and time complexity of their main components – attention modules. Several solutions to address this problem were proposed, yet they all assume additional structural assumptions regarding the structure of the attention which do not always hold in practice (such as sparsity). In this talk we will present Performers – a new class of Transformers architectures of linear space and time complexity of the attention module and providing in particular first effective linear mechanisms completely compatible with regular dense softmax attention with no structural attention priors. We demonstrate their effectiveness on standard applications – text and image data as well as novel ones, where Transformers due to their compute requirements are not usually applied – robotics and bioinformatics. We will explain new theoretical results behind algorithms used in Performers that are of interest on their own, even beyond the scope of attention-based neural network architectures.
Bio:
Krzysztof Choromanski is a research scientist at Google Brain Robotics New York and an adjunct assistant professor at Columbia University. Prior to joining Google, he completed his Ph.D at Columbia University working on structural graph theory (Ramsey-type results, in particular the Erdos-Hajnal Conjecture). His interests include structural & random graph theory, robotics & reinforcement learning, quasi Monte Carlo methods for machine learning, Riemannian optimization and more recently, attention-based architectures for sequential data with applications in robotics (memorization, lifelong learning), bioinformatics and vision.