Sideways: Depth-Parallel Training of Video Models
I will present our recent research on Sideways, an approximate backpropagation scheme for training video models. In standard backpropagation, the gradients and activations at every computation step through the model are temporally synchronized. The forward activations need to be stored until the backward pass is executed, preventing inter-layer (depth) parallelization. But is this required for smooth, redundant input streams like videos? Here, we explore an alternative: we overwrite network activations whenever new ones, i.e., from new frames, become available. This more gradual accumulation of information from both passes breaks the precise correspondence between gradients and activations, leading to theoretically more noisy weight updates. Counter-intuitively, we show that Sideways training of deep convolutional video networks not only still converges, but can also exhibit better generalization compared to standard synchronized backprop, providing an unexpected regularization effect.
Bio
My research focus lies at the intersection of computer vision, natural language understanding, and reinforcement learning. Currently, I am interested in
1) exploring deep learning to learn semantics from multimodal sensory inputs,
2) grounding language in vision and actions,
3) transfer policies between training and test environments,
4) symbolic reasoning with Deep Reinforcement Learning,
5) temporal abstraction,
6) scalable learning.
My main work laid down foundations of visual-question-answering. I was also working on intuitive physics, zero-shot learning, text-to-image retrieval, and image classification.
Before DeepMind, I was a Ph.D. student in Computer Vision at Multimodal Computing group at the Max Planck Institute for Informatics and Saarland University. I have graduated with summa cum laude and was awarded Eduard-Martin-Preis for the outstanding doctoral dissertation. The research that I led or was a part of featured in Bloomberg Business, Wikipedia, New Scientist, and MIT Technology Review.