LLM in MLPerf: Accelerating LLM training at scale
LLM pre-training is a very costly and lengthy process. Depending on the size, it can take months on thousands of GPUs, which is an equivalent of millions of dollars. This is why any performance improvement not only lowers the cost, but also allows a faster model turn-around for optimization and research. At NVIDIA, we apply state-of-the-art solutions to our software and hardware to achieve superior performance of LLM training. In the presentation, I will describe the most important performance improvements we have deployed in NeMo – our LLM training vehicle, and show how they impact the training at 10k GPU clusters that we performed for the MLPerf v3.1 edition.
Bio
Michal Marcinkiewicz is a Senior Deep Learning engineer and a Team Leader at NVIDIA. He received MSc from the University of Warsaw, and his PhD from the University of Montpellier for research on topological phase transitions. After obtaining his PhD he switched fields to deep learning and now is driving development and optimization of GPU-accelerated software.