How ML Helps to Manage the Google Cloud Infrastructure
When submitting a job to a cloud, a user specifies limits on the jobs’ resource usage. However, humans are notoriously bad at predicting, especially predicting non-tangible resources such as CPU cores or GB of memory. Underprediction has potentially disastrous consequences in particular for user-facing jobs: a job exceeding its limits might be throttled or killed, resulting in end-user requests delayed or dropped. Thus, human operators tend to err on the side of caution and over-allocate. Summed over the whole infrastructure, such widespread over-allocation and the resulting low utilization of hardware leads to megawatts of lost electricity and millions of dollars in the costs of infrastructure overexpansion.
In my talk I will describe Autopilot, a production automation tool that Google uses for its internal cloud. Autopilot relies on ML to automatically set the resource limits such as the CPU/memory limits for individual tasks. The context – a low-level production system many other higher-level services depend on – translates into ambitious reliability and efficiency requirements that are challenging for an ML application.
Bio
Krzysztof Rzadca is a visiting researcher at Google Warsaw and an associate professor in the Institute of Informatics, University of Warsaw. He did his PhD in computer science in 2008 in Institut National Polytechnique de Grenoble (INPG), France as a French government fellow. Between 2008 and 2010, he worked as a research fellow in Nanyang Technological University (NTU), Singapore. He was awarded grants from the Polish National Science Center, the Fundation for Polish Science and a faculty research award from Google. His research focuses on resource management and scheduling in large-scale distributed systems.