Abstract

Jobstats is a free and open-source job monitoring platform designed for CPU and GPU clusters. It has been adopted by over one hundred institutions worldwide. The platform periodically measures and stores up to tens of metrics for each job. The compute nodes and filesystems are also monitored. Monitoring data is combined with workload manager data to produce job efficiency reports. Each report provides metadata about the job as well as CPU/GPU utilization, CPU/GPU memory usage, and job-specific notes to guide the user. The Jobstats platform includes an Open OnDemand helper app which can be used to generate a URL to a Grafana dashboard showing the various metrics versus time. We will present ways that an institution can leverage the Jobstats data to get the most out of their GPUs: auto-cancelling jobs with idles GPUs, automatically emailing users with a low GPU efficiency, calculating the optimal amount of GPU fractionalization (e.g., MIG), and analyzing detailed metrics to identify jobs with a high GPU utilization that are in fact underperforming. Lastly, we will present a preview of a promising GPU-sharing approach.

Here is the public livestream link. Staff members can look for a Google Calendar invitation for the talk. 

Please reach out to Sam Scalice (sscalice@ucar.edu) with any questions you may have.