LLload: Simplifying Real-Time Job Monitoring for HPC Users
Author(s)
Byun, Chansup; Mullen, Julia; Reuther, Albert Iwersen; Arcand, William; Bergeron, William; Bestor, David; Burrill, Daniel; Gadepally, Vijay; Houle, Michael; Hubbell, Matthew; Jananthan, Hayden; Jones, Michael; Michaleas, Peter; Morales, Guillermo; Prout, Andrew; Rosa, Antonio; Yee, Charles; Kepner, Jeremy; Milechin, Lauren; ... Show more Show less
Download3626203.3670565.pdf (508.8Kb)
Publisher with Creative Commons License
Publisher with Creative Commons License
Creative Commons Attribution
Terms of use
Metadata
Show full item recordAbstract
One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner and the research project. Profiling tools provide a nice view of the performance of an application but often have a steep learning curve and rarely provide an easy to interpret view of resource utilization. Lower level tools such as top and htop provide a view of resource utilization for those familiar and comfortable with Linux but a barrier for newer HPC practitioners. To expand the existing profiling and job monitoring options, the MIT Lincoln Laboratory Supercomputing Center created LLoad, a tool that captures a snapshot of the resources being used by a job on a per user basis. LLload is a tool built from standard HPC tools that provides an easy way for a researcher to track resource usage of active jobs. We explain how the tool was designed and implemented and provide insight into how it is used to aid new researchers in developing their performance monitoring skills as well as guide researchers in their resource requests.
Description
PEARC ’24, July 21–25, 2024, Providence, RI, USA
Date issued
2024-07-17Department
Lincoln Laboratory; Massachusetts Institute of Technology. Office of Research Computing and DataPublisher
ACM|Practice and Experience in Advanced Research Computing
Citation
Byun, Chansup, Mullen, Julia, Reuther, Albert Iwersen, Arcand, William, Bergeron, William et al. 2024. "LLload: Simplifying Real-Time Job Monitoring for HPC Users."
Version: Author's final manuscript
ISBN
979-8-4007-0419-2
Collections
The following license files are associated with this item: