- Please refer to the slides with HPC service updates.
- Nils gave a short introduction and the HPC service context. Applications that can run in a single host with 1-48 cores should run under the HTCondor batch service. The HPC SLURM cluster, focus of this meeting, is intended for MPI jobs spanning multiple nodes.
- This was followed by a short explanation of the CephFS HPC storage back-end by Dan.
- Pablo then outlined recent and upcoming changes to the SLURM HPC batch service. The storage back end default "/hpcscratch" will move to the CephFS cluster currently called "/bescratch" early May. (Date of the intervention to be announced on the IT Service Status Board.) Pablo described the possibility of using the Intel tool suite to do application profiling of MPI applications. (Works with the recommended Mvapich3 as well as Intel MPI.) These Intel tools can be useful for users who develop their own MPI applications.
- The SLURM partitions (queues) will be reviewed, and a maximum run-time of 1 week is proposed. As some applications do not have checkpointing and a longer run-time would be desirable, a possible compromise would be to keep a run-time of 1 week for the inf-long partition and still allow 2-3 weeks on batch-long.
- It should be noted that for HTCondor, a run-time beyond the 1-week offered by the NextWeek job flavour can be achieved by setting the Condor job run time (+MaxRunTime= {number of seconds} ) in the Condor job submission file.
- Q: How to copy files to EOS from the job script? A: Following the deployment of AUKS and Kerberos credentials for the job can be achieved with the eos cp command. Please refer to the EOS FAQ for information about EOS.
- Q: For the profiling: How to see what a CPU is doing when one MPI rank takes longer time? Using cpu level profiling tools. Doing that for python level code however (to see which python functions take more time in a parallel environment ) is not clear. Some debug information and output from the application is required for this kind of CPU profiling, it does not come out of the box.