- Added batch queue status monitoring to the portal
- shows number of running jobs, used cores, pending and held jobs
- Observed memory used up first before cores lately. The prometheus monitoring(not currently availble in the protal) indicates in most cases the actual memory usage is much less than the requested memory. Some user engagement is needed to and we will also add monitoring improvement to help user understand the status better
- Hardware issue
- A node(c010) with fan issue -- reseated fan cage and cables solved the problem