- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Titan continues production using custom backfill and allocation queues with 15.2M events processed in the last two weeks. Harvester deployment is advancing, aiming for production with mini-pilots by end of February. Singularity container testing is continuing using the current custom workflows. Some discussion will be needed on how to integrate containers in the Harvester workflow in a unified way.
NERSC continues production with standard grid pilots on Edison with 5.7M events. Harvester is currently being tested on the Cori queues with the mini-pilot which is why Cori is no longer in production. The goal is similar to Titan, to have Harvester in production by end of February using Shifter containers.
ALCF 9.5M simulated events processed in the last two weeks. Harvester is being used with mini-pilots here. We are in the process of testing Yoda with jumbo jobs, but currently ironing out the details. One aspect of jumbo jobs that we did not anticipate is that all 150k input files must reach the source RSE before Harvester will begin launching jobs. We have been using Rucio at Theta lately which takes a long time to transfer one file at a time. We are now swapping in the Globus Online plugins to make this transfer much faster. Hopefully this will be in place by next week and we can get some idea of the scaling on Theta using Harvester + Yoda(Event Service) + Jumbo Jobs. We are also going to begin using Singularity containers on Theta as it has been installed on all nodes.
AGLT2_SL7 queue is working fine for the past 2 weeks.
Yesterday added AGLT2_MCORE_SL7, which is online, but is getting no brokered jobs. Investigating.
Working to go to LCMAPS mapping for dCache, which will in turn allow us to turn off our GUMS servers. We expect to do that later this afternoon as the initial change-over seems to be working.
We've ordered hardware now from Dell, C6420 sleds plus some needed switch infrastructure to support them. Delivery is expected near the end of February.
Site is now performing well and full of jobs
Various software updates
Stampede2 integration via CONNECT
Two unrelated incidents to Illinois offline
Illinois will have the monthly PM on Wednesday February 20
Ongoing issues:
1. Brian Lin is helping us track down HT-Condor/Slurm/SGE issues on the Harvard & BU side respectively.
2. Working on dropping our use of edg-mkgridmap.pl, migrating from Bestman to Gridftp
3. Ongoing GPFS operations - repairing some bad luns, file system maintenance, migration of system pool to warrantied equipment.
4. Preparations for first big NESE deployment.
5. HU squid issue just came up this morning. Dan is working on it.
6. Re-enabled LHCONE peering, but there were immediate problems. Networking team is working with MANLAN to investigate and fix.
- Mostly smooth running
- Some storage issues at OSCER, Dell is investigating the storage server that keeps crashing-
- Still seeing 'Auth failed' stagein errors at OSCER, which are most likely not site related, since they also happened at Lucille, and according to Wei, the error appears because xrdcp command does not have voms proxy
- Lucille still working on reconfiguring their storage, have HA issues
- Lucille A/R numbers for January are incorrect, since most of the time it was in a scheduled downtime, so the R number should be higher. Opening ticket.