SWT2_CPB:
New Storage Deployment
Improving migration scripts have been completed, causing us to resume migration in the production cluster.
We have been monitoring closely and holding discussions about resumed migration. So far, no data has been lost and the first migration process of one old storage has been completed. We are now discussing steps for speeding up the process with the next MD3460s to migrate from.
It seems the main migration script is more robust, safe, and informative. We will continue to watch it closely and make any necessary improvements as we use it. We also developed ideas for future major improvements for the next time we perform data migrations.
We found two broken files that are zero bytes in size on the source storage. One file has been restored by DDM (after being declared bad) and the second file is on scratchdisk, which is very old and will be removed.
GGUS-Ticket-ID: #683657: Varnish
We continued to coordinate with Ilija on remaining Frontier accesses to our squid which needs to be stopped.
We checked squid monitoring and found the 5% access is due to CVMFS.
As requested from Ilija to address the XML malformed bug discovered previously, we updated the Frontier Varnish version. This adjusted how quickly Varnish removed old objects, resolving the XML malformed issue.
Due to not finding any additional nodes that were experiencing routing problems to our site, the ticket has now been marked resolved.
Brief Reduce In Capacity
The chilled water plant experienced power issues on 8/30 at 12:55 p.m. due to poor weather conditions (thunderstorms). This caused the data center to be very warm. We drained 20% of our WN, consistently of the oldest models, temporarily to help control the temperature and monitored closely. We contacted the chilled water plant for updates, opening a ticket concerning our issue, and it was resolved on 8/31. We put the drained WN back into service on 8/31 in the evening.
An unscheduled warning downtime entry was created in CRIC since we were at risk of experiencing an outage.
We experienced some unexpected issues with alerting pertaining to high temperature and are investigating this. Unfortunately, this affected our ability to be aware of the issue sooner.
Failed Jobs
We experienced over 600 jobs fail over the past week due to jobs hitting the maxtime parameter for jobs set in CRIC. This was adjusted to 49 hours, but we continued to see jobs hitting this limit and receiving a SIGTERM. This appears to be a central issue, as it affects multiple sites.
OU: