From ROC DECH (backup: ROC SouthEast Europe) to ROC UK/I (backup: ROC AsiaPacific)
Tickets:
Opened New :92
close : 52
1st mail: 33
Quarantine : 20
2nd mail : 9
Unsolved: 3
Issues:
Tests for RGMA , LFC, SRM, SE creating (lots of) alarms (COD teams learning to handle them...)
Issues with certificate tests (most of the tests failing with timeouts -- sites probably have a valid certificate, reasons might be use of nonstandard port for certain services, service simply not available, or timing out because of cert test itself...?)
Several months ago I requested that a page detailing the most common causes for gLiteCE job failures be created, based on the experience by PPS. The answer was that basically we should wait until experience is gathered in the GGUS tickets that are going to be opened by CODs on PPS or production sites and then produce this web page. So far there is one page in gocwiki, detailing one site specific cause:
"Sometimes this is due to the fact that the user is not authorised on the CE."
I do not believe this is the only site-specific cause that was ever found as reason for this error .
I have the feeling that these host-cert-valid tests generate a lot of alarms, and I am not sure they must be handled by CODs, especially before a certificate is expired. Some CAs already send e-mails about expiring certificates one-two months in advance, which is a sensible period.
Note that there is some lack of synch in COD dashboard, so it shows older tests that are in error, while the newer tests are ok. Ordering the alarms by status in the Monitoring/Alarms page shows last test date as 11th May in most cases, and only a few times as 12, and never newer.
This problem appeared at beginning of our shift, and it seams to me it reappeared on 13th.