- ROD team update
- EGI ops
Many areas from yesterday's EGI ops meeting.
With the EMI update 14, should fix the critical problems for the EMI WMS. Update 15 should fix the major annoyance with wildcards in the gacl.
EMI-WN tarball packages expected around Wednesday this week. Testers would be appreciated.
EMI-WN Testbed. Do we have one / do we want one / do we have anyone that would use one (T2K / SnoPlus?)?
OPS members - general reminder to check the two registered are both active. Also: is the OPS voms server failing on 2B root CA, or does it reject everyone? If the former, then that'll need updated when the ops memebers cert's are cycled.
Sites should check that the GlueSubClusterWNTmpDir that they are publishing in the BDII matches what they expect. This is /tmp, except for RALPP, where it's /scratch.
[EGI/EMI updates see attachment]
- Nagios status
- Tier-1 update
We have had no significant planned interventions this week. At the moment we have none declared in the GOC DB either - with no major outages planned before LHC startup. There are both minor things we need to do (e.g. update to MyProxy) as well as some things done during a GOC DB At Risk (or "Warning") - such as some updates to the backup Castor database system. More significant changes longer term include a minor Castor update that will be needed for us to move its databases to Oracle 11 along with introduction of new networking equipment.
Two operational issues to report this last week.
- There have been problems on the network link at RAL used by data traffic to Tier2s. This is being worked on. However, the effect of interrupts are limited. Not only do the networking guys get to the problem quickly, but the effect is on file transfers which the FTS can retry. We do also fail some Nagios Sam tests from Oxford when it happens.
- Some of the old Sam infrastructure was decommissioned last Tuesday (e.g. part of Gridview no longer works). Notably the old programmatic interface to SAM stopped. Our Tier1 dashboard and Nagios tests have now been modified to pick these up using the newer interface but for a while the Tier dashboard was not showing the test results.
- Security update
- T2 issues
- General notes.
The March GDB: https://indico.cern.ch/conferenceDisplay.py?confId=155066. Covering change of chair, Vidyo usage and experiment ops update.
- Tickets
Some tickets (all from CMS as far as I can see) have been using the Savannah/GGUS interface. There seem to be a couple of ticket misfires but largely it seems to be working.
This morning Brian sent out tickets to a number of sites requesting details about those site's plans to upgrade their "below baseline" SEs. A good few of these sites have already responded.
LHCB have ticketed a few sites (Glasgow, Edinburgh) about job problems, the tickets seem well in hand. Birmingham also ticketed (80117) but these seem to be slightly different problems (no job slots free for lhcb).
Site specific:
Bristol:
https://ggus.eu/ws/ticket_info.php?ticket=80125
CMS transfer problems to Bristol.
RAL Tier-1
https://ggus.eu/ws/ticket_info.php?ticket=80119
sno+ software install at RAL failed. It could be a problem with their code, or something wrong with their install method, or something different at RAL they they haven't taken into account (the install worked at QMUL). It looks like they could do with some help. (This ticket is an offshoot of https://ggus.eu/ws/ticket_info.php?ticket=79428).
Brunel
https://ggus.eu/ws/ticket_info.php?ticket=80146
Biomed ticket, a user is having an authorisation error. The error messages are similar to ones we saw late last year when users were using out-of-date UIs that couldn't handle the newer UK CA's format (I don't quite understand the details myself, but the problem was definitely at the user's end).
This was indeed user error and teh ticket has now been closed. --- Daniela
Durham
https://ggus.eu/ws/ticket_info.php?ticket=79880
lhcb jobs are getting the infamous "Maradona" error. Some bad workers?
Cambridge
https://ggus.eu/ws/ticket_info.php?ticket=79728
For some reason after their upgrade a users files went missing. Atlas file clean up has been invoked. Is Cambridge's SE in the clear now?
QMUL
https://ggus.eu/ws/ticket_info.php?ticket=77959
QMUL are plagued with atlas deletion errors. Chris has updated to Storm 1.8.2 but the errors continue. In the ticket he suggests reassigning the ticket to atlas, and I agree with him.
Interesting Finished Tickets from the last week:
https://ggus.eu/ws/ticket_info.php?ticket=80120
Confirmation that you shouldn't panic if you see "FILE_EXISTS" errors in your site's FTS transfers.
https://ggus.eu/ws/ticket_info.php?ticket=80052
https://ggus.eu/ws/ticket_info.php?ticket=80061
These two tickets chronicle Chris' quest to have QMUL's availability/reliability stats amended. Some interesting links and information for other sites wanting to do similar. The reason for the miscalculation is in https://ggus.eu/ws/ticket_info.php?ticket=79929