- ROD team update
-- Also note: UKI ROC is being decommissioned. dteam groups/roles should be under NGI_UK.
- Nagios status
- EGI ops
BDII Instability: Did we observe problems with BDII on April 12? (One for RoD?)
Out now: BDII Core, Gridsite, VOMS
Expected this week: BLAH, DPM, Hydra, GFAL/lcg_util, StoRM and WMS
This WMS release should be the one to resolve the outstanding configuration issues we've noticed with it.
The final list of what should be in EMI-2 will be released soon (7th May current plan). This will give the expeted timelines for SL6 and Debian releases. Current status is the SL5 release is looking fine, the SL6 has a few sticky points, but should be there. Debian: only the client parts, UI and WN - and maybe not all clients in the UI for Debian.
EMI-2 release is due in 2 weeks. Advise against rushing to update production systems until they've been through staged rollout. On the other hand, they'll need run through staged rollout. It looks like the UMD versions will migrate to the EMI-2 releases transparently for SL5, so if your installed from UMD, you should not need to do anything in particular.
Hyrda is expected to be included in EMI-2.
UI/WN tarball: There are testing releases of the tarballs, inked off this ticket: https://ggus.eu/tech/ticket_show.php?ticket=74675
Feedback to be placed back in that ticket (for both UI and WN, despite what the name implies).
LFC, and IGE SAGA adaptors.
Mostly waiting for EMI-2, at which point there will be a lot of testing to do (for both SL5 and SL6).
A call for staged rollout testers for Debian clients, if we have anyone called Chris who might be interested... or indeed anyone whose surname is not Walker ....
With the problems with GEANT networking on April 12th, Ibergrid and NGI_IT both observed Site BDII's falling over, hard.
The suspicion is that with the GEANT problems, there were lots of clients connecting and dropping - leaving a pending connection on the BDII ldap server. The cleints then reconnected, and left another pending connection behind, until this behaved as a DOS scenario. This was seen in both EMI and gLite 3.2 releases. (i.e this is seen with network instability, not with network outage).
Did we observe similar problems in the UK?
BDII Site 1.1.0 includes an update to OpenLDAP 2.4, which should prevent problems in this sort of scenario - not by directly addressing the problem, but because it's noticably more performant, it increases the number of clients it can handle before this scenario causes service delivery problems.
- GridPP middleware status [placeholder]
- Tier-1 update
1. On Thursday 12th April we had a series of Atlas disk servers loose network connectivity. Although not confirmed we believe this problem is fixed by a newer kernel (and network driver) and this was rolled out to the affected Atlas disk servers (those with a particular 10Gbit network card) that afternoon and the following morning. We have just (yesterday evening) seen what looks like a similar thing on a LHCb disk server, and are planning a further rollout of the newer kernel.
2. We had a problem on one of the Atlas Castor headnodes caused by time drift. We have been checking for the ntp daemon running, but that was not sufficient. We have now rolled out a nagios test for time drift - which has picked up a number of systems that were out by some seconds.
3. We had a problem with xrootd access to the AtlasStripDeg service class - traced to a configuration problem.
4. We found an unnecessary restriction on our 4GB batch queue - a limit that we have raised.
5. We have added two new FTS front end systems on virtual machines. We backed out of this change at first as a number of problems were encountered. (One of these was that sites that had not updated their CA certificates since the new UK one was released were unable to submit FTS transfers). We have since re-applied the update (i.e. we do now have the two new FTS front ends in the alias).
- Security update [to be placed as first item]
- T2 issues
- General notes.
There was a GDB last week: https://indico.cern.ch/conferenceDisplay.py?confId=155067. A summary for the next ops meeting is being put together.
- Documentation review [placeholder this week]
Sage Matt says: " I'd appreciate it if everyone checked to see if their site has any crusty looking tickets that need a spring clean. I'll be chasing you from next week otherwise."
The new neuro science VO nearly has a name. Nearly. The devil's in the details (as always).
This got sent to Liverpool by accident. John rallied it to the right place, but it may have slipped under the radar. Ticket is from lhcb, sounds like cvmfs problems causing job failures.
Biomed complaining about negative space advertised by the CE.
This ticket can be put to bed, the user doesn't see the problems anymore. I'm not sure what Santanu did to fix things though.
Looks like this old ticket can be closed to (with the appropriate saga recorded in the solution).
Has the heavy load on the WMS evened itself out?
If the WMS has started to behave, will you be able to look at enabling SNO+ soon?
https://ggus.eu/ws/ticket_info.php?ticket=80527 - CE stability
https://ggus.eu/ws/ticket_info.php?ticket=81434 - CVMFS?
https://ggus.eu/ws/ticket_info.php?ticket=80527 <- repeat
Has a couple of tickets, likely to be caused (or not helped) by the extreme transition going on at Birmingham. It might help Mark to put these onto On Hold if they can't be solved.
Is there anything anybody can do to help get your SE back up? We stand ready to assist.
There could be useful information here (if your problem is similar to Lancaster's and other crashing sites):
Or it could be easier to upgrade (1.8.3 should be out soon, I'm not sure if the storage group have an stance on this).
From the Solved Case pile:
The only one that jumps out at me is:
Another case where the renewal of a VO Admin's certificate under the new CA cert causes shenanigans (no other word for it). One to watch out for other UK people over the coming months as they renew their certs.