Experiment problems/issues
=====================
LHCb
--------
RAL and UK in general are fine.
Bristol seems to be in downtime for a long time? Left Downtime 5 days ago, worth checking LHCb internals on that.
Minor issue at Glasgow, ticketed and resolved.
22-23 Feb, Dirac servers at CERN will be down for maintenance so no jobs then. All jobs should finish before then.
CMS
-------
In general, all Ok.
Bit of a historical problem with DPM, a kludge is in place. In the process of removing this, there's a problem with the SSL libs and Frontier, which ATLAS have probably resolved already. ggus 67491. Should be resolved fairly soon. Sites are still working, this is forward looking work.
Atlas
-------
No problems.
Brians notes that sites still needing to clean MCDISK and SRM dump: Oxford, Lancaster, RAL PPD and Durham. Oxford should have it done in the next couple of days.
Other
-------
Approval request for NA62, NEISS, and cernatschool.
Euan noted some slight confusion between LFC and SE in NEISS stuff? Sam notes that he's on top of it, and it's being resolved. Question on CPU efficiency if they're downloading lots of data from the Web - not jobs of that type run yet, so no data. Approval is primarily for LFC access, not expecting all sites to enable it if they are worried about job efficencies.
Should new VO's use space tokens? Sam indicated Neiss will, and recommends that in general.
No opposition, therefore these three VO's and approved by dteam.
More sites requested for cernatschool: expressions of interest from Glasgow and Birmingham.
Pheno issues taken to separate section.
Blacklisted sites
------------------------
None
Known events
--------------------
LHCb gap in jobs on 22-23 Feb.
Site performance issues
-----------------------------------
None
Meetings and Updates
================
ROD updates
-------------------
Nothing to say
EGI Ops
-------------
SGE support from Imperial - site config is distinct; so probably best not to list them.
Best not to change publishing until we initiate the ROC -> NGI change - therefore no change expected until that is started (expected mid Feb).
SP to these feed back to EGI.
Tier-1 update
-------------------
Some downtime last week for Oracle updates, and disk server updates for Castor. General all quiet.
Security
------------
RT ticket opened by EGI CSIRT to a Durham, no response from that site. Can sites respond to these tickets. Pakiti only checks a single WN, so ticket tracking is used to find cases where worker nodes have varying configs.
Query on possibly out of date contact on page. For security challenges, an alternate email address will be in use - this will be included in the Heads Up message for the challenge.
GDB
-------
Concerns over best effort support for batch systems.
Top level BDII's for Tier-1's, with semi-static information - seems a little muddled. Some high availability BDII's, with longer expiry cycles. This would mean that a short blip on the sites BDII would not result them disappearing from the top bdii's.
UK sites not moved to glite-Apel: No sites mentioned.
glexec/Argus
------------------
Chris (?) has installed argus/glexec, but not announced.
Tests for glexec are running against all sites - link from Maartens slide in GDB:
Pheno
=====
Glasgow were not aware that the problem was more than one user; in particular because other pheno users were running jobs at the same time. It is difficult to see how they could have raised it as a bigger issue within the available data.
The part (1) problem wasn't (directly) ICE related - it was only solved by updating both the MyProxy service and the WMS.
Noted that the situation is best reported by a team ticket, which wasn't done - this fed into rather light solutions recorded.
Without further testing, it's not easy to ensure that the system works end to end - and testing this is tricky, and to do it properly would take about 2.5 days of CPU time per site to test properly.
We need better feedback from the VO of problems earlier - and how to do that.
EGI trust anchors
=============
The first announcement was 8 months ago, and a couple after that. lcg-CA release notes contained a reference to it, but it's suspected that not all sites read all the release notes. It was noted that there probably should have been a policy that we were going to follow that change in trust anchor. Policy by GridPP, NGI? Not totally clear.
With changes to OpenSSL hashing algorithm, there are now 2 different aliases pointing to the same certificate. Mingchao to forward some technical details to the TB-Support list.
VO shares:
It's noted that RAL, ECDF and Oxford used to be working; Glasgow has updated recently.
ECDF have one CE publishing - is that enough, or does it need to be every CE? Glasgow publish from one CE, so that appears to be enough, at least in one case.
It might be just coincidence that sometimes it picks the 'correct' CE, and sometimes not.
Liverpool have all but one CE publishing correctly - but are listed (that one CE is into a separate cluster, and provides opportunistic only). It is suspected that this is a false positive on the test.
Versions of Site BDII. Newest 3.2 glite plus newer openLdap seems to work. OpenLdap version is important for the not crashing. Older versions of site BDII doesn't forward the shares (without manual patching).
AOB
===
EGI SA1 middleware requirements. SP to add summary.
Glasgow got tickets about PhysicalCPU's, and it's not clear what the problem was (67457). Did we miss something here? (Someone?) noted that they responded because they were reporting 0 for one of the CE's - which is something that was thought to be correct way to do it. Steven Burke to speak to Flavia on this.
Chat log:
-----------
[10:57:30] Jeremy Coles joined
[10:57:38] Alessandra Forti joined
[10:57:52] John Bland joined
[10:58:41] Peter Grandi joined
[10:58:50] Daniela Bauer joined
[10:58:52] Raja Nandakumar joined
[10:58:54] Brian Davies joined
[10:59:21] Rob Fay joined
[10:59:46] Elena Korolkova joined
[11:00:12] Peter Grandi left
[11:00:58] Stuart Wakefield joined
[11:01:01] Andrew Washbrook joined
[11:01:05] Mark Mitchell joined
[11:01:07] Peter Grandi joined
[11:02:02] Chris Brew joined
[11:02:25] Duncan Rand joined
[11:02:37] Pete Gronbech joined
[11:02:48] Duncan Rand left
[11:03:31] Govind Songara joined
[11:03:35] Stephen Burke joined
[11:03:40] Duncan Rand joined
[11:03:50] Mohammad kashif joined
[11:04:26] Andrew McNab joined
[11:05:35] Rob Harper joined
[11:07:26] Winnie Lacesso joined
[11:07:55] Ewan Mac Mahon joined
[11:09:15] Stuart Wakefield https://gus.fzk.de/ws/ticket_info.php?ticket=67491
[11:09:33] Ben Waugh joined
[11:10:25] Richard Hellier joined
[11:11:05] Jeremy Coles https://svnweb.cern.ch/trac/panda/browser/panda-autopyfactory/current/libexec/runpilot3-wrapper.sh
[11:13:33] Stephen Burke The neiss VO must be almost as good as the swetest VO
[11:14:05] Mingchao Ma joined
[11:14:35] Brian Davies update on sites still needing to clean mcdisk and to provide a dump of their SRM for ATLAS consistency checking is that Lancaster,oxford, PPD and durham are th eonly remaining sites
[11:20:36] Ewan Mac Mahon On that ^ We've just been re-arranging a couple of storagey things, but that's basically done, so now it's just a matter of getting a round to it; shouldn't take long now.
[11:23:28] Graeme Stewart joined
[11:25:23] Wahid Bhimji joined
[11:35:20] Mingchao Ma it is UKI-SCOTGRID-DURHAM, and the ticket was sent to oper.ip3@durham.ac.uk on 01 Feb, the ticket is still open!
[11:39:34] Stuart Purdie At that point we all move to ARC!
[11:39:44] Alessandra Forti indeed
[11:44:47] Mohammad kashif https://samnag023.cern.ch/nagios/
[11:45:06] Mohammad kashif nagios instance for glexec test
[11:45:11] Pete Gronbech and https://samnag023.cern.ch/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail
[11:49:51] Ewan Mac Mahon Do we possibly need a 'minor VOs' meeting once a month or something?
[11:50:18] Alessandra Forti or... we could have a minor VOs section once a month in this meeting
[11:50:46] Sam Skipsey We do have an 'other' VOs section in this meeting!
[11:51:10] Alessandra Forti but they never come and probably they don't have enough manpower to be always here
[11:51:37] Sam Skipsey That doesn't stop them turning up *sometimes*.
[11:52:11] Alessandra Forti that's not encouraging
[11:52:39] Sam Skipsey Well, how do you suggest we get small VOs to turn up? t2k is quite talkative on the email lists, but otherwise...
[11:53:48] Stephen Burke If they're invited and still don't turn up they can't complain ...
[11:54:41] Ewan Mac Mahon Even if they don't complain, if things don't end up working we get people going back to local clusters and so forth.
[11:55:07] Alessandra Forti maybe if we organise a specific sessions for them they will come
[11:55:25] Ewan Mac Mahon The trouble (of a sort) with this meeting is that there's a lot of both LHC stuff, and 'internal' stuff that might be a bit much for the minor VO people to deal with.
[11:56:00] Stephen Burke The VO part is at the start and they can leave after that
[11:56:08] Ewan Mac Mahon If there was a specific meeting we could just give them any relevant news update type things, and have an obvious place for them to bring things up.
[11:56:33] Alessandra Forti this is why I'm saying organising a specific ession for them so they are sure that if they turn up they don't get swamped
[11:56:43] Ewan Mac Mahon Either way, we should do something - are we actually inviting them to this meeting in a realistic manner?
[11:57:03] Sam Skipsey Well, I have no problem with the smaller VOs turning up here...
[11:57:10] Sam Skipsey (at least, for the VO bit)
[11:57:22] Stephen Jones joined
[11:57:23] Stephen Burke They never used to turn up to the tier-1 meeting either ...
[11:57:41] Alessandra Forti @Ewan: I cannot comment on that just tellin them to turn up is probably not enough
[11:57:47] Sam Skipsey In which case, one might question how we are expected to know about their problems...
[11:58:07] Sam Skipsey Alessandra: "turn up, or we can't know about your concerns or issues"?
[11:58:47] Alessandra Forti of course it depends how you put it....
[12:06:26] Jeremy Coles Andrew M: Please could you expand on "Very good week: apart from some decommissioned machines at the start of the week," thanks.
[12:09:19] Andrew McNab This was about 2 weeks ago now: there were a couple of sites that had machines in the database still that were in fact decommissioned, so that produced some alarms in the GridPP Nagios, but other than that were no persistent alarms that week
[12:10:08] Stephen Jones Liverpool: both our main CEs publish Share: values.
[12:10:45] Pete Gronbech oxford is in the capacities page too
[12:11:00] Pete Gronbech http://gstat-wlcg.cern.ch/apps/capacities/sites/
[12:13:21] Wahid Bhimji where is the share on gstat
[12:13:58] Pete Gronbech select a vo such as atlas to see if you are showing some number of logical cpu's for that vo
[12:14:15] Ewan Mac Mahon We're running the newish glite 3.2 site-BDII one, but with a newer openldap package. That seems to work fine.
[12:14:22] Stephen Jones Liverpool: 3.2.9-0, as of last week. Older version (3.2.3?) did not forward Share: values.
[12:14:23] Richard Hellier We are running 3.2.9-0
[12:14:49] Ewan Mac Mahon Last time we tried the 3.2 site BDII but with the standard SL5 openldap we had problems.
[12:14:50] Richard Hellier glite-BDII_site-3.2.10-1.sl5
[12:15:10] Richard Hellier We have seen occasional lockups but certainly not every 15 mins!
[12:15:14] Winnie Lacesso That's what Bristol saw pre-openldap24. Have to use > 5.0.8 7 openldap24.
[12:15:26] Winnie Lacesso s/7/&/
[12:15:27] Ewan Mac Mahon The newer openldap does seem to be important.
[12:15:34] Ewan Mac Mahon Not just the BDII version.
[12:15:53] Ewan Mac Mahon (for the not crashing)
[12:16:09] Chris Brew glite-BDII-3.2.4-0 doesn;t
[12:16:19] Chris Brew glite-BDII_site-3.2.10-1.sl5 does
[12:16:32] Stephen Burke does what?!
[12:16:40] Sam Skipsey lockup
[12:16:42] Stephen Jones forward Share: values, I think
[12:16:44] Sam Skipsey I assume.
[12:16:48] Chris Brew yes
[12:16:48] Wahid Bhimji Pete - so for oxford
[12:16:49] Wahid Bhimji http://gstat-prod.cern.ch/gstat/site/UKI-SOUTHGRID-OX-HEP/
[12:17:00] Wahid Bhimji I don't see anything under Vo-based.
[12:17:01] Sam Skipsey ...which, Chris?
[12:17:03] Pete Gronbech left
[12:17:14] Wahid Bhimji For ecdf we do have stuff ...
[12:17:22] Wahid Bhimji ah pete gone
[12:17:23] Pete Gronbech joined
[12:17:53] Mingchao Ma https://rt.egi.eu/rt/Dashboards/1749/SA1%20Middleware%20requirements
[12:18:03] Ben Waugh left
[12:18:32] Pete Gronbech note it's not the normal gstat page but the gstat-wlcg one in my last url
[12:18:35] Elena Korolkova I can't open it
[12:19:05] Govind Songara it ask for username/password
[12:20:14] Wahid Bhimji pete sorry - I still don't see where it says atlas at
[12:20:15] Wahid Bhimji http://gstat-wlcg.cern.ch/apps/capacities/sites/
[12:20:23] Jeremy Coles If you can't look at the page do not worry. I wanted to check the situation. If not in EGI that makes sense. W'll make a summary for TB-SUPPORT.
[12:20:23] Wahid Bhimji ah now I see it
[12:20:46] Elena Korolkova thanks. Jeremy
[12:20:51] Rob Fay https://gus.fzk.de/ws/ticket_info.php?ticket=67439 -- Liverpool ticket on GlueSubClusterLogicalCPUs
[12:21:49] Daniela Bauer I split my number of CPUs by 4 (4 ces in front of the same cluster), that way gstat is happy.
[12:22:14] Stephen Burke Until one of your CEs is down ...
[12:22:40] Ewan Mac Mahon But at least you just shrink, not shrink to nothing.
[12:23:06] Stephen Burke But 0 is easier to recognise as an anomalous value
[12:23:26] Raja Nandakumar left
[12:23:27] Richard Hellier left
[12:23:27] Wahid Bhimji left
[12:23:29] Sam Skipsey left
[12:23:30] Andrew Washbrook left
[12:23:30] John Bland left
[12:23:30] Elena Korolkova left
[12:23:31] Mohammad kashif left
[12:23:32] Brian Davies left
[12:23:32] Winnie Lacesso left
[12:23:32] Stuart Wakefield left
[12:23:32] Alessandra Forti left
[12:23:32] Mingchao Ma left
[12:23:33] David Crooks left
[12:23:34] Graeme Stewart left
[12:23:34] Andrew McNab left
[12:23:34] Stephen Jones left
[12:23:34] Rob Harper left
[12:23:37] Rob Fay left
[12:23:42] Govind Songara left
[12:23:42] Stephen Burke left
[12:23:43] Chris Brew left
[12:23:45] Ewan Mac Mahon left
There are minutes attached to this event.
Show them.