Deployment team & sites

Europe/London
EVO - GridPP Deployment team & sites meeting

EVO - GridPP Deployment team & sites meeting

Jeremy Coles
Description
- This is the biweekly DTEAM & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 77907 with code: 4880.
Minutes
more information

Experiment problems/issues
=====================

LHCb
--------
  RAL and UK in general are fine.
  Bristol seems to be in downtime for a long time?  Left Downtime 5 days ago, worth checking LHCb internals on that.
  Minor issue at Glasgow, ticketed and resolved.
  22-23 Feb, Dirac servers at CERN will be down for maintenance so no jobs then.  All jobs should finish before then.

CMS
-------
  In general, all Ok.
  Bit of a historical problem with DPM, a kludge is in place.  In the process of removing this, there's a problem with the SSL libs and Frontier, which ATLAS have probably resolved already. ggus 67491.  Should be resolved fairly soon.  Sites are still working, this is forward looking work.
 
Atlas
-------
  No problems.
  Brians notes that sites still needing to clean MCDISK and SRM dump: Oxford, Lancaster, RAL PPD and Durham.  Oxford should have it done in the next couple of days.

Other
-------
  Approval request for NA62, NEISS, and cernatschool.
  Euan noted some slight confusion between LFC and SE in NEISS stuff?  Sam notes that he's on top of it, and it's being resolved.  Question on CPU efficiency if they're downloading lots of data from the Web - not jobs of that type run yet, so no data.  Approval is primarily for LFC access, not expecting all sites to enable it if they are worried about job efficencies.  
  Should new VO's use space tokens?  Sam indicated Neiss will, and recommends that in general.
  No opposition, therefore these three VO's and approved by dteam.

  More sites requested for cernatschool: expressions of interest from Glasgow and Birmingham.
  Pheno issues taken to separate section.

Blacklisted sites
------------------------
  None

Known events
--------------------
  LHCb gap in jobs on 22-23 Feb.

Site performance issues
-----------------------------------
  None


Meetings and Updates
================

ROD updates
-------------------
   Nothing to say

EGI Ops
-------------
  SGE support from Imperial - site config is distinct; so probably best not to list them.
  Best not to change publishing until we initiate the ROC -> NGI change - therefore no change expected until that is started (expected mid Feb).
   SP to these feed back to EGI.

Tier-1 update
-------------------
  Some downtime last week for Oracle updates, and disk server updates for Castor.  General all quiet.

Security
------------
  RT ticket opened by EGI CSIRT to a Durham, no response from that site.  Can sites respond to these tickets.  Pakiti only checks a single WN, so ticket tracking is used to find cases where worker nodes have varying configs.
  Query on possibly out of date contact on page.  For security challenges, an alternate email address will be in use - this will be included in the Heads Up message for the challenge.

GDB
-------
  Concerns over best effort support for batch systems.
  Top level BDII's for Tier-1's, with semi-static information - seems a little muddled.  Some high availability BDII's, with longer expiry cycles.  This would mean that a short blip on the sites BDII would not result them disappearing from the top bdii's.
  UK sites not moved to glite-Apel: No sites mentioned.

glexec/Argus
------------------
  Chris (?) has installed argus/glexec, but not announced.
  Tests for glexec are running against all sites - link from Maartens slide in GDB:

Pheno
=====
  Glasgow were not aware that the problem was more than one user; in particular because other pheno users were running jobs at the same time.  It is difficult to see how they could have raised it as a bigger issue within the available data.
  The part (1) problem wasn't (directly) ICE related - it was only solved by updating both the MyProxy service and the WMS.
   Noted that the situation is best reported by a team ticket, which wasn't done - this fed into rather light solutions recorded.
   Without further testing, it's not easy to ensure that the system works end to end - and testing this is tricky, and to do it properly would take about 2.5 days of CPU time per site to test properly.
    We need better feedback from the VO of problems earlier - and how to do that.


EGI trust anchors
=============
  The first announcement was 8 months ago, and a couple after that. lcg-CA release notes contained a reference to it, but it's suspected that not all sites read all the release notes.  It was noted that there probably should have been a policy that we were going to follow that change in trust anchor.  Policy by GridPP, NGI?  Not totally clear.
  With changes to OpenSSL hashing algorithm, there are now 2 different aliases pointing to the same certificate. Mingchao to forward some technical details to the TB-Support list.


VO shares:
  It's noted that RAL, ECDF and Oxford used to be working; Glasgow has updated recently.
  ECDF have one CE publishing - is that enough, or does it need to be every CE?  Glasgow publish from one CE, so that appears to be enough, at least in one case.
  It might be just coincidence that sometimes it picks the 'correct' CE, and sometimes not.
  Liverpool have all but one CE publishing correctly - but are listed (that one CE is into a separate cluster, and provides opportunistic only).  It is suspected that this is a false positive on the test.

  Versions of Site BDII.  Newest 3.2 glite plus newer openLdap seems to work.  OpenLdap version is important for the not crashing.  Older versions of site BDII doesn't forward the shares (without manual patching).

AOB
===
  EGI SA1 middleware requirements.  SP to add summary.
  Glasgow got tickets about PhysicalCPU's, and it's not clear what the problem was (67457).  Did we miss something here?  (Someone?) noted that they responded because they were reporting 0 for one of the CE's - which is something that was thought to be correct way to do it. Steven Burke to speak to Flavia on this.

 
Chat log:
-----------

[10:57:30] Jeremy Coles joined
[10:57:38] Alessandra Forti joined
[10:57:52] John Bland joined
[10:58:41] Peter Grandi joined
[10:58:50] Daniela Bauer joined
[10:58:52] Raja Nandakumar joined
[10:58:54] Brian Davies joined
[10:59:21] Rob Fay joined
[10:59:46] Elena Korolkova joined
[11:00:12] Peter Grandi left
[11:00:58] Stuart Wakefield joined
[11:01:01] Andrew Washbrook joined
[11:01:05] Mark Mitchell joined
[11:01:07] Peter Grandi joined
[11:02:02] Chris Brew joined
[11:02:25] Duncan Rand joined
[11:02:37] Pete Gronbech joined
[11:02:48] Duncan Rand left
[11:03:31] Govind Songara joined
[11:03:35] Stephen Burke joined
[11:03:40] Duncan Rand joined
[11:03:50] Mohammad kashif joined
[11:04:26] Andrew McNab joined
[11:05:35] Rob Harper joined
[11:07:26] Winnie Lacesso joined
[11:07:55] Ewan Mac Mahon joined
[11:09:15] Stuart Wakefield https://gus.fzk.de/ws/ticket_info.php?ticket=67491
[11:09:33] Ben Waugh joined
[11:10:25] Richard Hellier joined
[11:11:05] Jeremy Coles https://svnweb.cern.ch/trac/panda/browser/panda-autopyfactory/current/libexec/runpilot3-wrapper.sh
[11:13:33] Stephen Burke The neiss VO must be almost as good as the swetest VO  
[11:14:05] Mingchao Ma joined
[11:14:35] Brian Davies update on sites still needing to clean mcdisk and to provide a dump of their SRM for ATLAS consistency checking is that Lancaster,oxford, PPD and durham are th eonly remaining sites
[11:20:36] Ewan Mac Mahon On that ^ We've just been re-arranging a couple of storagey things, but that's basically done, so now it's just a matter of getting a round to it; shouldn't take long now.
[11:23:28] Graeme Stewart joined
[11:25:23] Wahid Bhimji joined
[11:35:20] Mingchao Ma it is UKI-SCOTGRID-DURHAM, and the ticket was sent to oper.ip3@durham.ac.uk on 01 Feb, the ticket is still open!
[11:39:34] Stuart Purdie At that point we all move to ARC!
[11:39:44] Alessandra Forti indeed
[11:44:47] Mohammad kashif https://samnag023.cern.ch/nagios/
[11:45:06] Mohammad kashif nagios instance for glexec test
[11:45:11] Pete Gronbech and https://samnag023.cern.ch/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail
[11:49:51] Ewan Mac Mahon Do we possibly need a 'minor VOs' meeting once a month or something?
[11:50:18] Alessandra Forti or... we could have a minor VOs section once a month in this meeting
[11:50:46] Sam Skipsey We do have an 'other' VOs section in this meeting!
[11:51:10] Alessandra Forti but they never come and probably they don't have enough manpower to be always here
[11:51:37] Sam Skipsey That doesn't stop them turning up *sometimes*.
[11:52:11] Alessandra Forti that's not encouraging
[11:52:39] Sam Skipsey Well, how do you suggest we get small VOs to turn up? t2k is quite talkative on the email lists, but otherwise...
[11:53:48] Stephen Burke If they're invited and still don't turn up they can't complain ...
[11:54:41] Ewan Mac Mahon Even if they don't complain, if things don't end up working we get people going back to local clusters and so forth.
[11:55:07] Alessandra Forti maybe if we organise a specific sessions for them they will come
[11:55:25] Ewan Mac Mahon The trouble (of a sort) with this meeting is that there's a lot of both LHC stuff, and 'internal' stuff that might be a bit much for the minor VO people to deal with.
[11:56:00] Stephen Burke The VO part is at the start and they can leave after that
[11:56:08] Ewan Mac Mahon If there was a specific meeting we could just give them any relevant news update type things, and have an obvious place for them to bring things up.
[11:56:33] Alessandra Forti this is why I'm saying organising a specific ession for them so they are sure that if they turn up they don't get swamped
[11:56:43] Ewan Mac Mahon Either way, we should do something - are we actually inviting them to this meeting in a realistic manner?
[11:57:03] Sam Skipsey Well, I have no problem with the smaller VOs turning up here...
[11:57:10] Sam Skipsey (at least, for the VO bit)
[11:57:22] Stephen Jones joined
[11:57:23] Stephen Burke They never used to turn up to the tier-1 meeting either ...
[11:57:41] Alessandra Forti @Ewan: I cannot comment on that just tellin them to turn up is probably not enough
[11:57:47] Sam Skipsey In which case, one might question how we are expected to know about their problems...
[11:58:07] Sam Skipsey Alessandra: "turn up, or we can't know about your concerns or issues"?
[11:58:47] Alessandra Forti of course it depends how you put it....  
[12:06:26] Jeremy Coles Andrew M: Please could you expand on "Very good week: apart from some decommissioned machines at the start of the week," thanks.
[12:09:19] Andrew McNab This was about 2 weeks ago now: there were a couple of sites that had machines in the database still that were in fact decommissioned, so that produced some alarms in the GridPP Nagios, but other than that were no persistent alarms that week
[12:10:08] Stephen Jones Liverpool: both our main CEs publish Share: values.
[12:10:45] Pete Gronbech oxford is in the capacities page too
[12:11:00] Pete Gronbech http://gstat-wlcg.cern.ch/apps/capacities/sites/
[12:13:21] Wahid Bhimji where is the share on gstat
[12:13:58] Pete Gronbech select a vo such as atlas to see if you are showing some number of logical cpu's for that vo
[12:14:15] Ewan Mac Mahon We're running the newish glite 3.2 site-BDII one, but with a newer openldap package. That seems to work fine.
[12:14:22] Stephen Jones Liverpool: 3.2.9-0, as of last week. Older version (3.2.3?) did not forward Share: values.
[12:14:23] Richard Hellier We are running 3.2.9-0
[12:14:49] Ewan Mac Mahon Last time we tried the 3.2 site BDII but with the standard SL5 openldap we had problems.
[12:14:50] Richard Hellier glite-BDII_site-3.2.10-1.sl5

[12:15:10] Richard Hellier We have seen occasional lockups but certainly not every 15 mins!
[12:15:14] Winnie Lacesso That's what Bristol saw pre-openldap24. Have to use > 5.0.8 7 openldap24.
[12:15:26] Winnie Lacesso s/7/&/
[12:15:27] Ewan Mac Mahon The newer openldap does seem to be important.
[12:15:34] Ewan Mac Mahon Not just the BDII version.
[12:15:53] Ewan Mac Mahon (for the not crashing)
[12:16:09] Chris Brew glite-BDII-3.2.4-0 doesn;t
[12:16:19] Chris Brew glite-BDII_site-3.2.10-1.sl5 does
[12:16:32] Stephen Burke does what?!
[12:16:40] Sam Skipsey lockup
[12:16:42] Stephen Jones forward Share: values, I think
[12:16:44] Sam Skipsey I assume.
[12:16:48] Chris Brew yes
[12:16:48] Wahid Bhimji Pete - so for oxford
[12:16:49] Wahid Bhimji http://gstat-prod.cern.ch/gstat/site/UKI-SOUTHGRID-OX-HEP/
[12:17:00] Wahid Bhimji I don't see anything under Vo-based.
[12:17:01] Sam Skipsey ...which, Chris?
[12:17:03] Pete Gronbech left
[12:17:14] Wahid Bhimji For ecdf we do have stuff ...
[12:17:22] Wahid Bhimji ah pete gone
[12:17:23] Pete Gronbech joined
[12:17:53] Mingchao Ma https://rt.egi.eu/rt/Dashboards/1749/SA1%20Middleware%20requirements
[12:18:03] Ben Waugh left
[12:18:32] Pete Gronbech note it's not the normal gstat page but the gstat-wlcg one in my last url
[12:18:35] Elena Korolkova I can't open it
[12:19:05] Govind Songara it ask for username/password
[12:20:14] Wahid Bhimji pete sorry - I still don't see where it says atlas at
[12:20:15] Wahid Bhimji http://gstat-wlcg.cern.ch/apps/capacities/sites/
[12:20:23] Jeremy Coles If you can't look at the page do not worry. I wanted to check the situation. If not in EGI that makes sense. W'll make a summary for TB-SUPPORT.
[12:20:23] Wahid Bhimji ah now I see it
[12:20:46] Elena Korolkova thanks. Jeremy
[12:20:51] Rob Fay https://gus.fzk.de/ws/ticket_info.php?ticket=67439 -- Liverpool ticket on GlueSubClusterLogicalCPUs
[12:21:49] Daniela Bauer I split my number of CPUs by 4 (4 ces in front of the same cluster), that way gstat is happy.
[12:22:14] Stephen Burke Until one of your CEs is down ...
[12:22:40] Ewan Mac Mahon But at least you just shrink, not shrink to nothing.
[12:23:06] Stephen Burke But 0 is easier to recognise as an anomalous value
[12:23:26] Raja Nandakumar left
[12:23:27] Richard Hellier left
[12:23:27] Wahid Bhimji left
[12:23:29] Sam Skipsey left
[12:23:30] Andrew Washbrook left
[12:23:30] John Bland left
[12:23:30] Elena Korolkova left
[12:23:31] Mohammad kashif left
[12:23:32] Brian Davies left
[12:23:32] Winnie Lacesso left
[12:23:32] Stuart Wakefield left
[12:23:32] Alessandra Forti left
[12:23:32] Mingchao Ma left
[12:23:33] David Crooks left
[12:23:34] Graeme Stewart left
[12:23:34] Andrew McNab left
[12:23:34] Stephen Jones left
[12:23:34] Rob Harper left
[12:23:37] Rob Fay left
[12:23:42] Govind Songara left
[12:23:42] Stephen Burke left
[12:23:43] Chris Brew left
[12:23:45] Ewan Mac Mahon left

There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS -- job memory limits https://svnweb.cern.ch/trac/panda/browser/panda-autopyfactory/current/libexec/runpilot3-wrapper.sh. - Other -- Requests to approve: NA62; NEISS (http://www.geog.leeds.ac.uk/projects/neiss/) -- Requests to LFC enable: NA62; NEISS and cernatschool -- Looking for a couple more sites to enable cernatschool -- pheno (see later item) - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance issues
    • 11:20 11:35
      Meetings & updates 15m
      - ROD team status (any points to raise to sites or issues to follow up?) "Very good week: apart from some decommissioned machines at the start of the week" - EGI operations (Stuart) -- Summary of SR procedures https://wiki.egi.eu/wiki/Staged-rollout-procedures -- Batch system support/integration into the MW and in EMI in particular, all on Best Effort. For SGE, Imperial was listed with a question mark - should be clarified. -- After the lcg-CA ... distribution, there will be an announcement 2 weeks in advance of major changes in procedures, tools, repositories affecting sites. (That is: as close to exactly 2 weeks as possible, to prevent such notices getting lost in future). -- proposal to retire the EGEE attribues in BDII. The suggestion was: GlueSiteOtherInfo: EGEE_ROC=value --> to be replaced by GlueSiteOtherInfo: EGI_NGI=value GlueSiteOtherInfo: GRID=EGEE --> GRID=EGI - Tier-1 update Operational security -- Checking results at https://pakiti.egi.eu Summary of GDB (Wednesday 9th Feb) - Agenda http://indico.cern.ch/conferenceDisplay.py?confId=106641 - Availability calculations to include CREAM by end of March - APEL should have closed RGMA at the end of the year. What is our UK status? ARGUS - The current 1.2 release will work with glite 3.2 and EMI so no need to wait for the new release. 1.3 due in APRIL MUPJs - Ops test soon (but WLCG requirement). T1s end March. T2s end June for glexec - Probes are already in regional nagios (configured?) Information system - Plan to have highly available top-BDIIs - Slight issue with introduction of new config. option concerning refresh time Installed capacity - Sites need to check what they publish is correct EMI - First EMI release due in April - Life time of a release is 2 years, full support for 18 months, security for 6 months more - Change to using EPEL EGI - Unified Middleware Distribution approach. Will test with staged rollout - Still concern about number of early adopters - Discussion focused on who will put middleware in repository (EMI/EGI) WLCG m/w support - UI/WN and VO Box not in EMI at the moment. Markus skeptical that sites will upgrade to EMI 1 as its on SL5. Middleware - Glite 3.1 retirement - New glite web pages active http://glite.cern.ch/ - No more lcg-CA releases. This is now done by EGI Installation and configuration guide will be updated. https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide320 - Batch system integration all BEST EFFORT! SLC5.6 - tcmalloc.so is glibc dependant and hit LHCb - Plan to make more test nodes available at CERN with upgrades CREAM1.6.4 is released today 1.7 due march same as emi release
    • 11:35 11:45
      Lessons from pheno VO problems/support issues 10m
      -- Review of what happened for the pheno VO. -- Discussion of the handling of VO problems in general The overall summary suggestions: - Where possible checking following changes - Improved communication on local/Tier-2 ongoing technical issues - Closer monitoring of tickets after specified periods - Have support queries directed to a core list not an individual - Perform intermittent reviews of ticket “solutions” and training material - Ensuring that middleware issues encountered lead to Savannah/bug tickets Recommendations and observations: 1) Earlier escalation of problems being discussed in Tier-2 technical meetings would help as would a more regular review of tickets remaining open beyond 1 week. This problem could have been resolved quicker had there been more awareness outside of the ticketed site(s). 2) The UKI/NGI helpdesk staff need to have a list of individuals to query about tickets where there is uncertainty. There needs to be an escalation of tickets to a wider team if there is no response within 24hrs. Sites may prompt VOs to revalidate their services if an upgrade is undertaken. 3) The support team prompts to sites/users are clearly important and we will look at whether after two prompts the ticket can be escalated to a core UK team. The “solution” text for many tickets was very unhelpful and sites will be reminded to explain the outcome. Given that so many sites were involved we need to discuss with the users whether there was a reason the matter was not escalated within deployment & operations. 4) Reminders can be sent to the service maintainers and a request to check updates using submissions via a GridPP/NGI_UK hosted VO. The underlying problem for updating host certificates is being improved - including support for .lsc files in Quattor.
      QOS-review
    • 11:45 11:55
      EGI-trust anchors 10m
      https://wiki.egi.eu/wiki/EGI_IGTF_Release - The changes - The GridPP/WLCG position (i.e. how do we fit with EGI)
    • 11:55 12:00
      VO shares 5m
      - Check on the status of publishing CPU VO shares - Are there issues with site-BDII versions that stop people running YAIM here? Can all sites check their BDII version and indicate why they are not at a recent version if this is the case. We got another ticket: https://gus.fzk.de/ws/ticket_info.php?ticket=67365 Please could these sites check and comment in the ticket: UKI-LT2-UCL-HEP UKI-NORTHGRID-LIV-HEP UKI-SCOTGRID-DURHAM UKI-SCOTGRID-ECDF UKI-SCOTGRID-GLASGOW UKI-SOUTHGRID-CAM-HEP UKI-SOUTHGRID-OX-HEP UKI-SOUTHGRID-RALPP RAL-LCG2
      VO-shares-snapshot-100211
    • 12:00 12:05
      AOB 5m
      - Deployment planning. - SA1 middleware requirements (for EMI). https://rt.egi.eu/rt/Dashboards/1749/SA1%20Middleware%20requirements