Deployment team & sites

EVO - GridPP Deployment team & sites meeting

EVO - GridPP Deployment team & sites meeting

Jeremy Coles
- This is the biweekly DTEAM & sites meeting - The intention is to run the meeting in EVO: Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 64688 with code: 4880.
    • 1
      Experiment problems/issues
      Review of weekly issues by experiment/VO - LHCb -- Check on transfer problems -- 1) RAL Tier 1. Reasonable running over last week (since problems of previous weekend) although load has been somewhat lower. Investigations continue, but a lot of SRM hits appear to come from FTS. Plan is to upgrade LHCb SRM machines to increase performance. LHCb reprocessing due to start mid-November – so aim to upgrade/test before this. 2) UK Tier 2. Some problems with shared area at Bristol and Birmingham. Issue with queue length parameters at UCL causing jobs to be killed. - CMS - ATLAS -- Deletion policy for ATLAS LOCALGROUPDISK being worked on - Other - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance issues -> Follow up on action to document procedures/priorities in case of Tier-2 disk server loss.
    • 2
      Meetings & updates
      - ROD team status (any points to raise to sites or issues to follow up?) -- UKI ops list not on COD circulation list at the moment. -- BDII freshness becomes a critical test from 1st December (see attached note). Current problems can be seen via and using in the search box. - Tier-1 update Operational security - the recent vulnerability and response -- Some sites show in Pakiti. False positives? -- Checking results at - Escalated tickets 2 tickets on hold. 56316 - NGS-RAL not in BDII 58733 - RAL-PP biomed. dcache issue. - Discussion yesterday with those handling UK tickets. Agreed that common point of entry for NGI/UK service tickets will be GGUS. Looking at ways to flag unassigned tickets. Would like status put to "in progress" as soon as it is allocated. - There is a GDB tomorrow:
    • 3
      WMS situation
      "he problem I have seen is that jobs stay in READY state for very long time so Nagios cancel it and sends a new job and after two consecutive failure it sends a CRITICAL alert to dashboard. It generates a lot of false alarms in dashboard and ultimately it effect availability and reliability figure. CREAMCE's are more affected as there is a additional ICE component in WMS which seems to cause lot of trouble. Wlcg Nagios does not have an inbuilt mechanism to check WMS problem so we have to keep eye on status." Those running WMSes want to " get clear what is expected of the WMSs". Related tickets: Tier-1 Glasgow
    • 4
      WLCG Tier-2 availability for October
    • 5
      gstat - installed capacities & pledges
      - A reminder that some sites still do not tag to appear here - Check understanding of new resources - Confirm gstat reporting (in)accuracy
      Installed capacity
    • 6
      - Next HEPSYSMAN meeting Monday 22nd November in Birmingham. Please register here - A reminder of deployment timelines! -- -- CREAM. Testing continues by the experiments with positive results. We should anticipate a switch from the shortly unsupported LCG-CE. -- APEL - RGMA will be switched off at the end of the year. So moving to the new gLite-MON is becoming urgent. -- ARGUS - The security policy suspension will be reviewed again in December/January.