Operations team & Sites

EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 78425 with code: 4880. Apologies: Mark M
    • 11:00 11:20
      Meetings & updates 20m
      - ROD team update - Nagios status - Tier-1 update ote that RAL was closed both Monday and Tuesday (29/30 August) of last week, so no Tier1 representative at the last DTEAM meeting. During the long weekend services ran as normal. There were some intermittent SAM test failures (on the Atlas SRM and on the non-cream CE, CE06). Over the weekend (Sunday 5th Sep) there were load issues on the Castor Atlas instance (MCTape service class) The Atlas FTS channels to RAL were reduced (in the end down to 25% of nominal values). These were raised back to 50% of nominal values on Monday, and to 100% this morning. There was a failure of the RAL Site Access Router that broke network connectivity into RAL from 01:10 to 08:10 on the morning of Monday 5th September. The callout mechanisms that should have notified someone of this failure did not work - resulting in the long site outage. The problem was resolved when staff returned to work on Monday. This also made the GOC DB unavailable for the same time window. We have an At Risk tomorrow on the LFC FTS & 3D services while regular Oracle updates are applied. These are done in a rolling manner across the nodes so should not result in any downtime. We have seen intermittent errors for the SAM tests on our one non-cream CE (lcgce06) for the last week or so. Cause not yet understood. - Security update -- T2 issues For how long have sites been using hyperthreading? -- General notes. We are reviewing GOCDB roles ahead of the NGI_UK move. Please check your site entries and report any needed updates to Jeremy by Wednesday this week. BDII crashes - openLDAP versions. - Tickets Direct link: http://tinyurl.com/3jjnvca if not working Indirect link: https://ggus.eu/ws/ticket_search.php (select support unit 'ROC_UK/Ireland' and Creation Date 'Any')
    • 11:20 11:40
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb 1) Machine development so no new data. Recorded luminosity at 688.6 pb-1. HLT rate established at 2.5-3kHz. 2) Mainly MC simulation and user jobs – 19.2% of LHCb jobs run in UK over the last week. 3) Maximum job limit at RAL T1 increased from 1500 to 2000 for LHCb. 4) Old Viglen07 disk servers being replaced at RAL Tier 1. 5) Old shared software area being retired at RAL Tier 1 now that we are running CVFMS smoothly. 6) A few T2 problems with pilots aborted at RAL-PPD, EFDA-JET and QMUL, shared area problem at Manchester and rogue worker nodes at Glasgow and Manchester (both solved). - CMS - ATLAS - Other - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance/accounting issues - Metrics review
      Atlas Report
    • 11:40 11:50
      Middleware at sites 10m
      - In preparation for discussion at GridPP27 - Need to know what every site is running and what are your plans for upgrading. - All sites requested to fill out their site middleware status and plans here: https://www.gridpp.ac.uk/wiki/Middleware_transition - The WLCG baseline is here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions. - Please add to the comments section at the end and comment on: * Any confusions that you have about the migration to EMI/UMD * Any concerns that you have about the migration to EMI/UMD * Anything else you would like raised in the discussion at GridPP27 The panel will include: Jamie Shiers (WLCG) Michel Dresher (EGI) - Activity Manager for WP5/SA2; Oversees the provisioning of EGI's software infrastructure; Provides and collects roadmaps to and from technology providers; Compiles and prioritises technical requirements Alberto Di Meglio (EMI) - Inc. "Lead the EMI project through a successful execution of its objectives, ensuring consistency of the overall resources used and the work performed and control the progress of the work so that the results of the project adhere to the grant agreement".
    • 11:50 11:55
      Site publishing 5m
      - EGEE -> EGI - NGI_UK
    • 11:55 12:00
      Actions 5m
      - http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items
    • 12:00 12:01
      AOB 1m
      - ATLAS and glexec