Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 126540 with code: 4880. Apologies:
Minutes
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb No major news from LHCb today. Mostly reprocessing and user jobs, with some low level of Monte Carlo simulations. Various small issues in the UK have had GGUS tickets opened and closed quite quickly by the sites. We (LHCb) are planning to change the way we will distribute the conditions information to the jobs and are testing out the options needed. - CMS - ATLAS - Other GridPP PMB agreed good to support a psychology/neuroscience pilot project being driven by Birmingham. Stage 1 is to setup the VO and demonstrate working jobs at Birmingham and following that sites are asked to consider supporting the VO which is expected to be very light (jobs undertake parameter sweeps): - job submission via Ganga - 1 Gbyte of space on the software area of each site where jobs run; - about 1 Gbyte of space per worker node at runtime; - minimal space (less than 5 Gbyte, perhaps none) on the storage element at each site where jobs run; - memory required up to a few hundred MB - compiled code will be a few MB, with perhaps a few linked scientific libraries of the same size
    • 11:20 11:40
      Meetings & updates 20m
      - ROD team update - EGI ops Many areas from yesterday's EGI ops meeting. With the EMI update 14, should fix the critical problems for the EMI WMS. Update 15 should fix the major annoyance with wildcards in the gacl. EMI-WN tarball packages expected around Wednesday this week. Testers would be appreciated. EMI-WN Testbed. Do we have one / do we want one / do we have anyone that would use one (T2K / SnoPlus?)? OPS members - general reminder to check the two registered are both active. Also: is the OPS voms server failing on 2B root CA, or does it reject everyone? If the former, then that'll need updated when the ops memebers cert's are cycled. Sites should check that the GlueSubClusterWNTmpDir that they are publishing in the BDII matches what they expect. This is /tmp, except for RALPP, where it's /scratch. [EGI/EMI updates see attachment] - Nagios status - Tier-1 update We have had no significant planned interventions this week. At the moment we have none declared in the GOC DB either - with no major outages planned before LHC startup. There are both minor things we need to do (e.g. update to MyProxy) as well as some things done during a GOC DB At Risk (or "Warning") - such as some updates to the backup Castor database system. More significant changes longer term include a minor Castor update that will be needed for us to move its databases to Oracle 11 along with introduction of new networking equipment. Two operational issues to report this last week. - There have been problems on the network link at RAL used by data traffic to Tier2s. This is being worked on. However, the effect of interrupts are limited. Not only do the networking guys get to the problem quickly, but the effect is on file transfers which the FTS can retry. We do also fail some Nagios Sam tests from Oxford when it happens. - Some of the old Sam infrastructure was decommissioned last Tuesday (e.g. part of Gridview no longer works). Notably the old programmatic interface to SAM stopped. Our Tier1 dashboard and Nagios tests have now been modified to pick these up using the newer interface but for a while the Tier dashboard was not showing the test results. - Security update - T2 issues - General notes. The March GDB: https://indico.cern.ch/conferenceDisplay.py?confId=155066. Covering change of chair, Vidyo usage and experiment ops update. - Tickets Some tickets (all from CMS as far as I can see) have been using the Savannah/GGUS interface. There seem to be a couple of ticket misfires but largely it seems to be working. This morning Brian sent out tickets to a number of sites requesting details about those site's plans to upgrade their "below baseline" SEs. A good few of these sites have already responded. LHCB have ticketed a few sites (Glasgow, Edinburgh) about job problems, the tickets seem well in hand. Birmingham also ticketed (80117) but these seem to be slightly different problems (no job slots free for lhcb). Site specific: Bristol: https://ggus.eu/ws/ticket_info.php?ticket=80125 CMS transfer problems to Bristol. RAL Tier-1 https://ggus.eu/ws/ticket_info.php?ticket=80119 sno+ software install at RAL failed. It could be a problem with their code, or something wrong with their install method, or something different at RAL they they haven't taken into account (the install worked at QMUL). It looks like they could do with some help. (This ticket is an offshoot of https://ggus.eu/ws/ticket_info.php?ticket=79428). Brunel https://ggus.eu/ws/ticket_info.php?ticket=80146 Biomed ticket, a user is having an authorisation error. The error messages are similar to ones we saw late last year when users were using out-of-date UIs that couldn't handle the newer UK CA's format (I don't quite understand the details myself, but the problem was definitely at the user's end). This was indeed user error and teh ticket has now been closed. --- Daniela Durham https://ggus.eu/ws/ticket_info.php?ticket=79880 lhcb jobs are getting the infamous "Maradona" error. Some bad workers? Cambridge https://ggus.eu/ws/ticket_info.php?ticket=79728 For some reason after their upgrade a users files went missing. Atlas file clean up has been invoked. Is Cambridge's SE in the clear now? QMUL https://ggus.eu/ws/ticket_info.php?ticket=77959 QMUL are plagued with atlas deletion errors. Chris has updated to Storm 1.8.2 but the errors continue. In the ticket he suggests reassigning the ticket to atlas, and I agree with him. Interesting Finished Tickets from the last week: https://ggus.eu/ws/ticket_info.php?ticket=80120 Confirmation that you shouldn't panic if you see "FILE_EXISTS" errors in your site's FTS transfers. https://ggus.eu/ws/ticket_info.php?ticket=80052 https://ggus.eu/ws/ticket_info.php?ticket=80061 These two tickets chronicle Chris' quest to have QMUL's availability/reliability stats amended. Some interesting links and information for other sites wanting to do similar. The reason for the miscalculation is in https://ggus.eu/ws/ticket_info.php?ticket=79929
      EMI-update details
    • 11:40 11:55
      EGI survey (1) 15m
      https://www.surveymonkey.com/s/SW2WK6K
    • 11:55 12:00
      Actions 5m
      https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
    • 12:00 12:01
      AOB 1m
      - Sites that have not responded to EGI survey (2) - see email. - HEPSYSMAN May? - GridPP29 now confirmed for 26th/27th September in Oxford