Operations team & Sites

Name: Operations team & Sites
Start: 2011-10-25T11:00:00+01:00
End: 2011-10-25T12:16:00+01:00
Location: EVO - GridPP Operations team meeting

Tuesday 25 Oct 2011, 11:00 → 12:16 Europe/London

EVO - GridPP Operations team meeting

Description

- This is the weekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 108203 with code: 4880. Apologies: Raul, Raja

- 1
  
  Meetings & updates
  
  - ROD team update Durham and UCL have recurrent problems. - Nagios status - EGI From Stuart P: "Glue2: Sites not publishing Glue2 in the UK: UKI-LT2-Brunel; UKI-LT2-UCL-HEP; UKI-SCOTGRID-DURHAM; UKI-SOUTHGRID-BRIS-HEP and EFDA-JET. If running SL4 site BDII's, an upgrade should be planned, otherwise a closer look should be taken at the glue publishing paths for those sites. CREAM reliability: First impressions that a restart every 2-3 weeks is, indeed, typical - more detail to follow." - Tier-1 update Problems with Castor, or rather database infrastructure behind Castor, over the weekend: - At around 4am Saturday morning three (out of five) nodes in one of the Oracle RACs that host the Castor databases rebooted. Some downtime for Atlas & CMS (few hours). - Later on Saturday: Nodes in the other Oracle "RAC" cluster crashed (and did not reboot). In the end we stopped CMS & GEN Castor instances towards the end of Saturday. - Overnight Sat/Sun - Another crash of a node in the first cluster. We took remaining (Atlas, LHCb) Castor instances down. - Services were restored around 20:30 on Sunday. Summary So far: The problems were caused by instabilities in the Oracle database infrastructure behind Castor. The Castor databases are divided across two Oracle RACs and both RACs suffered nodes crashing and, in some cases, failing to reboot. The failures for nodes to reboot were caused by corrupt areas on a disk array used to stage backups. Investigations are ongoing into the root cause and a SIR is being produced. Since Sunday we have been gradually (cautiously) opening up limits on FTS & Batch. On Thursday afternoon (20th) the CMS Castor instance was unavailable for an hour or so. It looks like a recurrence of the old castor "JobManager" hang (not seen for some months). On Wednesday morning (19th) There was a hang of one of the Oracle RAC nodes in the database behind the LFC/FTS & 3D services. Apart from a few minute outage on the LFC (during a failover) there was an outage of the FTS for a couple of hours. Also: We are updating the disk controller on some firmware on a batch of disk servers. The older version reports a lot of 'SMART' errors on disk drives - but in many cases the are spurious and mask real disk errors. - Security update -- T2 issues Emyr's repo issue. -- General notes. New accounting portal http://www4.egee.cesga.es/accounting/egee_view.php Checking Red/Amber tickets for NGI_UK:http://tinyurl.com/5wtnxh5 Or go to https://ggus.eu/ws/ticket_search.php and select Support Unit:NGI_UK and Creation date: Any and Status: open states - then click Go.
- 2
  
  Experiment problems/issues
  
  Review of weekly issues by experiment/VO - LHCb Manchester mentioned at daily WLCG ops meeting: https://ggus.eu/ws/ticket_info.php?ticket=75614. 1. RAL downtime over the last weekend. 2. Manchester and QMUL - LHCb had run out of job slots. Both sites kindly increased the availability for LHCb I believe. - CMS DC: " ...castor problems and CREAM instability (especially at RALPP - although we are only doing better at Imperial by restarting every 4 hours)." Any conclusions to share from CMS F2F last week? - ATLAS - Other HEPiX talks from today may be of interest: https://indico.cern.ch/conferenceTimeTable.py?confId=138424#20111025. - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance/accounting issues - Metrics review
- 3
  
  Site updates
  
  Brunel ******* Local issues . Storage pool down. Raid controller to be replaced today, depending on Viglen. - Tickets . We have a ticket for the Atlas migration to CVMFS. My plan is to start migration tomorrow with the first cluster and complete for the whole site in November...an update on plans from 4 weeks ago. EMI: - moving one cluster from lcg-CE to EMI Cream this week. - New EMI BDII coming alive today - deployed in the last weeks a staged roll out for Argus and another for glexec
- 4
  
  Urgent topics to follow-up
- 5
  
  Actions
  
  - http://hepwww.rl.ac.uk/sysman/Nov2011/main.html
- 6
  
  AOB

Choose timezone

Operations team & Sites

EVO - GridPP Operations team meeting