- ROD team update
Durham and UCL have recurrent problems.
- Nagios status
- EGI
From Stuart P:
"Glue2: Sites not publishing Glue2 in the UK: UKI-LT2-Brunel; UKI-LT2-UCL-HEP; UKI-SCOTGRID-DURHAM; UKI-SOUTHGRID-BRIS-HEP and EFDA-JET. If running SL4 site BDII's, an upgrade should be planned, otherwise a closer look should be taken at the glue publishing paths for those sites.
CREAM reliability: First impressions that a restart every 2-3 weeks is, indeed, typical - more detail to follow."
- Tier-1 update
Problems with Castor, or rather database infrastructure behind Castor, over the weekend:
- At around 4am Saturday morning three (out of five) nodes in one of the Oracle RACs that host the Castor databases rebooted. Some downtime for Atlas & CMS (few hours).
- Later on Saturday: Nodes in the other Oracle "RAC" cluster crashed (and did not reboot). In the end we stopped CMS & GEN Castor instances towards the end of Saturday.
- Overnight Sat/Sun - Another crash of a node in the first cluster. We took remaining (Atlas, LHCb) Castor instances down.
- Services were restored around 20:30 on Sunday.
Summary So far: The problems were caused by instabilities in the Oracle database infrastructure behind Castor. The Castor databases are divided across two Oracle RACs and both RACs suffered nodes crashing and, in some cases, failing to reboot. The failures for nodes to reboot were caused by corrupt areas on a disk array used to stage backups. Investigations are ongoing into the
root cause and a SIR is being produced.
Since Sunday we have been gradually (cautiously) opening up limits on FTS & Batch.
On Thursday afternoon (20th) the CMS Castor instance was unavailable for an hour or so. It looks like a recurrence of the old castor "JobManager" hang (not seen for some months).
On Wednesday morning (19th) There was a hang of one of the Oracle RAC nodes in the database behind the LFC/FTS & 3D services. Apart from a few minute outage on the LFC (during a failover) there was an outage of the FTS for a couple of hours.
Also:
We are updating the disk controller on some firmware on a batch of disk servers. The older version reports a lot of 'SMART' errors on disk drives - but in many cases the are spurious and mask real disk errors.
- Security update
-- T2 issues
Emyr's repo issue.
-- General notes.
New accounting portal http://www4.egee.cesga.es/accounting/egee_view.php
Checking Red/Amber tickets for NGI_UK:http://tinyurl.com/5wtnxh5
Or go to https://ggus.eu/ws/ticket_search.php and select Support Unit:NGI_UK and Creation date: Any and Status: open states - then click Go.