Operations team & Sites
EVO - GridPP Operations team meeting
OPS Minutes 2015-04-14
=======================
Present:
Chris Brew
Daniel Traynor
Daniela Bauer
David Crooks
Elena Korolkova
Ewan MacMahon
Federico Melaccio
Gareth Roy
Gareth Smith
Gordon Stewart
Govind Songara
Jeremy Coles
John Bland
Kashif Mohammad
Matt Doidge
Matt Williams
Oliver Smith
Paige Lacesso
Raja Nandakumar
Robert Fay
Robert Frank
Steve Jones
Tom Whyntie (Vidyo Issues)
Experiment Reports:
===========================
LHCB
====
- Nothing to report .
- Monte Carlo jobs running on the Grid, no problems.
CMS
===
- Bristol was written up for low site readiness, however problem was solved a week ago. Awaiting an update.
- Nothing else to report.
ATLAS
=====
- ASAP metrics for most sites are above 90, with the goal to have ASAP greater than 80%
- Sussex and UCL have low numbers (however Sussex queue is in broker-off status and should not be included)
- UCL has place itself into downtime due to issues (lack of support due to leave)
- Most ATLAS sites should have been full of production work last week, and although there is a reduction in work at the moment ATLAS planning to fill grid when new tasks are fully validated.
- Issues were reported with the FTS3 server located at RAL due to network issues.
Other VO
========
- no comments
DIRAC
=====
- SAM results for VAC showed that:
- IC Cloud not updated since 11/4
- Cern Cloud not update since 5/4
- Lancaster not updated since 16/2
- Lancaster reported that their VAC instance may have been turned off.
Meetings & Updates
===========================
With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
General Updates
===========================
- CHEP 2015 is currently ongoing, there will be a session at GridPP34 to give opportunity to discuss interesting presentations that occurred.
- A number of GOCDB service types appear to be no longer used:
- OpsTool;
- dg.TargetSystemFactory;
- dg.ARC-CE;
- eu.egi.cloud.broker.proprietary.slipstream;
- globus-RLS; RGMA-IC;
- eu.egi.egran;
- SRM.online;
- egi.VODashboard;
- CUSTOM.egi.HTTPserver;
- egi.NetworkPortal;
- ch.cern.cvmfs.stratum.1;
- CUSTOM.pl.plgrid.Bazaar.
- None of service types appear to be in use by GridPP, JC to feed back to the EGI.
- DIRAC meeting taking place in May, there is currently a survey in process for communities that use DIRAC to comment. If anyone hosts such a VO please contact Jeremy for URL to the survey.
- TW is responsible for a VO clean up, please look at https://www.gridpp.ac.uk/wiki/VO_Cleanup_Campaign and let Jeremy know if you know of obsolete VO’s.
- Q115 reports are now due, please make sure you have supplied the required information to your Tier2 representative.
- March Availability & Reliability reports are now available.
- QMUL and Sussex have both asked for re-computations (QMUL figures again due to the inclusion of a test SE)
WLCG Operations
===========================
- nothing to report
Tier-1
===========================
- A network outage took place on Wednesday 8th April when a network upgrade caused issues. This caused the planned Castor update to be shelved. The Castor update will now take place on the 8th April.
- All Cream CE's at the RAL Tier 1 will be turned off on the 5th of May.
- Concern was expressed about ALICE, GS stated ALICE was updating their monitoring so that they can take advantage of ARC-CE's. EOL date discussed with ALICE.
- From Chat the chat transcript below:
Gareth Smith: (11:56 AM)
“Coming back to ALICE's use of ARC-CEs. I just confirmed my understanding: ALICE are successfully running jobs using our CEs - and have been for some months. We do not have a special configuration for them. The problem that remains is 'just' their monitoring which has only been able to monitor CREAM-CEs. (At the moment ALICE SAM is still only looking at our CREAM CEs.) ALICE are aware of, and are working to fix, this limitation. We have agreed with them the date we will stop our CREAM CEs.
- Outage also caused issues with APEL, GOCDB and other services located at RAL
Storage & Data Management
===========================
- nothing to report
Accounting
===========================
- APEL delays for Sheffield, RALPP and Bristol
Documentation
===========================
- Progress being made on security documentation.
Interoperation
===========================
- nothing to report
Monitoring
===========================
- nothing to report
On-Duty
===========================
- nothing to report
Rollout
===========================
- nothing to report
Security
===========================
- nothing to report
Services
===========================
- nothing to report
Tickets
===========================
RALPP
=====
111703 - Hammercloud tests failing. Awaiting feedback from ATLAS, difficulties in catching problem jobs.
BIRMINGHAM
==========
112875 - Low availability, now climbing waiting for it to get in the green.
GLASGOW
=======
112967 - Closed, Problem due to overload information system publishing old data.
113010 - Closed
EDINBURGH
=========
95303 - Awaiting glexec tarball
LANCASTER
=========
1005666 - Matt doesn't believe the results he's seeing. Going to re-run tests.
95299 - Awaiting glexec tarball
BRUNEL
======
112966 - Fixed, due to restarting torque
100IT
=====
112948 - Ongoing
108356 - Ticket can likely be solved/closed as it needs more work external to system.
TIER1
=====
108944 - Likely needs to involve the software developers after CHEP
112713 - Files need to be deleted, likely after CHEP, awaiting feedback from AL.
109694 - Sno+ gfal copy problems at Tier1, potentially and oddity with CASTOR
112977 - Hot files causing failures (1,000,000 access attempts over 18hrs)
111699 - Hammercould test failing, Awaiting feedback from ATLAS.
112866 - Another Hot file causing job failures
112721 - Atlas file acces ticket
UCL
===
112371 - ROD Low availability
112841 - Atlas 0% transfer efficiency
112873 - ROD SRM put failures
95298 - Awaiting glexec tarball
112722 - Atlas checksum timeouts
112966 - ROD job submit failures
There was a lengthy discussion on the status of UCL.
EM felt that UCL could not continue as a full Tier2 site and that services that were not functioning should be decommissioned and the site place into an uncertified state. Then if more effort was apparent the site could be re-certified either as a VAC site or with reduced services. JC pointed out that they have already agreed with UCL that the site should become a VAC only site and that the PMB are aware of the situation. JC also pointed out that to uncertify the site we would still need to follow GridPP procedures. SUSX was briefly discussed, EM stated that SUSX was a completely different matter as their was engaged effort, and the main issue was to do with it’s inclusion in reporting/monitoring metrics where it perhaps should not be. Return to UCL EM felt we should decommission the services now, rather than waiting for the VAC migration. JC pointed out that currently UCL was in downtime and would likely remain so until the issues were resolved. EM stated he would like approval from the PMB to begin work decommissioning UCL and if not would like a reason as to why not. JC stated he would make sure the issue would be raised at GridPP34 if it had not been taken care off by that point.
TRIUMF
======
Slow connection from TRIUMF to RAL was briefly discussed, JC checked to see if a ticket had been raised.
Tools
===========================
- nothing to report
VOs
===========================
- nothing to report
Discussion
============================
-
Actions in Progress:
=============================
- no update to the actions
AOB
===========================
- no AOB
Chat Log
============================
Jeremy Coles: (14/04/2015 11:02)
Gareth R is taking minutes today.
Ewan Mac Mahon: (11:05 AM)
The ASAP stats aren't entirely applicable to the way we're trying to run Sussex - they made it look a LOT more broken than it really is.
Daniela Bauer: (11:07 AM)
Can we not close down UCL as by now it's a drain to resources ?
Ewan Mac Mahon: (11:07 AM)
We really should.
And we should have this conversation and make a decision today,
Federico Melaccio: (11:07 AM)
I agree
Ewan Mac Mahon: (11:08 AM)
It's an operational issue.
Let's come back to this later in the meeting.
None of those look familiar or useful to me, but the cvmfs one might be important to someone - presumably it's been created fairly recently?
Paige Winslowe Lacesso: (11:27 AM)
Thot this morning APEL said Bristol was all ok
No, I'm wrong but it's not red error text.
Tom Whyntie: (11:34 AM)
Hi - sorry I've been having problems connecting and for some reason I'm getting a low bandwidth warning (which I haven't had before). Anyway - just to update from UCLan - they're running more test jobs and getting their heads around the CVMFS and Storage Element concepts in order to rewrite their code to work on the grid. All in progress. I'm in the process of writing more examples to help with this using CERN@school code - see, for example, https://github.com/CERNatschool/running-allpix
Jeremy Coles: (11:34 AM)
Thanks Tom.
Tom Whyntie: (11:34 AM)
I can't hear anything else really so I'm going to leave - any questions email me. Cheers, Tom
Chris Brew: (11:38 AM)
Hmmm, both our ArcCEs have:
2015-03-11 15:06:36,826 - stomp.py - ERROR - Lost connection
2015-03-11 15:06:36,826 - ssm2 - INFO - Disconnected from broker.
as the last entries in their log files within a second of each other then nothing else.
Does anyone know what needs restarting?
(to get ssmsend running again)
Gareth Douglas Roy: (11:41 AM)
a-rex
service a-rex restart
sorry :)
Chris Brew: (11:47 AM)
thanks, just come to that conclusion from google, and now seeing stuff in the log file again. Might be worth us putting a log file freshness nagios check on that log file.
David Crooks: (11:47 AM)
Yeah, we were thinking the same thing
Ewan Mac Mahon: (11:48 AM)
You can agree with e too :-)
As I say, I don't want this to be seen as a BAD thing, or an anti-UCL thing; a bit like just removing the LHCb support from JET rather than fixing it, I think this should be an amicable separation as far as possible, but it's a separation that needs to happen.
Gareth Smith: (11:56 AM)
Coming back to ALICE's use of ARC-CEs. I just confirmed my understanding: ALICE are successfully running jobs using our CEs - and have been for some months. We do not have a special configuration for them. The problem that remains is 'just' their monitoring which has only been able to monitor CREAM-CEs. (At the moment ALICE SAM is still only looking at our CREAM CEs.) ALICE are aware of, and are working to fix, this limitation. We have agreed with them the date we will stop our CREAM CEs.