Operations team & Sites

Name: Operations team & Sites
Start: 2015-04-14T11:00:00+01:00
End: 2015-04-14T12:30:00+01:00
Location: EVO - GridPP Operations team meeting

Tuesday 14 Apr 2015, 11:00 → 12:30 Europe/London

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +442030510622 -- The meeting extension is 9308582. Apologies:

Hide

OPS Minutes 2015-04-14
=======================
Present:
    Chris Brew
    Daniel Traynor
    Daniela Bauer
    David Crooks
    Elena Korolkova
    Ewan MacMahon
    Federico Melaccio
    Gareth Roy
    Gareth Smith
    Gordon Stewart
    Govind Songara
    Jeremy Coles
    John Bland
    Kashif Mohammad
    Matt Doidge
    Matt Williams
    Oliver Smith
    Paige Lacesso
    Raja Nandakumar
    Robert Fay
    Robert Frank
    Steve Jones
    Tom Whyntie (Vidyo Issues)

Experiment Reports:
===========================

LHCB
====
- Nothing to report .
- Monte Carlo jobs running on the Grid, no problems.

CMS
===
- Bristol was written up for low site readiness, however problem was solved a week ago. Awaiting an update.
- Nothing else to report.

ATLAS
=====
- ASAP metrics for most sites are above 90, with the goal to have ASAP greater than 80%
- Sussex and UCL have low numbers (however Sussex queue is in broker-off status and should not be included)
- UCL has place itself into downtime due to issues (lack of support due to leave)

- Most ATLAS sites should have been full of production work last week, and although there is a reduction in work at the moment ATLAS planning to fill grid when new tasks are fully validated.

- Issues were reported with the FTS3 server located at RAL due to network issues.

Other VO
========
- no comments

DIRAC
=====
- SAM results for VAC showed that:
    - IC Cloud not updated since 11/4
    - Cern Cloud not update since 5/4
    - Lancaster not updated since 16/2
        - Lancaster reported that their VAC instance may have been turned off.

Meetings & Updates
===========================
With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

General Updates
===========================
- CHEP 2015 is currently ongoing, there will be a session at GridPP34 to give opportunity to discuss interesting presentations that occurred.

- A number of GOCDB service types appear to be no longer used:
    - OpsTool;
    - dg.TargetSystemFactory;
    - dg.ARC-CE;
    - eu.egi.cloud.broker.proprietary.slipstream;
    - globus-RLS; RGMA-IC;
    - eu.egi.egran;
    - SRM.online;
    - egi.VODashboard;
    - CUSTOM.egi.HTTPserver;
    - egi.NetworkPortal;
    - ch.cern.cvmfs.stratum.1;
    - CUSTOM.pl.plgrid.Bazaar.
   - None of service types appear to be in use by GridPP, JC to feed back to the EGI.

- DIRAC meeting taking place in May, there is currently a survey in process for communities that use DIRAC to comment. If anyone hosts such a VO please contact Jeremy for URL to the survey.

- TW is responsible for a VO clean up, please look at https://www.gridpp.ac.uk/wiki/VO_Cleanup_Campaign and let Jeremy know if you know of obsolete VO’s.

- Q115 reports are now due, please make sure you have supplied the required information to your Tier2 representative.

- March Availability & Reliability reports are now available.
- QMUL and Sussex have both asked for re-computations (QMUL figures again due to the inclusion of a test SE)

WLCG Operations
===========================
- nothing to report

Tier-1
===========================
- A network outage took place on Wednesday 8th April when a network upgrade caused issues. This caused the planned Castor update to be shelved. The Castor update will now take place on the 8th April.

- All Cream CE's at the RAL Tier 1 will be turned off on the 5th of May.

- Concern was expressed about ALICE, GS stated ALICE was updating their monitoring so that they can take advantage of ARC-CE's. EOL date discussed with ALICE.

- From Chat the chat transcript below:
Gareth Smith: (11:56 AM)
“Coming back to ALICE's use of ARC-CEs. I just confirmed my understanding: ALICE are successfully running jobs using our CEs - and have been for some months. We do not have a special configuration for them. The problem that remains is 'just' their monitoring which has only been able to monitor CREAM-CEs. (At the moment ALICE SAM is still only looking at our CREAM CEs.) ALICE are aware of, and are working to fix, this limitation. We have agreed with them the date we will stop our CREAM CEs.

- Outage also caused issues with APEL, GOCDB and other services located at RAL

Storage & Data Management
===========================
- nothing to report

Accounting
===========================
- APEL delays for Sheffield, RALPP and Bristol

Documentation
===========================
- Progress being made on security documentation.

Interoperation
===========================
- nothing to report

Monitoring
===========================
- nothing to report

On-Duty
===========================
- nothing to report

Rollout
===========================
- nothing to report

Security
===========================
- nothing to report

Services
===========================
- nothing to report

Tickets
===========================

RALPP
=====
111703 - Hammercloud tests failing. Awaiting feedback from ATLAS, difficulties in catching problem jobs.

BIRMINGHAM
==========
112875 - Low availability, now climbing waiting for it to get in the green.

GLASGOW
=======
112967 - Closed, Problem due to overload information system publishing old data.
113010 - Closed

EDINBURGH
=========
95303 - Awaiting glexec tarball

LANCASTER
=========
1005666 - Matt doesn't believe the results he's seeing. Going to re-run tests.
95299 - Awaiting glexec tarball

BRUNEL
======
112966 - Fixed, due to restarting torque

100IT
=====

112948 - Ongoing
108356 - Ticket can likely be solved/closed as it needs more work external to system.

TIER1
=====
108944 - Likely needs to involve the software developers after CHEP
112713 - Files need to be deleted, likely after CHEP, awaiting feedback from AL.
109694 - Sno+ gfal copy problems at Tier1, potentially and oddity with CASTOR
112977 - Hot files causing failures (1,000,000 access attempts over 18hrs)
111699 - Hammercould test failing, Awaiting feedback from ATLAS.
112866 - Another Hot file causing job failures
112721 - Atlas file acces ticket

UCL
===
112371 - ROD Low availability
112841 - Atlas 0% transfer efficiency
112873 - ROD SRM put failures
95298 - Awaiting glexec tarball
112722 - Atlas checksum timeouts
112966 - ROD job submit failures

There was a lengthy discussion on the status of UCL.

EM felt that UCL could not continue as a full Tier2 site and that services that were not functioning should be decommissioned and the site place into an uncertified state. Then if more effort was apparent the site could be re-certified either as a VAC site or with reduced services. JC pointed out that they have already agreed with UCL that the site should become a VAC only site and that the PMB are aware of the situation. JC also pointed out that to uncertify the site we would still need to follow GridPP procedures. SUSX was briefly discussed, EM stated that SUSX was a completely different matter as their was engaged effort, and the main issue was to do with it’s inclusion in reporting/monitoring metrics where it perhaps should not be. Return to UCL EM felt we should decommission the services now, rather than waiting for the VAC migration. JC pointed out that currently UCL was in downtime and would likely remain so until the issues were resolved. EM stated he would like approval from the PMB to begin work decommissioning UCL and if not would like a reason as to why not. JC stated he would make sure the issue would be raised at GridPP34 if it had not been taken care off by that point.

TRIUMF
======
Slow connection from TRIUMF to RAL was briefly discussed, JC checked to see if a ticket had been raised.

Tools
===========================
- nothing to report

VOs
===========================
- nothing to report

Discussion
============================
-

Actions in Progress:
=============================
- no update to the actions

AOB
===========================
- no AOB

Chat Log
============================
Jeremy Coles: (14/04/2015 11:02)
Gareth R is taking minutes today.
Ewan Mac Mahon: (11:05 AM)
The ASAP stats aren't entirely applicable to the way we're trying to run Sussex - they made it look a LOT more broken than it really is.
Daniela Bauer: (11:07 AM)
Can we not close down UCL as by now it's a drain to resources ?
Ewan Mac Mahon: (11:07 AM)
We really should.
And we should have this conversation and make a decision today,
Federico Melaccio: (11:07 AM)
I agree
Ewan Mac Mahon: (11:08 AM)
It's an operational issue.
Let's come back to this later in the meeting.
None of those look familiar or useful to me, but the cvmfs one might be important to someone - presumably it's been created fairly recently?
Paige Winslowe Lacesso: (11:27 AM)
Thot this morning APEL said Bristol was all ok
No, I'm wrong but it's not red error text.
Tom Whyntie: (11:34 AM)
Hi - sorry I've been having problems connecting and for some reason I'm getting a low bandwidth warning (which I haven't had before). Anyway - just to update from UCLan - they're running more test jobs and getting their heads around the CVMFS and Storage Element concepts in order to rewrite their code to work on the grid. All in progress. I'm in the process of writing more examples to help with this using CERN@school code - see, for example, https://github.com/CERNatschool/running-allpix
Jeremy Coles: (11:34 AM)
Thanks Tom.
Tom Whyntie: (11:34 AM)
I can't hear anything else really so I'm going to leave - any questions email me. Cheers, Tom
Chris Brew: (11:38 AM)
Hmmm, both our ArcCEs have:
2015-03-11 15:06:36,826 - stomp.py - ERROR - Lost connection
2015-03-11 15:06:36,826 - ssm2 - INFO - Disconnected from broker.
as the last entries in their log files within a second of each other then nothing else.
Does anyone know what needs restarting?
(to get ssmsend running again)
Gareth Douglas Roy: (11:41 AM)
a-rex
service a-rex restart
sorry :)
Chris Brew: (11:47 AM)
thanks, just come to that conclusion from google, and now seeing stuff in the log file again. Might be worth us putting a log file freshness nagios check on that log file.
David Crooks: (11:47 AM)
Yeah, we were thinking the same thing
Ewan Mac Mahon: (11:48 AM)
You can agree with e too :-)
As I say, I don't want this to be seen as a BAD thing, or an anti-UCL thing; a bit like just removing the LHCb support from JET rather than fixing it, I think this should be an amicable separation as far as possible, but it's a separation that needs to happen.
Gareth Smith: (11:56 AM)
Coming back to ALICE's use of ARC-CEs. I just confirmed my understanding: ALICE are successfully running jobs using our CEs - and have been for some months. We do not have a special configuration for them. The problem that remains is 'just' their monitoring which has only been able to monitor CREAM-CEs. (At the moment ALICE SAM is still only looking at our CREAM CEs.) ALICE are aware of, and are working to fix, this limitation. We have agreed with them the date we will stop our CREAM CEs.

There are minutes attached to this event. Show them.

- 11:00 → 11:01
  
  Ops meeting minutes 1m
  
  * This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards. * The team composition has been changing. If everybody contributes then the task comes around less often. * From the start of GridPP4+ those in fully funded GridPP positions will be expected to contribute. Others are welcome to volunteer! * The minutes should contain a list of who attended; apologies; note who took the minutes and highlight actions. * A count is maintained at https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items. * After uploading minutes to the agenda page the minute taker is expected to: ** Update the list of ops actions. ** Update their 'count' so the task can be shared fairly. Thank you for your support!
- 11:01 → 11:20
  
  Experiment problems/issues 19m
  
  Review of weekly issues by experiment/VO - LHCb - CMS https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel - ATLAS - Other -- Already in use -- LSST, UCLan, LIGO. - DIRAC status -- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view
- 11:20 → 11:40
  
  Meetings & updates 20m
  
  With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates - WLCG ops coordination - Tier-1 status - Storage and data management - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - Site updates
- 11:40 → 11:45
  
  Actions review 5m
  
  * https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
- 11:45 → 12:00
  
  Site issues 15m
- 12:01 → 12:02
  
  AOB 1m

Choose timezone

Operations team & Sites

EVO - GridPP Operations team meeting