Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +44 (0)161 306 6802. Phone bridge ID 1001002 -- The meeting extension is 109308582. PIN 1234 Chair: Jeremy Minutes:  Apologies:

GridPP Operations Team and Sites Meeting 5th December 2017
==========================================================

Attended
========

Alessandra Forti, Andrew Lahiff, Andrew McNab, Chris Brew, Dan Traynor,
Daniela Bauer, David Crooks, Elena Korolkova, Gordon Stewart, Govind,
Ian Loader, Jens Jensen, John Hill, John Kelly, Kashif Mohammed, Leo Rojas,
Linda Cornwall, Lukasz Kreczko, Matt Doidge, Robert Frank, Sam Skipsey,
Steve Jones, Raja Nandakamar, Winnie Lacesso, Teng Li, Vip Davda.

Experiment Problems/Issues
==========================
LHCb
----
No major problems, some upload problems but may be part of a problem with a dirac
update

CMS
---
All OK at the moment, a number of sites with low availability for November.
Is anyone else using MOUNT_UNDER_SCRATCH apart from T1 and Bristol?

Atlas
-----
Manchster and Glasgow single core wok queue set offline, only multicore queues
active now

Direct access for high memory queues at QM,

ADC meeting last Tuesday - changes to hammercloud notification, no longer e-mails
for prod queue failures (see attachment), send mail for Atlas support

Changes to protection to Atlas Panda page, if you have problems please report
to atlas-uk-cloud-support

Other
-----

No reports

GridPP Dirac Issues
-------------------

Page looks quite good, VAC sites all running fine, only 2-3 grid sites on with
non recent results.

Meetings and Updates
====================

General Updates
---------------

WLCG Ops Co-ordination
----------------------

Next meeting Thursday
Anything to raise, get in touch

Tier 1 Status
-------------
WMS service at RAL due to be decommissioned at the the end of December
IPv6 Connectivity took some time to recover after the power cut

Storage and Data Management
---------------------------

Nothing concrete to report

Tier 2 Evolution
----------------

Nothing to report

Accounting
----------

Nothing to report

Documentation
-------------
Nothing to report

Interoperation
--------------

Meeting on Monday - David won't be able to attend

Monitoring
----------

Nothing to report

On Duty
-------

No report

Rollout
-------

Nothing to report

Security
--------

Link to list of CA roots on the Ops Dashboard - TACAR website
https://www.tacar.org/cert/list

Plans for integration/co-operation between different projects presented at
EOSC, link on Dashboard

Services
--------

No report

Tickets
-------

Link to Tickets - http://tinyurl.com/nwgrnys

IPv6 deployment - put on hold if no movement for Xmas.
Is Karin the latest Sno+ contact? Yes.

RALPP - Hoping ArcCE fix makes it to mainline


Tools
-----

Nothing to report

VOs
---

Nothing to report

Site Updates
------------

RALPP has installed singularity by installing singularity-runtime on all the
WNs, not additional config needed

Discussion Topics
=================
The 1.88 update and fallout (lessions)
--------------------------------------
Triggered a bug in ancient Bouncy Castle Version.
Different signing chain at both ends, looks for CRL extensions returns null
pointer
Need to move off SHA1 to SHA2
Should be completely equivalent apart from this BC bug.
Jens will add non-critical extension to the CRL (safer than adding critical
extension) today, may work around the issue
Robert Frank has rpm with the fix, he also states that according to the Java
docs the non-critical CRL extension should "fix" the issue.
Jens could not replicate the bug, he needs people to test it for him.
Different bug in CA update code, need to restart services to ensure fix works
UMD4, Argus 1.7 has newer bouncycastle version.
Kashif ran into problems downgrading, with it complaints of CRL (using yum
downgrade), Steve manually removed the rpms and reinstalled
(yum history rollback might work)
Can we have the SAM version tests extended for the UK
Jens: we need to go through the upgrade
Options:
1) Non-critical CRL extension
2) Use patched version of BC
3) Upgrade Argus - (now might be a good time)
Jens will produce the CRL extension today
Sites should hold off doing anything until the non-critical CRL extension has
been tested - Needs someone to test it!
Is there anything else that is broken?

The Steve Lloyd Tests
---------------------

Actions and AOB
===============

ACTION: Steve Jones to write a quick wiki pages with options for sites re 1.88 update

AOB: HepSysMan meeting, Tuesday 16th January in Glasgo

Chat
====

David Crooks: (05/12/2017 11:02)
Chris is taking minutes?
Agenda: https://indico.cern.ch/event/686303/
Daniel Peter Traynor: (11:12 AM)
how long can we not update the certs (running 1.87-1 now)? ops tests allready complain
Chris Brew: (11:12 AM)
Robert do you have a link for the other CA update bug? That sounds like the problem with remote dCaches, If I send it to the dCache team then they can recommend service restarts to everyone
Daniel Peter Traynor: (11:13 AM)
not aware of centos 7 production versions of storm or cream
David Crooks: (11:13 AM)
Yeah
That was a more longterm plan
John Hill: (11:13 AM)
After downgrade you have to restart daemons (or reboot)
Daniel Peter Traynor: (11:14 AM)
downgrade worked for me, discovered how to use yum history and needed our own repo which had the old cert version
Robert Frank: (11:15 AM)
yum history rollback <id>
Daniel Peter Traynor: (11:15 AM)
yum history, yum hostory info [id] , yum history undo [id]
Steve Jones: (11:16 AM)
To those who don't trust YUM to do the right thing!

for p in `rpm -qa | grep 1.88-1 | grep ca_`; do yum -y remove $p; done

Then check for other packages of version 1.88-1 and remove those too, by hand!

rpm -qa | grep 1.88-1
Then put the meta package back. It worked for me.
Daniela Bauer: (11:20 AM)
How are you going to get teh rest of the grid to update ?
Paige Winslowe Lacesso: (11:21 AM)
s/best choice/least bad/?
Andrew David Lahiff: (11:24 AM)
But note that some systems at RAL have been upgraded.
Robert Frank: (11:24 AM)
All mine are patched now
I'd have to set one up
John Hill: (11:26 AM)
I've not seen any other issues with 1.88
Elena Korolkova: (11:26 AM)
Can we have at the end of the discussion a list of actions for a site admin and send it to tb support, please
Steve Jones: (11:26 AM)
Yes; Elena, we'll make a list.
Chris Brew: (11:26 AM)
Options:
1) Non-critical CRL extension
2) Use patched version of BC
3) Upgrade Argus - (now might be a good time)
Lukasz Kreczko: (11:27 AM)
Upgrade to 1.7.0-1 ?
Elena Korolkova: (11:27 AM)
Thanks, Steve
Chris Brew: (11:30 AM)
Puwer cut hit us
Lukasz Kreczko: (11:30 AM)
MOUNT_UNDER_SCRATCH hit us ;)
Chris Brew: (11:31 AM)
Rest seems to be xrootd issues
Daniela Bauer: (11:32 AM)
Don't know.
Raja Nandakumar: (11:37 AM)
Apologies - need to leave now.

David Crooks: (11:38 AM)
https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMeetingWeek171204
Chris Brew: (11:48 AM)
TACAR website
David Crooks: (11:48 AM)
https://www.gridpp.ac.uk/wiki/GridRootCertificates
Chris Brew: (11:48 AM)
https://www.tacar.org/cert/list
Paige Winslowe Lacesso: (11:57 AM)
Sorrysorry must go
Alessandra Forti: (12:03 PM)
yit's noon
 

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      • This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards.

      • The team composition has been changing. If everybody contributes then the task comes around less often.

      • Please extract actions from the meeting and add them to our table here: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items#Action_list.

      • Recent allocations: See above link. The page should be updated each week by the minute taker (if they don't the task will keep coming to them!).

      • Upcoming allocations:

      5th Dec: Chris Brew
      12th Dec: TBC
      19th Dec: Vip
      8th Jan??
      15th Jan: TBC

    • 11:01 11:20
      Experiment problems/issues 19m

      Review of weekly issues by experiment/VO

      • LHCb

      • CMS
        T1: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
        T2: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel

      Could you please let me know if you cannot access these pages and if so what the error is ? Thanks:
      Singularity according to CMS: https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesDocumentation#Singularity
      (and a very bad plot to show which sites have Singularity: https://monit-kibana.cern.ch/kibana/app/kibana#/visualize/edit/CMS-glideins-singularity?_g=h@114715d&_a=h@af4d5e4)

      • ATLAS

      • Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.

      • GridPP DIRAC status [Andrew McNab]
        -- https://www.gridpp.ac.uk/gridpp-dirac-sam

    • 11:20 11:40
      Meetings & updates 20m

      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

      • General updates
      • WLCG ops coordination
      • Tier-1 status
      • Storage and data management
      • Tier-2 Evolution
      • Accounting
        Imperial accounting never recovered from adding a ARCCE and/or the move to EL7 and we are out of ideas (well, I'm going to send them a dump of our raw accounting data and see if they can make sense of it):
        https://ggus.eu/index.php?mode=ticket_info&ticket_id=130896
      • Documentation
      • Interoperation
      • Monitoring
      • On-duty
      • Security
      • Services
      • Tickets
      • Tools
      • VOs
      • Site updates
    • 11:40 12:20
      Discussion topics 40m
      • The 1.88 update and fallout (lessons)
      • The Steve Lloyd tests
    • 12:20 12:25
      Actions & AOB 5m
      • https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items