Operations team & Sites
EVO - GridPP Operations team meeting
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +44 (0)161 306 6802. Phone bridge ID 1001002 -- The meeting extension is 109308582. PIN 1234 Chair: Jeremy Minutes: Apologies:
GridPP Operations Team and Sites Meeting 5th December 2017
==========================================================
Attended
========
Alessandra Forti, Andrew Lahiff, Andrew McNab, Chris Brew, Dan Traynor,
Daniela Bauer, David Crooks, Elena Korolkova, Gordon Stewart, Govind,
Ian Loader, Jens Jensen, John Hill, John Kelly, Kashif Mohammed, Leo Rojas,
Linda Cornwall, Lukasz Kreczko, Matt Doidge, Robert Frank, Sam Skipsey,
Steve Jones, Raja Nandakamar, Winnie Lacesso, Teng Li, Vip Davda.
Experiment Problems/Issues
==========================
LHCb
----
No major problems, some upload problems but may be part of a problem with a dirac
update
CMS
---
All OK at the moment, a number of sites with low availability for November.
Is anyone else using MOUNT_UNDER_SCRATCH apart from T1 and Bristol?
Atlas
-----
Manchster and Glasgow single core wok queue set offline, only multicore queues
active now
Direct access for high memory queues at QM,
ADC meeting last Tuesday - changes to hammercloud notification, no longer e-mails
for prod queue failures (see attachment), send mail for Atlas support
Changes to protection to Atlas Panda page, if you have problems please report
to atlas-uk-cloud-support
Other
-----
No reports
GridPP Dirac Issues
-------------------
Page looks quite good, VAC sites all running fine, only 2-3 grid sites on with
non recent results.
Meetings and Updates
====================
General Updates
---------------
WLCG Ops Co-ordination
----------------------
Next meeting Thursday
Anything to raise, get in touch
Tier 1 Status
-------------
WMS service at RAL due to be decommissioned at the the end of December
IPv6 Connectivity took some time to recover after the power cut
Storage and Data Management
---------------------------
Nothing concrete to report
Tier 2 Evolution
----------------
Nothing to report
Accounting
----------
Nothing to report
Documentation
-------------
Nothing to report
Interoperation
--------------
Meeting on Monday - David won't be able to attend
Monitoring
----------
Nothing to report
On Duty
-------
No report
Rollout
-------
Nothing to report
Security
--------
Link to list of CA roots on the Ops Dashboard - TACAR website
https://www.tacar.org/cert/list
Plans for integration/co-operation between different projects presented at
EOSC, link on Dashboard
Services
--------
No report
Tickets
-------
Link to Tickets - http://tinyurl.com/nwgrnys
IPv6 deployment - put on hold if no movement for Xmas.
Is Karin the latest Sno+ contact? Yes.
RALPP - Hoping ArcCE fix makes it to mainline
Tools
-----
Nothing to report
VOs
---
Nothing to report
Site Updates
------------
RALPP has installed singularity by installing singularity-runtime on all the
WNs, not additional config needed
Discussion Topics
=================
The 1.88 update and fallout (lessions)
--------------------------------------
Triggered a bug in ancient Bouncy Castle Version.
Different signing chain at both ends, looks for CRL extensions returns null
pointer
Need to move off SHA1 to SHA2
Should be completely equivalent apart from this BC bug.
Jens will add non-critical extension to the CRL (safer than adding critical
extension) today, may work around the issue
Robert Frank has rpm with the fix, he also states that according to the Java
docs the non-critical CRL extension should "fix" the issue.
Jens could not replicate the bug, he needs people to test it for him.
Different bug in CA update code, need to restart services to ensure fix works
UMD4, Argus 1.7 has newer bouncycastle version.
Kashif ran into problems downgrading, with it complaints of CRL (using yum
downgrade), Steve manually removed the rpms and reinstalled
(yum history rollback might work)
Can we have the SAM version tests extended for the UK
Jens: we need to go through the upgrade
Options:
1) Non-critical CRL extension
2) Use patched version of BC
3) Upgrade Argus - (now might be a good time)
Jens will produce the CRL extension today
Sites should hold off doing anything until the non-critical CRL extension has
been tested - Needs someone to test it!
Is there anything else that is broken?
The Steve Lloyd Tests
---------------------
Actions and AOB
===============
ACTION: Steve Jones to write a quick wiki pages with options for sites re 1.88 update
AOB: HepSysMan meeting, Tuesday 16th January in Glasgo
Chat
====
David Crooks: (05/12/2017 11:02)
Chris is taking minutes?
Agenda: https://indico.cern.ch/event/686303/
Daniel Peter Traynor: (11:12 AM)
how long can we not update the certs (running 1.87-1 now)? ops tests allready complain
Chris Brew: (11:12 AM)
Robert do you have a link for the other CA update bug? That sounds like the problem with remote dCaches, If I send it to the dCache team then they can recommend service restarts to everyone
Daniel Peter Traynor: (11:13 AM)
not aware of centos 7 production versions of storm or cream
David Crooks: (11:13 AM)
Yeah
That was a more longterm plan
John Hill: (11:13 AM)
After downgrade you have to restart daemons (or reboot)
Daniel Peter Traynor: (11:14 AM)
downgrade worked for me, discovered how to use yum history and needed our own repo which had the old cert version
Robert Frank: (11:15 AM)
yum history rollback <id>
Daniel Peter Traynor: (11:15 AM)
yum history, yum hostory info [id] , yum history undo [id]
Steve Jones: (11:16 AM)
To those who don't trust YUM to do the right thing!
for p in `rpm -qa | grep 1.88-1 | grep ca_`; do yum -y remove $p; done
Then check for other packages of version 1.88-1 and remove those too, by hand!
rpm -qa | grep 1.88-1
Then put the meta package back. It worked for me.
Daniela Bauer: (11:20 AM)
How are you going to get teh rest of the grid to update ?
Paige Winslowe Lacesso: (11:21 AM)
s/best choice/least bad/?
Andrew David Lahiff: (11:24 AM)
But note that some systems at RAL have been upgraded.
Robert Frank: (11:24 AM)
All mine are patched now
I'd have to set one up
John Hill: (11:26 AM)
I've not seen any other issues with 1.88
Elena Korolkova: (11:26 AM)
Can we have at the end of the discussion a list of actions for a site admin and send it to tb support, please
Steve Jones: (11:26 AM)
Yes; Elena, we'll make a list.
Chris Brew: (11:26 AM)
Options:
1) Non-critical CRL extension
2) Use patched version of BC
3) Upgrade Argus - (now might be a good time)
Lukasz Kreczko: (11:27 AM)
Upgrade to 1.7.0-1 ?
Elena Korolkova: (11:27 AM)
Thanks, Steve
Chris Brew: (11:30 AM)
Puwer cut hit us
Lukasz Kreczko: (11:30 AM)
MOUNT_UNDER_SCRATCH hit us ;)
Chris Brew: (11:31 AM)
Rest seems to be xrootd issues
Daniela Bauer: (11:32 AM)
Don't know.
Raja Nandakumar: (11:37 AM)
Apologies - need to leave now.
David Crooks: (11:38 AM)
https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMeetingWeek171204
Chris Brew: (11:48 AM)
TACAR website
David Crooks: (11:48 AM)
https://www.gridpp.ac.uk/wiki/GridRootCertificates
Chris Brew: (11:48 AM)
https://www.tacar.org/cert/list
Paige Winslowe Lacesso: (11:57 AM)
Sorrysorry must go
Alessandra Forti: (12:03 PM)
yit's noon