UKI Monthly Operations Meeting (TB-SUPPORT)
→
GMT
EVO
EVO
Description
Monthly review and discussion meeting for those involved with GridPP deployment and operations. To join via EVO go to http://evo.caltech.edu.
To join by phone call +41 22 76 71400. The phone bridge ID is 451099 and the code: 4880.
Present: Jens (mins), Ewan, Yves, Brian, Duncan, Derek, Matt, Jeremy
(chair+mins), John, Elena, Mike, Sam, Winnie, Santanu, Peter, Graeme,
Greig, Phil, David, Andrew, Simon
Uptime discussed: RHUL didn't install host certificates - they will install them soon. Cambridge
upgrading and need Condor info & docs: huge differences between
series. Cambridge CE does not accept jobs atm (looks like 6.8.5 used for testing). Storage mostly OK,
some sites failed to update certificates, problem with dCache pin
manager.
Blackhole nodes: many reasons why this happens, eventually new
monitoring written to catch when this happens - for example SSH to head node or node disk is full.
Steve's transfer tests were not close to what experiments do. Test jobs being
updated to reflect more what apps will do. There is an Atlas test
showing default job locations and failure rates. Curious why jobs
attempt to run when site is in downtime, but jobs may have been
submitted before site entered downtime.
Jeremy will check the current job flow - is FCR consulted?
Santanu asked about the Cambridge failures. Graeme suggested looking at the failure links as the message contained in them gives a clue.
There was a bug in the accounting portal last week that prevented the gnatt chart display of accounting history. THis has now been fixed. Lancaster looks to have a problem folloing the cert updates.
Experiments issues: LHCb currently reviewing software, Dirac upgrade.
Atlas (Graeme): Cosmic data (from sub-systems) transferred T0->T1. Rolling functional tests take place each Monday - tickets raised against problem sites. Atlas prod problems with
CASTOR RAC upgrade. 10 TeV MC data ready to go. For T2s, new or
additional space tokens required, 2TB ATLASPRODDISK, .5TB
ATLASDATADISK, ATLASGROUPDISK with complicated ACLs (script for DPM to
set them), ATLASUSERDISK scratch cleaned centrally,
ATLASLOCALGROUPDISK (no garbage collection). Brian will ticket sites with a best guess for space requirements in each token (depends on local resources etc.).
Will switch Panda to PRODDISK soon. Small issue that the one spacetoken has 5 associated roles. Graeme has a script for DPM sites to use for this setup.
Greig reported that Edinburgh PRODDISK is full again and more data is coming in. Graeme will discuss offline - it is odd.
No CMS news. SuperNEMO enabled.
CA discussion. Tricky external constraints imposed due to suspected
root key compromise, Jens explained some of the background involving
peer CAs and non-UK grids trusting the UK CA. Imposed as a compromise
between the two means of closing a CA: to drop it immediately or to
let everything expire.
Thus, we have been reissuing certificates under the new hierarchy
since November. The only difference in this case has been that
certificates were re-signed - a new invention (in the grid context)
which should make life easier in the future - and then of course the
shorter timescale, see below.
In the UK main problems were due to communication problems: people
don't read their emails, and the ones that did felt that the mails did
not adequately convey a sense of urgency - due to the fact that the
mails were sent out later than planned due to delays. That was the
second problem, the short timescale due to delays in signing, partly
that the planning was not adequate for the task - the core task of
signing that ordinarily would have taken a few hours at most now took
a week, one difficulty was to ensure certificates going to helpdesks.
Final set of problems encountered were incidental: the VOMS bug
hitting some VOs due to VO admins not having deployed the bugfix: Jens
followed up on behalf of some VOs. Some certificates were re-signed
which should have been revoked, this caused some confusion. A few
where left out, this was discovered only on Friday afternoon after
Ewan reported a missing Oxford host cert - possibly due to a gap in
the original sql query, perhaps corresponding to a single day's
certificates.
Jeremy asked if the deadline for switching could have been brought forward by 1-2 weeks (before CRL expiry).
Graeme thought a central problem was the lack of warning and that most users were clueless about the impact - made worse by the 48hr changeover window. Stephen remarked that the normal renewal was reasonable. Several in the meeting were happy about the new renewal process (copying the CRT easier than copy and extracting across all WNs for example). Jens explained that it would be possible to subscribe to host certs which could be renewed for up to 3-5yrs.
Suggestions on how to improve communications will be discussed.
Further policy changes will be required to bring the UK back in
compliance with changing minimal requirements for Grid CAs (the
re-signing as opposed to normal renewals was a positive sign of this
change, the new requirements will also impose need for
re-identification with RAs every N years (where N is 3 or 5 or so,
depending on how the private key is stored).
How to check whether certificate matches private key? Jens will send
a recipe to tb-support (ACTION).
Purchases to be documented to enable other sites to benefit from
experiences. Also new benchmarking tests, Unis will need to buy
licenses but may have some already.
Oxford (and several other sites) seeing specific ports on grid service nodes having odd connection attempts - some 300 bytes transferred every 2 hrs or
so. Mingchao looking into it. See mail to gridpp-storage for further
information.
Which version of torque/maui are people running? See responses in
chat transcript below, around 12:02.
[10:27:14] Jeremy Coles joined
[10:27:20] John Bland joined
[10:28:18] Elena Korolkova joined
[10:29:30] Mike Kenyon joined
[10:29:33] Sam Skipsey joined
[10:30:14] Winnie Lacesso joined
[10:31:03] Santanu Das joined
[10:31:19] Peter Love joined
[10:35:11] Graeme Stewart joined
[10:35:21] Santanu Das hang on, probably you can't hear me
[10:35:40] Santanu Das I'm just trying to fix the audio
[10:35:48] Greig Cowan joined
[10:36:05] Alessandra Forti joined
[10:36:09] Simon George joined
[10:36:22] IPPP1 Durham joined
[10:37:02] Chris Brew joined
[10:37:06] Phone Bridge joined
[10:37:19] Simon George '409429' on the agenda is not a valid evo meeting id. Could someone check it please?
[10:39:12] Ewan Mac Mahon left
[10:39:12] Ewan Mac Mahon joined
[10:39:14] Ewan Mac Mahon left
[10:39:33] Jeremy Coles It is 451099
[10:40:22] Brian Davies http://www.gridpp.ac.uk/wiki/GridPP_storage_availability_monitoring
[10:43:12] Stephen Burke joined
[10:43:21] Rob Fay joined
[10:43:49] Phone Bridge joined
[10:51:26] Ewan Mac Mahon Essentially you fail the tests, but if you're in downtime no-one minds.
[10:51:37] Derek Ross Downtime should stop COD from opening tickets about failing tests
[10:51:49] Ewan Mac Mahon But if the RB is matching jobs to a downed site it shouldn't.
[10:56:30] Jens Jensen New CASTOR GIP is ready - publishing everything
[10:56:53] Jens Jensen Not in production yet at RAL
[10:58:39] Winnie Lacesso I've lost all sound, has anyone else?
[10:59:00] Jens Jensen I;ve got sound and I'm on a US Panda...
[10:59:20] Andrew Elwell joined
[10:59:38] Andrew Elwell Hi Gang - Sorry I'm late
[11:27:45] Simon George is it just me or has it gone quiet?
[11:28:22] Jeremy Coles Ok for me
[11:28:23] Phone Bridge left
[11:29:15] Phone Bridge joined
[11:30:07] Simon George phone bridge line went dead. I've redailed and can hear you again
[11:31:51] IPPP1 Durham That was Phil Roffe and David Ambrose-Griffith
[11:32:12] Peter Love left
[11:36:55] Jeremy Coles We'll continue for another 15 mins. Is there any other AOB?
[11:40:34] Ewan Mac Mahon Seconded. I didn't need to run the new certs through a browser at all.
[11:52:38] Winnie Lacesso Jeremy 0- so do all sites to run this benchmark need to buy a Spec2006 license??
[11:54:43] Winnie Lacesso Or if we have CPU ENNN & MM GB RAM & someone publishes Spec2006 results for that, can we use (publish) their results for free?
[11:56:13] Andrew Elwell Ewan - Do you run anything like snort or iptables rules to pick these up or just log greppage?
[11:58:24] Graeme Stewart left
[11:59:08] Winnie Lacesso Can anyone check on Rollout if it's UK only or wider than that?
[11:59:45] Andrew Elwell drop all connections from them and see if they complain?
[11:59:51] Andrew Elwell /bofh
[12:00:31] Chris Brew I don't see that IP in globus-gatekeeper.log or catalina.out on dCache at RALPP as a numeric IP. Should I if I've been contacted
[12:02:19] Ewan Mac Mahon Back now - sorry
[12:02:24] Chris Brew glite
[12:02:30] IPPP1 Durham glite
[12:02:31] Rob Fay also glite
[12:02:40] Chris Brew since I think thay've taken steve's fixes
[12:02:41] Derek Ross our own builds for the server, clients are glite
[12:03:54] Winnie Lacesso left
[12:03:55] IPPP1 Durham left
[12:03:56] Ewan Mac Mahon Bye.
[12:03:57] Chris Brew left
[12:03:58] Brian Davies left
[12:03:58] Andrew Elwell left
[12:03:59] Derek Ross left
[12:04:00] Sam Skipsey left
[12:04:02] Phone Bridge left
[12:04:02] Phone Bridge left
[12:04:02] John Bland left
[12:04:03] Elena Korolkova left
[12:04:03] Matthew Doidge left
[12:04:03] Mike Kenyon left
[12:04:05] Duncan Rand left
[12:04:05] Rob Fay left
[12:04:06] Ewan Mac Mahon left
[12:04:06] Stephen Burke left
[12:04:07] Yves Coppens left
[12:04:13] Alessandra Forti left
There are minutes attached to this event.
Show them.