Operations team & Sites
EVO - GridPP Operations team meeting
GridPP Operations 7 July 2015
Present:
A Forti
A Lahiff
A McNab
A Washbrook
B Davies
C Brew
D Bauer
D Crooks
D Rand
D Traynor (and Terry)
E Korolkova
F Melaccio
Gang Qin
G Roy
G Smith
G Songara
G Stewart
I Loader
J Bland
J Coles (chair)
J Hill
K Mohammed
L Skinner
M Doidge
M Raso-Barnett
R Frank
R Lopes
R Nandakumar
S Skipsey (minutes)
W Lacesso
-
Experiment updates:
LHCb:
RN: To first order, things fine. A few jobs due to monte carlo productions, picked up by most sites.
3 sites with issues - 1) Bristol, published max cpu time zero.
Other 2 sites are RHUL, QMUL - problems submitting to the CEs (there are GGUS tickets for these).
CMS:
DB: Some minor xrootd issues - one was a CERN xrootd redirector getting overloaded (but fixed quickly). Chris Brew updated, and then some OSG libraries caused a problem with xrootd?
Chris noted that they have a strange setup, an xrootd proxy infront of dCache with both the CMS trivial file catalogue plugin and the xrootd simple filesystem on - get these from OSG, and UMD repo had accelerated past OSG releases for xrootd [except those plugins], so a trivial update broke those packages until repulled from OSG.
ATLAS:
EK: Nothing to report today. Last week was ATLAS Software&Computing week [summary next week].
Was an issue with QMUL, do not have production jobs - release validations missing for multicore according to Alessandra.
Problem with Cambridge, not understood.
Other VOs:
DiRAC:
SS: Robot Certificate arranged for DIRAC for transfers. Have transferred 60TB or so of data. Ongoing discussion of how best to preserve ownership metadata for files in archive.
LIGO:
[Catalin on leave]
LOFAR:
GS: nothing to report (waiting on VO meeting)
JC noted that we need to make sure we keep things moving forward.
LSST:
AF: gave presentation to an LSST meeting, went well, were asked to make a similar presentation within a GridPP meeting [Pete Clark made this request].
JC noted that the intent of the next GridPP is to focus on these other VOs. There was a meeting between the program conveners for PP&A and Astronomy to see if we can adopt a joint approach. Ongoing discussions about such architecture.
LZ:
[D Colling absent]
EK noted that LZ is waiting for approval for OSG, but have seen no update since 10 days ago.
JC noted that there was funding approved for LZ, but not explicitly for computing. It was suggested they might wish to attend this meeting or f
UKQCD:
JC: last update, some issues with output files being added to WMS and not retrieved [now fixed]
Tom Whyntie is persuing use of DIRAC etc.
CT Scan simulation [Pravda]
Similar usecase to cernatschool, lucid VOs.
Mark/Matt at Birmingham talked to them [along with TW].
UCLan/Galaxy Dynamics:
JC: now done their first analyses, via northgridVO, mostly on Liverpool. Deployed on CVMFS. Want to produce their own GalaxyDynamics VO.
DIRAC Status:
AMcN: As far as ARC tests done, understood and fixed. Glasgow situation understood. Have not tried testing with other jobs.
GR noted that DBauer fixed an issue for the gridpp VO by asking for pilot roles added.
AMcN will liaise with GR/DBauer/Janusz about this.
Meetings and Updates:
[See: https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest ]
General Updates:
Multicore accounting - Durham situation? [MD noted that there's a ticket on this]
Some discussion with JGordon on this.
AF followed up with Oxford about this issue - puppet had overwritten the original change, so puppet fixed.
Daniela noted that IC's issue is that there are two ARC CEs - the prod one works [and is a newer release], the other one is a bit of a hack on top of the small cloud resource and proof of principle and therefore fragile. "Fixing" the latter might also break it, by upgrading it. [DB will talk to SFayer about this and see what the risk analysis looks like. Might just turn it off, due to the lack of resource behind it.]
LS on Durham - https://ggus.eu/?mode=ticket_info&ticket_id=114381 This is dependant on SLURM expertise to fix (reliant on Oliver Smith, returning from holiday tomorrow) - due to, we think, timing out job records.
DPM Workshop dates: [no comments]
T2 Reliability/Avail:
CB notes that he queries the values for RALPP.
Sheffield response?
[From chat log: Elena Korolkova: (11:32 AM)
in Sheffield I've overlooked a bad wn which was failing tests for LHCb it didn't cause troubles for atlas]
Ops Coord Updates:
UMD new release: MD this is the first version of the WN-Tarball with the gfal2-utils in. There was a problem that DBauer spotted with the UI that was released at the same time. Seems to work fine.
BD: asked if MD has tested copying from Castor with the WN-Tarball [re the gfal2-utils issues in earlier releases]. MD has not tested yet, but will.
*Action: MD to test against Castor with the current gfal2-utiils in WN-Tar. BD to help.
See details for Site Actions link:https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes150702#Specific_actions_for_sites
Ongoing action for multicore sites - ATLAS sites should cap at 80% resources for multicore [at T2s, multicore should get 80% of the 50% weighting for their job class]
T1 share adjustement for CMS roles.
CMS Space Monitoring information request [raul had GGUS ticketed CMS re the complexity of this process for T2s, and bugs in the implementation https://ggus.eu/index.php?mode=ticket_info&ticket_id=113415
]
T1 Update:
GSmith: report has been updated [re testing of WN configurations which have caused issues for some VOs]. Trying to schedule an intervention on the problematic router for Extreme - tentative date, 4th August.
RN noted that solid confirmation of the downtime is important for the VOs.
Documentation: has been improving in the last week.
EGI Interoperation meeting:
DCrooks: Ongoing discussion about SL5 decommissioning. If anyone has detailed plans, specificially, re this then please report back to JC and DCrooks. JC notes that many sites had SL5 storage nodes, which were being left for upgrade until necessary [to avoid disruption to service].
On-duty:
Do we want to feed back the issues with the fake ARC and fake BDII alarms masking real alarms?
KM has applied a patch on the GridPP Nagios to fix the alarms. There's no update from the person who should update this ticket on the ticket
https://ggus.eu/index.php?mode=ticket_info&ticket_id=114742
Security:
[There was a security team meeting last Wednesday, no attendees present in this meeting]
Services: RF - there's the problem with the VOMS server replicating between the master and the clients, which caused the replicated servers to be out of date. Should not be a problem if the client fails over to the master. There's a fix - Imperial and Oxford fixed. [Issue was an open slms update which fixed the logjam problem by restricted the Diffie-Hellman bits, but this is hardcoded in the VOMS servers MySQL! Official fix to be released today, but our workaround was to just not use Diffie-Hellman]
Perfsonar:
DR [issues with audio]
Tickets: [MD]
Bristol 114485 ticket [updated by WL today]
Durham 114536 - LS doesn't have access to fix ticket, waiting on OS to return. GGUS Access will be fixed for LS for next time.
Sheffield 114460 EK will do this week.
UCL tickets - in general, who to poke?
QMUL 114573 - JC asked RN to help them with debugging as LHCb rep. DT noted that this might not be an IPv6 problem, possibly just the release versions installed.
[Note for minutes - at some point, MD lost the ability to hear anyone else in the meeting, so responses are limited]
JC: who is the SNO+ liaison? (and T2K liaison) Post Chris Walker leaving?
[DT volunteered]
OUTSTANDING ACTIONS:
LSST update [everyone to talk to LSST] ongoing.
Incubator page now up and has links for all the new VOs we're working on [thanks to TWhyntie]
Contacts within VOs [ongoing]
-
Chat Logs:
Jeremy Coles: (07/07/2015 11:03)
a. Bristol : MaxCPUTime is 0. Needs to be fixed by Bristol. For more information : http://lhcbproject.web.cern.ch/lhcbproject/Operations/queues.html
b. RHUL : Errors submitting jobs to the CEs over the last few days. Could the admins have a look please.
c. QMUL : Same as RHUL above
Paige Winslowe Lacesso: (11:04 AM)
Will look into it pronto!
Queen Mary: (11:04 AM)
mic not working
terry and dan
Daniela Bauer: (11:05 AM)
I'm here
Alessandra Forti: (11:15 AM)
sorry I'm late
Brian Davies @RAL-LCG2: (11:16 AM)
for dirac, now have robot ticket fro FT transfers
65TB copied
Matt Doidge: (11:24 AM)
Durham have another ticket about this
One CE at Oxford, and the "test" CE at IC, are the only ones left in the July list
Daniela Bauer: (11:25 AM)
But Imperial is only cetst02, no ?
Matt Doidge: (11:28 AM)
https://ggus.eu/?mode=ticket_info&ticket_id=114381
<- The durham ticket
Jeremy Coles: (11:29 AM)
ce3.dur.scotgrid.ac.uk
ce4.dur.scotgrid.ac.uk
Elena Korolkova: (11:32 AM)
in Sheffield I've overlooked a bad wn which was failing tests for LHCb
it didn't cause troubles for atlas
Daniela Bauer: (11:37 AM)
There goes my sound, no doubt Vidyo will follow shortly after..
Jeremy Coles: (11:43 AM)
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes150702#Specific_actions_for_sites
raul: (11:47 AM)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=113415
That's the ticket I am using to talk to Tony about the CMS space dump. There is also a long email thread. too long!
Jeremy Coles: (11:50 AM)
https://wiki.egi.eu/wiki/Agenda-13-07-2015
Kashif: (11:54 AM)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=114742
Daniela Bauer: (11:54 AM)
Though right now I am locked out from the ROD portal
https://ggus.eu/index.php?mode=ticket_info&ticket_id=114881
John Hill: (11:57 AM)
Ow
Federico Melaccio: (11:57 AM)
noisy
Duncan Rand: (11:58 AM)
Sorry...
Mic problems
Ha!
Touche
Jeremy Coles: (11:59 AM)
Please go ahead Matt.
I can hear you.
I was talking. Humm.
Paige Winslowe Lacesso: (12:01 PM)
Updated it today1
Helloooooo!
Jeremy Coles: (12:04 PM)
Liam - have you applied for access as a supporter in GGUS?
Elena Korolkova: (12:06 PM)
Will do this this week
Liam Skinner: (12:07 PM)
Hi Jeremy, as far as I know - no, I'll ask about that, there's a few things I need to be added to - just noting that down - thanks
Samuel Cadellin Skipsey: (12:08 PM)
Jeremy: I heard you, but I don't thnk Matt did.
Jeremy Coles: (12:08 PM)
Fine. I will repeat later.
I was worried somehow there was an auto mute ....
Brian Davies @RAL-LCG2: (12:09 PM)
matt
Samuel Cadellin Skipsey: (12:10 PM)
Matt, we think you can't hear anyone else speaking.
Federico Melaccio: (12:10 PM)
Matt
Queen Mary: (12:10 PM)
if rhul also have an issue qmuls lhcb ticket it might be related to ipv6,
Jeremy Coles: (12:10 PM)
We hear you
Brian Davies @RAL-LCG2: (12:10 PM)
WE CAN HERE
Daniela Bauer: (12:10 PM)
you are very quiet
Liam Skinner: (12:10 PM)
I can hear you fine thanks 8-)
Jeremy Coles: (12:10 PM)
You do not hear us
Federico Melaccio: (12:10 PM)
Yes, but you can't hear us
Brian Davies @RAL-LCG2: (12:10 PM)
BUT CAN YOU HEAR US
Jeremy Coles: (12:11 PM)
Matt you may need to turn your audio out to active... or perhaps you need to rejoin. when you read this!......
John Bland: (12:12 PM)
as the last fragments of humanity ebbed away, the last lone signal was Matt Doidge narrating GGUS tickets to the empty void
Samuel Cadellin Skipsey: (12:12 PM)
It is very soothing.
Jeremy Coles: (12:13 PM)
Like the shipping forecast?
Andrew McNab: (12:14 PM)
Send him a ticket?
Matt Doidge: (12:16 PM)
Back...but still no sound. *shakes fist at Vidyo
Queen Mary: (12:17 PM)
ok
Matt Doidge: (12:17 PM)
Back and have sound!
Jeremy Coles: (12:18 PM)