Operations team & Sites

EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the Janet(UK) Community area. Meeting URL SeeVogh R.N. http://research.seevogh.com/join?meeting=M2MvMB2a2iDDDl929BDM92 EVO International http://evo.caltech.edu/evoNext/koala.jnlp?meeting=M2MvMB2a2iDDDl929BDM92 - Phone Bridge (+44 (0)161 306 6802.) ID: 100 3161 Password: 4880 Apologies: Mark
Operation steam and Sites meeting minutes
Tuesday, 4 March 201
B Davies
A Forti
A Lahiff
A McNab
C Brew
C Walker
D Traynor
D Bauer
D Crooks
E Korolkova
E MacMahon
E Steele
G Qin
G Roy
G Smith
J Coles
J Bland
J Hill
L Kreczko
M Doidge
M Raso-Barnett
M Kashif
P Gronbech
R Nandakuma
R Fay
R Frank
S Skipsey
S Jones
T Whyttle
W Bhimji
Review of weekly issues by experiment/VO
- LHCb
ECDF,Glasgow have problem still investigating.
Long standing isue with EFDA-JET
Had encourage Users to use FTS3 for debug traffic. Caught up with FTS3 issues.
Only RALPP has site availability less than 90%. Could C Brew provide an answer to why?
Lancaster Discrepancy between new and old plots of availaibility. Needs to be investigated.
Lack of production due to full RAL-LCG2 DATADISK
RHUL Sheffield and Southgrid sites.
Brunel is full which seems strange due to only being T3.
FTS3 Ral issues. ATLAS have moved all FTS3 servers to CERN.
ATLAS SW&C week last week. EK will present summary next week.
- Other
Meetings & updates       
General updates
    A Vidyo test meeting room is available for testing linked from this agenda. Headsets are recommended.
    No problems uncovered with the GridPP test website.
    From Monday's WLCG ops: intermittent problems with RAL's virtualization cluster, affecting many services (including FTS3).
    GGUS update - is there feedback we want to collate?
MD- No option anymore to sort by site
    For information: APARSEN-EGI Community Workshop on Managing, Computing and Preserving Big Data for Research will take place next week, 4-6 March.
    There will be a pre-GDB on batch systems next Tuesday, and a GDB covering various update areas.
    There was an EGI OMB meeting last Thursday.
    Steve produced an overview of getting ARGUS working at Liverpool.
    GOCDBv5.2 was released last week. This release adds an extensibility mechanism which allows Services, ServiceGroups and Sites to be extended using custom key-value pairs (following the GLUE2 extensibility mechanism).
Tier-1 - Status Page
Tuesday 4th March
    There have been problems with spart of the Virtual Machine infrastructure on the Tier1. This has caused problems for a number of services, including both FTS2 & 3. These are largely worked around now. However, we have asked Atlas to temporarily move their file transfers (for everything except UK transfers) off our FTS3 server.
    The software server used by the small VOs will be withdrawn from service (aiming for June).
    A replacement MyProxy server has been put into production (to resolve the MyProxy issues raised in GGUS ticket 97025). This new service is called myproxy.gridpp.rl.ac.uk. Sites and VOs need to make appropriate reconfigurations to use this. We plan to turn the old one (lcgrbp01.gridpp.rl.ac.uk) off at the end of March.
CW will announce this ( liasing with GS)
    Atlas disk space at the RAL Tier1 is full. One factor in this is a slow deletion rate that is being investigated.
    The Tier1 will move to use the new site firewall on Monday 17th March. We will stop FTS transfers while the change takes place (07:00-08:00) and stop new batch work from starting. Otherwise services will be up for the day but will be At Risk.
Documentation - KeyDocs
Some updated, please keep reviewing.
Interoperation - EGI ops agendas
(DC)Thanks for Feedback of monitoring ammalgamation.
HC functional test update.
(AF)ATLAS and CMS method similar.

Monitoring - Links MyWLCG
Tuesday 4th March
    Summary on HC functional tests
    Overview of feedback
On-duty - Dashboard ROD rota
Monday 3rd March
    A new dashboard is available for testing.
    The ROD rota has been extended to April
    Brunel sub-sites caused a problem leading to EMI-3 APEL alarms which is now fixed.

Security - Incident Procedure Policies Rota
Tuesday 4th March
    Questionnaires have been produced for EGI federated cloud sites.
    The next security team meeting is Wednesday 5th March at 11am.
Services - PerfSonar dashboard | GridPP VOMS
Tuesday 4th March
    The full UK perfSONAR view is given on this dashboard.
    When perfSONAR is performing in a stable fashion the site will appear on the main monitoring page.
Monday 3rd March 2014, 14.30 GMT
44 Open UK NGI tickets this week.
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
ILC moving to cvmfs, so those of us seekign to continue support will need to enable it. IC and Cambridge have already moved and been confirmed working. It might be easier if we collate any other sites who have moved into a single list to give to ILC. The working plan is to open tickets against sites who haven't moved after giving them a suitable grace period. In progress (26/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=99556 (6/12/13)
The NGI Argus ticket. There's been great progress on this, can we reflect some of this in the ticket? Or perhaps close it if we're satisfied. In progress (13/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101491 (23/2)
The RAL perfsonar latency box is being troublesome. It crashed and was brought back up again, but has crashed again so Duncan has reopened the ticket. Reopened (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101716 (28/2)
This cms transfer ticket has INFN as the "notified site", surely it should be RAL-LCG2 instead? I didn't change it myself in case I missed some nuance. Transfer problems appear to be linked to the virtualisation problems RAL have been experiencing affecting FTS3. In progress (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101729 (1/3)
LHCB pilots failing on a RAL CE. Being looked into. In Progress (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101701 (28/2)
ILC having troubles with the RAL ARC CEs. Looks to be a user group for ilc (production) missing. In progress (28/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101052 (6/2)
Biomed having trouble retrieving results from RAL cream CEs. Tracked down to the RAL EMI2 argus not handling Rfc proxies. An update to EMI3 is hoped to fix this, although Dan reports that this isn't the case at QM (see 101639). In progress (27/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101532 (25/2)
LHCB noting that RAL is publishing the default MaxCPUtime. Fixed but Orlin notes some caching behaviour. Maria AP chimed in that you might have a buffy bdii version in the chain. In progress (26/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100114 (8/1)
Chris W's ticket concerning jobs failing to get from RAL to Imperial. Catalin asked for some testing, but Chris has been on busy. The ticket hit its second reminder though. Waiting for reply (11/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97025 (3/9/13)
Longstanding myproxy issue. Andrew reports that the new myproxy service is up and running, so I assume this ticket can be closed soon? Or at least put back in progress. On hold (25/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101079 (9/2)
ARC CEs having a default SE of 0 and not being able to tune this per VO. Andrew is figuring out a fix to this. In progress (25/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/13)
cvmfs for Sno+. Ticket on hold whilst tarballs are created. Been that way for a while. On hold (29/1).
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100569 (28/1)
ECDF's perfsonar box refusing MA connections. Wahid has rebooted the box but no joy, Duncan linked some instructions as requested. In progress (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=99794 (16/12/13)
Access to the ECDF perfsonar pages. There's a big ACL overhaul going on at the moment, Andy apologises and will chase the central IT chaps about it. On hold (28/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101659 (27/2)
44444 jobs publishing on some ECDF CEs (as part of information system cleanup campaign). These CEs are due for retirement (replicant style) today, so this and the related tickets will be done with soon. In progress (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100840 (29/1)
Apel-Pub nagios test failures at ECDF. The guys are working on it, but sadly the ticket is escalating. Daniela posted a note that if you have a support ticket with APEL open (which I think is advisable) to link that into this ticket. In progress (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/13)
glexec deployment ticket. The ECDF lads are waiting on the tarball (i.e. me). Still. On hold (27/1)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101726 (1/3)
LHCB ticket about the default CPU time (999999) being published at RALPP. I thought that RALPP had solved something like this recently, but maybe I dreamt it? Assigned (1/3) Update - Solved, something was being published that shouldn't be any more.
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101727 (1/3)
Info system cleanup campaign, 4444444 job at RALPP. Assigned (1/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101398 (19/2)
LHCB would like xrootd holes poked in the RALPP firewall. As mentioned last week I believe this requires holes poked in the RAL firewall, which is undergoing an overhaul. This ticket could do with some attention mentioning these problems, and possible on holding. In progress (19/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101136 (11/2)
Request to upgrade the RALPP perfsonar to the latest version. Due to a lack of hands on deck Chris postponed this work, with a reminder date of today. On hold (21/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101367 (18/2)
A cms user having trouble srmcping in his jobs at IC. Looks to be a java 1.7 mismatch problem. Simon has asked some questions, no answer yet (user has set notify to "on solution" so might not have got the update). Waiting for reply (24/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101752 (3/3)
LHCB jobs having problems at Durham. Ewan S. has asked if the problems persist. Waiting for reply (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101763 (3/3)
Part of the campaign to clean up the information system, Durham have been asked to update their BDIIs (site and resource) to not-buggy versions. Assigned (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101177 (12/2)
Durham trying to wash the biomed out of their SE's information system. No joy yet. I advise asking at the storage meeting if stuck. In progress (26/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=99621 (10/12/13)
enmr noticed a bad WN, which was promptly quarantined. It hasn't been fixed, but I maintain that the problem itself is contained and solved if you want to close the ticket... On hold (28/1)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101710 (28/2)
Nagios SRM-Put test failures. The problem is known (it's DPM being odd with its space reporting whilst a pool is readonly -Sam describes it better). In progress (28/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)
LHCB sees that Glasgow is also publishing default max CPU time for some (all? one?) of their queues. Sam points out that this is on purpose (due in part to multicore jobs, jobs are limited by Wall time only), and asks if LHCB can't make educated guesses. Stefen replies with a point about the difference in "MaxCPUTime" and "MaxTotalCPUTime", but I'm not sure that covers the Glasgow concerns. Worth discussing to get a UK stance on this. In progress (3/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100568 (28/1)
Perfsonar MA problem. Raul has been working steadily at this and it looks to be progressing nicely. In progress (28/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101676 (27/2)
One of QM's perfsonar boxes is having problems, missing services. Likely to be caused by running a bleeding edge version of perfsonar. In progress (27/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101682 (27/2)
Brian has asked for a SE dump of QM atlas files. Assigned (27/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101557 (25/2)
Matt from SNO+ having trouble on a QM UI, delegating proxies to the FTS. The same works on lxplus though. This ticket needs a home, but there's an argument that it isn't a site problem (as a UI isn't necessarily part of a site). Assigned (26/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=94746 (10/6/13)
Biomed haunting the QM SE's info system. I believe Chris is waiting on his changes to seep into the Storm release (100290). On hold (14/1)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101669 (27/2)
lhcb ticketed Bristol, but the CE in question is in scheduled downtime. Possibly worth keeping this open whilst downtime is on to avoid a duplicate. In progress (27/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101516 (24/2)
Bristol's perfsonar ticket. Bristol upgraded which seems to have solved some of their problems, but their other server is having trouble now. Maybe the same again will fix it? In progress (25/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/13)
glexec at UCL. No news for a while from Ben. Daniela reminds him that the EMI3 upgrade is also imminent. On hold (26/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
A perfsonar ticket for UCL. A power outage looks to have brutalised their box. No word yet on if Ben has been able to save it. On hold (22/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101374 (19/2)
Sheffield's LHCB maxcputime ticket. Elena has set in progress but no news. In progress (25/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (3/1)
A perfsonar ticket for Sheffield, whose perfsonar needs updating. No news for a while. On hold (3/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/13)
Lancaster's glexec ticket. Whilst there's been some progress in the glexec tarball (not as much as there should be, as tarball time keeps being redirected, particularly with EMI3), no movement on the ticket. On hold (31/1)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1)
Lancaster suffering Poo Perfsonar Performance (I couldn't resist the childish alliteration). It doesn't seem to be an artificial carp (the rate has peeped over the 1Gb/s mark now and again. Looking for bottlenecks, but not had anytime to investigate. On hold (17/2)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)
LHCB jobs failing at JET due to openssl problems. No progress for a while, after the JET guys exhausted everything. On hold (11/2)
EGI OMB news/updates (15')   
Quick  overview  of  EGI  OMB  on  27th February
Notes  from  Jeremy  Coles
Batch systems (10')       
* The pre-GDB workshop agenda: https://indico.cern.ch/event/272785/
CW also going to PreGDB so please provide feedback
* The GridPP status: https://www.gridpp.ac.uk/wiki/Batch_system_status
* Issues we would like raised.
* Questions we would like asked or discussed.
AOB (1')
* Cross-check that the LHCb ARC issues are now resolved
( Link doe snot work on IE)

* UK CA SHA-2 switchover. 2 switches need to be flipped so this can happen quickly. We are following the problems at CERN.     
Actions review: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
Chat Window:
[11:03:04] Jeremy Coles Brian is taking minutes today. Ewan next week.
[11:07:10] Daniela Bauer FTS debug
[11:14:22] Tom Whyntie Not from CERN@school, sorry - all fine at the mo!
[11:16:44] Elena Korolkova Info from atlas; All new transfers will now use CERN FTS3 except for UK endpoints which remain on RAL FTS3. We are letting the non-UK transfers drain rather than cancelling them.
[11:17:41] Robert Frank Andrew will be a bit late
[11:21:27] John Hill It's OK for me
[11:29:56] Steve Jones ARGUS Central Banning: https://www.gridpp.ac.uk/wiki/Argus_Server#Configuring_Argus_for_Central_Banning
[11:30:40] Christopher Walker Gareth, I've been on holiday, but have you announced the myproxy change to users.
[11:30:50] Christopher Walker Also the software area change.
[11:38:46] Alessandra Forti there's a meeting tomorrow
[11:38:53] Ewan Mac Mahon Everything's fine, Linda's on this week, I'm testing people's banning setups.
[11:40:04] Wahid Bhimji I will try and do ECDF (perfsonar upgrade ) this week - thanks for the instructions
[11:41:08] Jeremy Coles Hi Wahid - Incidentally earlier in the meeting Raja mentioned a problem with pilots for LHCb at ECDF. I'm sure you are aware but just closing the loop.
[11:42:40] Elena Korolkova After talking to Peter I added multicloud option to RHUL, Shef, Cam. Bham amd Ox.
[11:43:01] Elena Korolkova I hope it will hope to get prod jobs.
[11:47:37] Wahid Bhimji yeah will upgrade this week
[11:47:59] Wahid Bhimji (PS Elena why not add ECDF to multicloud)
[11:48:31] Wahid Bhimji I closed that one yesterday (probably after you looked)
[11:49:23] Ewan Mac Mahon Why aren't all the sites multicloud anyway? As far as I can see the option is equivalent to "Do you want this site to break in the event of RAL problems? Y/N"
[11:49:31] Wahid Bhimji not really waiting on you
[11:50:04] Wahid Bhimji in that we'd rather you never did it (as mentioned in the ticket). I might also close the glexec ticket as no point in a ticket for that can't be actioned
[11:50:37] Daniela Bauer https://ggus.eu/index.php?mode=ticket_info&ticket_id=101367 is closed - the user was happy to use lcg-cp instead.
[11:51:45] Elena Korolkova @Ewan: there is a move to remove the concept of clouds in atlas
[11:52:23] Alessandra Forti the problem is the netwrok connectivity to the T1s
[11:52:55] Wahid Bhimji OK meanwhile why not make all UK T2Ds (at least) multicloud
[11:53:16] Alessandra Forti I thought that 's what Rod did when I complained last week
[11:53:35] Alessandra Forti when I checked all the sites were starting to get jobs
[11:54:10] Ewan Mac Mahon We still have basically no ATLAS work at Oxford. We're mostly alice, some LHCb, and a smattering of CMS.
[11:54:56] Sam Skipsey For reference with LHCb MaxCPU ticket, I am following Jeff Templon's example: https://ggus.eu/?mode=ticket_info&ticket_id=101322 here
[11:55:28] Ewan Steele https://ggus.eu/index.php?mode=ticket_info&ticket_id=101177 is now resolved
[11:56:39] Elena Korolkova @Wahid : sorry, I missed ECDF. It didn't looked empty last week.
[11:56:53] Elena Korolkova I've added it now.
[11:58:28] Wahid Bhimji thanks Elena -
[11:58:38] Wahid Bhimji got to skip out now sorry ...
[11:59:01] Christopher Walker SE dump provided
[12:03:10] Christopher Walker I've had another go at uploading the file and it isn't there still.
[12:04:06] Ewan Mac Mahon Put it on your SE  
[12:04:14] Jeremy Coles Is that for 94746 CHris?
[12:05:24] Elena Korolkova perdonar will be updated this week
[12:05:28] Ewan Mac Mahon If you're trying to attach it to the ticket, is it under the GGUS size limit? The Oxford dumps are always well over - I'd be slightly surprised if yours were smaller.
[12:05:45] Elena Korolkova and lhcb i hope today
[12:06:18] Christopher Walker No idea - but I didn't get a warning - but that's a good point.
[12:07:47] Elena Korolkova i put a dump on the wedsite and give alink in the ticket
[12:10:30] Chris Brew That sounds like a good reason to spec a 10G card for my laptop  
[12:11:12] Alessandra Forti Brian: a link to Manc dump is in the ticket now resolved.
[12:11:19] Sam Skipsey I think that only works if you are sufficiently close to the pursestrings, Chris  
[12:12:03] Chris Brew I control the departmental laptop budget
[12:12:57] Matt Doidge https://ggus.eu/index.php?mode=ticket_info&ticket_id=101639
[12:13:31] Matt Doidge I missed it as it was submitted 20 minutes after I started looking at tickets
[12:14:22] Ewan Mac Mahon Chris - looks like you might actually be able to do that with a mac and a thunderbolt adapter; there are hints of them existing in 10GbE versions.
[12:14:37] Chris Brew http://www.small-tree.com/Thunderbolt_Products_for_Mac_OS_X_s/192.htm
[12:15:00] Chris Brew I've just ordered a new macbook pro
[12:15:48] Chris Brew They all seem to be external boxes rather than a small dongle  
[12:16:23] Alessandra Forti that sad face is really sad
[12:16:26] Ewan Mac Mahon Indeed, they look like they're basically thunderbolt to PCIE adaptors with a 10GbE card plugged in.
[12:16:45] Ewan Mac Mahon Still, quite some bragging rights.
[12:23:34] Raja Nandakumar http://lhcb-web-dirac.cern.ch/DIRAC/LHCb-Production/undefined/grid/SiteStatus/display?name=ARC.RAL.uk
[12:25:14] Christopher Walker No new issues, but should the atlas pilot jobs use sha2
[12:26:03] Alessandra Forti I think they are still in the middle of m oving away from VOMRS which is not sha2 compatible
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 11:40 11:55
      EGI OMB news/updates 15m
    • 11:55 12:05
      Batch systems 10m
      * The pre-GDB workshop agenda: https://indico.cern.ch/event/272785/ * The GridPP status: https://www.gridpp.ac.uk/wiki/Batch_system_status * Issues we would like raised. * Questions we would like asked or discussed.
    • 12:05 12:06
      AOB 1m
      * Cross-check that the LHCb ARC issues are now resolved * UK CA SHA-2 switchover. 2 switches need to be flipped so this can happen quickly. We are following the problems at CERN. * Actions review: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items