Operations team & Sites

Name: Operations team & Sites
Start: 2012-09-04T11:00:00+01:00
End: 2012-09-04T12:16:00+01:00
Location: EVO - GridPP Operations team meeting

Tuesday 4 Sept 2012, 11:00 → 12:16 Europe/London

EVO - GridPP Operations team meeting

Description

- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 14 0782 with code: 4880. Apologies: Kashif

Hide

Experiment problems/issues

==========================

LHCb

----

Mark slater reported that Durham is blacklisted. Manchester was blacklisted yesterday with CVMFS issues.

CMS

---

Stuart W had "not much to say, the CMS/Security challenge has started, at least Imperial involved"

Alessandra (as security person) has been invaded to join the SSC and the coordination chat channel for the challenge. It was noted that the gridpp security response procedure requires an email to the NGI security list, which doesn't seem to have happened to date.

ATLAS

-----

Allesandra reported

UCL has storage problems, since 24th August.

Durham … is as has been. Both sites have limited manpower, so it's noted it might take time for them to sort things.

The atlas job recovery seems to be leaving world writable directories behind; and this is not needed. Work is underway to get that restricted - expected a quick response on this.

Network test with FZK has problems - FZK has poor rates due to firewall problems. Some discussion on wether join LHCone is the best way to solve this, or other approaches are better.

Other

-----

Chris W noted that Comet were making progress in setting up, and asked for people going to GridPP in oxford to let that be known to Chris in advance.

Looking for a candidate VO to try out the CVMFS stratum 0 at RAL, SP said he'd poke the NA62 people over that.

Meetings & updates

==================

Stephen experimenting with ways of updating all site specific parts across multiple pages - just at the testing stage at the moment.

Trying to get a list of areas in which GridPP is involved in technology development; Gridsite, Ganga, APEL, DPM tools.

Agenda for next GDB up: https://indico.cern.ch/conferenceDisplay.py?confId=155072

Anything to be raised/followed up to go to Chris (as T2 rep).

GPGPU questionnaire, deadline 13th September. Some discussion on what use the VOs will make of GPGPU's. Noted that sites won't deploy until VO's explicitly want GPGPU's, and they don't at the moment.

Tier-1 status

-------------

One of each hardware generation of worker node has been deployed into production on EMI-2 WN, so should see real jobs on them over the coming week. Skipping EMI-1 WN's, because GDB noted that the effort should be behind EMI-2.

On-duty

-------

Stuart P notes he made an error, and did not have the fact that he was on duty on his calendar last week. Caught up with the open tickets on Friday, after a COD ticket. Some problems with issuing tickets on thursday, all tidied up on Friday.

Jeremeny notes that we might need to look at how rota is communicated. Daniella notes that there were COD visible tickets this morning, and wonder if John Walsh was aware that he was on duty.

Services

------

The old CA certificate for the UK will expire, and a few bits being done to ensure no problems from that.

ECDF perfsonar installed, and needs added (along with Imperial, Brunel and Glasgow). Sites with only one box should have the latency tests enables - although people need to be aware that they might be meaningless - as they can have significant noise within the signal.

Tickets

------

Large deep analysis.

Matt noted that reminders haven't been sent out?

https://ggus.eu/ws/ticket_info.php?ticket=84408 (20/7)

Setting up of neurogrid.incf.org WMS & LFC. Both have been put in place, Catalin wonders if the LFC can be tested? Waiting for reply (29/8)

https://ggus.eu/ws/ticket_info.php?ticket=80259 (14/3)

neurogrid.incf.org creation ticket. Nearly finished now. In Progress (29/8)

https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11)

Brian's ticket to track older DPMs in the UK. Still have Durham, Bristol and Brunel to go at last update (but Brunel are retiring their old SE). On Hold (30/7)

https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7)

Setting up the COMET VO. Registering in EU Ops Portal (ticket 85736), On hold till this is done (3/9).

https://ggus.eu/ws/ticket_info.php?ticket=82492 (24/5)

Chris' ticket to change the reminder periods for the GridPP VOMS server. Assigned to Rober Frank, On Hold during VOMS transition (28/8)

TIER 1

https://ggus.eu/ws/ticket_info.php?ticket=85438 (23/8)

atlas were seeing FTS transfer failures from RAL. Some files have been corrupted, may have to get replacements from tape. Waiting for Reply (31/8)

https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8)

Biomed were seeing their nagios tests fail to register files at RAL, but looks to be a (peculiar) problem with their SAM jobs. Other units are involved. In Progress (3/9).

https://ggus.eu/ws/ticket_info.php?ticket=85023 (9/8)

SNO+ having troubles with one of the RAL WMSi. No reply after request to attempt job submission to lcgwms02. Waiting for Reply (10/8)

https://ggus.eu/ws/ticket_info.php?ticket=84492 (24/7)

SNO+ having job-matching problems at RAL. Some odd behaviour, but In Progress (31/8)

GLITE 3.1 Upgrade tickets (14/8):

https://ggus.eu/ws/ticket_info.php?ticket=85189 (UCL) In Progress (29/8)

https://ggus.eu/ws/ticket_info.php?ticket=85185 (CAMBRIDGE) In Progress (29/8)

https://ggus.eu/ws/ticket_info.php?ticket=85183 (GLASGOW) On hold (14/8)

https://ggus.eu/ws/ticket_info.php?ticket=85181 (DURHAM) In Progress (On hold?) (14/8)

https://ggus.eu/ws/ticket_info.php?ticket=85179 (Brunel) In Progress (22/8)

UK/SAM/GOCDB

https://ggus.eu/ws/ticket_info.php?ticket=85449 (23/8)

Bristol canceled an ongoing downtime but weren't bought out of it by the system, thus penalising them. Winnie is out to find the cause of the problem, and get back the lost uptime. Reset to "In Progress" after some ticket tennis (3/9)

PHENO/BRUNEL

https://ggus.eu/ws/ticket_info.php?ticket=85011 (28/8)

Pheno seem to be surprised that they have data on the retiring Brunel SE. In Progress (28/8)

SUSSEX

https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)

The Sussex Certification Chronicle. Jeremy wants to push getting Sussex out of downtime this week to avoid having to re-certify. In Progress (3/9)

UCL

https://ggus.eu/ws/ticket_info.php?ticket=85467 (24/8)

Atlas transfer errors to UCL. Clock skew on the head node took some of the blame, but seeing more failures with "Error reading token data header" messages.In Progress (30/8)

https://ggus.eu/ws/ticket_info.php?ticket=85549 (28/8)

Last of the User DN accounting tickets (the last child of 85547). In Progress (28/8)

DURHAM

https://ggus.eu/ws/ticket_info.php?ticket=85679 (31/8)

se01 failing Ops tests.

https://ggus.eu/ws/ticket_info.php?ticket=85731 (3/9)

ce01 failing APEL Pub tests.

https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7)

atlas production failures. On hold as Mike expects slow progress (3/9).

https://ggus.eu/ws/ticket_info.php?ticket=83950 (7/7)

lhcb cvmfs errors. On hold (7/8)

https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/3/11)

SE Upgrade ticket. Probably should be On Hold. (28/8).

https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/2011)

CompChem job failures at Durham. On hold due to the other problems, but once out of the woods worth checking that the problem persists. (8/8).

GLASGOW

https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)

SNO+ were having problems with one of the Glasgow WMSs (twinned ticket to 85023). Stuart asked for the FQAN used for the jobs as the problems seemed voms related, but no news since. Waiting for Reply (10/8)

https://ggus.eu/ws/ticket_info.php?ticket=83283 (14/6)

LHCB seeing high rate of job failures, likely to be caused by cvmfs. Glasgow upgraded all their nodes to the latest cvmfs but failures are still seen on the "high-core" nodes, correlated with high numbers of atlas job start up. Investigation continues. In Progress (30/8)

Lancaster see something similar, but they only have 12 core boxes, and don't observe correlation with number of cores.

OXFORD

https://ggus.eu/ws/ticket_info.php?ticket=85496 (25/8)

LHCB has job failures, that were not cvmfs related (they reckoned a lack of 32-bit gcc rpms or some OS difference). Problem seemed to evaporated though, did anything change. In progress, probably can be closed (31/8)

Suspected that it's an environment setup issue?

https://ggus.eu/ws/ticket_info.php?ticket=85524 (27/8)

Hone had problems submitting jobs through the Imperial WMS' due to "System load is too high" errors. Some magic was worked, and Hone see a massive improvement ahd propose to close the ticket. Can be closed (31/8).

LANCASTER (to my shame)

https://ggus.eu/ws/ticket_info.php?ticket=85412 (22/8)

JobSubmit tests failing to one of Lancaster's CEs. With help from LCG-SUPPORT tracked to a desync between ICE on the WMS & the CREAM. Best solution is cream reinstall, which is undergoing planning. On hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)

Lancaster's other CE isn't working well for ILC. Would like to reinstall, but will wait until ticket 85412 is solved. On hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=84583 (26/7)

Similarly LHCB are having problems on the same node. Lancaster is suffering a ticket pileup. On hold (3/9)

https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)

T2K transfers fail from RAL to Lancaster. Looks to be a networking problem. With new routing to be put in place soon hopefully this problem will disappear, as it has eluded understanding. On hold (3/9)

BRISTOL

https://ggus.eu/ws/ticket_info.php?ticket=85286 (17/8)

CMS transfers to Bristol failing. Winnie tracked to a maxed out datalink. In Progress (20/8)

https://ggus.eu/ws/ticket_info.php?ticket=80155 (12/3/11)

SE upgrade ticket. Bristol are prepping for the upgrade, with a test server. On hold (17/8)

RALPP

https://ggus.eu/ws/ticket_info.php?ticket=85019 (9/8)

ILC were having problems running jobs at RALPP. Needed a lot of configuration work, but progress made. In Progress (23/8)

RHUL

https://ggus.eu/ws/ticket_info.php?ticket=83627 (27/6)

Biomed seeing negative published space. Repeat of ticket 81439. Despite great efforts this remains so far unsolved. On hold (31/8)

Site updates

------------

UK NGI - monthly discussion

===========================

Revisit of NGI ticket assignment workflow

-----------------------------------------

There's a few small problems.

EGI broadcasts were not being sent to sites with multiple email addresses - fixed now.

ROD tickets (flagged as CIC tickets) has a different workflow, and they don't trigger the notification. This should be done this month from the GGUS people, the CIC people reckon it's next year, which is not so handy…

Nagios sends alerts - the same issue with multiple emails addresses. Fixed in update 19 of SAM/Nagios.

Additional point on tickets not raising reminders for Matt - Jeremy to check for replication, and if replicated a ticket will be raised.

UserDNs status

--------------

We would like sites to publish user DNs; 5 weren't, and 4 of them has, and UCL said they would, but haven't had the opportunity as yet.

EMI WNs and non-LHC VO testing

------------------------------

Some go around to identify candidate sites that can setup EMI-2 SL5 WN's, looking for sites to match with SE's other than DPM. QMUL with STORM, and RALPP with dCache were 'volunteered'. It was noted that the lack of the tarball WN install is causing problems for some sites.

Those sites with VO reps should poke those VO's to encourage them to test.

AOB

===

Last reminder to register for GridPP 29 at Oxford.

Chat window:

[11:01:17] Mark Slater Raja wasn't at LHCb meeting - EVO's not happy so I'm going to rejoin

[11:01:26] Mark Slater left

[11:02:31] Duncan Rand joined

[11:02:46] Ewan Mac Mahon https://www.gridpp.ac.uk/wiki/Report_Security_Incident

[11:02:46] Ewan Mac Mahon It's also worth noting that the gridpp security response procedure requires an email to the UK Ngi Security team...

[11:02:57] Ewan Mac Mahon And that doesn't seem to have been happening.

[11:03:13] Andrew McNab joined

[11:03:43] Mark Slater joined

[11:04:06] Gareth Smith joined

[11:04:39] Ewan Mac Mahon Yes; I wouldn't blame sites for doing it wrong, but I think it's worth thinking about. The organisation has been less than smooth.

[11:05:21] Ewan Mac Mahon Which page are you looking at?

[11:05:46] Ewan Mac Mahon 'cos the one I linked says "Report by e-mail to: UKNGI-SECURITY at JISCMAIL.ac.uk and abuse at egi.eu and Your local site security team "

[11:06:13] Stuart Wakefield http://www.gridpp.ac.uk/deployment/security/inchand/

[11:06:16] John Gordon joined

[11:06:34] Ewan Mac Mahon Thanks; we'll clearly need to get that synced up.

[11:07:37] Wahid Bhimji joined

[11:08:03] Andrew Washbrook joined

[11:08:03] Andrew Washbrook left

[11:08:08] John Gordon Are we on ATLAS Issues?

[11:08:15] Mark Slater yes

[11:09:45] Andrew McNab left

[11:10:46] Andrew McNab joined

[11:11:15] Ewan Mac Mahon I rather think FZK should just stop putting low performance firewalls in high performance paths; we shouldn't all have to join LHC one to bodge round their silly networking.

[11:15:37] Sam Skipsey (Strictly: you can get firewalls that process that kind of traffic, but they're horribly horribly expensive.)

[11:15:46] Brian Davies same head set!

[11:15:47] Stuart Purdie Well, you _can_ get firewalls that can take the traffic - but they start at the $1million mark...

[11:16:23] Ewan Mac Mahon Indeed; but I think then using a low spec firewall in front of your storage qualifies as bollocksing it up - you just can't do that and expect it to work.

[11:16:51] Ewan Mac Mahon You have to do exactly what RAL-LCG2 did and not firewall the storage.

[11:16:58] Sam Skipsey No, I agree with you, Ewan. It's known that sticking (esp. stateful) firewalls in the way is a bad idea.

[11:17:16] Andrew McNab as far as I can tell, that cvmfs problem was just on a single machine here

[11:17:25] Daniel Traynor joined

[11:18:30] Mark Slater THanks Andrew - Just seen your ticket update as well. I suspect Vladimir will put you back online by the end of the day!

[11:18:46] Mark Slater I'll give him a kick if not...

[11:19:29] Jeremy Coles http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

[11:22:00] raul lopes joined

[11:22:39] Jeremy Coles GridSite, Ganga, APEL but we have contributed a lot to storage, workload management etc even if they are not UK 'projects' per se

[11:24:25] Ewan Mac Mahon Running gridppnagios has felt quite a lot lilke alpha testing at times.....

[11:24:57] Ewan Mac Mahon Ooh; Cambridge and their lcg-ce/condor interface?

[11:25:25] Queen Mary, U London London, U.K. I will be unable to attend next week.

[11:28:43] Linda Cornwall joined

[11:32:31] Matt Doidge It should be possible to put some GPUs behind a CE at Lancaster

[11:33:03] Matt Doidge the shared cluster has some GPU nodes, and I'm sure I could wrangle access to a few of them

[11:33:05] Linda Cornwall http://www.gridpp.ac.uk/deployment/security/inchand/ IS VERY OUT OF DATE. I'm not sure where it links from. I will put link in to new page in there. You should use https://www.gridpp.ac.uk/wiki/Report_Security_Incident

[11:33:06] Andrew Washbrook apart from my TMVA GPU analysis presented at CHEP that is *shameless plug cough*

[11:36:15] Linda Cornwall I can't edit http://www.gridpp.ac.uk/deployment/security/inchand/ IThe info isvery out of date/from EGEE i.e. before EGI. so I'll act Neason

[11:43:02] Andrew Washbrook thanks

[11:43:42] Andrew Washbrook yep Glasgow is in the GridPP community

[11:43:56] Andrew Washbrook (added Gla this morning)

[11:43:57] Ewan Mac Mahon I can see the glasgow nodes now; I couldn't yesterday

[11:44:06] Ewan Mac Mahon (in the gridpp community)

[11:45:39] Ewan Mac Mahon At worst it's harmless but meaningless, at best, it's useful.

[11:46:01] John Hill OK

[11:46:02] Rob Harper Just checked with Chris and the RALPP perfsonar is currently on out test network, which isn't visible externally. We're working on making it public.

[11:46:14] Ian Collier Agree with Ewan, treat the results with scepticism.

[11:46:19] Rob Harper s/out/our/

[11:46:47] Ewan Mac Mahon OK; the Perfsonar should be on as similar as possible a connection as the gridftp endpoints, for preference.

[11:49:37] Rob Harper We will be migrating the rest of the site to the new net when we have it working, so ps and gridftp will be together.

[11:57:12] Ewan Mac Mahon Right; I'll have a closer look at Pana.

[11:57:17] Ewan Mac Mahon Panda, even.

[11:58:27] Andrew McNab left

[12:03:17] Jeremy Coles https://wiki.egi.eu/wiki/NGI-VO_WN_tests

[12:07:10] Ewan Mac Mahon I'll just check the dates, but yes, I won't be doing the update

[12:08:06] Ewan Mac Mahon OK; not sure exactly, but Kashif is certainly expected back 'soon'.

[12:11:35] Ewan Mac Mahon I think it's best to just have the NOTIFIED SITE field set for ordinary site tickets, and just have emails sent to the contact address(es) from the gocdb.

[12:11:41] Mark Slater I'm afraid I'm going to have to head - apologies!

[12:11:45] Mark Slater left

[12:20:17] Ewan Mac Mahon We'd have to check which if any of the test sites support na62 as well; I'm not sure we do.

[12:20:33] Ewan Mac Mahon We could, if required, I expect.

[12:20:37] John Bland I think Liverpool will be supporting NA62

There are minutes attached to this event. Show them.

- 11:00 → 11:20
  
  Experiment problems/issues 20m
  
  Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
- 11:20 → 11:40
  
  Meetings & updates 20m
  
  With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest WLCG security working group - Tier-1 status - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
- 11:40 → 12:00
  
  UK NGI - monthly discussion 20m
  
  - Revisit of NGI ticket assignment workflow - UserDNs status - EMI WNs and non-LHC VO testing RAL Tier-1 ************ EMI-2 SL5 queue consisting of 4 worker nodes (32 job slots in total). It's behind the "gridTest" queue available on each CE: lcgce03.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce07.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce08.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce09.gridpp.rl.ac.uk:8443/cream-pbs-gridTest UKI-NORTHGRID-LIV-HEP ***************************** EMI2/SL5 CREAM, TORQUE and WN with 1 node, 10 slots (available for testing from 2/09). UKI-SOUTHGRID-OX-HEP ***************************** EMI2 on SL5 test system behind the CE 't2ce02.physics.ox.ac.uk'. It's an EMI 2 on SL5 Cream CE, with a pair of 8-core EMI 2 on SL5 worker nodes (so a grand total of sixteen cores). UKI-LT2-Brunel ***************** EMI-1 Cream, EMI-1 WN, glexec, Argus): dc2-grid-68, dc2-grid-70, dgc-grid-43. Beyond SL5 ************* One test cluster available at UKI-LT2-Brunel: EMI-2 Cream (with glexec) running on SL6. It is dc2-grid-65 and has 16 job slots. Y
- 12:00 → 12:05
  
  Actions 5m
  
  To be completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items Completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Completed_Actions
- 12:05 → 12:06
  
  AOB 1m
  
  - Reminder and last chance to register for GridPP29: http://www.gridpp.ac.uk/gridpp29/.

Choose timezone

Operations team & Sites

EVO - GridPP Operations team meeting