Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 108203 with code: 4880. Apologies: Mark M

Meetings & updates (20')     

- ROD team update
ROD Durham

Stuart: Durham Green, fine. Don't get tickets automatically - Kashif?

Kashif: Opened ticket, both CEs failing, but this morning saw things were fine, maybe transient, so closed.

Kashif: UCL. Problem was old, fixed, came back again. Opened ticket and have been in touch.


- Nagios status
Nagios:
-- Kashif's backup....
Ewan: Going to sign up for Ops VO, will do so as soon as practicable to become backup for Kashif

- Tier-1 update


Catalin: Issue 2 weekends ago which affected Castor. This weekend had another castor half a day downtime Saturday/Sunday. Yesterday we seemed to understand the problem, used a workaround.
ATM starting up again . Also had Tape library microcode update.

Jeremy: Oracle issue affecting more than Castor. Is the throttling of FTS affecting Tier-2s?

Chris: Panda independently having issues, but not getting production this morning. Starting some now.

Alessandra/Elena: Clouds offline again yesterday. There are no backlogs at sites.

Catalin: channel limits at RAL set to 50%.


- Security update

Mingchao: Probes fixed so that all alerts on dashboard should be genuine. Sites to inform of false positives. Security incident in site which in process but not certified for EGI. Discussion on access controls.

ACTION Mingchao: Follow up information provided by dashboard and access control/information
-- T2 issues

Status of

UKI-SOUTHGRID-SUSX: in process of joining
Pete spoke to Emyr this morning - working on CREAM, setting up pool accounts and so on.

Jeremy: Fully UMD1?

Ewan: Might deviate for CREAM for SGE. Planned to make as UMD as possible, but noting where not.

UKI-ScotGrid-Gla-PPS

Stuart: Nodes turned off
Jeremy: Manual process to close.
Stuart: It's uncertified, probably should be closed.

PPS-RAL

Should be closed.

UKI-SOUTHGRID-RDG

Reading: Might be one that NGS need to deal with.

GILDA-NeSC

Jeremy: Set up for jobs as a test/training infrastructure. No longer there? Emailed listed contact, no response.
Sam/Wahid: Follow up with Steve Thorne

-- Networking.

was wondering if anyone has managed to record their outbound/inbound bandwidth figures for the month of October.
From Mark: I had mentioned this in September and several people came back to say they were going to do this and as October is nearly over it would be great if the figures were made available next week to Jeremy or myself.
These will be used as a rough bench mark for a month's traffic usage by GridPP in the UK for discussions with JANET on improving or monitoring pinch points within the collaborations network environment. We should probably repeat this exercise for November to see if there are monthly differences.


Discussion of which sites are monitoring/keeping data (see transcript).


- NGI

See John Gordon's email: "When we recently migrated UK sites from the UKI ROC to the NGI_UK APEL didn’t correctly recreate the T2 structure the way it was before. Before just getting this fixed I thought I should check how people would like to see it...."

- No dissenters.

- Tickets

Lots here to go through.....

Checking Red/Amber tickets for NGI_UK: http://tinyurl.com/6etw8gm

Or go to https://ggus.eu/ws/ticket_search.php and select Support Unit:NGI_UK and Creation date: Any and Status: open states - then click Go.


75671: LHCb/MissingLibrary/ECDF: Status should be changed to waiting for reply (Raja will follow up with Vladimir)

Space token size tickets: The ones that are still there still need fixing. Brian: How does a VO submit a change request to a site?

75488: CompChem/Durham: Sam to update this with progress

75395: Catalin to pass it to LB/WMS specialist at CERN/update ticket.

75393: T2K: Matt: Think it's been fixed.

75320: Enmr: David: We're looking into the intermittent lcg-tags failures.

74353: pheno: Jeremy will look into this.

 11:20         
Experiment problems/issues (20')     

Review of weekly issues by experiment/VO

- LHCb

Discussion with T2 sites about requirement for 1.1MHS06 of CPU time for reconstruction jobs. Some UK T2 sites have queues a little shorter than this.

UK performed well in reconstruction at Tier-2s. Next batch of reprocessing to start later this week. Daniela had an issue with CVMFS; not sure the status of this
Daniela: The Error message is not clear. It was a Nagios test - it was a warning not failure.
Alessandra: Nagios probes had problem this week with permissions, so could be fixed now.
Jeremy: Requirement for 1.1MHS06 Raja: Think that sites have lengthened queues. Once Imperial LCG-CE switched off OK there.

- CMS

Stuart: Couple of power problems at IC.
Jeremy: Issue at RALPP
Stuart: Problem with CREAM CEs going down and killing several jobs.
Brian: What's the status of the rate test to Imperial and RAL and PPD. Stuart: Maxed out 10G link, no plans to do that again soon.
Jeremy: Plans to use regional redirectors for xrootd. ?
Stuart: Thought about using it for wide area transfers.
Sam: We and Wahid tried using this without success, so we'll be interested to see how this goes.
Brian: I've been asking  DPM /CMS sites about success with xrootd
Alessandra: we've used it in local environment


- ATLAS


Problems with RAL over the last week. All seem to be(ing) solved. Looking at production/analysis this morning, things look back to normal. Transfer rates OK while yesterday were at worst red/at best blue for UK cloud

Alessandra: RHUL CVMFS status?
Govind: Some things coming up - will talk to you.

CVMFS plans.
How to: https://www.gridpp.ac.uk/wiki/UK_CVMFS_Deployment

- Other

SNO+ are currently ramping up their usage of grid computing - current needs are modest and mainly CPU at T2s. It has an approved VO, please consider supporting it at your site.

Catalin: working on supporting it on LFC/middleware instances at RAL T1.

- Experiment blacklisted sites

- Experiment known events affecting job slot requirements

- Site performance/accounting issues

- Metrics review

The first assessed accounting period has finished and the next has started! Steve will rerun his algorithm later this week over the whole period (to pick up late uploaded data). Take a look at the table and report any major errors: http://pprc.qmul.ac.uk/~lloyd/gridpp/metrics.html.

No obvious gaps during the period shown in http://www4.egee.cesga.es/accounting/egee_view.php [selecting NGI_UK; show data for SITE as a function of DATE].

 11:40         
Site updates (10')     

- What are you currently working on at your site ...

 11:50         
HEPiX fall workshop (5')     

Agenda: https://indico.cern.ch/conferenceTimeTable.py?confId=138424#all.detailed

- Site status reports
- Scientific linux status

Continue to have security updates for all releases of SL 4, until February 2012. Continue to have fastbug updates for only SL 4.9 until February 2012.
Continue to have security updates for all releases of SL 5 and 6. Continue to have fastbug updates for only the latest releases of SL 5 and 6. Decommissioning SL 4 February 2012.


- Virtualisation and clouds
- LHCONE; perfSONAR; IPv6
- Thursday = Storage day
- Security updates
- Historical perspectives

 11:55         
Actions (5')     

- https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items

 12:00         
AOB (1')     

- Please register for HEPSYSMAN http://hepwww.rl.ac.uk/sysman/ ... what topics would people like covered? There are few confirmed talks at the moment!

Discussion of possibility of talks for next week.  Ewan noted that free flowing discussion is often very useful.

- For the core team: Inform Jeremy if you are (not) attending the core team F2F meeting at QMUL on Friday 11th. If NOT please explain! Also the webpages for individual areas appear to lack updates from the October discussions: https://www.gridpp.ac.uk/wiki/Category:GridPP_Operations. ... for those attending have you booked accommodation for the night of 10th if needed?

Mark: Checking summary of network data - cacti data is fine.

Chat Transcript:

[11:00:06] Pete Gronbech joined
[11:00:08] Queen Mary, U London London, U.K. joined
[11:00:09] Govind Songara joined
[11:00:11] Elena Korolkova joined
[11:00:11] Rob Fay joined
[11:00:13] Santanu Das joined
[11:00:14] Stuart Purdie joined
[11:00:15] John Bland joined
[11:00:25] Daniela Bauer joined
[11:00:56] Catalin Condurache joined
[11:01:22] Chris Brew joined
[11:01:41] Rob Harper joined
[11:01:46] Raja Nandakumar joined
[11:01:58] Wahid Bhimji joined
[11:02:09] Daniela Bauer Duncan is on holiday.
[11:02:43] Stuart Wakefield joined
[11:02:56] Alessandra Forti joined
[11:03:00] Mark Slater joined
[11:03:21] Ewan Mac Mahon joined
[11:05:20] Ewan Mac Mahon :-P
[11:06:02] Matthew Doidge joined
[11:06:59] RECORDING David joined
[11:07:57] raul lopes joined
[11:08:34] Mingchao Ma joined
[11:10:29] Ben Waugh joined
[11:13:07] Mingchao Ma CVE-2011-0536
[11:14:02] Elena Korolkova Because of the problem in RAL UK cloud was set in brokeroff site twice: on Sunday and on Monday. On Monday I said to the shifter that he should set UK cloud online and he did that but three hours later all clouds have been set offline untill 5 am today because there was a huge backlog of filres which should be registered.
[11:14:35] Elena Korolkova This problem was caused by the error in schedconfig for FR T1
[11:14:38] Mingchao Ma https://operations-portal.egi.eu/csiDashboard/issues/ngi/NGI_UK
[11:16:22] Mingchao Ma https://operations-portal.egi.eu/csiDashboard
[11:16:56] Stephen Jones I also have no access to the dashboard.
[11:17:21] Rob Harper First link no, 2nd link yes.
[11:17:48] Govind Songara Only 2nd link OK for me
[11:18:00] Andrew McNab both links work for me, just showing NGI_UK sites
[11:18:25] Stephen Jones First, link no, 2nd link "no data"
[11:20:18] Ewan Mac Mahon I'm not sure I'm securoty officer for all of those sites.
[11:20:26] Ewan Mac Mahon Let me check the gocdb......
[11:21:15] Rob Harper No, you're not on RALPP
[11:21:51] Elena Korolkova left
[11:22:23] Elena Korolkova joined
[11:24:11] Mingchao Ma https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts
[11:25:09] Mingchao Ma https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/glibc-2011-04-12
[11:31:25] Alessandra Forti there is also a manchester local that shouldn't even exist.
[11:33:01] Wahid Bhimji we are collecting statistics for ECDF.
[11:33:19] Mark Slater We're also doing it for BHAM
[11:33:20] Rob Harper We didn't record this.
[11:33:34] Matthew Doidge Lancaster's got a new Cacti, can collect data for November
[11:33:37] Mark Slater I've still got to check the data for BHAM though....
[11:34:50] Rob Harper (Though we have cacti, so if data from there is good enough we can do something)
[11:35:32] Sam Skipsey Cacti data is useful, Rob, as long as you have data on external link.
[11:35:38] Sam Skipsey But email Mark  
[11:36:48] Brian Davies joined
[11:36:57] Alessandra Forti left
[11:37:02] Alessandra Forti joined
[11:39:48] Elena Korolkova it's a request to install CVMFS until the end of the year.
[11:40:19] Sam Skipsey Elena: the tickets we're talking about are Brian's Spacetoken tickets, not Alessandra's CVMFS tracking tickets.
[11:40:24] Sam Skipsey Although both are tracking tickets...
[11:40:42] Elena Korolkova It's not a problem for ATLAS that sites doesn't have CVMFS installed ATM
[11:41:10] Stephen Jones What does "red" mean? Once I schedule it, it's not urgent anymore. It'll be done in due course.
[11:41:15] Elena Korolkova Even spacetokens
[11:41:25] Sam Skipsey "red" means hasn't had any updates, I think.
[11:41:50] Stephen Jones Then I'll update to say "it'll be done in die course" and that's that.
[11:42:05] Elena Korolkova I think red means that tickets are open for more than n days.
[11:42:18] Elena Korolkova I don't know the n.
[11:47:25] Jeremy Coles Red means no updates in 5 days. It is not a very good or accurate system and we are giving feedback to GGUS.
[11:50:56] Alessandra Forti 48 hours or more
[11:51:02] Alessandra Forti the reco ones
[11:51:10] Sam Skipsey Yeah, on a modern system, they'll need 2 day queues.
[11:54:49] Alessandra Forti atlas us T3s have a similar scheme
[11:56:57] Alessandra Forti /us/US/
[11:57:56] Wahid Bhimji stuart/ please do let us know how it goes. Would be interested to try it out / help whatever
[12:00:23] Elena Korolkova many atlas jobs failed with timeouts errors for file transfer
[12:00:39] Elena Korolkova that 's because of all the problem
[12:03:19] Rob Harper GGUS Ticket 75410
[12:06:58] Mark Mitchell joined
[12:11:54] Alessandra Forti it never was like that
[12:11:59] Ewan Mac Mahon IMHO, of course.
[12:12:43] Matthew Doidge I concur!
[12:15:35] Ewan Mac Mahon *tumbleweed*
[12:15:49] Ewan Mac Mahon Does anyone actually know anything about vidyo though?
[12:18:07] Jeremy Coles A little but we have no access to test at the moment. More in a few months!
[12:20:43] Wahid Bhimji bye
[12:20:44] Wahid Bhimji left
[12:21:01] raul lopes left
[12:21:06] Mohammad kashif left
[12:20:42] Wahid Bhimji bye
[12:23:55] Santanu Das bye
[12:24:19] Alessandra Forti bye

There are minutes attached to this event. Show them.
    • 11:00 11:20
      Meetings & updates 20m
      - ROD team update - Nagios status -- Kashif's backup.... - Tier-1 update - Security update -- T2 issues Status of UKI-SOUTHGRID-SUSX UKI-ScotGrid-Gla-PPS PPS-RAL UKI-SOUTHGRID-RDG GILDA-NeSC -- Networking. was wondering if anyone has managed to record their outbound/inbound bandwidth figures for the month of October. From Mark: I had mentioned this in September and several people came back to say they were going to do this and as October is nearly over it would be great if the figures were made available next week to Jeremy or myself. These will be used as a rough bench mark for a month's traffic usage by GridPP in the UK for discussions with JANET on improving or monitoring pinch points within the collaborations network environment. We should probably repeat this exercise for November to see if there are monthly differences. - NGI See John Gordon's email: "When we recently migrated UK sites from the UKI ROC to the NGI_UK APEL didn’t correctly recreate the T2 structure the way it was before. Before just getting this fixed I thought I should check how people would like to see it...." - Tickets Lots here to go through..... Checking Red/Amber tickets for NGI_UK: http://tinyurl.com/6etw8gm Or go to https://ggus.eu/ws/ticket_search.php and select Support Unit:NGI_UK and Creation date: Any and Status: open states - then click Go.
    • 11:20 11:40
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb Discussion with T2 sites about requirement for 1.1MHS06 of CPU time for reconstruction jobs. Some UK T2 sites have queues a little shorter than this. - CMS - ATLAS CVMFS plans. How to: https://www.gridpp.ac.uk/wiki/UK_CVMFS_Deployment - Other SNO+ are currently ramping up their usage of grid computing - current needs are modest and mainly CPU at T2s. It has an approved VO, please consider supporting it at your site. - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance/accounting issues - Metrics review The first assessed accounting period has finished and the next has started! Steve will rerun his algorithm later this week over the whole period (to pick up late uploaded data). Take a look at the table and report any major errors: http://pprc.qmul.ac.uk/~lloyd/gridpp/metrics.html. No obvious gaps during the period shown in http://www4.egee.cesga.es/accounting/egee_view.php [selecting NGI_UK; show data for SITE as a function of DATE].
    • 11:40 11:50
      Site updates 10m
      - What are you currently working on at your site ...
    • 11:50 11:55
      HEPiX fall workshop 5m
      Agenda: https://indico.cern.ch/conferenceTimeTable.py?confId=138424#all.detailed - Site status reports - Scientific linux status Continue to have security updates for all releases of SL 4, until February 2012. Continue to have fastbug updates for only SL 4.9 until February 2012. Continue to have security updates for all releases of SL 5 and 6. Continue to have fastbug updates for only the latest releases of SL 5 and 6. Decommissioning SL 4 February 2012. - Virtualisation and clouds - LHCONE; perfSONAR; IPv6 - Thursday = Storage day - Security updates - Historical perspectives
    • 11:55 12:00
      Actions 5m
      - https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
    • 12:00 12:01
      AOB 1m
      - Please register for HEPSYSMAN http://hepwww.rl.ac.uk/sysman/ ... what topics would people like covered? There are few confirmed talks at the moment! - For the core team: Inform Jeremy if you are (not) attending the core team F2F meeting at QMUL on Friday 11th. If NOT please explain! Also the webpages for individual areas appear to lack updates from the October discussions: https://www.gridpp.ac.uk/wiki/Category:GridPP_Operations. ... for those attending have you booked accommodation for the night of 10th if needed?