Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 14 0782 with code: 4880. Apologies: Kashif, Pete, Daniela, Elena
Dave Britton got response from RT re gridmon boxes - can just leave it rack.

LHCb: Raja

nothing much to report from LHCb side. Had a few problems over last week but primarily basically at CERN and app software. Accounting went down over weekend, nothing affecting UK. UK jobs running fine save seeing that old problem with CREAM, zombie jobs, right now CEs art RAL are running 2000 jobs, only 1000 actually running.

CMS: Stuart, Chris

Nothing to report

ATLAS: Alessandra

Few things.

1) How to control memory leaks - what I said last week is not possible if you are overallocating jobs. We are looking now how to limit this from ATLAS perspective. Can set memory property in ATLAS broker. Glasgow has mixture of these parameters. Set it for Manchester to be larger than Glasgow/Oxford. Check on it in a few days.

2) This morning, ATLSA has installed a new DDM client which seems to havve a problem, somce of the sites in UK have been set offline because of this. New client being developed, sites offline today need to be taken into account.

3) Will be preparing end of month report tomorrow

4) Sheffield has had a power cut.

Dave: Glasgow catching up on accounting

Other VOs:

T1 update

Waiting for CVMFS volunteer.

SCAS to ARGUS was done

Storage:

DPM and possible alternatives, discussed at PMB. Should pursue what we're currently contributing for DPM, need to know more about CERN contribs

Accounting:

Remember to check HS06 values.

Documentation:

Update to KeyDocs: documentation stats on the different core ops pages, shown on bulletin board.

Steve making it easier to add site specific updates. A couple of problems where admins couldn't identify what needed to be kept up to date. So, created site specific stuff.

Admins to have a look at this and report back what they think.

Interoperations

EGI ops meeting: Stuart: only one wanted to raise. The point about EMI2 WN still exists, because part of what has been discussed is to allow people to move form gl3.2 SL5 to EMI2 SL6. Looking for a list of VOs that have tested EMI1 WN package, to make sure all VOs covered.

Expect to update CA package next week, in rollout now.

JC: Believes Brunel has SL6 cluster setup.

What's more important is to know who has tried it, hard part is finding people who tried it and it just worked - need specifics.

Monitoring:

SR:

Daniela has split out pages, see bulletin board for details for different areas: EMI1 rollout, EMI2 rollout and "State of the nation": discussion of versions across sites.

Stuart: note that gLite 3.1 WMS not included for GLA , but will be retired. Note that 3.3.5 should be updated.

Andy: WN is out of date - what should we upgrade to?
Sam: Note that ECDF would need WN tarball install.
JC: There are some issues with WN tarball, no testing in UK.Opportunity/gap where one site could test WN tarball.
Matt: WOuld like to try this, might not be for a bit.

Catalin: EMI1 WN releases: There is a test queue, which we announced. Not aware of how ATLAS are deoing with this.
Brian: We've had a suggestion that the main VOs don't
JC: Maybe still recommended not to upgrade.
JC: FOllow up with ATLAS about preparedness

Security:

Security discussion

Services:

PerfSonar -

Mark: We should do some larger scale testing. Couple of things to rollout. There's a file we can install on the bandwidth monitor to rate limit. We can't put that out on TB-SUPPORT, will send it out to Jeremy, Duncan and Alessandra. From there we can start doing trial tests. Will send an email out shortly.

VOMS -

JC: What's the status of VOMS?
Andy: Running as before, services hosted by IT services department.

Tickets:

See bulletin board

JC: Neurogrid?
Catalin: Progressing

A few snoplus tickets around software installs
Memory sizes.

JC: Any sites not marking tickets as in progress?
Matt: Happened once.
Snoplus tickets not being assigned

Ewan: Who is meant to do the assignment? Submitter?
JC: I'll check.

Sites roundtable:

Manchester: Alessandra: Memory issues, SL58.
Liverpool: PerfSonar is online, poised to upgrade everything to EMI1
RHUL: EMI1 CREAM
Lancs: Looking into Networking, ARGUS, EMI1
Glasgow: Infrastructure; 256 cores online. Storage. Federated xrootd testing, DPM updates needed for that.
ECDF: SRM update last week, seemed to go OK, failed a few tests. Looking to upgrade a few middleware services. PS, 2 boxes up but need ports open.
Oxford: Thinking about moving test cluster over to SL2 EMI2 for SE, maybe WN, upgrade remaining DPM pool nodes to same level.
T1: Brian: Dark data cleanup under way, looking at SONAR rates for different experiments. Currently a gap for a CMS Perfsonar box in the UK.


AOB:


Transcript:


[10:59:41] John Bland joined
[10:59:44] Stuart Purdie joined
[11:00:06] Govind Songara joined
[11:00:16] Raja Nandakumar joined
[11:01:56] RECORDING David joined
[11:02:14] Jeremy Coles David is taking minutes.
[11:03:02] John Bland steve's on his way
[11:03:24] Stephen Jones joined
[11:03:34] Brian Davies joined
[11:03:37] Andrew Washbrook joined
[11:03:47] Alessandra Forti joined
[11:04:04] John Bland does it have to be turned on in this 'rack'?
[11:04:09] Jeremy Coles No!
[11:04:23] John Bland right then, I've got a nice big 'rack' ours can go in...
[11:04:46] Ewan Mac Mahon joined
[11:07:38] Gareth Smith joined
[11:07:57] Catalin Condurache joined
[11:11:38] Raja Nandakumar left
[11:14:00] Andrew McNab joined
[11:14:56] Mark Mitchell joined
[11:18:15] Stephen Jones https://www.gridpp.ac.uk/wiki/Separate_Site_Status_Pages#Site_Specific_Pages
[11:27:32] Jeremy Coles http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html
[11:32:19] Jeremy Coles http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
[11:40:44] Matthew Doidge Lancaster#s also currently on 1g so this is relevent to our interests!
[11:42:24] Ewan Mac Mahon I think I got most of it.
[11:42:25] Andrew McNab fine for me
[11:43:27] Alessandra Forti nothing changed
[11:46:14] John Hill I have to leave - the gasman cometh (I hope...)
[11:46:19] John Hill left
[11:55:03] Ewan Mac Mahon So it's looking good, but it's been looking good before.
[11:55:21] Matthew Doidge Reading the ticket I thought Bright was an unhelpful local admin
[11:55:53] Sam Skipsey no, it's an unhelpful Cluster Vision management system.
[11:57:59] Chris Brew left
[12:00:33] Ewan Mac Mahon For reference that t2k-oxford ticket was: https://ggus.eu/ws/ticket_info.php?ticket=84487
[12:02:40] Matthew Doidge It could have been a one off, but it's worth keeping an eye out
[12:10:52] Catalin Condurache left
[12:10:57] Brian Davies left
[12:10:59] Andrew McNab left
[12:11:01] Ewan Mac Mahon left
[12:11:05] Alessandra Forti bye
[12:11:06] Matthew Doidge left
[12:11:07] Govind Songara left
[12:11:07] Alessandra Forti left
[12:11:15] Stuart Purdie left
[12:11:21] Sam Skipsey left
[12:11:29] Rob Fay left
[12:11:38] John Bland left
[12:11:39] Andrew Washbrook left
[12:11:50] Mark Mitchell left
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 11:40 12:00
      Site roundtable 20m
    • 12:00 12:05
      Actions 5m
      To be completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items Completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Completed_Actions
    • 12:05 12:06
      AOB 1m