Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +442030510622 -- The meeting extension is 9308582. Apologies:

Participants
------------

A. Forti, D Traynor, D Bauer, D Crooks (minutes), F Melaccio, G Qin, G Roy, G Stewart, J Coles (chair), J Hill, J Bland, M Doidge, M Raso-Barnett, W Bhimji, S Skipsey, C Walker, C Condurache, C Brew, R Nandakumar, A McNab, S Jones, P Gronbech, R Frank, B Davies, E Korolkova, G Songara, I Loader, O Smith, 

Tuesday, 11 November 2014
11:00 - 11:20 Experiment problems/issues 20' 
Review of weekly issues by experiment/VO

- LHCb

Raja: We just started our next restripping campaign. For now primarily T1s. Test of campaign. First running over 2000 input files from 2012. Friday should have confirmation that everything OK. If so, full flow by Friday or earlier. Grid; Otherwise quiet. Monte Carlo: Struggling with fixing with ROOT 6. Latest versions are just about to come out, should be sage to run LHCb stack. Soon start 2015 MC run. UK everything fine.


- CMS

Daniela: All quiet. Looked yesterday, Bristol green all last week. 

- ATLAS


-- Summary of the user threaded jobs status
-- MC status at sites.

Elena: MC Production. Bunch of multicore jobs but grid not too busy in recent weeks. DC14 request sent to production system. Wait for more jobs to come. New pilot. Problem observed with validation jobs for multicore queues solved. Rucio migration ATLAS is at final stages. Only step is to have all data managed by Rucio. Migration plan is to migrate 6 million data sets, take a long time. Plans to start it in mid Nov. Operations, there is no issues as I can see. 

RHUL, ECDF and Sheffield are now running latest multicore jobs submitted. 

- Other

Chris W: Couple chats with Matt, ticket on srmcp [discussed later]. Issue WMS will not renew a proxy on ARC CEs. Might cause some small VOs an issue. Talk about it at T1 meeting

Jeremy: Sounds reasonable to bring it up there. 

Chris: TB-SUPPORT, renew proxy manually as workaround. Fix WMS would be ideal. 

Steve Jones: WMS or ARC?

Chris: Currently WMS every so often renews proxy on CREAM. AIUI can't/won't do this on ARC. Don't know how this affects DIRAC. [see ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=109915 ]

Steve Jones: Get WMS/ARC devs to talk about this. 

Jeremy: Ticket is a good route for this. Different topic: status of cern@school data?

Chris: As of last week, didn't have data from satellite but did from schools. 

-  DIRAC status
-- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view .

Andrew: All cloud/VAC sites running happily. Gone through LCG sites, added all in GridPP to monitoring, including some not listed in DIRAC like UCL. What sites do we know don't run ordinary GridPP LCG batch jobs? Before I check why they're not running DIRAC jobs.

Jeremy: Do you mean which don't support Gridpp VO?

Andrew: Yes

Jeremy: Followup through Chris's page [http://pprc.qmul.ac.uk/~walker/votable.html] RAL/UCL/ECDF/Birmingham/Sussex not showing support. Could request some of those sites to support it. Any comments?

Chris: Page updates every hour or two, just looking in BDII. Just about possible that it will return a positive result from SE supporting VO but not CE.

Matt RB: I would for Sussex but don't know what I'd have to do.

Jeremy: Can point you at that. 

Andrew: Assume that if sites advertise gridpp that it should work for DIRAC as well. Won't pester people who don't claim to support gridpp vo, but will follow up if you do.

Chris W: Experience shows that there are a few ways it might not work. 

Jeremy: Follow up with RAL later. Any other sites? ECDF: [Wahid commenting in transcript - rather not support it]. 

Jeremy: Any other sites with plans for cloud-like infrastructure?

Steve Jones: Plans for Vac, no timeline. Intention at the moment, added to worklist.

Andrew: If you have a machine I can help, its very easy to start with one machine.

Jeremy: From Friday, plans for CMS to start using Vac/Vcycle - which site would do that, T1?

Andrew: T1 doing it with Condor based VM solution. Offered to set it up on VCycle instance at Manchester which manages the GridPP Vcycle tenancy at IC. At the moment runs ATLAS and LHCb jobs in VMs. If I add CMS (VMs almost identical) can show 3 experiments running in same Openstack tenancy with target shares mechanism sharing resources out. If one experiment not doing much work the others will take up the slack.

- Update needed for https://www.gridpp.ac.uk/wiki/GridPP_Cloud?
11:20 - 11:40 Meetings & updates 20' 
With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

- General updates

See bulletin with additional notes:

OLAs; As before but including clouds. 

Jeremy: Is there a use case for "Monitoring=N and production=Y" in GOCDB?

Gareth Roy: Need "fake" glite-APEL publisher entry for ARC CEs as a use case for "Monitoring=N and production=Y" - need the entry for publishing but monitoring breaks if you set Monitoring=Y.

Ewan: Looking at OX, VOMS mirror set up as "Monitoring=N and production=Y" Monitoring tries to probe admin interfaces: backup VOMS don't have them, this might be why we need this.

Steve: Only thing LIV has in that state is ARGUS (M=N, P=Y). Don't know why, don't know why we could changed this. We have ARC with a glite-APEL with monitoring=Y which seems to be OK - different from Glasgow? Might be that you can set M=Y and P=Y safely.

Gareth Roy: Set up a while ago, gave red nagios tests. They may have fixed it.

Chris Brew: RALPP ARCs with Monitoring=Y since they went in, no problems

Gareth Roy: OK, might be changing something then.

Gareth Smith: Remember guidance on turning off monitoring, might have changed. Tempted to turn off monitoring for ARGUS. Monitoring assumes ports open purely for monitoring, not needed for functioning. Having ports open solely for monitoring is debatable. We have a couple of cases that we're looking at like this. 

Jeremy: Noted that Robert had been trying to speak but it came over at very low level. 

Jeremy: Anyone interested in SME areas please get in touch following EGI request. Record what we're doing in this space - Glasgow and QMUL.

Jeremy: HTCondor workshop, Ian/Andrew/Lukasz registered. Email me if interested in attending.

Pete Gronbech: People should register for HEPSYSman

Chris W: We need list for lunch numbers.

Gareth S: A couple more registrations came in just now. 

- WLCG ops coordination

Alessandra: Feedback on survey welcome until tomorrow. 

GFAL2
------

Jeremy: worth noting that GFAL not supported as of 1 Nov, worth checking where our other VOs are with this. 

Chris W: SNO+ stuck with GFAL 2 not working with Castor

Brian: Might be fix for that. Trying at RAL to get correct versions of the gfal libraries on UI. Apparently gfal 2.7 should have that working but need to confirm. Will affect any VO trying to copy from Castor to file.

ATLAS MULTICORE
---------------

Steve  Jones: Half LIV cluster on multicore has been somewhat underused recently - not on us that things weren't running, just didn't attract jobs to fill it. 

Alessandra: Not referring to you, referring to sites that haven't set up multicore yet. 

Jeremy: Was Steve asking why set it up if you don't get the jobs?

Steve: Yes, for the record we'd have got more accounting credit if we hadn't had it set up. 

Alessandra: I agree, it has been repeated several times that sites are empty. There is also a problem on inefficiencies which needs to be looked at. I fed back 5 pages of notes on how multicore is going. 

Steve: Have had some issues setting up for small VOs which has contributed.

- Tier-1 status
- Storage and data management
- Accounting
- Documentation

KeyDocs to be updated:

Security Team 
Monitoring Links (Alessandra)
Grid Storage (Wahid)
Interoperation (David/Raul)
Wider VO (Chris W)
dCache (Brian)
Grid Sevices/BDII (Jeremy)

- Interoperation

David: Note that I'll be at hepsysman for next meeting
Jeremy: That will affect a few people, might need to send apologies. 

- Monitoring
- On-duty
- Rollout
- Security

Security was discussed. 

- Services

Catalin: Sent email/reminder about CVMFS keys package which auto configures EGI domain  for CVMFS. Most sites rolled out package, haven't heard of problems. Reminder to encourage to do it. Then it will be a simple way to maintain repos. Get in touch if you have questions but I don't foresee problems. Next step is to redefine VO software variable from gridpp to egi.eu. Transparent for everyone save system admins. Once I have UK done, I have excuse to do it elsewhere. 

Jeremy: How many sites left?

Catalin: Submitted jobs a few days ago, identified 4 sites. Handful haven't yet. Haven't heard any particular problems. 

Jeremy: If you have a list we can follow up with tickets. 

Catalin: I will if it doesn't happen in another week.

- Tickets

Matt:  Much the same view as last week. If no progress on ticket after ~week, set it on hold. 

With reference to https://ggus.eu/index.php?mode=ticket_info&ticket_id=...

109712: Now some update on this.

108356: Not sure if things are going on offline. Don't know if anyone else is planning on installed vmcatcher. Jeremy: everone in FedCloud need this? Matt: Need it to get secure images automatically. Ideally need it to have a proper cloud site. Jeremy: Add to Friday meeting? Matt: Might be worth it. Interesting to hear 100IT perspective as first fully cloud site.

109906: My advice to put ticket on hold while waiting for figures

109207: Elena: Was busy with Nagios Ops ticket, haven't got to it yet.

108273: On hold, but should look at it. Matt: Had trouble with iDrac interfaces yesterday. Note that if you have iDrac problems? Try clearing browser history.

108715: On hold, but should look at it.

107880:  Chris: I think it's srmcp that should fix this, there might be a workaround but that's where it should be fixed. Matt: srmcp is as old as the hills. Sam notes from transcript that this is why we have GFAL tools. Matt: Wonder about the precendent we set over the amount of work for essentially small subset of a user group. Ticket should maybe be assigned to UK in general, but put back in progress?

[Sam: The problem here is really that there's not a lot of grid stuff in general that is guaranteed to work on non-RHELx systems.]

Jeremy: Who is going to put it back in progress?

Matt: Bit of an odd ticket, assigned to T1 for help, bounced to QMUL where Chris is...

Brian: Did ticket pivot from an original to new issue?

Matt: No, it was a request for help. 

Jeremy: I'll have a look at the ticket offline.

- Tools
- VOs
- Site updates
11:40 - 12:00 Discussion 20' 
- perfSONAR 3.4

Jeremy: Note RALPP db question around PS 3.4 install from TB-SUPPORT. What do others think?

Ewan: Unusual to have done it this way, clean installer sets up the db. No-one else may have tried this?

Brian: May also become an issue for the T1.

Jeremy: Who else has installed this?

Ewan: Worked OK using NetInstall method. 

Brian: Mesh config still works with 3.4?

Ewan: The new ones work, but note that the URLs have completely changed. The instructions have the details. Note in general - this applies if you're doing a yum update from 3.3, so be aware.

Jeremy: Posted instructions (see transcript)

- multicore

Jeremy: Haven't discussed this in a while: ongoing long thread on passing parameters to batch systems. 

Alessandra: No summary - sites (I would appreciate some help here) that have batch systems different from SGE and torque if they could send parameters they prefer to use in their batch system - in UK I'm referring to HTCondor and Slurm at this point. For example pass memory to Torque you use the mem parameter - in SGE it's s_rss . In HTCondor and Slurm what do you use?

Jeremy: Slurm at Durham and Brunel?

Alessandra: Brunel not on Grid; more concerned with HTCondor currently

Chris Brew: Will look it up. 

Alessandra: If you can send me the list that would be good. Other sites have memory, virtual memory but that's contentious, wallclock and CPU time. 

Memory: mapped to RSS
VMEM: Contentious, virtual mem used to be RSS + swap but kernel has changed underneath and doesn't report that any more. Will need work.
Wallclock time
CPU time

Chris Brew: This is what the jobs request?

Alessandra: Yes, what the job may request.

Chris: I think I can get those. 

Jeremy: Oliver, can you do that for Slurm?

Oliver: I'll have a look. 

Jeremy: How about other HTCondor sites - should there be more parameters than these 4?

Jeremy: Any conclusion/arguments worth exploring from thread?

Alessandra: Discussion yesterday moved towards what to do with virtual memory but more complicated than passing parameters to the batch system. Requires using cgroups. For HTCondor straightforward, looking at Glasgow configuration. SGE waiting for a patch apparently. Univa GE has patch in latest version but sites have to install it and not everyone using Univa version. Not sure about SGE. 

Jeremy: Matt's installed it at Sussex and had some problems, not quite compatible with the middleware. 

Matt: Worth noting that UGE is closed source/proprietary so that might be why it's not so popular. Our problems are more to do with issue in recent version but there is native support. I'll send details as it will be quite different from U than SGE so it might be that I take that script that Manfred posted and fork it for UGE and propose that for us as there are new parameters to control memory using cgroups so it would be better for us to use that.

Alessandra: Script not tailored to use cgroups, is standard for a couple of years.
We need to work on that. Also from chat son of grid engine and other version not pushing cgroups that fast. 
Torque/Maui doesn't have support - I don't think there are plans for this. 

Cgroups is contentious - ATLAS would like this for all sites but might not be possible in short term. Might need alternatives/workarounds working on values of limits. Needs discussion. Fundamentally need to know what is possible with the batch systems. For sites that can do it get them to do it, and for sites that can't discuss it. No conclusion - long, but no conclusion.

- HEPSYSMAN topics (https://indico.cern.ch/event/350917/)

Jeremy: Reminder to register for hepsysman and noted the agenda.

12:00 - 12:01 AOB 1'

Chris W: Machine evaluation workshop in Coventry, Lustre and Openstack are topics if people are interested. Traditionally sponsored by Daresbury Labs and been in Liverpool, this year in exhibition centre near Coventry. [https://eventbooking.stfc.ac.uk/news-events/mew25] : 2 and 3 December 2014.


Transcript
----------

Elena Korolkova: (11/11/2014 11:09)
Sorry, I have problem with eduroam
Christopher J. Walker: (11:10 AM)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=109915
Elena Korolkova: (11:10 AM)
Next Tuesday I'm staying at home:)
Jeremy Coles: (11:15 AM)
http://pprc.qmul.ac.uk/~walker/votable.html
wahid: (11:16 AM)
support what 
Jeremy Coles: (11:16 AM)
The gridpp VO.
It is used by DIRAC.
Ewan Mac Mahon: (11:16 AM)
vo-nagios.physics.ox.ac.uk runs tests for the gridpp vo as well, if that's helpful.
wahid: (11:17 AM)
rather not thanks. we can remove the publishing
Ewan Mac Mahon: (11:20 AM)
Move it to th top of the list - it's very quick and easy to get VAC running, and it comes with a nice warm fuzzy feeling of productivity. Then you can tick it off the list, and that always feels good.
Jeremy Coles: (11:21 AM)
http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
Christopher J. Walker: (11:26 AM)
That's not a use case, that's a bug ikn the monitoring
Gareth Douglas Roy: (11:26 AM)
I agree
Christopher J. Walker: (11:26 AM)
It ought to be fixed by fixing the monitoring
Ewan Mac Mahon: (11:29 AM)
Indeed; in principle, this state should make no sense. But I'm not sure it needs to be made completely impossible to use it; it might be enough to have it as a "don't do this unless you're really, really, sure"
Our ARC's 'fake apel' entry is also production and monitored, FWIW.
Christopher J. Walker: (11:31 AM)
Surely monitoring requires ports open just for monitoring is also arguably a bug. 
So perhaps one should be allowed to do this as long as one has a ggus ticket open about the original bug. 
Ewan Mac Mahon: (11:32 AM)
I'm not sure I think that's a problem, necessarily. It has proven useful to have central monitoring of things (like squids) that might normally only be used internally.
I don't see any particular reason why monitoring shouldn't 'count' - in a distributed system it can be just as important as any othr part of the service.
Matt Doidge: (11:34 AM)
Will there be a Vidyo broadcast of HEPSYSMAN? I'm coming but Robin can't make it, but would like to listen in.
Daniela Bauer: (11:34 AM)
well, someone tried arguing the ports opening for monitoring and this is what they got:
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101304
Ewan Mac Mahon: (11:37 AM)
Hmm. That seems slightly a special case for argus not having any development effort, but it does reflect an existing assumption that it's reasonable to expect sites to configure things that are just for monitoring.
Daniela Bauer: (11:42 AM)
Well, it's the reason we don't admit to having an argus server...
Jeremy Coles: (11:44 AM)
https://www.gridpp.ac.uk/wiki/HEPSPEC06
wahid: (11:45 AM)
OK - will fix 
Jeremy Coles: (11:47 AM)
http://repository.egi.eu/2014/11/10/release-umd-3-9-0/
Samuel Cadellin Skipsey: (11:59 AM)
(this is why we have the GFAL tools, I note.)
The problem here is really that there's not a lot of grid stuff in general that is guaranteed to work on non-RHELx systems.
Christopher J. Walker: (12:03 PM)
Quite
IMHO that's a bug
but not one that's likely to be easy to solve. 
Matt Doidge: (12:04 PM)
Do we need something like a "UK Advice Helpdesk" Support Unit in GGUS?
Robert Frank: (12:04 PM)
you don't need mysql for 3.4 it should be using postgresql
I don't have mysql installed on our 3.4
and I've installed it using rpm installed
had other problems though
John Hill: (12:06 PM)
I just used the netinstall without any obvious problem
Jeremy Coles: (12:06 PM)
Thanks.
Federico Melaccio: (12:06 PM)
thanks
Jeremy Coles: (12:06 PM)
The instal linstructions https://twiki.opensciencegrid.org/bin/view/Documentation/InstallUpdatePS
Robert Frank: (12:07 PM)
https://github.com/HEP-Puppet/perfsonar
Federico Melaccio: (12:08 PM)
thanks Robert
that might be very useful
just for the record: the 3.4 perfSONAR package lists mysql-server as a dependency
Chris Brew: (12:12 PM)
CPUTime = JobCpuLimit
Memory = RequestMemory
Number of CPUs = RequestCpu
Not sure whether Condor kills on walltime
Ewan Mac Mahon: (12:14 PM)
Well, my cleanly re-installed 3.4 perfsonar bandwidth box seems to be running mysql and postgres.
Chris Brew: (12:14 PM)
Oh no it's JobTimeLimit
Samuel Cadellin Skipsey: (12:14 PM)
(In particular, since Univa is Paid)
Matt Doidge: (12:14 PM)
We don't know what's happeing with Son of Grid Engine either!
Samuel Cadellin Skipsey: (12:14 PM)
SoGE seems to not be pushing the cgroups stuff as hard.
(Sadly, OGE was where that was happening in the opensource stuff)
Matt Doidge: (12:14 PM)
There are various patches floating about the place.
wahid: (12:15 PM)
sorry I have to go now -see you soon
Daniela Bauer: (12:15 PM)
It's gone all quiet for me.
Matt Doidge: (12:15 PM)
And I know there is some cgroup support in some of them.
Daniela Bauer: (12:15 PM)
I think I will bow out now.
Jeremy Coles: (12:15 PM)
@Daniela - the meeting continues with sound for us.
Daniela Bauer: (12:16 PM)
But I am hungry...
Daniel Traynor: (12:16 PM)
songrid engine claims Linux cpuset support (aka cgroups) since 8.1.2 but it still looks exprimental
Samuel Cadellin Skipsey: (12:16 PM)
is that monitoring or controlling, Dan?
Jeremy Coles: (12:16 PM)
ALL: please check https://www.gridpp.ac.uk/wiki/Batch_system_status is up-to-date. Thanks.
Daniel Traynor: (12:17 PM)
process containment
Samuel Cadellin Skipsey: (12:18 PM)
Sure, but that's just "assign to a cgroup": what is it using the cgroup for? Just monitoring, or also control/limits?
Alessandra Forti: (12:18 PM)
@Chris let me know if you find out about the walltime
Jeremy Coles: (12:18 PM)
https://indico.cern.ch/event/350917/)
Matt Raso-Barnett: (12:18 PM)
probably for core binding in sge
@sam/dan
Daniel Traynor: (12:19 PM)
http://arc.liv.ac.uk/SGE/htmlman/htmlman5/sge_conf.html
Samuel Cadellin Skipsey: (12:19 PM)
Right, so no memory cgroup stuff?
Robert Frank: (12:19 PM)
older 3.4 version have mysql-server as dependency, newer don't
Daniel Traynor: (12:19 PM)
search for USE_CGROUPS
Robert Frank: (12:19 PM)
looks like 3.4-14 needs it, 3.4-16 doesn't
Samuel Cadellin Skipsey: (12:20 PM)
Yeah, so it looks like it is just using it to properly track/control all of the child processes for a job.
It doesn't *look* like its applying actual limits with them?
Christopher J. Walker: (12:21 PM)
https://eventbooking.stfc.ac.uk/news-events/mew25
Daniel Traynor: (12:21 PM)
ok - it just replaces the old cpu binding 
Ewan Mac Mahon: (12:21 PM)
So th MEW is 2nd and 3rd December.
Christopher J. Walker: (12:23 PM)
They are sending the sysadmins to Coventry
Matt Doidge: (12:24 PM)
Bye all, see most of you on Monday!

There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS -- Summary of the user threaded jobs status -- MC status at sites. - Other - DIRAC status -- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view . - Update needed for https://www.gridpp.ac.uk/wiki/GridPP_Cloud?
      Slides
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates - WLCG ops coordination - Tier-1 status - Storage and data management - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - Site updates
    • 11:40 12:00
      Discussion 20m
      - perfSONAR 3.4 -- "The only way I found to prepare the db was to install the 3.3 version of perfSONAR_PS-perfSONARBUOY package, which provides two scripts for that, owdb.pl and bwdb.pl...." - multicore status -- developments/discussion on passing parameters to batch systems -- summary on the long-thread related to which variable are set, mem vs vmem etc. - HEPSYSMAN topics (https://indico.cern.ch/event/350917/)
    • 12:00 12:01
      AOB 1m