Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +442030510622 -- The meeting extension is 9308582. Apologies:

LHCb
====

Raja: There's pretty much nothing to say, we are very quiet on grid. Updated to new version of DiRAC, working to resolve issues with that. This morning a few failures, possibly due to DiRAC.

Jeremy: Can you say more about new DiRAC?

Raja: New version released has a few fixes, supports IPv6 . Services should run on dual stack. LHCb customisation being looked at.

CMS
===

Daniela: Not much to report. GGUS ticket; data transfers fermilab and RAL or the UK are timing out at the moment - shuffling blame around - not clear if this will get resolved soon: see transcript for ticket.

Jeremy: Is this being followed up?

Daniela: Both Fermilab and RAL have a ticket - hopefully between them they can sort it out.

Jeremy: If it's a network issue...

Daniela: Not sure if it's a network issue.

ATLAS
=====

Elena: We have several open tickets for ATLAS, one 114153 which we looked at last time for Manchester, failed to get source files. Could someone from Manchester could update the ticket?

There is also for OX IPv6 test not working, can close the ticket saying that it's a test instance. The host itself is not reacheable. 114208. What are Ewan's feelings? 

Ewan: Not sure why Andrew sees the results he does - I think it's accessible. However we will close the ticket - concerned if is it is reopened through tests being run again. [Minutes note: Accessibility question resolved in ticket]

Elena: Can make a note of that. 

Ewan: The node is set up as not in prod not monitored in the GOC. Broader thing is maybe not IPv6 but not sending automated tests to things marked as not to be tested.

Elena: RHUL: jobs failed on stagein 114289 , queue set in test ode for some time. Sent Govind reminder, think has been resolved as saw queue go online yesterday but check A/R. Could Govind update ticket as solved?

114157: IC: Discussed last time, Simon cleared the dark files, checked that IC is not listed as site for functional tests (as is correct), we can close this ticket.

112721: RAL, discussed last time, Brian commented and comments from shifters which noticed that there are some failures as well. Andrew L commented that there is a JIRA ticket, shouldn't be FTS multiple transfers. ATLAS DDM support involved, Elena can send reminder to ask experts to update the ticket.

Andrew L: I'll have a look at the ticket. 

Elena: Follow up offline by email. 

OTHER VOS

DIRAC: Jeremy: We are making some progress, some activity, Durham. Believe there have been some tests using FTS. 

Ewan: Jens mentioned that he was planning an update in storage. 

LIGO: Catalin: I saw some messages from Paul Fri/Mon, he didn't manage yet to test Condor at RAL, for DIRAC other people on that. Successful at some point using VM to access disk points, [Jeremy: some issues with NATing]. Trying Mac laptop with less success. Tom encouraged him to use CernVM image. Latest update has new features. 

Jeremy: Making good progress?

Catalin: Not sure we can do more, also need input from users - maybe send a team to Cardiff?

Ewan: Maybe have a user training day? Maybe Birmingham? 

Jeremy: Good idea.

Sam: One thing that would have improved Pauls experience would have been having more of the disks attached to DIRAC working? Looks like the Imperial disk isn't working with the DIRAC DMS. Paul noted that he tried all the disks and the only Birmingham, Glasgow, QMUL working. This could be a red herring for users.

Jeremy: GGUS ticket against Janusz?

Tom: If you have been testing it, please report it.

Jeremy: We can raise a ticket now.

LOFAR

George: No update, waiting for user to get back to us. 

LSST

Alessandra: Simon, PhD student had some problems with submission, missed one VOMS server in configuration. Waiting for LIV and ECDF to fix their's as well. Then should be able to use all three sites & storage. We didn't see this before, for some reason he's the only one who was getting proxy from missing VOMS. 

One of the things that I'm worried about other sites to host LSST I forsee other similar changes, eg enable pilot role. 

LZ

Jeremy: I know that Dave Britton has written to LZ expressing GridPP's interest in supporting them further. 

Elena: US people submitted request for LZ VO to be set up in OSG.

Jeremy will follow up with Dave Colling.

UKQCD

Jeremy: Haven't commnunicated except some progress
Sam: Gang and I have been talking to Craig, bringing him up to date with modern Grid tools. 

UCLAN Galaxy Dynamics

Tom: Great guns. Thanks to Catalin have CVMFS repo set up. If you have a previous CernVM image if you're using CVMFS look at the new image.

Ewan: Get the new users to subscribe to the gridpp-support list, bring conversations into that forum.

Previous actions: Establish plan for LIGO: Being taking forward? Do we have a plan, timeline? 

Catalin: No, once Paul is ready for testing at RAL, I'm not sure what the DIRAC test status is. Happy to report from emails.

Jeremy: Followup offline.

Catalin: Ask Paul to join a meeting?

Jeremy: Think we should get a lot of them involved - Jeremy will look at this.

DIRAC status: Andrew: SAM tests, Ben has got tests working at UCL, turned out to be networking problem, both machines lost networking connection. Other issues, Lancaster hasn't been working for a few days, DIRAC RAL setup which uses a different way of creating VMs

Andrew L: I'm aware of that, haven't had a chance to look at it.

Andrew M: If the sites busy they're not getting a look in.

Jeremy: Same at Birmingham?

Andrew M: Possibly yes. 

Jeremy: Surprised, almost a month. Also LCG.Glasgow 

Andrew: Not getting anything for 8 days.

From Bulletin:
==============

Along with https://www.gridpp.ac.uk/wiki/Operations_Bulletin_220615

Alessandra and Jeremy discussed Sussex, A/R results and ATLAS test results. 

Gareth R: How are other sites doing publishing from ARC? Only way to fix it is to insert VO shares into perl script. Any suggestions or experience welcome.

Steve Jones: Have you compared output of BDII to others (eg LIV)

Gareth: Checked output of some ARC CEs; some had VO shares, some didn't. OX didn't, but they have CREAM CEs that do the publishing. 

Chris B: I manually hacked the glue-generator.pl

Gareth: That's what I did as well. Think your site was one I saw publishing shares. 

Chris: Don't think it would be a huge effort to write a cluster publisher, but not sure if that's useful.

Andrew L: Is this info used?

Gareth: Yes, points system that metrics are based on are based on VO shares, so if you don't do it you get zero. 

Ewan: If that's what it's used for, should this be done by another way (quarterly report, for example)? If people have done this could publish the hacks that would be useful.

Gareth: We found it when we switched off last CREAM

Chris: I'll mail it round (it has some other tweaks as well)

Next WLCG Workshop Jan/Feb

Tier-1 outage: Ewan: Shutting down Top BDIIs before and bring them back up afterwards? Previous experience of time where Top BDIIs came back empty which caused some issues.

Gareth S: Rings a bell with me. Not sure if 15 minutes is going to affect this, last time was longer, but this is a precaution we could take. 

Catalin: We could do that.

Gareth S: As a statement: we will do that.

Jeremy: We tend to send out updates - it is a short one

Gareth: The reason it's 15 minutes is to give time to flush through the network. 

Accounting: Another push to correct sites that are not publishing accounting by number of cores - see list in bulletin. Ticket against NGI, will cascade if the issues aren't resolved.

Gareth R: Glasgow: The one at Glasgow is being retired and will come back as an ARC imminently.

Oliver: Durham:  ARC CE there's a bug that's misreporting

Steve: Liv: Could you clarify the issue?

Jeremy: Accounting change, not publishing by number of cores

Ewan: Oxford: No idea, but it's a CREAM CE which is going to be decommissioned. 

Gareth: Issue was that they never ran multicore but the publishing still needs upgrading.

Steve: How do we fix this?

Alessandra: It's simple - I can recirculate the fix, it's one line.

EGI Ops:

One or two sites looking at SL5 longer term, mostly storage. See Agenda in bulletin link.

ACTION: Check early adopters contact details, " Some sites have the contact points for the EA adopters outdated so please check in table if all contacts and products are still correct"

Federico & David: Created a FAQ to help people get guidance on useful monitoring pages for different purposes - request made for people to have a look and, if they had something they'd like added/clarified please get in touch.

There was a security discussion.

Services: Duncan has added Manchester to the dual stack Perfsonar mesh. 

Duncan: IC VRF, need to send an email to find out what the situation is.

Jeremy: OMB has prepared a position paper on this.

Tickets:

See Bulletin.

114517: Elena: Closed it.
114513: Alessandra: that's on me.

GDB: See Notes: https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20150610


BDII discussion, intro from Alessandra:

WLCG Workshop simplifying the BDII (Alessandra) - easier to maintain it

OSG sites are going beyond that and eliminating it all together, esp CMS sites. Raises questions over tools like REBUS and SAM tests and occasions in which gather info about sites...

Discussion over at Ops coord on Thursday about this, experiments and sites use the BDII, what are the pros and cons, pros to keep it in simplified way, or not to keep it at all. 

Thoughts?

Steve Jones: Four options, leave it as it is, simplify it, get rid of it, make a different one? What about making a different one and phase out the current one over time?

Alessandra: Mix of AGIS and GOCDB - took years to arrive BDII, a new service would need a non-trivial amount of dev time

Jeremy: BDII became quite bloated and rise of static information, expts doing their own thing

Alessandra: The expts don't seem to use it that much. CMS clearly doesn't use it since ATLAS is using it to collect a limited set of info from queues, HEPSPEC, max queue length etc. Depending on size of site you can have thousands of entries. If only used for 4 or 5 parameters, even including REBUS, 10 parameters per site - overkill for sites? Useful to gather information about OS, during SL6 campaign was useful. REBUS even if we shut it down, perhaps not reliable since BDII is not reliable?

Jeremy: Focus is on LHC VOs, chat about DiRAC users

Alessandra: This is why I'm talking about simplifying it.

Jeremy: Going to go the way of the LFC, not used by big expts?

Alessandra: We don't hear about tickets about jobs failing because LFC not working. Still using it but at a different level of service level agreement.

Gareth R: Understand the technical reasons for it, never really seen the need for it, dynamic source which people ignored because they wanted static data. If you need it for other reasons then simplifying it seems good. 

Steve: It is complicated.

Sam: I would be sad if we got rid of it from a storage perspective as from that side it's not that complicated. There are lots of useful things, discovery etc. that you don't need for a CE. Would be very sad to see it go - replacing it with something else would be a very long process.

Gareth: GLUE2? Why are we still using GLUE1.2? We have a mixture - keep getting tickets about info incorrectly published that goes away an hour later. Something needs to be done to tidy it up. Need a distributed info source - most new technologies have key-value stores that they stick data in - don't know that what we have is nice.

Ewan: Bit of split between two things - nuts and bolts stuff, what port is this SRM talking on, vs. statistics like "what number of cores are you publishing" which is not being used by automated systems. Former has value and also BDII tends to be correct for those uses - when we advertise SRM on particular port, normally true. The more human facing side is more often wrong and difficult to make correct taking into account future technologies. Eg in cloud landscape the current set of information might not make sense.

Sam: We can't implement all of GLUE1.3

Chris B: The whole idea of subclusters is broken

Sam: Yes (not in GLUE2 but can't sue that). How much is published for politically reasons - does that need to be published by BDII?

Chris: If we use it to publish ports, publishing static isn't too bad - underlying software is OK, frustration is with cluster/CE part, as Ewan say broken at conceptual level.

Ewan: Publishing mix of genuinely dynamic and genuinely static - VO shares should be latter. Then we wouldn't need to worry. When coming as semi-derived value that's where the problem is. This is this service with details about this service, and separately static stuff. The problem is when you try to do things that are a mix of bothj

Chris: Useful knowledge to push back if, eg, we'll be full for the next week. Clusters/subclusters, maybe not appropriate for use.

Jeremy: If we shouldn't use it for clusters but should use it for VO shares, we should say that.

Ewan: I don't think we should publish VO shares (eg) through the BDII: but if we are they should come from static configuration and not quasi-static derived data.

Chris B: Static anyway, aren't they, end up in dynamic as it gets added to cluster info, but not dynamically generated?

Steve: Discussing implementation details, need to figure out requirements. Maybe underlying tech, but need to look at requirements.

Jeremy: That's the idea of the entry on Thursday, to talk about that. 

Alessandra: Are we consumers of this or only providers?

Steve: That's part of the question. Find out what the stakeholders are and what their requirements are.

Jeremy: That's the idea - involve [this group] as a user community and also provider community

Alessandra: 2 aspect: as provider community, is it a service that we'd be happy to get rid of, would it make a difference? As user community do you use it as site admin or other role (VO Manager, eg, I found it useful to use lcg-infosites)

AOB: David C: www.monitorama.com : Live stream, archive available later.

Transcript:

Federico Melaccio: (16/06/2015 11:01)
I did not get any email as well
and I hear an echo from Jeremy
as soon as someone joins
John Bland: (11:03 AM)
jeremy: we're hearing all your desktop sounds, quite loud
Steve Jones: (11:03 AM)
I'm getting a lot of Ding-a-lings from somewhere.
Federico Melaccio: (11:04 AM)
yes it looks good
Daniela Bauer: (11:05 AM)
https://ggus.eu/?mode=ticket_info&ticket_id=114275
Jeremy Coles: (11:11 AM)
https://ggus.eu/?mode=ticket_info&ticket_id=114289
https://ggus.eu/?mode=ticket_info&ticket_id=114157
https://ggus.eu/?mode=ticket_info&ticket_id=112721
Ewan Mac Mahon: (11:19 AM)
You've got to love the optimism in trying to install the grid tools natively on OSX.
Tom Whyntie: (11:19 AM)
Ha!
Ewan Mac Mahon: (11:20 AM)
And yes, I completely agree with that assessment - actual progress is being fairly slow, but he's being well supported.
Samuel Cadellin Skipsey: (11:20 AM)
It certainly surprised me that he tried OSX given Ubuntu hadn't even worked!
Jeremy Coles: (11:21 AM)
Present: Oliver; Winnie; Pete G; Raja; Raul; Robert F; Sam S; Steve J; Alessandra F; Andrew L; Andrew M; Catalin C; Chris B; Dan T; Daniela B; David C; Duncan R; Elena K; Ewan M; Federico M; Gang Q; Gareth R; George R; Gordon S; Govind; Ian L; Jeremy C; John B; Liam S; Matt D; Matt RB
Daniela Bauer: (11:21 AM)
There goes my Vidyo, sigh#
Jeremy Coles: (11:22 AM)
Action: Consider user training at Birmingham.
Ewan Mac Mahon: (11:26 AM)
Not to rule out having it somewhere else, of course, but Birmingham looks good to me - onsite B&B, onsite train station, and an onsite Matt & Mark.
Jeremy Coles: (11:27 AM)
+ close to Oxford?
Ewan Mac Mahon: (11:28 AM)
Thought hadn't crossed my mind :-)
But actually, good train links to most of England, and good air links to Scotland too, so it's close to everyone.
Alessandra Forti: (11:30 AM)
I think the time reflects the fact that the VOs are not established and there is some work going on
Ewan Mac Mahon: (11:32 AM)
Indeed, we shouldn't avoid spending time on doing good stuff, but we might want to consider relegating some of the regular standing agenda items out of the meeting and to bulletin-only updates for a while to save the meetings getting longer.
Some of the /other/ standing items, that is; stuff that's not this.
Jeremy Coles: (11:33 AM)
Action: Add UCLAN to other VOs list.
Action: Get users on to GridPP-support@...
Tom Whyntie: (11:35 AM)
Action: Jeremy to open up gridpp-support list so that it is visible on Jiscmail
Ewan Mac Mahon: (11:35 AM)
:-) That too.
Peter Gronbech: (11:47 AM)
That's importat becuase RHUL also had issues with multiple ARC ce's doubling published CPU counts
Ewan Mac Mahon: (11:49 AM)
The point about not (ab)using the information system for this sort of information is one of the things that will be part of this next discussion.
Alessandra Forti: (11:53 AM)
not us I think
for cream it is simple
for arc it should be automatic
Chris Brew: (11:56 AM)
shut em down
Ewan Mac Mahon: (11:57 AM)
Do we need to worry about historical accounting though - do we need to fix this, republish stuff, and then kill them, or is just killing them actually Ok?
Matt Doidge: (11:58 AM)
We'll have to ask John G
David Crooks: (11:59 AM)
https://indico.egi.eu/indico/getFile.py/access?resId=0&materialId=minutes&confId=2540
Federico Melaccio: (12:03 PM)
yeah sorry, I lost the audio at the wrong moment apparently
thanks David
David Crooks: (12:04 PM)
np :-)
Ewan Mac Mahon: (12:10 PM)
And closed.
Catalin Condurache: (12:11 PM)
i have to leave now. bye

Jeremy Coles: (12:12 PM)
https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20150610
Ewan Mac Mahon: (12:23 PM)
I think that's AGIS.
Part of the reason we can consider turning the BDII off at all is because much of its function has already been replaced.
Daniela Bauer: (12:24 PM)
The small VOs will need the bdii
or similar
Ewan Mac Mahon: (12:24 PM)
I think our plan is that they use dirac, so does their use of dirac rely on the bdii?
Daniela Bauer: (12:25 PM)
yes
How do you think we work out which VO can run where ?
Samuel Cadellin Skipsey: (12:25 PM)
You also , for example, can't use the gfal2 utils as well without it (they lookup storage endpoint configuration from it, just as the gfal tools do)
Ewan Mac Mahon: (12:25 PM)
By people poking Janusz to tweak static config files, mostly.
Daniela Bauer: (12:25 PM)
Yes, but that's not sustainable
the real dirac uses the bdii
If you want to look at a full dirac config you could try here
Samuel Cadellin Skipsey: (12:26 PM)
(I've literally just this week been talking to NA62 people about this kind of thing for storage, and their solution needs the BDII)
Chris Brew: (12:26 PM)
don't all the gfal tools us the BDII info
Daniela Bauer: (12:26 PM)
https://dirac.gridpp.ac.uk/
Samuel Cadellin Skipsey: (12:26 PM)
Yup, Chris, that's what I said :D
Ewan Mac Mahon: (12:27 PM)
i think we should use the BDII for things where there is a machine consuming the data - so for the information that the gfal/lcg-utils tools use, and for what dirac needs.
Chris Brew: (12:27 PM)
they're pretty unusable with the no bdii option
Ewan Mac Mahon: (12:27 PM)
We should strip out things that are only consumed by humans though.
Daniela Bauer: (12:27 PM)
we only use very basic data, like is teh site enabled for a given VO
Samuel Cadellin Skipsey: (12:27 PM)
Well, you can use them without the BDII, but you need to know a lot of things that it is hard to find out otherwise.
Daniela Bauer: (12:27 PM)
and "what is the storage path"
Ewan Mac Mahon: (12:28 PM)
Which is why Alessandra is thinking about simplification rather than elimination, I'd have thought.
Raja Nandakumar: (12:28 PM)
BTW - for LHCb the bdii is still needed.
Samuel Cadellin Skipsey: (12:28 PM)
and how does your implementation of SRM need to be talked to (BeStMan is highly nonstandard, for example)
Daniela Bauer: (12:28 PM)
and ports and stuff like that
Raja Nandakumar: (12:28 PM)
At least we need an information service of one type or another to allow automatic service discovery
Ewan Mac Mahon: (12:29 PM)
I actually think the BDII is wuite nice for some very specific uses.
er - quite nice.
Daniela Bauer: (12:29 PM)
You need some kind of central repo
otherwise each VO will have to maintain their own list of resources - hilarity will ensue
Matt Doidge: (12:30 PM)
I'm pro-BDII, but it could do with a clean up.
Daniela Bauer: (12:30 PM)
We could ditch glue2 and it wouldn't be missed 
Tom Whyntie: (12:31 PM)
Have to leave now - thanks, bye.
Jeremy Coles: (12:31 PM)
The ops coordination agenda item is "Information System Use Cases".
Ewan Mac Mahon: (12:33 PM)
The idea is broken at the conceptual level.
You shouldn't expect to have visibility of the internal hardware layout or the site, you should be able to see the port numbers of the endpoints.
Samuel Cadellin Skipsey: (12:35 PM)
The lat/long isn't that forgotten about (we got told off once for having too little resolution in ours...)
Alessandra Forti: (12:36 PM)
bloody google so precise... ;)
Ewan Mac Mahon: (12:37 PM)
I think our lat/long was pretty much rack specific at one point.
I believe you're correct Chris.
Gareth Douglas Roy: (12:39 PM)
GlueCECapability
:D
Ewan Mac Mahon: (12:40 PM)
Maybe there's another clean split there - all static config to be on the site BDII, all service node BDIIs to only publish autogenerated data with machine consumers.
And if something doesn't have a tool that uses it, then drop it.
David Crooks: (12:41 PM)
I agree - particularly a split between quasi-static data (shares) and dynamic data


David Crooks: (16/06/2015 12:44)
http://monitorama.com/
Ewan Mac Mahon: (12:45 PM)
Also, I agree with Alessandra - things like lcg-infosites are useful, but that (and the data tools) are the sort of things that would be kept under the principle of maintaining parts of the system that have tools that use them.
Federico Melaccio: (12:45 PM)
there is the watch live stream thing
Alessandra Forti: (12:46 PM)
bye
Federico Melaccio: (12:46 PM)
thanks,bye

 

 

 

 

 

 

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      * This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards. * The team composition has been changing. If everybody contributes then the task comes around less often. * From the start of GridPP4+ those in fully funded GridPP positions will be expected to contribute. Others are welcome to volunteer! * The minutes should contain a list of who attended; apologies; note who took the minutes and highlight actions. * A count is maintained at https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items. * After uploading minutes to the agenda page the minute taker is expected to: ** Update the list of ops actions. ** Update their 'count' so the task can be shared fairly. Thank you for your support!
    • 11:01 11:20
      Experiment problems/issues 19m
      Review of weekly issues by experiment/VO - LHCb - CMS https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel See comment last week: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SpaceMonSiteAdmin - ATLAS - Other -- DIRAC: Jens -- LIGO: Catalin -- LOFAR: George -- LSST: Alessandra -- LZ: David -- UKQCD: Jeremy Coles Previous actions: a) Catalin, Jens, Sam, Ewan, Tom etc. to work with Paul to establish plan (strategy, tests, timeline) for LIGO and report back on progress, outstanding issues etc. next week. b) Jeremy (Ewan, Jens...) to update AUP documents to ensure GridPP contacts should be in the VO, at least at the start. Pending. c) Jeremy (Ewan ...) Make a new mailing list for nascent VOs. Name shall be... requested/done. - DIRAC status -- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates - WLCG ops coordination - Tier-1 status - Storage and data management - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - Site updates
    • 11:40 11:55
      GDB review and BDII discussion 15m
      - GDB review (https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20150610) - BDII options
    • 11:55 11:57
      AOB 2m