Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +442030510622 -- The meeting extension is 9308582. Apologies: Jeremy Coles, Ian Neilson Minutes: David Crooks

Present
=======

Alessandra Forti, Andrew Washbrook, Andrew Lahiff, Andrew McNab, Brian Davies, Chris Brew, Dan Traynor, David Crooks (Minutes), Ewan MacMahon, Federico Melaccio, Gang Qin, Gareth SMith, Gordon Stewart, Govind Songara, John Hill, Kashif Mohammad, Lukasz Kreczko, Marcus Ebert, Matt Doidge, Oliver Smith, Winnie Lacesso, Pete Clarke, Pete Gronbech (Chair), Raul Lopes, Robert Fay, Robert Frank, Sam Skipsey, Steve Jones

Experiments
===========

LHCb: Nothing that I'm aware of from Ops meeting yesterday that's UK specific. Sporadically running MC when we can get it.

CMS: IC, RALPP got pulled up for having out of date PHEDEX, will have to update this. CMS intends to make xrootd  fallback tests critical at least for T2. 3 UK sites have this in warning state, need to sort out. Doesn't help that the docs are sparse.

Pete: Bristol looks a bit red in site readiness; also RALPP

Winnie: Storage was broken, is now fixed.

Chris: CMS overloading their storage. SAM tests don't show same as availability numbers, following up.

Pete: As long as you're aware and following up.

DIRAC: progress in Nov/Dec. Brian?

Brian: In process of deleting 1 million small files. Get to the point of having tar files. Not aware at how work at Leicester. 

Pete: Tarring files is manual or auto?

Brian: Not sure, that's Lydia's work. Needs space on her system to create tar files. In progress.

GalDyn: No new entries

LIGO: Andrew L: nothing since last week, try to use condor to submit to ARC CEs, submission wasn't working so need to look in that.

LSST: Various things from December, mostly fixed. 

Alessandra: There is a problem with gfal within the jobs run with Ganga nad Dirac, problem is conflicting python libraries, one 2.6 the other 2.7. THought it was Dirac UI but need to contact LHCb. 

ATLAS: https://indico.cern.ch/event/485673/contribution/1/attachments/1213524/1770874/atlas-report.txt. Ticket at Glasgow.

Sam: We can't get a dump of files that we've never had.

Alessandra: Without the dump ATLAS can't know which files aren't there.

Sam: Know that it's the files that can't get.

Alessandra.: Can be temporary that files aren't accessible. Using dump from sites automatically is to cover this space.

Bulletin
========

https://www.gridpp.ac.uk/wiki/Operations_Bulletin_180116

A/R: Lancaster at 0%

Matt: We're going for a recalculation, there was a JIRA ticket about it, puts us back to 90%. Nice that ATLAS doesn't count natural disaster!

Ops Coord: Next meeting on 27th Jan, agenda linked. Alessandra, any comments2? [No]

T1: Gareth S: Just a couple of things - had some issues with processing of tape for CMS, disk servers in front of tape had some problems. Added one, going to add another. The other thing, procurement, disk procurement is out, CPU is in EU standstill until later in the week.

Storage: Sam: Obviously things at Oxford more limited, Alistair keen to do some testing. Have had a few sites expressing some interesting. Other VOs: have had discussion about proviiosn for other VOs. LIGO intersete in using compute to run stuff on, question being how to get files to those jobs. How to best catalogue and data location services? A longer conversation. Already have catalogues, would be nice to be able to use them. Dirac file catalogue.

T2 evolution: [in addition to update in bulletin] Running new VAC at scale has shown up lots of problems with is really good - timeouts etc. Are testing new VAC release and will start advocating sites to update after testing done after a week and a full release.

Accounting: Benchmarking discussed at HEPSYSMAN; Alessandra

Documentation: VO ID Cards have changed for cdf, planck, superbvo and lsst and enmr.

Discussion of whether superbvo is extant. 

Pete Clarke: That was the Italian one? They moved to working on LHCb.

Steve Jones: Supernemo might be coming back.

Discussion of how to handle VOs that are not in operation. Be careful about declaring VOs dead. Ewan noted an argument for VOs being removed when they expire and reinstate if need be, rather than reusing VOs for different users. Steve noted that adding a VO isn't trivial - Ewan suggested deauthorise the VOs on a site but not necessarily remove pool accounts etc.

Pete G: Found an email from 2013, SuperB VO isn't extant.

Ewan: Can we clean up the VO card, if the main VO managers aren't available?

Steve: Can be difficult.

Discussion of publishing tutorial: Steve noted that the tutorial was how Liverpool did the accounting, and other sites could compare and the ticket some sites have with accounting potentially related to ARC CEs. Different ways that core count is reported. Issues with heterogeneous clusters and reporting have been around for some time.

Interoperation/EGI Ops: David: Nothing to report, next meeting 8/2, same topics as previously.

ROD: Kashif: I'm on duty this week. Only thing is glue 2 alarms [related to ticket]

Rollout:

Security: Nothing to report. Ian made a note of the IGTF release candidate.

Services: 

Tickets:

Pete: ARC bug?

Sam: We believe that you can't publish multiple subclusters in one ARC. As noted, never perfectly accurately reported a heterogeneous cluster, worked to satisfy tests. Different entities need different workarounds. EG REBUS removes a subcluster if it's the same as another, even if this accurately reflects reality. Different workarounds in place to work with different entities. 

Pete G: Pragmatic solution, this is not up to the job. Is there a recipe to satisfy tests?

Chris Brew: Thought I had a patch for this, working out why we're supposedly failing.

Andrew Lahiff: Looked at this, but saw different numbers in different places.

Ewan: Close the ticket; might not be the best use of effort to fix this as it has historically been problematic and may well continue to be so.

Pete G: Other places?

Sam: Suspect the rest of the world has been ticketed as well.

Kashif: Alerts show up on dashboard, so have to create ticket. Can we make these non-critical. Procedure is to create ticket within 24 hours of alarm, if you don't get ticketed by (well used to be) COD. If we can make these not critical would solve this.

Matt: Agree we want to close ticket, need to state our reasoning.

Pete G: Can Kashif pass up feedback that we'd like to make this not critical?

Kashif: Might be better to go through OMB; ask Jeremy?

ACTION: Ask Jeremy to pass on feedback asking for alarm to made non-critical. [https://ggus.eu/?mode=ticket_info&ticket_id=118930]

Meetings: GDB

Andy McNab: The thing that I really noticed is this plan to AARC ARC idea, EU funded to try to replace certificates as user visible creds with federated identity. Doesn't seem short term, might start looking at command line things in a  couple of years, starting with interactive websites. Seemed to have political backing from CERN people. One of the suggestions is would remove the need for people to have CERN accounts. Would require all institutions to have specific additions made to their identity management. 

Security work: nothing too controversial.

Data management plans for WLCG and HNISciCloud: This was about data preservation, describing landscape that there had been in US requirements to handle publically funded data in a better way, generated lots of policies. This is trying to see how we would fit into that, first steps in doing this kind of thing in the WLCG environment. Practicalities of handing over access to data - raw data is large. Matching policies in other domains to our domain. Implications if have to comply with something we're giving input.

Romain talked about landscape of how cyber attacks are evolving and things that we need to think about.

Other meetings: Ganga tutorial, useful?

Matt: Very useful, and it was a nice day! Good to get a user eye view. 

Pete G: Discussed in past that might open up to new VOs. Was the format useful for them or need to be tailored?

Matt: Make sure everyone is in same place beforehand, accounts etc.

HEPSYSMAN:

Pete G: I thought it went well - other feedback?

Matt: Good meeting, Grid dominated. Can't remember all the actions?

Ewan: Have some notes, turn into TB Support post.

Pete G: Should come out with list of items, push forward after the meeting. If people have things they remember let me know.

Ewan: CernVM, Ganga, DIRAC as a stack we're close to having that. Some polish left, but following the list you can submit jobs, reassuring. How is that a sustainable approach? At the moment our only plan, interesting stuff to be done, Ganga development. 

ACTIONS:

O-151215-03    Clarify process for declaring data loss to ATLAS.: Ongoing?

Sam: Don't think it's still ongoing, I will double check to make sure there's nothing I've missed.

Pete: If it is closed, please update the wiki.

O-150327-01: Done, involved updating VMs to work with new pilot framework, done at end of last year.

O-150327-03: progress in background, puppet module for HTC CE, changes to run it included in HTC 2.0, from last week. Next time I have timeslot will follow up.

AOB:

David C: Noted end of Romain's GDB talk covering SOC work and that work was starting in looking at a platform for sharing intelligence in the UK.

Transcript
==========

Matt Doidge: (19/01/2016 10:58)
How are you recording Vidyo Dave?
Ewan Mac Mahon: (11:04 AM)
OMG, audio Winnie.
Alessandra Forti: (11:06 AM)
I've uploaded the atlas report from elena
sorry if you asked about it, I had a sound mute still from the hepsysman
Ewan Mac Mahon: (11:08 AM)
He did, we'll probably come back to it.
Govind: (11:10 AM)
RHUL in test mode for Atlas last few days and asked cloud support but no reply..

Matt Doidge: (11:14 AM)
https://its.cern.ch/jira/browse/ADCMONITOR-412
Alessandra Forti: (11:16 AM)
I don't have updtes on that Martina has moved on and we need to discuss with ATLAS ops the ASAP maintenance.
Ewan Mac Mahon: (11:18 AM)
Someone (Gareth?) might need to mute.
I certainly thought superB was long since dead.
But I'm not quite sure why I think that.
Alessandra Forti: (11:25 AM)
I withdraw my comment on Martina. Clearly it was my projection for the future
Federico Melaccio: (11:26 AM)
https://en.wikipedia.org/wiki/SuperB
cancelled in 2012
Alessandra Forti: (11:30 AM)
Belle II is using wlcg resources indeed they are also more integrated with us since CHEP15
not in UK though
Samuel Cadellin Skipsey: (11:32 AM)
(In fact, Belle 2 is actually using LFCs actively!)
Federico Melaccio: (11:37 AM)
It's me
Nope
Daniela Bauer: (11:41 AM)
Close with WONT FIX.
Ewan Mac Mahon: (11:41 AM)
We've never got it right.
We've got it not-tripping-the-alarms
It's not the same thing.
Daniela Bauer: (11:42 AM)
And what kind of ticket is this anyway. It's like the cliche user "It doesn't work" ticket.
Steve Jones: (11:42 AM)
You can't make a silk purse from a sow's ear.
Daniela Bauer: (11:42 AM)
Bah Humbug#
Chris Brew: (11:43 AM)
Is gstat still going?
Ewan Mac Mahon: (11:43 AM)
And fundamentally no-one is attempting to make any use of the fine-grained subcluster information anyway.
Alessandra Forti: (11:43 AM)
there were plans to resurrect it from EGI
Ewan Mac Mahon: (11:43 AM)
It's all /completely/ pointless.
The glue schema can't cope with elasti resources.
It can't tell the difference between a logical CPU and a hyperthreaded unit.
It's al crap.
Er, all.
Lukasz Kreczko: (11:44 AM)
yeah...
Alessandra Forti: (11:45 AM)
when Glue 1 it was started there were only logical cpus i.e. cores
the accounting stuff was hammered in the schema that wasn't designed for it
Daniela Bauer: (11:46 AM)
Fix the test !!! Fix the test!!!
Alessandra Forti: (11:46 AM)
It is the reason why we are going to have the session at the WLCG workshop
where weare going to hopefully discuss how to move on from the BDII
Matt Doidge: (11:47 AM)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=118932
12 children!
Alessandra Forti: (11:48 AM)
problem the children are more verbose.... 
Daniela Bauer: (11:48 AM)
I currently don't see an alert for Imperial in teh dashboard
so I think I can just ignore this for now...
I second Kashif. We should get rid of this alarm and be done with it.
Ewan Mac Mahon: (11:49 AM)
Our case is "it's all bullshit"
Lukasz Kreczko: (11:49 AM)
maybe there is a nicer way to say it in the ticket
Steve Jones: (11:50 AM)
This is what we do at L'Pool, for better or worse.

https://www.gridpp.ac.uk/wiki/Publishing_tutorial
Ewan Mac Mahon: (11:50 AM)
I think there's value in being unequivocal.
But yes, the case is that there is no realistic prospect, nor practical purpose, in the figures bing 'right', and that there is no benefit in raising alarms and tickets for them being wrong in a particular subset of ways, but not others, therefore please stop.
Alessandra Forti: (11:55 AM)
I'mjeremy
Ewan Mac Mahon: (11:55 AM)
Fairly sure you're not :-)
Matt Doidge: (11:55 AM)
We're all Jeremy?
Ewan Mac Mahon: (11:56 AM)
It's a distributed parallel Jeremy.
Matt Doidge: (11:56 AM)
I can field the ticket if needed.
Alessandra Forti: (11:56 AM)
it seems there is consensus that publishing these numbers is rubbish
what I wonder is if Glue2 will solve at least partially the problems we are seeing or if we have to be more "unequivocal"
Samuel Cadellin Skipsey: (11:59 AM)
Alessandra: well, the problem Glue1.3 had wasn't even entirely Glue1.3 (it was that WMSen aren't smart enough to understand SubClusters). So, getting rid of WMSen would also seem to help...
Steve Jones: (11:59 AM)
It's not necessarily rubbish to publish the values. But it is rubbish if the meaning of the values, and hence how they are derived, is not universally agreed.

Alessandra Forti: (11:59 AM)
no if then everything is based on subclusters description
/no/not/
Steve Jones: (12:00 PM)
In short - GIGO.
Alessandra Forti: (12:01 PM)
The problem with WMS is that EGI doesn't move at all if nobody complains
Ewan Mac Mahon: (12:01 PM)
There's still an underlying assumption that it is both useful and possible to exactly describe the cluster. 
Which isn't true on either count.
Alessandra Forti: (12:01 PM)
or say something. They think the BDII is perfect as it is. 
gone
Steve Jones: (12:02 PM)
There is the practical task of a) showing the power of the cluster, b) knowing how many slots it has and c) and measuring how much work has been done by it.

At present, the info system is the only way to do these things.


Alessandra Forti: (12:03 PM)
for me quite well
Gareth Smith: (12:03 PM)
Sorry - have to leave meeting now.
Ewan Mac Mahon: (12:03 PM)
We don't need to know the first half of that, and the accounting barely uses the information system.
Paige Winslowe Lacesso: (12:03 PM)
Apologies, I must leave meeting now.
Steve Jones: (12:04 PM)
I'm not so sure. Knowing the power tell you the importance of the site.
Federico Melaccio: (12:04 PM)
it looks like ARGO also needs some information, according to this child ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=118928
Steve Jones: (12:04 PM)
Slots? Perhaps that's not so important. 
Alessandra Forti: (12:04 PM)
not true Ewan. The accounting uses all that information. there are different types of accounting
Steve Jones: (12:04 PM)
But the benchmark/scaling factor goes through the IS as well. 
Alessandra Forti: (12:05 PM)
and often operations people in the experiment reason in terms of slots not HS06...
Ewan Mac Mahon: (12:05 PM)
We can do accounting from a VAC-only system that has none of this. It might be being used in places, but there's nothing it's necessary for.
Steve Jones: (12:05 PM)
It's true, but this is the system we have today.
Alessandra Forti: (12:05 PM)
VAC mimics PBS apel publishing and uses the same numbers
Steve Jones: (12:06 PM)
And this is the system _they_ use, as far as I know. Hence, we have to use it. I'd write a new system , but it's not my job.
Alessandra Forti: (12:06 PM)
in Manchester they are reported in REBUS because I included them in the total
Matt Doidge: (12:07 PM)
Also thanks to Alessandra for making sure we had cake with our coffee.
Best coffee-break snacks I've had at a meeting for a long time.
Steve Jones: (12:09 PM)
Nice fig rolls too.
Peter Gronbech: (12:10 PM)
https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
Matt Doidge: (12:11 PM)
I'd like to add Lancaster to the httpd TF "volunteers"
Alessandra Forti: (12:11 PM)
cakes were teamwork with Sabah. I brought the mince pies, he bought the rest :)
Ewan Mac Mahon: (12:12 PM)
And I think we can probably drop Oxford from the HTTP TF list too - certainly for the IPv6-only endpoint which has very little future.
Matt Doidge: (12:13 PM)
Also on the LZ UI action it's worth mentioning that the cvmfs UI should be configured for LZ too.
Ewan Mac Mahon: (12:14 PM)
'Should be' in the future sense of being something we should do, or in the sense of we think it already has been?
Matt Doidge: (12:14 PM)
think it already has been
Elena gave it a quick test which seemed okay.
I'm working on a new tarball UI today and when I put that in cvmfs I want it to be well polished.
With all the UK supported VOs in it.
Ewan Mac Mahon: (12:16 PM)
In that case, if it works from the cvmfs tarball UI I think I might just declare that done on the Oxford UIs since they have access to the cvmfs, rather than adding it to the local UI setup.
Federico Melaccio: (12:17 PM)
thanks, bye

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      * This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards. * The team composition has been changing. If everybody contributes then the task comes around less often. * From the start of GridPP4+ those in fully funded GridPP positions will be expected to contribute. Others are welcome to volunteer! * The minutes should contain a list of who attended; apologies; note who took the minutes and highlight actions. * A count is maintained at https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items. * After uploading minutes to the agenda page the minute taker is expected to: ** Update the list of ops actions. ** Update their 'count' so the task can be shared fairly. Thank you for your support!
    • 11:01 11:20
      Experiment problems/issues 19m
      Review of weekly issues by experiment/VO - LHCb - CMS https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel - ATLAS - Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator. - GridPP DIRAC status [Andrew McNab] -- https://www.gridpp.ac.uk/gridpp-dirac-sam - Status of pilot enabling across sites.
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates - WLCG ops coordination - Tier-1 status - Storage and data management - Tier-2 Evolution - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - Site updates
    • 11:40 12:00
      Feedback/discussion on... 20m
      - The January GDB: https://indico.cern.ch/event/394776/ - The ganga workshop: https://indico.cern.ch/event/465558/ - HEPSYSMAN: https://indico.cern.ch/event/465560/
    • 12:00 12:05
      Actions & AOB 5m
      * https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items