Choose timezone
Your profile timezone:
Attending: Jeremy C, David Crooks, Daniela, Federico, Gang Qin, Gareth R, Ian Loader, John Bland, John Hill, Marcus Ebert, Raja, Winnie, Gordon Stewart, Robert Frank, Oliver Smith, Sam Skipsey, Catalin, Andrew Lahiff, Tom Whyntie, Kashif, Pete Gronbech, Ewan, Andy McNab, Govind, Gareth Smith, Chris Brew, Chair: Jeremy's Evil Clone Minutes: Matt D Apologies: Alessandra, Pete Clarke, Elena Experiment problems/issues 19' Review of weekly issues by experiment/VO - LHCb Low number of grid jobs, but no UK problems. Run some jobs everywhere, no problems have snuck in at any sites.. - CMS https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel Problem with xrootd fallback for 2 Brunel CEs , mentioned in computing meeting yesterday. Raul - tests are non-sensical, central redirector is slow. Been talking to Marian about this. - ATLAS Noone to report ought. A lot of analysis jobs over christmas need to be rerun so expect more job soon. Jamboree later this week. - Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator. - GridPP DIRAC status [Andrew McNab] -- https://www.gridpp.ac.uk/gridpp-dirac-sam This link has been updated, with a lot more information for regular and VAC/cloud sites on one page. Also includes how long a job has to wait before a job to go from being requested to running. Next phase is to split this information down to the VO, via tabs. Colours will be used to present information clearer. Ewan: We need to slightly watch meaningful colours - not everyone can see them all. Coloured background vs white background is usually fairly obvious though. - Status of pilot enabling across sites. Daniela: I've started aggregating all the results of my pilot tests here: http://www.hep.ph.ic.ac.uk/~dbauer/dirac/site_pilot_status.html I hope I'll be done by the end of the week Cambridge not on the list - Daniela looking at why - something wrong with the macro. - Other VOs Not much of interest. Green Light from Pete Clarke about working with Euclid - expect to be added to the "Incubator VO" list soon. Tom- added things to the UserGuide on Data management General Updates RHUL: spacetoken for snoplus? Sam - DPM directory level "quotaing" incoming, if SRM isn't used to write to a directory... Kinda Orthogonal to SRM implimentation though - need to disable SRM. Winnie: CREAM-CEs red, "No handlers could be found for logger "stomp.py"" Kashif will update later. Elena: how to limit the number of running jobs per user in condor -> Concurrency Limits. Chris came back to this on TB-Support. DIRAC File Catalog Command Line Interface guide added to the GridPP User Guide by Tom. https://www.gridpp.ac.uk/userguide/data-on-the-grid/dirac-dfc-cli.html Tom - If you want to give Feedback to these docs raise an issue on the GitHub page, or fork and pull if you want to contribute directly. Emails get lost! What to update when adding a VO (the LSST example!). Ewan - the reason this is poorly documented is that it really depends what each site has! No checklist exists (or can exist). Matt- Something a little handwavey? Ewan - VAC VAC VAC. Tom - Could instructions for searching TB-SUPPORT be added to the relevant Wiki page then? Notes from the January GDB are now available. https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20160113 Publishing CPUs...! Sam: WLCG WORKSHOP Site Feedback on Storage technologies. All feedback gratefully received! WLCG Operations Coordination https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes160121 Read the link! Machine Job Features - RPM to add MJF prologue for lhcb jobs - pbs torque out soon, ht-condor next. If anyone wants to volunteer to test this let Andy know. Andy will send message to TB-SUPPORT to ask for volunteers once he has a pbs torque rpm. ht-condor will be later, but will require some work to develop. More torque/maui sites then Matt thought - but several on their way out. Volunteers to attend next MW readiness meeting appreciated: Wed 27th Jan. http://indico.cern.ch/e/MW-Readiness_15 Tier 1 A reminder that there is a weekly Tier-1 experiment liaison meeting. Notes from the last meeting here http://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2016-01-20 We are investigating why LHCb batch jobs sometimes fail to write results back to Castor (and the sometimes fail to write remotely as well). A recent change has improved, but not fixed, this problem. We will migrate Atlas and LHCb data from the T10KC (5TB) to T10KD (8TB) generation tapes. Details (including timings) yet to be finalized but likely to start with Atlas first. We have seen a higher rate of problems on some disk servers in recent weeks. These are mainly individual disk failures. This is being reviewed. We had significant data loss from a disk server in AtlasScratchDisk when it suffered a triple disk failure a week ago. We are working a refresh of the database system behind the LFC. Both CPU and Disk capacity orders have been placed following a recent tender exercise. Failure rates between different types of drives over the years have been tested - interesting results but inconclusive. No names are being named. Tier 2 Evolution: VAC Vac 00.20.00 released. Emulates OpenStack environment for VMs, Cloud Init, contextualization from HTTP. Restarted testing of Cloud Init ATLAS VMs, and now getting jobs running to Finished state. New release yesterday - more VM types, simplified dependencies (no more NFS). Supports cloud init VMs - which hopefully will be the suggested atlas VMs. In position to encourage more sites to run balanced atlas/lhcb/cms setup. More sites to have some VM presence. Liverpool volunteered to offer a significant amount of resources as VMs. Good time to try things out as metrics "don't count" at the moment. Govind: What is the period gridpp accounting metrics not considered ? Pete G: It is likely to restart on 1st April TBC Accounting: Slight delay for Sheffield. Documentation: Review in a core ops meeting soon. On Duty: Kashif - Message broker fixed now. SAM nagios going this quarter, so no effort to fix. Problem is the message broker is chosen at random without checking, and just hangs if it can't get a connection. Kashif recieved no support on admin lists - those lists are completely silent. Monthly EGI Operation MB meeting could do with someone to attend it from UK. New ROTA out, could people respond! Security: CVE-2016-0728 Linux kernel: use after free in keyring facility local privilege escalation. EGI SVG Advisory in the works. Affects RH7 and derivatives/similar. RH5,RH6 and derivatives are not affected. RH/SL/CentOS updates published 25/01/2016 The IGTF has released a regular update to the trust anchor repository (1.71) - for distribution ON OR AFTER January 25th - Ewan - good news is that the kernel issue does not seem to effect Sl5 and Sl7 - but does effect recent Ubuntu so keep an eye out on Desktops and the like. Container type VMs need to watch out as well if running on an SL7 host. Services: Come back to perfsonar another day. Tickets Jeremy will reply to the parent of the publishing ticket. Birmingham and delete the biomed dark data Jeremy M has took control of the Sussex tickets. Inputs to the WLCG workshop 20' - Topics we would like mentioned/discussed next week: https://indico.cern.ch/event/433164/other-view?view=standard Anything we particularly want raised? Using HPC - ECDF and Lancaster stand out. Ewan - But what happens if HPC runs something we can't use? Sam - Many HPCsystems can't talk externally - so not useful. Matt - Andy Washbrook's talk from last GridPP is useful resource. Andy M - Experiments could do with being more flexible with OS requirements. Ewan - chroot or container into SL6. glexec also a problem. So need to have pool accounts or glexec on hardware or containerisation/virtualisation. Andy M - Openstack interest, particularly with SKA. Security Lightning Talks: Ewan - Don't move away from certificates! Layer sarongs type thing over certificates to make it "more palatable". Matt - Can't empathise with a fear of x509! Ewan - Need to change the pitch, don't explain it too much as that's when we give them the fear. Keep with Cert Wizard in principle. Tom - https://www.gridpp.ac.uk/userguide/getting-on-the-grid/grid-certificate.html https://www.gridpp.ac.uk/userguide/grid-first-steps/grid-first-steps.html Paraphrasing Ewan - Changing the backend so we have a prettier front end is madness Gareth - Has seen postive experiences from users recently. Chris - Couldn't the tools use p12 rather than having to create the pem files? Andy M - I think recent versions of voms-proxy-init do understand p12 too. Federated Authentication Evolution Andy M - call from some for central IT to run Shibboleth as an alternative to cern accounts and the like. All -Madness! Need to keep CAs! Sam, Andy- comment on the odd cyclic nature of proposed authentication solutions. trust evolution between sites and VOs Ewan - passwords bad! Maybe we need to give out hardware tokens? Andy - FNAL was always keen on this Jeremy - WLCG phone app! Medium Term Evolution: Data - skip over this one, section set up as a discussion. Information systems, accounting and benchmarking. Alessandra would like feedback There's a gigantic thread on this. Many scenations where site/top bdiis aren't around anymore. Accounting - Sam: storage accounting resurgent. Andy M: Emphasis being pushed to wall clock accounting. Sam: This is what HPC systems use. Ewan: And indeed commercial clouds. Sam - Storage accounting would like to be installed at more sites, but DPM publisher needs some fixes committed. Not sure on status for other storage systems. Progress quite slow. Future dependent of what happens to storage at sites. Call likely to be for a finer grained accounting. Early days yet if this all actually works. Move to generic record format? Benchmarking - Ewan - does it make sense to use benchmarking anymore? Move to "units processed". Benchmarking for dealing with vendors or accounting? Need to define scope. Jeremy - benchmarks also used for pledging. Andy - Experiments need some method to quantise how much CPU they want/need/have. Experiments use both sides of benchmarking- the quantising available compute and accounting compute used. Heavily involved, peer reviewed process that needs a common factor between experiments. Ewan - Real metric is job slots. Speed/hepspec per slot fairly flat. Next day - High Luminosity Era. Evolution to deal with greater data, and reducing infrastructure costs. Using specialising resources. Ewan - Return to benchmarking for a sec - using novel resources does not mix with a flat benchmark (example GPUs). Sam - hard to predict more then 10 years in the future. Bit silly to even try. CPU scaling doesn't happen anymore, processor width scaling slowing down. Either magic crystal computing or it'll be MIC/GPGPU everywhere. HSF/concurrency - desire for a very focused testing forum. But will need cash! Storage is a thing these guys are testing too. Meeting - Running out of people now. Ewan - future is diversity! Backends to commercial clouds and the like deliberately obfuscated. Need to make the frontend to be a flexible as possible. Sam - Which was the whole point of Grid Computing. We're finally there after more then a decade. Sam - Beware of wrong models! Ewan - When thinking about where we're going, remember that what we've done so far is actually really good. We've rolled with vast increases in network and storage, and replaced entire pieces (SEs, batch systems) without reworking everything from the ground up. We've muddled through quite well, smoothing over disruptive changes. Jeremy - a "step change" within the ability of the software to cope is coming though! Close meeting. Raja Nandakumar: (26/01/2016 11:00) Cool - we have a clone of Jeremy? Paige Winslowe Lacesso: (11:01 AM) Either Wonderful or Terrifying, unsure which Matt Doidge: (11:01 AM) Depends if it's an evil clone. Does it have a moustache? raul: (11:05 AM) this is non-sense Nope the central redirector is slow no Mic I've been talking to Marian Jeremy Coles: (11:09 AM) https://indico.cern.ch/event/4408 21/ Gareth Douglas Roy: (11:10 AM) https://indico.cern.ch/event/440821/ Matt Doidge: (11:10 AM) https://indico.cern.ch/event/4408 21/ Federico Melaccio: (11:10 AM) https://indico.cern.ch/event/440821/ Gareth Douglas Roy: (11:10 AM) yay! noooooo Daniela Bauer: (11:13 AM) I've started aggregating all the results of my pilot tests here: http://www.hep.ph.ic.ac.uk/~dbauer/dirac/site_pilot_status.html I hope I'll be done by the end of the week Ewan Mac Mahon: (11:17 AM) We need to slightly watch meaningful colours - not everyone can see them all. Coloured background vs white background is usually fairly obvious though. Tom Whyntie: (11:18 AM) None from me, but I'm adding things to the UserGuide on data management John Hill: (11:20 AM) Any particular reason why Cambridge isn't on Daniela's pilot test page? Jeremy Coles: (11:21 AM) I'll ask! Ewan Mac Mahon: (11:23 AM) There's a lot covered by the words 'have to disable SRM' Samuel Cadellin Skipsey: (11:23 AM) Ewan: yeah, I know, It's not totally clear that SRM actually gives people a lot more than just spacetokens, for T2 sites, though. So if you could replace spacetokens with quotas.... (is the intent, at least) Ewan Mac Mahon: (11:24 AM) I mean, it's easy enough at the site level, but are any of our major VOs actually ready? I know they're mostly almost ready, but actually ready? Samuel Cadellin Skipsey: (11:24 AM) There's a test "srmless" ATLAS DPM being tested somewhere (in France?) Tom Whyntie: (11:24 AM) np Raise an issue on the GitHub page please :-) https://github.com/gridpp/user-guides/issues Samuel Cadellin Skipsey: (11:25 AM) (But, yes, Catch-22s are terrible things, Ewan) Ewan Mac Mahon: (11:25 AM) preferably with a fork and a pull request, right?) ^ That last on Tom's Guide, not the SRM thing. Daniela Bauer: (11:26 AM) I found Cambridge. I need to learn not to let my macros silently fail... Tom Whyntie: (11:27 AM) You're very welcome to fork and pull request, but simply noting or making suggestions works too. I just find the issues easier to track than emails that can get lost. It also means everyone watching the repo can see if they've had the same issue, and we have a public record of what's been changed and how the guide is evolving. John Hill: (11:28 AM) @Daniela: Thanks for looking so quickly Tom Whyntie: (11:29 AM) @Ewan: yup Ewan Mac Mahon: (11:31 AM) VAC, VAC, VAC, VAC Tom Whyntie: (11:31 AM) Would something like a JIRA group help? ie. a forum where people can search the posts Ewan Mac Mahon: (11:31 AM) tb-support more-or-less is that. It does have an archive. As well as the liveliness to actually have useful content. Tom Whyntie: (11:33 AM) Could instructions for searching TB-SUPPORT be added to the relevant Wiki page then? Ewan Mac Mahon: (11:33 AM) Hmm, that shouldn't be a problem. Which wiki page is that? Peter Gronbech: (11:33 AM) https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=TB-SUPPORT Jeremy Coles: (11:34 AM) https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes160121 Ewan Mac Mahon: (11:34 AM) (also, in practice I tend to find that Googling most matters griddy actually hits the tb-support archives a large fraction of the time anyway) VAC? John Hill: (11:41 AM) We are Paige Winslowe Lacesso: (11:41 AM) Bristol has 2 tq/maui CEs John Bland: (11:42 AM) torque on half our nodes alongside condor raul: (11:42 AM) 1 torque to retired at brunel. John Bland: (11:42 AM) ours may well be replaced with VAC John Hill: (11:43 AM) I can look at it Paige Winslowe Lacesso: (11:43 AM) Bristol plans to replace SL5 CREAM-CEs doing tq/maui w/SL6 condor CEs (or whatever they are) Ewan Mac Mahon: (11:43 AM) ^ VAC :-) Andrew McNab: (11:45 AM) Vac conveniently supplies MJF by default :-) Ewan Mac Mahon: (11:47 AM) Making the deployment options here once again vac or faffing? Hmm. Have we got anything left in the GridPP4+ dissemination budget? We should get t-shirts with 'VAC or faffing' on. And the logo. Matt Doidge: (11:49 AM) Opps, wrong window. Can I put it in without mentioning names? Ewan Mac Mahon: (11:51 AM) Erm. What /are/ you doing in this other window? Matt Doidge: (11:51 AM) Minutes.. but I just realised what I typed - it was a question to Gareth about the drives tests! Nothing dodgey. Honest Gareth Smith: (11:53 AM) I would just leave it that we do see different failure rates between the disks in different batche sof servers. Although that is so bland as to be rather useless. We will dig into the stats a bit more. Ewan Mac Mahon: (11:54 AM) Is there any particular need to avoid naming names? If you've got the data to back it up, what's the downside> As you say - not mentioning names does make it rather less useful. Govind: (11:56 AM) What is the period gridpp accounting metrics not considered ? Peter Gronbech: (11:57 AM) It is likely to restart on 1st April TBC Ewan Mac Mahon: (11:57 AM) The subtext of this, of course, is that the most economically rational approach here is to turn all the worker nodes off for a couple of months to save on the electricity bills. Yup; not aware of anything unusual with the IGTF release. John Bland: (12:05 PM) fs issues on the CE, looking into it Jeremy Coles: (12:13 PM) https://indico.cern.ch/event/433164/other-view?view=standard Ewan Mac Mahon: (12:22 PM) Cloudy interfaces are definitely going to be the way to go for sharing resources among diverse user groups. Whether the back ends are VMs, containers, or bare metal. John Hill: (12:25 PM) The annual renewal confuses many of my users Ewan Mac Mahon: (12:26 PM) Certwizard writes stuff out for you in the various formats already. Samuel Cadellin Skipsey: (12:26 PM) (I don't have to look up the incantations, but I was an RA Operator at Edinburgh, so I'm just really odd.) Matt Doidge: (12:26 PM) You are the Cert Wizard Sam. Tom Whyntie: (12:27 PM) https://www.gridpp.ac.uk/userguide/getting-on-the-grid/grid-certificate.html https://www.gridpp.ac.uk/userguide/grid-first-steps/grid-first-steps.html Samuel Cadellin Skipsey: (12:27 PM) Matt: Certificus Separatus! Tom Whyntie: (12:27 PM) I've done my best with the UserGuide... Chris Brew: (12:27 PM) I would say the problem is the other way round - you get the p12 file and have to convert it to use it. Couldn't the tools use that rather than having to create the pem files Tom Whyntie: (12:28 PM) And the students and Steve J's got on OK (and got some good feedback to help improve it). Samuel Cadellin Skipsey: (12:28 PM) Well, yes, it turns out that the NGS actually had a client which understood p12 files. So, yeah, Chris, you're quite right, that would've been a reasonable approach. Andrew McNab: (12:29 PM) I think recent versions of voms-proxy-init do understand p12 too Ewan Mac Mahon: (12:30 PM) To which: a ha, ha, ha, ha, ha. If anyone thinks it's hard to get things added to HPC clusters, try convincing your IT Services to dick around with their Shibboleth IdP Federico Melaccio: (12:31 PM) slides from last GDB: http://indico.cern.ch/event/394776/contribution/3/attachments/1210272/1765020/GDB_Federated_Access.pdf Ewan Mac Mahon: (12:32 PM) This is a (nother) specific case of the general principle we have with new user communities - it is clearly better to use the 90% we already have, and build the missing 10%, than to trash the lot and build 100% of a new thing. But it is more exciting to build 100% of a new thing. Samuel Cadellin Skipsey: (12:34 PM) MICE has one, do they not? Andrew McNab: (12:34 PM) FNAL was always keen on this Ewan, or sends you a text Ewan Mac Mahon: (12:36 PM) Yeah; I'm not a massive fan of SMS based 2-factor because it's a monumental pain in the arse if you don't have reception. Matt Doidge: (12:36 PM) Like if you're in a machine room Ewan Mac Mahon: (12:36 PM) TOTP seems a pretty good compromise, and (with caveats around client support) U2F looks really good. Federico Melaccio: (12:36 PM) i think the online-banking-like token is the best Ewan Mac Mahon: (12:37 PM) ^ Which more-or-less is TOTP. But doing it on phones is cheaper than giving people dedicated hardware tokens. Federico Melaccio: (12:37 PM) true Ewan Mac Mahon: (12:51 PM) I think we have a reasonable handle on the net couple of years of CPU advances, and they're not going to be returning to increasing clock speeds and per core performance. It's all about the cores. Tom Whyntie: (12:59 PM) #lol Federico Melaccio: (01:00 PM) the model is not real the model is a lie Gareth Douglas Roy: (01:00 PM) n dimensional curve fit Federico Melaccio: (01:03 PM) thanks, bye Tom Whyntie: (01:03 PM) Thanks, bye