Ops Meeting Minutes, Tuesday, 15 July 2014 Attendees: Alessadra, Andrew M, Brian D, Chris B, Dan T, Daniela B, David C, Elena K, Ewan M, Federico M, Gareth R, Gareth S, Ian C, Ian L, Jeremy C, John B, John H, Kashif M, Mark S, Matt RB, Raja, Rob F, Sam S, Steve J, Wahid B Experiment problems/issues Review of weekly issues by experiment/VO - LHCb Raja: Going smoothly. Low level Monte Carlo - CMS (Daniela) - Limited news. Bristol have been struggling. There is a DPM issue at Brunel concerning inefficiencies - Raul had tried to actively address this before with CMS but only now they have followed up. Wahid has been talking to Raul about the issues - there was already a well known problem that was subsequently fixed in the latest release, but this current CMS problem may be new" - ATLAS Lost contact for first bit, then... Elena (describing some ops meeting): HPC was discussed. AGIS is very reliable. Atlas has used it for 2 years. It offers dynamic views. Group is working with it to set up new queues. New system for assessing site usability is to be brought in. It will create an automatic report. It will categorise the site into A, B, C. Liverpool and Sheffield are not T2Ds, thus they are automatically demoted. This will complement existing assessment for the time being at least. Alesssandra: The Atlas DC14 programme will include a mix of single and multi-core jobs. She will discuss baseline considerations for setting this up next week. She will talk about a solution to the draining problem. - Other n/a Meetings & updates With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates n/a - WLCG ops coordination Jeremy: what about ILC, who are dropping a Voms Server. Short discussion. ILC will be informed of our consensus, i.e. better just to turn off the Voms Server, as the system can cope with that. Once proxies are expired (24/48 hours) remove from Operations Portal. Sites will then get alerts and can respond. ------------------------------------------- VIDYO BROWNOUT (one of several) ------------------------------------------- - Tier-1 status Gareth: Castor Updates are complete. FTS2 will be stopped 2 Sept. - Storage and data management n/a - Accounting Jeremy: UCL is now the only site using APEL 2. Ticket raised. - Documentation Jeremy: Problem with stale documents alert has been fixed. - Interoperation David Crooks: There will be a meeting next Monday. There shall be a new CREAM. On migration to central SAMs, some, not UK, sites had version issues. On APEL: Will follow up at UCL. On monitoring reliability - there will be a manual re-computation. On UMD: thanks for survey response - about 80 so far. - Monitoring David Crooks: 4th July Meeting. Discussed SAM3, visualisation. Sites are reminded about the site monitoring wiki page. - On-duty n/a - Roll-out n/a - Security JC: Sites are reminded about EGI-ADV-2014625, high risk. - Services n/a - Tickets Matt: 29 Open UK tickets today. FNAL VOMS TICKETS As seen on TB-SUPPORT - a number of sites got tickets concerning jobs still contacting the FNAL voms server for CMS/ILC. Birmingham, RHUL, Liverpool and the Tier 1's tickets are still being worked on - RHUL's ticket might not have been spotted yet (still assigned). DECOMMISSIONING THE FTS3 SERVICE https://ggus.eu/index.php?mode=ticket_info&ticket_id=106615 (2/7) Gareth opened a ticket to document the retirement, in accordance with ancient grid laws. As naught is happening until the 2nd of September I put on hold till nearer the time. On Hold (14/7) TIER 1 https://ggus.eu/index.php?mode=ticket_info&ticket_id=106770 (10/7) enmr.eu wanted to add tags to one of the Tier 1's arc ces, which of course didn't work. There was an interesting exchange about why a VO would still want to have a site publish tags in the age of cvmfs (essentially so they can minimise changes to the submission gubbins). Andrew offered to add in the tag "VO-enmr.eu-CVMFS" by hand to his CE, it's likely that other sites might be asked to do the same - and it's a solution worth noting for other VOs. In progress (14/7) https://ggus.eu/index.php?mode=ticket_info&ticket_id=106610 (2/7) Enabling HyperK at the Tier 1. Ticket looks a little stalled after Chris commented that it was wise for Hyper K to be enabled on only Arc-CEs (in light of RAL going dairy free). In progress (2/7) UCL https://ggus.eu/index.php?mode=ticket_info&ticket_id=106425 (4/7) UCL are still having trouble with nagios tests after a pool node died. Ben is having trouble getting the new disk server set up - I tried to give him some tips and advised shouting out for help. In progress (8/7) BRISTOL https://ggus.eu/index.php?mode=ticket_info&ticket_id=106554 (1/6) Bristol having trouble with CMS transfers- Lukasz noticed Storm was being odd (believing there to be no free space when there was). The SE was kicked but the problem (or a similar one) showed up again. Anyone seen similar? (Looking at Chris Walker:Storm Sage again here). In Progress (9/7) https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 (1/6) cf TIER 1 ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324 CMS pilots losing contact with their home base. Looks similar to the issue at RAL, where they seem to have had some success (still waiting to see if it was complete). If the RAL chaps could elaborate on the firewall tweaks that brought about this improvement it would be greatly appreciated (The RAL ticket could do with an update too)! In Progress (14/7) - Tools Kashif: On 12th July, there was chaos with Nagios after a power cut. Duff results got into the top BDII and made sites look bad. After the system was restarted, sites continued to be impacted due to the backlog. Various suggestions were made. Jeremy: Why not flush the BDII cache. Kashif: Manual restart does that. It will be further discussed at ops meeting and so forth. - VOs LHCB An alert was sent out by Andrew Lahiff last week altering sites that an environment variable is need on ARC CEs for LHCB. It directs jobs to specific queues. Biomed Discussion on biomed. JC to have a word with VO managers. Stephen Burke suggests adding a requirement like TotalJobs < MaxTotalJobs. HyperK The VO will only use storage at QMU, London. It runs properly on sge/creamce. Only to be implement on ARC/CONDOR at RAL as CREAM is being phased out. - Site updates n/a Review of WLCG workshop See http://indico.cern.ch/event/330837/contribution/3/material/0/0.pdf Comments: Alessandra commented on CREAM fixes to pass walltime through to Toque/ Maui, to allow multi-core jobs to work. More on this next week. David Crooks commented on Monitoring Technology. The Mona Lisa model will be used. But also proposal to put in heavy hardware monitoring. Also, e.g., cvmfs monitoring. Discussion reqd. to check if this should remain a site task. Will be discussed at next consolidation meeting. Please contact David Crooks to share your views on this. AOB - A reminder to register for GridPP33 20th-22nd August: http://www.gridpp.ac.uk/gridpp33/registration.html. ---------------------------------------------------------- CHAT WINDOW CAPTURE (After Vidyo Crash) John Bland: (15/07/2014 11:33) I'm on 117.103.105.125 wahid: (11:33 AM) 109.105.124.84 David Crooks: (11:33 AM) I'm 141.52.27.20 as well Robert Fay: (11:33 AM) I'm on the same one as Jeremy Federico Melaccio: (11:33 AM) I'm on the same too Mark Slater: (11:34 AM) 109.105.... is NDGF Chris Brew: (11:34 AM) 141.52.27.20, got kicked off the first time but not the second time Raja: (11:34 AM) The conference status button is not clickable for me David Crooks: (11:34 AM) I'm the same as Chris John Hill: (11:34 AM) I'm "not in a confernece" Mark Slater: (11:34 AM) I've been kicked off once and lost Comms twice :( John Hill: (11:34 AM) so I can't find out the router Daniel Traynor: (11:35 AM) vidyourouter.ndgf.org now , kicked off the second time only. Alessandra Forti: (11:36 AM) yes but you need to be in the meeting to change it. i didn't have any problem with other meeting rooms... using the same router Gareth Douglas Roy: (11:37 AM) https://ggus.eu/?mode=ticket_info&ticket_id=103577 Ewan Mac Mahon: (11:37 AM) You don't actually have to support biomed if they're more hassle than they're worth. Alessandra Forti: (11:38 AM) may as well be Steve Jones: (11:38 AM) They are very good "fillers" when we have some slack!!! Alessandra Forti: (11:38 AM) it is enough 1 some of them also send jobs with the wallclock set.... Daniel Traynor: (11:40 AM) hypek woking fine at QM with gridengine and creamce Jeremy Coles: (11:42 AM) To find your router click on the config icon and then go to the 'status' page. David Crooks: (11:46 AM) Site monitoring wiki :-) https://www.gridpp.ac.uk/wiki/Site_monitoring_status Ewan Mac Mahon: (11:47 AM) Sorry - this doesn't seem to be quite working. Security stuff is SOP - update and reboot, but soon/now. Including you, ECDF. Jeremy Coles: (11:49 AM) EGI-ADV-20140625 Ewan Mac Mahon: (11:59 AM) A major CERN deployment moving to CentOS is sortof a big deal isn't it? Not a surprise as such, but still. wahid: (12:02 PM) well I tried it (package reporter ) straight aftr the mtg and it didn't work - now it does its simple enough but I still object to them ever asking for everyone to install it everywhere Samuel Cadellin Skipsey: (12:03 PM) wahid: I actually rather more object to the hint I heard that it doesn't actually tell the *user* what it is sending to the remove service. wahid: (12:03 PM) Ewan - acknowldedged - andy is poking systems team - they are always slow as they like to consult every user for somereason Sam - thats true Ewan Mac Mahon: (12:04 PM) I'm slightly dubious about the package reporter, but on a quick look it basically just seems to ship off the results of an 'rpm -qa'. Samuel Cadellin Skipsey: (12:05 PM) It would be nice if the package reporter did local logging. wahid: (12:05 PM) but its a perl script so you could get it to print Samuel Cadellin Skipsey: (12:05 PM) Sure, to both of you, but it would be nice if the person who wrote it showed they cared. Ewan Mac Mahon: (12:05 PM) Which won't work for non-RPM things, and will trawl up unrelated RPMs on (say) shared clusters. Samuel Cadellin Skipsey: (12:05 PM) You shouldn't have to tweak it to make it behave with the correct respect for sysadmins wahid: (12:05 PM) That wasn't the only time he said the "one more rpm " line he also said as he often does that they only want "90% of the sites" but then that quickly turns into a MB mandate Alessandra Forti: (12:12 PM) Sam: "it would be nice if the person who wrote it showed they cared" you ask too much.... ;) Ewan Mac Mahon: (12:13 PM) have they talked to the shoal folks? Because if you squint a bit that's a squid monitoring system too. Alessandra Forti: (12:13 PM) we could also feedback the request for printing and logging wahid: (12:23 PM) they WILL ! Ewan Mac Mahon: (12:25 PM) 'Non-SRM' isn't necessarily helpful if they still need weird stuff. If we can give then (e.g.) bare S3 interfaces, then that's one thing. Alessandra Forti: (12:26 PM) the major problem with non-srm is the space tokens used as quotas Ewan Mac Mahon: (12:26 PM) If we move from one set of grid specific tooling to another, we might as well not. Jeremy Coles: (12:28 PM) I'll try to get through the remaining talks in 5-10 minutes. I appreciate people will want to leave soon... if you do please note the AOB about GridPP33 on the agenda... please register! Thanks. Ewan Mac Mahon: (12:28 PM) What I want to do is DPM/dmlite -> dmlite with a (probably) ceph backend -> just the ceph. S3 has a strong advantage in multiple implementations existing. Samuel Cadellin Skipsey: (12:29 PM) Ewan: I may have a plugin or two to throw at you in a week or three Ewan Mac Mahon: (12:29 PM) Ooh. Jolly good. Does someone want to list all the times where volunteer support for grid middleware has actually worked well? Now I can't tell if no-one's answer that question or everyone is. Steve Jones: (12:44 PM) Thanks