Present (mostly surnames): Lahiff Davies Brew Traynor Bauer Crooks Rand Korolkova Mahon Melaccio qin Roy Gordon Stewart Govind Ian Loader Coles Bland Hill John Kelly Mohammad Kreczko Raso-Barnett Smith Gronbech Nandakumar Raul Frank Skipsey Jones Wahid Wash Experiment problems/issues 20' Review of weekly issues by experiment/VO - LHCb Raja: 2 things - 1) Andrew McNab will be the new LHCb UK Computing Coordinator (the division of labour between Raja and Andrew is not yet decided). In practice, for now, things remain unchanged for the purpose of Ops meetings. 2) Still anticipating next round of restripping early Nov. 3) Problem with Sheffield - cannot submit jobs ( Raja Nandakumar: (11:03 AM) glite-ce-job-status -a -e lcgce2.shef.ac.uk 2014-10-28 11:47:38,475 ERROR - Connection to service [http(s)://lcgce2.shef.ac.uk//ce-cream/services/CREAM2] failed: FaultString=[HTTP error] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server] - FaultDetail=[HTTP/1.1 404 Not Found] ) This occurs for all CEs at Sheffield. Last job to be picked up was on 21st. LHCb do direct submission to CEs. - CMS Daniela: Not much to report. Bristol still has issues (keeping an eye on it). Lukasz notes that it has taken a while to get CE01 back online at Bristol. Most of the changes in the downtime have been towards sharing with the rest of University. Discussed potential firewall issues with security team at University, and there is no deep packet inspection at that level on storage traffic at Bristol (so this is not the cause of issues). - ATLAS Elena: several problems which caused ATLAS issues running (were discussed at ADC Weekly & ATLAS UK meeting) - VOMS server issue at CERN, hammercloud issue with duplicate output names (which then failed and set sites to test), huge backlog of transferring jobs, problem with DQ2 catalogs. All issues resolved, at the moment, there are many merge jobs running (with many inputs and outputs), these are filling up proddisk spacetokens. ADC is aware of the problem. Decreased default lifetime in proddisk to 4days to help ameliorate the issue. Multicore issue with software release validation at RHUL, Shef, ECDF. (These are the 3 most recently added multicore queues in the UK). Following up with Alessandro di Salvo + PanDA experts. V low activity on multicore in general, however. Will be discussed this week in ADC Weekly this afternoon. - Other See Chris' notes in Bulletin. - DIRAC status -- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view . Andrew McNab: (11:13 AM) Could you hear ok? THere was no change with the DIRAC monitoring - Update needed for https://www.gridpp.ac.uk/wiki/GridPP_Cloud? 11:20 - 11:40 Meetings & updates 20' With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates EGI have asked about dteam use in APEL (which is higher recently than historically since June). Probably correlated with increased testing of various things - CASTOR at RAL, etc Jeremy asked what tests run via dteam? Steve's tests run using dteam, but should affect every site equally (other than the ipv6 tests, which only test ipv6 enabled endpoints). The network testing has been shifted to dteam from ATLAS. Oxford figure is Oliver testing things against Oxford. - New aggregator on the blog, comments please. (It aggregates the usual T2 + T1 blogging outputs). RIPE probes still to be installed. RIPE conference imminent so it would be good to have things installed for then (it's an open conference with remote participation). Final figures for Sept avail/reliability online. - WLCG ops coordination Next meeting is Thurs 6th. Please be aware of survey for.. - Tier-1 status No updates. - Storage and data management No updates. - Accounting Note there's a prototype APEL parser for HTCondor (if you're using CREAM as your CEs). Question as to if the HTCondor parser is coming back due to the Condor CE project being pushed to production? (Jeremy guesses so?) - Documentation Two KeyDocs for review: Mark Mitchell's Core Grid Services page - currently structured but not particularly detailed. Is this page useful? https://www.gridpp.ac.uk/wiki/Core_Grid_services (Robert Frank) noted that the page was a useful summary (but this was the first time he saw it). Ewan expressed the opinion that the summary was not too useful - pages get created for interesting things, and then forgotten after we've moved on, aggregate pages tend to rot as they get forgotten (the information currently on the page is terribly out of date, as it was last updated some time ago). Jeremy suggested that the concept of a summary is useful. (Ewan noted that in a hypothetical world where we would spend lots of time keeping pages like this perfectly up to date... it still might not be useful) Gareth R noted that the main issue with the wiki is that it's so hard to find things in it (when he started, he certainly didn't find this aggregate page by searching). Andrew MacNab noted that there has been some discussion with Tom Whyntie about how to restructure the wiki front-page to make it easier to actually find things. [There was some discussion about how this could be improved. Possibly a Core-Ops subtask?] Ewan noted that part of the issue is people's psychological barriers concerning actually altering the wiki in general (not wanting to overwrite other people's work). Second page: BDII/Information Services. https://www.gridpp.ac.uk/wiki/BDII Ewan: "it seems harmless" - Interoperation No updates. Next EGI ops is in November. - Monitoring No updates. Next Consolidation meeting end of week. - On-duty Quiet. - Rollout No comment. - Security Advisory concerning Xrootd monitoring. Ewan (on behalf of security team): this is not news (as we've already been informed by ATLAS, at least, to make the config changes). The one wrinkle is that the info was given as a YAIM snippet (as we're now transitioning to the Puppet management, we should probably look at giving help) - although the change maps to a single line in a single config file. - Services Question: has the perfsonar 3.4 info been updated and circulated? (But there were also requests for people to test such instructions.) Duncan, Chris W, Ewan and Alessandra were at that meeting, the instructions are still being worked on, and Ewan tested them and gave detailed feedback. Ewan: the short answer is "no, they are not currently ready for general use". In general, there are several things changing dramatically with the new install, you can technically do the new install via a yum update, but the mesh config urls are changing completely so there is further config anyway. (There are other changes to introduce more privilege separation.) So it's probably easiest to install from scratch. Jeremy also noted that there was a recommendation to add IPv6 Perfsonars to dual stack sites, and for some discussion of T3 sites who wanted to add themselves to T2 meshes. - Tickets 26 open tickets. The VO nagios update: Brunel having problems with gridpp, Lancs with pheno, RALPP with t2k job submission, Bristol on d/t, Sheffield with pheno, etc (probably CE issue), SRMs at T1 are failing their tests (for 11 days so far). Kashif: the SRM/T1 issue is a problem with CASTOR (it wants to map Kashif to Ops as it has a default mapping for his DN to that VO, and ignores the VOMS extensions). Brian noted that CASTOR SRM is, indeed, not VOMS aware. The certificate is mapped to one VO in the gridmapfile [probably the first entry encountered in the mapping file]. [some discussion about ticketing RAL/CERN re CASTOR still not being VOMS aware] - Tools No update. - VOs Catalin updated us on the CVMFS keys update (to decouple from CERN more). Also, as we're low on WLCG VO work at present, a good time for small VOs? - Site updates AOB. Gareth R: has anyone tried to sign up for dteam? (We're signing up our new guy, Gordon Stewart, and it looks like the web interface isn't working.) Ewan managed to use it yesterday, and the interface was "wierd", but worked. Jeremy will raise a ticket. - Chat Log: Daniela Bauer: (28/10/2014 10:54) I got kicked off twice so far and the meeting hasn't even started. Raja Nandakumar: (11:03 AM) glite-ce-job-status -a -e lcgce2.shef.ac.uk 2014-10-28 11:47:38,475 ERROR - Connection to service [http(s)://lcgce2.shef.ac.uk//ce-cream/services/CREAM2] failed: FaultString=[HTTP error] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server] - FaultDetail=[HTTP/1.1 404 Not Found] Elena Korolkova: (11:05 AM) @Raja: does lhcb use WMS for job submission? Raja Nandakumar: (11:06 AM) No - we submit jobs directly to the CEs Elena Korolkova: (11:06 AM) Atlas can submit job to lcgce1 and I can do it manually with atlas and t2k proxies. Lukasz Kreczko: (11:07 AM) it is "online" but in downtime I am straighting out the HTCondor and ARC configuration in the back Raja Nandakumar: (11:08 AM) Elena - I just tried the glite-ce-job-status command and I get the same error again. with debug mode, i get 2014-10-28 11:10:43,892 DEBUG - Contacting service [https://lcgce2.shef.ac.uk:8443//ce-cream/services/CREAM2] 2014-10-28 11:10:44,015 FATAL - Connection to service [http(s)://lcgce2.shef.ac.uk//ce-cream/services/CREAM2] failed: FaultString=[HTTP/1.1 404 Not Found] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server] - FaultDetail=[