Present (mostly surnames):

Lahiff
Davies
Brew
Traynor
Bauer
Crooks
Rand
Korolkova
Mahon
Melaccio
qin
Roy
Gordon Stewart
Govind
Ian Loader
Coles
Bland
Hill
John Kelly
Mohammad
Kreczko
Raso-Barnett
Smith
Gronbech
Nandakumar
Raul
Frank
Skipsey
Jones
Wahid
Wash

Experiment problems/issues 20' 
Review of weekly issues by experiment/VO

- LHCb

Raja: 2 things - 1) Andrew McNab will be the new LHCb UK Computing Coordinator (the division of labour between Raja and Andrew is not yet decided). In practice, for now, things remain unchanged for the purpose of Ops meetings.
2) Still anticipating next round of restripping early Nov.
3) Problem with Sheffield - cannot submit jobs (
Raja Nandakumar: (11:03 AM)

glite-ce-job-status -a -e lcgce2.shef.ac.uk 
2014-10-28 11:47:38,475 ERROR - Connection to service [http(s)://lcgce2.shef.ac.uk//ce-cream/services/CREAM2] failed: FaultString=[HTTP error] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server] - FaultDetail=[HTTP/1.1 404 Not Found] 
)

This occurs for all CEs at Sheffield. Last job to be picked up was on 21st. LHCb do direct submission to CEs.


- CMS

Daniela: Not much to report. Bristol still has issues (keeping an eye on it). Lukasz notes that it has taken a while to get CE01 back online at Bristol. Most of the changes in the downtime have been towards sharing with the rest of University. Discussed potential firewall issues with security team at University, and there is no deep packet inspection at that level on storage traffic at Bristol (so this is not the cause of issues).

- ATLAS

Elena: several problems which caused ATLAS issues running (were discussed at ADC Weekly & ATLAS UK meeting) - VOMS server issue at CERN, hammercloud issue with duplicate output names (which then failed and set sites to test), huge backlog of transferring jobs, problem with DQ2 catalogs.
All issues resolved, at the moment, there are many merge jobs running (with many inputs and outputs), these are filling up proddisk spacetokens. ADC is aware of the problem.
Decreased default lifetime in proddisk to 4days to help ameliorate the issue.

Multicore issue with software release validation at RHUL, Shef, ECDF. (These are the 3 most recently added multicore queues in the UK). Following up with Alessandro di Salvo + PanDA experts. V low activity on multicore in general, however.
Will be discussed this week in ADC Weekly this afternoon.

- Other

See Chris' notes in Bulletin.

-  DIRAC status
-- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view .

Andrew McNab: (11:13 AM)
Could you hear ok? THere was no change with the DIRAC monitoring


- Update needed for https://www.gridpp.ac.uk/wiki/GridPP_Cloud?
11:20 - 11:40 Meetings & updates 20' 
With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

- General updates


EGI have asked about dteam use in APEL (which is higher recently than historically since June). Probably correlated with increased testing of various things - CASTOR at RAL, etc
Jeremy asked what tests run via dteam?

Steve's tests run using dteam, but should affect every site equally (other than the ipv6 tests, which only test ipv6 enabled endpoints). The network testing has been shifted to dteam from ATLAS.

Oxford figure is Oliver testing things against Oxford.
-

New aggregator on the blog, comments please. (It aggregates the usual T2 + T1 blogging outputs).

RIPE probes still to be installed. RIPE conference imminent so it would be good to have things installed for then (it's an open conference with remote participation).


Final figures for Sept avail/reliability online.
 

- WLCG ops coordination

Next meeting is Thurs 6th. Please be aware of survey for..

- Tier-1 status

No updates.

- Storage and data management

No updates.

- Accounting

Note there's a prototype APEL parser for HTCondor (if you're using CREAM as your CEs).
Question as to if the HTCondor parser is coming back due to the Condor CE project being pushed to production?
(Jeremy guesses so?)


- Documentation

Two KeyDocs for review: 

Mark Mitchell's Core Grid Services page - currently structured but not particularly detailed. Is this page useful? https://www.gridpp.ac.uk/wiki/Core_Grid_services
(Robert Frank) noted that the page was a useful summary (but this was the first time he saw it). Ewan expressed the opinion that the summary was not too useful - pages get created for interesting things, and then forgotten after we've moved on, aggregate pages tend to rot as they get forgotten (the information currently on the page is terribly out of date, as it was last updated some time ago).
Jeremy suggested that the concept of a summary is useful. (Ewan noted that in a hypothetical world where we would spend lots of time keeping pages like this perfectly up to date... it still might not be useful)
Gareth R noted that the main issue with the wiki is that it's so hard to find things in it (when he started, he certainly didn't find this aggregate page by searching). 

Andrew MacNab noted that there has been some discussion with Tom Whyntie about how to restructure the wiki front-page to make it easier to actually find things.

[There was some discussion about how this could be improved. Possibly a Core-Ops subtask?]

Ewan noted that part of the issue is people's psychological barriers concerning actually altering the wiki in general (not wanting to overwrite other people's work).

Second page: BDII/Information Services. https://www.gridpp.ac.uk/wiki/BDII
Ewan: "it seems harmless"


- Interoperation

No updates. Next EGI ops is in November.

- Monitoring

No updates. Next Consolidation meeting end of week.

- On-duty

Quiet.

- Rollout

No comment.

- Security

Advisory concerning Xrootd monitoring. 

Ewan (on behalf of security team): this is not news (as we've already been informed by ATLAS, at least, to make the config changes). The one wrinkle is that the info was given as a YAIM snippet (as we're now transitioning to the Puppet management, we should probably look at giving help) - although the change maps to a single line in a single config file.


- Services

Question: has the perfsonar 3.4 info been updated and circulated? (But there were also requests for people to test such instructions.)
Duncan, Chris W, Ewan and Alessandra were at that meeting, the instructions are still being worked on, and Ewan tested them and gave detailed feedback.
Ewan: the short answer is "no, they are not currently ready for general use". 
In general, there are several things changing dramatically with the new install, you can technically do the new install via a yum update, but the mesh config urls are changing completely so there is further config anyway. (There are other changes to introduce more privilege separation.) So it's probably easiest to install from scratch.

Jeremy also noted that there was a recommendation to add IPv6 Perfsonars to dual stack sites, and for some discussion of T3 sites who wanted to add themselves to T2 meshes.

- Tickets

26 open tickets.

The VO nagios update: Brunel having problems with gridpp, Lancs with pheno, RALPP with t2k  job submission, Bristol on d/t, Sheffield with pheno, etc (probably CE issue), SRMs at T1 are failing their tests (for 11 days so far).

Kashif: the SRM/T1 issue is a problem with CASTOR (it wants to map Kashif to Ops as it has a default mapping for his DN to that VO, and ignores the VOMS extensions).
Brian noted that CASTOR SRM is, indeed, not VOMS aware. The certificate is mapped to one VO in the gridmapfile [probably the first entry encountered in the mapping file].
[some discussion about ticketing RAL/CERN re CASTOR still not being VOMS aware]

- Tools

No update.

- VOs

Catalin updated us on the CVMFS keys update (to decouple from CERN more).
Also, as we're low on WLCG VO work at present, a good time for small VOs?

- Site updates


AOB.

Gareth R: has anyone tried to sign up for dteam? (We're signing up our new guy, Gordon Stewart, and it looks like the web interface isn't working.)
Ewan managed to use it yesterday, and the interface was "wierd", but worked.
Jeremy will raise a ticket.

-
Chat Log:


Daniela Bauer: (28/10/2014 10:54)
I got kicked off twice so far and the meeting hasn't even started.
Raja Nandakumar: (11:03 AM)

glite-ce-job-status -a -e lcgce2.shef.ac.uk 
2014-10-28 11:47:38,475 ERROR - Connection to service [http(s)://lcgce2.shef.ac.uk//ce-cream/services/CREAM2] failed: FaultString=[HTTP error] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server] - FaultDetail=[HTTP/1.1 404 Not Found] 

Elena Korolkova: (11:05 AM)
@Raja: does lhcb use WMS for job submission?
Raja Nandakumar: (11:06 AM)
No - we submit jobs directly to the CEs
Elena Korolkova: (11:06 AM)
Atlas can submit job to lcgce1 and I can do it manually with atlas and t2k proxies.
Lukasz Kreczko: (11:07 AM)
it is "online" but in downtime
I am straighting out the HTCondor and ARC configuration in the back
Raja Nandakumar: (11:08 AM)
Elena - I just tried the glite-ce-job-status command and I get the same error again.
with debug mode, i get
2014-10-28 11:10:43,892 DEBUG - Contacting service [https://lcgce2.shef.ac.uk:8443//ce-cream/services/CREAM2]

2014-10-28 11:10:44,015 FATAL - Connection to service [http(s)://lcgce2.shef.ac.uk//ce-cream/services/CREAM2] failed: FaultString=[HTTP/1.1 404 Not Found] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server] - FaultDetail=[<html><head><title>Apache Tomcat/6.0.24 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Taho ma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}

Elena Korolkova: (11:13 AM)
Raja, Could you try ce1, please.
Raja Nandakumar: (11:13 AM)
That was pasted by me for Elena
Andrew McNab: (11:13 AM)
Could you hear ok? THere was no chance with the DIRAC monitoring
Elena Korolkova: (11:13 AM)
lcgce1
Andrew McNab: (11:13 AM)
no change
Raja Nandakumar: (11:14 AM)
[nraja@heplnx104 ~]$ glite-ce-job-status -d -a -e lcgce1.shef.ac.uk
2014-10-28 11:14:00,797 DEBUG - Using certificate proxy file [/tmp/x509up_u27592]
2014-10-28 11:14:00,816 WARN - No configuration file suitable for loading. Using built-in configuration
2014-10-28 11:14:00,816 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-status_CREAM_nraja_20141028-111400.log]
2014-10-28 11:14:00,817 DEBUG - Service address=[https://lcgce1.shef.ac.uk:8443//ce-cream/services/CREAM2]
2014-10-28 11:14:00,817 DEBUG - Contacting service [https://lcgce1.shef.ac.uk:8443//ce-cream/services/CREAM2]
2014-10-28 11:14:01,614 FATAL - JobStatus

Jeremy Coles: (11:14 AM)
https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
Raja Nandakumar: (11:14 AM)
Slightly different message, but same problem in lcgce1 also.
Elena Korolkova: (11:14 AM)
Thanks, Raja
Duncan Rand: (11:16 AM)
http://pprc.qmul.ac.uk/~lloyd/gridpp/nettest.html
Jeremy Coles: (11:22 AM)
https://www.gridpp.ac.uk/wiki/Core_Grid_services
Ewan Mac Mahon: (11:23 AM)
Well I'm looking at the page and trying to form one.
Yeah, that's not useful.
Jeremy Coles: (11:31 AM)
https://www.gridpp.ac.uk/wiki/BDII
wahid: (11:34 AM)
there is also a bit that stops the DN going - yes as sam says
its the same text
Chris Brew: (11:38 AM)
Is it only me habving problems with the audio cutting out?
John Hill: (11:39 AM)
It's OK for me
Gang Qin: (11:39 AM)
I got the cut out 3 times during the past 10 minutes
wahid: (11:40 AM)
on that xrootd monitoring - I just realised that since all disk nodes are sending this info - the change will be needed on all disk servers...
I hadn't done that for one
raul: (11:40 AM)
cutting out for me too.
Jeremy Coles: (11:42 AM)
If the audio cuts try using the proxy set to on. But if you don't need to use the proxy make sure it is off again (to avoid a proxy server overload).
Ewan Mac Mahon: (11:43 AM)
That sounds like something of a security bug too.
On the basis that it presumably would allow a user with a VOMS proxy for one VO to access/delete data for another VO.
One they may not even be a member of any longer.
wahid: (11:44 AM)
well that is indeed poor... However if you do have multiple roles in the same proxy then most grid services will just pick the first
nothing is going to change in Castor 
I think there are many dark secrets
(that aren't that secret)
Elena Korolkova: (11:48 AM)
David Rebatto is kindly helping me
to solve the problem
Ewan Mac Mahon: (11:48 AM)
Isn't this a practical problem for the major VOs and things like production/no production roles etc? That's all VOMS too,
Kashif Mohammad: (11:49 AM)
https://ggus.eu/?mode=ticket_info&ticket_id=109360