Ops Meeting Minutes, Tuesday, 15 July 2014

     Attendees: Alessadra, Andrew M, Brian D, Chris B, Dan T, Daniela B, David C, 
     Elena K, Ewan M, Federico M, Gareth R, Gareth S, Ian C, Ian L, Jeremy C, 
     John B,  John H, Kashif M, Mark S, Matt RB, Raja, Rob F, Sam S, Steve J,
     Wahid B

    Experiment problems/issues

    Review of weekly issues by experiment/VO

    - LHCb
      Raja: Going smoothly. Low level Monte Carlo

    - CMS
      (Daniela) - Limited news. Bristol have been struggling. There is a DPM issue 
      at Brunel concerning inefficiencies - Raul had tried to actively 
      address this before with CMS but only now they have followed up. Wahid 
      has been talking to Raul about the issues - there was already a well 
      known problem that was subsequently fixed  in the latest release, but 
      this current CMS problem may be new"

    - ATLAS
      Lost contact for first bit, then...
      Elena (describing some ops meeting): HPC was discussed. 
      AGIS is very reliable. Atlas has used it for 2 years. It offers 
      dynamic views. Group is working with it to set up new queues.

      New system for assessing site usability is to be brought in. It will create
      an automatic report. It will categorise the site into A, B, C. Liverpool 
      and Sheffield are not T2Ds, thus they are automatically demoted. This 
      will complement existing assessment for the time being at least.

      Alesssandra: The Atlas DC14 programme will include a mix of single 
      and multi-core jobs. She will discuss baseline considerations for 
      setting this up next week. She will talk about a solution to the 
      draining problem. 

    - Other
      n/a

    Meetings & updates

    With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

    - General updates
      n/a

    - WLCG ops coordination
      Jeremy: what about ILC, who are dropping a Voms Server. Short discussion. 
      ILC will be informed of our consensus, i.e. better just to turn off the 
      Voms Server, as the system can cope with that.  Once proxies are 
      expired (24/48 hours) remove from Operations Portal. Sites will then 
      get alerts and can respond.

-------------------------------------------
      VIDYO BROWNOUT (one of several)
-------------------------------------------

    - Tier-1 status
      Gareth: Castor Updates are complete. FTS2 will be stopped 2 Sept.

    - Storage and data management
      n/a

    - Accounting
      Jeremy: UCL is now the only site using APEL 2. Ticket raised.

    - Documentation
      Jeremy: Problem with stale documents alert has been fixed.

    - Interoperation
      David Crooks: There will be a meeting next Monday. There shall be a 
      new CREAM.
      On migration to central SAMs, some, not UK, sites had version issues.
      On APEL: Will follow up at UCL.
      On monitoring reliability - there will be a manual re-computation.
      On UMD: thanks for survey response - about 80 so far.

    - Monitoring
      David Crooks: 4th July Meeting. Discussed SAM3, visualisation. Sites are
      reminded about the site monitoring wiki page.

    - On-duty
      n/a

    - Roll-out
      n/a

    - Security
      JC: Sites are reminded about EGI-ADV-2014625, high risk.

    - Services
      n/a

    - Tickets
      Matt:

29 Open UK tickets today.

FNAL VOMS TICKETS
As seen on TB-SUPPORT - a number of sites got tickets concerning jobs still contacting the FNAL voms server for CMS/ILC. Birmingham, RHUL, Liverpool and the Tier 1's tickets are still being worked on - RHUL's ticket might not have been spotted yet (still assigned).

DECOMMISSIONING THE FTS3 SERVICE
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106615 (2/7)
Gareth opened a ticket to document the retirement, in accordance with ancient grid laws. As naught is happening until the 2nd of September I put on hold till nearer the time. On Hold (14/7)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106770 (10/7)
enmr.eu wanted to add tags to one of the Tier 1's arc ces, which of course didn't work. There was an interesting exchange about why a VO would still want to have a site publish tags in the age of cvmfs (essentially so they can minimise changes to the submission gubbins). Andrew offered to add in the tag "VO-enmr.eu-CVMFS" by hand to his CE, it's likely that other sites might be asked to do the same - and it's a solution worth noting for other VOs. In progress (14/7)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106610 (2/7)
Enabling HyperK at the Tier 1. Ticket looks a little stalled after Chris commented that it was wise for Hyper K to be enabled on only Arc-CEs (in light of RAL going dairy free). In progress (2/7)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106425 (4/7)
UCL are still having trouble with nagios tests after a pool node died. Ben is having trouble getting the new disk server set up - I tried to give him some tips and advised shouting out for help. In progress (8/7)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106554 (1/6)
Bristol having trouble with CMS transfers- Lukasz noticed Storm was being odd (believing there to be no free space when there was). The SE was kicked but the problem (or a similar one) showed up again. Anyone seen similar? (Looking at Chris Walker:Storm Sage again here). In Progress (9/7)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 (1/6)
cf TIER 1 ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324
CMS pilots losing contact with their home base. Looks similar to the issue at RAL, where they seem to have had some success (still waiting to see if it was complete). If the RAL chaps could elaborate on the firewall tweaks that brought about this improvement it would be greatly appreciated (The RAL ticket could do with an update too)! In Progress (14/7) 

 
    - Tools
      Kashif: 
      On 12th July, there was chaos with Nagios after a power cut. Duff 
      results got into the top BDII and made sites look bad. After the 
      system was restarted, sites continued to be impacted due to the 
      backlog. Various suggestions were made. 
      Jeremy: Why not flush the BDII cache.
      Kashif: Manual restart does that.
      It will be further discussed at ops meeting and so forth.

    - VOs

      LHCB
      An alert was sent out by Andrew Lahiff last week altering sites that an
      environment variable is need on ARC CEs for LHCB. It directs jobs
      to specific queues.


      Biomed
      Discussion on biomed. JC to have a word with VO managers. Stephen 
      Burke suggests adding a requirement like TotalJobs < MaxTotalJobs.

      HyperK
      The VO will only use storage at QMU, London. It runs properly 
      on sge/creamce. Only to be implement on ARC/CONDOR at RAL as CREAM is 
      being phased out.

    - Site updates
      n/a

    Review of WLCG workshop
      See http://indico.cern.ch/event/330837/contribution/3/material/0/0.pdf

      Comments:
      Alessandra commented on CREAM fixes to pass walltime through to Toque/ Maui,
      to allow multi-core jobs to work. More on this next week.

      David Crooks commented on Monitoring Technology. The Mona Lisa model
      will be used. But also proposal to put in heavy hardware monitoring.
      Also, e.g., cvmfs monitoring.
      Discussion reqd. to check if this should remain a site task.
      Will be discussed at next consolidation meeting.
      Please contact David Crooks to share your views on this.


    AOB

    - A reminder to register for GridPP33 20th-22nd August: http://www.gridpp.ac.uk/gridpp33/registration.html. 


----------------------------------------------------------
    CHAT WINDOW CAPTURE (After Vidyo Crash)

John Bland: (15/07/2014 11:33)

I'm on 117.103.105.125

wahid: (11:33 AM)

109.105.124.84

David Crooks: (11:33 AM)

I'm 141.52.27.20 as well

Robert Fay: (11:33 AM)

I'm on the same one as Jeremy

Federico Melaccio: (11:33 AM)

I'm on the same too

Mark Slater: (11:34 AM)

109.105.... is NDGF

Chris Brew: (11:34 AM)

141.52.27.20, got kicked off the first time but not the second time

Raja: (11:34 AM)

The conference status button is not clickable for me

David Crooks: (11:34 AM)

I'm the same as Chris

John Hill: (11:34 AM)

I'm "not in a confernece"

Mark Slater: (11:34 AM)

I've been kicked off once and lost Comms twice :(

John Hill: (11:34 AM)

so I can't find out the router

Daniel Traynor: (11:35 AM)

vidyourouter.ndgf.org now , kicked off the second time only.

Alessandra Forti: (11:36 AM)

yes but you need to be in the meeting to change it. i didn't have any problem with other meeting rooms...

using the same router

Gareth Douglas Roy: (11:37 AM)

https://ggus.eu/?mode=ticket_info&ticket_id=103577

Ewan Mac Mahon: (11:37 AM)

You don't actually have to support biomed if they're more hassle than they're worth.

Alessandra Forti: (11:38 AM)

may as well be

Steve Jones: (11:38 AM)

They are very good "fillers" when we have some slack!!!

Alessandra Forti: (11:38 AM)

it is enough 1

some of them also send jobs with the wallclock set....

Daniel Traynor: (11:40 AM)

hypek woking fine at QM with gridengine and creamce

Jeremy Coles: (11:42 AM)

To find your router click on the config icon and then go to the 'status' page.

David Crooks: (11:46 AM)

Site monitoring wiki :-)

https://www.gridpp.ac.uk/wiki/Site_monitoring_status

Ewan Mac Mahon: (11:47 AM)

Sorry - this doesn't seem to be quite working.

Security stuff is SOP - update and reboot,

but soon/now.

Including you, ECDF.

Jeremy Coles: (11:49 AM)

EGI-ADV-20140625

Ewan Mac Mahon: (11:59 AM)

A major CERN deployment moving to CentOS is sortof a big deal isn't it?

Not a surprise as such, but still.

wahid: (12:02 PM)

well I tried it (package reporter ) straight aftr the mtg and it didn't work - now it does

its simple enough but I still object to them ever asking for everyone to install it everywhere

Samuel Cadellin Skipsey: (12:03 PM)

wahid: I actually rather more object to the hint I heard that it doesn't actually tell the *user* what it is sending to the remove service.

wahid: (12:03 PM)

Ewan - acknowldedged - andy is poking systems team - they are always slow as they like to consult every user for somereason

Sam - thats true

Ewan Mac Mahon: (12:04 PM)

I'm slightly dubious about the package reporter, but on a quick look it basically just seems to ship off the results of an 'rpm -qa'.

Samuel Cadellin Skipsey: (12:05 PM)

It would be nice if the package reporter did local logging.

wahid: (12:05 PM)

but its a perl script so you could get it to print

Samuel Cadellin Skipsey: (12:05 PM)

Sure, to both of you, but it would be nice if the person who wrote it showed they cared.

Ewan Mac Mahon: (12:05 PM)

Which won't work for non-RPM things, and will trawl up unrelated RPMs on (say) shared clusters.

Samuel Cadellin Skipsey: (12:05 PM)

You shouldn't have to tweak it to make it behave with the correct respect for sysadmins

wahid: (12:05 PM)

That wasn't the only time he said the "one more rpm " line

he also said as he often does that they only want "90% of the sites"

but then that quickly turns into a MB mandate

Alessandra Forti: (12:12 PM)

Sam: "it would be nice if the person who wrote it showed they cared" you ask too much.... ;)

Ewan Mac Mahon: (12:13 PM)

have they talked to the shoal folks?

Because if you squint a bit that's a squid monitoring system too.

Alessandra Forti: (12:13 PM)

we could also feedback the request for printing and logging

wahid: (12:23 PM)

they WILL !

Ewan Mac Mahon: (12:25 PM)

'Non-SRM' isn't necessarily helpful if they still need weird stuff.

If we can give then (e.g.) bare S3 interfaces, then that's one thing.

Alessandra Forti: (12:26 PM)

the major problem with non-srm is the space tokens used as quotas

Ewan Mac Mahon: (12:26 PM)

If we move from one set of grid specific tooling to another, we might as well not.

Jeremy Coles: (12:28 PM)

I'll try to get through the remaining talks in 5-10 minutes. I appreciate people will want to leave soon... if you do please note the AOB about GridPP33 on the agenda... please register! Thanks.

Ewan Mac Mahon: (12:28 PM)

What I want to do is DPM/dmlite -> dmlite with a (probably) ceph backend -> just the ceph.

S3 has a strong advantage in multiple implementations existing.

Samuel Cadellin Skipsey: (12:29 PM)

Ewan: I may have a plugin or two to throw at you in a week or three

Ewan Mac Mahon: (12:29 PM)

Ooh. Jolly good.

Does someone want to list all the times where volunteer support for grid middleware has actually worked well?

Now I can't tell if no-one's answer that question or everyone is.

Steve Jones: (12:44 PM)

Thanks