Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +44 (0)161 306 6802. Phone bridge ID 1001002 -- The meeting extension is 109308582. PIN 1234 Chair: Jeremy Minutes: Daniela Reserve:  Apologies:

Note: The next two ops meetings will be cancelled due to HepSysman and the WLCG workshop.

Review of weekly issues by experiment/VO

    ** LHCb (Jeremy, Daniela)
    One Tier 1 issue.
    EL7 works at Imperial
https://ggus.eu/index.php?mode=ticket_info&ticket_id=128473

    ** CMS (Jeremy, Daniela)
    https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel
    There was clearly an issue at Bristol yesterday, told Winnie.
    gfal2 not working on SL6 nodes (seems to be fine on SL7). The ops tests using gfal2 work, but not the production stageout. Now Brunel has been dragged into this mess too. Imperial is trying to get CMS to set the EL7 queues into production:
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=127989
    The gfal2 mess is tracked here:
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=128555
    
   EOS: was mentioned, forgotten to make a note.
   not much pressure on sites right now
 

    ** ATLAS (Elena)
    Diskless T2: No final decision.
    EL7 multicore queue at Imperial now in production.

  **  Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.

   ** Obsolete VOs:
   The following VOs are gone for good and should be removed if you are still supporting them:
    camont, cdf, dzero, neiss
    supernemo - they are in discussion about grid usage - might have to be resurrected.

    ** LSST:
    LSST have now setup a mailing list if you are interested (talk to Alessandra)
    Alessandra ha also set-up a Wiki page for LSST:
    https://www.gridpp.ac.uk/wiki/LSST_UK

========================================================================
    
    Jeremy:
    The EMI repositories will be shut down on the 15th of June 2017.  Matt: EMI tarballs ready to be moved to UMD.

    All: poolid/gids: Keep them consistent for your own sanity.

    GridPP DIRAC status
    Oxford VAC not working: Kashif will have a look.
    -- https://www.gridpp.ac.uk/gridpp-dirac-sam

*** Meetings & updates

New pilot version for all VAC sites, some minor problems remaining (Jeremy reporting for Andrew)

With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

    
    WLCG ops coordination: Workshop coming up !
    Only for today: Dirac Workshop 2017:
    https://indico.cern.ch/event/609507/timetable/#20170529.detailed
    GridPP is still the only true multi-VO instance, so we will continue to see issues with this. release quality of dirac has improved within the past year. gfal2/EL7 problems seem to be coming to an end after being cooped up with all authors for 3 days on end. The Imperial GridPP module now has integrated testing. It was also noted that every dirac instance has it's own special module to adapt to their community on top of main dirac. We are trying to push as much of out special functionality back into the main dirac branch. For LZ interested in submitting to NERSC via dirac, which is apparently done by Belle.
    Tier-1 status: Nothing to report
    Storage and data management: Nothing to report
    Tier-2 Evolution
    Accounting: Nothing to report
    Documentation:  Nothing to report
    Interoperation: Nothing to report
    On-duty: New rota needed
    Security: Firmware
    Services: Nothing to report
    perfSonar/ipv6 (Duncan): Oxford will fully support ipv6 in the next 6 months (this is what Duncan was told). This is news to Pete. A couple of site in the US are going to use EOS, there's an EOS ipv6 release. So CERN will be able to support ipv6 on their storage. Still problems at RHUL. perfsonar training during WLCG workshop discussed, can WLCG support ipv6 rollout at sites ?
    Track ipv6 rollout via storage sam test ? Use WLCG reporting framework, no firm decision.
 
Tickets:
SUSSEX
https://ggus.eu/?mode=ticket_info&ticket_id=122772 (11/7/116)
Atlas xroot/webdav ticket, with just xroot to go. Any luck with the xrootd server? In progress (9/5)

https://ggus.eu/?mode=ticket_info&ticket_id=127767 (18/4)
Availability ticket - Daniela notes that there are still issues with test jobs not running in time, and advises perhaps reserving a slot for tests. On hold (25/5)

RALPP
https://ggus.eu/?mode=ticket_info&ticket_id=127555 (7/4)
Another availability ticket, although Chris points out several valid reasons why this points to the monitoring being fubared, vitiating RALPP's results. On hold (30/5)

OXFORD
https://ggus.eu/?mode=ticket_info&ticket_id=128512 (27/5)
LHCB spotted a problem with the Oxford ARC CE, which Kashif seems to have cleared up with a reboot, but no word from lhcb since. Could do with a poke. In progress (30/5)

BRISTOL
https://ggus.eu/?mode=ticket_info&ticket_id=126864 (28/2)
Request to enable LZ at Bristol. Winnie is progressing with rolling out the pool accounts etc for LZ, Dune and LSST as fast as her limited time allows. In progress (30/5)

GLASGOW
https://ggus.eu/?mode=ticket_info&ticket_id=124052 (25/9/16)
The cursed ticket, requiring new ARC CEs to fix a problem with publishing seen by LHCb. Hopefully enough positive karma has been accrued to give some breathing space to get round to this. On Hold (4/4)
Glasgow VAC for LHCb ? Talk to Andrew if VAC is sufficient

ECDF
https://ggus.eu/?mode=ticket_info&ticket_id=128294 (12/5)
Availability ticket - looking into the ECDF argo page is a journey into the unknown (status), with their infamous -1 availability. If you're site seems to be capping out at 99% then I'd blame the ECDF guys for stealing that extra 1% :-P In progress (19/5)
Corresponding ARGO ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=126724
(hasn't been update in 2 weeks)

SHEFFIELD
https://ggus.eu/?mode=ticket_info&ticket_id=127766 (18/4)
Another availability ticket, just waiting for the numbers to sooth themselves. Hopefully not too badly affected by the next ticket. On hold (25/5)

https://ggus.eu/?mode=ticket_info&ticket_id=128429 (19/5)
BDII test failure ticket, the alarm was fixed by Elena syncing her clocks, but glue-validator failures are still seen. Kashif right recommends running the test manually. In progress (2/6)

LANCASTER
https://ggus.eu/?mode=ticket_info&ticket_id=128321 (15/5)
And another Availability ticket. After improving test job flow through our cluster we're on holding the ticket. At least we're in good company. On hold (26/5)

RHUL
https://ggus.eu/?mode=ticket_info&ticket_id=128750 (1/6)
Duncan submitted a perfsonar related ticket after spotting some problems, Govind is investigating the issue. In progress (5/6)

IMPERIAL
https://ggus.eu/?mode=ticket_info&ticket_id=128555 (30/5)
The Imperials are having an issue with gfal-copy within CMS jobs. Despite not being able to replicate manually, some quiddity of the job environment causes gfal-copy to seg fault. The site has tried a different version and raising the ulimit, but no joy yet despite the efforts. In progress (5/6)

100IT has 3 tickets that I won't go into.

TIER 1
https://ggus.eu/?mode=ticket_info&ticket_id=124876 (7/11/16)
gridftp tests failing for echo, due to a problem with the tests. No movement on the counter ticket (https://www.ggus.org/index.php?mode=ticket_info&ticket_id=125026) since April. On Hold (1/1)

https://ggus.eu/?mode=ticket_info&ticket_id=127612 (8/4)
LHCB having problems with the RAL CEs, which seem to be ongoing (although they might have changed in nature). No news on the ticket in the last fortnight though. In progress (23/5)

https://ggus.eu/?mode=ticket_info&ticket_id=127967 (27/5)
Enabling MICE pilots at the Tier 1. The accounts are created but it looks like job submission isn't working yet for this role. In progress (25/5)

https://ggus.eu/?mode=ticket_info&ticket_id=127240 (21/3)
CMS staging tests. The last entry from the user was a request to clarify what the numbers in the plots meant and for additional plots. In progress (18/5)

https://ggus.eu/?mode=ticket_info&ticket_id=127597 (7/4)
CMS would like the Tier 1 to check their xroot/networking performance. In response Andrew L had switched off "lazy download" Andrew asked if this has helped, but the issue is muddled by the firewall at RAL dropping packets, awaiting news from the RAL networking team. In progress (30/5)

https://ggus.eu/?mode=ticket_info&ticket_id=117683 (18/11/2015)
Castor Glue 2 publishing ticket. Rob updated that development is still ongoing. On Hold (10/5)


 https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items (no updates besides the minute count)


** Sites:

These are the Wiki pages that should be kept up to date by the sites:
https://www.gridpp.ac.uk/wiki/Category:Sites_Status

Manchester: Alessandra: ARCCE problems with Atlas
QMUL:
Started development of new deployment system including centos 7. Investigating hadoop on lustre - could we make work on the grid? tender for new storage about to go out. Helping with WLCG ipv6 workshop. Hope to enable pre-emption queue for atlas event service by end of month.

Oxford: Pete will update HEPSPEC for new kit.
Imperial: Containers, containers, containers

Glasgow: Preparation for data centre, ARC and HTCondor. Openstack too much hassle (much agreements from other sites on this.)

RHUL: ipv6 good news, central IT on board. Plan to setup dual stack personar box. Working on VAC. Needs to update HEP SPEC. Working on CentOS 7 ARC.

Liverpool: Cooling will be refurbished, cluster will have to be relocated. Working on DPM upgrade.
Steve set-up ARC Condor SL7 using a dodgy release. Steve is preparing a talk for HEP sysman.

Oxford: xrootd work, hope to start CentOS 7 arc ce. ipv6, still waiting for hardware upgrades from central IT.

Birmingham: Switching over to VAC storage to zfs, looking at EOS for Alice, ipv6 not a priority

Apologies:
Andrew McNab, Raja

Present:
Alessandra F
Daniela B
Elena K
Gareth R
Ian L
Jeremy C
John B
John H
Matt D
Winnie L
Rob C
Robert F
Sam S
Steve J

From the chat window:
Winslowe Lacesso: (06/06/2017 11:04)
I'm sorry can you pls repeat that?
Do you mean the cms-site-readiness. link?
If not please cite link
Daniela Bauer: (11:06 AM)
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel
Then scroll to Bristol
https://ggus.eu/index.php?mode=ticket_info&ticket_id=128473
This is the EL7 LHCb link, not related to Bristol, sorry
Jeremy Coles: (11:07 AM)
The link is also in the agenda page.
Daniela Bauer: (11:12 AM)
Sorry, I didn't get the list of all the VOs. Is there a list somewhere ?
John Hill: (11:12 AM)
Ops Bulletin: General Issues
Daniela Bauer: (11:14 AM)
THANKS, FOUND IT
oops caps lock
Peter Gronbech: (11:18 AM)
http://vacmon.gridpp.ac.uk/1f4:15180::/UKI-SOUTHGRID-OX-HEP/t2vacuum.physics.ox.ac.uk/ looks good to me
Jeremy Coles: (11:19 AM)
Interesting compared to https://www.gridpp.ac.uk/gridpp-dirac-sam.
Peter Gronbech: (11:20 AM)
could be down to a recent gridpp certificate renewal....
Brian Davies: (11:24 AM)
My mic is not working, other than what Gareth has put in the bulletin , I can try to answer any the question
Re perSONAR, Matt Doidge and I have been testing ipv6 connection between Lancaster and RAL. Found asymmetric routing which Lancaster fixed. Might be an idea for others sites to check.
FYI ATLAS have a NERSC site
Daniela Bauer: (11:33 AM)
So has CMS
Jeremy Coles: (11:33 AM)
https://indico.cern.ch/event/615970/
Daniela Bauer: (11:33 AM)
but none of then use DIRAC
Brian Davies: (11:48 AM)
what version of gfal is used inside their jobs?
Matt Doidge: (11:53 AM)
gfal2 I think Brian.
Jeremy Coles: (11:53 AM)
Alessandra created this category in early May: https://www.gridpp.ac.uk/wiki/Category:Sites_Status.
raul: (11:54 AM)
Sorry, I've to leave for another meeting
Daniela Bauer: (11:56 AM)
gfal2 and I think they are just using the worker node one
Daniel Peter Traynor: (12:00 PM)
Started development of new deployment system including centos 7. Investigating hadoop on lustre - could we make work on the grid? tender for new storage about to go out. Helping with WLCG ipv6 workshop. Hope to enable pre-emption queue for atlas event service by end of month.
Duncan Rand: (12:03 PM)
got to go.
Ian Loader: (12:04 PM)
got to go
Paige Winslowe Lacesso: (12:04 PM)
Sorry must -
elena: (12:08 PM)
sorry have to leave
Jeremy Coles: (12:17 PM)
https://www.gridpp.ac.uk/wiki/Category:Sites_Status
Alessandra Forti: (12:24 PM)
I've added the new machines both with SL6 and SL7

 

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      • This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards.

      • The team composition has been changing. If everybody contributes then the task comes around less often.

      • Please extract actions from the meeting and add them to our table here: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items#Action_list.

      • Recent allocations: See above link. The page should be updated each week by the minute taker (if they don't the task will keep coming to them!).

      • Upcoming allocations:

      6th June:
      12th June:
      19th June:
      26th June:

    • 11:01 11:20
      Experiment problems/issues 19m

      Review of weekly issues by experiment/VO

      • LHCb

      • CMS
        https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel

      • ATLAS

      • Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.

      • GridPP DIRAC status [Andrew McNab]
        -- https://www.gridpp.ac.uk/gridpp-dirac-sam

    • 11:20 11:40
      Meetings & updates 20m

      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

      • General updates
      • WLCG ops coordination
      • Only for today: Dirac Workshop 2017:
        https://indico.cern.ch/event/609507/timetable/#20170529.detailed
      • Tier-1 status
      • Storage and data management
      • Tier-2 Evolution
      • Accounting
      • Documentation
      • Interoperation
      • Monitoring
      • On-duty
      • Security
      • Services
      • Tickets
      • Tools
      • VOs
      • Site updates
    • 11:40 12:20
      Updates 40m
      • Site roundup
        ** With reference to tables under https://www.gridpp.ac.uk/wiki/Category:Sites_Status.
    • 12:20 12:25
      Actions & AOB 5m
      • https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items