Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting

- The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6

-- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers.

-- The London (UK) service is on +44 (0)161 306 6802. Phone bridge ID 1001002

-- The meeting extension is 109308582. PIN 1234

Chair:  Jeremy C

Minutes: David C

Apologies:

GridPP Ops 8/5/2018

Present:

Andrew McNab, Chris Brew, Daniela Bauer, Dan Traynor, David Crooks (minutes), Duncan Rand, Elena Korolkova, Gareth Roy, Gordon Stewart, Ian Loader, Jeremy Coles (chair), John Hill, Leo Rojas, Linda Cornwall, Mark Slater, Matt Doidge, Pete Gronbech, Raul Lopes, Robert Frank, Rob Currie, Sam Skipsey, Steve Jones, Vip Davda

Experiment updates

LHCb

Andrew: Looked at Ops meeting yesterday, nothing UK specific

Jeremy: Nothing in WLCG Ops either

CMS

Daniela: Fairly quiet, Bristol has 4 tickets (only just after holiday weekend)

IPv6, please go and get it.

ATLAS

Elena: Problem with VOMS proxy, many queues went into test mode. Wrote to experts, disable HC test and put queues online. Problem was solved later, EPEL packages renamed? Think it's fixed.

Everything working after 2 hours. 

General UK sites:

- Sussex to go CPU only, use T1 or RHUL for storage

- Create SL7 queues for Brunel (for consistency, currently working OK)

- QMUL transfers, open ticket. Alessandra has checked settings for queues, not sure of situation.

- Sheffield situation, increase time in gridftp config, didn't help

Steve: Settings change didn't work for Liverpool?

Elena: Ask DDM experts

Discuss on Thursday

Jeremy: Re WLCG Ops, ATLAS EOS crashed on Friday. VOMS proxy, issue with renewal

Other VO updates?

Pete G: Enabled access for LZ/SKA at Oxford, might want some testing, haven't added new VO for a while - will investigate this week

Gareth R: VAC pool, updated 50% to V3, added vac pipe, all enabled VOs. Notice intermittent rate of jobs. Fair share setting? Have seen LZ/LSST, mostly pheno, LHCb. 

Andrew M: Mostly get pheno with this. 

Daniela: LZ only send to targeted sites. Need 4GB, if have that then let me know. Also discovered bug in sim, halted work.

Mark Slater: Switched to V3, have seen MICE, small number of others, not that much running. 

Gareth: OK, checking that I haven't messed it up.

Andrew M: Accounting portal is good to see shares, permissions (which accounting portal, DIRAC?)

Daniela: Meant to be open

Andrew: Can't see VOs not a member of ?

Daniela: May have got lost on upgrade of web interface

Gareth R: Daniela, LZ wants 4G/VM? I think we can, need to change config. Can mess around, will get back to you.

Elena: LZ don't currently use VAC queue

Daniela: But could do if the config works

Jeremy: Check back on DIRAC portal next week.

Bulletin

General updates

- EOSC-Goc

- Transfer failures? No update, interesting topic

- OMB

- Hardware survey, 4 yet to come in

- Brunel (passed on to Duncan), Imperial, UCL, Manchester (being worked on)

Storage & Data Management

NTR: Sam gives credit to Brian for gridftp timeout settings

T2 Evolution

- Vac 3.0/Vac pipes: https://www.gridpp.ac.uk/wiki/Vac_configuration_for_GridPP_DIRAC

Documentation

Updates to Interoperation Key Docs

Security

EGI IGTF CA update

Reminder that when doing updates that some worker nodes can get left behind

Services

Gareth R: Do we know when perfsonar 4.1 is due?

Duncan: No, was meant to be Q1.

Tickets

https://ggus.eu/?mode=ticket_info&ticket_id=134899 now closed

IPv6 in Condor (see transcript)

Site Round Table

Manchester (Andrew)

- Final stages of adding extra hardware

- Pretty much final setup of ARC service

- Bringing on storage

Nothing to add from Robert

RALPP (Chris)

- WN/storage in boxes, unpack this week

- 600 TB storage, although half replacing decommissioned estate

- IPv6, really need time to bring to PerfSonar

Nothing to add from Ian Loader

Glasgow (David/Gareth)

- David, documentation and handover 

- Gareth: VAC nodes to new version

- restructuring how compute is provided

- fill rates, poor fill rates on multicore, worth keeping?

- smaller vac pool supports small VOs

- WIP, need numbers

- opportunity to audit site

- not in huge rush to upgrade to C7

- generally tidy up. changes with central services, pushing IPv6 but slow progress. DC delayed again, planning permission... rely on future plansImperial (Daniela)

- mostly working on DiRAC data mover

- LZ had prod run, resume soon

- couple of open DIRAC issues, workshop at end of May

- Duncan: webdav transfers with Brian Bockelman

Sheffield (Elena)

- Going slowly, hired new local sysadmin, hasn't started yet. 

- SL7, going slowly

- couldn't move most of storage to C7, old/built on software RAID, difficult to upgrade

Cambridge (John)

Vac V3, C7 soon. Forward looking services already on C7. Ones being planned to decommission, don't need to

Sussex (Leo)

I'm working on perfsonar upgrade to centos 7
I am also working on our new centos 7 cluster implementing singularity. We are now able to sent jobs to the grid using the Singularity image provided via centos 7
(See transcript)

Birmingham (Mark)

- Workers in place, takes central IT a while to sort out naming. 

- Storage (200TB) in rack, needs names. 

- Vac slowly upgrading to V3

- Storage, believe to be in place for ATLAS EOS

- reduce DPM, look to decommission in long term, waiting on ATLAS

- IPv6 got gateway talking, then PerfSonar, then get dual stacking. 

- Don't have timeline, new domain handling with central services. Almost at point of just waiting on them

Lancaster (Matt)

- Upgrading PerfSonar to C7

- trouble getting IPv6 to work, auto conf got turned off, lots of debugging

- Continue looking to move to C7

- WNs have been C7 for ages, services as they are added, next storage

- Singularity: had to move some WNs to C6 for one community, so try singularity

- need to build own copy of Singularity

Oxford (Pete/Vip)

- ATM configuring new VOs, debugging

- Moved ~5 WNs to C7

Edinburgh (Rob)

- taken small steps towards dual stacking storage (head node done, need local config for pool nodes)

- local monitoring

- T3 in a box

Liverpool (Steve)

- Updated UMD3->4

- Installed new hardware, 600 slots, E5-2630v4

- HTCondor smaller VO scheduling, gave small slices, smaller VOs come through in fits and starts

- De facto cap

- Replaced by using large accounting group which works better

- Convert to C7 (close to) complete

- Need to do Vac v3/pipes

- Lots of certificates - currently use Cert Wizard, PeCR?

David: That's how we renew certificates, works very well

Chris B: Also suggest updating all certs at same time, even if not needed for some, to synchronise renewals, move to same timeline

Steve: That's good advice.

- moving to new build system (same as the old build system but reimplemented)

- General cleanup

- XFS kernel crashes, under investigation

QMUL

Revisit later

AOB

GDB

Proposal for Lightweight Sites WG: Andrew: bring together different initiatives, give WG that is site oriented

SOC WG Workshop

David: Advert for Workshop, registration open now: https://indico.cern.ch/event/717615/

HEPSYSMAN

Pete: 10 people registered, would very much like people to register/suggest talks. Dan Traynor has suggested having a discussion over role of HEPSYSMAN in light of changing context/outsourcing/etc.

Meeting Close

 

Transcript

Daniela's Test account: (08/05/2018 12:00)
https://ggus.eu/?mode=ticket_search&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=PRIORITY&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CHANGE&show_columns_check%5B%5D=SHORT_DESCRIPTION&show_columns_check%5B%5D=SCOPE&ticket_id=&supportunit=&su_hierarchy=0&former_su=&vo=cms&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=UKI-SOUTHGRID-BRIS-HEP&specattrib=none&status=open&priority=&typeofproblem=all&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=lastyear&from_date=08+May+2018&to_date=09+May+2018&untouched_date=&scope=&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21
David Crooks: (12:05 PM)
Sorry. mic was open
Jeremy Coles: (12:14 PM)
https://www.gridpp.ac.uk/gridpp-dirac-sam?vo=skatelescope.eu
Steve Jones: (12:18 PM)
VO status at sites (not VAC): http://pprc.qmul.ac.uk/~lloyd/gridpp/votable.html
Robert Frank: (12:25 PM)
manchester working on it
raul: (12:26 PM)
I emailed it to Duncan
David Crooks: (12:27 PM)
I've lost Jeremy, is it just me?
John Hill: (12:28 PM)
No I did as well
Matt Doidge: (12:28 PM)
I lost him for a bit as well.
Jeremy Coles: (12:28 PM)
Sorry - am I back?
John Hill: (12:28 PM)
yes
David Crooks: (12:28 PM)
Yes, sorry, can hear you fine
raul: (12:33 PM)
IPv6 in condor in the CMS factory is not properly configured
Sorry no mic
Brunel: Storage is all on Centos6. I'm starting to move it to CentOS7. Everything else is CentOS 7
New hardware being commissioned this week
Leo Rojas (Sussex): (12:45 PM)
Hey, mi mic is active and working but you cannot hear me
I'm working on perfsonar upgrade to centos 7
I am also working on our new centos 7 cluster implementing singularity. We are now able to sent jobs to the grid using the Singularity image provided via centos 7
Mark Slater: (12:47 PM)
Forgot to mention: Also updated perfsonar to CentOS 7 as well :)
That was Mark Slater (I obviously put the password in the wrong box when signing in!)
Leo Rojas (Sussex): (12:48 PM)
I mean(the singularity image) provided via CVMFS
David Crooks: (12:59 PM)
https://indico.cern.ch/event/717615/

There are minutes attached to this event. Show them.
    • 1
      Ops meeting minutes
      • This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards.

      • The team composition has been changing. If everybody contributes then the task comes around less often.

      • Please extract actions from the meeting and add them to our table here: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items#Action_list.

      • Recent allocations: See above link. The page should be updated each week by the minute taker (if they don't the task will keep coming to them!).

      • Upcoming allocations:

    • 2
      Experiment problems/issues

      Review of weekly issues by experiment/VO

      • LHCb

      • CMS
        T1: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
        T2: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel

      Please see attached notes.

      • ATLAS

      • Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.

      • GridPP DIRAC status [Andrew McNab]
        -- https://www.gridpp.ac.uk/gridpp-dirac-sam

    • 3
      Meetings & updates

      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

      • General updates
      • WLCG ops coordination
      • Tier-1 status
      • Storage and data management
      • Tier-2 Evolution
      • Accounting
      • Documentation
      • Interoperation
      • Monitoring
      • On-duty
      • Security
      • Services
      • Tickets
      • Tools
      • VOs
      • Site updates
    • 4
      Discussion
      • Site roundtable (including CentOS7).
    • 5
      Actions & AOB
      • WLCG GDB tomorrow - agenda is here https://indico.cern.ch/event/651353/
      • https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items