Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting

- The intention is to run the meeting in VidyoConnect: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6

-- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone.

-- The London (UK) service is on +442030510622.

-- The meeting extension is 109308582. PIN 1234

Chair:  JeremyC

Minutes: SamS

Apologies:

Minutes GridPP Operations Meeting 22 Jan 2019

Chair: Jeremy Coles

Minutes: Sam Skipsey

 

Attending [highest count]:

Alessandra Forti

Daniel Traynor

Daniela Bauer

Darren Moore

Elena Korolkova

Emanuele Simili

Gareth Roy

Gordon Stewart

Ian Loader

Jeremy Coles

John Hill

Kashif Mohammed

Linda Cornwall

Mark Slater

Matt Doidge

Winnie Lacesso

Pete Clarke

Raja Nandakumar

Raul Lopez

Robert Frank

Sam Skipsey

Ste Jones

Teng Li

Vip Davda

 

Raja:

LHCb - having problems with DIRAC-matcher (matcher seems to be under load, so not matching pilots with payloads). Possibly we'll see an increase in pilots exiting with no work / timeout.

Some GGUS tickets open in UK, but the bulk of them are resolved. 

Still see aborted pilots on some times at Liverpool. [Thanks to Catalin for fixing the RALPP issue]

[Raja gives apologies in advance for not being able to attend for next week's Ops meeting]

 

DUNE still have issue with 5GB transfer limit into RAL [via dynafed/S3 - S3 base problem]. Darren Moore notes that confirmation of fix will be in liaison meeting.

 

Daniela:

"CMS is incredibly quiet"

NTR.

 

Elena:

ATLAS had a problem with HTCondorCE at Liverpool (due to misconfig in AGIS). 

Alessandra sent email to the sites this morning re: Centos 7 (ATLAS wants to force migration to CC7 resource by June); scratchdisk space (ATLAS would like quota to 100TB/1000 analysis slots); IPv6 reminder that there's an ongoing WLCG campaign towards this.

site levels: no CC7 resources at Sheffield, or Glasgow (or Birmingham?). Camb and Birm have VAC nodes, so "don't need to migrate batch system". [John Hill notes Cambridge VAC is CC7]

 

*ATLAS Jamboree at beginning of Match - this is a sites Jamboree (5-8 March); 5th will have discussion of "hyperconverged" resources.

 

Alessandra would like to understand migration for sites. Gareth notes that our planned migration at Glasgow is tied with our new machine room, but this would be after the June deadline (so we'd need to make new plans to hit this).Alessandra things delay past the deadline is fine, with a good excuse (and new machine room is a good excuse).

 

*all sites - can we update the batch systems status wiki page?

 

-

Jeremy ops updates: all basically well for experiments.

 

-

Other VOs updates:

 

NTR

-

 

GridPP DIRAC status:

 

Lancaster from yesterday looked v slightly slow to start jobs - Matt notes that power-issues caused them to lose a rack, with concomitant effect on slots.

-

 

Meetings and updates:

 

T1 update: Darren - issue with cvmfs over the weekend, which harmed our efficiency, but now recovering.

 

T2 evolution: new VM definitions.

 

Interoperation: EGI OMB last week - Kashif notes it was a short meeting, mostly about HTCondorCE effort.

 

Security: site patch status? Matt update for David: [Sites advised to patch as this is fairly trivial and doesn't need any dt]. Thanks to everyone who attended the security edition of the Technical Meeting. 

Today is the last day to register for next months SOC meeting at Cosner's House. https://indico.cern.ch/event/775579/

 

-

Services: NTR

 

 

-

Tickets (by Matt);

 

IPv6 tickets

Oxford (Kashif notes this is evolving - 1 router might be updated for IPv6, but DNS etc not so far)

 

Pete Clarke mentioned that GridPP was "headlined" by a JISC meeting recently due to all our work on IPv6 migration and perfsonar [thanks to Duncan?]

 

Tier 1 Mice LFC ticket - some kind of weird connection issue?

 

RALPP Chris debugging error of webdav test (ROD ticket)  - error code of "7" (and there's no docs)

 

QMUL LHCb data transfers ticket. 

 

-

GDB updates:

 

Upcoming meetings mentioned in GDB - WLCG HSF OSG workshop, HEPIX Spring, ISGC2019, DIRAC Users workshop, DPM Workshop, etc.

 

SKA-CERN collabo update: [mostly updates we've seen before]. CERN/SKA collabs on OpenStack, OpenLab, ESCAPE {"Exascale science"). Common interest: PRACE, GPGPU etc. 

 

IPV6 deployment presentation: [timescales as mentioned]. Interesting notes on "reasons why sites have not moved" - most common are waiting on the infrastructure in which the site is embedded.

 

WLCG Storage Accounting: [need a way of reporting storage space which is not based on SRM - we call this SRR]. All SEs can publish to SRR now - but there's dev work needed to implement this. (Lots of work mostly on the nice API for inspection)

 

Monitoring and Infrastructure: CERN moving to "MONIT" unified monitoring as a service. (Dashboards, Alarms, Search and Archiving all together). Impl based on Kafka/Spark for transport+processing. "need to impl. GDPR"

 

DOMA-QoS: progress on plans for this project - principle is based on abstracting out the "types" of data modality ("needs REPLICAs", "COLD", "needs FAST access", "just OUTPUT" etc) from our hardware-bound ideas of "DISK" and "TAPE" storage, to potentially save money by allowing infrastructure to automatically provide QoS characteristics by any mechanism which meets the QoS tag's performance/reliability/etc requirements. It was mentioned that the QoS group would welcome new members and input, including from Experiments which have not previously been strongly engaged.

 

Summary: Stashcache, an introduction to this, and motivation for it [this has fed into some DOMA-QoS and DOMA-ACCESS discussions]. GeoIP in CVMFS v effective. 

 

-

AOB

 

No AOB.

 

 

 

Chat log:

 

John Hill: (22/01/2019 11:09)

Cambridge VAC CentOS7 already

Elena Korolkova: (11:13 AM)

All uk sites but Sheffield and Glasgow have Centos7 resources

and corresponding queues for Centos7

It's a matter of more resources to be moved Centos7

Matt Doidge: (11:23 AM)

https://indico.cern.ch/event/775579/

Mark Slater: (11:29 AM)

Afraid I've got to go - I still need to get IPV6 DNS entries in for perfsonar

Gareth Roy: (11:31 AM)

https://ps-dash.dev.ja.net/perfsonar-graphs/?source=ps-londhx1.ja.net&dest=ps001.gla.scotgrid.ac.uk&displaysetdest=&url=https://ps-londhx1.ja.net/esmond/perfsonar/archive&reverseurl=https://ps001.gla.scotgrid.ac.uk/esmond/perfsonar/archive&displaysetsrc=#start=1547551180&end=1548155980&summaryWindow=3600&timeframe=1w

Duncan Rand: (11:35 AM)

https://ps-dash.dev.ja.net/maddash-webui/details.cgi?uri=/maddash/grids/UK+Mesh+Config+-+UK+IPv6+Latency+-+Loss/ps002.gla.scotgrid.ac.uk/ps-londhx1.ja.net/Packet+Loss

Gareth Roy: (11:37 AM)

Thanks Duncan

Jeremy Coles: (11:39 AM)

https://www.gridpp.ac.uk/wiki/Batch_system_status

 

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      • This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards.

      • The team composition has been changing. If everybody contributes then the task comes around less often.

      • Please extract actions from the meeting and add them to our table here: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items#Action_list.

      • Recent allocations: See above link. The page should be updated each week by the minute taker (if they don't the task will keep coming to them!).

      • Upcoming allocations:

    • 11:01 11:20
      Experiment problems/issues 19m

      Review of weekly issues by experiment/VO

      • LHCb

      • CMS
        T1: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
        T2: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel

      Please see attached notes.

      • ATLAS

      • Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.

      • GridPP DIRAC status [Andrew McNab]
        -- https://www.gridpp.ac.uk/gridpp-dirac-sam

    • 11:20 11:40
      Meetings & updates 20m

      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

      • General updates
      • WLCG ops coordination
      • Tier-1 status
      • Storage and data management
      • Tier-2 Evolution
      • Accounting
      • Documentation
      • Interoperation
      • Monitoring
      • On-duty
      • Security
      • Services
      • Tickets
      • Tools
      • VOs
      • Site updates
    • 11:40 12:20
      Discussion topics 40m
      • January 2019 GDB updates
    • 12:20 12:25
      Actions & AOB 5m