Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +442030510622 -- The meeting extension is 9308582. Apologies: Minutes:

EARLY DRAFT NOTES (Jeremy)

 

Alessandra; Chris B; Dan T; Daniela B; David C; Elena K; Ewan MM; Federico M; Gang Q; Gareth R; Gareth S; Ian L; Ian N; Jeremy C; John B; John H; Marcus E; Matt D; Kashif M; Oliver S; Winnie L; Pete G; Robert F; Robert F; Steve J; Sam S; Terry F.

 

 

018: H/w for main compute platform. Other in LMC.

 

 

CMS (Daniela): Nothing to report. Glasgow on blacklist as T3. For user job submission. Nobody knows what happened. Issues for T2s so may be CMS staff change issue.

 

ATLAS (Elena): Going through the tickets – discussed at ATLAS meeting last week. Quite a number asking for DPM dumps.

 

118732 – Glasgow. Transfers fail with no such file dir error. Sam confirmed the files do not exist.  Now waiting for reply.

 

118052 – http deployment. From December 2015. Sam thinks the problem is solved. Some disk servers running old DPM dlite. Looking to fix. Link shows 60% tests are successful.

 

117889 – ATLAS request storage consistency check. Sam was asking which endpoints to have checked. Next on the list for Sam but have not been a high priority until now.

 

118695 – Lancaster. Storage consistency. Matt thinks he has solved the problem.

 

117894 – Sussex. Storage consistency. Not in progress yet.

 

118740 – Brunel. Multicore jobs failing. Raul thinks they are failing with errors of lost heartbeat and difficult to trace. May be related to power cuts.

 

117878 – Brunel. Consistency check. Raul says he is not sure what he should do. ATLAS response.

 

118756 – Manchester. Open last night.

118679 – Manchester. Http support.

117885 – Manchester storage. Two servers being deployed.

118728 – QMUL. T0 expert file cannot be copied to site. Dan investigating.

 

4 tickets on ATLAS request storage consistency: Oxford, Birmingham, Sheffield and RHUL. Each requires attention although Govind noted he will deal with it shortly.

 

ATLAS jamboree at the end of January.  https://indico.cern.ch/event/440821/

 

 

LHCb (Raja): Not much running on grid. Little MC. New request to come in.

 

T2A: Cloud resources open nebula at T1.

 

LIGO (Sam) some discussion about a problem (re Condor). Paul has just started looking at it again.

 

LSST: VOMS servers showing up properly in VO ID card.

Marcus may update on test transfers from NERSC as now has an account. He can try to do some test transfers. Need to know from Joe what data to transfer – the location.

 

LZ: (Elena). Ran production jobs for technical design report to simulate background events. Ran at IC. Daniela was asking sites to support LZ. Checking in BDII only see RALPP and Sheffield.

Oxford – turn on support on the CPU.  TBD.

ECDF – Still plan to support it. But currently switching compute cluster. Beginning Feb.

Liverpool – Accounts enabled.

 

Daniela: If you enable LZ you need to have pilot accounts enabled.  RALPPD works.

 

Sheffield and Sussex do not have pilot roles for GridPP VO. Problem at weekend with VOMS server down at weekend. Admin interface at Manchester. Bug in Dirac triggered. Set up extra DNS servers but the way setup not correct.

 

RF: Issue – container that ran … log file 4000/lines second. Route cause not identified. DNS issues Andrew was dealing with this and there needs to be some updates higher up the chain. JANET/IT services. Q: Has Andrew opened the ticket?

 

Dirac dependency on VOMS? Was a bug in Dirac setup.

 

RAL – being followed up.

Brunel. No update

UCL sorted? VAC standard images have it.

Sheffield – enabled pilot role for several VOs. Still does not work. Maybe an ARGUS config issue.

 

Durham. Couple to do. IN =-hand.

ECDF. Nothing yet. Switched.

Glasgow. Was on-hold while looking at account management. Looking to centralize.

Birmingham?

RALPPD – done.

 

GDB – David’s talk may be in March.

 

Request to sites. Use of puppet repo.

 

Please register for HEPSYSMAN

 

Ask Mark to forward the requirements for the GANGA meeting

 

 

 

 

 

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      * This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards. * The team composition has been changing. If everybody contributes then the task comes around less often. * From the start of GridPP4+ those in fully funded GridPP positions will be expected to contribute. Others are welcome to volunteer! * The minutes should contain a list of who attended; apologies; note who took the minutes and highlight actions. * A count is maintained at https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items. * After uploading minutes to the agenda page the minute taker is expected to: ** Update the list of ops actions. ** Update their 'count' so the task can be shared fairly. Thank you for your support!
    • 11:01 11:20
      Experiment problems/issues 19m
      Review of weekly issues by experiment/VO - LHCb - CMS https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel - ATLAS - Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator. - GridPP DIRAC status [Andrew McNab] -- http://www.gridpp.ac.uk/php/gridpp-dirac-sam.php?action=view - Status of pilot enabling across sites.
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - General updates - WLCG ops coordination - Tier-1 status - Storage and data management - Tier-2 Evolution - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - Site updates
    • 11:40 12:00
      Hot topics... 20m
      - WLCG workshop. -
    • 12:00 12:05
      Actions & AOB 5m
      * https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items