LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes140508 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes - May 8th, 2014

Agenda

https://indico.cern.ch/event/313378/

Attendance

local: Nicolò Magini (secretary), Andrea Sciabà, Maarten Litmaath (ALICE), Marian Babik, Maria Dimou, Simone Campana (ATLAS), Andrej Flipcic (ATLAS), Felix Lee (ASGC), Oliver Keeble, Julia Andreeva, Michail Salichos, Zbigniew Baranowski, Stefan Roiser (LHCb).
remote: Josep Flix (chair, PIC), Maite Barroso (Tier-0), Valery Mitsyn (JINR), Thomas Hartmann (KIT), Shawn Mc Kee, Cristina Aiftimiei (EGI), Christoph Wissing (CMS), Peter Solagna (EGI), Di Qing (TRIUMF), Antonio Maria Perez Calero Yzquierdo, Gareth Smith (RAL), Rob Quick (OSG), Dave Dykstra, Alessandra Forti

News

Alastair Dewhurst replaces Simone Campana in the IPv6 task force. Thanks to Simone and a warm welcome to Alastair!
Discussion on ARGUS support future
Discussion on the mandate and the objectives of a new task force or working group on network and transfer metrics
- End of the xrootd and perfSONAR task forces
- open discussion on the mandate - presentation is scheduled
2014 WLCG Workshop in Barcelona (7-9 July):
- Please register asap and book your hotel: https://indico.cern.ch/event/305362/ (registration will close one month in advance)
- The agenda is being discussed and potential speakers will soon be contacted

News from EGI

Peter presents about the future support of ARGUS
- SWITCH currently supporting ARGUS on best effort, but strongly suggesting some other institution to take over in the medium term. Evaluating alternatives.
- EGI surveying support for other products.

Peter explains that SWITCH has not given a firm deadline for the handover, but probably around 6 months, and it's important to start the discussion on finding alternatives.
Pepe asks when INFN can give an answer on the support: no timeline on the answer yet.
To be assessed again in next meeting

Proposal for new Working Group: Network and Transfer Metrics

Marian presents a proposal for a new Working Group: Network and Transfer Metrics
- The proposed mandate is to identify and publish the metrics, make sure that issues can be better understood and fixed, enable use of network-aware tools.
- Objectives and membership are presented.

Pepe asks if the WG will implement alarming mechanisms for network and transfer issues. Shawn answers that the scope of the WG would be to provide the metrics needed for this and ensure that the data is well organized. Implementing the alarms is out of scope and there are other project proposals for this.
Nicolo and Simone ask if the working group will also handle operational aspects previously covered by the xrootd task force (e.g. deployment and upgrade of monitoring plugins) not mentioned in the presentation. Julia suggests that the WG should ensure that the infrastructure needed to collect the required metrics is in place. Shawn answers that while the WG's role is not to fix issues, they are in charge of identifying the problems and coordinating with the developers to fix them, as mentioned explicitly for perfSonar. Agreed that checking that federation access metrics are published and coordinating the deployment of the monitoring plugins is also in scope of the WG.

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Security support for EMI-2 ended on April 30th, all baseline versions increased to EMI-3 except for dCache for which support was extended.
CVMFS bugfix release
gliteWMS bugfix release

Tier-0 and Tier-1 Grid services

Storage deployment

Site	Status	Recent changes	Planned changes
CERN	CASTOR: v2.1.14-11 and SRM-2.11-2 on ATLAS, ALICE, CMS and LHCB EOS: ALICE (EOS 0.3.4 / xrootd 3.3.4) ATLAS (EOS 0.3.8 / xrootd 3.3.4 / BeStMan2-2.3.0) CMS (EOS 0.3.7 / xrootd 3.3.4 / BeStMan2-2.3.0) LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.3.0 (OSG pre-release))
ASGC	CASTOR 2.1.13-9 CASTOR SRM 2.11-2 DPM 1.8.7-3 xrootd 3.3.4-1	None	None
BNL	dCache 2.6.18 (Chimera, Postgres 9.3 w/ hot backup) http (aria2c) and xrootd/Scalla on each pool	None	None
CNAF	StoRM 1.11.3 emi3 (ATLAS, LHCb, CMS)
FNAL	dCache 2.2 (Chimera, postgres 9) for disk instance; dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) for tape instance; httpd=2.2.3 Scalla xrootd 3.3.6-1 EOS 0.3.21-1/xrootd 3.3.6-1.slc5 with Bestman 2.3.0.16		Upgrade tape instance to Chimera/dCache 2.2 on May 19-20
IN2P3	dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes Postgres 9.2 xrootd 3.3.4 (Alice T1), xrootd 3.3.4 (Alice T2)
JINR-T1	dCache srm-cms.jinr-t1.ru: 2.6.24 srm-cms-mss.jinr-t1.ru: 2.2.24 with Enstore xrootd federation host for CMS: 3.3.6
KISTI	xrootd v3.3.4 on SL6 (redirector only; servers are still 3.2.6 on SL5 to be upgraded) for disk pools (ALICE T1) xrootd 20100510-1509_dbg on SL6 for tape pool xrootd v3.2.6 on SL5 for disk pools (ALICE T2) dpm 1.8.7-4
KIT	dCache atlassrm-fzk.gridka.de: 2.6.21-1 cmssrm-kit.gridka.de: 2.6.17-1 lhcbsrm-kit.gridka.de: 2.6.17-1 xrootd alice-tape-se.gridka.de 20100510-1509_dbg alice-disk-se.gridka.de 3.2.6 ATLAS FAX xrootd redirector 3.3.3-1
NDGF	dCache 2.8.2 (Chimera) on core servers and on pool nodes.	Upgraded to dCache 2.8.2
NL-T1	dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)
PIC	dCache head nodes (Chimera) and doors at 2.2.23-1 xrootd door to VO severs (3.3.4)	None	dCache 2.9 tests ongoing in a test instance
RAL	CASTOR 2.1.13-9 2.1.14-5 (tape servers) SRM 2.11-1		Scheduling upgrade to CASTOR 2.1.14
RRC-KI-T1	dCache 2.2.24 + Enstore (ATLAS) dCache 2.6.22 (LHCb) xrootd - EOS 0.3.19 (Alice)
TRIUMF	dCache 2.6.21	None	None

PIC mentions that dCache2.9 is compatible with Enstore.

FTS deployment

Site	Version	Recent changes	Planned changes
CERN	2.2.8 - transfer-fts-3.7.12-1
ASGC	2.2.8 - transfer-fts-3.7.12-1	None	None
BNL	2.2.8 - transfer-fts-3.7.10-1	None	None
CNAF	2.2.8 - transfer-fts-3.7.12-1
FNAL	2.2.8 - transfer-fts-3.7.12-1, fts-server-3.2.3-5	Put FTS3 server into production
IN2P3	2.2.8 - transfer-fts-3.7.12-1
JINR-T1	2.2.8 - transfer-fts-3.7.12-1
KIT	2.2.8 - transfer-fts-3.7.12-1
NDGF	2.2.8 - transfer-fts-3.7.12-1
NL-T1	2.2.8 - transfer-fts-3.7.12-1
PIC	2.2.8 - transfer-fts-3.7.12-1	None	None
RAL	2.2.8 - transfer-fts-3.7.12-1
TRIUMF	2.2.8 - transfer-fts-3.7.12-1

FNAL added FTS3 to the table. To be added also for CERN and RAL; will be tracked only for servers in WLCG deployment.

LFC deployment

Site	Version	OS, distribution	Backend	WLCG VOs	Upgrade plans
BNL	1.8.3.1-1 for T1 and US T2s	SL6, gLite	ORACLE 11gR2	ATLAS	None
CERN	1.8.7-4	SLC6, EPEL	Oracle 11	ATLAS, OPS, ATLAS Xroot federations
CERN	1.8.7-4	SLC6, EPEL	Oracle 12	LHCb

Oracle deployment

Note: only Oracle instances with a direct impact on offline computing activities of LHC experiments are tracked here
Note: an explicit entry for specific instances is needed only during upgrades, listing affected services. Otherwise sites may list a single entry.

Site	Instances	Current Version	WLCG services	Upgrade plans
CERN	CMSR	11.2.0.4	CMS computing services	Done on Feb 27th
CERN	CASTOR Nameserver	11.2.0.4	CASTOR for LHC experiments	Done on Mar 04th
CERN	CASTOR Public	11.2.0.4	CASTOR for LHC experiments	Done on Mar 06th
CERN	CASTOR Alicestg, Atlasstg, Cmsstg, LHCbstg	11.2.0.4	CASTOR for LHC experiments	Done: 10-14-25th March
CERN	LCGR	11.2.0.4	All other grid services (including e.g. Dashboard, FTS)	Done: 18th March
CERN	LHCBR	12.1.0.1	LHCb LFC, LHCb Dirac bookkeeping	Done: 24th of March
CERN	ATLR, ADCR	11.2.0.4	ATLAS conditions, ATLAS computing services	Done: April 1st
CERN	HR DB	11.2.0.4	VOMRS	Done: April 14th
CERN	CMSONR_ADG	11.2.0.4	CMS conditions (through Frontier)	Done: May 7th
BNL		11.2.0.3	ATLAS LFC, ATLAS conditions	TBA: upgrade to 11.2.0.4 (tentatively June)
RAL		11.2.0.3	ATLAS conditions	TBA: upgrade to 11.2.0.4 (tentatively June)
IN2P3		11.2.0.3	ATLAS conditions	TBD: upgrade to 11.2.0.4 on 13th of May
TRIUMF	TRAC	11.2.0.4	ATLAS conditions	Done

Oracle upgrades completed at CERN. At Tier-1s, schedule depends on testing of Golden Gate (new replication method).

T0 news

First conclusion of meetings to measure batch jobs efficiency depending on the location in which they are executed: No indication of any significant difference of job efficiency between CERN Geneva and Wigner
- Some other possible causes of low job efficiency have been found and are being investigated:
  - Intel versus AMD (Some batch applications optimised for Intel CPUs)
  - Zombie pilot jobs
  - Some users transfer data from/to remote locations
  - Virtual vs Bare Metal
  - SLC5 versus SLC6
- Other actions taken: dedicated perfSonar Meyrin-Wigner, SLC6 with standard TCP parameters. Full mesh of bandwidth measurements.
- The meetings/investigation will continue, we'll report when there are more findings.
- Full presentation to the WLCG mgt board available here: https://indico.cern.ch/event/302033/contribution/3/material/slides/0.pdf

WMS decommissioning: the machines are powered off; no tickets or user complaint whatsoever. Only SAM instances are running in production.

Migration to VOMS-admin: waiting for the new voms-admin release, expected in around three weeks. Ticket open to the developers: GGUS:102984

Argus: Sporadic authentication failures have been observed in the site-argus service; apparently some internal timeouts were triggered. We increased the number of nodes in the site-argus alias, which seems to have improved the situation; the root cause is being investigated.
- 3rd level support is best effort and has an uncertain future; This is bad news for us, Argus is an important piece of the grid middleware stack. We would like to raise it to WLCG's attention.

Maarten suggests to rephrase the statement about Meyrin vs Wigner job efficiency to clarify that no significant difference is seen on job efficiency based purely on the location.
Nicolo asks about the status of the SLC6 migration. Maite answers that the progress is currently 70% SLC6, +5% since the last report three weeks ago. At 80%, PES will discuss with experiments if/how much needs to be kept on SLC5.
Maite explains that the mails about zombie pilots are not yet sent automatically.

Other site news

Data management provider news

DPM 1.8.8 released

DPM 1.8.8 has been released to EPEL-stable. Highlights and full details available here: https://svnweb.cern.ch/trac/lcgdm/attachment/wiki/Dpm/DPM_releasenotes_Mar2014.txt

Storage Infosys publishing

A series of meetings involving many storage providers (Castor/EOS, dCache, DPM, StoRM) has been initiated as part of the validation of the information system. The aim is to ensure consistent, complete and correct publishing of storage systems to GLUE2, in particular relating to capacity publishing. https://twiki.cern.ch/twiki/bin/view/EGEE/GLUE2Storage

Experiments operations review and Plans

ALICE

High activity in preparation for Quark Matter 2014 (May 19-24, GSI Darmstadt)
KIT
- smooth sailing since 2 weeks
CERN
- SLC6 job efficiencies: next meeting tomorrow

ATLAS

MC production:
- lower activity in the last two weeks, waiting for the new requests to be approved and submitted
- multi-core MC14 production done, only smaller validation tasks running from time to time
FTS3 upgrade went smoothly without ATLAS intervention
Rucio stress test planned to start after 20th of May and continue with gradual increase of activity till end of June
Rucio commissioning:: migration to Rucio file catalog to be completed in the next two weeks
DQ2 site services issues: waiting for a DQ2 fix for uncaught exceptions affecting the stability of service
CERN low efficiency for jobs:
- pilot wrapper implemented a cleanup of orphans processes by process group control
- in production, the problematic tasks at CERN were activated again
Analysis and parallel make:
- sites and users complained on build jobs failures - asetup automatically sets MAKEFLAGS to number of available cores
- pilot wrapper fix - MAKEFLAGS are reset after the setup to the number of allocated cores to the batch jobs
Multi-core production:
- no massive tasks planned till release 19 is ready for simulation and reconstruction
- sites asked to reduce the multi-core partition in case of static single-core/multi-core allocation

Andrej clarifies that ATLAS will continue to send a mixture of single-core and multi-core jobs in the future.

CMS

High priority Production and Processings
- Heavy Ion reprocessing (now finished)
- Heavy Ion MC
- Upgrade MC
Oracle upgrade of CMS Online DB went ok yesterday
- Had some CMS internal communication deficit that all Grid jobs would be impacted (Blame CW for this!)
- CERN DBAs and CMS Frontier experts found a way to perform the upgrade transparently (Thanks!)
SAM tests
- Make CMS SAM test for glexec critical on May 15th
  - Open tickets to sites that would fail end of this week
SAM test for xrootd fallback
- Not yet critical
- Still waiting (mainly) for RAL to fix some issues
Xrootd Federation - “AAA”
- Scale testing of Tier-1 ongoing
- Reminder to sites: Please deploy detailed xrootd monitoring
Multi core processing
- Started to send production workflows through mixture of multi-core and single-core pilots
  - Executing N single-threaded jobs in N-core pilot (N typically 8)
  - First successful experiences at PIC (where we started first)
  - Ramping up at KIT, RAL and JINR
- Functional tests at other sites continuing or about to start
FTS3 for Phedex Debug transfers becoming mandatory now
- Will send tickets to sites this week
Problem in OSG DigiCert CRL on May 1st
- SAM tests failed for US sites
- Sites readiness metric already corrected
- Due to some issue on the CMS SAM-Nagios box changes only picked up properly on Monday

On Andrea's question, Marian comments that a ticket for SAM availability recalculation for the CRL incident is already open.
Simone asks other VOs about their strategy for availability recalculation in case of failures or timeouts in submission of SAM jobs through gliteWMS, which do not necessarily affect production jobs. Andrea answers for CMS that the recalculation is requested if the problem is in the infrastructure, but not if it is at the site; it needs to be investigated case by case. Such failure modes are expected to be reduced with the new CondorG and CREAMCE SAM submission probes.
Julia comments that the new SAM will allow to include production efficiency in availability calculations if desired by each VO, e.g. by taking the best of production and SAM job efficiency as the site availability. Can continue to use SAM jobs to probe individual services, but not penalizing the site if production is OK.

LHCb

Incremental stripping campaign finished, all productions closed, many thanks to all Tier1 sites for their support
CASTOR->EOS migration of LHCb user data finished, all CERN permanent storage for LHCb distributed computing now allocated at EOS, many thanks to DSS for the migration
Problem with some certificates especially for Brazilian VO members to access data at GRIDKA and IN2P3
- similar problem last fall at other DCache sites, fixed by update (special characters in certificates) fixed by update, this time seems to be different
- investigations ongoing

Maarten comments that the certificate issues are probably in some library used by dCache.
On Pepe's question, Stefan answers that PIC is not affected this time unlike the last problem.

Ongoing Task Forces and Working Groups

Tracking tools evolution TF

GGUS: Proposal to stop ticket creation through email. More info. At the moment, this feature can cause the creation of a lot of fake tickets. There are not that many real tickets opened through email.
CMS Comp Ops started the transition from Savannah to GGUS. At some point, the GGUS to savannah bridge should be stopped. The current plan is to stop it at the end of June more details

FTS3 Deployment TF

CERN prod instance has been upgraded to the latest stable version 3.2.22 (https://svnweb.cern.ch/trac/fts3/wiki/ReleaseNotes#Productionrelease)
RAL instance upgrade, proposed day Wed 14/5 at 11.00 Geneva time (downtime of ~1h)

gLExec deployment TF

80 tickets closed and verified, 15 still open (-1)
Deployment tracking page

Cristoph comments that very few CMS sites are still missing glExec deployment (a couple of Tier-2s and some Tier-3s).

Machine/Job Features

Main activity on development and soon deployment of a machine/job features service for a cloud infrastructure
- "API" agreed on the level of the URL / GET parameters to be consistent. This allows
  - different implementations for the IaaS service
  - clients to go for direct access instead of using mjf.py client (if wanted) as only two modes (batch/IaaS) need to be implemented
Note for sites, the mfj.py client does not need to be deployed for what concerns LHC VOs as all of them will bring the client with them in their software stack (cvmfs etc)

Stefan comments that sites are welcome to volunteer for testing the service for a cloud infrastructure.
Alessandra reminds that not all VOs are already shipping the mjf client, but they are planning to do so. It is available in the SFT CVMS repository.

Middleware readiness WG

Next meeting on Thu May 15, 10:30-12:00 CEST
- agenda
As promised at the WLCG Planning meeting of 2014/04/17 we addressed the Tier0 and Tier1 contacts the following questions, to which we get very prompt and useful reponses that will be on our MW Readiness WG twiki in time for next week's meeting. Many thanks to everyone!
1. If, How and Where you publish the MW versions you run in production.
2. How you use the Baseline versions' table given that the "baseline version" number doesn't necessarily reflect all individual updates of the packages in the dependencies.

Multicore deployment

CMS started scale tests of multicore pilots to Tier1s at PIC, KIT, RAL, JINR and CCIN2P3.
We enter now in a second stage for the TF, where we will actually start to evaluate the compatibility of ATLAS and CMS approaches to submitting multicore jobs to shared sites.
Coming sessions will be dedicated to present and discuss this experience from the sites point of view.

Simone asks if the TF has provided recommendations for all batch systems for dynamic resource provisioning. Alessandra and Antonio answer that they will be provided after gathering more experience with simultaneous ATLAS and CMS running. Simone offers dedicated ATLAS test tasks for this.

SHA-2 Migration TF

EGI broadcast #2 about the new VOMS servers was sent on May 6
- link
a problem with the timeline was discovered on May 7:
- job submission to CREAM fails when the proxy was signed by a VOMS server with a SHA512 host certificate (GGUS:104768)
- our new VOMS servers have such certificates
- the fix has been tested successfully
- we now need the fix to become available in EMI and UMD repositories
- all sites then need to update their CEs
- June 2 looks a bit tight...

Simone asks if there is also a plan to migrate to RFC proxies, now of interest for ATLAS, and if an update can be provided at the next meeting. Maarten answers that it shouldn't be an issue, as CMS has already started using them; he suggests to switch the SAM preprod instances to RFC proxies.
Nicolo confirms that CMS has found no blocking issues with RFC proxies, though some sites needed to upgrade the services (e.g. BeStMan) to recent versions.

WMS decommissioning TF

CERN WMS instances for experiments have been switched off on May 5
SAM instances have their own timeline

IPv6 validation and deployment TF

Today, meeting of the HEPix IPv6 working group, focused on the preparation of the June pre-GDB
Quoting today's email from Edoardo Martelli:
- It's with a great pleasure that I can introduce you lxplus-ipv6.cern.ch, an lxplus instance with dual-stack connectivity. You can ssh to it over IPv6 or IPv4 from anywhere on the Internet. I'd like to thank Steve Traylen, Ignacio Reguero and the IT-PES group for making it possible.

HTTP proxy discovery TF

For reference: TF home page
Progress has been slow, but some has happened:
- It is mostly waiting on full implementation of the SquidMonitoringTaskForce recommendations. For that, the GOCDB & OIM fields for registering squids have been defined, and the wlcg-squid-monitor.cern.ch machine is reading them and putting the list into a JSON file. However, the MRTG monitor that reads the file isn't complete so it's too early to ask all sites to register their squids.
- The wlcg-wpad.cern.ch name has been defined as an alias to wlcg-squid-monitor.cern.ch for now, but wpad.dat is not yet being generated from the JSON file.
- The frontier client has been fully ready to read WPAD/PAC files for one year now. The cvmfs 2.1.19 client implementation is almost complete except it doesn't robustly support all http proxies in a round-robin, which is the only way to do load balancing in lists of proxies specified in PAC files; otherwise the proxies listed are tried sequentially.

Action list

Document procedure and forum to track networking issues
- In the mandate of the Network and Transfer Metrics WG

AOB

-- NicoloMagini - 05 May 2014

Topic revision: r36 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback