LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes190606 (2019-06-17, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, June 6, 2019

Highlights
Agenda
Attendance
Operations News
Special topics
Middleware News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
- ALICE
- ATLAS
- CMS
- LHCb
Task Forces and Working Groups
Action list
- Specific actions for experiments
- Specific actions for sites
AOB

Highlights

Thanks a lot to the sites which have provided their answers to the site survey. Those which did not yet, PLEASE do it before 15th of June.

Agenda

https://indico.cern.ch/event/823800/

Attendance

local: Fabrizio (DPM devs), Federico (LHCb), Julia (WLCG), Konrad (LHCb), Maarten (ALICE + WLCG)
remote: Alessandra D (Napoli), Alessandra F (ATLAS + Manchester), Alessandro P (EGI), Baptiste (EGI), Catherine (LPSC + IN2P3), Christoph (CMS), Dave (FNAL), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Johannes (ATLAS), Matt (EGI), Renato (CBPF), Ron (NLT1), Sang-Un (KISTI), Ste (Liverpool), Stephan (CMS), Vladimir (LHCb)
apologies:

Operations News

the next meeting is planned for July 4
- US sites obviously are excused

Special topics

VO input regarding status of migration of sites to CentOS 7

Do your experiment operations follow the migration of sites to CentOS/EL 7?
If yes, what is the status?
If not, does it mean that your workflows are not so concerned because they are shielded?
By using containers?
Any other comments, clarifications are welcome

ALICE

Following the migration with the centres, the software on CVMFS is fully compatible.
Containers (via Singularity) are a big part of our future operations strategy (later this year or next year).

ATLAS

We are following up with sites. Enabling containers is part of the migration.
Status here: https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Deployment

Alessandra F:
- CentOS 7 highly desirable:
  - avoid Python version issues
  - benefit from better container support
  - user SW compiled on CentOS 7 has to run on that OS
- T1 done, T2 60% done

CMS

CMS follows CentOS 7 upgrades at sites only as it relates to reduced computing capacity/CE downtime. We use Singularity containers, i.e. run SL 6 and CentOS 7 jobs as needed regardless of the base OS.

Stephan:
- CMS uses Singularity both for production and analysis.
- In case of high load on CVMFS, CMS has experienced issues on some SL6 sites, while CentOS 7 sites work well under similar load.

LHCb

We are following the migration status
All our major sites have already migrated to centos7-compatible systems and we use centos7 there
In addition, we need a relatively minor amount of slc5-compatible (slc6 with compatibility layer) resources in order to produce MC with legacy reconstruction version which was used for Run1 data
We are finalizing a container solution based for Singularity for legacy productions
We are considering to use a containerized workflow by default

Federico:
- In general, migration status for LHCb is not very important, since Dirac sends jobs to sites with a particular OS version based on job requirements
- Singularity particularly needed to run SLC5 code on CentOS 7
- first usage expected in 1 month
- the usage will not comply with isolation requirements
  - Maarten: that is a matter for the Traceability WG

Can Singularity run inside Singularity?
- Should work for v3.x on EL7, at least when certain reasonable options are used.
  - Will be even better on EL8 thanks to its newer kernel.
- Singularity has always run fine inside Docker.

Discussion

Catherine:
- site admins in France would like to restrict the Singularity configuration
  as much as possible, e.g. by owner of container image, path, ...
- the default configuration works, but is deemed not good enough
- we would like to have a combined configuration serving both ATLAS and CMS,
  expecting it will also be fine for ALICE and LHCb

Alessandra F:
- though many ATLAS workflows are similar to those in CMS, user workflows
  are done differently
- we also need special treatment of certain sites, e.g. concerning mount points

Maarten:
- concerned admins should join the Containers WG where we try to devise a
  common configuration serving the 4 experiments

VO input regarding SRM usage and dependencies

Since some of the popular storage solutions (DPM, EOS) are moving towards no-SRM, we would like to assess the situation in every LHC VO regarding dependencies on SRM.

Which SRM functionality are you currently using?
Can this functionality be provided via other means?
If yes, are your data management and workload management frameworks ready for a switch? Note: this most probably implies coexistence of SRM-enabled and no-SRM sites.
If not, how do you plan to deal with EOS and DPM sites in the near future? Note: the latest DPM versions have the SRM only as an optional service with minimal support.
Any other comments, clarifications are welcome

ALICE

SRM services are not used by ALICE.

ATLAS

Third party copy, space reporting, tape interaction.

Yes, and we do it already except for tape interaction.
Third-party-copy: Mostly gsiftp already, though gradually deploying alternative options using root and davs, within the context of the DOMA TPC activity.
Space reporting: We do not use SRM space tokens anymore, instead using directories as discriminators for storage areas. Most sites deploy the space reporting JSON which is consumed, some still use SRM to query the storage usage.

Right now we are dependent on SRM for tape interaction, we do not require SRM for anything else if sites are able to deploy their space reporting JSONs. First tests showing root interaction with CTA are promising though, so at least we can drop the usage of SRM for CTA tape sites. For dCache tape sites, I'm quite certain that there is no other interaction protocol available, neither in FTS nor in GFAL. We will still need gsiftp-enabled sites to transfer to SRM-only destinations (dCache tape sites). PanDA has no dependency on any particular protocol, but instead depends on Rucio to resolve the protocols correctly.

We need to ask sites to upgrade. As long as there is gsiftp enabled we are fine.

Johannes:
- gsiftp should work everywhere
- other protocols currently have issues in the DOMA TPC compatibility matrix
- staging through Xrootd currently is not supported by the FTS
- the JSON file location should be standardized

Maarten :
- ALICE uses Xrootd for staging. The Xrootd code is not specific to ALICE.

Julia:
- We did not insist on standard JSON file (SRR) location since we thought for sites it might be better to have flexibility in this respect. In any case experiments would need to know which sites have already enabled SRR. This flag will be published in CRIC and GOCDB along with the SRR file location.

CMS

CMS uses gsiftp subset of SRM, i.e. requires a gsiftp, gridftp, or SRM endpoint at each site (except tape endpoints)

LHCb

Which SRM functionality are you currently using?
- staging, transfers, space accounting
Can this functionality be provided via other means?
- staging: not at the moment
- transfers: yes, to some extent (issues with dCache site due to space token)
- space accounting: some sites provide a json files with accounting info
If yes, are your data management and workload management frameworks ready for a switch?
- For most of it yes, under certain conditions. We must have
  - a single endpoint for gridftp
  - a single endpoint for xroot
  - a space accounting report available via json
If not, how do you plan to deal with EOS and DPM sites in the near future? Note: the latest DPM versions have the SRM only as an optional service with minimal support.
- If the sites do not provide the previous points, we can't use it.
Any other comments, clarifications are welcome
- We do not have any site with DPM where we need staging
- Most of our DPM sites (T2-D) have updated or will update within 2 month
- ~1/2 sites plan to provide SRM as long as possible (barring security issues and bitrot)

Discussion

Maarten:
- mind there may be a dependency on the SRM client tools for quite a while still
  - they will need to remain supported
- we will work with EGI Ops to get other VOs to move away from SRM dependencies
  - the data management client suite supports other protocols for everyone,
    not just for the LHC experiments

Alessandro P:
- we will follow up with VOs

Julia:
- WLCG Ops were mostly interested in understanding the situation with SRM and disk storage, like DPM. Less concerned about tape, since dCache and StoRM are not planning to stop support of SRM.

Johannes:
- CERN new tape solution is being tested by ATLAS.

CREAM migration task force

Twiki page

Julia
- Membership list currently contains people who had confirmed their participation. We still encourage more people to volunteer.
- The results of the ongoing site survey will help us decide where to focus our efforts

Maarten:
- A mailing list will be set up etc.

Middleware News

Useful Links
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

Normal activity levels on average.
No major issues.

ATLAS

Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~100k concurrently running job slots and ~15k jobs from Boinc. The HLT farm/Sim@P1 adds in its current configuration ~96k job slots in addition.
Commissioning of a new PanDA worker node pilot version on-going. We are continuing to slowly roll out the new version out to the sites.
Started a 2nd round of a data carousel test this week.
On-going discussions with the CTA team about how to best use the system.

CMS

smooth running, compute systems busy between 200k and 250k cores
- usual production/analysis mix, i.e. about 50k cores used by analysis
reduced overall capacity due to HLT farm being off after UPS fire at P5
processing of parked B physics data started
heavy ion re-reconstruction in progress, about half done
Monte Carlo generation for re-reprocessing (ultra-legacy) of 2017 configuration started
tape deletion campaign in progress, about 20 PBytes
issue with CMS database service, DBS, understood and resolved by downgrading an external product

LHCb

Smooth running at ~100K jobs, Usual activity: User jobs, MC productions, and WG productions
Poor transfer efficiency from CERN WN to outside storage GGUS:141112

Task Forces and Working Groups

GDPR and WLCG services

Updated list of services

Accounting TF

There was an issue in the T2 accounting reports generated starting of April. All pledge related columns have been empty. Fixed by Ivan
There is a long standing problem with INFN-Roma1-CMS, wrong very high usage screws up all T2 CPU accounting. Reported to be fixed by APEL experts. Portal still shows wrong numbers. Will be followed up.

Archival Storage WG

DPM Upgrade Task Force

The second wave of monitored upgrades to the current production version of DPM (1.12) has been announced in May, and quite a few sites have joined it: IN2P3-CPPM, AUVERGRID, INFN-COSENZA, BUDAPEST, GLASGOW, UNIBE-LHEP [moved from the previous wave], IN2P3-LPC. Worth signalling that some upgraded with very minimal support needed, if any (e.g. BUDAPEST). INFN-COSENZA had some setup issues, probably linked to the puppet templates in 1.12 not treating correctly passwords containing critical characters like '&'. To the best of knowledge this small bug is not considered critical, had been fixed time ago in the current development branch, and will be released with 1.13, which will likely be complete a few weeks after the DPM workshop (13-14th of June).

Worth mentioning that the management of CERN-IT and EGI have agreed on postponing the deadline for the security support of the DPM legacy components to the end of September 2019. The regular support for those components has ended on the 1st of June, and sites seeking for it will be advised to upgrade their installation and enable the DOME flavour of the setup.

Discussion

Johannes:
- ATLAS still see deletion issues at DPM sites
Fabrizio:
- many of those issues were due to problems unrelated to the MW
  - certificates, dying HW, ...
- the latest DPM versions are more robust and scalable than older versions
Julia:
- we may soon push all remaining DPM sites to look into upgrading
- let's see after the DPM workshop (June 13-14)
Renato:
- CBPF plans to upgrade next week

Information System Evolution TF

WLCG CRIC demo at the coming GDB

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG

Squid Monitoring and HTTP Proxy Discovery TFs

Traceability WG

Container WG

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

-- JuliaAndreeva - 2019-05-27

Topic revision: r19 - 2019-06-17 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback