WLCG Operations Coordination Minutes, June 6, 2019

Highlights

  • Thanks a lot to the sites which have provided their answers to the site survey. Those which did not yet, PLEASE do it before 15th of June.

Agenda

https://indico.cern.ch/event/823800/

Attendance

  • local: Fabrizio (DPM devs), Federico (LHCb), Julia (WLCG), Konrad (LHCb), Maarten (ALICE + WLCG)
  • remote: Alessandra D (Napoli), Alessandra F (ATLAS + Manchester), Alessandro P (EGI), Baptiste (EGI), Catherine (LPSC + IN2P3), Christoph (CMS), Dave (FNAL), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Johannes (ATLAS), Matt (EGI), Renato (CBPF), Ron (NLT1), Sang-Un (KISTI), Ste (Liverpool), Stephan (CMS), Vladimir (LHCb)
  • apologies:

Operations News

  • the next meeting is planned for July 4
    • US sites obviously are excused smile

Special topics

VO input regarding status of migration of sites to CentOS 7

  • Do your experiment operations follow the migration of sites to CentOS/EL 7?
  • If yes, what is the status?
  • If not, does it mean that your workflows are not so concerned because they are shielded?
  • By using containers?
  • Any other comments, clarifications are welcome

ALICE

  • Following the migration with the centres, the software on CVMFS is fully compatible.
  • Containers (via Singularity) are a big part of our future operations strategy (later this year or next year).

ATLAS

  • Alessandra F:
    • CentOS 7 highly desirable:
      • avoid Python version issues
      • benefit from better container support
      • user SW compiled on CentOS 7 has to run on that OS
    • T1 done, T2 60% done

CMS

  • CMS follows CentOS 7 upgrades at sites only as it relates to reduced computing capacity/CE downtime. We use Singularity containers, i.e. run SL 6 and CentOS 7 jobs as needed regardless of the base OS.

  • Stephan:
    • CMS uses Singularity both for production and analysis.
    • In case of high load on CVMFS, CMS has experienced issues on some SL6 sites, while CentOS 7 sites work well under similar load.

LHCb

  • We are following the migration status
  • All our major sites have already migrated to centos7-compatible systems and we use centos7 there
  • In addition, we need a relatively minor amount of slc5-compatible (slc6 with compatibility layer) resources in order to produce MC with legacy reconstruction version which was used for Run1 data
  • We are finalizing a container solution based for Singularity for legacy productions
  • We are considering to use a containerized workflow by default

  • Federico:
    • In general, migration status for LHCb is not very important, since Dirac sends jobs to sites with a particular OS version based on job requirements
    • Singularity particularly needed to run SLC5 code on CentOS 7
    • first usage expected in 1 month
    • the usage will not comply with isolation requirements
      • Maarten: that is a matter for the Traceability WG

  • Can Singularity run inside Singularity?
    • Should work for v3.x on EL7, at least when certain reasonable options are used.
      • Will be even better on EL8 thanks to its newer kernel.
    • Singularity has always run fine inside Docker.

Discussion

  • Catherine:
    • site admins in France would like to restrict the Singularity configuration
      as much as possible, e.g. by owner of container image, path, ...
    • the default configuration works, but is deemed not good enough
    • we would like to have a combined configuration serving both ATLAS and CMS,
      expecting it will also be fine for ALICE and LHCb

  • Alessandra F:
    • though many ATLAS workflows are similar to those in CMS, user workflows
      are done differently
    • we also need special treatment of certain sites, e.g. concerning mount points

  • Maarten:
    • concerned admins should join the Containers WG where we try to devise a
      common configuration serving the 4 experiments

VO input regarding SRM usage and dependencies

Since some of the popular storage solutions (DPM, EOS) are moving towards no-SRM, we would like to assess the situation in every LHC VO regarding dependencies on SRM.

  • Which SRM functionality are you currently using?
  • Can this functionality be provided via other means?
  • If yes, are your data management and workload management frameworks ready for a switch? Note: this most probably implies coexistence of SRM-enabled and no-SRM sites.
  • If not, how do you plan to deal with EOS and DPM sites in the near future? Note: the latest DPM versions have the SRM only as an optional service with minimal support.
  • Any other comments, clarifications are welcome

ALICE

  • SRM services are not used by ALICE.

ATLAS

  • Third party copy, space reporting, tape interaction.

  • Yes, and we do it already except for tape interaction.
  • Third-party-copy: Mostly gsiftp already, though gradually deploying alternative options using root and davs, within the context of the DOMA TPC activity.
  • Space reporting: We do not use SRM space tokens anymore, instead using directories as discriminators for storage areas. Most sites deploy the space reporting JSON which is consumed, some still use SRM to query the storage usage.

  • Right now we are dependent on SRM for tape interaction, we do not require SRM for anything else if sites are able to deploy their space reporting JSONs. First tests showing root interaction with CTA are promising though, so at least we can drop the usage of SRM for CTA tape sites. For dCache tape sites, I'm quite certain that there is no other interaction protocol available, neither in FTS nor in GFAL. We will still need gsiftp-enabled sites to transfer to SRM-only destinations (dCache tape sites). PanDA has no dependency on any particular protocol, but instead depends on Rucio to resolve the protocols correctly.

  • We need to ask sites to upgrade. As long as there is gsiftp enabled we are fine.

  • Johannes:
    • gsiftp should work everywhere
    • other protocols currently have issues in the DOMA TPC compatibility matrix
    • staging through Xrootd currently is not supported by the FTS
    • the JSON file location should be standardized

  • Maarten :
    • ALICE uses Xrootd for staging. The Xrootd code is not specific to ALICE.

  • Julia:
    • We did not insist on standard JSON file (SRR) location since we thought for sites it might be better to have flexibility in this respect. In any case experiments would need to know which sites have already enabled SRR. This flag will be published in CRIC and GOCDB along with the SRR file location.

CMS

  • CMS uses gsiftp subset of SRM, i.e. requires a gsiftp, gridftp, or SRM endpoint at each site (except tape endpoints)

LHCb

  • Which SRM functionality are you currently using?
    • staging, transfers, space accounting
  • Can this functionality be provided via other means?
    • staging: not at the moment
    • transfers: yes, to some extent (issues with dCache site due to space token)
    • space accounting: some sites provide a json files with accounting info
  • If yes, are your data management and workload management frameworks ready for a switch?
    • For most of it yes, under certain conditions. We must have
      • a single endpoint for gridftp
      • a single endpoint for xroot
      • a space accounting report available via json
  • If not, how do you plan to deal with EOS and DPM sites in the near future? Note: the latest DPM versions have the SRM only as an optional service with minimal support.
    • If the sites do not provide the previous points, we can't use it.
  • Any other comments, clarifications are welcome
    • We do not have any site with DPM where we need staging
    • Most of our DPM sites (T2-D) have updated or will update within 2 month
    • ~1/2 sites plan to provide SRM as long as possible (barring security issues and bitrot)

Discussion

  • Maarten:
    • mind there may be a dependency on the SRM client tools for quite a while still
      • they will need to remain supported
    • we will work with EGI Ops to get other VOs to move away from SRM dependencies
      • the data management client suite supports other protocols for everyone,
        not just for the LHC experiments

  • Alessandro P:
    • we will follow up with VOs

  • Julia:
    • WLCG Ops were mostly interested in understanding the situation with SRM and disk storage, like DPM. Less concerned about tape, since dCache and StoRM are not planning to stop support of SRM.

  • Johannes:
    • CERN new tape solution is being tested by ATLAS.

CREAM migration task force

  • Julia
    • Membership list currently contains people who had confirmed their participation. We still encourage more people to volunteer.
    • The results of the ongoing site survey will help us decide where to focus our efforts

  • Maarten:
    • A mailing list will be set up etc.

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal activity levels on average.
  • No major issues.

ATLAS

  • Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~100k concurrently running job slots and ~15k jobs from Boinc. The HLT farm/Sim@P1 adds in its current configuration ~96k job slots in addition.
  • Commissioning of a new PanDA worker node pilot version on-going. We are continuing to slowly roll out the new version out to the sites.
  • Started a 2nd round of a data carousel test this week.
  • On-going discussions with the CTA team about how to best use the system.

CMS

  • smooth running, compute systems busy between 200k and 250k cores
    • usual production/analysis mix, i.e. about 50k cores used by analysis
  • reduced overall capacity due to HLT farm being off after UPS fire at P5
  • processing of parked B physics data started
  • heavy ion re-reconstruction in progress, about half done
  • Monte Carlo generation for re-reprocessing (ultra-legacy) of 2017 configuration started
  • tape deletion campaign in progress, about 20 PBytes
  • issue with CMS database service, DBS, understood and resolved by downgrading an external product

LHCb

  • Smooth running at ~100K jobs, Usual activity: User jobs, MC productions, and WG productions
  • Poor transfer efficiency from CERN WN to outside storage GGUS:141112

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • There was an issue in the T2 accounting reports generated starting of April. All pledge related columns have been empty. Fixed by Ivan
  • There is a long standing problem with INFN-Roma1-CMS, wrong very high usage screws up all T2 CPU accounting. Reported to be fixed by APEL experts. Portal still shows wrong numbers. Will be followed up.

Archival Storage WG

DPM Upgrade Task Force

The second wave of monitored upgrades to the current production version of DPM (1.12) has been announced in May, and quite a few sites have joined it: IN2P3-CPPM, AUVERGRID, INFN-COSENZA, BUDAPEST, GLASGOW, UNIBE-LHEP [moved from the previous wave], IN2P3-LPC. Worth signalling that some upgraded with very minimal support needed, if any (e.g. BUDAPEST). INFN-COSENZA had some setup issues, probably linked to the puppet templates in 1.12 not treating correctly passwords containing critical characters like '&'. To the best of knowledge this small bug is not considered critical, had been fixed time ago in the current development branch, and will be released with 1.13, which will likely be complete a few weeks after the DPM workshop (13-14th of June).

Worth mentioning that the management of CERN-IT and EGI have agreed on postponing the deadline for the security support of the DPM legacy components to the end of September 2019. The regular support for those components has ended on the 1st of June, and sites seeking for it will be advised to upgrade their installation and enable the DOME flavour of the setup.

Discussion

  • Johannes:
    • ATLAS still see deletion issues at DPM sites
  • Fabrizio:
    • many of those issues were due to problems unrelated to the MW
      • certificates, dying HW, ...
    • the latest DPM versions are more robust and scalable than older versions
  • Julia:
    • we may soon push all remaining DPM sites to look into upgrading
    • let's see after the DPM workshop (June 13-14)
  • Renato:
    • CBPF plans to upgrade next week

Information System Evolution TF

  • WLCG CRIC demo at the coming GDB

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

Traceability WG

Container WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2019-05-27
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2019-06-17 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback