WLCG Operations Coordination Minutes, Sep 3, 2020

Highlights

Agenda

https://indico.cern.ch/event/950710/

Attendance

  • local:
  • remote: Alberto (monitoring), Alessandra (ATLAS + Manchester), Andrew (TRIUMF), Borja (monitoring), Christoph (CMS), David B (IN2P3-CC), David C (Technion), Dimitrios (ATLAS + WLCG), Eric C (storage), Eric F (IN2P3), Eric G (CERN-IT-DB), Giuseppe (CMS), Johannes (ATLAS), Julia (WLCG), Luca (storage), Maarten (ALICE + WLCG), Matt (Lancaster), Nikolay (monitoring), Panos (WLCG), Pedro (monitoring), Pepe (PIC), Petr (ATLAS + Prague), Stephan (CMS), Thomas (DESY), Tim (CERN-IT-CDA), Vladimir (LHCb)
  • apologies:

Operations News

  • the next meeting is planned for Oct 1st

Special topics

SRR deployment

see the presentation

Discussion

  • Alessandra: why don't dCache sites deploy the SRR?
  • Julia:
    • SRR does get deployed as part of the release,
      but at many sites the collected info is not useful yet
    • when the prototype at DESY looks good,
      we will follow up with the other sites
  • Christoph:
    • dCache can be configured in many different ways
    • the current SRR component only works for some of them
    • we try to find an SRR configuration that works out of the box at most sites

  • Alessandra: should we push for the SRR publisher cron job to be supported natively?
  • Julia: some storage MW providers prefer it to remain an external component

WLCG Critical Services proposal - next step

Reminder: a proposal to modernize and simplify the WLCG list of critical services was presented and discussed in the Ops Coordination meeting on June 4 and its implementation was agreed in the meeting on July 2.

To facilitate the process, 4 template pages have been created, 1 per experiment:

Please fill out the urgency and impact columns for services that are relevant to your experiment and add services that are missing. The criticality column holds the product of the provided values and is calculated automatically.

Please do not add very generic services like e-mail, web or SSO again, as we know they are critical essentially for everybody and we do not want to dilute the tables with information that does not really help.

Please do not remove services that are irrelevant to your experiment, because the tables provided here will be merged with those of the other experiments and such services may be relevant to them.

We will follow up per experiment and merge the tables from the 4 templates into combined tables on a new version of the Critical Services page. Those tables will also have 2 extra columns showing per service:

  • the maximum criticality among the 4 experiments
  • the sum of the criticalities of the 4 experiments

All criticality columns will be numerically sortable and each value will be displayed on a background whose color indicates the range in which that value falls. We currently envision 3 such ranges: top, high, moderate.

We then present and discuss the overall results in the next Ops Coordination meeting.

Discussion

  • there were no objections

SAM migration status

see the presentation

Discussion

  • Julia: does the proposed timeline look OK?
  • there were no objections

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly normal to high activity levels
  • No major problems for ALICE workflows
    • Several big T2 sites experienced power or cooling issues
      • GRIF_IRFU down between Aug 18 and Sep 2

ATLAS

  • Stable Grid production with up to ~420k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~70-90k slots from the HLT/Sim@CERN-P1 farm and ~10k slots from Boinc. Occasional additional peaks of ~100k job slots from HPCs.
  • Ramped down job slots used for Folding@Home jobs to ~15k on average. Planning to ramp down to a few hundred by the end of September.
  • Finishing 2nd DRAW_RPVLL reprocessing of full Run 2 data started 16 July using Data Carousel mode and iDDS - no problems.
  • No other major operational issues apart from the usual storage or transfer related problems at sites
  • TPC migration: slowly moving ready dCache and DPM sites to use https-TPC in production. Status 20 Aug: 7 dCache, 3 DPM (4 Tier1, 5 Tier2, 1 Tier3)

Discussion

  • Julia: do you track somewhere which sites can use TPC in production?
  • Johannes, Petr: there is a JIRA ticket
  • Julia: the FTS team asked if that information could be in CRIC?
  • Petr: if CRIC already contains the list of protocols per site, then the preferred protocol per site for TPC can be defined
  • Julia: yes protocols are in CRIC and there is a possibility to prioritize them. Will look in the JIRA ticket and discuss with the CRIC team how to model the necessary info there

CMS

  • running smoothly at around 250k cores
    • usual production/analysis split of 4:1
  • nevertheless significant production queue
    • main processing activities:
      • Run 2 ultra-legacy Monte Carlo
      • Run 2 pre-UL Monte Carlo
    • investigating under-utilization of some sites
  • tape data deletion campaign in final state
    • first sites already recycled/repacked tapes
  • migration to Rucio ongoing
    • nanoAOD samples migration to Rucio complete
    • large successful test with CTA the previous week
  • EOS space at CERN tight due to special samples
  • end of CMS CREAM-CE support reached
    • 1/16/4 Tier-1/2/3 sites with CREAM-CE(s) remaining
    • factory upgrade over the next several weeks
  • releasing old SL 6 VMs ongoing, expect to make Sep 30th deadline
  • working with CERN MonIT team on SAM3 to SiteMon migration

LHCb

  • smooth running,
    • mostly (90%) MC simulation (fast and full G4)
    • "legacy" stripping cycles of Run1+Run2 data (both pp and ion collisions) ended
  • recovered ~15k slots from the HLT farm that were given to Folding@Home
  • currently ~120k slots in total
  • CASTOR-->CTA migration plans being defined.
    • Need to get rid of SRM for T0-->T1 tape transfers. Discussions with relevant people started

Discussion

  • Julia:
    • Xrootd for TPC does not yet work at all the relevant sites
    • HTTP could work instead
    • this is being followed up by the experts involved

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • New accounting workflow which implies monthly data validation by site admins and report generation by CRIC using validated metrics works well.
  • More and more sites enable SRR and WLCG Storage Space Accounting (WSSA) system switched to SRR as soon as it is enabled

Archival Storage WG

Containers WG

CREAM migration TF

Details here

On Sep 1st reminders have been sent to all sites that did
not display evidence of setting up alternative solutions yet.

Summary:

  • 90 tickets
  • 23 done: 11 ARC, 12 HTCondor
  • 16 sites plan for ARC, 14 are considering it
  • 21 sites plan for HTCondor, 14 are considering it, 8 consider using SIMPLE
  • 3 tickets without reply

Discussion

  • Johannes: what are the expectations for sites to make the deadline?
  • Maarten:
    • as with most deployment campaigns, there will be a tail stretching beyond the deadline
    • however, it looks fairly certain that at least the vast majority of the capacity will be fine

dCache upgrade TF

  • Out of 43 dCache sites used by the LHC experiments, 40 have been successfully upgraded to version 5.2 and higher. Only 2 sites to go, one is going to migrate away from dCache. However, majority of the sites have a problem with SRR with empty storage shares section. The solution for the problem has been prototyped at DESY. After validation we will need to go for its deployment on all dCache sites.

DPM upgrade TF

  • DPM version 1.14 has been released. Many issues required for third party copy have been fixed in this version. DPM upgrade TF starts a new upgrade cycle, it includes upgrade to version 1.14 and enabling macaroons.

StoRM upgrade TF

  • StoRM version 1.11.18 has been released. Many issues required for third party copy have been fixed in this version. StoRM upgrade TF starts a new upgrade cycle,

Information System Evolution TF

  • Network information (Site IP subnets) required by the NOTED project has been enabled in the WLCG-CRIC. Discussed with Edoardo Martelli the possibility to use WLCG-CRIC as a repository of the network topology information for the WLCG infrastructure. Not limited to the LHC experiments, but also to be used by other collaborations like protoDUNE, XENON, BelleII and soon JUNO.
  • AGIS to ATLAS CRIC migration is ongoing and successfully passed the 3rd step of migrating AGIS API to CRIC API. Some ATLAS production services have been switched to CRIC in production. The remaining last step of the migration (moving AGIS WebUI into CRIC) is currently at the implementation phase. During the migration new feature requests from ATLAS have been implemented in ATLAS-CRIC.

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

MW Readiness WG

Network Throughput WG


Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2020-09-07 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback