CVMFS Sync Meeting

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map
Participants
  • Fabrizio Furano
  • Hugo Gonzalez Labrador
  • Maria Arsuaga Rios
  • Valentin Volkl

CVMFS Meeting Minutes

Attendees: Fabrizio, Hugo, Maria


1. Critical Priorities and Smooth Operations

Decision: Backup and Escalation Process

  • Action: Fabrizio will create a document describing critical alarms and response procedures for backup purposes
  • Primary objective: Ensure smooth operations of CVMFS service
  • Backup arrangement: Hugo will serve as backup when Fabrizio is on holidays

2. Strategic Initiatives

The meeting reviewed three key topics previously collected in a planning document created by Hugo https://docs.google.com/document/d/1l11stJRKtQw7dCEszY-T6Cgaf_piSR9HtDRyOoMXbwk/edit?usp=sharing. Responsibilities were assigned as follows:

A. Publisher Nodes Harmonization (Phase 1)

Owner: Fabrizio (in coordination with SFT and R&D)

Context:
- 72 publisher nodes enable individual communities to build and publish software into S3
- Current requirements: community-specific key pairs, CVMFS server packages, S3 credentials
- Strategic objective: Repatriate functionality under IT/CD ownership

Approach:
- Standardize software environment across all publisher nodes with uniform package deployment
- Model similar to lxplus (some communities may have access to software they don't strictly require)
- Key goal: Remove dependency on CephFS
- Most communities operate on a single node and don't require shared filesystem access
- CephFS dependency appears to have been introduced for convenience rather than necessity

Assessment: No fundamental technical constraints prevent hosting these services elsewhere


B. Bypass Local ZFS Pools on S1 Backends

Owner: Fabrizio

Current State:
- S1 backends maintain local ZFS pools populated by copying data from S0 (via HTTP GET/HEAD)
- Several repositories (e.g., unpacked, lhcbdev) already served directly from S0

Planned Changes:
- Bypass ZFS pools to simplify system architecture
- Expected benefits: Reduced access latency by avoiding redundant data copies
- Some monitoring components may need adjustment (scripts assume locally mounted data)

Risk Assessment:
- Primary risk: Reduced resilience during S3 outages (uncached data becomes temporarily unavailable)
- Mitigation: Backup machines maintain full S0 copies; BC can be restored by redirecting S1 backend aliases from S0 to backup system
- Strategic recommendation: Backup should ultimately be implemented in the S3 service if aligned with group strategy


C. Definition and Governance of unpacked.cern.ch

Owner: Hugo (in addition to backup responsibilities to ensure smooth operations)

Current Situation:
- Storage impact: ~90% of total CVMFS storage volume
- 1.1 PB logical
- 100 TB in S3
- 30 billion objects
- Usage: Only ~6% of CERN-domain requests (based on one week of data)
- Interpretation: Primarily used as archival repository rather than active software distribution

Key Issues:

  1. Inefficient Pruning:

    • Unused container images not pruned efficiently
    • Deletion latencies >24 hours (observed by Valentin)
    • Improvements planned for 2026 objectives
    • Challenge is primarily policy-driven rather than technical
  2. Lack of Visibility:

    • Operational logs expose only opaque content hashes
    • No direct linkage to originating container image or community
    • Current usage cannot be attributed to specific users or experiments
  3. Scalability Concerns:

    • Compressed S3 footprint relatively modest (~100 TB)
    • Number of objects (30 billion) presents scalability challenges for S3
    • Planned mitigations (removing bucket indexes) address symptoms but not root cause

Strategic Objectives:

  1. Lifecycle Management Policy:

    • IT should define and enforce policies for CVMFS-hosted container images
    • Systematic removal of unused or obsolete artifacts (similar to Harbor registry)
    • Despite existing technical mechanisms, repository continues to grow exponentially
  2. Enhanced Telemetry:

    • SFT must provide enhanced telemetry from CVMFS FUSE clients
    • Enable identification of calling process, container image, and responsible community
    • Aligned with ongoing work: https://github.com/cvmfs/cvmfs/pull/3735
    • Deploy telemetry-enabled CVMFS FUSE clients on lxplus and lxbatch systems
  3. Usage Reduction:

    • Reduce usage to clearly identified and justified communities
    • Requires close coordination with SFT on software development and deployment

3. PDC Backup Strategy

Decision: Selective Backup Approach

  • Exclusions from backup:
  • unpacked.cern.ch
  • sft-nightlies

Rationale:
- Valentin confirmed agreement with this approach
- Will optimize storage usage
- Excluded repositories can be recreated when needed

Action: Document this decision formally


Action Items Summary

| Owner | Action | Priority |
|-------|--------|----------|
| Fabrizio | Create critical alarms documentation for backup procedures | High |
| Fabrizio | Lead publisher nodes harmonization (Phase 1) with SFT/R&D | Medium |
| Fabrizio | Implement bypass local ZFS pools on S1 backends | Medium |
| Hugo | Serve as backup for Fabrizio during holidays | High |
| Hugo | Lead governance and definition work for unpacked.cern.ch | Medium |
| Hugo | Ensure smooth operations of CVMFS service | High |
| SFT | Provide enhanced telemetry from CVMFS FUSE clients | Medium |
| Fabrizio/Valentin | Document PDC backup exclusions (unpacked, sft-nightlies) | High |


Next Steps

  1. Fabrizio to circulate critical alarms document for review
  2. Coordination meeting with SFT/R&D on publisher nodes harmonization timeline
  3. Hugo to assess current unpacked.cern.ch usage patterns and coordinate with SFT on telemetry deployment
  4. Formal documentation of PDC backup strategy decisions

Minutes compiled from meeting discussion highlights

There are minutes attached to this event. Show them.
    • 11:00 11:10
      Operational status & service overview 10m
      • Current CVMFS service health (repositories, stratum-0/1, monitoring)
      • Any incidents since last meeting
      • Short-term operational risks or alerts
    • 11:10 11:20
      Issues & interventions follow-up 10m
      • Review of open issues
      • Status of ongoing or planned interventions
      • Decisions needed / actions agreed

      Follow up:
      From Enrico: An RBD volume of 300 TB is requested to be back to CEPH team. I'd like to get it back asap, as the cluster hosting is filling up and we do not have new capacity yet.
      There is no ticket as this was mentioned in our weekly Monday meeting.

      My idea is to keep backups in PDC only. But you should check with Fabrizio, what it his strategy about backups.
      What is essential for me is to get those 300TB back.

    • 11:20 11:30
      Onboarding & knowledge transfer 10m
      • Onboarding progress for Hugo and Maria

      • Open questions or unclear areas

      • Documentation gaps or runbooks to improve

      • Identify topics needing deeper walkthroughs (future sessions)

    • 11:30 11:40
      Collaborator input – SFT / Valentin 10m
      • Updates from SFT side

      • Cross-team dependencies or changes impacting CVMFS

      • Technical discussions or requests for support

      • Upcoming developments worth tracking

    • 11:40 11:45
      Planning & priorities 5m
      • Short-term priorities until next sync

      • Upcoming milestones or expected changes

      • Topics to follow up

    • 11:45 11:50
      AOB 5m