CVMFS Sync Meeting
CVMFS Meeting Minutes
Attendees: Fabrizio, Hugo, Maria
1. Critical Priorities and Smooth Operations
Decision: Backup and Escalation Process
- Action: Fabrizio will create a document describing critical alarms and response procedures for backup purposes
- Primary objective: Ensure smooth operations of CVMFS service
- Backup arrangement: Hugo will serve as backup when Fabrizio is on holidays
2. Strategic Initiatives
The meeting reviewed three key topics previously collected in a planning document created by Hugo https://docs.google.com/document/d/1l11stJRKtQw7dCEszY-T6Cgaf_piSR9HtDRyOoMXbwk/edit?usp=sharing. Responsibilities were assigned as follows:
A. Publisher Nodes Harmonization (Phase 1)
Owner: Fabrizio (in coordination with SFT and R&D)
Context:
- 72 publisher nodes enable individual communities to build and publish software into S3
- Current requirements: community-specific key pairs, CVMFS server packages, S3 credentials
- Strategic objective: Repatriate functionality under IT/CD ownership
Approach:
- Standardize software environment across all publisher nodes with uniform package deployment
- Model similar to lxplus (some communities may have access to software they don't strictly require)
- Key goal: Remove dependency on CephFS
- Most communities operate on a single node and don't require shared filesystem access
- CephFS dependency appears to have been introduced for convenience rather than necessity
Assessment: No fundamental technical constraints prevent hosting these services elsewhere
B. Bypass Local ZFS Pools on S1 Backends
Owner: Fabrizio
Current State:
- S1 backends maintain local ZFS pools populated by copying data from S0 (via HTTP GET/HEAD)
- Several repositories (e.g., unpacked, lhcbdev) already served directly from S0
Planned Changes:
- Bypass ZFS pools to simplify system architecture
- Expected benefits: Reduced access latency by avoiding redundant data copies
- Some monitoring components may need adjustment (scripts assume locally mounted data)
Risk Assessment:
- Primary risk: Reduced resilience during S3 outages (uncached data becomes temporarily unavailable)
- Mitigation: Backup machines maintain full S0 copies; BC can be restored by redirecting S1 backend aliases from S0 to backup system
- Strategic recommendation: Backup should ultimately be implemented in the S3 service if aligned with group strategy
C. Definition and Governance of unpacked.cern.ch
Owner: Hugo (in addition to backup responsibilities to ensure smooth operations)
Current Situation:
- Storage impact: ~90% of total CVMFS storage volume
- 1.1 PB logical
- 100 TB in S3
- 30 billion objects
- Usage: Only ~6% of CERN-domain requests (based on one week of data)
- Interpretation: Primarily used as archival repository rather than active software distribution
Key Issues:
-
Inefficient Pruning:
- Unused container images not pruned efficiently
- Deletion latencies >24 hours (observed by Valentin)
- Improvements planned for 2026 objectives
- Challenge is primarily policy-driven rather than technical
-
Lack of Visibility:
- Operational logs expose only opaque content hashes
- No direct linkage to originating container image or community
- Current usage cannot be attributed to specific users or experiments
-
Scalability Concerns:
- Compressed S3 footprint relatively modest (~100 TB)
- Number of objects (30 billion) presents scalability challenges for S3
- Planned mitigations (removing bucket indexes) address symptoms but not root cause
Strategic Objectives:
-
Lifecycle Management Policy:
- IT should define and enforce policies for CVMFS-hosted container images
- Systematic removal of unused or obsolete artifacts (similar to Harbor registry)
- Despite existing technical mechanisms, repository continues to grow exponentially
-
Enhanced Telemetry:
- SFT must provide enhanced telemetry from CVMFS FUSE clients
- Enable identification of calling process, container image, and responsible community
- Aligned with ongoing work: https://github.com/cvmfs/cvmfs/pull/3735
- Deploy telemetry-enabled CVMFS FUSE clients on lxplus and lxbatch systems
-
Usage Reduction:
- Reduce usage to clearly identified and justified communities
- Requires close coordination with SFT on software development and deployment
3. PDC Backup Strategy
Decision: Selective Backup Approach
- Exclusions from backup:
- unpacked.cern.ch
- sft-nightlies
Rationale:
- Valentin confirmed agreement with this approach
- Will optimize storage usage
- Excluded repositories can be recreated when needed
Action: Document this decision formally
Action Items Summary
| Owner | Action | Priority |
|-------|--------|----------|
| Fabrizio | Create critical alarms documentation for backup procedures | High |
| Fabrizio | Lead publisher nodes harmonization (Phase 1) with SFT/R&D | Medium |
| Fabrizio | Implement bypass local ZFS pools on S1 backends | Medium |
| Hugo | Serve as backup for Fabrizio during holidays | High |
| Hugo | Lead governance and definition work for unpacked.cern.ch | Medium |
| Hugo | Ensure smooth operations of CVMFS service | High |
| SFT | Provide enhanced telemetry from CVMFS FUSE clients | Medium |
| Fabrizio/Valentin | Document PDC backup exclusions (unpacked, sft-nightlies) | High |
Next Steps
- Fabrizio to circulate critical alarms document for review
- Coordination meeting with SFT/R&D on publisher nodes harmonization timeline
- Hugo to assess current unpacked.cern.ch usage patterns and coordinate with SFT on telemetry deployment
- Formal documentation of PDC backup strategy decisions
Minutes compiled from meeting discussion highlights