Dops + Ddev

Europe/Zurich
2/R-014 (CERN)

2/R-014

CERN

10
Show room on map
Description

The monthly Dops meeting (Dirac(X) operations) will run just before the weekly Ddev (Dirac(X) developers) meeting. 

Zoom Meeting ID
62504856418
Host
Federico Stagni
Useful links
Join via phone
Zoom URL

Dops – 16/04/2026

At CERN: Federico, Christophe, Christopher, Alexandre, André, Yan
On Zoom: Andrei, Hideki, Xiaomei, Luisa, Heloise, Dhiraj, Loris, Daniela, Bertrand, Vladimir, Ryun, Natthan
Apologies:


this meeting is being recorded

We plan, from now on, to record every meeting

Previous meetings + follow-ups


Communities issues and requests : roundtable

LHCb:

 Federico+Christopher+Christophe+Alexandre+Ryun

  • Kept installing latest release. NTR otherwise

Belle2

 Hideki, Ueda

  • Memory issues from JobMonitor

Juno+BES3:

 Xiaomei

  • Too long LFN : check MySQL version, recent enough ones would at least avoid registering a truncated LFN
  • monitoriFiles not working correctly for production transformations. Maybe buggy,
    •  Federico+Chris in LHCb a different mechanism is used
    •  Luisa in CTAO also a different solution

EGI+IN2P3

 Andrei, Mazen

  • EGI: NTR
  • The new IN2P3 service is operational now, for 1 specific community (which is using Rucio as DM)

CTAO

 Luisa, Nattan, Loris, Stella

  • Use case of possibly 10k short transformations (out of 1 production)

CLIC

 André

  • NTR

GridPP:

 Daniela, Simon

  • Nothing new for production (still on 8.0.74)

  • Pre-prod

    • My attempts to address this comment by Chris B (https://github.com/DIRACGrid/diracx/pull/851#discussion_r2999473646) resulted in us upgrading the test server from v9.0.2 to v9.0.20. This threw up a bunch of issues, the most prominent (and possibly most interesting to other people) one being: https://github.com/DIRACGrid/DIRAC/pull/8515 : You do need a storage management system to run DIRAC and the code should allow for this, as it did in v9.0.2. The link points to Federico’s fix.
    • We also see: WARNING: MYSQL_OPT_RECONNECT is deprecated and will be removed in a future version. Fix in https://github.com/DIRACGrid/DIRAC/pull/8507
    • Second pass at using one OpenSearch server for production and 3 pre-prod servers. This is done via index prefixes.
      • To re-iterate what we stated during the DIRAC workshop: We really can’t afford one OpenSearch server per DIRAC install.
    • The first issue was that in v8 and v9, no matter what we did, OpenSearch indices where also created for WMS/RMS despite: self.activityMonitoring False (the rogue indices look e.g. like this: dirac00._rmsmonitoring-index-2026-03)
    • This was fixed in https://github.com/DIRACGrid/DIRAC/pull/8490 by Federico.
    • We force backported this to our v8 installs to be able to continue testing. While it would be nice to have this backported to v8, it’s not vital.
    • We think https://github.com/DIRACGrid/DIRAC/issues/8489 : [Feature] Introduce the concept of a global prefix for OpenSearch indexes is going in the right direction, and could be of interest to other smaller DIRAC installations.
    • v9.0 pre-prod server: This issue (https://github.com/DIRACGrid/DIRAC/issues/8453): ‘/tmp’ filling up with proxies has been fixed, but unless it’s backported to v9.0, we cannot use v9.0. The original issue was a refactor gone wrong, so could we put the refactor of the refactor back into the release, please ?
    • At the moment we cherry-picked the solution back to our v9.0 test server, but that would be an unfortunate approach to a production server.
  • diracos issues

    • After upgrading to 2.60 we noticed the webapp using 100% CPU. Fixed in https://github.com/DIRACGrid/tornado_m2crypto/pull/7
    • https://github.com/DIRACGrid/DIRACOS2/issues/174 ([Bug]: DIRACOS2 2.58+ requires $HOME): Resulted in failing jobs on the certification server that were attempting a “traditional” (as opposed to cvmfs) DIRAC install. The Imperial College grid site has not had $HOME for years, and I suspect we aren’t the only ones. The only reason this does not show up in the wild is that prod instances tend to use the cvmfs version.
    • https://github.com/DIRACGrid/DIRACOS2/issues/169: [Feature]: Include htcondor-25 in diracos2: We are runnning HTCondor25 on our site and we see crashes that seem to be induced by jobs coming from the DIRAC certification server. The HTCondor mailing list suggests this might be a mismatch between the submitting and receiving condors. EGI is currently running a campaign to get sites to upgrade their condor installs. We think it would be a good idea to test this hypothesis before we hit a real problem.
  • Web App

  • Documentation:

    • I made a first pass of documenting the install DiracX in a container: https://github.com/DIRACGrid/diracx/pull/851 I would like to get this released (but got sidetracked testing fixed for 9.0.20), and I am also at in-person-meetings the next 5 working days out of 6. Maybe something to discuss for the ops part of this meeting (any possibility of releasing a first pass, in combination with an “improvement” issue ?)
    • https://github.com/DIRACGrid/DIRAC/issues/8513 (Documentation request for: NumberOfGPUs & AvailableRAM)

Releases announcements and reviews

DiracOS

DIRAC

  • v9.1.6 (+ v9.1.5, 4, 3)
    • v9.1.5 is buggy, and has been yanked in pypi
    • Core NEW: (#8484) Add DIRAC_FAST_PROCESS_POOL as experimental feature to speed up the REA
    • WMS CHANGE: (#8479) get cpu work loeft from a single source of truth

v9.1.7 is awaiting for DiracX release first (see below)

DiracX

  • v0.0.13 (+ v0.0.12, v0.0.11)

    • implemented authdb tables cleanup (#815)
    • replace container base images with pixi-managed environments (#810)
    • add diracx-tasks (#842) (63d3a01)
      • This should have been logically in v0.1.0, but for technical reasons (chicken-and-egg issue??) could not be done
    • add task to clean sandbox store (#883) (ab38f04)
    • core: strict UTC datetime validation for pydantic models (#477) (ed3d934)
  • v0.1.0

    • not yet there, but first PRs merged for it

?? A “proper” release is awaiting for diracx-charts

New documentation:

  • how to make a release (tested during last hackathon)
  • advanced tutorial (tested during last hackathon)
  • deploy in containers: https://github.com/DIRACGrid/diracx/pull/851 (draft)
    • this is for DiracX services, tasks will be done later on

Reminder: the PR titles should match the conventional commits spec, this is enforced via https://github.com/amannn/action-semantic-pull-request – this is determining how releases are numbered.

Dirac-CWL

DiracX-web

Pilot


Feature requests, and developers’ issues: inputs and prioritizations from communities

Nothing specific.

Prioritized backlog: communities input

https://github.com/orgs/DIRACGrid/projects/30/views/3 contains the prioritized backlog.


AOB

  • CMS and DiracX:
    • CMS wants to use DiracX as their Workflow Management System (basically, the production system). Their review concluded that it’s feasable. Technical work and contribution from CMS starts now.
    • 3 developers added to the DiracGrid organization in github, to the diracproject-users ML, and to the DiracX mattermost:
      • Valentin Kuznetsov
      • Alan Malta
      • Todor Ivanov
  • Certification machines
    • All working as expected
  • DIRAC as an “HSF affiliated project” : https://hepsoftwarefoundation.org/projects/affiliated.html
    •  Andrei sent out an updated answer
  • CHEP is in about 40 days

Next appointments

  • Few changes to some of the next meetings

    •  Federico will host the Ddev on April 30th
    • There won’t be a Ddev on Thursday 14th of May (holiday at CERN), so anticipated to Wednesday 13th of May, same time
    • The next DOps will be on Wednesday 20th of May, same time
    •  Federico will host the Ddev on May 28th (CHEP week)
  • WS/hackathons/conferences:

There are minutes attached to this event. Show them.
    • 10:00 11:00
      Dirac(X) operations (Dops)
      Convener: Federico Stagni (CERN)
    • 11:00 12:00
      Dirac(X) developers (Ddev)
      Convener: Alexandre Franck Boyer (CERN)

      # DIRAC Development Meeting (Ddev)

      **At CERN**: Christophe, Chris, Federico, Yan, Alexandre, Andre
      **On Zoom**: Heloise, Jorge, Mazen, Andrei, Loris, Natthan, Daniela, Hideki
      **Apologies**: Stella

      ## Product Goals & Roadmaps

      - Transition to DiracX:

      ```mermaid
      flowchart LR
          subgraph CWL["CWL"]
              CWL1("CWL submission endpoint"):::inprogress
              CWL2("CWL production system")
              CWL3("Transformation system machinery"):::blocked
              CWL4("Use CWL natively in new matcher"):::blocked
          end

          subgraph Core["Core"]
              CoreTasks("Tasks"):::inprogress
              Core2("RSS"):::inprogress
              Core3("DMS")
          end

          subgraph WMS["WMS"]
              WMS1("Matcher"):::inprogress
              WMS2("Pilot authentication"):::inprogress
              WMS3("Pilot submission"):::blocked
          end

          CWL3 --> CWL4
          CoreTasks --> Core2 --> Core3
          CoreTasks --> WMS1
          CoreTasks --> CWL3
          WMS1 --> CWL4
          CoreTasks --> WMS3

          click CoreTasks "https://www.github.com" "This is a tooltip for a link"

          classDef done fill:#B2DFDB,stroke:#00897B,color:black,stroke-width:2px;
          classDef inprogress fill:#FFF9C4,stroke:#F9A825,color:black,stroke-width:2px;
          classDef blocked fill:#BBBBBB,stroke:#222222,color:black,stroke-width:2px;

          subgraph Legend
              L2("Completed"):::done
              L4("In progress"):::inprogress
              L1("Ready for work")
              L3("Blocked"):::blocked
          end
      ```

      - CWL integration:

      ```mermaid
      flowchart LR
          subgraph dirac_cwl["dirac-cwl"]
              job1("Prototype Job Endpoint"):::done
              transformation("Prototype Transformation Endpoint"):::inprogress
              workflows("Workflows"):::inprogress
              prod("Prototype Production Endpoint"):::inprogress
          end

          subgraph DiracX1["DiracX"]
              prod_diracx("Implement the CWL Production System")
              trans_diracx("Implement the CWL Transformation endpoint")
              trans_diracx_original("Implement the Transformation System"):::blocked
              diracx_tasks("Implement DiracX Tasks"):::blocked
              job_diracx("Implement the CWL Job Endpoint"):::inprogress
          end

          diracx_tasks --> trans_diracx_original
          trans_diracx_original --> trans_diracx
          transformation --> trans_diracx
          job1 --> workflows
          prod --> prod_diracx
          prod_diracx -.-> deliver2(["Can submit productions to DiracX /productions"]):::milestone
          trans_diracx -.-> deliver3(["Can submit transformations to DiracX /transformations"]):::milestone
          job_diracx -.-> deliver5(["Can submit jobs to DiracX /jobs"]):::milestone

          classDef done fill:#B2DFDB,stroke:#00897B,color:black,stroke-width:2px;
          classDef inprogress fill:#FFF9C4,stroke:#F9A825,color:black,stroke-width:2px;
          classDef blocked fill:#BBBBBB,stroke:#222222,color:black,stroke-width:2px;
          classDef milestone fill:#FFDFE5,stroke:#FF5978,color:#8E2236,stroke-width:2px;

          subgraph Legend
              L1("Completed"):::done
              L2("In progress"):::inprogress
              L3("Ready for work")
              L4("Blocked"):::blocked
              L5("Milestone"):::milestone
          end
      ```

       

      ## Refinements 

      ### Needs triage
      https://github.com/orgs/DIRACGrid/projects/30/views/7

      **Goal: build a shared understanding of the project.**

      > DIRAC
      - [DictCache issue? (blocked)](https://github.com/DIRACGrid/DIRAC/issues/8472)
      - [Job grouping in HPC with no ext. connectivity](https://github.com/DIRACGrid/DIRAC/issues/8475)
      - [Certification Instance](https://github.com/DIRACGrid/DIRAC/issues/7658)
      - [Support for Rucio40](https://github.com/DIRACGrid/DIRAC/issues/8508)
      - [Default of 0 GPU](https://github.com/DIRACGrid/DIRAC/issues/8513)

      > DiracOS2
      - [Require $HOME](https://github.com/DIRACGrid/DIRACOS2/issues/174)
      - [Handle updates?](https://github.com/DIRACGrid/DIRACOS2/issues/30)
      - [Python warnings](https://github.com/DIRACGrid/DIRACOS2/issues/146)

      > WebAppDIRAC

      > diracx-web
      - [Oauth2-proxy](https://github.com/DIRACGrid/diracx-web/issues/482)
      - [Sync types with backend openapi](https://github.com/DIRACGrid/diracx-web/issues/237)

      > diracx
      - [ReleasePlease token not needed?](https://github.com/DIRACGrid/diracx/issues/884)
      - [Job match making phase2](https://github.com/DIRACGrid/diracx/issues/868)
      - [Scrap use of local git repo](https://github.com/DIRACGrid/diracx/issues/875)
      - [CWL job submission endpoint](https://github.com/DIRACGrid/diracx/issues/858)
      - [Merging DBs?](https://github.com/DIRACGrid/diracx/issues/860)
      - [CLI tests](https://github.com/DIRACGrid/diracx/pull/104)
          - oldest PR in diracx. [name=Chris] will ressurect it later.
      - [Config mechanism](https://github.com/DIRACGrid/diracx/issues/830)
      - [Integrate MCP Server](https://github.com/DIRACGrid/diracx/issues/827)
      - [RSS](https://github.com/DIRACGrid/diracx/issues/790)
          - [Phase1](https://github.com/DIRACGrid/diracx/issues/836)
          - [Phase2](https://github.com/DIRACGrid/diracx/issues/889)

      > Pilot

      > diracx-charts
      - [Replace Minio with Seaweed](https://github.com/DIRACGrid/diracx-charts/issues/191)
      - [Add hook rather than re-write entrypoint](https://github.com/DIRACGrid/diracx-charts/issues/257)

      > container-images
      - Shall we archive [container-images](https://github.com/DIRACGrid/container-images)? still secret image in there, can be useful so we keep it (the DIRAC images in `management` will slowly died and be archived)

      > dirac-cwl
      - [assigning an output sandbox to a job from the api](https://github.com/DIRACGrid/dirac-cwl/issues/92)
      - [dirac-cwl executor tests](https://github.com/DIRACGrid/dirac-cwl/issues/116)

      > signurlarity
      - [Factorize benchmark functions](https://github.com/DIRACGrid/signurlarity/issues/25)

      **External deps**

      ### [Temporary Section] In progress, predating the new organization

      https://github.com/orgs/DIRACGrid/projects/30/views/8

      Various people still need to deal with old and staled PRs. We will take them into account in the next sprints. 


      ### External dependencies

      https://github.com/orgs/DIRACGrid/projects/30/views/9

      ---

      [Planning Poker](https://en.wikipedia.org/wiki/Planning_poker)
      Story points values (based on Fibo)
      - `1pt`: Trivial, very clear (small bug fix, config change)
      - `2pts`: Small, well understood (small feature, clear requirements)
      - `3pts`: Medium, some unknowns (moderate feature)
      - `5pts`: Large, significant complexity (major feature, integration)
      - `8pts`: Very large, many unknowns (should probably be split)
      - `13+pts`: TOO BIG - must split!
      - `?`: not enough knowledge to answer (remember it's ok to ask any questions)

      ## Sprints

      ### Planning (Velocity and Planning Poker)

      - Backlog: https://github.com/orgs/DIRACGrid/projects/30/views/3
      - Current Sprint: https://github.com/orgs/DIRACGrid/projects/30/views/1

      ![](https://codimd.web.cern.ch/uploads/upload_4d419513411c38ecdcdcd593b2c0a19e.png)

       

      **Average Velocity: 3.07 x FTEs** *Last update: Jan 21st*

      #### :warning: Velocity is a planning tool, not a performance target

      - Velocity going down is NOT bad
      - Velocity going up is NOT always good (might mean over-estimation)
      - Velocity varies sprint-to-sprint
      - We track it to improve estimation, not to judge people

      **What affects velocity:**
      - Estimation accuracy (we're still learning)
      - Complexity of work

      **Our focus:** Delivering value and hitting commitments, not maximizing velocity numbers.

      ### April 30th (IN PROGRESS):

      #### Target and Context
      > Chris & Christophe will deliver diracx-tasks
      - Clean up existing issues/PRs: [Burning Charts](https://github.com/orgs/DIRACGrid/projects/30/insights?period=3M)
      - Finish Phase1 of the RSS migration
      - Finish Phase1-2 of the Matcher
      - More diracx cleanup
      - Make diracx-web stable

      #### Availability

      - [name=alexandre] 70%
      - [name=natthan] 30%
      - [name=luisa] 0%
      - [name=loris] 70%
      - [name=stella] 0% (comes back in 1 month)
      - [name=jorge] 60%
      - [name=ryun] 5%
      - [name=federico] 10%
      - [name=heloise] 40%
      - [name=christophe] 0%
      - [name=chris] 5%
      - [name=janusz] 0%
      - [name=mazen] 10%
      - [name=andrei] 10%
      - [name=yan] 100%
      - [name=daniela] 5%

      _ FTEs * _ = _ story points

      Expected Story Points:
      Persons:
      Expected Velocity:

      #### Sprint Planning: 

      - Backlog: https://github.com/orgs/DIRACGrid/projects/30/views/3
      - Sprint: https://github.com/orgs/DIRACGrid/projects/30/views/1

      ### April 16th (DONE):

      Expected Story Points: 45
      Persons: 2.6
      Expected Velocity: 17.3

      *20 Story Points / 2.6 people = 7.7 velocity*

      Comments:
      - Less people available during the sprint (holidays, CTAO had deadlines). Also some people seemed to spend more time than originally described, some of them less time.
      - A lot of bug fixes that were not planned

      #### Sprint review: https://github.com/orgs/DIRACGrid/projects/30/views/11

      Related to our goals:
      - **DIRAC to DiracX transition:**
          - RSS: Part1 of Phase1 complete (DB and models are in)
          - Removed EdDSA keys support
          - Bug fixes (e.g. replica map)

      - **CWL integration:**
          - Support for transformations with a static list of input data: POC (generated a few questions but will be revisited later)
          - Documentation to transition from jobdescription.xml modules to pre/post processing commands through the LHCb use case.

      - **Match-Making POC:**
          - Data Models: generated a few questions related to the specification itself

      - **DIRAC maintenance:**
          - A lot of small various Dirac fixes


      #### Sprint retrospective

      *The sprint is a boat :boat: ; we are trying to reach an island (target); identify anchors (what slowed you down), wind (what helped), and rocks ahead (risks for next sprint)*

      :warning: **Focus on the process, not people. We're here to improve together! 🚀**

      **:anchor: Anchors (what slowed you down)**
      - *Example: Unclear requirements on X; Waiting for Y delayed Z; ...*
      - [name=Alexandre] We have 1 certification machine: very rarely, multiple people need it at the same time and can be a blocking point. As it is rare, I don't think we should take any action from that, but worth keeping in mind in case it happens more frequently in the future.
      - [name=Federico] Bug fixes. We should test every feature release in certification.

      **:cloud: Wind (what helped)**
      - *Example: Good communication in weekly meetings; Quick code reviews; Clear acceptance criteria on user stories; ...*

      **🪨 Rocks (risks for next sprint)**
      - *Example: Team member K on vacation; Dependency on external API L; Technical debt in M; ...*

      ---

      ### Previous Sprints
      #### Summary

      - April 2nd:
        - *37 Story Points / 3.1 people = 11.9 velocity*
        - Comments: NTR

      - March 19th:
        - *23 Story Points / 3.5 people = 7.2 velocity* 
        - Comments:
          - Need to adapt the velocity computation because we are processing a lot of tasks not planned originally in the sprint (which is expected since we still have a lot of PRs without any attached issue to process, ...)

      - March 5th:
        - *19 Story Points / 2.8 people = 6.8 velocity*
        - Comments:
          - Less people available during this sprint, but more realistic expectation, we almost reached the expected velocity!!
          - LHCb AI hackathon: [name=Alexandre] was much less available than expected.
          - Took into account items that were in progress before scrum process (added some SP): resurrecting diracx-web, RSS simplified...

      - February 19th:
        - *38 Story Points / 4.4 people = 8.6 velocity*
        - Comments:
          - French holidays
          - [name=Alexandre] was more available than expected, but did not manage to quickly follow all the PRs.
          - A few tasks have been delayed (10 SP): waiting for further discussion on scheduling and diagrams for new LHCb workflows
          - Lot of "unplanned" items: expected as long as we have to deal with the large backlog of old items.

      - February 5th:
        - *29 Story Points / 3.1 people = 9.4 velocity*
        - Comments:
          - LHCb-CERN had a computing workshop
          - Various people worked on old PRs I did not take into account :warning:

      - January 21st:
        - *6 Story Points / 2.5 people = 2.4 velocity*
        - Comments:
          - LHCb-CERN had a team retreat, LHCb-Spain had a conference.

      - January 7th:
        - *15 Story Points / 3.9 people = 3.8 velocity*
        - Comments:
          - No specific comment, the sprint was split by the holidays.

      - December 10th:
        - *6 Story Points / 3 people = 2 velocity*
        - Comments:
          - About the same as the previous sprint: still a gap between expected/actual availability

      - November 26th:
        - *6 Story Points / 3 people = 2 velocity*
        - Comments:
          - Much lower than the previous sprint because it included tasks started before the sprint.
          - Lots of "almost done" PRs: we are improving the description of the tasks and their size but still not enough (each task should bring value though).

      - November 10th:
        - *22 Story Points / 4.3 people = 5.1 velocity*


      #### Actionable Results from the Retrospective

      - **Action:** Avoid verbose (AI-generated) issues with many implementation details that can deprecate over time.
        - Owner: developers and product owners
        - When: Sprint11
        - Status: 15/04/26 In Progress
      - **Action:** Better view of the PRs ready to be reviewed vs needing changes.
        - Owner: developers
        - When: Sprint8
        - Status: 15/04/26 In Progress
      - **Action:** Better communicate when a PR is going to be big, as soon as possible. Split the work in this case.
        - Owner: developers
        - When: Sprint6
        - Status: 21/01/26 DONE
      - **Action:** Better use of the mattermost channel to get reviews on a given PR
        - Owner: everyone
        - By when: Sprint3
        - Status: 04/02/26 DONE
      - **Action:** Define estimates and velocity based on Sprint2's results, taking into account external contributions (bonus Story Points) and availability
        - Owner: alexandre
        - By when: Sprint3
        - Status: DONE
      - **Action:** Better define the scrum roles
        - Owner: alexandre
        - By when: Sprint5
        - Status: DONE
      - **Action:** Better define `DONE` criteria (what should be included into the PR, and how to make sure we are not introducing too much technical debt)
        - Owner: everyone
        - By when: Sprint2
        - Status: DONE
      - **Action:** Avoid planning dependent tasks in a same sprint
        - Owner: everyone
        - By when: Sprint2
        - Status: DONE

      ## AOB