Dops + Ddev

Europe/Zurich
2/R-014 (CERN)

2/R-014

CERN

10
Show room on map
Description

The monthly Dops meeting (Dirac(X) operations) will run just before the weekly Ddev (Dirac(X) developers) meeting. 

Zoom Meeting ID
62504856418
Host
Federico Stagni
Useful links
Join via phone
Zoom URL
    • 10:00 11:00
      Dirac(X) operations (Dops)
      Convener: Federico Stagni (CERN)

      Dops – 18/06/2026

      At CERN: Federico, Alexandre, Ryunosuke, Cedric, Marco, Ryun
      On Zoom: Christophe, Luisa, Loris, Jorge, Simon, Daniela (first 30 min only), Heloise, Ueda, Janusz, Bertrand, Dhiraj, Xiaomei, Juraj, Yan, Stella
      Apologies: Andrei


      this meeting is being recorded

      Previous meetings + follow-ups

      • Dops 4 weeks ago. Follow-ups
        • On the request for a long term support release (DIRAC):
          • We agreed on creating a rel-v9r0 branch, to which all “needed” fixes have been backported from integration
          • Release(s) created accordingly
            • v9.0.22 is “everyone’s” target
            • This is to be considered as a “lily-pad” release, for jumping to highest ones later on.
            •  Daniela wants to test one more issue on that, and will get around to it soonish, sorry
        •  Federico sent out a google form for collecting requirements for DiracX Transformation/Production system
          • 3 “simple” answers received. Most of the “higher requirements” ones still to come
        • older CWL is coming to DiracX, and the new “hints”: https://codimd.web.cern.ch/SllN13jAQNSG25MjHB8Swg?both
          • “Large” PR needs to be split

      Communities issues and requests : roundtable

      LHCb:

       Federico+Christopher+Christophe+Alexandre+Ryun

      • Running in production the latest releases of everything

      Belle2

       Ueda, Hideki, Cedric

      • Smooth operation

      Juno+BES3:

       Xiaomei

      • Nothing new
      • fromPreviousMeeting monitorFiles not working correctly for production transformations. Maybe buggy,

      EGI+IN2P3

       Andrei, Mazen

      • NTR

      JINR

       Igor

      • NTR

      CTAO

       Luisa, Natthan, Loris, Stella

      • Answers prepared for the prod system, feedback from colleagues needed
      • Preliminary design review of CTAO computing earlier this week
        • Stressed the importance and usefulness of Transformation plugins
      • fromPreviousMeeting Use case of possibly 10k short transformations (out of 1 production request)

      CLIC

       André

      • NTR

      FCC

       Juraj

      • Started effort, pointers collected. Started collecting answers for the form.

      CMS

       Andrea, Marco

      • Not yet

      GridPP:

       Daniela, Simon, Janusz

      • Production: v8.0.78 + security patches (will upgrade next week to v8.0.80)
      • Fixed a couple of minor bugs in WebAppDIRAC, now in 5.0.14

      Releases announcements and reviews

      DiracOS

      DIRAC

      • Note on the existing branches:

        • there are now 3 “live” branches: rel-v8r0rel-v9r0integration
        • please, target integration for “everything” unless:
          • it’s a real bug fix (maybe backported)
          • it’s a security fix (but that should go through a CVE)
        • the “sweeper” of PRs is not anymore in action: you need to create separate PRs by yourself
          • to avoid “I’ll do it later”, PRs will only be merged if they are already created for all the branches.
      • v8.0.80

        • includes some security fixes (ported to every branch)
      • v9.0.22

        • various backports from integration
      • v9.1.12 + v9.1.11, v9.1.10

        • several not-only-security fixes
        • several performance fixes

      DiracX

      • v0.2.0
        • added AsyncTwoLevelCache

      CWL

      DiracX-web

      • first non-alpha release ?
        • Not yet

      Pilot

      • NTR

      Agenting and AI developments

      • Open issue to discuss at some point, or followed-up among communities.

      Feature requests, and developers’ issues: inputs and prioritizations from communities

      Transformation/Production “system” in DiracX

      • Nothing more to report for this meeting. We expect all updates by the next DOps.

      Prioritized backlog: communities input

      https://github.com/orgs/DIRACGrid/projects/30/views/3 contains the prioritized backlog.


      AOB


      Next appointments

      • Next meetings:

        • Next DOps on August 13th (no DOps in the middle of summer)
      • WS/hackathons/conferences:

        • DiracX hackathon: 1st and 2nd of July
          • 20 people registered. Search for another room un-successful
          • Social event on the evening of July 1st.
        • 12th DUW: 13th-16th October
          • registrations open! Free of charge, thanks to a local sponsor (waiting for Jiri to update a few things)
        • Planning for DiracX hackathons in 2027

          I have made a temporary booking for your event in our calendar. I will contact you in September to confirm the booking with you

    • 11:00 12:00
      Dirac(X) developers (Ddev)
      Convener: Alexandre Franck Boyer (CERN)

      # DIRAC Development Meeting (Ddev)

      **At CERN**: Federico, Christophe, Chris
      **On Zoom**: Loris, Heloise, Natthan, Simon, Bertrand, Amir, Yan, Andrei, Jorge, Stella, Daniela, Mazen
      **Apologies**: 

      ## Product Goals & Roadmaps

      - Transition to DiracX:

      ```mermaid
      flowchart LR
          subgraph CWL["CWL"]
              CWL1("CWL submission endpoint"):::inprogress
              CWL2("CWL production system")
              CWL3("Transformation system machinery"):::blocked
              CWL4("Use CWL natively in new matcher"):::blocked
          end

          subgraph Core["Core"]
              CoreTasks("Tasks"):::inprogress
              Core2("RSS"):::inprogress
              Core3("DMS")
          end

          subgraph WMS["WMS"]
              WMS1("Matcher"):::inprogress
              WMS2("Pilot authentication"):::inprogress
              WMS3("Pilot submission"):::blocked
          end

          CWL3 --> CWL4
          CoreTasks --> Core2 --> Core3
          CoreTasks --> WMS1
          CoreTasks --> CWL3
          WMS1 --> CWL4
          CoreTasks --> WMS3

          click CoreTasks "https://www.github.com" "This is a tooltip for a link"

          classDef done fill:#B2DFDB,stroke:#00897B,color:black,stroke-width:2px;
          classDef inprogress fill:#FFF9C4,stroke:#F9A825,color:black,stroke-width:2px;
          classDef blocked fill:#BBBBBB,stroke:#222222,color:black,stroke-width:2px;

          subgraph Legend
              L2("Completed"):::done
              L4("In progress"):::inprogress
              L1("Ready for work")
              L3("Blocked"):::blocked
          end
      ```

      - CWL integration:

      ```mermaid
      flowchart LR
          subgraph dirac_cwl["dirac-cwl"]
              job1("Prototype Job Endpoint"):::done
              transformation("Prototype Transformation Endpoint"):::inprogress
              workflows("Workflows"):::inprogress
              prod("Prototype Production Endpoint"):::inprogress
          end

          subgraph DiracX1["DiracX"]
              prod_diracx("Implement the CWL Production System")
              trans_diracx("Implement the CWL Transformation endpoint")
              trans_diracx_original("Implement the Transformation System"):::blocked
              diracx_tasks("Implement DiracX Tasks"):::blocked
              job_diracx("Implement the CWL Job Endpoint"):::inprogress
          end

          diracx_tasks --> trans_diracx_original
          trans_diracx_original --> trans_diracx
          transformation --> trans_diracx
          job1 --> workflows
          prod --> prod_diracx
          prod_diracx -.-> deliver2(["Can submit productions to DiracX /productions"]):::milestone
          trans_diracx -.-> deliver3(["Can submit transformations to DiracX /transformations"]):::milestone
          job_diracx -.-> deliver5(["Can submit jobs to DiracX /jobs"]):::milestone

          classDef done fill:#B2DFDB,stroke:#00897B,color:black,stroke-width:2px;
          classDef inprogress fill:#FFF9C4,stroke:#F9A825,color:black,stroke-width:2px;
          classDef blocked fill:#BBBBBB,stroke:#222222,color:black,stroke-width:2px;
          classDef milestone fill:#FFDFE5,stroke:#FF5978,color:#8E2236,stroke-width:2px;

          subgraph Legend
              L1("Completed"):::done
              L2("In progress"):::inprogress
              L3("Ready for work")
              L4("Blocked"):::blocked
              L5("Milestone"):::milestone
          end
      ```

      ## Refinements 

      ### Needs triage
      https://github.com/orgs/DIRACGrid/projects/30/views/7

      **Goal: build a shared understanding of the project.**

      > DIRAC

      > DiracOS2
      - [Handle updates?](https://github.com/DIRACGrid/DIRACOS2/issues/30)
      - [Python warnings](https://github.com/DIRACGrid/DIRACOS2/issues/146)

      > WebAppDIRAC

      > diracx-web
      - [Oauth2-proxy](https://github.com/DIRACGrid/diracx-web/issues/482)
      - [Sync types with backend openapi](https://github.com/DIRACGrid/diracx-web/issues/237)

      > diracx
      - [pixi lock file hook](https://github.com/DIRACGrid/diracx/issues/942)
      - [Fix warnings in test suite](https://github.com/DIRACGrid/diracx/issues/935)
      - [Cache static endpoint responses](https://github.com/DIRACGrid/diracx/issues/835)
      - [ADRs](https://github.com/DIRACGrid/diracx/issues/588)
      - [CLI tests](https://github.com/DIRACGrid/diracx/pull/104) - oldest PR in diracx. [name=Chris] will ressurect it later.
      - [Integrate MCP Server](https://github.com/DIRACGrid/diracx/issues/827) - still needs to be discussed on DOps first
      - [RSS](https://github.com/DIRACGrid/diracx/issues/790)
          - [Phase1](https://github.com/DIRACGrid/diracx/issues/836)
          - [Phase2](https://github.com/DIRACGrid/diracx/issues/889)

      > Pilot

      > diracx-charts

      > dirac-cwl
      - [assigning an output sandbox to a job from the api](https://github.com/DIRACGrid/dirac-cwl/issues/92)
      - [dirac-cwl executor tests](https://github.com/DIRACGrid/dirac-cwl/issues/116)

      > signurlarity
      - [rustfs docker image not stable](https://github.com/DIRACGrid/signurlarity/issues/38)

      - [name=Chris] TODO: we should add lhcb workflow transition documentation directly in diracx for the other communities
      - TODO: adding a word about dropping pre-commit ci

      **External deps**

      ### [Temporary Section] In progress, predating the new organization

      https://github.com/orgs/DIRACGrid/projects/30/views/8

      Various people still need to deal with old and staled PRs. We will take them into account in the next sprints. 

       

      ### External dependencies

      https://github.com/orgs/DIRACGrid/projects/30/views/9


      ---

      [Planning Poker](https://en.wikipedia.org/wiki/Planning_poker)
      Story points values (based on Fibo)
      - `1pt`: Trivial, very clear (small bug fix, config change)
      - `2pts`: Small, well understood (small feature, clear requirements)
      - `3pts`: Medium, some unknowns (moderate feature)
      - `5pts`: Large, significant complexity (major feature, integration)
      - `8pts`: Very large, many unknowns (should probably be split)
      - `13+pts`: TOO BIG - must split!
      - `?`: not enough knowledge to answer (remember it's ok to ask any questions)

      ## Sprints

      ### Planning (Velocity and Planning Poker)

      - Backlog: https://github.com/orgs/DIRACGrid/projects/30/views/3
      - Current Sprint: https://github.com/orgs/DIRACGrid/projects/30/views/1

      ![](https://codimd.web.cern.ch/uploads/upload_31860179052ee5ef48e43f9bc9141c9d.png)


      **Average Velocity: 3.07 x FTEs** *Last update: Jan 21st*

      #### :warning: Velocity is a planning tool, not a performance target

      - Velocity going down is NOT bad
      - Velocity going up is NOT always good (might mean over-estimation)
      - Velocity varies sprint-to-sprint
      - We track it to improve estimation, not to judge people

      **What affects velocity:**
      - Estimation accuracy (we're still learning)
      - Complexity of work

      **Our focus:** Delivering value and hitting commitments, not maximizing velocity numbers.

      ### June 24th (IN PROGRESS):

      #### Target and Context
      - **Transition**: 
          - Clean up existing issues/PRs: [Burning Charts](https://github.com/orgs/DIRACGrid/projects/30/insights?period=3M)
          - Integrate ADRs & precise roadmap
          - ~~Finish Job Monitoring and stabilize diracx-web~~
      - **RSS**: Finish Phase1
      - **Match Making**: Finish Phase3
      - **Pilot**: Finish PR1

      #### Availability

      - [name=alexandre] 50% 
      - [name=natthan] 20%
      - [name=luisa] %
      - [name=loris] 60%
      - [name=stella] 30%
      - [name=jorge] 70%
      - [name=ryun] %
      - [name=federico] 20%
      - [name=heloise] 30%
      - [name=christophe] %
      - [name=chris] %
      - [name=janusz] %
      - [name=mazen] %
      - [name=andrei] %
      - [name=yan] 100%
      - [name=Simon] 10%
      - [name=daniela] 10%
      - [name=Hideki] %
      - [name=Benedikt] 20%

      _ FTEs * _ = _ story points

      Expected Story Points:
      Persons:
      Expected Velocity:

      #### Sprint Planning: 

      - Backlog: https://github.com/orgs/DIRACGrid/projects/30/views/3
      - Sprint: https://github.com/orgs/DIRACGrid/projects/30/views/1

      ### June 11th (DONE):

      Expected Story Points: 56
      Persons: 3.8
      Expected Velocity: 14.7

      20/3.8 = 5.2 

      Comments: big difference between expected velocity and effective one. 
      - CHEP planning and long weekends (CH, France) largely affected it.

      #### Sprint Planning: 

      - Backlog: https://github.com/orgs/DIRACGrid/projects/30/views/3
      - Sprint: https://github.com/orgs/DIRACGrid/projects/30/views/1

      #### Sprint review: https://github.com/orgs/DIRACGrid/projects/30/views/11

      Related to our goals:
      - **DIRAC to DiracX transition:**
          - Mostly bug fixes

      - **CWL integration:**
          - NTR

      - **Match-Making POC:**
          - NTR

      - **DIRAC maintenance:**
          - Prepared rel-v9r0: paddies release

      #### Sprint retrospective

      *The sprint is a boat :boat: ; we are trying to reach an island (target); identify anchors (what slowed you down), wind (what helped), and rocks ahead (risks for next sprint)*

      :warning: **Focus on the process, not people. We're here to improve together! 🚀**

      **:anchor: Anchors (what slowed you down)**
      - *Example: Unclear requirements on X; Waiting for Y delayed Z; ...*
          - LHCb got issues in production with the AuthDB not cleaned up correctly: we should have better described the issue originally. This is expected to be better with new issues and the template.
          - Certification machine can't be accessed by everyone: should define a clear policy about what we expect from this instance and who should take care of the tests.
          - Lot of delay due to lack of reviewers: shall we let developers review PRs (could go with a the first pass, and then be assisted by architects): could come with some guidelines.
          - Should be more careful with PR titles: we merged a `feat` PR which should have been `chore` or `docs` (bumped the diracx version in the wrong way - corrected before the release)
          - pixi lock file was migrated to v7 and then reverted to v6 in another PR: we should be more careful about it. May be we could have a pre-commit hook to make sure the file is not touched if there is no deps changes -> minimum pixi version in pixi.toml
          - CWL PR:
              - was meant to move work done from dirac-cwl to diracx, but many other features were added, not easy to review (breaking the rule we have so big PRs should be split)
              - was rediscussed it after being implemented and certified with the architects: as developers, we should better explain our roadmap (CWL, but also in general) because it was not clear for everyone (let's come up with a plan we all agree on before the hackathon)

      **:cloud: Wind (what helped)**
      - *Example: Good communication in weekly meetings; Quick code reviews; Clear acceptance criteria on user stories; ...*

      **🪨 Rocks (risks for next sprint)**
      - *Example: Team member K on vacation; Dependency on external API L; Technical debt in M; ...*
      - LHCb week
      - CTAO review (Christophe away for a week)

      ---

      ### Previous Sprints
      #### Summary

      - May 28th:
        - *16.2 Story Points / 4.2 people = 3.8 velocity*
        - Comments:
          - Big difference between expected velocity and effective one. 
          - CHEP planning and long weekends (CH, France) largely affected it.

      - May 14th:
        - *11 Story Points / 4 people = 2.75 velocity*
        - Comments:
          - Lowest velocity since we started, but:
          - RSS end of phase1 is actually trickier than what we initially thought
          - Many people having to start preparing presentations (CHEP, LPC retreat)
          - Many people are working on tasks that are not in the scope of the sprint (to prepare future sprints): LHCb commands to replace the workflow modules, integration of CWL job submission endpoint within diracx
          - A CI failure in DIRAC preventing from merging

      - April 30th:
        - *42 Story Points / 4.1 people = 10.2 velocity*
        - Comments:
          - ~1/4 of the counted SP come from the integration of the tasks
          - `diracx-tasks` are here :tada: 
          - All the essential components are here to transition now.
          - RSS Phase1:
            - Should have been completed but it's still under development. [name=Loris] any blocking point? 
          - New Matcher: 
            - working on a v0.2 schema design expliciting more details about what we want

      - April 16th:
        - *20 Story Points / 2.6 people = 7.7 velocity*
        - Comments:
          - Less people available during the sprint (holidays, CTAO had deadlines). Also some people seemed to spend more time than originally described, some of them less time.
          - A lot of bug fixes that were not planned

      - April 2nd:
        - *37 Story Points / 3.1 people = 11.9 velocity*
        - Comments: NTR

      - March 19th:
        - *23 Story Points / 3.5 people = 7.2 velocity* 
        - Comments:
          - Need to adapt the velocity computation because we are processing a lot of tasks not planned originally in the sprint (which is expected since we still have a lot of PRs without any attached issue to process, ...)

      - March 5th:
        - *19 Story Points / 2.8 people = 6.8 velocity*
        - Comments:
          - Less people available during this sprint, but more realistic expectation, we almost reached the expected velocity!!
          - LHCb AI hackathon: [name=Alexandre] was much less available than expected.
          - Took into account items that were in progress before scrum process (added some SP): resurrecting diracx-web, RSS simplified...

      - February 19th:
        - *38 Story Points / 4.4 people = 8.6 velocity*
        - Comments:
          - French holidays
          - [name=Alexandre] was more available than expected, but did not manage to quickly follow all the PRs.
          - A few tasks have been delayed (10 SP): waiting for further discussion on scheduling and diagrams for new LHCb workflows
          - Lot of "unplanned" items: expected as long as we have to deal with the large backlog of old items.

      - February 5th:
        - *29 Story Points / 3.1 people = 9.4 velocity*
        - Comments:
          - LHCb-CERN had a computing workshop
          - Various people worked on old PRs I did not take into account :warning:

      - January 21st:
        - *6 Story Points / 2.5 people = 2.4 velocity*
        - Comments:
          - LHCb-CERN had a team retreat, LHCb-Spain had a conference.

      - January 7th:
        - *15 Story Points / 3.9 people = 3.8 velocity*
        - Comments:
          - No specific comment, the sprint was split by the holidays.

      - December 10th:
        - *6 Story Points / 3 people = 2 velocity*
        - Comments:
          - About the same as the previous sprint: still a gap between expected/actual availability

      - November 26th:
        - *6 Story Points / 3 people = 2 velocity*
        - Comments:
          - Much lower than the previous sprint because it included tasks started before the sprint.
          - Lots of "almost done" PRs: we are improving the description of the tasks and their size but still not enough (each task should bring value though).

      - November 10th:
        - *22 Story Points / 4.3 people = 5.1 velocity*


      #### Actionable Results from the Retrospective

      - **Action:** Feature PRs should be thoroughly tested in certification.
        - Owner: developers
        - When: Sprint12
        - Status: 29/04/26 In Progress
      - **Action:** Avoid verbose (AI-generated) issues with many implementation details that can deprecate over time.
        - Owner: developers and product owners
        - When: Sprint11
        - Status: 15/04/26 In Progress
      - **Action:** Better view of the PRs ready to be reviewed vs needing changes.
        - Owner: developers
        - When: Sprint8
        - Status: 15/04/26 In Progress
      - **Action:** Better communicate when a PR is going to be big, as soon as possible. Split the work in this case.
        - Owner: developers
        - When: Sprint6
        - Status: 21/01/26 DONE
      - **Action:** Better use of the mattermost channel to get reviews on a given PR
        - Owner: everyone
        - By when: Sprint3
        - Status: 04/02/26 DONE
      - **Action:** Define estimates and velocity based on Sprint2's results, taking into account external contributions (bonus Story Points) and availability
        - Owner: alexandre
        - By when: Sprint3
        - Status: DONE
      - **Action:** Better define the scrum roles
        - Owner: alexandre
        - By when: Sprint5
        - Status: DONE
      - **Action:** Better define `DONE` criteria (what should be included into the PR, and how to make sure we are not introducing too much technical debt)
        - Owner: everyone
        - By when: Sprint2
        - Status: DONE
      - **Action:** Avoid planning dependent tasks in a same sprint
        - Owner: everyone
        - By when: Sprint2
        - Status: DONE

      ## AOB