Dops + Ddev

Europe/Zurich
2/R-014 (CERN)

2/R-014

CERN

10
Show room on map
Description

The monthly Dops meeting (Dirac(X) operations) will run just before the weekly Ddev (Dirac(X) developers) meeting. 

Zoom Meeting ID
62504856418
Host
Federico Stagni
Useful links
Join via phone
Zoom URL

Dops – 12/03/2026

At CERN: Federico, Christophe, Christopher, Alexandre, André, Alan
On Zoom: Andrei, Hideki, Xiaomei, Luisa, Heloise, Stella, Loris, Vladimir, Daniela, Simon, Janusz, Bertrand, Mazen, Vladimir, Natthan
Apologies:


Previous meetings + follow-ups


Communities issues and requests : roundtable

LHCb:

 Federico+Christopher+Christophe+Alexandre+Ryun

  • Kept installing latest release (now running with 9.1.1 - patched, see below)
  • Regularly running 350k+ jobs
  • Matcher under stress when LHCb HLT farm starts ramping up
  • Database optimizations (some of which went into DIRAC v9.1.0) had to be added in order to address the heavy load

Belle2

 Hideki, Ueda

  • Memory consumption increased on one of the severs
  • Test server moving to v9

Juno+BES3:

 Xiaomei

  • fromPreviousMeeting Heavy productions since August. Running on 2 servers.
    • “only 20k” running jobs, but high job frequency because of short jobs are putting pressure on SandboxStore and CS
      •  Federico + Christophe solutions in v8 are mostly outside of DIRAC itself:
        • the performances of the disk are critical
        • in LHCb we created a DNS load balancer
      • increase the validity of the CS (option to increase the refresh time)

EGI+IN2P3

 Andrei, Mazen

  • Set up a second service (IN2P3) for specific VO
  • Test system v9+X
    • choosing FC plugin for v9

CTAO

 Luisa, Nattan, Loris, Stella

  • NTR

CLIC

 André
NTR

GridPP:

 Daniela, Simon

  • Nothing to report

Releases announcements and reviews

DIRAC

  • v9.0.20

    • Last of the v9.0 patches
  • v9.1.0

    • First minor release since “TBD” adoption
    • Contains MySQL changes (optional, suggested) detailed in the release notes
      • See “Deployment” notes in release notes
    • Yanked in pypi
  • v9.1.1

    • Should have also been a minor release
    • Contains MySQL changes (optional, suggested) detailed in the release notes
      • See “Deployment” notes in release notes
    • Yanked in pypi
  • v9.1.2

    • Patched version due to bug introduced in v9.1.0 (also in v9.1.1)

DiracX

DiracOS

  • 2.58
    • No news since last Dops

Dirac-CWL

  • Introduced a new JobWrapper that can run in DIRAC. A JobReport of its status was recently introduced

Pilot

  • Will (finally) complete the removal of py2 support on Monday

Feature requests, and developers’ issues: inputs and prioritizations from communities

Jobs’ match-making (matching) mechanism for DiracX: issues and plans

  •  Federico sent out a mail thread with questions, Andrei, Luisa, Ueda answered. Questions in there with summary of answers:
    • What are the limitations you encountered with the current system?
      • Expressing RAM requirements
        •  Federico it’s in v9
      • No priority boost for long Waiting jobs
      • No priority manipulation
      • It is not easy to understand the reason why a job has not been matched.
        • there is an attempt to do that, in a script
    • Do you make use of “Tags” for match-making? Which ones, and why?
      • Yes, for specific classification
    • Do you inject in the JDL specific, VO-dependent parameters?
      • Belle2 adds OS tags
    • Do some of your VOs or users use or are interested in CWL?
      • Yes (CTAO), not yet (EGI-FG, Belle2)
    • Do you have access to nodes that include heterogeneous resources (e.g. in HPCs)?
      • Not yet, but on the horizon
    • What could be a target rate of match-making operations?
      • 200Hz
  • Next:
    •  Federico will create an “epic” issue with user stories
    • A design will follow

MP Jobs accounting

Existing issue: https://github.com/DIRACGrid/diracx/issues/294

  •  Federico sent out a mail thread with questions and short list of user stories. Collected a few answers, extreme summary:
    • MP jobs and MP pilots are run a bit everywhere, but at a low level and the missing accounting from it is only partially noticeable, but nonetheless needed
  • Next:
    •  Federico will add the user stories to the issue above
    • Design and implementation should follow soon after (rather high priority)
  • Connected, writing down here for publicity: https://github.com/DIRACGrid/diracx/issues/562
    • this “epic” is about accounting and monitoring (OLAP). The plan in there is not working, but maybe few ideas could be borrowed from there. To be designed.

Need for TSCatalog?

Prioritized backlog: communities input

https://github.com/orgs/DIRACGrid/projects/30/views/3 contains the prioritized backlog.


AOB


Next appointments

There are minutes attached to this event. Show them.
    • 10:00 11:00
      Dirac(X) operations (Dops)
      Convener: Federico Stagni (CERN)
    • 11:00 12:00
      Dirac(X) developers (Ddev)
      Convener: Alexandre Franck Boyer (CERN)

      # DIRAC Development Meeting (Ddev)

      **At CERN**: Federico, Christopher, Alexandre, André, Ryun
      **On Zoom**: Andrei, Janusz, Hideki, Xiaomei, Michel, Dhiraj, Ryun, Hideki, Heloise, Stella, Loris, Vladimir, Daniela, Simon


      ## Product Goals & Roadmaps

      - Transition to DiracX:

      ```mermaid
      flowchart LR
          subgraph CWL["CWL"]
              CWL1("CWL submission endpoint")
              CWL2("CWL production system")
              CWL3("Transformation system machinery"):::blocked
              CWL4("Use CWL natively in new matcher"):::blocked
          end

          subgraph Core["Core"]
              CoreTasks("Tasks")
              Core2("RSS")
              Core3("DMS")
          end

          subgraph WMS["WMS"]
              WMS1("Matcher"):::blocked
              WMS2("Pilot authentication")
              WMS3("Pilot submission"):::blocked
          end

          CWL3 --> CWL4
          CoreTasks --> Core2 --> Core3
          CoreTasks --> WMS1
          CoreTasks --> CWL3
          WMS1 --> CWL4
          CoreTasks --> WMS3

          click CoreTasks "https://www.github.com" "This is a tooltip for a link"

          classDef done fill:#B2DFDB,stroke:#00897B,color:black,stroke-width:2px;
          classDef blocked fill:#BBBBBB,stroke:#222222,color:black,stroke-width:2px;

          subgraph Legend
              L2("Completed"):::done
              L1("Ready for work")
              L3("Blocked"):::blocked
          end
      ```

      - CWL integration:
      ![](https://codimd.web.cern.ch/uploads/upload_190b20d13cb4b3543a96af631ca1967d.png)


      ## Refinements 

      ### Needs triage
      https://github.com/orgs/DIRACGrid/projects/30/views/7

      **Goal: build a shared understanding of the project.**

      > DIRAC
      - [DictCache issue? (blocked)](https://github.com/DIRACGrid/DIRAC/issues/8472)
      - [Gracefully stop pilots](https://github.com/DIRACGrid/DIRAC/issues/8346)
      - [Job grouping in HPC with no ext. connectivity](https://github.com/DIRACGrid/DIRAC/issues/8475)

      > WebAppDIRAC

      > diracx
      - [fastAPI status](https://github.com/DIRACGrid/diracx/issues/823)
      - [doctest](https://github.com/DIRACGrid/diracx/issues/141)
      - [Stop using UPGRADE_REQUIRED](https://github.com/DIRACGrid/diracx/issues/542)
      - [CLI tests](https://github.com/DIRACGrid/diracx/pull/104)
          - oldest PR in diracx. [name=Chris] will ressurect it later.
      - [Config mechanism](https://github.com/DIRACGrid/diracx/issues/830)
      - [Integrate MCP Server](https://github.com/DIRACGrid/diracx/issues/827)
      - [RSS](https://github.com/DIRACGrid/diracx/issues/790)

      > Pilot

      > diracx-charts
      - [Move away from bitnami](https://github.com/DIRACGrid/diracx-charts/issues/181)
      - [Swap the default values of enabled](https://github.com/DIRACGrid/diracx-charts/issues/166)
      - [Release notes](https://github.com/DIRACGrid/diracx-charts/issues/237)
      - [Managing secret updates](https://github.com/DIRACGrid/diracx-charts/issues/84)
          - need design?
      - [Gateway API](https://github.com/DIRACGrid/diracx-charts/issues/68)
          - still relevant?
      - [Find a way to run the demo completely offline](https://github.com/DIRACGrid/diracx-charts/issues/48)
      - [Handling hotfixing](https://github.com/DIRACGrid/diracx-charts/issues/37)
          - still relevant? No

      > dirac-cwl
      - [assigning an output sandbox to a job from the api](https://github.com/DIRACGrid/dirac-cwl/issues/92)
      - [dirac-cwl executor tests](https://github.com/DIRACGrid/dirac-cwl/issues/116)

      **External deps**
      - [Drop boto]: need to be split into smaller issues, and address signurlity issues (add to the gh project)


      ### [Temporary Section] In progress, predating the new organization

      https://github.com/orgs/DIRACGrid/projects/30/views/8

      Various people still need to deal with old and staled PRs. We will take them into account in the next sprints. 

      > diracx
      - [feat (JobDB): pydantic datetime validation](https://github.com/DIRACGrid/diracx/pull/477)
          - Plan to be processed after LHCb Week (Sprint8: Feb 19th - March 5th)

      > DIRAC
      - [Summary tables](https://github.com/DIRACGrid/DIRAC/pull/8199)


      ### External dependencies

      https://github.com/orgs/DIRACGrid/projects/30/views/9

      ---

      [Planning Poker](https://en.wikipedia.org/wiki/Planning_poker)
      Story points values (based on Fibo)
      - `1pt`: Trivial, very clear (small bug fix, config change)
      - `2pts`: Small, well understood (small feature, clear requirements)
      - `3pts`: Medium, some unknowns (moderate feature)
      - `5pts`: Large, significant complexity (major feature, integration)
      - `8pts`: Very large, many unknowns (should probably be split)
      - `13+pts`: TOO BIG - must split!
      - `?`: not enough knowledge to answer (remember it's ok to ask any questions)

      ## Sprints

      ### Planning (Velocity and Planning Poker)

      - Backlog: https://github.com/orgs/DIRACGrid/projects/30/views/3
      - Current Sprint: https://github.com/orgs/DIRACGrid/projects/30/views/1

      ![](https://codimd.web.cern.ch/uploads/upload_fb41dd923f471fc2b1566c127706bd3c.png)


      **Average Velocity: 3.07 x FTEs** *Last update: Jan 21st*

      #### :warning: Velocity is a planning tool, not a performance target

      - Velocity going down is NOT bad
      - Velocity going up is NOT always good (might mean over-estimation)
      - Velocity varies sprint-to-sprint
      - We track it to improve estimation, not to judge people

      **What affects velocity:**
      - Estimation accuracy (we're still learning)
      - Complexity of work

      **Our focus:** Delivering value and hitting commitments, not maximizing velocity numbers.

      ### March 19th (IN PROGRESS):

      #### Target and Context
      - Chris & Christophe still working on the foundations (`diracx-tasks`)
      - Clean up existing issues/PRs
      - Start design of the migration of the RSS
      - Make diracx-web stable

      #### Availability

      - [name=alexandre] 70%
      - [name=natthan] 10%
      - [name=luisa] 
      - [name=loris] 80%
      - [name=stella] 50%
      - [name=jorge] 
      - [name=ryan]
      - [name=federico] %20
      - [name=heloise] %40
      - [name=christophe] 0%
      - [name=chris] 0%
      - [name=janusz] 20%
      - [name=mazen] 10%

      _ FTEs * _ = _ story points

      Expected Story Points:
      Persons:
      Expected Velocity:

      #### Sprint Planning: 

      - Backlog: https://github.com/orgs/DIRACGrid/projects/30/views/3
      - Sprint: https://github.com/orgs/DIRACGrid/projects/30/views/1

      ### March 5th (DONE):

      Expected Story Points: 26
      Persons: 2.8
      Expected Velocity: 9.2 

      *19 Story Points / 2.8 people = 6.8 velocity*

      Comments:
      - Less people available during this sprint, but more realistic expectation, we almost reached the expected velocity!!
      - LHCb AI hackathon: [name=Alexandre] was much less available than expected.
      - Took into account items that were in progress before scrum process (added some SP): resurrecting diracx-web, RSS simplified...
      - Closed 2 PRs that were new and made without going through the expected flow (issue -> ddev meeting discussion -> PR)
      - Lots of pending PRs, I see 2 issues:
          - Reviewing bottleneck: as a reviewer, it's hard to follow the development flow
          - Developers pushing their work near the end of the sprint

      #### Sprint review: https://github.com/orgs/DIRACGrid/projects/30/views/11

      Related to our goals:
      - **DIRAC to DiracX transition:**
          - More DiracX documentation
          - Ed25519 key in place
          - DiracX-Web is coming back (still in alpha)

      - **CWL integration:**

      - **DIRAC maintenance:**
          - RSS streamlining


      #### Sprint retrospective

      *The sprint is a boat :boat: ; we are trying to reach an island (target); identify anchors (what slowed you down), wind (what helped), and rocks ahead (risks for next sprint)*

      :warning: **Focus on the process, not people. We're here to improve together! 🚀**

      **:anchor: Anchors (what slowed you down)**
      - *Example: Unclear requirements on X; Waiting for Y delayed Z; ...*
      - Review capacity was limited this week: we could benefit from delegating review responsibilities earlier in this case.
      - PRs tended to appear toward the end of the sprint, making it harder to review and merge in time. Spreading submissions more evenly would help.

      **:cloud: Wind (what helped)**
      - *Example: Good communication in weekly meetings; Quick code reviews; Clear acceptance criteria on user stories; ...*
      - We collectively better plan our availability/velocity
      - Quick code review: helpful!

      **🪨 Rocks (risks for next sprint)**
      - *Example: Team member K on vacation; Dependency on external API L; Technical debt in M; ...*

      ---

      ### Previous Sprints
      #### Summary

      - February 19th:
        - *38 Story Points / 4.4 people = 8.6 velocity*
        - Comments:
          - French holidays
          - [name=Alexandre] was more available than expected, but did not manage to quickly follow all the PRs.
          - A few tasks have been delayed (10 SP): waiting for further discussion on scheduling and diagrams for new LHCb workflows
          - Lot of "unplanned" items: expected as long as we have to deal with the large backlog of old items.

      - February 5th:
        - *29 Story Points / 3.1 people = 9.4 velocity*
        - Comments:
          - LHCb-CERN had a computing workshop
          - Various people worked on old PRs I did not take into account :warning:

      - January 21st:
        - *6 Story Points / 2.5 people = 2.4 velocity*
        - Comments:
          - LHCb-CERN had a team retreat, LHCb-Spain had a conference.

      - January 7th:
        - *15 Story Points / 3.9 people = 3.8 velocity*
        - Comments:
          - No specific comment, the sprint was split by the holidays.

      - December 10th:
        - *6 Story Points / 3 people = 2 velocity*
        - Comments:
          - About the same as the previous sprint: still a gap between expected/actual availability

      - November 26th:
        - *6 Story Points / 3 people = 2 velocity*
        - Comments:
          - Much lower than the previous sprint because it included tasks started before the sprint.
          - Lots of "almost done" PRs: we are improving the description of the tasks and their size but still not enough (each task should bring value though).

      - November 10th:
        - *22 Story Points / 4.3 people = 5.1 velocity*


      #### Actionable Results from the Retrospective

      - **Action:** Better view of the PRs ready to be reviewed vs needing changes.
        - Owner: developers
        - When: Sprint8
        - Status: 19/02/26 In Progress
      - **Action:** Better communicate when a PR is going to be big, as soon as possible. Split the work in this case.
        - Owner: developers
        - When: Sprint6
        - Status: 21/01/26 DONE
      - **Action:** Better use of the mattermost channel to get reviews on a given PR
        - Owner: everyone
        - By when: Sprint3
        - Status: 04/02/26 DONE
      - **Action:** Define estimates and velocity based on Sprint2's results, taking into account external contributions (bonus Story Points) and availability
        - Owner: alexandre
        - By when: Sprint3
        - Status: DONE
      - **Action:** Better define the scrum roles
        - Owner: alexandre
        - By when: Sprint5
        - Status: DONE
      - **Action:** Better define `DONE` criteria (what should be included into the PR, and how to make sure we are not introducing too much technical debt)
        - Owner: everyone
        - By when: Sprint2
        - Status: DONE
      - **Action:** Avoid planning dependent tasks in a same sprint
        - Owner: everyone
        - By when: Sprint2
        - Status: DONE

      ## AOB