BiLD-Dev

Europe/Zurich
2/R-014 (CERN)

2/R-014

CERN

10
Show room on map
Federico Stagni (CERN)
Description

Bi-Weekly "Loyal" DIRAC developers meeting. And, following, the LHCbDIRAC developers meeting.

Zoom: BiLD
https://cern.zoom.us/j/62504856418?pwd=TU1kb01SOFFpSDBJeWVBdU9qemVXQT09

Meeting ID: 62504856418
Passcode: 12345678
 

Zoom Meeting ID
62504856418
Host
Federico Stagni
Useful links
Join via phone
Zoom URL
 
 

BiLD – 10/04/2025

At CERN: Federico, Alexandre, Robin, Theau, André
On Zoom: Janusz, Jorge, Xiaomei, Cedric
Apologies: Daniela, Christophe, Christopher


Follow-up from previous meetings

  • Last BiLD was 2 weeks ago, March 27th

DIRAC communities roundtable

LHCb:

Federico+Alexandre+Ryun+Vladimir+Alexey+Robin+Theau

  • Migrated to DIRAC v9 and diracx, and lots of other updates:
    1. we took everything down
    2. we applied optional updates that we profited to make since anyway the whole system was down
      • MySQL updates: (these are optional updates that we profited to make since anyway the whole system was down)
        • update MySQL to 8.4
        • update ROW_FORMAT to “Dynamic” (many tables were created when the default was “Compact”)
        • updated character set to utfmb4
        • optimize (defrag)
        • updated the passwords, adding special characters (which were NOT OK) – PR, also backported to v8
      • few machines resized
    3. deployed lhcbdiracx and lhcbdiracx-web
      • Update plan followed: https://codimd.web.cern.ch/dnfwITCRRTSvhopGDjlHSA?both
        • effectively, more than a full week of downtime
        • Restart NOT smooth - many hotfixes and quickly-merged PRs both for DIRAC and DiracX (and LHCbDIRAC and lhcbdiracx)
        • a token is added the proxy, calling diracx to get it
        • SandboxStore forwards to DiracX
        • Turned on and off several times the WorkloadManagement/JobStateUpdate to diracx route with the never-tested-before JobStateUpdate legacy adapter
        • 10 lhcbdiracx pods running, for now (scaled by hand)
    • Issues (probably forgetting some!):
      • The MySQL update scripts were partly incomplete – now should be alright
      • AccountingDB (MySQL) optional updates could not complete (few tables are too big)
      • diracx client extension code was re-worked (and basically tested in production)
      • All user jobs that were still in the system (or those that were added at the initial restart) failed until we put in this PR
      • diracx scalability issues (should all be solved by now):
        • The SandboxMetadataDB access was VERY slow (from diracx) because of non-optimized query, fixed
        • this query is specific to diracx in order to verify that a user can actually get the sandbox
        • we also cleaned the DB from quite some inconsistencies
        • AuthDB was missing indices
        • too many (refresh?) tokens requested (DIRAC fix)
        • DIRAC Framework/ProxyManager under some load, added few instances, reflected on diracx pods – possible related issue and maybe a PR
        • We are now running with the DIRAC JobStateUpdate services (not the diracx ones)
      • We realized that pilots could not e.g. update the jobs status – an issue was opened and a PR opened and closed (??)
        • for the moment, we gave the pilot the JobAdministrator property
      • Too many DNS requests from the DIRAC SandboxStore machine to the lhcbdiracx pods – not fully resolved, not clear who’s at fault
      • (we discovered that) Pilots can’t use anymore the host certificates. LHCb HLT farm was relying on that, so now it first downloads the pilot proxy (with a token inside).
      • The diracx /api/jobs/status route sometimes (?) stores local time instead of UTC, fixed with PR ?
      • Decent monitoring became quickly important. The openshift one is alright-ish but OTEL is needed, so PR created
    • Status:
      • went through a few releases, now running with alpha versions, and hotfixes
      • we are now running “almost everything”. Went up to 100k running jobs yesterday
      • more issues still coming up one-by-one but now “running”

ILC/Calice/FCC

André

  • NTR

Belle2

Hideki

  • NTR

Juno:

Xiaomei

  • NTR

GridPP:

Janusz

  • NTR

CTAO

Natthan

  • Use now the lets encrypt provided by our cluster to generate signed certificate

Topics from GitHub discussions and bots


DIRAC releases

  • v8.0.71
    • No new tags created since last time for v8
  • v9
    • …too many

DIRAC projects

DIRAC:

Issues by milestone:

Other issues:

PRs discussed:

  • NTR

WebApp:

  • 1 more pre-release made
  • from previous meeting One draft PR

Pilot:

  • PR feat: Adding JWT support alongside X509 auth
    • the new Pilot command can call directly the route, no need to use the CLI
    • the integration tests for this will be set up once diracx is updated with the connected diracx PR
      • for this to be done neatly, the branching strategy of diracx will need to be defined.

DIRACOS:

  • NTR

Documentation:

  • NTR

OAuth2:

  • NTR

management

  • NTR

diraccfg

  • NTR

DB12

Rucio

  • NTR

Tests

  • from previous meeting Federico Started adding Rucio to Dirac integration tests
    • –> to Janusz

DiracX:

Issues

PRs discussed:

DiracX-charts:

  • Merged PRs for using DB passwords with special characters

DiracX-web:

  • feat: enable remote backend connection (PR #318)

Release planning, tests and certification

  • Certification machines

    • NTR for today
    • Federico will update it with the latest diracx goodies
  • Next hackathon(s)

    • not sure…
  • Federico We will tag DIRAC v9 (and diracx, web, etc) just after Easter

Next appointments

  • Meetings:

    • BiLD: in 6 weeks! (Easter, then DiracX hackathon)
  • WS/hackathons/conferences:

AOB


LHCbDIRAC

 
There are minutes attached to this event. Show them.
    • 10:00 10:10
      Items from Previous BiLD-Dev 10m
    • 10:10 10:20
      DIRAC Communities roundtable 10m
    • 10:20 10:30
      DIRAC releases 10m
    • 10:30 10:55
      DIRAC projects 25m
      • DIRAC
      • WebApp
      • Pilot
      • DIRACOS2
      • VMDIRAC
      • Documentation
      • OAuth2
      • DiracX
      • other externals (include Rucio)
    • 10:55 11:00
      Release planning, tests and certification 5m
    • 11:00 11:15
      Weekly development(s) focus 15m
    • 11:15 11:25
      AOB
      Convener: Federico Stagni (CERN)
    • 11:25 11:40
      LHCbDIRAC 15m