BiLD-Dev
Description
Bi-Weekly "Loyal" DIRAC developers meeting. And, following, the LHCbDIRAC developers meeting.
Zoom: BiLD
https://cern.zoom.us/j/62504856418?pwd=TU1kb01SOFFpSDBJeWVBdU9qemVXQT09
Meeting ID: 62504856418
Passcode: 12345678
BiLD – 10/04/2025
At CERN: Federico, Alexandre, Robin, Theau, André
On Zoom: Janusz, Jorge, Xiaomei, Cedric
Apologies: Daniela, Christophe, Christopher
Follow-up from previous meetings
- Last BiLD was 2 weeks ago, March 27th
DIRAC communities roundtable
LHCb:
Federico+Alexandre+Ryun+Vladimir+Alexey+Robin+Theau
- Migrated to DIRAC v9 and diracx, and lots of other updates:
- we took everything down
- we applied optional updates that we profited to make since anyway the whole system was down
- MySQL updates: (these are optional updates that we profited to make since anyway the whole system was down)
- update MySQL to 8.4
- update ROW_FORMAT to “Dynamic” (many tables were created when the default was “Compact”)
- updated character set to
utfmb4
- optimize (defrag)
- updated the passwords, adding special characters (which were NOT OK) – PR, also backported to v8
- few machines resized
- MySQL updates: (these are optional updates that we profited to make since anyway the whole system was down)
- deployed lhcbdiracx and lhcbdiracx-web
- Update plan followed: https://codimd.web.cern.ch/dnfwITCRRTSvhopGDjlHSA?both
- effectively, more than a full week of downtime
- Restart NOT smooth - many hotfixes and quickly-merged PRs both for DIRAC and DiracX (and LHCbDIRAC and lhcbdiracx)
- a token is added the proxy, calling diracx to get it
SandboxStore
forwards to DiracX- Turned on and off several times the
WorkloadManagement/JobStateUpdate
todiracx
route with the never-tested-before JobStateUpdate legacy adapter - 10
lhcbdiracx
pods running, for now (scaled by hand)
- Update plan followed: https://codimd.web.cern.ch/dnfwITCRRTSvhopGDjlHSA?both
- Issues (probably forgetting some!):
- The MySQL update scripts were partly incomplete – now should be alright
- AccountingDB (MySQL) optional updates could not complete (few tables are too big)
- diracx client extension code was re-worked (and basically tested in production)
- All user jobs that were still in the system (or those that were added at the initial restart) failed until we put in this PR
diracx
scalability issues (should all be solved by now):- The SandboxMetadataDB access was VERY slow (from diracx) because of non-optimized query, fixed
- this query is specific to diracx in order to verify that a user can actually get the sandbox
- we also cleaned the DB from quite some inconsistencies
AuthDB
was missing indices- too many (refresh?) tokens requested (DIRAC fix)
- DIRAC
Framework/ProxyManager
under some load, added few instances, reflected on diracx pods – possible related issue and maybe a PR - We are now running with the DIRAC JobStateUpdate services (not the diracx ones)
- We realized that pilots could not e.g. update the jobs status – an issue was opened and a PR opened and closed (??)
- for the moment, we gave the pilot the
JobAdministrator
property
- for the moment, we gave the pilot the
- Too many DNS requests from the DIRAC SandboxStore machine to the lhcbdiracx pods – not fully resolved, not clear who’s at fault
- (we discovered that) Pilots can’t use anymore the host certificates. LHCb HLT farm was relying on that, so now it first downloads the pilot proxy (with a token inside).
- The diracx
/api/jobs/status
route sometimes (?) stores local time instead of UTC, fixed with PR ? - Decent monitoring became quickly important. The openshift one is alright-ish but OTEL is needed, so PR created
- Status:
- went through a few releases, now running with alpha versions, and hotfixes
- we are now running “almost everything”. Went up to 100k running jobs yesterday
- more issues still coming up one-by-one but now “running”
ILC/Calice/FCC
André
- NTR
Belle2
Hideki
- NTR
Juno:
Xiaomei
- NTR
GridPP:
Janusz
- NTR
CTAO
Natthan
- Use now the lets encrypt provided by our cluster to generate signed certificate
Topics from GitHub discussions and bots
DIRAC releases
- v8.0.71
- No new tags created since last time for v8
- v9
- …too many
DIRAC projects
DIRAC:
Issues by milestone:
- v8.0:
- Using cgroups to limit job resource usage
- PR is there, not finally reviewed
- Using cgroups to limit job resource usage
- v9.0:
- Nothing left in there
- After v9.0:
- Some issues being actively added
- Asked about Is PilotBundle.bundleProxy() still useful?
- apparently it is
PRs discussed:
- NTR
WebApp:
- 1 more pre-release made
- from previous meeting One draft PR
Pilot:
- PR feat: Adding JWT support alongside X509 auth
- the new Pilot command can call directly the route, no need to use the CLI
- the integration tests for this will be set up once diracx is updated with the connected diracx PR
- for this to be done neatly, the branching strategy of diracx will need to be defined.
DIRACOS:
- NTR
Documentation:
- NTR
OAuth2:
- NTR
management
- NTR
diraccfg
- NTR
DB12
- from previous meeting https://github.com/DIRACGrid/DIRAC/issues/7760#issuecomment-2482420604
- Federico proposed to create “alternate” benchmark
Rucio
- NTR
Tests
- from previous meeting Federico Started adding Rucio to Dirac integration tests
- –> to Janusz
DiracX:
Issues
- few open issues have been discussed briefly
- Open access and require auth not working inside a router
PRs discussed:
- Janusz prepared some suggestions in https://github.com/DIRACGrid/diracx/pull/467 based on external review tool. It should be seen which of these suggestions should be kept
DiracX-charts:
- Merged PRs for using DB passwords with special characters
DiracX-web:
- feat: enable remote backend connection (PR #318)
Release planning, tests and certification
-
Certification machines
- NTR for today
- Federico will update it with the latest diracx goodies
-
Next hackathon(s)
- not sure…
-
Federico We will tag DIRAC v9 (and diracx, web, etc) just after Easter
Next appointments
-
Meetings:
- BiLD: in 6 weeks! (Easter, then DiracX hackathon)
-
WS/hackathons/conferences:
- DiracX hackathon: 5 and 6 May - https://indico.cern.ch/event/1501369/
- few registered already
- Dirac Users’ Workshop: 17th-20th September 2205 - https://indico.cern.ch/e/duw11
- registrations open, Xiaomei added few info for accomodations. It is suggested to stay in the nearby hotels.
- DiracX hackathon: 5 and 6 May - https://indico.cern.ch/event/1501369/
AOB
- fromPreviousMeeting DIRAC was invited to be an “HSF affiliated project” : https://hepsoftwarefoundation.org/projects/affiliated.html
- Andrei, André, Federico met with Edoardo and Michel Jouvin for few clarifications. Andrei will call a consortium meeting
LHCbDIRAC
- Theau is having issues running everything (lhcbdiracx-web + lhcbdiracx) on his provided computer. Federico suggests to use a virtual machine, check with Ian if he can a large flavor in his personal tenent.
- fromPreviousMeeting Alexandre posted update to Moving
Job finalization
step from the workflow to theJobWrapper
: Transition Plan for Enhancing HPC Exploitation in DIRAC/LHCbDIRAC with connected draft PR Draft: feat(wms): New LHCb workflows
There are minutes attached to this event.
Show them.