BiLD – 10/04/2025
At CERN: Federico, Alexandre, Robin, Theau, André
On Zoom: Janusz, Jorge, Xiaomei, Cedric
Apologies: Daniela, Christophe, Christopher
Follow-up from previous meetings
- Last BiLD was 2 weeks ago, March 27th
DIRAC communities roundtable
LHCb:
Federico+Alexandre+Ryun+Vladimir+Alexey+Robin+Theau
- Migrated to DIRAC v9 and diracx, and lots of other updates:
- we took everything down
- we applied optional updates that we profited to make since anyway the whole system was down
- MySQL updates: (these are optional updates that we profited to make since anyway the whole system was down)
- update MySQL to 8.4
- update ROW_FORMAT to “Dynamic” (many tables were created when the default was “Compact”)
- updated character set to
utfmb4
- optimize (defrag)
- updated the passwords, adding special characters (which were NOT OK) – PR, also backported to v8
- few machines resized
- deployed lhcbdiracx and lhcbdiracx-web
- Update plan followed: https://codimd.web.cern.ch/dnfwITCRRTSvhopGDjlHSA?both
- effectively, more than a full week of downtime
- Restart NOT smooth - many hotfixes and quickly-merged PRs both for DIRAC and DiracX (and LHCbDIRAC and lhcbdiracx)
- a token is added the proxy, calling diracx to get it
SandboxStore
forwards to DiracX
- Turned on and off several times the
WorkloadManagement/JobStateUpdate
to diracx
route with the never-tested-before JobStateUpdate legacy adapter
- 10
lhcbdiracx
pods running, for now (scaled by hand)
- Issues (probably forgetting some!):
- The MySQL update scripts were partly incomplete – now should be alright
- AccountingDB (MySQL) optional updates could not complete (few tables are too big)
- diracx client extension code was re-worked (and basically tested in production)
- All user jobs that were still in the system (or those that were added at the initial restart) failed until we put in this PR
diracx
scalability issues (should all be solved by now):
- The SandboxMetadataDB access was VERY slow (from diracx) because of non-optimized query, fixed
- this query is specific to diracx in order to verify that a user can actually get the sandbox
- we also cleaned the DB from quite some inconsistencies
AuthDB
was missing indices
- too many (refresh?) tokens requested (DIRAC fix)
- DIRAC
Framework/ProxyManager
under some load, added few instances, reflected on diracx pods – possible related issue and maybe a PR
- We are now running with the DIRAC JobStateUpdate services (not the diracx ones)
- We realized that pilots could not e.g. update the jobs status – an issue was opened and a PR opened and closed (??)
- for the moment, we gave the pilot the
JobAdministrator
property
- Too many DNS requests from the DIRAC SandboxStore machine to the lhcbdiracx pods – not fully resolved, not clear who’s at fault
- (we discovered that) Pilots can’t use anymore the host certificates. LHCb HLT farm was relying on that, so now it first downloads the pilot proxy (with a token inside).
- The diracx
/api/jobs/status
route sometimes (?) stores local time instead of UTC, fixed with PR ?
- Decent monitoring became quickly important. The openshift one is alright-ish but OTEL is needed, so PR created
- Status:
- went through a few releases, now running with alpha versions, and hotfixes
- we are now running “almost everything”. Went up to 100k running jobs yesterday
- more issues still coming up one-by-one but now “running”
ILC/Calice/FCC
André
Belle2
Hideki
Juno:
Xiaomei
GridPP:
Janusz
CTAO
Natthan
- Use now the lets encrypt provided by our cluster to generate signed certificate
Topics from GitHub discussions and bots
- only un-answered DIRAC and DiracX topics with discussion updates:
DIRAC releases
- v8.0.71
- No new tags created since last time for v8
- v9
DIRAC projects
DIRAC:
Issues by milestone:
Other issues:
PRs discussed:
WebApp:
- 1 more pre-release made
- from previous meeting One draft PR
Pilot:
- PR feat: Adding JWT support alongside X509 auth
- the new Pilot command can call directly the route, no need to use the CLI
- the integration tests for this will be set up once diracx is updated with the connected diracx PR
- for this to be done neatly, the branching strategy of diracx will need to be defined.
DIRACOS:
Documentation:
OAuth2:
management
diraccfg
DB12
Rucio
Tests
- from previous meeting Federico Started adding Rucio to Dirac integration tests
DiracX:
Issues
PRs discussed:
DiracX-charts:
- Merged PRs for using DB passwords with special characters
DiracX-web:
- feat: enable remote backend connection (PR #318)
Release planning, tests and certification
Next appointments
AOB
LHCbDIRAC
There are minutes attached to this event.
Show them.