BiLD-Dev
→
Europe/Zurich
Description
Bi-Weekly "Loyal" DIRAC developers meeting.
And, following, the LHCbDIRAC developers meeting.
BiLD (Bi-weekly DIRAC Development meeting) – 28/05/2020
At CERN: Nobody, of course!
On Vidyo: Federico, Andrei, Andrii, André, Christopher, Christophe, Igor, Daniela, Hideki, Janusz, Alexandre, Simon, Cedric, Aymane, Vladimir
Apologies: Marko
Follow-up from previous meeting
NTR
DIRAC communities roundtable
GridPP:
- Working on v7r1 in certification setup
- not going very well
- Solved the issues with the proxy and M2Crypto
- Switched to HTCondorCEs at Imperial, trying from DIRAC (v6r22) shows lots of jobs in “Held” state: proxy keeps expiring?
- It is not clear if this is due to DIRAC or the CE
CLIC:
- Tried dirac-management for creating tarballs inside our CI
- Re-added -D/–destination option for outputting tarballs
- Works
- Xroot5 in diracos: Marko working on it
LHCb:
- M2Crypto issues: all looks like corrected
- one flag still needed for high scale, might become the default in one or 2 patches
- Pilot3 files: on s3 based web (also)
- working fine
- Running on HPCs: CINECA (all “standard” but with fat KNL nodes). SDumont (no CE, SLURM, fat nodes, limited CPUTime)
- CINECA looks OKish but there are some doubts on how DB12 works there. Also, several jobs are killed by the watchdog, to be investigated.
- One issue on SDumont: computation of CPU time left need to be fixed (see discussions in https://github.com/DIRACGrid/DIRAC/issues/4544)
France Grilles:
- Pilot3?
- Not yet
- Strong request came to maintain the REST interface (RESTDirac extension)
- maintaining it is a nuisance right now (it is on a separate machine)
- Christophe: once the core or DIRAC will talk https then it will be trivial
- Andrei: nevertheless we have an operational issue right now.
EGI:
- Check-in being tested: resolved a few bugs on our side
- Running v7r0 in production
- Development machine is based on v7r1
Belle2:
- Migration to v6r22 ongoing
- Thinking about moving to Pilot3:
- Question about how to feed Pilot3 file and how the pilot wrapper works.
- Federico: we can’t put pilot files on CVMFS
- The dirac-distribution container does not work with the structure of BelleDIRAC, where the web and DIRAC extensions are merged
- Rucio and DIRAC:
- certification is starting in BelleDIRAC
- mid of June, when validated, it will be committed to vanilla DIRAC
- covers all the methods that were in Lcg FC (in a way this is specific to Belle2)
Nica:
- Updated DIRAC to v7r0p24
- Users needing dirac-dms-* scripts in their jobs, and this is still not possible (Pilot3)
- so users needing dirac-configure?
- Daniela thinks it is linked to https://groups.google.com/forum/#!topic/diracgrid-forum/kPcMb1ZTcS0 ?
- [see discussion in Pilot3 part below]
- DIRACOS was downloaded from lhcb-rpm.cern.ch, created a ddos attack (Andrei should have sorted it out uploading to DIRACOS)
- New site in Mexico (SSH CE, torque):
- problem with watchdog killing the jobs
- Federico think there’s the watchdog can be disabled by touching file DISABLE_WATCHDOG_CPU_WALLCLOCK_CHECK (see https://groups.google.com/d/msg/diracgrid-forum/tcDu1glXJcQ/hIBZbSX7AwAJ)
Juno:
- Moving from CREAM to HTCondor CEs (no issues)
- Planning to move to version 7
Current situation
DIRAC
- v6r22:
- NTR
- v7r0:
- created v7r0p25
- mostly bug fixes
- created v7r0p25
- v7r1:
- v7r1p3 created, just inheriting changes from v6r22 and v7r0
- v7r2:
- Not created yet
WebApp:
- looking for new responsible for the WebApp as Zoltan has left
- Tests fixed for v4r0, not v4r1 (prettier)
- Some tasks to look at
Pilot3:
- https://groups.google.com/forum/#!topic/diracgrid-forum/kPcMb1ZTcS0
- DIRACSYSCONFIG variable needed?
- etc/dirac.cfg linked to pilot.cfg?
DIRACOS:
- xroot5?
- Chris looking at it, taking over from Marko (who is on vacation). Not so easy. Evaluating the effort needed.
- Discussing how to evolve in DIRACOSv2
VM:
- Igor: small fix in PR, to be merged.
Documentation:
- https://github.com/DIRACGrid/DIRAC/pull/4622 fixed couple bugs
- https://github.com/DIRACGrid/DIRAC/pull/4621 can be rebased to v7r0
OAuth2:
- Being tested in EGI framework
tornado and other externals
- NTR
management
- All versions from releases.cfg uploaded to EGI CVMFS
- Andrei made a script to pick up all the packages from releases.cfg (stand-alone script)
- Still “private” but can be and should be added to this repo
- Maybe can be added to the dirac-distribution
- https://github.com/DIRACGrid/DIRAC/issues/4604
- in general: we should be doing more automated deploys
diraccfg
- NTR
Release planning, tests and certification
-
Not much done wrt to what was discussed 2 BiLD meetings ago (which is still valid).
-
Proposal (from GridPP): make the DIRAC certification multi-VO
- The second VO would be gridpp “catchall” VO
- We developers might (or not) be part of it
- In general: OK, but:
- the first hackathon for v7r2 we will still do it in a single-VO instance
- Daniela will write down a issue in GitHub collecting what’s needed to be done, including:
- integration and system tests with single and multi-VO in mind
- update of trello tasks
- The second VO would be gridpp “catchall” VO
Weekly development(s) focus
NTR
DIRAC: current PRs and tasks being worked on, or topics from Google forum
PRs:
- v7r0:
- https://github.com/DIRACGrid/DIRAC/pull/4620 (Fail HTCondorCE.getPilotOutput if workingDirectory not available or if condor_transfer_data fails ) : Pay attention to notes for PilotManager service
- v7r1:
- NTR
- v7r2:
- https://github.com/DIRACGrid/DIRAC/issues/4524 (SubLogger are not flexible enough) : move to v7r2
On issues:
- https://github.com/DIRACGrid/DIRAC/issues/4616 maybe just some documentation on when to use the variable
- https://github.com/DIRACGrid/DIRAC/issues/4609 try out the 2 proposed “solutions” (~hacks)
- https://github.com/DIRACGrid/DIRAC/issues/4578 give an error message
AOB
Next BiLD in 2 weeks.
LHCbDIRAC
- Creation of releases: should be fine now (anyway, we’ll need more automation)
- M2Crypto:
- looks fine, only 1 machine still, can be tried on more than one
- Client using M2Crypto: we should try out ourselves first
- BKK
- password updated also for the production instance
- Cert instance (INT12r) moved to 19c
- need to update the instant client on the machine (in puppet) [Chris]
- A bit more sensible docs to go to https://lbdevops.web.cern.ch/lbdevops/
- see also https://gitlab.cern.ch/lhcb-dirac/LHCbDIRAC/-/merge_requests/811#note_3480088
- http://lhcbdistributedcomputingshifter-docs.web.cern.ch/ (DCS docs) could also be moved https://gitlab.cern.ch/lhcb-ops
- Chris B: is it OK to use a certificate for downloading user proxies?
- A: no other solutions, please document in the docs above
There are minutes attached to this event.
Show them.