DPPS Workload-BDMS workshop
-
-
1
Welcome and Objectives
- Overview of the workshop goals
- Key deliverables
Speakers: Mieke Bouwhuis (NIKHEF), Dr Mykhailo Dalchenko (University of Geneva), Nectarios Benekos (University of Peloponnese (GR))Drafted Minutes
Statement of the week, citing EL (14-1-2025):
"There is nothing that can not be solved, what we are missing is meetings like this."
WMS
- Server-client
- Not necessarly Dirac server at each DC.
For now for the Legacy instance, we have servers in 2 DCs (CSCS, PIC) + 1 server at DESY but without DB.But the deployment could be at a single high reliable DC, which is easier to maintain.
- CTADIRAC client on local system, submits jobs to the server, the pilot installs the client on the worker node on the fly.
- Not necessarly Dirac server at each DC.
- DCs
- DCs currently: ARC, HTCondor
- Frascati: slurm via ssh tunnel
- In the offsite agreement: should we require ARC?
- Renew the survey from N. Neyroud, if 3 out of 4 DCs use htcondor we can ask Frascati to run htcondor
- Frascati DC is built from scratch, they don’t have the expertise on these technologies, slurm was the easiest, Padova currently helping them, with this support they could also run htcondor
- discuss with SS the DC requirements, we don’t have direct contacts at the moment
- also use of k8, it is not a requirement
- agree with slurm over ssh? We then have to negotiate with DCs. htcondor preferred, ARC is older. We need to discuss with offsite ITC work package
- LA in favour of using ARC, htcondor: better for maintenance, slurm not widely used, not maintained either
- Deployments
- k8 deployment compliant with rel0.0
- 2 VMs + k8 not ready yet but deployment for next release ready
- PBolle runs OpenSearch for dirac on his cluster, is that for the test VMs?
- we asked desy for OpenSearch , it is not only for testing, it is only used for logging and monitoring
- we have to include it in the helm charts
- there is a ticket about this, also to monitor things in grafana
- DBs (more general, not related to only WMS)
-
when we know what we want we can include things in the SLAs to the DCs
-
more critical is the DBs, important that it is provided by the DC: backups, service, maintenance
-
questionnaire to the DCs, N. Neyroud made one in 2019, we can already ask for mysql,or mariadb , required by dirac (?), fully open source, questionnaire needs to be redone
-
is it going to be a shared db service, we can also use a distributed db?
-
set up the requirements, collect our own knowledge about DBs, needs a dedicated meeting
-
other systems like SUSS SOSS need to be included as well, they are in design phase we should include them in the discussion
-
MF (SUSS) already busy with this, is on top of SKAO developments regarding this
-
-
first have to agree among ourselves, then afterwards negotiate with DCs
-
we already have a list of DBs,some are already set: mysql and postgres.
-
- Legacy instance
- be able to run the pipelines for prod6, problem of 2 data management systems
- reason to run new pipelines on the old system ?
- if we can substitute with DPPS easily ,we can use the new system
- the FileCatalogDB will change from Dirac to Rucio
- we can register the files on Rucio, we don’t need to move the files, we be easier than to maintain two: this should be tested
- we will start the current version of WMS and BDMS with its own storage scopes
- Offsite DCs
-
galera clusters are independent, is an issue when eg power down at a DC, then we need backups and agreement with the DCs on accepted downtime
- request proxy available, not used by us, allows to have temporary shutdown, after back online it restores all information, we can ask the next days to dirac team
- we need the redundancy, if you lose DC that hosts DB, you must be able to replicate, maybe update the requirements for DPPS, it should be described as the dependencies, should have requirements that go into the SLAs, how much downtime can we afford
-
can we have a shared cluster distributed on different dcs?
- k8 decides how to distribute the different servers over the clusters.
- we agreed that there is no advantage to have a shared cluster managed by k8 distributed on differnt DCs. k8 would manage the deployment at the level of each DC (it could be more than one).
We could for instance deploy parts of the services on one DC (eg. DIARC WMS) and parts on another (eg. DIRAC Transformation/Production system), or all services on a single DC.
-
- Test instance
- try the updates to v9 for this hackathon, new migrations are one by one done by dirac community until it has become DiracX
- v9 should also work with k8, try it now, restore from backup is part of the use cases
- which dpps release is going to have diracx?
- the releases of dirac are not fixed
- dirac is very modular, they are updated per module
- we can use current dirac to switch to IAM, just because it supports certificates, tokens not completely working yet, we could switch to IAM now
- CWL integration is multistage, diracx: they write new microservices, during v9 , then the update of the API will be diracx
- we should put in plan what is going to be in the different releases
- the Dirac releases aren’t clear, developemtnat are a user community contribution, we could also contribute, MD could contribute to CWL
- question whether we should support in the current Dirac, or should we focus on support for Diracx, the person who does the CWL in Dirac is almost leaving
- Rel 0.0
- everythin for Rel 0.0 is ready
- for WMS sonarqube had a lot of issues, do we do the pragmatic solution ?
- this should be in the release plan, how we deal with sonarqube, otherwise this is going to be blocking for the release
- we require that the new develpments are good quality
- not all warnings are that important, like naming conventions
- open an issue so we can take a decision, we can adapt the sonarqube profile, messages are not always that appropriate
- same in ACADA, is a problem when there are many errors/warnings, people then ignore, should look at defaults of sonercube.
- Rel 0.1
- focus on CWL support in WMS, it worked for CalibPipe
- error and status info is already done
- integration with BDMS: is actually in rel0.0
- we are rediscussing UCs, we don’t want separate UCs for where you want to deploy
- sometimes test report requires comments by hand
- do you want to connect the staging environment to external storage/computing? would be extension of the UC
- for the next release use IAM. Because dirac v9 will use IAM.
- for internal usage set up IAM service, CNAF at the moment provides the service, we could install it ourselves, would be prefereable to have it in the CI
- we can ask for a test user
- dirac has its own service of IAM
- should do the integration with SimPipe
- Rel 1.0
- requires more work, uses metadata support from Dirac, is maybe not included in this release
- scatter-gather features: not supported in Dirac now, but in DIracX it will
- rel1.0 is not the same at M1, rel1.0 is Q3 2025
- we need to contribute to dirac to get the cwl support in there, we have no control over their schedule otherwise
- Rel 1.2
- Transformation consists of many jobs, each doing the same but handling different input
- how do you trigger the next transformation, how do you know things are finished that are required for the next step?
- it is data driven , not implemented yet, for the training we now wait for the training to finish, you could check if the metadata are complete, you still need monitoring
- we now have an agent implemented that checks whether transformations are efinished to trigger the next
- do you want to have a separate production that does the gathering?
- get the obsids from acada, then does production on these, then you know which metadata to check to start the rest
- how is the bookkeeping manual? responsibility for when the prod fails?
- some cases automatic reschedule, we started to implement scenarios of DC problems
-
-
- CTAO hiring somebody specifically for this
- in CWL jobs can end up 3 states, you can indicate the error codes
- CTADIRAC is not specific CTAO?
- in lhcb they did developments similar but too specific to their use cases, that is why CTADIRAC is generalized for as far it is not only applicable to CTAO
- what is the role of the person to do this?
- sdmc manager/operator? Someone who presses button, data manager?
- if we talk about testing it is AIV people
- we have to understand what these people should expect, they need procedures
- it is too early for these procedures
- should write this down by the end of this year, what should be automated or not , there will be stages in doing this
- data quality monitoring also plays a big role in this
- which metadata doe you have for MC?
- particle type, prod campaign, pointing dir, zenit h angle, subset for training or not, cta site
- we apply a query in the transformation
- when you have processed a file, does the transformation stop?
- it stays alive, maybe in some cases we want to stop the transformation when there are no more files to be expected. We can extend the agent to have more options than what is now implemented in the agent
-
- Monitoring
- the appearance of a file can trigger a new transformation
- Rel 2.0
- Now using the Dirac metadata catalogue, this will need to change to Rucio
- push/polling mechanism now being used
- we don’t need to pull every few seconds
- Opensearch not working since months, at Desy (at CC-Lyon it does work)
- VOMS March decommissioning, what will happen to the prod files?, IAM certificates work, storage elements work, compute element works
- we need to change the mapping of the files?
- IAM instance at CNAF is used, job submission works with tokens, not storage elements, only with certificate proxy, works also with IAM instance
- the current MoU will be for the current instance, for production we need a new MoU, the only difference will be response time and number of users
- users need to be migrated to the IAM
- non active usres will just not be migrated
- we need to change the mapping of the files?
BDMS
- RethinkDB no more operational, it is bug fixed, there will be no new features, if we think it has enough features then we can use it, if we need other features we can change to another db
- why would CTAO be in the high size - high complexity region, more than other experiments in the (Rucio) community? LHC experiments have even more data and higher complexity
- rucio is actually in the RDBMS range
- rucio is actually in the RDBMS range
- if we don't use sql for metadata, but document based db, why only restrict to the choices presented?
- postgres and mongodb are even supported , like in acada
- there are many doucment-based, why this one?
- that is considering only the complexity of the problem, in the diagram postgres allows to store data inside document oriented way, but in a relational way.
- what element prevents us from using a relational db?
- none
- bdms is not released yet, just the policy part, still needs to be done. Still needs to be done for other subpackages.
- why can't the complete branch be uploaded to the main branch?
- because we differ in what needs to be done
- everything in the main needs to work, documentation, testing, add functionality
- full branch integration impossible to test
- it indeed runs on the test cluster but what does it add to the main branch
- why can't the complete branch be uploaded to the main branch?
- in practice no prototype is compliant, take pieces from each prototype and merge that piece by piece and run it in the CI/CD
- for rel 0.0 all in main branch is ready
- for other releases the Rome group wants to include the whole prototype
- next steps: continue in tiny baby steps, broken down in list of things that need to be done
- instead of working on the small proposals people are going back to the entire system
- documentation needs still to be written, will have to be done next week
- Then other subpackages need to be tagged and we can release
- release will consist of helm charts and test reports
- Usually: open branch , make MR, describe and tag the people
- do this for whatever you are working on
- change the due date when dates are past
- discus the open items at the TCCs, fix issues
- feel free to push stuff, even if not perferct
- we need a handbook on how to handle git , reviews, MRs, issues, etc.
- MRs shouldn ‘t be so large , needs too long to review
- when can I ask people to review?
- anybody: they can indicate they don’t have time, can reassign
- should feel free to assign anybody
- review required, manual check needed
- we are probably not getting the ultimate version of gitlab
Todo:
- Discuss how to proceed with BDMS after Rel0.0
- Finish up Rel 0.0 issues for BDMS
- Set up Git policy
- Set up concise GitLab/DevOps wiki
Metadata presentation
- meaning of put data to metadata plugin?
- to indicate that we use the plugin
- are the data set in both systems, not just Rucio?
- part in Rucio, part in rethninkDB
- default database with the Rucio plugin is postgres
- when splitting the metadata: how do you ensure that you have stored them consistently?
- what is shown are files with metadata, but they don’t specify link to the schema, like json schema, we should be able to validate metadata linked to files
- acada – dpps interface: acada tells there is a file in a certain folder, and changes ownership
- give list of files, is provided as summary (not available yet from acada, nobody working on that)
- generally: two cases where we ingest: from acada, runs from WMS
- presented here: what is done for the current prototype
- questioning the release numbers used for this prototype, in relation with the DPPS releases
-
if Rucio already provides what we need, why do we need to implement it?
BIG question marks about:
- what will be imlemented
- who is taking responsibility for the MRs
- who is taking responsibility for the tags
- who is responsible for the BDMS repo
- who to get approvals from
- Queries
- we have a requirement that all metadata are queryable, for DPPS most queries are known well ahead of time
- except: the current requirements now state that the reply should be in 1 sec (?)
- was the data model document use for the prototype?
- the metadata description doc, it is a draft, which is not used here, nothing here in common
- Changefeed
- is the push method agent supported by Dirac?
- there is no requirement regarding the changefeed feature
- archive should not do anything, other jobs make additions to the archive, not the archive itself
- arguments for using rethinkDB?
- question whether we should separate the metadata db
- do not split the sw, only the backend
- separated setup will be difficult to synchronize
- Rucio plugin
- we need to write a new plugin for Dirac no matter what, can you hide the rethinkdb behind the rucio api? Can we have it complient with Rucio api?
- this is part of the Oscars LAPP project, see presentation
- question will arrise why it is not complient with rucio
- we should test the stuff on the existing back ends, and only move to rethinkdb if we don’t meet the requirements
- if the existing ones don’t suffice we can move to an alternative
- we need to write a new plugin for Dirac no matter what, can you hide the rethinkdb behind the rucio api? Can we have it complient with Rucio api?
-
- do you have to prepare the db architecture beforehand? Can’t you make queries that are not comptibel?
- this can be of general interest for Rucio
- should we split the user metadata from the default metadata, and then put the user metadata in a separate relational database?
- required that we define the UCs, NOW, and requirements
- difference between requirements for DPPS and SUSS
- do you have to prepare the db architecture beforehand? Can’t you make queries that are not comptibel?
-
- there is stuff to do for Dirac on rucio plugin, who will do it?
- belle-II has an MR on this
- first find out the status, see how much work it is, LAPP could contribute to this.
- belle-II has an MR on this
- there is stuff to do for Dirac on rucio plugin, who will do it?
Swiss contribution BDMS
- interface BDMS and SUSS
- you have a data product that will be handed over to SUSS
- it is not queried, it is all handed over to SUSS
- according to boss, all data need to be queried and available for every user
- at the moment these archives are separated
- not wise to start with single archive, maybe keep in mind, MF taking this into account
-
we only support restricted queries
- bulk archive does not require fine grained authorization
- Discuss:
- DPPS – SUSS interface discuss archive, science separate
- Swiss should be included in this discussion
- tested the replica deletion on site?
- can do this on north site by spinning up a docker compose
- aren’t we getting an installation on north site? With same functionalities as DCs?
- central services can be deployed on DCs, the instance is only composed of storage and worker nodes and access to cvmfs, onsite
- It can’t be the same as DC
- not a full DPPS installation
- what if we lose connection with site?
- so minimal installation, just rucio instance, rucio client
- you need an RSE on site
- We need:
- a strategy on where we store the data, QoS, access , staging, ...
- Write doc, define strategy , who??
- a strategy on where we store the data, QoS, access , staging, ...
- for cat B we need to install a lot of stuff onsite anyway.
- you only need cvmfs, worker nodes, rse for the data, DL0 ingestion client, very miinimalisitc deployment
- latency because server and client are not an issue on site
- what if the cvmfs connection is broken
- main thing is to replicate data
- previously talked about dropping category B?
-
- not everybody agrees
-
- difference between cat A and caat B can be large, in terms of calibration
- Slide 9: why use scope for storage temporary data sets?
- use it to efficiently look up
- then use dir structure instead of scope, scope is more for access
- you can set rules on the LFNs (directory structure is used in Dirac , is mapped to data sets etc in Rucio, Dirac does not know anything else)
Missing in BDMS group:
- Central discussion
- Consensus on design and architecture (and accepting that Rucio is the chosen technology)
- Discussion on requirements, UCs
- Keeping to the general guidelines of development, testing and integration
- Documentation
Concrete BDMS actions:
- organise BDMS meetings
- for DBs: make a table of the different strategies, do benchmarking (a la Oscars LAPP group), extract the subjectiveness
- keep the Monday-afternoon developers meetings and get used to the GitLab- and DevOps-way of working
- communicate where the documentation is
Other topics
- reasoning behind the categorization of the use cases:
- Use cases describe a functionality. The DPPS functionalities are defined in the DPPS function tree. Use cases are divided into these groups based on the functionality they describe.
- For UCs in general: place the UC in the function group that belongs to the functionality described in the UC
- Hence, for the specific case of the UCs in Rel 0.0, that describe the deployment of services: this is the functionality "Manage DPPS services"
- Release manager checks the release report, add to the release procedure
- Usage of labels in GitLab, templates in git, check with ACADA
- more frequent AIV meetings, focussed on releases, and more focussed on open issues
- Q&A plan and sw lifecycle doc need to be revived but not required for Rel 0.0
- Revise the UC tempate
- Discussion on monitoring and log data from ACADA becoming more urgent
-
2
Workload Status and Plans
Updates from the Workload team.
• Challenges and plans for (CTA)DIRAC integration.
• Objectives for the DIRACX hackathon.Speakers: Luisa ARRABITO (LUPM IN2P3/CNRS), Natthan PIGOUX -
10:30
Coffee
-
3
BDMS Status and Plans
Updates from the BDMS team.
• Challenges and plans for RUCIO integration.
• Optimizing workflows for data management.Speakers: Etienne Lyard, Georgios Zacharis, Stefano Gallozzi (INAF, Osservatorio Astronomico di Roma), Syed Anwar Ul Hasan -
4
Metadata needs for CTAO
Rehearsal of the contribution to DIRAC-Rucio miniworkshop
Speakers: Frederic Gillardo (Centre National de la Recherche Scientifique (FR)), Georgios Zacharis, frederic Gillardo -
12:00
Lunch
-
5
DIRAC-RUCIO integration needs for CTAOSpeaker: Maximilian Linhoff (TU Dortmund | CTAO)
-
6
Getting Rucio and DIRAC deployed using K8sSpeakers: Dr Volodymyr Savchenko (EPFL, Switzerland), Dr Volodymyr Savchenko (Department of Astronomy, University of Geneva)
-
15:30
Coffee
-
7
Hackathon preparation
• Defining action items for the DIRACX hackathon.
• Setting priorities for the DIRAC-RUCIO hackathon.
-
1
-
-
8
Metadata handling in RUCIO: gaps and opportunities
- Discuss the metadata handling options
- Establish metrics for performance testing
- Implement prototypes
-
12:00
Lunch
-
9
BDMS prototyping
- Continued prototyping and testing.
-
8