DPPS Workload-BDMS workshop

Name: DPPS Workload-BDMS workshop
Start: 2025-01-13T08:00:00+01:00
End: 2025-01-14T18:00:00+01:00
Location: CERN

13 Jan 2025, 08:00 → 14 Jan 2025, 18:00 Europe/Zurich

37/R-022 - 37/R-022 (CERN)

37/R-022 - 37/R-022

CERN

Show room on map

Registration

DPPS Workload-BDMS workshop

Participants

12 View full list

67209067711

Mykhailo Dalchenko

Join via phone

Monday 13 January
- Mon 13 Jan
- Tue 14 Jan
- 1
  Welcome and Objectives
  - Overview of the workshop goals
  - Key deliverables
  Speakers: Mieke Bouwhuis (NIKHEF), Dr Mykhailo Dalchenko (University of Geneva), Nectarios Benekos (University of Peloponnese (GR))
  Drafted Minutes
  
  Statement of the week, citing EL (14-1-2025):
  
  "There is nothing that can not be solved, what we are missing is meetings like this."
  
  WMS
  
  Server-client
  
  Not necessarly Dirac server at each DC.
  For now for the Legacy instance, we have servers in 2 DCs (CSCS, PIC) + 1 server at DESY but without DB.
  But the deployment could be at a single high reliable DC, which is easier to maintain.
  
  CTADIRAC client on local system, submits jobs to the server, the pilot installs the client on the worker node on the fly.
  
  DCs
  
  DCs currently: ARC, HTCondor
  
  Frascati: slurm via ssh tunnel
  
  In the offsite agreement: should we require ARC?
  
  Renew the survey from N. Neyroud, if 3 out of 4 DCs use htcondor we can ask Frascati to run htcondor
  
  Frascati DC is built from scratch, they don’t have the expertise on these technologies, slurm was the easiest, Padova currently helping them, with this support they could also run htcondor
  
  discuss with SS the DC requirements, we don’t have direct contacts at the moment
  
  also use of k8, it is not a requirement
  
  agree with slurm over ssh? We then have to negotiate with DCs. htcondor preferred, ARC is older. We need to discuss with offsite ITC work package
  
  LA in favour of using ARC, htcondor: better for maintenance, slurm not widely used, not maintained either
  
  Deployments
  
  k8 deployment compliant with rel0.0
  
  2 VMs + k8 not ready yet but deployment for next release ready
  
  PBolle runs OpenSearch for dirac on his cluster, is that for the test VMs?
  
  we asked desy for OpenSearch , it is not only for testing, it is only used for logging and monitoring
  
  we have to include it in the helm charts
  
  there is a ticket about this, also to monitor things in grafana
  
  DBs (more general, not related to only WMS)
  
  when we know what we want we can include things in the SLAs to the DCs
  
  more critical is the DBs, important that it is provided by the DC: backups, service, maintenance
  
  questionnaire to the DCs, N. Neyroud made one in 2019, we can already ask for mysql,or mariadb , required by dirac (?), fully open source, questionnaire needs to be redone
  
  is it going to be a shared db service, we can also use a distributed db?
  
  set up the requirements, collect our own knowledge about DBs, needs a dedicated meeting
  
  other systems like SUSS SOSS need to be included as well, they are in design phase we should include them in the discussion
  
  MF (SUSS) already busy with this, is on top of SKAO developments regarding this
  
  first have to agree among ourselves, then afterwards negotiate with DCs
  
  we already have a list of DBs,some are already set: mysql and postgres.
  
  Legacy instance
  
  be able to run the pipelines for prod6, problem of 2 data management systems
  
  reason to run new pipelines on the old system ?
  
  if we can substitute with DPPS easily ,we can use the new system
  
  the FileCatalogDB will change from Dirac to Rucio
  
  we can register the files on Rucio, we don’t need to move the files, we be easier than to maintain two: this should be tested
  
  we will start the current version of WMS and BDMS with its own storage scopes
  
  Offsite DCs
  
  galera clusters are independent, is an issue when eg power down at a DC, then we need backups and agreement with the DCs on accepted downtime
  
  request proxy available, not used by us, allows to have temporary shutdown, after back online it restores all information, we can ask the next days to dirac team
  
  we need the redundancy, if you lose DC that hosts DB, you must be able to replicate, maybe update the requirements for DPPS, it should be described as the dependencies, should have requirements that go into the SLAs, how much downtime can we afford
  
  can we have a shared cluster distributed on different dcs?
  
  k8 decides how to distribute the different servers over the clusters.
  
  we agreed that there is no advantage to have a shared cluster managed by k8 distributed on differnt DCs. k8 would manage the deployment at the level of each DC (it could be more than one).
  We could for instance deploy parts of the services on one DC (eg. DIARC WMS) and parts on another (eg. DIRAC Transformation/Production system), or all services on a single DC.
  
  Test instance
  
  try the updates to v9 for this hackathon, new migrations are one by one done by dirac community until it has become DiracX
  
  v9 should also work with k8, try it now, restore from backup is part of the use cases
  
  which dpps release is going to have diracx?
  
  the releases of dirac are not fixed
  
  dirac is very modular, they are updated per module
  
  we can use current dirac to switch to IAM, just because it supports certificates, tokens not completely working yet, we could switch to IAM now
  
  CWL integration is multistage, diracx: they write new microservices, during v9 , then the update of the API will be diracx
  
  we should put in plan what is going to be in the different releases
  
  the Dirac releases aren’t clear, developemtnat are a user community contribution, we could also contribute, MD could contribute to CWL
  
  question whether we should support in the current Dirac, or should we focus on support for Diracx, the person who does the CWL in Dirac is almost leaving
  
  Rel 0.0
  
  everythin for Rel 0.0 is ready
  
  for WMS sonarqube had a lot of issues, do we do the pragmatic solution ?
  
  this should be in the release plan, how we deal with sonarqube, otherwise this is going to be blocking for the release
  
  we require that the new develpments are good quality
  
  not all warnings are that important, like naming conventions
  
  open an issue so we can take a decision, we can adapt the sonarqube profile, messages are not always that appropriate
  
  same in ACADA, is a problem when there are many errors/warnings, people then ignore, should look at defaults of sonercube.
  
  Rel 0.1
  
  focus on CWL support in WMS, it worked for CalibPipe
  
  error and status info is already done
  
  integration with BDMS: is actually in rel0.0
  
  we are rediscussing UCs, we don’t want separate UCs for where you want to deploy
  
  sometimes test report requires comments by hand
  
  do you want to connect the staging environment to external storage/computing? would be extension of the UC
  
  for the next release use IAM. Because dirac v9 will use IAM.
  
  for internal usage set up IAM service, CNAF at the moment provides the service, we could install it ourselves, would be prefereable to have it in the CI
  
  we can ask for a test user
  
  dirac has its own service of IAM
  
  should do the integration with SimPipe
  
  Rel 1.0
  
  requires more work, uses metadata support from Dirac, is maybe not included in this release
  
  scatter-gather features: not supported in Dirac now, but in DIracX it will
  
  rel1.0 is not the same at M1, rel1.0 is Q3 2025
  
  we need to contribute to dirac to get the cwl support in there, we have no control over their schedule otherwise
  
  Rel 1.2
  
  Transformation consists of many jobs, each doing the same but handling different input
  
  how do you trigger the next transformation, how do you know things are finished that are required for the next step?
  
  it is data driven , not implemented yet, for the training we now wait for the training to finish, you could check if the metadata are complete, you still need monitoring
  
  we now have an agent implemented that checks whether transformations are efinished to trigger the next
  
  do you want to have a separate production that does the gathering?
  
  get the obsids from acada, then does production on these, then you know which metadata to check to start the rest
  
  how is the bookkeeping manual? responsibility for when the prod fails?
  
  some cases automatic reschedule, we started to implement scenarios of DC problems
  
  CTAO hiring somebody specifically for this
  
  in CWL jobs can end up 3 states, you can indicate the error codes
  
  CTADIRAC is not specific CTAO?
  
  in lhcb they did developments similar but too specific to their use cases, that is why CTADIRAC is generalized for as far it is not only applicable to CTAO
  
  what is the role of the person to do this?
  
  sdmc manager/operator? Someone who presses button, data manager?
  
  if we talk about testing it is AIV people
  
  we have to understand what these people should expect, they need procedures
  
  it is too early for these procedures
  
  should write this down by the end of this year, what should be automated or not , there will be stages in doing this
  
  data quality monitoring also plays a big role in this
  
  which metadata doe you have for MC?
  
  particle type, prod campaign, pointing dir, zenit h angle, subset for training or not, cta site
  
  we apply a query in the transformation
  
  when you have processed a file, does the transformation stop?
  
  it stays alive, maybe in some cases we want to stop the transformation when there are no more files to be expected. We can extend the agent to have more options than what is now implemented in the agent
  
  Monitoring
  
  the appearance of a file can trigger a new transformation
  
  Rel 2.0
  
  Now using the Dirac metadata catalogue, this will need to change to Rucio
  
  push/polling mechanism now being used
  
  we don’t need to pull every few seconds
  
  Opensearch not working since months, at Desy (at CC-Lyon it does work)
  
  VOMS March decommissioning, what will happen to the prod files?, IAM certificates work, storage elements work, compute element works
  
  we need to change the mapping of the files?
  
  IAM instance at CNAF is used, job submission works with tokens, not storage elements, only with certificate proxy, works also with IAM instance
  
  the current MoU will be for the current instance, for production we need a new MoU, the only difference will be response time and number of users
  
  users need to be migrated to the IAM
  
  non active usres will just not be migrated
  
  BDMS
  
  RethinkDB no more operational, it is bug fixed, there will be no new features, if we think it has enough features then we can use it, if we need other features we can change to another db
  
  why would CTAO be in the high size - high complexity region, more than other experiments in the (Rucio) community? LHC experiments have even more data and higher complexity
  
  rucio is actually in the RDBMS range
  
  if we don't use sql for metadata, but document based db, why only restrict to the choices presented?
  
  postgres and mongodb are even supported , like in acada
  
  there are many doucment-based, why this one?
  
  that is considering only the complexity of the problem, in the diagram postgres allows to store data inside document oriented way, but in a relational way.
  
  what element prevents us from using a relational db?
  
  none
  
  bdms is not released yet, just the policy part, still needs to be done. Still needs to be done for other subpackages.
  
  why can't the complete branch be uploaded to the main branch?
  
  because we differ in what needs to be done
  
  everything in the main needs to work, documentation, testing, add functionality
  
  full branch integration impossible to test
  
  it indeed runs on the test cluster but what does it add to the main branch
  
  in practice no prototype is compliant, take pieces from each prototype and merge that piece by piece and run it in the CI/CD
  
  for rel 0.0 all in main branch is ready
  
  for other releases the Rome group wants to include the whole prototype
  
  next steps: continue in tiny baby steps, broken down in list of things that need to be done
  
  instead of working on the small proposals people are going back to the entire system
  
  documentation needs still to be written, will have to be done next week
  
  Then other subpackages need to be tagged and we can release
  
  release will consist of helm charts and test reports
  
  Usually: open branch , make MR, describe and tag the people
  
  do this for whatever you are working on
  
  change the due date when dates are past
  
  discus the open items at the TCCs, fix issues
  
  feel free to push stuff, even if not perferct
  
  we need a handbook on how to handle git , reviews, MRs, issues, etc.
  
  MRs shouldn ‘t be so large , needs too long to review
  
  when can I ask people to review?
  
  anybody: they can indicate they don’t have time, can reassign
  
  should feel free to assign anybody
  
  review required, manual check needed
  
  we are probably not getting the ultimate version of gitlab
  
  Todo:
  
  Discuss how to proceed with BDMS after Rel0.0
  
  Finish up Rel 0.0 issues for BDMS
  
  Set up Git policy
  
  Set up concise GitLab/DevOps wiki
  
  Metadata presentation
  
  meaning of put data to metadata plugin?
  
  to indicate that we use the plugin
  
  are the data set in both systems, not just Rucio?
  
  part in Rucio, part in rethninkDB
  
  default database with the Rucio plugin is postgres
  
  when splitting the metadata: how do you ensure that you have stored them consistently?
  
  what is shown are files with metadata, but they don’t specify link to the schema, like json schema, we should be able to validate metadata linked to files
  
  acada – dpps interface: acada tells there is a file in a certain folder, and changes ownership
  
  give list of files, is provided as summary (not available yet from acada, nobody working on that)
  
  generally: two cases where we ingest: from acada, runs from WMS
  
  presented here: what is done for the current prototype
  
  questioning the release numbers used for this prototype, in relation with the DPPS releases
  
  if Rucio already provides what we need, why do we need to implement it?
  
  BIG question marks about:
  
  what will be imlemented
  
  who is taking responsibility for the MRs
  
  who is taking responsibility for the tags
  
  who is responsible for the BDMS repo
  
  who to get approvals from
  
  Queries
  
  we have a requirement that all metadata are queryable, for DPPS most queries are known well ahead of time
  
  except: the current requirements now state that the reply should be in 1 sec (?)
  
  was the data model document use for the prototype?
  
  the metadata description doc, it is a draft, which is not used here, nothing here in common
  
  Changefeed
  
  is the push method agent supported by Dirac?
  
  there is no requirement regarding the changefeed feature
  
  archive should not do anything, other jobs make additions to the archive, not the archive itself
  
  arguments for using rethinkDB?
  
  question whether we should separate the metadata db
  
  do not split the sw, only the backend
  
  separated setup will be difficult to synchronize
  
  Rucio plugin
  
  we need to write a new plugin for Dirac no matter what, can you hide the rethinkdb behind the rucio api? Can we have it complient with Rucio api?
  
  this is part of the Oscars LAPP project, see presentation
  
  question will arrise why it is not complient with rucio
  
  we should test the stuff on the existing back ends, and only move to rethinkdb if we don’t meet the requirements
  
  if the existing ones don’t suffice we can move to an alternative
  
  do you have to prepare the db architecture beforehand? Can’t you make queries that are not comptibel?
  
  this can be of general interest for Rucio
  
  should we split the user metadata from the default metadata, and then put the user metadata in a separate relational database?
  
  required that we define the UCs, NOW, and requirements
  
  difference between requirements for DPPS and SUSS
  
  there is stuff to do for Dirac on rucio plugin, who will do it?
  
  belle-II has an MR on this
  
  first find out the status, see how much work it is, LAPP could contribute to this.
  
  Swiss contribution BDMS
  
  interface BDMS and SUSS
  
  you have a data product that will be handed over to SUSS
  
  it is not queried, it is all handed over to SUSS
  
  according to boss, all data need to be queried and available for every user
  
  at the moment these archives are separated
  
  not wise to start with single archive, maybe keep in mind, MF taking this into account
  
  we only support restricted queries
  
  bulk archive does not require fine grained authorization
  
  Discuss:
  
  DPPS – SUSS interface discuss archive, science separate
  
  Swiss should be included in this discussion
  
  tested the replica deletion on site?
  
  can do this on north site by spinning up a docker compose
  
  aren’t we getting an installation on north site? With same functionalities as DCs?
  
  central services can be deployed on DCs, the instance is only composed of storage and worker nodes and access to cvmfs, onsite
  
  It can’t be the same as DC
  
  not a full DPPS installation
  
  what if we lose connection with site?
  
  so minimal installation, just rucio instance, rucio client
  
  you need an RSE on site
  
  We need:
  
  a strategy on where we store the data, QoS, access , staging, ...
  
  Write doc, define strategy , who??
  
  for cat B we need to install a lot of stuff onsite anyway.
  
  you only need cvmfs, worker nodes, rse for the data, DL0 ingestion client, very miinimalisitc deployment
  
  latency because server and client are not an issue on site
  
  what if the cvmfs connection is broken
  
  main thing is to replicate data
  
  previously talked about dropping category B?
  
  not everybody agrees
  
  difference between cat A and caat B can be large, in terms of calibration
  
  Slide 9: why use scope for storage temporary data sets?
  
  use it to efficiently look up
  
  then use dir structure instead of scope, scope is more for access
  
  you can set rules on the LFNs (directory structure is used in Dirac , is mapped to data sets etc in Rucio, Dirac does not know anything else)
  
  Missing in BDMS group:
  
  Central discussion
  
  Consensus on design and architecture (and accepting that Rucio is the chosen technology)
  
  Discussion on requirements, UCs
  
  Keeping to the general guidelines of development, testing and integration
  
  Documentation
  
  Concrete BDMS actions:
  
  organise BDMS meetings
  
  for DBs: make a table of the different strategies, do benchmarking (a la Oscars LAPP group), extract the subjectiveness
  
  keep the Monday-afternoon developers meetings and get used to the GitLab- and DevOps-way of working
  
  communicate where the documentation is
  
  Other topics
  
  reasoning behind the categorization of the use cases:
  
  Use cases describe a functionality. The DPPS functionalities are defined in the DPPS function tree. Use cases are divided into these groups based on the functionality they describe.
  
  For UCs in general: place the UC in the function group that belongs to the functionality described in the UC
  
  Hence, for the specific case of the UCs in Rel 0.0, that describe the deployment of services: this is the functionality "Manage DPPS services"
  
  Release manager checks the release report, add to the release procedure
  
  Usage of labels in GitLab, templates in git, check with ACADA
  
  more frequent AIV meetings, focussed on releases, and more focussed on open issues
  
  Q&A plan and sw lifecycle doc need to be revived but not required for Rel 0.0
  
  Revise the UC tempate
  
  Discussion on monitoring and log data from ACADA becoming more urgent
- 2
  
  Workload Status and Plans
  
  Updates from the Workload team.
  • Challenges and plans for (CTA)DIRAC integration.
  • Objectives for the DIRACX hackathon.
  
  Speakers: Luisa ARRABITO (LUPM IN2P3/CNRS), Natthan PIGOUX
  
  CTAO_WMS.pdf
- 10:30
  
  Coffee
- 3
  
  BDMS Status and Plans
  
  Updates from the BDMS team.
  • Challenges and plans for RUCIO integration.
  • Optimizing workflows for data management.
  
  Speakers: Etienne Lyard, Georgios Zacharis, Stefano Gallozzi (INAF, Osservatorio Astronomico di Roma), Syed Anwar Ul Hasan
  
  DPPS-BDMS-WMS-workshop-January-13-2025-CERN-v2.pdf
  
  DPPS-BDMS-WMS-workshop-January-13-2025-CERN-v2.pptx
  
  DPPS_Workload_&_BDMS_workshop_BDMS_report_20250113.pdf
- 4
  
  Metadata needs for CTAO
  
  Rehearsal of the contribution to DIRAC-Rucio miniworkshop
  
  Speakers: Frederic Gillardo (Centre National de la Recherche Scientifique (FR)), Georgios Zacharis, frederic Gillardo
  
  DPPS_Workload_&_BDMS_workshop_Jan_13_2025.odp
  
  DPPS_Workload_&_BDMS_workshop_Jan_13_2025.pdf
  
  DPPS_Workload_&_BDMS_workshop_Jan_13_2025.pdf
  
  OSCARS_CTA_13_01_2025.pdf
  
  OSCARS_CTA_13_01_2025.pptx
- 12:00
  
  Lunch
- 5
  
  DIRAC-RUCIO integration needs for CTAO
  
  Speaker: Maximilian Linhoff (TU Dortmund | CTAO)
- 6
  
  Getting Rucio and DIRAC deployed using K8s
  
  Speakers: Dr Volodymyr Savchenko (EPFL, Switzerland), Dr Volodymyr Savchenko (Department of Astronomy, University of Geneva)
  
  DPPS-k8s.pdf
- 15:30
  
  Coffee
- 7
  
  Hackathon preparation
  
  • Defining action items for the DIRACX hackathon.
  • Setting priorities for the DIRAC-RUCIO hackathon.
Tuesday 14 January
- Mon 13 Jan
- Tue 14 Jan
- 8
  Metadata handling in RUCIO: gaps and opportunities
  - Discuss the metadata handling options
  - Establish metrics for performance testing
  - Implement prototypes
- 12:00
  
  Lunch
- 9
  BDMS prototyping
  - Continued prototyping and testing.

Choose timezone

DPPS Workload-BDMS workshop

37/R-022 - 37/R-022

CERN

Drafted Minutes

WMS

BDMS

Todo:

Metadata presentation

Swiss contribution BDMS

Missing in BDMS group:

Concrete BDMS actions:

Other topics