CTA Dev Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Michael Davis (CERN)
    • 1
      CTA Release Workflow
    • 2
      CTA Release Roadmap

      See CTA Release Roadmap

      Release 4.8.1-1

      • Milestones link
      • Release date: 12 Dec
      • Pre-prod deployment date: 12 Dec
      • Prod deployment date: 12 Dec
      • Fixed protobuf error introduced by the previous release v4.8.0-1/v5.8.0-1.

      Release 4.8.2-1

      • Milestones link
      • Release date: 14 Dec
      • Pre-prod deployment date: 15 Dec
      • Prod deployment date: -
      • Optimisations on catalogue DB queries

      Public Release

      • Latest version available on public repo: v4.7.14-1, v5.7.14-1
      • Release 4.8.2-1 was done in a separate branch. It needs to be merged back into main.
      • We don't plan any other tagged releases before Christmas.
      • Jacek and Michael discussed creating a merge request related to the issue #242 (cta-frontend-grpc - problem with loading pem_root_certs), so that it can be reviewed by Michael.
    • 3
      CTA dev topics

      Review scheduler retry logic for archive and retrieve

      • We decided to apply a "pause" between all retries. To be discussed how this delay shall be performed.
      • For details check #37

      Handle 'unavailable' files in user and repack retrieves originated from problematic tapes

      • How to handle unavailable files, during the repack retrieve workflow?
      • Use is_unavailable flag vs reducing repack retrieve retries to zero.
      • For details check #218

      Amend code convention: include headers should use the complete path from the project root

      • Use full path vs relative file locations in header files.
      • For details check #249

      Allow VO override for repack

      • For details check #31

      REPACKING tape state and queue cleanup

      • Feedback of deploying v4.8.1-1 fix on production (fixed protobuf error - #ops-937).

      Several Free drive STALE because of long global scheduler lock aquisition time

      stagerrm issues continued

      • For details check ops issue #ops-943
      • There are several other stagerrm issues ongoing, such as #152, #151. We should have an unified approach.

      221213 Database intervention

      journalctl filling disk causing problems in CI

      "Needs discussion" topics

      "Dev issue needed" topics

      Review scheduler retry logic for archive and retrieve

      • Implementing a "try again after T seconds" mechanism is complex and requires playing with the current implementation of the object store.
      • In particular, we would need to create a new queue subtype to keep track of the requests that we want to retry later. This is a non-trivial task.
      • The new postgreSQL scheduler will make it much easier to implement this feature in the future (#147).
      • Therefore, we will not implement this yet.

      As a compromise, we will modify the number of retries to 0 (zero) in the case of repack requests, as discussed in the following topic.

      Handle 'unavailable' files in user and repack retrieves originated from problematic tapes

      • We discussed the two options presented in issue #218.
      • Both options are not mutually exclusive. However, option #2 (do not retry when repacking) is much simpler to implement and operate, while option #1 (manualy disable some files on a problematic tape) is more complex and requires changing the catalogue.
      • Therefore, we will implement option #2, but will keep discussing with our external collaborators if option #1 is also necessary.

      Amend code convention: include headers should use the complete path from the project root

      • It was decided that we will change all the headers to full path.
      • Richard will handle it.

      REPACKING tape state and queue cleanup

      • Release 4.8.1-1 fixed successfully the protobuf bug introduced in 4.8.0-1. The monitoring data shows this.

      Several Free drive STALE because of long global scheduler lock aquisition time

      • We will only mark as STALE a free drive that did not update its status in the past 4 hours (increase from 10 mins to 4 hours).
      • This change is only done in the client side (backend does not calculate this).

      stagerrm issues continued

      • There are several stagerrm-related issues in both our operations and development pages.
      • We need to aggregate all of them and discuss a common approach.
      • To be discussed between Joao, Julien and Richard.

      Improvements in gitlab CI workflow

      • The CI stage cta_valgrind has been taking a long time, and impacts the time that it takes to merge a commit into main.
      • Therefore, we will remove cta_valgrind from the list of mandatory CI stages (will be kept as optional). It will still be done as part of the scheduled CI tests.
      • The person tagging the release must check that the last commits passes the Valgrind tests. It must be written as a part of the checklist!

       

      • Besides this, the file ReleaseNotes.mb is always a source of rebase conflicts. We need to think of a strategy to avoid this conflicts (for example by clearly separating each person's commits in different files, or in different segments of the same file).
    • 4
      CTA dev board review

      Objective

      • Look at the active issues in our CTA dev board.
      • Decide if they should be kept, removed, reassigned, prioritised, etc.

      Review "In progress" issues

      • Full CTA board: link

      Review specific topic

      We did not cover this topic during this week.

      It will be kept for a future dev meeting.

    • 5
      AOB

      Other

      • Room 513/R-068 is booked every week, until the EOY, for the CTA dev meeting.