CTA Dev Meeting

Name: CTA Dev Meeting
Start: 2022-12-16T14:00:00+01:00
End: 2022-12-16T15:30:00+01:00
Location: CERN

Friday 16 Dec 2022, 14:00 → 15:30 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Michael Davis (CERN)

- 14:00 → 14:10
  CTA Release Workflow 10m
  - Release procedure instructions
- 14:10 → 14:20
  CTA Release Roadmap 10m
  See CTA Release Roadmap
  
  Release 4.8.1-1
  - Milestones link
  - Release date: 12 Dec
  - Pre-prod deployment date: 12 Dec
  - Prod deployment date: 12 Dec
  - Fixed protobuf error introduced by the previous release v4.8.0-1/v5.8.0-1.
  Release 4.8.2-1
  - Milestones link
  - Release date: 14 Dec
  - Pre-prod deployment date: 15 Dec
  - Prod deployment date: -
  - Optimisations on catalogue DB queries
  Public Release
  - Latest version available on public repo: v4.7.14-1, v5.7.14-1
  Release 4.8.2-1 was done in a separate branch. It needs to be merged back into main.
  We don't plan any other tagged releases before Christmas.
  Jacek and Michael discussed creating a merge request related to the issue #242 (cta-frontend-grpc - problem with loading pem_root_certs), so that it can be reviewed by Michael.
- 14:20 → 14:30
  CTA dev topics 10m
  Review scheduler retry logic for archive and retrieve
  - We decided to apply a "pause" between all retries. To be discussed how this delay shall be performed.
  - For details check #37
  Handle 'unavailable' files in user and repack retrieves originated from problematic tapes
  - How to handle unavailable files, during the repack retrieve workflow?
  - Use is_unavailable flag vs reducing repack retrieve retries to zero.
  - For details check #218
  Amend code convention: include headers should use the complete path from the project root
  - Use full path vs relative file locations in header files.
  - For details check #249
  Allow VO override for repack
  - For details check #31
  REPACKING tape state and queue cleanup
  - Feedback of deploying v4.8.1-1 fix on production (fixed protobuf error - #ops-937).
  Several Free drive STALE because of long global scheduler lock aquisition time
  - For details check ops issue #ops-929
  stagerrm issues continued
  - For details check ops issue #ops-943
  - There are several other stagerrm issues ongoing, such as #152, #151. We should have an unified approach.
  221213 Database intervention
  - For details check ops issue #ops-948
  journalctl filling disk causing problems in CI
  - For details check ops issue #ops-956
  "Needs discussion" topics
  - Issues link
  "Dev issue needed" topics
  - Ops issues link
  Review scheduler retry logic for archive and retrieve
  Implementing a "try again after T seconds" mechanism is complex and requires playing with the current implementation of the object store.
  In particular, we would need to create a new queue subtype to keep track of the requests that we want to retry later. This is a non-trivial task.
  The new postgreSQL scheduler will make it much easier to implement this feature in the future (#147).
  Therefore, we will not implement this yet.
  As a compromise, we will modify the number of retries to 0 (zero) in the case of repack requests, as discussed in the following topic.
  Handle 'unavailable' files in user and repack retrieves originated from problematic tapes
  We discussed the two options presented in issue #218.
  Both options are not mutually exclusive. However, option #2 (do not retry when repacking) is much simpler to implement and operate, while option #1 (manualy disable some files on a problematic tape) is more complex and requires changing the catalogue.
  Therefore, we will implement option #2, but will keep discussing with our external collaborators if option #1 is also necessary.
  Amend code convention: include headers should use the complete path from the project root
  It was decided that we will change all the headers to full path.
  Richard will handle it.
  REPACKING tape state and queue cleanup
  Release 4.8.1-1 fixed successfully the protobuf bug introduced in 4.8.0-1. The monitoring data shows this.
  Several Free drive STALE because of long global scheduler lock aquisition time
  We will only mark as STALE a free drive that did not update its status in the past 4 hours (increase from 10 mins to 4 hours).
  This change is only done in the client side (backend does not calculate this).
  stagerrm issues continued
  There are several stagerrm-related issues in both our operations and development pages.
  We need to aggregate all of them and discuss a common approach.
  To be discussed between Joao, Julien and Richard.
  Improvements in gitlab CI workflow
  The CI stage cta_valgrind has been taking a long time, and impacts the time that it takes to merge a commit into main.
  Therefore, we will remove cta_valgrind from the list of mandatory CI stages (will be kept as optional). It will still be done as part of the scheduled CI tests.
  The person tagging the release must check that the last commits passes the Valgrind tests. It must be written as a part of the checklist!
  
  Besides this, the file ReleaseNotes.mb is always a source of rebase conflicts. We need to think of a strategy to avoid this conflicts (for example by clearly separating each person's commits in different files, or in different segments of the same file).
- 14:30 → 14:40
  CTA dev board review 10m
  Objective
  - Look at the active issues in our CTA dev board.
  - Decide if they should be kept, removed, reassigned, prioritised, etc.
  Review "In progress" issues
  - Full CTA board: link
  Review specific topic
  - Every week we review the issues of one topic.
  - This week: "Scheduler" and "Object Store" labels
  We did not cover this topic during this week.
  It will be kept for a future dev meeting.
- 14:40 → 14:50
  AOB 10m
  Other
  - Room 513/R-068 is booked every week, until the EOY, for the CTA dev meeting.

CTA Dev Meeting

513/R-068

CERN

Release 4.8.1-1

Release 4.8.2-1

Public Release

Review scheduler retry logic for archive and retrieve

Handle 'unavailable' files in user and repack retrieves originated from problematic tapes

Amend code convention: include headers should use the complete path from the project root

Allow VO override for repack

REPACKING tape state and queue cleanup

Several Free drive STALE because of long global scheduler lock aquisition time

stagerrm issues continued

221213 Database intervention

journalctl filling disk causing problems in CI

"Needs discussion" topics

"Dev issue needed" topics

Review scheduler retry logic for archive and retrieve

Handle 'unavailable' files in user and repack retrieves originated from problematic tapes

Amend code convention: include headers should use the complete path from the project root

REPACKING tape state and queue cleanup

Several Free drive STALE because of long global scheduler lock aquisition time

stagerrm issues continued

Improvements in gitlab CI workflow

Objective

Review "In progress" issues

Review specific topic

Other