CTA Dev Meeting

Europe/Zurich
31/S-028 (CERN)

31/S-028

CERN

30
Show room on map
Michael Davis (CERN)
    • 1
      CTA Release Workflow
    • 2
      CTA Release Roadmap

      See CTA Release Roadmap

      Release 4.8.0-1

      • Milestones link
      • Release date: 28 Nov
      • Pre-prod deployment date: 30 Nov
      • Prod deployment date: 5 Dec

      Release 4.8.1-1 - DELAYED (see minutes)

      • Milestones link
      • Release date: TBD (already stress tested)
      • Pre-prod deployment date: TBD
      • Prod deployment date: TBD
      • Both 4.8.1-1 and 4.8.2-1 will be released simultaneously

      Release 4.8.2-1 - DELAYED (see minutes)

      • Milestones link
      • Release date: TBD
      • Pre-prod deployment date: TBD
      • Prod deployment date: TBD
      • Catalogue v13 release
      • Both 4.8.1-1 and 4.8.2-1 will be released simultaneously

      Public Release

      • Version available on public repo: v4.7.14-1, v5.7.14-1
      • New versions will be released after we have used it internally in production.
      • We have decided to change the approach on the issue #218 (unavailable files). Instead of using the new IS_ACCESSIBLE column, we will just not do any retries on repack retrieve requests. Therefore, we need to revert some of the existing commits, and there will be no urgency to release 4.8.2-1 (catalogue v13.0 )for now.
    • 3
      CTA dev topics

      Review scheduler retry logic for archive and retrieve

      • The retry logic should take into account the type of the error found, in order to decide if and how long the queue should sleep between retries.
      • For details check #37

      Handle 'unavailable' files in user and repack retrieves originated from problematic tapes

      • When to check for unavailable files, during the retrieve workflow?
      • For details check #218

      Allow VO override for repack

      • For details check #31

      REPACKING tape state and queue cleanup

      • Feedback of deploying version v4.8.0-1 on production.

      Several Free drive STALE because of long global scheduler lock aquisition time

      r_alice_test_datachallenge archives queues not being absorbed

      "Needs discussion" topics

      "Dev issue needed" topics

      Review scheduler retry logic for archive and retrieve

      • We decided to apply a minimum "time window" before canceling any job request.
      • Before this threshold is achieved, we should not cancel the request. Instead, if necessary, we should be delaying/retrying them. The exact details need to be defined.
      • The delay should be applied per tape file. We can take advantage of different tape copies to by pass delays on some tapes.
      • TODO: Write document proposing new behaviour/approach.

      Handle 'unavailable' files in user and repack retrieves originated from problematic tapes

      • We decided to change the approach to this problem. Instead of making use of the IS_ACCESSIBLE column (needs to be reverted on the git repo, before any new release), we will simply remove all the retry logic from the repack retrieve requests. This will mean that the operators can quickly get a list of all tape files that failed to retrieve (files that remain on the tape after the repack). Then, they can manually issue a new repack, mount on a different tape drive, or simply handle the tape as they desire.
      • Vlado will write a document on how the retry logic should be done for repacking (failed segments), taking into account the discussion during this meeting.
      • Catalogue commits are to be reverted from main and put back into a separate branch. The commit that adds IS_ACCESSIBLE should be removed from this branch.

      Allow VO override for repack

      • We won't be discussing this for now. Once we are more familiar with operating the new REPACKING behaviours - after new year's eve - we will revisit this topic.

      REPACKING tape state and queue cleanup - Wrong WARNING messages

      • For now, operations will filter out these messages, since they are not a problem.
      • They will permanently be removed (or have their priority reduced) in a future commit.

      Several Free drive STALE because of long global scheduler lock aquisition time

      • The only thing to do on the dev side is to increase the STALL constant. The rest will be handled by operations.

      r_alice_test_datachallenge archives queues not being absorbed

      • Vova will create an dev issue and link to the existing ops issue.
    • 4
      CTA dev board review

      Objective

      • Look at the active issues in our CTA dev board.
      • Decide if they should be kept, removed, reassigned, prioritised, etc.

      Review "In progress" issues

      • Full CTA board: link

      Review specific topic

    • 5
      AOB

      Other

      • Room 513/R-068 is booked every week, until the EOY, for the CTA dev meeting.