CTA deployment meeting

Europe/Zurich
31/S-028 (CERN)

31/S-028

CERN

30
Show room on map

Status of operational tools:

  • Vlado's + David's slides are found here.
  • The CTA pool supply script is invoked via Rundeck. For consistency reasons, this script should only be used on the production CTA instance. Non-prod CTA instances (e.g. ctapps) shall be statically allocated in advance with the necessary tapes.
  • The replacement for tsmod-daily-report should not generate mails but rather be used for a dashboard (fed via a monitoring sink) that will display current information and plots that can then be used for historical trends, such as a percentile-based distribution of drive runtime. (The current scripts sends a mail twice a day, information is not persistently kept).
  • So far, CTA treats supply pools as any other pool (similar to CASTOR). Following discussion, some of the checks done by tsmod-daily-report could be moved to CTA, such as checking for consistency (empty tapes, neither disabled nor read-only, etc). Vlado will create a corresponding ticket.
  • Alarm system:
    • Handling distinct element thresholds (ticket) is now implemented for drives and tapes and working well.
    • disabling functionality (tapes, drives, libraries for CTA) is not yet implemented. Disabling libraries can wait but for tapes and drives this should be provided with priority.
    • The new system works as a daemon (rather than being regularly invoked as it is the case for the CASTOR alarm system). This requires handling expiring authentication tokens (KRB5) required for invoking CTA and CASTOR commands. A possible solution is to de-daemonize it and invoke it via Rundeck. David/Vlado to work out a solution.
  • Dashboards: It is felt that there are currently far too many dashboards that have grown organically (see overviews here, here, or here). These need to be cleaned up and structured by functionality and audience. For operations, the following would be needed:
    • EOSCTA instances
    • tapeserver cluster
    • CTA tape back-end
    • Julien needs to work on a cleanup proposal, gathering input from tape operations.

FTS and GC:

  • Work ongoing for enhancing GC to mark filed as evicted (cf discussion at previous meeting)

Migration:

  • Error-tolerant injection code for directories ready (cf last meeting). Next step is to implement it for files. Michael will use the current PPS with the imported ATLAS namespace to exercise the code. Julien will hand over the PPS instance to Michael after informing ATLAS.

Devel/ops status updates:

  • A new release should be prepared for deployment next week. Julien will select the features and start freezing and building the release on Friday.

  • Namespace tombstones: A fix has been provided by Andreas, that should avoid leaking file tombstones in the EOS namespace when tape backed files are deleted. Needs testing, likely to find its way to EOS 4.5.7.

  • Python vs C++ for testing scripts: Using shell/python is showing some limitations when running complex concurrent system tests. For this case, Eric proposes to develop helper applications using C++ with well-documented, easy to understand flow so that these helper tools can be correctly followed by devops.

  • The EOS namespace backup/recovery procedure needs to be tested (ticket); in particular ensuring that directory/path names can be correctly recovered.

  • Versioning of DB/protocols for backwards-compatible upgrades: Starting on Oct 1st. See Steve's mail below.

 

 

 

 

--------------------------

Hi Eric and Michael,

 
From Tuesday 1st October 2019 onwards the CTA protobuf message formats and the catalogue database schema should be upgradable in a production environment.  During the development phase of CTA I said that developers could change the protobuf messages and the database schema without having to worry (too much) about a smooth upgrade path.  This grace period will end at midnight on Monday 30th September 2019.  Please make any necessary cleanups / modifications before the end of this period.
 
I see that Eric tries to ensure that the ID of each message field is unique across all messages.  I also see that Michael does not follow this practice.  My personal opinion is that both solutions have their own merits.  Eric’s solution prevents a software bug from being able to parse the wrong type of message and Michael’s solution is nice and simple.  I personally don’t mind that both approaches are used after 1st October.  I would just like to know that you both discussed this and exchanged your opinions.
 
AN ERIC EXAMPLE :
 
cat CTA/objectstore/cta.proto
// Pointer to the scheduler global lock
message SchedulerGlobalLockPointer {
  required string address = 110;
  required EntryLog log = 111;
}

// Pointer to the archive queue
message ArchiveQueuePointer {
  required string address = 120;
  required string name = 121;
}

// Pointer to the tape queue
message RetrieveQueuePointer {
  required string address = 130;
  required string vid = 131;
}

// The root entry. This entry contains all the most static information, i.e.
// the admin handled configuration information
message RootEntry {
  repeated ArchiveQueuePointer archive_queue_to_transfer_for_user_pointers = 1050;
  repeated ArchiveQueuePointer archive_queue_failed_pointers = 1062;
  repeated ArchiveQueuePointer archive_queue_to_report_for_user_pointers = 1068;
  repeated ArchiveQueuePointer archive_queue_to_transfer_for_repack_pointers = 1069;
  repeated ArchiveQueuePointer archive_queue_to_report_to_repack_for_success_pointers = 1072;
  repeated ArchiveQueuePointer archive_queue_to_report_to_repack_for_failure_pointers = 1073;
  repeated RetrieveQueuePointer retrieve_queue_to_transfer_for_user_pointers = 1060;
  repeated RetrieveQueuePointer retrieve_queue_to_report_for_user_pointers = 1063;
  repeated RetrieveQueuePointer retrieve_queue_failed_pointers = 1065;
  repeated RetrieveQueuePointer retrieve_queue_to_report_to_repack_for_success_pointers = 1066;
  repeated RetrieveQueuePointer retrieve_queue_to_report_to_repack_for_failure_pointers = 1067;
  repeated RetrieveQueuePointer retrieve_queue_to_transfer_for_repack_pointers = 1071;
  optional DriveRegisterPointer driveregisterpointer = 1070;
  optional AgentRegisterPointer agentregisterpointer = 1080;
  optional RepackIndexPointer repackindexpointer = 1085;
  optional RepackQueuePointer repackrequestspendingqueuepointer = 1086;
  optional RepackQueuePointer repackrequeststoexpandqueuepointer = 1088;
  optional string agentregisterintent = 1090;
  optional SchedulerGlobalLockPointer schedulerlockpointer = 1100;
}
 
 
A MICHAEL EXAMPLE:
 
cat  CTA/xrootd-ssi-protobuf-interface/eos_cta/protobuf/cta_frontend.proto
...
message Request {
  oneof request {
    cta.eos.Notification notification = 1;      //< EOS WFE Notification
    cta.admin.AdminCmd admincmd       = 2;      //< CTA Admin Command
  }
}
 
 
 
//
// Metadata responses sent by the CTA Frontend
//
 
message Response {
  enum ResponseType {
    RSP_INVALID                       = 0;      //< Response type was not set
    RSP_SUCCESS                       = 1;      //< Request is valid and was accepted for processing
    RSP_ERR_PROTOBUF                  = 2;      //< Framework error caused by Google Protocol Buffers layer
    RSP_ERR_CTA                       = 3;      //< Server error reported by CTA Frontend
    RSP_ERR_USER                      = 4;      //< User request is invalid
  }
  ResponseType type                   = 1;      //< Encode the type of this response
  map<string, string> xattr           = 2;      //< xattribute map
  string message_txt                  = 3;      //< Optional response message text
  cta.admin.HeaderType show_header    = 4;      //< Type of header to display (for stream responses)
}
 
Cheers,
 
Steve

 

There are minutes attached to this event. Show them.
    • 14:00 14:40
      Status of operational tools 40m

      Supply pool logic, Daily TSMOD report, Labelling, Mount/Unmount, Media-check, Drive-Test, Alarm System, Monitoring, etc.

    • 14:40 15:30
      Dev and Deployment Status, next steps 50m
      • API and versioning needs
      • status of pending devs (backpressure, cancel, repack)
      • others