CTA deployment meeting
Status of operational tools:
- Vlado's + David's slides are found here.
- The CTA pool supply script is invoked via Rundeck. For consistency reasons, this script should only be used on the production CTA instance. Non-prod CTA instances (e.g. ctapps) shall be statically allocated in advance with the necessary tapes.
- The replacement for tsmod-daily-report should not generate mails but rather be used for a dashboard (fed via a monitoring sink) that will display current information and plots that can then be used for historical trends, such as a percentile-based distribution of drive runtime. (The current scripts sends a mail twice a day, information is not persistently kept).
- So far, CTA treats supply pools as any other pool (similar to CASTOR). Following discussion, some of the checks done by tsmod-daily-report could be moved to CTA, such as checking for consistency (empty tapes, neither disabled nor read-only, etc). Vlado will create a corresponding ticket.
- Alarm system:
- Handling distinct element thresholds (ticket) is now implemented for drives and tapes and working well.
- disabling functionality (tapes, drives, libraries for CTA) is not yet implemented. Disabling libraries can wait but for tapes and drives this should be provided with priority.
- The new system works as a daemon (rather than being regularly invoked as it is the case for the CASTOR alarm system). This requires handling expiring authentication tokens (KRB5) required for invoking CTA and CASTOR commands. A possible solution is to de-daemonize it and invoke it via Rundeck. David/Vlado to work out a solution.
- Dashboards: It is felt that there are currently far too many dashboards that have grown organically (see overviews here, here, or here). These need to be cleaned up and structured by functionality and audience. For operations, the following would be needed:
- EOSCTA instances
- tapeserver cluster
- CTA tape back-end
- Julien needs to work on a cleanup proposal, gathering input from tape operations.
FTS and GC:
- Work ongoing for enhancing GC to mark filed as evicted (cf discussion at previous meeting)
Migration:
- Error-tolerant injection code for directories ready (cf last meeting). Next step is to implement it for files. Michael will use the current PPS with the imported ATLAS namespace to exercise the code. Julien will hand over the PPS instance to Michael after informing ATLAS.
Devel/ops status updates:
-
A new release should be prepared for deployment next week. Julien will select the features and start freezing and building the release on Friday.
-
Namespace tombstones: A fix has been provided by Andreas, that should avoid leaking file tombstones in the EOS namespace when tape backed files are deleted. Needs testing, likely to find its way to EOS 4.5.7.
-
Python vs C++ for testing scripts: Using shell/python is showing some limitations when running complex concurrent system tests. For this case, Eric proposes to develop helper applications using C++ with well-documented, easy to understand flow so that these helper tools can be correctly followed by devops.
-
The EOS namespace backup/recovery procedure needs to be tested (ticket); in particular ensuring that directory/path names can be correctly recovered.
-
Versioning of DB/protocols for backwards-compatible upgrades: Starting on Oct 1st. See Steve's mail below.
--------------------------
Hi Eric and Michael,
message SchedulerGlobalLockPointer {
required string address = 110;
required EntryLog log = 111;
}
// Pointer to the archive queue
message ArchiveQueuePointer {
required string address = 120;
required string name = 121;
}
// Pointer to the tape queue
message RetrieveQueuePointer {
required string address = 130;
required string vid = 131;
}
// The root entry. This entry contains all the most static information, i.e.
// the admin handled configuration information
message RootEntry {
repeated ArchiveQueuePointer archive_queue_to_transfer_for_user_pointers = 1050;
repeated ArchiveQueuePointer archive_queue_failed_pointers = 1062;
repeated ArchiveQueuePointer archive_queue_to_report_for_user_pointers = 1068;
repeated ArchiveQueuePointer archive_queue_to_transfer_for_repack_pointers = 1069;
repeated ArchiveQueuePointer archive_queue_to_report_to_repack_for_success_pointers = 1072;
repeated ArchiveQueuePointer archive_queue_to_report_to_repack_for_failure_pointers = 1073;
repeated RetrieveQueuePointer retrieve_queue_to_transfer_for_user_pointers = 1060;
repeated RetrieveQueuePointer retrieve_queue_to_report_for_user_pointers = 1063;
repeated RetrieveQueuePointer retrieve_queue_failed_pointers = 1065;
repeated RetrieveQueuePointer retrieve_queue_to_report_to_repack_for_success_pointers = 1066;
repeated RetrieveQueuePointer retrieve_queue_to_report_to_repack_for_failure_pointers = 1067;
repeated RetrieveQueuePointer retrieve_queue_to_transfer_for_repack_pointers = 1071;
optional DriveRegisterPointer driveregisterpointer = 1070;
optional AgentRegisterPointer agentregisterpointer = 1080;
optional RepackIndexPointer repackindexpointer = 1085;
optional RepackQueuePointer repackrequestspendingqueuepointer = 1086;
optional RepackQueuePointer repackrequeststoexpandqueuepointer = 1088;
optional string agentregisterintent = 1090;
optional SchedulerGlobalLockPointer schedulerlockpointer = 1100;
}