FTS3 Steering Meeting

Europe/Zurich
28/R-015 (CERN)

28/R-015

CERN

15
Show room on map
  • It was requested to share the criteria used to decide if an error is recoverable or not
  • Agreed that restoring a running FTS3 from a backup is dangerous
    • Additionally, for postmortem the data can be very likely too old to be useful anyway
    • Backups can be disabled, or done with the lowest integrity constrains possible, as to reduce the load
    • [FTS-485] - Document new backup policy
  • Low number of actives for a link seem to not completely honor the activity shares
    • Hard to debug, since need to know number of queued per activity too, and this historical view is not available
    • To follow up with the Dashboard to see if possible to better visualize
    • See if this can be artificially reproduced offline
    • [FTS-484] - Activity shares: honored?
  • Stalled connection still happen, although not as critical
    • Backport FTS-435 to fts-rest 3.4.2 may reduce the impact, or at least reduce the possibilities if happens again
  • Monitoring SSL complains from browsers
    • Need ROOT CAs installed: CERN CA, UK CA
    • May be the reason of multiple errors from the monitoring (cert manual verification is only temporary)
  • No Oracle FTS3 is used within WLCG
  • ATLAS will test the new functionalities running in fts3-pilot.cern.ch. Upgrade to production when given the green light.
  • It has been requested to experiments to document the usage they do/expect to do from the messaging, as to better test on that side too.
There are minutes attached to this event. Show them.
    • 16:00 16:30
      News and updates 30m
      * Release 3.4 https://fts3-service.web.cern.ch/content/fts-340 * Operational changes * How to handle backups
    • 16:30 16:45
      Requirements gathering 15m
      * Messages: which ones and which fields are in use? Any type expectation?
    • 16:45 17:15
      AOB 30m
      ATLAS * Stalled connections : We still see from time to time stalled connections (>3600 seconds). We implemented a restart of our agent when they appear, so it's not really a blocker anymore but it would be good to understand why they are still there. * Transactional submission. What is the status of the internal job id ? https://its.cern.ch/jira/browse/FTS-372 * Shares : They are some concerns about the Express share not working properly (see mail in attachment). * The monitoring is often broken (e.g. https://fts3.cern.ch:8449/fts3/ftsmon/#/generic-error?jobId=cb422eda-cc99-11e5-a4a7-02163e010724). It would be good to have something a bit more stable. * Changing priority of active requests. This would be very useful if we could boost some file transfers. https://its.cern.ch/jira/browse/FTS-300