CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
Michael Davis (CERN)

SpectraLogic Library

  • Delivery has been arranged but at present there is no date for installation

Monitoring and Statistics

  • Most of the monitoring dashboards are now online
  • The "all files/data on tape" cumulative graph using Cedric's statistics DB will be tested during the fill of the upcoming ALICE instance and the ATLAS read/write stress tests. These statistics will subsequently be wiped.
  • The PL/SQL code for statistics has been superseded, Cedric will remove it.

EOSCTAATLAS

  • ATLAS SFO and DDM teams have started talking to each other about the "is file safely on tape" handshake and "Too many OpenDir queries" issue. In the medium term this issue will be fixed using FTS. We will follow up with the FTS team in due course.
  • No other blocking issues for ATLAS. As all our resources are now required to prepare for the ALICE recall campaign, ATLAS can go on the back-burner for a while.
  • Read/write stress test will be resumed when ALICE instance has been prepared.

EOSCTACMS

  • The hop from EOSCMS to Tier-1 is not working properly. This is being followed up with the FTS team.

EOSCTAALICE

  • ALICE reprocessing campaign is forseen to run from April until end of June.
  • We will set up a ~5PB CTA instance for this campaign. The hardware to be used for this instance is scheduled to be retired in the summer.
  • Discussion on whether to use RAIN or single replica for this instance: many unanswered questions. Julien will set up a test RAIN setup and a decision will be made when we have more information.
There are minutes attached to this event. Show them.
    • 14:00 14:10
      Monitoring and Statistics 10m
      • Update on monitoring/dashboards (David)
      • When should we get rid of the Giuseppe's PL/SQL code for Statistics?
      • Testing Cedric's "all files/data on tape" statistics DB.
    • 14:10 14:30
      EOSCTAATLAS 20m
      • Too many OpenDir queries: At IT/ATLAS meeting they say they are happy to use the FTS "is file safely on tape" feature to fix this issue, which Eddie says will be available in the summer. (This means ATLAS + CMS will have the same setup). Is there any pressing reason to fix this in the interim or can we just leave it for now?
      • Status of simultaneous read/write stress test (Julien)
      • Any other issues for ATLAS?
    • 14:30 14:40
      EOSCTACMS 10m
      • Brief status update (Julien)
    • 14:40 15:00
      EOSCTAALICE 20m
      • Report back from yesterday's meeting with Latchezar (Steve/Cedric) See Steve's notes from the meeting on CodiMD.
      • ALICE want to do a campaign of 3 weeks of data transfers (subsequently mid-May = 5 weeks) followed by 3 months of reprocessing. This is extremely challenging as it will take 3 weeks just to drain the 5 PB disks and move them to the EOSCTAALICE instance. We need to communicate to ALICE that this is an exceptional request and we will do it on a best-effort basis with no guarantees that all the data will be there by mid-May.
      • Will the "spinner" space be single replica or RAIN?
      • Technical proposal from IT-ST to Latchezar on Tue 14 April on how the ALICE recall campaign will work (Steve/Julien).
      • Confirm that the Garbage Collector tests will be done on a different instance (EOSCTAALICEPPS?). Due to the time pressure above it will not be possible to start the GC tests until the ALICE instance is fully installed and recalls are underway.
    • 15:00 15:05
      AOB 5m
      • Need to test the failure scenario where a disk stops working, i.e. EOS reports that a disk replica exists but it has gone. Should we detect this scenario and simply remove the entry to the non-existent disk replica so that the client can retry?
      • GitLab issues have been reorganised into a Kanban Dashboard to allow us to more easily work on parallel tasks and prioritize between them.