CTA deployment meeting

Europe/Zurich
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
Michael Davis (CERN)

What would it mean to go to production during the Covid-19 lockdown?

  • Steve suggested that we could create a tool to import tapes written in CTA into CASTOR. This would give us the option of a full rollback. Not too hard to do and it reduces the risk of proceeding with the migration.
  • Spectra library delivery is delayed, but this is not a blocker. We can load media into IBM Lib1.
  • For DCS operators, there is not much change to their tools. They would need to learn cta-admin. Training could be done using videoconferencing/collaboration tools.
  • Commissioning/acceptance testing with ATLAS SFO was eventually completed, but took longer than expected due to communication difficulties and availability of experts. No problems on the CTA side but the rate of archive requests from SFO was much lower than we hoped for. Julien wrote a detailed report.
  • Previously we had 2 CTA instances, now we have 3 (Production, PPS and migration). PPS will be used for write tests and Migration will be used for migration tests for each experiment prior to migrating them to Production.
  • The main thing is that the CTA instance configuration must be stable so we can share the information. We are converging on the final production configuration.
  • Administering the CTA tape side is easy, administering the EOS buffer is a different story. Julien raised the idea that operating the CTA EOS instance could be shared with the experiment operations team. After discussion, we preferred the idea that a couple of people in the EOS disk operations team could train on CTA and give us support. In return Julien could help their team with issues such as hardware specs and procurement.

Conclusions

  • There are no blocking issues to stop us going to production. Obviously in the current teleworking situation communication is more difficult and everything takes a little longer. We can proceed, but on a more conservative schedule.
  • We need ATLAS to confirm that they are satisfied with the results of the SFO tests.
  • We will proceed with the rest of the commissioning tests we had planned, these are less complicated in that we don't depend on anyone else.
  • It is important that everyone knows how the instances are configured. Once Julien has the final production instance configuration he will document it.
  • Approach EOS operations team to see if a couple of people could be trained in CTA operations.
  • The "import CTA tapes to CASTOR" tool seems like a good idea that would let us all sleep easier.

Communication

  • Zoom has been more reliable than Vidyo, however Julien had a poor connection and was not able to dial in to the meeting using the French phone no.
  • Is it possible to get a CERN phone for Julien?

Atlas Repack

  • As the schedule for moving ATLAS to production will be put back by at least a few weeks, Vlado will push ahead with repacking ATLAS on CASTOR as quickly as possible. The goal is to repack all of ATLAS within one month.

CTA Testing Status

  • ATLAS SFO test was completed, we need to verify with ATLAS that they are now satisfied and we can go to production.
  • Julien will continue with simultaneous write/recall/delete tests; dual-copy tape pool test; Tier-1 export test; "What happens when the buffer is full" test
  • In parallel, Julien will proceed with setting up CMS testing
  • GC is ready for ALICE. We need to fix issue #631 before we can start testing.
  • No data has been sent to the PUBLIC test instance from nTOF or NA62. This is not on our critical path so we will not push them.
There are minutes attached to this event. Show them.
    • 13:30 13:50
      What would it mean to go to production during the Covid-19 lockdown? 20m

      "What could possibly go wrong?"

      What issues will be made more complex if we are teleworking?

      • This is not "business as usual" teleworking, it is "best effort". People (other team members and people outside our team) may not be available when we need them. Communication lines are lengthened and many things take longer than usual.
      • Impact on tape operations (Vlado): managing operations from off-site. Delivery and installation of new library. Procurement of tape media.
      • Training tape operators remotely
      • Monitoring CTA instances and fixing problems: monitoring the health of the EOS buffer, recovering files that did not make it to tape, detecting stuck/failed jobs, etc.
      • Maintaining the instances: installing/upgrading new versions of EOS, CTA, etc.
      • What else?
    • 13:50 14:00
      Communication 10m

      Communication is more difficult and the possibilities for miscommunication are higher.

      • Do talk about your problems and frustrations. Besides our online meetings you can call me any time.
      • Spend some time on documentation. Keep the documentation websites up-to-date.
      • Any other ideas about we can communicate more effectively?
    • 14:00 14:10
      ATLAS Repack (Vlado) 10m

      ATLAS CASTOR 7TB tape stats

      • Started with 1545 tapes initialy, repacked 683, still to repack 862 = these represent ~6 PB of free space needed
      • We currently have in free supply: 5 PB on LTO-7M (9 TB) in IBMLIB1 and 5 PB of LTO-7M in stock on a pallet (waiting for the Spectra Logic library) + 5.5 PB on IBM JE (20 TB) media
      • In addition we also have 6 PB of free space on IBM JD media (15 TB)

      Questions

      • Do we continue repack of ATLAS tapes before we migrated them to CTA? (I would say YES to eliminate/postpone/buy time from the pressure to have massive repack working - given the manpower)
      • Do we continue slowly or more agressively? (I can repack 100 tapes/week with 1 output stream (to have nice collocation), how much time we have before ATLAS does a switch? I think at least a month so we could repack it all if we increase the output streams)
      • What destination media we use? (I kept 15 TB tapes unused so we can spread the potential data on multiple streams/media if they are comming fast. With additional LHC delays due to coronavirus that is unlikely, we should be fine with 20 LTO drives + 10 TS1160 drives for any new data. It means that we can fill all 15 TB tapes)
    • 14:10 14:20
      CTA Testing Update 10m

      ATLAS Testing

      • ATLAS SFO test debrief (Julien)
      • Simultaneously run our own write/recall/delete tests on the instance
      • Test dual-copy tape pools
      • Tier-1 export test
      • "What happens when the buffer is full" test

      ALICE Testing

      • Steve says the Garbage Collector will be ready for testing from next week. Next steps?

      CMS Testing

      • Julien has been working on test specifications.

      Public Testing

      • Are we seeing any traffic from nTOF or NA62?
    • 14:20 14:25
      AOB 5m