CTA deployment meeting

Name: CTA deployment meeting
Start: 2020-11-09T14:00:00+01:00
End: 2020-11-09T16:00:00+01:00
Location: CERN

Monday 9 Nov 2020, 14:00 → 16:00 Europe/Zurich

600/R-001 (CERN)

600/R-001

CERN

Show room on map

Michael Davis (CERN)

Hide

ALICE Post-migration

ALICE instance was upgraded this morning.
Steve wants to discuss space reporting for the garbage collector (#905) as requirements are not clear. See detailed notes below.
Julien: ensure that ALICE have tested implicit prepare and are happy with it.
XrdAliceTokenAcc puppetization and ALICE probe (#105 and #666): after the upgrade, our probe is succeeding, but the ALICE probe is failing. Julien has disabled the reporting for now and will investigate with ALICE.
Steve has e-mailed ALICE to retry their recall with >1,000 clients.

Getting CMS into Production

Repack:

Repack of public_user is underway. First priority is to separate CMS user data prior to migration.

EOS Backup test (#113):

Read test successful, but cannot implement prepare evict to clean up buffer after file has been copied out, as this is not exposed in the XRootD Python interface. Elvin has opened a ticket with XRootD developers.
Write test will be on EOSCTA PUBLIC PPS. Julien will configure SSS.

FTS Archive Monitoring (m-bit) test:

On 5/11/2020, Katy tested archiving 12 TB (1667 files) from EOS to CTA PPS using fts3-devel. Julien delibrately corrupted some files: only successful archivals were reported as safely on tape.
Michael will investigate and fix the error reporting in the sys.archive.error xattr (#921). This is preventing reporting of archival failures via query prepare (jobs are only marked as failed when they time out).
Julien pointed out that CMS timeouts were not set properly: too long for archivals and too short for recalls. Katy corrected the settings for the tests, Julien will check with Katy/CMS DDM team to ensure these are set properly in production.

Reconciliation of Namespace and Catalogue:

Katy says their top-level directories are /cms and cms/store. Eric Vaandering (EV) said: "the only thing we can do is check the CTA namespace with what is in Rucio which is basically paths containing cms/store/{data, mc, hidata, himc, results, generator, lumi, relval}. Anything else is outside of our visibility."
Michael sent Katy a list of temporary files and test files which could possibly be deleted prior to migration
EV said (2/11/2020): "Igor’s tools will generate both lists [data missing in CTA and dark data in CTA]. We will generate those and work with you, but we probably shouldn’t take action on anything before also verifying it in PhEDEx since we have not yet achieved 100.0% consistency between our two systems."
Conclusion: we will continue with the Rucio/CTA reconciliation but it is likely that this activity will continue after the migration.

OTGs:

Confirm dates with CMS and send out OTGs next week

Getting LHCb into Production

Oliver is following up the Dirac+FTS+TPC issue with Mihai and Chris.
In the meantime, Michael will ask Chris to try an XRootD TPC between CTA and T1.

Getting PUBLIC into Production

COMPASS say they started preparing for their DAQ tests last week and will send data shortly.
NA62 are scheduled to do recall tests this week. Vova is contacting Barbara.
We have re-established contact with n_TOF. Their code uses libshift, i.e. a lot of Cns_ API calls rather than XRoot calls. They are willing to change this to standard XRoot calls but are reluctant to migrate to CTA before the current operation year. Michael and Vova will follow up and try to make a plan which works for them.

Steve's minutes from the "free space” discussions

The concept of free disk space is different for a pure disk-only EOS instance and an EOSCTA instance. A disk-only instance is concerned with reporting free space over long periods of time and is not interested in filtering down to just the file systems that are currently writable (RW in EOS speak). An EOSCTA instance on the other hand is only interested in RW file systems as it really wants to know if it can write the next file to the small disk cache without running out of space.
There are at least 4 components/areas of an EOSCTA instance that rely on calculating frees space:
- The tape server back pressure.
- The tape aware garbage collector in the MGM.
- The MGM code used to reply to "xrdfs MGM_HOST query space /" commands.
- The ENOSPC logic of EOS.
There are at least two different places in the EOS MGM that calculate free space:
- The MGM code used to reply to "xrdfs MGM_HOST query space /" commands.
- The tape aware garbage collector in the MGM.
Arguments aside of which values should be used to calculate free space, such a calculation can be made from at least the following:
- The amount of free raw space in a filesystem.
- The status of a file system: Is it booted?
- The “active” status of a file system: Is it on-line?
- The configuration of a file system: Is it writeable (RW)?
- The scaling factor to convert physical/raw layout such as 2 replica or RAIN to logical space.
- The amount of headroom reserved for EOS in a filesystem. The main reason for EOS headroom is to deal with files that are streamed into EOS and for which the end user does not know the final size. CTA should not have this problem. When a pre-existing file in xrdcp’ed into EOS its size is implicitly sent when the file is opened. When a cta-taped daemon start writing a file from it’s memory to disk it encodes the final size of that file in the URL used to open the file.
The current plan to move forward is:
- Set headroom to zero for each EOS filesystem in our EOSCTA instances. The current belief as explained above is that CTA does not need any EOS headroom.
- Have the tape-aware garbage collector copy the behaviour of the cta-taped daemon by having it call an external script to get what the CTA operators think is the current amount of free space in a given EOS space.

There are minutes attached to this event. Show them.

- 14:00 → 14:10
  ALICE Post-migration 10m
  - Steve is working on space reporting for the garbage collector (see #905).
  - Implicit prepare test.
  - XrdAliceTokenAcc puppetization and ALICE probe (#105 and #666).
  - Test of >1,000 clients (Latchezar said 10,000).
- 14:10 → 14:20
  Getting CMS into Production 10m
  - See Putting EOSCTACMS into Production: TO DO List
  - See Notes from meeting with CMS 02/10/2020
  TO DO
  - Reconciliation Rucio ←→ CTA
  - Test archive monitoring with FTS
  - Test EOS Backup (Elvin's daemon and user backup tools on EOS CMS) (#113).
  - Repack public_user to separate CMS files.
  - OTGs: CASTOR and CTA.
  Schedule
  - 9 Nov 2020: Pre-MWGR tests (CASTOR only)
  - 18 Nov 2020: MWGR#4 (CASTOR only)
  - w/c 23 Nov 2020 Put CASTOR CMS into recall-only mode
  - 30 Nov 2020: Migrate to CTA
  - 07 Dec 2020: CMS in production
  - End Dec 2020? : CMS retires PhEDEx
- 14:20 → 14:30
  Getting LHCb into Production 10m
  - See Putting EOSCTALHCB into Production (CodiMD)
  TO DO/Schedule
  - DAQ test : waiting for network connectivity
  - Test HTTP TPC with CTA (FTS multi-hop with one "hop" as QoS change, one hop as the transfer) to be agreed with Christophe
- 14:30 → 14:40
  Getting PUBLIC into Production 10m
  - Migration of SMEs from CASTOR to CTA, see Putting EOSCTAPUBLIC into Production
  COMPASS
  - Metadata check checksum bulk query: they will do it the slow way for now
  - Mon 9 Nov 2020 : DAQ tests on CTA (#69)
  NA62
  - w/c 9 Nov 2020 Recall tests (30 TB) on EOSCTA PPS (#72).
  - Some NA62 files still in r_public_user, will be repacked as part of the repack campaign. These files are not currently available on EOSCTA PPS.
  NA61/SHINE
  - Will start CTA integration in January
  - Physics starts in June
  n_TOF
  - Michael pinged them last week
- 14:40 → 14:50
  CTA Repack 10m
  - Julien: Maintenance processes and repack will be disabled on all tape servers except for two?
  - Vlado: Test repack with 3+2 drives.
  - Cedric: Remove "superseded" files in favour of the recycle bin.
- 14:50 → 15:00
  Tape Lifecycle and Workflow documentation 10m
  - Some documentation exists: CASTOR tape states, Vlado's documentation of Tape Lifecycle and Broken Tape Workflows.
  - Michael is synthesising this and will output a state diagram for review by the team.
  - The Archive Workflow documentation is missing a description for what happens in the case of dual-copy tape pools. Michael will create a state diagram for this as well.
  - Michael to investigate/fix the sys.archive.error issue.
- 15:00 → 15:05
  
  AOB 5m

Choose timezone