ALICE Post-migration
- ALICE instance was upgraded this morning.
- Steve wants to discuss space reporting for the garbage collector (#905) as requirements are not clear. See detailed notes below.
- Julien: ensure that ALICE have tested implicit prepare and are happy with it.
XrdAliceTokenAcc
puppetization and ALICE probe (#105 and #666): after the upgrade, our probe is succeeding, but the ALICE probe is failing. Julien has disabled the reporting for now and will investigate with ALICE.
- Steve has e-mailed ALICE to retry their recall with >1,000 clients.
Getting CMS into Production
Repack:
- Repack of public_user is underway. First priority is to separate CMS user data prior to migration.
EOS Backup test (#113):
- Read test successful, but cannot implement prepare evict to clean up buffer after file has been copied out, as this is not exposed in the XRootD Python interface. Elvin has opened a ticket with XRootD developers.
- Write test will be on EOSCTA PUBLIC PPS. Julien will configure SSS.
FTS Archive Monitoring (m-bit) test:
- On 5/11/2020, Katy tested archiving 12 TB (1667 files) from EOS to CTA PPS using fts3-devel. Julien delibrately corrupted some files: only successful archivals were reported as safely on tape.
- Michael will investigate and fix the error reporting in the sys.archive.error xattr (#921). This is preventing reporting of archival failures via query prepare (jobs are only marked as failed when they time out).
- Julien pointed out that CMS timeouts were not set properly: too long for archivals and too short for recalls. Katy corrected the settings for the tests, Julien will check with Katy/CMS DDM team to ensure these are set properly in production.
Reconciliation of Namespace and Catalogue:
OTGs:
- Confirm dates with CMS and send out OTGs next week
Getting LHCb into Production
- Oliver is following up the Dirac+FTS+TPC issue with Mihai and Chris.
- In the meantime, Michael will ask Chris to try an XRootD TPC between CTA and T1.
Getting PUBLIC into Production
- COMPASS say they started preparing for their DAQ tests last week and will send data shortly.
- NA62 are scheduled to do recall tests this week. Vova is contacting Barbara.
- We have re-established contact with n_TOF. Their code uses libshift, i.e. a lot of Cns_ API calls rather than XRoot calls. They are willing to change this to standard XRoot calls but are reluctant to migrate to CTA before the current operation year. Michael and Vova will follow up and try to make a plan which works for them.
Steve's minutes from the "free space” discussions
- The concept of free disk space is different for a pure disk-only EOS instance and an EOSCTA instance. A disk-only instance is concerned with reporting free space over long periods of time and is not interested in filtering down to just the file systems that are currently writable (RW in EOS speak). An EOSCTA instance on the other hand is only interested in RW file systems as it really wants to know if it can write the next file to the small disk cache without running out of space.
- There are at least 4 components/areas of an EOSCTA instance that rely on calculating frees space:
- The tape server back pressure.
- The tape aware garbage collector in the MGM.
- The MGM code used to reply to "xrdfs MGM_HOST query space /" commands.
- The ENOSPC logic of EOS.
- There are at least two different places in the EOS MGM that calculate free space:
- The MGM code used to reply to "xrdfs MGM_HOST query space /" commands.
- The tape aware garbage collector in the MGM.
- Arguments aside of which values should be used to calculate free space, such a calculation can be made from at least the following:
- The amount of free raw space in a filesystem.
- The status of a file system: Is it booted?
- The “active” status of a file system: Is it on-line?
- The configuration of a file system: Is it writeable (RW)?
- The scaling factor to convert physical/raw layout such as 2 replica or RAIN to logical space.
- The amount of headroom reserved for EOS in a filesystem. The main reason for EOS headroom is to deal with files that are streamed into EOS and for which the end user does not know the final size. CTA should not have this problem. When a pre-existing file in xrdcp’ed into EOS its size is implicitly sent when the file is opened. When a cta-taped daemon start writing a file from it’s memory to disk it encodes the final size of that file in the URL used to open the file.
- The current plan to move forward is:
- Set headroom to zero for each EOS filesystem in our EOSCTA instances. The current belief as explained above is that CTA does not need any EOS headroom.
- Have the tape-aware garbage collector copy the behaviour of the cta-taped daemon by having it call an external script to get what the CTA operators think is the current amount of free space in a given EOS space.
There are minutes attached to this event.
Show them.