Why we may need RAIN on SSDs
- Handling files bigger than a single SSD
- We don’t have to use the converter to convert file layouts (though we still have to use it to change spaces)
Debrief on the ALICE CTA+RAIN tests
RAIN layout
Most of the problems we experienced were not specific to RAIN, they were caused by using the EOS converter. One RAIN-specific problem is that when recalling from tape to RAIN, ls -y gives d10::t1 to indicate there is one file copy with 10 stripes, not 10 replicas of the file.
What are the use cases for conversion?
The EOS devs did not think we would use the converter. But there are several cases where we do need to use it:
- changing the disk layout (single to RAIN)
- changing the space (SSD to spinner)
- group rebalancing (changing number of groups)
- See also EOS converters and file identifiers in the EOSCTA docs site
Note: some new EOS developments and recent discussions with the EOS team mean that we will be able to further limit the cases where we need to use the converter. To bew reviewed.
Problems we need to investigate
- When we created 2 layout with 2 replicas, this created 2 tape copies. We need to investigate what happens when we archive with RAIN: how many tape copies are created?
- Space policy: There are several ways to specify which space name a file will land on: default space; space specified by URL; per-directory space specified by sys.forced.space.
- Performance issues: we have observed a long latency after closing a file after writing. Was this caused by bad or mismatched disk servers or something else?
- Solution to the problem of copying files from tape to RAIN. Do we need SSDs? If we do, do we need to use the converter? (Julien would like to isolate performance of tape drive from performance of disk system).
- Querying of free space: do file system statistics work as we expect on a RAIN layout?
- Will Mihai’s rewrite of converter code still change the disk file ID? (Answer: the converter code was refactored but it still changes the file ID).
Problems we know how to solve
Conversion changes disk file IDs, but we rely on disk file IDs to get file metadata from the EOS namespace. Two possible solutions:
- EOS keeps an index of archive file IDs to disk file IDs, we can forget the disk file ID.
- EOS can change the file IDs, they send us an event to update our catalogue with the new disk file ID.
Preference from the EOS devs was solution #2. However even if this is a synchronous event, it could still get out of sync if it fails after we update our DB but before the sync message gets back to the MGM. Such cases should be fixed by a retry. In some rare cases we may have to search back in the logs to get the correct disk file ID.
Problems we need to find a solution to
- Conversion fires the DELETE workflow (FIXED)
- When a new file is created by the converter, we lose the tape file system tag (65535). Can’t prepare evict. (FIXED)
- When will the Converter work of Mihai be merged into the master branch? (Currently in testing, will be merged shortly)
- Recall onto RAIN, we get
d10::t1
- Configure conversion on destination space and number of threads, has to restart MGM to process the conversion jobs.
We are not sure where to switch converter on (target space?)
- How to get the list of failed conversion jobs? (without having to parse the converter logfile)
What is the roadmap for EOS converter in the next few weeks?
There are minutes attached to this event.
Show them.