Skyhook introduces programmable storage for relational databases by embedding data semantics and data management methods directly within an object storage system. This allows a database like PostgreSQL to scale out by offloading bandwidth and CPU intensive data management tasks to the storage system without any changes to the database itself. Skyhook builds upon the user-extensible interfaces of Ceph distributed object storage, thereby not requiring changes to the core of the storage system. We found that Skyhook abstractions can also be used for non-relational data and are working on embedding access libraries such as HDF5 with minimal changes to HDF5 itself. Skyhook partitions datasets along semantic boundaries and leverages Ceph’s "object class” (aka “cls”) extensions mechanism to embed access methods that can interpret data semantics and apply predicates and other data management tasks within storage servers at the level of individual objects. Additionally, Skyhook indexes object data and metadata by utilizing Ceph's internal key/value service (built with RocksDB), thus creating locally query-able metadata on each storage server. We are exploring the use of Skyhook to embed ROOT dataset access functions into Ceph and are looking for feedback from the HEP community.
Recorded Meeting Video: https://www.youtube.com/watch?v=zNLZhlfXUNg