In the Data Science and Systems Integration Program, we have explored various ways to separate I/O code from user science code. After seven years of developing in-house solutions and contributing to external ones (including Intake), we propose an abstraction that we think is a broadly useful building block, named Tiled.
Tiled is a data access service for data-aware portals and data science tools. It has a Python client that feels much like h5py to use and integrates naturally with dask, but nothing about the service is Python-specific; it also works from curl. Tiled’s service sits atop databases, filesystems, and/or remote services to enable search and structured, chunkwise access to data in an extensible variety of appropriate formats, providing data in a consistent structure regardless of the format the data happens to be stored in at rest. The natively-supported formats span slow but widespread interchange formats (e.g. CSV, JSON) and fast, efficient ones (e.g. C buffers, Apache Arrow DataFrames). Tiled enables slicing and sub-selection to read and transfer only the data of interest, and it enables parallelized download of many chunks at once. Users can access data with very light software dependencies and fast partial downloads.
Tiled puts an emphasis on structures rather than formats, including N-dimensional strided arrays (i.e. numpy-like arrays), tabular data (i.e. pandas-like “dataframes”), and hierarchical structures thereof (e.g. xarrays, HDF5-compatible structures like NeXus).
Tiled implements extensible access control enforcement based on web security standards, similar to JupyterHub Authenticators. Like Jupyter, Tiled can be used by a single user or deployed as a shared resource. Tiled facilitates local client-side caching in a standard web browser or in Tiled’s Python client, making efficient use of bandwidth and enabling an offline “airplane mode.” Service-side caching of "hot" datasets and resources is also possible.
Tiled is conceptually “complete” but still new enough that there is room for disruptive suggestions and feedback. We are interested in particular in exploring how Awkward Array could be added to the set of supported structures, and in how its incorporation might prompt us to rethink other aspects of Tiled’s design.
Recorded meeting video: https://youtu.be/moIyT5cKT5s