Speakers
Description
The Joint Research Centre (JRC) of the European Commission has set up the JRC Earth Observation Data and Processing Platform (JEODPP) as a pilot infrastructure to enable the knowledge production Units to process and analyze big geospatial data in support to EU policy needs. The very heterogeneous data domains and analysis workflows of the various JRC projects require a flexible set-up of the data access and processing environments.
The basis of the platform consists of a petabyte-scale data storage system running EOS, a distributed file system developed and maintained by CERN. Three data processing levels have been implemented on top of this data storage and are delivered through a cluster of processing servers. The batch processing level allows running large-scale data processing tasks in a parallelized environment. The web-based remote desktop level provides access to tools and software libraries for fast prototyping in a standard desktop environment. The highest abstraction level is defined by an interactive processing environment based on Jupyter notebooks.
The interactive data processing in notebooks allows for advanced data analysis and visualization of the results on interactive maps on-the-fly. The processing in the notebooks is delegated via HTTP requests to a pool of processing servers for deferred parallel execution. As a response to the requests, the servers return the results of the data processing as JSON stream or map tiles which are rendered in the notebook. The processing is based on a custom developed API for analysis and visualization of geospatial data. The notebook-based approach gives users also the possibility to share data analysis and processing workflows with other users instead of merely sharing the output data of processing results. This facilitates collaborative data analysis.
All the processing levels are inter-connected through data access and data sharing interfaces based either on traditional file system access or on HTTP-based access provided by a CS3 software. Users can access data on the central platform storage from their office desktop via CIFS through a dedicated gateway, or from the Internet via a dedicated NextCloud instance. Access to data from the farm of processing servers is provided through a FUSE client specific to EOS in order to offer the highest possible data throughput. A major challenge is to set up consistent and reliable access control via all data access methods and tools. The migration of the cloud service to CERNBox is going to be assessed since it would facilitate the management of multi-protocol data sharing.