CS3 2019 - Cloud Storage Synchronization and Sharing Services

Name: CS3 2019 - Cloud Storage Synchronization and Sharing Services
Start: 2019-01-28T03:30:00+01:00
End: 2019-01-30T21:50:00+01:00
Location: CNR

28–30 Jan 2019

CNR

Europe/Zurich timezone

Advanced geo-spatial data analysis with Jupyter

29 Jan 2019, 10:10

20m

CNR

National Research Council - Piazzale Aldo Moro 7, 00185 Roma, Italy

Presentation Cloud infrastructure and software stacks for data science Data science: applications and infrastructure

Mr Paul Hasenohr (European Commission - Joint Research Centre)

The JRC Earth Observation Data and Processing Platform (JEODPP) is serving JRC projects and their partners for big data applications with emphasis on geospatial data. It has evolved into a multi-petabyte scale platform, offering advanced Web-enabled services for container-based batch processing, remote desktop, as well as interactive analysis and visualization through the JEO-lab service.

The JEO-lab service is providing a powerful and flexible Web-based environment to both data science specialists and less experienced occasional users for interactively analyzing and visualizing geo-spatial data. JEO-lab is based on Jupyter notebooks using a Python kernel and an API based on C++ libraries exposed to Python via the SWIG interface.

The design of the JEO-lab service separates the coding inside notebooks from the actual data processing that is deferred and executed by a series of back-end service nodes. The processing is initiated via an interactive map, requesting map tiles with the processing results from the back-end service nodes acting as tile engine. All processing chains from the notebooks are encoded into JSON objects and stored in a REDIS key-value store. The back-end tile engine nodes are retrieving and decoding the JSON objects, applying the processing chains, and sending the results back to the notebook clients.

The move to the JupyterLab environment largely improved the usability of JEO-lab. It allows to better manage coding and side-by-side display of the results on an interactive map. Split-maps allow for a convenient comparison of different analysis workflows.

A set of widgets provided by various Jupyter extensions enables the creation of advanced user interfaces for data analysis, in a style of desktop tools where the parameters of the underlying python functions are interactively controlled by appropriate widgets. This way, powerful analysis tools serve the needs of both specialists and desk officers without requiring any programming knowledge from the end-user. A series of customized thematic processing interfaces based on Jupyter notebooks has been developed to support JRC projects in various data analysis fields and visualization modes.

The separation of coding (notebook) and processing (tile engine) nodes improves security and scalability, but makes it difficult for users to extend the existing API. In order to overcome this limitation, a mechanism has been implemented to embed Python code provided by the user into modules and functions in the JSON objects of the processing chain. The code is then executed by the tile engine nodes where a Python on-the-fly interpreter is instantiated by the C++ libraries.

Various export functions allow to retrieve the results of the processing in various formats for further analysis, reporting, or distribution. The output files can then be retrieved through the JEODPP NextCloud instance.

In addition to the interactive processing environment, two new data processing tools based on Jupyter notebooks for large-scale data processing are currently in a prototype phase:
• The DASK environment (https://dask.org) provides an interface to a Kubernetes cluster with DASK workers and allows to launch parallel Python processing like NumPy or machine learning algorithms over multiple nodes, transparently for the user;
• Through the integration of Kubernetes with HTCondor, users have the possibility to submit jobs to HTCondor (the main batch processing environment of the JEODPP) through notebooks via the Kubernetes cluster. Kubernetes acting as meta-scheduler allows to use the same computing resources in an harmonious way leading to a more efficient use of the infrastructure.

It is expected that those new services will allow the users to be more autonomous to fully exploit the processing capabilities offered by the JEODPP.

References:
https://doi.org/10.1016/j.future.2017.11.007

Armin Burger (European Commission - Joint Research Centre) Mr Paul Hasenohr (European Commission - Joint Research Centre) Pierre Soille (European Commission - Joint Research Centre) Mr Davide De Marchi (European Commission - Joint Research Centre)

JRC-JEODPP_CS3_2019_Rome.pdf

CS3 2019 - Cloud Storage Synchronization and Sharing Services

Advanced geo-spatial data analysis with Jupyter

CNR

Speaker

Description

Authors

Presentation materials

Choose timezone

CS3 2019 - Cloud Storage Synchronization and Sharing Services

Speaker

Description

Authors

Presentation materials