Jun 6 – 9, 2017
DESY (Deutsches Elektronen-Synchrotron)
Europe/Zurich timezone

Experiences with HTCondor Universes on a Petabyte Scale Platform for Earth Observation Data Processing

Jun 8, 2017, 5:00 PM
25m
SemRoom 4a/4b (Building 1b) (DESY (Deutsches Elektronen-Synchrotron))

SemRoom 4a/4b (Building 1b)

DESY (Deutsches Elektronen-Synchrotron)

Notkestraße 85 D-22607 Hamburg Germany Tel.: +49 40 8998-0 N53° 34.399 E009° 52.830
HTCondor presentations and tutorials Workshop

Speaker

Dr Dario Rodriguez (European Commission - DG JRC)

Description

Experiences with HTCondor Universes on a Petabyte Scale Platform for Earth Observation Data Processing

Dario Rodriguez, Veselin Vasilev, and Pierre Soille

European Commission, Joint Research Centre (JRC)
Directorate I. Competences. Unit I.3 Text and Data Mining
Via E. Fermi 2749, I-21027 Ispra (Va), Italy

Scientific and technical support to policies ranging from environment to emergency situations often require the analysis of large amount of Earth Observation data. For instance, the Copernicus programme of the European Union with its fleet of satellites managed by the European Space Agency will generate about 10 terabytes for satellite image data per day when it will be in full operational capacity. Numerous projects at the Joint Research Centre rely on the processing of large amounts of Earth Observation data. This motivated the development of a flexible and cost-effective petabyte-scale data and processing platform called the JRC Earth Observation Data and Processing Platform [1]. It is based on commodity hardware and CERN EOS storage backend and is detailed in [2]. Scientific workflows for geospatial data analysis are based on a variety of software, libraries, and tools that are often applied to numerous input data sets. A workload manager is therefore essential to manage multiple jobs at the same time. There are several open source queuing programs that could be used to meet various scheduling and networking needs. While HTCondor can manage very well large environments of heterogeneous resources, it was selected for the JEODPP for its unique ability to manage many different types of environments where the jobs may behave better according to the computing environment. The choice of a given environment depends basically on the type of runtime, available resources, and relationship of inter-dependencies between processes. Currently, HTCondor provides more than nine different environments called `universes', each of which enables users to take advantage of the scheduling in a unique way. In this presentation, the HTCondor universes Vanilla, Docker, and Parallel are explored and illustrated for 3 different Earth Observation applications as briefly described hereafter:

  • The Global Human Settlement (GHS) framework produces global spatial information about the human presence on the planet over time [3]. This in the form of built upmaps, population density maps and settlement maps. The input satellite images covering most of the landmass are processed fully automatically using HTCondor Vanilla and Docker universe in a single core for one job. Each job executes MATLAB program in a runtime environment that generates analytics and knowledge reporting objectively and systematically about the presence of population and built-up infrastructure 4 . This kind of application is a perfect candidate for HTCondor because they are loosely coupled (embarrassingly parallel). It means that there is no dependency and communication between tasks. Many of the applications running on our platform share this characteristic. The experiments conducted for this application show that the Docker universe introduces very little overhead compared to the Vanilla universe. However, the Docker universe has the advantage of allowing the existence of multiple isolated user-space (Docker containers) coping with possibly conflicting software library requirements.

  • The SUMO (Search for Unidentified Maritime Objects) is an automatic ship detection software from SAR imagery based on JAVA 5. For this application, all the Sentinel-1 satellite images acquired over the Mediterranean Sea over the period of 1 year were processed6. A Docker image containing all the software dependencies to run SUMO was built and served as a basis to launch the jobs through HTCondor with the Docker universe. In addition, this application takes advantage of the HTCondor dynamic slots allowing the execution using multi-threads given the high random-access memory requirements of SUMO. Basically, one multicore application scales up to the maximum number of physical CPUs (hyperthreading in our platform was disabled) on a worker host if has not enough memory to accommodate more than one process.

  • Hydrodynamic and ecosystem simulations over the Mediterranean Sea are used to assess the marine environment in the EU, to set baselines, identify data gaps and simulate scenarios. The hydrodynamic models and ecosystem model used in this application are GETM1, GOTM2, FABM3 and ERGOM. They are based on numerical methods dealing with the spatial domain by separating it into numerous components through a discretization process that produces a model grid 7. These models are implemented as MPI application based on FORTRAN and it is running by using the parallel universe of HTCondor. One job is executed in two steps: First, it triggers the startup of a virtual HPC cluster based on Docker containers and; Second, the job run the working script on the virtual HPC Cluster by using OpenMPI. In this form, we can deploy a virtual HPC cluster on demand under the umbrella of HTCondor.

In the near future, the potential of the HTCondor grid universe will be investigated to benefit from other platforms holding complementary resources.

References

[1] P. Soille, A. Burger, D. Rodriguez, V. Syrris, and V. Vasilev.; Towards a JRC earth observation data and processing platform Proc. of the 2016 Conference on Big Data fromSpace (BiDS'16), pages 65-68, 2016.. Available from doi: 10.2788/854791

[2] V. Vasilev, Rodriguez. D., A. Burger, and P. Soille; Flexible and cost-efective petabyte-scale architecture with HTCondor processing and EOS storage backend for Earth observation applications. In Abstract book of 3rd European HTCondor, Hamburg, Germany, 2017. DESY. Submitted.

[3] M. Pesaresi, G. Huadong, X. Blaes, D. Ehrlich, S. Ferri, L. Gueguen, M. Halkia, M. Kaumann, T. Kemper, L. Lu, M. Marin-Herrera, G. Ouzounis, M. Scavazzon, P. Soille, V. Syrris, and L. Zanchetta. A global human settlement layer form optical HR/VHR RS data: concepts and results. 6(5):2102{2131, 2013. doi: 10.1109/JSTARS.2013.2271445

4 C. Corbane, M. Pesaresi, V. Syrris, T. Kemper, P. Politis, P. Soille, A. Florczyk, F. Sabo, D. Rodriguez, L. Maffenini, and S. Ferri. Global mapping of human settlements with Sentinel-1 and Sentinel-2 data: Recent developments in the Global Human Settlement Layer. Slides of presentation at WorldCover'2017, ESA, Frascati, Italy, March 2017. Available from http://worldcover2017.esa.int/files/2.2-p1.pdf

5 C. Santamaria, M. Stasolla, V. Fernandez Arguedas, P. Argentieri, M. Alvarez, and H. Greidanus. Sentinel-1 Maritime Surveillance: Testing and Experiences with Long-term Monitoring. Technical Report EUR 27591 EN, 2015

6 C. Santamaria, M. Alvarez, H. Greidanus, V. Syrris, P. Soille, and P. Argentieri. Mass processing of Sentinel-1 images for maritime surveillance. Remote Sensing, 2017. Submitted.

7 D. Macias, E. Garcia-Gorriz, A. Dosio, A. Stips, and K. Keuler; Obtaining the correct sea surface temperature: bias correction of regional climate model data for the Mediterranean Sea. Climate Dynamics, pages 1-23, 2016. Available from doi: 10.1007/s00382-016-3049-z

----------------------------------

  1. http://www.getm.eu/about-getm/.
  2. http://www.gotm.net/.
  3. https://sourceforge.net/projects/rmbm/.

Primary authors

Dr Dario Rodriguez (European Commission - DG JRC) Veselin Vasilev (European Commission, Joint Research Centre (JRC)) Pierre Soille (European Commission)

Presentation materials