European HTCondor Workshop 2019

Europe/Rome
EC-JRC Ispra

EC-JRC Ispra

European Commission – Joint Research Centre Via Enrico Fermi, 2749 I - 21027 Ispra (VA) Italy N45° 48' 36.09'' E008° 37' 16.72'' N45° 48.601 E008° 37.278 45.80998, 8.62135 https://www.openstreetmap.org/#map=17/45.80998/8.62135
Helge Meinhard (CERN), Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA), Antonio Puertas Gallardo (European Commission)
Description

The European HTCondor Workshop 2019 was held in Italy, hosted by the European Commission's Joint Research Centre in Ispra, Lombardy, Italy, very close to the shores of Lago Maggiore.

The Joint Research Centre (JRC) is the European Commission's science and knowledge service which employs scientists to carry out research in order to provide independent scientific advice and support to EU policy. The JRC is located in 6 different sites (Brussels, Geel, Ispra, Karlsruhe, Petten, Seville), and  employs around 3000 staff coming from throughout the EU.

The workshop was the fifth edition in Europe after the successful events at CERN in December 2014, ALBA in February 2016, DESY in June 2017 and RAL in September 2018.

The workshops are opportunities for novice and experienced users of HTCondor to learn, get help and have exchanges between them and with the HTCondor developers and experts. They are primarily addressed at users from EMEA, but open to everyone. The workshops consist of presentations, tutorials and "office hours" for consultancy, covering the HTCondor CE (Compute Element) covered prominently as well.

The workshops address participants from academia and research as well as from commercial entities alike.

    • Registration: Main Entrance

      Workshop Registration

    • Logistics JRC: Presentation European Commission Joint Research Centre
      Convener: Helge Meinhard (CERN)
    • Workshop presentations

      Antonio

      Convener: Helge Meinhard (CERN)
      • 3
        Philosophy of HTCondor

        Provides an overview of HTCondor's design and the principles behind it

        Speaker: Gregory Thain (University of Wisconsin-Madison)
    • 10:40 AM
      Coffee Break

      Coffee Break

    • Workshop presentations

      Antonio

      Convener: Antonio Puertas Gallardo (European Commission)
      • 4
        HTCondor ClassAd Advanced Tutorial

        An in-depth coverage of the HTCondor classad language with examples.

        Speaker: Gregory Thain (University of Wisconsin-Madison)
      • 5
        Python Bindings Tutorial

        An overview of HTCondor's official Python APIs, including job submission, monitoring, and what's new. Beyond just pointpoint files, this talk will attempt to allow people to learn live via Jupyter notebook sessions.

        Speaker: Todd Tannenbaum (University of Wisconsin Madison (US))
      • 6
        HTCondor Advanced Job Submission

        HTCondor Advanced Job Submission

        Speaker: Gregory Thain (University of Wisconsin-Madison)
    • 1:00 PM
      Lunch
    • Workshop presentations

      Antonio

      Convener: Christoph Beyer (DESY)
      • 7
        HTCondor Negotiator Policy and Configuration

        An explanation of the algorithms and policies of the HTCondor negotiator

        Speaker: Gregory Thain (University of Wisconsin-Madison)
    • 3:45 PM
      Coffee break
    • Office hours

      Meet the HTCondor developers and ask your questions

    • Workshop presentations

      Antonio

      Convener: Chris Brew (RAL)
      • 8
        HTCondor and containers for Batch and Interactive use: (Mostly) a success story

        An HTC cluster using HTCondor has been set up at Bonn University in 2017/2018.
        All infrastructure is fully puppetised, including the HTCondor configuration.
        Both interactive and batch jobs are run inside Singularity containers, and users only have to choose the desired OS via a job parameter from an offered collection of container images without setting up or building a container themselves.

        The container images are rebuilt daily and are provided by a CernVM filesystem (CVMFS) along with various software packages. The data to be analysed is stored on a CephFS file system.

        This talk presents the successful migration from a "classic" PBS cluster setup with separate login nodes providing an environment similar to the batch environment, causing many regular admin headaches, to a modern and flexible, HTCondor-based system offering the users several different environments both for interactive and batch usage. Both the successes and the several pitfalls and issues which were encountered will be discussed.

        Speaker: Oliver Freyermuth (University of Bonn (DE))
      • 9
        Update on Implementation and Usage of HTCondor at DESY

        In this talk we provide new details of the DESY configurations for HTCondor. We focus on features needed for user registry integration, node maintenance operations and fair share / quota handling. We are working on Docker, Jupyter and GPU integration into our smooth and transparent operating model setup.

        Speakers: Mr Thomas Finnern (DESY), Mr Christoph Beyer
    • Workshop presentations

      Antonio

      Convener: Dario Rogriguez (EC-JRC)
      • 10
        HTMap and HTC Notebooks: Bringing HTC to Python

        Introduction to some recent work by the HTCondor team to enable Python code, including Python embedded in Jupyter Notebooks, to easily and naturally leverage high throughput computing via HTCondor.

        Speaker: Todd Tannenbaum (University of Wisconsin Madison (US))
      • 11
        From interactive to distributed computing of land parcel signatures using HTCondor

        In the framework of the Common Agricultural Policy (CAP) of the European Union, a big technological shift is happening. For decades, the correct payment of subsidies to farmers was controlled by means of remotely sensed images, by doing visual interpretation and field visits, to assess that a randomly selected percentage of the land parcels respected all the rules. In recent years, we are witnessing a big evolution in Earth Observation, with the Copernicus Programme and the Sentinel satellites that provide coverage of every piece of land in high resolution every few days, together with the availability of cloud platforms capable of storing the big amount of data captured. Coupling this with the ease of use of cloud computing platforms and a wide number of tools for extracting valuable information from big geospatial datasets using machine learning techniques, puts the CAP controls sector in the edge of a revolution: from 2021, each single agricultural parcel in Europe will be constantly monitored for the full year. This involves the calculation and the constant update of temporal profiles that will monitor the vegetation status of land parcels in all their phases, from ploughing to sowing, from ripening to harvesting.
        In this context, the JEODPP (Joint Research Centre Big Data Platform) group was involved in preliminary studies to assess the feasibility of the new CAP Monitoring. On a first stage, by accessing the full catalogue of Sentinel-2 images, we developed an interactive tool to calculate the Normalized Difference Vegetation Index profile for a single parcel at a time inside a JupyterLab notebook (s2explorer application).
        When the algorithm was tested and verified, the need for scaling to regional or national level arose. This implied the need to process millions of vector polygons, each of them covered by more than 50 images per year, a perfect workspace for using the HTCondor workload manager services already available inside the JEODPP platform. The C++ routines developed for the interactive prototype were compiled in a standalone executable and the calculation was divided in three phases: 1) compilation of the list of satellite images involved in the selected spatial and temporal range, 2) creation of a job for each image, 3) collection of the result into a single binary file; a typical map-reduce schema. All these phases were performed by using HTCondor jobs, and, in particular, the second phase was heavy parallelized on the hundreds of cores available inside the JEODPP platform.
        We executed a first test on a region in Hungary (10K parcels for a full year processed in less than half an hour) and then scaled to the full Catalunia (640K parcels processed in 4 hours).
        The need to deeply evaluate the results of the batch processing generated the idea to “close the circle”, that is to provide an interactive tool to visually assess the calculations made by the HTCondor jobs. We developed a Python application running inside JupyterLab that could visualize all the land parcels involved in the calculation and, by clicking on each of them, immediately display the vegetation profile and the imagettes extracted from each individual satellite acquisition date. The tool is widely used by the JRC D.5 unit and it is the base for the future characterization of the crops by means of machine learning algorithms, a key component of the CAP Monitoring.
        This use case is an example of the involvement of HTCondor services in a complex environment were the need for interactive prototyping goes along with heavy distributed processing needs and contributes to create an integrated solution.

        Speaker: Mr Csaba Wirnhardt (European Commission, Joint Research Centre (JRC) Directorate D. Sustainable Resources. Unit D.5 Food Security)
      • 12
        What's new in HTCondor? What's coming up?

        Overview of new features recently released as well as discussion about the HTCondor development roadmap.

        Speaker: Todd Tannenbaum (University of Wisconsin Madison (US))
    • 10:35 AM
      Coffee break
    • Office hours

      Meet the HTCondor developers and ask your questions

    • Workshop presentations

      Antonio

      Convener: Catalin Condurache (Science and Technology Facilities Council STFC (GB))
    • Workshop presentations

      Antonio

      Convener: Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA)
      • 14
        Automation & Monitoring in the CERN Condor pool

        The CERN HTCondor pool is currently offering 200K cores of compute power to hundreds of users in the HEP community. Managing such cluster requires a significant effort in the daily operations, not only because of the scale, but also because of the diversity of the resources. In this scenario, the adoption of automation and monitoring tools becomes a strong requirement to optimize both the resource usage and the operators time.

        This talk presents different projects and prototypes that have been developed and integrated in our infrastructure to make these daily operations more efficient.

        Speaker: Luis Fernandez Alvarez (CERN)
      • 15
        Edge cases at CERN

        The overwhelming majority of batch workload at CERN is very similar. Experiment pilots, experiment tier-0 production, and local users running single core jobs. However one size doesn't fit all, and we now have a number of different edge cases that this talk will cover...

        GPUs, from machine learning to software validation.
        Users who are in a grey area between HPC & HTC
        Lower priority preempitible workloads
        Backfill of SLURM resources.
        Running HTCondor on kubernetes

        Speaker: Ben Jones (CERN)
      • 16
        Federating HTCondor pools

        Describes the several ways more than one condor pool can be joined.

        Speaker: Gregory Thain (University of Wisconsin-Madison)
    • 1:00 PM
      Lunch
    • Visits: Lab visits

      Labs visits

    • 4:00 PM
      Coffee break
    • Workshop presentations

      Antonio

      Convener: Josep Flix (PIC)
      • 17
        Moving Job Data

        Discussion about how HTCondor jobs can access their data, including some recent developments.

        Speaker: Todd Tannenbaum (University of Wisconsin Madison (US))
      • 18
        RESTful interfaces to HTCondor

        Discusses work in program about possible RESTful interfaces to HTCondor

        Speaker: Gregory Thain (University of Wisconsin-Madison)
      • 19
        'Startd flocking' and 'HT-HPC' c/o the Milan Physics Dept.

        Group-owned and operated clusters at the Physics Department of the University of Milan (UNIMI) are pooled together via the HTCondoe 'Startd flocking' feature. We describe the setup, its applications and possible use for parallel applications, with some preliminary performance results.

        Speaker: Francesco Prelz (Università degli Studi e INFN Milano (IT))
      • 20
        Planning for Future Scales and Complexity of HTCondor Pools in the CMS Experiment

        The resource needs of high energy physics experiments such as CMS at the LHC are expected to continue to grow significantly over the next decade, and will be more and more satisfied by computing capacity with non-standard characteristics. This presents challenges not only of scale but of complexity in resource provisioning and allocation. In this contribution, we will present results of recent HTCondor scale tests we have conducted using the CMS Global Pool Integration Test Bed (ITB) employing the multi-threaded Negotiator, where we have pushed the size of the pool to the maximum limits with currently-available hardware and explored effective performance limitations of the submit nodes in our infrastructure with realistic payloads. We will also discuss recent integration of resource-specific job matching conditions to satisfy HPC and Cloud use cases, where resources may not be suitable for running all kinds of workflows. Finally, we will review some specific use cases that we have difficulty solving with the current implementation of HTCondor.

        Speaker: James Letts (Univ. of California San Diego (US))
    • Workshop presentations

      Antonio

      Convener: Antonio Puertas Gallardo (EC-JRC)
      • 21
        Dynamic Batch service with HTCondor and Kubernetes

        One of the main challenges the JRC Big Data Platform-JEODPP [1] is to offer well consolidated computational services, such as the batch system or interactive data visualization on which users can process large scale geospatial data while ensuring a smooth user experience combined with easy administration of all resources from hardware to applications.
        Due to the heterogeneity of the user requirements of the JEODPP platform, many services have dynamic demand over time. As an example users who want to visualise data interactively requires a considerable amount of resources just in short periods, mainly during the core time of the day. Considering that the batch service should use the full capacity whenever possible, the resources should be allocated dynamically. Moreover keeping a physical separation between htcondor, interactive nodes, and all our other services as in our previous setting is not satisfactory from service management, monitoring, and performance perspectives. In that fixed setting, a lot of idle time when some operation was ended was added up to a lack of resources in the peak of need. To address these issues, we are moving to implement all our services under the Kubernetes control.
        Running batch workloads in Kubernetes is possible [2] and Kubernetes has a specific resource for that purpose (Job) but it is quite limited. To address this limitation, we present in this work the implementation of HTCondor within Kubernetes. HTCondor daemons are packaged inside pre-configured Docker images and deployed as a service in a container cluster handled by Kubernetes. As a proof of concept, we have successfully run an actual JEODPP use-case (atmospheric correction with Sen2Cor [3]) on Kubernetes and compared its performance with a standard HTCondor pool. Finally, we present some considerations to implement other services based on the namespaces of Kubernetes and some constraints to implement this solution in a production environment such as the JEODPP.

        References

        [1] P. Soille, A. Burger, D. Rodriguez, V. Syrris, and V. Vasilev.; Towards a JRC earth observation data and processing platform Proc. of the 2016 Conference on Big Data from Space (BiDS'16), pages 65-68, 2016. Available from doi: [10.2788/854791][4]

        [2] D. Aiftimiei, T. Boccali, M. Panella, A. Italiano, G. Donvito, D. Michelotto, M. Caballer et al. Geographically distributed Batch System as a Service: the INDIGO-DataCloud approach exploiting HTCondor. In J. Phys. Conf. Ser., vol. 898, p. 052033. 2017.

        [3] M. Main-Knorn, B. Pflug, J. Louis, V. Debaecker, U. Müller-Wilm and F. Gascon, Sen2Cor for Sentinel-2. In Image and Signal Processing for Remote Sensing XXIII, vol. 10427, p. 1042704. International Society for Optics and Photonics, 2017.

        ----------------------------------

        Speaker: Mr Luca Marletta (European Commission, Joint Research Centre (JRC) )
      • 22
        The Natural Language Processing automatization at JRC for Knowledge Management

        In our Directorate F Health, Consumers and Reference Materials at JRC, the Knowledge for Health and Consumer Safety Unit F.7 deals with anticipating knowledge needs, mapping knowledge gaps and suggesting research topics to be carried out in the Directorate and possibly in the JRC. For example, thousands of publications are released every year on different topics where JRC has strong competence and a mandate for scientific advice to European Commission, keep the pace of this continuous growing knowledge is an increasing challenge. The velocity of literature production makes impossible to deliver state of the art answers for EC policy makers without automating the whole process. The only way to face this issue is to apply Machine Learning tools (AI) in the field of Natural Language. The same approach can be used on extracts from raw text, including those from speeches, to reveal sentiments and feelings that can be used to understand trends and (political) shifts that may improve JRC insight in policy developments. We will present examples of benchmarking use of this approach, made by combining tools like Natural Language Understanding IBM Watson and AllenNLP Machine Comprehension models, installed locally at the JRC.

        Speaker: Mr Mauro Petrillo (European Commission Joint Research Centre)
      • 23
        HTCondor-CE overview: from Clusters to Grids

        Overview of the design of the HTCondor CE

        Speaker: Gregory Thain (University of Wisconsin-Madison)
      • 24
        HTCondor-CE Basics and Architecture

        An introduction to HTCondor-CE including an overview of its architecture and supported batch systems.

        Speaker: Brian Hua Lin (University of Wisconsin - Madison)
    • 10:40 AM
      Coffee break
    • Office hours

      Meet the HTCondor developers and ask your questions

    • Workshop presentations

      Antonio

      Convener: Christoph Beyer
    • 1:00 PM
      Lunch
    • Visits

      Labs visits

    • Workshop presentations

      Antonio

      Convener: Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA)
    • 4:05 PM
      Coffee break
    • Office hours

      Meet the HTCondor developers and ask your questions

    • Workshop presentations

      Antonio

      Convener: Chris Brew (Science and Technology Facilities Council STFC (GB))
      • 30
        Virgo: a computing evolution tale
        Speaker: Mr Gabriele Gaetano Fronze' (University e INFN Torino (IT), Subatech Nantes (FR))
      • 31
        Reflections on HTC Needs and Directions

        Reflections on the common upcoming requirements and directions of the scientific high throughput computing community.

        Speaker: Miron Livny (University of Wisconsin-Madison)
    • Workshop dinner: Hotel Lido

      Dinner at Angera Hotel Lido

    • Workshop presentations

      Antonio

      Convener: Catalin Condurache (EGI)
      • 32
        Building a European wide, Bioinformatics jobs execution network

        With more than 2,000 bioinformatics tools available, Usegalaxy.eu (https://usegalay.eu) is the biggest Galaxy instance in Europe covering most of the hottest bioinformatics topics and communities.
        After one year from its public launch into March 2018, Usegalaxy.eu has reached the important milestone of 5 million jobs executed and over 6 thousand registered users.

        Several computer centers across Europe are currently sharing their remote computation power to support the Usegalaxy.eu load: IT, UK, CZ, DE, PT, ES,...
        To create this network of shared computational resources, we leverage:
        Pulsar (https://pulsar.readthedocs.org), a TES-like service written in Python that allows a Galaxy server to automatically interact with those remote systems,
        VGCN (https://github.com/usegalaxy-eu/vgcn), a virtual image which has all of the required components to act as a galaxy compute node as part of an HTCondor cluster.
        Terraform (https://github.com/usegalaxy-eu/terraform), a set of scripts for safely and efficiently building the infrastructure into a modern cloud environment.
        Galaxy’s job destination framework allows job execution parameters to be determined dynamically at runtime, offering a flexible option for choosing the job endpoints, and the Pulsar layer ensures execution details are correctly exchanged to correctly perform jobs into the local and/or remote HTCondor clusters.

        Speaker: Gianmauro Cuccuru (University of Freiburg)
      • 33
        Large-scale aerial photo processing for tree health monitoring with HTCondor

        \documentclass{article}
        \usepackage{filecontents}
        \usepackage{authblk}
        \usepackage{natbib}
        \usepackage{natbib}
        \bibliographystyle{abbrvnat}
        \setcitestyle{numbers,open={[},close={]},citesep={,}}
        \begin{filecontents}{\jobname.bib}
        @article{phillips2008a,
        author = {Phillips, S.J. and Dudík, M.},
        title = {Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation},
        journal = {Ecography},
        volume = {31},
        number = {2},
        year = {2008},
        pages = {161–175},
        url = {https://doi.org/10.1111/j.0906-7590.2008.5203.x},
        language = {en}
        }
        @article{phillips2017a,
        author = {Phillips, S.J. and Anderson, R.P. and Dudík, M. and Schapire, R.E. and Blair, M.E.},
        title = {Opening the black box: an open‐source release of Maxent},
        journal = {Ecography},
        volume = {40},
        number = {7},
        year = {2017},
        pages = {887–893},
        url = {http://doi.org/10.1111/ecog.03049},
        language = {en}
        }
        @article{Soille,
        author = {Soille, P. and Burger, A. and De Marchi, D. and Kempeneers, P. and Rodriguez, D. and Syrris, V. and Vasilev, V.},
        title = {A versatile data-intensive computing platform for information retrieval from big geospatial data},
        journal = {Future Gener. Comput. Syst},
        volume = {81},
        year = {2018},
        pages = {30–40},
        url = {https://doi.org/10.1016/j.future.2017.11.007},
        language = {en}
        }
        @book{beck2019,
        author = {Beck, P. S. A and Martínez-Sanchez, L. and Di Leo, M. and Chemin, Y. and Caudullo, G. and de la Fuente, B. and Zarco-Tejada, P. J.},
        title = {The Canopy Health Monitoring (CanHeMon) project},
        institution = {Joint Research Centre (European Commission)},
        publisher = {Publications Office of the European Union, Luxembourg},
        year = {2019},
        isbn = {978-92-79-99639-9},
        doi = {10.2760/38697},
        pages = {79},
        language = {en},
        }
        @book{marti,
        author = {Beck, P. S. A and Martínez-Sanchez, L. and Di Leo, M. and Chemin, Y. and Caudullo, G. and de la Fuente, B. and Zarco-Tejada, P. J.},
        title = {Remote Sensing in support of Plant Health Measures–Findings from the Canopy Health Monitoring},
        institution = {Joint Research Centre (European Commission)},
        publisher = {Publications Office of the European Union, Luxembourg},
        year = {2019},
        isbn = {978-92-76-02051-6},
        doi = {10.2760/767468},
        pages = {13},
        language = {en},
        }
        @article{haralick1973a,
        author = {Haralick, R.M. and Shanmugam, Kand and Dinstein,I.},
        title = {Textural Features for Image Classification},
        journal = {IEEE Transactions on Systems, Man, and Cybernetics},
        volume = {3},
        year = {1973},
        pages = {610–621},
        url = {https://doi.org/10.1109/TSMC.1973.4309314},
        number = {6},
        language = {en}
        }

        @article{delafuente,
        author = {de la Fuente, B. and Saura, S. and Beck, P. S. A.},
        title = {Predicting the spread of an invasive tree pest: the pine wood nematode in Southern Europe},
        journal = {Journal of Applied Ecology},
        volume = {55},
        year = {2018},
        pages = {2374-2385},
        doi = {10.1111/1365-2664.13177},
        number = {5},
        language = {en}
        }
        \end{filecontents}
        \usepackage[utf8]{inputenc}

        \begin{document}
        \title{Large-scale aerial photo processing for tree health monitoring with HTCondor}

        \author[1]{Martinez-Sanchez, Laura}
        \author[1]{Rodriguez-Aseretto, Dario}
        \author[1]{Soille, Pierre}
        \author[1]{Beck, Pieter S. A.}
        \affil[1]{European Commission, Joint Research Centre (JRC)}
        \date{September 2019}
        \maketitle
        \begin{abstract}

        The Canopy Health Monitoring (CanHeMon) project ran at the Joint Research Centre of the European Commission from mid-2015 to mid-2018 and was funded by DG SANTE. DG SANTE is responsible, among other things, for the European Union’s Plant Health legislation, which aims to put in place effective measures to protect the Union’s territory and its plants, as well as ensuring trade is safe and the impacts of climate change on the health of EU crops and forests are mitigated. For specific harmful organisms that threaten its crops and forests, the EU takes emergency control measures. The Pine wood nematode (\textit{Bursaphelenchus xylophilus}) is such a quarantine pest. It can kill European coniferous tree species and is spreading through Portugal since the end of the 1990s.
        As part of the EU emergency measures against the pine wood nematode (PWN) (\textit{Bursaphelenchus xylophilus}) Decision 2012/535/EU, Portugal should perform, outside and during the flight season of the PWN’s vector, surveys of coniferous trees located in the 20 km wide buffer zone established along the Spanish border, with the aim to detect trees which are dead, in poor health or affected by fire or storm. These trees shall be felled and removed to avoid that they act as attractants for the longhorn beetle (Monochamus sp), the insect vector responsible for the spread of PWN \citep{delafuente}. The CanHeMon project tasked the Joint Research Centre with analysing a portion of the buffer zone, using remote sensing data, to support detection on the ground of declining pine trees. During the project, a 400 km2 area was imaged twice, in autumn 2015 and autumn 2016, at 15 cm resolution from aircraft, and individual declining tree crowns were detected using a MaxEnt-based \citep{phillips2017a,phillips2008a}, iterative image analysis algorithm, the performance of which was gauged through visual photointerpretation. The scalability of the automated methods was then tested using an image mosaic of the entire buffer zone at 30 cm resolution.
        We sought an image analysis platform that could efficiently handle and parallelise the computations on the large (terabyte) volumes of image data in this project. The JRC Earth Observation Data and Processing Platform (JEODPP) \citep{Soille}, which was developed in parallel with the CanHeMon project \citep{beck2019,marti}, increasingly met these needs over the course of the project. Being an inhouse service of the EC, it facilitates processing of the data for which the licensing does not permit public distribution. It is a versatile platform that brings the users to the data through web access and allows for large-scale batch processing of scientific workflows, remote desktop access for fast prototyping in legacy environments, and interactive data visualisation/analysis with JupiterLab.
        The storage and processing nodes underlying the JEODPP infrastructure consist of commodity hardware equipped with a stack of open source software. The storage service relies on the CERN EOS distributed file system which provides a disk-based, low latency storage service suitable for multipetabyte scale data. EOS is built on top of the XRootD protocol developed for high energy physics applications but also offers almost fully POSIX compliant access through a dedicated FUSE client called FUSEX that is suitable for other areas. As of summer 2019, the storage capacity of the EOS distributed file system of the JEODPP amounts to 14 PiB corresponding to a net capacity of 7 PiB given that all data are replicated once to ensure their availability and decrease the likelihood of data loss in case of disk failure. For all other services, the JEODPP relies on processing servers with a total of 2,200 cores distributed over 64 nodes. On average, 15 GB of RAM is available to each core. The batch processing service, called JEO-batch, is orchestrated with HTCondor. All applications running on the JEODPP are deployed within Docker containers to ease the management of applications having conflicting requirements in terms of library versions. Docker images are created by combining and modifying standard images downloaded from repositories. For the canopy health monitoring application, we created a Debian image with all the libraries needed to run the code (mainly R and GDAL libraries).
        Covering the entire PWN buffer zone with 4-band images of 30 cm resolution, stored in 8-bit, generates 2.4 TB of data. The associated texture layers \citep{haralick1973a}used in the analyses here added an additional 50 TB. The data were delivered and processed in 24,904 tiles measuring 1 km by 1 km. Processing a single tile in each iteration takes 40 to 55 minutes, with a memory usage of 5-7 GB on a regular CPU (with 2 to 8 cores). Processing the entire buffer zone on a single CPU would thus take more than a year. Assigning all of the 2,200 cores of the JEODPP to batch processing service and submitting the job with HTCondor, the task would be completed in less than two hours. In practice, between 100 and 500 cores of the JEODPP were used in the processing at any one time. The results of this processing were used to support management of the area on the ground and make recommendations on the use of remote sensing for large-area surveys in the context of plant health.
        \end{abstract}
        \bibliography{\jobname}
        \end{document}

        Speaker: Laura Martinez Sanchez (JRC)
      • 34
        Large scale mapping of human settlements from earth observation data with JEO-batch of the JRC Earth Observation Data and Processing Platform

        The spatial distribution of built-up areas and their expansion represent one of the most important forms of land use/ land cover changes confronting climate, environmental and socio-economic systems at a global scale. Characterizing the status and dynamics of built-up areas over large areas is technically feasible thanks to the availability of a panoply of earth observation data with different spatial, spectral and temporal characteristics. In the last 10 years, at the Joint Research Centre of the European Commission, the Global Human Settlement Layer project has been exploiting different sources of satellite imagery to monitor changes in the European and Global built-up landscapes to better inform policies and decision making (Corbane et al. 2017) (Florczyk et al. 2016).
        To meet the demands of large-scale mapping of human settlements from space, not only mass storage infrastructures are needed but also novel data analytics combined with high-performance computing platforms have to be designed. The JRC EO Data and Processing Platform (JEODPP) developed in the framework of JRC Big Data Analytics (BDA) project provides petabyte scale storage coupled to high throughput computing capacities enabling and facilitating the extraction of built-up areas from large volumes of satellite data both at the European and Global scales (Soille et al. 2018). In this work, we present the JEO-batch feature of the JEODPP; a low-level batch processing orchestrated by a dedicated workload manager and its utility for the execution of two main automated workflows for extracting built-up areas over large zones. The first workflow is implemented on a pan-European coverage of Very High Resolution satellite data from the Copernicus contributing missions acquired in 2015; the second workflow exploits a global Sentinel-2 pixel-based composite from the Copernicus constellation of satellites acquired mainly in 2018.
        Although the two workflows build on the same classifier, the number of images/tiles to be processed and their projections, the characteristics of remote sensing sensors, in particular their spatial resolutions and the derived outputs required different configurations of the workload automations. Taking advantage of the Docker universe of the JEO-batch that relies on the HTCondor architecture, the workflows, originally coded in Matlab, were compiled and successfully run on the JEODPP. The following table summarizes the main characteristics of the massive batch processing that allowed extracting built-up areas at the European and global scales:

        Main characteristics of the massive batch processing that allowed extracting built-up areas at the European and global scales

        References:

        • Corbane, Christina, Martino Pesaresi, Panagiotis Politis, Vasileios Syrris, Aneta J. Florczyk, Pierre Soille, Luca Maffenini, et al. 2017. “Big Earth Data Analytics on Sentinel-1 and Landsat Imagery in Support to Global Human Settlements Mapping.” Big Earth Data 1 (1–2): 118–44. https://doi.org/10.1080/20964471.2017.1397899.
        • Florczyk, Aneta Jadwiga, Stefano Ferri, Vasileios Syrris, Thomas Kemper, Matina Halkia, Pierre Soille, and Martino Pesaresi. 2016. “A New European Settlement Map From Optical Remotely Sensed Data.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (5): 1978–92. https://doi.org/10.1109/JSTARS.2015.2485662.
        • Soille, P., A. Burger, D. De Marchi, P. Kempeneers, D. Rodriguez, V. Syrris, and V. Vasilev. 2018. “A Versatile Data-Intensive Computing Platform for Information Retrieval from Big Geospatial Data.” Future Generation Computer Systems 81: 30–40. https://doi.org/10.1016/j.future.2017.11.007.
        Speakers: Dr Christina Corbane (Joint Research Centre), Dr Dario Rodriguez (European Commission - DG JRC)
      • 35
        Security Mechanisms in HTCondor

        Overview of the security mechanisms available in HTCondor, including discussion on security configuration and new security features being introduced in the HTCondor v8.9 series.

        Speaker: Todd Tannenbaum (University of Wisconsin Madison (US))
    • 10:50 AM
      Coffee break
    • Workshop presentations

      Antonio

      Convener: Helge Meinhard (CERN)
      • 36
        SciTokens and Credential Management

        Presentation on work to integrate distributed authorization technologies such as SciTokens and OAuth 2.0 into HTCondor, and what this means for end-users and system administrators.

        Speaker: Todd Tannenbaum (University of Wisconsin Madison (US))
      • 37
        Container support in HTCondor

        Container support in HTCondor

        Speaker: Gregory Thain (University of Wisconsin-Madison)
    • Miscellaneous
    • 1:00 PM
      Lunch-buffet
    • Visits: Nuclear Reactor Visit

      Labs visits

    • Bus departure from JRC Ispra: ESSOR