Q&A with Holden Karau on Spark, ML and Big Data

Europe/Zurich
513/R-070 - Openlab Space (CERN)

513/R-070 - Openlab Space

CERN

15
Show room on map
Luca Canali (CERN), Maria Girone (CERN)
Description

Discussions on topic around Spark, data engineering, ML, cloud solutions for Big Data and follow up of Holden's talk in the morning.

    • 15:00 15:10
      Intro and goals 10m
      Speakers: Luca Canali (CERN), Maria Girone (CERN)
    • 15:10 16:30
      Use cases and discussions 1h 20m

      Please contact luca.canali@cern.ch if you have specific topics you want to discuss, so that we can better organize the discussion and time for Q&A.

      Current proposals:

      • Additional questions and follow-up from the morning's computing seminar.

      • Discussion on topics regarding integrating Spark with Python, performance and usability - including ideas on further use of Arrow integration to pass data from Spark to the Cofea framework developed and FNAL (Lindesy Gray, Andrew Melo, CMS)

      • Drill down on Performance and Spark+Parquet in the context of speeding up data extraction for the Spark based framework developed for NXCals project. Several optimizations have been tested or are in the pipeline so far (including sorting by timestamp, partitioning and splitting in multiple files). There is interest to understand roadmap and current work in this area from Spark and open source communities, which can be of help for further tuning of the platform (Jakub Wozniak, BE-CO).

      • Interest in Spark structured streaming discussion, evolution in Spark 3, integration with Kafka, possible Kafka client upgrade to 2.0 (from the IT-CM monitoring team)

      • Possible interest by team working on Kubernetes and Kubeflow (Ricardo Brito Da Rocha, IT-CM)

      • Possible interest by team working on SWAN, integrating Spark and Jupyter + developing distributed processing for ROOT with Spark (Enric Tejedor Saavedra, EP-SFT)

      Speakers: Jakub Wozniak (CERN), Lindsey Gray (Fermi National Accelerator Lab. (US))