GSoC 2017 - Big Data Tools for Physics Analysis

Europe/Zurich
31/S-028 (CERN)

31/S-028

CERN

30
Show room on map

Participants: Krishnan, Danilo, Prasanth, Kacper, Enric

# Status of tasks

* Synchronous execution
 - If the kernel is busy running the Spark job, it cannot respond to messages coming from the JS frontend (in particular, the "stop job" message).
 - Krishnan opened an issue on jupyter-widgets to ask for advice on this. They suggested to launch the Spark execution in another thread, but this could case problems (e.g. thread local state). Moreover, this does not solve the main issue: some thread needs to selectively process only the "stop job" messages and not every single cell execution message received from the frontend.
   https://github.com/jupyter-widgets/ipywidgets/issues/1349

 - ACTION: Krishnan suggested another strategy: interrupt the running kernel (similar to a Control+C from the Spark shell). He will try this and report the results.
 - ACTION: Enric will trigger a conversation with Benjamin so that Krishnan can ask about this particular problem.
 - ACTION: Krishnan will write a small document where he describes and depicts the communication between the frontend and the backend, including all the channels and actors.

* Task information for JS display
 - Krishnan showed the information that can be queried from the Spark UI for every task, which includes start and end time, executor and some metrics, but no obvious correlation with the code.
 - With the aforementioned information, we can at least have a time-series event display with tasks, executors where they run and task information.
 - ACTION: Krishnan will send an example of the information obtained for a task.
 - ACTION: Krishnan will continue to investigate what other information we can obtain and if it is possible at all to correlate tasks with code, since the graph display of Spark shows some information on that regard. He will also have a look at other monitoring approaches for Spark.

* Automatic detection of Spark jobs in a cell
 - Krishnan found that the Jupyter-Spark can automatically detect the execution of a Spark job in a cell. It is the JS frontend that, by inspecting the list of Spark jobs coming from the web server, is able to realize that a new job was added and enable the display in that case.
   https://github.com/mozilla/jupyter-spark

 - ACTION: Krishnan will try to implement the automatic detection based on his findings. One way to do it could be to inspect the list of jobs when a cell is executed and show the display if there is a new job.

 

# Jupyter Spark roadmap

- Jupyter started a sub-project to integrate Spark with notebooks. The intended approach seems very basic (progress bars + cancellation of jobs) in comparison to what we want to achieve.

   https://github.com/jupyter/roadmap/blob/master/spark.md

- ACTION: Enric will write an e-mail to Benjamin to discuss how to integrate (if possible) Krishnan's work in the Jupyter roadmap.

There are minutes attached to this event. Show them.
    • 1
      Status of assigned tasks

      We will discuss the progress on the following tasks:
      * Check what information we can get for a running task from the Spark REST server. Can we link a task with a particular operation in the chain (e.g. map, reduce)? This will determine the monitoring display that we can offer to the user. One idea is to show a refreshed time series that tells us which tasks are running where, the task type, and also allows us to detect bottlenecks in our app.
      * Check how we can automatically detect that a Spark action was executed from a cell, so that the monitoring display is activated only then for the Spark-generic part.
      * Start with the synchronous execution case: make sure the communication from the JS client to the kernel process also works in this case.

    • 2
      Contribution to Jupyter-Spark

      Discuss how to contribute to the roadmap of Jupyter regarding the integration with Spark:
      https://github.com/jupyter/roadmap/blob/master/spark.md

    • 3
      AOB