GSoC 2017 - Big Data Tools for Physics Analysis
Participants: Krishnan, Danilo, Prasanth, Kacper, Enric
# Status of tasks
* Synchronous execution
- If the kernel is busy running the Spark job, it cannot respond to messages coming from the JS frontend (in particular, the "stop job" message).
- Krishnan opened an issue on jupyter-widgets to ask for advice on this. They suggested to launch the Spark execution in another thread, but this could case problems (e.g. thread local state). Moreover, this does not solve the main issue: some thread needs to selectively process only the "stop job" messages and not every single cell execution message received from the frontend.
https://github.com/jupyter-widgets/ipywidgets/issues/1349
- ACTION: Krishnan suggested another strategy: interrupt the running kernel (similar to a Control+C from the Spark shell). He will try this and report the results.
- ACTION: Enric will trigger a conversation with Benjamin so that Krishnan can ask about this particular problem.
- ACTION: Krishnan will write a small document where he describes and depicts the communication between the frontend and the backend, including all the channels and actors.
* Task information for JS display
- Krishnan showed the information that can be queried from the Spark UI for every task, which includes start and end time, executor and some metrics, but no obvious correlation with the code.
- With the aforementioned information, we can at least have a time-series event display with tasks, executors where they run and task information.
- ACTION: Krishnan will send an example of the information obtained for a task.
- ACTION: Krishnan will continue to investigate what other information we can obtain and if it is possible at all to correlate tasks with code, since the graph display of Spark shows some information on that regard. He will also have a look at other monitoring approaches for Spark.
* Automatic detection of Spark jobs in a cell
- Krishnan found that the Jupyter-Spark can automatically detect the execution of a Spark job in a cell. It is the JS frontend that, by inspecting the list of Spark jobs coming from the web server, is able to realize that a new job was added and enable the display in that case.
https://github.com/mozilla/jupyter-spark
- ACTION: Krishnan will try to implement the automatic detection based on his findings. One way to do it could be to inspect the list of jobs when a cell is executed and show the display if there is a new job.
# Jupyter Spark roadmap
- Jupyter started a sub-project to integrate Spark with notebooks. The intended approach seems very basic (progress bars + cancellation of jobs) in comparison to what we want to achieve.
https://github.com/jupyter/roadmap/blob/master/spark.md
- ACTION: Enric will write an e-mail to Benjamin to discuss how to integrate (if possible) Krishnan's work in the Jupyter roadmap.