We have restored access to the website from outside the CERN network, however access from certain worldwide locations is still being blocked (read more here).

CERN Accelerating science

Talk
Title Spark - a modern approach for distributed analytics
Video
Loading
If you experience any problem watching the video, click the download button below
Download Embed
Mp4:Medium
(800 kbps)
High
(2000 kbps)
More..
Copy-paste this code into your page:
Author(s) Surdy, Kacper (speaker) ; Kothuri, Prasanth (speaker) (CERN)
Corporate author(s) CERN. Geneva
Imprint 2016-08-03. - Streaming video.
Series (Workshops)
Lecture note on 2016-08-03T10:30:00
Subject category Workshops
Abstract

The Hadoop ecosystem is the leading opensource platform for distributed storing and processing big data. It is a very popular system for implementing data warehouses and data lakes. Spark has also emerged to be one of the leading engines for data analytics. The Hadoop platform is available at CERN as a central service provided by the IT department.

By attending the session, a participant will acquire knowledge of the essential concepts need to benefit from the parallel data processing offered by Spark framework. The session is structured around practical examples and tutorials.

Main topics:

  • Architecture overview - work distribution, concepts of a worker and a driver
  • Computing concepts of transformations and actions
  • Data processing APIs - RDD, DataFrame, and SparkSQL
Copyright/License © 2016-2024 CERN
Submitted by zbigniew.baranowski@cern.ch

 


 Record created 2016-09-09, last modified 2022-11-02


External link:
Download fulltext
Event details