Workshops

Hadoop Tutorial - Efficient data ingestion

Name: Hadoop Tutorial - Efficient data ingestion
Start: 2016-07-20T10:30:00+02:00
End: 2016-07-20T12:10:00+02:00
Location: CERN

by Daniel Lanza Garcia (CERN), Zbigniew Baranowski (CERN)

Wednesday 20 Jul 2016, 10:30 → 12:10 Europe/Zurich

31/3-004 - IT Amphitheatre (CERN)

31/3-004 - IT Amphitheatre

CERN

105

Show room on map

Description

The Hadoop ecosystem is the leading opensource platform for distributed storage and processing of "big data". The Hadoop platform is available at CERN as a central service provided by the IT department.

Real-time data ingestion to Hadoop ecosystem due to the system specificity is non-trivial process and requires some efforts (which is often underestimated) in order to make it efficient (low latency, optimize data placement, footprint on the cluster).

In this tutorial attendees will learn about:

The important aspects of storing the data in Hadoop Distributed File System (HDFS).
Data ingestion techniques and engines that are capable of shipping data to Hadoop in an efficient way.
Setting up a full data ingestion flow into a Hadoop Distributed Files System from various sources (streaming, log files, databases) using the best practices and components available around the ecosystem (including Sqoop, Kite, Flume, Kafka).

From the same series

1 3 4

Registration

Participants

Webcast

There is a live webcast for this event