CERN Computing Seminar

Random Decision Forests on Apache Spark

by Tom White (Cloudera)

Europe/Zurich
31/3-004 - IT Amphitheatre (CERN)

31/3-004 - IT Amphitheatre

CERN

105
Show room on map
Description

Apache Spark continues to gain momentum as the new processing paradigm for Apache Hadoop, and for the data scientist, it has a lot to like: natively distributed, REPL, Python APIs in addition to native Scala, and a library of machine learning algorithms, MLlib.

Spark includes an implementation of random decision forests, an important and popular ensemble classifier/regressor algorithm. This talk will introduce Spark and random decision forests to the curious, and demonstrate the process of analyzing a real-world data set with them. The session will cover loading data and understanding the data set, and introduce ideas like training and test set evaluation, ensemble methods, feature types, and supporting concepts like impurity and entropy.

About the speaker

Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He works for Cloudera, a company set up to offer Hadoop support and training. Previously he was as an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O'Reilly, java.net and IBM's developerWorks, and has spoken at several conferences, including at ApacheCon 2008 on Hadoop. Tom has a Bachelor's degree in Mathematics from the University of Cambridge and a Master's in Philosophy of Science from the University of Leeds, UK.

More information
Webcast
There is a live webcast for this event