CHEP 2016 Conference, San Francisco, October 8-14, 2016

Name: CHEP 2016 Conference, San Francisco, October 8-14, 2016
Start: 2016-10-10T08:00:00-07:00
End: 2016-10-14T18:00:00-07:00
Location: San Francisco Marriott Marquis

10–14 Oct 2016

San Francisco Marriott Marquis

America/Los_Angeles timezone

XRootD Popularity on Hadoop Clusters

13 Oct 2016, 14:30

15m

GG A+B (San Francisco Mariott Marquis)

GG A+B

San Francisco Mariott Marquis

Oral Track 5: Software Development Track 5: Software Development

Luca Menichetti (CERN) Marco Meoni (Universita di Pisa & INFN (IT)) Nicolo Magini (Fermi National Accelerator Lab. (US))

The CMS experiment has implemented a computing model where distributed monitoring infrastructures are collecting any kind of data and metadata about the performance of the computing operations. This data can be probed further by harnessing Big Data analytics approaches and discovering patterns and correlations that can improve the throughput and the efficiency of the computing model.

CMS has already begun to store a large set of operational data - user activities, job submissions, resources, file transfers, site efficiencies, software releases, network traffic, machine logs - in a Hadoop cluster. This offers the ability to run fast arbitrary query on the data and test several computing MapReduce-based frameworks.

In this work we analyze the XrootD logs collected in Hadoop through Gled and Flume and we benchmark their aggregation at the level of dataset for monitoring purpose of popularity queries, thus proving how dashboard and monitoring systems can benefit from Hadoop parallelism. Processing time on existing Oracle DBMS of XrootD time-series logs does not scale linearly with data volume. Conversely, Big Data architectures do and make it very effective re-processing any user-defined time interval. The entire set of existing Oracle queries is replicated in the Hadoop data store and result validation is performed accordingly.

These results constitute the set of features on top of which a mining platform is designed to predict the popularity of a new dataset, the best location for replicas or the proper amount of CPU and storage in future timeframes. Learning techniques applied to Big Data architectures are extensively explored to study the correlations between aggregated data and seek for patterns in the CMS computing ecosystem. Examples of this kind are primarily represented by operational information like file access statistics or dataset attributes, which are organised in samples suitable for feeding several classifiers.

Primary Keyword (Mandatory)	Analysis tools and techniques
Secondary Keyword (Optional)	Databases

Marco Meoni (Universita di Pisa & INFN (IT))

Domenico Giordano (CERN) Luca Menichetti (CERN) Nicolo Magini (Fermi National Accelerator Lab. (US)) Tommaso Boccali (Universita di Pisa & INFN (IT))

marcomeoni_CHEP2016_highlights.pdf

marcomeoni_CHEP2016.pdf

CHEP 2016 Conference, San Francisco, October 8-14, 2016

XRootD Popularity on Hadoop Clusters

GG A+B

San Francisco Mariott Marquis

Speakers

Description

Author

Co-authors

Presentation materials