CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2016-07-20T10:00:00-05:00
End: 2016-07-20T11:00:00-05:00
Location: No location set

Wednesday 20 Jul 2016, 10:00 → 11:00 US/Central

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

FNAL room: DIR/ Snake Pit-WH2NE - Wilson Hall 2nd fl North East

CERM room:

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

Attendance:
From FNAL: Oli, Jim P, Jim K. Saba, Matteo
Alexey (Princeton)
Bo, Cristina (FNAL)
Matthew Link, Kevin Lannon (ND),
Dorian, Kacper, Luca, (hadoop service at CERN)
Mike Hildreth

- Matthew Link and Kevin Lannon's talk - Vertica in HEP
Q&A:
How is the structure of the data? In the case of root each row in the data table is a vector of some structure, in our case we encode arrays of structs which could be equivalent to the root procedure.
When running vertica, are you running on disk? Vertica database is in local disk, when running root we were doing in a hadoop based system
Do you index in any way your table of data?
Just in primary key then we use keys for all the table objects like muons or leptons.
Do you perform object or event selection too? We work with object selection, the plot in sl. 6 does not indicate number of events.
How big was the size of each root file? We use roughly 37 root files of 1.9 GB each.

- Princeton workflow
Full scale test: code complete, parquet file has all the information
We save it in flat parquet file(s) ➜ read it and produce plots (even stack plots), histogrammar stack function ➜ follow up with Alexey how to make the plot nice
still missing the recoil calculation
implementing the scale factors and event weight
should convert ROOT files into JSON files
this week milestone: finish code
skimming part is complete except weight propagation
follow up because calculation should be done on the fly
also have to implement recoil calculatio
trigger: nothing for the moment
skimming is done, next two weeks focus on plotting part, and calculate additional properties into the system

- Testing workflow at CERN
Set up accounts for list of people using Hadoop cluster: Matteo, Jim P, Alexey, Cristina

- NERSc workflow
Test by moving two ROOT files to NERSC and converted into HDFS
running a selection implemented by Alexey
to use Cristina’s code, moved to Scala
join is needed, looking at clever way to do that
Poster
SC poster due to on Friday
Abstract: 150 words ➜ combination of Grace-Hopper and CHEP
needs two-page write up
update use case
draft of the poster
Matteo: Introduction
Cristina: problem description
Jim: histogrammar
Saba: NERSC

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  
  REU report 20m
  
  Speakers: Kevin Patrick Lannon (University of Notre Dame (US)), Matthew Link
  
  CMS Big Data.pdf