CMS Big Data Science Project


Fermilab room: PPD/ Quarium-WH8SW - Wilson Hall 8th fl South West

CERM room:


Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

  • attendance:
    • Matteo Cremonesi (FNAL), Cristina Mantilla (FNAL/Johns Hopkins), Saba Sehrish (FNAL), Jim Kowalkowski (FNAL), Jim Pivarski (Princeton), Alexey Svyatkovskiy (Princeton), Bo Jayalatika (FNAL), Maria Girone (CERN/openlab), Ian Fisk (Simons Foundation), Volker Tresp (Siemens), Tobias Enrich (LMU), Jin Chang (FNAL), Ruth Pordes (FNAL)


Many thanks to JimK for writing notes!


first milestone in 4 weeks: load BACON ntuples in hadoop+spark and produce a plot


  • Overall goal and time schedule
    • Realize CMS analysis use case in industry big data technology
    • Document comparison to traditional analysis using HEP specific ROOT framework in write-up by Fall 2016
    • Start with hadoop+spark, when complete, possibility to branch out


  • use case discussion
    • Physics: Matteo, Cristina
      • currently publishing results of analysis use case with 2015 data
      • plan is to further develop the analysis and integrate 2016 data, publish update in Fall 2016
    • technical team: Matteo, Cristina, JimP, Alexey, Saba
      • start with BACONprod ntuples and not MINIAOD
      • Matteo and Cristina will publish to slack
        • code via github
        • files to download (one small with a few events, one ~GB size)
      • JimP will publish to slack interface code via github (already done)


  • testing platforms
    • Alexey has 10 node testing cluster at Princeton and will give access
    • Ian is planning to have a test setup at Simons in New York and will also provide access
    • Matteo, Cristina to figure out together with Alexey and Ian to transfer larger quantities of BACON ntuple files to Princeton and New York


  • Milestones and meeting time schedule
    • Meeting every two weeks in this time slot, Wednesday’s at 10 AM CST, 5 PM CET
      • Next Meeting March 16
    • first milestone in 4 weeks: load BACON ntuples in hadoop+spark and produce a plot
    • everyone: think about further milestones and parts of the project that needs to be accomplished by Fall 2016


  • technical discussion
    • discussion about content of BACON ntuples (flat or flat/flat) ➜ answer is flat (simple structure of classes)
    • discussion of loading data from ROOT files or from pre-converted data in HDFS
    • discussion about analysis in python or scala ➜ will start with python for interactive part, scala will be used for slimming/skiming
There are minutes attached to this event. Show them.
The agenda of this meeting is empty