CMS Big Data Science Project

Europe/Berlin
600/R-001 (CERN)

600/R-001

CERN

15
Show room on map
Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
Description

FNAL room: Dark Side-WH6NW - Wilson Hall 2nd fl North East

CERN room: 600-R-001

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

If not possible, people can join the meeting by the phone, call-in numbers are here:

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

  • 10502145

170705 - Big Data Meeting

  • Marc Paterno: HDF5 and conversion: will be scheduled for a later date
  • some issues from FNAL
    • 2 students working from FNAL
    • Sun Yong
      • he will start working till End of September, and then go to Padova for his PhD -> partnership with Padova
      • wants to setup his analysis workflow in Spark
    • started with another student, reproduce all the work for CHEP last year
      • different physics data format
      • need to convert to new physics data format
      • and also was using AVRO
      • task: convert completely to spark-root
      • some issues:
        • Scala programming questions, need another person (more experienced in Scala) to have a look at the code
    • problem with moving files from MIT to CERN-HDFS
      • needs intermediate step
      • about 50 TB total
      • only have 5 TB temporary space on EOS
      • staged copies
        • 10 steps
  • news at CERN
    • Vagg started doing some performance tests on xrootd-connector
    • very easily can saturate network -> network is bottleneck
      • 1 Gbit easily saturdated
      • on 10 Gbit machines, testing root and parquet
      • lower transfer rates from CERNBox vs. public EOS
        • will contact EOS and ask about differences
  • action items
    • Vagg and Matteo: 50 TB application metrics in 2 weeks
      • what open data to use, file lists, etc.
    • Matteo will send Panda example file
  • New version of eos-connector on analytics
    • read parquet files and root files
    • performance improved through client-side buffering
      • parquet library was using parse-by-bite, so buffering helps a lot on the connector side
      • wrapped with standard hadoop buffering, to help with very small filesystem interactions
  • Saba
    • working with summer student, looking at scala code
    • review extensive use of user-defined functions, spark cannot optimize them, try to find alternative
      • replace with simple select queries
      • very close to wrap this up
    • trouble to run the code on Corrie, solution available, use shifter images to use spark on Corrie, works now
    • ready for more performance testing on Corrie
    • Corrie: maximum used were 512 modes, each node is 24 cores
    • another possibility is to use open data
There are minutes attached to this event. Show them.
    • 1
      News
      Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
    • 2
      Discussion