CMS Big Data Science Project

Name: CMS Big Data Science Project
Start: 2017-07-05T16:00:00+02:00
End: 2017-07-05T17:30:00+02:00
Location: CERN

Wednesday 5 Jul 2017, 16:00 → 17:30 Europe/Berlin

600/R-001 (CERN)

600/R-001

CERN

Show room on map

Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))

Description

FNAL room: Dark Side-WH6NW - Wilson Hall 2nd fl North East

CERN room: 600-R-001

Instructions to create a light-weight CERN account to join the meeting via Vidyo:

https://account.cern.ch/account/Externals/

If not possible, people can join the meeting by the phone, call-in numbers are here:

http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

The meeting id is hidden below the Videoconference Rooms link, but here it is again:

10502145

Hide

170705 - Big Data Meeting

Marc Paterno: HDF5 and conversion: will be scheduled for a later date
some issues from FNAL
- 2 students working from FNAL
- Sun Yong
  - he will start working till End of September, and then go to Padova for his PhD -> partnership with Padova
  - wants to setup his analysis workflow in Spark
- started with another student, reproduce all the work for CHEP last year
  - different physics data format
  - need to convert to new physics data format
  - and also was using AVRO
  - task: convert completely to spark-root
  - some issues:
    - Scala programming questions, need another person (more experienced in Scala) to have a look at the code
- problem with moving files from MIT to CERN-HDFS
  - needs intermediate step
  - about 50 TB total
  - only have 5 TB temporary space on EOS
  - staged copies
    - 10 steps
news at CERN
- Vagg started doing some performance tests on xrootd-connector
- very easily can saturate network -> network is bottleneck
  - 1 Gbit easily saturdated
  - on 10 Gbit machines, testing root and parquet
  - lower transfer rates from CERNBox vs. public EOS
    - will contact EOS and ask about differences
action items
- Vagg and Matteo: 50 TB application metrics in 2 weeks
  - what open data to use, file lists, etc.
- Matteo will send Panda example file
New version of eos-connector on analytics
- read parquet files and root files
- performance improved through client-side buffering
  - parquet library was using parse-by-bite, so buffering helps a lot on the connector side
  - wrapped with standard hadoop buffering, to help with very small filesystem interactions
Saba
- working with summer student, looking at scala code
- review extensive use of user-defined functions, spark cannot optimize them, try to find alternative
  - replace with simple select queries
  - very close to wrap this up
- trouble to run the code on Corrie, solution available, use shifter images to use spark on Corrie, works now
- ready for more performance testing on Corrie
- Corrie: maximum used were 512 modes, each node is 24 cores
- another possibility is to use open data

There are minutes attached to this event. Show them.

- 1
  
  News
  
  Speakers: Matteo Cremonesi (Fermi National Accelerator Lab. (US)), Oliver Gutsche (Fermi National Accelerator Lab. (US))
- 2
  
  Discussion

Choose timezone

CMS Big Data Science Project

600/R-001

CERN

170705 - Big Data Meeting