170705 - Big Data Meeting
- Marc Paterno: HDF5 and conversion: will be scheduled for a later date
- some issues from FNAL
- 2 students working from FNAL
- Sun Yong
- he will start working till End of September, and then go to Padova for his PhD -> partnership with Padova
- wants to setup his analysis workflow in Spark
- started with another student, reproduce all the work for CHEP last year
- different physics data format
- need to convert to new physics data format
- and also was using AVRO
- task: convert completely to spark-root
- some issues:
- Scala programming questions, need another person (more experienced in Scala) to have a look at the code
- problem with moving files from MIT to CERN-HDFS
- needs intermediate step
- about 50 TB total
- only have 5 TB temporary space on EOS
- staged copies
- news at CERN
- Vagg started doing some performance tests on xrootd-connector
- very easily can saturate network -> network is bottleneck
- 1 Gbit easily saturdated
- on 10 Gbit machines, testing root and parquet
- lower transfer rates from CERNBox vs. public EOS
- will contact EOS and ask about differences
- action items
- Vagg and Matteo: 50 TB application metrics in 2 weeks
- what open data to use, file lists, etc.
- Matteo will send Panda example file
- New version of eos-connector on analytics
- read parquet files and root files
- performance improved through client-side buffering
- parquet library was using parse-by-bite, so buffering helps a lot on the connector side
- wrapped with standard hadoop buffering, to help with very small filesystem interactions
- Saba
- working with summer student, looking at scala code
- review extensive use of user-defined functions, spark cannot optimize them, try to find alternative
- replace with simple select queries
- very close to wrap this up
- trouble to run the code on Corrie, solution available, use shifter images to use spark on Corrie, works now
- ready for more performance testing on Corrie
- Corrie: maximum used were 512 modes, each node is 24 cores
- another possibility is to use open data
There are minutes attached to this event.
Show them.