ProtoDUNE and IT Storage Services

Europe/Zurich
31/S-028 (CERN)

31/S-028

CERN

30
Show room on map

ProtoDUNE - IT storage: Data Challenge 2.0 post-mortem 19 April 2018

Room: Nektarios (EP-NP), Andrea (EP-NP), Geoff (FNAL), Jan (IT-ST), Cristi (IT-ST), Luca (IT-ST)

Remote: Tanya (FNAL), Kevin (FNAL), Steven (FNAL), Elisabetta (IN2P3), Denis (IN2P3), Ruth (FNAL), Maxim (BNL)

 

Steven Timm gave on overview of the Data Challenge and presented the goals, the results and the issues found. Main goal was to demonstrate that data movement can run without without disrupting DAQ with a sustained rate of 20GBit/s bandwidth from EHN-1 to EOS and on to Fermilab. After the test there were sent 300 TB of data from np04-srv-001 and np04-srv-002 to EOS showing good throughput (33.5Gbit/s) with both machines together using 3rd party xrdcp. and less than  5% failure. There were issues found on the cataloging as F-FTS depends on ls and rm commands on EOS fuse moun and  ls command to EOS hung for many hours. Copy of metadata files and symlinks were impacted also by the FUSE issues and 700 metadata files out of 5500 failed on first try.

Finally the outflow to FNAL only reached 2GByte/s (16Gbit/s) and need to investigate where the bottlenecks are (believe bottleneck is at Fermilab end but can’t prove). No obvious network contention.

Cristi made the report from the service side and expose the issues ongoing during the DC2 that were aggravated as there was high load on the EOS instance (AMS and NA62 contributed, the instance was also nearing its space capacity as well)

Discussion:

- Tackling the problems down the storage use case we discussed.

1. DQM:  issues experienced by Maxim using FUSE mount. This was tackled down with EOS expert last week and the agreement is to move out from FUSE for production activities and move towards native xrootd or eos api. This will improve “ls” timing and will prevent eventual freeze of the mounted filesystem.

2. Data recording: the path from EHN1 to EOS and back-up to CASTOR has been certified. The data aggregation to FNAL was impacted because of the polling via “ls” that F-FTS is doing and FUSE mount issues were slowing this down. It has been agreed that xrootd (xrdfs ls) or eos api (eos ls) should be used instead.

3. Data processing and dat analysis: no apparent issues. Maxim asked if staging the full file on the WN would pose any problem with an expected input file of 8GB. The actual limit on job directory is 20GB so local copy of 8GB is OK (keeping in mind the 20GB limit for the number of file in and files out).

- Monitoring:

It has been pointed out the need to revise the quotas for official usage and users. And it has been requested the possiblity ot have alarms based on occupancy once the quotas are reached (global and individual)

- Storage:

Steven reported that based on the expected rates NP04 will have enough buffer space on EHN1 for one weekend. This is in line with what IT ask the experiments on case of major problems. 

Cristi will be the storage contact and will be coming to the upcoming protoDUNE-IT coordination meetings

There are minutes attached to this event. Show them.
    • 15:30 15:35
      Introductions 5m
    • 15:35 15:45
      ProtoDUNE Data Challenge 2.0 overview 10m
      Speaker: Steven Timm (Fermi National Accelerator Lab. (US))
    • 15:45 15:55
      EOS report 10m
      Speaker: Cristian Contescu (CERN)
    • 15:55 16:30
      Discussion 35m