WLCG throughput call

Europe/Zurich
31/S-027 (CERN)

31/S-027

CERN

2
Show room on map

Attended:

Adam, Alessandro, Enrico, Fernando, Stefan, Shawn, HsinYen, Bruno, Marian

 

Agenda:

0) Agenda Review/Update, News

- News: 

  - perfSONAR 3.5.1 was released , status of the upgrades can be found at http://grid-monitoring.cern.ch/perfsonar_report.txt

 

1) perfSONAR status

 - Issues to report for 3.5.1 

   - Frederic reported postgres corruption seen during the upgrades to 3.5.1 on some FR site, will forward more details

   - Shawn mentioned that one of the major changes that might be unexpected in a minor release is the change of location for the configuration/log/service files

   - Frederic will update the Puppet modules to support 3.5.1 changes; Marian to check on the status of the official support 

 - WLCG deployment/operations status 

   - Detailed report has been sent before the meeting and was discussed. 

   - Marian has expressed concerns on the status of the perfSONAR deployment in UK, there are quite a few sonars offline and the one at IC was unregistered.

   - Overall there has been significant progress in all regions and almost all issues were resolved

   - Only few things to resolve wrt. commissioning.

   - The plan is to give a final report on the commissioning status at the next WLCG ops and discuss how we’re going to follow up in the future.

 

2) WLCG network throughput SU tickets 

CBPF (https://ggus.eu/index.php?mode=ticket_info&ticket_id=120081

was resolved by changing the routing preferences from ESNet to GEANT. ESNet is looking into the issue.

 

ASGC (https://ggus.eu/index.php?mode=ticket_info&ticket_id=119820

Packet loss and high latency for certain packets (queuing issue ?) reported by perfSONAR on ASGC to CERN, but not confirmed by the counters. HsinYen commented offline that there is an increase of the packet drop events on the WAN routers at TPE, CHI and AMS.

Unfortunately there are very few sonars in Asia and sites there have limited peering, which impacts any further investigation.

Throughput tests show peaks of 400Mbit/s (200Mbit/s usual) with frequent retransmissions occurring in bunches.

 

The plan is to run more throughput tests from different locations (CERN, AGLT2, ICEPP) and investigate with tcpdump what could be the source for the retransmissions. Since none of the reported tests at ASGC shows throughput above 1GBit, one idea might be to perform on-site iperf test to confirm that we could achieve 10G bandwidth locally (sysctl settings were checked  offline and are correct).

 

3) Focus Topic: Proposal fro WLCG-wide bandwidth testing (and corresponding meshes)

- Marian proposed to try to identify 2 meshes with 50 instances each that would run more frequent tests (every 20hours), the main constrain right now is that perfSONAR only supports schedules up to 24 hours. Google drive sheet that was discussed is at https://docs.google.com/spreadsheets/d/1CjRtWkll8CyElNgltr2KXTVn5HM3DKb4j061OfctHVw/edit#gid=0&fvid=1292058812

- Shawn commented that the current limitation might go away with pScheduler, which is planned in 3.6

- Frederic suggested lowering down duration of the tests to 10 seconds - issue is latency since for longer paths we wouldn’t have time to ramp-up

- Frederic suggested we could auto-generate meshes bi-weekly, this looks like a viable option for a long term, we would need to have a possibility to auto-generate meshes which is currently missing. New configuration system is planned in 3.6, there is a prototype available already, so we can ask if it’s possible to implement an API that we could use to auto-generate meshes 

- Shawn commented that we need to do something soon, Marian suggested to disable the current one and enable 2 meshes, to be followed up.

 

4) Round-table about throughput, network, monitoring, data transfer and new issues to track

- Alessandro reported that Frascati should be fine now. There was a long discussion on the VM support in perfSONAR. The preferred way to run is on physical boxes, but if this is not an option it could also run well on the VM (full node VMs or combination of pinned CPUs and SR-IOV is needed in this case).

- Bruno will follow up on the commisioning issues in DE

- Frederic reported negative latencies in FR, which is likely due to the fact that only one NTP server is used. Having 5-6 NTP servers configured is recommended, but this is currently difficult due to site policies in FR. We also discussed logs of dropped packets by iptables, which is likely due to the tracepath tests, which don’t have the corresponding iptables openings (tracepath works fine without them, so this is just an issue with logging, to be followed up). There are also many connection reset by peer messages reported or test interrupted - to be followed up.

- Stefan mentioned that CERN will look into running perfSONAR with Puppet and on OpenStack

 

5) AOB and next meeting

- WG review will be at the next WLCG operations co-ordination (https://indico.cern.ch/event/514078/)

- Update on the WG will be presented at HEPiX next week

There are minutes attached to this event. Show them.
The agenda of this meeting is empty