Network and Transfer Metrics WG (18 Mar 2015) Minutes

Attended: Jorge, Bruno, Ian, Ulf, Hsin-Yen, Pedro, Henryk, Alessandro, Enrico, Shawn, Marian (Excused: Tony, Jason, Stefan, John)

Agenda/slides presented: https://indico.cern.ch/event/379017/

We're still missing input to the use case document, please provided it ASAP (https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1mc/edit)

Next meetings: 8 Apr, 6 May, 3 June, 8 July, 2 Sept - all at 4pm CEST

1) perfSONAR status

Mesh configurations were changed according to what was agreed at the last meeting. In addition, there are two new meshes, Latin America and Dual-stack. 

Marian reported on the analysis done within LHCOPN/LHCONE based on the new freshness metrics that can identify gaps in the measurements. A summary of issues found was presented. Shawn has setup a testbed to validate 3.4.2rc, which was released last week and provides fixes to the issues identified. The plan is to follow up the testbed for next couple of days, if there are no issues reported, 3.4.2rc will get a green light. We should benefit from the auto-updates that were setup during 3.4.1 campaign, so all sonars should be updated within 24 hours once 3.4.2 is relased. Once deployed, we'll resume LHCOPN/LHCONE validation and continue with the full mesh latency ramp up.

2) Esmond status

Shawn will coordinate the activity in OSG, plan is to make it production ready by Q3. The system is currently running/tested on VMs, but recently there has been significant improvement in performance, we’re gathering 100% of the meshes (some with no data since measurements are missing due to issues reported in point 1). The plan is to introduce completness and accuracy checks to validate if the content of esmond matches what is seen/measured in the local measurement archives.

3) Network performance incidents follow up

At the last WLCG operations coordination it was pointed out that we're missing a procedure to track and follow up on the network performance incidents. Recent AGLT2-SARA incident was discussed in this context. Draft proposal was presented and discussed, please comment, in summary:

- New mailing list will be established to follow up,  proposed name is wlcg-network-throughput, initial participation will be the same as for the WG mailing list (transfer systems, experiments, perfsonar support, esnet, lhcopn/lhcone). 

- Experiments can report to the mailing list potential network performance incidents/degradations, WLCG perfSONAR support unit will investigate and confirm if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Affected sites will be contacted and should open an incident with their network providers. Tracking of the ongoing incidents will be done on the WG page (link to be added).

- Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider while informing the wlcg-network-throughput mailing list. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging of the problem. For the non-technical (policy) issues or if unclear, sites should escalate to the WLCG operations coordination.

The plan is to propose this procedure at the next WLCG operations coordination (2nd of April). Please send your comments, if any, before that date. 

Ian commented we should try to make sure we don’t create too many tickets with the network providers. Shawn responded that we’re introducing WLCG perfSONAR support unit in the middle to ensure that only relevant cases are reported to the network providers.

Bruno commented that sites observing network problems usually have a standard procedure. This has been added to the procedure now.

4) Integration projects

- FTS performance (no news)

- Experiment’s interface to perfSONAR

A revised proposal was presented and discussed. Henryk has already developed a prototype for esmond2mq and presented a status update on it. Initial tests show that esmond is currently too slow to respond to some of the raw data queries and additional tuning/optimisation will need to be performed. To be followed up with OSG and esmond developers.