In place of the normal face to face meeting at CERN - a virtual Zoom meeting.
Please register to say you will attend the Zoom meeting - connection details will be sent to those who register.
Notes (Day 1) by Francesco Prelz.
Agenda at https://indico.cern.ch/event/995980/
Attendees on Zoom: Dave Kelsey, Martin Bly, Raja Nandakumar, Dimitrios Christidis, Duncan Rand, Edoardo Martelli, Jiri Chudoba, Shawn McKee, Catalin Condurache, Pepe Flix, Bruno Hoeft, Francesco Prelz, Andrea Sciaba (joins after the break)
Apologies: Tim Chown (Day 1), Marian Babik, Kars Ohrenberg, Andrea Sciaba (will be late).
Raja: Nothing new from LHC-B. Running smoothly. The "chinese" connectivity problem was pinpointed and resolved.
Nothing from DUNE as well: they are still trying to come up with a computing model - so far ignoring IPv6 - but they should take IPv6 into account from the start! Will try to write a note about this in the CDR.
Dimitrios (representing ATLAS): ATLAS will announce in a few week they are moving away from SRM (has been their intention for a while) to WebDAV. Not sure about the status of available monitoring options for this protocol.
Dave: Will WebDAV run under FTS or elsewhere ?
Dimitrios: Under FTS - where WebDAV will be the transfer protocol of choice.
Shawn: It may be wise to push a bit back on this development until enough monitoring is in place to make some sense of the traffic.
Bruno: The WebDAV log file should contain the source and destination
Dave: We should check exactly which info the log files contain. We had issues with XROOTD where we couldn't tell the endpoint protocol.
There are enough people and bodies involved. We should be feeding back this issue to appropriate management
Shawn: There's an ATLAS software week this coming week - I can introduce the issue in one of the sessions.
Dimitrios: We have multiple storage implementations - and we would need to check all of them.
Dave: Doesn't FTS manage all the transfers ?
Shawn: Yes, but there are also questions and challenges on whether there's enough person-power to maintain FTS.
Dave: Thanks Dimitrios for raising this issue - we can now ensure it is fed back to the appropriate contexts.
Duncan: There can also be multiple streams, some on IPv4 and some on IPv6.
Pepe: Xrootd monitoring is also sub-optimal.
Duncan: There was work done on Xrootd - how can that be properly rolled out to production? We know that third-party copies are incorrectly recorded in the logs.
Pepe: there are data collected via Elasticsearch and sent to UChicago and CERN - but it doesn't work. The sites may not have proper recipes to apply these configurations ?
Dave: Configuration can be wrong - in terms of IPv6 over v4 preference.
Shawn: There is also a session at the ATLAS week next Tuesday on monitoring and analytics. We can ask Illya whether plots can be made telling IPv4 from IPv6. Illya makes plots from all sorts of perfsonar data - it may be interesting to see what we can tell from the data that are already collected.
Dave: Big bag of worms, but it would be interesting to understand what data can be found.
Dimitrios: It appeared that IPv6 usage is dropping - but no sites have taken any action. Andrea directed our attention to two of them.
Other news from Shawn: I have Duncan's request on adjusting the thresholds on Maddash - and changing the colors. For throughput, traceroute, etc... It was more complicated than we thought - will implement that soon.
We've been working on writing the perfsonar data to reduce the latency between when the data is produced and when it appears in elasticsearch.
Trying to push data to a bus instead of polling them. Issues and crashes were found, we set a milestone to addres them that should be completed in a few weeks.
Duncan: In the Maddash dashboard there is a '10G' version
Shawn: It's hard to combine the data. With mixed 10Gb/s (or even 40Gb/s) vs 1Gb/s testing the 1Gb/s endpoints will be overrun. We'll implement cuts on the 10Gb/s and 100Gb/s levels.
Bruno (KIT): No news on the IPv6 side. New link about to get operational between our site and CERN. Still rolling out IPv6 on the campus.
BELLE-2 is still on IPv4-only. Received an e-mail from the network manager - was investigating the issue with them but got no reply so far.
Catalin: Nothing to report from EGI. I still need to update the EGI wiki page as I mentioned last time. Several people changed. I am more an observer here now...
Dave: You can be an observer who can volunteer to do some testing!
Martin (RAL): As of yesterday, with assistance from Edoardo, we moved to the 100 Gb/s link and got rid of the old three 10 Gb/s links. There are still issues with IPv4. ICMP traffic and jumbo frames aren't making it through properly. After updating the border routers, the many virtual domains in the firewall had to be updated to handle jumbo frames. CMS is badly affected by this when accessing off-site data.
Dave: Are the issues on IPv4 only ?
Martin: "Bandwidth is crap" through the firewall - IPv6 is 3-4 times better than IPv4, but still bad. Need to figure out the reason.
Bruno: You may have a carrot for IPv6 here....
Edoardo: Which brand firewall...?
Bruno: Fortinet, an active/passive pair, in failover mode.
Martin: We'll need to make some internal network configuration changes, Will need to change switches to Mellanox boxes that should increase bandwidth in both directions. Public internet should be removed from the LHCOPN link before the end of a financial year (March). Will put in place a leaf-spine infrastructure linked with routers dedicated to HPC users, a new tape archive replacing Castor, etc. Multiple exit points will exist. The department director wants this to happen! An external contractor will take care of the planning and managing of the leaf-spine transition. Still hasn't accessed the site, but is in contact.
Francesco : Italy is still a dead country as far as IPv6 goes. CNAF has had to roll back twice a major upgrade on their Cisco border routers, and this is cause for more pressing concerns. Pinged the Turin T2 again for the dual-stack SE status: it is still pending, due to some sociological issues.
Pepe (represents PIC after Fernando left): We are trying to locate a person to take charge of the network at PIC, but it's hard during the pandemic. Two quick updates from PIC:
We are in the process to upgrade the network link to 100 Gb/s. An external firm will take care of that. There will be two independent 100 Gb/s connections to Barcelona. Should be happening in these coming weeks.
We tested the ENSTORE tape storage management software (developed at Fermilab). The developers agreed to make all needed changes to make it work on IPv6!
Dave: Persuading Fermilab to do this is a great service to the community...
Pepe: They were actually ahead of us on IPv6 development, but we helped with upgrading to Centos 7 first.
Dave: Any other site running ENSTORE, besides Fermilab and PIC ?
Pepe: The Russian Tier-1.
Bruno: Dubna or Moscow ?
Pepe: Probably both.
Edoardo (CERN) : As you may remember we reported to the IT Department that we may not have enough IPv4 addresses at hand for the new Data Center. An internal audit was conducted, showing that 70% of the IPv4 subnets are in use, but with very inefficient allocation within each subnet (order of 30% usage, and therefore 30% overall IPv4 address occupancy). The use of Virtual Machines will be decreasing in the near future, preferring containers over VMs - thus further reducing the use of IP addresses. So the address crunch seems not to be an issue anymore, once again. Other managers were also reluctant to go for IPv6-only public addresses, but will keep dual-stack for public-facing services.
Jiri (Prague): At the end of September Costin wrote that the mystery with ALICE was solved. Only an upgrade to Root6 and Xrootd 4.x is needed, but internally ALICE has to switch from using ROOT's TXNetFile to TNetXNGFile too, so that the new client API is used instead. The request is in the pipeline already, it shouldn't be long before it is ready for production.
Duncan: There was an issue with Queen Mary <-> TRIUMF data transfers. The transfers were failing, except for a brief period when they were succeeding 100%. We noticed a route change from Paris-Amsterdam-GEANT (working) to JANET-London-GEANT (not working). Terry did some debugging and eventually found a router between JANET and CANARIE not forwarding jumbo frames. No ICMP packet-too-large message was being propagated. The MTU was enlarged to ensure a happy ending.
Shawn: Path MTU discovery is not implemented or enabled for IPv6 in Linux - leading to this sort of problems.
Martin: Moons ago we had similar issues on our internal network. OPNR was not handling jumbo frames correctly, but the switch at least logged the issue. Locating where the problem occurs is hard.
Dave: Good to see that things were followed up and fixed. News from other UK sites ?
Duncan: Not much.
Dave: Universities don't want to break the network for off-site students...
Fermilab have a quite good and stable IPv6 traffic. They exceeded in throughput all EU sites. They access the FTS service
KISTI, an ALICE-only site, does not transfer files with FTS.
Kurchatov Institute in MOscow is still IPv4-only. No reply was obtained when asking for a roadmap to IPv6. No apparent sign of engagement.
All other Tier-1 sites are dual-stack and IPv6-ready.
Dave: Do people at BNL use their own FTS?
Bruno: There are three servers, CERN, RAL and Fermilab.
The usual summary page is shown:
Sheffield completed the transition, but no other GGUS ticket could be closed.
The Portuguese site should in principle be done - finishing this up with Stephan Lammel.
Dave: The UK are leading with 5 missing sites...
Shawn: Midwest Tier-2 had to roll back due to issues with an old Brocade device. Patrick at Southwest Tier-2 should be making progress. NorthEast Tier-2 is depending on the network infrastructure at some building near Boston.
Bruno: Four german T2 sites (Wuppertal, Munich, Goettingen and Freiburg) are pending on dual-stack support on the department network, which is facing many issues (lack of ipv6 enabled network equipment, lack of man power, ...).
Dave: It would be nice if site managers at least reacted and updated the tickets.
The percentage of IPv6-available storage hasn't changed:
Dave: If wee cannot close a ticket we should close the site... The check that Operations do on GGUS is just on changes to the tickets ?
Andrea: They just check inactive tickets that are still open. It's good that someone does it...
Edoardo: Did anyone complain that this work is only functioning on IPv6?
Shawn: Someone mentioned that it would be "nice" to have it on IPv4 as well. There actually is an IPv4 RFC that allows to pack a flow label into the option header, but 64 bits of room should be added to the option header for IPv4. And IPv4 options header (probably due to the drive to deprecate them) are often dropped/corrupted in transit.
Another possible location for IPv6 woule be the hop-by-hop options header: Fernando Gont has performed various tests on how the network behaves with various options enabled, with some being rather disruptive probably due to old hardware.
Traffic flow monitoring applications can easily and normally get at the flow label bits. While a few of them may still have problems consuming these, other locations would definitely be more problematic.
Dave: Is the described eBPF option still writing the flow label ?
Shawn: Yes. Multiple flows going through the same socket get the same flow label - to get a new one the socket has to be closed and reopened. Some applications existing recycle sockets: using eBPF is an option to handle the labeling in this case.
Dave: Anyone heeding to the request for help ?
Edoardo: A technical student could explore options for labeling inside the switches instead that on endpoint network stack.
Shawn: It can be hard for switches to track a label at wire speed. On top of that, the switch should be able to tell that a host is running Atlas or CMS and there are hosts serving both nodes.
Edoardo: If a site is e.g. Alice only, couldn't a label be added by the border router ?
Shawn: we are always asking for an application type code as well - the table with type bits is now being filled for ATLAS and BELLE-2. At least for ATLAS knowing the application type is critical: we want to make sure it is marked - we didn't get to the detail of specifying whether this is a MUST or SHOULD.
Edoardo: There is an RFC that saying the flow label should be a random value. Did you consider amending the RFC ?
Shawn: Brian Carpenter has been on our meetings to discuss exactly that. We need prototype work to understand what the challenges are, and if no issues are found we can work on an revised RFC. It takes a long time to get an RFC written and approved. It's in our plans for the future.
The only foreseen use for the flow label was to provide entropy. We can, if needed, keep/add a few more bits that can be set at random. We have no way to ask the kernel to add packet-by-packet randomness.
Dave: People may be keen to close the item about IPv6-only testing quickly, even if we keep discussing this every time.... Independently on whether anyone is available to actually perform the testing, can we agree at least on what should be done in an ideal world?
The path would call for all central services to be dual-stack before attempting ipv6-only ? Is there anything to do to get there ?
Bruno: the site can set up only a small island where IPv6-only is running to check whether there are any problems. I can not imagine switching IPv4 off for real.
Dave: If all services are dual stack, all worker nodes are dual stack and all transfers are happening on IPv6, I suppose at some point we *should* be able to turn off IPv4. Maintaining IPv4 forever is an endless waste.
Tunnelling mechanisms to involve sites that are IPv4-only should also be foreseen.
Martin: There is a lot of infrastructure beyond the services we need that uses IPv4.
Dave: It may be local stuff only. WLCG may be saying that all its services have to be IPv6, e.g. to provide Shawn with trackable traffic. It is useful to be able to test at least that scenario. What and where should we be testing? On worker nodes? Anything or anywhere else ?
Martin: We should be moving on the worker node front - thare are new services that keep adding to the plot.
Dave: should our priority be on checking IPv6-only containers then ?
Francesco: What is valuable in this egg hunt is finding places where IPv4 addresses (or, as a matter of fact, any protocol-specific address) are stored, either in databases or in packet payloads, as this is what makes any protocol translation technique for IPv6-only network islands fail. These can be intercepted at the WN side or at the service provisioning side. Acting on the service side would be much more efficient in reducing "noise" from well-known services, but harder to do on production services that cannot be easily tampered with.
Bruno: This looks like a replica of all the "tiers". Some services are special to Tier-1 or Tier-2. We may need a complete "PPS" pre-production service replica.
Francesco: Could we then be sampling traffic at the WN boundary to locate what protocols and services are used, beyond the well-kn own ones ?
Dave: Testing on dual-stack WNs may also help in reducing the noise from services we know work on IPv6.
Shawn: all the WNs at my site are dual-stack.
Martin: We have switches with port mirroring too.
Shawn: We have Elasticflow running over Elasticsearch collecting a lot of netflow data.
Dave: Is Francesco volunteering to so some work here ?
Francesco: I can actually put my hands on some traffic capture data from the Milan Tier-2 worker nodes but these are ATLAS-only, while getting a good coverage of services needed by all experiments, (and, as Bruno pointed out, at all tier levels) is essential to increase the chance of locating the critical services.
*end of Day 1 *
Notes by Duncan Rand
present: Dave Kelsey, Duncan Rand, Francesco Prelz, Bruno Hoeft, Tim Chown, Andrea Sciaba, Edoardo Martelli, Martin Bly, Dimitrios Christidis, Catalin Condurache.
Duncan summarised Marian’s report
”Concerning ETF, most of the developments that we have discussed related to changes in job submission/worker node testing are now in production for all the experiments. We had a quick discussion with LHCb concerning IPv6-only instance, they have shown interest to get it back and promised to follow up on it this year. Both CMS and LHCb are planning to develop new storage tests (to replace the old ones and support more protocols in addition to srm), so it will be an opportunity to look into IPv6 in more detail (and check the status of underlying libraries such as gfal2, xroot, etc.).
Concerning perfSONAR, we’re currently working on publishing the results directly from perfSONAR toolkits and to some extent this is already working on some of the US perfSONARs, we plan to extend it to rest of the world this year. Infrastructure monitoring has been updated and now uses psconfig/PWA as it’s main source (https://psetf.aglt2.org/etf/check_mk/). There is also a plan to integrate ipv6 toolkit written by Fernando Gont, which would help us run more extensive IPv6 tests (wrt. conformance to RFCs; https://www.si6networks.com/research/tools/ipv6toolkit/). WLCG network throughput had just one ticket on IPv6 from IHEP, which was resolved quickly, no other IPv6 issues were seen.”
We had a quick look at the ETF web pages. Martin was unable to access the IPv6-only etf-atlas-ipv6.cern.ch from his desktop at RAL.
Remind ourselves about cases where IPv4 is used, e.g. srm-atlas tape instance at RAL. Transition to CTA aiming to complete by end of 2021. Lots of data to be copied over.
Possible issue also at KIT with gridftp transfers. Bruno will try to come up with example transfers that can be investigated.
Francesco’s idea to look into packets, e.g. for hidden IPv4 literals. Idea is to find traffic that NAT64 cannot handle. Francesco will experiment at his Tier-2 to see what might be possible, for example dual-stack some WN or use Elastic-search for analysis. Can also survey experiment architecture and form an inventory. Perhaps collect list of IPv4 hosts.
Do we need to try to persuade sites to make their WN dual-stack? We have not made a formal request. Usually true that sites once have made their storage dual-stack it is a smaller step to do so with WN. How to monitor this? ETF perhaps?
We agreed that we didn’t want to submit to the virtual CHEP in May (which now requires full submissions ahead of time). It would be nice to aim for the next non-virtual CHEP. Two more conferences - ISGC and HEPiX. ISGC 2020 was cancelled. ISGC 2021 is in the last week in March. HEPiX will also be in March and will have an early morning session in European time and also one later in the evening for North Americans. Dave will draft a HEPiX abstract and called for anyone to present it should they wish to.
Ongoing support and encouragement of Tier-2 storage migration. Francesco and Martin to investigate packet capture and analysis. Primarily trying to root out IPv4 literals. General encouragement to move to dual-stack worker nodes. Pic worker nodes are mostly dual-stack, some old ones remain IPv4. Monitoring data transfers, ATLAS (WLCG?) moving to adopt webdav instead of gridftp for FTS, but webdav doesn’t identify IPv4/IPv6 usage (DOMA-tpc). Andrea will follow up with the FTS developers.
Perhaps aim for IPv6-only by the start of Run4. We have signalled our move to IPv6-only but not given a timetable. Will try to come up with a proposal for the WLCG MB towards the end of 2021. Discuss more at F2F in June. Edoardo reported starting to use remote clouds from Microsoft and Google - but they are using IPv4.
Edoardo will book room (31-28) for F2F meeting in June. But we also plan for another virtual meeting split over two days if travel still not possible then - propose 29 & 30 June 2021. Then, also one hour meetings at 16:00 (CET) 25 Feb, 16:00 (CEST) 29 April and 16:00 (CEST) 20 May. We will miss March because of the HEPiX and ISGC conferences.