HEPiX IPv6 Working Group - virtual F2F meeting

Europe/Zurich
Zoom

Zoom

David Kelsey (Science and Technology Facilities Council STFC (GB))
Description

In place of the normal face to face meeting at CERN - a fully virtual Zoom meeting.

Please register to say you will attend the Zoom meeting - connection details will be available to those who register. 

Tmings are approximate.

 

 

 

 

Registration
IPv6 virtual F2F participants
Zoom Meeting ID
64006864374
Host
Edoardo Martelli
Alternative host
Bruno Heinrich Hoeft
Useful links
Join via phone
Zoom URL

IPv6 meeting minutes for January 18th, 2022   Day1

(notes by Francesco Prelz)

Meeting agenda at: https://indico.cern.ch/event/1115437/

 

In attendance: Martin Bly, Nick Buraglio, Catalin Condurache, Costin Grigoras, Bruno Hoeft, Hironori Ito, Dave Kelsey, Kars Ohrenberg, Edoardo Martelli, mFrancesco Prelz, Andrea Sciaba', Edward Simmonds, Tim Skirvin

Dave K. reviews the agenda.

Round-table updates:

Costin: no news from Alice. Storage are almost all IPv6-ready. Trying to keep up-to-date with xrootd version on the client side. Message exchange to the central system is now mostly happening on IPv6.

Dave K.: What fraction of your clients are IPv6 now ?

Costin G. Today we have 76% of worker nodes on IPv6.

 

Dave K.: Do you have any time evolution of this number ?

Costin G.: The figure was actually higher (!) some time ago, but we started  a few new sites. Monalisa plots on http://alimonitor.cern.ch?3863   are shown.

 

Dave K.: Did you find  unexpected problems along the way ?

Costin G.: No debugging was needed. There was some issue with resolution, but it was not IPv6-related.

 

Bruno H.: Not very much to report. (KIT). Trying to get rid of our NATs, as our worker nodes are behind NAT.  Had to *purchase* some more IPv4 addresses - they are now advertised via LHCONE and LHCOPN. IPv6-only was not considered a ripe enough way out.

Edoardo M.: (CERN) Not much to report. CERN deployed an IPv6-compliant version of Kubernetes as the default K8 version on the CERN cloud.

Dave K.: Is IPv6 on by default when you start a VM ?

Edoardo M.: Probably not - don't know whether the address is assigned via DHCPv6 or pulled from the database.

Andrea S.: Will come to the Tier-2 report later - nothing else.

Martin B.: Not a huge amount of IPv6-specific stuff, but there are (RAL)  general internal changes in the Tier-1 network. Moving slowly to the new leaf-spine setup. The 'old' and 'new' sections talk via a connection on the super-spine. There is routing between the two sections. Outbound traffic  is currently routed out of the old section, but will move  to the new section shortly.   The networking infrastructure team tried to implement  'macro-segmentation' (= insert a firewall in  the middle of the internal routing) before Christmas,  but it didn't go well. A bug in the Fortinet routing firmware stopped the show. When the next step will take place  is uncertain. Finer-grain 'micro-segmentation' will follow.  There's a SOC (Security Operation Center - real-time threat identification sharing) project going on to monitor network:

           optical taps to CERN/JANET will be installed on the external network tomorrow.

           Software updates need to take place before dual-stack worker nodes can be started.

           Some unexpected routing failures were triggered by the choice  of the wrong gateway. The "grass will be green again" when everything is properly installed.

 

Dave K.: Is it a complex list of things that needs to move  to get to dual-stack WNs ?

Martin B.: Need to raise the version level on some components (HTCondor - maybe). Then dual-stacking the worker nodes becomes trivial. No manipulation to the network will happen before the LHC starts,  due to resource constraints.

Duncan R.: Business as usual. (Imperial)

Catalin C.: Not too much to say. (EGI)

Dave K.: Is it true that all of EGI, down to the core services, is now         fully dual-stack ?

Catalin C.: EGI is mostly trying to "shadow" what WLCG is doing.

 

Francesco P.: Nothing new. Talked with my colleagues in Turin today.        Hardware is configured. The action of updating the storage configuration is pending on a specific personnel issue.

Noticed that the crossover point (where the IPv6 fraction        is larger than IPv4) was reached in our usual statistics:

        https://orsone.mi.infn.it/~prelz/ipv6_vofeed/

        https://orsone.mi.infn.it/~prelz/ipv6_bdii/

 

Thank you Bruno for the juicy PCAP snippets. Now stripping        them of the VLAN tags, and looking at the contents. Will come        back with feedback on the contents and on whether any improvement        is needed with the strategy for collecting the samples.

 

Dave K.: What commercial content provider are known to be using IPv6 for,

         say, media streaming ?

 

Nick B.: Netflix has been IPv6-enabled for a long time.

         US mobile carriers have been offering dual-stack for >8 years.

 

Dave K.: Apart from the known Google statistics, where is the best place in

         the world to see/measure this traffic?

         Don't actually know what it means to count 'total traffic'.

 

Nick B.: Need to cross-reference several data sources. Very large backbone

         networks do sample data, typically 1 in 2000.

         It's taken years for ESnet to build tools to obtain this

         kind of measurements.

         Used to be easier to isolate IP protocols by VLAN. With

         new versions of Netflow this is not necessary anymore.

 

Kars O.: Business as usual. Everything works (TM).

(DESY)

 

Nick:

 

Tim S. Deploying dual-stack dCache tomorrow morning!

(FNAL) Dual-stack worker nodes were rolled out recently, too

 

Nick B.: What are you using for containerisation? Docker ?

      Singularity inside Docker. How are you jhandlinh the IPv6

      address per-conteiner.

 

Tim S.: Not yet - currently assigning addresses per worker node.

 

Nick B.: A problem that comes up along is that docker (used to) handle

      IPv6 very poorly. Once you pick up a solution let us know, as

      this questions gets asked often enough.

 

Edward S. - (a.k.a. "Tim's boss"). We are deploying IPV6 at a pace that ts

      too slow form some people and too fast for others.

      As we have to cater to OS support for all of scientific computing we

      are progressing as fast as we can.

 

Hironori Ito: we have been running dual-stack hosts for a along time,

(BNL)    so just snooping.

 

Dave K.: Does that mean that all the data transfers to BNL are occurring

         over IPv6?

 

Hironori: Not really....

 

Raja had to leave but sent an update via e-mail (please paste here).

he also reports about the Fermilab dCache dual-stack

 

( - coffee break - )

 

Andrea S. shows and comments on the usual Tier-2 migration

twiki page: https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6 

 

Andrea S.: The situation is pretty much unchanged. US-ATLAS, that used to

      move at a slower pace is now catching up, but there are no good news

      from the rest of the world. Some tickets have received no comments or

      updates in one year. I am apparently unable to 'extort' any more

      progress/information.

 

Dave K.: Even if you invite people to update the tickets there's no reply ?

 

Andrea S.: Exactly - no follow-up.

 

Dave K.: We'll chase the sites that don't at least communicate the reasons

         for the delay. We always said that the IPv6 availability would

         evantually become a requirement to flag a site as 'available'.

         We should perhaps follow up on this. Let's renew our 'campaign'

         for 2022.

 

Andrea S.: MIT, and in part Vanderbilt, are the most reluctant US sites.

 

Dave K.: Thanks for the continuing heroic effort.

 

Andrea S.: Wouldn't call it 'heroic' anymore...

 

Dave K.: We don't report on the Tier-1's anymore.

         "The Tier-1 campaign is done".

 

The IPv6 traffic measurement on LHCOPN

(https://twiki.cern.ch/twiki/bin/view/LHCOPN/LHCOPNEv4v6Traffic)

is shown and commented upon.

As well as the latest 'top talkers' list:

https://twiki.cern.ch/twiki/bin/view/LHCOPN/TopTalkers202112

 

 

Dave K.: Still need to investigate the reasons for the IPv4 preference,

         and fix the configuration.

 

Nick B.: Most likely the choice is application-based, as we've found time

         and again.

 

Francesco P.: Applications that are not IPv6-aware "should not" exist

       in WLCG - these are likely all configuration issues.

 

Martin B.: There are cases that can confuse applications into failing

       over to IPv4.

 

Francesco P.: There are also implementations of 'happy eyeballs' that

       are sneaking into various applications - notably curl/libcurl.

 

Nick B.: Libraries that applications are using that have no IPv6 notion are

         also, somewhat unexpectedly, quite common.

 

Dave K.: Do you have a list of libraries that you have found to fail

         with IPv6 ?

 

Nick B.: I actually started a github repository on 'things that don't work

         in IPv6-only': https://github.com/buraglio/broken-v6only.git

         but I haven't been getting any updates for a while

 

Costin G.: Python has options to delay the IPv4 part.

 

Dave K.: Send it up to a satellite link ?

 

Nick B.: I have multiple network providers "for fun" at home. One is the

   starlink satellite link, which is surprisingly usable for mostly anything.

   Can get around 70 Mbit/s, with 30 ms RTT.

   So I push all IPv4 traffic to starlink and the rest to another

   big provider...

 

Last agenda item: IPv6-only testing.

 

Dave K.: There is a testbed at CERN that experiments are using, but there are

   no significant news there since October. Does anyone have more plans

   to build IPv6-only expansions?

 

Tim S.: Moved from two to three nodes.

 

Bruno H: a proposal to set apart a few worker nodes for IPv6-only bounced.

 

Dave K.: In WLCG the situation is made simpler by the fact the the amount of

    services we have to support is more limited than in the general problem.

 

Nick B.: It's definitely easier to move a self-contained environment than an

    entire data center.

 

Nick B.: We may have a mandate to migrate to IPv6-only in a finite amount

    of time, but other people don't. Moving to dual-stack may be good enough

    for them.

 

    Most OSes, according to the RFC, will try IPv6 first if both AAAA and

    A DNS record re found. And may then wait for a timeout on the first attempt.

 

Francesco P.: "Happy eyeballs" was designed to work around this specific problem.

 

Dave K.: Can we identify a small, not-so-critical system that goves

   some valuable benefit if people access it ?

 

Francesco P.: Preventing delays, failed IPv6 connections and other issues

    that make people go "happy eyeballs", i.e. be proactive in not offering a

    worse service on IPv6, may be preferable.

 

 

 

 

 

 

IPv6 2 day meeting   - Day2

===============

(notes by Duncan Rand)

 

Present: Kelsey, Rand, Chown, Grigoras, Bly, Buraglio, Hoeft, Martelli, Ohrenberg, Nandakumar, Prelz, Babik

 

Agenda: https://indico.cern.ch/event/1115437/

 

Deployment of dual-stack on services other than storage.

 

Growing number of data transfer to and from WN i.e. not using FTS. So need WN to have IPv6 too. Have encouraged dual-stack WN in the past. How can we monitor status of deployment  of dual-stack WN? Raja has based his DUNE ETF tests on CMS ETF.

 

Tackling deployment. Do we think it is a good idea to encourage dual-stacking? Several sites have already rolled out dual-stack without apparent issues. Need to prioritise e.g. WN at the top of the list. Good to provide a framework - get better results. Don’t just say ‘do everything’ as some sites might get analysis paralysis and stall. Quite psychological. Start with low-hanging fruit - easy wins build confidence. If so need to come up with a recipe and test it out. More work up front but leads to better results in the end. Not asking for IPv6 to be used for everything, e.g pxe booting. More like adding IPv6 to an existing IPv4 WN. Is it outgoing transfers only needing IPv6. Alice: yes. The Alice VO box has some incoming ports requested to be open. For LHCb VO boxes are all already dual-stack (at most T1s). What about CMS and ATLAS? Proposal to just ask for IPv6 to be added to WN. Do we need CEs also? If the WN are dual-stack then the CE is also likely to be dual-stack. Discussion on the extent of our request. Need to be wary of irritating sysadmins. Best to specify the result and let them choose the path to get there. We should prepare a draft and check with the LHC experiments. Then try with a few test sites. Should we then have a roll-out campaign? How do we do this? Do we have yet another ticket campaign? Should we do a site survey? Some sites still haven’t implemented dual-stack storage. Discussion on timing of this with respect to LHC schedule. Try to have something done by the next monthly meeting.

 

Conference submissions

 

ISGC: Already agreed not to submit an abstract to the ISGC conference. Fully virtual meeting?

HEPiX: Spring meeting 25-29th April. Fully virtual meeting. Duty-bound to submit an abstract on our progress.

CHEP: Last abstract we submitted was Adelaide 2019. Norfolk, VA USA is the next planned meeting, looks delayed to May 2023.

TNC is a possibility, but deadline to submit has passed.

 

Marian - update from Networking Research Group

 

Marian showed his slides on packet marking etc. Idea is to do both UDP fireflies and IPv6 header. A given transfer might do both. Currently only xrootd supports fireflies. Also want to talk to EOS, Storm, Echo etc. Pros and cons of the two approaches. Exploration phase. Need also to think about the Collectors - will need a network of collectors. Potential deployment of CS9 (kernel 5.14) is interesting as it offers potential.

 

Nick - US news

 

ESnet v6: figuring out how to work on the US Gov mandate. Already in quite a good situation. Already building ESnet6 which was designed to be v6 only when memo came out. Have had v6 for a long time. Finding running v6-only network is non-trivial. Vendors differ in their v6 support. Should be fully v6 only on management network this year. Can’t remove v4 on data plane. Will transport v4 for the foreseable future. Might push v4 into a VRF - undecided, no great advantage though. Piloting NAT64 solution at border of some small v6 only data centres, e.g. Berkley. Try to test new ideas out on themselves first. All this is driven by the US Gov memo. Various policy-related documents have been completed. However, it is a difficult task with no budget. Start with low-hanging fruit e.g. desktop systems. Its a big collaboration project. Q: how many people working on this? Its part of our daily work. Reassuring people that we’re not going to turn off v4. Goal is to be 80% v6 only by 2025. Lots of science instruments etc unlikely to be v6 only, e.g. FPGA array from 1980’s. What about scientific instruments currently being built e.g. LSST and SKA. Has there been any talk of v6 training within the WLCG? We did a little in collaboration meetings. Nick used to teach IPv6 workshops. Tim runs courses for Janet members and has delivered some training for GridPP sysadmins.

 

AOB

None.

 

Dates of future meetings

Can we meet in person in June? Full day 9th and morning of 10th June proposed, perhaps in person. This is still "to be confirmed".

 Monthly meetings on 3rd March, 7th April, 12th May at 16:00 CERN time.

There are minutes attached to this event. Show them.