HEPiX IPv6 Working Group - Face to face meeting - CERN
Day 1 - 17 Sep 2019

(notes by Francesco Prelz)

Attendees (around the table): 
  M. Bly, A. Sciaba, Dimitrios Christidis (U Texas) - the ATLAS representative !,
  K. Ohrenberg, D. Rand, D. Kelsey, E. Martelli, F. Prelz 

On Vidyo: C. Condurache (now as an 'observer' from EGI).

Apologies were received from Jiri Chudoba and Bruno Hoeft.

----------------------------------------------

1. Agenda for the afternoon is briefly reviewed and agreed upon.

Edoardo M. is asked the courtesy of presenting his LHCONE slides
from last week again, and he graciously agrees.

Dimitrios Christides introduces himself. Has been working in Atlas on
distributed data management - since 1 and 1/2 years. Excellent, welcome!

Dave K.: I keep hearing about the big LHC experiments wanting to change their
data (transfer) models: is any new tool using (or forgetting) IPv6 ? It
would be good to know from the experiment reps.

Urgent topics: should we, as a HEPix working group, be presenting something
at the HEPix meeting in Amsterdam (Oct 14-18)? Of course! Only Martin B. is 
currently registered to attend. He kindly agrees to present a few status
slides, as long as these are not provided "at the last minute" (a few hours
will do...).

2. Roundtable updates.

Catalin C. (EGI): was assigned various tasks within the EGI foundation. His
  colleagues think it's very good to keep an eye on the activities of the IPv6
  group, as it's been driving changes at least withing WLCG. There's no assigned
  task to do this, but Catalin would be happy to continue in this group.
Dave K.: Be very welcome. How many non-WLCG sites in EGI have gone dual-stack ?
Catalin C.: Will find out - perhaps some cloud-only sites.
  Please change the e-mail address in the mailing list (note taken). 

Edoardo M. (CERN): No special news. Everything just works (TM).
Duncan R.: Any news about the new data centre, as it was one of our use cases ?
Edoardo M.: The new site in Prevessin (close to the CCC) 
  hss been approved by both the IT and general CERN management. It *should* be
  running with only IPv6 *public* addresses (and IPv4 internal connectivity
  for e.g. the technical network and DAQ systems).
Dave K.: Any timescale ?
Edoardo M.: There's no time to build this during the long shutdown, so it
  will be built during Run 3, and probably be completed at the end of Run 3.
  There are talks of delaying the restart of LHC to end of 2021
  (will be decided in November). There are no plans on procurement of
  network gear yet.
  The plan is apparently to keep the storage ('Data' centre) in Meyrin and
  the CPUs ('Computing' centre) in Prevessin.
  The new site may also host large Trigger and DAQ farms (one floorful per
  large experiment ?)
Dave K.: Is the IPv6 traffic at CERN continuing to increase ?
Edoardo M.: Shows the traffic stats at the LHCONE/LHCOPN border router:
  https://twiki.cern.ch/twiki/bin/view/LHCOPN/LHCOPNEv4v6Traffic
  apparently yes, it keeps increassing.
  General Internet IPv6 traffic is also now peaking at 30Gb/s and
  the IPv6/IPv4 ratio is going up. See:
  https://netstat.cern.ch/monitoring/network-statistics/ext/?p=EXT

Duncan R. (Imperial): GridPP is going to refresh their perfSONAR hardware.
  Perhaps (hopefully) people will enable IPv6 while they install it.
  10 over 18 sites, representing 80% of the data volume/traffic
  have now IPv6 enabled on their storage.
  Brunel has an IPv6-only queue for Atlas. Not sure whether this is
  actually required by Atlas.

Andrea S. (CMS): nothing special aside from the Tier2 report (see later).

Kars O. (DESY): Nothing special 
Dave K.: Is the traffic going up (we could be adding a plot for Hepix/CHEP)?
Kars O.: IPv6 traffic has been slowly ramping up. Will find a plot.
Dave K.: Any plan for IPv6 only ?
Kars O.: Not really.
  Three beamlines of the European XFEL facility are now operational, producing
  1-2 PB of data per week.
Duncan R.: Imperial plans to go IPv6-only at some stage - for the campus!

Dimitrios C. (ATLAS): will find out later details about the data changes.
  No news for the time being.

M. Bly (RAL): Trying to sort out an IPv6 packet loss issue. Tried to connect
  a bypass route across the RAL firewall, to the new 100 Gb/s border router
  pair. There was some suspicion that the previous gang of 4 routers was
  causing the problem. After the change packet loss is still there though.
  There is sporadic packet loss everywhere (including the OPN circuits), plus
  a terrible, constant packet loss rate (1 or 2 packets in 600), while IPv4 is
  absolutely clean in that respect, and is using the same physical path.
  CERN to Tier-1 is clean, while Tier-2 traffic shows packet loss variations
  in the outbound direction only.
Dave K.: Are the experiments noting the effect or are they happy about the
  current status ?
Martin B.: We don't hear from them, but we do need to lint any router
  configuration issue with the router admins and/or router vendors.
  Another open issue (that we don't have an answer for yet) is that
  BNL and FZK IPv6 (FTS) traffic 'fell off the cliff' on Friday for no
  good reason, and was only noticed at close of business Friday. Maybe
  some other issue, having nothing to do with IPv6, is playing there.
  Some anomalies in what should be the good routing path were found: will
  try to squash them.

Francesco P. (INFN): 
  CMS in Rome is still waiting to hire someone knowledgeable to handle
  the dcache upgrade and storage transition. Should we volunteer to help ?
  How much of the setup is CMS-specific ? 
Duncan R.: For dache very little.
Francesco P.: OK, will try contacting the site.
  In Turin the IPv6 storage transition was sidetracked by
  a few local accidents (sudden blackout damage, collapse of the roof...)
  but should be completed shortly. Network apparatus configuration is OK.
  Everything is at a standstill as far as enabling more general purpose
  networks: too much risk and too much work for no immediate advantage.
  The plan to enable IPv6 by default on the wireless network at the
  Milan department of Physics was stopped short by a University of Milan
  decision to centralise all wireless networking management.
  No site, including CNAF, is on the other hand reporting IPv6 problems,
  which means that GARR is treating IPv6 as a first-class citizen.

3. Andrea S. on Tier-2 deployment status:

The twiki page is up-to-date but there was no update since August 28th:
https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#WLCG_Tier_2_IPv6_deployment_stat
Dave K.: What's behind the fact that Atlas is below the other experiments ?
Andrea S.: After Alastair left nobody has been seriously chasing ATLAS sites...
  Overall, there is still a slow progress. We don't try to correct the
  bias introduced by sites underestimating the time needed to complete the
  transition.
Dave K.: On the up side, we are up to 70% completion. Analysing the reasons
  for delay: is the 72% 'network' reason just some attempt to shift the blame
  to central networking ?
Andrea S.: Probably not.  

- coffee break -

4. Monitoring - including perfSONAR and ETF

Dave K.: Where are we with perfSONAR?

Duncan R.: There was a new version appearing (4.2.0). Some sites (that
  have auto-update enabled in perfSONAR) updated automatically, but didn't
  report correctly afterwards: Marian had to fix something.
  There is an instance of ETF that's used to monitor the perfSONAR hosts,
  on epgperf.ph.bham.ac.uk, or:
  https://psetf.opensciencegrid.org/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fhost%3Depgperf.ph.bham.ac.uk%26site%3Detf%26view_name%3Dhost

Dave K.: There was some discussion on allowing/configuring remote management of
  perfSONAR boxes.
Duncan R.: Some people prefer to have a 'head in the sand' approach and
  disable auto-updates anyway.
  The site above reports 217 up-to-date perfsonar sites.
  MADdash reports: https://psmad.opensciencegrid.org/maddash-webui/
Dave K.: The experiment-specific IPv6 ETF measurement sites (e.g
  https://etf-cms-ipv6.cern.ch/etf/check_mk/index.py,
  https://etf-atlas-ipv6.cern.ch/etf/check_mk/index.py... are all up and provide
  a lot of information for IPv6 as well (most of the status is surprisingly
  'green'). It remains to understand who sholuld be taking care of handling
  any issue that's spotted there. We, as a working group, cannot interfere with
  the esisting administrative procedures of each experiment.

  LHCOPN Statistics show that times are not ripe already for our idea to
  have LHCOPN as the first IPv6-only link/service. Can we still sell the
  idea that having IPv6 only simplifies e.g. router management ?

Edoardo M: IPv6 would reduce drastically the amount of network prefixes
  that need to be announced by routing protocols. Simplification could
  be a good selling argument.

Dave K.: A stronger argument would be that we *have* to use IPv6 as the new
  computing centre at CERN *is* IPv6-only...

Duncan R.: A number of big companies (Facebook and the like) realised that
  running dual-stack internally is expensive, and have IPv6-only 
  *internal* services.
 
The Google IPv6 statistics (http://google.com/ipv6) are checked and commented
upon: the adoption rate seems not to be growing linearly and no more than
linearly. Belgium is still the continental champion at 51% adoption.

5. As agreed earlier on, Edoardo M. now shows the 'multiple LHCONE' talk
shown at the GDB at Fermilab last week:


https://indico.cern.ch/event/739882/contributions/3520004/attachments/1906199/3148167/EM-multiONE-GDB.pdf

Francesco P.: As long as there no serious security concerns (host
  administrators can access multiple networks and generate network cross-talk),
  various network virtualisation technologies are available today to implement
  'multiONE' services on shared hosts. One may argue that software that has been 
  updated to work with IPv6 (adding source and destination address loops,
  strategies and configuration) is ready to work under 'multiONE' as well.

 

HEPiX IPv6 Working Group - Face to face meeting - CERN
Day 2 - 18 Sep 2019

(notes by Duncan Rand)

Attendees: Kars Ohrenberg, Martin Bly, Dimitrios Christidis, Edoardo Martelli, Dave Kelsey, Duncan Rand, Francesco Prelz, Andrea Manzi, Andrea Sciaba

Remote: none.

Apologies - same list as yesterday.

https://indico.cern.ch/event/836709/

1. Data Transfer and Network Monitoring

FTS at FNAL still IPv4 because ‘transfers to the LPC EOS instance (not yet IPv6)’.

Efficiency issue: FTS monitoring still uses IPv6 = true/ false, need to modify FTS servers to publish IP version as true, false or unknown. Also need to extend to http and xrootd, being discussed in DOMA TPC working group. Planned changes for http need to be implemented in the storage elements.  XrootD still under discussion. Need to check with DOMA TPC working group.  Could also filter based on FTS transfer phase.

Andrea M. reported he will be leaving CERN and therefore unable to continue attending this meeting. Edward Karavakis will replace him.

There is also the FTS instance at MIT which may well still be IPv4. What about Belle-II - are we concerned about their status? Probably, since they use WLCG infrastructure. We should encourage their involvement, especially as IPv6 may well play a part in multiONE. Message from multiONE: ‘you have to dedicate a prefix to multiONE, if you can do that with IPv4 fine, but probably easier, given the shortage of IPv6 addresses, with IPv6’. 

Transfers between storage and local worker nodes and visa-versa. Default for GridFTP in GFAL is IPv6=false. Plan to release new version of GFAL which enables IPv6. had planned to do this earlier, but there was a problem accessing from IPv4 client with DPM from within DiRAC. Patch now in Globus directly - waiting for sites to upgrade to this version.

Summary of issues from follow up email by Andrea M:

* ipv6 monitoring for xrootd and HTTP TPC

- i will discuss what's needed during the next DOMA TPC meeting and report it to the list

* FTS  changes on the messaging to better report the ip version used ( most probably we will work on this task in October)

- https://its.cern.ch/jira/browse/FTS-1349

* Check if in InfluxDB FTS we can query and filter out the transfer errors happening before the transfers actually start

- i tried to add as filter in grafana "t_failure_phase !=TRANSFER_PREPARATION", but i get a timeout when querying..i have to check with the monitoring team why

* gfal2 ipv6=true for gridftp plugin

- https://its.cern.ch/jira/browse/DMC-1151

* DIRAC issue with DPM dual stack and ipv4 only clients

- https://its.cern.ch/jira/browse/LCGDM-2817

 

2. Dates of next meetings

Arranged for (CEST/CET) 11:30 on 10 Oct, 16:00 on 26 Nov and a face to face at CERN on 16-17 Jan 2020.

 

3. IPv6 only discussion

Main use case is that we need to be available for opportunistic offering of IPv6 only worker nodes. New developments: new machine room at CERN is likely not to have enough public IPv4 addresses so may use public IPv6 addresses only. Also multi-ONE will create multiple overlay networks. This will need a number of new addresses - sites could use IPv4 if they have enough, but easier with IPv6. Finally, having dual-stack is unnecessarily complex, so moving to IPv6 only could simplify things a lot.

Whether to turn off IPv4 is probably a site decision. What about EGI sites that support other non-particle physics communities? The message we are giving out is that if you are building up a new collaboration (e.g. SKA) you should do it with IPv6. Do we want to set a date for sites to turn off IPv4? Beginning of Run3 is probably too soon, what about start of Run4? An IPv6-only data centre at CERN could work if other sites are dual-stack. What about NAT64 also? Need to understand what is happening with Brunel, QMUL, Slovenian and Nebraska IPv6-only exercises.

Perhaps declare that from a certain date IPv4 is no longer supported. Need to do more testing to identify applications that cannot be dealt with by NAT64 and DNS64.

4. LHCOPN/LHCONE

Top talkers page: https://twiki.cern.ch/twiki/bin/view/LHCOPN/LHCOPNEv4v6Traffic can be used to identify large IPv4 traffic flows. Of the Tier-1s only the Russian site remains as IPv4 only - they said they would make the change over the summer.

How many sites peer over LHCONE using IPv6?

https://twiki.cern.ch/twiki/bin/view/LHCONE/LhcOneVRF