HEPiX IPv6 Working Group F2F meeting - Day 1

CERN

April 10th 2014, 14:00

Notes by Francesco Prelz

Attendance:
In person: Jerome Bernier; Alastair Dewhurst; Marek Elias; Dave Kelsey; Fernando Lopez; Edoardo 
Martelli; Kars Ohrenberg; Francesco Prelz; Duncan Rand; Michail Salichos; Andrea Sciaba; Ulf Tigerstedt; 
Ramiro Voicu; Christopher Walker; Tony Wildish

Remotely:
Raja Nandakumar

Dave: reviews agenda for today and tomorrow on Indico.
No other technical issues to add. People happy with agenda.

Minutes review: Minor comments.

Roundtable site updates:

CERN: IPv6 deployment is progressing. On April 1st (for real!) DHCPv6 was 
enabled for all devices at the CERN Computing Center and at Wigner.
One issue: some Windows servers had IPv6 off-site access blocked
administratively, but were getting off-site access via IPv6 after 
a public IPv6 address was obtained. The firewall rule was being applied only 
to hosts that were tagged as 'IPv6-Ready' in the network database. A temporary
workaround was applied to plug this hole up.
Next step: on May 6th an IPv6 address will be offered to all devices
on WiFi or 'portable' sockets.
The last step will be extending fixed device outlet access beyond buildings 
31 28 and 600.

DaveK: are training courses for sysadmins being offered? 
EdoardoM: No. Just a workshop for the IT departments. Support people attended 
courses.

CMS: IPv6-accessible Bestman node and OSG Computing Element nodes
are available at the CMS T2 in Nebraska (UNL, care of Brian Bockelman).
No production Storage Element yet (ideally they should be joining Tony's
transfer mesh).

AlastairD: From the Atlas perspective a few big test instances would be 
handier than many small-scales ones. 
DaveK: the small-scale, low-level tests have to work before attempts to
push the scale up are made.

KIT: FTS3 and dCache are still in the same state.
One 'dCache guy' complains that there are no clients where they can test 
IPv6 access.
The test environment includes UI, Worker node with PBS and a Cream CE.
The DNS is now working.

CCIN2p3 - Lyon: An infrastructure with DNS, test machines is ready.
The production DNS will go to IPv6 in a few days.
No dual-stack plans 

USLHC: No news.

NDGF: The IPv6 dCache test stand is included in the Atlas Hammercloud. Trouble
is the only available storage large enough for Hammercloud was IPv4-only. 
Troubleshooting and fixing this mixed configuration introduced some delay 
but should be done now.
Xrootd via IPv6 was also tested talking to the dual-stack dCache server. 
But the new IPv6-enabled xrootd client is not following the protocol
(wrong null-termination of certain strings).

FZU: A new testing site (fzu-ipv6) had to be registered in the GOCDB
for political reasons, but no ATLAS pilot jobs are coming in yet.
Worker nodes have only public IPv6, and private network IPv4 addresses 
to talk to Torque (failed to run it on IPv6 only).

In principle the WLCG Management board should be forgiving with IPv6-related
downtimes: the political pressure shouldn't be this extreme.
WN to Head node connections are IPv6-only.
Production DPM Pool to Head node connection is IPv6-only.

INFN: The IPv6-capable new CNAF router is still switched off. Keeping
urging/encouraging people on the matter.
Our UberFTP IPv6 pull request was eventually merged at the gentle request of
OSG (Brian Bockelman).
Still trying to make progress with the high-availability tools
(out of the RHEL5 clustering suite) used in Milan

DESY: Nothing important to report: just waiting for more people to use
IPv6.

PIC: No significant news. Will share a few items in the technical issues
roundtable.

Queen Mary: College moved to a fancy Cisco Firewall that is showing higher 
latency and had some infancy problem.
A storm SE was set on the IPv6 network. The IPv6 hepix VO was enabled.
The availability of thie SE is shown in the BDII. Should decide whether
all testbed resources accessible to the ipv6.hepix.org VO should end
up being listed in some BDII eventually. 

Imperial: Tickets submitted to GOCDB about the IPv6 entrances have
been closed.
Refreshed an old Globus ticket about IPv6 addressed being logged
as dotted-quads in th egridftp logs.
Playing with the FTS3 RESTful interface into DPM and Storm: this discovered 
a bug in GFAL2 causing wrong/missing file checksumming when files are
accessed via IPv6.

DaveK: written reports on the technical details of tests made should be 
shared with the group.

- coffee, just 1/2 hour late

Plans for the June 10th pre-GDB IPv6 workshop.

Dave covers the (many!) items in the attached list.
A generic ATLAS/CMS IPv6 strategy discussion follows.
Will come back to this tomorrow for more detailed planning.

Testing updates.
The Chicago, Gridka and NDGF sites seem to be getting timeouts and not 
working anymore with point-to-point connections from CERN.

Data transfer speeds are under 1 MB/s in the FNAL -> CERN, CERN -> FNAL 
links (both ways) as well as DESY -> CERN and Caltech -> FNAL.
As usual, site people should troubleshoot these issues - why aren't they
doing that?

DaveK: Status of SRM and dCache ? Are they still not working ?

FernandoL: With dCache 2.3 the gridFTP is working, the SRM-LS and SRM-RM 
are working, while SRM-CP is not. This was reported to the dCache devels.
A developer at Desy will try to reproduce this.
UlfT: Did you try ARC-CP ? Try it.
AndreaS: Can the problem be in the client rather than the server ?
FernandoL: Maybe. 

Status of (HT)Condor:
they support operations of a single-stack pool, either IPv4 or IPv6, mainly 
because every network endpoint is identified by only one address.

This means that on a dual-stack node (provided the host 'hostname' 
resolved on IPv6, otherwise a bug prevents the IPv6 address from being 
advertised correctly) only the IPv6 address is made known to the central 
collector, and the Condor services will be contactable on IPv6 only. This 
also means that a genuine dual-stack pool with a dual-stack central 
manager will effectively be partitioned between the ipv4-reachable nodes 
and the ipv6-reachable nodes.

Proper handling of multiple network endpoints for each service is
dealt with by the patch described in this ticket:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3982
which is about 50% completed. Alan De Smet is working on this, but he's
also the main contact form this year's Condor Week, so little progress
is likely to occur there until May.
FrancescoP volunteered for alpha- or beta-testing of this branch 
(master-ipv6-mixed-mode on GIT) once it becomes usable. 

As for the CREAM and Nordugrid GAHPs that we mentioned at the last
f2f meeting, Jaime Frey opened and closed a ticket addressing
all issues we identified. The latest gridftp_client library is now used
and the IPv6 options in both gsoap and org.glite.security.gss are 
enableded:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4243
These changes should appear in the next development release, 8.1.5.

Torque is understood not to work - many people tried and failed.

Anyone knows the state of Slurm ?
UlfT: There is nothing in the code supporting IPv6. Checked the last
releasenotes: can talk to some services via 

dCache PASV-used-as-redirecting-tool-while-EPSV-cant: need to close the 
loop on this issue with the dCache developers. 
For gridFTP proper, this can probably be worked around by violating RFC2428,
that allows filling just the port number in the EPSV reply.
Kars will get in touch with the devel team, then if feasible we'll try
moving forward by FrancescoP opening a dCache ticket.