US ATLAS Computing Facility

US/Eastern
Description

CANCELED!! Facilities Team Google Drive Folder

Conflicts with IRIS-HEP retreat. 

Would still be useful to add NOTEs for each section as updates.  

 

 

    • 1
      WBS 2.3 Facility Management News
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 2
      OSG-LHC
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 3
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News
        Speaker: Ivan Glushkov (University of Texas at Arlington (US))
      • 4
        Service Development & Deployment
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        ServiceX

        • two new PRs that improve user experience and logging.
        • Both UC and FAB instances on 1.2.2
        • BNL OKD instance should be up and running.

        XCache

        • running fine
        • will try to bug developers on Thursday on improving performance for initial requests.

        Analytics

        • work on integrating hepspec analytics in periodic running.

         

      • 5
        R&D Activities
        Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    • 6
      WBS 2.3.1 Tier1 Center
      Speaker: Eric Christian Lancon (Brookhaven National Laboratory (US))
    • WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonable running recently.
        • General: Bad IGTF certificate release on last Monday
        • AGLT2 network issue that caused dCache to hang causing a service disruption.
        • MWT2 bandwidth limited by failure in an UC network switch.
      • Nice discussion with the Taiwan team on the plan for them to join the US Cloud as a tier 2 after their tier 1 MOU ends on Sep 30, 2023.
        • The discussion happened the day after this facility meeting would have occurred. 
        • Pushed back start date for the Taiwan site being in production as a tier 2 to Nov 1, 2023. This allows time to sort out various issues without affecting the operation under the current Tier 1 MOU.
        • We will have weekly meetings until things settle down.
        • Shawn wrote a draft description of configuring the network that we are all reviewing.
      • The Zombie pilot issue is addressed by a recent pilot release. Still need time to test.
        • Opened ticket requested by Alexei to track progress on the issue.
      • Need to start thinking about FY24 procurement and operations.
        • Will we have a face to face meeting later in the year? 
      • 7
        AGLT2
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen

        08/23

        The UM network expert helped the UM site to do a system update on the Cisco border switches, to address the issue we noticed on 08/05 with the power outage(the management VLAN stopped working after power cycle of the switches). The update  went smoothly. 

         

        08/24

        One of the management switches in the UM Tier3 server room (sw11-m-01) lost connectivity, and a power cycle brought it back online. No services were impacted by this switch’s downtime. 

         

        8/28

        The UM campus network went completely down due to security concerns, and the AFS servers of our site are hosted on the grid.umich.edu which is part of the UM campus network, so the UM login nodes were not accessible. Also 4 dCache pools (msufs17/21/23, umfs40) nodes had the pool services offline in the same time window, the reason was unknown, and the fix was to restart the pools. The UM site also took the opportunity to migrate the 6 AFS servers from the grid.umich.edu domain to the aglt2.org domain to avoid future disruptions caused by the campus network. This migration involves lots of detailed work and it took us  4 days until we can fully recover everything in AFS, but we should have a more coherent (to the AGLT2 network) and robust AFS system in the future.  

         

        9/6 and 9/8

        The UM site received the 2 replacement PDUs from cyberpower, and we replaced the 2 bad PDUs from Rack 4, which caused the dCache pool node umfs30 to be offline for half an hour.

        9/9

        OSG released a bad certificate rpm (osg-ca-certs-1.114-1), our cert hosting node gate01 picked it up the bad version from auto update, and soon OSG released a new fixed version osg-ca-certs-1.114-2, and gate01 also picked up the update, we also restarted all dCache head and door nodes as recommended by the OSG, but during the weekend, there is still a 60% transfer failure, It turns out there are some bad PEM files from the bad version of osg-ca-certs,  which contains duplicated certificate files, and we had to reinstall the osg-ca-certs-1.114-2, and get the right  PEM files. 

      • 8
        MWT2
        Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        UC's network is in degraded 1x100Gbps while UChicago IT Services diagnoses issues on one of the linecards on our uplink switches.

        All new worker nodes have been put online. Troubleshooting network issues on the new IU storage node purchase.

        HTCondor-CE has been upgraded to 6 at all three sites.

        A number of weeks spent investigating zombie pilot jobs. Some operational/draining issues also arose from the zombie pilots running and mismatching how many jobs ATLAS thought we were running vs. what was actually running.

        8 additional webdav doors were added to webdav.mwt2.org.

        MISP/Zeek servers have been built and are running semi-prod.

        Brief network outage Aug 29 for switch upgrades at UChicago.

      • 9
        NET2
        Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

        Storage:

        • All but 3 disk servers are online, providing 7.4PB;
        • All 3 remaining disk servers are offline for maintenance and are expected to be operational within the next week;
        • The NET2 SE has been stable since its launch.

        Cluster:

        • Connection with Harvester established;
        • Collaborative work is ongoing with Fernando Harald Barreiro Megino, Lincoln Bryant, and Ryan Taylor to finalize the cluster configuration;
        • The Squid proxy server is successfully deployed and operational in High Availability (HA) mode, currently running two instances, but it can easily be scaled up.
        • We are exploring the option to configure the Squid proxy server to operate in "headless mode" in order to take advantage of the new CVMFS proxy sharding feature.

        Monitoring:

        • Site network monitoring using SNMP is now in place and registered on CRIC.
        • The PerfSonar server is up but we are still debugging why the hosts are not visible in the lookup service with the help of Sowmya Balasubramanian.
      • 10
        SWT2
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        • Running well, not much to report
        • There will be a 15min or so network outage for OU_OSCER_ATLAS_SE (se1.oscer.ou.edu) some time in the next week or so while OneNet upgrades and migrates the OU OFFN (DMZ) network connection. So there might be some failed transfers during that time, but they should be retried.
    • 11
      WBS 2.3.3 HPC Operations
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
    • WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 12
        Analysis Facilities - BNL
        Speaker: Ofer Rind (Brookhaven National Laboratory)
      • 13
        Analysis Facilities - SLAC
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14
        Analysis Facilities - Chicago
        Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
        • Updated dnspolicy of the HTCondor deployments. The default dns policy uses host dns server for pods with hostnetwork. Thus the dask worker won't be able to resolve in kubernetes dns names. The trition inference part of the AGC notebook happened to be the first application(condor job) that need to resolve in kubernetes dns name(the trition endpoint). We updated the dnspolicy(dnsPolicy: ClusterFirstWithHostNet). Now the condor jobs can also resolve in kubernetes dns names and this get the triton part of the AGC notebook going along. 
    • 15
      AOB