US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 1
      WBS 2.3 Facility Management News
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
    • 2
      OSG-LHC
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (next week)

      • OSG 3.5
        • vo-client 115
        • python-scitokens 1.6.2
      • OSG 3.5-upcoming
        • HTCondor 9.0.7 (has GSI, proxy delegation works)
      • OSG 3.6
        • vo-client 115
        • XRootD 5.3.2
        • xrootd-multiuser 2.0.3
        • XCache 3.0.0
        • osg-xrootd 3.6-10
        • HTCondor 9.0.7 (no GSI, proxy delegation broken)
        • blahp 2.2.0 (no GSI)
        • python-scitokens 1.6.2
      • OSG 3.6-upcoming
        • HTCondor 9.3.0 (no GSI, proxy delegation works)

      Miscellaneous

      • How's testing of XRootD in 3.6 going?
      • HTCondor-CE updates to support tokens
        • Known issue with C-style comments outside of routes in JOB_ROUTER_ENTRIES (thanks for the report, Wenjing!): https://opensciencegrid.org/docs/release/notes/#known-issues
        • CEs on token-supporting versions of HTCondor-CE
          • gate01.aglt2.org
          • gate02.grid.umich.edu
          • gate04.aglt2.org
          • gridgk05.racf.bnl.gov
          • osg-gk.mwt2.org
        • CEs on old versions of HTCondor-CE
          • atlas-ce.bu.edu
          • gk01.atlas-swt2.org
          • gk04.swt2.uta.edu
          • grid1.oscer.ou.edu
          • gridgk01.racf.bnl.gov
          • gridgk02.racf.bnl.gov
          • gridgk03.racf.bnl.gov
          • gridgk04.racf.bnl.gov
          • gridgk06.racf.bnl.gov
          • gridgk08.racf.bnl.gov
          • iut2-gk.mwt2.org
          • mwt2-gk.campuscluster.illinois.edu
          • ouhep0.nhn.ou.edu
          • spce01.sdcc.bnl.gov
          • tier2-01.ochep.ou.edu
          • uct2-gk.mwt2.org
    • Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 3
        TBD
    • 4
      WBS 2.3.1 Tier1 Center
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
    • WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonable running - some site problems....
        • AGLT2 was down for a Condor upgrade when the network failed.
          • Their new network redundancy scheme is not quite in service.
        • MWT2 memcache problem last weekend caused site to drain.
      • Planning continues to expend the funding completely by the end of the grant.
        • https://docs.google.com/spreadsheets/d/1-CV5UgeVsDYj8KrVMvuLP0lAVAcjNEQ8TgdYY6911vo
      • XRootD continues to need to be restarted once in a while.
      • AGLT2 updated to HTCondor 9.06 and HTCondor CE 5.1.osg35 and ran into weird bug involving ignoring comments in a configuration file.
      • We continue to bang on removing SRM and getting SRR reporting to be stable.
        • As a side effect from this I have noticed that our storage element definitions are in consistent. I think that Horst and Alessandra got this right at OU and we need to iterate at the other sites.
      • There has been an extend discussion on setting various CRIC parameters.
      • Ofer and I have been planning out how prioritize the updates that we need to get done between now and the start of run 3.
      • IPV6 is still not implemented at NET2 and CPB.
      • Mark Sosebee will retire in January (though he might come back part time).
      • 5
        AGLT2
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        Hardware

        3 R740xd2 serves from MSU are in production system. The IO benchmark shows the strip size 512K for RAID6 has the best IO performance, about 10% better. 

         

        Incident:

        11/22/2021, from 10am local time, the 10G commodity link connecting the AGLT2 UM site to Merit went off, so all nodes on the aglt2.org domain name lost access to the Merit DNS servers. The issue was resolved around 7pm when Merit repaired the hardware connecting to this link. During this window, all data transfers were failing and the site was already drained to 8% because of a planned condor update before the network outage. 

        dcache pool umfs06_12 caused jobs to fail at staging-in files, restarted the dcache service resolved the problem.

         System update:

        Conor was updated from 8.8.15 to 9.0.6, and condor-ce was updated from 4.5.2 to 5.1.2. During this update, we switched the authentication from host-based to token-based for the Condor Cluster, and that went smoothly because we already practiced it on a testbed. But we hit an issue with condor-ce after the update, where the condor-ce could see the incoming jobs, but the jobs could not be submitted to the local condor system. It took a few hours debugging to find the cause which is the new htcondor-ce does not support the format for commenting in the job router configuration, and this is already reported as a bug to the htcondor development team.  At about 13:00 11/23/2021, the site started to ramp up with jobs. And during the entire period with draining and updating problems, BOINC jobs were able to fill all the job slots of the site. 

      • 6
        MWT2
        Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))

        One of our MD3460 s-nodes went offline again Nov 15, back up later that day. Discussing retirement plans for these nodes (also now out of warranty).

        Site issue over last weekend caused the site to drain. Back online Monday.

        IU scheduled to update to HTCondor CE 5 / HTCondor 9 November 29. UC will update the first week of January. UIUC will be scheduled after the UC update.

      • 7
        NET2
        Speaker: Prof. Saul Youssef

         

        Source of occasional HC bumping us offline probably found and dealt with.  

        Problem with the 2 x 100Gb networking between NET2 and NESE Ceph. 

        Minor post xrd bump:  nodes rebooting where container loses GPFS mount.

        Planning for networking re-arrangements, worker nodes.  

        Working on NESE Tape with NESE and MIT teams.

        Todo:  

        • new perfsonar hardware
        • ipv6
        • OSG 3.5/3.6 upgrade

         

      • 8
        SWT2
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Site generally running well

        - Still seeing xrootd hangups, have cron in place to restart hung services

        - Waiting for next xrootd patch

        - Trying to switch SAM/SiteMon over from GRIDFTP to XROOTD for primary SE monitoring, that's currently causing UNKNOWN status, possibly because SiteMon is trying to monitor internal xrootd door, which isn't possible

        - SiteMon/MONIT team is looking at this

        UTA:

        • New Storage is coming online (~2PB) will mostly be used to retire existing storage.
        • WebDAV door is performing fairly well, considering the load.  Working on converting existing GridFTP to include WEBDav.
        • We have IPV6 addresses for the PerfSonar machines and are in the process of setting them up.
        • We are investigating an issue where some jobs failed to use the correct FRONTIER_SERVER variable.

         

    • 9
      WBS 2.3.3 HPC Operations
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
      • TACC
        • running fairly well but ran out of jobs. Followed up with DPA and a new dedicated task has been assigned.
        • "Texascale" mode coming up - will be offline for a week starting Dec 6 to run only the largest jobs
      • NERSC
        • Cori scheduled maintenance last week, plus ongoing filesystem instability
        • No updates for Perlmutter this week.
    • WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 10
        Analysis Facilities - BNL
        Speaker: Ofer Rind (Brookhaven National Laboratory)
      • 11
        Analysis Facilities - SLAC
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 12
        Analysis Facilities - Chicago
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        We are working on adding Jupyter Notebook selection and scheduling.  unifying atlas-ml.org website with af.uchicago.edu.

    • WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)

      Working numerous issues:

      • Cleaning up SRR reporting and disabling SRM protocol fallback (particularly AGLT2 and maybe BNL)
        • Fred and Ofer reviewing WLCG storage availability reporting
      • Pointed out change in Kibana monitor auth to ADC (result of shift to OpenStack, requires membership in es-atlas-kibana e-group to view); also found plots now missing on brokerage page
      • Need to clarify procedure for adding elements to CRIC/Topology
      • F-S DevOps: working with Michal to understand squid failover threshold for ticket alerts
        • SLATE squid container update to OSG 3.6?
      • HTCondor-CE updates?  (Pushed back at BNL until next week)
      • xrootd standalone server deployed at BNL for testing; Qiulan will help to configure
      • Prioritized readiness list for Run3
      • 13
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14
        Service Development & Deployment
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCaches

        • All working fine.
        • AGLT2 networking issue was gracefully automatically handled.
        • Upgraded to 5.3.4

        VP

        • Working fine
        • RAL still did not upgrade to 5.3.4 and most failures are coming from them.

        Rucio

        • VP integration development continues. Heartbeat endpoint PR now in review.
        • Oracle DB change is in and working fine.

        ServiceX

        • AF deployed instances work stably
        • A lot of developments and cleanups
        • Testing FABRIC deployed instance.

         

      • 15
        Kubernetes R&D at UTA
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        UTA_SWT2 decommissioning nearing completion, for hardware to be used for Kubernetes cluster at CPB.

    • 16
      AOB