US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

      pre-scrubbing prep meeting Friday at 2pm central.

       

      Sites should start FY21 equip purchase planning ASAP due to global chip shortage

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      US ATLAS downtimes

      • According to Alexey, the AGLT2 SE downtime was not added to the calendar because the AGLT2_SE resource has an FQDN/service combination of head01.aglt2.org/GridFTP but the ATLAS CRIC registered GridFTP endpoint is associated with dcdum02.aglt2.org (https://atlas-cric.cern.ch/core/service/detail/AGLT2_SE_0/)
      • Verifying with Alexey but it seems like we need to do an audit so that each CRIC endpoint and its protocols correspond to a Topology registered resource and services
      • Topology service -> CRIC protocol. Currently CRIC only picks up GridFTP/SRM downtimes
        • GridFtp -> GRIDFTP
        • SRMv2 -> SRMv2
        • WebDAV -> WEBDAV (not yet implemented)
        • XRootD component -> XROOTD (not yet implemented)
        • XRootD HA component -> XROOTD (not yet implemented)
        • XRootD cache server -> XROOTD (not yet implemented)
      • OSG will start working on the ability to more easily declare downtime for an entire site (i.e., Topology Resource Group): https://opensciencegrid.atlassian.net/browse/SOFTWARE-3526
    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR))

      Mostly quiet at T1.

      Currently dCache experts are performing deletion rate tests. To increase the deletion from 16 Hz to something much bigger.  More results soon.

      Investigating 11 missing files. (Trying to understand why dCache removed the files soon after they were deleted)

       

    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Several serious problems over the last 2 weeks:


        • AGLT2: Main switch at UM failed.
        • MWT2: AC problems at UC. Also some trouble keeping site full.
        • NET2: Network issues plus other smaller issues.
        • SWT2_OU: One third of jobs trying to return output files to CERN fail.
          • Limited outgoing bandwidth: O(100 kb/s)
          • Incoming network works fine with reasonable bandwidth.
        • SWT2_CPB: Outage last Saturday and intermittent problems thereafter.
      • These site problems smoked out an issue preventing storage from
      • Could NET2 and SWT2 please report on their IPV6 status.
      • Heard that there was progress on the XRootD TPC  transfers.
      • Preparing for pre-scrubbing.
      • Get your purchase in early as there are long leadtimes to receive servers because of the worldwide chip shortage. Planning to arrange a presentation on Dell's newest servers.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        1) MSU resolved the cooling issue in the server room

        2) 18th May, the core switch (Mellanox 2700) died from storage device issue. To recover, we used the new Dell S5232F switch to replace it. However afterwards, we see a lot of packets loss among different hosts, we had to connect all the other switches  and core service hosts directly to the new switch. It seems to be a spanning tree issue between the older Dell switches (OS can't be updated to OS10) and the new switch. We can't resolve this issue, especially we have planned to replace all switches in middle June. 

        During this incident, we lost 3 dcache pools with 150TB data from an aging and no warranty MD3260 storage enclosure. (The vdisk failed, and could not be recovered  without technical support)We declared the lost files. 

        One big challenge to recover is from vmware, one of the vmware node has trouble to boot into its system(due to network issue), we had to migrate the images from this host to the other 2 , and we are still working with the vmware support to bring this host back to the cluster.

        This incident cause a downtime of 6 days.  We still have ~30 work nodes suffering from significant packet loss, but the job failure from ATLAS is low (~3%). More impact is on the Tier3 jobs. 

        3) Had difficulties setting site downtime.  Didn't seem to match CERN and file access expectations.
        Part of that was operator problem (Philippe) until he started using the proper OSG wiki info.
        Part of that must be some missing topology entries (more services besides gridftp in SE?) or advertisement between OSG and CRIC.

         

        [1]MSU CRAC problems: We replaced the control board of CRAC#2 as it was still operating but could no longer read its temperature sensors.
        CRAC#2 was relying on CRAC#1 for control, thus being a potential single point of total failure if CRAC#1 ever stopped. This scenario was deemed too dangerous, even for the couple months we have left, thus we paid for a control board replacement.
        Separately, CRAC#1 turned off one of its compressors last Sunday. That's a repeat of 1 month ago.
        These concerns will disappear when we move to the MSU data center in June-July.

      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Jessica Lynn Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))

        Multiple cooling failures at UC last Thursday and Friday. UC workers were offline over the weekend, but storage remained online. Harvester took a while to fill our site each time we came out of downtime

        Issues with declaring the MWT2 SE offline due to WebDAV. Fred followed up on this

        Updated condor-ce on our GKs to 4.4.1-1.osg35.el7

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Incident when one of the GPFS pools filled filled => GGUS ticket fixed immediately. 

        Made some squid config progress with squid team.

        Working on xrd 5.2.0 with containers 

        Preparing to buy worker nodes 

        One node being tested with ipv6

        Met with Mark, re: figuring out how to add new resources in OSG in the github era

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        • IPV6 testing continues with setting up the deployment process.  Will ask for AAAA records for perfsonar machines.
        • Starting to clear logistics backlog that will allow decommission of UTA_SWT2.
        • Incident with rack level switch caused problems with production, looking at updating firmware.

         

        OU:

        - Running well, getting great opportunistic throughput, up to 70% (2k slots) extra.

        - Seeing some OU-CERN outbound transfer failures/timeouts, working with both local network folks and WLCG experts to isolate the issue. Will likely migrate to OU DMZ soon.

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speaker: Lincoln Bryant (University of Chicago (US))

      NERSC was failing jobs all weekend, perhaps related to config change / teething problems changing over from Doug to Lincoln. Reverted config change - jobs aren't outright failing now, but there are a lot of jobs "closed". Doug advises that PanDA is not getting results quickly enough from NERSC. Investigating. 

      TACC running more-or-less smoothly, need to switch proxy over from robot cert to personal cert with production role ASAP.

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m
        Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))

        All running fine on the legacy ML platform. 

        For the AF, equipment is being racked - hyperconverged storage/CPU, login servers.  the GPU server will not arrive till September.  

        Starting to work on setting up the AF - DNS, website, account provisioning.  

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind
      • OIM/CRIC downtime declaration issues during AGLT2 switch incident
        • Organized discussion as main topic at ADC Weekly Round Table
        • Bug found in CRIC; clarifications for OSG Topology
        • Brian/Alexey following up
      • HTCondor_CE version tracking added to Facility Services spreadsheet
      • BNL XRootd test shows stable memory profile
      • BNL_ANALY_VP now running well with transfer activity on BNL XCache
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCaches 

        • MWT2 - from time to time very high load and data gets bypassed
        • AGLT2 - had issue with the main switch but caches are running fine
        • All others running fine.

        VP

        • runs fine

        ServiceX

        • stress testing with local (MWT2) data. 2.4TB (1400 files) in ~8 min from up to 280 transformers. 
        • bugging Rucio developers to improve replica ordering

        Squids

        • running fine (again AGLT2 switch thing)
        • fixed some CRIC settings
        • still things to understand with Frontier Squid requests. Talking to Julio and Dave.
    • 14:40 14:45
      AOB 5m