US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      Releases

      • HTCondor-CE 5.1.0, HTCondor 9.0.0 (upcoming): recommended update to support both tokens + GSI!
      • Frontier Squid security fix (already deployed in ATLAS, thanks DevOps!)
      • HTCondor 8.8.13
      • XRootD 5.2.0 RC1 available in upcoming-testing

      - HTCondor Week May 24-28 registration open! https://agenda.hep.wisc.edu/event/1579/

      - WLCG CE + Pilot Factory Hackathon June 3-4: https://indico.cern.ch/event/1032742/

    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        Scale Validation of the XRootD Monitoring pipeline 15m

        Derek Weitzel will cover the work to verify the transfer accounting of XRootD and the OSG’s XRootD Monitoring Collector pipeline which is replacing the legacy GLED collector currently hosted at UCSD. We found that the single largest issue with the monitoring is the unreliable communication between the XRootD instances and the collectors since it uses the UDP protocol.

        Speaker: Derek Weitzel (University of Nebraska Lincoln (US))
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Some issues running over the last couple of weeks:
        • AGLT2 dCache issues
        • MWT2 not being kept full
        • NET2 problems with CRIC definitions


      • IPV6 complete at MWT2. SWT2 (CPB) close. NET2?
      • Close on OSG 3.5/XRootD 5.1.2 (5.2.0)
      • Need to think about OSG 3.6/Tokens/IAM
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
        • Added 3 C6420 Nodes (288 cores) to be shared between UM Tier3 and Tier2.
        • Site started draining (65% usage) on 6th May, reported to ADC, added a second gatekeeper, it started to ramp up after 24 hours.
        • Had low transfer efficiency and job stage in error (27% job failure) over the weekend, file access time out. The solution was to restart all dCache services.
        • Update dCache from 6.2.15 to 6.2.21 on Monday; Seems to have helped: greatly decreased transfer errors and increased job efficiency.
        • Will also test adding 50% more memory to one of the MSU dcache pool nodes with the most pools and files as those still seem to cause more errors than they should.
        • Update 3 squid  servers (UM site) to 4.15-1.1 to address recent security alert; MSU has updated one squid server to SL7, another one to be updated. 
      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Recovery from server room cooling issues through Friday, but most services back up by Wednesday afternoon. Took a while again for the site to be put online by hammercloud

        IU workers moved to IPv6

        Site keeps draining due to issues with aipanda158

        Updated all frontier-squids to 4.15-1.1

         

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        NESE_SCRATCHDISK issue, resolved over the weekend

        Working on latest xrd 5.2 from Wei & Andy.   Working successfully; continuing testing. 

        Planning networking upgrades

        BGP peering between BU and ESNet for ipv6 successful.

        Working on ipv6 dual stack testing at NET2

         

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        -Nothing to report, running well.

        - Getting very good opportunistic throughput in OU_OSCER_ATLAS_OPP (SCORE) and _TEST (MCORE) ES jobs at the moment.

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), Lincoln Bryant (University of Chicago (US))
    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        all running fine.

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind
      • BNL and SWT2 XRootd testbeds are updated with latest test version and undergoing stress tests.  BNL monitoring is set up.
      • Ongoing discussion about VOMS-IAM testing; awaiting development on VOMS import
      • Frontier-Squid DevOps meeting today
        • Latest security update from OSG-testing applied to SLATE squids
        • Discussion of update process
        • Non-SLATE squids still running at some sites - need an audit of configs across sites
      • Mark S. compiling information on downtime problems, planning for dicussion at ADC weekly in two weeks
        • Also looking at site CRIC configurations, e.g. NET2
      • Jobs still not getting brokered to ANALY_BNL_VP queue, unclear why (some suggestions from Rod)
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache

        * all running fine. One restart of one of the AGLT2 nodes.

        * LRZ-LMU adding one more node

        * all updated to send gStream over TCP.

        VP

        * working fine.

        * Ofer is discussing with Rod VP queue at BNL.

        ServiceX

        * running stress tests 

    • 14:40 14:45
      AOB 5m