US ATLAS Computing Facility

US/Eastern
Videoconference Rooms
US_ATLAS_Computing_Integration_and_Operations
Name
US_ATLAS_Computing_Integration_and_Operations
Description
Bi-weekly Facilities meeting
Extension
109263008
Owner
Robert William Gardner Jr
Auto-join URL
Useful links
Phone numbers
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      COVID-19 Research

      • COVID-19 payloads will be submitted through the OSG VO
      • Sites can give priority to COVID-19 OSG pilots through HTCondor-CE configuration. Specific documentation will be published and announced

      3.5.13 (tomorrow!)

      • CA certificate update
      • Maybe XRootD 4.11.3-2 (fixes an issue at OU) depending on site testing results
      • HTCondor-CE 4.2.1: use SSL auth instead of GSI for advertising to the central collector
      • GridFTP: includes a patch that fixes missing transfer logs

      3.5.14/3.4.48 (next Tuesday)

      HTCondor security release (HTCondor-CEs unaffected)

      Other

      • Does anyone use the OSG rolling release repositories?
      • When will the first ATLAS site upgrade to HTCondor-CE 4 (available in OSG 3.5 release) and HTCondor 8.9 (available in OSG 3.5 upcoming)?
      • The GridFTP replacement, XRootD standalone, is ready to be piloted. We're very interested in ATLAS needs and feedback
    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • normal operation in general at T1. But in min-safe mode (access to SDCC will be permitted only for hardware failures or to fix unexpected outages until further notice)
      • ARC-CE/gridftp interface enabled for ATLAS GPU jobs on the IC cluster. Will switch to gridftp submission mode on the CERN harvester side soon.
      • cvmfs and HTCondor upgrade ongoing on the farm nodes, in rolling fashion.
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))

      The tier 2 sites running well and I have pinging sites if I see job failures, open tickets, etc. The number of open team tickets was dangerously close to zero but a flurry of activity this morning opened some more tickets.

      There will be a pre-review review of the Tier 2 sites in preparation for the 5 year renewal of the tier 2 program.

      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Service:

        • Updated OSG to 3.5 on 2 out of 3 gatekeepers, the ATLAS production gatekeeper is still on 3.4, waiting for condor update on our cluster.
        • Testing and updating condor from 8.6.13 to 8.8.7, encountered some problem due to the new features of 8.8.7-1, with some workaround, fixed the problems. Also plan to rebuild all the work nodes with a seperate parition for condor jobs instead of sharing it with /tmp.  The condor head node is also a SL6 node, we plan to update it to SL7 with condor 8.8.7-1.
        • dcache is updated to 5.2.16, it took longer than what we planned
        • had problem with the dCache database (zpool problem), took half day to recover it, HTCondor ramped down due to this and also a related ggus ticket  146141(solved)  and 146144 (same as 146141, request to close)on 22nd March 2020

        Tickets:

         146371 : file transfer error with gfal-copy, but good with xrdcp  still investigating.  We restarted the pool, and it works for a while and then stopped work again. 

        Hardware:

        finished the retirement of old storage for this last purchase cycle and until the next cycle and are updating the storage by year of purchase

        Access during lockdown

         working remotely but access to T2 equipment allowed to Wenjing, Shawn at UM and Philippe, Dan Hayden at MSU

         
         
      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Site COVID-19 Site Status

        • UC: As of today, UChicago is only permitting Essential Staff building access on a per-building basis limited to one day a week
        • IU: No access to the IUPUI server room. Compute maintenance is on best-effort
        • UIUC: NCSA is remote. Compute maintenance is on best-effort

        UC

        • Investigating low level dCache transfer errors
        • Added additional xrootd dCache doors
        • Downgraded kernels on the R740xd2 storage nodes back to stock. We were running mainline to get better network performance, bonds looked better but caused 1000s of xfer errors/day

        UIUC

        • ICCP will retire 3824 cores of our older worker nodes in the coming months (rows 67, 68, and 69 on the v51 tab on the USATLAS capacity spreadsheet)
      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Access to MGHPCC is still allowed with scheduling and preparation.  Not a major limitation for us in practice.

        Added two new NESE gateway nodes for gridftp transfers.  NESE nodes working, working with ADC guys moving more into production.   New AGIS site and BU_NESE and new NESE_DATADISK will be "nucleus" site has been created.  Being tested by ADC.   New storage has arrived except for a couple of management switches from DELL which have been delayed until this month. 

        Ordering broken fans for various C6000 chassis failures. 

        Rolling kernel updates are in process on the worker nodes. 

        SLATE node installed (atlas-slate01.bu.edu) and first pass at installation attempted.  We'll be in touch with SLATE team soon.

        Proceeding to prepare a large volume tape tier for NESE & NET2.   Aiming for initial ~30PB storage with ~0.5PB front end.  Meeting with vendors (IBM, SpectraLogic and Quantum).   Want to compare notes with Xin and BNL. 

        Smooth operations otherwise in the past two weeks except that the site isn't really getting saturated. 

         

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        SWT2_CPB:

          Investigating an issue with our MD3XXXi based storage systems that shows episodic failures for staging files to worker nodes.  Looking at memory pressure settings in the kernel and driver firmware updates.

         

        OU:

        Not much, things are running fine.

        Some job failures because of incorrect condor jdl files coming in from pre-production harvester instance. Being worked on.

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speaker: Doug Benjamin (Duke University (US))

      NERSC keeps running.  got 3000 nodes earlier this week.

      new Tasks assigned to NERSC - put them on pause till older tasks finish.

      Lincoln and DB met on Friday to discuss on how to go forward at TACC and transfer knowledge (and create confusion) from DB to Lincoln.

       

       

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        Condor update to (security patched) version 8.8.8 rolled across the shared pool (T3 affected), only ~20% draining at once.

        Announced to users that BNL's "min-safe" operations (only ~5% staff on site) may affect response times but we strive to keep facility 100% operational

      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        ARC-CE with GridFTP job submission for ANALY_SLAC_GPU is working.

      • 14:15
        ATLAS ML Platform & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        Running smoothly.

        Now opportunistically running COVID jobs through the Folding at Home platform.

        Postponed update of GPU drivers and CUDA versions.

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))

      In the midst of evaluating monitoring and communication tools and protocols.  Have had fruitful discussions, as a group, thus far in an attempt to identify problem areas and potential solutions.

      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Analytics Infrastructure & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        working smoothly.

        Missing traces and events (both at CERN and UC). Investigating with Thomas B.

        Dcache logs ingresse, for now billing data (filebeats configuration, still to instrument doors and pools)

        Hope to start getting ESnet data next week.

      • 14:30
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
        Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

        Xcache servers working smoothly

        At MWT2 failures from Triumf (working with Simon, Andy, Matevz on understanding the issue) and LRZ (downtime). 

        At AGLT2 moved to ANALY_AGLT2_VP queue. Works well. Will try to ramp up in a day or two.

        At Prague networking issues (puppet k8s interaction), storage was 6 RAID arrays, not split in JBODs (78), new NIC (20Gbps). 

        Will work on Munich inclusion in VP.

    • 14:40 14:45
      AOB 5m