Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Two items to note:

      • Prescrubbing at BNL:  we now need to finalize WBS 2.3.x presentation for the scrubbing.  Target drafts by next week's facility coordination?
      • Upcoming quarterly reports:  we would like to see WBS 2.3.x reports by July 20th to give us time to finish WBS 2.3
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      - investigating issues with xrootd access

      - received Tuesday evening a tiny bit of first bunch of Run3 RAW data

      - Petr V. requested that OSG sites check the site topology to ensure SRM references for disk only sites is removed

       

    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonable running over the last two weeks
        • Communication issue with aipanda machines caused almost all running jobs to fail on 6/22/2022
        • MWT2 some between 6/25/2022 and 6/27/2022
        • NET2 disruption between 6/25/2022 and 7/1/2022
      • Please state how you are dealing with the current Linux kernel security issue.
      • Please describe any updates you are doing to OSG, dCache, XRootD, etc.
      • Please describe your procurement plans today.
        • I really want to get our orders out earlier this year than the late September like last year.
        • I will check with Dell to find out what CPUs, Server types, storage types, etc. might be actually be available to help guide what you order.
        • We can follow up as needed at next week's Facility Management Meeting
      • Please enter your quarterly reporting.
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen

        06/24/2022

        7 nodes became blackhole nodes because of cvmfs issue, this is later diagnosed with cause from the one of the squid servers. 

        06/29/2022

        One of the slate squid servers sl-um-es5 stopped working  because of both iptables issue and full var partition . It caused intermittent cvmfs issues. We got 2 ggus tickets for this. 

         

        06/30

        From 06/28, the SAM test jobs stopped running. This started after the SAM test job team made some changes (change the leave_in_queue conditions on ETF). We could not find any obvious cause after a couple of days of debugging. Eventually we decided to restart the condor-ce services on both ATLAS gatekeepers, and that got the SAM test jobs to start to run, but it also caused all the running jobs on the gatekeepers to be removed, so about 4000 jobs got removed. 

        07/06

        upgraded dCache 7.2.16 to 7.2.19 (with reboot to new kernel)
        Got all WNs updated and ready for reboot to new kernel.
        Starting rolling drain and reboot in batches

        All January 2022 order R6525 AMD Milan 7413 are shipped.
        A fraction already received.

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Upgrading elasticsearch to 8.3. Cluster upgraded to 7.17 last week. 

        Still waiting on UChicago IT Services to configure our new Juniper networking gear from our most recent purchase.

        Updating condor to 9.0.13-1.1.osg36.el7 on the workers. IU is done. UC is halfway done. UIUC still needs to be upgraded.

        A switch and servers rebooted at IU last weekend. Back online by Monday.

        Replacing the motherboard on the problematic dCache pool node appears to have fixed the lockup issues. Another dCache pool node had a bad NIC; this has also been replaced and the pool node is back online.

        Removed ALRB testing variables from the workers and gatekeepers.

        Applied user.max_net_namespaces=0 for kernel mitigation.

      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        • Received all of the R6525's that were outstanding (45 machines).  Starting to rack them.
        • Fixed a configuration problem in the compute nodes of the Kubernetes cluster.
        • Testing IPV6 and OSG 3.6 XRootD Standalone
          • Acts as a proxy to the backend storage (replicates existing services)
          • Drops gridFTP as available protocol
        • Reconfigured AC unit to avoid some problems associated with additional load

         

    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))

      Cori running well

       

      Ongoing debugging of Perlmutter failed jobs - Pilot cannot update PanDA satus some significant % of the time and they get marked as failed despite completing successfully. 

      Attempting to convert Perlmutter to use the latest Pilot3 code - may help

    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • ATLAS Analysis Facility Task Force Mandate document (review and comment)
        • Discussion of AE2 outcomes document at AF Forum last week
        • Presentations at BNL/JLAB/HSF S&C Round Table next Tuesday
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        We are creating two additional platforms.

        One will serve educational purposes and not like Codas workshop, have tools that are usable to all HEP not only ATLAS (servicex for CERN open data, jupyter with Root kernel, etc.)

        The other one will be dedicated to ATLAS Analytics with tools that support Analytics efforts.

         

    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Updated facility services spreadsheet
        • Big progress at NET2, now supporting token auth
      • XRootd access broken at BNL (GGUS).  Causing problems with VP queue (especially at BNL) and elsewhere
      • Numerous SLATE squid issues in the past week (iptables, partition size at AGLT2, OOM at IU, ...)
      • DOMA BDT discussion today about using X.509 and tokens for the Data Challenge
      • 2.3.5 folks please get QR in ASAP...
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        ES

        • upgraded to 7.17 last Thursday
        • today trying to upgrade to 8.3

        XCaches

        • all stable

        VP

        • except BNL all running fine

        Testing SLATE deployed Varnish based CVMFS caching at AGLT2.

         

      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        Yesterday Patrick found that while he is using the same compute node setup from the tier2 cluster for K8s cluster, one of the parameters for nodes is causing the issue for K8s to run containers (jobs waiting at the ContainerCreating state). Once he rolled that setting back, jobs started to run. 
        Pinged Fernando today, waiting for ATLAS test jobs.

    • 14:55 15:05
      AOB 10m