US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Recent meetings

      • Last week was LHCOPN/LHCONE meeting in Manchester, UK:  https://indico.cern.ch/event/1479019/
      • WLCG DOMA was today https://indico.cern.ch/event/1520247/
      • Next week is HEPiX in Lugano, Swizterland:  https://indico.cern.ch/event/1477299/ 

       

      We are working on a 5-year estimator for our facilities with a goal of understanding our resources needs to deliver US targets to the start of HL-LHC

      Please consider attending HTC25 in Madison Wisconsin June 2-6.  On June 4th we intend to have joint USATLAS-USCMS meetings https://agenda.hep.wisc.edu/event/2297/ 

       

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • Release this week: XRootD 5.7.3-1.5 (gstream fixes, adding support for purge plugins)
      • Kuantifier: verified access to test NET2 cluster, need Eduardo to set up unprivileged Prometheus
    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))

        WBS 2.3.1.2 Tier-1 Infrastructure - Jason

        • 75 servers (3 racks) arriving at BNL this week. Expect it to be available to Tier-1 in ~2 weeks
        • RBT submitted to meet the WLCG request for 5 PB additional tape

        WBS 2.3.1.3 Tier-1 Compute - Tom

        • Gridgk04,06 rebooted unexpectedly over the weekend (cause under investigation)
          • This caused a temporary dip in running jobs, service was restored Monday
        • Security fix has been pushed out across the Atlas T1 pool per HTCondor dev recommendation
          • SEC_TOKEN_REQUEST_LIMITS = DENY
          • SEC_ISSUED_TOKEN_EXPIRATION = 0

        WBS 2.3.1.4 Tier-1 Storage - Carlos

        • ATLAS reprocessing started Monday 17
          • + 310K files restored  so far.

          • Target is to use BNL-OSG2_MCTAPE size: 5414.3TB datasets: 2073 files: 67548

          • 2No major issues observed at dCache or HPSS

        • Integration/test instance migrated to Openshift

        WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan

        • NTR
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Pretty good running over the last couple of weeks.
        • MWT2 back to full production after doing a rolling update.
        • NET2 still working on repairs for the high core count servers.
      • EL9
        • MSU past all install system issues but still working to get installation parameters that work.
        • UTA working on installing new storage servers so it can update its storage to EL9.
          • The rest of CPB is at EL9.
      • Operations and Procurement plans
        • Sent out templates yesterday.
        • We will need to define milestones to match the contents of the plans.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        TACC: a sequece of rebooting of computing nodes and login nodes this week

        Perlmutter: following up with inode quota usage

        • Doug requested the inode quota to be increased 20M->50M, and the SCORE is reduced to 2 workers with 10 nodes each per submission (a factor of 5 reduction)  
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))

        Perlmutter in downtime today

        Over weekend ran out of inodes (again!!!) 

        •  asked to increase inode quota to 50M for 6 months
        • reduced the number of running SCORE slurm jobs from 5 to 2 (ie workers in Harvester)
        • reduced the number of nodes running SCORE slurm jobs from 20 to 10 
        • Net reduction of a factor 5 in number of SCORE jobs running on NERSC -  madgraph jobs caused havoc...

         

        Success in HEP-CCE Globus Compute.   first PanDA validation jobs successfull started with Test Harvester and Globus compute submitter.

        • working on monitor for Globus Compute and need to work with PanDA team to come up with a working solution Globus compute sweeper.
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • HTCondor update.
          • Interruption will be brief.  login nodes and the scheduler are already updated. Will rebuild the worker image and deploy tomorrow. 
          • The update address a security update and also will fix a bug affecting coffea-casa(job svc classad)
    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • CRIC migrated to python3
          • Led to HammerCloud not being able to whitelist sites
        • Tasks sent to ARM queues with a release that does not have merge for the release. All tasks were fixed.
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        XCaches

        • all moved to FluxCD or direct docker deployment
        • Wuppertal needs to fix gStream monitoring
        • Next Monday 8:30 CDT meeting on how DE will use XCaches in HTC only era.

         

        Varnishes

        • All working fine
        • Need Rod to change port
        • Agreed to get PIC, IN2P-CC and Roma to set up instances next

         

        ServiceX/Y

        • We had a meetup in UofW.
        • A lot of new functionalites discussed: RDFrame support, Joins, ARM support, ServiceX-Local, new version of local cache, ...
        • ServiceY will be a continued as a demonstrator, its functionallities will be picked and reimplemented in ServiceX at their timeline.

         

        CREST

        • Had one more HLT test
        • Need to update CERN Openstack k8s cluster due to nodes retirement.

         

        Analytics

        • brand new logstash configs and templates for WLCG_WPAD and cms-frontier data
      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
    • 14:25 14:35
      AOB 10m