US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      The GridFTP replacement, OSG XRootD standalone, documentation is live: https://opensciencegrid.org/docs/data/xrootd/install-standalone/

      • HTTP/S enabled by default
      • Supports HTTPS third-party copy

       

      Meeting notes:

      • SWT2 and NET2 interested in testing xrootd-https, Xin/Tier1 already is
      • RHEL8 (Doug) for OSG?  Timeframe: OLCF decision VM for Harvester coming up / also python3 as default? (Brian thinks yes)
    • 13:20 13:35
      Topical Report
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:35 13:40
      Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • Full run2 reprocessing ongoing, for BNL : ~1.3M files, 2.9PB to stage out of tape.  
      • slow deletion on DATADISK
        • GGUS 144845
        • cleaner has been running fine after dCache upgrade. But this time there was also DOMA-http TPC tests ongoing at the same time. External script is used to help speed up release of deleted space, ~4PB. 
    • 13:40 14:00
      Tier2 Centers
      Convener: Shawn Mc Kee (University of Michigan (US))
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

            Hardware:

            - no big change, no major issue, typical maintenance.

            - Progress continues on retiring older T2 storage at MSU and T3 storage at UM.

            Services:

            - One new ggus ticket (144783) about jobs losing heartbeat.  We verified at site.
              The number of jobs losing heartbeat has been consistent at the site, about 100-200 jobs per day.
              This also seems to have similar symptoms as seen at other sites (see  MWT2 ticket 144756)
              and tentatively tracked down to the pilot with a fix recently put in place.

            - Condor Problem: on Jan 21st, starting around 4am, the running jobs in condor started to drop down to 20%
              spent a few hours investigate, eventually rebooting the Condor central server  
              and another Tier 3 submission machine solved this problem.

            - Getting close to adding (restoring) xrootd.aglt2.org SAN to dcache doors SSL certificate.

        NOTE: Wenjing Wu is on vacation starting today through the next two weeks and then will be working for one week from China (use non-Gmail email to reach her:  wuwj@ihep.ac.cn or wwu@cern.ch  )  Back on the 17th of February

         

      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        GGUS Tickets:

        • Ticket 144756 "problems at ANALY_MWT2_UCORE" (Closed)

          • Jobs stuck in “scouting” status. Pilot stuck in endless monitoring loop

          • Pilot update pushed that fixed the issue. Jobs no longer getting stuck for days

        • Ticket 144542 "pilot stage-in issues" (Closed)

          • No update for couple weeks now after our last change. I pinged it last monday thinking 144756 was a similar issue. Closed it now that there doesn't seem to be a problem and nobody has commented/complained.

        • Ticket 144798 & 144808 (Closed)

          • Duplicate issue as 144756

          • Reopened as 144808. We evicted a large amount of jobs manually to allow new production jobs in as we weren't sure when a fix would happen.

        • Ticket 144840 "MWT2 stage-in issues"

          • Auth Failed popping up on xrootd downloads of files. Currently investigating by manual testing and checking logs.

        UC:

        Began network setup, but fell behind trying to get software from vendor. ETA is next week

        UIUC:

        Still waiting on new purchase arrival.

        IU:

        Ready for IPv6 setup according to network team. Will begin trial setup in the coming weeks.

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Smooth operations. 

        Started ipv6 journey.  BU networking working on getting addresses, we're preparing to dual stack ddm endpoints first.

        New NESE endpoint working.  

        Prep work for adding new DELL NESE storage (6PB raw).  Storage arrived.  Networking gear still arriving.  Still waiting on UPS power to three new racks at MGHPCC.

         

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

          One issue with XrootdFS mount on a GridFTP door caused problems with deletions.

          Everything else running well.

         

        OU:

        - Nothing to report, site running well.

        - There was a brief HC site outage over the weekend, caused by HC jobs being killed by the pilot because they consumed too much RAM. Those HC jobs were stopped again by Petr.

         

    • 14:00 14:05
      HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
    • 14:05 14:20
      Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        BNL T3: ATLAS GPFS filesystem outage a few hours on Sunday, thinking cluster went down due to new (incompatible) kernel modules being built & installed on Friday, caused stale mounts 12pm-6pm Sunday.

        Otherwise normal operations

      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        ATLAS ML Platform & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        ML platform running fine. Every now and then a new user comes and needs a bit of help with starting work. No shortage of GPUs now. 

        Starting work on Reinforcement Learning OpenAI environment for smarter caching decisions. This experience will be valuable for other use cases.

    • 14:20 14:40
      Continuous Operations
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Analytics Infrastructure & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        New ES nodes should be connected to the cluster next week.  Update of ES at the same time.

        We informed people of a pending removal of the "spare" ES cluster. Two people asked for a delay. New date of removal is 25th. 

        Slowly replaying perfsonar data from tape. Still some issues to fix.

        Getting meta and status perfsonar indices into RMQ and tape.  Work done on getting ESnet data following the same data flow.

        Starting work on organizing data annotations.

         

      • 14:30
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
        Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

        XCache was working stably at MWT2, AGLT2, Prague, BNL.

        VP changes requested by Rod, Tadashi. 

        Created XXX_VP_DISK for all 4 sites and "connected" them to ANALY queues at sites.

        There are edge cases that need to be addressed: eg. original data copy exists only on tape. 

        Quite a bit of traffic on all XCaches (> 3Gbps).

        Now reporting all requests and replies to/from VPservice to ES so we can monitor it. Need to find a way to label jobs brokered against VP copies, now it's rather complex to identify them.

        ServiceX work - new high performance transformer, work on kafka deployment, monitoring, performance characteristics.

    • 14:40 14:45
      AOB 5m