US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

      top of meeting -- to be discussed in sections

      • FedOps Squid milestone
      • CRIC pains
      • AF white paper - responding to Brian's latest
      • Storage accounting
      • news from hepix
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.6/3.5.32 (this week)

      • Patched HTCondor 8.9.11: fixes a crashing bug. All CEs on 8.9.x should update.
      • osg-scitokens-mapfile (similar to vo-client except for token issuers)
      • vo-client (enmr updates)

      OSG 3.6/3.5.33 (next week)

      • HTCondor 8.9.12

      Other

      • HTCondor plans on dropping GRAM and CREAM support from 9.0
      • XRootD plans to drop support for XRootD 4 at the end of the month
    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 15m
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • DDM deletion slowness : 5PB secondary data were deleted by DDM, took several days to finish. DDM ops was asked to slow down, which helped clearing the dark data quicker.  On the dCache side, adjusted several parameters in cleaner and also cleaning up a huge amount of old entries in a table used for cleaner. More details in this discussion:  https://codimd.web.cern.ch/PPMRQACQTaKG3szTrpL9ZQ#
      • HPSS: Atlas was very busy in the past two weeks, with staging of 1.24PB (361,349 files staged).  The number of active requests went above 100K limit several times, reason is under investigation. The massive staging requests overloaded dCache admin service, causing other transfer errors. e.g. GGUS 151014. A second PnfsManager is added to handle the requests. First time we encountered this issue with the new version of dCache (6.2.12). 
        • 6.2.12 was the latest version back to when we were upgrading dCache. 
      • BNL FTS to be updated next Monday (03/22), 3~10 AM EST.
        • to version 3.10
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Smooth running this week:



      • MWT2 down this week to upgrade to dCache to 6.2.15 (from version 5)
      • Current dip in jobs success is a second attempt to run a derivation production.
        • Code still broken find negative pT or abs(eta).
        • Tadashi has again made changes to prevent massive numbers scout jobs from being created.
      • Long discussion with Mark on operations yesterday after prior meeting with Mark/Ofer.
        • Mark will do much of the monitoring I have been doing.
        • Could everyone please monitor the US ATLAS Matter Most channel as Mark will triage problems there with your input. Hard problems will be documented with a ticket.
      • HEPiX has increased the pressure to finish the IPV6 migration
        • MWT2_IU has installed new Juniper routers that should solve the problem at IU.
        • NET2 is still thinking about it.
        • SWT2_CPB negotiating with their network team but believe they are close.
      • Frontier-Squid DevOps meeting immediately following
        • Mark/Patrick have told me the SWT2_CPB is now in good shape.
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Update

        1) both UM and MSU sites updated both core switches to latest JunOS v18. reminder: During the previous update the MSU router had to be downgraded to v17 because of a memory leak bug causing restarts every ~6h.

        Incidents:

        1) Some BOINC jobs caused /var/log/message flood with the squashfs error. It happened on ~10 work nodes. Also because of BOINC using squashfs, we are trying to rebuild the WN with bigger /tmp area. (1GB/core)

        2) MSU router: we added anti-spoofing filter on the packet source address to protect against potential attacks on our DNS or NTP servers. This filter is applied to the public VLAN. At first, the filter specified only the corresponding public subnet. A sampling ping test did not spot any problem. But a few hours later, several hundred condor jobs lost heartbeat and were dropped. It was then found that a fraction of the pings were failing from UM to MSU (but not from MSU to UM?!) on the private subnets. The solution was to add the private subnet to the filter definition. In fact UM had to do a similar thing earlier. It is still not clear why this is needed. This is still being investigated before moving to the new routers and campus research networks.

         

        3) Tried to update from OSG3.5 to OSG3.6 which includes the update in both Condor and Condor-CE.

          With Condor, we  hit authentication problem after updating the head node from 8.8.12 to 8.9.11 (The work nodes were updated to 8.9.11 a month ago). We could not resolve the authentication issue after trying different configurations, so we rolled back the Condor head node to 8.8.12. 

        With Condor-CE, after the update, Condor-CE could not receive any new jobs due to authentication failure. After many times of reconfiguration, we still could not resolve the issue, so we rolled back the update, but the authentication issue persisted with the ATLAS jobs (It works with OSG jobs). It is possibly caused by a bug in Condor-CE. (ATLAS jobs keep using an old security session, and Condor-CE rejected it, and send back a rejection message, but this message is blocked by the CERN firewall, so ATLAS jobs could not get this rejection message). Condor-CE started to receive new jobs after ~12 hours (the security session expired at the ATLAS jobs side). 

        4) update on "pilot error 1151" problem: finished resolving the fewer remaining errors after rebooting this dcache pool node (msufs04). It seemed to have some ~5k hanging processes which were likely locking some memory resources.

        5) still debugging transfer failure with CNAF. 60% of the transfer fails, during curl tests, 6% packets loss, very likely a network issue between the 2 sites, not sure which section.

        6) having one blackhole work node(it is newly built, something is not working with the cfengine policies), failed over 1000 jobs in one day due to missing software. 

        7) building the C6420 work nodes with big /tmp (100G) to run BOINC jobs. Recently BOINC jobs have changes to use squashfs, and it requires an extra 1GB per job slot. 

        8) firmware and software is updated on all the work nodes at MSU. 

         

         

         

         
      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        Upgrading dCache to 6.2.15

        • Upgrade is complete, finishing testing this morning
        • Ran system updates in addition to dCache updates
        • Will update topology to end the downtime after testing is complete today

        Upgraded the ElasticSearch cluster to 7.11.1

        Added a SLATE squid at IU for failover. MWT2 is now running two SLATE squids

        Work continues on re-location of the MWT2_UC data center; quotes from RFQ swing equipment received, discussing

      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        We had a period where there was nearly zero use of GPFS, near 100% use of NESE and this caused ~20% of jobs to fail on stage-out.  A likely solution is to add a few more NESE gateways for NET2 traffic. 

        We're going to go through and update our OSG registration.

        Smooth operations other than that.

        Working on xrd via OSG 3.6, preparing for semi-production testing of that setup, at first on the BU side. 

         

         

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Nothing to report, running well.

        - OU is about to switch over to the new SRR space reporting.

        - Possibly maintenance downtime next Wednesday for OSCER cluster updates.

        UTA:

        • Enabled SRR Reporting for UTA_SWT2 in ATLAS CRIC; SWT2_CPB was already enabled
        • SLATE based SQUID being tested
        • GGUS-Ticket-ID: #150941 appears to be a communication issue between data servers and redirector in XRootD.  Accelerating our deployment of new redirector hardware.  Doubtful that the problem is fixed, but will allow for more detailed logging.
    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), lincoln bryant

      Dealing with the loss of Doug's effort, which has been switched to BNL AF.  Replacement effort will come in about six months.  Lincoln and Doug have been sharing notes on NERSC accounts and his scripts.  Getting setup to switch TACC to use new Globus door at BNL.

       

    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - Chicago 5m

        Working fine.

        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind
      • CRIC migration wrap-up at yesterday's ADC Weekly - still some questions about WLCG vs. ATLAS master; need documentation
      • Need clarification about SRR configuration in CRIC - newly configured by OU, apply format to all US Xrootd sites?
      • Met with Mark and Fred to re-organize ops support effort; setting up a Facilities Support web page to aggregate useful monitoring links and document procedures - served from new SDCC drupal site until the new US ATLAS web site is ready
      • F/S DevOps meeting later this afternoon; Lincoln has merged SWT2 Slate Squids into OSG topology
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        ES upgraded.

        XCaches work fine.

        VP is working fine. Rucio integration ready.

        Squids are working fine. Investigating why BOINC jobs show high cache hit rate but still transfer info from Frontier.

    • 14:40 14:45
      AOB 5m