US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      Releases this week:

      • XRootD 4.11.2
      • UberFTP 2.8-3 (repackaging after OSG contributed patches to the new Grid Community Forum upstream: https://github.com/gridcf/uberftp)
      • HCC VO update (important if your site supports HCC!)

      Reminders

      Other

      There was an issue with the GRACC -> WLCG accounting process for January that was resolved last week (the initial APEL report was broken but was promptly fixed). Xin mentioned that he needed to manually update numbers in CRIC for BNL.

    • 13:20 13:35
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        TBD 15m
    • 13:35 13:40
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
      • normal operations in general
        • two WNs built incompletely, became a blackhole due to missing CVMFS files. Took down for rebuild. 
        • January job accounting numbers were initially off by ~50%, later corrected on APEL. Manually fixed the numbers on CRIC. 
      • data17 reprocessing started today. BNL tape staging running fine so far. 
    • 13:40 14:00
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • 13:40
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        software update:

        update the OSG software and htcondor-ce to the most recent release on all 3 gate keepers

        Frontier Squid is also updated to 4.10-1.1.osg34.el6 

        Plan to upgrade all our SLC6 nodes to SLC7, including dcache,htcondor,afs services

         

        Job Errors:

        A lot of jobs failing at this error:

        Non-zero return code from RAWtoESD (65); Logfile error in log.RAWtoESD: "AthMpEvtLoopMgr ERROR Failure in waiting or sub-process finished abnormally"

        Some of the work nodes fail 100% of the jobs, we identified and rebuilt around 15 affected work nodes, and after rebuilding, they do not seem to fail many jobs (failure rate lower than 10%)

        Note: This error also appears to the jobs on other 8 sites, AGLT2 fails 1/5 of them, there is no ticket, not sure if the error is from the job itself or the work nodes.

         

         

         

         
      • 13:45
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        UC

        • dCache upgrade to 5.2 in progress as of this morning
        • Site drained via switcher3 since Monday - is this new behavior?
        • Updated capacity spreadsheet and topology for new dCache purchases

        UIUC

        • 24 new workers (1960 cores) received Monday, in the process of being racked
      • 13:50
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        We're having some trouble keeping the site consistently full due to: GPFS sometimes getting slightly clogged -> stage-in timeouts -> blacklisting by HC.  I'm not sure if this is overlapping with global production issues.  We're still investigating this.

        SLATE node transfer happening at MGHPCC today. 

        BU networking has agreed to set up for ipv6 (NET2 is the first requestor at BU).  Started a "project".  I'll know more about timescales by Oklahoma.  The main issue is updating the DNS infrastructure.  

        NESE storage racks have UPS power now.  The new storage nodes are racked, powered, being tested.  

      • 13:55
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        • SWT2_CPB
          • ADC forcibly changed the panda queues to use rucio mover rather than LSM
            • This caused many problems, but we used it as chance to adopt rucio mover
            • We can use rucio mover for reading and this is preferred for us.
            • We can not use rucio mover for writing to storage
              • rucio mover would not honor the lan_write configuration in AGIS and wan_write does not work from the compute nodes
              • If it had worked, the PFN's probably could not be registered as was the case when trying the xrootd mover.  PFN contains .local domain rather than atlas-swt2.org domain
            • We have moved back to LSM on the writes for now.
            • We also discovered an issue with xrdadler32 command from xrootd that affects xrootd site mover and probably rucio mover that shows up during writes.  LSM avoids the issue.
          • Completed the change out of UPS batteries

        OU:

        - Nothing to report, site running well.

        - Need HS06 values for Gold 6230 CPUs.

        - Having some xrootd issues with Third-Party-Copy stress tests, following up with experts.

         

    • 14:00 14:05
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
    • 14:05 14:20
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:05
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        Presentation at SLAC ATLAS group meeting to push for Jupyter

        https://docs.google.com/presentation/d/1B9Xiwk9VwcUqNPxjTrVNwqFoT2UzRutpvn6eSvoJX1w/edit?usp=sharing

      • 14:15
        ATLAS ML Platform & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        all running smoothly. Mostly used by David Miller, Alexander Bogatskii for hyper parameter scanning of the CLARIANT network for top tagging. Several new users.

    • 14:20 14:40
      WBS 2.3.5 Continuous Operations
      Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
      • 14:20
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:25
        Analytics Infrastructure & User Support 5m
        Speaker: Ilija Vukotic (University of Chicago (US))

        After ES update everything is working smoothly. Need to define default apps in Kibana for different spaces. 

        Helping Ivan in moving to DPA space.

        Helping Maria with the data popularity project and Petya with Perfsonar data.

        Helping Nikolai H with xcache reported data.

        Some issues with Perfsonar data replay from tape.

        Should work on site specific dashboards.

      • 14:30
        Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
        Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

        changes to how RUCIO presents VP service to Jedi are now in production and passing my tests.

        Now Jedi logs don't show any VP activity even VP jobs are coming to both AGLT2 and Prague2. Not to MWT2 as our ANALY queue is offline.

        Now created and trying to get jobs come to ANALY_MWT2_VP that should read through XCache and write out to AGLT2.

         

    • 14:40 14:45
      AOB 5m