US ATLAS Computing Facility

US/Eastern
Videoconference Rooms
US_ATLAS_Computing_Integration_and_Operations
Name
US_ATLAS_Computing_Integration_and_Operations
Description
Bi-weekly Facilities meeting
Extension
109263008
Owner
Robert William Gardner Jr
Auto-join URL
Useful links
Phone numbers
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.5.3 and 3.4.37 (tomorrow)

      • HTCondor 4.0.1 (3.5 only, upgrade notes)
      • HTCondor 8.8.5 in 3.5, including new default security configuration (upgrade notes)
      • HTCondor 8.8.5 in 3.4, without the new default security configuration above (upgrade notes)
      • Fixes to Slurm accounting probes

      Other

      • Any feedback on our community testing emails/process?
        • No feedback but Brian expressed interest in increased ATLAS engagement
      • ATLAS XCache has changed rather substantially and needs testing. New version available in the 'fresh' tag
        • Wenjing, Brian, and Ilija will work together on deploying the new version
      • Fred was wondering how BOINC hours could be accounted
        • Since BOINC work doesn't come through the CE, the full OSG accounting workflow can't be used and development work will be required to get the data out of BOINC
        • BrianL will discuss with the GRACC team to see if they can/are willing to accept data generated by ATLAS BOINC so that the data can be sent to APEL with the rest of the monthly accounting data
    • 13:20 14:00
      Topical Report
      • 13:20
        OSiRIS ATLAS Event Service 15m
        Speaker: Benjeman Jay Meekhof (University of Michigan (US))
    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • 13:50
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        service:

        3 ggus tickets:

        1) analysis jobs fail at "not enough  resource available", this is cause by a ulimit set on our Condor system, it is fixed

        2) UCORE jobs fail at authentication, due to a problematic work node, the node is rebuilt

        3) 2 sites have transfer errors to AGLT2, (timeout), unresolved. ticket priority is changed to low priority

        Hardware:

        Received 3 R740x2d storage nodes(dCache pool nodes),1 to MSU, 2 to UM. waiting for provision. 

         

        Update on BOINC operation at AGLT2:

        Reminder: we had difficulties with the kernel not effectively applying a lower priority to the BOINC jobs. They ended up receiving roughly half of the CPU cycles which had never been the goal.  At the last meeting Wenjing had presented and reported on switching to using cgroups to try and tame that behavior.  But we did not have results or graphs to show at the time. 

        Here is a link to the current CPU efficiency which is already back to a much more reasonable and acceptable range while we are continuing to work on tuning this BOINC-backfilling model. 

        https://monit-grafana.cern.ch/d/000000696/job-accounting-historical-data?orgId=17&from=now-7d&to=now&var-bin=1h&var-groupby=dst_experiment_site&var-country=USA&var-federation=All&var-resources=All&var-tier=2&var-cloud=US&var-site=All&var-computingsite=All&var-nucleus=All&var-cores=All&var-eventservice=All&var-groups=All&var-inputdatatypes=All&var-inputprojects=All&var-outputproject=All&var-gshare=All&var-resourceserporting=All&var-processingtype=All&var-jobtype=All&var-jobstatus=All&var-error_category=All&var-measurement_suffix=1h&var-measurement_suffix_CQ=1h&var-retention_policy=long&var-division_factor=1&panelId=34&fullscreen

         
      • 13:55
        MWT2 5m
        Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        UC

        • Various storage work
          • Combination of broken space reporting and upstream change to a new reaper caused the site to fill up and be taken offline for a few days
          • Added IPv6 addresses to the MWT2 dCache nodes
          • Ongoing IPv6 debugging issues. Seems the UChicago IPv6 route isn't being advertised correctly. Ryan is working with Edoardo and ESnet to fix
        • Final quotes sent to purchasing

        IU

        • Compute quotes finalized

        UIUC

        • Quarterly PM today from 8am to 8pm CST
      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Minor operations issues:

        1. Low rate of CA errors only outgoing transfers to only certain external sites.  Might or might not be a problem on our end, but we have to investigate.

        2. We're occasionally still seeing too many squid failovers. 

        News:

        o Storage purchase out including a slate node

        o More worker and storage purchases should go out in the next few days. 

        o We need a bit of manual operations from CERN DDM to set up the NESE gridftp endpoint for testing.

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Nothing to report, all running well.

         

        UTA:

        - Storage server issues have shown up in both SWT2_CPB and UTA_SWT2, which required moving data to other servers. 

        - Seeing high loads on some data servers, which is causing problems with Event Index jobs, we are testing if a firmware update fixes the problem.

        - SLATE node is being worked on.

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))
      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    • 14:25 14:30
      AOB 5m