Indico celebrates its 20th anniversary! Check our blog post for more information!

US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))

      Happy New Year! Welcome to the first US ATLAS Computing Facility meeting of 2019.   We're trying out a new format to follow the new WBS 2.3 organization and we expect this to be an iterative process.  

      • Management updates
      • OSG-LHC updates
      • Area Reports - each meeting focus on a topic from one of the five areas 
      • In order to keep the meeting time to a reasonable length (goal of 1 hour), we should post Site Updates in advance and only call out significant issues requiring discussion during the meeting.

      Notes:

      • Quarterly reports due at end of week 
      • May LHC Ops review - questions about computing - will discuss at the next meeting
      • Significant meetings
        • ATLAS Sites Jamboree at CERN first week of March
        • WLCG/OSG/HSF at Jlab
        • HepIX
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
      • XRootD 4.9.0 RC3 available in osg-testing
      • We're working on an XCache docker image and would be interested in getting it tested in the ATLAS environment as soon as it's ready
      • HTCondor-CE 3.2.0 is an important update for receiving backfill OSG jobs. Please update ASAP and merge in any changes from /etc/condor-ce/condor_mapfile.rpmnew
    • 13:20 13:40
      Topical Area Report

      Each meeting one of the WBS 2.3 areas will present on significant topics with each area. http://bit.ly/facility-wbs:

      2.3.1 Tier-1 Operations -- Eric
      2.3.2 Tier-2s Operations -- Shawn
      2.3.3 HPC Operations -- Doug
      2.3.4 Analysis Facilities Operations -- Wei & Will
      2.3.5 Continuous Integration and Operations (CIOPS) -- Rob & Hiro

      This week we'll have Fred report on pricing/configurations from Dell.

      Next meeting we'll have a report from WBS 2.3.5 on Continuous Integration from Rob/Hiro.  Tentative schedule going forward:

      • WBS 2.3.5 - Rob/Hiro - January 30
      • WBS 2.3.1 - Eric - Feb 13
      • WBS 2.3.2 - Shawn - Feb 27
      • WBS 2.3.3 - Doug - Mar 13
      • WBS 2.3.4 - Wei/Will - Mar 27
      • 13:20
        Information on compute configuration 10m
        Speaker: Fred Luehring (Indiana University (US))
    • 13:40 14:20
      Site Reports
      • 13:40
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • electrical maintenance in the data center, which are scheduled for several inconsecutive days over this Month. During the maintenance days, we will run ES jobs to avoid early draining of partial computing farm. Running ES jobs on >14K cores now, both score and mcore PQs. No visible impact on network traffic. 
        • On Jan 22nd, a scheduled downtime, from 6am~6pm, for dCache upgrade.
        • space tokens
          • running low on DATADISK during the holiday break
          • SCRATCHDISK full on Sunday
          • discussion ongoing with ADC on increasing the "min low" limit on our tokens, to trigger DDM deletion earlier 
        • Xcache server deployed, being tested 
      • 13:45
        AGLT2 5m
        Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        Setup a new AGLT2_HOSPITAL queue, the difference is, input (read) data of jobs is non-local, but from different other US storage elements. It is also a multi core queue.

        Incidents:

        • Massive data transfer failure occurred a few times between late December and early January due to the failure of the authentication service in dCache, and one storage node losing network connectivity for a short period. 

        • Some of the Condor work nodes have unusual high load (over 1000) with or without jobs using the CPU,  the symptoms include high load, hanging /tmp directory, losing connection with condor head node, 100% swap usage, hanging sanity check processes. We updated a few work nodes to 8.6.12 from 8.4.11 for debugging purpose

        system updates:

        • We had 2 dCache updates in this quarter, respectively from 4.2.6 to 4.2.12, and from 4.2.12 to 4.2.21. The latter one is to support the xrootd-TPC and HTTP-TPC tests. During the first dCache update, we also updated the system firmware and SL7.5.

        • afs client 1.8 is compiled and installed on our CentOS 7 host. The available one for SL7 is still 1.6. We have not tested 1.8 on the SLC 7 nodes yet.

        • All the SL 7 nodes, including work nodes, grid service nodes and interactive nodes are all upgraded from SL7.5 to SL7.6, all the security patches are applied in time. And all the SL7.6 hosts are rebooted to run on the most recent kernel (3.10.0-957.1.3.el7.x86_64)

        • All the work nodes have the lustre-client upgraded from 2.10.4 to 2.10.6, this update is to support the most recent kernel (3.10.0-957.1.3.el7.x86_64).

        • All our three OSG gatekeepers have condor upgraded from 8.6.11 to 8.6.13

      • 13:50
        MWT2 5m
        Speaker: Lincoln Bryant (University of Chicago (US))

        System updates:

        • Upgrade of dCache from 3.1 (deprecated release) to 3.2 (supported release) at the first of the year.
        • Also we are planning a downtime the first week of Feb (tentatively, just heard back from Ryan Harden today)
        • Will potentially upgrade dCache's Postgres to v10 (in preparation for migration to dCache 4.2) and get IPv6 plumbed at UC

         

        Working on equipment purchases:

        Chicago

        • dCache expansion (6 MD1200 shelves)
        • XCache server (and ATLAS production SLATE server for additional services)
        • Machine learning server (for containerized GPU Panda jobs and co-scheduled with the ML platform)

        dCache s-node expansion

        Dell MD1200 (12x10 TB)                 100 TB usable/shelf

             Quantity  

         

         6

        XCache server Dell R740XD 12TB 7.2K RPM NLSAS 12Gbps 512e 3.5in; 800GB SSD SATA Mix Use 6Gbps 512n 2.5in 1
        ML server Nortech 5U Chassis Redundant Power Supplies,Dual Intel XEON 12-Core 6146, 192GB 2666MHz DDR4-2666 ECC REG DIMM ,Six Enterprise 480GB Solid State Drives, Eight GeForce RTX 2080 Ti Video Cards, 2-Port SFP+ 10Gb NIC, Three Years Parts and Labor 1

         

        Indiana & Illinois

        • Zeroing in on final compute configurations 

         

         

         

         

      • 13:55
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 14:00
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU: Nothing to report, everything is running smoothly.

        UTA_SWT2 & SWT2_CPB:  Intermittent issue with deletion service causing failures because gridftp servers are reporting that a non-existent file is a directory.  Trying to replicate.

        UTA_SWT2: Space reporting script had an issue and was not updating correctly.  We filled our disks and caused issues.  The issue was resolved and a new script is in place that will avoid any similar issues in the future.  Will roll out to SWT2_CPB later this week

        SWT2_CPB: No additional problems to report.

      • 14:05
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))
      • 14:10
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:15
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)

        Electrical maintenance this week

        • Discussing migrating interactive nodes to dual-powered common pool to lessen impact of further interventions

        Migration to shared-pool architecture is approved by liaison, implies a re-thinking of "long" queue implementation

        Updating systemd and kernel for vulnerabilities at same time

    • 14:20 14:25
      AOB 5m