US ATLAS Computing Integration and Operations

US/Eastern
virtual room (your office)

virtual room

your office

Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speaker: Robert William Gardner Jr (University of Chicago (US))
    • 13:15 13:25
      Capacity News: Procurements & Retirements 10m
    • 13:25 13:35
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:35 13:40
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:40 13:45
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:45 13:50
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:50 13:55
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:15 15:15
      Site Reports
      • 14:15
        BNL 5m
        Speaker: Michael Ernst
        • Except a brief outage of the SE stable operations at the Tier-1
          • the dCache core domain (admin node) ran out of memory in the early morning of Dec 31. This was quickly fixed by Hiro with a restart of the domain. As this has not happened before we suspect the situation was related to unusual operations in conjunction with performance optimization of the replica creation processes
        • The Tier-1 was flagged by ADC operations about DDM transfer errors because of missing files. 
          • Our investigations have shown that this is not a problem at the BNL site. All files reported as lost in the context of this ticket were created by jobs running at the ORNL_Titan site. This site is using the BNL SE to store the job output files. For the data transfer between the site and BNL a specific site mover is used by the pilot to move files produced at ORNL_Titan to the BNL SE. Our investigations have shown that some transfers suffer from a high failure rate and need a lot of retries until they, according to the site mover, eventually succeed. However, even if they are reported as being successfully transferred, the files don't exist at the destination SE (BNL). We suspect there is a race condition in the site mover code, most likely due to timing issues in the transfer failure recovery section, that leads to the deletion of a file that was successfully transferred to BNL. Note FTS is handling such cases correctly, but FTS is not managing these particular transfers. 
          • Missing files were declared lost by the T1 Storage Management Group
        • Updated the FY16 capacity/procurement table.
      • 14:20
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        With few exceptions AGLT2 mostly ran quite smoothly over the holiday break.

        A disk shelf at MSU lost 3 disks of a RAID-6 configuration just prior to New Years Day, and we were unable to recover it.  There were 16087 files on the pool, 3900 of which had replicas elsewhere at AGLT2.  The replicas were graduated to become primary copies.  The balance were declared lost.  Shawn documented the procedure he followed on our Wiki

        https://www.aglt2.org/wiki/bin/view/AGLT2/RecoverFromLostPool

        The pool itself consisted of 750GB drives that are now in short supply at AGLT2, so we chose to permanently retire this shelf to supplement available spares.

        It was expected that the 10 MSU R630 would be in production by New Years, but in the event these are not yet ready.  Perhaps by the end of this week they will be installed.  Multiple infrastructure changes were required to bring these machines online, involving IT support at MSU as well as AGLT2 moves, and this has all taken longer to effect than was expected.


        Some adjustments were made to the WLCG-v37 tab tables, to reflect the actual deployment of the final 2015 Dell R630 WN purchases.

         

      • 14:25
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site is running well

        • Full of Atlas jobs (MCORE, SCORE, Analy and Opport)
        • Rolling update to Condor 8.4.2
        • Renaming some Illinois nodes

         

        Networking

        • Illinois ⇔ Indiana high latency fixed
        • Was taking inefficient path via I2
        • Now direct route via ESnet

         

        New hardware status

        • UChicago
          • 18 Ceph Servers
          • Rolling online of new servers / upgrade of existing servers.
        • Indiana
          • 24 R630 (E5-2650 v3, 128GB) have been delivered
          • Waiting on PDU to provide power (110v ⇒ 220v)
      • 14:30
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Reported pre-Christmas reprocessing experience at the latest ADC meeting https://indico.cern.ch/event/469712/

        We had a few hour down time yesterday due to a bad LUN problem in GPFS.  This affects GPFS metadata response times and can cause Bestman to slow down or become unresponsive.  The problem is over as of today.

        Smooth running other than that.

        No updates to capacity updates.  550  TB + 24 worker nodes are still on the way from DELL.

         

      • 14:35
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - sites running well, no problems over the holidays

        - network quite stable as well

        - seeing same lost heartbeat jobs as Wei at SLAC, at both OU and LU, so this cannot be a site issue

         

      • 14:40
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        We had a shutdown at UTA_SWT2 for a couple of days for electrical work in the facility.  The system came back with few problems.

         

        No major problems observed for SWT2_CPB during the break.

         

        Working on putting storage online.

      • 14:45
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        smooth operation most of the time

        have to limit the s-core production due to high "lost heartbeat" issue, which happens to many sites.

        5 disks in one storage node failed. no data lost but have to move data around

        have one new CPU nodes online. will use elastic search to compare configuration of bare-metal and various sited VMs.

        storage PO received by vendor

        working on new batch node procurement: 14 M630s, ~$110K

    • 15:15 15:20
      AOB 5m