US ATLAS Computing Integration and Operations

US/Eastern
virtual room (your office)

virtual room

your office

Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speaker: Robert William Gardner Jr (University of Chicago (US))
      • Apologies from Horst, Saul
      • Forthcoming facilities workshop in Clemson, https://indico.cern.ch/event/472826/
        •  
      • Week following Clemson there is a workshop that might be of interest for campus research HPC best practices, relevant for campus clusters: http://www.ncsa.illinois.edu/Conferences/ARCC/agenda.html

       

       

       

    • 13:15 13:25
      Capacity News: Procurements & Retirements 10m
    • 13:25 13:35
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:35 13:40
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:40 13:45
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:45 13:50
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:50 13:55
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      From Andy:

       

      What needs to change for async caching support:

      1)    XrdPosix package to add async POSIX style I/O

      2)    XrdPss package to use async POSIX style I/O

      3)    XrdOucCache package to provide async cache interface, this also impacts XrdPosix package because it is responsible for loading and using the caching interface.

       

      The issue here is that all of these interfaces are public which means we need to implement this without breaking ABI compatibility (i.e. is must be backward compatible).

       

      Time estimates:

      a)    1 week to design and code up the new caching interface (4/5/16).

      b)    2 weeks to retrofit XrdPosix package to use (a) (3/21/16).

      c)    1 week to retrofit XrdPss package to use (b) (3/25/16).

       

      The above will always be available as work proceeds in the pssasync branch in the xroot github repo so other parallel work can proceed. Please be aware I go on vacation 3/28/16 for 12 days with limited if any internet connectivity so it is likely that we will not have a production quality version until 4/15/16 to 4/20/16, depending on how it goes.

       

       

       

    • 14:15 15:15
      Site Reports
      • 14:15
        BNL 5m
        Speaker: Michael Ernst
        • Smooth operations at full utilization of the compute farm (mostly MCORE)
        • AWS 100k core test still in preparation
          • Issues found with provisioning system based on APF
            • Now understood and fixed
          • Issues with S3 keys when running in 3 US regions
            • Understood and fixed by pilot developers
          • Scale test not to start before next week
        • Hiro has developed and deployed data management services for end users working on the shared T3 at BNL
          • Much improved bandwidth (over dq2-get) for data replication to T3 storage
        • Deployment of FY16 disk storage in progress
          • Hardware will be handed over to storage management group on or before March 15.
      • 14:20
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        pgsql was updated from 9.3.11 to 9.5.1 in advance of doing a dCache upgrade from the 2.10 to the 2.13 series.  This occurred on Tuesday last week during a full downtime that day.  At the same time our WNs were rebuilt completely, updating Condor to 8.4.4, cvmfs to 2.1.20, OSG-WN client to 3.3.8, glibc to 2.12-1.166.el6_7.7, and various other sl and sl-security updates.  Gatekeepers were updated to OSG 3.3.9, utilizing the OSG installation of Condor 8.4.3.  The master Condor machine is also on Condor 8.4.4, which works around a possible issue with the collector process in 8.4.3.

        Generally all upgrades went smoothly, modulo interactions between the various components.  The dCache update in particular surprised us with how quickly it went.  Several items were not immediately obvious, but a dCache documentation search showed the way.  The xrootd plugins required a bit more work, and consultations between Gerd, Ilija and Shawn will likely result in new plugin rpms in the near future.

        There are no outstanding issues with our site at this time.  However, we have noticed some recent jobs that are crashing WN.  These jobs run a process called "JSAPrun.exe".  Condor will suddenly report jobs running this process that have a (condor_status) LoadAv of many tens, even many hundreds, that results in the WN either crashing or becoming unresponsive.  We then get hung_task_timeout dumps in /var/log/messages indicating processes that have been blocked for more than 120 seconds.  We have only just discovered this, and so have not had a chance to do any further digging, but I mention it here because other sites may also be seeing this?

         

      • 14:25
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site has been  running well except for IU

        • Networking problems at IU
        • Last week it was issues in IndianaGigiPOP for ESnet conversion
        • Down again as of last night, tickets pending
        • Condor pool offline

         

        Scan for latest OpenSSL bug (Drownattack) shows MWT2 clean

         

        Minor update of dCache to 2.10.56-1

        • Helped with some XrootD door issues
        • Removed an old monitoring plugin that was causing java null pointer exceptions
        • Still some issues Lincoln is following up with Gerd

         

        New Disk at UChicago

        • Still in process of migrating LOCALGROUPDISK to Ceph
        • Migrating user data from older Ceph system to new Ceph (many tiny files).
        • Servers will be converted to dCache (~350TB)

         

        OSG 3.3.9

        • New lsm-get in use removing DCAP needs for MWT2
        • Reports to Elastic Search
        • Will be switching compute nodes to OSG 3.3.9 wn client

         

        minRSS and maxRSS now set

        • New Panda Queues for HIMEM
          • MWT2_HIMEM (2G-5G) - only nodes with >= 5GB/core
          • MWT2_HIMEM_MCORE (2G-3G) - only nodes with >= 3GB/core
        • ANALY_MWT2_MCORE (cpus=8, maxrss 16GB)
          • Very busy with jobs
          • But users do not use all cores

         

        ATLAS Analytics

        • Now keeping 1 copy of the data at Clemson, 1 copy at UC with redundant head nodes.
        • Currently riding out a scheduled downtime at Clemson. Kibana was up but now seems to be down.
        • Users have been notified.

         

        misc

        • Cleaning Nagios cruft and converting to Icinga.
        • Building SL7 machines and puppet rules for non-critical services.
        • No plans to run OSG software on SL7 for now.
      • 14:30
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 14:35
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - All sites running fine, no problems

        - Work ongoing to get OSG installed on new OSCER cluster. Currently working through some SELinux issues.

      • 14:40
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2

        Facilty electrical work forced a shutdown over the weekend, During shutdown we added memory to nodes with 24GB of memory

        • Provides ~320 additional single job slots or ~80 additional multi-core slots
        • Doubles multicore capacity

        SWT2_CPB

        Bringing 400TB of storage online.

         

        UTA - Expecting network interruption this weekend.

      • 14:45
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

        Putting SMR based storage (~1PB useable) in service - add as needed. Moving selected data to SMR storage:

        • large files (1GB+), haven't been accessed for 2+ years.

        Putting batch nodes in service via OpenStack. Delayed due to SLAC computing center personnel change.

    • 15:15 15:20
      AOB 5m