US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:15 13:20
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      The OSG BDII is shutting down, and yesterday the ATLAS SAM tests will switch the way in which they test queues.  From the announcement that was sent out by Ryu Sawada:

      "The ATLAS SAM tests are going to change the way they select the queues for the SAM tests. The selection so far was done using BDII information except for HTCONDOR-CEs. Soon it will be done selecting from the queues that are effectively used, i.e. the queues attached to the PandaQueues in AGIS and a new flag ETF_default=1. "

      "No negative impact is expected. But please watch SAM results of your site, and if you find any false results, please contact us for the correction by sending a ticket to GGUS."

       

      https://wlcg-mon.cern.ch/dashboard/request.py/siteviewhome

      JSON reporting of space usage is now active for all US sites. 

       

    • 13:20 13:30
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:40 13:45
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      perfSONAR v4.0 RC3 is out.   Hoping this is basically the final version for v4.0   If no problems are found a release could happen in a couple weeks.   Will need to get our sites update (auto-updates should run but needs checking)

      Working on new mesh-configuration.   See https://meshconfig-itb.grid.iu.edu/     Will become the production version http://meshconfig.grid.iu.edu soon (next ~week).    Everyone can get an account if interested.  Need to request admin access for specific meshes if needed.

      Lots of reorganization of network service components planned in OSG.  Will remove some ITB instances and rebalance resources (memory/CPU). New monitoring will be Docker based ETF running on CentOS7.3 VM.   https://gitlab.cern.ch/etf/docker/blob/master/README.md   Need updates for all services once perfSONAR v4.0 is released

      Next week is the LHCONE/LHCOPN meeting at BNL.  Hope some of you will be attending.  https://indico.cern.ch/event/581520/

      Analytics on network metrics showing occasional problems in packet loss at various locations.  Need to start opening tickets (after perfSONAR v4).

      Analytics links:  

       http://tiny.cc/PktLossNoUnknown  (Shows 6 months of packet loss by src/dest)

       http://tiny.cc/pSLink   (Shows network stats by specific site)

      Test emails by subscription are being issued, e.g.:

      Dear Shawn McKee,

              this mail is to let you that there was a significant change in packet loss detected by PerfSONAR.

      The site CA-SCINET-T2 (142.150.19.61)'s links got improved, total number from 5 to 0 links.
      These are all the bad links for the past hour:


      Best regards,
      ATLAS AAS

      Comments from Rob:  Improve the email messages to make what is being communicated obvious.  

       

    • 13:45 13:50
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      0. Xrootd proxy cache server at AGLT2.

      1. Under heavy load, the xrootd proxy cache sometimes can't send data back to some clients (broken pipe/send failure). Currently focus on checking OS/networking setting. Increase "txqueuelen" in NIC (ens2) from 1000 to 20000 - doesn't help. Reviewing other parameters.

      2. Question about uncommitted data in memory when a client close connection. Prefer to commit the data to disk to increase proxy efficiency but it is not always possible under heavy load. Will discard those data.

      3. Occasional lose of file descriptors (including TCP). 22 files so far in the last two days of stress test (out of 224k). _May_ due to a linux kernel semaphore bug which is fixed in the latest kernel. Need to confirm.

      4. After 1. is understood, will enter long period of stress test to check stability, memory usage, file/TCP descriptors, and networking.

      5. Packaging as a product. 

    • 13:50 14:10
      Site movers 20m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

      Ilija will send instruction about how to setup internal Xrootd doors for jobs to access sites' local storage.

    • 14:10 14:30
      OS performances testing 20m
      Speaker: Doug Benjamin (Duke University (US))

      Charge usage - ALCC - 5,527,098 hours

      ERCAP - 2,197,078 hours

    • 14:30 16:05
      Site Reports
      • 14:30
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))

        Site is in downtime now.

        dCache upgrade ongoing

        • software version : 2.10 to 3.0.11
        • network : channel bonding on dCache control nodes, for fault tolerance and load balance; upgrade of switch software

        Issues with AGIS PanDA queue blacklisting system

        • resulted in loss of CPU cycles
        • bugs with regard to downtime cancellation and manual online operation, fixed
        • policy of switcher: when to drain a site before a downtime. ADC will revisit current policy and present to sites.
      • 14:35
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        We have updated to dCache 3.x series from 2.16.  There is a DB schema change that took 5 hours to complete.  Unfortunately, our monthly chimera dumps are now broken as the schema change broke chimera_find.sh.  Hiro promises that he can fix this, and there is also a dCache ticket open for it.

        Our gatekeepers are updated to OSG 3.3.21 now, and the new [Resource Entry xxx] sections are in place in the 30-gip.ini file.  Following directions posted by Wei and John, AGIS was also updated to connect the listed queues.

        We have been notified that there will be a complete power outage in the UM server room on Saturday, June 24.  We will plan on shutting down all services on Friday afternoon, June 23, to prep for this.  Hopefully we can get much back up on Saturday afternoon, but that is far from certain at this advanced time.

         

         

      • 14:40
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site is full of jobs - operating well

         

        OSG 3.3.22 to installed on all gatekeepers

        • Ready for AGIS reporting and BDII retirement
        • HTCondor 8.4.11 installed across the site
        • CVMFS 2.3.3 fully deployed (2.3.5 released soon)

         

        USERDISK decommissioning

        • SCRATCHDISK increased to 300TB
        • Waiting on ADC to change Panda Q to use SCRATCHDISK for output

         

        New switches at UChicago are fully deployed

        • Cisco 6509 has been retired - all nodes moved to new Junipers
        • Future will replace 8x10Gb connection to SciDMZ with 2x40Gb

         

        Network monitoring and other issues

        • Full access to all switch data with SNMP at UChicago and Illinois
        • Working on the same at Indiana
        • Monitoring all port connections
        • A 2x10Gb uplink at Indiana was degraded to 1x10Gb (fixed)

         

         

         

         

      • 14:45
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Smooth operations with full sites with the exceptions of

        1) Checksum mismatch errors.  This generated a ticket for us, but the problem was on the source end.  Details can be found here http://egg.bu.edu/NET2%7binf:NET2%7d/gadget:Studies/section:report/2017-03/checksum_mismatch_exotics/

        2) ATLASSCRATCHDISK space is being used.

        3) Deletions are still happening via Bestman at our site.

        4) We still have a mystery problem with HTCONDOR-CE where the site drains for not understood reasons.  We're still investigating and have been in contact with Brian.  

        5) Working intensively on NESE, MGHPCC floor and WAN networking.  Had a very useful meeting with Alastair Dewhurst re: CEPH/Gridftp and his "Echo" project.

         

      • 14:50
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - sites are mostly running well

        - occasionally held HTCondor-CE jobs on OU_OSCER_ATLAS; potentially related to internal OSCER authentication issues; following up

         

      • 14:55
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2

           Planning on updating the hardware soon.

          No production issues

        SWT2_CPB

        Had an issue with transfers from two Canadian sites (McGill, UToronto) due to asymmetric routing.  CANARIE discovered the misconfigured router and fixed it.

        An issue with space reporting exists.  One data server had a configuration issue and was reporting more space being used than what was physically on disk.  This has been resolved and will see how much the overall space reporting has been impacted.

      • 15:00
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 16:05 16:10
      AOB 5m