US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:15 13:30
      Top of the Meeting 15m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:30 13:35
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      From Wei: 

      Vincent Garonne moved back to Oslo, still working on DDM. Mario Lassnig is in charge of DDM. Martin Barisits in charge of RUCIO.

      From Bob:

      Concrete plan nearly in place for Implementing WLCG diskless sites for production.  Would utilize storage at "nearby" T2 site.  See: https://indico.cern.ch/event/642836/contributions/2608398/attachments/1467335/2268911/Diskless_28May.pdf

       

    • 13:35 13:45
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:45 13:50
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:50 13:55
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:55 14:00
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      North American Throughput Meeting
      ============================

      31-May-2017, 10-11 AM Eastern

      Attending:  Dave, Ilija, Shawn, Marian, Phillipe, Saul, Duncan, Garhan, Andy

      https://indico.cern.ch/event/640627/

      perfSONAR v4.0
        - Update progress and issues 
          Shawn reported on OSG networking upgrades and data loss


      Network Measurement Platform status and updates
          Marian reported on ps_etf and meshconfig.grid.iu.edu.  Review of the services monitored (https://etf-ps.cern.ch/etf/check_mk)

      Update on Analytics
          Ilija reported on work to find changes in packet-loss, throughput, etc.  See paper https://arxiv.org/pdf/1508.01280.pdf
          Trying this method on CERN-BNL link analysis.   Machine-learning also being tried on perfSONAR data to find anomalies in
          our data. (Someone working on Titan...need details)

      Round-table 
          Saul mentioned that MGHPP is down for maintenance and this was an opportunity to go to 100G.  When site is back up it will be
          using 100G.  Shawn asked about using that path for LHCONE; Saul: yes, should be used.

          Andy: minor update to pScheduler in the next few days (intermittent lock-up fix).  IPv6 may be having some issues.

          Marian: Question about Docker support for the full toolkit?   Andy: being discussed if this will be done at next week's
          face-to-face in Ann Arbor.

          Lots of Q&A and account setup for meshconfig.

      AOB and next meeting

          Demo of OpenvSwitch / OpenFlow  + OpenStack for our next meeting.

          Watch email for next meeting date.

       

    • 14:00 14:05
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      From Wei and Andy:

      Inverse RUCIO Name2Name component is ready. It is a plugin that identify file replicas at different sites as the same file and thus improves Xrootd proxy cache's hitting rate. It requires support from Xrootd release 4.7, which will be ready soon.

      Working with RUCIO team to report back Proxy cache's contents --- in progress.

    • 14:05 14:15
      Site movers 10m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

      All completed except BNL_LOCAL-condor, which needs to set “deprecate_oldmover = True”. According to Xin:

      We can't change it for BNL_LOCAL-condor for the time being, as pilot 
      running ES jobs there isn't ready for it. It's said the patch is already 
      in, we can do the switch after it's released to production.

      I guess we don't need this item in the agenda in the future.

       

    • 14:15 14:25
      OS performances testing 10m
      Speaker: Doug Benjamin (Duke University (US))
    • 14:25 14:40
      HPCs integration 15m
      Speaker: Taylor Childers (Argonne National Laboratory (US))
    • 14:40 16:15
      Site Reports
      • 14:40
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))

        T1 services are running fine. 

      • 14:45
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        Over the Memorial Day weekend dCache suddenly "crashed".  Everything looked "normal" but writes were timing out.  Tracked to "too many locks", and cleared by doing a vacuum on all postgres DBs.  A secondary issue then asserted, where the dCacheDomain was running out of memory (at 2g).  Increased to 3g and this problem resolved.  We have been running stably since that time.

        Reminder that we will be down for a power outage from Noon on Friday, June 23 until sometime Monday June 26 when all services can be restarted.  We will do some software updates and dCache maintenance during this period.

         

      • 14:50
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site is now full of jobs and operating well

         

        Updated site to OSG 3.3.24

         

        Retirement of several storage nodes in dCache

         

        Illinois Campus Cluster lost hypervisors

        • Building which houses the ICC, ACB (Advanced Computation Building) had scheduled power outage 5/20
          • Upgrades to power feed to building
          • All equipment was powered off for 6 hours
          • First power down for many items in over 8 years
        • Cold start caused various hardware failures
        • Hypervisor cluster would not restart causing loss of MWT2 VMs
        • VMs migrated to campus base hypervisor which was planned to happen in a few weeks
          • MWT2 VLAN had been extended to campus hypervisors
          • Migration took several days
          • VMs now use NFS to access GPFS due to VLAN outside ACB
          • Took time to reconfigure VMs to new setup
          • Some tuning still needed but all is working well

         

        USERDISK down to only 12TB in use

         

        SCRATCHDISK deletion still is issue

        • Deletion is using rucio space reporting which seems to have a lot more free than dCache
        • Lincoln/Judith working with Hiro/Armen

         

      • 14:55
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        We had the annual 1 day MGHPCC-wide power shutdown last week.  Notable improvements made:

        1. Migrated NSF to new servers (mostly a Tier 3 issue)

        2. 100G WAN gear was installed and configured.  Use of 100G only waits for NoX to switch us over.

        3. USERDISK is almost empty according to plan.  Moved storage to other tokens as requested by Armen.

        4. Lots of NESE activity.  CEPH cluster made from Harvard contributed equipment as a test ATLAS DDM endpoint.

        Smooth operations with only minor problems.   High level of LIGO jobs for a few days.  

      • 15:00
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        -nothing to report, all sites are running well

        - we're seeing some lost heartbeat jobs, but we believe they are not site related, since we're seeing them at multiple sites, and BU is seeing them as well (right now, Tuesday afternoon), and in the past we've never been able to find a local source for them, and believe they're panda related

         

      • 15:05
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        1) Smooth operations over the past few weeks.

        2) Remaining WN's installed since the last meeting.

        3) Capacity spreadsheet updated to reflect current status.

      • 15:10
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 16:15 16:20
      AOB 5m