US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))

      Preliminary ATLAS session at OSG HAM : https://indico.cern.ch/event/613466/ 

      ADC workshop on containers 3/8 : https://indico.cern.ch/event/612601/

       

    • 13:15 13:20
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      The ADC needs a US volunteer to work with the ADC crew on AGIS deprecated fields issues.  See:
      https://indico.cern.ch/event/573928/contributions/2322349/attachments/1396258/2129079/Sitemovers.final.migration.16.01.17.pdf    

      There are plenty of jobs in the pipe to keep all sites busy.

      Overlay jobs crash out the frontier infrastructure
          lots of data / many requests per job from ConDB

      Solution: temporarily limit number of running/queued jobs to 2k/1k, limiting production to 50k events/day
      Long term solution under discussion.

      Recommend users run only on DAOD because non-experts running on AOD will most probably produce false physics results

      Private production is STRICTLY forbidden by physics coordination policy.

      Users can now finish/abort their own tasks from bigpanda

      See full list of DPA issues at
      https://docs.google.com/presentation/d/1mZ0kknGOJVNyF3onfKSEHUYAT0rHxBYcON7kYZvlj3k/edit?ts=58a0e636#slide=id.g1cd506785b_0_277

      Everyone is working hard to obsolete the BDII.  A new AGIS parameter to be used for monitoring, etf_default, will soon be added.  Initial setting is to true when a queue has pq_is_default=1 & pq_capability=score.  Tuning and testing is required.

      Trying to migrate all sites to the new mover control mechanism (use_newmover = true in AGIS).  There was some confusion, but LSM is a valid mover to use with the new controls.  If your queues have not moved, please expedite the transition with Alexey Anisenkov, Jose Caballero and the atlas-adc-agis group.

      ==============  This is an extract from a more complete Email  =======================

      On Monday 6 February 2017 at 10AM CERN time we will change the tool to manage the status of  PanDA queues, previously done through curl 'http://panda.cern.ch:25943/server/pandamon/query?'. This tool will be switched off.

      Instead, PanDA queue states will be managed by the AGIS centralized blacklisting system and synchronized to the other systems. 

      You can monitor blacklisted queues, see detailed instructions and examples for the new blacklisting CLI herehttp://atlas-agis.cern.ch/agis/pandablacklisting/list/. Alternatively you can also view blacklisting details in the usual AGIS Pandaqueue view, by selecting the fields in the upper menu like here: http://atlas-agis.cern.ch/agis/pandaqueue/table_view/?&vo_name=atlas&show_2=0&show_3=0&show_4=0&state=ACTIVE&show_10=1&show_11=1&show_12=1&show_13=1

      ======================================

      I've gone through the "ADC List of Existing Monitors" document.  The Google doc compresses down to only about 9 sites, with variations mostly in the parameters, but sometimes in names.  I come up with:

      http://panglia.triumf.ca
      (I have queried to find out the future of this; a date on the page is 3+ years old; it may go away simply by a lack of support from Victoria)

      http://bigpanda.cern.ch/dash/production/#cloud_US
      (there are many variations on the basic dashboard here)

      http://dashb-atlas-job.cern.ch/dashboard
      This one is interesting, as the basic "dashb-atlas" host string has MANY variations, including:
      dashb-atlas-job-prototype.cern.ch, dashb-atlas-ssb.cern.ch, dashb-atlas-ddm-acc.cern.ch, dashb-atlas-ddm.cern.ch

      http://dashb-fts-transfers.cern.ch
      This seems to be moving to monit portal -- dashboard named "MONIT FTS Transfers Plots"

      http://adc-ddm-mon.cern.ch/ddmusr01/plots

      http://wlcg-sam-atlas.cern.ch

      https://etf-atlas-prod.cern.ch/etf/check_mk

      http://apfmon.lancs.ac.uk
      This is an ATLAS Panda dashboard and is outside the scope of the migration

      http://wlcg-squid-monitor.cern.ch

      Early indications are that these 2 will migrate
      dashb-atlas-ddm.cern.ch (and, I assume, all variations on dashb-atlas-*)
      wlcg-sam-atlas.cern.ch

      New prototypes are
      monit-grafana.cern.ch
      monit.cern.ch/app/kibana

      All information to do with raw and processed data is in the new monitoring infrastructure.  I will report more information as developments proceed.

       

    • 13:20 13:30
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))

      1) ~290M events in the MC production queue as of 2/14. MC16 simulation coming soon, MC16 digit+reco end of this month

      2) Sites may have noticed some heavily failing pmerge tasks the past couple of days. Tasks were aborted and resubmitted

      3) Pilot release from Paul on 2/2 (v67.5) - details in posted shift summaries

      4) No follow-up issues for US sites

    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:40 13:45
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:45 13:50
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:50 14:10
      Site movers 20m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:10 14:30
      OS performances testing 20m
      Speaker: Doug Benjamin (Duke University (US))
    • 14:30 16:05
      Site Reports
      • 14:30
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • Tape Staging test:  completed yesterday. BNL performed well. Transfer efficiency dropped to ~80% on 02/13 due to a short network switch interruption;
        • In Monday morning, a network switch was down, interrupted dcache transfers for ~ 4 minutes, recovered shortly. On-going discussion on more fault tolerant network connection for the servers.
        • BNL FTS is IPv6 enabled now
        • New Panda Site mover configuration enabled for all BNL queues
        • Two CE gatekeepers have condor upgraded to newer version (8.4.11), in preparation for publishing queues names to OSG Collector using the new osg-configure package.
        • On-going no afs day, expect little impact on production jobs, some user analysis jobs may be affected if they explicitly use files from cern AFS.
      • 14:35
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        AGLT2 is running smoothly, with no apparent issues.

        Core count has exceeded 10000 for the first time since our establishment.

        All dCache servers and WN are online from the Fall RBT purchases.  Equipment purchased with closeout funds is also now online and the Capacity Spreadsheet is up to date.

        Some of the closeout funds were designated for replacement 1Gb and 10Gb switches.  The latter was necessary for the expansion of the UM WN complement.  We received an S4048-ON switch for this, and it is performing admirably.  The 1Gb N2048 switches are being used at both MSU and UM for public NIC connections for WN that still rely on those 1Gb NICs.  Consequent stage-in rates have nearly tripled as a result to 25-30MB/s for each connected WN.  Three of these switches are yet to install at MSU, but we expect this to complete within the next 2 weeks, resulting in only a small number of remaining 1Gb public NICs on the PC6248.

        dCache was upgraded from 2.13.50 to 2.16.26.  The DB schema updates took ~5hours to complete, but left us with extremely slow write rates and many timeouts.  Discussion with dCache support lead us to vacuum some of the tables and restore our efficiency.  The adjustments we made are now part of the dCache upgrade scheme.

        We expect sometime in the next 2 weeks to upgrade our gatekeepers to the just-released 3.3.21 version with HTCondor 8.4.11.

         

      • 14:40
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site is full of jobs except Illinois

         

        Illinois is currently doing its monthly PM

        • Down until later tonight
        • Performing some GPFS upgrades
        • Moving MWT2 GPFS metadata to NVME (more inodes, better performance)

         

        OSG 3.3.21 to be installed on all gatekeepers later this week

        • Has many fixes for AGIS reporting (for BDII retirement)
        • Will also go to HTCondor 8.4.11 at this time

         

        Working on migration to latest "wrapper" and job flow

        • Using wrapper 0.9.15; moving to 0.9.16
        • Cleans up job flow
        • CONNECT nearly done; MWT2 next

         

        Retirement of CISCO 6509 caused a dCache issue

        • CISCO handled the NAT routing to public network
        • Two dCache storage servers had a routing misconfiguration to use NAT
        • Nodes were cut off from WAN; needed to fix "route" to nodes public IP
        • Monitoring caught the problem quickly
        • NAT now provided by a dedicated node other than a switch

         

        New switches are in

        • All switches are racked and installed
        • However, missing 40Gb modules for EX4500 to 9608
        • Downtime in few weeks to installed/configuration the 40Gb connections

         

        Network monitoring at Illinois

        • Gaining access to snmp on all switches use by MWT2
        • Will soon be able to create graphana graphs on network usage

         

        New Purchases

        • UChicago
          • 26 R430
          • 1040 cores
          • Installed and some nodes are online
          • Waiting for new switches to bring remaining nodes online
        • Indiana
          • 15 R430
          • 600 cores
          • All nodes online
        • Illlinois
          • 8 C6320
          • 448 cores
          • Not installed as yet

        MWT2 Site total will be

        • 18520 cores
        • 192K HS06
        • Spreadsheet, GIP and OIM have all been updated

         

         

         

      • 14:45
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Switched to "usenewmover" in AGIS with no problems.  Still using local lsm.

        Going to San Diego.

        Issues:

        1) Problem with jobs using the ~usatlas1 home directory, overloading NFS.  Moved to a stronger NFS server (~16 hour down time to switch).

        2) An issue came up in the past day that might be a world-wide issue.  Errors in our SRM occurred where Bestman refuses to delete files with names greater than 231 characters.   Since 256 characters is the limit on many file systems, this may soon hit other sites.  Notified DDM.

        3) Waiting for cable replacements from DELL to bring up new storage.

        4) NESE activities ramping up.

        5) Switchover to condor-ce ready.  Will ask Jose to switch us over after the meeting.

        6) Investigating bandwidth after the peering with LHCONE.  Asked Hiro for a load test. 

        7) Still need to make JSON file for Wei.

      • 14:50
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - all sites running well

        - in the process of adding latest nodes into Schooner (OU_OSCER_ATLAS)

        - also in the process of updating spreadsheet with both dedicated and opportunistic Schooner capacity

         

      • 14:55
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2:

        • Implemented the space-usage.json reporting file for reporting storage usage.
        • Everything else running normally

        SWT2_CPB:

        • Implemented the space-usage.json reporting file
        • Suffering from a lack of pilots in our queues
          • Last Thursday, a partition got filled that caused problems with our local batch system (Torque)
          • Torque was cleared up and fixed, but Pilot Factories were no longer sending pilots to our queues.
          • A different issue arose with Torque when our CE was rebooted on Sunday.  This issue was cleared up on Monday.
          • We started to receive pilots for Analysis queue and single core production queues but we are receiving one job per hour in multi-core queue.
          • The problem is still occurring.  We need help from someone with access to AFP.

         

      • 15:00
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 16:05 16:10
      AOB 5m