US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2017-02-15T13:00:00-05:00
End: 2017-02-15T16:10:00-05:00
Location: No location set

Wednesday 15 Feb 2017, 13:00 → 16:10 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  
  Top of the Meeting 15m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  Preliminary ATLAS session at OSG HAM : https://indico.cern.ch/event/613466/
  
  ADC workshop on containers 3/8 : https://indico.cern.ch/event/612601/
- 13:15 → 13:20
  
  ADC news and issues 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  The ADC needs a US volunteer to work with the ADC crew on AGIS deprecated fields issues. See:
  https://indico.cern.ch/event/573928/contributions/2322349/attachments/1396258/2129079/Sitemovers.final.migration.16.01.17.pdf
  
  There are plenty of jobs in the pipe to keep all sites busy.
  
  Overlay jobs crash out the frontier infrastructure
  lots of data / many requests per job from ConDB
  
  Solution: temporarily limit number of running/queued jobs to 2k/1k, limiting production to 50k events/day
  Long term solution under discussion.
  
  Recommend users run only on DAOD because non-experts running on AOD will most probably produce false physics results
  
  Private production is STRICTLY forbidden by physics coordination policy.
  
  Users can now finish/abort their own tasks from bigpanda
  
  See full list of DPA issues at
  https://docs.google.com/presentation/d/1mZ0kknGOJVNyF3onfKSEHUYAT0rHxBYcON7kYZvlj3k/edit?ts=58a0e636#slide=id.g1cd506785b_0_277
  
  Everyone is working hard to obsolete the BDII. A new AGIS parameter to be used for monitoring, etf_default, will soon be added. Initial setting is to true when a queue has pq_is_default=1 & pq_capability=score. Tuning and testing is required.
  
  Trying to migrate all sites to the new mover control mechanism (use_newmover = true in AGIS). There was some confusion, but LSM is a valid mover to use with the new controls. If your queues have not moved, please expedite the transition with Alexey Anisenkov, Jose Caballero and the atlas-adc-agis group.
  
  ============== This is an extract from a more complete Email =======================
  
  On Monday 6 February 2017 at 10AM CERN time we will change the tool to manage the status of PanDA queues, previously done through curl 'http://panda.cern.ch:25943/server/pandamon/query?'. This tool will be switched off.
  
  Instead, PanDA queue states will be managed by the AGIS centralized blacklisting system and synchronized to the other systems.
  
  You can monitor blacklisted queues, see detailed instructions and examples for the new blacklisting CLI here: http://atlas-agis.cern.ch/agis/pandablacklisting/list/. Alternatively you can also view blacklisting details in the usual AGIS Pandaqueue view, by selecting the fields in the upper menu like here: http://atlas-agis.cern.ch/agis/pandaqueue/table_view/?&vo_name=atlas&show_2=0&show_3=0&show_4=0&state=ACTIVE&show_10=1&show_11=1&show_12=1&show_13=1
  
  ======================================
  
  I've gone through the "ADC List of Existing Monitors" document. The Google doc compresses down to only about 9 sites, with variations mostly in the parameters, but sometimes in names. I come up with:
  
  http://panglia.triumf.ca
  (I have queried to find out the future of this; a date on the page is 3+ years old; it may go away simply by a lack of support from Victoria)
  
  http://bigpanda.cern.ch/dash/production/#cloud_US
  (there are many variations on the basic dashboard here)
  
  http://dashb-atlas-job.cern.ch/dashboard
  This one is interesting, as the basic "dashb-atlas" host string has MANY variations, including:
  dashb-atlas-job-prototype.cern.ch, dashb-atlas-ssb.cern.ch, dashb-atlas-ddm-acc.cern.ch, dashb-atlas-ddm.cern.ch
  
  http://dashb-fts-transfers.cern.ch
  This seems to be moving to monit portal -- dashboard named "MONIT FTS Transfers Plots"
  
  http://adc-ddm-mon.cern.ch/ddmusr01/plots
  
  http://wlcg-sam-atlas.cern.ch
  
  https://etf-atlas-prod.cern.ch/etf/check_mk
  
  http://apfmon.lancs.ac.uk
  This is an ATLAS Panda dashboard and is outside the scope of the migration
  
  http://wlcg-squid-monitor.cern.ch
  
  Early indications are that these 2 will migrate
  dashb-atlas-ddm.cern.ch (and, I assume, all variations on dashb-atlas-*)
  wlcg-sam-atlas.cern.ch
  
  New prototypes are
  monit-grafana.cern.ch
  monit.cern.ch/app/kibana
  
  All information to do with raw and processed data is in the new monitoring infrastructure. I will report more information as developments proceed.
- 13:20 → 13:30
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-2_15_17.pdf
  
  shift-summary-2_8_17.pdf
  
  1) ~290M events in the MC production queue as of 2/14. MC16 simulation coming soon, MC16 digit+reco end of this month
  
  2) Sites may have noticed some heavily failing pmerge tasks the past couple of days. Tasks were aborted and resubmitted
  
  3) Pilot release from Paul on 2/2 (v67.5) - details in posted shift summaries
  
  4) No follow-up issues for US sites
- 13:30 → 13:35
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  170215_DataManagement_Armen.pdf
- 13:35 → 13:40
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:40 → 13:45
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:45 → 13:50
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:50 → 14:10
  
  Site movers 20m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:10 → 14:30
  
  OS performances testing 20m
  
  Speaker: Doug Benjamin (Duke University (US))
- 14:30 → 16:05
  Site Reports
  - 14:30
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    Tape Staging test: completed yesterday. BNL performed well. Transfer efficiency dropped to ~80% on 02/13 due to a short network switch interruption;
    
    In Monday morning, a network switch was down, interrupted dcache transfers for ~ 4 minutes, recovered shortly. On-going discussion on more fault tolerant network connection for the servers.
    
    BNL FTS is IPv6 enabled now
    
    New Panda Site mover configuration enabled for all BNL queues
    
    Two CE gatekeepers have condor upgraded to newer version (8.4.11), in preparation for publishing queues names to OSG Collector using the new osg-configure package.
    
    On-going no afs day, expect little impact on production jobs, some user analysis jobs may be affected if they explicitly use files from cern AFS.
  - 14:35
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    AGLT2 is running smoothly, with no apparent issues.
    
    Core count has exceeded 10000 for the first time since our establishment.
    
    All dCache servers and WN are online from the Fall RBT purchases. Equipment purchased with closeout funds is also now online and the Capacity Spreadsheet is up to date.
    
    Some of the closeout funds were designated for replacement 1Gb and 10Gb switches. The latter was necessary for the expansion of the UM WN complement. We received an S4048-ON switch for this, and it is performing admirably. The 1Gb N2048 switches are being used at both MSU and UM for public NIC connections for WN that still rely on those 1Gb NICs. Consequent stage-in rates have nearly tripled as a result to 25-30MB/s for each connected WN. Three of these switches are yet to install at MSU, but we expect this to complete within the next 2 weeks, resulting in only a small number of remaining 1Gb public NICs on the PC6248.
    
    dCache was upgraded from 2.13.50 to 2.16.26. The DB schema updates took ~5hours to complete, but left us with extremely slow write rates and many timeouts. Discussion with dCache support lead us to vacuum some of the tables and restore our efficiency. The adjustments we made are now part of the dCache upgrade scheme.
    
    We expect sometime in the next 2 weeks to upgrade our gatekeepers to the just-released 3.3.21 version with HTCondor 8.4.11.
  - 14:40
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site is full of jobs except Illinois
    
    Illinois is currently doing its monthly PM
    
    Down until later tonight
    
    Performing some GPFS upgrades
    
    Moving MWT2 GPFS metadata to NVME (more inodes, better performance)
    
    OSG 3.3.21 to be installed on all gatekeepers later this week
    
    Has many fixes for AGIS reporting (for BDII retirement)
    
    Will also go to HTCondor 8.4.11 at this time
    
    Working on migration to latest "wrapper" and job flow
    
    Using wrapper 0.9.15; moving to 0.9.16
    
    Cleans up job flow
    
    CONNECT nearly done; MWT2 next
    
    Retirement of CISCO 6509 caused a dCache issue
    
    CISCO handled the NAT routing to public network
    
    Two dCache storage servers had a routing misconfiguration to use NAT
    
    Nodes were cut off from WAN; needed to fix "route" to nodes public IP
    
    Monitoring caught the problem quickly
    
    NAT now provided by a dedicated node other than a switch
    
    New switches are in
    
    All switches are racked and installed
    
    However, missing 40Gb modules for EX4500 to 9608
    
    Downtime in few weeks to installed/configuration the 40Gb connections
    
    Network monitoring at Illinois
    
    Gaining access to snmp on all switches use by MWT2
    
    Will soon be able to create graphana graphs on network usage
    
    New Purchases
    
    UChicago
    
    26 R430
    
    1040 cores
    
    Installed and some nodes are online
    
    Waiting for new switches to bring remaining nodes online
    
    Indiana
    
    15 R430
    
    600 cores
    
    All nodes online
    
    Illlinois
    
    8 C6320
    
    448 cores
    
    Not installed as yet
    
    MWT2 Site total will be
    
    18520 cores
    
    192K HS06
    
    Spreadsheet, GIP and OIM have all been updated
  - 14:45
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Switched to "usenewmover" in AGIS with no problems. Still using local lsm.
    
    Going to San Diego.
    
    Issues:
    
    1) Problem with jobs using the ~usatlas1 home directory, overloading NFS. Moved to a stronger NFS server (~16 hour down time to switch).
    
    2) An issue came up in the past day that might be a world-wide issue. Errors in our SRM occurred where Bestman refuses to delete files with names greater than 231 characters. Since 256 characters is the limit on many file systems, this may soon hit other sites. Notified DDM.
    
    3) Waiting for cable replacements from DELL to bring up new storage.
    
    4) NESE activities ramping up.
    
    5) Switchover to condor-ce ready. Will ask Jose to switch us over after the meeting.
    
    6) Investigating bandwidth after the peering with LHCONE. Asked Hiro for a load test.
    
    7) Still need to make JSON file for Wei.
  - 14:50
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - all sites running well
    
    - in the process of adding latest nodes into Schooner (OU_OSCER_ATLAS)
    
    - also in the process of updating spreadsheet with both dedicated and opportunistic Schooner capacity
  - 14:55
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA_SWT2:
    
    Implemented the space-usage.json reporting file for reporting storage usage.
    
    Everything else running normally
    
    SWT2_CPB:
    
    Implemented the space-usage.json reporting file
    
    Suffering from a lack of pilots in our queues
    
    Last Thursday, a partition got filled that caused problems with our local batch system (Torque)
    
    Torque was cleared up and fixed, but Pilot Factories were no longer sending pilots to our queues.
    
    A different issue arose with Torque when our CE was rebooted on Sunday. This issue was cleared up on Monday.
    
    We started to receive pilots for Analysis queue and single core production queues but we are receiving one job per hour in multi-core queue.
    
    The problem is still occurring. We need help from someone with access to AFP.
  - 15:00
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 16:05 → 16:10
  
  AOB 5m