US ATLAS Computing Integration and Operations
-
-
13:00
→
13:15
Top of the Meeting 15mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
Preliminary ATLAS session at OSG HAM : https://indico.cern.ch/event/613466/
ADC workshop on containers 3/8 : https://indico.cern.ch/event/612601/
-
13:15
→
13:20
ADC news and issues 5mSpeakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
The ADC needs a US volunteer to work with the ADC crew on AGIS deprecated fields issues. See:
https://indico.cern.ch/event/573928/contributions/2322349/attachments/1396258/2129079/Sitemovers.final.migration.16.01.17.pdfThere are plenty of jobs in the pipe to keep all sites busy.
Overlay jobs crash out the frontier infrastructure
lots of data / many requests per job from ConDB
Solution: temporarily limit number of running/queued jobs to 2k/1k, limiting production to 50k events/day
Long term solution under discussion.
Recommend users run only on DAOD because non-experts running on AOD will most probably produce false physics results
Private production is STRICTLY forbidden by physics coordination policy.
Users can now finish/abort their own tasks from bigpanda
See full list of DPA issues at
https://docs.google.com/presentation/d/1mZ0kknGOJVNyF3onfKSEHUYAT0rHxBYcON7kYZvlj3k/edit?ts=58a0e636#slide=id.g1cd506785b_0_277Everyone is working hard to obsolete the BDII. A new AGIS parameter to be used for monitoring, etf_default, will soon be added. Initial setting is to true when a queue has pq_is_default=1 & pq_capability=score. Tuning and testing is required.
Trying to migrate all sites to the new mover control mechanism (use_newmover = true in AGIS). There was some confusion, but LSM is a valid mover to use with the new controls. If your queues have not moved, please expedite the transition with Alexey Anisenkov, Jose Caballero and the atlas-adc-agis group.
============== This is an extract from a more complete Email =======================
On Monday 6 February 2017 at 10AM CERN time we will change the tool to manage the status of PanDA queues, previously done through curl 'http://panda.cern.ch:25943/server/pandamon/query?'. This tool will be switched off.
Instead, PanDA queue states will be managed by the AGIS centralized blacklisting system and synchronized to the other systems.
You can monitor blacklisted queues, see detailed instructions and examples for the new blacklisting CLI here: http://atlas-agis.cern.ch/agis/pandablacklisting/list/. Alternatively you can also view blacklisting details in the usual AGIS Pandaqueue view, by selecting the fields in the upper menu like here: http://atlas-agis.cern.ch/agis/pandaqueue/table_view/?&vo_name=atlas&show_2=0&show_3=0&show_4=0&state=ACTIVE&show_10=1&show_11=1&show_12=1&show_13=1
======================================
I've gone through the "ADC List of Existing Monitors" document. The Google doc compresses down to only about 9 sites, with variations mostly in the parameters, but sometimes in names. I come up with:
http://panglia.triumf.ca
(I have queried to find out the future of this; a date on the page is 3+ years old; it may go away simply by a lack of support from Victoria)
http://bigpanda.cern.ch/dash/production/#cloud_US
(there are many variations on the basic dashboard here)
http://dashb-atlas-job.cern.ch/dashboard
This one is interesting, as the basic "dashb-atlas" host string has MANY variations, including:
dashb-atlas-job-prototype.cern.ch, dashb-atlas-ssb.cern.ch, dashb-atlas-ddm-acc.cern.ch, dashb-atlas-ddm.cern.ch
http://dashb-fts-transfers.cern.ch
This seems to be moving to monit portal -- dashboard named "MONIT FTS Transfers Plots"
http://adc-ddm-mon.cern.ch/ddmusr01/plots
http://wlcg-sam-atlas.cern.ch
https://etf-atlas-prod.cern.ch/etf/check_mk
http://apfmon.lancs.ac.uk
This is an ATLAS Panda dashboard and is outside the scope of the migration
http://wlcg-squid-monitor.cern.chEarly indications are that these 2 will migrate
dashb-atlas-ddm.cern.ch (and, I assume, all variations on dashb-atlas-*)
wlcg-sam-atlas.cern.chNew prototypes are
monit-grafana.cern.ch
monit.cern.ch/app/kibanaAll information to do with raw and processed data is in the new monitoring infrastructure. I will report more information as developments proceed.
-
13:20
→
13:30
Production 10mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
1) ~290M events in the MC production queue as of 2/14. MC16 simulation coming soon, MC16 digit+reco end of this month
2) Sites may have noticed some heavily failing pmerge tasks the past couple of days. Tasks were aborted and resubmitted
3) Pilot release from Paul on 2/2 (v67.5) - details in posted shift summaries
4) No follow-up issues for US sites
-
13:30
→
13:35
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:35
→
13:40
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:40
→
13:45
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:45
→
13:50
FAX and Xrootd Caching 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
→
14:10
Site movers 20mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:10
→
14:30
OS performances testing 20mSpeaker: Doug Benjamin (Duke University (US))
-
14:30
→
16:05
Site Reports
-
14:30
BNL 5mSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
- Tape Staging test: completed yesterday. BNL performed well. Transfer efficiency dropped to ~80% on 02/13 due to a short network switch interruption;
- In Monday morning, a network switch was down, interrupted dcache transfers for ~ 4 minutes, recovered shortly. On-going discussion on more fault tolerant network connection for the servers.
- BNL FTS is IPv6 enabled now
- New Panda Site mover configuration enabled for all BNL queues
- Two CE gatekeepers have condor upgraded to newer version (8.4.11), in preparation for publishing queues names to OSG Collector using the new osg-configure package.
- On-going no afs day, expect little impact on production jobs, some user analysis jobs may be affected if they explicitly use files from cern AFS.
-
14:35
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
AGLT2 is running smoothly, with no apparent issues.
Core count has exceeded 10000 for the first time since our establishment.
All dCache servers and WN are online from the Fall RBT purchases. Equipment purchased with closeout funds is also now online and the Capacity Spreadsheet is up to date.
Some of the closeout funds were designated for replacement 1Gb and 10Gb switches. The latter was necessary for the expansion of the UM WN complement. We received an S4048-ON switch for this, and it is performing admirably. The 1Gb N2048 switches are being used at both MSU and UM for public NIC connections for WN that still rely on those 1Gb NICs. Consequent stage-in rates have nearly tripled as a result to 25-30MB/s for each connected WN. Three of these switches are yet to install at MSU, but we expect this to complete within the next 2 weeks, resulting in only a small number of remaining 1Gb public NICs on the PC6248.
dCache was upgraded from 2.13.50 to 2.16.26. The DB schema updates took ~5hours to complete, but left us with extremely slow write rates and many timeouts. Discussion with dCache support lead us to vacuum some of the tables and restore our efficiency. The adjustments we made are now part of the dCache upgrade scheme.
We expect sometime in the next 2 weeks to upgrade our gatekeepers to the just-released 3.3.21 version with HTCondor 8.4.11.
-
14:40
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site is full of jobs except Illinois
Illinois is currently doing its monthly PM
- Down until later tonight
- Performing some GPFS upgrades
- Moving MWT2 GPFS metadata to NVME (more inodes, better performance)
OSG 3.3.21 to be installed on all gatekeepers later this week
- Has many fixes for AGIS reporting (for BDII retirement)
- Will also go to HTCondor 8.4.11 at this time
Working on migration to latest "wrapper" and job flow
- Using wrapper 0.9.15; moving to 0.9.16
- Cleans up job flow
- CONNECT nearly done; MWT2 next
Retirement of CISCO 6509 caused a dCache issue
- CISCO handled the NAT routing to public network
- Two dCache storage servers had a routing misconfiguration to use NAT
- Nodes were cut off from WAN; needed to fix "route" to nodes public IP
- Monitoring caught the problem quickly
- NAT now provided by a dedicated node other than a switch
New switches are in
- All switches are racked and installed
- However, missing 40Gb modules for EX4500 to 9608
- Downtime in few weeks to installed/configuration the 40Gb connections
Network monitoring at Illinois
- Gaining access to snmp on all switches use by MWT2
- Will soon be able to create graphana graphs on network usage
New Purchases
- UChicago
- 26 R430
- 1040 cores
- Installed and some nodes are online
- Waiting for new switches to bring remaining nodes online
- Indiana
- 15 R430
- 600 cores
- All nodes online
- Illlinois
- 8 C6320
- 448 cores
- Not installed as yet
MWT2 Site total will be
- 18520 cores
- 192K HS06
- Spreadsheet, GIP and OIM have all been updated
-
14:45
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Switched to "usenewmover" in AGIS with no problems. Still using local lsm.
Going to San Diego.
Issues:
1) Problem with jobs using the ~usatlas1 home directory, overloading NFS. Moved to a stronger NFS server (~16 hour down time to switch).
2) An issue came up in the past day that might be a world-wide issue. Errors in our SRM occurred where Bestman refuses to delete files with names greater than 231 characters. Since 256 characters is the limit on many file systems, this may soon hit other sites. Notified DDM.
3) Waiting for cable replacements from DELL to bring up new storage.
4) NESE activities ramping up.
5) Switchover to condor-ce ready. Will ask Jose to switch us over after the meeting.
6) Investigating bandwidth after the peering with LHCONE. Asked Hiro for a load test.
7) Still need to make JSON file for Wei.
- 14:50
-
14:55
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
UTA_SWT2:
- Implemented the space-usage.json reporting file for reporting storage usage.
- Everything else running normally
SWT2_CPB:
- Implemented the space-usage.json reporting file
- Suffering from a lack of pilots in our queues
- Last Thursday, a partition got filled that caused problems with our local batch system (Torque)
- Torque was cleared up and fixed, but Pilot Factories were no longer sending pilots to our queues.
- A different issue arose with Torque when our CE was rebooted on Sunday. This issue was cleared up on Monday.
- We started to receive pilots for Analysis queue and single core production queues but we are receiving one job per hour in multi-core queue.
- The problem is still occurring. We need help from someone with access to AFP.
-
15:00
WT2 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:30
-
16:05
→
16:10
AOB 5m
-
13:00
→
13:15