- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Preliminary ATLAS session at OSG HAM : https://indico.cern.ch/event/613466/
ADC workshop on containers 3/8 : https://indico.cern.ch/event/612601/
The ADC needs a US volunteer to work with the ADC crew on AGIS deprecated fields issues. See:
https://indico.cern.ch/event/573928/contributions/2322349/attachments/1396258/2129079/Sitemovers.final.migration.16.01.17.pdf
There are plenty of jobs in the pipe to keep all sites busy.
Overlay jobs crash out the frontier infrastructure
lots of data / many requests per job from ConDB
Solution: temporarily limit number of running/queued jobs to 2k/1k, limiting production to 50k events/day
Long term solution under discussion.
Recommend users run only on DAOD because non-experts running on AOD will most probably produce false physics results
Private production is STRICTLY forbidden by physics coordination policy.
Users can now finish/abort their own tasks from bigpanda
See full list of DPA issues at
https://docs.google.com/presentation/d/1mZ0kknGOJVNyF3onfKSEHUYAT0rHxBYcON7kYZvlj3k/edit?ts=58a0e636#slide=id.g1cd506785b_0_277
Everyone is working hard to obsolete the BDII. A new AGIS parameter to be used for monitoring, etf_default, will soon be added. Initial setting is to true when a queue has pq_is_default=1 & pq_capability=score. Tuning and testing is required.
Trying to migrate all sites to the new mover control mechanism (use_newmover = true in AGIS). There was some confusion, but LSM is a valid mover to use with the new controls. If your queues have not moved, please expedite the transition with Alexey Anisenkov, Jose Caballero and the atlas-adc-agis group.
============== This is an extract from a more complete Email =======================
On Monday 6 February 2017 at 10AM CERN time we will change the tool to manage the status of PanDA queues, previously done through curl 'http://panda.cern.ch:25943/server/pandamon/query?'. This tool will be switched off.
Instead, PanDA queue states will be managed by the AGIS centralized blacklisting system and synchronized to the other systems.
You can monitor blacklisted queues, see detailed instructions and examples for the new blacklisting CLI here: http://atlas-agis.cern.ch/agis/pandablacklisting/list/. Alternatively you can also view blacklisting details in the usual AGIS Pandaqueue view, by selecting the fields in the upper menu like here: http://atlas-agis.cern.ch/agis/pandaqueue/table_view/?&vo_name=atlas&show_2=0&show_3=0&show_4=0&state=ACTIVE&show_10=1&show_11=1&show_12=1&show_13=1
======================================
I've gone through the "ADC List of Existing Monitors" document. The Google doc compresses down to only about 9 sites, with variations mostly in the parameters, but sometimes in names. I come up with:
http://panglia.triumf.ca
(I have queried to find out the future of this; a date on the page is 3+ years old; it may go away simply by a lack of support from Victoria)
http://bigpanda.cern.ch/dash/production/#cloud_US
(there are many variations on the basic dashboard here)
http://dashb-atlas-job.cern.ch/dashboard
This one is interesting, as the basic "dashb-atlas" host string has MANY variations, including:
dashb-atlas-job-prototype.cern.ch, dashb-atlas-ssb.cern.ch, dashb-atlas-ddm-acc.cern.ch, dashb-atlas-ddm.cern.ch
http://dashb-fts-transfers.cern.ch
This seems to be moving to monit portal -- dashboard named "MONIT FTS Transfers Plots"
http://adc-ddm-mon.cern.ch/ddmusr01/plots
http://wlcg-sam-atlas.cern.ch
https://etf-atlas-prod.cern.ch/etf/check_mk
http://apfmon.lancs.ac.uk
This is an ATLAS Panda dashboard and is outside the scope of the migration
http://wlcg-squid-monitor.cern.ch
Early indications are that these 2 will migrate
dashb-atlas-ddm.cern.ch (and, I assume, all variations on dashb-atlas-*)
wlcg-sam-atlas.cern.ch
New prototypes are
monit-grafana.cern.ch
monit.cern.ch/app/kibana
All information to do with raw and processed data is in the new monitoring infrastructure. I will report more information as developments proceed.
1) ~290M events in the MC production queue as of 2/14. MC16 simulation coming soon, MC16 digit+reco end of this month
2) Sites may have noticed some heavily failing pmerge tasks the past couple of days. Tasks were aborted and resubmitted
3) Pilot release from Paul on 2/2 (v67.5) - details in posted shift summaries
4) No follow-up issues for US sites
AGLT2 is running smoothly, with no apparent issues.
Core count has exceeded 10000 for the first time since our establishment.
All dCache servers and WN are online from the Fall RBT purchases. Equipment purchased with closeout funds is also now online and the Capacity Spreadsheet is up to date.
Some of the closeout funds were designated for replacement 1Gb and 10Gb switches. The latter was necessary for the expansion of the UM WN complement. We received an S4048-ON switch for this, and it is performing admirably. The 1Gb N2048 switches are being used at both MSU and UM for public NIC connections for WN that still rely on those 1Gb NICs. Consequent stage-in rates have nearly tripled as a result to 25-30MB/s for each connected WN. Three of these switches are yet to install at MSU, but we expect this to complete within the next 2 weeks, resulting in only a small number of remaining 1Gb public NICs on the PC6248.
dCache was upgraded from 2.13.50 to 2.16.26. The DB schema updates took ~5hours to complete, but left us with extremely slow write rates and many timeouts. Discussion with dCache support lead us to vacuum some of the tables and restore our efficiency. The adjustments we made are now part of the dCache upgrade scheme.
We expect sometime in the next 2 weeks to upgrade our gatekeepers to the just-released 3.3.21 version with HTCondor 8.4.11.
Site is full of jobs except Illinois
Illinois is currently doing its monthly PM
OSG 3.3.21 to be installed on all gatekeepers later this week
Working on migration to latest "wrapper" and job flow
Retirement of CISCO 6509 caused a dCache issue
New switches are in
Network monitoring at Illinois
New Purchases
MWT2 Site total will be
Switched to "usenewmover" in AGIS with no problems. Still using local lsm.
Going to San Diego.
Issues:
1) Problem with jobs using the ~usatlas1 home directory, overloading NFS. Moved to a stronger NFS server (~16 hour down time to switch).
2) An issue came up in the past day that might be a world-wide issue. Errors in our SRM occurred where Bestman refuses to delete files with names greater than 231 characters. Since 256 characters is the limit on many file systems, this may soon hit other sites. Notified DDM.
3) Waiting for cable replacements from DELL to bring up new storage.
4) NESE activities ramping up.
5) Switchover to condor-ce ready. Will ask Jose to switch us over after the meeting.
6) Investigating bandwidth after the peering with LHCONE. Asked Hiro for a load test.
7) Still need to make JSON file for Wei.
UTA_SWT2:
SWT2_CPB: