The OSG BDII is shutting down, and yesterday the ATLAS SAM tests will switch the way in which they test queues. From the announcement that was sent out by Ryu Sawada:
"The ATLAS SAM tests are going to change the way they select the queues for the SAM tests. The selection so far was done using BDII information except for HTCONDOR-CEs. Soon it will be done selecting from the queues that are effectively used, i.e. the queues attached to the PandaQueues in AGIS and a new flag ETF_default=1. "
"No negative impact is expected. But please watch SAM results of your site, and if you find any false results, please contact us for the correction by sending a ticket to GGUS."
JSON reporting of space usage is now active for all US sites.
perfSONAR v4.0 RC3 is out. Hoping this is basically the final version for v4.0 If no problems are found a release could happen in a couple weeks. Will need to get our sites update (auto-updates should run but needs checking)
Working on new mesh-configuration. See https://meshconfig-itb.grid.iu.edu/ Will become the production version http://meshconfig.grid.iu.edu soon (next ~week). Everyone can get an account if interested. Need to request admin access for specific meshes if needed.
Lots of reorganization of network service components planned in OSG. Will remove some ITB instances and rebalance resources (memory/CPU). New monitoring will be Docker based ETF running on CentOS7.3 VM. https://gitlab.cern.ch/etf/docker/blob/master/README.md Need updates for all services once perfSONAR v4.0 is released
Next week is the LHCONE/LHCOPN meeting at BNL. Hope some of you will be attending. https://indico.cern.ch/event/581520/
Analytics on network metrics showing occasional problems in packet loss at various locations. Need to start opening tickets (after perfSONAR v4).
http://tiny.cc/PktLossNoUnknown (Shows 6 months of packet loss by src/dest)
http://tiny.cc/pSLink (Shows network stats by specific site)
Test emails by subscription are being issued, e.g.:
Dear Shawn McKee,
this mail is to let you that there was a significant change in packet loss detected by PerfSONAR.
The site CA-SCINET-T2 (22.214.171.124)'s links got improved, total number from 5 to 0 links.
These are all the bad links for the past hour:
Comments from Rob: Improve the email messages to make what is being communicated obvious.
0. Xrootd proxy cache server at AGLT2.
1. Under heavy load, the xrootd proxy cache sometimes can't send data back to some clients (broken pipe/send failure). Currently focus on checking OS/networking setting. Increase "txqueuelen" in NIC (ens2) from 1000 to 20000 - doesn't help. Reviewing other parameters.
2. Question about uncommitted data in memory when a client close connection. Prefer to commit the data to disk to increase proxy efficiency but it is not always possible under heavy load. Will discard those data.
3. Occasional lose of file descriptors (including TCP). 22 files so far in the last two days of stress test (out of 224k). _May_ due to a linux kernel semaphore bug which is fixed in the latest kernel. Need to confirm.
4. After 1. is understood, will enter long period of stress test to check stability, memory usage, file/TCP descriptors, and networking.
5. Packaging as a product.
Site is in downtime now.
dCache upgrade ongoing
Issues with AGIS PanDA queue blacklisting system
We have updated to dCache 3.x series from 2.16. There is a DB schema change that took 5 hours to complete. Unfortunately, our monthly chimera dumps are now broken as the schema change broke chimera_find.sh. Hiro promises that he can fix this, and there is also a dCache ticket open for it.
Our gatekeepers are updated to OSG 3.3.21 now, and the new [Resource Entry xxx] sections are in place in the 30-gip.ini file. Following directions posted by Wei and John, AGIS was also updated to connect the listed queues.
We have been notified that there will be a complete power outage in the UM server room on Saturday, June 24. We will plan on shutting down all services on Friday afternoon, June 23, to prep for this. Hopefully we can get much back up on Saturday afternoon, but that is far from certain at this advanced time.
Site is full of jobs - operating well
OSG 3.3.22 to installed on all gatekeepers
New switches at UChicago are fully deployed
Network monitoring and other issues
Smooth operations with full sites with the exceptions of
1) Checksum mismatch errors. This generated a ticket for us, but the problem was on the source end. Details can be found here http://egg.bu.edu/NET2%7binf:NET2%7d/gadget:Studies/section:report/2017-03/checksum_mismatch_exotics/
2) ATLASSCRATCHDISK space is being used.
3) Deletions are still happening via Bestman at our site.
4) We still have a mystery problem with HTCONDOR-CE where the site drains for not understood reasons. We're still investigating and have been in contact with Brian.
5) Working intensively on NESE, MGHPCC floor and WAN networking. Had a very useful meeting with Alastair Dewhurst re: CEPH/Gridftp and his "Echo" project.
Planning on updating the hardware soon.
No production issues
Had an issue with transfers from two Canadian sites (McGill, UToronto) due to asymmetric routing. CANARIE discovered the misconfigured router and fixed it.
An issue with space reporting exists. One data server had a configuration issue and was reporting more space being used than what was physically on disk. This has been resolved and will see how much the overall space reporting has been impacted.