- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Indico celebrates its 20th anniversary! Check our blog post for more information!
Happy New Year! Welcome to the first US ATLAS Computing Facility meeting of 2019. We're trying out a new format to follow the new WBS 2.3 organization and we expect this to be an iterative process.
Notes:
Each meeting one of the WBS 2.3 areas will present on significant topics with each area. http://bit.ly/facility-wbs:
2.3.1 Tier-1 Operations -- Eric
2.3.2 Tier-2s Operations -- Shawn
2.3.3 HPC Operations -- Doug
2.3.4 Analysis Facilities Operations -- Wei & Will
2.3.5 Continuous Integration and Operations (CIOPS) -- Rob & Hiro
This week we'll have Fred report on pricing/configurations from Dell.
Next meeting we'll have a report from WBS 2.3.5 on Continuous Integration from Rob/Hiro. Tentative schedule going forward:
Setup a new AGLT2_HOSPITAL queue, the difference is, input (read) data of jobs is non-local, but from different other US storage elements. It is also a multi core queue.
Incidents:
Massive data transfer failure occurred a few times between late December and early January due to the failure of the authentication service in dCache, and one storage node losing network connectivity for a short period.
Some of the Condor work nodes have unusual high load (over 1000) with or without jobs using the CPU, the symptoms include high load, hanging /tmp directory, losing connection with condor head node, 100% swap usage, hanging sanity check processes. We updated a few work nodes to 8.6.12 from 8.4.11 for debugging purpose
system updates:
We had 2 dCache updates in this quarter, respectively from 4.2.6 to 4.2.12, and from 4.2.12 to 4.2.21. The latter one is to support the xrootd-TPC and HTTP-TPC tests. During the first dCache update, we also updated the system firmware and SL7.5.
afs client 1.8 is compiled and installed on our CentOS 7 host. The available one for SL7 is still 1.6. We have not tested 1.8 on the SLC 7 nodes yet.
All the SL 7 nodes, including work nodes, grid service nodes and interactive nodes are all upgraded from SL7.5 to SL7.6, all the security patches are applied in time. And all the SL7.6 hosts are rebooted to run on the most recent kernel (3.10.0-957.1.3.el7.x86_64)
All the work nodes have the lustre-client upgraded from 2.10.4 to 2.10.6, this update is to support the most recent kernel (3.10.0-957.1.3.el7.x86_64).
All our three OSG gatekeepers have condor upgraded from 8.6.11 to 8.6.13
System updates:
Working on equipment purchases:
Chicago
dCache s-node expansion |
Dell MD1200 (12x10 TB) | 100 TB usable/shelf |
Quantity
6 |
XCache server | Dell R740XD | 12TB 7.2K RPM NLSAS 12Gbps 512e 3.5in; 800GB SSD SATA Mix Use 6Gbps 512n 2.5in | 1 |
ML server | Nortech | 5U Chassis Redundant Power Supplies,Dual Intel XEON 12-Core 6146, 192GB 2666MHz DDR4-2666 ECC REG DIMM ,Six Enterprise 480GB Solid State Drives, Eight GeForce RTX 2080 Ti Video Cards, 2-Port SFP+ 10Gb NIC, Three Years Parts and Labor | 1 |
Indiana & Illinois
OU: Nothing to report, everything is running smoothly.
UTA_SWT2 & SWT2_CPB: Intermittent issue with deletion service causing failures because gridftp servers are reporting that a non-existent file is a directory. Trying to replicate.
UTA_SWT2: Space reporting script had an issue and was not updating correctly. We filled our disks and caused issues. The issue was resolved and a new script is in place that will avoid any similar issues in the future. Will roll out to SWT2_CPB later this week
SWT2_CPB: No additional problems to report.
Electrical maintenance this week
Migration to shared-pool architecture is approved by liaison, implies a re-thinking of "long" queue implementation
Updating systemd and kernel for vulnerabilities at same time