US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2019-01-16T13:00:00-05:00
End: 2019-01-16T15:00:00-05:00
Location: No location set

Wednesday 16 Jan 2019, 13:00 → 15:00 US/Eastern

13:00 → 13:10
WBS 2.3 Facility Management News 10m

Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
Happy New Year! Welcome to the first US ATLAS Computing Facility meeting of 2019. We're trying out a new format to follow the new WBS 2.3 organization and we expect this to be an iterative process.

Management updates

OSG-LHC updates

Area Reports - each meeting focus on a topic from one of the five areas

In order to keep the meeting time to a reasonable length (goal of 1 hour), we should post Site Updates in advance and only call out significant issues requiring discussion during the meeting.

Notes:

Quarterly reports due at end of week

May LHC Ops review - questions about computing - will discuss at the next meeting

Significant meetings

ATLAS Sites Jamboree at CERN first week of March

WLCG/OSG/HSF at Jlab

HepIX
13:10 → 13:20
OSG-LHC 10m

Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
XRootD 4.9.0 RC3 available in osg-testing

We're working on an XCache docker image and would be interested in getting it tested in the ATLAS environment as soon as it's ready

HTCondor-CE 3.2.0 is an important update for receiving backfill OSG jobs. Please update ASAP and merge in any changes from /etc/condor-ce/condor_mapfile.rpmnew
13:20 → 13:40
Topical Area Report

Each meeting one of the WBS 2.3 areas will present on significant topics with each area. http://bit.ly/facility-wbs:

2.3.1 Tier-1 Operations -- Eric
2.3.2 Tier-2s Operations -- Shawn
2.3.3 HPC Operations -- Doug
2.3.4 Analysis Facilities Operations -- Wei & Will
2.3.5 Continuous Integration and Operations (CIOPS) -- Rob & Hiro
This week we'll have Fred report on pricing/configurations from Dell.

Next meeting we'll have a report from WBS 2.3.5 on Continuous Integration from Rob/Hiro. Tentative schedule going forward:

WBS 2.3.5 - Rob/Hiro - January 30

WBS 2.3.1 - Eric - Feb 13

WBS 2.3.2 - Shawn - Feb 27

WBS 2.3.3 - Doug - Mar 13

WBS 2.3.4 - Wei/Will - Mar 27
- 13:20
  
  Information on compute configuration 10m
  
  Speaker: Fred Luehring (Indiana University (US))
  
  DellPriceComparison.pdf

13:40 → 14:20

Site Reports

13:40
BNL 5m

Speaker: Xin Zhao (Brookhaven National Laboratory (US))
electrical maintenance in the data center, which are scheduled for several inconsecutive days over this Month. During the maintenance days, we will run ES jobs to avoid early draining of partial computing farm. Running ES jobs on >14K cores now, both score and mcore PQs. No visible impact on network traffic.

On Jan 22nd, a scheduled downtime, from 6am~6pm, for dCache upgrade.

space tokens

running low on DATADISK during the holiday break

SCRATCHDISK full on Sunday

discussion ongoing with ADC on increasing the "min low" limit on our tokens, to trigger DDM deletion earlier

Xcache server deployed, being tested
13:45
AGLT2 5m

Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

Tier-2 Working Area
Setup a new AGLT2_HOSPITAL queue, the difference is, input (read) data of jobs is non-local, but from different other US storage elements. It is also a multi core queue.

Incidents:

Massive data transfer failure occurred a few times between late December and early January due to the failure of the authentication service in dCache, and one storage node losing network connectivity for a short period.

Some of the Condor work nodes have unusual high load (over 1000) with or without jobs using the CPU, the symptoms include high load, hanging /tmp directory, losing connection with condor head node, 100% swap usage, hanging sanity check processes. We updated a few work nodes to 8.6.12 from 8.4.11 for debugging purpose

system updates:

We had 2 dCache updates in this quarter, respectively from 4.2.6 to 4.2.12, and from 4.2.12 to 4.2.21. The latter one is to support the xrootd-TPC and HTTP-TPC tests. During the first dCache update, we also updated the system firmware and SL7.5.

afs client 1.8 is compiled and installed on our CentOS 7 host. The available one for SL7 is still 1.6. We have not tested 1.8 on the SLC 7 nodes yet.

All the SL 7 nodes, including work nodes, grid service nodes and interactive nodes are all upgraded from SL7.5 to SL7.6, all the security patches are applied in time. And all the SL7.6 hosts are rebooted to run on the most recent kernel (3.10.0-957.1.3.el7.x86_64)

All the work nodes have the lustre-client upgraded from 2.10.4 to 2.10.6, this update is to support the most recent kernel (3.10.0-957.1.3.el7.x86_64).

All our three OSG gatekeepers have condor upgraded from 8.6.11 to 8.6.13

13:50

MWT2 5m

Speaker: Lincoln Bryant (University of Chicago (US))

System updates:

Upgrade of dCache from 3.1 (deprecated release) to 3.2 (supported release) at the first of the year.
Also we are planning a downtime the first week of Feb (tentatively, just heard back from Ryan Harden today)
Will potentially upgrade dCache's Postgres to v10 (in preparation for migration to dCache 4.2) and get IPv6 plumbed at UC

Working on equipment purchases:

Chicago

dCache expansion (6 MD1200 shelves)
XCache server (and ATLAS production SLATE server for additional services)
Machine learning server (for containerized GPU Panda jobs and co-scheduled with the ML platform)

dCache s-node expansion

Dell MD1200 (12x10 TB)

100 TB usable/shelf

Quantity

XCache server

Dell R740XD

12TB 7.2K RPM NLSAS 12Gbps 512e 3.5in; 800GB SSD SATA Mix Use 6Gbps 512n 2.5in

ML server

Nortech

5U Chassis Redundant Power Supplies,Dual Intel XEON 12-Core 6146, 192GB 2666MHz DDR4-2666 ECC REG DIMM ,Six Enterprise 480GB Solid State Drives, Eight GeForce RTX 2080 Ti Video Cards, 2-Port SFP+ 10Gb NIC, Three Years Parts and Labor

Indiana & Illinois

Zeroing in on final compute configurations

13:55

NET2 5m

Speaker: Prof. Saul Youssef (Boston University (US))
14:00

SWT2 5m

Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

OU: Nothing to report, everything is running smoothly.

UTA_SWT2 & SWT2_CPB: Intermittent issue with deletion service causing failures because gridftp servers are reporting that a non-existent file is a directory. Trying to replicate.

UTA_SWT2: Space reporting script had an issue and was not updating correctly. We filled our disks and caused issues. The issue was resolved and a new script is in place that will avoid any similar issues in the future. Will roll out to SWT2_CPB later this week

SWT2_CPB: No additional problems to report.
14:05

HPC Operations 5m

Speaker: Doug Benjamin (Duke University (US))
14:10

Analysis Facilities - SLAC 5m

Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
14:15
Analysis Facilities - BNL 5m

Speaker: William Strecker-Kellogg (Brookhaven National Lab)
Electrical maintenance this week

Discussing migrating interactive nodes to dual-powered common pool to lessen impact of further interventions

Migration to shared-pool architecture is approved by liaison, implies a re-thinking of "long" queue implementation

Updating systemd and kernel for vulnerabilities at same time

14:20 → 14:25

AOB 5m

Choose timezone

US ATLAS Computing Facility