US ATLAS Computing Integration and Operations
virtual room
your office
-
-
13:00
→
13:15
Top of the Meeting 15mSpeaker: Robert William Gardner Jr (University of Chicago (US))
- Apologies from Horst, Saul
- Forthcoming facilities workshop in Clemson, https://indico.cern.ch/event/472826/
- Week following Clemson there is a workshop that might be of interest for campus research HPC best practices, relevant for campus clusters: http://www.ncsa.illinois.edu/Conferences/ARCC/agenda.html
-
13:15
→
13:25
Capacity News: Procurements & Retirements 10m
-
13:25
→
13:35
Production 10mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:35
→
13:40
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:40
→
13:45
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:45
→
13:50
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:50
→
13:55
FAX and Xrootd Caching 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
From Andy:
What needs to change for async caching support:
1) XrdPosix package to add async POSIX style I/O
2) XrdPss package to use async POSIX style I/O
3) XrdOucCache package to provide async cache interface, this also impacts XrdPosix package because it is responsible for loading and using the caching interface.
The issue here is that all of these interfaces are public which means we need to implement this without breaking ABI compatibility (i.e. is must be backward compatible).
Time estimates:
a) 1 week to design and code up the new caching interface (4/5/16).
b) 2 weeks to retrofit XrdPosix package to use (a) (3/21/16).
c) 1 week to retrofit XrdPss package to use (b) (3/25/16).
The above will always be available as work proceeds in the pssasync branch in the xroot github repo so other parallel work can proceed. Please be aware I go on vacation 3/28/16 for 12 days with limited if any internet connectivity so it is likely that we will not have a production quality version until 4/15/16 to 4/20/16, depending on how it goes.
-
14:15
→
15:15
Site Reports
-
14:15
BNL 5mSpeaker: Michael Ernst
- Smooth operations at full utilization of the compute farm (mostly MCORE)
- AWS 100k core test still in preparation
- Issues found with provisioning system based on APF
- Now understood and fixed
- Issues with S3 keys when running in 3 US regions
- Understood and fixed by pilot developers
- Scale test not to start before next week
- Issues found with provisioning system based on APF
- Hiro has developed and deployed data management services for end users working on the shared T3 at BNL
- Much improved bandwidth (over dq2-get) for data replication to T3 storage
- Deployment of FY16 disk storage in progress
- Hardware will be handed over to storage management group on or before March 15.
-
14:20
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
pgsql was updated from 9.3.11 to 9.5.1 in advance of doing a dCache upgrade from the 2.10 to the 2.13 series. This occurred on Tuesday last week during a full downtime that day. At the same time our WNs were rebuilt completely, updating Condor to 8.4.4, cvmfs to 2.1.20, OSG-WN client to 3.3.8, glibc to 2.12-1.166.el6_7.7, and various other sl and sl-security updates. Gatekeepers were updated to OSG 3.3.9, utilizing the OSG installation of Condor 8.4.3. The master Condor machine is also on Condor 8.4.4, which works around a possible issue with the collector process in 8.4.3.
Generally all upgrades went smoothly, modulo interactions between the various components. The dCache update in particular surprised us with how quickly it went. Several items were not immediately obvious, but a dCache documentation search showed the way. The xrootd plugins required a bit more work, and consultations between Gerd, Ilija and Shawn will likely result in new plugin rpms in the near future.
There are no outstanding issues with our site at this time. However, we have noticed some recent jobs that are crashing WN. These jobs run a process called "JSAPrun.exe". Condor will suddenly report jobs running this process that have a (condor_status) LoadAv of many tens, even many hundreds, that results in the WN either crashing or becoming unresponsive. We then get hung_task_timeout dumps in /var/log/messages indicating processes that have been blocked for more than 120 seconds. We have only just discovered this, and so have not had a chance to do any further digging, but I mention it here because other sites may also be seeing this?
-
14:25
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site has been running well except for IU
- Networking problems at IU
- Last week it was issues in IndianaGigiPOP for ESnet conversion
- Down again as of last night, tickets pending
- Condor pool offline
Scan for latest OpenSSL bug (Drownattack) shows MWT2 clean
Minor update of dCache to 2.10.56-1
- Helped with some XrootD door issues
- Removed an old monitoring plugin that was causing java null pointer exceptions
- Still some issues Lincoln is following up with Gerd
New Disk at UChicago
- Still in process of migrating LOCALGROUPDISK to Ceph
- Migrating user data from older Ceph system to new Ceph (many tiny files).
- Servers will be converted to dCache (~350TB)
OSG 3.3.9
- New lsm-get in use removing DCAP needs for MWT2
- Reports to Elastic Search
- Will be switching compute nodes to OSG 3.3.9 wn client
minRSS and maxRSS now set
- New Panda Queues for HIMEM
- MWT2_HIMEM (2G-5G) - only nodes with >= 5GB/core
- MWT2_HIMEM_MCORE (2G-3G) - only nodes with >= 3GB/core
- ANALY_MWT2_MCORE (cpus=8, maxrss 16GB)
- Very busy with jobs
- But users do not use all cores
ATLAS Analytics
- Now keeping 1 copy of the data at Clemson, 1 copy at UC with redundant head nodes.
- Currently riding out a scheduled downtime at Clemson. Kibana was up but now seems to be down.
- Users have been notified.
misc
- Cleaning Nagios cruft and converting to Icinga.
- Building SL7 machines and puppet rules for non-critical services.
- No plans to run OSG software on SL7 for now.
-
14:30
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
- 14:35
-
14:40
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
UTA_SWT2
Facilty electrical work forced a shutdown over the weekend, During shutdown we added memory to nodes with 24GB of memory
- Provides ~320 additional single job slots or ~80 additional multi-core slots
- Doubles multicore capacity
SWT2_CPB
Bringing 400TB of storage online.
UTA - Expecting network interruption this weekend.
-
14:45
WT2 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
-
15:15
→
15:20
AOB 5m
-
13:00
→
13:15