US ATLAS Computing Integration and Operations
-
-
13:00
→
13:15
Top of the Meeting 15mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
-
13:15
→
13:20
ADC news and issues 5mSpeakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
The BDII is no longer in use for SAM tests.
I (Bob) need to report on 2017 pledges and installed for both cpu and storage. At this time, from the WLCG-v42 tab of our Normalizations spreadsheet I have the numbers below.
I am told the 2017 disk pledge for BNL is wrong, and clearly the installed totals are lower than that pledge, although some undetermined purchase is in progress.
Also, none of us has a good idea at this time where 2017 funds will be spent, and I will include that statement. If any of the numbers below are wrong, I NEED TO KNOW ASAP.
-----------------------------------------
Numbers are
disk -- pledged TB/actual TB/LocalGroup Disk subset (TB)
cpu -- pledged HS06/actual HS06
BNL/Tier1
disk 15640/11917/653
211830/156212
AGLT2
disk 4242/6850/445
cpu 57500/96080
MWT2
disk 6363/8558/500
86250/192277
NET2
disk 4242/5500/230
57500/96964
SWT2
disk 4242/6933/164
cpu 57500/109327 -
13:20
→
13:30
Production 10mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:30
→
13:35
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:35
→
13:40
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:40
→
13:45
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:45
→
13:50
FAX and Xrootd Caching 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
Xrootd Proxy Cache:
The source of network related issue at AGLT2 test bad is still not identified. Univ. Chicago will setup a test machine with 40Gbps NIC. We will see if the problem is repeatable at there.
Will continue to use AGLT2 test bed for long running test, and monitor the memory and file descriptor (TCP connection) usage.
-
13:50
→
14:00
Site movers 10mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
1. Most US sites moved to the new site movers but some didn't complete some of the additional changes in AGIS. Below are the instructions from Ilija. Will need to ask each site to do that and monitor site issues for the several hours to a day.
There are two different changes that have to be made for all the queues:
1) switch to use new site mover configurator by setting use_newmover to True
2) switch off variables used by the old mover configurator: deprecate_oldmover to True
2. Had a dedicated meeting on setting the correct internal Xrootd door at site in AGIS. Document in progress:
https://docs.google.com/document/d/1FKbXCHZ-NA__nFlELUpm_D32OOdXGgq786GeEQotOUA/edit?usp=sharing
-
14:00
→
14:10
OS performances testing 10mSpeaker: Doug Benjamin (Duke University (US))
-
14:10
→
14:25
HPCs integration 15mSpeaker: Taylor Childers (Argonne National Laboratory (US))
-
14:25
→
16:00
Site Reports
-
14:25
BNL 5mSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
-
14:30
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
Periodic dCache problems were traced to the dCacheDomain on our dCache admin machine head01. These were due to an excess of direct memory above the 2GB we had allocated. The limit was increased to 4GB and the service has run fine since that time. Current usage is 2.7GB and we are watching the process closely via OMD. There is some concern that it seems to be slowly growing still, and that may indicate a memory leak. It is also true that the number of transfers per unit time is much higher here than it has even been in the past.
All UM APC Symmetra 80k batteries were replaced today. We will be able to get a run-time calibration tomorrow. Prior to the replacement there was less than 1min of run time on the batteries.
AGLT2 will have a full outage starting on June 23 as all power will be off in the UM server room and building. We will likely restart services on Saturday afternoon, June 24. We are hoping the new batteries will allow us to keep a minimal VMWare configuration running, but will not know if this will be the case until just prior to the shutdown.
Today the "deprecate_oldmover" parameter on our main Production queue was set to true. Our test queue has been running this way now for quite some time without issue. This is the only change required to switch a queue, as all related, obsolete parameters are then set to 'OLDMOVERDEPRECATED' automagically. We will slowly switch over all of our queues now to this setting.
-
14:35
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site is now full of jobs and operating well
Problems during the last two weeks
- Network issues at UChicago
- DNS network reverse lookup problems affected SE access
- Bad fiber from UChicago to off campus problem create large packet loss
- Bad PDU at Indiana has about 500 cores offline
- New PDU should be installed today
- Dell C6320 chassis at Illinois has cooling problems
- Nodes in chassis in top of rack have been shutting down (about 300 cores offline)
- Dell believes it is a cooling issue which can be fixed with firmware update
- However, ICC admins having problem updating to new firmware (working with Dell)
- When update issue resolved, will apply to all chassis (56 total) during next ICC PM (April 19)
OSG 3.3.22-2 installed on all nodes
- OSG 3.3.22-1 had bad version of Xrootd (4.6.0 has many problems)
- Permissions issue caused gfal-copy to fail among other problems
- New tarball 3.3.22-2 released with XrootD 4.5.0
- Downgrade XrootD to 4.5.0 on nodes using RPM installation
- OSG 3.3.23-1 is now available and will be installed on all nodes this week
- cvmfs 2.3.5
- HTCondor-CE updates
- Removes gip and osg-info-services
All MWT2 and CONNECT PanDA queues have been converted to new AGIS site mover schema
- "newmovers" set on all Qs
- "deprecate oldmovers" set on all Qs
- Initially thought direct I/O was not working
- "Admin" (ie ddl) misunderstanding error!
- Some HC jobs do not use direct I/O even though Q configured to use it
Frontier access is overloading MWT2 squids
- This is causing very slow access to CVMFS repositories by CVMFS client using same squids
- Has an impact on efficiency and interactive users complain about slow access
- Known issue by ADC but no solution as yet
- MWT2 solution is to install separate CVMFS client only squids on our Stratum-1 servers
USERDISK and GROUPDISK decommissioning continuing
- Waiting on ADC to change Panda Q to use SCRATCHDISK for output by ANALY Qs
- Reducing size of GROUPDISK and adding freed space to DATADISK
CONNECT Blue Waters
- Had another run to use up 18K node hours
- Used over 8K cores for about 48 hours
- Mark N has applied for 1M node hours from Illinois quota for 2017 (should know soon)
Storage decomissioning
- In FY17 we are scheduled to retire over 1PB of old storage
- Spread over 7 servers more than 5 years old
- This will reduce MWT2 to about 7PB total storage
- DATADISK: >6000TB
- LOCALGROUPDISK: 500TB
- SCRATCHDISK: 300TB
- Network issues at UChicago
-
14:40
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
- 14:45
-
14:50
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
UTA_SWT2
- Things running fine
- Will likely take this site down for extended rebuild in the near future
- Disabled all unused Panda Queues (will delete in a week, per Alexey)
SWT2_CPB
- Production is running better even with reprocessing tasks
- Cleared up space reporting issue and starting to investigate reason for dark data
- starting to work on conversion to HTCondor-CE
- Analysis queue is doing direct reads incorrectly
- I should now have the correct permissions in AGIS to play with solution
-
14:55
WT2 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:25
-
16:00
→
16:05
AOB 5m
-
13:00
→
13:15