US ATLAS Computing Integration and Operations
-
-
13:00
→
13:15
Top of the Meeting 15mSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Robert William Gardner Jr (University of Chicago (US))
-
13:15
→
13:20
ADC news and issues 5mSpeakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
The big item being pushed by the ADC now is a switch to the new pilot and Site Mover Configuration. For those sites using their own LSM, as long as that LSM conforms to the "standard" then their should be little to no issue.
Jose has been running a low level of such pilots to all sites for some time and has noted only Lucille has any issues. Horst and Joel are working with Jose to understand this.
With the new mover configuration attributes in PandaQueue such as seprodpath, copyprefixin, copyprefix etc. will not be needed anymore, and will be set to some unusable value to ensure they do not crop up in unexpected ways. Eventually these will be entirely eliminated from AGIS.
There are plenty of events in the MC queue to keep sites busy for some time to come.
@TCB on Monday: reported status of space reporting json. This is less an issue with sites compare to SRM itself. OSG has a pre-beta version of LVS document.
-
13:20
→
13:30
Production 10mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:30
→
13:35
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:35
→
13:40
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:40
→
13:45
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:45
→
13:50
FAX and Xrootd Caching 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
→
14:10
Site movers 20mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
Ilija/Alexey: US analysis sites with "direct_access_lan" will not move for now.
Ale/Saul: confusion over what LSM refers to: propose to call site level script "LSM", and call the python code in Panda pilot "Site Mover Configuration" or "LSM driver".
David: should check the section on "Associated DDM Storages" to make certain the list of DDM endpoints
is selected in the correct order for the panda Qs.Joel/Horst: LUCILLE moved to new movers, see error. Note: LUCILLE_MCORE use xrdcp mover for read, lsm for write. LUCILLE_CE uses xrdcp/lsm for read, lcgcp for write/log. Is this correct?
Jose's summary on site movers and plan on moving APF:
-- we have been running for a long while a special factory setup that uses Alexey's pilot (instead of Paul's pilot) with a new input option -m1. This pilot is, as I understand, using the new movers. Alexey can correct me there if I am wrong. This dedicated setup have been submitting, at a very low rate (average 1 pilot per hour), to every single queue in US ATLAS, including T1 and all T2. If some queue is missing that was just by mistake, not on purpose.
-- As Alexey pointed out earlier, ALMOST all queues in that setup show jobs finished successfully. I understand that means that those sites are ready to work with the new movers.
-- and yes, to make it as clear as possible, lsm is being respected. Nobody is talking about replacing lsm, or changing it, or whatever
-- there is one site that seems to be failing 100% of the times with the
Alexey's pilot: LUCILLE.
This needs to be sorted out.
-- I have checked the failure rate of every queue with both pilots-Alexey's and Paul's- for an entire day of production. Numbers are quite similar. The fact that we are running very few in one case plays against it, so I am not very concerned about the exact differences. The important thing is that they are similar.
-- My plan now was to move an entire factory (currently 1/3 of total production) to run only Alexey's pilot, in order to have a sense of what happen when scaling up. We have never moved anything from 0 to 100% at the factories for US ATLAS.
If having an entire factory running Alexey's pilot works fine, and failure rate are similar to the other factories, queue by queue, then I planed to switch queues in AGIS. This step assumes that running Alexey's pilot is equivalent to running
Paul's pilot + AGIS setup for new movers.
-- If sites start playing with AGIS on their own, in parallel, my plan goes to hell. So I am going to leave to sites to decide if they still want me to take care of this or they prefer to run things on their own. -
14:10
→
14:30
OS performances testing 20mSpeaker: Doug Benjamin (Duke University (US))
-
14:30
→
16:05
Site Reports
-
14:30
BNL 5mSpeaker: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR))
-
14:35
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
AGLT2 is running smoothly.
Sometime next week we will take a short SE outage to upgrade dCache, and to finish off the H330 firmware updates. This is in response to the Dell announcement that firmware prior to 25.5.0.0019 could potentially corrupt the controlled disks. Two dCache pool servers at both sites, R730xd from the most recent Dell purchases, are impacted and need to be updated. All WN updates have been completed.
The dCache update that will simultaneously take place will update from our current 2.13.50 to the 2.16 Golden Release. There is a problem with the 2.13 release in that certain certificates from RU and CA are not supported prior to 2.13.52, the last of the 2.13 release series.
All 2016 funds are spent out. Five new R630 are in production at MSU, and of 4 N2048 switches purchased there, one is in full production, with the remaining 3 set to take the load off the last of the PC6248 at MSU towards the end of February when Mike Nila returns from 2 weeks at CERN. 13 R630 were purchased at UM. Most cabling is in place for these, and the 10Gb network switch (S4048-ON) is received but not yet configured. We hope to bring these into production next week. The single N2048 purchased for UM is in production with 18 32-core WN attached in a bonded 2x1Gb configuration. Kibana plots of stage-in rates shows a clear improvement for these machines over the rate when attached to the PC6248.
With all of the N2048 in place, the PC6248 at AGLT2 will no longer be in use for public NIC connections.
Sometime in the next few weeks we will be upgrading our gatekeepers to the most recent OSG release, 3.3.20 from our current 3.3.18. This should bring into play several HTCondor-CE reporting updates that will soon be required.
The list of affected controllers includes the following RAID controllers: H330, H730, H730P, H830, SD33-2S, SD33-2D
-
14:40
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site is full of jobs.
Switch problems at UChicago have a number of nodes offline for Atlas jobs
- Cooling problem casualty was the old Cisco 6509
- Due to stability problems nodes on this switch running opportunistic jobs only - small number of compute nodes affected.
- New Juniper top of rack switches are on order
New Purchases
- UChicago
- 26 R430
- 1040 cores
- Installed, waiting for new switches and cables
- Indiana
- 15 R430
- 600 cores
- Illlinois
- 8 C6320
- 448 cores
MWT2 Site total will be
- 18520 cores
- 192K HS06
Firmware on all Dell nodes upgraded to avoid data corruption
New Movers
- All Panda Q for MWT2 and CONECT are now using the "new movers" configuration
- MWT2 LSM in use without any changes
- CONNECT use "gfal2" mover except Bluewaters which uses lsm
OSG 3.3.20 installed on all gatekeepers
- Waiting on instruction from Xin on how to report resources to AGIS
dCache upgraded to 2.13.51
- Problem with certs from CA and RU fixed in this point release
- Will be looking to move to 2.16 in the near future
MWT2 face to face meeting this week in Urbana
-
14:45
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
22/23 new worker nodes are installed and in production.
1.5 PB of storage is installed, will begin testing once we fix a bad cable issue.
Will check with Jose and update AGIS config.
Very close to switching over to 100% HTCONDOR on the BU side.
Still need to update the spreadsheets.
Will add JSON following Wei's instructions.
-
14:50
SWT2-OU 5mSpeaker: Dr Horst Severini (University of Oklahoma (US))
-
14:55
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
-
15:00
WT2 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:30
-
16:05
→
16:10
AOB 5m
-
13:00
→
13:15