US ATLAS Computing Integration and Operations
virtual room
your office
-
-
1
Top of the MeetingSpeaker: Kaushik De (University of Texas at Arlington (US))
Present: Michael, Dave, Fred, Kaushik, Bob, Saul, Armen, Mark, Alden, Horst
Apologies: Rob
From Rob:
1) There is a nice summary of the sites jamboree from Ale https://indico.cern.ch/event/440821/
2) More generally, there were interesting discussions at the WLCG meeting, https://indico.cern.ch/event/433164/other-view?view=standard and summaries are being posted.
3) There is a cloud-level action item from Alessandra about publishing maxrss values, agreed at the jamboree.
Other items:
Reminder OSG AHM at Clemson Mar 14-17: https://indico.fnal.gov/conferenceDisplay.py?confId=10571
Latest ADC weekly meeting for general ATLAS info: https://indico.cern.ch/event/469716/
Michael: OSG technology area request to move to OSG v 3.3. Bob and Dave have tried this version and found issue with dccp. Fed back to OSG ops meeting by Xin. They will look into it. Expecting reply soon from Bockelman. We encourage other US sites to test this version, since 3.2 will go away in the near future.
Saul: at NET2 ran into problem with SLURM and filed ticket while doing upgrade to HT Condor CE (including new OSG release).
-
2
Capacity News: Procurements & Retirements
-
3
ProductionSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
4
Data ManagementSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
5
Data transfersSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
6
NetworksSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
7
FAX and Xrootd CachingSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
Site Reports
-
8
BNLSpeaker: Michael Ernst
- OSG Technology Area is asking sites to transition to OSG SW release 3.3.x a.s.a.p. as they want to drop support for OSG 3.2 by ~August
- I have sent a respective announcement to the US ATLAS T2 list and received feedback from AGLT2 and MWT2 indicating lack of support for dccp in rel 3.3
- This was brought up by Xin at yesterday's OSG production operations meting. The OSG SW team is looking into the issue. US ATLAS Facilities are expected to receive a response from Brian Bockelman (who heads the OSG Technology Area)
- Smooth operation of the Tier-1 center over the course of the last 2 weeks, utilization of CPU at capacity
- Reprocessing of 2012 data running since late last week on T1s and T2s worldwide
- T1 at BNL is leading the league of sites by a large fraction: overall contribution of 33%, followed by RAL (9.5%) and SIGNET (8%), a lot of stress is on the tape system, excellent staging performance of up to 50TB/60k RAW data files retrieved from tape in 24h: http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites[]=All+T21&sitesCat[]=All+Countries&activities[]=reprocessing&resourcetype=All&sitesSort=5&sitesCatSort=0&start=2016-02-01&end=2016-02-03&timerange=daily&granularity=Hourly&generic=0&sortby=11&series=All
- At the T1 we are in the process of implementing the memory configuration as requested by ADC at the WLCG collaboration meeting
- We've made the following changes in AGIS. :
queue name maxrss(GB) minrss(GB)
BNL_ATLAS_2 8 2.5
BNL_PROD 5 0
BNL_PROD_MCORE 24 0
BNL_PROD_MCOREHIMEM 64 24
ANALY_XX queues 3 0
- We've made the following changes in AGIS. :
- The first part of FY16 disk storage procurement has arrived, ~2.3 PB of usable disk space.
- Article about BNL/ATLAS AWS cloud work in InformationsWeek at http://www.informationweek.com/cloud/infrastructure-as-a-service/brookhaven-lab-finds-aws-spot-instances-hit-sweet-spot/d/d-id/1324145
- OSG Technology Area is asking sites to transition to OSG SW release 3.3.x a.s.a.p. as they want to drop support for OSG 3.2 by ~August
-
9
AGLT2Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
The dCache upgrade worked without a hitch; unfortunately, dCache itself did not. We upgraded to version 2.10.51-1 from 2.10.42-1. In the end, we rolled back to that 2.10.42-1 and the OOM conditions went away. The problem was introduced at 2.10.49-1 in the 3rd party Netty library. To quote Gerd Berdman, "The bug causes problems with throttling file loading when a slow HTTP client reads the file. Due to this the data gets queued in memory and for large files you will quickly run out of heap memory." Netty was downgraded and dCache 2.10.52-1 has now been released. However, we will not now do that upgrade, choosing instead to move to dCache 2.13 sometime towards the end of February.
All MSU R630 are now running jobs in Condor. This is the last of the 2015 funds, and the v38 spreadsheet has been updated accordingly.
It is no longer needed (perhaps for some time now) to notify the OSG goc that a change in APEL Normalization factor has been made. As long as the value in the resource is updated in OIM, it should propagate correctly within 24 hours into the WLCG reporting. If it is NOT seen to update, then it should be reported.
The flapping NIC on msufs02 was fixed by re-seating the SFP+ cable at the EX4500 switch end. However, some problems continue with the switch ports of this unit. It will be updated to a newer software version sometime soon.
The new AGIS parameters minrss and maxrss were updated this morning, taking on reasonable values (we hope) for our site.
Over the next month we will prepare to update our site software, that will probably take place towards the end of February. This may require a site down time of a day or two, but we are considering doing a rolling update over all WN. Software to be updated includes OSG-CE, OSG-WN, cvmfs, HTCondor, and other small pieces as needed.
-
10
MWT2Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site is running well
- Full of Atlas jobs (MCORE, SCORE, Analy and Opportunistic)
- Good efficiency
IU nodes now operational
- Over 13500 cores
- HS 133,303, APEL Factor of 9.87
- Accounting updated in OIM, REBUS, V38 and WLCG-V38
New Disk at UChicago
- Ceph based
- Will move MWT2_UC_LOCALGROUPDISK from dCache to Ceph
- Might change name to MWT2_LOCALGROUPDISK to aid in transition
- Lincoln is testing by using gfal-copy to push data into the system
OSG 3.3.8
- All head nodes have been using 3.3.x stack for a long time without problems
- CE (HTCondorCE)
- Squid
- CVMFS servers/clients
- GUMS
- Condor 8.4.3
- Still using 3.2.24 on worker nodes
- DCAP removed, we use in LSM
- Working on LSM update to remove DCAP
- Will use GFAL2 (gfal-copy, gfal-rm, gfal-sl)
Virtual Memory issues
- Large jobs causing many problems
- OOM killing other jobs
- Nodes hanging/crashing
- lostheartbeat
- Upgrade to HTCondor 8.4.3 and cgroups help control large jobs
- cgroup "soft" allows flexible RSS
- hard virtualmemory limit puts jobs into HELD
- Exposed inconsistent swapfile policy (little to no swap on some nodes)
FAX Door issues
- Doors at IU were causing problems
- Some type of internal IU networking issue (internal low level packet loss)
- Moved all doors to UC
- Will be moving doors off storage nodes onto VM like SRM (FAX and WebDAV)
WebDAV certificate issues
- Door is currently on a storage node (uct2-s13.mwt2.org)
- Needed a subject with SubjectAltName
- webdav.mwt2.org
- uct2-s13.mwt2.org
- Now supported in OSG PKI tools (osg-gridadmin-request -a)
- CI-Logon support added this Monday (2/1/2016)
minRSS and maxRSS now set
-
11
NET2Speaker: Prof. Saul Youssef (Boston University (US))
- 12
-
13
SWT2-UTASpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
-
14
WT2Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
8
-
15
AOB
-
1