US ATLAS Computing Integration and Operations
virtual room
your office
-
-
13:00
→
13:15
Top of the Meeting 15mSpeaker: Robert William Gardner Jr (University of Chicago (US))
- OSG AHM reminder: https://indico.fnal.gov/conferenceDisplay.py?confId=10571 , and US ATLAS Facilities meeting, March 14-17, 2016, Clemson University.
-
ESnet LHCONE site coordinator meetings
-
OU - Friday, Feb 12, 12 pm CST done
-
IU - Thursday, Feb 11, 9am CST done
-
UTA: Wednesday, Feb 17, 2pm CST (today)
-
BU - being discussed - NOX versus MIT at MANLAN; will check status bi-weekly
-
Duke - (likely this Friday)
-
UT/TACC - TBD, pending LEARN-Esnet peering
-
- OSG 3.3 upgrade discussion. 3.2 will be deprecated.
-
13:15
→
13:25
Capacity News: Procurements & Retirements 10m
-
13:25
→
13:35
Production 10mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:35
→
13:40
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:40
→
13:45
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:45
→
13:50
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:50
→
13:55
FAX and Xrootd Caching 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
→
15:15
Site Reports
-
14:15
BNL 5mSpeaker: Michael Ernst
Smooth operations for the last 2 weeks
- Running at capacity, mostly MCORE (production) jobs
- Heavily dominated by reprocessing jobs (2015 data) for the last ~10 days
- Disk storage is tight, free space is <1 PB
Preparing for the AWS 100k core scale test with the Event Service that is scheduled for next week
min/max RSS implemented according to ADC request
- Running at capacity, mostly MCORE (production) jobs
-
14:20
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
No news is good news? We are running well.
Preparation is ongoing for a full software update at AGLT2. This will include the glibc fixes, that should appear overnight tonight in our rpm repos. osg-wn client will be 3.3.8 (but may move to 3.3.9....), cvmfs will be 2.1.20, HTCondor will be 8.4.3. Osg-ce work on our test gatekeeper is ongoing today, but to today's current 3.3.9 version.
dcap rpms and lcg-util rpms, that are still needed by our lsm* utilities, were taken from EPEL.
Next week we are planning a likely upgrade of dCache from the 2.10 series, to the 2.13 series.
Note that OSG rpm suites require java 1.7, but if java 1.8 is also installed, and set to be the default, the OSG software will still run fine so we will put java 1.8 as the default everywhere. See osg ticket https://ticket.opensciencegrid.org/28484
-
14:25
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site has been running well
- Full of Atlas jobs (MCORE, SCORE, Analy and Opportunistic)
- Good efficiency
Illinois down for PM on campus cluster
Updated glibc pushed to all nodes
New Disk at UChicago
- Ceph based
- Migrating LOCALGROUPDISK
- Lincoln is using gfal-copy and srm to copy from dCache to Ceph but slow
- Currently migrated 193TB out of 368TB
- Need Kernel 4.4 to fix controller problems
OSG 3.3.9
- All head nodes have been using 3.3.x stack for a long time without problems
- CE (HTCondorCE)
- Squid
- CVMFS servers/clients
- GUMS
- Condor 8.4.3
- Still using 3.2.35 on worker nodes
- Testing new LSM
- Used GFAL2 via xrootd, then srm, the fax to try and stagein the file
minRSS and maxRSS now set
- MCORE needs 24GB for reprocessing jobs
- Changed HTCondorCE to request RSS of 24GB (previously 16GB)
- Many nodes are only 2G/core so this can cause idle cores due to lack of free memory
- Might create MWT2_MCORE_HIMEM to handle jobs > 2GB core
- Redirect to only node with 3GB or more per core
- MWT2 has almost 4000 cores which fit this criteria
-
14:30
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
We have had smooth operations in the past two weeks, with only one brief SRM incident. Running lots of reprocessing (maxrss is already 24000 for our MCORE queues). The main things we are working on are:
o Bringing the new storage online and into GPFS
o Transitioning to HTCondor-CE on the BU and HU side
o Re-formulating our WAN update plan
o Preparing to join LHCONE
o Working with MOC to add a pool of MOC worker nodes
There is also a problem with HU availability reporting via SAM which is broken and still has to be tracked down.
-
14:35
SWT2-OU 5mSpeaker: Dr Horst Severini (University of Oklahoma (US))
-
14:40
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
-
14:45
WT2 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
-
15:15
→
15:20
AOB 5m
-
13:00
→
13:15