US ATLAS Computing Integration and Operations
-
-
13:00
→
13:15
Top of the Meeting 15mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
-
13:15
→
13:20
ADC news and issues 5mSpeakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:20
→
13:30
Production 10mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:30
→
13:35
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:35
→
13:40
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:40
→
13:45
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
-
13:45
→
13:50
FAX and Xrootd Caching 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
→
14:10
OS performances testing 10mSpeaker: Doug Benjamin (Duke University (US))
-
14:10
→
14:25
HPCs integration 15mSpeaker: Taylor Childers (Argonne National Laboratory (US))
-
14:25
→
16:00
Site Reports
-
14:25
BNL 5mSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
-
14:30
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
We have updated all of our gatekeepers to the newest OSG release, 3.3.25. In conjunction with the update we worked with an OSG team to establish the new lcmaps-voms mapping described at this URL:
https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/InstallLcmapsVomsThis has been running at AGLT2 now for a week without any issues that we are aware of.
The released set of voms mappings, in /usr/share/osg/voms-mapfile-default, is not yet complete. In particular the atlas mappings are not yet updated. However, 2 over-ride files, and two "ban" files, located in /etc/grid-security, are searched first for account mappings prior to searching the default file. Only the first mapping, based upon the First FQAN of the presented certificate, or the identity portion of the presented certificate, is used to do the account mapping. We have put in place the following voms override file at AGLT2
[root@gate03 ~]# cat /etc/grid-security/voms-mapfile
"/atlas/usatlas/Role=production/Capability=NULL" usatlas1
"/atlas/usatlas/Role=software/Capability=NULL" usatlas2
"/atlas/usatlas/Role=lcgadmin/Capability=NULL" usatlas2
"/atlas/Role=lcgadmin/Capability=NULL" usatlas2
"/atlas/usatlas/Capability=NULL" usatlas3
"/atlas/Role=production/Capability=NULL" usatlas1
"/atlas/Capability=NULL" usatlas4
"/atlas/calib-muon/Role=NULL/Capability=NULL" muoncal
"/osg/ligo/Role=NULL/Capability=NULL" ligo
"/fermilab/*" fermilab
"/cms/*" uscms01We also have a /etc/grid-security/grid-mapfile to map the specific DN in use at AGLT2 for the muon calibration effort.
[root@gate03 ~]# cat /etc/grid-security/grid-mapfile
"/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=diehl/CN=490810/CN=Edward Diehl" muoncalOnce AGLT2 has modified dCache to use a similar mechanism, we will no longer be dependent on GUMS and will turn off our service. OSG in the future will deprecate and then entirely remove their GUMS support.
The dCache pgsql DBs crashed out over the past weekend, causing all jobs to fail. It seems that the auto-vacuum is not working for our pgsql 9.5.7 instance. This is under investigation, but as of 3pm Monday, we were back in business. As a reminder we are running dCache 3.8.11.
AGLT2 will go offline at Noon Friday for a complete power outage in the UM server room. We hope to be back up by day's end on Monday after multiple maintenance items are completed. The switcher2 set our queues offline at Noon today.
singularity is installed on all AGLT2 WN.
-
14:35
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
Site is now full of jobs and operating well
Problems over the last three week
The roof-top condenser fan at UChicago failed two weeks ago. The building engineers installed a temporary fix while unit is being replaced.
- Required some reduction in workers for a day to lower temp in room
Illinois
- 10K DDN system lost an entire disk tray
- Caused loss of any disk redundancy
- ICC Admins migrated all data to 12K DDN
- Unfortunately other bad disks cause the loss of some data but nothing important
- Site was down for 4 days to do backup/reformat/restore
- One positive is the FS is now reformatted with GPFS 4.x
-
14:40
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Normal operations except for
a) The slurm database is causing occasional problems at NET2/Harvard. We're working on it.
b) We're repairing a steady stream of old nodes with failing plastic fans.
We sped up scanning of GPFS so we can update space tokens every ~6 hours rather than every 24 hours. This is re: central deletion.NESE activities are ramping up; testing POC CEPH cluster; planning first major purchases; network redesign for MGHPCC floor is underway.
-
14:45
SWT2-OU 5mSpeaker: Dr Horst Severini (University of Oklahoma (US))
-
14:50
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
-
14:55
WT2 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:25
-
16:00
→
16:05
AOB 5m
-
13:00
→
13:15