US ATLAS Computing Integration and Operations
-
-
13:00
→
13:05
Top of the Meeting 5mSpeakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
-
13:05
→
13:15
Singularity / centos 7 deployment in the US cloud 10mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
Discussion among Rob, Xin and Wei:
It is better to de-couple the CentOS 7 migration and Singularity deployment, so that C7 migration can happen sooner. ADC doc for C7 migration:
https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Readiness
US sites has the options to take Rolling Transition or Big Bang Transition. We will work site-by-site to help the transition. ADC strongly suggest the Singularity 2.4.2 rpms be installed on C7 WNs.
- AGLT2 is in the process, using Rolling Transition
- BNL is doing Rolling Transition, plus Containerized WNs (see below). So moving forward unnoticed.
On Singularity: see presentation at ADC Site Jamboree:
- Singularity version 2.4.2, no 2.3.x
- Ultimately pilot 2 will evoke Singularity - compatible with US APF (and EU APF and aCT)
- Pilot 2 is not quite ready.
- Incompatible with Containerized WNs (Encapsulate payload in a Container)
- Containerized WNs is not a requirement, and you are on your own to support it
- But not forbidden either (good for learning and trying it out).
- Will need container_type and container_options setting in AGIS / Panda Queue
- For HPCs, investigating methods to reduce container image size
- single release image
- use SquashFS instead of Ext3 - doing this for NERSC - reduce by a factor of 3
-
13:15
→
13:20
ADC news and issues 5mSpeakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13:20
→
13:25
OSG software issues 5mSpeaker: Brian Lin (University of Wisconsin)
We're trying to track down deprecated OSG environment variables https://jira.opensciencegrid.org/browse/SOFTWARE-3011). The following don't appear to be used by any pilots:
- OSG_DATA
- OSG_DEFAULT_SE
- OSG_GLEXEC_LOCATION
- OSG_HOSTNAME
- OSG_LOCATION
- OSG_STORAGE_ELEMENT
So we would like to remove them in OSG 3.4 or at the very least, announce their deprecation.
-
13:25
→
13:30
Production 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:30
→
13:35
Data Management 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
-
13:35
→
13:40
Data transfers 5mSpeaker: Hironori Ito (Brookhaven National Laboratory (US))
-
13:40
→
13:45
Networks 5mSpeaker: Dr Shawn McKee (University of Michigan ATLAS Group)
Last week I attended two networking meetings: LHCONE/LHCOPN in Abingdon: https://indico.cern.ch/event/681168/ and the perfSONAR annual developer meeting in Amsterdam. (no public link) Lots of good discussion at both. LHCONE/LHCOPN meeting report at https://indico.cern.ch/event/681168/attachments/1616425/2569199/LHCOPNE-20180307-Abingdon-meeting-report.pdf
Today was the 2nd HEPiX NFV wg meeting https://indico.cern.ch/event/705126/ Next meeting April 25 10 AM Eastern. Live notes at https://docs.google.com/document/d/1CTsAqioZY8pcCDf3S7GbObHD_Sic06BF15dPmaVjOcM/edit
Questions on these meetings?
I won't go into other networking details here unless there are questions. Next week at the OSG AHM meeting there are 4 talks on Networking:
USATLAS meeting: Network evolution (Shawn)
Joint USATLAS/FIFE/USCMS meeting: perfSONAR discussion (Shawn)
Tuesday afternoon: OSG Networking Analytics: Evolution and Status (Shawn / Ilija)
Wednesday afternoon: OSG Networking (Shawn)
If you have questions (or specific things you think need covering in any of the above) bring it up now or email me.
-
13:45
→
13:50
XCache 5mSpeakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
→
13:55
HPCs integration 5mSpeaker: Taylor Childers (Argonne National Laboratory (US))
Harvester deployment:
ALCF: Locally installed Rucio version out dated enough to cause issues. Had to reinstall Harvester to get things consistent again. Back online, but needs work. Discussing with Doug G if we should continue with dedicated tasks or grid-style running, each comes with their own benefits/drawbacks.
NERSC: Harvester up and running on Cori-P1/P2, processed 50M+ events over the past 7 days.
OLCF: Harvester now running for the Allocation jobs queue. Running 3 batch jobs at a time with 800 nodes each.
Container deployment:
NERSC: done
OLCF/ALCF: still in development.
-
13:55
→
14:30
Site Reports
-
13:55
BNL 5mSpeaker: Xin Zhao (Brookhaven National Laboratory (US))
- running fine in general
- new xrootd version (release candidate) was put in place last week, which fixed an issue that caused xrootd crash with core dumps. The official release will come later.
- new WNs are in production since several weeks ago. The migration to SL7 for the rest of the farm will be combined with upgrade of new top-of-rack-switches, to minimize the downtime.
-
14:00
AGLT2 5mSpeakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
Four C6420 chassis and sleds are being racked today. We will configure them as SL7 WN as we get them up and ready to go.
All WN at MSU are now running SL7 as are 1/3 of the WN at UM. We are developing a plan to move the balance of the UM WN to SL7 by the end of March.
As of today, all of our dCache servers are dual IPv4/IPv6 stacked. We have not yet registered AAAA records though.
We have a network interruption between UM and MSU on Thursday night that will adversely impact HTCondor communications between our sites for a period of up to 4 hours. Consequently we will be idling down all MSU WN starting later this afternoon so as to lose the minimal number of jobs during the outage.
On Friday after the MSU WN set is back online, we will add SL7 Analysis and LMEM queues, rounding out the SL7 Panda Queue complement for AGLT2. When the complement of SL6 WN drops below some threshold, we will delete the SL6 Panda Queues and become SL7-only.
We are coordinating with the OSG folks on moving our non-ATLAS gate-keeper to SL7. This will most likely happen some time next week.
Singularity is installed on all WN as they are built, but no special configuration considerations have been implemented. Versions:
singularity-2.4.2-1.osg34.el7.x86_64
singularity-runtime-2.4.2-1.osg34.el7.x86_64 -
14:05
MWT2 5mSpeakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
Overall, site is performing well and is full of jobs
Singularity upgraded to 2.4.2 on all workers
UC
- four of the twenty new C6420s online and running jobs
- remaining sixteen are built but still offline
- spec results are low (less than 50% of what they are expected to be)
- BIOS settings are consistent, appear to be correct
- has anyone else had this issue with this latest batch of workers?
IU
- still waiting on power
- work order is in, but timeframe is unknown
UIUC
- nothing new to report
-
14:10
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
Just about ready to start "NET3", a joint Tier 3 with BU, Harvard and UMASS/Amherst.
Progress in HT-CONDOR migration with Brian Lin's help. Harvard has upgraded to OSG 3.4 with the new HTCONDOR. Problem has so far not reappeared. Setting up to do the same on the BU side.
Working on LCMAPS and Bestman migration (we're not worried about usatlas1,2,3,4 since they are all in the same unix group and group permissions are enough to do everything). We're planning to use Wei's gridftp-posix with a callout for Adler32 checksum computing.
Working on GPFS migration so that the system pool is on warrantied equipment.
Preparing for starter NESE data lake deployment. ~ 12 PB raw, including substantial buy-in from Harvard.
Reminder: We're planning to migrate the NET2 storage endpoint into NESE.
Added Fermilab access for OSG jobs.
Sites consistently full with smooth operations.
Hoping for ESNet help to help restart our LHCONE peering.
SL7 transition is on the agenda.
-
14:15
SWT2-OU 5mSpeaker: Dr Horst Severini (University of Oklahoma (US))
- all OU sites working well
- still working on getting rucio to use READ_LAN and WRITE_LAN, in order to stage-in/out from internal xrootd directly. Working with Mario and Alexey on that
- Lucille is ready to be migrated from Lucille_SE to OU_OSCER_ATLAS_SE
- taking brief OSCER downtime this afternoon for RAM replacement and BIOS updates
-
14:20
SWT2-UTA 5mSpeaker: Patrick Mcguigan (University of Texas at Arlington (US))
SWT2_CPB
- Seemingly solved an issue related to XRootD checksumming that was causing many problems.
- Major power outage when utility feed burned up and building generator failed. Both have been repaired
- Delayed working on HTCondor while dealing with the above
UTA_SWT2
- Updated firmware in Dell 4032 stack to avoid issues with lockup
- Power outage at SWT2_CPB affected the network path for this cluster.
-
14:25
WT2 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:55
-
14:30
→
14:35
AOB 5m
-
13:00
→
13:05