Discussion among Rob, Xin and Wei:
It is better to de-couple the CentOS 7 migration and Singularity deployment, so that C7 migration can happen sooner. ADC doc for C7 migration:
US sites has the options to take Rolling Transition or Big Bang Transition. We will work site-by-site to help the transition. ADC strongly suggest the Singularity 2.4.2 rpms be installed on C7 WNs.
On Singularity: see presentation at ADC Site Jamboree:
We're trying to track down deprecated OSG environment variables https://jira.opensciencegrid.org/browse/SOFTWARE-3011). The following don't appear to be used by any pilots:
So we would like to remove them in OSG 3.4 or at the very least, announce their deprecation.
Last week I attended two networking meetings: LHCONE/LHCOPN in Abingdon: https://indico.cern.ch/event/681168/ and the perfSONAR annual developer meeting in Amsterdam. (no public link) Lots of good discussion at both. LHCONE/LHCOPN meeting report at https://indico.cern.ch/event/681168/attachments/1616425/2569199/LHCOPNE-20180307-Abingdon-meeting-report.pdf
Today was the 2nd HEPiX NFV wg meeting https://indico.cern.ch/event/705126/ Next meeting April 25 10 AM Eastern. Live notes at https://docs.google.com/document/d/1CTsAqioZY8pcCDf3S7GbObHD_Sic06BF15dPmaVjOcM/edit
Questions on these meetings?
I won't go into other networking details here unless there are questions. Next week at the OSG AHM meeting there are 4 talks on Networking:
USATLAS meeting: Network evolution (Shawn)
Joint USATLAS/FIFE/USCMS meeting: perfSONAR discussion (Shawn)
Tuesday afternoon: OSG Networking Analytics: Evolution and Status (Shawn / Ilija)
Wednesday afternoon: OSG Networking (Shawn)
If you have questions (or specific things you think need covering in any of the above) bring it up now or email me.
ALCF: Locally installed Rucio version out dated enough to cause issues. Had to reinstall Harvester to get things consistent again. Back online, but needs work. Discussing with Doug G if we should continue with dedicated tasks or grid-style running, each comes with their own benefits/drawbacks.
NERSC: Harvester up and running on Cori-P1/P2, processed 50M+ events over the past 7 days.
OLCF: Harvester now running for the Allocation jobs queue. Running 3 batch jobs at a time with 800 nodes each.
OLCF/ALCF: still in development.
Four C6420 chassis and sleds are being racked today. We will configure them as SL7 WN as we get them up and ready to go.
All WN at MSU are now running SL7 as are 1/3 of the WN at UM. We are developing a plan to move the balance of the UM WN to SL7 by the end of March.
As of today, all of our dCache servers are dual IPv4/IPv6 stacked. We have not yet registered AAAA records though.
We have a network interruption between UM and MSU on Thursday night that will adversely impact HTCondor communications between our sites for a period of up to 4 hours. Consequently we will be idling down all MSU WN starting later this afternoon so as to lose the minimal number of jobs during the outage.
On Friday after the MSU WN set is back online, we will add SL7 Analysis and LMEM queues, rounding out the SL7 Panda Queue complement for AGLT2. When the complement of SL6 WN drops below some threshold, we will delete the SL6 Panda Queues and become SL7-only.
We are coordinating with the OSG folks on moving our non-ATLAS gate-keeper to SL7. This will most likely happen some time next week.
Singularity is installed on all WN as they are built, but no special configuration considerations have been implemented. Versions:
Overall, site is performing well and is full of jobs
Singularity upgraded to 2.4.2 on all workers
Just about ready to start "NET3", a joint Tier 3 with BU, Harvard and UMASS/Amherst.
Progress in HT-CONDOR migration with Brian Lin's help. Harvard has upgraded to OSG 3.4 with the new HTCONDOR. Problem has so far not reappeared. Setting up to do the same on the BU side.
Working on LCMAPS and Bestman migration (we're not worried about usatlas1,2,3,4 since they are all in the same unix group and group permissions are enough to do everything). We're planning to use Wei's gridftp-posix with a callout for Adler32 checksum computing.
Working on GPFS migration so that the system pool is on warrantied equipment.
Preparing for starter NESE data lake deployment. ~ 12 PB raw, including substantial buy-in from Harvard.
Reminder: We're planning to migrate the NET2 storage endpoint into NESE.
Added Fermilab access for OSG jobs.
Sites consistently full with smooth operations.
Hoping for ESNet help to help restart our LHCONE peering.
SL7 transition is on the agenda.
- all OU sites working well
- still working on getting rucio to use READ_LAN and WRITE_LAN, in order to stage-in/out from internal xrootd directly. Working with Mario and Alexey on that
- Lucille is ready to be migrated from Lucille_SE to OU_OSCER_ATLAS_SE
- taking brief OSCER downtime this afternoon for RAM replacement and BIOS updates