- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Singularity 3.1.0 in osg-upcoming
Sites that run opportunistic jobs need to update to at least HTCondor-CE 3.2.0 OR add the following line to the top of /etc/condor-ce/condor_mapfile:
GSI ".*,/[/A-Za-z0-9\.]*/Role=[A-Za-z0-9\.]*/Capability=NULL$" GSS_ASSIST_GRIDMAP
For example: https://github.com/opensciencegrid/htcondor-ce/blob/master/config/condor_mapfile.osg
Hardware:
9 new C6420 work nodes at AGLT2 UM site are put online, adding 56*9=504 cores, another 7 at MSU are still waiting to be put on the rack.
Site Service:
All AGLT2 PanDA queues are update in AGIS to use Singlarity.
AGLT2_UCORE is in test model , we are ready to move to Harvester.
We find low utilization rate in our Condor system, mostly due to the configuration (still 10% cores are in static partitioning, and some work nodes have incorrect configuration so they did not have all the available cores being claimed by Condor). We have done usage analysis from both the Condor system and ATLAS Job archive(90% cores of AGLT2 can be used by ATLAS). The wall time utilization of Condor was below 50%. In order to improve the utilization, we reconfigured over 100 work nodes, and also use ATLAS@home(BOINC) jobs to backfill. As a result, the CPU utilization of the cluster reaches about 90%. AGLT2 is the biggest site contributing to ATLAS@home, which contributes average 5200CPU days on a daily basis, and simulate 1.7M events per day.
Site downtime between 2nd and 3rd March due to power outage. Updated firmware of all nodes and switches, and dcache to 5.05. During downtime, we run ATLAS@home jobs when the power was back on.
UC and UIUC are running well and are full of jobs. We are currently debugging a problem with one of the hypervisors at IU that is affecting scheduling efficiency at IU.
In the past week, one of our dCache servers started having issues. We were in a site downtime over the weekend while bringing dCache back up. We are now back online (using spare MD1200s in place of the faulty hardware), but we are still debugging with Dell to get appropriate warranty replacements.
During the dCache downtime, we also updated our gatekeepers to htcondor-ce-3.2.1-1.osg34.el7.
IU received new compute equipment last week. Fred and Neeha are in the process of bringing the new workers online.
The new storage at UC was put online a couple weeks ago. We are in the process of bringing the rest of the new UC hardware online. UC is still waiting on the arrival of the machine learning machine, but otherwise we've received everything in the most recent equipment order.
UC and IU switches have been reconfigured for IPv6; we are still testing. Adding IPv6 to the UIUC switches is still in progress.
We are dealing with a GPFS hardware issue in the pool that contains the GPFS metadata. It's being repaired, but we stopped the GPFS metadata scan (and thus updating of the JSON file) and borrowed some space temporarily to evacuate the affected LUNs and rebuild.
We had a meeting with Wei as part of preparation for setting up a new NESE DDM endpoint. We're planning essentially a copy of what we're using now, but in Docker containers on the NESE side - multiple Gridftp endpoints with Wei's Adler callout and DNS round robin for load sharing.
ALCC allocation exhausted at OLCF and ALCF. used more than 125% of allocation at OLCF.
Now running at in significantly reduced priority at OLCF.
At ALCF we will run backfill for jobs < 802 nodes. Initial testing of 1024 jobs shows that there is a scaling issues between Harvester and Yoda.
Running scaling tests at NERSC to understand issues. up to 500 nodes works fine most of the time.
Now testing 750 nodes per worker when Cori comes back tomorrow.
- issues with SLAC storage element so testing BNL and AGLT2.
Rwg: introduce Marc Weinberg to the group who will work on NSF HPC