service:

3 ggus tickets:

1) analysis jobs fail at "not enough  resource available", this is cause by a ulimit set on our Condor system, it is fixed

2) UCORE jobs fail at authentication, due to a problematic work node, the node is rebuilt

3) 2 sites have transfer errors to AGLT2, (timeout), unresolved. ticket priority is changed to low priority

Hardware:

Received 3 R740x2d storage nodes(dCache pool nodes),1 to MSU, 2 to UM. waiting for provision. 

 

Update on BOINC operation at AGLT2:

Reminder: we had difficulties with the kernel not effectively applying a lower priority to the BOINC jobs. They ended up receiving roughly half of the CPU cycles which had never been the goal.  At the last meeting Wenjing had presented and reported on switching to using cgroups to try and tame that behavior.  But we did not have results or graphs to show at the time. 

Here is a link to the current CPU efficiency which is already back to a much more reasonable and acceptable range while we are continuing to work on tuning this BOINC-backfilling model. 

https://monit-grafana.cern.ch/d/000000696/job-accounting-historical-data?orgId=17&from=now-7d&to=now&var-bin=1h&var-groupby=dst_experiment_site&var-country=USA&var-federation=All&var-resources=All&var-tier=2&var-cloud=US&var-site=All&var-computingsite=All&var-nucleus=All&var-cores=All&var-eventservice=All&var-groups=All&var-inputdatatypes=All&var-inputprojects=All&var-outputproject=All&var-gshare=All&var-resourceserporting=All&var-processingtype=All&var-jobtype=All&var-jobstatus=All&var-error_category=All&var-measurement_suffix=1h&var-measurement_suffix_CQ=1h&var-retention_policy=long&var-division_factor=1&panelId=34&fullscreen