Outstanding tickets
-
147841 UKI-SCOTGRID-GLASGOW less urgent waiting for reply 2020-07-14 14:27:00 UKI-SCOTGRID-GLASGOW: deletion problems
- Problem with these deletions resolved. Final set of work to solve the underlying problem with the files in the namespace
-
147792 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-07-13 08:41:00 UKI-NORTHGRID-MAN-HEP deletion errors with message: DavPosix::unlink Authentication error
- Files likely declared lost; To follow up on ticket.
-
147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-13 09:11:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures
- still seeing periods of overloaded disk servers; plan to help ease our disk server problems later in the week with zfs tuning
- O(3PB) new servers with empty space; decides to write to the empty machine, hence overloads it; similar issues seen at GLA and SHEF
- Discussion on storage solutions and interaction bettween experiments and sites followed; some notes:
- Infrastructure and maintainance, difficicult to fund this.
- Experiments change their mind, based on current needs … planning for the wrong eventualities
- Sam - number of hurdles for new technologies; and assumptions of established code working
- Experience with setting up a DPM xroot cache woks well
- Use of cache to ‘change type’ eg. direct-io to locally staged.
- GLA, proxy cache to talk to DPM, for local users (set up with existing hardware)
-
146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-07-14 10:00:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- Restarting services in DPM;
- in progress.
-
146651 RAL-LCG2 urgent in progress 2020-07-14 09:27:00 singularity and user NS setup at RAL
- Ticket was updated, largely on hold until other priority upgrades rolled out.
-
146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
-
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
-
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- Plans to get access to site to re-rack the machines are planned; requires the agreement from the University for non emergency access.
CPU
-
RAL
- Largely recovered from quota change, and some network issues
-
Northgrid
- Lancs no jobs for last couple of days;
-
London
-
SouthGrid
-
Scotgrid
- ECDF few jobs
- GLA down to 50% nominal capactiy:
- O(600) cores off for problems with AC downtime
- Ceph queue issues for xrootd transfers:
- rebuild xrootd plugins using RAL updates
- Sam to update the Jira
- If not resolved by Tuesday, fallback to gridFTP
Other new issues
- New Site monitoring MONIT Page:
- Lancaster
- Jobs failing; appears to hit Mem limits; especially (and unusually?) with voms-proxy-init
- Changed the configuration to allow for softer memory limits; needs to be picked up by Harvester
- Noted that Arc runtime environment scripts for modifying environment are useful
Ongoing issues
-
CentOS7 - Sussex
-
Grand Unified queues
News round-table
-
Vip
- OX working ok; problems with Conder Upgrade to 8.8.X, not working for ATLAS; went back to 8.6
- Following up on question of Shefield
-
Matt
-
Peter
- Bad disk server; how to identify the device name;
- Matt, Sam to dig out the recipies (sent via mailing list)
-
Sam
-
Gareth
- Cores offline from AC issues; access restrictions make things challenging
-
JW
-
Patrick:
- Access to DC still needed
There are minutes attached to this event.
Show them.