ScotGrid Technical Meeting

Other Institutes

Other Institutes

Gareth Roy (University of Glasgow)
ScotGrid Technical Meeting 2 April 2014
Andy W
David C
Gang Q
Gareth R (chair)
Oliver S
Sam S (minutes)
Ewan: this week is Grid Week. Set ourselves an ambitious timeline for
upgrading/replacing services (by the end of this week!), including
SLURM migration and ARGUS.
This may actually result in Ewan's death.
Ewan had some questions about cluster, but will follow up by email.
(Do we have to publish Glue 1.3 presently? Yes.)
It was also mentioned that giving feedback on SLURM/Grid integration
experience would be useful at a GridPP level.
Andy: The long running middleware ports opening saga is resolved -
further deployments are now much easier.
Some things that need checked, but a lot has been done already.
(This now makes the EMI3 upgrade easier, which we are now working on.)
All being well, upgrade should complete before end of deadline.
Some problems with APEL accounting host (which is a VM). Can't even
power off the VM in the hypervisor(!). Considering bouncing the
Wahid has been working on the perfsonar (which is resolved), and the
EMI3-WN tarball. (It does not work out of the box).
Tickets 102914 LHCb "discussion". Waiting for LHCb.
102202 EMI2/3 ticket (progressing - now done site BDII, and ARGUS: we
note that clear cache + reload policy is necessary on the ARGUS
upgrade. Tested EMI3 WNs and CE, seem to be happy. Plan is to now roll
out WNs and then CEs - rate limiting step for CEs is that we also need
to switch them to the EMI3 publisher. This is complicated by the fact
that, to avoid the issues with BDIIs for site info, one of our CEs
publishes all our site info - the plan is to switch this to a separate
box that only published site info (so no CE is special). This box
being up and working limits our ability to migrate all of the CEs.
Last thing on our list is the UI.)
Working on disks for the (now out of warranty) disk servers.
Re: Group Chat question - you also need to be careful to publish your
ARGUS (and add it to the Site BDII site_urls list so it is properly
Discussion - Ewan's plans to upgrade the Site BDII by replacement. Do
you need to do anything other than updating the GOCDB to change a site
BDII url?
(Advice is to ask TBSUPPORT just to be sure.
Gareth asked if people want him to start pushing people to get an
internal discussion to happen? (It was agreed that there would be an
internal (technical) discussion, but there seems to be little momentum
on this.)
The benefit of this would be that we would get to know what the
CloudSoft do, and they would get to know more about our technical
Site Contacts - looks like the GOC has a site contact name including
Mark's email address. We need to remove this.
While we are doing this, we should also check over the contact names
to check everything is okay.
Qrtly Report: (via Vidyo shares)
There's a discussion about the nature of the Availability metric
derived from Steve Lloyd's centralised tests. (versus the actual SAM
tests from ATLAS, which differ significantly in their values)
Publishing metric probably needs corrected for ECDF as they're having
APEL issues.
Storage accounting and VO support table needs updated. (For ex: need
to remove NGS from the list of supported VOs, since the NGS is
Durham's supported VOs: atlas, cms, dream, zero, lhcb, pheno, zeus,
ilc, ops, camont, cdf,, mice, gridpp,,
compchem, planck
Storage use needs to be generated (using the tools).
Draft qrtly report will be circulated.
Chat log:
Andrew John Washbrook: (02/04/2014 11:02)
hi chaps
Thats a relief - glad he is not talking to himself
not quiet - just a bit shy
Ewan: (11:11 AM)
all of our service nodes are virtual!
Andrew John Washbrook: (11:12 AM)
good - you are to go to guy fo all my problems!
Ewan: (11:13 AM)
David have you got documentation for fixing the argus EMI1 alarm as
thats what I got ticketed for
Gareth Roy: (11:14 AM)
It's an upgrade, the Alarm should go once you've moved to EMI-3.... if
your already on EMI-3 it mught be a false positive
Ewan: (11:15 AM)
im already EMI3 but it's not publishing something that is being tested
quite a few UK sites are failing this alarm
Sam Skipsey: (11:19 AM)
The documentation claims that top bdiis talk to the GOC
Andrew John Washbrook: (11:24 AM)
Sam Skipsey: (11:33 AM)
dpm-sql-usage-by-vo-user on your DPM head
There are minutes attached to this event. Show them.
    • 11:00 AM 11:15 AM
      Tickets 15m
      • Durham 5m
      • Edinburgh 5m
      • Glasgow 5m
    • 11:15 AM 11:25 AM
      Research Updates 10m
      • Codebase - CloudSoft 5m
    • 11:25 AM 11:35 AM
      AOB 10m
      • Site Contacts 5m
      • Quarterly Report 5m