Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 14 0782 with code: 4880. Apologies: Kashif

Experiment problems/issues (20')     

Review of weekly issues by experiment/VO

Yesterday's WLCG daily ops update: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek120827#Monday.

- LHCb

Start reporocessing LHCB 20th Sept.
2 sites of concern, Durham and RHUL seems to have various problems with the pilots getting aborted. Don't fully understand this, waiting on this.
Posting a link for RHUL, see transcript.
Jeremy: Might be related to removal of CE.

Raja: Govind, let me know when site is stable so that I can have it re-enabled -
Govind: It's stable now.


- CMS

Stuart: Things goes reasonably OK. Brunel had issue a few days ago. Security Challenge, sites ready.
Challenge details discussed.

- ATLAS

* DDM central catalog load balancing not functioning correctly over the weekend.

Manchester
problem with disk server, warning DT was declared,
on Friday dpm service can't be restarted. Outage DT on Sunday.

ggus 85485
ggus 85316 (opened by Brian)

Glasgow
power cut over weekend. DT.
Still cmt timeout errors.
https://ggus.eu/ws/ticket_info.php?ticket=85508 is closed.

Hopefully cmt issues should be resolved shortly.

UCL
storage problem
https://ggus.eu/ws/ticket_info.php?ticket=85467 is last updated by Ben on 24.08.
The site is blacklisted.

RAL
FTS errors for T2s in IT and Fr cloud
https://ggus.eu/ws/ticket_info.php?ticket=85438
reopened files were declared lost but they are in the system

Lancaster
problem with one disk server
Analysis queue was set to test mode 1 h ago because of Staging input file failed
https://ggus.eu/ws/ticket_info.php?ticket=85538

Durham
https://ggus.eu/ws/ticket_info.php?ticket=84123 was last updated 2 weeks ago.
in a test mode

Brian tickets - Manchester had a ticket because they had an at risk with a specific issue for one disk server - my ticket was because at that point the whole site was failing transfers.


- Other



 11:20         
Meetings & updates (20')     


With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

EGI requires that dteam is supported as well as ops
AW: Moving to glite-cluster, but doing things manually. Currently MyEGI/REBUS disagreeing

Chris: Might be able to help, we use glite-cluster, very simple service. Easiest way of solving publishing issues after decommissioning LCG-CEs

VO feedback on EMI-WN: Also need to check that smaller VOs are able to use EMI WNs. If anyone is in contact with smaller VOs, please get them to test EMI WN queues.

Rebranding NGS to National e-Infrastructure Service.

Steve: installing EMI cluster, had issue with EMI CE.

GOC DB at risk from 8-2 on the 30th for some updates.



- Tier-1 status
- Accounting
- Documentation

Request to review Glue 2 documentation.

- Interoperation

Next meeting 10th September.
Stuart reviewing service update tickets.

- Monitoring
- On-duty

Dashboard takes a long time to update tests (WMS at Imperial). Acceptable length of alarms for the month exceeded. Had to close alarms because site in DT but dashboard not aware of it, which might act against site.

Any services down for over a month at Brunel? dgc-grid-50 to be retired, SE. (see transcript)

- Rollout

- Security

- Services

14 sites in BNL dashboard. RALPP, IC not in gridpp community.

Cambridge: Asymmetry in bandwidth, John going to have a look at it.


UK CA tag meeting for tomorrow - contact Jeremy with any questions
VOMS: Waiting for Kashif to return - is he back next week? Pete: A week from now.
Still running with NGS config, moving to Gridpp? Robert: some things to sort out, should do it soon, no date planned.

- Tickets

Matt: up to 48 tickets. Buildup due to people being on holiday. Also 7 tickets for glite 3.1 and other tickets on UserDNs. May be that we need to look at holiday cover for sites/VOs.

Site-Matt: Having problems with WMSes not getting job updates.

- Tools

- VOs

- Site updates

 11:40         
Site roundtable (20')     

- General problems & issues

- EMI-WN rollout

* EMI-2 SL5 in progress at Liverpool. Test cluster 'almost' working. EMI2/SL5 CREAM, TORQUE and WN 1 node 8 slots.

* RAL T1:

EMI-2 SL5 queue consisting of 4 worker nodes (32 job slots in total). It's behind the "gridTest" queue available on each CE:

lcgce03.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
lcgce07.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
lcgce08.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
lcgce09.gridpp.rl.ac.uk:8443/cream-pbs-gridTest

Ian: As far as I know these are available to all - I've just double checked.

* Oxford: EMI2 on SL5 test system behind the CE 't2ce02.physics.ox.ac.uk'.
It's an EMI 2 on SL5 Cream CE, with a pair of 8-core EMI 2 on SL5 worker nodes
(so a grand total of sixteen cores).

* Brunel 1:  all EMI-1 Cream, EMI-1 WN, glexec, Argus): dc2-grid-68, dc2-grid-70, dgc-grid-43.

* Brunel 2: EMI-2 Cream (with glexec) running on SL6. it's dc2-grid-65, 16 job slots.


- glexec/ARGUS status

CMS pushing for glexec
Alessandra: ATLAS to report in September.

RHUL:
Sheffield: We have ARGUS server but not glexec on WN. In principle we have EMI CE ready for production and are finishing final bits of 10G installation.
Glasgow:
RAL-LCG2: Nothing specific, EMI2, SL6
Liverpool: Steve looking at EMI2 services, hopefully working soon. Longer term, possibly moving DPM to EMI2/SL6. Still working on networking.
Manchester: There is the storage to rescue, which will be the main thing this well. 10G interfaces on storage.
RALPP: Replacing our glite 3.2 services with EMI, bdii will go live later today. ARGUS will go later. Testing 10G, hit a problem that S4810 not as capable routers has hoped.
Brian: Working with Andrew Lahiff, looking into FTS3. One of main differences is ... focussed rather than channels. How many gridftp transfers can you handle? Use current numbers for defaults, see transcript for details.
Cambridge: gLite 3.1 CE, solve by turning it off over the next couple of weeks.Do have a plan to move gradually to EMI, do glexec and ARGUS at same time.
Oxford: EM working on EMI test setup, waiting to see if users having problems. Migrated storage to EMI. This week help sussex. Interlagos nodes which seem sensitive to power blips. Working on FTS transfers.
IC: Daniela working on moving to EMI aggressively.
Jeremy: UCL? Duncan Not sure.
ECDF: Heavy load. glite 3.1, shouldn't have any now. Working on moving over to WMI. Did have ARGUS/CE in background, not in production. No plans for glexec WNs. Operations: rising fail rate for ATLAS, cmtsite timeouts, some network issues, fts stats seem low, ps boxes need firewall ports opened.
Lancs: glite3.1 free for a while, ancillary services on EMI, need to move hard stuff now. Have ARGUS server running, nothing uses it. Need glexec WN tarball. EMI2 on SL5/SL6?

Jeremy: glexec tarball working, but some config required at site level. Not sure that EMI are going to work on it any more.

 12:00         
Actions (5')     

To be completed:
https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items

Completed:
https://www.gridpp.ac.uk/wiki/Operations_Team_Completed_Actions

 12:05         
AOB (1')

Reminder of Gridpp 29 registration, see transcript.


Transcript:

[11:01:25] Stuart Wakefield joined
[11:01:54] Mark Norman joined
[11:02:08] Sam Skipsey joined
[11:02:21] RECORDING David joined
[11:02:48] Rob Harper joined
[11:03:00] Andrew McNab joined
[11:03:10] Chris Brew joined
[11:03:43] Raja Nandakumar http://lhcbweb.pic.es/DIRAC/LHCb-Production/visitor/systems/accountingPlots/getPlotImg?file=Z:eNp1js1KA0EQhJ8oYXqYTXb7JgpRsiTiJg8wZsqlcf-Y7gX16d3kpIiHOhQUX31JS76MQ3qQiyUN3IihU6q4vt-tz_unVX3yq5fHc72e37Fs2zzOkwztsnG8y5IaizYroFvGkE7S4yuaeEceUkJ8gBBBCgcJSyryvthggFaMD8vxLrealIi7qNbg6qKycaF0Dj8Pf_9dW8Y0ZjvEHkqBD3P_inx8e5ZuNF3oajHbHx_a_udTsn1OuOEKvlHwDZeGYkA=&nocache=1345809041002
[11:03:56] Matt Doidge joined
[11:05:25] John Hill joined
[11:06:43] Pete Gronbech joined
[11:07:10] Alessandra Forti atlas report is given by Elena even if I'm here as I wasn't around last week
[11:07:36] Duncan Rand joined
[11:07:39] Elena Korolkova fine with me
[11:08:38] Jeremy Coles okay - thanks. I uploaded the report.
[11:09:12] Govind Songara Raja, until 23rd ce3 was broken and it is stable since then, please enable job submission
[11:09:32] Govind Songara ce3 is emi-2 cream
[11:10:01] Jeremy Coles Elena you dropped out for 10 seconds.
[11:13:08] Raja Nandakumar Hi Govind, what about the 27th afternoon?
[11:13:55] Alessandra Forti I closed Brian ticket and added a reference to the team ticket.
[11:16:03] Andrew Washbrook joined
[11:18:23] Elena Korolkova you can lose upto 3% of month efficiency
[11:18:34] Elena Korolkova because of that
[11:19:20] Sam Skipsey sure, Elena. As I said, we were seeing if CVMFS would just roll out nicely with automount, but it turns out that new "many core" nodes are never free of atlas jobs, so you do actually need to offline them to let cvmfs update.
[11:19:29] Govind Songara Rajas; on 27th torque server was hanging which affected all jobs, i still need to find it
[11:20:44] Govind Songara Does anyone noticed 2.5.7 torque server hangs?
[11:20:52] Chris Brew yes
[11:21:00] Robert Frank joined
[11:21:34] Alessandra Forti yes, it is difficult to update cvmfs for atlas on multicores, it can take weeks.
[11:22:20] Chris Brew it has an bug - it runs out of file handles
[11:22:31] Chris Brew torque not cvmfs
[11:22:46] Andrew Washbrook Hi Chris - https://ggus.eu/ws/ticket_info.php?ticket=85514
[11:23:07] Andrew Washbrook I have probably done something wrong - but cannot see where the issue is
[11:23:10] Chris Brew increase the number for root in /etc/security/limits.conf to increase the time between hangs
[11:23:19] Chris Brew or update to a newer version
[11:23:38] Andrew Washbrook I will just use glite-cluster if this drags on
[11:24:00] Chris Brew We run 2.5.12-1
[11:24:13] Govind Songara Thanks Chris, i will try
[11:24:42] Chris Brew but some of the package names have changed so it doesn't sit so nicely with the glite/EMI stacks
[11:24:54] raul lopes joined
[11:26:29] John Hill UserDN - I've just made the change
[11:26:44] Andrew Washbrook sorry - an oversight - will do this today
[11:33:43] raul lopes it's dgc-grid-50
[11:33:49] raul lopes to be retired
[11:34:04] raul lopes SE
[11:35:55] Rob Harper Nothing to add.
[11:36:37] raul lopes I declared dgc-grid-50 in downtime for 6 weeks because the EGI rules for retiring an SE say so.
[11:37:04] Matt Doidge It's all gone quiet for me...
[11:37:34] Matt Doidge left
[11:37:39] Matt Doidge joined
[11:42:16] Jeremy Coles https://perfsonar.racf.bnl.gov:8443/exda/?page=25&cloudName=UK
[11:45:22] Robert Frank no date planned
[11:46:28] Robert Frank are we still planning to go ahead with 1 server
[11:50:51] Alessandra Forti for VOMS?
[11:51:06] Alessandra Forti we should have another discussion with NGS/NIS
[11:51:37] Alessandra Forti there were some unanswered questions to take a decision
[12:01:47] Brian Davies machine froze whne I open my mic. looking at FTS3 with Andrew L starting with RALPP and ECDF for T2s: Inmital number of slot sper T2 SE as follows:
[12:02:02] Brian Davies T2 slots
durham 7
bristol 24
b'ham 26
ucl 32
cambridge 33
rhul 37
brunel 42
oxford 43
sheffield 43
ecdf 47
liverpool 48
manchester 50
ralpp 62
glasgow 65
lancaster 65
qmul 81
Imperial 168

[12:06:12] raul lopes sorry. i've got to leave. constructions works in the office.
[12:06:15] raul lopes left
[12:06:16] John Hill  
[12:12:43] Raja Nandakumar Apologies - I too have to go.
[12:12:45] Raja Nandakumar left
[12:14:28] Robert Frank not at the moment
[12:15:19] Jeremy Coles https://www.gridpp.ac.uk/gridpp29/
[12:16:05] Brian Davies left
[12:16:11] Gareth Roy left
[12:16:13] Mark Norman left
[12:16:14] Robert Frank left
[12:16:14] John Hill left
[12:16:15] Stuart Purdie left
[12:16:15] John Bland left
[12:16:19] Stuart Wakefield left
[12:16:20] Duncan Rand left
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO Yesterday's WLCG daily ops update: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek120827#Monday. - LHCb - CMS - ATLAS * DDM central catalog load balancing not functioning correctly over the weekend. Manchester problem with disk server, warning DT was declared, on Friday dpm service can't be restarted. Outage DT on Sunday. ggus 85485 ggus 85316 (opened by Brian) Glasgow power cut over weekend. DT. Still cmt timeout errors. https://ggus.eu/ws/ticket_info.php?ticket=85508 is closed. UCL storage problem https://ggus.eu/ws/ticket_info.php?ticket=85467 is last updated by Ben on 24.08. The site is blacklisted. RAL FTS errors for T2s in IT and Fr cloud https://ggus.eu/ws/ticket_info.php?ticket=85438 reopened iles were declared lost but they are in the system Lancaster problem with one disk server Analysis queue was set to test mode 1 h ago because of Staging input file failed https://ggus.eu/ws/ticket_info.php?ticket=85538 Durham https://ggus.eu/ws/ticket_info.php?ticket=84123 was last updated 2 weeks ago. in a test mode - Other
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 11:40 12:00
      Site roundtable 20m
      - General problems & issues - EMI-WN rollout * EMI-2 SL5 in progress at Liverpool. Test cluster 'almost' working. EMI2/SL5 CREAM, TORQUE and WN 1 node 8 slots. * RAL T1: EMI-2 SL5 queue consisting of 4 worker nodes (32 job slots in total). It's behind the "gridTest" queue available on each CE: lcgce03.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce07.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce08.gridpp.rl.ac.uk:8443/cream-pbs-gridTest lcgce09.gridpp.rl.ac.uk:8443/cream-pbs-gridTest * Oxford: EMI2 on SL5 test system behind the CE 't2ce02.physics.ox.ac.uk'. It's an EMI 2 on SL5 Cream CE, with a pair of 8-core EMI 2 on SL5 worker nodes (so a grand total of sixteen cores). * Brunel 1: all EMI-1 Cream, EMI-1 WN, glexec, Argus): dc2-grid-68, dc2-grid-70, dgc-grid-43. * Brunel 2: EMI-2 Cream (with glexec) running on SL6. it's dc2-grid-65, 16 job slots. - glexec/ARGUS status -- http://tinyurl.com/ceomfrn
    • 12:00 12:05
      Actions 5m
      To be completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items Completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Completed_Actions
    • 12:05 12:06
      AOB 1m