Deployment team & sites

Europe/London
EVO - GridPP Deployment team & sites meeting

EVO - GridPP Deployment team & sites meeting

Jeremy Coles
Description
- This is the biweekly DTEAM & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 77907 with code: 4880.
Deployment team & sites (04 Jan 2011) Chaired by: Jeremy Coles

Jeremy Coles (Chair+minutes)
Richard Hellier
Matt Doidge
Santanu Das
Gareth Smith
Rob Fay
Raja Nandakumar
Pete Gronbech
Ben Waugh
Andrew McNab
Andrew Washbrook
Mingchao Ma
 
Apologies: Alessandra Forti

Experiment problems & issues

Raja: Things have been running smoothly. UCL has a problem – GGUS ticket 65727 opened NFS/SQLite locking. Even the power failure on new years eve did not create much of a stir even though many jobs were running – DIRAC may have resubmitted to get jobs finishing correctly masking any failures.

CMS – no report

ATLAS – no report

Pete: Oxford offline in Panda this morning. Does anybody know anything?

Gareth: Some knock on effect from RAL issues. Only a few sites now running. Two problems over holiday period. Load/full service class problem for ATLAS. Some disk servers deployed into that area but ran on one service class for quite some time. In the end batch capacity and FTS channels reduced for ATLAS.  Essentially the load was such that CASTOR could not process anything in that service class for a couple days between Christmas and New Year. ATLAS now in process of deleting files to create space. Altogether a bit messy and led to failing ATLAS SAM tests. This was fixed on 30th.

On NY eve a check found overheating power distribution unit in machines room. Turned off as a matter of urgency. Little immediate impact as two PSUs used. Took SRMs down for short while to consider options. Overnight though the Oracle database disk arrays were complaining. They use two PSUs: One side UPS and the other clean LPD power. When checked the disk arrays hosting LFC/3D databases were running only on UPS feed and we know this is noisy and disk arrays shut themselves down… on New Years day RAL took all the services using the DB down as a preacuation. People on site did some re-cabling of disk arrays and this led to an outage on 1st Januray in the afternoon. Services came back okay. Still running at reduced capacity for ATLAS.
Perhaps sites off because RAL not taking files from them. Many UK sites off in Panda: http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhome. [See update from Gareth below]


Meetings & Updates

Jeremy: ROD did function over the period thanks to efforts of Daniela and Stuart. Just a couple of problems reported and these appear to being followed up today as sites come out of “at risk”.  Overall things ran as would be expected.

Tier-1 update given above.

Operational security: No new issues to report.
Next GDB is on 12th January: http://indico.cern.ch/conferenceDisplay.py?confId=106639. Focus is on technical areas that are coming back into focus such as Argus. Afternoon will include updates on the experiment data access/management demonstrators.
 
Site review & plans

Richard (RAL): Working on ARGUS server at RAL.

Matt (Lancaster): Number of issues over period – worked up till Christmas eve and late changes affected site availability. gLite-APEL deployed just before… not publishing at the moment or for last 10 days. Waiting for feedback as believe warning comes at 14 days.

JC: And critical at 30 days. Please note that there was a general APEL problem that has affected all sites. The RAL broker went down on 31st as the log file was too big. This has now been fixed.

Santanu (Cambridge): Not much to report. Will install CREAM this week. APEL much the same. Have site accounting up and running and will check if this is being fed properly. Have not tried the latest gLite-APEL but now using Quill to cross-check accounting figures.

Pete: Need you to use new quill package to see how many CPU hours and compare this with APEL – expect about a factor of 10 difference.

Santanu: May only have two weeks data at the moment. In answer to Jeremy’s question on noted failures over the period, Cambridge had two power cuts in 24 hours. Restarted everything but some services had not restarted properly. Not clear why not all services restarted.

Gareth: To wrap up earlier discussion. The ATLAS problem – because RAL was full we were not able to receive output from Tier-2s so production was turned off. This is now coming back slowly. Should now see some work at T2s.

Rob (Liverpool): Mainly Ok. /dev/null issues turning to regular file (rather than special device file) that causes problems.

Pete: Seen at Oxford too… some ATLAS jobs seem to lead to this situation and this takes out the WN.
GridPP Nagios server had a hiccup on Saturday around 5pm. The proxy for Kashif did not last 30 days. Renewed proxy at 5pm and things came back to life. Outage was about 7 hours. Not sure what the knock-on effect was.

Ben (UCL): Not much to report. LHCb issue will be looked at. Adam Davidson looking at CREAM and has this almost working to submit jobs to Legion. Hope to set up two CEs one fronting the new hardware.
Govind (RHUL): We have switched off the MON box before holiday. See the same error as in GGUS ticket. There is an earlier error. During NY there were some FTS failures so offline in Panda.

Andrew (Manchester): Things ran pretty much smoothly at risk. Had CREAM CE problem again and this is being investigated.

Andrew (ECDF): Fine at ECDF. Only thing of note, on Christmas eve had thousands of ATLAS jobs fail. 15.6.3  release failed to install correctly. Took that out of published releases and things worked fine. Have an APEL box ready.

Mingchao (security): Quiet period for security over holiday. No critical issues concerning UK sites (there are 2-3 sites outside of the UK).


AOB: Nothing reported


EVO chat:
[10:58:49] Santanu Das joined
[10:59:22] Richard Hellier Yes!
[11:00:08] Gareth Smith joined
[11:00:48] Rob Fay joined
[11:01:14] Raja Nandakumar joined
[11:01:26] Jeremy Coles Will start in 2-3 minutes....
[11:01:37] Pete Gronbech joined
[11:04:33] Ben Waugh joined
[11:07:09] Govind Songara joined
[11:13:03] Andrew McNab joined
[11:13:30] Andrew Washbrook joined
[11:13:48] Pete Gronbech http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhome
[11:19:07] Mingchao Ma joined
[11:31:24] Ben Waugh can you hear me?
[11:34:36] Ben Waugh left
[11:35:24] Ben Waugh joined
[11:40:26] Richard Hellier left
[11:40:29] Raja Nandakumar left
[11:40:30] Ben Waugh left
[11:40:31] Rob Fay left
[11:40:32] Andrew McNab left
[11:40:32] Andrew Washbrook left
[11:40:33] Mingchao Ma left
[11:40:34] Gareth Smith left
[11:40:34] Govind Songara left
[11:40:35] Matthew Doidge left
 

There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of Christmas & New Year holiday period issues by experiment/VO - LHCb - CMS - ATLAS - Other - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance issues
    • 11:20 11:35
      Meetings & updates 15m
      - ROD team comments & observations - Tier-1 update Operational security -- Checking results at https://pakiti.egi.eu - Upcoming meetings -- The next GDB is on 12th January: http://indico.cern.ch/conferenceDisplay.py?confId=106639.
    • 11:35 12:00
      Site review and plans 25m
      - Problems that arose over the last 3 weeks -- Accounting information does not seem to be updating recently (since last week of December) and a probably related ticket is https://gus.fzk.de/ws/ticket_info.php?ticket=65771. If things were good generally then specific problems still appear for: Lancaster, RHUL, UCL, Cambridge and RALPP. - Plans for coming month
    • 12:00 12:05
      AOB 5m