Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 115728 with code: 4880. Apologies:
Ops Team minutes – Tuesday 15th November 2011
 
Present:
Jeremy (Chair+minutes); Raul; Pete G; Emyr J; Stuart W; Brian D; Gareth S; Alessandra F; Santanu D; Catalin C; Ben W; Ewan M; Wahid B; John B; Mark M; Mingchao M; Mark S; Govind; Andrew M; Kashif M; Raja; Stuart P; Chris B; Daniela; Andrew W; Rob H; Stephen J; Elena; Sam S
 
 
LHCb nothing significant to report. Repro winding down. Not completely clear why some user jobs have low efficiency running at RAL. No user has complained. As low as 70%.
Have we provided enough info to IC for CVMFS?
Daniela: Yes the tests are now passing.
Bad WNs at Tier-1
Tier-1 outage for firewall update this morning. There was some problem with packet loss on that route, it is hoped that problem (low level) will be improved by this firmware update. The other thing to mention is that RAL wishes to update one of the top-BDIIs to the UMD release.
Catalin- will rollout UMD 1.3.0 on all five nodes of the alias by Thursday morning the first node will be deployed followed by a one week cooling off period before the next nodes. If anyone notices problems please ticket the Tier-1.
ATLAS:
Problem with IC due to software area is full. ATLAS trying to clean it. Elena will follow up on the ATLAS side.
Brunel offline – thought due to new pilot version but site now in downtime.
Small issue with Durham
Lancaster has a small problem with anaysis as does ECDF. Sites are informed.
Brian suggested to include RHUL, LIv and Sheffield to gpd sites.
 
CMS
Generally okay. RAL PP and IC load test. Goes through the T1 firewall. Believe that this is possibly fixed.
GS: T2 route to out of site affects everyone else at RAL. Main issue was loading on the RAL link for others. When talking to other T1s using the OPN. Packet loss tends to be when load is 10-20%
 
HARDWARE:
Responses due back by Friday. Duncan had asked questions about defacto guidelines for deploying switches etc. in clusters due to issues with types of kit. TI have various thoughts. Many sites looking at deploying 10GB switches for storage but less clear for WNs. Is there anything outstanding?
Quick summary of funding summary to get a consistent picture.

PG: If you have not been contacted by CB rep. GridPP putting in bids to upgrade links from clusters to JANET backbone. Clusters funded via GridPP hardware money. JANET funding fine. We do not want clusters to be bottlenecked by cross-site issues. May involve contributing to campus networks, upgrading switches etc. There are a lot of unknowns – whether money will actually come, how much but if it does come it needs to be spent this financial year!

Ewan: While we do not know how much it could be quite a lot. Useful to think big but in a Tiered fashion. Part 1 for us is definitely needed while Part 2 is what would be very useful.
Pete: Outline proposals by the end of this week. Estimates of the prices will need to be used. Will need to have good chats with univ. networking
Mark: At Glasgow we came up with a quick figure with internal network teams. We can’t upgrade everything but discussed if we give them x amount then they can consider what partial contributions this can make to other upgrades they might propose.
Pete: In the guidelines. There is a subtle difference between improving bandwidth to cluster but we can not upgrade everything. If 10Gb link and the physics bandwitdh can use theis then consider a second 10Gb link.

Alessandra: Manchester has problem that connected through network northwest. Would need to upgrade network link in Warrington but can not fund that as it is Janet. Even if connection already good encouraged to double link bandwidth.
Pete: JANET are fully on board with this proposal and they will get money at some point. May help them.
Alessandra: Point is there are limits.
PG: For us will want to move to 10Gb but if we put the hardware in place it future proofs us when Oxford link gets upgraded. Case of buying to future proof.
Chris W: QMUL getting upgrade to 10Gb and JANET matching that. Costs are something like £3xk install and £3xk recurrent for 3 years.
PG: Probably not as this is capital as this “colour” of money is capital. But there is a grey area – you could buy with 5yr warranty.
Chris: Better if we could pay the first year recurrent and the college pays later years that would be helpful.
Glasgow ideas break into 3: Cluster – interconnet – campus itself. If we make a capital expenditure contribution then others may benefit. Many old Nortel switches that would go. Looking at putting in a second (resilient link) to cluster (layer-3 resliliency). Upgrading fibre optic cable that is out of date.
AF: Are you planning to upgrade WNs to 10Gb?
Not at the moment. There are no 10Gb cards on this… it is a site decision.
 
LHC portal gives powerconnect information for Dell. 1000-24F is on the discount list? Could you send the link?
Dell contact can get you access to the Dell portal. Can send people Gary’s details on request.
Ewan: The Dell folks have also made comments about items not on the portal.
CW: S4810 comes in around £9000 list price. As supply chain improves then prices may vary. If things are not on the portal then they may be later.
8024F (power connect) switch on the portal was £1180 something.  F4810 (force-10) is £9800. As a warning – the Dell Oxford found 8024F fully populated (appears to switch like fibre) with cables it was still cheaper than 8024F. Lower power etc. If starting 10Gb from scratch sfp+10 modules may be better. Both are 24 ports. Vary the number of RJ-45 connects.

Do not limit yourself to just one supplier! There may be other campus equipment elsewhere … for example Cisco campus at Glasgow would skew interest. Cluster may be Dell centric due to price but Campus and wide-area is more likely Cisco so hardware award needs to be thought through due to different costing models.

Ewan: Caution on 8024F. We have 2 at Oxford. They are good value but they are a 24port 10Gb switch and they are not great for interlinking. Connected with 4 links which takes up many ports. If we added another we’d use 16 ports just linking the ports together so expansion potential is limited hence the S4810 interest. They have 40Gb interlinking ports. Come up with a plan that considers this impact in the future.

Other sites updates/input:
Emyr: 3 36 port infinband. Dell R610s for service nodes. Will need to get a decent hardware for storage.
PG: The service node side may be less in scope but it depends if this is in addition to JANET or a pure GridPP bid.
Chris: 2 S4810s for link to new RAL 40Gb backbone with resilient links. Also discussing overlap with Tier-1 bid.
Mark: Lawrie coming up with plan. Core switch upgrade. Couple of single Gb switches coming off this for WNs. Main thing hoping to get through is to bypass as much of campus network as can as traffic from shared cluster to storage is causing problems. Own link at 10Gb but would ideal. Then straight into site router connection to JANET. Cross campus traffic will be separate which will be good. Minor issue is the main campus ok is taking time to come back.
Network upgrades within cluster are in scope. The main problems may be to do with connection out to JANET. Should try to look at whole cluster to outside link.
Notional idea of ATLAS anaylsis job requirements. For existing 6 core intels Gb is fine but some of the potential future cores (new AMD is 64 core) and if you go with those Gb is not enough so the WNs would probably need 10Gb too!
Will those WNs come with 10Gb on motherboard?
EM: Probably not. Existing R815 with new processors and it is Gb on motherboard and new version is the same. The 6145 is two of these in a 2U box. Would expect them to take add in cards no problem.
AF: Other vendors (Viglen) do not seem to come with this on board.
PG: Except these have 4CPUs per motherboard.
AF: There is a possibility of 10Gb card.
AF: Not entirely sure of underlying fabrics so getting things on the motherboard is less likely at the start. Market may change direction in 2yrs. Annoying
SD: Also thinking Dell 8012… need to see if we have a pop room for all the links. University need to check if 10Gb capability to the link room and if not then need more cable to group cabinet. 2* HP switch not easily linkable to Dell.
PG: Don’t feel too constrained with the Vendor choice. Campus directions are important too!
Worth saying that JANET has a framework for discounts from many of these other vendors. FORCE-10 are there and
Any queries. Main discussion is to ensure similar formats for the bids. The other side missed perhaps – if thinking about new management servers like perfSONAR boxes etc then it might be a time to bid for those. There is scope for additional monitoring boxes.
PG: On the perfSONAR side how many?
Two. One inbound and one outbound. Will send out a link for the spec. Would want something with 10Gb link?
Not sure!
Most of the data transfer to storage is 10Gb! Will dig out spec and circulate it.
For the cost it is almost not worth thinking about… yes but to run the 10Gb card you will need a better node.
 
[10:59:43] Matthew Doidge joined
[10:59:50] John Bland joined
[10:59:50] Brian Davies joined
[10:59:55] Rob Harper joined
[11:00:13] Rob Harper I hear you
[11:00:28] Santanu Das left
[11:00:58] Queen Mary, U London London, U.K. joined
[11:01:24] Mark Slater joined
[11:01:31] Mohammad kashif joined
[11:01:37] Santanu Das joined
[11:01:58] Daniela Bauer joined
[11:02:01] Ben Waugh joined
[11:02:08] Andrew Washbrook joined
[11:02:29] Sam Skipsey joined
[11:02:34] Raja Nandakumar joined
[11:02:53] Jeremy Coles Mark are you able to hear me?
[11:03:27] David Crooks joined
[11:03:39] Mark Mitchell Hi Jeremy
[11:03:44] Mark Mitchell I can hear the call
[11:03:49] Ewan Mac Mahon joined
[11:04:15] Govind Songara joined
[11:04:46] Elena Korolkova joined
[11:05:42] Gareth Smith joined
[11:07:20] Wahid Bhimji joined
[11:08:32] Mingchao Ma joined
[11:09:32] Emyr James joined
[11:09:46] raul lopes joined
[11:11:17] Pete Gronbech joined
[11:12:03] Emyr James left
[11:12:08] Emyr James joined
[11:12:33] Stuart Wakefield joined
[11:17:13] Brian Davies left
[11:18:17] Gareth Smith left
[11:19:01] Alessandra Forti joined
[11:30:01] RECORDING Santanu joined
[11:36:48] Catalin Condurache left
[11:48:54] Santanu Das left
[11:50:22] Santanu Das joined
[11:50:32] Ben Waugh I'll give up on audio!
[11:50:51] Ewan Mac Mahon Maybe you need a network upgrade 
[11:51:04] RECORDING Santanu joined
[11:53:30] Ben Waugh If service nodes etc. are likely to be out of scope, what about network switches within cluster, i.e. between disk servers and WNs, even if bandwidth from storage to outside world is limited?
[11:54:42] Ben Waugh Thanks.
[11:55:36] Alessandra Forti 64
[11:59:08] Ewan Mac Mahon There is an extent to which we need 10Gbit kit now though - we can't really wait two years.
[12:01:09] Wahid Bhimji For ecdf - no point in us saying anything here really - for the gridpp specific storage we would add some 10 gig switch(es) but for the rest of the link we need to speak to systems team which we have started but are coninuing tomorrow. There is a least one narrow pipe that could be changed
[12:01:41] Wahid Bhimji (Also we will have to be compatible with their (non-Dell) existing stuff like other
[12:01:51] John Bland fire drill, bye
[12:01:53] John Bland left
[12:03:07] Matthew Doidge Got to run (to a meeting with the network experts about this!).
[12:03:16] Matthew Doidge left
[12:05:57] Mingchao Ma left
[12:09:13] Mark Mitchell http://www.perfsonar.net/overview.html
[12:10:17] Ben Waugh left
[12:10:18] Mark Slater left
[12:10:23] Govind Songara left
[12:10:24] Andrew McNab left
[12:10:24] Mohammad kashif left
[12:10:25] Raja Nandakumar left
[12:10:25] David Crooks left
[12:10:26] Alessandra Forti left
[12:10:29] Emyr James left
[12:10:34] Stuart Purdie left
[12:10:35] Chris Brew left
[12:10:35] Daniela Bauer left
[12:10:43] Santanu Das left
[12:10:49] Stuart Wakefield left
[12:10:55] raul lopes left
[12:11:06] Andrew Washbrook left
12:15:29] Wahid Bhimji left
[12:17:33] Pete Gronbech left
[12:20:39] Rob Harper left
[12:25:02] Stephen Jones left
12:36:30] Mark Mitchell http://www.cisco.com/en/US/prod/collateral/ps6418/ps6419/ps6421/prod_case_study0900aecd8033e808.html
[12:58:09] Elena Korolkova left
[13:00:13] Ewan Mac Mahon Yes, I think it's time to go.
[13:01:18] Ewan Mac Mahon It's probably actually possible to put an SRM interface on a truck full of disk servers system.....
[13:02:27] Mark Mitchell http://www.ja.net/services/connections/tariffs.html
[13:03:50] Ewan Mac Mahon So at Oxford we've essentially got a promise of an upgrade to 2x10Gbit links - in the long run that might (maybe) wind up being routed as one for us, one for the rest of the university.
[13:04:07] Ewan Mac Mahon So that would be essentially a 10Gbit dedicated link, but at no (separate) cost.
[13:04:10] Mark Mitchell Dark fibre Cost UK
[13:04:29] Queen Mary, U London London, U.K. left
[13:04:37] Mark Mitchell left
[13:04:46] Ewan Mac Mahon left
[13:04:58] Sam Skipsey left
 
There are minutes attached to this event. Show them.
    • 1
      Experiment problems/issues
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
    • 2
      Hardware discussion
      - Focus will be on the network bids - If time permits we'll look at cluster purchase plans -- Reminder: Ensure that your service nodes are on resilient hardware and not ignored!
    • 3
      AOB