Note: These notes have yet to be edited.
Introduction (John Gordon)
Since the previous meeting TEG F2F 23rd
January and LHCOPN at BNL.
Next meetings: March 21st
, April 18th
and May 9th
Upcoming ISGC 26th
March. OGF Oxford, UK 11th
March. EGI User Forum 26th
March. HEPiX 23rd
April 2012, WLCG workshop 19th
May, CHEP 2012 21st
May, EGI Technical Forum – Prague September 2012.
Grid Engine SIG – if interested contact firstname.lastname@example.org
CERN WMS update to EMI-1 versions.
Vidyo worked for the TEG yesterday. Looks fine now. Shall we use if from now on? Yes.
IB: If it is useful we could invite the Vidyo team to come and talk to the GDB regarding hints and tips on usage.
JG: A good idea.
Still looking for volunteers for the GDB chair search committee.
IB: Should we change the nature of the GDB in the future? We have the TEGs and from yesterday’s discussions it was clear there needs to be a technical place to follow them up. There are less deployment issues now than at most of the last 10 years… so what should the future GDB look like and what is its role. Should it remain every month? Less frequently and we may lose momentum for technical discussions. Certainly need to keep it in some form.
JT: There was a ramp up for WLCG in trying to get things deployed from previous technical discussions
JG: There were several TEGs and several had low participation …so we need some forum
MS: Need to reduce the overlap between MB and GDB.
IB: There are now very few MBs.
MS: Still things like glexec and changes to info system that need to be followed up in GDB.
IB: No intent to move MB to telephone only. Open to suggestions on how to make things better – MB could be in middle of month.
MS: Perhaps reverse order so discussions in GDB take place before decisions in MB!
JT: Could split GDB over two days and have MB in afternoon of second day.
JG: Maria suggested yesterday a widening of the service coordination meetings.
IB: There were a couple of coordination actions from yesterday. We should think about the ops coordination, WLCG T1 service coordination, GDB etc. and look to reduce duplication and optimise.
JT: If splitting over two days, could have the T2 meeting on morning of first GDB day to encourage people to come.
JS: One thing is missing – there should be consistent recording of what happened.
IB: Another thing is we used to have minutes.
JG: We still have recording of the discussions
MS: Recording what has been said is useful for those not here but often leave without understanding what is the consensus. Actions are not clearly recorded.
Summary from WLCG TEG workshop (Ian Bird)
Context – what does the future look like for WLCG? From where will come the effort?
Asked at the TEG summary yesterday for each chair to prepare a two-slide summary for their area.
SM/DM (Wahid Bhimji)
There is a need for an overall architecture diagram and concrete recommendations.
Should document (the architecture of) both what we have now and what we want in the future.
Some feedback yesterday – what kind of env variables to pass from the WNs such as CPU power, functionality likt remaining job lifetime important for pilots, handling signals to do with job termination (hiting time or memory limits).
Do we need the WMS in the future or not? Only SAM tests will rely on it after ATLAS move software installation to Panda, CMS use glideinWMS, and LHCb direct submission.
JG: Extensions of jdl for multi-core jobs. Fine for user telling WMS what they want. Does the information exist in the BDII to then do the match making?
CG: The urgent thing is to pass thr requirement to the CE and batch system. Since the path is not to use the WMS then will look at direct submission.
JG: How do you know which sites to submit to?
CG: We may already know, but clearly we need to put this in the information system at some point. Some is there and used by MPI. The jdl extension needed now – we would like be able to specify the range of cores we would like to use not the exact number. We can start with 4 cores and simply the configuration of the batch systems.
JG: Same question about the batch systems. Does CREAM decide what to pull down or the batch system
CG: Information about being i/o or not is to help schedule jobs to avoid getting WNs stuck. If site configures blah so that this information is passed then that is fine…. The VO is just helping to do the scheduling in the best possible way. When a pilot is started with a specification, cannot then download a job which does not fit that specification. This may require some changes to the pilot frameworks.
IB: How will you know?
CG: In CMS, we consider high i/o jobs…
JT: This slide is nice (next steps) and it makes it clear how much is still to be done.
IB: Where you have ‘proof of concept’ do you hae people in mind?
CG: For multi-core yes. We need to identify whether we can do that with the m/w available. In the discussion we got from a few sites their availability to participate and from CMS one of the developers we got feedback.
IB: In your list you have a number of items
CG: Two main items need wider dedicated forum are clouds and virtualisation.
IB: We need a place to discuss batch systems. Coming back to GDB format, perhaps we should look at the HEPiX style of using interest groups. There may then for example be a special interest group on batch systems and someone in that may feel motivated to do some of the needed work. Some of these may merge with HEPiX activities later.
CG: We have not used much the pre-GDB days.
Security (Romain Wartel)
Identified three main risks and the work now will focus on recommendations to mitigate these.
IB: Could use a summary of the current situation
JG: Such as agreement on identity switching but less progress on storage security.
IB: Following up on previous comments on interest groups, then having one on NoSQL would be a good example.
JG: Question about the availability. What one infratructure doing it. But EGI is also doing this using the same infrastructure. Hindered by team saying we can not keep doing more profiles. Concern is that we fragment.
MM: What are the implications on the developers … we have a common tool which is MyEGI and what we would propose for now is myWLCG to have the tests showing that all the VOs want shown.
JG: What is different about myWLCG?
MM: To be used by the experiments. Adapted to the experiments needs.
PG: The difference being that MyEGI uses ops tests and you would have something using the experiment tests
JG: I was under the impression that you could deploy a VO Nagios and you would have a route using this …
IB: For next MB propose an editing team to bring all these reports together. The summary today has picked out the main outstanding issues anyway. In a future MB/GDB we’ll review the priorities and understand where the effort can come from to push things forward. There is potentially a lot of work in the recommendations.
JG: So you’ll be setting a single list of recommendations and priorities?
IB: It was clear yesterday that things are not exclusive so we do need an overview at some level – we have to look at this in the MB.
Signed off some actions previously on moving to glue2.0 etc. Will attach Markus’s document to the agenda. We should follow the plans and not just reinvent them.
Siteview – Current status and plans (Julia Andreeva)
JT: Slide 6. We heard a proposal from ops TEG that we would have SAM tests extended to have VO tests that are site responsibilities. Sites want a list of things that are an issue for the experiment. My question on this slide – is there anything different than this in your slide?
JA: No. You redfine the collector to look into SAM.
JT: From the site perspective is there anything adding value
JG: The suggestion here is having an overall status in view for the site.
JA: Some sites want to see more. They may be passing all the tests but they want to see anything that affects their status even If outside the site control.
M: We have come to an agreement – experiments – sites – within TEG and concluded that most ciritical thing for site monitoring is to agree metrics. Therefore to explain today the current metrics which will be replaced in the coming weeks….
JA: There are several views.
JT: Slide 7…. This stuff is new and not available from Nagios so this is what I am looking for.
IF: It is a useful exercise for the TEG.One worry – the reason for the SSB to work was that the experiments needed it for shifters. You have a prototype for 3 years and have little feedback from the sites. There was a heated MB on this topic. It would be nice to hear from the sites whether they want this dashboard.
JA: There was also frustration from the developers.
JT: When this came out we were involved and were a big fan. We stopped using it because the information going into it was not good.
..: The most difficult part for metrics was that we dependend on the experiments to export the information. Sometimes their agents were not publishing properly and this was difficult.
JT: Most of the time was spent debugging the input to the tool.
JT: Back to Ian’s question. One reason I was harsh about the status slide. The status part should be no different from SAM.When I say I don’t want to look at anything what I mean is that there is an automatic notification to my site monitoring to which the site can determine how to respond. SAM is good for this. If SAM says it is working but there is nothing happening I want to be able to check more widely and that is where this view comes in… not to see if my site is working but why is my site not getting more work.
MJ: At GRIF we were aware of the early version of the tool. We found it useful. Marketing to other sites was difficult because it was difficult to know wht it was for vs the VO specific tools. … We should try to put some effort into it and make sure it is known as a tool that people should use, also for the experiment people.
JA: Perhaps it is the mandate of the operations group to decide what tools should be used. If it is made much more useful for operations.
JT: The framework for seeing things is fine but there is a problem with what is going into it.
JS: This story has been ongoing for quite a long time. Depending on the requirements this is either possible or not. No reason we could not have something in place for first 2012 collisions. If it is used then it will improve with time. If we can not do this then we should stop talking about it. Deploy now and use some simple metrics.
JA: Deployed and reconfigured with metrics that are agreed to be useful.
JS: It will not be perfect from day one.
JG: We have some metrics defined already so could we not evaluate over the next month?
M: In TEG came to agreement of having restricted experiment particpation and … come to selection of results that the VOs consider important. Take opportunity of the TEG community being already in place to
JG: Do the experiments need to explain to the sites what the metrics are before we start evaluating them on them?
M: You have to start from a set of metrics that make sense.
JT: Set of metrics understandable by sites but not covering all of what the VO want, and then there are another set … we could go with what is in place but it should not stop the process Maria was talking about.
ATLAS: You need to start from the bottom not the top. Aggregate. Sometimes it is taking half a day for the on-duty to understand what is the problem.
IB: SAM is a framework for notifying sites of problems. This is for visualising the status. They should not have different data sources.
IF: For CMS SSB includes number of analysis jobs failed in last hour…
IB: There is some simple stuff, more complex stuff and other things that people might want to see.
Maria: We should start from basic tests and display those.
SC: Start with SAM critical tests. With some information on whether the site is blacklisted or not should be there for information. Next step would be to extend the SAM tests with quality checks – outcomes of hammercloud tests, connectivity with other clouds etc. This is a second step.
JG: The first part is not in place now?
JG: What is the action?
M: We propose in the TEG that we start with this group and these people then tell Julia what is required.
PG: Slightly confusing for sysadmin is that there are many pages. This page seems to be on another domain.
JG: There should be a link from the dashboard link.
JT: Remember, this does not mean that SAM is no longer the primary source indicating whether a site is working or not.
EMI-1 Updates (Cristina)
Only one since last time Update 12-19.01.2012.
The BDII update inclusion is useful.
LHCOPN update (John Shade)
JG: What are the primitive services (slide 6) that cause alarms?
JS: Latency tests and bandwidth tests – so you can see if there is for example a one way delay.
MS: Do we expect that LHC epxeriments start using BoD for WNs?
JG: The T1 use-case is almost met 100% of the time by dedicated lines.
??: There is a use-case for transferring between two sites for short periods at high-bandwidths.
Maurice: We use lightpaths on demand in surfnet for life sciences depending on when users are taking data.
JG: So then we should move away from FTS – managed transfers
??: Yes in that FTS does not know what other users are doing. BoD takes account of competition.
MS: Then if we need to integrate into the FTS then we need to know now for the FTS 3 architecture.
IB: What is the time scale? This is not coming in the next year.
Maurice: This is my question for many of the areas – what are the timelines… in regard of the long shutdown etc. what is expected.
MS: 2-3 years or 3-4 years?
??: There is a protocol already being used in many US networks that is available now for testing if there is engagement.
JG: The problem often comes in the cross-domain region.
On the timescale of 3 years will promote NSI which is being developed in the OGF.
IB: Need to check the strategy. On 3 years I don’t think anything is going to change. You should disconnect this from FTS3.
Maurice: I would put in the hooks for this but not do anything on the protocol – it gets the conversation going.
JG: On ATLAS full matrix network monitoring, SC said the opposite of Slide 24 indicating that there was a plan to have regular testing of a full mesh matrix.
JS: The concern is that the monitoring impacts the network. Perhaps it is intended to have a full matrix but not with continuous testing of all.
SC: First of all it is not just ATLAS. The idea is that the site paths covered by the OPN, it would be good to have tests between major T2s, T1s and other major T2s. Even there we can start from something simple – what will be tested and how often. It needs central coordination. The indication was that this needs testing within WLCG and not just LHCONE. There may be sites that never oin LHCONE and we still need to know their status.
JG: Curious on update on PIC connection to LHCONE
GM: We were connected through NREN-GEANT link with a 1Gb limitation. The upgrade to 10Gb was happening but we wanted to test… production traffic is now going through this link and it was causing more and more problems so it was removed. The connection was done to test functionality.
??: LHCOPN and LHCONE are currently fundamentally different. The former has dedicated resources. The results for LHCONE will be very different and we should be aware of that.
JG: So what you are saying is that we can’t have something like FTS monitoring the traffic and throttling bandwidth.
There was a discussion about what can be gleaned from the network monitoring and how important it is for experiments in determining which sites to use for the source of transfers.
A brief reminder on security – Romain Wartel
Storage accounting (John Gordon)
What now? (slide 18). Do we wait until 2013 or go with gstat or IGI.
SC: When you discuss installed capacity do you mean in space tokens?
JG: Definition is storage area so it could be either
IB: It is important to show how pledges have been used so it needs to cover everything.
SC: In eperiments we use a collector via SRM to gather space used/available and store this in a database. The numbers are compared to the BDII. Can this be useful here? ATLAS uses it to automatically clean-up space – it is very reliable and based on a remote query of the SRM.
JG: Perhaps compare what you see and pledges.
CG: Does it rely on spacetokens?
SC: Yes it does.
CG: I thought so and that’s why we could not adopt it.
JG: The buffers are T1D0… so you could ask SRM about them.
SC: In ATLAS the buffers are very limited – of order 5%. CMS maybe it is more like 90%.
ML: If you don’t pass a spacetoken then SRM may return unnamed space.
GM: The query for dCache will return the disk in front of the tape (for a D0 query).
RFC/SHA-2 proxies (Maarten Litmaath)
IGTF would like CAs to move from SHA-1 to SHA-2 signatures ASAP. For WLCG this implies moving to RFC.
MS: Virtual IDs derived from certificates in state of storage systems… if you sent everyone to CERN for certificates what would happen?
ML: It would be extreme if the CAs panicked if SHA-1 could be broken. With many of the CAs we can make something happen and as a fall back it is a possibility.
JG: if you talk about testing to be ready to move in 2013 – what testing needs to be done?
ML: Myself and others would need to test product by product for SHA-2 and RFC.
ML: When we move to RFC proxies, that is the big thing. Say some time late autumn this year. The big change is that the first dCache moves to jglobus2 then there is no going back. The legacy proxy will not be accepted by the upgraded dCache.
JT: jglobus1 supports RFC.
JG: Need to make sure the production users are kept aware of the timeline.
MS: To what extent could this be out of our control – a forced update on the other side of the atlantic…
ML: GridPMA has not come back to indicate a problem.
JG: Who is going to test things.
IB: Who is driving the timescales here?
ML: We can make a huge amount of progress this year
HM: Just as much a communication issue as technical issue.
IB: If we have a plan – no way in 2012 but can do it for June 2013 then take this plan to the other communities and ask them to consider this in their planning.
ML: During a big shutdown we can make these big changes.
JT: We have to take seriously that we have to be as ready as we can. Some of the CAs will HAVE to move if SHA-1 is cracked.
IB: In that case we should consider having a catch-all CA setup to deal with the transition.
MJ: It is likely that we have pressure from CA managers to move in January. Best to make sure they are aware of our plans to be ready for Jan 2013 but expect to move in June 2013… therefore request that they do not issue SHA-2 certs until June 2013 in their own planning.
What about the experiment software?
ML: For LHCb only DIRAC. SC is checking in ATLAS
SC: Checked and ATLAS is RFC compliant. Panda and DPM uses GridSite and … for SHA-2 testing is more difficult but can be done.
ML: Probably we can collaborate with EMI who have set up a CA for testing.
JG: Need a matrix of components and status and will hear a more detailed plan…
ML: May be a good idea to setup CA now anyway. Add it to our own distribution.
Meeting finished at 17:10
Alvaro Fernandez Casani
Gonzalo Merino Arevalo
Doina Cristina Aiftimie