Analysis Facility Pilot (weekly discussion)

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Zoom Meeting ID
61085982895
Host
Markus Schulz
Alternative host
Ben Jones
Useful links
Join via phone
Zoom URL

Present: Andrea, Ben, Enric, Diogo, Ignacio

This is was the first meeting since a while.

Ben pointed out at the beginning that there will be the Architecture Review Board by the end of the months.

With Gavin's new role at the board the prospects of acceptance of the presented plan are not too bad.   

 

Ben started to check the resources that could be used in the pilot. We are waiting for new resources to arrive. The idea is to porivide resources by using extra slots on the nodes.

The new pledges for April are larger than expected there will more resources than anticipated. The planned form of provisioning will pose no risk to impact the usage by the experiments.

Andrea wanted to know wether we would create an extra queue.

Ben explained that there will be an extra slot type. The start policy will accept only certain types of jobs, interactive and analysis jobs. The restriction isn't by users, but by the declared type. 

Andrea wanted clarification wether this means that every user on SWAN will matched.

Ben saw no reason why this should be restricted. The jobs can run only on these type of resources and these are limited.

However, we still might want to limit also by user.

It was pointed out that only very few people submit currentlt via Dask.

This was followed by a discussion on what types of tests should be conducted before we invite experiments to try the pilot.

Scalability tests have been identified as being of especially important. 

Ben pointed out that testing ob the live batch system is difficult. This isn't a controlled environment and the hardware is very diverse.

From here a short discussion branched out on the characteristics of the new machines. 

They have more memory, so using extra memory should be fine. The CPU shares are normally only 80% full, but there can be an impact on the efficiency.

We didn't reached the point to define what an acceptable level of impact should be used.

 

The need to have a variety of workloads that closely resemble realistic analysis work was discussed. 

Andrea pointed out that the workload that  he uses in tests can run with a large number of cores and is well packaged. However, there are only two different workloads and to cover more of the variety we need to introduce new users soon.

Diego  put an emphasis on the importance of scale tests.

This should cover bothm RDataframe and  Coffea.

Ben knows someone else who uses Dask.  <------------------------------ follow up!  

We discussed how we could use Andrea's workloads to emulate a larger number of users and a greated variance of workloads. It was proposed that he could run with different identities and configurations. The automation of this scenario would require some work. The variations of the performance of the same workloads can be used to estimate the impact of the jobs on the system.

Ben brought to the attention that the Negotiation cycle of the pool is currently 2 minutes. For interactive use we should use a different fair share.

A different negotiatior for these jobtypes is probably the best approach. This should have a cycle on the time scale of seconds. We might also add a limit per user to avoid flooding the system due to trivial errors on the user site.. 

The was followed by a discussion on resource analysis. 

 

We discussed the next steps:

A google slides stack should be created to list the next steps. Everyone is invited to contribute and add to the list and indicate what activities they can contribute to.

The doc has been created by Andrea and shared via mattermost: https://docs.google.com/presentation/d/1L_vCwIQ0UbrQ-0TEnA-VqJSw_mLk3ned_Qc48rzz3PY/edit#slide=id.p

-The work on extending the Andrea's scripts can start now, testing only after the new resources have arrived.

-SWAN changes to make it easier to access via Jupyterlab is already in QA we have now already an Alma9 image that can work 

-Connect to batch then after start dask (kinit still needed).

The question came up wether we hold it back until the ARB ? --- Everyone said that this is a service improvement and can be rolled out independently.

- Run tets on the bleeding edge to see how the system can handle oveload. 

Andrea was concerned that coffee has so many dependencies. One person reported that the CMS coffea tests have been already run on Alma9 and they run fine.

- What will be the number of nodes?   --->  Ben will get numbers  "follow up"

Ben pointed out that for alma9 a higher version of condor will be needed. Ben was shure that we retired the problematic nodes that blocked the upgrade.

It was suggested to test without the condor version request. Will check that it works 9 is on 23. ---> follow up 

The documents for the ARB are all in the google doc. Ben will look after the docs and share a link.

-- Maybe February is the right time to invite some power users ---> follow up by suggesting individuals 

 

There are minutes attached to this event. Show them.
    • 14:00 15:00
      Discussion and next steps 1h